Dataframe:
id review name label1 it is a great product for turning lights on. Ashley 12 plays music and have a good sound. Alex 13 I love it, lots of fun. Peter 0
The aim is to classify the text; if the review is about the functionality of the product (e.g. turn the light on, music), label=1
, otherwise label=0
.
I am running several sklearn models to see which one works bests:
# Naïve Bayes:text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])# Linear Support Vectors Classifier:text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC(loss='hinge', penalty='l2', max_iter = 50))])# SGDClassifiertext_clf_sgd = Pipeline([('tfidf', TfidfVectorizer()), ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,max_iter=50, tol=None))])#Random Foresttext_clf_rf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', RandomForestClassifier())])#neural network MLPClassifiertext_clf_mlp = Pipeline([('tfidf', TfidfVectorizer()), ('clf', MLPClassifier())])
Problem: How to tune models using GridSearchCV? What I have so far:
from sklearn.model_selection import GridSearchCVparameters = {'vect__ngram_range': [(1, 1), (1, 2)],'tfidf__use_idf': (True, False),'clf__alpha': (1e-2, 1e-3) }gs_clf = GridSearchCV(text_clf_nb, param_grid= parameters, cv=2, scoring='roc_auc', n_jobs=-1)gs_clf = gs_clf.fit((X_train, y_train))
This gives the following error on running gs_clf = gs_clf.fit((X_train, y_train))
:
ValueError: Invalid parameter C for estimator Pipeline(memory=None, steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.float64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True, stop_words=None, strip_accents=None, sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))], verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
I would appreciate any suggestions. Thanks.