24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Classifying with function words<br />

Next, we import our classes. The only new thing here is the support vector<br />

machines, which we will cover in the next section (for now, just consider it<br />

a standard classification algorithm). We import the SVC class, an SVM for<br />

classification, as well as the other standard workflow tools we have seen before:<br />

from sklearn.svm import SVC<br />

from sklearn.cross_validation import cross_val_score<br />

from sklearn.pipeline import Pipeline<br />

from sklearn import grid_search<br />

Chapter 9<br />

Support vector machines take a number of parameters. As I said, we will use one<br />

blindly here, before going into detail in the next section. We then use a dictionary<br />

to set which parameters we are going to search. For the kernel parameter, we will<br />

try linear and rbf. For C, we will try values of 1 and 10 (descriptions of these<br />

parameters are covered in the next section). We then create a grid search to search<br />

these parameters for the best choices:<br />

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}<br />

svr = SVC()<br />

grid = grid_search.GridSearchCV(svr, parameters)<br />

Gaussian kernels (such as rbf) only work for reasonably sized datasets,<br />

such as when the number of features is fewer than about 10,000.<br />

Next, we set up a pipeline that takes the feature extraction step using the<br />

CountVectorizer (only using function words), along with our grid search using<br />

SVM. The code is as follows:<br />

pipeline1 = Pipeline([('feature_extraction', extractor),<br />

('clf', grid)<br />

])<br />

Next, we apply cross_val_score to get our cross validated score for this pipeline.<br />

The result is 0.811, which means we approximately get 80 percent of the predictions<br />

correct. For 7 authors, this is a good result!<br />

[ 195 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!