24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 2<br />

Finally, we take the last value of the row and set the class. We set it to 1 (or True) if it<br />

is a good sample, and 0 if it is not:<br />

y[i] = row[-1] == 'g'<br />

We now have a dataset of samples and features in X, and the corresponding<br />

classes in y, as we did in the classification example in Chapter 1, Getting Started<br />

with Data Mining.<br />

Moving towards a standard workflow<br />

Estimators in scikit-learn have two main functions: fit() and predict().<br />

We train the algorithm using the fit method and our training set. We evaluate it<br />

using the predict method on our testing set.<br />

First, we need to create these training and testing sets. As before, import and run the<br />

train_test_split function:<br />

from sklearn.cross_validation import train_test_split<br />

X_train, X_test, y_train, y_test = train_test_split(X, y, random_<br />

state=14)<br />

Then, we import the nearest neighbor class and create an instance for it. We leave<br />

the parameters as defaults for now, and will choose good parameters later in this<br />

chapter. By default, the algorithm will choose the five nearest neighbors to predict<br />

the class of a testing sample:<br />

from sklearn.neighbors import KNeighborsClassifier<br />

estimator = KNeighborsClassifier()<br />

After creating our estimator, we must then fit it on our training dataset. For the<br />

nearest neighbor class, this records our dataset, allowing us to find the nearest<br />

neighbor for a new data point, by <strong>com</strong>paring that point to the training dataset:<br />

estimator.fit(X_train, y_train)<br />

We then train the algorithm with our test set and evaluate with our testing set:<br />

y_predicted = estimator.predict(X_test)<br />

accuracy = np.mean(y_test == y_predicted) * 100<br />

print("The accuracy is {0:.1f}%".format(accuracy))<br />

This scores 86.4 percent accuracy, which is impressive for a default algorithm and<br />

just a few lines of code! Most scikit-learn default parameters are chosen explicitly<br />

to work well with a range of datasets. However, you should always aim to choose<br />

parameters based on knowledge of the application experiment.<br />

[ 31 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!