www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Chapter 2 Finally, we take the last value of the row and set the class. We set it to 1 (or True) if it is a good sample, and 0 if it is not: y[i] = row[-1] == 'g' We now have a dataset of samples and features in X, and the corresponding classes in y, as we did in the classification example in Chapter 1, Getting Started with Data Mining. Moving towards a standard workflow Estimators in scikit-learn have two main functions: fit() and predict(). We train the algorithm using the fit method and our training set. We evaluate it using the predict method on our testing set. First, we need to create these training and testing sets. As before, import and run the train_test_split function: from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, random_ state=14) Then, we import the nearest neighbor class and create an instance for it. We leave the parameters as defaults for now, and will choose good parameters later in this chapter. By default, the algorithm will choose the five nearest neighbors to predict the class of a testing sample: from sklearn.neighbors import KNeighborsClassifier estimator = KNeighborsClassifier() After creating our estimator, we must then fit it on our training dataset. For the nearest neighbor class, this records our dataset, allowing us to find the nearest neighbor for a new data point, by comparing that point to the training dataset: estimator.fit(X_train, y_train) We then train the algorithm with our test set and evaluate with our testing set: y_predicted = estimator.predict(X_test) accuracy = np.mean(y_test == y_predicted) * 100 print("The accuracy is {0:.1f}%".format(accuracy)) This scores 86.4 percent accuracy, which is impressive for a default algorithm and just a few lines of code! Most scikit-learn default parameters are chosen explicitly to work well with a range of datasets. However, you should always aim to choose parameters based on knowledge of the application experiment. [ 31 ]

Classifying with scikit-learn Estimators Running the algorithm In our earlier experiments, we set aside a portion of the dataset as a testing set, with the rest being the training set. We train our algorithm on the training set and evaluate how effective it will be based on the testing set. However, what happens if we get lucky and choose an easy testing set? Alternatively, what if it was particularly troublesome? We can discard a good model due to poor results resulting from such an "unlucky" split of our data. The cross-fold validation framework is a way to address the problem of choosing a testing set and a standard methodology in data mining. The process works by doing a number of experiments with different training and testing splits, but using each sample in a testing set only once. The procedure is as follows: 1. Split the entire dataset into a number of sections called folds. 2. For each fold in the dataset, execute the following steps: ° ° Set that fold aside as the current testing set ° ° Train the algorithm on the remaining folds ° ° Evaluate on the current testing set 3. Report on all the evaluation scores, including the average score. 4. In this process, each sample is used in the testing set only once. This reduces (but doesn't completely eliminate) the likelihood of choosing lucky testing sets. Throughout this book, the code examples build upon each other within a chapter. Each chapter's code should be entered into the same IPython Notebook, unless otherwise specified. The scikit-learn library contains a number of cross fold validation methods. A helper function is given that performs the preceding procedure. We can import it now in our IPython Notebook: from sklearn.cross_validation import cross_val_score By default, cross_val_score uses a specific methodology called Stratified K Fold to split the dataset into folds. This creates folds that have approximately the same proportion of classes in each fold, again reducing the likelihood of choosing poor folds. This is a great default, so we won't mess with it right now. [ 32 ]

Classifying with scikit-learn Estimators<br />

Running the algorithm<br />

In our earlier experiments, we set aside a portion of the dataset as a testing set,<br />

with the rest being the training set. We train our algorithm on the training set and<br />

evaluate how effective it will be based on the testing set. However, what happens if<br />

we get lucky and choose an easy testing set? Alternatively, what if it was particularly<br />

troublesome? We can discard a good model due to poor results resulting from such<br />

an "unlucky" split of our data.<br />

The cross-fold validation framework is a way to address the problem of choosing a<br />

testing set and a standard methodology in data mining. The process works by doing<br />

a number of experiments with different training and testing splits, but using each<br />

sample in a testing set only once. The procedure is as follows:<br />

1. Split the entire dataset into a number of sections called folds.<br />

2. For each fold in the dataset, execute the following steps:<br />

° ° Set that fold aside as the current testing set<br />

° ° Train the algorithm on the remaining folds<br />

° ° Evaluate on the current testing set<br />

3. Report on all the evaluation scores, including the average score.<br />

4. In this process, each sample is used in the testing set only once.<br />

This reduces (but doesn't <strong>com</strong>pletely eliminate) the likelihood of<br />

choosing lucky testing sets.<br />

Throughout this book, the code examples build upon each<br />

other within a chapter. Each chapter's code should be<br />

entered into the same IPython Notebook, unless<br />

otherwise specified.<br />

The scikit-learn library contains a number of cross fold validation methods.<br />

A helper function is given that performs the preceding procedure. We can import<br />

it now in our IPython Notebook:<br />

from sklearn.cross_validation import cross_val_score<br />

By default, cross_val_score uses a specific methodology<br />

called Stratified K Fold to split the dataset into folds. This<br />

creates folds that have approximately the same proportion<br />

of classes in each fold, again reducing the likelihood of<br />

choosing poor folds. This is a great default, so we won't<br />

mess with it right now.<br />

[ 32 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!