www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Chapter 2 Finally, we take the last value of the row and set the class. We set it to 1 (or True) if it is a good sample, and 0 if it is not: y[i] = row[-1] == 'g' We now have a dataset of samples and features in X, and the corresponding classes in y, as we did in the classification example in Chapter 1, Getting Started with Data Mining. Moving towards a standard workflow Estimators in scikit-learn have two main functions: fit() and predict(). We train the algorithm using the fit method and our training set. We evaluate it using the predict method on our testing set. First, we need to create these training and testing sets. As before, import and run the train_test_split function: from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, random_ state=14) Then, we import the nearest neighbor class and create an instance for it. We leave the parameters as defaults for now, and will choose good parameters later in this chapter. By default, the algorithm will choose the five nearest neighbors to predict the class of a testing sample: from sklearn.neighbors import KNeighborsClassifier estimator = KNeighborsClassifier() After creating our estimator, we must then fit it on our training dataset. For the nearest neighbor class, this records our dataset, allowing us to find the nearest neighbor for a new data point, by comparing that point to the training dataset: estimator.fit(X_train, y_train) We then train the algorithm with our test set and evaluate with our testing set: y_predicted = estimator.predict(X_test) accuracy = np.mean(y_test == y_predicted) * 100 print("The accuracy is {0:.1f}%".format(accuracy)) This scores 86.4 percent accuracy, which is impressive for a default algorithm and just a few lines of code! Most scikit-learn default parameters are chosen explicitly to work well with a range of datasets. However, you should always aim to choose parameters based on knowledge of the application experiment. [ 31 ]
Classifying with scikit-learn Estimators Running the algorithm In our earlier experiments, we set aside a portion of the dataset as a testing set, with the rest being the training set. We train our algorithm on the training set and evaluate how effective it will be based on the testing set. However, what happens if we get lucky and choose an easy testing set? Alternatively, what if it was particularly troublesome? We can discard a good model due to poor results resulting from such an "unlucky" split of our data. The cross-fold validation framework is a way to address the problem of choosing a testing set and a standard methodology in data mining. The process works by doing a number of experiments with different training and testing splits, but using each sample in a testing set only once. The procedure is as follows: 1. Split the entire dataset into a number of sections called folds. 2. For each fold in the dataset, execute the following steps: ° ° Set that fold aside as the current testing set ° ° Train the algorithm on the remaining folds ° ° Evaluate on the current testing set 3. Report on all the evaluation scores, including the average score. 4. In this process, each sample is used in the testing set only once. This reduces (but doesn't completely eliminate) the likelihood of choosing lucky testing sets. Throughout this book, the code examples build upon each other within a chapter. Each chapter's code should be entered into the same IPython Notebook, unless otherwise specified. The scikit-learn library contains a number of cross fold validation methods. A helper function is given that performs the preceding procedure. We can import it now in our IPython Notebook: from sklearn.cross_validation import cross_val_score By default, cross_val_score uses a specific methodology called Stratified K Fold to split the dataset into folds. This creates folds that have approximately the same proportion of classes in each fold, again reducing the likelihood of choosing poor folds. This is a great default, so we won't mess with it right now. [ 32 ]
- Page 3 and 4: Learning Data Mining with Python Co
- Page 5 and 6: About the Author Robert Layton has
- Page 7 and 8: Christophe Van Gysel is pursuing a
- Page 9 and 10: www.allitebooks.com
- Page 11 and 12: Table of Contents Preprocessing usi
- Page 13 and 14: Table of Contents Chapter 7: Discov
- Page 15 and 16: Table of Contents GPU optimization
- Page 18 and 19: Preface If you have ever wanted to
- Page 20 and 21: What you need for this book It shou
- Page 22 and 23: Preface Reader feedback Feedback fr
- Page 24 and 25: Getting Started with Data Mining We
- Page 26 and 27: Chapter 1 In the preceding dataset,
- Page 28 and 29: After you have the above "Hello, wo
- Page 30 and 31: Chapter 1 Windows users may need to
- Page 32 and 33: Chapter 1 The dataset we are going
- Page 34 and 35: Chapter 1 As an example, we will co
- Page 36 and 37: We get the names of the features fo
- Page 38 and 39: Chapter 1 Two rules are near the to
- Page 40 and 41: Chapter 1 The scikit-learn library
- Page 42 and 43: We then iterate over all the sample
- Page 44 and 45: Chapter 1 Overfitting is the proble
- Page 46: Chapter 1 Summary In this chapter,
- Page 49 and 50: Classifying with scikit-learn Estim
- Page 51 and 52: Classifying with scikit-learn Estim
- Page 53: Classifying with scikit-learn Estim
- Page 57 and 58: Classifying with scikit-learn Estim
- Page 59 and 60: Classifying with scikit-learn Estim
- Page 61 and 62: Classifying with scikit-learn Estim
- Page 63 and 64: Classifying with scikit-learn Estim
- Page 65 and 66: Predicting Sports Winners with Deci
- Page 67 and 68: Predicting Sports Winners with Deci
- Page 69 and 70: Predicting Sports Winners with Deci
- Page 71 and 72: Predicting Sports Winners with Deci
- Page 73 and 74: Predicting Sports Winners with Deci
- Page 75 and 76: Predicting Sports Winners with Deci
- Page 77 and 78: Predicting Sports Winners with Deci
- Page 79 and 80: Predicting Sports Winners with Deci
- Page 81 and 82: Predicting Sports Winners with Deci
- Page 84 and 85: Recommending Movies Using Affinity
- Page 86 and 87: Chapter 4 The classic algorithm for
- Page 88 and 89: Chapter 4 When loading the file, we
- Page 90 and 91: Chapter 4 We will sample our datase
- Page 92 and 93: Chapter 4 Implementation On the fir
- Page 94 and 95: Chapter 4 We want to break out the
- Page 96 and 97: The process starts by creating dict
- Page 98 and 99: movie_name_data.columns = ["MovieID
- Page 100 and 101: To do this, we will compute the tes
- Page 102 and 103: Chapter 4 - Train Confidence: 1.000
Classifying with scikit-learn Estimators<br />
Running the algorithm<br />
In our earlier experiments, we set aside a portion of the dataset as a testing set,<br />
with the rest being the training set. We train our algorithm on the training set and<br />
evaluate how effective it will be based on the testing set. However, what happens if<br />
we get lucky and choose an easy testing set? Alternatively, what if it was particularly<br />
troublesome? We can discard a good model due to poor results resulting from such<br />
an "unlucky" split of our data.<br />
The cross-fold validation framework is a way to address the problem of choosing a<br />
testing set and a standard methodology in data mining. The process works by doing<br />
a number of experiments with different training and testing splits, but using each<br />
sample in a testing set only once. The procedure is as follows:<br />
1. Split the entire dataset into a number of sections called folds.<br />
2. For each fold in the dataset, execute the following steps:<br />
° ° Set that fold aside as the current testing set<br />
° ° Train the algorithm on the remaining folds<br />
° ° Evaluate on the current testing set<br />
3. Report on all the evaluation scores, including the average score.<br />
4. In this process, each sample is used in the testing set only once.<br />
This reduces (but doesn't <strong>com</strong>pletely eliminate) the likelihood of<br />
choosing lucky testing sets.<br />
Throughout this book, the code examples build upon each<br />
other within a chapter. Each chapter's code should be<br />
entered into the same IPython Notebook, unless<br />
otherwise specified.<br />
The scikit-learn library contains a number of cross fold validation methods.<br />
A helper function is given that performs the preceding procedure. We can import<br />
it now in our IPython Notebook:<br />
from sklearn.cross_validation import cross_val_score<br />
By default, cross_val_score uses a specific methodology<br />
called Stratified K Fold to split the dataset into folds. This<br />
creates folds that have approximately the same proportion<br />
of classes in each fold, again reducing the likelihood of<br />
choosing poor folds. This is a great default, so we won't<br />
mess with it right now.<br />
[ 32 ]