
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python


Chapter 9 Evaluation It is generally never a good idea to base an assessment on a single number. In the case of the f-score, it is usually more robust than tricks that give good scores despite not being useful. An example of this is accuracy. As we said in our previous chapter, a spam classifier could predict everything as being spam and get over 80 percent accuracy, although that solution is not useful at all. For that reason, it is usually worth going more in-depth on the results. To start with, we will look at the confusion matrix, as we did in Chapter 8, Beating CAPTCHAs with Neural Networks. Before we can do that, we need to predict a testing set. The previous code uses cross_val_score, which doesn't actually give us a trained model we can use. So, we will need to refit one. To do that, we need training and testing subsets: from sklearn.cross_validation import train_test_split training_documents, testing_documents, y_train, y_test = train_test_split(documents, classes, random_state=14) Next, we fit the pipeline to our training documents and create our predictions for the testing set: pipeline.fit(training_documents, y_train) y_pred = pipeline.predict(testing_documents) At this point, you might be wondering what the best combination of parameters actually was. We can extract this quite easily from our grid search object (which is the classifier step of our pipeline): print(pipeline.named_steps['classifier'].best_params_) The results give you all of the parameters for the classifier. However, most of the parameters are the defaults that we didn't touch. The ones we did search for were C and kernel, which were set to 1 and linear, respectively. Now we can create a confusion matrix: from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_pred, y_test) cm = cm / cm.astype(np.float).sum(axis=1) Next we get our authors so that we can label the axis correctly. For this purpose, we use the authors dictionary that our Enron dataset loaded. The code is as follows: sorted_authors = sorted(authors.keys(), key=lambda x:authors[x]) [ 207 ]

Authorship Attribution Finally, we show the confusion matrix using matplotlib. The only changes from the last chapter are highlighted below; just replace the letter labels with the authors from this chapter's experiments: %matplotlib inline from matplotlib import pyplot as plt plt.figure(figsize=(10,10)) plt.imshow(cm, cmap='Blues') tick_marks = np.arange(len(sorted_authors)) plt.xticks(tick_marks, sorted_authors) plt.yticks(tick_marks, sorted_authors) plt.ylabel('Actual') plt.xlabel('Predicted') plt.show() The results are shown in the following figure: [ 208 ]

Authorship Attribution<br />

Finally, we show the confusion matrix using matplotlib. The only changes from<br />

the last chapter are highlighted below; just replace the letter labels with the authors<br />

from this chapter's experiments:<br />

%matplotlib inline<br />

from matplotlib import pyplot as plt<br />

plt.figure(figsize=(10,10))<br />

plt.imshow(cm, cmap='Blues')<br />

tick_marks = np.arange(len(sorted_authors))<br />

plt.xticks(tick_marks, sorted_authors)<br />

plt.yticks(tick_marks, sorted_authors)<br />

plt.ylabel('Actual')<br />

plt.xlabel('Predicted')<br />

plt.show()<br />

The results are shown in the following figure:<br />

[ 208 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!