www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Chapter 9 Evaluation It is generally never a good idea to base an assessment on a single number. In the case of the f-score, it is usually more robust than tricks that give good scores despite not being useful. An example of this is accuracy. As we said in our previous chapter, a spam classifier could predict everything as being spam and get over 80 percent accuracy, although that solution is not useful at all. For that reason, it is usually worth going more in-depth on the results. To start with, we will look at the confusion matrix, as we did in Chapter 8, Beating CAPTCHAs with Neural Networks. Before we can do that, we need to predict a testing set. The previous code uses cross_val_score, which doesn't actually give us a trained model we can use. So, we will need to refit one. To do that, we need training and testing subsets: from sklearn.cross_validation import train_test_split training_documents, testing_documents, y_train, y_test = train_test_split(documents, classes, random_state=14) Next, we fit the pipeline to our training documents and create our predictions for the testing set: pipeline.fit(training_documents, y_train) y_pred = pipeline.predict(testing_documents) At this point, you might be wondering what the best combination of parameters actually was. We can extract this quite easily from our grid search object (which is the classifier step of our pipeline): print(pipeline.named_steps['classifier'].best_params_) The results give you all of the parameters for the classifier. However, most of the parameters are the defaults that we didn't touch. The ones we did search for were C and kernel, which were set to 1 and linear, respectively. Now we can create a confusion matrix: from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_pred, y_test) cm = cm / cm.astype(np.float).sum(axis=1) Next we get our authors so that we can label the axis correctly. For this purpose, we use the authors dictionary that our Enron dataset loaded. The code is as follows: sorted_authors = sorted(authors.keys(), key=lambda x:authors[x]) [ 207 ]
Authorship Attribution Finally, we show the confusion matrix using matplotlib. The only changes from the last chapter are highlighted below; just replace the letter labels with the authors from this chapter's experiments: %matplotlib inline from matplotlib import pyplot as plt plt.figure(figsize=(10,10)) plt.imshow(cm, cmap='Blues') tick_marks = np.arange(len(sorted_authors)) plt.xticks(tick_marks, sorted_authors) plt.yticks(tick_marks, sorted_authors) plt.ylabel('Actual') plt.xlabel('Predicted') plt.show() The results are shown in the following figure: [ 208 ]
- Page 180 and 181: Chapter 7 Next, we need to get the
- Page 182 and 183: • method='nelder-mead': This is u
- Page 184 and 185: Beating CAPTCHAs with Neural Networ
- Page 186 and 187: Chapter 8 The red lines indicate th
- Page 188 and 189: Chapter 8 The combination of an app
- Page 190 and 191: Chapter 8 Next we set the font of t
- Page 192 and 193: Chapter 8 We can then extract the s
- Page 194 and 195: Chapter 8 Our targets are integer v
- Page 196 and 197: Chapter 8 Then we iterate over our
- Page 198 and 199: Chapter 8 From these predictions, w
- Page 200 and 201: Chapter 8 This code correctly predi
- Page 202 and 203: The result is shown in the next gra
- Page 204 and 205: Chapter 8 However, it isn't very go
- Page 206: Chapter 8 Summary In this chapter,
- Page 209 and 210: Authorship Attribution Attributing
- Page 211 and 212: Authorship Attribution If we cannot
- Page 213 and 214: Authorship Attribution After taking
- Page 215 and 216: Authorship Attribution This dataset
- Page 217 and 218: Authorship Attribution "instead", "
- Page 219 and 220: Authorship Attribution Support vect
- Page 221 and 222: Authorship Attribution Kernels When
- Page 223 and 224: Authorship Attribution We can reuse
- Page 225 and 226: Authorship Attribution With our dat
- Page 227 and 228: Authorship Attribution We then reco
- Page 229: Authorship Attribution If it doesn'
- Page 234 and 235: Clustering News Articles In most of
- Page 236 and 237: Chapter 10 API Endpoints are the ac
- Page 238 and 239: The token object is just a dictiona
- Page 240 and 241: Chapter 10 We then create a list to
- Page 242 and 243: Chapter 10 We are going to use MD5
- Page 244 and 245: Chapter 10 Next, we develop the cod
- Page 246 and 247: Chapter 10 We use clustering techni
- Page 248 and 249: Chapter 10 The k-means algorithm is
- Page 250 and 251: Chapter 10 We only fit the X matrix
- Page 252 and 253: Chapter 10 We then print out the mo
- Page 254 and 255: Chapter 10 Our function definition
- Page 256 and 257: Chapter 10 The result from the prec
- Page 258 and 259: Chapter 10 Implementation Putting a
- Page 260 and 261: Chapter 10 Neural networks can also
- Page 262 and 263: We then call the partial_fit functi
- Page 264 and 265: Classifying Objects in Images Using
- Page 266 and 267: Chapter 11 This dataset comes from
- Page 268 and 269: You can change the image index to s
- Page 270 and 271: Chapter 11 Each of these issues has
- Page 272 and 273: Chapter 11 Using Theano, we can def
- Page 274 and 275: Chapter 11 Building a neural networ
- Page 276 and 277: Chapter 11 Finally, we create Thean
- Page 278 and 279: Chapter 11 return [image,] return s
Authorship Attribution<br />
Finally, we show the confusion matrix using matplotlib. The only changes from<br />
the last chapter are highlighted below; just replace the letter labels with the authors<br />
from this chapter's experiments:<br />
%matplotlib inline<br />
from matplotlib import pyplot as plt<br />
plt.figure(figsize=(10,10))<br />
plt.imshow(cm, cmap='Blues')<br />
tick_marks = np.arange(len(sorted_authors))<br />
plt.xticks(tick_marks, sorted_authors)<br />
plt.yticks(tick_marks, sorted_authors)<br />
plt.ylabel('Actual')<br />
plt.xlabel('Predicted')<br />
plt.show()<br />
The results are shown in the following figure:<br />
[ 208 ]