24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 9<br />

We can see that authors are predicted correctly in most cases—there is a clear<br />

diagonal line with high values. There are some large sources of error though (darker<br />

values are larger): e-mails from user baughman-d are typically predicted as being<br />

from reitmeyer-j for instance.<br />

Summary<br />

In this chapter, we looked at the text mining-based problem of authorship<br />

attribution. To perform this, we analyzed two types of features: function words<br />

and character n-grams. For function words, we were able to use the bag-of-words<br />

model—simply restricted to a set of words we chose beforehand. This gave us the<br />

frequencies of only those words. For character n-grams, we used a very similar<br />

workflow using the same class. However, we changed the analyzer to look at<br />

characters and not words. In addition, we used n-grams that are sequences of n<br />

tokens in a row—in our case characters. Word n-grams are also worth testing in<br />

some applications, as they can provide a cheap way to get the context of how a<br />

word is used.<br />

For classification, we used SVMs that optimize a line of separation between the<br />

classes based on the idea of finding the maximum margin. Anything above the line is<br />

one class and anything below the line is another class. As with the other classification<br />

tasks we have considered, we have a set of samples (in this case, our documents).<br />

We then used a very messy dataset, the Enron e-mails. This dataset contains lots of<br />

artefacts and other issues. This resulted in a lower accuracy than the books dataset,<br />

which was much cleaner. However, we were able to choose the correct author more<br />

than half the time, out of 10 possible authors.<br />

In the next chapter, we consider what we can do if we don't have target classes.<br />

This is called unsupervised learning, an exploratory problem rather than a prediction<br />

problem. We also continue to deal with messy text-based datasets.<br />

[ 209 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!