24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Authorship Attribution<br />

We can reuse the grid search from our previous code. All we need to do is specify the<br />

new feature extractor in a new pipeline:<br />

pipeline = Pipeline([('feature_extraction', CountVectorizer(analyzer='<br />

char', ngram_range=(3, 3))),<br />

('classifier', grid)<br />

])<br />

scores = cross_val_score(pipeline, documents, classes, scoring='f1')<br />

print("Score: {:.3f}".format(np.mean(scores)))<br />

There is a lot of implicit overlap between function words and character<br />

n-grams, as character sequences in function words are more likely to<br />

appear. However, the actual features are very different and character<br />

n-grams capture punctuation, which function words do not. For example,<br />

a character n-gram includes the full stop at the end of a sentence, while a<br />

function word-based method would only use the preceding word itself.<br />

Using the Enron dataset<br />

Enron was one of the largest energy <strong>com</strong>panies in the world in the late 1990s,<br />

reporting revenue over $100 billion. It has over 20,000 staff and—as of the year<br />

2000—there seemed to be no indications that something was very wrong.<br />

In 2001, the Enron Scandal occurred, where it was discovered that Enron was<br />

undertaking systematic, fraudulent accounting practices. This fraud was deliberate,<br />

wide-ranging across the <strong>com</strong>pany, and for significant amounts of money. After this<br />

was publicly discovered, its share price dropped from more than $90 in 2000 to less<br />

than $1 in 2001. Enron shortly filed for bankruptcy in a mess that would take more<br />

than 5 years to finally be resolved.<br />

As part of the investigation into Enron, the Federal Energy Regulatory Commission<br />

in the United States made more than 600,000 e-mails publicly available. Since then,<br />

this dataset has been used for everything from social network analysis to fraud<br />

analysis. It is also a great dataset for authorship analysis, as we are able to extract<br />

e-mails from the sent folder of individual users. This allows us to create a dataset<br />

much larger than many previous datasets.<br />

[ 200 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!