24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Social Media Insight Using Naive Bayes<br />

Printing c.most_<strong>com</strong>mon(5) gives the list of the top five most frequently occurring<br />

words. Ties are not handled well as only five are given and a very large number of<br />

words all share a tie for fifth place.<br />

The bag-of-words model has three major types. The first is to use the raw<br />

frequencies, as shown in the preceding example. This does have a drawback when<br />

documents vary in size from fewer words to many words, as the overall values<br />

will be very different. The second model is to use the normalized frequency,<br />

where each document's sum equals 1. This is a much better solution as the length<br />

of the document doesn't matter as much. The third type is to simply use binary<br />

features—a value is 1 if the word occurs at all and 0 if it doesn't. We will use binary<br />

representation in this chapter.<br />

Another popular (arguably more popular) method for performing normalization is<br />

called term frequency - inverse document frequency, or tf-idf. In this weighting<br />

scheme, term counts are first normalized to frequencies and then divided by the<br />

number of documents in which it appears in the corpus. We will use tf-idf in Chapter<br />

10, Clustering News Articles.<br />

There are a number of libraries for working with text data in Python. We will<br />

use a major one, called Natural Language ToolKit (NLTK). The scikit-learn<br />

library also has the CountVectorizer class that performs a similar action, and<br />

it is re<strong>com</strong>mended you take a look at it (we will use it in Chapter 9, Authorship<br />

Attribution). However the NLTK version has more options for word tokenization. If<br />

you are doing natural language processing in python, NLTK is a great library to use.<br />

N-grams<br />

A step up from single bag-of-words features is that of n-grams. An n-gram is a<br />

subsequence of n consecutive tokens. In this context, a word n-gram is a set of n<br />

words that appear in a row.<br />

They are counted the same way, with the n-grams forming a word that is put in<br />

the bag. The value of a cell in this dataset is the frequency that a particular n-gram<br />

appears in the given document.<br />

The value of n is a parameter. For English, setting it to between 2 to 5<br />

is a good start, although some applications call for higher values.<br />

[ 120 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!