24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 9<br />

Character n-grams are found in text documents by representing the document as a<br />

sequence of characters. These n-grams are then extracted from this sequence and a<br />

model is trained. There are a number of different models for this, but a standard one<br />

is very similar to the bag-of-words model we have used earlier.<br />

For each distinct n-gram in the training corpus, we create a feature for it. An example<br />

of an n-gram is , which is the letter e, a space, and then the letter t (the angle<br />

brackets are used to denote the start and end of the n-gram and aren't part of it). We<br />

then train our model using the frequency of each n-gram in the training documents<br />

and train the classifier using the created feature matrix.<br />

Character n-grams are defined in many ways. For instance,<br />

some applications only choose within-word characters, ignoring<br />

whitespace and punctuation. Some use this information (like our<br />

implementation in this chapter).<br />

A <strong>com</strong>mon theory for why character n-grams work is that people more typically<br />

write words they can easily say and character n-grams (at least when n is between 2<br />

and 6) are a good approximation for phonemes—the sounds we make when saying<br />

words. In this sense, using character n-grams approximates the sounds of words,<br />

which approximates your writing style. This is a <strong>com</strong>mon pattern when creating<br />

new features. First we have a theory on what concepts will impact the end result<br />

(authorship style) and then create features to approximate or measure those concepts.<br />

A main feature of a character n-gram matrix is that it is sparse and increases<br />

in sparsity with higher n-values quite quickly. For an n-value of 2, approximately 75<br />

percent of our feature matrix is zeros. For an n-value of 5, over 93 percent is<br />

zeros. This is typically less sparse than a word n-gram matrix of the same type<br />

though and shouldn't cause many issues using a classifier that is used for<br />

word-based classifications.<br />

Extracting character n-grams<br />

We are going to use our CountVectorizer class to extract character n-grams.<br />

To do that, we set the analyzer parameter and specify a value for n to extract<br />

n-grams with.<br />

The implementation in scikit-learn uses an n-gram range, allowing you to extract<br />

n-grams of multiple sizes at the same time. We won't delve into different n-values<br />

in this experiment, so we just set the values the same. To extract n-grams of size 3,<br />

you need to specify (3, 3) as the value for the n-gram range.<br />

[ 199 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!