24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 12<br />

The first parameter, /blogs/51* (just remember to change<br />

to the full path to your data folder), obtains a sample of the<br />

data (all files starting with 51, which is only 11 documents). We then set the output<br />

directory to a new folder, which we put in the data folder, and specify not to output<br />

the streamed data. Without the last option, the output data is shown to the <strong>com</strong>mand<br />

line when we run it—which isn't very helpful to us and slows down the <strong>com</strong>puter<br />

quite a lot.<br />

Run the script, and quite quickly each of the blog posts will be extracted and stored<br />

in our output folder. This script only ran on a single thread on the local <strong>com</strong>puter so<br />

we didn't get a speedup at all, but we know the code runs.<br />

We can now look in the output folder for the results. A bunch of files are created and<br />

each file contains each blog post on a separate line, preceded by the gender of the<br />

author of the blog.<br />

Training Naive Bayes<br />

Now that we have extracted the blog posts, we can train our Naive Bayes model<br />

on them. The intuition is that we record the probability of a word being written by<br />

a particular gender. To classify a new sample, we would multiply the probabilities<br />

and find the most likely gender.<br />

The aim of this code is to output a file that lists each word in the corpus, along<br />

with the frequencies of that word for each gender. The output file will look<br />

something like this:<br />

"'ailleurs" {"female": 0.003205128205128205}<br />

"'air" {"female": 0.003205128205128205}<br />

"'an" {"male": 0.0030581039755351682, "female": 0.004273504273504274}<br />

"'angoisse" {"female": 0.003205128205128205}<br />

"'apprendra" {"male": 0.0013047113868622459, "female":<br />

0.0014172668603481887}<br />

"'attendent" {"female": 0.00641025641025641}<br />

"'autistic" {"male": 0.002150537634408602}<br />

"'auto" {"female": 0.003205128205128205}<br />

"'avais" {"female": 0.00641025641025641}<br />

"'avait" {"female": 0.004273504273504274}<br />

"'behind" {"male": 0.0024390243902439024}<br />

"'bout" {"female": 0.002034152292059272}<br />

[ 285 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!