24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 12<br />

The first function is the mapper function for the first step. The goal of this function<br />

is to take each blog post, get all the words in that post, and then note the occurrence.<br />

We want the frequencies of the words, so we will return 1 / len(all_words),<br />

which allows us to later sum the values for frequencies. The <strong>com</strong>putation here isn't<br />

exactly correct—we need to also normalize for the number of documents. In this<br />

dataset, however, the class sizes are the same, so we can conveniently ignore this<br />

with little impact on our final version.<br />

We also output the gender of the post's author, as we will need that later:<br />

def extract_words_mapping(self, key, value):<br />

tokens = value.split()<br />

gender = eval(tokens[0])<br />

blog_post = eval(" ".join(tokens[1:]))<br />

all_words = word_search_re.findall(blog_post)<br />

all_words = [word.lower() for word in all_words]<br />

all_words = word_search_re.findall(blog_post)<br />

all_words = [word.lower() for word in all_words]<br />

for word in all_words:<br />

yield (gender, word), 1. / len(all_words)<br />

We used eval in the preceding code to simplify the parsing of the blog<br />

posts from the file, for this example. This is not re<strong>com</strong>mended. Instead,<br />

use a format such as JSON to properly store and parse the data from<br />

the files. A malicious use with access to the dataset can insert code into<br />

these tokens and have that code run on your server.<br />

In the reducer for the first step, we sum the frequencies for each gender and word<br />

pair. We also change the key to be the word, rather than the <strong>com</strong>bination, as this<br />

allows us to search by word when we use the final trained model (although, we still<br />

need to output the gender for later use);<br />

def reducer_count_words(self, key, frequencies):<br />

s = sum(frequencies)<br />

gender, word = key<br />

yield word, (gender, s)<br />

The final step doesn't need a mapper function, so we don't add one. The data will<br />

pass straight through as a type of identity mapper. The reducer, however, will<br />

<strong>com</strong>bine frequencies for each gender under the given word and then output the<br />

word and frequency dictionary.<br />

[ 287 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!