24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 12<br />

The map function takes a key and value pair and returns a list of key+value pairs. The<br />

keys for the input and output don't necessarily relate to each other. For example, for<br />

a MapReduce program that performs a word count, the input key might be a sample<br />

document's ID value, while the output key would be a given word. The input value<br />

would be the text of the document and the output value would be the frequency of<br />

each word:<br />

from collections import defaultdict<br />

def map_word_count(document_id, document):<br />

We first count the frequency of each word. In this simplified example, we split the<br />

document on whitespace to obtain the words, although there are better options:<br />

counts = defaultdict(int)<br />

for word in document.split():<br />

counts[word] += 1<br />

We then yield each of the word, count pairs. The word here is the key, with the count<br />

being the value in MapReduce terms:<br />

for word in counts:<br />

yield (word, counts[word])<br />

By using the word as the key, we can then perform a shuffle step, which groups all<br />

of the values for each key:<br />

def shuffle_words(results):<br />

First, we aggregate the resulting counts for each word into a list of counts:<br />

records = defaultdict(list)<br />

We then iterate over all the results that were returned by the map function;<br />

for results in results_generators:<br />

for word, count in results:<br />

records[word].append(count)<br />

Next, we yield each of the words along with all the counts that were obtained<br />

in our dataset:<br />

for word in records:<br />

yield (word, records[word])<br />

[ 277 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!