www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Chapter 12 The map function takes a key and value pair and returns a list of key+value pairs. The keys for the input and output don't necessarily relate to each other. For example, for a MapReduce program that performs a word count, the input key might be a sample document's ID value, while the output key would be a given word. The input value would be the text of the document and the output value would be the frequency of each word: from collections import defaultdict def map_word_count(document_id, document): We first count the frequency of each word. In this simplified example, we split the document on whitespace to obtain the words, although there are better options: counts = defaultdict(int) for word in document.split(): counts[word] += 1 We then yield each of the word, count pairs. The word here is the key, with the count being the value in MapReduce terms: for word in counts: yield (word, counts[word]) By using the word as the key, we can then perform a shuffle step, which groups all of the values for each key: def shuffle_words(results): First, we aggregate the resulting counts for each word into a list of counts: records = defaultdict(list) We then iterate over all the results that were returned by the map function; for results in results_generators: for word, count in results: records[word].append(count) Next, we yield each of the words along with all the counts that were obtained in our dataset: for word in records: yield (word, records[word]) [ 277 ]

Working with Big Data The final step is the reduce step, which takes a key value pair (the value in this case is always a list) and produces a key value pair as a result. In our example, the key is the word, the input list is the list of counts produced in the shuffle step, and the output value is the sum of the counts: def reduce_counts(word, list_of_counts): return (word, sum(list_of_counts)) To see this in action, we can use the 20 newsgroups dataset, which is provided in scikit-learn: from sklearn.datasets import fetch_20newsgroups dataset = fetch_20newsgroups(subset='train') documents = dataset.data We then apply our map step. We use enumerate here to automatically generate document IDs for us. While they aren't important in this application, these keys are important in other applications; map_results = map(map_word_count, enumerate(documents)) The actual result here is just a generator, no actual counts have been produced. That said, it is a generator that emits (word, count) pairs. Next, we perform the shuffle step to sort these word counts: shuffle_results = shuffle_words(map_results) This, in essence is a MapReduce job; however, it is only running on a single thread, meaning we aren't getting any benefit from the MapReduce data format. In the next section, we will start using Hadoop, an open source provider of MapReduce, to start to get the benefits from this type of paradigm. Hadoop MapReduce Hadoop is a set of open source tools from Apache that includes an implementation of MapReduce. In many cases, it is the de facto implementation used by many. The project is managed by the Apache group (who are responsible for the famous web server). [ 278 ]

Working with Big Data<br />

The final step is the reduce step, which takes a key value pair (the value in this case is<br />

always a list) and produces a key value pair as a result. In our example, the key is the<br />

word, the input list is the list of counts produced in the shuffle step, and the output<br />

value is the sum of the counts:<br />

def reduce_counts(word, list_of_counts):<br />

return (word, sum(list_of_counts))<br />

To see this in action, we can use the 20 newsgroups dataset, which is provided in<br />

scikit-learn:<br />

from sklearn.datasets import fetch_20newsgroups<br />

dataset = fetch_20newsgroups(subset='train')<br />

documents = dataset.data<br />

We then apply our map step. We use enumerate here to automatically generate<br />

document IDs for us. While they aren't important in this application, these keys<br />

are important in other applications;<br />

map_results = map(map_word_count, enumerate(documents))<br />

The actual result here is just a generator, no actual counts have been produced.<br />

That said, it is a generator that emits (word, count) pairs.<br />

Next, we perform the shuffle step to sort these word counts:<br />

shuffle_results = shuffle_words(map_results)<br />

This, in essence is a MapReduce job; however, it is only running on a single thread,<br />

meaning we aren't getting any benefit from the MapReduce data format. In the next<br />

section, we will start using Hadoop, an open source provider of MapReduce, to start<br />

to get the benefits from this type of paradigm.<br />

Hadoop MapReduce<br />

Hadoop is a set of open source tools from Apache that includes an implementation<br />

of MapReduce. In many cases, it is the de facto implementation used by many.<br />

The project is managed by the Apache group (who are responsible for the<br />

famous web server).<br />

[ 278 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!