www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Chapter 12 The map function takes a key and value pair and returns a list of key+value pairs. The keys for the input and output don't necessarily relate to each other. For example, for a MapReduce program that performs a word count, the input key might be a sample document's ID value, while the output key would be a given word. The input value would be the text of the document and the output value would be the frequency of each word: from collections import defaultdict def map_word_count(document_id, document): We first count the frequency of each word. In this simplified example, we split the document on whitespace to obtain the words, although there are better options: counts = defaultdict(int) for word in document.split(): counts[word] += 1 We then yield each of the word, count pairs. The word here is the key, with the count being the value in MapReduce terms: for word in counts: yield (word, counts[word]) By using the word as the key, we can then perform a shuffle step, which groups all of the values for each key: def shuffle_words(results): First, we aggregate the resulting counts for each word into a list of counts: records = defaultdict(list) We then iterate over all the results that were returned by the map function; for results in results_generators: for word, count in results: records[word].append(count) Next, we yield each of the words along with all the counts that were obtained in our dataset: for word in records: yield (word, records[word]) [ 277 ]
Working with Big Data The final step is the reduce step, which takes a key value pair (the value in this case is always a list) and produces a key value pair as a result. In our example, the key is the word, the input list is the list of counts produced in the shuffle step, and the output value is the sum of the counts: def reduce_counts(word, list_of_counts): return (word, sum(list_of_counts)) To see this in action, we can use the 20 newsgroups dataset, which is provided in scikit-learn: from sklearn.datasets import fetch_20newsgroups dataset = fetch_20newsgroups(subset='train') documents = dataset.data We then apply our map step. We use enumerate here to automatically generate document IDs for us. While they aren't important in this application, these keys are important in other applications; map_results = map(map_word_count, enumerate(documents)) The actual result here is just a generator, no actual counts have been produced. That said, it is a generator that emits (word, count) pairs. Next, we perform the shuffle step to sort these word counts: shuffle_results = shuffle_words(map_results) This, in essence is a MapReduce job; however, it is only running on a single thread, meaning we aren't getting any benefit from the MapReduce data format. In the next section, we will start using Hadoop, an open source provider of MapReduce, to start to get the benefits from this type of paradigm. Hadoop MapReduce Hadoop is a set of open source tools from Apache that includes an implementation of MapReduce. In many cases, it is the de facto implementation used by many. The project is managed by the Apache group (who are responsible for the famous web server). [ 278 ]
- Page 250 and 251: Chapter 10 We only fit the X matrix
- Page 252 and 253: Chapter 10 We then print out the mo
- Page 254 and 255: Chapter 10 Our function definition
- Page 256 and 257: Chapter 10 The result from the prec
- Page 258 and 259: Chapter 10 Implementation Putting a
- Page 260 and 261: Chapter 10 Neural networks can also
- Page 262 and 263: We then call the partial_fit functi
- Page 264 and 265: Classifying Objects in Images Using
- Page 266 and 267: Chapter 11 This dataset comes from
- Page 268 and 269: You can change the image index to s
- Page 270 and 271: Chapter 11 Each of these issues has
- Page 272 and 273: Chapter 11 Using Theano, we can def
- Page 274 and 275: Chapter 11 Building a neural networ
- Page 276 and 277: Chapter 11 Finally, we create Thean
- Page 278 and 279: Chapter 11 return [image,] return s
- Page 280 and 281: Chapter 11 Next, we define how the
- Page 282 and 283: Chapter 11 Getting your code to run
- Page 284 and 285: Chapter 11 Setting up the environme
- Page 286 and 287: This will unzip only one Coval.otf
- Page 288 and 289: Chapter 11 First we create the laye
- Page 290 and 291: Chapter 11 Finally, we set the verb
- Page 292: Chapter 11 Summary In this chapter,
- Page 295 and 296: Working with Big Data Big data What
- Page 297 and 298: Working with Big Data Governments a
- Page 299: Working with Big Data We start by c
- Page 303 and 304: Working with Big Data Getting the d
- Page 305 and 306: Working with Big Data If we aren't
- Page 307 and 308: Working with Big Data Before we sta
- Page 309 and 310: Working with Big Data The first val
- Page 311 and 312: Working with Big Data This gives us
- Page 313 and 314: Working with Big Data Next, we crea
- Page 315 and 316: Working with Big Data Then, make a
- Page 317 and 318: Working with Big Data Left-click th
- Page 319 and 320: Working with Big Data The result is
- Page 321 and 322: Next Steps… Extending the IPython
- Page 323 and 324: Next Steps… Chapter 3: Predicting
- Page 325 and 326: Next Steps… Vowpal Wabbit http://
- Page 327 and 328: Next Steps… Deeper networks These
- Page 329 and 330: Next Steps… Real-time clusterings
- Page 331 and 332: Next Steps… More resources Kaggle
- Page 333 and 334: authorship, attributing 185-188 AWS
- Page 335 and 336: feature extraction about 82 common
- Page 337 and 338: NetworkX about 145 defining 303 URL
- Page 339 and 340: scikit-learn package references 305
- Page 342 and 343: Thank you for buying Learning Data
- Page 344: Learning Python Data Visualization
Working with Big Data<br />
The final step is the reduce step, which takes a key value pair (the value in this case is<br />
always a list) and produces a key value pair as a result. In our example, the key is the<br />
word, the input list is the list of counts produced in the shuffle step, and the output<br />
value is the sum of the counts:<br />
def reduce_counts(word, list_of_counts):<br />
return (word, sum(list_of_counts))<br />
To see this in action, we can use the 20 newsgroups dataset, which is provided in<br />
scikit-learn:<br />
from sklearn.datasets import fetch_20newsgroups<br />
dataset = fetch_20newsgroups(subset='train')<br />
documents = dataset.data<br />
We then apply our map step. We use enumerate here to automatically generate<br />
document IDs for us. While they aren't important in this application, these keys<br />
are important in other applications;<br />
map_results = map(map_word_count, enumerate(documents))<br />
The actual result here is just a generator, no actual counts have been produced.<br />
That said, it is a generator that emits (word, count) pairs.<br />
Next, we perform the shuffle step to sort these word counts:<br />
shuffle_results = shuffle_words(map_results)<br />
This, in essence is a MapReduce job; however, it is only running on a single thread,<br />
meaning we aren't getting any benefit from the MapReduce data format. In the next<br />
section, we will start using Hadoop, an open source provider of MapReduce, to start<br />
to get the benefits from this type of paradigm.<br />
Hadoop MapReduce<br />
Hadoop is a set of open source tools from Apache that includes an implementation<br />
of MapReduce. In many cases, it is the de facto implementation used by many.<br />
The project is managed by the Apache group (who are responsible for the<br />
famous web server).<br />
[ 278 ]