24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Working with Big Data<br />

We start by creating a function that takes two numbers and adds them together.<br />

def add(a, b):<br />

return a + b<br />

We then perform the reduce. The signature of reduce is reduce(function, sequence,<br />

and initial), where the function is applied at each step to the sequence. In the first<br />

step, the initial value is used as the first value, rather than the first element of the list:<br />

from functools import reduce<br />

print(reduce(add, sums, 0))<br />

The result, 25, is the sum of each of the values in the sums list and is consequently<br />

the sum of each of the elements in the original array.<br />

The preceding code is equal to the following:<br />

initial = 0<br />

current_result = initial<br />

for element in sums:<br />

current_result = add(current_result, element)<br />

In this trivial example, our code can be greatly simplified, but the real gains <strong>com</strong>e<br />

from distributing the <strong>com</strong>putation. For instance, if we have a million sublists<br />

and each of those sublists contained a million elements, we can distribute this<br />

<strong>com</strong>putation over many <strong>com</strong>puters.<br />

In order to do this, we distribute the map step. For each of the elements in our list,<br />

we send it, along with a description of our function, to a <strong>com</strong>puter. This <strong>com</strong>puter<br />

then returns the result to our main <strong>com</strong>puter (the master).<br />

The master then sends the result to a <strong>com</strong>puter for the reduce step. In our example<br />

of a million sublists, we would send a million jobs to different <strong>com</strong>puters (the same<br />

<strong>com</strong>puter may be reused after it <strong>com</strong>pletes our first job). The returned result would<br />

be just a single list of a million numbers, which we then <strong>com</strong>pute the sum of.<br />

The result is that no <strong>com</strong>puter ever needed to store more than a million numbers,<br />

despite our original data having a trillion numbers in it.<br />

A word count example<br />

The implementation of MapReduce is a little more <strong>com</strong>plex than just using a map<br />

and reduce step. Both steps are invoked using keys, which allows for the separation<br />

of data and tracking of values.<br />

[ 276 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!