www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Chapter 12 MapReduce originates from Google, where it was developed with distributed computing in mind. It also introduces fault tolerance and scalability improvements. The "original" research for MapReduce was published in 2004, and since then there have been thousands of projects, implementations, and applications using it. While the concept is similar to many previous concepts, MapReduce has become a staple in big data analytics. Intuition MapReduce has two main steps: the Map step and the Reduce step. These are built on the functional programming concepts of mapping a function to a list and reducing the result. To explain the concept, we will develop code that will iterate over a list of lists and produce the sum of all numbers in those lists. There are also shuffle and combine steps in the MapReduce paradigm, which we will see later. To start with, the Map step takes a function and applies it to each element in a list. The returned result is a list of the same size, with the results of the function applied to each element. To open a new IPython Notebook, start by creating a list of lists with numbers in each sublist: a = [[1,2,1], [3,2], [4,9,1,0,2]] Next, we can perform a map, using the sum function. This step will apply the sum function to each element of a: sums = map(sum, a) While sums is a generator (the actual value isn't computed until we ask for it), the above step is approximately equal to the following code: sums = [] for sublist in a: results = sum(sublist) sums.append(results) The reduce step is a little more complicated. It involves applying a function to each element of the returned result, to some starting value. We start with an initial value, and then apply a given function to that initial value and the first value. We then apply the given function to the result and the next value, and so on. [ 275 ]

Working with Big Data We start by creating a function that takes two numbers and adds them together. def add(a, b): return a + b We then perform the reduce. The signature of reduce is reduce(function, sequence, and initial), where the function is applied at each step to the sequence. In the first step, the initial value is used as the first value, rather than the first element of the list: from functools import reduce print(reduce(add, sums, 0)) The result, 25, is the sum of each of the values in the sums list and is consequently the sum of each of the elements in the original array. The preceding code is equal to the following: initial = 0 current_result = initial for element in sums: current_result = add(current_result, element) In this trivial example, our code can be greatly simplified, but the real gains come from distributing the computation. For instance, if we have a million sublists and each of those sublists contained a million elements, we can distribute this computation over many computers. In order to do this, we distribute the map step. For each of the elements in our list, we send it, along with a description of our function, to a computer. This computer then returns the result to our main computer (the master). The master then sends the result to a computer for the reduce step. In our example of a million sublists, we would send a million jobs to different computers (the same computer may be reused after it completes our first job). The returned result would be just a single list of a million numbers, which we then compute the sum of. The result is that no computer ever needed to store more than a million numbers, despite our original data having a trillion numbers in it. A word count example The implementation of MapReduce is a little more complex than just using a map and reduce step. Both steps are invoked using keys, which allows for the separation of data and tracking of values. [ 276 ]

Chapter 12<br />

MapReduce originates from Google, where it was developed with distributed<br />

<strong>com</strong>puting in mind. It also introduces fault tolerance and scalability improvements.<br />

The "original" research for MapReduce was published in 2004, and since then there<br />

have been thousands of projects, implementations, and applications using it.<br />

While the concept is similar to many previous concepts, MapReduce has be<strong>com</strong>e a<br />

staple in big data analytics.<br />

Intuition<br />

MapReduce has two main steps: the Map step and the Reduce step. These are built<br />

on the functional programming concepts of mapping a function to a list and reducing<br />

the result. To explain the concept, we will develop code that will iterate over a list of<br />

lists and produce the sum of all numbers in those lists.<br />

There are also shuffle and <strong>com</strong>bine steps in the MapReduce paradigm, which we<br />

will see later.<br />

To start with, the Map step takes a function and applies it to each element in a<br />

list. The returned result is a list of the same size, with the results of the function<br />

applied to each element.<br />

To open a new IPython Notebook, start by creating a list of lists with numbers in<br />

each sublist:<br />

a = [[1,2,1], [3,2], [4,9,1,0,2]]<br />

Next, we can perform a map, using the sum function. This step will apply the sum<br />

function to each element of a:<br />

sums = map(sum, a)<br />

While sums is a generator (the actual value isn't <strong>com</strong>puted until we ask for it),<br />

the above step is approximately equal to the following code:<br />

sums = []<br />

for sublist in a:<br />

results = sum(sublist)<br />

sums.append(results)<br />

The reduce step is a little more <strong>com</strong>plicated. It involves applying a function to each<br />

element of the returned result, to some starting value. We start with an initial value,<br />

and then apply a given function to that initial value and the first value. We then<br />

apply the given function to the result and the next value, and so on.<br />

[ 275 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!