www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Chapter 12 MapReduce originates from Google, where it was developed with distributed computing in mind. It also introduces fault tolerance and scalability improvements. The "original" research for MapReduce was published in 2004, and since then there have been thousands of projects, implementations, and applications using it. While the concept is similar to many previous concepts, MapReduce has become a staple in big data analytics. Intuition MapReduce has two main steps: the Map step and the Reduce step. These are built on the functional programming concepts of mapping a function to a list and reducing the result. To explain the concept, we will develop code that will iterate over a list of lists and produce the sum of all numbers in those lists. There are also shuffle and combine steps in the MapReduce paradigm, which we will see later. To start with, the Map step takes a function and applies it to each element in a list. The returned result is a list of the same size, with the results of the function applied to each element. To open a new IPython Notebook, start by creating a list of lists with numbers in each sublist: a = [[1,2,1], [3,2], [4,9,1,0,2]] Next, we can perform a map, using the sum function. This step will apply the sum function to each element of a: sums = map(sum, a) While sums is a generator (the actual value isn't computed until we ask for it), the above step is approximately equal to the following code: sums = [] for sublist in a: results = sum(sublist) sums.append(results) The reduce step is a little more complicated. It involves applying a function to each element of the returned result, to some starting value. We start with an initial value, and then apply a given function to that initial value and the first value. We then apply the given function to the result and the next value, and so on. [ 275 ]
Working with Big Data We start by creating a function that takes two numbers and adds them together. def add(a, b): return a + b We then perform the reduce. The signature of reduce is reduce(function, sequence, and initial), where the function is applied at each step to the sequence. In the first step, the initial value is used as the first value, rather than the first element of the list: from functools import reduce print(reduce(add, sums, 0)) The result, 25, is the sum of each of the values in the sums list and is consequently the sum of each of the elements in the original array. The preceding code is equal to the following: initial = 0 current_result = initial for element in sums: current_result = add(current_result, element) In this trivial example, our code can be greatly simplified, but the real gains come from distributing the computation. For instance, if we have a million sublists and each of those sublists contained a million elements, we can distribute this computation over many computers. In order to do this, we distribute the map step. For each of the elements in our list, we send it, along with a description of our function, to a computer. This computer then returns the result to our main computer (the master). The master then sends the result to a computer for the reduce step. In our example of a million sublists, we would send a million jobs to different computers (the same computer may be reused after it completes our first job). The returned result would be just a single list of a million numbers, which we then compute the sum of. The result is that no computer ever needed to store more than a million numbers, despite our original data having a trillion numbers in it. A word count example The implementation of MapReduce is a little more complex than just using a map and reduce step. Both steps are invoked using keys, which allows for the separation of data and tracking of values. [ 276 ]
- Page 248 and 249: Chapter 10 The k-means algorithm is
- Page 250 and 251: Chapter 10 We only fit the X matrix
- Page 252 and 253: Chapter 10 We then print out the mo
- Page 254 and 255: Chapter 10 Our function definition
- Page 256 and 257: Chapter 10 The result from the prec
- Page 258 and 259: Chapter 10 Implementation Putting a
- Page 260 and 261: Chapter 10 Neural networks can also
- Page 262 and 263: We then call the partial_fit functi
- Page 264 and 265: Classifying Objects in Images Using
- Page 266 and 267: Chapter 11 This dataset comes from
- Page 268 and 269: You can change the image index to s
- Page 270 and 271: Chapter 11 Each of these issues has
- Page 272 and 273: Chapter 11 Using Theano, we can def
- Page 274 and 275: Chapter 11 Building a neural networ
- Page 276 and 277: Chapter 11 Finally, we create Thean
- Page 278 and 279: Chapter 11 return [image,] return s
- Page 280 and 281: Chapter 11 Next, we define how the
- Page 282 and 283: Chapter 11 Getting your code to run
- Page 284 and 285: Chapter 11 Setting up the environme
- Page 286 and 287: This will unzip only one Coval.otf
- Page 288 and 289: Chapter 11 First we create the laye
- Page 290 and 291: Chapter 11 Finally, we set the verb
- Page 292: Chapter 11 Summary In this chapter,
- Page 295 and 296: Working with Big Data Big data What
- Page 297: Working with Big Data Governments a
- Page 301 and 302: Working with Big Data The final ste
- Page 303 and 304: Working with Big Data Getting the d
- Page 305 and 306: Working with Big Data If we aren't
- Page 307 and 308: Working with Big Data Before we sta
- Page 309 and 310: Working with Big Data The first val
- Page 311 and 312: Working with Big Data This gives us
- Page 313 and 314: Working with Big Data Next, we crea
- Page 315 and 316: Working with Big Data Then, make a
- Page 317 and 318: Working with Big Data Left-click th
- Page 319 and 320: Working with Big Data The result is
- Page 321 and 322: Next Steps… Extending the IPython
- Page 323 and 324: Next Steps… Chapter 3: Predicting
- Page 325 and 326: Next Steps… Vowpal Wabbit http://
- Page 327 and 328: Next Steps… Deeper networks These
- Page 329 and 330: Next Steps… Real-time clusterings
- Page 331 and 332: Next Steps… More resources Kaggle
- Page 333 and 334: authorship, attributing 185-188 AWS
- Page 335 and 336: feature extraction about 82 common
- Page 337 and 338: NetworkX about 145 defining 303 URL
- Page 339 and 340: scikit-learn package references 305
- Page 342 and 343: Thank you for buying Learning Data
- Page 344: Learning Python Data Visualization
Chapter 12<br />
MapReduce originates from Google, where it was developed with distributed<br />
<strong>com</strong>puting in mind. It also introduces fault tolerance and scalability improvements.<br />
The "original" research for MapReduce was published in 2004, and since then there<br />
have been thousands of projects, implementations, and applications using it.<br />
While the concept is similar to many previous concepts, MapReduce has be<strong>com</strong>e a<br />
staple in big data analytics.<br />
Intuition<br />
MapReduce has two main steps: the Map step and the Reduce step. These are built<br />
on the functional programming concepts of mapping a function to a list and reducing<br />
the result. To explain the concept, we will develop code that will iterate over a list of<br />
lists and produce the sum of all numbers in those lists.<br />
There are also shuffle and <strong>com</strong>bine steps in the MapReduce paradigm, which we<br />
will see later.<br />
To start with, the Map step takes a function and applies it to each element in a<br />
list. The returned result is a list of the same size, with the results of the function<br />
applied to each element.<br />
To open a new IPython Notebook, start by creating a list of lists with numbers in<br />
each sublist:<br />
a = [[1,2,1], [3,2], [4,9,1,0,2]]<br />
Next, we can perform a map, using the sum function. This step will apply the sum<br />
function to each element of a:<br />
sums = map(sum, a)<br />
While sums is a generator (the actual value isn't <strong>com</strong>puted until we ask for it),<br />
the above step is approximately equal to the following code:<br />
sums = []<br />
for sublist in a:<br />
results = sum(sublist)<br />
sums.append(results)<br />
The reduce step is a little more <strong>com</strong>plicated. It involves applying a function to each<br />
element of the returned result, to some starting value. We start with an initial value,<br />
and then apply a given function to that initial value and the first value. We then<br />
apply the given function to the result and the next value, and so on.<br />
[ 275 ]