www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
The Hadoop ecosystem is quite complex, with a large number of tools. The main component we will use is Hadoop MapReduce. Other tools for working with big data that are included in Hadoop are as follows: Chapter 12 • Hadoop Distributed File System (HDFS): This is a file system that can store files over many computers, with the goal of being robust against hardware failure while providing high bandwidth. • YARN: This is a method for scheduling applications and managing clusters of computers. • Pig: This is a higher level programming language for MapReduce. Hadoop MapReduce is implemented in Java, and Pig sits on top of the Java implementation, allowing you to write programs in other languages— including Python. • Hive: This is for managing data warehouses and performing queries. • HBase: This is an implementation of Google's BigTable, a distributed database. These tools all solve different issues that come up when doing big data experiments, including data analytics. There are also non-Hadoop-based implementations of MapReduce, as well as other projects with similar goals. In addition, many cloud providers have MapReduce-based systems. Application In this application, we will look at predicting the gender of a writer based on their use of different words. We will use a Naive Bayes method for this, trained in MapReduce. The final model doesn't need MapReduce, although we can use the Map step to do so—that is, run the prediction model on each document in a list. This is a common Map operation for data mining in MapReduce, with the reduce step simply organizing the list of predictions so they can be tracked back to the original document. We will be using Amazon's infrastructure to run our application, allowing us to leverage their computing resources. [ 279 ]
Working with Big Data Getting the data The data we are going to use is a set of blog posts that are labeled for age, gender, industry (that is, work) and, funnily enough, star sign. This data was collected from http://blogger.com in August 2004 and has over 140 million words in more than 600,000 posts. Each blog is probably written by just one person, with some work put into verifying this (although, we can never be really sure). Posts are also matched with the date of posting, making this a very rich dataset. To get the data, go to http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm and click on Download Corpus. From there, unzip the file to a directory on your computer. The dataset is organized with a single blog to a file, with the filename giving the classes. For instance, one of the filenames is as follows: 1005545.male.25.Engineering.Sagittarius.xml The filename is separated by periods, and the fields are as follows: • Blogger ID: This a simple ID value to organize the identities. • Gender: This is either male or female, and all the blogs are identified as one of these two options (no other options are included in this dataset). • Age: The exact ages are given, but some gaps are deliberately present. Ages present are in the (inclusive) ranges of 13-17, 23-27, and 33-48. The reason for the gaps is to allow for splitting the blogs into age ranges with gaps, as it would be quite difficult to separate an 18 year old's writing from a 19 year old, and it is possible that the age itself is a little outdated. • Industry: In one of 40 different industries including science, engineering, arts, and real estate. Also, included is indUnk, for unknown industry. • Star Sign: This is one of the 12 astrological star signs. All values are self-reported, meaning there may be errors or inconsistencies with labeling, but are assumed to be mostly reliable—people had the option of not setting values if they wanted to preserve their privacy in those ways. A single file is in a psuedo-XML format, containing a tag and then a sequence of tags. Each of the tag is proceeded by a tag as well. While we can parse this as XML, it is much simpler to parse it on a line-by-line basis as the files are not exactly well-formed XML, with some errors (mostly encoding problems). To read the posts in the file, we can use a loop to iterate over the lines. [ 280 ]
- Page 252 and 253: Chapter 10 We then print out the mo
- Page 254 and 255: Chapter 10 Our function definition
- Page 256 and 257: Chapter 10 The result from the prec
- Page 258 and 259: Chapter 10 Implementation Putting a
- Page 260 and 261: Chapter 10 Neural networks can also
- Page 262 and 263: We then call the partial_fit functi
- Page 264 and 265: Classifying Objects in Images Using
- Page 266 and 267: Chapter 11 This dataset comes from
- Page 268 and 269: You can change the image index to s
- Page 270 and 271: Chapter 11 Each of these issues has
- Page 272 and 273: Chapter 11 Using Theano, we can def
- Page 274 and 275: Chapter 11 Building a neural networ
- Page 276 and 277: Chapter 11 Finally, we create Thean
- Page 278 and 279: Chapter 11 return [image,] return s
- Page 280 and 281: Chapter 11 Next, we define how the
- Page 282 and 283: Chapter 11 Getting your code to run
- Page 284 and 285: Chapter 11 Setting up the environme
- Page 286 and 287: This will unzip only one Coval.otf
- Page 288 and 289: Chapter 11 First we create the laye
- Page 290 and 291: Chapter 11 Finally, we set the verb
- Page 292: Chapter 11 Summary In this chapter,
- Page 295 and 296: Working with Big Data Big data What
- Page 297 and 298: Working with Big Data Governments a
- Page 299 and 300: Working with Big Data We start by c
- Page 301: Working with Big Data The final ste
- Page 305 and 306: Working with Big Data If we aren't
- Page 307 and 308: Working with Big Data Before we sta
- Page 309 and 310: Working with Big Data The first val
- Page 311 and 312: Working with Big Data This gives us
- Page 313 and 314: Working with Big Data Next, we crea
- Page 315 and 316: Working with Big Data Then, make a
- Page 317 and 318: Working with Big Data Left-click th
- Page 319 and 320: Working with Big Data The result is
- Page 321 and 322: Next Steps… Extending the IPython
- Page 323 and 324: Next Steps… Chapter 3: Predicting
- Page 325 and 326: Next Steps… Vowpal Wabbit http://
- Page 327 and 328: Next Steps… Deeper networks These
- Page 329 and 330: Next Steps… Real-time clusterings
- Page 331 and 332: Next Steps… More resources Kaggle
- Page 333 and 334: authorship, attributing 185-188 AWS
- Page 335 and 336: feature extraction about 82 common
- Page 337 and 338: NetworkX about 145 defining 303 URL
- Page 339 and 340: scikit-learn package references 305
- Page 342 and 343: Thank you for buying Learning Data
- Page 344: Learning Python Data Visualization
Working with Big Data<br />
Getting the data<br />
The data we are going to use is a set of blog posts that are labeled for age, gender,<br />
industry (that is, work) and, funnily enough, star sign. This data was collected from<br />
http://blogger.<strong>com</strong> in August 2004 and has over 140 million words in more than<br />
600,000 posts. Each blog is probably written by just one person, with some work put<br />
into verifying this (although, we can never be really sure). Posts are also matched<br />
with the date of posting, making this a very rich dataset.<br />
To get the data, go to http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm and click<br />
on Download Corpus. From there, unzip the file to a directory on your <strong>com</strong>puter.<br />
The dataset is organized with a single blog to a file, with the filename giving the<br />
classes. For instance, one of the filenames is as follows:<br />
1005545.male.25.Engineering.Sagittarius.xml<br />
The filename is separated by periods, and the fields are as follows:<br />
• Blogger ID: This a simple ID value to organize the identities.<br />
• Gender: This is either male or female, and all the blogs are identified as one<br />
of these two options (no other options are included in this dataset).<br />
• Age: The exact ages are given, but some gaps are deliberately present. Ages<br />
present are in the (inclusive) ranges of 13-17, 23-27, and 33-48. The reason<br />
for the gaps is to allow for splitting the blogs into age ranges with gaps, as<br />
it would be quite difficult to separate an 18 year old's writing from a 19 year<br />
old, and it is possible that the age itself is a little outdated.<br />
• Industry: In one of 40 different industries including science, engineering,<br />
arts, and real estate. Also, included is indUnk, for unknown industry.<br />
• Star Sign: This is one of the 12 astrological star signs.<br />
All values are self-reported, meaning there may be errors or inconsistencies with<br />
labeling, but are assumed to be mostly reliable—people had the option of not<br />
setting values if they wanted to preserve their privacy in those ways.<br />
A single file is in a psuedo-XML format, containing a tag and then a sequence<br />
of tags. Each of the tag is proceeded by a tag as well. While<br />
we can parse this as XML, it is much simpler to parse it on a line-by-line basis as the<br />
files are not exactly well-formed XML, with some errors (mostly encoding problems).<br />
To read the posts in the file, we can use a loop to iterate over the lines.<br />
[ 280 ]