www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

The Hadoop ecosystem is quite complex, with a large number of tools. The main component we will use is Hadoop MapReduce. Other tools for working with big data that are included in Hadoop are as follows: Chapter 12 • Hadoop Distributed File System (HDFS): This is a file system that can store files over many computers, with the goal of being robust against hardware failure while providing high bandwidth. • YARN: This is a method for scheduling applications and managing clusters of computers. • Pig: This is a higher level programming language for MapReduce. Hadoop MapReduce is implemented in Java, and Pig sits on top of the Java implementation, allowing you to write programs in other languages— including Python. • Hive: This is for managing data warehouses and performing queries. • HBase: This is an implementation of Google's BigTable, a distributed database. These tools all solve different issues that come up when doing big data experiments, including data analytics. There are also non-Hadoop-based implementations of MapReduce, as well as other projects with similar goals. In addition, many cloud providers have MapReduce-based systems. Application In this application, we will look at predicting the gender of a writer based on their use of different words. We will use a Naive Bayes method for this, trained in MapReduce. The final model doesn't need MapReduce, although we can use the Map step to do so—that is, run the prediction model on each document in a list. This is a common Map operation for data mining in MapReduce, with the reduce step simply organizing the list of predictions so they can be tracked back to the original document. We will be using Amazon's infrastructure to run our application, allowing us to leverage their computing resources. [ 279 ]

Working with Big Data Getting the data The data we are going to use is a set of blog posts that are labeled for age, gender, industry (that is, work) and, funnily enough, star sign. This data was collected from http://blogger.com in August 2004 and has over 140 million words in more than 600,000 posts. Each blog is probably written by just one person, with some work put into verifying this (although, we can never be really sure). Posts are also matched with the date of posting, making this a very rich dataset. To get the data, go to http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm and click on Download Corpus. From there, unzip the file to a directory on your computer. The dataset is organized with a single blog to a file, with the filename giving the classes. For instance, one of the filenames is as follows: 1005545.male.25.Engineering.Sagittarius.xml The filename is separated by periods, and the fields are as follows: • Blogger ID: This a simple ID value to organize the identities. • Gender: This is either male or female, and all the blogs are identified as one of these two options (no other options are included in this dataset). • Age: The exact ages are given, but some gaps are deliberately present. Ages present are in the (inclusive) ranges of 13-17, 23-27, and 33-48. The reason for the gaps is to allow for splitting the blogs into age ranges with gaps, as it would be quite difficult to separate an 18 year old's writing from a 19 year old, and it is possible that the age itself is a little outdated. • Industry: In one of 40 different industries including science, engineering, arts, and real estate. Also, included is indUnk, for unknown industry. • Star Sign: This is one of the 12 astrological star signs. All values are self-reported, meaning there may be errors or inconsistencies with labeling, but are assumed to be mostly reliable—people had the option of not setting values if they wanted to preserve their privacy in those ways. A single file is in a psuedo-XML format, containing a tag and then a sequence of tags. Each of the tag is proceeded by a tag as well. While we can parse this as XML, it is much simpler to parse it on a line-by-line basis as the files are not exactly well-formed XML, with some errors (mostly encoding problems). To read the posts in the file, we can use a loop to iterate over the lines. [ 280 ]

Working with Big Data<br />

Getting the data<br />

The data we are going to use is a set of blog posts that are labeled for age, gender,<br />

industry (that is, work) and, funnily enough, star sign. This data was collected from<br />

http://blogger.<strong>com</strong> in August 2004 and has over 140 million words in more than<br />

600,000 posts. Each blog is probably written by just one person, with some work put<br />

into verifying this (although, we can never be really sure). Posts are also matched<br />

with the date of posting, making this a very rich dataset.<br />

To get the data, go to http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm and click<br />

on Download Corpus. From there, unzip the file to a directory on your <strong>com</strong>puter.<br />

The dataset is organized with a single blog to a file, with the filename giving the<br />

classes. For instance, one of the filenames is as follows:<br />

1005545.male.25.Engineering.Sagittarius.xml<br />

The filename is separated by periods, and the fields are as follows:<br />

• Blogger ID: This a simple ID value to organize the identities.<br />

• Gender: This is either male or female, and all the blogs are identified as one<br />

of these two options (no other options are included in this dataset).<br />

• Age: The exact ages are given, but some gaps are deliberately present. Ages<br />

present are in the (inclusive) ranges of 13-17, 23-27, and 33-48. The reason<br />

for the gaps is to allow for splitting the blogs into age ranges with gaps, as<br />

it would be quite difficult to separate an 18 year old's writing from a 19 year<br />

old, and it is possible that the age itself is a little outdated.<br />

• Industry: In one of 40 different industries including science, engineering,<br />

arts, and real estate. Also, included is indUnk, for unknown industry.<br />

• Star Sign: This is one of the 12 astrological star signs.<br />

All values are self-reported, meaning there may be errors or inconsistencies with<br />

labeling, but are assumed to be mostly reliable—people had the option of not<br />

setting values if they wanted to preserve their privacy in those ways.<br />

A single file is in a psuedo-XML format, containing a tag and then a sequence<br />

of tags. Each of the tag is proceeded by a tag as well. While<br />

we can parse this as XML, it is much simpler to parse it on a line-by-line basis as the<br />

files are not exactly well-formed XML, with some errors (mostly encoding problems).<br />

To read the posts in the file, we can use a loop to iterate over the lines.<br />

[ 280 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!