www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Chapter 12 for line in inf: tokens = line.split() actual_gender = eval(tokens[0]) blog_post = eval(" ".join(tokens[1:])) yield actual_gender, nb_predict(model, blog_post) We then record the predictions and actual genders across our entire dataset. Our predictions here are either male or female. In order to use the f1_score function from scikit-learn, we need to turn these into ones and zeroes. In order to do that, we record a 0 if the gender is male and 1 if it is female. To do this, we use a Boolean test, seeing if the gender is female. We then convert these Boolean values to int using NumPy: y_true = [] y_pred = [] for actual_gender, predicted_gender in nb_predict_many(model, testing_ filenames[0]): y_true.append(actual_gender == "female") y_pred.append(predicted_gender == "female") y_true = np.array(y_true, dtype='int') y_pred = np.array(y_pred, dtype='int') Now, we test the quality of this result using the F1 score in scikit-learn: from sklearn.metrics import f1_score print("f1={:.4f}".format(f1_score(y_true, y_pred, pos_label=None))) The result of 0.78 is not bad. We can probably improve this by using more data, but to do that, we need to move to a more powerful infrastructure that can handle it. Training on Amazon's EMR infrastructure We are going to use Amazon's Elastic Map Reduce (EMR) infrastructure to run our parsing and model building jobs. In order to do that, we first need to create a bucket in Amazon's storage cloud. To do this, open the Amazon S3 console in your web browser by going to http://console.aws.amazon.com/s3 and click on Create Bucket. Remember the name of the bucket, as we will need it later. Right-click on the new bucket and select Properties. Then, change the permissions, granting everyone full access. This is not a good security practice in general, and I recommend that you change the access permissions after you complete this chapter. [ 293 ]

Working with Big Data Left-click the bucket to open it and click on Create Folder. Name the folder blogs_train. We are going to upload our training data to this folder for processing on the cloud. On your computer, we are going to use Amazon's AWS CLI, a command-line interface for processing on Amazon's cloud. To install it, use the following: sudo pip2 install awscli Follow the instructions at http://docs.aws.amazon.com/cli/latest/userguide/ cli-chap-getting-set-up.html to set the credentials for this program. We now want to upload our data to our new bucket. First, we want to create our dataset, which is all the blogs not starting with a 6 or 7. There are more graceful ways to do this copy, but none are cross-platform enough to recommend. Instead, simply copy all the files and then delete the ones that start with a 6 or 7, from the training dataset: cp -R ~/Data/blogs ~/Data/blogs_train_large rm ~/Data/blogs_train_large/6* rm ~/Data/blogs_train_large/7* Next, upload the data to your Amazon S3 bucket. Note that this will take some time and use quite a lot of upload data (several hundred megabytes). For those with slower internet connections, it may be worth doing this at a location with a faster connection; aws s3 cp ~/Data/blogs_train_large/ s3://ch12/blogs_train_large --recursive --exclude "*" --include "*.xml" We are going to connect to Amazon's EMR using mrjob—it handles the whole thing for us; it only needs our credentials to do so. Follow the instructions at https://pythonhosted.org/mrjob/guides/emr-quickstart.html to setup mrjob with your Amazon credentials. After this is done, we alter our mrjob run, only slightly, to run on Amazon EMR. We just tell mrjob to use emr using the -r switch and then set our s3 containers as the input and output directories. Even though this will be run on Amazon's infrastructure, it will still take quite a long time to run. [ 294 ]

Working with Big Data<br />

Left-click the bucket to open it and click on Create Folder. Name the folder<br />

blogs_train. We are going to upload our training data to this folder for<br />

processing on the cloud.<br />

On your <strong>com</strong>puter, we are going to use Amazon's AWS CLI, a <strong>com</strong>mand-line<br />

interface for processing on Amazon's cloud.<br />

To install it, use the following:<br />

sudo pip2 install awscli<br />

Follow the instructions at http://docs.aws.amazon.<strong>com</strong>/cli/latest/userguide/<br />

cli-chap-getting-set-up.html to set the credentials for this program.<br />

We now want to upload our data to our new bucket. First, we want to create our<br />

dataset, which is all the blogs not starting with a 6 or 7. There are more graceful<br />

ways to do this copy, but none are cross-platform enough to re<strong>com</strong>mend. Instead,<br />

simply copy all the files and then delete the ones that start with a 6 or 7, from the<br />

training dataset:<br />

cp -R ~/Data/blogs ~/Data/blogs_train_large<br />

rm ~/Data/blogs_train_large/6*<br />

rm ~/Data/blogs_train_large/7*<br />

Next, upload the data to your Amazon S3 bucket. Note that this will take some<br />

time and use quite a lot of upload data (several hundred megabytes). For those<br />

with slower internet connections, it may be worth doing this at a location with a<br />

faster connection;<br />

aws s3 cp ~/Data/blogs_train_large/ s3://ch12/blogs_train_large<br />

--recursive --exclude "*" --include "*.xml"<br />

We are going to connect to Amazon's EMR using mrjob—it handles the whole<br />

thing for us; it only needs our credentials to do so. Follow the instructions at<br />

https://pythonhosted.org/mrjob/guides/emr-quickstart.html to setup<br />

mrjob with your Amazon credentials.<br />

After this is done, we alter our mrjob run, only slightly, to run on Amazon EMR.<br />

We just tell mrjob to use emr using the -r switch and then set our s3 containers<br />

as the input and output directories. Even though this will be run on Amazon's<br />

infrastructure, it will still take quite a long time to run.<br />

[ 294 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!