www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Chapter 12 for line in inf: tokens = line.split() actual_gender = eval(tokens[0]) blog_post = eval(" ".join(tokens[1:])) yield actual_gender, nb_predict(model, blog_post) We then record the predictions and actual genders across our entire dataset. Our predictions here are either male or female. In order to use the f1_score function from scikit-learn, we need to turn these into ones and zeroes. In order to do that, we record a 0 if the gender is male and 1 if it is female. To do this, we use a Boolean test, seeing if the gender is female. We then convert these Boolean values to int using NumPy: y_true = [] y_pred = [] for actual_gender, predicted_gender in nb_predict_many(model, testing_ filenames[0]): y_true.append(actual_gender == "female") y_pred.append(predicted_gender == "female") y_true = np.array(y_true, dtype='int') y_pred = np.array(y_pred, dtype='int') Now, we test the quality of this result using the F1 score in scikit-learn: from sklearn.metrics import f1_score print("f1={:.4f}".format(f1_score(y_true, y_pred, pos_label=None))) The result of 0.78 is not bad. We can probably improve this by using more data, but to do that, we need to move to a more powerful infrastructure that can handle it. Training on Amazon's EMR infrastructure We are going to use Amazon's Elastic Map Reduce (EMR) infrastructure to run our parsing and model building jobs. In order to do that, we first need to create a bucket in Amazon's storage cloud. To do this, open the Amazon S3 console in your web browser by going to http://console.aws.amazon.com/s3 and click on Create Bucket. Remember the name of the bucket, as we will need it later. Right-click on the new bucket and select Properties. Then, change the permissions, granting everyone full access. This is not a good security practice in general, and I recommend that you change the access permissions after you complete this chapter. [ 293 ]
Working with Big Data Left-click the bucket to open it and click on Create Folder. Name the folder blogs_train. We are going to upload our training data to this folder for processing on the cloud. On your computer, we are going to use Amazon's AWS CLI, a command-line interface for processing on Amazon's cloud. To install it, use the following: sudo pip2 install awscli Follow the instructions at http://docs.aws.amazon.com/cli/latest/userguide/ cli-chap-getting-set-up.html to set the credentials for this program. We now want to upload our data to our new bucket. First, we want to create our dataset, which is all the blogs not starting with a 6 or 7. There are more graceful ways to do this copy, but none are cross-platform enough to recommend. Instead, simply copy all the files and then delete the ones that start with a 6 or 7, from the training dataset: cp -R ~/Data/blogs ~/Data/blogs_train_large rm ~/Data/blogs_train_large/6* rm ~/Data/blogs_train_large/7* Next, upload the data to your Amazon S3 bucket. Note that this will take some time and use quite a lot of upload data (several hundred megabytes). For those with slower internet connections, it may be worth doing this at a location with a faster connection; aws s3 cp ~/Data/blogs_train_large/ s3://ch12/blogs_train_large --recursive --exclude "*" --include "*.xml" We are going to connect to Amazon's EMR using mrjob—it handles the whole thing for us; it only needs our credentials to do so. Follow the instructions at https://pythonhosted.org/mrjob/guides/emr-quickstart.html to setup mrjob with your Amazon credentials. After this is done, we alter our mrjob run, only slightly, to run on Amazon EMR. We just tell mrjob to use emr using the -r switch and then set our s3 containers as the input and output directories. Even though this will be run on Amazon's infrastructure, it will still take quite a long time to run. [ 294 ]
- Page 266 and 267: Chapter 11 This dataset comes from
- Page 268 and 269: You can change the image index to s
- Page 270 and 271: Chapter 11 Each of these issues has
- Page 272 and 273: Chapter 11 Using Theano, we can def
- Page 274 and 275: Chapter 11 Building a neural networ
- Page 276 and 277: Chapter 11 Finally, we create Thean
- Page 278 and 279: Chapter 11 return [image,] return s
- Page 280 and 281: Chapter 11 Next, we define how the
- Page 282 and 283: Chapter 11 Getting your code to run
- Page 284 and 285: Chapter 11 Setting up the environme
- Page 286 and 287: This will unzip only one Coval.otf
- Page 288 and 289: Chapter 11 First we create the laye
- Page 290 and 291: Chapter 11 Finally, we set the verb
- Page 292: Chapter 11 Summary In this chapter,
- Page 295 and 296: Working with Big Data Big data What
- Page 297 and 298: Working with Big Data Governments a
- Page 299 and 300: Working with Big Data We start by c
- Page 301 and 302: Working with Big Data The final ste
- Page 303 and 304: Working with Big Data Getting the d
- Page 305 and 306: Working with Big Data If we aren't
- Page 307 and 308: Working with Big Data Before we sta
- Page 309 and 310: Working with Big Data The first val
- Page 311 and 312: Working with Big Data This gives us
- Page 313 and 314: Working with Big Data Next, we crea
- Page 315: Working with Big Data Then, make a
- Page 319 and 320: Working with Big Data The result is
- Page 321 and 322: Next Steps… Extending the IPython
- Page 323 and 324: Next Steps… Chapter 3: Predicting
- Page 325 and 326: Next Steps… Vowpal Wabbit http://
- Page 327 and 328: Next Steps… Deeper networks These
- Page 329 and 330: Next Steps… Real-time clusterings
- Page 331 and 332: Next Steps… More resources Kaggle
- Page 333 and 334: authorship, attributing 185-188 AWS
- Page 335 and 336: feature extraction about 82 common
- Page 337 and 338: NetworkX about 145 defining 303 URL
- Page 339 and 340: scikit-learn package references 305
- Page 342 and 343: Thank you for buying Learning Data
- Page 344: Learning Python Data Visualization
Working with Big Data<br />
Left-click the bucket to open it and click on Create Folder. Name the folder<br />
blogs_train. We are going to upload our training data to this folder for<br />
processing on the cloud.<br />
On your <strong>com</strong>puter, we are going to use Amazon's AWS CLI, a <strong>com</strong>mand-line<br />
interface for processing on Amazon's cloud.<br />
To install it, use the following:<br />
sudo pip2 install awscli<br />
Follow the instructions at http://docs.aws.amazon.<strong>com</strong>/cli/latest/userguide/<br />
cli-chap-getting-set-up.html to set the credentials for this program.<br />
We now want to upload our data to our new bucket. First, we want to create our<br />
dataset, which is all the blogs not starting with a 6 or 7. There are more graceful<br />
ways to do this copy, but none are cross-platform enough to re<strong>com</strong>mend. Instead,<br />
simply copy all the files and then delete the ones that start with a 6 or 7, from the<br />
training dataset:<br />
cp -R ~/Data/blogs ~/Data/blogs_train_large<br />
rm ~/Data/blogs_train_large/6*<br />
rm ~/Data/blogs_train_large/7*<br />
Next, upload the data to your Amazon S3 bucket. Note that this will take some<br />
time and use quite a lot of upload data (several hundred megabytes). For those<br />
with slower internet connections, it may be worth doing this at a location with a<br />
faster connection;<br />
aws s3 cp ~/Data/blogs_train_large/ s3://ch12/blogs_train_large<br />
--recursive --exclude "*" --include "*.xml"<br />
We are going to connect to Amazon's EMR using mrjob—it handles the whole<br />
thing for us; it only needs our credentials to do so. Follow the instructions at<br />
https://pythonhosted.org/mrjob/guides/emr-quickstart.html to setup<br />
mrjob with your Amazon credentials.<br />
After this is done, we alter our mrjob run, only slightly, to run on Amazon EMR.<br />
We just tell mrjob to use emr using the -r switch and then set our s3 containers<br />
as the input and output directories. Even though this will be run on Amazon's<br />
infrastructure, it will still take quite a long time to run.<br />
[ 294 ]