24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 12<br />

python extract_posts.py -r emr s3://ch12gender/blogs_train_large/<br />

--output-dir=s3://ch12/blogposts_train/ --no-output<br />

python nb_train.py -r emr s3://ch12/blogposts_train/ --output-dir=s3://<br />

ch12/model/ --o-output<br />

You will also be charged for the usage. This will only be a few dollars,<br />

but keep this in mind if you are going to keep running the jobs or doing<br />

other jobs on bigger datasets. I ran a very large number of jobs and was<br />

charged about $20 all up. Running just these few should be less than<br />

$4. However, you can check your balance and set up pricing alerts, by<br />

going to https://console.aws.amazon.<strong>com</strong>/billing/home.<br />

It isn't necessary for the blogposts_train and model folders to exist—they will be<br />

created by EMR. In fact, if they exist, you will get an error. If you are rerunning this,<br />

just change the names of these folders to something new, but remember to change<br />

both <strong>com</strong>mands to the same names (that is, the output directory of the first <strong>com</strong>mand<br />

is the input directory of the second <strong>com</strong>mand).<br />

If you are getting impatient, you can always stop the first job after<br />

a while and just use the training data gathered so far. I re<strong>com</strong>mend<br />

leaving the job for an absolute minimum of 15 minutes and probably<br />

at least an hour. You can't stop the second job and get good results<br />

though; the second job will probably take about two to three times as<br />

long as the first job did.<br />

You can now go back to the s3 console and download the output model from your<br />

bucket. Saving it locally, we can go back to our IPython Notebook and use the new<br />

model. We reenter the code here—only the differences are highlighted, just to update<br />

to our new model:<br />

aws_model_filename = os.path.join(os.path.expanduser("~"), "models",<br />

"aws_model")<br />

aws_model = load_model(aws_model_filename)<br />

y_true = []<br />

y_pred = []<br />

for actual_gender, predicted_gender in nb_predict_many(aws_model,<br />

testing_filenames[0]):<br />

y_true.append(actual_gender == "female")<br />

y_pred.append(predicted_gender == "female")<br />

y_true = np.array(y_true, dtype='int')<br />

y_pred = np.array(y_pred, dtype='int')<br />

print("f1={:.4f}".format(f1_score(y_true, y_pred, pos_label=None)))<br />

[ 295 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!