24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Working with Big Data<br />

Then, make a folder for our test set:<br />

mkdir blogs_test<br />

Move any file starting with a 6 or 7 into the test set, from the train set:<br />

cp blogs/6* blogs_test/<br />

cp blogs/7* blogs_test/<br />

We will rerun the blog extraction on all files in the training set. However, this is<br />

a large <strong>com</strong>putation that is better suited to cloud infrastructure than our system.<br />

For this reason, we will now move the parsing job to Amazon's infrastructure.<br />

Run the following on the <strong>com</strong>mand line, as you did before. The only difference is<br />

that we train on a different folder of input files. Before you run the following code,<br />

delete all files in the blog posts and models folders:<br />

python extract_posts.py ~/Data/blogs_train --output-dir=/home/bob/<br />

Data/blogposts –no-output<br />

python nb_train.py ~/Data/blogposts/ --output-dir=/home/bob/models/<br />

--no-output<br />

The code here will take quite a bit longer to run.<br />

We will test on any blog file in our test set. To get the files, we need to extract<br />

them. We will use the extract_posts.py MapReduce job, but store the files<br />

in a separate folder:<br />

python extract_posts.py ~/Data/blogs_test --output-dir=/home/bob/Data/<br />

blogposts_testing –no-output<br />

Back in the IPython Notebook, we list all the outputted testing files:<br />

testing_folder = os.path.join(os.path.expanduser("~"), "Data",<br />

"blogposts_testing")<br />

testing_filenames = []<br />

for filename in os.listdir(testing_folder):<br />

testing_filenames.append(os.path.join(testing_folder, filename))<br />

For each of these files, we extract the gender and document and then call the predict<br />

function. We do this in a generator, as there are a lot of documents, and we don't<br />

want to use too much memory. The generator yields the actual gender and the<br />

predicted gender:<br />

def nb_predict_many(model, input_filename):<br />

with open(input_filename) as inf:<br />

# remove leading and trailing whitespace<br />

[ 292 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!