24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Social Media Insight Using Naive Bayes<br />

Putting it all together<br />

Now <strong>com</strong>es the moment to put all of these pieces together. In our IPython<br />

Notebook, set the filenames and load the dataset and classes as we have done<br />

before. Set the filenames for both the tweets themselves (not the IDs!) and the<br />

labels that we assigned to them. The code is as follows:<br />

import os<br />

input_filename = os.path.join(os.path.expanduser("~"), "Data",<br />

"twitter", "python_tweets.json")<br />

labels_filename = os.path.join(os.path.expanduser("~"), "Data",<br />

"twitter", "python_classes.json")<br />

Load the tweets themselves. We are only interested in the content of the tweets, so<br />

we extract the text value and store only that. The code is as follows:<br />

tweets = []<br />

with open(input_filename) as inf:<br />

for line in inf:<br />

if len(line.strip()) == 0:<br />

continue<br />

tweets.append(json.loads(line)['text'])<br />

Load the labels for each of the tweets:<br />

with open(classes_filename) as inf:<br />

labels = json.load(inf)<br />

Now, create a pipeline putting together the <strong>com</strong>ponents from before. Our pipeline<br />

has three parts:<br />

• The NLTKBOW transformer we created<br />

• A DictVectorizer transformer<br />

• A BernoulliNB classifier<br />

The code is as follows:<br />

from sklearn.pipeline import Pipeline<br />

pipeline = Pipeline([('bag-of-words', NLTKBOW()),<br />

('vectorizer', DictVectorizer()),<br />

('naive-bayes', BernoulliNB())<br />

])<br />

[ 128 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!