24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Social Media Insight Using Naive Bayes<br />

As a last bit of JavaScript for this chapter (I promise), we call the load_next_<br />

tweet() function. This will set the first tweet to be labeled and then close off the<br />

JavaScript. The code is as follows:<br />

load_next_tweet();<br />

<br />

After you run this cell, you will get an HTML textbox, alongside the first tweet's text.<br />

Click in the textbox and enter 1 if it is relevant to our goal (in this case, it means is the<br />

tweet related to the programming language Python) and a 0 if it is not. After you do this,<br />

the next tweet will load. Enter the label and the next one will load. This continues<br />

until the tweets run out.<br />

When you finish all of this, simply save the labels to the output filename we defined<br />

earlier for the class values:<br />

with open(labels_filename, 'w') as outf:<br />

json.dump(labels, outf)<br />

You can call the preceding code even if you haven't finished. Any labeling you have<br />

done to that point will be saved. Running this Notebook again will pick up where<br />

you left off and you can keep labeling your tweets.<br />

This might take a while to do this! If you have a lot of tweets in your dataset, you'll<br />

need to classify all of them. If you are pushed for time, you can download the same<br />

dataset I used, which contains classifications.<br />

Creating a replicable dataset from Twitter<br />

In data mining, there are lots of variables. These aren't just in the data mining<br />

algorithms—they also appear in the data collection, environment, and many other<br />

factors. Being able to replicate your results is important as it enables you to verify<br />

or improve upon your results.<br />

Getting 80 percent accuracy on one dataset with algorithm X, and<br />

90 percent accuracy on another dataset with algorithm Y doesn't<br />

mean that Y is better. We need to be able to test on the same<br />

dataset in the same conditions to be able to properly <strong>com</strong>pare.<br />

[ 114 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!