24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Social Media Insight Using Naive Bayes<br />

Then for each tweet in the search results, we save it to our file in the same way we<br />

did when we were collecting the dataset originally:<br />

for tweet in search_results:<br />

if 'text' in tweet:<br />

output_file.write(json.dumps(tweet))<br />

output_file.write("\n\n")<br />

As a final step here (and still under the preceding if block), we want to store the<br />

labeling of this tweet. We can do this using the label_mapping dictionary we<br />

created before, looking up the tweet ID. The code is as follows:<br />

actual_labels.append(label_mapping[tweet['id']])<br />

Run the previous cell and the code will collect all of the tweets for you. If you created<br />

a really big dataset, this may take a while—Twitter does rate-limit requests. As a<br />

final step here, save the actual_labels to our classes file:<br />

with open(labels_filename, 'w') as outf:<br />

json.dump(actual_labels, outf)<br />

Text transformers<br />

Now that we have our dataset, how are we going to perform data mining on it?<br />

Text-based datasets include books, essays, websites, manuscripts, programming<br />

code, and other forms of written expression. All of the algorithms we have seen so<br />

far deal with numerical or categorical features, so how do we convert our text into a<br />

format that the algorithm can deal with?<br />

There are a number of measurements that could be taken. For instance, average<br />

word and average sentence length are used to predict the readability of a document.<br />

However, there are lots of feature types such as word occurrence which we will<br />

now investigate.<br />

Bag-of-words<br />

One of the simplest but highly effective models is to simply count each word in the<br />

dataset. We create a matrix, where each row represents a document in our dataset<br />

and each column represents a word. The value of the cell is the frequency of that<br />

word in the document.<br />

[ 118 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!