24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Clustering News Articles<br />

As the last action inside the loop, we get each of the stories from the returned result<br />

and add them to our stories list. We don't need all of the data—we only get the<br />

title, URL, and score. The code is as follows:<br />

stories.extend([(story['data']['title'], story['data']['url'],<br />

story['data']['score'])<br />

for story in result['data']['children']])<br />

Finally (and outside the loop), we return all the stories we have found:<br />

return stories<br />

Calling the stories function is a simple case of passing the authorization token and<br />

the subreddit name:<br />

stories = get_links("worldnews", token)<br />

The returned results should contain the title, URL, and 500 stories, which we will<br />

now use to extract the actual text from the resulting websites.<br />

Extracting text from arbitrary websites<br />

The links that we get from reddit go to arbitrary websites run by many different<br />

organizations. To make it harder, those pages were designed to be read by a<br />

human, not a <strong>com</strong>puter program. This can cause a problem when trying to get the<br />

actual content/story of those results, as modern websites have a lot going on in the<br />

background. JavaScript libraries are called, style sheets are applied, advertisements<br />

are loaded using AJAX, extra content is added to sidebars, and various other<br />

things are done to make the modern webpage a <strong>com</strong>plex document. These features<br />

make the modern Web what it is, but make it difficult to automatically get good<br />

information from!<br />

Finding the stories in arbitrary websites<br />

To start with, we will download the full webpage from each of these links and<br />

store them in our data folder, under a raw subfolder. We will process these to<br />

extract the useful information later on. This caching of results ensures that we don't<br />

have to continuously download the websites while we are working. First, we set up<br />

the data folder path:<br />

import os<br />

data_folder = os.path.join(os.path.expanduser("~"), "Data",<br />

"websites", "raw")<br />

[ 218 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!