24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Clustering News Articles<br />

Our system will start with the popular link aggregation website reddit, which<br />

stores lists of links to other websites, as well as a <strong>com</strong>ments section for discussion.<br />

Links on reddit are broken into several categories of links, called subreddits.<br />

There are subreddits devoted to particular TV shows, funny images, and many<br />

other things. What we are interested in is the subreddits for news. We will use<br />

the /r/worldnews subreddit in this chapter, but the code should work with any<br />

other subreddit.<br />

In this chapter, our goal is to download popular stories, and then cluster them to<br />

see any major themes or concepts that occur. This will give us an insight into the<br />

popular focus, without having to manually analyze hundreds of individual stories.<br />

Using a Web API to get data<br />

We have used web-based APIs to extract data in several of our previous chapters.<br />

For instance, in Chapter 7, Discovering Accounts to Follow Using Graph Mining,<br />

we used Twitter's API to extract data. Collecting data is a critical part of the<br />

data mining pipeline, and web-based APIs are a fantastic way to collect data<br />

on a variety of topics.<br />

There are three things you need to consider when using a web-based API for<br />

collecting data: authorization methods, rate limiting, and API endpoints.<br />

Authorization methods allow the data provider to know who is collecting the<br />

data, in order to ensure that they are being appropriately rate-limited and that<br />

data access can be tracked. For most websites, a personal account is often enough<br />

to start collecting data, but some websites will ask you to create a formal developer<br />

account to get this access.<br />

Rate limiting is applied to data collection, particularly free services. It is important<br />

to be aware of the rules when using APIs, as they can and do change from website<br />

to website. Twitter's API limit is 180 requests per 15 minutes (depending on the<br />

particular API call). Reddit, as we will see later, allows 30 requests per minute.<br />

Other websites impose daily limits, while others limit on a per-second basis. Even<br />

within websites, there are drastic differences for different API calls. For example,<br />

Google Maps has smaller limits and different API limits per-resource, with different<br />

allowances for the number of requests per hour.<br />

If you find you are creating an app or running an experiment that<br />

needs more requests and faster responses, most API providers<br />

have <strong>com</strong>mercial plans that allow for more calls.<br />

[ 212 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!