www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Clustering News Articles In most of the previous chapters, we performed data mining knowing what we were looking for. Our use of target classes allowed us to learn how our variables model those targets during the training phase. This type of learning, where we have targets to train against, is called supervised learning. In this chapter, we consider what we do without those targets. This is unsupervised learning and is much more of an exploratory task. Rather than wanting to classify with our model, the goal in unsupervised learning is more about exploring the data to find insights. In this chapter, we look at clustering news articles to find trends and patterns in the data. We look at how we can extract data from different websites using a link aggregation website to show a variety of news stories. The key concepts covered in this chapter include: • Obtaining text from arbitrary websites • Using the reddit API to collect interesting news stories • Cluster analysis for unsupervised data mining • Extracting topics from documents • Online learning for updating a model without retraining it • Cluster ensembling to combine different models Obtaining news articles In this chapter, we will build a system that takes a live feed of news articles and groups them together, where the groups have similar topics. You could run the system over several weeks (or longer) to see how trends change over that time. [ 211 ]

Clustering News Articles Our system will start with the popular link aggregation website reddit, which stores lists of links to other websites, as well as a comments section for discussion. Links on reddit are broken into several categories of links, called subreddits. There are subreddits devoted to particular TV shows, funny images, and many other things. What we are interested in is the subreddits for news. We will use the /r/worldnews subreddit in this chapter, but the code should work with any other subreddit. In this chapter, our goal is to download popular stories, and then cluster them to see any major themes or concepts that occur. This will give us an insight into the popular focus, without having to manually analyze hundreds of individual stories. Using a Web API to get data We have used web-based APIs to extract data in several of our previous chapters. For instance, in Chapter 7, Discovering Accounts to Follow Using Graph Mining, we used Twitter's API to extract data. Collecting data is a critical part of the data mining pipeline, and web-based APIs are a fantastic way to collect data on a variety of topics. There are three things you need to consider when using a web-based API for collecting data: authorization methods, rate limiting, and API endpoints. Authorization methods allow the data provider to know who is collecting the data, in order to ensure that they are being appropriately rate-limited and that data access can be tracked. For most websites, a personal account is often enough to start collecting data, but some websites will ask you to create a formal developer account to get this access. Rate limiting is applied to data collection, particularly free services. It is important to be aware of the rules when using APIs, as they can and do change from website to website. Twitter's API limit is 180 requests per 15 minutes (depending on the particular API call). Reddit, as we will see later, allows 30 requests per minute. Other websites impose daily limits, while others limit on a per-second basis. Even within websites, there are drastic differences for different API calls. For example, Google Maps has smaller limits and different API limits per-resource, with different allowances for the number of requests per hour. If you find you are creating an app or running an experiment that needs more requests and faster responses, most API providers have commercial plans that allow for more calls. [ 212 ]

Clustering News Articles<br />

In most of the previous chapters, we performed data mining knowing what we<br />

were looking for. Our use of target classes allowed us to learn how our variables<br />

model those targets during the training phase. This type of learning, where we have<br />

targets to train against, is called supervised learning. In this chapter, we consider<br />

what we do without those targets. This is unsupervised learning and is much more<br />

of an exploratory task. Rather than wanting to classify with our model, the goal in<br />

unsupervised learning is more about exploring the data to find insights.<br />

In this chapter, we look at clustering news articles to find trends and patterns in<br />

the data. We look at how we can extract data from different websites using a link<br />

aggregation website to show a variety of news stories.<br />

The key concepts covered in this chapter include:<br />

• Obtaining text from arbitrary websites<br />

• Using the reddit API to collect interesting news stories<br />

• Cluster analysis for unsupervised data mining<br />

• Extracting topics from documents<br />

• Online learning for updating a model without retraining it<br />

• Cluster ensembling to <strong>com</strong>bine different models<br />

Obtaining news articles<br />

In this chapter, we will build a system that takes a live feed of news articles and<br />

groups them together, where the groups have similar topics. You could run the<br />

system over several weeks (or longer) to see how trends change over that time.<br />

[ 211 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!