www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Clustering News Articles In most of the previous chapters, we performed data mining knowing what we were looking for. Our use of target classes allowed us to learn how our variables model those targets during the training phase. This type of learning, where we have targets to train against, is called supervised learning. In this chapter, we consider what we do without those targets. This is unsupervised learning and is much more of an exploratory task. Rather than wanting to classify with our model, the goal in unsupervised learning is more about exploring the data to find insights. In this chapter, we look at clustering news articles to find trends and patterns in the data. We look at how we can extract data from different websites using a link aggregation website to show a variety of news stories. The key concepts covered in this chapter include: • Obtaining text from arbitrary websites • Using the reddit API to collect interesting news stories • Cluster analysis for unsupervised data mining • Extracting topics from documents • Online learning for updating a model without retraining it • Cluster ensembling to combine different models Obtaining news articles In this chapter, we will build a system that takes a live feed of news articles and groups them together, where the groups have similar topics. You could run the system over several weeks (or longer) to see how trends change over that time. [ 211 ]
Clustering News Articles Our system will start with the popular link aggregation website reddit, which stores lists of links to other websites, as well as a comments section for discussion. Links on reddit are broken into several categories of links, called subreddits. There are subreddits devoted to particular TV shows, funny images, and many other things. What we are interested in is the subreddits for news. We will use the /r/worldnews subreddit in this chapter, but the code should work with any other subreddit. In this chapter, our goal is to download popular stories, and then cluster them to see any major themes or concepts that occur. This will give us an insight into the popular focus, without having to manually analyze hundreds of individual stories. Using a Web API to get data We have used web-based APIs to extract data in several of our previous chapters. For instance, in Chapter 7, Discovering Accounts to Follow Using Graph Mining, we used Twitter's API to extract data. Collecting data is a critical part of the data mining pipeline, and web-based APIs are a fantastic way to collect data on a variety of topics. There are three things you need to consider when using a web-based API for collecting data: authorization methods, rate limiting, and API endpoints. Authorization methods allow the data provider to know who is collecting the data, in order to ensure that they are being appropriately rate-limited and that data access can be tracked. For most websites, a personal account is often enough to start collecting data, but some websites will ask you to create a formal developer account to get this access. Rate limiting is applied to data collection, particularly free services. It is important to be aware of the rules when using APIs, as they can and do change from website to website. Twitter's API limit is 180 requests per 15 minutes (depending on the particular API call). Reddit, as we will see later, allows 30 requests per minute. Other websites impose daily limits, while others limit on a per-second basis. Even within websites, there are drastic differences for different API calls. For example, Google Maps has smaller limits and different API limits per-resource, with different allowances for the number of requests per hour. If you find you are creating an app or running an experiment that needs more requests and faster responses, most API providers have commercial plans that allow for more calls. [ 212 ]
- Page 182 and 183: • method='nelder-mead': This is u
- Page 184 and 185: Beating CAPTCHAs with Neural Networ
- Page 186 and 187: Chapter 8 The red lines indicate th
- Page 188 and 189: Chapter 8 The combination of an app
- Page 190 and 191: Chapter 8 Next we set the font of t
- Page 192 and 193: Chapter 8 We can then extract the s
- Page 194 and 195: Chapter 8 Our targets are integer v
- Page 196 and 197: Chapter 8 Then we iterate over our
- Page 198 and 199: Chapter 8 From these predictions, w
- Page 200 and 201: Chapter 8 This code correctly predi
- Page 202 and 203: The result is shown in the next gra
- Page 204 and 205: Chapter 8 However, it isn't very go
- Page 206: Chapter 8 Summary In this chapter,
- Page 209 and 210: Authorship Attribution Attributing
- Page 211 and 212: Authorship Attribution If we cannot
- Page 213 and 214: Authorship Attribution After taking
- Page 215 and 216: Authorship Attribution This dataset
- Page 217 and 218: Authorship Attribution "instead", "
- Page 219 and 220: Authorship Attribution Support vect
- Page 221 and 222: Authorship Attribution Kernels When
- Page 223 and 224: Authorship Attribution We can reuse
- Page 225 and 226: Authorship Attribution With our dat
- Page 227 and 228: Authorship Attribution We then reco
- Page 229 and 230: Authorship Attribution If it doesn'
- Page 231 and 232: Authorship Attribution Finally, we
- Page 236 and 237: Chapter 10 API Endpoints are the ac
- Page 238 and 239: The token object is just a dictiona
- Page 240 and 241: Chapter 10 We then create a list to
- Page 242 and 243: Chapter 10 We are going to use MD5
- Page 244 and 245: Chapter 10 Next, we develop the cod
- Page 246 and 247: Chapter 10 We use clustering techni
- Page 248 and 249: Chapter 10 The k-means algorithm is
- Page 250 and 251: Chapter 10 We only fit the X matrix
- Page 252 and 253: Chapter 10 We then print out the mo
- Page 254 and 255: Chapter 10 Our function definition
- Page 256 and 257: Chapter 10 The result from the prec
- Page 258 and 259: Chapter 10 Implementation Putting a
- Page 260 and 261: Chapter 10 Neural networks can also
- Page 262 and 263: We then call the partial_fit functi
- Page 264 and 265: Classifying Objects in Images Using
- Page 266 and 267: Chapter 11 This dataset comes from
- Page 268 and 269: You can change the image index to s
- Page 270 and 271: Chapter 11 Each of these issues has
- Page 272 and 273: Chapter 11 Using Theano, we can def
- Page 274 and 275: Chapter 11 Building a neural networ
- Page 276 and 277: Chapter 11 Finally, we create Thean
- Page 278 and 279: Chapter 11 return [image,] return s
- Page 280 and 281: Chapter 11 Next, we define how the
- Page 282 and 283: Chapter 11 Getting your code to run
Clustering News Articles<br />
In most of the previous chapters, we performed data mining knowing what we<br />
were looking for. Our use of target classes allowed us to learn how our variables<br />
model those targets during the training phase. This type of learning, where we have<br />
targets to train against, is called supervised learning. In this chapter, we consider<br />
what we do without those targets. This is unsupervised learning and is much more<br />
of an exploratory task. Rather than wanting to classify with our model, the goal in<br />
unsupervised learning is more about exploring the data to find insights.<br />
In this chapter, we look at clustering news articles to find trends and patterns in<br />
the data. We look at how we can extract data from different websites using a link<br />
aggregation website to show a variety of news stories.<br />
The key concepts covered in this chapter include:<br />
• Obtaining text from arbitrary websites<br />
• Using the reddit API to collect interesting news stories<br />
• Cluster analysis for unsupervised data mining<br />
• Extracting topics from documents<br />
• Online learning for updating a model without retraining it<br />
• Cluster ensembling to <strong>com</strong>bine different models<br />
Obtaining news articles<br />
In this chapter, we will build a system that takes a live feed of news articles and<br />
groups them together, where the groups have similar topics. You could run the<br />
system over several weeks (or longer) to see how trends change over that time.<br />
[ 211 ]