Clustering News Articles<br />

After this point, only slight improvements are made to the inertia, although it is<br />

hard to be specific about vague criteria such as this. Looking for this type of pattern<br />

is called the elbow rule, in that we are looking for an elbow-esque bend in the graph.<br />

Some datasets have more pronounced elbows, but this feature isn't guaranteed to<br />

even appear (some graphs may be smooth!).<br />

Based on this analysis, we set n_clusters to be 6 and then rerun the algorithm:<br />

n_clusters = 6<br />

pipeline = Pipeline([('feature_extraction',<br />

TfidfVectorizer(max_df=0.4)),<br />

('clusterer', KMeans(n_clusters=n_clusters))<br />

])<br />

pipeline.fit(documents)<br />

labels = pipeline.predict(documents)<br />

Extracting topic information from clusters<br />

Now we set our sights on the clusters in an attempt to discover the topics in each.<br />

We first extract the term list from our feature extraction step:<br />

terms = pipeline.named_steps['feature_extraction'].get_feature_names()<br />

We also set up another counter for counting the size of each of our classes:<br />

c = Counter(labels)<br />

Iterating over each cluster, we print the size of the cluster as before. It is important<br />

to keep in mind the sizes of the clusters when evaluating the results—some of the<br />

clusters will only have one sample, and are therefore not indicative of a general<br />

trend. The code is as follows:<br />

for cluster_number in range(n_clusters):<br />

print("Cluster {} contains {} samples".format(cluster_number,<br />

c[cluster_number]))<br />

Next (and still in the loop), we iterate over the most important terms for this cluster.<br />

To do this, we take the five largest values from the centroid, which we get by finding<br />

the features that have the highest values in the centroid itself. The code is as follows:<br />

print(" Most important terms")<br />

centroid = pipeline.named_steps['clusterer'].cluster_centers_<br />

[cluster_number]<br />

most_important = centroid.argsort()<br />

