24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Clustering News Articles<br />

After this point, only slight improvements are made to the inertia, although it is<br />

hard to be specific about vague criteria such as this. Looking for this type of pattern<br />

is called the elbow rule, in that we are looking for an elbow-esque bend in the graph.<br />

Some datasets have more pronounced elbows, but this feature isn't guaranteed to<br />

even appear (some graphs may be smooth!).<br />

Based on this analysis, we set n_clusters to be 6 and then rerun the algorithm:<br />

n_clusters = 6<br />

pipeline = Pipeline([('feature_extraction',<br />

TfidfVectorizer(max_df=0.4)),<br />

('clusterer', KMeans(n_clusters=n_clusters))<br />

])<br />

pipeline.fit(documents)<br />

labels = pipeline.predict(documents)<br />

Extracting topic information from clusters<br />

Now we set our sights on the clusters in an attempt to discover the topics in each.<br />

We first extract the term list from our feature extraction step:<br />

terms = pipeline.named_steps['feature_extraction'].get_feature_names()<br />

We also set up another counter for counting the size of each of our classes:<br />

c = Counter(labels)<br />

Iterating over each cluster, we print the size of the cluster as before. It is important<br />

to keep in mind the sizes of the clusters when evaluating the results—some of the<br />

clusters will only have one sample, and are therefore not indicative of a general<br />

trend. The code is as follows:<br />

for cluster_number in range(n_clusters):<br />

print("Cluster {} contains {} samples".format(cluster_number,<br />

c[cluster_number]))<br />

Next (and still in the loop), we iterate over the most important terms for this cluster.<br />

To do this, we take the five largest values from the centroid, which we get by finding<br />

the features that have the highest values in the centroid itself. The code is as follows:<br />

print(" Most important terms")<br />

centroid = pipeline.named_steps['clusterer'].cluster_centers_<br />

[cluster_number]<br />

most_important = centroid.argsort()<br />

[ 228 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!