www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Chapter 10 Our function definition takes a set of labels: def create_coassociation_matrix(labels): We then record the rows and columns of each match. We do these in a list. Sparse matrices are commonly just sets of lists recording the positions of nonzero values, and csr_matrix is an example of this type of sparse matrix: rows = [] cols = [] We then iterate over each of the individual labels: unique_labels = set(labels) for label in unique_labels: We look for all samples that have this label: indices = np.where(labels == label)[0] For each pair of samples with the preceding label, we record the position of both samples in our list. The code is as follows: for index1 in indices: for index2 in indices: rows.append(index1) cols.append(index2) Outside all loops, we then create the data, which is simply the value 1 for every time two samples were listed together. We get the number of 1 to place by noting how many matches we had in our labels set altogether. The code is as follows: data = np.ones((len(rows),)) return csr_matrix((data, (rows, cols)), dtype='float') To get the coassociation matrix from the labels, we simply call this function: C = create_coassociation_matrix(labels) From here, we can add multiple instances of these matrices together. This allows us to combine the results from multiple runs of k-means. Printing out C (just enter C into a new cell and run it) will tell you how many cells have nonzero values in them. In my case, about half of the cells had values in them, as my clustering result had a large cluster (the more even the clusters, the lower the number of nonzero values). The next step involves the hierarchical clustering of the coassociation matrix. We will do this by finding minimum spanning trees on this matrix and removing edges with a weight lower than a given threshold. [ 231 ]
Clustering News Articles In graph theory, a spanning tree is a set of edges on a graph that connects all of the nodes together. The Minimum Spanning Tree (MST) is simply the spanning tree with the lowest total weight. For our application, the nodes in our graph are samples from our dataset, and the edge weights are the number of times those two samples were clustered together—that is, the value from our coassociation matrix. In the following figure, a MST on a graph of six nodes is shown. Nodes on the graph can be used more than once in the MST. The only criterion for a spanning tree is that all nodes should be connected together. To compute the MST, we use SciPy's minimum_spanning_tree function, which is found in the sparse package: from scipy.sparse.csgraph import minimum_spanning_tree The mst function can be called directly on the sparse matrix returned by our coassociation function: mst = minimum_spanning_tree(C) However, in our coassociation matrix C, higher values are indicative of samples that are clustered together more often—a similarity value. In contrast, minimum_spanning_tree sees the input as a distance, with higher scores penalized. For this reason, we compute the minimum spanning tree on the negation of the coassociation matrix instead: mst = minimum_spanning_tree(-C) [ 232 ]
- Page 204 and 205: Chapter 8 However, it isn't very go
- Page 206: Chapter 8 Summary In this chapter,
- Page 209 and 210: Authorship Attribution Attributing
- Page 211 and 212: Authorship Attribution If we cannot
- Page 213 and 214: Authorship Attribution After taking
- Page 215 and 216: Authorship Attribution This dataset
- Page 217 and 218: Authorship Attribution "instead", "
- Page 219 and 220: Authorship Attribution Support vect
- Page 221 and 222: Authorship Attribution Kernels When
- Page 223 and 224: Authorship Attribution We can reuse
- Page 225 and 226: Authorship Attribution With our dat
- Page 227 and 228: Authorship Attribution We then reco
- Page 229 and 230: Authorship Attribution If it doesn'
- Page 231 and 232: Authorship Attribution Finally, we
- Page 234 and 235: Clustering News Articles In most of
- Page 236 and 237: Chapter 10 API Endpoints are the ac
- Page 238 and 239: The token object is just a dictiona
- Page 240 and 241: Chapter 10 We then create a list to
- Page 242 and 243: Chapter 10 We are going to use MD5
- Page 244 and 245: Chapter 10 Next, we develop the cod
- Page 246 and 247: Chapter 10 We use clustering techni
- Page 248 and 249: Chapter 10 The k-means algorithm is
- Page 250 and 251: Chapter 10 We only fit the X matrix
- Page 252 and 253: Chapter 10 We then print out the mo
- Page 256 and 257: Chapter 10 The result from the prec
- Page 258 and 259: Chapter 10 Implementation Putting a
- Page 260 and 261: Chapter 10 Neural networks can also
- Page 262 and 263: We then call the partial_fit functi
- Page 264 and 265: Classifying Objects in Images Using
- Page 266 and 267: Chapter 11 This dataset comes from
- Page 268 and 269: You can change the image index to s
- Page 270 and 271: Chapter 11 Each of these issues has
- Page 272 and 273: Chapter 11 Using Theano, we can def
- Page 274 and 275: Chapter 11 Building a neural networ
- Page 276 and 277: Chapter 11 Finally, we create Thean
- Page 278 and 279: Chapter 11 return [image,] return s
- Page 280 and 281: Chapter 11 Next, we define how the
- Page 282 and 283: Chapter 11 Getting your code to run
- Page 284 and 285: Chapter 11 Setting up the environme
- Page 286 and 287: This will unzip only one Coval.otf
- Page 288 and 289: Chapter 11 First we create the laye
- Page 290 and 291: Chapter 11 Finally, we set the verb
- Page 292: Chapter 11 Summary In this chapter,
- Page 295 and 296: Working with Big Data Big data What
- Page 297 and 298: Working with Big Data Governments a
- Page 299 and 300: Working with Big Data We start by c
- Page 301 and 302: Working with Big Data The final ste
- Page 303 and 304: Working with Big Data Getting the d
Chapter 10<br />
Our function definition takes a set of labels:<br />
def create_coassociation_matrix(labels):<br />
We then record the rows and columns of each match. We do these in a list. Sparse<br />
matrices are <strong>com</strong>monly just sets of lists recording the positions of nonzero values,<br />
and csr_matrix is an example of this type of sparse matrix:<br />
rows = []<br />
cols = []<br />
We then iterate over each of the individual labels:<br />
unique_labels = set(labels)<br />
for label in unique_labels:<br />
We look for all samples that have this label:<br />
indices = np.where(labels == label)[0]<br />
For each pair of samples with the preceding label, we record the position of both<br />
samples in our list. The code is as follows:<br />
for index1 in indices:<br />
for index2 in indices:<br />
rows.append(index1)<br />
cols.append(index2)<br />
Outside all loops, we then create the data, which is simply the value 1 for every time<br />
two samples were listed together. We get the number of 1 to place by noting how<br />
many matches we had in our labels set altogether. The code is as follows:<br />
data = np.ones((len(rows),))<br />
return csr_matrix((data, (rows, cols)), dtype='float')<br />
To get the coassociation matrix from the labels, we simply call this function:<br />
C = create_coassociation_matrix(labels)<br />
From here, we can add multiple instances of these matrices together. This allows us<br />
to <strong>com</strong>bine the results from multiple runs of k-means. Printing out C (just enter C<br />
into a new cell and run it) will tell you how many cells have nonzero values in them.<br />
In my case, about half of the cells had values in them, as my clustering result had a<br />
large cluster (the more even the clusters, the lower the number of nonzero values).<br />
The next step involves the hierarchical clustering of the coassociation matrix. We will<br />
do this by finding minimum spanning trees on this matrix and removing edges with<br />
a weight lower than a given threshold.<br />
[ 231 ]