24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 7<br />

Next, we need to get the labels that indicate which connected <strong>com</strong>ponent<br />

each sample was placed in. We iterate over all the connected <strong>com</strong>ponents,<br />

noting in a dictionary which user belonged to which connected <strong>com</strong>ponent.<br />

The code is as follows:<br />

label_dict = {}<br />

for i, sub_graph in enumerate(sub_graphs):<br />

for node in sub_graph.nodes():<br />

label_dict[node] = i<br />

Then we iterate over the nodes in the graph to get the label for each node in order.<br />

We need to do this two-step process, as nodes are not clearly ordered within a graph<br />

but they do maintain their order as long as no changes are made to the graph. What<br />

this means is that, until we change the graph, we can call .nodes() on the graph to<br />

get the same ordering. The code is as follows:<br />

labels = np.array([label_dict[node] for node in G.nodes()])<br />

Next the Silhouette Coefficient function takes a distance matrix, not a graph.<br />

Addressing this is another two-step process. First, NetworkX provides a handy<br />

function to_scipy_sparse_matrix, which returns the graph in a matrix format<br />

that we can use:<br />

X = nx.to_scipy_sparse_matrix(G).todense()<br />

The Silhouette Coefficient implementation in scikit-learn, at the time of writing,<br />

doesn't support sparse matrices. For this reason, we need to call the todense()<br />

function. Typically, this is a bad idea—sparse matrices are usually used because<br />

the data typically shouldn't be in a dense format. In this case, it will be fine because<br />

our dataset is relatively small; however, don't try this for larger datasets.<br />

For evaluating sparse datasets, I re<strong>com</strong>mended that you look<br />

into V-Measure or Adjusted Mutual Information. These are<br />

both implemented in scikit-learn, but they have very different<br />

parameters for performing their evaluation.<br />

However, the values are based on our weights, which are a similarity and not a<br />

distance. For a distance, higher values indicate more difference. We can convert<br />

from similarity to distance by subtracting the value from the maximum possible<br />

value, which for our weights was 1:<br />

X = 1 - X<br />

[ 157 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!