09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 10

Figure 4: Plot of the final centroids after 100 iterations

Please note that the plot command works in Matplotlib 3.1.1 or

higher versions.

In the preceding code we decided to limit the number of clusters to 3, but in most

cases with unlabeled data, one is never sure how many clusters exist. One can

determine the optimal number of clusters using the elbow method. The method

is based on the principle that we should choose the cluster number that reduces

the sum of squared error (SSE) distance. If k is the number of clusters, then as k

increases, the SSE decreases, with SSE = 0; when k is equal to the number of data

points, each point is its own cluster. We want a low value of k, such that SSE is also

low. For the famous Fisher's Iris data set, if we plot SSE for different k values, we can

see from the plot below that for k=3, the variance in SSE is the highest; after that, it

starts reducing, thus the elbow point is k=3:

Figure 5: Plotting SSE against the Number of Clusters, using Fisher's Iris data set

[ 383 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!