MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA MACHINE LEARNING TECHNIQUES - LASA

01.11.2014 Views

46 5 Go back to step 2 and repeat the process until a complete pass through all the data points results in no data point moving from one cluster to another. At this point the clusters are stable and the clustering process ends. K-means clustering corresponds to a particular case of Gaussian Mixture Model estimation with EM (expectation maximization) where the covariance matrices are fixed (these are actually diagonal and isotropic), see Section 3.1.4. The properties of K-means algorithm can be summarized as follows: There are always K clusters. There is always at least one item in each cluster. The clusters are non-hierarchical and they do not overlap. Every member of a cluster is closer to its cluster than any other cluster because closeness does not always involve the 'center' of clusters. The algorithm is guaranteed to converge in a finite number of iterations. Advantages: With a large number of variables, K-Means may be computationally faster than other clustering techniques, such as hierarchical clustering or non-parametric clustering. K-Means may produce tighter clusters, especially if the clusters are globular. K-mean is guaranteed to converge. Drawbacks: K-mean clustering suffers from several drawbacks. The choice of initial partition can greatly affect the final clusters that result, in terms of inter-cluster and intra-cluster distances and cohesion. Hence, one might find it difficult to compare the quality of the clusters produced (e.g. for different initial partitions or values of K affect outcome). K-means assumes a fixed number K of clusters, which is difficult to estimate off-hands. It does not work well with non-globular clusters. Different initial partitions can result in different final clusters (see Figure 3-8). It is, therefore, good practice to run the algorithm several times using different K values, to determine the optimal number of clusters. Figure 3-8: The random initialization of K-Means can lead to different clustering results, (in this case using 3 clusters) [DEMOS\CLUSTERING\KMEANS-RANDOM_INITIALIZATIONS.ML] © A.G.Billard 2004 – Last Update March 2011

47 Cases where K-means might be viewed as failing: Unbalanced clusters: The K-means algorithm takes into account only the distance between the means and data points; it has no representation of the weight and breadth of each cluster.. Consequently, data points belonging to unbalanced clusters (clusters with unbalanced number of points, spread with a smaller breadth) will be incorrectly assigned. Elongated clusters: The K-means algorithm has no way to represent the shape of a cluster. Hence, a simple measure of distance would be unable to separate two elongated clusters as shown in Figure 3-9. Figure 3-9: Typical examples for which K-means is ill-suited: unbalanced clusters (left) and elongated clusters (right). In both cases the algorithm fails to separate the two clusters. [DEMOS\CLUSTERING\KMEANS-UNBALANCED.ML] [DEMOS\CLUSTERING\KMEANS-ELONGATED.ML] 3.1.3 Soft K-means `Soft K-means algorithm' was offered as an alternative to K-means in order to “soften” the assignment of variables. Each data point x is given a soft `degree of assignment' to each of the means. The degree to which i i x is assigned to cluster k is the responsibility point i and is a real value comprised between 0 and 1. r k ' ( −β ⋅d( µ k, xi) ) = e ∑ ( −β ⋅ ( µ k' , i) ) e (3.5) k i d x i r k of cluster k for β is the stiffness and determines the binding between clusters. The larger β , the bigger the 1 distance across two clusters. A measure of the disparity across clusters is given by σ ≡ , β the radius of the circle that surrounds the means of the K clusters. ∑ The sum of the K responsibilities for the i th point is 1, i.e. r k = 1 k i © A.G.Billard 2004 – Last Update March 2011

47<br />

Cases where K-means might be viewed as failing:<br />

Unbalanced clusters:<br />

The K-means algorithm takes into account only the distance between the means and data points;<br />

it has no representation of the weight and breadth of each cluster.. Consequently, data points<br />

belonging to unbalanced clusters (clusters with unbalanced number of points, spread with a<br />

smaller breadth) will be incorrectly assigned.<br />

Elongated clusters:<br />

The K-means algorithm has no way to represent the shape of a cluster. Hence, a simple measure<br />

of distance would be unable to separate two elongated clusters as shown in Figure 3-9.<br />

Figure 3-9: Typical examples for which K-means is ill-suited: unbalanced clusters (left) and elongated<br />

clusters (right). In both cases the algorithm fails to separate the two clusters.<br />

[DEMOS\CLUSTERING\KMEANS-UNBALANCED.ML] [DEMOS\CLUSTERING\KMEANS-ELONGATED.ML]<br />

3.1.3 Soft K-means<br />

`Soft K-means algorithm' was offered as an alternative to K-means in order to “soften” the<br />

assignment of variables. Each data point x is given a soft `degree of assignment' to each of the<br />

means. The degree to which<br />

i<br />

i<br />

x is assigned to cluster k is the responsibility<br />

point i and is a real value comprised between 0 and 1.<br />

r<br />

k '<br />

( −β ⋅d( µ k,<br />

xi)<br />

)<br />

= e<br />

∑ ( −β ⋅ ( µ k'<br />

, i)<br />

)<br />

e<br />

(3.5)<br />

k<br />

i d x<br />

i<br />

r<br />

k<br />

of cluster k for<br />

β is the stiffness and determines the binding between clusters. The larger β , the bigger the<br />

1<br />

distance across two clusters. A measure of the disparity across clusters is given by σ ≡ ,<br />

β<br />

the radius of the circle that surrounds the means of the K clusters.<br />

∑<br />

The sum of the K responsibilities for the i th point is 1, i.e. r k = 1<br />

k<br />

i<br />

© A.G.Billard 2004 – Last Update March 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!