MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA MACHINE LEARNING TECHNIQUES - LASA
44 The CURE algorithm has a number of features of general significance. It takes special care of outliers. It also uses two devices to achieve scalability. A major feature of CURE is that it represents a cluster by a fixed number c of points scattered around it. The distance between two clusters used in the agglomerative process is equal to the minimum of distances between two scattered representatives. Therefore, CURE takes a middleground approach between the graph (all-points) methods and the geometric (one centroid) methods. Single and average link closeness is replaced by representatives. Selecting representatives scattered around a cluster makes it possible to cover non-spherical shapes. As before, agglomeration continues until the requested number k of clusters is achieved. CURE employs one additional device: originally selected scattered points are shrunk to the geometric centroid of the cluster by user-specified factor α. Shrinkage suppresses the affect of the outliers since outliers happen to be located further from the cluster centroid than the other scattered representatives. CURE is capable of finding clusters of different shapes and sizes, and it is insensitive to outliers. Since CURE uses sampling, estimation of its complexity is not straightforward. Figure 3-7: Agglomeration with CURE. Three clusters, each with three representatives, are shown before and after the merge and shrinkage. Two closest representatives are connected by arrow. Summary: Advantages of hierarchical clustering include: • Embedded flexibility regarding the level of granularity • Ease of handling of any forms of similarity or distance • Consequently, applicability to any attribute types © A.G.Billard 2004 – Last Update March 2011
45 Disadvantages of hierarchical clustering are related to: • Vagueness of termination criteria • The fact that most hierarchical algorithms do not revisit once constructed (intermediate) clusters with the purpose of their improvement Exercise: Determine the optimal distance metric for two examples, in which the data are either not linearly separable or linearly separable. 3.1.2 K-means clustering K-Means clustering generates a number K of disjoint, flat (non-hierarchical) k clusters k = K so as to minimize the sum-of-squares criterion C , 1,... , J ( ,... ) µ µ = x −µ 1 K K i k k = 1 k i∈C 2 ∑∑ (3.2) where x i is a vector representing the i th data point and µ is the geometric centroid of the data k points associated to cluster C . K-means is well suited to generating globular clusters. The K- Means method is numerical, unsupervised, non-deterministic and iterative. In general, the algorithm is not guaranteed to converge to a global minimum of J. k K-Means Algorithm: 1 Initialization: Pick K arbitrary centroids and set their geometric means µ ,..., 1 µ to K random values. 2 Calculate the distance dx ( , µ ) from each data point i to each centroid k . 3 Assignment Step: Assign the responsibility i k i r k of each data point i to its “closest” centroid k i (E-step). If a tie happens (i.e. two centroids are equidistant to a data point, one assigns the data point to the smallest winning centroid). { ( )} k arg min d x , µ = (3.3) i i k k i k ⎧⎧ 1 if k = k ri = ⎨⎨ ⎩⎩ 0 otherwise 4 Update Step: Adjust the centroids to be the means of all data points assigned to them (M-step) µ = k ∑ i ∑ i rx k i r k i i (3.4) © A.G.Billard 2004 – Last Update March 2011
- Page 1 and 2: SCHOOL OF ENGINEERING MACHINE LEARN
- Page 3 and 4: 3 4. 4 Regression Techniques ......
- Page 5 and 6: 5 9.2.2 Probability Distributions,
- Page 7 and 8: 7 Journals: • Machine Learning
- Page 9 and 10: 9 Performance What would be an opti
- Page 11 and 12: 11 1.2.3 Key features for a good le
- Page 13 and 14: 13 1.3.2 Crossvalidation To ensure
- Page 15 and 16: 15 In particular, we will consider
- Page 17 and 18: 17 2.1 Principal Component Analysis
- Page 19 and 20: 19 ( ) Xʹ′ = W X − µ (2.6) i
- Page 21 and 22: 21 2.1.2.2 Reconstruction error min
- Page 23 and 24: 23 PCA is an example of PP approach
- Page 25 and 26: 25 Algorithm: If one further assume
- Page 27 and 28: 27 The CCA algorithm consists thus
- Page 29 and 30: 29 Figure 2-6: Mixture of variables
- Page 31 and 32: 31 2.3.2 Why Gaussian variables are
- Page 33 and 34: 33 • In our general definition of
- Page 35 and 36: 35 2.3.5 ICA Ambiguities We cannot
- Page 37 and 38: 37 Denote by g the derivative of th
- Page 39 and 40: 39 3 Clustering and Classification
- Page 41 and 42: 41 An agglomerative clustering star
- Page 43: 43 3.1.1.1 The CURE Clustering Algo
- Page 47 and 48: 47 Cases where K-means might be vie
- Page 49 and 50: 49 3.1.4 Clustering with Mixtures o
- Page 51 and 52: 51 k ( σ j ) 2 = k ∑ i α = r k
- Page 53 and 54: 53 Theα are the so-called mixing c
- Page 55 and 56: 55 Figure 3-16: Clustering with 3 G
- Page 57 and 58: 57 When the transformation A is lin
- Page 59 and 60: 59 C: X → Y ( ) C x K = arg max
- Page 61 and 62: 61 Figure 3-18: Linear combination
- Page 63 and 64: 63 Figure 3-19: Bayes classificatio
- Page 65 and 66: 65 ⎛⎛ min ⎜⎜ w ⎝⎝ N i=
- Page 67 and 68: 67 T ( yi − xi w) 2 M ⎛⎛ ⎞
- Page 69 and 70: 69 Figure 4-2: Illustration of the
- Page 71 and 72: 71 4.4.2 Multi-Gaussian Case It is
- Page 73 and 74: 73 5 Kernel Methods These lecture n
- Page 75 and 76: 75 The kernel k provides a metric o
- Page 77 and 78: 77 M 1 T v = ∑ x ( x ) v M λ i j
- Page 79 and 80: 79 1 M The solutions to the dual ei
- Page 81 and 82: 81 5.4 Kernel CCA The linear versio
- Page 83 and 84: 83 additional ridge parameter induc
- Page 85 and 86: 85 Figure 5-3: TOP: Marginal (left)
- Page 87 and 88: 87 statistical independence. We def
- Page 89 and 90: 89 J j ( µ 1,...., µ K) = ∑∑
- Page 91 and 92: 91 A simple pattern recognition alg
- Page 93 and 94: 93 ( ) ( , ) f x = sign w x + b (5.
45<br />
Disadvantages of hierarchical clustering are related to:<br />
• Vagueness of termination criteria<br />
• The fact that most hierarchical algorithms do not revisit once constructed (intermediate)<br />
clusters with the purpose of their improvement<br />
Exercise:<br />
Determine the optimal distance metric for two examples, in which the data are either not linearly<br />
separable or linearly separable.<br />
3.1.2 K-means clustering<br />
K-Means clustering generates a number K of disjoint, flat (non-hierarchical)<br />
k<br />
clusters k = K so as to minimize the sum-of-squares criterion<br />
C , 1,... ,<br />
J<br />
( ,... )<br />
µ µ = x −µ<br />
1<br />
K<br />
K i k<br />
k = 1<br />
k<br />
i∈C<br />
2<br />
∑∑ (3.2)<br />
where x i<br />
is a vector representing the i th data point and µ is the geometric centroid of the data<br />
k<br />
points associated to cluster C . K-means is well suited to generating globular clusters. The K-<br />
Means method is numerical, unsupervised, non-deterministic and iterative. In general, the<br />
algorithm is not guaranteed to converge to a global minimum of J.<br />
k<br />
K-Means Algorithm:<br />
1 Initialization: Pick K arbitrary centroids and set their geometric means µ ,..., 1<br />
µ to<br />
K<br />
random values.<br />
2 Calculate the distance dx ( , µ ) from each data point i to each centroid k .<br />
3 Assignment Step: Assign the responsibility<br />
i<br />
k<br />
i<br />
r<br />
k<br />
of each data point i to its “closest”<br />
centroid k<br />
i<br />
(E-step). If a tie happens (i.e. two centroids are equidistant to a data<br />
point, one assigns the data point to the smallest winning centroid).<br />
{ ( )}<br />
k arg min d x , µ<br />
= (3.3)<br />
i i k<br />
k<br />
i<br />
k<br />
⎧⎧ 1 if k = k<br />
ri<br />
= ⎨⎨<br />
⎩⎩ 0 otherwise<br />
4 Update Step: Adjust the centroids to be the means of all data points assigned to<br />
them (M-step)<br />
µ =<br />
k<br />
∑<br />
i<br />
∑<br />
i<br />
rx<br />
k<br />
i<br />
r<br />
k<br />
i<br />
i<br />
(3.4)<br />
© A.G.Billard 2004 – Last Update March 2011