MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA MACHINE LEARNING TECHNIQUES - LASA

01.11.2014 Views

40 While hierarchical algorithms build clusters gradually (as crystals are grown), partitioning algorithms learn clusters directly. In doing so, they either try to discover clusters by iteratively relocating points between subsets, or try to identify clusters as areas highly populated with data. Algorithms of the first kind are surveyed in the section Partitioning Relocation Methods. They are further categorized into k-means methods (different schemes, initialization, optimization, harmonic means, extensions) and probabilistic clustering or density-Based Partitioning (E.g. soft-K-means and Mixture of Gaussians). Such methods concentrate on how well points fit into their clusters and tend to build clusters of proper convex shapes. When reading the following section, keep in mind that the major properties one is concerned with when designing a clustering methods include: • Type of attributes algorithm can handle • Scalability to large datasets • Ability to work with high dimensional data • Ability to find clusters of irregular shape • Handling outliers • Time complexity (when there is no confusion, we use the term complexity) • Data order dependency • Labeling or assignment (hard or strict vs. soft of fuzzy) • Reliance on a priori knowledge and user defined parameters • Interpretability of results 3.1.1 Hierarchical Clustering In hierarchical clustering, the data is partitioned iteratively, by either agglomerating the data or by dividing the data. The result of such an algorithm can be best represented by a dendrogram. Figure 3-2: Dendogram © A.G.Billard 2004 – Last Update March 2011

41 An agglomerative clustering starts with one-point (singleton) clusters and recursively merges two or more most appropriate clusters. A divisive clustering starts with one cluster of all data points and recursively splits the most appropriate cluster. The process continues until a stopping criterion (frequently, the requested number k of clusters) is achieved. To merge or split subsets of points rather than individual points, the distance between individual points has to be generalized to the distance between subsets. Such derived proximity measure is called a linkage metric. The type of the linkage metric used significantly affects hierarchical algorithms, since it reflects the particular concept of closeness and connectivity. Major intercluster linkage metrics include single link, average link, and complete link. The underlying dissimilarity measure (usually, distance) is computed for every pair of points with one point in the first set and another point in the second set. A specific operation such as minimum (single link), average (average link), or maximum (complete link) is applied to pair-wise dissimilarity measures: ( ) ( ) { } d c, c = operation d x, y | x∈c, y∈ c 1 2 1 2 An agglomerative method proceeds as follows: 1. Initialization: To each of the N data points p i , {i=1,..,N} associate one cluster c i . You, thus, start with N clusters. 2. Find the closest clusters according to a distance metric d(c i, c j ). The distance between groups can either be: • Single Linkage Clustering: The distance between the closest pair of data points d(c i ,, c j ) = Min { d(p i , p j ,) : Where data point p i is in cluster c i and data point p j , is cluster c j } • Complete Linkage Clustering: The distance between the farthest pair of data points d(c i ,, c j ) = Max { d(p i , p j ,) : Where data point p i is in cluster c i and data point p j , is cluster c j } • Average Linkage Clustering: The average distance between all pairs of data points d(c i ,, c j ) = Mean { d(p i , p j ,) : Where data point p i is in cluster c i and data point p j , is cluster c j } 3. Merge the two clusters into one single cluster taking, e.g., either the mean or the median across the two clusters. 4. Repeat step 2 and 3 until some criterion is achieved, such as, for instance, when the minimal number of clusters or the maximal distance between two clusters has been reached. The divisive hierarchical clustering methods work the other way around. They start by considering all data points as belonging to one cluster and divide iteratively the clusters, according to the points furthest apart. © A.G.Billard 2004 – Last Update March 2011

40<br />

While hierarchical algorithms build clusters gradually (as crystals are grown), partitioning<br />

algorithms learn clusters directly. In doing so, they either try to discover clusters by iteratively<br />

relocating points between subsets, or try to identify clusters as areas highly populated with data.<br />

Algorithms of the first kind are surveyed in the section Partitioning Relocation Methods. They are<br />

further categorized into k-means methods (different schemes, initialization, optimization,<br />

harmonic means, extensions) and probabilistic clustering or density-Based Partitioning (E.g.<br />

soft-K-means and Mixture of Gaussians). Such methods concentrate on how well points fit into<br />

their clusters and tend to build clusters of proper convex shapes.<br />

When reading the following section, keep in mind that the major properties one is concerned with<br />

when designing a clustering methods include:<br />

• Type of attributes algorithm can handle<br />

• Scalability to large datasets<br />

• Ability to work with high dimensional data<br />

• Ability to find clusters of irregular shape<br />

• Handling outliers<br />

• Time complexity (when there is no confusion, we use the term complexity)<br />

• Data order dependency<br />

• Labeling or assignment (hard or strict vs. soft of fuzzy)<br />

• Reliance on a priori knowledge and user defined parameters<br />

• Interpretability of results<br />

3.1.1 Hierarchical Clustering<br />

In hierarchical clustering, the data is partitioned iteratively, by either agglomerating the data or by<br />

dividing the data. The result of such an algorithm can be best represented by a dendrogram.<br />

Figure 3-2: Dendogram<br />

© A.G.Billard 2004 – Last Update March 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!