MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
39<br />
3 Clustering and Classification<br />
Classification consists of categorizing data according to some criteria. This can be used as a<br />
means to reduce the dimensionality of the dataset, prior to performing other form of data<br />
processing on the constructed categories.<br />
Classification is a key feature of most learning methods. In this chapter, we will first look at one<br />
way to perform classification through data clustering. Then, we will look at a number of linear<br />
classification methods that exploit eigenvalue decomposition such as those seen for PCA and<br />
CCA in the previous chapter. All of these methods are in effect supervised as they rely on an<br />
apriori estimate of the number of classes or clusters. Note that we will see other methods that can<br />
be used to perform classification in other chapters of these lecture notes. These are for instance<br />
feed-forward neural networks, or Hidden Markov Models. As it is often the case, machine learning<br />
methods may be used in various ways and there is thus not a single way to classify these!<br />
3.1 Clustering Techniques<br />
Clustering is a type of multivariate statistical analysis also known as cluster analysis,<br />
unsupervised classification analysis, or numerical taxonomy.<br />
Looking at Figure 3-1, one can easily see that the points group naturally into 3 separate clusters,<br />
albeit of different shape and size. While it is obvious to see this when looking at the graph, it<br />
would become less so obvious if you were to look at the coordinates of these points.<br />
Figure 3-1: Example of a dataset that form 3 clusters<br />
There are many ways to cluster data depending on the criteria required for the task. One can, for<br />
instance, group the data or segment the data into clusters. Clusters can be either globular or<br />
convex, such that any line you can draw between two cluster members stays inside the<br />
boundaries of the cluster, or non-globular or concave, i.e. taking any shape. Clusters can contain<br />
an equal number of data, or be of equal size (distance between the data). Central to all of the<br />
goals of cluster analysis is the notion of degree of similarity (or dissimilarity) between the<br />
individual objects being clustered. This notion determines the type of clusters that will be formed.<br />
In this chapter, we will see different methods of clustering that belong to two major classes of<br />
clustering algorithms, namely hierarchical clustering and partitioning algorithms. We will<br />
discuss the limitation of such methods. In particular, we will look at a generalization of K-means<br />
clustering, namely soft k-means and mixture of Gaussians that enable to overcome some but<br />
no all limitations of simple K-means.<br />
© A.G.Billard 2004 – Last Update March 2011