01.11.2014 Views

MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

39<br />

3 Clustering and Classification<br />

Classification consists of categorizing data according to some criteria. This can be used as a<br />

means to reduce the dimensionality of the dataset, prior to performing other form of data<br />

processing on the constructed categories.<br />

Classification is a key feature of most learning methods. In this chapter, we will first look at one<br />

way to perform classification through data clustering. Then, we will look at a number of linear<br />

classification methods that exploit eigenvalue decomposition such as those seen for PCA and<br />

CCA in the previous chapter. All of these methods are in effect supervised as they rely on an<br />

apriori estimate of the number of classes or clusters. Note that we will see other methods that can<br />

be used to perform classification in other chapters of these lecture notes. These are for instance<br />

feed-forward neural networks, or Hidden Markov Models. As it is often the case, machine learning<br />

methods may be used in various ways and there is thus not a single way to classify these!<br />

3.1 Clustering Techniques<br />

Clustering is a type of multivariate statistical analysis also known as cluster analysis,<br />

unsupervised classification analysis, or numerical taxonomy.<br />

Looking at Figure 3-1, one can easily see that the points group naturally into 3 separate clusters,<br />

albeit of different shape and size. While it is obvious to see this when looking at the graph, it<br />

would become less so obvious if you were to look at the coordinates of these points.<br />

Figure 3-1: Example of a dataset that form 3 clusters<br />

There are many ways to cluster data depending on the criteria required for the task. One can, for<br />

instance, group the data or segment the data into clusters. Clusters can be either globular or<br />

convex, such that any line you can draw between two cluster members stays inside the<br />

boundaries of the cluster, or non-globular or concave, i.e. taking any shape. Clusters can contain<br />

an equal number of data, or be of equal size (distance between the data). Central to all of the<br />

goals of cluster analysis is the notion of degree of similarity (or dissimilarity) between the<br />

individual objects being clustered. This notion determines the type of clusters that will be formed.<br />

In this chapter, we will see different methods of clustering that belong to two major classes of<br />

clustering algorithms, namely hierarchical clustering and partitioning algorithms. We will<br />

discuss the limitation of such methods. In particular, we will look at a generalization of K-means<br />

clustering, namely soft k-means and mixture of Gaussians that enable to overcome some but<br />

no all limitations of simple K-means.<br />

© A.G.Billard 2004 – Last Update March 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!