MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
90<br />
5.7 Support Vector Machines<br />
Adapted from "Learning with Kernels" by B. Scholkopf and A. Smola, MIT Press 2002, and "A Tutorial on Support<br />
Vector Machines for Pattern Recognition", by C.J.C. Burges, Data Mining and Knowledge Discovery, 2, 1998.<br />
Support vector machine (SVM) is probably one of the most popular applications of kernel<br />
methods. SVM exploits the kernel trick to extend classical linear classification to non-linear<br />
classification. It has been shown to be very powerful at separating highly intertwined data. Its<br />
simplicity of use and the large number of available software makes it easy to implement. We will<br />
here very briefly review the principle and the derivation of the algorithm. We will highlight on the<br />
sensivity to some hyperparameter so as to guide the potential user and ensure optimal use of the<br />
algorithm.<br />
Linear Case<br />
Let us first start by considering a very simple classification problem which illustrates well the<br />
reasoning behind using Kernel methods for classification. Suppose we are given two classes of<br />
objects. We are then faced with a new object, and we have to assign it to one of the two classes.<br />
This problem can be formalized as follows:<br />
Consider a training set composed of M input-output pairs where each input { } i =<br />
i<br />
1...<br />
associated with a label { } i =<br />
i<br />
1...<br />
y<br />
M<br />
. The label<br />
In SVM, we consider solely binary classification problems, i.e.:<br />
i<br />
y denotes the class to which the pattern<br />
{ , i=<br />
1... M<br />
i i<br />
x y } X { 1 }<br />
x<br />
M N<br />
∈ ° is<br />
i<br />
x belongs.<br />
∈ × ± (5.27)<br />
Note that there exist extensions to multiclass SVM. We will here focus first on the binary<br />
classification case.<br />
Given this training set, we wish to build a model of the relationships across the input points and<br />
their associated class label that would be a good predictor of the class to which each pattern<br />
belongs and would allow us to do inference: that is, given a new pattern x , we could estimate the<br />
class to which this new pattern belongs. In some sense, for a given new pattern x , we would<br />
choose a corresponding y, so that the pair { xyis , } somewhat similar to the training examples. To<br />
this end, we need a notion of similarity in X and in{ ± 1}<br />
.<br />
Similarity Measures: Characterizing the similarity of the outputs { 1}<br />
± is easy. In binary<br />
classification, only two situations occur: two labels can either be identical or different. The choice<br />
of the similarity measure for the inputs, on the other hand, is more complex and is tighly linked to<br />
k x, x'<br />
gives a<br />
the idea of kernel. As we have seen previously in these lecture notes, the kernel ( )<br />
measure of similarity across two datapoints x and x' . A natural choice for the kernel when<br />
considering the simple linear classification problem outlined above is to take the dot product, i.e.:<br />
N<br />
k x, x' = x, x' =∑ x x'<br />
(5.28)<br />
( )<br />
The geometrical interpretation of the canonical dot product is that it computes the cosine of the<br />
angle between the vectors x and x ' , provided they are normalized with length 1.<br />
i=<br />
1<br />
i<br />
i<br />
© A.G.Billard 2004 – Last Update March 2011