13.07.2015 Views

SVM Example

SVM Example

SVM Example

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>SVM</strong> <strong>Example</strong>Dan VenturaApril 16, 2009AbstractWe try to give a helpful simple example that demonstrates a linear<strong>SVM</strong> and then extend the example to a simple non-linear case to illustratethe use of mapping functions and kernels.1 IntroductionMany learning models make use of the idea that any learning problem can bemade easy with the right set of features. The trick, of course, is discovering that“right set of features”, which in general is a very difficult thing to do. <strong>SVM</strong>s areanother attempt at a model that does this. The idea behind <strong>SVM</strong>s is to make useof a (nonlinear) mapping function Φ that transforms data in input space to datain feature space in such a way as to render a problem linearly separable. The<strong>SVM</strong> then automatically discovers the optimal separating hyperplane (which,when mapped back into input space via Φ −1 , can be a complex decision surface).<strong>SVM</strong>s are rather interesting in that they enjoy both a sound theoretical basisas well as state-of-the-art success in real-world applications.To illustrate the basic ideas, we will begin with a linear <strong>SVM</strong> (that is, amodel that assumes the data is linearly separable). We will then expand theexample to the nonlinear case to demonstrate the role of the mapping functionΦ, and finally we will explain the idea of a kernel and how it allows <strong>SVM</strong>s tomake use of high-dimensional feature spaces while remaining tractable.2 Linear <strong>Example</strong> – when Φ is trivialSuppose we are given the following positively labeled data points in R 2 :{( ) ( ) ( ) ( )}3 3 6 6, , ,1 −1 1 −1and the following negatively labeled data points in R 2 (see Figure 1):{( ) ( ) ( ) ( )}1 0 0 −1, , ,0 1 −1 01


Figure 1: Sample data points in R 2 . Blue diamonds are positive examples andred squares are negative examples.We would like to discover a simple <strong>SVM</strong> that accurately discriminates thetwo classes. Since the data is linearly separable, we can use a linear <strong>SVM</strong> (thatis, one whose mapping function Φ() is the identity function). By inspection, itshould be obvious that there are three support vectors (see Figure 2):{s 1 =( 10), s 2 =( 31), s 3 =( 3−1In what follows we will use vectors augmented with a 1 as a bias input, andfor clarity we will differentiate these with an over-tilde. So, if s 1 = (10), then˜s 1 = (101). Figure 3 shows the <strong>SVM</strong> architecture, and our task is to find valuesfor the α i such that)}α 1 Φ(s 1 ) · Φ(s 1 ) + α 2 Φ(s 2 ) · Φ(s 1 ) + α 3 Φ(s 3 ) · Φ(s 1 ) = −1α 1 Φ(s 1 ) · Φ(s 2 ) + α 2 Φ(s 2 ) · Φ(s 2 ) + α 3 Φ(s 3 ) · Φ(s 2 ) = +1α 1 Φ(s 1 ) · Φ(s 3 ) + α 2 Φ(s 2 ) · Φ(s 3 ) + α 3 Φ(s 3 ) · Φ(s 3 ) = +1Since for now we have let Φ() = I, this reduces toα 1 ˜s 1 · ˜s 1 + α 2 ˜s 2 · ˜s 1 + α 3 ˜s 3 · ˜s 1 = −1α 1 ˜s 1 · ˜s 2 + α 2 ˜s 2 · ˜s 2 + α 3 ˜s 3 · ˜s 2 = +1α 1 ˜s 1 · ˜s 3 + α 2 ˜s 2 · ˜s 3 + α 3 ˜s 3 · ˜s 3 = +1Now, computing the dot products results in2


Figure 2: The three support vectors are marked as yellow circles.Figure 3: The <strong>SVM</strong> architecture.3


2α 1 + 4α 2 + 4α 3 = −14α 1 + 11α 2 + 9α 3 = +14α 1 + 9α 2 + 11α 3 = +1A little algebra reveals that the solution to this system of equations is α 1 =−3.5, α 2 = 0.75 and α 3 = 0.75.Now, we can look at how these α values relate to the discriminating hyperplane;or, in other words, now that we have the α i , how do we find the hyperplanethat discriminates the positive from the negative examples? It turns outthat˜w = ∑ iα i ˜s i⎛= −3.5 ⎝ 1 0⎛1⎞1= ⎝ 0 ⎠−2⎞⎠ + 0.75⎛⎝ 3 11⎞ ⎛⎠ + 0.75 ⎝ 3⎞−1 ⎠1Finally, remembering that our vectors are augmented with a bias, we canequate the last entry in ˜w with the hyperplane( offset ) b and write the separating1hyperplane equation, 0 = w T x + b, with w = and b = −2. Solving for x0gives the set of 2-vectors with x 1 = 2, and plotting the line gives the expecteddecision surface (see Figure 4).2.1 Input space vs. Feature space3 Nonlinear <strong>Example</strong> – when Φ is non-trivialNow suppose instead that we are given the following positively labeled datapoints in R 2 :{( ) ( ) ( ) ( )}2 2 −2 −2, , ,2 −2 −2 2and the following negatively labeled data points in R 2 (see Figure 5):{( ) ( ) ( ) ( )}1 1 −1 −1, , ,1 −1 −1 14


Figure 4: The discriminating hyperplane corresponding to the values α 1 =−3.5, α 2 = 0.75 and α 3 = 0.75.Figure 5: Nonlinearly separable sample data points in R 2 . Blue diamonds arepositive examples and red squares are negative examples.5


Figure 6: The data represented in feature space.Our goal, again, is to discover a separating hyperplane that accurately discriminatesthe two classes. Of course, it is obvious that no such hyperplaneexists in the input space (that is, in the space in which the original input datalive). Therefore, we must use a nonlinear <strong>SVM</strong> (that is, one whose mappingfunction Φ is a nonlinear mapping from input space into some feature space).Define⎧ ( )( ) 4 − x2 + |x 1 − x 2 |⎪⎨if √ xx14 − xΦ 1 =1 + |x 1 − x 2 |2 1 + x2 2( )> 2(1)x 2 x1⎪⎩otherwisex 2Referring back to Figure 3, we can see how Φ transforms our data beforethe dot products are performed. Therefore, we can rewrite the data in featurespace as{( 22) ( 6,2for the positive examples and{( ) ( 1 1,1 −1) ( 6,6) ( −1,−1) ( 2,6)}) ( −1,1for the negative examples (see Figure 6). Now we can once again easily identifythe support vectors (see Figure 7):{ ( ) ( )}1 2s 1 = , s1 2 =2We again use vectors augmented with a 1 as a bias input and will differentiatethem as before. Now given the [augmented] support vectors, we must again findvalues for the α i . This time our constraints are)}6


˜w = ∑ iFigure 7: The two support vectors (in feature space) are marked as yellowcircles.α 1 Φ 1 (s 1 ) · Φ 1 (s 1 ) + α 2 Φ 1 (s 2 ) · Φ 1 (s 1 ) = −1α 1 Φ 1 (s 1 ) · Φ 1 (s 2 ) + α 2 Φ 1 (s 2 ) · Φ 1 (s 2 ) = +1Given Eq. 1, this reduces toα 1 ˜s 1 · ˜s 1 + α 2 ˜s 2 · ˜s 1 = −1α 1 ˜s 1 · ˜s 2 + α 2 ˜s 2 · ˜s 2 = +1(Note that even though Φ 1 is a nontrivial function, both s 1 and s 2 map tothemselves under Φ 1 . This will not be the case for other inputs as we will seelater.)Now, computing the dot products results in3α 1 + 5α 2 = −15α 1 + 9α 2 = +1And the solution to this system of equations is α 1 = −7 and α 2 = 4.Finally, we can again look at the discriminating hyperplane in input spacethat corresponds to these α.α i ˜s i7


Figure 8: The discriminating hyperplane corresponding to the values α 1 = −7and α 2 = 4⎛= −7 ⎝111⎞⎛⎠ + 4 ⎝=⎛⎝11⎞⎠−3( ) 1giving us the separating hyperplane equation y = wx + b with w = and1b = −3. Plotting the line gives the expected decision surface (see Figure 8).3.1 Using the <strong>SVM</strong>Let’s briefly look at how we would use the <strong>SVM</strong> model to classify data. Givenx, the classification f(x) is given by the equation( )∑f(x) = σ α i Φ(s i ) · Φ(x)(2)iwhere σ(z) returns the sign of z. For example, if we wanted to classify the pointx = (4, 5) using the mapping function of Eq. 1,221⎞⎠( 4f5)( ( ) ( ) ( ) ( 1 4 2 4= σ −7Φ 1 · Φ1 1 + 4Φ51 · Φ2 15⎛ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞⎞1 0 2 0= σ ⎝−7 ⎝ 1 ⎠ · ⎝ 1 ⎠ + 4 ⎝ 2 ⎠ · ⎝ 1 ⎠⎠1 1 1 1= σ(−2)))8


Figure 9: The decision surface in input space corresponding to Φ 1 . Note thesingularity.and thus we would classify x = (4, 5) as negative. Looking again at the inputspace, we might be tempted to think this is not a reasonable classification; however,it is what our model says, and our model is consistent with all the trainingdata. As always, there are no guarantees on generalization accuracy, and if weare not happy about our generalization, the likely culprit is our choice of Φ.Indeed, if we map our discriminating hyperplane (which lives in feature space)back into input space, we can see the effective decision surface of our model(see Figure 9). Of course, we may or may not be able to improve generalizationaccuracy by choosing a different Φ; however, there is another reason to revisitour choice of mapping function.4 The Kernel TrickOur definition of Φ in Eq. 1 preserved the number of dimensions. In otherwords, our input and feature spaces are the same size. However, it is often thecase that in order to effectively separate the data, we must use a feature spacethat is of (sometimes very much) higher dimension than our input space. Letus now consider an alternative mapping functionΦ 2(x1x 2)=⎛⎜⎝x 1x 2(x 2 1 +x2 2)−53⎞⎟⎠ (3)which transforms our data from 2-dimensional input space to 3-dimensionalfeature space. Using this alternative mapping, the data in the new featurespace looks like9

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!