MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA MACHINE LEARNING TECHNIQUES - LASA

01.11.2014 Views

92 Let us now rewrite this result in terms of the input patterns dot product. We get the decision function: ⎛⎛ 1 1 ⎞⎞ sgn , i i y= ⎜⎜ x x − x, x + b ⎜⎜ ⎟⎟ m+ i= | yi=+ 1 m ⎟⎟ − i= | yi ⎝⎝ =+ 1 ⎠⎠ ⎛⎛ 1 1 ⎞⎞ i i = sgn ⎜⎜ k( x, x ) − k( x, x ) + b ⎜⎜ ⎟⎟ m+ i= | yi=+ 1 m ⎟⎟ − i= | yi ⎝⎝ =+ 1 ⎠⎠ x i , using the kernel k to compute the ∑ ∑ (5.31) ∑ ∑ (5.32) If b = 0 , i.e. the two classes' means are equidistant to the origin, then k can be viewed as a probability density when one of its arguments is fixed. By this, we mean that it is positive and has unit integral, ∫ ( ʹ′ ) k x , x dx= 1 ∀xʹ′∈X X In this case, y takes the form of the so-called Bayes classifier separating the two classes, subject to the assumption that the two classes of patterns were generated by sampling from two probability distributions that are correctly estimated by the Parzen Windows estimator of the two class densities, and where x∈ X . p p + − 1 i : = ∑ k x, x m + iy | i =+ 1 − iy | i =−1 ( ) 1 i : = ∑ k x, x m ( ) Thus, given some point x , the class label is computed by checking which of the two values p + or p − is larger. This is the best decision one can take if one has no prior information on the data distribution. 5.7.1 Support Vector Machine for Linearly Separable Datasets SVM consists of determining a hyperplane that determines a decision boundary in a binary classification problem. We consider first the linear case and then move to the non-linear case. Linear Support Vector Machines Let us assume for a moment that our datapoints x live in a feature space H . The class of hyperplanes in the dot product space H is given by: with the corresponding decision functions wx , + b= 0, where w∈H, b∈° (5.33) © A.G.Billard 2004 – Last Update March 2011

93 ( ) ( , ) f x = sign w x + b (5.34) We can now define a learning algorithm for linearly separable problems. First, observe that among all hyperplanes separating the data, there exists a unique optimal hyperplane, distinguished by the maximum margin of separation between any training point and the hyperplane, defined by: { } i maximizew ∈H, b∈Rmin x−x , x∈ H, w, x + b= 0, i = 1,..., M (5.35) While in the simple classification problem we had presented earlier on, it was sufficient to simply compute the distance between the two cluster’ means to define the normal vector and so the hyperplane, here, the problem of finding the normal vector that leads to the largest margin is slightly more complex. To construct the optimal hyperplane, we have to solve for the objective function τ ( w) : subject to the inequality constraints: minimize 1 2 ∈ ∈R τ ( w) = w (5.36) 2 w H, b i ( ) i y w, x + b ≥1, ∀ i=1,...M (5.37) Consider the points for which the equality in (5.37) holds (requiring that there exists such a point is equivalent to choosing a scale for w and b). These points lie on two hyperplanes i ( + = ) and H2 ( wx , 1) i b H1 wx , b 1 + =− with normal w and perpendicular distance from the origin 1 − b / w . Hence d = d = 1/ w and the margin is simply 2 / w . Note that 1 2 + − H and H are parallel (they have the same normal) and that no training points fall between them. Thus we can find the pair of hyperplanes which gives the maximum margin by minimizing to constraints (5.37) that ensures that the class label for a given for y = −1. i 2 w , subject i i x will be + 1 if y = + 1, and − 1 Let us now rephrase the minimization under constraint problem given by (5.36) and (5.37) in terms of the Lagrange multipliers α i, i= 1,..., l, one for each of the inequality constraints in (5.37). Recall that the rule is that for constraints of the form ci ≥ 0, the constraint equations are multiplied by positive Lagrange multipliers and subtracted from the objective function, in (5.36), to form the Lagrangian. For equality constraints, the Lagrange multipliers are unconstrained. This gives the Lagrangian: We must now minimize of L with respect to all the P l 1 2 i i ( ) ( ) L wb , , α ≡ w − α y wx , + b + α . P i i 2 i= 1 i= 1 l ∑ ∑ (5.38) L with respect to P w , b , and simultaneously require that the derivatives α vanish, all subject to the constraints α ≥ 0. This is a convex i i © A.G.Billard 2004 – Last Update March 2011

92<br />

Let us now rewrite this result in terms of the input patterns<br />

dot product. We get the decision function:<br />

⎛⎛ 1 1<br />

⎞⎞<br />

sgn , i<br />

i<br />

y= ⎜⎜ x x − x,<br />

x + b<br />

⎜⎜<br />

⎟⎟<br />

m+ i= | yi=+ 1<br />

m<br />

⎟⎟<br />

− i= | yi<br />

⎝⎝<br />

=+ 1 ⎠⎠<br />

⎛⎛ 1 1<br />

⎞⎞<br />

i<br />

i<br />

= sgn ⎜⎜ k( x, x ) − k( x,<br />

x ) + b<br />

⎜⎜<br />

⎟⎟<br />

m+ i= | yi=+ 1<br />

m<br />

⎟⎟<br />

− i= | yi<br />

⎝⎝<br />

=+ 1 ⎠⎠<br />

x i , using the kernel k to compute the<br />

∑ ∑ (5.31)<br />

∑ ∑ (5.32)<br />

If b = 0 , i.e. the two classes' means are equidistant to the origin, then k can be viewed as a<br />

probability density when one of its arguments is fixed. By this, we mean that it is positive and has<br />

unit integral,<br />

∫<br />

( ʹ′ )<br />

k x , x dx= 1 ∀xʹ′∈X<br />

X<br />

In this case, y takes the form of the so-called Bayes classifier separating the two classes, subject<br />

to the assumption that the two classes of patterns were generated by sampling from two<br />

probability distributions that are correctly estimated by the Parzen Windows estimator of the two<br />

class densities,<br />

and<br />

where x∈ X .<br />

p<br />

p<br />

+<br />

−<br />

1<br />

i<br />

: = ∑ k x, x<br />

m<br />

+ iy | i =+ 1<br />

− iy | i =−1<br />

( )<br />

1<br />

i<br />

: = ∑ k x, x<br />

m<br />

( )<br />

Thus, given some point x , the class label is computed by checking which of the two values p +<br />

or<br />

p −<br />

is larger. This is the best decision one can take if one has no prior information on the data<br />

distribution.<br />

5.7.1 Support Vector Machine for Linearly Separable Datasets<br />

SVM consists of determining a hyperplane that determines a decision boundary in a binary<br />

classification problem. We consider first the linear case and then move to the non-linear case.<br />

Linear Support Vector Machines<br />

Let us assume for a moment that our datapoints x live in a feature space H . The class of<br />

hyperplanes in the dot product space H is given by:<br />

with the corresponding decision functions<br />

wx , + b= 0, where w∈H,<br />

b∈° (5.33)<br />

© A.G.Billard 2004 – Last Update March 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!