MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
94<br />
quadratic programming problem, since the objective function is itself convex, and those points<br />
which satisfy the constraints also form a convex set (any linear constraint defines a convex set,<br />
and a set of N simultaneous linear constraints defines the intersection of N convex sets, which<br />
is also a convex set). This means that we can equivalently solve the following dual problem:<br />
maximize L , subject to the constraints that the gradient of L with respect to w and b vanish,<br />
P<br />
α ≥ .<br />
and subject also to the constraints that the 0<br />
i<br />
Requesting that the gradient of<br />
L with respect to<br />
P<br />
w and b vanish<br />
P<br />
∂<br />
i i<br />
P j i<br />
, j=1,....N<br />
j<br />
w L = w −∑<br />
∂<br />
α y x<br />
(5.39)<br />
j<br />
i<br />
gives the conditions:<br />
∂<br />
∂b L<br />
P<br />
∑ i<br />
(5.40)<br />
i<br />
i<br />
=− α y = 0<br />
M<br />
w = ∑ α<br />
(5.41)<br />
M<br />
i=<br />
1<br />
i=<br />
1<br />
i i<br />
i<br />
yx<br />
i<br />
∑ α i<br />
y = 0.<br />
(5.42)<br />
Since these are equality constraints in the dual formulation, we can substitute them into (5.38) to<br />
give:<br />
1<br />
j<br />
L α = α y − αα y y x , x .<br />
( )<br />
∑ ∑ (5.43)<br />
i i j i<br />
D i i j<br />
i 2 i,<br />
j<br />
Note that we have now given the Lagrangian different labels (P for primal, D for dual) to<br />
emphasize that the two formulations are different: L and L arise from the same objective<br />
function but with diferent constraints; and the solution is found either by minimizing<br />
maximizing<br />
L . Note also that if we formulate the problem with b = 0<br />
D<br />
P<br />
D<br />
L or by<br />
P<br />
, which amounts to<br />
requiring that all hyper-planes contain the origin (this is a mild restriction for high dimensional<br />
spaces, since it amounts to reducing the number of degrees of freedom by one), support vector<br />
training (for the separable, linear case) then amounts to maximizing L with respect to the α ,<br />
subject to constraints (5.42) and positivity of theα , with solution given by (5.41). Notice that<br />
there is a Lagrange multiplier α for every training point. In the solution those points for which<br />
i<br />
α<br />
i<br />
> 0 are called support vectors and lie on one of the hyperplanes H1,<br />
2<br />
α = , and lie either on<br />
1<br />
H , or on that side of<br />
1<br />
points have<br />
i<br />
0<br />
inequality in (5.37) is satisfied.<br />
H or<br />
2<br />
i<br />
D<br />
H . All other training<br />
H or H such that the strict<br />
2<br />
i<br />
For these machines, the support vectors are the critical elements of the training set. They lie<br />
closest to the decision boundary; if all other training points were removed (or moved around, but<br />
so as not to cross H or<br />
1<br />
H ), and training was repeated, the same separating hyperplane would<br />
2<br />
be found.<br />
© A.G.Billard 2004 – Last Update March 2011