MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA MACHINE LEARNING TECHNIQUES - LASA

01.11.2014 Views

100 In SVM, we found that minimizing w was interesting as it allowed a better separation of the two classes. Scholkopf and Smola argue that, in SVR, minimizing w is also interesting albeit for a different geometrical reason. They start by computing the e-margin given by: { } ( ) ( ) ( ) ( ) ( ) mε f : = inf φ x −φ x' , ∀x, x', s.t. f x − f x' ≥ 2ε (5.61) Figure 5-11: Minimizing the e-margin in SVR is equivalent to maximizing the slope of the function f. Similarly to the margin in SVM, the e-margin is a function of the projection vector w . As illustrated in Figure 5-11, the flatter the slope w of the function f, the larger the margin. Conversely, the steeper the slope is, the larger the width of the e-insensitive tube. Hence to maximize the margin, we must minimize w (minimizing each projection of w will flatten the slope). The linear illustration in Figure 5-11 holds in feature space as we will proceed to a linear fit in feature space. Finding the optimal estimate of f can now be formulated as an optimization problem of the form: ⎛⎛1 min ⎜⎜ w ⎝⎝ 2 2 ε { w + C⋅R [ f] ⎞⎞ ⎟⎟ ⎠⎠ (5.62) ε Where R [ f] is a regularized risk function that gives a measure of the e-insensitive error: 1 M i i [ ] = − ( ) ε R f ∑ y f x (5.63) M ε i= 1 C in (5.62) is a constant and hence a free parameter that determines a tradeoff between minimizing the increase in the error and optimizing for the complexity of the fit. This procedure is called e-SVR. Formalism: © A.G.Billard 2004 – Last Update March 2011

101 The optimization problem given by (5.62) can be rephrased as an optimization under constraint problem of the form: 1 2 minimize w 2 ⎧⎧ i i ⎪⎪ w, φ( x ) + b− y ≤ε subject to ⎨⎨ ∀ i= 1,... M i i ⎪⎪y − w, φ( x ) −b≤ε ⎩⎩ (5.64) Minimizing for the norm of w , we ensure a) that the optimization problem at stake is convex (hence easier to solve) and b) that the conditions given in (5.64) are easier to satisfy (for a small value of w , the left handside of each condition is more likely to be bounded above by ε ). Note that satisfying the conditions of (5.64) may be difficult (or impossible) for an arbitrary small * e, one can once more introduce the set of slack variables ξi and ξ i ( i = 1... M) for each datapoint, to denote whether the datapoint is on the left or righthandside of the function f 5 . The slack variables measure by how much each datapoint is incorrectly fitted. This yields the following optimization problem: 1 C minimize w + 2 M i ( ) i ( ) M ∑( ξi + ξi ) 2 * i= 1 ⎧⎧ i w, φ x + b− y ≤ ε + ξi ⎪⎪ ⎪⎪ i * subject to ⎨⎨y − w, φ x −b≤ ε + ξi ⎪⎪ * ⎪⎪ξi ≥ 0, ξi ≥ 0 ⎩⎩ (5.65) Again, one can solve this quadratic problem through Lagrange. Introducing a set of Lagrange multipliers α , η ≥ 0 for each of our inequalities constraints and writing the Lagrangian: i i 1 C C L , , *, = + 2 M M M M 2 * * * ( w ξξ b) w ∑( ξi + ξi ) − ∑( ηξ i i + ηξ i i ) M ∑ i= 1 M i= 1 i i= 1 i= 1 i i ( i y w, ( x ) b) − α ε + ξ + − φ − ∑ i i ( y w, ( x ) b i ) − α ε + ξ − + φ + * * i (5.66) Solving for each of the partial derivatives: ∂L ∂L ∂L = 0; = 0; = 0 (*) ∂b ∂w ∂ξ M ∂ L = * ∑ ( αi − αi ) = 0; (5.67) ∂b i= 1 5 We will use the shorthand ξ (*) when the notation applies to both i ξ i * and ξ i . © A.G.Billard 2004 – Last Update March 2011

101<br />

The optimization problem given by (5.62) can be rephrased as an optimization under constraint<br />

problem of the form:<br />

1 2<br />

minimize w 2<br />

⎧⎧ i<br />

i<br />

⎪⎪<br />

w,<br />

φ( x ) + b− y ≤ε<br />

subject to ⎨⎨<br />

∀ i=<br />

1,... M<br />

i<br />

i<br />

⎪⎪y − w,<br />

φ( x ) −b≤ε<br />

⎩⎩<br />

(5.64)<br />

Minimizing for the norm of w , we ensure a) that the optimization problem at stake is convex<br />

(hence easier to solve) and b) that the conditions given in (5.64) are easier to satisfy (for a small<br />

value of w , the left handside of each condition is more likely to be bounded above by ε ).<br />

Note that satisfying the conditions of (5.64) may be difficult (or impossible) for an arbitrary small<br />

*<br />

e, one can once more introduce the set of slack variables ξi and ξ<br />

i<br />

( i = 1... M)<br />

for each datapoint,<br />

to denote whether the datapoint is on the left or righthandside of the function f 5 . The slack<br />

variables measure by how much each datapoint is incorrectly fitted. This yields the following<br />

optimization problem:<br />

1 C<br />

minimize w +<br />

2 M<br />

i<br />

( )<br />

i<br />

( )<br />

M<br />

∑( ξi<br />

+ ξi<br />

)<br />

2 *<br />

i=<br />

1<br />

⎧⎧ i<br />

w,<br />

φ x + b− y ≤ ε + ξi<br />

⎪⎪<br />

⎪⎪ i<br />

*<br />

subject to ⎨⎨y − w,<br />

φ x −b≤ ε + ξi<br />

⎪⎪<br />

*<br />

⎪⎪ξi<br />

≥ 0, ξi<br />

≥ 0<br />

⎩⎩<br />

(5.65)<br />

Again, one can solve this quadratic problem through Lagrange. Introducing a set of Lagrange<br />

multipliers α , η ≥ 0 for each of our inequalities constraints and writing the Lagrangian:<br />

i<br />

i<br />

1 C C<br />

L , , *, = +<br />

2 M<br />

M<br />

M<br />

M<br />

2 * * *<br />

( w ξξ b) w ∑( ξi + ξi ) − ∑( ηξ i i<br />

+ ηξ<br />

i i )<br />

M<br />

∑<br />

i=<br />

1<br />

M<br />

i=<br />

1<br />

i<br />

i= 1 i=<br />

1<br />

i<br />

i<br />

( i<br />

y w,<br />

( x ) b)<br />

− α ε + ξ + − φ −<br />

∑<br />

i<br />

i<br />

( y w,<br />

( x ) b<br />

i<br />

)<br />

− α ε + ξ − + φ +<br />

* *<br />

i<br />

(5.66)<br />

Solving for each of the partial derivatives:<br />

∂L ∂L ∂L<br />

= 0; = 0; = 0<br />

(*)<br />

∂b ∂w ∂ξ<br />

M<br />

∂ L =<br />

*<br />

∑ ( αi<br />

− αi<br />

) = 0;<br />

(5.67)<br />

∂b<br />

i=<br />

1<br />

5 We will use the shorthand<br />

ξ<br />

(*)<br />

when the notation applies to both<br />

i<br />

ξ<br />

i<br />

*<br />

and ξ<br />

i<br />

.<br />

© A.G.Billard 2004 – Last Update March 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!