MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA MACHINE LEARNING TECHNIQUES - LASA
100 In SVM, we found that minimizing w was interesting as it allowed a better separation of the two classes. Scholkopf and Smola argue that, in SVR, minimizing w is also interesting albeit for a different geometrical reason. They start by computing the e-margin given by: { } ( ) ( ) ( ) ( ) ( ) mε f : = inf φ x −φ x' , ∀x, x', s.t. f x − f x' ≥ 2ε (5.61) Figure 5-11: Minimizing the e-margin in SVR is equivalent to maximizing the slope of the function f. Similarly to the margin in SVM, the e-margin is a function of the projection vector w . As illustrated in Figure 5-11, the flatter the slope w of the function f, the larger the margin. Conversely, the steeper the slope is, the larger the width of the e-insensitive tube. Hence to maximize the margin, we must minimize w (minimizing each projection of w will flatten the slope). The linear illustration in Figure 5-11 holds in feature space as we will proceed to a linear fit in feature space. Finding the optimal estimate of f can now be formulated as an optimization problem of the form: ⎛⎛1 min ⎜⎜ w ⎝⎝ 2 2 ε { w + C⋅R [ f] ⎞⎞ ⎟⎟ ⎠⎠ (5.62) ε Where R [ f] is a regularized risk function that gives a measure of the e-insensitive error: 1 M i i [ ] = − ( ) ε R f ∑ y f x (5.63) M ε i= 1 C in (5.62) is a constant and hence a free parameter that determines a tradeoff between minimizing the increase in the error and optimizing for the complexity of the fit. This procedure is called e-SVR. Formalism: © A.G.Billard 2004 – Last Update March 2011
101 The optimization problem given by (5.62) can be rephrased as an optimization under constraint problem of the form: 1 2 minimize w 2 ⎧⎧ i i ⎪⎪ w, φ( x ) + b− y ≤ε subject to ⎨⎨ ∀ i= 1,... M i i ⎪⎪y − w, φ( x ) −b≤ε ⎩⎩ (5.64) Minimizing for the norm of w , we ensure a) that the optimization problem at stake is convex (hence easier to solve) and b) that the conditions given in (5.64) are easier to satisfy (for a small value of w , the left handside of each condition is more likely to be bounded above by ε ). Note that satisfying the conditions of (5.64) may be difficult (or impossible) for an arbitrary small * e, one can once more introduce the set of slack variables ξi and ξ i ( i = 1... M) for each datapoint, to denote whether the datapoint is on the left or righthandside of the function f 5 . The slack variables measure by how much each datapoint is incorrectly fitted. This yields the following optimization problem: 1 C minimize w + 2 M i ( ) i ( ) M ∑( ξi + ξi ) 2 * i= 1 ⎧⎧ i w, φ x + b− y ≤ ε + ξi ⎪⎪ ⎪⎪ i * subject to ⎨⎨y − w, φ x −b≤ ε + ξi ⎪⎪ * ⎪⎪ξi ≥ 0, ξi ≥ 0 ⎩⎩ (5.65) Again, one can solve this quadratic problem through Lagrange. Introducing a set of Lagrange multipliers α , η ≥ 0 for each of our inequalities constraints and writing the Lagrangian: i i 1 C C L , , *, = + 2 M M M M 2 * * * ( w ξξ b) w ∑( ξi + ξi ) − ∑( ηξ i i + ηξ i i ) M ∑ i= 1 M i= 1 i i= 1 i= 1 i i ( i y w, ( x ) b) − α ε + ξ + − φ − ∑ i i ( y w, ( x ) b i ) − α ε + ξ − + φ + * * i (5.66) Solving for each of the partial derivatives: ∂L ∂L ∂L = 0; = 0; = 0 (*) ∂b ∂w ∂ξ M ∂ L = * ∑ ( αi − αi ) = 0; (5.67) ∂b i= 1 5 We will use the shorthand ξ (*) when the notation applies to both i ξ i * and ξ i . © A.G.Billard 2004 – Last Update March 2011
- Page 49 and 50: 49 3.1.4 Clustering with Mixtures o
- Page 51 and 52: 51 k ( σ j ) 2 = k ∑ i α = r k
- Page 53 and 54: 53 Theα are the so-called mixing c
- Page 55 and 56: 55 Figure 3-16: Clustering with 3 G
- Page 57 and 58: 57 When the transformation A is lin
- Page 59 and 60: 59 C: X → Y ( ) C x K = arg max
- Page 61 and 62: 61 Figure 3-18: Linear combination
- Page 63 and 64: 63 Figure 3-19: Bayes classificatio
- Page 65 and 66: 65 ⎛⎛ min ⎜⎜ w ⎝⎝ N i=
- Page 67 and 68: 67 T ( yi − xi w) 2 M ⎛⎛ ⎞
- Page 69 and 70: 69 Figure 4-2: Illustration of the
- Page 71 and 72: 71 4.4.2 Multi-Gaussian Case It is
- Page 73 and 74: 73 5 Kernel Methods These lecture n
- Page 75 and 76: 75 The kernel k provides a metric o
- Page 77 and 78: 77 M 1 T v = ∑ x ( x ) v M λ i j
- Page 79 and 80: 79 1 M The solutions to the dual ei
- Page 81 and 82: 81 5.4 Kernel CCA The linear versio
- Page 83 and 84: 83 additional ridge parameter induc
- Page 85 and 86: 85 Figure 5-3: TOP: Marginal (left)
- Page 87 and 88: 87 statistical independence. We def
- Page 89 and 90: 89 J j ( µ 1,...., µ K) = ∑∑
- Page 91 and 92: 91 A simple pattern recognition alg
- Page 93 and 94: 93 ( ) ( , ) f x = sign w x + b (5.
- Page 95 and 96: 95 Figure 5-6: A binary classificat
- Page 97 and 98: 97 where N is the number of support
- Page 99: 99 5.8 Support Vector Regression In
- Page 103 and 104: 103 Note that since we never have t
- Page 105 and 106: 105 Figure 5-13: Effect of the kern
- Page 107 and 108: 107 To better understand the effect
- Page 109 and 110: 109 5.9 Gaussian Process Regression
- Page 111 and 112: 111 One can then use the above expr
- Page 113 and 114: 113 5.9.2 Equivalence of Gaussian P
- Page 115 and 116: 115 5.9.3 Curse of dimensionality,
- Page 117 and 118: 117 The weight w determines the slo
- Page 119 and 120: 119 Figure 5-21: Example of success
- Page 121 and 122: 121 • its performance tends to de
- Page 123 and 124: 123 neurons. Furthermore, they lear
- Page 125 and 126: 125 The sigmoid f x ( x) ( ) = tanh
- Page 127 and 128: 127 6.3.2 Information Theory and th
- Page 129 and 130: 129 ( R) ⎛⎛det ⎞⎞ I( x, y)
- Page 131 and 132: 131 y= ∑ w x ) of Because of the
- Page 133 and 134: 133 6.5 Willshaw net David Willshaw
- Page 135 and 136: 135 6.6.1 Weights bounds One of the
- Page 137 and 138: 137 Figure 6-11: The weight vector
- Page 139 and 140: 139 6.6.4 Oja’s one Neuron Model
- Page 141 and 142: 141 If y i and y are highly correla
- Page 143 and 144: 143 Foldiak’s second model allows
- Page 145 and 146: 145 ∂ ∂ J 1 = fy − 1 2 λ 1yf
- Page 147 and 148: 147 6.8 The Self-Organizing Map (SO
- Page 149 and 150: 149 6. Decrease the size of the nei
101<br />
The optimization problem given by (5.62) can be rephrased as an optimization under constraint<br />
problem of the form:<br />
1 2<br />
minimize w 2<br />
⎧⎧ i<br />
i<br />
⎪⎪<br />
w,<br />
φ( x ) + b− y ≤ε<br />
subject to ⎨⎨<br />
∀ i=<br />
1,... M<br />
i<br />
i<br />
⎪⎪y − w,<br />
φ( x ) −b≤ε<br />
⎩⎩<br />
(5.64)<br />
Minimizing for the norm of w , we ensure a) that the optimization problem at stake is convex<br />
(hence easier to solve) and b) that the conditions given in (5.64) are easier to satisfy (for a small<br />
value of w , the left handside of each condition is more likely to be bounded above by ε ).<br />
Note that satisfying the conditions of (5.64) may be difficult (or impossible) for an arbitrary small<br />
*<br />
e, one can once more introduce the set of slack variables ξi and ξ<br />
i<br />
( i = 1... M)<br />
for each datapoint,<br />
to denote whether the datapoint is on the left or righthandside of the function f 5 . The slack<br />
variables measure by how much each datapoint is incorrectly fitted. This yields the following<br />
optimization problem:<br />
1 C<br />
minimize w +<br />
2 M<br />
i<br />
( )<br />
i<br />
( )<br />
M<br />
∑( ξi<br />
+ ξi<br />
)<br />
2 *<br />
i=<br />
1<br />
⎧⎧ i<br />
w,<br />
φ x + b− y ≤ ε + ξi<br />
⎪⎪<br />
⎪⎪ i<br />
*<br />
subject to ⎨⎨y − w,<br />
φ x −b≤ ε + ξi<br />
⎪⎪<br />
*<br />
⎪⎪ξi<br />
≥ 0, ξi<br />
≥ 0<br />
⎩⎩<br />
(5.65)<br />
Again, one can solve this quadratic problem through Lagrange. Introducing a set of Lagrange<br />
multipliers α , η ≥ 0 for each of our inequalities constraints and writing the Lagrangian:<br />
i<br />
i<br />
1 C C<br />
L , , *, = +<br />
2 M<br />
M<br />
M<br />
M<br />
2 * * *<br />
( w ξξ b) w ∑( ξi + ξi ) − ∑( ηξ i i<br />
+ ηξ<br />
i i )<br />
M<br />
∑<br />
i=<br />
1<br />
M<br />
i=<br />
1<br />
i<br />
i= 1 i=<br />
1<br />
i<br />
i<br />
( i<br />
y w,<br />
( x ) b)<br />
− α ε + ξ + − φ −<br />
∑<br />
i<br />
i<br />
( y w,<br />
( x ) b<br />
i<br />
)<br />
− α ε + ξ − + φ +<br />
* *<br />
i<br />
(5.66)<br />
Solving for each of the partial derivatives:<br />
∂L ∂L ∂L<br />
= 0; = 0; = 0<br />
(*)<br />
∂b ∂w ∂ξ<br />
M<br />
∂ L =<br />
*<br />
∑ ( αi<br />
− αi<br />
) = 0;<br />
(5.67)<br />
∂b<br />
i=<br />
1<br />
5 We will use the shorthand<br />
ξ<br />
(*)<br />
when the notation applies to both<br />
i<br />
ξ<br />
i<br />
*<br />
and ξ<br />
i<br />
.<br />
© A.G.Billard 2004 – Last Update March 2011