MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA MACHINE LEARNING TECHNIQUES - LASA
108 Figure 5-16: Increase of the number of support vectors and of the proportion of datapoints outside the e-sensitive tube when increasing n. From top to bottom, n takes value 0.08, 0.1 and 0.9. n-SVR was fitted with C=50 and a Gaussian kernel with kernel width=0.021. © A.G.Billard 2004 – Last Update March 2011
109 5.9 Gaussian Process Regression Adapted from C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006. In Section 4.3, we introduced probabilistic regression, a method by which the standard linear T 2 regressive model y w x N( 0, σ ) = + was extended to building a probabilistic estimate of the conditional distribution p( y| x ). Then for a new query point x*, one could compute an estimate y* by taking the expectation of y given x , yˆ = E p( y| x) . Further, using the assumption that { } all training points are i.i.d. and using a Gaussian prior with variance ∑ w on the distribution of the parameters w of the model, we found that the predictive distribution is also Gaussian and is given by: * * ⎛⎛ 1 * T * T * ⎞⎞ p( y | x , X, y) = N⎜⎜ x ∑ , 2 w Xy x ∑w x ⎟⎟ ⎝⎝σ ⎠⎠ (5.77) where X is N× M and y is 1× N are the matrix and vector of input-output training points. Next, we see how this probabilistic linear regressive model can be extended to allow non-linear regression, exploiting once more the kernel trick. Non-linear Case Assuming a non-linear transformation into feature space through the function φ ( x) , that maps each N-dimensional datapoint x into a D-dimensional feature space, and substituting everywhere in the linear model, the predictive distribution for the non-linear model becomes: 1 p y x X y N⎜⎜ x A X y x A x 2 ⎝⎝σ * * ⎛⎛ * T −1 * T −1 * ( | , , ) = φ( ) Φ( ) , φ( ) φ( ) ( ) ( ) −2 T −1 w with A= σ Φ X Φ X +Σ * ⎞⎞ ⎟⎟ ⎠⎠ (5.78) Φ( X ) is a matrix whose columns are composed of each projection φ ( x) of each training point x∈ X . While the expression of this density is quite simple, in practice computing the inverse of the matrix A may be very difficult as its dimension is proportional to the dimension of the feature space that may be quite large. 5.9.1 What is a Gaussian Process The baysian regression model given by (5.78) is one example of Gaussian Process. In its generic definition, a “Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution”. Assume that the real process you wish to describe is regulated by the function ( ) f x , where x spans the data space. Then, a Gaussian Process (GP) estimate of the function f is entirely © A.G.Billard 2004 – Last Update March 2011
- Page 57 and 58: 57 When the transformation A is lin
- Page 59 and 60: 59 C: X → Y ( ) C x K = arg max
- Page 61 and 62: 61 Figure 3-18: Linear combination
- Page 63 and 64: 63 Figure 3-19: Bayes classificatio
- Page 65 and 66: 65 ⎛⎛ min ⎜⎜ w ⎝⎝ N i=
- Page 67 and 68: 67 T ( yi − xi w) 2 M ⎛⎛ ⎞
- Page 69 and 70: 69 Figure 4-2: Illustration of the
- Page 71 and 72: 71 4.4.2 Multi-Gaussian Case It is
- Page 73 and 74: 73 5 Kernel Methods These lecture n
- Page 75 and 76: 75 The kernel k provides a metric o
- Page 77 and 78: 77 M 1 T v = ∑ x ( x ) v M λ i j
- Page 79 and 80: 79 1 M The solutions to the dual ei
- Page 81 and 82: 81 5.4 Kernel CCA The linear versio
- Page 83 and 84: 83 additional ridge parameter induc
- Page 85 and 86: 85 Figure 5-3: TOP: Marginal (left)
- Page 87 and 88: 87 statistical independence. We def
- Page 89 and 90: 89 J j ( µ 1,...., µ K) = ∑∑
- Page 91 and 92: 91 A simple pattern recognition alg
- Page 93 and 94: 93 ( ) ( , ) f x = sign w x + b (5.
- Page 95 and 96: 95 Figure 5-6: A binary classificat
- Page 97 and 98: 97 where N is the number of support
- Page 99 and 100: 99 5.8 Support Vector Regression In
- Page 101 and 102: 101 The optimization problem given
- Page 103 and 104: 103 Note that since we never have t
- Page 105 and 106: 105 Figure 5-13: Effect of the kern
- Page 107: 107 To better understand the effect
- Page 111 and 112: 111 One can then use the above expr
- Page 113 and 114: 113 5.9.2 Equivalence of Gaussian P
- Page 115 and 116: 115 5.9.3 Curse of dimensionality,
- Page 117 and 118: 117 The weight w determines the slo
- Page 119 and 120: 119 Figure 5-21: Example of success
- Page 121 and 122: 121 • its performance tends to de
- Page 123 and 124: 123 neurons. Furthermore, they lear
- Page 125 and 126: 125 The sigmoid f x ( x) ( ) = tanh
- Page 127 and 128: 127 6.3.2 Information Theory and th
- Page 129 and 130: 129 ( R) ⎛⎛det ⎞⎞ I( x, y)
- Page 131 and 132: 131 y= ∑ w x ) of Because of the
- Page 133 and 134: 133 6.5 Willshaw net David Willshaw
- Page 135 and 136: 135 6.6.1 Weights bounds One of the
- Page 137 and 138: 137 Figure 6-11: The weight vector
- Page 139 and 140: 139 6.6.4 Oja’s one Neuron Model
- Page 141 and 142: 141 If y i and y are highly correla
- Page 143 and 144: 143 Foldiak’s second model allows
- Page 145 and 146: 145 ∂ ∂ J 1 = fy − 1 2 λ 1yf
- Page 147 and 148: 147 6.8 The Self-Organizing Map (SO
- Page 149 and 150: 149 6. Decrease the size of the nei
- Page 151 and 152: 151 the resulting distribution is a
- Page 153 and 154: 153 To simplify the description of
- Page 155 and 156: 155 C µν −1 where ( ) is the µ
- Page 157 and 158: 157 The continuous time Hopfield ne
109<br />
5.9 Gaussian Process Regression<br />
Adapted from C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine<br />
Learning, the MIT Press, 2006.<br />
In Section 4.3, we introduced probabilistic regression, a method by which the standard linear<br />
T<br />
2<br />
regressive model y w x N( 0, σ )<br />
= + was extended to building a probabilistic estimate of the<br />
conditional distribution p( y|<br />
x ). Then for a new query point x*, one could compute an estimate<br />
y* by taking the expectation of y given x , yˆ = E p( y|<br />
x)<br />
. Further, using the assumption that<br />
{ }<br />
all training points are i.i.d. and using a Gaussian prior with variance ∑<br />
w<br />
on the distribution of the<br />
parameters w of the model, we found that the predictive distribution is also Gaussian and is given<br />
by:<br />
* * ⎛⎛ 1 * T<br />
* T * ⎞⎞<br />
p( y | x , X, y)<br />
= N⎜⎜<br />
x ∑ ,<br />
2 w<br />
Xy x ∑w<br />
x ⎟⎟<br />
⎝⎝σ<br />
⎠⎠<br />
(5.77)<br />
where X is N× M and y is 1× N are the matrix and vector of input-output training points. Next,<br />
we see how this probabilistic linear regressive model can be extended to allow non-linear<br />
regression, exploiting once more the kernel trick.<br />
Non-linear Case<br />
Assuming a non-linear transformation into feature space through the function φ ( x)<br />
, that maps<br />
each N-dimensional datapoint x into a D-dimensional feature space, and substituting everywhere<br />
in the linear model, the predictive distribution for the non-linear model becomes:<br />
1<br />
p y x X y N⎜⎜<br />
x A X y x A x<br />
2<br />
⎝⎝σ<br />
* * ⎛⎛<br />
*<br />
T<br />
−1 *<br />
T<br />
−1 *<br />
( | , , ) = φ( ) Φ( ) , φ( ) φ( )<br />
( ) ( )<br />
−2 T −1<br />
w<br />
with A= σ Φ X Φ X +Σ<br />
*<br />
⎞⎞<br />
⎟⎟<br />
⎠⎠<br />
(5.78)<br />
Φ( X ) is a matrix whose columns are composed of each projection φ ( x)<br />
of each training<br />
point x∈<br />
X . While the expression of this density is quite simple, in practice computing the<br />
inverse of the matrix A may be very difficult as its dimension is proportional to the dimension of<br />
the feature space that may be quite large.<br />
5.9.1 What is a Gaussian Process<br />
The baysian regression model given by (5.78) is one example of Gaussian Process.<br />
In its generic definition, a “Gaussian process is a collection of random variables, any finite<br />
number of which have a joint Gaussian distribution”.<br />
Assume that the real process you wish to describe is regulated by the function ( ) f x , where<br />
x spans the data space. Then, a Gaussian Process (GP) estimate of the function f is entirely<br />
© A.G.Billard 2004 – Last Update March 2011