MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA MACHINE LEARNING TECHNIQUES - LASA

01.11.2014 Views

108 Figure 5-16: Increase of the number of support vectors and of the proportion of datapoints outside the e-sensitive tube when increasing n. From top to bottom, n takes value 0.08, 0.1 and 0.9. n-SVR was fitted with C=50 and a Gaussian kernel with kernel width=0.021. © A.G.Billard 2004 – Last Update March 2011

109 5.9 Gaussian Process Regression Adapted from C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006. In Section 4.3, we introduced probabilistic regression, a method by which the standard linear T 2 regressive model y w x N( 0, σ ) = + was extended to building a probabilistic estimate of the conditional distribution p( y| x ). Then for a new query point x*, one could compute an estimate y* by taking the expectation of y given x , yˆ = E p( y| x) . Further, using the assumption that { } all training points are i.i.d. and using a Gaussian prior with variance ∑ w on the distribution of the parameters w of the model, we found that the predictive distribution is also Gaussian and is given by: * * ⎛⎛ 1 * T * T * ⎞⎞ p( y | x , X, y) = N⎜⎜ x ∑ , 2 w Xy x ∑w x ⎟⎟ ⎝⎝σ ⎠⎠ (5.77) where X is N× M and y is 1× N are the matrix and vector of input-output training points. Next, we see how this probabilistic linear regressive model can be extended to allow non-linear regression, exploiting once more the kernel trick. Non-linear Case Assuming a non-linear transformation into feature space through the function φ ( x) , that maps each N-dimensional datapoint x into a D-dimensional feature space, and substituting everywhere in the linear model, the predictive distribution for the non-linear model becomes: 1 p y x X y N⎜⎜ x A X y x A x 2 ⎝⎝σ * * ⎛⎛ * T −1 * T −1 * ( | , , ) = φ( ) Φ( ) , φ( ) φ( ) ( ) ( ) −2 T −1 w with A= σ Φ X Φ X +Σ * ⎞⎞ ⎟⎟ ⎠⎠ (5.78) Φ( X ) is a matrix whose columns are composed of each projection φ ( x) of each training point x∈ X . While the expression of this density is quite simple, in practice computing the inverse of the matrix A may be very difficult as its dimension is proportional to the dimension of the feature space that may be quite large. 5.9.1 What is a Gaussian Process The baysian regression model given by (5.78) is one example of Gaussian Process. In its generic definition, a “Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution”. Assume that the real process you wish to describe is regulated by the function ( ) f x , where x spans the data space. Then, a Gaussian Process (GP) estimate of the function f is entirely © A.G.Billard 2004 – Last Update March 2011

109<br />

5.9 Gaussian Process Regression<br />

Adapted from C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine<br />

Learning, the MIT Press, 2006.<br />

In Section 4.3, we introduced probabilistic regression, a method by which the standard linear<br />

T<br />

2<br />

regressive model y w x N( 0, σ )<br />

= + was extended to building a probabilistic estimate of the<br />

conditional distribution p( y|<br />

x ). Then for a new query point x*, one could compute an estimate<br />

y* by taking the expectation of y given x , yˆ = E p( y|<br />

x)<br />

. Further, using the assumption that<br />

{ }<br />

all training points are i.i.d. and using a Gaussian prior with variance ∑<br />

w<br />

on the distribution of the<br />

parameters w of the model, we found that the predictive distribution is also Gaussian and is given<br />

by:<br />

* * ⎛⎛ 1 * T<br />

* T * ⎞⎞<br />

p( y | x , X, y)<br />

= N⎜⎜<br />

x ∑ ,<br />

2 w<br />

Xy x ∑w<br />

x ⎟⎟<br />

⎝⎝σ<br />

⎠⎠<br />

(5.77)<br />

where X is N× M and y is 1× N are the matrix and vector of input-output training points. Next,<br />

we see how this probabilistic linear regressive model can be extended to allow non-linear<br />

regression, exploiting once more the kernel trick.<br />

Non-linear Case<br />

Assuming a non-linear transformation into feature space through the function φ ( x)<br />

, that maps<br />

each N-dimensional datapoint x into a D-dimensional feature space, and substituting everywhere<br />

in the linear model, the predictive distribution for the non-linear model becomes:<br />

1<br />

p y x X y N⎜⎜<br />

x A X y x A x<br />

2<br />

⎝⎝σ<br />

* * ⎛⎛<br />

*<br />

T<br />

−1 *<br />

T<br />

−1 *<br />

( | , , ) = φ( ) Φ( ) , φ( ) φ( )<br />

( ) ( )<br />

−2 T −1<br />

w<br />

with A= σ Φ X Φ X +Σ<br />

*<br />

⎞⎞<br />

⎟⎟<br />

⎠⎠<br />

(5.78)<br />

Φ( X ) is a matrix whose columns are composed of each projection φ ( x)<br />

of each training<br />

point x∈<br />

X . While the expression of this density is quite simple, in practice computing the<br />

inverse of the matrix A may be very difficult as its dimension is proportional to the dimension of<br />

the feature space that may be quite large.<br />

5.9.1 What is a Gaussian Process<br />

The baysian regression model given by (5.78) is one example of Gaussian Process.<br />

In its generic definition, a “Gaussian process is a collection of random variables, any finite<br />

number of which have a joint Gaussian distribution”.<br />

Assume that the real process you wish to describe is regulated by the function ( ) f x , where<br />

x spans the data space. Then, a Gaussian Process (GP) estimate of the function f is entirely<br />

© A.G.Billard 2004 – Last Update March 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!