MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA MACHINE LEARNING TECHNIQUES - LASA

01.11.2014 Views

112 T In this case the covariance of the prior becomes: ( ) is a matrix whose columns i x through (5.84). 2 ( ε) ( ) σ yy = cov f X + = K X , X + I , where y i y , i=1….M, correspond to the projection of the associated training point As done previously, we can express the joint distribution of the prior y (now including noise in the estimate of prior on the training datapoints) and the testing points given through f*: 2 ( , ) + σ I ( *, ) ( ) ( ) ⎡⎡y ⎤⎤ ⎛⎛ ⎡⎡K X X K X X ⎤⎤⎞⎞ ⎢⎢ N 0, f * ⎥⎥ ⎜⎜ ⎢⎢ ⎥⎥⎟⎟ ⎣⎣ ⎦⎦ ⎜⎜ ⎢⎢K X, X * K X*, X * ⎥⎥⎟⎟ ⎝⎝ ⎣⎣ ⎦⎦⎠⎠ : (5.85) Again, one can compute the conditional distribution of f* given the pair of training datapoints X, the testing datapoints X* and the noisy prior y. ( ( )) f*| X*, X, y : N f*,cov f * { } ( ) ⎡⎡ ( ) 2 −1 f* = E f*| X*, X, y = K X*, X K X, X + σ I⎤⎤ y 2 −1 ( f ) = K( X X ) − K( X X) ⎡⎡K( X X) + σ I⎤⎤ K( X X ) cov * *, * *, ⎣⎣ , ⎦⎦ , * ⎣⎣ ⎦⎦ (5.86) We are usually interested in computing solely the response of the model to one query point x *. In this case, the estimate of the associated output y * is given by the following: T { } ( ) ⎡⎡ ( ) 2 −1 y*~ f* = E f*| x*, X, y = k x*, X K X, X + σ I⎤⎤ y ⎣⎣ ⎦⎦ (5.87) i ( ) ( ) k x*, X is the vector of covariance k x*, x between the query point and the i M training data points x , i = 1... M. Since all the training pairs ( ) i i x , y , i 1... M = are given, these can be treated as parameters to the system and hence the prediction on y * from Equation (5.87) can be expressed as a linear i combination of kernel functions k( x*, x ): M i { } ∑αi ( ) y*~ f* = E f*| x*, X, y = k x*, x ( ) i= 1 2 −1 K X X σ I⎤⎤ y with α = ⎡⎡ ⎣⎣ , + ⎦⎦ (5.88) We have M kernel functions for each of the M training points x i . © A.G.Billard 2004 – Last Update March 2011

113 5.9.2 Equivalence of Gaussian Process Regression and Gaussian Mixture Regression The expression found in Equation (5.88) is somewhat very similar to that found for Gaussian i Mixture Regression, see Section 4.4.2 and Equation(4.28). The non-linear term k( x*, x ) is equivalent to the non-linear weights ( *) w x in Equation (4.28), whereas the parametersα i i somewhat correspond to the linear terms stemming from local PCA projections through the − i i i Cross-covariance matrices of each Gaussian in GMM (given by ( ) 1 i µ ( x µ ) Equation (4.28)). +∑ ∑ − in Y YX XX X The difference between GPR and GMR lies primarily in the fact that GP uses all the datapoints to do inference, whereas GMM performs some local clustering and uses a much smaller number of points (the centers of the Gaussians) to do inference. However, the two methods may become equivalent under certain conditions. i Assume a normalized Gaussian kernel for the function k( x*, x ) and a noise free model, i.e. σ = 0 . Let us further assume that for a well chosen kernel width, ( , ) j i k x x is non-zero only for data points deemed close to one another according to the metric k(.,.) and is (almost) zero for all other pairs. As a result, the matrix K is sparse. We can hence define a partitioning of the datapoints through a set of m , l= 1... m, centered on m datapoints x l l clusters ( ) i i l ∈ X . δ > 0 is an arbitrary threshold that determines the closenest of the datapoints. How to choose the m datapoints is core to the problematic of most clustering techniques and we refer to reader to Chapter 3.1 for a discussion of these techniques. Rearranging the ordering of the datapoints so that points belonging to each cluster are located on adjacent columns and duplicating each column of datapoints for points that belong to more than one cluster one can create the following block diagonal Gram matrix: l l, i l, j where the elements Kij = k( x , x ) applied on each pair of datapoints ( li , l, j , ) 1 ⎡⎡⎡⎡K% ⎤⎤ 0 ............. 0⎤⎤ ⎢⎢⎣⎣ ⎦⎦ ⎥⎥ K% = ⎢⎢............................... ⎥⎥ ⎢⎢ ⎥⎥ m ⎢⎢0 .................. ⎡⎡ ⎣⎣ K% ⎤⎤ ⎣⎣ ⎦⎦⎥⎥⎦⎦ % of the (5.89) l K % matrix are composed of the kernel function x x belonging to the associated cluster. Using properties for the inverse of a block diagonal matrix, we obtain a simplified expression for: © A.G.Billard 2004 – Last Update March 2011

113<br />

5.9.2 Equivalence of Gaussian Process Regression and Gaussian Mixture Regression<br />

The expression found in Equation (5.88) is somewhat very similar to that found for Gaussian<br />

i<br />

Mixture Regression, see Section 4.4.2 and Equation(4.28). The non-linear term k( x*, x ) is<br />

equivalent to the non-linear weights ( *)<br />

w x in Equation (4.28), whereas the parametersα<br />

i<br />

i<br />

somewhat correspond to the linear terms stemming from local PCA projections through the<br />

−<br />

i i i<br />

Cross-covariance matrices of each Gaussian in GMM (given by ( ) 1 i<br />

µ ( x µ )<br />

Equation (4.28)).<br />

+∑ ∑ − in<br />

Y YX XX X<br />

The difference between GPR and GMR lies primarily in the fact that GP uses all the datapoints to<br />

do inference, whereas GMM performs some local clustering and uses a much smaller number of<br />

points (the centers of the Gaussians) to do inference. However, the two methods may become<br />

equivalent under certain conditions.<br />

i<br />

Assume a normalized Gaussian kernel for the function k( x*, x ) and a noise free model,<br />

i.e. σ = 0 . Let us further assume that for a well chosen kernel width, ( , )<br />

j i<br />

k x x is non-zero only<br />

for data points deemed close to one another according to the metric k(.,.) and is (almost) zero for<br />

all other pairs. As a result, the matrix K is sparse. We can hence define a partitioning of the<br />

datapoints through a set of m , l= 1... m, centered on m datapoints x<br />

l<br />

l<br />

clusters ( )<br />

i i l<br />

∈ X . δ > 0<br />

is an arbitrary threshold that determines the closenest of the datapoints. How to choose the m<br />

datapoints is core to the problematic of most clustering techniques and we refer to reader to<br />

Chapter 3.1 for a discussion of these techniques.<br />

Rearranging the ordering of the datapoints so that points belonging to each cluster are located on<br />

adjacent columns and duplicating each column of datapoints for points that belong to more than<br />

one cluster one can create the following block diagonal Gram matrix:<br />

l l, i l,<br />

j<br />

where the elements Kij<br />

= k( x , x )<br />

applied on each pair of datapoints ( li , l,<br />

j<br />

, )<br />

1<br />

⎡⎡⎡⎡K%<br />

⎤⎤ 0 ............. 0⎤⎤<br />

⎢⎢⎣⎣<br />

⎦⎦<br />

⎥⎥<br />

K%<br />

= ⎢⎢...............................<br />

⎥⎥<br />

⎢⎢<br />

⎥⎥<br />

m<br />

⎢⎢0 .................. ⎡⎡<br />

⎣⎣<br />

K%<br />

⎤⎤<br />

⎣⎣<br />

⎦⎦⎥⎥⎦⎦<br />

% of the<br />

(5.89)<br />

l<br />

K % matrix are composed of the kernel function<br />

x x belonging to the associated cluster.<br />

Using properties for the inverse of a block diagonal matrix, we obtain a simplified expression for:<br />

© A.G.Billard 2004 – Last Update March 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!