MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
110<br />
defined by its mean m( x)<br />
and covariance function ( , ')<br />
x’ that span the data space):<br />
( ) =Ε ( )<br />
m x f x<br />
k x x (k is defined for each pair of point x,<br />
{ }<br />
( )( ( ) ( ))<br />
{ }<br />
( , ') ( ) ( )<br />
k x x =Ε f x −m x f x −m x<br />
(5.79)<br />
For simplicity, most GP assume a zero-mean process, i.e. m( x ) = 0. While the above<br />
description may be multidimensional, most regression techniques on GP assume that the output<br />
f ( x)<br />
is unidimensional. This is a limitation in the application of the process for regression as it<br />
allows making inference solely on a single dimension, say y= f ( x),<br />
y∈° . For multidimensional<br />
inference, one may run one GP per output variable.<br />
Using (5.79) and (5.78) and assuming zero-mean distribution, the baysian regression model can<br />
be rewritten as a Gaussian process defined by:<br />
T<br />
{ ( )} φ ( ) { }<br />
E f x = x E w = 0,<br />
T T<br />
T<br />
{ } φ( ) { } φ( ) φ( ) wφ( )<br />
( ) ( ) ( )<br />
k x, x' = E f x f x' = x E ww x' = x Σ x'<br />
(5.80)<br />
We are now endowed with a probabilistic representation of our real process f. The value taken by<br />
f at any given pair of input x, x’ is jointly Gaussian with zero mean and covariance given<br />
by k( x, x ')<br />
. This means that the estimate of f can be drawn only from looking conjointly at the<br />
distribution of f across two or more input variables. In practice, to visualize the process, one may<br />
sample a set *<br />
estimates of f, such that:<br />
Where ( *, *)<br />
* M *<br />
i<br />
X of M* of data points X* { x }<br />
K X X is a M * M *<br />
i j<br />
using ( )<br />
ij ( )<br />
= and compute f* a M*-dimensional vector of<br />
i=<br />
1<br />
( ( ))<br />
f* N 0, K X*, X*<br />
: (5.81)<br />
× covariance matrx, whose elements are computed<br />
K X*, X * = k x , x , ∀ i, j = 1.... M *. Note that if M*<br />
> D, i.e. the number of<br />
datapoints exceed the dimension of the feature space, the matrix is singular as the rank of K is D.<br />
Generating such a vector is called drawing from the prior distribution of f as it uses solely<br />
information on the query datapoints themselves and the prior assumption that the underlying<br />
process if jointly Gaussian and zero mean, as given by (5.81). A better inference can be made if<br />
one can make use of prior information in terms of a set of training points. Consider the set<br />
X<br />
M<br />
i<br />
{ x }<br />
i=<br />
1<br />
= as the training datapoints, one can then express the joint distribution of the<br />
estimates f and f* associated with the training and testing points respectively as:<br />
( , ) ( *, )<br />
( ) ( )<br />
⎡⎡f<br />
⎤⎤ ⎛⎛ ⎡⎡K X X K X X ⎤⎤⎞⎞<br />
⎢⎢ N 0,<br />
f *<br />
⎥⎥ ⎜⎜ ⎢⎢<br />
⎥⎥⎟⎟<br />
⎣⎣ ⎦⎦ ⎜⎜ ⎢⎢K X, X * K X*, X * ⎥⎥⎟⎟<br />
⎝⎝ ⎣⎣<br />
⎦⎦⎠⎠<br />
: (5.82)<br />
© A.G.Billard 2004 – Last Update March 2011