01.11.2014 Views

MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

68<br />

⎛⎛ 1 T<br />

T<br />

T ⎞⎞ ⎛⎛ 1 T −1<br />

⎞⎞<br />

p( w| X, y) ∝exp⎜⎜− ( y−X w) ( y−X w)<br />

⎟⎟⋅exp⎜⎜− w ∑w<br />

w⎟⎟<br />

⎝⎝ 2 ⎠⎠ ⎝⎝ 2 ⎠⎠<br />

⎛⎛ 1 ⎛⎛ 1 ⎞⎞<br />

∝exp⎜⎜− − ⎜⎜ +∑<br />

2<br />

w ⎟⎟ −<br />

⎝⎝ 2 ⎝⎝σ<br />

⎠⎠<br />

⎛⎛ 1<br />

T −1<br />

⎞⎞<br />

∝exp⎜⎜− ( w−v) ∑v<br />

( w−v)<br />

⎟⎟<br />

⎝⎝ 2<br />

⎠⎠<br />

T<br />

T −1<br />

( w v) XX ( w v)<br />

T<br />

−<br />

T<br />

( w ) and<br />

v ( w )<br />

−2 −2 −1 1<br />

−2 −1<br />

−1<br />

where v = σ σ XX +∑ Xy ∑ = σ XX +∑<br />

⎞⎞<br />

⎟⎟<br />

⎠⎠<br />

(4.15)<br />

The posterior distribution of the weight is thus a Gaussian distribution with mean v and<br />

covariance matrix ∑<br />

ν<br />

. Notice that the first term on the right handside of both terms is the<br />

covariance of the input modulated by the variance of the noise. In a deterministic model where<br />

the variance of the weight is zero and the variance of the noise is the identity, we recover the<br />

classical linear regressive model.<br />

We can compute our best estimate of the weights by computing the expectation over the<br />

posterior distribution, i.e.:<br />

−2 −2 T −1<br />

−<br />

{ ( )} σ ( σ ) 1<br />

w<br />

E p w| y, x = XX +∑ Xy.<br />

(4.16)<br />

This is called the maximum a posteriori (MAP) estimate of w.<br />

Notice that the posterior of the weight and by extension its MAP estimate depend not only on the<br />

input variable X but also on the output variable y . In practice, computation of MAP grows<br />

quadratically with the number of examples M and is a major drawback of such methods. Current<br />

efforts in research related to the development of such regression techniques are all devoted to<br />

reducing the dimension of the training set to make this computation bearable. These are so-called<br />

sparse methods. This is particularly crucial when one extends this computation to kernel<br />

methods, see Chapter 5.<br />

Once the parameters w have been estimated, we can use our probabilistic regressive model to<br />

*<br />

make predictions given new data points. Hence, given a so-called query point x (such point is<br />

usually a point not used for training the model; in other words, this would be a point belonging to<br />

y<br />

* = f x<br />

* . This can be<br />

the testing set or the validation set), we can compute an estimate of ( )<br />

obtained by averaging over all possible values for the weight, i.e.:<br />

where<br />

A<br />

−1 1 T −1<br />

XX<br />

2<br />

w<br />

= +Σ .<br />

σ<br />

( * | * , , ) = ∫ ( * | * , ) ( | , )<br />

p y x X y p y x w p w X y dw<br />

⎛⎛ 1<br />

⎝⎝σ<br />

* T −1 * T −1 *<br />

= N⎜⎜ x A Xy,<br />

x A x<br />

2<br />

⎟⎟<br />

⎞⎞<br />

⎠⎠<br />

(4.17)<br />

The predictive distribution is again Gaussian. Notice that the uncertainty (variance) on the<br />

predictive model grows quadratically with the amplitude of the query point and with the variance<br />

of the weight distribution. This is expected from linear regressive model and hence is a limitation.<br />

This effect is illustrated in Figure 4-2.<br />

© A.G.Billard 2004 – Last Update March 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!