MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
68<br />
⎛⎛ 1 T<br />
T<br />
T ⎞⎞ ⎛⎛ 1 T −1<br />
⎞⎞<br />
p( w| X, y) ∝exp⎜⎜− ( y−X w) ( y−X w)<br />
⎟⎟⋅exp⎜⎜− w ∑w<br />
w⎟⎟<br />
⎝⎝ 2 ⎠⎠ ⎝⎝ 2 ⎠⎠<br />
⎛⎛ 1 ⎛⎛ 1 ⎞⎞<br />
∝exp⎜⎜− − ⎜⎜ +∑<br />
2<br />
w ⎟⎟ −<br />
⎝⎝ 2 ⎝⎝σ<br />
⎠⎠<br />
⎛⎛ 1<br />
T −1<br />
⎞⎞<br />
∝exp⎜⎜− ( w−v) ∑v<br />
( w−v)<br />
⎟⎟<br />
⎝⎝ 2<br />
⎠⎠<br />
T<br />
T −1<br />
( w v) XX ( w v)<br />
T<br />
−<br />
T<br />
( w ) and<br />
v ( w )<br />
−2 −2 −1 1<br />
−2 −1<br />
−1<br />
where v = σ σ XX +∑ Xy ∑ = σ XX +∑<br />
⎞⎞<br />
⎟⎟<br />
⎠⎠<br />
(4.15)<br />
The posterior distribution of the weight is thus a Gaussian distribution with mean v and<br />
covariance matrix ∑<br />
ν<br />
. Notice that the first term on the right handside of both terms is the<br />
covariance of the input modulated by the variance of the noise. In a deterministic model where<br />
the variance of the weight is zero and the variance of the noise is the identity, we recover the<br />
classical linear regressive model.<br />
We can compute our best estimate of the weights by computing the expectation over the<br />
posterior distribution, i.e.:<br />
−2 −2 T −1<br />
−<br />
{ ( )} σ ( σ ) 1<br />
w<br />
E p w| y, x = XX +∑ Xy.<br />
(4.16)<br />
This is called the maximum a posteriori (MAP) estimate of w.<br />
Notice that the posterior of the weight and by extension its MAP estimate depend not only on the<br />
input variable X but also on the output variable y . In practice, computation of MAP grows<br />
quadratically with the number of examples M and is a major drawback of such methods. Current<br />
efforts in research related to the development of such regression techniques are all devoted to<br />
reducing the dimension of the training set to make this computation bearable. These are so-called<br />
sparse methods. This is particularly crucial when one extends this computation to kernel<br />
methods, see Chapter 5.<br />
Once the parameters w have been estimated, we can use our probabilistic regressive model to<br />
*<br />
make predictions given new data points. Hence, given a so-called query point x (such point is<br />
usually a point not used for training the model; in other words, this would be a point belonging to<br />
y<br />
* = f x<br />
* . This can be<br />
the testing set or the validation set), we can compute an estimate of ( )<br />
obtained by averaging over all possible values for the weight, i.e.:<br />
where<br />
A<br />
−1 1 T −1<br />
XX<br />
2<br />
w<br />
= +Σ .<br />
σ<br />
( * | * , , ) = ∫ ( * | * , ) ( | , )<br />
p y x X y p y x w p w X y dw<br />
⎛⎛ 1<br />
⎝⎝σ<br />
* T −1 * T −1 *<br />
= N⎜⎜ x A Xy,<br />
x A x<br />
2<br />
⎟⎟<br />
⎞⎞<br />
⎠⎠<br />
(4.17)<br />
The predictive distribution is again Gaussian. Notice that the uncertainty (variance) on the<br />
predictive model grows quadratically with the amplitude of the query point and with the variance<br />
of the weight distribution. This is expected from linear regressive model and hence is a limitation.<br />
This effect is illustrated in Figure 4-2.<br />
© A.G.Billard 2004 – Last Update March 2011