MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA MACHINE LEARNING TECHNIQUES - LASA

01.11.2014 Views

66 4.3 Probabilistic Regression Probabilistic regression is a statistical approach to classical linear regression and assumes that the observed instances of x and y have been generated by an underlying probabilistic process of which it will try to build an estimate. The rationale goes as follows: If one knows the joint distribution ( , ) p x y of x and y , then an estimate y ) of the output y can be built from knowing the input x by computing the expectation of y given x , { ( )} yˆ E p y| x = (4.6) Note that in many cases, if one is solely interested in constructing a regressive model, one does not need to build a model of the joint density but needs solely to estimate the conditional ( | ) p y x . Probabilistic regression extends the concept of linear regression by assuming that the observed values of y differ from f ( x ) by an additive random noiseε (noise is usually assumed to be independent of the observable x ): ( , ) y= f x w + ε (4.7) Usually, to simplify computation, one further assumes that the noise follows a zero-mean 2 Gaussian distribution with uncorrelated isotropic varianceσ . The covariance matrix is diagonal with all elements equal to 2 2 ε = N 0, σ . Such an assumption is called putting a prior distribution over the noise. σ ; so we write simply ( ) Let us first consider the probabilistic solution to the linear regression problem described before, that is: 2 ( 0, ) T y xw N σ = + (4.8) We have now one more variable to estimate, namely the variance of the noiseσ . Assuming that all pairs of observables are i.i.d (identically independently distributed), we can construct an estimate of the conditional probability of y given x and a choice of parameters w, σ as: M i ( | , , σ) p( y| x , w, σ) p y x w = ∏ (4.9) i= 1 Solving for the fact that sole the noise model is probabilistic and that it follows a Gaussian distribution with zero mean, we obtain: © A.G.Billard 2004 – Last Update March 2011

67 T ( yi − xi w) 2 M ⎛⎛ ⎞⎞ 1 ⇒ p( y| x, w, σ ) = ∏ exp⎜⎜− ⎟⎟ 2 i= 1 2πσ ⎜⎜ 2σ ⎟⎟ ⎝⎝ ⎠⎠ 1 ⎛⎛ 1 = exp⎜⎜− − − 2 2πσ ⎝⎝ 2σ T T T ( y X w) ( y X w) ⎞⎞ ⎟⎟ ⎠⎠ (4.10) In other words, the conditional is a Gaussian distribution with mean matrix σ 2 I , i.e. T 2 ( σ) ( σ ) p y| x, w, N X w, I . T X wand covariance = (4.11) To get a good estimate of the conditional distribution given in (4.11), it remains to find a good estimate of the two open parameters w and σ . In Bayesian formalism, to simplify the search for the optimal w , one would also specify a prior over the parameter w. Typically, one would assume a zero mean Gaussian prior with fixed covariance matrix ∑ : w ⎛⎛ 1 T −1 ⎞⎞ p( w) = N( 0, ∑ w) = exp ⎜⎜− w ∑w w⎟⎟ ⎝⎝ 2 ⎠⎠ (4.12) Let us now assume that we are provided with a set of input output pairs{ X, y }, where X is a N× M matrix of input and y a 1× N set of associated output. One can then compute the weight w that explains best (in a likelihood sense) the distribution of input-output pairs in the training set. To this end, one must first find an expression for the posterior distribution over w , which is given using Bayes’ Theorem: ( | , ) ( ) p( y X) likelihood × prior p y X w p w posterior = , p( w| X, y) marginal likelihood | = (4.13) The marginal likelihood is independent of w and can be computed from our current estimate of the likelihood and given our prior on the weight distribution: ( | ) ( | , ) ( ) p y X = ∫ p y X w p w dw (4.14) The quantity we are interested in is the posterior over w , which can now solve using: © A.G.Billard 2004 – Last Update March 2011

67<br />

T<br />

( yi<br />

− xi<br />

w)<br />

2<br />

M ⎛⎛<br />

⎞⎞<br />

1<br />

⇒ p( y| x, w, σ ) = ∏ exp⎜⎜−<br />

⎟⎟<br />

2<br />

i=<br />

1 2πσ<br />

⎜⎜ 2σ<br />

⎟⎟<br />

⎝⎝<br />

⎠⎠<br />

1 ⎛⎛ 1<br />

= exp⎜⎜− − −<br />

2<br />

2πσ<br />

⎝⎝ 2σ<br />

T<br />

T<br />

T<br />

( y X w) ( y X w)<br />

⎞⎞<br />

⎟⎟<br />

⎠⎠<br />

(4.10)<br />

In other words, the conditional is a Gaussian distribution with mean<br />

matrix<br />

σ<br />

2 I<br />

, i.e.<br />

T 2<br />

( σ) ( σ )<br />

p y| x, w, N X w, I .<br />

T<br />

X wand covariance<br />

= (4.11)<br />

To get a good estimate of the conditional distribution given in (4.11), it remains to find a good<br />

estimate of the two open parameters w and σ .<br />

In Bayesian formalism, to simplify the search for the optimal w , one would also specify a prior<br />

over the parameter w. Typically, one would assume a zero mean Gaussian prior with fixed<br />

covariance matrix ∑ :<br />

w<br />

⎛⎛ 1 T −1<br />

⎞⎞<br />

p( w) = N( 0, ∑<br />

w) = exp ⎜⎜− w ∑w<br />

w⎟⎟<br />

⎝⎝ 2 ⎠⎠<br />

(4.12)<br />

Let us now assume that we are provided with a set of input output pairs{ X,<br />

y }, where X is a<br />

N× M matrix of input and y a 1× N set of associated output. One can then compute the<br />

weight w that explains best (in a likelihood sense) the distribution of input-output pairs in the<br />

training set. To this end, one must first find an expression for the posterior distribution over w ,<br />

which is given using Bayes’ Theorem:<br />

( | , ) ( )<br />

p( y X)<br />

likelihood × prior<br />

p y X w p w<br />

posterior = , p( w| X,<br />

y)<br />

marginal likelihood |<br />

= (4.13)<br />

The marginal likelihood is independent of w and can be computed from our current estimate of<br />

the likelihood and given our prior on the weight distribution:<br />

( | ) ( | , ) ( )<br />

p y X<br />

= ∫ p y X w p w dw<br />

(4.14)<br />

The quantity we are interested in is the posterior over w , which can now solve using:<br />

© A.G.Billard 2004 – Last Update March 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!