MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA MACHINE LEARNING TECHNIQUES - LASA
66 4.3 Probabilistic Regression Probabilistic regression is a statistical approach to classical linear regression and assumes that the observed instances of x and y have been generated by an underlying probabilistic process of which it will try to build an estimate. The rationale goes as follows: If one knows the joint distribution ( , ) p x y of x and y , then an estimate y ) of the output y can be built from knowing the input x by computing the expectation of y given x , { ( )} yˆ E p y| x = (4.6) Note that in many cases, if one is solely interested in constructing a regressive model, one does not need to build a model of the joint density but needs solely to estimate the conditional ( | ) p y x . Probabilistic regression extends the concept of linear regression by assuming that the observed values of y differ from f ( x ) by an additive random noiseε (noise is usually assumed to be independent of the observable x ): ( , ) y= f x w + ε (4.7) Usually, to simplify computation, one further assumes that the noise follows a zero-mean 2 Gaussian distribution with uncorrelated isotropic varianceσ . The covariance matrix is diagonal with all elements equal to 2 2 ε = N 0, σ . Such an assumption is called putting a prior distribution over the noise. σ ; so we write simply ( ) Let us first consider the probabilistic solution to the linear regression problem described before, that is: 2 ( 0, ) T y xw N σ = + (4.8) We have now one more variable to estimate, namely the variance of the noiseσ . Assuming that all pairs of observables are i.i.d (identically independently distributed), we can construct an estimate of the conditional probability of y given x and a choice of parameters w, σ as: M i ( | , , σ) p( y| x , w, σ) p y x w = ∏ (4.9) i= 1 Solving for the fact that sole the noise model is probabilistic and that it follows a Gaussian distribution with zero mean, we obtain: © A.G.Billard 2004 – Last Update March 2011
67 T ( yi − xi w) 2 M ⎛⎛ ⎞⎞ 1 ⇒ p( y| x, w, σ ) = ∏ exp⎜⎜− ⎟⎟ 2 i= 1 2πσ ⎜⎜ 2σ ⎟⎟ ⎝⎝ ⎠⎠ 1 ⎛⎛ 1 = exp⎜⎜− − − 2 2πσ ⎝⎝ 2σ T T T ( y X w) ( y X w) ⎞⎞ ⎟⎟ ⎠⎠ (4.10) In other words, the conditional is a Gaussian distribution with mean matrix σ 2 I , i.e. T 2 ( σ) ( σ ) p y| x, w, N X w, I . T X wand covariance = (4.11) To get a good estimate of the conditional distribution given in (4.11), it remains to find a good estimate of the two open parameters w and σ . In Bayesian formalism, to simplify the search for the optimal w , one would also specify a prior over the parameter w. Typically, one would assume a zero mean Gaussian prior with fixed covariance matrix ∑ : w ⎛⎛ 1 T −1 ⎞⎞ p( w) = N( 0, ∑ w) = exp ⎜⎜− w ∑w w⎟⎟ ⎝⎝ 2 ⎠⎠ (4.12) Let us now assume that we are provided with a set of input output pairs{ X, y }, where X is a N× M matrix of input and y a 1× N set of associated output. One can then compute the weight w that explains best (in a likelihood sense) the distribution of input-output pairs in the training set. To this end, one must first find an expression for the posterior distribution over w , which is given using Bayes’ Theorem: ( | , ) ( ) p( y X) likelihood × prior p y X w p w posterior = , p( w| X, y) marginal likelihood | = (4.13) The marginal likelihood is independent of w and can be computed from our current estimate of the likelihood and given our prior on the weight distribution: ( | ) ( | , ) ( ) p y X = ∫ p y X w p w dw (4.14) The quantity we are interested in is the posterior over w , which can now solve using: © A.G.Billard 2004 – Last Update March 2011
- Page 15 and 16: 15 In particular, we will consider
- Page 17 and 18: 17 2.1 Principal Component Analysis
- Page 19 and 20: 19 ( ) Xʹ′ = W X − µ (2.6) i
- Page 21 and 22: 21 2.1.2.2 Reconstruction error min
- Page 23 and 24: 23 PCA is an example of PP approach
- Page 25 and 26: 25 Algorithm: If one further assume
- Page 27 and 28: 27 The CCA algorithm consists thus
- Page 29 and 30: 29 Figure 2-6: Mixture of variables
- Page 31 and 32: 31 2.3.2 Why Gaussian variables are
- Page 33 and 34: 33 • In our general definition of
- Page 35 and 36: 35 2.3.5 ICA Ambiguities We cannot
- Page 37 and 38: 37 Denote by g the derivative of th
- Page 39 and 40: 39 3 Clustering and Classification
- Page 41 and 42: 41 An agglomerative clustering star
- Page 43 and 44: 43 3.1.1.1 The CURE Clustering Algo
- Page 45 and 46: 45 Disadvantages of hierarchical cl
- Page 47 and 48: 47 Cases where K-means might be vie
- Page 49 and 50: 49 3.1.4 Clustering with Mixtures o
- Page 51 and 52: 51 k ( σ j ) 2 = k ∑ i α = r k
- Page 53 and 54: 53 Theα are the so-called mixing c
- Page 55 and 56: 55 Figure 3-16: Clustering with 3 G
- Page 57 and 58: 57 When the transformation A is lin
- Page 59 and 60: 59 C: X → Y ( ) C x K = arg max
- Page 61 and 62: 61 Figure 3-18: Linear combination
- Page 63 and 64: 63 Figure 3-19: Bayes classificatio
- Page 65: 65 ⎛⎛ min ⎜⎜ w ⎝⎝ N i=
- Page 69 and 70: 69 Figure 4-2: Illustration of the
- Page 71 and 72: 71 4.4.2 Multi-Gaussian Case It is
- Page 73 and 74: 73 5 Kernel Methods These lecture n
- Page 75 and 76: 75 The kernel k provides a metric o
- Page 77 and 78: 77 M 1 T v = ∑ x ( x ) v M λ i j
- Page 79 and 80: 79 1 M The solutions to the dual ei
- Page 81 and 82: 81 5.4 Kernel CCA The linear versio
- Page 83 and 84: 83 additional ridge parameter induc
- Page 85 and 86: 85 Figure 5-3: TOP: Marginal (left)
- Page 87 and 88: 87 statistical independence. We def
- Page 89 and 90: 89 J j ( µ 1,...., µ K) = ∑∑
- Page 91 and 92: 91 A simple pattern recognition alg
- Page 93 and 94: 93 ( ) ( , ) f x = sign w x + b (5.
- Page 95 and 96: 95 Figure 5-6: A binary classificat
- Page 97 and 98: 97 where N is the number of support
- Page 99 and 100: 99 5.8 Support Vector Regression In
- Page 101 and 102: 101 The optimization problem given
- Page 103 and 104: 103 Note that since we never have t
- Page 105 and 106: 105 Figure 5-13: Effect of the kern
- Page 107 and 108: 107 To better understand the effect
- Page 109 and 110: 109 5.9 Gaussian Process Regression
- Page 111 and 112: 111 One can then use the above expr
- Page 113 and 114: 113 5.9.2 Equivalence of Gaussian P
- Page 115 and 116: 115 5.9.3 Curse of dimensionality,
67<br />
T<br />
( yi<br />
− xi<br />
w)<br />
2<br />
M ⎛⎛<br />
⎞⎞<br />
1<br />
⇒ p( y| x, w, σ ) = ∏ exp⎜⎜−<br />
⎟⎟<br />
2<br />
i=<br />
1 2πσ<br />
⎜⎜ 2σ<br />
⎟⎟<br />
⎝⎝<br />
⎠⎠<br />
1 ⎛⎛ 1<br />
= exp⎜⎜− − −<br />
2<br />
2πσ<br />
⎝⎝ 2σ<br />
T<br />
T<br />
T<br />
( y X w) ( y X w)<br />
⎞⎞<br />
⎟⎟<br />
⎠⎠<br />
(4.10)<br />
In other words, the conditional is a Gaussian distribution with mean<br />
matrix<br />
σ<br />
2 I<br />
, i.e.<br />
T 2<br />
( σ) ( σ )<br />
p y| x, w, N X w, I .<br />
T<br />
X wand covariance<br />
= (4.11)<br />
To get a good estimate of the conditional distribution given in (4.11), it remains to find a good<br />
estimate of the two open parameters w and σ .<br />
In Bayesian formalism, to simplify the search for the optimal w , one would also specify a prior<br />
over the parameter w. Typically, one would assume a zero mean Gaussian prior with fixed<br />
covariance matrix ∑ :<br />
w<br />
⎛⎛ 1 T −1<br />
⎞⎞<br />
p( w) = N( 0, ∑<br />
w) = exp ⎜⎜− w ∑w<br />
w⎟⎟<br />
⎝⎝ 2 ⎠⎠<br />
(4.12)<br />
Let us now assume that we are provided with a set of input output pairs{ X,<br />
y }, where X is a<br />
N× M matrix of input and y a 1× N set of associated output. One can then compute the<br />
weight w that explains best (in a likelihood sense) the distribution of input-output pairs in the<br />
training set. To this end, one must first find an expression for the posterior distribution over w ,<br />
which is given using Bayes’ Theorem:<br />
( | , ) ( )<br />
p( y X)<br />
likelihood × prior<br />
p y X w p w<br />
posterior = , p( w| X,<br />
y)<br />
marginal likelihood |<br />
= (4.13)<br />
The marginal likelihood is independent of w and can be computed from our current estimate of<br />
the likelihood and given our prior on the weight distribution:<br />
( | ) ( | , ) ( )<br />
p y X<br />
= ∫ p y X w p w dw<br />
(4.14)<br />
The quantity we are interested in is the posterior over w , which can now solve using:<br />
© A.G.Billard 2004 – Last Update March 2011