MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA MACHINE LEARNING TECHNIQUES - LASA
54 ( ) ( ) ( ) ( ) ( ) ( ) { α 0 ,..., α 0 , 0 ,..., 0 , 0 ,..., 0 K µ µ K K } Θ = ∑ ∑ 0 1 1 1 E-step: At each step t, estimate, for each Gaussian k, the probability that this Gaussian is being responsible for generating each point of the dataset by computing: i k ( t) k( t) ( t) ( ) p ( ) ( | , t k x µ ∑ ) ⋅α t k αk = p( k| Θ ) = i k ( t) t ∑ pk x | j, µ ⋅α (3.18) j j ( ) ( ) M-step: Recompute the means, Covariances and prior probabilities so as to maximize the log-likelihood of the current estimate: probabilities ( t) ( | ) p k Θ : ( t) ( L( X) ) log | Θ and using current estimate of the ∑ ( ) j ( t) p( k| x , Θ ) j ( t) p k| x , Θ ⋅ k( t 1) j µ + = ∑ j x j (3.19) ( + 1) kt ∑ = ∑ j ( ) ( ) ( | , ) + 1 ( + 1) ( µ )( µ ) i ( t) p( k| x , Θ ) p k x Θ x − x − ( t 1) α + k j t j k t j k t = ∑ i ∑ j j ( t) ( | , Θ ) Pk x M T (3.20) (3.21) Note finally that we have not discussed how to choose the optimal number of states K. There are various techniques based on determining a tradeoff between increasing the number of states and hence the number of parameters (increasing greatly the computation required to estimate such a large set of parameters) and the improvement that such an increase brings to the computation of the likelihood. Such a tradeoff can be measured by either of the BIC, DIC or AIC criteria. A more detailed description is given in 7.2.3. Figure 3-16 shows an example of clustering using a 3 GMM model with full covariance matrix and illustrates how a poor initialization can result in poor clustering. © A.G.Billard 2004 – Last Update March 2011
55 Figure 3-16: Clustering with 3 Gaussians using full covariance matrix. Original data (top left), superposed Gaussians (top right), regions associated to each Gaussian using Bayes rule for determining the separation (bottom right); effect of a poor initialization on the clustering (bottom right). © A.G.Billard 2004 – Last Update March 2011
- Page 3 and 4: 3 4. 4 Regression Techniques ......
- Page 5 and 6: 5 9.2.2 Probability Distributions,
- Page 7 and 8: 7 Journals: • Machine Learning
- Page 9 and 10: 9 Performance What would be an opti
- Page 11 and 12: 11 1.2.3 Key features for a good le
- Page 13 and 14: 13 1.3.2 Crossvalidation To ensure
- Page 15 and 16: 15 In particular, we will consider
- Page 17 and 18: 17 2.1 Principal Component Analysis
- Page 19 and 20: 19 ( ) Xʹ′ = W X − µ (2.6) i
- Page 21 and 22: 21 2.1.2.2 Reconstruction error min
- Page 23 and 24: 23 PCA is an example of PP approach
- Page 25 and 26: 25 Algorithm: If one further assume
- Page 27 and 28: 27 The CCA algorithm consists thus
- Page 29 and 30: 29 Figure 2-6: Mixture of variables
- Page 31 and 32: 31 2.3.2 Why Gaussian variables are
- Page 33 and 34: 33 • In our general definition of
- Page 35 and 36: 35 2.3.5 ICA Ambiguities We cannot
- Page 37 and 38: 37 Denote by g the derivative of th
- Page 39 and 40: 39 3 Clustering and Classification
- Page 41 and 42: 41 An agglomerative clustering star
- Page 43 and 44: 43 3.1.1.1 The CURE Clustering Algo
- Page 45 and 46: 45 Disadvantages of hierarchical cl
- Page 47 and 48: 47 Cases where K-means might be vie
- Page 49 and 50: 49 3.1.4 Clustering with Mixtures o
- Page 51 and 52: 51 k ( σ j ) 2 = k ∑ i α = r k
- Page 53: 53 Theα are the so-called mixing c
- Page 57 and 58: 57 When the transformation A is lin
- Page 59 and 60: 59 C: X → Y ( ) C x K = arg max
- Page 61 and 62: 61 Figure 3-18: Linear combination
- Page 63 and 64: 63 Figure 3-19: Bayes classificatio
- Page 65 and 66: 65 ⎛⎛ min ⎜⎜ w ⎝⎝ N i=
- Page 67 and 68: 67 T ( yi − xi w) 2 M ⎛⎛ ⎞
- Page 69 and 70: 69 Figure 4-2: Illustration of the
- Page 71 and 72: 71 4.4.2 Multi-Gaussian Case It is
- Page 73 and 74: 73 5 Kernel Methods These lecture n
- Page 75 and 76: 75 The kernel k provides a metric o
- Page 77 and 78: 77 M 1 T v = ∑ x ( x ) v M λ i j
- Page 79 and 80: 79 1 M The solutions to the dual ei
- Page 81 and 82: 81 5.4 Kernel CCA The linear versio
- Page 83 and 84: 83 additional ridge parameter induc
- Page 85 and 86: 85 Figure 5-3: TOP: Marginal (left)
- Page 87 and 88: 87 statistical independence. We def
- Page 89 and 90: 89 J j ( µ 1,...., µ K) = ∑∑
- Page 91 and 92: 91 A simple pattern recognition alg
- Page 93 and 94: 93 ( ) ( , ) f x = sign w x + b (5.
- Page 95 and 96: 95 Figure 5-6: A binary classificat
- Page 97 and 98: 97 where N is the number of support
- Page 99 and 100: 99 5.8 Support Vector Regression In
- Page 101 and 102: 101 The optimization problem given
- Page 103 and 104: 103 Note that since we never have t
54<br />
( ) ( ) ( ) ( ) ( ) ( )<br />
{ α 0 ,..., α 0 , 0 ,..., 0 , 0 ,...,<br />
0<br />
K<br />
µ µ<br />
K K }<br />
Θ = ∑ ∑<br />
0 1 1 1<br />
E-step:<br />
At each step t, estimate, for each Gaussian k, the probability that this Gaussian is being<br />
responsible for generating each point of the dataset by computing:<br />
i k ( t) k( t)<br />
( t)<br />
( )<br />
p<br />
( )<br />
( | ,<br />
t<br />
k<br />
x µ ∑ ) ⋅α<br />
t<br />
k<br />
αk = p( k|<br />
Θ ) =<br />
i k ( t)<br />
t<br />
∑ pk<br />
x | j,<br />
µ ⋅α<br />
(3.18)<br />
j<br />
j<br />
( )<br />
( )<br />
M-step:<br />
Recompute the means, Covariances and prior probabilities so as to maximize the log-likelihood of<br />
the current estimate:<br />
probabilities<br />
( t)<br />
( | )<br />
p k Θ :<br />
( t)<br />
( L( X)<br />
)<br />
log |<br />
Θ and using current estimate of the<br />
∑<br />
( )<br />
j ( t)<br />
p( k| x , Θ )<br />
j ( t)<br />
p k| x , Θ ⋅<br />
k( t 1)<br />
j<br />
µ + =<br />
∑<br />
j<br />
x<br />
j<br />
(3.19)<br />
( + 1)<br />
kt<br />
∑ =<br />
∑<br />
j<br />
( )<br />
( )<br />
( | , )<br />
+ 1 ( + 1)<br />
( µ )( µ )<br />
i ( t)<br />
p( k| x , Θ )<br />
p k x Θ x − x −<br />
( t 1)<br />
α +<br />
k<br />
j t j k t j k t<br />
=<br />
∑<br />
i<br />
∑<br />
j<br />
j ( t)<br />
( | , Θ )<br />
Pk x<br />
M<br />
T<br />
(3.20)<br />
(3.21)<br />
Note finally that we have not discussed how to choose the optimal number of states K. There are<br />
various techniques based on determining a tradeoff between increasing the number of states and<br />
hence the number of parameters (increasing greatly the computation required to estimate such a<br />
large set of parameters) and the improvement that such an increase brings to the computation of<br />
the likelihood. Such a tradeoff can be measured by either of the BIC, DIC or AIC criteria. A more<br />
detailed description is given in 7.2.3.<br />
Figure 3-16 shows an example of clustering using a 3 GMM model with full covariance matrix and<br />
illustrates how a poor initialization can result in poor clustering.<br />
© A.G.Billard 2004 – Last Update March 2011