MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA MACHINE LEARNING TECHNIQUES - LASA

01.11.2014 Views

54 ( ) ( ) ( ) ( ) ( ) ( ) { α 0 ,..., α 0 , 0 ,..., 0 , 0 ,..., 0 K µ µ K K } Θ = ∑ ∑ 0 1 1 1 E-step: At each step t, estimate, for each Gaussian k, the probability that this Gaussian is being responsible for generating each point of the dataset by computing: i k ( t) k( t) ( t) ( ) p ( ) ( | , t k x µ ∑ ) ⋅α t k αk = p( k| Θ ) = i k ( t) t ∑ pk x | j, µ ⋅α (3.18) j j ( ) ( ) M-step: Recompute the means, Covariances and prior probabilities so as to maximize the log-likelihood of the current estimate: probabilities ( t) ( | ) p k Θ : ( t) ( L( X) ) log | Θ and using current estimate of the ∑ ( ) j ( t) p( k| x , Θ ) j ( t) p k| x , Θ ⋅ k( t 1) j µ + = ∑ j x j (3.19) ( + 1) kt ∑ = ∑ j ( ) ( ) ( | , ) + 1 ( + 1) ( µ )( µ ) i ( t) p( k| x , Θ ) p k x Θ x − x − ( t 1) α + k j t j k t j k t = ∑ i ∑ j j ( t) ( | , Θ ) Pk x M T (3.20) (3.21) Note finally that we have not discussed how to choose the optimal number of states K. There are various techniques based on determining a tradeoff between increasing the number of states and hence the number of parameters (increasing greatly the computation required to estimate such a large set of parameters) and the improvement that such an increase brings to the computation of the likelihood. Such a tradeoff can be measured by either of the BIC, DIC or AIC criteria. A more detailed description is given in 7.2.3. Figure 3-16 shows an example of clustering using a 3 GMM model with full covariance matrix and illustrates how a poor initialization can result in poor clustering. © A.G.Billard 2004 – Last Update March 2011

55 Figure 3-16: Clustering with 3 Gaussians using full covariance matrix. Original data (top left), superposed Gaussians (top right), regions associated to each Gaussian using Bayes rule for determining the separation (bottom right); effect of a poor initialization on the clustering (bottom right). © A.G.Billard 2004 – Last Update March 2011

54<br />

( ) ( ) ( ) ( ) ( ) ( )<br />

{ α 0 ,..., α 0 , 0 ,..., 0 , 0 ,...,<br />

0<br />

K<br />

µ µ<br />

K K }<br />

Θ = ∑ ∑<br />

0 1 1 1<br />

E-step:<br />

At each step t, estimate, for each Gaussian k, the probability that this Gaussian is being<br />

responsible for generating each point of the dataset by computing:<br />

i k ( t) k( t)<br />

( t)<br />

( )<br />

p<br />

( )<br />

( | ,<br />

t<br />

k<br />

x µ ∑ ) ⋅α<br />

t<br />

k<br />

αk = p( k|<br />

Θ ) =<br />

i k ( t)<br />

t<br />

∑ pk<br />

x | j,<br />

µ ⋅α<br />

(3.18)<br />

j<br />

j<br />

( )<br />

( )<br />

M-step:<br />

Recompute the means, Covariances and prior probabilities so as to maximize the log-likelihood of<br />

the current estimate:<br />

probabilities<br />

( t)<br />

( | )<br />

p k Θ :<br />

( t)<br />

( L( X)<br />

)<br />

log |<br />

Θ and using current estimate of the<br />

∑<br />

( )<br />

j ( t)<br />

p( k| x , Θ )<br />

j ( t)<br />

p k| x , Θ ⋅<br />

k( t 1)<br />

j<br />

µ + =<br />

∑<br />

j<br />

x<br />

j<br />

(3.19)<br />

( + 1)<br />

kt<br />

∑ =<br />

∑<br />

j<br />

( )<br />

( )<br />

( | , )<br />

+ 1 ( + 1)<br />

( µ )( µ )<br />

i ( t)<br />

p( k| x , Θ )<br />

p k x Θ x − x −<br />

( t 1)<br />

α +<br />

k<br />

j t j k t j k t<br />

=<br />

∑<br />

i<br />

∑<br />

j<br />

j ( t)<br />

( | , Θ )<br />

Pk x<br />

M<br />

T<br />

(3.20)<br />

(3.21)<br />

Note finally that we have not discussed how to choose the optimal number of states K. There are<br />

various techniques based on determining a tradeoff between increasing the number of states and<br />

hence the number of parameters (increasing greatly the computation required to estimate such a<br />

large set of parameters) and the improvement that such an increase brings to the computation of<br />

the likelihood. Such a tradeoff can be measured by either of the BIC, DIC or AIC criteria. A more<br />

detailed description is given in 7.2.3.<br />

Figure 3-16 shows an example of clustering using a 3 GMM model with full covariance matrix and<br />

illustrates how a poor initialization can result in poor clustering.<br />

© A.G.Billard 2004 – Last Update March 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!