MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
53<br />
Theα are the so-called mixing coefficients. Their sum is 1, i.e.<br />
k<br />
K<br />
∑<br />
k = 1<br />
α = 1<br />
k<br />
. These coefficients are<br />
usually estimated together with the estimation of the parameters of the Gaussians (i.e. the means<br />
and covariance matrices). In some cases, they can however be set to a constant. For instance to<br />
give equal weight to each Gaussian (in this case, α 1<br />
k<br />
= ∀ k = 1,... K). When the<br />
K<br />
coefficients are estimated, they end up representing a measure of the proportion of data points<br />
that belong most to that particular Gaussian (this is similar to the definition of theα seen in the<br />
case of Mixture of Gaussians in the previous section). In a probabilistic sense, these coefficients<br />
represent the prior probability with which each Gaussian may have generated the whole dataset<br />
M<br />
j<br />
and can hence be written αk<br />
= p( k) =∑ p( k|<br />
x)<br />
.<br />
j=<br />
1<br />
Learning of GMM requires determining the means and covariance matrices and prior probabilities<br />
of the K Gaussians. The most popular method relies on Expectation-Maximization (E-M). We<br />
advise the reader to take a detour here and read the tutorial by Gilmes provided in the annexes of<br />
these lectures notes for a full derivation of GMM parameters estimation through E-M. We briefly<br />
summarize the principle next.<br />
We want to maximize the likelihood of the model’s parameters<br />
1 1 1<br />
{ α ,..... α K , µ ,..... µ<br />
K , ,..... K ,}<br />
Θ= ∑ ∑ given the data, that is:<br />
Assuming that the set of M datapoints { }<br />
j=<br />
1<br />
(iid), we get:<br />
( ) ( )<br />
max L Θ | X = max p X | Θ (3.15)<br />
Θ<br />
X<br />
Θ<br />
j<br />
M<br />
= x is identically and independently distributed<br />
M K<br />
j k k<br />
( Θ ) = ∑α k<br />
⋅p( x µ ∑ )<br />
max p X | max | ,<br />
Θ<br />
Θ<br />
∏ (3.16)<br />
j=<br />
1 k=<br />
1<br />
Taking the log of the likelihood is often a good approach as it simplifies the computation. Using<br />
the fact that the optimum *<br />
log f x , one can<br />
compute:<br />
x of a function f ( x)<br />
is also an optimum of ( )<br />
( Θ ) = p( X Θ)<br />
max p X | max log |<br />
Θ<br />
Θ<br />
⎛⎛<br />
max log | , max log | ,<br />
Θ<br />
Θ<br />
j=<br />
1 k= 1 j= 1 ⎝⎝ k=<br />
1<br />
M K M K<br />
j k k j k k<br />
∏∑α k<br />
⋅p( x µ ∑ ) = ∑ ⎜⎜∑α k<br />
⋅p( x µ ∑ )<br />
⎞⎞<br />
⎟⎟<br />
⎠⎠<br />
(3.17)<br />
The log of a sum is difficult to compute and one is led to proceed iteratively by calculating an<br />
approximation at each step (EM), see Sections 9.4.2.2 The final update procedure runs as<br />
follows:<br />
Initialization of the parameters:<br />
Initialize all parameters to a value to start with. The<br />
p 1<br />
,..,<br />
K<br />
p can for instance be initialized with a<br />
uniform prior, while the means can be initialized by running K-Means first. The complete set of<br />
parameters is then given by:<br />
© A.G.Billard 2004 – Last Update March 2011