MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA MACHINE LEARNING TECHNIQUES - LASA

01.11.2014 Views

86 1 This implies that det ( A − ) is a constant (and is equal to ± 1 in the particular case of whitenend mixtures). Assuming then that the data is zero-mean and white, that the number of sources equal the dimension of the data, i.e. N , and replacing in (5.20), one can see that maximizing the negentropy is equivalent to minimizing mutual information, as the two quantities differ only by a constant factor, i.e. N I s ,... s cst. J x . ( ) ( ) 1 N = −∑ (5.23) Given that the mutual information of a random vector is nonnegative, and zero if and only if the components of the vector are independent, minimization of mutual information for ICA is interesting in that it has a single global minimum. Unfortunately, the mutual information is difficult to approximate and optimize on the basis of a finite sample of datapoints, and much research on ICA has focused on alternative contrast functions. The earliest approach that optimizes for the negentropy using the fast-ICA algorithm, see Section 2.3.6, relies on an adhoc set of non-linear functions. While not very generic, this approach remains advantageous in that it allows estimating each component separately and in that it does the estimation only on uncorrelated patterns. i= 1 i Non-linear Case We will here assume that the sources and the original datapoints have the same dimension and these are zero-mean and white. In practice this can be achieved by performing first PCA and then reducing to the smallest set of eigenvectors with best representation. Hence, given a set of j j= observation vectors { } 1,... i M X = x , we will look for the sources s 1 ,.... N i= 1,... N s , such that: s and 1 i = Wxi W = A − . As in linear ICA, we do not compute A and then its inverse, but rather estimate directly its inverse by computing the columns of W. While the fast-ICA method in linear ICA relied on chosing one or two predefined non-linear functions for the transformation, Kernel ICA solves ICA by considering not just a single non-linear { 1 ., 2 .,...} function, but an entire function space H:= h ( ) h ( ) of candidate nonlinearities. The hope is that using a function space makes it possible to adapt to a variety of sources and thus would h, h ,... make the algorithms more robust to varying source distributions. Each function 1 2 from goes ° N → ° . We can hence determine a set of N transformations j= M j { } 1... i h : x H, i 1,... N i → = with associated kernels ( ) ( ) K = h ., h . , i= 1,... N. Beware i i i that we are here comparing the projections of the datapoints along the dimensions i= 1,... N of the original dataset, which is thought to be the same dimension as that of the sources. Each matrix K is then M × M . i In Section 5.4, when considering kernel CCA, we saw that one can determine the projections that h, h ,... maximize the correlation across the variables using different sets of projections 1 2 measured by ( K K ) 1 ,.... N . This is ρ . This quantity is comprised between zero and one. While CCA aimed at finding the maximal correlation, kernel ICA will try to find the projections that maximize © A.G.Billard 2004 – Last Update March 2011

87 statistical independence. We define a measure of this independence through λ( K ,.... K ) 1 ρ( K ,.... K ) = − . This quantity is also comprised between 0 and 1 and 1 N 1 N is equal to 1 if the variables X ,..., 1 X are pairwise independent. Optimizing for independence N can be done by maximizing λ ( K K ) function: 1 ,.... N . This is equivalent to minimizing the following objective ( ,.... ) log λ ( ,.... ) J K K K K =− (5.24) 1 N 1 This optimization problem is solved in an iterative method. One starts with an initial guess for W and then iterates by gradient descent on J. Details can be found in Bach & Jordan 2002. Interestingly, an implementation of Kernel-ICA, whose computational complexity is linear in the number of data points, is proposed there. This reduces importantly the computational costs. N © A.G.Billard 2004 – Last Update March 2011

86<br />

1<br />

This implies that det ( A − ) is a constant (and is equal to ± 1 in the particular case of whitenend<br />

mixtures).<br />

Assuming then that the data is zero-mean and white, that the number of sources equal the<br />

dimension of the data, i.e. N , and replacing in (5.20), one can see that maximizing the<br />

negentropy is equivalent to minimizing mutual information, as the two quantities differ only by a<br />

constant factor, i.e.<br />

N<br />

I s ,... s cst. J x .<br />

( ) ( )<br />

1<br />

N<br />

= −∑ (5.23)<br />

Given that the mutual information of a random vector is nonnegative, and zero if and only if the<br />

components of the vector are independent, minimization of mutual information for ICA is<br />

interesting in that it has a single global minimum. Unfortunately, the mutual information is difficult<br />

to approximate and optimize on the basis of a finite sample of datapoints, and much research on<br />

ICA has focused on alternative contrast functions. The earliest approach that optimizes for the<br />

negentropy using the fast-ICA algorithm, see Section 2.3.6, relies on an adhoc set of non-linear<br />

functions. While not very generic, this approach remains advantageous in that it allows estimating<br />

each component separately and in that it does the estimation only on uncorrelated patterns.<br />

i=<br />

1<br />

i<br />

Non-linear Case<br />

We will here assume that the sources and the original datapoints have the same dimension and<br />

these are zero-mean and white. In practice this can be achieved by performing first PCA and then<br />

reducing to the smallest set of eigenvectors with best representation. Hence, given a set of<br />

j<br />

j=<br />

observation vectors { } 1,...<br />

i<br />

M<br />

X = x , we will look for the sources s 1<br />

,.... N<br />

i=<br />

1,... N<br />

s , such that:<br />

s and<br />

1<br />

i<br />

= Wxi<br />

W = A − . As in linear ICA, we do not compute A and then its inverse, but rather<br />

estimate directly its inverse by computing the columns of W.<br />

While the fast-ICA method in linear ICA relied on chosing one or two predefined non-linear<br />

functions for the transformation, Kernel ICA solves ICA by considering not just a single non-linear<br />

{ 1<br />

.,<br />

2<br />

.,...}<br />

function, but an entire function space H:= h ( ) h ( )<br />

of candidate nonlinearities. The hope<br />

is that using a function space makes it possible to adapt to a variety of sources and thus would<br />

h, h ,...<br />

make the algorithms more robust to varying source distributions. Each function<br />

1 2<br />

from<br />

goes<br />

° N → ° . We can hence determine a set of N transformations<br />

j=<br />

M<br />

j<br />

{ } 1...<br />

i<br />

h : x H, i 1,... N<br />

i<br />

→ = with associated kernels ( ) ( )<br />

K = h ., h . , i= 1,... N. Beware<br />

i i i<br />

that we are here comparing the projections of the datapoints along the dimensions i= 1,... N of<br />

the original dataset, which is thought to be the same dimension as that of the sources. Each<br />

matrix K is then M × M .<br />

i<br />

In Section 5.4, when considering kernel CCA, we saw that one can determine the projections that<br />

h, h ,...<br />

maximize the correlation across the variables using different sets of projections 1 2<br />

measured by ( K K )<br />

1 ,.... N<br />

. This is<br />

ρ . This quantity is comprised between zero and one. While CCA<br />

aimed at finding the maximal correlation, kernel ICA will try to find the projections that maximize<br />

© A.G.Billard 2004 – Last Update March 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!