MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA MACHINE LEARNING TECHNIQUES - LASA
86 1 This implies that det ( A − ) is a constant (and is equal to ± 1 in the particular case of whitenend mixtures). Assuming then that the data is zero-mean and white, that the number of sources equal the dimension of the data, i.e. N , and replacing in (5.20), one can see that maximizing the negentropy is equivalent to minimizing mutual information, as the two quantities differ only by a constant factor, i.e. N I s ,... s cst. J x . ( ) ( ) 1 N = −∑ (5.23) Given that the mutual information of a random vector is nonnegative, and zero if and only if the components of the vector are independent, minimization of mutual information for ICA is interesting in that it has a single global minimum. Unfortunately, the mutual information is difficult to approximate and optimize on the basis of a finite sample of datapoints, and much research on ICA has focused on alternative contrast functions. The earliest approach that optimizes for the negentropy using the fast-ICA algorithm, see Section 2.3.6, relies on an adhoc set of non-linear functions. While not very generic, this approach remains advantageous in that it allows estimating each component separately and in that it does the estimation only on uncorrelated patterns. i= 1 i Non-linear Case We will here assume that the sources and the original datapoints have the same dimension and these are zero-mean and white. In practice this can be achieved by performing first PCA and then reducing to the smallest set of eigenvectors with best representation. Hence, given a set of j j= observation vectors { } 1,... i M X = x , we will look for the sources s 1 ,.... N i= 1,... N s , such that: s and 1 i = Wxi W = A − . As in linear ICA, we do not compute A and then its inverse, but rather estimate directly its inverse by computing the columns of W. While the fast-ICA method in linear ICA relied on chosing one or two predefined non-linear functions for the transformation, Kernel ICA solves ICA by considering not just a single non-linear { 1 ., 2 .,...} function, but an entire function space H:= h ( ) h ( ) of candidate nonlinearities. The hope is that using a function space makes it possible to adapt to a variety of sources and thus would h, h ,... make the algorithms more robust to varying source distributions. Each function 1 2 from goes ° N → ° . We can hence determine a set of N transformations j= M j { } 1... i h : x H, i 1,... N i → = with associated kernels ( ) ( ) K = h ., h . , i= 1,... N. Beware i i i that we are here comparing the projections of the datapoints along the dimensions i= 1,... N of the original dataset, which is thought to be the same dimension as that of the sources. Each matrix K is then M × M . i In Section 5.4, when considering kernel CCA, we saw that one can determine the projections that h, h ,... maximize the correlation across the variables using different sets of projections 1 2 measured by ( K K ) 1 ,.... N . This is ρ . This quantity is comprised between zero and one. While CCA aimed at finding the maximal correlation, kernel ICA will try to find the projections that maximize © A.G.Billard 2004 – Last Update March 2011
87 statistical independence. We define a measure of this independence through λ( K ,.... K ) 1 ρ( K ,.... K ) = − . This quantity is also comprised between 0 and 1 and 1 N 1 N is equal to 1 if the variables X ,..., 1 X are pairwise independent. Optimizing for independence N can be done by maximizing λ ( K K ) function: 1 ,.... N . This is equivalent to minimizing the following objective ( ,.... ) log λ ( ,.... ) J K K K K =− (5.24) 1 N 1 This optimization problem is solved in an iterative method. One starts with an initial guess for W and then iterates by gradient descent on J. Details can be found in Bach & Jordan 2002. Interestingly, an implementation of Kernel-ICA, whose computational complexity is linear in the number of data points, is proposed there. This reduces importantly the computational costs. N © A.G.Billard 2004 – Last Update March 2011
- Page 35 and 36: 35 2.3.5 ICA Ambiguities We cannot
- Page 37 and 38: 37 Denote by g the derivative of th
- Page 39 and 40: 39 3 Clustering and Classification
- Page 41 and 42: 41 An agglomerative clustering star
- Page 43 and 44: 43 3.1.1.1 The CURE Clustering Algo
- Page 45 and 46: 45 Disadvantages of hierarchical cl
- Page 47 and 48: 47 Cases where K-means might be vie
- Page 49 and 50: 49 3.1.4 Clustering with Mixtures o
- Page 51 and 52: 51 k ( σ j ) 2 = k ∑ i α = r k
- Page 53 and 54: 53 Theα are the so-called mixing c
- Page 55 and 56: 55 Figure 3-16: Clustering with 3 G
- Page 57 and 58: 57 When the transformation A is lin
- Page 59 and 60: 59 C: X → Y ( ) C x K = arg max
- Page 61 and 62: 61 Figure 3-18: Linear combination
- Page 63 and 64: 63 Figure 3-19: Bayes classificatio
- Page 65 and 66: 65 ⎛⎛ min ⎜⎜ w ⎝⎝ N i=
- Page 67 and 68: 67 T ( yi − xi w) 2 M ⎛⎛ ⎞
- Page 69 and 70: 69 Figure 4-2: Illustration of the
- Page 71 and 72: 71 4.4.2 Multi-Gaussian Case It is
- Page 73 and 74: 73 5 Kernel Methods These lecture n
- Page 75 and 76: 75 The kernel k provides a metric o
- Page 77 and 78: 77 M 1 T v = ∑ x ( x ) v M λ i j
- Page 79 and 80: 79 1 M The solutions to the dual ei
- Page 81 and 82: 81 5.4 Kernel CCA The linear versio
- Page 83 and 84: 83 additional ridge parameter induc
- Page 85: 85 Figure 5-3: TOP: Marginal (left)
- Page 89 and 90: 89 J j ( µ 1,...., µ K) = ∑∑
- Page 91 and 92: 91 A simple pattern recognition alg
- Page 93 and 94: 93 ( ) ( , ) f x = sign w x + b (5.
- Page 95 and 96: 95 Figure 5-6: A binary classificat
- Page 97 and 98: 97 where N is the number of support
- Page 99 and 100: 99 5.8 Support Vector Regression In
- Page 101 and 102: 101 The optimization problem given
- Page 103 and 104: 103 Note that since we never have t
- Page 105 and 106: 105 Figure 5-13: Effect of the kern
- Page 107 and 108: 107 To better understand the effect
- Page 109 and 110: 109 5.9 Gaussian Process Regression
- Page 111 and 112: 111 One can then use the above expr
- Page 113 and 114: 113 5.9.2 Equivalence of Gaussian P
- Page 115 and 116: 115 5.9.3 Curse of dimensionality,
- Page 117 and 118: 117 The weight w determines the slo
- Page 119 and 120: 119 Figure 5-21: Example of success
- Page 121 and 122: 121 • its performance tends to de
- Page 123 and 124: 123 neurons. Furthermore, they lear
- Page 125 and 126: 125 The sigmoid f x ( x) ( ) = tanh
- Page 127 and 128: 127 6.3.2 Information Theory and th
- Page 129 and 130: 129 ( R) ⎛⎛det ⎞⎞ I( x, y)
- Page 131 and 132: 131 y= ∑ w x ) of Because of the
- Page 133 and 134: 133 6.5 Willshaw net David Willshaw
- Page 135 and 136: 135 6.6.1 Weights bounds One of the
86<br />
1<br />
This implies that det ( A − ) is a constant (and is equal to ± 1 in the particular case of whitenend<br />
mixtures).<br />
Assuming then that the data is zero-mean and white, that the number of sources equal the<br />
dimension of the data, i.e. N , and replacing in (5.20), one can see that maximizing the<br />
negentropy is equivalent to minimizing mutual information, as the two quantities differ only by a<br />
constant factor, i.e.<br />
N<br />
I s ,... s cst. J x .<br />
( ) ( )<br />
1<br />
N<br />
= −∑ (5.23)<br />
Given that the mutual information of a random vector is nonnegative, and zero if and only if the<br />
components of the vector are independent, minimization of mutual information for ICA is<br />
interesting in that it has a single global minimum. Unfortunately, the mutual information is difficult<br />
to approximate and optimize on the basis of a finite sample of datapoints, and much research on<br />
ICA has focused on alternative contrast functions. The earliest approach that optimizes for the<br />
negentropy using the fast-ICA algorithm, see Section 2.3.6, relies on an adhoc set of non-linear<br />
functions. While not very generic, this approach remains advantageous in that it allows estimating<br />
each component separately and in that it does the estimation only on uncorrelated patterns.<br />
i=<br />
1<br />
i<br />
Non-linear Case<br />
We will here assume that the sources and the original datapoints have the same dimension and<br />
these are zero-mean and white. In practice this can be achieved by performing first PCA and then<br />
reducing to the smallest set of eigenvectors with best representation. Hence, given a set of<br />
j<br />
j=<br />
observation vectors { } 1,...<br />
i<br />
M<br />
X = x , we will look for the sources s 1<br />
,.... N<br />
i=<br />
1,... N<br />
s , such that:<br />
s and<br />
1<br />
i<br />
= Wxi<br />
W = A − . As in linear ICA, we do not compute A and then its inverse, but rather<br />
estimate directly its inverse by computing the columns of W.<br />
While the fast-ICA method in linear ICA relied on chosing one or two predefined non-linear<br />
functions for the transformation, Kernel ICA solves ICA by considering not just a single non-linear<br />
{ 1<br />
.,<br />
2<br />
.,...}<br />
function, but an entire function space H:= h ( ) h ( )<br />
of candidate nonlinearities. The hope<br />
is that using a function space makes it possible to adapt to a variety of sources and thus would<br />
h, h ,...<br />
make the algorithms more robust to varying source distributions. Each function<br />
1 2<br />
from<br />
goes<br />
° N → ° . We can hence determine a set of N transformations<br />
j=<br />
M<br />
j<br />
{ } 1...<br />
i<br />
h : x H, i 1,... N<br />
i<br />
→ = with associated kernels ( ) ( )<br />
K = h ., h . , i= 1,... N. Beware<br />
i i i<br />
that we are here comparing the projections of the datapoints along the dimensions i= 1,... N of<br />
the original dataset, which is thought to be the same dimension as that of the sources. Each<br />
matrix K is then M × M .<br />
i<br />
In Section 5.4, when considering kernel CCA, we saw that one can determine the projections that<br />
h, h ,...<br />
maximize the correlation across the variables using different sets of projections 1 2<br />
measured by ( K K )<br />
1 ,.... N<br />
. This is<br />
ρ . This quantity is comprised between zero and one. While CCA<br />
aimed at finding the maximal correlation, kernel ICA will try to find the projections that maximize<br />
© A.G.Billard 2004 – Last Update March 2011