MACHINE LEARNING TECHNIQUES - LASA

More documents

Recommendations

Info

190 Shannon’s information measure is binary coded. To get an intuitive feeling of this concept, consider the case where the probability of x is uniform over the interval [1, 2]∈• . Then, the information conveyed in the observed event x=1 is equal to: 1 I( x= 1) =− log2 = log22= 1 2 If, on the other hand, x is uniformly distributed over the interval [1,8]∈• , making each of the occurrences of x 3 times less likely. Then, the information conveyed in observing x=1, is 3 times more important: ( ) 3 I x= 1 = log 8= log 2 = 3 2 2 9.3.1 Entropy The notion of information is tightly linked to the physics notion of entropy. The entropy of the information on x is a measure of the uncertainty or of the information content of a set of N observations of x: N =−∑ (8.30) ( ) ( ) log ( ) H x P x P x i= 1 If we take again our earlier example, the entropy of x, when x is uniform over the interval [1,2] is equal to ( ) H x 2 1 1 =− ∑ log2 = 1 2 2 i= 1 . If now, we choose a non uniform distribution of probability for x, wher H x ⎛⎛1 1 3 3⎞⎞ =− ⎜⎜ log + log ⎟⎟= 0.8 ⎝⎝4 4 4 4⎠⎠ e P(x=1)=1/4 and P(x=2)=3/4, then ( ) 2 2 Thus, there is more uncertainty in a situation where every outcome is equally probable. . In the continuous case, one defines the differential entropy of a distribution f of a random variable x as: h x =−∫ f x log f x dx (8.31) ( ) ( ) ( ) S where S is the support set of x. For instance, consider the case where x is uniformly distribution between [0,a], so that its density function is Then, its differential entropy is: f ( x) ⎧⎧1 if 0 ≤ x≤a ⎪⎪a = ⎨⎨ ⎪⎪ 1 if a < x≤ a+ b ⎪⎪⎩⎩ b , © A.G.Billard 2004 – Last Update March 2011
191 1 1 1 1 h x ∫ a b a b 0 a a ∫ a b b a a+ b ( ) =− log − log = log + log = log ( ⋅ ) This definition is appealing, as, if a or b have a large value, the spread of the distribution will be large. One can show that the Gaussian distribution is the distribution with maximal entropy for a given variance. In other words, this means that the gaussian distribution is the ``most random'' or the least structured of all distributions. Entropy is small for distributions that are clearly concentrated on certain values, i.e., when the variable is clearly clustered, or has a probability distribution function that is very ``spiky''. 9.3.2 Joint and conditional entropy If x and y are two random variables taking values between [0, I] ∈• and [0, J] ∈• respectively, and pi (, j ) the joint probability that x takes value I and y takes value j, then the entropy of the joint distribution is equal to: =−∑ (8.32) H( x, y) P( i, j)log P( i, j) i, j We can then derive the equation for the conditional entropy. ∑ H( x| y) =− P( j) H( x| y = j) j ∑ = − P( j) P( i| j)log P( i| j) j ∑ = − Pi ( , j)log Pi ( | j) i, j ∑ i Finally, conditional and joint entropy can be related as follows: ( ) ( ) Hxy (, ) = H x+ H yx | (8.33) Moreover, if the two variables are independent, then the entropy is additive: H( x, y) = H( x) + H( y) iff P( x, y) = P( x) P( y) 9.3.3 Mutual Information The mutual information between two random variables x and y is denoted by © A.G.Billard 2004 – Last Update March 2011
Page 1 and 2:
SCHOOL OF ENGINEERING MACHINE LEARN
Page 3 and 4:
3 4. 4 Regression Techniques ......
Page 5 and 6:
5 9.2.2 Probability Distributions,
Page 7 and 8:
7 Journals: • Machine Learning
Page 9 and 10:
9 Performance What would be an opti
Page 11 and 12:
11 1.2.3 Key features for a good le
Page 13 and 14:
13 1.3.2 Crossvalidation To ensure
Page 15 and 16:
15 In particular, we will consider
Page 17 and 18:
17 2.1 Principal Component Analysis
Page 19 and 20:
19 ( ) Xʹ′ = W X − µ (2.6) i
Page 21 and 22:
21 2.1.2.2 Reconstruction error min
Page 23 and 24:
23 PCA is an example of PP approach
Page 25 and 26:
25 Algorithm: If one further assume
Page 27 and 28:
27 The CCA algorithm consists thus
Page 29 and 30:
29 Figure 2-6: Mixture of variables
Page 31 and 32:
31 2.3.2 Why Gaussian variables are
Page 33 and 34:
33 • In our general definition of
Page 35 and 36:
35 2.3.5 ICA Ambiguities We cannot
Page 37 and 38:
37 Denote by g the derivative of th
Page 39 and 40:
39 3 Clustering and Classification
Page 41 and 42:
41 An agglomerative clustering star
Page 43 and 44:
43 3.1.1.1 The CURE Clustering Algo
Page 45 and 46:
45 Disadvantages of hierarchical cl
Page 47 and 48:
47 Cases where K-means might be vie
Page 49 and 50:
49 3.1.4 Clustering with Mixtures o
Page 51 and 52:
51 k ( σ j ) 2 = k ∑ i α = r k
Page 53 and 54:
53 Theα are the so-called mixing c
Page 55 and 56:
55 Figure 3-16: Clustering with 3 G
Page 57 and 58:
57 When the transformation A is lin
Page 59 and 60:
59 C: X → Y ( ) C x K = arg max
Page 61 and 62:
61 Figure 3-18: Linear combination
Page 63 and 64:
63 Figure 3-19: Bayes classificatio
Page 65 and 66:
65 ⎛⎛ min ⎜⎜ w ⎝⎝ N i=
Page 67 and 68:
67 T ( yi − xi w) 2 M ⎛⎛ ⎞
Page 69 and 70:
69 Figure 4-2: Illustration of the
Page 71 and 72:
71 4.4.2 Multi-Gaussian Case It is
Page 73 and 74:
73 5 Kernel Methods These lecture n
Page 75 and 76:
75 The kernel k provides a metric o
Page 77 and 78:
77 M 1 T v = ∑ x ( x ) v M λ i j
Page 79 and 80:
79 1 M The solutions to the dual ei
Page 81 and 82:
81 5.4 Kernel CCA The linear versio
Page 83 and 84:
83 additional ridge parameter induc
Page 85 and 86:
85 Figure 5-3: TOP: Marginal (left)
Page 87 and 88:
87 statistical independence. We def
Page 89 and 90:
89 J j ( µ 1,...., µ K) = ∑∑
Page 91 and 92:
91 A simple pattern recognition alg
Page 93 and 94:
93 ( ) ( , ) f x = sign w x + b (5.
Page 95 and 96:
95 Figure 5-6: A binary classificat
Page 97 and 98:
97 where N is the number of support
Page 99 and 100:
99 5.8 Support Vector Regression In
Page 101 and 102:
101 The optimization problem given
Page 103 and 104:
103 Note that since we never have t
Page 105 and 106:
105 Figure 5-13: Effect of the kern
Page 107 and 108:
107 To better understand the effect
Page 109 and 110:
109 5.9 Gaussian Process Regression
Page 111 and 112:
111 One can then use the above expr
Page 113 and 114:
113 5.9.2 Equivalence of Gaussian P
Page 115 and 116:
115 5.9.3 Curse of dimensionality,
Page 117 and 118:
117 The weight w determines the slo
Page 119 and 120:
119 Figure 5-21: Example of success
Page 121 and 122:
121 • its performance tends to de
Page 123 and 124:
123 neurons. Furthermore, they lear
Page 125 and 126:
125 The sigmoid f x ( x) ( ) = tanh
Page 127 and 128:
127 6.3.2 Information Theory and th
Page 129 and 130:
129 ( R) ⎛⎛det ⎞⎞ I( x, y)
Page 131 and 132:
131 y= ∑ w x ) of Because of the
Page 133 and 134:
133 6.5 Willshaw net David Willshaw
Page 135 and 136:
135 6.6.1 Weights bounds One of the
Page 137 and 138:
137 Figure 6-11: The weight vector
Page 139 and 140: 139 6.6.4 Oja’s one Neuron Model
Page 141 and 142: 141 If y i and y are highly correla
Page 143 and 144: 143 Foldiak’s second model allows
Page 145 and 146: 145 ∂ ∂ J 1 = fy − 1 2 λ 1yf
Page 147 and 148: 147 6.8 The Self-Organizing Map (SO
Page 149 and 150: 149 6. Decrease the size of the nei
Page 151 and 152: 151 the resulting distribution is a
Page 153 and 154: 153 To simplify the description of
Page 155 and 156: 155 C µν −1 where ( ) is the µ
Page 157 and 158: 157 The continuous time Hopfield ne
Page 159 and 160: 159 ∂f If the slope is negative,
Page 161 and 162: 161 7.2 Hidden Markov Models Hidden
Page 163 and 164: 163 Figure 7-2: Schematic illustrat
Page 165 and 166: 165 these two quantities to compute
Page 167 and 168: 167 7.2.4 Decoding an HMM There are
Page 169 and 170: 169 7.2.5 Further Readings Rabiner,
Page 171 and 172: 171 7.3.1 Principle In reinforcemen
Page 173 and 174: 173 general, acting to maximize imm
Page 175 and 176: 175 of reinforcement learning makes
Page 177 and 178: 177 8 Genetic Algorithms We conclud
Page 179 and 180: 179 However, you must define geneti
Page 181 and 182: 181 Often the crossover operator an
Page 183 and 184: 183 ( A λI) x 0 − = (8.5) where
Page 185 and 186: 185 Joint probability: The joint pr
Page 187 and 188: 187 The two most classical distribu
Page 189: 189 9.2.7 Statistical Independence
Page 193 and 194: 193 9.4 Estimators 9.4.1 Gradient d
Page 195 and 196: 195 9.4.2.1 Maximum Likelihood Mach
Page 197 and 198: 197 10 References • Machine Learn
show all

MACHINE LEARNING TECHNIQUES - LASA

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?