MACHINE LEARNING TECHNIQUES - LASA

More documents

Recommendations

Info

12 1.2.4 Exercise For the following scenarios: 1) learning to drive a bicycle; 2) learning how to open a box with a lever; 3) learning sign language (if lost on a Island with only deaf people), determine: a) The variables at hand; b) A good measure of performance, c) A criteria of “good enough” optimality, d) A threshold of sub-optimality (“too poor”) e) The minimal time lag 1.3 Best Practices in ML ML algorithms may be extremely sensitive to the particular choice of data you used for training them. Ideally, you would like your training set to be sufficiently large to sample from the real distribution of the data you try to estimate. In practice, this is not feasible. For instance, imagine that you wish to train an algorithm to recognize human faces. When training the algorithm, you may observe only a subset of all faces you may encounter in life, but you would still like your algorithm to generalize a model of “human faces” from observing only 100 of them. If you provide the algorithm with too many examples of the same subset of data of faces (e.g. by training the algorithm only on faces of people with long hair), the algorithm may overfit, i.e. learn features that are representatives of this part of the data but not of the global pattern you aimed at training it on. In this case, the algorithm will end up recognizing only faces with long hair. Each time an algorithm cannot detect correctly instances of a given class (e.g. human faces with short hair) is called a false negative. In addition to training the algorithm on a representative sample of the data to recognize, you should also provide it with data that are counterexamples to this set to avoid what is called false positives. For instance, in the above example, an algorithm that retained only the feature long hair may incorrectly recognize a human face when presented with pictures of horses. Finally, since ML algorithms are essentially looking at correlations across data, they may fit spurious correlations, if provided with a poorly chosen set of training data. To ensure then that the algorithm has generalized correctly over the subsets of training examples you used, a number of good practices have been developed, which we review briefly below. 1.3.1 Training, validation and testing sets Common practice to assess the validity of a Machine Learning algorithm is to measure its performance against three data sets, the training, validation and testing sets. These three sets are disjoint partitions of all the data at hand. The training and validation sets are used for crossvalidation (see below) during the training phase. In the above example when training an algorithm to recognize human faces, one would typically choose a set of N different faces representative of all gender, ethnicities and other fashionable additions (hair cuts, glasses, moustache, etc). One would then split this set into a training set and a validation set, usually half/half or 1/3 rd – 2/3 rd ). The testing set consists of a subset of the data which you would normally encounter once training is completed. In the above example, this would consist of faces recorded by the camera of the client to whom you will have sold the algorithm after training it in the laboratory. © A.G.Billard 2004 – Last Update March 2011
13 1.3.2 Crossvalidation To ensure that the splitting of data into training and validation sets was not a lucky split leading to optimal but non representative performance, one must apply crossvalidation. “Crossvalidation refers to the practice of confirming an experimental finding by repeating the experiment using an independent assay technique” (Wikipedia). In ML, this consists of splitting the dataset several times at random into training and validation sets. The performance of the algorithms is then measured on the whole set of experiments, e.g. by taking the mean and standard deviation of the error when applying the algorithm on each set. The standard deviation is important as it gives a measure of potential discrepancies in the structure of your data set. The crossvalidation phase is used to fine tune parameters of the algorithm (e.g. thresholds, priors, learning rates, etc) to ensure optimal performance. 1.3.3 Performance Measures in ML 1.3.3.1 Ground Truth Comparing the performance of a novel method to existing ones or trivial baselines is crucial. Whenever possible, one will use the ground truth as a means to compare performance of the system during training. In classification, the ground truth can be obtained for instance from manually labeling each data with the appropriate class label. If the choice of class label may be subjective (e.g. labeling facial expressions of happiness, sadness, etc), one may ask several raters to label the picture and one may take the average score across raters for the “true” label. 1.3.3.2 ROC Graphical representations such as the Receiver Operating Characteristic, so called ROC curve provide an interesting way to quickly compare algorithms independently of the choice of parameters. The ROC curve is used in binary classification algorithms. It balances the so-called true positives (TP), i.e. instances when the algorithm classified correctly a datapoint to (y=+1) and the false positives (FP), i.e. instances when the algorithm classified incorrectly datapoints to (y=+1). Conversely, false positives (FP) and false negatives (FN) correspond to the above two cases with classification (y=-1). The ROC curve plots the fraction of true positives and false positives over the total number of samples of class y=+1 in the dataset. Each point on the curve corresponds to a different value of the classifier’s parameter (usually a threshold; e.g. a threshold on Bayes’ classification). Typical ROC curve is drawn below. © A.G.Billard 2004 – Last Update March 2011
Page 1 and 2: SCHOOL OF ENGINEERING MACHINE LEARN
Page 3 and 4: 3 4. 4 Regression Techniques ......
Page 5 and 6: 5 9.2.2 Probability Distributions,
Page 7 and 8: 7 Journals: • Machine Learning
Page 9 and 10: 9 Performance What would be an opti
Page 11: 11 1.2.3 Key features for a good le
Page 15 and 16: 15 In particular, we will consider
Page 17 and 18: 17 2.1 Principal Component Analysis
Page 19 and 20: 19 ( ) Xʹ′ = W X − µ (2.6) i
Page 21 and 22: 21 2.1.2.2 Reconstruction error min
Page 23 and 24: 23 PCA is an example of PP approach
Page 25 and 26: 25 Algorithm: If one further assume
Page 27 and 28: 27 The CCA algorithm consists thus
Page 29 and 30: 29 Figure 2-6: Mixture of variables
Page 31 and 32: 31 2.3.2 Why Gaussian variables are
Page 33 and 34: 33 • In our general definition of
Page 35 and 36: 35 2.3.5 ICA Ambiguities We cannot
Page 37 and 38: 37 Denote by g the derivative of th
Page 39 and 40: 39 3 Clustering and Classification
Page 41 and 42: 41 An agglomerative clustering star
Page 43 and 44: 43 3.1.1.1 The CURE Clustering Algo
Page 45 and 46: 45 Disadvantages of hierarchical cl
Page 47 and 48: 47 Cases where K-means might be vie
Page 49 and 50: 49 3.1.4 Clustering with Mixtures o
Page 51 and 52: 51 k ( σ j ) 2 = k ∑ i α = r k
Page 53 and 54: 53 Theα are the so-called mixing c
Page 55 and 56: 55 Figure 3-16: Clustering with 3 G
Page 57 and 58: 57 When the transformation A is lin
Page 59 and 60: 59 C: X → Y ( ) C x K = arg max
Page 61 and 62: 61 Figure 3-18: Linear combination
Page 63 and 64:
63 Figure 3-19: Bayes classificatio
Page 65 and 66:
65 ⎛⎛ min ⎜⎜ w ⎝⎝ N i=
Page 67 and 68:
67 T ( yi − xi w) 2 M ⎛⎛ ⎞
Page 69 and 70:
69 Figure 4-2: Illustration of the
Page 71 and 72:
71 4.4.2 Multi-Gaussian Case It is
Page 73 and 74:
73 5 Kernel Methods These lecture n
Page 75 and 76:
75 The kernel k provides a metric o
Page 77 and 78:
77 M 1 T v = ∑ x ( x ) v M λ i j
Page 79 and 80:
79 1 M The solutions to the dual ei
Page 81 and 82:
81 5.4 Kernel CCA The linear versio
Page 83 and 84:
83 additional ridge parameter induc
Page 85 and 86:
85 Figure 5-3: TOP: Marginal (left)
Page 87 and 88:
87 statistical independence. We def
Page 89 and 90:
89 J j ( µ 1,...., µ K) = ∑∑
Page 91 and 92:
91 A simple pattern recognition alg
Page 93 and 94:
93 ( ) ( , ) f x = sign w x + b (5.
Page 95 and 96:
95 Figure 5-6: A binary classificat
Page 97 and 98:
97 where N is the number of support
Page 99 and 100:
99 5.8 Support Vector Regression In
Page 101 and 102:
101 The optimization problem given
Page 103 and 104:
103 Note that since we never have t
Page 105 and 106:
105 Figure 5-13: Effect of the kern
Page 107 and 108:
107 To better understand the effect
Page 109 and 110:
109 5.9 Gaussian Process Regression
Page 111 and 112:
111 One can then use the above expr
Page 113 and 114:
113 5.9.2 Equivalence of Gaussian P
Page 115 and 116:
115 5.9.3 Curse of dimensionality,
Page 117 and 118:
117 The weight w determines the slo
Page 119 and 120:
119 Figure 5-21: Example of success
Page 121 and 122:
121 • its performance tends to de
Page 123 and 124:
123 neurons. Furthermore, they lear
Page 125 and 126:
125 The sigmoid f x ( x) ( ) = tanh
Page 127 and 128:
127 6.3.2 Information Theory and th
Page 129 and 130:
129 ( R) ⎛⎛det ⎞⎞ I( x, y)
Page 131 and 132:
131 y= ∑ w x ) of Because of the
Page 133 and 134:
133 6.5 Willshaw net David Willshaw
Page 135 and 136:
135 6.6.1 Weights bounds One of the
Page 137 and 138:
137 Figure 6-11: The weight vector
Page 139 and 140:
139 6.6.4 Oja’s one Neuron Model
Page 141 and 142:
141 If y i and y are highly correla
Page 143 and 144:
143 Foldiak’s second model allows
Page 145 and 146:
145 ∂ ∂ J 1 = fy − 1 2 λ 1yf
Page 147 and 148:
147 6.8 The Self-Organizing Map (SO
Page 149 and 150:
149 6. Decrease the size of the nei
Page 151 and 152:
151 the resulting distribution is a
Page 153 and 154:
153 To simplify the description of
Page 155 and 156:
155 C µν −1 where ( ) is the µ
Page 157 and 158:
157 The continuous time Hopfield ne
Page 159 and 160:
159 ∂f If the slope is negative,
Page 161 and 162:
161 7.2 Hidden Markov Models Hidden
Page 163 and 164:
163 Figure 7-2: Schematic illustrat
Page 165 and 166:
165 these two quantities to compute
Page 167 and 168:
167 7.2.4 Decoding an HMM There are
Page 169 and 170:
169 7.2.5 Further Readings Rabiner,
Page 171 and 172:
171 7.3.1 Principle In reinforcemen
Page 173 and 174:
173 general, acting to maximize imm
Page 175 and 176:
175 of reinforcement learning makes
Page 177 and 178:
177 8 Genetic Algorithms We conclud
Page 179 and 180:
179 However, you must define geneti
Page 181 and 182:
181 Often the crossover operator an
Page 183 and 184:
183 ( A λI) x 0 − = (8.5) where
Page 185 and 186:
185 Joint probability: The joint pr
Page 187 and 188:
187 The two most classical distribu
Page 189 and 190:
189 9.2.7 Statistical Independence
Page 191 and 192:
191 1 1 1 1 h x ∫ a b a b 0 a a
Page 193 and 194:
193 9.4 Estimators 9.4.1 Gradient d
Page 195 and 196:
195 9.4.2.1 Maximum Likelihood Mach
Page 197 and 198:
197 10 References • Machine Learn
show all

MACHINE LEARNING TECHNIQUES - LASA

Create successful ePaper yourself

Delete template?

Save as template?