MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
12<br />
1.2.4 Exercise<br />
For the following scenarios: 1) learning to drive a bicycle; 2) learning how to open a box with a<br />
lever; 3) learning sign language (if lost on a Island with only deaf people), determine:<br />
a) The variables at hand;<br />
b) A good measure of performance,<br />
c) A criteria of “good enough” optimality,<br />
d) A threshold of sub-optimality (“too poor”)<br />
e) The minimal time lag<br />
1.3 Best Practices in ML<br />
ML algorithms may be extremely sensitive to the particular choice of data you used for training<br />
them. Ideally, you would like your training set to be sufficiently large to sample from the real<br />
distribution of the data you try to estimate. In practice, this is not feasible. For instance, imagine<br />
that you wish to train an algorithm to recognize human faces. When training the algorithm, you<br />
may observe only a subset of all faces you may encounter in life, but you would still like your<br />
algorithm to generalize a model of “human faces” from observing only 100 of them. If you provide<br />
the algorithm with too many examples of the same subset of data of faces (e.g. by training the<br />
algorithm only on faces of people with long hair), the algorithm may overfit, i.e. learn features<br />
that are representatives of this part of the data but not of the global pattern you aimed at training<br />
it on. In this case, the algorithm will end up recognizing only faces with long hair.<br />
Each time an algorithm cannot detect correctly instances of a given class (e.g. human faces with<br />
short hair) is called a false negative.<br />
In addition to training the algorithm on a representative sample of the data to recognize, you<br />
should also provide it with data that are counterexamples to this set to avoid what is called false<br />
positives. For instance, in the above example, an algorithm that retained only the feature long<br />
hair may incorrectly recognize a human face when presented with pictures of horses.<br />
Finally, since ML algorithms are essentially looking at correlations across data, they may fit<br />
spurious correlations, if provided with a poorly chosen set of training data. To ensure then that the<br />
algorithm has generalized correctly over the subsets of training examples you used, a number of<br />
good practices have been developed, which we review briefly below.<br />
1.3.1 Training, validation and testing sets<br />
Common practice to assess the validity of a Machine Learning algorithm is to measure its<br />
performance against three data sets, the training, validation and testing sets. These three sets<br />
are disjoint partitions of all the data at hand.<br />
The training and validation sets are used for crossvalidation (see below) during the training<br />
phase. In the above example when training an algorithm to recognize human faces, one would<br />
typically choose a set of N different faces representative of all gender, ethnicities and other<br />
fashionable additions (hair cuts, glasses, moustache, etc). One would then split this set into a<br />
training set and a validation set, usually half/half or 1/3 rd – 2/3 rd ).<br />
The testing set consists of a subset of the data which you would normally encounter once training<br />
is completed. In the above example, this would consist of faces recorded by the camera of the<br />
client to whom you will have sold the algorithm after training it in the laboratory.<br />
© A.G.Billard 2004 – Last Update March 2011