01.11.2014 Views

MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

12<br />

1.2.4 Exercise<br />

For the following scenarios: 1) learning to drive a bicycle; 2) learning how to open a box with a<br />

lever; 3) learning sign language (if lost on a Island with only deaf people), determine:<br />

a) The variables at hand;<br />

b) A good measure of performance,<br />

c) A criteria of “good enough” optimality,<br />

d) A threshold of sub-optimality (“too poor”)<br />

e) The minimal time lag<br />

1.3 Best Practices in ML<br />

ML algorithms may be extremely sensitive to the particular choice of data you used for training<br />

them. Ideally, you would like your training set to be sufficiently large to sample from the real<br />

distribution of the data you try to estimate. In practice, this is not feasible. For instance, imagine<br />

that you wish to train an algorithm to recognize human faces. When training the algorithm, you<br />

may observe only a subset of all faces you may encounter in life, but you would still like your<br />

algorithm to generalize a model of “human faces” from observing only 100 of them. If you provide<br />

the algorithm with too many examples of the same subset of data of faces (e.g. by training the<br />

algorithm only on faces of people with long hair), the algorithm may overfit, i.e. learn features<br />

that are representatives of this part of the data but not of the global pattern you aimed at training<br />

it on. In this case, the algorithm will end up recognizing only faces with long hair.<br />

Each time an algorithm cannot detect correctly instances of a given class (e.g. human faces with<br />

short hair) is called a false negative.<br />

In addition to training the algorithm on a representative sample of the data to recognize, you<br />

should also provide it with data that are counterexamples to this set to avoid what is called false<br />

positives. For instance, in the above example, an algorithm that retained only the feature long<br />

hair may incorrectly recognize a human face when presented with pictures of horses.<br />

Finally, since ML algorithms are essentially looking at correlations across data, they may fit<br />

spurious correlations, if provided with a poorly chosen set of training data. To ensure then that the<br />

algorithm has generalized correctly over the subsets of training examples you used, a number of<br />

good practices have been developed, which we review briefly below.<br />

1.3.1 Training, validation and testing sets<br />

Common practice to assess the validity of a Machine Learning algorithm is to measure its<br />

performance against three data sets, the training, validation and testing sets. These three sets<br />

are disjoint partitions of all the data at hand.<br />

The training and validation sets are used for crossvalidation (see below) during the training<br />

phase. In the above example when training an algorithm to recognize human faces, one would<br />

typically choose a set of N different faces representative of all gender, ethnicities and other<br />

fashionable additions (hair cuts, glasses, moustache, etc). One would then split this set into a<br />

training set and a validation set, usually half/half or 1/3 rd – 2/3 rd ).<br />

The testing set consists of a subset of the data which you would normally encounter once training<br />

is completed. In the above example, this would consist of faces recorded by the camera of the<br />

client to whom you will have sold the algorithm after training it in the laboratory.<br />

© A.G.Billard 2004 – Last Update March 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!