MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
59<br />
C:<br />
X → Y<br />
( )<br />
C x<br />
K<br />
= arg max ∑ 1<br />
y k<br />
kC : x = y<br />
( )<br />
(3.26)<br />
The hope is, thus, that the aggregation of all the classifiers will give better classification results<br />
than training a single classifier on the whole dataset.<br />
If the classification method bases its estimation on the average of the data, for a very<br />
heterogeneous dataset with local structures in the data, such an approach may fail to properly<br />
represent the local structure unless the classifier is provided with enough granularity to<br />
encapsulate these non-linearities 3 . A set of classifiers based on smaller subsets of the data may<br />
have more chances to catch these local structures. However, training classifiers on only a small<br />
subset of the data at hand may also degrade performance as they may fail to extract the generic<br />
sets of features of the data at hand (because they see too small a set) and focus on noise or<br />
particularities of each subset. This is a usual problem in machine learning and in classification, in<br />
particular. One must find a tradeoff between generalizing (hence having much less parameters<br />
than original datapoints) and representing all the local structures in the data.<br />
3.2.3.2 Boosting / Adaboost<br />
While bagging creates in parallel a set of K classifiers, boosting creates the classifiers<br />
sequentially and uses each previously created classifier to boost training of the next classifier.<br />
Principle:<br />
A weight is associated to each datapoint of the training set as well as to each classifier. Weights<br />
associated to the datapoint are adapted at each iteration step to reflect how well the datapoint is<br />
predicted by the global classifier. The less well classified the data point, the bigger its associated<br />
weight. This way, poorly classified data will be given more influence on the measure of the error<br />
and will be more likely to be selected to train the new classifier created at each iteration. As a<br />
result, they should be better estimated by this new classifier.<br />
Similarly the classifiers are weighted when combined to form the final classifier, so as to reflect<br />
their classification power. The poorer the classification power of a given classifier on the training<br />
set associated with this classifier, the less influence the classifier is given for final classification.<br />
Algorithm:<br />
M<br />
i i i N i<br />
Let us consider a binary classification problem with X = { x , y } , x ∈ , y ∈ { + 1; −1}<br />
v, i 1... M<br />
i<br />
i=<br />
1<br />
° . Let<br />
= be the weights associated to each datapoint. Usually these are uniformly distributed<br />
k<br />
for starters and hence one builds a first set of l classifiers C , k 1,... l<br />
all the set of data points.<br />
= by drawing uniformly from<br />
The final classifier is composed of a linear combination of the classifiers, so that each data point x<br />
is then classified according to the function:<br />
3 For instance in the case of multi-layer perceptron, one can add several hidden neurons and achieve optimal<br />
performance. However in this case one may argue that each neuron in the hidden layer is some sort of subclassifier<br />
(especially when using the threshold function as activation function for the output of the hidden neurons). Similarly in the<br />
case of Support Vector Machines, one may increase the number of support vector points until reaching an optimal<br />
description of all local non-linearities. At maximum, one may take all points as support vectors….!<br />
© A.G.Billard 2004 – Last Update March 2011