Semi-Supervised Learning With SVMs.pdf - Read
Semi-Supervised Learning With SVMs.pdf - Read
Semi-Supervised Learning With SVMs.pdf - Read
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Semi</strong>-supervised <strong>Learning</strong> <strong>With</strong><br />
Support Vector Machines<br />
B A K K A L A U R E A T S A R B E I T<br />
von Andre Guggenberger<br />
Matrikelnummer 0327514<br />
eingereicht an der<br />
Technischen Universität Wien<br />
im September 2008
ABSTRACT<br />
Support Vector Machines are a modern technique in the field of machine learning<br />
and have been successfully used in different fields of application. In general they<br />
are used for some kind of classification task, where they learn from a randomly<br />
selected training set, which has been classified in advance and then are applied on<br />
unseen instances. To get a good classification result it is often necessary that this<br />
training set contains a huge set of labeled instances. But for humans labeling of<br />
data is a time-consuming and boring task. Some algorithms address this problem<br />
and overcome this by learning on both, a small amount of labeled and a huge<br />
amount of unlabeled instances. There the learner has access to the pool of<br />
unlabeled instances and requests the label for some specific instances from an<br />
user. Then the learner uses all labeled data to learn the model. The choice of the<br />
unlabeled instances which should be labeled next has a significant impact on the<br />
quality of the resulting model. This kind of learning is called semi-supervised<br />
learning or active learning. Currently there exist some different solutions for<br />
semi-supervised learning. This work focuses on the most known ones and gives an<br />
overview about them.<br />
KURZFASSUNG<br />
Support Vector Machines sind eine moderne Technik im Bereich vom maschinellen<br />
Lernen und wurden mittlerweile in verschiedenen Anwendungsgebieten erfolgreich<br />
eingesetzt. Generell werden sie für Klassifikationsaufgaben verwendet, wobei sie<br />
von einer zufällig gewählte Menge von schon vorklassifizierten Trainingsdaten<br />
lernen und dann auf noch unbekannte Daten angewendet werden. Um ein gutes<br />
Klassifikationsergebnis zu erhalten, ist es oft notwendig, eine große Menge von<br />
vorklassifizierten Trainingsdaten zum Training zu verwenden. Das manuelle<br />
Klassifizieren von den Daten durch Menschen ist oft eine zeitaufwendige und<br />
langweilige Aufgabe. Um dies zu erleichtern wurden Algorithmen entwickelt, um<br />
mit schon wenigen klassifizierten und vielen nichtklassifizierten Daten ein Modell<br />
zu erstellen. Dabei hat der Klassifikator Zugang zu dem Pool von<br />
nichtklassifizierten Daten und fragt einen Benutzer nach der Klasse für einige<br />
spezielle Instanzen. Dann benützt er alle klassifizierten Daten zum Erstellen des<br />
Modells. Die Wahl jener noch nicht klassifizierten Instanzen, die von einem<br />
Experten klassifiziert werden sollen, hat einen signifikanten Einfluss auf die<br />
Qualität des resultierenden Modells. Diese Art des maschinellen Lernens wird als<br />
<strong>Semi</strong>-überwachtes Lernen oder aktives Lernen) bezeichnet. Momentan existieren<br />
verschiedenste Ansätze für <strong>Semi</strong>-überwachtes Lernen. Diese Arbeit behandelt die<br />
bekanntesten und liefert eine Übersicht über die verschiedenen Ansätze.<br />
1
Contents<br />
1 Introduction 4<br />
2 Basic Definitions of Support Vector Machines 5<br />
3 <strong>Semi</strong>-supervised <strong>Learning</strong> 9<br />
3.1 Random Subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />
3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />
3.3 Version Space Based Methods . . . . . . . . . . . . . . . . . . . . . 11<br />
3.3.1 Theory of the Version Space . . . . . . . . . . . . . . . . . . 11<br />
3.3.2 Simple Method . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />
3.3.3 Batch-Simple Method . . . . . . . . . . . . . . . . . . . . . 14<br />
3.3.4 Angle Diversity Strategy . . . . . . . . . . . . . . . . . . . . 15<br />
3.3.5 Multi-Class Problem . . . . . . . . . . . . . . . . . . . . . . 16<br />
3.4 Probability Based Method . . . . . . . . . . . . . . . . . . . . . . . 17<br />
3.4.1 The Probability Model . . . . . . . . . . . . . . . . . . . . . 17<br />
3.4.2 Least Certainty and Breaking Ties . . . . . . . . . . . . . . 18<br />
3.5 Other approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />
3.5.1 A <strong>Semi</strong>definite Programming Approach . . . . . . . . . . . . 18<br />
3.5.2 S 3 V M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
4 Experiments 21<br />
4.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
4.1.1 Evaluated Approaches . . . . . . . . . . . . . . . . . . . . . 21<br />
4.1.2 ssSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />
4.1.3 ssSVMToolbox . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />
4.2 Artificial Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
4.2.1 Gaussian Distributed Data . . . . . . . . . . . . . . . . . . . 31<br />
4.2.2 Two Spirals Dataset . . . . . . . . . . . . . . . . . . . . . . 35<br />
4.2.3 Chain Link Dataset . . . . . . . . . . . . . . . . . . . . . . . 38<br />
4.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />
4.3 Datasets from UCI Machine <strong>Learning</strong> Repository . . . . . . . . . . 40<br />
2
5 Conclusion 42<br />
A Relevant Links 43<br />
3
Chapter 1<br />
Introduction<br />
Support Vector Machines (<strong>SVMs</strong>) are a modern technique in the field of machine<br />
learning and have been successfully used in different fields of application. Most of<br />
the time they were used in a supervised learning context. There the learner has<br />
access to a large set of labeled data and builds a model using these informations.<br />
After this learning step the learner is presented new instances and tries to predict<br />
the correct labels. Beside supervised learning there exists also unsupervised learning<br />
where the learner cannot access the labels of the instances. In this case the learner<br />
tries to predict the labels by partitioning the data and creating so called clusters.<br />
Providing a huge set of labeled data (as in the supervised case) can be very<br />
time-consuming (and therefore costly). <strong>Semi</strong>-supervised learning tries to reduce<br />
the needed amount of labeled data by analyzing the unlabeled data. There only<br />
relevant instances have to be labeled by a human expert. Of course the overall<br />
accuracy has to be on par with the supervised learning accuracy.<br />
In this work I explain the most common approaches for semi-supervised learning<br />
with <strong>SVMs</strong>. I begin by introducing some basic definitions i.e. the SVM hyperplane,<br />
the kernel function and the SVM maximization task (Chapter 2). A detailed discussion<br />
about the theory of Support Vector Machines is not provided. The main<br />
part of the work focuses on semi-supervised learning. I present a definition of semisupervised<br />
learning in contrast to supervised and unsupervised learning, discuss the<br />
most common approaches (Chapter 3) for Support Vector Machines, compare semisupervised<br />
<strong>SVMs</strong> and supervised <strong>SVMs</strong> and present the results of my experiments<br />
with some of them. I show how they perform with different datasets including some<br />
common machine learning datasets and one real-world datasets (Chapter 4).<br />
4
Chapter 2<br />
Basic Definitions of Support<br />
Vector Machines<br />
Consider a typical classification problem. Some input vectors (feature vectors) and<br />
some labels are given. The objective of classification problems is to predict the<br />
labels of new input vectors so that the error rate of the classification is minimal.<br />
There are many algorithms to solve such kind of problems. Some of them require<br />
that the input data is linearly separable (by a hyperplane). But for many<br />
applications this assumption is not appropriate. And even if the assumption holds,<br />
most of the time there are many possible solutions for the hyperplane (Figure 2.1).<br />
Because we are looking for a hyperplane where the classification error is minimal<br />
this can be seen as an optimization problem. In 1965 Vapnik ( [VC04], [Vap00])<br />
introduced a mathematical approach to find a hyperplane with low generalization<br />
error. It is based on the theory of structural risk minimization, which states that<br />
the generalization error is influenced by the error on the training set and the complexity<br />
of the model. Based on this work Support Vector Machines were developed.<br />
They belong to the family of generalized linear classifiers and are so called maximum<br />
margin classifier. This means that the resulting hyperplane maximizes the<br />
distance between the ’nearest’ vectors of different classes with the assumption that<br />
a large margin is better for the gerneraliziation ability of the SVM. These ’nearest’<br />
vectors are called support vectors (SV) and <strong>SVMs</strong> consider only these vectors for<br />
the classification task. All other vectors can be ignored. Figure 2.2 illustrates a<br />
maximum margin classifier and the support vectors.<br />
In the context of <strong>SVMs</strong> it is also important to mind kernel functions. They<br />
project the low-dimensional training data to a higher dimensional feature space,<br />
because the separation of the training data is often easier achieved in this higher<br />
dimensional space. Moreover through this projection it is possible that training<br />
data, which couldn’t be separated linearly in the low-dimensional feature space,<br />
can be separated linearly in the high-dimensionl space.<br />
To understand semi-supervised learning we have to consider some mathematical<br />
5
Figure 2.1: Positive samples (green boxes) and negative samples (red circles). There<br />
are many possible solutions for the hyperplane (from [Mar03])<br />
Figure 2.2: Maximum margin, the middle line is the hyperplane, the vectors on the<br />
other lines are the support vectors (from [Mar03])<br />
6
ackground of <strong>SVMs</strong>. This is just a very short summary, beside very good resources<br />
on the internet Vapnik, Cristianini and Shawe-Taylor provide comprehensive introductions<br />
to Support Vector Machines [Vap00], [VC04] or [CST00].<br />
At first we have to define the hyperplane, which separates the data and acts as<br />
the decision boundary.<br />
H(ω, b) = x|ω T ∗ x + b = 0 (2.1)<br />
where ω is a weight vector, x is an input vector and b is the bias.<br />
Note that ω points orthogonal to H.<br />
Because we are interested in maximizing the margin, we have to define the<br />
distance from a support vector to the hyperplane.<br />
ω T ∗ x + b<br />
||ω||<br />
= ±1<br />
||ω||<br />
(2.2)<br />
From this definition the margin m follows straightforward (see Figure 2.2 for an<br />
illustration).<br />
2<br />
(2.3)<br />
||ω||<br />
The maximization task can be summarized as [TC01]:<br />
max min i{y i (ω ∗ φ(x ) )} (2.4)<br />
w∈F<br />
subject to ||ω|| = 1<br />
y i (ω ∗ φ(x i )) ≥ 1, i = 1...n.<br />
Note that this definition is only correct, if the data is linearly separable. In a<br />
non-linearly separable case we have to introduce slack variables.<br />
max min i{y i (ω ∗ φ(x ) )} (2.5)<br />
w∈F<br />
subject to<br />
ξ i ≥ 0<br />
y i (ω ∗ φ(x i )) ≥ 1 − ξ i , i = 1...n<br />
where ξ i are slack variables.<br />
Because <strong>SVMs</strong> try to maximize the margin we can restate the optimization task<br />
using the definition of the margin:<br />
min<br />
ω,ξ<br />
1<br />
2 ||ω||2 + C<br />
7<br />
n∑<br />
ξ i (2.6)<br />
i=1
subject to<br />
ξ i ≥ 0<br />
y i (ω ∗ φ(x i )) ≥ 1 − ξ i , i = 1...n<br />
where C is the complexity parameter. This parameter controls the complexity<br />
of the decision boundary. Large C penalize errors whereas small C penalize the<br />
complexity [Mei02].<br />
As said Support Vector Machines usually use so called kernels or kernel functions<br />
to project the data from a low-dimensional input space to a high-dimensional feature<br />
space. The kernel function K satisfies Mercer’s condition and we define K as:<br />
K(u, v) = φ(u) ∗ φ(v) (2.7)<br />
where φ : X− > F is a feature map [Mei02], [MIJ04]. One example of a feature<br />
map:<br />
f(x 1 , x 2 ) = (x 2 1, √ 2x 1 x 2 , x 2 2) (2.8)<br />
Using these feature map we can calculate the projection using the kernel K(u,v) =<br />
φ(u) ∗ φ(v) by computing the inner products of the data vectors and not the feature<br />
vectors φ(u) ∗ φ(v):<br />
K(u, v) = φ(u) ∗ φ(v) (2.9)<br />
= u 2 1v 2 1 + 2u 1 u 2 v 1 v 2 + u 2 2v 2 2 (2.10)<br />
whereas < u, v > is the inner product of u and v.<br />
In the context of <strong>SVMs</strong> we consider classifiers of the form:<br />
f(x) =<br />
where α i are the Lagrange multipliers.<br />
= (u 1 v 1 + u 2 v 2 ) 2 (2.11)<br />
= (< u, v >) 2 (2.12)<br />
n∑<br />
α i K(x, x). (2.13)<br />
i=1<br />
8
Chapter 3<br />
<strong>Semi</strong>-supervised <strong>Learning</strong><br />
The task of classification is also called supervised learning. In contrast to this the<br />
task of clustering is called unsupervised learning. There the learner doesn’t use<br />
labeled data, instead the learner tries to partitioning a datasets into clusters so<br />
that the data in a cluster share some common characteristics.<br />
<strong>Semi</strong>-supervised learning is a combination of supervised and unsupervised learning<br />
where typically a small amount of labeled and a large amount of unlabeled data<br />
are used for training. This is done because of two reasons. First labeling of a huge<br />
set of instances can be a time-consuming task. This classification has to be done by<br />
a skilled human expert and can be quiet costly. <strong>Semi</strong>-supervised learning reduces<br />
the needed amount of labeled instances and the associated costs. Note that in contrast<br />
the acquisition of the unlabeled data is usually relatively inexpensive. Second<br />
it has been shown that using unlabeled data for learning improves the accuracy of<br />
the produced learner [BD99]. G. Schohn and D. Cohn [SC00] report similar results,<br />
they state that a trained SVM on a well-chosen subset performs often better than<br />
on all available instances.<br />
Summing up the advantages of semi-supervised learning are (in many cases)<br />
better accuracy, fewer data and less training time. To achieve these the examples<br />
to be labeled should be selected properly.<br />
There are many different algorithms for semi-supervised learning with Support<br />
Vector Machines. Most of them require some kind of querying unlabeled instances<br />
to request the labels for them from a human expert. They differ in the way they<br />
select the next instances. The process of querying is called selective sampling.<br />
Sometimes semi-supervised learning is called active learning. As opposed to passive<br />
learning, where a classifier is trained using randomly selected labeled data, an<br />
active learner asks a user to label only ’important’ instances. Because the classifier<br />
gets feedback (the labels) about the for the classification relevant instances from an<br />
user, this process is called relevant feedback.<br />
Note that the approaches presented in 3.5.1 and 3.5.2 differ in that because here<br />
no feedback is necessary.<br />
9
3.1 Random Subset<br />
Obviously if we use a random process to select the unlabeled instances this learning<br />
cannot be considered as real semi-supervised learning. To get an appropriate accuracy<br />
the sampling strategy is as important as it is in case of supervised learning.<br />
<strong>Supervised</strong> learning and random subset semi-supervised learning are very similar<br />
and share most of the characteristics.<br />
Some researchers have experimented with this and have stated that the accuracy<br />
cannot keep up with real semi-supervised strategies. But they used this approach to<br />
compare it with the other semi-supervised learning approaches [FM01], [LKG + 05].<br />
3.2 Clustering<br />
One approach is to use a clustering algorithm (unsupervised learning) on the unlabeled<br />
data. Then we can e.g. choose the cluster centers (centroids) as instances<br />
to be labeled by an expert. G. Fung and O. Mangasarian have used k-median<br />
clustering and report a good classification accuracy in comparison with supervised<br />
learning but with fewer labeled instances [FM01]. It’s worth to keep in mind that<br />
one has to define the correct number of the clusters in advance. Correct means that<br />
the clusters should be good representatives of the available classes. G. Fung and<br />
O. Mangasarian do not really address this but as for other clustering algorithm the<br />
choice of the number of clusters can be assumed to be critical. An obvious solution<br />
is to set the number of clusters equal to the number of classes. Additionally G.<br />
Fung and O. Mangasarian extend the clustering by an approach similiar to that<br />
described in chapter 3.5.<br />
A general algorithm could be described this way:<br />
1. Use the labeled data to build a model<br />
2. Using the unlabeled data calculate n clusters<br />
3. Query some instances for labeling by a human expert. Which instances depends<br />
on the alogrithm. Some examples:<br />
(a) Query the centroids<br />
(b) Query instances on the cluster boundaries<br />
(c) A combination of the above approaches<br />
Cebron and Berthold introduced an advanced clustering technique, they proposed<br />
a prototype based learning approach using a density estimation technique<br />
and a probability model for selecting prototypes to obtain the labels from an expert<br />
[CB07].<br />
10
3.3 Version Space Based Methods<br />
Random Subset (chapter 3.1) and clustering (chapter 3.2) are simple but effective<br />
methods for semi-supervised learning. Depending on the given classification task<br />
the results can be quite good. Note that both can be used with other classifiers and<br />
are not limited to Support Vector Machines. Version space based methods are a<br />
more advanced technique, which use specific properties of Support Vector Machines<br />
for semi-supervised learning. But as we will see these approaches suffer from some<br />
critical limitations.<br />
The following approaches can be analyzed by their influence to the version space.<br />
Therefore it is worth to consider the theory of version spaces.<br />
3.3.1 Theory of the Version Space<br />
The version space was introduced by Tom Mitchell [Mit97]. It is the space containing<br />
all consistent hypotheses from the hypotheses space whereas the hypotheses<br />
space contains all possible hypotheses.<br />
In the context of <strong>SVMs</strong> the hypotheses are the hyperplanes and the version<br />
spaces contain all hyperplanes consistent with the current training data [TC01].<br />
More formally, the hypotheses space (all possible hypotheses) is defined as:<br />
H = {f|f(x) = φ(x) ∗ ω whereω ∈ W } (3.1)<br />
||ω||<br />
where the parameter space W is equal to the feature space F, f is a hypothesis. As<br />
in chapter 2 explained φ(x)∗ω is the definition of the (normalized) hyperplanes (Definition<br />
2.1). So this space contains all possible hyperplanes. Using this definition<br />
||ω||<br />
we can define the version space:<br />
∨<br />
V = {f ∈ H| y i f(x i ) > 0} (3.2)<br />
i∈{1...n}<br />
where y i is the class label. This definition eliminates all hypotheses (hyperplanes)<br />
not consistent with the given training data (Definition 2.4)<br />
Because there is a bijection between W (containing the unit vectors) and H<br />
(containing hyperplanes) we can redefine V [TC01]:<br />
V = {w ∈ W |||ω|| = 1, y i ∗ φ(x i )) > 0, i = 1...n} (3.3)<br />
There is a restriction of this definition: the training data has to be linearly<br />
separable in the feature space. But because it is possible to make every data linearly<br />
separable by modifying the kernel we can ignore this issue [STC99]. Furthermore<br />
because we often work in a high-dimensional feature space in many cases the data<br />
will be linearly separable.<br />
11
For our analysis it is important to note the duality between the feature space<br />
F and the parameter space W [TC01]. The unit vectors ω correspond to the decision<br />
boundaries f in F. This follows intuitively from the above definitions but this<br />
correspondence exists also converse. Let’s have a closer look on this issue. If one<br />
observes a new training instance x i in the feature space this instance reduces the<br />
set of all allowable hyperplane to ones that classify x i correctly. We can write this<br />
down more formally: every hyperplane must satisfy y i (ωφ(x i )) > 0, where y i is the<br />
label for the instance x i . As said before ω is the normal vector of the hyperplane<br />
in F. But we can think of y i φ(x i ) as being the normal vector of a hyperplane in<br />
W. It follows that ω(y i φ(x i )) = 0 defines a hyperplane in W. Recall that we have<br />
defined the version space V in W. Therefore the hyperplane is a boundary to the<br />
version space. It can be shown that the hyperplanes in W delimit the version space<br />
and from the definition of the maximization task of the <strong>SVMs</strong> it maximizes the<br />
minimum distance to any of this hyperplanes in W. <strong>SVMs</strong> find a center of the<br />
largest hypersphere in the version space, whose radius is the maximum margin and<br />
it can be shown that the hyperplanes touched by the hypersphere correspond to the<br />
support vectors and that the ω i often lie in the center of the version space [TC01].<br />
3.3.2 Simple Method<br />
Linear <strong>SVMs</strong> perform best when applied in high-dimensional domains (such as<br />
text classification). There the number of features is much larger than the number<br />
of examples and therefore the training data cannot cover the whole dimensions,<br />
meaning that the subspace spanned by the training examples is much smaller than<br />
the space containing all dimensions. Considering this observation G. Schohn and<br />
D. Cohn propose that a simple method to select instances for labeling is to search<br />
for examples that are orthogonal to the space spanned by the current training<br />
data [SC00]. Doing this would give the learner informations about yet not covered<br />
dimensions. Alternatively one can choose those instances which are near to the<br />
dividing hyperplane to improve the confidence in currently known dimensions. This<br />
is an attempt to narrow the existing margin. To maximally narrow the margin one<br />
would select those instances lying on the hyperplane. The interesting result from<br />
G. Schohn and D. Cohn is that training on a small subset of the data leads in most<br />
cases to a better performance than training on all available data.<br />
Remains the analysis of the computation of the proximity of a training instance<br />
to the hyperplane: this is inexpensive, because one can compute the hyperplane<br />
and evaluate each instance using a single dot product.<br />
The distance between a feature vector φ(x) and the hyperplane ω:<br />
|φ(x) ∗ ω| (3.4)<br />
Let’s have a look, how this simple method influences the version space. Given<br />
an unlabeled instance x i we can test how close the corresponding hyperplane in<br />
12
Figure 3.1: The gray line is the old hyperplane, the green lines are the old margins,<br />
’o’ is a new example and the black line the new hyperplane, when the new instance<br />
was labeld as ’-’ (from [SC00])<br />
W comes to the center of the hypersphere (the ω i ). If we choose the instance x i<br />
closest to the center we can reduce the version space as much as possible (and<br />
this will of course reduce the amount of consistent hypotheses). This distance<br />
can be easily computed using the above formular. By choosing the instance x i ,<br />
who come closest to the hyperplane in F, we maximally reduce the margin and<br />
the version space. Figure 3.1 shows the effect of an instance on the hyperplane<br />
graphically. There the bottom figure shows that by placing an instance to the<br />
center of the old hyperplane the margin (calculated using the new hyperplane) gets<br />
changed significantly. Placing an instance on the old hyperplane too far out has<br />
little impact on the margin, as we can see on the top figure.<br />
A more sophisticated description of this can be found in [TK02]. There three<br />
different approaches are presented, each trying to reduce the version space as much<br />
as possible. Note that these definitions rely on the assumption that the given<br />
problem is binary (two classes).<br />
1. Simple Margin: This is the method already described: choose the next instance<br />
closest to the hyperplane<br />
2. MaxMin Margin: Let the instance x be a candidate for being labeled by a<br />
human expert. This instance gets labeled as -1, assigning it to class -1. Then<br />
13
the margin m − of the resulting SVM gets calculated. After this x gets labeled<br />
as +1, assigning it to class +1 and again the margin m + gets computed.<br />
This procedure is repeated for all instances and the instance with the largest<br />
min(m − , m + ) is chosen.<br />
3. Ratio Margin: This is similar to the MaxMin Margin method, but uses the<br />
relative sizes of m − and m + : choose the instance with largest min( m− , m+ ).<br />
m + m −<br />
All three methods perform well, the simple margin method is computationally<br />
the fastest. But it has to be used carefully, because it can be unstable under<br />
some circumstances [HGC01], [TK02]. MaxMin Margin and Ratio Margin try to<br />
overcome these instability problems. The results of the experiments of S. Tong and<br />
D. Koller show that all three methods outperform random sampling [TK02].<br />
3.3.3 Batch-Simple Method<br />
One possible problem with the above methods is that every instance has to be<br />
labeled separately. That means that after each instance the user has to determine<br />
the label. A new hyperplane will be calculated and the next instance has to be<br />
queried. Often this approach is not practicable and some kind of batch mechanism<br />
is necessary. There exist different approaches of batch sampling for version space<br />
based algorithms [Cha05]. One of those approaches is the batch-simple sampling<br />
algorithm, where h unlabeled instances closest to the hyperplane are chosen and<br />
have to be labeled by a user. This could be seen as a rather naive extension of the<br />
above methods (of course naive doesn’t mean bad). The batch-simple method has<br />
been used to classify images [TC01] and the researchers in this paper report good<br />
results. The algorithm can be expressed as follows:<br />
1. initial model building: Build a model using the labeled data<br />
2. feedback round: query n instances closest to the hyperplane and ask the user<br />
to label them<br />
The feedback round can be repeated m times. Because this algorithm can be<br />
unstable during the first feedback round [TC01], Tong and Chang suggest an initial<br />
feedback round with random sampling:<br />
1. initial modell building: Build a modell using the labeled data<br />
2. first feedback round: choose randomly n instances for labeling<br />
3. advanced feedback round: query n instances closest to the hyperplane and<br />
ask the user to label them<br />
14
Now the advanced feedback round could be repeated m times. But how to<br />
choose ’good’ values for n and m? Simon Tong and Edward Chang do not explain<br />
a way to determine these values [TC01]. But it is clear that n has to be set in<br />
advance. They have used a query size of 20. m can be determined by using some<br />
kind of cross validation. It is also obvious that by decreasing the query size n one<br />
has to increase the number of rounds m and vice versa. Otherwise the accuracy of<br />
the classifier would decrease. Beside the technical reasons the choice of the values<br />
depends on the user, whose task is to label the instances. To take advantage of<br />
active learning this user should not have to label a huge set of examples. As an<br />
starting point one can use the values from [TC01]: query size = 20, number of<br />
rounds = 5.<br />
3.3.4 Angle Diversity Strategy<br />
One problem with the batch-simple method is that by sampling a batch of instances<br />
the diversity of them is not guaranteed. One can expect that divers instances<br />
can reduce the version space more efficiently, considering the diversity can have<br />
a significant impact on the performance of the classifier. A measurement of the<br />
diversity is the angles between the samples. The angle diversity strategy proposed in<br />
[Cha05] balances the closeness to the hyperplane and the diversity of the instances.<br />
More formally the angle between two instances x i and x j (respective their corresponding<br />
hyperplanes h i and h j :<br />
|cos(< (h i , h j ))| = |φ(x i).φ(x j )|<br />
||φ(x i )||||φ(x j )|| = |K(x i , x j )|<br />
√<br />
K(xi , x i )K(x j , x j )<br />
(3.5)<br />
where x i is an instance, φ(x i ) is its normal vector and K(x i , x j ) is the kernel function,<br />
which satisfies Mercer’s condition [Bur98].<br />
From these theoretical considerations the algorithm follows straightforward:<br />
1. Train a hyperplane h i by the given labeled set<br />
2. Calculate for each unlabeled instance x j its distance to the hyperplane h i<br />
3. Calculate the maximal angle from x j to any instance x i in the current labeled<br />
set<br />
What’s left is to consider the distance to the hyperplane, until now we have<br />
focused on the diversity of the samples. To do this we introduce another parameter<br />
α [Cha05]. This parameter balances the distance to the hyperplane and the diversity<br />
among the instances. The final decision rule can be expressed this way:<br />
|K(x i , x j )|<br />
α ∗ |f(x i | + (1 − α) ∗ (argmax√ x j K(xi , x i )K(x j , x j ) ) (3.6)<br />
15
As we can see α acts as a trade-off-factor between proximity and diversity. This<br />
parameter has to be set in advance and it is suggest to set it to 0.5 [Cha05]. They<br />
also present a more sophisticated solution for determining this parameter and clearly<br />
it is possible to use cross validation to get the best value for α.<br />
Some version space based methods have been tested in different fields of application<br />
[Cha05], [MPE06]. Whereas former have concentrated on image datasets<br />
and latter have tested these strategies on music datasets both come to the conclusion<br />
that the angle diversity strategy works best. Furthermore Tong concludes that<br />
active learning outperforms passive learning [Cha05].<br />
3.3.5 Multi-Class Problem<br />
So far we have just considered and analyzed the two-class case. But to be useful in<br />
general a semi-supervised learning approach should be easily used in a multi-class<br />
environment.<br />
There exist different strategies for solving a multi-class problem with N classes<br />
for supervised learning with <strong>SVMs</strong>. In the case of the one-versus-one approach<br />
N(N−1)<br />
<strong>SVMs</strong> are developed and a majority vote is used to determine the class of the<br />
2<br />
given instance. In contrast the one-versus-all method uses N <strong>SVMs</strong> and assigns the<br />
label of the class which SVM has the largest margin. An overview about different<br />
multi-class approaches for <strong>SVMs</strong> can be found here [Pal08]. The one-versus-all<br />
method was introduced by Vapnik [Vap00]. Hsu and Lin have compared different<br />
multi-class approaches for <strong>SVMs</strong> [HL02]. Platt has described another multi-class<br />
SVM approach: the decision directed acyclic graph [PCT00].<br />
From the above discussions it becomes not clear how to use these version space<br />
based methods for multi-class problems. Consider the simple method and the oneversus-all<br />
approach. In the case of a multi-class problem we have N decision boundaries,<br />
so which of the margins do we want to narrow? There a single instance has<br />
N distances (to the N hyperplanes) and narrowing one margin doesn’t automatically<br />
mean to narrow all margins. Until now little work has done solving multi-class<br />
semi-supervised problems. Mitra, Shankar, and Pal have applied the simple method<br />
to multi-class problems [MSP04]. They used a ’naive’ approach where they labeled<br />
N samples at a time. As said this approach lacks the analysis which example is best<br />
for all hyperplanes, because the influence of an example can be very large for one<br />
hyperplane but for other hyperplanes it can be useless. The angle diversity strategy<br />
suffers from the same problem, additionally it is not clear, which angle should be<br />
considered.<br />
The following section 3.4 describes probability based methods which overcome<br />
these problems and are more suitable for multi-class problem.<br />
16
3.4 Probability Based Method<br />
As we have seen the version space based methods lack of considering multi-class<br />
problems. An approach which can handle multi-class problems easily are probability<br />
based method [LKG + 05]. There a probability model for multiple <strong>SVMs</strong> is created.<br />
The results of each <strong>SVMs</strong> are interpreted as a probability and can be seen as a<br />
measurement of certainty that a given instance belongs to the class. In the case<br />
of semi-supervised learning using this approach is straightforward and using the<br />
probabilities we have many possibilities to query unlabeled instances for labeling.<br />
A simple method would be to train a model on the given labeled datasets. Than<br />
this model is applied on the unlabeled data and each of these unlabeled instances is<br />
given probabilities that these instances belong to a given class. Now we can query<br />
the least certain instances or the most certain instances. It is also possible to query<br />
the instances with the smallest difference in probability between its most likely<br />
and second most likely class. Using these probabilities there exist many different<br />
approaches and it is also possible to mixture some of them [LKG + 05].<br />
3.4.1 The Probability Model<br />
To get probabilities we have to extend the default Support Vector Machines. For<br />
a given instance the results of the default <strong>SVMs</strong> are distances where f.ex. 0 means<br />
that the instance lies on the hyperplane and 1 that the instance is a support vector.<br />
To assign a probability value to a class the sigmoid function can be used. Then<br />
the parametric model has the following form [LKG + 05]:<br />
P (y = 1|f) =<br />
1<br />
1 + exp(Af + B) ′ (3.7)<br />
where A and B are scalar values, which have to be estimated and f is the decision<br />
function of the SVM. Based on this parametric model there are some approaches<br />
for calculating the probabilities. As we can see, when we use this model we have to<br />
calculate the SVM parameters (complexity parameter C, kernel parameter k) and<br />
the parameter A and B where the parameter A and B have to be calculated for<br />
each binary SVM. We can use cross validation for this calculation but it is clear<br />
that this can be computationally expensive.<br />
A pragmatic approximation method could assume that all binary <strong>SVMs</strong> have<br />
the same A, eliminate B by assigning 0.5 to instances lying on the decision boundary<br />
and by trying to compute the SVM parameters and A simultaneously [LKG + 05].<br />
The decision function can be normalized by its margin to include the margin in the<br />
calculation of the probabilities. More formally:<br />
P pq (y = 1|f) =<br />
1<br />
(3.8)<br />
1 + exp( Af<br />
||ω|| )′<br />
17
where we currently look at class p and P pq is the probability of class p versus class<br />
q. We assume that P pq , q=1,2,... are independent. The final probability for class<br />
p:<br />
q≠p<br />
∏<br />
P (p) = P pq (y = 1|f) (3.9)<br />
q<br />
It has been reported that this approximation is very fast and delivers good<br />
accuracy results. Using this probability model there exist different approaches for<br />
semi-supervised learning. The next section outlines some.<br />
3.4.2 Least Certainty and Breaking Ties<br />
The algorithms for both are very similar.<br />
1. Built a multi-class model from the labeled training data<br />
2. Compute the probabilities<br />
3. Least Certainty: Query the instances with the smallest classification confidence<br />
for labeling by a human expert. Add them to the training set.<br />
4. Breaking ties: Query the instances with the smallest difference in probabilities<br />
for the two highest probability classes and obtain the correct label from a<br />
human expert. Add them to the training set.<br />
5. Goto 1<br />
Suppose a is the class with the highest probability, b is the class with second<br />
highest probability and P(a) and P(b) are the probabilities of the classes. Then<br />
least certainty tries to improve P(a) and breaking ties tries to improve P(a) - P(b).<br />
Intuitively, both methods improve the confidence of the classification. The number<br />
of instances, which should be queried, has to be set by the SVM designer.<br />
These approaches were tested on a gray-scale image datasets [LKG + 05]. They<br />
report a good accuracy and a reduced number of labeled images required to reach it.<br />
The breaking ties approach outperforms least certainty and using batch sampling<br />
was also effective.<br />
3.5 Other approaches<br />
3.5.1 A <strong>Semi</strong>definite Programming Approach<br />
<strong>Semi</strong>definite programming is an extension of linear and quadratic programming. A<br />
semidefinite programming problem is a convex constrained optimization problem.<br />
<strong>With</strong> semidefinite programming one tries to optimize a symmetric n × n matrix of<br />
18
variables X [XS05]. <strong>Semi</strong>definite programming can be used to use Support Vector<br />
Machines in an unsupervised and semi-supervised context. For clustering the goal<br />
is not to find a large margin classifier using the labeled data (as with supervised<br />
learning) but instead to find a labeling that results in a large margin classifier.<br />
Therefore every possible labeling has to be computed and the labeling with the<br />
maximum margin has to be chosen. Obviously this is computationally very expensive<br />
but Xu and Schuurmans have found out that it can be approximated using<br />
semidefinite programming. This unsupervised approach can be easily extended to<br />
semi-supervised learning where a small labeled training set has to be considered.<br />
Note that this approach also works for multi-class problems [XS05]. There is one important<br />
difference between this approach and the other above discussed approaches:<br />
Here the algorithm uses the unlabeled data directly that means no human expert is<br />
asked to label them. In this case the semi-supervised learning is a combination of<br />
supervised learning using the given labeled training set and unsupervised learning<br />
using the unlabeled data.<br />
3.5.2 S 3 V M<br />
This approach was introduced by Bennet and Demiriz [BD99]. Similar to the above<br />
approach no human gets asked to label instances. Instead the unlabeled data gets<br />
incorporated to the formulation of the optimization problem. S 3 V M reformulates<br />
the original definition by adding two constraints to the instances of the unlabeled<br />
datasets. Considering a binary SVM one constraint calculates the misclassification<br />
error as if the instance were in class 1 and the second constraint as if the instance<br />
were in class -1. S 3 V M tries to minimize these two possible misclassification errors.<br />
The labeling with the smallest error is the final labeling. Moreover Bennet and<br />
Demiriz introduce some optimization techniques for this. An analysis, how this<br />
approach performs in a multi-class environment, is not presented.<br />
3.6 Summary<br />
<strong>Semi</strong>-supervised learning is a promising approach to reduce the amount of needed<br />
labeled instances for training <strong>SVMs</strong> by asking a human expert to label relevant<br />
instances from an unlabeled pool of instances. As outlined there are many different<br />
approaches available. We can use clustering, which can also be used as a semisupervised<br />
learning approach with other machine learning algorithm. In contrast<br />
the here presented version space based methods focus on <strong>SVMs</strong> and promise good<br />
accuracy results but are primary usable for binary classification tasks. Extending<br />
these approaches for multi-class problems is an ongoing research topic. Simple but<br />
effective approaches are the probability based methods which can be easily used in<br />
a multi-class context and are therefore very convenient. S 3 V M and the semidefinite<br />
programming approach are also semi-supervised learning approaches but here no<br />
19
human gets asked to label relevant instances. Whereas the former incorporates<br />
unlabeled instances to the formulation of the optimization problem, the latter one<br />
tries to find the labeling with the largest margin.<br />
20
Chapter 4<br />
Experiments<br />
4.1 Experiment Setting<br />
To experiment with different approaches presented in this work I have implemented<br />
two applications. ssSVM is a semi-supervised SVM implementation and suppports<br />
different semi-supervised learning approaches like Least Certainty and Breaking<br />
Ties. ssSVM uses RapidMiner, an open-source data mining plattform, which provides<br />
a comprehensive API for machine learning tasks like different classification,<br />
different clustering and of course different SVM implementations. ssSVM is also<br />
based on Spring, mainly an inversion-of-control container and therefore it is highly<br />
configurable and extensible. Furthermore it wraps the WordVector Tool for creating<br />
word vectors from texts. The second implemented application is the GUI for<br />
ssSVM. It is called ssSVMToolbox and is based on Eclipse RCP. The chapters 4.1.2<br />
and 4.1.3 as well as the links in the appendix A provide detailed informations.<br />
4.1.1 Evaluated Approaches<br />
I compared following approaches and evaluated their performances on the different<br />
data sets:<br />
1. Least Certainty (LS)<br />
2. Breaking Ties (BT)<br />
3. Most Certainty (MC)<br />
4. Simple Margin (SM)<br />
5. Random Sampling (RS)<br />
I separated every datasets into three sub sets.<br />
1. training set for supervised learning (in this work also called reduced set)<br />
21
2. training set for semi-supervised learning (is used to query instances for the<br />
feedback)<br />
3. test set to evaluate the performance<br />
.<br />
Using the reduced and the training set for semi-supervised learning (merged also<br />
called the whole set) I trained a common SVM to get an upper bound and used<br />
the reduced set alone to get the lower bound. So the accuracies of the different<br />
approaches should be between these bounds. Furthermore I used a random sampling<br />
strategy (RS) to show that the different approaches are better than an approach<br />
which randomly chooses instances for feedback.<br />
I compared two different modes:<br />
1. incremental increased training size: there the feedback size is set to 1 and the<br />
training size is incremental increased<br />
2. batch mode: there the feedback size is set to a certain value (f.ex. 50), in<br />
some iterations the feedback size is increased and results with these different<br />
feedback sizes are compared<br />
4.1.2 ssSVM<br />
ssSVM (semi-supervised Support Vector Machine) is a Java application capable<br />
of performing semi-supervised learning tasks with Support Vector Machines. It is<br />
based on RapidMiner , an Open Source data mining tool, and on Spring, an IOC<br />
container. See Relevant Links for more informations (appendix A).<br />
The core of the application is the application context sssvmContext.xml. As<br />
RapidMiner ssSVM supports different operators, this file configures which operators<br />
ssSVM actually supports (which input sources, which SVM implementations, which<br />
validators,...).<br />
<br />
<br />
com . rapidminer . operator . tokenizer . SimpleTokenizer<br />
<br />
<br />
com . rapidminer . operator . tokenizer . NGramTokenizer<br />
<br />
<br />
com . rapidminer . operator . tokenizer . TermNGramGenerator<br />
<br />
<br />
com . rapidminer . operator . reducer . GermanStemmer<br />
<br />
<br />
com . rapidminer . operator . reducer . LovinsStemmer<br />
<br />
<br />
com . rapidminer . operator . reducer . PorterStemmer<br />
<br />
<br />
com . rapidminer . operator . reducer . SnowballStemmer<br />
<br />
<br />
com . rapidminer . operator . reducer . ToLowerCaseConverter<br />
<br />
<br />
com . rapidminer . operator . wordfilter . EnglishStopwordFilter<br />
<br />
<br />
com . rapidminer . operator . wordfilter . GermanStopwordFilter<br />
<br />
<br />
com . rapidminer . operator . wordfilter . StopwordFilterFile<br />
<br />
<br />
com . rapidminer . operator . wordfilter . TokenLengthFilter<br />
<br />
<br />
<br />
< property name =" tokenProcessors "><br />
<br />
<br />
<br />
<br />
< property name =" supported<strong>Read</strong>er "><br />
<br />
<br />
com . rapidminer . operator .io. CSVExampleSource<br />
<br />
<br />
com . rapidminer . operator .io. SparseFormatExampleSource<br />
<br />
<br />
com . rapidminer . operator .io. ArffExampleSource<br />
<br />
<br />
<br />
<br />
<br />
< property name =" params "><br />
<br />
<br />
< property name =" supportedSVMLearer "><br />
<br />
<br />
com . rapidminer . operator . learner . functions . kernel . LibSVMLearner<br />
<br />
<br />
com . rapidminer . operator . learner . functions . kernel . JMySVMLearner<br />
<br />
<br />
<br />
< property name =" supportedValidator "><br />
<br />
<br />
com . rapidminer . operator . validation . XValidation<br />
<br />
<br />
com . rapidminer . operator . validation . FixedSplitValidationChain<br />
<br />
<br />
23
< property name =" supportedPerfEvaluator "><br />
<br />
<br />
com . rapidminer . operator . performance . SimplePerformanceEvaluator<br />
<br />
<br />
com . rapidminer . operator . performance . PolynominalClassificationPerformanceEvaluator<br />
<br />
<br />
<br />
<br />
<br />
< property name =" supportedClusterer "><br />
<br />
<br />
com . rapidminer . operator . learner . clustering . clusterer . KMeans<br />
<br />
<br />
com . rapidminer . operator . learner . clustering . clusterer . SVClusteringOperator<br />
<br />
<br />
com . rapidminer . operator . learner . clustering . clusterer . KernelKMeans<br />
<br />
<br />
<br />
<br />
<br />
To use ssSVM for a concrete experiment another configuration is necessary.<br />
There the runtime properties for the experiment have to be provided. Instead of<br />
describing this file I provide an example. For a detailed description which parameters<br />
and parameter values are supported, see the RapidMiner documentation<br />
(appendix A).<br />
<br />
false <br />
<br />
<br />
<br />
<br />
< property name =" preprocessing "><br />
<br />
<br />
< property name =" parameter "><br />
<br />
<br />
< property name =" additionalProps "><br />
<br />
<br />
./ datasets / breast_cancer_wisconsin / wdbc_as_labeled . data<br />
<br />
<br />
<br />
<br />
<br />
< property name =" preprocessing "><br />
<br />
<br />
< property name =" parameter "><br />
<br />
<br />
< property name =" additionalProps "><br />
<br />
<br />
./ datasets / breast_cancer_wisconsin / wdbc_testset . data<br />
<br />
<br />
<br />
<br />
<br />
< property name =" preprocessing "><br />
<br />
<br />
< property name =" parameter "><br />
<br />
<br />
< property name =" additionalProps "><br />
<br />
<br />
./ datasets / breast_cancer_wisconsin / wdbc_as_unlabeled . data<br />
<br />
<br />
<br />
<br />
<br />
<br />
< property name =" numberOfInstancesForFeedback " value ="0" /><br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
< property name =" comparator "><br />
<br />
<br />
< property name =" numberOfInstancesForFeedback " value ="10" /><br />
<br />
25
< property name =" comparator "><br />
<br />
<br />
< property name =" numberOfInstancesForFeedback " value ="10" /><br />
<br />
<br />
< property name =" comparator "><br />
<br />
<br />
< property name =" numberOfInstancesForFeedback " value ="10" /><br />
<br />
<br />
< property name =" numberOfInstancesForFeedback " value ="10" /><br />
< property name =" params "><br />
<br />
<br />
< property name =" sssvmLearner "><br />
<br />
<br />
<br />
<br />
< property name =" seed " value =" 123456789 "/><br />
< property name =" numberOfInstancesForFeedback " value ="10" /><br />
<br />
<br />
< property name =" props "><br />
<br />
<br />
com . rapidminer . operator . learner . functions . kernel . jmysvm . kernel . KernelRadial<br />
<br />
0.8 <br />
<br />
<br />
<br />
<br />
< property name =" params "><br />
<br />
<br />
< property name =" clusterParams "><br />
<br />
<br />
< property name =" svmLearner " value =" libSVM " /><br />
< property name =" validator " value =" xval " /><br />
< property name =" perfEvaluator " value =" simple " /><br />
< property name =" clusterModelHandler "><br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
< property name =" clusterers "><br />
<br />
kmeans <br />
kernelKmeans <br />
<br />
<br />
< property name =" samplingStrategies "><br />
<br />
<br />
<br />
<br />
<br />
<br />
26
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Following code performs this experiment:<br />
final RuntimeHandler r = new RuntimeHandler (" wdbc . xml ");<br />
final SSSVMLearner learner = r. getSSSVMLearner ();<br />
// one feedback round<br />
final ExampleSet feedbackSet = learner . queryInstances (r. getLabeledExampleSet () , r. getUnlabeledExampleSet ());<br />
final ExampleSet all = ExampleSetUtils . merge (r. getLabeledExampleSet () , feedbackSet );<br />
// use a SVM implementation for training<br />
final IOObject [] resultSSSVM = r. getSVMLearner (). learn (r. getRuntimeConfig (). getSvmLearner () ,<br />
r. getRuntimeConfig (). getSvmPerfEvaluator () , all );<br />
// get performance of self test<br />
final PerformanceVector pvXval = (( PerformanceVector ) resultSSSVM [1]);<br />
// use model on a separate test set<br />
final PerformanceVector pvTest = r. getSVMLearner (). test (( Model ) resultSSSVM [0] ,<br />
r. getRuntimeConfig (). getSvmPerfEvaluator () , r. getTestExampleSet ());<br />
The incrementally increased training size mode can be executed by this code:<br />
protected List < Performance > performSSSVMStepwise ( final String experiment , final SamplingStrategy samplingStrategy )<br />
throws Exception {<br />
final RuntimeHandler r = new RuntimeHandler ( experiment );<br />
// learn sssvm<br />
final SSSVMLearner learner = r. getSSSVMLearner ();<br />
}<br />
ExampleSet all = ( ExampleSet ) r. getLabeledExampleSet (). clone ();<br />
final ExampleSet unlabeledSet = ( ExampleSet ) r. getUnlabeledExampleSet (). clone ();<br />
final List < Performance > results = new LinkedList < Performance >();<br />
final int feedbackSize = 10;<br />
learner . getSamplingStrategies (). clear ();<br />
learner . addSamplingStrategy ( samplingStrategy );<br />
samplingStrategy . setNumberOfInstancesForFeedback ( feedbackSize );<br />
for ( int i = 0; i < unlabeledSet . size () / 10; i ++) {<br />
final ExampleSet feedbackSet = learner . queryInstances (all ,<br />
ExampleSetUtils . intersect ( unlabeledSet , all ));<br />
all = ExampleSetUtils . merge (all , feedbackSet );<br />
final IOObject [] resultSSSVM = r. getSVMLearner (). learn (r. getRuntimeConfig (). getSvmLearner () ,<br />
r. getRuntimeConfig (). getSvmPerfEvaluator () , all );<br />
final PerformanceVector pvXval = (( PerformanceVector ) resultSSSVM [1]);<br />
final PerformanceVector pvTest = r. getSVMLearner (). test (( Model ) resultSSSVM [0] ,<br />
r. getRuntimeConfig (). getSvmPerfEvaluator () , r. getTestExampleSet ());<br />
final Performance perf = new Performance ( pvXval , pvTest , all . size () , all . size ()<br />
- r. getLabeledExampleSet (). size ());<br />
results . add ( perf );<br />
}<br />
return results ;<br />
If f.ex. a new input source should be used it has to be configured in sssvmContext.xml<br />
and after that it can be used in the experiment configuration.<br />
Table 4.1 describes the important packages of ssSVM.<br />
27
package name<br />
sssvm<br />
sssvm.clustermodel<br />
sssvm.confidencemodel<br />
sssvm.sampling<br />
sssvm.preprocessing<br />
sssvm.text<br />
describtion<br />
contains the ssSVM implementation<br />
and the core classes for running experiments<br />
contains cluster models<br />
for using clusterer in a semi-supervised manner<br />
contains the implementation<br />
for probability based semi-supervised approaches<br />
(Breaking Ties, Least Certainty,...)<br />
contains different sampling strategies<br />
contains preprocessing methods<br />
wraps the WVTool for creating word vectors from texts<br />
4.1.3 ssSVMToolbox<br />
Table 4.1: Packages<br />
This is the graphical user interface of ssSVM. It is based on Eclipse RCP. Using the<br />
ssSVMToolbox one can create, configure and run experiments. This application<br />
uses ssSVM to perform supervised and semi-supervised learning with <strong>SVMs</strong> and<br />
has the same abilities as ssSVM. Technically the toolbox is a GUI to manipulate<br />
experiment xml files.<br />
Running experiments is straightforward. At first one has to create a new experiment.<br />
The toolbox consists of some tabs. On the Input tab one can configure<br />
the datasources of the experiment. Here one has to provide the input format (f.ex.<br />
cvs), the filenames of the example sets and additional parameters for the example<br />
sets. The Preprocessing tab provides the configuration for preprocessing tasks like<br />
discretization and transformation of nominal to numeric attributes. The ssSVM<br />
Learner tab is the core of the toolbox. Here one can choose between different SVM<br />
learners, has to set the SVM parameters like kernel type and can activate and deactivate<br />
the different sampling strategies. For every sampling strategy one can set the<br />
feedback size. Finally one can execute the ssSVM experiment. After doing this, the<br />
Feedback Set table shows the instances for labeling by the human expert. There<br />
some features are shown (by double clicking on the row a dialog is opened and the<br />
whole instance is shown) and the user can label the instances by clicking on the cell<br />
Label. The current accuracy on the testset is also shown. The Result tab shows<br />
the accuracies and the confusion matrix.<br />
Figure 4.1 shows the Input tab whereas Figure 4.2 represents the ssSVM tab.<br />
By repeatedly executing the experiment one can experiment with incrementally<br />
increased training sizes, by setting the feedback sizes to values > 1 one can test<br />
the batch mode. By choosing different sampling strategies one can experiment with<br />
different combinations of them.<br />
In the next sections I show the results of my experiments. For some of these<br />
experiments I used the ssSVMToolbox. For more sophisticated results (f.ex. to<br />
28
Figure 4.1: Screenshot of the Input tab of the ssSVMToolbox<br />
29
Figure 4.2: Screenshot of the ssSVM tab of the ssSVMToolbox<br />
30
Figure 4.3: Binary Gaussian Distribution (µ 1 = 3, σ 1 = 3, µ 2 = 4, σ 2 = 3)<br />
create the different figures) I used a programmatic approach where I could execute<br />
different experiments with different settings all at once. See Section 4.1.2 for detailed<br />
informations and example source code.<br />
4.2 Artificial Datasets<br />
4.2.1 Gaussian Distributed Data<br />
For these experiments I used generated gaussian distributed data. I generated two<br />
different datasets with two different classes where the two classes overlap each other.<br />
In the first datasets ds 1 the σs are equal, in the second ds 2 the σs are different. The<br />
Figures 4.3 and 4.4 show plots of these datasets.<br />
For these datasets I evaluated different approaches (section 4.1.1). Table 4.2<br />
shows the upper and the lower bounds and the result of the ssSVM approach using<br />
a feedback size of 50.<br />
whole set reduced set LC BT MC SM RS<br />
self test 0.67 0.5 0.56 0.6 0.68 0.74 0.38<br />
test set 0.67 0.5 62 0.5 0.66 0.67 0.46<br />
trainingset size 840 40 50 50 50 50 50<br />
Table 4.2: Summary experiments with ds 1<br />
31
Figure 4.4: Binary Gaussian Distribution (µ 1 = 12, σ 1 = 15, µ 2 = 17, σ 2 = 1)<br />
Figure 4.5 gives a more detailed insight into the performance of the semisupervised<br />
SVM. There the feedback size was set equal to 1 and ssSVM were used<br />
to incrementally increase the training set size. As we can see after approx. 50 iterations<br />
Simple Margin and Most Certainty deliver good results in comparison with<br />
conventional <strong>SVMs</strong> but with much fewer data. Breaking Ties, Least Certainty and<br />
Most Certainty are most stable and outperform Random Sampling.<br />
Figure 4.6 shows how the implementation performs with different feedback sizes<br />
in a batch mode.<br />
The lower and upper bounds of the second datasets and the performance of<br />
ssSVM with feedback size 50 can be found in Table 4.3 .<br />
whole set reduced set LC BT MC SM RS<br />
self test 0.77 0.9 0.83 0.78 0.95 0.66 0.76<br />
test set 0.77 0.43 0.70 0.63 0.44 0.5 0.59<br />
trainingset size 840 40 90 90 90 90 90<br />
Table 4.3: Summary experiments with ds 2<br />
The performance of ssSVM with feedback size 1 and incremental increased feedback<br />
size is highlighted in Figure 4.7.<br />
Remains the overview how ssSVM performs on this datasets in a batch mode.<br />
Figure 4.8 highlights the results of this.<br />
32
Figure 4.5: Incremental increased training size ds 1<br />
Figure 4.6: different feedback sizes in batch mode ds 1<br />
33
Figure 4.7: Incremental increased training size ds 2<br />
Figure 4.8: different feedback sizes in batch mode for ds 2<br />
34
Figure 4.9: Incremental increased training size ds 1 , RBF kernel<br />
Both datasets show that the semi-supervised SVM approaches delivers similar<br />
results than the supervised approach but with a smaller training set. The incremental<br />
version outperforms the supervised approach with respect to the training<br />
set size and is better than the batch semi-supervised version, which is of course<br />
more practically and also performs better then the supervised approach.<br />
Different Kernels<br />
For the above experiments I used the Linear kernel. To look how the chosen kernel<br />
influences the result of the semi-supervised learning approaches I used polynomial<br />
and RBF kernels for experimenting with the datasets ds 1 . The Figures 4.10 and<br />
4.9 are similar to the Figure 4.5. For these datasets we can conclude that the<br />
chosen kernel influences the result of the SVM but has no specific impact on the<br />
semi-supervised approaches.<br />
4.2.2 Two Spirals Dataset<br />
I also applied ssSVM to a Two Spirals dataset (Figure 4.11).<br />
Table 4.4 shows lower and upper bound and the ssSVM accuracy of the dataset.<br />
The performance of ssSVM with feedback size 1 and incremental increased feedback<br />
size is highlighted in Figure 4.12, the results of using a batch mode can be<br />
found in Figure 4.13.<br />
35
Figure 4.10: Incremental increased training size ds 1 , polynomial kernel (degree =<br />
3)<br />
Figure 4.11: Two Spirals Dataset<br />
36
whole set reduced set LC BT MC SM RS<br />
self test 1 0.1 0 0 1 1 1<br />
test set 0.85 0.32 0.33 0.31 0.67 0.74 0.72<br />
trainingset size 104 38 48 48 48 48 48<br />
Table 4.4: Summary experiments with Two Spirals Dataset<br />
Figure 4.12: Incremental increased training size Two Spirals Dataset<br />
37
Figure 4.13: different feedback sizes in batch mode for Two Spirals Datasets<br />
As the Gaussian Datasets these experiments show that with ssSVM the necessary<br />
amount of training instances can be reduced significantly.<br />
4.2.3 Chain Link Dataset<br />
The last artificial dataset I used to evaluate ssSVM is the Chain Link Dataset 4.14.<br />
Table 4.5 shows upper and lower bounds, the Figures 4.15 and 4.16 show the<br />
accuracies with incremental increased training sets and with different batch sizes.<br />
whole set reduced set LC BT MC SM RS<br />
self test 0.89 0.66 0.77 0.7 0.67 0.67 0.86<br />
test set 0.9 0.76 0.86 0.75 0.73 0.81 0.66<br />
trainingset size 681 30 40 40 40 40 40<br />
Table 4.5: Summary experiments with Chain Link dataset<br />
4.2.4 Summary<br />
We could see that the semi-supervised SVM approaches reduced the amount of<br />
needed labeled data significantly. They delivered similar accuracies than the common<br />
SVM approach but the training set size was much smaller. As expected the<br />
38
Figure 4.14: Chain Link Dataset<br />
Figure 4.15: Incremental increased training size Chain Link dataset<br />
39
Figure 4.16: different feedback sizes in batch mode for Chain Link Dataset<br />
incremental version performs better than the batch version. Breaking Ties, Least<br />
Certainty, Simple Margin and Most Certainty perform better than Random Sampling<br />
but no single ’winner’ could be found.<br />
4.3 Datasets from UCI Machine <strong>Learning</strong> Repository<br />
Beside the generated datasets I evaluated my implementation using some datasets<br />
from the UCI Machine <strong>Learning</strong> Repository (appendix A).<br />
I used following datasets:<br />
1. abalone<br />
2. breast cancer (WDBC)<br />
3. heart scale<br />
4. hill valley<br />
5. kr-vs-kp<br />
Detailed informations about the datasets can be found on the UCI Machine <strong>Learning</strong><br />
Repository homepage. Again I separated the datasets into training sets for<br />
40
supervised learning, semi-supervised learning and testing. Note that I did not try<br />
to optimize the SVM kernel parameters to get good accuracies and therefore some<br />
accuracies are rather low. Instead I used different parameters for different datasets<br />
(f.ex. different kernel types) and for each datasets the same parameters for comparing<br />
supervised and semi-supervised learning.<br />
Two modes were used for the semi-supervised approach. First a simple batch<br />
mode where only one feedback round is used. The other mode uses 10 feedback<br />
rounds.<br />
The tables 4.6 and 4.7 outline the results of these experiments. Again, the semisupervised<br />
approaches deliver good accuracy but with reduced sample size with<br />
respect to the whole training set.<br />
whole set reduced set LC BT MC SM RS<br />
heart scale 0.84 0.75 0.83 0.83 0.78 0.77 0.80<br />
WDBC 0.94 0.75 0.94 0.94 0.77 0.80 0.87<br />
WDBC (RBF) 0.85 0.25 0.52 0.52 0.28 0.50 0.45<br />
abalone 0.54 0.44 0.51 0.53 0.44 0.51 0.51<br />
hill valley 0.94 0.85 0.89 0.89 0.87 0.85 0.86<br />
kr-vs-kp 0.44 0.29 0.39 0.43 0.42 0.33 0.22<br />
Table 4.6: Evaluation of semi-supervised SVM approaches (1 iteration, feedback<br />
size 50)<br />
whole set reduced set LC BT MC SM RS<br />
heart scale 0.84 0.75 0.83 0.83 0.76 0.84 0.8<br />
WDBC 0.94 0.75 0.94 0.94 0.77 0.87 0.80<br />
WDBC (RBF) 0.85 0.25 0.69 0.69 0.28 0.47 0.46<br />
abalone 0.54 0.44 0.50 0.51 0.43 0.49 0.51<br />
hill valley 0.94 0.85 0.89 0.88 0.88 0.86 0.86<br />
kr-vs-kp 0.44 0.29 0.47 0.53 0.31 0.21 0.16<br />
Table 4.7: Evaluation of semi-supervised SVM approaches, (10 iterations, feedback<br />
size 50)<br />
These datasets show that Least Certainty and Breaking Ties often delivers similar<br />
results and outperform the other approaches.<br />
41
Chapter 5<br />
Conclusion<br />
In this work I summarized different approaches of semi-supervised learning for Support<br />
Vector Machines. We have seen that most of them try to narrow the margin<br />
of the hyperplane. The version space based and the probability based methods<br />
belong to this category. <strong>Semi</strong>-supervised learning approaches promise to reduce the<br />
amount of the needed training data through performing so called feedback rounds<br />
where a human expert gets asked for labeling instances which are relevant for the<br />
given classification task. The experiments with different datasets have shown that<br />
ssSVM, my semi-supervised learning implementation for <strong>SVMs</strong>, keep this promise.<br />
<strong>With</strong> ssSVM one can obtain similar accuracies with fewer training data as with<br />
usual <strong>SVMs</strong>.<br />
One drawback of the presented semi-supervised learning approaches is that they<br />
introduce a new parameter, the feedback size. The feedback size influences not only<br />
the accuracy but also the acceptance of the human expert. If the feedback size is<br />
too large, the human expert has to label many instances and can get bored (as in<br />
the supervised case), if the feedback size is too small the accuracy can be too low.<br />
Because the optimal value for the feedback size depends on the datasets and the<br />
chosen approach there is no general rule how to set it. Additionally the number of<br />
feedback rounds must also be chosen.<br />
I compared Least Certainty, Breaking Ties, Most Certainty and Simple Margin<br />
with Random Sampling and could show that these approaches outperform the<br />
latter one. Which approach should be chosen depends on the datasets although<br />
Least Certainty and Breaking Ties seem to be most stable and are general good<br />
approaches.<br />
A problem is that until now there does not exist a practical online tuning algorithm<br />
for kernel parameters. If we add a new instance to the training set the<br />
optimal kernel parameters can change.<br />
Nevertheless my experiments with ssSVM show that using semi-supervised approaches<br />
help to reduce the size of needed labeled training data and are therefore<br />
valuable.<br />
42
Appendix A<br />
Relevant Links<br />
• Word Vector Tool - An Open-Source Tool for creating word vectors from texts<br />
http://www.wvtool.nemoz.org/<br />
• RapidMiner - An Open-Source Datamining Tool http://www.rapidminer.com<br />
• Spring Framework - An IOC Container http://springframework.org/<br />
• Eclipse RCP - The Eclipse Rich Client Platform http://wiki.eclipse.org/index.<br />
php/Rich Client Platform<br />
• UCI Machine <strong>Learning</strong> Repository - Repository containg different data sets<br />
http://archive.ics.uci.edu/ml/<br />
43
Bibliography<br />
[BD99]<br />
Kristin P. Bennett and Ayhan Demiriz. <strong>Semi</strong>-supervised support vector<br />
machines. In Proceedings of the 1998 conference on Advances in neural<br />
information processing systems II, pages 368–374, Cambridge, MA, USA,<br />
1999. MIT Press.<br />
[Bur98] Christopher J. C. Burges. A tutorial on support vector machines for<br />
pattern recognition. Data Mining and Knowledge Discovery, 2(2):121–<br />
167, 1998.<br />
[CB07]<br />
[Cha05]<br />
Nicolas Cebron and Michael R. Berthold. An adaptive multi objective selection<br />
strategy for active learning. Konstanzer Schriften in Mathematik<br />
und Informatik, No. 235, 2007.<br />
Edward Chang Simon Tong Kingsby Goh Chang-Wei Chang. Support<br />
vector machine concept-dependent active learning for image retrieval.<br />
IEEE Transactions on Multimedia 2005, 2005.<br />
[CST00] Nello Cristianini and John Shawe-Taylor. An introduction to support<br />
Vector Machines: and other kernel-based learning methods. Cambridge<br />
University Press, New York, NY, USA, 2000.<br />
[FM01]<br />
[HGC01]<br />
[HL02]<br />
F. Fung and O. Mangasarian. <strong>Semi</strong>-supervised support vector machines<br />
for unlabeled data classification. Optimization Methods and Software,<br />
15:29–44, 2001.<br />
Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point machines.<br />
Journal of Machine <strong>Learning</strong> Research, 1:245–279, 2001.<br />
Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass<br />
support vector machines. IEEE Transactions on Neural Networks,<br />
13:415–425, 2002.<br />
[LKG + 05] Tong Luo, Kurt Kramer, Dmitry B. Goldgof, Lawrence O. Hall, Scott<br />
Samson, Andrew Remsen, and Thomas Hopkins. Active learning to<br />
recognize multiple types of plankton. Journal of Machine <strong>Learning</strong> Research,<br />
6:589–613, 2005.<br />
44
[Mar03] Florian Markowetz. Klassifikation mit support vector machines.<br />
http://lectures.molgen.mpg.de/statistik03/docs/Kapitel 16.<strong>pdf</strong>, 2003.<br />
Lectures, Max Planck Institute For Molecular Genetics.<br />
[Mei02] Ron Meir. Support vector machines - an introduction.<br />
http://www.ee.technion.ac.il/ rmeir/SVMReview.<strong>pdf</strong>, 2002. Electrical<br />
Engineering Department, Israel Institute of Technology, Tutorial.<br />
[MIJ04] Romain Thibaux Michael I. Jordan. The kernel<br />
trick. http://www.cs.berkeley.edu/ jordan/courses/281Bspring04/lectures/lec3.<strong>pdf</strong>,<br />
Spring 2004. Lectures, CS Berkeley.<br />
[Mit97]<br />
[MPE06]<br />
[MSP04]<br />
[Pal08]<br />
[PCT00]<br />
[SC00]<br />
[STC99]<br />
[TC01]<br />
Thomas Mitchell. Machine <strong>Learning</strong>. McGraw-Hill Education (ISE Editions),<br />
October 1997.<br />
Michael Mandel, Graham Poliner, and Daniel Ellis. Support vector machine<br />
active learning for music retrieval. Multimedia Systems, 12(1):3–13,<br />
2006.<br />
Pabitra Mitra, B. Uma Shankar, and Sankar K. Pal. Segmentation of<br />
multispectral remote sensing images using active support vector machines.<br />
Pattern Recognition Letters, 25(9):1067–1074, 2004.<br />
Mahesh Pal. Multiclass approaches for support vector machine based<br />
land cover classification. CoRR, abs/0802.2411, 2008. informal publication.<br />
John C. Platt, Nello Cristianini, and Shawe J. Taylor. Large margin<br />
DAGs for multiclass classification. In Sara A. Solla, T. K. Leen, and<br />
K. R. Müller, editors, Advances in Neural Information Processing Systems,<br />
volume 12. MIT Press, 2000.<br />
Greg Schohn and David Cohn. Less is more: Active learning with support<br />
vector machines. In ICML ’00: Proceedings of the Seventeenth International<br />
Conference on Machine <strong>Learning</strong>, pages 839–846, San Francisco,<br />
CA, USA, 2000. Morgan Kaufmann Publishers Inc.<br />
John Shawe-Taylor and Nello Cristianini. Further results on the margin<br />
distribution. In COLT ’99: Proceedings of the twelfth annual conference<br />
on Computational learning theory, pages 278–285, New York, NY, USA,<br />
1999. ACM.<br />
Simon Tong and Edward Chang. Support vector machine active learning<br />
for image retrieval. In MULTIMEDIA ’01: Proceedings of the ninth ACM<br />
international conference on Multimedia, pages 107–118, New York, NY,<br />
USA, 2001. ACM.<br />
45
[TK02]<br />
[Vap00]<br />
[VC04]<br />
[XS05]<br />
Simon Tong and Daphne Koller. Support vector machine active learning<br />
with applications to text classification. Journal of Machine <strong>Learning</strong><br />
Research, 2:45–66, 2002.<br />
Vladimir N. Vapnik. The nature of statistical learning theory. Springer-<br />
Verlag New York, Inc., New York, NY, USA, 2000.<br />
V. N. Vapnik and A. Ya. Chervonenkis. Theory of pattern recognition.<br />
www.cs.berkeley.edu/ jordan/courses/281B-spring04/lectures/lec3.<strong>pdf</strong>,<br />
Spring 2004. Lectures, CS Berkeley.<br />
Linli Xu and Dale Schuurmans. Unsupervised and semi-supervised multiclass<br />
support vector machines. In Manuela M. Veloso and Subbarao<br />
Kambhampati, editors, AAAI, pages 904–910. AAAI Press / The MIT<br />
Press, 2005.<br />
46