21.11.2013 Views

Semi-Supervised Learning With SVMs.pdf - Read

Semi-Supervised Learning With SVMs.pdf - Read

Semi-Supervised Learning With SVMs.pdf - Read

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Semi</strong>-supervised <strong>Learning</strong> <strong>With</strong><br />

Support Vector Machines<br />

B A K K A L A U R E A T S A R B E I T<br />

von Andre Guggenberger<br />

Matrikelnummer 0327514<br />

eingereicht an der<br />

Technischen Universität Wien<br />

im September 2008


ABSTRACT<br />

Support Vector Machines are a modern technique in the field of machine learning<br />

and have been successfully used in different fields of application. In general they<br />

are used for some kind of classification task, where they learn from a randomly<br />

selected training set, which has been classified in advance and then are applied on<br />

unseen instances. To get a good classification result it is often necessary that this<br />

training set contains a huge set of labeled instances. But for humans labeling of<br />

data is a time-consuming and boring task. Some algorithms address this problem<br />

and overcome this by learning on both, a small amount of labeled and a huge<br />

amount of unlabeled instances. There the learner has access to the pool of<br />

unlabeled instances and requests the label for some specific instances from an<br />

user. Then the learner uses all labeled data to learn the model. The choice of the<br />

unlabeled instances which should be labeled next has a significant impact on the<br />

quality of the resulting model. This kind of learning is called semi-supervised<br />

learning or active learning. Currently there exist some different solutions for<br />

semi-supervised learning. This work focuses on the most known ones and gives an<br />

overview about them.<br />

KURZFASSUNG<br />

Support Vector Machines sind eine moderne Technik im Bereich vom maschinellen<br />

Lernen und wurden mittlerweile in verschiedenen Anwendungsgebieten erfolgreich<br />

eingesetzt. Generell werden sie für Klassifikationsaufgaben verwendet, wobei sie<br />

von einer zufällig gewählte Menge von schon vorklassifizierten Trainingsdaten<br />

lernen und dann auf noch unbekannte Daten angewendet werden. Um ein gutes<br />

Klassifikationsergebnis zu erhalten, ist es oft notwendig, eine große Menge von<br />

vorklassifizierten Trainingsdaten zum Training zu verwenden. Das manuelle<br />

Klassifizieren von den Daten durch Menschen ist oft eine zeitaufwendige und<br />

langweilige Aufgabe. Um dies zu erleichtern wurden Algorithmen entwickelt, um<br />

mit schon wenigen klassifizierten und vielen nichtklassifizierten Daten ein Modell<br />

zu erstellen. Dabei hat der Klassifikator Zugang zu dem Pool von<br />

nichtklassifizierten Daten und fragt einen Benutzer nach der Klasse für einige<br />

spezielle Instanzen. Dann benützt er alle klassifizierten Daten zum Erstellen des<br />

Modells. Die Wahl jener noch nicht klassifizierten Instanzen, die von einem<br />

Experten klassifiziert werden sollen, hat einen signifikanten Einfluss auf die<br />

Qualität des resultierenden Modells. Diese Art des maschinellen Lernens wird als<br />

<strong>Semi</strong>-überwachtes Lernen oder aktives Lernen) bezeichnet. Momentan existieren<br />

verschiedenste Ansätze für <strong>Semi</strong>-überwachtes Lernen. Diese Arbeit behandelt die<br />

bekanntesten und liefert eine Übersicht über die verschiedenen Ansätze.<br />

1


Contents<br />

1 Introduction 4<br />

2 Basic Definitions of Support Vector Machines 5<br />

3 <strong>Semi</strong>-supervised <strong>Learning</strong> 9<br />

3.1 Random Subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

3.3 Version Space Based Methods . . . . . . . . . . . . . . . . . . . . . 11<br />

3.3.1 Theory of the Version Space . . . . . . . . . . . . . . . . . . 11<br />

3.3.2 Simple Method . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />

3.3.3 Batch-Simple Method . . . . . . . . . . . . . . . . . . . . . 14<br />

3.3.4 Angle Diversity Strategy . . . . . . . . . . . . . . . . . . . . 15<br />

3.3.5 Multi-Class Problem . . . . . . . . . . . . . . . . . . . . . . 16<br />

3.4 Probability Based Method . . . . . . . . . . . . . . . . . . . . . . . 17<br />

3.4.1 The Probability Model . . . . . . . . . . . . . . . . . . . . . 17<br />

3.4.2 Least Certainty and Breaking Ties . . . . . . . . . . . . . . 18<br />

3.5 Other approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />

3.5.1 A <strong>Semi</strong>definite Programming Approach . . . . . . . . . . . . 18<br />

3.5.2 S 3 V M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

4 Experiments 21<br />

4.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

4.1.1 Evaluated Approaches . . . . . . . . . . . . . . . . . . . . . 21<br />

4.1.2 ssSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

4.1.3 ssSVMToolbox . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

4.2 Artificial Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

4.2.1 Gaussian Distributed Data . . . . . . . . . . . . . . . . . . . 31<br />

4.2.2 Two Spirals Dataset . . . . . . . . . . . . . . . . . . . . . . 35<br />

4.2.3 Chain Link Dataset . . . . . . . . . . . . . . . . . . . . . . . 38<br />

4.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

4.3 Datasets from UCI Machine <strong>Learning</strong> Repository . . . . . . . . . . 40<br />

2


5 Conclusion 42<br />

A Relevant Links 43<br />

3


Chapter 1<br />

Introduction<br />

Support Vector Machines (<strong>SVMs</strong>) are a modern technique in the field of machine<br />

learning and have been successfully used in different fields of application. Most of<br />

the time they were used in a supervised learning context. There the learner has<br />

access to a large set of labeled data and builds a model using these informations.<br />

After this learning step the learner is presented new instances and tries to predict<br />

the correct labels. Beside supervised learning there exists also unsupervised learning<br />

where the learner cannot access the labels of the instances. In this case the learner<br />

tries to predict the labels by partitioning the data and creating so called clusters.<br />

Providing a huge set of labeled data (as in the supervised case) can be very<br />

time-consuming (and therefore costly). <strong>Semi</strong>-supervised learning tries to reduce<br />

the needed amount of labeled data by analyzing the unlabeled data. There only<br />

relevant instances have to be labeled by a human expert. Of course the overall<br />

accuracy has to be on par with the supervised learning accuracy.<br />

In this work I explain the most common approaches for semi-supervised learning<br />

with <strong>SVMs</strong>. I begin by introducing some basic definitions i.e. the SVM hyperplane,<br />

the kernel function and the SVM maximization task (Chapter 2). A detailed discussion<br />

about the theory of Support Vector Machines is not provided. The main<br />

part of the work focuses on semi-supervised learning. I present a definition of semisupervised<br />

learning in contrast to supervised and unsupervised learning, discuss the<br />

most common approaches (Chapter 3) for Support Vector Machines, compare semisupervised<br />

<strong>SVMs</strong> and supervised <strong>SVMs</strong> and present the results of my experiments<br />

with some of them. I show how they perform with different datasets including some<br />

common machine learning datasets and one real-world datasets (Chapter 4).<br />

4


Chapter 2<br />

Basic Definitions of Support<br />

Vector Machines<br />

Consider a typical classification problem. Some input vectors (feature vectors) and<br />

some labels are given. The objective of classification problems is to predict the<br />

labels of new input vectors so that the error rate of the classification is minimal.<br />

There are many algorithms to solve such kind of problems. Some of them require<br />

that the input data is linearly separable (by a hyperplane). But for many<br />

applications this assumption is not appropriate. And even if the assumption holds,<br />

most of the time there are many possible solutions for the hyperplane (Figure 2.1).<br />

Because we are looking for a hyperplane where the classification error is minimal<br />

this can be seen as an optimization problem. In 1965 Vapnik ( [VC04], [Vap00])<br />

introduced a mathematical approach to find a hyperplane with low generalization<br />

error. It is based on the theory of structural risk minimization, which states that<br />

the generalization error is influenced by the error on the training set and the complexity<br />

of the model. Based on this work Support Vector Machines were developed.<br />

They belong to the family of generalized linear classifiers and are so called maximum<br />

margin classifier. This means that the resulting hyperplane maximizes the<br />

distance between the ’nearest’ vectors of different classes with the assumption that<br />

a large margin is better for the gerneraliziation ability of the SVM. These ’nearest’<br />

vectors are called support vectors (SV) and <strong>SVMs</strong> consider only these vectors for<br />

the classification task. All other vectors can be ignored. Figure 2.2 illustrates a<br />

maximum margin classifier and the support vectors.<br />

In the context of <strong>SVMs</strong> it is also important to mind kernel functions. They<br />

project the low-dimensional training data to a higher dimensional feature space,<br />

because the separation of the training data is often easier achieved in this higher<br />

dimensional space. Moreover through this projection it is possible that training<br />

data, which couldn’t be separated linearly in the low-dimensional feature space,<br />

can be separated linearly in the high-dimensionl space.<br />

To understand semi-supervised learning we have to consider some mathematical<br />

5


Figure 2.1: Positive samples (green boxes) and negative samples (red circles). There<br />

are many possible solutions for the hyperplane (from [Mar03])<br />

Figure 2.2: Maximum margin, the middle line is the hyperplane, the vectors on the<br />

other lines are the support vectors (from [Mar03])<br />

6


ackground of <strong>SVMs</strong>. This is just a very short summary, beside very good resources<br />

on the internet Vapnik, Cristianini and Shawe-Taylor provide comprehensive introductions<br />

to Support Vector Machines [Vap00], [VC04] or [CST00].<br />

At first we have to define the hyperplane, which separates the data and acts as<br />

the decision boundary.<br />

H(ω, b) = x|ω T ∗ x + b = 0 (2.1)<br />

where ω is a weight vector, x is an input vector and b is the bias.<br />

Note that ω points orthogonal to H.<br />

Because we are interested in maximizing the margin, we have to define the<br />

distance from a support vector to the hyperplane.<br />

ω T ∗ x + b<br />

||ω||<br />

= ±1<br />

||ω||<br />

(2.2)<br />

From this definition the margin m follows straightforward (see Figure 2.2 for an<br />

illustration).<br />

2<br />

(2.3)<br />

||ω||<br />

The maximization task can be summarized as [TC01]:<br />

max min i{y i (ω ∗ φ(x ) )} (2.4)<br />

w∈F<br />

subject to ||ω|| = 1<br />

y i (ω ∗ φ(x i )) ≥ 1, i = 1...n.<br />

Note that this definition is only correct, if the data is linearly separable. In a<br />

non-linearly separable case we have to introduce slack variables.<br />

max min i{y i (ω ∗ φ(x ) )} (2.5)<br />

w∈F<br />

subject to<br />

ξ i ≥ 0<br />

y i (ω ∗ φ(x i )) ≥ 1 − ξ i , i = 1...n<br />

where ξ i are slack variables.<br />

Because <strong>SVMs</strong> try to maximize the margin we can restate the optimization task<br />

using the definition of the margin:<br />

min<br />

ω,ξ<br />

1<br />

2 ||ω||2 + C<br />

7<br />

n∑<br />

ξ i (2.6)<br />

i=1


subject to<br />

ξ i ≥ 0<br />

y i (ω ∗ φ(x i )) ≥ 1 − ξ i , i = 1...n<br />

where C is the complexity parameter. This parameter controls the complexity<br />

of the decision boundary. Large C penalize errors whereas small C penalize the<br />

complexity [Mei02].<br />

As said Support Vector Machines usually use so called kernels or kernel functions<br />

to project the data from a low-dimensional input space to a high-dimensional feature<br />

space. The kernel function K satisfies Mercer’s condition and we define K as:<br />

K(u, v) = φ(u) ∗ φ(v) (2.7)<br />

where φ : X− > F is a feature map [Mei02], [MIJ04]. One example of a feature<br />

map:<br />

f(x 1 , x 2 ) = (x 2 1, √ 2x 1 x 2 , x 2 2) (2.8)<br />

Using these feature map we can calculate the projection using the kernel K(u,v) =<br />

φ(u) ∗ φ(v) by computing the inner products of the data vectors and not the feature<br />

vectors φ(u) ∗ φ(v):<br />

K(u, v) = φ(u) ∗ φ(v) (2.9)<br />

= u 2 1v 2 1 + 2u 1 u 2 v 1 v 2 + u 2 2v 2 2 (2.10)<br />

whereas < u, v > is the inner product of u and v.<br />

In the context of <strong>SVMs</strong> we consider classifiers of the form:<br />

f(x) =<br />

where α i are the Lagrange multipliers.<br />

= (u 1 v 1 + u 2 v 2 ) 2 (2.11)<br />

= (< u, v >) 2 (2.12)<br />

n∑<br />

α i K(x, x). (2.13)<br />

i=1<br />

8


Chapter 3<br />

<strong>Semi</strong>-supervised <strong>Learning</strong><br />

The task of classification is also called supervised learning. In contrast to this the<br />

task of clustering is called unsupervised learning. There the learner doesn’t use<br />

labeled data, instead the learner tries to partitioning a datasets into clusters so<br />

that the data in a cluster share some common characteristics.<br />

<strong>Semi</strong>-supervised learning is a combination of supervised and unsupervised learning<br />

where typically a small amount of labeled and a large amount of unlabeled data<br />

are used for training. This is done because of two reasons. First labeling of a huge<br />

set of instances can be a time-consuming task. This classification has to be done by<br />

a skilled human expert and can be quiet costly. <strong>Semi</strong>-supervised learning reduces<br />

the needed amount of labeled instances and the associated costs. Note that in contrast<br />

the acquisition of the unlabeled data is usually relatively inexpensive. Second<br />

it has been shown that using unlabeled data for learning improves the accuracy of<br />

the produced learner [BD99]. G. Schohn and D. Cohn [SC00] report similar results,<br />

they state that a trained SVM on a well-chosen subset performs often better than<br />

on all available instances.<br />

Summing up the advantages of semi-supervised learning are (in many cases)<br />

better accuracy, fewer data and less training time. To achieve these the examples<br />

to be labeled should be selected properly.<br />

There are many different algorithms for semi-supervised learning with Support<br />

Vector Machines. Most of them require some kind of querying unlabeled instances<br />

to request the labels for them from a human expert. They differ in the way they<br />

select the next instances. The process of querying is called selective sampling.<br />

Sometimes semi-supervised learning is called active learning. As opposed to passive<br />

learning, where a classifier is trained using randomly selected labeled data, an<br />

active learner asks a user to label only ’important’ instances. Because the classifier<br />

gets feedback (the labels) about the for the classification relevant instances from an<br />

user, this process is called relevant feedback.<br />

Note that the approaches presented in 3.5.1 and 3.5.2 differ in that because here<br />

no feedback is necessary.<br />

9


3.1 Random Subset<br />

Obviously if we use a random process to select the unlabeled instances this learning<br />

cannot be considered as real semi-supervised learning. To get an appropriate accuracy<br />

the sampling strategy is as important as it is in case of supervised learning.<br />

<strong>Supervised</strong> learning and random subset semi-supervised learning are very similar<br />

and share most of the characteristics.<br />

Some researchers have experimented with this and have stated that the accuracy<br />

cannot keep up with real semi-supervised strategies. But they used this approach to<br />

compare it with the other semi-supervised learning approaches [FM01], [LKG + 05].<br />

3.2 Clustering<br />

One approach is to use a clustering algorithm (unsupervised learning) on the unlabeled<br />

data. Then we can e.g. choose the cluster centers (centroids) as instances<br />

to be labeled by an expert. G. Fung and O. Mangasarian have used k-median<br />

clustering and report a good classification accuracy in comparison with supervised<br />

learning but with fewer labeled instances [FM01]. It’s worth to keep in mind that<br />

one has to define the correct number of the clusters in advance. Correct means that<br />

the clusters should be good representatives of the available classes. G. Fung and<br />

O. Mangasarian do not really address this but as for other clustering algorithm the<br />

choice of the number of clusters can be assumed to be critical. An obvious solution<br />

is to set the number of clusters equal to the number of classes. Additionally G.<br />

Fung and O. Mangasarian extend the clustering by an approach similiar to that<br />

described in chapter 3.5.<br />

A general algorithm could be described this way:<br />

1. Use the labeled data to build a model<br />

2. Using the unlabeled data calculate n clusters<br />

3. Query some instances for labeling by a human expert. Which instances depends<br />

on the alogrithm. Some examples:<br />

(a) Query the centroids<br />

(b) Query instances on the cluster boundaries<br />

(c) A combination of the above approaches<br />

Cebron and Berthold introduced an advanced clustering technique, they proposed<br />

a prototype based learning approach using a density estimation technique<br />

and a probability model for selecting prototypes to obtain the labels from an expert<br />

[CB07].<br />

10


3.3 Version Space Based Methods<br />

Random Subset (chapter 3.1) and clustering (chapter 3.2) are simple but effective<br />

methods for semi-supervised learning. Depending on the given classification task<br />

the results can be quite good. Note that both can be used with other classifiers and<br />

are not limited to Support Vector Machines. Version space based methods are a<br />

more advanced technique, which use specific properties of Support Vector Machines<br />

for semi-supervised learning. But as we will see these approaches suffer from some<br />

critical limitations.<br />

The following approaches can be analyzed by their influence to the version space.<br />

Therefore it is worth to consider the theory of version spaces.<br />

3.3.1 Theory of the Version Space<br />

The version space was introduced by Tom Mitchell [Mit97]. It is the space containing<br />

all consistent hypotheses from the hypotheses space whereas the hypotheses<br />

space contains all possible hypotheses.<br />

In the context of <strong>SVMs</strong> the hypotheses are the hyperplanes and the version<br />

spaces contain all hyperplanes consistent with the current training data [TC01].<br />

More formally, the hypotheses space (all possible hypotheses) is defined as:<br />

H = {f|f(x) = φ(x) ∗ ω whereω ∈ W } (3.1)<br />

||ω||<br />

where the parameter space W is equal to the feature space F, f is a hypothesis. As<br />

in chapter 2 explained φ(x)∗ω is the definition of the (normalized) hyperplanes (Definition<br />

2.1). So this space contains all possible hyperplanes. Using this definition<br />

||ω||<br />

we can define the version space:<br />

∨<br />

V = {f ∈ H| y i f(x i ) > 0} (3.2)<br />

i∈{1...n}<br />

where y i is the class label. This definition eliminates all hypotheses (hyperplanes)<br />

not consistent with the given training data (Definition 2.4)<br />

Because there is a bijection between W (containing the unit vectors) and H<br />

(containing hyperplanes) we can redefine V [TC01]:<br />

V = {w ∈ W |||ω|| = 1, y i ∗ φ(x i )) > 0, i = 1...n} (3.3)<br />

There is a restriction of this definition: the training data has to be linearly<br />

separable in the feature space. But because it is possible to make every data linearly<br />

separable by modifying the kernel we can ignore this issue [STC99]. Furthermore<br />

because we often work in a high-dimensional feature space in many cases the data<br />

will be linearly separable.<br />

11


For our analysis it is important to note the duality between the feature space<br />

F and the parameter space W [TC01]. The unit vectors ω correspond to the decision<br />

boundaries f in F. This follows intuitively from the above definitions but this<br />

correspondence exists also converse. Let’s have a closer look on this issue. If one<br />

observes a new training instance x i in the feature space this instance reduces the<br />

set of all allowable hyperplane to ones that classify x i correctly. We can write this<br />

down more formally: every hyperplane must satisfy y i (ωφ(x i )) > 0, where y i is the<br />

label for the instance x i . As said before ω is the normal vector of the hyperplane<br />

in F. But we can think of y i φ(x i ) as being the normal vector of a hyperplane in<br />

W. It follows that ω(y i φ(x i )) = 0 defines a hyperplane in W. Recall that we have<br />

defined the version space V in W. Therefore the hyperplane is a boundary to the<br />

version space. It can be shown that the hyperplanes in W delimit the version space<br />

and from the definition of the maximization task of the <strong>SVMs</strong> it maximizes the<br />

minimum distance to any of this hyperplanes in W. <strong>SVMs</strong> find a center of the<br />

largest hypersphere in the version space, whose radius is the maximum margin and<br />

it can be shown that the hyperplanes touched by the hypersphere correspond to the<br />

support vectors and that the ω i often lie in the center of the version space [TC01].<br />

3.3.2 Simple Method<br />

Linear <strong>SVMs</strong> perform best when applied in high-dimensional domains (such as<br />

text classification). There the number of features is much larger than the number<br />

of examples and therefore the training data cannot cover the whole dimensions,<br />

meaning that the subspace spanned by the training examples is much smaller than<br />

the space containing all dimensions. Considering this observation G. Schohn and<br />

D. Cohn propose that a simple method to select instances for labeling is to search<br />

for examples that are orthogonal to the space spanned by the current training<br />

data [SC00]. Doing this would give the learner informations about yet not covered<br />

dimensions. Alternatively one can choose those instances which are near to the<br />

dividing hyperplane to improve the confidence in currently known dimensions. This<br />

is an attempt to narrow the existing margin. To maximally narrow the margin one<br />

would select those instances lying on the hyperplane. The interesting result from<br />

G. Schohn and D. Cohn is that training on a small subset of the data leads in most<br />

cases to a better performance than training on all available data.<br />

Remains the analysis of the computation of the proximity of a training instance<br />

to the hyperplane: this is inexpensive, because one can compute the hyperplane<br />

and evaluate each instance using a single dot product.<br />

The distance between a feature vector φ(x) and the hyperplane ω:<br />

|φ(x) ∗ ω| (3.4)<br />

Let’s have a look, how this simple method influences the version space. Given<br />

an unlabeled instance x i we can test how close the corresponding hyperplane in<br />

12


Figure 3.1: The gray line is the old hyperplane, the green lines are the old margins,<br />

’o’ is a new example and the black line the new hyperplane, when the new instance<br />

was labeld as ’-’ (from [SC00])<br />

W comes to the center of the hypersphere (the ω i ). If we choose the instance x i<br />

closest to the center we can reduce the version space as much as possible (and<br />

this will of course reduce the amount of consistent hypotheses). This distance<br />

can be easily computed using the above formular. By choosing the instance x i ,<br />

who come closest to the hyperplane in F, we maximally reduce the margin and<br />

the version space. Figure 3.1 shows the effect of an instance on the hyperplane<br />

graphically. There the bottom figure shows that by placing an instance to the<br />

center of the old hyperplane the margin (calculated using the new hyperplane) gets<br />

changed significantly. Placing an instance on the old hyperplane too far out has<br />

little impact on the margin, as we can see on the top figure.<br />

A more sophisticated description of this can be found in [TK02]. There three<br />

different approaches are presented, each trying to reduce the version space as much<br />

as possible. Note that these definitions rely on the assumption that the given<br />

problem is binary (two classes).<br />

1. Simple Margin: This is the method already described: choose the next instance<br />

closest to the hyperplane<br />

2. MaxMin Margin: Let the instance x be a candidate for being labeled by a<br />

human expert. This instance gets labeled as -1, assigning it to class -1. Then<br />

13


the margin m − of the resulting SVM gets calculated. After this x gets labeled<br />

as +1, assigning it to class +1 and again the margin m + gets computed.<br />

This procedure is repeated for all instances and the instance with the largest<br />

min(m − , m + ) is chosen.<br />

3. Ratio Margin: This is similar to the MaxMin Margin method, but uses the<br />

relative sizes of m − and m + : choose the instance with largest min( m− , m+ ).<br />

m + m −<br />

All three methods perform well, the simple margin method is computationally<br />

the fastest. But it has to be used carefully, because it can be unstable under<br />

some circumstances [HGC01], [TK02]. MaxMin Margin and Ratio Margin try to<br />

overcome these instability problems. The results of the experiments of S. Tong and<br />

D. Koller show that all three methods outperform random sampling [TK02].<br />

3.3.3 Batch-Simple Method<br />

One possible problem with the above methods is that every instance has to be<br />

labeled separately. That means that after each instance the user has to determine<br />

the label. A new hyperplane will be calculated and the next instance has to be<br />

queried. Often this approach is not practicable and some kind of batch mechanism<br />

is necessary. There exist different approaches of batch sampling for version space<br />

based algorithms [Cha05]. One of those approaches is the batch-simple sampling<br />

algorithm, where h unlabeled instances closest to the hyperplane are chosen and<br />

have to be labeled by a user. This could be seen as a rather naive extension of the<br />

above methods (of course naive doesn’t mean bad). The batch-simple method has<br />

been used to classify images [TC01] and the researchers in this paper report good<br />

results. The algorithm can be expressed as follows:<br />

1. initial model building: Build a model using the labeled data<br />

2. feedback round: query n instances closest to the hyperplane and ask the user<br />

to label them<br />

The feedback round can be repeated m times. Because this algorithm can be<br />

unstable during the first feedback round [TC01], Tong and Chang suggest an initial<br />

feedback round with random sampling:<br />

1. initial modell building: Build a modell using the labeled data<br />

2. first feedback round: choose randomly n instances for labeling<br />

3. advanced feedback round: query n instances closest to the hyperplane and<br />

ask the user to label them<br />

14


Now the advanced feedback round could be repeated m times. But how to<br />

choose ’good’ values for n and m? Simon Tong and Edward Chang do not explain<br />

a way to determine these values [TC01]. But it is clear that n has to be set in<br />

advance. They have used a query size of 20. m can be determined by using some<br />

kind of cross validation. It is also obvious that by decreasing the query size n one<br />

has to increase the number of rounds m and vice versa. Otherwise the accuracy of<br />

the classifier would decrease. Beside the technical reasons the choice of the values<br />

depends on the user, whose task is to label the instances. To take advantage of<br />

active learning this user should not have to label a huge set of examples. As an<br />

starting point one can use the values from [TC01]: query size = 20, number of<br />

rounds = 5.<br />

3.3.4 Angle Diversity Strategy<br />

One problem with the batch-simple method is that by sampling a batch of instances<br />

the diversity of them is not guaranteed. One can expect that divers instances<br />

can reduce the version space more efficiently, considering the diversity can have<br />

a significant impact on the performance of the classifier. A measurement of the<br />

diversity is the angles between the samples. The angle diversity strategy proposed in<br />

[Cha05] balances the closeness to the hyperplane and the diversity of the instances.<br />

More formally the angle between two instances x i and x j (respective their corresponding<br />

hyperplanes h i and h j :<br />

|cos(< (h i , h j ))| = |φ(x i).φ(x j )|<br />

||φ(x i )||||φ(x j )|| = |K(x i , x j )|<br />

√<br />

K(xi , x i )K(x j , x j )<br />

(3.5)<br />

where x i is an instance, φ(x i ) is its normal vector and K(x i , x j ) is the kernel function,<br />

which satisfies Mercer’s condition [Bur98].<br />

From these theoretical considerations the algorithm follows straightforward:<br />

1. Train a hyperplane h i by the given labeled set<br />

2. Calculate for each unlabeled instance x j its distance to the hyperplane h i<br />

3. Calculate the maximal angle from x j to any instance x i in the current labeled<br />

set<br />

What’s left is to consider the distance to the hyperplane, until now we have<br />

focused on the diversity of the samples. To do this we introduce another parameter<br />

α [Cha05]. This parameter balances the distance to the hyperplane and the diversity<br />

among the instances. The final decision rule can be expressed this way:<br />

|K(x i , x j )|<br />

α ∗ |f(x i | + (1 − α) ∗ (argmax√ x j K(xi , x i )K(x j , x j ) ) (3.6)<br />

15


As we can see α acts as a trade-off-factor between proximity and diversity. This<br />

parameter has to be set in advance and it is suggest to set it to 0.5 [Cha05]. They<br />

also present a more sophisticated solution for determining this parameter and clearly<br />

it is possible to use cross validation to get the best value for α.<br />

Some version space based methods have been tested in different fields of application<br />

[Cha05], [MPE06]. Whereas former have concentrated on image datasets<br />

and latter have tested these strategies on music datasets both come to the conclusion<br />

that the angle diversity strategy works best. Furthermore Tong concludes that<br />

active learning outperforms passive learning [Cha05].<br />

3.3.5 Multi-Class Problem<br />

So far we have just considered and analyzed the two-class case. But to be useful in<br />

general a semi-supervised learning approach should be easily used in a multi-class<br />

environment.<br />

There exist different strategies for solving a multi-class problem with N classes<br />

for supervised learning with <strong>SVMs</strong>. In the case of the one-versus-one approach<br />

N(N−1)<br />

<strong>SVMs</strong> are developed and a majority vote is used to determine the class of the<br />

2<br />

given instance. In contrast the one-versus-all method uses N <strong>SVMs</strong> and assigns the<br />

label of the class which SVM has the largest margin. An overview about different<br />

multi-class approaches for <strong>SVMs</strong> can be found here [Pal08]. The one-versus-all<br />

method was introduced by Vapnik [Vap00]. Hsu and Lin have compared different<br />

multi-class approaches for <strong>SVMs</strong> [HL02]. Platt has described another multi-class<br />

SVM approach: the decision directed acyclic graph [PCT00].<br />

From the above discussions it becomes not clear how to use these version space<br />

based methods for multi-class problems. Consider the simple method and the oneversus-all<br />

approach. In the case of a multi-class problem we have N decision boundaries,<br />

so which of the margins do we want to narrow? There a single instance has<br />

N distances (to the N hyperplanes) and narrowing one margin doesn’t automatically<br />

mean to narrow all margins. Until now little work has done solving multi-class<br />

semi-supervised problems. Mitra, Shankar, and Pal have applied the simple method<br />

to multi-class problems [MSP04]. They used a ’naive’ approach where they labeled<br />

N samples at a time. As said this approach lacks the analysis which example is best<br />

for all hyperplanes, because the influence of an example can be very large for one<br />

hyperplane but for other hyperplanes it can be useless. The angle diversity strategy<br />

suffers from the same problem, additionally it is not clear, which angle should be<br />

considered.<br />

The following section 3.4 describes probability based methods which overcome<br />

these problems and are more suitable for multi-class problem.<br />

16


3.4 Probability Based Method<br />

As we have seen the version space based methods lack of considering multi-class<br />

problems. An approach which can handle multi-class problems easily are probability<br />

based method [LKG + 05]. There a probability model for multiple <strong>SVMs</strong> is created.<br />

The results of each <strong>SVMs</strong> are interpreted as a probability and can be seen as a<br />

measurement of certainty that a given instance belongs to the class. In the case<br />

of semi-supervised learning using this approach is straightforward and using the<br />

probabilities we have many possibilities to query unlabeled instances for labeling.<br />

A simple method would be to train a model on the given labeled datasets. Than<br />

this model is applied on the unlabeled data and each of these unlabeled instances is<br />

given probabilities that these instances belong to a given class. Now we can query<br />

the least certain instances or the most certain instances. It is also possible to query<br />

the instances with the smallest difference in probability between its most likely<br />

and second most likely class. Using these probabilities there exist many different<br />

approaches and it is also possible to mixture some of them [LKG + 05].<br />

3.4.1 The Probability Model<br />

To get probabilities we have to extend the default Support Vector Machines. For<br />

a given instance the results of the default <strong>SVMs</strong> are distances where f.ex. 0 means<br />

that the instance lies on the hyperplane and 1 that the instance is a support vector.<br />

To assign a probability value to a class the sigmoid function can be used. Then<br />

the parametric model has the following form [LKG + 05]:<br />

P (y = 1|f) =<br />

1<br />

1 + exp(Af + B) ′ (3.7)<br />

where A and B are scalar values, which have to be estimated and f is the decision<br />

function of the SVM. Based on this parametric model there are some approaches<br />

for calculating the probabilities. As we can see, when we use this model we have to<br />

calculate the SVM parameters (complexity parameter C, kernel parameter k) and<br />

the parameter A and B where the parameter A and B have to be calculated for<br />

each binary SVM. We can use cross validation for this calculation but it is clear<br />

that this can be computationally expensive.<br />

A pragmatic approximation method could assume that all binary <strong>SVMs</strong> have<br />

the same A, eliminate B by assigning 0.5 to instances lying on the decision boundary<br />

and by trying to compute the SVM parameters and A simultaneously [LKG + 05].<br />

The decision function can be normalized by its margin to include the margin in the<br />

calculation of the probabilities. More formally:<br />

P pq (y = 1|f) =<br />

1<br />

(3.8)<br />

1 + exp( Af<br />

||ω|| )′<br />

17


where we currently look at class p and P pq is the probability of class p versus class<br />

q. We assume that P pq , q=1,2,... are independent. The final probability for class<br />

p:<br />

q≠p<br />

∏<br />

P (p) = P pq (y = 1|f) (3.9)<br />

q<br />

It has been reported that this approximation is very fast and delivers good<br />

accuracy results. Using this probability model there exist different approaches for<br />

semi-supervised learning. The next section outlines some.<br />

3.4.2 Least Certainty and Breaking Ties<br />

The algorithms for both are very similar.<br />

1. Built a multi-class model from the labeled training data<br />

2. Compute the probabilities<br />

3. Least Certainty: Query the instances with the smallest classification confidence<br />

for labeling by a human expert. Add them to the training set.<br />

4. Breaking ties: Query the instances with the smallest difference in probabilities<br />

for the two highest probability classes and obtain the correct label from a<br />

human expert. Add them to the training set.<br />

5. Goto 1<br />

Suppose a is the class with the highest probability, b is the class with second<br />

highest probability and P(a) and P(b) are the probabilities of the classes. Then<br />

least certainty tries to improve P(a) and breaking ties tries to improve P(a) - P(b).<br />

Intuitively, both methods improve the confidence of the classification. The number<br />

of instances, which should be queried, has to be set by the SVM designer.<br />

These approaches were tested on a gray-scale image datasets [LKG + 05]. They<br />

report a good accuracy and a reduced number of labeled images required to reach it.<br />

The breaking ties approach outperforms least certainty and using batch sampling<br />

was also effective.<br />

3.5 Other approaches<br />

3.5.1 A <strong>Semi</strong>definite Programming Approach<br />

<strong>Semi</strong>definite programming is an extension of linear and quadratic programming. A<br />

semidefinite programming problem is a convex constrained optimization problem.<br />

<strong>With</strong> semidefinite programming one tries to optimize a symmetric n × n matrix of<br />

18


variables X [XS05]. <strong>Semi</strong>definite programming can be used to use Support Vector<br />

Machines in an unsupervised and semi-supervised context. For clustering the goal<br />

is not to find a large margin classifier using the labeled data (as with supervised<br />

learning) but instead to find a labeling that results in a large margin classifier.<br />

Therefore every possible labeling has to be computed and the labeling with the<br />

maximum margin has to be chosen. Obviously this is computationally very expensive<br />

but Xu and Schuurmans have found out that it can be approximated using<br />

semidefinite programming. This unsupervised approach can be easily extended to<br />

semi-supervised learning where a small labeled training set has to be considered.<br />

Note that this approach also works for multi-class problems [XS05]. There is one important<br />

difference between this approach and the other above discussed approaches:<br />

Here the algorithm uses the unlabeled data directly that means no human expert is<br />

asked to label them. In this case the semi-supervised learning is a combination of<br />

supervised learning using the given labeled training set and unsupervised learning<br />

using the unlabeled data.<br />

3.5.2 S 3 V M<br />

This approach was introduced by Bennet and Demiriz [BD99]. Similar to the above<br />

approach no human gets asked to label instances. Instead the unlabeled data gets<br />

incorporated to the formulation of the optimization problem. S 3 V M reformulates<br />

the original definition by adding two constraints to the instances of the unlabeled<br />

datasets. Considering a binary SVM one constraint calculates the misclassification<br />

error as if the instance were in class 1 and the second constraint as if the instance<br />

were in class -1. S 3 V M tries to minimize these two possible misclassification errors.<br />

The labeling with the smallest error is the final labeling. Moreover Bennet and<br />

Demiriz introduce some optimization techniques for this. An analysis, how this<br />

approach performs in a multi-class environment, is not presented.<br />

3.6 Summary<br />

<strong>Semi</strong>-supervised learning is a promising approach to reduce the amount of needed<br />

labeled instances for training <strong>SVMs</strong> by asking a human expert to label relevant<br />

instances from an unlabeled pool of instances. As outlined there are many different<br />

approaches available. We can use clustering, which can also be used as a semisupervised<br />

learning approach with other machine learning algorithm. In contrast<br />

the here presented version space based methods focus on <strong>SVMs</strong> and promise good<br />

accuracy results but are primary usable for binary classification tasks. Extending<br />

these approaches for multi-class problems is an ongoing research topic. Simple but<br />

effective approaches are the probability based methods which can be easily used in<br />

a multi-class context and are therefore very convenient. S 3 V M and the semidefinite<br />

programming approach are also semi-supervised learning approaches but here no<br />

19


human gets asked to label relevant instances. Whereas the former incorporates<br />

unlabeled instances to the formulation of the optimization problem, the latter one<br />

tries to find the labeling with the largest margin.<br />

20


Chapter 4<br />

Experiments<br />

4.1 Experiment Setting<br />

To experiment with different approaches presented in this work I have implemented<br />

two applications. ssSVM is a semi-supervised SVM implementation and suppports<br />

different semi-supervised learning approaches like Least Certainty and Breaking<br />

Ties. ssSVM uses RapidMiner, an open-source data mining plattform, which provides<br />

a comprehensive API for machine learning tasks like different classification,<br />

different clustering and of course different SVM implementations. ssSVM is also<br />

based on Spring, mainly an inversion-of-control container and therefore it is highly<br />

configurable and extensible. Furthermore it wraps the WordVector Tool for creating<br />

word vectors from texts. The second implemented application is the GUI for<br />

ssSVM. It is called ssSVMToolbox and is based on Eclipse RCP. The chapters 4.1.2<br />

and 4.1.3 as well as the links in the appendix A provide detailed informations.<br />

4.1.1 Evaluated Approaches<br />

I compared following approaches and evaluated their performances on the different<br />

data sets:<br />

1. Least Certainty (LS)<br />

2. Breaking Ties (BT)<br />

3. Most Certainty (MC)<br />

4. Simple Margin (SM)<br />

5. Random Sampling (RS)<br />

I separated every datasets into three sub sets.<br />

1. training set for supervised learning (in this work also called reduced set)<br />

21


2. training set for semi-supervised learning (is used to query instances for the<br />

feedback)<br />

3. test set to evaluate the performance<br />

.<br />

Using the reduced and the training set for semi-supervised learning (merged also<br />

called the whole set) I trained a common SVM to get an upper bound and used<br />

the reduced set alone to get the lower bound. So the accuracies of the different<br />

approaches should be between these bounds. Furthermore I used a random sampling<br />

strategy (RS) to show that the different approaches are better than an approach<br />

which randomly chooses instances for feedback.<br />

I compared two different modes:<br />

1. incremental increased training size: there the feedback size is set to 1 and the<br />

training size is incremental increased<br />

2. batch mode: there the feedback size is set to a certain value (f.ex. 50), in<br />

some iterations the feedback size is increased and results with these different<br />

feedback sizes are compared<br />

4.1.2 ssSVM<br />

ssSVM (semi-supervised Support Vector Machine) is a Java application capable<br />

of performing semi-supervised learning tasks with Support Vector Machines. It is<br />

based on RapidMiner , an Open Source data mining tool, and on Spring, an IOC<br />

container. See Relevant Links for more informations (appendix A).<br />

The core of the application is the application context sssvmContext.xml. As<br />

RapidMiner ssSVM supports different operators, this file configures which operators<br />

ssSVM actually supports (which input sources, which SVM implementations, which<br />

validators,...).<br />

<br />


<br />

com . rapidminer . operator . tokenizer . SimpleTokenizer<br />

<br />

<br />

com . rapidminer . operator . tokenizer . NGramTokenizer<br />

<br />

<br />

com . rapidminer . operator . tokenizer . TermNGramGenerator<br />

<br />

<br />

com . rapidminer . operator . reducer . GermanStemmer<br />

<br />

<br />

com . rapidminer . operator . reducer . LovinsStemmer<br />

<br />

<br />

com . rapidminer . operator . reducer . PorterStemmer<br />

<br />

<br />

com . rapidminer . operator . reducer . SnowballStemmer<br />

<br />

<br />

com . rapidminer . operator . reducer . ToLowerCaseConverter<br />

<br />

<br />

com . rapidminer . operator . wordfilter . EnglishStopwordFilter<br />

<br />

<br />

com . rapidminer . operator . wordfilter . GermanStopwordFilter<br />

<br />

<br />

com . rapidminer . operator . wordfilter . StopwordFilterFile<br />

<br />

<br />

com . rapidminer . operator . wordfilter . TokenLengthFilter<br />

<br />

<br />

<br />

< property name =" tokenProcessors "><br />

<br />

<br />

<br />

<br />

< property name =" supported<strong>Read</strong>er "><br />

<br />

<br />

com . rapidminer . operator .io. CSVExampleSource<br />

<br />

<br />

com . rapidminer . operator .io. SparseFormatExampleSource<br />

<br />

<br />

com . rapidminer . operator .io. ArffExampleSource<br />

<br />

<br />

<br />

<br />

<br />

< property name =" params "><br />

<br />

<br />

< property name =" supportedSVMLearer "><br />

<br />

<br />

com . rapidminer . operator . learner . functions . kernel . LibSVMLearner<br />

<br />

<br />

com . rapidminer . operator . learner . functions . kernel . JMySVMLearner<br />

<br />

<br />

<br />

< property name =" supportedValidator "><br />

<br />

<br />

com . rapidminer . operator . validation . XValidation<br />

<br />

<br />

com . rapidminer . operator . validation . FixedSplitValidationChain<br />

<br />

<br />

23


< property name =" supportedPerfEvaluator "><br />

<br />

<br />

com . rapidminer . operator . performance . SimplePerformanceEvaluator<br />

<br />

<br />

com . rapidminer . operator . performance . PolynominalClassificationPerformanceEvaluator<br />

<br />

<br />

<br />

<br />

<br />

< property name =" supportedClusterer "><br />

<br />

<br />

com . rapidminer . operator . learner . clustering . clusterer . KMeans<br />

<br />

<br />

com . rapidminer . operator . learner . clustering . clusterer . SVClusteringOperator<br />

<br />

<br />

com . rapidminer . operator . learner . clustering . clusterer . KernelKMeans<br />

<br />

<br />

<br />

<br />

<br />

To use ssSVM for a concrete experiment another configuration is necessary.<br />

There the runtime properties for the experiment have to be provided. Instead of<br />

describing this file I provide an example. For a detailed description which parameters<br />

and parameter values are supported, see the RapidMiner documentation<br />

(appendix A).<br />

<br />


false <br />

<br />

<br />

<br />

<br />

< property name =" preprocessing "><br />

<br />

<br />

< property name =" parameter "><br />

<br />

<br />

< property name =" additionalProps "><br />

<br />

<br />

./ datasets / breast_cancer_wisconsin / wdbc_as_labeled . data<br />

<br />

<br />

<br />

<br />

<br />

< property name =" preprocessing "><br />

<br />

<br />

< property name =" parameter "><br />

<br />

<br />

< property name =" additionalProps "><br />

<br />

<br />

./ datasets / breast_cancer_wisconsin / wdbc_testset . data<br />

<br />

<br />

<br />

<br />

<br />

< property name =" preprocessing "><br />

<br />

<br />

< property name =" parameter "><br />

<br />

<br />

< property name =" additionalProps "><br />

<br />

<br />

./ datasets / breast_cancer_wisconsin / wdbc_as_unlabeled . data<br />

<br />

<br />

<br />

<br />

<br />

<br />

< property name =" numberOfInstancesForFeedback " value ="0" /><br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

< property name =" comparator "><br />

<br />

<br />

< property name =" numberOfInstancesForFeedback " value ="10" /><br />

<br />

25


< property name =" comparator "><br />

<br />

<br />

< property name =" numberOfInstancesForFeedback " value ="10" /><br />

<br />

<br />

< property name =" comparator "><br />

<br />

<br />

< property name =" numberOfInstancesForFeedback " value ="10" /><br />

<br />

<br />

< property name =" numberOfInstancesForFeedback " value ="10" /><br />

< property name =" params "><br />

<br />

<br />

< property name =" sssvmLearner "><br />

<br />

<br />

<br />

<br />

< property name =" seed " value =" 123456789 "/><br />

< property name =" numberOfInstancesForFeedback " value ="10" /><br />

<br />

<br />

< property name =" props "><br />

<br />

<br />

com . rapidminer . operator . learner . functions . kernel . jmysvm . kernel . KernelRadial<br />

<br />

0.8 <br />

<br />

<br />

<br />

<br />

< property name =" params "><br />

<br />

<br />

< property name =" clusterParams "><br />

<br />

<br />

< property name =" svmLearner " value =" libSVM " /><br />

< property name =" validator " value =" xval " /><br />

< property name =" perfEvaluator " value =" simple " /><br />

< property name =" clusterModelHandler "><br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

< property name =" clusterers "><br />

<br />

kmeans <br />

kernelKmeans <br />

<br />

<br />

< property name =" samplingStrategies "><br />

<br />

<br />

<br />

<br />

<br />

<br />

26


<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

Following code performs this experiment:<br />

final RuntimeHandler r = new RuntimeHandler (" wdbc . xml ");<br />

final SSSVMLearner learner = r. getSSSVMLearner ();<br />

// one feedback round<br />

final ExampleSet feedbackSet = learner . queryInstances (r. getLabeledExampleSet () , r. getUnlabeledExampleSet ());<br />

final ExampleSet all = ExampleSetUtils . merge (r. getLabeledExampleSet () , feedbackSet );<br />

// use a SVM implementation for training<br />

final IOObject [] resultSSSVM = r. getSVMLearner (). learn (r. getRuntimeConfig (). getSvmLearner () ,<br />

r. getRuntimeConfig (). getSvmPerfEvaluator () , all );<br />

// get performance of self test<br />

final PerformanceVector pvXval = (( PerformanceVector ) resultSSSVM [1]);<br />

// use model on a separate test set<br />

final PerformanceVector pvTest = r. getSVMLearner (). test (( Model ) resultSSSVM [0] ,<br />

r. getRuntimeConfig (). getSvmPerfEvaluator () , r. getTestExampleSet ());<br />

The incrementally increased training size mode can be executed by this code:<br />

protected List < Performance > performSSSVMStepwise ( final String experiment , final SamplingStrategy samplingStrategy )<br />

throws Exception {<br />

final RuntimeHandler r = new RuntimeHandler ( experiment );<br />

// learn sssvm<br />

final SSSVMLearner learner = r. getSSSVMLearner ();<br />

}<br />

ExampleSet all = ( ExampleSet ) r. getLabeledExampleSet (). clone ();<br />

final ExampleSet unlabeledSet = ( ExampleSet ) r. getUnlabeledExampleSet (). clone ();<br />

final List < Performance > results = new LinkedList < Performance >();<br />

final int feedbackSize = 10;<br />

learner . getSamplingStrategies (). clear ();<br />

learner . addSamplingStrategy ( samplingStrategy );<br />

samplingStrategy . setNumberOfInstancesForFeedback ( feedbackSize );<br />

for ( int i = 0; i < unlabeledSet . size () / 10; i ++) {<br />

final ExampleSet feedbackSet = learner . queryInstances (all ,<br />

ExampleSetUtils . intersect ( unlabeledSet , all ));<br />

all = ExampleSetUtils . merge (all , feedbackSet );<br />

final IOObject [] resultSSSVM = r. getSVMLearner (). learn (r. getRuntimeConfig (). getSvmLearner () ,<br />

r. getRuntimeConfig (). getSvmPerfEvaluator () , all );<br />

final PerformanceVector pvXval = (( PerformanceVector ) resultSSSVM [1]);<br />

final PerformanceVector pvTest = r. getSVMLearner (). test (( Model ) resultSSSVM [0] ,<br />

r. getRuntimeConfig (). getSvmPerfEvaluator () , r. getTestExampleSet ());<br />

final Performance perf = new Performance ( pvXval , pvTest , all . size () , all . size ()<br />

- r. getLabeledExampleSet (). size ());<br />

results . add ( perf );<br />

}<br />

return results ;<br />

If f.ex. a new input source should be used it has to be configured in sssvmContext.xml<br />

and after that it can be used in the experiment configuration.<br />

Table 4.1 describes the important packages of ssSVM.<br />

27


package name<br />

sssvm<br />

sssvm.clustermodel<br />

sssvm.confidencemodel<br />

sssvm.sampling<br />

sssvm.preprocessing<br />

sssvm.text<br />

describtion<br />

contains the ssSVM implementation<br />

and the core classes for running experiments<br />

contains cluster models<br />

for using clusterer in a semi-supervised manner<br />

contains the implementation<br />

for probability based semi-supervised approaches<br />

(Breaking Ties, Least Certainty,...)<br />

contains different sampling strategies<br />

contains preprocessing methods<br />

wraps the WVTool for creating word vectors from texts<br />

4.1.3 ssSVMToolbox<br />

Table 4.1: Packages<br />

This is the graphical user interface of ssSVM. It is based on Eclipse RCP. Using the<br />

ssSVMToolbox one can create, configure and run experiments. This application<br />

uses ssSVM to perform supervised and semi-supervised learning with <strong>SVMs</strong> and<br />

has the same abilities as ssSVM. Technically the toolbox is a GUI to manipulate<br />

experiment xml files.<br />

Running experiments is straightforward. At first one has to create a new experiment.<br />

The toolbox consists of some tabs. On the Input tab one can configure<br />

the datasources of the experiment. Here one has to provide the input format (f.ex.<br />

cvs), the filenames of the example sets and additional parameters for the example<br />

sets. The Preprocessing tab provides the configuration for preprocessing tasks like<br />

discretization and transformation of nominal to numeric attributes. The ssSVM<br />

Learner tab is the core of the toolbox. Here one can choose between different SVM<br />

learners, has to set the SVM parameters like kernel type and can activate and deactivate<br />

the different sampling strategies. For every sampling strategy one can set the<br />

feedback size. Finally one can execute the ssSVM experiment. After doing this, the<br />

Feedback Set table shows the instances for labeling by the human expert. There<br />

some features are shown (by double clicking on the row a dialog is opened and the<br />

whole instance is shown) and the user can label the instances by clicking on the cell<br />

Label. The current accuracy on the testset is also shown. The Result tab shows<br />

the accuracies and the confusion matrix.<br />

Figure 4.1 shows the Input tab whereas Figure 4.2 represents the ssSVM tab.<br />

By repeatedly executing the experiment one can experiment with incrementally<br />

increased training sizes, by setting the feedback sizes to values > 1 one can test<br />

the batch mode. By choosing different sampling strategies one can experiment with<br />

different combinations of them.<br />

In the next sections I show the results of my experiments. For some of these<br />

experiments I used the ssSVMToolbox. For more sophisticated results (f.ex. to<br />

28


Figure 4.1: Screenshot of the Input tab of the ssSVMToolbox<br />

29


Figure 4.2: Screenshot of the ssSVM tab of the ssSVMToolbox<br />

30


Figure 4.3: Binary Gaussian Distribution (µ 1 = 3, σ 1 = 3, µ 2 = 4, σ 2 = 3)<br />

create the different figures) I used a programmatic approach where I could execute<br />

different experiments with different settings all at once. See Section 4.1.2 for detailed<br />

informations and example source code.<br />

4.2 Artificial Datasets<br />

4.2.1 Gaussian Distributed Data<br />

For these experiments I used generated gaussian distributed data. I generated two<br />

different datasets with two different classes where the two classes overlap each other.<br />

In the first datasets ds 1 the σs are equal, in the second ds 2 the σs are different. The<br />

Figures 4.3 and 4.4 show plots of these datasets.<br />

For these datasets I evaluated different approaches (section 4.1.1). Table 4.2<br />

shows the upper and the lower bounds and the result of the ssSVM approach using<br />

a feedback size of 50.<br />

whole set reduced set LC BT MC SM RS<br />

self test 0.67 0.5 0.56 0.6 0.68 0.74 0.38<br />

test set 0.67 0.5 62 0.5 0.66 0.67 0.46<br />

trainingset size 840 40 50 50 50 50 50<br />

Table 4.2: Summary experiments with ds 1<br />

31


Figure 4.4: Binary Gaussian Distribution (µ 1 = 12, σ 1 = 15, µ 2 = 17, σ 2 = 1)<br />

Figure 4.5 gives a more detailed insight into the performance of the semisupervised<br />

SVM. There the feedback size was set equal to 1 and ssSVM were used<br />

to incrementally increase the training set size. As we can see after approx. 50 iterations<br />

Simple Margin and Most Certainty deliver good results in comparison with<br />

conventional <strong>SVMs</strong> but with much fewer data. Breaking Ties, Least Certainty and<br />

Most Certainty are most stable and outperform Random Sampling.<br />

Figure 4.6 shows how the implementation performs with different feedback sizes<br />

in a batch mode.<br />

The lower and upper bounds of the second datasets and the performance of<br />

ssSVM with feedback size 50 can be found in Table 4.3 .<br />

whole set reduced set LC BT MC SM RS<br />

self test 0.77 0.9 0.83 0.78 0.95 0.66 0.76<br />

test set 0.77 0.43 0.70 0.63 0.44 0.5 0.59<br />

trainingset size 840 40 90 90 90 90 90<br />

Table 4.3: Summary experiments with ds 2<br />

The performance of ssSVM with feedback size 1 and incremental increased feedback<br />

size is highlighted in Figure 4.7.<br />

Remains the overview how ssSVM performs on this datasets in a batch mode.<br />

Figure 4.8 highlights the results of this.<br />

32


Figure 4.5: Incremental increased training size ds 1<br />

Figure 4.6: different feedback sizes in batch mode ds 1<br />

33


Figure 4.7: Incremental increased training size ds 2<br />

Figure 4.8: different feedback sizes in batch mode for ds 2<br />

34


Figure 4.9: Incremental increased training size ds 1 , RBF kernel<br />

Both datasets show that the semi-supervised SVM approaches delivers similar<br />

results than the supervised approach but with a smaller training set. The incremental<br />

version outperforms the supervised approach with respect to the training<br />

set size and is better than the batch semi-supervised version, which is of course<br />

more practically and also performs better then the supervised approach.<br />

Different Kernels<br />

For the above experiments I used the Linear kernel. To look how the chosen kernel<br />

influences the result of the semi-supervised learning approaches I used polynomial<br />

and RBF kernels for experimenting with the datasets ds 1 . The Figures 4.10 and<br />

4.9 are similar to the Figure 4.5. For these datasets we can conclude that the<br />

chosen kernel influences the result of the SVM but has no specific impact on the<br />

semi-supervised approaches.<br />

4.2.2 Two Spirals Dataset<br />

I also applied ssSVM to a Two Spirals dataset (Figure 4.11).<br />

Table 4.4 shows lower and upper bound and the ssSVM accuracy of the dataset.<br />

The performance of ssSVM with feedback size 1 and incremental increased feedback<br />

size is highlighted in Figure 4.12, the results of using a batch mode can be<br />

found in Figure 4.13.<br />

35


Figure 4.10: Incremental increased training size ds 1 , polynomial kernel (degree =<br />

3)<br />

Figure 4.11: Two Spirals Dataset<br />

36


whole set reduced set LC BT MC SM RS<br />

self test 1 0.1 0 0 1 1 1<br />

test set 0.85 0.32 0.33 0.31 0.67 0.74 0.72<br />

trainingset size 104 38 48 48 48 48 48<br />

Table 4.4: Summary experiments with Two Spirals Dataset<br />

Figure 4.12: Incremental increased training size Two Spirals Dataset<br />

37


Figure 4.13: different feedback sizes in batch mode for Two Spirals Datasets<br />

As the Gaussian Datasets these experiments show that with ssSVM the necessary<br />

amount of training instances can be reduced significantly.<br />

4.2.3 Chain Link Dataset<br />

The last artificial dataset I used to evaluate ssSVM is the Chain Link Dataset 4.14.<br />

Table 4.5 shows upper and lower bounds, the Figures 4.15 and 4.16 show the<br />

accuracies with incremental increased training sets and with different batch sizes.<br />

whole set reduced set LC BT MC SM RS<br />

self test 0.89 0.66 0.77 0.7 0.67 0.67 0.86<br />

test set 0.9 0.76 0.86 0.75 0.73 0.81 0.66<br />

trainingset size 681 30 40 40 40 40 40<br />

Table 4.5: Summary experiments with Chain Link dataset<br />

4.2.4 Summary<br />

We could see that the semi-supervised SVM approaches reduced the amount of<br />

needed labeled data significantly. They delivered similar accuracies than the common<br />

SVM approach but the training set size was much smaller. As expected the<br />

38


Figure 4.14: Chain Link Dataset<br />

Figure 4.15: Incremental increased training size Chain Link dataset<br />

39


Figure 4.16: different feedback sizes in batch mode for Chain Link Dataset<br />

incremental version performs better than the batch version. Breaking Ties, Least<br />

Certainty, Simple Margin and Most Certainty perform better than Random Sampling<br />

but no single ’winner’ could be found.<br />

4.3 Datasets from UCI Machine <strong>Learning</strong> Repository<br />

Beside the generated datasets I evaluated my implementation using some datasets<br />

from the UCI Machine <strong>Learning</strong> Repository (appendix A).<br />

I used following datasets:<br />

1. abalone<br />

2. breast cancer (WDBC)<br />

3. heart scale<br />

4. hill valley<br />

5. kr-vs-kp<br />

Detailed informations about the datasets can be found on the UCI Machine <strong>Learning</strong><br />

Repository homepage. Again I separated the datasets into training sets for<br />

40


supervised learning, semi-supervised learning and testing. Note that I did not try<br />

to optimize the SVM kernel parameters to get good accuracies and therefore some<br />

accuracies are rather low. Instead I used different parameters for different datasets<br />

(f.ex. different kernel types) and for each datasets the same parameters for comparing<br />

supervised and semi-supervised learning.<br />

Two modes were used for the semi-supervised approach. First a simple batch<br />

mode where only one feedback round is used. The other mode uses 10 feedback<br />

rounds.<br />

The tables 4.6 and 4.7 outline the results of these experiments. Again, the semisupervised<br />

approaches deliver good accuracy but with reduced sample size with<br />

respect to the whole training set.<br />

whole set reduced set LC BT MC SM RS<br />

heart scale 0.84 0.75 0.83 0.83 0.78 0.77 0.80<br />

WDBC 0.94 0.75 0.94 0.94 0.77 0.80 0.87<br />

WDBC (RBF) 0.85 0.25 0.52 0.52 0.28 0.50 0.45<br />

abalone 0.54 0.44 0.51 0.53 0.44 0.51 0.51<br />

hill valley 0.94 0.85 0.89 0.89 0.87 0.85 0.86<br />

kr-vs-kp 0.44 0.29 0.39 0.43 0.42 0.33 0.22<br />

Table 4.6: Evaluation of semi-supervised SVM approaches (1 iteration, feedback<br />

size 50)<br />

whole set reduced set LC BT MC SM RS<br />

heart scale 0.84 0.75 0.83 0.83 0.76 0.84 0.8<br />

WDBC 0.94 0.75 0.94 0.94 0.77 0.87 0.80<br />

WDBC (RBF) 0.85 0.25 0.69 0.69 0.28 0.47 0.46<br />

abalone 0.54 0.44 0.50 0.51 0.43 0.49 0.51<br />

hill valley 0.94 0.85 0.89 0.88 0.88 0.86 0.86<br />

kr-vs-kp 0.44 0.29 0.47 0.53 0.31 0.21 0.16<br />

Table 4.7: Evaluation of semi-supervised SVM approaches, (10 iterations, feedback<br />

size 50)<br />

These datasets show that Least Certainty and Breaking Ties often delivers similar<br />

results and outperform the other approaches.<br />

41


Chapter 5<br />

Conclusion<br />

In this work I summarized different approaches of semi-supervised learning for Support<br />

Vector Machines. We have seen that most of them try to narrow the margin<br />

of the hyperplane. The version space based and the probability based methods<br />

belong to this category. <strong>Semi</strong>-supervised learning approaches promise to reduce the<br />

amount of the needed training data through performing so called feedback rounds<br />

where a human expert gets asked for labeling instances which are relevant for the<br />

given classification task. The experiments with different datasets have shown that<br />

ssSVM, my semi-supervised learning implementation for <strong>SVMs</strong>, keep this promise.<br />

<strong>With</strong> ssSVM one can obtain similar accuracies with fewer training data as with<br />

usual <strong>SVMs</strong>.<br />

One drawback of the presented semi-supervised learning approaches is that they<br />

introduce a new parameter, the feedback size. The feedback size influences not only<br />

the accuracy but also the acceptance of the human expert. If the feedback size is<br />

too large, the human expert has to label many instances and can get bored (as in<br />

the supervised case), if the feedback size is too small the accuracy can be too low.<br />

Because the optimal value for the feedback size depends on the datasets and the<br />

chosen approach there is no general rule how to set it. Additionally the number of<br />

feedback rounds must also be chosen.<br />

I compared Least Certainty, Breaking Ties, Most Certainty and Simple Margin<br />

with Random Sampling and could show that these approaches outperform the<br />

latter one. Which approach should be chosen depends on the datasets although<br />

Least Certainty and Breaking Ties seem to be most stable and are general good<br />

approaches.<br />

A problem is that until now there does not exist a practical online tuning algorithm<br />

for kernel parameters. If we add a new instance to the training set the<br />

optimal kernel parameters can change.<br />

Nevertheless my experiments with ssSVM show that using semi-supervised approaches<br />

help to reduce the size of needed labeled training data and are therefore<br />

valuable.<br />

42


Appendix A<br />

Relevant Links<br />

• Word Vector Tool - An Open-Source Tool for creating word vectors from texts<br />

http://www.wvtool.nemoz.org/<br />

• RapidMiner - An Open-Source Datamining Tool http://www.rapidminer.com<br />

• Spring Framework - An IOC Container http://springframework.org/<br />

• Eclipse RCP - The Eclipse Rich Client Platform http://wiki.eclipse.org/index.<br />

php/Rich Client Platform<br />

• UCI Machine <strong>Learning</strong> Repository - Repository containg different data sets<br />

http://archive.ics.uci.edu/ml/<br />

43


Bibliography<br />

[BD99]<br />

Kristin P. Bennett and Ayhan Demiriz. <strong>Semi</strong>-supervised support vector<br />

machines. In Proceedings of the 1998 conference on Advances in neural<br />

information processing systems II, pages 368–374, Cambridge, MA, USA,<br />

1999. MIT Press.<br />

[Bur98] Christopher J. C. Burges. A tutorial on support vector machines for<br />

pattern recognition. Data Mining and Knowledge Discovery, 2(2):121–<br />

167, 1998.<br />

[CB07]<br />

[Cha05]<br />

Nicolas Cebron and Michael R. Berthold. An adaptive multi objective selection<br />

strategy for active learning. Konstanzer Schriften in Mathematik<br />

und Informatik, No. 235, 2007.<br />

Edward Chang Simon Tong Kingsby Goh Chang-Wei Chang. Support<br />

vector machine concept-dependent active learning for image retrieval.<br />

IEEE Transactions on Multimedia 2005, 2005.<br />

[CST00] Nello Cristianini and John Shawe-Taylor. An introduction to support<br />

Vector Machines: and other kernel-based learning methods. Cambridge<br />

University Press, New York, NY, USA, 2000.<br />

[FM01]<br />

[HGC01]<br />

[HL02]<br />

F. Fung and O. Mangasarian. <strong>Semi</strong>-supervised support vector machines<br />

for unlabeled data classification. Optimization Methods and Software,<br />

15:29–44, 2001.<br />

Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point machines.<br />

Journal of Machine <strong>Learning</strong> Research, 1:245–279, 2001.<br />

Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass<br />

support vector machines. IEEE Transactions on Neural Networks,<br />

13:415–425, 2002.<br />

[LKG + 05] Tong Luo, Kurt Kramer, Dmitry B. Goldgof, Lawrence O. Hall, Scott<br />

Samson, Andrew Remsen, and Thomas Hopkins. Active learning to<br />

recognize multiple types of plankton. Journal of Machine <strong>Learning</strong> Research,<br />

6:589–613, 2005.<br />

44


[Mar03] Florian Markowetz. Klassifikation mit support vector machines.<br />

http://lectures.molgen.mpg.de/statistik03/docs/Kapitel 16.<strong>pdf</strong>, 2003.<br />

Lectures, Max Planck Institute For Molecular Genetics.<br />

[Mei02] Ron Meir. Support vector machines - an introduction.<br />

http://www.ee.technion.ac.il/ rmeir/SVMReview.<strong>pdf</strong>, 2002. Electrical<br />

Engineering Department, Israel Institute of Technology, Tutorial.<br />

[MIJ04] Romain Thibaux Michael I. Jordan. The kernel<br />

trick. http://www.cs.berkeley.edu/ jordan/courses/281Bspring04/lectures/lec3.<strong>pdf</strong>,<br />

Spring 2004. Lectures, CS Berkeley.<br />

[Mit97]<br />

[MPE06]<br />

[MSP04]<br />

[Pal08]<br />

[PCT00]<br />

[SC00]<br />

[STC99]<br />

[TC01]<br />

Thomas Mitchell. Machine <strong>Learning</strong>. McGraw-Hill Education (ISE Editions),<br />

October 1997.<br />

Michael Mandel, Graham Poliner, and Daniel Ellis. Support vector machine<br />

active learning for music retrieval. Multimedia Systems, 12(1):3–13,<br />

2006.<br />

Pabitra Mitra, B. Uma Shankar, and Sankar K. Pal. Segmentation of<br />

multispectral remote sensing images using active support vector machines.<br />

Pattern Recognition Letters, 25(9):1067–1074, 2004.<br />

Mahesh Pal. Multiclass approaches for support vector machine based<br />

land cover classification. CoRR, abs/0802.2411, 2008. informal publication.<br />

John C. Platt, Nello Cristianini, and Shawe J. Taylor. Large margin<br />

DAGs for multiclass classification. In Sara A. Solla, T. K. Leen, and<br />

K. R. Müller, editors, Advances in Neural Information Processing Systems,<br />

volume 12. MIT Press, 2000.<br />

Greg Schohn and David Cohn. Less is more: Active learning with support<br />

vector machines. In ICML ’00: Proceedings of the Seventeenth International<br />

Conference on Machine <strong>Learning</strong>, pages 839–846, San Francisco,<br />

CA, USA, 2000. Morgan Kaufmann Publishers Inc.<br />

John Shawe-Taylor and Nello Cristianini. Further results on the margin<br />

distribution. In COLT ’99: Proceedings of the twelfth annual conference<br />

on Computational learning theory, pages 278–285, New York, NY, USA,<br />

1999. ACM.<br />

Simon Tong and Edward Chang. Support vector machine active learning<br />

for image retrieval. In MULTIMEDIA ’01: Proceedings of the ninth ACM<br />

international conference on Multimedia, pages 107–118, New York, NY,<br />

USA, 2001. ACM.<br />

45


[TK02]<br />

[Vap00]<br />

[VC04]<br />

[XS05]<br />

Simon Tong and Daphne Koller. Support vector machine active learning<br />

with applications to text classification. Journal of Machine <strong>Learning</strong><br />

Research, 2:45–66, 2002.<br />

Vladimir N. Vapnik. The nature of statistical learning theory. Springer-<br />

Verlag New York, Inc., New York, NY, USA, 2000.<br />

V. N. Vapnik and A. Ya. Chervonenkis. Theory of pattern recognition.<br />

www.cs.berkeley.edu/ jordan/courses/281B-spring04/lectures/lec3.<strong>pdf</strong>,<br />

Spring 2004. Lectures, CS Berkeley.<br />

Linli Xu and Dale Schuurmans. Unsupervised and semi-supervised multiclass<br />

support vector machines. In Manuela M. Veloso and Subbarao<br />

Kambhampati, editors, AAAI, pages 904–910. AAAI Press / The MIT<br />

Press, 2005.<br />

46

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!