Semi-Supervised Learning With SVMs.pdf - Read

Semi-supervised Learning With 

Support Vector Machines 

B A K K A L A U R E A T S A R B E I T 

von Andre Guggenberger 

Matrikelnummer 0327514 

eingereicht an der 

Technischen Universität Wien 

im September 2008

ABSTRACT 

Support Vector Machines are a modern technique in the field of machine learning 

and have been successfully used in different fields of application. In general they 

are used for some kind of classification task, where they learn from a randomly 

selected training set, which has been classified in advance and then are applied on 

unseen instances. To get a good classification result it is often necessary that this 

training set contains a huge set of labeled instances. But for humans labeling of 

data is a time-consuming and boring task. Some algorithms address this problem 

and overcome this by learning on both, a small amount of labeled and a huge 

amount of unlabeled instances. There the learner has access to the pool of 

unlabeled instances and requests the label for some specific instances from an 

user. Then the learner uses all labeled data to learn the model. The choice of the 

unlabeled instances which should be labeled next has a significant impact on the 

quality of the resulting model. This kind of learning is called semi-supervised 

learning or active learning. Currently there exist some different solutions for 

semi-supervised learning. This work focuses on the most known ones and gives an 

overview about them. 

KURZFASSUNG 

Support Vector Machines sind eine moderne Technik im Bereich vom maschinellen 

Lernen und wurden mittlerweile in verschiedenen Anwendungsgebieten erfolgreich 

eingesetzt. Generell werden sie für Klassifikationsaufgaben verwendet, wobei sie 

von einer zufällig gewählte Menge von schon vorklassifizierten Trainingsdaten 

lernen und dann auf noch unbekannte Daten angewendet werden. Um ein gutes 

Klassifikationsergebnis zu erhalten, ist es oft notwendig, eine große Menge von 

vorklassifizierten Trainingsdaten zum Training zu verwenden. Das manuelle 

Klassifizieren von den Daten durch Menschen ist oft eine zeitaufwendige und 

langweilige Aufgabe. Um dies zu erleichtern wurden Algorithmen entwickelt, um 

mit schon wenigen klassifizierten und vielen nichtklassifizierten Daten ein Modell 

zu erstellen. Dabei hat der Klassifikator Zugang zu dem Pool von 

nichtklassifizierten Daten und fragt einen Benutzer nach der Klasse für einige 

spezielle Instanzen. Dann benützt er alle klassifizierten Daten zum Erstellen des 

Modells. Die Wahl jener noch nicht klassifizierten Instanzen, die von einem 

Experten klassifiziert werden sollen, hat einen signifikanten Einfluss auf die 

Qualität des resultierenden Modells. Diese Art des maschinellen Lernens wird als 

Semi-überwachtes Lernen oder aktives Lernen) bezeichnet. Momentan existieren 

verschiedenste Ansätze für Semi-überwachtes Lernen. Diese Arbeit behandelt die 

bekanntesten und liefert eine Übersicht über die verschiedenen Ansätze. 

1

Contents 

1 Introduction 4 

2 Basic Definitions of Support Vector Machines 5 

3 Semi-supervised Learning 9 

3.1 Random Subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 

3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 

3.3 Version Space Based Methods . . . . . . . . . . . . . . . . . . . . . 11 

3.3.1 Theory of the Version Space . . . . . . . . . . . . . . . . . . 11 

3.3.2 Simple Method . . . . . . . . . . . . . . . . . . . . . . . . . 12 

3.3.3 Batch-Simple Method . . . . . . . . . . . . . . . . . . . . . 14 

3.3.4 Angle Diversity Strategy . . . . . . . . . . . . . . . . . . . . 15 

3.3.5 Multi-Class Problem . . . . . . . . . . . . . . . . . . . . . . 16 

3.4 Probability Based Method . . . . . . . . . . . . . . . . . . . . . . . 17 

3.4.1 The Probability Model . . . . . . . . . . . . . . . . . . . . . 17 

3.4.2 Least Certainty and Breaking Ties . . . . . . . . . . . . . . 18 

3.5 Other approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 

3.5.1 A Semidefinite Programming Approach . . . . . . . . . . . . 18 

3.5.2 S 3 V M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

4 Experiments 21 

4.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

4.1.1 Evaluated Approaches . . . . . . . . . . . . . . . . . . . . . 21 

4.1.2 ssSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 

4.1.3 ssSVMToolbox . . . . . . . . . . . . . . . . . . . . . . . . . 28 

4.2 Artificial Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 

4.2.1 Gaussian Distributed Data . . . . . . . . . . . . . . . . . . . 31 

4.2.2 Two Spirals Dataset . . . . . . . . . . . . . . . . . . . . . . 35 

4.2.3 Chain Link Dataset . . . . . . . . . . . . . . . . . . . . . . . 38 

4.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 

4.3 Datasets from UCI Machine Learning Repository . . . . . . . . . . 40 

2

5 Conclusion 42 

A Relevant Links 43 

3

Chapter 1 

Introduction 

Support Vector Machines (SVMs) are a modern technique in the field of machine 

learning and have been successfully used in different fields of application. Most of 

the time they were used in a supervised learning context. There the learner has 

access to a large set of labeled data and builds a model using these informations. 

After this learning step the learner is presented new instances and tries to predict 

the correct labels. Beside supervised learning there exists also unsupervised learning 

where the learner cannot access the labels of the instances. In this case the learner 

tries to predict the labels by partitioning the data and creating so called clusters. 

Providing a huge set of labeled data (as in the supervised case) can be very 

time-consuming (and therefore costly). Semi-supervised learning tries to reduce 

the needed amount of labeled data by analyzing the unlabeled data. There only 

relevant instances have to be labeled by a human expert. Of course the overall 

accuracy has to be on par with the supervised learning accuracy. 

In this work I explain the most common approaches for semi-supervised learning 

with SVMs. I begin by introducing some basic definitions i.e. the SVM hyperplane, 

the kernel function and the SVM maximization task (Chapter 2). A detailed discussion 

about the theory of Support Vector Machines is not provided. The main 

part of the work focuses on semi-supervised learning. I present a definition of semisupervised 

learning in contrast to supervised and unsupervised learning, discuss the 

most common approaches (Chapter 3) for Support Vector Machines, compare semisupervised 

SVMs and supervised SVMs and present the results of my experiments 

with some of them. I show how they perform with different datasets including some 

common machine learning datasets and one real-world datasets (Chapter 4). 

4

Chapter 2 

Basic Definitions of Support 

Vector Machines 

Consider a typical classification problem. Some input vectors (feature vectors) and 

some labels are given. The objective of classification problems is to predict the 

labels of new input vectors so that the error rate of the classification is minimal. 

There are many algorithms to solve such kind of problems. Some of them require 

that the input data is linearly separable (by a hyperplane). But for many 

applications this assumption is not appropriate. And even if the assumption holds, 

most of the time there are many possible solutions for the hyperplane (Figure 2.1). 

Because we are looking for a hyperplane where the classification error is minimal 

this can be seen as an optimization problem. In 1965 Vapnik ( [VC04], [Vap00]) 

introduced a mathematical approach to find a hyperplane with low generalization 

error. It is based on the theory of structural risk minimization, which states that 

the generalization error is influenced by the error on the training set and the complexity 

of the model. Based on this work Support Vector Machines were developed. 

They belong to the family of generalized linear classifiers and are so called maximum 

margin classifier. This means that the resulting hyperplane maximizes the 

distance between the ’nearest’ vectors of different classes with the assumption that 

a large margin is better for the gerneraliziation ability of the SVM. These ’nearest’ 

vectors are called support vectors (SV) and SVMs consider only these vectors for 

the classification task. All other vectors can be ignored. Figure 2.2 illustrates a 

maximum margin classifier and the support vectors. 

In the context of SVMs it is also important to mind kernel functions. They 

project the low-dimensional training data to a higher dimensional feature space, 

because the separation of the training data is often easier achieved in this higher 

dimensional space. Moreover through this projection it is possible that training 

data, which couldn’t be separated linearly in the low-dimensional feature space, 

can be separated linearly in the high-dimensionl space. 

To understand semi-supervised learning we have to consider some mathematical 

5

Figure 2.1: Positive samples (green boxes) and negative samples (red circles). There 

are many possible solutions for the hyperplane (from [Mar03]) 

Figure 2.2: Maximum margin, the middle line is the hyperplane, the vectors on the 

other lines are the support vectors (from [Mar03]) 

6

ackground of SVMs. This is just a very short summary, beside very good resources 

on the internet Vapnik, Cristianini and Shawe-Taylor provide comprehensive introductions 

to Support Vector Machines [Vap00], [VC04] or [CST00]. 

At first we have to define the hyperplane, which separates the data and acts as 

the decision boundary. 

H(ω, b) = x|ω T ∗ x + b = 0 (2.1) 

where ω is a weight vector, x is an input vector and b is the bias. 

Note that ω points orthogonal to H. 

Because we are interested in maximizing the margin, we have to define the 

distance from a support vector to the hyperplane. 

ω T ∗ x + b 

||ω|| 

= ±1 

||ω|| 

(2.2) 

From this definition the margin m follows straightforward (see Figure 2.2 for an 

illustration). 

2 

(2.3) 

||ω|| 

The maximization task can be summarized as [TC01]: 

max min i{y i (ω ∗ φ(x ) )} (2.4) 

w∈F 

subject to ||ω|| = 1 

y i (ω ∗ φ(x i )) ≥ 1, i = 1...n. 

Note that this definition is only correct, if the data is linearly separable. In a 

non-linearly separable case we have to introduce slack variables. 

max min i{y i (ω ∗ φ(x ) )} (2.5) 

w∈F 

subject to 

ξ i ≥ 0 

y i (ω ∗ φ(x i )) ≥ 1 − ξ i , i = 1...n 

where ξ i are slack variables. 

Because SVMs try to maximize the margin we can restate the optimization task 

using the definition of the margin: 

min 

ω,ξ 

1 

2 ||ω||2 + C 

7 

n∑ 

ξ i (2.6) 

i=1

subject to 

ξ i ≥ 0 

y i (ω ∗ φ(x i )) ≥ 1 − ξ i , i = 1...n 

where C is the complexity parameter. This parameter controls the complexity 

of the decision boundary. Large C penalize errors whereas small C penalize the 

complexity [Mei02]. 

As said Support Vector Machines usually use so called kernels or kernel functions 

to project the data from a low-dimensional input space to a high-dimensional feature 

space. The kernel function K satisfies Mercer’s condition and we define K as: 

K(u, v) = φ(u) ∗ φ(v) (2.7) 

where φ : X− > F is a feature map [Mei02], [MIJ04]. One example of a feature 

map: 

f(x 1 , x 2 ) = (x 2 1, √ 2x 1 x 2 , x 2 2) (2.8) 

Using these feature map we can calculate the projection using the kernel K(u,v) = 

φ(u) ∗ φ(v) by computing the inner products of the data vectors and not the feature 

vectors φ(u) ∗ φ(v): 

K(u, v) = φ(u) ∗ φ(v) (2.9) 

= u 2 1v 2 1 + 2u 1 u 2 v 1 v 2 + u 2 2v 2 2 (2.10) 

whereas < u, v > is the inner product of u and v. 

In the context of SVMs we consider classifiers of the form: 

f(x) = 

where α i are the Lagrange multipliers. 

= (u 1 v 1 + u 2 v 2 ) 2 (2.11) 

= (< u, v >) 2 (2.12) 

n∑ 

α i K(x, x). (2.13) 

i=1 

8

Chapter 3 

Semi-supervised Learning 

The task of classification is also called supervised learning. In contrast to this the 

task of clustering is called unsupervised learning. There the learner doesn’t use 

labeled data, instead the learner tries to partitioning a datasets into clusters so 

that the data in a cluster share some common characteristics. 

Semi-supervised learning is a combination of supervised and unsupervised learning 

where typically a small amount of labeled and a large amount of unlabeled data 

are used for training. This is done because of two reasons. First labeling of a huge 

set of instances can be a time-consuming task. This classification has to be done by 

a skilled human expert and can be quiet costly. Semi-supervised learning reduces 

the needed amount of labeled instances and the associated costs. Note that in contrast 

the acquisition of the unlabeled data is usually relatively inexpensive. Second 

it has been shown that using unlabeled data for learning improves the accuracy of 

the produced learner [BD99]. G. Schohn and D. Cohn [SC00] report similar results, 

they state that a trained SVM on a well-chosen subset performs often better than 

on all available instances. 

Summing up the advantages of semi-supervised learning are (in many cases) 

better accuracy, fewer data and less training time. To achieve these the examples 

to be labeled should be selected properly. 

There are many different algorithms for semi-supervised learning with Support 

Vector Machines. Most of them require some kind of querying unlabeled instances 

to request the labels for them from a human expert. They differ in the way they 

select the next instances. The process of querying is called selective sampling. 

Sometimes semi-supervised learning is called active learning. As opposed to passive 

learning, where a classifier is trained using randomly selected labeled data, an 

active learner asks a user to label only ’important’ instances. Because the classifier 

gets feedback (the labels) about the for the classification relevant instances from an 

user, this process is called relevant feedback. 

Note that the approaches presented in 3.5.1 and 3.5.2 differ in that because here 

no feedback is necessary. 

9

3.1 Random Subset 

Obviously if we use a random process to select the unlabeled instances this learning 

cannot be considered as real semi-supervised learning. To get an appropriate accuracy 

the sampling strategy is as important as it is in case of supervised learning. 

Supervised learning and random subset semi-supervised learning are very similar 

and share most of the characteristics. 

Some researchers have experimented with this and have stated that the accuracy 

cannot keep up with real semi-supervised strategies. But they used this approach to 

compare it with the other semi-supervised learning approaches [FM01], [LKG + 05]. 

3.2 Clustering 

One approach is to use a clustering algorithm (unsupervised learning) on the unlabeled 

data. Then we can e.g. choose the cluster centers (centroids) as instances 

to be labeled by an expert. G. Fung and O. Mangasarian have used k-median 

clustering and report a good classification accuracy in comparison with supervised 

learning but with fewer labeled instances [FM01]. It’s worth to keep in mind that 

one has to define the correct number of the clusters in advance. Correct means that 

the clusters should be good representatives of the available classes. G. Fung and 

O. Mangasarian do not really address this but as for other clustering algorithm the 

choice of the number of clusters can be assumed to be critical. An obvious solution 

is to set the number of clusters equal to the number of classes. Additionally G. 

Fung and O. Mangasarian extend the clustering by an approach similiar to that 

described in chapter 3.5. 

A general algorithm could be described this way: 

1. Use the labeled data to build a model 

2. Using the unlabeled data calculate n clusters 

3. Query some instances for labeling by a human expert. Which instances depends 

on the alogrithm. Some examples: 

(a) Query the centroids 

(b) Query instances on the cluster boundaries 

(c) A combination of the above approaches 

Cebron and Berthold introduced an advanced clustering technique, they proposed 

a prototype based learning approach using a density estimation technique 

and a probability model for selecting prototypes to obtain the labels from an expert 

[CB07]. 

10

3.3 Version Space Based Methods 

Random Subset (chapter 3.1) and clustering (chapter 3.2) are simple but effective 

methods for semi-supervised learning. Depending on the given classification task 

the results can be quite good. Note that both can be used with other classifiers and 

are not limited to Support Vector Machines. Version space based methods are a 

more advanced technique, which use specific properties of Support Vector Machines 

for semi-supervised learning. But as we will see these approaches suffer from some 

critical limitations. 

The following approaches can be analyzed by their influence to the version space. 

Therefore it is worth to consider the theory of version spaces. 

3.3.1 Theory of the Version Space 

The version space was introduced by Tom Mitchell [Mit97]. It is the space containing 

all consistent hypotheses from the hypotheses space whereas the hypotheses 

space contains all possible hypotheses. 

In the context of SVMs the hypotheses are the hyperplanes and the version 

spaces contain all hyperplanes consistent with the current training data [TC01]. 

More formally, the hypotheses space (all possible hypotheses) is defined as: 

H = {f|f(x) = φ(x) ∗ ω whereω ∈ W } (3.1) 

||ω|| 

where the parameter space W is equal to the feature space F, f is a hypothesis. As 

in chapter 2 explained φ(x)∗ω is the definition of the (normalized) hyperplanes (Definition 

2.1). So this space contains all possible hyperplanes. Using this definition 

||ω|| 

we can define the version space: 

∨ 

V = {f ∈ H| y i f(x i ) > 0} (3.2) 

i∈{1...n} 

where y i is the class label. This definition eliminates all hypotheses (hyperplanes) 

not consistent with the given training data (Definition 2.4) 

Because there is a bijection between W (containing the unit vectors) and H 

(containing hyperplanes) we can redefine V [TC01]: 

V = {w ∈ W |||ω|| = 1, y i ∗ φ(x i )) > 0, i = 1...n} (3.3) 

There is a restriction of this definition: the training data has to be linearly 

separable in the feature space. But because it is possible to make every data linearly 

separable by modifying the kernel we can ignore this issue [STC99]. Furthermore 

because we often work in a high-dimensional feature space in many cases the data 

will be linearly separable. 

11

For our analysis it is important to note the duality between the feature space 

F and the parameter space W [TC01]. The unit vectors ω correspond to the decision 

boundaries f in F. This follows intuitively from the above definitions but this 

correspondence exists also converse. Let’s have a closer look on this issue. If one 

observes a new training instance x i in the feature space this instance reduces the 

set of all allowable hyperplane to ones that classify x i correctly. We can write this 

down more formally: every hyperplane must satisfy y i (ωφ(x i )) > 0, where y i is the 

label for the instance x i . As said before ω is the normal vector of the hyperplane 

in F. But we can think of y i φ(x i ) as being the normal vector of a hyperplane in 

W. It follows that ω(y i φ(x i )) = 0 defines a hyperplane in W. Recall that we have 

defined the version space V in W. Therefore the hyperplane is a boundary to the 

version space. It can be shown that the hyperplanes in W delimit the version space 

and from the definition of the maximization task of the SVMs it maximizes the 

minimum distance to any of this hyperplanes in W. SVMs find a center of the 

largest hypersphere in the version space, whose radius is the maximum margin and 

it can be shown that the hyperplanes touched by the hypersphere correspond to the 

support vectors and that the ω i often lie in the center of the version space [TC01]. 

3.3.2 Simple Method 

Linear SVMs perform best when applied in high-dimensional domains (such as 

text classification). There the number of features is much larger than the number 

of examples and therefore the training data cannot cover the whole dimensions, 

meaning that the subspace spanned by the training examples is much smaller than 

the space containing all dimensions. Considering this observation G. Schohn and 

D. Cohn propose that a simple method to select instances for labeling is to search 

for examples that are orthogonal to the space spanned by the current training 

data [SC00]. Doing this would give the learner informations about yet not covered 

dimensions. Alternatively one can choose those instances which are near to the 

dividing hyperplane to improve the confidence in currently known dimensions. This 

is an attempt to narrow the existing margin. To maximally narrow the margin one 

would select those instances lying on the hyperplane. The interesting result from 

G. Schohn and D. Cohn is that training on a small subset of the data leads in most 

cases to a better performance than training on all available data. 

Remains the analysis of the computation of the proximity of a training instance 

to the hyperplane: this is inexpensive, because one can compute the hyperplane 

and evaluate each instance using a single dot product. 

The distance between a feature vector φ(x) and the hyperplane ω: 

|φ(x) ∗ ω| (3.4) 

Let’s have a look, how this simple method influences the version space. Given 

an unlabeled instance x i we can test how close the corresponding hyperplane in 

12

Figure 3.1: The gray line is the old hyperplane, the green lines are the old margins, 

’o’ is a new example and the black line the new hyperplane, when the new instance 

was labeld as ’-’ (from [SC00]) 

W comes to the center of the hypersphere (the ω i ). If we choose the instance x i 

closest to the center we can reduce the version space as much as possible (and 

this will of course reduce the amount of consistent hypotheses). This distance 

can be easily computed using the above formular. By choosing the instance x i , 

who come closest to the hyperplane in F, we maximally reduce the margin and 

the version space. Figure 3.1 shows the effect of an instance on the hyperplane 

graphically. There the bottom figure shows that by placing an instance to the 

center of the old hyperplane the margin (calculated using the new hyperplane) gets 

changed significantly. Placing an instance on the old hyperplane too far out has 

little impact on the margin, as we can see on the top figure. 

A more sophisticated description of this can be found in [TK02]. There three 

different approaches are presented, each trying to reduce the version space as much 

as possible. Note that these definitions rely on the assumption that the given 

problem is binary (two classes). 

1. Simple Margin: This is the method already described: choose the next instance 

closest to the hyperplane 

2. MaxMin Margin: Let the instance x be a candidate for being labeled by a 

human expert. This instance gets labeled as -1, assigning it to class -1. Then 

13

the margin m − of the resulting SVM gets calculated. After this x gets labeled 

as +1, assigning it to class +1 and again the margin m + gets computed. 

This procedure is repeated for all instances and the instance with the largest 

min(m − , m + ) is chosen. 

3. Ratio Margin: This is similar to the MaxMin Margin method, but uses the 

relative sizes of m − and m + : choose the instance with largest min( m− , m+ ). 

m + m − 

All three methods perform well, the simple margin method is computationally 

the fastest. But it has to be used carefully, because it can be unstable under 

some circumstances [HGC01], [TK02]. MaxMin Margin and Ratio Margin try to 

overcome these instability problems. The results of the experiments of S. Tong and 

D. Koller show that all three methods outperform random sampling [TK02]. 

3.3.3 Batch-Simple Method 

One possible problem with the above methods is that every instance has to be 

labeled separately. That means that after each instance the user has to determine 

the label. A new hyperplane will be calculated and the next instance has to be 

queried. Often this approach is not practicable and some kind of batch mechanism 

is necessary. There exist different approaches of batch sampling for version space 

based algorithms [Cha05]. One of those approaches is the batch-simple sampling 

algorithm, where h unlabeled instances closest to the hyperplane are chosen and 

have to be labeled by a user. This could be seen as a rather naive extension of the 

above methods (of course naive doesn’t mean bad). The batch-simple method has 

been used to classify images [TC01] and the researchers in this paper report good 

results. The algorithm can be expressed as follows: 

1. initial model building: Build a model using the labeled data 

2. feedback round: query n instances closest to the hyperplane and ask the user 

to label them 

The feedback round can be repeated m times. Because this algorithm can be 

unstable during the first feedback round [TC01], Tong and Chang suggest an initial 

feedback round with random sampling: 

1. initial modell building: Build a modell using the labeled data 

2. first feedback round: choose randomly n instances for labeling 

3. advanced feedback round: query n instances closest to the hyperplane and 

ask the user to label them 

14

Now the advanced feedback round could be repeated m times. But how to 

choose ’good’ values for n and m? Simon Tong and Edward Chang do not explain 

a way to determine these values [TC01]. But it is clear that n has to be set in 

advance. They have used a query size of 20. m can be determined by using some 

kind of cross validation. It is also obvious that by decreasing the query size n one 

has to increase the number of rounds m and vice versa. Otherwise the accuracy of 

the classifier would decrease. Beside the technical reasons the choice of the values 

depends on the user, whose task is to label the instances. To take advantage of 

active learning this user should not have to label a huge set of examples. As an 

starting point one can use the values from [TC01]: query size = 20, number of 

rounds = 5. 

3.3.4 Angle Diversity Strategy 

One problem with the batch-simple method is that by sampling a batch of instances 

the diversity of them is not guaranteed. One can expect that divers instances 

can reduce the version space more efficiently, considering the diversity can have 

a significant impact on the performance of the classifier. A measurement of the 

diversity is the angles between the samples. The angle diversity strategy proposed in 

[Cha05] balances the closeness to the hyperplane and the diversity of the instances. 

More formally the angle between two instances x i and x j (respective their corresponding 

hyperplanes h i and h j : 

|cos(< (h i , h j ))| = |φ(x i).φ(x j )| 

||φ(x i )||||φ(x j )|| = |K(x i , x j )| 

√ 

K(xi , x i )K(x j , x j ) 

(3.5) 

where x i is an instance, φ(x i ) is its normal vector and K(x i , x j ) is the kernel function, 

which satisfies Mercer’s condition [Bur98]. 

From these theoretical considerations the algorithm follows straightforward: 

1. Train a hyperplane h i by the given labeled set 

2. Calculate for each unlabeled instance x j its distance to the hyperplane h i 

3. Calculate the maximal angle from x j to any instance x i in the current labeled 

set 

What’s left is to consider the distance to the hyperplane, until now we have 

focused on the diversity of the samples. To do this we introduce another parameter 

α [Cha05]. This parameter balances the distance to the hyperplane and the diversity 

among the instances. The final decision rule can be expressed this way: 

|K(x i , x j )| 

α ∗ |f(x i | + (1 − α) ∗ (argmax√ x j K(xi , x i )K(x j , x j ) ) (3.6) 

15

As we can see α acts as a trade-off-factor between proximity and diversity. This 

parameter has to be set in advance and it is suggest to set it to 0.5 [Cha05]. They 

also present a more sophisticated solution for determining this parameter and clearly 

it is possible to use cross validation to get the best value for α. 

Some version space based methods have been tested in different fields of application 

[Cha05], [MPE06]. Whereas former have concentrated on image datasets 

and latter have tested these strategies on music datasets both come to the conclusion 

that the angle diversity strategy works best. Furthermore Tong concludes that 

active learning outperforms passive learning [Cha05]. 

3.3.5 Multi-Class Problem 

So far we have just considered and analyzed the two-class case. But to be useful in 

general a semi-supervised learning approach should be easily used in a multi-class 

environment. 

There exist different strategies for solving a multi-class problem with N classes 

for supervised learning with SVMs. In the case of the one-versus-one approach 

N(N−1) 

SVMs are developed and a majority vote is used to determine the class of the 

2 

given instance. In contrast the one-versus-all method uses N SVMs and assigns the 

label of the class which SVM has the largest margin. An overview about different 

multi-class approaches for SVMs can be found here [Pal08]. The one-versus-all 

method was introduced by Vapnik [Vap00]. Hsu and Lin have compared different 

multi-class approaches for SVMs [HL02]. Platt has described another multi-class 

SVM approach: the decision directed acyclic graph [PCT00]. 

From the above discussions it becomes not clear how to use these version space 

based methods for multi-class problems. Consider the simple method and the oneversus-all 

approach. In the case of a multi-class problem we have N decision boundaries, 

so which of the margins do we want to narrow? There a single instance has 

N distances (to the N hyperplanes) and narrowing one margin doesn’t automatically 

mean to narrow all margins. Until now little work has done solving multi-class 

semi-supervised problems. Mitra, Shankar, and Pal have applied the simple method 

to multi-class problems [MSP04]. They used a ’naive’ approach where they labeled 

N samples at a time. As said this approach lacks the analysis which example is best 

for all hyperplanes, because the influence of an example can be very large for one 

hyperplane but for other hyperplanes it can be useless. The angle diversity strategy 

suffers from the same problem, additionally it is not clear, which angle should be 

considered. 

The following section 3.4 describes probability based methods which overcome 

these problems and are more suitable for multi-class problem. 

16

3.4 Probability Based Method 

As we have seen the version space based methods lack of considering multi-class 

problems. An approach which can handle multi-class problems easily are probability 

based method [LKG + 05]. There a probability model for multiple SVMs is created. 

The results of each SVMs are interpreted as a probability and can be seen as a 

measurement of certainty that a given instance belongs to the class. In the case 

of semi-supervised learning using this approach is straightforward and using the 

probabilities we have many possibilities to query unlabeled instances for labeling. 

A simple method would be to train a model on the given labeled datasets. Than 

this model is applied on the unlabeled data and each of these unlabeled instances is 

given probabilities that these instances belong to a given class. Now we can query 

the least certain instances or the most certain instances. It is also possible to query 

the instances with the smallest difference in probability between its most likely 

and second most likely class. Using these probabilities there exist many different 

approaches and it is also possible to mixture some of them [LKG + 05]. 

3.4.1 The Probability Model 

To get probabilities we have to extend the default Support Vector Machines. For 

a given instance the results of the default SVMs are distances where f.ex. 0 means 

that the instance lies on the hyperplane and 1 that the instance is a support vector. 

To assign a probability value to a class the sigmoid function can be used. Then 

the parametric model has the following form [LKG + 05]: 

P (y = 1|f) = 

1 

1 + exp(Af + B) ′ (3.7) 

where A and B are scalar values, which have to be estimated and f is the decision 

function of the SVM. Based on this parametric model there are some approaches 

for calculating the probabilities. As we can see, when we use this model we have to 

calculate the SVM parameters (complexity parameter C, kernel parameter k) and 

the parameter A and B where the parameter A and B have to be calculated for 

each binary SVM. We can use cross validation for this calculation but it is clear 

that this can be computationally expensive. 

A pragmatic approximation method could assume that all binary SVMs have 

the same A, eliminate B by assigning 0.5 to instances lying on the decision boundary 

and by trying to compute the SVM parameters and A simultaneously [LKG + 05]. 

The decision function can be normalized by its margin to include the margin in the 

calculation of the probabilities. More formally: 

P pq (y = 1|f) = 

1 

(3.8) 

1 + exp( Af 

||ω|| )′ 

17

where we currently look at class p and P pq is the probability of class p versus class 

q. We assume that P pq , q=1,2,... are independent. The final probability for class 

p: 

q≠p 

∏ 

P (p) = P pq (y = 1|f) (3.9) 

q 

It has been reported that this approximation is very fast and delivers good 

accuracy results. Using this probability model there exist different approaches for 

semi-supervised learning. The next section outlines some. 

3.4.2 Least Certainty and Breaking Ties 

The algorithms for both are very similar. 

1. Built a multi-class model from the labeled training data 

2. Compute the probabilities 

3. Least Certainty: Query the instances with the smallest classification confidence 

for labeling by a human expert. Add them to the training set. 

4. Breaking ties: Query the instances with the smallest difference in probabilities 

for the two highest probability classes and obtain the correct label from a 

human expert. Add them to the training set. 

5. Goto 1 

Suppose a is the class with the highest probability, b is the class with second 

highest probability and P(a) and P(b) are the probabilities of the classes. Then 

least certainty tries to improve P(a) and breaking ties tries to improve P(a) - P(b). 

Intuitively, both methods improve the confidence of the classification. The number 

of instances, which should be queried, has to be set by the SVM designer. 

These approaches were tested on a gray-scale image datasets [LKG + 05]. They 

report a good accuracy and a reduced number of labeled images required to reach it. 

The breaking ties approach outperforms least certainty and using batch sampling 

was also effective. 

3.5 Other approaches 

3.5.1 A Semidefinite Programming Approach 

Semidefinite programming is an extension of linear and quadratic programming. A 

semidefinite programming problem is a convex constrained optimization problem. 

With semidefinite programming one tries to optimize a symmetric n × n matrix of 

18

variables X [XS05]. Semidefinite programming can be used to use Support Vector 

Machines in an unsupervised and semi-supervised context. For clustering the goal 

is not to find a large margin classifier using the labeled data (as with supervised 

learning) but instead to find a labeling that results in a large margin classifier. 

Therefore every possible labeling has to be computed and the labeling with the 

maximum margin has to be chosen. Obviously this is computationally very expensive 

but Xu and Schuurmans have found out that it can be approximated using 

semidefinite programming. This unsupervised approach can be easily extended to 

semi-supervised learning where a small labeled training set has to be considered. 

Note that this approach also works for multi-class problems [XS05]. There is one important 

difference between this approach and the other above discussed approaches: 

Here the algorithm uses the unlabeled data directly that means no human expert is 

asked to label them. In this case the semi-supervised learning is a combination of 

supervised learning using the given labeled training set and unsupervised learning 

using the unlabeled data. 

3.5.2 S 3 V M 

This approach was introduced by Bennet and Demiriz [BD99]. Similar to the above 

approach no human gets asked to label instances. Instead the unlabeled data gets 

incorporated to the formulation of the optimization problem. S 3 V M reformulates 

the original definition by adding two constraints to the instances of the unlabeled 

datasets. Considering a binary SVM one constraint calculates the misclassification 

error as if the instance were in class 1 and the second constraint as if the instance 

were in class -1. S 3 V M tries to minimize these two possible misclassification errors. 

The labeling with the smallest error is the final labeling. Moreover Bennet and 

Demiriz introduce some optimization techniques for this. An analysis, how this 

approach performs in a multi-class environment, is not presented. 

3.6 Summary 

Semi-supervised learning is a promising approach to reduce the amount of needed 

labeled instances for training SVMs by asking a human expert to label relevant 

instances from an unlabeled pool of instances. As outlined there are many different 

approaches available. We can use clustering, which can also be used as a semisupervised 

learning approach with other machine learning algorithm. In contrast 

the here presented version space based methods focus on SVMs and promise good 

accuracy results but are primary usable for binary classification tasks. Extending 

these approaches for multi-class problems is an ongoing research topic. Simple but 

effective approaches are the probability based methods which can be easily used in 

a multi-class context and are therefore very convenient. S 3 V M and the semidefinite 

programming approach are also semi-supervised learning approaches but here no 

19

human gets asked to label relevant instances. Whereas the former incorporates 

unlabeled instances to the formulation of the optimization problem, the latter one 

tries to find the labeling with the largest margin. 

20

Chapter 4 

Experiments 

4.1 Experiment Setting 

To experiment with different approaches presented in this work I have implemented 

two applications. ssSVM is a semi-supervised SVM implementation and suppports 

different semi-supervised learning approaches like Least Certainty and Breaking 

Ties. ssSVM uses RapidMiner, an open-source data mining plattform, which provides 

a comprehensive API for machine learning tasks like different classification, 

different clustering and of course different SVM implementations. ssSVM is also 

based on Spring, mainly an inversion-of-control container and therefore it is highly 

configurable and extensible. Furthermore it wraps the WordVector Tool for creating 

word vectors from texts. The second implemented application is the GUI for 

ssSVM. It is called ssSVMToolbox and is based on Eclipse RCP. The chapters 4.1.2 

and 4.1.3 as well as the links in the appendix A provide detailed informations. 

4.1.1 Evaluated Approaches 

I compared following approaches and evaluated their performances on the different 

data sets: 

1. Least Certainty (LS) 

2. Breaking Ties (BT) 

3. Most Certainty (MC) 

4. Simple Margin (SM) 

5. Random Sampling (RS) 

I separated every datasets into three sub sets. 

1. training set for supervised learning (in this work also called reduced set) 

21

2. training set for semi-supervised learning (is used to query instances for the 

feedback) 

3. test set to evaluate the performance 

. 

Using the reduced and the training set for semi-supervised learning (merged also 

called the whole set) I trained a common SVM to get an upper bound and used 

the reduced set alone to get the lower bound. So the accuracies of the different 

approaches should be between these bounds. Furthermore I used a random sampling 

strategy (RS) to show that the different approaches are better than an approach 

which randomly chooses instances for feedback. 

I compared two different modes: 

1. incremental increased training size: there the feedback size is set to 1 and the 

training size is incremental increased 

2. batch mode: there the feedback size is set to a certain value (f.ex. 50), in 

some iterations the feedback size is increased and results with these different 

feedback sizes are compared 

4.1.2 ssSVM 

ssSVM (semi-supervised Support Vector Machine) is a Java application capable 

of performing semi-supervised learning tasks with Support Vector Machines. It is 

based on RapidMiner , an Open Source data mining tool, and on Spring, an IOC 

container. See Relevant Links for more informations (appendix A). 

The core of the application is the application context sssvmContext.xml. As 

RapidMiner ssSVM supports different operators, this file configures which operators 

ssSVM actually supports (which input sources, which SVM implementations, which 

validators,...). 

 

 

com . rapidminer . operator . tokenizer . SimpleTokenizer 

 

 

com . rapidminer . operator . tokenizer . NGramTokenizer 

 

 

com . rapidminer . operator . tokenizer . TermNGramGenerator 

 

 

com . rapidminer . operator . reducer . GermanStemmer 

 

 

com . rapidminer . operator . reducer . LovinsStemmer 

 

 

com . rapidminer . operator . reducer . PorterStemmer 

 

 

com . rapidminer . operator . reducer . SnowballStemmer 

 

 

com . rapidminer . operator . reducer . ToLowerCaseConverter 

 

 

com . rapidminer . operator . wordfilter . EnglishStopwordFilter 

 

 

com . rapidminer . operator . wordfilter . GermanStopwordFilter 

 

 

com . rapidminer . operator . wordfilter . StopwordFilterFile 

 

 

com . rapidminer . operator . wordfilter . TokenLengthFilter 

 

 

 

< property name =" tokenProcessors "> 

 

 

 

 

< property name =" supportedReader "> 

 

 

com . rapidminer . operator .io. CSVExampleSource 

 

 

com . rapidminer . operator .io. SparseFormatExampleSource 

 

 

com . rapidminer . operator .io. ArffExampleSource 

 

 

 

 

 

< property name =" params "> 

 

 

< property name =" supportedSVMLearer "> 

 

 

com . rapidminer . operator . learner . functions . kernel . LibSVMLearner 

 

 

com . rapidminer . operator . learner . functions . kernel . JMySVMLearner 

 

 

 

< property name =" supportedValidator "> 

 

 

com . rapidminer . operator . validation . XValidation 

 

 

com . rapidminer . operator . validation . FixedSplitValidationChain 

 

 

23

< property name =" supportedPerfEvaluator "> 

 

 

com . rapidminer . operator . performance . SimplePerformanceEvaluator 

 

 

com . rapidminer . operator . performance . PolynominalClassificationPerformanceEvaluator 

 

 

 

 

 

< property name =" supportedClusterer "> 

 

 

com . rapidminer . operator . learner . clustering . clusterer . KMeans 

 

 

com . rapidminer . operator . learner . clustering . clusterer . SVClusteringOperator 

 

 

com . rapidminer . operator . learner . clustering . clusterer . KernelKMeans 

 

 

 

 

 

To use ssSVM for a concrete experiment another configuration is necessary. 

There the runtime properties for the experiment have to be provided. Instead of 

describing this file I provide an example. For a detailed description which parameters 

and parameter values are supported, see the RapidMiner documentation 

(appendix A). 

 

false 

 

 

 

 

< property name =" preprocessing "> 

 

 

< property name =" parameter "> 

 

 

< property name =" additionalProps "> 

 

 

./ datasets / breast_cancer_wisconsin / wdbc_as_labeled . data 

 

 

 

 

 


 

 


 

 


 

 

./ datasets / breast_cancer_wisconsin / wdbc_testset . data 

 

 

 

 

 


 

 


 

 


 

 

./ datasets / breast_cancer_wisconsin / wdbc_as_unlabeled . data 

 

 

 

 

 

 

< property name =" numberOfInstancesForFeedback " value ="0" /> 

 

 

 

 

 

 

 

 

 

< property name =" comparator "> 

 

 


 

25


 

 


 

 


 

 


 

 



 

 

< property name =" sssvmLearner "> 

 

 

 

 

< property name =" seed " value =" 123456789 "/> 


 

 

< property name =" props "> 

 

 

com . rapidminer . operator . learner . functions . kernel . jmysvm . kernel . KernelRadial 

 

0.8 

 

 

 

 


 

 

< property name =" clusterParams "> 

 

 

< property name =" svmLearner " value =" libSVM " /> 

< property name =" validator " value =" xval " /> 

< property name =" perfEvaluator " value =" simple " /> 

< property name =" clusterModelHandler "> 

 

 

 

 

 

 

 

 

 

 

 

 

< property name =" clusterers "> 

 

kmeans 

kernelKmeans 

 

 

< property name =" samplingStrategies "> 

 

 

 

 

 

 

26

 

 

 

 

 

 

 

 

 

 

 

Following code performs this experiment: 

final RuntimeHandler r = new RuntimeHandler (" wdbc . xml "); 

final SSSVMLearner learner = r. getSSSVMLearner (); 

// one feedback round 

final ExampleSet feedbackSet = learner . queryInstances (r. getLabeledExampleSet () , r. getUnlabeledExampleSet ()); 

final ExampleSet all = ExampleSetUtils . merge (r. getLabeledExampleSet () , feedbackSet ); 

// use a SVM implementation for training 

final IOObject [] resultSSSVM = r. getSVMLearner (). learn (r. getRuntimeConfig (). getSvmLearner () , 

r. getRuntimeConfig (). getSvmPerfEvaluator () , all ); 

// get performance of self test 

final PerformanceVector pvXval = (( PerformanceVector ) resultSSSVM [1]); 

// use model on a separate test set 

final PerformanceVector pvTest = r. getSVMLearner (). test (( Model ) resultSSSVM [0] , 

r. getRuntimeConfig (). getSvmPerfEvaluator () , r. getTestExampleSet ()); 

The incrementally increased training size mode can be executed by this code: 

protected List < Performance > performSSSVMStepwise ( final String experiment , final SamplingStrategy samplingStrategy ) 

throws Exception { 

final RuntimeHandler r = new RuntimeHandler ( experiment ); 

// learn sssvm 

final SSSVMLearner learner = r. getSSSVMLearner (); 

} 

ExampleSet all = ( ExampleSet ) r. getLabeledExampleSet (). clone (); 

final ExampleSet unlabeledSet = ( ExampleSet ) r. getUnlabeledExampleSet (). clone (); 

final List < Performance > results = new LinkedList < Performance >(); 

final int feedbackSize = 10; 

learner . getSamplingStrategies (). clear (); 

learner . addSamplingStrategy ( samplingStrategy ); 

samplingStrategy . setNumberOfInstancesForFeedback ( feedbackSize ); 

for ( int i = 0; i < unlabeledSet . size () / 10; i ++) { 

final ExampleSet feedbackSet = learner . queryInstances (all , 

ExampleSetUtils . intersect ( unlabeledSet , all )); 

all = ExampleSetUtils . merge (all , feedbackSet ); 

final IOObject [] resultSSSVM = r. getSVMLearner (). learn (r. getRuntimeConfig (). getSvmLearner () , 

r. getRuntimeConfig (). getSvmPerfEvaluator () , all ); 

final PerformanceVector pvXval = (( PerformanceVector ) resultSSSVM [1]); 

final PerformanceVector pvTest = r. getSVMLearner (). test (( Model ) resultSSSVM [0] , 

r. getRuntimeConfig (). getSvmPerfEvaluator () , r. getTestExampleSet ()); 

final Performance perf = new Performance ( pvXval , pvTest , all . size () , all . size () 

- r. getLabeledExampleSet (). size ()); 

results . add ( perf ); 

} 

return results ; 

If f.ex. a new input source should be used it has to be configured in sssvmContext.xml 

and after that it can be used in the experiment configuration. 

Table 4.1 describes the important packages of ssSVM. 

27

package name 

sssvm 

sssvm.clustermodel 

sssvm.confidencemodel 

sssvm.sampling 

sssvm.preprocessing 

sssvm.text 

describtion 

contains the ssSVM implementation 

and the core classes for running experiments 

contains cluster models 

for using clusterer in a semi-supervised manner 

contains the implementation 

for probability based semi-supervised approaches 

(Breaking Ties, Least Certainty,...) 

contains different sampling strategies 

contains preprocessing methods 

wraps the WVTool for creating word vectors from texts 

4.1.3 ssSVMToolbox 

Table 4.1: Packages 

This is the graphical user interface of ssSVM. It is based on Eclipse RCP. Using the 

ssSVMToolbox one can create, configure and run experiments. This application 

uses ssSVM to perform supervised and semi-supervised learning with SVMs and 

has the same abilities as ssSVM. Technically the toolbox is a GUI to manipulate 

experiment xml files. 

Running experiments is straightforward. At first one has to create a new experiment. 

The toolbox consists of some tabs. On the Input tab one can configure 

the datasources of the experiment. Here one has to provide the input format (f.ex. 

cvs), the filenames of the example sets and additional parameters for the example 

sets. The Preprocessing tab provides the configuration for preprocessing tasks like 

discretization and transformation of nominal to numeric attributes. The ssSVM 

Learner tab is the core of the toolbox. Here one can choose between different SVM 

learners, has to set the SVM parameters like kernel type and can activate and deactivate 

the different sampling strategies. For every sampling strategy one can set the 

feedback size. Finally one can execute the ssSVM experiment. After doing this, the 

Feedback Set table shows the instances for labeling by the human expert. There 

some features are shown (by double clicking on the row a dialog is opened and the 

whole instance is shown) and the user can label the instances by clicking on the cell 

Label. The current accuracy on the testset is also shown. The Result tab shows 

the accuracies and the confusion matrix. 

Figure 4.1 shows the Input tab whereas Figure 4.2 represents the ssSVM tab. 

By repeatedly executing the experiment one can experiment with incrementally 

increased training sizes, by setting the feedback sizes to values > 1 one can test 

the batch mode. By choosing different sampling strategies one can experiment with 

different combinations of them. 

In the next sections I show the results of my experiments. For some of these 

experiments I used the ssSVMToolbox. For more sophisticated results (f.ex. to 

28

Figure 4.1: Screenshot of the Input tab of the ssSVMToolbox 

29

Figure 4.2: Screenshot of the ssSVM tab of the ssSVMToolbox 

30

Figure 4.3: Binary Gaussian Distribution (µ 1 = 3, σ 1 = 3, µ 2 = 4, σ 2 = 3) 

create the different figures) I used a programmatic approach where I could execute 

different experiments with different settings all at once. See Section 4.1.2 for detailed 

informations and example source code. 

4.2 Artificial Datasets 

4.2.1 Gaussian Distributed Data 

For these experiments I used generated gaussian distributed data. I generated two 

different datasets with two different classes where the two classes overlap each other. 

In the first datasets ds 1 the σs are equal, in the second ds 2 the σs are different. The 

Figures 4.3 and 4.4 show plots of these datasets. 

For these datasets I evaluated different approaches (section 4.1.1). Table 4.2 

shows the upper and the lower bounds and the result of the ssSVM approach using 

a feedback size of 50. 

whole set reduced set LC BT MC SM RS 

self test 0.67 0.5 0.56 0.6 0.68 0.74 0.38 

test set 0.67 0.5 62 0.5 0.66 0.67 0.46 

trainingset size 840 40 50 50 50 50 50 

Table 4.2: Summary experiments with ds 1 

31

Figure 4.4: Binary Gaussian Distribution (µ 1 = 12, σ 1 = 15, µ 2 = 17, σ 2 = 1) 

Figure 4.5 gives a more detailed insight into the performance of the semisupervised 

SVM. There the feedback size was set equal to 1 and ssSVM were used 

to incrementally increase the training set size. As we can see after approx. 50 iterations 

Simple Margin and Most Certainty deliver good results in comparison with 

conventional SVMs but with much fewer data. Breaking Ties, Least Certainty and 

Most Certainty are most stable and outperform Random Sampling. 

Figure 4.6 shows how the implementation performs with different feedback sizes 

in a batch mode. 

The lower and upper bounds of the second datasets and the performance of 

ssSVM with feedback size 50 can be found in Table 4.3 . 


self test 0.77 0.9 0.83 0.78 0.95 0.66 0.76 

test set 0.77 0.43 0.70 0.63 0.44 0.5 0.59 


Table 4.3: Summary experiments with ds 2 

The performance of ssSVM with feedback size 1 and incremental increased feedback 

size is highlighted in Figure 4.7. 

Remains the overview how ssSVM performs on this datasets in a batch mode. 

Figure 4.8 highlights the results of this. 

32

Figure 4.5: Incremental increased training size ds 1 

Figure 4.6: different feedback sizes in batch mode ds 1 

33

Figure 4.7: Incremental increased training size ds 2 

Figure 4.8: different feedback sizes in batch mode for ds 2 

34

Figure 4.9: Incremental increased training size ds 1 , RBF kernel 

Both datasets show that the semi-supervised SVM approaches delivers similar 

results than the supervised approach but with a smaller training set. The incremental 

version outperforms the supervised approach with respect to the training 

set size and is better than the batch semi-supervised version, which is of course 

more practically and also performs better then the supervised approach. 

Different Kernels 

For the above experiments I used the Linear kernel. To look how the chosen kernel 

influences the result of the semi-supervised learning approaches I used polynomial 

and RBF kernels for experimenting with the datasets ds 1 . The Figures 4.10 and 

4.9 are similar to the Figure 4.5. For these datasets we can conclude that the 

chosen kernel influences the result of the SVM but has no specific impact on the 

semi-supervised approaches. 

4.2.2 Two Spirals Dataset 

I also applied ssSVM to a Two Spirals dataset (Figure 4.11). 

Table 4.4 shows lower and upper bound and the ssSVM accuracy of the dataset. 

The performance of ssSVM with feedback size 1 and incremental increased feedback 

size is highlighted in Figure 4.12, the results of using a batch mode can be 

found in Figure 4.13. 

35

Figure 4.10: Incremental increased training size ds 1 , polynomial kernel (degree = 

3) 

Figure 4.11: Two Spirals Dataset 

36


self test 1 0.1 0 0 1 1 1 

test set 0.85 0.32 0.33 0.31 0.67 0.74 0.72 


Table 4.4: Summary experiments with Two Spirals Dataset 

Figure 4.12: Incremental increased training size Two Spirals Dataset 

37

Figure 4.13: different feedback sizes in batch mode for Two Spirals Datasets 

As the Gaussian Datasets these experiments show that with ssSVM the necessary 

amount of training instances can be reduced significantly. 

4.2.3 Chain Link Dataset 

The last artificial dataset I used to evaluate ssSVM is the Chain Link Dataset 4.14. 

Table 4.5 shows upper and lower bounds, the Figures 4.15 and 4.16 show the 

accuracies with incremental increased training sets and with different batch sizes. 


self test 0.89 0.66 0.77 0.7 0.67 0.67 0.86 

test set 0.9 0.76 0.86 0.75 0.73 0.81 0.66 


Table 4.5: Summary experiments with Chain Link dataset 

4.2.4 Summary 

We could see that the semi-supervised SVM approaches reduced the amount of 

needed labeled data significantly. They delivered similar accuracies than the common 

SVM approach but the training set size was much smaller. As expected the 

38

Figure 4.14: Chain Link Dataset 

Figure 4.15: Incremental increased training size Chain Link dataset 

39

Figure 4.16: different feedback sizes in batch mode for Chain Link Dataset 

incremental version performs better than the batch version. Breaking Ties, Least 

Certainty, Simple Margin and Most Certainty perform better than Random Sampling 

but no single ’winner’ could be found. 

4.3 Datasets from UCI Machine Learning Repository 

Beside the generated datasets I evaluated my implementation using some datasets 

from the UCI Machine Learning Repository (appendix A). 

I used following datasets: 

1. abalone 

2. breast cancer (WDBC) 

3. heart scale 

4. hill valley 

5. kr-vs-kp 

Detailed informations about the datasets can be found on the UCI Machine Learning 

Repository homepage. Again I separated the datasets into training sets for 

40

supervised learning, semi-supervised learning and testing. Note that I did not try 

to optimize the SVM kernel parameters to get good accuracies and therefore some 

accuracies are rather low. Instead I used different parameters for different datasets 

(f.ex. different kernel types) and for each datasets the same parameters for comparing 

supervised and semi-supervised learning. 

Two modes were used for the semi-supervised approach. First a simple batch 

mode where only one feedback round is used. The other mode uses 10 feedback 

rounds. 

The tables 4.6 and 4.7 outline the results of these experiments. Again, the semisupervised 

approaches deliver good accuracy but with reduced sample size with 

respect to the whole training set. 


heart scale 0.84 0.75 0.83 0.83 0.78 0.77 0.80 

WDBC 0.94 0.75 0.94 0.94 0.77 0.80 0.87 

WDBC (RBF) 0.85 0.25 0.52 0.52 0.28 0.50 0.45 

abalone 0.54 0.44 0.51 0.53 0.44 0.51 0.51 

hill valley 0.94 0.85 0.89 0.89 0.87 0.85 0.86 

kr-vs-kp 0.44 0.29 0.39 0.43 0.42 0.33 0.22 

Table 4.6: Evaluation of semi-supervised SVM approaches (1 iteration, feedback 

size 50) 


heart scale 0.84 0.75 0.83 0.83 0.76 0.84 0.8 

WDBC 0.94 0.75 0.94 0.94 0.77 0.87 0.80 

WDBC (RBF) 0.85 0.25 0.69 0.69 0.28 0.47 0.46 

abalone 0.54 0.44 0.50 0.51 0.43 0.49 0.51 

hill valley 0.94 0.85 0.89 0.88 0.88 0.86 0.86 

kr-vs-kp 0.44 0.29 0.47 0.53 0.31 0.21 0.16 

Table 4.7: Evaluation of semi-supervised SVM approaches, (10 iterations, feedback 

size 50) 

These datasets show that Least Certainty and Breaking Ties often delivers similar 

results and outperform the other approaches. 

41

Chapter 5 

Conclusion 

In this work I summarized different approaches of semi-supervised learning for Support 

Vector Machines. We have seen that most of them try to narrow the margin 

of the hyperplane. The version space based and the probability based methods 

belong to this category. Semi-supervised learning approaches promise to reduce the 

amount of the needed training data through performing so called feedback rounds 

where a human expert gets asked for labeling instances which are relevant for the 

given classification task. The experiments with different datasets have shown that 

ssSVM, my semi-supervised learning implementation for SVMs, keep this promise. 

With ssSVM one can obtain similar accuracies with fewer training data as with 

usual SVMs. 

One drawback of the presented semi-supervised learning approaches is that they 

introduce a new parameter, the feedback size. The feedback size influences not only 

the accuracy but also the acceptance of the human expert. If the feedback size is 

too large, the human expert has to label many instances and can get bored (as in 

the supervised case), if the feedback size is too small the accuracy can be too low. 

Because the optimal value for the feedback size depends on the datasets and the 

chosen approach there is no general rule how to set it. Additionally the number of 

feedback rounds must also be chosen. 

I compared Least Certainty, Breaking Ties, Most Certainty and Simple Margin 

with Random Sampling and could show that these approaches outperform the 

latter one. Which approach should be chosen depends on the datasets although 

Least Certainty and Breaking Ties seem to be most stable and are general good 

approaches. 

A problem is that until now there does not exist a practical online tuning algorithm 

for kernel parameters. If we add a new instance to the training set the 

optimal kernel parameters can change. 

Nevertheless my experiments with ssSVM show that using semi-supervised approaches 

help to reduce the size of needed labeled training data and are therefore 

valuable. 

42

Appendix A 

Relevant Links 

• Word Vector Tool - An Open-Source Tool for creating word vectors from texts 

http://www.wvtool.nemoz.org/ 

• RapidMiner - An Open-Source Datamining Tool http://www.rapidminer.com 

• Spring Framework - An IOC Container http://springframework.org/ 

• Eclipse RCP - The Eclipse Rich Client Platform http://wiki.eclipse.org/index. 

php/Rich Client Platform 

• UCI Machine Learning Repository - Repository containg different data sets 

http://archive.ics.uci.edu/ml/ 

43

Bibliography 

[BD99] 

Kristin P. Bennett and Ayhan Demiriz. Semi-supervised support vector 

machines. In Proceedings of the 1998 conference on Advances in neural 

information processing systems II, pages 368–374, Cambridge, MA, USA, 

1999. MIT Press. 

[Bur98] Christopher J. C. Burges. A tutorial on support vector machines for 

pattern recognition. Data Mining and Knowledge Discovery, 2(2):121– 

167, 1998. 

[CB07] 

[Cha05] 

Nicolas Cebron and Michael R. Berthold. An adaptive multi objective selection 

strategy for active learning. Konstanzer Schriften in Mathematik 

und Informatik, No. 235, 2007. 

Edward Chang Simon Tong Kingsby Goh Chang-Wei Chang. Support 

vector machine concept-dependent active learning for image retrieval. 

IEEE Transactions on Multimedia 2005, 2005. 

[CST00] Nello Cristianini and John Shawe-Taylor. An introduction to support 

Vector Machines: and other kernel-based learning methods. Cambridge 

University Press, New York, NY, USA, 2000. 

[FM01] 

[HGC01] 

[HL02] 

F. Fung and O. Mangasarian. Semi-supervised support vector machines 

for unlabeled data classification. Optimization Methods and Software, 

15:29–44, 2001. 

Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point machines. 

Journal of Machine Learning Research, 1:245–279, 2001. 

Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass 

support vector machines. IEEE Transactions on Neural Networks, 

13:415–425, 2002. 

[LKG + 05] Tong Luo, Kurt Kramer, Dmitry B. Goldgof, Lawrence O. Hall, Scott 

Samson, Andrew Remsen, and Thomas Hopkins. Active learning to 

recognize multiple types of plankton. Journal of Machine Learning Research, 

6:589–613, 2005. 

44

[Mar03] Florian Markowetz. Klassifikation mit support vector machines. 

http://lectures.molgen.mpg.de/statistik03/docs/Kapitel 16.pdf, 2003. 

Lectures, Max Planck Institute For Molecular Genetics. 

[Mei02] Ron Meir. Support vector machines - an introduction. 

http://www.ee.technion.ac.il/ rmeir/SVMReview.pdf, 2002. Electrical 

Engineering Department, Israel Institute of Technology, Tutorial. 

[MIJ04] Romain Thibaux Michael I. Jordan. The kernel 

trick. http://www.cs.berkeley.edu/ jordan/courses/281Bspring04/lectures/lec3.pdf, 

Spring 2004. Lectures, CS Berkeley. 

[Mit97] 

[MPE06] 

[MSP04] 

[Pal08] 

[PCT00] 

[SC00] 

[STC99] 

[TC01] 

Thomas Mitchell. Machine Learning. McGraw-Hill Education (ISE Editions), 

October 1997. 

Michael Mandel, Graham Poliner, and Daniel Ellis. Support vector machine 

active learning for music retrieval. Multimedia Systems, 12(1):3–13, 

2006. 

Pabitra Mitra, B. Uma Shankar, and Sankar K. Pal. Segmentation of 

multispectral remote sensing images using active support vector machines. 

Pattern Recognition Letters, 25(9):1067–1074, 2004. 

Mahesh Pal. Multiclass approaches for support vector machine based 

land cover classification. CoRR, abs/0802.2411, 2008. informal publication. 

John C. Platt, Nello Cristianini, and Shawe J. Taylor. Large margin 

DAGs for multiclass classification. In Sara A. Solla, T. K. Leen, and 

K. R. Müller, editors, Advances in Neural Information Processing Systems, 

volume 12. MIT Press, 2000. 

Greg Schohn and David Cohn. Less is more: Active learning with support 

vector machines. In ICML ’00: Proceedings of the Seventeenth International 

Conference on Machine Learning, pages 839–846, San Francisco, 

CA, USA, 2000. Morgan Kaufmann Publishers Inc. 

John Shawe-Taylor and Nello Cristianini. Further results on the margin 

distribution. In COLT ’99: Proceedings of the twelfth annual conference 

on Computational learning theory, pages 278–285, New York, NY, USA, 

1999. ACM. 

Simon Tong and Edward Chang. Support vector machine active learning 

for image retrieval. In MULTIMEDIA ’01: Proceedings of the ninth ACM 

international conference on Multimedia, pages 107–118, New York, NY, 

USA, 2001. ACM. 

45

[TK02] 

[Vap00] 

[VC04] 

[XS05] 

Simon Tong and Daphne Koller. Support vector machine active learning 

with applications to text classification. Journal of Machine Learning 

Research, 2:45–66, 2002. 

Vladimir N. Vapnik. The nature of statistical learning theory. Springer- 

Verlag New York, Inc., New York, NY, USA, 2000. 

V. N. Vapnik and A. Ya. Chervonenkis. Theory of pattern recognition. 

www.cs.berkeley.edu/ jordan/courses/281B-spring04/lectures/lec3.pdf, 

Spring 2004. Lectures, CS Berkeley. 

Linli Xu and Dale Schuurmans. Unsupervised and semi-supervised multiclass 

support vector machines. In Manuela M. Veloso and Subbarao 

Kambhampati, editors, AAAI, pages 904–910. AAAI Press / The MIT 

Press, 2005. 

46

Semi-Supervised Learning With SVMs.pdf - Read

Create successful ePaper yourself

Delete template?

Save as template?