10.04.2013 Views

Unni Cathrine Eiken February 2005

Unni Cathrine Eiken February 2005

Unni Cathrine Eiken February 2005

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

There are several different distance metrics incorporated in TiMBL and, as will be described<br />

later, the user can choose the one that suits the data material best. The basic metric is the<br />

Overlap Metric, where the distance between two patterns is calculated as the sum of differences<br />

between the features of the two patterns (Daelemans et al. 2003, p. 20). The algorithm<br />

combining the k-NN approach with the overlap metric within TiMBL is called IB1 (Aha, Kibler<br />

and Albert 1991, in Daelemans et al. 2003). In this algorithm the value of k is the number of<br />

nearest distances (usually 1), allowing for a nearest neighbour set which may comprise several<br />

instances which all share the same distance to the a test example. The IB1 algorithm finds the k<br />

nearest neighbours of a test case by calculating the distance between a test instance Y and a<br />

training instance X. The distance between the two instances is the sum of the distances between<br />

the instances’ different features. If k = 1, a test instance is assigned the category of its single<br />

nearest neighbour. In cases where the algorithm finds a set of nearest neighbours, the majority<br />

vote of the set is chosen. This implies a certain bias toward high frequent categories, which in<br />

many cases will hold the majority vote.<br />

4.1.2 Testing<br />

To create a classifier, TiMBL needs training and test data in a feature vector where each<br />

instance consists of a fixed length of feature values followed by a category. For testing purposes,<br />

the feature sequence is used when the distance between a test instance and the training data is<br />

calculated, and the category functions as a means to evaluate whether the assigned classification<br />

was valid. Because the test data is compared directly with the training data, separate training and<br />

test sets are needed. In this project, the EPAS list was split into a training set consisting of all<br />

EPAS without pronouns and a test set consisting of the EPAS with pronouns. In addition, testing<br />

through TiMBL’s leave-one-out option was performed; here testing is done on each pattern of<br />

the training file by treating each pattern in turn as a test case (Daelemans et al. 2003, p. 35).<br />

In the classification phase the alignment of category and description features is stored, so that<br />

the categories of new, unseen sequences of description features can be probabilistically inferred<br />

in the following test phase.<br />

Regardless of the input format chosen, a classification with TiMBL presupposes that the training<br />

material consists of a number of features to be learned from, as well as a predetermined category<br />

61

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!