Unni Cathrine Eiken February 2005
Unni Cathrine Eiken February 2005
Unni Cathrine Eiken February 2005
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
There are several different distance metrics incorporated in TiMBL and, as will be described<br />
later, the user can choose the one that suits the data material best. The basic metric is the<br />
Overlap Metric, where the distance between two patterns is calculated as the sum of differences<br />
between the features of the two patterns (Daelemans et al. 2003, p. 20). The algorithm<br />
combining the k-NN approach with the overlap metric within TiMBL is called IB1 (Aha, Kibler<br />
and Albert 1991, in Daelemans et al. 2003). In this algorithm the value of k is the number of<br />
nearest distances (usually 1), allowing for a nearest neighbour set which may comprise several<br />
instances which all share the same distance to the a test example. The IB1 algorithm finds the k<br />
nearest neighbours of a test case by calculating the distance between a test instance Y and a<br />
training instance X. The distance between the two instances is the sum of the distances between<br />
the instances’ different features. If k = 1, a test instance is assigned the category of its single<br />
nearest neighbour. In cases where the algorithm finds a set of nearest neighbours, the majority<br />
vote of the set is chosen. This implies a certain bias toward high frequent categories, which in<br />
many cases will hold the majority vote.<br />
4.1.2 Testing<br />
To create a classifier, TiMBL needs training and test data in a feature vector where each<br />
instance consists of a fixed length of feature values followed by a category. For testing purposes,<br />
the feature sequence is used when the distance between a test instance and the training data is<br />
calculated, and the category functions as a means to evaluate whether the assigned classification<br />
was valid. Because the test data is compared directly with the training data, separate training and<br />
test sets are needed. In this project, the EPAS list was split into a training set consisting of all<br />
EPAS without pronouns and a test set consisting of the EPAS with pronouns. In addition, testing<br />
through TiMBL’s leave-one-out option was performed; here testing is done on each pattern of<br />
the training file by treating each pattern in turn as a test case (Daelemans et al. 2003, p. 35).<br />
In the classification phase the alignment of category and description features is stored, so that<br />
the categories of new, unseen sequences of description features can be probabilistically inferred<br />
in the following test phase.<br />
Regardless of the input format chosen, a classification with TiMBL presupposes that the training<br />
material consists of a number of features to be learned from, as well as a predetermined category<br />
61