10.04.2013 Views

Unni Cathrine Eiken February 2005

Unni Cathrine Eiken February 2005

Unni Cathrine Eiken February 2005

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

language processing tasks tend to benefit from lazy learning methods, particularly because the<br />

individual examples in the training material are not abstracted away from in the process of<br />

creating rules. When a new data instance is classified, it is compared to all previously seen<br />

examples, including low-frequent ones. This suggests that in the case of relatively small data<br />

sets, such as the one in the present work, MBL tools are particularly suitable.<br />

By consulting previously seen data and estimating the similarity between old and new instances<br />

of data, MBL algorithms such as TiMBL are able to calculate the likelihood of new instances of<br />

data. This is done by creating a classifier which essentially consists of an example set of<br />

particular patterns together with their associated categories. The classifier can subsequently<br />

classify unknown input patterns by applying algorithms to calculate the similarity, or distance,<br />

to the known patterns stored in memory. The Nearest Neighbor approach is one commonly used<br />

means to estimate this distance and is described in more detail in the following section.<br />

4.1.1 The Nearest Neighbor approach<br />

Daelemans et al. (2003, p. 19) state that all MBL approaches are founded on the classical k-<br />

Nearest Neighbor (k-NN) method of classification (Cover and Hart 1967). This approach<br />

classifies patterns of numeric data by using information gained from examining and classifying<br />

pattern distributions observed in a data collection. In the k-NN algorithm, a new instance of data<br />

is classified as nearest to a set of previously classified points. The intuition is that observations<br />

which are close together will have categories which are close together. When classifying a new<br />

instance of data, the k-NN approach weights the known information about the closest similar<br />

data instances most heavily. In other words, a new instance of data is classified in the category<br />

of its nearest neighbour. In large samples, this rule can be modified to classifying according to<br />

the majority of the nearest neighbours, rather than just using the single nearest neighbour. The<br />

k-NN approach has several implementations in TiMBL. As TiMBL is designed to classify<br />

linguistic patterns, which in most cases consist of discrete data values and allow for a large<br />

number of attributes with varying relevance, the k-NN algorithm is not used directly. Instead,<br />

the classification of discrete data is made possible through a modified version of the k-NN<br />

approach, as well as other algorithms.<br />

60

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!