Unni Cathrine Eiken February 2005

More documents

Recommendations

Info

language processing tasks tend to benefit from lazy learning methods, particularly because the individual examples in the training material are not abstracted away from in the process of creating rules. When a new data instance is classified, it is compared to all previously seen examples, including low-frequent ones. This suggests that in the case of relatively small data sets, such as the one in the present work, MBL tools are particularly suitable. By consulting previously seen data and estimating the similarity between old and new instances of data, MBL algorithms such as TiMBL are able to calculate the likelihood of new instances of data. This is done by creating a classifier which essentially consists of an example set of particular patterns together with their associated categories. The classifier can subsequently classify unknown input patterns by applying algorithms to calculate the similarity, or distance, to the known patterns stored in memory. The Nearest Neighbor approach is one commonly used means to estimate this distance and is described in more detail in the following section. 4.1.1 The Nearest Neighbor approach Daelemans et al. (2003, p. 19) state that all MBL approaches are founded on the classical k- Nearest Neighbor (k-NN) method of classification (Cover and Hart 1967). This approach classifies patterns of numeric data by using information gained from examining and classifying pattern distributions observed in a data collection. In the k-NN algorithm, a new instance of data is classified as nearest to a set of previously classified points. The intuition is that observations which are close together will have categories which are close together. When classifying a new instance of data, the k-NN approach weights the known information about the closest similar data instances most heavily. In other words, a new instance of data is classified in the category of its nearest neighbour. In large samples, this rule can be modified to classifying according to the majority of the nearest neighbours, rather than just using the single nearest neighbour. The k-NN approach has several implementations in TiMBL. As TiMBL is designed to classify linguistic patterns, which in most cases consist of discrete data values and allow for a large number of attributes with varying relevance, the k-NN algorithm is not used directly. Instead, the classification of discrete data is made possible through a modified version of the k-NN approach, as well as other algorithms. 60
There are several different distance metrics incorporated in TiMBL and, as will be described later, the user can choose the one that suits the data material best. The basic metric is the Overlap Metric, where the distance between two patterns is calculated as the sum of differences between the features of the two patterns (Daelemans et al. 2003, p. 20). The algorithm combining the k-NN approach with the overlap metric within TiMBL is called IB1 (Aha, Kibler and Albert 1991, in Daelemans et al. 2003). In this algorithm the value of k is the number of nearest distances (usually 1), allowing for a nearest neighbour set which may comprise several instances which all share the same distance to the a test example. The IB1 algorithm finds the k nearest neighbours of a test case by calculating the distance between a test instance Y and a training instance X. The distance between the two instances is the sum of the distances between the instances’ different features. If k = 1, a test instance is assigned the category of its single nearest neighbour. In cases where the algorithm finds a set of nearest neighbours, the majority vote of the set is chosen. This implies a certain bias toward high frequent categories, which in many cases will hold the majority vote. 4.1.2 Testing To create a classifier, TiMBL needs training and test data in a feature vector where each instance consists of a fixed length of feature values followed by a category. For testing purposes, the feature sequence is used when the distance between a test instance and the training data is calculated, and the category functions as a means to evaluate whether the assigned classification was valid. Because the test data is compared directly with the training data, separate training and test sets are needed. In this project, the EPAS list was split into a training set consisting of all EPAS without pronouns and a test set consisting of the EPAS with pronouns. In addition, testing through TiMBL’s leave-one-out option was performed; here testing is done on each pattern of the training file by treating each pattern in turn as a test case (Daelemans et al. 2003, p. 35). In the classification phase the alignment of category and description features is stored, so that the categories of new, unseen sequences of description features can be probabilistically inferred in the following test phase. Regardless of the input format chosen, a classification with TiMBL presupposes that the training material consists of a number of features to be learned from, as well as a predetermined category 61
Page 1 and 2:
University of Bergen Section for li
Page 3 and 4:
Preface The project presented in th
Page 5 and 6:
Table of Contents 1 INTRODUCTION AN
Page 7 and 8:
1 Introduction and problem statemen
Page 9 and 10:
patterns found in a text collection
Page 11 and 12:
The results obtained in this projec
Page 13 and 14:
The term anaphor describes a lingui
Page 15 and 16: 2.1.1.1 Discourse representation th
Page 17 and 18: eferring to BT. The NP which is lin
Page 19 and 20: esolution system will not be able t
Page 21 and 22: (2- 12) REC SUBJ EXIST OBJ IND-OBJ
Page 23 and 24: Figure 1 17
Page 25 and 26: means that the algorithm would prop
Page 27 and 28: for an overview). Many of these sys
Page 29 and 30: (2- 15) a. Politiet etterlyste i da
Page 31 and 32: section. The theory dates back to t
Page 33 and 34: 2.2.2 Different types of context So
Page 35 and 36: neighbours. For example, a target w
Page 37 and 38: with it. Selectional constraints al
Page 39 and 40: 3 From text to EPAS - the extractio
Page 41 and 42: 3.2 Predicate-argument structures "
Page 43 and 44: speaker flexibility with regards to
Page 45 and 46: and woman occur together both in su
Page 47 and 48: occur with. Arguments which are unl
Page 49 and 50: 3.3.1 NorGram in outline Norsk komp
Page 51 and 52: Figure 3 The most useful structure
Page 53 and 54: 3.4 Altering the source As already
Page 55 and 56: (3- 12) (3- 13) Politiet leter ette
Page 57 and 58: ARG1 and ARG2 arrays display a valu
Page 59 and 60: (3- 20) Anne Slåtten bodde i et st
Page 61 and 62: value and highly desirable. As such
Page 63 and 64: this project, this can be interpret
Page 65: The process of classifying the cons
Page 69 and 70: . ankomme,etterforsker,?,? ankomme,
Page 71 and 72: Test 2 Training set: EPAS_arg1 with
Page 73 and 74: The training and test material was
Page 75 and 76: • level 0: words which co-occur w
Page 77 and 78: (4- 9) avklare,obduksjon,? bede-om,
Page 79 and 80: (4-10) below shows the output for t
Page 81 and 82: In the introduction to this chapter
Page 83 and 84: the EPAS can be used in the classif
Page 85 and 86: exemption of jobbe-utfra, none of t
Page 87 and 88: antecedent for (4-15a). In the case
Page 89 and 90: Figure 7 Interestingly enough, howe
Page 91 and 92: When testing on knowledge-dependent
Page 93 and 94: Firth, J. R. (1957): A synopsis of
Page 95 and 96: Appendix A: Ekstraktor.pl - algorit
Page 97 and 98: finnARG2(); This function has exact
Page 99 and 100: #legger lest linje inn i @prt derso
Page 101 and 102: sub fjernEP{ #fjerner elementer fra
Page 103 and 104: } splice(@ARGx); $imax = @ARG3ep; @
Page 105 and 106: } else{ } } } push(@liste, $ARG0ep[
Page 107 and 108: 101 Appendix C: the EPAS list 23-å
Page 109 and 110: 103 obdusere,,kvinne observere,,23-
Page 111 and 112: Appendix D: Text aligned with EPAS
Page 113 and 114: eventualiteter. Vi varslet Kripos.
Page 115 and 116: Etterforskerne har flere observasjo
Page 117 and 118:
# Subrutine som tar inn argumentnum
Page 119 and 120:
Appendix F: POS-based structures SE
Page 121:
Vi har ingen spesiell teori som vi
show all

Unni Cathrine Eiken February 2005

Create successful ePaper yourself

Delete template?

Save as template?