Unni Cathrine Eiken February 2005
Unni Cathrine Eiken February 2005 Unni Cathrine Eiken February 2005
4 Classification In order to use the structures in the EPAS list as an aid in anaphora resolution, they have to be processed. The pre-processing in section 3.6.4 has shown that there does exist interesting distributions in the data set and indicates that certain groups of arguments display distributions particular for the domain. As a step toward exploring if these distributions can be used to represent selectional restrictions and thus function as real-world knowledge for the domain, the words in the EPAS list must be classified. This procedure uses the context pattens that a word occurs in to classify the word, for example allowing for an argument to be classified according to the predicates it co-occurs with. A classification of this type gives information about which word to expect in a given context pattern and the results can therefore be used in the process of chosing the most likely antecedent for an anaphor. In this respect, the most likely antecedent must be interpreted as the most likely antecedent given a particular contextual pattern. In the following, the EPAS list will first be classified to see if the context patterns represented by the EPAS contain enough information to suggest the correct antecedent for anaphoric expressions from the text collection. Then an association of concepts will be performed, creating bundles of those arguments which occur in similar contexts/with similar predicates. These concepts will then be applied in co-occurrence with the classification method to see if they improve the process of suggesting the correct antecedent for the anaphors. For the purposes of classification and testing, the EPAS list was divided into training and test sets. The test set consist of all structures containing pronouns, while the training set consists of the remaining EPAS. In the case of the test set, the correct antecedent for each pronoun was identified manually and added to the test file. When testing with the test instances, the classifier assigns an antecedent based on the patterns it has seen in the training set. In this way, the correct antecedent in each test case functions as a means of measuring the success rate of the classification. The test set provides a good way of testing the product of the classification and gives a measure as to whether the correct antecedent can be assigned based on training on occurrences of EPAS/context patterns. 58
The process of classifying the constituents of the EPAS is most useful if the aim of the classification is held clearly in mind. Classifying arguments relative to the predicates and the other arguments they co-occur with can give information about two things; • is the data set generalisable enough to allow inference of the single correct antecedent in each test case? • is the data set generalisable enough to allow inference of words within the semantic concept group that the correct antecedent belongs to? In this thesis, it is of interest to identify all the words which occur in specific environments. As a reaction, we are interested in finding the members which can co-occur in a specific pattern – and not necessarily only in the single correct antecedent. The classification phase in the present work has three steps; firstly classification through a memory-based learning algorithm, secondly association of semantic classes from the text material by looking at contextual environments, and thirdly classification through application of the concept groups gathered in step two. In the following, the classification method will be described in more detail. 4.1 Step I: Classification with TiMBL TiMBL (Tilburg Memory Based Learner) (Daelemans et al. 2003) is a memory-based learning (MBL) tool developed by the ILK research group at the University of Tilburg (ILK 2004). TiMBL has been developed with the domain of NLP specifically in mind and provides an implementation of several MBL algorithms. Within MBL, or lazy learning (Daelemans et al. 1999), training instances are simply stored in memory. Upon encountering new instances, classification is performed by comparing the new instance to the stored experiences and estimating the similarity of the new instance to the old ones. The stored example(s) most similar to the new instance is picked as its classification. This approach stands in opposition to rule-induction based methods, which also are called greedy algorithms. In greedy learning algorithms, the learning material is used to create a model with expected characteristics for each category to be learned. Daelemans et al. (1999) show that 59
- Page 13 and 14: The term anaphor describes a lingui
- Page 15 and 16: 2.1.1.1 Discourse representation th
- Page 17 and 18: eferring to BT. The NP which is lin
- Page 19 and 20: esolution system will not be able t
- Page 21 and 22: (2- 12) REC SUBJ EXIST OBJ IND-OBJ
- Page 23 and 24: Figure 1 17
- Page 25 and 26: means that the algorithm would prop
- Page 27 and 28: for an overview). Many of these sys
- Page 29 and 30: (2- 15) a. Politiet etterlyste i da
- Page 31 and 32: section. The theory dates back to t
- Page 33 and 34: 2.2.2 Different types of context So
- Page 35 and 36: neighbours. For example, a target w
- Page 37 and 38: with it. Selectional constraints al
- Page 39 and 40: 3 From text to EPAS - the extractio
- Page 41 and 42: 3.2 Predicate-argument structures "
- Page 43 and 44: speaker flexibility with regards to
- Page 45 and 46: and woman occur together both in su
- Page 47 and 48: occur with. Arguments which are unl
- Page 49 and 50: 3.3.1 NorGram in outline Norsk komp
- Page 51 and 52: Figure 3 The most useful structure
- Page 53 and 54: 3.4 Altering the source As already
- Page 55 and 56: (3- 12) (3- 13) Politiet leter ette
- Page 57 and 58: ARG1 and ARG2 arrays display a valu
- Page 59 and 60: (3- 20) Anne Slåtten bodde i et st
- Page 61 and 62: value and highly desirable. As such
- Page 63: this project, this can be interpret
- Page 67 and 68: There are several different distanc
- Page 69 and 70: . ankomme,etterforsker,?,? ankomme,
- Page 71 and 72: Test 2 Training set: EPAS_arg1 with
- Page 73 and 74: The training and test material was
- Page 75 and 76: • level 0: words which co-occur w
- Page 77 and 78: (4- 9) avklare,obduksjon,? bede-om,
- Page 79 and 80: (4-10) below shows the output for t
- Page 81 and 82: In the introduction to this chapter
- Page 83 and 84: the EPAS can be used in the classif
- Page 85 and 86: exemption of jobbe-utfra, none of t
- Page 87 and 88: antecedent for (4-15a). In the case
- Page 89 and 90: Figure 7 Interestingly enough, howe
- Page 91 and 92: When testing on knowledge-dependent
- Page 93 and 94: Firth, J. R. (1957): A synopsis of
- Page 95 and 96: Appendix A: Ekstraktor.pl - algorit
- Page 97 and 98: finnARG2(); This function has exact
- Page 99 and 100: #legger lest linje inn i @prt derso
- Page 101 and 102: sub fjernEP{ #fjerner elementer fra
- Page 103 and 104: } splice(@ARGx); $imax = @ARG3ep; @
- Page 105 and 106: } else{ } } } push(@liste, $ARG0ep[
- Page 107 and 108: 101 Appendix C: the EPAS list 23-å
- Page 109 and 110: 103 obdusere,,kvinne observere,,23-
- Page 111 and 112: Appendix D: Text aligned with EPAS
- Page 113 and 114: eventualiteter. Vi varslet Kripos.
The process of classifying the constituents of the EPAS is most useful if the aim of the<br />
classification is held clearly in mind. Classifying arguments relative to the predicates and the<br />
other arguments they co-occur with can give information about two things;<br />
• is the data set generalisable enough to allow inference of the single correct antecedent in<br />
each test case?<br />
• is the data set generalisable enough to allow inference of words within the semantic<br />
concept group that the correct antecedent belongs to?<br />
In this thesis, it is of interest to identify all the words which occur in specific environments. As a<br />
reaction, we are interested in finding the members which can co-occur in a specific pattern – and<br />
not necessarily only in the single correct antecedent.<br />
The classification phase in the present work has three steps; firstly classification through a<br />
memory-based learning algorithm, secondly association of semantic classes from the text<br />
material by looking at contextual environments, and thirdly classification through application of<br />
the concept groups gathered in step two. In the following, the classification method will be<br />
described in more detail.<br />
4.1 Step I: Classification with TiMBL<br />
TiMBL (Tilburg Memory Based Learner) (Daelemans et al. 2003) is a memory-based learning<br />
(MBL) tool developed by the ILK research group at the University of Tilburg (ILK 2004).<br />
TiMBL has been developed with the domain of NLP specifically in mind and provides an<br />
implementation of several MBL algorithms.<br />
Within MBL, or lazy learning (Daelemans et al. 1999), training instances are simply stored in<br />
memory. Upon encountering new instances, classification is performed by comparing the new<br />
instance to the stored experiences and estimating the similarity of the new instance to the old<br />
ones. The stored example(s) most similar to the new instance is picked as its classification. This<br />
approach stands in opposition to rule-induction based methods, which also are called greedy<br />
algorithms. In greedy learning algorithms, the learning material is used to create a model with<br />
expected characteristics for each category to be learned. Daelemans et al. (1999) show that<br />
59