Unni Cathrine Eiken February 2005
Unni Cathrine Eiken February 2005 Unni Cathrine Eiken February 2005
This thesis explores the value of using co-occurrence patterns to create concept groups that can act as an aid in the process of finding what a pronoun refers to. In order to find the entity that the pronoun han (he) refers to in example (1-1), the following two alternative patterns can be considered: (1- 3) a. lensmann etterlyser vitne sergeant calls-for witness b. gjerningsmann etterlyser vitne perpetrator calls-for witness When considering which of these patterns is the most likely one, data collected from a corpus can be consulted (Dagan and Itai 1990; Dagan et al. 1995; Nasukawa 1994; inter al.). If one of the patterns is found literally in the corpus, it will receive a strong preference. If none of the patterns occur in the data collection, similar patterns can be considered. Given that the patterns in example (1-4) below do feature in the data collection, they can contribute to guessing the correct referent for the anaphor in example (1-1): (1- 4) a. politi etterlyser vitne police call-for witness b. etterforsker etterlyser vitne investigator calls-for witness c. lensmann avhører vitne sergeant interviews witness d. politi avhører person police interview person e. gjerningsmann dreper offer perpetrator kills victim f. gjerningsmann angriper kvinne perpetrator attacks woman In view of the patterns in (1-4), the word lensmann (sergeant) engages in contexts similar to those of politi (police), which in turn occurs in similar contexts to etterforsker (investigator). By using association techniques, lensmann can be associated with the other arguments which occur in similar linguistic environments, and subsequently be preferred as the referent in (1-1). Approaches within the field of anaphora resolution have in recent years focused on knowledge- poor strategies used in combination with corpora, at the same time, the notion of constructing a large and comprehensive base of real-world knowledge has been abandoned somewhat (see Mitkov 2003 for a brief overview). The approach in the present work expands the co-occurrence 2
patterns found in a text collection to also consider semantically similar words and patterns to those present in the corpus. The association of semantically similar concepts is carried out through machine learning techniques and an association technique developed for this project. The project described in the present work develops and examines a method for the automatic inference of concept groups consisting of semantically similar words from a collection of limited domain texts. Information that becomes available by automatically classifying context patterns from a closed thematic domain is examined for the purpose of aiding in anaphora resolution. The resulting concept groups can function as a form of real-world knowledge by representing information about which words can be expected in certain contextual environments. As it is an established problem within the field of computational linguistics that the construction of knowledge bases requires such a high amount of manual labour, it is of interest to examine methods which can contribute to automating this task. 1.1 Project outline This thesis describes a method for the automatic association of clusters of semantically similar words collected from a limited thematic domain. The association of concepts is based on the distribution of arguments in particular syntactic contexts. The method described consists of three steps: 1) the extraction method, which deals with the extraction of meaning structures from a text corpus 2) the classification method, which deals with the association of the extracted meaning structures into concept groups 3) the application of the meaning structures and the concept groups to anaphora resolution In the extraction phase, semantic structures mainly corresponding to subject-verb-object relations are extracted and normalised to the form shown in (1-5) below. This type of relation is in this thesis termed an elementary predicate-argument structure (EPAS) and is described in greater detail in section 3.2. 3
- Page 1 and 2: University of Bergen Section for li
- Page 3 and 4: Preface The project presented in th
- Page 5 and 6: Table of Contents 1 INTRODUCTION AN
- Page 7: 1 Introduction and problem statemen
- Page 11 and 12: The results obtained in this projec
- Page 13 and 14: The term anaphor describes a lingui
- Page 15 and 16: 2.1.1.1 Discourse representation th
- Page 17 and 18: eferring to BT. The NP which is lin
- Page 19 and 20: esolution system will not be able t
- Page 21 and 22: (2- 12) REC SUBJ EXIST OBJ IND-OBJ
- Page 23 and 24: Figure 1 17
- Page 25 and 26: means that the algorithm would prop
- Page 27 and 28: for an overview). Many of these sys
- Page 29 and 30: (2- 15) a. Politiet etterlyste i da
- Page 31 and 32: section. The theory dates back to t
- Page 33 and 34: 2.2.2 Different types of context So
- Page 35 and 36: neighbours. For example, a target w
- Page 37 and 38: with it. Selectional constraints al
- Page 39 and 40: 3 From text to EPAS - the extractio
- Page 41 and 42: 3.2 Predicate-argument structures "
- Page 43 and 44: speaker flexibility with regards to
- Page 45 and 46: and woman occur together both in su
- Page 47 and 48: occur with. Arguments which are unl
- Page 49 and 50: 3.3.1 NorGram in outline Norsk komp
- Page 51 and 52: Figure 3 The most useful structure
- Page 53 and 54: 3.4 Altering the source As already
- Page 55 and 56: (3- 12) (3- 13) Politiet leter ette
- Page 57 and 58: ARG1 and ARG2 arrays display a valu
patterns found in a text collection to also consider semantically similar words and patterns to<br />
those present in the corpus. The association of semantically similar concepts is carried out<br />
through machine learning techniques and an association technique developed for this project.<br />
The project described in the present work develops and examines a method for the automatic<br />
inference of concept groups consisting of semantically similar words from a collection of<br />
limited domain texts. Information that becomes available by automatically classifying context<br />
patterns from a closed thematic domain is examined for the purpose of aiding in anaphora<br />
resolution. The resulting concept groups can function as a form of real-world knowledge by<br />
representing information about which words can be expected in certain contextual environments.<br />
As it is an established problem within the field of computational linguistics that the construction<br />
of knowledge bases requires such a high amount of manual labour, it is of interest to examine<br />
methods which can contribute to automating this task.<br />
1.1 Project outline<br />
This thesis describes a method for the automatic association of clusters of semantically similar<br />
words collected from a limited thematic domain. The association of concepts is based on the<br />
distribution of arguments in particular syntactic contexts.<br />
The method described consists of three steps:<br />
1) the extraction method, which deals with the extraction of meaning structures from a text<br />
corpus<br />
2) the classification method, which deals with the association of the extracted meaning<br />
structures into concept groups<br />
3) the application of the meaning structures and the concept groups to anaphora resolution<br />
In the extraction phase, semantic structures mainly corresponding to subject-verb-object<br />
relations are extracted and normalised to the form shown in (1-5) below. This type of relation is<br />
in this thesis termed an elementary predicate-argument structure (EPAS) and is described in<br />
greater detail in section 3.2.<br />
3