Unni Cathrine Eiken February 2005

More documents

Recommendations

Info

Abstract This thesis describes an approach of using corpus-based classification of semantically related words as a referent-guessing helper in anaphora resolution. A small limiteddomain corpus was collected and using a method based on semantic structures available from syntactic parses of the texts, elementary predicate-argument structures were extracted from it. The extracted structures were processed using an association technique which created bundles of semantically similar words based on their distribution in the text collection. The groups of semantically similar words represent valid selectional restrictions for the domain of the text collection in the sense that they characterise types of arguments which tend to occur in certain contexts. These groups can be used to create an expectation of which words to expect in a given contextual pattern, and thus be used in anaphora resolution to select a probable referent from a set of possible referents. The experiments in the thesis show that this approach produces promising results; the concept groups can function as a helper to find likely referents in anaphora resolution. Sammendrag Metoden som beskrives i denne hovedoppgaven bygger på korpusbasert klassifikasjon av semantisk like ord og relaterer dette til bruk innenfor anaforresolusjon. Et domenespesifikt korpus ble samlet, og forenklede predikat-argumentstrukturer ble ekstrahert ved hjelp av en metode basert på semantiske strukturer som er tilgjengelige etter en syntaktisk analyse av tekstene. Strukturene ble prosessert med en assosiasjonsteknikk som, basert på ordenes distribusjon i tekstsamlingen, dannet grupperinger av semantisk like ord. Disse ordgruppene representerer gyldige seleksjonsrestriksjoner innenfor tekstsamlingens avgrensede domene da de karakteriserer grupper av argumenter som forekommer i gitte kontekster. Ordgruppene kan brukes til å gi en indikasjon på hvilke ord som forventes i et gitt kontekstmønster. Ved anaforresolusjon kan dette være til hjelp ved utvelgelsen av en sannsynlig referent fra en liste med mulige referenter. Eksperimentene i oppgaven viser at denne metoden gir lovende resultater; ordgruppene kan fungere som et hjelpemiddel i prosessen med å finne sannsynlige referenter i anaforresolusjon. i
Preface The project presented in this paper is a Cand. Philol. thesis in Computational Linguistics and Language Technology and is submitted at the University of Bergen in <strong>February</strong> <strong>2005</strong>. The thesis is written in loose cooperation with the research project KunDoc (KunDoc 2004). KunDoc (Kunnskapsbasert dokumentanalyse / Knowledge-based document analysis), which was started in October 2003 and is funded by the Norwegian Research Council (NFR), has functioned as an inspiration for verbalising the approach in the thesis. The research within KunDoc is carried out in cooperation between the firm CognIT AS (CognIT 2004) and the University of Bergen. KunDoc aims at developing a method for the automatic recognition of discourse structures in written Norwegian texts. The project examines whether automated identification of coreference in a text can be used to create an unambiguous discourse structure of the text, identifying both its thematic and contextual structure. A further goal is to examine whether these techniques are useful within a closed thematic domain to create unambiguous automated summaries. Within KunDoc, it is of interest to generate ontologies that represent real-world knowledge. In the work on my thesis I have also worked in co-operation with the research project NorGram (NorGram 2004) at the University of Bergen. This project develops a computational grammar for Norwegian bokmål and is a part of the ParGram project at Palo Alto Research Center. The pre-processing of the text collection used in my project has been carried out using NorGram’s grammar on the XLE platform. ii
Page 1: University of Bergen Section for li
Page 5 and 6: Table of Contents 1 INTRODUCTION AN
Page 7 and 8: 1 Introduction and problem statemen
Page 9 and 10: patterns found in a text collection
Page 11 and 12: The results obtained in this projec
Page 13 and 14: The term anaphor describes a lingui
Page 15 and 16: 2.1.1.1 Discourse representation th
Page 17 and 18: eferring to BT. The NP which is lin
Page 19 and 20: esolution system will not be able t
Page 21 and 22: (2- 12) REC SUBJ EXIST OBJ IND-OBJ
Page 23 and 24: Figure 1 17
Page 25 and 26: means that the algorithm would prop
Page 27 and 28: for an overview). Many of these sys
Page 29 and 30: (2- 15) a. Politiet etterlyste i da
Page 31 and 32: section. The theory dates back to t
Page 33 and 34: 2.2.2 Different types of context So
Page 35 and 36: neighbours. For example, a target w
Page 37 and 38: with it. Selectional constraints al
Page 39 and 40: 3 From text to EPAS - the extractio
Page 41 and 42: 3.2 Predicate-argument structures "
Page 43 and 44: speaker flexibility with regards to
Page 45 and 46: and woman occur together both in su
Page 47 and 48: occur with. Arguments which are unl
Page 49 and 50: 3.3.1 NorGram in outline Norsk komp
Page 51 and 52: Figure 3 The most useful structure
Page 53 and 54:
3.4 Altering the source As already
Page 55 and 56:
(3- 12) (3- 13) Politiet leter ette
Page 57 and 58:
ARG1 and ARG2 arrays display a valu
Page 59 and 60:
(3- 20) Anne Slåtten bodde i et st
Page 61 and 62:
value and highly desirable. As such
Page 63 and 64:
this project, this can be interpret
Page 65 and 66:
The process of classifying the cons
Page 67 and 68:
There are several different distanc
Page 69 and 70:
. ankomme,etterforsker,?,? ankomme,
Page 71 and 72:
Test 2 Training set: EPAS_arg1 with
Page 73 and 74:
The training and test material was
Page 75 and 76:
• level 0: words which co-occur w
Page 77 and 78:
(4- 9) avklare,obduksjon,? bede-om,
Page 79 and 80:
(4-10) below shows the output for t
Page 81 and 82:
In the introduction to this chapter
Page 83 and 84:
the EPAS can be used in the classif
Page 85 and 86:
exemption of jobbe-utfra, none of t
Page 87 and 88:
antecedent for (4-15a). In the case
Page 89 and 90:
Figure 7 Interestingly enough, howe
Page 91 and 92:
When testing on knowledge-dependent
Page 93 and 94:
Firth, J. R. (1957): A synopsis of
Page 95 and 96:
Appendix A: Ekstraktor.pl - algorit
Page 97 and 98:
finnARG2(); This function has exact
Page 99 and 100:
#legger lest linje inn i @prt derso
Page 101 and 102:
sub fjernEP{ #fjerner elementer fra
Page 103 and 104:
} splice(@ARGx); $imax = @ARG3ep; @
Page 105 and 106:
} else{ } } } push(@liste, $ARG0ep[
Page 107 and 108:
101 Appendix C: the EPAS list 23-å
Page 109 and 110:
103 obdusere,,kvinne observere,,23-
Page 111 and 112:
Appendix D: Text aligned with EPAS
Page 113 and 114:
eventualiteter. Vi varslet Kripos.
Page 115 and 116:
Etterforskerne har flere observasjo
Page 117 and 118:
# Subrutine som tar inn argumentnum
Page 119 and 120:
Appendix F: POS-based structures SE
Page 121:
Vi har ingen spesiell teori som vi
show all

Unni Cathrine Eiken February 2005

Create successful ePaper yourself

Delete template?

Save as template?