Unni Cathrine Eiken February 2005

Unni Cathrine Eiken February 2005 Unni Cathrine Eiken February 2005

10.04.2013 Views

4.1 Step I: Classification with TiMBL 59 4.1.1 The Nearest Neighbor approach 60 4.1.2 Testing 61 4.1.3 Comments on the results 68 4.2 Step II: Association of concept groups 68 4.2.1 Classify 72 4.2.2 Associated concept classes 73 4.3 Step III: Using concept groups in TiMBL 74 4.3.1 Testing 75 4.4 Are concept classes useful for anaphora resolution? 78 5 FINAL REMARKS 82 5.1 Is a parser vital for the extraction process? 82 5.2 Summary and conclusions 84 6 REFERENCES 86 APPENDIX A: EKSTRAKTOR.PL – ALGORITHM 89 APPENDIX B: EKSTRAKTOR.PL – PROGRAM CODE 92 APPENDIX C: THE EPAS LIST 101 APPENDIX D: TEXT ALIGNED WITH EPAS 105 APPENDIX E: CLASSIFY.PL – PROGRAM CODE 110 APPENDIX F: POS-BASED STRUCTURES 113 v

1 Introduction and problem statement For many applications within the field of Natural Language Processing (NLP) it is vital to identify what a pronoun refers to. Consider a piece of text where (1-1a) is followed immediately by (1-1b) 1 . (1- 1) a. Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig kommer til å drepe igjen. The sergeant leading the investigation says that the perpetrator probably will kill again. b. Han etterlyser vitner som var i sentrum søndag kveld. He puts out a call for witnesses who were in the city centre Sunday evening. In an application consisting of, for example, summarising the text, a selection of the second sentence (1-1b) without the preceding sentence (1-1a) leaves the reader with the pronoun han (he), the referent of which cannot be identified. The task of identifying the referent of a pronoun is called anaphora resolution and its computer implementation is relevant in many NLP applications, such as machine translation, automatic abstracting, dialogue systems, question answering and information extraction. The problem of correctly identifying the referent of a pronoun is not trivial, as will be apparent from the comparison of examples (1-1) and (1-2). As will be further described in section 2.1, strategies that do not incorporate some sort of real-world knowledge cannot confidently identify the entities that the pronoun han (he) is linked to in these examples. (1- 2) Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig kommer til å drepe igjen. Han ble observert i sentrum søndag kveld. The sergeant leading the investigation says that the perpetrator probably will kill again. He was observed in the city centre Sunday evening. 1 The sentences in (1-1) and (1-2) are constructed example sentences and are not part of the data set collected and used in this thesis. 1

1 Introduction and problem statement<br />

For many applications within the field of Natural Language Processing (NLP) it is vital to<br />

identify what a pronoun refers to. Consider a piece of text where (1-1a) is followed immediately<br />

by (1-1b) 1 .<br />

(1- 1)<br />

a.<br />

Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig<br />

kommer til å drepe igjen.<br />

The sergeant leading the investigation says that the perpetrator probably will<br />

kill again.<br />

b. Han etterlyser vitner som var i sentrum søndag kveld.<br />

He puts out a call for witnesses who were in the city centre Sunday evening.<br />

In an application consisting of, for example, summarising the text, a selection of the second<br />

sentence (1-1b) without the preceding sentence (1-1a) leaves the reader with the pronoun han<br />

(he), the referent of which cannot be identified. The task of identifying the referent of a pronoun<br />

is called anaphora resolution and its computer implementation is relevant in many NLP<br />

applications, such as machine translation, automatic abstracting, dialogue systems, question<br />

answering and information extraction.<br />

The problem of correctly identifying the referent of a pronoun is not trivial, as will be apparent<br />

from the comparison of examples (1-1) and (1-2). As will be further described in section 2.1,<br />

strategies that do not incorporate some sort of real-world knowledge cannot confidently identify<br />

the entities that the pronoun han (he) is linked to in these examples.<br />

(1- 2)<br />

Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig<br />

kommer til å drepe igjen. Han ble observert i sentrum søndag kveld.<br />

The sergeant leading the investigation says that the perpetrator probably will<br />

kill again. He was observed in the city centre Sunday evening.<br />

1<br />

The sentences in (1-1) and (1-2) are constructed example sentences and are not part of the data set collected and<br />

used in this thesis.<br />

1

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!