10.04.2013 Views

Unni Cathrine Eiken February 2005

Unni Cathrine Eiken February 2005

Unni Cathrine Eiken February 2005

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

University of Bergen<br />

Section for linguistic studies<br />

CORPUS-BASED<br />

SEMANTIC CATEGORISATION<br />

FOR ANAPHORA RESOLUTION<br />

<strong>Unni</strong> <strong>Cathrine</strong> <strong>Eiken</strong><br />

Cand. Philol. Thesis in<br />

Computational Linguistics and<br />

Language Technology<br />

<strong>February</strong> <strong>2005</strong>


Abstract<br />

This thesis describes an approach of using corpus-based classification of semantically<br />

related words as a referent-guessing helper in anaphora resolution. A small limiteddomain<br />

corpus was collected and using a method based on semantic structures available<br />

from syntactic parses of the texts, elementary predicate-argument structures were<br />

extracted from it. The extracted structures were processed using an association technique<br />

which created bundles of semantically similar words based on their distribution in the text<br />

collection. The groups of semantically similar words represent valid selectional<br />

restrictions for the domain of the text collection in the sense that they characterise types<br />

of arguments which tend to occur in certain contexts. These groups can be used to create<br />

an expectation of which words to expect in a given contextual pattern, and thus be used in<br />

anaphora resolution to select a probable referent from a set of possible referents. The<br />

experiments in the thesis show that this approach produces promising results; the concept<br />

groups can function as a helper to find likely referents in anaphora resolution.<br />

Sammendrag<br />

Metoden som beskrives i denne hovedoppgaven bygger på korpusbasert klassifikasjon av<br />

semantisk like ord og relaterer dette til bruk innenfor anaforresolusjon. Et<br />

domenespesifikt korpus ble samlet, og forenklede predikat-argumentstrukturer ble<br />

ekstrahert ved hjelp av en metode basert på semantiske strukturer som er tilgjengelige<br />

etter en syntaktisk analyse av tekstene. Strukturene ble prosessert med en<br />

assosiasjonsteknikk som, basert på ordenes distribusjon i tekstsamlingen, dannet<br />

grupperinger av semantisk like ord. Disse ordgruppene representerer gyldige<br />

seleksjonsrestriksjoner innenfor tekstsamlingens avgrensede domene da de karakteriserer<br />

grupper av argumenter som forekommer i gitte kontekster. Ordgruppene kan brukes til å<br />

gi en indikasjon på hvilke ord som forventes i et gitt kontekstmønster. Ved<br />

anaforresolusjon kan dette være til hjelp ved utvelgelsen av en sannsynlig referent fra en<br />

liste med mulige referenter. Eksperimentene i oppgaven viser at denne metoden gir<br />

lovende resultater; ordgruppene kan fungere som et hjelpemiddel i prosessen med å finne<br />

sannsynlige referenter i anaforresolusjon.<br />

i


Preface<br />

The project presented in this paper is a Cand. Philol. thesis in Computational Linguistics<br />

and Language Technology and is submitted at the University of Bergen in <strong>February</strong> <strong>2005</strong>.<br />

The thesis is written in loose cooperation with the research project KunDoc (KunDoc<br />

2004). KunDoc (Kunnskapsbasert dokumentanalyse / Knowledge-based document<br />

analysis), which was started in October 2003 and is funded by the Norwegian Research<br />

Council (NFR), has functioned as an inspiration for verbalising the approach in the thesis.<br />

The research within KunDoc is carried out in cooperation between the firm CognIT AS<br />

(CognIT 2004) and the University of Bergen. KunDoc aims at developing a method for<br />

the automatic recognition of discourse structures in written Norwegian texts. The project<br />

examines whether automated identification of coreference in a text can be used to create<br />

an unambiguous discourse structure of the text, identifying both its thematic and<br />

contextual structure. A further goal is to examine whether these techniques are useful<br />

within a closed thematic domain to create unambiguous automated summaries. Within<br />

KunDoc, it is of interest to generate ontologies that represent real-world knowledge.<br />

In the work on my thesis I have also worked in co-operation with the research project<br />

NorGram (NorGram 2004) at the University of Bergen. This project develops a<br />

computational grammar for Norwegian bokmål and is a part of the ParGram project at<br />

Palo Alto Research Center. The pre-processing of the text collection used in my project<br />

has been carried out using NorGram’s grammar on the XLE platform.<br />

ii


Acknowledgements<br />

I would like to thank my supervisors professor Koenraad de Smedt and professor Helge<br />

Dyvik who have given me invaluable support and new ideas, especially in the process of<br />

developing the method in the thesis.<br />

The approach in the thesis has been developed in loose cooperation with KunDoc. In this<br />

connection I wish to thank Till Christopher Lech at CognIT AS, who has contributed<br />

with tips and support.<br />

Other people I would like to thank are Paul Meurer at Aksis, who installed XLE and<br />

NorGram on my home Linux computer and Martin Rasmussen Lie who has been a great<br />

help in programming questions and has implemented one of the approaches used in the<br />

thesis in Perl. Thanks also go to Aleksander Krzywinski, whose achievement it is that the<br />

pink computer exists.<br />

Many people who have been a great support in the process of finishing the thesis have not<br />

been mentioned – they are still, however, very greatly thanked. You know who you are!<br />

iii


Table of Contents<br />

1 INTRODUCTION AND PROBLEM STATEMENT 1<br />

1.1 Project outline 3<br />

2 THEORETICAL BACKGROUND 6<br />

2.1 Anaphora resolution 6<br />

2.1.1 Frameworks for anaphora resolution 8<br />

2.1.2 Computational approaches to anaphora resolution 12<br />

2.1.3 Anaphora resolution and text summarisation 22<br />

2.2 Finding meaning in the context 24<br />

2.2.1 The distributional approach 24<br />

2.2.2 Different types of context 27<br />

2.2.3 Context and selectional restrictions 30<br />

3 FROM TEXT TO EPAS – THE EXTRACTION METHOD 33<br />

3.1 Selecting the texts 33<br />

3.2 Predicate-argument structures 35<br />

3.2.1 What is represented in the EPAS? 39<br />

3.3 Parsing with NorGram 42<br />

3.3.1 NorGram in outline 43<br />

3.3.2 Extracting EPAS from NorGram 44<br />

3.4 Altering the source 47<br />

3.5 Finding the words 48<br />

3.6 Evaluation of the data set 52<br />

3.6.1 Errors from the grammar 52<br />

3.6.2 Irrelevant structures 53<br />

3.6.3 Manually added structures 54<br />

3.6.4 Comments about the EPAS list 55<br />

4 CLASSIFICATION 58<br />

iv


4.1 Step I: Classification with TiMBL 59<br />

4.1.1 The Nearest Neighbor approach 60<br />

4.1.2 Testing 61<br />

4.1.3 Comments on the results 68<br />

4.2 Step II: Association of concept groups 68<br />

4.2.1 Classify 72<br />

4.2.2 Associated concept classes 73<br />

4.3 Step III: Using concept groups in TiMBL 74<br />

4.3.1 Testing 75<br />

4.4 Are concept classes useful for anaphora resolution? 78<br />

5 FINAL REMARKS 82<br />

5.1 Is a parser vital for the extraction process? 82<br />

5.2 Summary and conclusions 84<br />

6 REFERENCES 86<br />

APPENDIX A: EKSTRAKTOR.PL – ALGORITHM 89<br />

APPENDIX B: EKSTRAKTOR.PL – PROGRAM CODE 92<br />

APPENDIX C: THE EPAS LIST 101<br />

APPENDIX D: TEXT ALIGNED WITH EPAS 105<br />

APPENDIX E: CLASSIFY.PL – PROGRAM CODE 110<br />

APPENDIX F: POS-BASED STRUCTURES 113<br />

v


1 Introduction and problem statement<br />

For many applications within the field of Natural Language Processing (NLP) it is vital to<br />

identify what a pronoun refers to. Consider a piece of text where (1-1a) is followed immediately<br />

by (1-1b) 1 .<br />

(1- 1)<br />

a.<br />

Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig<br />

kommer til å drepe igjen.<br />

The sergeant leading the investigation says that the perpetrator probably will<br />

kill again.<br />

b. Han etterlyser vitner som var i sentrum søndag kveld.<br />

He puts out a call for witnesses who were in the city centre Sunday evening.<br />

In an application consisting of, for example, summarising the text, a selection of the second<br />

sentence (1-1b) without the preceding sentence (1-1a) leaves the reader with the pronoun han<br />

(he), the referent of which cannot be identified. The task of identifying the referent of a pronoun<br />

is called anaphora resolution and its computer implementation is relevant in many NLP<br />

applications, such as machine translation, automatic abstracting, dialogue systems, question<br />

answering and information extraction.<br />

The problem of correctly identifying the referent of a pronoun is not trivial, as will be apparent<br />

from the comparison of examples (1-1) and (1-2). As will be further described in section 2.1,<br />

strategies that do not incorporate some sort of real-world knowledge cannot confidently identify<br />

the entities that the pronoun han (he) is linked to in these examples.<br />

(1- 2)<br />

Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig<br />

kommer til å drepe igjen. Han ble observert i sentrum søndag kveld.<br />

The sergeant leading the investigation says that the perpetrator probably will<br />

kill again. He was observed in the city centre Sunday evening.<br />

1<br />

The sentences in (1-1) and (1-2) are constructed example sentences and are not part of the data set collected and<br />

used in this thesis.<br />

1


This thesis explores the value of using co-occurrence patterns to create concept groups that can<br />

act as an aid in the process of finding what a pronoun refers to. In order to find the entity that the<br />

pronoun han (he) refers to in example (1-1), the following two alternative patterns can be<br />

considered:<br />

(1- 3)<br />

a. lensmann etterlyser vitne sergeant calls-for witness<br />

b. gjerningsmann etterlyser vitne perpetrator calls-for witness<br />

When considering which of these patterns is the most likely one, data collected from a corpus<br />

can be consulted (Dagan and Itai 1990; Dagan et al. 1995; Nasukawa 1994; inter al.). If one of<br />

the patterns is found literally in the corpus, it will receive a strong preference. If none of the<br />

patterns occur in the data collection, similar patterns can be considered. Given that the patterns<br />

in example (1-4) below do feature in the data collection, they can contribute to guessing the<br />

correct referent for the anaphor in example (1-1):<br />

(1- 4)<br />

a. politi etterlyser vitne police call-for witness<br />

b. etterforsker etterlyser vitne investigator calls-for witness<br />

c. lensmann avhører vitne sergeant interviews witness<br />

d. politi avhører person police interview person<br />

e. gjerningsmann dreper offer perpetrator kills victim<br />

f. gjerningsmann angriper kvinne perpetrator attacks woman<br />

In view of the patterns in (1-4), the word lensmann (sergeant) engages in contexts similar to<br />

those of politi (police), which in turn occurs in similar contexts to etterforsker (investigator). By<br />

using association techniques, lensmann can be associated with the other arguments which occur<br />

in similar linguistic environments, and subsequently be preferred as the referent in (1-1).<br />

Approaches within the field of anaphora resolution have in recent years focused on knowledge-<br />

poor strategies used in combination with corpora, at the same time, the notion of constructing a<br />

large and comprehensive base of real-world knowledge has been abandoned somewhat (see<br />

Mitkov 2003 for a brief overview). The approach in the present work expands the co-occurrence<br />

2


patterns found in a text collection to also consider semantically similar words and patterns to<br />

those present in the corpus. The association of semantically similar concepts is carried out<br />

through machine learning techniques and an association technique developed for this project.<br />

The project described in the present work develops and examines a method for the automatic<br />

inference of concept groups consisting of semantically similar words from a collection of<br />

limited domain texts. Information that becomes available by automatically classifying context<br />

patterns from a closed thematic domain is examined for the purpose of aiding in anaphora<br />

resolution. The resulting concept groups can function as a form of real-world knowledge by<br />

representing information about which words can be expected in certain contextual environments.<br />

As it is an established problem within the field of computational linguistics that the construction<br />

of knowledge bases requires such a high amount of manual labour, it is of interest to examine<br />

methods which can contribute to automating this task.<br />

1.1 Project outline<br />

This thesis describes a method for the automatic association of clusters of semantically similar<br />

words collected from a limited thematic domain. The association of concepts is based on the<br />

distribution of arguments in particular syntactic contexts.<br />

The method described consists of three steps:<br />

1) the extraction method, which deals with the extraction of meaning structures from a text<br />

corpus<br />

2) the classification method, which deals with the association of the extracted meaning<br />

structures into concept groups<br />

3) the application of the meaning structures and the concept groups to anaphora resolution<br />

In the extraction phase, semantic structures mainly corresponding to subject-verb-object<br />

relations are extracted and normalised to the form shown in (1-5) below. This type of relation is<br />

in this thesis termed an elementary predicate-argument structure (EPAS) and is described in<br />

greater detail in section 3.2.<br />

3


(1- 5)<br />

predicate, argument 1, argument 2<br />

In the classification phase, the extracted structures undergo processes that result in the grouping<br />

of concepts into clusters of semantically similar words.<br />

The evaluation of the results obtained by using the method developed in the project is twofold:<br />

• the resulting concept classes are evaluated; does the method produce semantic clusters<br />

that are valid for the thematic domain of the text collection?<br />

• the usefulness of using the concept classes in anaphora resolution is evaluated; does the<br />

method provide a means to infer which entity is referred to in examples such as (1-1)<br />

and (1-2)?<br />

Chapter 3 describes the extraction method, which uses output from a syntactic parser to collect<br />

semantic structures in the form of EPAS from the texts. The text collection used in this project<br />

consists of newspaper texts all concerning a criminal case. The constraints that hold on the<br />

corpus are further described in sections 3.1 and 3.4. Section 3.2 explains the format of the<br />

meaning structures extracted from the text corpus as well as the motivation for choosing EPAS<br />

as meaning representation. In sections 3.3 and 3.5 the process of parsing the texts and gathering<br />

the meaning representations from the parse output is outlined in further detail. Finally, the list of<br />

EPAS resulting from the extraction method is evaluated in section 3.6.<br />

The classification method is described in chapter 4. In section 4.1 a classification approach using<br />

machine learning techniques is described, in section 4.2 the constituents of the EPAS are<br />

associated into semantically similar groups based on their contextual distribution and finally<br />

these two approaches are applied in connection with one another in section 4.3. In section 4.4<br />

the potential of using concept groups in anaphora resolution is discussed.<br />

Final remarks and conclusions are found in section 5. Here the foundation of the extraction<br />

method is also briefly discussed.<br />

4


The results obtained in this project provide a preliminary indication of the feasibility and<br />

usefulness of a referent-guessing helper such as described in the introduction. Hindle (1990)<br />

states that small corpora and human intervention in the analysis phase are factors that have<br />

contributed to obscuring the usefulness of semantic classification based on distributional<br />

context. Within the framework of the present work it was not possible to conduct a large-scale<br />

corpus-based study. This is partly due to the lack of a robust and powerful enough extraction<br />

method and will be discussed in greater detail in chapter 3. The text collection that the study is<br />

based on is clearly much too small to offer anything but a tendency and the degree in which the<br />

extraction process is manually manipulated is too high to call the method fully automated.<br />

Nevertheless, this paper describes a pilot study and provides an indication of the quality and<br />

usefulness of the method.<br />

Before going on to describing the method developed in the present work, a brief introduction to<br />

the concepts of context and anaphora resolution is needed. The importance and usefulness of<br />

classifying words according to the contexts they occur in, as well as a brief background on<br />

anaphora resolution, is provided in chapter 2.<br />

5


2 Theoretical background<br />

In order to understand the motivation for developing an extraction and classification method as<br />

described in the present work, one needs a brief explanation of the theoretical foundation on<br />

which the method is based. In this chapter, the theoretical background of the method is<br />

described. In section 2.1 the concept of anaphora resolution and the need for context information<br />

in anaphora resolution systems is outlined. In section 2.2 the notion of using context as a means<br />

to identify semantically similar words is explained.<br />

2.1 Anaphora resolution<br />

Most natural language texts contain an abundance of pronouns and other expressions which are<br />

referentially linked to other items in the texts. In order to understand the meaning conveyed by a<br />

text, one needs a method to find out which entities these expressions are linked to. It is difficult<br />

to determine what a pronoun refers to without taking the notion of context and real-world<br />

knowledge into account. Natural language requires a certain amount of context to be intelligible.<br />

We distinguish between linguistic context, which denotes the concrete linguistic setting that a<br />

given word occurs in, and a more general notion of context that refers to the non-linguistic<br />

setting. In the following, a background on the theoretical basics of anaphora will be given,<br />

before some approaches to anaphora resolution are briefly outlined.<br />

Anaphor and referring expression are both terms that are used for words that point back either to<br />

other words or to entities in the world. Anaphora 2 can be defined as the linguistic phenomenon<br />

of using an anaphor to point back to a previously mentioned item in a text (Mitkov 2003, p.<br />

266).<br />

In the Oxford Concise Dictionary of Linguistics (Matthews 1997), a referring expression is<br />

defined as a linguistic element that refers to a specific entity in the real world, termed a referent.<br />

A referring expression can be any natural language expression that is used to refer to a realworld<br />

entity, including nouns and pronouns. As such the linguistic expressions James and he in<br />

a given text may both refer to a person called “James” existing in the real world.<br />

2 The term anaphora is in the present work used in alignment with current literature on anaphora resolution.<br />

Anaphora is the linguistic phenomenon of an anaphor pointing to another item in the text and should not be<br />

understood as the plural form of anaphor, which is anaphors.<br />

6


The term anaphor describes a linguistic element, often a pronoun or a nominal, which is linked<br />

to another linguistic element previously presented in the text (Mitkov 2003). An anaphoric<br />

reference is usually supported by a preceding nominal, which is called an antecedent. If a<br />

referring pronoun is mentioned previous to the mentioning of its referent, the term cataphora<br />

applies (Jurafsky and Martin 2000, p. 675). Anaphora provides us with an indirect reference to a<br />

real-world entity. When a referring expression, such as James, has been introduced in a text, it<br />

allows for subsequent reference by anaphors, such as he or the boy. The original referring<br />

expression is therefore the antecedent of the subsequent referring anaphor, for example the<br />

pronoun he. If the anaphor and the antecedent it is linked to both have the same referent in the<br />

real world, they are termed coreferential (Mitkov 2003, p. 267).<br />

(2- 1)<br />

Politimannen sier at han har flere observasjoner<br />

The policeman says that he has several observations<br />

In example (2-1) above, the pronoun han (he) is an anaphor which points back to its antecedent,<br />

the referring expression politimannen (the policeman). Han and politimannen both refer to the<br />

same real-world referent, the object “the policeman”, and are therefore coreferential.<br />

There are various and complex structural conditions on the co-occurrence of an anaphor and its<br />

antecedent. This includes constraints on how far away from each other the antecedent and the<br />

referring anaphor can be without disturbing the understanding of the text. An elaborate<br />

discussion of these conditions is, however, not within the scope of the present work.<br />

Mitkov (2003, p. 268) distinguishes between the following types of anaphora:<br />

• pronominal anaphora: The anaphor is a pronoun.<br />

• lexical noun phrase anaphora: The anaphor is a definite description or proper name that<br />

gives additional information and has a meaning independent of the antecedent.<br />

• verb anaphora: The anaphor is a verb and refers to an action.<br />

• adverb anaphora: The anaphor is an adverb.<br />

• zero anaphora: The anaphor is implicitly present in the text, but physically omitted.<br />

7


• nominal anaphora: The anaphor has a non-pronominal noun phrase as antecedent.<br />

o direct: anaphor and antecedent are linked through identity, synonymy,<br />

generalisation or specialisation.<br />

o indirect: anaphor and antecedent are linked through part-of-relations or set<br />

membership.<br />

He states that pronominal anaphora is the most frequent type, while nominal anaphora (indirect<br />

anaphora in particular) usually requires real-world knowledge to be resolved. In this thesis, the<br />

described method will be tested using occurrences of pronominal anaphora from the text<br />

collection.<br />

2.1.1 Frameworks for anaphora resolution<br />

Anaphora resolution is the process of determining the antecedent of an anaphor (Mitkov 2003,<br />

p. 269). In our minds, we build a discourse model that represents the entities mentioned in the<br />

discourse and shows the relationship between them (Webber 1978, in Jurafsky and Martin<br />

2000). A representation is evoked into the model upon the first mentioning in the discourse, and<br />

then subsequently accessed from the model if the entity is mentioned again, either by name or<br />

by way of anaphora. Entities will have varying degrees of salience in the discourse model,<br />

depending on how frequently they have been mentioned and also on how long ago they last were<br />

mentioned. This notion of a discourse model is used both in theories which aim at describing the<br />

process of anaphora resolution, and in computational approaches that automate the anaphora<br />

resolution process.<br />

There are different approaches to resolving referring expressions and anaphors that occur in<br />

natural language discourse. Several formalisms offer frameworks describing the theory of<br />

discourse representation in general and anaphora binding in particular. In the following sections,<br />

two of these formalisms will be briefly outlined.<br />

8


2.1.1.1 Discourse representation theory<br />

Discourse representation theory (DRT), proposed by Hans Kamp in 1981, represents a way of<br />

creating dynamic semantic representations for natural language discourse. The framework aims<br />

at representing larger linguistic units than sentences and is particularly useful for representing<br />

the way discourse changes with every new sentence that is introduced. The core structure within<br />

DRT is the discourse representation structure (DRS), which is transformed through the<br />

processing of each sentence of a discourse. Since every sentence in a discourse potentially can<br />

introduce new concepts and entities which are referentially linked to previously introduced<br />

entities, it is not possible to infer the full meaning of an individual sentence without regarding<br />

the discourse it fits into. Kamp and Reyle states it as “the meaning of the whole is more, one<br />

might say, than the conjunction of its parts” (Kamp and Reyle 1993, p. 59). The interpretation of<br />

a new sentence relies both on the meaning of the sentence itself and on the structure representing<br />

the context of the earlier sentences (Kamp and Reyle 1993, p. 59). Thus, a new sentence is<br />

interpreted as a contribution to the existing representation of the discourse.<br />

DRT establishes anaphoric links across sentence boundaries between anaphors in the current<br />

sentence and antecedents in the DRS as the new sentence is processed. A very simplified outline<br />

of what happens when the discourse in (2-2) is processed and a DRS is created, is given below<br />

(after Kamp and Reyle 1993).<br />

(2- 2)<br />

a. Jones owns Ulysses.<br />

b. It fascinates him.<br />

(2-2a) is entered into the DRS by way of applying phrase structure rules and lexical insertion<br />

rules in order to associate the sentence with a syntactic representation and a set of features and<br />

values for the individual words. Some examples of assigned features are number, gender and<br />

transitiveness (for verbs). The DRS for (2-2a) will in abridged form look like this:<br />

9


(2- 3)<br />

Discourse referents:<br />

x y<br />

DRS conditions:<br />

Jones (x)<br />

Ulysses (y)<br />

x owns y<br />

Upon entering (2-2b) into the DRS, a series of actions must be performed to calculate which<br />

entities in the DRS the two pronouns of sentence (2-2b) refer to. In the case of the sentences in<br />

(2-2), where the DRS only contains two members, this can be determined on the basis of gender<br />

agreement. The updated DRS will appear as in (2-4) below.<br />

(2- 4)<br />

Discourse referents:<br />

x y u v<br />

DRS conditions:<br />

Jones (x)<br />

Ulysses (y)<br />

x owns y<br />

u = y<br />

v = x<br />

u fascinates v<br />

DRT offers a framework for creating and storing semantic representations of the meaning<br />

conveyed in a natural language discourse. The theory does not, however, offer a means to<br />

identify the referent for ambiguous anaphors, or for anaphors which require real-world<br />

knowledge in the process of determining their referents.<br />

2.1.1.2 Binding Theory<br />

Binding Theory (BT) is a theoretical framework that describes syntactic conditions for intrasentential<br />

anaphoric linking. BT offers conditions for whether a nominal expression can, must,<br />

or must not be linked to another nominal in the sentence. Within BT, reflexive pronouns and<br />

reciprocals are termed anaphors, while non-reflexive pronouns are called pronouns or<br />

pronominals. This understanding of these terms will also be used in the following, when<br />

10


eferring to BT. The NP which is linked to by an anaphor or pronoun is for BT purposes the<br />

binder of the anaphor or pronoun. In example (2-5), he is the binder of himself, while himself is<br />

bound by he.<br />

(2- 5)<br />

He hurt himself.<br />

Chomsky’s binding theory has three principles, shown in (2-6) below (after Chomsky 1981, in<br />

Asudeh and Dalrymple 2004).<br />

(2- 6)<br />

A. An anaphor (reflexive or reciprocal) must be bound in its local domain.<br />

B. A pronominal (nonreflexive pronoun) must not be bound in its local domain.<br />

C. A nonpronoun must not be bound.<br />

The implications of these principles, as well as that of local domain, is exemplified by example<br />

(2-7). In (2-7a) the subclause the thief hurt himself constitutes the local domain for the anaphor<br />

himself. As, according to principle A above, the anaphor can only be bound in its local domain,<br />

the noun phrase the sergeant is not a possible binder. In (2-7b), the pronominal him must not<br />

(cannot) be bound in its local domain, and can therefore be bound to the noun phrase the<br />

sergeant. The pronominal must, however, not be bound to a noun phrase expressed in the<br />

sentence, but can refer to a discourse referent not mentioned in the sentence.<br />

(2- 7)<br />

a. The sergeant said that the thief hurt himself.<br />

b. The sergeant said that the thief hurt him.<br />

The fact that not all possible candidates in a syntactic domain can be binders is maintained by<br />

the requirement stating that the binder must be in a structurally dominant position to the entity to<br />

be bound. This ensures that the noun phrase the police cannot be the binder of the pronoun<br />

himself in example (2-8).<br />

11


(2- 8)<br />

The sergeant’s suspect hurt himself.<br />

Hellan (1988) suggests that the principles of standard Government and Binding theory are<br />

primarily based on English and that they cover “a very limited subpart if what constitutes a<br />

possible anaphoric system” (Hellan 1988, preface). He proposes several additional principles,<br />

maybe most notably the Command Principle in which, among other statements pertaining to the<br />

command relation, it is stated that also relations within hierarchies of thematic roles can stand in<br />

a command relation to an anaphor.<br />

2.1.2 Computational approaches to anaphora resolution<br />

Automated anaphora resolution systems basically have to perform three separate tasks (Mitkov<br />

2003):<br />

• identify the anaphors to be resolved<br />

• locate the candidates for antecedents<br />

• select the antecedent from the candidate list<br />

Different computational approaches apply different resolution factors and knowledge sources.<br />

The process of resolving the antecedent is based on several resolution factors, which in turn<br />

draw into account quite different sources of background knowledge. Using morphological<br />

knowledge may be the simplest approach; gender and/or number is compared and candidates are<br />

discounted if their gender/number does not fit that of the anaphor. Syntactic knowledge is used<br />

to identify syntactic parallelism; the antecedent is often found in a similar syntactic position as<br />

the anaphor. In many cases, the correct antecedent cannot be identified without the help of<br />

semantic information. Selectional constraints is one example of semantic knowledge that can be<br />

used to narrow down the list of candidates for the antecedent. Repeated mention of an entity in<br />

the preceding text passage to the anaphor may indicate that this entity has a higher degree of<br />

salience in the discourse and that it therefore is a likely antecedent for a following anaphor<br />

(Jurafsky and Martin 2000, p. 682). Using morphological, lexical, syntactic, semantic and<br />

salience criteria as background knowledge does not immediately suggest the most likely<br />

candidate, but rather acts as filters to eliminate unsuitable candidates (Mitkov 2003, p. 271).<br />

Mitkov (2003, p. 272) states that for some examples “the crucial and most reliable factor in<br />

deciding on the antecedent” is real-world knowledge. Even the most exquisite anaphora<br />

12


esolution system will not be able to resolve anaphora of the type that needs real-world<br />

knowledge to rule out candidates that just do not make common sense. The examples in (2-9)<br />

below illustrate the point; without access to real-world knowledge or semantics, there is no way<br />

to confidently resolve the antecedent of the anaphoric han (he).<br />

(2- 9)<br />

a. Politimannen skjøt etter morderen, og han falt.<br />

The policeman shot at the murderer and he fell.<br />

b. Politimannen skjøt etter morderen, og han bommet.<br />

The policeman shot at the murderer and he missed.<br />

2.1.2.1 Knowledge-free approaches<br />

Botley and McEnery term anaphora resolution systems which do not consult any form of<br />

knowledge representation in the process of identifying the antecedent of an anaphor<br />

“knowledge-free” (Botley and McEnery 2000, p. 17). In the two following sections, it will be<br />

shown, on the basis of two well-established syntactic algorithms for anaphora resolution, that<br />

knowledge-free approaches that resolve anaphors without employing real-world knowledge<br />

cannot identify different antecedents in the case of examples (1-1) and (1-2).<br />

2.1.2.1.1 Lappin and Leass’ algorithm for pronoun resolution<br />

Lappin and Leass (Lappin and Leass 1994, in Jurafsky and Martin 2000) offer an algorithm for<br />

pronoun interpretation which takes into account recency and syntactically-based preferences.<br />

The algorithm does not employ semantic preferences or background knowledge, but uses a<br />

weighting system which reflects various syntactic features as well as salience of recency in the<br />

discourse. When testing this algorithm on test data from the same genre as was used to develop<br />

the weighting system, Lappin and Leass report an accuracy of 86%. Jurafsky and Martin present<br />

a somewhat simplified part of the algorithm in the resolution of non-reflexive, third-person<br />

pronouns (Jurafsky and Martin 2000, p. 684). The Lappin and Leass algorithm creates a<br />

discourse model upon processing a sentence and assigns each member of the discourse model a<br />

salience value. A set of salience factors determine the salience weight each of the members is<br />

assigned. The aspect of recency is maintained by reducing each member’s salience value by half<br />

13


upon processing of a new sentence. (2-10) below shows the weighting system of the salience<br />

factors in the system.<br />

(2- 10)<br />

SALIENCE FACTOR VALUE<br />

Sentence recency 100<br />

Subject emphasis 80<br />

Existential emphasis 70<br />

Accusative emphasis 50<br />

Indirect object or oblique complement emphasis 40<br />

Non-adverbial emphasis 50<br />

Head noun emphasis 80<br />

In the following, this algorithm will be used in an attempt to resolve the referent for the anaphor<br />

han (he) in the second sentence of each member of the sentence pair presented in (1-1) and (1-2)<br />

and repeated in (2-11) below:<br />

(2- 11)<br />

a.<br />

Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig<br />

kommer til å drepe igjen. Han etterlyser vitner som var i sentrum søndag<br />

kveld.<br />

The sergeant leading the investigation says that the perpetrator probably will<br />

kill again. He puts out a call for witnesses who were in the city centre Sunday<br />

evening.<br />

b. Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig<br />

kommer til å drepe igjen. Han er observert i sentrum.<br />

The sergeant leading the investigation says that the perpetrator probably will<br />

kill again. He is observed in the city centre.<br />

When attempting to find the antecedent for the pronoun in (2-11a), all potential referents must<br />

be collected. Since the discourse only consists of two sentences, only the first sentence is<br />

processed, in the event of a longer discourse the referent is looked for in until four preceding<br />

sentences. The potential referents lensmannen (the sergeant), etterforskningen (the<br />

investigation) and gjerningsmannen (the perpetrator) are assigned salience values as shown in<br />

(2-12) below.<br />

14


(2- 12)<br />

REC SUBJ EXIST OBJ IND-OBJ NON-ADV HEAD N TOT<br />

lensmann 100 80 50 80 310<br />

etterforskning 100 50 150<br />

gjerningsmann 100 80 50 80 310<br />

As the algorithm moves on to the next sentence, the values assigned in (2-12) are reduced by<br />

half, as shown in (2-13).<br />

(2- 13)<br />

Referent Phrases Value<br />

lensmann lensmannen 190<br />

etterforskning etterforskningen 75<br />

gjerningsmann gjerningsmannen 190<br />

Now potential referents which do not agree in gender or number are removed. In our case, the<br />

pronoun is han (he), which for Norwegian bokmål specifies an animate referent. According to<br />

the preference factors in the algorithm, which check for gender and number only, the potential<br />

referent etterforskning cannot, however, be removed. At this stage, referents which do not<br />

satisfy intra-sentential syntactic coreference constraints will also be removed. Final salience<br />

values are calculated by assigning values for syntactic role parallelism and cataphora. In our<br />

case, lensmann is given extra weight due to the syntactic parallelism to the anaphor. This results<br />

in the values shown below:<br />

(2- 14)<br />

Referent Phrases Value<br />

lensmann lensmannen 225<br />

etterforskning etterforskningen 75<br />

gjerningsmann gjerningsmannen 190<br />

Since lensmann has the highest salience weight, this word is also chosen as the referent for the<br />

pronoun han. Through the processing of sentence (2-11a) it is clear that the same referent would<br />

be assigned upon a processing of sentence (2-11b). The algorithm does not take the semantic<br />

meaning of the sentence to be processed into account, and is therefore not able to differ between<br />

the referent assignment in examples such as (2-11a) and (2-11b).<br />

15


2.1.2.1.2 Hobbs’ Tree search algorithm<br />

Hobbs’ tree search algorithm (Hobbs 1978, in Jurafsky and Martin 2000) is a pronoun resolution<br />

algorithm based on syntactic tree structures of the sentences to be processed. Proceeding to<br />

resolve the antecedent of a pronoun, the tree search algorithm processes syntactic<br />

representations of all previous sentences in the discourse, as well as the sentence with the<br />

pronoun to be resolved. The syntactic representations of the discourse, in combination with the<br />

order the search of the syntactic structures is performed, to some degree represent a discourse<br />

model and salience preferences. In its search for the antecedent of a pronoun, the algorithm<br />

traverses syntactic trees in a left-to-right, breadth-first manner.<br />

To find the antecedent for the pronoun han in the sentences in (2-11), syntactic tree structures of<br />

the sentences are needed. The syntactic structures of (2-11) presented in Figure 1 and Figure 2<br />

are generated by the NorGram grammar web version 3 . This is a more complex grammar than the<br />

one used in Jurafsky and Martin’s outline of the algorithm (Jurafsky and Martin 2000, p. 689),<br />

but as stated there, the algorithm to a large degree allows any choice of grammar. Since the<br />

algorithm is based on searches in syntactic trees, and therefore relies on making assumptions<br />

regarding the build-up of the syntactic structures, the grammar must be specified in any case. In<br />

the following, the tree search as specified in Jurafsky and Martin (2000) will be carried through,<br />

with the assumptions regarding the grammar that that implies. The syntactic trees (and their<br />

labels) are included for illustrational purposes and should not be thought of as input for the<br />

algorithm.<br />

3 Generated at http://decentius.hit.uib.no:8010/logon/xle-mrs.xml 31/01-<strong>2005</strong><br />

16


Figure 1<br />

17


Figure 2<br />

In the process of identifying the antecedent for the pronoun han in the second sentence of<br />

(2-11a), the algorithm takes as its starting point the NP immediately dominating the pronoun.<br />

From there, it moves upwards in the tree to the first NP or sentence node. This node is called X<br />

and the path from the pronoun to X is called p. In our case, that means that X is the topmost<br />

sentence node (IP node). Since there are no more branches to the left of X and p, and in other<br />

words no possible antecedents introduced earlier in the same sentence, the algorithm moves<br />

along to the parse tree of the previous sentence. Searching left-to-right and breadth-first, the first<br />

NP node that is encountered is suggested as the antecedent for the pronoun. In our case, this<br />

18


means that the algorithm would propose the NP lensmannen som leder etterforskningen as the<br />

antecedent for the pronoun han. As will be clear from examining the example sentences in<br />

(2-11), this is a correct resolution of the antecedent in (2-11a), but not for the antecedent in<br />

(2-11b). Parallel to the Lappin and Leass resolution algorithm, the tree search algorithm also<br />

does not consider the semantic meaning of the sentence with the anaphor to be resolved. The<br />

pronouns han in the second sentences of (2-11a) and (2-11b) are treated in the same way, and<br />

lensmannen is chosen as the most likely antecedent in both cases.<br />

2.1.2.2 Traditional approaches to anaphora resolution<br />

As seen through the examples above, anaphors of the type that requires semantic information to<br />

be resolved simply cannot be resolved using purely syntactic algorithms. In order to find the<br />

antecedent for such anaphors, some sort of real-world knowledge must be consulted. Mitkov<br />

(1999) distinguishes between traditional and alternative approaches for anaphora resolution.<br />

The traditional approaches are those that use knowledge factors to filter out unlikely candidates<br />

and then use preference rules on a smaller set of likely candidates, while the alternative<br />

approaches find the most likely candidate based on statistical or AI techniques (Mitkov 1999, p.<br />

8). The traditional approaches usually draw in the factor of real-world or domain knowledge,<br />

often in the form of a comprehensive knowledge or domain base, in order to resolve anaphors of<br />

the type in examples (2-9) and (2-11) above (Mitkov 2003). Such approaches are also called<br />

knowledge-based (Botley and McEnery 2000, p. 11). In the above it has repeatedly been<br />

emphasized that some types of anaphors cannot be correctly resolved without access to realworld<br />

information. Carbonell and Brown’s (1988) multi-strategy approach is one traditional<br />

knowledge-based anaphora resolution system. Their approach follows what Botley and<br />

McEnery call “a trend […] towards the integration of several different resolution algorithms into<br />

large-scale modular architectures” (Botley and McEnery 2000, p. 17). Their system draws on<br />

different knowledge sources, including syntactic structure, case-frame semantics, dialog<br />

structure and real-world knowledge. The resolution of anaphors is based on constraints and<br />

preferences; first the constraints are applied to narrow down the list of potential antecedents and<br />

then the preferences are applied to each of the remaining candidates (Carbonell and Brown<br />

1988, p. 98). Real-world knowledge is realised as a set of precondition and postcondition<br />

constraints. These constraints for example determine that the object given no longer is in the<br />

possession of the actor after a successful act of giving has been carried out. The main problem<br />

19


with such an approach is stated by the developers: “the strategy is simple, but requires a fairly<br />

large amount of knowledge to be useful for a broad range of cases” (Carbonell and Brown 1988,<br />

p. 97).<br />

Generally speaking, the knowledge bases that knowledge-based systems for anaphora resolution<br />

rely on are difficult to represent and process, and require a considerable amount of human input<br />

(Mitkov 2001, p. 110). The information is structured using different frameworks; often each<br />

anaphora resolution system structures its knowledge base in a system-specific manner. Rather<br />

than giving an outline of various specific methods belonging to the traditional approaches, some<br />

of the formats used for knowledge representation are briefly mentioned below. Several<br />

frameworks have been developed to cope with the need for a formalism to represent real-world<br />

or domain knowledge. Most of these have been part of specific anaphora resolutions systems<br />

and have not constituted independent frameworks for the representations of real-world<br />

knowledge.<br />

Minsky’s Frames (Minsky 1975, in Botley and McEnery 2000) is a framework for representing<br />

knowledge about stereotyped objects and events. The frames are dynamic in the sense that the<br />

information they hold about a particular object or event can change if new information is<br />

encountered. Input into the system is interpreted in accordance with the information present in<br />

the frames; the frames generate expectations about the input (Botley and McEnery 2000, p. 12).<br />

In the case of a “shooting frame” being evoked upon processing of the sentence in (2-9a), the<br />

expectation that if somebody misses, it is likely to be the same person that also was doing the<br />

shooting, is created. Following such an expectation, it is easy to identify the correct antecedent<br />

for the anaphor. Schank’s Scripts (Schank 1972, in Botley and McEnery 2000) have some<br />

similarity to Minsky’s Frames, but are primarily used to represent knowledge about events<br />

which do not undergo change (Botley and McEnery 2000, p. 12). Information about role<br />

assignment and the sequence of events in given contexts is represented in the script.<br />

2.1.2.3 Alternative approaches to anaphora resolution<br />

Hand-coded knowledge bases that aim at representing real-world or domain knowledge are<br />

expensive and labor-intensive to build and maintain. As a consequence, the focus has shifted<br />

toward systems that rely less heavily on world knowledge in the last 15 years (see Mitkov 2003<br />

20


for an overview). Many of these systems incorporate semantic and real-world knowledge, but<br />

use methods that enable the collection of this information to have a high degree of automation<br />

(Baldwin 1997; Dagan and Itai 1990; Dagan et al. 1995; Nasukawa 1994; inter al.). Mitkov<br />

(2003) terms these systems knowledge-poor and attributes their growth in number in recent<br />

years to the fact that corpora and similar electronic linguistic resources have become better,<br />

larger and more available. Some of these systems do not really attempt at building a world- or<br />

domain knowledge base (Baldwin 1997; Nasukawa 1994), but rather look at features such as co-<br />

occurrence patterns in the text itself, while others integrate corpora and use them as a form of<br />

abstract knowledge base (Dagan and Itai 1990; Dagan et al. 1995).<br />

Among the different “alternative” approaches, Dagan and Itai’s (1990) statistical approach,<br />

Dagan et al.’s (1995) estimation of unseen patterns and Nasukawa’s (1994) knowledge-free<br />

method are of particular interest for this project. Dagan and Itai’s (1990) method is that of using<br />

co-occurrence patterns observed in a corpus as a type of selectional restrictions. Co-occurrence<br />

patterns observed in a large corpus are thought to reflect the semantic constraints that apply to<br />

natural language. Candidates for antecedents for the anaphor it are identified in the text and put<br />

in the place of the anaphor to be resolved. This produces co-occurrence patterns that are checked<br />

against the corpus. Subsequently the candidate present in the most frequently occurring cooccurrence<br />

pattern is chosen as the antecedent. This method relies on a large corpus, as only<br />

patterns which actually have been seen in the corpus are considered. Infrequent patterns will not<br />

be picked since they generally speaking will not feature on the top of the pattern list. Dagan et<br />

al. (1995) offer a solution to this problem by presenting a similar method which also estimates<br />

the probability of co-occurrence patterns that have not been observed in the training data. They<br />

state the importance of distinguishing between probable and improbable unobserved cooccurrence<br />

patterns and emphasise that the “distinctions ought to be made using the data that do<br />

occur in the corpus” (Dagan et al. 1995, p. 164). Anologies are made between specific unseen<br />

co-occurrence patterns and observed co-occurrences which contain similar words, determining<br />

word similarity by a similarity metric. Patterns that contain similar words to the target word and<br />

that have been observed in the training data are used to calculate how likely the target word is to<br />

occur in the same pattern. Nasukawa (1994) presents a resolution rate of 93,8% in an even<br />

knowledge-poorer method for pronoun resolution. Instead of drawing information from a<br />

corpus, word frequency and co-occurrence patterns in the text itself are used to filter out the<br />

most likely candidate for the antecedent. In Nasukawa’s approach, inter-sentential data is<br />

21


exploited in the process of resolving the pronoun it. The likelihood of the antecedent is<br />

determined statistically and the antecedent candidate with the highest value is selected by the<br />

system. The approach uses a syntactic-based heuristic rule for the selection of the antecedent.<br />

Nasukawa states that approaches using real-world knowledge are not large-scale enough yet to<br />

be of use in broad-coverage systems and attempts at extracting information corresponding to<br />

case frames in world knowledge from the texts to be processed (Nasukawa 1994, p. 1157). In<br />

this way, collocation patterns are used as a form of world knowledge for the domain of the texts.<br />

As has been outlined in the introduction, this thesis describes a method that can aid in the<br />

resolution of the anaphoric expressions that require real-world knowledge to correctly resolve<br />

their antecedents. The method automatically extracts and classifies nominal arguments, resulting<br />

in associated classes of similar words. This is a knowledge-poor method in the sense that it does<br />

not require a comprehensive knowledge base to be built, but rather uses data and co-occurrence<br />

patterns from a corpus to find the most likely antecedent from a list of possible candidates found<br />

in a text.<br />

2.1.3 Anaphora resolution and text summarisation<br />

As already mentioned, several NLP applications need a reliable means to resolve anaphoric<br />

expressions and identify coreferences. In the field of text summarisation, which belongs to the<br />

domain of the KunDoc project, anaphora resolution is vital for the process of finding<br />

coreferential chains, identifying discourse structure and ultimately producing a coherent<br />

summary. Systems for automatic text summarisation need to make a number of choices<br />

regarding the resolution of anaphoric expressions. Mani (2001, p. 70) identifies “dangling<br />

anaphors” as a coherence problem in automatic summaries; without a means to resolve<br />

anaphoric expressions, the summary may contain anaphors, but not the antecedents they refer to.<br />

This disturbs the coherence in the summary; not all the information that the reader needs is<br />

present in the summarised text. The (constructed) example below illustrates this: consider the<br />

full text example in (2-15a) in connection with the summarised version in (2-15b). Both<br />

instances of the pronouns han (he) in the summarised version in (2-15b) do not have an<br />

identified reference in the text. For a reader presented only with the summary, it is highly<br />

unclear what these pronouns refer to.<br />

22


(2- 15)<br />

a. Politiet etterlyste i dag tidlig en syklist i<br />

forbindelse med drapet på 23 år gamle Anne Slåtten. I<br />

formiddag meldte han seg til politiet, skriver bt.no<br />

- Jeg har foreløpig ikke klarhet i hva han har sagt,<br />

forteller lensmannen i Førde Kjell Fonn.<br />

This morning the police instituted a search for a biker<br />

in connection with the murder of 23-year-old Anne<br />

Slåtten. This morning he reported to the police, writes<br />

bt.no<br />

- For the time being I am not in the clear about what he<br />

has said, tells the sergeant in Førde Kjell Fonn.<br />

b. I formiddag meldte han seg til politiet, skriver bt.no<br />

- Jeg har foreløpig ikke klarhet i hva han har sagt,<br />

forteller lensmannen i Førde Kjell Fonn.<br />

This morning he reported to the police, writes bt.no<br />

- For the time being I am not in the clear about what he<br />

has said, tells the sergeant in Førde Kjell Fonn.<br />

Another reason why anaphora resolution is important for text summarisation is that the methods<br />

used for retrieving relevant sentences for a summary perform more accurately if anaphoric<br />

references to central concepts also are considered (Mitkov 2003, p. 276).<br />

The emergence of knowledge-poor and corpus-based approaches for anaphora resolution<br />

suggests that the representation of real-world knowledge not necessarily has to have the form of<br />

a human-made system. Alternative approaches show that information available from the text to<br />

be analyzed or from larger bodies of natural language text, can be used to give information that<br />

resembles that of real-world knowledge. The following section explains this notion of using<br />

contextual information to find intuitions about world knowledge.<br />

23


2.2 Finding meaning in the context<br />

“You shall know a word by the company it keeps!” Firth (1957, p. 179)<br />

“The meaning of entities, and the meaning of grammatical relations among them, is related to<br />

the restriction of combinations of these entities relative to other entities.” Harris (1968)<br />

2.2.1 The distributional approach<br />

The semantic meaning of a word is often readily suggested from the lexical context in which it<br />

occurs. This is an idea fronted by many scholars, starting with Firth (1957) and Harris (1968).<br />

Human beings use the context of a word in the process of deciding the semantic meaning of the<br />

word. When encountering an ambiguous word, the language user has a finite number of possible<br />

meanings to consider. By examining the environment that the ambiguous word exists in, the<br />

language user finds clues toward deciding which of the possible meanings that are applicable.<br />

The same mechanism applies when a language user is confronted with a novel word; by<br />

observing the usage of the word, preferably over several instances, a human being is able to<br />

induce the semantic meaning from the setting the word occurs in. This is in accordance with the<br />

Distributional Hypothesis as proposed by Harris (1968) and contributes to explaining the fact<br />

that humans rarely have problems identifying for example what an ambiguous word means, or<br />

what entity in the discourse a pronoun refers to. The fact that properties in a word’s linguistic<br />

environment can contain information about the meaning of the word is a useful tool for the<br />

semantic comparison of words. A word which is used within a limited thematic domain is likely<br />

to be used in a sense specific to the contextual setting, or domain, in which it occurs. This entails<br />

that the linguistic environment in which the word exists also gives information about the<br />

meaning of the word. Words that appear in the same linguistic setting in texts that describe the<br />

same theme may have similar or related meanings as well. Texts belonging to the same domain<br />

will to some extent contain information about the same things, and as such also contain<br />

semantically similar words which are used in similar ways. Following this line of thought, it<br />

should be possible to gain relevant information about which words to expect in specific<br />

positions in a text by way of looking at the context patterns they should fit into. Thus, words that<br />

occur in limited-domain texts can be classified relative to how they combine with each other.<br />

The idea that the contextual environment can give clues about the semantic meaning of a word is<br />

clearly not a new one, considering the quotes of Firth and Harris in the introduction to this<br />

24


section. The theory dates back to the empiricists of the mid-twentieth century. Linguistic theory<br />

in the first half of the twentieth century was to a large degree predominated by empiricism.<br />

Linguistic thought, particularly in the United States, but also in Europe, was strongly influenced<br />

by the positivism of the behaviourist philosophy. Bloomfield is regarded as one of the chief<br />

advocates of linguistic positivism, and his interpretation of linguistics predominated American<br />

linguistics in the 1930s and 1940s (Robbins 1997, p. 237). The positivistic/behaviouristic view<br />

on linguistic science put emphasis on the observable. Reliable facts could only be found through<br />

objective observation of data, and only phenomena which could be empirically experienced by<br />

any observer were considered valid data for further analysis. Robbins states that the favoured<br />

model of description of the time was that of distribution; for some linguists the notion of<br />

linguistic description coincided with the statement of distributional relations (Robbins 1997, p.<br />

239). He also attributes the fact that there was little emphasis on the study of semantics in the<br />

early twentieth century to Bloomfield’s dismissal of the possibilities of an empirically based<br />

study of this field. Since the analysis of meaning requires non-linguistic knowledge as well,<br />

semantic analysis was termed less ideal for empiricist methods. While the study of semantics<br />

previously had aimed at creating an exhaustive description of what is referred to by a linguistic<br />

entity, Firth represents a challenge to this way of thinking. His “contextual theory of language”<br />

introduced a move in semantics, toward a statement of meaning as a function of how words are<br />

used (Robbins 1997, p. 247). Together with Harris, Firth represents the distributional approach<br />

to finding semantic meaning. Within this approach, meaning is treated as semantic functions<br />

related to contexts of situation. This way of analysing meaning is data-driven in the same sense<br />

as empiricist approaches in other fields of linguistics and is strongly connected to the positivistic<br />

philosophy of science predominant in this time. However, using bottom-up methods as a means<br />

to formulate theories of linguistics is a direction that was more or less abandoned after<br />

Chomsky’s criticism of the structuralist approaches. Chomsky challenged the philosophical and<br />

scientific foundation of the Bloomfieldian canon through his proposal of the transformationalgenerative<br />

grammar. He dismissed the behaviouristic approach to language as the unacceptable<br />

product of the strong empiricism of the Bloomfieldian behaviourist school. The shift from<br />

empiricism to rationalism marks an important turning point in linguistic theory (Robbins 1997,<br />

p. 260). Botley and McEnery state that Chomsky, and the generation of linguists following his<br />

theories, represent a knowledge-driven approach with the goal of formulating linguistic theories<br />

(Botley and McEnery 2000, p. 24).<br />

25


The method of describing semantic meaning by looking at the distribution of text in context has<br />

more or less been abandoned in the decades following the shift of paradigm from empiricism to<br />

rationalism. Semantic analysis has been approached through new methods within linguistic<br />

theory, terming the meaning-is-use approach as too simple. In recent years, however,<br />

computational linguistics has brought some of these old ideas forward again. This is mainly due<br />

to the increasing availability of large, computer-readable corpora and powerful processing tools<br />

that reliably can perform operations on large data sets. The emergence of corpus approaches is a<br />

move away from the Chomskyan view toward an emphasis on actual observable linguistic<br />

behaviour (Botley and McEnery 2000, p. 24). Thus, the bottom-up approaches of Firth and<br />

Harris are again in fashion; by using corpora, computational linguists today are able to look at<br />

actual occurrences of data and use these to develop theories of linguistic performance. Leech<br />

argues that corpus linguistics is not a linguistic theory, but rather a methodology (Leech 1992, in<br />

Botley and McEnery 2000, p. 23). Rather than being primarily theoretically founded, corpus<br />

linguistics as a discipline focuses on linguistic performance and description, as found in actual<br />

occurrences of natural language text. One can say that Firth and Harris’ ideas have received a<br />

pragmatic renaissance; probably for a large part because of the available computational tools.<br />

Whether these new appliances of the distributional theories from the 1950s reflect a<br />

reconsideration of the usefulness and theoretical foundation of them, or whether they merely<br />

show that computational linguistics is a more pragmatic than linguistic theoretically founded<br />

science, is a discussion that is far out of the scope of the present work. What can be stated,<br />

though, is that the notion of using a word’s context to find out something about the meaning of<br />

that word, is an approach that seems to provide interesting results regarding semantic meaning,<br />

regardless of the motivation of such an approach. The type of semantic information available<br />

from the context is, however, not necessarily of the same type as that referred to when speaking<br />

of the semantic meaning of a word. Information obtainable from looking at distribution over<br />

several contexts rather provides a measure of semantic relatedness or closeness. Instead of<br />

providing a means to obtain or define the direct semantic meaning, methods that rely on<br />

contextual distribution can return words with the same or different meaning as a target word.<br />

26


2.2.2 Different types of context<br />

So far, we have argued that using context as a tool to indicate the semantic meaning of a word is<br />

a useful method in linguistics. The method’s theoretical foundation dates back to the middle of<br />

the twentieth century, but has not been pursued much in the last few decades. Even though the<br />

linguistic foundation of this method has been discussed, the advance in computational resources<br />

in recent years has brought this approach forward again. However, this being said, the different<br />

types of context that can be taken into consideration have not been discussed so far in this thesis.<br />

Agreeing on the fact that the semantic meaning of a word is suggested from the linguistic<br />

context in which it occurs, or “the company it keeps”, supports the notion that different words<br />

used in the same context are semantically similar. It does, however, not provide a means for<br />

calculating the degree of this similarity or even finding out exactly which words are similar to<br />

each other. Depending on the information that is desired to obtain about a target word, different<br />

context types mirror different aspects of the semantic meaning of a word. Any approach that<br />

attempts at describing semantic meaning based on the contextual distribution of words in a text<br />

collection must first define the type of context that best will reflect the desirable information.<br />

Somewhat simplified, we distinguish between topical context and local context.<br />

2.2.2.1 Topical context<br />

Topical context (Miller and Leacock 2000), or document context, is a quite wide term that<br />

covers what we could call the “wide conception” of what context is. All other content words<br />

which occur in the same environment as a target word are considered to make up the context of<br />

the word, and following the discussion above, contribute to indicating the semantic meaning of<br />

the target word. A target word’s contextual environment can be further specified depending on<br />

the purpose; in short, the context simply is all the words which occur within a context window<br />

of varying size. The window can be set to cover a certain number of words before and after a<br />

target word, or also to consist of the entire document the target word occurs in. Different<br />

parameters determine the weighting of each word found within the context window; for example<br />

words can be weighted according to their distance from the target word. One extreme way of<br />

looking at topical context might be the bag of words model, where a document is seen as an<br />

unordered collection of words, and the words are weighted by the number of times they occur in<br />

27


a document. In a more narrow sense, topical context can be limited to only consisting of the<br />

other words in the same sentence as a target word.<br />

The extraction of topical context does not draw on syntactic or semantic information and<br />

therefore does not provide an indication of the relationship that the words in the context have to<br />

each other or to the target word. It is therefore not possible to say anything specific about<br />

semantic similarity based solely on the occurrence of words in the topical context. As the name<br />

indicates, this type of context gives information about the topic, or the domain of the text that<br />

the target word exists in.<br />

Consider the words in example (2-16) as the topical context for the target word sykepleiestudent<br />

(student nurse). The context words are words which occur more than once in a short newspaper<br />

text from the text collection used in this project. Even with such a rudimentary method of<br />

selecting the words in the topical context, it is clear that this type of context provides cues to the<br />

thematic domain that the target word occurs in. The topical context does not, however, provide a<br />

means of finding words that are semantically similar to the target word. No close synonyms are<br />

retrieved, but rather words belonging to the same discourse domain as the target word.<br />

(2- 16)<br />

kvinne<br />

funn (subst)<br />

funnet (partisipp)<br />

død<br />

Førde<br />

politi<br />

leteaksjon<br />

2.2.2.2 Local context<br />

woman<br />

finding (noun)<br />

find (participle)<br />

dead<br />

Førde<br />

police<br />

search party<br />

Local context provides a more finely tuned way of looking at semantic similarities as expressed<br />

through the distribution of words in a text. In its simplest form, a word’s local context consists<br />

of its immediate surrounding words; that is the words immediately preceding and following a<br />

target word. The notion of local context can also be extended to include information about<br />

syntactic and grammatical properties that belong to the target word and its immediate<br />

28


neighbours. For example, a target word’s local context can be seen as its subject and object, or<br />

as the adjective preceding it.<br />

Several studies show that classifying words based on the local context in which they occur gives<br />

information about the semantic meaning of the words, rather than their membership within a<br />

thematic domain, as found when examining the topical context (Hindle 1990; Grefenstette 1992;<br />

Lin 1998; Lin and Pantel 2001; Pereira et al. 1993; inter al.). This indicates that access to<br />

features within a word’s local context can contribute to saying something about the meaning of<br />

the word and ultimately to act as a foundation for the formation of concept groups of<br />

semantically similar words. Distributional representations based on a word’s local context are<br />

useful for measuring the semantic similarity of words. Lin (1997) exploits this in an algorithm<br />

for word sense disambiguation and states that local context gives crucial clues about the<br />

meaning of a word following the intuition that:<br />

“Two different words are likely to have similar meanings if they occur in identical local<br />

contexts.” (Lin 1997, p 64).<br />

Since the local context can comprise syntactic and semantic information, it provides a means to<br />

access different information relevant to the type of analysis that will be performed on the<br />

material. Several approaches describe methods for finding similar nouns based on the<br />

distributional patterns of words in the local context (Hindle 1990; Grefenstette 1992; Lin 1998;<br />

Lin and Pantel 2001; Pantel and Lin 2002; Pereira et al. 1993; inter al.). These methods classify<br />

words in accordance with their distributional patterns, not using hand-coded semantic<br />

knowledge as a basis, but rather inferring the required knowledge from a text corpus as part of<br />

the analysis process. The approaches all adopt different methods for judging the similarity of<br />

words. Below, some of the approaches to finding similar words are described briefly; the<br />

similarity metrics, however, will not be discussed in this outline.<br />

Hindle (1990) shows that the contextual distribution of words provides a useful semantic<br />

classification, also in the event of an automated classification process with no human<br />

intervention. His method examines predicate-argument structures in a large corpus and<br />

automatically classifies nouns into semantically similar sets on the basis of the predicates they<br />

combine with. The similarity between nouns is measured as being a function of mutual<br />

29


information estimated from the text. Hindle’s results show that semantic relatedness can be<br />

derived from the distribution of syntactic forms (Hindle 1990, p. 274). This is a similar approach<br />

to the one taken in the present work, if on a substantially smaller scale. Hindle (1990) addresses<br />

the data sparseness problem by estimating the probability of an unseen event by comparing it to<br />

similar events which have been seen. Grefenstette (1992) presents a method which looks for<br />

context patterns in large domain-specific corpora and finds similar words relative to how a target<br />

word is used in a specific text or domain. His program SEXTANT uses syntactically derived<br />

contexts and estimates the similarity of two words by considering the overlapping of all the<br />

contexts associated with them over a large corpus (Grefenstette 1992, p. 325). As a result, a<br />

word’s context consists of all the words co-occurring with it in the corpus. Pereira et al. (1993)<br />

also report a method for clustering words according to their distributions in given syntactic<br />

contexts. In their approach, nouns are classified based on their syntactic relations to predicates in<br />

the corpus. The method enables the automatic derivation of classes of semantically similar<br />

words from a text corpus and produces clusters the authors term “intuitively informative”<br />

(Pereira et al. 1993, p. 190). Lin and Pantel (2001) present the unsupervised algorithm UNICON<br />

for the creation of groups of semantically similar words. Their approach examines collocation<br />

patterns consisting of dependency relationships, and employs a method for selecting significant<br />

collocation patterns. Those dependency relations which occur more frequently than if the words<br />

were independent of each other are selected as collocation patterns. This approach is further<br />

developed in Pantel and Lin (2002). Here, clusters which are relatively semantically different<br />

are initially identified and a subset of the cluster members are used to create so-called centroids,<br />

which represent the average features of the subsets. Subsequently new words are assigned to<br />

their most similar clusters. A word can be assigned to several clusters, each cluster<br />

corresponding to a sense of the word.<br />

2.2.3 Context and selectional restrictions<br />

In the above it has been argued that a given word will tend to co-occur with a limited class of<br />

other words, and that this information can be exploited to find words that are similar in meaning.<br />

One of the reasons for this expected occurrence of similar words in similar contexts, is that<br />

predicates to a certain extent limit the semantic properties of the arguments that they can<br />

combine with. This behaviour is captured through the notion of selectional restrictions, which<br />

define how a predicate restricts the class of arguments that can combine in a specific position<br />

30


with it. Selectional constraints allow a predicate to specify semantic restrictions on its arguments<br />

(Jurafsky and Martin 2000, p. 512). This accounts for the intuition that only a certain class of<br />

words can occur in a specific argument position to a given predicate. In the case of a verb such<br />

as avhøre (interrogate, take statement from) a possible selectional restriction for the first<br />

argument could be that it must represent a human. Jurafsky and Martin formulate it like this:<br />

interrogate restricts the constituents appearing as the first argument to those whose underlying<br />

concepts can actually partake in an interrogation (Jurafsky and Martin 2000, p. 512, slightly<br />

modified).<br />

More nuanced intuitions of selectional restrictions can be obtained by combining the knowledge<br />

of distribution in context and that of semantic restrictions placed on arguments by the predicate.<br />

This thesis applies a practical approach in order to find properties of the selectional restrictions<br />

of predicates within a limited thematic domain. Without aiming at formulating a comprehensive<br />

list of the selectional restrictions that apply within the domain in question, it is possible to obtain<br />

a list of examples that illustrate certain properties of the selectional restrictions. This is an<br />

extensional approach; by examining the structure of a set of arguments that all occur in the same<br />

contextual environment, for example as the first argument of a certain predicate, it is possible to<br />

draw certain conclusions about the selectional restrictions placed by the predicate. The aim of<br />

this project is not to define the selectional restrictions of the predicates in the data set, but rather<br />

collect a list of examples of valid restrictions for the domain and examine these. It is obvious<br />

that selectional restrictions also vary over different thematic domains; the allowed first<br />

arguments of a predicate will be very different in a formal text than in a fairy tale for children.<br />

This is again the intuition outlined in the first section of this chapter; words are used in different<br />

ways depending on the thematic domain they exist in. The distribution that classes of<br />

semantically similar arguments show within a limited domain may therefore very well be seen<br />

as a type of selectional constraint. To exemplify this, consider the domain used in the present<br />

work; newspaper texts concerning a criminal case. Considering the constructed phrases in<br />

(2-17), the first two are valid for the domain in the sense that they exemplify structures which<br />

are found in the data set, while the third is in violation of the selectional constraints assigned by<br />

the verb within this particular thematic domain. In the event of a killing within the domain in<br />

question, it can be expected that a perpetrator or a man has the thematic role of actor, but it has<br />

not been seen in the data material that a student can initiate this action.<br />

31


(2- 17)<br />

gjerningsmannen drepte kvinnen<br />

mannen drepte kvinnen<br />

studenten drepte kvinnen<br />

the perpetrator killed the woman<br />

the man killed the woman<br />

the student killed the woman<br />

The example above illustrates how context within the present work is used to formulate a notion<br />

of selectional restrictions. These can later be used to say something about which argument can<br />

be expected to feature in a specific contextual environment, and thus functions as a type of realworld<br />

knowledge for the domain of the text collection. With specific reference to anaphora<br />

resolution, these selectional restrictions can be used to give an indication of the most likely<br />

antecedent for an anaphor which normally would require access to real-world knowledge in<br />

order to be resolved.<br />

32


3 From text to EPAS – the extraction method<br />

This chapter describes the extraction method used in this project. The method extracts EPAS<br />

(elementary predicate-argument structures) from a text corpus consisting of newspaper texts<br />

collected from the internet.<br />

3.1 Selecting the texts<br />

Specifying the requirements for a suitable text collection is not as trivial as it may seem. To<br />

make sure that the extracted EPAS would produce semantically valid results when classified, the<br />

texts from which the structures were extracted had to fulfil certain requirements. Since the<br />

classification builds on the distributional hypothesis and relies on EPAS which show distribution<br />

particular to a restricted domain, initially, the most important specification for the texts was that<br />

they all had to belong to the same thematic domain. As such, the main focus in the requirements<br />

specification for the text collection was that of one closed thematic domain. But how exactly<br />

does one define the notion of a thematic domain? The first test set collected for the project<br />

consisted of factual prose texts dealing with roughly the same field. These texts, however,<br />

proved to be quite unsuitable for the later analysis, for reasons that will be explained in the<br />

following.<br />

It is clear that certain specifications must be fulfilled in the text collection from which the EPAS<br />

are derived. Texts displaying longer discourse chains are most suitable for the purpose of this<br />

project. One thematic domain must be described over several paragraphs, or preferably over the<br />

entire course of the discourse in the text. In order to extract the desired information from the<br />

texts and subsequently test if useful information has been extracted, the presence of anaphora, or<br />

referring expressions, in the text is needed. This entails the need for pronouns in particular. As<br />

such, texts containing discourse with a certain amount of concrete content were particularly<br />

useful for my purpose.<br />

Texts that are too vague, both with regards to their textual content and to their membership in a<br />

particular thematic category, were not suitable for the purpose of this project because they do<br />

not contain the type of theme-specific selectional constraints we are interested in extracting. One<br />

reoccurring problem in the text collections tested for the project was a too small degree of<br />

information expressed in full text, and very much information present in bullet lists, tables and<br />

33


other similar constructions. This information was only accessible after manual editing of the<br />

texts, and was even then often not useful since precisely the desirable discourse chains are<br />

avoided by use of this type of textual shorthand constructions. The information present in bullet<br />

lists and tables in the unedited text is most often not formulated in well-formed sentences and<br />

usually the use of referring expressions and pronouns is avoided. Such texts are also not<br />

immediately suitable for parsing, thus making it complicated to extract EPAS (semi-)<br />

automatically.<br />

As mentioned above, selecting the texts to be analyzed and creating the text collection to be the<br />

basis of the classifications in the project was a task that is not to be underestimated. Several<br />

different types of texts were experimented with in the attempt of finding a text type that satisfied<br />

the following criteria, in addition to being available for collection on the internet:<br />

• Limited and naturally confined thematic domain<br />

• Relatively long chains of discourse<br />

• Fairly high occurrence of anaphora, pronouns in particular<br />

• Several paragraphs where the same phenomenon is discussed<br />

• Low occurrence of tables and illustrations, ideally all the information in the texts should<br />

be expressed in complete and grammatical sentences<br />

The text type that fulfilled these criteria to the highest degree were news texts. By picking<br />

newspaper articles that all concerned the same theme, the criteria of a limited domain was<br />

satisfied. The articles, as provided on the internet, additionally fulfilled all the other<br />

requirements which had been set for the text collection. For this project, articles concerning a<br />

criminal case in the small town Førde on the west coast of Norway were chosen, mainly because<br />

this was a very big case in the Norwegian newspapers and a large number of articles have been<br />

written on the subject. The articles were selected from the newspaper Verdens Gang (VG) in<br />

June and July 2004.<br />

34


3.2 Predicate-argument structures<br />

"Not the same thing a bit," said the Hatter. "Why, you might as well say that 'I see what I eat' is<br />

the same thing as 'I eat what I see'." from Alice in Wonderland by Lewis Carroll.<br />

For the purposes of the subsequent classification phase, a meaning representation that would not<br />

allow for ambiguity or vagueness was desirable. Using the term EPAS, rather than referring to<br />

the verb and its subject and object, contributes to normalising and generalising the data. The<br />

motivation for choosing elementary predicate-arguments structures, or EPAS, as the<br />

representation of the meaning structures in the text collection will be explained in the following.<br />

By choosing EPAS as meaning representation, the focus of the structure is the verbal predicate.<br />

Instead of structuring the semantic representations extracted from the texts according to the<br />

grammatical roles and the formal function each word holds in the sentence, we look at how the<br />

verbal predicate combines with arguments. This is closely related to the idea of thematic roles,<br />

where the focus is on which roles the entities in a sentence occupy. It is suggested that “verbs<br />

must have their thematic role requirements listed in the lexicon” (Saeed 1997, p. 140) and as<br />

such that each verb has a predetermined set of possible argument frames. Thematic roles span<br />

over a wide range that describes the various roles the entities in a sentence can occupy. Using<br />

Saeed’s hierarchy of thematic roles, the agent is the initiator of action, while the patient and the<br />

theme are the entities an action is performed on. For Norwegian and English, there is a tendency<br />

for subjects to be agents and direct objects to be patients and themes (Saeed 1997, p. 145). This<br />

tendency can be altered by the speaker as a result of stylistic choice or desire to alter the<br />

information structure, for example by using passive verbal voice. The assignment of thematic<br />

roles to particular positions in a sentence is closely connected to the hierarchical structure of the<br />

thematic roles. There is a hierarchy of defined thematic roles for each sentence position; the<br />

hierarchy in (3-1) exemplifies the preferred order of roles in subject position (Saeed 1997, p.<br />

146).<br />

(3- 1)<br />

agent > recipient/benefactive > theme/patient > instrument > location<br />

The structuring of a semantic representation into predicates with belonging arguments does,<br />

however, not express exactly the same information as the assignment of thematic roles does.<br />

35


When using predicate-argument structures, the definitions of argument 1 and argument 2<br />

presupposes the existence of an underlying semantic hierarchy which defines the roles for agent<br />

and patient. For example, argument 1 can be defined as to always representing the agent in the<br />

sentence. An important distinction between the predicate/argument paradigm and that of<br />

thematic roles, is that by using agent/patient as the core of the semantic representation, the<br />

semantic classification does not focus on the predicate and its associated arguments. On the<br />

other hand, in a predicate/argument classification, the definition of the individual arguments<br />

does not directly consider the thematic roles. Since a semantic hierarchy has more well-defined<br />

roles than expressible with arguments, different instances of argument 1 will not have exactly<br />

the same semantic role, depending on the predicate they co-occur with.<br />

For the purpose of processing the extracted structures, it is useful that the structures are in a<br />

simplified form and also that structures that semantically convey the same information, but are<br />

expressed in a different manner syntactically, are represented with the same structure. This is in<br />

alignment with the doctrine of canonical form, which states the usefulness of letting linguistic<br />

constructions which display the same meaning content give rise to the same meaning<br />

representation (Jurafsky and Martin 2000, p. 507). By using a normalised form of<br />

representation, such as EPAS, for the structures extracted from the text, as well as founding the<br />

extraction method on semantic representations of the analysed texts, the generation of a<br />

generalisable data set is achived. Active and passive constructions with equivalent semantic<br />

meaning will be treated in the same way and will receive identical meaning representations. This<br />

can be seen in the following example:<br />

(3- 2)<br />

a. Morderen drepte kvinnen.<br />

The murderer killed the woman.<br />

b. Kvinnen ble drept av morderen.<br />

The woman was killed by the murderer.<br />

c. Kvinnen ble drept.<br />

The woman was killed.<br />

Sentence (3-2a) and (3-2b) in essence convey the same information and only differ with<br />

reference to verbal voice. The use of diathesis alternations of active and passive voice gives the<br />

36


speaker flexibility with regards to the relationship between grammatical structure and thematic<br />

roles. The use of passive versus active voice does not really alter the semantic content of a<br />

sentence, but represents a difference in the information structure of a sentence. Differences of<br />

this kind are not relevant for the present purposes, and will not be reflected in the extracted<br />

structures. The word kvinnen (the woman) has different syntactic roles in the acitve sentence in<br />

(3-2a) and the passive sentence in (3-2b), but has the same thematic role. Regardless of the fact<br />

that the phrase kvinnen (the woman) in (3-2b) represents a subject, while it represents an object<br />

in sentence (3-2a), both expressions have the thematic role of patient, or the entity acted upon.<br />

For the purposes of the present work both sentences will be represented by the single EPAS<br />

shown in (3-3):<br />

(3- 3)<br />

Predicate Argument 1 Argument 2<br />

drepe morder kvinne<br />

kill<br />

murderer<br />

woman<br />

Sentence (3-2c) is in passive voice and the subject from the active-voice sentence (3-2a) is not<br />

present. The formal subject of the sentence, “woman”, is logically a patient, and refers to the<br />

entity on which the activity of killing is performed. As such, this sentence will be represented by<br />

an EPAS which lacks its argument 1:<br />

(3- 4)<br />

Predicate Argument 1 Argument 2<br />

kill woman<br />

Extraction methods that are not based on a syntactic parse of the original texts do not have<br />

access to semantic relations within a sentence. This means that such methods must rely on more<br />

superficial structures, such as part of speech tags, and will not have the same degree of accuracy,<br />

or finesse, in the actual extraction of the meaning structures. Since the present work aims at<br />

providing results that can be useful as part of an anaphora resolution system, it is particularly<br />

important that the results obtained can be generalised as much as possible. Especially since the<br />

individual elements of the extracted structures will be used for subsequent processes, it is of<br />

high importance that they do not contain errors or irregularities as a result of the extraction<br />

37


process. To be as useful as possible, the meaning structures should be normalised and<br />

generalisable.<br />

The examples above show how normalisation through use of EPAS realises the concept of<br />

canonical form to some degree and seems particularly useful for the purpose of the present<br />

work. By using grammatic relations such as subject and object as reference points, semantically<br />

equivalent sentences, such as (3-2a) and (3-2b), would be given different meaning structures due<br />

to the difference in verbal voice. Structuring the meanings conveyed with the sentences in (3-2)<br />

within a grammatical relations paradigm would make it necessary to mark the verbal voice as<br />

well as the grammatical relations. In addition, active and passive structures would have to be<br />

treated differently in the subsequent analysis. Basing the extraction merely on syntactic<br />

properties of the sentences in the corpus would make the extracted material very difficult to<br />

classify, mainly because similar meanings would be represented differently.<br />

The advantages of a normalised and generalisable dataset is further clarified by the following<br />

example. Upon a simple grammatical analysis, the sentences shown in (3- 2) can be categorised<br />

based on the syntactic roles predicate, subject and object. The result of such a classification is<br />

shown in examples (3-5) and (3-6):<br />

(3- 5)<br />

predicate subject object<br />

a. drepe morder kvinne<br />

kill<br />

b. drepe<br />

kill<br />

c. drepe<br />

kill<br />

murderer<br />

kvinne<br />

woman<br />

kvinne<br />

woman<br />

woman<br />

murderer<br />

murderer<br />

?<br />

The structures in (3-5) above can be extracted upon part of speech tagging of the sentences in<br />

(3-2). The active and passive predicate receives the same structure, and as no semantic<br />

information is available, the structuring of the arguments is in accordance with their status as<br />

subject or object. Attempting to classify these subjects and objects based on their co-occurrence<br />

with the predicate produces groupings of words which are not directly generalisable. Murderer<br />

38


and woman occur together both in subject and object position, not reflecting the preferred<br />

selectional constrictions within the domain.<br />

(3- 6)<br />

predicate subject object<br />

a. drepe morder kvinne<br />

kill murderer woman<br />

a. drepes kvinne morder<br />

is-killed woman murderer<br />

b. drepes kvinne ?<br />

is-killed woman<br />

Example (3-6) provides a more elegant structuring. Because an extraction method based on<br />

syntactic relations is unable to generalise over verbal voice, two separate predicates are<br />

extracted, one for the passive and one for the active voice. Even though logically, the same<br />

action is performed on the entity the woman in both sentences in (3-6), a method as outlined<br />

above would not allow for a straightforward interpretation of this. The generalisation between<br />

active and passive versions of the same sentence is lost in such an approach. This would result in<br />

a higher number of predicates, and therefore in a less generalisable data material. It is likely that<br />

results as outlined above would also be of less use as a referent guessing helper in an anaphora<br />

resolution system, precisely because of the lower level of generalisability.<br />

3.2.1 What is represented in the EPAS?<br />

Jurafsky and Martin (2000, p. 510) state that all languages have predicate-argument structures at<br />

the core of their semantic structure. They further describe that the grammar organises the<br />

predicate-argument structure and selectional constraints restrict how other words and phrases<br />

can combine with a given word. In this project, a simplified version of predicate-argument<br />

structures is used as meaning representation. The EPAS, or meaning representations, are limited<br />

to consist of two nominal arguments at the most. Either one of the arguments in an EPAS may<br />

be empty/unidentified. This means that the EPAS extracted from my texts will belong to one of<br />

the following three patterns:<br />

39


(3- 7)<br />

a. predicate, argument 1, argument 2<br />

b. predicate, argument 1, ?<br />

c. predicate, ?, argument 2<br />

The reason for letting the EPAS consist of a maximum of two arguments is not primarily a<br />

fundamental decision, but rather an emergence from the empirical material that the data<br />

structures were collected from. When extracting EPAS from the data collection, the resulting<br />

structures consisted of a predicate with maximally two arguments. It is probable that there<br />

generally are fewer occurrences of predicates with more than two belonging arguments, and<br />

since my data collection is quite small, such occurrences do not feature in it.<br />

Only nominal arguments are featured in the EPAS; entailing that sentences with a nominal<br />

clause as object will be extracted as an EPAS lacking argument 2. This is clarified through the<br />

examples below:<br />

(3- 8)<br />

Et vitne opplyste at hun hadde hørt høye rop.<br />

A witness informed that she had heard loud screams.<br />

The sentence in example (3-8) above will yield the following three EPAS:<br />

(3- 9)<br />

a. høre, vitne, rop<br />

hear, witness, scream<br />

b. høy, rop, ?<br />

loud, scream, ?<br />

c. opplyse, vitne, ?<br />

inform, witness, ?<br />

(3-9c) does not display an argument 2 despite the fact that the original sentence has a nominal<br />

clause as object. The main reason for this choice of representation is that the subsequent<br />

classifying phase aims at creating classes of nominal arguments, based on the verbs they co-<br />

40


occur with. Arguments which are unlikely to represent relevant and interesting information for<br />

our classification are therefore omitted. In cases where a verbal predicate takes a nominal clause<br />

or a sentence as its argument, it is unlikely that the predicate selects the argument based on (for<br />

us) semantically interesting selectional restrictions. A verb can show restrictions in the selection<br />

of an argument, presenting us with the possibility of saying something about this argument<br />

relevant to the environment it occurs in. The same restrictions cannot be expected to apply in the<br />

cases where the verb takes a sentence as its argument. As a consequence of this, the meaning<br />

representations are limited to dealing with arguments which can be represented by single<br />

symbols and which do not refer to clauses or sentences.<br />

In order to extract all EPAS present in the texts, it will also be necessary to extract non-verbal<br />

predicates. These will generally speaking correspond to adjective-noun combinations in the text.<br />

This way, phrases of the type “the statement is important” or “the important statement” will both<br />

produce the following EPAS:<br />

(3- 10)<br />

Predicate Argument 1 Argument 2<br />

important statement ?<br />

The process of extracting the EPAS is a challenging part of the project, especially since the<br />

available tools for Norwegian are not robust enough to make this a trivial and straightforward<br />

task. Regardless of which method is used to extract the EPAS, it is evident that large part of the<br />

work on this project must be dedicated to the development of a suitable method for the<br />

extraction of them. For several reasons, it is desirable to develop an extraction method that is as<br />

automatic as possible. Most importantly, such a method saves a lot of time, but another<br />

important aspect is that more manual extraction methods easily could become subjective and<br />

less systematically concise. The next two sections will discuss the task of extracting the EPAS in<br />

more detail.<br />

41


3.3 Parsing with NorGram<br />

To be able to extract the EPAS from the text in a semi-automatic fashion, some sort of linguistic<br />

analysis of the texts is needed. One problem with working on a small language like Norwegian<br />

is that the linguistic tools you might need in the process just are not fully developed yet. Velldal<br />

(2003) describes a project where a set of Norwegian nouns are grouped into semantic classes<br />

based on their distribution over a large body of text. A word’s distribution in different contexts<br />

is represented as a feature vector in a semantic space model. In his project, Velldal addresses the<br />

problem of a lacking parser for Norwegian by stating that there does not exist any syntactic<br />

parser for Norwegian. Instead, he uses a shallow processing tool on a tagged corpus. The<br />

processing tool “translates” the tagged structures into predicate-argument structures, overcoming<br />

the need for a parser by only analysing those parts of the text relevant for the extraction of the<br />

needed structures. As has been explained in section 3.2, an extraction method that is based on<br />

surface structures and does not take semantic relations into account, might produce results that<br />

are unsuitable both for subsequent use in anaphora resolution and for generalisation of concepts.<br />

In view of this, the present work has aimed at developing an extraction method that uses parsed<br />

text to collect the meaning structures from the text.<br />

Although it is true that there does not exist any parser that fully covers the Norwegian language<br />

at the moment, there are a few alternative parsers available. Even if these grammars are not<br />

entirely robust enough to return parses on randomly chosen texts, they can be used for the<br />

experiments outlined in this project. The extraction method described in this thesis implements<br />

one of the existing parsing tools for Norwegian bokmål, NorGram (NorGram 2004).<br />

Since there are no easy-to-use automated tools available for use in the extraction process,<br />

obtaining the EPAS from the text involved a substantial amount of manual work, even when<br />

using a parser to automate the extraction. Parsing the texts was definitely of value, though, since<br />

once the texts were parsed and there was a syntactic analysis to work on, the EPAS could more<br />

readily be extracted. Because of the modular nature of the extraction method, the extraction<br />

process is not parser-dependent. Should a new and more robust grammar become available, the<br />

extraction method can be modified to accommodate this. The next section of this chapter briefly<br />

describes how the NorGram/XLE parser was used in the project, while section 3.3.2 describes in<br />

greater detail how the EPAS were extracted from the parser’s output.<br />

42


3.3.1 NorGram in outline<br />

Norsk komputasjonell grammatikk (NorGram) is a computational grammar for Norwegian<br />

bokmål. NorGram is based on the unification-based grammar formalism Lexical Functional<br />

Grammar (LFG), where language is described by means of feature structures that can be<br />

combined in the process of unification. Researchers involved in the NorGram project cooperate<br />

with researchers at Palo Alto Research Center (PARC), former Xerox PARC, who have<br />

developed a well functioning platform for the development of large-scale computational<br />

grammars. This system is called Xerox Linguistic Environment (XLE) and uses LFG as its<br />

theoretical linguistic framework. As such, NorGram can be said to be an LFG formalism for<br />

Norwegian, while XLE is an implementation of LFG.<br />

The NorGram grammar combined with an XLE-module is a relatively broad parser that can<br />

analyse most structures found in Norwegian. It was chosen for the purposes of this project<br />

because it was likely to return successful parse trees of a large part of the sentences found in the<br />

text collections. NorGram’s lexicon is quite large and includes entries of most regular<br />

Norwegian words. One problem with the lexicon with regards to the text collections used for<br />

this project, is that it contains relatively few compounds. All theme-specific texts feature a<br />

theme-specific vocabulary, sometimes with words (especially compound nouns) that cannot be<br />

expected to be found in ordinary dictionaries. This was also the case for the text collection in<br />

this project. Compounded nouns represented the largest group of words added to the lexicon. In<br />

Norwegian, one stands fairly free to form compounds consisting of words that also can exist<br />

individually and have an individual meaning. Whereas in English such compounds are written in<br />

two separate words, for example police investigator, they together form a new noun in<br />

Norwegian, for example politietterforsker (police investigator). This opens for a potentially<br />

infinite class of nouns and makes it virtually impossible to include all possible words in any<br />

lexicon.<br />

The NorGram lexicon was extended in order to be used as a tool to extract the EPAS from the<br />

text collection. Compounds and proper nouns that were part of sentences to be analysed were<br />

added to the lexicon files. To ensure that all EPAS could successfully be extracted, all sentences<br />

that were not parsed were examined to identify the word that represented the problem.<br />

Subsequently, that word was added to the lexicon. A more elegant way to solve the compound<br />

issue would be to make use of a module that splits compounds into the individual words they<br />

43


consist of, or to make use of a component that predicts the part of speech of an unknown word.<br />

One solution that would have been suitable for the purposes of the present work would be to<br />

assume that all unknown words were nouns. However, due to the small size of the corpus used<br />

for this project, none of these strategies were implemented, and unknown words were added to<br />

the lexicon by hand.<br />

When parsing texts with NorGram and XLE, the user has several choices with regards to the<br />

format of the final syntactic analysis. For example, it is possible to receive partial parses, or to<br />

let the system return all the potential analyses of the input sentence. For the purposes of this<br />

project, I received full parses of each sentence in the text material and chose to manually check<br />

each instance where the system returned multiple valid parses and actively decided on the<br />

correct one that I wished to extract the EPAS from.<br />

3.3.2 Extracting EPAS from NorGram<br />

The output provided by XLE upon a successful parse using the NorGram grammar is<br />

particularly useful for a subsequent extraction of EPAS. NorGram is based on the LFG grammar<br />

formalism and produces constituent structures (c-structures), functional structures (f-structures)<br />

and minimal recursion semantics structures (MRS-structures) upon parsing a sentence. Each of<br />

these outputs can be useful for a subsequent extraction of predicates and their arguments.<br />

The c-structure in LFG is an external structure which displays an ordered representation of the<br />

words in a sentence or phrase (Bresnan 2001, p. 44). In XLE, the c-structure is represented by a<br />

phrase structure tree, where the terminal nodes are fully inflected word forms. F-structures<br />

represent the internal structure of a sentence. On this level, the “syntactic functions are<br />

associated with semantic predicate argument relations” (Bresnan 2001, p. 45). C-structures and<br />

f-structures are different structures, but display parallel information. Figure 3 below shows the<br />

graphical representation of the c- and f-structures for the sentence Politiet leter etter morderen<br />

(The police is looking for the murderer) generated by NorGram.<br />

44


Figure 3<br />

The most useful structure for the purpose of extracting EPAS from the parse output, is the MRS<br />

structure. In comparison to the c- and f-structures, which are more syntactically motivated, the<br />

MRS structure displays the semantic structure within a sentence. In the next section, this<br />

structure is described in greater detail.<br />

3.3.2.1 Minimal Recursion Semantics<br />

Minimal Recursion Semantics (MRS), developed by Copestake et al. (2003), is a framework for<br />

computational semantics, providing a meta-language for describing semantic structures. The<br />

concept of MRS is primarily semantically motivated and aims at preserving the semantic<br />

structures in the input sentence. MRS allows for expressive adequacy, ensuring that the<br />

linguistic meanings conveyed by a sentence are expressed correctly in the semantic structure.<br />

The primary unit within the framework of MRS is the elementary predication (EP). An EP is a<br />

single relation with associated arguments and will generally speaking correspond to a lexeme<br />

with its argument roles filled. Since MRS provides a “flat” representation where the EPs are<br />

never nested within each other, semantically irrelevant implicit information about the syntactic<br />

structure of a phrase is avoided. The simple principle is that each EP has a “handle” which<br />

identifies it as belonging to a particular tree node and argument positions in EPs can be filled<br />

with handles which correspond to the EPs that belong immediately under it in the tree structure.<br />

45


More than one EP with the same handle entails that the EPs are conjoined and on the same node<br />

in the structure. Tree structure in this sense do not refer to the c-structure, but to an abstract<br />

structure which shows the hierarchical representation of the EPs.<br />

MRS is implemented in NorGram and the MRS structures provided there are the most<br />

convenient of the output structures from the point of view of EPAS extraction. A sentence or<br />

phrase can contain more than one EPAS, and all predicates with belonging arguments are<br />

displayed in the MRS representation. The MRS structures NorGram displays following the<br />

successful parse of a sentence, contain all the information needed to extract the EPAS. However,<br />

because of the manner in which they are displayed in the XLE graphical interface, it is not<br />

straightforward and simple for a human to see which arguments belong where and as such<br />

identify the EPAS directly. Only by tracing each individual EP and finding corresponding values<br />

in other EPs, is it possible to extract the sentence’s EPAS. Figure 4 below shows the graphical<br />

representation of the MRS structure for the sentence Politiet leter etter morderen (The police are<br />

looking for the murderer).<br />

Figure 4<br />

46


3.4 Altering the source<br />

As already mentioned, parsing randomly selected Norwegian texts is not an entirely<br />

straightforward task. Although NorGram provides for a quite broad grammar, not all linguistic<br />

constructions are parsed and, more importantly, not all words are covered in the lexicon. Ideally,<br />

it would be desirable to collect a limited domain treebank consisting of parsed sentences of the<br />

original texts as I found them on the internet. In practice, this was not a feasible task. It early<br />

became evident that the texts to be analyzed would have to be simplified for practical reasons.<br />

For the purpose of classification, I needed to extract the EPAS present in the texts. All the other<br />

information that was included in every sentence was not essential or necessary for the project.<br />

Although aware that it would be more scientific, and in any respect better, to extract the EPAS<br />

from original texts that have not been tampered with by me, this was not possible within the<br />

framework of this thesis. Given that I would have to simplify the texts in any case, I decided to<br />

cut most information that was irrelevant for the extraction of the (most central) EPAS. This<br />

process was performed on alle sentences in the text collection. Mainly adverbial phrases were<br />

excluded, on the basis that they would not be included in the extracted EPAS in any case. The<br />

examples in (3-11) below illustrate a typical example:<br />

(3- 11)<br />

a. Original sentence:<br />

Etter at hun ble funnet opplyste et vitne at hun hadde hørt<br />

høye rop om hjelp fra stedet tidlig søndag morgen.<br />

After she was found a witness informed that she had heard loud screams for<br />

help from the area early Sunday morning.<br />

b. Simplified form:<br />

Et vitne opplyste at hun hadde hørt høye rop.<br />

A witness informed that she had heard loud screams.<br />

c. Extracted structures:<br />

høre,vitne,rop<br />

høy,rop,?<br />

opplyse,vitne,?<br />

hear, witness, scream<br />

loud, scream,?<br />

inform, witness,?<br />

47


The pre-editing of the text collection will naturally have affected the resulting EPAS list. All<br />

structures from the original texts are not extracted and the EPAS list as a consequence does not<br />

include all relevant context patterns for the domain. Still, for the purposes of a pilot study such<br />

as in this thesis, the central structures, which display the most typical context patterns for the<br />

domain, include enough information to provide a tendency of the usefulness of the method. For<br />

the purpose of subsequent analyses, the extraction process can easily be performed on unedited<br />

original texts.<br />

3.5 Finding the words<br />

The process of extracting meaning structures such as the EPAS from the texts in the text<br />

collection is a substantial undertaking. It is also quite a tedious task, and since tedious tasks tend<br />

to benefit from being automated I wrote the Perl script Ekstraktor, which interprets the MRS-<br />

structures of a sentence and thereby puts together the EPAS for each parsed sentence. This<br />

chapter describes the outline of the automated extraction process.<br />

XLE provides the user with the choice of several output formats, including a graphical user<br />

interface that displays a tree graph of the parse as well as its F-structure and MRS-structure. By<br />

choice, the output can also be viewed as a file of Prolog predicates. In the process of extracting<br />

the EPAS, Ekstraktor reads the Prolog output, saves relevant information in a system of arrays,<br />

and subsequently performs several tests and actions on the stored information in order to present<br />

a list of all EPAS found in the parsed sentence.<br />

The MRS structures as represented in the Prolog output, provides all the information needed to<br />

extract the EPAS. Initially, the main EP with belonging arguments must be found. Since for the<br />

purpose of this paper, the linguistic structures analyzed are limited to full sentences, the main EP<br />

must display the category ‘v’ for verb. Once the main EP is identified, the semantic values for it<br />

and for its belonging arguments must be found. Subsequently, all the remaining predicateargument<br />

structures must be found. For them, there is no restriction as to which category they<br />

have. Consider the sentence shown in (3-12) together with an extract of the Prolog output of the<br />

parse shown in (3-13):<br />

48


(3- 12)<br />

(3- 13)<br />

Politiet leter etter morderen<br />

The police are looking for the murderer<br />

cf(1,eq(attr(var(19),'ARG0'),var(20))),<br />

cf(1,eq(attr(var(19),'ARG1'),var(21))),<br />

cf(1,eq(attr(var(19),'ARG2'),var(22))),<br />

cf(1,eq(attr(var(19),'LBL'),var(10))),<br />

cf(1,eq(attr(var(19),'LNK'),14)),<br />

cf(1,eq(attr(var(19),'_CAT'),'p')),<br />

cf(1,eq(attr(var(19),'_CATSUFF'),'sel')),<br />

cf(1,eq(attr(var(19),'relation'),semform('etter',15,[],[]))),<br />

cf(1,eq(attr(var(20),'type'),'event')),<br />

cf(1,eq(attr(var(21),'PERF'),'-')),<br />

cf(1,eq(attr(var(21),'TENSE'),'pres')),<br />

cf(1,eq(attr(var(21),'type'),'event')),<br />

cf(1,eq(attr(var(22),'NUM'),'sg')),<br />

cf(1,eq(attr(var(22),'PERS'),'3')),<br />

cf(1,eq(attr(var(22),'type'),'ref-ind')),<br />

cf(1,eq(attr(var(23),'ARG0'),var(21))),<br />

cf(1,eq(attr(var(23),'ARG1'),var(24))),<br />

cf(1,eq(attr(var(23),'ARG2'),var(22))),<br />

cf(1,eq(attr(var(23),'LBL'),var(10))),<br />

cf(1,eq(attr(var(23),'LNK'),10)),<br />

cf(1,eq(attr(var(23),'_CAT'),'v')),<br />

cf(1,eq(attr(var(23),'_PRT'),'etter')),<br />

cf(1,eq(attr(var(23),'relation'),semform('lete',11,[],[]))),<br />

cf(1,eq(attr(var(24),'NUM'),'sg')),<br />

cf(1,eq(attr(var(24),'PERS'),'3')),<br />

cf(1,eq(attr(var(24),'type'),'ref-ind')),<br />

cf(1,eq(attr(var(25),'ARG0'),var(22))),<br />

cf(1,eq(attr(var(25),'BODY'),var(26))),<br />

cf(1,eq(attr(var(25),'LBL'),var(27))),<br />

cf(1,eq(attr(var(25),'LNK'),18)),<br />

cf(1,eq(attr(var(25),'RSTR'),var(14))),<br />

cf(1,eq(attr(var(25),'relation'),semform('def',31,[],[]))),<br />

cf(1,eq(attr(var(26),'type'),'handle')),<br />

cf(1,eq(attr(var(27),'type'),'handle')),<br />

cf(1,eq(attr(var(28),'ARG0'),var(24))),<br />

cf(1,eq(attr(var(28),'BODY'),var(29))),<br />

cf(1,eq(attr(var(28),'LBL'),var(30))),<br />

cf(1,eq(attr(var(28),'LNK'),0)),<br />

cf(1,eq(attr(var(28),'RSTR'),var(17))),<br />

cf(1,eq(attr(var(28),'relation'),semform('def',9,[],[]))),<br />

cf(1,eq(attr(var(29),'type'),'handle')),<br />

cf(1,eq(attr(var(30),'type'),'handle')),<br />

cf(1,eq(attr(var(31),'ARG0'),var(22))),<br />

cf(1,eq(attr(var(31),'LBL'),var(13))),<br />

cf(1,eq(attr(var(31),'LNK'),18)),<br />

cf(1,eq(attr(var(31),'_CAT'),'n')),<br />

cf(1,eq(attr(var(31),'relation'),semform('morder',19,[],[]))),<br />

cf(1,eq(attr(var(32),'ARG0'),var(24))),<br />

cf(1,eq(attr(var(32),'LBL'),var(16))),<br />

cf(1,eq(attr(var(32),'LNK'),0)),<br />

cf(1,eq(attr(var(32),'_CAT'),'n')),<br />

cf(1,eq(attr(var(32),'relation'),semform('politi1',1,[],[]))),<br />

49


The Prolog code extract in (3-13) shows the MRS representation of the sentence in (3-12) by<br />

listing all the EPs in the sentence as well as the relationships that hold between the individual<br />

EPs. In simplified terms, the value of the attribute ‘semform’ holds the semantic form of the<br />

predicate, and the values of ‘ARG1’ and ‘ARG2’ point to the EPs where the semantic forms for<br />

argument 1 and argument 2 can be found. In order to extract all EPAS from such a Prolog file,<br />

one must go through all the EPs in turn, and find the semantic forms of each main EP and its<br />

belonging argument 1 and argument 2. In the extraction process, this matching and tracing of<br />

values is performed by the script Ekstraktor.<br />

The algorithm behind Ekstraktor is divided into two more or less separate parts: information<br />

retrieval from the Prolog file and processing of the information that was found and stored. Perl<br />

was chosen as the programming language mainly because of its excellent pattern matching<br />

facilities. Perl offers a very powerful and flexible regular expression syntax which lets the<br />

programmer construct regular expressions that will handle all kinds of pattern matching. For the<br />

information retrieval part of Ekstraktor, it was desirable to go through an input file, check for<br />

various patterns and store parts of the input file relevant to how the patterns were matched. (3-<br />

14) shows one of the pattern checks in Ekstraktor – if the line read from the file contains the<br />

string:<br />

‘relation’),semform(<br />

the entire line is stored in the array semform.<br />

(3- 14)<br />

if ($linjeFraFil =~ m/'relation'\),semform\(/){<br />

push(@semform, $linjeFraFil);<br />

}<br />

By going through the input file line by line and checking for several patterns, all information<br />

relevant to extracting the EPAS is stored in a system of arrays. To be able to keep track of which<br />

EP the various values belong to, a system of two arrays for each argument type is used – one for<br />

EP number and one for argument value. The ARG0 arrays correspond to the predicates in the<br />

structures and for each, the semantic form can directly be found in the semform-array. The<br />

50


ARG1 and ARG2 arrays display a value that must be traced before the semantic form can be<br />

extracted. A simplified example of the argument arrays for the sentence in (3-12) is shown in<br />

(3-15):<br />

(3- 15)<br />

ARG0:<br />

EP VALUE<br />

23 21<br />

25 22<br />

28 24<br />

31 22<br />

32 24<br />

ARG1:<br />

EP VALUE<br />

19 21<br />

23 24<br />

ARG2:<br />

EP VALUE<br />

19 22<br />

23 22<br />

To find the EPAS for this sentence, the first EP in the ARG0-array is incorporated in a regular<br />

expression which then is used for pattern matching in the members of the semform-array. If<br />

there exists an entry which matches the pattern, that is, which has an EP-value identical to the<br />

first EP in the ARG0-array, the semantic form is retrieved. To find the belonging arguments 1<br />

and 2, the ARG1 and ARG2-arrays are searched for an EP identical to the one of the predicate.<br />

If such an EP is found, the corresponding value is retrieved – for ARG1 in our example that<br />

would be the value 24. To find the semantic form of this value, we must find the EP where this<br />

value is identical to the value of ARG0, that is, the ARG0-array must again be consulted. When<br />

the EP is found, the semform-array can be pattern matched and the semantic form can be<br />

retrieved. To retrace the example; following such a procedure, the sentence in example (3-16):<br />

(3- 16)<br />

Politiet leter etter morderen<br />

The police are looking for the murderer<br />

is represented with the following EPAS, extracted from the Prolog file of the parse:<br />

(3- 17)<br />

lete-etter,politi,morder<br />

look-for,police,murderer<br />

For a detailed walkthrough of Ekstraktor, please consult Appendix A. The program code is<br />

available in Appendix B.<br />

51


3.6 Evaluation of the data set<br />

The data set created by the extraction process consisted of 195 elementary predicate-argument<br />

structures in its raw form. The original EPAS list was not directly applicable for the next parts of<br />

the project. Not all of the extracted structures on the list were suitable for further analysis. Some<br />

of the EPAS were not given an optimal analysis (for my purposes) by the grammar, some were<br />

irrelevant for the later analysis and some were not extracted correctly from the MRS by the Perl<br />

script. The dataset was post-edited to achieve a set of EPAS that did not include erroneously<br />

extracted or undesired structures. With such a small collection of structures as is the case in this<br />

project, the inclusion of only a few incorrect structures would be likely to skew the subsequent<br />

analysis and possibly produce false results.<br />

In the following, I will briefly outline some of the reasons why the EPAS list included incorrect<br />

structures and describe how the list was revised.<br />

3.6.1 Errors from the grammar<br />

Some of the undesired structures in the original EPAS list were directly caused by<br />

characteristics in the NorGram grammar. In the original EPAS list, there were for instance<br />

several structures of the type exemplified by (3-18):<br />

(3- 18)<br />

a. verbal predicate, nominal argument<br />

b. preposition, verbal predicate, nominal argument<br />

These structures should preferably have been combined into one EPAS. The example in (3-19)<br />

below shows a concrete instance from the EPAS list and is analogous to several other instances:<br />

(3- 19)<br />

a. bo, Anne live, Anne<br />

b. i, bo, studentkollektiv in, live, student housing<br />

The structure is extracted from the following sentence from the text material:<br />

52


(3- 20)<br />

Anne Slåtten bodde i et studentkollektiv utenfor Førde sentrum.<br />

Anne Slåtten lived in student housing outside central Førde.<br />

As example (3-20) shows, these structures originate from sentences featuring a verb with an<br />

adverbial complement. The adverbial is realised as a prepositional phrase where the preposition<br />

is selected by the verb. It would have been expected that sentences such as Anne bodde i et<br />

studentkollektiv (Anne lived in student housing) would result in one EPAS with the entity<br />

studentkollektiv somehow realized as the structure’s argument 2. Instead, the MRS structure of<br />

this and other similar sentences did not provide the necessary link between the verb as predicate<br />

and studentkollektiv as the second argument. When discussing this problem with the developers<br />

of the grammar, the source of the obstacle was easily identified. In the grammar, the verb bo<br />

(live) existed as an intransitive verb, not allowing for an adverbial complement to be analysed as<br />

required to produce the desired EPAS. In order to allow for this and similar sentences with the<br />

verb bo to produce one EPAS with the correct relationship between the predicate and its<br />

arguments, the entry for bo was altered. A solution which allows for an arbitrary preposition was<br />

favoured, instead of creating a new template that specifies the possible following prepositions.<br />

Analysing the sentence above with the revised grammar produces structures of the following<br />

type:<br />

(3- 21)<br />

bo, Anne, studentkollektiv<br />

live, Anne, studenthousing<br />

The same phenomena was observed for a few other verbs with prepositional phrases as<br />

complements, such as gjemme i (hide in) and observere i (observe in). In these instances, the<br />

structures were manually edited.<br />

3.6.2 Irrelevant structures<br />

Some of the structures that were extracted correctly from the text collection, were simply<br />

removed in the final post-editing of the EPAS list. These structures were not directly relevant for<br />

the later analysis process and would not contribute with any valuable information for the<br />

53


eferent-guessing procedures. In total 22 such structures were removed. The majority of these<br />

structures originate from adverbial phrases in the text collections. Location and temporal<br />

adverbs, featuring as prepositional phrases in the texts, show up in the EPAS list as structures<br />

disjoint from the rest of the sentence, not unlike the structures mentioned above. The preposition<br />

functions as the EPAS’ predicate, resulting in structures of this type:<br />

(3- 22)<br />

på, funn, åsted<br />

on, finding, crime scene<br />

Such structures were left out of the final EPAS list.<br />

Another type of correctly extracted structure that was omitted from the final list, were structures<br />

particular to the information structure in the grammatical analysis. The extraction script returned<br />

all predicate-argument structures present in the MRS structure for each parsed sentence. This<br />

resulted in a few structures that did not hold information that it was desirable to maintain in the<br />

EPAS list. Below is an example of such a structure:<br />

(3- 23)<br />

unspec_loc, , place<br />

3.6.3 Manually added structures<br />

Not all the predicate-argument structures present in the text collection were successfully<br />

extracted by means of the extraction method. After removing unwanted structures from the<br />

EPAS list, the texts in the text collection were gone through manually to gather any EPAS that<br />

were not returned by the automated extraction process. Had the text collection been larger so<br />

that the list of automatically extracted EPAS had been correspondingly bigger, too, this may not<br />

have been a necessary action. In the case of a (substantially) larger EPAS list, the structures that<br />

were not collected in the extraction process could have been abstained from, as the structures<br />

that would have been extracted would have provided enough information for the subsequent<br />

classifications and analyses. As is the case in this project, though, the text collection and the<br />

resulting EPAS list are very small. All information that can be extracted from the texts is of<br />

54


value and highly desirable. As such, it was a logical next step following the removal of<br />

unwanted structures, to make sure that all desirable structures had been collected from the texts.<br />

Several structures were added; many of which had been subjected to only partial extraction in<br />

the extraction process. This may in part be due to the syntactic analysis and in part to the<br />

matching by the Perl script. Further, EPAS were manually extracted from one additional text<br />

that had not been parsed, and therefore not been part of the initial extraction process. In total,<br />

this yielded 74 additional EPAS. The example in (3-24) below provides an example of a<br />

manually edited EPAS. (3-24a) shows the EPAS as it was after the automatic extraction process.<br />

While going through the texts, it became clear that this EPAS had not been extracted in a way<br />

that represented the meaning in the sentence it originated from, and therefore did not have an<br />

optimal structure. The EPAS was therefore manually modified to the form shown in (3-24b).<br />

(3- 24)<br />

a. Original EPAS:<br />

ta, syklist, kontakt<br />

make, biker, contact<br />

b. Manually corrected EPAS:<br />

ta-kontakt-med, syklist, politi<br />

make-contact-with, biker, police<br />

Appendix C contains the EPAS list, while Appendix D shows the alignment between sentences<br />

in the text and the extracted EPAS.<br />

3.6.4 Comments about the EPAS list<br />

The revised EPAS list consists of 223 elementary predicate-argument structures. 24 structures<br />

have been modified as described above, and 74 have been added. The list contains most EPAS<br />

present in the text collection and represents a list of verb-subject-object relations found within a<br />

limited thematic domain. While it is clear that the list could have been expanded by adding<br />

further texts to the collection, it was not possible to extend the list further within the frameworks<br />

of this project. The EPAS list is large and varied enough to show a tendency. Certainly, the<br />

instances of individual EPAS would have been higher and the list would also have been<br />

55


enriched by several new EPAS. Still, for the purposes of this thesis, the list includes a broad<br />

enough variety of structures to be of use in the classification phase.<br />

In the process of assessing the quality of the EPAS list, it became evident that the most<br />

interesting structures are the simplest ones. The EPAS corresponding to verb-subject-object<br />

relations are the ones that contribute with most information about the selectional restrictions of<br />

the domain. An alternative way to obtain an effective and robust extraction of EPAS might have<br />

been to concentrate only on this type of structure, rather that focusing on extracting all EPAS<br />

from the text collections and then filtering out unwanted ones.<br />

In order to estimate the potential of a classification of the EPAS list, line diagrams were created<br />

using Formal Concept Analysis (FCA). FCA is a methodology of data analysis and knowledge<br />

representation which identifies conceptual structures in data sets, and was a useful tool in the<br />

process of identifying how the predicates and arguments in the EPAS list related to each other.<br />

FCA distinguishes between two types of elements; formal objects and formal attributes. A<br />

formal concept is seen as a unit consisting of all belonging objects and attributes (Wolff 1991, p.<br />

430). Starting with any set of formal objects, all formal attributes the objects have in common<br />

can be identified. When using FCA to structure the data in the EPAS list, the arguments were<br />

termed objects, while the predicates were termed attributes. An FCA line diagram consists of all<br />

objects and attributes in a given context, organised hierachically according to their shared<br />

properties. Figure 5 below shows the FCA line diagram for part of the structures in the EPAS<br />

list 4 . Each white label corresponding to an argument from the EPAS list should be understood as<br />

a concept, and information about each concept can be read by following the upward leading<br />

paths from each concept. An object has a given attribute if there is an upward leading path from<br />

the object to the attribute (Wolff 1994, p. 431). Using the arguments/formal objects lensmann<br />

(sergeant) and Fonn as a starting point, the associated predicates/formal attributes gi (give), and<br />

bede-om (ask-for) can be identified. The arguments lensmann and Fonn co-occur with the<br />

predicates gi and bede-om, while politi (police), which is further down in the hierarchy, cooccurs<br />

with other predicates as well as those higher up in the diagram (gi, bede-om and bekrefte<br />

(confirm)). In other words, more general concepts are found toward the bottom of the diagram,<br />

while specialised concepts are found by following the paths upwards. For the data material in<br />

4 The diagram was made using the program Concept Explorer, downloadable from<br />

http://sourceforge.net/projects/conexp<br />

56


this project, this can be interpreted in terms of the contextual distribution the arguments have.<br />

Arguments found in the lower parts of the diagram are more general and co-occur with a wider<br />

range of predicates than the arguments found higher up in the hierarchy. In Figure 5, it can be<br />

seen that gjerningsmann (perpetrator) and drapsmann (killer) have similar distributions in the<br />

data material; drapsmann co-occurs with the predicates velge (choose) and gjemme (hide), while<br />

gjerningsmann only is found in connection with gjemme. On the basis of the formal concept<br />

analysis, it is clear that the EPAS list contains several arguments which show a distribution<br />

particular to their semantic meaning. The different lines in the diagram show interesting bundles<br />

of semantically related arguments and confirm the assumption that different types of arguments<br />

show different contextual distribution within the thematic domain.<br />

Figure 5<br />

57


4 Classification<br />

In order to use the structures in the EPAS list as an aid in anaphora resolution, they have to be<br />

processed. The pre-processing in section 3.6.4 has shown that there does exist interesting<br />

distributions in the data set and indicates that certain groups of arguments display distributions<br />

particular for the domain. As a step toward exploring if these distributions can be used to<br />

represent selectional restrictions and thus function as real-world knowledge for the domain, the<br />

words in the EPAS list must be classified. This procedure uses the context pattens that a word<br />

occurs in to classify the word, for example allowing for an argument to be classified according<br />

to the predicates it co-occurs with. A classification of this type gives information about which<br />

word to expect in a given context pattern and the results can therefore be used in the process of<br />

chosing the most likely antecedent for an anaphor. In this respect, the most likely antecedent<br />

must be interpreted as the most likely antecedent given a particular contextual pattern.<br />

In the following, the EPAS list will first be classified to see if the context patterns represented<br />

by the EPAS contain enough information to suggest the correct antecedent for anaphoric<br />

expressions from the text collection. Then an association of concepts will be performed, creating<br />

bundles of those arguments which occur in similar contexts/with similar predicates. These<br />

concepts will then be applied in co-occurrence with the classification method to see if they<br />

improve the process of suggesting the correct antecedent for the anaphors.<br />

For the purposes of classification and testing, the EPAS list was divided into training and test<br />

sets. The test set consist of all structures containing pronouns, while the training set consists of<br />

the remaining EPAS. In the case of the test set, the correct antecedent for each pronoun was<br />

identified manually and added to the test file. When testing with the test instances, the classifier<br />

assigns an antecedent based on the patterns it has seen in the training set. In this way, the correct<br />

antecedent in each test case functions as a means of measuring the success rate of the<br />

classification. The test set provides a good way of testing the product of the classification and<br />

gives a measure as to whether the correct antecedent can be assigned based on training on<br />

occurrences of EPAS/context patterns.<br />

58


The process of classifying the constituents of the EPAS is most useful if the aim of the<br />

classification is held clearly in mind. Classifying arguments relative to the predicates and the<br />

other arguments they co-occur with can give information about two things;<br />

• is the data set generalisable enough to allow inference of the single correct antecedent in<br />

each test case?<br />

• is the data set generalisable enough to allow inference of words within the semantic<br />

concept group that the correct antecedent belongs to?<br />

In this thesis, it is of interest to identify all the words which occur in specific environments. As a<br />

reaction, we are interested in finding the members which can co-occur in a specific pattern – and<br />

not necessarily only in the single correct antecedent.<br />

The classification phase in the present work has three steps; firstly classification through a<br />

memory-based learning algorithm, secondly association of semantic classes from the text<br />

material by looking at contextual environments, and thirdly classification through application of<br />

the concept groups gathered in step two. In the following, the classification method will be<br />

described in more detail.<br />

4.1 Step I: Classification with TiMBL<br />

TiMBL (Tilburg Memory Based Learner) (Daelemans et al. 2003) is a memory-based learning<br />

(MBL) tool developed by the ILK research group at the University of Tilburg (ILK 2004).<br />

TiMBL has been developed with the domain of NLP specifically in mind and provides an<br />

implementation of several MBL algorithms.<br />

Within MBL, or lazy learning (Daelemans et al. 1999), training instances are simply stored in<br />

memory. Upon encountering new instances, classification is performed by comparing the new<br />

instance to the stored experiences and estimating the similarity of the new instance to the old<br />

ones. The stored example(s) most similar to the new instance is picked as its classification. This<br />

approach stands in opposition to rule-induction based methods, which also are called greedy<br />

algorithms. In greedy learning algorithms, the learning material is used to create a model with<br />

expected characteristics for each category to be learned. Daelemans et al. (1999) show that<br />

59


language processing tasks tend to benefit from lazy learning methods, particularly because the<br />

individual examples in the training material are not abstracted away from in the process of<br />

creating rules. When a new data instance is classified, it is compared to all previously seen<br />

examples, including low-frequent ones. This suggests that in the case of relatively small data<br />

sets, such as the one in the present work, MBL tools are particularly suitable.<br />

By consulting previously seen data and estimating the similarity between old and new instances<br />

of data, MBL algorithms such as TiMBL are able to calculate the likelihood of new instances of<br />

data. This is done by creating a classifier which essentially consists of an example set of<br />

particular patterns together with their associated categories. The classifier can subsequently<br />

classify unknown input patterns by applying algorithms to calculate the similarity, or distance,<br />

to the known patterns stored in memory. The Nearest Neighbor approach is one commonly used<br />

means to estimate this distance and is described in more detail in the following section.<br />

4.1.1 The Nearest Neighbor approach<br />

Daelemans et al. (2003, p. 19) state that all MBL approaches are founded on the classical k-<br />

Nearest Neighbor (k-NN) method of classification (Cover and Hart 1967). This approach<br />

classifies patterns of numeric data by using information gained from examining and classifying<br />

pattern distributions observed in a data collection. In the k-NN algorithm, a new instance of data<br />

is classified as nearest to a set of previously classified points. The intuition is that observations<br />

which are close together will have categories which are close together. When classifying a new<br />

instance of data, the k-NN approach weights the known information about the closest similar<br />

data instances most heavily. In other words, a new instance of data is classified in the category<br />

of its nearest neighbour. In large samples, this rule can be modified to classifying according to<br />

the majority of the nearest neighbours, rather than just using the single nearest neighbour. The<br />

k-NN approach has several implementations in TiMBL. As TiMBL is designed to classify<br />

linguistic patterns, which in most cases consist of discrete data values and allow for a large<br />

number of attributes with varying relevance, the k-NN algorithm is not used directly. Instead,<br />

the classification of discrete data is made possible through a modified version of the k-NN<br />

approach, as well as other algorithms.<br />

60


There are several different distance metrics incorporated in TiMBL and, as will be described<br />

later, the user can choose the one that suits the data material best. The basic metric is the<br />

Overlap Metric, where the distance between two patterns is calculated as the sum of differences<br />

between the features of the two patterns (Daelemans et al. 2003, p. 20). The algorithm<br />

combining the k-NN approach with the overlap metric within TiMBL is called IB1 (Aha, Kibler<br />

and Albert 1991, in Daelemans et al. 2003). In this algorithm the value of k is the number of<br />

nearest distances (usually 1), allowing for a nearest neighbour set which may comprise several<br />

instances which all share the same distance to the a test example. The IB1 algorithm finds the k<br />

nearest neighbours of a test case by calculating the distance between a test instance Y and a<br />

training instance X. The distance between the two instances is the sum of the distances between<br />

the instances’ different features. If k = 1, a test instance is assigned the category of its single<br />

nearest neighbour. In cases where the algorithm finds a set of nearest neighbours, the majority<br />

vote of the set is chosen. This implies a certain bias toward high frequent categories, which in<br />

many cases will hold the majority vote.<br />

4.1.2 Testing<br />

To create a classifier, TiMBL needs training and test data in a feature vector where each<br />

instance consists of a fixed length of feature values followed by a category. For testing purposes,<br />

the feature sequence is used when the distance between a test instance and the training data is<br />

calculated, and the category functions as a means to evaluate whether the assigned classification<br />

was valid. Because the test data is compared directly with the training data, separate training and<br />

test sets are needed. In this project, the EPAS list was split into a training set consisting of all<br />

EPAS without pronouns and a test set consisting of the EPAS with pronouns. In addition, testing<br />

through TiMBL’s leave-one-out option was performed; here testing is done on each pattern of<br />

the training file by treating each pattern in turn as a test case (Daelemans et al. 2003, p. 35).<br />

In the classification phase the alignment of category and description features is stored, so that<br />

the categories of new, unseen sequences of description features can be probabilistically inferred<br />

in the following test phase.<br />

Regardless of the input format chosen, a classification with TiMBL presupposes that the training<br />

material consists of a number of features to be learned from, as well as a predetermined category<br />

61


which is the desired output category. The comma-separated values format was used for the<br />

EPAS classification. In order to classify the constituents of each EPAS based on the contextual<br />

patterns in the structure, each part of the EPAS was classified with reference to the other<br />

constituents in it. Somewhat analogous to the way that a screw can be described as being small,<br />

long and containing no holes, the argument åsted (crime scene) can be described through its cooccurrence<br />

with the predicate ankomme (arrive) and the argument etterforsker (investigator)<br />

(example (4-1)). This makes it possible to train a classifier on the EPAS list, using the argument<br />

which’s environment is to be learned as category label, and each constituent in the EPAS as<br />

features. To avoid that the category was explicitly present in the training material and ensure<br />

that the classifier was trained only on the environment of the desired category, the relevant<br />

feature was ignored using TiMBL’s ignore option. In order to classify the structures once for<br />

each argument type, two different data sets were prepared. Example (4-1) shows the format of<br />

the three-feature dataset that was used. The parentheses indicate that the feature in question was<br />

ignored when training and classifying.<br />

(4- 1)<br />

features category<br />

a. predicate, (argument 1), argument 2 argument 1<br />

b. predicate, argument 1, (argument 2) argument 2<br />

Example (4-2) shows excerpts of the two input files: (4-2a) shows the structures with argument<br />

1 as category, while (4-2b) shows the same structures with argument 2 as the category. The<br />

classifier is given two constituents of an EPAS to learn from and the target constituent is given<br />

as the EPAS’ category.<br />

(4- 2)<br />

a. ankomme,etterforsker,?,etterforsker<br />

ankomme,etterforsker,?,etterforsker<br />

ankomme,etterforsker,åsted,etterforsker<br />

antyde,politi,?,politi<br />

avhøre,?,person,?<br />

avhøre,?,vedkommende,?<br />

avhøre,politi,vitne,politi<br />

62


. ankomme,etterforsker,?,?<br />

ankomme,etterforsker,?,?<br />

ankomme,etterforsker,åsted,åsted<br />

antyde,politi,?,?<br />

avhøre,?,person,person<br />

avhøre,?,vedkommende,vedkommende<br />

avhøre,politi,vitne,vitne<br />

The output file that is created when TiMBL has classified the input data and run a test with the<br />

test data consists of the input given in the test set with the category predicted by TiMBL added<br />

at the end of each line. Further, the output supplied by TiMBL upon a successful training and<br />

testing round gives information about the actions in the various stages of analysis. TiMBL’s<br />

actions can be divided into three separate phases; in phase 1 the training data is analysed, in<br />

phase 2 the items in the training data are stored for efficient use during testing and in phase 3 the<br />

trained classifier is applied to the test set. For the purposes of the EPAS analysis, the default<br />

algorithm was used in the test phase. This algorithm computes the similarity between a test and<br />

a training item in terms of weighted overlap; the total difference between two patterns is the sum<br />

of relevance weights of those features which are not equal (Daelemans et al. 2003, p. 13).<br />

The classification of the EPAS and the subsequent testing was carried out in two distinct steps;<br />

classification and testing of argument 1 and argument 2 was done separately. The results of the<br />

classification and testing is described in the following sections.<br />

4.1.2.1 Classifying argument 1<br />

Several experiments were run through TiMBL with the aim of classifying occurrences of<br />

argument 1 according to the environment they occur in. The classifier was trained on all EPAS<br />

not containing pronouns and then tested. For the purpose of classifying occurrences of argument<br />

1, an EPAS list with the relevant argument 1 as category label was used. In the following<br />

descriptions of the performed tests, this list will be referred to as EPAS_arg1.<br />

Test 1<br />

Training set: EPAS_arg1 with no pronouns, argument 1 ignored.<br />

Test set: EPAS with pronouns in argument 1 position.<br />

Result: 57,69% (15/26) correct classifications<br />

63


The classifier was created with EPAS_arg1 with no pronouns as training set and tested with all<br />

EPAS containing pronouns in the position of argument 1. For the test set, each EPAS was<br />

completed with the antecedent for its pronoun. For reasons of classification and testing, the<br />

antecedent was appended at the end of each EPAS, thus functioning as the category label for the<br />

structure. In total, there were 26 EPAS with pronouns in the position of argument 1. (4- 3) below<br />

shows an example from the test file with pronouns as argument 1:<br />

(4- 3)<br />

få,pron,rapport,politi<br />

When classifying with argument 1 as category label and testing with EPAS with pronouns as<br />

argument 1, TiMBL assigned the correct category in 57,69% (15/26) of the test cases. One of<br />

the cases where the classifier had assigned the “wrong” category was actually not incorrect, the<br />

antecedent was of a form that did not exist in the training material (antecedent: kvinne/vitne<br />

(woman/witness), assigned category: vitne (witness)). Furthermore, in six of the incorrectly<br />

assigned categories, the category chosen by the classifier was semantically close to the correct<br />

antecedent. Example (4-4) below shows the seven examples where the incorrect categories<br />

assigned by the classifier can in fact be viewed as belonging to the same semantic group, and<br />

thus at least as a partially successful classification. Regarding all these instances as successful<br />

category assignments would heighten the classifier’s correct categorisations to 84,61% (22/26).<br />

(4- 4)<br />

Correct antecedent Assigned category<br />

kvinne/vitne (woman/witness) vitne (witness)<br />

Fonn (Fonn) politi (police)<br />

Kripos-spesialist (Kripos specialist) politi (police)<br />

politimester (police chief) Fonn (Fonn)<br />

politi (police) etterforsker (investigator)<br />

politi (police) Fonn (Fonn)<br />

Slåtten (Slåtten) kvinne (woman)<br />

64


Test 2<br />

Training set: EPAS_arg1 with no pronouns, argument 1 ignored.<br />

Test method: leave-one-out<br />

Result: 42,40% (81/191) correct classifications<br />

When training and testing on the EPAS_arg1 list with pronouns removed, the classifier<br />

produced a quite poor accuracy of 42,40%. TiMBL’s leave-one-out option makes it possible to<br />

train and test on the same material, as each pattern in the training file is used as a test case while<br />

the rest of the patterns are used as training material. One reason for the relatively low percentage<br />

of correctly classified instances is most likely the small size of the data set. With only 191<br />

patterns to learn from, the classifier does not have enough diversity in the examples to provide<br />

correct classifications and also does not find enough occurrences of the individual patterns to be<br />

able to pick the correct category. Since politi (police) by far is the most frequent feature in the<br />

EPAS list, many instances are wrongly assigned the category “police” by virtue of the majority<br />

vote of the nearest neighbour classification. An attempt to avoid this effect is described in test 3.<br />

Examining the instances where the classifier assigned the wrong category to an EPAS showed<br />

that in 27 of the incorrectly classified cases, the assigned category was semantically similar to<br />

the correct category. This suggests that the list in itself does contain some relevant information<br />

about the distribution of argument 1 in the data set. Example (4-5) below shows the correct<br />

categories and the categories assigned by the classifier.<br />

(4- 5)<br />

Correct category Assigned category<br />

Anne kvinne (woman)<br />

Slåtten<br />

drapsmann (killer) gjerningsmann (perpetrator)<br />

etterforsker (investigator) politi (police)<br />

Fonn lensmann (deputy)<br />

politi (police)<br />

gjerningsmann (perpetrator) person (person)<br />

Kripos-spesialist(Kripos specialist) politi (police)<br />

kvinne (woman) 23-åring (23-year-old)<br />

65


lensmann (deputy) Fonn<br />

politi (police)<br />

medarbeider (co-worker) politi (police)<br />

person (person) gjerningsmann (perpetrator)<br />

politi (police) lensmann (deputy)<br />

etterforsker (investigator)<br />

politimester (chief of police) politi (police)<br />

polititjenestefolk (police workers) politi (police)<br />

Slåtten Anne<br />

kvinne (woman)<br />

tekniker (technician) politi (police)<br />

23-åring (23-year-old) kvinne (woman)<br />

Test 3<br />

When using the overlap metric, all feature values are seen as equally dissimilar (Daelemans et<br />

al. 2003, p. 23). This means that the classifier is unable to determine the similarity of values<br />

such as politi (police), etterforsker (investigator) and politimester (chief of police) by means of<br />

looking at their co-occurrence with target classes. By using the Modified Value Difference<br />

Metric (MVDM), the features are weighted according to the patterns they occur in.<br />

Unfortunately, MVDM does not perform so well when used on small data sets with values that<br />

only occur a few times in the data set. When trained and tested on the EPAS_arg1 list, MVDM<br />

produced slightly lower accuracies than in the corresponding test with the overlap metric (see<br />

test 2 above). In practise, this meant that the benefits of MVDM could not be exploited due to<br />

the size of the data material.<br />

Test 4<br />

Training set: EPAS_arg1 excluding structures with pronouns and structures with non-verbal<br />

predicates<br />

Test method: leave-one-out<br />

Result: 45,03% (68/151) correct classifications<br />

66


The training and test material was modified by excluding all EPAS with non-verbal predicates,<br />

as well as all EPAS with the predicate være (be). This was done to see if these structures disturb<br />

the data material by adding irrelevant information that does not contribute to giving more<br />

information about the distribution of arguments in the EPAS. The accuracy increased slightly<br />

upon this modification of the data set. The editing did, however, not increase the accuracy when<br />

training on the edited EPAS_arg1 list and testing on EPAS containing pronouns in argument 1<br />

position.<br />

4.1.2.2 Classifying argument 2<br />

Analogous to the classification steps performed for argument 1, the classifications were repeated<br />

for occurrences of argument 2. The EPAS list with the second argument as category label will in<br />

the following be referred to as EPAS_arg2.<br />

Test 1<br />

Training set: EPAS_arg2 with no pronouns, argument 2 ignored.<br />

Test set: EPAS with pronouns in argument 2 position<br />

Result: (0/6) correct classifications<br />

Training the classifier on the EPAS_arg2 list without pronouns and testing on the EPAS with<br />

pronouns in argument 2 position did not produce any correct classifications. This is likely to in<br />

part be because of the small size of the test data set, as well as the homogenous nature of the test<br />

instances. Five of the wrongly classified instances were in fact of the same type; in all instances,<br />

the classifier had assigned the category kvinne (woman), while the correct antecedent was<br />

Slåtten.<br />

Test 2<br />

Training set: EPAS_arg2 with no pronouns, argument 2 ignored<br />

Test set: leave-one-out<br />

Result: 49,73% (95/191) correct classifications<br />

67


Training and testing the classifier on the EPAS_arg2 list with no pronouns produced an accuracy<br />

of 49,73%. As was the case for the corresponding classification of argument 1, it is likely that<br />

the relatively small dataset is a disadvantage for the classification process.<br />

4.1.3 Comments on the results<br />

The results obtained through classifying the EPAS indicate that the information present in the<br />

EPAS derived from the text collection does provide clues about which word to expect in a<br />

specific position. The accuracy scores obtained by training and testing on the EPAS extracted<br />

from a collection of texts suggest that even a small collection of texts on the same domain<br />

provide information to enable a classification approach based on contextual distribution. In the<br />

tests described above, there was a reoccurring tendency that in a number of the cases where the<br />

wrong category was assigned in the test phase, the assigned category bore some semantic<br />

resemblance to the correct category. This reinforces the initial intuition that similar words are<br />

used in similar environments and that the environment can contribute with clues toward the<br />

semantic meaning of a word.<br />

In the following section, the notion of finding words which are similar to each other by virtue of<br />

occurring in the same environments will be explored further.<br />

4.2 Step II: Association of concept groups<br />

The fundamental idea in this thesis is that words display certain semantic features based solely<br />

on the context they are found in. Therefore, when looking for possible antecedents for an<br />

anaphoric expression, the candidates should not only be weighted according to their cooccurrence<br />

in an identical context pattern in a corpus, but also according to their co-occurrence<br />

with similar context patterns. The assumption that words which occur in identical contexts have<br />

related meanings can be used to retrieve words with similar meanings from the data material.<br />

With a target argument and a target predicate as starting point, the association method goes<br />

through the EPAS list and returns words which occur in similar environments to the target<br />

argument. This association is performed in three steps:<br />

68


• level 0: words which co-occur with the target predicate are returned<br />

• level 1: words which occur in the same context as the target argument are returned<br />

• level 2: words which occur in the same context as the words found in level 1 are returned<br />

Level 0 considers the information that is directly accessible from the EPAS list; with a given<br />

predicate as reference point, the co-occurring arguments are retrieved. Level 1 looks at the other<br />

arguments that occur with the same predicates as the arguments retrieved in the first step.<br />

Finally, level 2 performs the same step once again and looks at the arguments that occur in the<br />

same contexts as the arguments collected in level 1. As a result, bundles of concepts are<br />

produced; each concept class consisting of words that are used in the same textual context, and<br />

therefore are likely to be semantically similar. The following example explains how the<br />

association of argument classes is performed.<br />

Level 0<br />

The association method takes as its starting point a predicate from the EPAS list. For a given<br />

predicate, the first and second arguments are listed. The nominal argument etterforsker<br />

(investigator) occurs as argument 1 of the verbal predicate ankomme (arrive) in the text<br />

collection. (4-6) below shows the EPAS with ankomme as predicate.<br />

(4- 6)<br />

ankomme,etterforsker,?<br />

ankomme,etterforsker,åsted<br />

arrive, investigator, ?<br />

arrive, investigator, crime scene<br />

Level 1<br />

In order to find other nominal arguments that occur in the same context patterns as etterforsker,<br />

we must look at the other EPAS in which etterforsker occurs as argument 1. This yields the<br />

EPAS shown in (4-7) below. For the sake of the concept association, the EPAS corresponding to<br />

adjective-noun relations in the original texts are not considered, and therefore not included in<br />

(4-7).<br />

69


(4- 7)<br />

bistå,etterforsker,lensmann<br />

bistå,etterforsker,politi<br />

ha,etterforsker,observasjon<br />

kontakte,etterforsker,vitne<br />

mene,etterforsker,?<br />

rigge,etterforsker,lyskaster<br />

undersøke,etterforsker,åsted<br />

assist, investigator, deputy<br />

assist, investigator, police<br />

have, investigator, observation<br />

contact, investigator, witness<br />

mean, investigator, ?<br />

build-up, investigator,searchlight<br />

examine, investigator, crime scene<br />

For each of the predicates in (4-7), we want to find the other arguments, in addition to<br />

etterforsker (investigator), which occur in the corpus material as the first argument of the<br />

predicate. Traversing the EPAS list in search of these nominal arguments yields the list<br />

presented in (4-8). Pronouns and empty argument slots are omitted from the association since<br />

they generally occur in too many different context patterns to contribute with relevant<br />

information in this kind of analysis.<br />

(4- 8)<br />

ha,politi,medarbeider<br />

ha,politi,teori<br />

kontakte,politi,vitne<br />

mene,politi,?<br />

undersøke,politi,aktivitet<br />

have,police,co-worker<br />

have,police,theory<br />

contact,police,witness<br />

mean,police,?<br />

examine,police,activity<br />

As can be seen from the EPAS in (4-8), politi (police) is the only other argument which occurs<br />

in the same contexts as etterforsker (investigator). So far, the association tells us that there is a<br />

relationship between the concepts etterforsker (investigator) and politi (police) in the sense that<br />

these words occur in the same environments in the text collection.<br />

Level 2<br />

In order to explore the possibility of further associated concepts, the association method goes<br />

one level further. Basically, the first step of the association is repeated, but with new parameters;<br />

for each of the first arguments in (4-8) we need to know which other words can occur in the<br />

same contextual position. Therefore the EPAS list is again consulted and all the other first<br />

arguments that occur in the same environment as politi (police) are returned. This produces the<br />

list in (4-9).<br />

70


(4- 9)<br />

avklare,obduksjon,?<br />

bede-om,lensmann,assistanse<br />

bede-om,Fonn,<br />

bede-om,lensmann,<br />

bekrefte,lensmann,?<br />

bekrefte,politimester,?<br />

finne,leteaksjon,kvinne<br />

få,kjæreste,telefon<br />

gi,Fonn,opplysning<br />

gi,kamera,indikasjon<br />

gi,lensmann,opplysning<br />

gi,lensmann,opplysning<br />

gi,vitneavhør,indikasjon<br />

ha,etterforsker,observasjon<br />

kjenne,generic-nom,Slåtten<br />

kontakte,etterforsker,vitne<br />

mene,etterforsker,?<br />

tro,lensmann,?<br />

clarify,autopsy,?<br />

ask-for,sergeant,assistance<br />

ask-for,Fonn,?<br />

ask-for,sergeant,?<br />

confirm,sergeant,?<br />

confirm,chief of police,?<br />

find,search party,woman<br />

get,boy/girlfriend,telephone<br />

give,Fonn,information<br />

give,camera,indication<br />

give,sergeant,information<br />

give,sergeant,information<br />

give,interview,indication<br />

have,investigator,observation<br />

know,generic-nom,Slåtten<br />

contact,investigator,witness<br />

mean,investigator,?<br />

believe,sergeant,?<br />

As a step toward disregarding arguments which do not occur often enough in the context in<br />

question to be of significance, the method may be limited to only considering arguments which<br />

occur more than once in the text material. In the case of a larger text collection, a different<br />

method of sorting out low frequent arguments would have to be adopted; for the small data set<br />

in this project, disregarding arguments which only occur once was of use. The steps as outlined<br />

above produces the following associated group of concepts:<br />

Figure 6<br />

etterforsker (investigator)<br />

politi (police)<br />

lensmann (sergeant)<br />

Fonn (Fonn)<br />

71


Intuitively, this is a quite good association of concepts, since all the entities in the grouping<br />

belong to the group law enforcement. If a person were to group nominals from the text<br />

collection into semantically similar concept classes, the grouping in Figure 6 would not be an<br />

unlikely result. The grouping as shown in Figure 6, however, is the result of an association<br />

based on context information from the text itself.<br />

4.2.1 Classify<br />

Manually performing the association method as described above on all the EPAS in the data set<br />

proved to be bordering on the impossible, mainly because it implied consulting the data set<br />

multiple times, each time looking for different values and keeping track of the partial goals in<br />

the process. Based on the method as described above, the Perl script classify was written 5 . In the<br />

following, the algorithm implemented in Classify is outlined in brief.<br />

For each predicate:<br />

1. Level 0:<br />

What is ARG1 and ARG2 in the corpus/EPAS list?<br />

2. Level 1:<br />

For each ARG1 = x that was found in 1:<br />

In connection with which other predicates is ARG1 also= x?<br />

For each of these predicates:<br />

Which other words occur as ARG1?<br />

Produces a list of words which occur in the same contexts as x<br />

3. Level 2:<br />

For each word = y in the list from level 1:<br />

Which other predicates does this word also co-occur with?<br />

For each of these predicates:<br />

Which other words occur as ARG1?<br />

Produces a list of words which occur in the same contexts as y<br />

Same procedure is repeated for ARG2.<br />

5<br />

The algorithm was implemented in Perl by Martin Rasmussen Lie, informatics student at the University of<br />

Bergen.<br />

72


(4-10) below shows the output for the predicate ankomme (arrive) as it is produced by classify.<br />

(4- 10)<br />

NIVÅ0, ARG1 (ankomme): etterforsker x 3<br />

NIVÅ1, ARG1 (etterforsker): ? x 5, politi x 5, pron x 3<br />

NIVÅ2, ARG1 (politi): ? x 17, pron x 9, lensmann x 6, Fonn x 2,<br />

forbipasserende x 1, generic-nom x 1, kamera x 1, kjæreste x 1,<br />

leteaksjon x 1, obduksjon x 1, politimester x 1, syklist x 1,<br />

vitneavhør x 1<br />

NIVÅ2, ARG1 (pron): politi x 7, kvinne x 4, vitne x 4, ? x 2,<br />

Anne x 2, bilfører x 2, Slåtten x 2, syklist x 2, etterforskning x 1,<br />

Fonn x 1, generic-nom x 1, kjæreste x 1, lensmann x 1, lommebok x 1,<br />

rapport x 1, teori x 1<br />

NIVÅ0, ARG2 (ankomme): ? x 2, åsted x 1<br />

NIVÅ1, ARG2 (åsted): aktivitet x 1, hybelhus x 1,<br />

minibankaktiviteter x 1, mobiltelefontrafikk x 1, område x 1,<br />

overvåkningsfilmer x 1<br />

NIVÅ2, ARG2 (aktivitet): minibankaktiviteter x 1,<br />

mobiltelefontrafikk x 1, område x 1, overvåkningsfilmer x 1<br />

NIVÅ2, ARG2 (hybelhus): (Ingen referanser)<br />

NIVÅ2, ARG2 (minibankaktiviteter): aktivitet x 1,<br />

mobiltelefontrafikk x 1, område x 1, overvåkningsfilmer x 1<br />

NIVÅ2, ARG2 (mobiltelefontrafikk): aktivitet x 1,<br />

minibankaktiviteter x 1, område x 1, overvåkningsfilmer x 1<br />

NIVÅ2, ARG2 (område): aktivitet x 1, minibankaktiviteter x 1,<br />

mobiltelefontrafikk x 1, overvåkningsfilmer x 1<br />

NIVÅ2, ARG2 (overvåkningsfilmer): aktivitet x 1,<br />

minibankaktiviteter x 1, mobiltelefontrafikk x 1, område x 1<br />

Please consult Appendix E for the full program code for classify.pl.<br />

4.2.2 Associated concept classes<br />

Running the three-level association of semantic classes performed by using the association<br />

method described above on the EPAS list produced six distinct groupings. These concept groups<br />

are shown in (4-11) below.<br />

(4- 11)<br />

a. POLICE:<br />

etterforsker, politi, lensmann, Fonn<br />

investigator, police, deputy, Fonn<br />

b. WOMAN:<br />

Anne, Slåtten, 23-åring, sykepleiestudent, kvinne, beboer<br />

Anne, Slåtten, 23-year-old, nurse student, woman, inhabitant<br />

73


c. PERP:<br />

gjerningsmann, drapsmann<br />

perpetrator, killer<br />

d. PERSON:<br />

person, bilfører, syklist, vedkommende<br />

person, car driver, biker, generic-nom<br />

e. OBSERV:<br />

teori, observasjon<br />

theory, observation<br />

f. PLACE:<br />

studentkollektiv, Førde<br />

student housing, Førde<br />

The classes of words shown in (4-11) form groups of concepts which occur in the same<br />

contextual environments within the thematic domain that the EPAS are extracted from. The<br />

groupings seem to reflect real semantic clusters in the sense that one can easily find a label to<br />

describe each group. For the purpose of the text collection in the present work, these six concept<br />

groups represent six distinct semantic groupings that share many features with respect to pattern<br />

distribution in the data set. With a larger data set to run the concept association on, more concept<br />

groups, and also more members within each group, would have been a likely outcome. The<br />

results of the concept association on the small data set in this project, does, however, suggest the<br />

feasibility of the method, as well as show that frequent patterns in smaller text collections also<br />

work toward capturing interesting concept groupings.<br />

4.3 Step III: Using concept groups in TiMBL<br />

The concept groups which emerged as a result of the association performed in section 4.2 above,<br />

represent clusters of words that occur in similar constellations in the data material. The<br />

emergence of concept groups which intuitively seem to have some semantic resemblance to<br />

each other confirms that the context a word fits into does indeed say something about what the<br />

word means, as per the distributional hypothesis.<br />

74


In the introduction to this chapter, it was stated that the aim of classifying the EPAS list is<br />

twofold; on the one hand it is of interest to see to which degree the environments that an<br />

argument occurs in over a collection of texts provide sufficient cues to ensure a correct guess of<br />

which argument can be expected in a specific context, on the other hand it is equally interesting<br />

to see if we through classification can narrow down the set of possible arguments for a specific<br />

context pattern. Through the association technique, six groups of words emerged; the members<br />

of each group sharing the feature that they all tend to occur in the same environments.<br />

Previously, it has been stated that some anaphors need access to information about the world in<br />

order to be resolved. This information can to some extent be represented by the concept groups<br />

associated from the data set. By identifying groups of words which typically occur in the same<br />

textual environment, an intuition about which words to expect in which contexts is captured. In<br />

the event of “difficult” anaphors which depend on world knowledge, an anaphora resolution<br />

system can retrieve potential antecedents from the text, check which concept group an expected<br />

antecedent is likely to belong to and consequently chose the antecedent candidate belonging to<br />

the expected concept group. As a first step of examining the usefulness of concept groups in<br />

combination with anaphora, experiments aiming at enhancing the performance of the classifier<br />

in section 4.1 were performed. These experiments are described in the following section.<br />

4.3.1 Testing<br />

Tests were performed in TiMBL, using the relevant concept group as the category for a feature<br />

pattern. Analogous to the testing in section 4.1.2 above, two separate test sets were prepared,<br />

one for the classification of each argument. In the cases where the relevant argument was a<br />

member of one of the concept groups, the head label of the concept group was used as the<br />

category label in the input data. If the relevant argument did not belong to any concept group,<br />

the argument itself was used as category label, as in the tests in section 4.1.2. Example (4-12)<br />

below shows an excerpt of the input file used for training the classifier for argument 1<br />

classification.<br />

(4- 12)<br />

drepe,gjerningsmann,kvinne,PERP<br />

drept,sykepleiestudent,?,WOMAN<br />

død,sykepleiestudent,?,WOMAN<br />

ekstra,patrulje,?,patrulje<br />

75


The aim of the tests performed in this section was to see if the accuracy of the classifier could be<br />

enhanced by training on a complete context pattern with the appropriate concept group as<br />

category label.<br />

Test 1<br />

Training set: EPAS_arg1 with no pronouns and concept classes as category label, argument 1<br />

ignored.<br />

Test method: leave-one-out<br />

Result: 56,54% (108/191)<br />

In this test, the classifier was trained on two features of the EPAS, ignoring argument 1. This<br />

test is analogous to test 2 in section 4.1.2.1, which had an accuracy of 41,20% correct<br />

classifications. In addition to the 108 correctly classified instances, additional five instances<br />

were assigned categories which are semantically similar to the correct category. This was true<br />

for Kripos-spesialist (Kripos specialist), politimester (chief of police), medarbeider (co-worker)<br />

and polititjenestefolk (police workers), which were all assigned the category POLICE. These<br />

words are not part of the concept group POLICE, but are obviously semantically related to the<br />

members of this concept group. Had these words occurred more frequently in the data material,<br />

they could have been expected to show a distribution allowing for their inclusion in POLICE.<br />

The results of this test suggest that labeling EPAS with concept group labels heightens the<br />

accuracy of the classifier. This is not surprising, given the fact that a higher number of context<br />

patterns/EPAS are labeled with the same category in such an approach, making the generalisable<br />

material larger.<br />

Test 2<br />

Training set: EPAS_arg1 with no pronouns and concept classes as category label.<br />

Test method: leave-one-out<br />

Result: 86,91% (166/191)<br />

This test was performed to see if training the classifier on the entire structure of an EPAS<br />

increases the accuracy of assigning concept labels to the structures. The classifier was trained on<br />

all three features of the EPAS. In this case, the classifier performed with a fairly high accuracy,<br />

assigning the correct category in 166 of 191 cases. It is obviously an advantage that all parts of<br />

76


the EPAS can be used in the classification phase when the category to be assigned is not literally<br />

a part of the structures to be learnt from.<br />

Test 3<br />

Training set: EPAS_arg1 with no pronouns and concept classes as category label, argument 1<br />

ignored.<br />

Test set: EPAS_arg1 with pronouns and concept classes as category label<br />

Result: 76,92% (20/26)<br />

As was the case in the corresponding test in section 4.1.2.1, two of the wrongly assigned<br />

categories were in fact within the same semantic group as the correct category. Regarding these<br />

as correct assignments would heighten the result to 84,61%. As was the case in the previous two<br />

tests, the assigned categories in these cases are too infrequent in the EPAS list to surface in the<br />

associated concept groups.<br />

Test 4<br />

Training set: EPAS_arg2 with no pronouns and concept classes as category label, argument 2<br />

ignored.<br />

Test set: EPAS_arg2 with pronouns and concept classes as category label<br />

Result: 83,33% (5/6)<br />

When training the classifier on the EPAS_arg2 list using concept class labels as categories and<br />

testing on the set of EPAS with pronouns in argument 2 position, the classifier resolved five of<br />

the six test instances correctly. In the corresponding test in section 4.1.2, the classifier did not<br />

assign the correct category in any of the six test cases. We did, however, see that five of the test<br />

instances were assigned categories which were semantically similar to the correct antecedent. In<br />

view of the results in the initial test, it came as no surprise that the classifier performed so much<br />

better when used in connection with the concept class labels.<br />

77


4.4 Are concept classes useful for anaphora resolution?<br />

The EPAS list has been processed in different ways in this chapter. The tests which have been<br />

described provide an indication of how context patterns extracted from the text collection can be<br />

used to create expectations of which words (or which type of words) that are likely to occur in a<br />

given contextual environment. These expectations can be used to anticipate which word, or<br />

rather which concept, might be the antecedent for an anaphor. The concept groups which<br />

emerged in the association process are simply classes of semantically related words which tend<br />

to have similar contextual distributions within the domain of the text corpus. In order to indicate<br />

the usefulness of such concept classes in the process of resolving an anaphor, the test set of the<br />

EPAS list (all EPAS containing pronouns) was processed with different methods. In (4-13) the<br />

results of these methods are shown. In addition to the tests in TiMBL described in the above, the<br />

anaphors in the test set were resolved manually using the Lappin and Leass approach as<br />

described in section 2.1.2. For these test, the sentence with the anaphor, as well as the preceding<br />

sentence, was considered in each case. This purely syntactic approach identified the correct<br />

antecedent in 16 of the 32 test instances.<br />

(4- 13)<br />

Method Correct assignments<br />

Syntactic method 50% (16/32)<br />

TiMBL 46.87% (15/32)<br />

TiMBL with concept groups 78,12% (25/32)<br />

The results shown in (4-13) suggest that using concept groups may indeed be a useful approach<br />

in anaphora resolution. Especially in the case of anaphoric expressions where the antecedent is<br />

not clearly stated in the text it may be useful to have an idea of which type of antecedent one<br />

might expect. 10 of the 32 EPAS containing pronouns were of this kind. The syntactic approach<br />

could naturally not resolve these anaphors, as an antecedent not clearly present in the text hardly<br />

can feature on a list of possible candidates. These types of anaphors require real-world or<br />

domain knowledge to be resolved. In the case of 4 of these 10 EPAS, the EPAS list could not be<br />

consulted to find likely antecedents. Because of the small size of the data set, some predicates<br />

only feature once. This was the case for the five predicates jobbe-utfra (work-from), kartlegge<br />

(map), ta (take), varsle (notify) and ville (want) which all only co-occur with pronouns. With the<br />

78


exemption of jobbe-utfra, none of the antecedents in these cases can be predicted on the basis of<br />

the distribution of predicates and arguments in the EPAS list. (4-14) shows the instances where<br />

the EPAS list could be consulted in the process of finding likely antecedents for these anaphors.<br />

In the case of ha (have) and komme-i-kontakt-med (come-into-contact-with), other EPAS with<br />

the same predicates where retrieved from the EPAS list. Since ha and komme-i-kontakt-med<br />

occur in identical or very similar patterns with politi as the first argument, this would be the<br />

preferred candidate for the antecedent in (4-14a), (4-14c) and (4-14d). In the case of (4-14b), the<br />

predicate jobbe-utfra only has this one occurrence in the EPAS list. This means that similar<br />

patterns must be examined in the search for a possible antecedent. By consulting the EPAS list,<br />

it can be found that teori (theory) only occurs as a second argument in connection with politi as<br />

first argument. This would suggest that politi is a potential antecedent for the pronoun. By<br />

applying the concept groups, the list of possible antecedents motivated by the texts can be<br />

expanded to also include the other arguments which have been found to display a similar<br />

distribution to the arguments which actually co-occur with the predicate in question. In the case<br />

of the pronouns in (4-14), politi is the correct antecedent in all of the cases.<br />

(4- 14)<br />

EPAS with<br />

pronoun<br />

similar EPAS antecedents<br />

from list<br />

a. ha,pron,teori ha,etterforsker,observasjon<br />

ha,politi,medarbeider<br />

ha,politi,teori<br />

b. jobbeutfra,pron,teori<br />

ha,politi,teori<br />

forkaste,politi,teori<br />

c. komme-i-kontaktmed,pron,bilfører<br />

d. komme-i-kontaktmed,pron,syklist <br />

komme-i-kontaktmed,politi,bilfører <br />

komme-i-kontaktmed,politi,generic-nomkomme-i-kontaktmed,politi,bilfører <br />

komme-i-kontaktmed,politi,generic-nom<br />

politi<br />

etterforsker<br />

politi<br />

concepts<br />

lensmann<br />

Fonn<br />

79<br />

etterforsker<br />

lensmann<br />

Fonn<br />

politi etterforsker<br />

lensmann<br />

Fonn<br />

politi etterforsker<br />

lensmann<br />

Fonn<br />

The examples in (4-14) indicate how the method described in this thesis can function. In cases<br />

where there is no clearly expressed antecedent in a text, or where the resolution of an antecedent


equires knowledge about the world (or knowledge about how predicates and arguments<br />

combine within a domain), the method can be of aid. Consider again the examples from the<br />

introduction, repeated in (4-15) below:<br />

(4- 15)<br />

a. Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig<br />

kommer til å drepe igjen. Han etterlyser vitner som var i sentrum søndag<br />

kveld.<br />

The sergeant leading the investigation says that the perpetrator probably will<br />

kill again. He puts out a call for witnesses who were in the city centre Sunday<br />

evening.<br />

b. Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig<br />

kommer til å drepe igjen. Han er observert i sentrum.<br />

The sergeant leading the investigation says that the perpetrator probably will<br />

kill again. He is observed in the city centre.<br />

As was established in chapters 1 and 2, the antecedent of the anaphor han (he) in (4-15b) cannot<br />

be resolved differently from the anaphor in (4-15a) without consulting some sort of knowledge<br />

source. However, the information present in the EPAS list can be used as domain knowledge in<br />

the process of resolving these anaphors. For the anaphor in (4-15a) other occurrences of the<br />

predicate etterlyse (call-for) can be consulted. This would produce the following list:<br />

(4- 16)<br />

etterlyse,?,bilfører<br />

etterlyse,politi,bilfører<br />

etterlyse,politi,person<br />

etterlyse,politi,syklist<br />

call-for,?,driver<br />

call-for,police,driver<br />

call-for,police,person<br />

call-for,police,biker<br />

It is clear from the list that etterlyse tends to co-occur with politi as its first argument, and that it<br />

does not occur with any other first arguments. Through the concept association we know that<br />

politi and lensmann (sergeant) both belong to the same concept group. Consequenntly, we also<br />

know that politi and lensmann both occur in the similar environments and thus share features.<br />

Given that the possible antecedents in (4-15a) are lensmann and gjerningsmann (perpetrator),<br />

the consultation of the EPAS list and the concept groups leads us to select lensmann as the<br />

80


antecedent for (4-15a). In the case of (4-15b) the EPAS list is unfortunately not equally helpful.<br />

Other occurrences of observere (observe) are:<br />

(4- 17)<br />

observere,?,23-åring<br />

observere,?,bile<br />

observere,?,person<br />

observe,?,23-year-old<br />

observe,?,car<br />

observe,?,person<br />

Neither the EPAS in (4-17), nor the concept groups from section 4.2.2, give us any clues as to<br />

whether lensmann or gjerningsmann is a more likely second argument in connection with the<br />

predicate observere. This can be explained by two circumstances; firstly, the example sentences<br />

in (4-15) are constructed and therefore do not reflect examples from the data set, secondly, the<br />

small size of the data material obviously limits the extent to which one can expect that all valid<br />

patterns of the domain are in fact found in the data set.<br />

81


5 Final remarks<br />

5.1 Is a parser vital for the extraction process?<br />

An initial assumption during the development of the method in this thesis was that it was of high<br />

importance to found the extraction method on a syntactic parse of the text collection. As the<br />

reader will recall, the reasons for this assumption were elaborated in chapter 3 and will therefore<br />

not be discussed further here. However, as a means of evaluating the extraction method, the<br />

texts in the text collection were processed using the Oslo-Bergen Tagger (OBT <strong>2005</strong>). This is a<br />

part of speech (POS) tagger which among other options offers the user a syntactic<br />

disambiguation of the input text. The texts were POS tagged using the web version of the tagger<br />

and structures corresponding to subject-verb-object relations were manually extracted from the<br />

output. This yielded a list of 169 structures, 26 of them with pronouns. The structures were<br />

extracted using a quite rudimentary method; for example no differences were made between<br />

active and passive versions of the same predicate. This resulted in a list featuring exactly the<br />

problematic issues discussed in chapter 3; the arguments were represented (and subsequently<br />

structured) not according to thematic roles, but merely according to their syntactic roles in the<br />

sentence. As a result, the list did not reflect characteristic arguments of the different predicates<br />

to the same degree as the EPAS list did. The list of the POS-based structures is available in<br />

Appendix F.<br />

Consider the FCA diagram in Figure 7 below. Figure 7 shows part of the FCA diagram created<br />

for the POS-based structures; the section of the diagram with the argument politi (police) as<br />

starting point is highlighted. When comparing this figure to the corresponding figure for the<br />

EPAS list (Figure 5 in section 3.6.4), it is quite clear that the POS-based list is significantly less<br />

generalisable. There are no clear groupings of arguments which display specific behaviour<br />

through their combination with a certain subset of predicates. Because formal subjects of both<br />

active and passive sentences are realised as first arguments in this extraction, it is hardly<br />

possible to group arguments into groups of semantically related words based on their<br />

distribution. As can be seen from the diagram, politi co-occurs with both sykepleiestudent<br />

(student nurse) and bilfører (driver), as well as other, more relevant, arguments.<br />

82


Figure 7<br />

Interestingly enough, however, the POS-based list of structures proved to be just as well suited<br />

as the EPAS list for subsequent classification using TiMBL. When training and testing the<br />

classifier on the POS-based structures, it assigned the correct antecedent in 57,69% (15/26) of<br />

the test cases. In comparison, the EPAS classifier performed with an accuracy of 57,69% when<br />

trained and tested on argument 1, and with an overall accuracy of 46,87%.<br />

These results are interesting mainly because they show that for the purposes of using a memory<br />

based classifier, an extraction method based on a syntactic parser does not necessarily provide<br />

better results than a POS-tagger based method. Even though the list of extracted structures was<br />

decidedly poorer than the EPAS list, especially because it contained “wrong” information in the<br />

sense that logical objects were listed as subjects by virtue of their syntactic role, it provided<br />

useful input for the classification process. It is, however, as suggested by the FCA diagram<br />

above, likely that the POS-based list would be of less use for the concept association phase,<br />

since this approach relies on the presence of similar entities in similar positions in the structures.<br />

As a conclusion, it can probably be stated that the advantages of using a syntactic parser in the<br />

83


extraction process are more unclear than first presumed. For the purposes of aiding in anaphora<br />

resolution, it may well be that an extraction method can perform equally well when based on<br />

more shallow/superficial processing methods.<br />

5.2 Summary and conclusions<br />

This thesis has described a method for corpus-based semantic categorisation of predicates and<br />

arguments in a limited thematic domain. The aim of the project was to create a means of<br />

automatically inferring selectional restrictions corresponding to real-world knowledge of the<br />

domain of the text collection. The classification of the predicates and arguments extracted from<br />

the text collection resulted in several concept groups, where each concept group displayed a<br />

particular distribution in the text collection.<br />

In the introduction of the thesis, it was stated that a chief goal of the project was to assess the<br />

value of using co-occurrence patterns to create concept groups which can act as an aid in the<br />

process of pronoun resolution. The concept groups were thought to function as an intuition<br />

about which word to expect in a given environment. Two criteria were formulated with regards<br />

to the evaluation of the results obtained by the project:<br />

• were the concept groups created valid for the domain of the text collection?<br />

• were the concept groups useful in the process of anaphora resolution?<br />

Through classification and testing of the extracted data set some remarks can be made with<br />

regards to these two criteria. The concept groups that emerged as a product of the association<br />

performed in the classification phase did indeed seem to constitute valid groupings of<br />

semantically similar words. The concept groups were made based on the contextual distribution<br />

of arguments in the text collection and represent groups of words which “keep the same<br />

company” and tend to occur in similar environments. They are valid groupings for the domain of<br />

the text collection and confirm the intuition that similar words display similar distribution, and<br />

thus similar behaviour in the data set. The tests performed with the concept groups show that<br />

they do contribute to heightening the success rate of the MBL classifier; when testing with<br />

EPAS containing pronouns the classifier assigned the correct concept group as antecedent in<br />

78% of the instances, in comparison to an almost 47% success rate without concept groups.<br />

84


When testing on knowledge-dependent anaphors and on anaphors which do not have an<br />

explicitly mentioned antecedent in the text, it was evident that concept groups contribute with<br />

interesting information. Ideally, a referring guessing helper using concept groups should be<br />

consulted as part of an anaphora resolution system. In the event of several possible antecedent<br />

candidates motivated from the text and proposed by the system, the concept groups in<br />

connection with the context pattern of the anaphor can provide useful information about which<br />

type of antecedent is likely. In this way the concept groups resemble information about the valid<br />

contextual patterns for the domain.<br />

The stumbling block of the method in this thesis is the limited dimension of it. The data set used<br />

for the analyses is fairly small and as a consequence the results are less powerful than they could<br />

have been. The extraction method is at best semi-automatic and employs far too much manual<br />

intervention. This is a reoccurring problem for many methods within the field of NLP, Mitkov<br />

for example says that “only a few anaphora resolution systems operate in fully automatic mode”<br />

(Mitkov 2001, p. 111). Most systems rely on manual pre-editing of the input texts, some<br />

methods are only manually simulated. In order for a method to be fully automatic there should<br />

be no human intervention at any stage (Mitkov 2001, p. 114). In the case of the project described<br />

in this thesis, the extraction method is far too manually manipulated to be considered automatic.<br />

The scope of the results is naturally influenced by the limitations of the data set, but regardless<br />

of the size of the data set and the manual intervention employed in the extraction phase, the<br />

method shows promising results. It was clear from the beginning that this would be a pilot study<br />

aiming at providing an indication of the usefulness of the method.<br />

In view of the results, it should be stated that using contextual distribution to derive intuitions of<br />

selectional restrictions in a limited domain is a promising venture. The results obtained in this<br />

project suggest that the distribution of predicates and arguments within a closed domain has a<br />

potential use as a representation of real-world knowledge. More definite conclusions about the<br />

extent to which such a method captures enough relevant intuitions about real-world knowledge<br />

to replace it in an anaphora resolution system can, however, first be made in the event of a<br />

larger-scale study.<br />

85


6 References<br />

Asudeh, Ash and Mary Dalrymple. (2004): Binding Theory. Working paper.<br />

Available at: www.ling.canterbury.ac.nz/personal/ asudeh/pdf/asudeh-dalrymple-binding.pdf<br />

Baldwin, Breck. (1997): CogNIAC: high precision coreference with limited knowledge and linguistic<br />

resources. Proceedings of the ACL’97/EACL’97 Workshop on Operational Factors in Practical, Robust<br />

Anaphora Resolution (Madrid), pp. 38-45.<br />

Available at http://acl.eldoc.ub.rug.nl/mirror/W/W97/index.html<br />

Botley, Simon and Tony McEnery. (2000): Discourse anaphora: The need for synthesis. Chapter 1 in<br />

Botley and McEnery (eds): Corpus-based and Computational Approaches to Discourse Anaphora. John<br />

Benjamins Publishing Company, pp. 1-41.<br />

Bresnan, Joan. (2001): Lexical-Functional Syntax. Blackwell.<br />

Carbonell, Jamie G. and Ralf D. Brown. (1988): Anaphora Resolution: A Multi-Strategy Approach.<br />

Proceedings of the 12 th International Conference on Computational Linguistics (COLING’88, Budapest),<br />

pp. 96-101.<br />

Available at: http://acl.ldc.upenn.edu/C/C88/C88-1021.pdf<br />

CognIT website (2004): http://www.cognit.no/<br />

Consulted 23/11-2004<br />

Copestake, A., D. Flickinger, I. Sag, C. Pollard. (2003): Minimal Recursion Semantics. An Introduction.<br />

Working paper.<br />

Available at: http://lingo.stanford.edu/sag/papers/copestake.pdf<br />

Cover, T. M. and P. E. Hart. (1967): Nearest neighbor pattern classification. Institute of Electrical and<br />

Electronics Engineers Transactions on Information Theory, pp. 21-27.<br />

Available at: http://yreka.stanford.edu/~cover/papers/transIT/0021cove.pdf<br />

Daelemans, Walter, A. van den Bosch and J. Zavrel. (1999): Forgetting Exceptions is Harmful in<br />

Language Learning. Machine Learning 34, special issue in natural language learning, pp. 11-43.<br />

Available at: http://ilk.kub.nl/pub/papers/harmful.ps<br />

Daelemans, Walter, J. Zavrel, K. van der Sloot, A. van den Bosch. (2003): TiMBL: Tilburg Memory<br />

Based Learner, version 5.0, Reference Guide. ILK Technical Report 03-10.<br />

Available at: http://ilk.uvt.nl/downloads/pub/papers/ilk0310.ps.gz<br />

Dagan, Ido and Alon Itai. (1990): Automatic Processing of Large Corpora for the Resolution of<br />

Anaphora References. Proceedings of the 13 th International Conference on Computational Linguistics<br />

(COLING ’90, Helsinki), pp. 330-332.<br />

Available at: http://acl.ldc.upenn.edu/C/C90/C90-3063.pdf<br />

Dagan, Ido, S. Marcus, S. Markovitch. (1995): Contextual word similarity and estimation from sparse<br />

data. In 30th Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio. Ohio<br />

State University, Association for Computational Linguistics, Morristown, New Jersey, pp. 164-171.<br />

Available at: http://citeseer.ist.psu.edu/article/dagan95contextual.html<br />

86


Firth, J. R. (1957): A synopsis of linguistic theory, 1930-55. In Studies in Linguistic Analysis,<br />

Philological Society, Oxford; reprinted in F. R. Palmer (ed.) (1968): Selected Papers of J. R. Firth 1952-<br />

59. Longman, pp. 168-205.<br />

Grefenstette, Gregory. (1992): SEXTANT: Exploring unexplored contexts for semantic extraction from<br />

syntactic analysis. Proceedings, 30th Annual Meeting of the Association for Computational Linguistics,<br />

pp. 324-326.<br />

Available at http://citeseer.ist.psu.edu/grefenstette92sextant.html<br />

Harris, Zellig S. (1968). Mathematical Structures of Language. New York: Wiley.<br />

Hellan, Lars. (1988): Anaphora in Norwegian and the Theory of Grammar. No 32 in Studies in<br />

Generative Grammar. Foris Publications, the Netherlands.<br />

Hindle, Donald. (1990): Noun classification from predicate-argument structures. In Proceedings of the<br />

28th annual meeting of the Association for Computational Linguistics, pp. 268-275.<br />

Available at http://citeseer.ist.psu.edu/hindle90noun.html<br />

ILK website (2004): http://ilk.kub.nl/<br />

Consulted 12/12-2004<br />

Jurafsky, Daniel and James H. Martin. (2000): Speech and Language Processing. An Introduction to<br />

Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice-Hall.<br />

Kamp, Hans and Uwe Reyle. (1993): From Discourse to Logic. Introduction to Modeltheoretic<br />

Semantics of Natural Language, Formal Logic and Discourse Representation Theory. Kluwer Academic<br />

Publishers (the Netherlands).<br />

KunDoc website (2004): http://www.kundoc.net/<br />

Consulted 23/11-2004<br />

Lin, Dekang. (1997): Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity. In<br />

Proceedings of ACL-97 (Madrid), pp. 64-71.<br />

Available at: http://citeseer.ist.psu.edu/article/lin97using.html<br />

Lin, Dekang. (1998): Automatic Retrieval and Clustering of Similar Words. In Proceedings of<br />

COLINGACL '98 (Montreal), pp. 768-774.<br />

Available at: http://citeseer.ist.psu.edu/16998.html<br />

Lin, Dekang and Patrik Pantel. (2001): Induction of Semantic Classes from Natural Language Text. In<br />

Proceedings of SIGKDD-01 (San Fransisco), pp. 317-322.<br />

Available at: http://citeseer.ist.psu.edu/lin01induction.html<br />

Mani, Inderjeet. (2001): Automatic summarization. John Benjamins.<br />

Matthews, P. H. (1997): The Oxford Concise Dictionary of Linguistics. Oxford University Press.<br />

Miller, G. and C. Leacock (2000): Lexical representations for sentence processing. Chapter 8 in Y.<br />

Ravin and C. Leacock (ed.): Polysemy: Theoretical and computational approaches. Oxford University<br />

Press.<br />

Mitkov, Ruslan. (1999): Anaphora Resolution: The State of the Art. Working paper, University of<br />

Wolverhampton.<br />

Available at: http://citeseer.ist.psu.edu/mitkov99anaphora.html<br />

87


Mitkov, Ruslan. (2001): Outstanding issues in anaphora resolution. In: Alexander Gelbukh (ed):<br />

Computational Linguistics and Intelligent Text Processing, pp. 110-125<br />

Mitkov, Ruslan. (2003): Anaphora Resolution. Chapter 14 in Mitkov (ed): The Oxford Handbook of<br />

Computational Linguistics. Oxford University Press, pp. 266-283.<br />

Nasukawa, Tetsuya. (1994): Robust method of pronoun resolution using full-text information.<br />

Proceedings of the 15 th International Conference on Computational Linguistics (COLING’94, Kyoto),<br />

pp.1157-1163.<br />

Available at: http://acl.eldoc.ub.rug.nl/mirror/C/C94/index.html<br />

NorGram website (2004): http://www.hf.uib.no/i/LiLi/SLF/Dyvik/norgram/<br />

Consulted 23/11-2004<br />

OBT (<strong>2005</strong>): Oslo-Bergen-taggeren<br />

Available at: http://decentius.aksis.uib.no/cl/cgp/obt.html<br />

Pantel, Patrick and Dekang Lin (2002): Discovering word senses from text. In Proceedings of ACM<br />

SIGKDD Conference on Knowledge Discovery and Data Mining (Edmonton), pp. 613-619.<br />

Pereira, Fernando, N. Tishby, L. Lee. (1993): Distributional clustering of English words. Proceedings of<br />

the 31st Annual Meeting of the ACL, pp. 183-190.<br />

Available at: http://acl.eldoc.ub.rug.nl/mirror/P/P93/index.html<br />

Robbins, R.H. (1997): A Short History of Linguistics. Longman.<br />

Saeed, John I. (1997): Semantics. Blackwell.<br />

Velldal, Erik. (2003): Modelling Word Senses With Fuzzy Clustering. Cand. Philol. Thesis in Language,<br />

Logic and Information. University of Oslo.<br />

Wolff, Karl Erich. (1994): A first course in formal concept analysis. In: Faulbaum, F. (ed): SoftStat’93<br />

Advances in Statistical Software 4, pp. 429-438.<br />

88


Appendix A: Ekstraktor.pl – algorithm<br />

The algorithm behind Ekstraktor is divided into two separate parts: information retrieval<br />

from the Prolog file and processing of the information that was found and stored.<br />

First a Prolog output file is opened and each line of the file is read. Based on patternmatching,<br />

lines from the file are stored in different arrays according to which pattern they<br />

match.<br />

Subsequent to the information-extraction from the Prolog file, the information stored in<br />

the arrays is processed for the purpose of creating predicate-argument structures. In the<br />

following. I will give a brief outline of the processing steps. I will do this by describing<br />

each of the central functions in Ekstraktor.<br />

The term epmor (eng: ep mother) corresponds to the first EP in the ARG0ep-array, in<br />

most cases meaning the EP “in question”.<br />

finnHoved();<br />

Finds the semantic forms of the main/first predicate-argument structure in the sentence.<br />

This function calls the following (sub)functions:<br />

finnEP1();<br />

Since the entities parsed are full sentences, the main structures is limited to having a verb<br />

as its head. This function searches the array catsuff for a pattern with the first member of<br />

ARG0ep as its EP. If such a pattern is found, the EP is discarded and the first members of<br />

arrays ARG0ep and ARG0verdi are removed.<br />

finnPred();<br />

Finds the semantic value of the sentence’s predicate/ARG0. Goes through the array<br />

semform searching for a pattern with the first member of ARG0ep as EP. If such a pattern<br />

is found, the semantic form is retrieved and stored in the array predikat.<br />

In order to avoid an “empty” semantic form if the argument is a proper noun, it is<br />

checked if the retrieved form matches named. If so, the array navn is searched for a<br />

pattern with the first member of ARG0ep as EP. If such an entry is found, predikat is<br />

emptied and the new semantic form is stored there.<br />

Some predicates have an extra attribute which is stored in the array prt. Each line in this<br />

array is searched for a pattern with the first member of ARG0ep as EP. If such an entry is<br />

found, the semantic form is retrieved and stored in the array ekstra.<br />

lagVerbStruktur();<br />

Creates the correct verbal structure for the predicate. This is for the cases where the<br />

predicate has an additional attribute – as in the predicate “lete etter” (Eng: look for). The<br />

89


function checks if there are any members in the array ekstra. If so, the main predicate and<br />

this extra attribute are stored in the array hovedpred.<br />

If there is nothing stored in ekstra, the main predicate is simply stored in the array<br />

hovedpred.<br />

finnARG1();<br />

Returns the semantic form of argument 1 and stores it in the array ARG1. First the arrays<br />

ARGxep, ARGxverdi, ep and ARGx are emptied and subsequently set to the corresponding<br />

argument 1 values. Then finnARGx() is called.<br />

finnARGx();<br />

Generalized function that finds the EP where the semantic form of the argument<br />

in question is stored, calls finnARGxsemform() and returns the semantic form.<br />

For ARG1 the actions are as follows:<br />

Goes through each member in ARG1ep. If the first member of ARG0ep is EP, the<br />

entry on the same index in ARG1verdi is stored as ARGx. Goes through each<br />

member in ARG1verdi. If ARGx matches an entry in ARG1verdi, the entry on the<br />

same index in ARG0ep is retrieved and stored in the array ep.<br />

finnARGxsemform() is called.<br />

finnARGxsemform();<br />

Generalized function that finds the semantic form of the argument in<br />

question.<br />

For ARG1 the actions are as follows:<br />

Find semantic form of double predicates, if there are any:<br />

The variable epARGx is set to the first member of the array ep (this array<br />

holds the indexes of EPs where the semantic form of ARGx is stored).<br />

Goes through the array index (holds pointers to semantic forms of double<br />

arguments), if an entry matches epARGx as EP, the index pointer is<br />

retrieved and stored in the array ARGxind. Goes through semform, if an<br />

entry matches epARGx as EP, it is removed from the array.<br />

If there are any entries in ARGxind, each member is looked at. If an entry<br />

matches an entry in ARG0verdi, the entry on the same index in ARG0ep is<br />

added to the array liste. The array semform is gone through, if an entry<br />

matches an entry from liste as EP, the semantic form is retrieved and<br />

stored in the array ARGx.<br />

Else find the semantic form of the argument in question:<br />

Goes through semform , if an entry matches epARGx as EP the semantic<br />

form is retrieved and stored in the array ARGx. If the element stored in<br />

ARGx matches ‘named’, the proper noun must be found. The array navn is<br />

searched for a pattern with epARGx as EP. The semantic form is retrieved<br />

and stored in the array ARGx.<br />

The contents of the array ARGx is stored in the array ARG1.<br />

The contents of ARG1 is in finnHoved() stored as HovedARG1.<br />

90


finnARG2();<br />

This function has exactly the same performance as finnARG1(), only with<br />

correspondingly different variable and array names.<br />

The contents of ARG2 is in finnHoved() stored as HovedARG2.<br />

fjernEP();<br />

Removes elements from the arrays ARG1ep, ARG1verdi, ARG2ep and ARG2verdi if they<br />

belong to the main EP.<br />

Goes through ARG1ep and ARG2ep. If the first member of ARG0ep matches the entry,<br />

the entry and the entry on the same index in the value-array is removed.<br />

The first entry in ARG0ep and ARG0verdi is subsequently removed.<br />

sjekkEkstra();<br />

Checks if there are more predicate-argument structures to be found, calls finnResten() if<br />

there are.<br />

Goes through ARG1ep and ARG2ep trying to match each element in ARG0ep. If there is a<br />

match, there exists a predicate with a belonging argument and finnResten() is called.<br />

finnResten();<br />

Finds the remaining predicate-argument structures.<br />

Calls the following (sub)functions:<br />

finnPred();<br />

finnARG1();<br />

finnARG2();<br />

fjernEP();<br />

sjekkEkstra();<br />

lagStruktur();<br />

Creates the predicate-argument structures as printed to the output file.<br />

If HovedARG1 or HovedARG2 contains more than one element, each element is printed<br />

with predicate and argument 2.<br />

Else, hovedpred,HovedARG1 and HovedARG2 are printed to file, separated by commas.<br />

91


Appendix B: Ekstraktor.pl – program code<br />

Perl script Ekstraktor.pl<br />

#åpner fil som angis fra kommandolinjen når programmet kjøres<br />

open(FIL, $ARGV[0]) or die("Kan ikke åpne filen!!\n");<br />

#leser hver linje i filen og legger linjen inn i forskjellige arrayer avhengig av hva som<br />

leses. Får lagret all informasjon som er nødvendig for å trekke ut pred-argstrukturene<br />

while ($linjeFraFil = ) {<br />

#legger indexverdi i @ARG0ep og arg0-verdi i @ARG0verdi dersom linjen fra filen<br />

inneholder ARG0<br />

if ($linjeFraFil =~ m/ARG0/){<br />

henteVerdi();<br />

push(@ARG0ep, $ep);<br />

push(@ARG0verdi, $verdi);<br />

}<br />

#legger indexverdi i @ARG1ep og arg1-verdi i @ARG1verdi dersom linjen fra filen<br />

inneholder ARG1<br />

if ($linjeFraFil =~ m/ARG1/){<br />

henteVerdi();<br />

push(@ARG1ep, $ep);<br />

push(@ARG1verdi, $verdi);<br />

}<br />

#legger indexverdi i @ARG2ep og arg2-verdi i @ARG2verdi dersom linjen fra filen<br />

inneholder ARG2<br />

if ($linjeFraFil =~ m/ARG2/){<br />

henteVerdi();<br />

push(@ARG2ep, $ep);<br />

push(@ARG2verdi, $verdi);<br />

}<br />

#legger indexverdi i @ARG3ep og arg3-verdi i @ARG3verdi dersom linjen fra filen<br />

inneholder ARG3<br />

if ($linjeFraFil =~ m/ARG3/){<br />

henteVerdi();<br />

push(@ARG3ep, $ep);<br />

push(@ARG3verdi, $verdi);<br />

}<br />

#legger lest linje inn i @restriksjoner dersom den inneholder 'BODY'<br />

if ($linjeFraFil =~ m/'BODY'/){<br />

push(@restriksjoner, $linjeFraFil);<br />

}<br />

#legger lest linje inn i @restriksjoner dersom den inneholder 'RSTR'<br />

if ($linjeFraFil =~ m/'RSTR'/){<br />

push(@restriksjoner, $linjeFraFil);<br />

}<br />

#legger lest linje inn i @semform dersom den bl.a inneholder 'semform'<br />

if ($linjeFraFil =~ m/'relation'\),semform\(/){<br />

push(@semform, $linjeFraFil);<br />

}<br />

#legger lest linje inn i @cat dersom den inneholder '_CAT'<br />

if ($linjeFraFil =~ m/'_CAT'\)/){<br />

push(@cat, $linjeFraFil);<br />

}<br />

#legger lest linje inn i @catsuff dersom den inneholder '_CATSUFF'<br />

if ($linjeFraFil =~ m/'_CATSUFF'\)/){<br />

push(@catsuff, $linjeFraFil);<br />

}<br />

92


#legger lest linje inn i @prt dersom den inneholder '_PRT'<br />

if ($linjeFraFil =~ m/'_PRT'\)/){<br />

push(@prt, $linjeFraFil);<br />

}<br />

#legger lest linje inn i @index dersom den inneholder 'L-INDEX'<br />

if ($linjeFraFil =~ m/'L-INDEX'\)/){<br />

push(@index, $linjeFraFil);<br />

}<br />

#legger lest linje inn i @index dersom den inneholder 'R-INDEX'<br />

if ($linjeFraFil =~ m/'R-INDEX'\)/){<br />

push(@index, $linjeFraFil);<br />

}<br />

#legger lest linje inn i @navn dersom den inneholder 'CARG'<br />

if ($linjeFraFil =~ m/'CARG'\)/){<br />

push(@navn, $linjeFraFil);<br />

}<br />

} #slutt while-loop<br />

close(FIL);<br />

#Her begynner prosesseringen av info hentet ut fra inputfilen:<br />

#fjerner ep'er som inneholder informasjon man vil se bort fra<br />

fjernRestri();<br />

#fjerner første ep dersom den ikke har kategori 'v'<br />

finnCat();<br />

#print("ARG0ep = \n@ARG0ep\nARG0verdi = \n@ARG0verdi\nARG1ep = \n@ARG1ep\nARG1verdi =<br />

\n@ARG1verdi\nARG2ep = \n@ARG2ep\nARG2verdi = \n@ARG2verdi\n");<br />

#finner hovedstrukturen<br />

finnHoved();<br />

#print("ARG0ep = \n@ARG0ep\nARG0verdi = \n@ARG0verdi\nARG1ep = \n@ARG1ep\nARG1verdi =<br />

\n@ARG1verdi\nARG2ep = \n@ARG2ep\nARG2verdi = \n@ARG2verdi\n");<br />

#print("@semform\n");<br />

#print("@navn\n");<br />

#finner predikat-argumentstruktur nr2<br />

#sjekkEkstra();<br />

#finnResten();<br />

#legger til predikat-argumentstrukturene til slutt i filen som angis<br />

open(OUTPUTFIL, ">>strukturer.txt") or die("kan ikke skrive til fil\n");<br />

#open(OUTPUTFIL, ">>home/unni/Hovedoppgave/parse/pas-strukturer.txt") or die("kan ikke<br />

skrive til fil\n");<br />

sjekkEkstra();<br />

#lager hovedstrukturen<br />

lagStruktur();<br />

close(OUTPUTFIL);<br />

#her kommer alle subfunksjonene:<br />

#henteVerdi():<br />

#henter ut relasjonsindex og verdi til ARGx fra en linje fra<br />

#inputfilen og lagrer dem i $ep og $verdi<br />

93


#Linje fra fil deles opp ved komma og lagres i @utenKomma. Verdiene hentes ut med<br />

substr().<br />

sub henteVerdi {<br />

@utenKomma = split(/,/, $linjeFraFil);<br />

push(@args, @utenKomma);<br />

}<br />

$ep = substr(@utenKomma[1], 12, 2);<br />

if ($ep =~ /\)/){<br />

$ep = split(/\)/, $ep);<br />

}<br />

$verdi = substr(@utenKomma[3], 4, 2);<br />

if ($verdi =~ /\)/){<br />

$verdi = split(/\)/, $verdi);<br />

}<br />

#finnHoved:<br />

#finner hoved pred-argstrukturen<br />

#vanligvis predikat,arg1,arg2<br />

sub finnHoved {<br />

}<br />

finnEP1();<br />

finnPred();<br />

lagVerbStruktur();<br />

finnARG1();<br />

@HovedARG1 = @ARG1;<br />

finnARG2();<br />

@HovedARG2 = @ARG2;<br />

fjernEP();<br />

sub sjekkEkstra {<br />

foreach $element (@ARG0ep){<br />

foreach $element2 (@ARG1ep){<br />

if ($element =~ $element2){<br />

#print("match!\n");<br />

finnResten();<br />

}<br />

}<br />

foreach $element3 (@ARG2ep){<br />

if ($element =~ $element3){<br />

#print("match2!\n");<br />

finnResten();<br />

}<br />

}<br />

}<br />

#print("ARG0ep: @ARG0ep\nARG1ep: @ARG1ep\nARG2ep: @ARG2ep\n");<br />

}<br />

sub finnResten {<br />

finnPred();<br />

finnARG1();<br />

finnARG2();<br />

print(OUTPUTFIL "@predikat,@ARG1,@ARG2\n");<br />

print("@predikat,@ARG1,@ARG2\n");<br />

fjernEP();<br />

splice (@predikat);<br />

splice (@ARG1);<br />

splice (@ARG2);<br />

sjekkEkstra();<br />

}<br />

94


sub fjernEP{<br />

#fjerner elementer fra @ARG1ep/verdi og @ARG2ep/verdi dersom de hører til hovedep'en<br />

$epmor = $ARG0ep[0];<br />

for ($i = 0; $i < @ARG1ep; $i++){<br />

if ($epmor =~ $ARG1ep[$i]){<br />

splice(@ARG1ep, $i, 1);<br />

splice(@ARG1verdi, $i, 1);<br />

#print("@semform\n");<br />

}<br />

}<br />

for ($i = 0; $i < @ARG2ep; $i++){<br />

if ($epmor =~ $ARG2ep[$i]){<br />

splice(@ARG2ep, $i, 1);<br />

splice(@ARG2verdi, $i, 1);<br />

}<br />

}<br />

shift(@ARG0ep);<br />

shift(@ARG0verdi);<br />

}<br />

sub finnCat {<br />

foreach $linje (@cat){<br />

$epmor = $ARG0ep[0];<br />

if ($linje =~ m/\(attr\(var\($epmor\)/){<br />

@utenDings = split(/\'/, $linje);<br />

push (@args, @utenDings);<br />

$epcat = substr(@utenDings[3],0,1);<br />

#print("$epcat\n");<br />

if ($epcat !~ /v/){<br />

shift(@ARG0ep);<br />

shift(@ARG0verdi);<br />

}<br />

}<br />

}<br />

}<br />

#finnEP1():<br />

#finner den ep'en som skal være utgangspunkt for predikat-argumentstrukturen<br />

#$epmor settes til første element i ARG0-arrayen<br />

#gå gjennom @catsuff, hvis linjen som leses matcher $epmor fjernes første element i<br />

@ARG0ep og @ARG0verdi<br />

sub finnEP1 {<br />

foreach $linje (@catsuff){<br />

$epmor = $ARG0ep[0];<br />

if ($linje =~ m/\(attr\(var\($epmor\)/){<br />

shift(@ARG0ep);<br />

shift(@ARG0verdi);<br />

}<br />

}<br />

}<br />

#finnPred():<br />

#finner den semantiske verdien til ARG0/predikatet i setningen<br />

#$epmor settes til første element i @ARG0ep<br />

#hvis lest linje bl.a matcher $epmor og 'semform', splittes den ved ' og elementene<br />

legges i @utenDings<br />

#@verb settes til fjerde element i @utenDings<br />

#dersom linjen matcher bl.a $epmor og '_PRT', splittes linjen ved ' og den semantiske<br />

formen legges til @verb2<br />

sub finnPred {<br />

$epmor = $ARG0ep[0];<br />

foreach $linje (@semform){<br />

if ($linje =~ /\(attr\(var\($epmor\),'relation'\),semform/){<br />

95


}<br />

}<br />

}<br />

@utenDings = split(/\'/, $linje);<br />

push(@args, @utenDings);<br />

@pred = @utenDings[3];<br />

push(@predikat, @pred);<br />

if ($predikat[0] =~ /named/){<br />

foreach $verdi (@navn){<br />

if ($verdi =~ /\(attr\(var\($epmor\)/){<br />

splice(@pred);<br />

splice(@predikat);<br />

@uten = split(/\'/, $verdi);<br />

push(@arg, @uten);<br />

@pred = @uten[3];<br />

push(@predikat, @pred);<br />

#print("@predikat\n");<br />

}<br />

}<br />

}<br />

foreach $linje (@prt){<br />

if ($linje =~ /\(attr\(var\($epmor\)/){<br />

@utenDings = split(/\'/, $linje);<br />

push(@args, @utenDings);<br />

@ekstr = @utenDings[3];<br />

push(@ekstra, @ekstr);<br />

}<br />

}<br />

sub finnARG1 {<br />

$imax = 0;<br />

splice(@ARGxep);<br />

splice(@ARGxverdi);<br />

splice(@ep);<br />

splice(@ARGx);<br />

}<br />

$imax = @ARG1ep;<br />

@ARGxep = @ARG1ep;<br />

@ARGxverdi = @ARG1verdi;<br />

finnARGx();<br />

@ARG1 = @ARGx;<br />

sub finnARG2 {<br />

$imax = 0;<br />

splice(@ARGxep);<br />

splice(@ARGxverdi);<br />

splice(@ep);<br />

splice(@ARGx);<br />

}<br />

$imax = @ARG2ep;<br />

@ARGxep = @ARG2ep;<br />

@ARGxverdi = @ARG2verdi;<br />

finnARGx();<br />

@ARG2 = @ARGx;<br />

sub finnARG3 {<br />

$imax = 0;<br />

splice(@ARGxep);<br />

splice(@ARGxverdi);<br />

splice(@ep);<br />

96


}<br />

splice(@ARGx);<br />

$imax = @ARG3ep;<br />

@ARGxep = @ARG3ep;<br />

@ARGxverdi = @ARG3verdi;<br />

finnARGx();<br />

@ARG3 = @ARGx;<br />

sub finnARGx {<br />

$epmor = $ARG0ep[0];<br />

}<br />

for ($i = 0; $i < $imax; $i++){<br />

if ($epmor =~ /$ARGxep[$i]/){<br />

$ARGx = $ARGxverdi[$i];<br />

$imax2 = @ARG0verdi;<br />

for ($ii = 0; $ii < $imax2; $ii++){<br />

if ($ARGx =~ /$ARG0verdi[$ii]/){<br />

push(@ep, $ARG0ep[$ii]);<br />

}<br />

}#slutt for2<br />

}<br />

}#slutt for1<br />

finnARGxsemform();<br />

#fjernRestri():<br />

#setter @ARGxep og @ARGxverdi til ARG0-verdiene<br />

#kjører restrik()<br />

sub fjernRestri {<br />

@ARGxep = @ARG0ep;<br />

@ARGxverdi = @ARG0verdi;<br />

restrik();<br />

@ARG0ep = @ARGxep;<br />

@ARG0verdi = @ARGxverdi;<br />

}<br />

#restrik():<br />

#Går gjennom @restriksjoner og @index og fjerner verdier fra @ARG0ep og @ARG0verdi dersom<br />

#disse arrayene inneholder informasjon om dem<br />

sub restrik {<br />

$imax = @ARGxep;<br />

}<br />

for ($i = 0; $i < $imax; $i++){<br />

foreach $linje (@restriksjoner){<br />

if ($linje =~ /\(attr\(var\($ARGxep[$i]\)/){<br />

splice(@ARGxep, $i, 1);<br />

splice(@ARGxverdi, $i, 1);<br />

}<br />

}<br />

}<br />

foreach $linje (@index){<br />

@utenKomma = split(/,/, $linje);<br />

push(@args, @utenKomma);<br />

$ep = substr(@utenKomma[1], 12, 2);<br />

push(@indexep, $ep);<br />

#print("@indexep\n");<br />

#print("@semform\n");<br />

foreach $linje (@indexep){<br />

97


}<br />

}<br />

for ($i = 0; $i < @semform; $i++){<br />

if ($semform[$i] =~ /\(attr\(var\($linje\)/){<br />

splice(@semform, $i, 1);<br />

#print("@semform\n");<br />

}<br />

}<br />

#restrimatch():<br />

#fjerner ep'er som ikke inneholder informasjon om den semantiske formen<br />

#går gjennom hvert element i @restriksjoner og for hvert element settes $epARG1 til<br />

første element i @ep2<br />

#hvis elementet fra @restriksjoner inneholder $epARG1 som indexverdi, fjernes det fra<br />

@ep2<br />

sub restrimatch {<br />

foreach $linje (@restriksjoner){<br />

$epARGx = $ep[0];<br />

if ($linje =~ m/\(attr\(var\($epARGx\)/){<br />

shift(@ep);<br />

}<br />

}<br />

}<br />

#restrimatch for doble argument2:<br />

#samme fremgangsmåte som for restrimatch(), men med andre variabler etc<br />

sub restrimatch4 {<br />

$imax = @liste;<br />

for ($i = 0; $i < $imax; $i++){<br />

foreach $linje (@restriksjoner){<br />

$epARGx = $liste[$i];<br />

if ($linje =~ m/\(attr\(var\($epARGx\)/){<br />

splice(@liste,$i,1);<br />

}<br />

}<br />

}<br />

}<br />

#MODULARISERT VERSJON - GENERISK FUNKSJON FOR Å FINNE SEMANTISK FORM<br />

#finnARGxsemform():<br />

sub finnARGxsemform {<br />

$epARGx = $ep[0];<br />

foreach $linje (@index){<br />

if ($linje =~ /\(attr\(var\($epARGx\)/){<br />

@utenKomma = split(/,/, $linje);<br />

push(args, @utenKomma);<br />

$verdi = substr(@utenKomma[3],4,2);<br />

push(@ARGxind, $verdi);<br />

for ($i = 0; $i < @semform; $i++){<br />

if (@semform[i] =~ /\(attr\(var\($epARGx\)/){<br />

splice(@semform, $i, 1);<br />

}<br />

}<br />

}<br />

}<br />

#finner ep hvor element i @ARGxind er verdien til ARG0 og legger dem i array @liste<br />

if (@ARGxind != 0){<br />

foreach $element (@ARGxind){<br />

$imax = @ARG0verdi;<br />

for ($i = 0; $i < $imax; $i++){<br />

if($element =~ /$ARG0verdi[$i]/){<br />

98


}<br />

else{<br />

}<br />

}<br />

}<br />

push(@liste, $ARG0ep[$i]);<br />

foreach $linje (@semform){<br />

foreach $element (@liste){<br />

if ($linje =~ /\(attr\(var\($element\)/){<br />

@utenDings = split(/\'/, $linje);<br />

push(args, @utenDings);<br />

$ARG = @utenDings[3];<br />

push(@ARGx, $ARG);<br />

}<br />

}<br />

}<br />

foreach $linje (@semform){<br />

if ($linje =~ /\(attr\(var\($epARGx\)/){<br />

@utenDings = split(/\'/, $linje);<br />

push(args, @utenDings);<br />

@ARGx = @utenDings[3];<br />

}<br />

}<br />

if ($ARGx[0] =~ /named/){<br />

foreach $verdi (@navn){<br />

if ($verdi =~ /\(attr\(var\($epARGx\)/){<br />

splice(@ARGx);<br />

#splice(@predikat);<br />

@uten = split(/\'/, $verdi);<br />

push(@arg, @uten);<br />

@ARGx = @uten[3];<br />

#push(@predikat, @pred);<br />

#print("@predikat\n");<br />

}<br />

}<br />

}<br />

}<br />

}<br />

#slutt finnARGxsemform<br />

#lagStruktur():<br />

#lager predikat-argumentstrukturen som skal skrives til outputfilen<br />

#hvis det finnes et element i @verb2 blir det lagt til @verb1<br />

#@verb, $ARG1, $ARG2 og $ARG3 skrives til outputfilen<br />

sub lagStruktur {<br />

#lagVerbStruktur();<br />

#lager riktig arg1 struktur<br />

if (@HovedARG1 > 1){<br />

foreach $element (@HovedARG1){<br />

print(OUTPUTFIL "\n@hovedpred,$element,@HovedARG2\n");<br />

print("@hovedpred,$element,@HovedARG2\n");<br />

}<br />

}<br />

# print(OUTPUTFIL "\n@hovedpred,@ARG1sem[0],@ARG2,$ARG3\n");<br />

if (@HovedARG2 > 1){<br />

foreach $element (@HovedARG2){<br />

print(OUTPUTFIL "\n@hovedpred,@HovedARG1,$element\n");<br />

99


}<br />

}<br />

print("@hovedpred,@HovedARG1,$element\n");<br />

else {<br />

print(OUTPUTFIL "\n@hovedpred,@HovedARG1,@HovedARG2\n");<br />

print("@hovedpred,@HovedARG1,@HovedARG2\n");<br />

}<br />

#if (@predikat != 0){<br />

# print(OUTPUTFIL "$predikat[0],@ARG1,@ARG2\n");<br />

# print("$predikat[0],@ARG1,@ARG2\n");<br />

#}<br />

}<br />

#lager riktig verb-struktur, feks "lete etter"<br />

sub lagVerbStruktur {<br />

if(@ekstra != 0){<br />

@hovedpred = ($predikat[0],$ekstra[0]);<br />

shift(@ekstra);<br />

shift(@predikat);<br />

}<br />

}<br />

else {<br />

@hovedpred = $predikat[0];<br />

shift(@predikat);<br />

}<br />

100


101<br />

Appendix C: the EPAS list<br />

23-år-gammel,student,<br />

aktuell,tidsrom,<br />

analysere,Kripos-spesialist,spor<br />

ankomme,etterforsker,<br />

ankomme,etterforsker,<br />

ankomme,etterforsker,åsted<br />

antyde,politi,<br />

avhøre,,person<br />

avhøre,,vedkommende<br />

avhøre,politi,vitne<br />

avklare,obduksjon,<br />

bede om,lensmann,assistanse<br />

bede om,politi,bistand<br />

bede,lensmann,<br />

bede,lensmann,<br />

bede-om,Fonn,bistand<br />

bekrefte,lensmann,<br />

bekrefte,politi,<br />

bekrefte,politi,<br />

bekrefte,politimester,<br />

bistå,etterforsker,lensmann<br />

bistå,etterforsker,politi<br />

bistå,etterforsker,politi<br />

bli,Anne,offer<br />

bo,23-åring,studentkollektiv<br />

bo,Anne,studentkollektiv<br />

bo,beboer,studentkollektiv<br />

bo,Slåtten,studentkollektiv<br />

brutal,drapsmann,<br />

desperat,rop,<br />

død,sykepleiestudent,<br />

drepe,,kvinne<br />

drepe,,pron<br />

drepe,,pron<br />

drepe,,Slåtten<br />

drepe,gjerningsmann,kvinne<br />

drept,sykepleiestudent<br />

ekstra,patrulje,<br />

endelig,rapport,<br />

etterforske,medarbeider,drap<br />

etterlyse,,bilfører<br />

etterlyse,,bilfører<br />

etterlyse,politi,bilfører<br />

etterlyse,politi,person<br />

etterlyse,politi,syklist<br />

etterlyse,politi,syklist<br />

etterlyst,syklist,<br />

etterlyst,syklist,<br />

fastslå,politi,<br />

finkjemme,politi,bygning<br />

finne,,død<br />

finne,,død<br />

finne,,død<br />

finne,,kvinne<br />

finne,,lommebok<br />

finne,,pron<br />

finne,,pron<br />

finne,,sykepleiestudent<br />

finne,forbipasserende,sykepleiestudent<br />

finne,leteaksjon,kvinne<br />

finne,politi,drapsmann<br />

forfølge,,pron<br />

forkaste,politi,teori<br />

fortelle,beboer,politi<br />

fortelle,Fonn,<br />

fortelle,Fonn,


102<br />

fra-kripos,etterforsker,<br />

fra-kripos,etterforsker,<br />

fra-kripos,etterforsker,<br />

første,praksisdag,<br />

førsteårs,sykepleiestudent,<br />

førsteårs,sykepleiestudent,<br />

få,politi,svar<br />

få,politi,tips<br />

få,pron,rapport<br />

få,pron,telefon<br />

gi,Fonn,opplysning<br />

gi,kamera,indikasjon<br />

gi,lensmann,opplysning<br />

gi,politi,informasjon<br />

gi,politi,opplysning<br />

gi,vitneavhør,indikasjon<br />

gjemme,drapsmann,<br />

gjemme,gjerningsmann,<br />

gjemme,gjerningsmann,<br />

gjennomføre,,rekonstruksjon<br />

gjennomgå,tekniker,studentkollektiv<br />

gjennomsøke,politi,studenthybel<br />

gjøre,politi,avhør<br />

gjøre,politi,rundspørring<br />

gå-gjennom,polititjenestefolk,material<br />

ha,etterforsker,observasjon<br />

ha,politi,medarbeider<br />

ha,politi,teori<br />

ha,pron,observasjon<br />

ha,pron,teori<br />

ha,pron,teori<br />

holde åpen,politi,mulighet<br />

holde,politi,kort<br />

holde,politi,pressekonferanse<br />

holde-åpen,politi,mulighet<br />

høre,pron,rop<br />

høre,pron,rop<br />

høre,vitne,rop<br />

høy,rop,<br />

håpe,politi,<br />

identifisere,politi,pron<br />

igangsette,,leteaksjon<br />

informere,,politi<br />

jobbe-utfra,pron,teori<br />

kartlegge,pron,bevegelse<br />

kartlegge,pron,bevegelse<br />

kjenne,generic-nom,Slåtten<br />

kjenne,politi,dødsårsak<br />

komme-i-kontakt-med,politi,bilfører<br />

komme-i-kontakt-med,politi,generic-nom<br />

komme-i-kontakt-med,pron,bilfører<br />

komme-i-kontakt-med,pron,syklist<br />

komme-inn,tips,<br />

kommentere,pron,<br />

kontakte,etterforsker,vitne<br />

kriminell,handling,<br />

kriminell,handling,<br />

kriminell,handling,<br />

melde-savnet,,student<br />

melde-seg,syklist,<br />

melde-seg,syklist,politi<br />

melde-seg,syklist,politi<br />

mene,etterforsker,<br />

mene,politi,<br />

merke,kjæreste,<br />

mistenkelig,dødsfall,<br />

mulig,teori,<br />

muntlig,rapport<br />

møte-opp-til,pron,praksisdag<br />

ny,tips,<br />

nær,opplysning


103<br />

obdusere,,kvinne<br />

observere,,23-åring<br />

observere,,bile<br />

observere,,person<br />

opplyse,Fonn,<br />

opplyse,vitne,<br />

oppmerksom,kvinne,<br />

overfalle,,Slåtten<br />

plombere,politi,hybelhus<br />

pågå,leteaksjon,<br />

påkledd,Slåtten,<br />

rigge,etterforsker,lyskaster<br />

samle,politi,observasjon<br />

sanke-inn,politi,video<br />

savne,,kvinne<br />

se,pron,Slåtten<br />

se,vitne,kvinne<br />

sentral,vitne,<br />

sette-igang,,leteaksjon<br />

sette-inn,politi,patrulje<br />

si,Fonn,<br />

si,Fonn,<br />

skade,,kvinne<br />

skje-med,generic-nom,kvinne<br />

slutte-seg-til,pron,Førde-politi<br />

sperre av,,hybelhus<br />

sperre av,politi,åsted<br />

spesiell,teori<br />

spesiell,teori,<br />

stenge av,politi,studentkollektiv<br />

stor,leteaksjon,<br />

systematisere,,tips<br />

søke-med,politi,hund<br />

ta,pron,utgangspunkt<br />

ta-høyde-for,lensmann,eventualitet<br />

ta-kontakt-med,politi,vitne<br />

ta-kontakt-med,syklist,politi<br />

taktisk,etterforsker,<br />

teknisk,etterforsker,<br />

teknisk,etterforsker,<br />

teknisk,spor,<br />

teknisk,spor,<br />

tidlig,teori,<br />

tilfeldig,forbipasserende,<br />

tilfeldig,offer,<br />

trenge,,vitne<br />

tro,lensmann,<br />

tro,politi,<br />

tro,pron,<br />

ukjent,gjerningsmann,<br />

ukjent,person,<br />

undersøke,,minibankaktiviteter<br />

undersøke,,mobiltelefontrafikk<br />

undersøke,,område<br />

undersøke,,overvåkningsfilmer<br />

undersøke,etterforsker,åsted<br />

undersøke,politi,aktivitet<br />

understreke,Fonn,<br />

understreke,pron,<br />

understreke,pron,<br />

understreke,pron,<br />

understreke,pron,generic-nom<br />

varsle,pron,Kripos<br />

velge,drapsmann,sykepleiestudent<br />

velge,drapsmann,sykepleiestudent<br />

ville,pron,kartlegge<br />

vise,funn,<br />

vise,funn,<br />

vise,undersøkelse,<br />

vite,politi,<br />

være,Anne,offer


være,bilfører,vitne<br />

være,bilfører,vitne<br />

være,etterforskning,bred<br />

være,kvinne,død<br />

være,kvinne,død<br />

være,kvinne,skadet<br />

være,kvinne,Slåtten<br />

være,lommebok,funn<br />

være,pron,funn<br />

være,pron,omkommet<br />

være,rapport,klar<br />

være,Slåtten,sykepleiestudent<br />

være,syklist,vitne<br />

være,vitne,kvinne<br />

ønske,politi,<br />

ønske,politi,<br />

åpen,mulighet,<br />

åpen,mulighet,<br />

104


Appendix D: Text aligned with EPAS<br />

SENTENCE EPAS METHOD<br />

Kvinne funnet død i Førde. finne,,død<br />

automatic<br />

være,kvinne,død<br />

automatic<br />

Den savnede kvinnen i Førde er finne,,død<br />

automatic<br />

nå funnet død.<br />

savne,,kvinne<br />

automatic<br />

være,kvinne,død<br />

automatic<br />

Politiet har gitt media<br />

opplysninger om funnet.<br />

gi,politi,opplysning automatic<br />

Lensmannen bekrefter at kvinnen finne,,død<br />

automatic<br />

er funnet død.<br />

bekrefte,lensmann,<br />

automatic<br />

Politiet har bedt Kripos om<br />

bistand i søket etter kvinnen.<br />

bede om,politi,bistand<br />

automatic<br />

23-åringen var førsteårs<br />

sykepleiestudent i Førde.<br />

førsteårs,sykepleiestudent,<br />

edited<br />

Hun møtte ikke opp til sin første,praksisdag,<br />

manual<br />

første praksisdag ved Førde<br />

aldershjem.<br />

møte_opp_til,pron,praksisdag<br />

edited<br />

Politiet ble informert. informere,,politi automatic<br />

En leteaksjon ble satt igang. sette_igang,,leteaksjon edited<br />

Leteaksjonen pågikk til kvinnen pågå,leteaksjon,<br />

automatic<br />

ble funnet.<br />

finne,,kvinne<br />

manual<br />

finne,leteaksjon,kvinne<br />

manual<br />

Politiet holder alle muligheter holde åpen,politi,mulighet<br />

edited<br />

åpne i saken.<br />

åpen,mulighet,<br />

automatic<br />

Etterforskerne vil ankomme i<br />

morgen.<br />

ankomme,etterforsker,<br />

automatic<br />

Et vitne hørte desperate rop om<br />

hjelp.<br />

Lensmannen har bedt om<br />

assistanse fra Kripos.<br />

Etterforskere fra Kripos skal<br />

bistå lensmannen i<br />

etterforskningen.<br />

Etterforskerne forventes å<br />

ankomme i løpet av dagen.<br />

Den 23 år gamle studenten ble<br />

meldt savnet tidlig søndag<br />

morgen.<br />

Anne Slåtten bodde i et<br />

studentkollektiv i Førde.<br />

Slåtten var førsteårs<br />

sykepleiestudent i Førde.<br />

Hun ble funnet omkommet i et<br />

skogholt.<br />

Et vitne opplyste at hun hadde<br />

hørt høye rop.<br />

Mandag holdt politiet en<br />

pressekonferanse.<br />

Lensmannen vil ikke gi nærmere<br />

opplysninger om åstedet.<br />

Beboerne i studentkollektivet<br />

har fortalt politiet at de så<br />

Slåtten lørdag kveld.<br />

Politiet har sperret av<br />

åstedet.<br />

desperat,rop,<br />

høre,vitne,rop<br />

bede om,lensmann,assistanse<br />

bistå,etterforsker,lensmann<br />

fra_kripos,etterforsker,<br />

ankomme,etterforsker,<br />

23-år_gammel,student,<br />

melde_savnet,,student<br />

bo,Anne,studentkollektiv<br />

bo,Slåtten,studentkollektiv<br />

førsteårs,sykepleiestudent,<br />

være,Slåtten,sykepleiestudent<br />

finne,,pron<br />

være,pron,omkommet<br />

være,pron,funn<br />

høy,rop,<br />

opplyse,vitne,<br />

høre,pron,rop<br />

holde,politi,pressekonferanse<br />

gi,lensmann,opplysning<br />

nær,opplysning<br />

fortelle,beboer,politi<br />

se,pron,Slåtten<br />

bo,beboer,studentkollektiv<br />

sperre av,politi,åsted<br />

105<br />

automatic<br />

automatic<br />

automatic<br />

automatic<br />

edited<br />

manual<br />

edited<br />

edited<br />

edited<br />

manual<br />

edited<br />

manual<br />

automatic<br />

manual<br />

automatic<br />

automatic<br />

automatic<br />

automatic<br />

automatic<br />

automatic<br />

automatic<br />

edited<br />

automatic<br />

manual<br />

edited<br />

Flere personer er avhørt i<br />

saken.<br />

avhøre,,person<br />

automatic<br />

Politiet holder alle muligheter holde_åpen,politi,mulighet<br />

edited<br />

åpne.<br />

åpen,mulighet,<br />

automatic<br />

Kvinnen blir trolig obdusert i obdusere,,kvinne automatic


løpet av tirsdag.<br />

Politiet håper obduksjonen vil<br />

avklare hva som skjedde med<br />

kvinnen<br />

Mandag kveld ankom<br />

etterforskere fra Kripos<br />

åstedet.<br />

Sent mandag kveld rigget<br />

etterforskerne opp lyskastere.<br />

Fonn vil ikke gi flere<br />

opplysninger om åstedet.<br />

Han vil ikke kommentere om<br />

kvinnen var skadet.<br />

håpe,politi,<br />

avklare,obduksjon,<br />

skje_med,generic-nom,kvinne<br />

ankomme,etterforsker,åsted<br />

fra_kripos,etterforsker,<br />

rigge,etterforsker,lyskaster<br />

gi,Fonn,opplysning<br />

kommentere,pron,<br />

skade,,kvinne<br />

være,kvinne,skadet<br />

holde,politi,kort<br />

Politiet holder kortene svært<br />

tett til brystet.<br />

Det er ikke kommet inn mange komme_inn,tips,<br />

tips i saken.<br />

Tipsene skal nå systematiseres. systematisere,,tips<br />

Fonn forteller at politiet vil<br />

ta kontakt med vitner.<br />

Politiet har flere mulige<br />

teorier.<br />

Det mest sentrale vitnet i<br />

saken er en kvinne.<br />

Hun skal ha hørt rop fra en<br />

kvinne.<br />

Politiet har stengt av<br />

studentkollektivet der 23åringen<br />

bodde.<br />

Studentkollektivet vil bli<br />

gjennomgått av teknikere.<br />

Fonn har bedt om teknisk<br />

bistand.<br />

fortelle,Fonn,<br />

ta_kontakt_med,politi,vitne<br />

mulig,teori,<br />

ha,politi,teori<br />

sentral,vitne,<br />

være,vitne,kvinne<br />

høre,pron,rop<br />

stenge av,politi,studentkollektiv<br />

bo,23-åring,studentkollektiv<br />

gjennomgå,tekniker,studentkollektiv<br />

bede-om,Fonn,bistand<br />

106<br />

manual<br />

manual<br />

edited<br />

edited<br />

edited<br />

automatic<br />

automatic<br />

automatic<br />

automatic<br />

manual<br />

automatic<br />

manual<br />

automatic<br />

manual<br />

manual<br />

automatic<br />

automatic<br />

automatic<br />

automatic<br />

automatic<br />

automatic<br />

edited<br />

automatic<br />

automatic<br />

Politiet bekrefter at Slåtten bekrefte,politi,<br />

automatic<br />

ble drept.<br />

drepe,,Slåtten<br />

automatic<br />

Undersøkelsene på stedet viser vise,undersøkelse,<br />

automatic<br />

at hun ble drept.<br />

drepe,,pron<br />

automatic<br />

Politiet tror at Slåtten ble tro,politi,<br />

automatic<br />

overfalt.<br />

overfalle,,Slåtten<br />

automatic<br />

De tror at kvinnen ble drept av tro,pron,<br />

automatic<br />

en ukjent gjerningsmann.<br />

ukjent,gjerningsmann,<br />

automatic<br />

drepe,gjerningsmann,kvinne<br />

automatic<br />

Politiet fastslår at kvinnens fastslå,politi,<br />

automatic<br />

lommebok ikke er funnet.<br />

finne,,lommebok<br />

manual<br />

være,lommebok,funn<br />

automatic<br />

Fonn opplyser at området ikke opplyse,Fonn,<br />

automatic<br />

er undersøkt.<br />

undersøke,,område<br />

automatic<br />

Politiet har forkastet en<br />

forkaste,politi,teori<br />

automatic<br />

tidligere teori.<br />

tidlig,teori,<br />

automatic<br />

Politiet får senere i dag svar<br />

på dødsårsaken.<br />

få,politi,svar<br />

automatic<br />

Politiet har ikke gjennomsøkt<br />

Slåttens studenthybel.<br />

gjennomsøke,politi,studenthybel automatic<br />

Hele hybelhuset ble sperret av. sperre av,,hybelhus edited<br />

Politiet har plombert<br />

hybelhuset.<br />

plombere,politi,hybelhus automatic<br />

Politiet skal finkjemme<br />

finkjemme,politi,bygning<br />

automatic<br />

bygningen for tekniske spor. teknisk,spor,<br />

automatic<br />

De tekniske etterforskerne har teknisk,etterforsker,<br />

automatic<br />

undersøkt åstedet.<br />

undersøke,etterforsker,åsted<br />

automatic<br />

To tekniske etterforskere<br />

teknisk,etterforsker,<br />

automatic<br />

bistår politiet i Førde.<br />

bistå,etterforsker,politi<br />

automatic<br />

En taktisk etterforsker fra bistå,etterforsker,politi<br />

automatic<br />

Kripos bistår politiet.<br />

taktisk,etterforsker,<br />

automatic<br />

fra_kripos,etterforsker,<br />

edited<br />

Lensmannen tar høyde for alle ta_høyde_for,lensmann,eventualitet manual


eventualiteter.<br />

Vi varslet Kripos. varsle,pron,Kripos automatic<br />

Den døde sykepleierstudenten død,sykepleiestudent,<br />

automatic<br />

ble funnet av en tilfeldig finne,forbipasserende,sykepleiestudent<br />

automatic<br />

forbipasserende.<br />

tilfeldig,forbipasserende,<br />

edited<br />

23-åringen ble sist observert<br />

lørdag kveld.<br />

observere,,23-åring automatic<br />

Politiet vet at hun fikk en vite,politi,<br />

automatic<br />

telefon fra kjæresten sin. få,pron,telefon<br />

automatic<br />

Kjæresten merket ikke at noe<br />

var galt.<br />

merke,kjæreste, automatic<br />

Vedkommende er avhørt. avhøre,,vedkommende automatic<br />

En større leteaksjon ble<br />

igangsette,,leteaksjon<br />

automatic<br />

igangsatt.<br />

stor,leteaksjon,<br />

edited<br />

Politiet etterlyser en syklist. etterlyse,politi,syklist automatic<br />

Den etterlyste syklisten har etterlyst,syklist,<br />

manual<br />

tatt kontakt med politiet. ta_kontakt_med,syklist,politi<br />

manual<br />

Fortsatt etterlyses to<br />

bilførere.<br />

etterlyse,,bilfører automatic<br />

Politiet etterlyste i dag to<br />

bilførere.<br />

etterlyse,politi,bilfører automatic<br />

To biler er observert på veien. observere,,bile automatic<br />

Politiet ønsker å komme i<br />

ønske,politi,<br />

automatic<br />

kontakt med bilførerne.<br />

komme_i_kontakt_med,politi,bilfører<br />

manual<br />

Fonn understreker at bilførerne understreke,Fonn,<br />

automatic<br />

er vitner.<br />

være,bilfører,vitne<br />

edited<br />

Fonn sier at han understreker understreke,pron,generic-nom<br />

edited<br />

dette.<br />

si,Fonn,<br />

automatic<br />

Slåtten var påkledd da hun ble påkledd,Slåtten,<br />

automatic<br />

funnet drept.<br />

finne,,pron<br />

automatic<br />

drepe,,pron<br />

automatic<br />

Vi vil nå kartlegge alle<br />

kartlegge,pron,bevegelse<br />

automatic<br />

bevegelser på åstedet.<br />

ville,pron,kartlegge<br />

manual<br />

Vi har ingen spesiell teori som ha,pron,teori<br />

automatic<br />

vi tar utgangspunkt i.<br />

spesiell,teori,<br />

automatic<br />

ta,pron,utgangspunkt<br />

automatic<br />

Funnene på åstedet viser at det vise,funn,<br />

automatic<br />

er en kriminell handling.<br />

kriminell,handling,<br />

automatic<br />

Det er ikke et mistenkelig mistenkelig,dødsfall,<br />

automatic<br />

dødsfall, men en kriminell<br />

handling.<br />

kriminell,handling,<br />

automatic<br />

Trenger flere vitner. trenge,,vitne automatic<br />

Politiet ønsker å komme i<br />

komme_i_kontakt_med,politi,generic-nom<br />

edited<br />

kontakt med alle som kjente ønske,politi,<br />

automatic<br />

Slåtten.<br />

kjenne,generic-nom,Slåtten<br />

automatic<br />

Etterforskerne fra Kripos vil<br />

kontakte vitner.<br />

kontakte,etterforsker,vitne automatic<br />

Politiet kjenner dødsårsaken. kjenne,politi,dødsårsak manual<br />

Politimesteren bekrefter at de<br />

har fått en muntlig rapport.<br />

Han understreker at politiet<br />

ikke vil gi informasjon om<br />

dødsårsaken.<br />

Politiet har ikke bekreftet<br />

hvor kvinnen ble drept.<br />

Politiet har nå 32 medarbeidere<br />

som etterforsker drapet.<br />

bekrefte,politimester,<br />

få,pron,rapport<br />

muntlig,rapport<br />

understreke,pron,<br />

gi,politi,informasjon<br />

bekrefte,politi,<br />

drepe,,kvinne<br />

ha,politi,medarbeider<br />

etterforske,medarbeider,drap<br />

107<br />

manual<br />

manual<br />

manual<br />

manual<br />

manual<br />

manual<br />

manual<br />

manual<br />

Syklisten meldte seg. melde_seg,syklist, manual<br />

Den etterlyste syklisten har nå melde_seg,syklist,politi<br />

manual<br />

meldt seg til politiet i Førde. etterlyst,syklist,<br />

manual<br />

Fortsatt etterlyses to<br />

bilførere.<br />

etterlyse,,bilfører manual<br />

Politiet etterlyste i dag<br />

tidlig en syklist.<br />

etterlyse,politi,syklist manual<br />

I formiddag meldte syklisten melde_seg,syklist,politi manual


seg til politiet.<br />

Jeg vil understreke at vi<br />

ønsker å komme i kontakt med<br />

både syklisten og bilførerne<br />

som vitner, sier Fonn.<br />

Vi vil nå kartlegge alle<br />

bevegelser på funnstedet og i<br />

boligen.<br />

Vi har ingen spesiell teori som<br />

vi jobber utifra nå.<br />

Men funnene på åstedet viser at<br />

det er en kriminell handling,<br />

forteller Fonn.<br />

I tillegg vil politiet gjøre en<br />

rundspørring rundt åstedet i<br />

løpet av dagen.<br />

Den endelige rapporten vil være<br />

klar på torsdag.<br />

To Kripos-spesialister skal<br />

analysere alle tekniske spor i<br />

Førde.<br />

De to sluttet seg til Førde-<br />

politiet i går.<br />

Alt av mobiltelefontrafikk,<br />

overvåkingsfilmer og<br />

minibankaktiviteter rundt<br />

drapstidspunktet skal<br />

undersøkes.<br />

Slik kan politiet undersøke<br />

aktiviteten i området<br />

sykepleiestudenten ble funnet<br />

drept.<br />

Lensmann Kjell Fonn ber alle<br />

som var i sentrum om å melde<br />

seg.<br />

Han understreker at<br />

etterforskningen er svært bred.<br />

Politiet har sanket inn videoer<br />

fra alle overvåkningskameraer i<br />

Førde.<br />

Polititjenestefolk går gjennom<br />

materialet.<br />

Kameraene vil gi en indikasjon<br />

på aktiviteten i Førde i det<br />

aktuelle tidsrommet.<br />

Gjerningsmannen gjemte seg i<br />

busker på åstedet.<br />

Det er sannsynlig at<br />

gjerningsmannen gjemte seg i<br />

busker ved åstedet.<br />

Tror Anne ble et tilfeldig<br />

offer.<br />

understreke,pron,<br />

komme_i_kontakt_med,pron,bilfører<br />

komme_i_kontakt_med,pron,syklist<br />

si,Fonn,<br />

være,bilfører,vitne<br />

være,syklist,vitne<br />

108<br />

manual<br />

manual<br />

manual<br />

manual<br />

manual<br />

manual<br />

kartlegge,pron,bevegelse manual<br />

ha,pron,teori<br />

spesiell,teori<br />

jobbe_utfra,pron,teori<br />

vise,funn,<br />

kriminell,handling,<br />

fortelle,Fonn,<br />

manual<br />

manual<br />

manual<br />

manual<br />

manual<br />

manual<br />

gjøre,politi,rundspørring manual<br />

endelig,rapport,<br />

være,rapport,klar<br />

analysere,Kripos-spesialist,spor<br />

teknisk,spor,<br />

manual<br />

manual<br />

manual<br />

manual<br />

slutte_seg_til,pron,Førde-politi manual<br />

undersøke,,minibankaktiviteter<br />

undersøke,,mobiltelefontrafikk<br />

undersøke,,overvåkningsfilmer<br />

undersøke,politi,aktivitet<br />

finne,,sykepleiestudent<br />

drept,sykepleiestudent<br />

manual<br />

manual<br />

manual<br />

manual<br />

manual<br />

manual<br />

bede,lensmann, automatic<br />

understreke,pron,<br />

manual<br />

være,etterforskning,bred<br />

manual<br />

sanke_inn,politi,video manual<br />

gå_gjennom,polititjenestefolk,material manual<br />

gi,kamera,indikasjon<br />

aktuell,tidsrom,<br />

gjemme,gjerningsmann,<br />

gjemme,gjerningsmann,<br />

tilfeldig,offer,<br />

bli,Anne,offer<br />

være,Anne,offer<br />

manual<br />

manual<br />

automatic<br />

manual<br />

manual<br />

Politiet avhører flere vitner. avhøre,politi,vitne automatic<br />

Politiet har søkt med hunder på<br />

åstedet.<br />

søke_med,politi,hund edited<br />

Politiet har samlet mange<br />

observasjoner.<br />

samle,politi,observasjon automatic<br />

Politiet antyder at drapsmannen antyde,politi,<br />

automatic<br />

har valgt sykepleiestudenten<br />

tilfeldig.<br />

velge,drapsmann,sykepleiestudent<br />

automatic<br />

Vitneavhør gir indikasjoner på gi,vitneavhør,indikasjon<br />

manual<br />

at den brutale drapsmannen har brutal,drapsmann,<br />

manual<br />

valgt sykepleierstudenten<br />

tilfeldig.<br />

velge,drapsmann,sykepleiestudent<br />

manual


Etterforskerne har flere<br />

observasjoner.<br />

Vitner så en kvinne som gikk<br />

alene.<br />

Politiet mener kvinnen er Anne<br />

Slåtten.<br />

Etterforskerne mener at hun<br />

ikke ble forfulgt.<br />

Drapsmannen kan ha gjemt seg i<br />

busker ved åstedet.<br />

Lensmannen ber unge kvinner<br />

være oppmerksomme.<br />

Politiet setter ikke inn ekstra<br />

patruljer i Førde.<br />

Politiet har fått flere nye<br />

tips.<br />

En rekonstruksjon ble<br />

gjennomført på tirsdag.<br />

Lensmannen tror at politiet<br />

finner drapsmannen.<br />

Politiet etterlyser fem<br />

personer.<br />

109<br />

ha,etterforsker,observasjon manual<br />

se,vitne,kvinne manual<br />

mene,politi,<br />

automatic<br />

være,kvinne,Slåtten<br />

automatic<br />

mene,etterforsker,<br />

automatic<br />

forfølge,,pron<br />

automatic<br />

gjemme,drapsmann, manual<br />

oppmerksom,kvinne,<br />

bede,lensmann,<br />

sette_inn,politi,patrulje<br />

ekstra,patrulje,<br />

automatic<br />

automatic<br />

edited<br />

automatic<br />

få,politi,tips<br />

automatic<br />

ny,tips,<br />

automatic<br />

gjennomføre,,rekonstruksjon automatic<br />

tro,lensmann,<br />

finne,politi,drapsmann<br />

etterlyse,politi,person<br />

automatic<br />

atomatic<br />

automatic<br />

Personene er observert i Førde. observere,,person automatic<br />

Politiet har ikke identifisert<br />

dem.<br />

identifisere,politi,pron<br />

automatic<br />

Politiet har gjort 1275 avhør. gjøre,politi,avhør automatic<br />

De har fem observasjoner av ha,pron,observasjon<br />

automatic<br />

ukjente personer.<br />

ukjent,person,<br />

automatic


Appendix E: classify.pl – program code<br />

#!/bin/perl<br />

### Konfigurering og initialisering ###<br />

# Predikat/argument-fil.<br />

my $infile = "pred_argliste.txt";<br />

# Initialiserer tomme datastrukturer.<br />

my %pred_args1 = ();<br />

my %pred_args2 = ();<br />

my %arg1_preds = ();<br />

my %arg2_preds = ();<br />

# Leser inndata.<br />

open(FILE, $infile);<br />

my @lines = ;<br />

close(FILE);<br />

### Her bygges datastrukturen ###<br />

foreach $line (@lines) {<br />

# Fjerner linjeskift fra hver linje.<br />

chomp($line);<br />

}<br />

# Henter ut hvert enkelt ord på hver linje.<br />

my ($pred, $arg1, $arg2) = split(/,/, $line);<br />

# Registerer arg1 og arg2 på predikatet (øker telleren).<br />

$pred_args1{$pred}{$arg1} += 1;<br />

$pred_args2{$pred}{$arg2} += 1;<br />

# Registrerer predikatene som arg1 og arg2 forekommer med.<br />

$arg1_preds{$arg1}{$pred} += 1;<br />

$arg2_preds{$arg2}{$pred} += 1;<br />

### Her er selve logikken i programmet ###<br />

# Henter predikatene som skal vises fra kommandolinjen (viser alle hvis ingen er<br />

oppgitt).<br />

my @preds = scalar @ARGV ? @ARGV : sort keys %pred_args1;<br />

# Løkke som styrer flyten i programmet.<br />

foreach $pred (@preds) {<br />

foreach $arg (1, 2) {<br />

### NIVÅ 0 ###<br />

my @args_lvl0 = parse_level0($arg, $pred);<br />

}<br />

foreach $arg_lvl0 (@args_lvl0) {<br />

### NIVÅ 1 ###<br />

my @args_lvl1 = parse_level1or2(1, $arg, $arg_lvl0, $pred);<br />

foreach $arg_lvl1 (@args_lvl1) {<br />

### NIVÅ 2 ###<br />

parse_level1or2(2, $arg, $arg_lvl1, $pred, $arg_lvl0);<br />

}<br />

}<br />

print "\n";<br />

}<br />

print "\n";<br />

110


# Subrutine som tar inn argumentnummer (1 eller 2, tilsvarer ARG1 eller ARG2) og<br />

predikat.<br />

# Viser predikatets argumenter (NIVÅ0) og returnerer dem for bruk på neste nivå.<br />

sub parse_level0 {<br />

# Henter subrutinens parametre.<br />

my $argnum = shift;<br />

my $pred = shift;<br />

# Henter ut alle predikatets argumenter.<br />

my %args = $argnum == 1 ? %{$pred_args1{$pred}} : %{$pred_args2{$pred}};<br />

# Henter ut selve argumentene i sortert rekkefølge (etter antall forekomster og<br />

alfabetisk).<br />

my @args = sort { $args{$b} $args{$a} } sort { lc($a) cmp lc($b) } keys %args;<br />

}<br />

# Viser predikatets argument-liste.<br />

print "NIVÅ0, ARG$argnum ($pred): ";<br />

print join(', ', map { "$_ x $args{$_}" } @args), "\n";<br />

return @args;<br />

# Subrutine som tar inn argumentnummer (1 eller 2, tilsvarer ARG1 eller ARG2) og et<br />

argument (pluss ekstra styringsparametre).<br />

# Finner andre predikater som argumentet er brukt i forbindelse med.<br />

# Viser argumenter for alle funnede predikater, samt antallet ganger disse argumentene<br />

ble brukt.<br />

#<br />

# I resten av subrutinen kalles disse funnede argumentene for "refererte argumenter (på<br />

neste nivå)",<br />

# fordi de indirekte via et predikat er referert fra et argument (eks: et NIVÅ0-argument<br />

"refererer"<br />

# til en del NIVÅ1-arguenter, som igjen referer til en del NIVÅ2-argumenter).<br />

sub parse_level1or2 {<br />

# Henter subrutinens parametre.<br />

my $level = shift;<br />

my $argnum = shift;<br />

my $arg = shift;<br />

my $pred_lvl0 = shift;<br />

my $arg_lastlevel = $level == 1 ? '' : shift;<br />

# Tar ikke hensyn til '?'-argumenter.<br />

return if $arg eq '?';<br />

# Datastruktur som teller antall refererte argumenter på neste nivå.<br />

my %argrefs = ();<br />

# Henter ut alle predikatene som inneholder argumentet.<br />

my @preds = $argnum == 1 ? keys %{$arg1_preds{$arg}} : keys %{$arg2_preds{$arg}};<br />

# ...og gjennomløper disse predikatene.<br />

foreach $pred (@preds) {<br />

# Tar ikke hensyn til samme predikatet som man holder på med.<br />

next if $pred eq $pred_lvl0;<br />

# Henter ut alle predikatets argumenter.<br />

my %args = $argnum == 1 ? %{$pred_args1{$pred}} : %{$pred_args2{$pred}};<br />

# ...og gjennomløper og teller disse.<br />

foreach $argref (keys %args) {<br />

# Tar ikke hensyn til samme argument som man holder på med.<br />

next if $argref eq $arg || $argref eq $arg_lastlevel;<br />

}<br />

# Øker telleren (antall refererte argumenter på neste nivå).<br />

$argrefs{$argref} += $args{$argref};<br />

111


}<br />

# Henter ut de refererte argumentene på neste nivå.<br />

my @argrefs = sort { $argrefs{$b} $argrefs{$a} } sort { lc($a) cmp lc($b) } keys<br />

%argrefs;<br />

# Viser de refererte argumentene på neste nivå.<br />

print " " x (3*$level), "NIVÅ$level, ARG$argnum ($arg): ";<br />

print scalar @argrefs ? join(', ', map { "$_ x $argrefs{$_}" } @argrefs) : '(Ingen<br />

referanser)', "\n";<br />

}<br />

return @argrefs;<br />

112


Appendix F: POS-based structures<br />

SENTENCE POS-STRUCTURE<br />

Kvinne funnet død i Førde. finne,,kvinne<br />

Den savnede kvinnen i Førde er nå funnet død. finne,kvinne,død<br />

Politiet har gitt media opplysninger om funnet. gi,politi,opplysning<br />

Lensmannen bekrefter at kvinnen er funnet død. bekrefte,lensmann,at<br />

finne,kvinne,død<br />

Politiet har bedt Kripos om bistand i søket etter kvinnen. be,politi,kripos<br />

23-åringen var førsteårs sykepleiestudent i Førde. være,23-åring,<br />

Hun møtte ikke opp til sin første praksisdag ved Førde<br />

aldershjem.<br />

møte,hun,<br />

Politiet ble informert. informere,politi,<br />

En leteaksjon ble satt igang. sette,leteaksjon,<br />

Leteaksjonen pågikk til kvinnen ble funnet. pågå,leteaksjon,<br />

finne,kvinne,<br />

Politiet holder alle muligheter åpne i saken. holde,politi,mulighet<br />

Etterforskerne vil ankomme i morgen. ankomme,etterforsker,<br />

Et vitne hørte desperate rop om hjelp. høre,vitne,rop<br />

Lensmannen har bedt om assistanse fra Kripos. be,lensmann,<br />

Etterforskere fra Kripos skal bistå lensmannen i<br />

etterforskningen.<br />

bistå,etterforsker,lensmann<br />

Etterforskerne forventes å ankomme i løpet av dagen. ankomme,etterforsker,<br />

Den 23 år gamle studenten ble meldt savnet tidlig søndag<br />

morgen.<br />

melde savne,student,<br />

Anne Slåtten bodde i et studentkollektiv i Førde. bo,Slåtten,<br />

Slåtten var førsteårs sykepleiestudent i Førde. være,Slåtten,sykepleiestudent<br />

Hun ble funnet omkommet i et skogholt. finne,hun,<br />

Et vitne opplyste at hun hadde hørt høye rop. opplyse,vitne,<br />

høre,hun,rop<br />

Mandag holdt politiet en pressekonferanse. holde,politi,pressekonferanse<br />

Lensmannen vil ikke gi nærmere opplysninger om åstedet. gi,lensmann,opplysning<br />

Beboerne i studentkollektivet har fortalt politiet at de så fortelle,beboer,politi<br />

Slåtten lørdag kveld.<br />

se,de,<br />

Politiet har sperret av åstedet. sperre,politi,<br />

Flere personer er avhørt i saken. avhøre,person,<br />

Politiet holder alle muligheter åpne. holde,politi,mulighet<br />

Kvinnen blir trolig obdusert i løpet av tirsdag. obdusere,kvinne,<br />

Politiet håper obduksjonen vil avklare hva som skjedde med håpe,politi,<br />

kvinnen<br />

avklare,obduksjon,hva<br />

Mandag kveld ankom etterforskere fra Kripos åstedet. ankomme,etterforsker,åsted<br />

Sent mandag kveld rigget etterforskerne opp lyskastere. rigge,etterforsker,<br />

Fonn vil ikke gi flere opplysninger om åstedet. gi,Fonn,opplysning<br />

Han vil ikke kommentere om kvinnen var skadet. kommentere,han,<br />

skade,kvinne,<br />

Politiet holder kortene svært tett til brystet. holde,politi,kort<br />

Det er ikke kommet inn mange tips i saken. komme,det,<br />

Tipsene skal nå systematiseres. systematisere,tips,<br />

Fonn forteller at politiet vil ta kontakt med vitner. fortelle,Fonn,<br />

ta,politi,kontakt<br />

Politiet har flere mulige teorier. ha,politi,teori<br />

Det mest sentrale vitnet i saken er en kvinne. være,vitne,kvinne<br />

Hun skal ha hørt rop fra en kvinne. høre,hun,rop<br />

Politiet har stengt av studentkollektivet der 23-åringen stenge,politi,<br />

bodde.<br />

bo,23-åring,<br />

Studentkollektivet vil bli gjennomgått av teknikere. gjennomgå,studentkollektiv,<br />

Fonn har bedt om teknisk bistand. be,Fonn,<br />

Politiet bekrefter at Slåtten ble drept. bekrefte,politi,at<br />

drepe,Slåtten,<br />

Undersøkelsene på stedet viser at hun ble drept. vise,undersøkelse,at<br />

drepe,hun,<br />

Politiet tror at Slåtten ble overfalt. tro,politi,at<br />

overfalle,Slåtten,<br />

De tror at kvinnen ble drept av en ukjent gjerningsmann. tro,de,at<br />

drepe,kvinne,<br />

113


Politiet fastslår at kvinnens lommebok ikke er funnet. fastslå,politi,at<br />

finne,lommebok,<br />

Fonn opplyser at området ikke er undersøkt. opplyse,Fonn,at<br />

undersøke,område,<br />

Politiet har forkastet en tidligere teori. forkaste,politi,teori<br />

Politiet får senere i dag svar på dødsårsaken. få,politi,svar<br />

Politiet har ikke gjennomsøkt Slåttens studenthybel. gjennomsøke,politi,studenthybel<br />

Hele hybelhuset ble sperret av. sperre,hybelhus,<br />

Politiet har plombert hybelhuset. plombere,politi,hybelhus<br />

Politiet skal finkjemme bygningen for tekniske spor. finkjemme,politi,bygning<br />

De tekniske etterforskerne har undersøkt åstedet. undersøke,etterforsker,åsted<br />

To tekniske etterforskere bistår politiet i Førde. bistå,etterforsker,politi<br />

En taktisk etterforsker fra Kripos bistår politiet. bistå,etterforsker,politi<br />

Lensmannen tar høyde for alle eventualiteter. ta,lensmann,høyde<br />

Vi varslet Kripos. varsle,vi,kripos<br />

Den døde sykepleierstudenten ble funnet av en tilfeldig<br />

finne,sykepleierstudent,forbipasse<br />

forbipasserende.<br />

rende<br />

23-åringen ble sist observert lørdag kveld. observere,23-åring,<br />

Politiet vet at hun fikk en telefon fra kjæresten sin. vite,politi,<br />

få,hun,telefon<br />

Kjæresten merket ikke at noe var galt.<br />

Vedkommende er avhørt.<br />

merke,kjæreste,at<br />

En større leteaksjon ble igangsatt. igangsette,leteaksjon,<br />

Politiet etterlyser en syklist. etterlyse,politi,syklist<br />

Den etterlyste syklisten har tatt kontakt med politiet. ta,syklist,kontakt<br />

Fortsatt etterlyses to bilførere. etterlyse,,bilfører<br />

Politiet etterlyste i dag to bilførere. etterlyse,politi,bilfører<br />

To biler er observert på veien. observere,bil,<br />

Politiet ønsker å komme i kontakt med bilførerne. ønske,politi,å<br />

Fonn understreker at bilførerne er vitner. understreke,Fonn,at<br />

være,bilfører,vitne<br />

Fonn sier at han understreker dette. si,Fonn,at<br />

understreke,han,dette<br />

Slåtten var påkledd da hun ble funnet drept. være,Slåtten,påkledd<br />

finne,hun,drepe<br />

Vi vil nå kartlegge alle bevegelser på åstedet. kartlegge,vi,bevegelse<br />

Vi har ingen spesiell teori som vi tar utgangspunkt i. ha,vi,teori<br />

ta,vi,utgangspunkt<br />

Funnene på åstedet viser at det er en kriminell handling. vise,funn,at<br />

være,det,handling<br />

Det er ikke et mistenkelig dødsfall, men en kriminell<br />

handling.<br />

være,det,dødsfall<br />

Trenger flere vitner. trenge,vitne,<br />

Politiet ønsker å komme i kontakt med alle som kjente<br />

ønske,politi,å<br />

Slåtten.<br />

kjenne,,Slåtten<br />

Etterforskerne fra Kripos vil kontakte vitner. kontakte,etterforsker,vitne<br />

Politiet kjenner dødsårsaken. kjenne,politi,dødsårsak<br />

Politimesteren bekrefter at de har fått en muntlig rapport. bekrefte,politimester,at<br />

ha,de,rapport<br />

Han understreker at politiet ikke vil gi informasjon om<br />

understreke,han,at<br />

dødsårsaken.<br />

gi,politi,informasjon<br />

Politiet har ikke bekreftet hvor kvinnen ble drept. bekrefte,politi,<br />

drepe,kvinne,<br />

Politiet har nå 32 medarbeidere som etterforsker drapet. ha,politi,medarbeider<br />

etterforske,medarbeider,drap<br />

Syklisten meldte seg. melde,syklist,seg<br />

Den etterlyste syklisten har nå meldt seg til politiet i<br />

Førde.<br />

melde,syklist,seg<br />

Fortsatt etterlyses to bilførere. etterlyse,bilfører,<br />

Politiet etterlyste i dag tidlig en syklist. etterlyse,politi,syklist<br />

I formiddag meldte syklisten seg til politiet. melde,syklist,seg<br />

Jeg vil understreke at vi ønsker å komme i kontakt med både understreke,jeg,at<br />

syklisten og bilførerne som vitner, sier Fonn.<br />

ønske,vi,å<br />

vitne,bilfører,<br />

si,Fonn,<br />

Vi vil nå kartlegge alle bevegelser på funnstedet og i<br />

boligen.<br />

kartlegge,vi,bevegelse<br />

114


Vi har ingen spesiell teori som vi jobber utifra nå. ha,vi,teori<br />

jobbe,vi,<br />

Men funnene på åstedet viser at det er en kriminell handling, vise,funn,at<br />

forteller Fonn.<br />

være,det,handling<br />

fortelle,Fonn,<br />

I tillegg vil politiet gjøre en rundspørring rundt åstedet i<br />

løpet av dagen.<br />

gjøre,politi,rundspørring<br />

Den endelige rapporten vil være klar på torsdag. være,rapport,klar<br />

To Kripos-spesialister skal analysere alle tekniske spor i<br />

Førde.<br />

analysere,Kripos-spesialist,spor<br />

De to sluttet seg til Førde-politiet i går. slutte,to,seg<br />

Alt av mobiltelefontrafikk, overvåkingsfilmer og<br />

minibankaktiviteter rundt drapstidspunktet skal undersøkes.<br />

undersøke,minibankaktivitet,<br />

Slik kan politiet undersøke aktiviteten i området<br />

undersøke,politi,aktivitet<br />

sykepleiestudenten ble funnet drept.<br />

finne,sykepleierstudent,<br />

Lensmann Kjell Fonn ber alle som var i sentrum om å melde be,Fonn,alle<br />

seg.<br />

melde,,seg<br />

Han understreker at etterforskningen er svært bred. understreke,han,at<br />

være,etterforskning,bred<br />

Politiet har sanket inn videoer fra alle overvåkningskameraer<br />

i Førde.<br />

sanke,politi,<br />

Polititjenestefolk går gjennom materialet. gå,polititjenestefolk,<br />

Kameraene vil gi en indikasjon på aktiviteten i Førde i det<br />

aktuelle tidsrommet.<br />

gi,kamera,indikasjon<br />

Gjerningsmannen gjemte seg i busker på åstedet. gjemme,gjerningsmann,seg<br />

Det er sannsynlig at gjerningsmannen gjemte seg i busker ved være,det,sannsynlig<br />

åstedet.<br />

gjemme,gjerningsmann,seg<br />

Tror Anne ble et tilfeldig offer. bli,Anne,offer<br />

Politiet avhører flere vitner. avhøre,politi,vitne<br />

Politiet har søkt med hunder på åstedet. søke,politi,<br />

Politiet har samlet mange observasjoner. samle,politi,observasjon<br />

Politiet antyder at drapsmannen har valgt sykepleiestudenten antyde,politi,at<br />

tilfeldig.<br />

velge,drapsmann,sykepleiestudent<br />

Vitneavhør gir indikasjoner på at den brutale drapsmannen har gi,vitneavhør,indikasjon<br />

valgt sykepleierstudenten tilfeldig.<br />

velge,drapsmann,sykepleiestudent<br />

Etterforskerne har flere observasjoner. ha,etterforsker,observasjon<br />

Vitner så en kvinne som gikk alene. se,vitne,kvinne<br />

Politiet mener kvinnen er Anne Slåtten. mene,politi,<br />

være,Slåtten,kvinne<br />

Etterforskerne mener at hun ikke ble forfulgt. mene,etterforsker,at<br />

bli,hun,forfulgt<br />

Drapsmannen kan ha gjemt seg i busker ved åstedet. gjemme,drapsmann,seg<br />

Lensmannen ber unge kvinner være oppmerksomme. be,lensmann,kvinne<br />

Politiet setter ikke inn ekstra patruljer i Førde. sette,politi,<br />

Politiet har fått flere nye tips. ha,politi,tips<br />

En rekonstruksjon ble gjennomført på tirsdag. gjennomføre,rekonstruksjon,<br />

Lensmannen tror at politiet finner drapsmannen. tro,lensmann,at<br />

finne,politi,drapsmann<br />

Politiet etterlyser fem personer. etterlyse,politi,person<br />

Personene er observert i Førde. observere,person,<br />

Politiet har ikke identifisert dem. identifisere,politi,de<br />

Politiet har gjort 1275 avhør. gjøre,politi,avhør<br />

De har fem observasjoner av ukjente personer. ha,de,observasjon<br />

115

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!