Unni Cathrine Eiken February 2005

University of Bergen 

Section for linguistic studies 

CORPUS-BASED 

SEMANTIC CATEGORISATION 

FOR ANAPHORA RESOLUTION 

Unni Cathrine Eiken 

Cand. Philol. Thesis in 

Computational Linguistics and 

Language Technology 

February 2005

Abstract 

This thesis describes an approach of using corpus-based classification of semantically 

related words as a referent-guessing helper in anaphora resolution. A small limiteddomain 

corpus was collected and using a method based on semantic structures available 

from syntactic parses of the texts, elementary predicate-argument structures were 

extracted from it. The extracted structures were processed using an association technique 

which created bundles of semantically similar words based on their distribution in the text 

collection. The groups of semantically similar words represent valid selectional 

restrictions for the domain of the text collection in the sense that they characterise types 

of arguments which tend to occur in certain contexts. These groups can be used to create 

an expectation of which words to expect in a given contextual pattern, and thus be used in 

anaphora resolution to select a probable referent from a set of possible referents. The 

experiments in the thesis show that this approach produces promising results; the concept 

groups can function as a helper to find likely referents in anaphora resolution. 

Sammendrag 

Metoden som beskrives i denne hovedoppgaven bygger på korpusbasert klassifikasjon av 

semantisk like ord og relaterer dette til bruk innenfor anaforresolusjon. Et 

domenespesifikt korpus ble samlet, og forenklede predikat-argumentstrukturer ble 

ekstrahert ved hjelp av en metode basert på semantiske strukturer som er tilgjengelige 

etter en syntaktisk analyse av tekstene. Strukturene ble prosessert med en 

assosiasjonsteknikk som, basert på ordenes distribusjon i tekstsamlingen, dannet 

grupperinger av semantisk like ord. Disse ordgruppene representerer gyldige 

seleksjonsrestriksjoner innenfor tekstsamlingens avgrensede domene da de karakteriserer 

grupper av argumenter som forekommer i gitte kontekster. Ordgruppene kan brukes til å 

gi en indikasjon på hvilke ord som forventes i et gitt kontekstmønster. Ved 

anaforresolusjon kan dette være til hjelp ved utvelgelsen av en sannsynlig referent fra en 

liste med mulige referenter. Eksperimentene i oppgaven viser at denne metoden gir 

lovende resultater; ordgruppene kan fungere som et hjelpemiddel i prosessen med å finne 

sannsynlige referenter i anaforresolusjon. 

i

Preface 

The project presented in this paper is a Cand. Philol. thesis in Computational Linguistics 

and Language Technology and is submitted at the University of Bergen in February 2005. 

The thesis is written in loose cooperation with the research project KunDoc (KunDoc 

2004). KunDoc (Kunnskapsbasert dokumentanalyse / Knowledge-based document 

analysis), which was started in October 2003 and is funded by the Norwegian Research 

Council (NFR), has functioned as an inspiration for verbalising the approach in the thesis. 

The research within KunDoc is carried out in cooperation between the firm CognIT AS 

(CognIT 2004) and the University of Bergen. KunDoc aims at developing a method for 

the automatic recognition of discourse structures in written Norwegian texts. The project 

examines whether automated identification of coreference in a text can be used to create 

an unambiguous discourse structure of the text, identifying both its thematic and 

contextual structure. A further goal is to examine whether these techniques are useful 

within a closed thematic domain to create unambiguous automated summaries. Within 

KunDoc, it is of interest to generate ontologies that represent real-world knowledge. 

In the work on my thesis I have also worked in co-operation with the research project 

NorGram (NorGram 2004) at the University of Bergen. This project develops a 

computational grammar for Norwegian bokmål and is a part of the ParGram project at 

Palo Alto Research Center. The pre-processing of the text collection used in my project 

has been carried out using NorGram’s grammar on the XLE platform. 

ii

Acknowledgements 

I would like to thank my supervisors professor Koenraad de Smedt and professor Helge 

Dyvik who have given me invaluable support and new ideas, especially in the process of 

developing the method in the thesis. 

The approach in the thesis has been developed in loose cooperation with KunDoc. In this 

connection I wish to thank Till Christopher Lech at CognIT AS, who has contributed 

with tips and support. 

Other people I would like to thank are Paul Meurer at Aksis, who installed XLE and 

NorGram on my home Linux computer and Martin Rasmussen Lie who has been a great 

help in programming questions and has implemented one of the approaches used in the 

thesis in Perl. Thanks also go to Aleksander Krzywinski, whose achievement it is that the 

pink computer exists. 

Many people who have been a great support in the process of finishing the thesis have not 

been mentioned – they are still, however, very greatly thanked. You know who you are! 

iii

Table of Contents 

1 INTRODUCTION AND PROBLEM STATEMENT 1 

1.1 Project outline 3 

2 THEORETICAL BACKGROUND 6 

2.1 Anaphora resolution 6 

2.1.1 Frameworks for anaphora resolution 8 

2.1.2 Computational approaches to anaphora resolution 12 

2.1.3 Anaphora resolution and text summarisation 22 

2.2 Finding meaning in the context 24 

2.2.1 The distributional approach 24 

2.2.2 Different types of context 27 

2.2.3 Context and selectional restrictions 30 

3 FROM TEXT TO EPAS – THE EXTRACTION METHOD 33 

3.1 Selecting the texts 33 

3.2 Predicate-argument structures 35 

3.2.1 What is represented in the EPAS? 39 

3.3 Parsing with NorGram 42 

3.3.1 NorGram in outline 43 

3.3.2 Extracting EPAS from NorGram 44 

3.4 Altering the source 47 

3.5 Finding the words 48 

3.6 Evaluation of the data set 52 

3.6.1 Errors from the grammar 52 

3.6.2 Irrelevant structures 53 

3.6.3 Manually added structures 54 

3.6.4 Comments about the EPAS list 55 

4 CLASSIFICATION 58 

iv

4.1 Step I: Classification with TiMBL 59 

4.1.1 The Nearest Neighbor approach 60 

4.1.2 Testing 61 

4.1.3 Comments on the results 68 

4.2 Step II: Association of concept groups 68 

4.2.1 Classify 72 

4.2.2 Associated concept classes 73 

4.3 Step III: Using concept groups in TiMBL 74 

4.3.1 Testing 75 

4.4 Are concept classes useful for anaphora resolution? 78 

5 FINAL REMARKS 82 

5.1 Is a parser vital for the extraction process? 82 

5.2 Summary and conclusions 84 

6 REFERENCES 86 

APPENDIX A: EKSTRAKTOR.PL – ALGORITHM 89 

APPENDIX B: EKSTRAKTOR.PL – PROGRAM CODE 92 

APPENDIX C: THE EPAS LIST 101 

APPENDIX D: TEXT ALIGNED WITH EPAS 105 

APPENDIX E: CLASSIFY.PL – PROGRAM CODE 110 

APPENDIX F: POS-BASED STRUCTURES 113 

v

1 Introduction and problem statement 

For many applications within the field of Natural Language Processing (NLP) it is vital to 

identify what a pronoun refers to. Consider a piece of text where (1-1a) is followed immediately 

by (1-1b) 1 . 

(1- 1) 

a. 

Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig 

kommer til å drepe igjen. 

The sergeant leading the investigation says that the perpetrator probably will 

kill again. 

b. Han etterlyser vitner som var i sentrum søndag kveld. 

He puts out a call for witnesses who were in the city centre Sunday evening. 

In an application consisting of, for example, summarising the text, a selection of the second 

sentence (1-1b) without the preceding sentence (1-1a) leaves the reader with the pronoun han 

(he), the referent of which cannot be identified. The task of identifying the referent of a pronoun 

is called anaphora resolution and its computer implementation is relevant in many NLP 

applications, such as machine translation, automatic abstracting, dialogue systems, question 

answering and information extraction. 

The problem of correctly identifying the referent of a pronoun is not trivial, as will be apparent 

from the comparison of examples (1-1) and (1-2). As will be further described in section 2.1, 

strategies that do not incorporate some sort of real-world knowledge cannot confidently identify 

the entities that the pronoun han (he) is linked to in these examples. 

(1- 2) 


kommer til å drepe igjen. Han ble observert i sentrum søndag kveld. 


kill again. He was observed in the city centre Sunday evening. 

1 

The sentences in (1-1) and (1-2) are constructed example sentences and are not part of the data set collected and 

used in this thesis. 

1

This thesis explores the value of using co-occurrence patterns to create concept groups that can 

act as an aid in the process of finding what a pronoun refers to. In order to find the entity that the 

pronoun han (he) refers to in example (1-1), the following two alternative patterns can be 

considered: 

(1- 3) 

a. lensmann etterlyser vitne sergeant calls-for witness 

b. gjerningsmann etterlyser vitne perpetrator calls-for witness 

When considering which of these patterns is the most likely one, data collected from a corpus 

can be consulted (Dagan and Itai 1990; Dagan et al. 1995; Nasukawa 1994; inter al.). If one of 

the patterns is found literally in the corpus, it will receive a strong preference. If none of the 

patterns occur in the data collection, similar patterns can be considered. Given that the patterns 

in example (1-4) below do feature in the data collection, they can contribute to guessing the 

correct referent for the anaphor in example (1-1): 

(1- 4) 

a. politi etterlyser vitne police call-for witness 

b. etterforsker etterlyser vitne investigator calls-for witness 

c. lensmann avhører vitne sergeant interviews witness 

d. politi avhører person police interview person 

e. gjerningsmann dreper offer perpetrator kills victim 

f. gjerningsmann angriper kvinne perpetrator attacks woman 

In view of the patterns in (1-4), the word lensmann (sergeant) engages in contexts similar to 

those of politi (police), which in turn occurs in similar contexts to etterforsker (investigator). By 

using association techniques, lensmann can be associated with the other arguments which occur 

in similar linguistic environments, and subsequently be preferred as the referent in (1-1). 

Approaches within the field of anaphora resolution have in recent years focused on knowledge- 

poor strategies used in combination with corpora, at the same time, the notion of constructing a 

large and comprehensive base of real-world knowledge has been abandoned somewhat (see 

Mitkov 2003 for a brief overview). The approach in the present work expands the co-occurrence 

2

patterns found in a text collection to also consider semantically similar words and patterns to 

those present in the corpus. The association of semantically similar concepts is carried out 

through machine learning techniques and an association technique developed for this project. 

The project described in the present work develops and examines a method for the automatic 

inference of concept groups consisting of semantically similar words from a collection of 

limited domain texts. Information that becomes available by automatically classifying context 

patterns from a closed thematic domain is examined for the purpose of aiding in anaphora 

resolution. The resulting concept groups can function as a form of real-world knowledge by 

representing information about which words can be expected in certain contextual environments. 

As it is an established problem within the field of computational linguistics that the construction 

of knowledge bases requires such a high amount of manual labour, it is of interest to examine 

methods which can contribute to automating this task. 

1.1 Project outline 

This thesis describes a method for the automatic association of clusters of semantically similar 

words collected from a limited thematic domain. The association of concepts is based on the 

distribution of arguments in particular syntactic contexts. 

The method described consists of three steps: 

1) the extraction method, which deals with the extraction of meaning structures from a text 

corpus 

2) the classification method, which deals with the association of the extracted meaning 

structures into concept groups 

3) the application of the meaning structures and the concept groups to anaphora resolution 

In the extraction phase, semantic structures mainly corresponding to subject-verb-object 

relations are extracted and normalised to the form shown in (1-5) below. This type of relation is 

in this thesis termed an elementary predicate-argument structure (EPAS) and is described in 

greater detail in section 3.2. 

3

(1- 5) 

predicate, argument 1, argument 2 

In the classification phase, the extracted structures undergo processes that result in the grouping 

of concepts into clusters of semantically similar words. 

The evaluation of the results obtained by using the method developed in the project is twofold: 

• the resulting concept classes are evaluated; does the method produce semantic clusters 

that are valid for the thematic domain of the text collection? 

• the usefulness of using the concept classes in anaphora resolution is evaluated; does the 

method provide a means to infer which entity is referred to in examples such as (1-1) 

and (1-2)? 

Chapter 3 describes the extraction method, which uses output from a syntactic parser to collect 

semantic structures in the form of EPAS from the texts. The text collection used in this project 

consists of newspaper texts all concerning a criminal case. The constraints that hold on the 

corpus are further described in sections 3.1 and 3.4. Section 3.2 explains the format of the 

meaning structures extracted from the text corpus as well as the motivation for choosing EPAS 

as meaning representation. In sections 3.3 and 3.5 the process of parsing the texts and gathering 

the meaning representations from the parse output is outlined in further detail. Finally, the list of 

EPAS resulting from the extraction method is evaluated in section 3.6. 

The classification method is described in chapter 4. In section 4.1 a classification approach using 

machine learning techniques is described, in section 4.2 the constituents of the EPAS are 

associated into semantically similar groups based on their contextual distribution and finally 

these two approaches are applied in connection with one another in section 4.3. In section 4.4 

the potential of using concept groups in anaphora resolution is discussed. 

Final remarks and conclusions are found in section 5. Here the foundation of the extraction 

method is also briefly discussed. 

4

The results obtained in this project provide a preliminary indication of the feasibility and 

usefulness of a referent-guessing helper such as described in the introduction. Hindle (1990) 

states that small corpora and human intervention in the analysis phase are factors that have 

contributed to obscuring the usefulness of semantic classification based on distributional 

context. Within the framework of the present work it was not possible to conduct a large-scale 

corpus-based study. This is partly due to the lack of a robust and powerful enough extraction 

method and will be discussed in greater detail in chapter 3. The text collection that the study is 

based on is clearly much too small to offer anything but a tendency and the degree in which the 

extraction process is manually manipulated is too high to call the method fully automated. 

Nevertheless, this paper describes a pilot study and provides an indication of the quality and 

usefulness of the method. 

Before going on to describing the method developed in the present work, a brief introduction to 

the concepts of context and anaphora resolution is needed. The importance and usefulness of 

classifying words according to the contexts they occur in, as well as a brief background on 

anaphora resolution, is provided in chapter 2. 

5

2 Theoretical background 

In order to understand the motivation for developing an extraction and classification method as 

described in the present work, one needs a brief explanation of the theoretical foundation on 

which the method is based. In this chapter, the theoretical background of the method is 

described. In section 2.1 the concept of anaphora resolution and the need for context information 

in anaphora resolution systems is outlined. In section 2.2 the notion of using context as a means 

to identify semantically similar words is explained. 

2.1 Anaphora resolution 

Most natural language texts contain an abundance of pronouns and other expressions which are 

referentially linked to other items in the texts. In order to understand the meaning conveyed by a 

text, one needs a method to find out which entities these expressions are linked to. It is difficult 

to determine what a pronoun refers to without taking the notion of context and real-world 

knowledge into account. Natural language requires a certain amount of context to be intelligible. 

We distinguish between linguistic context, which denotes the concrete linguistic setting that a 

given word occurs in, and a more general notion of context that refers to the non-linguistic 

setting. In the following, a background on the theoretical basics of anaphora will be given, 

before some approaches to anaphora resolution are briefly outlined. 

Anaphor and referring expression are both terms that are used for words that point back either to 

other words or to entities in the world. Anaphora 2 can be defined as the linguistic phenomenon 

of using an anaphor to point back to a previously mentioned item in a text (Mitkov 2003, p. 

266). 

In the Oxford Concise Dictionary of Linguistics (Matthews 1997), a referring expression is 

defined as a linguistic element that refers to a specific entity in the real world, termed a referent. 

A referring expression can be any natural language expression that is used to refer to a realworld 

entity, including nouns and pronouns. As such the linguistic expressions James and he in 

a given text may both refer to a person called “James” existing in the real world. 

2 The term anaphora is in the present work used in alignment with current literature on anaphora resolution. 

Anaphora is the linguistic phenomenon of an anaphor pointing to another item in the text and should not be 

understood as the plural form of anaphor, which is anaphors. 

6

The term anaphor describes a linguistic element, often a pronoun or a nominal, which is linked 

to another linguistic element previously presented in the text (Mitkov 2003). An anaphoric 

reference is usually supported by a preceding nominal, which is called an antecedent. If a 

referring pronoun is mentioned previous to the mentioning of its referent, the term cataphora 

applies (Jurafsky and Martin 2000, p. 675). Anaphora provides us with an indirect reference to a 

real-world entity. When a referring expression, such as James, has been introduced in a text, it 

allows for subsequent reference by anaphors, such as he or the boy. The original referring 

expression is therefore the antecedent of the subsequent referring anaphor, for example the 

pronoun he. If the anaphor and the antecedent it is linked to both have the same referent in the 

real world, they are termed coreferential (Mitkov 2003, p. 267). 

(2- 1) 

Politimannen sier at han har flere observasjoner 

The policeman says that he has several observations 

In example (2-1) above, the pronoun han (he) is an anaphor which points back to its antecedent, 

the referring expression politimannen (the policeman). Han and politimannen both refer to the 

same real-world referent, the object “the policeman”, and are therefore coreferential. 

There are various and complex structural conditions on the co-occurrence of an anaphor and its 

antecedent. This includes constraints on how far away from each other the antecedent and the 

referring anaphor can be without disturbing the understanding of the text. An elaborate 

discussion of these conditions is, however, not within the scope of the present work. 

Mitkov (2003, p. 268) distinguishes between the following types of anaphora: 

• pronominal anaphora: The anaphor is a pronoun. 

• lexical noun phrase anaphora: The anaphor is a definite description or proper name that 

gives additional information and has a meaning independent of the antecedent. 

• verb anaphora: The anaphor is a verb and refers to an action. 

• adverb anaphora: The anaphor is an adverb. 

• zero anaphora: The anaphor is implicitly present in the text, but physically omitted. 

7

• nominal anaphora: The anaphor has a non-pronominal noun phrase as antecedent. 

o direct: anaphor and antecedent are linked through identity, synonymy, 

generalisation or specialisation. 

o indirect: anaphor and antecedent are linked through part-of-relations or set 

membership. 

He states that pronominal anaphora is the most frequent type, while nominal anaphora (indirect 

anaphora in particular) usually requires real-world knowledge to be resolved. In this thesis, the 

described method will be tested using occurrences of pronominal anaphora from the text 

collection. 

2.1.1 Frameworks for anaphora resolution 

Anaphora resolution is the process of determining the antecedent of an anaphor (Mitkov 2003, 

p. 269). In our minds, we build a discourse model that represents the entities mentioned in the 

discourse and shows the relationship between them (Webber 1978, in Jurafsky and Martin 

2000). A representation is evoked into the model upon the first mentioning in the discourse, and 

then subsequently accessed from the model if the entity is mentioned again, either by name or 

by way of anaphora. Entities will have varying degrees of salience in the discourse model, 

depending on how frequently they have been mentioned and also on how long ago they last were 

mentioned. This notion of a discourse model is used both in theories which aim at describing the 

process of anaphora resolution, and in computational approaches that automate the anaphora 

resolution process. 

There are different approaches to resolving referring expressions and anaphors that occur in 

natural language discourse. Several formalisms offer frameworks describing the theory of 

discourse representation in general and anaphora binding in particular. In the following sections, 

two of these formalisms will be briefly outlined. 

8

2.1.1.1 Discourse representation theory 

Discourse representation theory (DRT), proposed by Hans Kamp in 1981, represents a way of 

creating dynamic semantic representations for natural language discourse. The framework aims 

at representing larger linguistic units than sentences and is particularly useful for representing 

the way discourse changes with every new sentence that is introduced. The core structure within 

DRT is the discourse representation structure (DRS), which is transformed through the 

processing of each sentence of a discourse. Since every sentence in a discourse potentially can 

introduce new concepts and entities which are referentially linked to previously introduced 

entities, it is not possible to infer the full meaning of an individual sentence without regarding 

the discourse it fits into. Kamp and Reyle states it as “the meaning of the whole is more, one 

might say, than the conjunction of its parts” (Kamp and Reyle 1993, p. 59). The interpretation of 

a new sentence relies both on the meaning of the sentence itself and on the structure representing 

the context of the earlier sentences (Kamp and Reyle 1993, p. 59). Thus, a new sentence is 

interpreted as a contribution to the existing representation of the discourse. 

DRT establishes anaphoric links across sentence boundaries between anaphors in the current 

sentence and antecedents in the DRS as the new sentence is processed. A very simplified outline 

of what happens when the discourse in (2-2) is processed and a DRS is created, is given below 

(after Kamp and Reyle 1993). 

(2- 2) 

a. Jones owns Ulysses. 

b. It fascinates him. 

(2-2a) is entered into the DRS by way of applying phrase structure rules and lexical insertion 

rules in order to associate the sentence with a syntactic representation and a set of features and 

values for the individual words. Some examples of assigned features are number, gender and 

transitiveness (for verbs). The DRS for (2-2a) will in abridged form look like this: 

9

(2- 3) 

Discourse referents: 

x y 

DRS conditions: 

Jones (x) 

Ulysses (y) 

x owns y 

Upon entering (2-2b) into the DRS, a series of actions must be performed to calculate which 

entities in the DRS the two pronouns of sentence (2-2b) refer to. In the case of the sentences in 

(2-2), where the DRS only contains two members, this can be determined on the basis of gender 

agreement. The updated DRS will appear as in (2-4) below. 

(2- 4) 

Discourse referents: 

x y u v 

DRS conditions: 

Jones (x) 

Ulysses (y) 

x owns y 

u = y 

v = x 

u fascinates v 

DRT offers a framework for creating and storing semantic representations of the meaning 

conveyed in a natural language discourse. The theory does not, however, offer a means to 

identify the referent for ambiguous anaphors, or for anaphors which require real-world 

knowledge in the process of determining their referents. 

2.1.1.2 Binding Theory 

Binding Theory (BT) is a theoretical framework that describes syntactic conditions for intrasentential 

anaphoric linking. BT offers conditions for whether a nominal expression can, must, 

or must not be linked to another nominal in the sentence. Within BT, reflexive pronouns and 

reciprocals are termed anaphors, while non-reflexive pronouns are called pronouns or 

pronominals. This understanding of these terms will also be used in the following, when 

10

eferring to BT. The NP which is linked to by an anaphor or pronoun is for BT purposes the 

binder of the anaphor or pronoun. In example (2-5), he is the binder of himself, while himself is 

bound by he. 

(2- 5) 

He hurt himself. 

Chomsky’s binding theory has three principles, shown in (2-6) below (after Chomsky 1981, in 

Asudeh and Dalrymple 2004). 

(2- 6) 

A. An anaphor (reflexive or reciprocal) must be bound in its local domain. 

B. A pronominal (nonreflexive pronoun) must not be bound in its local domain. 

C. A nonpronoun must not be bound. 

The implications of these principles, as well as that of local domain, is exemplified by example 

(2-7). In (2-7a) the subclause the thief hurt himself constitutes the local domain for the anaphor 

himself. As, according to principle A above, the anaphor can only be bound in its local domain, 

the noun phrase the sergeant is not a possible binder. In (2-7b), the pronominal him must not 

(cannot) be bound in its local domain, and can therefore be bound to the noun phrase the 

sergeant. The pronominal must, however, not be bound to a noun phrase expressed in the 

sentence, but can refer to a discourse referent not mentioned in the sentence. 

(2- 7) 

a. The sergeant said that the thief hurt himself. 

b. The sergeant said that the thief hurt him. 

The fact that not all possible candidates in a syntactic domain can be binders is maintained by 

the requirement stating that the binder must be in a structurally dominant position to the entity to 

be bound. This ensures that the noun phrase the police cannot be the binder of the pronoun 

himself in example (2-8). 

11

(2- 8) 

The sergeant’s suspect hurt himself. 

Hellan (1988) suggests that the principles of standard Government and Binding theory are 

primarily based on English and that they cover “a very limited subpart if what constitutes a 

possible anaphoric system” (Hellan 1988, preface). He proposes several additional principles, 

maybe most notably the Command Principle in which, among other statements pertaining to the 

command relation, it is stated that also relations within hierarchies of thematic roles can stand in 

a command relation to an anaphor. 

2.1.2 Computational approaches to anaphora resolution 

Automated anaphora resolution systems basically have to perform three separate tasks (Mitkov 

2003): 

• identify the anaphors to be resolved 

• locate the candidates for antecedents 

• select the antecedent from the candidate list 

Different computational approaches apply different resolution factors and knowledge sources. 

The process of resolving the antecedent is based on several resolution factors, which in turn 

draw into account quite different sources of background knowledge. Using morphological 

knowledge may be the simplest approach; gender and/or number is compared and candidates are 

discounted if their gender/number does not fit that of the anaphor. Syntactic knowledge is used 

to identify syntactic parallelism; the antecedent is often found in a similar syntactic position as 

the anaphor. In many cases, the correct antecedent cannot be identified without the help of 

semantic information. Selectional constraints is one example of semantic knowledge that can be 

used to narrow down the list of candidates for the antecedent. Repeated mention of an entity in 

the preceding text passage to the anaphor may indicate that this entity has a higher degree of 

salience in the discourse and that it therefore is a likely antecedent for a following anaphor 

(Jurafsky and Martin 2000, p. 682). Using morphological, lexical, syntactic, semantic and 

salience criteria as background knowledge does not immediately suggest the most likely 

candidate, but rather acts as filters to eliminate unsuitable candidates (Mitkov 2003, p. 271). 

Mitkov (2003, p. 272) states that for some examples “the crucial and most reliable factor in 

deciding on the antecedent” is real-world knowledge. Even the most exquisite anaphora 

12

esolution system will not be able to resolve anaphora of the type that needs real-world 

knowledge to rule out candidates that just do not make common sense. The examples in (2-9) 

below illustrate the point; without access to real-world knowledge or semantics, there is no way 

to confidently resolve the antecedent of the anaphoric han (he). 

(2- 9) 

a. Politimannen skjøt etter morderen, og han falt. 

The policeman shot at the murderer and he fell. 

b. Politimannen skjøt etter morderen, og han bommet. 

The policeman shot at the murderer and he missed. 

2.1.2.1 Knowledge-free approaches 

Botley and McEnery term anaphora resolution systems which do not consult any form of 

knowledge representation in the process of identifying the antecedent of an anaphor 

“knowledge-free” (Botley and McEnery 2000, p. 17). In the two following sections, it will be 

shown, on the basis of two well-established syntactic algorithms for anaphora resolution, that 

knowledge-free approaches that resolve anaphors without employing real-world knowledge 

cannot identify different antecedents in the case of examples (1-1) and (1-2). 

2.1.2.1.1 Lappin and Leass’ algorithm for pronoun resolution 

Lappin and Leass (Lappin and Leass 1994, in Jurafsky and Martin 2000) offer an algorithm for 

pronoun interpretation which takes into account recency and syntactically-based preferences. 

The algorithm does not employ semantic preferences or background knowledge, but uses a 

weighting system which reflects various syntactic features as well as salience of recency in the 

discourse. When testing this algorithm on test data from the same genre as was used to develop 

the weighting system, Lappin and Leass report an accuracy of 86%. Jurafsky and Martin present 

a somewhat simplified part of the algorithm in the resolution of non-reflexive, third-person 

pronouns (Jurafsky and Martin 2000, p. 684). The Lappin and Leass algorithm creates a 

discourse model upon processing a sentence and assigns each member of the discourse model a 

salience value. A set of salience factors determine the salience weight each of the members is 

assigned. The aspect of recency is maintained by reducing each member’s salience value by half 

13

upon processing of a new sentence. (2-10) below shows the weighting system of the salience 

factors in the system. 

(2- 10) 

SALIENCE FACTOR VALUE 

Sentence recency 100 

Subject emphasis 80 

Existential emphasis 70 

Accusative emphasis 50 

Indirect object or oblique complement emphasis 40 

Non-adverbial emphasis 50 

Head noun emphasis 80 

In the following, this algorithm will be used in an attempt to resolve the referent for the anaphor 

han (he) in the second sentence of each member of the sentence pair presented in (1-1) and (1-2) 

and repeated in (2-11) below: 

(2- 11) 

a. 


kommer til å drepe igjen. Han etterlyser vitner som var i sentrum søndag 

kveld. 


kill again. He puts out a call for witnesses who were in the city centre Sunday 

evening. 

b. Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig 

kommer til å drepe igjen. Han er observert i sentrum. 


kill again. He is observed in the city centre. 

When attempting to find the antecedent for the pronoun in (2-11a), all potential referents must 

be collected. Since the discourse only consists of two sentences, only the first sentence is 

processed, in the event of a longer discourse the referent is looked for in until four preceding 

sentences. The potential referents lensmannen (the sergeant), etterforskningen (the 

investigation) and gjerningsmannen (the perpetrator) are assigned salience values as shown in 

(2-12) below. 

14

(2- 12) 

REC SUBJ EXIST OBJ IND-OBJ NON-ADV HEAD N TOT 

lensmann 100 80 50 80 310 

etterforskning 100 50 150 

gjerningsmann 100 80 50 80 310 

As the algorithm moves on to the next sentence, the values assigned in (2-12) are reduced by 

half, as shown in (2-13). 

(2- 13) 

Referent Phrases Value 

lensmann lensmannen 190 

etterforskning etterforskningen 75 

gjerningsmann gjerningsmannen 190 

Now potential referents which do not agree in gender or number are removed. In our case, the 

pronoun is han (he), which for Norwegian bokmål specifies an animate referent. According to 

the preference factors in the algorithm, which check for gender and number only, the potential 

referent etterforskning cannot, however, be removed. At this stage, referents which do not 

satisfy intra-sentential syntactic coreference constraints will also be removed. Final salience 

values are calculated by assigning values for syntactic role parallelism and cataphora. In our 

case, lensmann is given extra weight due to the syntactic parallelism to the anaphor. This results 

in the values shown below: 

(2- 14) 

Referent Phrases Value 

lensmann lensmannen 225 

etterforskning etterforskningen 75 

gjerningsmann gjerningsmannen 190 

Since lensmann has the highest salience weight, this word is also chosen as the referent for the 

pronoun han. Through the processing of sentence (2-11a) it is clear that the same referent would 

be assigned upon a processing of sentence (2-11b). The algorithm does not take the semantic 

meaning of the sentence to be processed into account, and is therefore not able to differ between 

the referent assignment in examples such as (2-11a) and (2-11b). 

15

2.1.2.1.2 Hobbs’ Tree search algorithm 

Hobbs’ tree search algorithm (Hobbs 1978, in Jurafsky and Martin 2000) is a pronoun resolution 

algorithm based on syntactic tree structures of the sentences to be processed. Proceeding to 

resolve the antecedent of a pronoun, the tree search algorithm processes syntactic 

representations of all previous sentences in the discourse, as well as the sentence with the 

pronoun to be resolved. The syntactic representations of the discourse, in combination with the 

order the search of the syntactic structures is performed, to some degree represent a discourse 

model and salience preferences. In its search for the antecedent of a pronoun, the algorithm 

traverses syntactic trees in a left-to-right, breadth-first manner. 

To find the antecedent for the pronoun han in the sentences in (2-11), syntactic tree structures of 

the sentences are needed. The syntactic structures of (2-11) presented in Figure 1 and Figure 2 

are generated by the NorGram grammar web version 3 . This is a more complex grammar than the 

one used in Jurafsky and Martin’s outline of the algorithm (Jurafsky and Martin 2000, p. 689), 

but as stated there, the algorithm to a large degree allows any choice of grammar. Since the 

algorithm is based on searches in syntactic trees, and therefore relies on making assumptions 

regarding the build-up of the syntactic structures, the grammar must be specified in any case. In 

the following, the tree search as specified in Jurafsky and Martin (2000) will be carried through, 

with the assumptions regarding the grammar that that implies. The syntactic trees (and their 

labels) are included for illustrational purposes and should not be thought of as input for the 

algorithm. 

3 Generated at http://decentius.hit.uib.no:8010/logon/xle-mrs.xml 31/01-2005 

16

Figure 1 

17

Figure 2 

In the process of identifying the antecedent for the pronoun han in the second sentence of 

(2-11a), the algorithm takes as its starting point the NP immediately dominating the pronoun. 

From there, it moves upwards in the tree to the first NP or sentence node. This node is called X 

and the path from the pronoun to X is called p. In our case, that means that X is the topmost 

sentence node (IP node). Since there are no more branches to the left of X and p, and in other 

words no possible antecedents introduced earlier in the same sentence, the algorithm moves 

along to the parse tree of the previous sentence. Searching left-to-right and breadth-first, the first 

NP node that is encountered is suggested as the antecedent for the pronoun. In our case, this 

18

means that the algorithm would propose the NP lensmannen som leder etterforskningen as the 

antecedent for the pronoun han. As will be clear from examining the example sentences in 

(2-11), this is a correct resolution of the antecedent in (2-11a), but not for the antecedent in 

(2-11b). Parallel to the Lappin and Leass resolution algorithm, the tree search algorithm also 

does not consider the semantic meaning of the sentence with the anaphor to be resolved. The 

pronouns han in the second sentences of (2-11a) and (2-11b) are treated in the same way, and 

lensmannen is chosen as the most likely antecedent in both cases. 

2.1.2.2 Traditional approaches to anaphora resolution 

As seen through the examples above, anaphors of the type that requires semantic information to 

be resolved simply cannot be resolved using purely syntactic algorithms. In order to find the 

antecedent for such anaphors, some sort of real-world knowledge must be consulted. Mitkov 

(1999) distinguishes between traditional and alternative approaches for anaphora resolution. 

The traditional approaches are those that use knowledge factors to filter out unlikely candidates 

and then use preference rules on a smaller set of likely candidates, while the alternative 

approaches find the most likely candidate based on statistical or AI techniques (Mitkov 1999, p. 

8). The traditional approaches usually draw in the factor of real-world or domain knowledge, 

often in the form of a comprehensive knowledge or domain base, in order to resolve anaphors of 

the type in examples (2-9) and (2-11) above (Mitkov 2003). Such approaches are also called 

knowledge-based (Botley and McEnery 2000, p. 11). In the above it has repeatedly been 

emphasized that some types of anaphors cannot be correctly resolved without access to realworld 

information. Carbonell and Brown’s (1988) multi-strategy approach is one traditional 

knowledge-based anaphora resolution system. Their approach follows what Botley and 

McEnery call “a trend […] towards the integration of several different resolution algorithms into 

large-scale modular architectures” (Botley and McEnery 2000, p. 17). Their system draws on 

different knowledge sources, including syntactic structure, case-frame semantics, dialog 

structure and real-world knowledge. The resolution of anaphors is based on constraints and 

preferences; first the constraints are applied to narrow down the list of potential antecedents and 

then the preferences are applied to each of the remaining candidates (Carbonell and Brown 

1988, p. 98). Real-world knowledge is realised as a set of precondition and postcondition 

constraints. These constraints for example determine that the object given no longer is in the 

possession of the actor after a successful act of giving has been carried out. The main problem 

19

with such an approach is stated by the developers: “the strategy is simple, but requires a fairly 

large amount of knowledge to be useful for a broad range of cases” (Carbonell and Brown 1988, 

p. 97). 

Generally speaking, the knowledge bases that knowledge-based systems for anaphora resolution 

rely on are difficult to represent and process, and require a considerable amount of human input 

(Mitkov 2001, p. 110). The information is structured using different frameworks; often each 

anaphora resolution system structures its knowledge base in a system-specific manner. Rather 

than giving an outline of various specific methods belonging to the traditional approaches, some 

of the formats used for knowledge representation are briefly mentioned below. Several 

frameworks have been developed to cope with the need for a formalism to represent real-world 

or domain knowledge. Most of these have been part of specific anaphora resolutions systems 

and have not constituted independent frameworks for the representations of real-world 

knowledge. 

Minsky’s Frames (Minsky 1975, in Botley and McEnery 2000) is a framework for representing 

knowledge about stereotyped objects and events. The frames are dynamic in the sense that the 

information they hold about a particular object or event can change if new information is 

encountered. Input into the system is interpreted in accordance with the information present in 

the frames; the frames generate expectations about the input (Botley and McEnery 2000, p. 12). 

In the case of a “shooting frame” being evoked upon processing of the sentence in (2-9a), the 

expectation that if somebody misses, it is likely to be the same person that also was doing the 

shooting, is created. Following such an expectation, it is easy to identify the correct antecedent 

for the anaphor. Schank’s Scripts (Schank 1972, in Botley and McEnery 2000) have some 

similarity to Minsky’s Frames, but are primarily used to represent knowledge about events 

which do not undergo change (Botley and McEnery 2000, p. 12). Information about role 

assignment and the sequence of events in given contexts is represented in the script. 

2.1.2.3 Alternative approaches to anaphora resolution 

Hand-coded knowledge bases that aim at representing real-world or domain knowledge are 

expensive and labor-intensive to build and maintain. As a consequence, the focus has shifted 

toward systems that rely less heavily on world knowledge in the last 15 years (see Mitkov 2003 

20

for an overview). Many of these systems incorporate semantic and real-world knowledge, but 

use methods that enable the collection of this information to have a high degree of automation 

(Baldwin 1997; Dagan and Itai 1990; Dagan et al. 1995; Nasukawa 1994; inter al.). Mitkov 

(2003) terms these systems knowledge-poor and attributes their growth in number in recent 

years to the fact that corpora and similar electronic linguistic resources have become better, 

larger and more available. Some of these systems do not really attempt at building a world- or 

domain knowledge base (Baldwin 1997; Nasukawa 1994), but rather look at features such as co- 

occurrence patterns in the text itself, while others integrate corpora and use them as a form of 

abstract knowledge base (Dagan and Itai 1990; Dagan et al. 1995). 

Among the different “alternative” approaches, Dagan and Itai’s (1990) statistical approach, 

Dagan et al.’s (1995) estimation of unseen patterns and Nasukawa’s (1994) knowledge-free 

method are of particular interest for this project. Dagan and Itai’s (1990) method is that of using 

co-occurrence patterns observed in a corpus as a type of selectional restrictions. Co-occurrence 

patterns observed in a large corpus are thought to reflect the semantic constraints that apply to 

natural language. Candidates for antecedents for the anaphor it are identified in the text and put 

in the place of the anaphor to be resolved. This produces co-occurrence patterns that are checked 

against the corpus. Subsequently the candidate present in the most frequently occurring cooccurrence 

pattern is chosen as the antecedent. This method relies on a large corpus, as only 

patterns which actually have been seen in the corpus are considered. Infrequent patterns will not 

be picked since they generally speaking will not feature on the top of the pattern list. Dagan et 

al. (1995) offer a solution to this problem by presenting a similar method which also estimates 

the probability of co-occurrence patterns that have not been observed in the training data. They 

state the importance of distinguishing between probable and improbable unobserved cooccurrence 

patterns and emphasise that the “distinctions ought to be made using the data that do 

occur in the corpus” (Dagan et al. 1995, p. 164). Anologies are made between specific unseen 

co-occurrence patterns and observed co-occurrences which contain similar words, determining 

word similarity by a similarity metric. Patterns that contain similar words to the target word and 

that have been observed in the training data are used to calculate how likely the target word is to 

occur in the same pattern. Nasukawa (1994) presents a resolution rate of 93,8% in an even 

knowledge-poorer method for pronoun resolution. Instead of drawing information from a 

corpus, word frequency and co-occurrence patterns in the text itself are used to filter out the 

most likely candidate for the antecedent. In Nasukawa’s approach, inter-sentential data is 

21

exploited in the process of resolving the pronoun it. The likelihood of the antecedent is 

determined statistically and the antecedent candidate with the highest value is selected by the 

system. The approach uses a syntactic-based heuristic rule for the selection of the antecedent. 

Nasukawa states that approaches using real-world knowledge are not large-scale enough yet to 

be of use in broad-coverage systems and attempts at extracting information corresponding to 

case frames in world knowledge from the texts to be processed (Nasukawa 1994, p. 1157). In 

this way, collocation patterns are used as a form of world knowledge for the domain of the texts. 

As has been outlined in the introduction, this thesis describes a method that can aid in the 

resolution of the anaphoric expressions that require real-world knowledge to correctly resolve 

their antecedents. The method automatically extracts and classifies nominal arguments, resulting 

in associated classes of similar words. This is a knowledge-poor method in the sense that it does 

not require a comprehensive knowledge base to be built, but rather uses data and co-occurrence 

patterns from a corpus to find the most likely antecedent from a list of possible candidates found 

in a text. 

2.1.3 Anaphora resolution and text summarisation 

As already mentioned, several NLP applications need a reliable means to resolve anaphoric 

expressions and identify coreferences. In the field of text summarisation, which belongs to the 

domain of the KunDoc project, anaphora resolution is vital for the process of finding 

coreferential chains, identifying discourse structure and ultimately producing a coherent 

summary. Systems for automatic text summarisation need to make a number of choices 

regarding the resolution of anaphoric expressions. Mani (2001, p. 70) identifies “dangling 

anaphors” as a coherence problem in automatic summaries; without a means to resolve 

anaphoric expressions, the summary may contain anaphors, but not the antecedents they refer to. 

This disturbs the coherence in the summary; not all the information that the reader needs is 

present in the summarised text. The (constructed) example below illustrates this: consider the 

full text example in (2-15a) in connection with the summarised version in (2-15b). Both 

instances of the pronouns han (he) in the summarised version in (2-15b) do not have an 

identified reference in the text. For a reader presented only with the summary, it is highly 

unclear what these pronouns refer to. 

22

(2- 15) 

a. Politiet etterlyste i dag tidlig en syklist i 

forbindelse med drapet på 23 år gamle Anne Slåtten. I 

formiddag meldte han seg til politiet, skriver bt.no 

- Jeg har foreløpig ikke klarhet i hva han har sagt, 

forteller lensmannen i Førde Kjell Fonn. 

This morning the police instituted a search for a biker 

in connection with the murder of 23-year-old Anne 

Slåtten. This morning he reported to the police, writes 

bt.no 

- For the time being I am not in the clear about what he 

has said, tells the sergeant in Førde Kjell Fonn. 

b. I formiddag meldte han seg til politiet, skriver bt.no 

- Jeg har foreløpig ikke klarhet i hva han har sagt, 

forteller lensmannen i Førde Kjell Fonn. 

This morning he reported to the police, writes bt.no 

- For the time being I am not in the clear about what he 

has said, tells the sergeant in Førde Kjell Fonn. 

Another reason why anaphora resolution is important for text summarisation is that the methods 

used for retrieving relevant sentences for a summary perform more accurately if anaphoric 

references to central concepts also are considered (Mitkov 2003, p. 276). 

The emergence of knowledge-poor and corpus-based approaches for anaphora resolution 

suggests that the representation of real-world knowledge not necessarily has to have the form of 

a human-made system. Alternative approaches show that information available from the text to 

be analyzed or from larger bodies of natural language text, can be used to give information that 

resembles that of real-world knowledge. The following section explains this notion of using 

contextual information to find intuitions about world knowledge. 

23

2.2 Finding meaning in the context 

“You shall know a word by the company it keeps!” Firth (1957, p. 179) 

“The meaning of entities, and the meaning of grammatical relations among them, is related to 

the restriction of combinations of these entities relative to other entities.” Harris (1968) 

2.2.1 The distributional approach 

The semantic meaning of a word is often readily suggested from the lexical context in which it 

occurs. This is an idea fronted by many scholars, starting with Firth (1957) and Harris (1968). 

Human beings use the context of a word in the process of deciding the semantic meaning of the 

word. When encountering an ambiguous word, the language user has a finite number of possible 

meanings to consider. By examining the environment that the ambiguous word exists in, the 

language user finds clues toward deciding which of the possible meanings that are applicable. 

The same mechanism applies when a language user is confronted with a novel word; by 

observing the usage of the word, preferably over several instances, a human being is able to 

induce the semantic meaning from the setting the word occurs in. This is in accordance with the 

Distributional Hypothesis as proposed by Harris (1968) and contributes to explaining the fact 

that humans rarely have problems identifying for example what an ambiguous word means, or 

what entity in the discourse a pronoun refers to. The fact that properties in a word’s linguistic 

environment can contain information about the meaning of the word is a useful tool for the 

semantic comparison of words. A word which is used within a limited thematic domain is likely 

to be used in a sense specific to the contextual setting, or domain, in which it occurs. This entails 

that the linguistic environment in which the word exists also gives information about the 

meaning of the word. Words that appear in the same linguistic setting in texts that describe the 

same theme may have similar or related meanings as well. Texts belonging to the same domain 

will to some extent contain information about the same things, and as such also contain 

semantically similar words which are used in similar ways. Following this line of thought, it 

should be possible to gain relevant information about which words to expect in specific 

positions in a text by way of looking at the context patterns they should fit into. Thus, words that 

occur in limited-domain texts can be classified relative to how they combine with each other. 

The idea that the contextual environment can give clues about the semantic meaning of a word is 

clearly not a new one, considering the quotes of Firth and Harris in the introduction to this 

24

section. The theory dates back to the empiricists of the mid-twentieth century. Linguistic theory 

in the first half of the twentieth century was to a large degree predominated by empiricism. 

Linguistic thought, particularly in the United States, but also in Europe, was strongly influenced 

by the positivism of the behaviourist philosophy. Bloomfield is regarded as one of the chief 

advocates of linguistic positivism, and his interpretation of linguistics predominated American 

linguistics in the 1930s and 1940s (Robbins 1997, p. 237). The positivistic/behaviouristic view 

on linguistic science put emphasis on the observable. Reliable facts could only be found through 

objective observation of data, and only phenomena which could be empirically experienced by 

any observer were considered valid data for further analysis. Robbins states that the favoured 

model of description of the time was that of distribution; for some linguists the notion of 

linguistic description coincided with the statement of distributional relations (Robbins 1997, p. 

239). He also attributes the fact that there was little emphasis on the study of semantics in the 

early twentieth century to Bloomfield’s dismissal of the possibilities of an empirically based 

study of this field. Since the analysis of meaning requires non-linguistic knowledge as well, 

semantic analysis was termed less ideal for empiricist methods. While the study of semantics 

previously had aimed at creating an exhaustive description of what is referred to by a linguistic 

entity, Firth represents a challenge to this way of thinking. His “contextual theory of language” 

introduced a move in semantics, toward a statement of meaning as a function of how words are 

used (Robbins 1997, p. 247). Together with Harris, Firth represents the distributional approach 

to finding semantic meaning. Within this approach, meaning is treated as semantic functions 

related to contexts of situation. This way of analysing meaning is data-driven in the same sense 

as empiricist approaches in other fields of linguistics and is strongly connected to the positivistic 

philosophy of science predominant in this time. However, using bottom-up methods as a means 

to formulate theories of linguistics is a direction that was more or less abandoned after 

Chomsky’s criticism of the structuralist approaches. Chomsky challenged the philosophical and 

scientific foundation of the Bloomfieldian canon through his proposal of the transformationalgenerative 

grammar. He dismissed the behaviouristic approach to language as the unacceptable 

product of the strong empiricism of the Bloomfieldian behaviourist school. The shift from 

empiricism to rationalism marks an important turning point in linguistic theory (Robbins 1997, 

p. 260). Botley and McEnery state that Chomsky, and the generation of linguists following his 

theories, represent a knowledge-driven approach with the goal of formulating linguistic theories 

(Botley and McEnery 2000, p. 24). 

25

The method of describing semantic meaning by looking at the distribution of text in context has 

more or less been abandoned in the decades following the shift of paradigm from empiricism to 

rationalism. Semantic analysis has been approached through new methods within linguistic 

theory, terming the meaning-is-use approach as too simple. In recent years, however, 

computational linguistics has brought some of these old ideas forward again. This is mainly due 

to the increasing availability of large, computer-readable corpora and powerful processing tools 

that reliably can perform operations on large data sets. The emergence of corpus approaches is a 

move away from the Chomskyan view toward an emphasis on actual observable linguistic 

behaviour (Botley and McEnery 2000, p. 24). Thus, the bottom-up approaches of Firth and 

Harris are again in fashion; by using corpora, computational linguists today are able to look at 

actual occurrences of data and use these to develop theories of linguistic performance. Leech 

argues that corpus linguistics is not a linguistic theory, but rather a methodology (Leech 1992, in 

Botley and McEnery 2000, p. 23). Rather than being primarily theoretically founded, corpus 

linguistics as a discipline focuses on linguistic performance and description, as found in actual 

occurrences of natural language text. One can say that Firth and Harris’ ideas have received a 

pragmatic renaissance; probably for a large part because of the available computational tools. 

Whether these new appliances of the distributional theories from the 1950s reflect a 

reconsideration of the usefulness and theoretical foundation of them, or whether they merely 

show that computational linguistics is a more pragmatic than linguistic theoretically founded 

science, is a discussion that is far out of the scope of the present work. What can be stated, 

though, is that the notion of using a word’s context to find out something about the meaning of 

that word, is an approach that seems to provide interesting results regarding semantic meaning, 

regardless of the motivation of such an approach. The type of semantic information available 

from the context is, however, not necessarily of the same type as that referred to when speaking 

of the semantic meaning of a word. Information obtainable from looking at distribution over 

several contexts rather provides a measure of semantic relatedness or closeness. Instead of 

providing a means to obtain or define the direct semantic meaning, methods that rely on 

contextual distribution can return words with the same or different meaning as a target word. 

26

2.2.2 Different types of context 

So far, we have argued that using context as a tool to indicate the semantic meaning of a word is 

a useful method in linguistics. The method’s theoretical foundation dates back to the middle of 

the twentieth century, but has not been pursued much in the last few decades. Even though the 

linguistic foundation of this method has been discussed, the advance in computational resources 

in recent years has brought this approach forward again. However, this being said, the different 

types of context that can be taken into consideration have not been discussed so far in this thesis. 

Agreeing on the fact that the semantic meaning of a word is suggested from the linguistic 

context in which it occurs, or “the company it keeps”, supports the notion that different words 

used in the same context are semantically similar. It does, however, not provide a means for 

calculating the degree of this similarity or even finding out exactly which words are similar to 

each other. Depending on the information that is desired to obtain about a target word, different 

context types mirror different aspects of the semantic meaning of a word. Any approach that 

attempts at describing semantic meaning based on the contextual distribution of words in a text 

collection must first define the type of context that best will reflect the desirable information. 

Somewhat simplified, we distinguish between topical context and local context. 

2.2.2.1 Topical context 

Topical context (Miller and Leacock 2000), or document context, is a quite wide term that 

covers what we could call the “wide conception” of what context is. All other content words 

which occur in the same environment as a target word are considered to make up the context of 

the word, and following the discussion above, contribute to indicating the semantic meaning of 

the target word. A target word’s contextual environment can be further specified depending on 

the purpose; in short, the context simply is all the words which occur within a context window 

of varying size. The window can be set to cover a certain number of words before and after a 

target word, or also to consist of the entire document the target word occurs in. Different 

parameters determine the weighting of each word found within the context window; for example 

words can be weighted according to their distance from the target word. One extreme way of 

looking at topical context might be the bag of words model, where a document is seen as an 

unordered collection of words, and the words are weighted by the number of times they occur in 

27

a document. In a more narrow sense, topical context can be limited to only consisting of the 

other words in the same sentence as a target word. 

The extraction of topical context does not draw on syntactic or semantic information and 

therefore does not provide an indication of the relationship that the words in the context have to 

each other or to the target word. It is therefore not possible to say anything specific about 

semantic similarity based solely on the occurrence of words in the topical context. As the name 

indicates, this type of context gives information about the topic, or the domain of the text that 

the target word exists in. 

Consider the words in example (2-16) as the topical context for the target word sykepleiestudent 

(student nurse). The context words are words which occur more than once in a short newspaper 

text from the text collection used in this project. Even with such a rudimentary method of 

selecting the words in the topical context, it is clear that this type of context provides cues to the 

thematic domain that the target word occurs in. The topical context does not, however, provide a 

means of finding words that are semantically similar to the target word. No close synonyms are 

retrieved, but rather words belonging to the same discourse domain as the target word. 

(2- 16) 

kvinne 

funn (subst) 

funnet (partisipp) 

død 

Førde 

politi 

leteaksjon 

2.2.2.2 Local context 

woman 

finding (noun) 

find (participle) 

dead 

Førde 

police 

search party 

Local context provides a more finely tuned way of looking at semantic similarities as expressed 

through the distribution of words in a text. In its simplest form, a word’s local context consists 

of its immediate surrounding words; that is the words immediately preceding and following a 

target word. The notion of local context can also be extended to include information about 

syntactic and grammatical properties that belong to the target word and its immediate 

28

neighbours. For example, a target word’s local context can be seen as its subject and object, or 

as the adjective preceding it. 

Several studies show that classifying words based on the local context in which they occur gives 

information about the semantic meaning of the words, rather than their membership within a 

thematic domain, as found when examining the topical context (Hindle 1990; Grefenstette 1992; 

Lin 1998; Lin and Pantel 2001; Pereira et al. 1993; inter al.). This indicates that access to 

features within a word’s local context can contribute to saying something about the meaning of 

the word and ultimately to act as a foundation for the formation of concept groups of 

semantically similar words. Distributional representations based on a word’s local context are 

useful for measuring the semantic similarity of words. Lin (1997) exploits this in an algorithm 

for word sense disambiguation and states that local context gives crucial clues about the 

meaning of a word following the intuition that: 

“Two different words are likely to have similar meanings if they occur in identical local 

contexts.” (Lin 1997, p 64). 

Since the local context can comprise syntactic and semantic information, it provides a means to 

access different information relevant to the type of analysis that will be performed on the 

material. Several approaches describe methods for finding similar nouns based on the 

distributional patterns of words in the local context (Hindle 1990; Grefenstette 1992; Lin 1998; 

Lin and Pantel 2001; Pantel and Lin 2002; Pereira et al. 1993; inter al.). These methods classify 

words in accordance with their distributional patterns, not using hand-coded semantic 

knowledge as a basis, but rather inferring the required knowledge from a text corpus as part of 

the analysis process. The approaches all adopt different methods for judging the similarity of 

words. Below, some of the approaches to finding similar words are described briefly; the 

similarity metrics, however, will not be discussed in this outline. 

Hindle (1990) shows that the contextual distribution of words provides a useful semantic 

classification, also in the event of an automated classification process with no human 

intervention. His method examines predicate-argument structures in a large corpus and 

automatically classifies nouns into semantically similar sets on the basis of the predicates they 

combine with. The similarity between nouns is measured as being a function of mutual 

29

information estimated from the text. Hindle’s results show that semantic relatedness can be 

derived from the distribution of syntactic forms (Hindle 1990, p. 274). This is a similar approach 

to the one taken in the present work, if on a substantially smaller scale. Hindle (1990) addresses 

the data sparseness problem by estimating the probability of an unseen event by comparing it to 

similar events which have been seen. Grefenstette (1992) presents a method which looks for 

context patterns in large domain-specific corpora and finds similar words relative to how a target 

word is used in a specific text or domain. His program SEXTANT uses syntactically derived 

contexts and estimates the similarity of two words by considering the overlapping of all the 

contexts associated with them over a large corpus (Grefenstette 1992, p. 325). As a result, a 

word’s context consists of all the words co-occurring with it in the corpus. Pereira et al. (1993) 

also report a method for clustering words according to their distributions in given syntactic 

contexts. In their approach, nouns are classified based on their syntactic relations to predicates in 

the corpus. The method enables the automatic derivation of classes of semantically similar 

words from a text corpus and produces clusters the authors term “intuitively informative” 

(Pereira et al. 1993, p. 190). Lin and Pantel (2001) present the unsupervised algorithm UNICON 

for the creation of groups of semantically similar words. Their approach examines collocation 

patterns consisting of dependency relationships, and employs a method for selecting significant 

collocation patterns. Those dependency relations which occur more frequently than if the words 

were independent of each other are selected as collocation patterns. This approach is further 

developed in Pantel and Lin (2002). Here, clusters which are relatively semantically different 

are initially identified and a subset of the cluster members are used to create so-called centroids, 

which represent the average features of the subsets. Subsequently new words are assigned to 

their most similar clusters. A word can be assigned to several clusters, each cluster 

corresponding to a sense of the word. 

2.2.3 Context and selectional restrictions 

In the above it has been argued that a given word will tend to co-occur with a limited class of 

other words, and that this information can be exploited to find words that are similar in meaning. 

One of the reasons for this expected occurrence of similar words in similar contexts, is that 

predicates to a certain extent limit the semantic properties of the arguments that they can 

combine with. This behaviour is captured through the notion of selectional restrictions, which 

define how a predicate restricts the class of arguments that can combine in a specific position 

30

with it. Selectional constraints allow a predicate to specify semantic restrictions on its arguments 

(Jurafsky and Martin 2000, p. 512). This accounts for the intuition that only a certain class of 

words can occur in a specific argument position to a given predicate. In the case of a verb such 

as avhøre (interrogate, take statement from) a possible selectional restriction for the first 

argument could be that it must represent a human. Jurafsky and Martin formulate it like this: 

interrogate restricts the constituents appearing as the first argument to those whose underlying 

concepts can actually partake in an interrogation (Jurafsky and Martin 2000, p. 512, slightly 

modified). 

More nuanced intuitions of selectional restrictions can be obtained by combining the knowledge 

of distribution in context and that of semantic restrictions placed on arguments by the predicate. 

This thesis applies a practical approach in order to find properties of the selectional restrictions 

of predicates within a limited thematic domain. Without aiming at formulating a comprehensive 

list of the selectional restrictions that apply within the domain in question, it is possible to obtain 

a list of examples that illustrate certain properties of the selectional restrictions. This is an 

extensional approach; by examining the structure of a set of arguments that all occur in the same 

contextual environment, for example as the first argument of a certain predicate, it is possible to 

draw certain conclusions about the selectional restrictions placed by the predicate. The aim of 

this project is not to define the selectional restrictions of the predicates in the data set, but rather 

collect a list of examples of valid restrictions for the domain and examine these. It is obvious 

that selectional restrictions also vary over different thematic domains; the allowed first 

arguments of a predicate will be very different in a formal text than in a fairy tale for children. 

This is again the intuition outlined in the first section of this chapter; words are used in different 

ways depending on the thematic domain they exist in. The distribution that classes of 

semantically similar arguments show within a limited domain may therefore very well be seen 

as a type of selectional constraint. To exemplify this, consider the domain used in the present 

work; newspaper texts concerning a criminal case. Considering the constructed phrases in 

(2-17), the first two are valid for the domain in the sense that they exemplify structures which 

are found in the data set, while the third is in violation of the selectional constraints assigned by 

the verb within this particular thematic domain. In the event of a killing within the domain in 

question, it can be expected that a perpetrator or a man has the thematic role of actor, but it has 

not been seen in the data material that a student can initiate this action. 

31

(2- 17) 

gjerningsmannen drepte kvinnen 

mannen drepte kvinnen 

studenten drepte kvinnen 

the perpetrator killed the woman 

the man killed the woman 

the student killed the woman 

The example above illustrates how context within the present work is used to formulate a notion 

of selectional restrictions. These can later be used to say something about which argument can 

be expected to feature in a specific contextual environment, and thus functions as a type of realworld 

knowledge for the domain of the text collection. With specific reference to anaphora 

resolution, these selectional restrictions can be used to give an indication of the most likely 

antecedent for an anaphor which normally would require access to real-world knowledge in 

order to be resolved. 

32

3 From text to EPAS – the extraction method 

This chapter describes the extraction method used in this project. The method extracts EPAS 

(elementary predicate-argument structures) from a text corpus consisting of newspaper texts 

collected from the internet. 

3.1 Selecting the texts 

Specifying the requirements for a suitable text collection is not as trivial as it may seem. To 

make sure that the extracted EPAS would produce semantically valid results when classified, the 

texts from which the structures were extracted had to fulfil certain requirements. Since the 

classification builds on the distributional hypothesis and relies on EPAS which show distribution 

particular to a restricted domain, initially, the most important specification for the texts was that 

they all had to belong to the same thematic domain. As such, the main focus in the requirements 

specification for the text collection was that of one closed thematic domain. But how exactly 

does one define the notion of a thematic domain? The first test set collected for the project 

consisted of factual prose texts dealing with roughly the same field. These texts, however, 

proved to be quite unsuitable for the later analysis, for reasons that will be explained in the 

following. 

It is clear that certain specifications must be fulfilled in the text collection from which the EPAS 

are derived. Texts displaying longer discourse chains are most suitable for the purpose of this 

project. One thematic domain must be described over several paragraphs, or preferably over the 

entire course of the discourse in the text. In order to extract the desired information from the 

texts and subsequently test if useful information has been extracted, the presence of anaphora, or 

referring expressions, in the text is needed. This entails the need for pronouns in particular. As 

such, texts containing discourse with a certain amount of concrete content were particularly 

useful for my purpose. 

Texts that are too vague, both with regards to their textual content and to their membership in a 

particular thematic category, were not suitable for the purpose of this project because they do 

not contain the type of theme-specific selectional constraints we are interested in extracting. One 

reoccurring problem in the text collections tested for the project was a too small degree of 

information expressed in full text, and very much information present in bullet lists, tables and 

33

other similar constructions. This information was only accessible after manual editing of the 

texts, and was even then often not useful since precisely the desirable discourse chains are 

avoided by use of this type of textual shorthand constructions. The information present in bullet 

lists and tables in the unedited text is most often not formulated in well-formed sentences and 

usually the use of referring expressions and pronouns is avoided. Such texts are also not 

immediately suitable for parsing, thus making it complicated to extract EPAS (semi-) 

automatically. 

As mentioned above, selecting the texts to be analyzed and creating the text collection to be the 

basis of the classifications in the project was a task that is not to be underestimated. Several 

different types of texts were experimented with in the attempt of finding a text type that satisfied 

the following criteria, in addition to being available for collection on the internet: 

• Limited and naturally confined thematic domain 

• Relatively long chains of discourse 

• Fairly high occurrence of anaphora, pronouns in particular 

• Several paragraphs where the same phenomenon is discussed 

• Low occurrence of tables and illustrations, ideally all the information in the texts should 

be expressed in complete and grammatical sentences 

The text type that fulfilled these criteria to the highest degree were news texts. By picking 

newspaper articles that all concerned the same theme, the criteria of a limited domain was 

satisfied. The articles, as provided on the internet, additionally fulfilled all the other 

requirements which had been set for the text collection. For this project, articles concerning a 

criminal case in the small town Førde on the west coast of Norway were chosen, mainly because 

this was a very big case in the Norwegian newspapers and a large number of articles have been 

written on the subject. The articles were selected from the newspaper Verdens Gang (VG) in 

June and July 2004. 

34

3.2 Predicate-argument structures 

"Not the same thing a bit," said the Hatter. "Why, you might as well say that 'I see what I eat' is 

the same thing as 'I eat what I see'." from Alice in Wonderland by Lewis Carroll. 

For the purposes of the subsequent classification phase, a meaning representation that would not 

allow for ambiguity or vagueness was desirable. Using the term EPAS, rather than referring to 

the verb and its subject and object, contributes to normalising and generalising the data. The 

motivation for choosing elementary predicate-arguments structures, or EPAS, as the 

representation of the meaning structures in the text collection will be explained in the following. 

By choosing EPAS as meaning representation, the focus of the structure is the verbal predicate. 

Instead of structuring the semantic representations extracted from the texts according to the 

grammatical roles and the formal function each word holds in the sentence, we look at how the 

verbal predicate combines with arguments. This is closely related to the idea of thematic roles, 

where the focus is on which roles the entities in a sentence occupy. It is suggested that “verbs 

must have their thematic role requirements listed in the lexicon” (Saeed 1997, p. 140) and as 

such that each verb has a predetermined set of possible argument frames. Thematic roles span 

over a wide range that describes the various roles the entities in a sentence can occupy. Using 

Saeed’s hierarchy of thematic roles, the agent is the initiator of action, while the patient and the 

theme are the entities an action is performed on. For Norwegian and English, there is a tendency 

for subjects to be agents and direct objects to be patients and themes (Saeed 1997, p. 145). This 

tendency can be altered by the speaker as a result of stylistic choice or desire to alter the 

information structure, for example by using passive verbal voice. The assignment of thematic 

roles to particular positions in a sentence is closely connected to the hierarchical structure of the 

thematic roles. There is a hierarchy of defined thematic roles for each sentence position; the 

hierarchy in (3-1) exemplifies the preferred order of roles in subject position (Saeed 1997, p. 

146). 

(3- 1) 

agent > recipient/benefactive > theme/patient > instrument > location 

The structuring of a semantic representation into predicates with belonging arguments does, 

however, not express exactly the same information as the assignment of thematic roles does. 

35

When using predicate-argument structures, the definitions of argument 1 and argument 2 

presupposes the existence of an underlying semantic hierarchy which defines the roles for agent 

and patient. For example, argument 1 can be defined as to always representing the agent in the 

sentence. An important distinction between the predicate/argument paradigm and that of 

thematic roles, is that by using agent/patient as the core of the semantic representation, the 

semantic classification does not focus on the predicate and its associated arguments. On the 

other hand, in a predicate/argument classification, the definition of the individual arguments 

does not directly consider the thematic roles. Since a semantic hierarchy has more well-defined 

roles than expressible with arguments, different instances of argument 1 will not have exactly 

the same semantic role, depending on the predicate they co-occur with. 

For the purpose of processing the extracted structures, it is useful that the structures are in a 

simplified form and also that structures that semantically convey the same information, but are 

expressed in a different manner syntactically, are represented with the same structure. This is in 

alignment with the doctrine of canonical form, which states the usefulness of letting linguistic 

constructions which display the same meaning content give rise to the same meaning 

representation (Jurafsky and Martin 2000, p. 507). By using a normalised form of 

representation, such as EPAS, for the structures extracted from the text, as well as founding the 

extraction method on semantic representations of the analysed texts, the generation of a 

generalisable data set is achived. Active and passive constructions with equivalent semantic 

meaning will be treated in the same way and will receive identical meaning representations. This 

can be seen in the following example: 

(3- 2) 

a. Morderen drepte kvinnen. 

The murderer killed the woman. 

b. Kvinnen ble drept av morderen. 

The woman was killed by the murderer. 

c. Kvinnen ble drept. 

The woman was killed. 

Sentence (3-2a) and (3-2b) in essence convey the same information and only differ with 

reference to verbal voice. The use of diathesis alternations of active and passive voice gives the 

36

speaker flexibility with regards to the relationship between grammatical structure and thematic 

roles. The use of passive versus active voice does not really alter the semantic content of a 

sentence, but represents a difference in the information structure of a sentence. Differences of 

this kind are not relevant for the present purposes, and will not be reflected in the extracted 

structures. The word kvinnen (the woman) has different syntactic roles in the acitve sentence in 

(3-2a) and the passive sentence in (3-2b), but has the same thematic role. Regardless of the fact 

that the phrase kvinnen (the woman) in (3-2b) represents a subject, while it represents an object 

in sentence (3-2a), both expressions have the thematic role of patient, or the entity acted upon. 

For the purposes of the present work both sentences will be represented by the single EPAS 

shown in (3-3): 

(3- 3) 

Predicate Argument 1 Argument 2 

drepe morder kvinne 

kill 

murderer 

woman 

Sentence (3-2c) is in passive voice and the subject from the active-voice sentence (3-2a) is not 

present. The formal subject of the sentence, “woman”, is logically a patient, and refers to the 

entity on which the activity of killing is performed. As such, this sentence will be represented by 

an EPAS which lacks its argument 1: 

(3- 4) 


kill woman 

Extraction methods that are not based on a syntactic parse of the original texts do not have 

access to semantic relations within a sentence. This means that such methods must rely on more 

superficial structures, such as part of speech tags, and will not have the same degree of accuracy, 

or finesse, in the actual extraction of the meaning structures. Since the present work aims at 

providing results that can be useful as part of an anaphora resolution system, it is particularly 

important that the results obtained can be generalised as much as possible. Especially since the 

individual elements of the extracted structures will be used for subsequent processes, it is of 

high importance that they do not contain errors or irregularities as a result of the extraction 

37

process. To be as useful as possible, the meaning structures should be normalised and 

generalisable. 

The examples above show how normalisation through use of EPAS realises the concept of 

canonical form to some degree and seems particularly useful for the purpose of the present 

work. By using grammatic relations such as subject and object as reference points, semantically 

equivalent sentences, such as (3-2a) and (3-2b), would be given different meaning structures due 

to the difference in verbal voice. Structuring the meanings conveyed with the sentences in (3-2) 

within a grammatical relations paradigm would make it necessary to mark the verbal voice as 

well as the grammatical relations. In addition, active and passive structures would have to be 

treated differently in the subsequent analysis. Basing the extraction merely on syntactic 

properties of the sentences in the corpus would make the extracted material very difficult to 

classify, mainly because similar meanings would be represented differently. 

The advantages of a normalised and generalisable dataset is further clarified by the following 

example. Upon a simple grammatical analysis, the sentences shown in (3- 2) can be categorised 

based on the syntactic roles predicate, subject and object. The result of such a classification is 

shown in examples (3-5) and (3-6): 

(3- 5) 

predicate subject object 

a. drepe morder kvinne 

kill 

b. drepe 

kill 

c. drepe 

kill 

murderer 

kvinne 

woman 

kvinne 

woman 

woman 

murderer 

murderer 

? 

The structures in (3-5) above can be extracted upon part of speech tagging of the sentences in 

(3-2). The active and passive predicate receives the same structure, and as no semantic 

information is available, the structuring of the arguments is in accordance with their status as 

subject or object. Attempting to classify these subjects and objects based on their co-occurrence 

with the predicate produces groupings of words which are not directly generalisable. Murderer 

38

and woman occur together both in subject and object position, not reflecting the preferred 

selectional constrictions within the domain. 

(3- 6) 

predicate subject object 

a. drepe morder kvinne 

kill murderer woman 

a. drepes kvinne morder 

is-killed woman murderer 

b. drepes kvinne ? 

is-killed woman 

Example (3-6) provides a more elegant structuring. Because an extraction method based on 

syntactic relations is unable to generalise over verbal voice, two separate predicates are 

extracted, one for the passive and one for the active voice. Even though logically, the same 

action is performed on the entity the woman in both sentences in (3-6), a method as outlined 

above would not allow for a straightforward interpretation of this. The generalisation between 

active and passive versions of the same sentence is lost in such an approach. This would result in 

a higher number of predicates, and therefore in a less generalisable data material. It is likely that 

results as outlined above would also be of less use as a referent guessing helper in an anaphora 

resolution system, precisely because of the lower level of generalisability. 

3.2.1 What is represented in the EPAS? 

Jurafsky and Martin (2000, p. 510) state that all languages have predicate-argument structures at 

the core of their semantic structure. They further describe that the grammar organises the 

predicate-argument structure and selectional constraints restrict how other words and phrases 

can combine with a given word. In this project, a simplified version of predicate-argument 

structures is used as meaning representation. The EPAS, or meaning representations, are limited 

to consist of two nominal arguments at the most. Either one of the arguments in an EPAS may 

be empty/unidentified. This means that the EPAS extracted from my texts will belong to one of 

the following three patterns: 

39

(3- 7) 

a. predicate, argument 1, argument 2 

b. predicate, argument 1, ? 

c. predicate, ?, argument 2 

The reason for letting the EPAS consist of a maximum of two arguments is not primarily a 

fundamental decision, but rather an emergence from the empirical material that the data 

structures were collected from. When extracting EPAS from the data collection, the resulting 

structures consisted of a predicate with maximally two arguments. It is probable that there 

generally are fewer occurrences of predicates with more than two belonging arguments, and 

since my data collection is quite small, such occurrences do not feature in it. 

Only nominal arguments are featured in the EPAS; entailing that sentences with a nominal 

clause as object will be extracted as an EPAS lacking argument 2. This is clarified through the 

examples below: 

(3- 8) 

Et vitne opplyste at hun hadde hørt høye rop. 

A witness informed that she had heard loud screams. 

The sentence in example (3-8) above will yield the following three EPAS: 

(3- 9) 

a. høre, vitne, rop 

hear, witness, scream 

b. høy, rop, ? 

loud, scream, ? 

c. opplyse, vitne, ? 

inform, witness, ? 

(3-9c) does not display an argument 2 despite the fact that the original sentence has a nominal 

clause as object. The main reason for this choice of representation is that the subsequent 

classifying phase aims at creating classes of nominal arguments, based on the verbs they co- 

40

occur with. Arguments which are unlikely to represent relevant and interesting information for 

our classification are therefore omitted. In cases where a verbal predicate takes a nominal clause 

or a sentence as its argument, it is unlikely that the predicate selects the argument based on (for 

us) semantically interesting selectional restrictions. A verb can show restrictions in the selection 

of an argument, presenting us with the possibility of saying something about this argument 

relevant to the environment it occurs in. The same restrictions cannot be expected to apply in the 

cases where the verb takes a sentence as its argument. As a consequence of this, the meaning 

representations are limited to dealing with arguments which can be represented by single 

symbols and which do not refer to clauses or sentences. 

In order to extract all EPAS present in the texts, it will also be necessary to extract non-verbal 

predicates. These will generally speaking correspond to adjective-noun combinations in the text. 

This way, phrases of the type “the statement is important” or “the important statement” will both 

produce the following EPAS: 

(3- 10) 


important statement ? 

The process of extracting the EPAS is a challenging part of the project, especially since the 

available tools for Norwegian are not robust enough to make this a trivial and straightforward 

task. Regardless of which method is used to extract the EPAS, it is evident that large part of the 

work on this project must be dedicated to the development of a suitable method for the 

extraction of them. For several reasons, it is desirable to develop an extraction method that is as 

automatic as possible. Most importantly, such a method saves a lot of time, but another 

important aspect is that more manual extraction methods easily could become subjective and 

less systematically concise. The next two sections will discuss the task of extracting the EPAS in 

more detail. 

41

3.3 Parsing with NorGram 

To be able to extract the EPAS from the text in a semi-automatic fashion, some sort of linguistic 

analysis of the texts is needed. One problem with working on a small language like Norwegian 

is that the linguistic tools you might need in the process just are not fully developed yet. Velldal 

(2003) describes a project where a set of Norwegian nouns are grouped into semantic classes 

based on their distribution over a large body of text. A word’s distribution in different contexts 

is represented as a feature vector in a semantic space model. In his project, Velldal addresses the 

problem of a lacking parser for Norwegian by stating that there does not exist any syntactic 

parser for Norwegian. Instead, he uses a shallow processing tool on a tagged corpus. The 

processing tool “translates” the tagged structures into predicate-argument structures, overcoming 

the need for a parser by only analysing those parts of the text relevant for the extraction of the 

needed structures. As has been explained in section 3.2, an extraction method that is based on 

surface structures and does not take semantic relations into account, might produce results that 

are unsuitable both for subsequent use in anaphora resolution and for generalisation of concepts. 

In view of this, the present work has aimed at developing an extraction method that uses parsed 

text to collect the meaning structures from the text. 

Although it is true that there does not exist any parser that fully covers the Norwegian language 

at the moment, there are a few alternative parsers available. Even if these grammars are not 

entirely robust enough to return parses on randomly chosen texts, they can be used for the 

experiments outlined in this project. The extraction method described in this thesis implements 

one of the existing parsing tools for Norwegian bokmål, NorGram (NorGram 2004). 

Since there are no easy-to-use automated tools available for use in the extraction process, 

obtaining the EPAS from the text involved a substantial amount of manual work, even when 

using a parser to automate the extraction. Parsing the texts was definitely of value, though, since 

once the texts were parsed and there was a syntactic analysis to work on, the EPAS could more 

readily be extracted. Because of the modular nature of the extraction method, the extraction 

process is not parser-dependent. Should a new and more robust grammar become available, the 

extraction method can be modified to accommodate this. The next section of this chapter briefly 

describes how the NorGram/XLE parser was used in the project, while section 3.3.2 describes in 

greater detail how the EPAS were extracted from the parser’s output. 

42

3.3.1 NorGram in outline 

Norsk komputasjonell grammatikk (NorGram) is a computational grammar for Norwegian 

bokmål. NorGram is based on the unification-based grammar formalism Lexical Functional 

Grammar (LFG), where language is described by means of feature structures that can be 

combined in the process of unification. Researchers involved in the NorGram project cooperate 

with researchers at Palo Alto Research Center (PARC), former Xerox PARC, who have 

developed a well functioning platform for the development of large-scale computational 

grammars. This system is called Xerox Linguistic Environment (XLE) and uses LFG as its 

theoretical linguistic framework. As such, NorGram can be said to be an LFG formalism for 

Norwegian, while XLE is an implementation of LFG. 

The NorGram grammar combined with an XLE-module is a relatively broad parser that can 

analyse most structures found in Norwegian. It was chosen for the purposes of this project 

because it was likely to return successful parse trees of a large part of the sentences found in the 

text collections. NorGram’s lexicon is quite large and includes entries of most regular 

Norwegian words. One problem with the lexicon with regards to the text collections used for 

this project, is that it contains relatively few compounds. All theme-specific texts feature a 

theme-specific vocabulary, sometimes with words (especially compound nouns) that cannot be 

expected to be found in ordinary dictionaries. This was also the case for the text collection in 

this project. Compounded nouns represented the largest group of words added to the lexicon. In 

Norwegian, one stands fairly free to form compounds consisting of words that also can exist 

individually and have an individual meaning. Whereas in English such compounds are written in 

two separate words, for example police investigator, they together form a new noun in 

Norwegian, for example politietterforsker (police investigator). This opens for a potentially 

infinite class of nouns and makes it virtually impossible to include all possible words in any 

lexicon. 

The NorGram lexicon was extended in order to be used as a tool to extract the EPAS from the 

text collection. Compounds and proper nouns that were part of sentences to be analysed were 

added to the lexicon files. To ensure that all EPAS could successfully be extracted, all sentences 

that were not parsed were examined to identify the word that represented the problem. 

Subsequently, that word was added to the lexicon. A more elegant way to solve the compound 

issue would be to make use of a module that splits compounds into the individual words they 

43

consist of, or to make use of a component that predicts the part of speech of an unknown word. 

One solution that would have been suitable for the purposes of the present work would be to 

assume that all unknown words were nouns. However, due to the small size of the corpus used 

for this project, none of these strategies were implemented, and unknown words were added to 

the lexicon by hand. 

When parsing texts with NorGram and XLE, the user has several choices with regards to the 

format of the final syntactic analysis. For example, it is possible to receive partial parses, or to 

let the system return all the potential analyses of the input sentence. For the purposes of this 

project, I received full parses of each sentence in the text material and chose to manually check 

each instance where the system returned multiple valid parses and actively decided on the 

correct one that I wished to extract the EPAS from. 

3.3.2 Extracting EPAS from NorGram 

The output provided by XLE upon a successful parse using the NorGram grammar is 

particularly useful for a subsequent extraction of EPAS. NorGram is based on the LFG grammar 

formalism and produces constituent structures (c-structures), functional structures (f-structures) 

and minimal recursion semantics structures (MRS-structures) upon parsing a sentence. Each of 

these outputs can be useful for a subsequent extraction of predicates and their arguments. 

The c-structure in LFG is an external structure which displays an ordered representation of the 

words in a sentence or phrase (Bresnan 2001, p. 44). In XLE, the c-structure is represented by a 

phrase structure tree, where the terminal nodes are fully inflected word forms. F-structures 

represent the internal structure of a sentence. On this level, the “syntactic functions are 

associated with semantic predicate argument relations” (Bresnan 2001, p. 45). C-structures and 

f-structures are different structures, but display parallel information. Figure 3 below shows the 

graphical representation of the c- and f-structures for the sentence Politiet leter etter morderen 

(The police is looking for the murderer) generated by NorGram. 

44

Figure 3 

The most useful structure for the purpose of extracting EPAS from the parse output, is the MRS 

structure. In comparison to the c- and f-structures, which are more syntactically motivated, the 

MRS structure displays the semantic structure within a sentence. In the next section, this 

structure is described in greater detail. 

3.3.2.1 Minimal Recursion Semantics 

Minimal Recursion Semantics (MRS), developed by Copestake et al. (2003), is a framework for 

computational semantics, providing a meta-language for describing semantic structures. The 

concept of MRS is primarily semantically motivated and aims at preserving the semantic 

structures in the input sentence. MRS allows for expressive adequacy, ensuring that the 

linguistic meanings conveyed by a sentence are expressed correctly in the semantic structure. 

The primary unit within the framework of MRS is the elementary predication (EP). An EP is a 

single relation with associated arguments and will generally speaking correspond to a lexeme 

with its argument roles filled. Since MRS provides a “flat” representation where the EPs are 

never nested within each other, semantically irrelevant implicit information about the syntactic 

structure of a phrase is avoided. The simple principle is that each EP has a “handle” which 

identifies it as belonging to a particular tree node and argument positions in EPs can be filled 

with handles which correspond to the EPs that belong immediately under it in the tree structure. 

45

More than one EP with the same handle entails that the EPs are conjoined and on the same node 

in the structure. Tree structure in this sense do not refer to the c-structure, but to an abstract 

structure which shows the hierarchical representation of the EPs. 

MRS is implemented in NorGram and the MRS structures provided there are the most 

convenient of the output structures from the point of view of EPAS extraction. A sentence or 

phrase can contain more than one EPAS, and all predicates with belonging arguments are 

displayed in the MRS representation. The MRS structures NorGram displays following the 

successful parse of a sentence, contain all the information needed to extract the EPAS. However, 

because of the manner in which they are displayed in the XLE graphical interface, it is not 

straightforward and simple for a human to see which arguments belong where and as such 

identify the EPAS directly. Only by tracing each individual EP and finding corresponding values 

in other EPs, is it possible to extract the sentence’s EPAS. Figure 4 below shows the graphical 

representation of the MRS structure for the sentence Politiet leter etter morderen (The police are 

looking for the murderer). 

Figure 4 

46

3.4 Altering the source 

As already mentioned, parsing randomly selected Norwegian texts is not an entirely 

straightforward task. Although NorGram provides for a quite broad grammar, not all linguistic 

constructions are parsed and, more importantly, not all words are covered in the lexicon. Ideally, 

it would be desirable to collect a limited domain treebank consisting of parsed sentences of the 

original texts as I found them on the internet. In practice, this was not a feasible task. It early 

became evident that the texts to be analyzed would have to be simplified for practical reasons. 

For the purpose of classification, I needed to extract the EPAS present in the texts. All the other 

information that was included in every sentence was not essential or necessary for the project. 

Although aware that it would be more scientific, and in any respect better, to extract the EPAS 

from original texts that have not been tampered with by me, this was not possible within the 

framework of this thesis. Given that I would have to simplify the texts in any case, I decided to 

cut most information that was irrelevant for the extraction of the (most central) EPAS. This 

process was performed on alle sentences in the text collection. Mainly adverbial phrases were 

excluded, on the basis that they would not be included in the extracted EPAS in any case. The 

examples in (3-11) below illustrate a typical example: 

(3- 11) 

a. Original sentence: 

Etter at hun ble funnet opplyste et vitne at hun hadde hørt 

høye rop om hjelp fra stedet tidlig søndag morgen. 

After she was found a witness informed that she had heard loud screams for 

help from the area early Sunday morning. 

b. Simplified form: 

Et vitne opplyste at hun hadde hørt høye rop. 

A witness informed that she had heard loud screams. 

c. Extracted structures: 

høre,vitne,rop 

høy,rop,? 

opplyse,vitne,? 

hear, witness, scream 

loud, scream,? 

inform, witness,? 

47

The pre-editing of the text collection will naturally have affected the resulting EPAS list. All 

structures from the original texts are not extracted and the EPAS list as a consequence does not 

include all relevant context patterns for the domain. Still, for the purposes of a pilot study such 

as in this thesis, the central structures, which display the most typical context patterns for the 

domain, include enough information to provide a tendency of the usefulness of the method. For 

the purpose of subsequent analyses, the extraction process can easily be performed on unedited 

original texts. 

3.5 Finding the words 

The process of extracting meaning structures such as the EPAS from the texts in the text 

collection is a substantial undertaking. It is also quite a tedious task, and since tedious tasks tend 

to benefit from being automated I wrote the Perl script Ekstraktor, which interprets the MRS- 

structures of a sentence and thereby puts together the EPAS for each parsed sentence. This 

chapter describes the outline of the automated extraction process. 

XLE provides the user with the choice of several output formats, including a graphical user 

interface that displays a tree graph of the parse as well as its F-structure and MRS-structure. By 

choice, the output can also be viewed as a file of Prolog predicates. In the process of extracting 

the EPAS, Ekstraktor reads the Prolog output, saves relevant information in a system of arrays, 

and subsequently performs several tests and actions on the stored information in order to present 

a list of all EPAS found in the parsed sentence. 

The MRS structures as represented in the Prolog output, provides all the information needed to 

extract the EPAS. Initially, the main EP with belonging arguments must be found. Since for the 

purpose of this paper, the linguistic structures analyzed are limited to full sentences, the main EP 

must display the category ‘v’ for verb. Once the main EP is identified, the semantic values for it 

and for its belonging arguments must be found. Subsequently, all the remaining predicateargument 

structures must be found. For them, there is no restriction as to which category they 

have. Consider the sentence shown in (3-12) together with an extract of the Prolog output of the 

parse shown in (3-13): 

48

(3- 12) 

(3- 13) 

Politiet leter etter morderen 

The police are looking for the murderer 

cf(1,eq(attr(var(19),'ARG0'),var(20))), 



cf(1,eq(attr(var(19),'LBL'),var(10))), 

cf(1,eq(attr(var(19),'LNK'),14)), 

cf(1,eq(attr(var(19),'_CAT'),'p')), 

cf(1,eq(attr(var(19),'_CATSUFF'),'sel')), 

cf(1,eq(attr(var(19),'relation'),semform('etter',15,[],[]))), 

cf(1,eq(attr(var(20),'type'),'event')), 

cf(1,eq(attr(var(21),'PERF'),'-')), 

cf(1,eq(attr(var(21),'TENSE'),'pres')), 

cf(1,eq(attr(var(21),'type'),'event')), 

cf(1,eq(attr(var(22),'NUM'),'sg')), 

cf(1,eq(attr(var(22),'PERS'),'3')), 

cf(1,eq(attr(var(22),'type'),'ref-ind')), 






cf(1,eq(attr(var(23),'_CAT'),'v')), 

cf(1,eq(attr(var(23),'_PRT'),'etter')), 

cf(1,eq(attr(var(23),'relation'),semform('lete',11,[],[]))), 

cf(1,eq(attr(var(24),'NUM'),'sg')), 

cf(1,eq(attr(var(24),'PERS'),'3')), 

cf(1,eq(attr(var(24),'type'),'ref-ind')), 


cf(1,eq(attr(var(25),'BODY'),var(26))), 



cf(1,eq(attr(var(25),'RSTR'),var(14))), 

cf(1,eq(attr(var(25),'relation'),semform('def',31,[],[]))), 

cf(1,eq(attr(var(26),'type'),'handle')), 



cf(1,eq(attr(var(28),'BODY'),var(29))), 



cf(1,eq(attr(var(28),'RSTR'),var(17))), 

cf(1,eq(attr(var(28),'relation'),semform('def',9,[],[]))), 






cf(1,eq(attr(var(31),'_CAT'),'n')), 

cf(1,eq(attr(var(31),'relation'),semform('morder',19,[],[]))), 




cf(1,eq(attr(var(32),'_CAT'),'n')), 

cf(1,eq(attr(var(32),'relation'),semform('politi1',1,[],[]))), 

49

The Prolog code extract in (3-13) shows the MRS representation of the sentence in (3-12) by 

listing all the EPs in the sentence as well as the relationships that hold between the individual 

EPs. In simplified terms, the value of the attribute ‘semform’ holds the semantic form of the 

predicate, and the values of ‘ARG1’ and ‘ARG2’ point to the EPs where the semantic forms for 

argument 1 and argument 2 can be found. In order to extract all EPAS from such a Prolog file, 

one must go through all the EPs in turn, and find the semantic forms of each main EP and its 

belonging argument 1 and argument 2. In the extraction process, this matching and tracing of 

values is performed by the script Ekstraktor. 

The algorithm behind Ekstraktor is divided into two more or less separate parts: information 

retrieval from the Prolog file and processing of the information that was found and stored. Perl 

was chosen as the programming language mainly because of its excellent pattern matching 

facilities. Perl offers a very powerful and flexible regular expression syntax which lets the 

programmer construct regular expressions that will handle all kinds of pattern matching. For the 

information retrieval part of Ekstraktor, it was desirable to go through an input file, check for 

various patterns and store parts of the input file relevant to how the patterns were matched. (3- 

14) shows one of the pattern checks in Ekstraktor – if the line read from the file contains the 

string: 

‘relation’),semform( 

the entire line is stored in the array semform. 

(3- 14) 

if ($linjeFraFil =~ m/'relation'\),semform\(/){ 

push(@semform, $linjeFraFil); 

} 

By going through the input file line by line and checking for several patterns, all information 

relevant to extracting the EPAS is stored in a system of arrays. To be able to keep track of which 

EP the various values belong to, a system of two arrays for each argument type is used – one for 

EP number and one for argument value. The ARG0 arrays correspond to the predicates in the 

structures and for each, the semantic form can directly be found in the semform-array. The 

50

ARG1 and ARG2 arrays display a value that must be traced before the semantic form can be 

extracted. A simplified example of the argument arrays for the sentence in (3-12) is shown in 

(3-15): 

(3- 15) 

ARG0: 

EP VALUE 

23 21 

25 22 

28 24 

31 22 

32 24 

ARG1: 

EP VALUE 

19 21 

23 24 

ARG2: 

EP VALUE 

19 22 

23 22 

To find the EPAS for this sentence, the first EP in the ARG0-array is incorporated in a regular 

expression which then is used for pattern matching in the members of the semform-array. If 

there exists an entry which matches the pattern, that is, which has an EP-value identical to the 

first EP in the ARG0-array, the semantic form is retrieved. To find the belonging arguments 1 

and 2, the ARG1 and ARG2-arrays are searched for an EP identical to the one of the predicate. 

If such an EP is found, the corresponding value is retrieved – for ARG1 in our example that 

would be the value 24. To find the semantic form of this value, we must find the EP where this 

value is identical to the value of ARG0, that is, the ARG0-array must again be consulted. When 

the EP is found, the semform-array can be pattern matched and the semantic form can be 

retrieved. To retrace the example; following such a procedure, the sentence in example (3-16): 

(3- 16) 

Politiet leter etter morderen 

The police are looking for the murderer 

is represented with the following EPAS, extracted from the Prolog file of the parse: 

(3- 17) 

lete-etter,politi,morder 

look-for,police,murderer 

For a detailed walkthrough of Ekstraktor, please consult Appendix A. The program code is 

available in Appendix B. 

51

3.6 Evaluation of the data set 

The data set created by the extraction process consisted of 195 elementary predicate-argument 

structures in its raw form. The original EPAS list was not directly applicable for the next parts of 

the project. Not all of the extracted structures on the list were suitable for further analysis. Some 

of the EPAS were not given an optimal analysis (for my purposes) by the grammar, some were 

irrelevant for the later analysis and some were not extracted correctly from the MRS by the Perl 

script. The dataset was post-edited to achieve a set of EPAS that did not include erroneously 

extracted or undesired structures. With such a small collection of structures as is the case in this 

project, the inclusion of only a few incorrect structures would be likely to skew the subsequent 

analysis and possibly produce false results. 

In the following, I will briefly outline some of the reasons why the EPAS list included incorrect 

structures and describe how the list was revised. 

3.6.1 Errors from the grammar 

Some of the undesired structures in the original EPAS list were directly caused by 

characteristics in the NorGram grammar. In the original EPAS list, there were for instance 

several structures of the type exemplified by (3-18): 

(3- 18) 

a. verbal predicate, nominal argument 

b. preposition, verbal predicate, nominal argument 

These structures should preferably have been combined into one EPAS. The example in (3-19) 

below shows a concrete instance from the EPAS list and is analogous to several other instances: 

(3- 19) 

a. bo, Anne live, Anne 

b. i, bo, studentkollektiv in, live, student housing 

The structure is extracted from the following sentence from the text material: 

52

(3- 20) 

Anne Slåtten bodde i et studentkollektiv utenfor Førde sentrum. 

Anne Slåtten lived in student housing outside central Førde. 

As example (3-20) shows, these structures originate from sentences featuring a verb with an 

adverbial complement. The adverbial is realised as a prepositional phrase where the preposition 

is selected by the verb. It would have been expected that sentences such as Anne bodde i et 

studentkollektiv (Anne lived in student housing) would result in one EPAS with the entity 

studentkollektiv somehow realized as the structure’s argument 2. Instead, the MRS structure of 

this and other similar sentences did not provide the necessary link between the verb as predicate 

and studentkollektiv as the second argument. When discussing this problem with the developers 

of the grammar, the source of the obstacle was easily identified. In the grammar, the verb bo 

(live) existed as an intransitive verb, not allowing for an adverbial complement to be analysed as 

required to produce the desired EPAS. In order to allow for this and similar sentences with the 

verb bo to produce one EPAS with the correct relationship between the predicate and its 

arguments, the entry for bo was altered. A solution which allows for an arbitrary preposition was 

favoured, instead of creating a new template that specifies the possible following prepositions. 

Analysing the sentence above with the revised grammar produces structures of the following 

type: 

(3- 21) 

bo, Anne, studentkollektiv 

live, Anne, studenthousing 

The same phenomena was observed for a few other verbs with prepositional phrases as 

complements, such as gjemme i (hide in) and observere i (observe in). In these instances, the 

structures were manually edited. 

3.6.2 Irrelevant structures 

Some of the structures that were extracted correctly from the text collection, were simply 

removed in the final post-editing of the EPAS list. These structures were not directly relevant for 

the later analysis process and would not contribute with any valuable information for the 

53

eferent-guessing procedures. In total 22 such structures were removed. The majority of these 

structures originate from adverbial phrases in the text collections. Location and temporal 

adverbs, featuring as prepositional phrases in the texts, show up in the EPAS list as structures 

disjoint from the rest of the sentence, not unlike the structures mentioned above. The preposition 

functions as the EPAS’ predicate, resulting in structures of this type: 

(3- 22) 

på, funn, åsted 

on, finding, crime scene 

Such structures were left out of the final EPAS list. 

Another type of correctly extracted structure that was omitted from the final list, were structures 

particular to the information structure in the grammatical analysis. The extraction script returned 

all predicate-argument structures present in the MRS structure for each parsed sentence. This 

resulted in a few structures that did not hold information that it was desirable to maintain in the 

EPAS list. Below is an example of such a structure: 

(3- 23) 

unspec_loc, , place 

3.6.3 Manually added structures 

Not all the predicate-argument structures present in the text collection were successfully 

extracted by means of the extraction method. After removing unwanted structures from the 

EPAS list, the texts in the text collection were gone through manually to gather any EPAS that 

were not returned by the automated extraction process. Had the text collection been larger so 

that the list of automatically extracted EPAS had been correspondingly bigger, too, this may not 

have been a necessary action. In the case of a (substantially) larger EPAS list, the structures that 

were not collected in the extraction process could have been abstained from, as the structures 

that would have been extracted would have provided enough information for the subsequent 

classifications and analyses. As is the case in this project, though, the text collection and the 

resulting EPAS list are very small. All information that can be extracted from the texts is of 

54

value and highly desirable. As such, it was a logical next step following the removal of 

unwanted structures, to make sure that all desirable structures had been collected from the texts. 

Several structures were added; many of which had been subjected to only partial extraction in 

the extraction process. This may in part be due to the syntactic analysis and in part to the 

matching by the Perl script. Further, EPAS were manually extracted from one additional text 

that had not been parsed, and therefore not been part of the initial extraction process. In total, 

this yielded 74 additional EPAS. The example in (3-24) below provides an example of a 

manually edited EPAS. (3-24a) shows the EPAS as it was after the automatic extraction process. 

While going through the texts, it became clear that this EPAS had not been extracted in a way 

that represented the meaning in the sentence it originated from, and therefore did not have an 

optimal structure. The EPAS was therefore manually modified to the form shown in (3-24b). 

(3- 24) 

a. Original EPAS: 

ta, syklist, kontakt 

make, biker, contact 

b. Manually corrected EPAS: 

ta-kontakt-med, syklist, politi 

make-contact-with, biker, police 

Appendix C contains the EPAS list, while Appendix D shows the alignment between sentences 

in the text and the extracted EPAS. 

3.6.4 Comments about the EPAS list 

The revised EPAS list consists of 223 elementary predicate-argument structures. 24 structures 

have been modified as described above, and 74 have been added. The list contains most EPAS 

present in the text collection and represents a list of verb-subject-object relations found within a 

limited thematic domain. While it is clear that the list could have been expanded by adding 

further texts to the collection, it was not possible to extend the list further within the frameworks 

of this project. The EPAS list is large and varied enough to show a tendency. Certainly, the 

instances of individual EPAS would have been higher and the list would also have been 

55

enriched by several new EPAS. Still, for the purposes of this thesis, the list includes a broad 

enough variety of structures to be of use in the classification phase. 

In the process of assessing the quality of the EPAS list, it became evident that the most 

interesting structures are the simplest ones. The EPAS corresponding to verb-subject-object 

relations are the ones that contribute with most information about the selectional restrictions of 

the domain. An alternative way to obtain an effective and robust extraction of EPAS might have 

been to concentrate only on this type of structure, rather that focusing on extracting all EPAS 

from the text collections and then filtering out unwanted ones. 

In order to estimate the potential of a classification of the EPAS list, line diagrams were created 

using Formal Concept Analysis (FCA). FCA is a methodology of data analysis and knowledge 

representation which identifies conceptual structures in data sets, and was a useful tool in the 

process of identifying how the predicates and arguments in the EPAS list related to each other. 

FCA distinguishes between two types of elements; formal objects and formal attributes. A 

formal concept is seen as a unit consisting of all belonging objects and attributes (Wolff 1991, p. 

430). Starting with any set of formal objects, all formal attributes the objects have in common 

can be identified. When using FCA to structure the data in the EPAS list, the arguments were 

termed objects, while the predicates were termed attributes. An FCA line diagram consists of all 

objects and attributes in a given context, organised hierachically according to their shared 

properties. Figure 5 below shows the FCA line diagram for part of the structures in the EPAS 

list 4 . Each white label corresponding to an argument from the EPAS list should be understood as 

a concept, and information about each concept can be read by following the upward leading 

paths from each concept. An object has a given attribute if there is an upward leading path from 

the object to the attribute (Wolff 1994, p. 431). Using the arguments/formal objects lensmann 

(sergeant) and Fonn as a starting point, the associated predicates/formal attributes gi (give), and 

bede-om (ask-for) can be identified. The arguments lensmann and Fonn co-occur with the 

predicates gi and bede-om, while politi (police), which is further down in the hierarchy, cooccurs 

with other predicates as well as those higher up in the diagram (gi, bede-om and bekrefte 

(confirm)). In other words, more general concepts are found toward the bottom of the diagram, 

while specialised concepts are found by following the paths upwards. For the data material in 

4 The diagram was made using the program Concept Explorer, downloadable from 

http://sourceforge.net/projects/conexp 

56

this project, this can be interpreted in terms of the contextual distribution the arguments have. 

Arguments found in the lower parts of the diagram are more general and co-occur with a wider 

range of predicates than the arguments found higher up in the hierarchy. In Figure 5, it can be 

seen that gjerningsmann (perpetrator) and drapsmann (killer) have similar distributions in the 

data material; drapsmann co-occurs with the predicates velge (choose) and gjemme (hide), while 

gjerningsmann only is found in connection with gjemme. On the basis of the formal concept 

analysis, it is clear that the EPAS list contains several arguments which show a distribution 

particular to their semantic meaning. The different lines in the diagram show interesting bundles 

of semantically related arguments and confirm the assumption that different types of arguments 

show different contextual distribution within the thematic domain. 

Figure 5 

57

4 Classification 

In order to use the structures in the EPAS list as an aid in anaphora resolution, they have to be 

processed. The pre-processing in section 3.6.4 has shown that there does exist interesting 

distributions in the data set and indicates that certain groups of arguments display distributions 

particular for the domain. As a step toward exploring if these distributions can be used to 

represent selectional restrictions and thus function as real-world knowledge for the domain, the 

words in the EPAS list must be classified. This procedure uses the context pattens that a word 

occurs in to classify the word, for example allowing for an argument to be classified according 

to the predicates it co-occurs with. A classification of this type gives information about which 

word to expect in a given context pattern and the results can therefore be used in the process of 

chosing the most likely antecedent for an anaphor. In this respect, the most likely antecedent 

must be interpreted as the most likely antecedent given a particular contextual pattern. 

In the following, the EPAS list will first be classified to see if the context patterns represented 

by the EPAS contain enough information to suggest the correct antecedent for anaphoric 

expressions from the text collection. Then an association of concepts will be performed, creating 

bundles of those arguments which occur in similar contexts/with similar predicates. These 

concepts will then be applied in co-occurrence with the classification method to see if they 

improve the process of suggesting the correct antecedent for the anaphors. 

For the purposes of classification and testing, the EPAS list was divided into training and test 

sets. The test set consist of all structures containing pronouns, while the training set consists of 

the remaining EPAS. In the case of the test set, the correct antecedent for each pronoun was 

identified manually and added to the test file. When testing with the test instances, the classifier 

assigns an antecedent based on the patterns it has seen in the training set. In this way, the correct 

antecedent in each test case functions as a means of measuring the success rate of the 

classification. The test set provides a good way of testing the product of the classification and 

gives a measure as to whether the correct antecedent can be assigned based on training on 

occurrences of EPAS/context patterns. 

58

The process of classifying the constituents of the EPAS is most useful if the aim of the 

classification is held clearly in mind. Classifying arguments relative to the predicates and the 

other arguments they co-occur with can give information about two things; 

• is the data set generalisable enough to allow inference of the single correct antecedent in 

each test case? 

• is the data set generalisable enough to allow inference of words within the semantic 

concept group that the correct antecedent belongs to? 

In this thesis, it is of interest to identify all the words which occur in specific environments. As a 

reaction, we are interested in finding the members which can co-occur in a specific pattern – and 

not necessarily only in the single correct antecedent. 

The classification phase in the present work has three steps; firstly classification through a 

memory-based learning algorithm, secondly association of semantic classes from the text 

material by looking at contextual environments, and thirdly classification through application of 

the concept groups gathered in step two. In the following, the classification method will be 

described in more detail. 

4.1 Step I: Classification with TiMBL 

TiMBL (Tilburg Memory Based Learner) (Daelemans et al. 2003) is a memory-based learning 

(MBL) tool developed by the ILK research group at the University of Tilburg (ILK 2004). 

TiMBL has been developed with the domain of NLP specifically in mind and provides an 

implementation of several MBL algorithms. 

Within MBL, or lazy learning (Daelemans et al. 1999), training instances are simply stored in 

memory. Upon encountering new instances, classification is performed by comparing the new 

instance to the stored experiences and estimating the similarity of the new instance to the old 

ones. The stored example(s) most similar to the new instance is picked as its classification. This 

approach stands in opposition to rule-induction based methods, which also are called greedy 

algorithms. In greedy learning algorithms, the learning material is used to create a model with 

expected characteristics for each category to be learned. Daelemans et al. (1999) show that 

59

language processing tasks tend to benefit from lazy learning methods, particularly because the 

individual examples in the training material are not abstracted away from in the process of 

creating rules. When a new data instance is classified, it is compared to all previously seen 

examples, including low-frequent ones. This suggests that in the case of relatively small data 

sets, such as the one in the present work, MBL tools are particularly suitable. 

By consulting previously seen data and estimating the similarity between old and new instances 

of data, MBL algorithms such as TiMBL are able to calculate the likelihood of new instances of 

data. This is done by creating a classifier which essentially consists of an example set of 

particular patterns together with their associated categories. The classifier can subsequently 

classify unknown input patterns by applying algorithms to calculate the similarity, or distance, 

to the known patterns stored in memory. The Nearest Neighbor approach is one commonly used 

means to estimate this distance and is described in more detail in the following section. 

4.1.1 The Nearest Neighbor approach 

Daelemans et al. (2003, p. 19) state that all MBL approaches are founded on the classical k- 

Nearest Neighbor (k-NN) method of classification (Cover and Hart 1967). This approach 

classifies patterns of numeric data by using information gained from examining and classifying 

pattern distributions observed in a data collection. In the k-NN algorithm, a new instance of data 

is classified as nearest to a set of previously classified points. The intuition is that observations 

which are close together will have categories which are close together. When classifying a new 

instance of data, the k-NN approach weights the known information about the closest similar 

data instances most heavily. In other words, a new instance of data is classified in the category 

of its nearest neighbour. In large samples, this rule can be modified to classifying according to 

the majority of the nearest neighbours, rather than just using the single nearest neighbour. The 

k-NN approach has several implementations in TiMBL. As TiMBL is designed to classify 

linguistic patterns, which in most cases consist of discrete data values and allow for a large 

number of attributes with varying relevance, the k-NN algorithm is not used directly. Instead, 

the classification of discrete data is made possible through a modified version of the k-NN 

approach, as well as other algorithms. 

60

There are several different distance metrics incorporated in TiMBL and, as will be described 

later, the user can choose the one that suits the data material best. The basic metric is the 

Overlap Metric, where the distance between two patterns is calculated as the sum of differences 

between the features of the two patterns (Daelemans et al. 2003, p. 20). The algorithm 

combining the k-NN approach with the overlap metric within TiMBL is called IB1 (Aha, Kibler 

and Albert 1991, in Daelemans et al. 2003). In this algorithm the value of k is the number of 

nearest distances (usually 1), allowing for a nearest neighbour set which may comprise several 

instances which all share the same distance to the a test example. The IB1 algorithm finds the k 

nearest neighbours of a test case by calculating the distance between a test instance Y and a 

training instance X. The distance between the two instances is the sum of the distances between 

the instances’ different features. If k = 1, a test instance is assigned the category of its single 

nearest neighbour. In cases where the algorithm finds a set of nearest neighbours, the majority 

vote of the set is chosen. This implies a certain bias toward high frequent categories, which in 

many cases will hold the majority vote. 

4.1.2 Testing 

To create a classifier, TiMBL needs training and test data in a feature vector where each 

instance consists of a fixed length of feature values followed by a category. For testing purposes, 

the feature sequence is used when the distance between a test instance and the training data is 

calculated, and the category functions as a means to evaluate whether the assigned classification 

was valid. Because the test data is compared directly with the training data, separate training and 

test sets are needed. In this project, the EPAS list was split into a training set consisting of all 

EPAS without pronouns and a test set consisting of the EPAS with pronouns. In addition, testing 

through TiMBL’s leave-one-out option was performed; here testing is done on each pattern of 

the training file by treating each pattern in turn as a test case (Daelemans et al. 2003, p. 35). 

In the classification phase the alignment of category and description features is stored, so that 

the categories of new, unseen sequences of description features can be probabilistically inferred 

in the following test phase. 

Regardless of the input format chosen, a classification with TiMBL presupposes that the training 

material consists of a number of features to be learned from, as well as a predetermined category 

61

which is the desired output category. The comma-separated values format was used for the 

EPAS classification. In order to classify the constituents of each EPAS based on the contextual 

patterns in the structure, each part of the EPAS was classified with reference to the other 

constituents in it. Somewhat analogous to the way that a screw can be described as being small, 

long and containing no holes, the argument åsted (crime scene) can be described through its cooccurrence 

with the predicate ankomme (arrive) and the argument etterforsker (investigator) 

(example (4-1)). This makes it possible to train a classifier on the EPAS list, using the argument 

which’s environment is to be learned as category label, and each constituent in the EPAS as 

features. To avoid that the category was explicitly present in the training material and ensure 

that the classifier was trained only on the environment of the desired category, the relevant 

feature was ignored using TiMBL’s ignore option. In order to classify the structures once for 

each argument type, two different data sets were prepared. Example (4-1) shows the format of 

the three-feature dataset that was used. The parentheses indicate that the feature in question was 

ignored when training and classifying. 

(4- 1) 

features category 

a. predicate, (argument 1), argument 2 argument 1 

b. predicate, argument 1, (argument 2) argument 2 

Example (4-2) shows excerpts of the two input files: (4-2a) shows the structures with argument 

1 as category, while (4-2b) shows the same structures with argument 2 as the category. The 

classifier is given two constituents of an EPAS to learn from and the target constituent is given 

as the EPAS’ category. 

(4- 2) 

a. ankomme,etterforsker,?,etterforsker 

ankomme,etterforsker,?,etterforsker 

ankomme,etterforsker,åsted,etterforsker 

antyde,politi,?,politi 

avhøre,?,person,? 

avhøre,?,vedkommende,? 

avhøre,politi,vitne,politi 

62

. ankomme,etterforsker,?,? 

ankomme,etterforsker,?,? 

ankomme,etterforsker,åsted,åsted 

antyde,politi,?,? 

avhøre,?,person,person 

avhøre,?,vedkommende,vedkommende 

avhøre,politi,vitne,vitne 

The output file that is created when TiMBL has classified the input data and run a test with the 

test data consists of the input given in the test set with the category predicted by TiMBL added 

at the end of each line. Further, the output supplied by TiMBL upon a successful training and 

testing round gives information about the actions in the various stages of analysis. TiMBL’s 

actions can be divided into three separate phases; in phase 1 the training data is analysed, in 

phase 2 the items in the training data are stored for efficient use during testing and in phase 3 the 

trained classifier is applied to the test set. For the purposes of the EPAS analysis, the default 

algorithm was used in the test phase. This algorithm computes the similarity between a test and 

a training item in terms of weighted overlap; the total difference between two patterns is the sum 

of relevance weights of those features which are not equal (Daelemans et al. 2003, p. 13). 

The classification of the EPAS and the subsequent testing was carried out in two distinct steps; 

classification and testing of argument 1 and argument 2 was done separately. The results of the 

classification and testing is described in the following sections. 

4.1.2.1 Classifying argument 1 

Several experiments were run through TiMBL with the aim of classifying occurrences of 

argument 1 according to the environment they occur in. The classifier was trained on all EPAS 

not containing pronouns and then tested. For the purpose of classifying occurrences of argument 

1, an EPAS list with the relevant argument 1 as category label was used. In the following 

descriptions of the performed tests, this list will be referred to as EPAS_arg1. 

Test 1 

Training set: EPAS_arg1 with no pronouns, argument 1 ignored. 

Test set: EPAS with pronouns in argument 1 position. 

Result: 57,69% (15/26) correct classifications 

63

The classifier was created with EPAS_arg1 with no pronouns as training set and tested with all 

EPAS containing pronouns in the position of argument 1. For the test set, each EPAS was 

completed with the antecedent for its pronoun. For reasons of classification and testing, the 

antecedent was appended at the end of each EPAS, thus functioning as the category label for the 

structure. In total, there were 26 EPAS with pronouns in the position of argument 1. (4- 3) below 

shows an example from the test file with pronouns as argument 1: 

(4- 3) 

få,pron,rapport,politi 

When classifying with argument 1 as category label and testing with EPAS with pronouns as 

argument 1, TiMBL assigned the correct category in 57,69% (15/26) of the test cases. One of 

the cases where the classifier had assigned the “wrong” category was actually not incorrect, the 

antecedent was of a form that did not exist in the training material (antecedent: kvinne/vitne 

(woman/witness), assigned category: vitne (witness)). Furthermore, in six of the incorrectly 

assigned categories, the category chosen by the classifier was semantically close to the correct 

antecedent. Example (4-4) below shows the seven examples where the incorrect categories 

assigned by the classifier can in fact be viewed as belonging to the same semantic group, and 

thus at least as a partially successful classification. Regarding all these instances as successful 

category assignments would heighten the classifier’s correct categorisations to 84,61% (22/26). 

(4- 4) 

Correct antecedent Assigned category 

kvinne/vitne (woman/witness) vitne (witness) 

Fonn (Fonn) politi (police) 

Kripos-spesialist (Kripos specialist) politi (police) 

politimester (police chief) Fonn (Fonn) 

politi (police) etterforsker (investigator) 

politi (police) Fonn (Fonn) 

Slåtten (Slåtten) kvinne (woman) 

64

Test 2 


Test method: leave-one-out 


When training and testing on the EPAS_arg1 list with pronouns removed, the classifier 

produced a quite poor accuracy of 42,40%. TiMBL’s leave-one-out option makes it possible to 

train and test on the same material, as each pattern in the training file is used as a test case while 

the rest of the patterns are used as training material. One reason for the relatively low percentage 

of correctly classified instances is most likely the small size of the data set. With only 191 

patterns to learn from, the classifier does not have enough diversity in the examples to provide 

correct classifications and also does not find enough occurrences of the individual patterns to be 

able to pick the correct category. Since politi (police) by far is the most frequent feature in the 

EPAS list, many instances are wrongly assigned the category “police” by virtue of the majority 

vote of the nearest neighbour classification. An attempt to avoid this effect is described in test 3. 

Examining the instances where the classifier assigned the wrong category to an EPAS showed 

that in 27 of the incorrectly classified cases, the assigned category was semantically similar to 

the correct category. This suggests that the list in itself does contain some relevant information 

about the distribution of argument 1 in the data set. Example (4-5) below shows the correct 

categories and the categories assigned by the classifier. 

(4- 5) 

Correct category Assigned category 

Anne kvinne (woman) 

Slåtten 

drapsmann (killer) gjerningsmann (perpetrator) 

etterforsker (investigator) politi (police) 

Fonn lensmann (deputy) 

politi (police) 

gjerningsmann (perpetrator) person (person) 

Kripos-spesialist(Kripos specialist) politi (police) 

kvinne (woman) 23-åring (23-year-old) 

65

lensmann (deputy) Fonn 


medarbeider (co-worker) politi (police) 

person (person) gjerningsmann (perpetrator) 

politi (police) lensmann (deputy) 

etterforsker (investigator) 

politimester (chief of police) politi (police) 

polititjenestefolk (police workers) politi (police) 

Slåtten Anne 

kvinne (woman) 

tekniker (technician) politi (police) 

23-åring (23-year-old) kvinne (woman) 

Test 3 

When using the overlap metric, all feature values are seen as equally dissimilar (Daelemans et 

al. 2003, p. 23). This means that the classifier is unable to determine the similarity of values 

such as politi (police), etterforsker (investigator) and politimester (chief of police) by means of 

looking at their co-occurrence with target classes. By using the Modified Value Difference 

Metric (MVDM), the features are weighted according to the patterns they occur in. 

Unfortunately, MVDM does not perform so well when used on small data sets with values that 

only occur a few times in the data set. When trained and tested on the EPAS_arg1 list, MVDM 

produced slightly lower accuracies than in the corresponding test with the overlap metric (see 

test 2 above). In practise, this meant that the benefits of MVDM could not be exploited due to 

the size of the data material. 

Test 4 

Training set: EPAS_arg1 excluding structures with pronouns and structures with non-verbal 

predicates 



66

The training and test material was modified by excluding all EPAS with non-verbal predicates, 

as well as all EPAS with the predicate være (be). This was done to see if these structures disturb 

the data material by adding irrelevant information that does not contribute to giving more 

information about the distribution of arguments in the EPAS. The accuracy increased slightly 

upon this modification of the data set. The editing did, however, not increase the accuracy when 

training on the edited EPAS_arg1 list and testing on EPAS containing pronouns in argument 1 

position. 

4.1.2.2 Classifying argument 2 

Analogous to the classification steps performed for argument 1, the classifications were repeated 

for occurrences of argument 2. The EPAS list with the second argument as category label will in 

the following be referred to as EPAS_arg2. 

Test 1 


Test set: EPAS with pronouns in argument 2 position 

Result: (0/6) correct classifications 

Training the classifier on the EPAS_arg2 list without pronouns and testing on the EPAS with 

pronouns in argument 2 position did not produce any correct classifications. This is likely to in 

part be because of the small size of the test data set, as well as the homogenous nature of the test 

instances. Five of the wrongly classified instances were in fact of the same type; in all instances, 

the classifier had assigned the category kvinne (woman), while the correct antecedent was 

Slåtten. 

Test 2 

Training set: EPAS_arg2 with no pronouns, argument 2 ignored 

Test set: leave-one-out 


67

Training and testing the classifier on the EPAS_arg2 list with no pronouns produced an accuracy 

of 49,73%. As was the case for the corresponding classification of argument 1, it is likely that 

the relatively small dataset is a disadvantage for the classification process. 

4.1.3 Comments on the results 

The results obtained through classifying the EPAS indicate that the information present in the 

EPAS derived from the text collection does provide clues about which word to expect in a 

specific position. The accuracy scores obtained by training and testing on the EPAS extracted 

from a collection of texts suggest that even a small collection of texts on the same domain 

provide information to enable a classification approach based on contextual distribution. In the 

tests described above, there was a reoccurring tendency that in a number of the cases where the 

wrong category was assigned in the test phase, the assigned category bore some semantic 

resemblance to the correct category. This reinforces the initial intuition that similar words are 

used in similar environments and that the environment can contribute with clues toward the 

semantic meaning of a word. 

In the following section, the notion of finding words which are similar to each other by virtue of 

occurring in the same environments will be explored further. 

4.2 Step II: Association of concept groups 

The fundamental idea in this thesis is that words display certain semantic features based solely 

on the context they are found in. Therefore, when looking for possible antecedents for an 

anaphoric expression, the candidates should not only be weighted according to their cooccurrence 

in an identical context pattern in a corpus, but also according to their co-occurrence 

with similar context patterns. The assumption that words which occur in identical contexts have 

related meanings can be used to retrieve words with similar meanings from the data material. 

With a target argument and a target predicate as starting point, the association method goes 

through the EPAS list and returns words which occur in similar environments to the target 

argument. This association is performed in three steps: 

68

• level 0: words which co-occur with the target predicate are returned 

• level 1: words which occur in the same context as the target argument are returned 

• level 2: words which occur in the same context as the words found in level 1 are returned 

Level 0 considers the information that is directly accessible from the EPAS list; with a given 

predicate as reference point, the co-occurring arguments are retrieved. Level 1 looks at the other 

arguments that occur with the same predicates as the arguments retrieved in the first step. 

Finally, level 2 performs the same step once again and looks at the arguments that occur in the 

same contexts as the arguments collected in level 1. As a result, bundles of concepts are 

produced; each concept class consisting of words that are used in the same textual context, and 

therefore are likely to be semantically similar. The following example explains how the 

association of argument classes is performed. 

Level 0 

The association method takes as its starting point a predicate from the EPAS list. For a given 

predicate, the first and second arguments are listed. The nominal argument etterforsker 

(investigator) occurs as argument 1 of the verbal predicate ankomme (arrive) in the text 

collection. (4-6) below shows the EPAS with ankomme as predicate. 

(4- 6) 

ankomme,etterforsker,? 

ankomme,etterforsker,åsted 

arrive, investigator, ? 

arrive, investigator, crime scene 

Level 1 

In order to find other nominal arguments that occur in the same context patterns as etterforsker, 

we must look at the other EPAS in which etterforsker occurs as argument 1. This yields the 

EPAS shown in (4-7) below. For the sake of the concept association, the EPAS corresponding to 

adjective-noun relations in the original texts are not considered, and therefore not included in 

(4-7). 

69

(4- 7) 

bistå,etterforsker,lensmann 

bistå,etterforsker,politi 

ha,etterforsker,observasjon 

kontakte,etterforsker,vitne 

mene,etterforsker,? 

rigge,etterforsker,lyskaster 

undersøke,etterforsker,åsted 

assist, investigator, deputy 

assist, investigator, police 

have, investigator, observation 

contact, investigator, witness 

mean, investigator, ? 

build-up, investigator,searchlight 

examine, investigator, crime scene 

For each of the predicates in (4-7), we want to find the other arguments, in addition to 

etterforsker (investigator), which occur in the corpus material as the first argument of the 

predicate. Traversing the EPAS list in search of these nominal arguments yields the list 

presented in (4-8). Pronouns and empty argument slots are omitted from the association since 

they generally occur in too many different context patterns to contribute with relevant 

information in this kind of analysis. 

(4- 8) 

ha,politi,medarbeider 

ha,politi,teori 

kontakte,politi,vitne 

mene,politi,? 

undersøke,politi,aktivitet 

have,police,co-worker 

have,police,theory 

contact,police,witness 

mean,police,? 

examine,police,activity 

As can be seen from the EPAS in (4-8), politi (police) is the only other argument which occurs 

in the same contexts as etterforsker (investigator). So far, the association tells us that there is a 

relationship between the concepts etterforsker (investigator) and politi (police) in the sense that 

these words occur in the same environments in the text collection. 

Level 2 

In order to explore the possibility of further associated concepts, the association method goes 

one level further. Basically, the first step of the association is repeated, but with new parameters; 

for each of the first arguments in (4-8) we need to know which other words can occur in the 

same contextual position. Therefore the EPAS list is again consulted and all the other first 

arguments that occur in the same environment as politi (police) are returned. This produces the 

list in (4-9). 

70

(4- 9) 

avklare,obduksjon,? 

bede-om,lensmann,assistanse 

bede-om,Fonn, 

bede-om,lensmann, 

bekrefte,lensmann,? 

bekrefte,politimester,? 

finne,leteaksjon,kvinne 

få,kjæreste,telefon 

gi,Fonn,opplysning 

gi,kamera,indikasjon 

gi,lensmann,opplysning 


gi,vitneavhør,indikasjon 


kjenne,generic-nom,Slåtten 


mene,etterforsker,? 

tro,lensmann,? 

clarify,autopsy,? 

ask-for,sergeant,assistance 

ask-for,Fonn,? 

ask-for,sergeant,? 

confirm,sergeant,? 

confirm,chief of police,? 

find,search party,woman 

get,boy/girlfriend,telephone 

give,Fonn,information 

give,camera,indication 

give,sergeant,information 

give,sergeant,information 

give,interview,indication 

have,investigator,observation 

know,generic-nom,Slåtten 

contact,investigator,witness 

mean,investigator,? 

believe,sergeant,? 

As a step toward disregarding arguments which do not occur often enough in the context in 

question to be of significance, the method may be limited to only considering arguments which 

occur more than once in the text material. In the case of a larger text collection, a different 

method of sorting out low frequent arguments would have to be adopted; for the small data set 

in this project, disregarding arguments which only occur once was of use. The steps as outlined 

above produces the following associated group of concepts: 

Figure 6 

etterforsker (investigator) 


lensmann (sergeant) 

Fonn (Fonn) 

71

Intuitively, this is a quite good association of concepts, since all the entities in the grouping 

belong to the group law enforcement. If a person were to group nominals from the text 

collection into semantically similar concept classes, the grouping in Figure 6 would not be an 

unlikely result. The grouping as shown in Figure 6, however, is the result of an association 

based on context information from the text itself. 

4.2.1 Classify 

Manually performing the association method as described above on all the EPAS in the data set 

proved to be bordering on the impossible, mainly because it implied consulting the data set 

multiple times, each time looking for different values and keeping track of the partial goals in 

the process. Based on the method as described above, the Perl script classify was written 5 . In the 

following, the algorithm implemented in Classify is outlined in brief. 

For each predicate: 

1. Level 0: 

What is ARG1 and ARG2 in the corpus/EPAS list? 

2. Level 1: 

For each ARG1 = x that was found in 1: 

In connection with which other predicates is ARG1 also= x? 

For each of these predicates: 

Which other words occur as ARG1? 

Produces a list of words which occur in the same contexts as x 

3. Level 2: 

For each word = y in the list from level 1: 

Which other predicates does this word also co-occur with? 

For each of these predicates: 

Which other words occur as ARG1? 

Produces a list of words which occur in the same contexts as y 

Same procedure is repeated for ARG2. 

5 

The algorithm was implemented in Perl by Martin Rasmussen Lie, informatics student at the University of 

Bergen. 

72

(4-10) below shows the output for the predicate ankomme (arrive) as it is produced by classify. 

(4- 10) 

NIVÅ0, ARG1 (ankomme): etterforsker x 3 

NIVÅ1, ARG1 (etterforsker): ? x 5, politi x 5, pron x 3 

NIVÅ2, ARG1 (politi): ? x 17, pron x 9, lensmann x 6, Fonn x 2, 

forbipasserende x 1, generic-nom x 1, kamera x 1, kjæreste x 1, 

leteaksjon x 1, obduksjon x 1, politimester x 1, syklist x 1, 

vitneavhør x 1 

NIVÅ2, ARG1 (pron): politi x 7, kvinne x 4, vitne x 4, ? x 2, 

Anne x 2, bilfører x 2, Slåtten x 2, syklist x 2, etterforskning x 1, 

Fonn x 1, generic-nom x 1, kjæreste x 1, lensmann x 1, lommebok x 1, 

rapport x 1, teori x 1 

NIVÅ0, ARG2 (ankomme): ? x 2, åsted x 1 

NIVÅ1, ARG2 (åsted): aktivitet x 1, hybelhus x 1, 

minibankaktiviteter x 1, mobiltelefontrafikk x 1, område x 1, 

overvåkningsfilmer x 1 

NIVÅ2, ARG2 (aktivitet): minibankaktiviteter x 1, 

mobiltelefontrafikk x 1, område x 1, overvåkningsfilmer x 1 

NIVÅ2, ARG2 (hybelhus): (Ingen referanser) 

NIVÅ2, ARG2 (minibankaktiviteter): aktivitet x 1, 

mobiltelefontrafikk x 1, område x 1, overvåkningsfilmer x 1 

NIVÅ2, ARG2 (mobiltelefontrafikk): aktivitet x 1, 

minibankaktiviteter x 1, område x 1, overvåkningsfilmer x 1 

NIVÅ2, ARG2 (område): aktivitet x 1, minibankaktiviteter x 1, 

mobiltelefontrafikk x 1, overvåkningsfilmer x 1 

NIVÅ2, ARG2 (overvåkningsfilmer): aktivitet x 1, 

minibankaktiviteter x 1, mobiltelefontrafikk x 1, område x 1 

Please consult Appendix E for the full program code for classify.pl. 

4.2.2 Associated concept classes 

Running the three-level association of semantic classes performed by using the association 

method described above on the EPAS list produced six distinct groupings. These concept groups 

are shown in (4-11) below. 

(4- 11) 

a. POLICE: 

etterforsker, politi, lensmann, Fonn 

investigator, police, deputy, Fonn 

b. WOMAN: 

Anne, Slåtten, 23-åring, sykepleiestudent, kvinne, beboer 

Anne, Slåtten, 23-year-old, nurse student, woman, inhabitant 

73

c. PERP: 

gjerningsmann, drapsmann 

perpetrator, killer 

d. PERSON: 

person, bilfører, syklist, vedkommende 

person, car driver, biker, generic-nom 

e. OBSERV: 

teori, observasjon 

theory, observation 

f. PLACE: 

studentkollektiv, Førde 

student housing, Førde 

The classes of words shown in (4-11) form groups of concepts which occur in the same 

contextual environments within the thematic domain that the EPAS are extracted from. The 

groupings seem to reflect real semantic clusters in the sense that one can easily find a label to 

describe each group. For the purpose of the text collection in the present work, these six concept 

groups represent six distinct semantic groupings that share many features with respect to pattern 

distribution in the data set. With a larger data set to run the concept association on, more concept 

groups, and also more members within each group, would have been a likely outcome. The 

results of the concept association on the small data set in this project, does, however, suggest the 

feasibility of the method, as well as show that frequent patterns in smaller text collections also 

work toward capturing interesting concept groupings. 

4.3 Step III: Using concept groups in TiMBL 

The concept groups which emerged as a result of the association performed in section 4.2 above, 

represent clusters of words that occur in similar constellations in the data material. The 

emergence of concept groups which intuitively seem to have some semantic resemblance to 

each other confirms that the context a word fits into does indeed say something about what the 

word means, as per the distributional hypothesis. 

74

In the introduction to this chapter, it was stated that the aim of classifying the EPAS list is 

twofold; on the one hand it is of interest to see to which degree the environments that an 

argument occurs in over a collection of texts provide sufficient cues to ensure a correct guess of 

which argument can be expected in a specific context, on the other hand it is equally interesting 

to see if we through classification can narrow down the set of possible arguments for a specific 

context pattern. Through the association technique, six groups of words emerged; the members 

of each group sharing the feature that they all tend to occur in the same environments. 

Previously, it has been stated that some anaphors need access to information about the world in 

order to be resolved. This information can to some extent be represented by the concept groups 

associated from the data set. By identifying groups of words which typically occur in the same 

textual environment, an intuition about which words to expect in which contexts is captured. In 

the event of “difficult” anaphors which depend on world knowledge, an anaphora resolution 

system can retrieve potential antecedents from the text, check which concept group an expected 

antecedent is likely to belong to and consequently chose the antecedent candidate belonging to 

the expected concept group. As a first step of examining the usefulness of concept groups in 

combination with anaphora, experiments aiming at enhancing the performance of the classifier 

in section 4.1 were performed. These experiments are described in the following section. 

4.3.1 Testing 

Tests were performed in TiMBL, using the relevant concept group as the category for a feature 

pattern. Analogous to the testing in section 4.1.2 above, two separate test sets were prepared, 

one for the classification of each argument. In the cases where the relevant argument was a 

member of one of the concept groups, the head label of the concept group was used as the 

category label in the input data. If the relevant argument did not belong to any concept group, 

the argument itself was used as category label, as in the tests in section 4.1.2. Example (4-12) 

below shows an excerpt of the input file used for training the classifier for argument 1 

classification. 

(4- 12) 

drepe,gjerningsmann,kvinne,PERP 

drept,sykepleiestudent,?,WOMAN 

død,sykepleiestudent,?,WOMAN 

ekstra,patrulje,?,patrulje 

75

The aim of the tests performed in this section was to see if the accuracy of the classifier could be 

enhanced by training on a complete context pattern with the appropriate concept group as 

category label. 

Test 1 

Training set: EPAS_arg1 with no pronouns and concept classes as category label, argument 1 

ignored. 


Result: 56,54% (108/191) 

In this test, the classifier was trained on two features of the EPAS, ignoring argument 1. This 

test is analogous to test 2 in section 4.1.2.1, which had an accuracy of 41,20% correct 

classifications. In addition to the 108 correctly classified instances, additional five instances 

were assigned categories which are semantically similar to the correct category. This was true 

for Kripos-spesialist (Kripos specialist), politimester (chief of police), medarbeider (co-worker) 

and polititjenestefolk (police workers), which were all assigned the category POLICE. These 

words are not part of the concept group POLICE, but are obviously semantically related to the 

members of this concept group. Had these words occurred more frequently in the data material, 

they could have been expected to show a distribution allowing for their inclusion in POLICE. 

The results of this test suggest that labeling EPAS with concept group labels heightens the 

accuracy of the classifier. This is not surprising, given the fact that a higher number of context 

patterns/EPAS are labeled with the same category in such an approach, making the generalisable 

material larger. 

Test 2 

Training set: EPAS_arg1 with no pronouns and concept classes as category label. 


Result: 86,91% (166/191) 

This test was performed to see if training the classifier on the entire structure of an EPAS 

increases the accuracy of assigning concept labels to the structures. The classifier was trained on 

all three features of the EPAS. In this case, the classifier performed with a fairly high accuracy, 

assigning the correct category in 166 of 191 cases. It is obviously an advantage that all parts of 

76

the EPAS can be used in the classification phase when the category to be assigned is not literally 

a part of the structures to be learnt from. 

Test 3 


ignored. 

Test set: EPAS_arg1 with pronouns and concept classes as category label 

Result: 76,92% (20/26) 

As was the case in the corresponding test in section 4.1.2.1, two of the wrongly assigned 

categories were in fact within the same semantic group as the correct category. Regarding these 

as correct assignments would heighten the result to 84,61%. As was the case in the previous two 

tests, the assigned categories in these cases are too infrequent in the EPAS list to surface in the 

associated concept groups. 

Test 4 


ignored. 

Test set: EPAS_arg2 with pronouns and concept classes as category label 

Result: 83,33% (5/6) 

When training the classifier on the EPAS_arg2 list using concept class labels as categories and 

testing on the set of EPAS with pronouns in argument 2 position, the classifier resolved five of 

the six test instances correctly. In the corresponding test in section 4.1.2, the classifier did not 

assign the correct category in any of the six test cases. We did, however, see that five of the test 

instances were assigned categories which were semantically similar to the correct antecedent. In 

view of the results in the initial test, it came as no surprise that the classifier performed so much 

better when used in connection with the concept class labels. 

77

4.4 Are concept classes useful for anaphora resolution? 

The EPAS list has been processed in different ways in this chapter. The tests which have been 

described provide an indication of how context patterns extracted from the text collection can be 

used to create expectations of which words (or which type of words) that are likely to occur in a 

given contextual environment. These expectations can be used to anticipate which word, or 

rather which concept, might be the antecedent for an anaphor. The concept groups which 

emerged in the association process are simply classes of semantically related words which tend 

to have similar contextual distributions within the domain of the text corpus. In order to indicate 

the usefulness of such concept classes in the process of resolving an anaphor, the test set of the 

EPAS list (all EPAS containing pronouns) was processed with different methods. In (4-13) the 

results of these methods are shown. In addition to the tests in TiMBL described in the above, the 

anaphors in the test set were resolved manually using the Lappin and Leass approach as 

described in section 2.1.2. For these test, the sentence with the anaphor, as well as the preceding 

sentence, was considered in each case. This purely syntactic approach identified the correct 

antecedent in 16 of the 32 test instances. 

(4- 13) 

Method Correct assignments 

Syntactic method 50% (16/32) 

TiMBL 46.87% (15/32) 

TiMBL with concept groups 78,12% (25/32) 

The results shown in (4-13) suggest that using concept groups may indeed be a useful approach 

in anaphora resolution. Especially in the case of anaphoric expressions where the antecedent is 

not clearly stated in the text it may be useful to have an idea of which type of antecedent one 

might expect. 10 of the 32 EPAS containing pronouns were of this kind. The syntactic approach 

could naturally not resolve these anaphors, as an antecedent not clearly present in the text hardly 

can feature on a list of possible candidates. These types of anaphors require real-world or 

domain knowledge to be resolved. In the case of 4 of these 10 EPAS, the EPAS list could not be 

consulted to find likely antecedents. Because of the small size of the data set, some predicates 

only feature once. This was the case for the five predicates jobbe-utfra (work-from), kartlegge 

(map), ta (take), varsle (notify) and ville (want) which all only co-occur with pronouns. With the 

78

exemption of jobbe-utfra, none of the antecedents in these cases can be predicted on the basis of 

the distribution of predicates and arguments in the EPAS list. (4-14) shows the instances where 

the EPAS list could be consulted in the process of finding likely antecedents for these anaphors. 

In the case of ha (have) and komme-i-kontakt-med (come-into-contact-with), other EPAS with 

the same predicates where retrieved from the EPAS list. Since ha and komme-i-kontakt-med 

occur in identical or very similar patterns with politi as the first argument, this would be the 

preferred candidate for the antecedent in (4-14a), (4-14c) and (4-14d). In the case of (4-14b), the 

predicate jobbe-utfra only has this one occurrence in the EPAS list. This means that similar 

patterns must be examined in the search for a possible antecedent. By consulting the EPAS list, 

it can be found that teori (theory) only occurs as a second argument in connection with politi as 

first argument. This would suggest that politi is a potential antecedent for the pronoun. By 

applying the concept groups, the list of possible antecedents motivated by the texts can be 

expanded to also include the other arguments which have been found to display a similar 

distribution to the arguments which actually co-occur with the predicate in question. In the case 

of the pronouns in (4-14), politi is the correct antecedent in all of the cases. 

(4- 14) 

EPAS with 

pronoun 

similar EPAS antecedents 

from list 

a. ha,pron,teori ha,etterforsker,observasjon 



b. jobbeutfra,pron,teori 


forkaste,politi,teori 

c. komme-i-kontaktmed,pron,bilfører 

d. komme-i-kontaktmed,pron,syklist 

komme-i-kontaktmed,politi,bilfører 

komme-i-kontaktmed,politi,generic-nomkomme-i-kontaktmed,politi,bilfører 

komme-i-kontaktmed,politi,generic-nom 

politi 

etterforsker 

politi 

concepts 

lensmann 

Fonn 

79 

etterforsker 

lensmann 

Fonn 

politi etterforsker 

lensmann 

Fonn 

politi etterforsker 

lensmann 

Fonn 

The examples in (4-14) indicate how the method described in this thesis can function. In cases 

where there is no clearly expressed antecedent in a text, or where the resolution of an antecedent

equires knowledge about the world (or knowledge about how predicates and arguments 

combine within a domain), the method can be of aid. Consider again the examples from the 

introduction, repeated in (4-15) below: 

(4- 15) 

a. Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig 

kommer til å drepe igjen. Han etterlyser vitner som var i sentrum søndag 

kveld. 


kill again. He puts out a call for witnesses who were in the city centre Sunday 

evening. 

b. Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig 

kommer til å drepe igjen. Han er observert i sentrum. 


kill again. He is observed in the city centre. 

As was established in chapters 1 and 2, the antecedent of the anaphor han (he) in (4-15b) cannot 

be resolved differently from the anaphor in (4-15a) without consulting some sort of knowledge 

source. However, the information present in the EPAS list can be used as domain knowledge in 

the process of resolving these anaphors. For the anaphor in (4-15a) other occurrences of the 

predicate etterlyse (call-for) can be consulted. This would produce the following list: 

(4- 16) 

etterlyse,?,bilfører 

etterlyse,politi,bilfører 

etterlyse,politi,person 

etterlyse,politi,syklist 

call-for,?,driver 

call-for,police,driver 

call-for,police,person 

call-for,police,biker 

It is clear from the list that etterlyse tends to co-occur with politi as its first argument, and that it 

does not occur with any other first arguments. Through the concept association we know that 

politi and lensmann (sergeant) both belong to the same concept group. Consequenntly, we also 

know that politi and lensmann both occur in the similar environments and thus share features. 

Given that the possible antecedents in (4-15a) are lensmann and gjerningsmann (perpetrator), 

the consultation of the EPAS list and the concept groups leads us to select lensmann as the 

80

antecedent for (4-15a). In the case of (4-15b) the EPAS list is unfortunately not equally helpful. 

Other occurrences of observere (observe) are: 

(4- 17) 

observere,?,23-åring 

observere,?,bile 

observere,?,person 

observe,?,23-year-old 

observe,?,car 

observe,?,person 

Neither the EPAS in (4-17), nor the concept groups from section 4.2.2, give us any clues as to 

whether lensmann or gjerningsmann is a more likely second argument in connection with the 

predicate observere. This can be explained by two circumstances; firstly, the example sentences 

in (4-15) are constructed and therefore do not reflect examples from the data set, secondly, the 

small size of the data material obviously limits the extent to which one can expect that all valid 

patterns of the domain are in fact found in the data set. 

81

5 Final remarks 

5.1 Is a parser vital for the extraction process? 

An initial assumption during the development of the method in this thesis was that it was of high 

importance to found the extraction method on a syntactic parse of the text collection. As the 

reader will recall, the reasons for this assumption were elaborated in chapter 3 and will therefore 

not be discussed further here. However, as a means of evaluating the extraction method, the 

texts in the text collection were processed using the Oslo-Bergen Tagger (OBT 2005). This is a 

part of speech (POS) tagger which among other options offers the user a syntactic 

disambiguation of the input text. The texts were POS tagged using the web version of the tagger 

and structures corresponding to subject-verb-object relations were manually extracted from the 

output. This yielded a list of 169 structures, 26 of them with pronouns. The structures were 

extracted using a quite rudimentary method; for example no differences were made between 

active and passive versions of the same predicate. This resulted in a list featuring exactly the 

problematic issues discussed in chapter 3; the arguments were represented (and subsequently 

structured) not according to thematic roles, but merely according to their syntactic roles in the 

sentence. As a result, the list did not reflect characteristic arguments of the different predicates 

to the same degree as the EPAS list did. The list of the POS-based structures is available in 

Appendix F. 

Consider the FCA diagram in Figure 7 below. Figure 7 shows part of the FCA diagram created 

for the POS-based structures; the section of the diagram with the argument politi (police) as 

starting point is highlighted. When comparing this figure to the corresponding figure for the 

EPAS list (Figure 5 in section 3.6.4), it is quite clear that the POS-based list is significantly less 

generalisable. There are no clear groupings of arguments which display specific behaviour 

through their combination with a certain subset of predicates. Because formal subjects of both 

active and passive sentences are realised as first arguments in this extraction, it is hardly 

possible to group arguments into groups of semantically related words based on their 

distribution. As can be seen from the diagram, politi co-occurs with both sykepleiestudent 

(student nurse) and bilfører (driver), as well as other, more relevant, arguments. 

82

Figure 7 

Interestingly enough, however, the POS-based list of structures proved to be just as well suited 

as the EPAS list for subsequent classification using TiMBL. When training and testing the 

classifier on the POS-based structures, it assigned the correct antecedent in 57,69% (15/26) of 

the test cases. In comparison, the EPAS classifier performed with an accuracy of 57,69% when 

trained and tested on argument 1, and with an overall accuracy of 46,87%. 

These results are interesting mainly because they show that for the purposes of using a memory 

based classifier, an extraction method based on a syntactic parser does not necessarily provide 

better results than a POS-tagger based method. Even though the list of extracted structures was 

decidedly poorer than the EPAS list, especially because it contained “wrong” information in the 

sense that logical objects were listed as subjects by virtue of their syntactic role, it provided 

useful input for the classification process. It is, however, as suggested by the FCA diagram 

above, likely that the POS-based list would be of less use for the concept association phase, 

since this approach relies on the presence of similar entities in similar positions in the structures. 

As a conclusion, it can probably be stated that the advantages of using a syntactic parser in the 

83

extraction process are more unclear than first presumed. For the purposes of aiding in anaphora 

resolution, it may well be that an extraction method can perform equally well when based on 

more shallow/superficial processing methods. 

5.2 Summary and conclusions 

This thesis has described a method for corpus-based semantic categorisation of predicates and 

arguments in a limited thematic domain. The aim of the project was to create a means of 

automatically inferring selectional restrictions corresponding to real-world knowledge of the 

domain of the text collection. The classification of the predicates and arguments extracted from 

the text collection resulted in several concept groups, where each concept group displayed a 

particular distribution in the text collection. 

In the introduction of the thesis, it was stated that a chief goal of the project was to assess the 

value of using co-occurrence patterns to create concept groups which can act as an aid in the 

process of pronoun resolution. The concept groups were thought to function as an intuition 

about which word to expect in a given environment. Two criteria were formulated with regards 

to the evaluation of the results obtained by the project: 

• were the concept groups created valid for the domain of the text collection? 

• were the concept groups useful in the process of anaphora resolution? 

Through classification and testing of the extracted data set some remarks can be made with 

regards to these two criteria. The concept groups that emerged as a product of the association 

performed in the classification phase did indeed seem to constitute valid groupings of 

semantically similar words. The concept groups were made based on the contextual distribution 

of arguments in the text collection and represent groups of words which “keep the same 

company” and tend to occur in similar environments. They are valid groupings for the domain of 

the text collection and confirm the intuition that similar words display similar distribution, and 

thus similar behaviour in the data set. The tests performed with the concept groups show that 

they do contribute to heightening the success rate of the MBL classifier; when testing with 

EPAS containing pronouns the classifier assigned the correct concept group as antecedent in 

78% of the instances, in comparison to an almost 47% success rate without concept groups. 

84

When testing on knowledge-dependent anaphors and on anaphors which do not have an 

explicitly mentioned antecedent in the text, it was evident that concept groups contribute with 

interesting information. Ideally, a referring guessing helper using concept groups should be 

consulted as part of an anaphora resolution system. In the event of several possible antecedent 

candidates motivated from the text and proposed by the system, the concept groups in 

connection with the context pattern of the anaphor can provide useful information about which 

type of antecedent is likely. In this way the concept groups resemble information about the valid 

contextual patterns for the domain. 

The stumbling block of the method in this thesis is the limited dimension of it. The data set used 

for the analyses is fairly small and as a consequence the results are less powerful than they could 

have been. The extraction method is at best semi-automatic and employs far too much manual 

intervention. This is a reoccurring problem for many methods within the field of NLP, Mitkov 

for example says that “only a few anaphora resolution systems operate in fully automatic mode” 

(Mitkov 2001, p. 111). Most systems rely on manual pre-editing of the input texts, some 

methods are only manually simulated. In order for a method to be fully automatic there should 

be no human intervention at any stage (Mitkov 2001, p. 114). In the case of the project described 

in this thesis, the extraction method is far too manually manipulated to be considered automatic. 

The scope of the results is naturally influenced by the limitations of the data set, but regardless 

of the size of the data set and the manual intervention employed in the extraction phase, the 

method shows promising results. It was clear from the beginning that this would be a pilot study 

aiming at providing an indication of the usefulness of the method. 

In view of the results, it should be stated that using contextual distribution to derive intuitions of 

selectional restrictions in a limited domain is a promising venture. The results obtained in this 

project suggest that the distribution of predicates and arguments within a closed domain has a 

potential use as a representation of real-world knowledge. More definite conclusions about the 

extent to which such a method captures enough relevant intuitions about real-world knowledge 

to replace it in an anaphora resolution system can, however, first be made in the event of a 

larger-scale study. 

85

6 References 

Asudeh, Ash and Mary Dalrymple. (2004): Binding Theory. Working paper. 

Available at: www.ling.canterbury.ac.nz/personal/ asudeh/pdf/asudeh-dalrymple-binding.pdf 

Baldwin, Breck. (1997): CogNIAC: high precision coreference with limited knowledge and linguistic 

resources. Proceedings of the ACL’97/EACL’97 Workshop on Operational Factors in Practical, Robust 

Anaphora Resolution (Madrid), pp. 38-45. 

Available at http://acl.eldoc.ub.rug.nl/mirror/W/W97/index.html 

Botley, Simon and Tony McEnery. (2000): Discourse anaphora: The need for synthesis. Chapter 1 in 

Botley and McEnery (eds): Corpus-based and Computational Approaches to Discourse Anaphora. John 

Benjamins Publishing Company, pp. 1-41. 

Bresnan, Joan. (2001): Lexical-Functional Syntax. Blackwell. 

Carbonell, Jamie G. and Ralf D. Brown. (1988): Anaphora Resolution: A Multi-Strategy Approach. 

Proceedings of the 12 th International Conference on Computational Linguistics (COLING’88, Budapest), 

pp. 96-101. 

Available at: http://acl.ldc.upenn.edu/C/C88/C88-1021.pdf 

CognIT website (2004): http://www.cognit.no/ 

Consulted 23/11-2004 

Copestake, A., D. Flickinger, I. Sag, C. Pollard. (2003): Minimal Recursion Semantics. An Introduction. 

Working paper. 

Available at: http://lingo.stanford.edu/sag/papers/copestake.pdf 

Cover, T. M. and P. E. Hart. (1967): Nearest neighbor pattern classification. Institute of Electrical and 

Electronics Engineers Transactions on Information Theory, pp. 21-27. 

Available at: http://yreka.stanford.edu/~cover/papers/transIT/0021cove.pdf 

Daelemans, Walter, A. van den Bosch and J. Zavrel. (1999): Forgetting Exceptions is Harmful in 

Language Learning. Machine Learning 34, special issue in natural language learning, pp. 11-43. 

Available at: http://ilk.kub.nl/pub/papers/harmful.ps 

Daelemans, Walter, J. Zavrel, K. van der Sloot, A. van den Bosch. (2003): TiMBL: Tilburg Memory 

Based Learner, version 5.0, Reference Guide. ILK Technical Report 03-10. 

Available at: http://ilk.uvt.nl/downloads/pub/papers/ilk0310.ps.gz 

Dagan, Ido and Alon Itai. (1990): Automatic Processing of Large Corpora for the Resolution of 

Anaphora References. Proceedings of the 13 th International Conference on Computational Linguistics 

(COLING ’90, Helsinki), pp. 330-332. 

Available at: http://acl.ldc.upenn.edu/C/C90/C90-3063.pdf 

Dagan, Ido, S. Marcus, S. Markovitch. (1995): Contextual word similarity and estimation from sparse 

data. In 30th Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio. Ohio 

State University, Association for Computational Linguistics, Morristown, New Jersey, pp. 164-171. 

Available at: http://citeseer.ist.psu.edu/article/dagan95contextual.html 

86

Firth, J. R. (1957): A synopsis of linguistic theory, 1930-55. In Studies in Linguistic Analysis, 

Philological Society, Oxford; reprinted in F. R. Palmer (ed.) (1968): Selected Papers of J. R. Firth 1952- 

59. Longman, pp. 168-205. 

Grefenstette, Gregory. (1992): SEXTANT: Exploring unexplored contexts for semantic extraction from 

syntactic analysis. Proceedings, 30th Annual Meeting of the Association for Computational Linguistics, 

pp. 324-326. 

Available at http://citeseer.ist.psu.edu/grefenstette92sextant.html 

Harris, Zellig S. (1968). Mathematical Structures of Language. New York: Wiley. 

Hellan, Lars. (1988): Anaphora in Norwegian and the Theory of Grammar. No 32 in Studies in 

Generative Grammar. Foris Publications, the Netherlands. 

Hindle, Donald. (1990): Noun classification from predicate-argument structures. In Proceedings of the 

28th annual meeting of the Association for Computational Linguistics, pp. 268-275. 

Available at http://citeseer.ist.psu.edu/hindle90noun.html 

ILK website (2004): http://ilk.kub.nl/ 


Jurafsky, Daniel and James H. Martin. (2000): Speech and Language Processing. An Introduction to 

Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice-Hall. 

Kamp, Hans and Uwe Reyle. (1993): From Discourse to Logic. Introduction to Modeltheoretic 

Semantics of Natural Language, Formal Logic and Discourse Representation Theory. Kluwer Academic 

Publishers (the Netherlands). 

KunDoc website (2004): http://www.kundoc.net/ 


Lin, Dekang. (1997): Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity. In 

Proceedings of ACL-97 (Madrid), pp. 64-71. 

Available at: http://citeseer.ist.psu.edu/article/lin97using.html 

Lin, Dekang. (1998): Automatic Retrieval and Clustering of Similar Words. In Proceedings of 

COLINGACL '98 (Montreal), pp. 768-774. 

Available at: http://citeseer.ist.psu.edu/16998.html 

Lin, Dekang and Patrik Pantel. (2001): Induction of Semantic Classes from Natural Language Text. In 

Proceedings of SIGKDD-01 (San Fransisco), pp. 317-322. 

Available at: http://citeseer.ist.psu.edu/lin01induction.html 

Mani, Inderjeet. (2001): Automatic summarization. John Benjamins. 

Matthews, P. H. (1997): The Oxford Concise Dictionary of Linguistics. Oxford University Press. 

Miller, G. and C. Leacock (2000): Lexical representations for sentence processing. Chapter 8 in Y. 

Ravin and C. Leacock (ed.): Polysemy: Theoretical and computational approaches. Oxford University 

Press. 

Mitkov, Ruslan. (1999): Anaphora Resolution: The State of the Art. Working paper, University of 

Wolverhampton. 

Available at: http://citeseer.ist.psu.edu/mitkov99anaphora.html 

87

Mitkov, Ruslan. (2001): Outstanding issues in anaphora resolution. In: Alexander Gelbukh (ed): 

Computational Linguistics and Intelligent Text Processing, pp. 110-125 

Mitkov, Ruslan. (2003): Anaphora Resolution. Chapter 14 in Mitkov (ed): The Oxford Handbook of 

Computational Linguistics. Oxford University Press, pp. 266-283. 

Nasukawa, Tetsuya. (1994): Robust method of pronoun resolution using full-text information. 

Proceedings of the 15 th International Conference on Computational Linguistics (COLING’94, Kyoto), 

pp.1157-1163. 

Available at: http://acl.eldoc.ub.rug.nl/mirror/C/C94/index.html 

NorGram website (2004): http://www.hf.uib.no/i/LiLi/SLF/Dyvik/norgram/ 


OBT (2005): Oslo-Bergen-taggeren 

Available at: http://decentius.aksis.uib.no/cl/cgp/obt.html 

Pantel, Patrick and Dekang Lin (2002): Discovering word senses from text. In Proceedings of ACM 

SIGKDD Conference on Knowledge Discovery and Data Mining (Edmonton), pp. 613-619. 

Pereira, Fernando, N. Tishby, L. Lee. (1993): Distributional clustering of English words. Proceedings of 

the 31st Annual Meeting of the ACL, pp. 183-190. 

Available at: http://acl.eldoc.ub.rug.nl/mirror/P/P93/index.html 

Robbins, R.H. (1997): A Short History of Linguistics. Longman. 

Saeed, John I. (1997): Semantics. Blackwell. 

Velldal, Erik. (2003): Modelling Word Senses With Fuzzy Clustering. Cand. Philol. Thesis in Language, 

Logic and Information. University of Oslo. 

Wolff, Karl Erich. (1994): A first course in formal concept analysis. In: Faulbaum, F. (ed): SoftStat’93 

Advances in Statistical Software 4, pp. 429-438. 

88

Appendix A: Ekstraktor.pl – algorithm 

The algorithm behind Ekstraktor is divided into two separate parts: information retrieval 

from the Prolog file and processing of the information that was found and stored. 

First a Prolog output file is opened and each line of the file is read. Based on patternmatching, 

lines from the file are stored in different arrays according to which pattern they 

match. 

Subsequent to the information-extraction from the Prolog file, the information stored in 

the arrays is processed for the purpose of creating predicate-argument structures. In the 

following. I will give a brief outline of the processing steps. I will do this by describing 

each of the central functions in Ekstraktor. 

The term epmor (eng: ep mother) corresponds to the first EP in the ARG0ep-array, in 

most cases meaning the EP “in question”. 

finnHoved(); 

Finds the semantic forms of the main/first predicate-argument structure in the sentence. 

This function calls the following (sub)functions: 

finnEP1(); 

Since the entities parsed are full sentences, the main structures is limited to having a verb 

as its head. This function searches the array catsuff for a pattern with the first member of 

ARG0ep as its EP. If such a pattern is found, the EP is discarded and the first members of 

arrays ARG0ep and ARG0verdi are removed. 

finnPred(); 

Finds the semantic value of the sentence’s predicate/ARG0. Goes through the array 

semform searching for a pattern with the first member of ARG0ep as EP. If such a pattern 

is found, the semantic form is retrieved and stored in the array predikat. 

In order to avoid an “empty” semantic form if the argument is a proper noun, it is 

checked if the retrieved form matches named. If so, the array navn is searched for a 

pattern with the first member of ARG0ep as EP. If such an entry is found, predikat is 

emptied and the new semantic form is stored there. 

Some predicates have an extra attribute which is stored in the array prt. Each line in this 

array is searched for a pattern with the first member of ARG0ep as EP. If such an entry is 

found, the semantic form is retrieved and stored in the array ekstra. 

lagVerbStruktur(); 

Creates the correct verbal structure for the predicate. This is for the cases where the 

predicate has an additional attribute – as in the predicate “lete etter” (Eng: look for). The 

89

function checks if there are any members in the array ekstra. If so, the main predicate and 

this extra attribute are stored in the array hovedpred. 

If there is nothing stored in ekstra, the main predicate is simply stored in the array 

hovedpred. 

finnARG1(); 

Returns the semantic form of argument 1 and stores it in the array ARG1. First the arrays 

ARGxep, ARGxverdi, ep and ARGx are emptied and subsequently set to the corresponding 

argument 1 values. Then finnARGx() is called. 

finnARGx(); 

Generalized function that finds the EP where the semantic form of the argument 

in question is stored, calls finnARGxsemform() and returns the semantic form. 

For ARG1 the actions are as follows: 

Goes through each member in ARG1ep. If the first member of ARG0ep is EP, the 

entry on the same index in ARG1verdi is stored as ARGx. Goes through each 

member in ARG1verdi. If ARGx matches an entry in ARG1verdi, the entry on the 

same index in ARG0ep is retrieved and stored in the array ep. 

finnARGxsemform() is called. 

finnARGxsemform(); 

Generalized function that finds the semantic form of the argument in 

question. 

For ARG1 the actions are as follows: 

Find semantic form of double predicates, if there are any: 

The variable epARGx is set to the first member of the array ep (this array 

holds the indexes of EPs where the semantic form of ARGx is stored). 

Goes through the array index (holds pointers to semantic forms of double 

arguments), if an entry matches epARGx as EP, the index pointer is 

retrieved and stored in the array ARGxind. Goes through semform, if an 

entry matches epARGx as EP, it is removed from the array. 

If there are any entries in ARGxind, each member is looked at. If an entry 

matches an entry in ARG0verdi, the entry on the same index in ARG0ep is 

added to the array liste. The array semform is gone through, if an entry 

matches an entry from liste as EP, the semantic form is retrieved and 

stored in the array ARGx. 

Else find the semantic form of the argument in question: 

Goes through semform , if an entry matches epARGx as EP the semantic 

form is retrieved and stored in the array ARGx. If the element stored in 

ARGx matches ‘named’, the proper noun must be found. The array navn is 

searched for a pattern with epARGx as EP. The semantic form is retrieved 

and stored in the array ARGx. 

The contents of the array ARGx is stored in the array ARG1. 

The contents of ARG1 is in finnHoved() stored as HovedARG1. 

90

finnARG2(); 

This function has exactly the same performance as finnARG1(), only with 

correspondingly different variable and array names. 

The contents of ARG2 is in finnHoved() stored as HovedARG2. 

fjernEP(); 

Removes elements from the arrays ARG1ep, ARG1verdi, ARG2ep and ARG2verdi if they 

belong to the main EP. 

Goes through ARG1ep and ARG2ep. If the first member of ARG0ep matches the entry, 

the entry and the entry on the same index in the value-array is removed. 

The first entry in ARG0ep and ARG0verdi is subsequently removed. 

sjekkEkstra(); 

Checks if there are more predicate-argument structures to be found, calls finnResten() if 

there are. 

Goes through ARG1ep and ARG2ep trying to match each element in ARG0ep. If there is a 

match, there exists a predicate with a belonging argument and finnResten() is called. 

finnResten(); 

Finds the remaining predicate-argument structures. 

Calls the following (sub)functions: 

finnPred(); 

finnARG1(); 

finnARG2(); 

fjernEP(); 


lagStruktur(); 

Creates the predicate-argument structures as printed to the output file. 

If HovedARG1 or HovedARG2 contains more than one element, each element is printed 

with predicate and argument 2. 

Else, hovedpred,HovedARG1 and HovedARG2 are printed to file, separated by commas. 

91

Appendix B: Ekstraktor.pl – program code 

Perl script Ekstraktor.pl 

#åpner fil som angis fra kommandolinjen når programmet kjøres 

open(FIL, $ARGV[0]) or die("Kan ikke åpne filen!!\n"); 

#leser hver linje i filen og legger linjen inn i forskjellige arrayer avhengig av hva som 

leses. Får lagret all informasjon som er nødvendig for å trekke ut pred-argstrukturene 

while ($linjeFraFil = ) { 

#legger indexverdi i @ARG0ep og arg0-verdi i @ARG0verdi dersom linjen fra filen 

inneholder ARG0 

if ($linjeFraFil =~ m/ARG0/){ 

henteVerdi(); 

push(@ARG0ep, $ep); 

push(@ARG0verdi, $verdi); 

} 




henteVerdi(); 



} 




henteVerdi(); 



} 




henteVerdi(); 



} 

#legger lest linje inn i @restriksjoner dersom den inneholder 'BODY' 

if ($linjeFraFil =~ m/'BODY'/){ 

push(@restriksjoner, $linjeFraFil); 

} 

#legger lest linje inn i @restriksjoner dersom den inneholder 'RSTR' 

if ($linjeFraFil =~ m/'RSTR'/){ 

push(@restriksjoner, $linjeFraFil); 

} 

#legger lest linje inn i @semform dersom den bl.a inneholder 'semform' 

if ($linjeFraFil =~ m/'relation'\),semform$/){ 

push(@semform, $linjeFraFil); 

} 

#legger lest linje inn i @cat dersom den inneholder '_CAT' 

if ($linjeFraFil =~ m/'_CAT'$/){ 

push(@cat, $linjeFraFil); 

} 

#legger lest linje inn i @catsuff dersom den inneholder '_CATSUFF' 

if ($linjeFraFil =~ m/'_CATSUFF'\)/){ 

push(@catsuff, $linjeFraFil); 

} 

92

#legger lest linje inn i @prt dersom den inneholder '_PRT' 

if ($linjeFraFil =~ m/'_PRT'\)/){ 

push(@prt, $linjeFraFil); 

} 

#legger lest linje inn i @index dersom den inneholder 'L-INDEX' 

if ($linjeFraFil =~ m/'L-INDEX'\)/){ 

push(@index, $linjeFraFil); 

} 

#legger lest linje inn i @index dersom den inneholder 'R-INDEX' 

if ($linjeFraFil =~ m/'R-INDEX'\)/){ 

push(@index, $linjeFraFil); 

} 

#legger lest linje inn i @navn dersom den inneholder 'CARG' 

if ($linjeFraFil =~ m/'CARG'\)/){ 

push(@navn, $linjeFraFil); 

} 

} #slutt while-loop 

close(FIL); 

#Her begynner prosesseringen av info hentet ut fra inputfilen: 

#fjerner ep'er som inneholder informasjon man vil se bort fra 

fjernRestri(); 

#fjerner første ep dersom den ikke har kategori 'v' 

finnCat(); 

#print("ARG0ep = \n@ARG0ep\nARG0verdi = \n@ARG0verdi\nARG1ep = \n@ARG1ep\nARG1verdi = 

\n@ARG1verdi\nARG2ep = \n@ARG2ep\nARG2verdi = \n@ARG2verdi\n"); 

#finner hovedstrukturen 

finnHoved(); 

#print("ARG0ep = \n@ARG0ep\nARG0verdi = \n@ARG0verdi\nARG1ep = \n@ARG1ep\nARG1verdi = 

\n@ARG1verdi\nARG2ep = \n@ARG2ep\nARG2verdi = \n@ARG2verdi\n"); 

#print("@semform\n"); 

#print("@navn\n"); 

#finner predikat-argumentstruktur nr2 

#sjekkEkstra(); 

#finnResten(); 

#legger til predikat-argumentstrukturene til slutt i filen som angis 

open(OUTPUTFIL, ">>strukturer.txt") or die("kan ikke skrive til fil\n"); 

#open(OUTPUTFIL, ">>home/unni/Hovedoppgave/parse/pas-strukturer.txt") or die("kan ikke 

skrive til fil\n"); 


#lager hovedstrukturen 

lagStruktur(); 

close(OUTPUTFIL); 

#her kommer alle subfunksjonene: 

#henteVerdi(): 

#henter ut relasjonsindex og verdi til ARGx fra en linje fra 

#inputfilen og lagrer dem i $ep og $verdi 

93

#Linje fra fil deles opp ved komma og lagres i @utenKomma. Verdiene hentes ut med 

substr(). 

sub henteVerdi { 

@utenKomma = split(/,/, $linjeFraFil); 

push(@args, @utenKomma); 

} 

$ep = substr(@utenKomma[1], 12, 2); 

if ($ep =~ /\)/){ 

$ep = split(/\)/, $ep); 

} 

$verdi = substr(@utenKomma[3], 4, 2); 

if ($verdi =~ /\)/){ 

$verdi = split(/\)/, $verdi); 

} 

#finnHoved: 

#finner hoved pred-argstrukturen 

#vanligvis predikat,arg1,arg2 

sub finnHoved { 

} 

finnEP1(); 

finnPred(); 

lagVerbStruktur(); 

finnARG1(); 

@HovedARG1 = @ARG1; 

finnARG2(); 

@HovedARG2 = @ARG2; 

fjernEP(); 

sub sjekkEkstra { 

foreach $element (@ARG0ep){ 

foreach $element2 (@ARG1ep){ 

if ($element =~ $element2){ 

#print("match!\n"); 

finnResten(); 

} 

} 

foreach $element3 (@ARG2ep){ 

if ($element =~ $element3){ 

#print("match2!\n"); 

finnResten(); 

} 

} 

} 

#print("ARG0ep: @ARG0ep\nARG1ep: @ARG1ep\nARG2ep: @ARG2ep\n"); 

} 

sub finnResten { 

finnPred(); 

finnARG1(); 

finnARG2(); 

print(OUTPUTFIL "@predikat,@ARG1,@ARG2\n"); 

print("@predikat,@ARG1,@ARG2\n"); 

fjernEP(); 

splice (@predikat); 

splice (@ARG1); 

splice (@ARG2); 


} 

94

sub fjernEP{ 

#fjerner elementer fra @ARG1ep/verdi og @ARG2ep/verdi dersom de hører til hovedep'en 

$epmor = $ARG0ep[0]; 

for ($i = 0; $i < @ARG1ep; $i++){ 

if ($epmor =~ $ARG1ep[$i]){ 

splice(@ARG1ep, $i, 1); 

splice(@ARG1verdi, $i, 1); 


} 

} 

for ($i = 0; $i < @ARG2ep; $i++){ 

if ($epmor =~ $ARG2ep[$i]){ 

splice(@ARG2ep, $i, 1); 

splice(@ARG2verdi, $i, 1); 

} 

} 

shift(@ARG0ep); 

shift(@ARG0verdi); 

} 

sub finnCat { 

foreach $linje (@cat){ 


if ($linje =~ m/$attr\(var\($epmor$/){ 

@utenDings = split(/\'/, $linje); 

push (@args, @utenDings); 

$epcat = substr(@utenDings[3],0,1); 

#print("$epcat\n"); 

if ($epcat !~ /v/){ 



} 

} 

} 

} 

#finnEP1(): 

#finner den ep'en som skal være utgangspunkt for predikat-argumentstrukturen 

#$epmor settes til første element i ARG0-arrayen 

#gå gjennom @catsuff, hvis linjen som leses matcher $epmor fjernes første element i 

@ARG0ep og @ARG0verdi 

sub finnEP1 { 

foreach $linje (@catsuff){ 


if ($linje =~ m/$attr\(var\($epmor$/){ 



} 

} 

} 

#finnPred(): 

#finner den semantiske verdien til ARG0/predikatet i setningen 

#$epmor settes til første element i @ARG0ep 

#hvis lest linje bl.a matcher $epmor og 'semform', splittes den ved ' og elementene 

legges i @utenDings 

#@verb settes til fjerde element i @utenDings 

#dersom linjen matcher bl.a $epmor og '_PRT', splittes linjen ved ' og den semantiske 

formen legges til @verb2 

sub finnPred { 


foreach $linje (@semform){ 

if ($linje =~ /$attr\(var\($epmor$,'relation'\),semform/){ 

95

} 

} 

} 


push(@args, @utenDings); 

@pred = @utenDings[3]; 

push(@predikat, @pred); 

if ($predikat[0] =~ /named/){ 

foreach $verdi (@navn){ 

if ($verdi =~ /$attr\(var\($epmor$/){ 

splice(@pred); 

splice(@predikat); 

@uten = split(/\'/, $verdi); 

push(@arg, @uten); 

@pred = @uten[3]; 

push(@predikat, @pred); 

#print("@predikat\n"); 

} 

} 

} 

foreach $linje (@prt){ 

if ($linje =~ /$attr\(var\($epmor$/){ 


push(@args, @utenDings); 

@ekstr = @utenDings[3]; 

push(@ekstra, @ekstr); 

} 

} 

sub finnARG1 { 

$imax = 0; 

splice(@ARGxep); 

splice(@ARGxverdi); 

splice(@ep); 

splice(@ARGx); 

} 

$imax = @ARG1ep; 

@ARGxep = @ARG1ep; 

@ARGxverdi = @ARG1verdi; 

finnARGx(); 

@ARG1 = @ARGx; 


$imax = 0; 



splice(@ep); 


} 




finnARGx(); 



$imax = 0; 



splice(@ep); 

96

} 





finnARGx(); 


sub finnARGx { 


} 

for ($i = 0; $i < $imax; $i++){ 

if ($epmor =~ /$ARGxep[$i]/){ 

$ARGx = $ARGxverdi[$i]; 

$imax2 = @ARG0verdi; 

for ($ii = 0; $ii < $imax2; $ii++){ 

if ($ARGx =~ /$ARG0verdi[$ii]/){ 

push(@ep, $ARG0ep[$ii]); 

} 

}#slutt for2 

} 

}#slutt for1 

finnARGxsemform(); 

#fjernRestri(): 

#setter @ARGxep og @ARGxverdi til ARG0-verdiene 

#kjører restrik() 

sub fjernRestri { 



restrik(); 

@ARG0ep = @ARGxep; 

@ARG0verdi = @ARGxverdi; 

} 

#restrik(): 

#Går gjennom @restriksjoner og @index og fjerner verdier fra @ARG0ep og @ARG0verdi dersom 

#disse arrayene inneholder informasjon om dem 

sub restrik { 

$imax = @ARGxep; 

} 

for ($i = 0; $i < $imax; $i++){ 

foreach $linje (@restriksjoner){ 

if ($linje =~ /$attr\(var\($ARGxep[$i]$/){ 

splice(@ARGxep, $i, 1); 

splice(@ARGxverdi, $i, 1); 

} 

} 

} 

foreach $linje (@index){ 

@utenKomma = split(/,/, $linje); 

push(@args, @utenKomma); 

$ep = substr(@utenKomma[1], 12, 2); 

push(@indexep, $ep); 

#print("@indexep\n"); 


foreach $linje (@indexep){ 

97

} 

} 

for ($i = 0; $i < @semform; $i++){ 

if ($semform[$i] =~ /$attr\(var\($linje$/){ 

splice(@semform, $i, 1); 


} 

} 

#restrimatch(): 

#fjerner ep'er som ikke inneholder informasjon om den semantiske formen 

#går gjennom hvert element i @restriksjoner og for hvert element settes $epARG1 til 

første element i @ep2 

#hvis elementet fra @restriksjoner inneholder $epARG1 som indexverdi, fjernes det fra 

@ep2 

sub restrimatch { 


$epARGx = $ep[0]; 

if ($linje =~ m/$attr\(var\($epARGx$/){ 

shift(@ep); 

} 

} 

} 

#restrimatch for doble argument2: 

#samme fremgangsmåte som for restrimatch(), men med andre variabler etc 

sub restrimatch4 { 

$imax = @liste; 

for ($i = 0; $i < $imax; $i++){ 


$epARGx = $liste[$i]; 

if ($linje =~ m/$attr\(var\($epARGx$/){ 

splice(@liste,$i,1); 

} 

} 

} 

} 

#MODULARISERT VERSJON - GENERISK FUNKSJON FOR Å FINNE SEMANTISK FORM 

#finnARGxsemform(): 

sub finnARGxsemform { 

$epARGx = $ep[0]; 

foreach $linje (@index){ 

if ($linje =~ /$attr\(var\($epARGx$/){ 

@utenKomma = split(/,/, $linje); 

push(args, @utenKomma); 

$verdi = substr(@utenKomma[3],4,2); 

push(@ARGxind, $verdi); 

for ($i = 0; $i < @semform; $i++){ 

if (@semform[i] =~ /$attr\(var\($epARGx$/){ 

splice(@semform, $i, 1); 

} 

} 

} 

} 

#finner ep hvor element i @ARGxind er verdien til ARG0 og legger dem i array @liste 

if (@ARGxind != 0){ 

foreach $element (@ARGxind){ 

$imax = @ARG0verdi; 

for ($i = 0; $i < $imax; $i++){ 

if($element =~ /$ARG0verdi[$i]/){ 

98

} 

else{ 

} 

} 

} 

push(@liste, $ARG0ep[$i]); 


foreach $element (@liste){ 

if ($linje =~ /$attr\(var\($element$/){ 


push(args, @utenDings); 

$ARG = @utenDings[3]; 

push(@ARGx, $ARG); 

} 

} 

} 


if ($linje =~ /$attr\(var\($epARGx$/){ 


push(args, @utenDings); 

@ARGx = @utenDings[3]; 

} 

} 

if ($ARGx[0] =~ /named/){ 

foreach $verdi (@navn){ 

if ($verdi =~ /$attr\(var\($epARGx$/){ 


#splice(@predikat); 

@uten = split(/\'/, $verdi); 

push(@arg, @uten); 

@ARGx = @uten[3]; 

#push(@predikat, @pred); 

#print("@predikat\n"); 

} 

} 

} 

} 

} 

#slutt finnARGxsemform 

#lagStruktur(): 

#lager predikat-argumentstrukturen som skal skrives til outputfilen 

#hvis det finnes et element i @verb2 blir det lagt til @verb1 

#@verb, $ARG1, $ARG2 og $ARG3 skrives til outputfilen 

sub lagStruktur { 

#lagVerbStruktur(); 

#lager riktig arg1 struktur 

if (@HovedARG1 > 1){ 

foreach $element (@HovedARG1){ 

print(OUTPUTFIL "\n@hovedpred,$element,@HovedARG2\n"); 

print("@hovedpred,$element,@HovedARG2\n"); 

} 

} 

# print(OUTPUTFIL "\n@hovedpred,@ARG1sem[0],@ARG2,$ARG3\n"); 

if (@HovedARG2 > 1){ 

foreach $element (@HovedARG2){ 

print(OUTPUTFIL "\n@hovedpred,@HovedARG1,$element\n"); 

99

} 

} 

print("@hovedpred,@HovedARG1,$element\n"); 

else { 

print(OUTPUTFIL "\n@hovedpred,@HovedARG1,@HovedARG2\n"); 

print("@hovedpred,@HovedARG1,@HovedARG2\n"); 

} 

#if (@predikat != 0){ 

# print(OUTPUTFIL "$predikat[0],@ARG1,@ARG2\n"); 

# print("$predikat[0],@ARG1,@ARG2\n"); 

#} 

} 

#lager riktig verb-struktur, feks "lete etter" 

sub lagVerbStruktur { 

if(@ekstra != 0){ 

@hovedpred = ($predikat[0],$ekstra[0]); 

shift(@ekstra); 

shift(@predikat); 

} 

} 

else { 

@hovedpred = $predikat[0]; 

shift(@predikat); 

} 

100

101 

Appendix C: the EPAS list 

23-år-gammel,student, 

aktuell,tidsrom, 

analysere,Kripos-spesialist,spor 

ankomme,etterforsker, 



antyde,politi, 

avhøre,,person 

avhøre,,vedkommende 

avhøre,politi,vitne 

avklare,obduksjon, 

bede om,lensmann,assistanse 

bede om,politi,bistand 

bede,lensmann, 


bede-om,Fonn,bistand 

bekrefte,lensmann, 

bekrefte,politi, 


bekrefte,politimester, 




bli,Anne,offer 

bo,23-åring,studentkollektiv 

bo,Anne,studentkollektiv 

bo,beboer,studentkollektiv 

bo,Slåtten,studentkollektiv 

brutal,drapsmann, 

desperat,rop, 

død,sykepleiestudent, 

drepe,,kvinne 

drepe,,pron 

drepe,,pron 

drepe,,Slåtten 

drepe,gjerningsmann,kvinne 

drept,sykepleiestudent 

ekstra,patrulje, 

endelig,rapport, 

etterforske,medarbeider,drap 

etterlyse,,bilfører 

etterlyse,,bilfører 

etterlyse,politi,bilfører 




etterlyst,syklist, 

etterlyst,syklist, 

fastslå,politi, 

finkjemme,politi,bygning 

finne,,død 

finne,,død 

finne,,død 

finne,,kvinne 

finne,,lommebok 

finne,,pron 

finne,,pron 

finne,,sykepleiestudent 

finne,forbipasserende,sykepleiestudent 


finne,politi,drapsmann 

forfølge,,pron 


fortelle,beboer,politi 

fortelle,Fonn, 

fortelle,Fonn,

102 

fra-kripos,etterforsker, 



første,praksisdag, 

førsteårs,sykepleiestudent, 


få,politi,svar 

få,politi,tips 

få,pron,rapport 

få,pron,telefon 




gi,politi,informasjon 

gi,politi,opplysning 

gi,vitneavhør,indikasjon 

gjemme,drapsmann, 

gjemme,gjerningsmann, 


gjennomføre,,rekonstruksjon 

gjennomgå,tekniker,studentkollektiv 

gjennomsøke,politi,studenthybel 

gjøre,politi,avhør 

gjøre,politi,rundspørring 

gå-gjennom,polititjenestefolk,material 




ha,pron,observasjon 

ha,pron,teori 

ha,pron,teori 

holde åpen,politi,mulighet 

holde,politi,kort 

holde,politi,pressekonferanse 

holde-åpen,politi,mulighet 

høre,pron,rop 



høy,rop, 

håpe,politi, 

identifisere,politi,pron 

igangsette,,leteaksjon 

informere,,politi 

jobbe-utfra,pron,teori 

kartlegge,pron,bevegelse 



kjenne,politi,dødsårsak 

komme-i-kontakt-med,politi,bilfører 

komme-i-kontakt-med,politi,generic-nom 

komme-i-kontakt-med,pron,bilfører 

komme-i-kontakt-med,pron,syklist 

komme-inn,tips, 

kommentere,pron, 


kriminell,handling, 



melde-savnet,,student 

melde-seg,syklist, 

melde-seg,syklist,politi 

melde-seg,syklist,politi 

mene,etterforsker, 

mene,politi, 

merke,kjæreste, 

mistenkelig,dødsfall, 

mulig,teori, 

muntlig,rapport 

møte-opp-til,pron,praksisdag 

ny,tips, 

nær,opplysning

103 

obdusere,,kvinne 

observere,,23-åring 

observere,,bile 

observere,,person 

opplyse,Fonn, 

opplyse,vitne, 

oppmerksom,kvinne, 

overfalle,,Slåtten 

plombere,politi,hybelhus 

pågå,leteaksjon, 

påkledd,Slåtten, 


samle,politi,observasjon 

sanke-inn,politi,video 

savne,,kvinne 

se,pron,Slåtten 

se,vitne,kvinne 

sentral,vitne, 

sette-igang,,leteaksjon 

sette-inn,politi,patrulje 

si,Fonn, 

si,Fonn, 

skade,,kvinne 

skje-med,generic-nom,kvinne 

slutte-seg-til,pron,Førde-politi 

sperre av,,hybelhus 

sperre av,politi,åsted 

spesiell,teori 

spesiell,teori, 

stenge av,politi,studentkollektiv 

stor,leteaksjon, 

systematisere,,tips 

søke-med,politi,hund 

ta,pron,utgangspunkt 

ta-høyde-for,lensmann,eventualitet 

ta-kontakt-med,politi,vitne 

ta-kontakt-med,syklist,politi 

taktisk,etterforsker, 

teknisk,etterforsker, 


teknisk,spor, 

teknisk,spor, 

tidlig,teori, 

tilfeldig,forbipasserende, 

tilfeldig,offer, 

trenge,,vitne 

tro,lensmann, 

tro,politi, 

tro,pron, 

ukjent,gjerningsmann, 

ukjent,person, 

undersøke,,minibankaktiviteter 

undersøke,,mobiltelefontrafikk 

undersøke,,område 

undersøke,,overvåkningsfilmer 



understreke,Fonn, 

understreke,pron, 



understreke,pron,generic-nom 

varsle,pron,Kripos 

velge,drapsmann,sykepleiestudent 


ville,pron,kartlegge 

vise,funn, 

vise,funn, 

vise,undersøkelse, 

vite,politi, 

være,Anne,offer

være,bilfører,vitne 


være,etterforskning,bred 

være,kvinne,død 


være,kvinne,skadet 

være,kvinne,Slåtten 

være,lommebok,funn 

være,pron,funn 

være,pron,omkommet 

være,rapport,klar 

være,Slåtten,sykepleiestudent 

være,syklist,vitne 

være,vitne,kvinne 

ønske,politi, 


åpen,mulighet, 


104

Appendix D: Text aligned with EPAS 

SENTENCE EPAS METHOD 

Kvinne funnet død i Førde. finne,,død 

automatic 


automatic 

Den savnede kvinnen i Førde er finne,,død 

automatic 

nå funnet død. 

savne,,kvinne 

automatic 


automatic 

Politiet har gitt media 

opplysninger om funnet. 

gi,politi,opplysning automatic 

Lensmannen bekrefter at kvinnen finne,,død 

automatic 

er funnet død. 

bekrefte,lensmann, 

automatic 

Politiet har bedt Kripos om 

bistand i søket etter kvinnen. 

bede om,politi,bistand 

automatic 

23-åringen var førsteårs 

sykepleiestudent i Førde. 


edited 

Hun møtte ikke opp til sin første,praksisdag, 

manual 

første praksisdag ved Førde 

aldershjem. 

møte_opp_til,pron,praksisdag 

edited 

Politiet ble informert. informere,,politi automatic 

En leteaksjon ble satt igang. sette_igang,,leteaksjon edited 

Leteaksjonen pågikk til kvinnen pågå,leteaksjon, 

automatic 

ble funnet. 

finne,,kvinne 

manual 


manual 

Politiet holder alle muligheter holde åpen,politi,mulighet 

edited 

åpne i saken. 


automatic 

Etterforskerne vil ankomme i 

morgen. 


automatic 

Et vitne hørte desperate rop om 

hjelp. 

Lensmannen har bedt om 

assistanse fra Kripos. 

Etterforskere fra Kripos skal 

bistå lensmannen i 

etterforskningen. 

Etterforskerne forventes å 

ankomme i løpet av dagen. 

Den 23 år gamle studenten ble 

meldt savnet tidlig søndag 

morgen. 

Anne Slåtten bodde i et 

studentkollektiv i Førde. 

Slåtten var førsteårs 

sykepleiestudent i Førde. 

Hun ble funnet omkommet i et 

skogholt. 

Et vitne opplyste at hun hadde 

hørt høye rop. 

Mandag holdt politiet en 

pressekonferanse. 

Lensmannen vil ikke gi nærmere 

opplysninger om åstedet. 

Beboerne i studentkollektivet 

har fortalt politiet at de så 

Slåtten lørdag kveld. 

Politiet har sperret av 

åstedet. 

desperat,rop, 


bede om,lensmann,assistanse 


fra_kripos,etterforsker, 


23-år_gammel,student, 

melde_savnet,,student 

bo,Anne,studentkollektiv 

bo,Slåtten,studentkollektiv 


være,Slåtten,sykepleiestudent 

finne,,pron 

være,pron,omkommet 

være,pron,funn 

høy,rop, 

opplyse,vitne, 


holde,politi,pressekonferanse 


nær,opplysning 

fortelle,beboer,politi 

se,pron,Slåtten 

bo,beboer,studentkollektiv 

sperre av,politi,åsted 

105 

automatic 

automatic 

automatic 

automatic 

edited 

manual 

edited 

edited 

edited 

manual 

edited 

manual 

automatic 

manual 

automatic 

automatic 

automatic 

automatic 

automatic 

automatic 

automatic 

edited 

automatic 

manual 

edited 

Flere personer er avhørt i 

saken. 

avhøre,,person 

automatic 

Politiet holder alle muligheter holde_åpen,politi,mulighet 

edited 

åpne. 


automatic 

Kvinnen blir trolig obdusert i obdusere,,kvinne automatic

løpet av tirsdag. 

Politiet håper obduksjonen vil 

avklare hva som skjedde med 

kvinnen 

Mandag kveld ankom 

etterforskere fra Kripos 

åstedet. 

Sent mandag kveld rigget 

etterforskerne opp lyskastere. 

Fonn vil ikke gi flere 

opplysninger om åstedet. 

Han vil ikke kommentere om 

kvinnen var skadet. 

håpe,politi, 

avklare,obduksjon, 

skje_med,generic-nom,kvinne 





kommentere,pron, 

skade,,kvinne 

være,kvinne,skadet 

holde,politi,kort 

Politiet holder kortene svært 

tett til brystet. 

Det er ikke kommet inn mange komme_inn,tips, 

tips i saken. 

Tipsene skal nå systematiseres. systematisere,,tips 

Fonn forteller at politiet vil 

ta kontakt med vitner. 

Politiet har flere mulige 

teorier. 

Det mest sentrale vitnet i 

saken er en kvinne. 

Hun skal ha hørt rop fra en 

kvinne. 

Politiet har stengt av 

studentkollektivet der 23åringen 

bodde. 

Studentkollektivet vil bli 

gjennomgått av teknikere. 

Fonn har bedt om teknisk 

bistand. 


ta_kontakt_med,politi,vitne 

mulig,teori, 


sentral,vitne, 

være,vitne,kvinne 


stenge av,politi,studentkollektiv 

bo,23-åring,studentkollektiv 

gjennomgå,tekniker,studentkollektiv 

bede-om,Fonn,bistand 

106 

manual 

manual 

edited 

edited 

edited 

automatic 

automatic 

automatic 

automatic 

manual 

automatic 

manual 

automatic 

manual 

manual 

automatic 

automatic 

automatic 

automatic 

automatic 

automatic 

edited 

automatic 

automatic 

Politiet bekrefter at Slåtten bekrefte,politi, 

automatic 

ble drept. 

drepe,,Slåtten 

automatic 

Undersøkelsene på stedet viser vise,undersøkelse, 

automatic 

at hun ble drept. 

drepe,,pron 

automatic 

Politiet tror at Slåtten ble tro,politi, 

automatic 

overfalt. 

overfalle,,Slåtten 

automatic 

De tror at kvinnen ble drept av tro,pron, 

automatic 

en ukjent gjerningsmann. 

ukjent,gjerningsmann, 

automatic 

drepe,gjerningsmann,kvinne 

automatic 

Politiet fastslår at kvinnens fastslå,politi, 

automatic 

lommebok ikke er funnet. 

finne,,lommebok 

manual 

være,lommebok,funn 

automatic 

Fonn opplyser at området ikke opplyse,Fonn, 

automatic 

er undersøkt. 

undersøke,,område 

automatic 

Politiet har forkastet en 


automatic 

tidligere teori. 

tidlig,teori, 

automatic 

Politiet får senere i dag svar 

på dødsårsaken. 

få,politi,svar 

automatic 

Politiet har ikke gjennomsøkt 

Slåttens studenthybel. 

gjennomsøke,politi,studenthybel automatic 

Hele hybelhuset ble sperret av. sperre av,,hybelhus edited 

Politiet har plombert 

hybelhuset. 

plombere,politi,hybelhus automatic 

Politiet skal finkjemme 

finkjemme,politi,bygning 

automatic 

bygningen for tekniske spor. teknisk,spor, 

automatic 

De tekniske etterforskerne har teknisk,etterforsker, 

automatic 

undersøkt åstedet. 


automatic 

To tekniske etterforskere 


automatic 

bistår politiet i Førde. 


automatic 

En taktisk etterforsker fra bistå,etterforsker,politi 

automatic 

Kripos bistår politiet. 

taktisk,etterforsker, 

automatic 


edited 

Lensmannen tar høyde for alle ta_høyde_for,lensmann,eventualitet manual

eventualiteter. 

Vi varslet Kripos. varsle,pron,Kripos automatic 

Den døde sykepleierstudenten død,sykepleiestudent, 

automatic 

ble funnet av en tilfeldig finne,forbipasserende,sykepleiestudent 

automatic 

forbipasserende. 

tilfeldig,forbipasserende, 

edited 

23-åringen ble sist observert 

lørdag kveld. 

observere,,23-åring automatic 

Politiet vet at hun fikk en vite,politi, 

automatic 

telefon fra kjæresten sin. få,pron,telefon 

automatic 

Kjæresten merket ikke at noe 

var galt. 

merke,kjæreste, automatic 

Vedkommende er avhørt. avhøre,,vedkommende automatic 

En større leteaksjon ble 

igangsette,,leteaksjon 

automatic 

igangsatt. 

stor,leteaksjon, 

edited 

Politiet etterlyser en syklist. etterlyse,politi,syklist automatic 

Den etterlyste syklisten har etterlyst,syklist, 

manual 

tatt kontakt med politiet. ta_kontakt_med,syklist,politi 

manual 

Fortsatt etterlyses to 

bilførere. 

etterlyse,,bilfører automatic 

Politiet etterlyste i dag to 

bilførere. 

etterlyse,politi,bilfører automatic 

To biler er observert på veien. observere,,bile automatic 

Politiet ønsker å komme i 


automatic 

kontakt med bilførerne. 

komme_i_kontakt_med,politi,bilfører 

manual 

Fonn understreker at bilførerne understreke,Fonn, 

automatic 

er vitner. 


edited 

Fonn sier at han understreker understreke,pron,generic-nom 

edited 

dette. 

si,Fonn, 

automatic 

Slåtten var påkledd da hun ble påkledd,Slåtten, 

automatic 

funnet drept. 

finne,,pron 

automatic 

drepe,,pron 

automatic 

Vi vil nå kartlegge alle 


automatic 

bevegelser på åstedet. 

ville,pron,kartlegge 

manual 

Vi har ingen spesiell teori som ha,pron,teori 

automatic 

vi tar utgangspunkt i. 

spesiell,teori, 

automatic 

ta,pron,utgangspunkt 

automatic 

Funnene på åstedet viser at det vise,funn, 

automatic 

er en kriminell handling. 


automatic 

Det er ikke et mistenkelig mistenkelig,dødsfall, 

automatic 

dødsfall, men en kriminell 

handling. 


automatic 

Trenger flere vitner. trenge,,vitne automatic 

Politiet ønsker å komme i 

komme_i_kontakt_med,politi,generic-nom 

edited 

kontakt med alle som kjente ønske,politi, 

automatic 

Slåtten. 


automatic 

Etterforskerne fra Kripos vil 

kontakte vitner. 

kontakte,etterforsker,vitne automatic 

Politiet kjenner dødsårsaken. kjenne,politi,dødsårsak manual 

Politimesteren bekrefter at de 

har fått en muntlig rapport. 

Han understreker at politiet 

ikke vil gi informasjon om 

dødsårsaken. 

Politiet har ikke bekreftet 

hvor kvinnen ble drept. 

Politiet har nå 32 medarbeidere 

som etterforsker drapet. 

bekrefte,politimester, 

få,pron,rapport 

muntlig,rapport 




drepe,,kvinne 



107 

manual 

manual 

manual 

manual 

manual 

manual 

manual 

manual 

Syklisten meldte seg. melde_seg,syklist, manual 

Den etterlyste syklisten har nå melde_seg,syklist,politi 

manual 

meldt seg til politiet i Førde. etterlyst,syklist, 

manual 

Fortsatt etterlyses to 

bilførere. 

etterlyse,,bilfører manual 

Politiet etterlyste i dag 

tidlig en syklist. 

etterlyse,politi,syklist manual 

I formiddag meldte syklisten melde_seg,syklist,politi manual

seg til politiet. 

Jeg vil understreke at vi 

ønsker å komme i kontakt med 

både syklisten og bilførerne 

som vitner, sier Fonn. 

Vi vil nå kartlegge alle 

bevegelser på funnstedet og i 

boligen. 

Vi har ingen spesiell teori som 

vi jobber utifra nå. 

Men funnene på åstedet viser at 

det er en kriminell handling, 

forteller Fonn. 

I tillegg vil politiet gjøre en 

rundspørring rundt åstedet i 

løpet av dagen. 

Den endelige rapporten vil være 

klar på torsdag. 

To Kripos-spesialister skal 

analysere alle tekniske spor i 

Førde. 

De to sluttet seg til Førde- 

politiet i går. 

Alt av mobiltelefontrafikk, 

overvåkingsfilmer og 

minibankaktiviteter rundt 

drapstidspunktet skal 

undersøkes. 

Slik kan politiet undersøke 

aktiviteten i området 

sykepleiestudenten ble funnet 

drept. 

Lensmann Kjell Fonn ber alle 

som var i sentrum om å melde 

seg. 

Han understreker at 

etterforskningen er svært bred. 

Politiet har sanket inn videoer 

fra alle overvåkningskameraer i 

Førde. 

Polititjenestefolk går gjennom 

materialet. 

Kameraene vil gi en indikasjon 

på aktiviteten i Førde i det 

aktuelle tidsrommet. 

Gjerningsmannen gjemte seg i 

busker på åstedet. 

Det er sannsynlig at 

gjerningsmannen gjemte seg i 

busker ved åstedet. 

Tror Anne ble et tilfeldig 

offer. 


komme_i_kontakt_med,pron,bilfører 

komme_i_kontakt_med,pron,syklist 

si,Fonn, 


være,syklist,vitne 

108 

manual 

manual 

manual 

manual 

manual 

manual 

kartlegge,pron,bevegelse manual 

ha,pron,teori 

spesiell,teori 

jobbe_utfra,pron,teori 

vise,funn, 



manual 

manual 

manual 

manual 

manual 

manual 

gjøre,politi,rundspørring manual 

endelig,rapport, 

være,rapport,klar 


teknisk,spor, 

manual 

manual 

manual 

manual 

slutte_seg_til,pron,Førde-politi manual 

undersøke,,minibankaktiviteter 

undersøke,,mobiltelefontrafikk 

undersøke,,overvåkningsfilmer 


finne,,sykepleiestudent 

drept,sykepleiestudent 

manual 

manual 

manual 

manual 

manual 

manual 

bede,lensmann, automatic 


manual 


manual 

sanke_inn,politi,video manual 

gå_gjennom,polititjenestefolk,material manual 


aktuell,tidsrom, 



tilfeldig,offer, 

bli,Anne,offer 

være,Anne,offer 

manual 

manual 

automatic 

manual 

manual 

Politiet avhører flere vitner. avhøre,politi,vitne automatic 

Politiet har søkt med hunder på 

åstedet. 

søke_med,politi,hund edited 

Politiet har samlet mange 

observasjoner. 

samle,politi,observasjon automatic 

Politiet antyder at drapsmannen antyde,politi, 

automatic 

har valgt sykepleiestudenten 

tilfeldig. 


automatic 

Vitneavhør gir indikasjoner på gi,vitneavhør,indikasjon 

manual 

at den brutale drapsmannen har brutal,drapsmann, 

manual 

valgt sykepleierstudenten 

tilfeldig. 


manual

Etterforskerne har flere 

observasjoner. 

Vitner så en kvinne som gikk 

alene. 

Politiet mener kvinnen er Anne 

Slåtten. 

Etterforskerne mener at hun 

ikke ble forfulgt. 

Drapsmannen kan ha gjemt seg i 

busker ved åstedet. 

Lensmannen ber unge kvinner 

være oppmerksomme. 

Politiet setter ikke inn ekstra 

patruljer i Førde. 

Politiet har fått flere nye 

tips. 

En rekonstruksjon ble 

gjennomført på tirsdag. 

Lensmannen tror at politiet 

finner drapsmannen. 

Politiet etterlyser fem 

personer. 

109 

ha,etterforsker,observasjon manual 

se,vitne,kvinne manual 

mene,politi, 

automatic 

være,kvinne,Slåtten 

automatic 

mene,etterforsker, 

automatic 

forfølge,,pron 

automatic 

gjemme,drapsmann, manual 

oppmerksom,kvinne, 


sette_inn,politi,patrulje 

ekstra,patrulje, 

automatic 

automatic 

edited 

automatic 

få,politi,tips 

automatic 

ny,tips, 

automatic 

gjennomføre,,rekonstruksjon automatic 

tro,lensmann, 



automatic 

atomatic 

automatic 

Personene er observert i Førde. observere,,person automatic 

Politiet har ikke identifisert 

dem. 

identifisere,politi,pron 

automatic 

Politiet har gjort 1275 avhør. gjøre,politi,avhør automatic 

De har fem observasjoner av ha,pron,observasjon 

automatic 

ukjente personer. 

ukjent,person, 

automatic

Appendix E: classify.pl – program code 

#!/bin/perl 

### Konfigurering og initialisering ### 

# Predikat/argument-fil. 

my $infile = "pred_argliste.txt"; 

# Initialiserer tomme datastrukturer. 

my %pred_args1 = (); 

my %pred_args2 = (); 

my %arg1_preds = (); 

my %arg2_preds = (); 

# Leser inndata. 

open(FILE, $infile); 

my @lines = ; 

close(FILE); 

### Her bygges datastrukturen ### 

foreach $line (@lines) { 

# Fjerner linjeskift fra hver linje. 

chomp($line); 

} 

# Henter ut hvert enkelt ord på hver linje. 

my ($pred, $arg1, $arg2) = split(/,/, $line); 

# Registerer arg1 og arg2 på predikatet (øker telleren). 

$pred_args1{$pred}{$arg1} += 1; 

$pred_args2{$pred}{$arg2} += 1; 

# Registrerer predikatene som arg1 og arg2 forekommer med. 

$arg1_preds{$arg1}{$pred} += 1; 

$arg2_preds{$arg2}{$pred} += 1; 

### Her er selve logikken i programmet ### 

# Henter predikatene som skal vises fra kommandolinjen (viser alle hvis ingen er 

oppgitt). 

my @preds = scalar @ARGV ? @ARGV : sort keys %pred_args1; 

# Løkke som styrer flyten i programmet. 

foreach $pred (@preds) { 

foreach $arg (1, 2) { 

### NIVÅ 0 ### 

my @args_lvl0 = parse_level0($arg, $pred); 

} 

foreach $arg_lvl0 (@args_lvl0) { 

### NIVÅ 1 ### 

my @args_lvl1 = parse_level1or2(1, $arg, $arg_lvl0, $pred); 

foreach $arg_lvl1 (@args_lvl1) { 

### NIVÅ 2 ### 

parse_level1or2(2, $arg, $arg_lvl1, $pred, $arg_lvl0); 

} 

} 

print "\n"; 

} 

print "\n"; 

110

# Subrutine som tar inn argumentnummer (1 eller 2, tilsvarer ARG1 eller ARG2) og 

predikat. 

# Viser predikatets argumenter (NIVÅ0) og returnerer dem for bruk på neste nivå. 

sub parse_level0 { 

# Henter subrutinens parametre. 

my $argnum = shift; 

my $pred = shift; 

# Henter ut alle predikatets argumenter. 

my %args = $argnum == 1 ? %{$pred_args1{$pred}} : %{$pred_args2{$pred}}; 

# Henter ut selve argumentene i sortert rekkefølge (etter antall forekomster og 

alfabetisk). 

my @args = sort { $args{$b} $args{$a} } sort { lc($a) cmp lc($b) } keys %args; 

} 

# Viser predikatets argument-liste. 

print "NIVÅ0, ARG$argnum ($pred): "; 

print join(', ', map { "$_ x $args{$_}" } @args), "\n"; 

return @args; 

# Subrutine som tar inn argumentnummer (1 eller 2, tilsvarer ARG1 eller ARG2) og et 

argument (pluss ekstra styringsparametre). 

# Finner andre predikater som argumentet er brukt i forbindelse med. 

# Viser argumenter for alle funnede predikater, samt antallet ganger disse argumentene 

ble brukt. 

# 

# I resten av subrutinen kalles disse funnede argumentene for "refererte argumenter (på 

neste nivå)", 

# fordi de indirekte via et predikat er referert fra et argument (eks: et NIVÅ0-argument 

"refererer" 

# til en del NIVÅ1-arguenter, som igjen referer til en del NIVÅ2-argumenter). 

sub parse_level1or2 { 

# Henter subrutinens parametre. 

my $level = shift; 

my $argnum = shift; 

my $arg = shift; 

my $pred_lvl0 = shift; 

my $arg_lastlevel = $level == 1 ? '' : shift; 

# Tar ikke hensyn til '?'-argumenter. 

return if $arg eq '?'; 

# Datastruktur som teller antall refererte argumenter på neste nivå. 

my %argrefs = (); 

# Henter ut alle predikatene som inneholder argumentet. 

my @preds = $argnum == 1 ? keys %{$arg1_preds{$arg}} : keys %{$arg2_preds{$arg}}; 

# ...og gjennomløper disse predikatene. 

foreach $pred (@preds) { 

# Tar ikke hensyn til samme predikatet som man holder på med. 

next if $pred eq $pred_lvl0; 

# Henter ut alle predikatets argumenter. 

my %args = $argnum == 1 ? %{$pred_args1{$pred}} : %{$pred_args2{$pred}}; 

# ...og gjennomløper og teller disse. 

foreach $argref (keys %args) { 

# Tar ikke hensyn til samme argument som man holder på med. 

next if $argref eq $arg || $argref eq $arg_lastlevel; 

} 

# Øker telleren (antall refererte argumenter på neste nivå). 

$argrefs{$argref} += $args{$argref}; 

111

} 

# Henter ut de refererte argumentene på neste nivå. 

my @argrefs = sort { $argrefs{$b} $argrefs{$a} } sort { lc($a) cmp lc($b) } keys 

%argrefs; 

# Viser de refererte argumentene på neste nivå. 

print " " x (3*$level), "NIVÅ$level, ARG$argnum ($arg): "; 

print scalar @argrefs ? join(', ', map { "$_ x $argrefs{$_}" } @argrefs) : '(Ingen 

referanser)', "\n"; 

} 

return @argrefs; 

112

Appendix F: POS-based structures 

SENTENCE POS-STRUCTURE 

Kvinne funnet død i Førde. finne,,kvinne 

Den savnede kvinnen i Førde er nå funnet død. finne,kvinne,død 

Politiet har gitt media opplysninger om funnet. gi,politi,opplysning 

Lensmannen bekrefter at kvinnen er funnet død. bekrefte,lensmann,at 

finne,kvinne,død 

Politiet har bedt Kripos om bistand i søket etter kvinnen. be,politi,kripos 

23-åringen var førsteårs sykepleiestudent i Førde. være,23-åring, 

Hun møtte ikke opp til sin første praksisdag ved Førde 

aldershjem. 

møte,hun, 

Politiet ble informert. informere,politi, 

En leteaksjon ble satt igang. sette,leteaksjon, 

Leteaksjonen pågikk til kvinnen ble funnet. pågå,leteaksjon, 

finne,kvinne, 

Politiet holder alle muligheter åpne i saken. holde,politi,mulighet 

Etterforskerne vil ankomme i morgen. ankomme,etterforsker, 

Et vitne hørte desperate rop om hjelp. høre,vitne,rop 

Lensmannen har bedt om assistanse fra Kripos. be,lensmann, 

Etterforskere fra Kripos skal bistå lensmannen i 

etterforskningen. 


Etterforskerne forventes å ankomme i løpet av dagen. ankomme,etterforsker, 

Den 23 år gamle studenten ble meldt savnet tidlig søndag 

morgen. 

melde savne,student, 

Anne Slåtten bodde i et studentkollektiv i Førde. bo,Slåtten, 

Slåtten var førsteårs sykepleiestudent i Førde. være,Slåtten,sykepleiestudent 

Hun ble funnet omkommet i et skogholt. finne,hun, 

Et vitne opplyste at hun hadde hørt høye rop. opplyse,vitne, 

høre,hun,rop 

Mandag holdt politiet en pressekonferanse. holde,politi,pressekonferanse 

Lensmannen vil ikke gi nærmere opplysninger om åstedet. gi,lensmann,opplysning 

Beboerne i studentkollektivet har fortalt politiet at de så fortelle,beboer,politi 

Slåtten lørdag kveld. 

se,de, 

Politiet har sperret av åstedet. sperre,politi, 

Flere personer er avhørt i saken. avhøre,person, 

Politiet holder alle muligheter åpne. holde,politi,mulighet 

Kvinnen blir trolig obdusert i løpet av tirsdag. obdusere,kvinne, 

Politiet håper obduksjonen vil avklare hva som skjedde med håpe,politi, 

kvinnen 

avklare,obduksjon,hva 

Mandag kveld ankom etterforskere fra Kripos åstedet. ankomme,etterforsker,åsted 

Sent mandag kveld rigget etterforskerne opp lyskastere. rigge,etterforsker, 

Fonn vil ikke gi flere opplysninger om åstedet. gi,Fonn,opplysning 

Han vil ikke kommentere om kvinnen var skadet. kommentere,han, 

skade,kvinne, 

Politiet holder kortene svært tett til brystet. holde,politi,kort 

Det er ikke kommet inn mange tips i saken. komme,det, 

Tipsene skal nå systematiseres. systematisere,tips, 

Fonn forteller at politiet vil ta kontakt med vitner. fortelle,Fonn, 

ta,politi,kontakt 

Politiet har flere mulige teorier. ha,politi,teori 

Det mest sentrale vitnet i saken er en kvinne. være,vitne,kvinne 

Hun skal ha hørt rop fra en kvinne. høre,hun,rop 

Politiet har stengt av studentkollektivet der 23-åringen stenge,politi, 

bodde. 

bo,23-åring, 

Studentkollektivet vil bli gjennomgått av teknikere. gjennomgå,studentkollektiv, 

Fonn har bedt om teknisk bistand. be,Fonn, 

Politiet bekrefter at Slåtten ble drept. bekrefte,politi,at 

drepe,Slåtten, 

Undersøkelsene på stedet viser at hun ble drept. vise,undersøkelse,at 

drepe,hun, 

Politiet tror at Slåtten ble overfalt. tro,politi,at 

overfalle,Slåtten, 

De tror at kvinnen ble drept av en ukjent gjerningsmann. tro,de,at 

drepe,kvinne, 

113

Politiet fastslår at kvinnens lommebok ikke er funnet. fastslå,politi,at 

finne,lommebok, 

Fonn opplyser at området ikke er undersøkt. opplyse,Fonn,at 

undersøke,område, 

Politiet har forkastet en tidligere teori. forkaste,politi,teori 

Politiet får senere i dag svar på dødsårsaken. få,politi,svar 

Politiet har ikke gjennomsøkt Slåttens studenthybel. gjennomsøke,politi,studenthybel 

Hele hybelhuset ble sperret av. sperre,hybelhus, 

Politiet har plombert hybelhuset. plombere,politi,hybelhus 

Politiet skal finkjemme bygningen for tekniske spor. finkjemme,politi,bygning 

De tekniske etterforskerne har undersøkt åstedet. undersøke,etterforsker,åsted 

To tekniske etterforskere bistår politiet i Førde. bistå,etterforsker,politi 

En taktisk etterforsker fra Kripos bistår politiet. bistå,etterforsker,politi 

Lensmannen tar høyde for alle eventualiteter. ta,lensmann,høyde 

Vi varslet Kripos. varsle,vi,kripos 

Den døde sykepleierstudenten ble funnet av en tilfeldig 

finne,sykepleierstudent,forbipasse 

forbipasserende. 

rende 

23-åringen ble sist observert lørdag kveld. observere,23-åring, 

Politiet vet at hun fikk en telefon fra kjæresten sin. vite,politi, 

få,hun,telefon 

Kjæresten merket ikke at noe var galt. 

Vedkommende er avhørt. 

merke,kjæreste,at 

En større leteaksjon ble igangsatt. igangsette,leteaksjon, 

Politiet etterlyser en syklist. etterlyse,politi,syklist 

Den etterlyste syklisten har tatt kontakt med politiet. ta,syklist,kontakt 

Fortsatt etterlyses to bilførere. etterlyse,,bilfører 

Politiet etterlyste i dag to bilførere. etterlyse,politi,bilfører 

To biler er observert på veien. observere,bil, 

Politiet ønsker å komme i kontakt med bilførerne. ønske,politi,å 

Fonn understreker at bilførerne er vitner. understreke,Fonn,at 


Fonn sier at han understreker dette. si,Fonn,at 

understreke,han,dette 

Slåtten var påkledd da hun ble funnet drept. være,Slåtten,påkledd 

finne,hun,drepe 

Vi vil nå kartlegge alle bevegelser på åstedet. kartlegge,vi,bevegelse 

Vi har ingen spesiell teori som vi tar utgangspunkt i. ha,vi,teori 

ta,vi,utgangspunkt 

Funnene på åstedet viser at det er en kriminell handling. vise,funn,at 

være,det,handling 

Det er ikke et mistenkelig dødsfall, men en kriminell 

handling. 

være,det,dødsfall 

Trenger flere vitner. trenge,vitne, 

Politiet ønsker å komme i kontakt med alle som kjente 

ønske,politi,å 

Slåtten. 

kjenne,,Slåtten 

Etterforskerne fra Kripos vil kontakte vitner. kontakte,etterforsker,vitne 

Politiet kjenner dødsårsaken. kjenne,politi,dødsårsak 

Politimesteren bekrefter at de har fått en muntlig rapport. bekrefte,politimester,at 

ha,de,rapport 

Han understreker at politiet ikke vil gi informasjon om 

understreke,han,at 

dødsårsaken. 


Politiet har ikke bekreftet hvor kvinnen ble drept. bekrefte,politi, 

drepe,kvinne, 

Politiet har nå 32 medarbeidere som etterforsker drapet. ha,politi,medarbeider 


Syklisten meldte seg. melde,syklist,seg 

Den etterlyste syklisten har nå meldt seg til politiet i 

Førde. 

melde,syklist,seg 

Fortsatt etterlyses to bilførere. etterlyse,bilfører, 

Politiet etterlyste i dag tidlig en syklist. etterlyse,politi,syklist 

I formiddag meldte syklisten seg til politiet. melde,syklist,seg 

Jeg vil understreke at vi ønsker å komme i kontakt med både understreke,jeg,at 

syklisten og bilførerne som vitner, sier Fonn. 

ønske,vi,å 

vitne,bilfører, 

si,Fonn, 

Vi vil nå kartlegge alle bevegelser på funnstedet og i 

boligen. 

kartlegge,vi,bevegelse 

114

Vi har ingen spesiell teori som vi jobber utifra nå. ha,vi,teori 

jobbe,vi, 

Men funnene på åstedet viser at det er en kriminell handling, vise,funn,at 

forteller Fonn. 

være,det,handling 


I tillegg vil politiet gjøre en rundspørring rundt åstedet i 

løpet av dagen. 

gjøre,politi,rundspørring 

Den endelige rapporten vil være klar på torsdag. være,rapport,klar 

To Kripos-spesialister skal analysere alle tekniske spor i 

Førde. 


De to sluttet seg til Førde-politiet i går. slutte,to,seg 

Alt av mobiltelefontrafikk, overvåkingsfilmer og 

minibankaktiviteter rundt drapstidspunktet skal undersøkes. 

undersøke,minibankaktivitet, 

Slik kan politiet undersøke aktiviteten i området 


sykepleiestudenten ble funnet drept. 

finne,sykepleierstudent, 

Lensmann Kjell Fonn ber alle som var i sentrum om å melde be,Fonn,alle 

seg. 

melde,,seg 

Han understreker at etterforskningen er svært bred. understreke,han,at 


Politiet har sanket inn videoer fra alle overvåkningskameraer 

i Førde. 

sanke,politi, 

Polititjenestefolk går gjennom materialet. gå,polititjenestefolk, 

Kameraene vil gi en indikasjon på aktiviteten i Førde i det 

aktuelle tidsrommet. 


Gjerningsmannen gjemte seg i busker på åstedet. gjemme,gjerningsmann,seg 

Det er sannsynlig at gjerningsmannen gjemte seg i busker ved være,det,sannsynlig 

åstedet. 

gjemme,gjerningsmann,seg 

Tror Anne ble et tilfeldig offer. bli,Anne,offer 

Politiet avhører flere vitner. avhøre,politi,vitne 

Politiet har søkt med hunder på åstedet. søke,politi, 

Politiet har samlet mange observasjoner. samle,politi,observasjon 

Politiet antyder at drapsmannen har valgt sykepleiestudenten antyde,politi,at 

tilfeldig. 


Vitneavhør gir indikasjoner på at den brutale drapsmannen har gi,vitneavhør,indikasjon 

valgt sykepleierstudenten tilfeldig. 


Etterforskerne har flere observasjoner. ha,etterforsker,observasjon 

Vitner så en kvinne som gikk alene. se,vitne,kvinne 

Politiet mener kvinnen er Anne Slåtten. mene,politi, 

være,Slåtten,kvinne 

Etterforskerne mener at hun ikke ble forfulgt. mene,etterforsker,at 

bli,hun,forfulgt 

Drapsmannen kan ha gjemt seg i busker ved åstedet. gjemme,drapsmann,seg 

Lensmannen ber unge kvinner være oppmerksomme. be,lensmann,kvinne 

Politiet setter ikke inn ekstra patruljer i Førde. sette,politi, 

Politiet har fått flere nye tips. ha,politi,tips 

En rekonstruksjon ble gjennomført på tirsdag. gjennomføre,rekonstruksjon, 

Lensmannen tror at politiet finner drapsmannen. tro,lensmann,at 


Politiet etterlyser fem personer. etterlyse,politi,person 

Personene er observert i Førde. observere,person, 

Politiet har ikke identifisert dem. identifisere,politi,de 

Politiet har gjort 1275 avhør. gjøre,politi,avhør 

De har fem observasjoner av ukjente personer. ha,de,observasjon 

115

Unni Cathrine Eiken February 2005

Create successful ePaper yourself

Delete template?

Save as template?