Unni Cathrine Eiken February 2005
Unni Cathrine Eiken February 2005
Unni Cathrine Eiken February 2005
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
University of Bergen<br />
Section for linguistic studies<br />
CORPUS-BASED<br />
SEMANTIC CATEGORISATION<br />
FOR ANAPHORA RESOLUTION<br />
<strong>Unni</strong> <strong>Cathrine</strong> <strong>Eiken</strong><br />
Cand. Philol. Thesis in<br />
Computational Linguistics and<br />
Language Technology<br />
<strong>February</strong> <strong>2005</strong>
Abstract<br />
This thesis describes an approach of using corpus-based classification of semantically<br />
related words as a referent-guessing helper in anaphora resolution. A small limiteddomain<br />
corpus was collected and using a method based on semantic structures available<br />
from syntactic parses of the texts, elementary predicate-argument structures were<br />
extracted from it. The extracted structures were processed using an association technique<br />
which created bundles of semantically similar words based on their distribution in the text<br />
collection. The groups of semantically similar words represent valid selectional<br />
restrictions for the domain of the text collection in the sense that they characterise types<br />
of arguments which tend to occur in certain contexts. These groups can be used to create<br />
an expectation of which words to expect in a given contextual pattern, and thus be used in<br />
anaphora resolution to select a probable referent from a set of possible referents. The<br />
experiments in the thesis show that this approach produces promising results; the concept<br />
groups can function as a helper to find likely referents in anaphora resolution.<br />
Sammendrag<br />
Metoden som beskrives i denne hovedoppgaven bygger på korpusbasert klassifikasjon av<br />
semantisk like ord og relaterer dette til bruk innenfor anaforresolusjon. Et<br />
domenespesifikt korpus ble samlet, og forenklede predikat-argumentstrukturer ble<br />
ekstrahert ved hjelp av en metode basert på semantiske strukturer som er tilgjengelige<br />
etter en syntaktisk analyse av tekstene. Strukturene ble prosessert med en<br />
assosiasjonsteknikk som, basert på ordenes distribusjon i tekstsamlingen, dannet<br />
grupperinger av semantisk like ord. Disse ordgruppene representerer gyldige<br />
seleksjonsrestriksjoner innenfor tekstsamlingens avgrensede domene da de karakteriserer<br />
grupper av argumenter som forekommer i gitte kontekster. Ordgruppene kan brukes til å<br />
gi en indikasjon på hvilke ord som forventes i et gitt kontekstmønster. Ved<br />
anaforresolusjon kan dette være til hjelp ved utvelgelsen av en sannsynlig referent fra en<br />
liste med mulige referenter. Eksperimentene i oppgaven viser at denne metoden gir<br />
lovende resultater; ordgruppene kan fungere som et hjelpemiddel i prosessen med å finne<br />
sannsynlige referenter i anaforresolusjon.<br />
i
Preface<br />
The project presented in this paper is a Cand. Philol. thesis in Computational Linguistics<br />
and Language Technology and is submitted at the University of Bergen in <strong>February</strong> <strong>2005</strong>.<br />
The thesis is written in loose cooperation with the research project KunDoc (KunDoc<br />
2004). KunDoc (Kunnskapsbasert dokumentanalyse / Knowledge-based document<br />
analysis), which was started in October 2003 and is funded by the Norwegian Research<br />
Council (NFR), has functioned as an inspiration for verbalising the approach in the thesis.<br />
The research within KunDoc is carried out in cooperation between the firm CognIT AS<br />
(CognIT 2004) and the University of Bergen. KunDoc aims at developing a method for<br />
the automatic recognition of discourse structures in written Norwegian texts. The project<br />
examines whether automated identification of coreference in a text can be used to create<br />
an unambiguous discourse structure of the text, identifying both its thematic and<br />
contextual structure. A further goal is to examine whether these techniques are useful<br />
within a closed thematic domain to create unambiguous automated summaries. Within<br />
KunDoc, it is of interest to generate ontologies that represent real-world knowledge.<br />
In the work on my thesis I have also worked in co-operation with the research project<br />
NorGram (NorGram 2004) at the University of Bergen. This project develops a<br />
computational grammar for Norwegian bokmål and is a part of the ParGram project at<br />
Palo Alto Research Center. The pre-processing of the text collection used in my project<br />
has been carried out using NorGram’s grammar on the XLE platform.<br />
ii
Acknowledgements<br />
I would like to thank my supervisors professor Koenraad de Smedt and professor Helge<br />
Dyvik who have given me invaluable support and new ideas, especially in the process of<br />
developing the method in the thesis.<br />
The approach in the thesis has been developed in loose cooperation with KunDoc. In this<br />
connection I wish to thank Till Christopher Lech at CognIT AS, who has contributed<br />
with tips and support.<br />
Other people I would like to thank are Paul Meurer at Aksis, who installed XLE and<br />
NorGram on my home Linux computer and Martin Rasmussen Lie who has been a great<br />
help in programming questions and has implemented one of the approaches used in the<br />
thesis in Perl. Thanks also go to Aleksander Krzywinski, whose achievement it is that the<br />
pink computer exists.<br />
Many people who have been a great support in the process of finishing the thesis have not<br />
been mentioned – they are still, however, very greatly thanked. You know who you are!<br />
iii
Table of Contents<br />
1 INTRODUCTION AND PROBLEM STATEMENT 1<br />
1.1 Project outline 3<br />
2 THEORETICAL BACKGROUND 6<br />
2.1 Anaphora resolution 6<br />
2.1.1 Frameworks for anaphora resolution 8<br />
2.1.2 Computational approaches to anaphora resolution 12<br />
2.1.3 Anaphora resolution and text summarisation 22<br />
2.2 Finding meaning in the context 24<br />
2.2.1 The distributional approach 24<br />
2.2.2 Different types of context 27<br />
2.2.3 Context and selectional restrictions 30<br />
3 FROM TEXT TO EPAS – THE EXTRACTION METHOD 33<br />
3.1 Selecting the texts 33<br />
3.2 Predicate-argument structures 35<br />
3.2.1 What is represented in the EPAS? 39<br />
3.3 Parsing with NorGram 42<br />
3.3.1 NorGram in outline 43<br />
3.3.2 Extracting EPAS from NorGram 44<br />
3.4 Altering the source 47<br />
3.5 Finding the words 48<br />
3.6 Evaluation of the data set 52<br />
3.6.1 Errors from the grammar 52<br />
3.6.2 Irrelevant structures 53<br />
3.6.3 Manually added structures 54<br />
3.6.4 Comments about the EPAS list 55<br />
4 CLASSIFICATION 58<br />
iv
4.1 Step I: Classification with TiMBL 59<br />
4.1.1 The Nearest Neighbor approach 60<br />
4.1.2 Testing 61<br />
4.1.3 Comments on the results 68<br />
4.2 Step II: Association of concept groups 68<br />
4.2.1 Classify 72<br />
4.2.2 Associated concept classes 73<br />
4.3 Step III: Using concept groups in TiMBL 74<br />
4.3.1 Testing 75<br />
4.4 Are concept classes useful for anaphora resolution? 78<br />
5 FINAL REMARKS 82<br />
5.1 Is a parser vital for the extraction process? 82<br />
5.2 Summary and conclusions 84<br />
6 REFERENCES 86<br />
APPENDIX A: EKSTRAKTOR.PL – ALGORITHM 89<br />
APPENDIX B: EKSTRAKTOR.PL – PROGRAM CODE 92<br />
APPENDIX C: THE EPAS LIST 101<br />
APPENDIX D: TEXT ALIGNED WITH EPAS 105<br />
APPENDIX E: CLASSIFY.PL – PROGRAM CODE 110<br />
APPENDIX F: POS-BASED STRUCTURES 113<br />
v
1 Introduction and problem statement<br />
For many applications within the field of Natural Language Processing (NLP) it is vital to<br />
identify what a pronoun refers to. Consider a piece of text where (1-1a) is followed immediately<br />
by (1-1b) 1 .<br />
(1- 1)<br />
a.<br />
Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig<br />
kommer til å drepe igjen.<br />
The sergeant leading the investigation says that the perpetrator probably will<br />
kill again.<br />
b. Han etterlyser vitner som var i sentrum søndag kveld.<br />
He puts out a call for witnesses who were in the city centre Sunday evening.<br />
In an application consisting of, for example, summarising the text, a selection of the second<br />
sentence (1-1b) without the preceding sentence (1-1a) leaves the reader with the pronoun han<br />
(he), the referent of which cannot be identified. The task of identifying the referent of a pronoun<br />
is called anaphora resolution and its computer implementation is relevant in many NLP<br />
applications, such as machine translation, automatic abstracting, dialogue systems, question<br />
answering and information extraction.<br />
The problem of correctly identifying the referent of a pronoun is not trivial, as will be apparent<br />
from the comparison of examples (1-1) and (1-2). As will be further described in section 2.1,<br />
strategies that do not incorporate some sort of real-world knowledge cannot confidently identify<br />
the entities that the pronoun han (he) is linked to in these examples.<br />
(1- 2)<br />
Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig<br />
kommer til å drepe igjen. Han ble observert i sentrum søndag kveld.<br />
The sergeant leading the investigation says that the perpetrator probably will<br />
kill again. He was observed in the city centre Sunday evening.<br />
1<br />
The sentences in (1-1) and (1-2) are constructed example sentences and are not part of the data set collected and<br />
used in this thesis.<br />
1
This thesis explores the value of using co-occurrence patterns to create concept groups that can<br />
act as an aid in the process of finding what a pronoun refers to. In order to find the entity that the<br />
pronoun han (he) refers to in example (1-1), the following two alternative patterns can be<br />
considered:<br />
(1- 3)<br />
a. lensmann etterlyser vitne sergeant calls-for witness<br />
b. gjerningsmann etterlyser vitne perpetrator calls-for witness<br />
When considering which of these patterns is the most likely one, data collected from a corpus<br />
can be consulted (Dagan and Itai 1990; Dagan et al. 1995; Nasukawa 1994; inter al.). If one of<br />
the patterns is found literally in the corpus, it will receive a strong preference. If none of the<br />
patterns occur in the data collection, similar patterns can be considered. Given that the patterns<br />
in example (1-4) below do feature in the data collection, they can contribute to guessing the<br />
correct referent for the anaphor in example (1-1):<br />
(1- 4)<br />
a. politi etterlyser vitne police call-for witness<br />
b. etterforsker etterlyser vitne investigator calls-for witness<br />
c. lensmann avhører vitne sergeant interviews witness<br />
d. politi avhører person police interview person<br />
e. gjerningsmann dreper offer perpetrator kills victim<br />
f. gjerningsmann angriper kvinne perpetrator attacks woman<br />
In view of the patterns in (1-4), the word lensmann (sergeant) engages in contexts similar to<br />
those of politi (police), which in turn occurs in similar contexts to etterforsker (investigator). By<br />
using association techniques, lensmann can be associated with the other arguments which occur<br />
in similar linguistic environments, and subsequently be preferred as the referent in (1-1).<br />
Approaches within the field of anaphora resolution have in recent years focused on knowledge-<br />
poor strategies used in combination with corpora, at the same time, the notion of constructing a<br />
large and comprehensive base of real-world knowledge has been abandoned somewhat (see<br />
Mitkov 2003 for a brief overview). The approach in the present work expands the co-occurrence<br />
2
patterns found in a text collection to also consider semantically similar words and patterns to<br />
those present in the corpus. The association of semantically similar concepts is carried out<br />
through machine learning techniques and an association technique developed for this project.<br />
The project described in the present work develops and examines a method for the automatic<br />
inference of concept groups consisting of semantically similar words from a collection of<br />
limited domain texts. Information that becomes available by automatically classifying context<br />
patterns from a closed thematic domain is examined for the purpose of aiding in anaphora<br />
resolution. The resulting concept groups can function as a form of real-world knowledge by<br />
representing information about which words can be expected in certain contextual environments.<br />
As it is an established problem within the field of computational linguistics that the construction<br />
of knowledge bases requires such a high amount of manual labour, it is of interest to examine<br />
methods which can contribute to automating this task.<br />
1.1 Project outline<br />
This thesis describes a method for the automatic association of clusters of semantically similar<br />
words collected from a limited thematic domain. The association of concepts is based on the<br />
distribution of arguments in particular syntactic contexts.<br />
The method described consists of three steps:<br />
1) the extraction method, which deals with the extraction of meaning structures from a text<br />
corpus<br />
2) the classification method, which deals with the association of the extracted meaning<br />
structures into concept groups<br />
3) the application of the meaning structures and the concept groups to anaphora resolution<br />
In the extraction phase, semantic structures mainly corresponding to subject-verb-object<br />
relations are extracted and normalised to the form shown in (1-5) below. This type of relation is<br />
in this thesis termed an elementary predicate-argument structure (EPAS) and is described in<br />
greater detail in section 3.2.<br />
3
(1- 5)<br />
predicate, argument 1, argument 2<br />
In the classification phase, the extracted structures undergo processes that result in the grouping<br />
of concepts into clusters of semantically similar words.<br />
The evaluation of the results obtained by using the method developed in the project is twofold:<br />
• the resulting concept classes are evaluated; does the method produce semantic clusters<br />
that are valid for the thematic domain of the text collection?<br />
• the usefulness of using the concept classes in anaphora resolution is evaluated; does the<br />
method provide a means to infer which entity is referred to in examples such as (1-1)<br />
and (1-2)?<br />
Chapter 3 describes the extraction method, which uses output from a syntactic parser to collect<br />
semantic structures in the form of EPAS from the texts. The text collection used in this project<br />
consists of newspaper texts all concerning a criminal case. The constraints that hold on the<br />
corpus are further described in sections 3.1 and 3.4. Section 3.2 explains the format of the<br />
meaning structures extracted from the text corpus as well as the motivation for choosing EPAS<br />
as meaning representation. In sections 3.3 and 3.5 the process of parsing the texts and gathering<br />
the meaning representations from the parse output is outlined in further detail. Finally, the list of<br />
EPAS resulting from the extraction method is evaluated in section 3.6.<br />
The classification method is described in chapter 4. In section 4.1 a classification approach using<br />
machine learning techniques is described, in section 4.2 the constituents of the EPAS are<br />
associated into semantically similar groups based on their contextual distribution and finally<br />
these two approaches are applied in connection with one another in section 4.3. In section 4.4<br />
the potential of using concept groups in anaphora resolution is discussed.<br />
Final remarks and conclusions are found in section 5. Here the foundation of the extraction<br />
method is also briefly discussed.<br />
4
The results obtained in this project provide a preliminary indication of the feasibility and<br />
usefulness of a referent-guessing helper such as described in the introduction. Hindle (1990)<br />
states that small corpora and human intervention in the analysis phase are factors that have<br />
contributed to obscuring the usefulness of semantic classification based on distributional<br />
context. Within the framework of the present work it was not possible to conduct a large-scale<br />
corpus-based study. This is partly due to the lack of a robust and powerful enough extraction<br />
method and will be discussed in greater detail in chapter 3. The text collection that the study is<br />
based on is clearly much too small to offer anything but a tendency and the degree in which the<br />
extraction process is manually manipulated is too high to call the method fully automated.<br />
Nevertheless, this paper describes a pilot study and provides an indication of the quality and<br />
usefulness of the method.<br />
Before going on to describing the method developed in the present work, a brief introduction to<br />
the concepts of context and anaphora resolution is needed. The importance and usefulness of<br />
classifying words according to the contexts they occur in, as well as a brief background on<br />
anaphora resolution, is provided in chapter 2.<br />
5
2 Theoretical background<br />
In order to understand the motivation for developing an extraction and classification method as<br />
described in the present work, one needs a brief explanation of the theoretical foundation on<br />
which the method is based. In this chapter, the theoretical background of the method is<br />
described. In section 2.1 the concept of anaphora resolution and the need for context information<br />
in anaphora resolution systems is outlined. In section 2.2 the notion of using context as a means<br />
to identify semantically similar words is explained.<br />
2.1 Anaphora resolution<br />
Most natural language texts contain an abundance of pronouns and other expressions which are<br />
referentially linked to other items in the texts. In order to understand the meaning conveyed by a<br />
text, one needs a method to find out which entities these expressions are linked to. It is difficult<br />
to determine what a pronoun refers to without taking the notion of context and real-world<br />
knowledge into account. Natural language requires a certain amount of context to be intelligible.<br />
We distinguish between linguistic context, which denotes the concrete linguistic setting that a<br />
given word occurs in, and a more general notion of context that refers to the non-linguistic<br />
setting. In the following, a background on the theoretical basics of anaphora will be given,<br />
before some approaches to anaphora resolution are briefly outlined.<br />
Anaphor and referring expression are both terms that are used for words that point back either to<br />
other words or to entities in the world. Anaphora 2 can be defined as the linguistic phenomenon<br />
of using an anaphor to point back to a previously mentioned item in a text (Mitkov 2003, p.<br />
266).<br />
In the Oxford Concise Dictionary of Linguistics (Matthews 1997), a referring expression is<br />
defined as a linguistic element that refers to a specific entity in the real world, termed a referent.<br />
A referring expression can be any natural language expression that is used to refer to a realworld<br />
entity, including nouns and pronouns. As such the linguistic expressions James and he in<br />
a given text may both refer to a person called “James” existing in the real world.<br />
2 The term anaphora is in the present work used in alignment with current literature on anaphora resolution.<br />
Anaphora is the linguistic phenomenon of an anaphor pointing to another item in the text and should not be<br />
understood as the plural form of anaphor, which is anaphors.<br />
6
The term anaphor describes a linguistic element, often a pronoun or a nominal, which is linked<br />
to another linguistic element previously presented in the text (Mitkov 2003). An anaphoric<br />
reference is usually supported by a preceding nominal, which is called an antecedent. If a<br />
referring pronoun is mentioned previous to the mentioning of its referent, the term cataphora<br />
applies (Jurafsky and Martin 2000, p. 675). Anaphora provides us with an indirect reference to a<br />
real-world entity. When a referring expression, such as James, has been introduced in a text, it<br />
allows for subsequent reference by anaphors, such as he or the boy. The original referring<br />
expression is therefore the antecedent of the subsequent referring anaphor, for example the<br />
pronoun he. If the anaphor and the antecedent it is linked to both have the same referent in the<br />
real world, they are termed coreferential (Mitkov 2003, p. 267).<br />
(2- 1)<br />
Politimannen sier at han har flere observasjoner<br />
The policeman says that he has several observations<br />
In example (2-1) above, the pronoun han (he) is an anaphor which points back to its antecedent,<br />
the referring expression politimannen (the policeman). Han and politimannen both refer to the<br />
same real-world referent, the object “the policeman”, and are therefore coreferential.<br />
There are various and complex structural conditions on the co-occurrence of an anaphor and its<br />
antecedent. This includes constraints on how far away from each other the antecedent and the<br />
referring anaphor can be without disturbing the understanding of the text. An elaborate<br />
discussion of these conditions is, however, not within the scope of the present work.<br />
Mitkov (2003, p. 268) distinguishes between the following types of anaphora:<br />
• pronominal anaphora: The anaphor is a pronoun.<br />
• lexical noun phrase anaphora: The anaphor is a definite description or proper name that<br />
gives additional information and has a meaning independent of the antecedent.<br />
• verb anaphora: The anaphor is a verb and refers to an action.<br />
• adverb anaphora: The anaphor is an adverb.<br />
• zero anaphora: The anaphor is implicitly present in the text, but physically omitted.<br />
7
• nominal anaphora: The anaphor has a non-pronominal noun phrase as antecedent.<br />
o direct: anaphor and antecedent are linked through identity, synonymy,<br />
generalisation or specialisation.<br />
o indirect: anaphor and antecedent are linked through part-of-relations or set<br />
membership.<br />
He states that pronominal anaphora is the most frequent type, while nominal anaphora (indirect<br />
anaphora in particular) usually requires real-world knowledge to be resolved. In this thesis, the<br />
described method will be tested using occurrences of pronominal anaphora from the text<br />
collection.<br />
2.1.1 Frameworks for anaphora resolution<br />
Anaphora resolution is the process of determining the antecedent of an anaphor (Mitkov 2003,<br />
p. 269). In our minds, we build a discourse model that represents the entities mentioned in the<br />
discourse and shows the relationship between them (Webber 1978, in Jurafsky and Martin<br />
2000). A representation is evoked into the model upon the first mentioning in the discourse, and<br />
then subsequently accessed from the model if the entity is mentioned again, either by name or<br />
by way of anaphora. Entities will have varying degrees of salience in the discourse model,<br />
depending on how frequently they have been mentioned and also on how long ago they last were<br />
mentioned. This notion of a discourse model is used both in theories which aim at describing the<br />
process of anaphora resolution, and in computational approaches that automate the anaphora<br />
resolution process.<br />
There are different approaches to resolving referring expressions and anaphors that occur in<br />
natural language discourse. Several formalisms offer frameworks describing the theory of<br />
discourse representation in general and anaphora binding in particular. In the following sections,<br />
two of these formalisms will be briefly outlined.<br />
8
2.1.1.1 Discourse representation theory<br />
Discourse representation theory (DRT), proposed by Hans Kamp in 1981, represents a way of<br />
creating dynamic semantic representations for natural language discourse. The framework aims<br />
at representing larger linguistic units than sentences and is particularly useful for representing<br />
the way discourse changes with every new sentence that is introduced. The core structure within<br />
DRT is the discourse representation structure (DRS), which is transformed through the<br />
processing of each sentence of a discourse. Since every sentence in a discourse potentially can<br />
introduce new concepts and entities which are referentially linked to previously introduced<br />
entities, it is not possible to infer the full meaning of an individual sentence without regarding<br />
the discourse it fits into. Kamp and Reyle states it as “the meaning of the whole is more, one<br />
might say, than the conjunction of its parts” (Kamp and Reyle 1993, p. 59). The interpretation of<br />
a new sentence relies both on the meaning of the sentence itself and on the structure representing<br />
the context of the earlier sentences (Kamp and Reyle 1993, p. 59). Thus, a new sentence is<br />
interpreted as a contribution to the existing representation of the discourse.<br />
DRT establishes anaphoric links across sentence boundaries between anaphors in the current<br />
sentence and antecedents in the DRS as the new sentence is processed. A very simplified outline<br />
of what happens when the discourse in (2-2) is processed and a DRS is created, is given below<br />
(after Kamp and Reyle 1993).<br />
(2- 2)<br />
a. Jones owns Ulysses.<br />
b. It fascinates him.<br />
(2-2a) is entered into the DRS by way of applying phrase structure rules and lexical insertion<br />
rules in order to associate the sentence with a syntactic representation and a set of features and<br />
values for the individual words. Some examples of assigned features are number, gender and<br />
transitiveness (for verbs). The DRS for (2-2a) will in abridged form look like this:<br />
9
(2- 3)<br />
Discourse referents:<br />
x y<br />
DRS conditions:<br />
Jones (x)<br />
Ulysses (y)<br />
x owns y<br />
Upon entering (2-2b) into the DRS, a series of actions must be performed to calculate which<br />
entities in the DRS the two pronouns of sentence (2-2b) refer to. In the case of the sentences in<br />
(2-2), where the DRS only contains two members, this can be determined on the basis of gender<br />
agreement. The updated DRS will appear as in (2-4) below.<br />
(2- 4)<br />
Discourse referents:<br />
x y u v<br />
DRS conditions:<br />
Jones (x)<br />
Ulysses (y)<br />
x owns y<br />
u = y<br />
v = x<br />
u fascinates v<br />
DRT offers a framework for creating and storing semantic representations of the meaning<br />
conveyed in a natural language discourse. The theory does not, however, offer a means to<br />
identify the referent for ambiguous anaphors, or for anaphors which require real-world<br />
knowledge in the process of determining their referents.<br />
2.1.1.2 Binding Theory<br />
Binding Theory (BT) is a theoretical framework that describes syntactic conditions for intrasentential<br />
anaphoric linking. BT offers conditions for whether a nominal expression can, must,<br />
or must not be linked to another nominal in the sentence. Within BT, reflexive pronouns and<br />
reciprocals are termed anaphors, while non-reflexive pronouns are called pronouns or<br />
pronominals. This understanding of these terms will also be used in the following, when<br />
10
eferring to BT. The NP which is linked to by an anaphor or pronoun is for BT purposes the<br />
binder of the anaphor or pronoun. In example (2-5), he is the binder of himself, while himself is<br />
bound by he.<br />
(2- 5)<br />
He hurt himself.<br />
Chomsky’s binding theory has three principles, shown in (2-6) below (after Chomsky 1981, in<br />
Asudeh and Dalrymple 2004).<br />
(2- 6)<br />
A. An anaphor (reflexive or reciprocal) must be bound in its local domain.<br />
B. A pronominal (nonreflexive pronoun) must not be bound in its local domain.<br />
C. A nonpronoun must not be bound.<br />
The implications of these principles, as well as that of local domain, is exemplified by example<br />
(2-7). In (2-7a) the subclause the thief hurt himself constitutes the local domain for the anaphor<br />
himself. As, according to principle A above, the anaphor can only be bound in its local domain,<br />
the noun phrase the sergeant is not a possible binder. In (2-7b), the pronominal him must not<br />
(cannot) be bound in its local domain, and can therefore be bound to the noun phrase the<br />
sergeant. The pronominal must, however, not be bound to a noun phrase expressed in the<br />
sentence, but can refer to a discourse referent not mentioned in the sentence.<br />
(2- 7)<br />
a. The sergeant said that the thief hurt himself.<br />
b. The sergeant said that the thief hurt him.<br />
The fact that not all possible candidates in a syntactic domain can be binders is maintained by<br />
the requirement stating that the binder must be in a structurally dominant position to the entity to<br />
be bound. This ensures that the noun phrase the police cannot be the binder of the pronoun<br />
himself in example (2-8).<br />
11
(2- 8)<br />
The sergeant’s suspect hurt himself.<br />
Hellan (1988) suggests that the principles of standard Government and Binding theory are<br />
primarily based on English and that they cover “a very limited subpart if what constitutes a<br />
possible anaphoric system” (Hellan 1988, preface). He proposes several additional principles,<br />
maybe most notably the Command Principle in which, among other statements pertaining to the<br />
command relation, it is stated that also relations within hierarchies of thematic roles can stand in<br />
a command relation to an anaphor.<br />
2.1.2 Computational approaches to anaphora resolution<br />
Automated anaphora resolution systems basically have to perform three separate tasks (Mitkov<br />
2003):<br />
• identify the anaphors to be resolved<br />
• locate the candidates for antecedents<br />
• select the antecedent from the candidate list<br />
Different computational approaches apply different resolution factors and knowledge sources.<br />
The process of resolving the antecedent is based on several resolution factors, which in turn<br />
draw into account quite different sources of background knowledge. Using morphological<br />
knowledge may be the simplest approach; gender and/or number is compared and candidates are<br />
discounted if their gender/number does not fit that of the anaphor. Syntactic knowledge is used<br />
to identify syntactic parallelism; the antecedent is often found in a similar syntactic position as<br />
the anaphor. In many cases, the correct antecedent cannot be identified without the help of<br />
semantic information. Selectional constraints is one example of semantic knowledge that can be<br />
used to narrow down the list of candidates for the antecedent. Repeated mention of an entity in<br />
the preceding text passage to the anaphor may indicate that this entity has a higher degree of<br />
salience in the discourse and that it therefore is a likely antecedent for a following anaphor<br />
(Jurafsky and Martin 2000, p. 682). Using morphological, lexical, syntactic, semantic and<br />
salience criteria as background knowledge does not immediately suggest the most likely<br />
candidate, but rather acts as filters to eliminate unsuitable candidates (Mitkov 2003, p. 271).<br />
Mitkov (2003, p. 272) states that for some examples “the crucial and most reliable factor in<br />
deciding on the antecedent” is real-world knowledge. Even the most exquisite anaphora<br />
12
esolution system will not be able to resolve anaphora of the type that needs real-world<br />
knowledge to rule out candidates that just do not make common sense. The examples in (2-9)<br />
below illustrate the point; without access to real-world knowledge or semantics, there is no way<br />
to confidently resolve the antecedent of the anaphoric han (he).<br />
(2- 9)<br />
a. Politimannen skjøt etter morderen, og han falt.<br />
The policeman shot at the murderer and he fell.<br />
b. Politimannen skjøt etter morderen, og han bommet.<br />
The policeman shot at the murderer and he missed.<br />
2.1.2.1 Knowledge-free approaches<br />
Botley and McEnery term anaphora resolution systems which do not consult any form of<br />
knowledge representation in the process of identifying the antecedent of an anaphor<br />
“knowledge-free” (Botley and McEnery 2000, p. 17). In the two following sections, it will be<br />
shown, on the basis of two well-established syntactic algorithms for anaphora resolution, that<br />
knowledge-free approaches that resolve anaphors without employing real-world knowledge<br />
cannot identify different antecedents in the case of examples (1-1) and (1-2).<br />
2.1.2.1.1 Lappin and Leass’ algorithm for pronoun resolution<br />
Lappin and Leass (Lappin and Leass 1994, in Jurafsky and Martin 2000) offer an algorithm for<br />
pronoun interpretation which takes into account recency and syntactically-based preferences.<br />
The algorithm does not employ semantic preferences or background knowledge, but uses a<br />
weighting system which reflects various syntactic features as well as salience of recency in the<br />
discourse. When testing this algorithm on test data from the same genre as was used to develop<br />
the weighting system, Lappin and Leass report an accuracy of 86%. Jurafsky and Martin present<br />
a somewhat simplified part of the algorithm in the resolution of non-reflexive, third-person<br />
pronouns (Jurafsky and Martin 2000, p. 684). The Lappin and Leass algorithm creates a<br />
discourse model upon processing a sentence and assigns each member of the discourse model a<br />
salience value. A set of salience factors determine the salience weight each of the members is<br />
assigned. The aspect of recency is maintained by reducing each member’s salience value by half<br />
13
upon processing of a new sentence. (2-10) below shows the weighting system of the salience<br />
factors in the system.<br />
(2- 10)<br />
SALIENCE FACTOR VALUE<br />
Sentence recency 100<br />
Subject emphasis 80<br />
Existential emphasis 70<br />
Accusative emphasis 50<br />
Indirect object or oblique complement emphasis 40<br />
Non-adverbial emphasis 50<br />
Head noun emphasis 80<br />
In the following, this algorithm will be used in an attempt to resolve the referent for the anaphor<br />
han (he) in the second sentence of each member of the sentence pair presented in (1-1) and (1-2)<br />
and repeated in (2-11) below:<br />
(2- 11)<br />
a.<br />
Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig<br />
kommer til å drepe igjen. Han etterlyser vitner som var i sentrum søndag<br />
kveld.<br />
The sergeant leading the investigation says that the perpetrator probably will<br />
kill again. He puts out a call for witnesses who were in the city centre Sunday<br />
evening.<br />
b. Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig<br />
kommer til å drepe igjen. Han er observert i sentrum.<br />
The sergeant leading the investigation says that the perpetrator probably will<br />
kill again. He is observed in the city centre.<br />
When attempting to find the antecedent for the pronoun in (2-11a), all potential referents must<br />
be collected. Since the discourse only consists of two sentences, only the first sentence is<br />
processed, in the event of a longer discourse the referent is looked for in until four preceding<br />
sentences. The potential referents lensmannen (the sergeant), etterforskningen (the<br />
investigation) and gjerningsmannen (the perpetrator) are assigned salience values as shown in<br />
(2-12) below.<br />
14
(2- 12)<br />
REC SUBJ EXIST OBJ IND-OBJ NON-ADV HEAD N TOT<br />
lensmann 100 80 50 80 310<br />
etterforskning 100 50 150<br />
gjerningsmann 100 80 50 80 310<br />
As the algorithm moves on to the next sentence, the values assigned in (2-12) are reduced by<br />
half, as shown in (2-13).<br />
(2- 13)<br />
Referent Phrases Value<br />
lensmann lensmannen 190<br />
etterforskning etterforskningen 75<br />
gjerningsmann gjerningsmannen 190<br />
Now potential referents which do not agree in gender or number are removed. In our case, the<br />
pronoun is han (he), which for Norwegian bokmål specifies an animate referent. According to<br />
the preference factors in the algorithm, which check for gender and number only, the potential<br />
referent etterforskning cannot, however, be removed. At this stage, referents which do not<br />
satisfy intra-sentential syntactic coreference constraints will also be removed. Final salience<br />
values are calculated by assigning values for syntactic role parallelism and cataphora. In our<br />
case, lensmann is given extra weight due to the syntactic parallelism to the anaphor. This results<br />
in the values shown below:<br />
(2- 14)<br />
Referent Phrases Value<br />
lensmann lensmannen 225<br />
etterforskning etterforskningen 75<br />
gjerningsmann gjerningsmannen 190<br />
Since lensmann has the highest salience weight, this word is also chosen as the referent for the<br />
pronoun han. Through the processing of sentence (2-11a) it is clear that the same referent would<br />
be assigned upon a processing of sentence (2-11b). The algorithm does not take the semantic<br />
meaning of the sentence to be processed into account, and is therefore not able to differ between<br />
the referent assignment in examples such as (2-11a) and (2-11b).<br />
15
2.1.2.1.2 Hobbs’ Tree search algorithm<br />
Hobbs’ tree search algorithm (Hobbs 1978, in Jurafsky and Martin 2000) is a pronoun resolution<br />
algorithm based on syntactic tree structures of the sentences to be processed. Proceeding to<br />
resolve the antecedent of a pronoun, the tree search algorithm processes syntactic<br />
representations of all previous sentences in the discourse, as well as the sentence with the<br />
pronoun to be resolved. The syntactic representations of the discourse, in combination with the<br />
order the search of the syntactic structures is performed, to some degree represent a discourse<br />
model and salience preferences. In its search for the antecedent of a pronoun, the algorithm<br />
traverses syntactic trees in a left-to-right, breadth-first manner.<br />
To find the antecedent for the pronoun han in the sentences in (2-11), syntactic tree structures of<br />
the sentences are needed. The syntactic structures of (2-11) presented in Figure 1 and Figure 2<br />
are generated by the NorGram grammar web version 3 . This is a more complex grammar than the<br />
one used in Jurafsky and Martin’s outline of the algorithm (Jurafsky and Martin 2000, p. 689),<br />
but as stated there, the algorithm to a large degree allows any choice of grammar. Since the<br />
algorithm is based on searches in syntactic trees, and therefore relies on making assumptions<br />
regarding the build-up of the syntactic structures, the grammar must be specified in any case. In<br />
the following, the tree search as specified in Jurafsky and Martin (2000) will be carried through,<br />
with the assumptions regarding the grammar that that implies. The syntactic trees (and their<br />
labels) are included for illustrational purposes and should not be thought of as input for the<br />
algorithm.<br />
3 Generated at http://decentius.hit.uib.no:8010/logon/xle-mrs.xml 31/01-<strong>2005</strong><br />
16
Figure 1<br />
17
Figure 2<br />
In the process of identifying the antecedent for the pronoun han in the second sentence of<br />
(2-11a), the algorithm takes as its starting point the NP immediately dominating the pronoun.<br />
From there, it moves upwards in the tree to the first NP or sentence node. This node is called X<br />
and the path from the pronoun to X is called p. In our case, that means that X is the topmost<br />
sentence node (IP node). Since there are no more branches to the left of X and p, and in other<br />
words no possible antecedents introduced earlier in the same sentence, the algorithm moves<br />
along to the parse tree of the previous sentence. Searching left-to-right and breadth-first, the first<br />
NP node that is encountered is suggested as the antecedent for the pronoun. In our case, this<br />
18
means that the algorithm would propose the NP lensmannen som leder etterforskningen as the<br />
antecedent for the pronoun han. As will be clear from examining the example sentences in<br />
(2-11), this is a correct resolution of the antecedent in (2-11a), but not for the antecedent in<br />
(2-11b). Parallel to the Lappin and Leass resolution algorithm, the tree search algorithm also<br />
does not consider the semantic meaning of the sentence with the anaphor to be resolved. The<br />
pronouns han in the second sentences of (2-11a) and (2-11b) are treated in the same way, and<br />
lensmannen is chosen as the most likely antecedent in both cases.<br />
2.1.2.2 Traditional approaches to anaphora resolution<br />
As seen through the examples above, anaphors of the type that requires semantic information to<br />
be resolved simply cannot be resolved using purely syntactic algorithms. In order to find the<br />
antecedent for such anaphors, some sort of real-world knowledge must be consulted. Mitkov<br />
(1999) distinguishes between traditional and alternative approaches for anaphora resolution.<br />
The traditional approaches are those that use knowledge factors to filter out unlikely candidates<br />
and then use preference rules on a smaller set of likely candidates, while the alternative<br />
approaches find the most likely candidate based on statistical or AI techniques (Mitkov 1999, p.<br />
8). The traditional approaches usually draw in the factor of real-world or domain knowledge,<br />
often in the form of a comprehensive knowledge or domain base, in order to resolve anaphors of<br />
the type in examples (2-9) and (2-11) above (Mitkov 2003). Such approaches are also called<br />
knowledge-based (Botley and McEnery 2000, p. 11). In the above it has repeatedly been<br />
emphasized that some types of anaphors cannot be correctly resolved without access to realworld<br />
information. Carbonell and Brown’s (1988) multi-strategy approach is one traditional<br />
knowledge-based anaphora resolution system. Their approach follows what Botley and<br />
McEnery call “a trend […] towards the integration of several different resolution algorithms into<br />
large-scale modular architectures” (Botley and McEnery 2000, p. 17). Their system draws on<br />
different knowledge sources, including syntactic structure, case-frame semantics, dialog<br />
structure and real-world knowledge. The resolution of anaphors is based on constraints and<br />
preferences; first the constraints are applied to narrow down the list of potential antecedents and<br />
then the preferences are applied to each of the remaining candidates (Carbonell and Brown<br />
1988, p. 98). Real-world knowledge is realised as a set of precondition and postcondition<br />
constraints. These constraints for example determine that the object given no longer is in the<br />
possession of the actor after a successful act of giving has been carried out. The main problem<br />
19
with such an approach is stated by the developers: “the strategy is simple, but requires a fairly<br />
large amount of knowledge to be useful for a broad range of cases” (Carbonell and Brown 1988,<br />
p. 97).<br />
Generally speaking, the knowledge bases that knowledge-based systems for anaphora resolution<br />
rely on are difficult to represent and process, and require a considerable amount of human input<br />
(Mitkov 2001, p. 110). The information is structured using different frameworks; often each<br />
anaphora resolution system structures its knowledge base in a system-specific manner. Rather<br />
than giving an outline of various specific methods belonging to the traditional approaches, some<br />
of the formats used for knowledge representation are briefly mentioned below. Several<br />
frameworks have been developed to cope with the need for a formalism to represent real-world<br />
or domain knowledge. Most of these have been part of specific anaphora resolutions systems<br />
and have not constituted independent frameworks for the representations of real-world<br />
knowledge.<br />
Minsky’s Frames (Minsky 1975, in Botley and McEnery 2000) is a framework for representing<br />
knowledge about stereotyped objects and events. The frames are dynamic in the sense that the<br />
information they hold about a particular object or event can change if new information is<br />
encountered. Input into the system is interpreted in accordance with the information present in<br />
the frames; the frames generate expectations about the input (Botley and McEnery 2000, p. 12).<br />
In the case of a “shooting frame” being evoked upon processing of the sentence in (2-9a), the<br />
expectation that if somebody misses, it is likely to be the same person that also was doing the<br />
shooting, is created. Following such an expectation, it is easy to identify the correct antecedent<br />
for the anaphor. Schank’s Scripts (Schank 1972, in Botley and McEnery 2000) have some<br />
similarity to Minsky’s Frames, but are primarily used to represent knowledge about events<br />
which do not undergo change (Botley and McEnery 2000, p. 12). Information about role<br />
assignment and the sequence of events in given contexts is represented in the script.<br />
2.1.2.3 Alternative approaches to anaphora resolution<br />
Hand-coded knowledge bases that aim at representing real-world or domain knowledge are<br />
expensive and labor-intensive to build and maintain. As a consequence, the focus has shifted<br />
toward systems that rely less heavily on world knowledge in the last 15 years (see Mitkov 2003<br />
20
for an overview). Many of these systems incorporate semantic and real-world knowledge, but<br />
use methods that enable the collection of this information to have a high degree of automation<br />
(Baldwin 1997; Dagan and Itai 1990; Dagan et al. 1995; Nasukawa 1994; inter al.). Mitkov<br />
(2003) terms these systems knowledge-poor and attributes their growth in number in recent<br />
years to the fact that corpora and similar electronic linguistic resources have become better,<br />
larger and more available. Some of these systems do not really attempt at building a world- or<br />
domain knowledge base (Baldwin 1997; Nasukawa 1994), but rather look at features such as co-<br />
occurrence patterns in the text itself, while others integrate corpora and use them as a form of<br />
abstract knowledge base (Dagan and Itai 1990; Dagan et al. 1995).<br />
Among the different “alternative” approaches, Dagan and Itai’s (1990) statistical approach,<br />
Dagan et al.’s (1995) estimation of unseen patterns and Nasukawa’s (1994) knowledge-free<br />
method are of particular interest for this project. Dagan and Itai’s (1990) method is that of using<br />
co-occurrence patterns observed in a corpus as a type of selectional restrictions. Co-occurrence<br />
patterns observed in a large corpus are thought to reflect the semantic constraints that apply to<br />
natural language. Candidates for antecedents for the anaphor it are identified in the text and put<br />
in the place of the anaphor to be resolved. This produces co-occurrence patterns that are checked<br />
against the corpus. Subsequently the candidate present in the most frequently occurring cooccurrence<br />
pattern is chosen as the antecedent. This method relies on a large corpus, as only<br />
patterns which actually have been seen in the corpus are considered. Infrequent patterns will not<br />
be picked since they generally speaking will not feature on the top of the pattern list. Dagan et<br />
al. (1995) offer a solution to this problem by presenting a similar method which also estimates<br />
the probability of co-occurrence patterns that have not been observed in the training data. They<br />
state the importance of distinguishing between probable and improbable unobserved cooccurrence<br />
patterns and emphasise that the “distinctions ought to be made using the data that do<br />
occur in the corpus” (Dagan et al. 1995, p. 164). Anologies are made between specific unseen<br />
co-occurrence patterns and observed co-occurrences which contain similar words, determining<br />
word similarity by a similarity metric. Patterns that contain similar words to the target word and<br />
that have been observed in the training data are used to calculate how likely the target word is to<br />
occur in the same pattern. Nasukawa (1994) presents a resolution rate of 93,8% in an even<br />
knowledge-poorer method for pronoun resolution. Instead of drawing information from a<br />
corpus, word frequency and co-occurrence patterns in the text itself are used to filter out the<br />
most likely candidate for the antecedent. In Nasukawa’s approach, inter-sentential data is<br />
21
exploited in the process of resolving the pronoun it. The likelihood of the antecedent is<br />
determined statistically and the antecedent candidate with the highest value is selected by the<br />
system. The approach uses a syntactic-based heuristic rule for the selection of the antecedent.<br />
Nasukawa states that approaches using real-world knowledge are not large-scale enough yet to<br />
be of use in broad-coverage systems and attempts at extracting information corresponding to<br />
case frames in world knowledge from the texts to be processed (Nasukawa 1994, p. 1157). In<br />
this way, collocation patterns are used as a form of world knowledge for the domain of the texts.<br />
As has been outlined in the introduction, this thesis describes a method that can aid in the<br />
resolution of the anaphoric expressions that require real-world knowledge to correctly resolve<br />
their antecedents. The method automatically extracts and classifies nominal arguments, resulting<br />
in associated classes of similar words. This is a knowledge-poor method in the sense that it does<br />
not require a comprehensive knowledge base to be built, but rather uses data and co-occurrence<br />
patterns from a corpus to find the most likely antecedent from a list of possible candidates found<br />
in a text.<br />
2.1.3 Anaphora resolution and text summarisation<br />
As already mentioned, several NLP applications need a reliable means to resolve anaphoric<br />
expressions and identify coreferences. In the field of text summarisation, which belongs to the<br />
domain of the KunDoc project, anaphora resolution is vital for the process of finding<br />
coreferential chains, identifying discourse structure and ultimately producing a coherent<br />
summary. Systems for automatic text summarisation need to make a number of choices<br />
regarding the resolution of anaphoric expressions. Mani (2001, p. 70) identifies “dangling<br />
anaphors” as a coherence problem in automatic summaries; without a means to resolve<br />
anaphoric expressions, the summary may contain anaphors, but not the antecedents they refer to.<br />
This disturbs the coherence in the summary; not all the information that the reader needs is<br />
present in the summarised text. The (constructed) example below illustrates this: consider the<br />
full text example in (2-15a) in connection with the summarised version in (2-15b). Both<br />
instances of the pronouns han (he) in the summarised version in (2-15b) do not have an<br />
identified reference in the text. For a reader presented only with the summary, it is highly<br />
unclear what these pronouns refer to.<br />
22
(2- 15)<br />
a. Politiet etterlyste i dag tidlig en syklist i<br />
forbindelse med drapet på 23 år gamle Anne Slåtten. I<br />
formiddag meldte han seg til politiet, skriver bt.no<br />
- Jeg har foreløpig ikke klarhet i hva han har sagt,<br />
forteller lensmannen i Førde Kjell Fonn.<br />
This morning the police instituted a search for a biker<br />
in connection with the murder of 23-year-old Anne<br />
Slåtten. This morning he reported to the police, writes<br />
bt.no<br />
- For the time being I am not in the clear about what he<br />
has said, tells the sergeant in Førde Kjell Fonn.<br />
b. I formiddag meldte han seg til politiet, skriver bt.no<br />
- Jeg har foreløpig ikke klarhet i hva han har sagt,<br />
forteller lensmannen i Førde Kjell Fonn.<br />
This morning he reported to the police, writes bt.no<br />
- For the time being I am not in the clear about what he<br />
has said, tells the sergeant in Førde Kjell Fonn.<br />
Another reason why anaphora resolution is important for text summarisation is that the methods<br />
used for retrieving relevant sentences for a summary perform more accurately if anaphoric<br />
references to central concepts also are considered (Mitkov 2003, p. 276).<br />
The emergence of knowledge-poor and corpus-based approaches for anaphora resolution<br />
suggests that the representation of real-world knowledge not necessarily has to have the form of<br />
a human-made system. Alternative approaches show that information available from the text to<br />
be analyzed or from larger bodies of natural language text, can be used to give information that<br />
resembles that of real-world knowledge. The following section explains this notion of using<br />
contextual information to find intuitions about world knowledge.<br />
23
2.2 Finding meaning in the context<br />
“You shall know a word by the company it keeps!” Firth (1957, p. 179)<br />
“The meaning of entities, and the meaning of grammatical relations among them, is related to<br />
the restriction of combinations of these entities relative to other entities.” Harris (1968)<br />
2.2.1 The distributional approach<br />
The semantic meaning of a word is often readily suggested from the lexical context in which it<br />
occurs. This is an idea fronted by many scholars, starting with Firth (1957) and Harris (1968).<br />
Human beings use the context of a word in the process of deciding the semantic meaning of the<br />
word. When encountering an ambiguous word, the language user has a finite number of possible<br />
meanings to consider. By examining the environment that the ambiguous word exists in, the<br />
language user finds clues toward deciding which of the possible meanings that are applicable.<br />
The same mechanism applies when a language user is confronted with a novel word; by<br />
observing the usage of the word, preferably over several instances, a human being is able to<br />
induce the semantic meaning from the setting the word occurs in. This is in accordance with the<br />
Distributional Hypothesis as proposed by Harris (1968) and contributes to explaining the fact<br />
that humans rarely have problems identifying for example what an ambiguous word means, or<br />
what entity in the discourse a pronoun refers to. The fact that properties in a word’s linguistic<br />
environment can contain information about the meaning of the word is a useful tool for the<br />
semantic comparison of words. A word which is used within a limited thematic domain is likely<br />
to be used in a sense specific to the contextual setting, or domain, in which it occurs. This entails<br />
that the linguistic environment in which the word exists also gives information about the<br />
meaning of the word. Words that appear in the same linguistic setting in texts that describe the<br />
same theme may have similar or related meanings as well. Texts belonging to the same domain<br />
will to some extent contain information about the same things, and as such also contain<br />
semantically similar words which are used in similar ways. Following this line of thought, it<br />
should be possible to gain relevant information about which words to expect in specific<br />
positions in a text by way of looking at the context patterns they should fit into. Thus, words that<br />
occur in limited-domain texts can be classified relative to how they combine with each other.<br />
The idea that the contextual environment can give clues about the semantic meaning of a word is<br />
clearly not a new one, considering the quotes of Firth and Harris in the introduction to this<br />
24
section. The theory dates back to the empiricists of the mid-twentieth century. Linguistic theory<br />
in the first half of the twentieth century was to a large degree predominated by empiricism.<br />
Linguistic thought, particularly in the United States, but also in Europe, was strongly influenced<br />
by the positivism of the behaviourist philosophy. Bloomfield is regarded as one of the chief<br />
advocates of linguistic positivism, and his interpretation of linguistics predominated American<br />
linguistics in the 1930s and 1940s (Robbins 1997, p. 237). The positivistic/behaviouristic view<br />
on linguistic science put emphasis on the observable. Reliable facts could only be found through<br />
objective observation of data, and only phenomena which could be empirically experienced by<br />
any observer were considered valid data for further analysis. Robbins states that the favoured<br />
model of description of the time was that of distribution; for some linguists the notion of<br />
linguistic description coincided with the statement of distributional relations (Robbins 1997, p.<br />
239). He also attributes the fact that there was little emphasis on the study of semantics in the<br />
early twentieth century to Bloomfield’s dismissal of the possibilities of an empirically based<br />
study of this field. Since the analysis of meaning requires non-linguistic knowledge as well,<br />
semantic analysis was termed less ideal for empiricist methods. While the study of semantics<br />
previously had aimed at creating an exhaustive description of what is referred to by a linguistic<br />
entity, Firth represents a challenge to this way of thinking. His “contextual theory of language”<br />
introduced a move in semantics, toward a statement of meaning as a function of how words are<br />
used (Robbins 1997, p. 247). Together with Harris, Firth represents the distributional approach<br />
to finding semantic meaning. Within this approach, meaning is treated as semantic functions<br />
related to contexts of situation. This way of analysing meaning is data-driven in the same sense<br />
as empiricist approaches in other fields of linguistics and is strongly connected to the positivistic<br />
philosophy of science predominant in this time. However, using bottom-up methods as a means<br />
to formulate theories of linguistics is a direction that was more or less abandoned after<br />
Chomsky’s criticism of the structuralist approaches. Chomsky challenged the philosophical and<br />
scientific foundation of the Bloomfieldian canon through his proposal of the transformationalgenerative<br />
grammar. He dismissed the behaviouristic approach to language as the unacceptable<br />
product of the strong empiricism of the Bloomfieldian behaviourist school. The shift from<br />
empiricism to rationalism marks an important turning point in linguistic theory (Robbins 1997,<br />
p. 260). Botley and McEnery state that Chomsky, and the generation of linguists following his<br />
theories, represent a knowledge-driven approach with the goal of formulating linguistic theories<br />
(Botley and McEnery 2000, p. 24).<br />
25
The method of describing semantic meaning by looking at the distribution of text in context has<br />
more or less been abandoned in the decades following the shift of paradigm from empiricism to<br />
rationalism. Semantic analysis has been approached through new methods within linguistic<br />
theory, terming the meaning-is-use approach as too simple. In recent years, however,<br />
computational linguistics has brought some of these old ideas forward again. This is mainly due<br />
to the increasing availability of large, computer-readable corpora and powerful processing tools<br />
that reliably can perform operations on large data sets. The emergence of corpus approaches is a<br />
move away from the Chomskyan view toward an emphasis on actual observable linguistic<br />
behaviour (Botley and McEnery 2000, p. 24). Thus, the bottom-up approaches of Firth and<br />
Harris are again in fashion; by using corpora, computational linguists today are able to look at<br />
actual occurrences of data and use these to develop theories of linguistic performance. Leech<br />
argues that corpus linguistics is not a linguistic theory, but rather a methodology (Leech 1992, in<br />
Botley and McEnery 2000, p. 23). Rather than being primarily theoretically founded, corpus<br />
linguistics as a discipline focuses on linguistic performance and description, as found in actual<br />
occurrences of natural language text. One can say that Firth and Harris’ ideas have received a<br />
pragmatic renaissance; probably for a large part because of the available computational tools.<br />
Whether these new appliances of the distributional theories from the 1950s reflect a<br />
reconsideration of the usefulness and theoretical foundation of them, or whether they merely<br />
show that computational linguistics is a more pragmatic than linguistic theoretically founded<br />
science, is a discussion that is far out of the scope of the present work. What can be stated,<br />
though, is that the notion of using a word’s context to find out something about the meaning of<br />
that word, is an approach that seems to provide interesting results regarding semantic meaning,<br />
regardless of the motivation of such an approach. The type of semantic information available<br />
from the context is, however, not necessarily of the same type as that referred to when speaking<br />
of the semantic meaning of a word. Information obtainable from looking at distribution over<br />
several contexts rather provides a measure of semantic relatedness or closeness. Instead of<br />
providing a means to obtain or define the direct semantic meaning, methods that rely on<br />
contextual distribution can return words with the same or different meaning as a target word.<br />
26
2.2.2 Different types of context<br />
So far, we have argued that using context as a tool to indicate the semantic meaning of a word is<br />
a useful method in linguistics. The method’s theoretical foundation dates back to the middle of<br />
the twentieth century, but has not been pursued much in the last few decades. Even though the<br />
linguistic foundation of this method has been discussed, the advance in computational resources<br />
in recent years has brought this approach forward again. However, this being said, the different<br />
types of context that can be taken into consideration have not been discussed so far in this thesis.<br />
Agreeing on the fact that the semantic meaning of a word is suggested from the linguistic<br />
context in which it occurs, or “the company it keeps”, supports the notion that different words<br />
used in the same context are semantically similar. It does, however, not provide a means for<br />
calculating the degree of this similarity or even finding out exactly which words are similar to<br />
each other. Depending on the information that is desired to obtain about a target word, different<br />
context types mirror different aspects of the semantic meaning of a word. Any approach that<br />
attempts at describing semantic meaning based on the contextual distribution of words in a text<br />
collection must first define the type of context that best will reflect the desirable information.<br />
Somewhat simplified, we distinguish between topical context and local context.<br />
2.2.2.1 Topical context<br />
Topical context (Miller and Leacock 2000), or document context, is a quite wide term that<br />
covers what we could call the “wide conception” of what context is. All other content words<br />
which occur in the same environment as a target word are considered to make up the context of<br />
the word, and following the discussion above, contribute to indicating the semantic meaning of<br />
the target word. A target word’s contextual environment can be further specified depending on<br />
the purpose; in short, the context simply is all the words which occur within a context window<br />
of varying size. The window can be set to cover a certain number of words before and after a<br />
target word, or also to consist of the entire document the target word occurs in. Different<br />
parameters determine the weighting of each word found within the context window; for example<br />
words can be weighted according to their distance from the target word. One extreme way of<br />
looking at topical context might be the bag of words model, where a document is seen as an<br />
unordered collection of words, and the words are weighted by the number of times they occur in<br />
27
a document. In a more narrow sense, topical context can be limited to only consisting of the<br />
other words in the same sentence as a target word.<br />
The extraction of topical context does not draw on syntactic or semantic information and<br />
therefore does not provide an indication of the relationship that the words in the context have to<br />
each other or to the target word. It is therefore not possible to say anything specific about<br />
semantic similarity based solely on the occurrence of words in the topical context. As the name<br />
indicates, this type of context gives information about the topic, or the domain of the text that<br />
the target word exists in.<br />
Consider the words in example (2-16) as the topical context for the target word sykepleiestudent<br />
(student nurse). The context words are words which occur more than once in a short newspaper<br />
text from the text collection used in this project. Even with such a rudimentary method of<br />
selecting the words in the topical context, it is clear that this type of context provides cues to the<br />
thematic domain that the target word occurs in. The topical context does not, however, provide a<br />
means of finding words that are semantically similar to the target word. No close synonyms are<br />
retrieved, but rather words belonging to the same discourse domain as the target word.<br />
(2- 16)<br />
kvinne<br />
funn (subst)<br />
funnet (partisipp)<br />
død<br />
Førde<br />
politi<br />
leteaksjon<br />
2.2.2.2 Local context<br />
woman<br />
finding (noun)<br />
find (participle)<br />
dead<br />
Førde<br />
police<br />
search party<br />
Local context provides a more finely tuned way of looking at semantic similarities as expressed<br />
through the distribution of words in a text. In its simplest form, a word’s local context consists<br />
of its immediate surrounding words; that is the words immediately preceding and following a<br />
target word. The notion of local context can also be extended to include information about<br />
syntactic and grammatical properties that belong to the target word and its immediate<br />
28
neighbours. For example, a target word’s local context can be seen as its subject and object, or<br />
as the adjective preceding it.<br />
Several studies show that classifying words based on the local context in which they occur gives<br />
information about the semantic meaning of the words, rather than their membership within a<br />
thematic domain, as found when examining the topical context (Hindle 1990; Grefenstette 1992;<br />
Lin 1998; Lin and Pantel 2001; Pereira et al. 1993; inter al.). This indicates that access to<br />
features within a word’s local context can contribute to saying something about the meaning of<br />
the word and ultimately to act as a foundation for the formation of concept groups of<br />
semantically similar words. Distributional representations based on a word’s local context are<br />
useful for measuring the semantic similarity of words. Lin (1997) exploits this in an algorithm<br />
for word sense disambiguation and states that local context gives crucial clues about the<br />
meaning of a word following the intuition that:<br />
“Two different words are likely to have similar meanings if they occur in identical local<br />
contexts.” (Lin 1997, p 64).<br />
Since the local context can comprise syntactic and semantic information, it provides a means to<br />
access different information relevant to the type of analysis that will be performed on the<br />
material. Several approaches describe methods for finding similar nouns based on the<br />
distributional patterns of words in the local context (Hindle 1990; Grefenstette 1992; Lin 1998;<br />
Lin and Pantel 2001; Pantel and Lin 2002; Pereira et al. 1993; inter al.). These methods classify<br />
words in accordance with their distributional patterns, not using hand-coded semantic<br />
knowledge as a basis, but rather inferring the required knowledge from a text corpus as part of<br />
the analysis process. The approaches all adopt different methods for judging the similarity of<br />
words. Below, some of the approaches to finding similar words are described briefly; the<br />
similarity metrics, however, will not be discussed in this outline.<br />
Hindle (1990) shows that the contextual distribution of words provides a useful semantic<br />
classification, also in the event of an automated classification process with no human<br />
intervention. His method examines predicate-argument structures in a large corpus and<br />
automatically classifies nouns into semantically similar sets on the basis of the predicates they<br />
combine with. The similarity between nouns is measured as being a function of mutual<br />
29
information estimated from the text. Hindle’s results show that semantic relatedness can be<br />
derived from the distribution of syntactic forms (Hindle 1990, p. 274). This is a similar approach<br />
to the one taken in the present work, if on a substantially smaller scale. Hindle (1990) addresses<br />
the data sparseness problem by estimating the probability of an unseen event by comparing it to<br />
similar events which have been seen. Grefenstette (1992) presents a method which looks for<br />
context patterns in large domain-specific corpora and finds similar words relative to how a target<br />
word is used in a specific text or domain. His program SEXTANT uses syntactically derived<br />
contexts and estimates the similarity of two words by considering the overlapping of all the<br />
contexts associated with them over a large corpus (Grefenstette 1992, p. 325). As a result, a<br />
word’s context consists of all the words co-occurring with it in the corpus. Pereira et al. (1993)<br />
also report a method for clustering words according to their distributions in given syntactic<br />
contexts. In their approach, nouns are classified based on their syntactic relations to predicates in<br />
the corpus. The method enables the automatic derivation of classes of semantically similar<br />
words from a text corpus and produces clusters the authors term “intuitively informative”<br />
(Pereira et al. 1993, p. 190). Lin and Pantel (2001) present the unsupervised algorithm UNICON<br />
for the creation of groups of semantically similar words. Their approach examines collocation<br />
patterns consisting of dependency relationships, and employs a method for selecting significant<br />
collocation patterns. Those dependency relations which occur more frequently than if the words<br />
were independent of each other are selected as collocation patterns. This approach is further<br />
developed in Pantel and Lin (2002). Here, clusters which are relatively semantically different<br />
are initially identified and a subset of the cluster members are used to create so-called centroids,<br />
which represent the average features of the subsets. Subsequently new words are assigned to<br />
their most similar clusters. A word can be assigned to several clusters, each cluster<br />
corresponding to a sense of the word.<br />
2.2.3 Context and selectional restrictions<br />
In the above it has been argued that a given word will tend to co-occur with a limited class of<br />
other words, and that this information can be exploited to find words that are similar in meaning.<br />
One of the reasons for this expected occurrence of similar words in similar contexts, is that<br />
predicates to a certain extent limit the semantic properties of the arguments that they can<br />
combine with. This behaviour is captured through the notion of selectional restrictions, which<br />
define how a predicate restricts the class of arguments that can combine in a specific position<br />
30
with it. Selectional constraints allow a predicate to specify semantic restrictions on its arguments<br />
(Jurafsky and Martin 2000, p. 512). This accounts for the intuition that only a certain class of<br />
words can occur in a specific argument position to a given predicate. In the case of a verb such<br />
as avhøre (interrogate, take statement from) a possible selectional restriction for the first<br />
argument could be that it must represent a human. Jurafsky and Martin formulate it like this:<br />
interrogate restricts the constituents appearing as the first argument to those whose underlying<br />
concepts can actually partake in an interrogation (Jurafsky and Martin 2000, p. 512, slightly<br />
modified).<br />
More nuanced intuitions of selectional restrictions can be obtained by combining the knowledge<br />
of distribution in context and that of semantic restrictions placed on arguments by the predicate.<br />
This thesis applies a practical approach in order to find properties of the selectional restrictions<br />
of predicates within a limited thematic domain. Without aiming at formulating a comprehensive<br />
list of the selectional restrictions that apply within the domain in question, it is possible to obtain<br />
a list of examples that illustrate certain properties of the selectional restrictions. This is an<br />
extensional approach; by examining the structure of a set of arguments that all occur in the same<br />
contextual environment, for example as the first argument of a certain predicate, it is possible to<br />
draw certain conclusions about the selectional restrictions placed by the predicate. The aim of<br />
this project is not to define the selectional restrictions of the predicates in the data set, but rather<br />
collect a list of examples of valid restrictions for the domain and examine these. It is obvious<br />
that selectional restrictions also vary over different thematic domains; the allowed first<br />
arguments of a predicate will be very different in a formal text than in a fairy tale for children.<br />
This is again the intuition outlined in the first section of this chapter; words are used in different<br />
ways depending on the thematic domain they exist in. The distribution that classes of<br />
semantically similar arguments show within a limited domain may therefore very well be seen<br />
as a type of selectional constraint. To exemplify this, consider the domain used in the present<br />
work; newspaper texts concerning a criminal case. Considering the constructed phrases in<br />
(2-17), the first two are valid for the domain in the sense that they exemplify structures which<br />
are found in the data set, while the third is in violation of the selectional constraints assigned by<br />
the verb within this particular thematic domain. In the event of a killing within the domain in<br />
question, it can be expected that a perpetrator or a man has the thematic role of actor, but it has<br />
not been seen in the data material that a student can initiate this action.<br />
31
(2- 17)<br />
gjerningsmannen drepte kvinnen<br />
mannen drepte kvinnen<br />
studenten drepte kvinnen<br />
the perpetrator killed the woman<br />
the man killed the woman<br />
the student killed the woman<br />
The example above illustrates how context within the present work is used to formulate a notion<br />
of selectional restrictions. These can later be used to say something about which argument can<br />
be expected to feature in a specific contextual environment, and thus functions as a type of realworld<br />
knowledge for the domain of the text collection. With specific reference to anaphora<br />
resolution, these selectional restrictions can be used to give an indication of the most likely<br />
antecedent for an anaphor which normally would require access to real-world knowledge in<br />
order to be resolved.<br />
32
3 From text to EPAS – the extraction method<br />
This chapter describes the extraction method used in this project. The method extracts EPAS<br />
(elementary predicate-argument structures) from a text corpus consisting of newspaper texts<br />
collected from the internet.<br />
3.1 Selecting the texts<br />
Specifying the requirements for a suitable text collection is not as trivial as it may seem. To<br />
make sure that the extracted EPAS would produce semantically valid results when classified, the<br />
texts from which the structures were extracted had to fulfil certain requirements. Since the<br />
classification builds on the distributional hypothesis and relies on EPAS which show distribution<br />
particular to a restricted domain, initially, the most important specification for the texts was that<br />
they all had to belong to the same thematic domain. As such, the main focus in the requirements<br />
specification for the text collection was that of one closed thematic domain. But how exactly<br />
does one define the notion of a thematic domain? The first test set collected for the project<br />
consisted of factual prose texts dealing with roughly the same field. These texts, however,<br />
proved to be quite unsuitable for the later analysis, for reasons that will be explained in the<br />
following.<br />
It is clear that certain specifications must be fulfilled in the text collection from which the EPAS<br />
are derived. Texts displaying longer discourse chains are most suitable for the purpose of this<br />
project. One thematic domain must be described over several paragraphs, or preferably over the<br />
entire course of the discourse in the text. In order to extract the desired information from the<br />
texts and subsequently test if useful information has been extracted, the presence of anaphora, or<br />
referring expressions, in the text is needed. This entails the need for pronouns in particular. As<br />
such, texts containing discourse with a certain amount of concrete content were particularly<br />
useful for my purpose.<br />
Texts that are too vague, both with regards to their textual content and to their membership in a<br />
particular thematic category, were not suitable for the purpose of this project because they do<br />
not contain the type of theme-specific selectional constraints we are interested in extracting. One<br />
reoccurring problem in the text collections tested for the project was a too small degree of<br />
information expressed in full text, and very much information present in bullet lists, tables and<br />
33
other similar constructions. This information was only accessible after manual editing of the<br />
texts, and was even then often not useful since precisely the desirable discourse chains are<br />
avoided by use of this type of textual shorthand constructions. The information present in bullet<br />
lists and tables in the unedited text is most often not formulated in well-formed sentences and<br />
usually the use of referring expressions and pronouns is avoided. Such texts are also not<br />
immediately suitable for parsing, thus making it complicated to extract EPAS (semi-)<br />
automatically.<br />
As mentioned above, selecting the texts to be analyzed and creating the text collection to be the<br />
basis of the classifications in the project was a task that is not to be underestimated. Several<br />
different types of texts were experimented with in the attempt of finding a text type that satisfied<br />
the following criteria, in addition to being available for collection on the internet:<br />
• Limited and naturally confined thematic domain<br />
• Relatively long chains of discourse<br />
• Fairly high occurrence of anaphora, pronouns in particular<br />
• Several paragraphs where the same phenomenon is discussed<br />
• Low occurrence of tables and illustrations, ideally all the information in the texts should<br />
be expressed in complete and grammatical sentences<br />
The text type that fulfilled these criteria to the highest degree were news texts. By picking<br />
newspaper articles that all concerned the same theme, the criteria of a limited domain was<br />
satisfied. The articles, as provided on the internet, additionally fulfilled all the other<br />
requirements which had been set for the text collection. For this project, articles concerning a<br />
criminal case in the small town Førde on the west coast of Norway were chosen, mainly because<br />
this was a very big case in the Norwegian newspapers and a large number of articles have been<br />
written on the subject. The articles were selected from the newspaper Verdens Gang (VG) in<br />
June and July 2004.<br />
34
3.2 Predicate-argument structures<br />
"Not the same thing a bit," said the Hatter. "Why, you might as well say that 'I see what I eat' is<br />
the same thing as 'I eat what I see'." from Alice in Wonderland by Lewis Carroll.<br />
For the purposes of the subsequent classification phase, a meaning representation that would not<br />
allow for ambiguity or vagueness was desirable. Using the term EPAS, rather than referring to<br />
the verb and its subject and object, contributes to normalising and generalising the data. The<br />
motivation for choosing elementary predicate-arguments structures, or EPAS, as the<br />
representation of the meaning structures in the text collection will be explained in the following.<br />
By choosing EPAS as meaning representation, the focus of the structure is the verbal predicate.<br />
Instead of structuring the semantic representations extracted from the texts according to the<br />
grammatical roles and the formal function each word holds in the sentence, we look at how the<br />
verbal predicate combines with arguments. This is closely related to the idea of thematic roles,<br />
where the focus is on which roles the entities in a sentence occupy. It is suggested that “verbs<br />
must have their thematic role requirements listed in the lexicon” (Saeed 1997, p. 140) and as<br />
such that each verb has a predetermined set of possible argument frames. Thematic roles span<br />
over a wide range that describes the various roles the entities in a sentence can occupy. Using<br />
Saeed’s hierarchy of thematic roles, the agent is the initiator of action, while the patient and the<br />
theme are the entities an action is performed on. For Norwegian and English, there is a tendency<br />
for subjects to be agents and direct objects to be patients and themes (Saeed 1997, p. 145). This<br />
tendency can be altered by the speaker as a result of stylistic choice or desire to alter the<br />
information structure, for example by using passive verbal voice. The assignment of thematic<br />
roles to particular positions in a sentence is closely connected to the hierarchical structure of the<br />
thematic roles. There is a hierarchy of defined thematic roles for each sentence position; the<br />
hierarchy in (3-1) exemplifies the preferred order of roles in subject position (Saeed 1997, p.<br />
146).<br />
(3- 1)<br />
agent > recipient/benefactive > theme/patient > instrument > location<br />
The structuring of a semantic representation into predicates with belonging arguments does,<br />
however, not express exactly the same information as the assignment of thematic roles does.<br />
35
When using predicate-argument structures, the definitions of argument 1 and argument 2<br />
presupposes the existence of an underlying semantic hierarchy which defines the roles for agent<br />
and patient. For example, argument 1 can be defined as to always representing the agent in the<br />
sentence. An important distinction between the predicate/argument paradigm and that of<br />
thematic roles, is that by using agent/patient as the core of the semantic representation, the<br />
semantic classification does not focus on the predicate and its associated arguments. On the<br />
other hand, in a predicate/argument classification, the definition of the individual arguments<br />
does not directly consider the thematic roles. Since a semantic hierarchy has more well-defined<br />
roles than expressible with arguments, different instances of argument 1 will not have exactly<br />
the same semantic role, depending on the predicate they co-occur with.<br />
For the purpose of processing the extracted structures, it is useful that the structures are in a<br />
simplified form and also that structures that semantically convey the same information, but are<br />
expressed in a different manner syntactically, are represented with the same structure. This is in<br />
alignment with the doctrine of canonical form, which states the usefulness of letting linguistic<br />
constructions which display the same meaning content give rise to the same meaning<br />
representation (Jurafsky and Martin 2000, p. 507). By using a normalised form of<br />
representation, such as EPAS, for the structures extracted from the text, as well as founding the<br />
extraction method on semantic representations of the analysed texts, the generation of a<br />
generalisable data set is achived. Active and passive constructions with equivalent semantic<br />
meaning will be treated in the same way and will receive identical meaning representations. This<br />
can be seen in the following example:<br />
(3- 2)<br />
a. Morderen drepte kvinnen.<br />
The murderer killed the woman.<br />
b. Kvinnen ble drept av morderen.<br />
The woman was killed by the murderer.<br />
c. Kvinnen ble drept.<br />
The woman was killed.<br />
Sentence (3-2a) and (3-2b) in essence convey the same information and only differ with<br />
reference to verbal voice. The use of diathesis alternations of active and passive voice gives the<br />
36
speaker flexibility with regards to the relationship between grammatical structure and thematic<br />
roles. The use of passive versus active voice does not really alter the semantic content of a<br />
sentence, but represents a difference in the information structure of a sentence. Differences of<br />
this kind are not relevant for the present purposes, and will not be reflected in the extracted<br />
structures. The word kvinnen (the woman) has different syntactic roles in the acitve sentence in<br />
(3-2a) and the passive sentence in (3-2b), but has the same thematic role. Regardless of the fact<br />
that the phrase kvinnen (the woman) in (3-2b) represents a subject, while it represents an object<br />
in sentence (3-2a), both expressions have the thematic role of patient, or the entity acted upon.<br />
For the purposes of the present work both sentences will be represented by the single EPAS<br />
shown in (3-3):<br />
(3- 3)<br />
Predicate Argument 1 Argument 2<br />
drepe morder kvinne<br />
kill<br />
murderer<br />
woman<br />
Sentence (3-2c) is in passive voice and the subject from the active-voice sentence (3-2a) is not<br />
present. The formal subject of the sentence, “woman”, is logically a patient, and refers to the<br />
entity on which the activity of killing is performed. As such, this sentence will be represented by<br />
an EPAS which lacks its argument 1:<br />
(3- 4)<br />
Predicate Argument 1 Argument 2<br />
kill woman<br />
Extraction methods that are not based on a syntactic parse of the original texts do not have<br />
access to semantic relations within a sentence. This means that such methods must rely on more<br />
superficial structures, such as part of speech tags, and will not have the same degree of accuracy,<br />
or finesse, in the actual extraction of the meaning structures. Since the present work aims at<br />
providing results that can be useful as part of an anaphora resolution system, it is particularly<br />
important that the results obtained can be generalised as much as possible. Especially since the<br />
individual elements of the extracted structures will be used for subsequent processes, it is of<br />
high importance that they do not contain errors or irregularities as a result of the extraction<br />
37
process. To be as useful as possible, the meaning structures should be normalised and<br />
generalisable.<br />
The examples above show how normalisation through use of EPAS realises the concept of<br />
canonical form to some degree and seems particularly useful for the purpose of the present<br />
work. By using grammatic relations such as subject and object as reference points, semantically<br />
equivalent sentences, such as (3-2a) and (3-2b), would be given different meaning structures due<br />
to the difference in verbal voice. Structuring the meanings conveyed with the sentences in (3-2)<br />
within a grammatical relations paradigm would make it necessary to mark the verbal voice as<br />
well as the grammatical relations. In addition, active and passive structures would have to be<br />
treated differently in the subsequent analysis. Basing the extraction merely on syntactic<br />
properties of the sentences in the corpus would make the extracted material very difficult to<br />
classify, mainly because similar meanings would be represented differently.<br />
The advantages of a normalised and generalisable dataset is further clarified by the following<br />
example. Upon a simple grammatical analysis, the sentences shown in (3- 2) can be categorised<br />
based on the syntactic roles predicate, subject and object. The result of such a classification is<br />
shown in examples (3-5) and (3-6):<br />
(3- 5)<br />
predicate subject object<br />
a. drepe morder kvinne<br />
kill<br />
b. drepe<br />
kill<br />
c. drepe<br />
kill<br />
murderer<br />
kvinne<br />
woman<br />
kvinne<br />
woman<br />
woman<br />
murderer<br />
murderer<br />
?<br />
The structures in (3-5) above can be extracted upon part of speech tagging of the sentences in<br />
(3-2). The active and passive predicate receives the same structure, and as no semantic<br />
information is available, the structuring of the arguments is in accordance with their status as<br />
subject or object. Attempting to classify these subjects and objects based on their co-occurrence<br />
with the predicate produces groupings of words which are not directly generalisable. Murderer<br />
38
and woman occur together both in subject and object position, not reflecting the preferred<br />
selectional constrictions within the domain.<br />
(3- 6)<br />
predicate subject object<br />
a. drepe morder kvinne<br />
kill murderer woman<br />
a. drepes kvinne morder<br />
is-killed woman murderer<br />
b. drepes kvinne ?<br />
is-killed woman<br />
Example (3-6) provides a more elegant structuring. Because an extraction method based on<br />
syntactic relations is unable to generalise over verbal voice, two separate predicates are<br />
extracted, one for the passive and one for the active voice. Even though logically, the same<br />
action is performed on the entity the woman in both sentences in (3-6), a method as outlined<br />
above would not allow for a straightforward interpretation of this. The generalisation between<br />
active and passive versions of the same sentence is lost in such an approach. This would result in<br />
a higher number of predicates, and therefore in a less generalisable data material. It is likely that<br />
results as outlined above would also be of less use as a referent guessing helper in an anaphora<br />
resolution system, precisely because of the lower level of generalisability.<br />
3.2.1 What is represented in the EPAS?<br />
Jurafsky and Martin (2000, p. 510) state that all languages have predicate-argument structures at<br />
the core of their semantic structure. They further describe that the grammar organises the<br />
predicate-argument structure and selectional constraints restrict how other words and phrases<br />
can combine with a given word. In this project, a simplified version of predicate-argument<br />
structures is used as meaning representation. The EPAS, or meaning representations, are limited<br />
to consist of two nominal arguments at the most. Either one of the arguments in an EPAS may<br />
be empty/unidentified. This means that the EPAS extracted from my texts will belong to one of<br />
the following three patterns:<br />
39
(3- 7)<br />
a. predicate, argument 1, argument 2<br />
b. predicate, argument 1, ?<br />
c. predicate, ?, argument 2<br />
The reason for letting the EPAS consist of a maximum of two arguments is not primarily a<br />
fundamental decision, but rather an emergence from the empirical material that the data<br />
structures were collected from. When extracting EPAS from the data collection, the resulting<br />
structures consisted of a predicate with maximally two arguments. It is probable that there<br />
generally are fewer occurrences of predicates with more than two belonging arguments, and<br />
since my data collection is quite small, such occurrences do not feature in it.<br />
Only nominal arguments are featured in the EPAS; entailing that sentences with a nominal<br />
clause as object will be extracted as an EPAS lacking argument 2. This is clarified through the<br />
examples below:<br />
(3- 8)<br />
Et vitne opplyste at hun hadde hørt høye rop.<br />
A witness informed that she had heard loud screams.<br />
The sentence in example (3-8) above will yield the following three EPAS:<br />
(3- 9)<br />
a. høre, vitne, rop<br />
hear, witness, scream<br />
b. høy, rop, ?<br />
loud, scream, ?<br />
c. opplyse, vitne, ?<br />
inform, witness, ?<br />
(3-9c) does not display an argument 2 despite the fact that the original sentence has a nominal<br />
clause as object. The main reason for this choice of representation is that the subsequent<br />
classifying phase aims at creating classes of nominal arguments, based on the verbs they co-<br />
40
occur with. Arguments which are unlikely to represent relevant and interesting information for<br />
our classification are therefore omitted. In cases where a verbal predicate takes a nominal clause<br />
or a sentence as its argument, it is unlikely that the predicate selects the argument based on (for<br />
us) semantically interesting selectional restrictions. A verb can show restrictions in the selection<br />
of an argument, presenting us with the possibility of saying something about this argument<br />
relevant to the environment it occurs in. The same restrictions cannot be expected to apply in the<br />
cases where the verb takes a sentence as its argument. As a consequence of this, the meaning<br />
representations are limited to dealing with arguments which can be represented by single<br />
symbols and which do not refer to clauses or sentences.<br />
In order to extract all EPAS present in the texts, it will also be necessary to extract non-verbal<br />
predicates. These will generally speaking correspond to adjective-noun combinations in the text.<br />
This way, phrases of the type “the statement is important” or “the important statement” will both<br />
produce the following EPAS:<br />
(3- 10)<br />
Predicate Argument 1 Argument 2<br />
important statement ?<br />
The process of extracting the EPAS is a challenging part of the project, especially since the<br />
available tools for Norwegian are not robust enough to make this a trivial and straightforward<br />
task. Regardless of which method is used to extract the EPAS, it is evident that large part of the<br />
work on this project must be dedicated to the development of a suitable method for the<br />
extraction of them. For several reasons, it is desirable to develop an extraction method that is as<br />
automatic as possible. Most importantly, such a method saves a lot of time, but another<br />
important aspect is that more manual extraction methods easily could become subjective and<br />
less systematically concise. The next two sections will discuss the task of extracting the EPAS in<br />
more detail.<br />
41
3.3 Parsing with NorGram<br />
To be able to extract the EPAS from the text in a semi-automatic fashion, some sort of linguistic<br />
analysis of the texts is needed. One problem with working on a small language like Norwegian<br />
is that the linguistic tools you might need in the process just are not fully developed yet. Velldal<br />
(2003) describes a project where a set of Norwegian nouns are grouped into semantic classes<br />
based on their distribution over a large body of text. A word’s distribution in different contexts<br />
is represented as a feature vector in a semantic space model. In his project, Velldal addresses the<br />
problem of a lacking parser for Norwegian by stating that there does not exist any syntactic<br />
parser for Norwegian. Instead, he uses a shallow processing tool on a tagged corpus. The<br />
processing tool “translates” the tagged structures into predicate-argument structures, overcoming<br />
the need for a parser by only analysing those parts of the text relevant for the extraction of the<br />
needed structures. As has been explained in section 3.2, an extraction method that is based on<br />
surface structures and does not take semantic relations into account, might produce results that<br />
are unsuitable both for subsequent use in anaphora resolution and for generalisation of concepts.<br />
In view of this, the present work has aimed at developing an extraction method that uses parsed<br />
text to collect the meaning structures from the text.<br />
Although it is true that there does not exist any parser that fully covers the Norwegian language<br />
at the moment, there are a few alternative parsers available. Even if these grammars are not<br />
entirely robust enough to return parses on randomly chosen texts, they can be used for the<br />
experiments outlined in this project. The extraction method described in this thesis implements<br />
one of the existing parsing tools for Norwegian bokmål, NorGram (NorGram 2004).<br />
Since there are no easy-to-use automated tools available for use in the extraction process,<br />
obtaining the EPAS from the text involved a substantial amount of manual work, even when<br />
using a parser to automate the extraction. Parsing the texts was definitely of value, though, since<br />
once the texts were parsed and there was a syntactic analysis to work on, the EPAS could more<br />
readily be extracted. Because of the modular nature of the extraction method, the extraction<br />
process is not parser-dependent. Should a new and more robust grammar become available, the<br />
extraction method can be modified to accommodate this. The next section of this chapter briefly<br />
describes how the NorGram/XLE parser was used in the project, while section 3.3.2 describes in<br />
greater detail how the EPAS were extracted from the parser’s output.<br />
42
3.3.1 NorGram in outline<br />
Norsk komputasjonell grammatikk (NorGram) is a computational grammar for Norwegian<br />
bokmål. NorGram is based on the unification-based grammar formalism Lexical Functional<br />
Grammar (LFG), where language is described by means of feature structures that can be<br />
combined in the process of unification. Researchers involved in the NorGram project cooperate<br />
with researchers at Palo Alto Research Center (PARC), former Xerox PARC, who have<br />
developed a well functioning platform for the development of large-scale computational<br />
grammars. This system is called Xerox Linguistic Environment (XLE) and uses LFG as its<br />
theoretical linguistic framework. As such, NorGram can be said to be an LFG formalism for<br />
Norwegian, while XLE is an implementation of LFG.<br />
The NorGram grammar combined with an XLE-module is a relatively broad parser that can<br />
analyse most structures found in Norwegian. It was chosen for the purposes of this project<br />
because it was likely to return successful parse trees of a large part of the sentences found in the<br />
text collections. NorGram’s lexicon is quite large and includes entries of most regular<br />
Norwegian words. One problem with the lexicon with regards to the text collections used for<br />
this project, is that it contains relatively few compounds. All theme-specific texts feature a<br />
theme-specific vocabulary, sometimes with words (especially compound nouns) that cannot be<br />
expected to be found in ordinary dictionaries. This was also the case for the text collection in<br />
this project. Compounded nouns represented the largest group of words added to the lexicon. In<br />
Norwegian, one stands fairly free to form compounds consisting of words that also can exist<br />
individually and have an individual meaning. Whereas in English such compounds are written in<br />
two separate words, for example police investigator, they together form a new noun in<br />
Norwegian, for example politietterforsker (police investigator). This opens for a potentially<br />
infinite class of nouns and makes it virtually impossible to include all possible words in any<br />
lexicon.<br />
The NorGram lexicon was extended in order to be used as a tool to extract the EPAS from the<br />
text collection. Compounds and proper nouns that were part of sentences to be analysed were<br />
added to the lexicon files. To ensure that all EPAS could successfully be extracted, all sentences<br />
that were not parsed were examined to identify the word that represented the problem.<br />
Subsequently, that word was added to the lexicon. A more elegant way to solve the compound<br />
issue would be to make use of a module that splits compounds into the individual words they<br />
43
consist of, or to make use of a component that predicts the part of speech of an unknown word.<br />
One solution that would have been suitable for the purposes of the present work would be to<br />
assume that all unknown words were nouns. However, due to the small size of the corpus used<br />
for this project, none of these strategies were implemented, and unknown words were added to<br />
the lexicon by hand.<br />
When parsing texts with NorGram and XLE, the user has several choices with regards to the<br />
format of the final syntactic analysis. For example, it is possible to receive partial parses, or to<br />
let the system return all the potential analyses of the input sentence. For the purposes of this<br />
project, I received full parses of each sentence in the text material and chose to manually check<br />
each instance where the system returned multiple valid parses and actively decided on the<br />
correct one that I wished to extract the EPAS from.<br />
3.3.2 Extracting EPAS from NorGram<br />
The output provided by XLE upon a successful parse using the NorGram grammar is<br />
particularly useful for a subsequent extraction of EPAS. NorGram is based on the LFG grammar<br />
formalism and produces constituent structures (c-structures), functional structures (f-structures)<br />
and minimal recursion semantics structures (MRS-structures) upon parsing a sentence. Each of<br />
these outputs can be useful for a subsequent extraction of predicates and their arguments.<br />
The c-structure in LFG is an external structure which displays an ordered representation of the<br />
words in a sentence or phrase (Bresnan 2001, p. 44). In XLE, the c-structure is represented by a<br />
phrase structure tree, where the terminal nodes are fully inflected word forms. F-structures<br />
represent the internal structure of a sentence. On this level, the “syntactic functions are<br />
associated with semantic predicate argument relations” (Bresnan 2001, p. 45). C-structures and<br />
f-structures are different structures, but display parallel information. Figure 3 below shows the<br />
graphical representation of the c- and f-structures for the sentence Politiet leter etter morderen<br />
(The police is looking for the murderer) generated by NorGram.<br />
44
Figure 3<br />
The most useful structure for the purpose of extracting EPAS from the parse output, is the MRS<br />
structure. In comparison to the c- and f-structures, which are more syntactically motivated, the<br />
MRS structure displays the semantic structure within a sentence. In the next section, this<br />
structure is described in greater detail.<br />
3.3.2.1 Minimal Recursion Semantics<br />
Minimal Recursion Semantics (MRS), developed by Copestake et al. (2003), is a framework for<br />
computational semantics, providing a meta-language for describing semantic structures. The<br />
concept of MRS is primarily semantically motivated and aims at preserving the semantic<br />
structures in the input sentence. MRS allows for expressive adequacy, ensuring that the<br />
linguistic meanings conveyed by a sentence are expressed correctly in the semantic structure.<br />
The primary unit within the framework of MRS is the elementary predication (EP). An EP is a<br />
single relation with associated arguments and will generally speaking correspond to a lexeme<br />
with its argument roles filled. Since MRS provides a “flat” representation where the EPs are<br />
never nested within each other, semantically irrelevant implicit information about the syntactic<br />
structure of a phrase is avoided. The simple principle is that each EP has a “handle” which<br />
identifies it as belonging to a particular tree node and argument positions in EPs can be filled<br />
with handles which correspond to the EPs that belong immediately under it in the tree structure.<br />
45
More than one EP with the same handle entails that the EPs are conjoined and on the same node<br />
in the structure. Tree structure in this sense do not refer to the c-structure, but to an abstract<br />
structure which shows the hierarchical representation of the EPs.<br />
MRS is implemented in NorGram and the MRS structures provided there are the most<br />
convenient of the output structures from the point of view of EPAS extraction. A sentence or<br />
phrase can contain more than one EPAS, and all predicates with belonging arguments are<br />
displayed in the MRS representation. The MRS structures NorGram displays following the<br />
successful parse of a sentence, contain all the information needed to extract the EPAS. However,<br />
because of the manner in which they are displayed in the XLE graphical interface, it is not<br />
straightforward and simple for a human to see which arguments belong where and as such<br />
identify the EPAS directly. Only by tracing each individual EP and finding corresponding values<br />
in other EPs, is it possible to extract the sentence’s EPAS. Figure 4 below shows the graphical<br />
representation of the MRS structure for the sentence Politiet leter etter morderen (The police are<br />
looking for the murderer).<br />
Figure 4<br />
46
3.4 Altering the source<br />
As already mentioned, parsing randomly selected Norwegian texts is not an entirely<br />
straightforward task. Although NorGram provides for a quite broad grammar, not all linguistic<br />
constructions are parsed and, more importantly, not all words are covered in the lexicon. Ideally,<br />
it would be desirable to collect a limited domain treebank consisting of parsed sentences of the<br />
original texts as I found them on the internet. In practice, this was not a feasible task. It early<br />
became evident that the texts to be analyzed would have to be simplified for practical reasons.<br />
For the purpose of classification, I needed to extract the EPAS present in the texts. All the other<br />
information that was included in every sentence was not essential or necessary for the project.<br />
Although aware that it would be more scientific, and in any respect better, to extract the EPAS<br />
from original texts that have not been tampered with by me, this was not possible within the<br />
framework of this thesis. Given that I would have to simplify the texts in any case, I decided to<br />
cut most information that was irrelevant for the extraction of the (most central) EPAS. This<br />
process was performed on alle sentences in the text collection. Mainly adverbial phrases were<br />
excluded, on the basis that they would not be included in the extracted EPAS in any case. The<br />
examples in (3-11) below illustrate a typical example:<br />
(3- 11)<br />
a. Original sentence:<br />
Etter at hun ble funnet opplyste et vitne at hun hadde hørt<br />
høye rop om hjelp fra stedet tidlig søndag morgen.<br />
After she was found a witness informed that she had heard loud screams for<br />
help from the area early Sunday morning.<br />
b. Simplified form:<br />
Et vitne opplyste at hun hadde hørt høye rop.<br />
A witness informed that she had heard loud screams.<br />
c. Extracted structures:<br />
høre,vitne,rop<br />
høy,rop,?<br />
opplyse,vitne,?<br />
hear, witness, scream<br />
loud, scream,?<br />
inform, witness,?<br />
47
The pre-editing of the text collection will naturally have affected the resulting EPAS list. All<br />
structures from the original texts are not extracted and the EPAS list as a consequence does not<br />
include all relevant context patterns for the domain. Still, for the purposes of a pilot study such<br />
as in this thesis, the central structures, which display the most typical context patterns for the<br />
domain, include enough information to provide a tendency of the usefulness of the method. For<br />
the purpose of subsequent analyses, the extraction process can easily be performed on unedited<br />
original texts.<br />
3.5 Finding the words<br />
The process of extracting meaning structures such as the EPAS from the texts in the text<br />
collection is a substantial undertaking. It is also quite a tedious task, and since tedious tasks tend<br />
to benefit from being automated I wrote the Perl script Ekstraktor, which interprets the MRS-<br />
structures of a sentence and thereby puts together the EPAS for each parsed sentence. This<br />
chapter describes the outline of the automated extraction process.<br />
XLE provides the user with the choice of several output formats, including a graphical user<br />
interface that displays a tree graph of the parse as well as its F-structure and MRS-structure. By<br />
choice, the output can also be viewed as a file of Prolog predicates. In the process of extracting<br />
the EPAS, Ekstraktor reads the Prolog output, saves relevant information in a system of arrays,<br />
and subsequently performs several tests and actions on the stored information in order to present<br />
a list of all EPAS found in the parsed sentence.<br />
The MRS structures as represented in the Prolog output, provides all the information needed to<br />
extract the EPAS. Initially, the main EP with belonging arguments must be found. Since for the<br />
purpose of this paper, the linguistic structures analyzed are limited to full sentences, the main EP<br />
must display the category ‘v’ for verb. Once the main EP is identified, the semantic values for it<br />
and for its belonging arguments must be found. Subsequently, all the remaining predicateargument<br />
structures must be found. For them, there is no restriction as to which category they<br />
have. Consider the sentence shown in (3-12) together with an extract of the Prolog output of the<br />
parse shown in (3-13):<br />
48
(3- 12)<br />
(3- 13)<br />
Politiet leter etter morderen<br />
The police are looking for the murderer<br />
cf(1,eq(attr(var(19),'ARG0'),var(20))),<br />
cf(1,eq(attr(var(19),'ARG1'),var(21))),<br />
cf(1,eq(attr(var(19),'ARG2'),var(22))),<br />
cf(1,eq(attr(var(19),'LBL'),var(10))),<br />
cf(1,eq(attr(var(19),'LNK'),14)),<br />
cf(1,eq(attr(var(19),'_CAT'),'p')),<br />
cf(1,eq(attr(var(19),'_CATSUFF'),'sel')),<br />
cf(1,eq(attr(var(19),'relation'),semform('etter',15,[],[]))),<br />
cf(1,eq(attr(var(20),'type'),'event')),<br />
cf(1,eq(attr(var(21),'PERF'),'-')),<br />
cf(1,eq(attr(var(21),'TENSE'),'pres')),<br />
cf(1,eq(attr(var(21),'type'),'event')),<br />
cf(1,eq(attr(var(22),'NUM'),'sg')),<br />
cf(1,eq(attr(var(22),'PERS'),'3')),<br />
cf(1,eq(attr(var(22),'type'),'ref-ind')),<br />
cf(1,eq(attr(var(23),'ARG0'),var(21))),<br />
cf(1,eq(attr(var(23),'ARG1'),var(24))),<br />
cf(1,eq(attr(var(23),'ARG2'),var(22))),<br />
cf(1,eq(attr(var(23),'LBL'),var(10))),<br />
cf(1,eq(attr(var(23),'LNK'),10)),<br />
cf(1,eq(attr(var(23),'_CAT'),'v')),<br />
cf(1,eq(attr(var(23),'_PRT'),'etter')),<br />
cf(1,eq(attr(var(23),'relation'),semform('lete',11,[],[]))),<br />
cf(1,eq(attr(var(24),'NUM'),'sg')),<br />
cf(1,eq(attr(var(24),'PERS'),'3')),<br />
cf(1,eq(attr(var(24),'type'),'ref-ind')),<br />
cf(1,eq(attr(var(25),'ARG0'),var(22))),<br />
cf(1,eq(attr(var(25),'BODY'),var(26))),<br />
cf(1,eq(attr(var(25),'LBL'),var(27))),<br />
cf(1,eq(attr(var(25),'LNK'),18)),<br />
cf(1,eq(attr(var(25),'RSTR'),var(14))),<br />
cf(1,eq(attr(var(25),'relation'),semform('def',31,[],[]))),<br />
cf(1,eq(attr(var(26),'type'),'handle')),<br />
cf(1,eq(attr(var(27),'type'),'handle')),<br />
cf(1,eq(attr(var(28),'ARG0'),var(24))),<br />
cf(1,eq(attr(var(28),'BODY'),var(29))),<br />
cf(1,eq(attr(var(28),'LBL'),var(30))),<br />
cf(1,eq(attr(var(28),'LNK'),0)),<br />
cf(1,eq(attr(var(28),'RSTR'),var(17))),<br />
cf(1,eq(attr(var(28),'relation'),semform('def',9,[],[]))),<br />
cf(1,eq(attr(var(29),'type'),'handle')),<br />
cf(1,eq(attr(var(30),'type'),'handle')),<br />
cf(1,eq(attr(var(31),'ARG0'),var(22))),<br />
cf(1,eq(attr(var(31),'LBL'),var(13))),<br />
cf(1,eq(attr(var(31),'LNK'),18)),<br />
cf(1,eq(attr(var(31),'_CAT'),'n')),<br />
cf(1,eq(attr(var(31),'relation'),semform('morder',19,[],[]))),<br />
cf(1,eq(attr(var(32),'ARG0'),var(24))),<br />
cf(1,eq(attr(var(32),'LBL'),var(16))),<br />
cf(1,eq(attr(var(32),'LNK'),0)),<br />
cf(1,eq(attr(var(32),'_CAT'),'n')),<br />
cf(1,eq(attr(var(32),'relation'),semform('politi1',1,[],[]))),<br />
49
The Prolog code extract in (3-13) shows the MRS representation of the sentence in (3-12) by<br />
listing all the EPs in the sentence as well as the relationships that hold between the individual<br />
EPs. In simplified terms, the value of the attribute ‘semform’ holds the semantic form of the<br />
predicate, and the values of ‘ARG1’ and ‘ARG2’ point to the EPs where the semantic forms for<br />
argument 1 and argument 2 can be found. In order to extract all EPAS from such a Prolog file,<br />
one must go through all the EPs in turn, and find the semantic forms of each main EP and its<br />
belonging argument 1 and argument 2. In the extraction process, this matching and tracing of<br />
values is performed by the script Ekstraktor.<br />
The algorithm behind Ekstraktor is divided into two more or less separate parts: information<br />
retrieval from the Prolog file and processing of the information that was found and stored. Perl<br />
was chosen as the programming language mainly because of its excellent pattern matching<br />
facilities. Perl offers a very powerful and flexible regular expression syntax which lets the<br />
programmer construct regular expressions that will handle all kinds of pattern matching. For the<br />
information retrieval part of Ekstraktor, it was desirable to go through an input file, check for<br />
various patterns and store parts of the input file relevant to how the patterns were matched. (3-<br />
14) shows one of the pattern checks in Ekstraktor – if the line read from the file contains the<br />
string:<br />
‘relation’),semform(<br />
the entire line is stored in the array semform.<br />
(3- 14)<br />
if ($linjeFraFil =~ m/'relation'\),semform\(/){<br />
push(@semform, $linjeFraFil);<br />
}<br />
By going through the input file line by line and checking for several patterns, all information<br />
relevant to extracting the EPAS is stored in a system of arrays. To be able to keep track of which<br />
EP the various values belong to, a system of two arrays for each argument type is used – one for<br />
EP number and one for argument value. The ARG0 arrays correspond to the predicates in the<br />
structures and for each, the semantic form can directly be found in the semform-array. The<br />
50
ARG1 and ARG2 arrays display a value that must be traced before the semantic form can be<br />
extracted. A simplified example of the argument arrays for the sentence in (3-12) is shown in<br />
(3-15):<br />
(3- 15)<br />
ARG0:<br />
EP VALUE<br />
23 21<br />
25 22<br />
28 24<br />
31 22<br />
32 24<br />
ARG1:<br />
EP VALUE<br />
19 21<br />
23 24<br />
ARG2:<br />
EP VALUE<br />
19 22<br />
23 22<br />
To find the EPAS for this sentence, the first EP in the ARG0-array is incorporated in a regular<br />
expression which then is used for pattern matching in the members of the semform-array. If<br />
there exists an entry which matches the pattern, that is, which has an EP-value identical to the<br />
first EP in the ARG0-array, the semantic form is retrieved. To find the belonging arguments 1<br />
and 2, the ARG1 and ARG2-arrays are searched for an EP identical to the one of the predicate.<br />
If such an EP is found, the corresponding value is retrieved – for ARG1 in our example that<br />
would be the value 24. To find the semantic form of this value, we must find the EP where this<br />
value is identical to the value of ARG0, that is, the ARG0-array must again be consulted. When<br />
the EP is found, the semform-array can be pattern matched and the semantic form can be<br />
retrieved. To retrace the example; following such a procedure, the sentence in example (3-16):<br />
(3- 16)<br />
Politiet leter etter morderen<br />
The police are looking for the murderer<br />
is represented with the following EPAS, extracted from the Prolog file of the parse:<br />
(3- 17)<br />
lete-etter,politi,morder<br />
look-for,police,murderer<br />
For a detailed walkthrough of Ekstraktor, please consult Appendix A. The program code is<br />
available in Appendix B.<br />
51
3.6 Evaluation of the data set<br />
The data set created by the extraction process consisted of 195 elementary predicate-argument<br />
structures in its raw form. The original EPAS list was not directly applicable for the next parts of<br />
the project. Not all of the extracted structures on the list were suitable for further analysis. Some<br />
of the EPAS were not given an optimal analysis (for my purposes) by the grammar, some were<br />
irrelevant for the later analysis and some were not extracted correctly from the MRS by the Perl<br />
script. The dataset was post-edited to achieve a set of EPAS that did not include erroneously<br />
extracted or undesired structures. With such a small collection of structures as is the case in this<br />
project, the inclusion of only a few incorrect structures would be likely to skew the subsequent<br />
analysis and possibly produce false results.<br />
In the following, I will briefly outline some of the reasons why the EPAS list included incorrect<br />
structures and describe how the list was revised.<br />
3.6.1 Errors from the grammar<br />
Some of the undesired structures in the original EPAS list were directly caused by<br />
characteristics in the NorGram grammar. In the original EPAS list, there were for instance<br />
several structures of the type exemplified by (3-18):<br />
(3- 18)<br />
a. verbal predicate, nominal argument<br />
b. preposition, verbal predicate, nominal argument<br />
These structures should preferably have been combined into one EPAS. The example in (3-19)<br />
below shows a concrete instance from the EPAS list and is analogous to several other instances:<br />
(3- 19)<br />
a. bo, Anne live, Anne<br />
b. i, bo, studentkollektiv in, live, student housing<br />
The structure is extracted from the following sentence from the text material:<br />
52
(3- 20)<br />
Anne Slåtten bodde i et studentkollektiv utenfor Førde sentrum.<br />
Anne Slåtten lived in student housing outside central Førde.<br />
As example (3-20) shows, these structures originate from sentences featuring a verb with an<br />
adverbial complement. The adverbial is realised as a prepositional phrase where the preposition<br />
is selected by the verb. It would have been expected that sentences such as Anne bodde i et<br />
studentkollektiv (Anne lived in student housing) would result in one EPAS with the entity<br />
studentkollektiv somehow realized as the structure’s argument 2. Instead, the MRS structure of<br />
this and other similar sentences did not provide the necessary link between the verb as predicate<br />
and studentkollektiv as the second argument. When discussing this problem with the developers<br />
of the grammar, the source of the obstacle was easily identified. In the grammar, the verb bo<br />
(live) existed as an intransitive verb, not allowing for an adverbial complement to be analysed as<br />
required to produce the desired EPAS. In order to allow for this and similar sentences with the<br />
verb bo to produce one EPAS with the correct relationship between the predicate and its<br />
arguments, the entry for bo was altered. A solution which allows for an arbitrary preposition was<br />
favoured, instead of creating a new template that specifies the possible following prepositions.<br />
Analysing the sentence above with the revised grammar produces structures of the following<br />
type:<br />
(3- 21)<br />
bo, Anne, studentkollektiv<br />
live, Anne, studenthousing<br />
The same phenomena was observed for a few other verbs with prepositional phrases as<br />
complements, such as gjemme i (hide in) and observere i (observe in). In these instances, the<br />
structures were manually edited.<br />
3.6.2 Irrelevant structures<br />
Some of the structures that were extracted correctly from the text collection, were simply<br />
removed in the final post-editing of the EPAS list. These structures were not directly relevant for<br />
the later analysis process and would not contribute with any valuable information for the<br />
53
eferent-guessing procedures. In total 22 such structures were removed. The majority of these<br />
structures originate from adverbial phrases in the text collections. Location and temporal<br />
adverbs, featuring as prepositional phrases in the texts, show up in the EPAS list as structures<br />
disjoint from the rest of the sentence, not unlike the structures mentioned above. The preposition<br />
functions as the EPAS’ predicate, resulting in structures of this type:<br />
(3- 22)<br />
på, funn, åsted<br />
on, finding, crime scene<br />
Such structures were left out of the final EPAS list.<br />
Another type of correctly extracted structure that was omitted from the final list, were structures<br />
particular to the information structure in the grammatical analysis. The extraction script returned<br />
all predicate-argument structures present in the MRS structure for each parsed sentence. This<br />
resulted in a few structures that did not hold information that it was desirable to maintain in the<br />
EPAS list. Below is an example of such a structure:<br />
(3- 23)<br />
unspec_loc, , place<br />
3.6.3 Manually added structures<br />
Not all the predicate-argument structures present in the text collection were successfully<br />
extracted by means of the extraction method. After removing unwanted structures from the<br />
EPAS list, the texts in the text collection were gone through manually to gather any EPAS that<br />
were not returned by the automated extraction process. Had the text collection been larger so<br />
that the list of automatically extracted EPAS had been correspondingly bigger, too, this may not<br />
have been a necessary action. In the case of a (substantially) larger EPAS list, the structures that<br />
were not collected in the extraction process could have been abstained from, as the structures<br />
that would have been extracted would have provided enough information for the subsequent<br />
classifications and analyses. As is the case in this project, though, the text collection and the<br />
resulting EPAS list are very small. All information that can be extracted from the texts is of<br />
54
value and highly desirable. As such, it was a logical next step following the removal of<br />
unwanted structures, to make sure that all desirable structures had been collected from the texts.<br />
Several structures were added; many of which had been subjected to only partial extraction in<br />
the extraction process. This may in part be due to the syntactic analysis and in part to the<br />
matching by the Perl script. Further, EPAS were manually extracted from one additional text<br />
that had not been parsed, and therefore not been part of the initial extraction process. In total,<br />
this yielded 74 additional EPAS. The example in (3-24) below provides an example of a<br />
manually edited EPAS. (3-24a) shows the EPAS as it was after the automatic extraction process.<br />
While going through the texts, it became clear that this EPAS had not been extracted in a way<br />
that represented the meaning in the sentence it originated from, and therefore did not have an<br />
optimal structure. The EPAS was therefore manually modified to the form shown in (3-24b).<br />
(3- 24)<br />
a. Original EPAS:<br />
ta, syklist, kontakt<br />
make, biker, contact<br />
b. Manually corrected EPAS:<br />
ta-kontakt-med, syklist, politi<br />
make-contact-with, biker, police<br />
Appendix C contains the EPAS list, while Appendix D shows the alignment between sentences<br />
in the text and the extracted EPAS.<br />
3.6.4 Comments about the EPAS list<br />
The revised EPAS list consists of 223 elementary predicate-argument structures. 24 structures<br />
have been modified as described above, and 74 have been added. The list contains most EPAS<br />
present in the text collection and represents a list of verb-subject-object relations found within a<br />
limited thematic domain. While it is clear that the list could have been expanded by adding<br />
further texts to the collection, it was not possible to extend the list further within the frameworks<br />
of this project. The EPAS list is large and varied enough to show a tendency. Certainly, the<br />
instances of individual EPAS would have been higher and the list would also have been<br />
55
enriched by several new EPAS. Still, for the purposes of this thesis, the list includes a broad<br />
enough variety of structures to be of use in the classification phase.<br />
In the process of assessing the quality of the EPAS list, it became evident that the most<br />
interesting structures are the simplest ones. The EPAS corresponding to verb-subject-object<br />
relations are the ones that contribute with most information about the selectional restrictions of<br />
the domain. An alternative way to obtain an effective and robust extraction of EPAS might have<br />
been to concentrate only on this type of structure, rather that focusing on extracting all EPAS<br />
from the text collections and then filtering out unwanted ones.<br />
In order to estimate the potential of a classification of the EPAS list, line diagrams were created<br />
using Formal Concept Analysis (FCA). FCA is a methodology of data analysis and knowledge<br />
representation which identifies conceptual structures in data sets, and was a useful tool in the<br />
process of identifying how the predicates and arguments in the EPAS list related to each other.<br />
FCA distinguishes between two types of elements; formal objects and formal attributes. A<br />
formal concept is seen as a unit consisting of all belonging objects and attributes (Wolff 1991, p.<br />
430). Starting with any set of formal objects, all formal attributes the objects have in common<br />
can be identified. When using FCA to structure the data in the EPAS list, the arguments were<br />
termed objects, while the predicates were termed attributes. An FCA line diagram consists of all<br />
objects and attributes in a given context, organised hierachically according to their shared<br />
properties. Figure 5 below shows the FCA line diagram for part of the structures in the EPAS<br />
list 4 . Each white label corresponding to an argument from the EPAS list should be understood as<br />
a concept, and information about each concept can be read by following the upward leading<br />
paths from each concept. An object has a given attribute if there is an upward leading path from<br />
the object to the attribute (Wolff 1994, p. 431). Using the arguments/formal objects lensmann<br />
(sergeant) and Fonn as a starting point, the associated predicates/formal attributes gi (give), and<br />
bede-om (ask-for) can be identified. The arguments lensmann and Fonn co-occur with the<br />
predicates gi and bede-om, while politi (police), which is further down in the hierarchy, cooccurs<br />
with other predicates as well as those higher up in the diagram (gi, bede-om and bekrefte<br />
(confirm)). In other words, more general concepts are found toward the bottom of the diagram,<br />
while specialised concepts are found by following the paths upwards. For the data material in<br />
4 The diagram was made using the program Concept Explorer, downloadable from<br />
http://sourceforge.net/projects/conexp<br />
56
this project, this can be interpreted in terms of the contextual distribution the arguments have.<br />
Arguments found in the lower parts of the diagram are more general and co-occur with a wider<br />
range of predicates than the arguments found higher up in the hierarchy. In Figure 5, it can be<br />
seen that gjerningsmann (perpetrator) and drapsmann (killer) have similar distributions in the<br />
data material; drapsmann co-occurs with the predicates velge (choose) and gjemme (hide), while<br />
gjerningsmann only is found in connection with gjemme. On the basis of the formal concept<br />
analysis, it is clear that the EPAS list contains several arguments which show a distribution<br />
particular to their semantic meaning. The different lines in the diagram show interesting bundles<br />
of semantically related arguments and confirm the assumption that different types of arguments<br />
show different contextual distribution within the thematic domain.<br />
Figure 5<br />
57
4 Classification<br />
In order to use the structures in the EPAS list as an aid in anaphora resolution, they have to be<br />
processed. The pre-processing in section 3.6.4 has shown that there does exist interesting<br />
distributions in the data set and indicates that certain groups of arguments display distributions<br />
particular for the domain. As a step toward exploring if these distributions can be used to<br />
represent selectional restrictions and thus function as real-world knowledge for the domain, the<br />
words in the EPAS list must be classified. This procedure uses the context pattens that a word<br />
occurs in to classify the word, for example allowing for an argument to be classified according<br />
to the predicates it co-occurs with. A classification of this type gives information about which<br />
word to expect in a given context pattern and the results can therefore be used in the process of<br />
chosing the most likely antecedent for an anaphor. In this respect, the most likely antecedent<br />
must be interpreted as the most likely antecedent given a particular contextual pattern.<br />
In the following, the EPAS list will first be classified to see if the context patterns represented<br />
by the EPAS contain enough information to suggest the correct antecedent for anaphoric<br />
expressions from the text collection. Then an association of concepts will be performed, creating<br />
bundles of those arguments which occur in similar contexts/with similar predicates. These<br />
concepts will then be applied in co-occurrence with the classification method to see if they<br />
improve the process of suggesting the correct antecedent for the anaphors.<br />
For the purposes of classification and testing, the EPAS list was divided into training and test<br />
sets. The test set consist of all structures containing pronouns, while the training set consists of<br />
the remaining EPAS. In the case of the test set, the correct antecedent for each pronoun was<br />
identified manually and added to the test file. When testing with the test instances, the classifier<br />
assigns an antecedent based on the patterns it has seen in the training set. In this way, the correct<br />
antecedent in each test case functions as a means of measuring the success rate of the<br />
classification. The test set provides a good way of testing the product of the classification and<br />
gives a measure as to whether the correct antecedent can be assigned based on training on<br />
occurrences of EPAS/context patterns.<br />
58
The process of classifying the constituents of the EPAS is most useful if the aim of the<br />
classification is held clearly in mind. Classifying arguments relative to the predicates and the<br />
other arguments they co-occur with can give information about two things;<br />
• is the data set generalisable enough to allow inference of the single correct antecedent in<br />
each test case?<br />
• is the data set generalisable enough to allow inference of words within the semantic<br />
concept group that the correct antecedent belongs to?<br />
In this thesis, it is of interest to identify all the words which occur in specific environments. As a<br />
reaction, we are interested in finding the members which can co-occur in a specific pattern – and<br />
not necessarily only in the single correct antecedent.<br />
The classification phase in the present work has three steps; firstly classification through a<br />
memory-based learning algorithm, secondly association of semantic classes from the text<br />
material by looking at contextual environments, and thirdly classification through application of<br />
the concept groups gathered in step two. In the following, the classification method will be<br />
described in more detail.<br />
4.1 Step I: Classification with TiMBL<br />
TiMBL (Tilburg Memory Based Learner) (Daelemans et al. 2003) is a memory-based learning<br />
(MBL) tool developed by the ILK research group at the University of Tilburg (ILK 2004).<br />
TiMBL has been developed with the domain of NLP specifically in mind and provides an<br />
implementation of several MBL algorithms.<br />
Within MBL, or lazy learning (Daelemans et al. 1999), training instances are simply stored in<br />
memory. Upon encountering new instances, classification is performed by comparing the new<br />
instance to the stored experiences and estimating the similarity of the new instance to the old<br />
ones. The stored example(s) most similar to the new instance is picked as its classification. This<br />
approach stands in opposition to rule-induction based methods, which also are called greedy<br />
algorithms. In greedy learning algorithms, the learning material is used to create a model with<br />
expected characteristics for each category to be learned. Daelemans et al. (1999) show that<br />
59
language processing tasks tend to benefit from lazy learning methods, particularly because the<br />
individual examples in the training material are not abstracted away from in the process of<br />
creating rules. When a new data instance is classified, it is compared to all previously seen<br />
examples, including low-frequent ones. This suggests that in the case of relatively small data<br />
sets, such as the one in the present work, MBL tools are particularly suitable.<br />
By consulting previously seen data and estimating the similarity between old and new instances<br />
of data, MBL algorithms such as TiMBL are able to calculate the likelihood of new instances of<br />
data. This is done by creating a classifier which essentially consists of an example set of<br />
particular patterns together with their associated categories. The classifier can subsequently<br />
classify unknown input patterns by applying algorithms to calculate the similarity, or distance,<br />
to the known patterns stored in memory. The Nearest Neighbor approach is one commonly used<br />
means to estimate this distance and is described in more detail in the following section.<br />
4.1.1 The Nearest Neighbor approach<br />
Daelemans et al. (2003, p. 19) state that all MBL approaches are founded on the classical k-<br />
Nearest Neighbor (k-NN) method of classification (Cover and Hart 1967). This approach<br />
classifies patterns of numeric data by using information gained from examining and classifying<br />
pattern distributions observed in a data collection. In the k-NN algorithm, a new instance of data<br />
is classified as nearest to a set of previously classified points. The intuition is that observations<br />
which are close together will have categories which are close together. When classifying a new<br />
instance of data, the k-NN approach weights the known information about the closest similar<br />
data instances most heavily. In other words, a new instance of data is classified in the category<br />
of its nearest neighbour. In large samples, this rule can be modified to classifying according to<br />
the majority of the nearest neighbours, rather than just using the single nearest neighbour. The<br />
k-NN approach has several implementations in TiMBL. As TiMBL is designed to classify<br />
linguistic patterns, which in most cases consist of discrete data values and allow for a large<br />
number of attributes with varying relevance, the k-NN algorithm is not used directly. Instead,<br />
the classification of discrete data is made possible through a modified version of the k-NN<br />
approach, as well as other algorithms.<br />
60
There are several different distance metrics incorporated in TiMBL and, as will be described<br />
later, the user can choose the one that suits the data material best. The basic metric is the<br />
Overlap Metric, where the distance between two patterns is calculated as the sum of differences<br />
between the features of the two patterns (Daelemans et al. 2003, p. 20). The algorithm<br />
combining the k-NN approach with the overlap metric within TiMBL is called IB1 (Aha, Kibler<br />
and Albert 1991, in Daelemans et al. 2003). In this algorithm the value of k is the number of<br />
nearest distances (usually 1), allowing for a nearest neighbour set which may comprise several<br />
instances which all share the same distance to the a test example. The IB1 algorithm finds the k<br />
nearest neighbours of a test case by calculating the distance between a test instance Y and a<br />
training instance X. The distance between the two instances is the sum of the distances between<br />
the instances’ different features. If k = 1, a test instance is assigned the category of its single<br />
nearest neighbour. In cases where the algorithm finds a set of nearest neighbours, the majority<br />
vote of the set is chosen. This implies a certain bias toward high frequent categories, which in<br />
many cases will hold the majority vote.<br />
4.1.2 Testing<br />
To create a classifier, TiMBL needs training and test data in a feature vector where each<br />
instance consists of a fixed length of feature values followed by a category. For testing purposes,<br />
the feature sequence is used when the distance between a test instance and the training data is<br />
calculated, and the category functions as a means to evaluate whether the assigned classification<br />
was valid. Because the test data is compared directly with the training data, separate training and<br />
test sets are needed. In this project, the EPAS list was split into a training set consisting of all<br />
EPAS without pronouns and a test set consisting of the EPAS with pronouns. In addition, testing<br />
through TiMBL’s leave-one-out option was performed; here testing is done on each pattern of<br />
the training file by treating each pattern in turn as a test case (Daelemans et al. 2003, p. 35).<br />
In the classification phase the alignment of category and description features is stored, so that<br />
the categories of new, unseen sequences of description features can be probabilistically inferred<br />
in the following test phase.<br />
Regardless of the input format chosen, a classification with TiMBL presupposes that the training<br />
material consists of a number of features to be learned from, as well as a predetermined category<br />
61
which is the desired output category. The comma-separated values format was used for the<br />
EPAS classification. In order to classify the constituents of each EPAS based on the contextual<br />
patterns in the structure, each part of the EPAS was classified with reference to the other<br />
constituents in it. Somewhat analogous to the way that a screw can be described as being small,<br />
long and containing no holes, the argument åsted (crime scene) can be described through its cooccurrence<br />
with the predicate ankomme (arrive) and the argument etterforsker (investigator)<br />
(example (4-1)). This makes it possible to train a classifier on the EPAS list, using the argument<br />
which’s environment is to be learned as category label, and each constituent in the EPAS as<br />
features. To avoid that the category was explicitly present in the training material and ensure<br />
that the classifier was trained only on the environment of the desired category, the relevant<br />
feature was ignored using TiMBL’s ignore option. In order to classify the structures once for<br />
each argument type, two different data sets were prepared. Example (4-1) shows the format of<br />
the three-feature dataset that was used. The parentheses indicate that the feature in question was<br />
ignored when training and classifying.<br />
(4- 1)<br />
features category<br />
a. predicate, (argument 1), argument 2 argument 1<br />
b. predicate, argument 1, (argument 2) argument 2<br />
Example (4-2) shows excerpts of the two input files: (4-2a) shows the structures with argument<br />
1 as category, while (4-2b) shows the same structures with argument 2 as the category. The<br />
classifier is given two constituents of an EPAS to learn from and the target constituent is given<br />
as the EPAS’ category.<br />
(4- 2)<br />
a. ankomme,etterforsker,?,etterforsker<br />
ankomme,etterforsker,?,etterforsker<br />
ankomme,etterforsker,åsted,etterforsker<br />
antyde,politi,?,politi<br />
avhøre,?,person,?<br />
avhøre,?,vedkommende,?<br />
avhøre,politi,vitne,politi<br />
62
. ankomme,etterforsker,?,?<br />
ankomme,etterforsker,?,?<br />
ankomme,etterforsker,åsted,åsted<br />
antyde,politi,?,?<br />
avhøre,?,person,person<br />
avhøre,?,vedkommende,vedkommende<br />
avhøre,politi,vitne,vitne<br />
The output file that is created when TiMBL has classified the input data and run a test with the<br />
test data consists of the input given in the test set with the category predicted by TiMBL added<br />
at the end of each line. Further, the output supplied by TiMBL upon a successful training and<br />
testing round gives information about the actions in the various stages of analysis. TiMBL’s<br />
actions can be divided into three separate phases; in phase 1 the training data is analysed, in<br />
phase 2 the items in the training data are stored for efficient use during testing and in phase 3 the<br />
trained classifier is applied to the test set. For the purposes of the EPAS analysis, the default<br />
algorithm was used in the test phase. This algorithm computes the similarity between a test and<br />
a training item in terms of weighted overlap; the total difference between two patterns is the sum<br />
of relevance weights of those features which are not equal (Daelemans et al. 2003, p. 13).<br />
The classification of the EPAS and the subsequent testing was carried out in two distinct steps;<br />
classification and testing of argument 1 and argument 2 was done separately. The results of the<br />
classification and testing is described in the following sections.<br />
4.1.2.1 Classifying argument 1<br />
Several experiments were run through TiMBL with the aim of classifying occurrences of<br />
argument 1 according to the environment they occur in. The classifier was trained on all EPAS<br />
not containing pronouns and then tested. For the purpose of classifying occurrences of argument<br />
1, an EPAS list with the relevant argument 1 as category label was used. In the following<br />
descriptions of the performed tests, this list will be referred to as EPAS_arg1.<br />
Test 1<br />
Training set: EPAS_arg1 with no pronouns, argument 1 ignored.<br />
Test set: EPAS with pronouns in argument 1 position.<br />
Result: 57,69% (15/26) correct classifications<br />
63
The classifier was created with EPAS_arg1 with no pronouns as training set and tested with all<br />
EPAS containing pronouns in the position of argument 1. For the test set, each EPAS was<br />
completed with the antecedent for its pronoun. For reasons of classification and testing, the<br />
antecedent was appended at the end of each EPAS, thus functioning as the category label for the<br />
structure. In total, there were 26 EPAS with pronouns in the position of argument 1. (4- 3) below<br />
shows an example from the test file with pronouns as argument 1:<br />
(4- 3)<br />
få,pron,rapport,politi<br />
When classifying with argument 1 as category label and testing with EPAS with pronouns as<br />
argument 1, TiMBL assigned the correct category in 57,69% (15/26) of the test cases. One of<br />
the cases where the classifier had assigned the “wrong” category was actually not incorrect, the<br />
antecedent was of a form that did not exist in the training material (antecedent: kvinne/vitne<br />
(woman/witness), assigned category: vitne (witness)). Furthermore, in six of the incorrectly<br />
assigned categories, the category chosen by the classifier was semantically close to the correct<br />
antecedent. Example (4-4) below shows the seven examples where the incorrect categories<br />
assigned by the classifier can in fact be viewed as belonging to the same semantic group, and<br />
thus at least as a partially successful classification. Regarding all these instances as successful<br />
category assignments would heighten the classifier’s correct categorisations to 84,61% (22/26).<br />
(4- 4)<br />
Correct antecedent Assigned category<br />
kvinne/vitne (woman/witness) vitne (witness)<br />
Fonn (Fonn) politi (police)<br />
Kripos-spesialist (Kripos specialist) politi (police)<br />
politimester (police chief) Fonn (Fonn)<br />
politi (police) etterforsker (investigator)<br />
politi (police) Fonn (Fonn)<br />
Slåtten (Slåtten) kvinne (woman)<br />
64
Test 2<br />
Training set: EPAS_arg1 with no pronouns, argument 1 ignored.<br />
Test method: leave-one-out<br />
Result: 42,40% (81/191) correct classifications<br />
When training and testing on the EPAS_arg1 list with pronouns removed, the classifier<br />
produced a quite poor accuracy of 42,40%. TiMBL’s leave-one-out option makes it possible to<br />
train and test on the same material, as each pattern in the training file is used as a test case while<br />
the rest of the patterns are used as training material. One reason for the relatively low percentage<br />
of correctly classified instances is most likely the small size of the data set. With only 191<br />
patterns to learn from, the classifier does not have enough diversity in the examples to provide<br />
correct classifications and also does not find enough occurrences of the individual patterns to be<br />
able to pick the correct category. Since politi (police) by far is the most frequent feature in the<br />
EPAS list, many instances are wrongly assigned the category “police” by virtue of the majority<br />
vote of the nearest neighbour classification. An attempt to avoid this effect is described in test 3.<br />
Examining the instances where the classifier assigned the wrong category to an EPAS showed<br />
that in 27 of the incorrectly classified cases, the assigned category was semantically similar to<br />
the correct category. This suggests that the list in itself does contain some relevant information<br />
about the distribution of argument 1 in the data set. Example (4-5) below shows the correct<br />
categories and the categories assigned by the classifier.<br />
(4- 5)<br />
Correct category Assigned category<br />
Anne kvinne (woman)<br />
Slåtten<br />
drapsmann (killer) gjerningsmann (perpetrator)<br />
etterforsker (investigator) politi (police)<br />
Fonn lensmann (deputy)<br />
politi (police)<br />
gjerningsmann (perpetrator) person (person)<br />
Kripos-spesialist(Kripos specialist) politi (police)<br />
kvinne (woman) 23-åring (23-year-old)<br />
65
lensmann (deputy) Fonn<br />
politi (police)<br />
medarbeider (co-worker) politi (police)<br />
person (person) gjerningsmann (perpetrator)<br />
politi (police) lensmann (deputy)<br />
etterforsker (investigator)<br />
politimester (chief of police) politi (police)<br />
polititjenestefolk (police workers) politi (police)<br />
Slåtten Anne<br />
kvinne (woman)<br />
tekniker (technician) politi (police)<br />
23-åring (23-year-old) kvinne (woman)<br />
Test 3<br />
When using the overlap metric, all feature values are seen as equally dissimilar (Daelemans et<br />
al. 2003, p. 23). This means that the classifier is unable to determine the similarity of values<br />
such as politi (police), etterforsker (investigator) and politimester (chief of police) by means of<br />
looking at their co-occurrence with target classes. By using the Modified Value Difference<br />
Metric (MVDM), the features are weighted according to the patterns they occur in.<br />
Unfortunately, MVDM does not perform so well when used on small data sets with values that<br />
only occur a few times in the data set. When trained and tested on the EPAS_arg1 list, MVDM<br />
produced slightly lower accuracies than in the corresponding test with the overlap metric (see<br />
test 2 above). In practise, this meant that the benefits of MVDM could not be exploited due to<br />
the size of the data material.<br />
Test 4<br />
Training set: EPAS_arg1 excluding structures with pronouns and structures with non-verbal<br />
predicates<br />
Test method: leave-one-out<br />
Result: 45,03% (68/151) correct classifications<br />
66
The training and test material was modified by excluding all EPAS with non-verbal predicates,<br />
as well as all EPAS with the predicate være (be). This was done to see if these structures disturb<br />
the data material by adding irrelevant information that does not contribute to giving more<br />
information about the distribution of arguments in the EPAS. The accuracy increased slightly<br />
upon this modification of the data set. The editing did, however, not increase the accuracy when<br />
training on the edited EPAS_arg1 list and testing on EPAS containing pronouns in argument 1<br />
position.<br />
4.1.2.2 Classifying argument 2<br />
Analogous to the classification steps performed for argument 1, the classifications were repeated<br />
for occurrences of argument 2. The EPAS list with the second argument as category label will in<br />
the following be referred to as EPAS_arg2.<br />
Test 1<br />
Training set: EPAS_arg2 with no pronouns, argument 2 ignored.<br />
Test set: EPAS with pronouns in argument 2 position<br />
Result: (0/6) correct classifications<br />
Training the classifier on the EPAS_arg2 list without pronouns and testing on the EPAS with<br />
pronouns in argument 2 position did not produce any correct classifications. This is likely to in<br />
part be because of the small size of the test data set, as well as the homogenous nature of the test<br />
instances. Five of the wrongly classified instances were in fact of the same type; in all instances,<br />
the classifier had assigned the category kvinne (woman), while the correct antecedent was<br />
Slåtten.<br />
Test 2<br />
Training set: EPAS_arg2 with no pronouns, argument 2 ignored<br />
Test set: leave-one-out<br />
Result: 49,73% (95/191) correct classifications<br />
67
Training and testing the classifier on the EPAS_arg2 list with no pronouns produced an accuracy<br />
of 49,73%. As was the case for the corresponding classification of argument 1, it is likely that<br />
the relatively small dataset is a disadvantage for the classification process.<br />
4.1.3 Comments on the results<br />
The results obtained through classifying the EPAS indicate that the information present in the<br />
EPAS derived from the text collection does provide clues about which word to expect in a<br />
specific position. The accuracy scores obtained by training and testing on the EPAS extracted<br />
from a collection of texts suggest that even a small collection of texts on the same domain<br />
provide information to enable a classification approach based on contextual distribution. In the<br />
tests described above, there was a reoccurring tendency that in a number of the cases where the<br />
wrong category was assigned in the test phase, the assigned category bore some semantic<br />
resemblance to the correct category. This reinforces the initial intuition that similar words are<br />
used in similar environments and that the environment can contribute with clues toward the<br />
semantic meaning of a word.<br />
In the following section, the notion of finding words which are similar to each other by virtue of<br />
occurring in the same environments will be explored further.<br />
4.2 Step II: Association of concept groups<br />
The fundamental idea in this thesis is that words display certain semantic features based solely<br />
on the context they are found in. Therefore, when looking for possible antecedents for an<br />
anaphoric expression, the candidates should not only be weighted according to their cooccurrence<br />
in an identical context pattern in a corpus, but also according to their co-occurrence<br />
with similar context patterns. The assumption that words which occur in identical contexts have<br />
related meanings can be used to retrieve words with similar meanings from the data material.<br />
With a target argument and a target predicate as starting point, the association method goes<br />
through the EPAS list and returns words which occur in similar environments to the target<br />
argument. This association is performed in three steps:<br />
68
• level 0: words which co-occur with the target predicate are returned<br />
• level 1: words which occur in the same context as the target argument are returned<br />
• level 2: words which occur in the same context as the words found in level 1 are returned<br />
Level 0 considers the information that is directly accessible from the EPAS list; with a given<br />
predicate as reference point, the co-occurring arguments are retrieved. Level 1 looks at the other<br />
arguments that occur with the same predicates as the arguments retrieved in the first step.<br />
Finally, level 2 performs the same step once again and looks at the arguments that occur in the<br />
same contexts as the arguments collected in level 1. As a result, bundles of concepts are<br />
produced; each concept class consisting of words that are used in the same textual context, and<br />
therefore are likely to be semantically similar. The following example explains how the<br />
association of argument classes is performed.<br />
Level 0<br />
The association method takes as its starting point a predicate from the EPAS list. For a given<br />
predicate, the first and second arguments are listed. The nominal argument etterforsker<br />
(investigator) occurs as argument 1 of the verbal predicate ankomme (arrive) in the text<br />
collection. (4-6) below shows the EPAS with ankomme as predicate.<br />
(4- 6)<br />
ankomme,etterforsker,?<br />
ankomme,etterforsker,åsted<br />
arrive, investigator, ?<br />
arrive, investigator, crime scene<br />
Level 1<br />
In order to find other nominal arguments that occur in the same context patterns as etterforsker,<br />
we must look at the other EPAS in which etterforsker occurs as argument 1. This yields the<br />
EPAS shown in (4-7) below. For the sake of the concept association, the EPAS corresponding to<br />
adjective-noun relations in the original texts are not considered, and therefore not included in<br />
(4-7).<br />
69
(4- 7)<br />
bistå,etterforsker,lensmann<br />
bistå,etterforsker,politi<br />
ha,etterforsker,observasjon<br />
kontakte,etterforsker,vitne<br />
mene,etterforsker,?<br />
rigge,etterforsker,lyskaster<br />
undersøke,etterforsker,åsted<br />
assist, investigator, deputy<br />
assist, investigator, police<br />
have, investigator, observation<br />
contact, investigator, witness<br />
mean, investigator, ?<br />
build-up, investigator,searchlight<br />
examine, investigator, crime scene<br />
For each of the predicates in (4-7), we want to find the other arguments, in addition to<br />
etterforsker (investigator), which occur in the corpus material as the first argument of the<br />
predicate. Traversing the EPAS list in search of these nominal arguments yields the list<br />
presented in (4-8). Pronouns and empty argument slots are omitted from the association since<br />
they generally occur in too many different context patterns to contribute with relevant<br />
information in this kind of analysis.<br />
(4- 8)<br />
ha,politi,medarbeider<br />
ha,politi,teori<br />
kontakte,politi,vitne<br />
mene,politi,?<br />
undersøke,politi,aktivitet<br />
have,police,co-worker<br />
have,police,theory<br />
contact,police,witness<br />
mean,police,?<br />
examine,police,activity<br />
As can be seen from the EPAS in (4-8), politi (police) is the only other argument which occurs<br />
in the same contexts as etterforsker (investigator). So far, the association tells us that there is a<br />
relationship between the concepts etterforsker (investigator) and politi (police) in the sense that<br />
these words occur in the same environments in the text collection.<br />
Level 2<br />
In order to explore the possibility of further associated concepts, the association method goes<br />
one level further. Basically, the first step of the association is repeated, but with new parameters;<br />
for each of the first arguments in (4-8) we need to know which other words can occur in the<br />
same contextual position. Therefore the EPAS list is again consulted and all the other first<br />
arguments that occur in the same environment as politi (police) are returned. This produces the<br />
list in (4-9).<br />
70
(4- 9)<br />
avklare,obduksjon,?<br />
bede-om,lensmann,assistanse<br />
bede-om,Fonn,<br />
bede-om,lensmann,<br />
bekrefte,lensmann,?<br />
bekrefte,politimester,?<br />
finne,leteaksjon,kvinne<br />
få,kjæreste,telefon<br />
gi,Fonn,opplysning<br />
gi,kamera,indikasjon<br />
gi,lensmann,opplysning<br />
gi,lensmann,opplysning<br />
gi,vitneavhør,indikasjon<br />
ha,etterforsker,observasjon<br />
kjenne,generic-nom,Slåtten<br />
kontakte,etterforsker,vitne<br />
mene,etterforsker,?<br />
tro,lensmann,?<br />
clarify,autopsy,?<br />
ask-for,sergeant,assistance<br />
ask-for,Fonn,?<br />
ask-for,sergeant,?<br />
confirm,sergeant,?<br />
confirm,chief of police,?<br />
find,search party,woman<br />
get,boy/girlfriend,telephone<br />
give,Fonn,information<br />
give,camera,indication<br />
give,sergeant,information<br />
give,sergeant,information<br />
give,interview,indication<br />
have,investigator,observation<br />
know,generic-nom,Slåtten<br />
contact,investigator,witness<br />
mean,investigator,?<br />
believe,sergeant,?<br />
As a step toward disregarding arguments which do not occur often enough in the context in<br />
question to be of significance, the method may be limited to only considering arguments which<br />
occur more than once in the text material. In the case of a larger text collection, a different<br />
method of sorting out low frequent arguments would have to be adopted; for the small data set<br />
in this project, disregarding arguments which only occur once was of use. The steps as outlined<br />
above produces the following associated group of concepts:<br />
Figure 6<br />
etterforsker (investigator)<br />
politi (police)<br />
lensmann (sergeant)<br />
Fonn (Fonn)<br />
71
Intuitively, this is a quite good association of concepts, since all the entities in the grouping<br />
belong to the group law enforcement. If a person were to group nominals from the text<br />
collection into semantically similar concept classes, the grouping in Figure 6 would not be an<br />
unlikely result. The grouping as shown in Figure 6, however, is the result of an association<br />
based on context information from the text itself.<br />
4.2.1 Classify<br />
Manually performing the association method as described above on all the EPAS in the data set<br />
proved to be bordering on the impossible, mainly because it implied consulting the data set<br />
multiple times, each time looking for different values and keeping track of the partial goals in<br />
the process. Based on the method as described above, the Perl script classify was written 5 . In the<br />
following, the algorithm implemented in Classify is outlined in brief.<br />
For each predicate:<br />
1. Level 0:<br />
What is ARG1 and ARG2 in the corpus/EPAS list?<br />
2. Level 1:<br />
For each ARG1 = x that was found in 1:<br />
In connection with which other predicates is ARG1 also= x?<br />
For each of these predicates:<br />
Which other words occur as ARG1?<br />
Produces a list of words which occur in the same contexts as x<br />
3. Level 2:<br />
For each word = y in the list from level 1:<br />
Which other predicates does this word also co-occur with?<br />
For each of these predicates:<br />
Which other words occur as ARG1?<br />
Produces a list of words which occur in the same contexts as y<br />
Same procedure is repeated for ARG2.<br />
5<br />
The algorithm was implemented in Perl by Martin Rasmussen Lie, informatics student at the University of<br />
Bergen.<br />
72
(4-10) below shows the output for the predicate ankomme (arrive) as it is produced by classify.<br />
(4- 10)<br />
NIVÅ0, ARG1 (ankomme): etterforsker x 3<br />
NIVÅ1, ARG1 (etterforsker): ? x 5, politi x 5, pron x 3<br />
NIVÅ2, ARG1 (politi): ? x 17, pron x 9, lensmann x 6, Fonn x 2,<br />
forbipasserende x 1, generic-nom x 1, kamera x 1, kjæreste x 1,<br />
leteaksjon x 1, obduksjon x 1, politimester x 1, syklist x 1,<br />
vitneavhør x 1<br />
NIVÅ2, ARG1 (pron): politi x 7, kvinne x 4, vitne x 4, ? x 2,<br />
Anne x 2, bilfører x 2, Slåtten x 2, syklist x 2, etterforskning x 1,<br />
Fonn x 1, generic-nom x 1, kjæreste x 1, lensmann x 1, lommebok x 1,<br />
rapport x 1, teori x 1<br />
NIVÅ0, ARG2 (ankomme): ? x 2, åsted x 1<br />
NIVÅ1, ARG2 (åsted): aktivitet x 1, hybelhus x 1,<br />
minibankaktiviteter x 1, mobiltelefontrafikk x 1, område x 1,<br />
overvåkningsfilmer x 1<br />
NIVÅ2, ARG2 (aktivitet): minibankaktiviteter x 1,<br />
mobiltelefontrafikk x 1, område x 1, overvåkningsfilmer x 1<br />
NIVÅ2, ARG2 (hybelhus): (Ingen referanser)<br />
NIVÅ2, ARG2 (minibankaktiviteter): aktivitet x 1,<br />
mobiltelefontrafikk x 1, område x 1, overvåkningsfilmer x 1<br />
NIVÅ2, ARG2 (mobiltelefontrafikk): aktivitet x 1,<br />
minibankaktiviteter x 1, område x 1, overvåkningsfilmer x 1<br />
NIVÅ2, ARG2 (område): aktivitet x 1, minibankaktiviteter x 1,<br />
mobiltelefontrafikk x 1, overvåkningsfilmer x 1<br />
NIVÅ2, ARG2 (overvåkningsfilmer): aktivitet x 1,<br />
minibankaktiviteter x 1, mobiltelefontrafikk x 1, område x 1<br />
Please consult Appendix E for the full program code for classify.pl.<br />
4.2.2 Associated concept classes<br />
Running the three-level association of semantic classes performed by using the association<br />
method described above on the EPAS list produced six distinct groupings. These concept groups<br />
are shown in (4-11) below.<br />
(4- 11)<br />
a. POLICE:<br />
etterforsker, politi, lensmann, Fonn<br />
investigator, police, deputy, Fonn<br />
b. WOMAN:<br />
Anne, Slåtten, 23-åring, sykepleiestudent, kvinne, beboer<br />
Anne, Slåtten, 23-year-old, nurse student, woman, inhabitant<br />
73
c. PERP:<br />
gjerningsmann, drapsmann<br />
perpetrator, killer<br />
d. PERSON:<br />
person, bilfører, syklist, vedkommende<br />
person, car driver, biker, generic-nom<br />
e. OBSERV:<br />
teori, observasjon<br />
theory, observation<br />
f. PLACE:<br />
studentkollektiv, Førde<br />
student housing, Førde<br />
The classes of words shown in (4-11) form groups of concepts which occur in the same<br />
contextual environments within the thematic domain that the EPAS are extracted from. The<br />
groupings seem to reflect real semantic clusters in the sense that one can easily find a label to<br />
describe each group. For the purpose of the text collection in the present work, these six concept<br />
groups represent six distinct semantic groupings that share many features with respect to pattern<br />
distribution in the data set. With a larger data set to run the concept association on, more concept<br />
groups, and also more members within each group, would have been a likely outcome. The<br />
results of the concept association on the small data set in this project, does, however, suggest the<br />
feasibility of the method, as well as show that frequent patterns in smaller text collections also<br />
work toward capturing interesting concept groupings.<br />
4.3 Step III: Using concept groups in TiMBL<br />
The concept groups which emerged as a result of the association performed in section 4.2 above,<br />
represent clusters of words that occur in similar constellations in the data material. The<br />
emergence of concept groups which intuitively seem to have some semantic resemblance to<br />
each other confirms that the context a word fits into does indeed say something about what the<br />
word means, as per the distributional hypothesis.<br />
74
In the introduction to this chapter, it was stated that the aim of classifying the EPAS list is<br />
twofold; on the one hand it is of interest to see to which degree the environments that an<br />
argument occurs in over a collection of texts provide sufficient cues to ensure a correct guess of<br />
which argument can be expected in a specific context, on the other hand it is equally interesting<br />
to see if we through classification can narrow down the set of possible arguments for a specific<br />
context pattern. Through the association technique, six groups of words emerged; the members<br />
of each group sharing the feature that they all tend to occur in the same environments.<br />
Previously, it has been stated that some anaphors need access to information about the world in<br />
order to be resolved. This information can to some extent be represented by the concept groups<br />
associated from the data set. By identifying groups of words which typically occur in the same<br />
textual environment, an intuition about which words to expect in which contexts is captured. In<br />
the event of “difficult” anaphors which depend on world knowledge, an anaphora resolution<br />
system can retrieve potential antecedents from the text, check which concept group an expected<br />
antecedent is likely to belong to and consequently chose the antecedent candidate belonging to<br />
the expected concept group. As a first step of examining the usefulness of concept groups in<br />
combination with anaphora, experiments aiming at enhancing the performance of the classifier<br />
in section 4.1 were performed. These experiments are described in the following section.<br />
4.3.1 Testing<br />
Tests were performed in TiMBL, using the relevant concept group as the category for a feature<br />
pattern. Analogous to the testing in section 4.1.2 above, two separate test sets were prepared,<br />
one for the classification of each argument. In the cases where the relevant argument was a<br />
member of one of the concept groups, the head label of the concept group was used as the<br />
category label in the input data. If the relevant argument did not belong to any concept group,<br />
the argument itself was used as category label, as in the tests in section 4.1.2. Example (4-12)<br />
below shows an excerpt of the input file used for training the classifier for argument 1<br />
classification.<br />
(4- 12)<br />
drepe,gjerningsmann,kvinne,PERP<br />
drept,sykepleiestudent,?,WOMAN<br />
død,sykepleiestudent,?,WOMAN<br />
ekstra,patrulje,?,patrulje<br />
75
The aim of the tests performed in this section was to see if the accuracy of the classifier could be<br />
enhanced by training on a complete context pattern with the appropriate concept group as<br />
category label.<br />
Test 1<br />
Training set: EPAS_arg1 with no pronouns and concept classes as category label, argument 1<br />
ignored.<br />
Test method: leave-one-out<br />
Result: 56,54% (108/191)<br />
In this test, the classifier was trained on two features of the EPAS, ignoring argument 1. This<br />
test is analogous to test 2 in section 4.1.2.1, which had an accuracy of 41,20% correct<br />
classifications. In addition to the 108 correctly classified instances, additional five instances<br />
were assigned categories which are semantically similar to the correct category. This was true<br />
for Kripos-spesialist (Kripos specialist), politimester (chief of police), medarbeider (co-worker)<br />
and polititjenestefolk (police workers), which were all assigned the category POLICE. These<br />
words are not part of the concept group POLICE, but are obviously semantically related to the<br />
members of this concept group. Had these words occurred more frequently in the data material,<br />
they could have been expected to show a distribution allowing for their inclusion in POLICE.<br />
The results of this test suggest that labeling EPAS with concept group labels heightens the<br />
accuracy of the classifier. This is not surprising, given the fact that a higher number of context<br />
patterns/EPAS are labeled with the same category in such an approach, making the generalisable<br />
material larger.<br />
Test 2<br />
Training set: EPAS_arg1 with no pronouns and concept classes as category label.<br />
Test method: leave-one-out<br />
Result: 86,91% (166/191)<br />
This test was performed to see if training the classifier on the entire structure of an EPAS<br />
increases the accuracy of assigning concept labels to the structures. The classifier was trained on<br />
all three features of the EPAS. In this case, the classifier performed with a fairly high accuracy,<br />
assigning the correct category in 166 of 191 cases. It is obviously an advantage that all parts of<br />
76
the EPAS can be used in the classification phase when the category to be assigned is not literally<br />
a part of the structures to be learnt from.<br />
Test 3<br />
Training set: EPAS_arg1 with no pronouns and concept classes as category label, argument 1<br />
ignored.<br />
Test set: EPAS_arg1 with pronouns and concept classes as category label<br />
Result: 76,92% (20/26)<br />
As was the case in the corresponding test in section 4.1.2.1, two of the wrongly assigned<br />
categories were in fact within the same semantic group as the correct category. Regarding these<br />
as correct assignments would heighten the result to 84,61%. As was the case in the previous two<br />
tests, the assigned categories in these cases are too infrequent in the EPAS list to surface in the<br />
associated concept groups.<br />
Test 4<br />
Training set: EPAS_arg2 with no pronouns and concept classes as category label, argument 2<br />
ignored.<br />
Test set: EPAS_arg2 with pronouns and concept classes as category label<br />
Result: 83,33% (5/6)<br />
When training the classifier on the EPAS_arg2 list using concept class labels as categories and<br />
testing on the set of EPAS with pronouns in argument 2 position, the classifier resolved five of<br />
the six test instances correctly. In the corresponding test in section 4.1.2, the classifier did not<br />
assign the correct category in any of the six test cases. We did, however, see that five of the test<br />
instances were assigned categories which were semantically similar to the correct antecedent. In<br />
view of the results in the initial test, it came as no surprise that the classifier performed so much<br />
better when used in connection with the concept class labels.<br />
77
4.4 Are concept classes useful for anaphora resolution?<br />
The EPAS list has been processed in different ways in this chapter. The tests which have been<br />
described provide an indication of how context patterns extracted from the text collection can be<br />
used to create expectations of which words (or which type of words) that are likely to occur in a<br />
given contextual environment. These expectations can be used to anticipate which word, or<br />
rather which concept, might be the antecedent for an anaphor. The concept groups which<br />
emerged in the association process are simply classes of semantically related words which tend<br />
to have similar contextual distributions within the domain of the text corpus. In order to indicate<br />
the usefulness of such concept classes in the process of resolving an anaphor, the test set of the<br />
EPAS list (all EPAS containing pronouns) was processed with different methods. In (4-13) the<br />
results of these methods are shown. In addition to the tests in TiMBL described in the above, the<br />
anaphors in the test set were resolved manually using the Lappin and Leass approach as<br />
described in section 2.1.2. For these test, the sentence with the anaphor, as well as the preceding<br />
sentence, was considered in each case. This purely syntactic approach identified the correct<br />
antecedent in 16 of the 32 test instances.<br />
(4- 13)<br />
Method Correct assignments<br />
Syntactic method 50% (16/32)<br />
TiMBL 46.87% (15/32)<br />
TiMBL with concept groups 78,12% (25/32)<br />
The results shown in (4-13) suggest that using concept groups may indeed be a useful approach<br />
in anaphora resolution. Especially in the case of anaphoric expressions where the antecedent is<br />
not clearly stated in the text it may be useful to have an idea of which type of antecedent one<br />
might expect. 10 of the 32 EPAS containing pronouns were of this kind. The syntactic approach<br />
could naturally not resolve these anaphors, as an antecedent not clearly present in the text hardly<br />
can feature on a list of possible candidates. These types of anaphors require real-world or<br />
domain knowledge to be resolved. In the case of 4 of these 10 EPAS, the EPAS list could not be<br />
consulted to find likely antecedents. Because of the small size of the data set, some predicates<br />
only feature once. This was the case for the five predicates jobbe-utfra (work-from), kartlegge<br />
(map), ta (take), varsle (notify) and ville (want) which all only co-occur with pronouns. With the<br />
78
exemption of jobbe-utfra, none of the antecedents in these cases can be predicted on the basis of<br />
the distribution of predicates and arguments in the EPAS list. (4-14) shows the instances where<br />
the EPAS list could be consulted in the process of finding likely antecedents for these anaphors.<br />
In the case of ha (have) and komme-i-kontakt-med (come-into-contact-with), other EPAS with<br />
the same predicates where retrieved from the EPAS list. Since ha and komme-i-kontakt-med<br />
occur in identical or very similar patterns with politi as the first argument, this would be the<br />
preferred candidate for the antecedent in (4-14a), (4-14c) and (4-14d). In the case of (4-14b), the<br />
predicate jobbe-utfra only has this one occurrence in the EPAS list. This means that similar<br />
patterns must be examined in the search for a possible antecedent. By consulting the EPAS list,<br />
it can be found that teori (theory) only occurs as a second argument in connection with politi as<br />
first argument. This would suggest that politi is a potential antecedent for the pronoun. By<br />
applying the concept groups, the list of possible antecedents motivated by the texts can be<br />
expanded to also include the other arguments which have been found to display a similar<br />
distribution to the arguments which actually co-occur with the predicate in question. In the case<br />
of the pronouns in (4-14), politi is the correct antecedent in all of the cases.<br />
(4- 14)<br />
EPAS with<br />
pronoun<br />
similar EPAS antecedents<br />
from list<br />
a. ha,pron,teori ha,etterforsker,observasjon<br />
ha,politi,medarbeider<br />
ha,politi,teori<br />
b. jobbeutfra,pron,teori<br />
ha,politi,teori<br />
forkaste,politi,teori<br />
c. komme-i-kontaktmed,pron,bilfører<br />
d. komme-i-kontaktmed,pron,syklist <br />
komme-i-kontaktmed,politi,bilfører <br />
komme-i-kontaktmed,politi,generic-nomkomme-i-kontaktmed,politi,bilfører <br />
komme-i-kontaktmed,politi,generic-nom<br />
politi<br />
etterforsker<br />
politi<br />
concepts<br />
lensmann<br />
Fonn<br />
79<br />
etterforsker<br />
lensmann<br />
Fonn<br />
politi etterforsker<br />
lensmann<br />
Fonn<br />
politi etterforsker<br />
lensmann<br />
Fonn<br />
The examples in (4-14) indicate how the method described in this thesis can function. In cases<br />
where there is no clearly expressed antecedent in a text, or where the resolution of an antecedent
equires knowledge about the world (or knowledge about how predicates and arguments<br />
combine within a domain), the method can be of aid. Consider again the examples from the<br />
introduction, repeated in (4-15) below:<br />
(4- 15)<br />
a. Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig<br />
kommer til å drepe igjen. Han etterlyser vitner som var i sentrum søndag<br />
kveld.<br />
The sergeant leading the investigation says that the perpetrator probably will<br />
kill again. He puts out a call for witnesses who were in the city centre Sunday<br />
evening.<br />
b. Lensmannen som leder etterforskningen, sier at gjerningsmannen trolig<br />
kommer til å drepe igjen. Han er observert i sentrum.<br />
The sergeant leading the investigation says that the perpetrator probably will<br />
kill again. He is observed in the city centre.<br />
As was established in chapters 1 and 2, the antecedent of the anaphor han (he) in (4-15b) cannot<br />
be resolved differently from the anaphor in (4-15a) without consulting some sort of knowledge<br />
source. However, the information present in the EPAS list can be used as domain knowledge in<br />
the process of resolving these anaphors. For the anaphor in (4-15a) other occurrences of the<br />
predicate etterlyse (call-for) can be consulted. This would produce the following list:<br />
(4- 16)<br />
etterlyse,?,bilfører<br />
etterlyse,politi,bilfører<br />
etterlyse,politi,person<br />
etterlyse,politi,syklist<br />
call-for,?,driver<br />
call-for,police,driver<br />
call-for,police,person<br />
call-for,police,biker<br />
It is clear from the list that etterlyse tends to co-occur with politi as its first argument, and that it<br />
does not occur with any other first arguments. Through the concept association we know that<br />
politi and lensmann (sergeant) both belong to the same concept group. Consequenntly, we also<br />
know that politi and lensmann both occur in the similar environments and thus share features.<br />
Given that the possible antecedents in (4-15a) are lensmann and gjerningsmann (perpetrator),<br />
the consultation of the EPAS list and the concept groups leads us to select lensmann as the<br />
80
antecedent for (4-15a). In the case of (4-15b) the EPAS list is unfortunately not equally helpful.<br />
Other occurrences of observere (observe) are:<br />
(4- 17)<br />
observere,?,23-åring<br />
observere,?,bile<br />
observere,?,person<br />
observe,?,23-year-old<br />
observe,?,car<br />
observe,?,person<br />
Neither the EPAS in (4-17), nor the concept groups from section 4.2.2, give us any clues as to<br />
whether lensmann or gjerningsmann is a more likely second argument in connection with the<br />
predicate observere. This can be explained by two circumstances; firstly, the example sentences<br />
in (4-15) are constructed and therefore do not reflect examples from the data set, secondly, the<br />
small size of the data material obviously limits the extent to which one can expect that all valid<br />
patterns of the domain are in fact found in the data set.<br />
81
5 Final remarks<br />
5.1 Is a parser vital for the extraction process?<br />
An initial assumption during the development of the method in this thesis was that it was of high<br />
importance to found the extraction method on a syntactic parse of the text collection. As the<br />
reader will recall, the reasons for this assumption were elaborated in chapter 3 and will therefore<br />
not be discussed further here. However, as a means of evaluating the extraction method, the<br />
texts in the text collection were processed using the Oslo-Bergen Tagger (OBT <strong>2005</strong>). This is a<br />
part of speech (POS) tagger which among other options offers the user a syntactic<br />
disambiguation of the input text. The texts were POS tagged using the web version of the tagger<br />
and structures corresponding to subject-verb-object relations were manually extracted from the<br />
output. This yielded a list of 169 structures, 26 of them with pronouns. The structures were<br />
extracted using a quite rudimentary method; for example no differences were made between<br />
active and passive versions of the same predicate. This resulted in a list featuring exactly the<br />
problematic issues discussed in chapter 3; the arguments were represented (and subsequently<br />
structured) not according to thematic roles, but merely according to their syntactic roles in the<br />
sentence. As a result, the list did not reflect characteristic arguments of the different predicates<br />
to the same degree as the EPAS list did. The list of the POS-based structures is available in<br />
Appendix F.<br />
Consider the FCA diagram in Figure 7 below. Figure 7 shows part of the FCA diagram created<br />
for the POS-based structures; the section of the diagram with the argument politi (police) as<br />
starting point is highlighted. When comparing this figure to the corresponding figure for the<br />
EPAS list (Figure 5 in section 3.6.4), it is quite clear that the POS-based list is significantly less<br />
generalisable. There are no clear groupings of arguments which display specific behaviour<br />
through their combination with a certain subset of predicates. Because formal subjects of both<br />
active and passive sentences are realised as first arguments in this extraction, it is hardly<br />
possible to group arguments into groups of semantically related words based on their<br />
distribution. As can be seen from the diagram, politi co-occurs with both sykepleiestudent<br />
(student nurse) and bilfører (driver), as well as other, more relevant, arguments.<br />
82
Figure 7<br />
Interestingly enough, however, the POS-based list of structures proved to be just as well suited<br />
as the EPAS list for subsequent classification using TiMBL. When training and testing the<br />
classifier on the POS-based structures, it assigned the correct antecedent in 57,69% (15/26) of<br />
the test cases. In comparison, the EPAS classifier performed with an accuracy of 57,69% when<br />
trained and tested on argument 1, and with an overall accuracy of 46,87%.<br />
These results are interesting mainly because they show that for the purposes of using a memory<br />
based classifier, an extraction method based on a syntactic parser does not necessarily provide<br />
better results than a POS-tagger based method. Even though the list of extracted structures was<br />
decidedly poorer than the EPAS list, especially because it contained “wrong” information in the<br />
sense that logical objects were listed as subjects by virtue of their syntactic role, it provided<br />
useful input for the classification process. It is, however, as suggested by the FCA diagram<br />
above, likely that the POS-based list would be of less use for the concept association phase,<br />
since this approach relies on the presence of similar entities in similar positions in the structures.<br />
As a conclusion, it can probably be stated that the advantages of using a syntactic parser in the<br />
83
extraction process are more unclear than first presumed. For the purposes of aiding in anaphora<br />
resolution, it may well be that an extraction method can perform equally well when based on<br />
more shallow/superficial processing methods.<br />
5.2 Summary and conclusions<br />
This thesis has described a method for corpus-based semantic categorisation of predicates and<br />
arguments in a limited thematic domain. The aim of the project was to create a means of<br />
automatically inferring selectional restrictions corresponding to real-world knowledge of the<br />
domain of the text collection. The classification of the predicates and arguments extracted from<br />
the text collection resulted in several concept groups, where each concept group displayed a<br />
particular distribution in the text collection.<br />
In the introduction of the thesis, it was stated that a chief goal of the project was to assess the<br />
value of using co-occurrence patterns to create concept groups which can act as an aid in the<br />
process of pronoun resolution. The concept groups were thought to function as an intuition<br />
about which word to expect in a given environment. Two criteria were formulated with regards<br />
to the evaluation of the results obtained by the project:<br />
• were the concept groups created valid for the domain of the text collection?<br />
• were the concept groups useful in the process of anaphora resolution?<br />
Through classification and testing of the extracted data set some remarks can be made with<br />
regards to these two criteria. The concept groups that emerged as a product of the association<br />
performed in the classification phase did indeed seem to constitute valid groupings of<br />
semantically similar words. The concept groups were made based on the contextual distribution<br />
of arguments in the text collection and represent groups of words which “keep the same<br />
company” and tend to occur in similar environments. They are valid groupings for the domain of<br />
the text collection and confirm the intuition that similar words display similar distribution, and<br />
thus similar behaviour in the data set. The tests performed with the concept groups show that<br />
they do contribute to heightening the success rate of the MBL classifier; when testing with<br />
EPAS containing pronouns the classifier assigned the correct concept group as antecedent in<br />
78% of the instances, in comparison to an almost 47% success rate without concept groups.<br />
84
When testing on knowledge-dependent anaphors and on anaphors which do not have an<br />
explicitly mentioned antecedent in the text, it was evident that concept groups contribute with<br />
interesting information. Ideally, a referring guessing helper using concept groups should be<br />
consulted as part of an anaphora resolution system. In the event of several possible antecedent<br />
candidates motivated from the text and proposed by the system, the concept groups in<br />
connection with the context pattern of the anaphor can provide useful information about which<br />
type of antecedent is likely. In this way the concept groups resemble information about the valid<br />
contextual patterns for the domain.<br />
The stumbling block of the method in this thesis is the limited dimension of it. The data set used<br />
for the analyses is fairly small and as a consequence the results are less powerful than they could<br />
have been. The extraction method is at best semi-automatic and employs far too much manual<br />
intervention. This is a reoccurring problem for many methods within the field of NLP, Mitkov<br />
for example says that “only a few anaphora resolution systems operate in fully automatic mode”<br />
(Mitkov 2001, p. 111). Most systems rely on manual pre-editing of the input texts, some<br />
methods are only manually simulated. In order for a method to be fully automatic there should<br />
be no human intervention at any stage (Mitkov 2001, p. 114). In the case of the project described<br />
in this thesis, the extraction method is far too manually manipulated to be considered automatic.<br />
The scope of the results is naturally influenced by the limitations of the data set, but regardless<br />
of the size of the data set and the manual intervention employed in the extraction phase, the<br />
method shows promising results. It was clear from the beginning that this would be a pilot study<br />
aiming at providing an indication of the usefulness of the method.<br />
In view of the results, it should be stated that using contextual distribution to derive intuitions of<br />
selectional restrictions in a limited domain is a promising venture. The results obtained in this<br />
project suggest that the distribution of predicates and arguments within a closed domain has a<br />
potential use as a representation of real-world knowledge. More definite conclusions about the<br />
extent to which such a method captures enough relevant intuitions about real-world knowledge<br />
to replace it in an anaphora resolution system can, however, first be made in the event of a<br />
larger-scale study.<br />
85
6 References<br />
Asudeh, Ash and Mary Dalrymple. (2004): Binding Theory. Working paper.<br />
Available at: www.ling.canterbury.ac.nz/personal/ asudeh/pdf/asudeh-dalrymple-binding.pdf<br />
Baldwin, Breck. (1997): CogNIAC: high precision coreference with limited knowledge and linguistic<br />
resources. Proceedings of the ACL’97/EACL’97 Workshop on Operational Factors in Practical, Robust<br />
Anaphora Resolution (Madrid), pp. 38-45.<br />
Available at http://acl.eldoc.ub.rug.nl/mirror/W/W97/index.html<br />
Botley, Simon and Tony McEnery. (2000): Discourse anaphora: The need for synthesis. Chapter 1 in<br />
Botley and McEnery (eds): Corpus-based and Computational Approaches to Discourse Anaphora. John<br />
Benjamins Publishing Company, pp. 1-41.<br />
Bresnan, Joan. (2001): Lexical-Functional Syntax. Blackwell.<br />
Carbonell, Jamie G. and Ralf D. Brown. (1988): Anaphora Resolution: A Multi-Strategy Approach.<br />
Proceedings of the 12 th International Conference on Computational Linguistics (COLING’88, Budapest),<br />
pp. 96-101.<br />
Available at: http://acl.ldc.upenn.edu/C/C88/C88-1021.pdf<br />
CognIT website (2004): http://www.cognit.no/<br />
Consulted 23/11-2004<br />
Copestake, A., D. Flickinger, I. Sag, C. Pollard. (2003): Minimal Recursion Semantics. An Introduction.<br />
Working paper.<br />
Available at: http://lingo.stanford.edu/sag/papers/copestake.pdf<br />
Cover, T. M. and P. E. Hart. (1967): Nearest neighbor pattern classification. Institute of Electrical and<br />
Electronics Engineers Transactions on Information Theory, pp. 21-27.<br />
Available at: http://yreka.stanford.edu/~cover/papers/transIT/0021cove.pdf<br />
Daelemans, Walter, A. van den Bosch and J. Zavrel. (1999): Forgetting Exceptions is Harmful in<br />
Language Learning. Machine Learning 34, special issue in natural language learning, pp. 11-43.<br />
Available at: http://ilk.kub.nl/pub/papers/harmful.ps<br />
Daelemans, Walter, J. Zavrel, K. van der Sloot, A. van den Bosch. (2003): TiMBL: Tilburg Memory<br />
Based Learner, version 5.0, Reference Guide. ILK Technical Report 03-10.<br />
Available at: http://ilk.uvt.nl/downloads/pub/papers/ilk0310.ps.gz<br />
Dagan, Ido and Alon Itai. (1990): Automatic Processing of Large Corpora for the Resolution of<br />
Anaphora References. Proceedings of the 13 th International Conference on Computational Linguistics<br />
(COLING ’90, Helsinki), pp. 330-332.<br />
Available at: http://acl.ldc.upenn.edu/C/C90/C90-3063.pdf<br />
Dagan, Ido, S. Marcus, S. Markovitch. (1995): Contextual word similarity and estimation from sparse<br />
data. In 30th Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio. Ohio<br />
State University, Association for Computational Linguistics, Morristown, New Jersey, pp. 164-171.<br />
Available at: http://citeseer.ist.psu.edu/article/dagan95contextual.html<br />
86
Firth, J. R. (1957): A synopsis of linguistic theory, 1930-55. In Studies in Linguistic Analysis,<br />
Philological Society, Oxford; reprinted in F. R. Palmer (ed.) (1968): Selected Papers of J. R. Firth 1952-<br />
59. Longman, pp. 168-205.<br />
Grefenstette, Gregory. (1992): SEXTANT: Exploring unexplored contexts for semantic extraction from<br />
syntactic analysis. Proceedings, 30th Annual Meeting of the Association for Computational Linguistics,<br />
pp. 324-326.<br />
Available at http://citeseer.ist.psu.edu/grefenstette92sextant.html<br />
Harris, Zellig S. (1968). Mathematical Structures of Language. New York: Wiley.<br />
Hellan, Lars. (1988): Anaphora in Norwegian and the Theory of Grammar. No 32 in Studies in<br />
Generative Grammar. Foris Publications, the Netherlands.<br />
Hindle, Donald. (1990): Noun classification from predicate-argument structures. In Proceedings of the<br />
28th annual meeting of the Association for Computational Linguistics, pp. 268-275.<br />
Available at http://citeseer.ist.psu.edu/hindle90noun.html<br />
ILK website (2004): http://ilk.kub.nl/<br />
Consulted 12/12-2004<br />
Jurafsky, Daniel and James H. Martin. (2000): Speech and Language Processing. An Introduction to<br />
Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice-Hall.<br />
Kamp, Hans and Uwe Reyle. (1993): From Discourse to Logic. Introduction to Modeltheoretic<br />
Semantics of Natural Language, Formal Logic and Discourse Representation Theory. Kluwer Academic<br />
Publishers (the Netherlands).<br />
KunDoc website (2004): http://www.kundoc.net/<br />
Consulted 23/11-2004<br />
Lin, Dekang. (1997): Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity. In<br />
Proceedings of ACL-97 (Madrid), pp. 64-71.<br />
Available at: http://citeseer.ist.psu.edu/article/lin97using.html<br />
Lin, Dekang. (1998): Automatic Retrieval and Clustering of Similar Words. In Proceedings of<br />
COLINGACL '98 (Montreal), pp. 768-774.<br />
Available at: http://citeseer.ist.psu.edu/16998.html<br />
Lin, Dekang and Patrik Pantel. (2001): Induction of Semantic Classes from Natural Language Text. In<br />
Proceedings of SIGKDD-01 (San Fransisco), pp. 317-322.<br />
Available at: http://citeseer.ist.psu.edu/lin01induction.html<br />
Mani, Inderjeet. (2001): Automatic summarization. John Benjamins.<br />
Matthews, P. H. (1997): The Oxford Concise Dictionary of Linguistics. Oxford University Press.<br />
Miller, G. and C. Leacock (2000): Lexical representations for sentence processing. Chapter 8 in Y.<br />
Ravin and C. Leacock (ed.): Polysemy: Theoretical and computational approaches. Oxford University<br />
Press.<br />
Mitkov, Ruslan. (1999): Anaphora Resolution: The State of the Art. Working paper, University of<br />
Wolverhampton.<br />
Available at: http://citeseer.ist.psu.edu/mitkov99anaphora.html<br />
87
Mitkov, Ruslan. (2001): Outstanding issues in anaphora resolution. In: Alexander Gelbukh (ed):<br />
Computational Linguistics and Intelligent Text Processing, pp. 110-125<br />
Mitkov, Ruslan. (2003): Anaphora Resolution. Chapter 14 in Mitkov (ed): The Oxford Handbook of<br />
Computational Linguistics. Oxford University Press, pp. 266-283.<br />
Nasukawa, Tetsuya. (1994): Robust method of pronoun resolution using full-text information.<br />
Proceedings of the 15 th International Conference on Computational Linguistics (COLING’94, Kyoto),<br />
pp.1157-1163.<br />
Available at: http://acl.eldoc.ub.rug.nl/mirror/C/C94/index.html<br />
NorGram website (2004): http://www.hf.uib.no/i/LiLi/SLF/Dyvik/norgram/<br />
Consulted 23/11-2004<br />
OBT (<strong>2005</strong>): Oslo-Bergen-taggeren<br />
Available at: http://decentius.aksis.uib.no/cl/cgp/obt.html<br />
Pantel, Patrick and Dekang Lin (2002): Discovering word senses from text. In Proceedings of ACM<br />
SIGKDD Conference on Knowledge Discovery and Data Mining (Edmonton), pp. 613-619.<br />
Pereira, Fernando, N. Tishby, L. Lee. (1993): Distributional clustering of English words. Proceedings of<br />
the 31st Annual Meeting of the ACL, pp. 183-190.<br />
Available at: http://acl.eldoc.ub.rug.nl/mirror/P/P93/index.html<br />
Robbins, R.H. (1997): A Short History of Linguistics. Longman.<br />
Saeed, John I. (1997): Semantics. Blackwell.<br />
Velldal, Erik. (2003): Modelling Word Senses With Fuzzy Clustering. Cand. Philol. Thesis in Language,<br />
Logic and Information. University of Oslo.<br />
Wolff, Karl Erich. (1994): A first course in formal concept analysis. In: Faulbaum, F. (ed): SoftStat’93<br />
Advances in Statistical Software 4, pp. 429-438.<br />
88
Appendix A: Ekstraktor.pl – algorithm<br />
The algorithm behind Ekstraktor is divided into two separate parts: information retrieval<br />
from the Prolog file and processing of the information that was found and stored.<br />
First a Prolog output file is opened and each line of the file is read. Based on patternmatching,<br />
lines from the file are stored in different arrays according to which pattern they<br />
match.<br />
Subsequent to the information-extraction from the Prolog file, the information stored in<br />
the arrays is processed for the purpose of creating predicate-argument structures. In the<br />
following. I will give a brief outline of the processing steps. I will do this by describing<br />
each of the central functions in Ekstraktor.<br />
The term epmor (eng: ep mother) corresponds to the first EP in the ARG0ep-array, in<br />
most cases meaning the EP “in question”.<br />
finnHoved();<br />
Finds the semantic forms of the main/first predicate-argument structure in the sentence.<br />
This function calls the following (sub)functions:<br />
finnEP1();<br />
Since the entities parsed are full sentences, the main structures is limited to having a verb<br />
as its head. This function searches the array catsuff for a pattern with the first member of<br />
ARG0ep as its EP. If such a pattern is found, the EP is discarded and the first members of<br />
arrays ARG0ep and ARG0verdi are removed.<br />
finnPred();<br />
Finds the semantic value of the sentence’s predicate/ARG0. Goes through the array<br />
semform searching for a pattern with the first member of ARG0ep as EP. If such a pattern<br />
is found, the semantic form is retrieved and stored in the array predikat.<br />
In order to avoid an “empty” semantic form if the argument is a proper noun, it is<br />
checked if the retrieved form matches named. If so, the array navn is searched for a<br />
pattern with the first member of ARG0ep as EP. If such an entry is found, predikat is<br />
emptied and the new semantic form is stored there.<br />
Some predicates have an extra attribute which is stored in the array prt. Each line in this<br />
array is searched for a pattern with the first member of ARG0ep as EP. If such an entry is<br />
found, the semantic form is retrieved and stored in the array ekstra.<br />
lagVerbStruktur();<br />
Creates the correct verbal structure for the predicate. This is for the cases where the<br />
predicate has an additional attribute – as in the predicate “lete etter” (Eng: look for). The<br />
89
function checks if there are any members in the array ekstra. If so, the main predicate and<br />
this extra attribute are stored in the array hovedpred.<br />
If there is nothing stored in ekstra, the main predicate is simply stored in the array<br />
hovedpred.<br />
finnARG1();<br />
Returns the semantic form of argument 1 and stores it in the array ARG1. First the arrays<br />
ARGxep, ARGxverdi, ep and ARGx are emptied and subsequently set to the corresponding<br />
argument 1 values. Then finnARGx() is called.<br />
finnARGx();<br />
Generalized function that finds the EP where the semantic form of the argument<br />
in question is stored, calls finnARGxsemform() and returns the semantic form.<br />
For ARG1 the actions are as follows:<br />
Goes through each member in ARG1ep. If the first member of ARG0ep is EP, the<br />
entry on the same index in ARG1verdi is stored as ARGx. Goes through each<br />
member in ARG1verdi. If ARGx matches an entry in ARG1verdi, the entry on the<br />
same index in ARG0ep is retrieved and stored in the array ep.<br />
finnARGxsemform() is called.<br />
finnARGxsemform();<br />
Generalized function that finds the semantic form of the argument in<br />
question.<br />
For ARG1 the actions are as follows:<br />
Find semantic form of double predicates, if there are any:<br />
The variable epARGx is set to the first member of the array ep (this array<br />
holds the indexes of EPs where the semantic form of ARGx is stored).<br />
Goes through the array index (holds pointers to semantic forms of double<br />
arguments), if an entry matches epARGx as EP, the index pointer is<br />
retrieved and stored in the array ARGxind. Goes through semform, if an<br />
entry matches epARGx as EP, it is removed from the array.<br />
If there are any entries in ARGxind, each member is looked at. If an entry<br />
matches an entry in ARG0verdi, the entry on the same index in ARG0ep is<br />
added to the array liste. The array semform is gone through, if an entry<br />
matches an entry from liste as EP, the semantic form is retrieved and<br />
stored in the array ARGx.<br />
Else find the semantic form of the argument in question:<br />
Goes through semform , if an entry matches epARGx as EP the semantic<br />
form is retrieved and stored in the array ARGx. If the element stored in<br />
ARGx matches ‘named’, the proper noun must be found. The array navn is<br />
searched for a pattern with epARGx as EP. The semantic form is retrieved<br />
and stored in the array ARGx.<br />
The contents of the array ARGx is stored in the array ARG1.<br />
The contents of ARG1 is in finnHoved() stored as HovedARG1.<br />
90
finnARG2();<br />
This function has exactly the same performance as finnARG1(), only with<br />
correspondingly different variable and array names.<br />
The contents of ARG2 is in finnHoved() stored as HovedARG2.<br />
fjernEP();<br />
Removes elements from the arrays ARG1ep, ARG1verdi, ARG2ep and ARG2verdi if they<br />
belong to the main EP.<br />
Goes through ARG1ep and ARG2ep. If the first member of ARG0ep matches the entry,<br />
the entry and the entry on the same index in the value-array is removed.<br />
The first entry in ARG0ep and ARG0verdi is subsequently removed.<br />
sjekkEkstra();<br />
Checks if there are more predicate-argument structures to be found, calls finnResten() if<br />
there are.<br />
Goes through ARG1ep and ARG2ep trying to match each element in ARG0ep. If there is a<br />
match, there exists a predicate with a belonging argument and finnResten() is called.<br />
finnResten();<br />
Finds the remaining predicate-argument structures.<br />
Calls the following (sub)functions:<br />
finnPred();<br />
finnARG1();<br />
finnARG2();<br />
fjernEP();<br />
sjekkEkstra();<br />
lagStruktur();<br />
Creates the predicate-argument structures as printed to the output file.<br />
If HovedARG1 or HovedARG2 contains more than one element, each element is printed<br />
with predicate and argument 2.<br />
Else, hovedpred,HovedARG1 and HovedARG2 are printed to file, separated by commas.<br />
91
Appendix B: Ekstraktor.pl – program code<br />
Perl script Ekstraktor.pl<br />
#åpner fil som angis fra kommandolinjen når programmet kjøres<br />
open(FIL, $ARGV[0]) or die("Kan ikke åpne filen!!\n");<br />
#leser hver linje i filen og legger linjen inn i forskjellige arrayer avhengig av hva som<br />
leses. Får lagret all informasjon som er nødvendig for å trekke ut pred-argstrukturene<br />
while ($linjeFraFil = ) {<br />
#legger indexverdi i @ARG0ep og arg0-verdi i @ARG0verdi dersom linjen fra filen<br />
inneholder ARG0<br />
if ($linjeFraFil =~ m/ARG0/){<br />
henteVerdi();<br />
push(@ARG0ep, $ep);<br />
push(@ARG0verdi, $verdi);<br />
}<br />
#legger indexverdi i @ARG1ep og arg1-verdi i @ARG1verdi dersom linjen fra filen<br />
inneholder ARG1<br />
if ($linjeFraFil =~ m/ARG1/){<br />
henteVerdi();<br />
push(@ARG1ep, $ep);<br />
push(@ARG1verdi, $verdi);<br />
}<br />
#legger indexverdi i @ARG2ep og arg2-verdi i @ARG2verdi dersom linjen fra filen<br />
inneholder ARG2<br />
if ($linjeFraFil =~ m/ARG2/){<br />
henteVerdi();<br />
push(@ARG2ep, $ep);<br />
push(@ARG2verdi, $verdi);<br />
}<br />
#legger indexverdi i @ARG3ep og arg3-verdi i @ARG3verdi dersom linjen fra filen<br />
inneholder ARG3<br />
if ($linjeFraFil =~ m/ARG3/){<br />
henteVerdi();<br />
push(@ARG3ep, $ep);<br />
push(@ARG3verdi, $verdi);<br />
}<br />
#legger lest linje inn i @restriksjoner dersom den inneholder 'BODY'<br />
if ($linjeFraFil =~ m/'BODY'/){<br />
push(@restriksjoner, $linjeFraFil);<br />
}<br />
#legger lest linje inn i @restriksjoner dersom den inneholder 'RSTR'<br />
if ($linjeFraFil =~ m/'RSTR'/){<br />
push(@restriksjoner, $linjeFraFil);<br />
}<br />
#legger lest linje inn i @semform dersom den bl.a inneholder 'semform'<br />
if ($linjeFraFil =~ m/'relation'\),semform\(/){<br />
push(@semform, $linjeFraFil);<br />
}<br />
#legger lest linje inn i @cat dersom den inneholder '_CAT'<br />
if ($linjeFraFil =~ m/'_CAT'\)/){<br />
push(@cat, $linjeFraFil);<br />
}<br />
#legger lest linje inn i @catsuff dersom den inneholder '_CATSUFF'<br />
if ($linjeFraFil =~ m/'_CATSUFF'\)/){<br />
push(@catsuff, $linjeFraFil);<br />
}<br />
92
#legger lest linje inn i @prt dersom den inneholder '_PRT'<br />
if ($linjeFraFil =~ m/'_PRT'\)/){<br />
push(@prt, $linjeFraFil);<br />
}<br />
#legger lest linje inn i @index dersom den inneholder 'L-INDEX'<br />
if ($linjeFraFil =~ m/'L-INDEX'\)/){<br />
push(@index, $linjeFraFil);<br />
}<br />
#legger lest linje inn i @index dersom den inneholder 'R-INDEX'<br />
if ($linjeFraFil =~ m/'R-INDEX'\)/){<br />
push(@index, $linjeFraFil);<br />
}<br />
#legger lest linje inn i @navn dersom den inneholder 'CARG'<br />
if ($linjeFraFil =~ m/'CARG'\)/){<br />
push(@navn, $linjeFraFil);<br />
}<br />
} #slutt while-loop<br />
close(FIL);<br />
#Her begynner prosesseringen av info hentet ut fra inputfilen:<br />
#fjerner ep'er som inneholder informasjon man vil se bort fra<br />
fjernRestri();<br />
#fjerner første ep dersom den ikke har kategori 'v'<br />
finnCat();<br />
#print("ARG0ep = \n@ARG0ep\nARG0verdi = \n@ARG0verdi\nARG1ep = \n@ARG1ep\nARG1verdi =<br />
\n@ARG1verdi\nARG2ep = \n@ARG2ep\nARG2verdi = \n@ARG2verdi\n");<br />
#finner hovedstrukturen<br />
finnHoved();<br />
#print("ARG0ep = \n@ARG0ep\nARG0verdi = \n@ARG0verdi\nARG1ep = \n@ARG1ep\nARG1verdi =<br />
\n@ARG1verdi\nARG2ep = \n@ARG2ep\nARG2verdi = \n@ARG2verdi\n");<br />
#print("@semform\n");<br />
#print("@navn\n");<br />
#finner predikat-argumentstruktur nr2<br />
#sjekkEkstra();<br />
#finnResten();<br />
#legger til predikat-argumentstrukturene til slutt i filen som angis<br />
open(OUTPUTFIL, ">>strukturer.txt") or die("kan ikke skrive til fil\n");<br />
#open(OUTPUTFIL, ">>home/unni/Hovedoppgave/parse/pas-strukturer.txt") or die("kan ikke<br />
skrive til fil\n");<br />
sjekkEkstra();<br />
#lager hovedstrukturen<br />
lagStruktur();<br />
close(OUTPUTFIL);<br />
#her kommer alle subfunksjonene:<br />
#henteVerdi():<br />
#henter ut relasjonsindex og verdi til ARGx fra en linje fra<br />
#inputfilen og lagrer dem i $ep og $verdi<br />
93
#Linje fra fil deles opp ved komma og lagres i @utenKomma. Verdiene hentes ut med<br />
substr().<br />
sub henteVerdi {<br />
@utenKomma = split(/,/, $linjeFraFil);<br />
push(@args, @utenKomma);<br />
}<br />
$ep = substr(@utenKomma[1], 12, 2);<br />
if ($ep =~ /\)/){<br />
$ep = split(/\)/, $ep);<br />
}<br />
$verdi = substr(@utenKomma[3], 4, 2);<br />
if ($verdi =~ /\)/){<br />
$verdi = split(/\)/, $verdi);<br />
}<br />
#finnHoved:<br />
#finner hoved pred-argstrukturen<br />
#vanligvis predikat,arg1,arg2<br />
sub finnHoved {<br />
}<br />
finnEP1();<br />
finnPred();<br />
lagVerbStruktur();<br />
finnARG1();<br />
@HovedARG1 = @ARG1;<br />
finnARG2();<br />
@HovedARG2 = @ARG2;<br />
fjernEP();<br />
sub sjekkEkstra {<br />
foreach $element (@ARG0ep){<br />
foreach $element2 (@ARG1ep){<br />
if ($element =~ $element2){<br />
#print("match!\n");<br />
finnResten();<br />
}<br />
}<br />
foreach $element3 (@ARG2ep){<br />
if ($element =~ $element3){<br />
#print("match2!\n");<br />
finnResten();<br />
}<br />
}<br />
}<br />
#print("ARG0ep: @ARG0ep\nARG1ep: @ARG1ep\nARG2ep: @ARG2ep\n");<br />
}<br />
sub finnResten {<br />
finnPred();<br />
finnARG1();<br />
finnARG2();<br />
print(OUTPUTFIL "@predikat,@ARG1,@ARG2\n");<br />
print("@predikat,@ARG1,@ARG2\n");<br />
fjernEP();<br />
splice (@predikat);<br />
splice (@ARG1);<br />
splice (@ARG2);<br />
sjekkEkstra();<br />
}<br />
94
sub fjernEP{<br />
#fjerner elementer fra @ARG1ep/verdi og @ARG2ep/verdi dersom de hører til hovedep'en<br />
$epmor = $ARG0ep[0];<br />
for ($i = 0; $i < @ARG1ep; $i++){<br />
if ($epmor =~ $ARG1ep[$i]){<br />
splice(@ARG1ep, $i, 1);<br />
splice(@ARG1verdi, $i, 1);<br />
#print("@semform\n");<br />
}<br />
}<br />
for ($i = 0; $i < @ARG2ep; $i++){<br />
if ($epmor =~ $ARG2ep[$i]){<br />
splice(@ARG2ep, $i, 1);<br />
splice(@ARG2verdi, $i, 1);<br />
}<br />
}<br />
shift(@ARG0ep);<br />
shift(@ARG0verdi);<br />
}<br />
sub finnCat {<br />
foreach $linje (@cat){<br />
$epmor = $ARG0ep[0];<br />
if ($linje =~ m/\(attr\(var\($epmor\)/){<br />
@utenDings = split(/\'/, $linje);<br />
push (@args, @utenDings);<br />
$epcat = substr(@utenDings[3],0,1);<br />
#print("$epcat\n");<br />
if ($epcat !~ /v/){<br />
shift(@ARG0ep);<br />
shift(@ARG0verdi);<br />
}<br />
}<br />
}<br />
}<br />
#finnEP1():<br />
#finner den ep'en som skal være utgangspunkt for predikat-argumentstrukturen<br />
#$epmor settes til første element i ARG0-arrayen<br />
#gå gjennom @catsuff, hvis linjen som leses matcher $epmor fjernes første element i<br />
@ARG0ep og @ARG0verdi<br />
sub finnEP1 {<br />
foreach $linje (@catsuff){<br />
$epmor = $ARG0ep[0];<br />
if ($linje =~ m/\(attr\(var\($epmor\)/){<br />
shift(@ARG0ep);<br />
shift(@ARG0verdi);<br />
}<br />
}<br />
}<br />
#finnPred():<br />
#finner den semantiske verdien til ARG0/predikatet i setningen<br />
#$epmor settes til første element i @ARG0ep<br />
#hvis lest linje bl.a matcher $epmor og 'semform', splittes den ved ' og elementene<br />
legges i @utenDings<br />
#@verb settes til fjerde element i @utenDings<br />
#dersom linjen matcher bl.a $epmor og '_PRT', splittes linjen ved ' og den semantiske<br />
formen legges til @verb2<br />
sub finnPred {<br />
$epmor = $ARG0ep[0];<br />
foreach $linje (@semform){<br />
if ($linje =~ /\(attr\(var\($epmor\),'relation'\),semform/){<br />
95
}<br />
}<br />
}<br />
@utenDings = split(/\'/, $linje);<br />
push(@args, @utenDings);<br />
@pred = @utenDings[3];<br />
push(@predikat, @pred);<br />
if ($predikat[0] =~ /named/){<br />
foreach $verdi (@navn){<br />
if ($verdi =~ /\(attr\(var\($epmor\)/){<br />
splice(@pred);<br />
splice(@predikat);<br />
@uten = split(/\'/, $verdi);<br />
push(@arg, @uten);<br />
@pred = @uten[3];<br />
push(@predikat, @pred);<br />
#print("@predikat\n");<br />
}<br />
}<br />
}<br />
foreach $linje (@prt){<br />
if ($linje =~ /\(attr\(var\($epmor\)/){<br />
@utenDings = split(/\'/, $linje);<br />
push(@args, @utenDings);<br />
@ekstr = @utenDings[3];<br />
push(@ekstra, @ekstr);<br />
}<br />
}<br />
sub finnARG1 {<br />
$imax = 0;<br />
splice(@ARGxep);<br />
splice(@ARGxverdi);<br />
splice(@ep);<br />
splice(@ARGx);<br />
}<br />
$imax = @ARG1ep;<br />
@ARGxep = @ARG1ep;<br />
@ARGxverdi = @ARG1verdi;<br />
finnARGx();<br />
@ARG1 = @ARGx;<br />
sub finnARG2 {<br />
$imax = 0;<br />
splice(@ARGxep);<br />
splice(@ARGxverdi);<br />
splice(@ep);<br />
splice(@ARGx);<br />
}<br />
$imax = @ARG2ep;<br />
@ARGxep = @ARG2ep;<br />
@ARGxverdi = @ARG2verdi;<br />
finnARGx();<br />
@ARG2 = @ARGx;<br />
sub finnARG3 {<br />
$imax = 0;<br />
splice(@ARGxep);<br />
splice(@ARGxverdi);<br />
splice(@ep);<br />
96
}<br />
splice(@ARGx);<br />
$imax = @ARG3ep;<br />
@ARGxep = @ARG3ep;<br />
@ARGxverdi = @ARG3verdi;<br />
finnARGx();<br />
@ARG3 = @ARGx;<br />
sub finnARGx {<br />
$epmor = $ARG0ep[0];<br />
}<br />
for ($i = 0; $i < $imax; $i++){<br />
if ($epmor =~ /$ARGxep[$i]/){<br />
$ARGx = $ARGxverdi[$i];<br />
$imax2 = @ARG0verdi;<br />
for ($ii = 0; $ii < $imax2; $ii++){<br />
if ($ARGx =~ /$ARG0verdi[$ii]/){<br />
push(@ep, $ARG0ep[$ii]);<br />
}<br />
}#slutt for2<br />
}<br />
}#slutt for1<br />
finnARGxsemform();<br />
#fjernRestri():<br />
#setter @ARGxep og @ARGxverdi til ARG0-verdiene<br />
#kjører restrik()<br />
sub fjernRestri {<br />
@ARGxep = @ARG0ep;<br />
@ARGxverdi = @ARG0verdi;<br />
restrik();<br />
@ARG0ep = @ARGxep;<br />
@ARG0verdi = @ARGxverdi;<br />
}<br />
#restrik():<br />
#Går gjennom @restriksjoner og @index og fjerner verdier fra @ARG0ep og @ARG0verdi dersom<br />
#disse arrayene inneholder informasjon om dem<br />
sub restrik {<br />
$imax = @ARGxep;<br />
}<br />
for ($i = 0; $i < $imax; $i++){<br />
foreach $linje (@restriksjoner){<br />
if ($linje =~ /\(attr\(var\($ARGxep[$i]\)/){<br />
splice(@ARGxep, $i, 1);<br />
splice(@ARGxverdi, $i, 1);<br />
}<br />
}<br />
}<br />
foreach $linje (@index){<br />
@utenKomma = split(/,/, $linje);<br />
push(@args, @utenKomma);<br />
$ep = substr(@utenKomma[1], 12, 2);<br />
push(@indexep, $ep);<br />
#print("@indexep\n");<br />
#print("@semform\n");<br />
foreach $linje (@indexep){<br />
97
}<br />
}<br />
for ($i = 0; $i < @semform; $i++){<br />
if ($semform[$i] =~ /\(attr\(var\($linje\)/){<br />
splice(@semform, $i, 1);<br />
#print("@semform\n");<br />
}<br />
}<br />
#restrimatch():<br />
#fjerner ep'er som ikke inneholder informasjon om den semantiske formen<br />
#går gjennom hvert element i @restriksjoner og for hvert element settes $epARG1 til<br />
første element i @ep2<br />
#hvis elementet fra @restriksjoner inneholder $epARG1 som indexverdi, fjernes det fra<br />
@ep2<br />
sub restrimatch {<br />
foreach $linje (@restriksjoner){<br />
$epARGx = $ep[0];<br />
if ($linje =~ m/\(attr\(var\($epARGx\)/){<br />
shift(@ep);<br />
}<br />
}<br />
}<br />
#restrimatch for doble argument2:<br />
#samme fremgangsmåte som for restrimatch(), men med andre variabler etc<br />
sub restrimatch4 {<br />
$imax = @liste;<br />
for ($i = 0; $i < $imax; $i++){<br />
foreach $linje (@restriksjoner){<br />
$epARGx = $liste[$i];<br />
if ($linje =~ m/\(attr\(var\($epARGx\)/){<br />
splice(@liste,$i,1);<br />
}<br />
}<br />
}<br />
}<br />
#MODULARISERT VERSJON - GENERISK FUNKSJON FOR Å FINNE SEMANTISK FORM<br />
#finnARGxsemform():<br />
sub finnARGxsemform {<br />
$epARGx = $ep[0];<br />
foreach $linje (@index){<br />
if ($linje =~ /\(attr\(var\($epARGx\)/){<br />
@utenKomma = split(/,/, $linje);<br />
push(args, @utenKomma);<br />
$verdi = substr(@utenKomma[3],4,2);<br />
push(@ARGxind, $verdi);<br />
for ($i = 0; $i < @semform; $i++){<br />
if (@semform[i] =~ /\(attr\(var\($epARGx\)/){<br />
splice(@semform, $i, 1);<br />
}<br />
}<br />
}<br />
}<br />
#finner ep hvor element i @ARGxind er verdien til ARG0 og legger dem i array @liste<br />
if (@ARGxind != 0){<br />
foreach $element (@ARGxind){<br />
$imax = @ARG0verdi;<br />
for ($i = 0; $i < $imax; $i++){<br />
if($element =~ /$ARG0verdi[$i]/){<br />
98
}<br />
else{<br />
}<br />
}<br />
}<br />
push(@liste, $ARG0ep[$i]);<br />
foreach $linje (@semform){<br />
foreach $element (@liste){<br />
if ($linje =~ /\(attr\(var\($element\)/){<br />
@utenDings = split(/\'/, $linje);<br />
push(args, @utenDings);<br />
$ARG = @utenDings[3];<br />
push(@ARGx, $ARG);<br />
}<br />
}<br />
}<br />
foreach $linje (@semform){<br />
if ($linje =~ /\(attr\(var\($epARGx\)/){<br />
@utenDings = split(/\'/, $linje);<br />
push(args, @utenDings);<br />
@ARGx = @utenDings[3];<br />
}<br />
}<br />
if ($ARGx[0] =~ /named/){<br />
foreach $verdi (@navn){<br />
if ($verdi =~ /\(attr\(var\($epARGx\)/){<br />
splice(@ARGx);<br />
#splice(@predikat);<br />
@uten = split(/\'/, $verdi);<br />
push(@arg, @uten);<br />
@ARGx = @uten[3];<br />
#push(@predikat, @pred);<br />
#print("@predikat\n");<br />
}<br />
}<br />
}<br />
}<br />
}<br />
#slutt finnARGxsemform<br />
#lagStruktur():<br />
#lager predikat-argumentstrukturen som skal skrives til outputfilen<br />
#hvis det finnes et element i @verb2 blir det lagt til @verb1<br />
#@verb, $ARG1, $ARG2 og $ARG3 skrives til outputfilen<br />
sub lagStruktur {<br />
#lagVerbStruktur();<br />
#lager riktig arg1 struktur<br />
if (@HovedARG1 > 1){<br />
foreach $element (@HovedARG1){<br />
print(OUTPUTFIL "\n@hovedpred,$element,@HovedARG2\n");<br />
print("@hovedpred,$element,@HovedARG2\n");<br />
}<br />
}<br />
# print(OUTPUTFIL "\n@hovedpred,@ARG1sem[0],@ARG2,$ARG3\n");<br />
if (@HovedARG2 > 1){<br />
foreach $element (@HovedARG2){<br />
print(OUTPUTFIL "\n@hovedpred,@HovedARG1,$element\n");<br />
99
}<br />
}<br />
print("@hovedpred,@HovedARG1,$element\n");<br />
else {<br />
print(OUTPUTFIL "\n@hovedpred,@HovedARG1,@HovedARG2\n");<br />
print("@hovedpred,@HovedARG1,@HovedARG2\n");<br />
}<br />
#if (@predikat != 0){<br />
# print(OUTPUTFIL "$predikat[0],@ARG1,@ARG2\n");<br />
# print("$predikat[0],@ARG1,@ARG2\n");<br />
#}<br />
}<br />
#lager riktig verb-struktur, feks "lete etter"<br />
sub lagVerbStruktur {<br />
if(@ekstra != 0){<br />
@hovedpred = ($predikat[0],$ekstra[0]);<br />
shift(@ekstra);<br />
shift(@predikat);<br />
}<br />
}<br />
else {<br />
@hovedpred = $predikat[0];<br />
shift(@predikat);<br />
}<br />
100
101<br />
Appendix C: the EPAS list<br />
23-år-gammel,student,<br />
aktuell,tidsrom,<br />
analysere,Kripos-spesialist,spor<br />
ankomme,etterforsker,<br />
ankomme,etterforsker,<br />
ankomme,etterforsker,åsted<br />
antyde,politi,<br />
avhøre,,person<br />
avhøre,,vedkommende<br />
avhøre,politi,vitne<br />
avklare,obduksjon,<br />
bede om,lensmann,assistanse<br />
bede om,politi,bistand<br />
bede,lensmann,<br />
bede,lensmann,<br />
bede-om,Fonn,bistand<br />
bekrefte,lensmann,<br />
bekrefte,politi,<br />
bekrefte,politi,<br />
bekrefte,politimester,<br />
bistå,etterforsker,lensmann<br />
bistå,etterforsker,politi<br />
bistå,etterforsker,politi<br />
bli,Anne,offer<br />
bo,23-åring,studentkollektiv<br />
bo,Anne,studentkollektiv<br />
bo,beboer,studentkollektiv<br />
bo,Slåtten,studentkollektiv<br />
brutal,drapsmann,<br />
desperat,rop,<br />
død,sykepleiestudent,<br />
drepe,,kvinne<br />
drepe,,pron<br />
drepe,,pron<br />
drepe,,Slåtten<br />
drepe,gjerningsmann,kvinne<br />
drept,sykepleiestudent<br />
ekstra,patrulje,<br />
endelig,rapport,<br />
etterforske,medarbeider,drap<br />
etterlyse,,bilfører<br />
etterlyse,,bilfører<br />
etterlyse,politi,bilfører<br />
etterlyse,politi,person<br />
etterlyse,politi,syklist<br />
etterlyse,politi,syklist<br />
etterlyst,syklist,<br />
etterlyst,syklist,<br />
fastslå,politi,<br />
finkjemme,politi,bygning<br />
finne,,død<br />
finne,,død<br />
finne,,død<br />
finne,,kvinne<br />
finne,,lommebok<br />
finne,,pron<br />
finne,,pron<br />
finne,,sykepleiestudent<br />
finne,forbipasserende,sykepleiestudent<br />
finne,leteaksjon,kvinne<br />
finne,politi,drapsmann<br />
forfølge,,pron<br />
forkaste,politi,teori<br />
fortelle,beboer,politi<br />
fortelle,Fonn,<br />
fortelle,Fonn,
102<br />
fra-kripos,etterforsker,<br />
fra-kripos,etterforsker,<br />
fra-kripos,etterforsker,<br />
første,praksisdag,<br />
førsteårs,sykepleiestudent,<br />
førsteårs,sykepleiestudent,<br />
få,politi,svar<br />
få,politi,tips<br />
få,pron,rapport<br />
få,pron,telefon<br />
gi,Fonn,opplysning<br />
gi,kamera,indikasjon<br />
gi,lensmann,opplysning<br />
gi,politi,informasjon<br />
gi,politi,opplysning<br />
gi,vitneavhør,indikasjon<br />
gjemme,drapsmann,<br />
gjemme,gjerningsmann,<br />
gjemme,gjerningsmann,<br />
gjennomføre,,rekonstruksjon<br />
gjennomgå,tekniker,studentkollektiv<br />
gjennomsøke,politi,studenthybel<br />
gjøre,politi,avhør<br />
gjøre,politi,rundspørring<br />
gå-gjennom,polititjenestefolk,material<br />
ha,etterforsker,observasjon<br />
ha,politi,medarbeider<br />
ha,politi,teori<br />
ha,pron,observasjon<br />
ha,pron,teori<br />
ha,pron,teori<br />
holde åpen,politi,mulighet<br />
holde,politi,kort<br />
holde,politi,pressekonferanse<br />
holde-åpen,politi,mulighet<br />
høre,pron,rop<br />
høre,pron,rop<br />
høre,vitne,rop<br />
høy,rop,<br />
håpe,politi,<br />
identifisere,politi,pron<br />
igangsette,,leteaksjon<br />
informere,,politi<br />
jobbe-utfra,pron,teori<br />
kartlegge,pron,bevegelse<br />
kartlegge,pron,bevegelse<br />
kjenne,generic-nom,Slåtten<br />
kjenne,politi,dødsårsak<br />
komme-i-kontakt-med,politi,bilfører<br />
komme-i-kontakt-med,politi,generic-nom<br />
komme-i-kontakt-med,pron,bilfører<br />
komme-i-kontakt-med,pron,syklist<br />
komme-inn,tips,<br />
kommentere,pron,<br />
kontakte,etterforsker,vitne<br />
kriminell,handling,<br />
kriminell,handling,<br />
kriminell,handling,<br />
melde-savnet,,student<br />
melde-seg,syklist,<br />
melde-seg,syklist,politi<br />
melde-seg,syklist,politi<br />
mene,etterforsker,<br />
mene,politi,<br />
merke,kjæreste,<br />
mistenkelig,dødsfall,<br />
mulig,teori,<br />
muntlig,rapport<br />
møte-opp-til,pron,praksisdag<br />
ny,tips,<br />
nær,opplysning
103<br />
obdusere,,kvinne<br />
observere,,23-åring<br />
observere,,bile<br />
observere,,person<br />
opplyse,Fonn,<br />
opplyse,vitne,<br />
oppmerksom,kvinne,<br />
overfalle,,Slåtten<br />
plombere,politi,hybelhus<br />
pågå,leteaksjon,<br />
påkledd,Slåtten,<br />
rigge,etterforsker,lyskaster<br />
samle,politi,observasjon<br />
sanke-inn,politi,video<br />
savne,,kvinne<br />
se,pron,Slåtten<br />
se,vitne,kvinne<br />
sentral,vitne,<br />
sette-igang,,leteaksjon<br />
sette-inn,politi,patrulje<br />
si,Fonn,<br />
si,Fonn,<br />
skade,,kvinne<br />
skje-med,generic-nom,kvinne<br />
slutte-seg-til,pron,Førde-politi<br />
sperre av,,hybelhus<br />
sperre av,politi,åsted<br />
spesiell,teori<br />
spesiell,teori,<br />
stenge av,politi,studentkollektiv<br />
stor,leteaksjon,<br />
systematisere,,tips<br />
søke-med,politi,hund<br />
ta,pron,utgangspunkt<br />
ta-høyde-for,lensmann,eventualitet<br />
ta-kontakt-med,politi,vitne<br />
ta-kontakt-med,syklist,politi<br />
taktisk,etterforsker,<br />
teknisk,etterforsker,<br />
teknisk,etterforsker,<br />
teknisk,spor,<br />
teknisk,spor,<br />
tidlig,teori,<br />
tilfeldig,forbipasserende,<br />
tilfeldig,offer,<br />
trenge,,vitne<br />
tro,lensmann,<br />
tro,politi,<br />
tro,pron,<br />
ukjent,gjerningsmann,<br />
ukjent,person,<br />
undersøke,,minibankaktiviteter<br />
undersøke,,mobiltelefontrafikk<br />
undersøke,,område<br />
undersøke,,overvåkningsfilmer<br />
undersøke,etterforsker,åsted<br />
undersøke,politi,aktivitet<br />
understreke,Fonn,<br />
understreke,pron,<br />
understreke,pron,<br />
understreke,pron,<br />
understreke,pron,generic-nom<br />
varsle,pron,Kripos<br />
velge,drapsmann,sykepleiestudent<br />
velge,drapsmann,sykepleiestudent<br />
ville,pron,kartlegge<br />
vise,funn,<br />
vise,funn,<br />
vise,undersøkelse,<br />
vite,politi,<br />
være,Anne,offer
være,bilfører,vitne<br />
være,bilfører,vitne<br />
være,etterforskning,bred<br />
være,kvinne,død<br />
være,kvinne,død<br />
være,kvinne,skadet<br />
være,kvinne,Slåtten<br />
være,lommebok,funn<br />
være,pron,funn<br />
være,pron,omkommet<br />
være,rapport,klar<br />
være,Slåtten,sykepleiestudent<br />
være,syklist,vitne<br />
være,vitne,kvinne<br />
ønske,politi,<br />
ønske,politi,<br />
åpen,mulighet,<br />
åpen,mulighet,<br />
104
Appendix D: Text aligned with EPAS<br />
SENTENCE EPAS METHOD<br />
Kvinne funnet død i Førde. finne,,død<br />
automatic<br />
være,kvinne,død<br />
automatic<br />
Den savnede kvinnen i Førde er finne,,død<br />
automatic<br />
nå funnet død.<br />
savne,,kvinne<br />
automatic<br />
være,kvinne,død<br />
automatic<br />
Politiet har gitt media<br />
opplysninger om funnet.<br />
gi,politi,opplysning automatic<br />
Lensmannen bekrefter at kvinnen finne,,død<br />
automatic<br />
er funnet død.<br />
bekrefte,lensmann,<br />
automatic<br />
Politiet har bedt Kripos om<br />
bistand i søket etter kvinnen.<br />
bede om,politi,bistand<br />
automatic<br />
23-åringen var førsteårs<br />
sykepleiestudent i Førde.<br />
førsteårs,sykepleiestudent,<br />
edited<br />
Hun møtte ikke opp til sin første,praksisdag,<br />
manual<br />
første praksisdag ved Førde<br />
aldershjem.<br />
møte_opp_til,pron,praksisdag<br />
edited<br />
Politiet ble informert. informere,,politi automatic<br />
En leteaksjon ble satt igang. sette_igang,,leteaksjon edited<br />
Leteaksjonen pågikk til kvinnen pågå,leteaksjon,<br />
automatic<br />
ble funnet.<br />
finne,,kvinne<br />
manual<br />
finne,leteaksjon,kvinne<br />
manual<br />
Politiet holder alle muligheter holde åpen,politi,mulighet<br />
edited<br />
åpne i saken.<br />
åpen,mulighet,<br />
automatic<br />
Etterforskerne vil ankomme i<br />
morgen.<br />
ankomme,etterforsker,<br />
automatic<br />
Et vitne hørte desperate rop om<br />
hjelp.<br />
Lensmannen har bedt om<br />
assistanse fra Kripos.<br />
Etterforskere fra Kripos skal<br />
bistå lensmannen i<br />
etterforskningen.<br />
Etterforskerne forventes å<br />
ankomme i løpet av dagen.<br />
Den 23 år gamle studenten ble<br />
meldt savnet tidlig søndag<br />
morgen.<br />
Anne Slåtten bodde i et<br />
studentkollektiv i Førde.<br />
Slåtten var førsteårs<br />
sykepleiestudent i Førde.<br />
Hun ble funnet omkommet i et<br />
skogholt.<br />
Et vitne opplyste at hun hadde<br />
hørt høye rop.<br />
Mandag holdt politiet en<br />
pressekonferanse.<br />
Lensmannen vil ikke gi nærmere<br />
opplysninger om åstedet.<br />
Beboerne i studentkollektivet<br />
har fortalt politiet at de så<br />
Slåtten lørdag kveld.<br />
Politiet har sperret av<br />
åstedet.<br />
desperat,rop,<br />
høre,vitne,rop<br />
bede om,lensmann,assistanse<br />
bistå,etterforsker,lensmann<br />
fra_kripos,etterforsker,<br />
ankomme,etterforsker,<br />
23-år_gammel,student,<br />
melde_savnet,,student<br />
bo,Anne,studentkollektiv<br />
bo,Slåtten,studentkollektiv<br />
førsteårs,sykepleiestudent,<br />
være,Slåtten,sykepleiestudent<br />
finne,,pron<br />
være,pron,omkommet<br />
være,pron,funn<br />
høy,rop,<br />
opplyse,vitne,<br />
høre,pron,rop<br />
holde,politi,pressekonferanse<br />
gi,lensmann,opplysning<br />
nær,opplysning<br />
fortelle,beboer,politi<br />
se,pron,Slåtten<br />
bo,beboer,studentkollektiv<br />
sperre av,politi,åsted<br />
105<br />
automatic<br />
automatic<br />
automatic<br />
automatic<br />
edited<br />
manual<br />
edited<br />
edited<br />
edited<br />
manual<br />
edited<br />
manual<br />
automatic<br />
manual<br />
automatic<br />
automatic<br />
automatic<br />
automatic<br />
automatic<br />
automatic<br />
automatic<br />
edited<br />
automatic<br />
manual<br />
edited<br />
Flere personer er avhørt i<br />
saken.<br />
avhøre,,person<br />
automatic<br />
Politiet holder alle muligheter holde_åpen,politi,mulighet<br />
edited<br />
åpne.<br />
åpen,mulighet,<br />
automatic<br />
Kvinnen blir trolig obdusert i obdusere,,kvinne automatic
løpet av tirsdag.<br />
Politiet håper obduksjonen vil<br />
avklare hva som skjedde med<br />
kvinnen<br />
Mandag kveld ankom<br />
etterforskere fra Kripos<br />
åstedet.<br />
Sent mandag kveld rigget<br />
etterforskerne opp lyskastere.<br />
Fonn vil ikke gi flere<br />
opplysninger om åstedet.<br />
Han vil ikke kommentere om<br />
kvinnen var skadet.<br />
håpe,politi,<br />
avklare,obduksjon,<br />
skje_med,generic-nom,kvinne<br />
ankomme,etterforsker,åsted<br />
fra_kripos,etterforsker,<br />
rigge,etterforsker,lyskaster<br />
gi,Fonn,opplysning<br />
kommentere,pron,<br />
skade,,kvinne<br />
være,kvinne,skadet<br />
holde,politi,kort<br />
Politiet holder kortene svært<br />
tett til brystet.<br />
Det er ikke kommet inn mange komme_inn,tips,<br />
tips i saken.<br />
Tipsene skal nå systematiseres. systematisere,,tips<br />
Fonn forteller at politiet vil<br />
ta kontakt med vitner.<br />
Politiet har flere mulige<br />
teorier.<br />
Det mest sentrale vitnet i<br />
saken er en kvinne.<br />
Hun skal ha hørt rop fra en<br />
kvinne.<br />
Politiet har stengt av<br />
studentkollektivet der 23åringen<br />
bodde.<br />
Studentkollektivet vil bli<br />
gjennomgått av teknikere.<br />
Fonn har bedt om teknisk<br />
bistand.<br />
fortelle,Fonn,<br />
ta_kontakt_med,politi,vitne<br />
mulig,teori,<br />
ha,politi,teori<br />
sentral,vitne,<br />
være,vitne,kvinne<br />
høre,pron,rop<br />
stenge av,politi,studentkollektiv<br />
bo,23-åring,studentkollektiv<br />
gjennomgå,tekniker,studentkollektiv<br />
bede-om,Fonn,bistand<br />
106<br />
manual<br />
manual<br />
edited<br />
edited<br />
edited<br />
automatic<br />
automatic<br />
automatic<br />
automatic<br />
manual<br />
automatic<br />
manual<br />
automatic<br />
manual<br />
manual<br />
automatic<br />
automatic<br />
automatic<br />
automatic<br />
automatic<br />
automatic<br />
edited<br />
automatic<br />
automatic<br />
Politiet bekrefter at Slåtten bekrefte,politi,<br />
automatic<br />
ble drept.<br />
drepe,,Slåtten<br />
automatic<br />
Undersøkelsene på stedet viser vise,undersøkelse,<br />
automatic<br />
at hun ble drept.<br />
drepe,,pron<br />
automatic<br />
Politiet tror at Slåtten ble tro,politi,<br />
automatic<br />
overfalt.<br />
overfalle,,Slåtten<br />
automatic<br />
De tror at kvinnen ble drept av tro,pron,<br />
automatic<br />
en ukjent gjerningsmann.<br />
ukjent,gjerningsmann,<br />
automatic<br />
drepe,gjerningsmann,kvinne<br />
automatic<br />
Politiet fastslår at kvinnens fastslå,politi,<br />
automatic<br />
lommebok ikke er funnet.<br />
finne,,lommebok<br />
manual<br />
være,lommebok,funn<br />
automatic<br />
Fonn opplyser at området ikke opplyse,Fonn,<br />
automatic<br />
er undersøkt.<br />
undersøke,,område<br />
automatic<br />
Politiet har forkastet en<br />
forkaste,politi,teori<br />
automatic<br />
tidligere teori.<br />
tidlig,teori,<br />
automatic<br />
Politiet får senere i dag svar<br />
på dødsårsaken.<br />
få,politi,svar<br />
automatic<br />
Politiet har ikke gjennomsøkt<br />
Slåttens studenthybel.<br />
gjennomsøke,politi,studenthybel automatic<br />
Hele hybelhuset ble sperret av. sperre av,,hybelhus edited<br />
Politiet har plombert<br />
hybelhuset.<br />
plombere,politi,hybelhus automatic<br />
Politiet skal finkjemme<br />
finkjemme,politi,bygning<br />
automatic<br />
bygningen for tekniske spor. teknisk,spor,<br />
automatic<br />
De tekniske etterforskerne har teknisk,etterforsker,<br />
automatic<br />
undersøkt åstedet.<br />
undersøke,etterforsker,åsted<br />
automatic<br />
To tekniske etterforskere<br />
teknisk,etterforsker,<br />
automatic<br />
bistår politiet i Førde.<br />
bistå,etterforsker,politi<br />
automatic<br />
En taktisk etterforsker fra bistå,etterforsker,politi<br />
automatic<br />
Kripos bistår politiet.<br />
taktisk,etterforsker,<br />
automatic<br />
fra_kripos,etterforsker,<br />
edited<br />
Lensmannen tar høyde for alle ta_høyde_for,lensmann,eventualitet manual
eventualiteter.<br />
Vi varslet Kripos. varsle,pron,Kripos automatic<br />
Den døde sykepleierstudenten død,sykepleiestudent,<br />
automatic<br />
ble funnet av en tilfeldig finne,forbipasserende,sykepleiestudent<br />
automatic<br />
forbipasserende.<br />
tilfeldig,forbipasserende,<br />
edited<br />
23-åringen ble sist observert<br />
lørdag kveld.<br />
observere,,23-åring automatic<br />
Politiet vet at hun fikk en vite,politi,<br />
automatic<br />
telefon fra kjæresten sin. få,pron,telefon<br />
automatic<br />
Kjæresten merket ikke at noe<br />
var galt.<br />
merke,kjæreste, automatic<br />
Vedkommende er avhørt. avhøre,,vedkommende automatic<br />
En større leteaksjon ble<br />
igangsette,,leteaksjon<br />
automatic<br />
igangsatt.<br />
stor,leteaksjon,<br />
edited<br />
Politiet etterlyser en syklist. etterlyse,politi,syklist automatic<br />
Den etterlyste syklisten har etterlyst,syklist,<br />
manual<br />
tatt kontakt med politiet. ta_kontakt_med,syklist,politi<br />
manual<br />
Fortsatt etterlyses to<br />
bilførere.<br />
etterlyse,,bilfører automatic<br />
Politiet etterlyste i dag to<br />
bilførere.<br />
etterlyse,politi,bilfører automatic<br />
To biler er observert på veien. observere,,bile automatic<br />
Politiet ønsker å komme i<br />
ønske,politi,<br />
automatic<br />
kontakt med bilførerne.<br />
komme_i_kontakt_med,politi,bilfører<br />
manual<br />
Fonn understreker at bilførerne understreke,Fonn,<br />
automatic<br />
er vitner.<br />
være,bilfører,vitne<br />
edited<br />
Fonn sier at han understreker understreke,pron,generic-nom<br />
edited<br />
dette.<br />
si,Fonn,<br />
automatic<br />
Slåtten var påkledd da hun ble påkledd,Slåtten,<br />
automatic<br />
funnet drept.<br />
finne,,pron<br />
automatic<br />
drepe,,pron<br />
automatic<br />
Vi vil nå kartlegge alle<br />
kartlegge,pron,bevegelse<br />
automatic<br />
bevegelser på åstedet.<br />
ville,pron,kartlegge<br />
manual<br />
Vi har ingen spesiell teori som ha,pron,teori<br />
automatic<br />
vi tar utgangspunkt i.<br />
spesiell,teori,<br />
automatic<br />
ta,pron,utgangspunkt<br />
automatic<br />
Funnene på åstedet viser at det vise,funn,<br />
automatic<br />
er en kriminell handling.<br />
kriminell,handling,<br />
automatic<br />
Det er ikke et mistenkelig mistenkelig,dødsfall,<br />
automatic<br />
dødsfall, men en kriminell<br />
handling.<br />
kriminell,handling,<br />
automatic<br />
Trenger flere vitner. trenge,,vitne automatic<br />
Politiet ønsker å komme i<br />
komme_i_kontakt_med,politi,generic-nom<br />
edited<br />
kontakt med alle som kjente ønske,politi,<br />
automatic<br />
Slåtten.<br />
kjenne,generic-nom,Slåtten<br />
automatic<br />
Etterforskerne fra Kripos vil<br />
kontakte vitner.<br />
kontakte,etterforsker,vitne automatic<br />
Politiet kjenner dødsårsaken. kjenne,politi,dødsårsak manual<br />
Politimesteren bekrefter at de<br />
har fått en muntlig rapport.<br />
Han understreker at politiet<br />
ikke vil gi informasjon om<br />
dødsårsaken.<br />
Politiet har ikke bekreftet<br />
hvor kvinnen ble drept.<br />
Politiet har nå 32 medarbeidere<br />
som etterforsker drapet.<br />
bekrefte,politimester,<br />
få,pron,rapport<br />
muntlig,rapport<br />
understreke,pron,<br />
gi,politi,informasjon<br />
bekrefte,politi,<br />
drepe,,kvinne<br />
ha,politi,medarbeider<br />
etterforske,medarbeider,drap<br />
107<br />
manual<br />
manual<br />
manual<br />
manual<br />
manual<br />
manual<br />
manual<br />
manual<br />
Syklisten meldte seg. melde_seg,syklist, manual<br />
Den etterlyste syklisten har nå melde_seg,syklist,politi<br />
manual<br />
meldt seg til politiet i Førde. etterlyst,syklist,<br />
manual<br />
Fortsatt etterlyses to<br />
bilførere.<br />
etterlyse,,bilfører manual<br />
Politiet etterlyste i dag<br />
tidlig en syklist.<br />
etterlyse,politi,syklist manual<br />
I formiddag meldte syklisten melde_seg,syklist,politi manual
seg til politiet.<br />
Jeg vil understreke at vi<br />
ønsker å komme i kontakt med<br />
både syklisten og bilførerne<br />
som vitner, sier Fonn.<br />
Vi vil nå kartlegge alle<br />
bevegelser på funnstedet og i<br />
boligen.<br />
Vi har ingen spesiell teori som<br />
vi jobber utifra nå.<br />
Men funnene på åstedet viser at<br />
det er en kriminell handling,<br />
forteller Fonn.<br />
I tillegg vil politiet gjøre en<br />
rundspørring rundt åstedet i<br />
løpet av dagen.<br />
Den endelige rapporten vil være<br />
klar på torsdag.<br />
To Kripos-spesialister skal<br />
analysere alle tekniske spor i<br />
Førde.<br />
De to sluttet seg til Førde-<br />
politiet i går.<br />
Alt av mobiltelefontrafikk,<br />
overvåkingsfilmer og<br />
minibankaktiviteter rundt<br />
drapstidspunktet skal<br />
undersøkes.<br />
Slik kan politiet undersøke<br />
aktiviteten i området<br />
sykepleiestudenten ble funnet<br />
drept.<br />
Lensmann Kjell Fonn ber alle<br />
som var i sentrum om å melde<br />
seg.<br />
Han understreker at<br />
etterforskningen er svært bred.<br />
Politiet har sanket inn videoer<br />
fra alle overvåkningskameraer i<br />
Førde.<br />
Polititjenestefolk går gjennom<br />
materialet.<br />
Kameraene vil gi en indikasjon<br />
på aktiviteten i Førde i det<br />
aktuelle tidsrommet.<br />
Gjerningsmannen gjemte seg i<br />
busker på åstedet.<br />
Det er sannsynlig at<br />
gjerningsmannen gjemte seg i<br />
busker ved åstedet.<br />
Tror Anne ble et tilfeldig<br />
offer.<br />
understreke,pron,<br />
komme_i_kontakt_med,pron,bilfører<br />
komme_i_kontakt_med,pron,syklist<br />
si,Fonn,<br />
være,bilfører,vitne<br />
være,syklist,vitne<br />
108<br />
manual<br />
manual<br />
manual<br />
manual<br />
manual<br />
manual<br />
kartlegge,pron,bevegelse manual<br />
ha,pron,teori<br />
spesiell,teori<br />
jobbe_utfra,pron,teori<br />
vise,funn,<br />
kriminell,handling,<br />
fortelle,Fonn,<br />
manual<br />
manual<br />
manual<br />
manual<br />
manual<br />
manual<br />
gjøre,politi,rundspørring manual<br />
endelig,rapport,<br />
være,rapport,klar<br />
analysere,Kripos-spesialist,spor<br />
teknisk,spor,<br />
manual<br />
manual<br />
manual<br />
manual<br />
slutte_seg_til,pron,Førde-politi manual<br />
undersøke,,minibankaktiviteter<br />
undersøke,,mobiltelefontrafikk<br />
undersøke,,overvåkningsfilmer<br />
undersøke,politi,aktivitet<br />
finne,,sykepleiestudent<br />
drept,sykepleiestudent<br />
manual<br />
manual<br />
manual<br />
manual<br />
manual<br />
manual<br />
bede,lensmann, automatic<br />
understreke,pron,<br />
manual<br />
være,etterforskning,bred<br />
manual<br />
sanke_inn,politi,video manual<br />
gå_gjennom,polititjenestefolk,material manual<br />
gi,kamera,indikasjon<br />
aktuell,tidsrom,<br />
gjemme,gjerningsmann,<br />
gjemme,gjerningsmann,<br />
tilfeldig,offer,<br />
bli,Anne,offer<br />
være,Anne,offer<br />
manual<br />
manual<br />
automatic<br />
manual<br />
manual<br />
Politiet avhører flere vitner. avhøre,politi,vitne automatic<br />
Politiet har søkt med hunder på<br />
åstedet.<br />
søke_med,politi,hund edited<br />
Politiet har samlet mange<br />
observasjoner.<br />
samle,politi,observasjon automatic<br />
Politiet antyder at drapsmannen antyde,politi,<br />
automatic<br />
har valgt sykepleiestudenten<br />
tilfeldig.<br />
velge,drapsmann,sykepleiestudent<br />
automatic<br />
Vitneavhør gir indikasjoner på gi,vitneavhør,indikasjon<br />
manual<br />
at den brutale drapsmannen har brutal,drapsmann,<br />
manual<br />
valgt sykepleierstudenten<br />
tilfeldig.<br />
velge,drapsmann,sykepleiestudent<br />
manual
Etterforskerne har flere<br />
observasjoner.<br />
Vitner så en kvinne som gikk<br />
alene.<br />
Politiet mener kvinnen er Anne<br />
Slåtten.<br />
Etterforskerne mener at hun<br />
ikke ble forfulgt.<br />
Drapsmannen kan ha gjemt seg i<br />
busker ved åstedet.<br />
Lensmannen ber unge kvinner<br />
være oppmerksomme.<br />
Politiet setter ikke inn ekstra<br />
patruljer i Førde.<br />
Politiet har fått flere nye<br />
tips.<br />
En rekonstruksjon ble<br />
gjennomført på tirsdag.<br />
Lensmannen tror at politiet<br />
finner drapsmannen.<br />
Politiet etterlyser fem<br />
personer.<br />
109<br />
ha,etterforsker,observasjon manual<br />
se,vitne,kvinne manual<br />
mene,politi,<br />
automatic<br />
være,kvinne,Slåtten<br />
automatic<br />
mene,etterforsker,<br />
automatic<br />
forfølge,,pron<br />
automatic<br />
gjemme,drapsmann, manual<br />
oppmerksom,kvinne,<br />
bede,lensmann,<br />
sette_inn,politi,patrulje<br />
ekstra,patrulje,<br />
automatic<br />
automatic<br />
edited<br />
automatic<br />
få,politi,tips<br />
automatic<br />
ny,tips,<br />
automatic<br />
gjennomføre,,rekonstruksjon automatic<br />
tro,lensmann,<br />
finne,politi,drapsmann<br />
etterlyse,politi,person<br />
automatic<br />
atomatic<br />
automatic<br />
Personene er observert i Førde. observere,,person automatic<br />
Politiet har ikke identifisert<br />
dem.<br />
identifisere,politi,pron<br />
automatic<br />
Politiet har gjort 1275 avhør. gjøre,politi,avhør automatic<br />
De har fem observasjoner av ha,pron,observasjon<br />
automatic<br />
ukjente personer.<br />
ukjent,person,<br />
automatic
Appendix E: classify.pl – program code<br />
#!/bin/perl<br />
### Konfigurering og initialisering ###<br />
# Predikat/argument-fil.<br />
my $infile = "pred_argliste.txt";<br />
# Initialiserer tomme datastrukturer.<br />
my %pred_args1 = ();<br />
my %pred_args2 = ();<br />
my %arg1_preds = ();<br />
my %arg2_preds = ();<br />
# Leser inndata.<br />
open(FILE, $infile);<br />
my @lines = ;<br />
close(FILE);<br />
### Her bygges datastrukturen ###<br />
foreach $line (@lines) {<br />
# Fjerner linjeskift fra hver linje.<br />
chomp($line);<br />
}<br />
# Henter ut hvert enkelt ord på hver linje.<br />
my ($pred, $arg1, $arg2) = split(/,/, $line);<br />
# Registerer arg1 og arg2 på predikatet (øker telleren).<br />
$pred_args1{$pred}{$arg1} += 1;<br />
$pred_args2{$pred}{$arg2} += 1;<br />
# Registrerer predikatene som arg1 og arg2 forekommer med.<br />
$arg1_preds{$arg1}{$pred} += 1;<br />
$arg2_preds{$arg2}{$pred} += 1;<br />
### Her er selve logikken i programmet ###<br />
# Henter predikatene som skal vises fra kommandolinjen (viser alle hvis ingen er<br />
oppgitt).<br />
my @preds = scalar @ARGV ? @ARGV : sort keys %pred_args1;<br />
# Løkke som styrer flyten i programmet.<br />
foreach $pred (@preds) {<br />
foreach $arg (1, 2) {<br />
### NIVÅ 0 ###<br />
my @args_lvl0 = parse_level0($arg, $pred);<br />
}<br />
foreach $arg_lvl0 (@args_lvl0) {<br />
### NIVÅ 1 ###<br />
my @args_lvl1 = parse_level1or2(1, $arg, $arg_lvl0, $pred);<br />
foreach $arg_lvl1 (@args_lvl1) {<br />
### NIVÅ 2 ###<br />
parse_level1or2(2, $arg, $arg_lvl1, $pred, $arg_lvl0);<br />
}<br />
}<br />
print "\n";<br />
}<br />
print "\n";<br />
110
# Subrutine som tar inn argumentnummer (1 eller 2, tilsvarer ARG1 eller ARG2) og<br />
predikat.<br />
# Viser predikatets argumenter (NIVÅ0) og returnerer dem for bruk på neste nivå.<br />
sub parse_level0 {<br />
# Henter subrutinens parametre.<br />
my $argnum = shift;<br />
my $pred = shift;<br />
# Henter ut alle predikatets argumenter.<br />
my %args = $argnum == 1 ? %{$pred_args1{$pred}} : %{$pred_args2{$pred}};<br />
# Henter ut selve argumentene i sortert rekkefølge (etter antall forekomster og<br />
alfabetisk).<br />
my @args = sort { $args{$b} $args{$a} } sort { lc($a) cmp lc($b) } keys %args;<br />
}<br />
# Viser predikatets argument-liste.<br />
print "NIVÅ0, ARG$argnum ($pred): ";<br />
print join(', ', map { "$_ x $args{$_}" } @args), "\n";<br />
return @args;<br />
# Subrutine som tar inn argumentnummer (1 eller 2, tilsvarer ARG1 eller ARG2) og et<br />
argument (pluss ekstra styringsparametre).<br />
# Finner andre predikater som argumentet er brukt i forbindelse med.<br />
# Viser argumenter for alle funnede predikater, samt antallet ganger disse argumentene<br />
ble brukt.<br />
#<br />
# I resten av subrutinen kalles disse funnede argumentene for "refererte argumenter (på<br />
neste nivå)",<br />
# fordi de indirekte via et predikat er referert fra et argument (eks: et NIVÅ0-argument<br />
"refererer"<br />
# til en del NIVÅ1-arguenter, som igjen referer til en del NIVÅ2-argumenter).<br />
sub parse_level1or2 {<br />
# Henter subrutinens parametre.<br />
my $level = shift;<br />
my $argnum = shift;<br />
my $arg = shift;<br />
my $pred_lvl0 = shift;<br />
my $arg_lastlevel = $level == 1 ? '' : shift;<br />
# Tar ikke hensyn til '?'-argumenter.<br />
return if $arg eq '?';<br />
# Datastruktur som teller antall refererte argumenter på neste nivå.<br />
my %argrefs = ();<br />
# Henter ut alle predikatene som inneholder argumentet.<br />
my @preds = $argnum == 1 ? keys %{$arg1_preds{$arg}} : keys %{$arg2_preds{$arg}};<br />
# ...og gjennomløper disse predikatene.<br />
foreach $pred (@preds) {<br />
# Tar ikke hensyn til samme predikatet som man holder på med.<br />
next if $pred eq $pred_lvl0;<br />
# Henter ut alle predikatets argumenter.<br />
my %args = $argnum == 1 ? %{$pred_args1{$pred}} : %{$pred_args2{$pred}};<br />
# ...og gjennomløper og teller disse.<br />
foreach $argref (keys %args) {<br />
# Tar ikke hensyn til samme argument som man holder på med.<br />
next if $argref eq $arg || $argref eq $arg_lastlevel;<br />
}<br />
# Øker telleren (antall refererte argumenter på neste nivå).<br />
$argrefs{$argref} += $args{$argref};<br />
111
}<br />
# Henter ut de refererte argumentene på neste nivå.<br />
my @argrefs = sort { $argrefs{$b} $argrefs{$a} } sort { lc($a) cmp lc($b) } keys<br />
%argrefs;<br />
# Viser de refererte argumentene på neste nivå.<br />
print " " x (3*$level), "NIVÅ$level, ARG$argnum ($arg): ";<br />
print scalar @argrefs ? join(', ', map { "$_ x $argrefs{$_}" } @argrefs) : '(Ingen<br />
referanser)', "\n";<br />
}<br />
return @argrefs;<br />
112
Appendix F: POS-based structures<br />
SENTENCE POS-STRUCTURE<br />
Kvinne funnet død i Førde. finne,,kvinne<br />
Den savnede kvinnen i Førde er nå funnet død. finne,kvinne,død<br />
Politiet har gitt media opplysninger om funnet. gi,politi,opplysning<br />
Lensmannen bekrefter at kvinnen er funnet død. bekrefte,lensmann,at<br />
finne,kvinne,død<br />
Politiet har bedt Kripos om bistand i søket etter kvinnen. be,politi,kripos<br />
23-åringen var førsteårs sykepleiestudent i Førde. være,23-åring,<br />
Hun møtte ikke opp til sin første praksisdag ved Førde<br />
aldershjem.<br />
møte,hun,<br />
Politiet ble informert. informere,politi,<br />
En leteaksjon ble satt igang. sette,leteaksjon,<br />
Leteaksjonen pågikk til kvinnen ble funnet. pågå,leteaksjon,<br />
finne,kvinne,<br />
Politiet holder alle muligheter åpne i saken. holde,politi,mulighet<br />
Etterforskerne vil ankomme i morgen. ankomme,etterforsker,<br />
Et vitne hørte desperate rop om hjelp. høre,vitne,rop<br />
Lensmannen har bedt om assistanse fra Kripos. be,lensmann,<br />
Etterforskere fra Kripos skal bistå lensmannen i<br />
etterforskningen.<br />
bistå,etterforsker,lensmann<br />
Etterforskerne forventes å ankomme i løpet av dagen. ankomme,etterforsker,<br />
Den 23 år gamle studenten ble meldt savnet tidlig søndag<br />
morgen.<br />
melde savne,student,<br />
Anne Slåtten bodde i et studentkollektiv i Førde. bo,Slåtten,<br />
Slåtten var førsteårs sykepleiestudent i Førde. være,Slåtten,sykepleiestudent<br />
Hun ble funnet omkommet i et skogholt. finne,hun,<br />
Et vitne opplyste at hun hadde hørt høye rop. opplyse,vitne,<br />
høre,hun,rop<br />
Mandag holdt politiet en pressekonferanse. holde,politi,pressekonferanse<br />
Lensmannen vil ikke gi nærmere opplysninger om åstedet. gi,lensmann,opplysning<br />
Beboerne i studentkollektivet har fortalt politiet at de så fortelle,beboer,politi<br />
Slåtten lørdag kveld.<br />
se,de,<br />
Politiet har sperret av åstedet. sperre,politi,<br />
Flere personer er avhørt i saken. avhøre,person,<br />
Politiet holder alle muligheter åpne. holde,politi,mulighet<br />
Kvinnen blir trolig obdusert i løpet av tirsdag. obdusere,kvinne,<br />
Politiet håper obduksjonen vil avklare hva som skjedde med håpe,politi,<br />
kvinnen<br />
avklare,obduksjon,hva<br />
Mandag kveld ankom etterforskere fra Kripos åstedet. ankomme,etterforsker,åsted<br />
Sent mandag kveld rigget etterforskerne opp lyskastere. rigge,etterforsker,<br />
Fonn vil ikke gi flere opplysninger om åstedet. gi,Fonn,opplysning<br />
Han vil ikke kommentere om kvinnen var skadet. kommentere,han,<br />
skade,kvinne,<br />
Politiet holder kortene svært tett til brystet. holde,politi,kort<br />
Det er ikke kommet inn mange tips i saken. komme,det,<br />
Tipsene skal nå systematiseres. systematisere,tips,<br />
Fonn forteller at politiet vil ta kontakt med vitner. fortelle,Fonn,<br />
ta,politi,kontakt<br />
Politiet har flere mulige teorier. ha,politi,teori<br />
Det mest sentrale vitnet i saken er en kvinne. være,vitne,kvinne<br />
Hun skal ha hørt rop fra en kvinne. høre,hun,rop<br />
Politiet har stengt av studentkollektivet der 23-åringen stenge,politi,<br />
bodde.<br />
bo,23-åring,<br />
Studentkollektivet vil bli gjennomgått av teknikere. gjennomgå,studentkollektiv,<br />
Fonn har bedt om teknisk bistand. be,Fonn,<br />
Politiet bekrefter at Slåtten ble drept. bekrefte,politi,at<br />
drepe,Slåtten,<br />
Undersøkelsene på stedet viser at hun ble drept. vise,undersøkelse,at<br />
drepe,hun,<br />
Politiet tror at Slåtten ble overfalt. tro,politi,at<br />
overfalle,Slåtten,<br />
De tror at kvinnen ble drept av en ukjent gjerningsmann. tro,de,at<br />
drepe,kvinne,<br />
113
Politiet fastslår at kvinnens lommebok ikke er funnet. fastslå,politi,at<br />
finne,lommebok,<br />
Fonn opplyser at området ikke er undersøkt. opplyse,Fonn,at<br />
undersøke,område,<br />
Politiet har forkastet en tidligere teori. forkaste,politi,teori<br />
Politiet får senere i dag svar på dødsårsaken. få,politi,svar<br />
Politiet har ikke gjennomsøkt Slåttens studenthybel. gjennomsøke,politi,studenthybel<br />
Hele hybelhuset ble sperret av. sperre,hybelhus,<br />
Politiet har plombert hybelhuset. plombere,politi,hybelhus<br />
Politiet skal finkjemme bygningen for tekniske spor. finkjemme,politi,bygning<br />
De tekniske etterforskerne har undersøkt åstedet. undersøke,etterforsker,åsted<br />
To tekniske etterforskere bistår politiet i Førde. bistå,etterforsker,politi<br />
En taktisk etterforsker fra Kripos bistår politiet. bistå,etterforsker,politi<br />
Lensmannen tar høyde for alle eventualiteter. ta,lensmann,høyde<br />
Vi varslet Kripos. varsle,vi,kripos<br />
Den døde sykepleierstudenten ble funnet av en tilfeldig<br />
finne,sykepleierstudent,forbipasse<br />
forbipasserende.<br />
rende<br />
23-åringen ble sist observert lørdag kveld. observere,23-åring,<br />
Politiet vet at hun fikk en telefon fra kjæresten sin. vite,politi,<br />
få,hun,telefon<br />
Kjæresten merket ikke at noe var galt.<br />
Vedkommende er avhørt.<br />
merke,kjæreste,at<br />
En større leteaksjon ble igangsatt. igangsette,leteaksjon,<br />
Politiet etterlyser en syklist. etterlyse,politi,syklist<br />
Den etterlyste syklisten har tatt kontakt med politiet. ta,syklist,kontakt<br />
Fortsatt etterlyses to bilførere. etterlyse,,bilfører<br />
Politiet etterlyste i dag to bilførere. etterlyse,politi,bilfører<br />
To biler er observert på veien. observere,bil,<br />
Politiet ønsker å komme i kontakt med bilførerne. ønske,politi,å<br />
Fonn understreker at bilførerne er vitner. understreke,Fonn,at<br />
være,bilfører,vitne<br />
Fonn sier at han understreker dette. si,Fonn,at<br />
understreke,han,dette<br />
Slåtten var påkledd da hun ble funnet drept. være,Slåtten,påkledd<br />
finne,hun,drepe<br />
Vi vil nå kartlegge alle bevegelser på åstedet. kartlegge,vi,bevegelse<br />
Vi har ingen spesiell teori som vi tar utgangspunkt i. ha,vi,teori<br />
ta,vi,utgangspunkt<br />
Funnene på åstedet viser at det er en kriminell handling. vise,funn,at<br />
være,det,handling<br />
Det er ikke et mistenkelig dødsfall, men en kriminell<br />
handling.<br />
være,det,dødsfall<br />
Trenger flere vitner. trenge,vitne,<br />
Politiet ønsker å komme i kontakt med alle som kjente<br />
ønske,politi,å<br />
Slåtten.<br />
kjenne,,Slåtten<br />
Etterforskerne fra Kripos vil kontakte vitner. kontakte,etterforsker,vitne<br />
Politiet kjenner dødsårsaken. kjenne,politi,dødsårsak<br />
Politimesteren bekrefter at de har fått en muntlig rapport. bekrefte,politimester,at<br />
ha,de,rapport<br />
Han understreker at politiet ikke vil gi informasjon om<br />
understreke,han,at<br />
dødsårsaken.<br />
gi,politi,informasjon<br />
Politiet har ikke bekreftet hvor kvinnen ble drept. bekrefte,politi,<br />
drepe,kvinne,<br />
Politiet har nå 32 medarbeidere som etterforsker drapet. ha,politi,medarbeider<br />
etterforske,medarbeider,drap<br />
Syklisten meldte seg. melde,syklist,seg<br />
Den etterlyste syklisten har nå meldt seg til politiet i<br />
Førde.<br />
melde,syklist,seg<br />
Fortsatt etterlyses to bilførere. etterlyse,bilfører,<br />
Politiet etterlyste i dag tidlig en syklist. etterlyse,politi,syklist<br />
I formiddag meldte syklisten seg til politiet. melde,syklist,seg<br />
Jeg vil understreke at vi ønsker å komme i kontakt med både understreke,jeg,at<br />
syklisten og bilførerne som vitner, sier Fonn.<br />
ønske,vi,å<br />
vitne,bilfører,<br />
si,Fonn,<br />
Vi vil nå kartlegge alle bevegelser på funnstedet og i<br />
boligen.<br />
kartlegge,vi,bevegelse<br />
114
Vi har ingen spesiell teori som vi jobber utifra nå. ha,vi,teori<br />
jobbe,vi,<br />
Men funnene på åstedet viser at det er en kriminell handling, vise,funn,at<br />
forteller Fonn.<br />
være,det,handling<br />
fortelle,Fonn,<br />
I tillegg vil politiet gjøre en rundspørring rundt åstedet i<br />
løpet av dagen.<br />
gjøre,politi,rundspørring<br />
Den endelige rapporten vil være klar på torsdag. være,rapport,klar<br />
To Kripos-spesialister skal analysere alle tekniske spor i<br />
Førde.<br />
analysere,Kripos-spesialist,spor<br />
De to sluttet seg til Førde-politiet i går. slutte,to,seg<br />
Alt av mobiltelefontrafikk, overvåkingsfilmer og<br />
minibankaktiviteter rundt drapstidspunktet skal undersøkes.<br />
undersøke,minibankaktivitet,<br />
Slik kan politiet undersøke aktiviteten i området<br />
undersøke,politi,aktivitet<br />
sykepleiestudenten ble funnet drept.<br />
finne,sykepleierstudent,<br />
Lensmann Kjell Fonn ber alle som var i sentrum om å melde be,Fonn,alle<br />
seg.<br />
melde,,seg<br />
Han understreker at etterforskningen er svært bred. understreke,han,at<br />
være,etterforskning,bred<br />
Politiet har sanket inn videoer fra alle overvåkningskameraer<br />
i Førde.<br />
sanke,politi,<br />
Polititjenestefolk går gjennom materialet. gå,polititjenestefolk,<br />
Kameraene vil gi en indikasjon på aktiviteten i Førde i det<br />
aktuelle tidsrommet.<br />
gi,kamera,indikasjon<br />
Gjerningsmannen gjemte seg i busker på åstedet. gjemme,gjerningsmann,seg<br />
Det er sannsynlig at gjerningsmannen gjemte seg i busker ved være,det,sannsynlig<br />
åstedet.<br />
gjemme,gjerningsmann,seg<br />
Tror Anne ble et tilfeldig offer. bli,Anne,offer<br />
Politiet avhører flere vitner. avhøre,politi,vitne<br />
Politiet har søkt med hunder på åstedet. søke,politi,<br />
Politiet har samlet mange observasjoner. samle,politi,observasjon<br />
Politiet antyder at drapsmannen har valgt sykepleiestudenten antyde,politi,at<br />
tilfeldig.<br />
velge,drapsmann,sykepleiestudent<br />
Vitneavhør gir indikasjoner på at den brutale drapsmannen har gi,vitneavhør,indikasjon<br />
valgt sykepleierstudenten tilfeldig.<br />
velge,drapsmann,sykepleiestudent<br />
Etterforskerne har flere observasjoner. ha,etterforsker,observasjon<br />
Vitner så en kvinne som gikk alene. se,vitne,kvinne<br />
Politiet mener kvinnen er Anne Slåtten. mene,politi,<br />
være,Slåtten,kvinne<br />
Etterforskerne mener at hun ikke ble forfulgt. mene,etterforsker,at<br />
bli,hun,forfulgt<br />
Drapsmannen kan ha gjemt seg i busker ved åstedet. gjemme,drapsmann,seg<br />
Lensmannen ber unge kvinner være oppmerksomme. be,lensmann,kvinne<br />
Politiet setter ikke inn ekstra patruljer i Førde. sette,politi,<br />
Politiet har fått flere nye tips. ha,politi,tips<br />
En rekonstruksjon ble gjennomført på tirsdag. gjennomføre,rekonstruksjon,<br />
Lensmannen tror at politiet finner drapsmannen. tro,lensmann,at<br />
finne,politi,drapsmann<br />
Politiet etterlyser fem personer. etterlyse,politi,person<br />
Personene er observert i Førde. observere,person,<br />
Politiet har ikke identifisert dem. identifisere,politi,de<br />
Politiet har gjort 1275 avhør. gjøre,politi,avhør<br />
De har fem observasjoner av ukjente personer. ha,de,observasjon<br />
115