Unni Cathrine Eiken February 2005
Unni Cathrine Eiken February 2005 Unni Cathrine Eiken February 2005
other similar constructions. This information was only accessible after manual editing of the texts, and was even then often not useful since precisely the desirable discourse chains are avoided by use of this type of textual shorthand constructions. The information present in bullet lists and tables in the unedited text is most often not formulated in well-formed sentences and usually the use of referring expressions and pronouns is avoided. Such texts are also not immediately suitable for parsing, thus making it complicated to extract EPAS (semi-) automatically. As mentioned above, selecting the texts to be analyzed and creating the text collection to be the basis of the classifications in the project was a task that is not to be underestimated. Several different types of texts were experimented with in the attempt of finding a text type that satisfied the following criteria, in addition to being available for collection on the internet: • Limited and naturally confined thematic domain • Relatively long chains of discourse • Fairly high occurrence of anaphora, pronouns in particular • Several paragraphs where the same phenomenon is discussed • Low occurrence of tables and illustrations, ideally all the information in the texts should be expressed in complete and grammatical sentences The text type that fulfilled these criteria to the highest degree were news texts. By picking newspaper articles that all concerned the same theme, the criteria of a limited domain was satisfied. The articles, as provided on the internet, additionally fulfilled all the other requirements which had been set for the text collection. For this project, articles concerning a criminal case in the small town Førde on the west coast of Norway were chosen, mainly because this was a very big case in the Norwegian newspapers and a large number of articles have been written on the subject. The articles were selected from the newspaper Verdens Gang (VG) in June and July 2004. 34
3.2 Predicate-argument structures "Not the same thing a bit," said the Hatter. "Why, you might as well say that 'I see what I eat' is the same thing as 'I eat what I see'." from Alice in Wonderland by Lewis Carroll. For the purposes of the subsequent classification phase, a meaning representation that would not allow for ambiguity or vagueness was desirable. Using the term EPAS, rather than referring to the verb and its subject and object, contributes to normalising and generalising the data. The motivation for choosing elementary predicate-arguments structures, or EPAS, as the representation of the meaning structures in the text collection will be explained in the following. By choosing EPAS as meaning representation, the focus of the structure is the verbal predicate. Instead of structuring the semantic representations extracted from the texts according to the grammatical roles and the formal function each word holds in the sentence, we look at how the verbal predicate combines with arguments. This is closely related to the idea of thematic roles, where the focus is on which roles the entities in a sentence occupy. It is suggested that “verbs must have their thematic role requirements listed in the lexicon” (Saeed 1997, p. 140) and as such that each verb has a predetermined set of possible argument frames. Thematic roles span over a wide range that describes the various roles the entities in a sentence can occupy. Using Saeed’s hierarchy of thematic roles, the agent is the initiator of action, while the patient and the theme are the entities an action is performed on. For Norwegian and English, there is a tendency for subjects to be agents and direct objects to be patients and themes (Saeed 1997, p. 145). This tendency can be altered by the speaker as a result of stylistic choice or desire to alter the information structure, for example by using passive verbal voice. The assignment of thematic roles to particular positions in a sentence is closely connected to the hierarchical structure of the thematic roles. There is a hierarchy of defined thematic roles for each sentence position; the hierarchy in (3-1) exemplifies the preferred order of roles in subject position (Saeed 1997, p. 146). (3- 1) agent > recipient/benefactive > theme/patient > instrument > location The structuring of a semantic representation into predicates with belonging arguments does, however, not express exactly the same information as the assignment of thematic roles does. 35
- Page 1 and 2: University of Bergen Section for li
- Page 3 and 4: Preface The project presented in th
- Page 5 and 6: Table of Contents 1 INTRODUCTION AN
- Page 7 and 8: 1 Introduction and problem statemen
- Page 9 and 10: patterns found in a text collection
- Page 11 and 12: The results obtained in this projec
- Page 13 and 14: The term anaphor describes a lingui
- Page 15 and 16: 2.1.1.1 Discourse representation th
- Page 17 and 18: eferring to BT. The NP which is lin
- Page 19 and 20: esolution system will not be able t
- Page 21 and 22: (2- 12) REC SUBJ EXIST OBJ IND-OBJ
- Page 23 and 24: Figure 1 17
- Page 25 and 26: means that the algorithm would prop
- Page 27 and 28: for an overview). Many of these sys
- Page 29 and 30: (2- 15) a. Politiet etterlyste i da
- Page 31 and 32: section. The theory dates back to t
- Page 33 and 34: 2.2.2 Different types of context So
- Page 35 and 36: neighbours. For example, a target w
- Page 37 and 38: with it. Selectional constraints al
- Page 39: 3 From text to EPAS - the extractio
- Page 43 and 44: speaker flexibility with regards to
- Page 45 and 46: and woman occur together both in su
- Page 47 and 48: occur with. Arguments which are unl
- Page 49 and 50: 3.3.1 NorGram in outline Norsk komp
- Page 51 and 52: Figure 3 The most useful structure
- Page 53 and 54: 3.4 Altering the source As already
- Page 55 and 56: (3- 12) (3- 13) Politiet leter ette
- Page 57 and 58: ARG1 and ARG2 arrays display a valu
- Page 59 and 60: (3- 20) Anne Slåtten bodde i et st
- Page 61 and 62: value and highly desirable. As such
- Page 63 and 64: this project, this can be interpret
- Page 65 and 66: The process of classifying the cons
- Page 67 and 68: There are several different distanc
- Page 69 and 70: . ankomme,etterforsker,?,? ankomme,
- Page 71 and 72: Test 2 Training set: EPAS_arg1 with
- Page 73 and 74: The training and test material was
- Page 75 and 76: • level 0: words which co-occur w
- Page 77 and 78: (4- 9) avklare,obduksjon,? bede-om,
- Page 79 and 80: (4-10) below shows the output for t
- Page 81 and 82: In the introduction to this chapter
- Page 83 and 84: the EPAS can be used in the classif
- Page 85 and 86: exemption of jobbe-utfra, none of t
- Page 87 and 88: antecedent for (4-15a). In the case
- Page 89 and 90: Figure 7 Interestingly enough, howe
other similar constructions. This information was only accessible after manual editing of the<br />
texts, and was even then often not useful since precisely the desirable discourse chains are<br />
avoided by use of this type of textual shorthand constructions. The information present in bullet<br />
lists and tables in the unedited text is most often not formulated in well-formed sentences and<br />
usually the use of referring expressions and pronouns is avoided. Such texts are also not<br />
immediately suitable for parsing, thus making it complicated to extract EPAS (semi-)<br />
automatically.<br />
As mentioned above, selecting the texts to be analyzed and creating the text collection to be the<br />
basis of the classifications in the project was a task that is not to be underestimated. Several<br />
different types of texts were experimented with in the attempt of finding a text type that satisfied<br />
the following criteria, in addition to being available for collection on the internet:<br />
• Limited and naturally confined thematic domain<br />
• Relatively long chains of discourse<br />
• Fairly high occurrence of anaphora, pronouns in particular<br />
• Several paragraphs where the same phenomenon is discussed<br />
• Low occurrence of tables and illustrations, ideally all the information in the texts should<br />
be expressed in complete and grammatical sentences<br />
The text type that fulfilled these criteria to the highest degree were news texts. By picking<br />
newspaper articles that all concerned the same theme, the criteria of a limited domain was<br />
satisfied. The articles, as provided on the internet, additionally fulfilled all the other<br />
requirements which had been set for the text collection. For this project, articles concerning a<br />
criminal case in the small town Førde on the west coast of Norway were chosen, mainly because<br />
this was a very big case in the Norwegian newspapers and a large number of articles have been<br />
written on the subject. The articles were selected from the newspaper Verdens Gang (VG) in<br />
June and July 2004.<br />
34