Unni Cathrine Eiken February 2005

Unni Cathrine Eiken February 2005 Unni Cathrine Eiken February 2005

10.04.2013 Views

other similar constructions. This information was only accessible after manual editing of the texts, and was even then often not useful since precisely the desirable discourse chains are avoided by use of this type of textual shorthand constructions. The information present in bullet lists and tables in the unedited text is most often not formulated in well-formed sentences and usually the use of referring expressions and pronouns is avoided. Such texts are also not immediately suitable for parsing, thus making it complicated to extract EPAS (semi-) automatically. As mentioned above, selecting the texts to be analyzed and creating the text collection to be the basis of the classifications in the project was a task that is not to be underestimated. Several different types of texts were experimented with in the attempt of finding a text type that satisfied the following criteria, in addition to being available for collection on the internet: • Limited and naturally confined thematic domain • Relatively long chains of discourse • Fairly high occurrence of anaphora, pronouns in particular • Several paragraphs where the same phenomenon is discussed • Low occurrence of tables and illustrations, ideally all the information in the texts should be expressed in complete and grammatical sentences The text type that fulfilled these criteria to the highest degree were news texts. By picking newspaper articles that all concerned the same theme, the criteria of a limited domain was satisfied. The articles, as provided on the internet, additionally fulfilled all the other requirements which had been set for the text collection. For this project, articles concerning a criminal case in the small town Førde on the west coast of Norway were chosen, mainly because this was a very big case in the Norwegian newspapers and a large number of articles have been written on the subject. The articles were selected from the newspaper Verdens Gang (VG) in June and July 2004. 34

3.2 Predicate-argument structures "Not the same thing a bit," said the Hatter. "Why, you might as well say that 'I see what I eat' is the same thing as 'I eat what I see'." from Alice in Wonderland by Lewis Carroll. For the purposes of the subsequent classification phase, a meaning representation that would not allow for ambiguity or vagueness was desirable. Using the term EPAS, rather than referring to the verb and its subject and object, contributes to normalising and generalising the data. The motivation for choosing elementary predicate-arguments structures, or EPAS, as the representation of the meaning structures in the text collection will be explained in the following. By choosing EPAS as meaning representation, the focus of the structure is the verbal predicate. Instead of structuring the semantic representations extracted from the texts according to the grammatical roles and the formal function each word holds in the sentence, we look at how the verbal predicate combines with arguments. This is closely related to the idea of thematic roles, where the focus is on which roles the entities in a sentence occupy. It is suggested that “verbs must have their thematic role requirements listed in the lexicon” (Saeed 1997, p. 140) and as such that each verb has a predetermined set of possible argument frames. Thematic roles span over a wide range that describes the various roles the entities in a sentence can occupy. Using Saeed’s hierarchy of thematic roles, the agent is the initiator of action, while the patient and the theme are the entities an action is performed on. For Norwegian and English, there is a tendency for subjects to be agents and direct objects to be patients and themes (Saeed 1997, p. 145). This tendency can be altered by the speaker as a result of stylistic choice or desire to alter the information structure, for example by using passive verbal voice. The assignment of thematic roles to particular positions in a sentence is closely connected to the hierarchical structure of the thematic roles. There is a hierarchy of defined thematic roles for each sentence position; the hierarchy in (3-1) exemplifies the preferred order of roles in subject position (Saeed 1997, p. 146). (3- 1) agent > recipient/benefactive > theme/patient > instrument > location The structuring of a semantic representation into predicates with belonging arguments does, however, not express exactly the same information as the assignment of thematic roles does. 35

other similar constructions. This information was only accessible after manual editing of the<br />

texts, and was even then often not useful since precisely the desirable discourse chains are<br />

avoided by use of this type of textual shorthand constructions. The information present in bullet<br />

lists and tables in the unedited text is most often not formulated in well-formed sentences and<br />

usually the use of referring expressions and pronouns is avoided. Such texts are also not<br />

immediately suitable for parsing, thus making it complicated to extract EPAS (semi-)<br />

automatically.<br />

As mentioned above, selecting the texts to be analyzed and creating the text collection to be the<br />

basis of the classifications in the project was a task that is not to be underestimated. Several<br />

different types of texts were experimented with in the attempt of finding a text type that satisfied<br />

the following criteria, in addition to being available for collection on the internet:<br />

• Limited and naturally confined thematic domain<br />

• Relatively long chains of discourse<br />

• Fairly high occurrence of anaphora, pronouns in particular<br />

• Several paragraphs where the same phenomenon is discussed<br />

• Low occurrence of tables and illustrations, ideally all the information in the texts should<br />

be expressed in complete and grammatical sentences<br />

The text type that fulfilled these criteria to the highest degree were news texts. By picking<br />

newspaper articles that all concerned the same theme, the criteria of a limited domain was<br />

satisfied. The articles, as provided on the internet, additionally fulfilled all the other<br />

requirements which had been set for the text collection. For this project, articles concerning a<br />

criminal case in the small town Førde on the west coast of Norway were chosen, mainly because<br />

this was a very big case in the Norwegian newspapers and a large number of articles have been<br />

written on the subject. The articles were selected from the newspaper Verdens Gang (VG) in<br />

June and July 2004.<br />

34

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!