10.04.2013 Views

Unni Cathrine Eiken February 2005

Unni Cathrine Eiken February 2005

Unni Cathrine Eiken February 2005

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3 From text to EPAS – the extraction method<br />

This chapter describes the extraction method used in this project. The method extracts EPAS<br />

(elementary predicate-argument structures) from a text corpus consisting of newspaper texts<br />

collected from the internet.<br />

3.1 Selecting the texts<br />

Specifying the requirements for a suitable text collection is not as trivial as it may seem. To<br />

make sure that the extracted EPAS would produce semantically valid results when classified, the<br />

texts from which the structures were extracted had to fulfil certain requirements. Since the<br />

classification builds on the distributional hypothesis and relies on EPAS which show distribution<br />

particular to a restricted domain, initially, the most important specification for the texts was that<br />

they all had to belong to the same thematic domain. As such, the main focus in the requirements<br />

specification for the text collection was that of one closed thematic domain. But how exactly<br />

does one define the notion of a thematic domain? The first test set collected for the project<br />

consisted of factual prose texts dealing with roughly the same field. These texts, however,<br />

proved to be quite unsuitable for the later analysis, for reasons that will be explained in the<br />

following.<br />

It is clear that certain specifications must be fulfilled in the text collection from which the EPAS<br />

are derived. Texts displaying longer discourse chains are most suitable for the purpose of this<br />

project. One thematic domain must be described over several paragraphs, or preferably over the<br />

entire course of the discourse in the text. In order to extract the desired information from the<br />

texts and subsequently test if useful information has been extracted, the presence of anaphora, or<br />

referring expressions, in the text is needed. This entails the need for pronouns in particular. As<br />

such, texts containing discourse with a certain amount of concrete content were particularly<br />

useful for my purpose.<br />

Texts that are too vague, both with regards to their textual content and to their membership in a<br />

particular thematic category, were not suitable for the purpose of this project because they do<br />

not contain the type of theme-specific selectional constraints we are interested in extracting. One<br />

reoccurring problem in the text collections tested for the project was a too small degree of<br />

information expressed in full text, and very much information present in bullet lists, tables and<br />

33

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!