Unni Cathrine Eiken February 2005

More documents

Recommendations

Info

3.3 Parsing with NorGram To be able to extract the EPAS from the text in a semi-automatic fashion, some sort of linguistic analysis of the texts is needed. One problem with working on a small language like Norwegian is that the linguistic tools you might need in the process just are not fully developed yet. Velldal (2003) describes a project where a set of Norwegian nouns are grouped into semantic classes based on their distribution over a large body of text. A word’s distribution in different contexts is represented as a feature vector in a semantic space model. In his project, Velldal addresses the problem of a lacking parser for Norwegian by stating that there does not exist any syntactic parser for Norwegian. Instead, he uses a shallow processing tool on a tagged corpus. The processing tool “translates” the tagged structures into predicate-argument structures, overcoming the need for a parser by only analysing those parts of the text relevant for the extraction of the needed structures. As has been explained in section 3.2, an extraction method that is based on surface structures and does not take semantic relations into account, might produce results that are unsuitable both for subsequent use in anaphora resolution and for generalisation of concepts. In view of this, the present work has aimed at developing an extraction method that uses parsed text to collect the meaning structures from the text. Although it is true that there does not exist any parser that fully covers the Norwegian language at the moment, there are a few alternative parsers available. Even if these grammars are not entirely robust enough to return parses on randomly chosen texts, they can be used for the experiments outlined in this project. The extraction method described in this thesis implements one of the existing parsing tools for Norwegian bokmål, NorGram (NorGram 2004). Since there are no easy-to-use automated tools available for use in the extraction process, obtaining the EPAS from the text involved a substantial amount of manual work, even when using a parser to automate the extraction. Parsing the texts was definitely of value, though, since once the texts were parsed and there was a syntactic analysis to work on, the EPAS could more readily be extracted. Because of the modular nature of the extraction method, the extraction process is not parser-dependent. Should a new and more robust grammar become available, the extraction method can be modified to accommodate this. The next section of this chapter briefly describes how the NorGram/XLE parser was used in the project, while section 3.3.2 describes in greater detail how the EPAS were extracted from the parser’s output. 42
3.3.1 NorGram in outline Norsk komputasjonell grammatikk (NorGram) is a computational grammar for Norwegian bokmål. NorGram is based on the unification-based grammar formalism Lexical Functional Grammar (LFG), where language is described by means of feature structures that can be combined in the process of unification. Researchers involved in the NorGram project cooperate with researchers at Palo Alto Research Center (PARC), former Xerox PARC, who have developed a well functioning platform for the development of large-scale computational grammars. This system is called Xerox Linguistic Environment (XLE) and uses LFG as its theoretical linguistic framework. As such, NorGram can be said to be an LFG formalism for Norwegian, while XLE is an implementation of LFG. The NorGram grammar combined with an XLE-module is a relatively broad parser that can analyse most structures found in Norwegian. It was chosen for the purposes of this project because it was likely to return successful parse trees of a large part of the sentences found in the text collections. NorGram’s lexicon is quite large and includes entries of most regular Norwegian words. One problem with the lexicon with regards to the text collections used for this project, is that it contains relatively few compounds. All theme-specific texts feature a theme-specific vocabulary, sometimes with words (especially compound nouns) that cannot be expected to be found in ordinary dictionaries. This was also the case for the text collection in this project. Compounded nouns represented the largest group of words added to the lexicon. In Norwegian, one stands fairly free to form compounds consisting of words that also can exist individually and have an individual meaning. Whereas in English such compounds are written in two separate words, for example police investigator, they together form a new noun in Norwegian, for example politietterforsker (police investigator). This opens for a potentially infinite class of nouns and makes it virtually impossible to include all possible words in any lexicon. The NorGram lexicon was extended in order to be used as a tool to extract the EPAS from the text collection. Compounds and proper nouns that were part of sentences to be analysed were added to the lexicon files. To ensure that all EPAS could successfully be extracted, all sentences that were not parsed were examined to identify the word that represented the problem. Subsequently, that word was added to the lexicon. A more elegant way to solve the compound issue would be to make use of a module that splits compounds into the individual words they 43
Page 1 and 2: University of Bergen Section for li
Page 3 and 4: Preface The project presented in th
Page 5 and 6: Table of Contents 1 INTRODUCTION AN
Page 7 and 8: 1 Introduction and problem statemen
Page 9 and 10: patterns found in a text collection
Page 11 and 12: The results obtained in this projec
Page 13 and 14: The term anaphor describes a lingui
Page 15 and 16: 2.1.1.1 Discourse representation th
Page 17 and 18: eferring to BT. The NP which is lin
Page 19 and 20: esolution system will not be able t
Page 21 and 22: (2- 12) REC SUBJ EXIST OBJ IND-OBJ
Page 23 and 24: Figure 1 17
Page 25 and 26: means that the algorithm would prop
Page 27 and 28: for an overview). Many of these sys
Page 29 and 30: (2- 15) a. Politiet etterlyste i da
Page 31 and 32: section. The theory dates back to t
Page 33 and 34: 2.2.2 Different types of context So
Page 35 and 36: neighbours. For example, a target w
Page 37 and 38: with it. Selectional constraints al
Page 39 and 40: 3 From text to EPAS - the extractio
Page 41 and 42: 3.2 Predicate-argument structures "
Page 43 and 44: speaker flexibility with regards to
Page 45 and 46: and woman occur together both in su
Page 47: occur with. Arguments which are unl
Page 51 and 52: Figure 3 The most useful structure
Page 53 and 54: 3.4 Altering the source As already
Page 55 and 56: (3- 12) (3- 13) Politiet leter ette
Page 57 and 58: ARG1 and ARG2 arrays display a valu
Page 59 and 60: (3- 20) Anne Slåtten bodde i et st
Page 61 and 62: value and highly desirable. As such
Page 63 and 64: this project, this can be interpret
Page 65 and 66: The process of classifying the cons
Page 67 and 68: There are several different distanc
Page 69 and 70: . ankomme,etterforsker,?,? ankomme,
Page 71 and 72: Test 2 Training set: EPAS_arg1 with
Page 73 and 74: The training and test material was
Page 75 and 76: • level 0: words which co-occur w
Page 77 and 78: (4- 9) avklare,obduksjon,? bede-om,
Page 79 and 80: (4-10) below shows the output for t
Page 81 and 82: In the introduction to this chapter
Page 83 and 84: the EPAS can be used in the classif
Page 85 and 86: exemption of jobbe-utfra, none of t
Page 87 and 88: antecedent for (4-15a). In the case
Page 89 and 90: Figure 7 Interestingly enough, howe
Page 91 and 92: When testing on knowledge-dependent
Page 93 and 94: Firth, J. R. (1957): A synopsis of
Page 95 and 96: Appendix A: Ekstraktor.pl - algorit
Page 97 and 98: finnARG2(); This function has exact
Page 99 and 100:
#legger lest linje inn i @prt derso
Page 101 and 102:
sub fjernEP{ #fjerner elementer fra
Page 103 and 104:
} splice(@ARGx); $imax = @ARG3ep; @
Page 105 and 106:
} else{ } } } push(@liste, $ARG0ep[
Page 107 and 108:
101 Appendix C: the EPAS list 23-å
Page 109 and 110:
103 obdusere,,kvinne observere,,23-
Page 111 and 112:
Appendix D: Text aligned with EPAS
Page 113 and 114:
eventualiteter. Vi varslet Kripos.
Page 115 and 116:
Etterforskerne har flere observasjo
Page 117 and 118:
# Subrutine som tar inn argumentnum
Page 119 and 120:
Appendix F: POS-based structures SE
Page 121:
Vi har ingen spesiell teori som vi
show all

Unni Cathrine Eiken February 2005

Create successful ePaper yourself

Delete template?

Save as template?