10.04.2013 Views

Unni Cathrine Eiken February 2005

Unni Cathrine Eiken February 2005

Unni Cathrine Eiken February 2005

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.3.1 NorGram in outline<br />

Norsk komputasjonell grammatikk (NorGram) is a computational grammar for Norwegian<br />

bokmål. NorGram is based on the unification-based grammar formalism Lexical Functional<br />

Grammar (LFG), where language is described by means of feature structures that can be<br />

combined in the process of unification. Researchers involved in the NorGram project cooperate<br />

with researchers at Palo Alto Research Center (PARC), former Xerox PARC, who have<br />

developed a well functioning platform for the development of large-scale computational<br />

grammars. This system is called Xerox Linguistic Environment (XLE) and uses LFG as its<br />

theoretical linguistic framework. As such, NorGram can be said to be an LFG formalism for<br />

Norwegian, while XLE is an implementation of LFG.<br />

The NorGram grammar combined with an XLE-module is a relatively broad parser that can<br />

analyse most structures found in Norwegian. It was chosen for the purposes of this project<br />

because it was likely to return successful parse trees of a large part of the sentences found in the<br />

text collections. NorGram’s lexicon is quite large and includes entries of most regular<br />

Norwegian words. One problem with the lexicon with regards to the text collections used for<br />

this project, is that it contains relatively few compounds. All theme-specific texts feature a<br />

theme-specific vocabulary, sometimes with words (especially compound nouns) that cannot be<br />

expected to be found in ordinary dictionaries. This was also the case for the text collection in<br />

this project. Compounded nouns represented the largest group of words added to the lexicon. In<br />

Norwegian, one stands fairly free to form compounds consisting of words that also can exist<br />

individually and have an individual meaning. Whereas in English such compounds are written in<br />

two separate words, for example police investigator, they together form a new noun in<br />

Norwegian, for example politietterforsker (police investigator). This opens for a potentially<br />

infinite class of nouns and makes it virtually impossible to include all possible words in any<br />

lexicon.<br />

The NorGram lexicon was extended in order to be used as a tool to extract the EPAS from the<br />

text collection. Compounds and proper nouns that were part of sentences to be analysed were<br />

added to the lexicon files. To ensure that all EPAS could successfully be extracted, all sentences<br />

that were not parsed were examined to identify the word that represented the problem.<br />

Subsequently, that word was added to the lexicon. A more elegant way to solve the compound<br />

issue would be to make use of a module that splits compounds into the individual words they<br />

43

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!