21.04.2013 Views

Eckhard Bick - VISL

Eckhard Bick - VISL

Eckhard Bick - VISL

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

(iii) corvos-marinhos<br />

"corvo-marinho" N M P<br />

(iv) Estados=Unidos<br />

"Estados=Unidos" PROP M P<br />

(2) offers examples for derivational tags (DERP for prefixes and DERS for suffixes),<br />

as well as polylexical word boundaries (the '=' sign in (iv) is introduced by the tagger<br />

to mark a non-hyphen polylexical link). Also purely orthographic or procedural<br />

information can be added to the tag list, like for capitalisation or for<br />

use of the heuristics module 7 .<br />

The morphological analyser constitutes the lowest level of the PALAVRAS<br />

parsing system, and feeds its output to Constraint Grammar morphological<br />

disambiguation, and ultimately to the syntactic and semantic modules. PALAVRAS<br />

was originally designed for written Brazilian Portuguese, but now recognises also<br />

European Portuguese orthography and grammar, either directly (lexical additions) or<br />

- if necessary - by systematic orthographic variation (pre-heuristics module).<br />

Not all registers prove equally accessible to automatic analysis, thus phonetic<br />

dialect spelling in fiction texts or phonetically precise transcription of speech data,<br />

for instance, cause obvious problems. Scientific texts can have a very rich<br />

vocabulary, but many of the difficult words are open to systematic Latin/Greek based<br />

derivation, which has been implemented in PALAVRAS. News texts often contain<br />

many names, but name candidate words can be identified quite effectively by<br />

heuristic rules based on capitalisation, in combination with character inventory and<br />

immediate context (cp. chapter 2.2.4.4). Only words derived from names (e.g.<br />

adjectives) and chemical or pharmaceutical names evade this solution by not being<br />

capitalised, and need to be treated by another morphological heuristics module, also<br />

used for misspellings, foreign loan words and the few Portuguese words that are both<br />

not listed in the PALAVRAS lexicon, and underivable for the analyser (cp. 2ii).<br />

PALAVRA’s typical lexical recognition rate is 99.6-99.9% (cp. chapters<br />

2.2.4.7 and 2.2.6). In these figures a word is counted as “recognised” if the correct<br />

base form or derivation is among those offered (ambiguity is only resolved at a later<br />

stage), and if propria are recognised as such (though without necessarily matching a<br />

lexicon entry).<br />

2.2 The program and its data-bases<br />

2.2.1 Program specifications<br />

7 Any orthographical changes introduced by the tagger's heuristics module - spelling/accent correction etc. - is marked<br />

with an ALT-tag after the original word form. The xxx in (ii) means a hypothesized root not found in the current<br />

PALAVRAS lexicon, or one normally disallowed by inflexional or word class - affix combination rules.<br />

- 16 -

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!