21.04.2013 Views

Eckhard Bick - VISL

Eckhard Bick - VISL

Eckhard Bick - VISL

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

word-final 'r', especially in infinitives, approaches the zero-morpheme: *amá - amar<br />

(to love), *agradecê - agradecer (to thank), *mulhé - mulher (woman).<br />

The quantitatively most demanding problem, however, is faulty accentuation,<br />

due to typist errors when compiling the corpus (the Borba-Ramsey corpus on the<br />

European Corpus Initiative CD-ROM was not scanned or collected from pre-existing<br />

electronic text, but typed), or to 8th-bit-ASCII losses in traffic accidents on the<br />

information highway (where only English gets a safe ride). Removing or adding<br />

accentuation may, however, lead to mistakes, where both the accentuated and the unaccentuated<br />

word form represent perfectly normal lexical items, as in maca<br />

(hammock), maça (club) and maçã (apple). Also, there might be ambiguity as to<br />

which accent to add. Therefore, most of the accentuation heuristics module is only<br />

used on otherwise unanalysable words. Safe bets are the adding of the til in word<br />

final 'ao' and 'oes' (which are nearly unthinkable without the accent), yielding 'ão' and<br />

'ões', whereas the change of 'c' into 'ç' before dark vowels is much more likely in the<br />

suffix '-ção' (plural '-ções') than, say, in word-initial position.<br />

Of the non-nasal accents in Portuguese, the grave accent only appears when<br />

the preposition a is fused with pronouns whose first letter is 'a': à (=a a), às (=a as),<br />

àquela (=a aquela). Since, on the tagging level, the parser has not yet enough<br />

contextual knowledge to disambiguate the isolated pronoun from the misspelled<br />

fused form a + pronoun, no accent-adding is attempted here.<br />

The acute and circumflex accent spelling errors are handled by the tagger<br />

module in the following way:<br />

If there is no prior analysis, and if:<br />

(a1) the word contains no accent, and only 1 vowel<br />

-> add an acute accent to the vowel<br />

-> if the word is still unanalysable, add a circumflex instead<br />

(a2) the word contains no accent, and more than 1 vowel<br />

-> look the word's potential stems up as unaccented root-forms ("R-forms") in the<br />

lexicon.<br />

Since the acute- and circumflex- accents in Portuguese - besides denoting vowel<br />

opening in 'e' and 'o' - are used as stress markers, and since stress can change in<br />

derivation, - accented potentially suffix-taking word stems (i.e. typically nouns and<br />

adjectives 38 ) have "R-forms" (derivation root forms) entered in the lexicon, where<br />

the accent has been removed. Ordinarily, these are intended only to be used in<br />

combination with a stress-taking suffix, like '-ável' or '-inho'. In the spelling<br />

correction module, however, this condition is suspended, and "R-forms" may be used<br />

to recognise missing accent errors in suffix-less words. There are (acute- and<br />

38 Verbs, too, combine with a number of suffixes, but all verbs' base forms (infinitives) have oxitonous stress, and since<br />

accents in Portuguese are stress markers, no extra lexicon entry is necessary here.<br />

- 56 -

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!