21.04.2013 Views

Eckhard Bick - VISL

Eckhard Bick - VISL

Eckhard Bick - VISL

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

pôlos ALT pólos<br />

"pólo" N M S<br />

Obviously, such changes may create unrealistic words with unwanted and<br />

improbable analyses. Thus the slang-decoder rule that changes word final '-ê' into an<br />

infinitive '-er' might, for the French-Portuguese word ateliê, permit an analysis like<br />

"a+tela+ia+er", by wrongly "recognising" an infinitive ending and reducing the word<br />

to the root 'tela' - while drawing upon both prefix- and suffix-lexica. Therefore,<br />

derivational depth is limited in these cases, to one prefix (and no suffix). Similar, but<br />

less rigorous, restrictions apply to Luso-Brazilian variation and spelling correction.<br />

2.2.4.7 Heuristics: The last promille<br />

About 0.05% - 0.2% 40 of lower case word forms in running text cannot be reduced to<br />

stems found in the PALMORF lexicon, even when using the derivational, variational<br />

or correctional modules described earlier. Name heuristics is not used on lower case<br />

word forms, exceptions like unknown pharmaceutical names being treated as<br />

common nouns.<br />

Since the parser's higher levels (for example, syntax) need some reading for<br />

every word to work on, these unanalysable lower case word forms need to be given<br />

one or more heuristic readings with regard to word class and inflexion morphology.<br />

Three main groups may be distinguished, comprising of roughly one third of the<br />

cases each (Cp. the corresponding statistics table in 2.2.6 on recall figures):<br />

a) orthographic errors not detected by the accent module<br />

b) unknown and underivable Portuguese words or abbreviations<br />

c) unknown foreign loan words<br />

Sadly, for optimal performance, the three groups would require different strategies.<br />

Foreign words appearing in running Portuguese text are typically nouns or noun<br />

phrases, and trying to identify verbal elements only causes trouble. In "real"<br />

Portuguese words without spelling errors, structural clues - like inflexion endings<br />

and suffixes - should be emphasised. These will be meaningful in misspelled<br />

Portuguese words, too, but, in addition, specific rules about letter manipulation<br />

(doubling of letters, missing letters, letter inversion, missing blanks etc.) and even<br />

knowledge about keyboard characteristics might make a difference.<br />

Motivated by a grammatical perspective rather than probabilistics, my<br />

approach has been to emphasise groups (a) and (b) and look for Portuguese<br />

morphological clues in words with unknown stems. Since prefixes have very little<br />

bearing on the probability of a word's word class or inflexional categories, only the<br />

inflexion endings and suffix lexica are used. As in ordinary analysis (chapter 2.2.3)<br />

40 These figures are heavily dependent on text type and corpus quality (i.e. number of orthographical errors). As I<br />

corrected and improved my lexica, the percentage of unanalysed word forms has fallen to well below 0.1% for "good<br />

quality" texts.<br />

- 58 -

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!