21.04.2013 Views

Eckhard Bick - VISL

Eckhard Bick - VISL

Eckhard Bick - VISL

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

(i.e., not too complex) derivational analysis, one that escapes the heuristic filters<br />

described above. Research on large corpora can weed out the high frequency cases of<br />

these words, which can then be entered into the lexicon. Checking against a 35<br />

million word corpus, where I filtered the output of the parser for derived and<br />

unknown words, I found only some hundred words (118 word form types) where a<br />

derivative analysis had - wrongly - been preferred over the proper noun analysis. 33<br />

Many instances were syntactically isolated in one-word headlines or brackets. 10<br />

lexemes accounted for half the cases. Quite a few of these words had been given a<br />

derivative analysis with very short or rare roots ('mar' for 'Maria', 'pá' for Paulo, 'the<br />

chemical element 'frâncio' for 'Francisco', 'tê', 'fê' and 'zê'). Since I have a tag in the<br />

lexicon () for non-deriving lexemes, it was easy to prevent these roots from<br />

overgenerating. For others, like the group Cristiana, Cristiano, Cristina (root 'crista')<br />

entering the names into the lexicon may be the appropriate solution.<br />

It was not quantitatively possible to inspect the large corpus (especially<br />

sentence initial words) for the opposite error, i.e. preferring a proper noun analysis<br />

over a lexical derivational analysis, but shorter samples suggest that sentence initial<br />

derived words are much less frequent than names. In mid-sentence, finally, the<br />

contextual constraints are quite effective and likely to make the right choices.<br />

A final, though, quantification on 21.806 words from the Borba-Ramsey corpus,<br />

containing 452 (2.1%) of (real or supposed) name chains, yielded an error rate of 2%<br />

for the PROP class (positive and negative errors combined, shaded in table 9). This<br />

is higher than the parser's usual morphological/PoS error rate of under 1%, but one<br />

must take into consideration that all 11 errors occurred heuristically, mostly with<br />

lexically unknown words, of which half were spelled incorrectly.<br />

(9) Table: name frequency statistics<br />

correct analysis:<br />

chosen tag:<br />

Proper noun Other, simple Other, derived<br />

PROP 79 (17.5%) 0 0<br />

PROP 362 (80.1%) 2 (0.04%) 0<br />

Other word classes 9 (2.0%) - -<br />

The 2 cases of wrong positive choice were the sentence initial words Lagartixou<br />

(which should have been a verb, derived from lagarto - 'lizard'), and Les (misspelled<br />

for the verbal inflexion form Lês - of ler 'to read'). Of the 9 cases involving wrong<br />

negative choices, 4 were names spelled in lower case (geraldinho, juraçy, sanhaço,<br />

playboy), 2 were sentence initial words also occurring as common nouns (nogueira -<br />

nut tree, and bezerra - 'female calf'), one was a place name (Santo Amaro, read as a<br />

33 These statistics were done with an older version o the parser, which included ordinary lower case words in the prename<br />

context. With the up-to-date version, there is not such a strong bias in favour of PROP readings, and the<br />

percentage of false positive choices of a derivational reading might be expected to be somewhat higher.<br />

- 49 -

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!