21.04.2013 Views

Eckhard Bick - VISL

Eckhard Bick - VISL

Eckhard Bick - VISL

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2412 words 1837 words 4249 words<br />

Error types: errors % correct errors % correct errors % correct<br />

Morphology (all) 29 98.8 % 7 99.6 % 36 99.2 %<br />

unknown English<br />

words in headlines<br />

- 10<br />

- 3<br />

Morphology (pure) 16 99.3 % 6 99.7 % 22 99.5 %<br />

Syntax (all) 66 97.3 % 46 97.5 % 112 97.4 %<br />

syntax caused by<br />

morphology<br />

- 189 -<br />

- 1<br />

- 0<br />

- 11<br />

- 3<br />

- 37 - 7 - 44<br />

Syntax (pure) 29 98.8 % 39 97.9 % 68 98.4 %<br />

3.9.3 Text type interference and tag set complexity<br />

However, a closer look at the texts involved reveals that the news texts are quite<br />

different from the prose fiction example, both lexically and syntactically. First of all,<br />

there is a rather high percentage of complex names (e.g. 'Massachussets Institute of<br />

Technology'), abbreviations ('MIT') and English loan words and vogue terms like 'joy<br />

stick', 'bad boy' and the like. Thus a single word, console, which - used as an unknown<br />

English noun ['video console'] and not as a Portuguese verb ['to comfort'] - is<br />

responsible for a third (!) of all errors in the video game text. Second, VEJA news texts<br />

are - syntactically - very rich in free predicatives (typically information about persons,<br />

institutions or abbreviations, like age, place, definition etc.) all acting as false<br />

"argument candidates" , as well as other types of parenthetical information, bracketing,<br />

head lines and interfering "syntactically superfluous" finite verb forms in the form of<br />

quotations, which all tend to blur the clause boundaries that otherwise would be<br />

important structural information for the parser.<br />

Still, none of the above problems are in principle intractable for the CGapproach,<br />

and by providing for special features like these in the rule set (and lexicon)<br />

error rates can be reduced for any text type.<br />

One might assume that errors are evenly spread throughout the text, which<br />

would - for an average sentence length of 15 words - mean about one morphological<br />

error in every tenth sentence, and a syntactic error in every third. However, this is not<br />

true: for all text types, errors appear in clusters, obviously most morphological errors<br />

also appear in the list of syntactic errors, and many syntactic errors interfere with<br />

readings in their neighbourhood, due to rules that depend on clause boundary words,<br />

uniqueness principle and so forth. Thus, a V-N word class error not only affects<br />

syntactic mapping and disambiguation for the word in question, but can cause 2 or 3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!