21.04.2013 Views

Eckhard Bick - VISL

Eckhard Bick - VISL

Eckhard Bick - VISL

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

other languages from both the Germanic, Romance and Finno-Ugric language families<br />

(Swedish, German, French, Finnish etc.) 105 . A mature CG-grammar for the<br />

morphological level (word class or PoS disambiguation), typically consists of at least<br />

1.000-2.000 rules. For the English ENGCG system, word class error rates of under 0.3%<br />

have been reported at a disambiguation level of 94-97% (Voutilainen, 1992).<br />

In a recent direct comparison 106 between an updated ENGCG and a statistical<br />

tagger trained on a 357.000 107 word section of the Brown corpus, Samuelsson &<br />

Voutilainen (1999) found that error rates for the Constraint Grammar system were at<br />

least an order of magnitude lower than those of the probabilistic system at comparable<br />

disambiguation levels. Thus, ENGCG error rates were 0.1% with a 1.07 tag/word ratio<br />

and 0.43% with a 1.026 tag/word ration, while the statistical system achieved error rates<br />

of 2.8% and 3.72%, respectively.<br />

Constraint Grammar type rules have also been used in hybrid systems, for<br />

instance where an automated learning algorithm is trained on a morphologically tagged<br />

corpus with the objective of constructing or selecting local context discard rules. Thus,<br />

Lindberg (1998), using Progol inductive logic programming 108 and a ±2 word context<br />

window, reports 98% recall in Swedish test texts, with a residual ambiguity of 1.13<br />

readings pr. word, and a rule body of 7000 rules. Another hybrid system is decribed in<br />

Padró i Cirera (1997) , where a relaxation labelling tagger is applied to English and<br />

Spanish. In this system, CG style rules for POS-tagging were integrated with HMM<br />

tagging, creating a statistical model for for the distribution of tag targets and context<br />

conditions. Constraint rules were partly learned from a training corpus using statistical<br />

decision trees, and partly hand-written on the basis of output errors in probabilistic<br />

HMM taggers 109 . In comparison with HMM and relaxation labelling base line taggers,<br />

both types of constraint rules improved tagger performance individually, and resulted -<br />

when combined - in an overall precision rate of 97.35% for fully disambiguated Wall<br />

Street Journal text.<br />

While hybrid systems thus seem to offer some advances in comparison with<br />

ordinary HMM modelling and related techniques, they are still far from achieving<br />

ENGCG level results, one likely explanation residing in the fact that the automatically<br />

learned rules of such systems (so far) lack the global scope (i.e. sentence scope) and<br />

105 For a short comparison of CG systems, cp. chapter 8.1.<br />

106 Both systems used the same tag set: CG-tags were filtered into the kind of fused single tags typical of statistical taggers.<br />

Both systems were tested on the same 50.000 word benchmark text, consisting of both journalistic, scientific and manual<br />

excerpts.<br />

107 At this training corpus size, the learning curve of the statistical tagger flattened out, suggesting that larger training<br />

corpora would not lead to any significant improvement in tagging performance.<br />

108 In addition, Lindberg used so-called “lexical” rules (not to be induced), removing rare readings of frequent word forms,<br />

much like the heuristic rules in a regular CG - but with the important difference, that the CG rules would be<br />

used after at least one round of regular disambiguation, whereas Lindberg’s lexical rules came into play before ordinary<br />

(induced) rules.<br />

109 With only 20 linguist-written rules, the balance was heavily in favour of the automatically generated constraints (8473).<br />

- 149 -

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!