Eckhard Bick - VISL

Eckhard Bick - VISL

Eckhard Bick - VISL


You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.5 Tools for disambiguation<br />

In corpus linguistics, most systems of automatic analysis can be classified by measuring<br />

them against the bipolarity of rule based versus probabilistic approaches. Thus,<br />

Karlsson (1995) distinguishes between “pure” rule based or probabilistic systems,<br />

hybrid systems and compound systems, i.e. rule based systems supplemented with<br />

probabilistic modules, or probabilistic systems with rule based “bias” or postprocessing.<br />

As a second parameter, lexicon dependency might be added, since both rules based and<br />

probabilistic systems differ internally as to how much use they make of extensive<br />

lexica, both in terms of lexical coverage and granularity of lexical information.<br />

Typically, in terms of computational viability, probabilistic systems are good at<br />

lower level analysis, especially word class (part of speech, PoS) annotation and speech<br />

recognition, while rule based systems have been preferred for higher level annotation,<br />

like constituent trees and argument structure. As a result of this polarisation, the older -<br />

linguistically motivated - term "parsing", though derived from "pars orationis" (part of<br />

speech) has come to mean, more narrowly, higher level syntactic analysis, while the<br />

newer - computationally motivated - term "tagging" has mostly been limited to lower<br />

level PoS-annotation, - which is the obvious application for at leastword based tags.<br />

Even implementationally, the bipolarity is quite distinct: The archetypal rule based<br />

systems, PSG grammars and their descendants, have embraced declarative programming<br />

languages like Prolog and Lisp, while probabilistic systems huddle together around the<br />

Hidden Markov Model using procedural programming languages like C or - for<br />

statistics proper - common UNIX-tools like sort, uniq, awk and perl.<br />

With the advent of larger, multi-million word corpora, apart from annotation<br />

speed, error rates have become more crucial, since manual post-processing is becoming<br />

less and less feasible. On the one hand, this should favour rule-based systems, since<br />

they can - at least in theory - be made more "perfect", so the high initial price in man<br />

power for writing a grammar should pay off for large corpora - the larger the corpus the<br />

better the investment. On the other hand, large corpora supply better training facilities<br />

for the "cheap" probabilistic systems and should thus make them more accurate 88 . Yet<br />

again, since what is really needed, are tagged training corpora, co-operation between<br />

systems might be the best solution. This, however, presupposes more or less compatible<br />

category definitions and tag sets, which is, in spite of normalising initiatives like the<br />

EU's EAGLES convention (Monachini and Calzolari, 1996) far from being a reality<br />

today.<br />

88 For a tagset of 50 PoS-inflexion tags or tag chains, for example, it is as hard to train trigrams on a million word corpus as<br />

it is to train tetragrams on a 50 million word corpus, the reason being, that the number is only 8 times as high as the number<br />

of different n-grams. Training trigrams on a hundred million word corpus, however, yields on average 800 examples of each<br />

trigram combination - even when ignoring the relatively higher frequency of the more relevant trigrams -, which should be<br />

enough to do statistics on.<br />

- 133 -

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!