21.04.2013 Views

Eckhard Bick - VISL

Eckhard Bick - VISL

Eckhard Bick - VISL

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.5 Tools for disambiguation<br />

In corpus linguistics, most systems of automatic analysis can be classified by measuring<br />

them against the bipolarity of rule based versus probabilistic approaches. Thus,<br />

Karlsson (1995) distinguishes between “pure” rule based or probabilistic systems,<br />

hybrid systems and compound systems, i.e. rule based systems supplemented with<br />

probabilistic modules, or probabilistic systems with rule based “bias” or postprocessing.<br />

As a second parameter, lexicon dependency might be added, since both rules based and<br />

probabilistic systems differ internally as to how much use they make of extensive<br />

lexica, both in terms of lexical coverage and granularity of lexical information.<br />

Typically, in terms of computational viability, probabilistic systems are good at<br />

lower level analysis, especially word class (part of speech, PoS) annotation and speech<br />

recognition, while rule based systems have been preferred for higher level annotation,<br />

like constituent trees and argument structure. As a result of this polarisation, the older -<br />

linguistically motivated - term "parsing", though derived from "pars orationis" (part of<br />

speech) has come to mean, more narrowly, higher level syntactic analysis, while the<br />

newer - computationally motivated - term "tagging" has mostly been limited to lower<br />

level PoS-annotation, - which is the obvious application for at leastword based tags.<br />

Even implementationally, the bipolarity is quite distinct: The archetypal rule based<br />

systems, PSG grammars and their descendants, have embraced declarative programming<br />

languages like Prolog and Lisp, while probabilistic systems huddle together around the<br />

Hidden Markov Model using procedural programming languages like C or - for<br />

statistics proper - common UNIX-tools like sort, uniq, awk and perl.<br />

With the advent of larger, multi-million word corpora, apart from annotation<br />

speed, error rates have become more crucial, since manual post-processing is becoming<br />

less and less feasible. On the one hand, this should favour rule-based systems, since<br />

they can - at least in theory - be made more "perfect", so the high initial price in man<br />

power for writing a grammar should pay off for large corpora - the larger the corpus the<br />

better the investment. On the other hand, large corpora supply better training facilities<br />

for the "cheap" probabilistic systems and should thus make them more accurate 88 . Yet<br />

again, since what is really needed, are tagged training corpora, co-operation between<br />

systems might be the best solution. This, however, presupposes more or less compatible<br />

category definitions and tag sets, which is, in spite of normalising initiatives like the<br />

EU's EAGLES convention (Monachini and Calzolari, 1996) far from being a reality<br />

today.<br />

88 For a tagset of 50 PoS-inflexion tags or tag chains, for example, it is as hard to train trigrams on a million word corpus as<br />

it is to train tetragrams on a 50 million word corpus, the reason being, that the number is only 8 times as high as the number<br />

of different n-grams. Training trigrams on a hundred million word corpus, however, yields on average 800 examples of each<br />

trigram combination - even when ignoring the relatively higher frequency of the more relevant trigrams -, which should be<br />

enough to do statistics on.<br />

- 133 -

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!