21.04.2013 Views

Eckhard Bick - VISL

Eckhard Bick - VISL

Eckhard Bick - VISL

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

account, and transition probabilities are computed from n+1-gram frequencies (using<br />

so-called trigrams, tetragrams etc.). In practice, however, due to the exponential<br />

combinatorial growth of the number of possible n-grams, such an approach is not<br />

feasible for an MM where states are words (or rather, in this context, word forms). Even<br />

the 1 million bigrams of a 1 million word corpus have no great worth for predicting the<br />

40.000.000.000 possible transitions for a language with 200.000 word forms.<br />

This is why most part of speech taggers use Hidden Markov Models, where states<br />

stand for word classes, or morphologically subclassified word classes, like NS (noun<br />

singular) or even VBE3S (the verb "to be" in the 3.person singular), and each (PoS-)<br />

state generates words from a matrix of so-called lexical probabilities. An English<br />

article, for example, might be said to have a probability of 0.6 for being 'the', and 0.4 for<br />

being 'a'. The reason why the model is now called hidden, is that it is only the wordsymbols<br />

that can be directly observed, whereas the underlying state-transitions remain<br />

hidden from view.<br />

For word classes, trigram frequencies can be meaningfully computed from a<br />

tagged corpus of reasonable size, and the same corpus can be used to determine lexical<br />

frequencies. The trained tagger can then be used on unknown text, provided the<br />

existence of a lexicon of word forms, or at least inflexion and suffix morphemes.<br />

Interestingly, for small training corpora, the trigram-approach even performs slightly<br />

better than a variable context algorithm (Lezius et. al., 1996).<br />

For making its decision, the HMM tagger computes the probability of a given<br />

string of words being generated by a certain sequence of word class transitions, and<br />

tries to maximise this value. The probability value (for a string w1 w2 w3 ... wn of n<br />

words) is the product of all n transition probabilities and all n lexical probabilities 89 :<br />

for bigrams:<br />

p(T) * p(W|T) = p(t1) * p(t2|t1) * p(t3|t2) * ... * p(tn|tn-1) * p(w1|t1) * p(w2|t2) * p(w3|t3) * ...<br />

* p(wn|tn)<br />

for trigrams:<br />

p(T) * p(W|T) = p(t1) * p(t2|t1) * p(t3|t2t1) * ... * p(tn|tn-1tn-2) * p(w1|t1) * p(w2|t2) * p(w3|t3)<br />

* ... * p(wn|tn)<br />

[where p = probability, W = word chain, T = tag chain, w = word, t = tag]<br />

Since p(T|W) = p(W|T) * p(T) / p(W) and p(W) is constant for all readings, p(T|W) is<br />

maximised at the same time as p(T) * p(W|T).<br />

89 Eeg-Olofsson (1996, IV, p.73) thinks that relative (i.e. lexical) and transitional probabilities are, in a way, complementary,<br />

with one of them being able to compensate for lack of information with regard to the other.<br />

- 135 -

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!