21.04.2013 Views

Eckhard Bick - VISL

Eckhard Bick - VISL

Eckhard Bick - VISL

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The remaining 2 words of the 4-word pre-comma group are checked first, before<br />

progressing, then, the search window is reset to after the break (comma).<br />

WORD1 WORD2 WORD3 WORD4, WORD5 ......<br />

step 1 | |<br />

step 2 | |<br />

step 3 |XXXXXXXXXXXXXXX| | |<br />

step 4 |_________________ _ _ _ _ _ _<br />

|<br />

The overlapping search is clearly necessary to find all possible combinations:<br />

without punctuation breaks, n words may form n*(m-1) combinations of up to m<br />

elements. With a depth of 4 this amounts to 3000 possible polylexicals for a 1000<br />

word text.<br />

It is crucial to begin with the longest string and then work backwards, one<br />

might otherwise miss 3- or 4-word polylexicals, that "contain" smaller ones. E.g., in<br />

Portuguese, 'dentro=em' (inside) is a complex preposition, 'dentro=em=breve' (before<br />

long) a complex adverb. In searching from left to right one would miss out on the<br />

(longer) adverb reading, because 'dentro=em' is found first, and the search string<br />

reset to start from scratch at position 3.<br />

2.2.4.2 Word or morpheme: enclitic pronouns<br />

Generally, in inflecting languages like Portuguese, future tense endings are regarded<br />

as bound morphemes, whereas pronouns are classified as (free morpheme) words.<br />

However, making things less easy for the preprocessor, Portuguese allows both to<br />

appear as hyphenated "linked" morphemes, too. Consider the following examples:<br />

(1a) O comprei amanhã. (I'll buy it tomorrow.)<br />

(1b) Comprá-lo-ei amanhã.<br />

(2a) Não o pode fazer. (He can't do it.)<br />

(2b) Não pode fazê-lo.<br />

(3a) O tinham visto. (They had seen him.)<br />

(3b) Tinham-no visto.<br />

(4) Chove. (It rains.)<br />

In (1b) the direct object pronoun 'o'/’lo’ is placed mesoclitically, before the future<br />

tense inflexion ending, which thus becomes enclitic. The preprocessor has to<br />

recognise this structure and transform it into a canonical form, which the word-based<br />

tagger can understand:<br />

(1c) *Comprei- o amanhã.<br />

As can be seen in (2) and (3) both the stem and the enclitic pronoun undergo<br />

phonetically motivated changes, the infinitive loosing its 'r' and receiving a stress<br />

- 39 -

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!