21.04.2013 Views

Eckhard Bick - VISL

Eckhard Bick - VISL

Eckhard Bick - VISL

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

(d) as part of words. For instance, $” will become a tag (left quote<br />

border) if attached left of an alphanumeric string, and (right quote border) if<br />

attached right. Also, abbreviations often include punctuation (. , - /), which is<br />

especially problematic, since ambiguity with regard to sentence boundary<br />

punctuation arises. To solve the ambiguity, the preprocessor consults an abbreviation<br />

lexicon file and checks for typical sentence-initial/final context or typical context for<br />

individual abbreviations.<br />

Second, the preprocessor separates what it thinks are words by line feeds.<br />

Here, the basic assumption of word-hood defines words as alphanumeric strings<br />

separated by blank spaces, hyphens, non-abbreviation-punctuation, line feeds or tabs.<br />

The reason for including hyphenation in the list is the need to morphologically<br />

analyse enclitic and mesoclitic pronouns (e.g. ‘dar-lhe-ei’), and to decrease the<br />

number of - lexiconwise - unknown words: The elements of hyphenated strings can<br />

thus be recognised and analysed individually by the PALMORF analyser, even if the<br />

compound as such does not figure in the lexicon. Thus, a word class and inflexional<br />

analysis can usually be provided and passed on to the syntactic and higher modules<br />

of the parser, even if only the last part of a hyphenated string is “analysable”.<br />

Third, for pragmatic reasons, a number of polylexicals has been entered in the<br />

PALMORF lexicon, consisting of several space- or hyphen-separated units that<br />

would otherwise qualify as individual words (e.g. ‘guarda-chuva’, ‘em vez de’).<br />

These polylexicals have been defined ad hoc by parsing needs (e.g. complex<br />

prepositions), semantic considerations (machine translation) or dictionary tradition.<br />

Polylexicals are treated like ordinary words by the parser, i.e. assigned form and<br />

function tags etc., and can be addressed as individual contexts by Constraint<br />

Grammar rules. In the newest version of the parser, one type of polylexical is<br />

assembled independently of existing lexicon entries: Proper noun chains are fused<br />

into polylexical “words” if specified patterns of capital letters, non-Portuguese letter<br />

combinations and name chain particles (like ‘de’, ‘von’, ‘van’ etc.) are matched.<br />

Criteria for the heuristic identification of non-Portuguese strings are, among<br />

others, letters like ‘y’ and ‘w’, gemination of letters other than ‘r’ and ‘s’, and wordfinal<br />

letters other than vowels, ‘r’, ‘s’ and ‘m’. Apart from name recognition,<br />

identification of non-Portuguese strings is useful in connection with hyphenated<br />

word chains - which will not be split if they contain at least one non-Portuguese<br />

element, in order to avoid “accidental” (i.e. affix or inflexion-heuristics based)<br />

assignment of non-noun word class 9 .<br />

2.2.2.3 Data bases and searching techniques<br />

On start-up the program arranges its data-bases in a particular way in RAM:<br />

a) the grammatical lexicon is organised alphabetically with grammatical information<br />

attached to the head word string. Each grammatical field has its own pointer. The<br />

9 N (noun) and PROP (proper noun) are the overwhelminly most common word classes for foreign language material in<br />

Portuguese.<br />

- 19 -

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!