21.04.2013 Views

Eckhard Bick - VISL

Eckhard Bick - VISL

Eckhard Bick - VISL

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

As shown in the diagram (yellow boxes), PALMORF - or rather its preprocessor and<br />

heuristics modules - is quite capable of “meddling” with its data. Still, orthographic<br />

intervention as such (*) is used only heuristically, where no ordinary analysis has<br />

been found, and the altered word forms are marked 'ALT', so they can be identified<br />

later, for example for output statistics, and for the sake of general corpus fidelity.<br />

Affected areas are luso-brazilian orthographic variation (e.g. oi/ou digraphs, ct -> t,<br />

cp -> p), typographically based accentuation errors (e.g. 7-bit-ASCII vs. 8-bit-ASCII<br />

input) and some common spelling errors (e.g. cão -> ção, çao -> ção).<br />

2.2.2.2 Preprocessing<br />

local<br />

disambiguation<br />

Unlike post-analysis heuristics, preprocessor intervention (+) applies to all input, and<br />

is close to being a general parsing necessity. Among other things, a natural and<br />

unavoidable step in all NLP is the decision of what to tag. Obviously, in a word<br />

based tagger and a sentence based parser, this amounts to establishing word and<br />

sentence boundaries.<br />

First, the preprocessor strives to establish what is not a word, and marks it by<br />

prefixing a $-sign: $. - $, - $( - $) - $% -$78.7 - $± - $” - $7:20 etc. Of these, some<br />

are later treated as words anyway. Thus, numbers will be assigned the word class<br />

NUM and a syntactic function, $% will be treated as a noun (N), $7:20 as a time<br />

adverbial. Punctuation is treated in four ways:<br />

(a) as sentence delimiter. Ordinarily, it is the DELIMITERS list of the CG rule<br />

file that determines which punctuation marks are treated as sentence boundaries (e.g.<br />

$. and $:, but not $- and $,). However, the preprocessor can add sentence delimiters<br />

() where it identifies sentence-final abbreviations, or - for instance - instead of<br />

double line feeds around punctuation-free headlines.<br />

(b) as a regular non-word. Such punctuation is shown in the analysis file<br />

without a tag (e.g. $: or $!), but can still be referred to by CG-rules.<br />

(c) as tag-bearing “words”. This is unusual in a Constraint Grammar, but $%<br />

(as a noun) is an example, and $, as a co-ordinator (like the conjunction ‘e’) is<br />

another one.<br />

- 18 -<br />

OUTPUT<br />

orthographic<br />

variation*<br />

accentuation errors*<br />

spelling errors*<br />

propria heuristics+<br />

non-propria heuristics+

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!