21.04.2013 Views

Eckhard Bick - VISL

Eckhard Bick - VISL

Eckhard Bick - VISL

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.10 Speech data tagging:<br />

Probing the limits of robustness<br />

3.10.1 Text/speech differences in a CG perspective<br />

In terms of test texts, the probably most difficult task for the parser has been a pilot<br />

project on the tagging of transcribed speech data (<strong>Bick</strong>, 1998-2) from the NURC corpus<br />

of Brazilian educated urban speech (Castilho, 1989). While the morphological/PoS<br />

tagger module, with a success rate of around 99% even without additional rules, proved<br />

quite robust in test runs on spoken language data, syntactic analysis fared somewhat<br />

worse, with an initial correctness rate of 91-92% for the - rule-wise - unmodified system<br />

(cp. 3.10.5).<br />

In order to explain this discrepancy between morphological robustness and<br />

syntactic failure, a number of hypotheses were formulated and subsequently put to the<br />

test by changing the system’s preprocessor module and CG rule set accordingly:<br />

• In my parser, rules with morphological targets mostly use a shorter context range<br />

(group structure) than those with syntactic targets (cf. chapter 3.7.3). Thus, the<br />

proportion of rules without and with unbounded contexts is 10 times as high for<br />

rules targeting morphological tags than for syntactic targets, and 70-80% of all<br />

syntactic rules stretch their context all the way to the sentence delimiters – making<br />

these rules vulnerable to the speech specific absence or vagueness of such delimiters.<br />

• Incomplete utterances tend to leave group structure intact more often than clause<br />

structure, - at least if one doesn’t count repetitional modifications/corrections of<br />

prenominal modifiers (essas esses progressos, esta este caminho, da dos nomes),<br />

where word class adjacency rules can often override agreement rules.<br />

• Speech data lacks punctuation and has unclear sentence window borders, which is<br />

especially bad for syntactic CG analysis which tends to use many unbounded context<br />

restrictions (cp 1).<br />

• Speech data is filled with ”syntactic noise”, repetitions and false starts of one- or<br />

two-word chunks, as well as pause and phatic interjections (ahn, uh, eeh etc.).<br />

3.10.2 Preprocessing tasks<br />

In order to make these problems more accessible to Constraint Grammar rules, and to<br />

improve syntactic performance, a preprocessor was designed with the specific goal of<br />

establishing utterance or sentence boundary candidates and removing syntactic noise.<br />

Its task areas are the following:<br />

- 191 -

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!