21.04.2013 Views

Eckhard Bick - VISL

Eckhard Bick - VISL

Eckhard Bick - VISL

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

The NURC speech corpus (“Norma urbana culta”) 241 , described in (Castilho, 1989)<br />

ca. 100.000 words, for testing purposes only (<strong>Bick</strong>, 1998-2)<br />

Brazilian transcribed interviews, monologue and conversation<br />

Folha de São Paulo (1994-1996 running editions)<br />

ca. 90.000.000 words, for a research project 242 at the University of São Paulo<br />

Brazilian newspaper texts (all topics)<br />

The Tycho Brahe corpus (17th century sample), cp. www.ime.usp.br/~tycho<br />

ca. 50.000 words, for external use<br />

historical Portuguese (Antonio das Chagas) 243<br />

To make automatic comparison possible, the system’s morphological tag set was<br />

filtered into specific synthetic tags also recognized by the probabilistic tagger used<br />

in the Tycho Brahe project.<br />

The NILC corpus (Núcleo Interinstitucinal de Lingüística Computacional,<br />

http://www.nilc.icmc.sc.usp.br/) 244<br />

ca. 39.000.000 words, used for testing purposes<br />

ca. 100.000 words for external evaluation<br />

journalistic, didactic and student essay texts<br />

Originally, I tagged this corpus for internal purposes only, as a means of testing the<br />

robustness of the morphological part of the CG parser. However, part of the corpus<br />

(100.000 words of mixed science, literature and economy) also exists in a handtagged<br />

version established by NILC in order to train a probabilistic or hybrid<br />

tagging system. Like in the Tycho Brahe case, the CG morphological tag set proved<br />

rich enough to allow filtering into the specific synthetic tags preferred by the NILC<br />

team, making direct comparison possible. A special challenge in this case was the<br />

distinction between 6 different verbal “valency word classes”, VAUX, VLIG, VINT,<br />

VTD, VTI, VBI, roughly matching the (instantiated) CG valency tags , ,<br />

, , and /, respectively.<br />

As can be seen from the list, the parser can handle a fairly broad spektrum of<br />

Portuguese language data. The largest task, the tagging of 3 years of running<br />

newspaper text (Folha de São Paulo) for a research group at the Catholic University<br />

of São Paulo, took 50 hours of CPU processing time on a linux system, averaging a<br />

speed of 500 words per second, and demonstrated the robustness of the system not<br />

only in grammatical, but also in technical terms.<br />

So far, no large scale semantic annotation has been attempted, and automatic<br />

post-CG tree structure annotation of running text has only be performed on test texts<br />

and a 20.000 word corpus of teaching sentences.<br />

241<br />

I would like to thank professor Ataliba de Castilho for making the NURC corpus accessible to me in electronic form.<br />

242<br />

In this connection, I would like to mention Tony Berber Sardinha who is having a great deal of to-be-rewarded<br />

confidence in my parser.<br />

243<br />

This text and the Tycho Brahe tag set was kindly made available by Helena Britto.<br />

244<br />

I would like to thank the NILC team for letting me have a go at their corpus, and Sandra Maria Aluisio for having<br />

patience in discussing tagging differences with me.<br />

- 430 -

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!