29.01.2013 Aufrufe

Mehrsprachigkeit in Europa: Plurilinguismo in Europa ... - EURAC

Mehrsprachigkeit in Europa: Plurilinguismo in Europa ... - EURAC

Mehrsprachigkeit in Europa: Plurilinguismo in Europa ... - EURAC

MEHR ANZEIGEN
WENIGER ANZEIGEN

Erfolgreiche ePaper selbst erstellen

Machen Sie aus Ihren PDF Publikationen ein blätterbares Flipbook mit unserer einzigartigen Google optimierten e-Paper Software.

Elena Chiocchetti, Verena Lyd<strong>in</strong>g<br />

set of subfi elds of the subjects under analysis (see fi gure 1). Structural annotation adds markers<br />

to explicitly codify text structure, such as sentence and paragraph boundaries or likewise<br />

preamble, annex or title <strong>in</strong>formation.<br />

Table 2 lists a number of categories adopted for structural annotation and meta data<br />

<strong>in</strong>formation.<br />

META INFORMATION • title of the document<br />

• abbreviation<br />

• offi cial pass<strong>in</strong>g date<br />

• <strong>in</strong> Offi cial Journal (number, date)<br />

• legal system<br />

• legal hierarchy (e.g. regional or<br />

national legislation)<br />

• language<br />

• translation status (e.g. orig<strong>in</strong>al or<br />

translated version)<br />

• belong<strong>in</strong>g to subfi eld (see fi gure 1 for full<br />

list of subfi elds)<br />

STRUCTURAL INFORMATION simple structural segments:<br />

• sentence<br />

• textual paragraph<br />

structural segments with content related<br />

<strong>in</strong>formation about segment type:<br />

• title<br />

• preamble, annex, …<br />

• legal paragraph<br />

• chapter<br />

510<br />

Table 2: Meta data categories and categories of structural annotation<br />

Another important level of annotation is the alignment of multil<strong>in</strong>gual documents. In<br />

multil<strong>in</strong>gual versions of one legal text the sentences correspond<strong>in</strong>g to each other are associated<br />

to allow for multil<strong>in</strong>gual searches <strong>in</strong> parallel texts. Though not yet realised for the LexALP<br />

corpus, the fi rst step would be to align sentences and subsequently go down to align phrases or<br />

sentence fragments and even words, where possible.<br />

3.2. Representation and storage of corpus data<br />

3.2.1. Character encod<strong>in</strong>g<br />

The corpus documents are collected as raw text fi les. Due to the differ<strong>in</strong>g character sets<br />

commonly used to encode Slovene documents (ISO 8859-2/Lat<strong>in</strong>-2) and documents <strong>in</strong> Italian,<br />

German and French (ISO-8859-1/Lat<strong>in</strong>1) 11 all texts are converted to the overarch<strong>in</strong>g UTF-8<br />

(UNICODE) 12 encod<strong>in</strong>g.<br />

11 For a description of the ISO-8859 charsets see http://czyborra.com/charsets/iso8859.html<br />

12 http://www.unicode.org<br />

Multil<strong>in</strong>gualism.<strong>in</strong>db 510 4-12-2006 12:30:21

Hurra! Ihre Datei wurde hochgeladen und ist bereit für die Veröffentlichung.

Erfolgreich gespeichert!

Leider ist etwas schief gelaufen!