28.01.2015 Views

Download - Российский комитет Программы ЮНЕСКО ...

Download - Российский комитет Программы ЮНЕСКО ...

Download - Российский комитет Программы ЮНЕСКО ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

A Hidden Component: The Universal Declaration of Human Rights<br />

Hidden inside the language identification instrument is a set of training texts<br />

for the software. Considering that the richness and quality of training texts<br />

is the most critical in language identification task we used a set of translated<br />

texts from the Universal Declaration of Human Rights (UDHR) provided by<br />

the UN Higher Commission for Human Rights (UNHCHR).<br />

Of note is that not all translated UDHR texts are provided with encoding;<br />

some are available only as image files. Image files can be read by humans but not<br />

directly by computers, necessitating that we transform images into text data.<br />

Table 1 illustrates how many transformed texts are given in image format (322<br />

languages were available at the date of the first search, in early 2004). More than<br />

two hundred languages use Latin script, with or without diacritics, and only<br />

three of them were given in PDF or GIF file format. In contrast to this, among<br />

languages using so-called Abugida script 3 , not a single language was presented in<br />

the form of encoded text. This fact might itself point to the existence of a digital<br />

language divide, or in this particular case, a “digital script divide”.<br />

Table 1. Number of available UDHR texts from the UNHCHR website by format<br />

Latin Cyrillic Other Abjad Abugida Hanzi All Total<br />

alphabet<br />

others<br />

Encoded 253 10 1 1 0 3 0 268<br />

PDF 2 4 2 3 10 0 4 25<br />

GIF 1 3 0 9 15 0 1 29<br />

Total 256 17 3 13 29 3 5 322<br />

Other alphabets: Greek, Armenian and Georgian; Abjad: Arabic and Hebrew; Abugida: Amharic<br />

and all Brahmi origin scripts used in South and Southeast Asia; Hanzi: Chinese, Japanese and<br />

Korean; All others: Assyrian, Canadian syllabics, Ojibwa, Cree, Mongolian and Yi.<br />

Around the same time as we launched the Language Observatory Project, Eric<br />

Miller launched UDHR-in-Unicode project. The objective of this project was<br />

to demonstrate the use of Unicode for a wide variety of languages, using the<br />

Universal Declaration of Human Rights (UDHR) as a representative text.<br />

Currently, UDHR-in-Unicode is housed on the Unicode Consortium website<br />

and the texts are used in the study of natural language processing.<br />

3<br />

Abugida scripts are syllabic scripts, most of which are generated from Indian Brahmi scripts and currently<br />

used in South and Southeast Asian regions. Another important Abugida script is Amharic.<br />

47

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!