28.01.2015 Views

Download - Российский комитет Программы ЮНЕСКО ...

Download - Российский комитет Программы ЮНЕСКО ...

Download - Российский комитет Программы ЮНЕСКО ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Yoshiki MIKAMI<br />

Leader, Language Observatory Project;<br />

Professor, Nagaoka University of Technology;<br />

(Nagaoka, Japan)<br />

Katsuko T. NAKAHIRA<br />

Member, Language Observatory Project;<br />

Assistant Professor, Nagaoka University of Technology<br />

(Nagaoka, Japan)<br />

Measuring Linguistic Diversity on the Web<br />

The Language Observatory<br />

The Language Observatory [1] Project was founded in 2003. The main<br />

objective of the project is to observe the real state of language use on the web.<br />

When the first workshop of the project was held on 21 February 2004, the<br />

Russian Committee of the UNESCO Information for All Programme kindly<br />

reported it in Russian. Then we received several responses from various<br />

language communities around the world. This really encouraged us.<br />

How It Works<br />

The Language Observatory is designed to measure the use of each language on<br />

the World Wide Web. Measurement is done by counting the number of written<br />

pages on the Web in each language.<br />

The observatory consists of two major components. The first is a data collection<br />

instrument from the Web, a crawler robot developed at the University of Milan.<br />

It can collect millions of Web pages per day.<br />

The second component is a language identification instrument. We have<br />

developed software to identify language, script and encoding properties of<br />

Web pages with high accuracy and maximum coverage. The first version of<br />

the identification algorithm LIM (Language Identification Module) was<br />

developed in 2002 [2] and implemented in 2004. The most recent updated<br />

version is called G2LI. You can use it on the Web.<br />

According to a recent verification examination G2LI is capable of identifying<br />

184 languages in ISO Language Code (ISO 639-1) with an average accuracy of<br />

94%. In addition to a wide coverage of languages, it can identify various types<br />

of legacy encodings 2 , which are still extensively used by many non-Latin-script<br />

user communities.<br />

2<br />

Legacy encodings are non-standardised, and often proprietary encodings.<br />

46

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!