22.11.2014 Views

Download - arXiv

Download - arXiv

Download - arXiv

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Exploiting multilingual nomenclatures and language-independent text features<br />

as an interlingua for cross-lingual text analysis applications<br />

Ralf Steinberger, Bruno Pouliquen & Camelia Ignat<br />

European Commission – Joint Research Centre (JRC)<br />

Via E. Fermi, T.P. 267, 21020 Ispra (VA), Italy<br />

Firstname.Lastname@jrc.it – http:// www.jrc.it/langtech<br />

Abstract<br />

We are proposing a simple, but efficient basic approach for a number of multilingual and cross-lingual language technology applications<br />

that are not limited to the usual two or three languages, but that can be applied with relatively little effort to larger sets of languages.<br />

The approach consists of using existing multilingual linguistic resources such as thesauri, nomenclatures and gazetteers, as<br />

well as exploiting the existence of additional more or less language-independent text items such as dates, currency expressions, numbers,<br />

names and cognates. Mapping texts onto the multilingual resources and identifying word token links between texts in different<br />

languages are basic ingredients for applications such as cross-lingual document similarity calculation, multilingual clustering and categorisation,<br />

cross-lingual document retrieval, and tools to provide cross-lingual information access.<br />

1. Background and Motivation<br />

The European Union (EU) currently has 20 official languages,<br />

plus a few non-official ones. Most existing text<br />

analysis software tools have been developed for a few major<br />

languages, while very few resources and tools are<br />

available for the less widely spoken languages. There<br />

clearly is a need for more tools that can help the European<br />

citizens to access textual information written in the other<br />

languages.<br />

The 20 official EU languages add up to 190 language<br />

pair combinations. Almost all cross-lingual text analysis<br />

applications, including Machine Translation (MT), Cross-<br />

Lingual Information Retrieval (CLIR) and Cross-Lingual<br />

News Topic Tracking (CLNTT), make use of bilingual<br />

equivalences and rules. The few approaches to CLNTT, for<br />

instance, are either based on bilingual dictionaries (Wactlar<br />

1999) or use MT (Leek et al. 1999). In the EU setting,<br />

interlingua approaches and approaches towards unified<br />

multilingual resources, such as EuroWordNet and<br />

MULTEXT, clearly gain in attraction. However, there are<br />

many more unexploited resources that may not have been<br />

developed for machine use, but that can be exploited for<br />

multilingual Information Extraction (IE) and to provide<br />

cross-lingual information access.<br />

The Language Technology team of the Joint Research<br />

Centre (JRC) has the aim to produce a number of text<br />

analysis applications for ideally all official EU languages<br />

(and more) that help users to navigate in large multilingual<br />

document collections and that provide them with<br />

cross-lingual information access. Due to a lack of manpower<br />

and due to the limited availability of machineusable<br />

linguistic resources, we developed the following<br />

preferences:<br />

(a) limiting language-specific text processing to a minimum,<br />

by using heuristics and other shallow methods;<br />

(b) using statistics and Machine Learning (ML) methods<br />

rather than hand-crafted linguistic rules, where possible;<br />

(c) making use of various available multilingual lexical<br />

resources, even if they were not initially developed for<br />

machine use.<br />

While it is clear that more thorough knowledge-driven<br />

methods would produce better results in many cases, the<br />

JRC’s work has shown that a shallow and mostly language-independent<br />

approach can yield a number of useful<br />

and new text analysis applications while keeping the language-specific<br />

effort to between one and three person<br />

months of effort per language.<br />

The following sections describe efforts to map texts<br />

onto multilingual knowledge structures (Section 2) and to<br />

exploit further almost language-independent text features<br />

(Section 3). Section 4 explains how to deal with some<br />

language-specific issues and Section 5 lists a few language-independent<br />

methods and tools that can be used together<br />

with the resources mentioned in the previous sections.<br />

Section 6 shows some useful applications built with<br />

the procedures described in this article. In Section 7, we<br />

draw a few conclusions.<br />

2. Mapping texts onto existing multilingual<br />

thesauri, nomenclatures and gazetteers<br />

When mapping a given text onto a knowledge structure<br />

such as a thesaurus, we create a text representation consisting<br />

of a choice of thesaurus nodes, and possibly also of<br />

the relative importance of various nodes for the text representation.<br />

One, but not the only way of carrying out this<br />

mapping process is by verifying the lexical overlap between<br />

the document’s vocabulary and the terms of the<br />

thesaurus. Two documents can be assumed to be similar if<br />

they have a similar representation according to the mapping<br />

onto this thesaurus.<br />

In a multilingual thesaurus, nodes in the various language<br />

versions are linked via language-independent (typically<br />

numerical) node identifiers. While the conceptual<br />

world of a given language or of a specific thesaurus is, of<br />

course, not completely language-independent, the numerical<br />

thesaurus links between various language versions are<br />

good enough for an interlingua approximation. Two<br />

documents written in different languages can thus be assumed<br />

to be similar if they have a similar text representation<br />

according to this multilingual thesaurus.<br />

Additionally to thesauri, gazetteers and nomenclatures<br />

can fulfil the same function. Gazetteers are geographical<br />

dictionaries, i.e. lists of place names. According to Norviliené<br />

(forthcoming), the term nomenclature is used to<br />

describe ordered systems of words (e.g. product names)<br />

used in a particular discipline (e.g. business or customs),


Figure 1. Recognition of place names in Bulgarian and<br />

Czech text, and display of the results in English.<br />

containing a description of entities from a particular domain<br />

and their, typically mono-hierarchical, relationship.<br />

Thesauri are poly-hierarchically ordered systems of concepts<br />

and their natural language names that are mainly<br />

used for documentation purposes such as indexing and retrieval.<br />

The aim of this section is to show how texts can be<br />

mapped onto one or more thesauri to create a multifaceted<br />

language-independent document representation.<br />

The more thesauri can be used, the more information will<br />

be available for the document representation and the better<br />

documents can be compared with each other. The following<br />

sub-sections sketch our current mapping process<br />

onto various such lexical knowledge sources.<br />

2.1. Gazetteers of place names<br />

Unlike people’s names and other named entities, place<br />

names cannot be recognised by searching for patterns in<br />

text because there are as good as no contextual clues (Gey<br />

2000). Instead, geographical place name recognition has<br />

to rely on gazetteers and can only be carried out via a<br />

lookup of text words in the gazetteer. As places are<br />

spelled with a first uppercase letter in EU languages, only<br />

uppercase words need to be looked up. The lookup process<br />

sounds simple, but there are four major difficulties:<br />

(a) Place names can also be words in one or more languages,<br />

such as ‘And’ (Iran) and ‘Split’ (Croatia);<br />

(b) Some place names are homonymic with people’s<br />

names, such as ‘Victoria’ (capital of the Seychelles,<br />

and others) and ‘Annan’ (UK);<br />

(c) Many major places have varying names in different<br />

languages (exonyms; Venezia vs. Venice, etc.) or even<br />

in the same language (‘Saint Petersburg’, ‘Saint Pétersbourg’,<br />

‘Санкт-Петербург’ [Sankt-Peterbúrg],<br />

‘Leningrad’, ‘Petrograd’, etc.).<br />

(d) Multiple places share the same name, such as the fourteen<br />

cities and villages in the world called ‘Paris’;<br />

While place name recognition in general is a very well<br />

understood named entity recognition task, disambiguation<br />

between various homographic place names (issue d) has<br />

only recently been tackled (Pouliquen et al. 2004a). Exonym<br />

recognition (issue c) has to rely on an exhaustive<br />

multilingual database. While a number of monolingual<br />

gazetteers are freely available (see Gey 2000) we are only<br />

aware of two multilingual place name lists: the KNAB database<br />

of the Institute of the Estonian Language 1 and the<br />

1 See http://www.eki.ee/knab/knab.htm.<br />

European Commission’s NUTS database 2 , currently available<br />

in fifteen languages.<br />

Even for languages with relatively few speakers such<br />

as Slovene, good resources exist. For instance, KNAB currently<br />

contains about 150 Slovene place names. The freely<br />

available database of the Geonet Name Server 3 has 6600<br />

English language references of Slovene place names. Slovene<br />

place names are handled by the Slovene governmental<br />

commission for the standardisation of geographical<br />

names 4 , who even provide a link to a gazetteer of exonyms<br />

5 .<br />

Our approach consists of looking up all uppercase<br />

words in the gazetteer database and of applying a number<br />

of heuristics for disambiguation (see Pouliquen et al.<br />

2004a). When a string could be a single word or be part of<br />

a multi-word place name, the longer place name is preferred.<br />

The result is a list of place names occurring in the<br />

text with their offset and length, plus latitude and longitude,<br />

as well as information on the country they belong to<br />

and probably information about the hierarchical organisation<br />

of the country (e.g. town, province, region, country).<br />

In Figure 1, automatically identified place names in Bulgarian<br />

and Czech text are highlighted and translated. Additional<br />

information is available in the underlying XML<br />

file, but is not displayed here.<br />

To limit the negative impact of place names that could<br />

also be common words or people’s names (problems (a)<br />

and (b)), which would lead to many wrong hits and thus to<br />

a low precision, we currently use lists of geo-stop words,<br />

i.e. words that should not be marked as place names even<br />

if they are found in text. As ambiguous place names such<br />

as ‘And’ and ‘Split’ are only a problem for English language<br />

texts, but not for German or other languages, there<br />

should be a different geo-stop word list for each language.<br />

Producing a geo-stop word list for a new language takes<br />

little effort as word frequency lists of the language can be<br />

used. By automatically geo-coding a frequency list of the<br />

ten thousand most frequent words of the language and collecting<br />

those words that were found by the system, but<br />

that are not place names, such a geo-stop word list is<br />

quickly produced. Person names such as Victoria are<br />

harder to come around as the person name is rather frequent<br />

and there are 190 places with this name in the<br />

world, including the capital of the Seychelles. This problem<br />

can only be overcome by using the outcome of the<br />

person name recognition tool described in section 3.2.<br />

An evaluation of the place name recognition tool in<br />

English texts yielded a precision of 96.8% and a recall of<br />

96.5%. For details, see Ignat et al. (2003).<br />

Language-specific issues regarding the lookup process,<br />

such as place name inflection, will be discussed in<br />

section 4 as they are not only relevant to place name recognition.<br />

The result of the mapping process is thus a vector of<br />

place names where each place name is a dimension and<br />

the frequency with which it has been mentioned in the text<br />

is the length of the vector. For some applications, it may<br />

be useful to restrict the recognition resolution to the country<br />

level, i.e. each mention of a place in the country adds<br />

2 Available at http://europa.eu.int/comm/eurostat/ramon/.<br />

3 See http://earth-info.nga.mil/gns/html/.<br />

4 See http://www.sigov.si/kszi/.<br />

5 Available at http://www.sigov.si/kszi/ang/exonyms.pdf.


TARIC CODE PRODUCT DESCRIPTION<br />

0702 Tomatoes, fresh or chilled<br />

0702 00 00 07 Cherry tomatoes<br />

0702 00 00 99 Other<br />

0703 Onions, shallots, garlic, leeks and other<br />

alliaceous vegetables, fresh or chilled<br />

0703 10 Onions and shallots<br />

0703 20 Garlic<br />

0703 90 Leeks and other alliaceous vegetables<br />

0703 90 00 10 Leeks<br />

0703 90 00 90 Other<br />

Table 1. English product descriptions in TARIC chapter:<br />

Edible Vegetables and Certain Roots and Tubers.<br />

to the country score. The occurrence frequency and the<br />

country score can also be normalised, using TF.IDF or<br />

similar, to down-weight the importance of places like<br />

Washington that are highly frequent in some text types<br />

such as world news.<br />

2.2. Nomenclatures of products, etc.<br />

Other views of the same document can be produced by<br />

listing all document terms from various other fields, such<br />

as products and product groups, professions, medical or<br />

electro-technical terms, etc. Various nomenclatures can be<br />

downloaded from the internet (see Norviliené 2004), and<br />

many of them are available on the EC’s classification<br />

server Ramon (see Footnote 2). For instance, there is the<br />

electro-technical nomenclature ETIM 6 , the Statistical Classification<br />

of Products by Activity in the European Economic<br />

Community CPA, the Statistical Classification of<br />

Economic Activities in the European Community NACE,<br />

and many more.<br />

To date, we have only worked with the Integrated Tariff<br />

of the European Communities TARIC 7 , which is the hierarchical<br />

product list used by the Customs Offices in the<br />

EU to declare the movement of goods across borders.<br />

TARIC is a more detailed version of the so-called Combined<br />

Nomenclature CN, which is again more detailed<br />

than the Harmonised System HS used by the World Customs<br />

Organisation. TARIC distinguishes about 28,000<br />

headings and subdivisions.<br />

We chose TARIC because it exists in twenty languages<br />

(including Slovene) and because it is a rather complete list<br />

of tangible items that can be imported or exported. It is in<br />

the nature of TARIC that illegal products such as bombs<br />

and many drugs are not included (although heroin and cocaine<br />

are part of TARIC). It includes live animals, food,<br />

chemicals, pharmaceuticals, textiles, precious stones, metals,<br />

machinery, vehicles, optical material, works of art,<br />

and much more.<br />

Table 1 shows some of the product descriptions that<br />

are organised hierarchically into up to 5 levels (two digits<br />

per level). Knowing which of the products and product<br />

groups are referred to in a text can be very useful to generate<br />

a product-related document representation, i.e. a<br />

vector of products and their relative importance in the<br />

6 See http://www.etim.de/html/download.html<br />

7 See http://europa.eu.int/comm/taxation_customs/<br />

databases/taric_en.htm<br />

text. We can furthermore use the numerical TARIC codes<br />

as an interlingua to represent the product aspect of document<br />

written in the twenty languages in which the product<br />

nomenclature exists. However, before being able to use<br />

the product lists of this resource in a lookup process, we<br />

needed to overcome several difficulties:<br />

(a) As the entire TARIC product description (e.g. “Leeks<br />

and other alliaceous vegetables” in code 070390) will<br />

not be found verbatim in the text, the product terminology<br />

first needs to be extracted from the description<br />

(e.g. leeks and alliaceous vegetables in Table 1).<br />

(b) Usually, the plural forms are used in TARIC so that the<br />

singular or other inflected forms need to be added for<br />

the lookup process to be successful. For further issues<br />

concerning inflection of words and suffixes, see Section<br />

4.<br />

(c) Syntactic co-ordination constructions such as in code<br />

0703 need to be resolved and expanded out to produce<br />

lists such as fresh onions, chilled onions, fresh shallots,<br />

chilled alliaceous vegetables, etc.<br />

(d) This process typically results in product lists such as<br />

fresh onions and chilled onions, while the most usual<br />

underspecified term onions is not part of the list. This<br />

needs to be added.<br />

(e) While multi-word terms are usually monosemous,<br />

many single-word terms such as onion or juice can be<br />

part of many different TARIC classes as there are many<br />

different types of juices and onions (wild onions, pearl<br />

onions, dried onions, etc.). As we did not want to miss<br />

frequently used products such as onions or juice, and<br />

we did not want one term to trigger many different<br />

TARIC classes, we decided to add about 350 supergroups<br />

such as vegetables and milk products and to<br />

place the under-specified term directly under the super-group.<br />

These steps were carried out, mostly by the Centre for Information<br />

and Language Processing CIS 8 at the University<br />

of Munich in Germany, in the context of a collaborative<br />

agreement, for the languages English, German, French,<br />

Spanish, Italian and Portuguese. In the semi-automatic<br />

process, heuristics were used and results were checked<br />

manually. Inflection forms were added by making use of<br />

extensive morphological dictionaries available at CIS. The<br />

English and Italian dictionary resources created by CIS<br />

were then checked thoroughly for correctness at the JRC.<br />

The resulting dictionaries are thus of the form SUPER-<br />

GROUP | CODE | TERM where several terms are allowed for<br />

the same code if written one term per line, and several<br />

codes are obviously allowed for each super-group. The<br />

super-group column furthermore allows us to do a more<br />

coarse-grained classification of texts so that documents<br />

triggering the class vegetables several times are identified<br />

as similar even if they do not mention the same vegetables.<br />

To date, the dictionaries have been developed for the<br />

languages English, Italian, German, French, Spanish and<br />

Portuguese.<br />

Regarding the recognition of the derived product terminology<br />

in the text, the same lookup procedure can be<br />

used as for geographical place names. However, in most<br />

European languages, products are not spelled with a first<br />

uppercase letter so that all words need to be checked<br />

against the terms in the product list. Figure 2 shows some<br />

product recognition results.<br />

8 See http://www.cis.uni-muenchen.de/


The difficulties involved in the lookup process are<br />

again linked to polysemous words like bush, joint, bus,<br />

etc. Some of these terms belong to very different TARIC<br />

classes (e.g. joint). Others are simply homographic with<br />

words not related to products (e.g. Bush). For testing, we<br />

applied the system to various text types and, more importantly,<br />

to the 10,000 top frequent words derived from reference<br />

corpora. This gave us a good idea of the most frequent<br />

missing products, which were then added to the dictionaries.<br />

Furthermore, this helped us to identify those<br />

high-frequency words that are homographic with products<br />

and that could thus potentially generate wrong hits. Depending<br />

on the type of problem, we used one of two solutions.<br />

(a) For words triggering different TARIC product<br />

classes, we usually amended the dictionary by adding<br />

some additional specification (e.g. joint was changed to<br />

rubber joint) that helps in the disambiguation. The disadvantage<br />

is that the single word joint will no longer be recognised.<br />

(b) For words that are homographic with nonproduct<br />

vocabulary of the language (e.g. Bush), we produced<br />

a language-dependent product stop word list containing<br />

all those words that the system should not recognise.<br />

This helps to avoid that the US president triggers the<br />

product class live plants. We thus decided to sacrifice recall<br />

for precision.<br />

The effort to prepare and tune the product dictionaries<br />

for each language ranges between two and six months per<br />

language, but we foresaw that the advantage of mapping<br />

texts onto the TARIC nomenclature with its encompassing<br />

coverage would be worth the effort. The result of the<br />

product recognition procedure is thus a product information<br />

extraction tool that allows us also to provide users<br />

with product-specific cross-lingual information access and<br />

to produce a product-specific feature vector for each<br />

document that can be used for monolingual and crosslingual<br />

document similarity calculation.<br />

The TARIC nomenclature is seemingly distributed for<br />

free, but the dictionaries derived from it cannot currently<br />

be made available due to the agreement with CIS. However,<br />

the JRC would be interested in collaborations creating<br />

publicly available resources for more languages.<br />

2.3. Thesauri and classification systems<br />

Libraries and documentation centres of most large organisations<br />

use hierarchically organised thesauri or flat lists of<br />

subject domain descriptions as classification systems to<br />

store and retrieve their documents. Documents are often<br />

multiply classified, meaning that each document is<br />

marked as belonging to several classes (multi-label categorisation).<br />

Such a classification of a document leads to<br />

yet another vector space representation of documents, using<br />

the descriptors as dimensions and, if the descriptors<br />

are ordered or weighted, the weight as vector length.<br />

They ate young river salmon with cream and potatoes.<br />

Figure 2. Automatic recognition of products in English<br />

text. Display of the results in English and Portuguese.<br />

The European Parliament (EP) and the European<br />

Commission (EC) have jointly developed a thesaurus<br />

called EUROVOC (EUROVOC 1995) that is used by them and<br />

about twenty regional and national European parliaments<br />

to index (i.e. classify) their texts. Though other classification<br />

systems exist, EUROVOC is adapted by a growing<br />

number of national organisations so that it has now become<br />

sort of a standard. To obtain a licence, it is necessary<br />

to contact the EC’s Publications Office OPOCE.<br />

EUROVOC is a wide-coverage thesaurus that organises<br />

its over 6,000 descriptors (classes) from 21 different fields<br />

(e.g. politics, finance, science, social questions, organisations,<br />

foodstuff, etc.) hierarchically into a maximum of 8<br />

levels. EUROVOC exists in currently 22 languages where<br />

each numerical descriptor code has exactly one terminological<br />

correspondence per language.<br />

As EUROVOC is a wide-coverage thesaurus with only<br />

6000 classes, its descriptors are mostly rather high-level,<br />

conceptual terms. Examples are PROTECTION OF MINORI-<br />

TIES, FISHERY MANAGEMENT and CONSTRUCTION AND<br />

TOWN PLANNING. 9 Unlike the concrete low-level terms<br />

from TARIC and many other nomenclatures, EUROVOC descriptors<br />

cannot normally be extracted from texts, i.e. they<br />

can only rarely be found via a lookup procedure. Instead,<br />

EUROVOC classification is a keyword assignment task, i.e.<br />

the most pertinent descriptors from an independent reference<br />

list (the thesaurus) are assigned to a text even if these<br />

terms do not occur verbatim in the text.<br />

In the various European parliaments, this assignment<br />

is done manually by professional librarians, but the JRC<br />

has developed a system that learns from manually classified<br />

documents to assign a ranked list of EUROVOC descriptors<br />

to any given text. This work is described in detail<br />

in Pouliquen et al. (2003a) so that we only summarise the<br />

procedure here: The system maps documents onto EURO-<br />

VOC by carrying out category-ranking classification using<br />

Machine Learning methods. In an inductive process, it<br />

builds a profile-based classifier by observing the manual<br />

classification on a training set of documents with only<br />

positive examples. Table 2 shows the first few of a long<br />

list of words automatically identified as being significant<br />

for the EUROVOC descriptor FISHERY MANAGEMENT. Before<br />

feeding the training texts to the ML algorithm, some<br />

linguistic pre-processing was carried out to lemmatise<br />

words and to mark up multi-word terms such as<br />

power_plant and New_York as one token and a large stop<br />

word list of words with low semantic content was used.<br />

However, tests have shown that lemmatisation and multiword<br />

mark-up had only little impact on the performance<br />

for Spanish and English. Assignment results for the highly<br />

inflected Finnish language were very comparable, showing<br />

that the statistical method can be applied without using<br />

linguistic tools, if necessary.<br />

A manual evaluation of the EUROVOC descriptor assignment<br />

process for English and Spanish parliamentary<br />

documents, taking human performance as a benchmark,<br />

showed that the system performs 86% and 80% as well as<br />

the professional indexers did. For details, see Pouliquen et<br />

al. (2003a).<br />

The outcome of the mapping process for a given text is<br />

a ranked list of the EUROVOC classes that are most pertinent<br />

for this text. Table 3 shows the first few EUROVOC<br />

9 We write all EUROVOC descriptors in small caps.


Lemma<br />

Weight<br />

fishery_resource 54.47<br />

fishing 49.11<br />

fish 46.19<br />

common_fishery_policy 44.67<br />

fishery 44.19<br />

fishing_activity 43.37<br />

fly_the_flag 42.87<br />

aquaculture 39.27<br />

conservation 38.34<br />

vessel 37.91<br />

Table 2. The first few of a long list of lemmas that have<br />

been automatically identified as being highly relevant<br />

and typical for documents that were manually classified<br />

with the EUROVOC descriptor FISHERY MANAGEMENT,<br />

plus their weight (the profile of the descriptor). The<br />

presence of many of these lemmas in a given text indicate<br />

a certain likelihood that FISHERY MANAGEMENT is<br />

an appropriate descriptor for this text.<br />

descriptors assigned automatically to a text found on the<br />

internet.<br />

Due to the multilingual nature of EUROVOC, this representation<br />

is independent of the text language so that it is<br />

very suitable for cross-lingual document similarity calculation.<br />

The system has currently been trained for thirteen<br />

languages so that documents written in any of these languages<br />

can be represented with the same languageindependent<br />

EUROVOC descriptor vector. Unlike the applications<br />

described in sections 2.1 and 2.2, this Machine<br />

Learning method to map documents onto thesauri requires<br />

training material, i.e. documents that have been manually<br />

classified. While some linguistic, rule-based or dictionarybased<br />

approaches exist for automatic thesaurus indexing<br />

(e.g. Marjorie & Hainebach 1996), more recent efforts<br />

such as the one by Montejo-Ráez (2002) tend to exploit<br />

the power of ML approaches. The advantage of these becomes<br />

even more evident for highly multilingual applications<br />

such as automatic EUROVOC indexing.<br />

Most other highly multilingual thesauri we are aware<br />

of are subject-specific, such as the agricultural thesaurus<br />

AGROVOC, the particle physics thesaurus DESY and the<br />

medical thesauri UMLS and MeSH. AGROVOC, which is<br />

freely available at the FAO web site 10 , exists in six major<br />

world languages. The medical thesaurus MeSH exists in<br />

twelve mainly European languages, but according to Nelson<br />

et al. (2000), the thesaurus has fully or partially been<br />

translated into a further eight world languages, including<br />

Slovene.<br />

3. Language-independent text features<br />

The mapping processes described in section 2 yield several<br />

vector space document representations, one for each<br />

thesaurus, nomenclature, gazetteer or word list used. Further<br />

multilingual representations can be generated by extracting<br />

named entities to create lists of text features such<br />

as (a) date or (b) currency expressions, (c) numbers and<br />

10 See http://www.fao.org/agrovoc/.<br />

(d) names, as these can be represented in a normalised,<br />

language-independent format. For an introduction to the<br />

state of the art of the field of Named Entity Recognition<br />

(NER), see Daille & Morin (2000). Names of people or organisations<br />

are not strictly language-independent because<br />

names may be written differently depending on the language<br />

(and sometimes even within the language), but at<br />

least among European languages many names are spelled<br />

the same. Due to the historical relatedness of many European<br />

languages, there are even (e) a few general language<br />

words that are similar or the same. These are usually referred<br />

to as cognates. The English and German words<br />

‘finger’, ‘arm’, ‘demonstration’, ‘computer’, etc. are some<br />

examples. In this section, we describe how these five additional<br />

text features can be recognised and exploited to<br />

contribute to linking related documents both monolingually<br />

and across languages.<br />

3.1. Date and currency expressions<br />

Within the same language, there are usually different<br />

ways of writing a certain date or currency expression (e.g.<br />

English 13 October 2004, 13/10/2004, 13.10.2004, thirteenth<br />

of October of the year two thousand and four, etc.).<br />

Some of these date expressions may be the same as in<br />

other languages (e.g. 13.10.2004), but others are not. As<br />

the underlying concept is the same, namely a reference to<br />

a specific date in the same time reference system, the concept<br />

can be expressed in a standard way (see, for instance,<br />

ISO standard ISO-8601) so that it is the same across languages.<br />

For dates, we currently use ‘DD’YYYYMMDD.<br />

Expressions such as 13.10.2004 are thus normalised to<br />

DD20041013.<br />

At the JRC, we do not currently recognise currency expressions,<br />

but we have developed a tool that recognises<br />

and normalises date expressions. It is a languageindependent<br />

software tool that uses language-specific parameter<br />

files, one per language. The set of languages includes<br />

Slovene. A preliminary version of this tool is described<br />

in Ignat et al. (2003). It is available on request.<br />

The language-specific parameter file allows to list<br />

days of the week, months of the year, common abbreviations<br />

for week days and months, cardinal and ordinal<br />

number expressions, words that can be part of the date expression<br />

(e.g. of the year), as well as expressions used for<br />

relative dates such as yesterday, last December, etc. It furthermore<br />

allows to specify ordering rules. In English, for<br />

Rank Descriptor<br />

Similarity<br />

1 VETERINARY LEGISLATION 42.4%<br />

2 PUBLIC HEALTH 37.1%<br />

3 VETERINARY INSPECTION 36.6%<br />

4 FOOD CONTROL 35.6%<br />

5 FOOD INSPECTION 34.8%<br />

6 AUSTRIA 29.5%<br />

7 VETERINARY PRODUCT 28.9%<br />

8 COMMUNITY CONTROL 28.4%<br />

Table 3. Assignment results (8 top-ranking descriptors)<br />

for the document Food and veterinary Office mission to<br />

Austria, found on the internet at<br />

http://europa.eu.int/comm/food/fs/inspections/vi/reports/au<br />

stria/vi_rep_oste_1074-1999_en.html.


instance, it is possible to mention the DAY after the<br />

MONTH (e.g. May 2 nd ) whereas this is not allowed in German<br />

and other languages. The tool recognises absolute<br />

and relative dates, as well as complete and incomplete<br />

dates. The expression last December thus is a relative incomplete<br />

date with underspecified DAY. If a reference<br />

date is given (this can, for instance, be the publication<br />

date for newspaper articles), the tool can calculate the<br />

normalised expression DD20031200 for the words last<br />

December if the reference date is in the year 2004.<br />

The tool does not currently attempt to recognise time<br />

expression (e.g. 5 PM; 17:15), date periods (e.g. 14-15<br />

October 2004; in the 1960s), incomplete dates with only<br />

one of DAY, MONTH or YEAR (e.g. in October; on the<br />

third), or named cultural festivities (e.g. at Christmas).<br />

An evaluation of the tool on English texts from the<br />

Message Understanding Conference MUC (considering<br />

only the date expressions the tool attempts to recognise)<br />

yielded the following precision/recall values: relative<br />

dates: 86%/67%; complete dates: 100%/100%; incomplete<br />

dates: 98%/98% (for details, see Ignat 2003). The main<br />

problems regarding relative dates have since been corrected<br />

(e.g. this may was recognised as ‘May of the reference<br />

year’) so that the results are now better. The evaluation<br />

of the tool on Romanian news texts yielded similar<br />

results.<br />

For some document types such as news articles, a list<br />

of the normalised date expressions can be a meaningful<br />

signature of the text. Together with further signatures for<br />

names, etc., documents can be described rather accurately.<br />

Following recognition, date expressions can be highlighted<br />

in text for faster retrieval (similar to place names<br />

in Figure 1). Another advantage of the application is that,<br />

once the recognised dates are normalised and stored in a<br />

database, users can search for all articles mentioning a<br />

date in a certain period, by using a simple SQL query.<br />

3.2. Proper names<br />

According to Gey (2000), 30% of content-bearing words<br />

in journalistic text are proper names such as names of<br />

people and of organisations. Friburger & Maurel (2002)<br />

showed that names recognised in text are very valuable<br />

for document similarity calculation, but say that the usage<br />

of names alone is not sufficient for this purpose. It is obvious,<br />

though, that a list of proper names can be a highly<br />

significant signature for at least journalistic text. If combined<br />

with further signatures, as proposed in this article,<br />

name lists can be very powerful.<br />

Proper name recognition is a subject area that is very<br />

well understood and a number of named entity recognition<br />

(NER) tools are available either commercially or for research.<br />

At the JRC, we are currently using two alternative<br />

approaches to recognise people’s names: (a) a PERL tool<br />

with regular expressions that identifies sequences of uppercase<br />

words as names if they are introduced or followed<br />

by cue words such as President, Professor, teacher, etc.;<br />

(b) the part-of-speech output of the readily trained Tree<br />

Tagger 11 , combined with some minimalist local grammar<br />

rules. Until now, we have exploited the Tree Tagger tool<br />

only for English text, although trained Tree Tagger versions<br />

are also available for French, German and Italian.<br />

Spelling<br />

Vladimir Putin<br />

Vladimir Poetin<br />

Vladimir Poutine<br />

Vladimir V Putin<br />

Vladmir Putin<br />

Vladímir Putin<br />

Wladimir Putin<br />

Władimir Putin<br />

Language(s)<br />

DA, EN, ES, IT, NO, SV<br />

NL<br />

FR<br />

EN<br />

EN<br />

ES<br />

DE<br />

PL<br />

Table 4. Variations of the name of the Russian President<br />

found in news texts in various languages.<br />

The less sophisticated PERL tool misses names that are not<br />

surrounded by cue words, but it has the advantage that it<br />

is just a question of a few hours to extend it to new languages,<br />

so that we are now able to recognise names in<br />

English, French, German, Spanish, Italian, Estonian and<br />

Bulgarian.<br />

Even within the same language and the same text, authors<br />

often use different versions of the same name. This<br />

is not only true for foreign names such as Al Qaida (Al<br />

Qaeda, Al Kaida, etc.), but also for known names such as<br />

George Bush (George W. Bush, George Bush Jr., George<br />

Walker Bush, etc.). After having examined a number of<br />

approximative matching techniques, we decided to implement<br />

a simple letter trigram measure that allows us to<br />

recognise many monolingual and cross-lingual name<br />

variations found, as shown in Table 4. The most frequent<br />

variation is now taken as the prototypical one that is<br />

stored in the database, and all others are stored in an alias<br />

list of variations. Via an automatic lookup of the Wikipedia<br />

online encyclopaedia in various languages 12 , further<br />

name variations such as Japanese ウラジーミル<br />

プーチン, Chinese 普 京 and Russian Владимир Путин<br />

can be found automatically.<br />

By using the PERL regular expressions continuously<br />

over time, a database of frequently mentioned person’s<br />

names can be built up so that names can then be found in<br />

new text by using simple lookup procedures, without the<br />

need for cue words.<br />

The result of the proper name recognition is thus a list<br />

of people’s names mentioned in a given text, together<br />

with possible name variants and with information on how<br />

often the name was mentioned, both in the given text and<br />

in other texts over time. This latter frequency can be used<br />

to weight the relevance of names in a given text, using<br />

TF.IDF or a related measure, in order to down-weight frequently<br />

mentioned names such as George Bush and to<br />

highlight new or rarely used person names.<br />

3.3. Cognates and numbers<br />

When comparing the tokens of texts written in different<br />

languages with each other, one can frequently find some<br />

overlap. This overlap usually consists of (a) numbers in<br />

numerical form (e.g. 596), (b) names or (c) other words<br />

that are coincidentally the same across languages (cognates).<br />

Cognates are normally due to common historical<br />

11 http://www.ims.uni-stuttgart.de/projekte/corplex/<br />

TreeTagger/DecisionTreeTagger.html<br />

12 See various language versions at<br />

http://en.wikipedia.org, http://de.wikipedia.org, etc.


oots (e.g. English finger and arm vs. German Finger and<br />

Arm) or because they adapted the same loanwords (e.g.<br />

German Computer and Italian computer). These three<br />

types of identical text tokens can be exploited to contribute<br />

constructively to cross-lingual document similarity<br />

calculation. Two news articles about the same event written<br />

in English and Spanish, for instance, are likely to have<br />

a number of tokens in common, while two articles about<br />

different events are likely to have less tokens in common.<br />

Obviously, several limitations are linked to this approach:<br />

(a) Number formats can differ from one language to the<br />

other, for instance due to the different usage of number<br />

separators (e.g. English 1,000.00 vs. German<br />

1.000,00), but more often than not there is no difference<br />

(1000 is used in both languages).<br />

(b) Names of people and places often differ from one language<br />

to the other because of different pronunciation<br />

rules (e.g. English Al Qaeda vs. German Al Kaida), or<br />

for historical reasons (e.g. English Venice vs. German<br />

Venedig vs. French Venise, etc.). Languages with different<br />

writing systems are much less likely to have<br />

word tokens in common, even if the pronunciation of<br />

the words is identical (e.g. Italian Venezia vs. Greek<br />

Βενετία).<br />

(c) So-called false friends (words that are the same without<br />

sharing the same meaning, such as English manifestation<br />

and French manifestation or English war and<br />

German war) would cause false hits.<br />

Many more historically related words across languages<br />

could theoretically be exploited, by writing rules that implement<br />

some historical language change phenomena. Especially<br />

the large number of European words with Greek,<br />

Latin or Germanic origin should be easy to identify: Examples<br />

include English pharmacy vs. French pharmacie<br />

and English elephant vs. French elephant vs. German Elefant<br />

vs. Italian elefante. While the benefit of the rulebased<br />

or trigram-based similarity measure has not been<br />

proven, we are already exploiting identical cognates,<br />

numbers and other identical text tokens across languages<br />

in a system for multilingual news topic tracking, as described<br />

in section 6.<br />

4. Dealing with language-specific issues<br />

From a linguistic point of view, the procedures described<br />

in the previous sections are relatively simplistic. They<br />

mainly rely on tokenisation, case information, dictionary<br />

lookup procedures, stop word lists, simple local patterns,<br />

heuristics, and statistics and Machine Learning methods<br />

operating on ‘words’ without part-of-speech information.<br />

Many of these procedures will work well with English<br />

texts as English has a rather poor morphology. However,<br />

this approach will be much less successful for more highly<br />

inflected languages like Hungarian or those of the Slavic<br />

language family.<br />

It should be possible to overcome most of these phenomena<br />

with the help of good morphology tools, but these<br />

are not available to us for the large range of languages we<br />

are interested in (all twenty official EU languages and<br />

more!). As the manpower available in the JRC’s Language<br />

Technology group is rather limited, as well, we had to resort,<br />

yet again, to some simple heuristics that would allow<br />

us to benefit as much as possible from the available multilingual<br />

resources and the language-independent text features<br />

while limiting the effort to a few weeks per language.<br />

With the existing applications already being set up,<br />

adding the language-specific resources for a new language<br />

takes between two and twelve weeks. Extracting the relevant<br />

terminology from the TARIC product description and<br />

preparing it for the application described in section 2.2 is<br />

rather labour-intensive so that it takes an additional estimated<br />

12 weeks. It is clear that not all linguistic phenomena<br />

and not all languages can be dealt with, but for a large<br />

number of European languages this is sufficient to produce<br />

good and very useful text analysis applications, as<br />

described in section 6.<br />

For the statistical EUROVOC thesaurus text classification<br />

task, experiments with Spanish have shown that, surprisingly,<br />

performance gains only approximately 2%<br />

when operating on lemmas rather than on inflected words.<br />

Furthermore, multilingual performance tests for EUROVOC<br />

descriptor assignment on eleven different languages from<br />

different language families, including German, Spanish,<br />

Finnish and Lithuanian, have shown that performance is<br />

rather uniform across the languages. Details about these<br />

experiments can be found in Pouliquen et al. (2003a).<br />

Simple dictionary lookup procedures such as for geocoding<br />

and product recognition are, however, more sensitive<br />

to word form variations because inflected word forms<br />

such as New Yorker will not be found in text if the gazetteer<br />

only contains the base form New York. We solve this<br />

problem partially by providing language-specific regular<br />

expressions that strip potential suffixes off those uppercase<br />

words that were found in a text, but not in the place<br />

name gazetteer. For instance, if words like Londonit,<br />

Frankfurdis or New Yorgile are found in Estonian text,<br />

regular expressions will strip -it to produce London and<br />

will replace dis to t and gile to k in order to produce<br />

Frankfurt and New York. Together with Finnish, Estonian<br />

is known for its extremely sophisticated morphology.<br />

However, place names occur with a limited number of<br />

case endings (in/to/from/… London) so that 37 regular expressions<br />

cover most cases. For most languages, a much<br />

smaller number of regular expressions is needed. A small<br />

evaluation on Estonian news headlines showed that 63 out<br />

of 72 place names were recognised correctly (Recall =<br />

87.5%). The remaining nine places were not found because<br />

either the place name was not in the database or because<br />

the suffix stripping rule was missing (about equal<br />

parts). No wrong hits occurred in the test set (precision =<br />

100%).<br />

It should be possible to apply the same suffix-stripping<br />

procedure to other kinds of vocabulary lists such as products,<br />

professions, etc. However, as these lists are likely to<br />

be larger and we cannot limit our search to upper case<br />

words, the lookup process should be slower and it is possible<br />

that it will produce more wrong hits.<br />

It is not certain that for an agglutinative language like<br />

Hungarian, which can add many different types of suffixes<br />

one after the other, suffix stripping is feasible. It<br />

would be an interesting experiment to apply cascades of<br />

suffix-stripping regular expressions to see whether this<br />

helps to find place names, but the danger to get false hits<br />

due to over-stripping is big.<br />

Further tokenisation issues arise when dealing with<br />

languages such as Chinese which do not mark word borders<br />

by a space, and compounding languages like German<br />

where (mostly) nouns can be combined to form long<br />

words. While, at least in German, expressions like Ber-


Figure 2. Document profile (mock-up) summarising the information extracted from documents. Entities linked to multilingual<br />

thesauri and nomenclatures can be displayed in several languages.<br />

liner actor (an actor from Berlin) are not compounded<br />

(Berliner Schauspieler), nouns referring to products are:<br />

Sauerstoffflaschenventilverschluss (oxygen bottle valve<br />

closure).<br />

For most European languages, the uppercase/lowercase<br />

distinction can be exploited when looking<br />

for the names of people or places. The same is not true for<br />

languages like Japanese, Hindi and Arabic. Furthermore,<br />

case rules even differ to some extent between languages<br />

such as English and French (e.g. the English vs. les anglais)<br />

so that rules either have to be adapted specifically to<br />

each language or lower recall has to be accepted when<br />

looking only at uppercase words.<br />

5. Language-independent procedures and<br />

applications<br />

In the highly multilingual setting of the set of applications<br />

discussed in this article, language-independent text analysis<br />

procedures are very useful. We currently use the following<br />

applications:<br />

(a) An automatic language guessing tool using letter bigram<br />

and trigram statistics, that has currently been<br />

trained for 25 languages.<br />

(b) A keyword extraction tool that identifies the statistically<br />

most salient words and their relative importance<br />

(their keyness) by comparing the word frequency in<br />

the text with an average word frequency as found in<br />

large reference corpora. While we use the loglikelihood<br />

formula to extract and rank the words, other<br />

formulae like TF.IDF are possible alternatives. A list of<br />

stop words can be used to stop some words from being<br />

identified as keywords that are low in semantic content<br />

or that are meaningless when being out of context. A<br />

ranked list of keywords for a document is a good vector<br />

space representation of this document.<br />

(c) A tool to measure the similarity between two documents<br />

by calculating the cosine or another similarity<br />

measure between the vector space representations of<br />

two documents. Monolingually, the list of extracted<br />

keywords and their keyness can be used as input. For<br />

cross-lingual similarity calculation, features like the<br />

ones discussed in this article can be fed to the system.<br />

(d) This document similarity measure can be used for a<br />

number of applications, including hierarchical unsupervised<br />

document clustering, classification and<br />

query-by-example document retrieval.<br />

Further applications that can be based on languageindependent<br />

methods are automatic document summarisation<br />

by extracting the most relevant sentences (e.g. those<br />

containing most keywords), and the generation of document<br />

maps. Document maps such as Kohonen maps are<br />

two-dimensional representations of the multi-dimensional<br />

document space that can be useful to get a first overview<br />

of the main contents of a large document collection or to<br />

navigate in the document collection.


Figure 3. German news automatically identified as being about the same subject, together with the title of the most representative<br />

news article, the keywords for this cluster and a map showing the place names mentioned in the cluster. The links<br />

below lead to the corresponding news article cluster in English, Spanish, French and Italian.<br />

6. Applications<br />

At the JRC, we combine applications based on the language-independent<br />

algorithms listed in section 5 with the<br />

information extracted according to the procedures described<br />

in sections 2 and 3. In spite of the relatively shallow<br />

linguistic processing, we were able to produce applications<br />

that are being used as regular in-house services<br />

and for the ad-hoc analysis of document collections given<br />

to us by various users.<br />

Once entities such as dates, names or products have<br />

been identified, they can be highlighted in text in different<br />

colours to allow users to find them quickly, as shown in<br />

Figure 1. For foreign language text, the entity can be displayed<br />

in another language to give users information<br />

about a text that they might not otherwise understand<br />

(cross-lingual information access). The various information<br />

aspects (products, places, keywords, etc.) extracted<br />

from unrestricted and unstructured text can also be displayed<br />

together to provide users with sort of a document<br />

profile, as shown in Figure 2. Those information aspects<br />

that are linked to multilingual nomenclatures, gazetteers<br />

and thesauri can furthermore be displayed in languages<br />

other than the document language.<br />

The structured meta-information is stored in a database<br />

to enable users to search document collections by using<br />

this meta-data as features. This makes it possible, for<br />

instance, to search for all documents mentioning tobacco<br />

products, making reference to Turkey and mentioning a<br />

date in the range 1.01.2003 and 31.03.2003.<br />

When the reference of geographical place names has<br />

been identified unambiguously, i.e. when we have identified<br />

latitude and longitude of the places, it is easy to create<br />

a map showing the geographical coverage of a document,<br />

of a cluster of documents or of a whole document<br />

collection. Figure 3 shows a small map with those geographical<br />

places highlighted that were mentioned in a<br />

cluster of news articles about the same subject. It also<br />

shows how the clustering of news represented by their<br />

automatically identified keywords successfully identifies<br />

all those articles that talk about the same event (in Figure<br />

3, it is the discovery of our solar system’s tenth planet,<br />

Sedna, in March 2004). The vector space representation of<br />

the whole cluster can be compared to that of each individual<br />

article, by calculating the cosine, so that the article<br />

whose representation is closest to the centroid of the cluster’s<br />

representation can be chosen as the most typical article<br />

whose title can be chosen as the cluster title.<br />

Figure 3 also shows how cross-lingual links between a<br />

cluster and the news clusters in other languages can be established<br />

successfully by using the multilingual nomenclatures,<br />

thesauri and gazetteers as an interlingua. The<br />

JRC's cross-lingual news tracking system (Pouliquen<br />

2004b) represents each cluster by three different vectors.<br />

When comparing this document representation with those<br />

of clusters in other languages, each of the three vectors<br />

contributes with a different weight to the overall similarity<br />

between the clusters of documents written in different<br />

languages, as described in Pouliquen (2004b). Another<br />

usage of the cross-lingual document similarity calculation<br />

is the automatic compilation of collections of parallel (or<br />

comparable) texts to train and test information extraction<br />

or Machine Translation software. When testing the document<br />

similarity calculation based only on the EUROVOC<br />

descriptor vector representation of 820 English documents<br />

and their Spanish translations (Pouliquen et al. 2003b), we<br />

found that in 90.61% of cases, the Spanish translation was<br />

successfully found as being the most similar Spanish


document for a given English document. When adding information<br />

about the length of texts to exploit the fact that<br />

translations should have a similar length to the original<br />

document, the result increased to 96.83%. This result<br />

shows that processes to map documents onto a multilingual<br />

thesaurus can lead to extremely powerful applications.<br />

Cross-lingual document similarity calculation is<br />

also an essential ingredient for cross-lingual document<br />

plagiarism detection, an application for which, to our<br />

knowledge, no solutions have been proposed to date.<br />

7. Conclusion<br />

The intention of this article was to describe how multilingual<br />

knowledge sources such as gazetteers, vocabulary<br />

lists, nomenclatures and thesauri, as well as languageindependent<br />

text features such as dates, can be exploited<br />

for information extraction tasks, to provide cross-lingual<br />

information access and to calculate cross-lingual document<br />

similarity, which itself is a basic ingredient for many<br />

more text analysis applications. We furthermore wanted to<br />

show how relatively naïve text analysis tools can be helpful<br />

to develop powerful text analysis applications for<br />

many different languages with rather little effort, once the<br />

methodology has been decided on and the tools have been<br />

set up. At the JRC, we have already developed the language-specific<br />

resources for a number of European languages<br />

and we are currently making an effort to extend<br />

this tool set to all twenty official languages of the European<br />

Union. While we have no doubt that it is possible to<br />

produce better results with more thorough linguistic<br />

methods, such labour-intensive language-specific work is<br />

not an option for our small team whose aim it is to work<br />

on 20 or more languages. Instead, we exploit existing<br />

multilingual lexical resources (even if they had not initially<br />

been developed for machine use) and languageindependent<br />

text features, and we make use of Machine<br />

Learning techniques, statistical methods and heuristics.<br />

We believe to have shown that this approach can lead to<br />

good results and that it is even possible to produce working<br />

versions of novel applications such as cross-lingual<br />

news topic tracking using an interlingua document representation.<br />

The effort required to develop the languagespecific<br />

resources for a new language ranges between one<br />

week and three months for the applications we are currently<br />

using. Extracting and developing TARIC product<br />

nomenclature terms is a comparatively labour-intensive<br />

task that requires an additional estimated two to three<br />

months. In order to extend the current tool set to new languages<br />

and applications, the JRC is actively seeking collaborators<br />

such as mother tongue students who would join<br />

us as trainees.<br />

Individual applications out of the set presented in this<br />

paper have been tested and proven, including date and<br />

place name recognition, EUROVOC thesaurus descriptor assignment,<br />

monolingual news clustering and news topic<br />

tracking, and cross-lingual news topic tracking. A number<br />

of other applications presented here still need to be evaluated<br />

formally. Furthermore, it would be useful to carry out<br />

a thorough one-by-one evaluation of the effectiveness of<br />

each of the text features presented here, and of their relative<br />

impact for cross-lingual document similarity calculation.<br />

The JRC can share tools and resources with noncommercial<br />

entities if they are not bound by copyrights<br />

owned by other organisations. The JRC is furthermore interested<br />

in collaborations yielding more language resources,<br />

especially for the new EU languages.<br />

Acknowledgements<br />

Many people have contributed to developing the tool set<br />

described in this paper and to developing and evaluating<br />

the language-specific resources for various languages. We<br />

would particularly like to thank Laima Norviliené (born<br />

Cekyte) and Irina Temnikova for their help with the product<br />

recognition tool, Victoria Fernandez Mera, Elisabet<br />

Lindkvist Michailaki and Arturo Montejo-Ráez for their<br />

help regarding the EUROVOC thesaurus indexing application,<br />

Marco Kimler for his refinement of the geo-coding<br />

tool, and Emilia Käsper, Ippolita Valerio, Tom de Groeve,<br />

Victoria Fernandez Mera, Tomaž Erjavec, Christian Gold<br />

and Irina Temnikova for their help in creating languagespecific<br />

resources for Estonian, Italian, Dutch, Spanish,<br />

Slovene, German, Bulgarian and Russian. We would also<br />

like to thank the JRC’s Web Technology team for providing<br />

us with the multilingual news collection to develop<br />

and test many of the applications described here.<br />

8. References<br />

Daille Béatrice & Emmanuel Morin (2000). Reconnaissance<br />

automatique des noms propres de la langue<br />

écrite : les récentes réalisations. In: D. Maurel & F.<br />

Guenthner : Traitement automatique des langues<br />

vol. 41, No. 3. Traitement des noms propres, pp. 601-<br />

623. Hermes, Paris.<br />

Eurovoc (1995). Thesaurus EUROVOC - Volume 2: Subject-Oriented<br />

Version. Ed. 3/English Language. Annex<br />

to the index of the Official Journal of the EC. Luxembourg,<br />

Office for Official Publications of the European<br />

Communities. http://europa.eu.int/celex/eurovoc.<br />

Friburger N. & D. Maurel (2002). Textual Similarity<br />

Based on Proper Names. Proceedings of the workshop<br />

‘Mathematical/Formal Methods in Information Retrieval’<br />

(MFIR’2002) at the 25th ACM SIGIR Conference,<br />

pp. 155-167. Tampere, Finland.<br />

Gey Frederic (2000). Research to Improve Cross-<br />

Language Retrieval – Position Paper for CLEF. In C.<br />

Peters (ed.): Cross-Language Information Retrieval and<br />

Evaluation, Workshop of Cross-Language Evaluation<br />

Forum (CLEF’2000), Lisbon, Portugal. Lecture Notes in<br />

Computer Science 2069, Springer.<br />

Hyland R., C. Clifton & R. Holland (1999). GeoNODE:<br />

Visualizing News in Geospatial Context. In Afca99.<br />

Ignat Camelia, Bruno Pouliquen, António Ribeiro & Ralf<br />

Steinberger (2003). Extending an Information Extraction<br />

Tool Set to Central and Eastern European Languages.<br />

In: Proceedings of the International Workshop<br />

‘Information Extraction for Slavonic and other Central<br />

and Eastern European Languages’ (IESL'2003), held at<br />

RANLP'2003, pp. 33-39. Borovets, Bulgaria, 8 - 9 September<br />

2003.<br />

Leek Tim, Hubert Jin, Sreenivasa Sista & Richard<br />

Schwartz (1999). The BBN Crosslingual Topic Detection<br />

and Tracking System. In 1999 TDT Evaluation System<br />

Summary<br />

Papers.<br />

http://www.nist.gov/speech/tests/tdt/tdt99/papers<br />

Marjorie M.K. Hlava & Richard Hainebach (1996). Multilingual<br />

Machine Indexing. NIT'1996. available at<br />

http://joan.simmons.edu/~chen/nit/NIT'96/96-105-Hava.htm


Montejo-Ráez Arturo (2002). Towards conceptual indexing<br />

using automatic assignment of descriptors. Workshop<br />

on ‘Personalisation Techniques in Electronic Publishing<br />

on the Web: Trends and Perspectives’. Málaga,<br />

Spain, May 2002.<br />

Nelson Stuart, Michael Schopen, Jacque-Lynne Schulman<br />

& Natalie Arluk (2000). An interlingual database of<br />

MeSH translations. 8 th International Conference on<br />

Medical Librarianship. London, July 2000.<br />

Norviliené Laima (forthcoming). Computerlinguistische<br />

Analyse von Produktthesauri. Unpublished Master’s<br />

Thesis. Ludwig-Maximilians University Munich, Centre<br />

for Information and Language Processing.<br />

Pouliquen Bruno, Ralf Steinberger & Camelia Ignat<br />

(2003a). Automatic Annotation of Multilingual Text<br />

Collections with a Conceptual Thesaurus. In: Proceedings<br />

of the Workshop ‘Ontologies and Information Extraction’<br />

at the Summer School ‘The Semantic Web and<br />

Language Technology - Its Potential and Practicalities’<br />

(EUROLAN'2003). Bucharest, Romania, 28 July - 8 August<br />

2003.<br />

Pouliquen Bruno, Ralf Steinberger & Camelia Ignat<br />

(2003b). Automatic Identification of Document Translations<br />

in Large Multilingual Document Collections.<br />

Proceedings of the International Conference Recent Advances<br />

in Natural Language Processing (RANLP'2003),<br />

Borovets, Bulgaria.<br />

Pouliquen Bruno, Ralf Steinberger, Camelia Ignat & Tom<br />

de Groeve (2004a). Geographical Information Recognition<br />

and Visualisation in Texts Written in Various Languages.<br />

In: Proceedings of the 19 th Annual ACM Symposium<br />

on Applied Computing (SAC'2004), Special<br />

Track on Information Access and Retrieval (SAC-IAR),<br />

vol. 2, pp. 1051-1058. Nicosia, Cyprus, 14 - 17 March<br />

2004.<br />

Pouliquen Bruno, Ralf Steinberger, Camelia Ignat, Emilia<br />

Käsper & Irina Temnikova (2004b). Multilingual and<br />

Cross-lingual News Topic Tracking. In: Proceedings of<br />

the 20 th International Conference on Computational<br />

Linguistics (CoLing'2004). Geneva, Switzerland, 23-27<br />

August 2004.<br />

Wactlar H.D. (1999). New Directions in Video Information<br />

Extraction and Summarization. In Proceedings of<br />

the 10 th DELOS Workshop, Sanorini, Greece, 24-25 June<br />

1999.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!