30.08.2013 Views

Wordnet

Wordnet

Wordnet

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

From WordNet,<br />

to EuroWordNet,<br />

to the Global <strong>Wordnet</strong> Grid:<br />

anchoring languages to universal meaning<br />

Piek Vossen<br />

VU University Amsterdam<br />

1


Princeton WordNet<br />

• http://wordnet.princeton.edu/<br />

• Developed by George Miller and his team at<br />

Princeton University, as the implementation of<br />

a mental model of the lexicon<br />

• Organized around the notion of a synset: a set<br />

of synonyms in a language that represent a<br />

single concept<br />

• Semantic relations between concepts<br />

• Covers over 117,000 concepts and over<br />

150,000 English words<br />

2


What kind of resource is<br />

WordNet?<br />

• Mostly used database in language<br />

technology<br />

• Enormous impact in language technology<br />

development<br />

• Large<br />

• Free and downloadable<br />

• English<br />

3


<strong>Wordnet</strong> Starting point<br />

• Lexical database organized around concepts instead of lexical<br />

forms:<br />

– Separates lexical forms from concepts<br />

– Defines concepts through a relational model of meaning and<br />

not an encyclopedic view<br />

• Concept is defined by the notion of a synset, synsets distinguish<br />

word meanings:<br />

– {board, plank}{board, plank} {board, committee}{board, get on}<br />

• The ‘synset’ as a weak notion of synonymy:<br />

“two expressions are synonymous in a linguistic context C if the<br />

substitution of one for the other in C does not alter the truth<br />

value.” (Miller et al. 1993)<br />

4


<strong>Wordnet</strong>: a network of semantically<br />

related words<br />

{conveyance;transport}<br />

{vehicle}<br />

{motor vehicle; automotive vehicle}<br />

{car; auto; automobile; machine; motorcar}<br />

{cruiser; squad car; patrol car;<br />

police car; prowl car}<br />

{car door}<br />

{bumper}<br />

{cab; taxi; hack; taxicab}<br />

{car mirror}<br />

{car window}<br />

{armrest}<br />

{doorlock}<br />

{hinge;<br />

flexible joint}<br />

5


Polysemy & Word Forms<br />

Synsets (word meanings)<br />

contain one or more<br />

word forms<br />

Word forms in multiple<br />

synsets are polysemous<br />

Polysemous word forms<br />

are said to have multiple<br />

senses


Polysemy, Familiarity<br />

& Zipf's Law<br />

Zipf's law:<br />

There is a constant k such that f * r = k<br />

In words: there is a predictable relation between the<br />

frequency of a word and its rank<br />

Zipf's other law:<br />

The number of meanings of a word is related to its<br />

frequency of use<br />

Polysemy indicates familiarity in wordnet:<br />

• “horse” horse” has 6 meanings, “equus caballus” has 1


Familiarity & SemCor<br />

Sense numbers indicate frequency in SemCor (250,000<br />

tokens from Brown corpus manually tagged with WordNet<br />

senses):<br />

• {horse:1, Equus caballus:1} => animal<br />

• {horse:5, knight:2} => in chess


<strong>Wordnet</strong> 3.0 statistics<br />

POS Unique Synsets Total<br />

Strings Word-Sense<br />

Pairs<br />

Noun 117,798 82,115 146,312<br />

Verb 11,529 13,767 25,047<br />

Adjective 21,479 18,156 30,002<br />

Adverb 4,481 3,621 5,580<br />

Totals 155,287 117,659 206,941<br />

9


<strong>Wordnet</strong> 3.0 statistics<br />

POS Monosemous Polysemous Polysemous<br />

Words and<br />

Senses<br />

Words Senses<br />

Noun 101,863 15,935 44,449<br />

Verb 6,277 5,252 18,770<br />

Adjective 16,503 4,976 14,399<br />

Adverb 3,748 733 1,832<br />

Totals 128,391 26,896 79,450<br />

10


Semantic organization of<br />

Nouns in WordNet<br />

25 unique beginners


noun.Tops file<br />

Contains very general classifi cations


Lexicalization patterns<br />

building<br />

church<br />

abbey<br />

artifact<br />

bird<br />

object<br />

canary<br />

common<br />

canary<br />

animal<br />

dog<br />

entity<br />

organism<br />

crocodile<br />

tree<br />

plant<br />

fl ower<br />

rose<br />

top-layer<br />

25 unique<br />

beginners<br />

Basic Level<br />

Concepts<br />

(Rosch)<br />

13


Lexicalization patterns<br />

building<br />

church<br />

abbey<br />

artifact<br />

bird<br />

object<br />

canary<br />

common<br />

canary<br />

animal<br />

dog<br />

entity<br />

organism<br />

crocodile<br />

tree<br />

plant<br />

fl ower<br />

• balance of two rose principles:<br />

top-layer<br />

25 unique<br />

beginners<br />

basic level<br />

concepts<br />

● predict most features<br />

● apply to most subclasses<br />

• where most concepts are created<br />

• amalgamate most parts<br />

• most abstract level to draw a pictures<br />

14


inessential<br />

souvenir<br />

garbage<br />

threat<br />

Lexicalization patterns<br />

building<br />

church<br />

abbey<br />

artifact<br />

bird<br />

object<br />

canary<br />

common<br />

canary<br />

entity<br />

animal<br />

dog<br />

organism<br />

crocodile<br />

tree<br />

plant<br />

top-layer<br />

curiosity<br />

....etc....<br />

waste<br />

25 unique<br />

variable beginners<br />

fl ower<br />

rose<br />

basic level<br />

concepts<br />

15


inessential<br />

souvenir<br />

garbage<br />

threat<br />

Lexicalization patterns<br />

building<br />

church<br />

abbey<br />

artifact<br />

bird<br />

object<br />

canary<br />

common<br />

canary<br />

entity<br />

animal<br />

dog<br />

organism<br />

crocodile<br />

tree<br />

plant<br />

top-layer<br />

curiosity<br />

....etc....<br />

waste<br />

25 unique<br />

variable beginners<br />

fl ower<br />

rose<br />

basic level<br />

concepts<br />

16


<strong>Wordnet</strong> top level<br />

17


leg<br />

Meronymy & pictures<br />

beak<br />

tail<br />

18


Meronymy & pictures<br />

19


Dogs in WordNet<br />

20


Type-role distinction<br />

• Current WordNet treatment:<br />

(1) a husky is a kind of dog (type)<br />

(2) a husky is a kind of working dog (role)<br />

• What’s wrong?<br />

(2) is defeasible, (1) is not:<br />

*This husky is not a dog<br />

This husky is not a working dog<br />

Other roles: watchdog, sheepdog, herding dog, lapdog, etc….<br />

21


Ontological observations<br />

• Identity criteria as used in OntoClean (Guarino &<br />

Welty 2002):<br />

– rigidity: to what extent are properties true for all<br />

instances of entities in all worlds? You are always a<br />

human, but you can be a student for a short while.<br />

• Ignoring this distinction leads to ISA-overloading<br />

22


Ontology and lexicon<br />

• Hierarchy of disjunct types:<br />

Canine => PoodleDog; NewfoundlandDog;<br />

GermanShepherdDog; Husky<br />

• Lexicon:<br />

– NAMES for TYPES:<br />

{poodle}EN, {poedel}NL, {pudoru}JP<br />

=> ((instance x Poodle)<br />

– LABELS for ROLES:<br />

{watchdog}EN, {waakhond}NL, {banken}JP<br />

=>((instance x Canine) and (role x GuardingProcess))<br />

23


Expansion with pure hyponymy<br />

lapdog<br />

relations<br />

dog<br />

hunting dog puppy<br />

street dog<br />

watchdog<br />

poodle<br />

short hair<br />

dachshund<br />

dachshund<br />

long hair<br />

dachshund<br />

bitch<br />

Expansion from a type to roles<br />

24


Expansion with pure hyponymy<br />

lapdog<br />

relations<br />

dog<br />

hunting dog puppy<br />

street dog<br />

watchdog<br />

poodle<br />

short hair<br />

dachshund<br />

dachshund<br />

long hair<br />

dachshund<br />

bitch<br />

Expansion from a role to types and other roles<br />

25


Synset definition<br />

Synsets consist of interchangeable words or synonyms (Miller, 1998)<br />

loose criteria<br />

lion, king_of_beasts,<br />

Panthera_leo [large gregarious<br />

predatory feline of Africa and India<br />

having a tawny coat with a shaggy<br />

mane in the male]<br />

strict criteria<br />

dog, domestic doc, canis<br />

familiaris (a member of the<br />

genus Canis..)<br />

Pooch, doggie, doggy,<br />

barker, bow-wow (informal<br />

terms for dogs)<br />

26


Differences among wordnets<br />

English <strong>Wordnet</strong><br />

large number of synsets<br />

asshole, bastard, cocksucker, dickhead, shit,<br />

mother fucker, motherfucker, prick, whoreson,<br />

son of a bitch, SOB<br />

cad, bounder, blackguard, dog, hound, heel<br />

gasbag, windbag<br />

rotter, rat, skunk, stinker, bum, puke, crumb,<br />

lowlife, scum_bag, so-and-so<br />

pain, pain_in_the_neck, nuisance<br />

worm, louse, insect, dirt_ball<br />

Dutch <strong>Wordnet</strong><br />

62 synonyms<br />

naarling:1/r_n-24518, beroerling:1/d_n-26921,<br />

ellendeling:1/r_n-12324, etterbak:1/d_n-75936,<br />

etterbuil:2/d_n-75940, fielt:1/d_n-80137,<br />

fluim:2/d_n-81948, gemenerik:1/r_n-14607,<br />

hond:2/r_n-79023, hondenlul:1/r_n-17019,<br />

kankerlijer:1/d_n-130709, kelerelijder:1/d_n-<br />

540923, kelerelijer:1/d_n-147148,<br />

klerelijer:1/r_n-19790, kloot:1/r_n-19887,<br />

kloothommel:1/d_n-137246, klootspiraal:1/d_n-<br />

412711, klootzak:1/r_n-19888, kwal:2/r_n-<br />

21077, lamgat:1/d_n-152244, lammeling:1/r_n-<br />

21272, lamstraal:1/d_n-152396, lamzak:1/r_n-<br />

21286, lazersteen:1/d_n-413025,<br />

lazerstraal:1/d_n-154087, loeder:1/r_n-22410,<br />

lul:2/r_n-22757, lulhannes:1/d_n-161976,<br />

lulletje:1/d_n-541138, miesgasser:1/d_n-<br />

172163, mispunt:1/r_n-24006, onverlaat:1/r_n-<br />

26320, paardelul:1/d_n-228940,<br />

paardenlul:1/n_n-501022, patjakker:1/d_n-<br />

212558, pleurislijder:1/r_n-28842, ploert:1/r_n-<br />

28881, plurk:1/d_n-220067, etc. etc.<br />

insulting terms for people who are stupid, ridiculous, irritating, lazy, slow, ……<br />

27


Splitting or Lumping<br />

Lumping<br />

“one denotation and several connotations per synset”<br />

lion, king_of_beasts, Panthera_leo<br />

Splitting<br />

“one denotation and one connotation per synset”<br />

Kraut, Krauthead, Jerry, Hun<br />

German<br />

hyponym


Splitting<br />

• On which criteria? Usage (register, domain,<br />

frequency, style, etc.); Attitude (polarity,<br />

subjectivity); Morphology, Syntax, etc.<br />

• Consequent splitting leads to synsets without<br />

synonyms<br />

• Leads to ISA-overloading (german is not a<br />

hypernym of krauthead)


Lumping<br />

• Consequent lumping leads to extremely<br />

large synsets<br />

• Low interchangeability of synonyms as<br />

their connotations differ too much<br />

• Low interoperability between wordnets:<br />

precise translation equivalence is impossible<br />

• Leads to ‘unintuitive’ synsets


Hybrid and 2-layered


Summary<br />

• Synsets are more compact representations for concepts than word<br />

meanings in traditional lexicons<br />

• Synonyms and hypernyms are substitutional variants:<br />

– begin – commence<br />

– I once had a canary. The bird got sick. The poor animal died.<br />

• Hyponymy and meronymy chains are important transitive relations for<br />

predicting properties and explaining textual properties:<br />

object -> artifact -> vehicle -> 4-wheeled vehicle -> car<br />

• Strict separation of part of speech although concepts are closely<br />

related (bed – sleep) and are similar (dead – death)<br />

• Lexicalization patterns reveal important mental structures<br />

32


EuroWordNet<br />

• The development of a multilingual database with wordnets<br />

for several European languages<br />

• Funded by the European Commission, DG XIII,<br />

Luxembourg as projects LE2-4003 and LE4-8328<br />

• March 1996 - September 1999<br />

• 2.5 Million EURO.<br />

• http://www.hum.uva.nl/~ewn<br />

• http://www.illc.uva.nl/EuroWordNet/fi nalresults-ewn.html<br />

33


EuroWordNet<br />

• Languages covered:<br />

– EuroWordNet-1 (LE2-4003): English, Dutch, Spanish, Italian<br />

– EuroWordNet-2 (LE4-8328): German, French, Czech, Estonian.<br />

• Size of vocabulary:<br />

– EuroWordNet-1: 30,000 concepts - 50,000 word meanings.<br />

– EuroWordNet-2: 15,000 concepts- 25,000 word meaning.<br />

• Type of vocabulary:<br />

– the most frequent words of the languages<br />

– all concepts needed to relate more specifi c concepts<br />

34


EuroWordNet Model<br />

ride<br />

move<br />

go<br />

III<br />

III<br />

drive<br />

Lexical Items Table<br />

Lexical Items Table<br />

cabalgar<br />

jinetear<br />

III<br />

III<br />

mover<br />

transitar<br />

conducir<br />

Domains<br />

Traffic<br />

ILI-record<br />

{drive}<br />

Ontology<br />

2OrderEntity<br />

Air Road` Location Dynamic<br />

III<br />

rijden berijden<br />

I<br />

I<br />

III<br />

II<br />

II<br />

Lexical Items Table<br />

II<br />

Inter-Lingual-Index<br />

I = Language Independent link<br />

II = Link from Language Specific<br />

to Inter lingual Index<br />

III = Language Dependent Link<br />

II<br />

guidare<br />

bewegen<br />

gaan<br />

Lexical Items Table<br />

III<br />

III<br />

andare<br />

muoversi<br />

cavalcare<br />

35


Differences in relations between<br />

EuroWordNet and WordNet<br />

• Added Features to relations<br />

• Cross-Part-Of-Speech relations<br />

• New relations to differentiate shallow hierarchies<br />

• Different interpretations of some relations<br />

36


Cross-Part-Of-Speech relations<br />

WordNet1.5: nouns and verbs are not interrelated by basic semantic<br />

relations such as hyponymy and synonymy:<br />

adornment 2 => change of state-- (the act of changing something)<br />

adorn 1 => change, alter-- (cause to change; make different)<br />

EuroWordNet: words of different parts of speech can be inter-linked with<br />

explicit xpos-synonymy, xpos-antonymy and xpos-hyponymy relations:<br />

{adorn V} XPOS_NEAR_SYNONYM {adornment N}<br />

{size N} XPOS_NEAR_HYPONYM {tall A}<br />

{short A}<br />

37


Role relations<br />

In the case of many verbs and nouns the most salient relation is not the hyperonym<br />

but the relation between the event and the involved participants. These relations<br />

are expressed as follows:<br />

{knife} ROLE_INSTRUMENT {to cut}<br />

{to cut} INVOLVED_INSTRUMENT {knife} reversed<br />

{school} ROLE_LOCATION {to teach}<br />

{to teach} INVOLVED_LOCATION {school} reversed<br />

These relations are typically used when other relations, mainly hyponymy, do not<br />

clarify the position of the concept network, but the word is still closely related to<br />

another word.<br />

38


Co_Role relations<br />

guitar player HAS_HYPERONYM player<br />

CO_AGENT_INSTRUMENT guitar<br />

player HAS_HYPERONYM person<br />

ROLE_AGENT to play music<br />

CO_AGENT_INSTRUMENT musical instrument<br />

to play music HAS_HYPERONYM to make<br />

ROLE_INSTRUMENT musical instrument<br />

guitar HAS_HYPERONYM musical instrument<br />

CO_INSTRUMENT_AGENT guitar player<br />

ice saw HAS_HYPERONYM saw<br />

CO_INSTRUMENT_PATIENT ice<br />

saw HAS_HYPERONYM saw<br />

ROLE_INSTRUMENT to saw<br />

ice CO_PATIENT_INSTRUMENT ice saw REVERSED<br />

39


Horizontal & vertical semantic relations<br />

chronical patient ;<br />

mental patient<br />

HYPONYM<br />

STATE<br />

patient<br />

disease; disorder<br />

HYPONYM<br />

stomach disease,<br />

kidney disorder,<br />

ρ-PATIENT<br />

ρ-PATIENT<br />

physiotherapy<br />

medicine<br />

etc.<br />

ρ-CAUSE<br />

ρ-INSTRUMENT<br />

cure<br />

treat<br />

hospital, etc.<br />

ρ-LOCATION<br />

ρ-AGENT<br />

docter<br />

HYPONYM<br />

child docter<br />

co-ρ-<br />

AGENT-PATIENT<br />

child<br />

40


Overview of the Language<br />

Internal relations in Euro<strong>Wordnet</strong><br />

Same Part of Speech relations:<br />

NEAR_SYNONYMY apparatus - machine<br />

HYPERONYMY/HYPONYMY car - vehicle<br />

ANTONYMY open - close<br />

HOLONYMY/MERONYMY head - nose<br />

Cross-Part-of-Speech relations:<br />

XPOS_NEAR_SYNONYMY dead - death; to adorn - adornment<br />

XPOS_HYPERONYMY/HYPONYMY to love - emotion<br />

XPOS_ANTONYMY to live - dead<br />

CAUSE die - death<br />

SUBEVENT buy - pay; sleep - snore<br />

ROLE/INVOLVED write - pencil; hammer - hammer<br />

STATE the poor - poor<br />

MANNER<br />

BELONG_TO_CLASS<br />

to slurp - noisily<br />

Rome - city<br />

41


The Multilingual Design<br />

• Inter-Lingual-Index: unstructured fund of concepts to<br />

provide an effi cient mapping across the languages;<br />

• Index-records are mainly based on WordNet synsets and<br />

consist of synonyms, glosses and source references;<br />

• Various types of complex equivalence relations are<br />

distinguished;<br />

• Equivalence relations from synsets to index records: not on a<br />

word-to-word basis;<br />

• Indirect matching of synsets linked to the same index items;<br />

42


Equivalent Near Synonym<br />

1. Multiple Targets (1:many)<br />

Dutch wordnet: schoonmaken (to clean) matches with 4<br />

senses of clean in WordNet1.5:<br />

● make clean by removing dirt, filth, or unwanted substances from<br />

● remove unwanted substances from, such as feathers or pits, as of chickens or fruit<br />

● remove in making clean; "Clean the spots off the rug"<br />

● remove unwanted substances from - (as in chemistry)<br />

2. Multiple Sources (many:1)<br />

Dutch wordnet: versiersel near_synonym versiering<br />

ILI-Record: decoration.<br />

3. Multiple Targets and Sources (many:many)<br />

Dutch wordnet: toestel near_synonym apparaat<br />

ILI-records: machine; device; apparatus; tool<br />

43


EN-Net<br />

toe<br />

finger<br />

head<br />

NL-Net<br />

hoofd<br />

kop<br />

Complex mappings across<br />

languages<br />

{ toe : part of foot }<br />

{ finger : part of hand }<br />

{ dedo , dito :<br />

finger or toe }<br />

{ head : part of body }<br />

{ hoofd : human head }<br />

{ kop : animal head }<br />

= normal equivalence<br />

= eq _has_hyponym<br />

= eq _has_hyperonym<br />

IT-Net<br />

dito<br />

ES-Net<br />

dedo<br />

44


Typical gaps in the (English)<br />

ILI<br />

• Dutch:<br />

doodschoppen (to kick to death):<br />

eq_hyperonym {kill}V and to {kick}V<br />

aardig (Adjective, to like):<br />

eq_near_synonym {like}V<br />

cassière (female cashier)<br />

eq_hyperonym {cashier}, {woman}<br />

kunstproduct (artifact substance)<br />

eq_hyperonym {artifact} and to {product}<br />

• Spanish:<br />

alevín (young fi sh):<br />

eq_hyperonym {fi sh} and eq_be_in_state {young}<br />

cajera (female cashier)<br />

eq_hyperonym {cashier}, {woman}<br />

45


<strong>Wordnet</strong>s as semantic<br />

structures<br />

• <strong>Wordnet</strong>s are unique language-specifi c structures:<br />

– different lexicalizations<br />

– differences in synonymy and homonymy<br />

– different relations between synsets<br />

– same organizational principles: synset structure and same set<br />

of semantic relations.<br />

• Language independent knowledge is assigned to the<br />

ILI and can thus be shared for all language linked to<br />

the ILI: both an ontology and domain hierarchy<br />

46


Autonomous & Language-Specific<br />

<strong>Wordnet</strong>1.5 Dutch <strong>Wordnet</strong><br />

object<br />

artifact, artefact<br />

(a man-made object)<br />

natural object (an<br />

object occurring<br />

naturally)<br />

block instrumentality body<br />

implement<br />

container<br />

device<br />

tool instrument<br />

box spoon bag<br />

blok<br />

{block}<br />

bak<br />

{box}<br />

voorwerp<br />

{object}<br />

werktuig{tool}<br />

lepel<br />

{spoon}<br />

tas<br />

{bag}<br />

lichaam<br />

{body}<br />

47


• <strong>Wordnet</strong>s:<br />

<strong>Wordnet</strong>s versus ontologies<br />

• autonomous language-specific lexicalization patterns in a<br />

relational network.<br />

• Usage: to predict substitution in text for information<br />

retrieval,<br />

• text generation, machine translation, word-sense-<br />

disambiguation.<br />

• Ontologies:<br />

• data structure with formally defined concepts.<br />

• Usage: making semantic inferences.<br />

48


Building wordnets<br />

• Two major approaches:<br />

– Expand model: translate the English synonyms and<br />

copy the synsets and relations;<br />

• Fast and cheap<br />

• Benefi ts from English research and resources<br />

• Bias by Princeton wordnet<br />

– Merge model: build wordnet independently of English<br />

and create equivalence mapping afterwards:<br />

• Slow and expensive<br />

• Complicated since the structures differ and you cannot<br />

change English<br />

• Better representation of the language structure:<br />

theoretically more sound (true WORD-net)<br />

49


How to harmonize wordnets?<br />

• Define universal sets of concepts that play a major role in<br />

many different wordnets: so-called Base Concepts<br />

• Define base concepts in each language wordnet<br />

– High level in the hierarchy<br />

– Many hyponyms<br />

• Provide the closest equivalent in English wordnet<br />

• Expand down-ward with hyponyms & determine the<br />

intersection of English equivalences<br />

50


garbage<br />

threat<br />

Base Concepts in <strong>Wordnet</strong><br />

building<br />

church<br />

abbey<br />

artifact<br />

object<br />

bird<br />

canary<br />

common<br />

canary<br />

entity<br />

animal<br />

dog<br />

organism<br />

crocodile<br />

plant<br />

tree<br />

fl ower<br />

rose<br />

25 unique<br />

beginners<br />

1024 base<br />

concepts<br />

basic level<br />

concepts<br />

51


Common Base Concepts<br />

Important in at least two languages<br />

Nouns Verbs Total<br />

Physical objects & substances 491 491<br />

Processes and states 272 228 500<br />

Mental objects 33 33<br />

Total 796 228 1024<br />

52


Synsets No. of senses Sens./<br />

syns.<br />

EuroWordNet data<br />

Entries Sens./<br />

entry<br />

LIRels. LIRels/<br />

syns<br />

EQRels-<br />

ILI<br />

EQRels/<br />

syn<br />

Synsets<br />

without<br />

ILI<br />

Dutch 44015 70201 1,59 56283 1,25 111639 2,54 53448 1,21 7203<br />

Spanish 23370 50526 2,16 27933 1,81 55163 2,36 21236 0,91 0<br />

Italian 40428 48499 1,20 32978 1,47 117068 2,90 71789 1,78 1561<br />

French 22745 32809 1.44 18777 1.75 49494 2.18 22730 1.00 20<br />

German 15132 20453 1.35 17098 1.20 34818 2.30 16347 1.08 0<br />

Czech 12824 19949 1.56 12283 1.62 26259 2.05 12824 1.00 0<br />

Estonian 7678 13839 1.80 10961 1.26 16318 2.13 9004 1.17 0<br />

English 16361 40588 2,48 17320 2,34 42140 2,58 n.a. n.a. n.a.<br />

WN15 94515 187602 1,98 126617 1,48 211375 2,24 n.a. n.a. n.a.<br />

53


From EuroWordNet to Global<br />

WordNet<br />

• Currently, wordnets exist for more than 70 languages,<br />

including: Arabic, Bantu, Basque, Chinese, Bulgarian,<br />

Estonian, Hebrew, Icelandic, Japanese, Kannada, Korean,<br />

Latvian, Nepali, Persian, Romanian, Sanskrit, Tamil, Thai,<br />

Turkish, Zulu...<br />

• Many languages are genetically and typologically<br />

unrelated<br />

• http://www.globalwordnet.org<br />

54


Indo <strong>Wordnet</strong> Project<br />

http://www.cfi lt.iitb.ac.in/wordnet/webhwn/<br />

• Basis for 10 year Indian Machine translation<br />

project: translation through ILI is more<br />

effi cient than building translation memories<br />

• Hindi wordnet as the ILI, while Hindi is linked<br />

to the English wordnet<br />

• About 20 Indian languages (900 million<br />

speakers)<br />

55


Indo <strong>Wordnet</strong> progress 2010<br />

Synsets Words Synsets Words<br />

Assamese 353 19.609 Marathi 9.739 21.223<br />

Bengali 8.679 18.563 Nepali 5.802 10.278<br />

Bodo 3.837 13.357 Oriya To start<br />

Gugarati 970 2.125 Punjabi To start<br />

Hindi 33.900 8.200 Sanskrit 3.340 17.820<br />

Kannad 5.920 7.344 Tamil 4.750 9.821<br />

Kashmiri 6.569 8.674 Telugu 10.639 18.250<br />

Malayalam 6.154 8.622 Urdu 6.123 9.641<br />

Manipuri 2.744 5.231<br />

56


Asian <strong>Wordnet</strong> Project<br />

http://asianwordnet.org/<br />

57


Asian <strong>Wordnet</strong> Project<br />

http://asianwordnet.org/<br />

• Built using a <strong>Wordnet</strong> Management System<br />

developed by NICT, NECTEC Bangkok,<br />

Thailand<br />

• Free download: Apache 2.0+, PHP 5.2+,<br />

MySQL 5.0+<br />

• Translation of English wordnet (expand<br />

model), voting, progress monitoring,<br />

intersection and overlap<br />

• Uses <strong>Wordnet</strong>-LMF as exchange format<br />

58


Asian <strong>Wordnet</strong> Editor<br />

59


Asian <strong>Wordnet</strong> Editor<br />

60


Asian <strong>Wordnet</strong> Editor<br />

61


South African <strong>Wordnet</strong>s<br />

• Started in 2008 as a collaboration of the Department of<br />

African Languages at UNISA and the Centre for Text<br />

Technology (CTexT®) at the North-West University.<br />

• Languages:<br />

– Afrikaans<br />

– Setswana, isiNdebele, isiZulu, isiXhosa and Sesotho sa<br />

Leboa<br />

• Nr. synsets: between 5,000 and 15,000 per language and<br />

will be completed by the end of January 2011.<br />

• Linked to the English wordnet<br />

• DebVisDic wordnet editing environment:<br />

http://deb.fi .muni.cz/clients-debvisdic.php<br />

62


Some downsides of the<br />

Euro<strong>Wordnet</strong> model<br />

• Construction is not done uniformly<br />

• Coverage differs<br />

• Not all wordnets can communicate with one another<br />

• Proprietary rights restrict free access and usage<br />

• A lot of semantics is duplicated<br />

• Complex and obscure equivalence relations due to<br />

linguistic differences between English and other<br />

languages<br />

63


vehículo<br />

1<br />

auto tren<br />

vehicle<br />

1<br />

car train<br />

2<br />

English Words<br />

2<br />

Spanish Words<br />

Next step: Global WordNet<br />

Grid<br />

veicolo<br />

1<br />

auto treno<br />

2<br />

Italian Words<br />

Inter-Lingual<br />

Ontology<br />

Object<br />

Device<br />

TransportDevice<br />

3 3<br />

dopravní prost edník<br />

1<br />

auto vlak<br />

2<br />

Czech Words<br />

voertuig<br />

1<br />

auto trein<br />

2<br />

Dutch Words<br />

véhicule<br />

1<br />

voiture train<br />

2<br />

French Words<br />

Fahrzeug<br />

Auto Zug<br />

1<br />

2<br />

German Words<br />

liiklusvahend<br />

1<br />

auto killavoor<br />

2<br />

Estonian Words<br />

64


GWN-Grid: Main Features<br />

• Construct separate wordnets for each Grid<br />

language<br />

• Contributors from each language encode the<br />

same core set of concepts plus<br />

culture/language-specific ones<br />

• Synsets (concepts) can be mapped<br />

crosslinguistically via an ontology<br />

65


The Ontology: Main Features<br />

• Formal ontology serves as universal index of<br />

concepts<br />

• List of concepts is not just based on the lexicon of<br />

a particular language (unlike in EuroWordNet) but<br />

uses ontological observations<br />

• Ontology contains only upper and mid-level<br />

concepts but concepts can be derived using formal<br />

expressions in e.g. KIF or RDF<br />

• Concepts are related in a type hierarchy<br />

• Concepts are defined with axioms<br />

66


The Ontology: Main Features<br />

• Minimal set of concepts (Reductionist view):<br />

– to express equivalence across languages<br />

– to support inferencing<br />

• Ontology must be powerful enough to encode all concepts<br />

that are lexically expressed in any of the Grid languages<br />

• Ontology need not and cannot provide a linguistic<br />

encoding (label) for all concepts found in the Grid<br />

languages<br />

– Lexicalization in a language is not sufficient to warrant inclusion<br />

in the ontology<br />

– Lexicalization in all or many languages may be sufficient<br />

• Ontological observations will be used to define the<br />

concepts in the ontology<br />

67


Lexicalizations not mapped to<br />

WordNet<br />

• Not added to the type hierarchy:<br />

{straathond}NL (a dog that lives in the streets)<br />

=> ((instance x Canine) and (habitat x Street))<br />

• Added to the type hierarchy:<br />

{klunen}NL (to walk on skates from one frozen body to the next<br />

over land)<br />

WalkProcess = KluunProcess<br />

Axioms:<br />

(and (instance x Human) (instance y Walk) (instance z Skates)<br />

(wear x z) (instance s1 Skate) (instance s2 Skate) (before s1 y)<br />

(before y s2) etc…<br />

• National dishes, customs, games,....<br />

68


Most mismatching concepts are<br />

not new types<br />

• Refer to sets of types in specific circumstances or<br />

to concept that are dependent on these types, next<br />

to {rivierwater}NL there are many other:<br />

{theewater}NL (water used for making tea)<br />

{koffiewater}NL (water used for making coffee)<br />

{bluswater}NL (water used for extinguishing fire)<br />

• Relate to linguistic phenomena:<br />

– gender, perspective, aspect, diminutives, politeness,<br />

pejoratives, part-of-speech constraints<br />

69


KIF expression for gender<br />

marking<br />

• {teacher}EN => ((instance x Human) and<br />

(agent x TeachingProcess))<br />

• {Lehrer}DE => ((instance x Man) and<br />

(agent x TeachingProcess))<br />

• {Lehrerin}DE => ((instance x Woman) and<br />

(agent x TeachingProcess))<br />

70


KIF expression for perspective<br />

sell: subj(x), direct obj(z),indirect obj(y)<br />

versus<br />

buy: subj(y), direct obj(z),indirect obj(x)<br />

=> (and (instance x Human)(instance y Human)<br />

(instance z Entity) (instance e FinancialTransaction)<br />

(source x e) (destination y e) (patient e)<br />

The same process but a different perspective by subject<br />

and object realization: marry in Russian two verbs,<br />

apprendre in French can mean teach and learn<br />

71


Open Questions/Challenges<br />

• What is a word, i.e., a lexical unit?<br />

• What is the status of complex lexemes like<br />

English lightning rod, word of mouth, find<br />

out, kick the bucket?<br />

• What is a semantic unit, i.e. a concept?<br />

72


Open Questions/Challenges<br />

• Is there a core inventory of concepts that are<br />

universally encoded?<br />

• If so, what are these concepts?<br />

• How can crosslinguistic equivalence be verified?<br />

• Is there systematicity to the language-specific<br />

extensions?<br />

• What are the lexicalization patterns of individual<br />

languages?<br />

• Are lexical gaps accidental or systematic?<br />

73


Global <strong>Wordnet</strong> Grid<br />

Installations<br />

• Global <strong>Wordnet</strong> Association:<br />

– Upload and GWG viewer through DebVisDic<br />

• KYOTO project<br />

– WordNet-LMF<br />

– Web service and editor<br />

– Mapping of complete English <strong>Wordnet</strong> to<br />

DOLCE


DebVisDic GWA<br />

http://deb.fi.muni.cz:9000/gwgeng?action=listPreview


GWA-KYOTO


Knowledge Integration in KYOTO<br />

• A model of division of labour (along the lines of Putnam 1975) in which<br />

knowledge is stored in 3 layers:<br />

– Vocabularies, term databases, etc. (SKOS)<br />

– WordNet (WN-LMF)<br />

– Ontology (OWL-DL)<br />

• Mapping relations that support the division of labour<br />

– language-specific conceptualizations<br />

• Each layer supports different types of inferencing<br />

– SparQL queries<br />

– Graph algorithms (UKB, SSID+)<br />

– Formal reasoning (OWL-DL reasoners, FACT++)


3-layered knowledge model<br />

Division of labor (Putnam 1975)<br />

Geonames<br />

8 million places<br />

150,486<br />

mappings<br />

663<br />

mappings<br />

Species 2000<br />

3 million species<br />

English <strong>Wordnet</strong> 3.0<br />

100,000 concepts<br />

Top synsets<br />

Base Concepts<br />

1,000<br />

All nouns All verbs All adjectives<br />

Domain <strong>Wordnet</strong><br />

1216 synsets<br />

210 WN3.0 synsets<br />

1,006 new synsets<br />

200,000 mappings<br />

990 mappings<br />

Wn: polluted water wo:patient<br />

Wn: greenhouse gas<br />

wo:done-by<br />

Kyoto ontology<br />

2,000 classes, 3,000 axioms<br />

Dolce-Lite<br />

OntoWordNet<br />

Endurant Perdurant<br />

Quality<br />

Domain concepts<br />

Ont: pollution<br />

Ont: warming<br />

Spanish WN: BC+equi<br />

Japan WN: BC+equi<br />

Basque WN: BC+equi<br />

Chinese WN: BC+equi<br />

Italian WN: BC+equi<br />

Dutch WN: BC+equi


Division of labor in knowledge sources<br />

Skos database <strong>Wordnet</strong>-LMF Ontology-OWL-DL<br />

2.1 million species 100,000 synsets 2,000 classes & 3,000 axioms<br />

Animalia<br />

Chordata<br />

Amphibia<br />

Anura<br />

Leptodactylidae<br />

Eleutherodactylus<br />

Eleutherodactylus<br />

atrabracus<br />

Eleutherodactylus<br />

augusti<br />

animal:1<br />

Base Concept<br />

chordate:1<br />

vertebrate:1,craniate:1<br />

amphibian:3<br />

frog:1, toad:1, toad frog:1,<br />

anuran:1, batrachian:1, salientian:1<br />

barking frog<br />

endurant<br />

physical-object endanger<br />

organism<br />

perdurant<br />

Term database<br />

500,000 terms<br />

endemic frog<br />

endangered frog<br />

poisonous frog<br />

alien frog


Example<br />

268 Species 2000 concepts<br />

Animalia/Chordata/Aves/Anseriformes/Anati<br />

dae/Anas/ITS-175103 : Yellow-billed Pintail<br />

eng-3.0-01847565-n <br />

297 WN3.0 Base Concepts<br />

01507175-n 05 399 bird_genus<br />

Connected to KYOTO ontology<br />

bird_genus-eng-3.0-01507175-n type


<strong>Wordnet</strong> ontology relations<br />

Rigid vs. Non-rigid<br />

Rigid<br />

• Synset:Endurant; Synset:Perdurant; Synset:Quality:<br />

• sc_equivalenceOf or sc_subclassOf<br />

Non-rigid:<br />

• Synset:Role; Synset:Endurant<br />

• sc_domainOf: range of ontology types that restricts a role<br />

• sc_playRole: role that is being played<br />

Rigidity can be detected automatically (Rudify, 80% precision,<br />

IAG 80%) and is stored in <strong>Wordnet</strong>-LMF as attributes to synsets


<strong>Wordnet</strong> to ontology mappings<br />

{create, produce, make}Verb, English<br />

-> sc_ equivalenceOf construction<br />

{artifact, artefact}Noun, English<br />

-> sc_domainOf physical_object<br />

-> sc_playRole result-existence<br />

-> sc_participantOf construction<br />

{kunststof}Noun, Dutch // lit. artifact<br />

substance<br />

-> sc_domainOf amount_of_matter<br />

-> sc_playRole result-existence<br />

-> sc_participantOf construction<br />

{meat}Noun, English<br />

-> sc_domainOf cow, sheep, pig<br />

-> sc_playRole patient<br />

-> sc_playRole eat<br />

{ 名 肉 , 食物 , 餐 }Noun, Chinese<br />

-> sc_domainOf animal<br />

-> sc_playRole patient<br />

-> sc_playRole eat<br />

{ ماعط ,محل ,ءاذغ}Noun, Arabic<br />

-> sc_domainOf cow, sheep<br />

-> sc_playRole patient<br />

-> sc_playRole eat


<strong>Wordnet</strong> to ontology mappings<br />

{teacher}Noun, English<br />

-> sc_domainOf human<br />

-> sc_playRole done-by<br />

-> sc_participantOf teach<br />

{leraar}Noun, Dutch // lit. male teacher<br />

-> sc_domainOf man<br />

-> sc_playRole done-by<br />

-> sc_participantOf teach<br />

{lerares}Noun, Dutch // lit. female teacher<br />

-> sc_domainOf woman<br />

-> sc_playRole done-by<br />

-> sc_participantOf teach


<strong>Wordnet</strong>-LMF<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />


WN-LMF Synset relations<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />


WN-LMF Synset relations<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />


WN-LMF Synset relations<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />


KYOTO project: Wikyoto editor<br />

http://www.wikyoto.net/

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!