25.08.2013 Views

Creating a phraseme matrix based on a Tertium ... - Euralex

Creating a phraseme matrix based on a Tertium ... - Euralex

Creating a phraseme matrix based on a Tertium ... - Euralex

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

In additi<strong>on</strong> to the variants of <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s listed in secti<strong>on</strong> 2—which all involve lexical or<br />

structural similarity—, a <str<strong>on</strong>g>phraseme</str<strong>on</strong>g> could be c<strong>on</strong>sidered a variant or syn<strong>on</strong>ym of another<br />

<str<strong>on</strong>g>phraseme</str<strong>on</strong>g> because of a shared meaning, independent from lexical or syntactic aspects.<br />

Detecting those groups automatically is impossible, since NLP methods rely <strong>on</strong> surfaces—for<br />

example, when lemmatizing word forms or extracting noun phrases—, or operate <strong>on</strong> the<br />

meaning by c<strong>on</strong>sulting <strong>on</strong>tologies like WordNet or GermaNet, but they do not operate <strong>on</strong> the<br />

underlying idiomatic meaning. However, grouping <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s according to their meaning<br />

would be a very useful strategy for providing a network of <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s, which could serve as<br />

the basis of a meta-index for phraseological collecti<strong>on</strong>s from different centuries: Although<br />

syntactic structures or vocabulary might have changed over time, it is very probable that the<br />

expressed meaning was preserved, as in example 4.<br />

As a starting point, we assumed that <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s sharing autosemantica (e.g., nouns), could<br />

probably share meaning to some extent. We therefore automatically lemmatized all <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s<br />

from our four resources resulting in pairs of <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s and nouns. If a <str<strong>on</strong>g>phraseme</str<strong>on</strong>g> had more<br />

than <strong>on</strong>e noun, we manually chose the most representative <strong>on</strong>e—that is, the noun you would<br />

expect the <str<strong>on</strong>g>phraseme</str<strong>on</strong>g> to be listed under in a general dicti<strong>on</strong>ary. We could then sort <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s<br />

according to their shared noun. Each noun-<str<strong>on</strong>g>phraseme</str<strong>on</strong>g> pair was then manually assigned an<br />

identifier; seeing <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s sharing a comm<strong>on</strong> noun grouped together, speeds up the<br />

assigning of identifiers.<br />

Identifiers c<strong>on</strong>sist of two parts: (a) a five-character code representing the semantic<br />

category, and (b) a three-character index representing a prototypical instantiati<strong>on</strong> of this<br />

category. After first experiments, we decided to not give an explicit verbal designati<strong>on</strong> for<br />

semantic categories: The inventory of the semantic index is growing while annotating noun<str<strong>on</strong>g>phraseme</str<strong>on</strong>g><br />

pairs, the verbal designati<strong>on</strong> would have to be adjusted all the time. The semantic<br />

category is therefore <strong>on</strong>ly given implicitly by the prototypical instantiati<strong>on</strong>s. As this resource<br />

is intended to be used primarily by human experts, there is no need for explicit categorizati<strong>on</strong>.<br />

We also decided to not explicitly mark the relati<strong>on</strong>ship between <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s bel<strong>on</strong>ging to<br />

the same semantical category. The relati<strong>on</strong> always involves semantic similarity of the<br />

underlying c<strong>on</strong>cept, but could involve various formal aspects c<strong>on</strong>cerning vocabulary,<br />

syntactic structure, transformati<strong>on</strong>s, morphosyntactic features of the whole multi-word unit,<br />

possible syntactic roles of the whole multi-word unit, etc.<br />

4. Results and C<strong>on</strong>clusi<strong>on</strong><br />

In this paper we presented our approach for creating a meta-index of <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s in German<br />

dicti<strong>on</strong>aries and collecti<strong>on</strong>s from various points in time. We use the semantic c<strong>on</strong>cept of<br />

<str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s as a <strong>Tertium</strong> Comparati<strong>on</strong>is to be able to group <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s expressing the same<br />

meaning by using different words and syntactical structures. This meta-index presents an<br />

overview of the inventory of special-purpose collecti<strong>on</strong>s and general-purpose dicti<strong>on</strong>aries for<br />

German with respect to <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s. The <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s included in these collecti<strong>on</strong>s are listed in<br />

the meta-index according to their underlying idiomatic meaning. The index can be sorted by<br />

the kind of informati<strong>on</strong> involved, that is, c<strong>on</strong>cepts, collecti<strong>on</strong>s, or nouns. Human experts can<br />

easily see at a glance (a) if a specific semantic c<strong>on</strong>cept is included in all collecti<strong>on</strong>s (by<br />

browsing c<strong>on</strong>cepts), (b) the degree of variati<strong>on</strong> c<strong>on</strong>cerning various aspects like vocabulary or<br />

syntactic structure (by browsing c<strong>on</strong>cepts and inspecting listed original entries), (c) the<br />

semantic c<strong>on</strong>cepts a noun is part of (by browsing or searching nouns), or (d) the semantic<br />

c<strong>on</strong>cepts presented in a specific collecti<strong>on</strong> (by browsing collecti<strong>on</strong>s).<br />

Our meta-index presents a first step towards an <strong>on</strong>tology of semantic c<strong>on</strong>cepts expressed<br />

by idiomatic multi-word units. Usual <strong>on</strong>tologies express semantic relati<strong>on</strong>s <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> literal<br />

723

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!