Creating a phraseme matrix based on a Tertium ... - Euralex
Creating a phraseme matrix based on a Tertium ... - Euralex
Creating a phraseme matrix based on a Tertium ... - Euralex
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
In additi<strong>on</strong> to the variants of <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s listed in secti<strong>on</strong> 2—which all involve lexical or<br />
structural similarity—, a <str<strong>on</strong>g>phraseme</str<strong>on</strong>g> could be c<strong>on</strong>sidered a variant or syn<strong>on</strong>ym of another<br />
<str<strong>on</strong>g>phraseme</str<strong>on</strong>g> because of a shared meaning, independent from lexical or syntactic aspects.<br />
Detecting those groups automatically is impossible, since NLP methods rely <strong>on</strong> surfaces—for<br />
example, when lemmatizing word forms or extracting noun phrases—, or operate <strong>on</strong> the<br />
meaning by c<strong>on</strong>sulting <strong>on</strong>tologies like WordNet or GermaNet, but they do not operate <strong>on</strong> the<br />
underlying idiomatic meaning. However, grouping <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s according to their meaning<br />
would be a very useful strategy for providing a network of <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s, which could serve as<br />
the basis of a meta-index for phraseological collecti<strong>on</strong>s from different centuries: Although<br />
syntactic structures or vocabulary might have changed over time, it is very probable that the<br />
expressed meaning was preserved, as in example 4.<br />
As a starting point, we assumed that <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s sharing autosemantica (e.g., nouns), could<br />
probably share meaning to some extent. We therefore automatically lemmatized all <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s<br />
from our four resources resulting in pairs of <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s and nouns. If a <str<strong>on</strong>g>phraseme</str<strong>on</strong>g> had more<br />
than <strong>on</strong>e noun, we manually chose the most representative <strong>on</strong>e—that is, the noun you would<br />
expect the <str<strong>on</strong>g>phraseme</str<strong>on</strong>g> to be listed under in a general dicti<strong>on</strong>ary. We could then sort <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s<br />
according to their shared noun. Each noun-<str<strong>on</strong>g>phraseme</str<strong>on</strong>g> pair was then manually assigned an<br />
identifier; seeing <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s sharing a comm<strong>on</strong> noun grouped together, speeds up the<br />
assigning of identifiers.<br />
Identifiers c<strong>on</strong>sist of two parts: (a) a five-character code representing the semantic<br />
category, and (b) a three-character index representing a prototypical instantiati<strong>on</strong> of this<br />
category. After first experiments, we decided to not give an explicit verbal designati<strong>on</strong> for<br />
semantic categories: The inventory of the semantic index is growing while annotating noun<str<strong>on</strong>g>phraseme</str<strong>on</strong>g><br />
pairs, the verbal designati<strong>on</strong> would have to be adjusted all the time. The semantic<br />
category is therefore <strong>on</strong>ly given implicitly by the prototypical instantiati<strong>on</strong>s. As this resource<br />
is intended to be used primarily by human experts, there is no need for explicit categorizati<strong>on</strong>.<br />
We also decided to not explicitly mark the relati<strong>on</strong>ship between <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s bel<strong>on</strong>ging to<br />
the same semantical category. The relati<strong>on</strong> always involves semantic similarity of the<br />
underlying c<strong>on</strong>cept, but could involve various formal aspects c<strong>on</strong>cerning vocabulary,<br />
syntactic structure, transformati<strong>on</strong>s, morphosyntactic features of the whole multi-word unit,<br />
possible syntactic roles of the whole multi-word unit, etc.<br />
4. Results and C<strong>on</strong>clusi<strong>on</strong><br />
In this paper we presented our approach for creating a meta-index of <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s in German<br />
dicti<strong>on</strong>aries and collecti<strong>on</strong>s from various points in time. We use the semantic c<strong>on</strong>cept of<br />
<str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s as a <strong>Tertium</strong> Comparati<strong>on</strong>is to be able to group <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s expressing the same<br />
meaning by using different words and syntactical structures. This meta-index presents an<br />
overview of the inventory of special-purpose collecti<strong>on</strong>s and general-purpose dicti<strong>on</strong>aries for<br />
German with respect to <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s. The <str<strong>on</strong>g>phraseme</str<strong>on</strong>g>s included in these collecti<strong>on</strong>s are listed in<br />
the meta-index according to their underlying idiomatic meaning. The index can be sorted by<br />
the kind of informati<strong>on</strong> involved, that is, c<strong>on</strong>cepts, collecti<strong>on</strong>s, or nouns. Human experts can<br />
easily see at a glance (a) if a specific semantic c<strong>on</strong>cept is included in all collecti<strong>on</strong>s (by<br />
browsing c<strong>on</strong>cepts), (b) the degree of variati<strong>on</strong> c<strong>on</strong>cerning various aspects like vocabulary or<br />
syntactic structure (by browsing c<strong>on</strong>cepts and inspecting listed original entries), (c) the<br />
semantic c<strong>on</strong>cepts a noun is part of (by browsing or searching nouns), or (d) the semantic<br />
c<strong>on</strong>cepts presented in a specific collecti<strong>on</strong> (by browsing collecti<strong>on</strong>s).<br />
Our meta-index presents a first step towards an <strong>on</strong>tology of semantic c<strong>on</strong>cepts expressed<br />
by idiomatic multi-word units. Usual <strong>on</strong>tologies express semantic relati<strong>on</strong>s <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> literal<br />
723