insight into the slovak and czech corpus linguistics - Slovenský ...
insight into the slovak and czech corpus linguistics - Slovenský ... insight into the slovak and czech corpus linguistics - Slovenský ...
A brief example 4 now presents a simple sentence as a sequence of annotated words: Form (Czech) (Lit.) Tag , , Z:------------že that J,------------litera the-letter NNFS1-----A---výše above Dg-------2A---1 uvedené of-mentioned AAFS2----1A---mezinárodní of-international AAFS2----1A---smlouvy of-agreement NNFS2-----A---mezi between RR--7---------- ČR Czech Rep. NNFXX-----A---8 a and J^------------- SR Slovakia NNFXX-----A---8 bude will VB-S---3F-AA--mít have Vf--------A---co pretty TT------------nevidět soon Vf--------N---- Special symbols are used for combinations of values that are not easily distinguished, or the processing of which was simply left for the future. In most cases, we use the symbol ‘X’ for ‘any value’ in the particular grammatical category. The lemma represents a unique identification of the word in the morphological dictionary. Usually, the customary dictionary base form (headword) is used as the identification string, extended (if necessary) by a dash and a number distinguishing it from its homographs. We use the following convention: all forms of a lemma must have the same part of speech, and for nouns, they also have to have the same gender. (This is, obviously, in accordance with the conventions of the morphological dictionary we use – see below in 3.1.2 Morphological Analysis). 3.1.2 MORPHOLOGICAL ANALYSIS Morphological analysis is a process of which the input is a word form as found in the text, and the output is a set of possible lemmas which represent the form in the dictionary, with each lemma accompanied by a set of possible tags (as defined in the previous section). For example, for the word form ženu the morphological analysis returns the following results: Lemma tag(s) žena (woman) NNFS4-----A---hnát (to rush) VB-S---1P-AA--- This example exhibits an ambiguity at the lemma level, but no ambiguity within the lemmas. On the other hand, the word form učení displays both types of ambiguity: 4 Example from the weekly journal Českomoravský profit, 10/1994. 56
Lemma tag(s) učení (theory) NNNS1-----A----, NNNS2-----A----, NNNS3-----A----, NNNS4-----A----, NNNS5-----A----, NNNS6-----A----, NNNP1-----A----, NNNP2-----A----, NNNP4-----A----, NNNP5-----A---učený (educated) AAMP1----1A----, AAMP5----1A---- There could be as many as five different lemmas for a given word form and as many as 27 different tags for one lemma. Morphological analysis currently covers about a million Czech lemmas (including derivations), and is based on about 520,000 stems. It can recognize about 25 million word forms and their tags. 3.1.3 THE PROCESS OF MANUAL MORPHOLOGICAL ANNOTATION Morphological analysis is the first step towards the first level of annotation (morphological tagging) in the Prague Dependency Treebank. It can proceed fully automatically and very quickly (about 20000 word forms per second on today’s average machine). We have developed a special software tool (called sgd on a Unix platform, and DA under Windows) which enables easy manual disambiguation of the morphological output. It also helps the annotators to edit the output of the morphology, thus facilitating the identification of possible problems and unknown words in the morphology itself. The morphological annotation has been performed on every sentence in the PDT twice, with a third person resolving the differences between the two annotators. Inter-annotator agreement has been around 97% (measured as a percentage of input tokens receiving the same tag by both annotators). After the adjudication process, errors still remain, though as we are currently preparing version 2.0 of the PDT, we are better able to identify those errors (based on the upper levels of annotation) and we are correcting them. A total of 1,800,000 words (tokens) is now available with manually annotated lemmas and tags. 3.2 THE ANALYTICAL LEVEL The analytical (surface-syntactic) level of annotation is a newly designed level to more easily use (and compare) the results achieved in English parsing to Czech, and to have a preliminary analysis of a sentence structure before proceeding to the most detailed level, the tectogrammatical one. We have chosen the dependency structure to represent the syntactic relations within the sentence. Thus, the basic principles can be formulated as follows: • The structure of the sentence is an oriented, acyclic graph with one entry (root) node; the nodes of the tree are annotated by complex symbols (attribute-value pairs); • The number of nodes of the graph is equal to the number of words in the sentence plus one for the extra root node; • The annotation result is only • 1. the structure of the tree, • 2. the analytical function of every node. An analytical function determines the relation between the (dependent) node and its governing node (which is the node one level up the tree). All the other node attributes (see the table below) 57
- Page 7 and 8: Corpora and dictionaries VĚRA SCHM
- Page 9 and 10: The actual lexicographical work beg
- Page 11 and 12: HISTORY OF THE MOST RELEVANT CORPOR
- Page 13 and 14: 1.6 OXFORD COLLOCATIONS DICTIONARY
- Page 15 and 16: BIBLIOGRAPHY Bejoin, H. (2003): Mod
- Page 17 and 18: FDC consists of five dictionaries:
- Page 19 and 20: In addition, there was also an enor
- Page 21 and 22: - detection of auxiliary forms of t
- Page 23 and 24: Perhaps it should be explained why
- Page 25 and 26: tvary je nebo jedli) či spíše po
- Page 27 and 28: 1 WHY IS CORRECT POS AND MORPHOLOGI
- Page 29 and 30: form: které, lemma: který, tags:
- Page 31 and 32: • to make it possible to develop
- Page 33 and 34: LEAVE ONLY correct interpretation(s
- Page 35 and 36: these corpora contain errors). The
- Page 37 and 38: Here the prepositional reading is d
- Page 39 and 40: • syncretism of the nominative/ac
- Page 41 and 42: EXAMPLE 14 (13) Tyto problémy se s
- Page 43 and 44: Karlsson F., Voutilainen A., Heikki
- Page 45 and 46: INTRODUCTION Apart from deciding on
- Page 47 and 48: morphological characteristics have
- Page 49 and 50: on the other hand, such sentences a
- Page 51 and 52: can safely be considered ungram mat
- Page 53 and 54: ABSTRACT A natural language is usua
- Page 55: The textual data used for the task
- Page 59 and 60: afun _Co _Ap _Pa Description AuxT x
- Page 61 and 62: Dependencies between nodes serve as
- Page 63 and 64: Every node of the tree is further a
- Page 65 and 66: oughly speaking, the structural dif
- Page 67 and 68: amongst themselves during the cours
- Page 69 and 70: the annotators did not have to run
- Page 71 and 72: approximated instead. This would al
- Page 73 and 74: Lopatková, M., Panevová, J. (2006
- Page 75 and 76: (a) syntactic relations are depende
- Page 77 and 78: For the representation of TFA, a sp
- Page 79 and 80: 4.2 The following types of textual
- Page 81 and 82: established; actually the form ‘t
- Page 83 and 84: Recent Developments in the Theory o
- Page 85 and 86: Their form is governed by their hea
- Page 87 and 88: noun phrases or by the subject posi
- Page 89 and 90: elation (labels of underlying roles
- Page 91 and 92: Directionality proper and direction
- Page 93 and 94: Verbal Valency in the Prague Depend
- Page 95 and 96: 2.2 THE CONCEPT OF SHIFTING OF “C
- Page 97 and 98: lemma (its analytical form, i.e. th
- Page 99 and 100: 4.1 MISSING OPTIONAL VALENCY SLOTS
- Page 101 and 102: For instance: odebral mu.BEN krev z
- Page 103 and 104: změnit účes z kudrn na rovné vl
- Page 105 and 106: this volume) 11 . On the other hand
A brief example 4 now presents a simple sentence as a sequence of annotated words:<br />
Form (Czech) (Lit.) Tag<br />
, , Z:------------že<br />
that J,------------litera<br />
<strong>the</strong>-letter NNFS1-----A---výše<br />
above Dg-------2A---1<br />
uvedené of-mentioned AAFS2----1A---mezinárodní<br />
of-international AAFS2----1A---smlouvy<br />
of-agreement NNFS2-----A---mezi<br />
between RR--7----------<br />
ČR Czech Rep. NNFXX-----A---8<br />
a <strong>and</strong> J^-------------<br />
SR Slovakia NNFXX-----A---8<br />
bude will VB-S---3F-AA--mít<br />
have Vf--------A---co<br />
pretty TT------------nevidět<br />
soon Vf--------N----<br />
Special symbols are used for combinations of values that are not easily distinguished, or <strong>the</strong><br />
processing of which was simply left for <strong>the</strong> future. In most cases, we use <strong>the</strong> symbol ‘X’ for<br />
‘any value’ in <strong>the</strong> particular grammatical category.<br />
The lemma represents a unique identification of <strong>the</strong> word in <strong>the</strong> morphological dictionary.<br />
Usually, <strong>the</strong> customary dictionary base form (headword) is used as <strong>the</strong> identification string,<br />
extended (if necessary) by a dash <strong>and</strong> a number distinguishing it from its homographs. We<br />
use <strong>the</strong> following convention: all forms of a lemma must have <strong>the</strong> same part of speech, <strong>and</strong><br />
for nouns, <strong>the</strong>y also have to have <strong>the</strong> same gender. (This is, obviously, in accordance with <strong>the</strong><br />
conventions of <strong>the</strong> morphological dictionary we use – see below in 3.1.2 Morphological<br />
Analysis).<br />
3.1.2 MORPHOLOGICAL ANALYSIS<br />
Morphological analysis is a process of which <strong>the</strong> input is a word form as found in <strong>the</strong><br />
text, <strong>and</strong> <strong>the</strong> output is a set of possible lemmas which represent <strong>the</strong> form in <strong>the</strong> dictionary,<br />
with each lemma accompanied by a set of possible tags (as defined in <strong>the</strong> previous<br />
section). For example, for <strong>the</strong> word form ženu <strong>the</strong> morphological analysis returns <strong>the</strong><br />
following results:<br />
Lemma tag(s)<br />
žena (woman) NNFS4-----A---hnát<br />
(to rush) VB-S---1P-AA---<br />
This example exhibits an ambiguity at <strong>the</strong> lemma level, but no ambiguity within <strong>the</strong> lemmas.<br />
On <strong>the</strong> o<strong>the</strong>r h<strong>and</strong>, <strong>the</strong> word form učení displays both types of ambiguity:<br />
4 Example from <strong>the</strong> weekly journal Českomoravský profit, 10/1994.<br />
56