insight into the slovak and czech corpus linguistics - Slovenský ...

18.07.2013 Views
A brief example 4 now presents a simple sentence as a sequence of annotated words: Form (Czech) (Lit.) Tag , , Z:------------že that J,------------litera the-letter NNFS1-----A---výše above Dg-------2A---1 uvedené of-mentioned AAFS2----1A---mezinárodní of-international AAFS2----1A---smlouvy of-agreement NNFS2-----A---mezi between RR--7---------- ČR Czech Rep. NNFXX-----A---8 a and J^------------- SR Slovakia NNFXX-----A---8 bude will VB-S---3F-AA--mít have Vf--------A---co pretty TT------------nevidět soon Vf--------N---- Special symbols are used for combinations of values that are not easily distinguished, or the processing of which was simply left for the future. In most cases, we use the symbol ‘X’ for ‘any value’ in the particular grammatical category. The lemma represents a unique identification of the word in the morphological dictionary. Usually, the customary dictionary base form (headword) is used as the identification string, extended (if necessary) by a dash and a number distinguishing it from its homographs. We use the following convention: all forms of a lemma must have the same part of speech, and for nouns, they also have to have the same gender. (This is, obviously, in accordance with the conventions of the morphological dictionary we use – see below in 3.1.2 Morphological Analysis). 3.1.2 MORPHOLOGICAL ANALYSIS Morphological analysis is a process of which the input is a word form as found in the text, and the output is a set of possible lemmas which represent the form in the dictionary, with each lemma accompanied by a set of possible tags (as defined in the previous section). For example, for the word form ženu the morphological analysis returns the following results: Lemma tag(s) žena (woman) NNFS4-----A---hnát (to rush) VB-S---1P-AA--- This example exhibits an ambiguity at the lemma level, but no ambiguity within the lemmas. On the other hand, the word form učení displays both types of ambiguity: 4 Example from the weekly journal Českomoravský profit, 10/1994. 56

Lemma tag(s) učení (theory) NNNS1-----A----, NNNS2-----A----, NNNS3-----A----, NNNS4-----A----, NNNS5-----A----, NNNS6-----A----, NNNP1-----A----, NNNP2-----A----, NNNP4-----A----, NNNP5-----A---učený (educated) AAMP1----1A----, AAMP5----1A---- There could be as many as five different lemmas for a given word form and as many as 27 different tags for one lemma. Morphological analysis currently covers about a million Czech lemmas (including derivations), and is based on about 520,000 stems. It can recognize about 25 million word forms and their tags. 3.1.3 THE PROCESS OF MANUAL MORPHOLOGICAL ANNOTATION Morphological analysis is the first step towards the first level of annotation (morphological tagging) in the Prague Dependency Treebank. It can proceed fully automatically and very quickly (about 20000 word forms per second on today’s average machine). We have developed a special software tool (called sgd on a Unix platform, and DA under Windows) which enables easy manual disambiguation of the morphological output. It also helps the annotators to edit the output of the morphology, thus facilitating the identification of possible problems and unknown words in the morphology itself. The morphological annotation has been performed on every sentence in the PDT twice, with a third person resolving the differences between the two annotators. Inter-annotator agreement has been around 97% (measured as a percentage of input tokens receiving the same tag by both annotators). After the adjudication process, errors still remain, though as we are currently preparing version 2.0 of the PDT, we are better able to identify those errors (based on the upper levels of annotation) and we are correcting them. A total of 1,800,000 words (tokens) is now available with manually annotated lemmas and tags. 3.2 THE ANALYTICAL LEVEL The analytical (surface-syntactic) level of annotation is a newly designed level to more easily use (and compare) the results achieved in English parsing to Czech, and to have a preliminary analysis of a sentence structure before proceeding to the most detailed level, the tectogrammatical one. We have chosen the dependency structure to represent the syntactic relations within the sentence. Thus, the basic principles can be formulated as follows: • The structure of the sentence is an oriented, acyclic graph with one entry (root) node; the nodes of the tree are annotated by complex symbols (attribute-value pairs); • The number of nodes of the graph is equal to the number of words in the sentence plus one for the extra root node; • The annotation result is only • 1. the structure of the tree, • 2. the analytical function of every node. An analytical function determines the relation between the (dependent) node and its governing node (which is the node one level up the tree). All the other node attributes (see the table below) 57

Page 1 and 2: INSIGHT INTO THE SLOVAK AND CZECH C

Page 4 and 5: © Silvie Cinková, Radovan Garabí

Page 7 and 8: Corpora and dictionaries VĚRA SCHM

Page 9 and 10: The actual lexicographical work beg

Page 11 and 12: HISTORY OF THE MOST RELEVANT CORPOR

Page 13 and 14: 1.6 OXFORD COLLOCATIONS DICTIONARY

Page 15 and 16: BIBLIOGRAPHY Bejoin, H. (2003): Mod

Page 17 and 18: FDC consists of five dictionaries:

Page 19 and 20: In addition, there was also an enor

Page 21 and 22: - detection of auxiliary forms of t

Page 23 and 24: Perhaps it should be explained why

Page 25 and 26: tvary je nebo jedli) či spíše po

Page 27 and 28: 1 WHY IS CORRECT POS AND MORPHOLOGI

Page 29 and 30: form: které, lemma: který, tags:

Page 31 and 32: • to make it possible to develop

Page 33 and 34: LEAVE ONLY correct interpretation(s

Page 35 and 36: these corpora contain errors). The

Page 37 and 38: Here the prepositional reading is d

Page 39 and 40: • syncretism of the nominative/ac

Page 41 and 42: EXAMPLE 14 (13) Tyto problémy se s

Page 43 and 44: Karlsson F., Voutilainen A., Heikki

Page 45 and 46: INTRODUCTION Apart from deciding on

Page 47 and 48: morphological characteristics have

Page 49 and 50: on the other hand, such sentences a

Page 51 and 52: can safely be considered ungram mat

Page 53 and 54: ABSTRACT A natural language is usua

Page 55: The textual data used for the task

Page 59 and 60: afun _Co _Ap _Pa Description AuxT x

Page 61 and 62: Dependencies between nodes serve as

Page 63 and 64: Every node of the tree is further a

Page 65 and 66: oughly speaking, the structural dif

Page 67 and 68: amongst themselves during the cours

Page 69 and 70: the annotators did not have to run

Page 71 and 72: approximated instead. This would al

Page 73 and 74: Lopatková, M., Panevová, J. (2006

Page 75 and 76: (a) syntactic relations are depende

Page 77 and 78: For the representation of TFA, a sp

Page 79 and 80: 4.2 The following types of textual

Page 81 and 82: established; actually the form ‘t

Page 83 and 84: Recent Developments in the Theory o

Page 85 and 86: Their form is governed by their hea

Page 87 and 88: noun phrases or by the subject posi

Page 89 and 90: elation (labels of underlying roles

Page 91 and 92: Directionality proper and direction

Page 93 and 94: Verbal Valency in the Prague Depend

Page 95 and 96: 2.2 THE CONCEPT OF SHIFTING OF “C

Page 97 and 98: lemma (its analytical form, i.e. th

Page 99 and 100: 4.1 MISSING OPTIONAL VALENCY SLOTS

Page 101 and 102: For instance: odebral mu.BEN krev z

Page 103 and 104: změnit účes z kudrn na rovné vl

Page 105 and 106: this volume) 11 . On the other hand

Page 107 and 108: For instance: ocitnout se ocitla se

Page 109 and 110: The reason for treating these four

Page 111 and 112: only as a first step, in this respe

Page 113 and 114: Nouns as Components of Support Verb

Page 115 and 116: valency representations. Apart from

Page 117 and 118: Even in the cross-linguistic perspe

Page 119 and 120: deprived of a part of its original

Page 121 and 122: present list of SVCs in PDT (i.e. t

Page 123 and 124: 5.1.2 CHANGES IN VALENCY FRAMES OF

Page 125 and 126: SVCs formed by nouns derived from v

Page 127 and 128: obligatory complementation of the g

Page 129 and 130: (22) Petrovou.ACT povinností je p

Page 131 and 132: the possible explanations. The foll

Page 133 and 134: (34) Takový byl alespoň slib od o

Page 135 and 136: on the two of them dealing with the

Page 137 and 138: Person, I. (1992): Das kausative Fu

Page 139 and 140: V případě, že se substantivum o

Page 141 and 142: However, even here we are not told

Page 143 and 144: Définition I.30: catégorie flexio

Page 145 and 146: 2. With our other two examples, the

Page 147 and 148: there is a clear difference in comp

Page 149 and 150: 5. By answering the ten questions a

Page 151 and 152: Slovak National Corpus - History an

Page 153 and 154: events abroad and at home. The inte

Page 155 and 156: explaining the objective, the conte

Page 157 and 158: corpus manager Korman, which was de

Page 159 and 160: ABSTRACT História budovania Sloven

Page 161 and 162: 4 WHY PYTHON Our programming langua

Page 163 and 164: The text is then lemmatised and mor

Page 165 and 166: linear sequence of tokens (or other

Page 167 and 168: morphological, syntactical, lexical

Page 169 and 170: proper names is also indicated on t

Page 171 and 172: For example: Nechcem cestovať v to

Page 173 and 174: 10. Citation forms (%) include fore

Page 175 and 176: Other verb values (where relevant)

Page 177 and 178: RESUME V príspevku sa zaoberáme p

Page 179 and 180: Options for the Generation of a Cor

Page 181 and 182: need a grammar which would utilize

Page 183 and 184: can serve as a basis: F. Štícha e

Page 185 and 186: As the starting point for morphosyn

Page 187 and 188: Construction of semantic structures

Page 189 and 190: The morphosyntactic approach means

Page 191 and 192: a) auto-semantic (predicates): - ac

Page 193 and 194: 4.1.3 Genus verbi (morphosyntactica

Page 195 and 196: • Semantic, syntactic, and morpho

Page 197 and 198: 4.2.5 Form structure (declination)

Page 199 and 200: Syntactic, semantic, and pragmatic

Page 201 and 202: BIBLIOGRAPHY 1. Běličová, H.: N

Page 203 and 204: 51. Najnowsze dzieje języków sło

Page 205 and 206: ABSTRAKT Korpusové gramatiky vych

Page 207 and 208: 31. 3. 2003 Garabík, Radovan: USEN

valency

corpus

verb

czech

morphological

annotation

slovak

verbs

noun

dictionary

insight

linguistics

korpus.juls.savba.sk

insight into the slovak and czech corpus linguistics - Slovenský ...

insight into the slovak and czech corpus linguistics - Slovenský ... ... View more insight into the slovak and czech corpus linguistics - Slovenský ...

Delete template?

Save as template ?

insight into the slovak and czech corpus linguistics - Slovenský ... insight into the slovak and czech corpus linguistics - Slovenský ...