Large-Scale Semi-Supervised Learning for Natural Language ...

12.07.2015 Views
Foreign Language F Words f ∈ F Cognates E f+ False FriendsE f−Japanese (Rômaji) napukin napkin nanking, pumpkin, snacking, sneakingFrench abondamment abundantly abandonment, abatement, ... wondermentGerman prozyklische procyclical polished, prophylactic, prophylaxisSpanish viudos widows avoids, idiots, video, videos, virtuousTable 7.1: Foreign-English cognates and false friend training examples.2001] define a word pair (e,f) to be cognate if they are a translation pair (same meaning)and their edit distance is less than three (same form). We adopt an improved definition(suggested by [Melamed, 1999] for the French-English Canadian Hansards) that doesnot over-propose shorter word pairs: (e,f) are cognate if they are translations and theirLCSR ≥ 0.58. Note that this cutoff is somewhat conservative: the English/German cognateslight/Licht (LCSR=0.8) are included, but not the cognates eight/acht (LCSR=0.4).If two words must have LCSR ≥ 0.58 to be cognate, then for a given word f ∈ F , weneed only consider as possible cognates the subset of words in E having an LCSR with flarger than 0.58, a set we callE f . The portion ofE f with the same meaning asf,E f+ , arecognates, while the part with different meanings, E f− , are not cognates. The words E f−with similar spelling but different meaning are sometimes called false friends. The cognateidentification task is, for every word f ∈ F , and a list of similarly spelled words E f , todistinguish the cognate subset E f+ from the false friend setE f− .To create training data for our learning approaches, and to generate a high-quality labeledtest set, we need to annotate some of the (f,e f ∈ E f ) word pairs for whether ornot the words share a common meaning. In Section 7.5, we explain our two high-precisionautomatic annotation methods: checking if each pair of words (a) were aligned in a wordalignedbitext, or (b) were listed as translation pairs in a bilingual dictionary.Table 7.1 provides some labeled examples with non-empty cognate and false friendlists. Note that despite what it may appear from these examples, this is not a ranking task:even in highly related languages, most words in F have empty E f+ lists, and many haveempty E f− as well. Thus one natural formulation for cognate identification is a pairwise(and symmetric) cognation classification that looks at each pair (f,e f ) separately and individuallymakes a decision:+(napukin,napkin)– (napukin,nanking)– (napukin,pumpkin)In this formulation, the benefits of a discriminative approach are clear: it must find substringsthat distinguish cognate pairs from word pairs with otherwise similar form. [Klementievand Roth, 2006], although using a discriminative approach, do not provide theirinfinite-attribute perceptron with competitive counter-examples. They instead use transliterationsas positives and randomly-paired English and Russian words as negative examples.In the following section, we also improve on [Klementiev and Roth, 2006] by using acharacter-based string alignment to focus the features for discrimination.7.4 Features for Discriminative String SimilarityAs Chapter 2 explained, discriminative training learns a classifier from a set of labeledtraining examples, each represented as a set of features. In the previous section we showed99

how labeled word pairs can be collected. We now address methods of representing theseword pairs as sets of features useful for determining cognation.Consider the Rômaji Japanese/English cognates: (sutoresu,stress). The LCSR is 0.625.Note that the LCSR of sutoresu with the English false friend stories is higher: 0.75. LCSRalone is too weak a feature to pick out cognates. We need to look at the actual charactersubstrings.[Klementiev and Roth, 2006] generate features for a pair of words by splitting bothwords into all possible substrings of up to size two:sutoresu ⇒ { s, u, t, o, r, e, s, u, su, ut, to, ... su }stress ⇒ { s, t, r, e, s, s, st, tr, re, es, ss }Then, a feature vector is built from all substring pairs from the two words such that thedifference in positions of the substrings is within one:{s-s, s-t, s-st, su-s, su-t, su-st, su-tr... r-s, r-s, r-es...}This feature vector provides the feature representation used in supervised machine learning.This example also highlights the limitations of the Klementiev and Roth approach. Thelearner can provide weight to features like s-s or s-st at the beginning of the word, butbecause of the gradual accumulation of positional differences, the learner never sees thetor-tr and es-es correspondences that really help indicate the words are cognate.Our solution is to use the minimum-edit-distance alignment of the two strings as thebasis for feature extraction, rather than the positional correspondences. We also includebeginning-of-word (ˆ) and end-of-word ($) markers (referred to as boundary markers) tohighlight correspondences at those positions. The pair (sutoresu,stress) can be aligned:For the feature representation, we only extract substring pairs that are consistent with thisalignment. 1 That is, the letters in our pairs can only be aligned to each other and not toletters outside the pairing:{ ˆ-ˆ,ˆs-ˆs, s-s, su-s, ut-t, t-t,... es-es, s-s, su-ss...}We define phrase pairs to be the pairs of substrings consistent with the alignment. A similaruse of the term “phrase” exists in machine translation, where phrases are often pairs of wordsequences consistent with word-based alignments [Koehn et al., 2003].By limiting the substrings to only those pairs that are consistent with the alignment, wegenerate fewer, more-informative features. Computationally, using more-precise featuresallows a larger maximum substring size L than is feasible with the positional approach.Larger substrings allow us to capture important recurring deletions like the “u” in the pairsut-st observed in Japanese-English.[Tiedemann, 1999] and others have shown the importance of using the mismatchingportions of cognate pairs to learn the recurrent spelling changes between two languages.In order to capture mismatching segments longer than our maximum substring size willallow, we include special features in our representation called mismatches. Mismatchesare phrases that span the entire sequence of unaligned characters between two pairs of1 If the words are from different writing systems, we can get the alignment by mapping the foreign lettersto their closest Roman equivalent, or by using the EM algorithm to learn the edits [Ristad and Yianilos, 1998].In recent work in recognizing transliterations between different writing systems [Jiampojamarn et al., 2010],we used the output of a many-to-many alignment model [Jiampojamarn et al., 2007] to directly extract thesubstring-alignment features.100

Page 1 and 2: University of AlbertaLarge-Scale Se

Page 5 and 6: Table of Contents1 Introduction 11.

Page 7 and 8: 7 Alignment-Based Discriminative St

Page 9 and 10: List of Figures2.1 The linear class

Page 11 and 12: drawn in by establishing a partial

Page 13 and 14: (2) “He saw the trophy won yester

Page 15 and 16: actual sentence said, “My son’s

Page 17 and 18: Uses Web-Scale N-grams Auto-Creates

Page 19 and 20: spelling correction, and the identi

Page 21 and 22: Chapter 2Supervised and Semi-Superv

Page 23 and 24: emphasis on “deliverables and eva

Page 25 and 26: Figure 2.1: The linear classifier h

Page 27 and 28: The above experimental set-up is so

Page 29 and 30: and discriminative models therefore

Page 31 and 32: their slack value). In practice, I

Page 33 and 34: One way to find a better solution i

Page 35 and 36: Figure 2.2: Learning from labeled a

Page 37 and 38: algorithm). Yarowsky used it for wo

Page 39 and 40: Learning with Natural Automatic Exa

Page 41 and 42: positive examples from any collecti

Page 43 and 44: generated word clusters. Several re

Page 45 and 46: One common disambiguation task is t

Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx

Page 49 and 50: For each target wordv 0 , there are

Page 51 and 52: ut without counts for the class pri

Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM

Page 55 and 56: We also follow Carlson et al. [2001

Page 57 and 58: Set BASE [Golding and Roth, 1999] T

Page 59 and 60: pronoun (#3) guarantees that at the

Page 61 and 62: 807876F-Score747270Stemmed patterns

Page 63 and 64: anaphoricity by [Denis and Baldridg

Page 65 and 66: ter, we present a simple technique

Page 67 and 68: We seek weights such that the class

Page 69 and 70: each optimum performance is at most

Page 71 and 72: We now show that ¯w T (diag(¯p)

Page 73 and 74: Training ExamplesSystem 10 100 1K 1

Page 75 and 76: Since we wanted the system to learn

Page 77 and 78: Chapter 5Creating Robust Supervised

Page 79 and 80: § In-Domain (IN) Out-of-Domain #1

Page 81 and 82: Adjective ordering is also needed i

Page 83 and 84: Accuracy (%)10095908580757065601001

Page 85 and 86: System IN O1 O2Baseline 66.9 44.6 6

Page 87 and 88: 90% of the time in Gutenberg. The L

Page 89 and 90: VBN/VBD distinction by providing re

Page 91 and 92: other tasks we only had a handful o

Page 93 and 94: without the need for manual annotat

Page 95 and 96: DSP uses these labels to identify o

Page 97 and 98: Semantic classesMotivated by previo

Page 99 and 100: empirical Pr(n|v) in Equation (6.2)

Page 101 and 102: Verb Plaus./Implaus. Resnik Dagan e

Page 103 and 104: SystemAccMost-Recent Noun 17.9%Maxi

Page 105 and 106: Chapter 7Alignment-Based Discrimina

Page 107: ious measures to learn the recurren

Page 111 and 112: Figure 7.1: LCSR histogram and poly

Page 113 and 114: 0.711-pt Average Precision0.60.50.4

Page 115 and 116: Fr-En Bitext Es-En Bitext De-En Bit

Page 117 and 118: Chapter 8Conclusions and Future Wor

Page 119 and 120: 8.3 Future WorkThis section outline

Page 121 and 122: My focus is thus on enabling robust

Page 123 and 124: [Bergsma and Cherry, 2010] Shane Be

Page 125 and 126: [Church and Mercer, 1993] Kenneth W

Page 127 and 128: [Grefenstette, 1999] Gregory Grefen

Page 129 and 130: [Koehn, 2005] Philipp Koehn. Europa

Page 131 and 132: [Mihalcea and Moldovan, 1999] Rada

Page 133 and 134: [Ristad and Yianilos, 1998] Eric Sv

Page 135 and 136: [Wang et al., 2008] Qin Iris Wang,

Page 137: NNP noun, proper, singular Motown V

features

examples

feature

pairs

classifier

weights

labeled

noun

corpus

approaches

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ... ... View more Large-Scale Semi-Supervised Learning for Natural Language ...

Delete template?

Save as template ?

Large-Scale Semi-Supervised Learning for Natural Language ... Large-Scale Semi-Supervised Learning for Natural Language ...