Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ... Large-Scale Semi-Supervised Learning for Natural Language ...

old.site.clsp.jhu.edu
from old.site.clsp.jhu.edu More from this publisher
12.07.2015 Views

Foreign Language F Words f ∈ F Cognates E f+ False FriendsE f−Japanese (Rômaji) napukin napkin nanking, pumpkin, snacking, sneakingFrench abondamment abundantly abandonment, abatement, ... wondermentGerman prozyklische procyclical polished, prophylactic, prophylaxisSpanish viudos widows avoids, idiots, video, videos, virtuousTable 7.1: Foreign-English cognates and false friend training examples.2001] define a word pair (e,f) to be cognate if they are a translation pair (same meaning)and their edit distance is less than three (same form). We adopt an improved definition(suggested by [Melamed, 1999] for the French-English Canadian Hansards) that doesnot over-propose shorter word pairs: (e,f) are cognate if they are translations and theirLCSR ≥ 0.58. Note that this cutoff is somewhat conservative: the English/German cognateslight/Licht (LCSR=0.8) are included, but not the cognates eight/acht (LCSR=0.4).If two words must have LCSR ≥ 0.58 to be cognate, then for a given word f ∈ F , weneed only consider as possible cognates the subset of words in E having an LCSR with flarger than 0.58, a set we callE f . The portion ofE f with the same meaning asf,E f+ , arecognates, while the part with different meanings, E f− , are not cognates. The words E f−with similar spelling but different meaning are sometimes called false friends. The cognateidentification task is, for every word f ∈ F , and a list of similarly spelled words E f , todistinguish the cognate subset E f+ from the false friend setE f− .To create training data for our learning approaches, and to generate a high-quality labeledtest set, we need to annotate some of the (f,e f ∈ E f ) word pairs for whether ornot the words share a common meaning. In Section 7.5, we explain our two high-precisionautomatic annotation methods: checking if each pair of words (a) were aligned in a wordalignedbitext, or (b) were listed as translation pairs in a bilingual dictionary.Table 7.1 provides some labeled examples with non-empty cognate and false friendlists. Note that despite what it may appear from these examples, this is not a ranking task:even in highly related languages, most words in F have empty E f+ lists, and many haveempty E f− as well. Thus one natural formulation for cognate identification is a pairwise(and symmetric) cognation classification that looks at each pair (f,e f ) separately and individuallymakes a decision:+(napukin,napkin)– (napukin,nanking)– (napukin,pumpkin)In this formulation, the benefits of a discriminative approach are clear: it must find substringsthat distinguish cognate pairs from word pairs with otherwise similar form. [Klementievand Roth, 2006], although using a discriminative approach, do not provide theirinfinite-attribute perceptron with competitive counter-examples. They instead use transliterationsas positives and randomly-paired English and Russian words as negative examples.In the following section, we also improve on [Klementiev and Roth, 2006] by using acharacter-based string alignment to focus the features for discrimination.7.4 Features for Discriminative String SimilarityAs Chapter 2 explained, discriminative training learns a classifier from a set of labeledtraining examples, each represented as a set of features. In the previous section we showed99

how labeled word pairs can be collected. We now address methods of representing theseword pairs as sets of features useful for determining cognation.Consider the Rômaji Japanese/English cognates: (sutoresu,stress). The LCSR is 0.625.Note that the LCSR of sutoresu with the English false friend stories is higher: 0.75. LCSRalone is too weak a feature to pick out cognates. We need to look at the actual charactersubstrings.[Klementiev and Roth, 2006] generate features for a pair of words by splitting bothwords into all possible substrings of up to size two:sutoresu ⇒ { s, u, t, o, r, e, s, u, su, ut, to, ... su }stress ⇒ { s, t, r, e, s, s, st, tr, re, es, ss }Then, a feature vector is built from all substring pairs from the two words such that thedifference in positions of the substrings is within one:{s-s, s-t, s-st, su-s, su-t, su-st, su-tr... r-s, r-s, r-es...}This feature vector provides the feature representation used in supervised machine learning.This example also highlights the limitations of the Klementiev and Roth approach. Thelearner can provide weight to features like s-s or s-st at the beginning of the word, butbecause of the gradual accumulation of positional differences, the learner never sees thetor-tr and es-es correspondences that really help indicate the words are cognate.Our solution is to use the minimum-edit-distance alignment of the two strings as thebasis for feature extraction, rather than the positional correspondences. We also includebeginning-of-word (ˆ) and end-of-word ($) markers (referred to as boundary markers) tohighlight correspondences at those positions. The pair (sutoresu,stress) can be aligned:For the feature representation, we only extract substring pairs that are consistent with thisalignment. 1 That is, the letters in our pairs can only be aligned to each other and not toletters outside the pairing:{ ˆ-ˆ,ˆs-ˆs, s-s, su-s, ut-t, t-t,... es-es, s-s, su-ss...}We define phrase pairs to be the pairs of substrings consistent with the alignment. A similaruse of the term “phrase” exists in machine translation, where phrases are often pairs of wordsequences consistent with word-based alignments [Koehn et al., 2003].By limiting the substrings to only those pairs that are consistent with the alignment, wegenerate fewer, more-informative features. Computationally, using more-precise featuresallows a larger maximum substring size L than is feasible with the positional approach.Larger substrings allow us to capture important recurring deletions like the “u” in the pairsut-st observed in Japanese-English.[Tiedemann, 1999] and others have shown the importance of using the mismatchingportions of cognate pairs to learn the recurrent spelling changes between two languages.In order to capture mismatching segments longer than our maximum substring size willallow, we include special features in our representation called mismatches. Mismatchesare phrases that span the entire sequence of unaligned characters between two pairs of1 If the words are from different writing systems, we can get the alignment by mapping the foreign lettersto their closest Roman equivalent, or by using the EM algorithm to learn the edits [Ristad and Yianilos, 1998].In recent work in recognizing transliterations between different writing systems [Jiampojamarn et al., 2010],we used the output of a many-to-many alignment model [Jiampojamarn et al., 2007] to directly extract thesubstring-alignment features.100

Foreign <strong>Language</strong> F Words f ∈ F Cognates E f+ False FriendsE f−Japanese (Rômaji) napukin napkin nanking, pumpkin, snacking, sneakingFrench abondamment abundantly abandonment, abatement, ... wondermentGerman prozyklische procyclical polished, prophylactic, prophylaxisSpanish viudos widows avoids, idiots, video, videos, virtuousTable 7.1: Foreign-English cognates and false friend training examples.2001] define a word pair (e,f) to be cognate if they are a translation pair (same meaning)and their edit distance is less than three (same <strong>for</strong>m). We adopt an improved definition(suggested by [Melamed, 1999] <strong>for</strong> the French-English Canadian Hansards) that doesnot over-propose shorter word pairs: (e,f) are cognate if they are translations and theirLCSR ≥ 0.58. Note that this cutoff is somewhat conservative: the English/German cognateslight/Licht (LCSR=0.8) are included, but not the cognates eight/acht (LCSR=0.4).If two words must have LCSR ≥ 0.58 to be cognate, then <strong>for</strong> a given word f ∈ F , weneed only consider as possible cognates the subset of words in E having an LCSR with flarger than 0.58, a set we callE f . The portion ofE f with the same meaning asf,E f+ , arecognates, while the part with different meanings, E f− , are not cognates. The words E f−with similar spelling but different meaning are sometimes called false friends. The cognateidentification task is, <strong>for</strong> every word f ∈ F , and a list of similarly spelled words E f , todistinguish the cognate subset E f+ from the false friend setE f− .To create training data <strong>for</strong> our learning approaches, and to generate a high-quality labeledtest set, we need to annotate some of the (f,e f ∈ E f ) word pairs <strong>for</strong> whether ornot the words share a common meaning. In Section 7.5, we explain our two high-precisionautomatic annotation methods: checking if each pair of words (a) were aligned in a wordalignedbitext, or (b) were listed as translation pairs in a bilingual dictionary.Table 7.1 provides some labeled examples with non-empty cognate and false friendlists. Note that despite what it may appear from these examples, this is not a ranking task:even in highly related languages, most words in F have empty E f+ lists, and many haveempty E f− as well. Thus one natural <strong>for</strong>mulation <strong>for</strong> cognate identification is a pairwise(and symmetric) cognation classification that looks at each pair (f,e f ) separately and individuallymakes a decision:+(napukin,napkin)– (napukin,nanking)– (napukin,pumpkin)In this <strong>for</strong>mulation, the benefits of a discriminative approach are clear: it must find substringsthat distinguish cognate pairs from word pairs with otherwise similar <strong>for</strong>m. [Klementievand Roth, 2006], although using a discriminative approach, do not provide theirinfinite-attribute perceptron with competitive counter-examples. They instead use transliterationsas positives and randomly-paired English and Russian words as negative examples.In the following section, we also improve on [Klementiev and Roth, 2006] by using acharacter-based string alignment to focus the features <strong>for</strong> discrimination.7.4 Features <strong>for</strong> Discriminative String SimilarityAs Chapter 2 explained, discriminative training learns a classifier from a set of labeledtraining examples, each represented as a set of features. In the previous section we showed99

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!