12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 7Alignment-Based DiscriminativeString Similarity“Kimono... kimono... kimono... Ha! Of course! Kimono is come from theGreek word himona, is mean winter. So, what do you wear in the wintertimeto stay warm? A robe. You see: robe, kimono. There you go!”- Gus Portokalos, My Big Fat Greek WeddingThis chapter proposes a new model of string similarity that exploits a character-basedalignment of the two strings. We again adopt a discriminative approach. Positive pairs aregenerated automatically from word pairs with a high association in an aligned bitext, orelse mined from dictionary translations. Negatives are constructed from pairs with a highamount of character overlap, but which are not translations. So in this work, there are threetypes of in<strong>for</strong>mation that allow us to generate examples automatically: statistics from abitext, entries in a dictionary, and characters in the strings. This in<strong>for</strong>mation is unlabeledin the sense that no human annotator has specifically labeled cognates in the data. It isuseful because once a definition is adopted, examples can be generated automatically, anddifferent methods can be empirically evaluated on a level playing field.7.1 IntroductionString similarity is often used as a means of quantifying the likelihood that two pairs ofstrings have the same underlying meaning, based purely on the character composition ofthe two words. [Strube et al., 2002] use edit distance [Levenshtein, 1966] as a feature <strong>for</strong>determining if two words are coreferent. [Taskar et al., 2005] use French-English commonletter sequences as a feature <strong>for</strong> discriminative word alignment in bilingual texts. [Brilland Moore, 2000] learn misspelled-word to correctly-spelled-word similarities <strong>for</strong> spellingcorrection. In each of these examples, a similarity measure can make use of the recurrentsubstring pairings that reliably occur between words having the same meaning.Across natural languages, these recurrent substring correspondences are found in wordpairs known as cognates: words with a common <strong>for</strong>m and meaning across languages. Cognatesarise either from words in a common ancestor language (e.g. light/Licht, night/Nachtin English/German) or from <strong>for</strong>eign word borrowings (e.g. trampoline/toranporin in English/Japanese).Knowledge of cognates is useful <strong>for</strong> a number of applications, including0 A version of this chapter has been published as [Bergsma and Kondrak, 2007a]96

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!