12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

word alignment [Kondrak et al., 2003], sentence alignment [Simard et al., 1992; Church,1993; McEnery and Oakes, 1996; Melamed, 1999] and learning translation lexicons [Mannand Yarowsky, 2001; Koehn and Knight, 2002]. The related task of identifying transliterationshas also received much recent attention [Klementiev and Roth, 2006; Zelenko andAone, 2006; Yoon et al., 2007; Jiampojamarn et al., 2010]. Extending dictionaries withautomatically-acquired knowledge of cognates and transliterations can improve machinetranslation systems [Knight et al., 1995].Also, cognates have been used to help assess the readability of a <strong>for</strong>eign language textby new language learners [Uitdenbogerd, 2005]. Developing automatic ways to identifythese cognates is thus a prerequisite <strong>for</strong> a robust automatic readability assessment.We propose an alignment-based, discriminative approach to string similarity and weevaluate this approach on the task of cognate identification. Section 7.2 describes previousapproaches and their limitations. In Section 7.3, we explain our technique <strong>for</strong> automaticallycreating a cognate-identification training set. A novel aspect of this set is the inclusion ofcompetitive counter-examples <strong>for</strong> learning. Section 7.4 shows how discriminative featuresare created from a character-based, minimum-edit-distance alignment of a pair of strings.In Section 7.5, we describe our bitext and dictionary-based experiments on six languagepairs, including three based on non-Roman alphabets. In Section 7.6, we show significantimprovements over traditional approaches, as well as significant gains over more recenttechniques by [Ristad and Yianilos, 1998], [Tiedemann, 1999], [Kondrak, 2005], and [Klementievand Roth, 2006].7.2 Related WorkString similarity is a fundamental concept in a variety of fields and hence a range of techniqueshave been developed. We focus on approaches that have been applied to words, i.e.,uninterrupted sequences of characters found in natural language text. The most well-knownmeasure of the similarity of two strings is the edit distance or Levenshtein distance [Levenshtein,1966]: the number of insertions, deletions and substitutions required to trans<strong>for</strong>mone string into another. In our experiments, we use normalized edit distance (NED): editdistance divided by the length of the longer word. Other popular measures include Dice’sCoefficient (DICE) [Adamson and Boreham, 1974], and the length-normalized measureslongest common subsequence ratio (LCSR) [Melamed, 1999], the length of the longestcommon subsequence divided by the length of the longer word (used by [Melamed, 1998]),and longest common prefix ratio (PREFIX) [Kondrak, 2005], the length of the longestcommon prefix divided by the longer word length (four-letter prefix match was used by[Simard et al., 1992]). These baseline approaches have the important advantage of not requiringtraining data. We can also include in the non-learning category [Kondrak, 2005]’slongest common subsequence <strong>for</strong>mula (LCSF), a probabilistic measure designed to mitigateLCSR’s preference <strong>for</strong> shorter words.Although simple to use, the untrained measures cannot adapt to the specific spellingdifferences between a pair of languages. Researchers have there<strong>for</strong>e investigated adaptivemeasures that are learned from a set of known cognate pairs. [Ristad and Yianilos, 1998]developed a stochastic transducer version of edit distance learned from unaligned stringpairs. [Mann and Yarowsky, 2001] saw little improvement over edit distance when applyingthis transducer to cognates, even when filtering the transducer’s probabilities intodifferent weight classes to better approximate edit distance. [Tiedemann, 1999] used var-97

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!