12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Fr-En Bitext Es-En Bitext De-En Bitextfilm:film agenda:agenda akt:actambassadeur:ambassador natural:natural asthma:asthmabio:bio márgenes:margins lobby:lobbyradios:radios hormonal:hormonal homosexuell:homosexualabusif:abusive radón:radon brutale:brutalirréfutable:irrefutable higiénico:hygienic inzidenz:incidenceGr-En Dict. Jp-En Dict. Rs-En Dict.alkali:alkali baiohoronikusu:bioholonics aerozol:aerosolmakaroni:macaroni mafia:mafia gondola:gondolaadrenalini:adrenaline manierisumu:manierisme rubidiy:rubidiumflamingko:flamingo ebonaito:ebonite panteon:pantheonspasmodikos:spasmodic oratorio:oratorio antonim:antonymamvrosia:ambrosia mineraru:mineral gladiator:gladiatorTable 7.5: Highest scored pairs by Alignment-Based Discriminative classifier (negative pairin italics).7.7 Conclusion and Future WorkThis is the first research to apply discriminative string similarity to the task of cognateidentification. We have introduced and successfully applied an alignment-based framework<strong>for</strong> discriminative similarity that consistently demonstrates improved per<strong>for</strong>mance in bothbitext and dictionary-based cognate identification on six language pairs. Our improvedapproach can be applied in any of the diverse applications where traditional similarity measureslike edit distance and LCSR are prevalent. We have also made available our cognateidentification data sets, which will be of interest to general string similarity researchers.Furthermore, we have provided a natural framework <strong>for</strong> future cognate identificationresearch. Phonetic, semantic, or syntactic features could be included within our discriminativeinfrastructure to aid in the identification of cognates in text. In particular, we couldinvestigate approaches that do not require the bilingual dictionaries or bitexts to generatetraining data. For example, researchers have automatically developed translation lexiconsby seeing if words from each language have similar frequencies, contexts [Koehn andKnight, 2002], burstiness, inverse document frequencies, and date distributions [Schaferand Yarowsky, 2002]. Semantic and string similarity might be learned jointly with a cotrainingor bootstrapping approach [Klementiev and Roth, 2006]. We may also comparealignment-based discriminative string similarity with a more complex discriminative modelthat learns the alignments as latent structure [McCallum et al., 2005].Since the original publication of this work, we have also applied the alignment-basedstring similarity model to the task of transliteration identification [Jiampojamarn et al.,2010] with good results. In that work, we also proposed a new model of string similaritythat uses a string kernel to implicitly represent substring pairs of arbitrary length, liftingone of the computational limitations of the model in this chapter.In addition, we also looked specifically at the cognate identification problem from amultilingual perspective [Bergsma and Kondrak, 2007b]. While the current chapter looksto detect cognates in pairs of languages, we provided a methodology that directly <strong>for</strong>ms106

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!