12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Lang. Feat. Wt. ExampleFr (Bitext) ées - ed +8.0 vérifiées:verifiedJp (Dict.) ru - l +5.9 penaruti:penaltyDe (Bitext) k - c +5.5 kreativ:creativeDe Btxt eren$ - e$ +5.2 ignorieren:ignoreFr Btxt lement$ - ly$ +5.2 admirablement admirablyEs (Bitext) ar - ating +5.0 acelerar:acceleratingRs (Dict.) irov - +4.9 motivirovat:motivateGr (Dict.) f - ph +4.1 symfonia:symphonyGr (Dict.) kos - c +3.3 anarchikos:anarchicGr (Dict.) os$ - y$ -2.5 anarchikos:anarchyJp (Dict.) ou - ou -2.6 handoutai:handoutEs (Dict.) - un -3.1 balance:unbalanceFr (Bitext) s$ - ly$ -4.2 fervents:ferventlyFr (Dict.) er$ - er$ -5.0 <strong>for</strong>mer:<strong>for</strong>merEs (Bitext) mos - s -5.1 toleramos:toleratesTable 7.4: Example features and weights <strong>for</strong> various Alignment-Based Discriminative classifiers(Foreign-English, negative pairs in italics).age derivational and inflectional morphology. For example, Greek-English pairs with theadjective-ending correspondence kos-c, e.g. anarchikos:anarchic, are favoured, but pairswith the adjective ending in Greek and noun ending in English, os$-y$, are penalized; indeed,by our definition, anarchikos:anarchy is not cognate. In a bitext, the feature ées-edcaptures that feminine-plural inflection of past tense verbs in French corresponds to regularpast tense in English. On the other hand, words ending in the Spanish first person pluralverb suffix -amos are rarely translated to English words ending with the suffix -s, causingmos-s to be penalized. The ability to leverage negative features, learned from appropriatecounter examples, is a key innovation of our discriminative framework.Table 7.5 gives the top pairs scored by our system on the three bitext and three of thedictionary test sets. Notice that unlike traditional similarity measures that always scoreidentical words higher than all other pairs, by virtue of our feature weighting, our discriminativeclassifier prefers some pairs with very characteristic spelling changes.We per<strong>for</strong>med error analysis by looking at all the pairs our system scored quite confidently(highly positive or highly negative similarity), but which were labeled oppositely.Highly-scored false positives arose equally from 1) actual cognates not linked as translationsin the data, 2) related words with diverged meanings, e.g. the only error in Table 7.5:makaroni in Greek actually means spaghetti in English (makaronada is macaroni), and 3)the same word stem, a different part of speech (e.g. the Greek/English adjective/noun synonymos:synonym).Meanwhile, inspection of the highly-confident false negatives revealedsome (often erroneously-aligned in the bitext) positive pairs with incidental letter match(e.g. the French/English recettes:proceeds) that we would not actually deem to be cognate.Thus the errors that our system makes are often either linguistically interesting or point outmistakes in our automatically-labeled bitext and (to a lesser extent) dictionary data.105

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!