12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

“<strong>Natural</strong> Automatic Examples,” as described in Chapter 2, Section 2.5.4. In analysis, wedisambiguate semantic labels, such as part-of-speech tags, representing abstract propertiesof surface words. For these problems, we have historically needed manually-labeled data.For generation tasks, a model of each candidate’s distribution in text is created. Themodels indicate which usage best fits each context, enabling candidate disambiguation intasks such as spelling correction [Golding and Roth, 1999], preposition selection [Chodorowet al., 2007; Felice and Pulman, 2007], and diacritic restoration [Yarowsky, 1994]. Themodels can be large-scale classifiers or standard N-gram language models (LMs).An N-gram is a sequence of words. A unigram is one word, a bigram is two words, atrigram is three words, and so on. An N-gram language model is a model that computesthe probability of a sentence as the product of the probabilities of the N-grams in the sentence.The (maximum likelihood) probability of an N-gram is simply its count divided bythe number of times it occurs in the corpus. Higher probability sentences will thus be composedof N-grams that are more frequent. For resolving confusable words, we could selectthe candidate that results in a higher whole-sentence probability, effectively combining thecounts of N-grams at different positions.The power of an N-gram language model crucially depends on the data from whichthe counts are taken: the more data, the better. Trigram LMs have long been used <strong>for</strong>spelling correction, an approach sometimes referred to as the Mays, Damerau, and Mercermodel [Wilcox-O’Hearn et al., 2008]. Gamon et al. [2008] use a Gigaword 5-gram LM<strong>for</strong> preposition selection. While web-scale LMs have proved useful <strong>for</strong> machine translation[Brants et al., 2007], most web-scale disambiguation approaches compare specific sequencecounts rather than full-sentence probabilities. Counts are usually gathered using an Internetsearch engine [Lapata and Keller, 2005; Yi et al., 2008].In analysis problems such as part-of-speech tagging, it is not as obvious how a LM canbe used to score the candidates, since LMs do not contain the candidates themselves, onlysurface words. However, large LMs can also benefit these applications, provided there aresurface words that correlate with the semantic labels. Essentially, we devise some surrogates<strong>for</strong> each label, and determine the likelihood of these surrogates occurring with thegiven context. For example, Mihalcea and Moldovan [1999] per<strong>for</strong>m sense disambiguationby creating label surrogates from similar-word lists <strong>for</strong> each sense. To choose the sense ofbass in the phrase “caught a huge bass,” we might consider tenor, alto, and pitch <strong>for</strong> senseone and snapper, mackerel, and tuna <strong>for</strong> sense two. The sense whose group has the higherweb-frequency count in bass’s context is chosen. [Yu et al., 2007] use a similar approach toverify the near-synonymy of the words in the sense pools of the OntoNotes project [Hovyet al., 2006]. They check whether a word can be substituted into the place of another elementin its sense pool, using a few sentences where the sense pool of the original elementhas been annotated. The substitution likelihood is computed using the counts of N-gramsof various orders from the Google web-scale N-gram corpus (discussed in the followingsubsection).We build on similar ideas in our unified view of analysis and generation disambiguationproblems (Section 3.3). For generation problems, we gather counts <strong>for</strong> each surfacecandidate filling our 2-to-5-gram patterns. For analysis problems, we use surrogates as thefillers. We collect our pattern counts from a web-scale corpus.37

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!