12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

One common disambiguation task is the identification of word-choice errors in text. Alanguage checker can flag an error if a confusable alternative better fits a given context:(1) The system tried to decide {among, between} the two confusable words.Most NLP systems resolve such ambiguity with the help of a large corpus of text. Thecorpus indicates which candidate is more frequent in similar contexts. The larger the corpus,the more accurate the disambiguation [Banko and Brill, 2001]. Since few corpora are aslarge as the world wide web, 1 many systems incorporate web counts into their selectionprocess.For the above example, a typical web-based system would query a search engine withthe sequences “decide among the” and “decide between the” and select the candidate thatreturns the most pages [Lapata and Keller, 2005]. Clearly, this approach fails when morecontext is needed <strong>for</strong> disambiguation.We propose a unified view of using web-scale data <strong>for</strong> lexical disambiguation. Ratherthan using a single context sequence, we use contexts of various lengths and positions.There are five 5-grams, four 4-grams, three trigrams and two bigrams spanning the targetword in Example (1). We gather counts <strong>for</strong> each of these sequences, with each candidatein the target position. We first show how the counts can be used as features in a supervisedclassifier, with a count’s contribution weighted by its context’s size and position. We alsopropose a novel unsupervised system that simply sums a subset of the (log) counts <strong>for</strong> eachcandidate. Surprisingly, this system achieves most of the gains of the supervised approachwithout requiring any training data.Since we make use of features derived from the distribution of patterns in large amountsof unlabeled data, this work is an instance of a semi-supervised approach in the category,“Using Features from Unlabeled Data,” discussed in Chapter 2, Section 2.5.5.In Section 3.2, we discuss the range of problems that fit the lexical disambiguationframework, and also discuss previous work using the web as a corpus. In Section 3.3 wediscuss our general disambiguation methodology. While all disambiguation problems canbe tackled in a common framework, most approaches are developed <strong>for</strong> a specific task.Like Roth [1998] and Cucerzan and Yarowsky [2002], we take a unified view of disambiguation,and apply our systems to preposition selection (Section 3.5), spelling correction(Section 3.6), and non-referential pronoun detection (Section 3.7). In particular we spenda fair amount of time on non-referential pronoun detection. On each of these applications,our systems outper<strong>for</strong>m traditional web-scale approaches.3.2 Related Work3.2.1 Lexical DisambiguationYarowsky [1994] defines lexical disambiguation as a task where a system must “disambiguatetwo or more semantically distinct word-<strong>for</strong>ms which have been conflated into thesame representation in some medium.” Lapata and Keller [2005] divide disambiguationproblems into two groups: generation and analysis. In generation, the confusable candidatesare actual words, like among and between. Generation problems permit learning with1 Google recently announced they are now indexing over 1 trillion unique URLs (http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html). This figure represents a staggering amount oftextual data.36

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!