12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

amples from unlabeled data in order to train better models of selectional preference (Chapter6) and string similarity (Chapter 7). The discriminative classifiers trained from this dataexploited several novel sources of in<strong>for</strong>mation, including character-level (string and capitalization)features <strong>for</strong> selectional preferences, and features derived from a character-basedsequence alignment <strong>for</strong> discriminative string similarity. While automatic example generationwas not applied to web-scale unlabeled data in this work, it promises to scale easilyto web-scale text. For example, after summarizing web-scale data with N-gram statistics,we can create examples using only several gigabytes of compressed, N-gram-text, ratherthan using the petabytes of raw web text directly. So automatic example generation fromaggregate statistics promises both better scaling and cleaner data (since aggregate statisticsnaturally exclude phenomena that occur purely due to chance).The key methods of parts one and two are, of course, compatible in another way: itwould be straight<strong>for</strong>ward to use the output of the pseudo-trained models as features insupervised systems. This is similar to the approach of Ando and Zhang [2005], and, in fact,was pursued in some of our concurrent work [Bergsma et al., 2009a] (with good results).8.2 The Impact of this WorkWe hope the straight<strong>for</strong>ward but effective techniques presented in this dissertation will helppromote simple, scalable semi-supervised learning as a future paradigm <strong>for</strong> NLP research.We advocate such a direction <strong>for</strong> several reasons.First, only via machine learning can we combine the millions of parameters that interactin natural language processing. Second, only by leveraging unlabeled data can we gobeyond the limited models that can be learned from small, hand-annotated training sets.Furthermore, it is highly advantageous to have an NLP system that both benefits fromunlabeled data and that can readily take advantage of even more unlabeled data when it becomesavailable. Both the volume of text on the web and the power of computer architecturecontinue to grow exponentially over time. Systems that use unlabeled data will there<strong>for</strong>eimprove automatically over time, without any special annotation, research, or engineeringef<strong>for</strong>t. For example, in [Pitler et al., 2010], we presented a parser whose per<strong>for</strong>mance improveslogarithmically with the number of unique N-grams in a web-scale N-gram corpus.A useful direction <strong>for</strong> future work would be to identify other problems that can benefit fromthe use of web-scale volumes of unlabeled data. We could hopefully thereby enable an evengreater proportion of NLP systems to achieve automatic improvements in per<strong>for</strong>mance.The following section describes some specific directions <strong>for</strong> future work, and notessome tasks where web-scale data might be productively exploited.Once we find out, <strong>for</strong> a range of tasks, just how far we can get with big data and MLalone, we will have a better handle on what other sources of linguistic knowledge mightbe needed. For example, we can now get to around 75% accuracy on preposition selectionusing N-grams alone (Section 3.5). To correct preposition errors with even higher accuracy,we needed to exploit knowledge of the speaker’s native language (and thus their likelypreposition confusions), getting above 95% accuracy in this manner (but also sacrificing asmall but perhaps reasonable amount of coverage). It’s unlikely N-gram data alone wouldever allow us to select the correct preposition in phrases like, “I like to swim be<strong>for</strong>e/afterschool.” Similarly, we argued that to per<strong>for</strong>m even better on non-referential pronoun detection(Section 3.7), we will need to pay attention to wider segments of discourse.109

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!