12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3.2.2 Web-<strong>Scale</strong> Statistics in NLPExploiting the vast amount of data on the web is part of a growing trend in natural languageprocessing [Keller and Lapata, 2003]. In this section, we focus on some research that hashad a particular influence on our own work. We begin by discussing approaches that extractin<strong>for</strong>mation using Internet search-engines, be<strong>for</strong>e discussing recent approaches that havemade use of the Google web-scale N-gram corpus.There were initially three main avenues of research that used the web as a corpus; allwere based on the use of Internet search engines.In the first line of research, search-engine page counts are used as substitutes <strong>for</strong> countsof a phrase in a corpus [Grefenstette, 1999; Keller and Lapata, 2003; Chklovski and Pantel,2004; Lapata and Keller, 2005]. That is, a phrase is issued to a search engine as a query,and the count, given by the search engine, of how many pages contain that query is takenas a substitute <strong>for</strong> the number of times that phrase occurs on the web. Quotation marks areplaced around the phrase so that the words are only matched when they occur in their exactphrasal order. By using Internet-derived statistics, these approaches automatically benefitfrom the growing size and variety of documents on the world wide web. We previouslyused this approach to collect pattern counts that indicate the gender of noun phrases; thisprovided very useful in<strong>for</strong>mation <strong>for</strong> an anaphora resolution system [Bergsma, 2005]. Wealso previously showed how a variety of search-engine counts can be used to improve theper<strong>for</strong>mance of search-engine query segmentation [Bergsma and Wang, 2007] (a problemclosely related to Noun-Compound Bracketing, which we explore in Chapter 5).In another line of work, search engines are use to assess how often a pair of wordsoccur on the same page (or how often they occur close to each other), irrespective of theirorder. Thus the page counts returned by a search engine are taken at face value as documentco-occurrence counts. Applications in this area include determining the phrasal semanticorientation (good or bad) <strong>for</strong> sentiment analysis [Turney, 2002] and assessing the coherenceof key phrases [Turney, 2003].A third line of research involves issuing queries to a search engine and then makinguse of the returned documents. Resnik [1999] shows how the web can be used to gatherbilingual text <strong>for</strong> machine translation, while Jones and Ghani [2000] use the web to buildcorpora <strong>for</strong> minority languages. Ravichandran and Hovy [2002] process returned web pagesto identify answer patterns <strong>for</strong> question answering. In an answer-typing system, Pinchakand Bergsma [2007] use the web to find documents that provide in<strong>for</strong>mation on unit types<strong>for</strong> how-questions. Many other question-answering systems use the web to assist in findinga correct answer to a question [Brill et al., 2001; Cucerzan and Agichtein, 2005; Radev etal., 2001]. Nakov and Hearst [2005a; 2005b] use search engines both to return counts <strong>for</strong>N-grams, and also to process the returned results to extract in<strong>for</strong>mation not available froma search-engine directly, such as punctuation and capitalization.While a lot of progress has been made using search engines to extract web-scale statistics,there are many fundamental issues with this approach. First of all, since the webchanges every day, the results using a search engine are not exactly reproducible. Secondly,some have questioned the reliability of search engine page counts [Kilgarriff, 2007].Most importantly, using search engines to extract count in<strong>for</strong>mation is terribly inefficient,and thus search engines restrict the number of queries one can issue to gather web-scalein<strong>for</strong>mation. With limited queries, we can only use limited in<strong>for</strong>mation in our systems.A solution to these issues was enabled by Thorsten Brants and Alex Franz at Googlewhen they released the Google Web 1T 5-gram Corpus Version 1.1 in 2006 [Brants and38

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!