12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

actual sentence said, “My son’s simple trophy won their hearts,” then we should guess VBD<strong>for</strong> won, regardless of the global distribution of trophy won. Of course, we let the learningalgorithm choose the relative weight on global vs. local in<strong>for</strong>mation. In my experience,when good local features are available, the learning algorithm will usually put most of theweight on them, as the algorithm finds these features to be statistically more reliable. So wemust lower our expectations <strong>for</strong> the possible benefits of purely distributional in<strong>for</strong>mation.When there are already other good sources of in<strong>for</strong>mation available locally, the effect ofglobal in<strong>for</strong>mation is diminished. Section 5.6 presents some experimental results on VBN-VBD disambiguation and discusses this point further.Using N-grams <strong>for</strong> <strong>Learning</strong> from Unlabeled DataIn our work, we make use of aggregate counts over a large corpus; we don’t inspect the individualinstances of each phrase. That is, we do not separately process the 4868 sentenceswhere “won the trophy” occurs on the web, rather we use the N-gram, won the trophy, andits count, 4868, as a single unit of in<strong>for</strong>mation. We do this mainly because it’s computationallyinefficient to process all the instances (that is, the entire web). Very good inferences canbe drawn from the aggregate statistics. Chapter 2 describes a range of alternative methods<strong>for</strong> exploiting unlabeled data; many of these can not scale to web-scale text.1.4 A Perspective on Statistical vs. Linguistic ApproachesWhen reading any document, it can be useful to think about the author’s perspective. Sometimes,when we establish the author’s perspective, we might also establish that the documentis not worth reading any further. This might happen, <strong>for</strong> example, if the author’s perspectiveis completely at odds with our own, or if it seems likely the author’s perspective willprevent them from viewing evidence objectively.Surely, some readers of this document are also wondering about the perspective of itsauthor. Does he approach language from a purely statistical viewpoint, or is he interested inlinguistics itself? The answer: Although I certainly advocate the use of statistical methodsand huge volumes of data, I am mostly interested in how these resources can help withreal linguistic phenomena. I agree that linguistics has an essential role to play in the futureof NLP [Jelinek, 2005; Hajič and Hajičová, 2007]. I aim to be aware of the knowledge oflinguists and I try to think about where this knowledge might apply in my own work. I try togain insight into problems by annotating data myself. When I tackle a particular linguisticphenomenon, I try to think about how that phenomenon serves human communication andthought, how it may work differently in written or spoken language, how it may workdifferently across human languages, and how a particular computational representation maybe inadequate. By doing these things, I hope to not only produce more interesting andinsightful research, but to produce systems that work better. For example, while a search onGoogle Scholar reveals a number of papers proposing “language independent” approachesto tasks such as named-entity recognition, parsing, grapheme-to-phoneme conversion, andin<strong>for</strong>mation retrieval, it is my experience that approaches that pay attention to languagespecificissues tend to work better (e.g., in transliteration [Jiampojamarn et al., 2010]). Infact, exploiting linguistic knowledge can even help the Google statistical translation system[Xu et al., 2009] – a system that is often mentioned as an example of a purely data-drivenNLP approach.6

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!