12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

On the other hand, mapping language to meaning is a very hard task, and statisticaltools help a lot too. It does not seem likely that we will solve the problems of NLP anytimesoon. Machine learning allows us to make very good predictions (in the face of uncertainty)by combining multiple, individually inadequate sources of evidence. Furthermore, it isempirically very effective to make predictions based on something previously observed(say, on the web), rather than trying to interpret everything purely on the basis of a very richlinguistic (or multi-modal) model. The observations that we rely on can sometimes be subtle(as in the verb tagging example from Section 1.3) and sometimes obvious (e.g., just countwhich preposition occurs most frequently in a given context, Section 3.5). Crucially, evenif our systems do not really model the underlying linguistic (and other mental) processes, 4such predictions may still be quite useful <strong>for</strong> real applications (e.g., in speech, machinetranslation, writing aids, in<strong>for</strong>mation retrieval, etc.). Finally, once we understand whatcan be solved trivially with big data and machine learning, it might better help us focus ourattention on the appropriate deeper linguistic issues; i.e., the long tail of linguistic behaviourpredicted by Zipf’s law. Of course, we need to be aware of the limitations of N-gram modelsand big data, because, as Mark Steedman writes [Steedman, 2008]:“One day, either because of the demise of Moore’s law, or simply because wehave done all the easy stuff, the Long Tail will come back to haunt us.”Not long ago, many in our community were dismissive of applying large volumes ofdata and machine learning to linguistic problems at all. For example, IBM’s first paperon statistical machine translation was met with a famously (anonymous) negative review(1988) (quoted in [Jelinek, 2009]):“The crude <strong>for</strong>ce of computers is not science. The paper is simply beyond thescope of COLING.”Of course, statistical approaches are now clearly dominant in NLP (see Section 2.1). Infact, what is interesting about the field of NLP today is the growing concern that our fieldis now too empirical. These concerns even come from researchers that were the leadersof the shift to statistical methods. For example, an upcoming talk at COLING 2010 byKen Church and Mark Johnson discusses the topic, “The Pendulum has swung too far. Therevival of empiricism in the 1990s was an exciting time. But now there is no longer muchroom <strong>for</strong> anything else.” 5 Richard Sproat adds: 6“... the field [of computational linguistics] has devolved in large measure into agroup of technicians who are more interested in tweaking the techniques thanin the problems they are applied to; who are far more impressed by a clevernew ML approach to an old problem, than the application of known techniquesto a new problem.”Although my own interests lie in both understanding linguistic problems and in “tweaking”ML techniques, I don’t think everyone need approach NLP the same way. We need4 Our models obviously do not reflect real human cognition since humans do not have access to the trillionsof pages of data that we use to train our models. The main objective of this dissertation is to investigate whatkinds of useful and scientifically interesting things we can do with computers. In general, my research aims toexploit models of human linguistic processing where possible, as opposed to trying to replicate them.5 http://nlp.stan<strong>for</strong>d.edu/coling10/full-program.html#ring6 http://www.cslu.ogi.edu/˜sproatr/newindex/ncfom.html7

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!