12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

so within some branches of psychology, linguistics and artificial intelligence even today.Manning and Schütze believe that“much of the skepticism towards probabilistic models <strong>for</strong> language (and cognitionin general) stems from the fact that the well-known early probabilisticmodels (developed in the 1940s and 1950s) are extremely simplistic. Becausethese simplistic models clearly do not do justice to the complexity of humanlanguage, it is easy to view probabilistic models in general as inadequate.”The stochastic paradigm became much more influential again after the 1970s and early1980s when N-gram models were successfully applied to speech recognition by the IBMThomas J. Watson Research Center [Jelinek, 1976; Bahl et al., 1983] and by James Bakerat Carnegie Mellon University [Baker, 1975]. Previous ef<strong>for</strong>ts in speech recognition hadbeen rather “ad hoc and fragile, and were demonstrated on only a few specially selected examples”[Russell and Norvig, 2003]. The work by Jelinek and others soon made it apparentthat data-driven approaches simply work better. As Hajič and Hajičová [2007] summarize:“[The] IBM Research group under Fred Jelinek’s leadership realized (and experimentallyshowed) that linguistic rules and Artificial Intelligence techniqueshad inferior results even when compared to very simplistic statistical techniques.This was first demonstrated on phonetic base<strong>for</strong>ms in the acousticmodel <strong>for</strong> a speech recognition system, but later it became apparent that thiscan be safely assumed almost <strong>for</strong> every other problem in the field (e.g., Jelinek[1976]). Statistical learning mechanisms were apparently and clearlysuperior to any human-designed rules, especially those using any preferencesystem, since humans are notoriously bad at estimating quantitative characteristicsin a system with many parameters (such as a natural language).”Probabilistic and machine learning techniques such as decision trees, clustering, EM,and maximum entropy gradually became the foundation of speech processing [Fung andRoth, 2005]. The successes in speech then inspired a range of empirical approaches tonatural language processing. Simple statistical techniques were soon applied to part-ofspeechtagging, parsing, machine translation, word-sense disambiguation, and a range ofother NLP tasks. While there was only one statistical paper at the ACL conference in 1990,virtually all papers in ACL today employ statistical techniques [Hajič and Hajičová, 2007].Of course, the fact that statistical techniques currently work better is only partly responsible<strong>for</strong> their rise to prominence. There was a fairly large gap in time between their provenper<strong>for</strong>mance on speech recognition and their widespread acceptance in NLP. Advancesin computer technology and the greater availability of data resources also played a role.According to Church and Mercer [1993]:“Back in the 1970s, the more data-intensive methods were probably beyond themeans of many researchers, especially those working in universities... Fortunately,as a result of improvements in computer technology and the increasingavailability of data due to numerous data collection ef<strong>for</strong>ts, the data-intensivemethods are no longer restricted to those working in affluent industrial laboratories.”Two other important developments were the practical application and commercializationof NLP algorithms and the emphasis that was placed on empirical evaluation. A greater13

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!