12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

precise make-up and genre of the training text, limiting generalizability of theresults and the reach of the annotation ef<strong>for</strong>t. Second, in modeling aspectsof human language acquisition, the role of supervision in learning must becarefully considered, given that children are not provided explicit indicationsof linguistic distinctions, and generally do not attend to explicit correction oftheir errors. Moreover, batch methods, even in an unsupervised setting, cannotmodel the actual online processes of child learning, which show gradualdevelopment of linguistic knowledge and competence.”Theoretical motivations aside, the practical benefit of this line of research is essentiallyto have the high per<strong>for</strong>mance and flexibility of discriminatively-trained systems, withoutthe cost of labeling huge numbers of examples. One can always label more examples toachieve better per<strong>for</strong>mance on a particular task and domain but the expense can be severe.Even companies with great resources, like Google and Microsoft, prefer solutions that donot require paying annotators to create labeled data. This is because any cost of annotationwould have to be repeated in each language and potentially each domain in which the systemmight be deployed (because of the dependence on the “precise make-up and genre ofthe training text” mentioned above). While some annotation jobs can be shipped to cheapoverseas annotators at relatively low cost, finding annotation experts in many languagesand domains might be more difficult. 5 Furthermore, after initial results, if the objectiveof the program is changed slightly, then new data would have to be annotated once again.Not only is this expensive, but it slows down the product development cycle. Finally, <strong>for</strong>many companies and government organizations, data privacy and security concerns preventthe outsourcing of annotation altogether. All labeling must be done by expensive andoverstretched internal analysts.Of course, even when there is plentiful labeled examples and the problem is welldefinedand unchanging, it may still boost per<strong>for</strong>mance to incorporate statistics from unlabeleddata. We have recently seen impressive gains from using unlabeled evidence, evenwith large amounts of labeled data, <strong>for</strong> example in the work of Ando and Zhang [2005],Suzuki and Isozaki [2008], and Pitler et al. [2010].In the remainder of this section, we briefly outline approaches to transductive learning,self-training, bootstrapping, learning with heuristically-labeled examples, and using featuresderived from unlabeled data. We focus on the work that best characterizes each area,simply noting in passing some research that does not fit cleanly into a particular category.2.5.1 Transductive <strong>Learning</strong>Transductive learning gives us a great opportunity to talk more about document classification(where it was perhaps most famously applied in [Joachims, 1999b]), but otherwise thisapproach does not seem to be widely used in NLP. Most learners operate in the inductivelearning framework: you learn your model from the training set, and apply it to unseen data.In the transductive framework on the other hand, you assume that, at learning time, youare given access to the test examples you wish to classify (but not their labels).5 Another trend worth highlighting is work that leverages large numbers of cheap, non-expert annotationsthrough online services such as Amazon’s Mechanical Turk [Snow et al., 2008]. This has been shown to worksurprisingly well <strong>for</strong> a number of simple problems. Combining the benefits of non-expert annotations with thebenefits of semi-supervised learning is a potentially rich area <strong>for</strong> future work.25

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!