Large-Scale Semi-Supervised Learning for Natural Language ...

More documents

Recommendations

Info

precise make-up and genre of the training text, limiting generalizability of theresults and the reach of the annotation effort. Second, in modeling aspectsof human language acquisition, the role of supervision in learning must becarefully considered, given that children are not provided explicit indicationsof linguistic distinctions, and generally do not attend to explicit correction oftheir errors. Moreover, batch methods, even in an unsupervised setting, cannotmodel the actual online processes of child learning, which show gradualdevelopment of linguistic knowledge and competence.”Theoretical motivations aside, the practical benefit of this line of research is essentiallyto have the high performance and flexibility of discriminatively-trained systems, withoutthe cost of labeling huge numbers of examples. One can always label more examples toachieve better performance on a particular task and domain but the expense can be severe.Even companies with great resources, like Google and Microsoft, prefer solutions that donot require paying annotators to create labeled data. This is because any cost of annotationwould have to be repeated in each language and potentially each domain in which the systemmight be deployed (because of the dependence on the “precise make-up and genre ofthe training text” mentioned above). While some annotation jobs can be shipped to cheapoverseas annotators at relatively low cost, finding annotation experts in many languagesand domains might be more difficult. 5 Furthermore, after initial results, if the objectiveof the program is changed slightly, then new data would have to be annotated once again.Not only is this expensive, but it slows down the product development cycle. Finally, formany companies and government organizations, data privacy and security concerns preventthe outsourcing of annotation altogether. All labeling must be done by expensive andoverstretched internal analysts.Of course, even when there is plentiful labeled examples and the problem is welldefinedand unchanging, it may still boost performance to incorporate statistics from unlabeleddata. We have recently seen impressive gains from using unlabeled evidence, evenwith large amounts of labeled data, for example in the work of Ando and Zhang [2005],Suzuki and Isozaki [2008], and Pitler et al. [2010].In the remainder of this section, we briefly outline approaches to transductive learning,self-training, bootstrapping, learning with heuristically-labeled examples, and using featuresderived from unlabeled data. We focus on the work that best characterizes each area,simply noting in passing some research that does not fit cleanly into a particular category.2.5.1 Transductive LearningTransductive learning gives us a great opportunity to talk more about document classification(where it was perhaps most famously applied in [Joachims, 1999b]), but otherwise thisapproach does not seem to be widely used in NLP. Most learners operate in the inductivelearning framework: you learn your model from the training set, and apply it to unseen data.In the transductive framework on the other hand, you assume that, at learning time, youare given access to the test examples you wish to classify (but not their labels).5 Another trend worth highlighting is work that leverages large numbers of cheap, non-expert annotationsthrough online services such as Amazon’s Mechanical Turk [Snow et al., 2008]. This has been shown to worksurprisingly well for a number of simple problems. Combining the benefits of non-expert annotations with thebenefits of semi-supervised learning is a potentially rich area for future work.25
Figure 2.2: Learning from labeled and unlabeled examples, from (Zhu, 2005)Consider Figure 2.2. In the typical inductive set-up, we would design our classifierbased purely on the labeled points for the two classes: the o’s and +’s. We would draw thebest hyperplane to separate these labeled vectors. However, when we look at all the dotsthat do not have labels, we may wish to draw a different hyperplane. It appears that thereare two clusters of data, one on the left and one on the right. Drawing a hyperplane downthe middle would appear to be the optimum choice to separate the two classes. This is onlyapparent after inspecting unlabeled examples.We can always train a classifier using both labeled and unlabeled examples in the transductiveset-up, but then apply the classifier to unseen data in an inductive evaluation. So insome sense we can group other semi-supervised approaches that make use of labeled andunlabeled examples into this category (e.g. work by Wang et al. [2008]), even if they arenot applied transductively per se.There are many computational algorithms that can make use of unlabeled exampleswhen learning the separating hyperplane. The intuition behind them is to say somethinglike: of all combinations of possible labels on the unseen examples, find the overall bestseparating hyperplane. Thus, in some sense we pretend we know the labels on the unlabeleddata, and use these labels to train our model via traditional supervised learning. In mostsemi-supervised algorithms, we either implicitly or explicitly generate labels for unlabeleddata in a conceptually similar fashion, to (hopefully) enhance the data we use to train theclassifier.These approaches are not applicable to the problems that we wish to tackle in thisdissertation mainly due to practicality. We want to leverage huge volumes of unlabeleddata: all the data on the web, if possible. Most transductive algorithms cannot scale to thismany examples. Another potential problem is that for many NLP applications, the spaceof possible labels is simply too large to enumerate. For example, work in parsing aims toproduce a tree indicating the syntactic relationships of the words in a sentence. [Church andPatil, 1982] show the number of possible binary trees increases with the Catalan numbers.For twenty-word sentences, there are billions of possible trees. We are currently exploringlinguistically-motivated ways to perform a high-precision pruning of the output space for26
Page 1 and 2: University of AlbertaLarge-Scale Se
Page 5 and 6: Table of Contents1 Introduction 11.
Page 7 and 8: 7 Alignment-Based Discriminative St
Page 9 and 10: List of Figures2.1 The linear class
Page 11 and 12: drawn in by establishing a partial
Page 13 and 14: (2) “He saw the trophy won yester
Page 15 and 16: actual sentence said, “My son’s
Page 17 and 18: Uses Web-Scale N-grams Auto-Creates
Page 19 and 20: spelling correction, and the identi
Page 21 and 22: Chapter 2Supervised and Semi-Superv
Page 23 and 24: emphasis on “deliverables and eva
Page 25 and 26: Figure 2.1: The linear classifier h
Page 27 and 28: The above experimental set-up is so
Page 29 and 30: and discriminative models therefore
Page 31 and 32: their slack value). In practice, I
Page 33: One way to find a better solution i
Page 37 and 38: algorithm). Yarowsky used it for wo
Page 39 and 40: Learning with Natural Automatic Exa
Page 41 and 42: positive examples from any collecti
Page 43 and 44: generated word clusters. Several re
Page 45 and 46: One common disambiguation task is t
Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
Page 49 and 50: For each target wordv 0 , there are
Page 51 and 52: ut without counts for the class pri
Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
Page 55 and 56: We also follow Carlson et al. [2001
Page 57 and 58: Set BASE [Golding and Roth, 1999] T
Page 59 and 60: pronoun (#3) guarantees that at the
Page 61 and 62: 807876F-Score747270Stemmed patterns
Page 63 and 64: anaphoricity by [Denis and Baldridg
Page 65 and 66: ter, we present a simple technique
Page 67 and 68: We seek weights such that the class
Page 69 and 70: each optimum performance is at most
Page 71 and 72: We now show that ¯w T (diag(¯p)
Page 73 and 74: Training ExamplesSystem 10 100 1K 1
Page 75 and 76: Since we wanted the system to learn
Page 77 and 78: Chapter 5Creating Robust Supervised
Page 79 and 80: § In-Domain (IN) Out-of-Domain #1
Page 81 and 82: Adjective ordering is also needed i
Page 83 and 84: Accuracy (%)10095908580757065601001
Page 85 and 86:
System IN O1 O2Baseline 66.9 44.6 6
Page 87 and 88:
90% of the time in Gutenberg. The L
Page 89 and 90:
VBN/VBD distinction by providing re
Page 91 and 92:
other tasks we only had a handful o
Page 93 and 94:
without the need for manual annotat
Page 95 and 96:
DSP uses these labels to identify o
Page 97 and 98:
Semantic classesMotivated by previo
Page 99 and 100:
empirical Pr(n|v) in Equation (6.2)
Page 101 and 102:
Verb Plaus./Implaus. Resnik Dagan e
Page 103 and 104:
SystemAccMost-Recent Noun 17.9%Maxi
Page 105 and 106:
Chapter 7Alignment-Based Discrimina
Page 107 and 108:
ious measures to learn the recurren
Page 109 and 110:
how labeled word pairs can be colle
Page 111 and 112:
Figure 7.1: LCSR histogram and poly
Page 113 and 114:
0.711-pt Average Precision0.60.50.4
Page 115 and 116:
Fr-En Bitext Es-En Bitext De-En Bit
Page 117 and 118:
Chapter 8Conclusions and Future Wor
Page 119 and 120:
8.3 Future WorkThis section outline
Page 121 and 122:
My focus is thus on enabling robust
Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
Page 137:
NNP noun, proper, singular Motown V
show all

Large-Scale Semi-Supervised Learning for Natural Language ...

Create successful ePaper yourself

Delete template?

Save as template?