precise make-up and genre of the training text, limiting generalizability of theresults and the reach of the annotation ef<strong>for</strong>t. Second, in modeling aspectsof human language acquisition, the role of supervision in learning must becarefully considered, given that children are not provided explicit indicationsof linguistic distinctions, and generally do not attend to explicit correction oftheir errors. Moreover, batch methods, even in an unsupervised setting, cannotmodel the actual online processes of child learning, which show gradualdevelopment of linguistic knowledge and competence.”Theoretical motivations aside, the practical benefit of this line of research is essentiallyto have the high per<strong>for</strong>mance and flexibility of discriminatively-trained systems, withoutthe cost of labeling huge numbers of examples. One can always label more examples toachieve better per<strong>for</strong>mance on a particular task and domain but the expense can be severe.Even companies with great resources, like Google and Microsoft, prefer solutions that donot require paying annotators to create labeled data. This is because any cost of annotationwould have to be repeated in each language and potentially each domain in which the systemmight be deployed (because of the dependence on the “precise make-up and genre ofthe training text” mentioned above). While some annotation jobs can be shipped to cheapoverseas annotators at relatively low cost, finding annotation experts in many languagesand domains might be more difficult. 5 Furthermore, after initial results, if the objectiveof the program is changed slightly, then new data would have to be annotated once again.Not only is this expensive, but it slows down the product development cycle. Finally, <strong>for</strong>many companies and government organizations, data privacy and security concerns preventthe outsourcing of annotation altogether. All labeling must be done by expensive andoverstretched internal analysts.Of course, even when there is plentiful labeled examples and the problem is welldefinedand unchanging, it may still boost per<strong>for</strong>mance to incorporate statistics from unlabeleddata. We have recently seen impressive gains from using unlabeled evidence, evenwith large amounts of labeled data, <strong>for</strong> example in the work of Ando and Zhang [2005],Suzuki and Isozaki [2008], and Pitler et al. [2010].In the remainder of this section, we briefly outline approaches to transductive learning,self-training, bootstrapping, learning with heuristically-labeled examples, and using featuresderived from unlabeled data. We focus on the work that best characterizes each area,simply noting in passing some research that does not fit cleanly into a particular category.2.5.1 Transductive <strong>Learning</strong>Transductive learning gives us a great opportunity to talk more about document classification(where it was perhaps most famously applied in [Joachims, 1999b]), but otherwise thisapproach does not seem to be widely used in NLP. Most learners operate in the inductivelearning framework: you learn your model from the training set, and apply it to unseen data.In the transductive framework on the other hand, you assume that, at learning time, youare given access to the test examples you wish to classify (but not their labels).5 Another trend worth highlighting is work that leverages large numbers of cheap, non-expert annotationsthrough online services such as Amazon’s Mechanical Turk [Snow et al., 2008]. This has been shown to worksurprisingly well <strong>for</strong> a number of simple problems. Combining the benefits of non-expert annotations with thebenefits of semi-supervised learning is a potentially rich area <strong>for</strong> future work.25
Figure 2.2: <strong>Learning</strong> from labeled and unlabeled examples, from (Zhu, 2005)Consider Figure 2.2. In the typical inductive set-up, we would design our classifierbased purely on the labeled points <strong>for</strong> the two classes: the o’s and +’s. We would draw thebest hyperplane to separate these labeled vectors. However, when we look at all the dotsthat do not have labels, we may wish to draw a different hyperplane. It appears that thereare two clusters of data, one on the left and one on the right. Drawing a hyperplane downthe middle would appear to be the optimum choice to separate the two classes. This is onlyapparent after inspecting unlabeled examples.We can always train a classifier using both labeled and unlabeled examples in the transductiveset-up, but then apply the classifier to unseen data in an inductive evaluation. So insome sense we can group other semi-supervised approaches that make use of labeled andunlabeled examples into this category (e.g. work by Wang et al. [2008]), even if they arenot applied transductively per se.There are many computational algorithms that can make use of unlabeled exampleswhen learning the separating hyperplane. The intuition behind them is to say somethinglike: of all combinations of possible labels on the unseen examples, find the overall bestseparating hyperplane. Thus, in some sense we pretend we know the labels on the unlabeleddata, and use these labels to train our model via traditional supervised learning. In mostsemi-supervised algorithms, we either implicitly or explicitly generate labels <strong>for</strong> unlabeleddata in a conceptually similar fashion, to (hopefully) enhance the data we use to train theclassifier.These approaches are not applicable to the problems that we wish to tackle in thisdissertation mainly due to practicality. We want to leverage huge volumes of unlabeleddata: all the data on the web, if possible. Most transductive algorithms cannot scale to thismany examples. Another potential problem is that <strong>for</strong> many NLP applications, the spaceof possible labels is simply too large to enumerate. For example, work in parsing aims toproduce a tree indicating the syntactic relationships of the words in a sentence. [Church andPatil, 1982] show the number of possible binary trees increases with the Catalan numbers.For twenty-word sentences, there are billions of possible trees. We are currently exploringlinguistically-motivated ways to per<strong>for</strong>m a high-precision pruning of the output space <strong>for</strong>26
- Page 1 and 2: University of AlbertaLarge-Scale Se
- Page 5 and 6: Table of Contents1 Introduction 11.
- Page 7 and 8: 7 Alignment-Based Discriminative St
- Page 9 and 10: List of Figures2.1 The linear class
- Page 11 and 12: drawn in by establishing a partial
- Page 13 and 14: (2) “He saw the trophy won yester
- Page 15 and 16: actual sentence said, “My son’s
- Page 17 and 18: Uses Web-Scale N-grams Auto-Creates
- Page 19 and 20: spelling correction, and the identi
- Page 21 and 22: Chapter 2Supervised and Semi-Superv
- Page 23 and 24: emphasis on “deliverables and eva
- Page 25 and 26: Figure 2.1: The linear classifier h
- Page 27 and 28: The above experimental set-up is so
- Page 29 and 30: and discriminative models therefore
- Page 31 and 32: their slack value). In practice, I
- Page 33: One way to find a better solution i
- Page 37 and 38: algorithm). Yarowsky used it for wo
- Page 39 and 40: Learning with Natural Automatic Exa
- Page 41 and 42: positive examples from any collecti
- Page 43 and 44: generated word clusters. Several re
- Page 45 and 46: One common disambiguation task is t
- Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
- Page 49 and 50: For each target wordv 0 , there are
- Page 51 and 52: ut without counts for the class pri
- Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
- Page 55 and 56: We also follow Carlson et al. [2001
- Page 57 and 58: Set BASE [Golding and Roth, 1999] T
- Page 59 and 60: pronoun (#3) guarantees that at the
- Page 61 and 62: 807876F-Score747270Stemmed patterns
- Page 63 and 64: anaphoricity by [Denis and Baldridg
- Page 65 and 66: ter, we present a simple technique
- Page 67 and 68: We seek weights such that the class
- Page 69 and 70: each optimum performance is at most
- Page 71 and 72: We now show that ¯w T (diag(¯p)
- Page 73 and 74: Training ExamplesSystem 10 100 1K 1
- Page 75 and 76: Since we wanted the system to learn
- Page 77 and 78: Chapter 5Creating Robust Supervised
- Page 79 and 80: § In-Domain (IN) Out-of-Domain #1
- Page 81 and 82: Adjective ordering is also needed i
- Page 83 and 84: Accuracy (%)10095908580757065601001
- Page 85 and 86:
System IN O1 O2Baseline 66.9 44.6 6
- Page 87 and 88:
90% of the time in Gutenberg. The L
- Page 89 and 90:
VBN/VBD distinction by providing re
- Page 91 and 92:
other tasks we only had a handful o
- Page 93 and 94:
without the need for manual annotat
- Page 95 and 96:
DSP uses these labels to identify o
- Page 97 and 98:
Semantic classesMotivated by previo
- Page 99 and 100:
empirical Pr(n|v) in Equation (6.2)
- Page 101 and 102:
Verb Plaus./Implaus. Resnik Dagan e
- Page 103 and 104:
SystemAccMost-Recent Noun 17.9%Maxi
- Page 105 and 106:
Chapter 7Alignment-Based Discrimina
- Page 107 and 108:
ious measures to learn the recurren
- Page 109 and 110:
how labeled word pairs can be colle
- Page 111 and 112:
Figure 7.1: LCSR histogram and poly
- Page 113 and 114:
0.711-pt Average Precision0.60.50.4
- Page 115 and 116:
Fr-En Bitext Es-En Bitext De-En Bit
- Page 117 and 118:
Chapter 8Conclusions and Future Wor
- Page 119 and 120:
8.3 Future WorkThis section outline
- Page 121 and 122:
My focus is thus on enabling robust
- Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
- Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
- Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
- Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
- Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
- Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
- Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
- Page 137:
NNP noun, proper, singular Motown V