Large-Scale Semi-Supervised Learning for Natural Language ...

More documents

Recommendations

Info

parsing and other tasks [Bergsma and Cherry, 2010]. One goal of our work is to facilitatemore intensive semi-supervised learning approaches. This is an active research area ingeneral.2.5.2 Self-trainingSelf-training is a very simple algorithm that has shown some surprising success in naturallanguage parsing [McClosky et al., 2006a]. In this approach, you build a classifier (orparser, or any kind of predictor) on some labeled training examples. You then use thelearned classifier to label a large number of unlabeled feature vectors. You then re-trainyour system on both the original labeled examples and the automatically-labeled examples(and then evaluate on your original development and test data). Again, note that this semisupervisedtechnique explicitly involves generating labels for unlabeled data to enhance thetraining of the classifier.Historically, this approach has not worked very well. Any errors the system makes afterthe first round of training are just compounded by re-training on those errors. Perhaps itworks better in parsing (and especially with a parse reranker) where the constraints of thegrammar give some extra guidance to the initial output of the parser. More work is neededin this area.2.5.3 BootstrappingBootstrapping has a long and rich history in NLP. Bootstrapping is like self-training, butwhere we avoid the compounding of errors by exploiting different views of the problem.We first describe the overall idea in the context of algorithms for multi-view learning. Wethen consider how related work in bootstrapping from seeds also fits into the multi-viewframework.Bootstrapping with Multiple ViewsConsider, once again, classifying documents. However, let’s assume that these are onlinedocuments. In addition to the words in the documents themselves, we might also classifydocuments using the text in hyperlinks pointing to the documents, taken from other websites(so-called anchor text). In the standard supervised learning framework, we would just usethis additional text as additional features, and train on our labeled set. In a bootstrappingapproach (specifically, the co-training algorithm [Blum and Mitchell, 1998]), we insteadtrain two classifiers: one with features from the document, and one with features from theanchor text in hyperlinks. We use one classifier to label additional examples for the other tolearn from, and iterate training and classification with one classifier then the other until allthe documents are labeled. Since the classifiers have orthogonal views of the problem, themistakes made by one classifier should not be too detrimental to the learning of the otherclassifier. That is, the errors should not compound as they do in self-training. Blum andMitchell [1998] give a PAC Learning-style framework for this approach, and give empiricalresults on the web-page classification task.The notion of a problem having orthogonal views or representations is an extremelypowerful concept. Many language problems can be viewed in this way, and many algorithmsthat exploit a dual representation have been proposed. Yarowsky [1995] first implementedthis style of algorithm in NLP (and it is now sometimes referred to as the Yarowsky27
algorithm). Yarowsky used it for word-sense disambiguation. He essentially showed thata bootstrapping approach can achieve performance comparable to full supervised learning.An example from word-sense disambiguation will help illustrate: To disambiguate whetherthe noun bass is used in the fish sense or in the music sense, we can rely on a just a fewkey contexts to identify unambiguous instances of the noun in text. Suppose we know thatcaught a bass means the fish sense of bass. Now, whenever we see caught a bass, we labelthat noun for the fish sense. This is the context-based view of the problem. The other viewis a document-based view. It has been shown experimentally that all instances of a uniqueword type in a single document tend to share the same sense [Gale et al., 1992]. Once wehave one instance of bass labeled, we can extend this classification to the other instancesof bass in the same document using this second view. We can then re-learn our contextbasedclassifier from these new examples and repeat the process in new documents and newcontexts, until all the instances are labeled.Multi-view bootstrapping is also used in information extraction [Etzioni et al., 2005].Collins and Singer [1999] and Cucerzan and Yarowsky [1999] apply bootstrapping to thetask of named-entity recognition. Klementiev and Roth [2006] used bootstrapping to extractinterlingual named entities. Our research has also been influenced by co-training-styleweakly supervised algorithms used in coreference resolution [Ge et al., 1998; Harabagiuet al., 2001; Müller et al., 2002; Ng and Cardie, 2003b; 2003a; Bean and Riloff, 2004] andgrammatical gender determination [Cucerzan and Yarowsky, 2003].Bootstrapping from SeedsA distinct line of bootstrapping research has also evolved in NLP, which we call Bootstrappingfrom Seeds. These approaches all involve starting with a small number of examples,building predictors from these examples, labeling more examples with the new predictors,and then repeating the process to build a large collection of information. While this researchgenerally does not explicitly cast the tasks as exploiting orthogonal views of the data, it isinstructive to describe these techniques from the multi-view perspective.An early example is described by Hearst [1992]. Suppose we wish to find hypernyms intext. A hypernym is a relation between two things such that one thing is a sub-class of theother. It is sometimes known as the is-a relation. For example a wound is-a type of injury,Ottawa is-a city, a Cadillac is-a car, etc. Suppose we see the words in text, “Cadillacs andother cars...” There are two separate sources of information in this example:1. The string pair itself: Cadillac, car2. The context: Xs and other YsWe can perform bootstrapping in this framework as follows: First, we obtain a list of seedpairs of words, e.g. Cadillac/car, Ottawa/city, wound/injury, etc. Now, we create a predictorthat will label examples as being hypernyms based purely on whether they occur in thisseed set. We are thus only using the first view of the problem: the actual string pairs. Weuse this predictor to label a number of examples in actual text, e.g. “Cadillacs and othercars, cars such as Cadillacs, cars including Cadillacs, etc.” We then train a predictor forthe other view of the problem: From all the labeled examples, we extract predictive contexts:“Xs and other Ys, Ys such as Xs, Ys including Xs, etc.” The contexts extracted inthis view can now be used to extract more seeds, and the seeds can then be used to extractmore contexts, etc., in an iterative fashion. Hearst described an early form of this algorithm,28
Page 1 and 2: University of AlbertaLarge-Scale Se
Page 5 and 6: Table of Contents1 Introduction 11.
Page 7 and 8: 7 Alignment-Based Discriminative St
Page 9 and 10: List of Figures2.1 The linear class
Page 11 and 12: drawn in by establishing a partial
Page 13 and 14: (2) “He saw the trophy won yester
Page 15 and 16: actual sentence said, “My son’s
Page 17 and 18: Uses Web-Scale N-grams Auto-Creates
Page 19 and 20: spelling correction, and the identi
Page 21 and 22: Chapter 2Supervised and Semi-Superv
Page 23 and 24: emphasis on “deliverables and eva
Page 25 and 26: Figure 2.1: The linear classifier h
Page 27 and 28: The above experimental set-up is so
Page 29 and 30: and discriminative models therefore
Page 31 and 32: their slack value). In practice, I
Page 33 and 34: One way to find a better solution i
Page 35: Figure 2.2: Learning from labeled a
Page 39 and 40: Learning with Natural Automatic Exa
Page 41 and 42: positive examples from any collecti
Page 43 and 44: generated word clusters. Several re
Page 45 and 46: One common disambiguation task is t
Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
Page 49 and 50: For each target wordv 0 , there are
Page 51 and 52: ut without counts for the class pri
Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
Page 55 and 56: We also follow Carlson et al. [2001
Page 57 and 58: Set BASE [Golding and Roth, 1999] T
Page 59 and 60: pronoun (#3) guarantees that at the
Page 61 and 62: 807876F-Score747270Stemmed patterns
Page 63 and 64: anaphoricity by [Denis and Baldridg
Page 65 and 66: ter, we present a simple technique
Page 67 and 68: We seek weights such that the class
Page 69 and 70: each optimum performance is at most
Page 71 and 72: We now show that ¯w T (diag(¯p)
Page 73 and 74: Training ExamplesSystem 10 100 1K 1
Page 75 and 76: Since we wanted the system to learn
Page 77 and 78: Chapter 5Creating Robust Supervised
Page 79 and 80: § In-Domain (IN) Out-of-Domain #1
Page 81 and 82: Adjective ordering is also needed i
Page 83 and 84: Accuracy (%)10095908580757065601001
Page 85 and 86: System IN O1 O2Baseline 66.9 44.6 6
Page 87 and 88:
90% of the time in Gutenberg. The L
Page 89 and 90:
VBN/VBD distinction by providing re
Page 91 and 92:
other tasks we only had a handful o
Page 93 and 94:
without the need for manual annotat
Page 95 and 96:
DSP uses these labels to identify o
Page 97 and 98:
Semantic classesMotivated by previo
Page 99 and 100:
empirical Pr(n|v) in Equation (6.2)
Page 101 and 102:
Verb Plaus./Implaus. Resnik Dagan e
Page 103 and 104:
SystemAccMost-Recent Noun 17.9%Maxi
Page 105 and 106:
Chapter 7Alignment-Based Discrimina
Page 107 and 108:
ious measures to learn the recurren
Page 109 and 110:
how labeled word pairs can be colle
Page 111 and 112:
Figure 7.1: LCSR histogram and poly
Page 113 and 114:
0.711-pt Average Precision0.60.50.4
Page 115 and 116:
Fr-En Bitext Es-En Bitext De-En Bit
Page 117 and 118:
Chapter 8Conclusions and Future Wor
Page 119 and 120:
8.3 Future WorkThis section outline
Page 121 and 122:
My focus is thus on enabling robust
Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
Page 137:
NNP noun, proper, singular Motown V
show all

Large-Scale Semi-Supervised Learning for Natural Language ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?