Large-Scale Semi-Supervised Learning for Natural Language ...

More documents

Recommendations

Info

and heuristic approaches are now being called unsupervised, since they are not based onlearning from labeled data. For example, in Chapter 1, Section 1.1, we discussed how apart-of-speech tagger could be based on linguistic rules. A rule-based tagger could in somesense be considered unsupervised, since a human presumably created the rules from intuition,not from labeled data. However, since the human probably looked at some data tocome up with the rules (a textbook, maybe?), calling this unsupervised is a little misleadingfrom a machine learning perspective. Most people would probably simply call this a“rule-based approach.” In Chapter 3, we propose unsupervised systems for lexical disambiguation,where a designer need only specify the words that are correlated with the classesof interest, rather than label any training data. We also discuss previous approaches thatuse counts derived from Internet search engine results. These approaches have usually beenunsupervised.From a machine learning perspective, true unsupervised approaches are those that induceoutput structure from properties of the problem, with guidance from probabilistic modelsrather than human intuition. We can illustrate this concept most clearly again with theexample of document classification. Suppose we know there are two classes: documentsabout sports, and documents that are not about sports. We can generate the feature vectorsas discussed above, and then simply form two groups of vectors such that members of eachgroup are close to each other (in terms of Euclidean distance) inN-dimensional space. Newfeature vectors can be assigned to whatever group or cluster they are closest to. The pointsclosest to one cluster will be separated from points closest to the other cluster by a hyperplanein N-dimensional space. Where there’s a hyperplane, then there’s a correspondinglinear classifier, with a set of weights. So clustering can learn a linear classifier as well. Wedon’t know what the clusters represent, but hopefully one of them has all the sports documents(if we inspect the clusters and define one of them as the sports class, we’re essentiallydoing a form of semi-supervised learning).Clustering can also be regarded as an “exploratory science” that seeks to discover usefulpatterns and structures in data [Pantel, 2003]. This structure might later be exploited forother forms of language processing; later we will see how clustering can be used to providehelpful feature information for supervised classifiers (Section 2.5.5).Clustering is the simplest unsupervised learning algorithm. In more complicated setups,we can define a probability model over our features (and possibly over other hiddenvariables), and then try to learn the parameters of the model such that our unlabeled data hasa high likelihood under this model. We previously used such a technique to train a pronounresolution system using expectation maximization [Cherry and Bergsma, 2005]. Similartechniques can be used to train hidden markov models and other generative models.These models can provide a very nice way to incorporate lots of unlabeled data. In somesense, however, doing anything beyond an HMM requires one to be a bit of probabilisticmodelingguru. The more features you incorporate in the model, the more you have to accountfor the interdependence of these features explicitly in your model. Some assumptionsyou make may not be valid and may impair performance. It’s hard to know exactly what’swrong with your model, and how to change it to make it better. Also, when setting theparameters of your model using clustering or expectation maximization, you might reacha point only of local optimum, from which the algorithm can proceed no further to bettersettings under your model (and you have no idea you’ve reached this point). But, sincethese algorithms are not optimizing discriminative performance anyways, it’s not clear youwant the global maximum even if you can find it.23
One way to find a better solution in this bumpy optimization space is to initialize orfix parameters of your model in ways that bias things toward what you know you want.For example, [Haghighi and Klein, 2010] fix a number of parameters in their entity-type /coreference model using prototypes of different classes. That is, they ensure, e.g., that Bushor Gore are in the PERSON class, as are the nominals president, official, etc., and thatthis class is referred to by the appropriate set of pronouns. They also set a number of otherparameters to “fixed heuristic values.” When the unsupervised learning kicks in, it initiallyhas less freedom to go off the rails, as the hand-tuning has started the model from a goodspot in the optimization space.One argument that is sometimes made against fully unsupervised approaches is that theset-up is a little unrealistic. You will likely want to evaluate your approach. To evaluateyour approach, you will need some labeled data. If you can produce labeled data for testing,you can produce some labeled data for training. It seems that semi-supervised learningis a more realistic situation: you have lots of unlabeled data, but you also have a few labeledexamples to help you configure your parameters. In our unsupervised pronoun resolutionwork [Cherry and Bergsma, 2005], we also used some labeled examples to re-weight the parameterslearned by EM (using the discriminative technique known as maximum entropy). 3Another interaction between unsupervised and supervised learning occurs when an unsupervisedmethod provides intermediate structural information for a supervised algorithm.For example, unsupervised algorithms often generate the unseen word-to-word alignmentsin statistical machine translation [Brown et al., 1993] and also the unseen character-tophonemealignments in grapheme-to-phoneme conversion [Jiampojamarn et al., 2007].This alignment information is then leveraged by subsequent supervised processing.2.5 Semi-Supervised LearningSemi-supervised learning is a huge and growing area of interest in many different researchcommunities. The name semi-supervised learning has come to essentially mean that apredictor is being created from information from both labeled and unlabeled examples.There are a variety of flavours of semi-supervised learning that are relevant to NLP andmerit discussion. A good recent survey of semi-supervised techniques in general is byZhu [2005].Semi-supervised learning was the “Special Topic of Interest” at the 2009 Conferenceon Natural Language Learning. The organizers, Suzanne Stevenson and Xavier Carreras,provided a thoughtful motivation for semi-supervised learning in the call for papers. 4“The field of natural language learning has made great strides over the last 15years, especially in the design and application of supervised and batch learningmethods. However, two challenges arise with this kind of approach. First, incore NLP tasks, supervised approaches require typically large amounts of manuallyannotated data, and experience has shown that results often depend on the3 The line between supervised and unsupervised learning can be a little blurry. We called our use of labeleddata and maximum entropy the “supervised extension” of our unsupervised system in our EM paper [Cherryand Bergsma, 2005]. A later unsupervised approach by Charniak and Elsner [2009], which also uses EMtraining for pronoun resolution, involved tuning essentially the same number of hyperparameters by hand (tooptimize performance on a development set) as the number of parameters we tuned with supervision. Is thisstill unsupervised learning?4 http://www.cnts.ua.ac.be/conll2009/cfp.html24
Page 1 and 2: University of AlbertaLarge-Scale Se
Page 5 and 6: Table of Contents1 Introduction 11.
Page 7 and 8: 7 Alignment-Based Discriminative St
Page 9 and 10: List of Figures2.1 The linear class
Page 11 and 12: drawn in by establishing a partial
Page 13 and 14: (2) “He saw the trophy won yester
Page 15 and 16: actual sentence said, “My son’s
Page 17 and 18: Uses Web-Scale N-grams Auto-Creates
Page 19 and 20: spelling correction, and the identi
Page 21 and 22: Chapter 2Supervised and Semi-Superv
Page 23 and 24: emphasis on “deliverables and eva
Page 25 and 26: Figure 2.1: The linear classifier h
Page 27 and 28: The above experimental set-up is so
Page 29 and 30: and discriminative models therefore
Page 31: their slack value). In practice, I
Page 35 and 36: Figure 2.2: Learning from labeled a
Page 37 and 38: algorithm). Yarowsky used it for wo
Page 39 and 40: Learning with Natural Automatic Exa
Page 41 and 42: positive examples from any collecti
Page 43 and 44: generated word clusters. Several re
Page 45 and 46: One common disambiguation task is t
Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
Page 49 and 50: For each target wordv 0 , there are
Page 51 and 52: ut without counts for the class pri
Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
Page 55 and 56: We also follow Carlson et al. [2001
Page 57 and 58: Set BASE [Golding and Roth, 1999] T
Page 59 and 60: pronoun (#3) guarantees that at the
Page 61 and 62: 807876F-Score747270Stemmed patterns
Page 63 and 64: anaphoricity by [Denis and Baldridg
Page 65 and 66: ter, we present a simple technique
Page 67 and 68: We seek weights such that the class
Page 69 and 70: each optimum performance is at most
Page 71 and 72: We now show that ¯w T (diag(¯p)
Page 73 and 74: Training ExamplesSystem 10 100 1K 1
Page 75 and 76: Since we wanted the system to learn
Page 77 and 78: Chapter 5Creating Robust Supervised
Page 79 and 80: § In-Domain (IN) Out-of-Domain #1
Page 81 and 82: Adjective ordering is also needed i
Page 83 and 84:
Accuracy (%)10095908580757065601001
Page 85 and 86:
System IN O1 O2Baseline 66.9 44.6 6
Page 87 and 88:
90% of the time in Gutenberg. The L
Page 89 and 90:
VBN/VBD distinction by providing re
Page 91 and 92:
other tasks we only had a handful o
Page 93 and 94:
without the need for manual annotat
Page 95 and 96:
DSP uses these labels to identify o
Page 97 and 98:
Semantic classesMotivated by previo
Page 99 and 100:
empirical Pr(n|v) in Equation (6.2)
Page 101 and 102:
Verb Plaus./Implaus. Resnik Dagan e
Page 103 and 104:
SystemAccMost-Recent Noun 17.9%Maxi
Page 105 and 106:
Chapter 7Alignment-Based Discrimina
Page 107 and 108:
ious measures to learn the recurren
Page 109 and 110:
how labeled word pairs can be colle
Page 111 and 112:
Figure 7.1: LCSR histogram and poly
Page 113 and 114:
0.711-pt Average Precision0.60.50.4
Page 115 and 116:
Fr-En Bitext Es-En Bitext De-En Bit
Page 117 and 118:
Chapter 8Conclusions and Future Wor
Page 119 and 120:
8.3 Future WorkThis section outline
Page 121 and 122:
My focus is thus on enabling robust
Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
Page 137:
NNP noun, proper, singular Motown V
show all

Large-Scale Semi-Supervised Learning for Natural Language ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?