and heuristic approaches are now being called unsupervised, since they are not based onlearning from labeled data. For example, in Chapter 1, Section 1.1, we discussed how apart-of-speech tagger could be based on linguistic rules. A rule-based tagger could in somesense be considered unsupervised, since a human presumably created the rules from intuition,not from labeled data. However, since the human probably looked at some data tocome up with the rules (a textbook, maybe?), calling this unsupervised is a little misleadingfrom a machine learning perspective. Most people would probably simply call this a“rule-based approach.” In Chapter 3, we propose unsupervised systems <strong>for</strong> lexical disambiguation,where a designer need only specify the words that are correlated with the classesof interest, rather than label any training data. We also discuss previous approaches thatuse counts derived from Internet search engine results. These approaches have usually beenunsupervised.From a machine learning perspective, true unsupervised approaches are those that induceoutput structure from properties of the problem, with guidance from probabilistic modelsrather than human intuition. We can illustrate this concept most clearly again with theexample of document classification. Suppose we know there are two classes: documentsabout sports, and documents that are not about sports. We can generate the feature vectorsas discussed above, and then simply <strong>for</strong>m two groups of vectors such that members of eachgroup are close to each other (in terms of Euclidean distance) inN-dimensional space. Newfeature vectors can be assigned to whatever group or cluster they are closest to. The pointsclosest to one cluster will be separated from points closest to the other cluster by a hyperplanein N-dimensional space. Where there’s a hyperplane, then there’s a correspondinglinear classifier, with a set of weights. So clustering can learn a linear classifier as well. Wedon’t know what the clusters represent, but hopefully one of them has all the sports documents(if we inspect the clusters and define one of them as the sports class, we’re essentiallydoing a <strong>for</strong>m of semi-supervised learning).Clustering can also be regarded as an “exploratory science” that seeks to discover usefulpatterns and structures in data [Pantel, 2003]. This structure might later be exploited <strong>for</strong>other <strong>for</strong>ms of language processing; later we will see how clustering can be used to providehelpful feature in<strong>for</strong>mation <strong>for</strong> supervised classifiers (Section 2.5.5).Clustering is the simplest unsupervised learning algorithm. In more complicated setups,we can define a probability model over our features (and possibly over other hiddenvariables), and then try to learn the parameters of the model such that our unlabeled data hasa high likelihood under this model. We previously used such a technique to train a pronounresolution system using expectation maximization [Cherry and Bergsma, 2005]. Similartechniques can be used to train hidden markov models and other generative models.These models can provide a very nice way to incorporate lots of unlabeled data. In somesense, however, doing anything beyond an HMM requires one to be a bit of probabilisticmodelingguru. The more features you incorporate in the model, the more you have to account<strong>for</strong> the interdependence of these features explicitly in your model. Some assumptionsyou make may not be valid and may impair per<strong>for</strong>mance. It’s hard to know exactly what’swrong with your model, and how to change it to make it better. Also, when setting theparameters of your model using clustering or expectation maximization, you might reacha point only of local optimum, from which the algorithm can proceed no further to bettersettings under your model (and you have no idea you’ve reached this point). But, sincethese algorithms are not optimizing discriminative per<strong>for</strong>mance anyways, it’s not clear youwant the global maximum even if you can find it.23
One way to find a better solution in this bumpy optimization space is to initialize orfix parameters of your model in ways that bias things toward what you know you want.For example, [Haghighi and Klein, 2010] fix a number of parameters in their entity-type /coreference model using prototypes of different classes. That is, they ensure, e.g., that Bushor Gore are in the PERSON class, as are the nominals president, official, etc., and thatthis class is referred to by the appropriate set of pronouns. They also set a number of otherparameters to “fixed heuristic values.” When the unsupervised learning kicks in, it initiallyhas less freedom to go off the rails, as the hand-tuning has started the model from a goodspot in the optimization space.One argument that is sometimes made against fully unsupervised approaches is that theset-up is a little unrealistic. You will likely want to evaluate your approach. To evaluateyour approach, you will need some labeled data. If you can produce labeled data <strong>for</strong> testing,you can produce some labeled data <strong>for</strong> training. It seems that semi-supervised learningis a more realistic situation: you have lots of unlabeled data, but you also have a few labeledexamples to help you configure your parameters. In our unsupervised pronoun resolutionwork [Cherry and Bergsma, 2005], we also used some labeled examples to re-weight the parameterslearned by EM (using the discriminative technique known as maximum entropy). 3Another interaction between unsupervised and supervised learning occurs when an unsupervisedmethod provides intermediate structural in<strong>for</strong>mation <strong>for</strong> a supervised algorithm.For example, unsupervised algorithms often generate the unseen word-to-word alignmentsin statistical machine translation [Brown et al., 1993] and also the unseen character-tophonemealignments in grapheme-to-phoneme conversion [Jiampojamarn et al., 2007].This alignment in<strong>for</strong>mation is then leveraged by subsequent supervised processing.2.5 <strong>Semi</strong>-<strong>Supervised</strong> <strong>Learning</strong><strong>Semi</strong>-supervised learning is a huge and growing area of interest in many different researchcommunities. The name semi-supervised learning has come to essentially mean that apredictor is being created from in<strong>for</strong>mation from both labeled and unlabeled examples.There are a variety of flavours of semi-supervised learning that are relevant to NLP andmerit discussion. A good recent survey of semi-supervised techniques in general is byZhu [2005].<strong>Semi</strong>-supervised learning was the “Special Topic of Interest” at the 2009 Conferenceon <strong>Natural</strong> <strong>Language</strong> <strong>Learning</strong>. The organizers, Suzanne Stevenson and Xavier Carreras,provided a thoughtful motivation <strong>for</strong> semi-supervised learning in the call <strong>for</strong> papers. 4“The field of natural language learning has made great strides over the last 15years, especially in the design and application of supervised and batch learningmethods. However, two challenges arise with this kind of approach. First, incore NLP tasks, supervised approaches require typically large amounts of manuallyannotated data, and experience has shown that results often depend on the3 The line between supervised and unsupervised learning can be a little blurry. We called our use of labeleddata and maximum entropy the “supervised extension” of our unsupervised system in our EM paper [Cherryand Bergsma, 2005]. A later unsupervised approach by Charniak and Elsner [2009], which also uses EMtraining <strong>for</strong> pronoun resolution, involved tuning essentially the same number of hyperparameters by hand (tooptimize per<strong>for</strong>mance on a development set) as the number of parameters we tuned with supervision. Is thisstill unsupervised learning?4 http://www.cnts.ua.ac.be/conll2009/cfp.html24
- Page 1 and 2: University of AlbertaLarge-Scale Se
- Page 5 and 6: Table of Contents1 Introduction 11.
- Page 7 and 8: 7 Alignment-Based Discriminative St
- Page 9 and 10: List of Figures2.1 The linear class
- Page 11 and 12: drawn in by establishing a partial
- Page 13 and 14: (2) “He saw the trophy won yester
- Page 15 and 16: actual sentence said, “My son’s
- Page 17 and 18: Uses Web-Scale N-grams Auto-Creates
- Page 19 and 20: spelling correction, and the identi
- Page 21 and 22: Chapter 2Supervised and Semi-Superv
- Page 23 and 24: emphasis on “deliverables and eva
- Page 25 and 26: Figure 2.1: The linear classifier h
- Page 27 and 28: The above experimental set-up is so
- Page 29 and 30: and discriminative models therefore
- Page 31: their slack value). In practice, I
- Page 35 and 36: Figure 2.2: Learning from labeled a
- Page 37 and 38: algorithm). Yarowsky used it for wo
- Page 39 and 40: Learning with Natural Automatic Exa
- Page 41 and 42: positive examples from any collecti
- Page 43 and 44: generated word clusters. Several re
- Page 45 and 46: One common disambiguation task is t
- Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
- Page 49 and 50: For each target wordv 0 , there are
- Page 51 and 52: ut without counts for the class pri
- Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
- Page 55 and 56: We also follow Carlson et al. [2001
- Page 57 and 58: Set BASE [Golding and Roth, 1999] T
- Page 59 and 60: pronoun (#3) guarantees that at the
- Page 61 and 62: 807876F-Score747270Stemmed patterns
- Page 63 and 64: anaphoricity by [Denis and Baldridg
- Page 65 and 66: ter, we present a simple technique
- Page 67 and 68: We seek weights such that the class
- Page 69 and 70: each optimum performance is at most
- Page 71 and 72: We now show that ¯w T (diag(¯p)
- Page 73 and 74: Training ExamplesSystem 10 100 1K 1
- Page 75 and 76: Since we wanted the system to learn
- Page 77 and 78: Chapter 5Creating Robust Supervised
- Page 79 and 80: § In-Domain (IN) Out-of-Domain #1
- Page 81 and 82: Adjective ordering is also needed i
- Page 83 and 84:
Accuracy (%)10095908580757065601001
- Page 85 and 86:
System IN O1 O2Baseline 66.9 44.6 6
- Page 87 and 88:
90% of the time in Gutenberg. The L
- Page 89 and 90:
VBN/VBD distinction by providing re
- Page 91 and 92:
other tasks we only had a handful o
- Page 93 and 94:
without the need for manual annotat
- Page 95 and 96:
DSP uses these labels to identify o
- Page 97 and 98:
Semantic classesMotivated by previo
- Page 99 and 100:
empirical Pr(n|v) in Equation (6.2)
- Page 101 and 102:
Verb Plaus./Implaus. Resnik Dagan e
- Page 103 and 104:
SystemAccMost-Recent Noun 17.9%Maxi
- Page 105 and 106:
Chapter 7Alignment-Based Discrimina
- Page 107 and 108:
ious measures to learn the recurren
- Page 109 and 110:
how labeled word pairs can be colle
- Page 111 and 112:
Figure 7.1: LCSR histogram and poly
- Page 113 and 114:
0.711-pt Average Precision0.60.50.4
- Page 115 and 116:
Fr-En Bitext Es-En Bitext De-En Bit
- Page 117 and 118:
Chapter 8Conclusions and Future Wor
- Page 119 and 120:
8.3 Future WorkThis section outline
- Page 121 and 122:
My focus is thus on enabling robust
- Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
- Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
- Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
- Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
- Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
- Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
- Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
- Page 137:
NNP noun, proper, singular Motown V