predicted classtrue class+1 -1+1 TP FP-1 FN TNTable 2.1: The classifier confusion matrix. Assuming “1” is the positive class and “-1” isthe negative class, each instance assigned a class by a classifier is either a true positive (TP),false positive (FP), false negative (FN), or true negative (TN), depending on its actual classmembership (true class) and what was predicted by the classifier (predicted class).gether, all the elements that we predicted to be members of the positive class). Recall, onthe other hand, tells us the percentage of actual sports documents that were also predictedby the classifier to be sports documents. That is, recall is the ratio of true positives dividedby the number of true positives plus the number of false negatives (together, all the true,gold-standard positives). It is possible to achieve 100% recall on any task by predicting allinstances to be of the positive class (eliminating false negatives). In isolation, there<strong>for</strong>e,precision or recall may not be very in<strong>for</strong>mative, and so they are often stated together. For asingle per<strong>for</strong>mance number, precision and recall are often combined into the F-score, whichis simply the harmonic mean of precision and recall.We summarize these measures using Table 2.1 and the following equations:Precision =Recall =TPTP+FPTPTP+FNF-Score = 2∗Precision∗RecallPrecision+Recall2.3.3 <strong>Supervised</strong> <strong>Learning</strong> AlgorithmsWe want a learning algorithm that will give us the best accuracy on our evaluation data– how do we choose it? As you might imagine, there are many different ways to choosethe weights. Some algorithms are better suited to some situations than others. There aregenerative models like naive bayes that work well when you have smaller amounts of trainingdata [Ng and Jordan, 2002]. Generative approaches jointly model both the input andoutput variables in a probabilistic <strong>for</strong>mulation. They require one to explicitly model the interdependenciesbetween the features of the model. There are also perceptrons, maximumentropy/logistic regression models, support vector machines, and many other discriminativetechniques that all have various advantages and disadvantages in certain situations.These models are known as discriminative because they are optimized to distinguish theoutput labels given the input features (to discriminate between the different classes), ratherthan to jointly model the input and output variables as in the generative approach. As Vapnik[1998] says, (quoted in [Ng and Jordan, 2002]): “One should solve the [classification]problem directly and never solve a more general problem as an intermediate step.” Indeed,[Roth, 1998] shows that generative and discriminative classifiers both make use ofa linear feature space. Given the same representation, the difference between generative19
and discriminative models there<strong>for</strong>e rests solely in how the weights are chosen. Ratherthan choosing weights that best fit the generative model on the training data (and satisfythe model’s simplifying assumptions, typically concerning the interdependence or independenceof different features), a discriminative model chooses the weights that best attain thedesired objective: better predictions [Fung and Roth, 2005]. Discriminative models thustend to per<strong>for</strong>m better, and are correspondingly the preferred approach today in many areasof NLP (including increasingly in semantics, where we recently proposed a discriminativeapproach to selectional preference; Chapter 6). Unlike generative approaches, when usingdiscriminative algorithms we can generally use arbitrary and interdependent features in ourmodel without worrying about modeling such interdependencies. Use of the word discriminativein NLP has thus come to indicate both an approach that optimizes <strong>for</strong> classificationaccuracy directly and one that uses a wide variety of features. In fact, one kind of featureyou might use in a discriminative system is the prediction or output of a generative model.This illustrates another advantage of discriminative learning: competing approaches canalways be included as new features.Note the clear advantages of discriminative models are really only true <strong>for</strong> supervisedlearning in NLP. There are now a growing number of generative, Bayesian, unsupervisedalgorithms that are being developed. It may be the case that the pendulum will soon swingback and generative models will again dominate the supervised playing field as well, particularlyif they can provide principled ways to incorporate unlabeled data into a semisupervisedframework.2.3.4 Support Vector MachinesWhen you have lots of features and lots of examples, support vector machines [Cortes andVapnik, 1995] (SVMs) seem to be the best discriminative approach. One reason might bebecause they per<strong>for</strong>m well in situations, like natural language, where many features arerelevant [Joachims, 1999a], as opposed to situations where a few key indicators may besufficient <strong>for</strong> prediction. Conceptually, SVMs take a geometric view of the problem, asdepicted in Figure 2.1. The training algorithm chooses the hyperplane location such thatit is maximally far away from the closest positive and negative points on either side of it(this is known as the max-margin solution). These closest vectors are known as supportvectors. You can reconstruct the hyperplane from this set of vectors alone. Thus the namesupport vector machine. In fact, Figure 2.1 depicts the hyperplane that would be learned byan SVM, with marks on the corresponding support vectors.It can be shown that the hyperplane that maximizes the margin corresponds to theweight vector that solves the following constrained optimization problem:min¯w12 ||¯w||2subjectto : ∀i, y i (¯w · ¯x i ) ≥1 (2.2)where ||¯w|| is the Euclidean norm of the weight vector. Note ||¯w|| 2 = ¯w · ¯w. The 1 2 is amathematical convenience so that that coefficient goes away when we take the derivative.The optimization says that we want to find the smallest weight vector (in terms of its Euclideannorm) such that our linear classifier’s output (h(¯x) = ¯w·¯x) is bigger than one whenthe correct label is a positive class (y = +1), and less than -1 when the correct label is anegative class (y = -1). The constraint in Equation 2.2 is a succinct way of writing thesetwo conditions in one line.20
- Page 1 and 2: University of AlbertaLarge-Scale Se
- Page 5 and 6: Table of Contents1 Introduction 11.
- Page 7 and 8: 7 Alignment-Based Discriminative St
- Page 9 and 10: List of Figures2.1 The linear class
- Page 11 and 12: drawn in by establishing a partial
- Page 13 and 14: (2) “He saw the trophy won yester
- Page 15 and 16: actual sentence said, “My son’s
- Page 17 and 18: Uses Web-Scale N-grams Auto-Creates
- Page 19 and 20: spelling correction, and the identi
- Page 21 and 22: Chapter 2Supervised and Semi-Superv
- Page 23 and 24: emphasis on “deliverables and eva
- Page 25 and 26: Figure 2.1: The linear classifier h
- Page 27: The above experimental set-up is so
- Page 31 and 32: their slack value). In practice, I
- Page 33 and 34: One way to find a better solution i
- Page 35 and 36: Figure 2.2: Learning from labeled a
- Page 37 and 38: algorithm). Yarowsky used it for wo
- Page 39 and 40: Learning with Natural Automatic Exa
- Page 41 and 42: positive examples from any collecti
- Page 43 and 44: generated word clusters. Several re
- Page 45 and 46: One common disambiguation task is t
- Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
- Page 49 and 50: For each target wordv 0 , there are
- Page 51 and 52: ut without counts for the class pri
- Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
- Page 55 and 56: We also follow Carlson et al. [2001
- Page 57 and 58: Set BASE [Golding and Roth, 1999] T
- Page 59 and 60: pronoun (#3) guarantees that at the
- Page 61 and 62: 807876F-Score747270Stemmed patterns
- Page 63 and 64: anaphoricity by [Denis and Baldridg
- Page 65 and 66: ter, we present a simple technique
- Page 67 and 68: We seek weights such that the class
- Page 69 and 70: each optimum performance is at most
- Page 71 and 72: We now show that ¯w T (diag(¯p)
- Page 73 and 74: Training ExamplesSystem 10 100 1K 1
- Page 75 and 76: Since we wanted the system to learn
- Page 77 and 78: Chapter 5Creating Robust Supervised
- Page 79 and 80:
§ In-Domain (IN) Out-of-Domain #1
- Page 81 and 82:
Adjective ordering is also needed i
- Page 83 and 84:
Accuracy (%)10095908580757065601001
- Page 85 and 86:
System IN O1 O2Baseline 66.9 44.6 6
- Page 87 and 88:
90% of the time in Gutenberg. The L
- Page 89 and 90:
VBN/VBD distinction by providing re
- Page 91 and 92:
other tasks we only had a handful o
- Page 93 and 94:
without the need for manual annotat
- Page 95 and 96:
DSP uses these labels to identify o
- Page 97 and 98:
Semantic classesMotivated by previo
- Page 99 and 100:
empirical Pr(n|v) in Equation (6.2)
- Page 101 and 102:
Verb Plaus./Implaus. Resnik Dagan e
- Page 103 and 104:
SystemAccMost-Recent Noun 17.9%Maxi
- Page 105 and 106:
Chapter 7Alignment-Based Discrimina
- Page 107 and 108:
ious measures to learn the recurren
- Page 109 and 110:
how labeled word pairs can be colle
- Page 111 and 112:
Figure 7.1: LCSR histogram and poly
- Page 113 and 114:
0.711-pt Average Precision0.60.50.4
- Page 115 and 116:
Fr-En Bitext Es-En Bitext De-En Bit
- Page 117 and 118:
Chapter 8Conclusions and Future Wor
- Page 119 and 120:
8.3 Future WorkThis section outline
- Page 121 and 122:
My focus is thus on enabling robust
- Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
- Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
- Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
- Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
- Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
- Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
- Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
- Page 137:
NNP noun, proper, singular Motown V