Having the largest possible margin (or, equivalently, the smallest possible weight vectorsubject to the constraints) that classifies the training examples correctly seems to be a goodidea, as it is most likely to generalize to new data. Once again, consider text categorization.We may have a feature <strong>for</strong> each word in each document. There’s enough words and fewenough documents such that our training algorithm could possibly get all the training examplesclassified correctly if it just puts all the weight on the rare words in each document.So if Obama occurs in a single sports document in our training set, but nowhere else in thetraining set, our algorithm could get that document classified correctly if it were to put allits weight on the word Obama and ignore the other features. Although this approach woulddo well on the training set, it will likely not generalize well to unseen documents. It’s likelynot the maximum margin (smallest weight vector) solution. If we can instead separate thepositive and negative examples using more-frequent words like score and win and teamsthen we should do so. We will use less weights overall, and the weight vector will havea smaller norm (fewer weights will be non-zero). It intuitively seems like a good idea torely on more frequent words to make decisions, and the SVM optimization just encodesthis intuition in a theoretically well-grounded <strong>for</strong>mulation (it’s all based on ‘empirical riskminimization’ [Vapnik, 1998]).Sometimes, the positive and negative examples are not separable, and there will be nosolution to the above optimization. At other times, even if the data is separable, it maybe better to turn the hard constraints in the above equation into soft preferences, and placeeven greater emphasis on using the frequent features. That is, we may wish to have aweight vector with a small norm even at the expense of not separating the data. In terms ofcategorizing sports documents, words like score and win and teams may sometimes occurin non-sports documents in the training set (so we may get some training documents wrongif we put positive weight on them), but they are a better bet <strong>for</strong> getting test documentscorrect than putting high weight on rare words like Obama (blindly en<strong>for</strong>cing separability).Geometrically, we can view this as saying we might want to allow some points to lie on theopposite side of the hyperplane (or at least closer to it), if we can do this with weights onfewer dimensions.[Cortes and Vapnik, 1995] give the optimization program <strong>for</strong> a soft-margin SVM as:1m∑min¯w,ξ 1 ,...,ξ M 2 ||¯w||2 +C ξ isubjectto : ∀i, ξ i ≥ 0i=1y i (¯w · ¯x i ) ≥ 1−ξ i (2.3)Theξ i values are known as the slacks. Each example may use some slack. The classificationmust either be separable and satisfy the margin constraint (in which case ξ = 0) or it mayinstead use its slack to satisfy the inequality. The weighted sum of the slacks are minimizedalong with the norm of ¯w.The relative importance of the slacks (getting the training examples separated nicely)versus the minimization of the weights (using more general features) is controlled by tuningC.If the feature-weights learned by the algorithm are the parameters, then thisC valueis known as a hyperparameter, since it’s something done separately from the regular parameterlearning. The general practice is to try various values <strong>for</strong> this hyperparameter, andchoose the one that gets the highest per<strong>for</strong>mance on the development set. In an SVM, thishyperparameter is known as the regularization parameter. It controls how much we penalizetraining vectors that lie on the opposite side of the hyperplane (with distance given by21
their slack value). In practice, I usually try a range of values <strong>for</strong> this parameter starting at0.000001 and going up by a factor of 10 to around 100000.Note you would not want to tune the regularization parameter by measuring per<strong>for</strong>manceon the training set, as less regularization is always going to lead to better per<strong>for</strong>manceon the training data itself. Regularization is a way to prevent overfitting the trainingdata, and thus should be set on separate examples, i.e., the development set. However, somepeople like to do 10-fold cross validation on the training data to set their hyperparameters.I have no problem with this.Another detail regarding SVM learning is that sometimes it makes sense to scale ornormalize the features to enable faster and sometimes better learning. For many tasks,it makes sense to divide all the feature values by the Euclidean norm of the feature vector,such that the resulting vector has a magnitude of one. In the chapters that follow, we specifyif we use such a technique. Again, we can test whether such a trans<strong>for</strong>mation is worth it byseeing how it affects per<strong>for</strong>mance on our development data.SVMs have been shown to work quite well on a range of tasks. If you want to use alinear classifier, they seem to be a good choice. The SVM <strong>for</strong>mulation is also perfectlysuited to using kernels to automatically expand the feature space, allowing <strong>for</strong> non-linearclassification. For all the tasks investigated in this dissertation, however, standard kernelswere not found to improve per<strong>for</strong>mance. Furthermore, training and testing takes longerwhen kernels are used.2.3.5 SoftwareWe view the current best practice in most NLP classification applications as follows: Use asmany labeled examples as you can find <strong>for</strong> the task and domain of interest. Then, carefullyconstruct a linear feature space such that all potentially useful combinations of propertiesare explicit dimensions in that space (rather than implicitly creating such dimensionsthrough the use of kernels). For training, use the LIBLINEAR package [Fan et al., 2008],an amazingly fast solver that can return the SVM model in seconds even <strong>for</strong> tens of thousandsof features and instances (other fast alternatives exist, but haven’t been explored inthis dissertation). This set-up allows <strong>for</strong> very rapid system development and evaluation,allowing us to focus on the features themselves, rather than the learning algorithm.Since many of the tasks in this dissertation were completed be<strong>for</strong>e LIBLINEAR wasavailable, we also present results using older solvers such as the logistic regression packagein Weka [Witten and Frank, 2005], the efficient SVM multiclass instance of SVM struct[Tsochantaridis et al., 2004]), and our old stand-by, Thorsten Joachim’s SVM light [Joachims,1999a]. Whatever package is used, it should now be clear that in terms of this dissertation,training simply means learning a set of weights <strong>for</strong> a linear classifier using a given set oflabeled data.2.4 Unsupervised <strong>Learning</strong>There is a way to gather linguistic annotations without using any training data: unsupervisedlearning. This at first seems rather magical. How can a system produce labels without everseeing them?Most current unsupervised approaches in NLP are decidedly unmagical. Probably sinceso much current work is based on supervised training from labeled data, some rule-based22
- Page 1 and 2: University of AlbertaLarge-Scale Se
- Page 5 and 6: Table of Contents1 Introduction 11.
- Page 7 and 8: 7 Alignment-Based Discriminative St
- Page 9 and 10: List of Figures2.1 The linear class
- Page 11 and 12: drawn in by establishing a partial
- Page 13 and 14: (2) “He saw the trophy won yester
- Page 15 and 16: actual sentence said, “My son’s
- Page 17 and 18: Uses Web-Scale N-grams Auto-Creates
- Page 19 and 20: spelling correction, and the identi
- Page 21 and 22: Chapter 2Supervised and Semi-Superv
- Page 23 and 24: emphasis on “deliverables and eva
- Page 25 and 26: Figure 2.1: The linear classifier h
- Page 27 and 28: The above experimental set-up is so
- Page 29: and discriminative models therefore
- Page 33 and 34: One way to find a better solution i
- Page 35 and 36: Figure 2.2: Learning from labeled a
- Page 37 and 38: algorithm). Yarowsky used it for wo
- Page 39 and 40: Learning with Natural Automatic Exa
- Page 41 and 42: positive examples from any collecti
- Page 43 and 44: generated word clusters. Several re
- Page 45 and 46: One common disambiguation task is t
- Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
- Page 49 and 50: For each target wordv 0 , there are
- Page 51 and 52: ut without counts for the class pri
- Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
- Page 55 and 56: We also follow Carlson et al. [2001
- Page 57 and 58: Set BASE [Golding and Roth, 1999] T
- Page 59 and 60: pronoun (#3) guarantees that at the
- Page 61 and 62: 807876F-Score747270Stemmed patterns
- Page 63 and 64: anaphoricity by [Denis and Baldridg
- Page 65 and 66: ter, we present a simple technique
- Page 67 and 68: We seek weights such that the class
- Page 69 and 70: each optimum performance is at most
- Page 71 and 72: We now show that ¯w T (diag(¯p)
- Page 73 and 74: Training ExamplesSystem 10 100 1K 1
- Page 75 and 76: Since we wanted the system to learn
- Page 77 and 78: Chapter 5Creating Robust Supervised
- Page 79 and 80: § In-Domain (IN) Out-of-Domain #1
- Page 81 and 82:
Adjective ordering is also needed i
- Page 83 and 84:
Accuracy (%)10095908580757065601001
- Page 85 and 86:
System IN O1 O2Baseline 66.9 44.6 6
- Page 87 and 88:
90% of the time in Gutenberg. The L
- Page 89 and 90:
VBN/VBD distinction by providing re
- Page 91 and 92:
other tasks we only had a handful o
- Page 93 and 94:
without the need for manual annotat
- Page 95 and 96:
DSP uses these labels to identify o
- Page 97 and 98:
Semantic classesMotivated by previo
- Page 99 and 100:
empirical Pr(n|v) in Equation (6.2)
- Page 101 and 102:
Verb Plaus./Implaus. Resnik Dagan e
- Page 103 and 104:
SystemAccMost-Recent Noun 17.9%Maxi
- Page 105 and 106:
Chapter 7Alignment-Based Discrimina
- Page 107 and 108:
ious measures to learn the recurren
- Page 109 and 110:
how labeled word pairs can be colle
- Page 111 and 112:
Figure 7.1: LCSR histogram and poly
- Page 113 and 114:
0.711-pt Average Precision0.60.50.4
- Page 115 and 116:
Fr-En Bitext Es-En Bitext De-En Bit
- Page 117 and 118:
Chapter 8Conclusions and Future Wor
- Page 119 and 120:
8.3 Future WorkThis section outline
- Page 121 and 122:
My focus is thus on enabling robust
- Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
- Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
- Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
- Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
- Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
- Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
- Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
- Page 137:
NNP noun, proper, singular Motown V