Large-Scale Semi-Supervised Learning for Natural Language ...

More documents

Recommendations

Info

Having the largest possible margin (or, equivalently, the smallest possible weight vectorsubject to the constraints) that classifies the training examples correctly seems to be a goodidea, as it is most likely to generalize to new data. Once again, consider text categorization.We may have a feature for each word in each document. There’s enough words and fewenough documents such that our training algorithm could possibly get all the training examplesclassified correctly if it just puts all the weight on the rare words in each document.So if Obama occurs in a single sports document in our training set, but nowhere else in thetraining set, our algorithm could get that document classified correctly if it were to put allits weight on the word Obama and ignore the other features. Although this approach woulddo well on the training set, it will likely not generalize well to unseen documents. It’s likelynot the maximum margin (smallest weight vector) solution. If we can instead separate thepositive and negative examples using more-frequent words like score and win and teamsthen we should do so. We will use less weights overall, and the weight vector will havea smaller norm (fewer weights will be non-zero). It intuitively seems like a good idea torely on more frequent words to make decisions, and the SVM optimization just encodesthis intuition in a theoretically well-grounded formulation (it’s all based on ‘empirical riskminimization’ [Vapnik, 1998]).Sometimes, the positive and negative examples are not separable, and there will be nosolution to the above optimization. At other times, even if the data is separable, it maybe better to turn the hard constraints in the above equation into soft preferences, and placeeven greater emphasis on using the frequent features. That is, we may wish to have aweight vector with a small norm even at the expense of not separating the data. In terms ofcategorizing sports documents, words like score and win and teams may sometimes occurin non-sports documents in the training set (so we may get some training documents wrongif we put positive weight on them), but they are a better bet for getting test documentscorrect than putting high weight on rare words like Obama (blindly enforcing separability).Geometrically, we can view this as saying we might want to allow some points to lie on theopposite side of the hyperplane (or at least closer to it), if we can do this with weights onfewer dimensions.[Cortes and Vapnik, 1995] give the optimization program for a soft-margin SVM as:1m∑min¯w,ξ 1 ,...,ξ M 2 ||¯w||2 +C ξ isubjectto : ∀i, ξ i ≥ 0i=1y i (¯w · ¯x i ) ≥ 1−ξ i (2.3)Theξ i values are known as the slacks. Each example may use some slack. The classificationmust either be separable and satisfy the margin constraint (in which case ξ = 0) or it mayinstead use its slack to satisfy the inequality. The weighted sum of the slacks are minimizedalong with the norm of ¯w.The relative importance of the slacks (getting the training examples separated nicely)versus the minimization of the weights (using more general features) is controlled by tuningC.If the feature-weights learned by the algorithm are the parameters, then thisC valueis known as a hyperparameter, since it’s something done separately from the regular parameterlearning. The general practice is to try various values for this hyperparameter, andchoose the one that gets the highest performance on the development set. In an SVM, thishyperparameter is known as the regularization parameter. It controls how much we penalizetraining vectors that lie on the opposite side of the hyperplane (with distance given by21
their slack value). In practice, I usually try a range of values for this parameter starting at0.000001 and going up by a factor of 10 to around 100000.Note you would not want to tune the regularization parameter by measuring performanceon the training set, as less regularization is always going to lead to better performanceon the training data itself. Regularization is a way to prevent overfitting the trainingdata, and thus should be set on separate examples, i.e., the development set. However, somepeople like to do 10-fold cross validation on the training data to set their hyperparameters.I have no problem with this.Another detail regarding SVM learning is that sometimes it makes sense to scale ornormalize the features to enable faster and sometimes better learning. For many tasks,it makes sense to divide all the feature values by the Euclidean norm of the feature vector,such that the resulting vector has a magnitude of one. In the chapters that follow, we specifyif we use such a technique. Again, we can test whether such a transformation is worth it byseeing how it affects performance on our development data.SVMs have been shown to work quite well on a range of tasks. If you want to use alinear classifier, they seem to be a good choice. The SVM formulation is also perfectlysuited to using kernels to automatically expand the feature space, allowing for non-linearclassification. For all the tasks investigated in this dissertation, however, standard kernelswere not found to improve performance. Furthermore, training and testing takes longerwhen kernels are used.2.3.5 SoftwareWe view the current best practice in most NLP classification applications as follows: Use asmany labeled examples as you can find for the task and domain of interest. Then, carefullyconstruct a linear feature space such that all potentially useful combinations of propertiesare explicit dimensions in that space (rather than implicitly creating such dimensionsthrough the use of kernels). For training, use the LIBLINEAR package [Fan et al., 2008],an amazingly fast solver that can return the SVM model in seconds even for tens of thousandsof features and instances (other fast alternatives exist, but haven’t been explored inthis dissertation). This set-up allows for very rapid system development and evaluation,allowing us to focus on the features themselves, rather than the learning algorithm.Since many of the tasks in this dissertation were completed before LIBLINEAR wasavailable, we also present results using older solvers such as the logistic regression packagein Weka [Witten and Frank, 2005], the efficient SVM multiclass instance of SVM struct[Tsochantaridis et al., 2004]), and our old stand-by, Thorsten Joachim’s SVM light [Joachims,1999a]. Whatever package is used, it should now be clear that in terms of this dissertation,training simply means learning a set of weights for a linear classifier using a given set oflabeled data.2.4 Unsupervised LearningThere is a way to gather linguistic annotations without using any training data: unsupervisedlearning. This at first seems rather magical. How can a system produce labels without everseeing them?Most current unsupervised approaches in NLP are decidedly unmagical. Probably sinceso much current work is based on supervised training from labeled data, some rule-based22
Page 1 and 2: University of AlbertaLarge-Scale Se
Page 5 and 6: Table of Contents1 Introduction 11.
Page 7 and 8: 7 Alignment-Based Discriminative St
Page 9 and 10: List of Figures2.1 The linear class
Page 11 and 12: drawn in by establishing a partial
Page 13 and 14: (2) “He saw the trophy won yester
Page 15 and 16: actual sentence said, “My son’s
Page 17 and 18: Uses Web-Scale N-grams Auto-Creates
Page 19 and 20: spelling correction, and the identi
Page 21 and 22: Chapter 2Supervised and Semi-Superv
Page 23 and 24: emphasis on “deliverables and eva
Page 25 and 26: Figure 2.1: The linear classifier h
Page 27 and 28: The above experimental set-up is so
Page 29: and discriminative models therefore
Page 33 and 34: One way to find a better solution i
Page 35 and 36: Figure 2.2: Learning from labeled a
Page 37 and 38: algorithm). Yarowsky used it for wo
Page 39 and 40: Learning with Natural Automatic Exa
Page 41 and 42: positive examples from any collecti
Page 43 and 44: generated word clusters. Several re
Page 45 and 46: One common disambiguation task is t
Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
Page 49 and 50: For each target wordv 0 , there are
Page 51 and 52: ut without counts for the class pri
Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
Page 55 and 56: We also follow Carlson et al. [2001
Page 57 and 58: Set BASE [Golding and Roth, 1999] T
Page 59 and 60: pronoun (#3) guarantees that at the
Page 61 and 62: 807876F-Score747270Stemmed patterns
Page 63 and 64: anaphoricity by [Denis and Baldridg
Page 65 and 66: ter, we present a simple technique
Page 67 and 68: We seek weights such that the class
Page 69 and 70: each optimum performance is at most
Page 71 and 72: We now show that ¯w T (diag(¯p)
Page 73 and 74: Training ExamplesSystem 10 100 1K 1
Page 75 and 76: Since we wanted the system to learn
Page 77 and 78: Chapter 5Creating Robust Supervised
Page 79 and 80: § In-Domain (IN) Out-of-Domain #1
Page 81 and 82:
Adjective ordering is also needed i
Page 83 and 84:
Accuracy (%)10095908580757065601001
Page 85 and 86:
System IN O1 O2Baseline 66.9 44.6 6
Page 87 and 88:
90% of the time in Gutenberg. The L
Page 89 and 90:
VBN/VBD distinction by providing re
Page 91 and 92:
other tasks we only had a handful o
Page 93 and 94:
without the need for manual annotat
Page 95 and 96:
DSP uses these labels to identify o
Page 97 and 98:
Semantic classesMotivated by previo
Page 99 and 100:
empirical Pr(n|v) in Equation (6.2)
Page 101 and 102:
Verb Plaus./Implaus. Resnik Dagan e
Page 103 and 104:
SystemAccMost-Recent Noun 17.9%Maxi
Page 105 and 106:
Chapter 7Alignment-Based Discrimina
Page 107 and 108:
ious measures to learn the recurren
Page 109 and 110:
how labeled word pairs can be colle
Page 111 and 112:
Figure 7.1: LCSR histogram and poly
Page 113 and 114:
0.711-pt Average Precision0.60.50.4
Page 115 and 116:
Fr-En Bitext Es-En Bitext De-En Bit
Page 117 and 118:
Chapter 8Conclusions and Future Wor
Page 119 and 120:
8.3 Future WorkThis section outline
Page 121 and 122:
My focus is thus on enabling robust
Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
Page 137:
NNP noun, proper, singular Motown V
show all

Large-Scale Semi-Supervised Learning for Natural Language ...

Create successful ePaper yourself

Delete template?

Save as template?