Large-Scale Semi-Supervised Learning for Natural Language ...

More documents

Recommendations

Info

where ¯1 is an N-dimensional vector of ones. This is CS-SVM with all weights set to unity.The counts for each preposition are simply summed, and whichever one scores the highestis taken as the output (actually only a subset of the counts are used, see Section 4.4.1). Asmentioned earlier, this system performs remarkably well on several tasks.4.2.3 Variance Regularization SVMsSuppose we choose our attribute partition well and train the CS-SVM on a sufficient numberof examples to achieve good performance. It is a reasonable hypothesis that the learnedweights will be predominantly positive. This is because each sub-vector ¯x r was chosento only include attributes that are predictive of class r. Unlike the classifier in (4.1) whichweighs positive and negative evidence together for each class, in CS-SVM, negative evidenceonly plays a roll as it contributes to the score of competing classes.If all the attributes are equally important, the weights should be equal, as in the unsupervisedapproach in (4.5). If some are more important than others, the training examplesshould reflect this and the learner can adjust the weights accordingly. 3 In the absence ofthis training evidence, it is reasonable to bias the classifier toward an equal-weight solution.Rather than the standard SVM regularization that minimizes the norm of the weights asin (4.4), we therefore regularize toward weights that have low variance. More formally, wecan regard the set of weights, w 1 ,...,w N , as the distribution of a discrete random variable,W . We can calculate the mean and variance of this variable from its distribution. We seeka variable that has low variance.We begin with a more general objective and then explain how a specific choice of covariancematrix, C, minimizes the variance of the weights. We propose the regularizer:1min¯w,ξ 1 ,...,ξ m 2 ¯wT C¯w+Csubjectto : ∀im∑i=1ξ iξ i ≥0∀r ≠ y i , ¯w y i · ¯x i y i − ¯w r · ¯x i r ≥1−ξi (4.6)where C is a normalized covariance matrix such that ∑ i,j C i,j = 0. This ensures uniformweight vectors receive zero regularization penalty. Since all covariance matrices are positivesemi-definite, the quadratic program (QP) remains convex in ¯w, and thus amenable togeneral purpose QP-solvers.Since the unsupervised system in (4.5) has zero weight variance, the SVM learned in(4.6) should do as least as well as (4.5) as we tune the C-parameter on development data.That is, as C approaches zero, variance minimization becomes the sole objective of (4.6),and uniform weights are produced.We use covariance matrices of the form:C = diag(¯p)− ¯p¯p T (4.7)where diag(¯p) is the matrix constructed by putting ¯p on the main diagonal. Here, ¯p is anarbitrary N-dimensional weighting vector, such that p ≥ 0 and ∑ i p i = 1. The vector ¯pdictates the contribution of each w i to the mean and variance of the weights in ¯w. It is easyto see that ∑ i,j C i,j = ∑ i p i − ∑ ∑i j p ip j = 0.3 E.g., recall from Chapter 3 that the true preposition might be better predicted by the counts of patterns thattend to include the preposition’s grammatical object, i.e., patterns that include more right-context.61
We now show that ¯w T (diag(¯p) − ¯p¯p T )¯w expresses the variance of the weights in ¯wwith respect to the probability weighting ¯p. The variance of a random variable with meanE[W] = µ is:Var[W] = E[(W −µ) 2 ] = E[W 2 ]−E[W] 2The mean of the weights using probability weighting ¯p is E[W] = ¯w T ¯p = ¯p¯w. Also,E[W 2 ] = ¯w T diag(¯p)¯w. Thus:Var[W] = ¯w T diag(¯p)¯w −(¯w T¯p)(¯p¯w)= ¯w T (diag(¯p)− ¯p¯p)¯wIn our experiments, we deem each weight to be equally important to the variance calculation,and set p i = 1 N,∀i = 1,...,N.The goal of the regularization in (4.6) using C from (4.7) can be regarded as directingthe SVM toward a good unsupervised system, regardless of the constraints (training examples).In some unsupervised systems, however, only a subset of the attributes are used. Inother cases, distinct subsets of weights should have low variance, rather than minimizingthe variance across all weights. There are examples of these situations in Section 4.4.We can account for these cases in our QP. We provide separate terms in our quadraticfunction for the subsets of ¯w that should have low variance. Suppose we create L subsetsof ¯w: ˜ω 1 ,...˜ω L , where ˜ω j is ¯w with elements set to zero that are not in subset j. We thenminimize 1 2 (˜ωT 1 C 1˜ω 1 + ... + ˜ωL TC L˜ω L ). If the terms in subset j have low variance, C j= C from (4.7) is used. If the subset corresponds to attributes that are not a priori knownto be useful, an identity matrix can instead be used, C j = I, and these weights will beregularized toward zero as in a standard SVM. 4Variance regularization therefore exploits extra knowledge by the system designer. Thedesigner decides which weights should have similar values, and the SVM is biased to preferthis solution.One consequence of being able to regularize different subsets of weights is that we canalso apply variance regularization to the standard multi-class SVM (Section 4.2.1). We canuse an identity C i matrix for all irrelevant weights, i.e., weights that correspond to classattributepairs where the attribute is not directly relevant to the class. In our experiments,however, we apply variance regularization to the more efficient CS-SVM.We refer to a CS-SVM trained using the variance minimization quadratic program as theVAR-SVM.Web-Scale N-gram VAR-SVMIf variance regularization is applied to all weights, attributes COUNT(in Russia during),COUNT(Russia during 1997), and COUNT(during 1997 to) will be encouraged to have similarweights in the score for class during. Furthermore, these will be weighted similarly toother patterns, filled with other prepositions, used in the scores for other classes.Alternatively, we could minimize the variance separately over all 5-gram patterns, thenover all 4-gram patterns, etc., or over all patterns with a filler in the same position. In our4 Weights must appear in ≥1 subsets (possibly only in the C j = I subset). Each occurs in at most one inour experiments. Note it is straightforward to express this as a single covariance matrix regularizer over ¯w; weomit the details.62
Page 1 and 2:
University of AlbertaLarge-Scale Se
Page 5 and 6:
Table of Contents1 Introduction 11.
Page 7 and 8:
7 Alignment-Based Discriminative St
Page 9 and 10:
List of Figures2.1 The linear class
Page 11 and 12:
drawn in by establishing a partial
Page 13 and 14:
(2) “He saw the trophy won yester
Page 15 and 16:
actual sentence said, “My son’s
Page 17 and 18:
Uses Web-Scale N-grams Auto-Creates
Page 19 and 20: spelling correction, and the identi
Page 21 and 22: Chapter 2Supervised and Semi-Superv
Page 23 and 24: emphasis on “deliverables and eva
Page 25 and 26: Figure 2.1: The linear classifier h
Page 27 and 28: The above experimental set-up is so
Page 29 and 30: and discriminative models therefore
Page 31 and 32: their slack value). In practice, I
Page 33 and 34: One way to find a better solution i
Page 35 and 36: Figure 2.2: Learning from labeled a
Page 37 and 38: algorithm). Yarowsky used it for wo
Page 39 and 40: Learning with Natural Automatic Exa
Page 41 and 42: positive examples from any collecti
Page 43 and 44: generated word clusters. Several re
Page 45 and 46: One common disambiguation task is t
Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
Page 49 and 50: For each target wordv 0 , there are
Page 51 and 52: ut without counts for the class pri
Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
Page 55 and 56: We also follow Carlson et al. [2001
Page 57 and 58: Set BASE [Golding and Roth, 1999] T
Page 59 and 60: pronoun (#3) guarantees that at the
Page 61 and 62: 807876F-Score747270Stemmed patterns
Page 63 and 64: anaphoricity by [Denis and Baldridg
Page 65 and 66: ter, we present a simple technique
Page 67 and 68: We seek weights such that the class
Page 69: each optimum performance is at most
Page 73 and 74: Training ExamplesSystem 10 100 1K 1
Page 75 and 76: Since we wanted the system to learn
Page 77 and 78: Chapter 5Creating Robust Supervised
Page 79 and 80: § In-Domain (IN) Out-of-Domain #1
Page 81 and 82: Adjective ordering is also needed i
Page 83 and 84: Accuracy (%)10095908580757065601001
Page 85 and 86: System IN O1 O2Baseline 66.9 44.6 6
Page 87 and 88: 90% of the time in Gutenberg. The L
Page 89 and 90: VBN/VBD distinction by providing re
Page 91 and 92: other tasks we only had a handful o
Page 93 and 94: without the need for manual annotat
Page 95 and 96: DSP uses these labels to identify o
Page 97 and 98: Semantic classesMotivated by previo
Page 99 and 100: empirical Pr(n|v) in Equation (6.2)
Page 101 and 102: Verb Plaus./Implaus. Resnik Dagan e
Page 103 and 104: SystemAccMost-Recent Noun 17.9%Maxi
Page 105 and 106: Chapter 7Alignment-Based Discrimina
Page 107 and 108: ious measures to learn the recurren
Page 109 and 110: how labeled word pairs can be colle
Page 111 and 112: Figure 7.1: LCSR histogram and poly
Page 113 and 114: 0.711-pt Average Precision0.60.50.4
Page 115 and 116: Fr-En Bitext Es-En Bitext De-En Bit
Page 117 and 118: Chapter 8Conclusions and Future Wor
Page 119 and 120: 8.3 Future WorkThis section outline
Page 121 and 122:
My focus is thus on enabling robust
Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
Page 137:
NNP noun, proper, singular Motown V
show all

Large-Scale Semi-Supervised Learning for Natural Language ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?