where ¯1 is an N-dimensional vector of ones. This is CS-SVM with all weights set to unity.The counts <strong>for</strong> each preposition are simply summed, and whichever one scores the highestis taken as the output (actually only a subset of the counts are used, see Section 4.4.1). Asmentioned earlier, this system per<strong>for</strong>ms remarkably well on several tasks.4.2.3 Variance Regularization SVMsSuppose we choose our attribute partition well and train the CS-SVM on a sufficient numberof examples to achieve good per<strong>for</strong>mance. It is a reasonable hypothesis that the learnedweights will be predominantly positive. This is because each sub-vector ¯x r was chosento only include attributes that are predictive of class r. Unlike the classifier in (4.1) whichweighs positive and negative evidence together <strong>for</strong> each class, in CS-SVM, negative evidenceonly plays a roll as it contributes to the score of competing classes.If all the attributes are equally important, the weights should be equal, as in the unsupervisedapproach in (4.5). If some are more important than others, the training examplesshould reflect this and the learner can adjust the weights accordingly. 3 In the absence ofthis training evidence, it is reasonable to bias the classifier toward an equal-weight solution.Rather than the standard SVM regularization that minimizes the norm of the weights asin (4.4), we there<strong>for</strong>e regularize toward weights that have low variance. More <strong>for</strong>mally, wecan regard the set of weights, w 1 ,...,w N , as the distribution of a discrete random variable,W . We can calculate the mean and variance of this variable from its distribution. We seeka variable that has low variance.We begin with a more general objective and then explain how a specific choice of covariancematrix, C, minimizes the variance of the weights. We propose the regularizer:1min¯w,ξ 1 ,...,ξ m 2 ¯wT C¯w+Csubjectto : ∀im∑i=1ξ iξ i ≥0∀r ≠ y i , ¯w y i · ¯x i y i − ¯w r · ¯x i r ≥1−ξi (4.6)where C is a normalized covariance matrix such that ∑ i,j C i,j = 0. This ensures uni<strong>for</strong>mweight vectors receive zero regularization penalty. Since all covariance matrices are positivesemi-definite, the quadratic program (QP) remains convex in ¯w, and thus amenable togeneral purpose QP-solvers.Since the unsupervised system in (4.5) has zero weight variance, the SVM learned in(4.6) should do as least as well as (4.5) as we tune the C-parameter on development data.That is, as C approaches zero, variance minimization becomes the sole objective of (4.6),and uni<strong>for</strong>m weights are produced.We use covariance matrices of the <strong>for</strong>m:C = diag(¯p)− ¯p¯p T (4.7)where diag(¯p) is the matrix constructed by putting ¯p on the main diagonal. Here, ¯p is anarbitrary N-dimensional weighting vector, such that p ≥ 0 and ∑ i p i = 1. The vector ¯pdictates the contribution of each w i to the mean and variance of the weights in ¯w. It is easyto see that ∑ i,j C i,j = ∑ i p i − ∑ ∑i j p ip j = 0.3 E.g., recall from Chapter 3 that the true preposition might be better predicted by the counts of patterns thattend to include the preposition’s grammatical object, i.e., patterns that include more right-context.61
We now show that ¯w T (diag(¯p) − ¯p¯p T )¯w expresses the variance of the weights in ¯wwith respect to the probability weighting ¯p. The variance of a random variable with meanE[W] = µ is:Var[W] = E[(W −µ) 2 ] = E[W 2 ]−E[W] 2The mean of the weights using probability weighting ¯p is E[W] = ¯w T ¯p = ¯p¯w. Also,E[W 2 ] = ¯w T diag(¯p)¯w. Thus:Var[W] = ¯w T diag(¯p)¯w −(¯w T¯p)(¯p¯w)= ¯w T (diag(¯p)− ¯p¯p)¯wIn our experiments, we deem each weight to be equally important to the variance calculation,and set p i = 1 N,∀i = 1,...,N.The goal of the regularization in (4.6) using C from (4.7) can be regarded as directingthe SVM toward a good unsupervised system, regardless of the constraints (training examples).In some unsupervised systems, however, only a subset of the attributes are used. Inother cases, distinct subsets of weights should have low variance, rather than minimizingthe variance across all weights. There are examples of these situations in Section 4.4.We can account <strong>for</strong> these cases in our QP. We provide separate terms in our quadraticfunction <strong>for</strong> the subsets of ¯w that should have low variance. Suppose we create L subsetsof ¯w: ˜ω 1 ,...˜ω L , where ˜ω j is ¯w with elements set to zero that are not in subset j. We thenminimize 1 2 (˜ωT 1 C 1˜ω 1 + ... + ˜ωL TC L˜ω L ). If the terms in subset j have low variance, C j= C from (4.7) is used. If the subset corresponds to attributes that are not a priori knownto be useful, an identity matrix can instead be used, C j = I, and these weights will beregularized toward zero as in a standard SVM. 4Variance regularization there<strong>for</strong>e exploits extra knowledge by the system designer. Thedesigner decides which weights should have similar values, and the SVM is biased to preferthis solution.One consequence of being able to regularize different subsets of weights is that we canalso apply variance regularization to the standard multi-class SVM (Section 4.2.1). We canuse an identity C i matrix <strong>for</strong> all irrelevant weights, i.e., weights that correspond to classattributepairs where the attribute is not directly relevant to the class. In our experiments,however, we apply variance regularization to the more efficient CS-SVM.We refer to a CS-SVM trained using the variance minimization quadratic program as theVAR-SVM.Web-<strong>Scale</strong> N-gram VAR-SVMIf variance regularization is applied to all weights, attributes COUNT(in Russia during),COUNT(Russia during 1997), and COUNT(during 1997 to) will be encouraged to have similarweights in the score <strong>for</strong> class during. Furthermore, these will be weighted similarly toother patterns, filled with other prepositions, used in the scores <strong>for</strong> other classes.Alternatively, we could minimize the variance separately over all 5-gram patterns, thenover all 4-gram patterns, etc., or over all patterns with a filler in the same position. In our4 Weights must appear in ≥1 subsets (possibly only in the C j = I subset). Each occurs in at most one inour experiments. Note it is straight<strong>for</strong>ward to express this as a single covariance matrix regularizer over ¯w; weomit the details.62
- Page 1 and 2:
University of AlbertaLarge-Scale Se
- Page 5 and 6:
Table of Contents1 Introduction 11.
- Page 7 and 8:
7 Alignment-Based Discriminative St
- Page 9 and 10:
List of Figures2.1 The linear class
- Page 11 and 12:
drawn in by establishing a partial
- Page 13 and 14:
(2) “He saw the trophy won yester
- Page 15 and 16:
actual sentence said, “My son’s
- Page 17 and 18:
Uses Web-Scale N-grams Auto-Creates
- Page 19 and 20: spelling correction, and the identi
- Page 21 and 22: Chapter 2Supervised and Semi-Superv
- Page 23 and 24: emphasis on “deliverables and eva
- Page 25 and 26: Figure 2.1: The linear classifier h
- Page 27 and 28: The above experimental set-up is so
- Page 29 and 30: and discriminative models therefore
- Page 31 and 32: their slack value). In practice, I
- Page 33 and 34: One way to find a better solution i
- Page 35 and 36: Figure 2.2: Learning from labeled a
- Page 37 and 38: algorithm). Yarowsky used it for wo
- Page 39 and 40: Learning with Natural Automatic Exa
- Page 41 and 42: positive examples from any collecti
- Page 43 and 44: generated word clusters. Several re
- Page 45 and 46: One common disambiguation task is t
- Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
- Page 49 and 50: For each target wordv 0 , there are
- Page 51 and 52: ut without counts for the class pri
- Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
- Page 55 and 56: We also follow Carlson et al. [2001
- Page 57 and 58: Set BASE [Golding and Roth, 1999] T
- Page 59 and 60: pronoun (#3) guarantees that at the
- Page 61 and 62: 807876F-Score747270Stemmed patterns
- Page 63 and 64: anaphoricity by [Denis and Baldridg
- Page 65 and 66: ter, we present a simple technique
- Page 67 and 68: We seek weights such that the class
- Page 69: each optimum performance is at most
- Page 73 and 74: Training ExamplesSystem 10 100 1K 1
- Page 75 and 76: Since we wanted the system to learn
- Page 77 and 78: Chapter 5Creating Robust Supervised
- Page 79 and 80: § In-Domain (IN) Out-of-Domain #1
- Page 81 and 82: Adjective ordering is also needed i
- Page 83 and 84: Accuracy (%)10095908580757065601001
- Page 85 and 86: System IN O1 O2Baseline 66.9 44.6 6
- Page 87 and 88: 90% of the time in Gutenberg. The L
- Page 89 and 90: VBN/VBD distinction by providing re
- Page 91 and 92: other tasks we only had a handful o
- Page 93 and 94: without the need for manual annotat
- Page 95 and 96: DSP uses these labels to identify o
- Page 97 and 98: Semantic classesMotivated by previo
- Page 99 and 100: empirical Pr(n|v) in Equation (6.2)
- Page 101 and 102: Verb Plaus./Implaus. Resnik Dagan e
- Page 103 and 104: SystemAccMost-Recent Noun 17.9%Maxi
- Page 105 and 106: Chapter 7Alignment-Based Discrimina
- Page 107 and 108: ious measures to learn the recurren
- Page 109 and 110: how labeled word pairs can be colle
- Page 111 and 112: Figure 7.1: LCSR histogram and poly
- Page 113 and 114: 0.711-pt Average Precision0.60.50.4
- Page 115 and 116: Fr-En Bitext Es-En Bitext De-En Bit
- Page 117 and 118: Chapter 8Conclusions and Future Wor
- Page 119 and 120: 8.3 Future WorkThis section outline
- Page 121 and 122:
My focus is thus on enabling robust
- Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
- Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
- Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
- Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
- Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
- Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
- Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
- Page 137:
NNP noun, proper, singular Motown V