12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

where ¯1 is an N-dimensional vector of ones. This is CS-SVM with all weights set to unity.The counts <strong>for</strong> each preposition are simply summed, and whichever one scores the highestis taken as the output (actually only a subset of the counts are used, see Section 4.4.1). Asmentioned earlier, this system per<strong>for</strong>ms remarkably well on several tasks.4.2.3 Variance Regularization SVMsSuppose we choose our attribute partition well and train the CS-SVM on a sufficient numberof examples to achieve good per<strong>for</strong>mance. It is a reasonable hypothesis that the learnedweights will be predominantly positive. This is because each sub-vector ¯x r was chosento only include attributes that are predictive of class r. Unlike the classifier in (4.1) whichweighs positive and negative evidence together <strong>for</strong> each class, in CS-SVM, negative evidenceonly plays a roll as it contributes to the score of competing classes.If all the attributes are equally important, the weights should be equal, as in the unsupervisedapproach in (4.5). If some are more important than others, the training examplesshould reflect this and the learner can adjust the weights accordingly. 3 In the absence ofthis training evidence, it is reasonable to bias the classifier toward an equal-weight solution.Rather than the standard SVM regularization that minimizes the norm of the weights asin (4.4), we there<strong>for</strong>e regularize toward weights that have low variance. More <strong>for</strong>mally, wecan regard the set of weights, w 1 ,...,w N , as the distribution of a discrete random variable,W . We can calculate the mean and variance of this variable from its distribution. We seeka variable that has low variance.We begin with a more general objective and then explain how a specific choice of covariancematrix, C, minimizes the variance of the weights. We propose the regularizer:1min¯w,ξ 1 ,...,ξ m 2 ¯wT C¯w+Csubjectto : ∀im∑i=1ξ iξ i ≥0∀r ≠ y i , ¯w y i · ¯x i y i − ¯w r · ¯x i r ≥1−ξi (4.6)where C is a normalized covariance matrix such that ∑ i,j C i,j = 0. This ensures uni<strong>for</strong>mweight vectors receive zero regularization penalty. Since all covariance matrices are positivesemi-definite, the quadratic program (QP) remains convex in ¯w, and thus amenable togeneral purpose QP-solvers.Since the unsupervised system in (4.5) has zero weight variance, the SVM learned in(4.6) should do as least as well as (4.5) as we tune the C-parameter on development data.That is, as C approaches zero, variance minimization becomes the sole objective of (4.6),and uni<strong>for</strong>m weights are produced.We use covariance matrices of the <strong>for</strong>m:C = diag(¯p)− ¯p¯p T (4.7)where diag(¯p) is the matrix constructed by putting ¯p on the main diagonal. Here, ¯p is anarbitrary N-dimensional weighting vector, such that p ≥ 0 and ∑ i p i = 1. The vector ¯pdictates the contribution of each w i to the mean and variance of the weights in ¯w. It is easyto see that ∑ i,j C i,j = ∑ i p i − ∑ ∑i j p ip j = 0.3 E.g., recall from Chapter 3 that the true preposition might be better predicted by the counts of patterns thattend to include the preposition’s grammatical object, i.e., patterns that include more right-context.61

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!