12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

We seek weights such that the classifier makes few errors on training data and generalizeswell to unseen data. There areKN weights to learn, <strong>for</strong> the cross-product of attributesand classes. The most common approach is to trainK separate one-versus-all binary SVMs,one <strong>for</strong> each class. The weights learned <strong>for</strong> the rth SVM provide the weights ¯W r in (4.1).We call this approach OvA-SVM. Note in some settings various one-versus-one strategiesmay be more effective than one-versus-all [Hsu and Lin, 2002].The weights can also be found using a single constrained optimization [Vapnik, 1998;Weston and Watkins, 1998]. Following the soft-margin version in [Crammer and Singer,2001]:1minW,ξ 1 ,...,ξ M 2subjectto : ∀iK∑|| ¯W i || 2 +Ci=1ξ i ≥0∀r ≠ y i , ¯Wy i · ¯x i − ¯W r · ¯x i ≥1−ξ i (4.2)The constraints require the correct class to be scored higher than other classes by a certainmargin, with slack <strong>for</strong> non-separable cases. Minimizing the weights is a <strong>for</strong>m of regularization.Tuning the C-parameter controls the emphasis on regularization versus separationof training examples.We call this the K-SVM. The K-SVM outper<strong>for</strong>med the OvA-SVM in [Crammer andSinger, 2001], but see [Rifkin and Klautau, 2004]. The popularity of K-SVM is partly due toconvenience; it is included in popular SVM software like SVM-multiclass 1 and LIBLINEAR[Fan et al., 2008].Note that with two classes, K-SVM is less efficient than a standard binary SVM. Abinary classifier outputs class 1 if (¯w · ¯x > 0) and class 2 otherwise. The K-SVM encodesa binary classifier using ¯W 1 = ¯w and ¯W 2 = −¯w, there<strong>for</strong>e requiring twice the memoryof a binary SVM. However, both binary and 2-class <strong>for</strong>mulations have the same solution[Weston and Watkins, 1998].m∑i=1ξ iWeb-<strong>Scale</strong> N-gram K-SVMK-SVM was used to combine the N-gram counts in Chapter 3. This was the SUPERLMmodel. Recall that <strong>for</strong> preposition selection, attributes were web counts of patterns filledwith 34 prepositions, corresponding to the 34 classes. Each preposition serves as the fillerof each context pattern. Fourteen patterns were used <strong>for</strong> each filler: all five 5-grams, four4-grams, three 3-grams, and two 2-grams spanning the position to be predicted. There areN =14∗34 =476 total attributes, and there<strong>for</strong>e KN =476∗34 =16184 weights in theWmatrix.Figure 4.1 depicts the optimization problem <strong>for</strong> the preposition selection classifier. Forthe ith training example, the optimizer must set the weights such that the score <strong>for</strong> the trueclass (from) is higher than the scores of all the other classes by a margin of 1. Otherwise, itmust use the slack parameter, ξ i . The score is the linear product of the preposition-specificweights, ¯Wr and all the features, ¯x i . For illustration, seven of the thirty-four total classes aredepicted. Note these constraints must be collectively satisfied across all training examples.A K-SVM classifier can potentially exploit very subtle in<strong>for</strong>mation <strong>for</strong> this task. Let¯W in and ¯W be<strong>for</strong>e be weights <strong>for</strong> the classes in and be<strong>for</strong>e. Notice some of the attributes1 http://svmlight.joachims.org/svm_multiclass.html58

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!