12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

and discriminative models there<strong>for</strong>e rests solely in how the weights are chosen. Ratherthan choosing weights that best fit the generative model on the training data (and satisfythe model’s simplifying assumptions, typically concerning the interdependence or independenceof different features), a discriminative model chooses the weights that best attain thedesired objective: better predictions [Fung and Roth, 2005]. Discriminative models thustend to per<strong>for</strong>m better, and are correspondingly the preferred approach today in many areasof NLP (including increasingly in semantics, where we recently proposed a discriminativeapproach to selectional preference; Chapter 6). Unlike generative approaches, when usingdiscriminative algorithms we can generally use arbitrary and interdependent features in ourmodel without worrying about modeling such interdependencies. Use of the word discriminativein NLP has thus come to indicate both an approach that optimizes <strong>for</strong> classificationaccuracy directly and one that uses a wide variety of features. In fact, one kind of featureyou might use in a discriminative system is the prediction or output of a generative model.This illustrates another advantage of discriminative learning: competing approaches canalways be included as new features.Note the clear advantages of discriminative models are really only true <strong>for</strong> supervisedlearning in NLP. There are now a growing number of generative, Bayesian, unsupervisedalgorithms that are being developed. It may be the case that the pendulum will soon swingback and generative models will again dominate the supervised playing field as well, particularlyif they can provide principled ways to incorporate unlabeled data into a semisupervisedframework.2.3.4 Support Vector MachinesWhen you have lots of features and lots of examples, support vector machines [Cortes andVapnik, 1995] (SVMs) seem to be the best discriminative approach. One reason might bebecause they per<strong>for</strong>m well in situations, like natural language, where many features arerelevant [Joachims, 1999a], as opposed to situations where a few key indicators may besufficient <strong>for</strong> prediction. Conceptually, SVMs take a geometric view of the problem, asdepicted in Figure 2.1. The training algorithm chooses the hyperplane location such thatit is maximally far away from the closest positive and negative points on either side of it(this is known as the max-margin solution). These closest vectors are known as supportvectors. You can reconstruct the hyperplane from this set of vectors alone. Thus the namesupport vector machine. In fact, Figure 2.1 depicts the hyperplane that would be learned byan SVM, with marks on the corresponding support vectors.It can be shown that the hyperplane that maximizes the margin corresponds to theweight vector that solves the following constrained optimization problem:min¯w12 ||¯w||2subjectto : ∀i, y i (¯w · ¯x i ) ≥1 (2.2)where ||¯w|| is the Euclidean norm of the weight vector. Note ||¯w|| 2 = ¯w · ¯w. The 1 2 is amathematical convenience so that that coefficient goes away when we take the derivative.The optimization says that we want to find the smallest weight vector (in terms of its Euclideannorm) such that our linear classifier’s output (h(¯x) = ¯w·¯x) is bigger than one whenthe correct label is a positive class (y = +1), and less than -1 when the correct label is anegative class (y = -1). The constraint in Equation 2.2 is a succinct way of writing thesetwo conditions in one line.20

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!