12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Having the largest possible margin (or, equivalently, the smallest possible weight vectorsubject to the constraints) that classifies the training examples correctly seems to be a goodidea, as it is most likely to generalize to new data. Once again, consider text categorization.We may have a feature <strong>for</strong> each word in each document. There’s enough words and fewenough documents such that our training algorithm could possibly get all the training examplesclassified correctly if it just puts all the weight on the rare words in each document.So if Obama occurs in a single sports document in our training set, but nowhere else in thetraining set, our algorithm could get that document classified correctly if it were to put allits weight on the word Obama and ignore the other features. Although this approach woulddo well on the training set, it will likely not generalize well to unseen documents. It’s likelynot the maximum margin (smallest weight vector) solution. If we can instead separate thepositive and negative examples using more-frequent words like score and win and teamsthen we should do so. We will use less weights overall, and the weight vector will havea smaller norm (fewer weights will be non-zero). It intuitively seems like a good idea torely on more frequent words to make decisions, and the SVM optimization just encodesthis intuition in a theoretically well-grounded <strong>for</strong>mulation (it’s all based on ‘empirical riskminimization’ [Vapnik, 1998]).Sometimes, the positive and negative examples are not separable, and there will be nosolution to the above optimization. At other times, even if the data is separable, it maybe better to turn the hard constraints in the above equation into soft preferences, and placeeven greater emphasis on using the frequent features. That is, we may wish to have aweight vector with a small norm even at the expense of not separating the data. In terms ofcategorizing sports documents, words like score and win and teams may sometimes occurin non-sports documents in the training set (so we may get some training documents wrongif we put positive weight on them), but they are a better bet <strong>for</strong> getting test documentscorrect than putting high weight on rare words like Obama (blindly en<strong>for</strong>cing separability).Geometrically, we can view this as saying we might want to allow some points to lie on theopposite side of the hyperplane (or at least closer to it), if we can do this with weights onfewer dimensions.[Cortes and Vapnik, 1995] give the optimization program <strong>for</strong> a soft-margin SVM as:1m∑min¯w,ξ 1 ,...,ξ M 2 ||¯w||2 +C ξ isubjectto : ∀i, ξ i ≥ 0i=1y i (¯w · ¯x i ) ≥ 1−ξ i (2.3)Theξ i values are known as the slacks. Each example may use some slack. The classificationmust either be separable and satisfy the margin constraint (in which case ξ = 0) or it mayinstead use its slack to satisfy the inequality. The weighted sum of the slacks are minimizedalong with the norm of ¯w.The relative importance of the slacks (getting the training examples separated nicely)versus the minimization of the weights (using more general features) is controlled by tuningC.If the feature-weights learned by the algorithm are the parameters, then thisC valueis known as a hyperparameter, since it’s something done separately from the regular parameterlearning. The general practice is to try various values <strong>for</strong> this hyperparameter, andchoose the one that gets the highest per<strong>for</strong>mance on the development set. In an SVM, thishyperparameter is known as the regularization parameter. It controls how much we penalizetraining vectors that lie on the opposite side of the hyperplane (with distance given by21

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!