12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Figure 2.1: The linear classifier hyperplane (as given by an SVM, with support vectorsindicated)when the dot product is computed (here between three weights and three features). For oursports example, we would hope that the algorithm would learn, <strong>for</strong> example, that curlingshould get a positive weight, since documents that contain the word curling are usuallyabout sports. It should assign a fairly low weight, perhaps zero weight, to the word the,since this word doesn’t have much to say one way or the other. Choosing an appropriateweight <strong>for</strong> the Obama feature is left as an exercise <strong>for</strong> the reader. Note that weights canbe negative. Section 2.3 has more details on some of the different algorithms that learn theweights.If we take a geometric view, and think of the feature vectors as points inN-dimensionalspace, then learning the weights can also be thought of as learning a separating hyperplane.Once we have any classifier, then all feature vectors that get positive scores will be in oneregion of space, and all the feature vectors that get negative scores will be in another. Witha linear classifier, a hyperplane will divide these two regions. Figure 2.1 depicts this set-upin two dimensions, with the points of one class on the left, the points <strong>for</strong> the other class onthe right, and the dividing hyperplane as a bar down the middle. 1In this discussion, we’ve focused on binary classification: is the document about sportsor not? In many practical applications, however, we have more than two categories, e.g.sports, finance, politics, etc. It’s fairly easy to adapt the binary linear classifier to the multiclasscase. For K classes, one common approach is the one-versus-all strategy: we haveK binary classifiers that each predict whether a document is part of a given category ornot. Thus we might classify a document about Obama going curling as both a sports anda politics document. In cases where only one category is possible (i.e., the classes are mutuallyexclusive, such as the restriction that each word have only one part-of-speech tag),we could take the highest-scoring classifier (the highest h(¯x)) as the class. There are alsomulticlass classifiers, like the approach we use in Chapter 3, that essentially jointly optimizethe K classifiers (e.g. [Crammer and Singer, 2001]). Chapter 4 defines and evaluatesvarious multi-class learning approaches.A final point to address: should we be using a linear classifier <strong>for</strong> our problems atall? Linear classifiers are very simple, extremely fast, and work very well on a range of1 From: www.stat.columbia.edu/˜cook/movabletype/archives/2006/02/interesting_cas_1.html16

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!