Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ... Large-Scale Semi-Supervised Learning for Natural Language ...

old.site.clsp.jhu.edu
from old.site.clsp.jhu.edu More from this publisher
12.07.2015 Views

problems. However, they do not work well in all situations. Suppose we wanted a binaryclassifier to tell us whether a given Canadian city is either in Manitoba or not in Manitoba.Suppose we had only one feature: distance from the Pacific ocean. It would be difficult tochoose a weight and a threshold for a linear classifier such that we could separate Manitobancities from other Canadian cities with only this one feature. If we took cities below athreshold, we could get all cities west of Ontario. If we took those above, we could getall cities east of Saskatchewan. We would say that the positive and negative examples arenot separable using this feature representation; the positives and negatives can’t be placednicely onto either side of a hyperplane. There are lots of non-linear classifiers to choosefrom that might do better.On the other hand, we’re always free to choose whatever feature function we like; formost problems, we can just choose a feature space that does work well with linear classifiers(i.e., a feature space that perhaps does make the training data separable). We could dividedistance-from-Pacific-ocean into multiple features: say, a binary feature if the distance isbetween 0 and 100 km, another if it’s between 100 and 200, etc. Also, many learningalgorithms permit us to use the kernel trick, which maps the feature vectors into an implicithigher-dimensional space where a linear hyperplane can better divide the classes. We returnto this point briefly in the following section. For many natural language problems, we havethousands of relevant features, and good classification is possible with linear classifiers.Generally, the more features, the more separable the examples.2.3 Supervised LearningIn this section, we provide a very practical discussion of how the parameters of the linearclassifier are chosen. This is the NLP view of machine learning: what you need to know touse it as a tool.2.3.1 Experimental Set-upThe proper set-up is to have at least three sets of labeled data when designing a supervisedmachine learning system. First, you have a training set, which you use to learn your model(yet another word that means the same thing as the weights or parameters: the model is theset of weights). Secondly, you have a development set, which serves two roles: a) you canset any of your algorithm’s hyperparameters on this set (hyperparameters are discussed below),and b) you can test your system on this set as you are developing. Rather than havinga single development set, you could optimize your parameters by ten-fold cross validationon the training set, essentially re-using the training data to set development parameters.Finally, you have a hold-out set or test set of unseen data which you use for your finalevaluation. You only evaluate on the test set once, to generate the final results of your experimentsfor your paper. This simulates how your algorithm would actually be used inpractice: classifying data it has not seen before.To run machine learning in this framework, we typically begin by converting the threesets into feature vectors and labels. We then supply the training set, in labeled featurevector format, to a standard software package, and this package returns the weights. Thepackage can also be used to multiply the feature vectors by the weights, and return theclassification decisions for new examples. It thus can and often does calculate performanceon the development or test sets for you.17

The above experimental set-up is sometimes referred to as a batch learning approach,because the algorithm is given the entire training set at once. A typical algorithm learnsa single, static model using the entire training set in one training session (remember: fora linear classifier, by model we just mean the set of weights). This is the approach takenby SVMs and maximum entropy models. This is clearly different than how humans learn;we adapt over time as new data is presented. Alternatively, an online learning algorithmis one that is presented with training examples in sequence. Online learning iterativelyre-estimates the model each time a new training instance is encountered. The perceptronis the classic example of an online learning approach, while currently MIRA [Crammerand Singer, 2003; Crammer et al., 2006] is a popular maximum-margin online learner (seeSection 2.3.4 for more on max-margin classifiers). In practice, there is little differencebetween how batch and online learners are used; if new training examples become availableto a batch learner, the new examples can simply be added to the existing training set and themodel can be re-trained on the old-plus-new combined data as another batch process.It is also worth mentioning another learning paradigm known as active learning [Cohnet al., 1994; Tong and Koller, 2002]. Here the learner does not simply train passivelyfrom whatever labeled data is available, rather, the learner can request specific examplesbe labeled if it deems adding these examples to the training set will most improve theclassifier’s predictive power. Active learning could potentially be used in conjunction withthe techniques in this dissertation to get the most benefit out of the smallest amount oftraining data possible.2.3.2 Evaluation MeasuresPerformance is often evaluated in terms of accuracy: what percentage of examples did weclassify correctly? For example, if our decision is whether a document is about sports or not(i.e., sports is the positive class), then accuracy is the percentage of documents that are correctlylabeled as sports or non-sports. Note it is difficult to compare accuracy of classifiersacross tasks, because typically the class balance strongly affects the achievable accuracy.For example, suppose there are 100 documents in our test set, and only five of these aresports documents. Then a system could trivially achieve 95% accuracy by assigning everydocument the non-sports label. 95% might be much harder to obtain on another task witha 50-50 balance of the positive and negative classes. Accuracy is most useful as a measurewhen the performance of the proposed system is compared to a baseline: a reasonable,simple and perhaps even trivial classifier, such as one that picks the majority-class (themost frequent class in the training data). We use baselines whenever we state accuracy inthis dissertation.Accuracy also does not tell us whether our classifier is predicting one class disproportionatelymore often than another (that is, whether it has a bias). Statistical measures thatdo identify classifier biases are Precision, Recall, and F-Score. These measures are usedtogether extensively in classifier evaluation. 2 Again, suppose sports is the class we’re predicting.Precision tells us: of the documents that our classifier predicted to be sports, whatpercentage are actually sports? That is, precision is the ratio of true positives (elementswe predicted to be of the positive class that truly are positive, where sports is the positiveclass in our running example), divided by the sum of true positives and false positives (to-2 Wikipedia has a detailed discussion of these measures: http://en.wikipedia.org/wiki/Precision_and_recall18

problems. However, they do not work well in all situations. Suppose we wanted a binaryclassifier to tell us whether a given Canadian city is either in Manitoba or not in Manitoba.Suppose we had only one feature: distance from the Pacific ocean. It would be difficult tochoose a weight and a threshold <strong>for</strong> a linear classifier such that we could separate Manitobancities from other Canadian cities with only this one feature. If we took cities below athreshold, we could get all cities west of Ontario. If we took those above, we could getall cities east of Saskatchewan. We would say that the positive and negative examples arenot separable using this feature representation; the positives and negatives can’t be placednicely onto either side of a hyperplane. There are lots of non-linear classifiers to choosefrom that might do better.On the other hand, we’re always free to choose whatever feature function we like; <strong>for</strong>most problems, we can just choose a feature space that does work well with linear classifiers(i.e., a feature space that perhaps does make the training data separable). We could dividedistance-from-Pacific-ocean into multiple features: say, a binary feature if the distance isbetween 0 and 100 km, another if it’s between 100 and 200, etc. Also, many learningalgorithms permit us to use the kernel trick, which maps the feature vectors into an implicithigher-dimensional space where a linear hyperplane can better divide the classes. We returnto this point briefly in the following section. For many natural language problems, we havethousands of relevant features, and good classification is possible with linear classifiers.Generally, the more features, the more separable the examples.2.3 <strong>Supervised</strong> <strong>Learning</strong>In this section, we provide a very practical discussion of how the parameters of the linearclassifier are chosen. This is the NLP view of machine learning: what you need to know touse it as a tool.2.3.1 Experimental Set-upThe proper set-up is to have at least three sets of labeled data when designing a supervisedmachine learning system. First, you have a training set, which you use to learn your model(yet another word that means the same thing as the weights or parameters: the model is theset of weights). Secondly, you have a development set, which serves two roles: a) you canset any of your algorithm’s hyperparameters on this set (hyperparameters are discussed below),and b) you can test your system on this set as you are developing. Rather than havinga single development set, you could optimize your parameters by ten-fold cross validationon the training set, essentially re-using the training data to set development parameters.Finally, you have a hold-out set or test set of unseen data which you use <strong>for</strong> your finalevaluation. You only evaluate on the test set once, to generate the final results of your experiments<strong>for</strong> your paper. This simulates how your algorithm would actually be used inpractice: classifying data it has not seen be<strong>for</strong>e.To run machine learning in this framework, we typically begin by converting the threesets into feature vectors and labels. We then supply the training set, in labeled featurevector <strong>for</strong>mat, to a standard software package, and this package returns the weights. Thepackage can also be used to multiply the feature vectors by the weights, and return theclassification decisions <strong>for</strong> new examples. It thus can and often does calculate per<strong>for</strong>manceon the development or test sets <strong>for</strong> you.17

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!