12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

and heuristic approaches are now being called unsupervised, since they are not based onlearning from labeled data. For example, in Chapter 1, Section 1.1, we discussed how apart-of-speech tagger could be based on linguistic rules. A rule-based tagger could in somesense be considered unsupervised, since a human presumably created the rules from intuition,not from labeled data. However, since the human probably looked at some data tocome up with the rules (a textbook, maybe?), calling this unsupervised is a little misleadingfrom a machine learning perspective. Most people would probably simply call this a“rule-based approach.” In Chapter 3, we propose unsupervised systems <strong>for</strong> lexical disambiguation,where a designer need only specify the words that are correlated with the classesof interest, rather than label any training data. We also discuss previous approaches thatuse counts derived from Internet search engine results. These approaches have usually beenunsupervised.From a machine learning perspective, true unsupervised approaches are those that induceoutput structure from properties of the problem, with guidance from probabilistic modelsrather than human intuition. We can illustrate this concept most clearly again with theexample of document classification. Suppose we know there are two classes: documentsabout sports, and documents that are not about sports. We can generate the feature vectorsas discussed above, and then simply <strong>for</strong>m two groups of vectors such that members of eachgroup are close to each other (in terms of Euclidean distance) inN-dimensional space. Newfeature vectors can be assigned to whatever group or cluster they are closest to. The pointsclosest to one cluster will be separated from points closest to the other cluster by a hyperplanein N-dimensional space. Where there’s a hyperplane, then there’s a correspondinglinear classifier, with a set of weights. So clustering can learn a linear classifier as well. Wedon’t know what the clusters represent, but hopefully one of them has all the sports documents(if we inspect the clusters and define one of them as the sports class, we’re essentiallydoing a <strong>for</strong>m of semi-supervised learning).Clustering can also be regarded as an “exploratory science” that seeks to discover usefulpatterns and structures in data [Pantel, 2003]. This structure might later be exploited <strong>for</strong>other <strong>for</strong>ms of language processing; later we will see how clustering can be used to providehelpful feature in<strong>for</strong>mation <strong>for</strong> supervised classifiers (Section 2.5.5).Clustering is the simplest unsupervised learning algorithm. In more complicated setups,we can define a probability model over our features (and possibly over other hiddenvariables), and then try to learn the parameters of the model such that our unlabeled data hasa high likelihood under this model. We previously used such a technique to train a pronounresolution system using expectation maximization [Cherry and Bergsma, 2005]. Similartechniques can be used to train hidden markov models and other generative models.These models can provide a very nice way to incorporate lots of unlabeled data. In somesense, however, doing anything beyond an HMM requires one to be a bit of probabilisticmodelingguru. The more features you incorporate in the model, the more you have to account<strong>for</strong> the interdependence of these features explicitly in your model. Some assumptionsyou make may not be valid and may impair per<strong>for</strong>mance. It’s hard to know exactly what’swrong with your model, and how to change it to make it better. Also, when setting theparameters of your model using clustering or expectation maximization, you might reacha point only of local optimum, from which the algorithm can proceed no further to bettersettings under your model (and you have no idea you’ve reached this point). But, sincethese algorithms are not optimizing discriminative per<strong>for</strong>mance anyways, it’s not clear youwant the global maximum even if you can find it.23

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!