12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Figure 2.2: <strong>Learning</strong> from labeled and unlabeled examples, from (Zhu, 2005)Consider Figure 2.2. In the typical inductive set-up, we would design our classifierbased purely on the labeled points <strong>for</strong> the two classes: the o’s and +’s. We would draw thebest hyperplane to separate these labeled vectors. However, when we look at all the dotsthat do not have labels, we may wish to draw a different hyperplane. It appears that thereare two clusters of data, one on the left and one on the right. Drawing a hyperplane downthe middle would appear to be the optimum choice to separate the two classes. This is onlyapparent after inspecting unlabeled examples.We can always train a classifier using both labeled and unlabeled examples in the transductiveset-up, but then apply the classifier to unseen data in an inductive evaluation. So insome sense we can group other semi-supervised approaches that make use of labeled andunlabeled examples into this category (e.g. work by Wang et al. [2008]), even if they arenot applied transductively per se.There are many computational algorithms that can make use of unlabeled exampleswhen learning the separating hyperplane. The intuition behind them is to say somethinglike: of all combinations of possible labels on the unseen examples, find the overall bestseparating hyperplane. Thus, in some sense we pretend we know the labels on the unlabeleddata, and use these labels to train our model via traditional supervised learning. In mostsemi-supervised algorithms, we either implicitly or explicitly generate labels <strong>for</strong> unlabeleddata in a conceptually similar fashion, to (hopefully) enhance the data we use to train theclassifier.These approaches are not applicable to the problems that we wish to tackle in thisdissertation mainly due to practicality. We want to leverage huge volumes of unlabeleddata: all the data on the web, if possible. Most transductive algorithms cannot scale to thismany examples. Another potential problem is that <strong>for</strong> many NLP applications, the spaceof possible labels is simply too large to enumerate. For example, work in parsing aims toproduce a tree indicating the syntactic relationships of the words in a sentence. [Church andPatil, 1982] show the number of possible binary trees increases with the Catalan numbers.For twenty-word sentences, there are billions of possible trees. We are currently exploringlinguistically-motivated ways to per<strong>for</strong>m a high-precision pruning of the output space <strong>for</strong>26

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!