12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

which used some manual intervention, but later approaches have essentially differed quitelittle from her original proposal.Google co-founder Sergei Brin [1998] used a similar technique to extract relations suchas (author, title) from the web. Similar work was also presented in [Riloff and Jones, 1999]and [Agichtein and Gravano, 2000]. Pantel and Pennacchiotti [2006] used this approachto extract general semantic relations (such as part-of, succession, production, etc.), whilePaşca et al. [2006] present extraction results on a web-scale corpus. Another famous variationof this method is Ravichandran and Hovy’s system <strong>for</strong> finding patterns <strong>for</strong> answeringquestions [Ravichandran and Hovy, 2002]. They begin with seeds such as (Mozart, 1756)and use these to find patterns that contain the answers to questions such as When was Xborn?Note the contrast with the traditional supervised machine-learning framework, wherewe would have annotators mark up text with examples of hypernyms, relations, or questionanswerpairs, etc., and then learn a predictor from these labeled examples using supervisedlearning. In bootstrapping from seeds, we do not label segments of text, but rather pairsof words (labeling only one view of the problem). When we find instances of these pairsin text, we essentially label more data automatically, and then infer a context-based predictorfrom this labeled set. This context-based predictor can then be used to find moreexamples of the relation of interest (hypernyms, authors of books, question-answer pairs,etc.). Notice, however, that in contrast to standard supervised learning, we do not label anynegative examples, only positive instances. Thus, when building a context-based predictor,there is no obvious way to exploit our powerful machinery <strong>for</strong> feature-based discriminativelearning and classification. Very simple methods are instead used to keep track of the bestcontext-based patterns <strong>for</strong> identifying new examples in text.In iterative bootstrapping, although the first round of training often produces reasonableresults, things often go wrong in later iterations. The first round will inevitably producesome noise, some wrong pairs extracted by the predictor. The contexts extracted from thesefalse predictions will lead to more false pairs being extracted, and so on. In all publishedresearch on this topic that we are aware of, the precision of the extractions decreases in eachstage.2.5.4 <strong>Learning</strong> with Heuristically-Labeled ExamplesIn the above discussion of bootstrapping, we outlined a number of approaches that extendan existing set of classifications (or seeds) by iteratively classifying and learning from newexamples. Another interesting, non-iterative scenario is the situation where, rather thanhaving a few seed examples, we begin with many positive examples of a class or relation,and attempt to classify new relations in this context. With a relatively comprehensive setof seeds, there is little value in iterating to obtain more. 6 Also, having a lot of seeds canalso provide a way to generate the negative examples we need <strong>for</strong> discriminative learning.In this section we look at two flavours: special cases where the examples can be createdautomatically, and cases where we have only positive seeds, and so create pseudo-negativeexamples through some heuristic means.6 There are also non-iterative approaches that also start with limited seed data. Haghighi and Klein [2006]create a generative, unsupervised sequence prediction model, but add features to indicate if a word to be classifiedis distributionally-similar to a seed word. Like the approaches presented in our discussion of bootstrappingwith seeds, this system achieves impressive results starting with very little manually-provided in<strong>for</strong>mation.29

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!