12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

without the need <strong>for</strong> manual annotation of examples. Our proposed work here is there<strong>for</strong>ean example of a semi-supervised system that uses “<strong>Learning</strong> with Heuristically-LabeledExamples,” as described in Chapter 2, Section 2.5.4.We evaluate DSP on the task of assigning verb-object selectional preference. We encodea noun’s textual distribution as feature in<strong>for</strong>mation. The learned feature weights are linguisticallyinteresting, yielding high-quality similar-word lists as latent in<strong>for</strong>mation. With thesefeatures, DSP is also an example of a semi-supervised system that creates features fromunlabeled data (Section 2.5.5). It thus encapsulates the two main thrusts of this dissertation.Despite its representational power, DSP scales to real-world data sizes: examples arepartitioned by predicate, and a separate SVM is trained <strong>for</strong> each partition. This allows us toefficiently learn with over 57 thousand features and 6.5 million examples. DSP outper<strong>for</strong>msrecently proposed alternatives in a range of experiments, and better correlates with humanplausibility judgments. It also shows strong gains over a Mutual In<strong>for</strong>mation-based cooccurrencemodel on two tasks: identifying objects of verbs in an unseen corpus and findingpronominal antecedents in coreference data.6.2 Related WorkMost approaches to SPs generalize from observed predicate-argument pairs to semanticallysimilar ones by modeling the semantic class of the argument, following Resnik [1996]. Forexample, we might have a class Mexican Food and learn that the entire class is suitable <strong>for</strong>eating. Usually, the classes are from WordNet [Miller et al., 1990], although they can alsobe inferred from clustering [Rooth et al., 1999]. Brockmann and Lapata [2003] comparea number of WordNet-based approaches, including Resnik [1996], Li and Abe [1998], andClark and Weir [2002], and found that more sophisticated class-based approaches do notalways outper<strong>for</strong>m frequency-based models.Another line of research generalizes using similar words. Suppose we are calculatingthe probability of a particular noun, n, occurring as the object argument of a given verbalpredicate, v. Let Pr(n|v) be the empirical maximum-likelihood estimate from observedtext. Dagan et al. [1999] define the similarity-weighted probability, Pr SIM , to be:Pr SIM (n|v) = ∑Sim(v ′ ,v)Pr(n|v ′ ) (6.1)v ′ ∈SIMS(v)where Sim(v ′ ,v) returns a real-valued similarity between two verbs v ′ and v (normalizedover all pair similarities in the sum). In contrast, Erk [2007] generalizes by substitutingsimilar arguments, while Wang et al. [2005] use the cross-product of similar pairs. One keyissue is how to define the set of similar words, SIMS(w). Erk [2007] compared a numberof techniques <strong>for</strong> creating similar-word sets and found that both the Jaccard coefficientand Lin [1998a]’s in<strong>for</strong>mation-theoretic metric work best. Similarity-smoothed modelsare simple to compute, potentially adaptable to new domains, and require no manuallycompiledresources such as WordNet.Selectional preferences have also been a recent focus of researchers investigating thelearning of paraphrases and inference rules [Pantel et al., 2007; Roberto et al., 2007]. Inferencessuch as “[X wins Y] ⇒ [X plays Y]” are only valid <strong>for</strong> certain arguments X andY. We follow Pantel et al. [2007] in using automatically-extracted semantic classes to helpcharacterize plausible arguments.84

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!