12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

F-Score0.90.80.70.60.50.40.30.20.10DSP allErk (2007)Keller and Lapata (2003)10 100 1000 10000 100000 1e+06Noun FrequencyFigure 6.1: Disambiguation results by noun frequency.on the low-frequency nouns. A richer feature set allows DSP to make correct inferences onexamples that provide minimal co-occurrence data. These are also the examples <strong>for</strong> whichwe would expect co-occurrence models like MI to fail.As a further experiment, we re-trained DSP but with only the string-based featuresremoved. Overall macro-averaged F-score dropped from 0.65 to 0.64 (a statistically significantreduction in per<strong>for</strong>mance). The system scored nearly identically to DSP on thehigh-frequency nouns, but per<strong>for</strong>med roughly 15% worse on the nouns that occurred lessthan ten times. This shows that the string-based features are important <strong>for</strong> selectional preference,and particularly helpful <strong>for</strong> low-frequency nouns.Finally, we note that some suggestions <strong>for</strong> improving pseudo-word evaluations havebeen proposed in a very recent paper by Chambers and Jurafsky [Chambers and Jurafsky,2010]. For example, our evaluation in this section only considers unseen pairs in the sensethat our training and testing pairs are distinct. It may be more realistic to evaluate on allpairs extracted from a corpus, in which case we can directly compare to co-occurrence (orconditional probability, or MI), as in the following sections.6.4.4 Human PlausibilityTable 6.2 compares some of our systems on data used by Resnik [1996] (also Appendix 2 inHolmes et al. [1989]). The plausibility of these pairs was initially judged based on the experimenters’intuitions, and later confirmed in a human experiment. We include the scoresof Resnik’s system, and note that its errors were attributed to sense ambiguity and otherlimitations of class-based approaches [Resnik, 1996]. 8 The other comparison approachesalso make a number of mistakes, which can often be traced to a misguided choice of similarword to smooth with.We also compare to our empirical MI model, trained on our parsed corpus. Although8 For example, warn-engine scores highly because engines are in the class entity, and physical entities (e.g.people) are often objects of warn. Unlike DSP, Resnik’s approach cannot learn that <strong>for</strong> warn, “the property ofbeing a person is more important than the property of being an entity” [Resnik, 1996].91

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!