12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1. Is there a benefit in combining web-scale counts with the standard features used instate-of-the-art supervised approaches?2. How well do web-based models per<strong>for</strong>m on new domains or when labeled data isscarce?We address these questions on two generation and two analysis tasks, using both existingN-gram data and a novel web-scale N-gram corpus that includes part-of-speech in<strong>for</strong>mation(Section 5.2). While previous work has combined web-scale features with otherfeatures in specific classification problems [Modjeska et al., 2003; Yang et al., 2005; Vadasand Curran, 2007b; Tratz and Hovy, 2010], we provide a multi-task, multi-domain comparison.Some may question why supervised learning with standard features is needed at all <strong>for</strong>generation problems. Why not solely rely on direct evidence from a giant corpus? Forexample, <strong>for</strong> the task of prenominal adjective ordering (Section 5.3), a system that needs todescribe a ball that is both big and red can simply check that big red is more common onthe web than red big, and order the adjectives accordingly.It is, however, suboptimal to only use simple counts from N-gram data. For example,ordering adjectives by direct web evidence per<strong>for</strong>ms 7% worse than our best supervisedsystem (Section 5.3.2). No matter how large the web becomes, there will always be plausibleconstructions that never occur. For example, there are currently no pages indexedby Google with the preferred adjective ordering <strong>for</strong> bedraggled 56-year-old [professor].Also, in a particular domain, words may have a non-standard usage. Systems trained onlabeled data can learn the domain usage and leverage other regularities, such as suffixes andtransitivity <strong>for</strong> adjective ordering.With these benefits, systems trained on labeled data have become the dominant technologyin academic NLP. There is a growing recognition, however, that these systems arehighly domain dependent. For example, parsers trained on annotated newspaper text per<strong>for</strong>mpoorly on other genres [Gildea, 2001]. While many approaches have adapted NLPsystems to specific domains [Tsuruoka et al., 2005; McClosky et al., 2006b; Blitzer et al.,2007; Daumé III, 2007; Rimell and Clark, 2008], these techniques assume the system knowson which domain it is being used, and that it has access to representative data in that domain.These assumptions are unrealistic in many real-world situations; <strong>for</strong> example, whenautomatically processing a heterogeneous collection of web pages. How well do supervisedand unsupervised NLP systems per<strong>for</strong>m when used uncustomized, out-of-the-box on newdomains, and how can we best design our systems <strong>for</strong> robust open-domain per<strong>for</strong>mance?Our results show that using web-scale N-gram data in supervised systems advancesthe state-of-the-art per<strong>for</strong>mance on standard analysis and generation tasks. More importantly,when operating out-of-domain, or when labeled data is not plentiful, using web-scaleN-gram data not only helps achieve good per<strong>for</strong>mance – it is essential.5.2 Experiments and Data5.2.1 Experimental DesignWe again evaluate the benefit of N-gram data on multi-class classification problems. Foreach task, we have some labeled data indicating the correct output <strong>for</strong> each example. Weevaluate with accuracy: the percentage of examples correctly classified in test data. We69

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!