12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

with standard lexical features could possibly allow robust functional relation identificationacross different domains and genres.8.3.4 Improving Core NLP TechnologiesI also plan to apply the web-scale semi-supervised framework to core NLP technologiesthat are in great demand in the NLP community.I have previously explored a range of enhancements to pronoun resolution systems[Cherry and Bergsma, 2005; Bergsma, 2005; Bergsma and Lin, 2006; Bergsma et al.,2008b; 2008a; 2009a]. My next step will be to develop and distribute an efficient, stateof-the-art,N-gram-enabled pronoun resolution system <strong>for</strong> academic and industrial applications.In conversation with colleagues at conferences, I have found that many researchersshy away from machine-learned pronoun resolution systems because of a fear they wouldnot work well on new domains (i.e., the specific domain on which the research is beingconducted). By incorporating web-scale statistics into pronoun resolvers, I plan to producea robust system that people can confidently apply wherever needed.I will also use web-scale resources to make advances in parsing, the cornerstone technologyof NLP. A parser gives the structure of a sentence, identifying who is doing whatto whom. Parsing digs deeper into text than typical in<strong>for</strong>mation retrieval technology, extractingricher levels of knowledge. Companies like Google and Microsoft have recognizedthe need to access these deeper linguistic structures and are making parsing a focus <strong>for</strong>their next generation of search engines. I will create an accurate open-domain parser: adomain-independent parser that can reliably analyze any genre of text. A few approacheshave successfully adapted a parser to a specific domain, such as general non-fiction [Mc-Closky et al., 2006b] or biomedical text [Rimell and Clark, 2008], but these systems makeassumptions that would be unrealistic when parsing text in a heterogeneous collection ofweb pages, <strong>for</strong> example. A parser that could reliably process a variety of genres, withoutmanual involvement, would be of great practical and scientific value.I will create an open-domain parser by essentially adapting to all the text on the web,again building on the robust classifiers presented in Chapter 5. Parsing decisions will bebased on observations in web-scale N-gram data, rather than observed (and potentiallyoverly-specific) constructions in a particular domain. Custom algorithms could also beused to extract web-scale knowledge <strong>for</strong> difficult parsing decisions in coordination, nouncompounding, and prepositional phrase attachment. Work in open domain parsing will alsorequire the development of new, cross-domain, task-based evaluations; these could facilitatecomparison of parsers based on different <strong>for</strong>malisms.I have recently explored methods to both improve the speed of highly-accurate graphbasedparsers [Bergsma and Cherry, 2010] (thus allowing the incorporation of new featureswith less overhead) and ways to incorporate web-scale statistics into the subtask of nounphrase parsing [Pitler et al., 2010]. In preliminary experiments, I have identified a numberof other simple N-gram-derived features that improve full-sentence parsing accuracy.I also plan to investigate whether open-domain parsing could be improved by manuallyannotating parses of the most frequent N-grams in our new web-scale N-gram corpus(Chapter 5). Recall that the new N-gram corpus includes part-of-speech tags. These tagsmight help identify N-grams that are likely to be both syntactic constituents and syntacticallyambiguous (e.g. noun compounds). The annotation could be done either by experts,or by crowdsourcing annotation via Amazon’s Mechanical Turk. A similar technique wasrecently successfully demonstrated <strong>for</strong> MT [Bloodgood and Callison-Burch, 2010].111

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!