with standard lexical features could possibly allow robust functional relation identificationacross different domains and genres.8.3.4 Improving Core NLP TechnologiesI also plan to apply the web-scale semi-supervised framework to core NLP technologiesthat are in great demand in the NLP community.I have previously explored a range of enhancements to pronoun resolution systems[Cherry and Bergsma, 2005; Bergsma, 2005; Bergsma and Lin, 2006; Bergsma et al.,2008b; 2008a; 2009a]. My next step will be to develop and distribute an efficient, stateof-the-art,N-gram-enabled pronoun resolution system <strong>for</strong> academic and industrial applications.In conversation with colleagues at conferences, I have found that many researchersshy away from machine-learned pronoun resolution systems because of a fear they wouldnot work well on new domains (i.e., the specific domain on which the research is beingconducted). By incorporating web-scale statistics into pronoun resolvers, I plan to producea robust system that people can confidently apply wherever needed.I will also use web-scale resources to make advances in parsing, the cornerstone technologyof NLP. A parser gives the structure of a sentence, identifying who is doing whatto whom. Parsing digs deeper into text than typical in<strong>for</strong>mation retrieval technology, extractingricher levels of knowledge. Companies like Google and Microsoft have recognizedthe need to access these deeper linguistic structures and are making parsing a focus <strong>for</strong>their next generation of search engines. I will create an accurate open-domain parser: adomain-independent parser that can reliably analyze any genre of text. A few approacheshave successfully adapted a parser to a specific domain, such as general non-fiction [Mc-Closky et al., 2006b] or biomedical text [Rimell and Clark, 2008], but these systems makeassumptions that would be unrealistic when parsing text in a heterogeneous collection ofweb pages, <strong>for</strong> example. A parser that could reliably process a variety of genres, withoutmanual involvement, would be of great practical and scientific value.I will create an open-domain parser by essentially adapting to all the text on the web,again building on the robust classifiers presented in Chapter 5. Parsing decisions will bebased on observations in web-scale N-gram data, rather than observed (and potentiallyoverly-specific) constructions in a particular domain. Custom algorithms could also beused to extract web-scale knowledge <strong>for</strong> difficult parsing decisions in coordination, nouncompounding, and prepositional phrase attachment. Work in open domain parsing will alsorequire the development of new, cross-domain, task-based evaluations; these could facilitatecomparison of parsers based on different <strong>for</strong>malisms.I have recently explored methods to both improve the speed of highly-accurate graphbasedparsers [Bergsma and Cherry, 2010] (thus allowing the incorporation of new featureswith less overhead) and ways to incorporate web-scale statistics into the subtask of nounphrase parsing [Pitler et al., 2010]. In preliminary experiments, I have identified a numberof other simple N-gram-derived features that improve full-sentence parsing accuracy.I also plan to investigate whether open-domain parsing could be improved by manuallyannotating parses of the most frequent N-grams in our new web-scale N-gram corpus(Chapter 5). Recall that the new N-gram corpus includes part-of-speech tags. These tagsmight help identify N-grams that are likely to be both syntactic constituents and syntacticallyambiguous (e.g. noun compounds). The annotation could be done either by experts,or by crowdsourcing annotation via Amazon’s Mechanical Turk. A similar technique wasrecently successfully demonstrated <strong>for</strong> MT [Bloodgood and Callison-Burch, 2010].111
My focus is thus on enabling robust, open-domain systems through better features andnew kinds of labeled data. These improvements should combine constructively with recent,orthogonal advances in domain detection and adaptation [McClosky et al., 2010].8.3.5 Mining New Data SourcesWhile web-scale N-gram data is very effective, future NLP technology will combine in<strong>for</strong>mationfrom a variety of other structured and unstructured data sources to make betternatural language inferences. Query logs, parallel bilingual corpora, and collaborativeprojects like Wikipedia will provide crucial knowledge <strong>for</strong> syntactic and semantic analysis.For example, there is a tremendous amount of untapped in<strong>for</strong>mation in the Wikipedia edithistories, which record all the changes made to Wikipedia pages. As a first step in harvestingthis in<strong>for</strong>mation, we could extract a database of real spelling corrections made toWikipedia pages. This data could be used to train and test NLP spelling correction systemsat an unprecedented scale.Furthermore, it also seems likely that in<strong>for</strong>mation from the massive volume of onlineimages and video will be used to in<strong>for</strong>m automatic language processing. Many simplestatistics can also be computed from visual sources and stored, just like N-gram counts, inprecompiled databases. For example, we might extract visual descriptors using algorithmslike the popular and efficient SIFT algorithm [Lowe, 1999], convert these descriptors toimage codewords (i.e., the bag-of-words representation of images), and then store the codewordco-occurrence counts in a large database.In fact, services like the Google Image Search and Flickr Photo Sharing websites effectivelyalready link caption words to images in a database. This service could be exploited<strong>for</strong> building special language models, <strong>for</strong> example, <strong>for</strong> selectional preference. When creatingfeatures <strong>for</strong> nouns occurring with particular verbs, <strong>for</strong> example (as in Chapter 6), wemight query the image search service using the noun string as the keyword, and then createSIFT-style features <strong>for</strong> the retrieved images. Could we build a model, <strong>for</strong> example, of thingsthat can be eaten, purely based on visual images of edible substances?In general, I envision some breakthroughs once NLP moves beyond solving text processingin isolation and instead adopts an approach that integrates advances in large-scaleprocessing across a variety of disciplines.– Thanks <strong>for</strong> reading the dissertation!112
- Page 1 and 2:
University of AlbertaLarge-Scale Se
- Page 5 and 6:
Table of Contents1 Introduction 11.
- Page 7 and 8:
7 Alignment-Based Discriminative St
- Page 9 and 10:
List of Figures2.1 The linear class
- Page 11 and 12:
drawn in by establishing a partial
- Page 13 and 14:
(2) “He saw the trophy won yester
- Page 15 and 16:
actual sentence said, “My son’s
- Page 17 and 18:
Uses Web-Scale N-grams Auto-Creates
- Page 19 and 20:
spelling correction, and the identi
- Page 21 and 22:
Chapter 2Supervised and Semi-Superv
- Page 23 and 24:
emphasis on “deliverables and eva
- Page 25 and 26:
Figure 2.1: The linear classifier h
- Page 27 and 28:
The above experimental set-up is so
- Page 29 and 30:
and discriminative models therefore
- Page 31 and 32:
their slack value). In practice, I
- Page 33 and 34:
One way to find a better solution i
- Page 35 and 36:
Figure 2.2: Learning from labeled a
- Page 37 and 38:
algorithm). Yarowsky used it for wo
- Page 39 and 40:
Learning with Natural Automatic Exa
- Page 41 and 42:
positive examples from any collecti
- Page 43 and 44:
generated word clusters. Several re
- Page 45 and 46:
One common disambiguation task is t
- Page 47 and 48:
3.2.2 Web-Scale Statistics in NLPEx
- Page 49 and 50:
For each target wordv 0 , there are
- Page 51 and 52:
ut without counts for the class pri
- Page 53 and 54:
Accuracy (%)10090807060SUPERLMSUMLM
- Page 55 and 56:
We also follow Carlson et al. [2001
- Page 57 and 58:
Set BASE [Golding and Roth, 1999] T
- Page 59 and 60:
pronoun (#3) guarantees that at the
- Page 61 and 62:
807876F-Score747270Stemmed patterns
- Page 63 and 64:
anaphoricity by [Denis and Baldridg
- Page 65 and 66:
ter, we present a simple technique
- Page 67 and 68:
We seek weights such that the class
- Page 69 and 70: each optimum performance is at most
- Page 71 and 72: We now show that ¯w T (diag(¯p)
- Page 73 and 74: Training ExamplesSystem 10 100 1K 1
- Page 75 and 76: Since we wanted the system to learn
- Page 77 and 78: Chapter 5Creating Robust Supervised
- Page 79 and 80: § In-Domain (IN) Out-of-Domain #1
- Page 81 and 82: Adjective ordering is also needed i
- Page 83 and 84: Accuracy (%)10095908580757065601001
- Page 85 and 86: System IN O1 O2Baseline 66.9 44.6 6
- Page 87 and 88: 90% of the time in Gutenberg. The L
- Page 89 and 90: VBN/VBD distinction by providing re
- Page 91 and 92: other tasks we only had a handful o
- Page 93 and 94: without the need for manual annotat
- Page 95 and 96: DSP uses these labels to identify o
- Page 97 and 98: Semantic classesMotivated by previo
- Page 99 and 100: empirical Pr(n|v) in Equation (6.2)
- Page 101 and 102: Verb Plaus./Implaus. Resnik Dagan e
- Page 103 and 104: SystemAccMost-Recent Noun 17.9%Maxi
- Page 105 and 106: Chapter 7Alignment-Based Discrimina
- Page 107 and 108: ious measures to learn the recurren
- Page 109 and 110: how labeled word pairs can be colle
- Page 111 and 112: Figure 7.1: LCSR histogram and poly
- Page 113 and 114: 0.711-pt Average Precision0.60.50.4
- Page 115 and 116: Fr-En Bitext Es-En Bitext De-En Bit
- Page 117 and 118: Chapter 8Conclusions and Future Wor
- Page 119: 8.3 Future WorkThis section outline
- Page 123 and 124: [Bergsma and Cherry, 2010] Shane Be
- Page 125 and 126: [Church and Mercer, 1993] Kenneth W
- Page 127 and 128: [Grefenstette, 1999] Gregory Grefen
- Page 129 and 130: [Koehn, 2005] Philipp Koehn. Europa
- Page 131 and 132: [Mihalcea and Moldovan, 1999] Rada
- Page 133 and 134: [Ristad and Yianilos, 1998] Eric Sv
- Page 135 and 136: [Wang et al., 2008] Qin Iris Wang,
- Page 137: NNP noun, proper, singular Motown V