amples from unlabeled data in order to train better models of selectional preference (Chapter6) and string similarity (Chapter 7). The discriminative classifiers trained from this dataexploited several novel sources of in<strong>for</strong>mation, including character-level (string and capitalization)features <strong>for</strong> selectional preferences, and features derived from a character-basedsequence alignment <strong>for</strong> discriminative string similarity. While automatic example generationwas not applied to web-scale unlabeled data in this work, it promises to scale easilyto web-scale text. For example, after summarizing web-scale data with N-gram statistics,we can create examples using only several gigabytes of compressed, N-gram-text, ratherthan using the petabytes of raw web text directly. So automatic example generation fromaggregate statistics promises both better scaling and cleaner data (since aggregate statisticsnaturally exclude phenomena that occur purely due to chance).The key methods of parts one and two are, of course, compatible in another way: itwould be straight<strong>for</strong>ward to use the output of the pseudo-trained models as features insupervised systems. This is similar to the approach of Ando and Zhang [2005], and, in fact,was pursued in some of our concurrent work [Bergsma et al., 2009a] (with good results).8.2 The Impact of this WorkWe hope the straight<strong>for</strong>ward but effective techniques presented in this dissertation will helppromote simple, scalable semi-supervised learning as a future paradigm <strong>for</strong> NLP research.We advocate such a direction <strong>for</strong> several reasons.First, only via machine learning can we combine the millions of parameters that interactin natural language processing. Second, only by leveraging unlabeled data can we gobeyond the limited models that can be learned from small, hand-annotated training sets.Furthermore, it is highly advantageous to have an NLP system that both benefits fromunlabeled data and that can readily take advantage of even more unlabeled data when it becomesavailable. Both the volume of text on the web and the power of computer architecturecontinue to grow exponentially over time. Systems that use unlabeled data will there<strong>for</strong>eimprove automatically over time, without any special annotation, research, or engineeringef<strong>for</strong>t. For example, in [Pitler et al., 2010], we presented a parser whose per<strong>for</strong>mance improveslogarithmically with the number of unique N-grams in a web-scale N-gram corpus.A useful direction <strong>for</strong> future work would be to identify other problems that can benefit fromthe use of web-scale volumes of unlabeled data. We could hopefully thereby enable an evengreater proportion of NLP systems to achieve automatic improvements in per<strong>for</strong>mance.The following section describes some specific directions <strong>for</strong> future work, and notessome tasks where web-scale data might be productively exploited.Once we find out, <strong>for</strong> a range of tasks, just how far we can get with big data and MLalone, we will have a better handle on what other sources of linguistic knowledge mightbe needed. For example, we can now get to around 75% accuracy on preposition selectionusing N-grams alone (Section 3.5). To correct preposition errors with even higher accuracy,we needed to exploit knowledge of the speaker’s native language (and thus their likelypreposition confusions), getting above 95% accuracy in this manner (but also sacrificing asmall but perhaps reasonable amount of coverage). It’s unlikely N-gram data alone wouldever allow us to select the correct preposition in phrases like, “I like to swim be<strong>for</strong>e/afterschool.” Similarly, we argued that to per<strong>for</strong>m even better on non-referential pronoun detection(Section 3.7), we will need to pay attention to wider segments of discourse.109
8.3 Future WorkThis section outlines some specific ways to extend or apply insights from this thesis.8.3.1 Improved <strong>Learning</strong> with Automatically-Generated ExamplesIn part two of this thesis, we achieved good results by automatically generating trainingexamples, but we left open some natural questions arising from this work. For example,how many negatives should be generated <strong>for</strong> each positive? How do we ensure that trainingwith pseudo-examples transfers well to testing on real examples? While the size of thelearning problem prevented extensive experiments at the time the research was originallyconducted, recent advances in large-scale machine learning enable much faster training.This allows us to per<strong>for</strong>m large-scale empirical studies to address the above questions. Incombination with the usual advances in computer speed and memory, large-scale empiricalstudies will become even easier. In fact, some have even suggested that large-scale learningof linear classifiers is now essentially a solved problem [Yu et al., 2010]. This provides evengreater impetus to test and exploit large-scale linear pseudo-classifiers in NLP.8.3.2 Exploiting New ML TechniquesAnother interesting direction <strong>for</strong> future research will be the development of learning algorithmsthat exploit correlations between local and global features (see Chapter 1 <strong>for</strong> anexample of local and global features <strong>for</strong> VBN/VBD disambiguation). Often the local andglobal patterns represent the same linguistic construction, and their weights should thus besimilar. For example, suppose at test time we encounter the phrase, “it was the Bears whowon.” Even if we haven’t seen the pattern, “noun who verb” as local context in the trainingset, we may have seen it in the global context of a VBD training instance. Laplacian regularization(previously used to exploit the distributional similarity of words <strong>for</strong> syntactic parsing[Wang et al., 2006]) provides a principled way to <strong>for</strong>ce global and local features to havesimilar weights, although simpler feature-based techniques also exist [Daumé III, 2007]. Inparticular, combining Laplacian regularization with the scaling of feature values (to allowthe more predictive, local features to have higher weight) is a promising direction to explore.In any case, identifying an effective solution here could have implications on other,related problems, such as multi-task learning [Raina et al., 2006], domain adaptation [Mc-Closky et al., 2010] and sharing feature knowledge across languages [Berg-Kirkpatrick andKlein, 2010].8.3.3 New NLP ProblemsThere are a number of other important, but largely unexplored, NLP problems where webscalesolutions could have an impact. One such problem is the detection of functionalrelations <strong>for</strong> in<strong>for</strong>mation extraction. A functional relation is a binary relation where eachelement of the domain is related to a unique element in the codomain. For example, eachperson has a unique birthplace and date of birth, but may have multiple children, residences,and alma maters. There are a number of novel contextual clues that could flag these relations.For example, the indefinite articles a/an tend not to occur with functional relations;we frequently observe a cousin of in text, but we rarely see a birthplace of. The latter isfunctional. Based on our results in Chapter 5, a classifier combining such simple statistics110
- Page 1 and 2:
University of AlbertaLarge-Scale Se
- Page 5 and 6:
Table of Contents1 Introduction 11.
- Page 7 and 8:
7 Alignment-Based Discriminative St
- Page 9 and 10:
List of Figures2.1 The linear class
- Page 11 and 12:
drawn in by establishing a partial
- Page 13 and 14:
(2) “He saw the trophy won yester
- Page 15 and 16:
actual sentence said, “My son’s
- Page 17 and 18:
Uses Web-Scale N-grams Auto-Creates
- Page 19 and 20:
spelling correction, and the identi
- Page 21 and 22:
Chapter 2Supervised and Semi-Superv
- Page 23 and 24:
emphasis on “deliverables and eva
- Page 25 and 26:
Figure 2.1: The linear classifier h
- Page 27 and 28:
The above experimental set-up is so
- Page 29 and 30:
and discriminative models therefore
- Page 31 and 32:
their slack value). In practice, I
- Page 33 and 34:
One way to find a better solution i
- Page 35 and 36:
Figure 2.2: Learning from labeled a
- Page 37 and 38:
algorithm). Yarowsky used it for wo
- Page 39 and 40:
Learning with Natural Automatic Exa
- Page 41 and 42:
positive examples from any collecti
- Page 43 and 44:
generated word clusters. Several re
- Page 45 and 46:
One common disambiguation task is t
- Page 47 and 48:
3.2.2 Web-Scale Statistics in NLPEx
- Page 49 and 50:
For each target wordv 0 , there are
- Page 51 and 52:
ut without counts for the class pri
- Page 53 and 54:
Accuracy (%)10090807060SUPERLMSUMLM
- Page 55 and 56:
We also follow Carlson et al. [2001
- Page 57 and 58:
Set BASE [Golding and Roth, 1999] T
- Page 59 and 60:
pronoun (#3) guarantees that at the
- Page 61 and 62:
807876F-Score747270Stemmed patterns
- Page 63 and 64:
anaphoricity by [Denis and Baldridg
- Page 65 and 66:
ter, we present a simple technique
- Page 67 and 68: We seek weights such that the class
- Page 69 and 70: each optimum performance is at most
- Page 71 and 72: We now show that ¯w T (diag(¯p)
- Page 73 and 74: Training ExamplesSystem 10 100 1K 1
- Page 75 and 76: Since we wanted the system to learn
- Page 77 and 78: Chapter 5Creating Robust Supervised
- Page 79 and 80: § In-Domain (IN) Out-of-Domain #1
- Page 81 and 82: Adjective ordering is also needed i
- Page 83 and 84: Accuracy (%)10095908580757065601001
- Page 85 and 86: System IN O1 O2Baseline 66.9 44.6 6
- Page 87 and 88: 90% of the time in Gutenberg. The L
- Page 89 and 90: VBN/VBD distinction by providing re
- Page 91 and 92: other tasks we only had a handful o
- Page 93 and 94: without the need for manual annotat
- Page 95 and 96: DSP uses these labels to identify o
- Page 97 and 98: Semantic classesMotivated by previo
- Page 99 and 100: empirical Pr(n|v) in Equation (6.2)
- Page 101 and 102: Verb Plaus./Implaus. Resnik Dagan e
- Page 103 and 104: SystemAccMost-Recent Noun 17.9%Maxi
- Page 105 and 106: Chapter 7Alignment-Based Discrimina
- Page 107 and 108: ious measures to learn the recurren
- Page 109 and 110: how labeled word pairs can be colle
- Page 111 and 112: Figure 7.1: LCSR histogram and poly
- Page 113 and 114: 0.711-pt Average Precision0.60.50.4
- Page 115 and 116: Fr-En Bitext Es-En Bitext De-En Bit
- Page 117: Chapter 8Conclusions and Future Wor
- Page 121 and 122: My focus is thus on enabling robust
- Page 123 and 124: [Bergsma and Cherry, 2010] Shane Be
- Page 125 and 126: [Church and Mercer, 1993] Kenneth W
- Page 127 and 128: [Grefenstette, 1999] Gregory Grefen
- Page 129 and 130: [Koehn, 2005] Philipp Koehn. Europa
- Page 131 and 132: [Mihalcea and Moldovan, 1999] Rada
- Page 133 and 134: [Ristad and Yianilos, 1998] Eric Sv
- Page 135 and 136: [Wang et al., 2008] Qin Iris Wang,
- Page 137: NNP noun, proper, singular Motown V