Large-Scale Semi-Supervised Learning for Natural Language ...

More documents

Recommendations

Info

amples from unlabeled data in order to train better models of selectional preference (Chapter6) and string similarity (Chapter 7). The discriminative classifiers trained from this dataexploited several novel sources of information, including character-level (string and capitalization)features for selectional preferences, and features derived from a character-basedsequence alignment for discriminative string similarity. While automatic example generationwas not applied to web-scale unlabeled data in this work, it promises to scale easilyto web-scale text. For example, after summarizing web-scale data with N-gram statistics,we can create examples using only several gigabytes of compressed, N-gram-text, ratherthan using the petabytes of raw web text directly. So automatic example generation fromaggregate statistics promises both better scaling and cleaner data (since aggregate statisticsnaturally exclude phenomena that occur purely due to chance).The key methods of parts one and two are, of course, compatible in another way: itwould be straightforward to use the output of the pseudo-trained models as features insupervised systems. This is similar to the approach of Ando and Zhang [2005], and, in fact,was pursued in some of our concurrent work [Bergsma et al., 2009a] (with good results).8.2 The Impact of this WorkWe hope the straightforward but effective techniques presented in this dissertation will helppromote simple, scalable semi-supervised learning as a future paradigm for NLP research.We advocate such a direction for several reasons.First, only via machine learning can we combine the millions of parameters that interactin natural language processing. Second, only by leveraging unlabeled data can we gobeyond the limited models that can be learned from small, hand-annotated training sets.Furthermore, it is highly advantageous to have an NLP system that both benefits fromunlabeled data and that can readily take advantage of even more unlabeled data when it becomesavailable. Both the volume of text on the web and the power of computer architecturecontinue to grow exponentially over time. Systems that use unlabeled data will thereforeimprove automatically over time, without any special annotation, research, or engineeringeffort. For example, in [Pitler et al., 2010], we presented a parser whose performance improveslogarithmically with the number of unique N-grams in a web-scale N-gram corpus.A useful direction for future work would be to identify other problems that can benefit fromthe use of web-scale volumes of unlabeled data. We could hopefully thereby enable an evengreater proportion of NLP systems to achieve automatic improvements in performance.The following section describes some specific directions for future work, and notessome tasks where web-scale data might be productively exploited.Once we find out, for a range of tasks, just how far we can get with big data and MLalone, we will have a better handle on what other sources of linguistic knowledge mightbe needed. For example, we can now get to around 75% accuracy on preposition selectionusing N-grams alone (Section 3.5). To correct preposition errors with even higher accuracy,we needed to exploit knowledge of the speaker’s native language (and thus their likelypreposition confusions), getting above 95% accuracy in this manner (but also sacrificing asmall but perhaps reasonable amount of coverage). It’s unlikely N-gram data alone wouldever allow us to select the correct preposition in phrases like, “I like to swim before/afterschool.” Similarly, we argued that to perform even better on non-referential pronoun detection(Section 3.7), we will need to pay attention to wider segments of discourse.109
8.3 Future WorkThis section outlines some specific ways to extend or apply insights from this thesis.8.3.1 Improved Learning with Automatically-Generated ExamplesIn part two of this thesis, we achieved good results by automatically generating trainingexamples, but we left open some natural questions arising from this work. For example,how many negatives should be generated for each positive? How do we ensure that trainingwith pseudo-examples transfers well to testing on real examples? While the size of thelearning problem prevented extensive experiments at the time the research was originallyconducted, recent advances in large-scale machine learning enable much faster training.This allows us to perform large-scale empirical studies to address the above questions. Incombination with the usual advances in computer speed and memory, large-scale empiricalstudies will become even easier. In fact, some have even suggested that large-scale learningof linear classifiers is now essentially a solved problem [Yu et al., 2010]. This provides evengreater impetus to test and exploit large-scale linear pseudo-classifiers in NLP.8.3.2 Exploiting New ML TechniquesAnother interesting direction for future research will be the development of learning algorithmsthat exploit correlations between local and global features (see Chapter 1 for anexample of local and global features for VBN/VBD disambiguation). Often the local andglobal patterns represent the same linguistic construction, and their weights should thus besimilar. For example, suppose at test time we encounter the phrase, “it was the Bears whowon.” Even if we haven’t seen the pattern, “noun who verb” as local context in the trainingset, we may have seen it in the global context of a VBD training instance. Laplacian regularization(previously used to exploit the distributional similarity of words for syntactic parsing[Wang et al., 2006]) provides a principled way to force global and local features to havesimilar weights, although simpler feature-based techniques also exist [Daumé III, 2007]. Inparticular, combining Laplacian regularization with the scaling of feature values (to allowthe more predictive, local features to have higher weight) is a promising direction to explore.In any case, identifying an effective solution here could have implications on other,related problems, such as multi-task learning [Raina et al., 2006], domain adaptation [Mc-Closky et al., 2010] and sharing feature knowledge across languages [Berg-Kirkpatrick andKlein, 2010].8.3.3 New NLP ProblemsThere are a number of other important, but largely unexplored, NLP problems where webscalesolutions could have an impact. One such problem is the detection of functionalrelations for information extraction. A functional relation is a binary relation where eachelement of the domain is related to a unique element in the codomain. For example, eachperson has a unique birthplace and date of birth, but may have multiple children, residences,and alma maters. There are a number of novel contextual clues that could flag these relations.For example, the indefinite articles a/an tend not to occur with functional relations;we frequently observe a cousin of in text, but we rarely see a birthplace of. The latter isfunctional. Based on our results in Chapter 5, a classifier combining such simple statistics110
Page 1 and 2:
University of AlbertaLarge-Scale Se
Page 5 and 6:
Table of Contents1 Introduction 11.
Page 7 and 8:
7 Alignment-Based Discriminative St
Page 9 and 10:
List of Figures2.1 The linear class
Page 11 and 12:
drawn in by establishing a partial
Page 13 and 14:
(2) “He saw the trophy won yester
Page 15 and 16:
actual sentence said, “My son’s
Page 17 and 18:
Uses Web-Scale N-grams Auto-Creates
Page 19 and 20:
spelling correction, and the identi
Page 21 and 22:
Chapter 2Supervised and Semi-Superv
Page 23 and 24:
emphasis on “deliverables and eva
Page 25 and 26:
Figure 2.1: The linear classifier h
Page 27 and 28:
The above experimental set-up is so
Page 29 and 30:
and discriminative models therefore
Page 31 and 32:
their slack value). In practice, I
Page 33 and 34:
One way to find a better solution i
Page 35 and 36:
Figure 2.2: Learning from labeled a
Page 37 and 38:
algorithm). Yarowsky used it for wo
Page 39 and 40:
Learning with Natural Automatic Exa
Page 41 and 42:
positive examples from any collecti
Page 43 and 44:
generated word clusters. Several re
Page 45 and 46:
One common disambiguation task is t
Page 47 and 48:
3.2.2 Web-Scale Statistics in NLPEx
Page 49 and 50:
For each target wordv 0 , there are
Page 51 and 52:
ut without counts for the class pri
Page 53 and 54:
Accuracy (%)10090807060SUPERLMSUMLM
Page 55 and 56:
We also follow Carlson et al. [2001
Page 57 and 58:
Set BASE [Golding and Roth, 1999] T
Page 59 and 60:
pronoun (#3) guarantees that at the
Page 61 and 62:
807876F-Score747270Stemmed patterns
Page 63 and 64:
anaphoricity by [Denis and Baldridg
Page 65 and 66:
ter, we present a simple technique
Page 67 and 68: We seek weights such that the class
Page 69 and 70: each optimum performance is at most
Page 71 and 72: We now show that ¯w T (diag(¯p)
Page 73 and 74: Training ExamplesSystem 10 100 1K 1
Page 75 and 76: Since we wanted the system to learn
Page 77 and 78: Chapter 5Creating Robust Supervised
Page 79 and 80: § In-Domain (IN) Out-of-Domain #1
Page 81 and 82: Adjective ordering is also needed i
Page 83 and 84: Accuracy (%)10095908580757065601001
Page 85 and 86: System IN O1 O2Baseline 66.9 44.6 6
Page 87 and 88: 90% of the time in Gutenberg. The L
Page 89 and 90: VBN/VBD distinction by providing re
Page 91 and 92: other tasks we only had a handful o
Page 93 and 94: without the need for manual annotat
Page 95 and 96: DSP uses these labels to identify o
Page 97 and 98: Semantic classesMotivated by previo
Page 99 and 100: empirical Pr(n|v) in Equation (6.2)
Page 101 and 102: Verb Plaus./Implaus. Resnik Dagan e
Page 103 and 104: SystemAccMost-Recent Noun 17.9%Maxi
Page 105 and 106: Chapter 7Alignment-Based Discrimina
Page 107 and 108: ious measures to learn the recurren
Page 109 and 110: how labeled word pairs can be colle
Page 111 and 112: Figure 7.1: LCSR histogram and poly
Page 113 and 114: 0.711-pt Average Precision0.60.50.4
Page 115 and 116: Fr-En Bitext Es-En Bitext De-En Bit
Page 117: Chapter 8Conclusions and Future Wor
Page 121 and 122: My focus is thus on enabling robust
Page 123 and 124: [Bergsma and Cherry, 2010] Shane Be
Page 125 and 126: [Church and Mercer, 1993] Kenneth W
Page 127 and 128: [Grefenstette, 1999] Gregory Grefen
Page 129 and 130: [Koehn, 2005] Philipp Koehn. Europa
Page 131 and 132: [Mihalcea and Moldovan, 1999] Rada
Page 133 and 134: [Ristad and Yianilos, 1998] Eric Sv
Page 135 and 136: [Wang et al., 2008] Qin Iris Wang,
Page 137: NNP noun, proper, singular Motown V
show all

Large-Scale Semi-Supervised Learning for Natural Language ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?