Large-Scale Semi-Supervised Learning for Natural Language ...

More documents

Recommendations

Info

System IN O1 O2Baseline 89.2 85.2 79.6ContextSum 92.5 91.1 90.4SVM with N-GM features 96.1 93.4 93.8SVM with LEX features 95.8 93.4 93.0SVM with N-GM + LEX 96.4 93.5 94.0Table 5.6: Verb-POS-disambiguation accuracy (%) trained on WSJ, tested on WSJ (IN) andout-of-domain Brown (O1) and Medline (O2).Accuracy (%)100959085N-GM (N,V+context)LEX (N,V+context)N-GM (N,V)LEX (N,V)80100 1e3 1e4Number of training examplesFigure 5.8: Out-of-domain learning curve of verb disambiguation classifiers on Medline.ContextSum: We also use these context counts in an unsupervised system, ContextSum.Analogously to the SUMLM system from Chapter 3, we separately sum the log-counts forall contexts filled with VBD and then VBN, outputting the tag with the higher total.5.6.2 Verb POS Disambiguation ResultsAs in all tasks, N-GM+LEX has the best in-domain accuracy (96.4%, Table 5.6). Out-ofdomain,when N-grams are excluded, errors only increase around 14% on Medline and 2%on Brown (the differences are not statistically significant). Why? Figure 5.8, the learningcurve for performance on Medline, suggests some reasons. We omit N-GM+LEX fromFigure 5.8 as it closely follows N-GM.Recall that we grouped the features into two views: 1) noun-verb (N,V) and 2) context.If we use just (N,V) features, we do see a large drop out-of-domain: LEX (N,V) lags N-GM(N,V) even using all the training examples. The same is true using only context features(not shown). Using both views, the results are closer: 93.8% for N-GM and 93.0% for LEX.With two views of an example, LEX is more likely to have domain-neutral features to drawon. The effects of data sparsity are reduced.Also, the Treebank provides an atypical number of labeled examples for analysis tasks.In a more typical situation with less labeled examples, N-GM strongly dominates LEX, evenwhen two views are used. E.g., with 2285 training examples, N-GM+LEX is statisticallysignificantly better than LEX on both out-of-domain sets.All systems, however, perform log-linearly with training size. This differs especiallyfrom our generation tasks where the N-GM performance had levelled off much earlier. In81
other tasks we only had a handful of N-GM features; here there are 21K features for thedistributional patterns of N,V pairs. With more features, we need more labeled data tomake the N-GM features work properly. A similar issue with such high-dimensional globalfeatures is explored in [Huang and Yates, 2009]. Reducing this feature space by pruning orperforming transformations may improve accuracy in and out-of-domain.5.7 Discussion and Future WorkOf all classifiers, LEX performs worst on all cross-domain tasks. Clearly, many of theregularities that a typical classifier exploits in one domain do not transfer to new genres.N-GM features, however, do not depend directly on training examples, and thus work bettercross-domain. Of course, using web-scale N-grams is not the only way to create robustclassifiers. Counts from any large auxiliary corpus may also help, but web counts shouldhelp more [Lapata and Keller, 2005]. Section 5.6.2 suggests that another way to mitigatedomain-dependence is to have multiple feature views.[Banko and Brill, 2001] argue “a logical next step for the research community would beto direct efforts towards increasing the size of annotated training collections.” Assuming wereally do want systems that operate beyond the specific domains on which they are trained,the community also needs to identify which systems behave as in Figure 5.2, where theaccuracy of the best in-domain system actually decreases with more training examples. Ourresults suggest better features, such as web pattern counts, may help more than expandingtraining data. Also, systems using web-scale unlabeled data will improve automatically asthe web expands, without annotation effort.In some sense, using web counts as features is a form of domain adaptation: adaptinga web model to the training domain. How do we ensure these features are adaptedwell and not used in domain-specific ways (especially with many features to adapt, as inSection 5.6)? One option may be to regularize the classifier specifically for out-of-domainaccuracy. We found that adjusting the SVM misclassification penalty (for more regularization)can help or hurt out-of-domain. Other regularizations are possible. In fact, one goodoption might be to extend the results of Chapter 4 to cross-domain usage. In this chapter,each task had a domain-neutral unsupervised approach. We could encode these systemsas linear classifiers with corresponding weights. Rather than a typical SVM that minimizesthe weight-norm||w|| (plus the slacks), we could regularize toward domain-neutral weights,and otherwise minimize weight variance so that it prefers this system in the absence of otherinformation. This regularization could be optimized on creative splits of the training data,in an effort to simulate the lexical overlap we can expect with a particular test domain.5.8 ConclusionWe presented results on tasks spanning a range of NLP research: generation, disambiguation,parsing and tagging. Using web-scale N-gram data improves accuracy on each task.When less training data is used, or when the system is used on a different domain, N-gramfeatures greatly improve performance. Since most supervised NLP systems do not useweb-scale counts, further cross-domain evaluation may reveal some very brittle systems.Continued effort in new domains should be a priority for the community going forward.This concludes the part of the dissertation that explored using web-scale N-gram dataas features within a supervised classifier. I hope you enjoyed it.82
Page 1 and 2:
University of AlbertaLarge-Scale Se
Page 5 and 6:
Table of Contents1 Introduction 11.
Page 7 and 8:
7 Alignment-Based Discriminative St
Page 9 and 10:
List of Figures2.1 The linear class
Page 11 and 12:
drawn in by establishing a partial
Page 13 and 14:
(2) “He saw the trophy won yester
Page 15 and 16:
actual sentence said, “My son’s
Page 17 and 18:
Uses Web-Scale N-grams Auto-Creates
Page 19 and 20:
spelling correction, and the identi
Page 21 and 22:
Chapter 2Supervised and Semi-Superv
Page 23 and 24:
emphasis on “deliverables and eva
Page 25 and 26:
Figure 2.1: The linear classifier h
Page 27 and 28:
The above experimental set-up is so
Page 29 and 30:
and discriminative models therefore
Page 31 and 32:
their slack value). In practice, I
Page 33 and 34:
One way to find a better solution i
Page 35 and 36:
Figure 2.2: Learning from labeled a
Page 37 and 38:
algorithm). Yarowsky used it for wo
Page 39 and 40: Learning with Natural Automatic Exa
Page 41 and 42: positive examples from any collecti
Page 43 and 44: generated word clusters. Several re
Page 45 and 46: One common disambiguation task is t
Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
Page 49 and 50: For each target wordv 0 , there are
Page 51 and 52: ut without counts for the class pri
Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
Page 55 and 56: We also follow Carlson et al. [2001
Page 57 and 58: Set BASE [Golding and Roth, 1999] T
Page 59 and 60: pronoun (#3) guarantees that at the
Page 61 and 62: 807876F-Score747270Stemmed patterns
Page 63 and 64: anaphoricity by [Denis and Baldridg
Page 65 and 66: ter, we present a simple technique
Page 67 and 68: We seek weights such that the class
Page 69 and 70: each optimum performance is at most
Page 71 and 72: We now show that ¯w T (diag(¯p)
Page 73 and 74: Training ExamplesSystem 10 100 1K 1
Page 75 and 76: Since we wanted the system to learn
Page 77 and 78: Chapter 5Creating Robust Supervised
Page 79 and 80: § In-Domain (IN) Out-of-Domain #1
Page 81 and 82: Adjective ordering is also needed i
Page 83 and 84: Accuracy (%)10095908580757065601001
Page 85 and 86: System IN O1 O2Baseline 66.9 44.6 6
Page 87 and 88: 90% of the time in Gutenberg. The L
Page 89: VBN/VBD distinction by providing re
Page 93 and 94: without the need for manual annotat
Page 95 and 96: DSP uses these labels to identify o
Page 97 and 98: Semantic classesMotivated by previo
Page 99 and 100: empirical Pr(n|v) in Equation (6.2)
Page 101 and 102: Verb Plaus./Implaus. Resnik Dagan e
Page 103 and 104: SystemAccMost-Recent Noun 17.9%Maxi
Page 105 and 106: Chapter 7Alignment-Based Discrimina
Page 107 and 108: ious measures to learn the recurren
Page 109 and 110: how labeled word pairs can be colle
Page 111 and 112: Figure 7.1: LCSR histogram and poly
Page 113 and 114: 0.711-pt Average Precision0.60.50.4
Page 115 and 116: Fr-En Bitext Es-En Bitext De-En Bit
Page 117 and 118: Chapter 8Conclusions and Future Wor
Page 119 and 120: 8.3 Future WorkThis section outline
Page 121 and 122: My focus is thus on enabling robust
Page 123 and 124: [Bergsma and Cherry, 2010] Shane Be
Page 125 and 126: [Church and Mercer, 1993] Kenneth W
Page 127 and 128: [Grefenstette, 1999] Gregory Grefen
Page 129 and 130: [Koehn, 2005] Philipp Koehn. Europa
Page 131 and 132: [Mihalcea and Moldovan, 1999] Rada
Page 133 and 134: [Ristad and Yianilos, 1998] Eric Sv
Page 135 and 136: [Wang et al., 2008] Qin Iris Wang,
Page 137: NNP noun, proper, singular Motown V
show all

Large-Scale Semi-Supervised Learning for Natural Language ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?