12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 8Conclusions and Future Work8.1 SummaryThis dissertation outlined two simple, scalable, effective methods <strong>for</strong> large-scale semisupervisedlearning: constructing features from web-scale N-gram data, and using unlabeleddata to automatically generate training examples.The availability of web-scale N-gram data was crucial <strong>for</strong> our improved web-scalefeature-based approaches. While the Google N-gram data was originally created to supportthe language model of an MT system, we confirmed that this data can be useful <strong>for</strong>a range of tasks, including both analysis and generation problems. Unlike previous workusing search engines, it is possible to extract millions of web-scale counts efficiently fromN-gram data. We can thus freely exploit numerous overlapping and interdependent contexts<strong>for</strong> each example, <strong>for</strong> both training and test instances. Chapter 3 presented a unified framework<strong>for</strong> integrating such N-gram in<strong>for</strong>mation <strong>for</strong> various lexical disambiguation tasks.Excellent results were achieved on three tasks. In particular, we proposed a novel and successfulmethod of using web-scale counts <strong>for</strong> the identification of non-referential pronouns,a long-standing challenge in the anaphora resolution community.In Chapter 4, we introduced a new <strong>for</strong>m of SVM training to mitigate the dependence ofthe discriminative web-N-gram systems on large amounts of training data. Since the unsupervisedsystem was known to achieve good per<strong>for</strong>mance with equal weights, we changedthe SVM’s regularization to prefer low-weight-variance solutions, biasing it toward the unsupervisedsolution. The optimization problem remained a convex function of the featureweights, and was thus theoretically no harder to optimize than a standard SVM. On smalleramounts of training data, the variance-regularization SVM per<strong>for</strong>med dramatically betterthan the standard multi-class SVM.Chapter 5 addressed a pair of open questions on the use of web-scale data in NLP. First,we showed there was indeed a significant benefit in combining web-scale counts with thetraditional features used in state-of-the-art supervised approaches. For example, we proposeda novel system <strong>for</strong> adjective ordering that exceeds the state-of-the-art per<strong>for</strong>mance,without using any N-gram data, and then we further improved the per<strong>for</strong>mance of this systemby adding N-gram features. Secondly, and perhaps much more importantly, modelswith web-based features were shown to per<strong>for</strong>m much better than traditional supervisedsystems when moving to new domains or when labeled training data was scarce (realisticsituations <strong>for</strong> the practical application of NLP technology).In the second part of the dissertation, we showed how to automatically create labeled ex-108

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!