12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Training ExamplesSystem 10 100 1K 10K 100KOvA-SVM 16.0 50.6 66.1 71.1 73.5K-SVM 13.7 50.0 65.8 72.0 74.7K-SVM + 22.2 56.8 70.5 73.7 75.2CS-SVM 27.1 58.8 69.0 73.5 74.2CS-SVM + 39.6 64.8 71.5 74.0 74.4VAR-SVM 73.8 74.2 74.7 74.9 74.9Table 4.1: Accuracy (%) of preposition-selection SVMs. Unsupervised (SUMLM) accuracyis 73.7%.Training ExamplesSystem 10 100 1K 10K 100KCS-SVM 86.0 93.5 95.1 95.7 95.7CS-SVM + 91.0 94.9 95.3 95.7 95.7VAR-SVM 94.9 95.3 95.6 95.7 95.8Table 4.2: Accuracy (%) of spell-correction SVMs. Unsupervised (SUMLM) accuracy is94.8%.other systems, and always exceeds the per<strong>for</strong>mance of the unsupervised approach. 5 Augmentingthe attributes with sum counts (the+systems) strongly helps with fewer examples,especially in conjunction with the more efficient CS-SVM. However, VAR-SVM clearlyhelps more. We noted earlier that VAR-SVM is guaranteed to do as well as the unsupervisedsystem on the development data, but here we confirm that it can also exploit even smallamounts of training data to further improve accuracy.CS-SVM outper<strong>for</strong>ms K-SVM except with 100K examples, while OvA-SVM is betterthan K-SVM <strong>for</strong> small amounts of data. 6 K-SVM per<strong>for</strong>ms best with all the data; it uses themost expressive representation, but needs 100K examples to make use of it. On the otherhand, feature augmentation and variance regularization provide diminishing returns as theamount of training data increases.4.4.2 Context-Sensitive Spelling CorrectionRecall that in context-sensitive spelling correction, <strong>for</strong> every occurrence of a word in a predefinedconfusion set (e.g. {cite, sight, cite}), the classifier selects the most likely wordfrom the set. We use the five confusion sets from Chapter 3; four are binary and one isa 3-way classification. We use 100K training, 10K development, and 10K test examples<strong>for</strong> each, and average accuracy across the sets. All 2-to-5 gram counts are used in theunsupervised system, so the variance of all weights is regularized in VAR-SVM.5 Significance is calculated using a χ 2 test over the test set correct/incorrect contingency table.6 [Rifkin and Klautau, 2004] argue OvA-SVM is as good as K-SVM, but this is “predicated on the assumptionthat the classes are ‘independent’,” i.e., that examples from class 0 are no closer to class 1 than to class 2. Thisis not true of this task (e.g. ¯x be<strong>for</strong>e is closer to ¯x after than ¯x in, etc.).64

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!