12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

their slack value). In practice, I usually try a range of values <strong>for</strong> this parameter starting at0.000001 and going up by a factor of 10 to around 100000.Note you would not want to tune the regularization parameter by measuring per<strong>for</strong>manceon the training set, as less regularization is always going to lead to better per<strong>for</strong>manceon the training data itself. Regularization is a way to prevent overfitting the trainingdata, and thus should be set on separate examples, i.e., the development set. However, somepeople like to do 10-fold cross validation on the training data to set their hyperparameters.I have no problem with this.Another detail regarding SVM learning is that sometimes it makes sense to scale ornormalize the features to enable faster and sometimes better learning. For many tasks,it makes sense to divide all the feature values by the Euclidean norm of the feature vector,such that the resulting vector has a magnitude of one. In the chapters that follow, we specifyif we use such a technique. Again, we can test whether such a trans<strong>for</strong>mation is worth it byseeing how it affects per<strong>for</strong>mance on our development data.SVMs have been shown to work quite well on a range of tasks. If you want to use alinear classifier, they seem to be a good choice. The SVM <strong>for</strong>mulation is also perfectlysuited to using kernels to automatically expand the feature space, allowing <strong>for</strong> non-linearclassification. For all the tasks investigated in this dissertation, however, standard kernelswere not found to improve per<strong>for</strong>mance. Furthermore, training and testing takes longerwhen kernels are used.2.3.5 SoftwareWe view the current best practice in most NLP classification applications as follows: Use asmany labeled examples as you can find <strong>for</strong> the task and domain of interest. Then, carefullyconstruct a linear feature space such that all potentially useful combinations of propertiesare explicit dimensions in that space (rather than implicitly creating such dimensionsthrough the use of kernels). For training, use the LIBLINEAR package [Fan et al., 2008],an amazingly fast solver that can return the SVM model in seconds even <strong>for</strong> tens of thousandsof features and instances (other fast alternatives exist, but haven’t been explored inthis dissertation). This set-up allows <strong>for</strong> very rapid system development and evaluation,allowing us to focus on the features themselves, rather than the learning algorithm.Since many of the tasks in this dissertation were completed be<strong>for</strong>e LIBLINEAR wasavailable, we also present results using older solvers such as the logistic regression packagein Weka [Witten and Frank, 2005], the efficient SVM multiclass instance of SVM struct[Tsochantaridis et al., 2004]), and our old stand-by, Thorsten Joachim’s SVM light [Joachims,1999a]. Whatever package is used, it should now be clear that in terms of this dissertation,training simply means learning a set of weights <strong>for</strong> a linear classifier using a given set oflabeled data.2.4 Unsupervised <strong>Learning</strong>There is a way to gather linguistic annotations without using any training data: unsupervisedlearning. This at first seems rather magical. How can a system produce labels without everseeing them?Most current unsupervised approaches in NLP are decidedly unmagical. Probably sinceso much current work is based on supervised training from labeled data, some rule-based22

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!