13.08.2022 Views

advanced-algorithmic-trading

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

326

• Continually monitor the system and adjust it as necessary if its performance begins to

degrade

In this particular section we will avoid discussion of how to download multiple articles from

external sources and make use of a given dataset that already comes with its own provided labels.

This will allow us to concentrate on the implementation of the "classification pipeline", rather

than spend a substantial amount of time obtaining and tagging documents.

While beyond the scope of this section, it is possible to make use of Python libraries, such

as ScraPy and BeautifulSoup, to automatically obtain many web-based articles and effectively

extract their text-based data from the HTML making up the page data.

Under the assumption that we have a document corpus that is pre-labelled (the process of

which will be outlined below), we will begin by taking the training corpus and incorporating it

into a Python data structure that is suitable for pre-processing and consumption via the classifier.

23.2 Supervised Document Classification

Consider a collection of text documents. Each document has an associated set of keywords,

which we will call "features". Each of these documents might possess a class label describing

what the article is about.

For instance, a set of articles from a website discussing pets might have articles that are

primarily about dogs, cats or hamsters (say). Certain words, such as "cage" (hamster), "leash"

(dog) or "milk" (cat) might be more representative of certain pets than others. Supervised

classifiers are able to isolate certain words which are representative of certain labels by learning

from a set of training articles that are already pre-labelled, often in a manual fashion by a human.

Mathematically each of the j articles about pets within a training corpus have an associated

feature vector x j , with components of this vector representing "strength" of words (we will define

"strength" below). Each article also has an associated class label, y j , which in this case would

be the name of the pet most associated with the article.

The classifier is fed feature vector-class label pairs and it learns how representative features

are for particular class labels.

In the following new example we will use the SVM as our model and train it on a corpus (a

collection of documents) that has been previously generated.

23.3 Preparing a Dataset for Classification

A famous dataset often used in machine learning classification design is the Reuters 21578 set,

the details of which can be found here: http://www.daviddlewis.com/resources/testcollections/

reuters21578/. It is one of the most widely used testing datasets for text classification.

The set consists of a collection of news articles–a corpus–that are tagged with a selection of

topics and geographic locations. Thus it comes ready-made for use in classification tasks as it is

already pre-labelled.

We will now download, extract and prepare the dataset. The following instructions assume

you have access to a command line interface, such as the Terminal that ships with Linux or Mac

OS X. If you are using Windows then you will need to download a Tar/GZIP extraction tool to

get hold of the data, such as 7-Zip.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!