13.08.2022 Views

advanced-algorithmic-trading

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

302

21.1 High Dimensional Data

Quantitative finance and algorithmic trading extend well beyond analysis of asset price time

series. The increased competition from the proliferation of quantitative funds has forced new

and old firms alike to consider alternative data sources. Many of these sources are inhomogeneous,

non-numeric and form extremely large databases. When suitably quantified much of this data is

extremely high-dimensional. Examples include satellite imagery, high-resolution video, corpora

of text documents and sensor data.

To provide some scope as the extreme dimensionality of these datasets, consider a standard

1080p monitor, which has a resolution of 1920 × 1080 = 2073600 pixels. If we restrict each of

these pixels to displaying either black or white then there are 2 2073600 potential images that can

be displayed. This is a vast number. It becomes significantly worse when considering the fact

that each pixel often has 2 24 potential colours - three separate 8-bit channels for red, green and

blue respectively.

Hence there is significant motivation when searching through such datasets to reduce dimensionality

to a manageable level. This is achieved by trying to find lower dimensional

subspaces that still capture the essence of the data signal. A key problem is that even with a

huge number of samples the "training data cannot be expected to populate the space"[20]. If

N is the number of samples available and p is the dimensionality of the space then we are in a

situation where p ≫ N. In essence, there are large subsets of the feature space where very little

is known. This problem is often referred to as the Curse of Dimensionality.

Much of unsupervised learning is thus concerned with means of reducing this dimensionality

to a reasonable level but still retaining the "signal" within the data. Mathematically, we are

attempting to describe the key variations in the data using a lower dimensional manifold of

dimension q < p, which is embedded within the larger p-dimensional space. Dimensionality

reduction algorithms such as linear Principal Components Analysis (PCA) and non-linear kernel

PCA have been developed for this task.

21.2 Mathematical Overview of Unsupervised Learning

In a supervised learning task a training set exists consisting of N pairs of feature vectors, or

predictors, x i ∈ R p as well as associated outputs or responses, y i ∈ R. Thus the dataset consists

of N tuples (x 1 , y 1 ), . . . , (x n , y n ). The responses y i can be considered as "labels" for the set of

features. They are used to guide the supervised learning algorithm in its training phase. In order

to train the model we need to define a loss function between the true value of the response y and

its estimate from the model ŷ, given by L(y, ŷ).

The unsupervised learning setting differs in that the only available data is a set of unlabelled

predictors x i . That is, there are no associated labelled responses y i for each data point. Thus

there is no concept of training or supervision for such techniques since there is nothing for

the algorithm to use for ground truth. Instead interest lies solely in the structure of the x i s

themselves.

As with supervised learning one approach involves formulating the task probabilistically via

a concept known as conditional density estimation[51, 71].

In the supervised learning case models are built of the form p(y i | x i , θ). Specific interest

lies in the distribution of the responses y i , conditional on both the feature vectors x i and the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!