13.08.2022 Views

advanced-algorithmic-trading

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

284

Secondly, note that in the 50-50 split of testing/training data we are leaving out half of all

observations. Hence we are reducing information that would otherwise be used to train the

model. Thus it is likely to perform worse than if we had used all of the observations, including

those in the validation set. This leads to a situation where we may actually overestimate the

test error for the full data set.

Thirdly, time series data (especially in quantitative finance) possesses a degree of serial correlation.

This means that each observation is not independent and identically distributed (iid).

Thus the process of randomly assigning various observations to separate samples is not strictly

valid and will introduce its own error, which must be considered.

In order to reduce the impact of these issues we will consider a more sophisticated splitting

of the data known as k-fold cross validation.

20.2.4 k-Fold Cross Validation

K-fold cross-validation improves upon the validation set approach by dividing the n observations

into k mutually exclusive, and approximately equally sized subsets known as "folds".

The first fold becomes a validation set, while the remaining k − 1 folds (aggregated together)

become the training set. The model is fit on the training set and its test error is estimated on

the validation set. This procedure is repeated k times, with each repetition holding out a fold as

the validation set, while the remaining k − 1 are used for training.

This allows an overall test estimate, CV k , to be calculated that is an average of all the

individual mean-squared errors, MSE i , for each fold:

CV k = 1 k∑

MSE i (20.12)

k

i=1

The obvious question that arises at this stage is what value do we choose for k? The short

answer (based on empirical studies) is to choose k = 5 or k = 10. The longer answer to this

question relates to both computational expense and the bias-variance tradeoff.

Leave-One-Out Cross Validation

We can actually choose k = n, which means that we fit the model n times, with only a single

observation left out for each fitting. This is known as leave-one-out cross-validation

(LOOCV). It can be very computationally expensive, particularly if n is large and the model has

an expensive fitting procedure, as fitting must be repeated n times.

While LOOCV is beneficial in reducing bias, due to the fact that nearly all of the samples

are used for fitting in each case, it actually suffers from the problem of high variance. This is

because we are calculating the test error on a single response each time for each observation in

the data set.

k-fold cross-validation reduces the variance at the expense of introducing some more bias,

due to the fact that some of the observations are not used for training. With k = 5 or k = 10

the bias-variance tradeoff is generally optimised.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!