13.08.2022 Views

advanced-algorithmic-trading

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

282

20.2.1 Overview of Cross-Validation

Recall from the section above the definitions of test error and flexibility:

• Test Error - The average error, where the average is across many observations, associated

with the predictive performance of a particular statistical model when assessed on new

observations that were not used to train the model.

• Flexibility - The degrees of freedom available to the model to "fit" to the training data. A

linear regression is very inflexible (it only has two degrees of freedom) whereas a high-degree

polynomial is very flexible (and as such can have many degrees of freedom).

With these concepts in mind we can now define cross-validation.

The goal of cross-validation is to estimate the test error associated with a statistical model

or select the appropriate level of flexibility for a particular statistical method.

Recall that the training error associated with a model can vastly underestimate the test error

of the model. Cross-validation provides us with the capability to more accurately estimate the

test error, which we will never know in practice.

Cross-validation works by "holding out" particular subsets of the training set in order to use

them as test observations. In this section we will discuss the various ways in which such subsets

are held out. In addition we will implement the methods using Python on an example forecasting

model based on prior historical data.

20.2.2 Forecasting Example

In order to make the following theoretical discussion concrete we will consider the development

of a new trading strategy based on the prediction of price levels of Amazon, Inc. Equally we

could pick the S&P500 index, as in the time series sections, or any other asset with pricing data.

For this approach we will simply consider the closing price of the historical daily Open-High-

Low-Close (OHLC) bars as predictors and the following day’s closing price as the response. Hence

we are attempting to predict tomorrow’s price using daily historic prices. This is similar to an

autoregressive model from the time series section except that we will use both a linear regression

and a non-linear polynomial regression as our machine learning models.

An observation will consist of a pair, x i and y i , which contain the predictor values and the

response value respectively. If we consider a daily lag of p days, then x i has p components. Each of

these components represents the closing price from one day further behind. x p represents today’s

closing price (known), while x p−1 represents the closing price yesterday, while x 1 represents the

price p − 1 days ago.

y i contains only a single value, namely tomorrow’s closing price, and is thus a scalar. Hence

each observation is a tuple (x i , y i ). We will consider a set of n observations corresponding to n

days worth of historical pricing information for Amazon (see Fig 20.3).

Our goal is to find a statistical model that attempts to predict the price level of Amazon

based on the previous days prices. If we were to achieve an accurate prediction, we could use it

to generate basic trading signals.

We will use cross-validation in two ways: Firstly to estimate the test error of particular

statistical learning methods (i.e. their separate predictive performance), and secondly to select

the optimal flexibility of the chosen method in order to minimise the errors associated with bias

and variance.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!