13.08.2022 Views

advanced-algorithmic-trading

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

48

find the orientation of the hyperplane that best linearly characterises the data.

"Best" in this case means minimising some form of error function. The most popular method

to do this is via ordinary least squares (OLS). If we define the residual sum of squares (RSS),

which is the sum of the squared differences between the outputs and the linear regression estimates:

RSS(β) =

=

N∑

(y i − f(x i )) 2 (5.2)

i=1

N∑

(y i − β T x i ) 2 (5.3)

i=1

Then the goal of OLS is to minimise the RSS, via adjustment of the β coefficients. Although

we won’t derive it here (see Hastie et al[51] for details) the Maximum Likelihood Estimate of β,

which minimises the RSS, is given by:

ˆβ = (X T X) −1 X T y (5.4)

To make a subsequent prediction y N+1 , given some new data x N+1 , we simply multiply the

components of x N+1 by the associated β coefficients and obtain y N+1 .

The important point here is that ˆβ is a point estimate, meaning that it is a single value

in real-valued p + 1-dimensional space–R p+1 . In the Bayesian formulation we will see that the

interpretation differs substantially.

5.2 Bayesian Linear Regression

In a Bayesian framework linear regression is stated in a probabilistic manner. The above linear

regression model is reformulated in probabilistic language. The syntax for a linear regression in

a Bayesian framework looks like this:

y ∼ N ( β T X, σ 2 I ) (5.5)

The response values y are sampled from a multivariate normal distribution that has a mean

equal to the product of the β coefficients and the predictors, X, and a variance of σ 2 . Here, I

refers to the identity matrix, which is necessary because the distribution is multivariate.

This is different to how the frequentist approach is usually outlined. In the frequentist setting

above there is no mention of probability distributions for anything other than the measurement

error. In the Bayesian formulation the entire problem is recast such that the y i values are samples

from a normal distribution.

A common question at this stage is "What is the benefit of doing this?". What do we get

out of this reformulation? There are two main reasons for doing so[99]:

• Prior Distributions: If we have any prior knowledge about the parameters β then we

can choose prior distributions that reflect this. If we do not then we can still choose

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!