13.08.2022 Views

advanced-algorithmic-trading

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

310

Figure 22.2: K-Means Algorithm on Simulated Data with k = 3 and k = 4

distributions with differing means and variances. It is immediately apparent that the choice of

K when carrying out the K-Means algorithm is important for the interpretation of results.

In the left subplot the algorithm is forced to choose three clusters. It has largely captured

the three separate Gaussian clusters, assigning blue, red and green colours to each. Although

it clearly has difficulty in selecting the closest clusters for the "outlying" points of each cluster

that lie in the neighbourhood of x 1 = 5, x 2 = 8. This is a difficult situation for any clustering

algorithm that involves overlapping data.

Recall that the K-Means algorithm is a hard clustering tool. That is, it creates a distinct

hard boundary between cluster membership, rather than probabalistically assigning membership

as in a soft cluster algorithm.

In the right subplot the algorithm is forced to choose four clusters and has divided the

grouping on the left hand side of the plot into two separate regions (yellow and red). However

it is known that this particular cluster was generated from a single Gaussian distribution and

hence the algorithm has incorrectly clustered the data. The remaining clusters on the right hand

side are correctly identified however.

The choice of K has significant implications for the usefulness of the algorithm–particularly

with regard to quantitative trading applications.

22.1.4 OHLC Clustering

In this section K-Means Clustering will be used on daily Open-High-Low-Close (OHLC) data,

also known as bars or candles. Such analysis is interesting because it considers extra dimensions

to daily data that are often ignored, in favour of making sole use of adjusted closing prices.

Because it is important to compare each candle on a "like-for-like" basis, each of the High, Low

and Close dimensions will be normalised by the corresponding Open price. This has the added

benefit that stock splits, dividends and other "discrete" price-affecting corporate actions will

automatically be accounted for. By normalising each candle in this manner, the dimensionality

is reduced from four (Open, High, Low, Close) to three: High/Open, Low/Open, Close/Open.

In the following code two years of S&P500 data will be downloaded. The bars will then

be plotted using Matplotlib. The data will then be normalised in the manner described above,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!