13.08.2022 Views

advanced-algorithmic-trading

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

308

22.1.3 Simulated Data

In this section K-Means Clustering will be applied to a set of simulated data in order to provide

familiarisation with the specific Scikit-Learn implementation of the algorithm. It will also be

shown how the choice of K in the algorithm is extremely important in order to achieve good

results.

The task will involve sampling three separate two-dimensional Gaussian distributions to create

a selection of observation data. The K-Means Algorithm will then be used, with various

choices of the parameter K, to infer cluster membership. A comparison of two separate choices

for K will be plotted along with inferred cluster membership by colour.

Similar simulated data exercises have been carried out throughout the book. They help identify

the potential flaws in models on synthetic data, the statistical properties of which are easily

controlled. This provides strong insight into the limitation of such models prior to their application

on real financial data, where we most certainly cannot control the statistical properties!

The first step is to import the necessary libraries. The Python itertools library is used to

chain lists of lists together when generating the random sample data. Itertools is an extremely

useful library, which can save a significant amount of development time. Reading through the

documentation is a good exercise for any budding quant developer or trader.

The remaining imports are NumPy, Matplotlib and the KMeans class from Scikit-Learn, which

lives in the cluster module:

# simulated_data.py

import itertools

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

The first task within the __main__ function is to set the random seed so that the code

below is completely reproducible. Subsequently the number of samples for each cluster is set

(samples=100), as well as the two-dimensional means and covariance matrices of each Gaussian

cluster (one per element in each list).

The norm_dists list contains three separate two-dimensional lists of observations, one for

each cluster, generated using a list comprehension. Finally the observational data X is generated

by chaining each of these sublists using the itertools library:

np.random.seed(1)

# Set the number of samples, the means and

# variances of each of the three simulated clusters

samples = 100

mu = [(7, 5), (8, 12), (1, 10)]

cov = [

[[0.5, 0], [0, 1.0]],

[[2.0, 0], [0, 3.5]],

[[3, 0], [0, 5]],

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!