24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 2<br />

Nearest neighbors can be used for nearly any dataset-however, it can be very<br />

<strong>com</strong>putationally expensive to <strong>com</strong>pute the distance between all pairs of samples.<br />

For example if there are 10 samples in the dataset, there are 45 unique distances<br />

to <strong>com</strong>pute. However, if there are 1000 samples, there are nearly 500,000! Various<br />

methods exist for improving this speed dramatically; some of which are covered in<br />

the later chapters of this book.<br />

It can also do poorly in categorical-based datasets, and another algorithm should be<br />

used for these instead.<br />

Distance metrics<br />

A key underlying concept in data mining is that of distance. If we have two samples,<br />

we need to know how close they are to each other. Further more, we need to answer<br />

questions such as are these two samples more similar than the other two?<br />

Answering questions like these is important to the out<strong>com</strong>e of the case.<br />

The most <strong>com</strong>mon distance metric that the people are aware of is Euclidean<br />

distance, which is the real-world distance. If you were to plot the points on a graph<br />

and measure the distance with a straight ruler, the result would be the Euclidean<br />

distance. A little more formally, it is the square root of the sum of the squared<br />

distances for each feature.<br />

Euclidean distance is intuitive, but provides poor accuracy if some features have<br />

larger values than others. It also gives poor results when lots of features have a<br />

value of 0, known as a sparse matrix. There are other distance metrics in use; two<br />

<strong>com</strong>monly employed ones are the Manhattan and Cosine distance.<br />

The Manhattan distance is the sum of the absolute differences in each feature (with<br />

no use of square distances). Intuitively, it can be thought of as the number of moves<br />

a rook piece (or castle) in chess would take to move between the points, if it were<br />

limited to moving one square at a time. While the Manhattan distance does suffer if<br />

some features have larger values than others, the effect is not as dramatic as in the<br />

case of Euclidean.<br />

[ 27 ]<br />

<strong>www</strong>.<strong>allitebooks</strong>.<strong>com</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!