24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 8<br />

However, it isn't very good. We don't really expect letters to be moved around, just<br />

individual letter <strong>com</strong>parisons to be wrong. For this reason, we will create a different<br />

distance metric, which is simply the number of letters in the same positions that are<br />

incorrect. The code is as follows:<br />

def <strong>com</strong>pute_distance(prediction, word):<br />

return len(prediction) - sum(prediction[i] == word[i] for i in<br />

range(len(prediction)))<br />

We subtract the value from the length of the prediction word (which is four) to make<br />

it a distance metric where lower values indicate more similarity between the words.<br />

Putting it all together<br />

We can now test our improved prediction function using similar code to before.<br />

First we define a prediction, which also takes our list of valid words:<br />

from operator import itemgetter<br />

def improved_prediction(word, net, dictionary, shear=0.2):<br />

captcha = create_captcha(word, shear=shear)<br />

prediction = predict_captcha(captcha, net)<br />

prediction = prediction[:4]<br />

Up to this point, the code is as before. We do our prediction and limit it to the first<br />

four characters. However, we now check if the word is in the dictionary or not. If it<br />

is, we return that as our prediction. If it is not, we find the next nearest word. The<br />

code is as follows:<br />

if prediction not in dictionary:<br />

We <strong>com</strong>pute the distance between our predicted word and each other word in the<br />

dictionary, and sort it by distance (lowest first). The code is as follows:<br />

distances = sorted([(word, <strong>com</strong>pute_distance(prediction, word))<br />

for word in dictionary],<br />

key=itemgetter(1))<br />

We then get the best matching word—that is, the one with the lowest distance—and<br />

predict that word:<br />

best_word = distances[0]<br />

prediction = best_word[0]<br />

We then return the correctness, word, and prediction as before:<br />

return word == prediction, word, prediction<br />

[ 181 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!