24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 8<br />

Summary<br />

In this chapter, we worked with images in order to use simple pixel values to<br />

predict the letter being portrayed in a CAPTCHA. Our CAPTCHAs were a bit<br />

simplified; we only used <strong>com</strong>plete four-letter English words. In practice, the problem<br />

is much harder—as it should be! With some improvements, it would be possible to<br />

solve much harder CAPTCHAs with neural networks and a methodology similar to<br />

what we discussed. The scikit-image library contains lots of useful functions for<br />

extracting shapes from images, functions for improving contrast, and other image<br />

tools that will help.<br />

We took our larger problem of predicting words, and created a smaller and simple<br />

problem of predicting letters. From here, we were able to create a feed-forward<br />

neural network to accurately predict which letter was in the image. At this<br />

stage, our results were very good with 97 percent accuracy.<br />

Neural networks are simply connected sets of neurons, which are basic <strong>com</strong>putation<br />

devices consisting of a single function. However, when you connect these together,<br />

they can solve incredibly <strong>com</strong>plex problems. Neural networks are the basis for deep<br />

learning, which is one of the most effective areas of data mining at the moment.<br />

Despite our great per-letter accuracy, the performance when predicting a word<br />

drops to just over 50 percent when trying to predict a whole word. There were<br />

several factors for this, representing the difficulty of taking a problem from an<br />

experiment to the real world.<br />

We improved our accuracy using a dictionary, searching for the best matching<br />

word. To do this, we considered the <strong>com</strong>monly used edit distance; however, we<br />

simplified it because we were only concerned with individual mistakes on letters,<br />

not insertions or deletions. This improvement netted some benefit, but there are<br />

still many improvements you could try to further boost the accuracy.<br />

In the next chapter, we will continue with string <strong>com</strong>parisons. We will attempt<br />

to determine which author (out of a set of authors) wrote a particular<br />

document—using only the content and no other information!<br />

[ 183 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!