www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Beating CAPTCHAs with Neural Networks Interpreting information contained in images has long been a difficult problem in data mining, but it is one that is really starting to be addressed. The latest research is providing algorithms to detect and understand images to the point where automated commercial surveillance systems are now being used—in real-world scenarios—by major vendors. These systems are capable of understanding and recognizing objects and people in video footage. It is difficult to extract information from images. There is lots of raw data in an image, and the standard method for encoding images—pixels—isn't that informative by itself. Images—particularly photos—can be blurry, too close to the targets, too dark, too light, scaled, cropped, skewed, or any other of a variety of problems that cause havoc for a computer system trying to extract useful information. In this chapter, we look at extracting text from images by using neural networks for predicting each letter. The problem we are trying to solve is to automatically understand CAPTCHA messages. CAPTCHAs are images designed to be easy for humans to solve and hard for a computer to solve, as per the acronym: Completely Automated Public Turing test to tell Computers and Humans Apart. Many websites use them for registration and commenting systems to stop automated programs flooding their site with fake accounts and spam comments. The topics covered in this chapter include: • Neural networks • Creating our own dataset of CAPTCHAs and letters • The scikit-image library for working with image data • The PyBrain library for neural networks [ 161 ]

Beating CAPTCHAs with Neural Networks • Extracting basic features from images • Using neural networks for larger-scale classification tasks • Improving performance using postprocessing Artificial neural networks Neural networks are a class of algorithm that was originally designed based on the way that human brains work. However, modern advances are generally based on mathematics rather than biological insights. A neural network is a collection of neurons that are connected together. Each neuron is a simple function of its inputs, which generates an output: The functions that define a neuron's processing can be any standard function, such as a linear combination of the inputs, and are called the activation function. For the commonly used learning algorithms to work, we need the activation function to be derivable and smooth. A frequently used activation function is the logistic function, which is defined by the following equation (k is often simply 1, x is the inputs into the neuron, and L is normally 1, that is, the maximum value of the function): The value of this graph, from -6 to +6, is shown as follows: [ 162 ]

Beating CAPTCHAs with<br />

Neural Networks<br />

Interpreting information contained in images has long been a difficult problem in<br />

data mining, but it is one that is really starting to be addressed. The latest research is<br />

providing algorithms to detect and understand images to the point where automated<br />

<strong>com</strong>mercial surveillance systems are now being used—in real-world scenarios—by<br />

major vendors. These systems are capable of understanding and recognizing objects<br />

and people in video footage.<br />

It is difficult to extract information from images. There is lots of raw data in an<br />

image, and the standard method for encoding images—pixels—isn't that informative<br />

by itself. Images—particularly photos—can be blurry, too close to the targets, too<br />

dark, too light, scaled, cropped, skewed, or any other of a variety of problems that<br />

cause havoc for a <strong>com</strong>puter system trying to extract useful information.<br />

In this chapter, we look at extracting text from images by using neural networks<br />

for predicting each letter. The problem we are trying to solve is to automatically<br />

understand CAPTCHA messages. CAPTCHAs are images designed to be easy for<br />

humans to solve and hard for a <strong>com</strong>puter to solve, as per the acronym: Completely<br />

Automated Public Turing test to tell Computers and Humans Apart. Many websites<br />

use them for registration and <strong>com</strong>menting systems to stop automated programs<br />

flooding their site with fake accounts and spam <strong>com</strong>ments.<br />

The topics covered in this chapter include:<br />

• Neural networks<br />

• Creating our own dataset of CAPTCHAs and letters<br />

• The scikit-image library for working with image data<br />

• The PyBrain library for neural networks<br />

[ 161 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!