www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Chapter 8 Our targets are integer values between 0 and 26, with each representing a letter of the alphabet. Neural networks don't usually support multiple values from a single neuron, instead preferring to have multiple outputs, each with values 0 or 1. We therefore perform one hot-encoding of the targets, giving us a target array that has 26 outputs per sample, using values near 1 if that letter is likely and near 0 otherwise. The code is as follows: from sklearn.preprocessing import OneHotEncoder onehot = OneHotEncoder() y = onehot.fit_transform(targets.reshape(targets.shape[0],1)) The library we are going to use doesn't support sparse arrays, so we need to turn our sparse matrix into a dense NumPy array. The code is as follows: y = y.todense() Adjusting our training dataset to our methodology Our training dataset differs from our final methodology quite significantly. Our dataset here is nicely created individual letters, fitting the 20-pixel by 20-pixel image. The methodology involves extracting the letters from words, which may squash them, move them away from the center, or create other problems. Ideally, the data you train your classifier on should mimic the environment it will be used in. In practice, we make concessions, but aim to minimize the differences as much as possible. For this experiment, we would ideally extract letters from actual CAPTCHAs and label those. In the interests of speeding up the process a bit, we will just run our segmentation function on the training dataset and return those letters instead. We will need the resize function from scikit-image, as our sub-images won't always be 20 pixels by 20 pixels. The code is as follows: from skimage.transform import resize From here, we can run our segment_image function on each sample and then resize them to 20 pixels by 20 pixels. The code is as follows: dataset = np.array([resize(segment_image(sample)[0], (20, 20)) for sample in dataset]) [ 171 ]
Beating CAPTCHAs with Neural Networks Finally, we will create our dataset. This dataset array is three-dimensional, as it is an array of two-dimensional images. Our classifier will need a two-dimensional array, so we simply flatten the last two dimensions: X = dataset.reshape((dataset.shape[0], dataset.shape[1] * dataset. shape[2])) Finally, using the train_test_split function of scikit-learn, we create a set of data for training and one for testing. The code is as follows: from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = \ train_test_split(X, y, train_size=0.9) Training and classifying We are now going to build a neural network that will take an image as input and try to predict which (single) letter is in the image. We will use the training set of single letters we created earlier. The dataset itself is quite simple. We have a 20 by 20 pixel image, each pixel 1 (black) or 0 (white). These represent the 400 features that we will use as inputs into the neural network. The outputs will be 26 values between 0 and 1, where higher values indicate a higher likelihood that the associated letter (the first neuron is A, the second is B, and so on) is the letter represented by the input image. We are going to use the PyBrain library for our neural network. As with all the libraries we have seen so far, PyBrain can be installed from pip: pip install pybrain. The PyBrain library uses its own dataset format, but luckily it isn't too difficult to create training and testing datasets using this format. The code is as follows: from pybrain.datasets import SupervisedDataSet First, we iterate over our training dataset and add each as a sample into a new SupervisedDataSet instance. The code is as follows: training = SupervisedDataSet(X.shape[1], y.shape[1]) for i in range(X_train.shape[0]): training.addSample(X_train[i], y_train[i]) [ 172 ]
- Page 143 and 144: Social Media Insight Using Naive Ba
- Page 145 and 146: Social Media Insight Using Naive Ba
- Page 147 and 148: Social Media Insight Using Naive Ba
- Page 149 and 150: Social Media Insight Using Naive Ba
- Page 151 and 152: Social Media Insight Using Naive Ba
- Page 153 and 154: Social Media Insight Using Naive Ba
- Page 155 and 156: Social Media Insight Using Naive Ba
- Page 158 and 159: Discovering Accounts to Follow Usin
- Page 160 and 161: Chapter 7 Next, we will need a list
- Page 162 and 163: Chapter 7 Make sure the filename is
- Page 164 and 165: Chapter 7 cursor = results['next_cu
- Page 166 and 167: Chapter 7 Next, we are going to rem
- Page 168 and 169: Chapter 7 Creating a graph Now, we
- Page 170 and 171: Chapter 7 As you can see, it is ver
- Page 172 and 173: Chapter 7 Next, we will only add th
- Page 174 and 175: Chapter 7 The difference in this gr
- Page 176 and 177: Chapter 7 We can graph the entire s
- Page 178 and 179: Chapter 7 Optimizing criteria Our a
- Page 180 and 181: Chapter 7 Next, we need to get the
- Page 182 and 183: • method='nelder-mead': This is u
- Page 184 and 185: Beating CAPTCHAs with Neural Networ
- Page 186 and 187: Chapter 8 The red lines indicate th
- Page 188 and 189: Chapter 8 The combination of an app
- Page 190 and 191: Chapter 8 Next we set the font of t
- Page 192 and 193: Chapter 8 We can then extract the s
- Page 196 and 197: Chapter 8 Then we iterate over our
- Page 198 and 199: Chapter 8 From these predictions, w
- Page 200 and 201: Chapter 8 This code correctly predi
- Page 202 and 203: The result is shown in the next gra
- Page 204 and 205: Chapter 8 However, it isn't very go
- Page 206: Chapter 8 Summary In this chapter,
- Page 209 and 210: Authorship Attribution Attributing
- Page 211 and 212: Authorship Attribution If we cannot
- Page 213 and 214: Authorship Attribution After taking
- Page 215 and 216: Authorship Attribution This dataset
- Page 217 and 218: Authorship Attribution "instead", "
- Page 219 and 220: Authorship Attribution Support vect
- Page 221 and 222: Authorship Attribution Kernels When
- Page 223 and 224: Authorship Attribution We can reuse
- Page 225 and 226: Authorship Attribution With our dat
- Page 227 and 228: Authorship Attribution We then reco
- Page 229 and 230: Authorship Attribution If it doesn'
- Page 231 and 232: Authorship Attribution Finally, we
- Page 234 and 235: Clustering News Articles In most of
- Page 236 and 237: Chapter 10 API Endpoints are the ac
- Page 238 and 239: The token object is just a dictiona
- Page 240 and 241: Chapter 10 We then create a list to
- Page 242 and 243: Chapter 10 We are going to use MD5
Chapter 8<br />
Our targets are integer values between 0 and 26, with each representing a letter of<br />
the alphabet. Neural networks don't usually support multiple values from a single<br />
neuron, instead preferring to have multiple outputs, each with values 0 or 1. We<br />
therefore perform one hot-encoding of the targets, giving us a target array that has<br />
26 outputs per sample, using values near 1 if that letter is likely and near 0 otherwise.<br />
The code is as follows:<br />
from sklearn.preprocessing import OneHotEncoder<br />
onehot = OneHotEncoder()<br />
y = onehot.fit_transform(targets.reshape(targets.shape[0],1))<br />
The library we are going to use doesn't support sparse arrays, so we need to turn<br />
our sparse matrix into a dense NumPy array. The code is as follows:<br />
y = y.todense()<br />
Adjusting our training dataset to our<br />
methodology<br />
Our training dataset differs from our final methodology quite significantly. Our<br />
dataset here is nicely created individual letters, fitting the 20-pixel by 20-pixel image.<br />
The methodology involves extracting the letters from words, which may squash<br />
them, move them away from the center, or create other problems.<br />
Ideally, the data you train your classifier on should mimic the environment it will<br />
be used in. In practice, we make concessions, but aim to minimize the differences<br />
as much as possible.<br />
For this experiment, we would ideally extract letters from actual CAPTCHAs and<br />
label those. In the interests of speeding up the process a bit, we will just run our<br />
segmentation function on the training dataset and return those letters instead.<br />
We will need the resize function from scikit-image, as our sub-images won't<br />
always be 20 pixels by 20 pixels. The code is as follows:<br />
from skimage.transform import resize<br />
From here, we can run our segment_image function on each sample and then resize<br />
them to 20 pixels by 20 pixels. The code is as follows:<br />
dataset = np.array([resize(segment_image(sample)[0], (20, 20)) for<br />
sample in dataset])<br />
[ 171 ]