24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 11<br />

This dataset <strong>com</strong>es from a popular dataset called CIFAR-10. It contains 60,000<br />

images that are 32 pixels wide and 32 pixels high, with each pixel having a<br />

red-green-blue (RGB) value. The dataset is already split into training and testing,<br />

although we will not use the testing dataset until after we <strong>com</strong>plete our training.<br />

The CIFAR-10 dataset is available for download at: http://<strong>www</strong>.<br />

cs.toronto.edu/~kriz/cifar.html. Download the python<br />

version, which has already been converted to NumPy arrays.<br />

Opening a new IPython Notebook, we can see what the data looks like. First, we set<br />

up the data filenames. We will only worry about the first batch to start with, and<br />

scale up to the full dataset size towards the end;<br />

import os<br />

data_folder = os.path.join(os.path.expanduser("~"), "Data", "cifar-10-<br />

batches-py")<br />

batch1_filename = os.path.join(data_folder, "data_batch_1")<br />

Next, we create a function that can read the data stored in the batches. The batches<br />

have been saved using pickle, which is a python library to save objects. Usually,<br />

we can just call pickle.load on the file to get the object. However, there is a small<br />

issue with this data: it was saved in Python 2, but we need to open it in Python 3. In<br />

order to address this, we set the encoding to latin (even though we are opening it in<br />

byte mode):<br />

import pickle<br />

# Bigfix thanks to: http://stackoverflow.<strong>com</strong>/questions/11305790/<br />

pickle-in<strong>com</strong>patability-of-numpy-arrays-between-python-2-and-3<br />

def unpickle(filename):<br />

with open(filename, 'rb') as fo:<br />

return pickle.load(fo, encoding='latin1')<br />

Using this function, we can now load the batch dataset:<br />

batch1 = unpickle(batch1_filename)<br />

This batch is a dictionary, containing the actual data in NumPy arrays, the<br />

corresponding labels and filenames, and finally a note to say which batch it is<br />

(this is training batch 1 of 5, for instance).<br />

We can extract an image by using its index in the batch's data key:<br />

image_index = 100<br />

image = batch1['data'][image_index]<br />

[ 243 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!