www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Chapter 1 The dataset we are going to use for this example is a NumPy two-dimensional array, which is a format that underlies most of the examples in the rest of the book. The array looks like a table, with rows representing different samples and columns representing different features. The cells represent the value of a particular feature of a particular sample. To illustrate, we can load the dataset with the following code: import numpy as np dataset_filename = "affinity_dataset.txt" X = np.loadtxt(dataset_filename) For this example, run the IPython Notebook and create an IPython Notebook. Enter the above code into the first cell of your Notebook. You can then run the code by pressing Shift + Enter (which will also add a new cell for the next lot of code). After the code is run, the square brackets to the left-hand side of the first cell will be assigned an incrementing number, letting you know that this cell has been completed. The first cell should look like the following: For later code that will take more time to run, an asterisk will be placed here to denote that this code is either running or scheduled to be run. This asterisk will be replaced by a number when the code has completed running. You will need to save the dataset into the same directory as the IPython Notebook. If you choose to store it somewhere else, you will need to change the dataset_filename value to the new location. Next, we can show some of the rows of the dataset to get a sense of what the dataset looks like. Enter the following line of code into the next cell and run it, in order to print the first five lines of the dataset: print(X[:5]) Downloading the example code You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. [ 9 ]

Getting Started with Data Mining The result will show you which items were bought in the first five transactions listed: The dataset can be read by looking at each row (horizontal line) at a time. The first row (0, 0, 1, 1, 1) shows the items purchased in the first transaction. Each column (vertical row) represents each of the items. They are bread, milk, cheese, apples, and bananas, respectively. Therefore, in the first transaction, the person bought cheese, apples, and bananas, but not bread or milk. Each of these features contain binary values, stating only whether the items were purchased and not how many of them were purchased. A 1 indicates that "at least 1" item was bought of this type, while a 0 indicates that absolutely none of that item was purchased. Implementing a simple ranking of rules We wish to find rules of the type If a person buys product X, then they are likely to purchase product Y. We can quite easily create a list of all of the rules in our dataset by simply finding all occasions when two products were purchased together. However, we then need a way to determine good rules from bad ones. This will allow us to choose specific products to recommend. Rules of this type can be measured in many ways, of which we will focus on two: support and confidence. Support is the number of times that a rule occurs in a dataset, which is computed by simply counting the number of samples that the rule is valid for. It can sometimes be normalized by dividing by the total number of times the premise of the rule is valid, but we will simply count the total for this implementation. While the support measures how often a rule exists, confidence measures how accurate they are when they can be used. It can be computed by determining the percentage of times the rule applies when the premise applies. We first count how many times a rule applies in our dataset and divide it by the number of samples where the premise (the if statement) occurs. [ 10 ]

Chapter 1<br />

The dataset we are going to use for this example is a NumPy two-dimensional<br />

array, which is a format that underlies most of the examples in the rest of the book.<br />

The array looks like a table, with rows representing different samples and columns<br />

representing different features.<br />

The cells represent the value of a particular feature of a particular sample. To<br />

illustrate, we can load the dataset with the following code:<br />

import numpy as np<br />

dataset_filename = "affinity_dataset.txt"<br />

X = np.loadtxt(dataset_filename)<br />

For this example, run the IPython Notebook and create an IPython Notebook.<br />

Enter the above code into the first cell of your Notebook. You can then run the<br />

code by pressing Shift + Enter (which will also add a new cell for the next lot of<br />

code). After the code is run, the square brackets to the left-hand side of the first cell<br />

will be assigned an incrementing number, letting you know that this cell has been<br />

<strong>com</strong>pleted. The first cell should look like the following:<br />

For later code that will take more time to run, an asterisk will be placed here to<br />

denote that this code is either running or scheduled to be run. This asterisk will be<br />

replaced by a number when the code has <strong>com</strong>pleted running.<br />

You will need to save the dataset into the same directory as the IPython<br />

Notebook. If you choose to store it somewhere else, you will need to change<br />

the dataset_filename value to the new location.<br />

Next, we can show some of the rows of the dataset to get a sense of what the dataset<br />

looks like. Enter the following line of code into the next cell and run it, in order to<br />

print the first five lines of the dataset:<br />

print(X[:5])<br />

Downloading the example code<br />

You can download the example code files from your account at<br />

http://<strong>www</strong>.packtpub.<strong>com</strong> for all the Packt Publishing books you<br />

have purchased. If you purchased this book elsewhere, you can visit<br />

http://<strong>www</strong>.packtpub.<strong>com</strong>/support and register to have the<br />

files e-mailed directly to you.<br />

[ 9 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!