24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Getting Started with Data Mining<br />

The result will show you which items were bought in the first five transactions listed:<br />

The dataset can be read by looking at each row (horizontal line) at a time. The first<br />

row (0, 0, 1, 1, 1) shows the items purchased in the first transaction. Each<br />

column (vertical row) represents each of the items. They are bread, milk, cheese,<br />

apples, and bananas, respectively. Therefore, in the first transaction, the person<br />

bought cheese, apples, and bananas, but not bread or milk.<br />

Each of these features contain binary values, stating only whether the items were<br />

purchased and not how many of them were purchased. A 1 indicates that "at least<br />

1" item was bought of this type, while a 0 indicates that absolutely none of that item<br />

was purchased.<br />

Implementing a simple ranking of rules<br />

We wish to find rules of the type If a person buys product X, then they are likely to<br />

purchase product Y. We can quite easily create a list of all of the rules in our dataset by<br />

simply finding all occasions when two products were purchased together. However,<br />

we then need a way to determine good rules from bad ones. This will allow us to<br />

choose specific products to re<strong>com</strong>mend.<br />

Rules of this type can be measured in many ways, of which we will focus on two:<br />

support and confidence.<br />

Support is the number of times that a rule occurs in a dataset, which is <strong>com</strong>puted by<br />

simply counting the number of samples that the rule is valid for. It can sometimes be<br />

normalized by dividing by the total number of times the premise of the rule is valid,<br />

but we will simply count the total for this implementation.<br />

While the support measures how often a rule exists, confidence measures how<br />

accurate they are when they can be used. It can be <strong>com</strong>puted by determining the<br />

percentage of times the rule applies when the premise applies. We first count how<br />

many times a rule applies in our dataset and divide it by the number of samples<br />

where the premise (the if statement) occurs.<br />

[ 10 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!