24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 4<br />

We want to break out the preceding loop if we didn't find any new frequent itemsets<br />

(and also to print a message to let us know what is going on):<br />

if len(cur_frequent_itemsets) == 0:<br />

print("Did not find any frequent itemsets of length {}".<br />

format(k))<br />

sys.stdout.flush()<br />

break<br />

We use sys.stdout.flush() to ensure that the printouts happen<br />

while the code is still running. Sometimes, in large loops in particular<br />

cells, the printouts will not happen until the code has <strong>com</strong>pleted. Flushing<br />

the output in this way ensures that the printout happens when we want.<br />

Don't do it too much though—the flush operation carries a <strong>com</strong>putational<br />

cost (as does printing) and this will slow down the program.<br />

If we do find frequent itemsets, we print out a message to let us know the loop will<br />

be running again. This algorithm can take a while to run, so it is helpful to know that<br />

the code is still running while you wait for it to <strong>com</strong>plete! Let's look at the code:<br />

else:<br />

print("I found {} frequent itemsets of length<br />

{}".format(len(cur_frequent_itemsets), k))<br />

sys.stdout.flush()<br />

Finally, after the end of the loop, we are no longer interested in the first set of<br />

itemsets anymore—these are itemsets of length one, which won't help us create<br />

association rules – we need at least two items to create association rules. Let's<br />

delete them:<br />

del frequent_itemsets[1]<br />

You can now run this code. It may take a few minutes, more if you have older<br />

hardware. If you find you are having trouble running any of the code samples,<br />

take a look at using an online cloud provider for additional speed. Details about<br />

using the cloud to do the work are given in Appendix, Next Steps.<br />

The preceding code returns 1,718 frequent itemsets of varying lengths. You'll notice<br />

that the number of itemsets grows as the length increases before it shrinks. It grows<br />

because of the increasing number of possible rules. After a while, the large number<br />

of <strong>com</strong>binations no longer has the support necessary to be considered frequent.<br />

This results in the number shrinking. This shrinking is the benefit of the Apriori<br />

algorithm. If we search all possible itemsets (not just the supersets of frequent ones),<br />

we would be searching thousands of times more itemsets to see if they are frequent.<br />

[ 71 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!