24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Re<strong>com</strong>mending Movies Using Affinity Analysis<br />

However, it can be applied to many processes:<br />

• Fraud detection<br />

• Customer segmentation<br />

• Software optimization<br />

• Product re<strong>com</strong>mendations<br />

Affinity analysis is usually much more exploratory than classification. We often<br />

don't have the <strong>com</strong>plete dataset we expect for many classification tasks. For instance,<br />

in movie re<strong>com</strong>mendation, we have reviews from different people on different<br />

movies. However, it is unlikely we have each reviewer review all of the movies in<br />

our dataset. This leaves an important and difficult question in affinity analysis. If a<br />

reviewer hasn't reviewed a movie, is that an indication that they aren't interested<br />

in the movie (and therefore wouldn't re<strong>com</strong>mend it) or simply that they haven't<br />

reviewed it yet?<br />

We won't answer that question in this chapter, but thinking about gaps in your<br />

datasets can lead to questions like this. In turn, that can lead to answers that may<br />

help improve the efficacy of your approach.<br />

Algorithms for affinity analysis<br />

We introduced a basic method for affinity analysis in Chapter 1, Getting Started with<br />

Data Mining, which tested all of the possible rule <strong>com</strong>binations. We <strong>com</strong>puted the<br />

confidence and support for each rule, which in turn allowed us to rank them to find<br />

the best rules.<br />

However, this approach is not efficient. Our dataset in Chapter 1, Getting Started with<br />

Data Mining, had just five items for sale. We could expect even a small store to have<br />

hundreds of items for sale, while many online stores would have thousands (or<br />

millions!). With a naive rule creation, such as our previous algorithm, the growth in<br />

time needed to <strong>com</strong>pute these rules increases exponentially. As we add more items,<br />

the time it takes to <strong>com</strong>pute all rules increases significantly faster. Specifically, the<br />

total possible number of rules is 2n - 1. For our five-item dataset, there are 31 possible<br />

rules. For 10 items, it is 1023. For just 100 items, the number has 30 digits. Even the<br />

drastic increase in <strong>com</strong>puting power couldn't possibly keep up with the increases in<br />

the number of items stored online. Therefore, we need algorithms that work smarter,<br />

as opposed to <strong>com</strong>puters that work harder.<br />

[ 62 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!