24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Re<strong>com</strong>mending Movies Using Affinity Analysis<br />

The movie re<strong>com</strong>mendation problem<br />

Product re<strong>com</strong>mendation is big business. Online stores use it to up-sell to<br />

customers by re<strong>com</strong>mending other products that they could buy. Making better<br />

re<strong>com</strong>mendations leads to better sales. When online shopping is selling to millions<br />

of customers every year, there is a lot of potential money to be made by selling more<br />

items to these customers.<br />

Product re<strong>com</strong>mendations have been researched for many years; however, the field<br />

gained a significant boost when Netflix ran their Netflix Prize between 2007 and<br />

2009. This <strong>com</strong>petition aimed to determine if anyone can predict a user's rating of a<br />

film better than Netflix was currently doing. The prize went to a team that was just<br />

over 10 percent better than the current solution. While this may not seem like a large<br />

improvement, such an improvement would net millions to Netflix in revenue from<br />

better movie re<strong>com</strong>mendations.<br />

Obtaining the dataset<br />

Since the inception of the Netflix Prize, Grouplens, a research group at the University<br />

of Minnesota, has released several datasets that are often used for testing algorithms<br />

in this area. They have released several versions of a movie rating dataset, which<br />

have different sizes. There is a version with 100,000 reviews, one with 1 million<br />

reviews and one with 10 million reviews.<br />

The datasets are available from http://grouplens.org/datasets/movielens/<br />

and the dataset we are going to use in this chapter is the MovieLens 1 million<br />

dataset. Download this dataset and unzip it in your data folder. Start a new IPython<br />

Notebook and type the following code:<br />

import os<br />

import pandas as pd<br />

data_folder = os.path.join(os.path.expanduser("~"), "Data",<br />

"ml-100k")<br />

ratings_filename = os.path.join(data_folder, "u.data")<br />

Ensure that ratings_filename points to the u.data file in the unzipped folder.<br />

Loading with pandas<br />

The MovieLens dataset is in a good shape; however, there are some changes from the<br />

default options in pandas.read_csv that we need to make. To start with, the data is<br />

separated by tabs, not <strong>com</strong>mas. Next, there is no heading line. This means the first<br />

line in the file is actually data and we need to manually set the column names.<br />

[ 64 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!