CMPT-741 Fall 2009 Data Mining Martin Ester Course Project
CMPT-741 Fall 2009 Data Mining Martin Ester Course Project
CMPT-741 Fall 2009 Data Mining Martin Ester Course Project
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Total marks: 30 % of the class<br />
Due date: November 30, <strong>2009</strong><br />
Introduction<br />
<strong>CMPT</strong>-<strong>741</strong> <strong>Fall</strong> <strong>2009</strong><br />
<strong>Data</strong> <strong>Mining</strong><br />
<strong>Martin</strong> <strong>Ester</strong><br />
<strong>Course</strong> <strong>Project</strong><br />
This course project is to be conducted in small groups of two to three students. You will be using<br />
the public domain data mining tool WEKA, to be downloaded from<br />
http://www.cs.waikato.ac.nz/ml/weka/index_downloading.html.<br />
<strong>Data</strong>set<br />
You will be using reviews from Epinions.com. Each review consists of the full text, the pros and<br />
cons (a summary of the review) and the numerical rating (1, 2, 3, 4, or 5). The texts are provided<br />
both as raw texts and in a version with part-of-speech tags. More specifically, here are the<br />
attributes of a product review:<br />
PKey|ProductRate|Pros|Cons|FullReview|TaggedPros|TaggedCons|TaggedFullReview<br />
Our dataset contains reviews for each of the following product categories and products:<br />
Camcorder<br />
Pkey Product<br />
1 Pure Digital Flip Video Ultra (2 GB) Camcorder<br />
2 Canon Elura 100 Mini DV Camcorder<br />
3 Canon ZR40 Mini DV Camcorder<br />
4 Canon GL2 Mini DV Camcorder<br />
5 Samsung SC-D23 Mini DV Camcorder<br />
6 Sony Handycam DCR-TRV250 Digital-8 Camcorder<br />
7 Panasonic Palmcorder® PV-DV203 Mini DV Camcorder<br />
8 Canon Elura 50 Mini DV Camcorder
Cellular Phone<br />
Digital Camera<br />
DVD Player<br />
MP3 Player<br />
Pkey Product<br />
1 Apple iPhone (8 GB) Smartphone<br />
2 Nokia 6210 Cell Phone<br />
3 Samsung SPH-A500 Cell Phone<br />
4 RIM BlackBerry Pearl 8100 Smartphone<br />
5 Palm Treo 650 Smartphone<br />
6 Sony Ericsson T68IS Cell Phone<br />
7 Motorola V400 Cell Phone<br />
8 Motorola Qâ„¢ Smartphone<br />
Pkey Product<br />
1 Sony Mavica MVC-FD83 Digital Camera<br />
2 Olympus Camedia D-360L Digital Camera<br />
3 Nikon D40 Digital Camera with 18-55mm Lens<br />
4 Canon PowerShot® S5 IS Digital Camera<br />
5 Kodak EasyShare LS443 Digital Camera<br />
6 Hewlett Packard Photosmart 435 Digital Camera<br />
7 BenQ DC1500 Digital Camera<br />
8 Agfa ePhoto Smile Digital Camera<br />
Pkey Product<br />
1 Panasonic DVD-LV50 5 in. Portable DVD Player<br />
2 Toshiba SD-1200 DVD Player<br />
3 Panasonic DMR-E85H (120 GB) DVD Recorder<br />
4 Toshiba RD-XS32SU DVD Recorder / HDD Recorder<br />
5 Apex Digital AD-1200 DVD Player<br />
6 Initial IDM-1731 7 in. Portable DVD Player<br />
7 Cyberhome DVR 1600 DVD Recorder<br />
8 Cyberhome CH-LDV 712 7 in. Portable DVD Player<br />
Pkey Product<br />
1 Apple iPod touch 2nd Generation (16 GB) MP3 Player<br />
2 Intel Pocket Concert (128 MB) MP3 Player<br />
3 Apple iPod classic 5th Generation White (30 GB) MP3 Player<br />
4 Apple iPod classic 4th Generation (20 GB) MP3 Player<br />
5 Apple iPod shuffle 1st Generation White (512 MB, M9724LL/A) MP3 Player
6 SanDisk Sansa m240 (1 GB) MP3 Player<br />
7 Dell DJ (20 GB) MP3 Player<br />
8 RCA Lyra RD2840 (40 GB) MP3 Player<br />
The dataset contains multiple reviews for each product.<br />
Tasks<br />
Your task is sentiment classification. We create class labels for reviews as follows:<br />
Rating = 1 or 2: negative, rating = 3: neutral, rating = 4 or 5: positive.<br />
Build a sentiment classifier using any of the methods implemented in WEKA. Submit a hard<br />
copy report addressing the following issues:<br />
1) Describe the features that you used for your classifier as well as why and how you selected<br />
them.<br />
2) Discuss your choice of a classification method. You may have to experiment with several<br />
methods and may report some of the preliminary results as arguments.<br />
3) Give the classification accuracy, the precision and recall and the F-measure (for the positive<br />
class and for the negative class) of your sentiment classifier. You need to perform 5-fold<br />
cross-validation to compute your performance measures.<br />
4) Design and implement a method that explains the results of your sentiment classifier. Discuss<br />
your design and provide the actual explanations produced for the classification of the first<br />
reviews of each product category.<br />
Marking scheme<br />
Each of the four issues listed above is worth 25% of the course project marks.