www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

The process starts by creating dictionaries to store how many times we see the premise leading to the conclusion (a correct example of the rule) and how many times it doesn't (an incorrect example). Let's look at the code: correct_counts = defaultdict(int) incorrect_counts = defaultdict(int) We iterate over all of the users, their favorable reviews, and over each candidate association rule: for user, reviews in favorable_reviews_by_users.items(): for candidate_rule in candidate_rules: premise, conclusion = candidate_rule Chapter 4 We then test to see if the premise is applicable to this user. In other words, did the user favorably review all of the movies in the premise? Let's look at the code: if premise.issubset(reviews): If the premise applies, we see if the conclusion movie was also rated favorably. If so, the rule is correct in this instance. If not, it is incorrect. Let's look at the code: if premise.issubset(reviews): if conclusion in reviews: correct_counts[candidate_rule] += 1 else: incorrect_counts[candidate_rule] += 1 We then compute the confidence for each rule by dividing the correct count by the total number of times the rule was seen: rule_confidence = {candidate_rule: correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule]) for candidate_rule in candidate_rules} Now we can print the top five rules by sorting this confidence dictionary and printing the results: from operator import itemgetter sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(1), reverse=True) for index in range(5): print("Rule #{0}".format(index + 1)) (premise, conclusion) = sorted_confidence[index][0] [ 73 ]

Recommending Movies Using Affinity Analysis print("Rule: If a person recommends {0} they will also recommend {1}".format(premise, conclusion)) print(" - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)])) print("") The result is as follows: Rule #1 Rule: If a person recommends frozenset({64, 56, 98, 50, 7}) they will also recommend 174 - Confidence: 1.000 Rule #2 Rule: If a person recommends frozenset({98, 100, 172, 79, 50, 56}) they will also recommend 7 - Confidence: 1.000 Rule #3 Rule: If a person recommends frozenset({98, 172, 181, 174, 7}) they will also recommend 50 - Confidence: 1.000 Rule #4 Rule: If a person recommends frozenset({64, 98, 100, 7, 172, 50}) they will also recommend 174 - Confidence: 1.000 Rule #5 Rule: If a person recommends frozenset({64, 1, 7, 172, 79, 50}) they will also recommend 181 - Confidence: 1.000 The resulting printout shows only the movie IDs, which isn't very helpful without the names of the movies also. The dataset came with a file called u.items, which stores the movie names and their corresponding MovieID (as well as other information, such as the genre). We can load the titles from this file using pandas. Additional information about the file and categories is available in the README that came with the dataset. The data in the files is in CSV format, but with data separated by the | symbol; it has no header and the encoding is important to set. The column names were found in the README file. movie_name_filename = os.path.join(data_folder, "u.item") movie_name_data = pd.read_csv(movie_name_filename, delimiter="|", header=None, encoding = "mac-roman") [ 74 ]

The process starts by creating dictionaries to store how many times we see the<br />

premise leading to the conclusion (a correct example of the rule) and how many<br />

times it doesn't (an incorrect example). Let's look at the code:<br />

correct_counts = defaultdict(int)<br />

incorrect_counts = defaultdict(int)<br />

We iterate over all of the users, their favorable reviews, and over each candidate<br />

association rule:<br />

for user, reviews in favorable_reviews_by_users.items():<br />

for candidate_rule in candidate_rules:<br />

premise, conclusion = candidate_rule<br />

Chapter 4<br />

We then test to see if the premise is applicable to this user. In other words, did the<br />

user favorably review all of the movies in the premise? Let's look at the code:<br />

if premise.issubset(reviews):<br />

If the premise applies, we see if the conclusion movie was also rated favorably.<br />

If so, the rule is correct in this instance. If not, it is incorrect. Let's look at the code:<br />

if premise.issubset(reviews):<br />

if conclusion in reviews:<br />

correct_counts[candidate_rule] += 1<br />

else:<br />

incorrect_counts[candidate_rule] += 1<br />

We then <strong>com</strong>pute the confidence for each rule by dividing the correct count by the<br />

total number of times the rule was seen:<br />

rule_confidence = {candidate_rule: correct_counts[candidate_rule]<br />

/ float(correct_counts[candidate_rule] +<br />

incorrect_counts[candidate_rule])<br />

for candidate_rule in candidate_rules}<br />

Now we can print the top five rules by sorting this confidence dictionary and<br />

printing the results:<br />

from operator import itemgetter<br />

sorted_confidence = sorted(rule_confidence.items(),<br />

key=itemgetter(1), reverse=True)<br />

for index in range(5):<br />

print("Rule #{0}".format(index + 1))<br />

(premise, conclusion) = sorted_confidence[index][0]<br />

[ 73 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!