www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Chapter 4 We will sample our dataset to form a training dataset. This also helps reduce the size of the dataset that will be searched, making the Apriori algorithm run faster. We obtain all reviews from the first 200 users: ratings = all_ratings[all_ratings['UserID'].isin(range(200))] Next, we can create a dataset of only the favorable reviews in our sample: favorable_ratings = ratings[ratings["Favorable"]] We will be searching the user's favorable reviews for our itemsets. So, the next thing we need is the movies which each user has given a favorable. We can compute this by grouping the dataset by the User ID and iterating over the movies in each group: favorable_reviews_by_users = dict((k, frozenset(v.values)) for k, v in favorable_ratings groupby("UserID")["MovieID"]) In the preceding code, we stored the values as a frozenset, allowing us to quickly check if a movie has been rated by a user. Sets are much faster than lists for this type of operation, and we will use them in a later code. Finally, we can create a DataFrame that tells us how frequently each movie has been given a favorable review: num_favorable_by_movie = ratings[["MovieID", "Favorable"]]. groupby("MovieID").sum() We can see the top five movies by running the following code: num_favorable_by_movie.sort("Favorable", ascending=False)[:5] Let's see the top five movies list: MovieID Favorable 50 100 100 89 258 83 181 79 174 74 [ 67 ]
Recommending Movies Using Affinity Analysis The Apriori algorithm The Apriori algorithm is part of our affinity analysis and deals specifically with finding frequent itemsets within the data. The basic procedure of Apriori builds up new candidate itemsets from previously discovered frequent itemsets. These candidates are tested to see if they are frequent, and then the algorithm iterates as explained here: 1. Create initial frequent itemsets by placing each item in its own itemset. Only items with at least the minimum support are used in this step. 2. New candidate itemsets are created from the most recently discovered frequent itemsets by finding supersets of the existing frequent itemsets. 3. All candidate itemsets are tested to see if they are frequent. If a candidate is not frequent then it is discarded. If there are no new frequent itemsets from this step, go to the last step. 4. Store the newly discovered frequent itemsets and go to the second step. 5. Return all of the discovered frequent itemsets. This process is outlined in the following workflow: [ 68 ]
- Page 40 and 41: Chapter 1 The scikit-learn library
- Page 42 and 43: We then iterate over all the sample
- Page 44 and 45: Chapter 1 Overfitting is the proble
- Page 46: Chapter 1 Summary In this chapter,
- Page 49 and 50: Classifying with scikit-learn Estim
- Page 51 and 52: Classifying with scikit-learn Estim
- Page 53 and 54: Classifying with scikit-learn Estim
- Page 55 and 56: Classifying with scikit-learn Estim
- Page 57 and 58: Classifying with scikit-learn Estim
- Page 59 and 60: Classifying with scikit-learn Estim
- Page 61 and 62: Classifying with scikit-learn Estim
- Page 63 and 64: Classifying with scikit-learn Estim
- Page 65 and 66: Predicting Sports Winners with Deci
- Page 67 and 68: Predicting Sports Winners with Deci
- Page 69 and 70: Predicting Sports Winners with Deci
- Page 71 and 72: Predicting Sports Winners with Deci
- Page 73 and 74: Predicting Sports Winners with Deci
- Page 75 and 76: Predicting Sports Winners with Deci
- Page 77 and 78: Predicting Sports Winners with Deci
- Page 79 and 80: Predicting Sports Winners with Deci
- Page 81 and 82: Predicting Sports Winners with Deci
- Page 84 and 85: Recommending Movies Using Affinity
- Page 86 and 87: Chapter 4 The classic algorithm for
- Page 88 and 89: Chapter 4 When loading the file, we
- Page 92 and 93: Chapter 4 Implementation On the fir
- Page 94 and 95: Chapter 4 We want to break out the
- Page 96 and 97: The process starts by creating dict
- Page 98 and 99: movie_name_data.columns = ["MovieID
- Page 100 and 101: To do this, we will compute the tes
- Page 102 and 103: Chapter 4 - Train Confidence: 1.000
- Page 104 and 105: Extracting Features with Transforme
- Page 106 and 107: Chapter 5 Thought should always be
- Page 108 and 109: Chapter 5 Other features describe a
- Page 110 and 111: Chapter 5 Similarly, we can convert
- Page 112 and 113: Chapter 5 [18, 19, 20], [21, 22, 23
- Page 114 and 115: Chapter 5 Next, we create our trans
- Page 116 and 117: Chapter 5 This returns a different
- Page 118 and 119: Also, we want to set the final colu
- Page 120 and 121: Chapter 5 The downside to transform
- Page 122 and 123: Chapter 5 A transformer is akin to
- Page 124 and 125: We can then create an instance of t
- Page 126: Chapter 5 Putting it all together N
- Page 129 and 130: Social Media Insight Using Naive Ba
- Page 131 and 132: Social Media Insight Using Naive Ba
- Page 133 and 134: Social Media Insight Using Naive Ba
- Page 135 and 136: Social Media Insight Using Naive Ba
- Page 137 and 138: Social Media Insight Using Naive Ba
- Page 139 and 140: Social Media Insight Using Naive Ba
Chapter 4<br />
We will sample our dataset to form a training dataset. This also helps reduce<br />
the size of the dataset that will be searched, making the Apriori algorithm run faster.<br />
We obtain all reviews from the first 200 users:<br />
ratings = all_ratings[all_ratings['UserID'].isin(range(200))]<br />
Next, we can create a dataset of only the favorable reviews in our sample:<br />
favorable_ratings = ratings[ratings["Favorable"]]<br />
We will be searching the user's favorable reviews for our itemsets. So, the next thing<br />
we need is the movies which each user has given a favorable. We can <strong>com</strong>pute this<br />
by grouping the dataset by the User ID and iterating over the movies in each group:<br />
favorable_reviews_by_users = dict((k, frozenset(v.values))<br />
for k, v in favorable_ratings<br />
groupby("UserID")["MovieID"])<br />
In the preceding code, we stored the values as a frozenset, allowing us to quickly<br />
check if a movie has been rated by a user. Sets are much faster than lists for this type<br />
of operation, and we will use them in a later code.<br />
Finally, we can create a DataFrame that tells us how frequently each movie has been<br />
given a favorable review:<br />
num_favorable_by_movie = ratings[["MovieID", "Favorable"]].<br />
groupby("MovieID").sum()<br />
We can see the top five movies by running the following code:<br />
num_favorable_by_movie.sort("Favorable", ascending=False)[:5]<br />
Let's see the top five movies list:<br />
MovieID Favorable<br />
50 100<br />
100 89<br />
258 83<br />
181 79<br />
174 74<br />
[ 67 ]