www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Chapter 7 Next, we are going to remove any user who doesn't have any friends. For these users, we can't really make a recommendation in this way. Instead, we might have to look at their content or people who follow them. We will leave that out of the scope of this chapter, though, so let's just remove these users. The code is as follows: friends = {user_id:friends[user_id] for user_id in friends if len(friends[user_id]) > 0} We now have between 30 and 50 users, depending on your initial search results. We are now going to increase that amount to 150. The following code will take quite a long time to run—given the limits on the API, we can only get the friends for a user once every minute. Simple math will tell us that 150 users will take 150 minutes, or 2.5 hours. Given the time we are going to be spending on getting this data, it pays to ensure we get only good users. What makes a good user, though? Given that we will be looking to make recommendations based on shared connections, we will search for users based on shared connections. We will get the friends of our existing users, starting with those users who are better connected to our existing users. To do that, we maintain a count of all the times a user is in one of our friends lists. It is worth considering the goals of the application when considering your sampling strategy. For this purpose, getting lots of similar users enables the recommendations to be more regularly applicable. To do this, we simply iterate over all the friends lists we have and then count each time a friend occurs. from collections import defaultdict def count_friends(friends): friend_count = defaultdict(int) for friend_list in friends.values(): for friend in friend_list: friend_count[friend] += 1 return friend_count Computing our current friend count, we can then get the most connected (that is, most friends from our existing list) person from our sample. The code is as follows: friend_count reverse=True) = count_friends(friends) from operator import itemgetter best_friends = sorted(friend_count.items(), key=itemgetter(1), [ 143 ]
Discovering Accounts to Follow Using Graph Mining From here, we set up a loop that continues until we have the friends of 150 users. We then iterate over all of our best friends (which happens in order of the number of people who have them as friends) until we find a user whose friends we haven't already got. We then get the friends of that user and update the friends counts. Finally, we work out who is the most connected user who we haven't already got in our list: while len(friends) < 150: for user_id, count in best_friends: if user_id not in friends: break friends[user_id] = get_friends(t, user_id) for friend in friends[user_id]: friend_count[friend] += 1 best_friends = sorted(friend_count.items(), key=itemgetter(1), reverse=True) The codes will then loop and continue until we reach 150 users. You may want to set these value lower, such as 40 or 50 users (or even just skip this bit of code temporarily). Then, complete the chapter's code and get a feel for how the results work. After that, reset the number of users in this loop to 150, leave the code to run for a few hours, and then come back and rerun the later code. Given that collecting that data probably took over 2 hours, it would be a good idea to save it in case we have to turn our computer off. Using the json library, we can easily save our friends dictionary to a file: import json friends_filename = os.path.join(data_folder, "python_friends.json") with open(friends_filename, 'w') as outf: json.dump(friends, outf) If you need to load the file, use the json.load function: with open(friends_filename) as inf: friends = json.load(inf) [ 144 ]
- Page 116 and 117: Chapter 5 This returns a different
- Page 118 and 119: Also, we want to set the final colu
- Page 120 and 121: Chapter 5 The downside to transform
- Page 122 and 123: Chapter 5 A transformer is akin to
- Page 124 and 125: We can then create an instance of t
- Page 126: Chapter 5 Putting it all together N
- Page 129 and 130: Social Media Insight Using Naive Ba
- Page 131 and 132: Social Media Insight Using Naive Ba
- Page 133 and 134: Social Media Insight Using Naive Ba
- Page 135 and 136: Social Media Insight Using Naive Ba
- Page 137 and 138: Social Media Insight Using Naive Ba
- Page 139 and 140: Social Media Insight Using Naive Ba
- Page 141 and 142: Social Media Insight Using Naive Ba
- Page 143 and 144: Social Media Insight Using Naive Ba
- Page 145 and 146: Social Media Insight Using Naive Ba
- Page 147 and 148: Social Media Insight Using Naive Ba
- Page 149 and 150: Social Media Insight Using Naive Ba
- Page 151 and 152: Social Media Insight Using Naive Ba
- Page 153 and 154: Social Media Insight Using Naive Ba
- Page 155 and 156: Social Media Insight Using Naive Ba
- Page 158 and 159: Discovering Accounts to Follow Usin
- Page 160 and 161: Chapter 7 Next, we will need a list
- Page 162 and 163: Chapter 7 Make sure the filename is
- Page 164 and 165: Chapter 7 cursor = results['next_cu
- Page 168 and 169: Chapter 7 Creating a graph Now, we
- Page 170 and 171: Chapter 7 As you can see, it is ver
- Page 172 and 173: Chapter 7 Next, we will only add th
- Page 174 and 175: Chapter 7 The difference in this gr
- Page 176 and 177: Chapter 7 We can graph the entire s
- Page 178 and 179: Chapter 7 Optimizing criteria Our a
- Page 180 and 181: Chapter 7 Next, we need to get the
- Page 182 and 183: • method='nelder-mead': This is u
- Page 184 and 185: Beating CAPTCHAs with Neural Networ
- Page 186 and 187: Chapter 8 The red lines indicate th
- Page 188 and 189: Chapter 8 The combination of an app
- Page 190 and 191: Chapter 8 Next we set the font of t
- Page 192 and 193: Chapter 8 We can then extract the s
- Page 194 and 195: Chapter 8 Our targets are integer v
- Page 196 and 197: Chapter 8 Then we iterate over our
- Page 198 and 199: Chapter 8 From these predictions, w
- Page 200 and 201: Chapter 8 This code correctly predi
- Page 202 and 203: The result is shown in the next gra
- Page 204 and 205: Chapter 8 However, it isn't very go
- Page 206: Chapter 8 Summary In this chapter,
- Page 209 and 210: Authorship Attribution Attributing
- Page 211 and 212: Authorship Attribution If we cannot
- Page 213 and 214: Authorship Attribution After taking
- Page 215 and 216: Authorship Attribution This dataset
Discovering Accounts to Follow Using Graph Mining<br />
From here, we set up a loop that continues until we have the friends of 150 users.<br />
We then iterate over all of our best friends (which happens in order of the number<br />
of people who have them as friends) until we find a user whose friends we haven't<br />
already got. We then get the friends of that user and update the friends counts.<br />
Finally, we work out who is the most connected user who we haven't already got<br />
in our list:<br />
while len(friends) < 150:<br />
for user_id, count in best_friends:<br />
if user_id not in friends:<br />
break<br />
friends[user_id] = get_friends(t, user_id)<br />
for friend in friends[user_id]:<br />
friend_count[friend] += 1<br />
best_friends = sorted(friend_count.items(),<br />
key=itemgetter(1), reverse=True)<br />
The codes will then loop and continue until we reach 150 users.<br />
You may want to set these value lower, such as 40 or 50 users<br />
(or even just skip this bit of code temporarily). Then, <strong>com</strong>plete the<br />
chapter's code and get a feel for how the results work. After that,<br />
reset the number of users in this loop to 150, leave the code to run<br />
for a few hours, and then <strong>com</strong>e back and rerun the later code.<br />
Given that collecting that data probably took over 2 hours, it would be a good idea<br />
to save it in case we have to turn our <strong>com</strong>puter off. Using the json library, we can<br />
easily save our friends dictionary to a file:<br />
import json<br />
friends_filename = os.path.join(data_folder, "python_friends.json")<br />
with open(friends_filename, 'w') as outf:<br />
json.dump(friends, outf)<br />
If you need to load the file, use the json.load function:<br />
with open(friends_filename) as inf:<br />
friends = json.load(inf)<br />
[ 144 ]