www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Discovering Accounts to Follow Using Graph Mining Lots of things can be represented as graphs. This is particularly true in this day of Big Data, online social networks, and the Internet of Things. In particular, online social networks are big business, with sites such as Facebook that have over 500 million active users (50 percent of them log in each day). These sites often monetize themselves by targeted advertising. However, for users to be engaged with a website, they often need to follow interesting people or pages. In this chapter, we will look at the concept of similarity and how we can create graphs based on it. We will also see how to split this graph up into meaningful subgraphs using connected components. This simple algorithm introduces the concept of cluster analysis—splitting a dataset into subsets based on similarity. We will investigate cluster analysis in more depth in Chapter 10, Clustering News Articles. The topics covered in this chapter include: • Creating graphs from social networks • Loading and saving built classifiers • The NetworkX package • Converting graphs to matrices • Distance and similarity • Optimizing parameters based on scoring functions • Loss functions and scoring functions [ 135 ]

Discovering Accounts to Follow Using Graph Mining Loading the dataset In this chapter, our task is to recommend users on online social networks based on shared connections. Our logic is that if two users have the same friends, they are highly similar and worth recommending to each other. We are going to create a small social graph from Twitter using the API we introduced in the previous chapter. The data we are looking for is a subset of users interested in a similar topic (again, the Python programming language) and a list of all of their friends (people they follow). With this data, we will check how similar two users are, based on how many friends they have in common. There are many other online social networks apart from Twitter. The reason we have chosen Twitter for this experiment is that their API makes it quite easy to get this sort of information. The information is available from other sites, such as Facebook, LinkedIn, and Instagram, as well. However, getting this information is more difficult. To start collecting data, set up a new IPython Notebook and an instance of the twitter connection, as we did in the previous chapter. You can reuse the app information from the previous chapter or create a new one: import twitter consumer_key = "" consumer_secret = "" access_token = "" access_token_secret = "" authorization = twitter.OAuth(access_token, access_token_secret, consumer_key, consumer_secret) t = twitter.Twitter(auth=authorization, retry=True) Also, create the output filename: import os data_folder = os.path.join(os.path.expanduser("~"), "Data", "twitter") output_filename = os.path.join(data_folder, "python_tweets.json") We will also need the json library to save our data: import json [ 136 ]

Discovering Accounts to Follow Using Graph Mining<br />

Loading the dataset<br />

In this chapter, our task is to re<strong>com</strong>mend users on online social networks based on<br />

shared connections. Our logic is that if two users have the same friends, they are highly<br />

similar and worth re<strong>com</strong>mending to each other.<br />

We are going to create a small social graph from Twitter using the API we<br />

introduced in the previous chapter. The data we are looking for is a subset of users<br />

interested in a similar topic (again, the Python programming language) and a list of<br />

all of their friends (people they follow). With this data, we will check how similar<br />

two users are, based on how many friends they have in <strong>com</strong>mon.<br />

There are many other online social networks apart from<br />

Twitter. The reason we have chosen Twitter for this<br />

experiment is that their API makes it quite easy to get this<br />

sort of information. The information is available from other<br />

sites, such as Facebook, LinkedIn, and Instagram, as well.<br />

However, getting this information is more difficult.<br />

To start collecting data, set up a new IPython Notebook and an instance of the<br />

twitter connection, as we did in the previous chapter. You can reuse the app<br />

information from the previous chapter or create a new one:<br />

import twitter<br />

consumer_key = ""<br />

consumer_secret = ""<br />

access_token = ""<br />

access_token_secret = ""<br />

authorization = twitter.OAuth(access_token, access_token_secret,<br />

consumer_key, consumer_secret)<br />

t = twitter.Twitter(auth=authorization, retry=True)<br />

Also, create the output filename:<br />

import os<br />

data_folder = os.path.join(os.path.expanduser("~"), "Data",<br />

"twitter")<br />

output_filename = os.path.join(data_folder, "python_tweets.json")<br />

We will also need the json library to save our data:<br />

import json<br />

[ 136 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!