Our resulting DeepWalk model is just a Word2Vec model, so anything you can do
with Word2Vec in the context of words, you can do with this model in the context
of vertices. Let us use the model to discover similarities between documents:
def evaluate_model(td_matrix, model_file, source_id):
model = gensim.models.Word2Vec.load(model_file).wv
most_similar = model.most_similar(str(source_id))
scores = [x[1] for x in most_similar]
target_ids = [x[0] for x in most_similar]
# compare top 10 scores with cosine similarity
# between source and each target
X = np.repeat(td_matrix[source_id].todense(), 10, axis=0)
Y = td_matrix[target_ids].todense()
cosims = [cosine_similarity(X[i], Y[i])[0, 0] for i in range(10)]
for i in range(10):
print("{:d} {:s} {:.3f} {:.3f}".format(source_id, target_ids[i], cosims[i], scores[i]))
source_id = np.random.choice(E.shape[0])
evaluate_model(TD, W2V_MODEL_FILE, source_id)
Following is the output shown. The first and second columns are the source and
target vertex IDs. The third column is the cosine similarity between the term vectors
corresponding to the source and target documents, and the fourth is the similarity
score reported by the Word2Vec model. As you can see, cosine similarity reports a
similarity only between 2 of the 10 document pairs, but the Word2Vec model is able
to detect latent similarities in the embedding space. This is similar to the behavior
we have noticed between one-hot encoding and dense embeddings:

Chapter 7

We are now ready to create our word embedding model. The gensim package

offers a simple API that allows us to declaratively create and train a Word2Vec

model, using the following code. The trained model will be serialized to the file

given by W2V_MODEL_FILE. The Documents class allows us to stream large input files

to train the Word2Vec model without running into memory issues. We will train

the Word2Vec model in skip-gram mode with a window size of 10, which means we

train it to predict up to five neighboring vertices given a central vertex. The resulting

embedding for each vertex is a dense vector of size 128:

W2V_MODEL_FILE = os.path.join(DATA_DIR, "w2v-neurips-papers.model")

class Documents(object):

def __init__(self, input_file):

self.input_file = input_file

def __iter__(self):

with open(self.input_file, "r") as f:

for i, line in enumerate(f):

if i % 1000 == 0:

if i % 1000 == 0:


logging.info("{:d} random walks extracted".format(i))

yield line.strip().split()

def train_word2vec_model(random_walks_file, model_file):

if os.path.exists(model_file):

print("Model file {:s} already present, skipping training".format(model_file))
return



docs = Documents(random_walks_file)

model = gensim.models.Word2Vec(docs,


