09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Word Embeddings

size=128, # size of embedding vector

window=10, # window size

sg=1, # skip-gram model

min_count=2,

workers=4

)

model.train(

docs,

total_examples=model.corpus_count,

epochs=50)

model.save(model_file)

# train model

train_word2vec_model(RANDOM_WALKS_FILE, W2V_MODEL_FILE)

Our resulting DeepWalk model is just a Word2Vec model, so anything you can do

with Word2Vec in the context of words, you can do with this model in the context

of vertices. Let us use the model to discover similarities between documents:

def evaluate_model(td_matrix, model_file, source_id):

model = gensim.models.Word2Vec.load(model_file).wv

most_similar = model.most_similar(str(source_id))

scores = [x[1] for x in most_similar]

target_ids = [x[0] for x in most_similar]

# compare top 10 scores with cosine similarity

# between source and each target

X = np.repeat(td_matrix[source_id].todense(), 10, axis=0)

Y = td_matrix[target_ids].todense()

cosims = [cosine_similarity(X[i], Y[i])[0, 0] for i in range(10)]

for i in range(10):

print("{:d} {:s} {:.3f} {:.3f}".format(

source_id, target_ids[i], cosims[i], scores[i]))

source_id = np.random.choice(E.shape[0])

evaluate_model(TD, W2V_MODEL_FILE, source_id)

Following is the output shown. The first and second columns are the source and

target vertex IDs. The third column is the cosine similarity between the term vectors

corresponding to the source and target documents, and the fourth is the similarity

score reported by the Word2Vec model. As you can see, cosine similarity reports a

similarity only between 2 of the 10 document pairs, but the Word2Vec model is able

to detect latent similarities in the embedding space. This is similar to the behavior

we have noticed between one-hot encoding and dense embeddings:

[ 258 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!