pdfcoffee

soumyasankar99
from soumyasankar99 More from this publisher
09.05.2023 Views

0 3422 3455 118 4527 2304 772 3659 2852 4515 5135 3439 12730 906 3498 2286 4755 2567 2632Chapter 70 5769 638 3574 79 2825 3532 2363 360 1443 4789 229 4515 3014 3683 29675206 2288 1615 11660 2469 1353 5596 2207 4065 31000 2236 1464 1596 2554 40210 4688 864 3684 4542 3647 28590 4884 4590 5386 621 4947 2784 1309 4958 33140 5546 200 3964 1817 845We are now ready to create our word embedding model. The gensim packageoffers a simple API that allows us to declaratively create and train a Word2Vecmodel, using the following code. The trained model will be serialized to the filegiven by W2V_MODEL_FILE. The Documents class allows us to stream large input filesto train the Word2Vec model without running into memory issues. We will trainthe Word2Vec model in skip-gram mode with a window size of 10, which means wetrain it to predict up to five neighboring vertices given a central vertex. The resultingembedding for each vertex is a dense vector of size 128:W2V_MODEL_FILE = os.path.join(DATA_DIR, "w2v-neurips-papers.model")class Documents(object):def __init__(self, input_file):self.input_file = input_filedef __iter__(self):with open(self.input_file, "r") as f:for i, line in enumerate(f):if i % 1000 == 0:if i % 1000 == 0:logging.info("{:d} random walks extracted".format(i))yield line.strip().split()def train_word2vec_model(random_walks_file, model_file):if os.path.exists(model_file):print("Model file {:s} already present, skipping training".format(model_file))returndocs = Documents(random_walks_file)model = gensim.models.Word2Vec(docs,[ 257 ]

Word Embeddingssize=128, # size of embedding vectorwindow=10, # window sizesg=1, # skip-gram modelmin_count=2,workers=4)model.train(docs,total_examples=model.corpus_count,epochs=50)model.save(model_file)# train modeltrain_word2vec_model(RANDOM_WALKS_FILE, W2V_MODEL_FILE)Our resulting DeepWalk model is just a Word2Vec model, so anything you can dowith Word2Vec in the context of words, you can do with this model in the contextof vertices. Let us use the model to discover similarities between documents:def evaluate_model(td_matrix, model_file, source_id):model = gensim.models.Word2Vec.load(model_file).wvmost_similar = model.most_similar(str(source_id))scores = [x[1] for x in most_similar]target_ids = [x[0] for x in most_similar]# compare top 10 scores with cosine similarity# between source and each targetX = np.repeat(td_matrix[source_id].todense(), 10, axis=0)Y = td_matrix[target_ids].todense()cosims = [cosine_similarity(X[i], Y[i])[0, 0] for i in range(10)]for i in range(10):print("{:d} {:s} {:.3f} {:.3f}".format(source_id, target_ids[i], cosims[i], scores[i]))source_id = np.random.choice(E.shape[0])evaluate_model(TD, W2V_MODEL_FILE, source_id)Following is the output shown. The first and second columns are the source andtarget vertex IDs. The third column is the cosine similarity between the term vectorscorresponding to the source and target documents, and the fourth is the similarityscore reported by the Word2Vec model. As you can see, cosine similarity reports asimilarity only between 2 of the 10 document pairs, but the Word2Vec model is ableto detect latent similarities in the embedding space. This is similar to the behaviorwe have noticed between one-hot encoding and dense embeddings:[ 258 ]

0 3422 3455 118 4527 2304 772 3659 2852 4515 5135 3439 1273

0 906 3498 2286 4755 2567 2632

Chapter 7

0 5769 638 3574 79 2825 3532 2363 360 1443 4789 229 4515 3014 3683 2967

5206 2288 1615 1166

0 2469 1353 5596 2207 4065 3100

0 2236 1464 1596 2554 4021

0 4688 864 3684 4542 3647 2859

0 4884 4590 5386 621 4947 2784 1309 4958 3314

0 5546 200 3964 1817 845

We are now ready to create our word embedding model. The gensim package

offers a simple API that allows us to declaratively create and train a Word2Vec

model, using the following code. The trained model will be serialized to the file

given by W2V_MODEL_FILE. The Documents class allows us to stream large input files

to train the Word2Vec model without running into memory issues. We will train

the Word2Vec model in skip-gram mode with a window size of 10, which means we

train it to predict up to five neighboring vertices given a central vertex. The resulting

embedding for each vertex is a dense vector of size 128:

W2V_MODEL_FILE = os.path.join(DATA_DIR, "w2v-neurips-papers.model")

class Documents(object):

def __init__(self, input_file):

self.input_file = input_file

def __iter__(self):

with open(self.input_file, "r") as f:

for i, line in enumerate(f):

if i % 1000 == 0:

if i % 1000 == 0:

logging.info(

"{:d} random walks extracted".format(i))

yield line.strip().split()

def train_word2vec_model(random_walks_file, model_file):

if os.path.exists(model_file):

print("Model file {:s} already present, skipping training"

.format(model_file))

return

docs = Documents(random_walks_file)

model = gensim.models.Word2Vec(

docs,

[ 257 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!