pdfcoffee
0 3422 3455 118 4527 2304 772 3659 2852 4515 5135 3439 12730 906 3498 2286 4755 2567 2632Chapter 70 5769 638 3574 79 2825 3532 2363 360 1443 4789 229 4515 3014 3683 29675206 2288 1615 11660 2469 1353 5596 2207 4065 31000 2236 1464 1596 2554 40210 4688 864 3684 4542 3647 28590 4884 4590 5386 621 4947 2784 1309 4958 33140 5546 200 3964 1817 845We are now ready to create our word embedding model. The gensim packageoffers a simple API that allows us to declaratively create and train a Word2Vecmodel, using the following code. The trained model will be serialized to the filegiven by W2V_MODEL_FILE. The Documents class allows us to stream large input filesto train the Word2Vec model without running into memory issues. We will trainthe Word2Vec model in skip-gram mode with a window size of 10, which means wetrain it to predict up to five neighboring vertices given a central vertex. The resultingembedding for each vertex is a dense vector of size 128:W2V_MODEL_FILE = os.path.join(DATA_DIR, "w2v-neurips-papers.model")class Documents(object):def __init__(self, input_file):self.input_file = input_filedef __iter__(self):with open(self.input_file, "r") as f:for i, line in enumerate(f):if i % 1000 == 0:if i % 1000 == 0:logging.info("{:d} random walks extracted".format(i))yield line.strip().split()def train_word2vec_model(random_walks_file, model_file):if os.path.exists(model_file):print("Model file {:s} already present, skipping training".format(model_file))returndocs = Documents(random_walks_file)model = gensim.models.Word2Vec(docs,[ 257 ]
Word Embeddingssize=128, # size of embedding vectorwindow=10, # window sizesg=1, # skip-gram modelmin_count=2,workers=4)model.train(docs,total_examples=model.corpus_count,epochs=50)model.save(model_file)# train modeltrain_word2vec_model(RANDOM_WALKS_FILE, W2V_MODEL_FILE)Our resulting DeepWalk model is just a Word2Vec model, so anything you can dowith Word2Vec in the context of words, you can do with this model in the contextof vertices. Let us use the model to discover similarities between documents:def evaluate_model(td_matrix, model_file, source_id):model = gensim.models.Word2Vec.load(model_file).wvmost_similar = model.most_similar(str(source_id))scores = [x[1] for x in most_similar]target_ids = [x[0] for x in most_similar]# compare top 10 scores with cosine similarity# between source and each targetX = np.repeat(td_matrix[source_id].todense(), 10, axis=0)Y = td_matrix[target_ids].todense()cosims = [cosine_similarity(X[i], Y[i])[0, 0] for i in range(10)]for i in range(10):print("{:d} {:s} {:.3f} {:.3f}".format(source_id, target_ids[i], cosims[i], scores[i]))source_id = np.random.choice(E.shape[0])evaluate_model(TD, W2V_MODEL_FILE, source_id)Following is the output shown. The first and second columns are the source andtarget vertex IDs. The third column is the cosine similarity between the term vectorscorresponding to the source and target documents, and the fourth is the similarityscore reported by the Word2Vec model. As you can see, cosine similarity reports asimilarity only between 2 of the 10 document pairs, but the Word2Vec model is ableto detect latent similarities in the embedding space. This is similar to the behaviorwe have noticed between one-hot encoding and dense embeddings:[ 258 ]
- Page 242 and 243: Chapter 6The preceding images were
- Page 244 and 245: Chapter 6Another interesting paper
- Page 246 and 247: Chapter 6To elaborate, let us say t
- Page 248 and 249: Chapter 6Figure 7: The architecture
- Page 250 and 251: Chapter 6Figure 11: Illegible initi
- Page 252 and 253: Chapter 6Bedrooms: Generated bedroo
- Page 254 and 255: Chapter 6The images need to be norm
- Page 256 and 257: Chapter 6initializer = tf.random_no
- Page 258 and 259: Cool, right? Now we can define the
- Page 260 and 261: Chapter 6d_loss = (dA_loss + dB_los
- Page 262 and 263: Chapter 6generator_AB.save_weights(
- Page 264: 6. Ledig, Christian, et al. Photo-R
- Page 267 and 268: Word EmbeddingsDeep learning models
- Page 269 and 270: Word EmbeddingsFor example, "crucia
- Page 271 and 272: Word EmbeddingsAssuming a window si
- Page 273 and 274: Word EmbeddingsGloVeThe Global vect
- Page 275 and 276: Word Embeddingsgensim is an open so
- Page 277 and 278: Word Embeddingsgensim also provides
- Page 279 and 280: Word EmbeddingsSpecifically, we wil
- Page 281 and 282: Word EmbeddingsWe will also convert
- Page 283 and 284: Word EmbeddingsE = np.zeros((vocab_
- Page 285 and 286: Word Embeddingsx = self.embedding(x
- Page 287 and 288: Word EmbeddingsThe change in valida
- Page 289 and 290: Word EmbeddingsThe dataset is a 114
- Page 291: Word Embeddingsprint("random walks
- Page 295 and 296: Word EmbeddingsfastText computes em
- Page 297 and 298: Word EmbeddingsIn the future, once
- Page 299 and 300: Word EmbeddingsA much earlier relat
- Page 301 and 302: Word EmbeddingsOnce you have the fi
- Page 303 and 304: Word EmbeddingsThis will create the
- Page 305 and 306: Word EmbeddingsClassifying with BER
- Page 307 and 308: Word Embeddings2. Each Transformer
- Page 309 and 310: Word EmbeddingsOnce trained, we sav
- Page 311 and 312: Word Embeddings4. Pennington, J., S
- Page 313 and 314: Word Embeddings34. Google Research,
- Page 315 and 316: Recurrent Neural NetworksWe will th
- Page 317 and 318: Recurrent Neural NetworksFor notati
- Page 319 and 320: Recurrent Neural NetworksThis probl
- Page 321 and 322: Recurrent Neural NetworksThe line a
- Page 323 and 324: Recurrent Neural NetworksGated recu
- Page 325 and 326: Recurrent Neural NetworksThis probl
- Page 327 and 328: Recurrent Neural NetworksThe topolo
- Page 329 and 330: Recurrent Neural Networkstexts = do
- Page 331 and 332: Recurrent Neural Networksdef call(s
- Page 333 and 334: Recurrent Neural Networks# callback
- Page 335 and 336: Recurrent Neural NetworksExample
- Page 337 and 338: Recurrent Neural NetworksAs can be
- Page 339 and 340: Recurrent Neural Networksdata_dir =
- Page 341 and 342: Recurrent Neural NetworksWe can als
0 3422 3455 118 4527 2304 772 3659 2852 4515 5135 3439 1273
0 906 3498 2286 4755 2567 2632
Chapter 7
0 5769 638 3574 79 2825 3532 2363 360 1443 4789 229 4515 3014 3683 2967
5206 2288 1615 1166
0 2469 1353 5596 2207 4065 3100
0 2236 1464 1596 2554 4021
0 4688 864 3684 4542 3647 2859
0 4884 4590 5386 621 4947 2784 1309 4958 3314
0 5546 200 3964 1817 845
We are now ready to create our word embedding model. The gensim package
offers a simple API that allows us to declaratively create and train a Word2Vec
model, using the following code. The trained model will be serialized to the file
given by W2V_MODEL_FILE. The Documents class allows us to stream large input files
to train the Word2Vec model without running into memory issues. We will train
the Word2Vec model in skip-gram mode with a window size of 10, which means we
train it to predict up to five neighboring vertices given a central vertex. The resulting
embedding for each vertex is a dense vector of size 128:
W2V_MODEL_FILE = os.path.join(DATA_DIR, "w2v-neurips-papers.model")
class Documents(object):
def __init__(self, input_file):
self.input_file = input_file
def __iter__(self):
with open(self.input_file, "r") as f:
for i, line in enumerate(f):
if i % 1000 == 0:
if i % 1000 == 0:
logging.info(
"{:d} random walks extracted".format(i))
yield line.strip().split()
def train_word2vec_model(random_walks_file, model_file):
if os.path.exists(model_file):
print("Model file {:s} already present, skipping training"
.format(model_file))
return
docs = Documents(random_walks_file)
model = gensim.models.Word2Vec(
docs,
[ 257 ]