09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 7

([Earth, The], 1)

([Earth, travels], 1)

([Earth, around], 1)

([travels, The], 1)

([travels, Earth], 1)

([travels, around], 1)

([travels, the], 1)

We also need negative samples to train a model properly, so we generate additional

negative samples by pairing each input word with some random word in the

vocabulary. This process is called Negative sampling and might result in the

following additional inputs:

([Earth, aardvark], 0)

([Earth, zebra], 0)

A model trained with all of these inputs is called a Skip-gram with Negative

Sampling (SGNS) model.

It is important to understand that we are not interested in the ability of these models

to classify; rather, we are interested in the side effect of training – the learned

weights. These learned weights are what we call the embedding.

While it may be instructive to implement the models on your own as an academic

exercise, at this point Word2Vec is so commoditized, you are unlikely to ever need

to do this. For the curious, you will find code to implement the CBOW and skipgram

models in the files tf2_cbow_model.py and tf2_cbow_skipgram.py in the

source code accompanying this chapter.

Google's pretrained Word2Vec model is available

here (https://drive.google.com/file/

d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). The model

was trained on roughly 100 billion words from a Google News

dataset and contains a vocabulary of 3 million words and phrases.

The output vector dimensionality is 300. It is available as a bin

file and can be opened using gensim using gensim.models.

Word2Vec.load_word2vec_format() or using gensim()

data downloader.

The other early implementation of word embedding is GloVe,

which we will talk about next.

[ 237 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!