09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Word Embeddings

A much earlier related line of work into producing embeddings for long sequences

such as paragraphs and documents was proposed by Le and Mikolov [26] soon

after Word2Vec was proposed; it is now known interchangeably as Doc2Vec or

Paragraph2Vec. The Doc2Vec algorithm is an extension of Word2Vec that uses

surrounding words to predict a word. In the case of Doc2Vec, an additional

parameter, the paragraph ID, is provided during training. At the end of the training,

the Doc2Vec network learns an embedding for every word and an embedding for

every paragraph. During inference, the network is given a paragraph with some

missing words. The network uses the known part of the paragraph to produce

a paragraph embedding, then uses this paragraph embedding and the word

embeddings to infer the missing words in the paragraph. The Doc2Vec algorithm

comes in two flavors—the Paragraph Vectors - Distributed Memory (PV-DM) and

Paragraph Vectors - Distributed Bag of Words (PV-DBOW), roughly analogous

to CBOW and skip-gram in Word2Vec. We will not look at Doc2Vec further in this

book, except to note that the gensim toolkit provides prebuilt implementations that

you can train with your own corpus.

Having looked at the different forms of static embeddings, we will now switch gears

a bit and look at dynamic embeddings.

Language model-based embeddings

Language model-based embeddings represent the next step in the evolution of word

embeddings. A language model is a probability distribution over sequences of words.

Once we have a model, we can ask it to predict the most likely next word given a

particular sequence of words. Similar to traditional word embeddings, both static and

dynamic, they are trained to predict the next word (or previous word as well, if the

language model is bidirectional) given a partial sentence from the corpus. Training

does not involve active labeling, since it leverages the natural grammatical structure

of large volumes of text, so in a sense this is an unsupervised learning process:

[ 264 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!