Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

Equation 11.1 - Embedding arithmeticThis arithmetic is cool and all, but you won’t actually be using it much; the wholepoint was to show you that the word embeddings indeed capture the relationshipbetween different words. We can use them to train other models, though.Using Word EmbeddingsIt seems easy enough: Get the text corpora tokenized, look the tokens up in thetable of pre-trained word embeddings, and then use the embeddings as inputs ofanother model. But, what if the vocabulary of your corpora is not quite properlyrepresented in the embeddings? Even worse, what if the preprocessing steps youused resulted in a lot of tokens that do not exist in the embeddings?"Choose your word embeddings wisely."Grail KnightVocabulary CoverageOnce again, the Grail Knight has a point—the chosen word embeddings mustprovide good vocabulary coverage. First and foremost, most of the usualpreprocessing steps do not apply when you’re using pre-trained word embeddingslike GloVe: no lemmatization, no stemming, no stop-word removal. These stepswould likely end up producing a lot of [UNK] tokens.Second, even without those preprocessing steps, maybe the words used in thegiven text corpora are simply not a good match for a particular pre-trained set ofword embeddings.Let’s see how good a match the glove-wiki-gigaword-50 embeddings are to ourown vocabulary. Our vocabulary has 3,706 words (3,704 from our text corporaplus the padding and unknown special tokens):vocab = list(dictionary.token2id.keys())len(vocab)Word Embeddings | 931

Output3706Let’s see how many words of our own vocabulary are unknown to theembeddings:unknown_words = sorted(list(set(vocab).difference(set(glove.vocab))))print(len(unknown_words))print(unknown_words[:5])Output44['[PAD]', '[UNK]', 'arrum', 'barrowful', 'beauti']There are only 44 unknown words: the two special tokens, and some other weirdwords like "arrum" and "barrowful." It looks good, right? It means that there are3,662 matches out of 3,706 words, hinting at 98.81% coverage. But it is actuallybetter than that.If we look at how often the unknown words show up in our text corpora, we’ll havea precise measure of how many tokens will be unknown to the embeddings. Toactually get the total count we need to get the IDs of the unknown words first, andthen look at their frequencies in the corpora:unknown_ids = [dictionary.token2id[w]for w in unknown_wordsif w not in ['[PAD]', '[UNK]']]unknown_count = np.sum([dictionary.cfs[idx]for idx in unknown_ids])unknown_count, dictionary.num_posOutput(82, 50802)932 | Chapter 11: Down the Yellow Brick Rabbit Hole

Output

3706

Let’s see how many words of our own vocabulary are unknown to the

embeddings:

unknown_words = sorted(

list(set(vocab).difference(set(glove.vocab)))

)

print(len(unknown_words))

print(unknown_words[:5])

Output

44

['[PAD]', '[UNK]', 'arrum', 'barrowful', 'beauti']

There are only 44 unknown words: the two special tokens, and some other weird

words like "arrum" and "barrowful." It looks good, right? It means that there are

3,662 matches out of 3,706 words, hinting at 98.81% coverage. But it is actually

better than that.

If we look at how often the unknown words show up in our text corpora, we’ll have

a precise measure of how many tokens will be unknown to the embeddings. To

actually get the total count we need to get the IDs of the unknown words first, and

then look at their frequencies in the corpora:

unknown_ids = [dictionary.token2id[w]

for w in unknown_words

if w not in ['[PAD]', '[UNK]']]

unknown_count = np.sum([dictionary.cfs[idx]

for idx in unknown_ids])

unknown_count, dictionary.num_pos

Output

(82, 50802)

932 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!