Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Equation 11.1 - Embedding arithmeticThis arithmetic is cool and all, but you won’t actually be using it much; the wholepoint was to show you that the word embeddings indeed capture the relationshipbetween different words. We can use them to train other models, though.Using Word EmbeddingsIt seems easy enough: Get the text corpora tokenized, look the tokens up in thetable of pre-trained word embeddings, and then use the embeddings as inputs ofanother model. But, what if the vocabulary of your corpora is not quite properlyrepresented in the embeddings? Even worse, what if the preprocessing steps youused resulted in a lot of tokens that do not exist in the embeddings?"Choose your word embeddings wisely."Grail KnightVocabulary CoverageOnce again, the Grail Knight has a point—the chosen word embeddings mustprovide good vocabulary coverage. First and foremost, most of the usualpreprocessing steps do not apply when you’re using pre-trained word embeddingslike GloVe: no lemmatization, no stemming, no stop-word removal. These stepswould likely end up producing a lot of [UNK] tokens.Second, even without those preprocessing steps, maybe the words used in thegiven text corpora are simply not a good match for a particular pre-trained set ofword embeddings.Let’s see how good a match the glove-wiki-gigaword-50 embeddings are to ourown vocabulary. Our vocabulary has 3,706 words (3,704 from our text corporaplus the padding and unknown special tokens):vocab = list(dictionary.token2id.keys())len(vocab)Word Embeddings | 931
Output3706Let’s see how many words of our own vocabulary are unknown to theembeddings:unknown_words = sorted(list(set(vocab).difference(set(glove.vocab))))print(len(unknown_words))print(unknown_words[:5])Output44['[PAD]', '[UNK]', 'arrum', 'barrowful', 'beauti']There are only 44 unknown words: the two special tokens, and some other weirdwords like "arrum" and "barrowful." It looks good, right? It means that there are3,662 matches out of 3,706 words, hinting at 98.81% coverage. But it is actuallybetter than that.If we look at how often the unknown words show up in our text corpora, we’ll havea precise measure of how many tokens will be unknown to the embeddings. Toactually get the total count we need to get the IDs of the unknown words first, andthen look at their frequencies in the corpora:unknown_ids = [dictionary.token2id[w]for w in unknown_wordsif w not in ['[PAD]', '[UNK]']]unknown_count = np.sum([dictionary.cfs[idx]for idx in unknown_ids])unknown_count, dictionary.num_posOutput(82, 50802)932 | Chapter 11: Down the Yellow Brick Rabbit Hole
- Page 906 and 907: Additional SetupThis is a special c
- Page 908 and 909: "Down the Yellow Brick Rabbit Hole"
- Page 910 and 911: The actual texts of the books are c
- Page 912 and 913: "What is this punkt?"That’s the P
- Page 914 and 915: 14 # If there is a configuration fi
- Page 916 and 917: Sentence Tokenization in spaCyBy th
- Page 918 and 919: AttributesThe Dataset has many attr
- Page 920 and 921: Output{'labels': 1,'sentence': 'The
- Page 922 and 923: elements from the text. But preproc
- Page 924 and 925: Data AugmentationLet’s briefly ad
- Page 926 and 927: The corpora’s dictionary is not a
- Page 928 and 929: Finally, if we want to convert a li
- Page 930 and 931: Once we’re happy with the size an
- Page 932 and 933: from transformers import BertTokeni
- Page 934 and 935: "What about the separation token?"T
- Page 936 and 937: The last output, attention_mask, wo
- Page 938 and 939: Outputtensor([[ 3, 27, 1, ..., 0, 0
- Page 940 and 941: vector, right? And our vocabulary i
- Page 942 and 943: Maybe you filled this blank in with
- Page 944 and 945: Continuous Bag-of-Words (CBoW)In th
- Page 946 and 947: That’s a fairly simple model, rig
- Page 948 and 949: Figure 11.13 - Continuous bag-of-wo
- Page 950 and 951: Figure 11.15 - Reviewing restaurant
- Page 952 and 953: You got that right—arithmetic—r
- Page 954 and 955: There we go, 50 dimensions! It’s
- Page 958 and 959: Only 82 out of 50,802 words in the
- Page 960 and 961: Now we can use its encode() method
- Page 962 and 963: Model I — GloVE + ClassifierData
- Page 964 and 965: Pre-trained PyTorch EmbeddingsThe e
- Page 966 and 967: Model Configuration & TrainingLet
- Page 968 and 969: 6 self.encoder = encoder7 self.mlp
- Page 970 and 971: Figure 11.20 - Losses—Transformer
- Page 972 and 973: Outputtensor([[[2.6334e-01, 6.9912e
- Page 974 and 975: I want to introduce you to…ELMoBo
- Page 976 and 977: OutputToken: 32 watchThe get_token(
- Page 978 and 979: Helper Function to Retrieve Embeddi
- Page 980 and 981: Output(tensor(-0.5047, device='cuda
- Page 982 and 983: torch.all(new_flair_sentences[0].to
- Page 984 and 985: Outputtensor(0.3504, device='cuda:0
- Page 986 and 987: We can leverage this fact to slight
- Page 988 and 989: We can easily get the embeddings fo
- Page 990 and 991: Figure 11.24 - Losses—simple clas
- Page 992 and 993: We can inspect the pre-trained mode
- Page 994 and 995: Every word piece is prefixed with #
- Page 996 and 997: far, our models used these embeddin
- Page 998 and 999: position_ids = torch.arange(512).ex
- Page 1000 and 1001: Pre-training TasksMasked Language M
- Page 1002 and 1003: Then, let’s create an instance of
- Page 1004 and 1005: If these two sentences were the inp
Output
3706
Let’s see how many words of our own vocabulary are unknown to the
embeddings:
unknown_words = sorted(
list(set(vocab).difference(set(glove.vocab)))
)
print(len(unknown_words))
print(unknown_words[:5])
Output
44
['[PAD]', '[UNK]', 'arrum', 'barrowful', 'beauti']
There are only 44 unknown words: the two special tokens, and some other weird
words like "arrum" and "barrowful." It looks good, right? It means that there are
3,662 matches out of 3,706 words, hinting at 98.81% coverage. But it is actually
better than that.
If we look at how often the unknown words show up in our text corpora, we’ll have
a precise measure of how many tokens will be unknown to the embeddings. To
actually get the total count we need to get the IDs of the unknown words first, and
then look at their frequencies in the corpora:
unknown_ids = [dictionary.token2id[w]
for w in unknown_words
if w not in ['[PAD]', '[UNK]']]
unknown_count = np.sum([dictionary.cfs[idx]
for idx in unknown_ids])
unknown_count, dictionary.num_pos
Output
(82, 50802)
932 | Chapter 11: Down the Yellow Brick Rabbit Hole