Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
You got that right—arithmetic—really! Maybe you’ve seen this "equation"somewhere else already:KING - MAN + WOMAN = QUEENAwesome, right? We’ll try this "equation" out shortly, hang in there!Pre-trained Word2VecWord2Vec is a simple model but it still requires a sizable amount of text data tolearn meaningful embeddings. Luckily for us, someone else had already done thehard work of training these models, and we can use Gensim’s downloader to choosefrom a variety of pre-trained word embeddings.For a detailed list of the available models (embeddings), pleasecheck Gensim-data’s repository [184] on GitHub."Why so many embeddings? How are they different from each other?"Good question! It turns out, using different text corpora to train a Word2Vecmodel produces different embeddings. On the one hand, this shouldn’t be asurprise; after all, these are different datasets and it’s expected that they willproduce different results. On the other hand, if these datasets all containsentences in the same language (English, for example), how come the embeddingsare different?The embeddings will be influenced by the kind of language used in the text: Thephrasing and wording used in novels are different from those used in news articlesand radically different from those used on Twitter, for example."Choose your word embeddings wisely."Grail KnightMoreover, not every word embedding is learned using a Word2Vec modelarchitecture. There are many different ways of learning word embeddings, one ofthem being…Word Embeddings | 927
Global Vectors (GloVe)The Global Vectors model was proposed by Pennington, J. et al. in their 2014 paper"GloVe: Global Vectors for Word Representation." [185] It combines the skip-grammodel with co-occurrence statistics at the global level (hence the name). We’renot diving into its inner workings here, but if you’re interested in knowing moreabout it, check its official website: https://nlp.stanford.edu/projects/glove/.The pre-trained GloVe embeddings come in many sizes and shapes: Dimensionsvary between 25 and 300, and vocabularies vary between 400,000 and 2,200,000words. Let’s use Gensim’s downloader to retrieve the smallest one: glove-wikigigaword-50.It was trained on Wikipedia 2014 and Gigawords 5, it contains400,000 words in its vocabulary, and its embeddings have 50 dimensions.Downloading Pre-trained Word Embeddings1 from gensim import downloader2 glove = downloader.load('glove-wiki-gigaword-50')3 len(glove.vocab)Output400000Let’s check the embeddings for "alice" (the vocabulary is uncased):glove['alice']Outputarray([ 0.16386, 0.57795, -0.59197, -0.32446, 0.29762, 0.85151,-0.76695, -0.20733, 0.21491, -0.51587, -0.17517, 0.94459,0.12705, -0.33031, 0.75951, 0.44449, 0.16553, -0.19235,0.06553, -0.12394, 0.61446, 0.89784, 0.17413, 0.41149,1.191 , -0.39461, -0.459 , 0.02216, -0.50843, -0.44464,0.68721, -0.7167 , 0.20835, -0.23437, 0.02604, -0.47993,0.31873, -0.29135, 0.50273, -0.55144, -0.06669, 0.43873,-0.24293, -1.0247 , 0.02937, 0.06849, 0.25451, -1.9663 ,0.26673, 0.88486], dtype=float32)928 | Chapter 11: Down the Yellow Brick Rabbit Hole
- Page 902 and 903: • training the Transformer to tac
- Page 904 and 905: Part IVNatural Language Processing|
- Page 906 and 907: Additional SetupThis is a special c
- Page 908 and 909: "Down the Yellow Brick Rabbit Hole"
- Page 910 and 911: The actual texts of the books are c
- Page 912 and 913: "What is this punkt?"That’s the P
- Page 914 and 915: 14 # If there is a configuration fi
- Page 916 and 917: Sentence Tokenization in spaCyBy th
- Page 918 and 919: AttributesThe Dataset has many attr
- Page 920 and 921: Output{'labels': 1,'sentence': 'The
- Page 922 and 923: elements from the text. But preproc
- Page 924 and 925: Data AugmentationLet’s briefly ad
- Page 926 and 927: The corpora’s dictionary is not a
- Page 928 and 929: Finally, if we want to convert a li
- Page 930 and 931: Once we’re happy with the size an
- Page 932 and 933: from transformers import BertTokeni
- Page 934 and 935: "What about the separation token?"T
- Page 936 and 937: The last output, attention_mask, wo
- Page 938 and 939: Outputtensor([[ 3, 27, 1, ..., 0, 0
- Page 940 and 941: vector, right? And our vocabulary i
- Page 942 and 943: Maybe you filled this blank in with
- Page 944 and 945: Continuous Bag-of-Words (CBoW)In th
- Page 946 and 947: That’s a fairly simple model, rig
- Page 948 and 949: Figure 11.13 - Continuous bag-of-wo
- Page 950 and 951: Figure 11.15 - Reviewing restaurant
- Page 954 and 955: There we go, 50 dimensions! It’s
- Page 956 and 957: Equation 11.1 - Embedding arithmeti
- Page 958 and 959: Only 82 out of 50,802 words in the
- Page 960 and 961: Now we can use its encode() method
- Page 962 and 963: Model I — GloVE + ClassifierData
- Page 964 and 965: Pre-trained PyTorch EmbeddingsThe e
- Page 966 and 967: Model Configuration & TrainingLet
- Page 968 and 969: 6 self.encoder = encoder7 self.mlp
- Page 970 and 971: Figure 11.20 - Losses—Transformer
- Page 972 and 973: Outputtensor([[[2.6334e-01, 6.9912e
- Page 974 and 975: I want to introduce you to…ELMoBo
- Page 976 and 977: OutputToken: 32 watchThe get_token(
- Page 978 and 979: Helper Function to Retrieve Embeddi
- Page 980 and 981: Output(tensor(-0.5047, device='cuda
- Page 982 and 983: torch.all(new_flair_sentences[0].to
- Page 984 and 985: Outputtensor(0.3504, device='cuda:0
- Page 986 and 987: We can leverage this fact to slight
- Page 988 and 989: We can easily get the embeddings fo
- Page 990 and 991: Figure 11.24 - Losses—simple clas
- Page 992 and 993: We can inspect the pre-trained mode
- Page 994 and 995: Every word piece is prefixed with #
- Page 996 and 997: far, our models used these embeddin
- Page 998 and 999: position_ids = torch.arange(512).ex
- Page 1000 and 1001: Pre-training TasksMasked Language M
Global Vectors (GloVe)
The Global Vectors model was proposed by Pennington, J. et al. in their 2014 paper
"GloVe: Global Vectors for Word Representation." [185] It combines the skip-gram
model with co-occurrence statistics at the global level (hence the name). We’re
not diving into its inner workings here, but if you’re interested in knowing more
about it, check its official website: https://nlp.stanford.edu/projects/glove/.
The pre-trained GloVe embeddings come in many sizes and shapes: Dimensions
vary between 25 and 300, and vocabularies vary between 400,000 and 2,200,000
words. Let’s use Gensim’s downloader to retrieve the smallest one: glove-wikigigaword-50.
It was trained on Wikipedia 2014 and Gigawords 5, it contains
400,000 words in its vocabulary, and its embeddings have 50 dimensions.
Downloading Pre-trained Word Embeddings
1 from gensim import downloader
2 glove = downloader.load('glove-wiki-gigaword-50')
3 len(glove.vocab)
Output
400000
Let’s check the embeddings for "alice" (the vocabulary is uncased):
glove['alice']
Output
array([ 0.16386, 0.57795, -0.59197, -0.32446, 0.29762, 0.85151,
-0.76695, -0.20733, 0.21491, -0.51587, -0.17517, 0.94459,
0.12705, -0.33031, 0.75951, 0.44449, 0.16553, -0.19235,
0.06553, -0.12394, 0.61446, 0.89784, 0.17413, 0.41149,
1.191 , -0.39461, -0.459 , 0.02216, -0.50843, -0.44464,
0.68721, -0.7167 , 0.20835, -0.23437, 0.02604, -0.47993,
0.31873, -0.29135, 0.50273, -0.55144, -0.06669, 0.43873,
-0.24293, -1.0247 , 0.02937, 0.06849, 0.25451, -1.9663 ,
0.26673, 0.88486], dtype=float32)
928 | Chapter 11: Down the Yellow Brick Rabbit Hole