Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Maybe you filled this blank in with "too," or maybe you chose a different word like"here" or "now," depending on what you assumed to be preceding the first word.Figure 11.7 - Many options for filling in the [BLANK]That’s easy, right? How did you do it, though? How do you know that "you" shouldfollow "nice to meet"? You’ve probably read and said "nice to meet you" thousands oftimes. But have you ever read or said: "Nice to meet aardvark"? Me neither!What about the second sentence? It’s not that obvious anymore, but I bet you canstill rule out "to meet you aardvark" (or at least admit that’s very unlikely to be thecase).It turns out, we have a language model in our heads too, and it’s straightforward toguess which words are good choices to fill in the blanks using sequences that arefamiliar to us.Before Word Embeddings | 917
N-gramsThe structure, in the examples above, is composed of three words and a blank: afour-gram. If we were using two words and blank, that would be a trigram, and, for agiven number of words (n-1) followed by a blank, an n-gram.Figure 11.8 - N-gramsN-gram models are based on pure statistics: They fill in the blanks using the mostcommon sequence that matches the words preceding the blank (that’s called thecontext). On the one hand, larger values of n (longer sequences of words) may yieldbetter predictions; on the other hand, they may yield no predictions since aparticular sequence of words may have never been observed. In the latter case, onecan always fall back to a shorter n-gram and try again (that’s called a stupid back-off,by the way).For a more detailed explanation of n-gram models, please checkthe "N-gram Language Models" [178] section of Lena Voita’samazing "NLP Course | For You." [179]These models are simple, but they are somewhat limited because they can onlylook back."Can we look ahead too?"Sure, we can!918 | Chapter 11: Down the Yellow Brick Rabbit Hole
- Page 892 and 893: 3. DecoderThe Transformer decoder h
- Page 894 and 895: 5. Encoder "Layer"The encoder "laye
- Page 896 and 897: 7. "Sub-Layer" WrapperThe "sub-laye
- Page 898 and 899: 8. Multi-Headed AttentionThe multi-
- Page 900 and 901: Model Configuration & TrainingModel
- Page 902 and 903: • training the Transformer to tac
- Page 904 and 905: Part IVNatural Language Processing|
- Page 906 and 907: Additional SetupThis is a special c
- Page 908 and 909: "Down the Yellow Brick Rabbit Hole"
- Page 910 and 911: The actual texts of the books are c
- Page 912 and 913: "What is this punkt?"That’s the P
- Page 914 and 915: 14 # If there is a configuration fi
- Page 916 and 917: Sentence Tokenization in spaCyBy th
- Page 918 and 919: AttributesThe Dataset has many attr
- Page 920 and 921: Output{'labels': 1,'sentence': 'The
- Page 922 and 923: elements from the text. But preproc
- Page 924 and 925: Data AugmentationLet’s briefly ad
- Page 926 and 927: The corpora’s dictionary is not a
- Page 928 and 929: Finally, if we want to convert a li
- Page 930 and 931: Once we’re happy with the size an
- Page 932 and 933: from transformers import BertTokeni
- Page 934 and 935: "What about the separation token?"T
- Page 936 and 937: The last output, attention_mask, wo
- Page 938 and 939: Outputtensor([[ 3, 27, 1, ..., 0, 0
- Page 940 and 941: vector, right? And our vocabulary i
- Page 944 and 945: Continuous Bag-of-Words (CBoW)In th
- Page 946 and 947: That’s a fairly simple model, rig
- Page 948 and 949: Figure 11.13 - Continuous bag-of-wo
- Page 950 and 951: Figure 11.15 - Reviewing restaurant
- Page 952 and 953: You got that right—arithmetic—r
- Page 954 and 955: There we go, 50 dimensions! It’s
- Page 956 and 957: Equation 11.1 - Embedding arithmeti
- Page 958 and 959: Only 82 out of 50,802 words in the
- Page 960 and 961: Now we can use its encode() method
- Page 962 and 963: Model I — GloVE + ClassifierData
- Page 964 and 965: Pre-trained PyTorch EmbeddingsThe e
- Page 966 and 967: Model Configuration & TrainingLet
- Page 968 and 969: 6 self.encoder = encoder7 self.mlp
- Page 970 and 971: Figure 11.20 - Losses—Transformer
- Page 972 and 973: Outputtensor([[[2.6334e-01, 6.9912e
- Page 974 and 975: I want to introduce you to…ELMoBo
- Page 976 and 977: OutputToken: 32 watchThe get_token(
- Page 978 and 979: Helper Function to Retrieve Embeddi
- Page 980 and 981: Output(tensor(-0.5047, device='cuda
- Page 982 and 983: torch.all(new_flair_sentences[0].to
- Page 984 and 985: Outputtensor(0.3504, device='cuda:0
- Page 986 and 987: We can leverage this fact to slight
- Page 988 and 989: We can easily get the embeddings fo
- Page 990 and 991: Figure 11.24 - Losses—simple clas
N-grams
The structure, in the examples above, is composed of three words and a blank: a
four-gram. If we were using two words and blank, that would be a trigram, and, for a
given number of words (n-1) followed by a blank, an n-gram.
Figure 11.8 - N-grams
N-gram models are based on pure statistics: They fill in the blanks using the most
common sequence that matches the words preceding the blank (that’s called the
context). On the one hand, larger values of n (longer sequences of words) may yield
better predictions; on the other hand, they may yield no predictions since a
particular sequence of words may have never been observed. In the latter case, one
can always fall back to a shorter n-gram and try again (that’s called a stupid back-off,
by the way).
For a more detailed explanation of n-gram models, please check
the "N-gram Language Models" [178] section of Lena Voita’s
amazing "NLP Course | For You." [179]
These models are simple, but they are somewhat limited because they can only
look back.
"Can we look ahead too?"
Sure, we can!
918 | Chapter 11: Down the Yellow Brick Rabbit Hole