Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Output{'labels': 1,'sentence': 'There was nothing so VERY remarkable in that; nor didAlice think it so VERY much out of the way to hear the Rabbit say toitself, `Oh dear!','source': 'alice28-1476.txt'}Now that the labels are in place, we can finally shuffle the dataset and split it intotraining and test sets:Data Preparation1 shuffled_dataset = dataset.shuffle(seed=42)2 split_dataset = shuffled_dataset.train_test_split(test_size=0.2)3 split_datasetOutputDatasetDict({train: Dataset({features: ['sentence', 'source'],num_rows: 3081})test: Dataset({features: ['sentence', 'source'],num_rows: 771})})The splits are actually a dataset dictionary, so you may want to retrieve the actualdatasets from it:Data Preparation1 train_dataset = split_dataset['train']2 test_dataset = split_dataset['test']Done! We have two—training and test—randomly shuffled datasets.Building a Dataset | 895
Word TokenizationThe naïve word tokenization, as we’ve already seen, simply splits a sentence intowords using the white space as a separator:sentence = "I'm following the white rabbit"tokens = sentence.split(' ')tokensOutput["I'm", 'following', 'the', 'white', 'rabbit']But, as we’ve also seen, there are issues with the naïve approach (how to handlecontractions, for example). Let’s try using Gensim, [170] a popular library for topicmodeling, which offers some out-of-the-box tools for performing wordtokenization:from gensim.parsing.preprocessing import *preprocess_string(sentence)Output['follow', 'white', 'rabbit']"That doesn’t look right … some words are simply gone!"Welcome to the world of tokenization :-) It turns out, Gensim’spreprocess_string() applies many filters by default, namely:• strip_tags() (for removing HTML-like tags between brackets)• strip_punctuation()• strip_multiple_whitespaces()• strip_numeric()The filters above are pretty straightforward, and they are used to remove typical896 | Chapter 11: Down the Yellow Brick Rabbit Hole
- Page 870 and 871: Figure 10.18 - Losses - PyTorch’s
- Page 872 and 873: Figure 10.20 - Sample image—label
- Page 874 and 875: 4041 # Builds a weighted random sam
- Page 876 and 877: Figure 10.23 - Sample image—split
- Page 878 and 879: Einops"There is more than one way t
- Page 880 and 881: Figure 10.26 - Two patch embeddings
- Page 882 and 883: Now each sequence has ten elements,
- Page 884 and 885: It takes an instance of a Transform
- Page 886 and 887: Putting It All TogetherIn this chap
- Page 888 and 889: 1. Encoder-DecoderThe encoder-decod
- Page 890 and 891: This is the actual encoder-decoder
- Page 892 and 893: 3. DecoderThe Transformer decoder h
- Page 894 and 895: 5. Encoder "Layer"The encoder "laye
- Page 896 and 897: 7. "Sub-Layer" WrapperThe "sub-laye
- Page 898 and 899: 8. Multi-Headed AttentionThe multi-
- Page 900 and 901: Model Configuration & TrainingModel
- Page 902 and 903: • training the Transformer to tac
- Page 904 and 905: Part IVNatural Language Processing|
- Page 906 and 907: Additional SetupThis is a special c
- Page 908 and 909: "Down the Yellow Brick Rabbit Hole"
- Page 910 and 911: The actual texts of the books are c
- Page 912 and 913: "What is this punkt?"That’s the P
- Page 914 and 915: 14 # If there is a configuration fi
- Page 916 and 917: Sentence Tokenization in spaCyBy th
- Page 918 and 919: AttributesThe Dataset has many attr
- Page 922 and 923: elements from the text. But preproc
- Page 924 and 925: Data AugmentationLet’s briefly ad
- Page 926 and 927: The corpora’s dictionary is not a
- Page 928 and 929: Finally, if we want to convert a li
- Page 930 and 931: Once we’re happy with the size an
- Page 932 and 933: from transformers import BertTokeni
- Page 934 and 935: "What about the separation token?"T
- Page 936 and 937: The last output, attention_mask, wo
- Page 938 and 939: Outputtensor([[ 3, 27, 1, ..., 0, 0
- Page 940 and 941: vector, right? And our vocabulary i
- Page 942 and 943: Maybe you filled this blank in with
- Page 944 and 945: Continuous Bag-of-Words (CBoW)In th
- Page 946 and 947: That’s a fairly simple model, rig
- Page 948 and 949: Figure 11.13 - Continuous bag-of-wo
- Page 950 and 951: Figure 11.15 - Reviewing restaurant
- Page 952 and 953: You got that right—arithmetic—r
- Page 954 and 955: There we go, 50 dimensions! It’s
- Page 956 and 957: Equation 11.1 - Embedding arithmeti
- Page 958 and 959: Only 82 out of 50,802 words in the
- Page 960 and 961: Now we can use its encode() method
- Page 962 and 963: Model I — GloVE + ClassifierData
- Page 964 and 965: Pre-trained PyTorch EmbeddingsThe e
- Page 966 and 967: Model Configuration & TrainingLet
- Page 968 and 969: 6 self.encoder = encoder7 self.mlp
Word Tokenization
The naïve word tokenization, as we’ve already seen, simply splits a sentence into
words using the white space as a separator:
sentence = "I'm following the white rabbit"
tokens = sentence.split(' ')
tokens
Output
["I'm", 'following', 'the', 'white', 'rabbit']
But, as we’ve also seen, there are issues with the naïve approach (how to handle
contractions, for example). Let’s try using Gensim, [170] a popular library for topic
modeling, which offers some out-of-the-box tools for performing word
tokenization:
from gensim.parsing.preprocessing import *
preprocess_string(sentence)
Output
['follow', 'white', 'rabbit']
"That doesn’t look right … some words are simply gone!"
Welcome to the world of tokenization :-) It turns out, Gensim’s
preprocess_string() applies many filters by default, namely:
• strip_tags() (for removing HTML-like tags between brackets)
• strip_punctuation()
• strip_multiple_whitespaces()
• strip_numeric()
The filters above are pretty straightforward, and they are used to remove typical
896 | Chapter 11: Down the Yellow Brick Rabbit Hole