Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
The actual texts of the books are contained between lines 105 and 3703(remember Python’s zero-based indexing) and 309 and 5099, respectively.Moreover, we’re joining all the lines together into a single large string of text foreach book because we’re going to organize the resulting texts into sentences, andin a regular book there are line breaks mid-sentence all over.We definitely do not want to do that manually every time, right? Although it wouldbe more difficult to automatically remove any additions to the original text, we canpartially automate the removal of the extra lines by setting the real start and endlines of each text in a configuration file (lines.cfg):Configuration File1 text_cfg = """fname,start,end2 alice28-1476.txt,104,37043 wizoz10-1740.txt,310,5100"""4 bytes_written = open(5 os.path.join(localfolder, 'lines.cfg'), 'w'6 ).write(text_cfg)Your local folder (texts) should have three files now: alice28-1476.txt, lines.cfg,and wizoz10-1740.txt. Now, it is time to perform…Sentence TokenizationA token is a piece of a text, and to tokenize a text means to splitit into pieces; that is, into a list of tokens."What kind of pieces are we talking about here?"The most common kind of piece is a word. So, tokenizing a text usually means tosplit it into words using the white space as a separator:sentence = "I'm following the white rabbit"tokens = sentence.split(' ')tokensBuilding a Dataset | 885
Output["I'm", 'following', 'the', 'white', 'rabbit']"What about 'I’m'? Isn’t it two words?"Yes, and no. Not helpful, right? As usual, it depends—word contractions like thatare fairly common, and maybe you want to keep them as single tokens. But it is alsopossible to have the token itself split into its two basic components, "I" and "am,"such that the sentence above has six tokens instead of five. For now, we’re onlyinterested in sentence tokenization, which, as you probably already guessed,means to split a text into its sentences.We’ll get back to the topic of tokenization at word (and subword)levels later.For a brief introduction to the topic, check the "Tokenization" [163]section of the Introduction to Information Retrieval [164] book byChristopher D. Manning, Prabhakar Raghavan, and HinrichSchütze, Cambridge University Press (2008).We’re using NLTK’s sent_tokenize() method to accomplish this instead of tryingto devise the splitting rules ourselves (NLTK is the natural language toolKit libraryand is one of the most traditional tools for handling NLP tasks):import nltkfrom nltk.tokenize import sent_tokenizenltk.download('punkt')corpus_alice = sent_tokenize(alice)corpus_wizard = sent_tokenize(wizard)len(corpus_alice), len(corpus_wizard)Output(1612, 2240)There are 1,612 sentences in Alice’s Adventures in Wonderland and 2,240 sentencesin The Wonderful Wizard of Oz.886 | Chapter 11: Down the Yellow Brick Rabbit Hole
- Page 860 and 861: Data Preparation1 # Generating trai
- Page 862 and 863: Figure 10.15 - Losses—Transformer
- Page 864 and 865: • First, and most important, PyTo
- Page 866 and 867: decode(), with a single one, encode
- Page 868 and 869: 46 for i in range(self.target_len):
- Page 870 and 871: Figure 10.18 - Losses - PyTorch’s
- Page 872 and 873: Figure 10.20 - Sample image—label
- Page 874 and 875: 4041 # Builds a weighted random sam
- Page 876 and 877: Figure 10.23 - Sample image—split
- Page 878 and 879: Einops"There is more than one way t
- Page 880 and 881: Figure 10.26 - Two patch embeddings
- Page 882 and 883: Now each sequence has ten elements,
- Page 884 and 885: It takes an instance of a Transform
- Page 886 and 887: Putting It All TogetherIn this chap
- Page 888 and 889: 1. Encoder-DecoderThe encoder-decod
- Page 890 and 891: This is the actual encoder-decoder
- Page 892 and 893: 3. DecoderThe Transformer decoder h
- Page 894 and 895: 5. Encoder "Layer"The encoder "laye
- Page 896 and 897: 7. "Sub-Layer" WrapperThe "sub-laye
- Page 898 and 899: 8. Multi-Headed AttentionThe multi-
- Page 900 and 901: Model Configuration & TrainingModel
- Page 902 and 903: • training the Transformer to tac
- Page 904 and 905: Part IVNatural Language Processing|
- Page 906 and 907: Additional SetupThis is a special c
- Page 908 and 909: "Down the Yellow Brick Rabbit Hole"
- Page 912 and 913: "What is this punkt?"That’s the P
- Page 914 and 915: 14 # If there is a configuration fi
- Page 916 and 917: Sentence Tokenization in spaCyBy th
- Page 918 and 919: AttributesThe Dataset has many attr
- Page 920 and 921: Output{'labels': 1,'sentence': 'The
- Page 922 and 923: elements from the text. But preproc
- Page 924 and 925: Data AugmentationLet’s briefly ad
- Page 926 and 927: The corpora’s dictionary is not a
- Page 928 and 929: Finally, if we want to convert a li
- Page 930 and 931: Once we’re happy with the size an
- Page 932 and 933: from transformers import BertTokeni
- Page 934 and 935: "What about the separation token?"T
- Page 936 and 937: The last output, attention_mask, wo
- Page 938 and 939: Outputtensor([[ 3, 27, 1, ..., 0, 0
- Page 940 and 941: vector, right? And our vocabulary i
- Page 942 and 943: Maybe you filled this blank in with
- Page 944 and 945: Continuous Bag-of-Words (CBoW)In th
- Page 946 and 947: That’s a fairly simple model, rig
- Page 948 and 949: Figure 11.13 - Continuous bag-of-wo
- Page 950 and 951: Figure 11.15 - Reviewing restaurant
- Page 952 and 953: You got that right—arithmetic—r
- Page 954 and 955: There we go, 50 dimensions! It’s
- Page 956 and 957: Equation 11.1 - Embedding arithmeti
- Page 958 and 959: Only 82 out of 50,802 words in the
The actual texts of the books are contained between lines 105 and 3703
(remember Python’s zero-based indexing) and 309 and 5099, respectively.
Moreover, we’re joining all the lines together into a single large string of text for
each book because we’re going to organize the resulting texts into sentences, and
in a regular book there are line breaks mid-sentence all over.
We definitely do not want to do that manually every time, right? Although it would
be more difficult to automatically remove any additions to the original text, we can
partially automate the removal of the extra lines by setting the real start and end
lines of each text in a configuration file (lines.cfg):
Configuration File
1 text_cfg = """fname,start,end
2 alice28-1476.txt,104,3704
3 wizoz10-1740.txt,310,5100"""
4 bytes_written = open(
5 os.path.join(localfolder, 'lines.cfg'), 'w'
6 ).write(text_cfg)
Your local folder (texts) should have three files now: alice28-1476.txt, lines.cfg,
and wizoz10-1740.txt. Now, it is time to perform…
Sentence Tokenization
A token is a piece of a text, and to tokenize a text means to split
it into pieces; that is, into a list of tokens.
"What kind of pieces are we talking about here?"
The most common kind of piece is a word. So, tokenizing a text usually means to
split it into words using the white space as a separator:
sentence = "I'm following the white rabbit"
tokens = sentence.split(' ')
tokens
Building a Dataset | 885