Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
device_index = (model.device.indexif model.device.type != 'cpu'else -1)gpt2_gen = pipeline('text-generation',model=model,tokenizer=auto_tokenizer,device=device_index)The only parameter we may have to change is, once again, the max_length:result = gpt2_gen(base_text, max_length=250)print(result[0]['generated_text'])OutputAlice was beginning to get very tired of sitting by her sister onthe bank, and of having nothing to do: once or twice she had peepedinto the book her sister was reading, but it had no pictures orconversations in it, `and what is the use of a book,'thought Alice`without pictures or conversation?' So she was considering in herown mind (as well as she could, for the hot day made her feel verysleepy and stupid), whether the pleasure of making a daisy-chainwould be worth the trouble of getting up and picking the daisies,when suddenly a White Rabbit with pink eyes ran close by her.The rabbit was running away quite as quickly as it had jumped to herfeet.She had nothing of the kind, and, as she made it up to Alice,was beginning to look at the door carefully in one thought.`It'svery curious,'after having been warned,`that I should be talking toAlice!'`It's not,'she went on, `it wasn't even a cat,' so very veryquietly indeed.'In that instant he began to cry out aloud.Alicebegan to sob out, 'I am not to cry out!'`WhatThis time, I’ve kept the whole thing, the base and the generated text. I tried it outseveral times and, in my humble opinion, the output looks more "Alice-y" now.What do you think?Putting It All Together | 1015
RecapIn this chapter, we took a deep dive into the natural language processing world. Webuilt our own dataset from scratch using two books, Alice’s Adventures inWonderland and The Wonderful Wizard of Oz, and performed sentence and wordtokenization. Then, we built a vocabulary and used it with a tokenizer to generatethe primary input of our models: sequences of token IDs. Next, we creatednumerical representations for our tokens, starting with a basic one-hot encodingand working our way to using word embeddings to train a model for classifying thesource of a sentence. We also learned about the limitations of classicalembeddings, and the need for contextual word embeddings produced by languagemodels like ELMo and BERT. We got to know our Muppet friend in detail: inputembeddings, pre-training tasks, and hidden states (the actual embeddings). Weleveraged the HuggingFace library to fine-tune a pre-trained model using aTrainer and to deliver predictions using a pipeline. Lastly, we used the famousGPT-2 model to generate text that, hopefully, looks like it was written by LewisCarroll. This is what we’ve covered:• using NLTK to perform sentence tokenization on our text corpora• converting each book into a CSV file containing one sentence per line• building a dataset using HuggingFace’s Dataset to load the CSV files• creating new columns in the dataset using map()• learning about data augmentation for text data• using Gensim to perform word tokenization• building a vocabulary and using it to get a token ID for each word• adding special tokens to the vocabulary, like [UNK] and [PAD]• loading our own vocabulary into HuggingFace’s tokenizer• understanding the output of a tokenizer: input_ids, token_type_ids, andattention_mask• using the tokenizer to tokenize two sentences as a single input• creating numerical representations for each token, starting with one-hotencoding• learning about the simplicity and limitations of the bag-of-words (BoW)approach1016 | Chapter 11: Down the Yellow Brick Rabbit Hole
- Page 990 and 991: Figure 11.24 - Losses—simple clas
- Page 992 and 993: We can inspect the pre-trained mode
- Page 994 and 995: Every word piece is prefixed with #
- Page 996 and 997: far, our models used these embeddin
- Page 998 and 999: position_ids = torch.arange(512).ex
- Page 1000 and 1001: Pre-training TasksMasked Language M
- Page 1002 and 1003: Then, let’s create an instance of
- Page 1004 and 1005: If these two sentences were the inp
- Page 1006 and 1007: The BERT model may take many other
- Page 1008 and 1009: The contextual word embeddings are
- Page 1010 and 1011: Model Configuration1 class BERTClas
- Page 1012 and 1013: "Which BERT is that? DistilBERT?!"D
- Page 1014 and 1015: Well, you probably don’t want to
- Page 1016 and 1017: set num_labels=1 as argument.If you
- Page 1018 and 1019: Output{'attention_mask': [1, 1, 1,
- Page 1020 and 1021: OutputTrainingArguments(output_dir=
- Page 1022 and 1023: Method for Computing Accuracy1 def
- Page 1024 and 1025: loaded_model = (AutoModelForSequenc
- Page 1026 and 1027: logits.logits.argmax(dim=1)Outputte
- Page 1028 and 1029: For a complete list of available ta
- Page 1030 and 1031: [215]. For a demo of GPT-2’s capa
- Page 1032 and 1033: in Chapter 9, and I reproduce it be
- Page 1034 and 1035: Data Preparation1 auto_tokenizer =
- Page 1036 and 1037: Data Preparation1 lm_train_dataset
- Page 1038 and 1039: The training arguments are roughly
- Page 1042 and 1043: • learning that a language model
- Page 1044 and 1045: [167] https://huggingface.co/docs/d
Recap
In this chapter, we took a deep dive into the natural language processing world. We
built our own dataset from scratch using two books, Alice’s Adventures in
Wonderland and The Wonderful Wizard of Oz, and performed sentence and word
tokenization. Then, we built a vocabulary and used it with a tokenizer to generate
the primary input of our models: sequences of token IDs. Next, we created
numerical representations for our tokens, starting with a basic one-hot encoding
and working our way to using word embeddings to train a model for classifying the
source of a sentence. We also learned about the limitations of classical
embeddings, and the need for contextual word embeddings produced by language
models like ELMo and BERT. We got to know our Muppet friend in detail: input
embeddings, pre-training tasks, and hidden states (the actual embeddings). We
leveraged the HuggingFace library to fine-tune a pre-trained model using a
Trainer and to deliver predictions using a pipeline. Lastly, we used the famous
GPT-2 model to generate text that, hopefully, looks like it was written by Lewis
Carroll. This is what we’ve covered:
• using NLTK to perform sentence tokenization on our text corpora
• converting each book into a CSV file containing one sentence per line
• building a dataset using HuggingFace’s Dataset to load the CSV files
• creating new columns in the dataset using map()
• learning about data augmentation for text data
• using Gensim to perform word tokenization
• building a vocabulary and using it to get a token ID for each word
• adding special tokens to the vocabulary, like [UNK] and [PAD]
• loading our own vocabulary into HuggingFace’s tokenizer
• understanding the output of a tokenizer: input_ids, token_type_ids, and
attention_mask
• using the tokenizer to tokenize two sentences as a single input
• creating numerical representations for each token, starting with one-hot
encoding
• learning about the simplicity and limitations of the bag-of-words (BoW)
approach
1016 | Chapter 11: Down the Yellow Brick Rabbit Hole