Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

device_index = (model.device.indexif model.device.type != 'cpu'else -1)gpt2_gen = pipeline('text-generation',model=model,tokenizer=auto_tokenizer,device=device_index)The only parameter we may have to change is, once again, the max_length:result = gpt2_gen(base_text, max_length=250)print(result[0]['generated_text'])OutputAlice was beginning to get very tired of sitting by her sister onthe bank, and of having nothing to do: once or twice she had peepedinto the book her sister was reading, but it had no pictures orconversations in it, `and what is the use of a book,'thought Alice`without pictures or conversation?' So she was considering in herown mind (as well as she could, for the hot day made her feel verysleepy and stupid), whether the pleasure of making a daisy-chainwould be worth the trouble of getting up and picking the daisies,when suddenly a White Rabbit with pink eyes ran close by her.The rabbit was running away quite as quickly as it had jumped to herfeet.She had nothing of the kind, and, as she made it up to Alice,was beginning to look at the door carefully in one thought.`It'svery curious,'after having been warned,`that I should be talking toAlice!'`It's not,'she went on, `it wasn't even a cat,' so very veryquietly indeed.'In that instant he began to cry out aloud.Alicebegan to sob out, 'I am not to cry out!'`WhatThis time, I’ve kept the whole thing, the base and the generated text. I tried it outseveral times and, in my humble opinion, the output looks more "Alice-y" now.What do you think?Putting It All Together | 1015

RecapIn this chapter, we took a deep dive into the natural language processing world. Webuilt our own dataset from scratch using two books, Alice’s Adventures inWonderland and The Wonderful Wizard of Oz, and performed sentence and wordtokenization. Then, we built a vocabulary and used it with a tokenizer to generatethe primary input of our models: sequences of token IDs. Next, we creatednumerical representations for our tokens, starting with a basic one-hot encodingand working our way to using word embeddings to train a model for classifying thesource of a sentence. We also learned about the limitations of classicalembeddings, and the need for contextual word embeddings produced by languagemodels like ELMo and BERT. We got to know our Muppet friend in detail: inputembeddings, pre-training tasks, and hidden states (the actual embeddings). Weleveraged the HuggingFace library to fine-tune a pre-trained model using aTrainer and to deliver predictions using a pipeline. Lastly, we used the famousGPT-2 model to generate text that, hopefully, looks like it was written by LewisCarroll. This is what we’ve covered:• using NLTK to perform sentence tokenization on our text corpora• converting each book into a CSV file containing one sentence per line• building a dataset using HuggingFace’s Dataset to load the CSV files• creating new columns in the dataset using map()• learning about data augmentation for text data• using Gensim to perform word tokenization• building a vocabulary and using it to get a token ID for each word• adding special tokens to the vocabulary, like [UNK] and [PAD]• loading our own vocabulary into HuggingFace’s tokenizer• understanding the output of a tokenizer: input_ids, token_type_ids, andattention_mask• using the tokenizer to tokenize two sentences as a single input• creating numerical representations for each token, starting with one-hotencoding• learning about the simplicity and limitations of the bag-of-words (BoW)approach1016 | Chapter 11: Down the Yellow Brick Rabbit Hole

Recap

In this chapter, we took a deep dive into the natural language processing world. We

built our own dataset from scratch using two books, Alice’s Adventures in

Wonderland and The Wonderful Wizard of Oz, and performed sentence and word

tokenization. Then, we built a vocabulary and used it with a tokenizer to generate

the primary input of our models: sequences of token IDs. Next, we created

numerical representations for our tokens, starting with a basic one-hot encoding

and working our way to using word embeddings to train a model for classifying the

source of a sentence. We also learned about the limitations of classical

embeddings, and the need for contextual word embeddings produced by language

models like ELMo and BERT. We got to know our Muppet friend in detail: input

embeddings, pre-training tasks, and hidden states (the actual embeddings). We

leveraged the HuggingFace library to fine-tune a pre-trained model using a

Trainer and to deliver predictions using a pipeline. Lastly, we used the famous

GPT-2 model to generate text that, hopefully, looks like it was written by Lewis

Carroll. This is what we’ve covered:

• using NLTK to perform sentence tokenization on our text corpora

• converting each book into a CSV file containing one sentence per line

• building a dataset using HuggingFace’s Dataset to load the CSV files

• creating new columns in the dataset using map()

• learning about data augmentation for text data

• using Gensim to perform word tokenization

• building a vocabulary and using it to get a token ID for each word

• adding special tokens to the vocabulary, like [UNK] and [PAD]

• loading our own vocabulary into HuggingFace’s tokenizer

• understanding the output of a tokenizer: input_ids, token_type_ids, and

attention_mask

• using the tokenizer to tokenize two sentences as a single input

• creating numerical representations for each token, starting with one-hot

encoding

• learning about the simplicity and limitations of the bag-of-words (BoW)

approach

1016 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!