Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

"Down the Yellow Brick Rabbit Hole"Where does the phrase in the title come from? On the one hand, if it were "downthe rabbit hole," one could guess Alice’s Adventures in Wonderland. On the otherhand, if it were "the yellow brick road," one could guess The Wonderful Wizard of Oz.But it is neither (or maybe it is both?). What if, instead of trying to guess itourselves, we trained a model to classify sentences? This is a book about deeplearning, after all :-)Training models on text data is what natural language processing (NLP) is all about.The whole field is enormous, and we’ll be only scratching the surface of it in thischapter. We’ll start with the most obvious question: "how do you convert text datainto numerical data?", we’ll end up using a pre-trained model—our famous Muppetfriend, BERT—to classify sentences.Building a DatasetThere are many freely available datasets for NLP. The texts are usually alreadynicely organized into sentences that you can easily feed to a pre-trained model likeBERT. Isn’t it awesome? Well, yeah, but…"But what?"But the texts you’ll find in the real world are not nicely organized into sentences.You have to organize them yourself.So, we’ll start our NLP journey by following the steps of Alice and Dorothy, fromAlice’s Adventures in Wonderland [158] by Lewis Carroll and The Wonderful Wizard of Oz[159]by L. Frank Baum.Both texts are freely available at the Oxford Text Archive (OTA)[160]under an Attribution-NonCommercial-ShareAlike 3.0Unported (CC BY-NC-SA 3.0) license."Down the Yellow Brick Rabbit Hole" | 883

Figure 11.1 - Left: "Alice and the Baby Pig" illustration by John Tenniel, from "Alice’s Adventures inWonderland" (1865). Right: "Dorothy meets the Cowardly Lion" illustration by W. W. Denslow,from "The Wonderful Wizard of Oz" (1900).The direct links to both texts are alice28-1476.txt [161] (we’re naming it ALICE_URL)and wizoz10-1740.txt [162] (we’re naming it WIZARD_URL). You can download both ofthem to a local folder using the helper function download_text() (included indata_generation.nlp):Data Loading1 localfolder = 'texts'2 download_text(ALICE_URL, localfolder)3 download_text(WIZARD_URL, localfolder)If you open these files in a text editor, you’ll see that there is a lot of information atthe beginning (and some at the end) that has been added to the original text of thebooks for legal reasons. We need to remove these additions to the original texts:Downloading Books1 fname1 = os.path.join(localfolder, 'alice28-1476.txt')2 with open(fname1, 'r') as f:3 alice = ''.join(f.readlines()[104:3704])4 fname2 = os.path.join(localfolder, 'wizoz10-1740.txt')5 with open(fname2, 'r') as f:6 wizard = ''.join(f.readlines()[310:5100])884 | Chapter 11: Down the Yellow Brick Rabbit Hole

"Down the Yellow Brick Rabbit Hole"

Where does the phrase in the title come from? On the one hand, if it were "down

the rabbit hole," one could guess Alice’s Adventures in Wonderland. On the other

hand, if it were "the yellow brick road," one could guess The Wonderful Wizard of Oz.

But it is neither (or maybe it is both?). What if, instead of trying to guess it

ourselves, we trained a model to classify sentences? This is a book about deep

learning, after all :-)

Training models on text data is what natural language processing (NLP) is all about.

The whole field is enormous, and we’ll be only scratching the surface of it in this

chapter. We’ll start with the most obvious question: "how do you convert text data

into numerical data?", we’ll end up using a pre-trained model—our famous Muppet

friend, BERT—to classify sentences.

Building a Dataset

There are many freely available datasets for NLP. The texts are usually already

nicely organized into sentences that you can easily feed to a pre-trained model like

BERT. Isn’t it awesome? Well, yeah, but…

"But what?"

But the texts you’ll find in the real world are not nicely organized into sentences.

You have to organize them yourself.

So, we’ll start our NLP journey by following the steps of Alice and Dorothy, from

Alice’s Adventures in Wonderland [158] by Lewis Carroll and The Wonderful Wizard of Oz

[159]

by L. Frank Baum.

Both texts are freely available at the Oxford Text Archive (OTA)

[160]

under an Attribution-NonCommercial-ShareAlike 3.0

Unported (CC BY-NC-SA 3.0) license.

"Down the Yellow Brick Rabbit Hole" | 883

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!