22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

elements from the text. But preprocess_string() also includes the following

filters:

• strip_short(): It discards any word less than three characters long.

• remove_stopwords(): It discards any word that is considered a stop word (like

"the," "but," "then," and so on).

• stem_text(): It modifies words by stemming them; that is, reducing them to a

common base form (from "following" to its base "follow," for example).

For a brief introduction to stemming (and the related

lemmatization) procedures, please check the "Stemming and

lemmatization" [171] section of the Introduction to Information

Retrieval [172] book by Christopher D. Manning, Prabhakar

Raghavan, and Hinrich Schütze, Cambridge University Press

(2008).

We won’t be removing stop words or performing stemming here. Since our goal is

to use HF’s pre-trained BERT model, we’ll also use its corresponding pre-trained

tokenizer.

So, let’s use the first four filters only (and make everything lowercase too):

filters = [lambda x: x.lower(),

strip_tags,

strip_punctuation,

strip_multiple_whitespaces, strip_numeric]

preprocess_string(sentence, filters=filters)

Output

['i', 'm', 'following', 'the', 'white', 'rabbit']

Word Tokenization | 897

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!