Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

elements from the text. But preprocess_string() also includes the followingfilters:• strip_short(): It discards any word less than three characters long.• remove_stopwords(): It discards any word that is considered a stop word (like"the," "but," "then," and so on).• stem_text(): It modifies words by stemming them; that is, reducing them to acommon base form (from "following" to its base "follow," for example).For a brief introduction to stemming (and the relatedlemmatization) procedures, please check the "Stemming andlemmatization" [171] section of the Introduction to InformationRetrieval [172] book by Christopher D. Manning, PrabhakarRaghavan, and Hinrich Schütze, Cambridge University Press(2008).We won’t be removing stop words or performing stemming here. Since our goal isto use HF’s pre-trained BERT model, we’ll also use its corresponding pre-trainedtokenizer.So, let’s use the first four filters only (and make everything lowercase too):filters = [lambda x: x.lower(),strip_tags,strip_punctuation,strip_multiple_whitespaces, strip_numeric]preprocess_string(sentence, filters=filters)Output['i', 'm', 'following', 'the', 'white', 'rabbit']Word Tokenization | 897

Another option is to use Gensim’s simple_preprocess(), which converts the textinto a list of lowercase tokens, discarding tokens that are either too short (less thanthree characters) or too long (more than fifteen characters):from gensim.utils import simple_preprocesstokens = simple_preprocess(sentence)tokensOutput['following', 'the', 'white', 'rabbit']"Why are we using Gensim? Can’t we use NLTK to perform wordtokenization?"Fair enough. NLTK can be used to tokenize words as well, but Gensim cannot beused to tokenize sentences. Besides, since Gensim has many other interesting toolsfor building vocabularies, bag-of-words (BoW) models, and Word2Vec models (we’llget to that soon), it makes sense to introduce it as soon as possible.898 | Chapter 11: Down the Yellow Brick Rabbit Hole

Another option is to use Gensim’s simple_preprocess(), which converts the text

into a list of lowercase tokens, discarding tokens that are either too short (less than

three characters) or too long (more than fifteen characters):

from gensim.utils import simple_preprocess

tokens = simple_preprocess(sentence)

tokens

Output

['following', 'the', 'white', 'rabbit']

"Why are we using Gensim? Can’t we use NLTK to perform word

tokenization?"

Fair enough. NLTK can be used to tokenize words as well, but Gensim cannot be

used to tokenize sentences. Besides, since Gensim has many other interesting tools

for building vocabularies, bag-of-words (BoW) models, and Word2Vec models (we’ll

get to that soon), it makes sense to introduce it as soon as possible.

898 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!