Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
elements from the text. But preprocess_string() also includes the followingfilters:• strip_short(): It discards any word less than three characters long.• remove_stopwords(): It discards any word that is considered a stop word (like"the," "but," "then," and so on).• stem_text(): It modifies words by stemming them; that is, reducing them to acommon base form (from "following" to its base "follow," for example).For a brief introduction to stemming (and the relatedlemmatization) procedures, please check the "Stemming andlemmatization" [171] section of the Introduction to InformationRetrieval [172] book by Christopher D. Manning, PrabhakarRaghavan, and Hinrich Schütze, Cambridge University Press(2008).We won’t be removing stop words or performing stemming here. Since our goal isto use HF’s pre-trained BERT model, we’ll also use its corresponding pre-trainedtokenizer.So, let’s use the first four filters only (and make everything lowercase too):filters = [lambda x: x.lower(),strip_tags,strip_punctuation,strip_multiple_whitespaces, strip_numeric]preprocess_string(sentence, filters=filters)Output['i', 'm', 'following', 'the', 'white', 'rabbit']Word Tokenization | 897
Another option is to use Gensim’s simple_preprocess(), which converts the textinto a list of lowercase tokens, discarding tokens that are either too short (less thanthree characters) or too long (more than fifteen characters):from gensim.utils import simple_preprocesstokens = simple_preprocess(sentence)tokensOutput['following', 'the', 'white', 'rabbit']"Why are we using Gensim? Can’t we use NLTK to perform wordtokenization?"Fair enough. NLTK can be used to tokenize words as well, but Gensim cannot beused to tokenize sentences. Besides, since Gensim has many other interesting toolsfor building vocabularies, bag-of-words (BoW) models, and Word2Vec models (we’llget to that soon), it makes sense to introduce it as soon as possible.898 | Chapter 11: Down the Yellow Brick Rabbit Hole
- Page 872 and 873: Figure 10.20 - Sample image—label
- Page 874 and 875: 4041 # Builds a weighted random sam
- Page 876 and 877: Figure 10.23 - Sample image—split
- Page 878 and 879: Einops"There is more than one way t
- Page 880 and 881: Figure 10.26 - Two patch embeddings
- Page 882 and 883: Now each sequence has ten elements,
- Page 884 and 885: It takes an instance of a Transform
- Page 886 and 887: Putting It All TogetherIn this chap
- Page 888 and 889: 1. Encoder-DecoderThe encoder-decod
- Page 890 and 891: This is the actual encoder-decoder
- Page 892 and 893: 3. DecoderThe Transformer decoder h
- Page 894 and 895: 5. Encoder "Layer"The encoder "laye
- Page 896 and 897: 7. "Sub-Layer" WrapperThe "sub-laye
- Page 898 and 899: 8. Multi-Headed AttentionThe multi-
- Page 900 and 901: Model Configuration & TrainingModel
- Page 902 and 903: • training the Transformer to tac
- Page 904 and 905: Part IVNatural Language Processing|
- Page 906 and 907: Additional SetupThis is a special c
- Page 908 and 909: "Down the Yellow Brick Rabbit Hole"
- Page 910 and 911: The actual texts of the books are c
- Page 912 and 913: "What is this punkt?"That’s the P
- Page 914 and 915: 14 # If there is a configuration fi
- Page 916 and 917: Sentence Tokenization in spaCyBy th
- Page 918 and 919: AttributesThe Dataset has many attr
- Page 920 and 921: Output{'labels': 1,'sentence': 'The
- Page 924 and 925: Data AugmentationLet’s briefly ad
- Page 926 and 927: The corpora’s dictionary is not a
- Page 928 and 929: Finally, if we want to convert a li
- Page 930 and 931: Once we’re happy with the size an
- Page 932 and 933: from transformers import BertTokeni
- Page 934 and 935: "What about the separation token?"T
- Page 936 and 937: The last output, attention_mask, wo
- Page 938 and 939: Outputtensor([[ 3, 27, 1, ..., 0, 0
- Page 940 and 941: vector, right? And our vocabulary i
- Page 942 and 943: Maybe you filled this blank in with
- Page 944 and 945: Continuous Bag-of-Words (CBoW)In th
- Page 946 and 947: That’s a fairly simple model, rig
- Page 948 and 949: Figure 11.13 - Continuous bag-of-wo
- Page 950 and 951: Figure 11.15 - Reviewing restaurant
- Page 952 and 953: You got that right—arithmetic—r
- Page 954 and 955: There we go, 50 dimensions! It’s
- Page 956 and 957: Equation 11.1 - Embedding arithmeti
- Page 958 and 959: Only 82 out of 50,802 words in the
- Page 960 and 961: Now we can use its encode() method
- Page 962 and 963: Model I — GloVE + ClassifierData
- Page 964 and 965: Pre-trained PyTorch EmbeddingsThe e
- Page 966 and 967: Model Configuration & TrainingLet
- Page 968 and 969: 6 self.encoder = encoder7 self.mlp
- Page 970 and 971: Figure 11.20 - Losses—Transformer
Another option is to use Gensim’s simple_preprocess(), which converts the text
into a list of lowercase tokens, discarding tokens that are either too short (less than
three characters) or too long (more than fifteen characters):
from gensim.utils import simple_preprocess
tokens = simple_preprocess(sentence)
tokens
Output
['following', 'the', 'white', 'rabbit']
"Why are we using Gensim? Can’t we use NLTK to perform word
tokenization?"
Fair enough. NLTK can be used to tokenize words as well, but Gensim cannot be
used to tokenize sentences. Besides, since Gensim has many other interesting tools
for building vocabularies, bag-of-words (BoW) models, and Word2Vec models (we’ll
get to that soon), it makes sense to introduce it as soon as possible.
898 | Chapter 11: Down the Yellow Brick Rabbit Hole