Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

Output{'labels': 1,'sentence': 'There was nothing so VERY remarkable in that; nor didAlice think it so VERY much out of the way to hear the Rabbit say toitself, `Oh dear!','source': 'alice28-1476.txt'}Now that the labels are in place, we can finally shuffle the dataset and split it intotraining and test sets:Data Preparation1 shuffled_dataset = dataset.shuffle(seed=42)2 split_dataset = shuffled_dataset.train_test_split(test_size=0.2)3 split_datasetOutputDatasetDict({train: Dataset({features: ['sentence', 'source'],num_rows: 3081})test: Dataset({features: ['sentence', 'source'],num_rows: 771})})The splits are actually a dataset dictionary, so you may want to retrieve the actualdatasets from it:Data Preparation1 train_dataset = split_dataset['train']2 test_dataset = split_dataset['test']Done! We have two—training and test—randomly shuffled datasets.Building a Dataset | 895

Word TokenizationThe naïve word tokenization, as we’ve already seen, simply splits a sentence intowords using the white space as a separator:sentence = "I'm following the white rabbit"tokens = sentence.split(' ')tokensOutput["I'm", 'following', 'the', 'white', 'rabbit']But, as we’ve also seen, there are issues with the naïve approach (how to handlecontractions, for example). Let’s try using Gensim, [170] a popular library for topicmodeling, which offers some out-of-the-box tools for performing wordtokenization:from gensim.parsing.preprocessing import *preprocess_string(sentence)Output['follow', 'white', 'rabbit']"That doesn’t look right … some words are simply gone!"Welcome to the world of tokenization :-) It turns out, Gensim’spreprocess_string() applies many filters by default, namely:• strip_tags() (for removing HTML-like tags between brackets)• strip_punctuation()• strip_multiple_whitespaces()• strip_numeric()The filters above are pretty straightforward, and they are used to remove typical896 | Chapter 11: Down the Yellow Brick Rabbit Hole

Word Tokenization

The naïve word tokenization, as we’ve already seen, simply splits a sentence into

words using the white space as a separator:

sentence = "I'm following the white rabbit"

tokens = sentence.split(' ')

tokens

Output

["I'm", 'following', 'the', 'white', 'rabbit']

But, as we’ve also seen, there are issues with the naïve approach (how to handle

contractions, for example). Let’s try using Gensim, [170] a popular library for topic

modeling, which offers some out-of-the-box tools for performing word

tokenization:

from gensim.parsing.preprocessing import *

preprocess_string(sentence)

Output

['follow', 'white', 'rabbit']

"That doesn’t look right … some words are simply gone!"

Welcome to the world of tokenization :-) It turns out, Gensim’s

preprocess_string() applies many filters by default, namely:

• strip_tags() (for removing HTML-like tags between brackets)

• strip_punctuation()

• strip_multiple_whitespaces()

• strip_numeric()

The filters above are pretty straightforward, and they are used to remove typical

896 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!