09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 8

Our objective is to train the model so that, given a sentence as input, it learns

to predict the corresponding sentiment provided in the label. Each sentence is

a sequence of words. However, in order to input it into the model, we have to

convert it into a sequence of integers. Each integer in the sequence will point to a

word. The mapping of integers to words for our corpus is called a vocabulary. Thus

we need to tokenize the sentences and produce a vocabulary. This is done using the

following code:

tokenizer = tf.keras.preprocessing.text.Tokenizer()

tokenizer.fit_on_texts(sentences)

vocab_size = len(tokenizer.word_counts)

print("vocabulary size: {:d}".format(vocab_size))

word2idx = tokenizer.word_index

idx2word = {v:k for (k, v) in word2idx.items()}

Our vocabulary consists of 5271 unique words. It is possible to make the size

smaller by dropping words that occur fewer than some threshold number of times,

which can be found by inspecting the tokenizer.word_counts dictionary. In such

cases, we need to add 1 to the vocabulary size for the UNK (unknown) entry, which

will be used to replace every word that is not found in the vocabulary.

We also construct lookup dictionaries to convert from word to word index and

back. The first dictionary is useful during training, in order to construct integer

sequences to feed the network. The second dictionary is used to convert from

word index back to word in our prediction code later.

Each sentence can have a different number of words. Our model will require us

to provide sequences of integers of identical length for each sentence. In order to

support this requirement, it is common to choose a maximum sequence length

that is large enough to accommodate most of the sentences in the training set. Any

sentences that are shorter will be padded with zeros, and any sentences that are

longer will be truncated. An easy way to choose a good value for the maximum

sequence length is to look at the sentence length (in number of words) at different

percentile positions:

seq_lengths = np.array([len(s.split()) for s in sentences])

print([(p, np.percentile(seq_lengths, p)) for p

in [75, 80, 90, 95, 99, 100]])

This gives us the following output:

[(75, 16.0), (80, 18.0), (90, 22.0), (95, 26.0), (99, 36.0), (100, 71.0)]

[ 301 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!