09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

[ 367 ]

Chapter 9

Let us use the preceding generated arrays to get some information about the corpus

that will help us figure out good values for our constants for our LSTM network:

print("Total number of sentences are: {:d} ".format(len(sents)))

print ("Sentence distribution min {:d}, max {:d} , mean {:3f}, median

{:3f}".format(np.min(sent_lens), np.max(sent_lens), np.mean(sent_

lens), np.median(sent_lens)))

print("Vocab size (full) {:d}".format(len(word_freqs)))

This gives us the following information about the corpus:

Total number of sentences are: 131545

Sentence distribution min 1, max 2434 , mean 120.525052, median

115.000000

Vocab size (full) 50743

Based on this information, we set the following constants for our LSTM model.

We choose our VOCAB_SIZE as 5000; that is, our vocabulary covers the most

frequent 5,000 words, which covers over 93% of the words used in the corpus. The

remaining words are treated as out of vocabulary (OOV) and replaced with the

token UNK. At prediction time, any word that the model hasn't seen will also be

assigned the token UNK. SEQUENCE_LEN is set to approximately twice the median

length of sentences in the training set, and indeed, approximately 110 million of our

131 million sentences are shorter than this setting. Sentences that are shorter than

SEQUENCE_LENGTH will be padded by a special PAD character, and those that are

longer will be truncated to fit the limit:

VOCAB_SIZE = 5000

SEQUENCE_LEN = 50

Since the input to our LSTM will be numeric, we need to build lookup tables that

go back and forth between words and word IDs. Since we limit our vocabulary size

to 5,000 and we have to add the two pseudo-words PAD and UNK, our lookup table

contains entries for the most frequently occurring 4,998 words plus PAD and UNK:

word2id = {}

word2id["PAD"] = 0

word2id["UNK"] = 1

for v, (k, _) in enumerate(word_freqs.most_common(VOCAB_SIZE - 2)):

word2id[k] = v + 2

id2word = {v:k for k, v in word2id.items()}

The input to our network is a sequence of words, where each word is represented by

a vector. Simplistically, we could just use a one-hot encoding for each word, but that

makes the input data very large. So, we encode each word using its 50-dimensional

GloVe embeddings.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!