09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Recurrent Neural Networks

As can be seen, the maximum sentence length is 71 words, but 99% of the sentences

are under 36 words. If we choose a value of 64, for example, we should be able to

get away with not having to truncate most of the sentences.

The preceding blocks of code can be run interactively multiple times to choose

good values of vocabulary size and maximum sequence length respectively. In our

example, we have chosen to keep all the words (so vocab_size = 5271), and we

have set our max_seqlen to 64.

Our next step is to create a dataset that our model can consume. We first use our

trained tokenizer to convert each sentence from a sequence of words (sentences)

to a sequence of integers (sentences_as_ints), where each corresponding integer

is the index of the word in the tokenizer.word_index. It is then truncated and

padded with zeros. The labels are also converted to a NumPy array labels_as_

ints, and finally, we combine the tensors sentences_as_ints and labels_as_ints

to form a TensorFlow dataset:

max_seqlen = 64

# create dataset

sentences_as_ints = tokenizer.texts_to_sequences(sentences)

sentences_as_ints = tf.keras.preprocessing.sequence.pad_sequences(

sentences_as_ints, maxlen=max_seqlen)

labels_as_ints = np.array(labels)

dataset = tf.data.Dataset.from_tensor_slices(

(sentences_as_ints, labels_as_ints))

We want to set aside 1/3 of the dataset for evaluation. Of the remaining data, we

will use 10% as an inline validation dataset that the model will use to gauge its

own progress during training, and the remaining as the training dataset. Finally,

we create batches of 64 sentences for each dataset:

dataset = dataset.shuffle(10000)

test_size = len(sentences) // 3

val_size = (len(sentences) - test_size) // 10

test_dataset = dataset.take(test_size)

val_dataset = dataset.skip(test_size).take(val_size)

train_dataset = dataset.skip(test_size + val_size)

batch_size = 64

train_dataset = train_dataset.batch(batch_size)

val_dataset = val_dataset.batch(batch_size)

test_dataset = test_dataset.batch(batch_size)

[ 302 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!