09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Recurrent Neural Networks

Next we handle different length sequences of words by padding with zeros at the

end, using the pad_sequences() function. Because our strings are fairly short, we

do not do any truncation; we just pad to the maximum length of sentence that we

have (8 words for English, and 16 words for French):

tokenizer_en = tf.keras.preprocessing.text.Tokenizer(

filters="", lower=False)

tokenizer_en.fit_on_texts(sents_en)

data_en = tokenizer_en.texts_to_sequences(sents_en)

data_en = tf.keras.preprocessing.sequence.pad_sequences(

data_en, padding="post")

tokenizer_fr = tf.keras.preprocessing.text.Tokenizer(

filters="", lower=False)

tokenizer_fr.fit_on_texts(sents_fr_in)

tokenizer_fr.fit_on_texts(sents_fr_out)

data_fr_in = tokenizer_fr.texts_to_sequences(sents_fr_in)

data_fr_in = tf.keras.preprocessing.sequence.pad_sequences(

data_fr_in, padding="post")

data_fr_out = tokenizer_fr.texts_to_sequences(sents_fr_out)

data_fr_out = tf.keras.preprocessing.sequence.pad_sequences(

data_fr_out, padding="post")

vocab_size_en = len(tokenizer_en.word_index)

vocab_size_fr = len(tokenizer_fr.word_index)

word2idx_en = tokenizer_en.word_index

idx2word_en = {v:k for k, v in word2idx_en.items()}

word2idx_fr = tokenizer_fr.word_index

idx2word_fr = {v:k for k, v in word2idx_fr.items()}

print("vocab size (en): {:d}, vocab size (fr): {:d}".format(

vocab_size_en, vocab_size_fr))

maxlen_en = data_en.shape[1]

maxlen_fr = data_fr_out.shape[1]

print("seqlen (en): {:d}, (fr): {:d}".format(maxlen_en, maxlen_fr))

Finally, we convert the data to a TensorFlow dataset, then split it into a training and

test dataset:

batch_size = 64

dataset = tf.data.Dataset.from_tensor_slices(

(data_en, data_fr_in, data_fr_out))

dataset = dataset.shuffle(10000)

test_size = NUM_SENT_PAIRS // 4

test_dataset = dataset.take(test_size).batch(

[ 320 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!