09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 8

The output of the encoder is a single tensor (return_sequences=False) of shape

(batch_size, encoder_dim) and represents a batch of context vectors representing

the input sentences. The encoder state tensor has the same dimensions. The

decoder outputs are also a batch of sequence of integers, but the maximum size of

a French sentence is 16; therefore, the dimensions are (batch_size, maxlen_fr).

The decoder predictions are a batch of probability distributions across all time

steps; hence the dimensions are (batch_size, maxlen_fr, vocab_size_fr+1), and

the decoder state is the same dimension as the encoder state (batch_size, decoder_

dim):

encoder input : (64, 8)

encoder output : (64, 1024) state: (64, 1024)

decoder output (logits): (64, 16, 7658) state: (64, 1024)

decoder output (labels): (64, 16)

Next we define the loss function. Because we padded our sentences, we don't

want to bias our results by considering equality of pad words between the labels

and predictions. Our loss function masks our predictions with the labels, so

padded positions on the label are also removed from the predictions, and we only

compute our loss using the non zero elements on both the label and predictions.

This is done as follows:

def loss_fn(ytrue, ypred):

scce = tf.keras.losses.SparseCategoricalCrossentropy(

from_logits=True)

mask = tf.math.logical_not(tf.math.equal(ytrue, 0))

mask = tf.cast(mask, dtype=tf.int64)

loss = scce(ytrue, ypred, sample_weight=mask)

return loss

Because the seq2seq model is not easy to package into a simple Keras model, we

have to handle the training loop manually as well. Our train_step() function

handles the flow of data and computes the loss at each step, applies the gradient

of the loss back to the trainable weights, and returns the loss.

Notice that the training code is not quite the same as what was described in our

discussion of the seq2seq model earlier. Here it appears that the entire decoder_

input is fed in one go into the decoder to produce the output offset by one time

step, whereas in the discussion, we said that this happens sequentially, where the

token generated in the previous time step is used as the input to the next time step.

[ 323 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!