09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Recurrent Neural Networks

3. Because of the parallelism referred to in the previous point, in order to

distinguish the position of each element in the sequence in the transformer

network, a positional encoding layer is added to provide this positional

information.

Let us walk through how data flows through the transformer. The Encoder side

consists of an Embedding and a Positional Encoding layer, followed by some

number (6 in the reference implementation [30]) of transformer blocks. Each

transformer block on the Encoder side consists of a Multi-head attention layer and

a position-wise Feed-Forward Network (FFN).

We have already briefly seen that self-attention is the process of attending to parts

of the same sequence. Thus, when processing a sentence, we might want to know

what other words are most aligned to the current word. The multi-head attention

layer consists of multiple (8 in the reference implementation [30]) parallel selfattention

layers. Self-attention is carried out by constructing three vectors Q (query),

K (key), and V (value), out of the input embedding. These vectors are created by

multiplying the input embedding with three trainable weight matrices W Q

, W K

,

and W V

. The output vector Z is created by combining K, Q, and V at each selfattention

layer using the following formula. Here d K

refers to the dimension of the

K, Q, and V vectors (64 in the reference implementation [30]):

zz = ssssssssssssss( QQQQTT

√dd kk

)VV

The multi-head attention layer will create multiple values for Z (based on multiple

trainable weight matrices W Q

, W K

, and W V

at each self-attention layer), and then

concatenate them for input into the position-wise FFN layer.

The input to the position-wise FFN consists of embeddings for the different elements

in the sequence (or words in the sentence), attended to via self-attention in the

multi-head attention layer. Each token is represented internally by a fixed length

embedding vector (512 in the reference implementation [30]). Each vector is run

through the FFN in parallel. The output of the FFN is the input to the Multi-head

attention layer in the next transformer block. If this is the last transformer block in

the encoder, then the output is the context vector that is passed to the decoder.

In addition to the signal from the previous layer, both the multi-head attention layer

and the position-wise FFN layer send out a residual signal from their input to their

output. The output and residual inputs are passed through a layer-normalization

[32] step, and is shown in the Figure 7 as the Add and Norm layer.

[ 338 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!