Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

"… with great depth comes great complexity …"Peter Parker…and, along with that, overfitting.But we also know that dropout works pretty well as a regularizer, so we can throwthat in the mix as well."How are we adding normalization, residual connections, and dropoutto our model?"We’ll wrap each and every "sub-layer" with them! Cool, right? But that brings upanother question: How to wrap them? It turns out, we can wrap a "sub-layer" in oneof two ways: norm-last or norm-first.Figure 10.7 - "Sub-Layers"—norm-last vs norm-firstThe norm-last wrapper follows the "Attention Is All you Need" [149] paper to theletter:"We employ a residual connection around each of the two sub-layers, followed bylayer normalization. That is, the output of each sub-layer isLayerNorm(x+Sublayer(x)), where Sublayer(x) is the function implemented bythe sub-layer itself."The norm-first wrapper follows the "sub-layer" implementation described in "TheAnnotated Transformer," [150] which explicitly places norm first as opposed to lastWrapping "Sub-Layers" | 809

for the sake of code simplicity.Let’s turn the diagrams above into equations:Equation 10.3 - Outputs—norm-first vs norm-lastThe equations are almost the same, except for the fact that the norm-last wrapper(from "Attention Is All You Need") normalizes the outputs and the norm-firstwrapper (from "The Annotated Transformer") normalizes the inputs. That’s asmall, yet important, difference."Why?"If you’re using positional encoding, you want to normalize your inputs, so normfirstis more convenient."What about the outputs?"We’ll normalize the final outputs; that is, the output of the last "layer" (which isthe output of its last, not normalized, "sub-layer"). Any intermediate output issimply the input of the subsequent "sub-layer," and each "sub-layer" normalizes itsown inputs.There is another important difference that will be discussed in thenext section.From now on, we’re sticking with norm-first, thus normalizing the inputs:Equation 10.4 - Outputs—norm-firstBy wrapping each and every "sub-layer" inside both encoder "layers" and decoder"layers," we’ll arrive at the desired Transformer architecture.Let’s start with the…810 | Chapter 10: Transform and Roll Out

for the sake of code simplicity.

Let’s turn the diagrams above into equations:

Equation 10.3 - Outputs—norm-first vs norm-last

The equations are almost the same, except for the fact that the norm-last wrapper

(from "Attention Is All You Need") normalizes the outputs and the norm-first

wrapper (from "The Annotated Transformer") normalizes the inputs. That’s a

small, yet important, difference.

"Why?"

If you’re using positional encoding, you want to normalize your inputs, so normfirst

is more convenient.

"What about the outputs?"

We’ll normalize the final outputs; that is, the output of the last "layer" (which is

the output of its last, not normalized, "sub-layer"). Any intermediate output is

simply the input of the subsequent "sub-layer," and each "sub-layer" normalizes its

own inputs.

There is another important difference that will be discussed in the

next section.

From now on, we’re sticking with norm-first, thus normalizing the inputs:

Equation 10.4 - Outputs—norm-first

By wrapping each and every "sub-layer" inside both encoder "layers" and decoder

"layers," we’ll arrive at the desired Transformer architecture.

Let’s start with the…

810 | Chapter 10: Transform and Roll Out

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!