Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

Wide vs Narrow AttentionThis mechanism is known as wide attention: Each attention head gets thefull hidden state and produces a context vector of the same size. This istotally fine if the number of hidden dimensions is small.For a larger number of dimensions, though, each attention head will get achunk of the affine transformation of the hidden state to work with. This isa detail of utmost importance: It is not a chunk of the original hidden state,but of its transformation. For example, say there are 512 dimensions in thehidden state and we’d like to use eight attention heads: Each attention headwould work with a chunk of 64 dimensions only. This mechanism is known asnarrow attention, and we’ll get back to it in the next chapter."Which one should I use?"On the one hand, wide attention will likely yield better models compared tousing narrow attention on the same number of dimensions. On the otherhand, narrow attention makes it possible to use more dimensions, whichmay improve the quality of the model as well. It’s hard to tell you which oneis best overall, but I can tell you that state-of-the-art large Transformermodels use narrow attention. In our much simpler and smaller model,though, we’re sticking with wide attention.The multi-headed attention mechanism is usually depicted like this:Figure 9.22 - Multi-headed attention mechanismAttention | 737

The code for the multi-headed attention mechanism looks like this:Multi-Headed Attention Mechanism1 class MultiHeadAttention(nn.Module):2 def __init__(self, n_heads, d_model,3 input_dim=None, proj_values=True):4 super().__init__()5 self.linear_out = nn.Linear(n_heads * d_model, d_model)6 self.attn_heads = nn.ModuleList(7 [Attention(d_model,8 input_dim=input_dim,9 proj_values=proj_values)10 for _ in range(n_heads)]11 )1213 def init_keys(self, key):14 for attn in self.attn_heads:15 attn.init_keys(key)1617 @property18 def alphas(self):19 # Shape: n_heads, N, 1, L (source)20 return torch.stack(21 [attn.alphas for attn in self.attn_heads], dim=022 )2324 def output_function(self, contexts):25 # N, 1, n_heads * D26 concatenated = torch.cat(contexts, axis=-1)27 # Linear transf. to go back to original dimension28 out = self.linear_out(concatenated) # N, 1, D29 return out3031 def forward(self, query, mask=None):32 contexts = [attn(query, mask=mask)33 for attn in self.attn_heads]34 out = self.output_function(contexts)35 return outIt is pretty much a list of attention mechanisms with an extra linear layer on top.But it is not any list; it is a special list—it is an nn.ModuleList.738 | Chapter 9 — Part I: Sequence-to-Sequence

Wide vs Narrow Attention

This mechanism is known as wide attention: Each attention head gets the

full hidden state and produces a context vector of the same size. This is

totally fine if the number of hidden dimensions is small.

For a larger number of dimensions, though, each attention head will get a

chunk of the affine transformation of the hidden state to work with. This is

a detail of utmost importance: It is not a chunk of the original hidden state,

but of its transformation. For example, say there are 512 dimensions in the

hidden state and we’d like to use eight attention heads: Each attention head

would work with a chunk of 64 dimensions only. This mechanism is known as

narrow attention, and we’ll get back to it in the next chapter.

"Which one should I use?"

On the one hand, wide attention will likely yield better models compared to

using narrow attention on the same number of dimensions. On the other

hand, narrow attention makes it possible to use more dimensions, which

may improve the quality of the model as well. It’s hard to tell you which one

is best overall, but I can tell you that state-of-the-art large Transformer

models use narrow attention. In our much simpler and smaller model,

though, we’re sticking with wide attention.

The multi-headed attention mechanism is usually depicted like this:

Figure 9.22 - Multi-headed attention mechanism

Attention | 737

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!