22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

far, our models used these embeddings (or a bag of them) as their only input.

But BERT, being a Transformer encoder, also needs positional information. In

Chapter 9 we used positional encoding, but BERT uses position embeddings

instead.

"What’s the difference between encoding and embedding?"

While the position encoding we used in the past had fixed values for each position,

the position embeddings are learned by the model (like any other embedding

layer). The number of entries in this lookup table is given by the maximum length

of the sequence.

And there is more! BERT also adds a third embedding, namely, segment

embedding, which is a position embedding at the sentence level (since inputs may

have either one or two sentences). That’s what the token_type_ids produced by

the tokenizer are good for: They work as a sentence index for each token.

Figure 11.25 - BERT’s input embeddings

Talking (or writing) is cheap, though, so let’s take a look under BERT’s hood:

input_embeddings = bert_model.embeddings

Output

BertEmbeddings(

(word_embeddings): Embedding(30522, 768, padding_idx=0)

(position_embeddings): Embedding(512, 768)

(token_type_embeddings): Embedding(2, 768)

(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)

(dropout): Dropout(p=0.1, inplace=False)

)

BERT | 971

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!