Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

Every word piece is prefixed with ## to indicate that it does notstand on its own as a word.Given enough word pieces in a vocabulary, it will be able to represent everyunknown word using a concatenation of word pieces. Problem solved! That’s whatBERT’s pre-trained tokenizer does.For more details on the WordPiece tokenizer, as well as othersub-word tokenizers like Byte-Pair Encoding (BPE) andSentencePiece, please check HuggingFace’s "Summary of theTokenizers" [207] and Cathal Horan’s great post "Tokenizers: Howmachines read" [208] on FloydHub.Let’s tokenize a pair of sentences using BERT’s WordPiece tokenizer:sentence1 = 'Alice is inexplicably following the white rabbit'sentence2 = 'Follow the white rabbit, Neo'tokens = bert_tokenizer(sentence1, sentence2, return_tensors='pt')tokensOutput{'input_ids': tensor([[ 101, 5650, 2003, 1999, 10288, 24759,5555, 6321, 2206, 1996, 2317, 10442, 102, 3582, 1996, 2317,10442, 1010, 9253, 102]]),'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,1, 1, 1, 1, 1, 1, 1]]),'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1]])}Notice that, since there are two sentences, the token_type_ids have two distinctvalues (zero and one) that work as the sentence index corresponding to thesentence each token belongs to. Hold this thought, because we’re using thisinformation in the next section.BERT | 969

To actually see the word pieces, it’s easier to convert the input IDs back intotokens:print(bert_tokenizer.convert_ids_to_tokens(tokens['input_ids'][0]))Output['[CLS]', 'alice', 'is', 'in', '##ex', '##pl', '##ica', '##bly','following', 'the', 'white', 'rabbit', '[SEP]', 'follow', 'the','white', 'rabbit', ',', 'neo', '[SEP]']There it is: "inexplicably" got disassembled into its word pieces, the separator token[SEP] got inserted between the two sentences (and at the end as well), and there isa classifier token [CLS] at the start.AutoTokenizerIf you want to quickly try different tokenizers without having to import theircorresponding classes, you can use HuggingFace’s AutoTokenizer instead:from transformers import AutoTokenizerauto_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')print(auto_tokenizer.__class__)Output<class 'transformers.tokenization_bert.BertTokenizer'>As you can see, it infers the correct model class based on the name of themodel you’re loading, e.g., bert-base-uncased.Input EmbeddingsOnce the sentences are tokenized, we can use their tokens' IDs to look up thecorresponding embeddings as usual. These are the word / token embeddings. So970 | Chapter 11: Down the Yellow Brick Rabbit Hole

Every word piece is prefixed with ## to indicate that it does not

stand on its own as a word.

Given enough word pieces in a vocabulary, it will be able to represent every

unknown word using a concatenation of word pieces. Problem solved! That’s what

BERT’s pre-trained tokenizer does.

For more details on the WordPiece tokenizer, as well as other

sub-word tokenizers like Byte-Pair Encoding (BPE) and

SentencePiece, please check HuggingFace’s "Summary of the

Tokenizers" [207] and Cathal Horan’s great post "Tokenizers: How

machines read" [208] on FloydHub.

Let’s tokenize a pair of sentences using BERT’s WordPiece tokenizer:

sentence1 = 'Alice is inexplicably following the white rabbit'

sentence2 = 'Follow the white rabbit, Neo'

tokens = bert_tokenizer(sentence1, sentence2, return_tensors='pt')

tokens

Output

{'input_ids': tensor([[ 101, 5650, 2003, 1999, 10288, 24759,

5555, 6321, 2206, 1996, 2317, 10442, 102, 3582, 1996, 2317,

10442, 1010, 9253, 102]]),

'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

1, 1, 1, 1, 1, 1, 1]]),

'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1]])}

Notice that, since there are two sentences, the token_type_ids have two distinct

values (zero and one) that work as the sentence index corresponding to the

sentence each token belongs to. Hold this thought, because we’re using this

information in the next section.

BERT | 969

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!