Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

That’s a fairly simple model, right? If our vocabulary had only five words ("the,""small," "is," "barking," and "dog"), we could try to represent each word with anembedding of three dimensions. Let’s create a dummy model to inspect its(randomly initialized) embeddings:torch.manual_seed(42)dummy_cbow = CBOW(vocab_size=5, embedding_size=3)dummy_cbow.embedding.state_dict()OutputOrderedDict([('weight', tensor([[ 0.3367, 0.1288, 0.2345],[ 0.2303, -1.1229, -0.1863],[ 2.2082, -0.6380, 0.4617],[ 0.2674, 0.5349, 0.8094],[ 1.1103, -1.6898, -0.9890]]))])Figure 11.12 - Word embeddingsAs depicted in the figure above, PyTorch’s nn.Embedding layer is a large lookuptable. It may be randomly initialized given the size of the vocabulary(num_embeddings) and the number of dimensions (embedding_dim). To actuallyretrieve the values, we need to call the embedding layer with a list of tokenindices, and it will return the corresponding rows of the table.For example, we can retrieve the embeddings for the tokens "is" and "barking" usingtheir corresponding indices (two and three):# tokens: ['is', 'barking']dummy_cbow.embedding(torch.as_tensor([2, 3]))Word Embeddings | 921

Outputtensor([[ 2.2082, -0.6380, 0.4617],[ 0.2674, 0.5349, 0.8094]], grad_fn=<EmbeddingBackward>)That’s why the main job of the tokenizer is to transform asentence into a list of token IDs. That list is used as an input tothe embedding layer, and from then on, the tokens arerepresented by dense vectors."How do you choose the number of dimensions?"It is commonplace to use between 50 and 300 dimensions for word embeddings,but some embeddings may be as large as 3,000 dimensions. That may look like a lotbut, compared to one-hot-encoded vectors, it is a bargain! The vocabulary of ourtiny dataset would already require more than 3,000 dimensions if it were one-hotencoded.In our former example, "dog" was the central word and the other four words werethe context words:tiny_vocab = ['the', 'small', 'is', 'barking', 'dog']context_words = ['the', 'small', 'is', 'barking']target_words = ['dog']Now, let’s pretend that we tokenized the words and got their correspondingindices:batch_context = torch.as_tensor([[0, 1, 2, 3]]).long()batch_target = torch.as_tensor([4]).long()In its very first training step, the model would compute the continuous bag-ofwordsfor the inputs by averaging the corresponding embeddings.922 | Chapter 11: Down the Yellow Brick Rabbit Hole

Output

tensor([[ 2.2082, -0.6380, 0.4617],

[ 0.2674, 0.5349, 0.8094]], grad_fn=<EmbeddingBackward>)

That’s why the main job of the tokenizer is to transform a

sentence into a list of token IDs. That list is used as an input to

the embedding layer, and from then on, the tokens are

represented by dense vectors.

"How do you choose the number of dimensions?"

It is commonplace to use between 50 and 300 dimensions for word embeddings,

but some embeddings may be as large as 3,000 dimensions. That may look like a lot

but, compared to one-hot-encoded vectors, it is a bargain! The vocabulary of our

tiny dataset would already require more than 3,000 dimensions if it were one-hotencoded.

In our former example, "dog" was the central word and the other four words were

the context words:

tiny_vocab = ['the', 'small', 'is', 'barking', 'dog']

context_words = ['the', 'small', 'is', 'barking']

target_words = ['dog']

Now, let’s pretend that we tokenized the words and got their corresponding

indices:

batch_context = torch.as_tensor([[0, 1, 2, 3]]).long()

batch_target = torch.as_tensor([4]).long()

In its very first training step, the model would compute the continuous bag-ofwords

for the inputs by averaging the corresponding embeddings.

922 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!