Chapter 7

The usage pattern of instantiating a model and tokenizer from a pretrained model,

optionally fine-tuning it using a comparatively small labeled dataset, and then

using it for predictions, is fairly typical and applicable for the other fine tuning

classes as well. The Transformers API provides a standardized API to work with

multiple Transformer models and do standard fine-tuning tasks on them. The

preceding described code can be found in the file bert_paraphrase.py in the

code accompanying this chapter.


In this chapter, we have learned about the concepts behind distributional

representations of words and its various implementations, starting from static

word embeddings such as Word2Vec and GloVe.

We have then looked at improvements to the basic idea, such as subword

embeddings, sentence embeddings that capture the context of the word in the

sentence, as well as the use of entire language models for generating embeddings.

While the language model-based embeddings are achieving state of the art results

nowadays, there are still plenty of applications where more traditional approaches

yield very good results, so it is important to know them all and understand the


We have also looked briefly at other interesting uses of word embeddings outside

the realm of natural language, where the distributional properties of other kinds

of sequences are leveraged to make predictions in domains such as information

retrieval and recommendation systems.

You are now ready to use embeddings not only for your text-based neural

networks, which we will look at in greater depth in the next chapter, but also

to use embeddings in other areas of machine learning.


1. Mikolov, T., et al. (2013, Sep 7) Efficient Estimation of Word Representations

in Vector Space. arXiv:1301.3781v3 [cs.CL].

2. Mikolov, T., et al. (2013, Sep 17). Exploiting Similarities among Languages

for Machine Translation. arXiv:1309.4168v1 [cs.CL].

3. Mikolov, T., et al. (2013). Distributed Representations of Words and Phrases

and their Compositionality. Advances in Neural Information Processing

Systems 26 (NIPS 2013).

