pdfcoffee
Chapter 7We can also compute the distance between two words in the embedding space usingthe distance() function. This is really just 1 - similarity():print("distance(singapore, malaysia) = {:.3f}".format(word_vectors.distance("singapore", "malaysia")))We can also look up vectors for a vocabulary word either directly from the word_vectors object, or by using the word_vec() wrapper, shown as follows:vec_song = word_vectors["song"]vec_song_2 = word_vectors.word_vec("song", use_norm=True)There are a few other functions that you may find useful depending on your usecase. The documentation page for KeyedVectors contains a list of all the availablefunctions [10].The code shown here can be found in the explore_text8_embedding.py file in thecode accompanying this book.Using word embeddings for spamdetectionBecause of the widespread availability of various robust embeddings generatedfrom large corpora, it has become quite common to use one of these embeddingsto convert text input for use with machine learning models. Text is treated as asequence of tokens. The embedding provides a dense fixed dimension vector foreach token. Each token is replaced with its vector, and this converts the sequenceof text into a matrix of examples, each of which has a fixed number of featurescorresponding to the dimensionality of the embedding.This matrix of examples can be used directly as input to standard (non-neuralnetwork based) machine learning programs, but since this book is about deeplearning and TensorFlow, we will demonstrate its use with a one-dimensionalversion of the Convolutional Neural Network (CNN) that you learned aboutin Chapter 4, Convolutional Neural Networks. Our example is a spam detector thatwill classify Short Message Service (SMS) or text messages as either "ham" or"spam." The example is very similar to the sentiment analysis example in Chapter 5,Advanced Convolutional Neural Networks that used a one-dimensional CNN, but ourfocus here will be on the embedding layer.[ 243 ]
Word EmbeddingsSpecifically, we will see how the program learns an embedding from scratch thatis customized to the spam detection task. Next we will see how to use an externalthird-party embedding like the ones we have learned about in this chapter, a processsimilar to transfer learning in computer vision. Finally, we will learn how to combinethe two approaches, starting with a third party embedding and letting the networkuse that as a starting point for its custom embedding, a process similar to fine tuningin computer vision.As usual, we will start with our imports:import argparseimport gensim.downloader as apiimport numpy as npimport osimport shutilimport tensorflow as tffrom sklearn.metrics import accuracy_score, confusion_matrixScikit-learn is an open source Python machine learning toolkit thatcontains many efficient and easy to use tools for data mining anddata analysis. In this chapter we have used two of its predefinedmetrics, accuracy_score and confusion_matrix, to evaluateour model after it is trained.You can learn more about scikit-learn at https://scikitlearn.org/stable/.Getting the dataThe data for our model is available publicly and comes from the SMS spamcollection dataset from the UCI Machine Learning Repository [11]. The followingcode will download the file and parse it to produce a list of SMS messages and theircorresponding labels:def download_and_read(url):local_file = url.split('/')[-1]p = tf.keras.utils.get_file(local_file, url,extract=True, cache_dir=".")labels, texts = [], []local_file = os.path.join("datasets", "SMSSpamCollection")with open(local_file, "r") as fin:for line in fin:label, text = line.strip().split('\t')[ 244 ]
- Page 228 and 229: [ 193 ]Chapter 6Eventually, we reac
- Page 230 and 231: [ 195 ]Chapter 6Next, we combine th
- Page 232 and 233: Chapter 6And handwritten digits gen
- Page 234 and 235: Chapter 6Figure 1: Visualizing the
- Page 236 and 237: Chapter 6The resultant generator mo
- Page 238 and 239: Chapter 6Figure 4: A summary of res
- Page 240 and 241: Chapter 6def train(self, epochs, ba
- Page 242 and 243: Chapter 6The preceding images were
- Page 244 and 245: Chapter 6Another interesting paper
- Page 246 and 247: Chapter 6To elaborate, let us say t
- Page 248 and 249: Chapter 6Figure 7: The architecture
- Page 250 and 251: Chapter 6Figure 11: Illegible initi
- Page 252 and 253: Chapter 6Bedrooms: Generated bedroo
- Page 254 and 255: Chapter 6The images need to be norm
- Page 256 and 257: Chapter 6initializer = tf.random_no
- Page 258 and 259: Cool, right? Now we can define the
- Page 260 and 261: Chapter 6d_loss = (dA_loss + dB_los
- Page 262 and 263: Chapter 6generator_AB.save_weights(
- Page 264: 6. Ledig, Christian, et al. Photo-R
- Page 267 and 268: Word EmbeddingsDeep learning models
- Page 269 and 270: Word EmbeddingsFor example, "crucia
- Page 271 and 272: Word EmbeddingsAssuming a window si
- Page 273 and 274: Word EmbeddingsGloVeThe Global vect
- Page 275 and 276: Word Embeddingsgensim is an open so
- Page 277: Word Embeddingsgensim also provides
- Page 281 and 282: Word EmbeddingsWe will also convert
- Page 283 and 284: Word EmbeddingsE = np.zeros((vocab_
- Page 285 and 286: Word Embeddingsx = self.embedding(x
- Page 287 and 288: Word EmbeddingsThe change in valida
- Page 289 and 290: Word EmbeddingsThe dataset is a 114
- Page 291 and 292: Word Embeddingsprint("random walks
- Page 293 and 294: Word Embeddingssize=128, # size of
- Page 295 and 296: Word EmbeddingsfastText computes em
- Page 297 and 298: Word EmbeddingsIn the future, once
- Page 299 and 300: Word EmbeddingsA much earlier relat
- Page 301 and 302: Word EmbeddingsOnce you have the fi
- Page 303 and 304: Word EmbeddingsThis will create the
- Page 305 and 306: Word EmbeddingsClassifying with BER
- Page 307 and 308: Word Embeddings2. Each Transformer
- Page 309 and 310: Word EmbeddingsOnce trained, we sav
- Page 311 and 312: Word Embeddings4. Pennington, J., S
- Page 313 and 314: Word Embeddings34. Google Research,
- Page 315 and 316: Recurrent Neural NetworksWe will th
- Page 317 and 318: Recurrent Neural NetworksFor notati
- Page 319 and 320: Recurrent Neural NetworksThis probl
- Page 321 and 322: Recurrent Neural NetworksThe line a
- Page 323 and 324: Recurrent Neural NetworksGated recu
- Page 325 and 326: Recurrent Neural NetworksThis probl
- Page 327 and 328: Recurrent Neural NetworksThe topolo
Chapter 7
We can also compute the distance between two words in the embedding space using
the distance() function. This is really just 1 - similarity():
print("distance(singapore, malaysia) = {:.3f}".format(
word_vectors.distance("singapore", "malaysia")
))
We can also look up vectors for a vocabulary word either directly from the word_
vectors object, or by using the word_vec() wrapper, shown as follows:
vec_song = word_vectors["song"]
vec_song_2 = word_vectors.word_vec("song", use_norm=True)
There are a few other functions that you may find useful depending on your use
case. The documentation page for KeyedVectors contains a list of all the available
functions [10].
The code shown here can be found in the explore_text8_embedding.py file in the
code accompanying this book.
Using word embeddings for spam
detection
Because of the widespread availability of various robust embeddings generated
from large corpora, it has become quite common to use one of these embeddings
to convert text input for use with machine learning models. Text is treated as a
sequence of tokens. The embedding provides a dense fixed dimension vector for
each token. Each token is replaced with its vector, and this converts the sequence
of text into a matrix of examples, each of which has a fixed number of features
corresponding to the dimensionality of the embedding.
This matrix of examples can be used directly as input to standard (non-neural
network based) machine learning programs, but since this book is about deep
learning and TensorFlow, we will demonstrate its use with a one-dimensional
version of the Convolutional Neural Network (CNN) that you learned about
in Chapter 4, Convolutional Neural Networks. Our example is a spam detector that
will classify Short Message Service (SMS) or text messages as either "ham" or
"spam." The example is very similar to the sentiment analysis example in Chapter 5,
Advanced Convolutional Neural Networks that used a one-dimensional CNN, but our
focus here will be on the embedding layer.
[ 243 ]