pdfcoffee

soumyasankar99
from soumyasankar99 More from this publisher
09.05.2023 Views

Chapter 7We can also compute the distance between two words in the embedding space usingthe distance() function. This is really just 1 - similarity():print("distance(singapore, malaysia) = {:.3f}".format(word_vectors.distance("singapore", "malaysia")))We can also look up vectors for a vocabulary word either directly from the word_vectors object, or by using the word_vec() wrapper, shown as follows:vec_song = word_vectors["song"]vec_song_2 = word_vectors.word_vec("song", use_norm=True)There are a few other functions that you may find useful depending on your usecase. The documentation page for KeyedVectors contains a list of all the availablefunctions [10].The code shown here can be found in the explore_text8_embedding.py file in thecode accompanying this book.Using word embeddings for spamdetectionBecause of the widespread availability of various robust embeddings generatedfrom large corpora, it has become quite common to use one of these embeddingsto convert text input for use with machine learning models. Text is treated as asequence of tokens. The embedding provides a dense fixed dimension vector foreach token. Each token is replaced with its vector, and this converts the sequenceof text into a matrix of examples, each of which has a fixed number of featurescorresponding to the dimensionality of the embedding.This matrix of examples can be used directly as input to standard (non-neuralnetwork based) machine learning programs, but since this book is about deeplearning and TensorFlow, we will demonstrate its use with a one-dimensionalversion of the Convolutional Neural Network (CNN) that you learned aboutin Chapter 4, Convolutional Neural Networks. Our example is a spam detector thatwill classify Short Message Service (SMS) or text messages as either "ham" or"spam." The example is very similar to the sentiment analysis example in Chapter 5,Advanced Convolutional Neural Networks that used a one-dimensional CNN, but ourfocus here will be on the embedding layer.[ 243 ]

Word EmbeddingsSpecifically, we will see how the program learns an embedding from scratch thatis customized to the spam detection task. Next we will see how to use an externalthird-party embedding like the ones we have learned about in this chapter, a processsimilar to transfer learning in computer vision. Finally, we will learn how to combinethe two approaches, starting with a third party embedding and letting the networkuse that as a starting point for its custom embedding, a process similar to fine tuningin computer vision.As usual, we will start with our imports:import argparseimport gensim.downloader as apiimport numpy as npimport osimport shutilimport tensorflow as tffrom sklearn.metrics import accuracy_score, confusion_matrixScikit-learn is an open source Python machine learning toolkit thatcontains many efficient and easy to use tools for data mining anddata analysis. In this chapter we have used two of its predefinedmetrics, accuracy_score and confusion_matrix, to evaluateour model after it is trained.You can learn more about scikit-learn at https://scikitlearn.org/stable/.Getting the dataThe data for our model is available publicly and comes from the SMS spamcollection dataset from the UCI Machine Learning Repository [11]. The followingcode will download the file and parse it to produce a list of SMS messages and theircorresponding labels:def download_and_read(url):local_file = url.split('/')[-1]p = tf.keras.utils.get_file(local_file, url,extract=True, cache_dir=".")labels, texts = [], []local_file = os.path.join("datasets", "SMSSpamCollection")with open(local_file, "r") as fin:for line in fin:label, text = line.strip().split('\t')[ 244 ]

Chapter 7

We can also compute the distance between two words in the embedding space using

the distance() function. This is really just 1 - similarity():

print("distance(singapore, malaysia) = {:.3f}".format(

word_vectors.distance("singapore", "malaysia")

))

We can also look up vectors for a vocabulary word either directly from the word_

vectors object, or by using the word_vec() wrapper, shown as follows:

vec_song = word_vectors["song"]

vec_song_2 = word_vectors.word_vec("song", use_norm=True)

There are a few other functions that you may find useful depending on your use

case. The documentation page for KeyedVectors contains a list of all the available

functions [10].

The code shown here can be found in the explore_text8_embedding.py file in the

code accompanying this book.

Using word embeddings for spam

detection

Because of the widespread availability of various robust embeddings generated

from large corpora, it has become quite common to use one of these embeddings

to convert text input for use with machine learning models. Text is treated as a

sequence of tokens. The embedding provides a dense fixed dimension vector for

each token. Each token is replaced with its vector, and this converts the sequence

of text into a matrix of examples, each of which has a fixed number of features

corresponding to the dimensionality of the embedding.

This matrix of examples can be used directly as input to standard (non-neural

network based) machine learning programs, but since this book is about deep

learning and TensorFlow, we will demonstrate its use with a one-dimensional

version of the Convolutional Neural Network (CNN) that you learned about

in Chapter 4, Convolutional Neural Networks. Our example is a spam detector that

will classify Short Message Service (SMS) or text messages as either "ham" or

"spam." The example is very similar to the sentiment analysis example in Chapter 5,

Advanced Convolutional Neural Networks that used a one-dimensional CNN, but our

focus here will be on the embedding layer.

[ 243 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!