09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Word Embeddings

Specifically, we will see how the program learns an embedding from scratch that

is customized to the spam detection task. Next we will see how to use an external

third-party embedding like the ones we have learned about in this chapter, a process

similar to transfer learning in computer vision. Finally, we will learn how to combine

the two approaches, starting with a third party embedding and letting the network

use that as a starting point for its custom embedding, a process similar to fine tuning

in computer vision.

As usual, we will start with our imports:

import argparse

import gensim.downloader as api

import numpy as np

import os

import shutil

import tensorflow as tf

from sklearn.metrics import accuracy_score, confusion_matrix

Scikit-learn is an open source Python machine learning toolkit that

contains many efficient and easy to use tools for data mining and

data analysis. In this chapter we have used two of its predefined

metrics, accuracy_score and confusion_matrix, to evaluate

our model after it is trained.

You can learn more about scikit-learn at https://scikitlearn.org/stable/.

Getting the data

The data for our model is available publicly and comes from the SMS spam

collection dataset from the UCI Machine Learning Repository [11]. The following

code will download the file and parse it to produce a list of SMS messages and their

corresponding labels:

def download_and_read(url):

local_file = url.split('/')[-1]

p = tf.keras.utils.get_file(local_file, url,

extract=True, cache_dir=".")

labels, texts = [], []

local_file = os.path.join("datasets", "SMSSpamCollection")

with open(local_file, "r") as fin:

for line in fin:

label, text = line.strip().split('\t')

[ 244 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!