09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 8

The first sequence starts at position 0 and stops one short of the final word in the

sentence, and the second sequence starts at position 1 and goes all the way to the end

of the sentence:

def preprocess_sentence(sent):

sent = "".join([c for c in unicodedata.normalize("NFD", sent)

if unicodedata.category(c) != "Mn"])

sent = re.sub(r"([!.?])", r" \1", sent)

sent = re.sub(r"[^a-zA-Z!.?]+", r" ", sent)

sent = re.sub(r"\s+", " ", sent)

sent = sent.lower()

return sent

def download_and_read():

en_sents, fr_sents_in, fr_sents_out = [], [], []

local_file = os.path.join("datasets", "fra.txt")

with open(local_file, "r") as fin:

for i, line in enumerate(fin):

en_sent, fr_sent = line.strip().split('\t')

en_sent = [w for w in preprocess_sentence(en_sent).split()]

fr_sent = preprocess_sentence(fr_sent)

fr_sent_in = [w for w in ("BOS " + fr_sent).split()]

fr_sent_out = [w for w in (fr_sent + " EOS").split()]

en_sents.append(en_sent)

fr_sents_in.append(fr_sent_in)

fr_sents_out.append(fr_sent_out)

if i >= num_sent_pairs - 1:

break

return en_sents, fr_sents_in, fr_sents_out

sents_en, sents_fr_in, sents_fr_out = download_and_read()

Our next step is to tokenize our inputs and create the vocabulary. Since we have

sequences in two different languages, we will create two different tokenizers

and vocabularies, one for each language. The tf.keras framework provides a very

powerful and versatile tokenizer class – here we have set filters to an empty string

and lower to False because we have already done what was needed for tokenization

in our preprocess_sentence() function. The Tokenizer creates various data

structures from which we can compute the vocabulary sizes and lookup tables that

allow us to go from word to word index and back.

[ 319 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!