pdfcoffee

soumyasankar99
from soumyasankar99 More from this publisher
09.05.2023 Views

Chapter 8That is, if the input is the sequence [c 1, c 2, …, c n], the output will be [c 2, c 3, …, c n+1].We will train the network for 50 epochs, and at the end of every 10 epochs, we willgenerate a fixed size sequence of characters starting with a standard prefix. In thefollowing example, we have used the prefix "Alice", the name of the protagonistin our novels.As always, we will first import the necessary libraries and set up some constants.Here the DATA_DIR points to a data folder under the location where you downloadedthe source code for this chapter. The CHECKPOINT_DIR is the location, a foldercheckpoints under the data folder, where we will save the weights of the model atthe end of every 10 epochs:import osimport numpy as npimport reimport shutilimport tensorflow as tfDATA_DIR = "./data"CHECKPOINT_DIR = os.path.join(DATA_DIR, "checkpoints")Next we download and prepare the data for our network to consume. The textsof both books are publicly available from the Project Gutenberg website. Thetf.keras.utils.get_file() function will check to see whether the file is alreadydownloaded to your local drive, and if not, it will download to a datasets folderunder the location of the code. We also preprocess the input a little here, removingnewline and byte order mark characters from the text. This step will create thetexts variable, a flat list of characters for these two books:def download_and_read(urls):texts = []for i, url in enumerate(urls):p = tf.keras.utils.get_file("ex1-{:d}.txt".format(i), url,cache_dir=".")text = open(p, "r").read()# remove byte order marktext = text.replace("\ufeff", "")# remove newlinestext = text.replace('\n', ' ')text = re.sub(r'\s+', " ", text)# add it to the listtexts.extend(text)return texts[ 293 ]

Recurrent Neural Networkstexts = download_and_read(["http://www.gutenberg.org/cache/epub/28885/pg28885.txt","https://www.gutenberg.org/files/12/12-0.txt"])Next, we will create our vocabulary. In our case, our vocabulary contains 90 uniquecharacters, composed of uppercase and lowercase alphabets, numbers, and specialcharacters. We also create some mapping dictionaries to convert each vocabularycharacter to a unique integer and vice versa. As noted earlier, the input and outputof the network is a sequence of characters. However, the actual input and output ofthe network are sequences of integers, and we will use these mapping dictionariesto handle this conversion:# create the vocabularyvocab = sorted(set(texts))print("vocab size: {:d}".format(len(vocab)))# create mapping from vocab chars to intschar2idx = {c:i for i, c in enumerate(vocab)}idx2char = {i:c for c, i in char2idx.items()}The next step is to use these mapping dictionaries to convert our charactersequence input into an integer sequence, and then into a TensorFlow dataset. Eachof our sequences is going to be 100 characters long, with the output being offsetfrom the input by 1 character position. We first batch the dataset into slices of 101characters, then apply the split_train_labels() function to every element of thedataset to create our sequences dataset, which is a dataset of tuples of two elements,each element of the tuple being a vector of size 100 and type tf.int64. We thenshuffle these sequences and then create batches of 64 tuples each for input to ournetwork. Each element of the dataset is now a tuple consisting of a pair of matrices,each of size (64, 100) and type tf.int64:# numericize the textstexts_as_ints = np.array([char2idx[c] for c in texts])data = tf.data.Dataset.from_tensor_slices(texts_as_ints)# number of characters to show before asking for prediction# sequences: [None, 100]seq_length = 100sequences = data.batch(seq_length + 1, drop_remainder=True)def split_train_labels(sequence):input_seq = sequence[0:-1]output_seq = sequence[1:]return input_seq, output_seq[ 294 ]

Chapter 8

That is, if the input is the sequence [c 1

, c 2

, …, c n

], the output will be [c 2

, c 3

, …, c n+1

].

We will train the network for 50 epochs, and at the end of every 10 epochs, we will

generate a fixed size sequence of characters starting with a standard prefix. In the

following example, we have used the prefix "Alice", the name of the protagonist

in our novels.

As always, we will first import the necessary libraries and set up some constants.

Here the DATA_DIR points to a data folder under the location where you downloaded

the source code for this chapter. The CHECKPOINT_DIR is the location, a folder

checkpoints under the data folder, where we will save the weights of the model at

the end of every 10 epochs:

import os

import numpy as np

import re

import shutil

import tensorflow as tf

DATA_DIR = "./data"

CHECKPOINT_DIR = os.path.join(DATA_DIR, "checkpoints")

Next we download and prepare the data for our network to consume. The texts

of both books are publicly available from the Project Gutenberg website. The

tf.keras.utils.get_file() function will check to see whether the file is already

downloaded to your local drive, and if not, it will download to a datasets folder

under the location of the code. We also preprocess the input a little here, removing

newline and byte order mark characters from the text. This step will create the

texts variable, a flat list of characters for these two books:

def download_and_read(urls):

texts = []

for i, url in enumerate(urls):

p = tf.keras.utils.get_file("ex1-{:d}.txt".format(i), url,

cache_dir=".")

text = open(p, "r").read()

# remove byte order mark

text = text.replace("\ufeff", "")

# remove newlines

text = text.replace('\n', ' ')

text = re.sub(r'\s+', " ", text)

# add it to the list

texts.extend(text)

return texts

[ 293 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!