09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Recurrent Neural Networks

texts = download_and_read([

"http://www.gutenberg.org/cache/epub/28885/pg28885.txt",

"https://www.gutenberg.org/files/12/12-0.txt"

])

Next, we will create our vocabulary. In our case, our vocabulary contains 90 unique

characters, composed of uppercase and lowercase alphabets, numbers, and special

characters. We also create some mapping dictionaries to convert each vocabulary

character to a unique integer and vice versa. As noted earlier, the input and output

of the network is a sequence of characters. However, the actual input and output of

the network are sequences of integers, and we will use these mapping dictionaries

to handle this conversion:

# create the vocabulary

vocab = sorted(set(texts))

print("vocab size: {:d}".format(len(vocab)))

# create mapping from vocab chars to ints

char2idx = {c:i for i, c in enumerate(vocab)}

idx2char = {i:c for c, i in char2idx.items()}

The next step is to use these mapping dictionaries to convert our character

sequence input into an integer sequence, and then into a TensorFlow dataset. Each

of our sequences is going to be 100 characters long, with the output being offset

from the input by 1 character position. We first batch the dataset into slices of 101

characters, then apply the split_train_labels() function to every element of the

dataset to create our sequences dataset, which is a dataset of tuples of two elements,

each element of the tuple being a vector of size 100 and type tf.int64. We then

shuffle these sequences and then create batches of 64 tuples each for input to our

network. Each element of the dataset is now a tuple consisting of a pair of matrices,

each of size (64, 100) and type tf.int64:

# numericize the texts

texts_as_ints = np.array([char2idx[c] for c in texts])

data = tf.data.Dataset.from_tensor_slices(texts_as_ints)

# number of characters to show before asking for prediction

# sequences: [None, 100]

seq_length = 100

sequences = data.batch(seq_length + 1, drop_remainder=True)

def split_train_labels(sequence):

input_seq = sequence[0:-1]

output_seq = sequence[1:]

return input_seq, output_seq

[ 294 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!