09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 8

for idx, line in enumerate(fposs):

poss.append(line.strip())

if num_pairs is not None and idx >= num_pairs:

break

return sents, poss

sents, poss = download_and_read("./datasets")

assert(len(sents) == len(poss))

print("# of records: {:d}".format(len(sents)))

There are 3194 sentences in our dataset. We will then use the TensorFlow (tf.keras)

tokenizer to tokenize the sentences and create a list of sentence tokens. We reuse the

same infrastructure to tokenize the parts of speech, although we could have simply

split on spaces. Each input record to the network is currently a sequence of text

tokens, but they need to be a sequence of integers. During the tokenizing process,

the Tokenizer also maintains the tokens in the vocabulary, from which we can build

mappings from token to integer and back.

We have two vocabularies to consider, first the vocabulary of word tokens in the

sentence collection, and the vocabulary of POS tags in part-of-speech collection. The

following code shows how to tokenize both collections and generate the necessary

mapping dictionaries:

def tokenize_and_build_vocab(texts, vocab_size=None, lower=True):

if vocab_size is None:

tokenizer = tf.keras.preprocessing.text.Tokenizer(lower=lower)

else:

tokenizer = tf.keras.preprocessing.text.Tokenizer(

num_words=vocab_size+1, oov_token="UNK", lower=lower)

tokenizer.fit_on_texts(texts)

if vocab_size is not None:

# additional workaround, see issue 8092

# https://github.com/keras-team/keras/issues/8092

tokenizer.word_index = {e:i for e, i in

tokenizer.word_index.items() if

i <= vocab_size+1 }

word2idx = tokenizer.word_index

idx2word = {v:k for k, v in word2idx.items()}

return word2idx, idx2word, tokenizer

word2idx_s, idx2word_s, tokenizer_s = tokenize_and_build_vocab(

sents, vocab_size=9000)

word2idx_t, idx2word_t, tokenizer_t = tokenize_and_build_vocab(

poss, vocab_size=38, lower=False)

[ 309 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!