09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Recurrent Neural Networks

Example ‒ seq2seq without attention for

machine translation

To understand the seq2seq model in greater detail, we will look at an example of

one that learns how to translate from English to French using the French-English

bilingual dataset from the Tatoeba Project (1997-2019) [26]. The dataset contains

approximately 167,000 sentence pairs. To make our training go faster, we will only

consider the first 30,000 sentence pairs for our training.

As always, we will start with the imports:

import nltk

import numpy as np

import re

import shutil

import tensorflow as tf

import os

import unicodedata

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

The data is provided as a remote zip file. The easiest way to access the file is to

download it from http://www.manythings.org/anki/fra-eng.zip and expand

it locally using unzip. The zip file contains a tab separated file called fra.txt, with

French and English sentence pairs separated by a tab, one pair per line. The code

expects the fra.txt file in a dataset folder in the same directory as itself. We want

to extract three different datasets from it.

If you recall the structure of the seq2seq network, the input to the encoder is a

sequence of English words. On the decoder side, the input is a set of French words,

and the output is the sequence of French words offset by 1 timestep. The following

function will download the zip file, expand it, and create the datasets described

before.

The input is preprocessed to "asciify" the characters, separate out specific

punctuations from their neighboring word, and remove all characters other than

alphabets and these specific punctuation symbols. Finally, the sentences are

converted to lowercase. Each English sentence is just converted to a single sequence

of words. Each French sentence is converted into two sequences, one preceded by

the BOS pseudo-word and the other followed by the end of sentence (EOS) pseudoword.

[ 318 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!