09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Autoencoders

yield Xbatch, Xbatch

Xbatch = embeddings[X[sids, :]]

train_size = 0.7

Xtrain, Xtest = train_test_split(sent_wids, train_size=train_size)

train_gen = sentence_generator(Xtrain, embeddings, BATCH_SIZE)

test_gen = sentence_generator(Xtest, embeddings, BATCH_SIZE)

Now we are ready to define the autoencoder. As we have shown in the diagram,

it is composed of an encoder LSTM and a decoder LSTM. The encoder LSTM reads

a tensor of shape (BATCH_SIZE, SEQUENCE_LEN, EMBED_SIZE) representing a batch

of sentences. Each sentence is represented as a padded fixed-length sequence of

words of size SEQUENCE_LEN. Each word is represented as a 300-dimensional GloVe

vector. The output dimension of the encoder LSTM is a hyperparameter LATENT_

SIZE, which is the size of the sentence vector that will come from the encoder part

of the trained autoencoder later. The vector space of dimensionality LATENT_SIZE

represents the latent space that encodes the meaning of the sentence. The output

of the LSTM is a vector of size (LATENT_SIZE) for each sentence, so for the batch

the shape of the output tensor is (BATCH_SIZE, LATENT_SIZE). This is now fed to

a RepeatVector layer, which replicates this across the entire sequence; that is, the

output tensor from this layer has the shape (BATCH_SIZE, SEQUENCE_LEN, LATENT_

SIZE). This tensor is now fed into the decoder LSTM, whose output dimension is the

EMBED_SIZE, so the output tensor has shape (BATCH_SIZE, SEQUENCE_LEN, EMBED_

SIZE), that is, the same shape as the input tensor.

We compile this model with the SGD optimizer and the MSE loss function. The

reason we use MSE is that we want to reconstruct a sentence that has a similar

meaning, that is, something that is close to the original sentence in the embedded

space of dimension LATENT_SIZE:

inputs = Input(shape=(SEQUENCE_LEN, EMBED_SIZE), name="input")

encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum",

name="encoder_lstm")(inputs)

decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)

decoded = Bidirectional(LSTM(EMBED_SIZE, return_sequences=True),

merge_mode="sum", name="decoder_lstm")(decoded)

autoencoder = Model(inputs, decoded)

We define the loss function as mean squared error and choose the Adam optimizer:

autoencoder.compile(optimizer="sgd", loss="mse")

[ 370 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!