09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Autoencoders

The embedding is generated into a matrix of shape (VOCAB_SIZE, EMBED_SIZE)

where each row represents the GloVe embedding for a word in our vocabulary.

The PAD and UNK rows (0 and 1 respectively) are populated with zeros and random

uniform values respectively:

EMBED_SIZE = 50

def lookup_word2id(word):

try:

return word2id[word]

except KeyError:

return word2id["UNK"]

def load_glove_vectors(glove_file, word2id, embed_size):

embedding = np.zeros((len(word2id), embed_size))

fglove = open(glove_file, "rb")

for line in fglove:

cols = line.strip().split()

word = cols[0]

if embed_size == 0:

embed_size = len(cols) - 1

if word2id.has_key(word):

vec = np.array([float(v) for v in cols[1:]])

embedding[lookup_word2id(word)] = vec

embedding[word2id["PAD"]] = np.zeros((embed_size))

embedding[word2id["UNK"]] = np.random.uniform(-1, 1, embed_size)

return embedding

Next, we use these functions to generate embeddings:

sent_wids = [[lookup_word2id(w) for w in s.split()] for s in sents]

sent_wids = sequence.pad_sequences(sent_wids, SEQUENCE_LEN)

# load glove vectors into weight matrix

embeddings = load_glove_vectors(os.path.join(DATA_DIR, "glove.6B.{:d}

d.txt".format(EMBED_SIZE)), word2id, EMBED_SIZE)

[ 368 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!