09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Word Embeddings

gensim also provides a doesnt_match() function, which can be used to detect the

odd one out of a list of words:

print(word_vectors.doesnt_match(["hindus", "parsis", "singapore",

"christians"]))

This gives us singapore as expected, since it is the only country among a set of

words identifying religions.

We can also calculate the similarity between two words. Here we demonstrate

that the distance between related words is less than that of unrelated words:

for word in ["woman", "dog", "whale", "tree"]:

print("similarity({:s}, {:s}) = {:.3f}".format(

"man", word,

word_vectors.similarity("man", word)

))

Gives the following interesting result:

similarity(man, woman) = 0.759

similarity(man, dog) = 0.474

similarity(man, whale) = 0.290

similarity(man, tree) = 0.260

The similar_by_word() function is functionally equivalent to similar() except

that the latter normalizes the vector before comparing by default. There is also a

related similar_by_vector() function which allows you to find similar words

by specifying a vector as input. Here we try to find words that are similar to

"singapore":

print(print_most_similar(

word_vectors.similar_by_word("singapore"), 5)

)

And we get the following output, which seems to be mostly correct, at least from

a geographical point of view:

0.882 malaysia

0.837 indonesia

0.826 philippines

0.825 uganda

0.822 thailand

...

[ 242 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!