09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Word Embeddings

The dataset is a 11463 × 5812 matrix of word counts, where the rows represent

words, and columns represent conference papers. We will use this to construct a

graph of papers, where an edge between two papers represents a word that occurs in

both of them. Both node2vec and DeepWalk assume that the graph is undirected and

unweighted. Our graph is undirected, since a relationship between a pair of papers is

bidirectional. However, our edges could have weights based on the number of word

co-occurrences between the two documents. For our example, we will consider any

number of co-occurrences above 0 to be a valid unweighted edge.

As usual, we will start by declaring our imports:

import gensim

import logging

import numpy as np

import os

import shutil

import tensorflow as tf

from scipy.sparse import csr_matrix

from sklearn.metrics.pairwise import cosine_similarity

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)

s', level=logging.INFO)

The next step is to download the data from the UCI repository and convert it to

a sparse term document matrix, TD, then construct a document-document matrix

E by multiplying the transpose of the term-document matrix with itself. Our graph

is represented as an adjacency or edge matrix by the document-document matrix.

Since each element represents a similarity between two documents, we will binarize

the matrix E by setting any non-zero elements to 1:

DATA_DIR = "./data"

UCI_DATA_URL = "https://archive.ics.uci.edu/ml/machine-learningdatabases/00371/NIPS_1987-2015.csv"

def download_and_read(url):

local_file = url.split('/')[-1]

p = tf.keras.utils.get_file(local_file, url, cache_dir=".")

row_ids, col_ids, data = [], [], []

rid = 0

f = open(p, "r")

for line in f:

line = line.strip()

if line.startswith("\"\","):

# header

[ 254 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!