pdfcoffee

soumyasankar99
from soumyasankar99 More from this publisher
09.05.2023 Views

continue# compute non-zero elements for current rowcounts = np.array([int(x) for x in line.split(',')[1:]])nz_col_ids = np.nonzero(counts)[0]nz_data = counts[nz_col_ids]nz_row_ids = np.repeat(rid, len(nz_col_ids))rid += 1# add data to big listsrow_ids.extend(nz_row_ids.tolist())col_ids.extend(nz_col_ids.tolist())data.extend(nz_data.tolist())f.close()TD = csr_matrix((np.array(data), (np.array(row_ids), np.array(col_ids))),shape=(rid, counts.shape[0]))return TDChapter 7# read data and convert to Term-Document matrixTD = download_and_read(UCI_DATA_URL)# compute undirected, unweighted edge matrixE = TD.T * TD# binarizeE[E > 0] = 1Once we have our sparse binarized adjacency matrix, E, we can then generaterandom walks from each of the vertices. From each node, we construct 32 randomwalks of maximum length of 40 nodes. The walks have a random restart probabilityof 0.15, which means that for any node, the particular random walk could end with15% probability. The following code will construct the random walks and write themout to a file given by RANDOM_WALKS_FILE. Note that this is a very slow process. Acopy of the output is provided along with the source code for this chapter in caseyou prefer to skip the random walk generation process:NUM_WALKS_PER_VERTEX = 32MAX_PATH_LENGTH = 40RESTART_PROB = 0.15RANDOM_WALKS_FILE = os.path.join(DATA_DIR, "random-walks.txt")def construct_random_walks(E, n, alpha, l, ofile):if os.path.exists(ofile):[ 255 ]

Word Embeddingsprint("random walks generated already, skipping")returnf = open(ofile, "w")for i in range(E.shape[0]): # for each vertexif i % 100 == 0:print("{:d} random walks generated from {:d} vertices".format(n * i, i))for j in range(n): # construct n random walkscurr = iwalk = [curr]target_nodes = np.nonzero(E[curr])[1]for k in range(l): # each of max length l# should we restart?if np.random.random() < alpha and len(walk) > 5:break# choose one outgoing edge and append to walktry:curr = np.random.choice(target_nodes)walk.append(curr)target_nodes = np.nonzero(E[curr])[1]except ValueError:continuef.write("{:s}\n".format(" ".join([str(x) for x in walk])))print("{:d} random walks generated from {:d} vertices, COMPLETE".format(n * i, i))f.close()# construct random walks (caution: very long process!)construct_random_walks(E, NUM_WALKS_PER_VERTEX, RESTART_PROB, MAX_PATH_LENGTH, RANDOM_WALKS_FILE)A few lines from the RANDOM_WALKS_FILE are shown below. You could imaginethat these look like sentences in a language where the vocabulary of words is allthe node IDs in our graph. We have learned that word embeddings exploit thestructure of language to generate a distributional representation for words. Graphembedding schemes such as DeepWalk and node2vec do the exact same thing withthese "sentences" created out of random walks. Such embeddings are able to capturesimilarities between nodes in a graph that go beyond immediate neighbors, as weshall see as follows:0 1405 4845 754 4391 3524 4282 2357 3922 16670 1341 456 495 1647 4200 5379 473 2311[ 256 ]

Word Embeddings

print("random walks generated already, skipping")

return

f = open(ofile, "w")

for i in range(E.shape[0]): # for each vertex

if i % 100 == 0:

print("{:d} random walks generated from {:d} vertices"

.format(n * i, i))

for j in range(n): # construct n random walks

curr = i

walk = [curr]

target_nodes = np.nonzero(E[curr])[1]

for k in range(l): # each of max length l

# should we restart?

if np.random.random() < alpha and len(walk) > 5:

break

# choose one outgoing edge and append to walk

try:

curr = np.random.choice(target_nodes)

walk.append(curr)

target_nodes = np.nonzero(E[curr])[1]

except ValueError:

continue

f.write("{:s}\n".format(" ".join([str(x) for x in walk])))

print("{:d} random walks generated from {:d} vertices, COMPLETE"

.format(n * i, i))

f.close()

# construct random walks (caution: very long process!)

construct_random_walks(E, NUM_WALKS_PER_VERTEX, RESTART_PROB, MAX_

PATH_LENGTH, RANDOM_WALKS_FILE)

A few lines from the RANDOM_WALKS_FILE are shown below. You could imagine

that these look like sentences in a language where the vocabulary of words is all

the node IDs in our graph. We have learned that word embeddings exploit the

structure of language to generate a distributional representation for words. Graph

embedding schemes such as DeepWalk and node2vec do the exact same thing with

these "sentences" created out of random walks. Such embeddings are able to capture

similarities between nodes in a graph that go beyond immediate neighbors, as we

shall see as follows:

0 1405 4845 754 4391 3524 4282 2357 3922 1667

0 1341 456 495 1647 4200 5379 473 2311

[ 256 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!