09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Recurrent Neural Networks

In order to get the data, you need to install the NLTK library if it is not already

installed (NLTK is included in the Anaconda distribution), as well as the 10%

treebank dataset (not installed by default). To install NLTK, follow the steps on the

NLTK install page [23]. To install the treebank dataset, perform the following at the

Python REPL:

>>> import nltk

>>> nltk.download("treebank")

Once this is done, we are ready to build our network. As usual, we will start by

importing the necessary packages:

import numpy as np

import os

import shutil

import tensorflow as tf

We will lazily import the NLTK treebank dataset into a pair of parallel flat files, one

containing the sentences and the other containing a corresponding POS sequence:

def download_and_read(dataset_dir, num_pairs=None):

sent_filename = os.path.join(dataset_dir, "treebank-sents.txt")

poss_filename = os.path.join(dataset_dir, "treebank-poss.txt")

if not(os.path.exists(sent_filename) and os.path.exists(poss_

filename)):

import nltk

if not os.path.exists(dataset_dir):

os.makedirs(dataset_dir)

fsents = open(sent_filename, "w")

fposs = open(poss_filename, "w")

sentences = nltk.corpus.treebank.tagged_sents()

for sent in sentences:

fsents.write(" ".join([w for w, p in sent]) + "\n")

fposs.write(" ".join([p for w, p in sent]) + "\n")

fsents.close()

fposs.close()

sents, poss = [], []

with open(sent_filename, "r") as fsent:

for idx, line in enumerate(fsent):

sents.append(line.strip())

if num_pairs is not None and idx >= num_pairs:

break

with open(poss_filename, "r") as fposs:

[ 308 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!