28.06.2013 Views

Papers in PDF format

Papers in PDF format

Papers in PDF format

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Build<strong>in</strong>g the collection<br />

The lowest common denom<strong>in</strong>ator for represent<strong>in</strong>g <strong>in</strong><strong>format</strong>ion <strong>in</strong> conventional libraries is paper, and many<br />

digital library efforts <strong>in</strong>volve scanned paper documents (e.g. [Crocca & Anderson 1995; Van House 1995]). In<br />

the world of electronic <strong>in</strong><strong>format</strong>ion, PostScript, rather than pla<strong>in</strong>, un<strong>format</strong>ted text, is the closest analog to<br />

paper as a document storage medium, and its page-based nature turns out to be helpful <strong>in</strong> structur<strong>in</strong>g the<br />

collection. In order to build the <strong>in</strong>dex, it is necessary to be able to extract pla<strong>in</strong>, un<strong>format</strong>ted, text automatically<br />

from the documents. In fact, the system design accommodates not just PostScript but any <strong>format</strong> from which<br />

pag<strong>in</strong>ated ASCII text can be extracted—for example DVI, RTF or HTML files. Document images can be<br />

accommodated by OCR-<strong>in</strong>g them for <strong>in</strong>dex<strong>in</strong>g purposes: the <strong>in</strong>evitable recognition errors will reduce the quality<br />

of the <strong>in</strong>dex, but this can be ameliorated by us<strong>in</strong>g ranked queries conta<strong>in</strong><strong>in</strong>g redundant terms. However, from<br />

an <strong>in</strong>itial <strong>in</strong>vestigation it appeared that PostScript files are almost universal <strong>in</strong> computer science technical<br />

report archives, and the system currently deals only with this <strong>format</strong>.<br />

Archives of technical reports can be located through several lists ma<strong>in</strong>ta<strong>in</strong>ed on the Internet, and recursively<br />

descend<strong>in</strong>g the directory hierarchy look<strong>in</strong>g for (possibly compressed) PostScript files. Each file is downloaded,<br />

along with its size and date, and the appropriate <strong>in</strong><strong>format</strong>ion is extracted.<br />

File Name : nzdl-ps-extraction.eps<br />

Title :<br />

Creator : Diagram<br />

CreationDate : Mon Aug 21 14:30:03 1995<br />

Pages : 0 0<br />

Figure 2: Conversion from PostScript: a PostScript file, the text extracted from it, and a facsimile image<br />

In<strong>format</strong>ion stored centrally<br />

With an Internet-based digital library, a crucial question is how much <strong>in</strong><strong>format</strong>ion to store centrally. It was<br />

decided that the library would comprise an <strong>in</strong>dex and search eng<strong>in</strong>e, and the documents themselves would<br />

rema<strong>in</strong> <strong>in</strong> their orig<strong>in</strong>al repositories. Periodically, all repositories are scanned to refresh the <strong>in</strong>dex, add<strong>in</strong>g any<br />

new documents and remov<strong>in</strong>g references to those that have been deleted. There is no <strong>in</strong>tention to provide an<br />

archiv<strong>in</strong>g service that will reta<strong>in</strong> documents that are removed from their orig<strong>in</strong>al sites.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!