28.06.2013 Views

Papers in PDF format

Papers in PDF format

Papers in PDF format

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OODB Support for WWW Applications:<br />

Disclos<strong>in</strong>g the <strong>in</strong>ternal structure of Hyperdocuments<br />

1. Introduction<br />

J.T. de Munk, A.T.M. Aerts and P.M.E. De Bra<br />

Department of Mathematics and Comput<strong>in</strong>g Science<br />

E<strong>in</strong>dhoven University of Technology<br />

PO Box 513, 5600 MB E<strong>in</strong>dhoven<br />

The Netherlands<br />

{munk,ws<strong>in</strong>atma,debra}@w<strong>in</strong>.tue.nl<br />

Most World Wide Web (WWW or Web) servers use the operat<strong>in</strong>g system's native file system for the storage of the<br />

HTML documents and their embedded images. This access mechanism works f<strong>in</strong>e for direct access, but is ill suited for<br />

f<strong>in</strong>d<strong>in</strong>g documents conta<strong>in</strong><strong>in</strong>g certa<strong>in</strong> <strong>in</strong><strong>format</strong>ion. Even just f<strong>in</strong>d<strong>in</strong>g all documents that exist on a server is difficult<br />

because of the complexity of the hypertext l<strong>in</strong>k structure. The structure may not be completely connected, mean<strong>in</strong>g that<br />

some documents on a server may not even be reachable by follow<strong>in</strong>g l<strong>in</strong>ks from other documents.<br />

Several attempts have been made to build additional structures, that provide search facilities for the <strong>in</strong><strong>format</strong>ion stored<br />

on a WWW-server or a cluster of servers. Glimpse [Manber & Wu 94] provides <strong>in</strong>dex<strong>in</strong>g at the server level. Harvest<br />

[Schwartz et al. 94] extends Glimpse to offer retrieval over a set of servers.<br />

The exist<strong>in</strong>g <strong>in</strong>dex databases ignore most or all of the <strong>in</strong>ternal structure of the documents. Ask<strong>in</strong>g for <strong>in</strong><strong>format</strong>ion that<br />

appears <strong>in</strong> a "header" of certa<strong>in</strong> levels, e.g. levels 1, 2 and 3, is not possible. F<strong>in</strong>d<strong>in</strong>g <strong>in</strong><strong>format</strong>ion <strong>in</strong> an field is<br />

impossible as well.<br />

The source of the problem is the lack of a sound access-mechanism for the <strong>in</strong><strong>format</strong>ion one is <strong>in</strong>terested <strong>in</strong>. The flat file<br />

system approach taken by Web servers makes it easy to access a whole document, given its address, but makes<br />

associative retrieval difficult. An "<strong>in</strong>verted" access mechanism is needed, provid<strong>in</strong>g access to documents or parts of<br />

documents, given a description of their contents, their <strong>in</strong>ternal structure, or the l<strong>in</strong>k structure of their environment.<br />

Instead of add<strong>in</strong>g an <strong>in</strong>dex-database onto a file system based Web server, we propose a server based on an<br />

object-oriented database, deliver<strong>in</strong>g the documents and the answers to search requests from the same <strong>in</strong><strong>format</strong>ion<br />

source. Documents are stored as objects of which the <strong>in</strong>ternal structure represents the HTML structure, thus enabl<strong>in</strong>g<br />

query<strong>in</strong>g for structural elements like headers, hypertext l<strong>in</strong>ks, quotes, addresses, etc.<br />

The most <strong>in</strong>fluential similar approach to enhanc<strong>in</strong>g the WWW is the Hyper-G project [Andrews et al. 95], which has a<br />

richer structure than WWW, but can reduce documents to HTML <strong>in</strong> order to serve them to WWW browsers. We take a<br />

different approach by keep<strong>in</strong>g the WWW architecture, to the po<strong>in</strong>t where we <strong>in</strong>tegrate an exist<strong>in</strong>g WWW server with an<br />

object oriented database system.<br />

2. Requirements and Properties<br />

When we started the development of the new server architecture, the follow<strong>in</strong>g requirements guided the process, and<br />

resulted <strong>in</strong> the correspond<strong>in</strong>g properties <strong>in</strong> the prototype system:<br />

The new server had to be built us<strong>in</strong>g freely available technology wherever possible. This requirement resulted <strong>in</strong><br />

the follow<strong>in</strong>g choices:<br />

Instead of build<strong>in</strong>g a completely new WWW server, the code for the CERN server was used. Two<br />

versions of the new server have been built: one that leaves the CERN server code completely <strong>in</strong>tact, and<br />

one that uses a small modification to elim<strong>in</strong>ate the need for an additional CGI-script (which slows th<strong>in</strong>gs<br />

down).<br />

The Ode database system was chosen [Agrawal & Gehani 89], [Gehani 91], because it was and is freely<br />

available to universities (after sign<strong>in</strong>g a license agreement with AT&T).<br />

The HTML parser needed to be at least as forgiv<strong>in</strong>g to syntax errors <strong>in</strong> HTML documents as most<br />

browsers. For this reason we have developed our own HTML parser, <strong>in</strong>stead of reus<strong>in</strong>g an exist<strong>in</strong>g parser

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!