28.06.2013 Views

Papers in PDF format

Papers in PDF format

Papers in PDF format

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Gatherers and Brokers. The Gatherer is a subsystem which collects metadata (<strong>in</strong>dex<strong>in</strong>g <strong>in</strong><strong>format</strong>ion) from a<br />

set of providers (file servers). The Gatherer knows about different object types (e.g., HTML, text, PostScript)<br />

and is able to parse and extract different metadata depend<strong>in</strong>g on the type of object be<strong>in</strong>g processed. A Broker<br />

provides a search <strong>in</strong>terface to <strong>in</strong><strong>format</strong>ion collected by one or more Gatherers or other Brokers. Harvest<br />

provides much of the system-framework for handl<strong>in</strong>g distributed <strong>in</strong><strong>format</strong>ion. However, as discussed below,<br />

the basic Harvest system does not address well requirements for content rout<strong>in</strong>g to multiple Brokers where<br />

<strong>in</strong><strong>format</strong>ion cannot be a priori l<strong>in</strong>ked to a particular Broker, nor can it handle effectively query rout<strong>in</strong>g where<br />

broker level metadata is needed to effectively assign user queries to appropriate Brokers. These areas and<br />

others motivated a number of extensions to the basic Harvest system.<br />

System Architecture<br />

The MIDS system architecture is given <strong>in</strong> [Fig. 1]. The Gatherers are Harvest subsystems which extract<br />

metadata from one or more providers (sources of <strong>in</strong><strong>format</strong>ion). The metadata <strong>format</strong> used, called the Summary<br />

Object Interchange Format (SOIF) is based on a comb<strong>in</strong>ation of the Internet Anonymous FTP Archives (IAFA)<br />

IETF Work<strong>in</strong>g Group templates [Deutsch et al. 94] and BibTex [Lamport 86]. It has an attribute-value <strong>format</strong><br />

which is easily parsed and yet sufficiently expressive to handle many k<strong>in</strong>ds of objects. The Gatherer<br />

Dissem<strong>in</strong>ation Service (GDS) periodically collects streams of metadata <strong>in</strong> SOIF <strong>format</strong> from a set of Harvest<br />

Gatherers. Us<strong>in</strong>g the collected metadata, the GDS provides a content rout<strong>in</strong>g capability that classifies the<br />

documents associated with the metadata <strong>in</strong>to a set of topical areas, and builds a set of files conta<strong>in</strong><strong>in</strong>g streams<br />

of SOIF records for Brokers that had registered themselves with the GDS. It also dissem<strong>in</strong>ates its processed<br />

<strong>in</strong><strong>format</strong>ion to Brokers and the Broker In<strong>format</strong>ion Service (BIS). Brokers are Harvest subsystems that provide<br />

a full-text search capability for MIDS. Each Broker manages a set of topical areas and is periodically updated<br />

with SOIF records from the GDS based on the <strong>in</strong><strong>format</strong>ion type it desires. The BIS provides a topical browse<br />

service, enabl<strong>in</strong>g a user to retrieve documents based on topic selection. It also provides a query rout<strong>in</strong>g<br />

[Sheldon et al. 94] capability by determ<strong>in</strong><strong>in</strong>g which Brokers to search for a particular topical area. The client<br />

<strong>in</strong>terface provides a user <strong>in</strong>terface to enable effective navigation, brows<strong>in</strong>g, and search<strong>in</strong>g of <strong>in</strong><strong>format</strong>ion.<br />

In<strong>format</strong>ion Collection<br />

In<strong>format</strong>ion is collected <strong>in</strong>to MIDS by utiliz<strong>in</strong>g Harvest Gatherers. The GDS conta<strong>in</strong>s a list of Gatherers<br />

that it periodically contacts for streams of metadata. This <strong>in</strong><strong>format</strong>ion is stored <strong>in</strong> an <strong>in</strong>put queue on disk for<br />

later process<strong>in</strong>g by the classification eng<strong>in</strong>e which is part of the GDS. For better network utilization<br />

(bandwidth), it is more efficient to run Gatherers at each provider site, although Gatherers can access data<br />

remotely as well. Runn<strong>in</strong>g a Gatherer locally enables it to collect metadata by directly access<strong>in</strong>g documents via<br />

file system I/O, and then have it ship one compressed file of its processed <strong>in</strong><strong>format</strong>ion to the GDS. Runn<strong>in</strong>g a<br />

Gatherer remotely requires that it obta<strong>in</strong> each document through the HTTP, Gopher, or FTP protocol, which<br />

will <strong>in</strong>cur a much greater performance penalty.<br />

In<strong>format</strong>ion Classification<br />

There are two methods used currently to classify <strong>in</strong><strong>format</strong>ion with<strong>in</strong> the GDS. To provide for a tailorable<br />

classification scheme, tools are provided for a knowledge eng<strong>in</strong>eer to def<strong>in</strong>e a knowledge base of classification<br />

rules based on a fixed taxonomy of topics. The knowledge eng<strong>in</strong>eer will def<strong>in</strong>e the relationship between topics<br />

of <strong>in</strong>terest as well as descriptors used to map documents <strong>in</strong>to the topical hierarchy. The classification eng<strong>in</strong>e<br />

used to process <strong>in</strong><strong>format</strong>ion <strong>in</strong>to the fixed taxonomy is a SIFT [Yan et al. 95] filter<strong>in</strong>g eng<strong>in</strong>e, modified to<br />

provide for a more expressive query language, and also to process streams of SOIF records as opposed to<br />

regular full-text documents. SIFT's output process<strong>in</strong>g logic was also modified to generate more classification<br />

results, and also to generate Broker specific output files conta<strong>in</strong><strong>in</strong>g streams of SOIF records match<strong>in</strong>g Broker<br />

<strong>in</strong>terest profiles.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!