28.06.2013 Views

Papers in PDF format

Papers in PDF format

Papers in PDF format

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Introduction<br />

MITRE In<strong>format</strong>ion Discovery System<br />

Daniel J. Helm, Raymond J. D'Amore, Puck-Fai Yan<br />

Digital Libraries Group<br />

The MITRE Corporation<br />

1820 Dolley Madison Blvd., McLean, VA 22102-3481, USA<br />

{dhelm, rdamore, pyan}@mitre.org<br />

Abstract: The MITRE In<strong>format</strong>ion Discovery System (MIDS) is a basel<strong>in</strong>e system for <strong>in</strong>tegrat<strong>in</strong>g<br />

advanced process<strong>in</strong>g tools for <strong>in</strong><strong>format</strong>ion discovery and retrieval <strong>in</strong> large-scale distributed<br />

environments. The system is built on a modular, extendible architecture that allows for system-level<br />

decoupl<strong>in</strong>g and allocation of component process<strong>in</strong>g tools across network nodes to provide for efficient<br />

process<strong>in</strong>g <strong>in</strong> distributed environments. At one level, the system provides for multi-platform user<br />

access to HTTP, Gopher, FTP, and news servers us<strong>in</strong>g an HTML based client <strong>in</strong>terface. However,<br />

more significantly, the system provides advanced tools for metadata generation from disparate<br />

network objects, and a content rout<strong>in</strong>g mediation layer for classification of metadata <strong>in</strong>to appropriate<br />

<strong>in</strong><strong>format</strong>ion brokers. This bottom-up layered <strong>in</strong><strong>format</strong>ion organization approach supports a wide range<br />

of <strong>in</strong><strong>format</strong>ion retrieval and brows<strong>in</strong>g strategies.<br />

The MITRE In<strong>format</strong>ion Discovery System (MIDS) project is a multi-year MITRE-sponsored research<br />

program to develop a set of multi-faceted capabilities for collect<strong>in</strong>g, categoriz<strong>in</strong>g, organiz<strong>in</strong>g, and discover<strong>in</strong>g<br />

digital <strong>in</strong><strong>format</strong>ion <strong>in</strong> a distributed environment. MIDS tools not only support <strong>in</strong><strong>format</strong>ion management and<br />

retrieval, but post-retrieval operations for summariz<strong>in</strong>g and present<strong>in</strong>g <strong>in</strong><strong>format</strong>ion to the user.<br />

A white pages directory organizes items by name (e.g., Internet user names), while a yellow pages<br />

organizes <strong>in</strong><strong>format</strong>ion by attributes. MIDS is <strong>in</strong>tended to support a dynamic yellow pages capability; that is,<br />

<strong>in</strong><strong>format</strong>ion is organized by "topic," and can be dynamically reorganized as <strong>in</strong><strong>format</strong>ion collections change or<br />

new collections are added. The system consists of an <strong>in</strong><strong>format</strong>ion organizer that groups multiple collection<br />

contents topically, and a multi-broker network that uses object summaries to create a type of adaptable subject<br />

catalog that is used to support brows<strong>in</strong>g and search<strong>in</strong>g specific collections. The organizer and broker are the<br />

primary components which provide the foundation for the MIDS architectural framework.<br />

The ma<strong>in</strong> emphasis of the program so far has been on the effective <strong>in</strong>tegration of advanced <strong>in</strong><strong>format</strong>ion<br />

retrieval and natural language process<strong>in</strong>g capabilities s<strong>in</strong>ce there are a number of important system <strong>in</strong>tegration<br />

issues and effectiveness and efficiency concerns that are critical to develop<strong>in</strong>g even a rudimentary pilot system.<br />

While the short-term objective is not to support a large-scale system evaluation such as currently be<strong>in</strong>g<br />

supported by ARPA's Tipster program, the current work will be focused on selected operational-based, enduser<br />

assessments of system effectiveness and efficiency.<br />

MITRE In<strong>format</strong>ion Discovery System<br />

Essentially, MIDS is organized around several key functions: collection, <strong>in</strong>dex<strong>in</strong>g, rout<strong>in</strong>g, organization,<br />

storage, retrieval and brows<strong>in</strong>g. These functions are typically handled as tightly <strong>in</strong>tegrated components, <strong>in</strong><br />

traditional commercial off-the-shelf systems; however, MIDS provides these functions as separate modules that<br />

can communicate with<strong>in</strong> a networked environment. This distributed, modular approach provides flexibility <strong>in</strong><br />

terms of the methods employed and <strong>in</strong> def<strong>in</strong><strong>in</strong>g extensions to the basel<strong>in</strong>e architecture.<br />

The MIDS project leverages heavily of both MITRE-developed technologies and tools available <strong>in</strong> the<br />

public doma<strong>in</strong>. As a result the system <strong>in</strong>corporates technologies for metadata extraction, cluster<strong>in</strong>g, filter<strong>in</strong>g,<br />

retrieval, and document summarization with new development focused on specialized problems not readily<br />

addressed by off-the-shelf capabilities. In a sense, MIDS is both an <strong>in</strong>tegration framework and a set of tools.<br />

As an example, the <strong>in</strong>tegration framework is centered around the Harvest [Bowman et al. 94] system.<br />

Harvest is a scalable, customizable discovery and access system developed by the Internet Research Task<br />

Force Research Group on Resource Discovery (IRTF-RD). Major Harvest components utilized <strong>in</strong>clude

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!