28.06.2013 Views

Papers in PDF format

Papers in PDF format

Papers in PDF format

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

(Uniform Resource Locators) are presented. The user can then select a document, which subsequently gets<br />

retrieved from the appropriate server (e.g., HTTP, Gopher, FTP) at the provider site.<br />

Post-retrieval Tools<br />

The system is designed to support a wide range of post-retrieval tools to help users process <strong>in</strong><strong>format</strong>ion<br />

after it is retrieved. The ma<strong>in</strong> services provided <strong>in</strong>clude document summarization and abstract<strong>in</strong>g. These<br />

services utilize a Part-of-Speech (POS) tagger based on the work of Eric Brill [Brill 93], which has been<br />

modified by MITRE for better performance. Process<strong>in</strong>g documents with a POS-tagger allows us to abstract and<br />

summarize based on parts of speech. In the <strong>in</strong>itial capability, a tagged document can be filtered for verb and<br />

noun forms thereby reduc<strong>in</strong>g the word volume for the document. We elim<strong>in</strong>ate common verb forms to avoid<br />

bias<strong>in</strong>g results. The rema<strong>in</strong><strong>in</strong>g verb and noun forms can then be totaled to produce a vector for the document.<br />

This vector can be used for cluster<strong>in</strong>g or passed forward to the summarization process. The summarization<br />

process exam<strong>in</strong>es the verb and noun vector for the document and selects the most used words <strong>in</strong> the document.<br />

Those words are then used to score each sentence of the document. The highest scor<strong>in</strong>g sentences are then<br />

presented as a summary <strong>in</strong> the order <strong>in</strong> which they appeared <strong>in</strong> the orig<strong>in</strong>al document. The number of<br />

sentences chosen is relative to the size of the document, but will not exceed a maximum threshold.<br />

Experimental System<br />

We have built a prototype MIDS to demonstrate our ideas and to evaluate and extend the system. The<br />

system currently collects and processes approximately 20,000 documents from a number of servers at MITRE.<br />

We currently have one Harvest Gatherer collect<strong>in</strong>g <strong>in</strong><strong>format</strong>ion from three MITRE servers, four Harvest<br />

Brokers provid<strong>in</strong>g the full-text search functionality, one GDS, and one BIS. With<strong>in</strong> the GDS classification<br />

knowledge base, we have def<strong>in</strong>ed a number of fixed taxonomies cover<strong>in</strong>g the areas of bus<strong>in</strong>ess, government,<br />

science, and education. With<strong>in</strong> the area of science, we have <strong>in</strong>corporated a subset of the ACM Comput<strong>in</strong>g<br />

Reviews Classification System. [1]<br />

Example Session<br />

Our prototype system <strong>in</strong>itially presents the top level screen shown <strong>in</strong> [Fig. 2] to the user. The screen<br />

displays the high level topical categories as well as a list of tools <strong>in</strong>clud<strong>in</strong>g "Search" (full-text search). [Fig. 3]<br />

shows a subset of the topics displayed as the result of select<strong>in</strong>g the "Science" topical area. The list box presents<br />

the hierarchy of topics beg<strong>in</strong>n<strong>in</strong>g with the selected root category. Indentation is used to show a topic, sub-topic<br />

relationship. The number to the right of each leaf topic is the number of documents assigned to that area. A<br />

user can select one or more topics from the list box, select the “View Documents” button, and receive a<br />

breakdown of all documents <strong>in</strong> each area selected as shown <strong>in</strong> [Fig. 4]. The score associated with each<br />

document <strong>in</strong> the list represents how similar the document is to the topical profile. Optionally, a user can<br />

specify “List groups” under “Results Type” to receive a clustered list<strong>in</strong>g of documents associated with selected<br />

topics. [Fig. 5] shows a sample clustered view of documents. Each cluster of documents is preceded by a list<br />

of the top discrim<strong>in</strong>at<strong>in</strong>g terms, with the left-most terms be<strong>in</strong>g the highest discrim<strong>in</strong>ators. The score<br />

associated with each document <strong>in</strong> a cluster, corresponds to how similar the document is to the discrim<strong>in</strong>at<strong>in</strong>g<br />

term list (cluster centroid). The user can then click on a document title to have it retrieved and displayed from<br />

the provider site, with topical descriptors highlighted <strong>in</strong> the text (Fig. 6); or click on “Summary In<strong>format</strong>ion”<br />

to see a summary of the document. The summary <strong>in</strong>cludes a list of other topics assigned to the document as a<br />

result of the classification process, and an abstract of the document generated by process<strong>in</strong>g it through the Brill<br />

tagger software. [Fig. 7] shows a sample document summary. With<strong>in</strong> a topic list screen, a user can select the<br />

“Search” function to perform a keyword search aga<strong>in</strong>st all Harvest Brokers that manage <strong>in</strong><strong>format</strong>ion associated<br />

with the selected topical area. The query rout<strong>in</strong>g service provided by the BIS determ<strong>in</strong>es the appropriate<br />

Brokers to query, where results from different Brokers are comb<strong>in</strong>ed <strong>in</strong>to a s<strong>in</strong>gle list similar to that shown <strong>in</strong><br />

[Fig. 4].<br />

[1] The ACM Comput<strong>in</strong>g Classification System is Copyright (c) 1996 by the Association for Comput<strong>in</strong>g Mach<strong>in</strong>ery and is<br />

<strong>in</strong>cluded here with permission.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!