28.06.2013 Views

Papers in PDF format

Papers in PDF format

Papers in PDF format

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

the choice should be. It may be that the page-level <strong>in</strong>dex suffices for all retrieval, and the document-level one<br />

should be abandoned. Perhaps the problem can be solved by a more sophisticated <strong>in</strong><strong>format</strong>ion display<br />

mechanism: for example, TileBars [Hearst 1995] provide a way to visualize the distribution of query terms<br />

throughout a document that should be easy to <strong>in</strong>tegrate <strong>in</strong>to our system. A number of other search<strong>in</strong>g<br />

mechanisms would be very useful. Mixed Boolean and ranked queries would allow users greater control over<br />

the documents that are retrieved. Brows<strong>in</strong>g by location needs some human pre-process<strong>in</strong>g to identify mach<strong>in</strong>e<br />

locations by <strong>in</strong>stitution rather than by network address. Even simpler, and also useful, would be to allow users<br />

to look at the directory that a particular report comes from, because sometimes multipart documents are stored<br />

as separate files. F<strong>in</strong>ally, co-location queries allow users to exam<strong>in</strong>e which words occur together; this is useful<br />

for textual analysis purposes.<br />

Conclusion<br />

We have dist<strong>in</strong>guished two oppos<strong>in</strong>g philosophies for the construction of networked digital libraries. One is to<br />

have contributors supply bibliographic details for the <strong>in</strong><strong>format</strong>ion that they place <strong>in</strong> the collection. The other is<br />

to garner the collection automatically and use a full-text <strong>in</strong>dex for access <strong>in</strong>stead of an ord<strong>in</strong>ary library<br />

catalogue.<br />

This paper has focused on the second approach. With it, public digital libraries can be constructed on the<br />

Internet entirely automatically, <strong>in</strong> any area for which repositories of suitable text can be located. Extract<strong>in</strong>g all<br />

<strong>in</strong>dex <strong>in</strong><strong>format</strong>ion from the documents themselves is feasible if full-text <strong>in</strong>dex<strong>in</strong>g is used, and elim<strong>in</strong>ates the<br />

manual catalogu<strong>in</strong>g effort <strong>in</strong>volved <strong>in</strong> creat<strong>in</strong>g and ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g the library. The material <strong>in</strong> such a library can<br />

be distributed globally. The central database <strong>in</strong> our prototype library, which comprises a full-text <strong>in</strong>dex,<br />

facsimile images of the first page or two of each document, and the pla<strong>in</strong> text of all documents, represents only<br />

10% of the collection size.<br />

Look<strong>in</strong>g <strong>in</strong>to the future, it is likely that new generations of digital library will marry the two approaches. The<br />

first offers high-quality catalogu<strong>in</strong>g <strong>in</strong><strong>format</strong>ion, while the second provides significantly <strong>in</strong>creased coverage.<br />

Improved techniques for <strong>in</strong><strong>format</strong>ion extraction from text, along with large public-doma<strong>in</strong> bibliographies, offer<br />

the possibility of be<strong>in</strong>g able to match reports <strong>in</strong> a collection with items <strong>in</strong> a bibliography file, thus provid<strong>in</strong>g<br />

catalogu<strong>in</strong>g <strong>in</strong><strong>format</strong>ion at no additional cost. Moreover, personal contributions to people-oriented research<br />

databases like Hypatia are likely to provide more authoritative reference <strong>in</strong><strong>format</strong>ion than do general<br />

bibliographies, so bibliographic quality can be <strong>in</strong>creased by amalgamat<strong>in</strong>g <strong>in</strong><strong>format</strong>ion from different sources,<br />

tagged accord<strong>in</strong>g to likely reliability.<br />

F<strong>in</strong>ally there is the problem of distribution. Architectures like HARVEST and NCSTRL provide a distributed<br />

<strong>in</strong>frastructure that underp<strong>in</strong> all aspects of the collection. However, there is a danger <strong>in</strong> be<strong>in</strong>g too distributed:<br />

whereas users want to see a unified system, these schemes allow sites to provide their own brows<strong>in</strong>g software<br />

through which their repository must be viewed, and this nonuniformity can be a great annoyance <strong>in</strong> practice.<br />

Web search eng<strong>in</strong>es, <strong>in</strong> contrast, are not distributed, though they may <strong>in</strong>volve multiprocessors access<strong>in</strong>g the<br />

same database <strong>in</strong> order to provide adequate power. Aga<strong>in</strong> we will probably see an amalgamation of the two<br />

approaches, and <strong>in</strong>deed distributed <strong>in</strong>dexes expressly designed for multi-collection environments are a current<br />

research topic <strong>in</strong> the <strong>in</strong><strong>format</strong>ion retrieval community.<br />

Digital libraries represent one way of deal<strong>in</strong>g with the new reality of Internet publish<strong>in</strong>g. Mak<strong>in</strong>g a m<strong>in</strong>imum<br />

of assumptions, a library based on full-text retrieval imposes structure on a fundamentally anarchic,<br />

uncatalogued, system, giv<strong>in</strong>g <strong>in</strong><strong>format</strong>ion consumers a tool to f<strong>in</strong>d what they need.<br />

References<br />

[Bowman et al. 1994] Bowman, C.M., Danzig, P.B., Manber, U., and Schwartz, M.F. (1994) “Scalable<br />

Internet resource discovery: Research problems and approaches,” Communications of the ACM 37(8), pp.<br />

98–107.<br />

[Bowman et al. 1994] Bowman, C.M., Danzig, P.B., Manber, U., and Schwartz, M.F. (1994) Scalable Internet<br />

resource discovery: Research problems and approaches. Communications of the ACM, 37 (8), 98–107.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!