28.02.2013 Aufrufe

Sharing Knowledge: Scientific Communication - SSOAR

Sharing Knowledge: Scientific Communication - SSOAR

Sharing Knowledge: Scientific Communication - SSOAR

MEHR ANZEIGEN
WENIGER ANZEIGEN

Erfolgreiche ePaper selbst erstellen

Machen Sie aus Ihren PDF Publikationen ein blätterbares Flipbook mit unserer einzigartigen Google optimierten e-Paper Software.

162 Judith Plümer<br />

The summarizers generate SOIF records for each document. SOIF was designed<br />

for the storage and exchange of the summaries of documents which may originate<br />

in different formats. It was built as a part of the harvest software. The<br />

structure of SOIF is simple and effective. A SOIF object consists of multiple<br />

attribute-value pairs. For example<br />

author{16}: Erwin Mustermann<br />

is such a pair, the number in brackets gives the number of characters of the value.<br />

There are no restrictions on the number and names of attributes.<br />

This principle of attribute-value pairs makes SOIF equivalent to the use of the<br />

HTML2.0 META tag with its NAME-CONTENT pairs. But it can also be used<br />

to store information from the body of a HTML document or from other formats.<br />

For example, the HTML summarizer of harvest assigns all words between<br />

and to the header attribute.<br />

Obviously, it makes sense to build such attribute-value pairs from any markup<br />

language like HTML, SGML, XML or TeX/LaTeX.<br />

The summarizer for documents in a typeset format as .dvi, .ps or .pdf work<br />

differently, they only store the first 100 words of these documents as keywords<br />

because the text is not semantically structured. One could think of criteria to extract<br />

more structure out of these unstructured documents via methods of artificial<br />

intelligence, heuristics or the use of thesauri (see below).<br />

The summarizers of harvest also store metadata such as the mimetype of the<br />

original document, its URL, URLs of links contained in the document, name<br />

and host of the gathering agent, time of generation, update time and time to live.<br />

All these pieces of information play an important role for the system because the<br />

information is not constant. They are used to control the expiration times.<br />

Collection of documents and problems of heterogeneity<br />

Since SOIF objects are just ASCII files, they can be transferred by any protocol<br />

as SOIF streams. Harvest builds a database or an index of SOIF objects by using<br />

Glimpse which makes the attribute-value pairs searchable.<br />

Now the problem arises when upgrading to HTML4.0 or to XML with metadata<br />

in RDF (Resource Description Framework). In MPRESS we currently store<br />

SCHEME information of HTML4.0 META as a qualifier. For example<br />

<br />

is mapped to<br />

DC.subject.msc{5}: 19D10

Hurra! Ihre Datei wurde hochgeladen und ist bereit für die Veröffentlichung.

Erfolgreich gespeichert!

Leider ist etwas schief gelaufen!