28.02.2013 Aufrufe

Sharing Knowledge: Scientific Communication - SSOAR

Sharing Knowledge: Scientific Communication - SSOAR

Sharing Knowledge: Scientific Communication - SSOAR

MEHR ANZEIGEN
WENIGER ANZEIGEN

Erfolgreiche ePaper selbst erstellen

Machen Sie aus Ihren PDF Publikationen ein blätterbares Flipbook mit unserer einzigartigen Google optimierten e-Paper Software.

MPRESS - transition of metadata formats 161<br />

The gathering part of the Harvest [Hardy, 1996, Technical Report] software<br />

which is used in Math-Net is responsible for this collecting process. It is a configurable<br />

robot. Now the gathering of the documents does not mean making copies<br />

and keeping them somewhere. It means taking a temporary copy, extracting<br />

the relevant information from this copy, storing this information in a database<br />

and deleting the temporary copy.<br />

One gathering agent is determined by several configuration files that have to<br />

be created by an administrator. There is one configuration file that contains the<br />

URLs that the agent should gather and evaluate together with a lot of parameters<br />

which say whether he should run recursively, at which depth the recursion<br />

should stop, how many documents should be gathered at most, which documents<br />

should be included or excluded during the recursion, how many different<br />

hosts may be visited during the recursion and which types of protocols should be<br />

used (ftp, http).<br />

The other configuration files can be used in their default form or be modified,<br />

too. As there are configuration files which say how to handle different mimetypes<br />

and formats, how to interpret suffixes, and how to summarize different<br />

formats.<br />

The gatherer pipes the retrieved document through the essence machine<br />

[Hardy, 1996, Trans. Comp. Sci] that generates summaries of the documents in<br />

SOIF [Hardie, 1999] (summary object interchange format, see below). The<br />

SOIF documents are stored by the gatherer and the original resources are deleted.<br />

The SOIF documents of one agent reside in a gnu-zipped ASCII file with<br />

some additional information. For example, there is a small databases storing the<br />

MD5 [Rivest, 1992] numbers of the original files to avoid multiple copies of one<br />

document. Internally the MD5 number is also used to check whether a document<br />

has been changed or not in case the agent does not visit a site for the first time.<br />

(The agents of MPRESS run every other week.) If the document has not been<br />

changed, the essence process does not have to be run again on the same document.<br />

In this case only the time-to-life (TTL) of the respective SOIF-record is<br />

modified. This saves local computing power, but it does not save netload, which<br />

would be desirable. So what we wanted was an incrementally running agent.<br />

The original harvest software was modified in this way so that it runs incrementally<br />

for MPRESS.<br />

There are summarizers for a variety of formats. The pages in the Web that<br />

contain mathematically relevant information are usually stored in PostScript,<br />

PDF or HTML. The summarizers of the harvest software do not handle 8-bit and<br />

unicode characters correctly. So we had to modify the PostScript and HTML<br />

summarizer to interpret at least Umlauts and „ß“ for German needs in order to<br />

allow correct responses when querying for German names. This problem is solved<br />

with X-Harvest and HyREX because they are able to handle Unicode<br />

characters (see below).

Hurra! Ihre Datei wurde hochgeladen und ist bereit für die Veröffentlichung.

Erfolgreich gespeichert!

Leider ist etwas schief gelaufen!