Sharing Knowledge: Scientific Communication - SSOAR
Sharing Knowledge: Scientific Communication - SSOAR
Sharing Knowledge: Scientific Communication - SSOAR
Erfolgreiche ePaper selbst erstellen
Machen Sie aus Ihren PDF Publikationen ein blätterbares Flipbook mit unserer einzigartigen Google optimierten e-Paper Software.
MPRESS - transition of metadata formats 161<br />
The gathering part of the Harvest [Hardy, 1996, Technical Report] software<br />
which is used in Math-Net is responsible for this collecting process. It is a configurable<br />
robot. Now the gathering of the documents does not mean making copies<br />
and keeping them somewhere. It means taking a temporary copy, extracting<br />
the relevant information from this copy, storing this information in a database<br />
and deleting the temporary copy.<br />
One gathering agent is determined by several configuration files that have to<br />
be created by an administrator. There is one configuration file that contains the<br />
URLs that the agent should gather and evaluate together with a lot of parameters<br />
which say whether he should run recursively, at which depth the recursion<br />
should stop, how many documents should be gathered at most, which documents<br />
should be included or excluded during the recursion, how many different<br />
hosts may be visited during the recursion and which types of protocols should be<br />
used (ftp, http).<br />
The other configuration files can be used in their default form or be modified,<br />
too. As there are configuration files which say how to handle different mimetypes<br />
and formats, how to interpret suffixes, and how to summarize different<br />
formats.<br />
The gatherer pipes the retrieved document through the essence machine<br />
[Hardy, 1996, Trans. Comp. Sci] that generates summaries of the documents in<br />
SOIF [Hardie, 1999] (summary object interchange format, see below). The<br />
SOIF documents are stored by the gatherer and the original resources are deleted.<br />
The SOIF documents of one agent reside in a gnu-zipped ASCII file with<br />
some additional information. For example, there is a small databases storing the<br />
MD5 [Rivest, 1992] numbers of the original files to avoid multiple copies of one<br />
document. Internally the MD5 number is also used to check whether a document<br />
has been changed or not in case the agent does not visit a site for the first time.<br />
(The agents of MPRESS run every other week.) If the document has not been<br />
changed, the essence process does not have to be run again on the same document.<br />
In this case only the time-to-life (TTL) of the respective SOIF-record is<br />
modified. This saves local computing power, but it does not save netload, which<br />
would be desirable. So what we wanted was an incrementally running agent.<br />
The original harvest software was modified in this way so that it runs incrementally<br />
for MPRESS.<br />
There are summarizers for a variety of formats. The pages in the Web that<br />
contain mathematically relevant information are usually stored in PostScript,<br />
PDF or HTML. The summarizers of the harvest software do not handle 8-bit and<br />
unicode characters correctly. So we had to modify the PostScript and HTML<br />
summarizer to interpret at least Umlauts and „ß“ for German needs in order to<br />
allow correct responses when querying for German names. This problem is solved<br />
with X-Harvest and HyREX because they are able to handle Unicode<br />
characters (see below).