28.02.2013 Aufrufe

Sharing Knowledge: Scientific Communication - SSOAR

Sharing Knowledge: Scientific Communication - SSOAR

Sharing Knowledge: Scientific Communication - SSOAR

MEHR ANZEIGEN
WENIGER ANZEIGEN

Erfolgreiche ePaper selbst erstellen

Machen Sie aus Ihren PDF Publikationen ein blätterbares Flipbook mit unserer einzigartigen Google optimierten e-Paper Software.

166 Judith Plümer<br />

Migration from HTML META to RDF<br />

Above we indicated the advantages that RDF has in comparison with HTML<br />

META. The question that we want to discuss here is how the software can handle<br />

the richer structure.<br />

HTML META is in a mathematical sense equivalent to the SOIF format since<br />

both store attribute/value pairs. But that precludes the use of RDF on the base of<br />

the current harvest software. What we need is a software with the power to handle<br />

RDF and the wisdom to integrate SOIF data that come from some gatherer<br />

agents in the world and don’t want to update their software.<br />

In the CARMEN (http://www.math.uos.de/projects/carmen/) project of the<br />

federal ministry of science (Global Info program of the BMBF) tools were developed<br />

that can be plugged together to operate MPRESS on the basis of RDF.<br />

X-Harvest<br />

The X-Harvest [Kokkelink, 2000] software is a modification of the Harvest-NG<br />

(http://webharvest.sourceforge.net/ng/) software that substitutes the internal<br />

SOIF format by RDF. That means X-Harvest is a substitute for the gatherer<br />

component of the Harvest software which stores the summaries of the documents<br />

in RDF format. X-Harvest is completely written in Perl. That makes addition<br />

and modification of features easy but on the other hand it makes the installation<br />

a real challenge: the Harvest-NG software needs the installation of 8 additional<br />

Perl modules which require further Perl modules. X-Harvest needs additional<br />

Perl modules which piles up to 25 Perl modules totally that have to be installed.<br />

Therefore we decided to implement a script that takes care of the installation<br />

procedure.<br />

But this modular architecture solves for example the problem of character<br />

sets because X-Harvest uses the Unicode module and hence handles Unicode<br />

characters.<br />

As well there are modules for the X-Harvest software that are able to solve<br />

the heterogeneity problems that we mentioned above:<br />

What do we do with HTML documents without any metadata and with documents<br />

coming up in formats that are not able to carry metadata information like<br />

PostScript? For this purpose there is a summarizer that generates metadata out<br />

of HTML documents on probabilistic guesses and a summarizer for PostScript<br />

documents that extracts metadata by given heuristics (http://www.math.uos.de/<br />

projects/carmen/AP11/). The use of heuristics in this context makes complete<br />

sense since mathematical papers are at most all of the same structure and mostly<br />

offered in PostScript which was generated out of TeX/LaTeX. This, however, is<br />

an aspect for which the described solution does not have to be scalable for other<br />

disciplines.

Hurra! Ihre Datei wurde hochgeladen und ist bereit für die Veröffentlichung.

Erfolgreich gespeichert!

Leider ist etwas schief gelaufen!