Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
113<br />
uncerta<str<strong>on</strong>g>in</str<strong>on</strong>g>ty regard<str<strong>on</strong>g>in</str<strong>on</strong>g>g dates <str<strong>on</strong>g>in</str<strong>on</strong>g> the legal texts <str<strong>on</strong>g>in</str<strong>on</strong>g> Volterra. To <str<strong>on</strong>g>in</str<strong>on</strong>g>tegrate HGV <strong>and</strong> Volterra, they created<br />
annotati<strong>on</strong>s databases for each project or “r<strong>and</strong>omly-generated values associated with each record <str<strong>on</strong>g>in</str<strong>on</strong>g><br />
the orig<str<strong>on</strong>g>in</str<strong>on</strong>g>al databases” so they could “dem<strong>on</strong>strate cross-database jo<str<strong>on</strong>g>in</str<strong>on</strong>g>s <strong>and</strong> third-party annotati<strong>on</strong>s”<br />
(Jacks<strong>on</strong> et al. 2009).<br />
The project used OGSA-DAI 372 for data <str<strong>on</strong>g>in</str<strong>on</strong>g>tegrati<strong>on</strong> because it was c<strong>on</strong>sidered a de facto st<strong>and</strong>ard by<br />
many other e-science projects for <str<strong>on</strong>g>in</str<strong>on</strong>g>tegrat<str<strong>on</strong>g>in</str<strong>on</strong>g>g heterogeneous databases, it was open source, <strong>and</strong> it was<br />
compliant with many relati<strong>on</strong>al databases, XML, <strong>and</strong> other file-based resources. OGSA-DAI also<br />
supported the exposure of data resources <strong>on</strong> to grids (Bodard et al. 2009). Most important, <str<strong>on</strong>g>in</str<strong>on</strong>g> terms of<br />
data <str<strong>on</strong>g>in</str<strong>on</strong>g>tegrati<strong>on</strong>:<br />
… OGSA-DAI can abstract the underly<str<strong>on</strong>g>in</str<strong>on</strong>g>g databases us<str<strong>on</strong>g>in</str<strong>on</strong>g>g SQL views <strong>and</strong> provide an<br />
<str<strong>on</strong>g>in</str<strong>on</strong>g>tegrated <str<strong>on</strong>g>in</str<strong>on</strong>g>terface <strong>on</strong>to them us<str<strong>on</strong>g>in</str<strong>on</strong>g>g distributed query<str<strong>on</strong>g>in</str<strong>on</strong>g>g. This fulfils the essential requirement<br />
of the project to leave the underly<str<strong>on</strong>g>in</str<strong>on</strong>g>g data resources untouched as far as possible (Jacks<strong>on</strong> et al.<br />
2009).<br />
One goal of LaQuAT was to be able to support federated search<str<strong>on</strong>g>in</str<strong>on</strong>g>g of a “virtual database” <str<strong>on</strong>g>in</str<strong>on</strong>g> order that<br />
the underly<str<strong>on</strong>g>in</str<strong>on</strong>g>g databases would not have to undergo major changes for <str<strong>on</strong>g>in</str<strong>on</strong>g>clusi<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> such a resource.<br />
“The ability to l<str<strong>on</strong>g>in</str<strong>on</strong>g>k up such diverse data resources, <str<strong>on</strong>g>in</str<strong>on</strong>g> a way that respects the orig<str<strong>on</strong>g>in</str<strong>on</strong>g>al data resources<br />
<strong>and</strong> the communities resp<strong>on</strong>sible for them,” Bodard et al. (2009) asserted, “is a press<str<strong>on</strong>g>in</str<strong>on</strong>g>g need am<strong>on</strong>g<br />
humanities researchers.”<br />
A number of issues complicated data <str<strong>on</strong>g>in</str<strong>on</strong>g>tegrati<strong>on</strong>, however, <str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g data c<strong>on</strong>sistency <strong>and</strong> some<br />
specific features of OGSA-DAI. To beg<str<strong>on</strong>g>in</str<strong>on</strong>g> with, some of the orig<str<strong>on</strong>g>in</str<strong>on</strong>g>al data <str<strong>on</strong>g>in</str<strong>on</strong>g> the HGV database had<br />
been “c<strong>on</strong>tam<str<strong>on</strong>g>in</str<strong>on</strong>g>ated by c<strong>on</strong>trol characters,” a factor that had serious implicati<strong>on</strong>s for the OGSA-DAI<br />
system s<str<strong>on</strong>g>in</str<strong>on</strong>g>ce it provided access to databases via web services, which are based <strong>on</strong> the exchange of<br />
XML documents. S<str<strong>on</strong>g>in</str<strong>on</strong>g>ce the use of c<strong>on</strong>trol characters with<str<strong>on</strong>g>in</str<strong>on</strong>g> an XML document results <str<strong>on</strong>g>in</str<strong>on</strong>g> an <str<strong>on</strong>g>in</str<strong>on</strong>g>valid<br />
XML file that cannot be parsed, they had to extend the system’s “relati<strong>on</strong>al data to XML c<strong>on</strong>versi<strong>on</strong><br />
classes to filter out such c<strong>on</strong>trol characters <strong>and</strong> replace these with spaces.” The Volterra database also<br />
presented its own unique challenges, particularly <str<strong>on</strong>g>in</str<strong>on</strong>g> terms of database design, s<str<strong>on</strong>g>in</str<strong>on</strong>g>ce not all tables had<br />
the same columns <strong>and</strong> some columns with the same <str<strong>on</strong>g>in</str<strong>on</strong>g>formati<strong>on</strong> had different names. A sec<strong>on</strong>d major<br />
challenge was the lack of suitable database drivers, <strong>and</strong> the data from both Volterra <strong>and</strong> HGV were<br />
ported <str<strong>on</strong>g>in</str<strong>on</strong>g>to MySQL to be able to <str<strong>on</strong>g>in</str<strong>on</strong>g>teract with OGSA-DAI. Other issues <str<strong>on</strong>g>in</str<strong>on</strong>g>cluded need<str<strong>on</strong>g>in</str<strong>on</strong>g>g to adapt the<br />
way the OGSA-DAI exposed metadata <strong>and</strong> hav<str<strong>on</strong>g>in</str<strong>on</strong>g>g to alter the way the system used SQL views<br />
because of the large nature of the HGV database. In the end, the project could use <strong>on</strong>ly a subset of the<br />
HGV database to ensure that query time would be reas<strong>on</strong>able. Despite these <strong>and</strong> other challenges, the<br />
project was able to develop a dem<strong>on</strong>strator that provided <str<strong>on</strong>g>in</str<strong>on</strong>g>tegrated access to both HGV <strong>and</strong><br />
Volterra. 373<br />
The LaQuAT project had orig<str<strong>on</strong>g>in</str<strong>on</strong>g>ally assumed that <strong>on</strong>e of the most useful outcomes of <str<strong>on</strong>g>in</str<strong>on</strong>g>tegrat<str<strong>on</strong>g>in</str<strong>on</strong>g>g the<br />
two databases would be where data overlapped (such as <str<strong>on</strong>g>in</str<strong>on</strong>g> terms of pers<strong>on</strong>al <strong>and</strong> place names), but<br />
they found <str<strong>on</strong>g>in</str<strong>on</strong>g>stead that clear-cut overlaps were fairly easy to identify. A far more <str<strong>on</strong>g>in</str<strong>on</strong>g>terest<str<strong>on</strong>g>in</str<strong>on</strong>g>g questi<strong>on</strong>,<br />
they proposed, was to try to automatically recognize “the co-existence of hom<strong>on</strong>ymous pers<strong>on</strong>s or<br />
372 While the technical details of this software are bey<strong>on</strong>d the scope of this paper, Jacks<strong>on</strong> et al. expla<str<strong>on</strong>g>in</str<strong>on</strong>g> that “OGSA-DAI executes workflows which can<br />
be viewed as scripts which specify what data is to be accessed <strong>and</strong> what is to be d<strong>on</strong>e to it. Workflows c<strong>on</strong>sist of activities, which are well-def<str<strong>on</strong>g>in</str<strong>on</strong>g>ed<br />
functi<strong>on</strong>al units which perform some data-related operati<strong>on</strong> e.g. query a database, transform data to XML, deliver data via FTP. A client submits a<br />
workflow to an OGSA-DAI server via an OGSA-DAI web service. The server parses, compiles <strong>and</strong> executes the workflow.”<br />
373 For more <strong>on</strong> the <str<strong>on</strong>g>in</str<strong>on</strong>g>frastructure proof of c<strong>on</strong>cept design, see Jacks<strong>on</strong> et al. (2009). This dem<strong>on</strong>strator can be viewed at<br />
http://doma<str<strong>on</strong>g>in</str<strong>on</strong>g>001.vidar.ngs.manchester.ac.uk:8080/laquat/laquatDemo.jsp