26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

58<br />

the <str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong>, not the <str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong> itself. When a digital surrogate becomes available, I can po<str<strong>on</strong>g>in</str<strong>on</strong>g>t to<br />

that. In the meantime, a way of st<strong>and</strong>ardiz<str<strong>on</strong>g>in</str<strong>on</strong>g>g references to parts of a work would be useful,” he added.<br />

Other recent research has exam<str<strong>on</strong>g>in</str<strong>on</strong>g>ed some potential methods for resolv<str<strong>on</strong>g>in</str<strong>on</strong>g>g the issues of semantic<br />

encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g. Romanello (2008) proposed the use of microformats 174 <strong>and</strong> the CTS to provide<br />

semantic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g between classics e-journals <strong>and</strong> the primary sources/can<strong>on</strong>ical texts they referenced.<br />

One of the first challenges was simply to detect the can<strong>on</strong>ical references themselves, for as Romanello<br />

dem<strong>on</strong>strated, references to ancient texts were often abridged, the abbreviati<strong>on</strong>s used for author <strong>and</strong><br />

work names varied greatly, <strong>on</strong>ly some citati<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g>cluded the editors’ names, <strong>and</strong> the reference schemes<br />

could differ (e.g., for Aeschylus Persae, variant citati<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g>cluded A. Pers., Aesch. Pers., <strong>and</strong> Aeschyl.<br />

Pers.). For this reas<strong>on</strong>, Romanello et al. (2009a) explored the use of mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g to extract<br />

can<strong>on</strong>ical references to primary classical sources from unstructured texts. Although references to<br />

primary sources with<str<strong>on</strong>g>in</str<strong>on</strong>g> the sec<strong>on</strong>dary literature can vary greatly, as seen above, they noted that a<br />

number of similar patterns could often be detected. They thus tra<str<strong>on</strong>g>in</str<strong>on</strong>g>ed c<strong>on</strong>diti<strong>on</strong>al r<strong>and</strong>om fields (CRF)<br />

to identify references to primary sources texts with<str<strong>on</strong>g>in</str<strong>on</strong>g> larger unstructured texts. CRF was a particularly<br />

suitable algorithm because of its ability to c<strong>on</strong>sider a large number of token features when classify<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g data as either “citati<strong>on</strong>s” or “not citati<strong>on</strong>s.” Prelim<str<strong>on</strong>g>in</str<strong>on</strong>g>ary results <strong>on</strong> a sample of 24 pages<br />

achieved a precisi<strong>on</strong> of 81 percent <strong>and</strong> a recall of 94.1 percent. 175<br />

Even when references are successfully identified, the challenges of encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g rema<str<strong>on</strong>g>in</str<strong>on</strong>g>.<br />

Romanello (2008) stated that most references to primary texts with<str<strong>on</strong>g>in</str<strong>on</strong>g> electr<strong>on</strong>ic sec<strong>on</strong>dary sources<br />

were hard l<str<strong>on</strong>g>in</str<strong>on</strong>g>ked “through a tightly coupled l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g system” <strong>and</strong> were also rarely encoded <str<strong>on</strong>g>in</str<strong>on</strong>g> a<br />

mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e-readable format. Other obstacles to semantic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g <str<strong>on</strong>g>in</str<strong>on</strong>g>cluded the lack of shared st<strong>and</strong>ards or<br />

best practices <str<strong>on</strong>g>in</str<strong>on</strong>g> terms of encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g primary references <str<strong>on</strong>g>in</str<strong>on</strong>g> most corpora served as XHTML documents<br />

<strong>and</strong> the lack of comm<strong>on</strong> protocols to support <str<strong>on</strong>g>in</str<strong>on</strong>g>teroperability am<strong>on</strong>g different texts collecti<strong>on</strong>s that<br />

would allow the l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g of primary <strong>and</strong> sec<strong>on</strong>dary sources. To allow as much <str<strong>on</strong>g>in</str<strong>on</strong>g>teroperability as<br />

possible, Romanello promoted us<str<strong>on</strong>g>in</str<strong>on</strong>g>g “a comm<strong>on</strong> protocol to access collecti<strong>on</strong>s of texts <strong>and</strong> a shared<br />

format to encode can<strong>on</strong>ical references with<str<strong>on</strong>g>in</str<strong>on</strong>g> web <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e resources” (Romanello 2008). The other<br />

requirements of a semantic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g system were that it must be open ended, <str<strong>on</strong>g>in</str<strong>on</strong>g>teroperable, <strong>and</strong><br />

semantic- <strong>and</strong> language-neutral. Language-neutral <strong>and</strong> unique identifiers for authors <strong>and</strong> works (such<br />

as those of the TLG) were also recommended to support cross-l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g across languages.<br />

The basic system for semantic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g outl<str<strong>on</strong>g>in</str<strong>on</strong>g>ed by Romanello thus made use of the CTS URN scheme,<br />

which uses the TLG Can<strong>on</strong> 176 of Greek authors for identifiers, a series of microformats that he<br />

specifically developed to embed can<strong>on</strong>ical references <str<strong>on</strong>g>in</str<strong>on</strong>g> HTML elements, <strong>and</strong> open protocols such as<br />

the CTS text-retrieval protocol to retrieve either whole texts or parts of texts <str<strong>on</strong>g>in</str<strong>on</strong>g> order to support various<br />

value-added services such as reference <str<strong>on</strong>g>in</str<strong>on</strong>g>dex<str<strong>on</strong>g>in</str<strong>on</strong>g>g. Romanello proposed three microformats for his<br />

system: ctauthor (references to can<strong>on</strong>ical authors, or statements that can be made mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e-readable<br />

through the CTS URN structure); ctwork (references to works without author names); <strong>and</strong> ctref—“a<br />

compound microformat to encode a complete can<strong>on</strong>ical reference” that requires the use of ctauthor,<br />

ctwork; <strong>and</strong> a range property to specify the text secti<strong>on</strong>s that were referred to. While implementati<strong>on</strong> of<br />

such microformats encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> CTS protocols would enable a number of <str<strong>on</strong>g>in</str<strong>on</strong>g>terest<str<strong>on</strong>g>in</str<strong>on</strong>g>g value-added<br />

services such as semantic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g, granular text retrieval, <strong>and</strong> cross-l<str<strong>on</strong>g>in</str<strong>on</strong>g>gual reference <str<strong>on</strong>g>in</str<strong>on</strong>g>dex<str<strong>on</strong>g>in</str<strong>on</strong>g>g (e.g.,<br />

174 Accord<str<strong>on</strong>g>in</str<strong>on</strong>g>g to the microformats website, “microformats are a set of simple, open data formats built up<strong>on</strong> exist<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> widely adopted st<strong>and</strong>ards” that<br />

have been designed to be both human <strong>and</strong> mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e readable (http://microformats.org/about)<br />

175 Work by Romanello c<strong>on</strong>t<str<strong>on</strong>g>in</str<strong>on</strong>g>ues <str<strong>on</strong>g>in</str<strong>on</strong>g> this area through crefex (Can<strong>on</strong>ical REFerences Extractor- http://code.google.com/p/crefex/) <strong>and</strong> was presented at the<br />

Digital Classicist/ICS Work <str<strong>on</strong>g>in</str<strong>on</strong>g> Progress Sem<str<strong>on</strong>g>in</str<strong>on</strong>g>ar <str<strong>on</strong>g>in</str<strong>on</strong>g> July 2010. See Matteo Romanello, “Towards a Tool for the Automatic Extracti<strong>on</strong> of Can<strong>on</strong>ical<br />

References.” http://www.digitalclassicist.org/wip/wip2010-04mr.pdf<br />

176 http://www.tlg.uci.edu/can<strong>on</strong>/f<strong>on</strong>tsel

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!