Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
14<br />
generated by the decisi<strong>on</strong> tree program (C4.5) were used to determ<str<strong>on</strong>g>in</str<strong>on</strong>g>e the more likely variant. The<br />
authors found that the parallel-text correcti<strong>on</strong> rate was c<strong>on</strong>sistently higher than the s<str<strong>on</strong>g>in</str<strong>on</strong>g>gle-text<br />
correcti<strong>on</strong> rate by between 5 percent <strong>and</strong> 16 percent. Basel<str<strong>on</strong>g>in</str<strong>on</strong>g>e character accuracy <str<strong>on</strong>g>in</str<strong>on</strong>g> this f<str<strong>on</strong>g>in</str<strong>on</strong>g>al<br />
experiment rose to 99.49 percent.<br />
The ability to search text variants <strong>and</strong> to automatically collate various editi<strong>on</strong>s of the same work <str<strong>on</strong>g>in</str<strong>on</strong>g> a<br />
digital library through the use of OCR <strong>and</strong> a number of automated techniques offers new research<br />
opportunities. In additi<strong>on</strong>, the work of Stewart et al. provides useful less<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g> how curated digital<br />
corpora, automated methods, <strong>and</strong> milli<strong>on</strong>-book libraries can be used to create new, more sophisticated<br />
digital libraries:<br />
By situat<str<strong>on</strong>g>in</str<strong>on</strong>g>g corpus producti<strong>on</strong> with<str<strong>on</strong>g>in</str<strong>on</strong>g> a digital library (i.e., a collecti<strong>on</strong> of authenticated digital<br />
objects with basic catalog<str<strong>on</strong>g>in</str<strong>on</strong>g>g data), exploit<str<strong>on</strong>g>in</str<strong>on</strong>g>g the strengths of large collecti<strong>on</strong>s (e.g., multiple<br />
editi<strong>on</strong>s), <strong>and</strong> mak<str<strong>on</strong>g>in</str<strong>on</strong>g>g judicious use of practical automated methods, we can start to build new<br />
corpora <strong>on</strong> top of our digital libraries that are not <strong>on</strong>ly larger but, <str<strong>on</strong>g>in</str<strong>on</strong>g> many ways, more useful<br />
than their manually c<strong>on</strong>structed predecessors (Stewart et al. 2007).<br />
Further research reported by Boschetti et al. (2009) was <str<strong>on</strong>g>in</str<strong>on</strong>g>formed by the prelim<str<strong>on</strong>g>in</str<strong>on</strong>g>ary techniques<br />
reported <str<strong>on</strong>g>in</str<strong>on</strong>g> Stewart et al. (2007), but also exp<strong>and</strong>ed it s<str<strong>on</strong>g>in</str<strong>on</strong>g>ce the <str<strong>on</strong>g>in</str<strong>on</strong>g>itial work did not <str<strong>on</strong>g>in</str<strong>on</strong>g>clude the<br />
recogniti<strong>on</strong> of Greek accents <strong>and</strong> diacritical marks.<br />
Boschetti et al. (2009) c<strong>on</strong>ducted a series of experiments <str<strong>on</strong>g>in</str<strong>on</strong>g> attempt<str<strong>on</strong>g>in</str<strong>on</strong>g>g to create a scalable workflow<br />
for outputt<str<strong>on</strong>g>in</str<strong>on</strong>g>g highly accurate OCR of Greek text. This workflow used progressive multiple alignment<br />
of the OCR output of two commercial products (Anagnostis, Abbyy F<str<strong>on</strong>g>in</str<strong>on</strong>g>eReader) <strong>and</strong> <strong>on</strong>e opensource<br />
47 OCR eng<str<strong>on</strong>g>in</str<strong>on</strong>g>e (OCRopus), which was not available when Stewart et al. (2007) c<strong>on</strong>ducted their<br />
research. Multiple editi<strong>on</strong>s of Athenaeus’ Deipnosophistae, <strong>on</strong>e editi<strong>on</strong> of Aeschylus, <strong>and</strong> a 1475<br />
editi<strong>on</strong> of August<str<strong>on</strong>g>in</str<strong>on</strong>g>e’s De Cogitate Dei were used for the OCR experiments. This research determ<str<strong>on</strong>g>in</str<strong>on</strong>g>ed<br />
that the accuracy of s<str<strong>on</strong>g>in</str<strong>on</strong>g>gle eng<str<strong>on</strong>g>in</str<strong>on</strong>g>es was very dependent <strong>on</strong> the tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g set created for it, but it also<br />
revealed that <str<strong>on</strong>g>in</str<strong>on</strong>g> several cases OCRopus obta<str<strong>on</strong>g>in</str<strong>on</strong>g>ed better results than either commercial opti<strong>on</strong>. The<br />
highest accuracy level (99.01 percent), which was for sample pages from the fairly recent Loeb editi<strong>on</strong><br />
of Athenaeus, was obta<str<strong>on</strong>g>in</str<strong>on</strong>g>ed through the use of multiple progressive alignment <strong>and</strong> a spell-check<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />
algorithm. (Accuracy levels <strong>on</strong> earlier editi<strong>on</strong>s of Athenaeus ranged from 94 percent to 98 percent).<br />
The additi<strong>on</strong> of accents did produce lower character accuracy results than those reported by Stewart et<br />
al., but at the same time, accents are an important part of Ancient Greek, <strong>and</strong> any OCR system for this<br />
language will ultimately need to <str<strong>on</strong>g>in</str<strong>on</strong>g>clude them. This research also dem<strong>on</strong>strated that OCRopus, a<br />
relatively new open-source OCR eng<str<strong>on</strong>g>in</str<strong>on</strong>g>e, could produce results comparable to those of expensive<br />
commercial products.<br />
While both Stewart et al. (2007) <strong>and</strong> Boschetti et al. (2009) focused <strong>on</strong> us<str<strong>on</strong>g>in</str<strong>on</strong>g>g OCR to recognize pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted<br />
editi<strong>on</strong>s of Ancient Greek, a variety of both classical scholarship <strong>and</strong> document-recogniti<strong>on</strong> research 48<br />
has been c<strong>on</strong>ducted <strong>on</strong> the Archimedes Palimpsest, 49 a thirteenth-century prayer book that c<strong>on</strong>ta<str<strong>on</strong>g>in</str<strong>on</strong>g>s<br />
erased texts that were written several centuries before, <str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g previously “lost” treatises by<br />
Archimedes <strong>and</strong> Hypereides. This manuscript has s<str<strong>on</strong>g>in</str<strong>on</strong>g>ce been digitized, <strong>and</strong> the images created of the<br />
47 http://code.google.com/p/ocropus/<br />
48 A palimpsest is a manuscript “<strong>on</strong> which more than <strong>on</strong>e text has been written with the earlier writ<str<strong>on</strong>g>in</str<strong>on</strong>g>g <str<strong>on</strong>g>in</str<strong>on</strong>g>completely erased <strong>and</strong> still visible”<br />
(http://wordnetweb.pr<str<strong>on</strong>g>in</str<strong>on</strong>g>cet<strong>on</strong>.edu/perl/webwns=palimpsest). For a full list of research publicati<strong>on</strong>s us<str<strong>on</strong>g>in</str<strong>on</strong>g>g the Archimedes Palimpsest, see<br />
http://www.archimedespalimpsest.org/bibliography1.html<br />
49 http://www.archimedespalimpsest.org/