26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

14<br />

generated by the decisi<strong>on</strong> tree program (C4.5) were used to determ<str<strong>on</strong>g>in</str<strong>on</strong>g>e the more likely variant. The<br />

authors found that the parallel-text correcti<strong>on</strong> rate was c<strong>on</strong>sistently higher than the s<str<strong>on</strong>g>in</str<strong>on</strong>g>gle-text<br />

correcti<strong>on</strong> rate by between 5 percent <strong>and</strong> 16 percent. Basel<str<strong>on</strong>g>in</str<strong>on</strong>g>e character accuracy <str<strong>on</strong>g>in</str<strong>on</strong>g> this f<str<strong>on</strong>g>in</str<strong>on</strong>g>al<br />

experiment rose to 99.49 percent.<br />

The ability to search text variants <strong>and</strong> to automatically collate various editi<strong>on</strong>s of the same work <str<strong>on</strong>g>in</str<strong>on</strong>g> a<br />

digital library through the use of OCR <strong>and</strong> a number of automated techniques offers new research<br />

opportunities. In additi<strong>on</strong>, the work of Stewart et al. provides useful less<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g> how curated digital<br />

corpora, automated methods, <strong>and</strong> milli<strong>on</strong>-book libraries can be used to create new, more sophisticated<br />

digital libraries:<br />

By situat<str<strong>on</strong>g>in</str<strong>on</strong>g>g corpus producti<strong>on</strong> with<str<strong>on</strong>g>in</str<strong>on</strong>g> a digital library (i.e., a collecti<strong>on</strong> of authenticated digital<br />

objects with basic catalog<str<strong>on</strong>g>in</str<strong>on</strong>g>g data), exploit<str<strong>on</strong>g>in</str<strong>on</strong>g>g the strengths of large collecti<strong>on</strong>s (e.g., multiple<br />

editi<strong>on</strong>s), <strong>and</strong> mak<str<strong>on</strong>g>in</str<strong>on</strong>g>g judicious use of practical automated methods, we can start to build new<br />

corpora <strong>on</strong> top of our digital libraries that are not <strong>on</strong>ly larger but, <str<strong>on</strong>g>in</str<strong>on</strong>g> many ways, more useful<br />

than their manually c<strong>on</strong>structed predecessors (Stewart et al. 2007).<br />

Further research reported by Boschetti et al. (2009) was <str<strong>on</strong>g>in</str<strong>on</strong>g>formed by the prelim<str<strong>on</strong>g>in</str<strong>on</strong>g>ary techniques<br />

reported <str<strong>on</strong>g>in</str<strong>on</strong>g> Stewart et al. (2007), but also exp<strong>and</strong>ed it s<str<strong>on</strong>g>in</str<strong>on</strong>g>ce the <str<strong>on</strong>g>in</str<strong>on</strong>g>itial work did not <str<strong>on</strong>g>in</str<strong>on</strong>g>clude the<br />

recogniti<strong>on</strong> of Greek accents <strong>and</strong> diacritical marks.<br />

Boschetti et al. (2009) c<strong>on</strong>ducted a series of experiments <str<strong>on</strong>g>in</str<strong>on</strong>g> attempt<str<strong>on</strong>g>in</str<strong>on</strong>g>g to create a scalable workflow<br />

for outputt<str<strong>on</strong>g>in</str<strong>on</strong>g>g highly accurate OCR of Greek text. This workflow used progressive multiple alignment<br />

of the OCR output of two commercial products (Anagnostis, Abbyy F<str<strong>on</strong>g>in</str<strong>on</strong>g>eReader) <strong>and</strong> <strong>on</strong>e opensource<br />

47 OCR eng<str<strong>on</strong>g>in</str<strong>on</strong>g>e (OCRopus), which was not available when Stewart et al. (2007) c<strong>on</strong>ducted their<br />

research. Multiple editi<strong>on</strong>s of Athenaeus’ Deipnosophistae, <strong>on</strong>e editi<strong>on</strong> of Aeschylus, <strong>and</strong> a 1475<br />

editi<strong>on</strong> of August<str<strong>on</strong>g>in</str<strong>on</strong>g>e’s De Cogitate Dei were used for the OCR experiments. This research determ<str<strong>on</strong>g>in</str<strong>on</strong>g>ed<br />

that the accuracy of s<str<strong>on</strong>g>in</str<strong>on</strong>g>gle eng<str<strong>on</strong>g>in</str<strong>on</strong>g>es was very dependent <strong>on</strong> the tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g set created for it, but it also<br />

revealed that <str<strong>on</strong>g>in</str<strong>on</strong>g> several cases OCRopus obta<str<strong>on</strong>g>in</str<strong>on</strong>g>ed better results than either commercial opti<strong>on</strong>. The<br />

highest accuracy level (99.01 percent), which was for sample pages from the fairly recent Loeb editi<strong>on</strong><br />

of Athenaeus, was obta<str<strong>on</strong>g>in</str<strong>on</strong>g>ed through the use of multiple progressive alignment <strong>and</strong> a spell-check<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

algorithm. (Accuracy levels <strong>on</strong> earlier editi<strong>on</strong>s of Athenaeus ranged from 94 percent to 98 percent).<br />

The additi<strong>on</strong> of accents did produce lower character accuracy results than those reported by Stewart et<br />

al., but at the same time, accents are an important part of Ancient Greek, <strong>and</strong> any OCR system for this<br />

language will ultimately need to <str<strong>on</strong>g>in</str<strong>on</strong>g>clude them. This research also dem<strong>on</strong>strated that OCRopus, a<br />

relatively new open-source OCR eng<str<strong>on</strong>g>in</str<strong>on</strong>g>e, could produce results comparable to those of expensive<br />

commercial products.<br />

While both Stewart et al. (2007) <strong>and</strong> Boschetti et al. (2009) focused <strong>on</strong> us<str<strong>on</strong>g>in</str<strong>on</strong>g>g OCR to recognize pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted<br />

editi<strong>on</strong>s of Ancient Greek, a variety of both classical scholarship <strong>and</strong> document-recogniti<strong>on</strong> research 48<br />

has been c<strong>on</strong>ducted <strong>on</strong> the Archimedes Palimpsest, 49 a thirteenth-century prayer book that c<strong>on</strong>ta<str<strong>on</strong>g>in</str<strong>on</strong>g>s<br />

erased texts that were written several centuries before, <str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g previously “lost” treatises by<br />

Archimedes <strong>and</strong> Hypereides. This manuscript has s<str<strong>on</strong>g>in</str<strong>on</strong>g>ce been digitized, <strong>and</strong> the images created of the<br />

47 http://code.google.com/p/ocropus/<br />

48 A palimpsest is a manuscript “<strong>on</strong> which more than <strong>on</strong>e text has been written with the earlier writ<str<strong>on</strong>g>in</str<strong>on</strong>g>g <str<strong>on</strong>g>in</str<strong>on</strong>g>completely erased <strong>and</strong> still visible”<br />

(http://wordnetweb.pr<str<strong>on</strong>g>in</str<strong>on</strong>g>cet<strong>on</strong>.edu/perl/webwns=palimpsest). For a full list of research publicati<strong>on</strong>s us<str<strong>on</strong>g>in</str<strong>on</strong>g>g the Archimedes Palimpsest, see<br />

http://www.archimedespalimpsest.org/bibliography1.html<br />

49 http://www.archimedespalimpsest.org/

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!