Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
13<br />
One major project that has recently been funded <str<strong>on</strong>g>in</str<strong>on</strong>g> this area is “New Technology for Digitizati<strong>on</strong> of<br />
Ancient Objects <strong>and</strong> Documents,” a jo<str<strong>on</strong>g>in</str<strong>on</strong>g>t project of the Archaeological Comput<str<strong>on</strong>g>in</str<strong>on</strong>g>g Research Group<br />
(ACRG) <strong>and</strong> the School of Electr<strong>on</strong>ics <strong>and</strong> Computer Science (ECS), Southampt<strong>on</strong>; the Centre for the<br />
Study of Ancient Documents (CSAD), Oxford; the CDLI, Los Angeles-Philadelphia-Oxford-Berl<str<strong>on</strong>g>in</str<strong>on</strong>g>;<br />
<strong>and</strong> the Electr<strong>on</strong>ic Text Corpus of Sumerian Literature (ETCSL), Oxford. 42 This project has received a<br />
12-m<strong>on</strong>th Arts <strong>and</strong> Humanities Research <str<strong>on</strong>g>Council</str<strong>on</strong>g> (AHRC) grant to “develop a “Reflectance<br />
Transformati<strong>on</strong> Imag<str<strong>on</strong>g>in</str<strong>on</strong>g>g (RTI) System for Ancient Documentary Artefacts.” The team plans to<br />
develop two RTI systems that can be used to capture high-quality digital images of documentary texts<br />
<strong>and</strong> archaeological materials. The <str<strong>on</strong>g>in</str<strong>on</strong>g>itial test<str<strong>on</strong>g>in</str<strong>on</strong>g>g will be c<strong>on</strong>ducted <strong>on</strong> stylus tablets from V<str<strong>on</strong>g>in</str<strong>on</strong>g>dol<strong>and</strong>a,<br />
st<strong>on</strong>e <str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong>s, L<str<strong>on</strong>g>in</str<strong>on</strong>g>ear B, <strong>and</strong> cuneiform tablets.<br />
Other relevant research is be<str<strong>on</strong>g>in</str<strong>on</strong>g>g c<strong>on</strong>ducted by the IMPACT (Improv<str<strong>on</strong>g>in</str<strong>on</strong>g>g Access to Text) 43 project. The<br />
European Commissi<strong>on</strong> has funded this project <strong>and</strong> it is explor<str<strong>on</strong>g>in</str<strong>on</strong>g>g how to develop advanced OCR<br />
methods for historical texts, particularly <str<strong>on</strong>g>in</str<strong>on</strong>g> terms of the use of OCR <str<strong>on</strong>g>in</str<strong>on</strong>g> mass digitizati<strong>on</strong> processes. 44<br />
While their research is not specifically focused <strong>on</strong> develop<str<strong>on</strong>g>in</str<strong>on</strong>g>g techniques for classical languages, Lat<str<strong>on</strong>g>in</str<strong>on</strong>g><br />
was the major language of <str<strong>on</strong>g>in</str<strong>on</strong>g>tellectual discourse <str<strong>on</strong>g>in</str<strong>on</strong>g> Europe for almost a century, so techniques adapted<br />
for either manuscripts or early pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted books would be useful to classical scholarship <strong>and</strong> bey<strong>on</strong>d.<br />
Ancient Greek<br />
Only a limited amount of work has c<strong>on</strong>sidered us<str<strong>on</strong>g>in</str<strong>on</strong>g>g automatic techniques <str<strong>on</strong>g>in</str<strong>on</strong>g> the optical recogniti<strong>on</strong> of<br />
ancient or classical Greek. While some recent research has focused <strong>on</strong> the development of OCR for<br />
“Old Greek” historical manuscripts, 45 little work has explored develop<str<strong>on</strong>g>in</str<strong>on</strong>g>g techniques for either<br />
manuscripts or pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted editi<strong>on</strong>s of Ancient Greek texts.<br />
Some prelim<str<strong>on</strong>g>in</str<strong>on</strong>g>ary work <str<strong>on</strong>g>in</str<strong>on</strong>g> develop<str<strong>on</strong>g>in</str<strong>on</strong>g>g an automatic-recogniti<strong>on</strong> methodology for Ancient Greek is<br />
detailed by Stewart et al. (2007). In these authors’ <str<strong>on</strong>g>in</str<strong>on</strong>g>itial survey of Greek editi<strong>on</strong>s, they found that <strong>on</strong><br />
average almost 14 percent of the Greek words <strong>on</strong> a text page were found <str<strong>on</strong>g>in</str<strong>on</strong>g> the notes or apparatus<br />
criticus. The authors first used a multi-tiered approach to OCR that applied two major post-process<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />
techniques to the output of two commercial OCR packages, ABBYY F<str<strong>on</strong>g>in</str<strong>on</strong>g>eReader (8.0) 46 <strong>and</strong><br />
Anagnostis 4.1. Dur<str<strong>on</strong>g>in</str<strong>on</strong>g>g this experiment, they found that character accuracy <strong>on</strong> simple uncorrected text<br />
averaged about 98.57 percent. Other prelim<str<strong>on</strong>g>in</str<strong>on</strong>g>ary experiments with OCR-generated text revealed that<br />
the uncorrected OCR could serve as searchable corpora. Even when work<str<strong>on</strong>g>in</str<strong>on</strong>g>g with a mid-n<str<strong>on</strong>g>in</str<strong>on</strong>g>eteenthcentury<br />
editi<strong>on</strong> of Aristotle <str<strong>on</strong>g>in</str<strong>on</strong>g> a n<strong>on</strong>st<strong>and</strong>ard Greek f<strong>on</strong>t, searches of the OCR-generated text typically<br />
provided superior recall than searches of texts that had been manually typed because the OCR text<br />
<str<strong>on</strong>g>in</str<strong>on</strong>g>cluded variant read<str<strong>on</strong>g>in</str<strong>on</strong>g>gs found <str<strong>on</strong>g>in</str<strong>on</strong>g> the notes. In a sec<strong>on</strong>d experiment, the automatic correcti<strong>on</strong> of s<str<strong>on</strong>g>in</str<strong>on</strong>g>gle<br />
texts was performed us<str<strong>on</strong>g>in</str<strong>on</strong>g>g a list of <strong>on</strong>e milli<strong>on</strong> Greek words <strong>and</strong> the Morpheus Greek morphological<br />
analyzer that was developed by the PDL.<br />
For their third experiment, Stewart <strong>and</strong> colleagues used the OCR output of multiple editi<strong>on</strong>s of the<br />
same work to correct <strong>on</strong>e another <str<strong>on</strong>g>in</str<strong>on</strong>g> a three-step process. First, different editi<strong>on</strong>s of a text were aligned<br />
by f<str<strong>on</strong>g>in</str<strong>on</strong>g>d<str<strong>on</strong>g>in</str<strong>on</strong>g>g unique str<str<strong>on</strong>g>in</str<strong>on</strong>g>gs <str<strong>on</strong>g>in</str<strong>on</strong>g> each. Sec<strong>on</strong>d, if an error word was found <str<strong>on</strong>g>in</str<strong>on</strong>g> <strong>on</strong>e text, a fuzzy search was<br />
performed <str<strong>on</strong>g>in</str<strong>on</strong>g> the aligned parallel text to try to locate the correct form. F<str<strong>on</strong>g>in</str<strong>on</strong>g>ally, <strong>on</strong>ce error words <str<strong>on</strong>g>in</str<strong>on</strong>g> a<br />
base text had been matched aga<str<strong>on</strong>g>in</str<strong>on</strong>g>st potential ground truth counterparts <str<strong>on</strong>g>in</str<strong>on</strong>g> the parallel texts, rules<br />
42 http://www.southampt<strong>on</strong>.ac.uk/archaeology/news/news_2010/acrg_dedefi_ma<str<strong>on</strong>g>in</str<strong>on</strong>g>.shtml<br />
43 http://www.impact-project.eu/home/<br />
44 For a recent overview of some of the IMPACT project’s research, see Ploeger et al. (2009).<br />
45 For an example, see Ntzios et al. (2007).<br />
46 http://www.abbyy.com/