Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
18<br />
Schibel <strong>and</strong> Rydberg-Cox argued that good bibliographic descripti<strong>on</strong> is required for this historical<br />
source material (ideally so that such collecti<strong>on</strong>s can be sorted by period, place, language, literary<br />
genre, publisher, <strong>and</strong> audience), particularly s<str<strong>on</strong>g>in</str<strong>on</strong>g>ce many digitized texts will often be reused <str<strong>on</strong>g>in</str<strong>on</strong>g> other<br />
c<strong>on</strong>texts. A sec<strong>on</strong>d recommendati<strong>on</strong> made by Schibel <strong>and</strong> Rydberg-Cox (2006) is the need to identify<br />
at least basic structural metadata for such books (e.g., fr<strong>on</strong>t, body, back) or to create a rough table of<br />
c<strong>on</strong>tents that provides a framework by which to make page images available. They suggested that such<br />
structural metadata would support new research <str<strong>on</strong>g>in</str<strong>on</strong>g>to traditi<strong>on</strong>al questi<strong>on</strong>s of textual <str<strong>on</strong>g>in</str<strong>on</strong>g>fluence for<br />
researchers who could use automatic text-similarity measures to recognize text families <strong>and</strong> trace either<br />
the <str<strong>on</strong>g>in</str<strong>on</strong>g>fluence of major authors or the purposes of a given document. Despite such new opportunities,<br />
problems rema<str<strong>on</strong>g>in</str<strong>on</strong>g>. An <str<strong>on</strong>g>in</str<strong>on</strong>g>itial analysis by the authors of digital libraries of page images of early modern<br />
books revealed that page images produced were often <str<strong>on</strong>g>in</str<strong>on</strong>g>accurate or <str<strong>on</strong>g>in</str<strong>on</strong>g>adequate, OCR tools were not<br />
yet flexible enough to produce transcripti<strong>on</strong>s, <strong>and</strong> automated tagg<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g is far more difficult<br />
with “pre-st<strong>and</strong>ardized language.”<br />
Schibel <strong>and</strong> Rydberg-Cox c<strong>on</strong>cluded, however, that the greatest challenge faced <str<strong>on</strong>g>in</str<strong>on</strong>g> provid<str<strong>on</strong>g>in</str<strong>on</strong>g>g access to<br />
early modern books is that l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic tools for Early Modern Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> are c<strong>on</strong>siderably underdeveloped:<br />
Aside from the issues outl<str<strong>on</strong>g>in</str<strong>on</strong>g>ed above, two major challenges face humans <strong>and</strong> computers alike.<br />
First, we have no comprehensive dicti<strong>on</strong>ary of Neo-Lat<str<strong>on</strong>g>in</str<strong>on</strong>g>. Readers must cope with neologisms<br />
or, often much harder to decipher, idioms <strong>and</strong> turns of expressi<strong>on</strong> of particular groups. Sec<strong>on</strong>d,<br />
aside from morphological analyzers such as Morpheus—the Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> morphological analyzer<br />
found <str<strong>on</strong>g>in</str<strong>on</strong>g> the Perseus Digital <strong>Library</strong>—we have few computati<strong>on</strong>al tools for Lat<str<strong>on</strong>g>in</str<strong>on</strong>g>. Even<br />
Morpheus does not use c<strong>on</strong>textual clues to prioritize analyses, <strong>and</strong> we are not aware of any<br />
substantive work <strong>on</strong> named entity recogniti<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> Lat<str<strong>on</strong>g>in</str<strong>on</strong>g>. We do not yet have mature electr<strong>on</strong>ic<br />
authority lists for the Greco-Roman world, much less the people, places, etc. of the early<br />
modern period (Schibel <strong>and</strong> Rydberg Cox 2006).<br />
Some of the issues listed here, such as the development of l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic tools for early modern Lat<str<strong>on</strong>g>in</str<strong>on</strong>g>, have<br />
received further attenti<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> the past four years by authors such as Reddy <strong>and</strong> Crane (2006). They<br />
tested the abilities of the commercial OCR ABBY F<str<strong>on</strong>g>in</str<strong>on</strong>g>eReader <strong>and</strong> the open-source document<br />
recogniti<strong>on</strong> system Gamera 56 to recognize glyphs <str<strong>on</strong>g>in</str<strong>on</strong>g> early modern Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> documents. They found that<br />
after extensive tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g Gamera could recognize about 80 percent of glyphs while F<str<strong>on</strong>g>in</str<strong>on</strong>g>eReader could<br />
recognize about 84 percent. To improve the character-recogniti<strong>on</strong> output, they recommended the use of<br />
language model<str<strong>on</strong>g>in</str<strong>on</strong>g>g for future work.<br />
Rydberg-Cox (2009) also explored some of the computati<strong>on</strong>al challenges <str<strong>on</strong>g>in</str<strong>on</strong>g> creat<str<strong>on</strong>g>in</str<strong>on</strong>g>g a corpus of early<br />
Modern Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> <strong>and</strong> reported <strong>on</strong> work from the NEH project, “Approach<str<strong>on</strong>g>in</str<strong>on</strong>g>g the Problems of Digitiz<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />
Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> Incunables.” The primary aim of this project was to exam<str<strong>on</strong>g>in</str<strong>on</strong>g>e the “challenges associated with<br />
represent<str<strong>on</strong>g>in</str<strong>on</strong>g>g <str<strong>on</strong>g>in</str<strong>on</strong>g> digital form the complex <strong>and</strong> n<strong>on</strong>-st<strong>and</strong>ard typefaces used <str<strong>on</strong>g>in</str<strong>on</strong>g> these texts to abbreviate<br />
words” a practice that was d<strong>on</strong>e <str<strong>on</strong>g>in</str<strong>on</strong>g> imitati<strong>on</strong> of medieval h<strong>and</strong>writ<str<strong>on</strong>g>in</str<strong>on</strong>g>g practice. Such features of early<br />
typography occurred at vary<str<strong>on</strong>g>in</str<strong>on</strong>g>g rates <str<strong>on</strong>g>in</str<strong>on</strong>g> different books, Rydberg-Cox noted, but they do appear so<br />
frequently that no digitizati<strong>on</strong> project can fail to c<strong>on</strong>sider them. This issue was also faced by the<br />
Archimedes Digital <strong>Library</strong> 57 project, which, when digitiz<str<strong>on</strong>g>in</str<strong>on</strong>g>g texts published between 1495 <strong>and</strong> 1691,<br />
discovered between three <strong>and</strong> five abbreviati<strong>on</strong>s <strong>on</strong> every pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted page. Rydberg-Cox emphasized that<br />
56 http://gamera.<str<strong>on</strong>g>in</str<strong>on</strong>g>formatik.hsnr.de/. In March 2011, the Gamera Project announced that they were releas<str<strong>on</strong>g>in</str<strong>on</strong>g>g a GreekOCR Toolkit<br />
(http://gamera.<str<strong>on</strong>g>in</str<strong>on</strong>g>formatik.hsnr.de/add<strong>on</strong>s/greekocr4gamera/), an OCR system that can be used for “polyt<strong>on</strong>al Greek text documents.” Although still <str<strong>on</strong>g>in</str<strong>on</strong>g> the<br />
test<str<strong>on</strong>g>in</str<strong>on</strong>g>g stage, it <str<strong>on</strong>g>in</str<strong>on</strong>g>cludes extensive documentati<strong>on</strong> <strong>and</strong> the ability to recognize accents.<br />
57 http://archimedes.fas.harvard.edu/