26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

18<br />

Schibel <strong>and</strong> Rydberg-Cox argued that good bibliographic descripti<strong>on</strong> is required for this historical<br />

source material (ideally so that such collecti<strong>on</strong>s can be sorted by period, place, language, literary<br />

genre, publisher, <strong>and</strong> audience), particularly s<str<strong>on</strong>g>in</str<strong>on</strong>g>ce many digitized texts will often be reused <str<strong>on</strong>g>in</str<strong>on</strong>g> other<br />

c<strong>on</strong>texts. A sec<strong>on</strong>d recommendati<strong>on</strong> made by Schibel <strong>and</strong> Rydberg-Cox (2006) is the need to identify<br />

at least basic structural metadata for such books (e.g., fr<strong>on</strong>t, body, back) or to create a rough table of<br />

c<strong>on</strong>tents that provides a framework by which to make page images available. They suggested that such<br />

structural metadata would support new research <str<strong>on</strong>g>in</str<strong>on</strong>g>to traditi<strong>on</strong>al questi<strong>on</strong>s of textual <str<strong>on</strong>g>in</str<strong>on</strong>g>fluence for<br />

researchers who could use automatic text-similarity measures to recognize text families <strong>and</strong> trace either<br />

the <str<strong>on</strong>g>in</str<strong>on</strong>g>fluence of major authors or the purposes of a given document. Despite such new opportunities,<br />

problems rema<str<strong>on</strong>g>in</str<strong>on</strong>g>. An <str<strong>on</strong>g>in</str<strong>on</strong>g>itial analysis by the authors of digital libraries of page images of early modern<br />

books revealed that page images produced were often <str<strong>on</strong>g>in</str<strong>on</strong>g>accurate or <str<strong>on</strong>g>in</str<strong>on</strong>g>adequate, OCR tools were not<br />

yet flexible enough to produce transcripti<strong>on</strong>s, <strong>and</strong> automated tagg<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g is far more difficult<br />

with “pre-st<strong>and</strong>ardized language.”<br />

Schibel <strong>and</strong> Rydberg-Cox c<strong>on</strong>cluded, however, that the greatest challenge faced <str<strong>on</strong>g>in</str<strong>on</strong>g> provid<str<strong>on</strong>g>in</str<strong>on</strong>g>g access to<br />

early modern books is that l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic tools for Early Modern Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> are c<strong>on</strong>siderably underdeveloped:<br />

Aside from the issues outl<str<strong>on</strong>g>in</str<strong>on</strong>g>ed above, two major challenges face humans <strong>and</strong> computers alike.<br />

First, we have no comprehensive dicti<strong>on</strong>ary of Neo-Lat<str<strong>on</strong>g>in</str<strong>on</strong>g>. Readers must cope with neologisms<br />

or, often much harder to decipher, idioms <strong>and</strong> turns of expressi<strong>on</strong> of particular groups. Sec<strong>on</strong>d,<br />

aside from morphological analyzers such as Morpheus—the Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> morphological analyzer<br />

found <str<strong>on</strong>g>in</str<strong>on</strong>g> the Perseus Digital <strong>Library</strong>—we have few computati<strong>on</strong>al tools for Lat<str<strong>on</strong>g>in</str<strong>on</strong>g>. Even<br />

Morpheus does not use c<strong>on</strong>textual clues to prioritize analyses, <strong>and</strong> we are not aware of any<br />

substantive work <strong>on</strong> named entity recogniti<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> Lat<str<strong>on</strong>g>in</str<strong>on</strong>g>. We do not yet have mature electr<strong>on</strong>ic<br />

authority lists for the Greco-Roman world, much less the people, places, etc. of the early<br />

modern period (Schibel <strong>and</strong> Rydberg Cox 2006).<br />

Some of the issues listed here, such as the development of l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic tools for early modern Lat<str<strong>on</strong>g>in</str<strong>on</strong>g>, have<br />

received further attenti<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> the past four years by authors such as Reddy <strong>and</strong> Crane (2006). They<br />

tested the abilities of the commercial OCR ABBY F<str<strong>on</strong>g>in</str<strong>on</strong>g>eReader <strong>and</strong> the open-source document<br />

recogniti<strong>on</strong> system Gamera 56 to recognize glyphs <str<strong>on</strong>g>in</str<strong>on</strong>g> early modern Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> documents. They found that<br />

after extensive tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g Gamera could recognize about 80 percent of glyphs while F<str<strong>on</strong>g>in</str<strong>on</strong>g>eReader could<br />

recognize about 84 percent. To improve the character-recogniti<strong>on</strong> output, they recommended the use of<br />

language model<str<strong>on</strong>g>in</str<strong>on</strong>g>g for future work.<br />

Rydberg-Cox (2009) also explored some of the computati<strong>on</strong>al challenges <str<strong>on</strong>g>in</str<strong>on</strong>g> creat<str<strong>on</strong>g>in</str<strong>on</strong>g>g a corpus of early<br />

Modern Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> <strong>and</strong> reported <strong>on</strong> work from the NEH project, “Approach<str<strong>on</strong>g>in</str<strong>on</strong>g>g the Problems of Digitiz<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> Incunables.” The primary aim of this project was to exam<str<strong>on</strong>g>in</str<strong>on</strong>g>e the “challenges associated with<br />

represent<str<strong>on</strong>g>in</str<strong>on</strong>g>g <str<strong>on</strong>g>in</str<strong>on</strong>g> digital form the complex <strong>and</strong> n<strong>on</strong>-st<strong>and</strong>ard typefaces used <str<strong>on</strong>g>in</str<strong>on</strong>g> these texts to abbreviate<br />

words” a practice that was d<strong>on</strong>e <str<strong>on</strong>g>in</str<strong>on</strong>g> imitati<strong>on</strong> of medieval h<strong>and</strong>writ<str<strong>on</strong>g>in</str<strong>on</strong>g>g practice. Such features of early<br />

typography occurred at vary<str<strong>on</strong>g>in</str<strong>on</strong>g>g rates <str<strong>on</strong>g>in</str<strong>on</strong>g> different books, Rydberg-Cox noted, but they do appear so<br />

frequently that no digitizati<strong>on</strong> project can fail to c<strong>on</strong>sider them. This issue was also faced by the<br />

Archimedes Digital <strong>Library</strong> 57 project, which, when digitiz<str<strong>on</strong>g>in</str<strong>on</strong>g>g texts published between 1495 <strong>and</strong> 1691,<br />

discovered between three <strong>and</strong> five abbreviati<strong>on</strong>s <strong>on</strong> every pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted page. Rydberg-Cox emphasized that<br />

56 http://gamera.<str<strong>on</strong>g>in</str<strong>on</strong>g>formatik.hsnr.de/. In March 2011, the Gamera Project announced that they were releas<str<strong>on</strong>g>in</str<strong>on</strong>g>g a GreekOCR Toolkit<br />

(http://gamera.<str<strong>on</strong>g>in</str<strong>on</strong>g>formatik.hsnr.de/add<strong>on</strong>s/greekocr4gamera/), an OCR system that can be used for “polyt<strong>on</strong>al Greek text documents.” Although still <str<strong>on</strong>g>in</str<strong>on</strong>g> the<br />

test<str<strong>on</strong>g>in</str<strong>on</strong>g>g stage, it <str<strong>on</strong>g>in</str<strong>on</strong>g>cludes extensive documentati<strong>on</strong> <strong>and</strong> the ability to recognize accents.<br />

57 http://archimedes.fas.harvard.edu/

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!