Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
16<br />
The image-registrati<strong>on</strong> algorithm developed by Baumann <strong>and</strong> Seales was successful, <strong>and</strong> the authors<br />
rightly c<strong>on</strong>cluded that:<br />
High-resoluti<strong>on</strong>, multispectral digital imag<str<strong>on</strong>g>in</str<strong>on</strong>g>g of important documents is emerg<str<strong>on</strong>g>in</str<strong>on</strong>g>g as a<br />
st<strong>and</strong>ard practice for enabl<str<strong>on</strong>g>in</str<strong>on</strong>g>g scholarly analysis of difficult or damaged texts. As imag<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />
techniques improve, documents are revisited <strong>and</strong> re-imaged, <strong>and</strong> registrati<strong>on</strong> of these images<br />
<str<strong>on</strong>g>in</str<strong>on</strong>g>to the same frame of reference for direct comparis<strong>on</strong> can be a powerful tool (Baumann <strong>and</strong><br />
Seales 2009).<br />
The work of the EDUCE Project illustrates how the state of the art is be<str<strong>on</strong>g>in</str<strong>on</strong>g>g used to provide new levels<br />
of access to valuable <strong>and</strong> damaged manuscripts.<br />
Lat<str<strong>on</strong>g>in</str<strong>on</strong>g><br />
In light of the extensive digitizati<strong>on</strong> of cultural heritage materials such as manuscripts <strong>and</strong> the large<br />
number of Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> texts that are becom<str<strong>on</strong>g>in</str<strong>on</strong>g>g available through massive digitizati<strong>on</strong> projects, techniques for<br />
improv<str<strong>on</strong>g>in</str<strong>on</strong>g>g access to these materials is an area of grow<str<strong>on</strong>g>in</str<strong>on</strong>g>g research that is exam<str<strong>on</strong>g>in</str<strong>on</strong>g>ed <str<strong>on</strong>g>in</str<strong>on</strong>g> this subsecti<strong>on</strong>.<br />
A variety of approaches have been explored for improv<str<strong>on</strong>g>in</str<strong>on</strong>g>g access to Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> manuscripts. Leydier et al.<br />
(2007) explored the use of “word-spott<str<strong>on</strong>g>in</str<strong>on</strong>g>g” to improve <str<strong>on</strong>g>in</str<strong>on</strong>g>formati<strong>on</strong> retrieval of textual data <str<strong>on</strong>g>in</str<strong>on</strong>g><br />
primarily Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> medieval manuscript images. They describe the technique as follows:<br />
In practice, word-spott<str<strong>on</strong>g>in</str<strong>on</strong>g>g c<strong>on</strong>sists <str<strong>on</strong>g>in</str<strong>on</strong>g> retriev<str<strong>on</strong>g>in</str<strong>on</strong>g>g all the occurrences of an image of a word. This<br />
template word is selected by the user by outl<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>on</strong>e occurrence <strong>on</strong> the document. It results <str<strong>on</strong>g>in</str<strong>on</strong>g><br />
the system propos<str<strong>on</strong>g>in</str<strong>on</strong>g>g a sorted list of hits that the user can prune manually. … Word-spott<str<strong>on</strong>g>in</str<strong>on</strong>g>g is<br />
based <strong>on</strong> a similarity or a distance between two images, the reference image def<str<strong>on</strong>g>in</str<strong>on</strong>g>ed by the user<br />
<strong>and</strong> the target images represent<str<strong>on</strong>g>in</str<strong>on</strong>g>g the rest of the page or all the pages of a multi-page<br />
document. C<strong>on</strong>trary to text query <strong>on</strong> a document processed by OCR, a word-image query can<br />
be sensitive to the style of the writ<str<strong>on</strong>g>in</str<strong>on</strong>g>g or the typography used. This technique is used when<br />
word recogniti<strong>on</strong> cannot be d<strong>on</strong>e, for example <strong>on</strong> very deteriorated pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted documents or <strong>on</strong><br />
manuscripts (Leydier et al. 2007).<br />
The authors report that ma<str<strong>on</strong>g>in</str<strong>on</strong>g> drawback to this approach is that a user has to select a keyword <str<strong>on</strong>g>in</str<strong>on</strong>g> a<br />
manuscript image (typically based <strong>on</strong> an ascii transcript) as a basis for further image retrieval, limit<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />
their approach to retrieval of other images by word <strong>on</strong>ly.<br />
Another approach, presented by Edwards et al. (2004), tra<str<strong>on</strong>g>in</str<strong>on</strong>g>ed a generalized Hidden Markov Model<br />
(gHMM) <strong>on</strong> the transcripti<strong>on</strong> of a Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> manuscript to get both a transmissi<strong>on</strong> model <strong>and</strong> <strong>on</strong>e example<br />
each for 22 letters to create an emissi<strong>on</strong> model. Their transiti<strong>on</strong> model for unigrams, bigrams, <strong>and</strong><br />
trigrams was fitted us<str<strong>on</strong>g>in</str<strong>on</strong>g>g the Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> <strong>Library</strong>’s electr<strong>on</strong>ic versi<strong>on</strong> of Caesar’s Gallic Wars, <strong>and</strong> their<br />
emissi<strong>on</strong> model was tra<str<strong>on</strong>g>in</str<strong>on</strong>g>ed <strong>on</strong> 22 glyphs taken from a twelfth-century manuscript of Terence’s<br />
Comoediae. In c<strong>on</strong>trast to Leydier et al., the authors argued that word-spott<str<strong>on</strong>g>in</str<strong>on</strong>g>g was not entirely<br />
appropriate for a highly <str<strong>on</strong>g>in</str<strong>on</strong>g>flected language such as Lat<str<strong>on</strong>g>in</str<strong>on</strong>g>:<br />
Manmatha et al. … <str<strong>on</strong>g>in</str<strong>on</strong>g>troduce the technique of “word spott<str<strong>on</strong>g>in</str<strong>on</strong>g>g,” which segments text <str<strong>on</strong>g>in</str<strong>on</strong>g>to word<br />
images, rectifies the word images, <strong>and</strong> then uses an aligned tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g set to learn<br />
corresp<strong>on</strong>dences between rectified word images <strong>and</strong> str<str<strong>on</strong>g>in</str<strong>on</strong>g>gs. The method is not suitable for a<br />
heavily <str<strong>on</strong>g>in</str<strong>on</strong>g>flected language, because words take so many forms. In an <str<strong>on</strong>g>in</str<strong>on</strong>g>flected language, the<br />
natural unit to match to is a subset of a word, rather than a whole word, imply<str<strong>on</strong>g>in</str<strong>on</strong>g>g that <strong>on</strong>e