26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

16<br />

The image-registrati<strong>on</strong> algorithm developed by Baumann <strong>and</strong> Seales was successful, <strong>and</strong> the authors<br />

rightly c<strong>on</strong>cluded that:<br />

High-resoluti<strong>on</strong>, multispectral digital imag<str<strong>on</strong>g>in</str<strong>on</strong>g>g of important documents is emerg<str<strong>on</strong>g>in</str<strong>on</strong>g>g as a<br />

st<strong>and</strong>ard practice for enabl<str<strong>on</strong>g>in</str<strong>on</strong>g>g scholarly analysis of difficult or damaged texts. As imag<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

techniques improve, documents are revisited <strong>and</strong> re-imaged, <strong>and</strong> registrati<strong>on</strong> of these images<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>to the same frame of reference for direct comparis<strong>on</strong> can be a powerful tool (Baumann <strong>and</strong><br />

Seales 2009).<br />

The work of the EDUCE Project illustrates how the state of the art is be<str<strong>on</strong>g>in</str<strong>on</strong>g>g used to provide new levels<br />

of access to valuable <strong>and</strong> damaged manuscripts.<br />

Lat<str<strong>on</strong>g>in</str<strong>on</strong>g><br />

In light of the extensive digitizati<strong>on</strong> of cultural heritage materials such as manuscripts <strong>and</strong> the large<br />

number of Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> texts that are becom<str<strong>on</strong>g>in</str<strong>on</strong>g>g available through massive digitizati<strong>on</strong> projects, techniques for<br />

improv<str<strong>on</strong>g>in</str<strong>on</strong>g>g access to these materials is an area of grow<str<strong>on</strong>g>in</str<strong>on</strong>g>g research that is exam<str<strong>on</strong>g>in</str<strong>on</strong>g>ed <str<strong>on</strong>g>in</str<strong>on</strong>g> this subsecti<strong>on</strong>.<br />

A variety of approaches have been explored for improv<str<strong>on</strong>g>in</str<strong>on</strong>g>g access to Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> manuscripts. Leydier et al.<br />

(2007) explored the use of “word-spott<str<strong>on</strong>g>in</str<strong>on</strong>g>g” to improve <str<strong>on</strong>g>in</str<strong>on</strong>g>formati<strong>on</strong> retrieval of textual data <str<strong>on</strong>g>in</str<strong>on</strong>g><br />

primarily Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> medieval manuscript images. They describe the technique as follows:<br />

In practice, word-spott<str<strong>on</strong>g>in</str<strong>on</strong>g>g c<strong>on</strong>sists <str<strong>on</strong>g>in</str<strong>on</strong>g> retriev<str<strong>on</strong>g>in</str<strong>on</strong>g>g all the occurrences of an image of a word. This<br />

template word is selected by the user by outl<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>on</strong>e occurrence <strong>on</strong> the document. It results <str<strong>on</strong>g>in</str<strong>on</strong>g><br />

the system propos<str<strong>on</strong>g>in</str<strong>on</strong>g>g a sorted list of hits that the user can prune manually. … Word-spott<str<strong>on</strong>g>in</str<strong>on</strong>g>g is<br />

based <strong>on</strong> a similarity or a distance between two images, the reference image def<str<strong>on</strong>g>in</str<strong>on</strong>g>ed by the user<br />

<strong>and</strong> the target images represent<str<strong>on</strong>g>in</str<strong>on</strong>g>g the rest of the page or all the pages of a multi-page<br />

document. C<strong>on</strong>trary to text query <strong>on</strong> a document processed by OCR, a word-image query can<br />

be sensitive to the style of the writ<str<strong>on</strong>g>in</str<strong>on</strong>g>g or the typography used. This technique is used when<br />

word recogniti<strong>on</strong> cannot be d<strong>on</strong>e, for example <strong>on</strong> very deteriorated pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted documents or <strong>on</strong><br />

manuscripts (Leydier et al. 2007).<br />

The authors report that ma<str<strong>on</strong>g>in</str<strong>on</strong>g> drawback to this approach is that a user has to select a keyword <str<strong>on</strong>g>in</str<strong>on</strong>g> a<br />

manuscript image (typically based <strong>on</strong> an ascii transcript) as a basis for further image retrieval, limit<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

their approach to retrieval of other images by word <strong>on</strong>ly.<br />

Another approach, presented by Edwards et al. (2004), tra<str<strong>on</strong>g>in</str<strong>on</strong>g>ed a generalized Hidden Markov Model<br />

(gHMM) <strong>on</strong> the transcripti<strong>on</strong> of a Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> manuscript to get both a transmissi<strong>on</strong> model <strong>and</strong> <strong>on</strong>e example<br />

each for 22 letters to create an emissi<strong>on</strong> model. Their transiti<strong>on</strong> model for unigrams, bigrams, <strong>and</strong><br />

trigrams was fitted us<str<strong>on</strong>g>in</str<strong>on</strong>g>g the Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> <strong>Library</strong>’s electr<strong>on</strong>ic versi<strong>on</strong> of Caesar’s Gallic Wars, <strong>and</strong> their<br />

emissi<strong>on</strong> model was tra<str<strong>on</strong>g>in</str<strong>on</strong>g>ed <strong>on</strong> 22 glyphs taken from a twelfth-century manuscript of Terence’s<br />

Comoediae. In c<strong>on</strong>trast to Leydier et al., the authors argued that word-spott<str<strong>on</strong>g>in</str<strong>on</strong>g>g was not entirely<br />

appropriate for a highly <str<strong>on</strong>g>in</str<strong>on</strong>g>flected language such as Lat<str<strong>on</strong>g>in</str<strong>on</strong>g>:<br />

Manmatha et al. … <str<strong>on</strong>g>in</str<strong>on</strong>g>troduce the technique of “word spott<str<strong>on</strong>g>in</str<strong>on</strong>g>g,” which segments text <str<strong>on</strong>g>in</str<strong>on</strong>g>to word<br />

images, rectifies the word images, <strong>and</strong> then uses an aligned tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g set to learn<br />

corresp<strong>on</strong>dences between rectified word images <strong>and</strong> str<str<strong>on</strong>g>in</str<strong>on</strong>g>gs. The method is not suitable for a<br />

heavily <str<strong>on</strong>g>in</str<strong>on</strong>g>flected language, because words take so many forms. In an <str<strong>on</strong>g>in</str<strong>on</strong>g>flected language, the<br />

natural unit to match to is a subset of a word, rather than a whole word, imply<str<strong>on</strong>g>in</str<strong>on</strong>g>g that <strong>on</strong>e

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!