26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

23<br />

recognized. In additi<strong>on</strong>, IPA may be used for pr<strong>on</strong>unciati<strong>on</strong>, Greek letters may be used for<br />

classical Greek quotati<strong>on</strong>s, <strong>and</strong> Greek letters <strong>and</strong> other special characters may be used for<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>dicat<str<strong>on</strong>g>in</str<strong>on</strong>g>g footnotes or other references (Breuel 2009).<br />

The challenge of multil<str<strong>on</strong>g>in</str<strong>on</strong>g>gual document recogniti<strong>on</strong> is a significant <strong>on</strong>e for classical scholarship that<br />

has been reported by many digital classics projects. OCRopus has built-<str<strong>on</strong>g>in</str<strong>on</strong>g> l<str<strong>on</strong>g>in</str<strong>on</strong>g>e recognizers for Lat<str<strong>on</strong>g>in</str<strong>on</strong>g><br />

scripts, <strong>and</strong> unlike those of many other OCR systems, these recognizers make few assumpti<strong>on</strong>s about<br />

character sets <strong>and</strong> f<strong>on</strong>ts <strong>and</strong> are <str<strong>on</strong>g>in</str<strong>on</strong>g>stead “tra<str<strong>on</strong>g>in</str<strong>on</strong>g>ed” <strong>on</strong> text-l<str<strong>on</strong>g>in</str<strong>on</strong>g>e <str<strong>on</strong>g>in</str<strong>on</strong>g>put that is then aligned aga<str<strong>on</strong>g>in</str<strong>on</strong>g>st ground<br />

truth data <strong>and</strong> can be used to automatically tra<str<strong>on</strong>g>in</str<strong>on</strong>g> “<str<strong>on</strong>g>in</str<strong>on</strong>g>dividual character shape models.” For Devanagari,<br />

OCRopus h<strong>and</strong>led diacritics by treat<str<strong>on</strong>g>in</str<strong>on</strong>g>g “character+diacritic” comb<str<strong>on</strong>g>in</str<strong>on</strong>g>ati<strong>on</strong>s as novel characters.<br />

The f<str<strong>on</strong>g>in</str<strong>on</strong>g>al process<str<strong>on</strong>g>in</str<strong>on</strong>g>g stage of OCRopus is language model<str<strong>on</strong>g>in</str<strong>on</strong>g>g, which <str<strong>on</strong>g>in</str<strong>on</strong>g> the case of OCRopus is based<br />

<strong>on</strong> WFSTs. WFSTs allow language models <strong>and</strong> character recogniti<strong>on</strong> alternatives to be “manipulated<br />

algebraically” <strong>and</strong> such language models can be learned from tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g data or c<strong>on</strong>structed manually.<br />

One important use of such models for mixed-language classical texts is that they can be used to<br />

automatically identify languages with<str<strong>on</strong>g>in</str<strong>on</strong>g> a digital text. “We can take exist<str<strong>on</strong>g>in</str<strong>on</strong>g>g language models for<br />

English <strong>and</strong> Sanskrit <strong>and</strong> comb<str<strong>on</strong>g>in</str<strong>on</strong>g>e them,” Breuel explicated. “As part of the comb<str<strong>on</strong>g>in</str<strong>on</strong>g>ati<strong>on</strong>, we can tra<str<strong>on</strong>g>in</str<strong>on</strong>g><br />

or specify the probable locati<strong>on</strong>s <strong>and</strong> frequencies of transiti<strong>on</strong>s between the two language models,<br />

corresp<strong>on</strong>d<str<strong>on</strong>g>in</str<strong>on</strong>g>g to, for example, isolated foreign words with<str<strong>on</strong>g>in</str<strong>on</strong>g> <strong>on</strong>e language, or l<strong>on</strong>g quotati<strong>on</strong>s” (Breuel<br />

2009).<br />

As this subsecti<strong>on</strong> has <str<strong>on</strong>g>in</str<strong>on</strong>g>dicated, the computati<strong>on</strong>al challenges of process<str<strong>on</strong>g>in</str<strong>on</strong>g>g Sanskrit are be<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

actively researched, <strong>and</strong> some of the technical soluti<strong>on</strong>s may be adaptable to other historical languages<br />

as well.<br />

Syriac<br />

The Syriac dialect bel<strong>on</strong>gs to the Aramaic branch of the Semitic languages <strong>and</strong> flourished between the<br />

third <strong>and</strong> seventh centuries AD, although it c<strong>on</strong>t<str<strong>on</strong>g>in</str<strong>on</strong>g>ued to be used as a written language through the<br />

n<str<strong>on</strong>g>in</str<strong>on</strong>g>eteenth century. It has a somewhat smaller body of research <str<strong>on</strong>g>in</str<strong>on</strong>g> terms of document analysis <strong>and</strong><br />

recogniti<strong>on</strong> than other ancient dialects <strong>and</strong> languages covered <str<strong>on</strong>g>in</str<strong>on</strong>g> this review, but n<strong>on</strong>etheless there is an<br />

active <strong>and</strong> grow<str<strong>on</strong>g>in</str<strong>on</strong>g>g body of research <strong>on</strong> this topic. Although there are fewer digital texts available <str<strong>on</strong>g>in</str<strong>on</strong>g><br />

Syriac <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e than for Greek, Lat<str<strong>on</strong>g>in</str<strong>on</strong>g>, Sumerian, or Sanskrit, some texts written <str<strong>on</strong>g>in</str<strong>on</strong>g> this dialect can be<br />

found <str<strong>on</strong>g>in</str<strong>on</strong>g> many papyri <strong>and</strong> manuscript collecti<strong>on</strong>s. 66 In additi<strong>on</strong>, a number of pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted editi<strong>on</strong>s <strong>and</strong><br />

reference works <strong>on</strong> Syriac can be found <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e <str<strong>on</strong>g>in</str<strong>on</strong>g> both Google Books <strong>and</strong> the Internet Archive. 67<br />

Document-recogniti<strong>on</strong> work with Syriac has been reported by Bilane et al. (2008), who have<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>vestigated the use of word-spott<str<strong>on</strong>g>in</str<strong>on</strong>g>g 68 for h<strong>and</strong>writ<str<strong>on</strong>g>in</str<strong>on</strong>g>g analysis <str<strong>on</strong>g>in</str<strong>on</strong>g> digitized Syriac manuscripts. They<br />

noted that Syriac presents a particularly <str<strong>on</strong>g>in</str<strong>on</strong>g>terest<str<strong>on</strong>g>in</str<strong>on</strong>g>g case for automatic historical-document analysis<br />

because it comb<str<strong>on</strong>g>in</str<strong>on</strong>g>es the word structure <strong>and</strong> calligraphy of Arabic h<strong>and</strong>writ<str<strong>on</strong>g>in</str<strong>on</strong>g>g while also be<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>tenti<strong>on</strong>ally written at an angle. Bilane et al. (2008) used a word-spott<str<strong>on</strong>g>in</str<strong>on</strong>g>g method so that they would<br />

not need to rely <strong>on</strong> any prior <str<strong>on</strong>g>in</str<strong>on</strong>g>formati<strong>on</strong> or be dependent <strong>on</strong> specific word or character-segmentati<strong>on</strong><br />

algorithms. Tse <strong>and</strong> Bigun (2007) have also reported <strong>on</strong> work <str<strong>on</strong>g>in</str<strong>on</strong>g> the automatic recogniti<strong>on</strong> of Syriac,<br />

with a focus <strong>on</strong> develop<str<strong>on</strong>g>in</str<strong>on</strong>g>g an <str<strong>on</strong>g>in</str<strong>on</strong>g>itial character-recogniti<strong>on</strong> system that can serve as a basel<str<strong>on</strong>g>in</str<strong>on</strong>g>e OCR for<br />

Syriac-Aramaic texts that use the Serto script. Their system does not require the use of segmentati<strong>on</strong><br />

66 For example, see the Syriac manuscripts available through the Virtual Manuscript Room, http://vmr.bham.ac.uk/Collecti<strong>on</strong>s/M<str<strong>on</strong>g>in</str<strong>on</strong>g>gana/part/Syriac/<br />

67 See, for example, Breviarium: juxta ritum ecclesiæ Antiochenæ Syrorum (http://books.google.com/booksid=w-UOAAAAQAAJ) <str<strong>on</strong>g>in</str<strong>on</strong>g> Google Books, or<br />

The book of c<strong>on</strong>solati<strong>on</strong>s; or, The pastoral epistles; the Syriac text (with both the Syriac text <strong>and</strong> the English translati<strong>on</strong>) <str<strong>on</strong>g>in</str<strong>on</strong>g> the Internet Archive<br />

(http://www.archive.org/details/bookofc<strong>on</strong>solatio00ishouoft).<br />

68 Word-spott<str<strong>on</strong>g>in</str<strong>on</strong>g>g is a technique that has also been used with Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> manuscripts <strong>and</strong> is expla<str<strong>on</strong>g>in</str<strong>on</strong>g>ed <str<strong>on</strong>g>in</str<strong>on</strong>g> detail earlier <str<strong>on</strong>g>in</str<strong>on</strong>g> this paper.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!