26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

130<br />

Cayless thus developed a method for generat<str<strong>on</strong>g>in</str<strong>on</strong>g>g a “Scalable Vector Graphics (SVG) 413 representati<strong>on</strong><br />

of the text <str<strong>on</strong>g>in</str<strong>on</strong>g> an image of a manuscript” (Cayless 2009). This work was <str<strong>on</strong>g>in</str<strong>on</strong>g>spired by experiments<br />

c<strong>on</strong>ducted us<str<strong>on</strong>g>in</str<strong>on</strong>g>g the OpenLayers 414 Javascript library by Tom Elliott <strong>and</strong> Sean Gillies to trace the text<br />

<strong>on</strong> a sample <str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong> 415 <strong>and</strong> Cayless sought to create a “toolcha<str<strong>on</strong>g>in</str<strong>on</strong>g>” that used <strong>on</strong>ly open-source<br />

software. To beg<str<strong>on</strong>g>in</str<strong>on</strong>g>, Cayless c<strong>on</strong>verted JPEG images of manuscripts <str<strong>on</strong>g>in</str<strong>on</strong>g>to a bitmap format us<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

ImageMagick; 416 he then used an open-source tool called Potrace 417 to c<strong>on</strong>vert the bitmap to SVG.<br />

The SVG c<strong>on</strong>versi<strong>on</strong> process required some manual <str<strong>on</strong>g>in</str<strong>on</strong>g>terventi<strong>on</strong>, <strong>and</strong> an SVG editor called Inscape<br />

was used to clean up the result<str<strong>on</strong>g>in</str<strong>on</strong>g>g SVG files. The result<str<strong>on</strong>g>in</str<strong>on</strong>g>g SVG documents were analyzed us<str<strong>on</strong>g>in</str<strong>on</strong>g>g a<br />

Pyth<strong>on</strong> script that attempted to “detect l<str<strong>on</strong>g>in</str<strong>on</strong>g>es <str<strong>on</strong>g>in</str<strong>on</strong>g> the image <strong>and</strong> organize paths with<str<strong>on</strong>g>in</str<strong>on</strong>g> those l<str<strong>on</strong>g>in</str<strong>on</strong>g>es <str<strong>on</strong>g>in</str<strong>on</strong>g>to<br />

groups with<str<strong>on</strong>g>in</str<strong>on</strong>g> the document” (Cayless 2008).<br />

After the text image with<str<strong>on</strong>g>in</str<strong>on</strong>g> a larger manuscript page image had been c<strong>on</strong>verted <str<strong>on</strong>g>in</str<strong>on</strong>g>to SVG paths, these<br />

paths could be grouped with<str<strong>on</strong>g>in</str<strong>on</strong>g> the document to mark the words there<str<strong>on</strong>g>in</str<strong>on</strong>g> <strong>and</strong> these groups could then be<br />

l<str<strong>on</strong>g>in</str<strong>on</strong>g>ked us<str<strong>on</strong>g>in</str<strong>on</strong>g>g various methods to tokenized versi<strong>on</strong>s of the transcripti<strong>on</strong>s (Cayless 2009). Cayless then<br />

used the OpenLayers library to simultaneously display the l<str<strong>on</strong>g>in</str<strong>on</strong>g>ked manuscript image <strong>and</strong> TEI<br />

transcripti<strong>on</strong>, for importantly, OpenLayers “allows the <str<strong>on</strong>g>in</str<strong>on</strong>g>serti<strong>on</strong> of a s<str<strong>on</strong>g>in</str<strong>on</strong>g>gle image as a base layer<br />

(though it supports tiled images as well), so it is quite simple to <str<strong>on</strong>g>in</str<strong>on</strong>g>sert a page image <str<strong>on</strong>g>in</str<strong>on</strong>g>to it” (Cayless<br />

2008). This <str<strong>on</strong>g>in</str<strong>on</strong>g>itial system also required the additi<strong>on</strong> of several functi<strong>on</strong>s to the OpenLayers library,<br />

particularly the ability to support “paths <strong>and</strong> groups of paths.” Ultimately, Cayless reported that:<br />

The experiments outl<str<strong>on</strong>g>in</str<strong>on</strong>g>ed above prove that it is feasible to go from a page image with a TEIbased<br />

transcripti<strong>on</strong> to an <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e display <str<strong>on</strong>g>in</str<strong>on</strong>g> which the image can be panned <strong>and</strong> zoomed, <strong>and</strong> the<br />

text <strong>on</strong> the page can be l<str<strong>on</strong>g>in</str<strong>on</strong>g>ked to the transcripti<strong>on</strong> (<strong>and</strong> vice-versa). The steps <str<strong>on</strong>g>in</str<strong>on</strong>g> the process that<br />

have not yet been fully automated are the selecti<strong>on</strong> of a black/white cutoff for the page image,<br />

the decisi<strong>on</strong> of what percentage of vertical overlap to use <str<strong>on</strong>g>in</str<strong>on</strong>g> recogniz<str<strong>on</strong>g>in</str<strong>on</strong>g>g that two paths are<br />

members of the same l<str<strong>on</strong>g>in</str<strong>on</strong>g>e, <strong>and</strong> the need for l<str<strong>on</strong>g>in</str<strong>on</strong>g>e beg<str<strong>on</strong>g>in</str<strong>on</strong>g>n<str<strong>on</strong>g>in</str<strong>on</strong>g>g () tags to be <str<strong>on</strong>g>in</str<strong>on</strong>g>serted <str<strong>on</strong>g>in</str<strong>on</strong>g>to the<br />

TEI transcripti<strong>on</strong> (Cayless 2008).<br />

While automatic analysis of the SVG output has supported the detecti<strong>on</strong> of l<str<strong>on</strong>g>in</str<strong>on</strong>g>es of text <str<strong>on</strong>g>in</str<strong>on</strong>g> page<br />

images, work c<strong>on</strong>t<str<strong>on</strong>g>in</str<strong>on</strong>g>ues to allow the automatic detecti<strong>on</strong> of words or other features <str<strong>on</strong>g>in</str<strong>on</strong>g> the image.<br />

Cayless c<strong>on</strong>cluded that this research raised two issues. To beg<str<strong>on</strong>g>in</str<strong>on</strong>g> with, further research would need to<br />

c<strong>on</strong>sider what structures (bey<strong>on</strong>d l<str<strong>on</strong>g>in</str<strong>on</strong>g>es) could be detected <str<strong>on</strong>g>in</str<strong>on</strong>g> a SVG document <strong>and</strong> how they could be<br />

l<str<strong>on</strong>g>in</str<strong>on</strong>g>ked to transcripti<strong>on</strong>s. Sec<strong>on</strong>d, TEI transcripti<strong>on</strong>s often def<str<strong>on</strong>g>in</str<strong>on</strong>g>e document structure <str<strong>on</strong>g>in</str<strong>on</strong>g> a “semantic”<br />

rather than physical way, <strong>and</strong> even though l<str<strong>on</strong>g>in</str<strong>on</strong>g>e, word, <strong>and</strong> letter segments can be marked <str<strong>on</strong>g>in</str<strong>on</strong>g> TEI they<br />

often are not. This makes it difficult, if not impossible, to automate the l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g process. Cayless<br />

proposed that a st<strong>and</strong>ard would need to be developed for this type of l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g.<br />

Other experiments <str<strong>on</strong>g>in</str<strong>on</strong>g> automatic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g of images <strong>and</strong> transcripti<strong>on</strong>s have been c<strong>on</strong>ducted by the TILE<br />

project. 418 This project seeks to build a “web-based image markup tool” 419 <strong>and</strong> is based <strong>on</strong> the exist<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

code of the Ajax XML (AXE) image encoder. 420 It will be <str<strong>on</strong>g>in</str<strong>on</strong>g>teroperable with both the EPPT <strong>and</strong> the<br />

413 SVG is “a language for describ<str<strong>on</strong>g>in</str<strong>on</strong>g>g two-dimensi<strong>on</strong>al graphics <strong>and</strong> graphical applicati<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g> XML.” http://www.w3.org/Graphics/SVG/<br />

414 http://trac.openlayers.org/wiki/Release/2.6/Notes<br />

415 http://sgillies.net/blog/691/digitiz<str<strong>on</strong>g>in</str<strong>on</strong>g>g-ancient-<str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong>s-with-openlayers<br />

416 http://www.imagemagick.org/<br />

417 http://potrace.sourceforge.net/<br />

418 This project’s approach to digital editi<strong>on</strong>s was discussed earlier <str<strong>on</strong>g>in</str<strong>on</strong>g> this paper.<br />

419 An <str<strong>on</strong>g>in</str<strong>on</strong>g>itial release of TILE 0.9 is now available for download at (http://mith.umd.edu/tile/) <str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g extensive step-by-step documentati<strong>on</strong><br />

http://mith.<str<strong>on</strong>g>in</str<strong>on</strong>g>fo/tile/documentati<strong>on</strong>/ <strong>and</strong> a forum for users. This <str<strong>on</strong>g>in</str<strong>on</strong>g>itial versi<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g>cludes an image markup tool, import<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> export<str<strong>on</strong>g>in</str<strong>on</strong>g>g tools, <strong>and</strong> a semiautomated<br />

l<str<strong>on</strong>g>in</str<strong>on</strong>g>e recognizer. There is also a TILE s<strong>and</strong>box (http://mith.umd.edu/tile/s<strong>and</strong>box/), a “MITH-hosted versi<strong>on</strong> of TILE allow<str<strong>on</strong>g>in</str<strong>on</strong>g>g users to try the<br />

tool before <str<strong>on</strong>g>in</str<strong>on</strong>g>stall<str<strong>on</strong>g>in</str<strong>on</strong>g>g their own copy.”<br />

420 http://mith.<str<strong>on</strong>g>in</str<strong>on</strong>g>fo/AXE/

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!