26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

52<br />

First, it utilizes a nearest neighbor framework that requires no h<strong>and</strong>-crafted rules, <strong>and</strong> provides<br />

analogies to facilitate learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g. Sec<strong>on</strong>d, <strong>and</strong> perhaps more significantly, it exploits a large,<br />

unlabelled corpus to improve the predicti<strong>on</strong> of novel roots (Lee 2008).<br />

Lee observed that many students of Ancient Greek memorized “paradigmatic” verbs that could be used<br />

as analogies to identify the roots of unseen verbs. From this <str<strong>on</strong>g>in</str<strong>on</strong>g>sight, Lee utilized a “nearest-neighbor”<br />

mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e-learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g framework to model this process. When given a word <str<strong>on</strong>g>in</str<strong>on</strong>g> an <str<strong>on</strong>g>in</str<strong>on</strong>g>flected form, the<br />

algorithm searched for the root form am<strong>on</strong>g its “neighbors” by mak<str<strong>on</strong>g>in</str<strong>on</strong>g>g substituti<strong>on</strong>s to its prefix <strong>and</strong><br />

suffix. Valid substituti<strong>on</strong>s are harvested from pairs of <str<strong>on</strong>g>in</str<strong>on</strong>g>flected <strong>and</strong> root forms <str<strong>on</strong>g>in</str<strong>on</strong>g> a tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g set of data,<br />

<strong>and</strong> these pairs are then used to serve as “analogies to re<str<strong>on</strong>g>in</str<strong>on</strong>g>force learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g.” N<strong>on</strong>etheless, Ancient Greek<br />

still posed some challenges that complicated a m<str<strong>on</strong>g>in</str<strong>on</strong>g>imally supervised approach. Lee expla<str<strong>on</strong>g>in</str<strong>on</strong>g>ed that<br />

heavily <str<strong>on</strong>g>in</str<strong>on</strong>g>flected languages such as Greek suffer from “data sparseness” s<str<strong>on</strong>g>in</str<strong>on</strong>g>ce many <str<strong>on</strong>g>in</str<strong>on</strong>g>flected forms<br />

appear at most a few times <strong>and</strong> many root forms may not appear at all <str<strong>on</strong>g>in</str<strong>on</strong>g> a corpus. As a rule-based<br />

system, Morpheus needed a priori knowledge of possible stems <strong>and</strong> affixes, all of which had to be<br />

crafted by h<strong>and</strong>. To provide a more scalable approach, Lee used a data-driven approach that<br />

automatically determ<str<strong>on</strong>g>in</str<strong>on</strong>g>ed stems <strong>and</strong> affixes from tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g data (morphology data for the Greek<br />

Septuag<str<strong>on</strong>g>in</str<strong>on</strong>g>t from the University of Pennsylvania) <strong>and</strong> then used the TLG as a source of unlabeled data<br />

to guide predicti<strong>on</strong> of novel roots.<br />

While Lee made use of mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> unlabeled corpora, Tambouratzis (2008) automated the<br />

morphological segmentati<strong>on</strong> of Greek by “coupl<str<strong>on</strong>g>in</str<strong>on</strong>g>g an iterative pattern-recogniti<strong>on</strong> algorithm with a<br />

modest amount of l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic knowledge, expressed via a set of <str<strong>on</strong>g>in</str<strong>on</strong>g>teracti<strong>on</strong>s associated with weights.”<br />

He used an “ant col<strong>on</strong>y optimizati<strong>on</strong> (ACO) metaheuristic” to automatically determ<str<strong>on</strong>g>in</str<strong>on</strong>g>e optimal weight<br />

values <strong>and</strong> found that <str<strong>on</strong>g>in</str<strong>on</strong>g> several cases the automatic system provided better results than those that had<br />

been manually determ<str<strong>on</strong>g>in</str<strong>on</strong>g>ed by scholars. In c<strong>on</strong>trast to Lee, Tambouratzis used <strong>on</strong>ly a subset of the TLG<br />

for tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g data (<str<strong>on</strong>g>in</str<strong>on</strong>g> this case, the speeches of several Greek orators).<br />

In additi<strong>on</strong> to the work d<strong>on</strong>e by Dik <strong>and</strong> Whal<str<strong>on</strong>g>in</str<strong>on</strong>g>g for “Perseus Under PhiloLogic,” other research <str<strong>on</strong>g>in</str<strong>on</strong>g>to<br />

automatic morphological analysis of Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> has been c<strong>on</strong>ducted by (F<str<strong>on</strong>g>in</str<strong>on</strong>g>kel <strong>and</strong> Stump 2009). These<br />

authors reported <strong>on</strong> computati<strong>on</strong>al experiments to generate the morphology of Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> verbs.<br />

Lexic<strong>on</strong>s<br />

Lexic<strong>on</strong>s are reference tools that have l<strong>on</strong>g played an important role <str<strong>on</strong>g>in</str<strong>on</strong>g> classical scholarship <strong>and</strong><br />

particularly <str<strong>on</strong>g>in</str<strong>on</strong>g> the study of historical languages. 153 As previously noted, the lack of a computati<strong>on</strong>al<br />

lexic<strong>on</strong> for Sanskrit is a major research challenge. This secti<strong>on</strong> explores some important lexic<strong>on</strong>s for<br />

classical languages <strong>and</strong> suggests new roles for these traditi<strong>on</strong>al reference works <str<strong>on</strong>g>in</str<strong>on</strong>g> a digital<br />

envir<strong>on</strong>ment.<br />

The Comprehensive Aramaic Lexic<strong>on</strong> 154 (CAL) hopes to serve as a “new dicti<strong>on</strong>ary of the Aramaic<br />

language.” Aramaic is a Semitic language, <strong>and</strong> numerous <str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong>s <strong>and</strong> papyri, as well as Biblical<br />

<strong>and</strong> other religious texts, are written <str<strong>on</strong>g>in</str<strong>on</strong>g> it. This project, currently <str<strong>on</strong>g>in</str<strong>on</strong>g> preparati<strong>on</strong> by an <str<strong>on</strong>g>in</str<strong>on</strong>g>ternati<strong>on</strong>al<br />

team of scholars, is based at Hebrew Uni<strong>on</strong> College <str<strong>on</strong>g>in</str<strong>on</strong>g> C<str<strong>on</strong>g>in</str<strong>on</strong>g>c<str<strong>on</strong>g>in</str<strong>on</strong>g>nati. The goal is to create a<br />

comprehensive lexic<strong>on</strong> that will take all of ancient Aramaic <str<strong>on</strong>g>in</str<strong>on</strong>g>to account, be based <strong>on</strong> a compilati<strong>on</strong> of<br />

all Aramaic literature, <strong>and</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g>clude extensive references to modern scholarly literature. Although a<br />

153 This secti<strong>on</strong> focuses <strong>on</strong> larger projects that plan to create <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e or digital lexic<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g> additi<strong>on</strong> to pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted <strong>on</strong>es, but there are also a number of lexic<strong>on</strong>s<br />

for classical languages that have been placed <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e as PDFs or <str<strong>on</strong>g>in</str<strong>on</strong>g> other static formats, such as the Chicago Demotic Dicti<strong>on</strong>ary<br />

(http://oi.uchicago.edu/research/projects/dem/); other projects have scanned historical dicti<strong>on</strong>aries <strong>and</strong> provided <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e search<str<strong>on</strong>g>in</str<strong>on</strong>g>g capabilities, such as<br />

Sanskrit, Tamil <strong>and</strong> Pahlavi Dicti<strong>on</strong>aries, http://webapps.uni-koeln.de/tamil/<br />

154 http://cal1.cn.huc.edu/

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!