26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

26<br />

Markov Models (MEMM) 82 <strong>and</strong> outperformed previous basel<str<strong>on</strong>g>in</str<strong>on</strong>g>e methods. “We hope to use this<br />

comb<str<strong>on</strong>g>in</str<strong>on</strong>g>ed model for preannotati<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> an active learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g sett<str<strong>on</strong>g>in</str<strong>on</strong>g>g to aid annotators <str<strong>on</strong>g>in</str<strong>on</strong>g> label<str<strong>on</strong>g>in</str<strong>on</strong>g>g a large<br />

Syriac corpus,” wrote McClanahan et al., c<strong>on</strong>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g that “this corpus will c<strong>on</strong>ta<str<strong>on</strong>g>in</str<strong>on</strong>g> data spann<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

multiple centuries <strong>and</strong> a variety of authors <strong>and</strong> genres. Future work will require address<str<strong>on</strong>g>in</str<strong>on</strong>g>g issues<br />

encountered <str<strong>on</strong>g>in</str<strong>on</strong>g> this corpus. In additi<strong>on</strong>, there is much to do <str<strong>on</strong>g>in</str<strong>on</strong>g> gett<str<strong>on</strong>g>in</str<strong>on</strong>g>g the overall tag accuracy closer to<br />

the accuracy of <str<strong>on</strong>g>in</str<strong>on</strong>g>dividual decisi<strong>on</strong>s.” 83<br />

Another challenge faced <str<strong>on</strong>g>in</str<strong>on</strong>g> build<str<strong>on</strong>g>in</str<strong>on</strong>g>g NLP tools for Syriac is that it is an abjad writ<str<strong>on</strong>g>in</str<strong>on</strong>g>g system that omits<br />

vowels <strong>and</strong> other diacritics, yet the automatic additi<strong>on</strong> of diacritics to Syriac text, accord<str<strong>on</strong>g>in</str<strong>on</strong>g>g to Haertel<br />

et al. (2010), could greatly enhance the utility of these texts for historical <strong>and</strong> l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic research. They<br />

c<strong>on</strong>sequently developed an automatic-diacritizati<strong>on</strong> system that utilized c<strong>on</strong>diti<strong>on</strong>al Markov models<br />

(CMMS) <strong>and</strong> a number of already-diacritized texts as tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g data. When test<str<strong>on</strong>g>in</str<strong>on</strong>g>g their system with<br />

related Semitic languages (e.g., Hebrew) as well as resource-rich languages such as English, they<br />

reported that their system outperformed other low-resource approaches <strong>and</strong> achieved nearly state-ofthe-art<br />

results when compared with resource-rich systems.<br />

The TURGAMA Project, 84 based at the Institute for Religious Studies at Leiden University, is also<br />

explor<str<strong>on</strong>g>in</str<strong>on</strong>g>g the use of “computer-assisted l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic analysis” for both Aramaic <strong>and</strong> Syriac translati<strong>on</strong>s of<br />

the Bible. In c<strong>on</strong>trast to creat<str<strong>on</strong>g>in</str<strong>on</strong>g>g a tagged corpus, Project Director Wido van Peursen has expla<str<strong>on</strong>g>in</str<strong>on</strong>g>ed that<br />

the TURGAMA Project has focused <strong>on</strong> develop<str<strong>on</strong>g>in</str<strong>on</strong>g>g a highly specific model of encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g for their<br />

research. He noted that their project started at <strong>on</strong>e level below that of the Oxford-BYU project, or with<br />

the level of morphology rather than POS tagg<str<strong>on</strong>g>in</str<strong>on</strong>g>g.<br />

Van Peursen outl<str<strong>on</strong>g>in</str<strong>on</strong>g>ed a number of challenges that arise when c<strong>on</strong>duct<str<strong>on</strong>g>in</str<strong>on</strong>g>g l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistics analysis with<br />

ancient texts such as Syriac Biblical manuscripts, <str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g that there are no native speakers of the<br />

languages <str<strong>on</strong>g>in</str<strong>on</strong>g>volved, there are <strong>on</strong>ly written sources, there are multiple manuscript witnesses for<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>dividual “texts,” <strong>and</strong> the corpora are typically quite small. These challenges, accord<str<strong>on</strong>g>in</str<strong>on</strong>g>g to van<br />

Peursen, lead to two c<strong>on</strong>crete dilemmas when develop<str<strong>on</strong>g>in</str<strong>on</strong>g>g ancient text corpora: (1) Should<br />

computati<strong>on</strong>al analysis be “data-oriented or theory-driven” <strong>and</strong> (2) What is the priority for the corpus<br />

of the language The analysis of these dilemmas led van Peursen to argue that an encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g model,<br />

rather than a tagg<str<strong>on</strong>g>in</str<strong>on</strong>g>g approach, should be taken:<br />

The challenges <strong>and</strong> dilemmas menti<strong>on</strong>ed above require a model that is deductive rather than<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>ductive; that goes from form (the c<strong>on</strong>crete textual data) to functi<strong>on</strong> (the categories that we do<br />

not know a priori); that entails register<str<strong>on</strong>g>in</str<strong>on</strong>g>g the distributi<strong>on</strong> of l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic elements, rather than<br />

merely add<str<strong>on</strong>g>in</str<strong>on</strong>g>g functi<strong>on</strong>al labels—<str<strong>on</strong>g>in</str<strong>on</strong>g> other words, that <str<strong>on</strong>g>in</str<strong>on</strong>g>volves encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g rather than tagg<str<strong>on</strong>g>in</str<strong>on</strong>g>g;<br />

that registers both the paradigmatic forms <strong>and</strong> their realizati<strong>on</strong>s; that allows grammatical<br />

categories <strong>and</strong> formal descripti<strong>on</strong>s to be redef<str<strong>on</strong>g>in</str<strong>on</strong>g>ed <strong>on</strong> the basis of corpus analysis; <strong>and</strong> that<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>volves <str<strong>on</strong>g>in</str<strong>on</strong>g>teractive analytical procedures, which are needed for the level of accuracy we aim<br />

for (van Peursen 2009).<br />

C<strong>on</strong>sequently, the TURGAMA Project’s l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic analysis of Hebrew <strong>and</strong> Syriac is c<strong>on</strong>ducted from<br />

the bottom up; i.e., at the levels of <str<strong>on</strong>g>in</str<strong>on</strong>g>dividual words, phrases, clauses, <strong>and</strong> the entire text. The<br />

workflow of their system <str<strong>on</strong>g>in</str<strong>on</strong>g>volves pattern-recogniti<strong>on</strong> programs, “language-specific auxiliary files”<br />

82 http://en.wikipedia.org/wiki/Maximum_entropy_Markov_model<br />

83 An earlier discussi<strong>on</strong> of the complicated process of comb<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g, human annotati<strong>on</strong> <strong>and</strong> ancient corpus creati<strong>on</strong> by this project can be<br />

found <str<strong>on</strong>g>in</str<strong>on</strong>g> Carroll et al. (2007).<br />

84 The full project name is TURGAMA, “Computer -Assisted Analysis of the Peshitta <strong>and</strong> the Targum: Text, Language <strong>and</strong> Interpretati<strong>on</strong>.”<br />

http://www.hum.leiden.edu/religi<strong>on</strong>/research/research-programmes/antiquity/turgama.html

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!