26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

60<br />

Text M<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g, Quotati<strong>on</strong> Detecti<strong>on</strong>, <strong>and</strong> Authorship Attributi<strong>on</strong><br />

A number of potential technologies could benefit both from automatic citati<strong>on</strong> detecti<strong>on</strong> <strong>and</strong> from the<br />

broader use of more st<strong>and</strong>ardized citati<strong>on</strong> encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g <str<strong>on</strong>g>in</str<strong>on</strong>g> digital corpora; these <str<strong>on</strong>g>in</str<strong>on</strong>g>clude text m<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

applicati<strong>on</strong>s, such as the study of text reuse, as well as quotati<strong>on</strong> detecti<strong>on</strong> <strong>and</strong> authorship attributi<strong>on</strong>.<br />

While the research presented <str<strong>on</strong>g>in</str<strong>on</strong>g> this secti<strong>on</strong> made use of various text-m<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> NLP techniques with<br />

unlabeled corpora, digital texts with large numbers of citati<strong>on</strong>s either automatically or manually<br />

marked up could provide useful tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g data for this k<str<strong>on</strong>g>in</str<strong>on</strong>g>d of work. Regardless of how the <str<strong>on</strong>g>in</str<strong>on</strong>g>formati<strong>on</strong><br />

is detected <strong>and</strong> extracted, the ability to exam<str<strong>on</strong>g>in</str<strong>on</strong>g>e text reuse, trace quotati<strong>on</strong>s, 179 <strong>and</strong> analyze <str<strong>on</strong>g>in</str<strong>on</strong>g>dividual<br />

authors <strong>and</strong> study different patterns of authorship will be <str<strong>on</strong>g>in</str<strong>on</strong>g>creas<str<strong>on</strong>g>in</str<strong>on</strong>g>gly important services expected not<br />

<strong>on</strong>ly by users of mass-digitizati<strong>on</strong> projects but of classical digital libraries as well.<br />

The eAQUA project, 180 based <str<strong>on</strong>g>in</str<strong>on</strong>g> Germany, is broadly <str<strong>on</strong>g>in</str<strong>on</strong>g>vestigat<str<strong>on</strong>g>in</str<strong>on</strong>g>g how text-m<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g technologies<br />

might be used <str<strong>on</strong>g>in</str<strong>on</strong>g> the analysis of classical texts through six specific subprojects (rec<strong>on</strong>structi<strong>on</strong> of the<br />

lost works of the Atthidographers, text reuse <str<strong>on</strong>g>in</str<strong>on</strong>g> Plato, papyri classificati<strong>on</strong>, extracti<strong>on</strong> of templates for<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong>s, metrical analysis of Plautus, <strong>and</strong> text completi<strong>on</strong> of fragmentary texts). 181 “The ma<str<strong>on</strong>g>in</str<strong>on</strong>g><br />

focus of this project is to break down research questi<strong>on</strong>s from the field of Classics <str<strong>on</strong>g>in</str<strong>on</strong>g> a reusable format<br />

fitt<str<strong>on</strong>g>in</str<strong>on</strong>g>g with NLP algorithms,” Büchler et al. (2008) submitted, “<strong>and</strong> to apply this type of approach to<br />

the data from the Ancient sources.” This approach of first determ<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g how classical scholars actually<br />

c<strong>on</strong>duct research <strong>and</strong> then attempt<str<strong>on</strong>g>in</str<strong>on</strong>g>g to match those processes with appropriate algorithms shows the<br />

importance of underst<strong>and</strong><str<strong>on</strong>g>in</str<strong>on</strong>g>g the discipl<str<strong>on</strong>g>in</str<strong>on</strong>g>e for which you are design<str<strong>on</strong>g>in</str<strong>on</strong>g>g tools. This is an essential po<str<strong>on</strong>g>in</str<strong>on</strong>g>t<br />

that is seen throughout this review.<br />

The basic visi<strong>on</strong> of eAQUA is to present a unified approach c<strong>on</strong>sist<str<strong>on</strong>g>in</str<strong>on</strong>g>g of “Data, Algorithms <strong>and</strong><br />

Applicati<strong>on</strong>s,” <strong>and</strong> this project specifically addresses both the development of applicati<strong>on</strong>s (research<br />

questi<strong>on</strong>s) <strong>and</strong> algorithms (NLP, text m<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g, co-occurrence analysis, cluster<str<strong>on</strong>g>in</str<strong>on</strong>g>g, classificati<strong>on</strong>). Data<br />

or corpora from research partners will be imported through st<strong>and</strong>ardized data <str<strong>on</strong>g>in</str<strong>on</strong>g>terfaces <str<strong>on</strong>g>in</str<strong>on</strong>g>to an<br />

eAQUA portal that is be<str<strong>on</strong>g>in</str<strong>on</strong>g>g developed. This portal will also provide access to all the structured data<br />

that are extracted through a variety of web services that can be used by scholars. 182<br />

One area of active research that is be<str<strong>on</strong>g>in</str<strong>on</strong>g>g c<strong>on</strong>ducted by the eAQUA project is the use of citati<strong>on</strong><br />

detecti<strong>on</strong> <strong>and</strong> textual reuse <str<strong>on</strong>g>in</str<strong>on</strong>g> the TLG corpus to <str<strong>on</strong>g>in</str<strong>on</strong>g>vestigate “the recepti<strong>on</strong> of Plato as a case study of<br />

textual reuse <strong>on</strong> ancient Greek texts” (Büchler <strong>and</strong> Geßner 2009). In their work, they first extracted<br />

word-by-word citati<strong>on</strong>s by comb<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g n-gram overlaps <strong>and</strong> significant terms for several works of<br />

Plato; sec<strong>on</strong>d, they loosened the c<strong>on</strong>stra<str<strong>on</strong>g>in</str<strong>on</strong>g>ts <strong>on</strong> syntactic word order to f<str<strong>on</strong>g>in</str<strong>on</strong>g>d citati<strong>on</strong>s. The authors<br />

emphasized that develop<str<strong>on</strong>g>in</str<strong>on</strong>g>g appropriate visualizati<strong>on</strong> tools is essential to study textual reuse s<str<strong>on</strong>g>in</str<strong>on</strong>g>ce textm<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

approaches to corpora typically produce a huge amount of data that simply cannot be explored<br />

manually. Their paper thus offers several <str<strong>on</strong>g>in</str<strong>on</strong>g>trigu<str<strong>on</strong>g>in</str<strong>on</strong>g>g visualizati<strong>on</strong>s, <str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g highlight<str<strong>on</strong>g>in</str<strong>on</strong>g>g the<br />

differences <str<strong>on</strong>g>in</str<strong>on</strong>g> citati<strong>on</strong>s to works of Plato across time (from the Neo-Plat<strong>on</strong>ists to the Middle<br />

179 Prelim<str<strong>on</strong>g>in</str<strong>on</strong>g>ary research <strong>on</strong> quotati<strong>on</strong> identificati<strong>on</strong> <strong>and</strong> track<str<strong>on</strong>g>in</str<strong>on</strong>g>g has been reported for Google Books (Schilit <strong>and</strong> Kolak 2008).<br />

180 http://www.eaqua.net/en/<str<strong>on</strong>g>in</str<strong>on</strong>g>dex.php<br />

181 The computati<strong>on</strong>al challenges of automatic metrical analysis <strong>and</strong> fragmentary texts have received some research attenti<strong>on</strong>. For metrical analysis, see<br />

Deufert et al. (2010), Eder (2007), Fusi (2008) <strong>and</strong> Papakitsos (2011) For fragmentary texts, see Berti et al. (2009) <strong>and</strong> Romanello et al. (2009b). The use<br />

of digital technology for <str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong>s <strong>and</strong> papyri is covered <str<strong>on</strong>g>in</str<strong>on</strong>g> their respective secti<strong>on</strong>s.<br />

182 Accord<str<strong>on</strong>g>in</str<strong>on</strong>g>g to the W3C, a web service can be def<str<strong>on</strong>g>in</str<strong>on</strong>g>ed as “a software system designed to support <str<strong>on</strong>g>in</str<strong>on</strong>g>teroperable mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e-to-mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e <str<strong>on</strong>g>in</str<strong>on</strong>g>teracti<strong>on</strong> over a<br />

network. It has an <str<strong>on</strong>g>in</str<strong>on</strong>g>terface described <str<strong>on</strong>g>in</str<strong>on</strong>g> a mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e-processable format (specifically WSDL). Other systems <str<strong>on</strong>g>in</str<strong>on</strong>g>teract with the Web service <str<strong>on</strong>g>in</str<strong>on</strong>g> a manner<br />

prescribed by its descripti<strong>on</strong> us<str<strong>on</strong>g>in</str<strong>on</strong>g>g SOAP messages, typically c<strong>on</strong>veyed us<str<strong>on</strong>g>in</str<strong>on</strong>g>g HTTP with an XML serializati<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> c<strong>on</strong>juncti<strong>on</strong> with other Web-related<br />

st<strong>and</strong>ard.” (http://www.w3.org/TR/ws-arch/#whatis). Two important related st<strong>and</strong>ards are SOAP (Simple Object Access Protocol), a “lightweight protocol<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>tended for exchang<str<strong>on</strong>g>in</str<strong>on</strong>g>g structured <str<strong>on</strong>g>in</str<strong>on</strong>g>formati<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> a decentralized, distributed envir<strong>on</strong>ment” (http://www.w3.org/TR/soap12-part1/) <strong>and</strong> WSDL (Web<br />

Services Descripti<strong>on</strong> Language), an “XML format for describ<str<strong>on</strong>g>in</str<strong>on</strong>g>g network services as a set of endpo<str<strong>on</strong>g>in</str<strong>on</strong>g>ts operat<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>on</strong> messages c<strong>on</strong>ta<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g either<br />

document-oriented or procedure-oriented <str<strong>on</strong>g>in</str<strong>on</strong>g>formati<strong>on</strong>.” (http://www.w3.org/TR/wsdl)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!