22.01.2015 Views

Military Communications and Information Technology: A Trusted ...

Military Communications and Information Technology: A Trusted ...

Military Communications and Information Technology: A Trusted ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 3: <strong>Information</strong> <strong>Technology</strong> for Interoperability <strong>and</strong> Decision...<br />

275<br />

ISAF-MT<br />

In our research project Machine Translation for ISAF Forces (ISAF-MT) [4]<br />

we have built a Dari-German translation system. As the project takes place in a German<br />

– U.S. cooperation, research in SMT for Dari <strong>and</strong> English is also being<br />

conducted. The objective of our project is the application of statistical machine<br />

translation technology for the construction of a Dari-German translation system<br />

in a military context.<br />

The software framework we used for the ISAF-MT project is Moses [19].<br />

Moses is the most widely used toolbox for constructing SMT systems. It is an opensource<br />

project <strong>and</strong> has a very large user <strong>and</strong> developer community. This ensures<br />

that the latest concepts <strong>and</strong> techniques are available for developers of SMT systems.<br />

In the course of our project we built a parallel German-Dari corpus because no<br />

corpus for this language pair was available. We have built a text corpus that consists<br />

mainly of news texts focusing on OSINT terrorism topics. To find a compromise<br />

between size of training data <strong>and</strong> quality of translation, we applied a multilevel<br />

approach for corpus creation. On the one h<strong>and</strong>, we efficiently extracted parallel<br />

texts from the internet to meet the requirement of corpus size. On the other<br />

h<strong>and</strong>, high quality translations were generated by native speakers <strong>and</strong> professional<br />

translators. Our current corpus contains about 27 000 lines. 13 000 lines are web<br />

extracted news text <strong>and</strong> about 4500 lines are high quality translations. About 9500<br />

corpus lines consist of diverse text material, such as open source subtitles, public<br />

translation examples, the Dari constitution <strong>and</strong> texts of common information.<br />

We further improved the system by integrating a Dari-German dictionary of about<br />

80 000 entries that also contains some military terms.<br />

In the course of our project we ran different experiments for the improvement<br />

of system performance. In the following, we will describe some of the objects<br />

of the experiments that significantly influenced SMT performance.<br />

• Quality of the training corpus: We ran experiments with different subsets<br />

of the corpus, focusing on different translation quality or type of text. We also<br />

looked into various ways of integrating dictionaries into our training data<br />

in the most beneficial way.<br />

• Corpus preparation: System performance improved by the normalization<br />

of the training corpus, e.g., in terms of different Unicode normalizations<br />

<strong>and</strong> normalization that were done specifically for the language Dari.<br />

• Software experiments: To build our SMT system we tried different available<br />

software, e.g., for training of translation model or language model.<br />

• Training configuration: The Moses framework provides different possible<br />

approaches <strong>and</strong> configurations for SMT. We looked into various settings<br />

(e.g. maximum length of trained phrases) as well as into the integration<br />

of linguistic knowledge (e.g., lemmas) into the translation model.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!