30.08.2013 Views

Design and Implementation of XML Based Architecture for English to ...

Design and Implementation of XML Based Architecture for English to ...

Design and Implementation of XML Based Architecture for English to ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

NOTICE<br />

“This material is presented <strong>to</strong> ensure timely dissemination<br />

<strong>of</strong> scholarly <strong>and</strong> technical work. Copyright <strong>and</strong> all rights<br />

therein are retained by authors or by other copyright<br />

holders. All persons copying this in<strong>for</strong>mation are<br />

expected <strong>to</strong> adhere <strong>to</strong> the terms <strong>and</strong> constraints invoked<br />

by each author's copyright. In most cases, these works<br />

may not be reposted without the explicit permission <strong>of</strong> the<br />

copyright holder.”


Advances in Computational Sciences <strong>and</strong> Technology<br />

ISSN 0973-6107 Volume 2 Number 3 (2009) pp. 279–291<br />

© Research India Publications<br />

http://www.ripublication.com/acst.htm<br />

<strong>Design</strong> <strong>and</strong> <strong>Implementation</strong> <strong>of</strong> <strong>XML</strong> <strong>Based</strong><br />

<strong>Architecture</strong> <strong>for</strong> <strong>English</strong> <strong>to</strong> Hindi Language Sentence<br />

Translation<br />

1 Harshad B. Prajapati, 2 Vipul K. Dabhi, 3 Logee Vaghela <strong>and</strong> 4 Hiren Vataliya<br />

1,2Department <strong>of</strong> In<strong>for</strong>mation Technology <strong>of</strong> Faculty <strong>of</strong> Technology, Dharmsinh Desai<br />

University, Nadiad 387001, Gujarat, INDIA<br />

1 E-mail: harshad.b.prajapati@gmail.com<br />

2 Department <strong>of</strong> In<strong>for</strong>mation Technology <strong>of</strong> Faculty <strong>of</strong> Technology, Dharmsinh Desai<br />

University, Nadiad 387001, Gujarat, INDIA<br />

3,4 Students <strong>of</strong> UG course in In<strong>for</strong>mation Technology at Dharmsinh Desai University<br />

during 2006 <strong>to</strong> 2009<br />

Abstract<br />

The problem <strong>of</strong> au<strong>to</strong>mated <strong>English</strong> <strong>to</strong> Hindi language translation has been<br />

actively researched by Indian researchers. Moreover, researchers from other<br />

countries also try <strong>to</strong> translate <strong>English</strong> language <strong>to</strong> their national or mother<br />

<strong>to</strong>ngue languages. However, majority attempted approaches are based on<br />

Natural Language Processing (NLP), which require program modifications <strong>to</strong><br />

add new rules <strong>for</strong> the grammar <strong>of</strong> the language. In this paper, we propose an<br />

Extensible Markup Language (<strong>XML</strong>) based extensible, simple, <strong>and</strong> elegant<br />

architecture <strong>for</strong> <strong>English</strong> <strong>to</strong> Hindi language sentence translation. Moreover, we<br />

also implement proposed architecture using Java technology <strong>to</strong> show the pro<strong>of</strong><br />

<strong>of</strong> concept (PoC). The architecture presented in this paper will help <strong>to</strong> Indian<br />

researchers <strong>to</strong> make further progress on <strong>English</strong> <strong>to</strong> Hindi language translation<br />

problem with a novel <strong>and</strong> extensible architecture approach. Moreover, our<br />

approach is generic enough, <strong>and</strong> it can be used <strong>for</strong> any source language <strong>and</strong><br />

target language whose grammars can be expressed in <strong>XML</strong> schema.<br />

Index Terms: Language translation, <strong>English</strong> <strong>to</strong> Hindi, Application <strong>of</strong> <strong>XML</strong>.<br />

Introduction<br />

The language underst<strong>and</strong>ing problem is frequently encountered in business, society,


280 Harshad B. Prajapati et al<br />

culture, etc. However, disparity in used languages is h<strong>and</strong>led through language<br />

translation. Specifically, in India, the government <strong>of</strong>fices circulate rules, memos,<br />

notices, etc. in Hindi language also. And, such is the case with educational institutes<br />

<strong>and</strong> universities also in the country. At present, the <strong>English</strong> documents are manually<br />

translated in<strong>to</strong> other languages. If some au<strong>to</strong>mated system can be designed <strong>for</strong> this<br />

translation, it can do the translation efficiently <strong>and</strong> can reduce human errors. We<br />

h<strong>and</strong>le the problem <strong>of</strong> <strong>English</strong> <strong>to</strong> Hindi translation using Extensible Markup Language<br />

(<strong>XML</strong>) [1] based architecture approach.<br />

We propose an <strong>XML</strong> based solution <strong>for</strong> <strong>English</strong> <strong>to</strong> Hindi language sentence<br />

translation. For providing solution, we represent the <strong>English</strong> grammar <strong>of</strong> a particular<br />

sentence in <strong>XML</strong> schema [2]. We also represent the Hindi grammar <strong>of</strong> that particular<br />

sentence in <strong>XML</strong> schema <strong>and</strong> the required translation file in <strong>for</strong>m <strong>of</strong> Extensible Stylesheet<br />

Language Trans<strong>for</strong>mations (XSLT) [3]. Moreover, we categorize the grammar<br />

files, depending upon the number <strong>of</strong> <strong>to</strong>kens present in the sentence, <strong>to</strong> make grammar<br />

matching operation fast. We maintain database <strong>of</strong> <strong>English</strong> words, corresponding Hindi<br />

words, <strong>and</strong> the types <strong>of</strong> <strong>English</strong> words (e.g. noun, pronoun, verb, etc.) <strong>to</strong> aid<br />

translation process. Although, we have attempted only certain sentence combinations<br />

<strong>of</strong> <strong>English</strong> grammar, our architecture is extensible enough <strong>to</strong> accommodate any<br />

number <strong>of</strong> grammar rules. Moreover, our results indicate that our proposed<br />

architecture is effective in <strong>English</strong> <strong>to</strong> Hindi language sentence translation.<br />

As Hindi is a National language <strong>of</strong> India, in government <strong>of</strong>fices <strong>of</strong> the country the<br />

businesses are done through Hindi language also. Moreover, in the country, notices,<br />

hoardings, railway time-tables, are shown in Hindi along with <strong>English</strong> language with<br />

the objective that those who do not underst<strong>and</strong> <strong>English</strong> language can also underst<strong>and</strong><br />

instructions if they are provided in Hindi. These are few examples, where our work<br />

can help <strong>to</strong> convert <strong>English</strong> documents <strong>to</strong> Hindi documents. Moreover, our proposed<br />

architecture is generic <strong>and</strong> can be adapted <strong>to</strong> accommodate <strong>for</strong> other languages also.<br />

Our objectives in this paper are (i) <strong>to</strong> design an <strong>XML</strong> based extensible, simple,<br />

<strong>and</strong> elegant architecture <strong>for</strong> <strong>English</strong> <strong>to</strong> Hindi language sentence translation, <strong>and</strong> (ii) <strong>to</strong><br />

implement required components using Java technology [4] <strong>to</strong> show the pro<strong>of</strong> <strong>of</strong><br />

concept (PoC) <strong>of</strong> our novel approach <strong>of</strong> sentence translation.<br />

The paper is structured as follows. The section II discusses related work <strong>and</strong><br />

motivation <strong>for</strong> our work. The section III discusses on language translation <strong>and</strong><br />

required support from <strong>XML</strong> technology. The section IV proposes <strong>and</strong> discusses an<br />

<strong>XML</strong> based architecture <strong>for</strong> <strong>English</strong> <strong>to</strong> Hindi language sentence translation. The<br />

section V discusses an implementation <strong>of</strong> architecture. The section VI provides results<br />

<strong>and</strong> discussion. Finally, the section VII concludes our work.<br />

Related Work <strong>and</strong> Motivation <strong>for</strong> Our Work<br />

A. Related Work<br />

The work in ANGLABHARTI [5] [6] represents a machine-aided translation<br />

methodology specifically designed <strong>for</strong> translating <strong>English</strong> <strong>to</strong> Indian languages. Instead<br />

<strong>of</strong> designing transla<strong>to</strong>rs <strong>for</strong> <strong>English</strong> <strong>to</strong> each Indian language, ANGLABHARTI uses a


<strong>Design</strong> <strong>and</strong> <strong>Implementation</strong> <strong>of</strong> <strong>XML</strong> <strong>Based</strong> <strong>Architecture</strong> 281<br />

pseudo-interlingua approach. It analyses <strong>English</strong> only once <strong>and</strong> creates an<br />

intermediate structure called PLIL (Pseudo Lingua <strong>for</strong> Indian Languages). The PLIL<br />

structure is then converted <strong>to</strong> each Indian language through a process <strong>of</strong> textgeneration.<br />

The <strong>English</strong> <strong>to</strong> Hindi version <strong>of</strong> the ANGLABHARTI translation<br />

methodology is AnglaHindi [7], which is used <strong>for</strong> translation from <strong>English</strong> <strong>to</strong> all<br />

Indian languages.<br />

The work in [8] uses Natural Language Processing [9] based approach <strong>to</strong><br />

recognize the content <strong>of</strong> a document in order <strong>to</strong> convert it in<strong>to</strong> another language. Their<br />

work focuses on <strong>English</strong> – Hindi translation system with special reference <strong>to</strong> weather<br />

narration domain.<br />

The work in [10] focuses on development <strong>of</strong> Example <strong>Based</strong> Machine Translation<br />

(EBMT) system <strong>for</strong> <strong>English</strong> <strong>to</strong> Hindi. This system is useful where large-scale<br />

computational resources are not available, which are otherwise required in Machine<br />

Translation system. The part <strong>of</strong> this work [11] focuses on h<strong>and</strong>ling divergence issues<br />

<strong>of</strong> the translation in Example <strong>Based</strong> Machine Translation (EBMT) system <strong>for</strong> <strong>English</strong><br />

<strong>to</strong> Hindi. The divergence in translation occurs when structurally similar sentences <strong>of</strong><br />

the source language do not get translated in<strong>to</strong> sentences that are similar in structures in<br />

the target language. The EBMT system generates translation <strong>of</strong> a given sentence by<br />

retrieving similar past translation examples from its example base <strong>and</strong> then adapting<br />

them suitably <strong>to</strong> meet the current translation requirements.<br />

In our work, we focus on design <strong>and</strong> implementation <strong>of</strong> general purpose <strong>XML</strong><br />

based architecture that can be used <strong>for</strong> translating sentence from one language <strong>to</strong><br />

another language. However, specifically we discuss <strong>English</strong> <strong>to</strong> Hindi translation. The<br />

requirement in our work is that <strong>for</strong> both the source <strong>and</strong> target language their grammar<br />

should be expressible in <strong>XML</strong> schema <strong>for</strong>mat. In our current work, we omit all<br />

advanced <strong>and</strong> complex processing <strong>of</strong> sentences such as verb <strong>for</strong>mation; suffix <strong>and</strong><br />

prefix h<strong>and</strong>ling; <strong>and</strong> h<strong>and</strong>ling phrase, clause, etc. However, we intent <strong>to</strong> include all<br />

these advanced processing in our future work.<br />

B. Motivation <strong>for</strong> our Work<br />

Three main reasons motivated us <strong>to</strong> work on <strong>XML</strong> based architecture <strong>for</strong> <strong>English</strong> <strong>to</strong><br />

Hindi language translation. First, as <strong>English</strong> is an international language <strong>and</strong> Hindi is<br />

our national language, we feel it more appropriate <strong>to</strong> select these two languages <strong>for</strong><br />

translation. Second, Indian people, who do not underst<strong>and</strong> <strong>English</strong> (less than 5%<br />

population [12] can underst<strong>and</strong> the <strong>English</strong>), can underst<strong>and</strong> <strong>and</strong> learn it by translating<br />

it in<strong>to</strong> Hindi language. Third, we wanted <strong>to</strong> evaluate the <strong>XML</strong> technology <strong>for</strong><br />

language translation problem, as a very good support is provided by <strong>XML</strong> <strong>for</strong><br />

document validation <strong>and</strong> trans<strong>for</strong>mation.<br />

Language Translation <strong>and</strong> Support From <strong>XML</strong> Technology<br />

A. Language translation <strong>and</strong> its Need<br />

Since the evolution <strong>of</strong> human being, a language has always remained a basic<br />

instrument <strong>for</strong> communication. The language, <strong>for</strong> oral as well as written


282 Harshad B. Prajapati et al<br />

communication, has greatly contributed <strong>for</strong> the progress <strong>of</strong> business, society, culture,<br />

economy, etc in the world. However, the unfamiliarity <strong>of</strong> a used communication<br />

language can cause serious problems <strong>to</strong> communicating parties. Two approaches <strong>for</strong><br />

the solution are either use common language or use a transla<strong>to</strong>r in between sender <strong>and</strong><br />

receiver. Thus, need arises <strong>to</strong> translate from one language <strong>to</strong> another language when<br />

people in environment are unable <strong>to</strong> underst<strong>and</strong> the used language. Specifically, in<br />

India although literacy <strong>of</strong> <strong>English</strong> is increasing day by day, still there is a great<br />

strength <strong>of</strong> people who are not able <strong>to</strong> underst<strong>and</strong> <strong>English</strong>. Thus, language translation<br />

approach is more effective <strong>and</strong> practical when mass <strong>of</strong> people is not ready <strong>to</strong> accept a<br />

single common language <strong>for</strong> communication.<br />

B. <strong>XML</strong> Technology in Document translation<br />

Here, we discuss how <strong>XML</strong> technology [1] is useful in document validation <strong>and</strong><br />

translation. The <strong>XML</strong> is a general-purpose specification <strong>for</strong> creating cus<strong>to</strong>m markup<br />

languages. It is classified as an extensible language because it allows its users <strong>to</strong><br />

define their own elements. <strong>XML</strong> is recommended by the World Wide Web<br />

Consortium (W3C) [13]. It is a fee-free open st<strong>and</strong>ard. The recommendation specifies<br />

both the lexical grammar <strong>and</strong> the requirements <strong>for</strong> parsing. The parsing <strong>of</strong> <strong>XML</strong><br />

documents allows facility <strong>of</strong> checking the validity <strong>of</strong> documents. It supports checking<br />

<strong>of</strong> whether document is well <strong>for</strong>med or schema valid. Using this facility, we can check<br />

whether the document follows the defined structure or not by parsing the document<br />

against the schema.<br />

XSLT [3] (<strong>XML</strong> Trans<strong>for</strong>mations) is a declarative, <strong>XML</strong> based document<br />

trans<strong>for</strong>mation language. An XSLT processor can use an XSLT style-sheet [14] as a<br />

guide <strong>for</strong> the conversion <strong>of</strong> the data tree represented by one <strong>XML</strong> document in<strong>to</strong><br />

another. So, XSLT can be used <strong>to</strong> alter the <strong>for</strong>mat <strong>of</strong> <strong>XML</strong> data either in<strong>to</strong> HTML [15]<br />

or other <strong>for</strong>mats that are suitable <strong>for</strong> display. XSLT can be used <strong>to</strong> alter the <strong>for</strong>mat <strong>of</strong><br />

<strong>XML</strong> data in document. The important <strong>XML</strong> extension, XPath [16] is a Document<br />

Object Model (DOM) like node tree data model <strong>and</strong> path expression language; <strong>and</strong> it<br />

is used <strong>for</strong> selecting the data present in the <strong>XML</strong> documents. XPath makes it possible<br />

<strong>to</strong> refer <strong>to</strong> individual parts <strong>of</strong> an <strong>XML</strong> document. This provides r<strong>and</strong>om access <strong>to</strong><br />

<strong>XML</strong> data <strong>for</strong> other technologies, including XSLT, XSL-FO [3], XQuery [17] etc.<br />

XQuery is a W3C language <strong>for</strong> querying, constructing <strong>and</strong> trans<strong>for</strong>ming <strong>XML</strong> data.<br />

XPath expressions can refer <strong>to</strong> all or part <strong>of</strong> the text, data <strong>and</strong> values in <strong>XML</strong><br />

elements, attributes, processing instructions, comments etc. They can also access the<br />

names <strong>of</strong> elements <strong>and</strong> attributes. This extension is useful in different kinds <strong>of</strong><br />

document processing. Moreover, DOM, pull <strong>and</strong> push parsing APIs <strong>for</strong> <strong>XML</strong><br />

processing, which are available <strong>for</strong> most <strong>of</strong> the programming <strong>and</strong> scripting languages,<br />

help a lot in document processing.<br />

<strong>Architecture</strong> <strong>for</strong> <strong>English</strong> <strong>to</strong> Hindi Sentence Translation<br />

In this section, we propose architecture <strong>for</strong> <strong>English</strong> language <strong>to</strong> Hindi language<br />

sentence translation. However, our architecture is extensible enough <strong>to</strong> accommodate


<strong>Design</strong> <strong>and</strong> <strong>Implementation</strong> <strong>of</strong> <strong>XML</strong> <strong>Based</strong> <strong>Architecture</strong> 283<br />

other languages as target languages <strong>of</strong> translations. Be<strong>for</strong>e attempting <strong>XML</strong><br />

technology <strong>for</strong> <strong>English</strong> <strong>to</strong> Hindi language translation problem, we studied other<br />

methods also, <strong>and</strong> found that <strong>XML</strong> technology enables the following characteristics in<br />

our architecture.<br />

• Simple: Rather than designing parser <strong>for</strong> parsing <strong>English</strong> sentences, we use<br />

already available <strong>XML</strong> parser, which makes parsing <strong>of</strong> sentences easy. <strong>XML</strong><br />

parsers are available <strong>for</strong> many programming languages. Thus, the<br />

programmer does not need <strong>to</strong> write a parsing code.<br />

• Extensible: The architecture supports adding new rules <strong>of</strong> grammar in <strong>for</strong>m <strong>of</strong><br />

<strong>XML</strong> schema (.xsd) files. Thus, the implementation <strong>of</strong> architecture does not<br />

require major changes in coding, which becomes really difficult in other<br />

approaches <strong>of</strong> translation.<br />

• Elegant: All the components <strong>of</strong> architecture are modular <strong>and</strong> have well<br />

defined behaviors. Moreover, the input <strong>and</strong> output in intermediate processing<br />

are in <strong>for</strong>m <strong>of</strong> <strong>XML</strong> documents. Thus, it easily enables putting extra<br />

in<strong>for</strong>mation in intermediate output or input (it will be used in future work <strong>for</strong><br />

advanced processing <strong>of</strong> sentences).<br />

Our proposed architecture <strong>for</strong> <strong>English</strong> <strong>to</strong> Hindi sentence translation is shown in<br />

Fig. 1. We discuss here working <strong>of</strong> all the components.<br />

Figure 1: <strong>XML</strong> based proposed architecture <strong>for</strong> <strong>English</strong> <strong>to</strong> Hindi sentence translation.<br />

A. User Interface<br />

The user interface component is responsible <strong>for</strong> taking <strong>English</strong> sentence as an input<br />

from user. This component passes the input sentence <strong>to</strong> the <strong>to</strong>ken genera<strong>to</strong>r<br />

component <strong>and</strong> waits <strong>for</strong> getting translated Hindi sentence from the Hindi sentence<br />

builder component. For example, the Fig. 2 shows the sample <strong>English</strong> sentence as an<br />

input <strong>and</strong> corresponding output sentence in Hindi language. We take this sentence as


284 Harshad B. Prajapati et al<br />

an example <strong>to</strong> underst<strong>and</strong> the working <strong>of</strong> other components.<br />

Figure 2: Sample input sentence in <strong>English</strong> <strong>and</strong> corresponding output sentence in<br />

Hindi.<br />

B. Token Genera<strong>to</strong>r<br />

The <strong>to</strong>ken genera<strong>to</strong>r components separates out (i.e. <strong>to</strong>kenizes) the words that are<br />

present in <strong>English</strong> sentence. The generated <strong>to</strong>kens are passed <strong>to</strong> <strong>to</strong>ken type genera<strong>to</strong>r<br />

component <strong>to</strong> identify their types. For the example input sentence, “This is a boy”,<br />

results in<strong>to</strong> four <strong>to</strong>kens: This, is, a, <strong>and</strong> boy.<br />

C. Token Type Genera<strong>to</strong>r<br />

The <strong>to</strong>ken type genera<strong>to</strong>r consults the <strong>to</strong>ken type database <strong>to</strong> find out the type <strong>of</strong> each<br />

<strong>to</strong>ken (i.e., word) that is generated by the <strong>to</strong>ken genera<strong>to</strong>r. The <strong>to</strong>ken type database<br />

would contain <strong>English</strong> words <strong>and</strong> their corresponding types. It is quite possible <strong>for</strong><br />

certain <strong>English</strong> words <strong>to</strong> have more than one type. For the example input sentence, the<br />

<strong>to</strong>ken type genera<strong>to</strong>r finds out that “this” is a pronoun, “is” is a verb, a is an article,<br />

<strong>and</strong> boy is a noun.<br />

D. <strong>XML</strong> Instance-File Builder<br />

The <strong>XML</strong> instance-file builder component will take <strong>to</strong>kens <strong>and</strong> their types as input<br />

from <strong>to</strong>ken type genera<strong>to</strong>r component <strong>and</strong> generates an <strong>XML</strong> instance file (i.e., <strong>to</strong>kens<br />

annotated with their type in sequence) corresponding <strong>to</strong> <strong>English</strong> sentence. This<br />

component will build the xml file corresponding <strong>to</strong> given sentence. In the case when a<br />

particular <strong>to</strong>ken has more than one type, more than one <strong>XML</strong> instance-files will be<br />

generated. For example, if a single word is found <strong>to</strong> be pronoun <strong>and</strong> adjective in the<br />

<strong>to</strong>ken type database, then two <strong>XML</strong> instance-files will be generated. For the example<br />

input sentence, the generated <strong>XML</strong> instance-file is shown in Fig. 3.<br />

Figure 3: <strong>XML</strong> instance file <strong>for</strong> given sample input sentence.<br />

E. <strong>XML</strong> Instance-Fie Parser<br />

The <strong>XML</strong> instance-file parser finds out the correct <strong>English</strong> grammar file (i.e., .xsd<br />

file) from the database <strong>of</strong> .xsd files. In the case when multiple <strong>XML</strong> instance-files are


<strong>Design</strong> <strong>and</strong> <strong>Implementation</strong> <strong>of</strong> <strong>XML</strong> <strong>Based</strong> <strong>Architecture</strong> 285<br />

generated by <strong>XML</strong> instance-file builder, the <strong>XML</strong> instance-file parser finds out valid<br />

<strong>XML</strong> instance-file by parsing all the instance files against most appropriate .xsd files<br />

present in the database. In the case when single <strong>XML</strong> instance-file is generated, the<br />

<strong>XML</strong> instance file-parser just parses that file against most appropriate .xsd files. The<br />

end result <strong>of</strong> the process is selection <strong>of</strong> valid <strong>XML</strong> instance-file <strong>and</strong> corresponding<br />

<strong>English</strong> grammar file. The <strong>XML</strong> instance-file parser passes the selected <strong>English</strong><br />

grammar (i.e., .xsd) file <strong>to</strong> the <strong>English</strong> <strong>to</strong> Hindi transla<strong>to</strong>r component <strong>and</strong> valid<br />

<strong>English</strong> grammar file (i.e., .xsd) <strong>to</strong> the Hindi grammar selec<strong>to</strong>r component. For the<br />

example <strong>English</strong> sentence, the fragment <strong>of</strong> selected .xsd file is shown in Fig. 4.<br />

Figure 4: Valid <strong>English</strong> grammar file (i.e., .xsd) that can parse the given sample<br />

<strong>English</strong> sentence<br />

If we keep on adding more <strong>and</strong> more <strong>English</strong> grammar rules, the number <strong>of</strong> files in<br />

<strong>English</strong> grammar database also increases. The size <strong>of</strong> <strong>English</strong> grammar database can<br />

affect the searching <strong>of</strong> valid <strong>XML</strong> instance-file <strong>and</strong>/or correct <strong>English</strong> grammar file.<br />

For this reason, while choosing most appropriate <strong>English</strong> grammar files <strong>for</strong> parsing,<br />

we make educated guess <strong>of</strong> them based on number <strong>of</strong> <strong>to</strong>kens present in the <strong>English</strong><br />

sentence.<br />

F. <strong>English</strong> <strong>to</strong> Hindi grammar selec<strong>to</strong>r<br />

This component selects Hindi grammar file corresponding <strong>to</strong> selected <strong>English</strong><br />

grammar file by consulting the Hindi grammar <strong>and</strong> translation database. The Hindi<br />

grammar <strong>and</strong> translation database contains Hindi grammar files (i.e., .xsd files) <strong>and</strong><br />

translation files (i.e., .xslt files). Once the Hindi grammar file is selected,<br />

corresponding Hindi translation file is also retrieved from the database <strong>and</strong> it is passed<br />

<strong>to</strong> <strong>English</strong> <strong>to</strong> Hindi transla<strong>to</strong>r component.<br />

G. <strong>English</strong> <strong>to</strong> Hindi transla<strong>to</strong>r<br />

The <strong>English</strong> <strong>to</strong> Hindi transla<strong>to</strong>r component gets valid <strong>XML</strong> instance-file (.xml) from<br />

<strong>XML</strong> instance-file parser component <strong>and</strong> valid Hindi .xsd <strong>and</strong> .xslt files from Hindi<br />

grammar selec<strong>to</strong>r component. Using the .xslt file, the <strong>English</strong> <strong>to</strong> Hindi transla<strong>to</strong>r<br />

trans<strong>for</strong>ms the <strong>English</strong> <strong>XML</strong> instance-file in<strong>to</strong> Hindi <strong>XML</strong> instance-file. Once


286 Harshad B. Prajapati et al<br />

trans<strong>for</strong>mation is over, the <strong>English</strong> <strong>to</strong> Hindi transla<strong>to</strong>r component removes all the tags,<br />

<strong>and</strong> <strong>for</strong>ms the Hindi sentence in <strong>English</strong> words (i.e., it does SVO <strong>to</strong> SOV conversion.<br />

Hindi is SOV language). For example sentence, the generated Hindi <strong>XML</strong> instancefile<br />

is shown in Fig. 5.<br />

Figure 5: Fragment <strong>of</strong> the Hindi <strong>XML</strong> instance file <strong>for</strong> example <strong>English</strong> sentence.<br />

G. Hindi sentence builder<br />

The Hindi sentence builder component gets the Hindi sentence <strong>for</strong>med using the<br />

<strong>English</strong> words present in original sentence. This component replaces each <strong>English</strong><br />

word by Hindi word present in the trans<strong>for</strong>med sentence. The component retrieves<br />

Hindi words corresponding <strong>to</strong> <strong>English</strong> words by consulting the Hindi Words database.<br />

The end result <strong>of</strong> this component is a Hindi sentence; <strong>and</strong> it is given <strong>to</strong> user interface<br />

component <strong>for</strong> display.<br />

<strong>Implementation</strong> <strong>of</strong> Proposed <strong>Architecture</strong><br />

We implement the proposed architecture using Java technology [4]. The Java [18]<br />

supports <strong>XML</strong> processing via <strong>XML</strong> APIs [19] such as JAXP [20], SAX, JDOM, etc.<br />

We have used Netbeans IDE 6.0 [21] as a development environment, as it is a<br />

sophisticated IDE <strong>for</strong> Java based development <strong>to</strong> the date. For database<br />

implementation, we choose MS-ACCESS, as it is satisfac<strong>to</strong>ry <strong>for</strong> rudimentary<br />

implementation. However, in our future work, we will adopt the database systems<br />

supporting indexing facilities.<br />

A. Database <strong>Implementation</strong><br />

For the implementation <strong>of</strong> the database, we have chosen MS-ACCESS as a database.<br />

We have implemented two databases: one, <strong>for</strong> identifying the type (e.g., noun, verb,<br />

etc.) <strong>of</strong> the <strong>English</strong> word <strong>and</strong> another one is <strong>for</strong> the mapping between <strong>English</strong> word<br />

<strong>and</strong> corresponding Hindi word. Few records <strong>of</strong> the type table <strong>of</strong> type database are<br />

shown in Fig. 6. The Fig. 7 shows noun database-table containing <strong>English</strong> nouns <strong>and</strong><br />

their Hindi representation in Hindi font. In this figure, the column named Hindi<br />

contains associated Hindi word encoded using a Sarjudas font [22]. Similarly, other<br />

database tables are created <strong>for</strong> adjective, adverb, article, conjunction, pronoun, verb,<br />

<strong>and</strong> preposition.


<strong>Design</strong> <strong>and</strong> <strong>Implementation</strong> <strong>of</strong> <strong>XML</strong> <strong>Based</strong> <strong>Architecture</strong> 287<br />

Figure 6: Example records <strong>of</strong> type<br />

database-table containing <strong>English</strong> words<br />

<strong>and</strong> their types.<br />

Figure 7: Example records <strong>of</strong> noun<br />

database-table containing <strong>English</strong> nouns<br />

<strong>and</strong> their Hindi representation in Hindi<br />

font (Sarjudas font [22]).<br />

B. <strong>Implementation</strong> <strong>of</strong> Grammar Files<br />

We have implemented required <strong>English</strong> grammar, Hindi grammar, <strong>and</strong> translation<br />

files <strong>for</strong> 21 different types <strong>of</strong> <strong>English</strong> sentences. The Table 1 shows all the <strong>for</strong>m <strong>of</strong><br />

sentences that we have incorporated in our system. We place grammar files in<br />

appropriate direc<strong>to</strong>ry <strong>to</strong> make search <strong>of</strong> them easy <strong>and</strong> efficient. For example, all the<br />

grammar files containing three <strong>to</strong>kens are place in a direc<strong>to</strong>ry with the name 3, <strong>and</strong> so<br />

on. The example <strong>English</strong> grammar file <strong>and</strong> Hindi grammar file are already discussed<br />

in Section IV.<br />

Table 1: List <strong>of</strong> <strong>for</strong>ms <strong>of</strong> sentences that are supported by implemented system.<br />

No. <strong>of</strong> words in a sentence Form <strong>of</strong> a <strong>English</strong> sentence<br />

3<br />

Pronoun verb noun<br />

Noun verb noun1<br />

Pronoun verb pronoun1<br />

Pronoun verb adjective<br />

Pronoun verb verb1<br />

4<br />

Pronoun verb article noun<br />

Pronoun verb noun verb<br />

Pronoun verb adverb adjective<br />

Pronoun noun verb noun1<br />

5<br />

Pronoun verb article noun noun1<br />

Pronoun verb adverb adjective<br />

noun<br />

Pronoun verb article adjective<br />

noun<br />

Pronoun verb adjective<br />

preposition noun<br />

Pronoun verb article adjective<br />

noun


288 Harshad B. Prajapati et al<br />

No. <strong>of</strong> words in a sentence Form <strong>of</strong> a <strong>English</strong> sentence<br />

Pronoun noun preposition verb<br />

noun<br />

Pronoun verb adverb adjective<br />

noun<br />

Pronoun verb noun adverb<br />

adjective<br />

6<br />

Pronoun verb noun adverb<br />

pronoun1 noun1<br />

Pronoun verb noun preposition<br />

pronoun1 noun1<br />

Pronoun verb verb1 preposition<br />

pronoun1 noun<br />

7 Pronoun verb article noun verb1<br />

preposition pronoun1<br />

C. <strong>Implementation</strong> <strong>of</strong> Components<br />

For representation <strong>of</strong> <strong>English</strong> <strong>and</strong> Hindi grammar, we use <strong>XML</strong> Schema. For parsing<br />

<strong>of</strong> <strong>XML</strong> instance documents, we use DOM parsing [19]. The API <strong>for</strong> using the DOM<br />

parsing approach is available in the Java API <strong>for</strong> <strong>XML</strong> Processing (JAXP) [19]. We<br />

briefly discuss implementation <strong>of</strong> all the architecture components in following<br />

paragraph.<br />

The user interface component is implemented as a Java GUI application. It<br />

h<strong>and</strong>les user input <strong>and</strong> renders the output in Hindi. The <strong>to</strong>ken genera<strong>to</strong>r <strong>to</strong>kenizes the<br />

<strong>English</strong> sentence in<strong>to</strong> words using StringTokenizer class <strong>of</strong> Java. The <strong>to</strong>ken type<br />

genera<strong>to</strong>r component uses JDBC [23] <strong>to</strong> retrieve records from the type database table.<br />

The <strong>XML</strong> instance-file builder creates <strong>XML</strong> instance-file(s). The <strong>XML</strong> instance-file<br />

parser uses DocumentBuilderFac<strong>to</strong>ry <strong>and</strong> DocumentBuilder <strong>for</strong> parsing <strong>XML</strong> instance<br />

documents. The <strong>XML</strong> instance-file parser selects only those schema files which are<br />

placed in a direc<strong>to</strong>ry whose name is same as the number <strong>of</strong> <strong>to</strong>kens present in <strong>XML</strong><br />

instance-file. The <strong>English</strong> <strong>to</strong> Hindi grammar selec<strong>to</strong>r component selects the Hindi<br />

grammar file <strong>and</strong> trans<strong>for</strong>mation file from the matching Hindi direc<strong>to</strong>ry. The <strong>English</strong><br />

<strong>to</strong> Hindi transla<strong>to</strong>r component uses two main classes Trans<strong>for</strong>merFac<strong>to</strong>ry <strong>and</strong><br />

Trans<strong>for</strong>mer <strong>of</strong> JAXP <strong>for</strong> translating <strong>English</strong> <strong>XML</strong>-instance document in<strong>to</strong> Hindi<br />

<strong>XML</strong>-instance document. The Hindi sentence builder component uses JDBC <strong>to</strong> make<br />

connection with the Hindi words database in order <strong>to</strong> retrieve Hindi meaning <strong>of</strong><br />

<strong>English</strong> words. The generated result is a Hindi sentence, which is given <strong>to</strong> user<br />

interface component.<br />

Results <strong>and</strong> Discussion<br />

The Fig. 8 shows a screen-shot <strong>of</strong> a system during its run. The user provides <strong>English</strong>


<strong>Design</strong> <strong>and</strong> <strong>Implementation</strong> <strong>of</strong> <strong>XML</strong> <strong>Based</strong> <strong>Architecture</strong> 289<br />

sentence in Input text-field. On clicking Translation but<strong>to</strong>n, the system starts the<br />

translation process, generates the corresponding Hindi sentence, <strong>and</strong> displays it in the<br />

Hindi text-field. If the system is able <strong>to</strong> generate correct Hindi output, it displays<br />

“SENTENCE IS VALID” in error field, otherwise displays “SENTENCE IS NOT<br />

VALID.”<br />

Figure 8: Screen-shot <strong>of</strong> a run <strong>of</strong> a system implementing proposed architecture.<br />

The implemented system supports correct output, high-speed translation, facility<br />

<strong>for</strong> dynamic updating <strong>of</strong> the database (future aspects) <strong>and</strong> has user friendly<br />

environment. At present, this system supports translation <strong>for</strong> a selected word list <strong>and</strong><br />

few sentence types. However, by modifying the database <strong>and</strong> adding more grammar<br />

rules in <strong>for</strong>m <strong>of</strong> <strong>XML</strong> schemas, it can be extended <strong>to</strong> cover wide coverage <strong>of</strong><br />

sentences <strong>and</strong> words.<br />

As the translation relies on external grammar <strong>and</strong> translation files, the architecture<br />

can be adapted <strong>for</strong> other regional languages <strong>of</strong> India. We intent <strong>to</strong> include advance<br />

processing <strong>of</strong> sentences <strong>and</strong> complex rules <strong>of</strong> grammar in our future implementation<br />

<strong>of</strong> the proposed architecture.<br />

Conclusions<br />

We proposed <strong>XML</strong> based architecture <strong>for</strong> <strong>English</strong> <strong>to</strong> Hindi sentence translation. The<br />

proposed architecture is extensible, simple, <strong>and</strong> elegant. Moreover, our proposed<br />

architecture is generic, <strong>and</strong> it can be adapted <strong>to</strong> make other language translation also.


290 Harshad B. Prajapati et al<br />

We also discussed implementation <strong>of</strong> this architecture using Java technology. For<br />

selected word list <strong>and</strong> sentence grammar rules, the implemented system produces<br />

correct output.<br />

As our initial attempt was <strong>to</strong> show the capability <strong>of</strong> <strong>XML</strong> in language translation,<br />

we have not considered complex sentences <strong>and</strong> other complex rules <strong>of</strong> grammar in<br />

implementation <strong>of</strong> architecture. However, the <strong>XML</strong> has capability <strong>to</strong> annotate extra<br />

in<strong>for</strong>mation in <strong>for</strong>m <strong>of</strong> tag-attributes, which we like <strong>to</strong> use in future implementation <strong>of</strong><br />

the architecture. In future, using incremental approach <strong>of</strong> development, we like <strong>to</strong><br />

incorporate translation at phrase level, clause level, <strong>and</strong> sentence level. Moreover, we<br />

will also emphasize on subject verb agreement. For example, Hindi translation <strong>of</strong> he<br />

goes (woh jata hai ) <strong>and</strong> she goes (woh jati hai) is different. We also like <strong>to</strong> consider<br />

rules <strong>for</strong> translating gerund, present participle, past participle, infinitive, etc. At a later<br />

stage <strong>of</strong> development, we like <strong>to</strong> consider context sensitive meanings <strong>of</strong> words.<br />

References<br />

[1] Extensible Markup Language (<strong>XML</strong>) 1.0 (Fifth Edition). [Online]. Available:<br />

http://www.w3.org/TR/REC-xml/<br />

[2] <strong>XML</strong> Schema Part 0: Primer Second Edition. [Online]. Available:<br />

http://www.w3.org/TR/xmlschema-0/<br />

[3] XSL Trans<strong>for</strong>mations (XSLT) Version 1.0. [Online]. Available:<br />

http://www.w3.org/TR/xslt<br />

[4] Java Technology. [Online]. Available: http://www.sun.com/java/<br />

[5] R. Sinha. ANGLABHARTI: A MULTILINGUAL MACHINE AIDED<br />

TRANSLATION METHODLOGY FOR TRANSLATION FROM ENGLISH<br />

TO INDIAN LANGUAGES. [Online]. Available:<br />

http://www.cse.iitk.ac.in/users/langtech/anglabharti.htm<br />

[6] R. Sinha, K. Sivaraman, A. Agrawal, R. Jain, R. Srivastava, <strong>and</strong> A. Jain,<br />

“Anglabharti: a multilingual machine aided translation project on translation<br />

from english <strong>to</strong> indian languages,” vol. 2, Oct 1995, pp. 1609–1614 vol.2.<br />

[7] R. Sinha <strong>and</strong> A. Jain, “Anglahindi: An english <strong>to</strong> hindi machine-aided<br />

translation system,” New Orleans, Louisiana, USA, 2003. [Online]. Available:<br />

http://mt-archive.info/MTS-2003-Sinha.pdf<br />

[8] L. Gore <strong>and</strong> N. Patil. Paper on english <strong>to</strong> hindi - translation system. [Online].<br />

Available: www.cse.iitk.ac.in/users/langtech/strans2002/latagore.doc#.#<br />

[9] D. Jurafsky <strong>and</strong> J. H. Martin, Speech <strong>and</strong> Language Processing: An<br />

Introduction <strong>to</strong> Natural Language Processing, Computational Linguistics, <strong>and</strong><br />

Speech Recognition, S. Russell <strong>and</strong> P. Norvig, Eds. Upper Saddle River, NJ:<br />

Prentice Hall, 2000.<br />

[10] D. Gupta, “Contributions <strong>to</strong> english <strong>to</strong> hindi machine translation using<br />

example-based approach,” Ph.D. dissertation, Indian Institute <strong>of</strong> Technology<br />

Delhi, Hauz Khas, New Delhi-110016, India, January 2005,<br />

http://hermes.itc.it/people/gupta/mywork/Thesis_final_16thjuly.pdf.


<strong>Design</strong> <strong>and</strong> <strong>Implementation</strong> <strong>of</strong> <strong>XML</strong> <strong>Based</strong> <strong>Architecture</strong> 291<br />

[11] D. Gupta <strong>and</strong> N. Chatterjee, “Identification <strong>of</strong> divergence <strong>for</strong> english <strong>to</strong> hindi<br />

ebmt,” New Orleans, Louisiana, USA, 2003, pp. 141–148. [Online].<br />

Available: http://mt-archive.info/MTS-2003-Gupta.pdf<br />

[12] (1995) India: A country study. Washing<strong>to</strong>n: GPO <strong>for</strong> the Library <strong>of</strong> Congress.<br />

[Online]. Available: http://countrystudies.us/india/67.htm<br />

[13] World Wide Web Consortium. [Online]. Available: http://www.w3.org/<br />

[14] Extensible Stylesheet Language (XSL) Version 1.1. [Online]. Available:<br />

http://www.w3.org/TR/xsl<br />

[15] HTML 4.01 Specification. [Online]. Available:<br />

http://www.w3.org/TR/html401/<br />

[16] <strong>XML</strong> Path Language (XPath) Version 1.0. [Online]. Available:<br />

http://www.w3.org/TR/xpath<br />

[17] XQuery 1.0: An <strong>XML</strong> Query Language. [Online]. Available:<br />

http://www.w3.org/TR/xquery/<br />

[18] The Source <strong>for</strong> Java Developers. [Online]. Available: http://java.sun.com/<br />

[19] A. Vohra <strong>and</strong> D. Vohra, Pro <strong>XML</strong> Development with Java Technology,<br />

S. Russell <strong>and</strong> P. Norvig, Eds. Apress, 2006.<br />

[20] jaxp:JAXP Reference <strong>Implementation</strong>. [Online]. Available:<br />

https://jaxp.dev.java.net/<br />

[21] Netbeans IDE. [Online]. Available: http://www.netbeans.org/<br />

[22] Sarjudas-Hindi font file. [Online]. Available:<br />

http://www.jaishreekrishna.com/sarjudas.ttf<br />

[23] Java Database Connectivity. [Online]. Available:<br />

http://java.sun.com/javase/technologies/database/


292 Harshad B. Prajapati et al

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!