Development of a Hindi Lemmatizer - arXiv

Development of a Hindi Lemmatizer - arXiv Development of a Hindi Lemmatizer - arXiv

cogprints.org
from cogprints.org More from this publisher
24.12.2013 Views

International Journal of Computational Linguistics and Natural Language Processing Vol 2 Issue 5 May 2013 ISSN 2279 – 0756 word contains two suffixes together which are ◌य and ◌ो◌ं. This becomes hard for the system as it finds difficulty in picking up the correct rule for the particular word. Similarly there are many more exceptions for which we have generated different rules. To overcome such problems we have built a database in which such exceptional words are kept. Although this work requires much time but for the sake of fast and accurate result this approach is applied. The rule is shown in Fig 2- If (root) present in (knowledgebase) { Fetch the root from the list Display; } else if (root) not present in (knowledgebase) { If (source) ends with (suffix) { Substring the source Display the root; } } Fig. 2 Rule procedure D. Algorithmic Steps The input word is first checked in database. If the word exists in the database then it is displayed as output but if the word doesn’t exist in the database then the rules are accessed for stripping out the suffix. The rules work by deleting the suffix from the input. After deletion, if the word provides a proper meaning then it is displayed as a result otherwise a particular character or matra is added to the stripped word to make it a proper meaningful word. The steps are shown in Fig 3. 1. Check input word in knowledgebase. 2. Display if exist. 3. Otherwise access the rules. 4. Generate suffix stripping rules i. Delete the suffix. ii. Delete & add characters. Some of the input words are shown in Fig 4- नज़र, सड़क, लड़क, लड़कयाँ, खुशी, भारतीयता, मजदूर, िमठाई, बािलकाओं, िननीय, गौरवांवत, सफलताओं, लड़क, मंज़ल, वदा, यादा, पढ़ाई, कवय, ितजोरय, सतरंगी, आतंकय, बुनाई, नकारामक, नेताओं, अपमािनत, िचड़य, संशोधन, शशाली, शीलय. Fig. 4 Snapshot of inputs The output of some of these words are shown in Table V- TABLE V SEPARATED LEMMA AND SUFFIXES Lemma Suffix नज़र ◌े◌ं सड़क ◌ो◌ं लड़क - खुश ◌ी भारत ◌ीयता मजदूर ◌ी बािलका ओं वास नीय सफल ताओं लड़का ◌ो◌ं संशोध न ितजोर य लड़क ◌याँ यादा - Some of the wrong output words are shown in Fig 5- ववेचना, उोगपित, ककार, ककार, आयामक, ानामक, कलाित, शांितयता, िमान, गुणवा, गुणकार, िनरंतर, नकलची, िनंदनीय, जनामक, सौभायशाली, वावलंबन, तमनाएं, णत, दयालु, चौकदार, चमकला, वधुतीकरण. Fig. 3 Algorithm V. EVALUATION The system is evaluated for its accuracy where we gave 500 words for lemmatization. Among these 500 words 456 words were correctly lemmatized and 44 words were incorrect because they violated both the exceptional and general rules. Accuracy of the system was computed using the following equation- Accuracy= 91% Fig. 5 Snapshot of errors VI. CONCLUSION In this paper we have discussed the development of a lemmatizer for Hindi. The work uses the rule based approach by creating knowledgebase which contains all the Hindi words that are commonly used in day to day life. The approach also emphasized on time optimization problem rather than on space. Since nowadays space is not at all a big problem, therefore our approach aimed to optimize time and generate accurate result in a very short period. Our system gave 91% of accuracy. Snigdha Paul et.al. 383 www.ijclnlp.org

International Journal of Computational Linguistics and Natural Language Processing Vol 2 Issue 5 May 2013 ISSN 2279 – 0756 REFERENCES [1] Vishal Goyal and Gurpreet Singh Lehal, “Hindi Morphological and Generator,” IEEE Computer Society Press California USA, pp. 1156- 1159, 2008. [2] Bharti Akshar, Vineet Chaitanya and Rajeev Sangal, The Natural Language Processing:A Paninian Perspective, 1995. [3] Manzoor Ahmed Chachoo and S.M.K Quadri, “Morphological Analysis from the raw Kashmiri Corpus Using Open Source Extract Tool,” Vol. 7, No. 2, 2011. [4] Anand Kumar M, Dhanlakshmi V and Sonam K.P, “A sequence labeling approach to morphological analyzer for tamil language,” International Journal on Compter Science and Engineering, Vol. 02, No. 06, 2010. [5] Nikhil K V S, “Hindi derivational morphological analyzer,” Language Technologies Research Center, IIIT Hyderabad, 2012 [6] Itisree Jena, Sriram Chaudhary, Himani Chaudhary and Dipti M. Sarma,”Developing Oriya Morphological Analyzer Using Lt-toolbox,” ICISIL 2011, CCIS 139, pp. 124-129, 2011. [7] Smriti Singh and Vaijayanti M Sarma, “Hindi Noun Inflection and Distributed Morphology.” [8] A. Ramnathan, D Rao, “A lightweight Stemmer for Hindi,” In Proceedings of Workshop on Computational Linguistics for South Asian Languages, 10 th Conference of the European Chapter of Association of Computational Linguistcs. pp 42-48. 2003. [9] Prasenjit Majumder, Mandar Mitra, swapan k. Pauri, Gobinda Kole, Pabitra Mitra and Kalyankumar Datta, YASS: Yet Another Suffix Stripper, ACM Transactions on Information Systems, Vol.25, No.4, pp. 18-38,2007. [10] Plisson, J, Larc, N, Mladenic, “A Rule based approach to word lemmatization,” Proceedings of the 7 th International Multiconference Information Society, IS-2004, Institute Jozef Stefan, Ljubljana, pp.83- 86,2008. [11] Martin F. Porter, An algorithm for suffix stripping, Program, Vol. 14, No. 3, pp 130-137, 1980. [12] Julie Beth Lovins, Development of stemming Algorithm, Mechanical Translation and Computational Linguistics, Vol. 11, No. 1, pp 22-23, 1968. [13] Deepa Gupta, Rahul Kumar Yadav, Nidhi Sajan, “Improving Unsupervised Stemming by using Partial Lemmatization Coupled with Data-Based Heuristics for Hindi ,” International Journal of Computer Application(0975-8887), Vol. 38, No. 8, January 2012. [14] Mohd. Shahid Hussain, “An unsupervised approach to develop stemmer,” International Journal on Natural Language Computing, Vol. 1, No. 2, August 2012. Snigdha Paul et.al. 384 www.ijclnlp.org

International Journal <strong>of</strong> Computational Linguistics and Natural Language Processing Vol 2 Issue 5 May 2013<br />

ISSN 2279 – 0756<br />

word contains two suffixes together which are ◌य and ◌ो◌ं.<br />

This becomes hard for the system as it finds difficulty in<br />

picking up the correct rule for the particular word. Similarly<br />

there are many more exceptions for which we have generated<br />

different rules. To overcome such problems we have built a<br />

database in which such exceptional words are kept. Although<br />

this work requires much time but for the sake <strong>of</strong> fast and<br />

accurate result this approach is applied. The rule is shown in<br />

Fig 2-<br />

If (root) present in (knowledgebase)<br />

{<br />

Fetch the root from the list<br />

Display;<br />

}<br />

else if (root) not present in (knowledgebase)<br />

{<br />

If (source) ends with (suffix)<br />

{<br />

Substring the source<br />

Display the root;<br />

}<br />

}<br />

Fig. 2 Rule procedure<br />

D. Algorithmic Steps<br />

The input word is first checked in database. If the word<br />

exists in the database then it is displayed as output but if the<br />

word doesn’t exist in the database then the rules are accessed<br />

for stripping out the suffix. The rules work by deleting the<br />

suffix from the input. After deletion, if the word provides a<br />

proper meaning then it is displayed as a result otherwise a<br />

particular character or matra is added to the stripped word to<br />

make it a proper meaningful word. The steps are shown in Fig<br />

3.<br />

1. Check input word in knowledgebase.<br />

2. Display if exist.<br />

3. Otherwise access the rules.<br />

4. Generate suffix stripping rules<br />

i. Delete the suffix.<br />

ii. Delete & add characters.<br />

Some <strong>of</strong> the input words are shown in Fig 4-<br />

नज़र, सड़क, लड़क, लड़कयाँ, खुशी, भारतीयता, मजदूर, िमठाई,<br />

बािलकाओं, िननीय, गौरवांवत, सफलताओं, लड़क, मंज़ल, वदा,<br />

यादा, पढ़ाई, कवय, ितजोरय, सतरंगी, आतंकय, बुनाई,<br />

नकारामक, नेताओं, अपमािनत, िचड़य, संशोधन, शशाली,<br />

शीलय.<br />

Fig. 4 Snapshot <strong>of</strong> inputs<br />

The output <strong>of</strong> some <strong>of</strong> these words are shown in Table V-<br />

TABLE V<br />

SEPARATED LEMMA AND SUFFIXES<br />

Lemma<br />

Suffix<br />

नज़र<br />

◌े◌ं<br />

सड़क<br />

◌ो◌ं<br />

लड़क -<br />

खुश<br />

◌ी<br />

भारत<br />

◌ीयता<br />

मजदूर<br />

◌ी<br />

बािलका<br />

ओं<br />

वास<br />

नीय<br />

सफल<br />

ताओं<br />

लड़का<br />

◌ो◌ं<br />

संशोध<br />

न<br />

ितजोर<br />

य<br />

लड़क<br />

◌याँ<br />

यादा -<br />

Some <strong>of</strong> the wrong output words are shown in Fig 5-<br />

ववेचना, उोगपित, ककार, ककार, आयामक, ानामक, कलाित,<br />

शांितयता, िमान, गुणवा, गुणकार, िनरंतर, नकलची, िनंदनीय,<br />

जनामक, सौभायशाली, वावलंबन, तमनाएं, णत, दयालु, चौकदार,<br />

चमकला, वधुतीकरण.<br />

Fig. 3 Algorithm<br />

V. EVALUATION<br />

The system is evaluated for its accuracy where we gave<br />

500 words for lemmatization. Among these 500 words 456<br />

words were correctly lemmatized and 44 words were incorrect<br />

because they violated both the exceptional and general rules.<br />

Accuracy <strong>of</strong> the system was computed using the following<br />

equation-<br />

Accuracy= 91%<br />

Fig. 5 Snapshot <strong>of</strong> errors<br />

VI. CONCLUSION<br />

In this paper we have discussed the development <strong>of</strong> a<br />

lemmatizer for <strong>Hindi</strong>. The work uses the rule based approach<br />

by creating knowledgebase which contains all the <strong>Hindi</strong> words<br />

that are commonly used in day to day life. The approach also<br />

emphasized on time optimization problem rather than on<br />

space. Since nowadays space is not at all a big problem,<br />

therefore our approach aimed to optimize time and generate<br />

accurate result in a very short period. Our system gave 91% <strong>of</strong><br />

accuracy.<br />

Snigdha Paul et.al.<br />

383<br />

www.ijclnlp.org

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!