Development of a Hindi Lemmatizer - arXiv

24.12.2013 Views
International Journal of Computational Linguistics and Natural Language Processing Vol 2 Issue 5 May 2013 ISSN 2279 – 0756 word contains two suffixes together which are ◌य and ◌ो◌ं. This becomes hard for the system as it finds difficulty in picking up the correct rule for the particular word. Similarly there are many more exceptions for which we have generated different rules. To overcome such problems we have built a database in which such exceptional words are kept. Although this work requires much time but for the sake of fast and accurate result this approach is applied. The rule is shown in Fig 2- If (root) present in (knowledgebase) { Fetch the root from the list Display; } else if (root) not present in (knowledgebase) { If (source) ends with (suffix) { Substring the source Display the root; } } Fig. 2 Rule procedure D. Algorithmic Steps The input word is first checked in database. If the word exists in the database then it is displayed as output but if the word doesn’t exist in the database then the rules are accessed for stripping out the suffix. The rules work by deleting the suffix from the input. After deletion, if the word provides a proper meaning then it is displayed as a result otherwise a particular character or matra is added to the stripped word to make it a proper meaningful word. The steps are shown in Fig 3. 1. Check input word in knowledgebase. 2. Display if exist. 3. Otherwise access the rules. 4. Generate suffix stripping rules i. Delete the suffix. ii. Delete & add characters. Some of the input words are shown in Fig 4- नज़र, सड़क, लड़क, लड़कयाँ, खुशी, भारतीयता, मजदूर, िमठाई, बािलकाओं, िननीय, गौरवांवत, सफलताओं, लड़क, मंज़ल, वदा, यादा, पढ़ाई, कवय, ितजोरय, सतरंगी, आतंकय, बुनाई, नकारामक, नेताओं, अपमािनत, िचड़य, संशोधन, शशाली, शीलय. Fig. 4 Snapshot of inputs The output of some of these words are shown in Table V- TABLE V SEPARATED LEMMA AND SUFFIXES Lemma Suffix नज़र ◌े◌ं सड़क ◌ो◌ं लड़क - खुश ◌ी भारत ◌ीयता मजदूर ◌ी बािलका ओं वास नीय सफल ताओं लड़का ◌ो◌ं संशोध न ितजोर य लड़क ◌याँ यादा - Some of the wrong output words are shown in Fig 5- ववेचना, उोगपित, ककार, ककार, आयामक, ानामक, कलाित, शांितयता, िमान, गुणवा, गुणकार, िनरंतर, नकलची, िनंदनीय, जनामक, सौभायशाली, वावलंबन, तमनाएं, णत, दयालु, चौकदार, चमकला, वधुतीकरण. Fig. 3 Algorithm V. EVALUATION The system is evaluated for its accuracy where we gave 500 words for lemmatization. Among these 500 words 456 words were correctly lemmatized and 44 words were incorrect because they violated both the exceptional and general rules. Accuracy of the system was computed using the following equation- Accuracy= 91% Fig. 5 Snapshot of errors VI. CONCLUSION In this paper we have discussed the development of a lemmatizer for Hindi. The work uses the rule based approach by creating knowledgebase which contains all the Hindi words that are commonly used in day to day life. The approach also emphasized on time optimization problem rather than on space. Since nowadays space is not at all a big problem, therefore our approach aimed to optimize time and generate accurate result in a very short period. Our system gave 91% of accuracy. Snigdha Paul et.al. 383 www.ijclnlp.org

International Journal of Computational Linguistics and Natural Language Processing Vol 2 Issue 5 May 2013 ISSN 2279 – 0756 REFERENCES [1] Vishal Goyal and Gurpreet Singh Lehal, “Hindi Morphological and Generator,” IEEE Computer Society Press California USA, pp. 1156- 1159, 2008. [2] Bharti Akshar, Vineet Chaitanya and Rajeev Sangal, The Natural Language Processing:A Paninian Perspective, 1995. [3] Manzoor Ahmed Chachoo and S.M.K Quadri, “Morphological Analysis from the raw Kashmiri Corpus Using Open Source Extract Tool,” Vol. 7, No. 2, 2011. [4] Anand Kumar M, Dhanlakshmi V and Sonam K.P, “A sequence labeling approach to morphological analyzer for tamil language,” International Journal on Compter Science and Engineering, Vol. 02, No. 06, 2010. [5] Nikhil K V S, “Hindi derivational morphological analyzer,” Language Technologies Research Center, IIIT Hyderabad, 2012 [6] Itisree Jena, Sriram Chaudhary, Himani Chaudhary and Dipti M. Sarma,”Developing Oriya Morphological Analyzer Using Lt-toolbox,” ICISIL 2011, CCIS 139, pp. 124-129, 2011. [7] Smriti Singh and Vaijayanti M Sarma, “Hindi Noun Inflection and Distributed Morphology.” [8] A. Ramnathan, D Rao, “A lightweight Stemmer for Hindi,” In Proceedings of Workshop on Computational Linguistics for South Asian Languages, 10 th Conference of the European Chapter of Association of Computational Linguistcs. pp 42-48. 2003. [9] Prasenjit Majumder, Mandar Mitra, swapan k. Pauri, Gobinda Kole, Pabitra Mitra and Kalyankumar Datta, YASS: Yet Another Suffix Stripper, ACM Transactions on Information Systems, Vol.25, No.4, pp. 18-38,2007. [10] Plisson, J, Larc, N, Mladenic, “A Rule based approach to word lemmatization,” Proceedings of the 7 th International Multiconference Information Society, IS-2004, Institute Jozef Stefan, Ljubljana, pp.83- 86,2008. [11] Martin F. Porter, An algorithm for suffix stripping, Program, Vol. 14, No. 3, pp 130-137, 1980. [12] Julie Beth Lovins, Development of stemming Algorithm, Mechanical Translation and Computational Linguistics, Vol. 11, No. 1, pp 22-23, 1968. [13] Deepa Gupta, Rahul Kumar Yadav, Nidhi Sajan, “Improving Unsupervised Stemming by using Partial Lemmatization Coupled with Data-Based Heuristics for Hindi ,” International Journal of Computer Application(0975-8887), Vol. 38, No. 8, January 2012. [14] Mohd. Shahid Hussain, “An unsupervised approach to develop stemmer,” International Journal on Natural Language Computing, Vol. 1, No. 2, August 2012. Snigdha Paul et.al. 384 www.ijclnlp.org

Page 1 and 2: International Journal of Computatio

Page 3: International Journal of Computatio

suffix

hindi

morphological

lemmatization

analysis

morphology

lemmatizer

suffixes

computational

analyzer

arxiv

cogprints.org

Development of a Hindi Lemmatizer - arXiv

Development of a Hindi Lemmatizer - arXiv ... View more Development of a Hindi Lemmatizer - arXiv

Delete template?

Save as template ?

Development of a Hindi Lemmatizer - arXiv Development of a Hindi Lemmatizer - arXiv