10.07.2015 Views

TLex: Thai Lexeme Analyser Based on the ... - BEST - Nectec

TLex: Thai Lexeme Analyser Based on the ... - BEST - Nectec

TLex: Thai Lexeme Analyser Based on the ... - BEST - Nectec

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

5TABLE IVEVALUATION RESULTS OF MERGING APPROACH ON THE combinedFEATURE SETGenre Approach ResultsP R F1Article no-merge 95.45 96.80 96.12merge 95.71 96.54 96.13Buddhism no-merge 95.39 95.39 95.39merge 95.19 94.96 95.07Encyclopedia no-merge 95.17 95.21 95.19merge 95.15 94.83 94.99Law no-merge 95.14 96.28 95.70merge 95.36 96 95.68News no-merge 91.91 94.71 93.29merge 93.54 94.99 94.26Novel no-merge 94.57 95.57 95.07merge 95.43 95.72 95.57Nsc no-merge 95.71 95.57 95.64merge 95.83 95.53 95.68Old-doc no-merge 91.70 93.12 92.40merge 92.09 92.81 92.45Royalnews no-merge 80.83 91.42 85.80merge 84.41 92.28 88.17Talk no-merge 96.74 97.00 96.87merge 96.99 96.78 96.89Tvnews no-merge 91.07 95.06 93.02merge 92.29 94.92 93.59Wiki no-merge 90.61 93.53 92.05merge 91.62 93.58 92.59TABLE VEVALUATION RESULTS OF NE MERGING FROM ALL GENRESApproachResultsP R F1no-merge 92.86 94.97 93.90merge 93.63 94.91 94.27improvement comes from <strong>the</strong> Royalnews genre, in which <strong>the</strong>NE merging yielded <strong>the</strong> performance of 88.17% based <strong>on</strong> <strong>the</strong>F1 value compared to 85.8% without merging. The reas<strong>on</strong> isthat <strong>the</strong> NEs under <strong>the</strong> Royalnews genre are limited, however,occur very often in <strong>the</strong> articles.We also measured <strong>the</strong> executi<strong>on</strong> times of two approaches.The no-merge approach yields 7,444 words/sec. The mergeapproach yields 2,229 words/sec. The decreasing speed in<strong>the</strong> merge approach is due to <strong>the</strong> time taken for parsing andchecking <strong>the</strong> named entities in <strong>the</strong> list.VI. CONCLUSIONS AND FUTURE WORKSIn this paper, we proposed a soluti<strong>on</strong> to <strong>the</strong> InterBest2009 <str<strong>on</strong>g>Thai</str<strong>on</strong>g> Word Segmentati<strong>on</strong> task based <strong>on</strong> <strong>the</strong> C<strong>on</strong>diti<strong>on</strong>alRandom Fields. To train <strong>the</strong> word segmentati<strong>on</strong> models, weproposed three different feature sets: char (by using all possiblecharacters as features), char-type (by categorizing allcharacters into 10 different types) and combined (by usingboth characters and character types as features) The evaluati<strong>on</strong>results showed that <strong>the</strong> char-type feature set yielded <strong>the</strong> worstperformance of 62.90% based <strong>on</strong> <strong>the</strong> F1 value. The modeltrained by using char feature set gave <strong>the</strong> F1 value of 91.51%.The highest F1 value of 93.90% was achieved when using <strong>the</strong>combined feature set. Therefore, it could be c<strong>on</strong>cluded that <strong>the</strong>CRFs could effectively learn <strong>the</strong> actual character sequences.However, in <strong>the</strong> case that <strong>the</strong> exact character sequential patternsare not matched, <strong>the</strong> generalized model based <strong>on</strong> <strong>the</strong>char-type could help cover such case. The analysis of <strong>the</strong>segmented results showed that <strong>the</strong> models trained by using<strong>the</strong> CRFs tend to mistakenly separate named entities into afew small segments. As a soluti<strong>on</strong>, we performed <strong>the</strong> postprocessingstep by merging named entities (NEs) found in <strong>the</strong>returned segments. The NE merging step helped increase <strong>the</strong>performance of <strong>the</strong> combined feature model to 94.27% based<strong>on</strong> <strong>the</strong> averaged F1 value.For future works, we will focus <strong>on</strong> <strong>the</strong> improvement of <strong>the</strong>word segmentati<strong>on</strong> models learned by using <strong>the</strong> CRFs. Onepossible soluti<strong>on</strong> is to improve <strong>the</strong> feature set for learning <strong>the</strong>models. A new character feature set could be automaticallyc<strong>on</strong>structed by using some clustering algorithms. Ano<strong>the</strong>rpossible improvement is by training a named entity recogniti<strong>on</strong>(NER) model to merge returned segments which c<strong>on</strong>tainnamed entities into single word units.REFERENCES[1] C. Haruechaiyasak, S. K<strong>on</strong>gyoung and M. Dailey, “A comparative study<strong>on</strong> <str<strong>on</strong>g>Thai</str<strong>on</strong>g> word segmentati<strong>on</strong> approaches,” Proc. of <strong>the</strong> ECTI-CON 2008,1:125-128, 2008.[2] C. Haruechaiyasak, S. K<strong>on</strong>gyoung and C. Damr<strong>on</strong>grat, “LearnLexTo: amachine-learning based word segmentati<strong>on</strong> for indexing <str<strong>on</strong>g>Thai</str<strong>on</strong>g> texts,” Proc.of <strong>the</strong> <strong>the</strong> 2nd ACM Workshop <strong>on</strong> Improving N<strong>on</strong> English Web Searching(iNEWS), pp. 85–88, 2008.[3] A. Kawtrakul and C. Thumkan<strong>on</strong>, “A Statistical Approach to <str<strong>on</strong>g>Thai</str<strong>on</strong>g>Morphological Analyzer,” Proc. of <strong>the</strong> 5th Workshop <strong>on</strong> Very LargeCorpora, pp. 286–289, 1997.[4] C. S.G. Khoo and T. Ee Loh, “Using Statistical and C<strong>on</strong>textual Informati<strong>on</strong>to Identify Two-and Three-Character Words in Chinese Text,”Journal of <strong>the</strong> American Society for Informati<strong>on</strong> Science and Technology,53(5):365–377, 2002.[5] C. Kruengkrai and H. Isahara, “A C<strong>on</strong>diti<strong>on</strong>al Random Field Frameworkfor <str<strong>on</strong>g>Thai</str<strong>on</strong>g> Morphological Analysis,” Proc. of <strong>the</strong> Fifth Int. C<strong>on</strong>f. <strong>on</strong>Language Resources and Evaluati<strong>on</strong> (LREC-2006), 2006.[6] C. Kruengkrai, et al., “An Error-Driven Word-Character Hybrid Modelfor Joint Chinese Word Segmentati<strong>on</strong> and POS Tagging,” Proc. of <strong>the</strong>ACL-IJCNLP 2009, pp. 513–521, 2009.[7] T. Kudo, K. Yamamoto, and Y. Matsumoto, “Applying C<strong>on</strong>diti<strong>on</strong>alRandom Fields to Japanese Morphological Analysis,” Proc. of EMNLP,pp. 89–96, 2004.[8] J. Lafferty, A. McCallum, and F. Pereira, “C<strong>on</strong>diti<strong>on</strong>al Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data,” Proc.of <strong>the</strong> Eighteenth Int. C<strong>on</strong>f. <strong>on</strong> Machine Learning (ICML), pp. 282–289,2001.[9] S. Meknavin, P. Charoenpornsawat, and B. Kijsirikul, “Feature-<str<strong>on</strong>g>Based</str<strong>on</strong>g><str<strong>on</strong>g>Thai</str<strong>on</strong>g> Word Segmentati<strong>on</strong>,” Proc. of NLPRS 97, pp. 289–296, 1997.[10] F. Peng, F. Feng, and A. McCallum, “Chinese Segmentati<strong>on</strong> and NewWord Detecti<strong>on</strong> Using C<strong>on</strong>diti<strong>on</strong>al Random Fields,” Proc. of <strong>the</strong> 20thCOLING, pp. 562–568, 2004.[11] Y. Poowarawan, “Dicti<strong>on</strong>ary-based <str<strong>on</strong>g>Thai</str<strong>on</strong>g> Syllable Separati<strong>on</strong>,” Proc. of<strong>the</strong> Ninth Electr<strong>on</strong>ics Engineering C<strong>on</strong>ference, 1986.[12] V. Sornlertlamvanich, “Word Segmentati<strong>on</strong> for <str<strong>on</strong>g>Thai</str<strong>on</strong>g> in Machine Translati<strong>on</strong>System,” Machine Translati<strong>on</strong>, Nati<strong>on</strong>al Electr<strong>on</strong>ics and ComputerTechnology Center, Bangkok, 1993.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!