TLex: Thai Lexeme Analyser Based on the ... - BEST - Nectec

More documents

Recommendations

Info

4TABLE IIEVALUATION RESULTS OF THREE DIFFERENT FEATURE SETSGenre Feature set ResultsP R F1Fig. 4.Example of named entities collected from the training corpusTABLE INUMBER OF NAMED ENTITIES EXTRACTED FROM TRAINING CORPUSGenreNumber of NEsArticle 7,193Buddhism 349Encyclopedia 5,462Law 921News 23,038Novel 4,145Talk 2,284Wiki 12,935evaluating different word segmentation approaches. Precisionis defined as the number of tokens correctly segmented bythe algorithm divided by the total number of tokens returnedfrom the algorithm. Recall is defined as the number of tokenscorrectly segmented by the algorithm divided by the totalnumber of tokens in the test set. Eq. 4 shows the calculationof F1 measure which is the harmonic mean of precision andrecall.F 1 =2 ∗ P recision ∗ RecallP recision + RecallWe first performed evaluation on three feature sets (i.e.,char, char-type and combined) as explained in previous section.Given the training corpus, we formatted the text stringsaccording to different feature sets. We trained three wordsegmentation models based on the CRFs. The results based oneach of the 12 test genres are shown in Table II. The averagedperformance on all genres is summarized in Table III.The results from the tables show that using the char-typefeature set yielded the worst averaged performance of 62.90%based on the F1 value. The model trained by using charfeature set gave a significant improvement of 91.51%. Thereason is that the char feature set provides more precise andricher information based on actual characters, therefore, theCRFs could effectively observe and recognize the sequences.However, when combining both char-type and char featuresets, the performance increases to 93.90%. The improvementcomes from that the char-type could help segment the characterpatterns which were not previously observed using thechar feature set. Comparing among all 12 genres, the Talkgenre yielded the best performance.Next, we perform and evaluate the NE merging process. Asmentioned in the previous section, applying the NE merging(4)Article char 92.19 92.10 92.15char-type 64.43 67.08 65.73combined 95.45 96.80 96.12Buddhism char 93.95 93.21 93.58char-type 66.95 65.46 66.20combined 95.39 95.39 95.39Encyclopedia char 92.59 91.38 91.98char-type 64.87 66.00 65.43combined 95.17 95.21 95.19Law char 94.17 94.52 94.34char-type 64.61 68.40 66.45combined 95.14 96.28 95.70News char 90.21 90.07 90.14char-type 59.53 64.45 61.89combined 91.91 94.71 93.29Novel char 93.66 93.97 93.82char-type 65.54 65.94 65.74combined 94.57 95.57 95.07Nsc char 91.53 90.90 91.21char-type 63.98 63.39 63.68combined 95.71 95.57 95.64Old-doc char 90.98 90.38 90.68char-type 55.57 59.50 57.47combined 91.70 93.12 92.40Royalnews char 80.23 86.96 83.46char-type 44.35 62.53 51.89combined 80.83 91.42 85.80Talk char 94.34 93.99 94.16char-type 66.75 67.21 66.98combined 96.74 97.00 96.87Tvnews char 90.32 92.50 91.40char-type 56.06 60.97 58.41combined 91.07 95.06 93.02Wiki char 91.42 90.65 91.03char-type 61.86 65.22 63.50combined 90.61 93.53 92.05TABLE IIIEVALUATION RESULTS OF DIFFERENT FEATURE SETS FROM ALL GENRESFeature setResultsP R F1char 91.30 91.72 91.51char-type 61.21 64.68 62.90combined 92.86 94.97 93.90step could help combine the segments which are namedentities into a single word unit. By using the model trainedfrom the combined feature set, the comparison results betweenmerge and no-merge approaches under each of the 12 genresare shown in Table IV. The averaged performance on all genresis summarized in Table V. The results show that the NEmerging step helps increase the performance of the combinedfeature model to 94.27% based on the F1 value. The maximum
5TABLE IVEVALUATION RESULTS OF MERGING APPROACH ON THE combinedFEATURE SETGenre Approach ResultsP R F1Article no-merge 95.45 96.80 96.12merge 95.71 96.54 96.13Buddhism no-merge 95.39 95.39 95.39merge 95.19 94.96 95.07Encyclopedia no-merge 95.17 95.21 95.19merge 95.15 94.83 94.99Law no-merge 95.14 96.28 95.70merge 95.36 96 95.68News no-merge 91.91 94.71 93.29merge 93.54 94.99 94.26Novel no-merge 94.57 95.57 95.07merge 95.43 95.72 95.57Nsc no-merge 95.71 95.57 95.64merge 95.83 95.53 95.68Old-doc no-merge 91.70 93.12 92.40merge 92.09 92.81 92.45Royalnews no-merge 80.83 91.42 85.80merge 84.41 92.28 88.17Talk no-merge 96.74 97.00 96.87merge 96.99 96.78 96.89Tvnews no-merge 91.07 95.06 93.02merge 92.29 94.92 93.59Wiki no-merge 90.61 93.53 92.05merge 91.62 93.58 92.59TABLE VEVALUATION RESULTS OF NE MERGING FROM ALL GENRESApproachResultsP R F1no-merge 92.86 94.97 93.90merge 93.63 94.91 94.27improvement comes from the Royalnews genre, in which theNE merging yielded the performance of 88.17% based on theF1 value compared to 85.8% without merging. The reason isthat the NEs under the Royalnews genre are limited, however,occur very often in the articles.We also measured the execution times of two approaches.The no-merge approach yields 7,444 words/sec. The mergeapproach yields 2,229 words/sec. The decreasing speed inthe merge approach is due to the time taken for parsing andchecking the named entities in the list.VI. CONCLUSIONS AND FUTURE WORKSIn this paper, we proposed a solution to the InterBest2009 <strong>Thai</strong> Word Segmentation task based on the ConditionalRandom Fields. To train the word segmentation models, weproposed three different feature sets: char (by using all possiblecharacters as features), char-type (by categorizing allcharacters into 10 different types) and combined (by usingboth characters and character types as features) The evaluationresults showed that the char-type feature set yielded the worstperformance of 62.90% based on the F1 value. The modeltrained by using char feature set gave the F1 value of 91.51%.The highest F1 value of 93.90% was achieved when using thecombined feature set. Therefore, it could be concluded that theCRFs could effectively learn the actual character sequences.However, in the case that the exact character sequential patternsare not matched, the generalized model based on thechar-type could help cover such case. The analysis of thesegmented results showed that the models trained by usingthe CRFs tend to mistakenly separate named entities into afew small segments. As a solution, we performed the postprocessingstep by merging named entities (NEs) found in thereturned segments. The NE merging step helped increase theperformance of the combined feature model to 94.27% basedon the averaged F1 value.For future works, we will focus on the improvement of theword segmentation models learned by using the CRFs. Onepossible solution is to improve the feature set for learning themodels. A new character feature set could be automaticallyconstructed by using some clustering algorithms. Anotherpossible improvement is by training a named entity recognition(NER) model to merge returned segments which containnamed entities into single word units.REFERENCES[1] C. Haruechaiyasak, S. Kongyoung and M. Dailey, “A comparative studyon <strong>Thai</strong> word segmentation approaches,” Proc. of the ECTI-CON 2008,1:125-128, 2008.[2] C. Haruechaiyasak, S. Kongyoung and C. Damrongrat, “LearnLexTo: amachine-learning based word segmentation for indexing <strong>Thai</strong> texts,” Proc.of the the 2nd ACM Workshop on Improving Non English Web Searching(iNEWS), pp. 85–88, 2008.[3] A. Kawtrakul and C. Thumkanon, “A Statistical Approach to <strong>Thai</strong>Morphological Analyzer,” Proc. of the 5th Workshop on Very LargeCorpora, pp. 286–289, 1997.[4] C. S.G. Khoo and T. Ee Loh, “Using Statistical and Contextual Informationto Identify Two-and Three-Character Words in Chinese Text,”Journal of the American Society for Information Science and Technology,53(5):365–377, 2002.[5] C. Kruengkrai and H. Isahara, “A Conditional Random Field Frameworkfor <strong>Thai</strong> Morphological Analysis,” Proc. of the Fifth Int. Conf. onLanguage Resources and Evaluation (LREC-2006), 2006.[6] C. Kruengkrai, et al., “An Error-Driven Word-Character Hybrid Modelfor Joint Chinese Word Segmentation and POS Tagging,” Proc. of theACL-IJCNLP 2009, pp. 513–521, 2009.[7] T. Kudo, K. Yamamoto, and Y. Matsumoto, “Applying ConditionalRandom Fields to Japanese Morphological Analysis,” Proc. of EMNLP,pp. 89–96, 2004.[8] J. Lafferty, A. McCallum, and F. Pereira, “Conditional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data,” Proc.of the Eighteenth Int. Conf. on Machine Learning (ICML), pp. 282–289,2001.[9] S. Meknavin, P. Charoenpornsawat, and B. Kijsirikul, “Feature-<strong>Based</strong><strong>Thai</strong> Word Segmentation,” Proc. of NLPRS 97, pp. 289–296, 1997.[10] F. Peng, F. Feng, and A. McCallum, “Chinese Segmentation and NewWord Detection Using Conditional Random Fields,” Proc. of the 20thCOLING, pp. 562–568, 2004.[11] Y. Poowarawan, “Dictionary-based <strong>Thai</strong> Syllable Separation,” Proc. ofthe Ninth Electronics Engineering Conference, 1986.[12] V. Sornlertlamvanich, “Word Segmentation for <strong>Thai</strong> in Machine TranslationSystem,” Machine Translation, National Electronics and ComputerTechnology Center, Bangkok, 1993.
Page 1: 1TLex: Tha

TLex: Thai Lexeme Analyser Based on the ... - BEST - Nectec

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?