TLex: Thai Lexeme Analyser Based on the ... - BEST - Nectec

1<strong>TLex</strong>: <strong>Thai</strong> <strong>Lexeme</strong> <strong>Analyser</strong> <strong>Based</strong> on theConditional Random FieldsChoochart Haruechaiyasak and Sarawoot KongyoungAbstract—In this paper, we present our proposed solution tothe InterBEST 2009 <strong>Thai</strong> Word Segmentation task. We applied theConditional Random Fields (CRFs) to train word segmentationmodels from a given corpus. Using the CRFs, the word segmentationproblem can be formulated as a sequential labelingtask in which each character in the text string is predictedinto one of two following classes: word-beginning and intra-wordcharacters. One of the key factors which effect the performanceof the word segmentation models is the design of appropriatefeature sets. We proposed and evaluated three different featuresets: char (by using all possible characters as features), char-type(by categorizing all possible characters into 10 different types)and combined (by using both characters and character typesas features). The evaluation results showed that the combinedfeature set yielded the best performance with the averaged F1value over all genres equal to 93.90%. To further improve theresults, we performed a post-processing step by merging namedentities (NEs) in the segmented texts. We used the list of NEswhich is compiled from the training corpus. The NE mergingstep helped increase the performance of the combined featuremodel to 94.27%.Index Terms—Word segmentation, tokenization, morphologicalanalysis, conditional random fields.I. INTRODUCTIONWord segmentation is considered a basic yet very importantNLP task in many unsegmented languages. The main goal ofword segmentation task is to assign correct word boundaries ongiven text strings. Previous approaches applied to <strong>Thai</strong> wordsegmentation can be broadly classified as dictionary-based andmachine learning. The dictionary-based approach relies on aset of terms from a dictionary for parsing and segmenting inputtexts into word tokens. During the parsing process, series ofcharacters are looked up on the dictionary for matching terms.The performance of the dictionary-based approach depends onthe quality and size of the word set in the dictionary usedduring the segmentation process.Two main problems of unknown word and ambiguity typicallyoccur while parsing and segmenting the input text strings.Unknown words refer to terms which are not found in the dictionarywhile parsing the text strings. Whereas, the ambiguityproblem occurs when there are more than one way to segmentthe text strings. One possible solution to solve the unknownword problem in the dictionary-based approach is by usinglanguage-dependent lexical rules to merge unknown segments.This solution has the main drawback of the requirement toupdate the lexical rules. Today, many new words are beingChoochart Haruechaiyasak and Sarawoot Kongyoung are with the HumanLanguage Technology Laboratory (HLT), National Electronics and ComputerTechnology Center (NECTEC), <strong>Thai</strong>land Science Park, Klong Luang,Pathumthani 12120, <strong>Thai</strong>land (email: {choochart.haruechaiyasak, sarawoot.kongyoung}@nectec.or.th)generated by transliterating from foreign words. As a result,the process of updating the lexical rules becomes inefficient.Another alternative solution is to collect and include as manyunknown words into the dictionary. However, this solution isnot very practical since it requires manual addition and updateof newly generated terms.The ambiguity problem for the dictionary-based approachcan be solved via several selection techniques such as byselecting the longest possible term, i.e., longest matching [11].Another alternative is by selecting the segmented series whichyields the minimum number of word tokens, i.e., maximalmatching [12]. Previous works reported that the maximalmatching algorithm has a marginal improvement over thelongest matching algorithm, however, with the tradeoff onlonger running time. Both longest matching and maximalmatching algorithms can be considered as using some heuristicsto solve the ambiguity problem. The main disadvantageis these heuristics are very simple and static, i.e., unable toadapt to changing domains.Due to the drawbacks of the dictionary-based approach,machine learning algorithms have been adopted for the wordsegmentation task. The machine learning approach relies ona model trained from a corpus by using machine learningalgorithms. Using the tagged corpus in which word boundariesare explicitly marked with a special character, a machinelearning algorithm could be applied to train a model based onthe features surrounding these boundaries. Under the machinelearning approach, the word segmentation problem can beformulated as a binary classification task in which eachcharacter in the text string is predicted as belonging to oneof two classes: word-beginning and intra-word characters.The main advantage of the machine learning approach is theindependence of dictionary. The unknown word and ambiguityproblems are handled by the classification model which islearned from various character patterns inside a tagged corpus.Therefore, the performance of this approach depends on thedomain and size of the corpus. For example, if the model istrained based on a specific domain corpus, it might not performwell on other different domains. Moreover, the corpus must belarge enough to cover all different character patterns so thatthe model could be trained effectively.Recently, the Human Language Technology Laboratory(HLT) under the National Electronics and Computer TechnologyCenter (NECTEC) has designed and released a <strong>Thai</strong> wordsegmentation corpus to the <strong>Thai</strong> NLP research community. Themost updated corpus used for the Interbest 2009 <strong>Thai</strong> WordSegmentation workshop 1 contains 5-million words (plus an-1 http://thailang.nectec.or.th/interbest

3(by categorizing all characters into 10 different types) andcombined (by using both characters and character types asfeatures). For the char-type feature set, we categorize allcharacters into 10 different types as shown in Figure 2. The setof character types is designed based on linguistic knowledge,i.e., lexical rules, of <strong>Thai</strong> language. These character typescould provide effective information for identifying the wordboundaries.Fig. 3.setExample of a text string formatted according to the combined featureFig. 1.The overall process of our proposed solutionsTag Type ValuecnvwtsFig. 2.Consonant characters which can beassigned as the word ending characterConsonant characters which cannot beassigned as the word ending characterVowel characters which are not allowed tobegin a wordVowel characters which areallowed to begin a wordTonal charactersSymbol charactersd Digit characters 0-9q Quote characters '-' “-”p Space character within a word _o Other characters a-z A-ZCharacter-type (char-type) feature setNext the training corpus is transformed into three formattedtraining sets (one for each feature sets). Figure 3 shows anexample of a text string formatted based on the combinedfeature set. The first column contains the character (char)features and the second column contains the character-type(char-type) features. The third column is the annotation forpredicted class labels: word-beginning character (B) and intrawordcharacter (I). Using the formatted training data sets, wetrained the word segmentation models based on the CRFs.The CRFs algorithm learns to construct the most probable.label sequences by observing the features surrounding eachannotated class. The word segmentation models can then beevaluated by using the test data set.<strong>Based</strong> on our initial observation, the training corpus containsa large number of named entities (NEs), especially in thenews articles. Figure 4 shows some examples of NEs in threedifferent types: person, organization and place. Our initialanalysis of the segmented results showed that the modelstrained by using the CRFs tend to mistakenly separate namedentities into a few small segments. To improve the segmentedresults, we perform the NE merging as the post-processingstep. This process involves the extraction of all named entitiesfound in the training corpus. Table I lists the number of NEscollected from each genre of the training corpus. Since theNews genre contains the largest number of NEs, therefore, wecould expect higher improvement when NE merging is appliedon the segmented results. Using the NE list, the NE mergingperforms parsing on the segmented texts returned from theword segmentation model. During the parsing process, if aseries of segments matches the list of NEs, they are mergedinto one single word unit. Some examples of the NE mergedresults are shown in Figure 4.V. EXPERIMENTS AND RESULTSIn this section, we evaluate the proposed solutions usingthe training and test corpora provided by the InterBEST 2009workshop. The experiments (for test corpus evaluation) wereperformed on a PC with the the Intel Core 2 Duo 1.80GHzand 2 GB of RAM running with the Windows Vista. Threeperformance measures of precision, recall and F1 are used for

4TABLE IIEVALUATION RESULTS OF THREE DIFFERENT FEATURE SETSGenre Feature set ResultsP R F1Fig. 4.Example of named entities collected from the training corpusTABLE INUMBER OF NAMED ENTITIES EXTRACTED FROM TRAINING CORPUSGenreNumber of NEsArticle 7,193Buddhism 349Encyclopedia 5,462Law 921News 23,038Novel 4,145Talk 2,284Wiki 12,935evaluating different word segmentation approaches. Precisionis defined as the number of tokens correctly segmented bythe algorithm divided by the total number of tokens returnedfrom the algorithm. Recall is defined as the number of tokenscorrectly segmented by the algorithm divided by the totalnumber of tokens in the test set. Eq. 4 shows the calculationof F1 measure which is the harmonic mean of precision andrecall.F 1 =2 ∗ P recision ∗ RecallP recision + RecallWe first performed evaluation on three feature sets (i.e.,char, char-type and combined) as explained in previous section.Given the training corpus, we formatted the text stringsaccording to different feature sets. We trained three wordsegmentation models based on the CRFs. The results based oneach of the 12 test genres are shown in Table II. The averagedperformance on all genres is summarized in Table III.The results from the tables show that using the char-typefeature set yielded the worst averaged performance of 62.90%based on the F1 value. The model trained by using charfeature set gave a significant improvement of 91.51%. Thereason is that the char feature set provides more precise andricher information based on actual characters, therefore, theCRFs could effectively observe and recognize the sequences.However, when combining both char-type and char featuresets, the performance increases to 93.90%. The improvementcomes from that the char-type could help segment the characterpatterns which were not previously observed using thechar feature set. Comparing among all 12 genres, the Talkgenre yielded the best performance.Next, we perform and evaluate the NE merging process. Asmentioned in the previous section, applying the NE merging(4)Article char 92.19 92.10 92.15char-type 64.43 67.08 65.73combined 95.45 96.80 96.12Buddhism char 93.95 93.21 93.58char-type 66.95 65.46 66.20combined 95.39 95.39 95.39Encyclopedia char 92.59 91.38 91.98char-type 64.87 66.00 65.43combined 95.17 95.21 95.19Law char 94.17 94.52 94.34char-type 64.61 68.40 66.45combined 95.14 96.28 95.70News char 90.21 90.07 90.14char-type 59.53 64.45 61.89combined 91.91 94.71 93.29Novel char 93.66 93.97 93.82char-type 65.54 65.94 65.74combined 94.57 95.57 95.07Nsc char 91.53 90.90 91.21char-type 63.98 63.39 63.68combined 95.71 95.57 95.64Old-doc char 90.98 90.38 90.68char-type 55.57 59.50 57.47combined 91.70 93.12 92.40Royalnews char 80.23 86.96 83.46char-type 44.35 62.53 51.89combined 80.83 91.42 85.80Talk char 94.34 93.99 94.16char-type 66.75 67.21 66.98combined 96.74 97.00 96.87Tvnews char 90.32 92.50 91.40char-type 56.06 60.97 58.41combined 91.07 95.06 93.02Wiki char 91.42 90.65 91.03char-type 61.86 65.22 63.50combined 90.61 93.53 92.05TABLE IIIEVALUATION RESULTS OF DIFFERENT FEATURE SETS FROM ALL GENRESFeature setResultsP R F1char 91.30 91.72 91.51char-type 61.21 64.68 62.90combined 92.86 94.97 93.90step could help combine the segments which are namedentities into a single word unit. By using the model trainedfrom the combined feature set, the comparison results betweenmerge and no-merge approaches under each of the 12 genresare shown in Table IV. The averaged performance on all genresis summarized in Table V. The results show that the NEmerging step helps increase the performance of the combinedfeature model to 94.27% based on the F1 value. The maximum

5TABLE IVEVALUATION RESULTS OF MERGING APPROACH ON THE combinedFEATURE SETGenre Approach ResultsP R F1Article no-merge 95.45 96.80 96.12merge 95.71 96.54 96.13Buddhism no-merge 95.39 95.39 95.39merge 95.19 94.96 95.07Encyclopedia no-merge 95.17 95.21 95.19merge 95.15 94.83 94.99Law no-merge 95.14 96.28 95.70merge 95.36 96 95.68News no-merge 91.91 94.71 93.29merge 93.54 94.99 94.26Novel no-merge 94.57 95.57 95.07merge 95.43 95.72 95.57Nsc no-merge 95.71 95.57 95.64merge 95.83 95.53 95.68Old-doc no-merge 91.70 93.12 92.40merge 92.09 92.81 92.45Royalnews no-merge 80.83 91.42 85.80merge 84.41 92.28 88.17Talk no-merge 96.74 97.00 96.87merge 96.99 96.78 96.89Tvnews no-merge 91.07 95.06 93.02merge 92.29 94.92 93.59Wiki no-merge 90.61 93.53 92.05merge 91.62 93.58 92.59TABLE VEVALUATION RESULTS OF NE MERGING FROM ALL GENRESApproachResultsP R F1no-merge 92.86 94.97 93.90merge 93.63 94.91 94.27improvement comes from the Royalnews genre, in which theNE merging yielded the performance of 88.17% based on theF1 value compared to 85.8% without merging. The reason isthat the NEs under the Royalnews genre are limited, however,occur very often in the articles.We also measured the execution times of two approaches.The no-merge approach yields 7,444 words/sec. The mergeapproach yields 2,229 words/sec. The decreasing speed inthe merge approach is due to the time taken for parsing andchecking the named entities in the list.VI. CONCLUSIONS AND FUTURE WORKSIn this paper, we proposed a solution to the InterBest2009 <strong>Thai</strong> Word Segmentation task based on the ConditionalRandom Fields. To train the word segmentation models, weproposed three different feature sets: char (by using all possiblecharacters as features), char-type (by categorizing allcharacters into 10 different types) and combined (by usingboth characters and character types as features) The evaluationresults showed that the char-type feature set yielded the worstperformance of 62.90% based on the F1 value. The modeltrained by using char feature set gave the F1 value of 91.51%.The highest F1 value of 93.90% was achieved when using thecombined feature set. Therefore, it could be concluded that theCRFs could effectively learn the actual character sequences.However, in the case that the exact character sequential patternsare not matched, the generalized model based on thechar-type could help cover such case. The analysis of thesegmented results showed that the models trained by usingthe CRFs tend to mistakenly separate named entities into afew small segments. As a solution, we performed the postprocessingstep by merging named entities (NEs) found in thereturned segments. The NE merging step helped increase theperformance of the combined feature model to 94.27% basedon the averaged F1 value.For future works, we will focus on the improvement of theword segmentation models learned by using the CRFs. Onepossible solution is to improve the feature set for learning themodels. A new character feature set could be automaticallyconstructed by using some clustering algorithms. Anotherpossible improvement is by training a named entity recognition(NER) model to merge returned segments which containnamed entities into single word units.REFERENCES[1] C. Haruechaiyasak, S. Kongyoung and M. Dailey, “A comparative studyon <strong>Thai</strong> word segmentation approaches,” Proc. of the ECTI-CON 2008,1:125-128, 2008.[2] C. Haruechaiyasak, S. Kongyoung and C. Damrongrat, “LearnLexTo: amachine-learning based word segmentation for indexing <strong>Thai</strong> texts,” Proc.of the the 2nd ACM Workshop on Improving Non English Web Searching(iNEWS), pp. 85–88, 2008.[3] A. Kawtrakul and C. Thumkanon, “A Statistical Approach to <strong>Thai</strong>Morphological Analyzer,” Proc. of the 5th Workshop on Very LargeCorpora, pp. 286–289, 1997.[4] C. S.G. Khoo and T. Ee Loh, “Using Statistical and Contextual Informationto Identify Two-and Three-Character Words in Chinese Text,”Journal of the American Society for Information Science and Technology,53(5):365–377, 2002.[5] C. Kruengkrai and H. Isahara, “A Conditional Random Field Frameworkfor <strong>Thai</strong> Morphological Analysis,” Proc. of the Fifth Int. Conf. onLanguage Resources and Evaluation (LREC-2006), 2006.[6] C. Kruengkrai, et al., “An Error-Driven Word-Character Hybrid Modelfor Joint Chinese Word Segmentation and POS Tagging,” Proc. of theACL-IJCNLP 2009, pp. 513–521, 2009.[7] T. Kudo, K. Yamamoto, and Y. Matsumoto, “Applying ConditionalRandom Fields to Japanese Morphological Analysis,” Proc. of EMNLP,pp. 89–96, 2004.[8] J. Lafferty, A. McCallum, and F. Pereira, “Conditional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data,” Proc.of the Eighteenth Int. Conf. on Machine Learning (ICML), pp. 282–289,2001.[9] S. Meknavin, P. Charoenpornsawat, and B. Kijsirikul, “Feature-<strong>Based</strong><strong>Thai</strong> Word Segmentation,” Proc. of NLPRS 97, pp. 289–296, 1997.[10] F. Peng, F. Feng, and A. McCallum, “Chinese Segmentation and NewWord Detection Using Conditional Random Fields,” Proc. of the 20thCOLING, pp. 562–568, 2004.[11] Y. Poowarawan, “Dictionary-based <strong>Thai</strong> Syllable Separation,” Proc. ofthe Ninth Electronics Engineering Conference, 1986.[12] V. Sornlertlamvanich, “Word Segmentation for <strong>Thai</strong> in Machine TranslationSystem,” Machine Translation, National Electronics and ComputerTechnology Center, Bangkok, 1993.

TLex: Thai Lexeme Analyser Based on the ... - BEST - Nectec

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?