Thai Word Segmentation based-on GLR Parsing ... - BEST - Nectec

<strong>Thai</strong> <strong>Word</strong> <strong>Segmentation</strong> <strong>based</strong>-on GLR Parsing Technique and<strong>Word</strong> N-gram ModelPiya Limcharoen, Cholwich Nattee and Thanaruk TheeramunkongAbstract—<strong>Word</strong> segmentation is one of basic processes forthe languages without explicit word boundary. Up to now, severalworks on <strong>Thai</strong> word segmentation have been proposedsuch as longest matching and maximum matching. We proposea <strong>Thai</strong> word segmentation technique <strong>based</strong> on GLR parsingand statistical language model. In this technique, an input <strong>Thai</strong>text is firstly segmented into a sequence of <strong>Thai</strong> CharacterClusters (TCCs). Each TCC represents a group of inseparable<strong>Thai</strong> characters <strong>based</strong> on <strong>Thai</strong> writing system. The concept ofTCC helps avoid choosing the segmentation points that violatethe writing rules. Then, the most suitable segmentation candidateis chosen <strong>based</strong> on the word N-gram model with interpolation.Both of the candidate generation and selection processesare conducted through the two-phase GLR parsing technique.In the first phase, the production rules for TCC are applied toparse an input sequence of characters into a sequence of TCCsand becomes an input tokens for the parsing in the secondphase. The second phase groups TCCs and forms a word. Inthe second phase, we construct the grammar rules thatrepresent a word as a sequence of TCCs from words in theprepared training set. However, ambiguities in segmentingwords affect the parsing result. We then need to apply the statisticallanguage model to select the most appropriate segmentation.This statistical model is applied together with the GLRparsing, and the beam search technique is applied to select onlythe best k parsing paths. We evaluate the proposed technique.usingthe test data provided by InterBEST 2009. Theexperimental results show that the technique can obtain87.04% f-measure when the beam size is set to 10.WI. INTRODUCTIONORD segmentation is a basic and crucial process inNatural Language Processing (NLP). Since word is afundamental unit of any language, most NLP systems firstneed to segment input text into a sequence of words beforefurther processing. However, the writing system of <strong>Thai</strong>language does not use any delimiter to explicitly indicateword boundaries as shown in Figure 1. A sequence of wordsis written continuingly like writing an English sentence“You ate an apple” as“Youateanapple”. This characteristic can also be found inother Asian languages such as Japanese, Chinese, Korean,etc. It makes processing text in these languages becomemore complicated since the word boundaries are basicallyambiguous and sometimes depend on the semantic of sentence.A dedicated technique is required for efficiently segmentingtext and identifying words.Piya Limcharoen, Cholwich Nattee and Thanaruk Theeramunkong arewith the School of Information and Computer Technology, SirindhornInternational Institute of Technology, Thammasat University 131 M.5 TiwanontRd., Bangkadi, Muang, Pathumthani, <strong>Thai</strong>land 12000; e-mail {piya,cholwich, thanaruk}@siit.tu.ac.th<strong>Word</strong> segmentation can basically be split into two mainprocesses: word candidate generation and word candidateselection. The first process aims at constructing all possibleword candidates from a given input text. While, the latterprocess aims at choosing the most suitable candidate. In thispaper, we propose a novel approach for <strong>Thai</strong> word segmentation<strong>based</strong> on GLR parsing technique. We consider that theword segmentation can be done by applying syntacticalanalysis process using grammars from <strong>Thai</strong> writing rules andword list. Furthermore, our approach embeds the candidateselection process into the candidate generation process. Thistechnique would allow reduction in the number of candidatesgenerated.In order to reduce the size of input text and search space,we also propose two-phase candidate generation. In the firstphase, input text as a stream of characters is split into groups<strong>based</strong> on <strong>Thai</strong> writing rules. Then, groups of characters arecombined again into a word candidate in the second phase.Here, the candidate selection process is performed togetherwith the second phase of candidate generation. We apply theconcept of <strong>Thai</strong> Character Cluster (TCC) [1]. This concept is<strong>based</strong> on <strong>Thai</strong> writing rules. It aims to group together thecharacters depending on the other characters, such as eachvowel and tone mark need to correspond to a consonant.Based on its definition, TCC is an unambiguous and inseparablegroup of characters. This means that no segmentationpoint can be occurred between any two characters insidea TCC as shown in Figure 2. Applying the concept of TCCis useful for the candidate generation since it works as anintermediate level to form a group of characters and reducesthe size of data to be handled. Segmenting a sequence ofcharacters into TCCs can be done by using parsing technique.Based on TCC concept, we can select the longest subsequenceof characters matched to the rule as a TCC. Applyingthe parsing technique for the candidate generation in thisapproach can be viewed as a generalized version of radixtree or trie structure used in many <strong>Thai</strong> word segmentationapproaches since words in the dictionary can be handledtogether with the rules in the writing system.Fig 1. Connect words in <strong>Thai</strong> writing system “You(คุณ) ate(กิน) an apple(แอปเปิ้ล)”.

supervised machine learning techniques, using the features<strong>based</strong> on words and part-of-speech tags. Haruechaiyasak etal. [11] applied character-<strong>based</strong> features <strong>based</strong> on characterlocation and character type in <strong>Thai</strong> writing system to severallearning techniques e.g. naïve bayes, decision tree, supportvector machines, and conditional random fields.In addition, the proposed approach applies a GLR parsingtechnique [12, 13] as a main tool for segmenting words.Since the parser performs similarly as a generalized versionof the trie structureFig 2. Concept of TCC that groups character depending on another character.For candidate selection, we apply the statistical languagemodel to select the most appropriate candidate. The model isused in the proposed approach is the word N-gram modelwith interpolation. We propose a method that combines thestatistical model into the candidate generation process. Byincorporating beam search technique into the parsingprocess, the approach generates only some candidates withhigh potential to be an appropriate word candidate. Thistechnique provides benefits when we compare it to mostexisting works that constructs all possible candidates beforemake a decision. Our method generates and selects the candidatesat the moment that each input token is fetched by theparser.II. RELATED WORKSMost of the earlier works in <strong>Thai</strong> word segmentation are<strong>based</strong> on mapping input text to predefined dictionary. Problemof this method is ways of mapping or handling segmentationambiguity. The longest matching method maps writtentext from left to right, and gives first priority to the longestcandidate found in the dictionary [2, 3]. The maximal matchingmethod creates all possible word segmentation candidates,and selects the one that contains the fewest words [4].Then, many works try to incorporate various kinds of additionalinformation beyond the dictionary to make bettersegmentation results. The statistical approach applies statistics<strong>based</strong> on the language model collected from corpora toselect the most appropriate segmentation. Part-Of-Speech N-gram model is applied to filter out unnecessary segmentationcandidates [5]. Aroonmanakun [6] used tri-gram statisticswith syllable collocation to select the best candidate.The feature-<strong>based</strong> approach tries to extract features fromthe generated word segmentation candidates, and apply machinelearning techniques to learn a classifier to select themost suitable candidate. Sornil and Chaiwanarom [7, 8] proposedthe technique to segment <strong>Thai</strong> syllables and appliedlogistic regression to combine syllables into a word. Theeramunkongand Usanavasin [9] proposed several kinds offeatures that can be extracted from each segmentation point.They also proposed to use decision tree to classify the correctword segmentation out of incorrect segmentation. Meknavinet al. [10] applied RIPPER and Winnow, which areIII. TWO-PHASE PARSING APPROACH FOR CANDIDATEGENERATIONWe first explain about the GLR parsing technique that isapplied in the proposed approach. GLR parser parses inputsentence in the bottom-up manner using the parsing tablegenerated from context-free grammar (CFG). We can alsosay that the parser reads input tokens and uses the parsingtable as a dictionary for analyzing the structure of the input.There are basically two actions conducted on the input tokensas either ‘shift’ or ‘reduce’ according to parsing table.The existing parsing technique can be used in words segmentationby taking a stream of characters as input and resultwords segmentation candidates, so-called Character-to-<strong>Word</strong> segmentation as shown in Figure 3.In this paper, we introduce the two-phase parsing techniquefor word segmentation. The first phase (so-called Character-to-TCCsegmentation) groups input characters to asequence of TCCs. This helps reduce the size of input andimprove the performance of the overall processes. After that,the stream of TCCs is used as an input for the second phase(so-called TCC-to-<strong>Word</strong> segmentation). Then, the secondphase outputs all possible segmentation candidates as shownin Figure 4 and 5. The basic idea of the two-phase parsing isto break down one big problem of word segmentation intotwo simpler sub-problems. The first phase reduces scale ofinput by reducing character stream into a shorter stream ofTCCs. Then, the second phase generates the word segmentationcandidates as groups of TCCs which are fewer thangroups of characters.A. Character-to-TCC <strong>Segmentation</strong>In the first phase, we generate the TCC parsing table byusing TCC context-free grammar (CFG) as shown in Figure6. The TCC grammars consisting of 123 rules are createdmanually by following the concept that tries to group togethercharacters that depends on others. This results the parsingtable that guides the parser to parse stream of character, andform a stream of TCCs. Then, we use the parsing table thatobtains from the TCC grammar to parse a stream of characterand group a TCC. The result from this process reducesthe size of the input for the next phase as shown in Figure 7.B. TCC-to-<strong>Word</strong> <strong>Segmentation</strong>In the second phase, we generate TCC-to-<strong>Word</strong> parsingtable by using another set of CFGs as shown in Figure 8.The grammar rule defines a word as a stream of TCCs. The

Fig 3. Candidate results (11 candidates) from one-phase parsing. ตอน(part),ที่(position)Fig 6. Example of TCC grammars. Start with “_” mean terminal variable.Fig 7. Reduction in input size that is result of first phase (Character-to-TCC).Fig 4. Process of two-phase parsing: First phase (Character-to-TCC) andSecond phase (TCC-to-<strong>Word</strong>). ตอน(part), ที่(position)Fig 5. Candidate results (3 candidates) from two-phase parsing. ตอน(part), ที่(position)parsing table <strong>based</strong> on this grammar is then generated. Thisprocess is time-consuming depending on the number ofwords in the dictionary. However, this extensive time can beacceptable since the parsing table creation process is doneonly once. The result of this step is a parsing table thatguides the parser how to group a stream of TCCs to words.Since there are ambiguities in segmenting word, the CFGsare also ambiguous. The system generates all possible waysto connect groups of TCCs into words in a sentence. Wethen need to apply the candidate selection to select the mostappropriate candidate.Applying two-phase parsing results improve the performancebecause of the data size reduction. Moreover, combiningthis candidate generation with candidate selectiondiscussed in the next section will show more potential of thismethod over the existing methods. By using the parsingtechnique, we can show a new opportunity in concept ofsegmentation, which come with advantage over dictionary<strong>based</strong>and machine learning-<strong>based</strong> because of flexibility inparsing.Based on our previous work [14], we have conducts someexperiments to evaluate the two-phase parsing techniqueexplained in this section. We found from the experiments

After they are combined together, it results a lot of candidateshaving chance to be a correct segmentation.A. Statistical model using N-gram modelThe statistical model is applied for selection the ambiguousword segmentation points as shown in Figure 10. Thismodel is able to deal with the unknown words by applyingthe probability that an unknown word occurs in a sentence.Furthermore, machine learning technique can be applied toimprove the predictive accuracy of system by making thesystem learn from the mistakes.In this paper, we apply word N-gram model with interpolationtechnique combining uni-gram, bi-gram and tri-gram.N-gram probability can be calculated as: | ∑ Fig 8. Example of word grammars. Start with “_” mean terminal variable.that the two-phase parsing technique gains 35.48% reductionin the number of states when it is compared to the one-phaseparsing technique. This experiment was conducted by usingthe <strong>Thai</strong> dictionary containing 42,858 words.IV. CANDIDATE SELECTION BASED ON STATISTICAL MODELAND BEAM SEARCHIn this section, we explain the application of statisticallanguage model for candidate selection. From the exampleof candidates shown in Figure 9, the number of obtainedcandidates varies on the length of TCCs stream and thenumber of words with ambiguities. Ambiguous words arethe main factor to number of candidates that the result mayconsist of two or more candidates at the ambiguous point.ปลา|มี|ตาก|ลม|แต่|เสื้อ|ตาก|ลม|ปลา|มี|ตาก|ลม|แต่|เสื้อ|ตาก|ล|ม|…ปลา|มี|ตา|กลม|แต่|เสื้อ|ตาก|ลม|ปลา|มี|ตา|กลม|แต่|เสื้อ|ตาก|ล|ม|…ปลา|มี|ตา|ก|ลม|แต่|เสื้อ|ตาก|ลม|ปลา|มี|ตา|ก|ลม|แต่|เสื้อ|ตาก|ล|ม|…ป|ลา|มี|ตา|กลม|แต่|เสื้อ|ตาก|ลม|ป|ลา|มี|ตา|กลม|แต่|เสื้อ|ตาก|ล|ม|…ป|ลา|มี|ตา|ก|ลม|แต่|เสื้อ|ตาก|ลม|ป|ลา|มี|ตา|ก|ลม|แต่|เสื้อ|ตาก|ล|ม|…ป|ลา|มี|ตา|ก|ล|ม|แต่|เสื้อ|ตา|ก|ลม|ป|ลา|มี|ตา|ก|ล|ม|แต่|เสื้อ|ตา|ก|ล|ม|Fig 9. Result 72 candidates from candidate generation without candidateselection process. is the nth word in the word sequence, and denotes asub-sequence of the i th to the j th words. is a N-gram parameterwhich is called uni-gram when N=1, bi-gram whenN=2, and tri-gram when N=3. Figure 11 shows examples ofN-gram probabilities in logarithmic form.In the proposed approach, we apply the interpolation tech-P = 6.15e-05 ปลา|มี|ตา|กลม|แต่|เสื้อ|ตาก|ลม|P = 5.97e-05 ปลา|มี|ตา|กลม|แต่|เสื้อ|ตา|กลม|P = 4.80e-05 ปลา|มี|ตาก|ลม|แต่|เสื้อ|ตาก|ลม|P = 4.66e-05 ปลา|มี|ตาก|ลม|แต่|เสื้อ|ตา|กลม|P = 3.61e-05 ปลา|มี|ตา|กลม|แต่|เสื้อ|ตา|ก|ลม|…ปลา(Fish), มี(Have), ตาก(Dry), ตา(Eye),กลม(Round), ลม(Wind), แต่(but), เลื้อ(Shirt)Fig 10. Candidates that have difference meaning ordering by probability.-8.57086e-01 | |-1.65997e+00 |ที่|-4.88202e+00 |จักรกล|-6.75706e+00 |อารัก|+7.59836e-08 |คู ่ควง|ของ|-2.31597e+00 |บริโภค|เข้าใจ|-4.10350e+00 |แล้ว|กลุ ้ม|-5.90000e+00 | |มือยิง|+7.59836e-08 |ครึ่ง|วง|กลม|+7.59836e-08 |เสนอ|เงิน|ชดเชย|-1.11394e+00 |เขา|แนะนํา|หล่อน|-4.89695e+00 |&| |พิชิต|Fig 11. Probability information extract from corpus by uni-gram, bi-gramand tri-gram model.

nique that combines tri-gram, bi-gram and uni-gram: | | | Such that the s are interpolation parameters, and theirsum is 1. 1In this paper, we set the s to 0.1 for uni-gram, 0.3 for bigram,and 0.6 for tri-gram. To avoid zero probability for anunknown sequence of words, we estimate the probability byusing the lowest probability found in the corpus.This N-gram model is applied to all the segmentationpoints in a candidate. The probability of the whole sentencecan be obtained from the product of all the N-gram probabilitiesin the sentences. However, this combination techniquetends to put more favor on the candidate with smaller numberof segmentation points. We try to relax this problem byapplying the geometric average to the product of the N-gramprobabilities. The geometric average of a data set [a 1 , a 2 , ...,a n ] is given by · ··· Figure 10 shows an example of the result after applyingstatistical model as you can see that the best candidate is “ปลา(Fish) | มี(Have) | ตา(Eye) | กลม(Round) | แต่(But) | เสื ้อ(Shirt) |ตาก(Dry) | ลม(Wind)” attached with the highest probability.B. Beam Search Technique for Candidate Selection.As discussed earlier, we propose the technique that embedsthe candidate selection into each parsing level of thecandidate generation <strong>based</strong> on the beam search. This helpsstopping the low potential child candidates to become a parentcandidate in the next level.Once a word candidate is generated from a sequence ofTCCs while the parsing process is going on, the combinedprobability from the beginning of the sentence to the currentword is calculated and attached to the candidate. This parsingprocess is conducted in the breath-first search manner.We select only the k best candidates to be considered in thenext level. This k represents the beam size in this beamsearch technique.Figure 9 and 12 shows an example of applying beamsearch technique for candidate selection. The bold line inFigure 12 shows the two best candidates that are chosen ateach level. This can be compared to the result when thebeam search is not applied.The benefit of using this technique is that the number ofcandidates is constant along all the parsing levels. Thisavoids the problem that the number of candidates increaseFig 12. Candidates result from parser bold circle refer to candidate when using beam size 2.

ased on the number of TCCs and ambiguous words.Comparing the proposed method to the existing statisticalmethods, the proposed method combines the statistical modelinto the parsing process and makes all work done only inone processV. EXPERIMENT RESULTS AND DISCUSSIONTo evaluate the proposed method, we conduct some experimentson real-world data. We evaluate the <strong>Thai</strong> word segmentationaccuracy using the proposed parsing techniqueand statistical model with different size of beam. By ourexperiment use both training set and test set from InterBEST2009: <strong>Thai</strong> <strong>Word</strong> <strong>Segmentation</strong>: an International Episode[15].The training datasets provided by InterBEST are used forcreate our training data and TCC-to-<strong>Word</strong> parsing table. Wethen use InterBEST test sets to evaluate our proposed technique.Result from InterBEST evaluate system shows precision,recall and f-measure for each genre from 12 genres asshown in Figure 13. In this experiment, the beam size is variedfrom 1 to 10 to show its effect to the performance of theproposed approach. For the s in the interpolation, we givemore significant on tri-gram with =0.6, then bi-gram with=0.3, and uni-gram with =0.1, respectively.Moreover, our technique is originally able to deal withdocuments containing only <strong>Thai</strong> characters. To evaluate theperformance of our system, English letters and symbols areskipped to make the input text containing only <strong>Thai</strong> charactersbefore conducting word segmentation. The skipped Englishcharacters are later inserted into the segmented text tomake the final output.Based on the experimental results, the highest average f-measure of 87.04% was obtained when the beam size is setto the highest value of 10. However, we found that only alittle difference when we set the beam size to 3 and 5. Thebeam search technique shows its benefit of reducing thenumber of evaluated candidates without losing the accuracyof the system. It also shows that applying all probabilitycomputation to all candidates are useless because most of thegood candidates are the one with drastically high probabilitycompared to other candidates. This may lead to a techniqueto dynamically assign the beam size that we can set the beamsize <strong>based</strong> on the gap between the group of candidates withhigh probability and the group with low probability.We found that our technique worked well with the genres“Article”, “Buddhism”, “Encyclopedia”, “Novel”, “NSC”,and “Talk” that we can obtain more than 90% of f-measurescore. However, it works poorly with the genres “Old document”and “Royal news” that we obtain only 80.81% and71.26% respectively. After analyzing the experimental resultsand the datasets in details, we found that the genres“Old document” and “Royal news” contains a lot of unknownwords such as Royal family member names, abbreviations,and name entities. Since unknown words cannotproperly be handled by the proposed technique, all the wordsare basically treated as regular words. Therefore, the systemwould yield the low segmentation accuracy for the datasetcontaining a lot of unknown words.VI. CONCLUSIONIn this paper, we propose <strong>Thai</strong> word segmentation framework.This framework combines candidate generation andcandidate selection in the same process. In candidate generation,we propose a technique to reduce the problem size byseparating the problem into two phases. The first-phase intendsto reduce the input size using the concept of TCC. TheNumber of Beam 1 3 5 10Genre P R F P R F P R F P R FArticle 88.34 93.65 90.92 88.87 94.40 91.55 88.87 94.41 91.56 88.92 94.44 91.60Buddhism 91.06 95.80 93.37 90.92 96.23 93.50 90.86 96.21 93.46 90.84 96.20 93.44Encyclopedia 89.40 93.70 91.50 90.09 94.70 92.34 90.11 94.73 92.36 90.11 94.72 92.36Law 82.26 90.39 86.14 82.39 91.08 86.52 82.35 91.06 86.49 82.35 91.05 86.48News 77.08 91.84 83.81 77.36 92.58 84.29 77.38 92.59 84.30 77.43 92.61 84.34Novel 86.88 93.09 89.88 86.96 93.98 90.33 86.98 93.98 90.34 87.03 93.99 90.38NSC 89.00 92.41 90.67 89.81 93.51 91.62 89.83 93.52 91.64 89.85 93.53 91.65Old document 74.72 88.41 80.99 74.29 88.41 80.74 74.33 88.43 80.77 74.39 88.44 80.81Royal news 60.27 86.94 71.19 60.17 87.45 71.29 60.10 87.43 71.23 60.12 87.46 71.26Talk 91.26 94.79 92.99 91.69 95.97 93.78 91.66 95.98 93.77 91.66 95.99 93.78TV news 78.15 92.18 84.59 78.47 92.91 85.09 78.45 92.90 85.06 78.47 92.91 85.08Wiki 76.99 89.73 82.88 77.28 90.27 83.27 77.32 90.30 83.31 77.35 90.31 83.33Average 82.12 91.91 86.58 82.36 92.62 87.03 82.35 92.63 87.03 82.38 92.64 87.04*P – Precision, R – Recall, F – F-measureFig 13. Evaluation result for statistical model approach with difference beam value.

second phase then generates all possible word segmentationcandidates. The evaluation of this two-phase techniqueshowed that the number of states in the parsing tables becomesmaller comparing to the one-phase parsing.We also propose an approach that embeds the candidateselection into the second-phase parsing. It aims to generateonly potential candidates by applying statistical model.Moreover, the beam search techniques were applied directlyinto the parser to filter out low-potential candidates at theearlier state. This helps reduce the number of candidates andimprove the system performance in term of speed. The experimentalresults conducted on the 12-genre dataset showedthat the proposed technique obtain on average 87.04% f-measure when the beam size is set to 10.For the future work, we plan to incorporate a technique for<strong>Thai</strong> unknown word recognition and name entity extractionin order to improve the system performance.Technology, 2008. ECTI-CON 2008, Vol.1, Issue 14-17 May 2008,pp.125-128, 2008.[12] M. Tomita, An efficient augmented-context-free parsing algorithm,Computational Linguistics, Vol.13 No.1-2, pp.31-46, 1987.[13] J. Earley, An efficient context-free parsing algorithm, Communicationsof the ACM, Vol.13 No.2, pp.94-102, 1970.[14] P. Limcharoen, C. Nattee, and T. Theeramunkong, Two-Phase CandidateGeneration for <strong>Thai</strong> <strong>Word</strong> <strong>Segmentation</strong> using GLR ParsingTechnique, Proceedings of the Third International Conference onKnowledge, Information and Creativity Support Systems (KICSS2008), pp.98–103, December 2008.[15] Nectec., InterBEST2009 <strong>Thai</strong> <strong>Word</strong> <strong>Segmentation</strong>: an InternationalEpisode., http://thailang.nectec.or.th/interbest/ACKNOWLEDGMENTThis work has partially been supported by the NationalScience and Technology Development Agency Ministry ofScience and Technology (NSTDA), project name “Developedof <strong>Thai</strong> News Corpus with Entity Tags Named” projectnumber NT-B-22-KE-38-52-01 and from <strong>Thai</strong>land ResearchFund under project name “Research on Automatic RelationshipDiscovery in News Articles” project numberBRG50800013.REFERENCES[1] T. Theeramunkong, V. Sornlertlamvanich, T. Tanhermhong, and W.Chinnan, Character cluster <strong>based</strong> thai information retrieval, IRAL ’00:Proceedings of the fifth international workshop on Information retrievalwith Asian languages, pp.75–80, 2000.[2] Y. Poovarawan, Dictionary-<strong>based</strong> <strong>Thai</strong> Syllable Separation, Proceedingsof the 9th Electronics Engineering Conference, 1986.[3] S. Rarunrom, Dictionary-<strong>based</strong> <strong>Thai</strong>. <strong>Word</strong> Separation, Senior ProjectReport, 1991.[4] V. Sornlertlamvanich, <strong>Word</strong> <strong>Segmentation</strong> for <strong>Thai</strong> in Machine TranslationSystem. Machine Translation, National Electronics and ComputerTechnology Center, 1993.[5] A. Kawtrakul, C. Thumkanon, and S. Seriburi, A Statistical Approachto <strong>Thai</strong> <strong>Word</strong> Filtering, Proceedings of the second Symposium onNatural Language Processing, pp. 398-406, 1995.[6] W. Aroonmanakun, Collocation and <strong>Thai</strong> word segmentation, In: JointInternational Conference of SNLP-Oriental COCOSDA, pp.68-75,2002.[7] O. Sornil, and P. Chaiwanarom, A Non-Dictionary Approach for <strong>Thai</strong>Syllable <strong>Segmentation</strong> Using Prediction by Partial Matching, Proceedingsof PACLING'03, 2003.[8] O. Sornil, and P. Chaiwanarom., Combining prediction by partialmatching and logistic regression for <strong>Thai</strong> word segmentation, Proceedingsof the 20th international Conference on Computational,2004.[9] T. Theeramunkong, and S. Usanavasin, Non-dictionary-<strong>based</strong> <strong>Thai</strong>word segmentation using decision trees, Proceedings of the first internationalconference on Human language technology research, pp.1-5,2001.[10] S. Mekanavin, P. Charenpornsawat, and B. Kijsirikul, Feature-<strong>based</strong><strong>Thai</strong> <strong>Word</strong>s <strong>Segmentation</strong>, Proceedings of the Natural LanguageProcessing Pacific Rim Symposium, pp.41- 48, 1997.[11] C. Haruechaiyasak, S. Kongyoung, and M. Dailey, A comparativestudy on <strong>Thai</strong> word segmentation approaches, Electrical Engineering/Electronics,Computer, Telecommunications and Information

Thai Word Segmentation based-on GLR Parsing ... - BEST - Nectec

Create successful ePaper yourself

Delete template?

Save as template?