11.07.2015 Views

Thai Word Segmentation based-on GLR Parsing ... - BEST - Nectec

Thai Word Segmentation based-on GLR Parsing ... - BEST - Nectec

Thai Word Segmentation based-on GLR Parsing ... - BEST - Nectec

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<str<strong>on</strong>g>Thai</str<strong>on</strong>g> <str<strong>on</strong>g>Word</str<strong>on</strong>g> <str<strong>on</strong>g>Segmentati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>based</str<strong>on</strong>g>-<strong>on</strong> <strong>GLR</strong> <strong>Parsing</strong> Technique and<str<strong>on</strong>g>Word</str<strong>on</strong>g> N-gram ModelPiya Limcharoen, Cholwich Nattee and Thanaruk Theeramunk<strong>on</strong>gAbstract—<str<strong>on</strong>g>Word</str<strong>on</strong>g> segmentati<strong>on</strong> is <strong>on</strong>e of basic processes forthe languages without explicit word boundary. Up to now, severalworks <strong>on</strong> <str<strong>on</strong>g>Thai</str<strong>on</strong>g> word segmentati<strong>on</strong> have been proposedsuch as l<strong>on</strong>gest matching and maximum matching. We proposea <str<strong>on</strong>g>Thai</str<strong>on</strong>g> word segmentati<strong>on</strong> technique <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> <strong>GLR</strong> parsingand statistical language model. In this technique, an input <str<strong>on</strong>g>Thai</str<strong>on</strong>g>text is firstly segmented into a sequence of <str<strong>on</strong>g>Thai</str<strong>on</strong>g> CharacterClusters (TCCs). Each TCC represents a group of inseparable<str<strong>on</strong>g>Thai</str<strong>on</strong>g> characters <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> <str<strong>on</strong>g>Thai</str<strong>on</strong>g> writing system. The c<strong>on</strong>cept ofTCC helps avoid choosing the segmentati<strong>on</strong> points that violatethe writing rules. Then, the most suitable segmentati<strong>on</strong> candidateis chosen <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> the word N-gram model with interpolati<strong>on</strong>.Both of the candidate generati<strong>on</strong> and selecti<strong>on</strong> processesare c<strong>on</strong>ducted through the two-phase <strong>GLR</strong> parsing technique.In the first phase, the producti<strong>on</strong> rules for TCC are applied toparse an input sequence of characters into a sequence of TCCsand becomes an input tokens for the parsing in the sec<strong>on</strong>dphase. The sec<strong>on</strong>d phase groups TCCs and forms a word. Inthe sec<strong>on</strong>d phase, we c<strong>on</strong>struct the grammar rules thatrepresent a word as a sequence of TCCs from words in theprepared training set. However, ambiguities in segmentingwords affect the parsing result. We then need to apply the statisticallanguage model to select the most appropriate segmentati<strong>on</strong>.This statistical model is applied together with the <strong>GLR</strong>parsing, and the beam search technique is applied to select <strong>on</strong>lythe best k parsing paths. We evaluate the proposed technique.usingthe test data provided by Inter<strong>BEST</strong> 2009. Theexperimental results show that the technique can obtain87.04% f-measure when the beam size is set to 10.WI. INTRODUCTIONORD segmentati<strong>on</strong> is a basic and crucial process inNatural Language Processing (NLP). Since word is afundamental unit of any language, most NLP systems firstneed to segment input text into a sequence of words beforefurther processing. However, the writing system of <str<strong>on</strong>g>Thai</str<strong>on</strong>g>language does not use any delimiter to explicitly indicateword boundaries as shown in Figure 1. A sequence of wordsis written c<strong>on</strong>tinuingly like writing an English sentence“You ate an apple” as“Youateanapple”. This characteristic can also be found inother Asian languages such as Japanese, Chinese, Korean,etc. It makes processing text in these languages becomemore complicated since the word boundaries are basicallyambiguous and sometimes depend <strong>on</strong> the semantic of sentence.A dedicated technique is required for efficiently segmentingtext and identifying words.Piya Limcharoen, Cholwich Nattee and Thanaruk Theeramunk<strong>on</strong>g arewith the School of Informati<strong>on</strong> and Computer Technology, SirindhornInternati<strong>on</strong>al Institute of Technology, Thammasat University 131 M.5 Tiwan<strong>on</strong>tRd., Bangkadi, Muang, Pathumthani, <str<strong>on</strong>g>Thai</str<strong>on</strong>g>land 12000; e-mail {piya,cholwich, thanaruk}@siit.tu.ac.th<str<strong>on</strong>g>Word</str<strong>on</strong>g> segmentati<strong>on</strong> can basically be split into two mainprocesses: word candidate generati<strong>on</strong> and word candidateselecti<strong>on</strong>. The first process aims at c<strong>on</strong>structing all possibleword candidates from a given input text. While, the latterprocess aims at choosing the most suitable candidate. In thispaper, we propose a novel approach for <str<strong>on</strong>g>Thai</str<strong>on</strong>g> word segmentati<strong>on</strong><str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> <strong>GLR</strong> parsing technique. We c<strong>on</strong>sider that theword segmentati<strong>on</strong> can be d<strong>on</strong>e by applying syntacticalanalysis process using grammars from <str<strong>on</strong>g>Thai</str<strong>on</strong>g> writing rules andword list. Furthermore, our approach embeds the candidateselecti<strong>on</strong> process into the candidate generati<strong>on</strong> process. Thistechnique would allow reducti<strong>on</strong> in the number of candidatesgenerated.In order to reduce the size of input text and search space,we also propose two-phase candidate generati<strong>on</strong>. In the firstphase, input text as a stream of characters is split into groups<str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> <str<strong>on</strong>g>Thai</str<strong>on</strong>g> writing rules. Then, groups of characters arecombined again into a word candidate in the sec<strong>on</strong>d phase.Here, the candidate selecti<strong>on</strong> process is performed togetherwith the sec<strong>on</strong>d phase of candidate generati<strong>on</strong>. We apply thec<strong>on</strong>cept of <str<strong>on</strong>g>Thai</str<strong>on</strong>g> Character Cluster (TCC) [1]. This c<strong>on</strong>cept is<str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> <str<strong>on</strong>g>Thai</str<strong>on</strong>g> writing rules. It aims to group together thecharacters depending <strong>on</strong> the other characters, such as eachvowel and t<strong>on</strong>e mark need to corresp<strong>on</strong>d to a c<strong>on</strong>s<strong>on</strong>ant.Based <strong>on</strong> its definiti<strong>on</strong>, TCC is an unambiguous and inseparablegroup of characters. This means that no segmentati<strong>on</strong>point can be occurred between any two characters insidea TCC as shown in Figure 2. Applying the c<strong>on</strong>cept of TCCis useful for the candidate generati<strong>on</strong> since it works as anintermediate level to form a group of characters and reducesthe size of data to be handled. Segmenting a sequence ofcharacters into TCCs can be d<strong>on</strong>e by using parsing technique.Based <strong>on</strong> TCC c<strong>on</strong>cept, we can select the l<strong>on</strong>gest subsequenceof characters matched to the rule as a TCC. Applyingthe parsing technique for the candidate generati<strong>on</strong> in thisapproach can be viewed as a generalized versi<strong>on</strong> of radixtree or trie structure used in many <str<strong>on</strong>g>Thai</str<strong>on</strong>g> word segmentati<strong>on</strong>approaches since words in the dicti<strong>on</strong>ary can be handledtogether with the rules in the writing system.Fig 1. C<strong>on</strong>nect words in <str<strong>on</strong>g>Thai</str<strong>on</strong>g> writing system “You(คุณ) ate(กิน) an apple(แอปเปิ้ล)”.


supervised machine learning techniques, using the features<str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> words and part-of-speech tags. Haruechaiyasak etal. [11] applied character-<str<strong>on</strong>g>based</str<strong>on</strong>g> features <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> characterlocati<strong>on</strong> and character type in <str<strong>on</strong>g>Thai</str<strong>on</strong>g> writing system to severallearning techniques e.g. naïve bayes, decisi<strong>on</strong> tree, supportvector machines, and c<strong>on</strong>diti<strong>on</strong>al random fields.In additi<strong>on</strong>, the proposed approach applies a <strong>GLR</strong> parsingtechnique [12, 13] as a main tool for segmenting words.Since the parser performs similarly as a generalized versi<strong>on</strong>of the trie structureFig 2. C<strong>on</strong>cept of TCC that groups character depending <strong>on</strong> another character.For candidate selecti<strong>on</strong>, we apply the statistical languagemodel to select the most appropriate candidate. The model isused in the proposed approach is the word N-gram modelwith interpolati<strong>on</strong>. We propose a method that combines thestatistical model into the candidate generati<strong>on</strong> process. Byincorporating beam search technique into the parsingprocess, the approach generates <strong>on</strong>ly some candidates withhigh potential to be an appropriate word candidate. Thistechnique provides benefits when we compare it to mostexisting works that c<strong>on</strong>structs all possible candidates beforemake a decisi<strong>on</strong>. Our method generates and selects the candidatesat the moment that each input token is fetched by theparser.II. RELATED WORKSMost of the earlier works in <str<strong>on</strong>g>Thai</str<strong>on</strong>g> word segmentati<strong>on</strong> are<str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> mapping input text to predefined dicti<strong>on</strong>ary. Problemof this method is ways of mapping or handling segmentati<strong>on</strong>ambiguity. The l<strong>on</strong>gest matching method maps writtentext from left to right, and gives first priority to the l<strong>on</strong>gestcandidate found in the dicti<strong>on</strong>ary [2, 3]. The maximal matchingmethod creates all possible word segmentati<strong>on</strong> candidates,and selects the <strong>on</strong>e that c<strong>on</strong>tains the fewest words [4].Then, many works try to incorporate various kinds of additi<strong>on</strong>alinformati<strong>on</strong> bey<strong>on</strong>d the dicti<strong>on</strong>ary to make bettersegmentati<strong>on</strong> results. The statistical approach applies statistics<str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> the language model collected from corpora toselect the most appropriate segmentati<strong>on</strong>. Part-Of-Speech N-gram model is applied to filter out unnecessary segmentati<strong>on</strong>candidates [5]. Aro<strong>on</strong>manakun [6] used tri-gram statisticswith syllable collocati<strong>on</strong> to select the best candidate.The feature-<str<strong>on</strong>g>based</str<strong>on</strong>g> approach tries to extract features fromthe generated word segmentati<strong>on</strong> candidates, and apply machinelearning techniques to learn a classifier to select themost suitable candidate. Sornil and Chaiwanarom [7, 8] proposedthe technique to segment <str<strong>on</strong>g>Thai</str<strong>on</strong>g> syllables and appliedlogistic regressi<strong>on</strong> to combine syllables into a word. Theeramunk<strong>on</strong>gand Usanavasin [9] proposed several kinds offeatures that can be extracted from each segmentati<strong>on</strong> point.They also proposed to use decisi<strong>on</strong> tree to classify the correctword segmentati<strong>on</strong> out of incorrect segmentati<strong>on</strong>. Meknavinet al. [10] applied RIPPER and Winnow, which areIII. TWO-PHASE PARSING APPROACH FOR CANDIDATEGENERATIONWe first explain about the <strong>GLR</strong> parsing technique that isapplied in the proposed approach. <strong>GLR</strong> parser parses inputsentence in the bottom-up manner using the parsing tablegenerated from c<strong>on</strong>text-free grammar (CFG). We can alsosay that the parser reads input tokens and uses the parsingtable as a dicti<strong>on</strong>ary for analyzing the structure of the input.There are basically two acti<strong>on</strong>s c<strong>on</strong>ducted <strong>on</strong> the input tokensas either ‘shift’ or ‘reduce’ according to parsing table.The existing parsing technique can be used in words segmentati<strong>on</strong>by taking a stream of characters as input and resultwords segmentati<strong>on</strong> candidates, so-called Character-to-<str<strong>on</strong>g>Word</str<strong>on</strong>g> segmentati<strong>on</strong> as shown in Figure 3.In this paper, we introduce the two-phase parsing techniquefor word segmentati<strong>on</strong>. The first phase (so-called Character-to-TCCsegmentati<strong>on</strong>) groups input characters to asequence of TCCs. This helps reduce the size of input andimprove the performance of the overall processes. After that,the stream of TCCs is used as an input for the sec<strong>on</strong>d phase(so-called TCC-to-<str<strong>on</strong>g>Word</str<strong>on</strong>g> segmentati<strong>on</strong>). Then, the sec<strong>on</strong>dphase outputs all possible segmentati<strong>on</strong> candidates as shownin Figure 4 and 5. The basic idea of the two-phase parsing isto break down <strong>on</strong>e big problem of word segmentati<strong>on</strong> intotwo simpler sub-problems. The first phase reduces scale ofinput by reducing character stream into a shorter stream ofTCCs. Then, the sec<strong>on</strong>d phase generates the word segmentati<strong>on</strong>candidates as groups of TCCs which are fewer thangroups of characters.A. Character-to-TCC <str<strong>on</strong>g>Segmentati<strong>on</strong></str<strong>on</strong>g>In the first phase, we generate the TCC parsing table byusing TCC c<strong>on</strong>text-free grammar (CFG) as shown in Figure6. The TCC grammars c<strong>on</strong>sisting of 123 rules are createdmanually by following the c<strong>on</strong>cept that tries to group togethercharacters that depends <strong>on</strong> others. This results the parsingtable that guides the parser to parse stream of character, andform a stream of TCCs. Then, we use the parsing table thatobtains from the TCC grammar to parse a stream of characterand group a TCC. The result from this process reducesthe size of the input for the next phase as shown in Figure 7.B. TCC-to-<str<strong>on</strong>g>Word</str<strong>on</strong>g> <str<strong>on</strong>g>Segmentati<strong>on</strong></str<strong>on</strong>g>In the sec<strong>on</strong>d phase, we generate TCC-to-<str<strong>on</strong>g>Word</str<strong>on</strong>g> parsingtable by using another set of CFGs as shown in Figure 8.The grammar rule defines a word as a stream of TCCs. The


Fig 3. Candidate results (11 candidates) from <strong>on</strong>e-phase parsing. ตอน(part),ที่(positi<strong>on</strong>)Fig 6. Example of TCC grammars. Start with “_” mean terminal variable.Fig 7. Reducti<strong>on</strong> in input size that is result of first phase (Character-to-TCC).Fig 4. Process of two-phase parsing: First phase (Character-to-TCC) andSec<strong>on</strong>d phase (TCC-to-<str<strong>on</strong>g>Word</str<strong>on</strong>g>). ตอน(part), ที่(positi<strong>on</strong>)Fig 5. Candidate results (3 candidates) from two-phase parsing. ตอน(part), ที่(positi<strong>on</strong>)parsing table <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> this grammar is then generated. Thisprocess is time-c<strong>on</strong>suming depending <strong>on</strong> the number ofwords in the dicti<strong>on</strong>ary. However, this extensive time can beacceptable since the parsing table creati<strong>on</strong> process is d<strong>on</strong>e<strong>on</strong>ly <strong>on</strong>ce. The result of this step is a parsing table thatguides the parser how to group a stream of TCCs to words.Since there are ambiguities in segmenting word, the CFGsare also ambiguous. The system generates all possible waysto c<strong>on</strong>nect groups of TCCs into words in a sentence. Wethen need to apply the candidate selecti<strong>on</strong> to select the mostappropriate candidate.Applying two-phase parsing results improve the performancebecause of the data size reducti<strong>on</strong>. Moreover, combiningthis candidate generati<strong>on</strong> with candidate selecti<strong>on</strong>discussed in the next secti<strong>on</strong> will show more potential of thismethod over the existing methods. By using the parsingtechnique, we can show a new opportunity in c<strong>on</strong>cept ofsegmentati<strong>on</strong>, which come with advantage over dicti<strong>on</strong>ary<str<strong>on</strong>g>based</str<strong>on</strong>g>and machine learning-<str<strong>on</strong>g>based</str<strong>on</strong>g> because of flexibility inparsing.Based <strong>on</strong> our previous work [14], we have c<strong>on</strong>ducts someexperiments to evaluate the two-phase parsing techniqueexplained in this secti<strong>on</strong>. We found from the experiments


After they are combined together, it results a lot of candidateshaving chance to be a correct segmentati<strong>on</strong>.A. Statistical model using N-gram modelThe statistical model is applied for selecti<strong>on</strong> the ambiguousword segmentati<strong>on</strong> points as shown in Figure 10. Thismodel is able to deal with the unknown words by applyingthe probability that an unknown word occurs in a sentence.Furthermore, machine learning technique can be applied toimprove the predictive accuracy of system by making thesystem learn from the mistakes.In this paper, we apply word N-gram model with interpolati<strong>on</strong>technique combining uni-gram, bi-gram and tri-gram.N-gram probability can be calculated as: | ∑ Fig 8. Example of word grammars. Start with “_” mean terminal variable.that the two-phase parsing technique gains 35.48% reducti<strong>on</strong>in the number of states when it is compared to the <strong>on</strong>e-phaseparsing technique. This experiment was c<strong>on</strong>ducted by usingthe <str<strong>on</strong>g>Thai</str<strong>on</strong>g> dicti<strong>on</strong>ary c<strong>on</strong>taining 42,858 words.IV. CANDIDATE SELECTION BASED ON STATISTICAL MODELAND BEAM SEARCHIn this secti<strong>on</strong>, we explain the applicati<strong>on</strong> of statisticallanguage model for candidate selecti<strong>on</strong>. From the exampleof candidates shown in Figure 9, the number of obtainedcandidates varies <strong>on</strong> the length of TCCs stream and thenumber of words with ambiguities. Ambiguous words arethe main factor to number of candidates that the result mayc<strong>on</strong>sist of two or more candidates at the ambiguous point.ปลา|มี|ตาก|ลม|แต่|เสื้อ|ตาก|ลม|ปลา|มี|ตาก|ลม|แต่|เสื้อ|ตาก|ล|ม|…ปลา|มี|ตา|กลม|แต่|เสื้อ|ตาก|ลม|ปลา|มี|ตา|กลม|แต่|เสื้อ|ตาก|ล|ม|…ปลา|มี|ตา|ก|ลม|แต่|เสื้อ|ตาก|ลม|ปลา|มี|ตา|ก|ลม|แต่|เสื้อ|ตาก|ล|ม|…ป|ลา|มี|ตา|กลม|แต่|เสื้อ|ตาก|ลม|ป|ลา|มี|ตา|กลม|แต่|เสื้อ|ตาก|ล|ม|…ป|ลา|มี|ตา|ก|ลม|แต่|เสื้อ|ตาก|ลม|ป|ลา|มี|ตา|ก|ลม|แต่|เสื้อ|ตาก|ล|ม|…ป|ลา|มี|ตา|ก|ล|ม|แต่|เสื้อ|ตา|ก|ลม|ป|ลา|มี|ตา|ก|ล|ม|แต่|เสื้อ|ตา|ก|ล|ม|Fig 9. Result 72 candidates from candidate generati<strong>on</strong> without candidateselecti<strong>on</strong> process. is the nth word in the word sequence, and denotes asub-sequence of the i th to the j th words. is a N-gram parameterwhich is called uni-gram when N=1, bi-gram whenN=2, and tri-gram when N=3. Figure 11 shows examples ofN-gram probabilities in logarithmic form.In the proposed approach, we apply the interpolati<strong>on</strong> tech-P = 6.15e-05 ปลา|มี|ตา|กลม|แต่|เสื้อ|ตาก|ลม|P = 5.97e-05 ปลา|มี|ตา|กลม|แต่|เสื้อ|ตา|กลม|P = 4.80e-05 ปลา|มี|ตาก|ลม|แต่|เสื้อ|ตาก|ลม|P = 4.66e-05 ปลา|มี|ตาก|ลม|แต่|เสื้อ|ตา|กลม|P = 3.61e-05 ปลา|มี|ตา|กลม|แต่|เสื้อ|ตา|ก|ลม|…ปลา(Fish), มี(Have), ตาก(Dry), ตา(Eye),กลม(Round), ลม(Wind), แต่(but), เลื้อ(Shirt)Fig 10. Candidates that have difference meaning ordering by probability.-8.57086e-01 | |-1.65997e+00 |ที่|-4.88202e+00 |จักรกล|-6.75706e+00 |อารัก|+7.59836e-08 |คู ่ควง|ของ|-2.31597e+00 |บริโภค|เข้าใจ|-4.10350e+00 |แล้ว|กลุ ้ม|-5.90000e+00 | |มือยิง|+7.59836e-08 |ครึ่ง|วง|กลม|+7.59836e-08 |เสนอ|เงิน|ชดเชย|-1.11394e+00 |เขา|แนะนํา|หล่อน|-4.89695e+00 |&| |พิชิต|Fig 11. Probability informati<strong>on</strong> extract from corpus by uni-gram, bi-gramand tri-gram model.


nique that combines tri-gram, bi-gram and uni-gram: | | | Such that the s are interpolati<strong>on</strong> parameters, and theirsum is 1. 1In this paper, we set the s to 0.1 for uni-gram, 0.3 for bigram,and 0.6 for tri-gram. To avoid zero probability for anunknown sequence of words, we estimate the probability byusing the lowest probability found in the corpus.This N-gram model is applied to all the segmentati<strong>on</strong>points in a candidate. The probability of the whole sentencecan be obtained from the product of all the N-gram probabilitiesin the sentences. However, this combinati<strong>on</strong> techniquetends to put more favor <strong>on</strong> the candidate with smaller numberof segmentati<strong>on</strong> points. We try to relax this problem byapplying the geometric average to the product of the N-gramprobabilities. The geometric average of a data set [a 1 , a 2 , ...,a n ] is given by · ··· Figure 10 shows an example of the result after applyingstatistical model as you can see that the best candidate is “ปลา(Fish) | มี(Have) | ตา(Eye) | กลม(Round) | แต่(But) | เสื ้อ(Shirt) |ตาก(Dry) | ลม(Wind)” attached with the highest probability.B. Beam Search Technique for Candidate Selecti<strong>on</strong>.As discussed earlier, we propose the technique that embedsthe candidate selecti<strong>on</strong> into each parsing level of thecandidate generati<strong>on</strong> <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> the beam search. This helpsstopping the low potential child candidates to become a parentcandidate in the next level.Once a word candidate is generated from a sequence ofTCCs while the parsing process is going <strong>on</strong>, the combinedprobability from the beginning of the sentence to the currentword is calculated and attached to the candidate. This parsingprocess is c<strong>on</strong>ducted in the breath-first search manner.We select <strong>on</strong>ly the k best candidates to be c<strong>on</strong>sidered in thenext level. This k represents the beam size in this beamsearch technique.Figure 9 and 12 shows an example of applying beamsearch technique for candidate selecti<strong>on</strong>. The bold line inFigure 12 shows the two best candidates that are chosen ateach level. This can be compared to the result when thebeam search is not applied.The benefit of using this technique is that the number ofcandidates is c<strong>on</strong>stant al<strong>on</strong>g all the parsing levels. Thisavoids the problem that the number of candidates increaseFig 12. Candidates result from parser bold circle refer to candidate when using beam size 2.


ased <strong>on</strong> the number of TCCs and ambiguous words.Comparing the proposed method to the existing statisticalmethods, the proposed method combines the statistical modelinto the parsing process and makes all work d<strong>on</strong>e <strong>on</strong>ly in<strong>on</strong>e processV. EXPERIMENT RESULTS AND DISCUSSIONTo evaluate the proposed method, we c<strong>on</strong>duct some experiments<strong>on</strong> real-world data. We evaluate the <str<strong>on</strong>g>Thai</str<strong>on</strong>g> word segmentati<strong>on</strong>accuracy using the proposed parsing techniqueand statistical model with different size of beam. By ourexperiment use both training set and test set from Inter<strong>BEST</strong>2009: <str<strong>on</strong>g>Thai</str<strong>on</strong>g> <str<strong>on</strong>g>Word</str<strong>on</strong>g> <str<strong>on</strong>g>Segmentati<strong>on</strong></str<strong>on</strong>g>: an Internati<strong>on</strong>al Episode[15].The training datasets provided by Inter<strong>BEST</strong> are used forcreate our training data and TCC-to-<str<strong>on</strong>g>Word</str<strong>on</strong>g> parsing table. Wethen use Inter<strong>BEST</strong> test sets to evaluate our proposed technique.Result from Inter<strong>BEST</strong> evaluate system shows precisi<strong>on</strong>,recall and f-measure for each genre from 12 genres asshown in Figure 13. In this experiment, the beam size is variedfrom 1 to 10 to show its effect to the performance of theproposed approach. For the s in the interpolati<strong>on</strong>, we givemore significant <strong>on</strong> tri-gram with =0.6, then bi-gram with=0.3, and uni-gram with =0.1, respectively.Moreover, our technique is originally able to deal withdocuments c<strong>on</strong>taining <strong>on</strong>ly <str<strong>on</strong>g>Thai</str<strong>on</strong>g> characters. To evaluate theperformance of our system, English letters and symbols areskipped to make the input text c<strong>on</strong>taining <strong>on</strong>ly <str<strong>on</strong>g>Thai</str<strong>on</strong>g> charactersbefore c<strong>on</strong>ducting word segmentati<strong>on</strong>. The skipped Englishcharacters are later inserted into the segmented text tomake the final output.Based <strong>on</strong> the experimental results, the highest average f-measure of 87.04% was obtained when the beam size is setto the highest value of 10. However, we found that <strong>on</strong>ly alittle difference when we set the beam size to 3 and 5. Thebeam search technique shows its benefit of reducing thenumber of evaluated candidates without losing the accuracyof the system. It also shows that applying all probabilitycomputati<strong>on</strong> to all candidates are useless because most of thegood candidates are the <strong>on</strong>e with drastically high probabilitycompared to other candidates. This may lead to a techniqueto dynamically assign the beam size that we can set the beamsize <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> the gap between the group of candidates withhigh probability and the group with low probability.We found that our technique worked well with the genres“Article”, “Buddhism”, “Encyclopedia”, “Novel”, “NSC”,and “Talk” that we can obtain more than 90% of f-measurescore. However, it works poorly with the genres “Old document”and “Royal news” that we obtain <strong>on</strong>ly 80.81% and71.26% respectively. After analyzing the experimental resultsand the datasets in details, we found that the genres“Old document” and “Royal news” c<strong>on</strong>tains a lot of unknownwords such as Royal family member names, abbreviati<strong>on</strong>s,and name entities. Since unknown words cannotproperly be handled by the proposed technique, all the wordsare basically treated as regular words. Therefore, the systemwould yield the low segmentati<strong>on</strong> accuracy for the datasetc<strong>on</strong>taining a lot of unknown words.VI. CONCLUSIONIn this paper, we propose <str<strong>on</strong>g>Thai</str<strong>on</strong>g> word segmentati<strong>on</strong> framework.This framework combines candidate generati<strong>on</strong> andcandidate selecti<strong>on</strong> in the same process. In candidate generati<strong>on</strong>,we propose a technique to reduce the problem size byseparating the problem into two phases. The first-phase intendsto reduce the input size using the c<strong>on</strong>cept of TCC. TheNumber of Beam 1 3 5 10Genre P R F P R F P R F P R FArticle 88.34 93.65 90.92 88.87 94.40 91.55 88.87 94.41 91.56 88.92 94.44 91.60Buddhism 91.06 95.80 93.37 90.92 96.23 93.50 90.86 96.21 93.46 90.84 96.20 93.44Encyclopedia 89.40 93.70 91.50 90.09 94.70 92.34 90.11 94.73 92.36 90.11 94.72 92.36Law 82.26 90.39 86.14 82.39 91.08 86.52 82.35 91.06 86.49 82.35 91.05 86.48News 77.08 91.84 83.81 77.36 92.58 84.29 77.38 92.59 84.30 77.43 92.61 84.34Novel 86.88 93.09 89.88 86.96 93.98 90.33 86.98 93.98 90.34 87.03 93.99 90.38NSC 89.00 92.41 90.67 89.81 93.51 91.62 89.83 93.52 91.64 89.85 93.53 91.65Old document 74.72 88.41 80.99 74.29 88.41 80.74 74.33 88.43 80.77 74.39 88.44 80.81Royal news 60.27 86.94 71.19 60.17 87.45 71.29 60.10 87.43 71.23 60.12 87.46 71.26Talk 91.26 94.79 92.99 91.69 95.97 93.78 91.66 95.98 93.77 91.66 95.99 93.78TV news 78.15 92.18 84.59 78.47 92.91 85.09 78.45 92.90 85.06 78.47 92.91 85.08Wiki 76.99 89.73 82.88 77.28 90.27 83.27 77.32 90.30 83.31 77.35 90.31 83.33Average 82.12 91.91 86.58 82.36 92.62 87.03 82.35 92.63 87.03 82.38 92.64 87.04*P – Precisi<strong>on</strong>, R – Recall, F – F-measureFig 13. Evaluati<strong>on</strong> result for statistical model approach with difference beam value.


sec<strong>on</strong>d phase then generates all possible word segmentati<strong>on</strong>candidates. The evaluati<strong>on</strong> of this two-phase techniqueshowed that the number of states in the parsing tables becomesmaller comparing to the <strong>on</strong>e-phase parsing.We also propose an approach that embeds the candidateselecti<strong>on</strong> into the sec<strong>on</strong>d-phase parsing. It aims to generate<strong>on</strong>ly potential candidates by applying statistical model.Moreover, the beam search techniques were applied directlyinto the parser to filter out low-potential candidates at theearlier state. This helps reduce the number of candidates andimprove the system performance in term of speed. The experimentalresults c<strong>on</strong>ducted <strong>on</strong> the 12-genre dataset showedthat the proposed technique obtain <strong>on</strong> average 87.04% f-measure when the beam size is set to 10.For the future work, we plan to incorporate a technique for<str<strong>on</strong>g>Thai</str<strong>on</strong>g> unknown word recogniti<strong>on</strong> and name entity extracti<strong>on</strong>in order to improve the system performance.Technology, 2008. ECTI-CON 2008, Vol.1, Issue 14-17 May 2008,pp.125-128, 2008.[12] M. Tomita, An efficient augmented-c<strong>on</strong>text-free parsing algorithm,Computati<strong>on</strong>al Linguistics, Vol.13 No.1-2, pp.31-46, 1987.[13] J. Earley, An efficient c<strong>on</strong>text-free parsing algorithm, Communicati<strong>on</strong>sof the ACM, Vol.13 No.2, pp.94-102, 1970.[14] P. Limcharoen, C. Nattee, and T. Theeramunk<strong>on</strong>g, Two-Phase CandidateGenerati<strong>on</strong> for <str<strong>on</strong>g>Thai</str<strong>on</strong>g> <str<strong>on</strong>g>Word</str<strong>on</strong>g> <str<strong>on</strong>g>Segmentati<strong>on</strong></str<strong>on</strong>g> using <strong>GLR</strong> <strong>Parsing</strong>Technique, Proceedings of the Third Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong>Knowledge, Informati<strong>on</strong> and Creativity Support Systems (KICSS2008), pp.98–103, December 2008.[15] <strong>Nectec</strong>., Inter<strong>BEST</strong>2009 <str<strong>on</strong>g>Thai</str<strong>on</strong>g> <str<strong>on</strong>g>Word</str<strong>on</strong>g> <str<strong>on</strong>g>Segmentati<strong>on</strong></str<strong>on</strong>g>: an Internati<strong>on</strong>alEpisode., http://thailang.nectec.or.th/interbest/ACKNOWLEDGMENTThis work has partially been supported by the Nati<strong>on</strong>alScience and Technology Development Agency Ministry ofScience and Technology (NSTDA), project name “Developedof <str<strong>on</strong>g>Thai</str<strong>on</strong>g> News Corpus with Entity Tags Named” projectnumber NT-B-22-KE-38-52-01 and from <str<strong>on</strong>g>Thai</str<strong>on</strong>g>land ResearchFund under project name “Research <strong>on</strong> Automatic Relati<strong>on</strong>shipDiscovery in News Articles” project numberBRG50800013.REFERENCES[1] T. Theeramunk<strong>on</strong>g, V. Sornlertlamvanich, T. Tanhermh<strong>on</strong>g, and W.Chinnan, Character cluster <str<strong>on</strong>g>based</str<strong>on</strong>g> thai informati<strong>on</strong> retrieval, IRAL ’00:Proceedings of the fifth internati<strong>on</strong>al workshop <strong>on</strong> Informati<strong>on</strong> retrievalwith Asian languages, pp.75–80, 2000.[2] Y. Poovarawan, Dicti<strong>on</strong>ary-<str<strong>on</strong>g>based</str<strong>on</strong>g> <str<strong>on</strong>g>Thai</str<strong>on</strong>g> Syllable Separati<strong>on</strong>, Proceedingsof the 9th Electr<strong>on</strong>ics Engineering C<strong>on</strong>ference, 1986.[3] S. Rarunrom, Dicti<strong>on</strong>ary-<str<strong>on</strong>g>based</str<strong>on</strong>g> <str<strong>on</strong>g>Thai</str<strong>on</strong>g>. <str<strong>on</strong>g>Word</str<strong>on</strong>g> Separati<strong>on</strong>, Senior ProjectReport, 1991.[4] V. Sornlertlamvanich, <str<strong>on</strong>g>Word</str<strong>on</strong>g> <str<strong>on</strong>g>Segmentati<strong>on</strong></str<strong>on</strong>g> for <str<strong>on</strong>g>Thai</str<strong>on</strong>g> in Machine Translati<strong>on</strong>System. Machine Translati<strong>on</strong>, Nati<strong>on</strong>al Electr<strong>on</strong>ics and ComputerTechnology Center, 1993.[5] A. Kawtrakul, C. Thumkan<strong>on</strong>, and S. Seriburi, A Statistical Approachto <str<strong>on</strong>g>Thai</str<strong>on</strong>g> <str<strong>on</strong>g>Word</str<strong>on</strong>g> Filtering, Proceedings of the sec<strong>on</strong>d Symposium <strong>on</strong>Natural Language Processing, pp. 398-406, 1995.[6] W. Aro<strong>on</strong>manakun, Collocati<strong>on</strong> and <str<strong>on</strong>g>Thai</str<strong>on</strong>g> word segmentati<strong>on</strong>, In: JointInternati<strong>on</strong>al C<strong>on</strong>ference of SNLP-Oriental COCOSDA, pp.68-75,2002.[7] O. Sornil, and P. Chaiwanarom, A N<strong>on</strong>-Dicti<strong>on</strong>ary Approach for <str<strong>on</strong>g>Thai</str<strong>on</strong>g>Syllable <str<strong>on</strong>g>Segmentati<strong>on</strong></str<strong>on</strong>g> Using Predicti<strong>on</strong> by Partial Matching, Proceedingsof PACLING'03, 2003.[8] O. Sornil, and P. Chaiwanarom., Combining predicti<strong>on</strong> by partialmatching and logistic regressi<strong>on</strong> for <str<strong>on</strong>g>Thai</str<strong>on</strong>g> word segmentati<strong>on</strong>, Proceedingsof the 20th internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Computati<strong>on</strong>al,2004.[9] T. Theeramunk<strong>on</strong>g, and S. Usanavasin, N<strong>on</strong>-dicti<strong>on</strong>ary-<str<strong>on</strong>g>based</str<strong>on</strong>g> <str<strong>on</strong>g>Thai</str<strong>on</strong>g>word segmentati<strong>on</strong> using decisi<strong>on</strong> trees, Proceedings of the first internati<strong>on</strong>alc<strong>on</strong>ference <strong>on</strong> Human language technology research, pp.1-5,2001.[10] S. Mekanavin, P. Charenpornsawat, and B. Kijsirikul, Feature-<str<strong>on</strong>g>based</str<strong>on</strong>g><str<strong>on</strong>g>Thai</str<strong>on</strong>g> <str<strong>on</strong>g>Word</str<strong>on</strong>g>s <str<strong>on</strong>g>Segmentati<strong>on</strong></str<strong>on</strong>g>, Proceedings of the Natural LanguageProcessing Pacific Rim Symposium, pp.41- 48, 1997.[11] C. Haruechaiyasak, S. K<strong>on</strong>gyoung, and M. Dailey, A comparativestudy <strong>on</strong> <str<strong>on</strong>g>Thai</str<strong>on</strong>g> word segmentati<strong>on</strong> approaches, Electrical Engineering/Electr<strong>on</strong>ics,Computer, Telecommunicati<strong>on</strong>s and Informati<strong>on</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!