10.07.2015 Views

TLex: Thai Lexeme Analyser Based on the ... - BEST - Nectec

TLex: Thai Lexeme Analyser Based on the ... - BEST - Nectec

TLex: Thai Lexeme Analyser Based on the ... - BEST - Nectec

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1<str<strong>on</strong>g>TLex</str<strong>on</strong>g>: <str<strong>on</strong>g>Thai</str<strong>on</strong>g> <str<strong>on</strong>g>Lexeme</str<strong>on</strong>g> <str<strong>on</strong>g>Analyser</str<strong>on</strong>g> <str<strong>on</strong>g>Based</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong>C<strong>on</strong>diti<strong>on</strong>al Random FieldsChoochart Haruechaiyasak and Sarawoot K<strong>on</strong>gyoungAbstract—In this paper, we present our proposed soluti<strong>on</strong> to<strong>the</strong> Inter<strong>BEST</strong> 2009 <str<strong>on</strong>g>Thai</str<strong>on</strong>g> Word Segmentati<strong>on</strong> task. We applied <strong>the</strong>C<strong>on</strong>diti<strong>on</strong>al Random Fields (CRFs) to train word segmentati<strong>on</strong>models from a given corpus. Using <strong>the</strong> CRFs, <strong>the</strong> word segmentati<strong>on</strong>problem can be formulated as a sequential labelingtask in which each character in <strong>the</strong> text string is predictedinto <strong>on</strong>e of two following classes: word-beginning and intra-wordcharacters. One of <strong>the</strong> key factors which effect <strong>the</strong> performanceof <strong>the</strong> word segmentati<strong>on</strong> models is <strong>the</strong> design of appropriatefeature sets. We proposed and evaluated three different featuresets: char (by using all possible characters as features), char-type(by categorizing all possible characters into 10 different types)and combined (by using both characters and character typesas features). The evaluati<strong>on</strong> results showed that <strong>the</strong> combinedfeature set yielded <strong>the</strong> best performance with <strong>the</strong> averaged F1value over all genres equal to 93.90%. To fur<strong>the</strong>r improve <strong>the</strong>results, we performed a post-processing step by merging namedentities (NEs) in <strong>the</strong> segmented texts. We used <strong>the</strong> list of NEswhich is compiled from <strong>the</strong> training corpus. The NE mergingstep helped increase <strong>the</strong> performance of <strong>the</strong> combined featuremodel to 94.27%.Index Terms—Word segmentati<strong>on</strong>, tokenizati<strong>on</strong>, morphologicalanalysis, c<strong>on</strong>diti<strong>on</strong>al random fields.I. INTRODUCTIONWord segmentati<strong>on</strong> is c<strong>on</strong>sidered a basic yet very importantNLP task in many unsegmented languages. The main goal ofword segmentati<strong>on</strong> task is to assign correct word boundaries <strong>on</strong>given text strings. Previous approaches applied to <str<strong>on</strong>g>Thai</str<strong>on</strong>g> wordsegmentati<strong>on</strong> can be broadly classified as dicti<strong>on</strong>ary-based andmachine learning. The dicti<strong>on</strong>ary-based approach relies <strong>on</strong> aset of terms from a dicti<strong>on</strong>ary for parsing and segmenting inputtexts into word tokens. During <strong>the</strong> parsing process, series ofcharacters are looked up <strong>on</strong> <strong>the</strong> dicti<strong>on</strong>ary for matching terms.The performance of <strong>the</strong> dicti<strong>on</strong>ary-based approach depends <strong>on</strong><strong>the</strong> quality and size of <strong>the</strong> word set in <strong>the</strong> dicti<strong>on</strong>ary usedduring <strong>the</strong> segmentati<strong>on</strong> process.Two main problems of unknown word and ambiguity typicallyoccur while parsing and segmenting <strong>the</strong> input text strings.Unknown words refer to terms which are not found in <strong>the</strong> dicti<strong>on</strong>arywhile parsing <strong>the</strong> text strings. Whereas, <strong>the</strong> ambiguityproblem occurs when <strong>the</strong>re are more than <strong>on</strong>e way to segment<strong>the</strong> text strings. One possible soluti<strong>on</strong> to solve <strong>the</strong> unknownword problem in <strong>the</strong> dicti<strong>on</strong>ary-based approach is by usinglanguage-dependent lexical rules to merge unknown segments.This soluti<strong>on</strong> has <strong>the</strong> main drawback of <strong>the</strong> requirement toupdate <strong>the</strong> lexical rules. Today, many new words are beingChoochart Haruechaiyasak and Sarawoot K<strong>on</strong>gyoung are with <strong>the</strong> HumanLanguage Technology Laboratory (HLT), Nati<strong>on</strong>al Electr<strong>on</strong>ics and ComputerTechnology Center (NECTEC), <str<strong>on</strong>g>Thai</str<strong>on</strong>g>land Science Park, Kl<strong>on</strong>g Luang,Pathumthani 12120, <str<strong>on</strong>g>Thai</str<strong>on</strong>g>land (email: {choochart.haruechaiyasak, sarawoot.k<strong>on</strong>gyoung}@nectec.or.th)generated by transliterating from foreign words. As a result,<strong>the</strong> process of updating <strong>the</strong> lexical rules becomes inefficient.Ano<strong>the</strong>r alternative soluti<strong>on</strong> is to collect and include as manyunknown words into <strong>the</strong> dicti<strong>on</strong>ary. However, this soluti<strong>on</strong> isnot very practical since it requires manual additi<strong>on</strong> and updateof newly generated terms.The ambiguity problem for <strong>the</strong> dicti<strong>on</strong>ary-based approachcan be solved via several selecti<strong>on</strong> techniques such as byselecting <strong>the</strong> l<strong>on</strong>gest possible term, i.e., l<strong>on</strong>gest matching [11].Ano<strong>the</strong>r alternative is by selecting <strong>the</strong> segmented series whichyields <strong>the</strong> minimum number of word tokens, i.e., maximalmatching [12]. Previous works reported that <strong>the</strong> maximalmatching algorithm has a marginal improvement over <strong>the</strong>l<strong>on</strong>gest matching algorithm, however, with <strong>the</strong> tradeoff <strong>on</strong>l<strong>on</strong>ger running time. Both l<strong>on</strong>gest matching and maximalmatching algorithms can be c<strong>on</strong>sidered as using some heuristicsto solve <strong>the</strong> ambiguity problem. The main disadvantageis <strong>the</strong>se heuristics are very simple and static, i.e., unable toadapt to changing domains.Due to <strong>the</strong> drawbacks of <strong>the</strong> dicti<strong>on</strong>ary-based approach,machine learning algorithms have been adopted for <strong>the</strong> wordsegmentati<strong>on</strong> task. The machine learning approach relies <strong>on</strong>a model trained from a corpus by using machine learningalgorithms. Using <strong>the</strong> tagged corpus in which word boundariesare explicitly marked with a special character, a machinelearning algorithm could be applied to train a model based <strong>on</strong><strong>the</strong> features surrounding <strong>the</strong>se boundaries. Under <strong>the</strong> machinelearning approach, <strong>the</strong> word segmentati<strong>on</strong> problem can beformulated as a binary classificati<strong>on</strong> task in which eachcharacter in <strong>the</strong> text string is predicted as bel<strong>on</strong>ging to <strong>on</strong>eof two classes: word-beginning and intra-word characters.The main advantage of <strong>the</strong> machine learning approach is <strong>the</strong>independence of dicti<strong>on</strong>ary. The unknown word and ambiguityproblems are handled by <strong>the</strong> classificati<strong>on</strong> model which islearned from various character patterns inside a tagged corpus.Therefore, <strong>the</strong> performance of this approach depends <strong>on</strong> <strong>the</strong>domain and size of <strong>the</strong> corpus. For example, if <strong>the</strong> model istrained based <strong>on</strong> a specific domain corpus, it might not performwell <strong>on</strong> o<strong>the</strong>r different domains. Moreover, <strong>the</strong> corpus must belarge enough to cover all different character patterns so that<strong>the</strong> model could be trained effectively.Recently, <strong>the</strong> Human Language Technology Laboratory(HLT) under <strong>the</strong> Nati<strong>on</strong>al Electr<strong>on</strong>ics and Computer TechnologyCenter (NECTEC) has designed and released a <str<strong>on</strong>g>Thai</str<strong>on</strong>g> wordsegmentati<strong>on</strong> corpus to <strong>the</strong> <str<strong>on</strong>g>Thai</str<strong>on</strong>g> NLP research community. Themost updated corpus used for <strong>the</strong> Interbest 2009 <str<strong>on</strong>g>Thai</str<strong>on</strong>g> WordSegmentati<strong>on</strong> workshop 1 c<strong>on</strong>tains 5-milli<strong>on</strong> words (plus an-1 http://thailang.nectec.or.th/interbest


3(by categorizing all characters into 10 different types) andcombined (by using both characters and character types asfeatures). For <strong>the</strong> char-type feature set, we categorize allcharacters into 10 different types as shown in Figure 2. The setof character types is designed based <strong>on</strong> linguistic knowledge,i.e., lexical rules, of <str<strong>on</strong>g>Thai</str<strong>on</strong>g> language. These character typescould provide effective informati<strong>on</strong> for identifying <strong>the</strong> wordboundaries.Fig. 3.setExample of a text string formatted according to <strong>the</strong> combined featureFig. 1.The overall process of our proposed soluti<strong>on</strong>sTag Type ValuecnvwtsFig. 2.C<strong>on</strong>s<strong>on</strong>ant characters which can beassigned as <strong>the</strong> word ending characterC<strong>on</strong>s<strong>on</strong>ant characters which cannot beassigned as <strong>the</strong> word ending characterVowel characters which are not allowed tobegin a wordVowel characters which areallowed to begin a wordT<strong>on</strong>al charactersSymbol charactersd Digit characters 0-9q Quote characters '-' “-”p Space character within a word _o O<strong>the</strong>r characters a-z A-ZCharacter-type (char-type) feature setNext <strong>the</strong> training corpus is transformed into three formattedtraining sets (<strong>on</strong>e for each feature sets). Figure 3 shows anexample of a text string formatted based <strong>on</strong> <strong>the</strong> combinedfeature set. The first column c<strong>on</strong>tains <strong>the</strong> character (char)features and <strong>the</strong> sec<strong>on</strong>d column c<strong>on</strong>tains <strong>the</strong> character-type(char-type) features. The third column is <strong>the</strong> annotati<strong>on</strong> forpredicted class labels: word-beginning character (B) and intrawordcharacter (I). Using <strong>the</strong> formatted training data sets, wetrained <strong>the</strong> word segmentati<strong>on</strong> models based <strong>on</strong> <strong>the</strong> CRFs.The CRFs algorithm learns to c<strong>on</strong>struct <strong>the</strong> most probable.label sequences by observing <strong>the</strong> features surrounding eachannotated class. The word segmentati<strong>on</strong> models can <strong>the</strong>n beevaluated by using <strong>the</strong> test data set.<str<strong>on</strong>g>Based</str<strong>on</strong>g> <strong>on</strong> our initial observati<strong>on</strong>, <strong>the</strong> training corpus c<strong>on</strong>tainsa large number of named entities (NEs), especially in <strong>the</strong>news articles. Figure 4 shows some examples of NEs in threedifferent types: pers<strong>on</strong>, organizati<strong>on</strong> and place. Our initialanalysis of <strong>the</strong> segmented results showed that <strong>the</strong> modelstrained by using <strong>the</strong> CRFs tend to mistakenly separate namedentities into a few small segments. To improve <strong>the</strong> segmentedresults, we perform <strong>the</strong> NE merging as <strong>the</strong> post-processingstep. This process involves <strong>the</strong> extracti<strong>on</strong> of all named entitiesfound in <strong>the</strong> training corpus. Table I lists <strong>the</strong> number of NEscollected from each genre of <strong>the</strong> training corpus. Since <strong>the</strong>News genre c<strong>on</strong>tains <strong>the</strong> largest number of NEs, <strong>the</strong>refore, wecould expect higher improvement when NE merging is applied<strong>on</strong> <strong>the</strong> segmented results. Using <strong>the</strong> NE list, <strong>the</strong> NE mergingperforms parsing <strong>on</strong> <strong>the</strong> segmented texts returned from <strong>the</strong>word segmentati<strong>on</strong> model. During <strong>the</strong> parsing process, if aseries of segments matches <strong>the</strong> list of NEs, <strong>the</strong>y are mergedinto <strong>on</strong>e single word unit. Some examples of <strong>the</strong> NE mergedresults are shown in Figure 4.V. EXPERIMENTS AND RESULTSIn this secti<strong>on</strong>, we evaluate <strong>the</strong> proposed soluti<strong>on</strong>s using<strong>the</strong> training and test corpora provided by <strong>the</strong> Inter<strong>BEST</strong> 2009workshop. The experiments (for test corpus evaluati<strong>on</strong>) wereperformed <strong>on</strong> a PC with <strong>the</strong> <strong>the</strong> Intel Core 2 Duo 1.80GHzand 2 GB of RAM running with <strong>the</strong> Windows Vista. Threeperformance measures of precisi<strong>on</strong>, recall and F1 are used for


4TABLE IIEVALUATION RESULTS OF THREE DIFFERENT FEATURE SETSGenre Feature set ResultsP R F1Fig. 4.Example of named entities collected from <strong>the</strong> training corpusTABLE INUMBER OF NAMED ENTITIES EXTRACTED FROM TRAINING CORPUSGenreNumber of NEsArticle 7,193Buddhism 349Encyclopedia 5,462Law 921News 23,038Novel 4,145Talk 2,284Wiki 12,935evaluating different word segmentati<strong>on</strong> approaches. Precisi<strong>on</strong>is defined as <strong>the</strong> number of tokens correctly segmented by<strong>the</strong> algorithm divided by <strong>the</strong> total number of tokens returnedfrom <strong>the</strong> algorithm. Recall is defined as <strong>the</strong> number of tokenscorrectly segmented by <strong>the</strong> algorithm divided by <strong>the</strong> totalnumber of tokens in <strong>the</strong> test set. Eq. 4 shows <strong>the</strong> calculati<strong>on</strong>of F1 measure which is <strong>the</strong> harm<strong>on</strong>ic mean of precisi<strong>on</strong> andrecall.F 1 =2 ∗ P recisi<strong>on</strong> ∗ RecallP recisi<strong>on</strong> + RecallWe first performed evaluati<strong>on</strong> <strong>on</strong> three feature sets (i.e.,char, char-type and combined) as explained in previous secti<strong>on</strong>.Given <strong>the</strong> training corpus, we formatted <strong>the</strong> text stringsaccording to different feature sets. We trained three wordsegmentati<strong>on</strong> models based <strong>on</strong> <strong>the</strong> CRFs. The results based <strong>on</strong>each of <strong>the</strong> 12 test genres are shown in Table II. The averagedperformance <strong>on</strong> all genres is summarized in Table III.The results from <strong>the</strong> tables show that using <strong>the</strong> char-typefeature set yielded <strong>the</strong> worst averaged performance of 62.90%based <strong>on</strong> <strong>the</strong> F1 value. The model trained by using charfeature set gave a significant improvement of 91.51%. Thereas<strong>on</strong> is that <strong>the</strong> char feature set provides more precise andricher informati<strong>on</strong> based <strong>on</strong> actual characters, <strong>the</strong>refore, <strong>the</strong>CRFs could effectively observe and recognize <strong>the</strong> sequences.However, when combining both char-type and char featuresets, <strong>the</strong> performance increases to 93.90%. The improvementcomes from that <strong>the</strong> char-type could help segment <strong>the</strong> characterpatterns which were not previously observed using <strong>the</strong>char feature set. Comparing am<strong>on</strong>g all 12 genres, <strong>the</strong> Talkgenre yielded <strong>the</strong> best performance.Next, we perform and evaluate <strong>the</strong> NE merging process. Asmenti<strong>on</strong>ed in <strong>the</strong> previous secti<strong>on</strong>, applying <strong>the</strong> NE merging(4)Article char 92.19 92.10 92.15char-type 64.43 67.08 65.73combined 95.45 96.80 96.12Buddhism char 93.95 93.21 93.58char-type 66.95 65.46 66.20combined 95.39 95.39 95.39Encyclopedia char 92.59 91.38 91.98char-type 64.87 66.00 65.43combined 95.17 95.21 95.19Law char 94.17 94.52 94.34char-type 64.61 68.40 66.45combined 95.14 96.28 95.70News char 90.21 90.07 90.14char-type 59.53 64.45 61.89combined 91.91 94.71 93.29Novel char 93.66 93.97 93.82char-type 65.54 65.94 65.74combined 94.57 95.57 95.07Nsc char 91.53 90.90 91.21char-type 63.98 63.39 63.68combined 95.71 95.57 95.64Old-doc char 90.98 90.38 90.68char-type 55.57 59.50 57.47combined 91.70 93.12 92.40Royalnews char 80.23 86.96 83.46char-type 44.35 62.53 51.89combined 80.83 91.42 85.80Talk char 94.34 93.99 94.16char-type 66.75 67.21 66.98combined 96.74 97.00 96.87Tvnews char 90.32 92.50 91.40char-type 56.06 60.97 58.41combined 91.07 95.06 93.02Wiki char 91.42 90.65 91.03char-type 61.86 65.22 63.50combined 90.61 93.53 92.05TABLE IIIEVALUATION RESULTS OF DIFFERENT FEATURE SETS FROM ALL GENRESFeature setResultsP R F1char 91.30 91.72 91.51char-type 61.21 64.68 62.90combined 92.86 94.97 93.90step could help combine <strong>the</strong> segments which are namedentities into a single word unit. By using <strong>the</strong> model trainedfrom <strong>the</strong> combined feature set, <strong>the</strong> comparis<strong>on</strong> results betweenmerge and no-merge approaches under each of <strong>the</strong> 12 genresare shown in Table IV. The averaged performance <strong>on</strong> all genresis summarized in Table V. The results show that <strong>the</strong> NEmerging step helps increase <strong>the</strong> performance of <strong>the</strong> combinedfeature model to 94.27% based <strong>on</strong> <strong>the</strong> F1 value. The maximum


5TABLE IVEVALUATION RESULTS OF MERGING APPROACH ON THE combinedFEATURE SETGenre Approach ResultsP R F1Article no-merge 95.45 96.80 96.12merge 95.71 96.54 96.13Buddhism no-merge 95.39 95.39 95.39merge 95.19 94.96 95.07Encyclopedia no-merge 95.17 95.21 95.19merge 95.15 94.83 94.99Law no-merge 95.14 96.28 95.70merge 95.36 96 95.68News no-merge 91.91 94.71 93.29merge 93.54 94.99 94.26Novel no-merge 94.57 95.57 95.07merge 95.43 95.72 95.57Nsc no-merge 95.71 95.57 95.64merge 95.83 95.53 95.68Old-doc no-merge 91.70 93.12 92.40merge 92.09 92.81 92.45Royalnews no-merge 80.83 91.42 85.80merge 84.41 92.28 88.17Talk no-merge 96.74 97.00 96.87merge 96.99 96.78 96.89Tvnews no-merge 91.07 95.06 93.02merge 92.29 94.92 93.59Wiki no-merge 90.61 93.53 92.05merge 91.62 93.58 92.59TABLE VEVALUATION RESULTS OF NE MERGING FROM ALL GENRESApproachResultsP R F1no-merge 92.86 94.97 93.90merge 93.63 94.91 94.27improvement comes from <strong>the</strong> Royalnews genre, in which <strong>the</strong>NE merging yielded <strong>the</strong> performance of 88.17% based <strong>on</strong> <strong>the</strong>F1 value compared to 85.8% without merging. The reas<strong>on</strong> isthat <strong>the</strong> NEs under <strong>the</strong> Royalnews genre are limited, however,occur very often in <strong>the</strong> articles.We also measured <strong>the</strong> executi<strong>on</strong> times of two approaches.The no-merge approach yields 7,444 words/sec. The mergeapproach yields 2,229 words/sec. The decreasing speed in<strong>the</strong> merge approach is due to <strong>the</strong> time taken for parsing andchecking <strong>the</strong> named entities in <strong>the</strong> list.VI. CONCLUSIONS AND FUTURE WORKSIn this paper, we proposed a soluti<strong>on</strong> to <strong>the</strong> InterBest2009 <str<strong>on</strong>g>Thai</str<strong>on</strong>g> Word Segmentati<strong>on</strong> task based <strong>on</strong> <strong>the</strong> C<strong>on</strong>diti<strong>on</strong>alRandom Fields. To train <strong>the</strong> word segmentati<strong>on</strong> models, weproposed three different feature sets: char (by using all possiblecharacters as features), char-type (by categorizing allcharacters into 10 different types) and combined (by usingboth characters and character types as features) The evaluati<strong>on</strong>results showed that <strong>the</strong> char-type feature set yielded <strong>the</strong> worstperformance of 62.90% based <strong>on</strong> <strong>the</strong> F1 value. The modeltrained by using char feature set gave <strong>the</strong> F1 value of 91.51%.The highest F1 value of 93.90% was achieved when using <strong>the</strong>combined feature set. Therefore, it could be c<strong>on</strong>cluded that <strong>the</strong>CRFs could effectively learn <strong>the</strong> actual character sequences.However, in <strong>the</strong> case that <strong>the</strong> exact character sequential patternsare not matched, <strong>the</strong> generalized model based <strong>on</strong> <strong>the</strong>char-type could help cover such case. The analysis of <strong>the</strong>segmented results showed that <strong>the</strong> models trained by using<strong>the</strong> CRFs tend to mistakenly separate named entities into afew small segments. As a soluti<strong>on</strong>, we performed <strong>the</strong> postprocessingstep by merging named entities (NEs) found in <strong>the</strong>returned segments. The NE merging step helped increase <strong>the</strong>performance of <strong>the</strong> combined feature model to 94.27% based<strong>on</strong> <strong>the</strong> averaged F1 value.For future works, we will focus <strong>on</strong> <strong>the</strong> improvement of <strong>the</strong>word segmentati<strong>on</strong> models learned by using <strong>the</strong> CRFs. Onepossible soluti<strong>on</strong> is to improve <strong>the</strong> feature set for learning <strong>the</strong>models. A new character feature set could be automaticallyc<strong>on</strong>structed by using some clustering algorithms. Ano<strong>the</strong>rpossible improvement is by training a named entity recogniti<strong>on</strong>(NER) model to merge returned segments which c<strong>on</strong>tainnamed entities into single word units.REFERENCES[1] C. Haruechaiyasak, S. K<strong>on</strong>gyoung and M. Dailey, “A comparative study<strong>on</strong> <str<strong>on</strong>g>Thai</str<strong>on</strong>g> word segmentati<strong>on</strong> approaches,” Proc. of <strong>the</strong> ECTI-CON 2008,1:125-128, 2008.[2] C. Haruechaiyasak, S. K<strong>on</strong>gyoung and C. Damr<strong>on</strong>grat, “LearnLexTo: amachine-learning based word segmentati<strong>on</strong> for indexing <str<strong>on</strong>g>Thai</str<strong>on</strong>g> texts,” Proc.of <strong>the</strong> <strong>the</strong> 2nd ACM Workshop <strong>on</strong> Improving N<strong>on</strong> English Web Searching(iNEWS), pp. 85–88, 2008.[3] A. Kawtrakul and C. Thumkan<strong>on</strong>, “A Statistical Approach to <str<strong>on</strong>g>Thai</str<strong>on</strong>g>Morphological Analyzer,” Proc. of <strong>the</strong> 5th Workshop <strong>on</strong> Very LargeCorpora, pp. 286–289, 1997.[4] C. S.G. Khoo and T. Ee Loh, “Using Statistical and C<strong>on</strong>textual Informati<strong>on</strong>to Identify Two-and Three-Character Words in Chinese Text,”Journal of <strong>the</strong> American Society for Informati<strong>on</strong> Science and Technology,53(5):365–377, 2002.[5] C. Kruengkrai and H. Isahara, “A C<strong>on</strong>diti<strong>on</strong>al Random Field Frameworkfor <str<strong>on</strong>g>Thai</str<strong>on</strong>g> Morphological Analysis,” Proc. of <strong>the</strong> Fifth Int. C<strong>on</strong>f. <strong>on</strong>Language Resources and Evaluati<strong>on</strong> (LREC-2006), 2006.[6] C. Kruengkrai, et al., “An Error-Driven Word-Character Hybrid Modelfor Joint Chinese Word Segmentati<strong>on</strong> and POS Tagging,” Proc. of <strong>the</strong>ACL-IJCNLP 2009, pp. 513–521, 2009.[7] T. Kudo, K. Yamamoto, and Y. Matsumoto, “Applying C<strong>on</strong>diti<strong>on</strong>alRandom Fields to Japanese Morphological Analysis,” Proc. of EMNLP,pp. 89–96, 2004.[8] J. Lafferty, A. McCallum, and F. Pereira, “C<strong>on</strong>diti<strong>on</strong>al Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data,” Proc.of <strong>the</strong> Eighteenth Int. C<strong>on</strong>f. <strong>on</strong> Machine Learning (ICML), pp. 282–289,2001.[9] S. Meknavin, P. Charoenpornsawat, and B. Kijsirikul, “Feature-<str<strong>on</strong>g>Based</str<strong>on</strong>g><str<strong>on</strong>g>Thai</str<strong>on</strong>g> Word Segmentati<strong>on</strong>,” Proc. of NLPRS 97, pp. 289–296, 1997.[10] F. Peng, F. Feng, and A. McCallum, “Chinese Segmentati<strong>on</strong> and NewWord Detecti<strong>on</strong> Using C<strong>on</strong>diti<strong>on</strong>al Random Fields,” Proc. of <strong>the</strong> 20thCOLING, pp. 562–568, 2004.[11] Y. Poowarawan, “Dicti<strong>on</strong>ary-based <str<strong>on</strong>g>Thai</str<strong>on</strong>g> Syllable Separati<strong>on</strong>,” Proc. of<strong>the</strong> Ninth Electr<strong>on</strong>ics Engineering C<strong>on</strong>ference, 1986.[12] V. Sornlertlamvanich, “Word Segmentati<strong>on</strong> for <str<strong>on</strong>g>Thai</str<strong>on</strong>g> in Machine Translati<strong>on</strong>System,” Machine Translati<strong>on</strong>, Nati<strong>on</strong>al Electr<strong>on</strong>ics and ComputerTechnology Center, Bangkok, 1993.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!