TLex: Thai Lexeme Analyser Based on the ... - BEST - Nectec

3(by categorizing all characters into 10 different types) andcombined (by using both characters and character types asfeatures). For the char-type feature set, we categorize allcharacters into 10 different types as shown in Figure 2. The setof character types is designed based on linguistic knowledge,i.e., lexical rules, of <strong>Thai</strong> language. These character typescould provide effective information for identifying the wordboundaries.Fig. 3.setExample of a text string formatted according to the combined featureFig. 1.The overall process of our proposed solutionsTag Type ValuecnvwtsFig. 2.Consonant characters which can beassigned as the word ending characterConsonant characters which cannot beassigned as the word ending characterVowel characters which are not allowed tobegin a wordVowel characters which areallowed to begin a wordTonal charactersSymbol charactersd Digit characters 0-9q Quote characters '-' “-”p Space character within a word _o Other characters a-z A-ZCharacter-type (char-type) feature setNext the training corpus is transformed into three formattedtraining sets (one for each feature sets). Figure 3 shows anexample of a text string formatted based on the combinedfeature set. The first column contains the character (char)features and the second column contains the character-type(char-type) features. The third column is the annotation forpredicted class labels: word-beginning character (B) and intrawordcharacter (I). Using the formatted training data sets, wetrained the word segmentation models based on the CRFs.The CRFs algorithm learns to construct the most probable.label sequences by observing the features surrounding eachannotated class. The word segmentation models can then beevaluated by using the test data set.<strong>Based</strong> on our initial observation, the training corpus containsa large number of named entities (NEs), especially in thenews articles. Figure 4 shows some examples of NEs in threedifferent types: person, organization and place. Our initialanalysis of the segmented results showed that the modelstrained by using the CRFs tend to mistakenly separate namedentities into a few small segments. To improve the segmentedresults, we perform the NE merging as the post-processingstep. This process involves the extraction of all named entitiesfound in the training corpus. Table I lists the number of NEscollected from each genre of the training corpus. Since theNews genre contains the largest number of NEs, therefore, wecould expect higher improvement when NE merging is appliedon the segmented results. Using the NE list, the NE mergingperforms parsing on the segmented texts returned from theword segmentation model. During the parsing process, if aseries of segments matches the list of NEs, they are mergedinto one single word unit. Some examples of the NE mergedresults are shown in Figure 4.V. EXPERIMENTS AND RESULTSIn this section, we evaluate the proposed solutions usingthe training and test corpora provided by the InterBEST 2009workshop. The experiments (for test corpus evaluation) wereperformed on a PC with the the Intel Core 2 Duo 1.80GHzand 2 GB of RAM running with the Windows Vista. Threeperformance measures of precision, recall and F1 are used for

Previous page

Next page

1

3

4

5

TLex: Thai Lexeme Analyser Based on the ... - BEST - Nectec

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?