10.07.2015 Views

TLex: Thai Lexeme Analyser Based on the ... - BEST - Nectec

TLex: Thai Lexeme Analyser Based on the ... - BEST - Nectec

TLex: Thai Lexeme Analyser Based on the ... - BEST - Nectec

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3(by categorizing all characters into 10 different types) andcombined (by using both characters and character types asfeatures). For <strong>the</strong> char-type feature set, we categorize allcharacters into 10 different types as shown in Figure 2. The setof character types is designed based <strong>on</strong> linguistic knowledge,i.e., lexical rules, of <str<strong>on</strong>g>Thai</str<strong>on</strong>g> language. These character typescould provide effective informati<strong>on</strong> for identifying <strong>the</strong> wordboundaries.Fig. 3.setExample of a text string formatted according to <strong>the</strong> combined featureFig. 1.The overall process of our proposed soluti<strong>on</strong>sTag Type ValuecnvwtsFig. 2.C<strong>on</strong>s<strong>on</strong>ant characters which can beassigned as <strong>the</strong> word ending characterC<strong>on</strong>s<strong>on</strong>ant characters which cannot beassigned as <strong>the</strong> word ending characterVowel characters which are not allowed tobegin a wordVowel characters which areallowed to begin a wordT<strong>on</strong>al charactersSymbol charactersd Digit characters 0-9q Quote characters '-' “-”p Space character within a word _o O<strong>the</strong>r characters a-z A-ZCharacter-type (char-type) feature setNext <strong>the</strong> training corpus is transformed into three formattedtraining sets (<strong>on</strong>e for each feature sets). Figure 3 shows anexample of a text string formatted based <strong>on</strong> <strong>the</strong> combinedfeature set. The first column c<strong>on</strong>tains <strong>the</strong> character (char)features and <strong>the</strong> sec<strong>on</strong>d column c<strong>on</strong>tains <strong>the</strong> character-type(char-type) features. The third column is <strong>the</strong> annotati<strong>on</strong> forpredicted class labels: word-beginning character (B) and intrawordcharacter (I). Using <strong>the</strong> formatted training data sets, wetrained <strong>the</strong> word segmentati<strong>on</strong> models based <strong>on</strong> <strong>the</strong> CRFs.The CRFs algorithm learns to c<strong>on</strong>struct <strong>the</strong> most probable.label sequences by observing <strong>the</strong> features surrounding eachannotated class. The word segmentati<strong>on</strong> models can <strong>the</strong>n beevaluated by using <strong>the</strong> test data set.<str<strong>on</strong>g>Based</str<strong>on</strong>g> <strong>on</strong> our initial observati<strong>on</strong>, <strong>the</strong> training corpus c<strong>on</strong>tainsa large number of named entities (NEs), especially in <strong>the</strong>news articles. Figure 4 shows some examples of NEs in threedifferent types: pers<strong>on</strong>, organizati<strong>on</strong> and place. Our initialanalysis of <strong>the</strong> segmented results showed that <strong>the</strong> modelstrained by using <strong>the</strong> CRFs tend to mistakenly separate namedentities into a few small segments. To improve <strong>the</strong> segmentedresults, we perform <strong>the</strong> NE merging as <strong>the</strong> post-processingstep. This process involves <strong>the</strong> extracti<strong>on</strong> of all named entitiesfound in <strong>the</strong> training corpus. Table I lists <strong>the</strong> number of NEscollected from each genre of <strong>the</strong> training corpus. Since <strong>the</strong>News genre c<strong>on</strong>tains <strong>the</strong> largest number of NEs, <strong>the</strong>refore, wecould expect higher improvement when NE merging is applied<strong>on</strong> <strong>the</strong> segmented results. Using <strong>the</strong> NE list, <strong>the</strong> NE mergingperforms parsing <strong>on</strong> <strong>the</strong> segmented texts returned from <strong>the</strong>word segmentati<strong>on</strong> model. During <strong>the</strong> parsing process, if aseries of segments matches <strong>the</strong> list of NEs, <strong>the</strong>y are mergedinto <strong>on</strong>e single word unit. Some examples of <strong>the</strong> NE mergedresults are shown in Figure 4.V. EXPERIMENTS AND RESULTSIn this secti<strong>on</strong>, we evaluate <strong>the</strong> proposed soluti<strong>on</strong>s using<strong>the</strong> training and test corpora provided by <strong>the</strong> Inter<strong>BEST</strong> 2009workshop. The experiments (for test corpus evaluati<strong>on</strong>) wereperformed <strong>on</strong> a PC with <strong>the</strong> <strong>the</strong> Intel Core 2 Duo 1.80GHzand 2 GB of RAM running with <strong>the</strong> Windows Vista. Threeperformance measures of precisi<strong>on</strong>, recall and F1 are used for

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!