19.07.2013 Views

( ' ¦& à 2 )10 2 çWß®ý ¢1¤8¦& úED2F ò GIHbçpÖháIþ 35P QDÉ RÍÚ ...

( ' ¦& à 2 )10 2 çWß®ý ¢1¤8¦& úED2F ò GIHbçpÖháIþ 35P QDÉ RÍÚ ...

( ' ¦& à 2 )10 2 çWß®ý ¢1¤8¦& úED2F ò GIHbçpÖháIþ 35P QDÉ RÍÚ ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Rast £ ¤ ¥ ¦ § ¨© <br />

¡<br />

∗<br />

¢<br />

<br />

<strong>10</strong> <br />

2005 5<br />

<br />

<br />

N-gram<br />

<br />

<br />

<br />

Rast<br />

Rast<br />

1<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

¡ £¢¥¤§¦¥¨<br />

<br />

£ ¥ ¦ <br />

¦ <br />

N-gram 2<br />

<br />

<br />

<br />

<br />

£© ¦¨¥¥©<br />

<br />

£<br />

<br />

¡ <br />

<br />

¥ <br />

¢¤¦ <br />

¦ ¦ §<br />

N-gram <br />

§<br />

¦<br />

¢¤¦ ¥ <br />

<br />

¥<br />

<br />

Rast£ <br />

Rast<br />

<br />

§<br />

¢¥¤<br />

§£¢¥<br />

<br />

<br />

<br />

∗ ¡<br />

1<br />

<br />

2<br />

<br />

<br />

<br />

£<br />

§<br />

2 ¥ £ ¥<br />

<br />

<br />

<br />

¥£<br />

¥<br />

£<br />

<br />

¨ <br />

<br />

¦<br />

<br />

¨§<br />

© <br />

<br />

©£<br />

¨<br />

<br />

N-gram<br />

¨ <br />

¦ ¥<br />

N-gram <br />

¦ <br />

<br />

¥<br />

¦ <br />

<br />

¡ ¥ <br />

¥ £<br />

¥<br />

<br />

<br />

<br />

<br />

<br />

¦ <br />

N-gram <br />

<br />

<br />

<br />

<br />

<br />

<br />

¥ <br />

¦ <br />

¥ £


£<br />

<br />

Rast © £<br />

£<br />

¨ <br />

£<br />

<br />

£ <br />

<br />

<br />

¦ <br />

§¥<br />

<br />

3 Rast <br />

<br />

1 <br />

<br />

<br />

<br />

<br />

Rast<br />

<br />

<br />

¥<br />

§¥ <br />

¦ ¦ <br />

N-gram ¦<br />

<br />

<br />

£ <br />

¡<br />

<br />

<br />

£<br />

<br />

<br />

SI <br />

<br />

Rast<br />

<br />

<br />

<br />

<br />

£ <br />

£¢¤<br />

4 Rast ¡<br />

¢¤£¦¥¨§©¦©<br />

4.1 <br />

¨ ¨ ¤<br />

<br />

<br />

¥¤<br />

2<br />

4.1.1 ¥¤§<br />

Rast<br />

<br />

¦<br />

1. <br />

<br />

¥<br />

2.<br />

3.<br />

¨¤ § <br />

¥§<br />

<br />

<br />

<br />

4.1.2<br />

<br />

<br />

<br />

4.1.3<br />

§ <br />

¦§<br />

¦<br />

<br />

4.<br />

<br />

<br />

ID <br />

<br />

¤¦ <br />

§¨¦<br />

5. <br />

<br />

¤<br />

<br />

6.<br />

<br />

<br />

<br />

<br />

¦¨ ¤<br />

¤<br />

7.<br />

§ <br />

<br />

2 <br />

¢¤£¦¥¨§<br />

4.1.2<br />

<br />

Rast<br />

<br />

1. §¥<br />

2.<br />

<br />

¡<br />

¦§¨§<br />

<br />

<br />

<br />

ID ¥


3.<br />

<br />

4.1.6<br />

<br />

<br />

<br />

4.<br />

5. ¨<br />

ID<br />

¤<br />

¤¨ <br />

¦¤ <br />

Rast<br />

£ <br />

¦<br />

¨¨¤ <br />

<br />

<br />

1 ¨£¤ <br />

*1<br />

4.1.3 ¤¥¨§<br />

1<br />

<br />

<br />

<br />

<br />

<br />

Rast<br />

<br />

<br />

<br />

<br />

¤¦¨ ¥<br />

¤ £ <br />

¤<br />

¥<br />

¨ ¨ <br />

¤ ¤¦¨<br />

¥ <br />

<br />

•<br />

<br />

<br />

•<br />

•<br />

<br />

<br />

<br />

<br />

<br />

<br />

£<br />

<br />

¥ <br />

<br />

<br />

.ngm,pos,rng<br />

¦¥ £ <br />

<br />

<br />

<br />

<br />

<br />

4.1.2<br />

<br />

<br />

¦¨<br />

§ <br />

<br />

<br />

<br />

<br />

£<br />

§ <br />

Berkeley DB[5] <br />

§¤<br />

B<br />

<br />

.inv<br />

<br />

¦§¤ <br />

£ <br />

A <br />

<br />

1.<br />

¨<br />

<br />

<br />

*1 Rast 0.1.0 <br />

.inv A ¤<br />

3<br />

A§ <br />

2.<br />

3.<br />

<br />

ID<br />

£<br />

A<br />

<br />

<br />

<br />

<br />

<br />

¤ <br />

1.<br />

2.<br />

¨ <br />

<br />

.inv <br />

A £ <br />

<br />

<br />

¦<br />

ID¤<br />

<br />

<br />

<br />

<br />

<br />

£<br />

<br />

<br />

<br />

¦§ ¥<br />

<br />

<br />

<br />

<br />

A A ≤<br />

≤ A<br />

4.1.4 ¨<br />

<br />

Rast <br />

<br />

<br />

1. ¨ ¨ £<br />

<br />

2. *2 ¤¦<br />

.inv<br />

¤¨¨<br />

<br />

.inv ¤¤¤¨<br />

¤¨¤¤<br />

¤¨¨<br />

3.<br />

<br />

4.<br />

5.<br />

6.<br />

¨ <br />

¤¨¨¨¤¦¨<br />

¨ ¨¡<br />

¨¤<br />

<br />

¤¡<br />

<br />

<br />

¨ <br />

¤<br />

7. ¨¤¨¨<br />

¨ <br />

*2 ID


8. <br />

¤ *3 ¨¡<br />

4.1.5 ¤<br />

<br />

3 ¤<br />

Rast ¦<br />

¦<br />

¤¨¡<br />

<br />

1. ¢¤£<br />

¨¤¨ <br />

2.<br />

3.<br />

<br />

¨¤¤<br />

¦¥¨§ <br />

4.<br />

text.ngm © text.pos ¤¦<br />

text.pos ¦¨¤<br />

¡<br />

<br />

Berkeley DB B<br />

<br />

¥§ <br />

¡ ¦¤<br />

¦¨<br />

<br />

<br />

<br />

1. <br />

¨<br />

text.pos <br />

<br />

2.<br />

<br />

¤<br />

¤<br />

<br />

text.pos<br />

¦<br />

<br />

3. <br />

<br />

4.<br />

<br />

<br />

<br />

¤<br />

ID<br />

<br />

<br />

ID<br />

<br />

<br />

¦<br />

¤¦<br />

©¦¦ <br />

text.pos<br />

¦ <br />

ID ¦<br />

<br />

<br />

¦ ¦<br />

¡ <br />

¤¨<br />

¡<br />

<br />

<br />

¤ <br />

<br />

*3 ID <br />

¢¡£<br />

4<br />

1.<br />

2.<br />

3.<br />

<br />

ID 1 <br />

ID <br />

¦¤<br />

<br />

<br />

<br />

<br />

text.pfl text.pos ¤¦¡¨<br />

¤¦<br />

<br />

<br />

text.pos <br />

ID <br />

<br />

¨<br />

¤¤¡<br />

<br />

<br />

¤¤<br />

¦¦¡<br />

1. <br />

<br />

2. <br />

<br />

<br />

text.rng ©¦¡¤¦¦<br />

¨¦<br />

ID<br />

<br />

<br />

<br />

<br />

¤¦¦¦¤¤<br />

¨ <br />

<br />

1.<br />

2.<br />

3.<br />

¤<br />

ID<br />

¦¤ <br />

<br />

<br />

¨ <br />

¦¤<br />

¡<br />

4.1.6<br />

¤¤ ¡<br />

Rast<br />

¤ <br />

¤¡<br />

<br />

¡¨¤<br />

¤¤ ¦¡<br />

Rast<br />

¤¤¦¦¦<br />

<br />

¦¨¨<br />

<br />

£<br />

£¦<br />

¡¦<br />

¡<br />

<br />

1. <br />

2.<br />

<br />

£<br />

£<br />

<br />

¨<br />

¤<br />

¤¤<br />

£<br />

¨¨<br />

¨<br />

¤¨<br />

<br />

3. ¨¨¤<br />

¨<br />

4.1.7<br />

<br />

¨¤¦¤¨¡<br />

¤¨¦¤¡¡


¤¨¤¡ ¦<br />

<br />

¤<br />

<br />

¡¤<br />

¦¨¤¨ ¨¦¤<br />

<br />

¦ ¤ <br />

Rast<br />

<br />

-IDF<br />

¦¨ <br />

TF<br />

<br />

<br />

<br />

<br />

¦¡¦¤ <br />

<br />

¨¤¨ <br />

¡ ¤<br />

¤¤<br />

¡ ¦¡¦ *4¨ <br />

<br />

¤ ¤<br />

¤¤¤<br />

T ¨¤<br />

F =<br />

IDF = log<strong>10</strong><br />

T F − IDF = T F ∗ IDF<br />

<br />

<br />

¨<br />

¨ ¤ + 1<br />

¤<br />

4.1.8<br />

¡¦¤¤¡¦<br />

¨<br />

¤<br />

Rast<br />

¤¦¤¨<br />

¤¤¤¤<br />

¤¤<br />

¤<br />

¡¡¨<br />

¦¡ <br />

Rast 0.1.0<br />

¤¤¨<br />

<br />

<br />

£<br />

¤¤¤<br />

¦¦¨<br />

<br />

¨ <br />

<br />

¤<br />

¡<br />

¤<br />

<br />

¤¤¤ ¨<br />

¤ ¡ <br />

¤<br />

<br />

4.2 <br />

Rast<br />

<br />

¦¤¤<br />

<br />

¦¤¤¨¨¦<br />

¤¦¨¦¨ <br />

<br />

¦¤¤¤¦<br />

¨¦<br />

UTF-8 EUC-JP<br />

¤¡ ¨¤ <br />

N-<br />

*4 ¢¤£¦¥¨§©¦ ¤<br />

5<br />

gram <br />

<br />

<br />

<br />

4 <br />

¦<br />

¤<br />

¨¦¨ <br />

Rast<br />

¨ dlopen(3)<br />

¦<br />

<br />

¨¨¤¨<br />

¦¨<br />

¤¨¨<br />

¤¤<br />

¨¨¤¤<br />

<br />

¤¦¤ ¦<br />

<br />

utf8 utf8.so<br />

<br />

¡ ¤<br />

rast encoding<br />

<br />

<br />

rast encoding module t<br />

<br />

<br />

¨¨¤¤¦¤<br />

<br />

<br />

¥¤<br />

Rast <br />

<br />

<br />

<br />

¤<br />

¨¤¨¨<br />

<br />

¦¨<br />

1.<br />

2.<br />

<br />

rast error t *get char len(rast tokenizer t<br />

¤<br />

*tokenizer, rast size t *len)<br />

<br />

<br />

rast error t *get token(rast tokenizer t *tok-<br />

enizer, rast token t *token)—<br />

3. <br />

<br />

<br />

<br />

¦<br />

rast error t *get next offset(rast tokenizer t<br />

*tokenizer, rast size t *byte offset,<br />

rast size t *char offset)


4. ¨<br />

¤¨<br />

void normalize text(apr pool t *pool, const<br />

char *src, rast size t src len, char **dst,<br />

rast size t *dst len)<br />

5. ¤¦¤¤¤¨¨¨¡<br />

¨¨¦ <br />

<br />

6. <br />

void normalize chars(apr pool t *pool, const<br />

char *src, rast size t src len, char **dst,<br />

rast size t *dst len)<br />

£<br />

¨¤¡<br />

<br />

int is space(rast char t *ch)<br />

<br />

Rast<br />

<br />

¤<br />

4.2.1 utf8 ¡¡<br />

utf8<br />

<br />

¤¨ <br />

<br />

¦¦<br />

UTF-8 N-gram<br />

¨¤¨¨¦<br />

<br />

<br />

[1] <br />

N UNICODE[6]<br />

<br />

¡¨¤<br />

bi-gram<br />

<br />

¤<br />

<br />

¤<br />

<br />

<br />

¡<br />

¨ <br />

<br />

¤ ¤¨<br />

<br />

<br />

¤¦<br />

£<br />

tri-gram<br />

¨¨ <br />

bi-gram<br />

¦<br />

<br />

<br />

¦<br />

¤¨<br />

bi-gram ¨¦<br />

¨ <br />

Basic Latin Latin-1 Supplement<br />

<br />

<br />

<br />

¦¨¦¡ N <br />

<br />

4.2.2 euc jp ¦¤<br />

euc_jp<br />

<br />

£<br />

<br />

¤¨¨ <br />

¤¤ EUC-JP ¡ N-gram<br />

¨¨¦¦<br />

<br />

utf8 ¨¦¤¨<br />

¨<br />

N<br />

¨ ¡<br />

bi-gram<br />

<br />

6<br />

¤¦<br />

<br />

<br />

utf8 mecab_euc_jp<br />

13 11<br />

433 416<br />

<br />

¨ 1 utf8<br />

mecab euc jp <br />

<br />

4.2.3 mecab euc <br />

jp<br />

mecab_euc_jp ¤¨¦<br />

¦¨¦ ¦<br />

EUC-JP<br />

¡ ¤<br />

MeCab<br />

¨¦<br />

<br />

¨¦¤¤¨¤<br />

<br />

£<br />

¨<br />

<br />

<br />

¨<br />

5<br />

<br />

<br />

5.1<br />

N-gram<br />

<br />

utf8.c MeCab *5 <br />

¨¤¤¦¤<br />

¤<br />

¦ mecab_euc_jp.c <br />

<br />

¦<br />

C<br />

¤¤¤ ¡<br />

5.2 <br />

<br />

£<br />

<br />

¤ <br />

¦<br />

5.2.1<br />

Rast <br />

utf8 ¡ mecab_euc_jp <br />

¨<br />

<br />

¦¦¤<br />

<br />

¦<br />

<br />

Estraier[8] Namazu[7]<br />

Rast<br />

<br />

<br />

<br />

¤<br />

<br />

ruby-list@ruby-lang.org ruby-<br />

<br />

dev@ruby-lang.org<br />

<br />

<br />

¤<br />

63126 <br />

¦ <br />

63126<br />

¥§ <br />

<br />

¦¤¦¨¦¨¦¦<br />

*5<br />

¤¦ ChaSen ¦<br />

<br />

¦¤¦ http://chasen.org/ taku/software/mecab/


CPU Pentium 4 3GHz<br />

<br />

<br />

<br />

1GB<br />

Ultra ATA <strong>10</strong>0<br />

2 <br />

¨¤¡<br />

63126<br />

¤¨¤<br />

¡ ¦<br />

¨¤<br />

Ruby <br />

tk 5<br />

¤¥<br />

¨<br />

<br />

¤<br />

§<br />

3 ¨<br />

<br />

4 ¤¤<br />

2<br />

¨<br />

<br />

<br />

¡¨ cold start <br />

¤ ¤¤<br />

<br />

¡¥<br />

<br />

hot start<br />

¤¦<br />

¨ <br />

N/A<br />

¤¡<br />

¡<br />

<br />

<br />

<br />

1.2.28<br />

Estraier Namazu <br />

2.0.14 <br />

<br />

5.2.2 <br />

63126 Rast Estraier<br />

3.5 ¦<br />

¤<br />

<br />

¦¤ <br />

Rast Estraier<br />

a<br />

<br />

<br />

¦<br />

<br />

¦<br />

¤¦<br />

¤<br />

<br />

¤¤¦<br />

<br />

¡ ¦¡¨ Rast N ¢¤£<br />

<br />

<br />

<br />

<br />

<br />

¨¤<br />

<br />

<br />

¦¥¤§¦©¨<br />

<br />

§ ¥¤§<br />

<br />

§¡¤¤¥¤§¤<br />

<br />

a ¦<br />

ID ¤<br />

<br />

<br />

<br />

¦¨¤<br />

¤¥§<br />

¤<br />

¥§¤¨<br />

¦<br />

¥§<br />

<br />

©<br />

<br />

<br />

7<br />

¤¤ ¡¨¤ <br />

6 <br />

¤ ¦¦<br />

Rast <br />

<br />

<br />

©¦<br />

¤<br />

<br />

<br />

mecab_euc_jp<br />

©<br />

©<br />

euc_jp<br />

¤ <br />

<br />

© © <br />

©©<br />

<br />

© <br />

<br />

© ¤¦¡ <br />

©¦<br />

<br />

¤¤<br />

¤ <br />

<br />

© <br />

<br />

<br />

<br />

¤¦<br />

¡ <br />

<br />

<br />

<br />

<br />

¡ ¤ <br />

<br />

©<br />

<br />

7 <br />

<br />

<br />

<br />

Rast ¤<br />

<br />

¤<br />

¦<br />

<br />

¦ <br />

©<br />

<br />

<br />

¦ © ¤<br />

<br />

Rast<br />

<br />

¡ <br />

<br />

<br />

<br />

©¤<br />

<br />

<br />

¤<br />

http://www.netlab.jp/rast/<br />

Rast<br />

¤ <br />

Rast<br />

¤¤ ¡¤ <br />

<br />

©<br />

<br />

<br />

<br />

¤<br />

<br />

<br />

<br />

<br />

¡ <br />

Rast<br />

16 <br />

¤<br />

©¡ £¢¥¤¡¦<br />

Rast<br />

¡© §£¨ ¡ <br />

IPA<br />

£ £ <br />

<br />

<br />

<br />

<br />

<br />

¦


£ 6<br />

¡¡ 1<br />

¤¤<br />

<br />

<br />

<br />

<br />

¤<br />

¤<br />

<br />

<br />

Estraier Namazu Rast(N-gram utf8)<br />

(sec) 156.19 626.09 531.33<br />

(sec) 8.699 23.808 7.938018<br />

139Mb 116Mb 324Mb<br />

3 ( 63126 )<br />

Estraier Namazu Rast(N-gram utf8)<br />

Ruby<br />

29547 N/A 50066<br />

cold start (sec) 0.4318912 0.0347416 0.665318<br />

hot start (sec) 0.076197 0.00<strong>10</strong>586 0.5553598<br />

¤<br />

<br />

tk<br />

2232 612 4739<br />

cold start (sec) 0.3576128 0.3562174 0.51231968<br />

hot start (sec) 0.0183802 0.0346768 0.1419772<br />

¤<br />

<br />

tcl/tk<br />

338 <strong>10</strong>9 522<br />

cold start (sec) 0.3760914 0.2879038 0.557817832<br />

hot start (sec) 0.016116 0.0080904 0.019303<br />

¤<br />

<br />

emacs<br />

653 94 863<br />

cold start (sec) 0.330301 0.2961386 0.2069632<br />

hot start (sec) 0.01<strong>10</strong>054 0.0076692 0.0129308<br />

¤<br />

<br />

a<br />

N/A 3045 61918<br />

cold start (sec) 0.2709114 0.6318622 15.1325726<br />

hot start (sec) 0.009056 0.180159 <strong>10</strong>.3322752<br />

¤<br />

<br />

N/A 86 12991<br />

cold start (sec) 0.260278 0.4412032 0.9149052<br />

hot start (sec) 0.008311 0.0063548 0.2162194<br />

<br />

4 ( 63126 )<br />

8


£ ¨ <br />

[1] , : “Unicode <br />

¤¡ £ <br />

¤<br />

N-gram ”,<br />

, <br />

£¡<br />

¦<br />

¡ <br />

<br />

[2] ¡¡¡ , £¡ , , : ¡¡ “ <br />

¤©£¤£<br />

1999.<br />

¤<br />

[3] : “¤£¦<br />

[4] <br />

, 1999.<br />

<br />

£<br />

£<br />

<br />

”,<br />

<br />

¦¡ , 2000-NL-136-17, 2000.<br />

© ”, DEWS2002,<br />

”, <br />

, 2002.<br />

£<br />

: “¤£¦<br />

<br />

<br />

[5] “Sleepycat Software: Products: Berkeley<br />

DB”,<br />

http://www.sleepycat.com/products/db.shtml<br />

[6] “Unicode Home Page”,<br />

http://www.unicode.org/<br />

[7] “ Namazu”,<br />

http://www.namazu.org/<br />

[8] “Estraier: a personal full-text search system”,<br />

http://estraier.sourceforge.net/<br />

9

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!