( ' ¦& à 2 )10 2 çWß®ý ¢1¤8¦& úED2F ò GIHbçpÖháIþ 35P QDÉ RÍÚ ...
( ' ¦& à 2 )10 2 çWß®ý ¢1¤8¦& úED2F ò GIHbçpÖháIþ 35P QDÉ RÍÚ ...
( ' ¦& à 2 )10 2 çWß®ý ¢1¤8¦& úED2F ò GIHbçpÖháIþ 35P QDÉ RÍÚ ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Rast £ ¤ ¥ ¦ § ¨© <br />
¡<br />
∗<br />
¢<br />
<br />
<strong>10</strong> <br />
2005 5<br />
<br />
<br />
N-gram<br />
<br />
<br />
<br />
Rast<br />
Rast<br />
1<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
¡ £¢¥¤§¦¥¨<br />
<br />
£ ¥ ¦ <br />
¦ <br />
N-gram 2<br />
<br />
<br />
<br />
<br />
£© ¦¨¥¥©<br />
<br />
£<br />
<br />
¡ <br />
<br />
¥ <br />
¢¤¦ <br />
¦ ¦ §<br />
N-gram <br />
§<br />
¦<br />
¢¤¦ ¥ <br />
<br />
¥<br />
<br />
Rast£ <br />
Rast<br />
<br />
§<br />
¢¥¤<br />
§£¢¥<br />
<br />
<br />
<br />
∗ ¡<br />
1<br />
<br />
2<br />
<br />
<br />
<br />
£<br />
§<br />
2 ¥ £ ¥<br />
<br />
<br />
<br />
¥£<br />
¥<br />
£<br />
<br />
¨ <br />
<br />
¦<br />
<br />
¨§<br />
© <br />
<br />
©£<br />
¨<br />
<br />
N-gram<br />
¨ <br />
¦ ¥<br />
N-gram <br />
¦ <br />
<br />
¥<br />
¦ <br />
<br />
¡ ¥ <br />
¥ £<br />
¥<br />
<br />
<br />
<br />
<br />
<br />
¦ <br />
N-gram <br />
<br />
<br />
<br />
<br />
<br />
<br />
¥ <br />
¦ <br />
¥ £
£<br />
<br />
Rast © £<br />
£<br />
¨ <br />
£<br />
<br />
£ <br />
<br />
<br />
¦ <br />
§¥<br />
<br />
3 Rast <br />
<br />
1 <br />
<br />
<br />
<br />
<br />
Rast<br />
<br />
<br />
¥<br />
§¥ <br />
¦ ¦ <br />
N-gram ¦<br />
<br />
<br />
£ <br />
¡<br />
<br />
<br />
£<br />
<br />
<br />
SI <br />
<br />
Rast<br />
<br />
<br />
<br />
<br />
£ <br />
£¢¤<br />
4 Rast ¡<br />
¢¤£¦¥¨§©¦©<br />
4.1 <br />
¨ ¨ ¤<br />
<br />
<br />
¥¤<br />
2<br />
4.1.1 ¥¤§<br />
Rast<br />
<br />
¦<br />
1. <br />
<br />
¥<br />
2.<br />
3.<br />
¨¤ § <br />
¥§<br />
<br />
<br />
<br />
4.1.2<br />
<br />
<br />
<br />
4.1.3<br />
§ <br />
¦§<br />
¦<br />
<br />
4.<br />
<br />
<br />
ID <br />
<br />
¤¦ <br />
§¨¦<br />
5. <br />
<br />
¤<br />
<br />
6.<br />
<br />
<br />
<br />
<br />
¦¨ ¤<br />
¤<br />
7.<br />
§ <br />
<br />
2 <br />
¢¤£¦¥¨§<br />
4.1.2<br />
<br />
Rast<br />
<br />
1. §¥<br />
2.<br />
<br />
¡<br />
¦§¨§<br />
<br />
<br />
<br />
ID ¥
3.<br />
<br />
4.1.6<br />
<br />
<br />
<br />
4.<br />
5. ¨<br />
ID<br />
¤<br />
¤¨ <br />
¦¤ <br />
Rast<br />
£ <br />
¦<br />
¨¨¤ <br />
<br />
<br />
1 ¨£¤ <br />
*1<br />
4.1.3 ¤¥¨§<br />
1<br />
<br />
<br />
<br />
<br />
<br />
Rast<br />
<br />
<br />
<br />
<br />
¤¦¨ ¥<br />
¤ £ <br />
¤<br />
¥<br />
¨ ¨ <br />
¤ ¤¦¨<br />
¥ <br />
<br />
•<br />
<br />
<br />
•<br />
•<br />
<br />
<br />
<br />
<br />
<br />
<br />
£<br />
<br />
¥ <br />
<br />
<br />
.ngm,pos,rng<br />
¦¥ £ <br />
<br />
<br />
<br />
<br />
<br />
4.1.2<br />
<br />
<br />
¦¨<br />
§ <br />
<br />
<br />
<br />
<br />
£<br />
§ <br />
Berkeley DB[5] <br />
§¤<br />
B<br />
<br />
.inv<br />
<br />
¦§¤ <br />
£ <br />
A <br />
<br />
1.<br />
¨<br />
<br />
<br />
*1 Rast 0.1.0 <br />
.inv A ¤<br />
3<br />
A§ <br />
2.<br />
3.<br />
<br />
ID<br />
£<br />
A<br />
<br />
<br />
<br />
<br />
<br />
¤ <br />
1.<br />
2.<br />
¨ <br />
<br />
.inv <br />
A £ <br />
<br />
<br />
¦<br />
ID¤<br />
<br />
<br />
<br />
<br />
<br />
£<br />
<br />
<br />
<br />
¦§ ¥<br />
<br />
<br />
<br />
<br />
A A ≤<br />
≤ A<br />
4.1.4 ¨<br />
<br />
Rast <br />
<br />
<br />
1. ¨ ¨ £<br />
<br />
2. *2 ¤¦<br />
.inv<br />
¤¨¨<br />
<br />
.inv ¤¤¤¨<br />
¤¨¤¤<br />
¤¨¨<br />
3.<br />
<br />
4.<br />
5.<br />
6.<br />
¨ <br />
¤¨¨¨¤¦¨<br />
¨ ¨¡<br />
¨¤<br />
<br />
¤¡<br />
<br />
<br />
¨ <br />
¤<br />
7. ¨¤¨¨<br />
¨ <br />
*2 ID
8. <br />
¤ *3 ¨¡<br />
4.1.5 ¤<br />
<br />
3 ¤<br />
Rast ¦<br />
¦<br />
¤¨¡<br />
<br />
1. ¢¤£<br />
¨¤¨ <br />
2.<br />
3.<br />
<br />
¨¤¤<br />
¦¥¨§ <br />
4.<br />
text.ngm © text.pos ¤¦<br />
text.pos ¦¨¤<br />
¡<br />
<br />
Berkeley DB B<br />
<br />
¥§ <br />
¡ ¦¤<br />
¦¨<br />
<br />
<br />
<br />
1. <br />
¨<br />
text.pos <br />
<br />
2.<br />
<br />
¤<br />
¤<br />
<br />
text.pos<br />
¦<br />
<br />
3. <br />
<br />
4.<br />
<br />
<br />
<br />
¤<br />
ID<br />
<br />
<br />
ID<br />
<br />
<br />
¦<br />
¤¦<br />
©¦¦ <br />
text.pos<br />
¦ <br />
ID ¦<br />
<br />
<br />
¦ ¦<br />
¡ <br />
¤¨<br />
¡<br />
<br />
<br />
¤ <br />
<br />
*3 ID <br />
¢¡£<br />
4<br />
1.<br />
2.<br />
3.<br />
<br />
ID 1 <br />
ID <br />
¦¤<br />
<br />
<br />
<br />
<br />
text.pfl text.pos ¤¦¡¨<br />
¤¦<br />
<br />
<br />
text.pos <br />
ID <br />
<br />
¨<br />
¤¤¡<br />
<br />
<br />
¤¤<br />
¦¦¡<br />
1. <br />
<br />
2. <br />
<br />
<br />
text.rng ©¦¡¤¦¦<br />
¨¦<br />
ID<br />
<br />
<br />
<br />
<br />
¤¦¦¦¤¤<br />
¨ <br />
<br />
1.<br />
2.<br />
3.<br />
¤<br />
ID<br />
¦¤ <br />
<br />
<br />
¨ <br />
¦¤<br />
¡<br />
4.1.6<br />
¤¤ ¡<br />
Rast<br />
¤ <br />
¤¡<br />
<br />
¡¨¤<br />
¤¤ ¦¡<br />
Rast<br />
¤¤¦¦¦<br />
<br />
¦¨¨<br />
<br />
£<br />
£¦<br />
¡¦<br />
¡<br />
<br />
1. <br />
2.<br />
<br />
£<br />
£<br />
<br />
¨<br />
¤<br />
¤¤<br />
£<br />
¨¨<br />
¨<br />
¤¨<br />
<br />
3. ¨¨¤<br />
¨<br />
4.1.7<br />
<br />
¨¤¦¤¨¡<br />
¤¨¦¤¡¡
¤¨¤¡ ¦<br />
<br />
¤<br />
<br />
¡¤<br />
¦¨¤¨ ¨¦¤<br />
<br />
¦ ¤ <br />
Rast<br />
<br />
-IDF<br />
¦¨ <br />
TF<br />
<br />
<br />
<br />
<br />
¦¡¦¤ <br />
<br />
¨¤¨ <br />
¡ ¤<br />
¤¤<br />
¡ ¦¡¦ *4¨ <br />
<br />
¤ ¤<br />
¤¤¤<br />
T ¨¤<br />
F =<br />
IDF = log<strong>10</strong><br />
T F − IDF = T F ∗ IDF<br />
<br />
<br />
¨<br />
¨ ¤ + 1<br />
¤<br />
4.1.8<br />
¡¦¤¤¡¦<br />
¨<br />
¤<br />
Rast<br />
¤¦¤¨<br />
¤¤¤¤<br />
¤¤<br />
¤<br />
¡¡¨<br />
¦¡ <br />
Rast 0.1.0<br />
¤¤¨<br />
<br />
<br />
£<br />
¤¤¤<br />
¦¦¨<br />
<br />
¨ <br />
<br />
¤<br />
¡<br />
¤<br />
<br />
¤¤¤ ¨<br />
¤ ¡ <br />
¤<br />
<br />
4.2 <br />
Rast<br />
<br />
¦¤¤<br />
<br />
¦¤¤¨¨¦<br />
¤¦¨¦¨ <br />
<br />
¦¤¤¤¦<br />
¨¦<br />
UTF-8 EUC-JP<br />
¤¡ ¨¤ <br />
N-<br />
*4 ¢¤£¦¥¨§©¦ ¤<br />
5<br />
gram <br />
<br />
<br />
<br />
4 <br />
¦<br />
¤<br />
¨¦¨ <br />
Rast<br />
¨ dlopen(3)<br />
¦<br />
<br />
¨¨¤¨<br />
¦¨<br />
¤¨¨<br />
¤¤<br />
¨¨¤¤<br />
<br />
¤¦¤ ¦<br />
<br />
utf8 utf8.so<br />
<br />
¡ ¤<br />
rast encoding<br />
<br />
<br />
rast encoding module t<br />
<br />
<br />
¨¨¤¤¦¤<br />
<br />
<br />
¥¤<br />
Rast <br />
<br />
<br />
<br />
¤<br />
¨¤¨¨<br />
<br />
¦¨<br />
1.<br />
2.<br />
<br />
rast error t *get char len(rast tokenizer t<br />
¤<br />
*tokenizer, rast size t *len)<br />
<br />
<br />
rast error t *get token(rast tokenizer t *tok-<br />
enizer, rast token t *token)—<br />
3. <br />
<br />
<br />
<br />
¦<br />
rast error t *get next offset(rast tokenizer t<br />
*tokenizer, rast size t *byte offset,<br />
rast size t *char offset)
4. ¨<br />
¤¨<br />
void normalize text(apr pool t *pool, const<br />
char *src, rast size t src len, char **dst,<br />
rast size t *dst len)<br />
5. ¤¦¤¤¤¨¨¨¡<br />
¨¨¦ <br />
<br />
6. <br />
void normalize chars(apr pool t *pool, const<br />
char *src, rast size t src len, char **dst,<br />
rast size t *dst len)<br />
£<br />
¨¤¡<br />
<br />
int is space(rast char t *ch)<br />
<br />
Rast<br />
<br />
¤<br />
4.2.1 utf8 ¡¡<br />
utf8<br />
<br />
¤¨ <br />
<br />
¦¦<br />
UTF-8 N-gram<br />
¨¤¨¨¦<br />
<br />
<br />
[1] <br />
N UNICODE[6]<br />
<br />
¡¨¤<br />
bi-gram<br />
<br />
¤<br />
<br />
¤<br />
<br />
<br />
¡<br />
¨ <br />
<br />
¤ ¤¨<br />
<br />
<br />
¤¦<br />
£<br />
tri-gram<br />
¨¨ <br />
bi-gram<br />
¦<br />
<br />
<br />
¦<br />
¤¨<br />
bi-gram ¨¦<br />
¨ <br />
Basic Latin Latin-1 Supplement<br />
<br />
<br />
<br />
¦¨¦¡ N <br />
<br />
4.2.2 euc jp ¦¤<br />
euc_jp<br />
<br />
£<br />
<br />
¤¨¨ <br />
¤¤ EUC-JP ¡ N-gram<br />
¨¨¦¦<br />
<br />
utf8 ¨¦¤¨<br />
¨<br />
N<br />
¨ ¡<br />
bi-gram<br />
<br />
6<br />
¤¦<br />
<br />
<br />
utf8 mecab_euc_jp<br />
13 11<br />
433 416<br />
<br />
¨ 1 utf8<br />
mecab euc jp <br />
<br />
4.2.3 mecab euc <br />
jp<br />
mecab_euc_jp ¤¨¦<br />
¦¨¦ ¦<br />
EUC-JP<br />
¡ ¤<br />
MeCab<br />
¨¦<br />
<br />
¨¦¤¤¨¤<br />
<br />
£<br />
¨<br />
<br />
<br />
¨<br />
5<br />
<br />
<br />
5.1<br />
N-gram<br />
<br />
utf8.c MeCab *5 <br />
¨¤¤¦¤<br />
¤<br />
¦ mecab_euc_jp.c <br />
<br />
¦<br />
C<br />
¤¤¤ ¡<br />
5.2 <br />
<br />
£<br />
<br />
¤ <br />
¦<br />
5.2.1<br />
Rast <br />
utf8 ¡ mecab_euc_jp <br />
¨<br />
<br />
¦¦¤<br />
<br />
¦<br />
<br />
Estraier[8] Namazu[7]<br />
Rast<br />
<br />
<br />
<br />
¤<br />
<br />
ruby-list@ruby-lang.org ruby-<br />
<br />
dev@ruby-lang.org<br />
<br />
<br />
¤<br />
63126 <br />
¦ <br />
63126<br />
¥§ <br />
<br />
¦¤¦¨¦¨¦¦<br />
*5<br />
¤¦ ChaSen ¦<br />
<br />
¦¤¦ http://chasen.org/ taku/software/mecab/
CPU Pentium 4 3GHz<br />
<br />
<br />
<br />
1GB<br />
Ultra ATA <strong>10</strong>0<br />
2 <br />
¨¤¡<br />
63126<br />
¤¨¤<br />
¡ ¦<br />
¨¤<br />
Ruby <br />
tk 5<br />
¤¥<br />
¨<br />
<br />
¤<br />
§<br />
3 ¨<br />
<br />
4 ¤¤<br />
2<br />
¨<br />
<br />
<br />
¡¨ cold start <br />
¤ ¤¤<br />
<br />
¡¥<br />
<br />
hot start<br />
¤¦<br />
¨ <br />
N/A<br />
¤¡<br />
¡<br />
<br />
<br />
<br />
1.2.28<br />
Estraier Namazu <br />
2.0.14 <br />
<br />
5.2.2 <br />
63126 Rast Estraier<br />
3.5 ¦<br />
¤<br />
<br />
¦¤ <br />
Rast Estraier<br />
a<br />
<br />
<br />
¦<br />
<br />
¦<br />
¤¦<br />
¤<br />
<br />
¤¤¦<br />
<br />
¡ ¦¡¨ Rast N ¢¤£<br />
<br />
<br />
<br />
<br />
<br />
¨¤<br />
<br />
<br />
¦¥¤§¦©¨<br />
<br />
§ ¥¤§<br />
<br />
§¡¤¤¥¤§¤<br />
<br />
a ¦<br />
ID ¤<br />
<br />
<br />
<br />
¦¨¤<br />
¤¥§<br />
¤<br />
¥§¤¨<br />
¦<br />
¥§<br />
<br />
©<br />
<br />
<br />
7<br />
¤¤ ¡¨¤ <br />
6 <br />
¤ ¦¦<br />
Rast <br />
<br />
<br />
©¦<br />
¤<br />
<br />
<br />
mecab_euc_jp<br />
©<br />
©<br />
euc_jp<br />
¤ <br />
<br />
© © <br />
©©<br />
<br />
© <br />
<br />
© ¤¦¡ <br />
©¦<br />
<br />
¤¤<br />
¤ <br />
<br />
© <br />
<br />
<br />
<br />
¤¦<br />
¡ <br />
<br />
<br />
<br />
<br />
¡ ¤ <br />
<br />
©<br />
<br />
7 <br />
<br />
<br />
<br />
Rast ¤<br />
<br />
¤<br />
¦<br />
<br />
¦ <br />
©<br />
<br />
<br />
¦ © ¤<br />
<br />
Rast<br />
<br />
¡ <br />
<br />
<br />
<br />
©¤<br />
<br />
<br />
¤<br />
http://www.netlab.jp/rast/<br />
Rast<br />
¤ <br />
Rast<br />
¤¤ ¡¤ <br />
<br />
©<br />
<br />
<br />
<br />
¤<br />
<br />
<br />
<br />
<br />
¡ <br />
Rast<br />
16 <br />
¤<br />
©¡ £¢¥¤¡¦<br />
Rast<br />
¡© §£¨ ¡ <br />
IPA<br />
£ £ <br />
<br />
<br />
<br />
<br />
<br />
¦
£ 6<br />
¡¡ 1<br />
¤¤<br />
<br />
<br />
<br />
<br />
¤<br />
¤<br />
<br />
<br />
Estraier Namazu Rast(N-gram utf8)<br />
(sec) 156.19 626.09 531.33<br />
(sec) 8.699 23.808 7.938018<br />
139Mb 116Mb 324Mb<br />
3 ( 63126 )<br />
Estraier Namazu Rast(N-gram utf8)<br />
Ruby<br />
29547 N/A 50066<br />
cold start (sec) 0.4318912 0.0347416 0.665318<br />
hot start (sec) 0.076197 0.00<strong>10</strong>586 0.5553598<br />
¤<br />
<br />
tk<br />
2232 612 4739<br />
cold start (sec) 0.3576128 0.3562174 0.51231968<br />
hot start (sec) 0.0183802 0.0346768 0.1419772<br />
¤<br />
<br />
tcl/tk<br />
338 <strong>10</strong>9 522<br />
cold start (sec) 0.3760914 0.2879038 0.557817832<br />
hot start (sec) 0.016116 0.0080904 0.019303<br />
¤<br />
<br />
emacs<br />
653 94 863<br />
cold start (sec) 0.330301 0.2961386 0.2069632<br />
hot start (sec) 0.01<strong>10</strong>054 0.0076692 0.0129308<br />
¤<br />
<br />
a<br />
N/A 3045 61918<br />
cold start (sec) 0.2709114 0.6318622 15.1325726<br />
hot start (sec) 0.009056 0.180159 <strong>10</strong>.3322752<br />
¤<br />
<br />
N/A 86 12991<br />
cold start (sec) 0.260278 0.4412032 0.9149052<br />
hot start (sec) 0.008311 0.0063548 0.2162194<br />
<br />
4 ( 63126 )<br />
8
£ ¨ <br />
[1] , : “Unicode <br />
¤¡ £ <br />
¤<br />
N-gram ”,<br />
, <br />
£¡<br />
¦<br />
¡ <br />
<br />
[2] ¡¡¡ , £¡ , , : ¡¡ “ <br />
¤©£¤£<br />
1999.<br />
¤<br />
[3] : “¤£¦<br />
[4] <br />
, 1999.<br />
<br />
£<br />
£<br />
<br />
”,<br />
<br />
¦¡ , 2000-NL-136-17, 2000.<br />
© ”, DEWS2002,<br />
”, <br />
, 2002.<br />
£<br />
: “¤£¦<br />
<br />
<br />
[5] “Sleepycat Software: Products: Berkeley<br />
DB”,<br />
http://www.sleepycat.com/products/db.shtml<br />
[6] “Unicode Home Page”,<br />
http://www.unicode.org/<br />
[7] “ Namazu”,<br />
http://www.namazu.org/<br />
[8] “Estraier: a personal full-text search system”,<br />
http://estraier.sourceforge.net/<br />
9