Data Collection and Evaluation of AURORA-2 Japanese Corpus
Data Collection and Evaluation of AURORA-2 Japanese Corpus
Data Collection and Evaluation of AURORA-2 Japanese Corpus
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
DATA COLLECTION AND EVALUATION OF <strong>AURORA</strong>-2 JAPANESE CORPUS<br />
Satoshi Nakamura 1 , Kazumasa Yamamoto 2 , Kazuya Takeda 3 , Shingo Kuroiwa 4 ,<br />
Norihide Kitaoka 5 , Takeshi Yamada 6 , Mitsunori Mizumachi 1 , Takanobu Nishiura 7 ,<br />
Masakiyo Fujimoto 8 , Akira Saso 9 , Toshiki Endo 1<br />
1 ATR Spoken Language Translation Research Labs., 2 Shinshu University , 3 Nagoya University,<br />
4 University <strong>of</strong> Tokushima, 5 Toyohashi University <strong>of</strong> Technology, 6 University <strong>of</strong> Tsukuba,<br />
7 Wakayama University, 8 Ryukoku University, 9 Advanced Institute <strong>of</strong> Science <strong>and</strong> Technology<br />
{slp-noise-wg@slt.atr.co.jp}<br />
ABSTRACT<br />
Speech recognition systems must still be improved when they are<br />
exposed to noisy environments. For this improvement, developments<br />
<strong>of</strong> the st<strong>and</strong>ard evaluation corpus <strong>and</strong> assessment technologies<br />
are essential. Recently the <strong>AURORA</strong>-2,3 corpus <strong>and</strong> their<br />
evaluation scenarios have had significant impacts on noisy speech<br />
recognition research. This paper introduces a <strong>Japanese</strong> noisy<br />
speech corpus <strong>and</strong> its evaluation scripts, called <strong>AURORA</strong>-2J. The<br />
<strong>AURORA</strong>-2J is a <strong>Japanese</strong> connected digits corpus. The data collection<br />
<strong>and</strong> evaluation scenarios are designed in the same way as<br />
<strong>AURORA</strong>-2 with the help <strong>of</strong> ETSI <strong>AURORA</strong> group. Furthermore,<br />
we have collected in-car speech corpus similar to <strong>AURORA</strong>-3.<br />
The in-car speech corpus includes <strong>Japanese</strong> connected digits <strong>and</strong><br />
comm<strong>and</strong> words collected in a moving car. This paper describes<br />
the data collection, baseline scripts, <strong>and</strong> its baseline performance.<br />
1. INTRODUCTION<br />
The recent progress <strong>of</strong> speech recognition technology has been<br />
brought about by the advent <strong>of</strong> statistical modeling <strong>and</strong> large-scale<br />
corpora. Furthermore, it is also known that progress has been accelerated<br />
by the U.S. DARPA projects initiated in the late ’80s<br />
in terms <strong>of</strong> project participants competitively developing speech<br />
recognition systems on the same task using the same training <strong>and</strong><br />
test corpus.<br />
However, current speech recognition performance must still<br />
be improved if the system is to be exposed to noisy environments,<br />
where speech recognition applications might be used in practice.<br />
Thereby, robustness to acoustic noise is an emerging <strong>and</strong> crucial<br />
factor to be solved for speech recognition systems.<br />
With regard to the noise robustness problem, there have been<br />
two evaluation projects, SPINE1,2[1] <strong>and</strong> <strong>AURORA</strong>[2]. The<br />
SPINE (SPeech recognition In Noisy Environment) project was<br />
organized by U.S. DARPA, with SPINE1 in 2000 <strong>and</strong> SPINE2 in<br />
2001. The task is English spontaneous dialog between an operator<br />
<strong>and</strong> a soldier in a noisy field. The task was spontaneous continuous<br />
speech recognition in noisy environments. The results <strong>of</strong> the<br />
project brought many improvements to continuous noisy speech<br />
recognition, though the task seems quite special <strong>and</strong> a little difficult<br />
to h<strong>and</strong>le.<br />
On the other h<strong>and</strong>, the ETSI <strong>AURORA</strong> group initiated a special<br />
session in the EUROSPEECH conference. They are actively<br />
working to develop st<strong>and</strong>ard technologies under ETSI for distributed<br />
speech recognition[3]. In parallel with their st<strong>and</strong>ardization<br />
activities, they have distributed a noisy connected speech corpus<br />
based on TI digits with baseline HTK scripts to academic researchers<br />
for further noisy speech recognition research. So far,<br />
<strong>AURORA</strong>-2, a connected digit corpus with additive noise, <strong>and</strong><br />
<strong>AURORA</strong>-3, an in-car noisy digit <strong>and</strong> word corpus, are distributed<br />
with HTK scripts, which can be used to get baseline performance<br />
<strong>and</strong> relative improvements over the baseline results[4, 5]. The advantages<br />
<strong>of</strong> the <strong>AURORA</strong> are 1) The connected digit task is relatively<br />
small compared to spontaneous speech, <strong>and</strong> 2) The baseline<br />
performance can be easily attained by the attached HTK scripts.<br />
The authors voluntarily organized a special working group<br />
in October 2001 under auspices <strong>of</strong> the Information Processing<br />
Society <strong>of</strong> Japan in order to assess speech recognition technology<br />
in noisy environments. The focus <strong>of</strong> the working group included<br />
the planning <strong>of</strong> comprehensive fundamental assessments<br />
<strong>of</strong> noisy speech recognition, st<strong>and</strong>ardized corpus collection, evaluation<br />
strategy developments, <strong>and</strong> distribution <strong>of</strong> st<strong>and</strong>ardized<br />
processing modules. To begin with, we decided to follow the<br />
<strong>AURORA</strong>-2 corpus collection <strong>and</strong> evaluation since the task is<br />
small enough <strong>and</strong> the evaluation scheme is quite clear. As for the<br />
<strong>Japanese</strong> <strong>AURORA</strong>-2, <strong>AURORA</strong>-2J, we have simply translated<br />
English digits into <strong>Japanese</strong> digits adding the same noise. Furthermore,<br />
we collected the in-car <strong>Japanese</strong> connected digits <strong>and</strong><br />
comm<strong>and</strong> word data similar to the <strong>AURORA</strong>-3.<br />
In this paper, section 2 describes <strong>AURORA</strong>-2J corpus collection,<br />
its evaluation scripts <strong>and</strong> baseline results. The in-car speech<br />
corpus is described in section 3. Section 4 describes the categories<br />
in which the developed noisy speech recognition system should<br />
be fairly compared. Finally section 5 summarizes the paper <strong>and</strong><br />
describes future directions.
Table 1. <strong>AURORA</strong>-2J baseline recognition results.<br />
Clean Training (%Acc)<br />
A B C Overall<br />
Subway Babble Car Exhibition Average Restaurant Street Airport Station Average Subway M Street M Average Average<br />
Clean 99.72 99.58 99.82 99.60 99.68 99.72 99.58 99.82 99.60 99.68 99.82 99.67 99.75 99.69<br />
20 dB 96.90 80.80 89.59 95.90 90.80 84.86 88.51 82.17 82.29 84.46 91.50 92.26 91.88 88.48<br />
15 dB 76.27 56.83 58.16 75.41 66.67 61.10 65.39 57.80 55.01 59.83 70.80 75.39 73.10 65.22<br />
10 dB 47.16 38.63 38.86 41.65 41.58 40.50 42.59 41.93 37.98 40.75 43.51 47.28 45.40 42.01<br />
5 dB 25.27 23.16 20.79 21.97 22.80 21.06 23.79 26.16 22.25 23.32 25.91 25.03 25.47 23.54<br />
0 dB 12.28 8.16 10.38 11.97 10.70 9.89 13.75 12.68 9.84 11.54 13.72 13.60 13.66 11.63<br />
-5 dB 7.43 4.35 7.25 7.90 6.73 1.90 8.56 4.77 5.46 5.17 8.81 8.74 8.78 6.52<br />
Average 51.58 41.52 43.56 49.38 46.51 43.48 46.81 44.15 41.47 43.98 49.09 50.71 49.90 46.17<br />
Multicondition Training (%Acc)<br />
A B C Overall<br />
Subway Babble Car Exhibition Average Restaurant Street Airport Station Average Subway M Street M Average Average<br />
Clean 99.79 99.64 99.67 99.75 99.71 99.79 99.64 99.67 99.75 99.71 99.69 99.55 99.62 99.69<br />
20 dB 99.63 99.67 99.70 99.57 99.64 98.62 99.46 98.90 97.99 98.74 99.51 99.40 99.46 99.25<br />
15 dB 99.26 99.40 99.37 98.83 99.22 96.90 97.58 96.45 94.11 96.26 99.17 98.37 98.77 97.94<br />
10 dB 98.25 97.43 97.94 97.38 97.75 86.83 89.57 91.29 84.94 88.16 96.90 93.71 95.31 93.42<br />
5 dB 93.89 89.78 92.16 92.32 92.04 68.56 71.28 77.72 76.18 73.44 87.47 80.86 84.17 83.02<br />
0 dB 74.85 62.48 64.96 73.68 68.99 31.87 48.22 49.36 51.90 45.34 52.32 50.57 51.45 56.02<br />
-5 dB 30.46 25.12 23.17 29.56 27.08 -3.78 18.65 16.70 16.69 12.07 21.31 14.96 18.14 19.28<br />
Average 93.18 89.75 90.83 92.36 91.53 76.56 81.22 82.74 81.02 80.39 87.07 84.58 85.83 85.93<br />
2. <strong>AURORA</strong>-2J<br />
2.1. <strong>Japanese</strong> digits pronunciation<br />
<strong>AURORA</strong>-2J is the same as <strong>AURORA</strong>2, but it was uttered in<br />
<strong>Japanese</strong>. The number <strong>of</strong> speakers is the same <strong>and</strong> the digit strings<br />
for each speaker are identical. Table 2 shows the pronunciations<br />
<strong>of</strong> eleven digits in <strong>AURORA</strong>-2 <strong>and</strong> <strong>AURORA</strong>-2J. Speakers were<br />
requested to pronounce digits as specified in this table. These pronunciations<br />
are assigned considering the occurrence frequency in<br />
uttering telephone numbers <strong>and</strong> credit numbers. Although vowel<br />
lengthening sometimes occurs in /ni/ <strong>and</strong> /go/, the two pronunciations<br />
are not distinguished in <strong>AURORA</strong>-2J.<br />
Sometimes, “4” is read as /shi/, “7” is read as /shichi/, <strong>and</strong><br />
“0” is read as /rei/ in <strong>Japanese</strong>. However these pronunciations are<br />
rarely used when a telephone number or a credit card number is<br />
told over the telephone. Hence, <strong>AURORA</strong>-2J did not employ these<br />
pronunciations.<br />
Table 2. Pronunciations <strong>of</strong> digits<br />
Digit <strong>AURORA</strong>-2 <strong>AURORA</strong>-2J<br />
1 one /ichi/<br />
2 two /ni/<br />
3 three /saN/<br />
4 four /yoN/<br />
5 five /go/<br />
6 six /roku/<br />
7 seven /nana/<br />
8 eight /hachi/<br />
9 nine /kyuH/<br />
0(Z) zero /zero/<br />
0(O) oh /maru/<br />
2.2. <strong>Data</strong> recording<br />
A headset microphone, Sennheier MHD25, was used for recording<br />
with a USB-audio interface (Edirol UA-5) connected to a Windows<br />
personal computer. The recording was done in a soundpro<strong>of</strong> booth<br />
where speakers read a list <strong>of</strong> digit strings presented on CRT monitor<br />
screen connected to the PC. The final file format <strong>of</strong> the speech<br />
data is the Micros<strong>of</strong>t wav format <strong>of</strong> 16-kHz sampling.<br />
2.3. Filtering <strong>and</strong> noise adding<br />
The <strong>AURORA</strong>-2J database follows the <strong>AURORA</strong>-2 database, <strong>and</strong><br />
was created in exactly the same way. All programs <strong>and</strong> scripts<br />
used here were kindly provided by the <strong>AURORA</strong> project for both<br />
filtering speech signals <strong>and</strong> adding noise signals.<br />
2.4. Training/Testing dataset<br />
The design <strong>of</strong> training <strong>and</strong> testing dataset is the same as <strong>AURORA</strong>-<br />
2. Two sets <strong>of</strong> training data are prepared, such as clean-training<br />
dataset <strong>and</strong> multi-condition dataset. Total utterance is 8,440 utterance<br />
by 110 speakers (55 male <strong>and</strong> 55 female speakers). For<br />
multi-condition training dataset, four kinds <strong>of</strong> noise (Subway, Babble,<br />
Car, Exhibition) are added to the clean speech in five kinds <strong>of</strong><br />
SNR (clean, 20dB, 15dB, 10dB, 5dB). For each noise <strong>and</strong> SNR<br />
condition, 422 utterances are included. G.712 filter is applied to<br />
all the speech data.<br />
For the testing dataset, we prepare three kinds <strong>of</strong> dataset completely<br />
the same as in <strong>AURORA</strong>-2.<br />
[Testset A] Noise condition is same as in multi-condition training<br />
set. Subway, Babble, Car, <strong>and</strong> Exhibition noises are used.<br />
[Testset B] Noise condition is different from the multi-condition<br />
training dataset. Restaurant, Street, Airport, <strong>and</strong> Station<br />
noises are used.
[Testset C] Channel condition is different from the training<br />
dataset. MIRS channel is applied to the speech data.<br />
2.5. Reference scripts <strong>and</strong> baseline performance<br />
The reference back-end scripts were mostly based on the original<br />
<strong>AURORA</strong>-2 baseline back-end scripts, <strong>and</strong> some modifications<br />
were introduced from the Micros<strong>of</strong>t complex baseline backend<br />
scripts. Other experimental conditions, including the number<br />
<strong>of</strong> recognition units, were basically the same as the original<br />
<strong>AURORA</strong>-2 conditions, except for the number <strong>of</strong> Gaussian mixtures<br />
per state, 20 mixtures for digits, <strong>and</strong> 36 mixtures for pause<br />
models.<br />
The feature vector consisted <strong>of</strong> 12 MFCC <strong>and</strong> log energy<br />
with their corresponding delta <strong>and</strong> acceleration coefficients. Thus<br />
a vector contains 39 components in total. These parameters<br />
were calculated using HCopy with the same conditions as the<br />
<strong>AURORA</strong>-2 HTK baseline. The baseline recognition performance<br />
<strong>of</strong> <strong>AURORA</strong>-2J is shown in Table 1.<br />
3. IN-CAR SPEECH CORPUS<br />
As for the st<strong>and</strong>ard corpus for in-car speech technologies, we collected<br />
an in-car speech corpus aiming to distribute a similiar corpus<br />
to <strong>AURORA</strong>-3. This corpus is a part <strong>of</strong> the CIAIR (Center for<br />
Integrated Acoustic Information Research) in-car speech corpora<br />
collected at Nagoya University [6]. About 38,350 word utterances<br />
<strong>of</strong> 80 speakers while driving are recorded for the corpus.<br />
3.1. Vocabulary<br />
The basic vocabulary consists <strong>of</strong> 50 isolated word utterances, one<br />
4 digits isolated utterances, one 16 digit-string utterances, <strong>and</strong><br />
four kinds <strong>of</strong> single digit utterances <strong>and</strong> 10 digit-string utterances.<br />
The contents <strong>of</strong> the utterances are listed in Table 3. The average<br />
phoneme length <strong>of</strong> the 50 words is 10.2 (ranging from 3 to 25<br />
phonemes).<br />
Table 3. The length <strong>and</strong> manner <strong>of</strong> utterances for the digit strings.<br />
Words, isolated utterance<br />
x50<br />
4 digits, with pauses at every digit x4<br />
X-X-X-X<br />
10 digits, with pauses after the first 3 <strong>and</strong> 6 words x4<br />
XXX-XXX-XXXX<br />
16 digits, with pauses at every four digits x1<br />
XXXX-XXXX-XXXX-XXXX<br />
3.2. In-car data collection<br />
The data collection was performed using a specially designed data<br />
collection vehicle that has multiple data acquisition capabilities <strong>of</strong><br />
up to 16 channels <strong>of</strong> audio signals, three channels <strong>of</strong> video <strong>and</strong><br />
other driving-related information, i.e., car position, vehicle speed,<br />
engine speed, brake <strong>and</strong> acceleration pedals, <strong>and</strong> steering wheel.<br />
Fig. 1. Microphone positions for data collection: Side view (top)<br />
<strong>and</strong> top view (bottom).<br />
Five microphones were placed around the driver’s seat, as<br />
shown in Figure 1, where the top <strong>and</strong> the side views <strong>of</strong> the driver’s<br />
seat are depicted. Microphone positions are marked by the black<br />
dots. While microphones #3 <strong>and</strong> #4 were located on the dashboard,<br />
#5, #6 <strong>and</strong> #7 were attached to the ceiling. Microphone #6 was<br />
closest to the speaker. The microphone used in this data collection<br />
is the SONY ECM-77B ominidirectional electlet microphone. In<br />
addition to these distributed microphones, the driver wore a headset<br />
with a close-talking microphone (#1). Since the speaker drove<br />
while speaking, audio prompting through headphones was used.<br />
The utterances were collected under 13 carefully controlled<br />
different driving conditions, i.e., combinations <strong>of</strong> three car speeds<br />
(idle, driving in a city area <strong>and</strong> driving on an expressway) <strong>and</strong> five<br />
car conditions (fan on (hi/lo), CD player on, open window, <strong>and</strong><br />
Table 4. Recording conditions in car<br />
Ideling Normal, CD Player On, Hazzard<br />
On<br />
City Area Driving Normal, CD Player On, Air Condition<br />
On, Window Open, CD Player<br />
On + Air Condition On + Window<br />
Harf Open<br />
Express Way Driving Normal, CD Player On, Air Condition<br />
On, WIndow Open, CD Player<br />
On + Air Condition On + Window<br />
Slightly Open
Long-term spectrum (dB)<br />
Long-term spectrum (dB)<br />
Long-term spectrum (dB)<br />
Idling (close-talking microphone)<br />
Idling (h<strong>and</strong>s-free microphone)<br />
60<br />
80<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
0 1000 2000 3000 4000 5000 6000 7000 8000<br />
Frequency (Hz)<br />
Low speed (close-talking microphone)<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
0 1000 2000 3000 4000 5000 6000 7000 8000<br />
Frequency (Hz)<br />
High speed (close-talking microphone)<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
0 1000 2000 3000 4000 5000 6000 7000 8000<br />
Frequency (Hz)<br />
Long-term spectrum (dB)<br />
Long-term spectrum (dB)<br />
Long-term spectrum (dB)<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
0 1000 2000 3000 4000 5000 6000 7000 8000<br />
Frequency (Hz)<br />
90<br />
Low speed (h<strong>and</strong>s-free microphone)<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
0 1000 2000 3000 4000 5000 6000 7000 8000<br />
Frequency (Hz)<br />
90<br />
High speed (h<strong>and</strong>s-free microphone)<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
0 1000 2000 3000 4000 5000 6000 7000 8000<br />
Frequency (Hz)<br />
Fig. 2. Long-term average spectra for various driving condition in car.<br />
normal driving condition) shown in Table.4. The overall average<br />
SNR is roughly about 13dB, although the SNR distributes from<br />
0dB to 25dB. The long-term spectra <strong>of</strong> these noise condition are<br />
shown in Fig.2 We are designing training <strong>and</strong> testing dataset, <strong>and</strong><br />
evaluation baseline scripts.<br />
4. EVALUATION CATEGORIES<br />
Properly speaking, the <strong>AURORA</strong> project aims at developing <strong>and</strong><br />
evaluating front-ends for recognizers. However, in some papers<br />
reported so far, many changes to back-end HTK scripts were introduced,<br />
such as using extra data not included in <strong>AURORA</strong>, increasing<br />
the number <strong>of</strong> mixtures, using HMMs which were not<br />
whole word models, <strong>and</strong> so on. The recognition results using<br />
these methods cannot be fairly compared with methods using the<br />
original back-end scripts. Therefore, we propose evaluation categories<br />
in this paper. A method is compared with other methods<br />
only within the same category. According to the degree <strong>of</strong> changes<br />
in the back-end scripts from the original baseline, users declare the<br />
category to which they belong from following categories:<br />
Category 0. No changes to back-end HTK scripts.<br />
Category 1. If the HMM topology is the same as the baseline<br />
scripts, any training process will be allowed. The discriminative<br />
training can be introduced in this category. The computational<br />
cost in the recognition phase should be the same<br />
as it was.<br />
Category 2. If the HMM topology is the same, adaptation processes<br />
can be introduced using some testing data. Speaker<br />
or environment adaptation, <strong>and</strong> PMC with one state noise<br />
model can be allowed in this category. An increase <strong>of</strong> the<br />
computational cost will be caused only by the adaptation<br />
process.
Category 3. Changes in the st<strong>and</strong>ard HMM topology. A different<br />
number <strong>of</strong> mixtures <strong>and</strong> states can be allowed. However,<br />
the recognition unit should be whole word models. PMC<br />
with more than one state noise model can be included in<br />
this category.<br />
Category 4. Any process will be allowed as far as the computational<br />
cost is under the specific limit. For example, a complex<br />
structure model can be used with low dimensional feature<br />
vectors.<br />
Category 5. Any process with any computational cost will be allowed.<br />
Category B. The use <strong>of</strong> any training data not included in AU-<br />
RORA is allowed, not only speech data but also environment<br />
noise data. Of course, the evaluation data is AU-<br />
RORA. This category essentially differs from Categories<br />
1-5.<br />
5. CONCLUSION<br />
In this paper, we introduced <strong>AURORA</strong>-2J, the first version <strong>of</strong> a<br />
<strong>Japanese</strong> evaluation set <strong>of</strong> noisy speech recognition, <strong>and</strong> the baseline<br />
performances <strong>of</strong> bundled evaluation scripts. The database can<br />
be accessed from [9]. We also mentioned our plan for further development.<br />
In addition to the <strong>Japanese</strong> <strong>AURORA</strong>-2 <strong>and</strong> In-car speech<br />
data, the isolated word <strong>and</strong> digit-string utterances recorded under<br />
controled actual noisy environments, <strong>AURORA</strong>-2.5J, whch<br />
is <strong>AURORA</strong>-2J with Lombard effects, are to be available soon.<br />
This database will consist <strong>of</strong> noise-free speech uttered by speakers<br />
listening to the same noises as <strong>AURORA</strong>-2J through headphones.<br />
With this database, we can analyze the Lombard effects<br />
separately from additive <strong>and</strong> convolutional noises. We are also<br />
carefully designing corpus like <strong>AURORA</strong>-4J. This database will<br />
be constructed on large vocabulary, continuous speech recognition<br />
tasks, but we planed to introduce new, challenging, but realistic<br />
noises. We are planning to distribute <strong>AURORA</strong>-2J databases in<br />
this workshop.<br />
In addition to developing the databases, we are also working<br />
on comparison <strong>and</strong> integration <strong>of</strong> noise reduction algorithms on<br />
the <strong>AURORA</strong>-2 <strong>and</strong> <strong>AURORA</strong>-2J database [8]. We implemented<br />
these algorithms as modules, which can be easily combined <strong>and</strong><br />
applied to the <strong>AURORA</strong> evaluation process. In near future, those<br />
modules will also be distributed with the developed corpus.<br />
7. REFERENCES<br />
[1] http://elazar.itd.nrl.navy.mil/spine/<br />
[2] http://eurospeech2001.org/ese/NoiseRobust/index.html,<br />
http://www.elda.fr/proj/aurora1.html,<br />
http://www.elda.fr/proj/aurora2.html<br />
[3] ETSI st<strong>and</strong>ard document, “Speech processing, transmission<br />
<strong>and</strong> quality aspects (STQ); Distributed speech recognition;<br />
Front-end feature extraction algorithm; Compression algorithm”,<br />
ETSI ES 201 108 v1.1.2 (2000-04), 2000<br />
[4] H.G.Hirsh, D.Pearce, “The <strong>AURORA</strong> experimental framework<br />
for the performance evaluations <strong>of</strong> speech recognition<br />
systems under noisy conditions”, ISCA ITRW ASR2000,<br />
september, 2000<br />
[5] D.Pearce, “Developing the ETSI <strong>AURORA</strong> advanced distributed<br />
speech recognition front-end & What next”, Proc.<br />
EUROSPEECH2001, 2001<br />
[6] N. Kawaguchi, K. Takeda, et al., “Construction <strong>of</strong> Speech<br />
<strong>Corpus</strong> in Moving Car Environment”, Proc. International<br />
Conference on Spoken Language Processing, pp.1281-1284,<br />
2000 (ICSLP2000, Beijing, China).<br />
[7] K. Yamamoto, S. Nakamura, K. Takeda, S. Kuroiwa, N. Kitaoka,<br />
T. Yamada, M. Mizumachi, T. Nishiura, M. Fujimoto,<br />
“<strong>AURORA</strong>-2J/<strong>AURORA</strong>-3J <strong>Corpus</strong> <strong>and</strong> <strong>Evaluation</strong> Baseline(in<br />
<strong>Japanese</strong>)”, Technical Report IPSJ SIG-SLP 2003-<br />
SLP-47, July 2003.<br />
[8] T. Yamada, J. Okada, K. Takeda, N. Kitaoka, M. Fujimoto,<br />
S. Kuroiwa, K. Yamamoto, T. Nishiura, M. Mizumachi, S.<br />
Nakamura, “Integration <strong>of</strong> noise reduction algorithms for<br />
Aurora 2 database”, Eurospeech2003, 2003.<br />
[9] http://sp.shinshu-u.ac.jp/%7Ekyama/SLP-WG/<br />
6. ACKNOWLEDGEMENTS<br />
The authors thank to Dr. David Pearce <strong>of</strong> the <strong>AURORA</strong> group<br />
for his help to these activities. This work was supported in part<br />
by the Telecommunications Advancement Organization <strong>of</strong> Japan.<br />
The present study was conducted using <strong>AURORA</strong>-2J database developed<br />
by IPSJ-SIG SLP Noisy Speech Recognition <strong>Evaluation</strong><br />
Working Group.