Data Collection and Evaluation of AURORA-2 Japanese Corpus

DATA COLLECTION AND EVALUATION OF AURORA-2 JAPANESE CORPUS 

Satoshi Nakamura 1 , Kazumasa Yamamoto 2 , Kazuya Takeda 3 , Shingo Kuroiwa 4 , 

Norihide Kitaoka 5 , Takeshi Yamada 6 , Mitsunori Mizumachi 1 , Takanobu Nishiura 7 , 

Masakiyo Fujimoto 8 , Akira Saso 9 , Toshiki Endo 1 

1 ATR Spoken Language Translation Research Labs., 2 Shinshu University , 3 Nagoya University, 

4 University of Tokushima, 5 Toyohashi University of Technology, 6 University of Tsukuba, 

7 Wakayama University, 8 Ryukoku University, 9 Advanced Institute of Science and Technology 

{slp-noise-wg@slt.atr.co.jp} 

ABSTRACT 

Speech recognition systems must still be improved when they are 

exposed to noisy environments. For this improvement, developments 

of the standard evaluation corpus and assessment technologies 

are essential. Recently the AURORA-2,3 corpus and their 

evaluation scenarios have had significant impacts on noisy speech 

recognition research. This paper introduces a Japanese noisy 

speech corpus and its evaluation scripts, called AURORA-2J. The 

AURORA-2J is a Japanese connected digits corpus. The data collection 

and evaluation scenarios are designed in the same way as 

AURORA-2 with the help of ETSI AURORA group. Furthermore, 

we have collected in-car speech corpus similar to AURORA-3. 

The in-car speech corpus includes Japanese connected digits and 

command words collected in a moving car. This paper describes 

the data collection, baseline scripts, and its baseline performance. 

1. INTRODUCTION 

The recent progress of speech recognition technology has been 

brought about by the advent of statistical modeling and large-scale 

corpora. Furthermore, it is also known that progress has been accelerated 

by the U.S. DARPA projects initiated in the late ’80s 

in terms of project participants competitively developing speech 

recognition systems on the same task using the same training and 

test corpus. 

However, current speech recognition performance must still 

be improved if the system is to be exposed to noisy environments, 

where speech recognition applications might be used in practice. 

Thereby, robustness to acoustic noise is an emerging and crucial 

factor to be solved for speech recognition systems. 

With regard to the noise robustness problem, there have been 

two evaluation projects, SPINE1,2[1] and AURORA[2]. The 

SPINE (SPeech recognition In Noisy Environment) project was 

organized by U.S. DARPA, with SPINE1 in 2000 and SPINE2 in 

2001. The task is English spontaneous dialog between an operator 

and a soldier in a noisy field. The task was spontaneous continuous 

speech recognition in noisy environments. The results of the 

project brought many improvements to continuous noisy speech 

recognition, though the task seems quite special and a little difficult 

to handle. 

On the other hand, the ETSI AURORA group initiated a special 

session in the EUROSPEECH conference. They are actively 

working to develop standard technologies under ETSI for distributed 

speech recognition[3]. In parallel with their standardization 

activities, they have distributed a noisy connected speech corpus 

based on TI digits with baseline HTK scripts to academic researchers 

for further noisy speech recognition research. So far, 

AURORA-2, a connected digit corpus with additive noise, and 

AURORA-3, an in-car noisy digit and word corpus, are distributed 

with HTK scripts, which can be used to get baseline performance 

and relative improvements over the baseline results[4, 5]. The advantages 

of the AURORA are 1) The connected digit task is relatively 

small compared to spontaneous speech, and 2) The baseline 

performance can be easily attained by the attached HTK scripts. 

The authors voluntarily organized a special working group 

in October 2001 under auspices of the Information Processing 

Society of Japan in order to assess speech recognition technology 

in noisy environments. The focus of the working group included 

the planning of comprehensive fundamental assessments 

of noisy speech recognition, standardized corpus collection, evaluation 

strategy developments, and distribution of standardized 

processing modules. To begin with, we decided to follow the 

AURORA-2 corpus collection and evaluation since the task is 

small enough and the evaluation scheme is quite clear. As for the 

Japanese AURORA-2, AURORA-2J, we have simply translated 

English digits into Japanese digits adding the same noise. Furthermore, 

we collected the in-car Japanese connected digits and 

command word data similar to the AURORA-3. 

In this paper, section 2 describes AURORA-2J corpus collection, 

its evaluation scripts and baseline results. The in-car speech 

corpus is described in section 3. Section 4 describes the categories 

in which the developed noisy speech recognition system should 

be fairly compared. Finally section 5 summarizes the paper and 

describes future directions.

Table 1. AURORA-2J baseline recognition results. 

Clean Training (%Acc) 

A B C Overall 

Subway Babble Car Exhibition Average Restaurant Street Airport Station Average Subway M Street M Average Average 

Clean 99.72 99.58 99.82 99.60 99.68 99.72 99.58 99.82 99.60 99.68 99.82 99.67 99.75 99.69 

20 dB 96.90 80.80 89.59 95.90 90.80 84.86 88.51 82.17 82.29 84.46 91.50 92.26 91.88 88.48 

15 dB 76.27 56.83 58.16 75.41 66.67 61.10 65.39 57.80 55.01 59.83 70.80 75.39 73.10 65.22 

10 dB 47.16 38.63 38.86 41.65 41.58 40.50 42.59 41.93 37.98 40.75 43.51 47.28 45.40 42.01 

5 dB 25.27 23.16 20.79 21.97 22.80 21.06 23.79 26.16 22.25 23.32 25.91 25.03 25.47 23.54 

0 dB 12.28 8.16 10.38 11.97 10.70 9.89 13.75 12.68 9.84 11.54 13.72 13.60 13.66 11.63 

-5 dB 7.43 4.35 7.25 7.90 6.73 1.90 8.56 4.77 5.46 5.17 8.81 8.74 8.78 6.52 

Average 51.58 41.52 43.56 49.38 46.51 43.48 46.81 44.15 41.47 43.98 49.09 50.71 49.90 46.17 

Multicondition Training (%Acc) 

A B C Overall 

Subway Babble Car Exhibition Average Restaurant Street Airport Station Average Subway M Street M Average Average 

Clean 99.79 99.64 99.67 99.75 99.71 99.79 99.64 99.67 99.75 99.71 99.69 99.55 99.62 99.69 

20 dB 99.63 99.67 99.70 99.57 99.64 98.62 99.46 98.90 97.99 98.74 99.51 99.40 99.46 99.25 

15 dB 99.26 99.40 99.37 98.83 99.22 96.90 97.58 96.45 94.11 96.26 99.17 98.37 98.77 97.94 

10 dB 98.25 97.43 97.94 97.38 97.75 86.83 89.57 91.29 84.94 88.16 96.90 93.71 95.31 93.42 

5 dB 93.89 89.78 92.16 92.32 92.04 68.56 71.28 77.72 76.18 73.44 87.47 80.86 84.17 83.02 

0 dB 74.85 62.48 64.96 73.68 68.99 31.87 48.22 49.36 51.90 45.34 52.32 50.57 51.45 56.02 

-5 dB 30.46 25.12 23.17 29.56 27.08 -3.78 18.65 16.70 16.69 12.07 21.31 14.96 18.14 19.28 

Average 93.18 89.75 90.83 92.36 91.53 76.56 81.22 82.74 81.02 80.39 87.07 84.58 85.83 85.93 

2. AURORA-2J 

2.1. Japanese digits pronunciation 

AURORA-2J is the same as AURORA2, but it was uttered in 

Japanese. The number of speakers is the same and the digit strings 

for each speaker are identical. Table 2 shows the pronunciations 

of eleven digits in AURORA-2 and AURORA-2J. Speakers were 

requested to pronounce digits as specified in this table. These pronunciations 

are assigned considering the occurrence frequency in 

uttering telephone numbers and credit numbers. Although vowel 

lengthening sometimes occurs in /ni/ and /go/, the two pronunciations 

are not distinguished in AURORA-2J. 

Sometimes, “4” is read as /shi/, “7” is read as /shichi/, and 

“0” is read as /rei/ in Japanese. However these pronunciations are 

rarely used when a telephone number or a credit card number is 

told over the telephone. Hence, AURORA-2J did not employ these 

pronunciations. 

Table 2. Pronunciations of digits 

Digit AURORA-2 AURORA-2J 

1 one /ichi/ 

2 two /ni/ 

3 three /saN/ 

4 four /yoN/ 

5 five /go/ 

6 six /roku/ 

7 seven /nana/ 

8 eight /hachi/ 

9 nine /kyuH/ 

0(Z) zero /zero/ 

0(O) oh /maru/ 

2.2. Data recording 

A headset microphone, Sennheier MHD25, was used for recording 

with a USB-audio interface (Edirol UA-5) connected to a Windows 

personal computer. The recording was done in a soundproof booth 

where speakers read a list of digit strings presented on CRT monitor 

screen connected to the PC. The final file format of the speech 

data is the Microsoft wav format of 16-kHz sampling. 

2.3. Filtering and noise adding 

The AURORA-2J database follows the AURORA-2 database, and 

was created in exactly the same way. All programs and scripts 

used here were kindly provided by the AURORA project for both 

filtering speech signals and adding noise signals. 

2.4. Training/Testing dataset 

The design of training and testing dataset is the same as AURORA- 

2. Two sets of training data are prepared, such as clean-training 

dataset and multi-condition dataset. Total utterance is 8,440 utterance 

by 110 speakers (55 male and 55 female speakers). For 

multi-condition training dataset, four kinds of noise (Subway, Babble, 

Car, Exhibition) are added to the clean speech in five kinds of 

SNR (clean, 20dB, 15dB, 10dB, 5dB). For each noise and SNR 

condition, 422 utterances are included. G.712 filter is applied to 

all the speech data. 

For the testing dataset, we prepare three kinds of dataset completely 

the same as in AURORA-2. 

[Testset A] Noise condition is same as in multi-condition training 

set. Subway, Babble, Car, and Exhibition noises are used. 

[Testset B] Noise condition is different from the multi-condition 

training dataset. Restaurant, Street, Airport, and Station 

noises are used.

[Testset C] Channel condition is different from the training 

dataset. MIRS channel is applied to the speech data. 

2.5. Reference scripts and baseline performance 

The reference back-end scripts were mostly based on the original 

AURORA-2 baseline back-end scripts, and some modifications 

were introduced from the Microsoft complex baseline backend 

scripts. Other experimental conditions, including the number 

of recognition units, were basically the same as the original 

AURORA-2 conditions, except for the number of Gaussian mixtures 

per state, 20 mixtures for digits, and 36 mixtures for pause 

models. 

The feature vector consisted of 12 MFCC and log energy 

with their corresponding delta and acceleration coefficients. Thus 

a vector contains 39 components in total. These parameters 

were calculated using HCopy with the same conditions as the 

AURORA-2 HTK baseline. The baseline recognition performance 

of AURORA-2J is shown in Table 1. 

3. IN-CAR SPEECH CORPUS 

As for the standard corpus for in-car speech technologies, we collected 

an in-car speech corpus aiming to distribute a similiar corpus 

to AURORA-3. This corpus is a part of the CIAIR (Center for 

Integrated Acoustic Information Research) in-car speech corpora 

collected at Nagoya University [6]. About 38,350 word utterances 

of 80 speakers while driving are recorded for the corpus. 

3.1. Vocabulary 

The basic vocabulary consists of 50 isolated word utterances, one 

4 digits isolated utterances, one 16 digit-string utterances, and 

four kinds of single digit utterances and 10 digit-string utterances. 

The contents of the utterances are listed in Table 3. The average 

phoneme length of the 50 words is 10.2 (ranging from 3 to 25 

phonemes). 

Table 3. The length and manner of utterances for the digit strings. 

Words, isolated utterance 

x50 

4 digits, with pauses at every digit x4 

X-X-X-X 

10 digits, with pauses after the first 3 and 6 words x4 

XXX-XXX-XXXX 

16 digits, with pauses at every four digits x1 

XXXX-XXXX-XXXX-XXXX 

3.2. In-car data collection 

The data collection was performed using a specially designed data 

collection vehicle that has multiple data acquisition capabilities of 

up to 16 channels of audio signals, three channels of video and 

other driving-related information, i.e., car position, vehicle speed, 

engine speed, brake and acceleration pedals, and steering wheel. 

Fig. 1. Microphone positions for data collection: Side view (top) 

and top view (bottom). 

Five microphones were placed around the driver’s seat, as 

shown in Figure 1, where the top and the side views of the driver’s 

seat are depicted. Microphone positions are marked by the black 

dots. While microphones #3 and #4 were located on the dashboard, 

#5, #6 and #7 were attached to the ceiling. Microphone #6 was 

closest to the speaker. The microphone used in this data collection 

is the SONY ECM-77B ominidirectional electlet microphone. In 

addition to these distributed microphones, the driver wore a headset 

with a close-talking microphone (#1). Since the speaker drove 

while speaking, audio prompting through headphones was used. 

The utterances were collected under 13 carefully controlled 

different driving conditions, i.e., combinations of three car speeds 

(idle, driving in a city area and driving on an expressway) and five 

car conditions (fan on (hi/lo), CD player on, open window, and 

Table 4. Recording conditions in car 

Ideling Normal, CD Player On, Hazzard 

On 

City Area Driving Normal, CD Player On, Air Condition 

On, Window Open, CD Player 

On + Air Condition On + Window 

Harf Open 

Express Way Driving Normal, CD Player On, Air Condition 

On, WIndow Open, CD Player 

On + Air Condition On + Window 

Slightly Open

Long-term spectrum (dB) 



Idling (close-talking microphone) 

Idling (hands-free microphone) 

60 

80 

50 

40 

30 

20 

10 

0 

0 1000 2000 3000 4000 5000 6000 7000 8000 

Frequency (Hz) 

Low speed (close-talking microphone) 

70 

60 

50 

40 

30 

20 

10 

0 

0 1000 2000 3000 4000 5000 6000 7000 8000 


High speed (close-talking microphone) 

70 

60 

50 

40 

30 

20 

10 

0 

0 1000 2000 3000 4000 5000 6000 7000 8000 





70 

60 

50 

40 

30 

20 

10 

0 

0 1000 2000 3000 4000 5000 6000 7000 8000 


90 

Low speed (hands-free microphone) 

80 

70 

60 

50 

40 

30 

20 

10 

0 

0 1000 2000 3000 4000 5000 6000 7000 8000 


90 

High speed (hands-free microphone) 

80 

70 

60 

50 

40 

30 

20 

10 

0 

0 1000 2000 3000 4000 5000 6000 7000 8000 


Fig. 2. Long-term average spectra for various driving condition in car. 

normal driving condition) shown in Table.4. The overall average 

SNR is roughly about 13dB, although the SNR distributes from 

0dB to 25dB. The long-term spectra of these noise condition are 

shown in Fig.2 We are designing training and testing dataset, and 

evaluation baseline scripts. 

4. EVALUATION CATEGORIES 

Properly speaking, the AURORA project aims at developing and 

evaluating front-ends for recognizers. However, in some papers 

reported so far, many changes to back-end HTK scripts were introduced, 

such as using extra data not included in AURORA, increasing 

the number of mixtures, using HMMs which were not 

whole word models, and so on. The recognition results using 

these methods cannot be fairly compared with methods using the 

original back-end scripts. Therefore, we propose evaluation categories 

in this paper. A method is compared with other methods 

only within the same category. According to the degree of changes 

in the back-end scripts from the original baseline, users declare the 

category to which they belong from following categories: 

Category 0. No changes to back-end HTK scripts. 

Category 1. If the HMM topology is the same as the baseline 

scripts, any training process will be allowed. The discriminative 

training can be introduced in this category. The computational 

cost in the recognition phase should be the same 

as it was. 

Category 2. If the HMM topology is the same, adaptation processes 

can be introduced using some testing data. Speaker 

or environment adaptation, and PMC with one state noise 

model can be allowed in this category. An increase of the 

computational cost will be caused only by the adaptation 

process.

Category 3. Changes in the standard HMM topology. A different 

number of mixtures and states can be allowed. However, 

the recognition unit should be whole word models. PMC 

with more than one state noise model can be included in 

this category. 

Category 4. Any process will be allowed as far as the computational 

cost is under the specific limit. For example, a complex 

structure model can be used with low dimensional feature 

vectors. 

Category 5. Any process with any computational cost will be allowed. 

Category B. The use of any training data not included in AU- 

RORA is allowed, not only speech data but also environment 

noise data. Of course, the evaluation data is AU- 

RORA. This category essentially differs from Categories 

1-5. 

5. CONCLUSION 

In this paper, we introduced AURORA-2J, the first version of a 

Japanese evaluation set of noisy speech recognition, and the baseline 

performances of bundled evaluation scripts. The database can 

be accessed from [9]. We also mentioned our plan for further development. 

In addition to the Japanese AURORA-2 and In-car speech 

data, the isolated word and digit-string utterances recorded under 

controled actual noisy environments, AURORA-2.5J, whch 

is AURORA-2J with Lombard effects, are to be available soon. 

This database will consist of noise-free speech uttered by speakers 

listening to the same noises as AURORA-2J through headphones. 

With this database, we can analyze the Lombard effects 

separately from additive and convolutional noises. We are also 

carefully designing corpus like AURORA-4J. This database will 

be constructed on large vocabulary, continuous speech recognition 

tasks, but we planed to introduce new, challenging, but realistic 

noises. We are planning to distribute AURORA-2J databases in 

this workshop. 

In addition to developing the databases, we are also working 

on comparison and integration of noise reduction algorithms on 

the AURORA-2 and AURORA-2J database [8]. We implemented 

these algorithms as modules, which can be easily combined and 

applied to the AURORA evaluation process. In near future, those 

modules will also be distributed with the developed corpus. 

7. REFERENCES 

[1] http://elazar.itd.nrl.navy.mil/spine/ 

[2] http://eurospeech2001.org/ese/NoiseRobust/index.html, 

http://www.elda.fr/proj/aurora1.html, 

http://www.elda.fr/proj/aurora2.html 

[3] ETSI standard document, “Speech processing, transmission 

and quality aspects (STQ); Distributed speech recognition; 

Front-end feature extraction algorithm; Compression algorithm”, 

ETSI ES 201 108 v1.1.2 (2000-04), 2000 

[4] H.G.Hirsh, D.Pearce, “The AURORA experimental framework 

for the performance evaluations of speech recognition 

systems under noisy conditions”, ISCA ITRW ASR2000, 

september, 2000 

[5] D.Pearce, “Developing the ETSI AURORA advanced distributed 

speech recognition front-end & What next”, Proc. 

EUROSPEECH2001, 2001 

[6] N. Kawaguchi, K. Takeda, et al., “Construction of Speech 

Corpus in Moving Car Environment”, Proc. International 

Conference on Spoken Language Processing, pp.1281-1284, 

2000 (ICSLP2000, Beijing, China). 

[7] K. Yamamoto, S. Nakamura, K. Takeda, S. Kuroiwa, N. Kitaoka, 

T. Yamada, M. Mizumachi, T. Nishiura, M. Fujimoto, 

“AURORA-2J/AURORA-3J Corpus and Evaluation Baseline(in 

Japanese)”, Technical Report IPSJ SIG-SLP 2003- 

SLP-47, July 2003. 

[8] T. Yamada, J. Okada, K. Takeda, N. Kitaoka, M. Fujimoto, 

S. Kuroiwa, K. Yamamoto, T. Nishiura, M. Mizumachi, S. 

Nakamura, “Integration of noise reduction algorithms for 

Aurora 2 database”, Eurospeech2003, 2003. 

[9] http://sp.shinshu-u.ac.jp/%7Ekyama/SLP-WG/ 

6. ACKNOWLEDGEMENTS 

The authors thank to Dr. David Pearce of the AURORA group 

for his help to these activities. This work was supported in part 

by the Telecommunications Advancement Organization of Japan. 

The present study was conducted using AURORA-2J database developed 

by IPSJ-SIG SLP Noisy Speech Recognition Evaluation 

Working Group.

Data Collection and Evaluation of AURORA-2 Japanese Corpus

Create successful ePaper yourself

Delete template?

Save as template?