19.01.2015 Views

Data Collection and Evaluation of AURORA-2 Japanese Corpus

Data Collection and Evaluation of AURORA-2 Japanese Corpus

Data Collection and Evaluation of AURORA-2 Japanese Corpus

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

DATA COLLECTION AND EVALUATION OF <strong>AURORA</strong>-2 JAPANESE CORPUS<br />

Satoshi Nakamura 1 , Kazumasa Yamamoto 2 , Kazuya Takeda 3 , Shingo Kuroiwa 4 ,<br />

Norihide Kitaoka 5 , Takeshi Yamada 6 , Mitsunori Mizumachi 1 , Takanobu Nishiura 7 ,<br />

Masakiyo Fujimoto 8 , Akira Saso 9 , Toshiki Endo 1<br />

1 ATR Spoken Language Translation Research Labs., 2 Shinshu University , 3 Nagoya University,<br />

4 University <strong>of</strong> Tokushima, 5 Toyohashi University <strong>of</strong> Technology, 6 University <strong>of</strong> Tsukuba,<br />

7 Wakayama University, 8 Ryukoku University, 9 Advanced Institute <strong>of</strong> Science <strong>and</strong> Technology<br />

{slp-noise-wg@slt.atr.co.jp}<br />

ABSTRACT<br />

Speech recognition systems must still be improved when they are<br />

exposed to noisy environments. For this improvement, developments<br />

<strong>of</strong> the st<strong>and</strong>ard evaluation corpus <strong>and</strong> assessment technologies<br />

are essential. Recently the <strong>AURORA</strong>-2,3 corpus <strong>and</strong> their<br />

evaluation scenarios have had significant impacts on noisy speech<br />

recognition research. This paper introduces a <strong>Japanese</strong> noisy<br />

speech corpus <strong>and</strong> its evaluation scripts, called <strong>AURORA</strong>-2J. The<br />

<strong>AURORA</strong>-2J is a <strong>Japanese</strong> connected digits corpus. The data collection<br />

<strong>and</strong> evaluation scenarios are designed in the same way as<br />

<strong>AURORA</strong>-2 with the help <strong>of</strong> ETSI <strong>AURORA</strong> group. Furthermore,<br />

we have collected in-car speech corpus similar to <strong>AURORA</strong>-3.<br />

The in-car speech corpus includes <strong>Japanese</strong> connected digits <strong>and</strong><br />

comm<strong>and</strong> words collected in a moving car. This paper describes<br />

the data collection, baseline scripts, <strong>and</strong> its baseline performance.<br />

1. INTRODUCTION<br />

The recent progress <strong>of</strong> speech recognition technology has been<br />

brought about by the advent <strong>of</strong> statistical modeling <strong>and</strong> large-scale<br />

corpora. Furthermore, it is also known that progress has been accelerated<br />

by the U.S. DARPA projects initiated in the late ’80s<br />

in terms <strong>of</strong> project participants competitively developing speech<br />

recognition systems on the same task using the same training <strong>and</strong><br />

test corpus.<br />

However, current speech recognition performance must still<br />

be improved if the system is to be exposed to noisy environments,<br />

where speech recognition applications might be used in practice.<br />

Thereby, robustness to acoustic noise is an emerging <strong>and</strong> crucial<br />

factor to be solved for speech recognition systems.<br />

With regard to the noise robustness problem, there have been<br />

two evaluation projects, SPINE1,2[1] <strong>and</strong> <strong>AURORA</strong>[2]. The<br />

SPINE (SPeech recognition In Noisy Environment) project was<br />

organized by U.S. DARPA, with SPINE1 in 2000 <strong>and</strong> SPINE2 in<br />

2001. The task is English spontaneous dialog between an operator<br />

<strong>and</strong> a soldier in a noisy field. The task was spontaneous continuous<br />

speech recognition in noisy environments. The results <strong>of</strong> the<br />

project brought many improvements to continuous noisy speech<br />

recognition, though the task seems quite special <strong>and</strong> a little difficult<br />

to h<strong>and</strong>le.<br />

On the other h<strong>and</strong>, the ETSI <strong>AURORA</strong> group initiated a special<br />

session in the EUROSPEECH conference. They are actively<br />

working to develop st<strong>and</strong>ard technologies under ETSI for distributed<br />

speech recognition[3]. In parallel with their st<strong>and</strong>ardization<br />

activities, they have distributed a noisy connected speech corpus<br />

based on TI digits with baseline HTK scripts to academic researchers<br />

for further noisy speech recognition research. So far,<br />

<strong>AURORA</strong>-2, a connected digit corpus with additive noise, <strong>and</strong><br />

<strong>AURORA</strong>-3, an in-car noisy digit <strong>and</strong> word corpus, are distributed<br />

with HTK scripts, which can be used to get baseline performance<br />

<strong>and</strong> relative improvements over the baseline results[4, 5]. The advantages<br />

<strong>of</strong> the <strong>AURORA</strong> are 1) The connected digit task is relatively<br />

small compared to spontaneous speech, <strong>and</strong> 2) The baseline<br />

performance can be easily attained by the attached HTK scripts.<br />

The authors voluntarily organized a special working group<br />

in October 2001 under auspices <strong>of</strong> the Information Processing<br />

Society <strong>of</strong> Japan in order to assess speech recognition technology<br />

in noisy environments. The focus <strong>of</strong> the working group included<br />

the planning <strong>of</strong> comprehensive fundamental assessments<br />

<strong>of</strong> noisy speech recognition, st<strong>and</strong>ardized corpus collection, evaluation<br />

strategy developments, <strong>and</strong> distribution <strong>of</strong> st<strong>and</strong>ardized<br />

processing modules. To begin with, we decided to follow the<br />

<strong>AURORA</strong>-2 corpus collection <strong>and</strong> evaluation since the task is<br />

small enough <strong>and</strong> the evaluation scheme is quite clear. As for the<br />

<strong>Japanese</strong> <strong>AURORA</strong>-2, <strong>AURORA</strong>-2J, we have simply translated<br />

English digits into <strong>Japanese</strong> digits adding the same noise. Furthermore,<br />

we collected the in-car <strong>Japanese</strong> connected digits <strong>and</strong><br />

comm<strong>and</strong> word data similar to the <strong>AURORA</strong>-3.<br />

In this paper, section 2 describes <strong>AURORA</strong>-2J corpus collection,<br />

its evaluation scripts <strong>and</strong> baseline results. The in-car speech<br />

corpus is described in section 3. Section 4 describes the categories<br />

in which the developed noisy speech recognition system should<br />

be fairly compared. Finally section 5 summarizes the paper <strong>and</strong><br />

describes future directions.


Table 1. <strong>AURORA</strong>-2J baseline recognition results.<br />

Clean Training (%Acc)<br />

A B C Overall<br />

Subway Babble Car Exhibition Average Restaurant Street Airport Station Average Subway M Street M Average Average<br />

Clean 99.72 99.58 99.82 99.60 99.68 99.72 99.58 99.82 99.60 99.68 99.82 99.67 99.75 99.69<br />

20 dB 96.90 80.80 89.59 95.90 90.80 84.86 88.51 82.17 82.29 84.46 91.50 92.26 91.88 88.48<br />

15 dB 76.27 56.83 58.16 75.41 66.67 61.10 65.39 57.80 55.01 59.83 70.80 75.39 73.10 65.22<br />

10 dB 47.16 38.63 38.86 41.65 41.58 40.50 42.59 41.93 37.98 40.75 43.51 47.28 45.40 42.01<br />

5 dB 25.27 23.16 20.79 21.97 22.80 21.06 23.79 26.16 22.25 23.32 25.91 25.03 25.47 23.54<br />

0 dB 12.28 8.16 10.38 11.97 10.70 9.89 13.75 12.68 9.84 11.54 13.72 13.60 13.66 11.63<br />

-5 dB 7.43 4.35 7.25 7.90 6.73 1.90 8.56 4.77 5.46 5.17 8.81 8.74 8.78 6.52<br />

Average 51.58 41.52 43.56 49.38 46.51 43.48 46.81 44.15 41.47 43.98 49.09 50.71 49.90 46.17<br />

Multicondition Training (%Acc)<br />

A B C Overall<br />

Subway Babble Car Exhibition Average Restaurant Street Airport Station Average Subway M Street M Average Average<br />

Clean 99.79 99.64 99.67 99.75 99.71 99.79 99.64 99.67 99.75 99.71 99.69 99.55 99.62 99.69<br />

20 dB 99.63 99.67 99.70 99.57 99.64 98.62 99.46 98.90 97.99 98.74 99.51 99.40 99.46 99.25<br />

15 dB 99.26 99.40 99.37 98.83 99.22 96.90 97.58 96.45 94.11 96.26 99.17 98.37 98.77 97.94<br />

10 dB 98.25 97.43 97.94 97.38 97.75 86.83 89.57 91.29 84.94 88.16 96.90 93.71 95.31 93.42<br />

5 dB 93.89 89.78 92.16 92.32 92.04 68.56 71.28 77.72 76.18 73.44 87.47 80.86 84.17 83.02<br />

0 dB 74.85 62.48 64.96 73.68 68.99 31.87 48.22 49.36 51.90 45.34 52.32 50.57 51.45 56.02<br />

-5 dB 30.46 25.12 23.17 29.56 27.08 -3.78 18.65 16.70 16.69 12.07 21.31 14.96 18.14 19.28<br />

Average 93.18 89.75 90.83 92.36 91.53 76.56 81.22 82.74 81.02 80.39 87.07 84.58 85.83 85.93<br />

2. <strong>AURORA</strong>-2J<br />

2.1. <strong>Japanese</strong> digits pronunciation<br />

<strong>AURORA</strong>-2J is the same as <strong>AURORA</strong>2, but it was uttered in<br />

<strong>Japanese</strong>. The number <strong>of</strong> speakers is the same <strong>and</strong> the digit strings<br />

for each speaker are identical. Table 2 shows the pronunciations<br />

<strong>of</strong> eleven digits in <strong>AURORA</strong>-2 <strong>and</strong> <strong>AURORA</strong>-2J. Speakers were<br />

requested to pronounce digits as specified in this table. These pronunciations<br />

are assigned considering the occurrence frequency in<br />

uttering telephone numbers <strong>and</strong> credit numbers. Although vowel<br />

lengthening sometimes occurs in /ni/ <strong>and</strong> /go/, the two pronunciations<br />

are not distinguished in <strong>AURORA</strong>-2J.<br />

Sometimes, “4” is read as /shi/, “7” is read as /shichi/, <strong>and</strong><br />

“0” is read as /rei/ in <strong>Japanese</strong>. However these pronunciations are<br />

rarely used when a telephone number or a credit card number is<br />

told over the telephone. Hence, <strong>AURORA</strong>-2J did not employ these<br />

pronunciations.<br />

Table 2. Pronunciations <strong>of</strong> digits<br />

Digit <strong>AURORA</strong>-2 <strong>AURORA</strong>-2J<br />

1 one /ichi/<br />

2 two /ni/<br />

3 three /saN/<br />

4 four /yoN/<br />

5 five /go/<br />

6 six /roku/<br />

7 seven /nana/<br />

8 eight /hachi/<br />

9 nine /kyuH/<br />

0(Z) zero /zero/<br />

0(O) oh /maru/<br />

2.2. <strong>Data</strong> recording<br />

A headset microphone, Sennheier MHD25, was used for recording<br />

with a USB-audio interface (Edirol UA-5) connected to a Windows<br />

personal computer. The recording was done in a soundpro<strong>of</strong> booth<br />

where speakers read a list <strong>of</strong> digit strings presented on CRT monitor<br />

screen connected to the PC. The final file format <strong>of</strong> the speech<br />

data is the Micros<strong>of</strong>t wav format <strong>of</strong> 16-kHz sampling.<br />

2.3. Filtering <strong>and</strong> noise adding<br />

The <strong>AURORA</strong>-2J database follows the <strong>AURORA</strong>-2 database, <strong>and</strong><br />

was created in exactly the same way. All programs <strong>and</strong> scripts<br />

used here were kindly provided by the <strong>AURORA</strong> project for both<br />

filtering speech signals <strong>and</strong> adding noise signals.<br />

2.4. Training/Testing dataset<br />

The design <strong>of</strong> training <strong>and</strong> testing dataset is the same as <strong>AURORA</strong>-<br />

2. Two sets <strong>of</strong> training data are prepared, such as clean-training<br />

dataset <strong>and</strong> multi-condition dataset. Total utterance is 8,440 utterance<br />

by 110 speakers (55 male <strong>and</strong> 55 female speakers). For<br />

multi-condition training dataset, four kinds <strong>of</strong> noise (Subway, Babble,<br />

Car, Exhibition) are added to the clean speech in five kinds <strong>of</strong><br />

SNR (clean, 20dB, 15dB, 10dB, 5dB). For each noise <strong>and</strong> SNR<br />

condition, 422 utterances are included. G.712 filter is applied to<br />

all the speech data.<br />

For the testing dataset, we prepare three kinds <strong>of</strong> dataset completely<br />

the same as in <strong>AURORA</strong>-2.<br />

[Testset A] Noise condition is same as in multi-condition training<br />

set. Subway, Babble, Car, <strong>and</strong> Exhibition noises are used.<br />

[Testset B] Noise condition is different from the multi-condition<br />

training dataset. Restaurant, Street, Airport, <strong>and</strong> Station<br />

noises are used.


[Testset C] Channel condition is different from the training<br />

dataset. MIRS channel is applied to the speech data.<br />

2.5. Reference scripts <strong>and</strong> baseline performance<br />

The reference back-end scripts were mostly based on the original<br />

<strong>AURORA</strong>-2 baseline back-end scripts, <strong>and</strong> some modifications<br />

were introduced from the Micros<strong>of</strong>t complex baseline backend<br />

scripts. Other experimental conditions, including the number<br />

<strong>of</strong> recognition units, were basically the same as the original<br />

<strong>AURORA</strong>-2 conditions, except for the number <strong>of</strong> Gaussian mixtures<br />

per state, 20 mixtures for digits, <strong>and</strong> 36 mixtures for pause<br />

models.<br />

The feature vector consisted <strong>of</strong> 12 MFCC <strong>and</strong> log energy<br />

with their corresponding delta <strong>and</strong> acceleration coefficients. Thus<br />

a vector contains 39 components in total. These parameters<br />

were calculated using HCopy with the same conditions as the<br />

<strong>AURORA</strong>-2 HTK baseline. The baseline recognition performance<br />

<strong>of</strong> <strong>AURORA</strong>-2J is shown in Table 1.<br />

3. IN-CAR SPEECH CORPUS<br />

As for the st<strong>and</strong>ard corpus for in-car speech technologies, we collected<br />

an in-car speech corpus aiming to distribute a similiar corpus<br />

to <strong>AURORA</strong>-3. This corpus is a part <strong>of</strong> the CIAIR (Center for<br />

Integrated Acoustic Information Research) in-car speech corpora<br />

collected at Nagoya University [6]. About 38,350 word utterances<br />

<strong>of</strong> 80 speakers while driving are recorded for the corpus.<br />

3.1. Vocabulary<br />

The basic vocabulary consists <strong>of</strong> 50 isolated word utterances, one<br />

4 digits isolated utterances, one 16 digit-string utterances, <strong>and</strong><br />

four kinds <strong>of</strong> single digit utterances <strong>and</strong> 10 digit-string utterances.<br />

The contents <strong>of</strong> the utterances are listed in Table 3. The average<br />

phoneme length <strong>of</strong> the 50 words is 10.2 (ranging from 3 to 25<br />

phonemes).<br />

Table 3. The length <strong>and</strong> manner <strong>of</strong> utterances for the digit strings.<br />

Words, isolated utterance<br />

x50<br />

4 digits, with pauses at every digit x4<br />

X-X-X-X<br />

10 digits, with pauses after the first 3 <strong>and</strong> 6 words x4<br />

XXX-XXX-XXXX<br />

16 digits, with pauses at every four digits x1<br />

XXXX-XXXX-XXXX-XXXX<br />

3.2. In-car data collection<br />

The data collection was performed using a specially designed data<br />

collection vehicle that has multiple data acquisition capabilities <strong>of</strong><br />

up to 16 channels <strong>of</strong> audio signals, three channels <strong>of</strong> video <strong>and</strong><br />

other driving-related information, i.e., car position, vehicle speed,<br />

engine speed, brake <strong>and</strong> acceleration pedals, <strong>and</strong> steering wheel.<br />

Fig. 1. Microphone positions for data collection: Side view (top)<br />

<strong>and</strong> top view (bottom).<br />

Five microphones were placed around the driver’s seat, as<br />

shown in Figure 1, where the top <strong>and</strong> the side views <strong>of</strong> the driver’s<br />

seat are depicted. Microphone positions are marked by the black<br />

dots. While microphones #3 <strong>and</strong> #4 were located on the dashboard,<br />

#5, #6 <strong>and</strong> #7 were attached to the ceiling. Microphone #6 was<br />

closest to the speaker. The microphone used in this data collection<br />

is the SONY ECM-77B ominidirectional electlet microphone. In<br />

addition to these distributed microphones, the driver wore a headset<br />

with a close-talking microphone (#1). Since the speaker drove<br />

while speaking, audio prompting through headphones was used.<br />

The utterances were collected under 13 carefully controlled<br />

different driving conditions, i.e., combinations <strong>of</strong> three car speeds<br />

(idle, driving in a city area <strong>and</strong> driving on an expressway) <strong>and</strong> five<br />

car conditions (fan on (hi/lo), CD player on, open window, <strong>and</strong><br />

Table 4. Recording conditions in car<br />

Ideling Normal, CD Player On, Hazzard<br />

On<br />

City Area Driving Normal, CD Player On, Air Condition<br />

On, Window Open, CD Player<br />

On + Air Condition On + Window<br />

Harf Open<br />

Express Way Driving Normal, CD Player On, Air Condition<br />

On, WIndow Open, CD Player<br />

On + Air Condition On + Window<br />

Slightly Open


Long-term spectrum (dB)<br />

Long-term spectrum (dB)<br />

Long-term spectrum (dB)<br />

Idling (close-talking microphone)<br />

Idling (h<strong>and</strong>s-free microphone)<br />

60<br />

80<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000<br />

Frequency (Hz)<br />

Low speed (close-talking microphone)<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000<br />

Frequency (Hz)<br />

High speed (close-talking microphone)<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000<br />

Frequency (Hz)<br />

Long-term spectrum (dB)<br />

Long-term spectrum (dB)<br />

Long-term spectrum (dB)<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000<br />

Frequency (Hz)<br />

90<br />

Low speed (h<strong>and</strong>s-free microphone)<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000<br />

Frequency (Hz)<br />

90<br />

High speed (h<strong>and</strong>s-free microphone)<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000<br />

Frequency (Hz)<br />

Fig. 2. Long-term average spectra for various driving condition in car.<br />

normal driving condition) shown in Table.4. The overall average<br />

SNR is roughly about 13dB, although the SNR distributes from<br />

0dB to 25dB. The long-term spectra <strong>of</strong> these noise condition are<br />

shown in Fig.2 We are designing training <strong>and</strong> testing dataset, <strong>and</strong><br />

evaluation baseline scripts.<br />

4. EVALUATION CATEGORIES<br />

Properly speaking, the <strong>AURORA</strong> project aims at developing <strong>and</strong><br />

evaluating front-ends for recognizers. However, in some papers<br />

reported so far, many changes to back-end HTK scripts were introduced,<br />

such as using extra data not included in <strong>AURORA</strong>, increasing<br />

the number <strong>of</strong> mixtures, using HMMs which were not<br />

whole word models, <strong>and</strong> so on. The recognition results using<br />

these methods cannot be fairly compared with methods using the<br />

original back-end scripts. Therefore, we propose evaluation categories<br />

in this paper. A method is compared with other methods<br />

only within the same category. According to the degree <strong>of</strong> changes<br />

in the back-end scripts from the original baseline, users declare the<br />

category to which they belong from following categories:<br />

Category 0. No changes to back-end HTK scripts.<br />

Category 1. If the HMM topology is the same as the baseline<br />

scripts, any training process will be allowed. The discriminative<br />

training can be introduced in this category. The computational<br />

cost in the recognition phase should be the same<br />

as it was.<br />

Category 2. If the HMM topology is the same, adaptation processes<br />

can be introduced using some testing data. Speaker<br />

or environment adaptation, <strong>and</strong> PMC with one state noise<br />

model can be allowed in this category. An increase <strong>of</strong> the<br />

computational cost will be caused only by the adaptation<br />

process.


Category 3. Changes in the st<strong>and</strong>ard HMM topology. A different<br />

number <strong>of</strong> mixtures <strong>and</strong> states can be allowed. However,<br />

the recognition unit should be whole word models. PMC<br />

with more than one state noise model can be included in<br />

this category.<br />

Category 4. Any process will be allowed as far as the computational<br />

cost is under the specific limit. For example, a complex<br />

structure model can be used with low dimensional feature<br />

vectors.<br />

Category 5. Any process with any computational cost will be allowed.<br />

Category B. The use <strong>of</strong> any training data not included in AU-<br />

RORA is allowed, not only speech data but also environment<br />

noise data. Of course, the evaluation data is AU-<br />

RORA. This category essentially differs from Categories<br />

1-5.<br />

5. CONCLUSION<br />

In this paper, we introduced <strong>AURORA</strong>-2J, the first version <strong>of</strong> a<br />

<strong>Japanese</strong> evaluation set <strong>of</strong> noisy speech recognition, <strong>and</strong> the baseline<br />

performances <strong>of</strong> bundled evaluation scripts. The database can<br />

be accessed from [9]. We also mentioned our plan for further development.<br />

In addition to the <strong>Japanese</strong> <strong>AURORA</strong>-2 <strong>and</strong> In-car speech<br />

data, the isolated word <strong>and</strong> digit-string utterances recorded under<br />

controled actual noisy environments, <strong>AURORA</strong>-2.5J, whch<br />

is <strong>AURORA</strong>-2J with Lombard effects, are to be available soon.<br />

This database will consist <strong>of</strong> noise-free speech uttered by speakers<br />

listening to the same noises as <strong>AURORA</strong>-2J through headphones.<br />

With this database, we can analyze the Lombard effects<br />

separately from additive <strong>and</strong> convolutional noises. We are also<br />

carefully designing corpus like <strong>AURORA</strong>-4J. This database will<br />

be constructed on large vocabulary, continuous speech recognition<br />

tasks, but we planed to introduce new, challenging, but realistic<br />

noises. We are planning to distribute <strong>AURORA</strong>-2J databases in<br />

this workshop.<br />

In addition to developing the databases, we are also working<br />

on comparison <strong>and</strong> integration <strong>of</strong> noise reduction algorithms on<br />

the <strong>AURORA</strong>-2 <strong>and</strong> <strong>AURORA</strong>-2J database [8]. We implemented<br />

these algorithms as modules, which can be easily combined <strong>and</strong><br />

applied to the <strong>AURORA</strong> evaluation process. In near future, those<br />

modules will also be distributed with the developed corpus.<br />

7. REFERENCES<br />

[1] http://elazar.itd.nrl.navy.mil/spine/<br />

[2] http://eurospeech2001.org/ese/NoiseRobust/index.html,<br />

http://www.elda.fr/proj/aurora1.html,<br />

http://www.elda.fr/proj/aurora2.html<br />

[3] ETSI st<strong>and</strong>ard document, “Speech processing, transmission<br />

<strong>and</strong> quality aspects (STQ); Distributed speech recognition;<br />

Front-end feature extraction algorithm; Compression algorithm”,<br />

ETSI ES 201 108 v1.1.2 (2000-04), 2000<br />

[4] H.G.Hirsh, D.Pearce, “The <strong>AURORA</strong> experimental framework<br />

for the performance evaluations <strong>of</strong> speech recognition<br />

systems under noisy conditions”, ISCA ITRW ASR2000,<br />

september, 2000<br />

[5] D.Pearce, “Developing the ETSI <strong>AURORA</strong> advanced distributed<br />

speech recognition front-end & What next”, Proc.<br />

EUROSPEECH2001, 2001<br />

[6] N. Kawaguchi, K. Takeda, et al., “Construction <strong>of</strong> Speech<br />

<strong>Corpus</strong> in Moving Car Environment”, Proc. International<br />

Conference on Spoken Language Processing, pp.1281-1284,<br />

2000 (ICSLP2000, Beijing, China).<br />

[7] K. Yamamoto, S. Nakamura, K. Takeda, S. Kuroiwa, N. Kitaoka,<br />

T. Yamada, M. Mizumachi, T. Nishiura, M. Fujimoto,<br />

“<strong>AURORA</strong>-2J/<strong>AURORA</strong>-3J <strong>Corpus</strong> <strong>and</strong> <strong>Evaluation</strong> Baseline(in<br />

<strong>Japanese</strong>)”, Technical Report IPSJ SIG-SLP 2003-<br />

SLP-47, July 2003.<br />

[8] T. Yamada, J. Okada, K. Takeda, N. Kitaoka, M. Fujimoto,<br />

S. Kuroiwa, K. Yamamoto, T. Nishiura, M. Mizumachi, S.<br />

Nakamura, “Integration <strong>of</strong> noise reduction algorithms for<br />

Aurora 2 database”, Eurospeech2003, 2003.<br />

[9] http://sp.shinshu-u.ac.jp/%7Ekyama/SLP-WG/<br />

6. ACKNOWLEDGEMENTS<br />

The authors thank to Dr. David Pearce <strong>of</strong> the <strong>AURORA</strong> group<br />

for his help to these activities. This work was supported in part<br />

by the Telecommunications Advancement Organization <strong>of</strong> Japan.<br />

The present study was conducted using <strong>AURORA</strong>-2J database developed<br />

by IPSJ-SIG SLP Noisy Speech Recognition <strong>Evaluation</strong><br />

Working Group.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!