07.11.2014 Views

Individual variation in speech perception and production

Individual variation in speech perception and production

Individual variation in speech perception and production

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Individual</strong> <strong>variation</strong> <strong>in</strong> <strong>speech</strong><br />

<strong>perception</strong> <strong>and</strong> <strong>production</strong><br />

Valerie Hazan<br />

Dept of Phonetics <strong>and</strong> L<strong>in</strong>guistics<br />

UCL, London, UK


Aim? To underst<strong>and</strong> this!<br />

Why do these <strong>production</strong>s sound so<br />

different?<br />

Why do these vary <strong>in</strong> <strong>in</strong>telligibility for<br />

different listeners?<br />

What is the l<strong>in</strong>k between acoustic<br />

aspects of the <strong>production</strong>s <strong>and</strong> their<br />

<strong>perception</strong>?


Why study <strong>variation</strong>?<br />

Inherent part of ‘real’ communication<br />

To build more robust models of <strong>speech</strong><br />

<strong>perception</strong> <strong>and</strong> <strong>production</strong><br />

To build better practical applications:<br />

<br />

<br />

<br />

Speech technology – ASR <strong>and</strong> synthesis<br />

Speech <strong>and</strong> language therapy<br />

Language tra<strong>in</strong><strong>in</strong>g


Why is this becom<strong>in</strong>g a ‘hot<br />

topic’?<br />

<br />

<br />

<br />

<br />

Move away from ‘laboratory <strong>speech</strong>’ to use of<br />

more natural <strong>speech</strong> stimuli<br />

<strong>Individual</strong> variability no longer dismissed as<br />

‘experimental noise’<br />

Emergence of models of <strong>speech</strong> <strong>perception</strong><br />

such as hyper-hypo theory, exemplar model..<br />

Increas<strong>in</strong>g number of <strong>speech</strong> tech <strong>and</strong> other<br />

applications


Complexity, complexity<br />

everywhere…<br />

<br />

Ma<strong>in</strong> problem <strong>in</strong> studies of <strong>variation</strong>? Unpick<strong>in</strong>g various sources of<br />

<strong>variation</strong><br />

Speaker-related variability<br />

Cross-speaker<br />

Anatomical differences (vocal folds, vocal tract)<br />

Accent<br />

<strong>Individual</strong> articulatory behaviours (e.g. identical tw<strong>in</strong><br />

studies..)<br />

…<br />

With<strong>in</strong>-speaker<br />

Speak<strong>in</strong>g style<br />

Speech rate<br />

Physical <strong>and</strong> mental health<br />

…<br />

Listener-related variability<br />

Auditory process<strong>in</strong>g abilities<br />

Perceptual weight<strong>in</strong>g of acoustic cues<br />

Relative use of acoustic <strong>and</strong> l<strong>in</strong>guistic <strong>in</strong>fo


Does variability <strong>in</strong> <strong>speech</strong> <strong>production</strong><br />

affect <strong>speech</strong> <strong>perception</strong>?<br />

Efficient process of normalisation but<br />

<br />

<br />

<br />

<br />

Speakers not equally <strong>in</strong>telligible<br />

Different styles of <strong>speech</strong> not equally<br />

<strong>in</strong>telligible<br />

Variability might affect <strong>perception</strong> <strong>in</strong><br />

impoverished listen<strong>in</strong>g conditions<br />

(‘stressed’ system)<br />

Some listener populations may be less able<br />

to cope with <strong>variation</strong> than others?…


Does variability <strong>in</strong> <strong>speech</strong> <strong>production</strong><br />

affect <strong>speech</strong> <strong>perception</strong>?<br />

Let’s remember that poor acousticphonetic<br />

<strong>in</strong>formation can be<br />

compensated by efficient use of<br />

contextual/l<strong>in</strong>guistic <strong>in</strong>formation<br />

BUT greater cognitive load when<br />

redundancy of <strong>in</strong>formation is reduced?


An example of the ‘power’ of<br />

context…


Some key topics<br />

<br />

<br />

Speaker-related variability<br />

<br />

<br />

<br />

To what degree do speakers vary from each other<br />

<strong>in</strong> terms of how <strong>in</strong>telligible they are?<br />

How much of the <strong>variation</strong> is due to anatomical<br />

differences?<br />

Are child speakers more variable than adult<br />

speakers?<br />

Listener-related variability<br />

<br />

<br />

Do listeners vary <strong>in</strong> the acoustic-phonetic <strong>in</strong>fo they<br />

use to decode <strong>speech</strong>?<br />

Do children vary from adults <strong>in</strong> the use of this<br />

<strong>in</strong>formation?


Some methodological issues..<br />

Two approaches <strong>in</strong> studies of speaker<br />

‘clarity’ <strong>and</strong> <strong>variation</strong>:<br />

<br />

<br />

‘Deliberately-clear’ <strong>speech</strong>: comparison of<br />

different elicited speak<strong>in</strong>g styles with<strong>in</strong><br />

speakers<br />

‘Intr<strong>in</strong>sically-clear’ <strong>speech</strong>: comparison of<br />

different speakers us<strong>in</strong>g ‘set’ style (usually<br />

read <strong>speech</strong>)


Part 1: Speaker-related<br />

<strong>variation</strong> <strong>and</strong> its impact on<br />

<strong>in</strong>telligibility


A study of effect of speaker-related variability<br />

on <strong>speech</strong> <strong>in</strong>telligibility - Hazan <strong>and</strong> Markham<br />

(2004, 2005)<br />

<br />

<br />

<br />

To what degree do speakers vary from each other<br />

<strong>in</strong> terms of how <strong>in</strong>telligible they are for a fixed set<br />

of materials?<br />

Is speaker <strong>in</strong>telligibility consistent across listeners<br />

differ<strong>in</strong>g <strong>in</strong> age <strong>and</strong> sex?<br />

Can we f<strong>in</strong>d correlations between speaker<br />

<strong>in</strong>telligibility <strong>and</strong> acoustic-phonetic characteristics<br />

of their <strong>speech</strong>?


UCL Speaker Database (Markham<br />

& Hazan, 2002)<br />

Speakers<br />

o<br />

o<br />

45 speakers (18 women, 15 men, 6 girls <strong>and</strong> 6 boys<br />

aged 12)<br />

Homogeneous accent group (South-eastern British<br />

English accent)<br />

Listeners<br />

<br />

135 listeners (same accent group)<br />

45 children aged 7-8,<br />

45 children aged 11-12<br />

<br />

45 adults


UCL Speaker Database<br />

<br />

<br />

<br />

<br />

Word level<br />

<br />

<br />

<br />

Nonsense VCV lists<br />

Markham word test<br />

Manchester Junior word lists<br />

Sentence level<br />

<br />

<br />

Semantically-unpredictable sentences (SUS)<br />

Accent reveal<strong>in</strong>g sentences<br />

Read passages<br />

<br />

2 text passages (Arthur, Ra<strong>in</strong>bow)<br />

Semi-spontaneous monologues<br />

<br />

<br />

Description of cartoon<br />

Retell<strong>in</strong>g of cartoon without picture


Study 1: Intelligibility tests<br />

Markham Word test<br />

Set of 124 monosyllabic words:<br />

o familiar to children > 7<br />

o cover all frequent consonant confusions, with<br />

large spread of vowels<br />

o E.g. might, net, night, sat, shop…


Study 1: Intelligibility tests<br />

Methodology<br />

- Listeners tested <strong>in</strong> quiet classroom via<br />

headphones<br />

- Oral response given to experimenter<br />

- In the triplet condition, each listener heard 25<br />

unique words from each of 15 speakers,<br />

presented <strong>in</strong> a fully r<strong>and</strong>omised fashion.


A little test…’good’ or ‘poor’<br />

speaker?<br />

Ranked 1st out of 45<br />

Ranked 9 th out of 45<br />

Ranked 45th out of 45<br />

Ranked 2nd out of 45<br />

Ranked 44th out of 45


Error rate (%)<br />

SPEAKER<br />

af-06<br />

af-14<br />

am-10<br />

am-08<br />

af-21<br />

af-02<br />

af-12<br />

am-07<br />

cm-04<br />

am-19<br />

af-10<br />

cf-01<br />

af-09<br />

cf-04<br />

cm-05<br />

af-13<br />

af-11<br />

af-16<br />

am-05<br />

af-19<br />

af-04<br />

cm-02<br />

am-09<br />

am-02<br />

af-18<br />

af-17<br />

am-06<br />

cf-06<br />

cf-08<br />

am-03<br />

am-18<br />

cm-01<br />

am-16<br />

am-17<br />

af-08<br />

cm-03<br />

af-07<br />

cf-03<br />

cf-09<br />

am-12<br />

af-15<br />

cm-06<br />

am-13<br />

af-03<br />

am-14<br />

0<br />

10<br />

20<br />

across speakers?<br />

What is the range <strong>in</strong> <strong>in</strong>telligibility rates


Intelligibility for speaker groups<br />

differ<strong>in</strong>g <strong>in</strong> sex <strong>and</strong> age<br />

1.00<br />

0.95<br />

0.90<br />

0.85<br />

0.80<br />

30<br />

29<br />

Adult listeners<br />

Women Men Girls Boys<br />

ANOVA on<br />

arcs<strong>in</strong>etransformed<br />

<strong>in</strong>telligibility<br />

rates<br />

Women are<br />

more <strong>in</strong>telligible<br />

than other<br />

speaker groups<br />

No difference <strong>in</strong><br />

<strong>in</strong>telligibility<br />

between groups<br />

of children <strong>and</strong><br />

men


Intelligibility rates for<br />

<strong>in</strong>dividual speakers<br />

Note degree<br />

of variability<br />

with<strong>in</strong> each<br />

gender/age<br />

group<br />

Women Men girls boys


Speaker <strong>in</strong>telligibility:<br />

Cross-listener correlations<br />

Word Intelligibility - OC listeners (Prop.)<br />

1.00<br />

.90<br />

.80<br />

.70<br />

.70<br />

.80<br />

.90<br />

1.00<br />

Word Intelligibility - YC Listeners (Prop.)<br />

1.00<br />

.90<br />

.80<br />

.70<br />

.70<br />

.80<br />

.90<br />

1.00<br />

Word Intelligibility - YC listeners (Prop.)<br />

1.00<br />

.90<br />

.80<br />

.70<br />

.70<br />

.80<br />

.90<br />

1.00<br />

Talker Group<br />

Boys<br />

Girls<br />

Men<br />

Women<br />

Total Population<br />

Rsq = 0.8562<br />

Word Intelligibility - Adult listeners (Prop.)<br />

Word Intelligibility - Adult Listeners (Prop.)<br />

Word Intelligibility - OC Listeners (Prop.)<br />

Remarkable consistency <strong>in</strong> relative speaker <strong>in</strong>telligibility<br />

across the three listener groups:<br />

Pearson’s correlation: r=0.90 to r=0.95


What does this tell us?<br />

<br />

<br />

<br />

<br />

Speakers differ significantly <strong>in</strong> <strong>in</strong>telligibility<br />

even for s<strong>in</strong>gle-word materials presented <strong>in</strong><br />

low-level noise<br />

There is a tendency for women to be more<br />

<strong>in</strong>telligible than men or children.<br />

There is remarkable consistency across<br />

listener groups as to which voices are more<br />

or less <strong>in</strong>telligible<br />

So <strong>in</strong>telligibility of a voice is based on<br />

someth<strong>in</strong>g <strong>in</strong>herent to the speaker rather<br />

than on relation between speaker <strong>and</strong> listener


Some shortcom<strong>in</strong>gs…<br />

Results based on ‘laboratory <strong>speech</strong>’<br />

(s<strong>in</strong>gle words read <strong>in</strong> anechoic<br />

chamber)<br />

How do these speakers sound <strong>in</strong><br />

connected <strong>speech</strong>?<br />

<br />

<br />

‘Best’ speaker<br />

‘Worst’ speaker


Can we underst<strong>and</strong> why some<br />

speakers are more <strong>in</strong>telligible than<br />

others?<br />

Approach?<br />

<br />

<br />

Analysis of acoustic-phonetic<br />

characteristics of <strong>speech</strong> stimuli<br />

Correlation with <strong>in</strong>telligibility results


What do previous studies tell<br />

us?<br />

<br />

<br />

Cross-speaker studies<br />

<br />

<br />

Bond <strong>and</strong> Moore (1994): <strong>speech</strong> rate, vowel<br />

space,<br />

Bradlow et al (1996): fundamental frequency<br />

range, vowel space measures, precision of<br />

articulation<br />

Speak<strong>in</strong>g style studies<br />

<br />

<br />

Picheny (1985, 86): <strong>speech</strong> rate, articulatory<br />

precision<br />

Krause <strong>and</strong> Braida (2004): energy <strong>in</strong> midfrequency<br />

range, low-frequency modulation <strong>in</strong><br />

<strong>in</strong>tensity envelope


Acoustic-phonetic correlates<br />

of <strong>in</strong>telligibility?<br />

The follow<strong>in</strong>g measures were made:<br />

Long-term average spectrum (based on all words)<br />

<br />

<br />

<br />

<br />

<br />

Slope <strong>and</strong> energy <strong>in</strong> 1-3 kHz frequency b<strong>and</strong><br />

Fundamental frequency (based on all words)<br />

<br />

F0 mean, range, open quotient, irregularity<br />

Speak<strong>in</strong>g rate (based on subset of words)<br />

<br />

Measure of mean word duration<br />

CV amplitude ratio (based on subset of words)<br />

<br />

CV ratio for fricative-vowel, stop-vowel, nasal-vowel<br />

Formant spac<strong>in</strong>g (based on subset of words)<br />

<br />

Euclidian distance F1/F2 for /i, A, u/, difference <strong>in</strong> F1<br />

between /i/ <strong>and</strong> /a/ (<strong>in</strong> erb), difference <strong>in</strong> F2 between /i/ <strong>and</strong><br />

/u/ (<strong>in</strong> erb).


Significant correlation between <strong>in</strong>telligibility<br />

<strong>and</strong> total energy <strong>in</strong> 1-3 kHz region<br />

.20<br />

.18<br />

Mean word error rate (prop.)<br />

.16<br />

.14<br />

.12<br />

.10<br />

.08<br />

.06<br />

.04<br />

Speakers<br />

Children<br />

Rsq = 0.1395<br />

Men<br />

Rsq = 0.5655<br />

Women<br />

Rsq = 0.5161<br />

Total Population<br />

.02<br />

Rsq = 0.4036<br />

-10<br />

-8<br />

-6<br />

-4<br />

-2<br />

0<br />

2<br />

4<br />

Total energy 1-3 kHz


Significant correlation between<br />

<strong>in</strong>telligibility <strong>and</strong> word duration (<strong>speech</strong><br />

rate)<br />

.20<br />

.18<br />

.16<br />

Word error rate (prop.)<br />

.14<br />

.12<br />

.10<br />

.08<br />

.06<br />

Speakers<br />

Children<br />

Rsq = 0.0049<br />

Men<br />

Rsq = 0.4163<br />

Women<br />

Rsq = 0.1202<br />

.04<br />

Total Population<br />

.02<br />

Rsq = 0.1284<br />

.3<br />

.4<br />

.5<br />

.6<br />

.7<br />

Mean word duration (sec.)


Significant if weak correlation between<br />

<strong>in</strong>telligibility <strong>and</strong> measure of vowel space<br />

.20<br />

.18<br />

.16<br />

Word error rate (prop.)<br />

.14<br />

.12<br />

.10<br />

.08<br />

.06<br />

Speakers<br />

Children<br />

Rsq = 0.0016<br />

Men<br />

Rsq = 0.5912<br />

Women<br />

Rsq = 0.2162<br />

.04<br />

Total Population<br />

.02<br />

Rsq = 0.1552<br />

1.0<br />

2.0<br />

3.0<br />

4.0<br />

5.0<br />

6.0<br />

Difference <strong>in</strong> F2 (ERB) between /i/ <strong>and</strong> /u/


BUT lots of variability…<br />

Profiles of ‘best’ <strong>and</strong> ‘worst’ speakers<br />

Talker Rank (1-45)<br />

Word<br />

Intel. (all)<br />

LTAS<br />

1-3 kHz<br />

Word duration Diff. F2 /i/-<br />

/u/<br />

af-06 1 1.5 3 6<br />

af-14 2= 19= 16 11<br />

am-10 2= 9 17 40<br />

am-08 4 27 25 18<br />

af-12 5 35 1 16<br />

. . . .<br />

am-12 40 42 39 42<br />

af-15 41 39 29 39<br />

cm-06 42 19= 19 4<br />

am-13 44 43= 40 44<br />

am-14 45 39 44 45


Conclusions<br />

<br />

Is speaker ‘clarity’ <strong>in</strong>herent to the speaker or might listeners<br />

f<strong>in</strong>d voices closer to their own easier to underst<strong>and</strong>?<br />

Intelligibility appeared to be determ<strong>in</strong>ed by <strong>in</strong>herent<br />

characteristics of their <strong>speech</strong> rather than by an <strong>in</strong>teraction of<br />

listener <strong>and</strong> speaker-related factors…Implications for<br />

exemplar-based theories?<br />

<br />

Are <strong>in</strong>telligibility rates correlated with acoustic-phonetic<br />

measures?<br />

YES: Correlation with energy <strong>in</strong> mid-frequency b<strong>and</strong>,<br />

speak<strong>in</strong>g rate <strong>and</strong> vowel space BUT strength of correlation<br />

varies with speaker groups. No s<strong>in</strong>gle measure was<br />

necessary or sufficient to produce highly-<strong>in</strong>telligible <strong>speech</strong>. It<br />

seems likely that high <strong>in</strong>telligibility <strong>in</strong> <strong>in</strong>dividual speakers<br />

arises from different comb<strong>in</strong>ations of acoustic-phonetic<br />

features.


Why do acoustic-phonetic<br />

correlates vary across studies?<br />

<br />

<br />

<br />

Methodological differences<br />

<br />

<br />

Studies of ‘<strong>in</strong>tr<strong>in</strong>sically-clear’ vs ‘deliberately-clear’<br />

<strong>speech</strong><br />

Types of materials used: e.g. fundamental<br />

frequency measures more likely to be important <strong>in</strong><br />

connected sentences?<br />

Lack of strong correlation between any one<br />

acoustic-phonetic measure <strong>and</strong> <strong>in</strong>telligibility<br />

Strong correlations would imply that different<br />

listeners rely on the same acoustic-phonetic<br />

<strong>in</strong>formation..is this the case?


Are we bark<strong>in</strong>g up the wrong<br />

tree?<br />

Maybe what’s important is not how<br />

specific acoustic-phonetic characteristics<br />

are produced by a speaker but..<br />

the degree of ‘<strong>in</strong>ternal consistency’ <strong>in</strong><br />

the <strong>production</strong> of phones (overlapp<strong>in</strong>g<br />

or non-overlapp<strong>in</strong>g<br />

categories)…(Newman et al, 2000)


Newman et al. (2001) data<br />

• Slower reaction<br />

times for fricatives<br />

for speaker<br />

show<strong>in</strong>g overlap <strong>in</strong><br />

/s/-/S/ distributions<br />

• Reflects greater<br />

‘process<strong>in</strong>g load’?


Part 2: Listener-related<br />

<strong>variation</strong>


A bit of history…<br />

Early studies of <strong>speech</strong> <strong>perception</strong><br />

<br />

Ma<strong>in</strong> issues: What acoustic cues are<br />

primary for <strong>perception</strong> of phonemic<br />

contrasts?<br />

<br />

<br />

<br />

Presentation of mean identification or<br />

discrim<strong>in</strong>ation functions<br />

No discussion of <strong>in</strong>dividual differences <strong>in</strong><br />

identification ability or cue weight<strong>in</strong>g<br />

‘Poor’ performers sometimes elim<strong>in</strong>ated from<br />

dataset


Early models of <strong>speech</strong><br />

<strong>perception</strong><br />

Search for <strong>in</strong>variant correlates of<br />

l<strong>in</strong>guistic units<br />

Assumption that variability which is<br />

irrelevant for decod<strong>in</strong>g of <strong>speech</strong> is<br />

‘stripped away’<br />

<br />

<br />

Motor Theory: <strong>in</strong>variant units are<br />

articulatory gestures<br />

Quantal Theory: <strong>in</strong>variant units are<br />

acoustic properties of the signal


Why an <strong>in</strong>creas<strong>in</strong>g <strong>in</strong>terest <strong>in</strong><br />

variability?<br />

Vocal tract normalisation theories vs<br />

exemplar-based memory models<br />

<br />

‘Rather than warp the <strong>in</strong>put signal to match a<br />

fixed <strong>in</strong>ternal template, the <strong>in</strong>ternal<br />

representation adapts accord<strong>in</strong>g to the<br />

“perceived identity of the talker”’ (Johnson,<br />

1990)


<strong>Individual</strong> differences <strong>in</strong><br />

<strong>speech</strong> process<strong>in</strong>g<br />

<br />

<br />

<br />

Clearly present <strong>in</strong> <strong>in</strong>dividuals with hear<strong>in</strong>g<br />

loss but not immediately evident <strong>in</strong> normallyhear<strong>in</strong>g<br />

subjects…<br />

Why?<br />

<br />

Speech is highly redundant so <strong>in</strong>dividuals are<br />

function<strong>in</strong>g ‘at ceil<strong>in</strong>g’<br />

BUT <strong>in</strong>dividual differences are obvious when<br />

the system is ‘stressed’ (e.g. <strong>speech</strong> <strong>in</strong> noise,<br />

simultaneous tasks)


Where might <strong>in</strong>dividual<br />

differences arise?<br />

Auditory process<strong>in</strong>g<br />

Process<strong>in</strong>g of acoustic-phonetic<br />

<strong>in</strong>formation (acoustic cues)<br />

Use of contextual/l<strong>in</strong>guistic <strong>in</strong>formation


What about listener-related<br />

differences <strong>in</strong> our study of word<br />

<strong>perception</strong> <strong>in</strong> noise?<br />

<br />

What was the range <strong>in</strong> <strong>in</strong>telligibility across listeners?<br />

.20<br />

Adult listeners<br />

.18<br />

Mean error rate (triplets)<br />

.16<br />

.14<br />

.12<br />

.10<br />

.08<br />

.06<br />

.04<br />

.02<br />

0.00<br />

1<br />

4<br />

7<br />

10<br />

13<br />

16<br />

19<br />

22<br />

25<br />

28<br />

31<br />

34<br />

37<br />

40<br />

43<br />

subject number


Are there <strong>in</strong>dividual differences <strong>in</strong><br />

auditory process<strong>in</strong>g?<br />

Study of <strong>in</strong>dividual differences <strong>in</strong><br />

process<strong>in</strong>g of <strong>speech</strong> <strong>and</strong> non-<strong>speech</strong><br />

sounds (Surprenant <strong>and</strong> Watson, 2001)<br />

Group of 93 subjects tested on:<br />

<br />

<br />

Speech <strong>perception</strong> tasks (syllables, words,<br />

sentences <strong>in</strong> noise)<br />

Spectral-temporal discrim<strong>in</strong>ation tasks<br />

us<strong>in</strong>g simple <strong>and</strong> complex non<strong>speech</strong> tasks


<strong>Individual</strong> differences <strong>in</strong><br />

auditory process<strong>in</strong>g tasks<br />

From Surprenant <strong>and</strong> Watson, JASA, 110, 2001


<strong>Individual</strong> differences <strong>in</strong><br />

sentence <strong>in</strong> noise tasks<br />

<br />

Psychometric function for<br />

sentences <strong>in</strong> noise for<br />

subjects divided <strong>in</strong> group of<br />

ten. Only groups<br />

1,2,3,7,9,10 are shown.<br />

(Surprenant <strong>and</strong> Watson,<br />

2001)


Conclusions of study<br />

<br />

<br />

<br />

Significant <strong>in</strong>dividual differences <strong>in</strong> auditory<br />

process<strong>in</strong>g<br />

Significant <strong>in</strong>dividual differences <strong>in</strong> ‘normal<br />

population’ for <strong>speech</strong> <strong>perception</strong> <strong>in</strong> noise:<br />

E.g. at SNR of –1.6 dB, best 10% recognise 82%<br />

of words whilst worst 10% recognise 38% of<br />

words.<br />

Little correlation between performance on<br />

auditory <strong>and</strong> <strong>speech</strong> tasks (but ‘global’ vs<br />

‘analytic’ tasks?)


Are there <strong>in</strong>dividual differences <strong>in</strong><br />

the process<strong>in</strong>g of acoustic cues?<br />

What about <strong>in</strong>dividual differences <strong>in</strong> the<br />

ability to extract phonetic <strong>in</strong>formation<br />

from the signal (phonemic<br />

categorisation tasks)?


Let’s th<strong>in</strong>k more about the<br />

notion of ‘categorisation’<br />

?<br />

L<br />

R


Phoneme categorisation task -<br />

Example<br />

‘<strong>speech</strong> cont<strong>in</strong>uum’<br />

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7


What tasks do you carry out<br />

with this cont<strong>in</strong>uum?<br />

Identification task<br />

- Sounds presented <strong>in</strong><br />

r<strong>and</strong>om order (many<br />

repetitions)<br />

- Each time, listener says<br />

which sound was heard<br />

(e.g. R or L)<br />

% R<br />

Percentage "Date" responses<br />

100<br />

Identification function<br />

80<br />

60<br />

40<br />

20<br />

0<br />

1 2 3 4 5 6<br />

Stimulus steps<br />

gradient of function<br />

phoneme boundary


<strong>Individual</strong> differences <strong>in</strong> the use of<br />

acoustic cues (Hazan et al, 1990s)<br />

Contrasts:<br />

<br />

<br />

GATE-DATE, GAT-DAT, GEET-DEET<br />

GATE-KATE, GAT-KAT, GEET-KEET<br />

<br />

Test conditions<br />

<br />

comb<strong>in</strong>ed-cue <strong>and</strong> ‘s<strong>in</strong>gle cue’ <strong>in</strong> order to<br />

evaluate ‘weight<strong>in</strong>g’ given to each acoustic<br />

cue.


An example:<br />

DATE-GATE endpo<strong>in</strong>ts<br />

Burst <strong>and</strong> F2 vary<strong>in</strong>g<br />

Only F2 vary<strong>in</strong>g<br />

Only burst vary<strong>in</strong>g


Listeners & Test procedure<br />

<br />

<br />

Listeners<br />

<br />

<br />

50 adults (mean: 21.1 yrs) with normal pure tone<br />

thresholds<br />

Native speakers of British English<br />

Test procedure<br />

<br />

<br />

<br />

Two-alternative forced choice identification task<br />

32 responses per stimulus per listener collected<br />

over four sessions<br />

Comb<strong>in</strong>ed-cue <strong>and</strong> s<strong>in</strong>gle-cue stimuli r<strong>and</strong>omised<br />

together


d/-/g/ contrast: <strong>in</strong>dividual<br />

differences <strong>in</strong> cue-weight<strong>in</strong>g<br />

<br />

There was little variability <strong>in</strong> cue-weight<strong>in</strong>g <strong>in</strong> the context of /i/, but <strong>in</strong><br />

the context of /ei/ or /a/, there was greater variability <strong>in</strong> the use of<br />

burst or transition <strong>in</strong>formation. Some listeners focused on a specific<br />

cue, others required cue redundancy whilst a small number were able<br />

to make use of whichever <strong>in</strong>formation was present.<br />

Percentage of listeners<br />

affected by cue removal<br />

100%<br />

80%<br />

60%<br />

40%<br />

20%<br />

0%<br />

GATE-DATE GAT-DAT GEET-DEET<br />

M<strong>in</strong>imal Pair<br />

No burst-cue<br />

No transitions-cue<br />

Both<br />

Neither


g/-/k/ contrast: <strong>Individual</strong><br />

differences <strong>in</strong> cue-weight<strong>in</strong>g<br />

<br />

Less than 5% of listeners were us<strong>in</strong>g the F1 transition cue <strong>in</strong> the<br />

context of /ei/ <strong>and</strong> /i/. However, <strong>in</strong> the context of /ae/, even though on<br />

average the effect of F1 transition cue removal was not significant,<br />

over 30% of listeners were significantly affected by the absence of this<br />

cue.<br />

Voic<strong>in</strong>g contrast: Effect of cue-removal<br />

% listeners affected by<br />

cue-removal<br />

100%<br />

80%<br />

60%<br />

40%<br />

20%<br />

0%<br />

GATE-<br />

KATE<br />

GAT-<br />

KAT<br />

GEET-<br />

KEET<br />

No F1 transition cue<br />

CONTRAST


Discussion<br />

<br />

Acoustic cues vary <strong>in</strong> their “robustness”: much less<br />

<strong>in</strong>dividual variability <strong>in</strong> the use of the clearly dom<strong>in</strong>ant<br />

VOT cue to voic<strong>in</strong>g than <strong>in</strong> the use of burst/formant cues<br />

to the place contrast.<br />

<br />

<strong>Individual</strong> differences <strong>in</strong> cue-weight<strong>in</strong>g strategies used by<br />

listeners. For the /d/-/g/ contrasts, some listeners<br />

appeared to focus on a specific cue (either burst or<br />

formant transition <strong>in</strong>formation), whilst a small number<br />

appeared to be flexible <strong>in</strong> their use of whichever cue was<br />

present.


Are there <strong>in</strong>dividual differences <strong>in</strong> the<br />

use of contextual cues? – one<br />

example<br />

Not surpris<strong>in</strong>g that listeners vary also <strong>in</strong><br />

the way they make use of l<strong>in</strong>guistic<br />

contextual <strong>in</strong>formation<br />

One useful test: SPIN (Kalikow et al.,<br />

1977).


SPIN test<br />

HP The watchdog gave a warn<strong>in</strong>g growl.<br />

HP She made the bed with clean sheets.<br />

LP The old man discussed the dive.<br />

LP Bob heard Paul call about the strips.<br />

Tells us how well people can make use of<br />

semantic <strong>in</strong>formation, but tells us less about the<br />

use of phonetic <strong>in</strong>formation


Some data show<strong>in</strong>g <strong>variation</strong> <strong>in</strong> the<br />

effect of contextual <strong>in</strong>formation<br />

55 adult listeners<br />

tested on 50 SPIN<br />

sentences<br />

100<br />

80<br />

SPIN (% correct)<br />

60<br />

40<br />

Low predictability<br />

20<br />

High predictability<br />

SUBJECT


So?<br />

Evidence of <strong>in</strong>dividual variability at all<br />

levels of <strong>perception</strong><br />

<br />

<br />

<br />

Auditory process<strong>in</strong>g<br />

Process<strong>in</strong>g of acoustic-phonetic cues<br />

(‘analytic’ tasks)<br />

Use of contextual <strong>in</strong>formation


My ma<strong>in</strong> message for today!<br />

Complex <strong>in</strong>teraction between variability<br />

<strong>in</strong> <strong>production</strong> <strong>and</strong> <strong>perception</strong><br />

<br />

<br />

Due to <strong>in</strong>dividual variability <strong>in</strong> <strong>production</strong>,<br />

acoustic-phonetic cues present <strong>in</strong> the<br />

signal will vary from speaker to speaker.<br />

Some listeners seem to be more flexible <strong>in</strong><br />

their use of acoustic cues <strong>and</strong> can cope<br />

with impoverished acoustic-phonetic <strong>in</strong>fo,<br />

others less so.


Bibliography<br />

Bond, Z., & Moore, T. (1994). A note on the acoustic-phonetic characteristics of<br />

<strong>in</strong>advertently clear <strong>speech</strong>. Speech Communication, 14, 325-337.<br />

Bradlow, A., Torretta, G., & Pisoni, D. (1996). Intelligibility of normal <strong>speech</strong> .1. global <strong>and</strong><br />

f<strong>in</strong>e-gra<strong>in</strong>ed acoustic-phonetic talker characteristics. Speech Communication, 20, 255-272.<br />

Hazan, V. <strong>and</strong> Markham, D (2004) Acoustic-phonetic correlates of talker <strong>in</strong>telligibility for<br />

adults <strong>and</strong> children. JASA, 116, 3108-3118<br />

Kalikow, D., Stevens, K. <strong>and</strong> Elliot, L. (1977). Development of a test of <strong>speech</strong> <strong>in</strong>telligibility<br />

<strong>in</strong> noise us<strong>in</strong>g sentence materials with controlled word predictability. Journal of the Audio<br />

Eng<strong>in</strong>eer<strong>in</strong>g Society, 19, 920-922.<br />

Markham, D. <strong>and</strong> Hazan, V. (2002). UCL Speaker Database. Speech, Hear<strong>in</strong>g <strong>and</strong><br />

Language: UCL work <strong>in</strong> progress, vol. 14, 1-17.<br />

Markham, D. <strong>and</strong> Hazan, V. (2004) The effect of talker- <strong>and</strong> listener-related factors on<br />

<strong>in</strong>telligibility for a real-word, open-set <strong>perception</strong> test. Journal of Speech, Hear<strong>in</strong>g <strong>and</strong><br />

Language Research, 47, 725-737.<br />

Newman, R., Clouse, S., & Burnham, J. (2001). The perceptual consequences of<br />

with<strong>in</strong>talker variability <strong>in</strong> fricative <strong>production</strong>. JASA, 109, 1181-1196.<br />

Picheny M.A., Durlach N.I., Braida L.D. (1985), Speak<strong>in</strong>g clearly for the hard of hear<strong>in</strong>g .1.<br />

<strong>in</strong>telligibility differences between clear <strong>and</strong> conversational <strong>speech</strong>, J. Speech Hear. Res. 28,<br />

96-103.<br />

Surprenant, A.M. & Watson, C.S. (2001) <strong>Individual</strong> differences <strong>in</strong> the process<strong>in</strong>g of <strong>speech</strong><br />

<strong>and</strong> non<strong>speech</strong> sounds by normal-hear<strong>in</strong>g listeners. JASA, 110, 2085-2095.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!