28.11.2014 Views

Cross-modal clustering in the acoustic - articulatory space

Cross-modal clustering in the acoustic - articulatory space

Cross-modal clustering in the acoustic - articulatory space

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Proceed<strong>in</strong>gs, FOETIK 2009, Dept. of L<strong>in</strong>guistics, Stockholm University<br />

<strong>Cross</strong> - <strong>modal</strong> Cluster<strong>in</strong>g <strong>in</strong> <strong>the</strong> Acoustic - Articulatory<br />

Space<br />

G. Ananthakrishnan and Daniel M. eiberg<br />

Centre for Speech Technology, CSC, KTH, Stockholm<br />

agopal@kth.se, neiberg@speech.kth.se<br />

Abstract<br />

This paper explores cross-<strong>modal</strong> <strong>cluster<strong>in</strong>g</strong> <strong>in</strong><br />

<strong>the</strong> <strong>acoustic</strong>-<strong>articulatory</strong> <strong>space</strong>. A method to<br />

improve <strong>cluster<strong>in</strong>g</strong> us<strong>in</strong>g <strong>in</strong>formation from<br />

more than one <strong>modal</strong>ity is presented. Formants<br />

and <strong>the</strong> Electromagnetic Articulography measurements<br />

are used to study correspond<strong>in</strong>g clusters<br />

formed <strong>in</strong> <strong>the</strong> two <strong>modal</strong>ities. A measure<br />

for estimat<strong>in</strong>g <strong>the</strong> uncerta<strong>in</strong>ty <strong>in</strong> correspondences<br />

between one cluster <strong>in</strong> <strong>the</strong> <strong>acoustic</strong><br />

<strong>space</strong> and several clusters <strong>in</strong> <strong>the</strong> <strong>articulatory</strong><br />

<strong>space</strong> is suggested.<br />

Introduction<br />

Try<strong>in</strong>g to estimate <strong>the</strong> <strong>articulatory</strong> measurements<br />

from <strong>acoustic</strong> data has been of special<br />

<strong>in</strong>terest for long time and is known as <strong>acoustic</strong>to-<strong>articulatory</strong><br />

<strong>in</strong>version. Though this mapp<strong>in</strong>g<br />

between <strong>the</strong> two <strong>modal</strong>ities expected to be a<br />

one-to-one mapp<strong>in</strong>g, early research presented<br />

some <strong>in</strong>terest<strong>in</strong>g evidence show<strong>in</strong>g nonuniqueness,<br />

<strong>in</strong> this mapp<strong>in</strong>g. Bite-block experiments<br />

have shown that speakers are capable<br />

of produc<strong>in</strong>g sounds perceptually close to <strong>the</strong><br />

<strong>in</strong>tended sounds even though <strong>the</strong> jaw is fixed <strong>in</strong><br />

an unnatural position (Gay et al., 1981). Mermelste<strong>in</strong><br />

(1967) and Schroeder (1967) have<br />

shown, through analytical <strong>articulatory</strong> models,<br />

that <strong>the</strong> <strong>in</strong>version is unique to a class of area<br />

functions ra<strong>the</strong>r than a unique configuration of<br />

<strong>the</strong> vocal tract.<br />

With <strong>the</strong> advent of measur<strong>in</strong>g techniques<br />

like Electromagnetic Articulography (EMA)<br />

and X-Ray Microbeam, it was possible to collect<br />

simultaneous measurements of <strong>acoustic</strong>s<br />

and articulation dur<strong>in</strong>g cont<strong>in</strong>uous speech. Several<br />

attempts have been made by researchers to<br />

perform <strong>acoustic</strong>-to-<strong>articulatory</strong> <strong>in</strong>version by<br />

apply<strong>in</strong>g mach<strong>in</strong>e learn<strong>in</strong>g techniques to <strong>the</strong><br />

<strong>acoustic</strong>-<strong>articulatory</strong> data (Yehia et al., 1998<br />

and Kjellström and Engwall, 2009). The statistical<br />

methods applied to <strong>the</strong> problem of mapp<strong>in</strong>g<br />

brought a new dimension to <strong>the</strong> concept of<br />

non-uniqueness <strong>in</strong> <strong>the</strong> mapp<strong>in</strong>g. In <strong>the</strong> determ<strong>in</strong>istic<br />

case, one can say that if <strong>the</strong> same <strong>acoustic</strong><br />

parameters are produced by more than one <strong>articulatory</strong><br />

configuration, <strong>the</strong>n <strong>the</strong> particular<br />

mapp<strong>in</strong>g is considered to be non-unique. It is<br />

almost impossible to show this us<strong>in</strong>g real recorded<br />

data, unless more than one <strong>articulatory</strong><br />

configuration produces exactly <strong>the</strong> same <strong>acoustic</strong><br />

parameters. However, not f<strong>in</strong>d<strong>in</strong>g such <strong>in</strong>stances<br />

does not imply that non-uniqueness<br />

does not exist.<br />

Q<strong>in</strong> and Carreira-Perp<strong>in</strong>án (2007) proposed<br />

that <strong>the</strong> mapp<strong>in</strong>g is non-unique if, for a particular<br />

<strong>acoustic</strong> cluster, <strong>the</strong> correspond<strong>in</strong>g <strong>articulatory</strong><br />

mapp<strong>in</strong>g may be found <strong>in</strong> more than one<br />

cluster. Evidence of non-uniqueness <strong>in</strong> certa<strong>in</strong><br />

<strong>acoustic</strong> clusters for phonemes like /ɹ/, /l/ and<br />

/w/ was presented. The study by Q<strong>in</strong> quantized<br />

<strong>the</strong> <strong>acoustic</strong> <strong>space</strong> us<strong>in</strong>g <strong>the</strong> perceptual Itakura<br />

distance on LPC features. The <strong>articulatory</strong><br />

<strong>space</strong> was clustered us<strong>in</strong>g a nonparametric<br />

Gaussian density kernel with a fixed variance.<br />

The problem with such a def<strong>in</strong>ition of nonuniqueness<br />

is that one does not know what is<br />

<strong>the</strong> optimal method and level of quantization<br />

for <strong>cluster<strong>in</strong>g</strong> <strong>the</strong> <strong>acoustic</strong> and <strong>articulatory</strong><br />

<strong>space</strong>s.<br />

A later study by Neiberg et. al. (2008) argued<br />

that <strong>the</strong> different <strong>articulatory</strong> clusters<br />

should not only map onto a s<strong>in</strong>gle <strong>acoustic</strong> cluster<br />

but should also map onto <strong>acoustic</strong> distributions<br />

with <strong>the</strong> same parameters, for it to be<br />

called non-unique. Us<strong>in</strong>g an approach based on<br />

f<strong>in</strong>d<strong>in</strong>g <strong>the</strong> Bhattacharya distance between <strong>the</strong><br />

distributions of <strong>the</strong> <strong>in</strong>verse mapp<strong>in</strong>g, <strong>the</strong>y found<br />

that phonemes like /p/, /t/, /k/, /s/ and /z/ are<br />

highly non-unique.<br />

In this study, we wish to observe how clusters<br />

<strong>in</strong> <strong>the</strong> <strong>acoustic</strong> <strong>space</strong> map onto <strong>the</strong> <strong>articulatory</strong><br />

<strong>space</strong>. For every cluster <strong>in</strong> <strong>the</strong> <strong>acoustic</strong><br />

<strong>space</strong>, we <strong>in</strong>tend to f<strong>in</strong>d <strong>the</strong> uncerta<strong>in</strong>ty <strong>in</strong> f<strong>in</strong>d<strong>in</strong>g<br />

a correspond<strong>in</strong>g <strong>articulatory</strong> cluster. It must<br />

be noted that this uncerta<strong>in</strong>ty is not necessarily<br />

<strong>the</strong> non-uniqueness <strong>in</strong> <strong>the</strong> <strong>acoustic</strong>-to<strong>articulatory</strong><br />

mapp<strong>in</strong>g. However, f<strong>in</strong>d<strong>in</strong>g this uncerta<strong>in</strong>ty<br />

would give an <strong>in</strong>tuitive understand<strong>in</strong>g


Proceed<strong>in</strong>gs, FOETIK 2009, Dept. of L<strong>in</strong>guistics, Stockholm University<br />

about <strong>the</strong> difficulties <strong>in</strong> <strong>the</strong> mapp<strong>in</strong>g for different<br />

phonemes.<br />

Cluster<strong>in</strong>g <strong>the</strong> <strong>acoustic</strong> and <strong>articulatory</strong><br />

<strong>space</strong>s separately, as was done <strong>in</strong> previous studies<br />

by Q<strong>in</strong> and Carreira-Perp<strong>in</strong>án (2007) as well<br />

as Neiberg et al. (2008), leads to hard boundaries<br />

<strong>in</strong> <strong>the</strong> clusters. The cluster labels for <strong>the</strong><br />

<strong>in</strong>stances near <strong>the</strong>se boundaries may estimated<br />

<strong>in</strong>correctly, which may cause an over estimation<br />

of <strong>the</strong> uncerta<strong>in</strong>ty. This situation is expla<strong>in</strong>ed<br />

by Fig. 1 us<strong>in</strong>g syn<strong>the</strong>tic data where we<br />

can see both <strong>the</strong> distributions of <strong>the</strong> syn<strong>the</strong>tic<br />

data as well as <strong>the</strong> Maximum A-posteriori<br />

Probability (MAP) Estimates for <strong>the</strong> clusters.<br />

We can see that, because of <strong>the</strong> <strong>in</strong>correct <strong>cluster<strong>in</strong>g</strong>,<br />

it seems as if data belong<strong>in</strong>g to one cluster<br />

<strong>in</strong> mode A belongs to more than one cluster<br />

<strong>in</strong> mode B.<br />

In order to mitigate this problem, we have<br />

suggested a method of cross-<strong>modal</strong> <strong>cluster<strong>in</strong>g</strong><br />

where both <strong>the</strong> available <strong>modal</strong>ities are made<br />

use of by allow<strong>in</strong>g soft boundaries for <strong>the</strong> clusters<br />

<strong>in</strong> each <strong>modal</strong>ity. <strong>Cross</strong>-<strong>modal</strong> <strong>cluster<strong>in</strong>g</strong><br />

has been dealt with <strong>in</strong> detail under several contexts<br />

of comb<strong>in</strong><strong>in</strong>g multi-<strong>modal</strong> data. Coen<br />

(2005) proposed a self supervised method<br />

where he used <strong>acoustic</strong> and visual features to<br />

learn perceptual structures based on temporal<br />

correlations between <strong>the</strong> two <strong>modal</strong>ities. He<br />

used <strong>the</strong> concept of slices, which are topological<br />

manifolds encod<strong>in</strong>g dynamic states. Similarly,<br />

Belolli et al.(2007) proposed a <strong>cluster<strong>in</strong>g</strong><br />

algorithm us<strong>in</strong>g Support Vector Mach<strong>in</strong>es<br />

(SVMs) for <strong>cluster<strong>in</strong>g</strong> <strong>in</strong>ter-related text datasets.<br />

The method proposed <strong>in</strong> this paper does not<br />

make use of correlations, but ma<strong>in</strong>ly uses co<strong>cluster<strong>in</strong>g</strong><br />

properties between <strong>the</strong> two <strong>modal</strong>ities<br />

<strong>in</strong> order to perform <strong>the</strong> cross-<strong>modal</strong> <strong>cluster<strong>in</strong>g</strong>.<br />

Thus, even non-l<strong>in</strong>ear dependencies (uncorrelated)<br />

may also be modeled us<strong>in</strong>g this<br />

simple method.<br />

Theory<br />

We assume that <strong>the</strong> data is a Gaussian Mixture<br />

Model (GMM). The <strong>acoustic</strong> <strong>space</strong> Y = {y 1 ,<br />

y 2 …y } with ‘’ data po<strong>in</strong>ts is modelled us<strong>in</strong>g<br />

‘I’ Gaussians, {λ 1 , λ 2 ,…λ I } and <strong>the</strong> <strong>articulatory</strong><br />

<strong>space</strong> X = {x 1 , x 2 …x } is modelled us<strong>in</strong>g ‘K’<br />

Gaussians, {γ 1 ,γ 2 ,….γ K }. ‘I’ and ‘K’ are<br />

obta<strong>in</strong>ed by m<strong>in</strong>imiz<strong>in</strong>g <strong>the</strong> Bayesian<br />

Information Criterion (BIC). If we know which<br />

<strong>articulatory</strong> Gaussian a particular data po<strong>in</strong>t<br />

belongs to, say, γ k The correct <strong>acoustic</strong><br />

Gaussian ‘λ n ’ for <strong>the</strong> <strong>the</strong> ‘n th ’ data po<strong>in</strong>t hav<strong>in</strong>g<br />

<strong>acoustic</strong> features ‘y n ’ and <strong>articulatory</strong> features<br />

‘x n ’ is given by <strong>the</strong> maximum cross-<strong>modal</strong> a-<br />

posteriori probability<br />

λ = arg max P( λ | x , y , γ )<br />

n i n n k<br />

1≤i ≤I<br />

= arg max p( xn, yn<br />

| λi , γ k )*P( λi | γ k )*P( γ k )<br />

1≤i≤<br />

I<br />

(1)<br />

The knowledge about <strong>the</strong> <strong>articulatory</strong> cluster<br />

can <strong>the</strong>n be used to improve <strong>the</strong> estimate of <strong>the</strong><br />

Figure 1. The figures above show a syn<strong>the</strong>sized<br />

example of data <strong>in</strong> two <strong>modal</strong>ities. The figures<br />

below show how MAP hard <strong>cluster<strong>in</strong>g</strong> may br<strong>in</strong>g<br />

about an effect of uncerta<strong>in</strong>ty <strong>in</strong> <strong>the</strong> correspondence<br />

between clusters <strong>in</strong> <strong>the</strong> two <strong>modal</strong>ities.<br />

correct <strong>acoustic</strong> cluster and vice versa as shown<br />

below<br />

γ = arg max p( x , y | λ , γ )*P( γ | λ )*P( λ )<br />

n n n i k k i i<br />

1≤k ≤K<br />

(2)<br />

Where P(λ|γ) is <strong>the</strong> cross-<strong>modal</strong> prior and <strong>the</strong><br />

p(x,y|λ,γ) is <strong>the</strong> jo<strong>in</strong>t cross-<strong>modal</strong> distribution.<br />

If <strong>the</strong> first estimates of <strong>the</strong> correct clusters are<br />

MAP, <strong>the</strong>n <strong>the</strong> estimates of <strong>the</strong> correct clusters<br />

of <strong>the</strong> speech segments are improved<br />

Figure 2. The figures shows an improved<br />

8<br />

6<br />

4<br />

2<br />

0<br />

2<br />

8<br />

6<br />

4<br />

2<br />

0<br />

2<br />

4<br />

6 4 2 0 2 4 6 8<br />

8<br />

6<br />

4<br />

2<br />

0<br />

2<br />

Mode A<br />

Mode A<br />

4<br />

6 4 2 0 2 4 6 8<br />

Mode A<br />

4<br />

6 4 2 0 2 4 6 8<br />

2<br />

4 2 0 2 4 6 8 10 12<br />

performance and soft boundaries for <strong>the</strong> syn<strong>the</strong>tic<br />

data us<strong>in</strong>g cross-<strong>modal</strong> clustr<strong>in</strong>g, here <strong>the</strong> effect of<br />

uncerta<strong>in</strong>ty <strong>in</strong> correspondences is less.<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

12<br />

10<br />

2<br />

4 2 0 2 4 6 8 10 12<br />

8<br />

6<br />

4<br />

2<br />

0<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

Mode B<br />

Mode B<br />

Mode B<br />

2<br />

4 2 0 2 4 6 8 10 12


Proceed<strong>in</strong>gs, FOETIK 2009, Dept. of L<strong>in</strong>guistics, Stockholm University<br />


Proceed<strong>in</strong>gs, FOETIK 2009, Dept. of L<strong>in</strong>guistics, Stockholm University<br />

measure of <strong>the</strong> uncerta<strong>in</strong>ty <strong>in</strong> prediction, and<br />

thus forms an <strong>in</strong>tuitive measure for our<br />

purpose. It is always between 0 and 1 and so<br />

comparisons between different cross-<strong>modal</strong><br />

<strong>cluster<strong>in</strong>g</strong>s is easy. 1 <strong>in</strong>dicates very high<br />

uncerta<strong>in</strong>ty while 0 <strong>in</strong>dicates one-to-one<br />

mapp<strong>in</strong>g between correspond<strong>in</strong>g clusters <strong>in</strong> <strong>the</strong><br />

two <strong>modal</strong>ities.<br />

Experiments and Results<br />

The MOCHA-TIMIT database (Wrench, 2001)<br />

was used to perform <strong>the</strong> experiments. The data<br />

consists of simultaneous measurements of<br />

<strong>acoustic</strong> and <strong>articulatory</strong> data for a female<br />

speaker. The <strong>articulatory</strong> data consisted of 14<br />

channels, which <strong>in</strong>cluded <strong>the</strong> X and Y-axis<br />

positions of EMA coils on 7 articulators, <strong>the</strong><br />

Lower Jaw (LJ), Upper Lip (UL), Lower Lip<br />

(LL), Tongue Tip (TT), Tongue Body (TB),<br />

Tongue Dorsum (TD) and Velum (V). Only<br />

vowels were considered for this study and <strong>the</strong><br />

<strong>acoustic</strong> <strong>space</strong> was represented by <strong>the</strong> first 5<br />

formants, obta<strong>in</strong>ed from 25 ms <strong>acoustic</strong><br />

w<strong>in</strong>dows shifted by 10 ms. The <strong>articulatory</strong><br />

data was low-pass fitered and down-sampled <strong>in</strong><br />

order to correspond with <strong>acoustic</strong> data rate. The<br />

uncerta<strong>in</strong>ty (U) <strong>in</strong> <strong>cluster<strong>in</strong>g</strong> was estimated<br />

us<strong>in</strong>g Equation 3 for <strong>the</strong> British vowels, namely<br />

/ʊ, æ, e, ɒ, ɑ:, u:, ɜ:ʳ, ɔ:, ʌ, ɩ:, ə, ɘ/. The<br />

<strong>articulatory</strong> data was first clustered for all <strong>the</strong><br />

<strong>articulatory</strong> channels and <strong>the</strong>n was clustered<br />

<strong>in</strong>dividually for each of <strong>the</strong> 7 articulators.<br />

Fig. 3 shows <strong>the</strong> clusters <strong>in</strong> both <strong>the</strong><br />

<strong>acoustic</strong> and <strong>articulatory</strong> <strong>space</strong> for <strong>the</strong> vowel<br />

/e/. We can see that data po<strong>in</strong>ts correspond<strong>in</strong>g<br />

to one cluster <strong>in</strong> <strong>the</strong> <strong>acoustic</strong> <strong>space</strong> (F1-F2<br />

formant <strong>space</strong>) correspond to more than one<br />

cluster <strong>in</strong> <strong>the</strong> <strong>articulatory</strong> <strong>space</strong>. The ellipses,<br />

which correspond to <strong>in</strong>itial clusters are replaced<br />

by different <strong>cluster<strong>in</strong>g</strong> labels estimated by <strong>the</strong><br />

MCMAP algorithm. So though <strong>the</strong> <strong>acoustic</strong><br />

features had more than one cluster <strong>in</strong> <strong>the</strong> first<br />

estimate, after cross-<strong>modal</strong> <strong>cluster<strong>in</strong>g</strong>, all <strong>the</strong><br />

<strong>in</strong>stances are assigned to a s<strong>in</strong>gle cluster.<br />

Fig. 4 shows <strong>the</strong> correspondences between<br />

<strong>acoustic</strong> clusters and <strong>the</strong> LJ for <strong>the</strong> vowel /ə/.<br />

We can see that <strong>the</strong> uncerta<strong>in</strong>ty is less for some<br />

of <strong>the</strong> clusters, while it is higher for some<br />

o<strong>the</strong>rs. Fig. 5 shows <strong>the</strong> comparative measures<br />

of overall <strong>the</strong> uncerta<strong>in</strong>ty (over all <strong>the</strong><br />

articulators), of <strong>the</strong> <strong>articulatory</strong> clusters<br />

correspond<strong>in</strong>g to each one of <strong>the</strong> <strong>acoustic</strong><br />

clusters for <strong>the</strong> different vowels tested. Fig 6.<br />

shows <strong>the</strong> correspondence uncerta<strong>in</strong>ty of<br />

<strong>in</strong>dividual articulators.<br />

Uncerta<strong>in</strong>ty<br />

Figure 5. The figure shows <strong>the</strong> overall<br />

uncerta<strong>in</strong>ty (for <strong>the</strong> whole <strong>articulatory</strong><br />

configuration) for <strong>the</strong> British vowels.<br />

Uncerta<strong>in</strong>ty<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

Overall Uncerta<strong>in</strong>ty for <strong>the</strong> Vowels<br />

ʊ æ e ɒ ɑ: u: ɜ:ʳ ɔ: ʌ ɩ: ə ɘ<br />

Vow els<br />

Uncerta<strong>in</strong>ty for Individual Articulators<br />

ʊ æ e ɒ ɑ: u: ɜ:ʳ ɔ: ʌ ɩ: ə ɘ<br />

Figure 6. The figure shows <strong>the</strong> uncerta<strong>in</strong>ty for<br />

<strong>in</strong>dividual articulators for <strong>the</strong> British vowels.<br />

Discussion<br />

From Fig. 5 it is clear that <strong>the</strong> shorter vowels<br />

seem to have more uncerta<strong>in</strong>ty than longer<br />

vowels which is <strong>in</strong>tuitive. The higher uncerta<strong>in</strong>ty<br />

is seen for <strong>the</strong> short vowels /e/ and /ə/,<br />

while <strong>the</strong>re is almost no uncerta<strong>in</strong>ty for <strong>the</strong> long<br />

vowels /ɑ:/ and /ɩ:/. The overall uncerta<strong>in</strong>ty for<br />

<strong>the</strong> entire configuration is usually around <strong>the</strong><br />

lowest uncerta<strong>in</strong>ty for a s<strong>in</strong>gle articulator. This<br />

is <strong>in</strong>tuitive, and shows that even though certa<strong>in</strong><br />

articulator correspondences are uncerta<strong>in</strong>, <strong>the</strong><br />

correspondences are more certa<strong>in</strong> for <strong>the</strong> overall<br />

configuration. When <strong>the</strong> uncerta<strong>in</strong>ty for <strong>in</strong>dividual<br />

articulators is observed, <strong>the</strong>n it is apparent<br />

that <strong>the</strong> velum has a high uncerta<strong>in</strong>ty of<br />

more than 0.6 for all <strong>the</strong> vowels. This is due to<br />

<strong>the</strong> fact that nasalization is not observable <strong>in</strong><br />

<strong>the</strong> formants very easily. So even though different<br />

clusters are formed <strong>in</strong> <strong>the</strong> <strong>articulatory</strong> <strong>space</strong>,<br />

<strong>the</strong>y are seen <strong>in</strong> <strong>the</strong> same cluster <strong>in</strong> <strong>the</strong> <strong>acoustic</strong><br />

<strong>space</strong>. The uncerta<strong>in</strong>ty is much less <strong>in</strong> <strong>the</strong> lower<br />

lip correspondence for <strong>the</strong> long vowels /ɑ:/, /u:/<br />

and /ɜ:ʳ/ while it is high for /ʊ/ and /e/. The TD<br />

shows lower uncerta<strong>in</strong>ty for <strong>the</strong> back vowels<br />

/u:/ and /ɔ:/. The uncerta<strong>in</strong>ty for TD is higher<br />

V<br />

TD<br />

TB<br />

TT<br />

LL<br />

UL<br />

LJ


Proceed<strong>in</strong>gs, FOETIK 2009, Dept. of L<strong>in</strong>guistics, Stockholm University<br />

for <strong>the</strong> front vowels like /e/ and /ɘ/. The<br />

uncerta<strong>in</strong>ty for <strong>the</strong> tongue tip is lower for <strong>the</strong><br />

vowels like /ʊ/ and /ɑ:/ while it is higher for<br />

/ɜ:ʳ/ and /ʌ/. These results are <strong>in</strong>tuitive, and<br />

show that it is easier to f<strong>in</strong>d correspondences<br />

between <strong>acoustic</strong> and <strong>articulatory</strong> clusters for<br />

some vowels, while it is more difficult for o<strong>the</strong>rs.<br />

Conclusion and Future Work<br />

The algorithm proposed, helps <strong>in</strong> improv<strong>in</strong>g <strong>the</strong><br />

<strong>cluster<strong>in</strong>g</strong> ability us<strong>in</strong>g <strong>in</strong>formation from multiple<br />

<strong>modal</strong>ities. A measure for f<strong>in</strong>d<strong>in</strong>g out uncerta<strong>in</strong>ty<br />

<strong>in</strong> correspondences between <strong>acoustic</strong><br />

and <strong>articulatory</strong> clusters has been suggested and<br />

empirical results on certa<strong>in</strong> British vowels have<br />

been presented. The results presented are <strong>in</strong>tuitive<br />

and show difficulties <strong>in</strong> mak<strong>in</strong>g predictions<br />

about <strong>the</strong> articulation from <strong>acoustic</strong>s for certa<strong>in</strong><br />

sounds. It follows that certa<strong>in</strong> changes <strong>in</strong> <strong>the</strong><br />

<strong>articulatory</strong> configurations cause variation <strong>in</strong><br />

<strong>the</strong> formants, while certa<strong>in</strong> <strong>articulatory</strong> changes<br />

do not change <strong>the</strong> formants.<br />

It is apparent that <strong>the</strong> empirical results presented<br />

depend on <strong>the</strong> type of <strong>cluster<strong>in</strong>g</strong> and <strong>in</strong>itialization<br />

of <strong>the</strong> algorithm. This must be explored.<br />

Future work must also be done on extend<strong>in</strong>g<br />

this paradigm to <strong>in</strong>clude o<strong>the</strong>r classes<br />

of phonemes as well as different languages and<br />

subjects. It would be <strong>in</strong>terest<strong>in</strong>g to see if <strong>the</strong>se<br />

empirical results can be generalized or are special<br />

to certa<strong>in</strong> subjects and languages and accents.<br />

Kjellström, H. and Engwall, O. (2009) Audiovisual-to-<strong>articulatory</strong><br />

<strong>in</strong>version. Speech<br />

Communication 51(3), 195-209.<br />

Mermelste<strong>in</strong>, P., (1967) Determ<strong>in</strong>ation of <strong>the</strong><br />

Vocal-Tract Shape from Measured Formant<br />

Frequencies, J. Acoust. Soc. Am. 41, 1283-<br />

1294.<br />

Neiberg, D., Ananthakrishnan, G. and Engwall,<br />

O. (2008) The Acoustic to Articulation<br />

Mapp<strong>in</strong>g: Non-l<strong>in</strong>ear or Non-unique? Proceed<strong>in</strong>gs<br />

of Interspeech, 1485-1488.<br />

Q<strong>in</strong>, C. and Carreira-Perp<strong>in</strong>án, M. Á. (2007) An<br />

Emperical Investigation of <strong>the</strong> Nonuniqueness<br />

<strong>in</strong> <strong>the</strong> Acoustic-to-Articulatory Mapp<strong>in</strong>g.<br />

Proceed<strong>in</strong>gs of Interspeech, 74–77.<br />

Schroeder, M. R. (1967) Determ<strong>in</strong>ation of <strong>the</strong><br />

geometry of <strong>the</strong> human vocal tract by <strong>acoustic</strong><br />

measurements. J. Acoust. Soc. Am<br />

41(2), 1002–1010.<br />

Wrench, A. (1999) The MOCHA-TIMIT <strong>articulatory</strong><br />

database. Queen Margaret University<br />

College, Tech. Rep, 1999. Onl<strong>in</strong>e:<br />

http://www.cstr.ed.ac.uk/research/projects/a<br />

rtic/mocha.html.<br />

Yehia, H., Rub<strong>in</strong>, P. and Vatikiotis-Bateson.<br />

(1998) Quantitative association of vocaltract<br />

and facial behavior. Speech Communication<br />

26(1-2), 23-43.<br />

Acknowledgements<br />

This work is supported by <strong>the</strong> Swedish<br />

Research Council project 80449001, Computer-<br />

Animated Language Teachers.<br />

References<br />

Bolelli L., Ertek<strong>in</strong> S., Zhou D. and Giles C. L.<br />

(2007) K-SVMeans: A Hybrid Cluster<strong>in</strong>g<br />

Algorithm for Multi-Type Interrelated Datasets.<br />

International Conference on Web Intelligence,<br />

198–204.<br />

Coen M. H. (2005) <strong>Cross</strong>-Modal Cluster<strong>in</strong>g.<br />

Proceed<strong>in</strong>gs of <strong>the</strong> Twentieth National Conference<br />

on Artificial Intelligence, 932-937.<br />

Gay, T., L<strong>in</strong>dblom B. and Lubker, J. (1981)<br />

Production of bite-block vowels: <strong>acoustic</strong><br />

equivalence by selective compensation. J.<br />

Acoust. Soc. Am. 69, 802-810, 1981.


Proceed<strong>in</strong>gs, FONETIK 2009, Dept. of L<strong>in</strong>guistics, Stockholm University

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!