Cross-modal clustering in the acoustic - articulatory space
Cross-modal clustering in the acoustic - articulatory space
Cross-modal clustering in the acoustic - articulatory space
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Proceed<strong>in</strong>gs, FOETIK 2009, Dept. of L<strong>in</strong>guistics, Stockholm University<br />
<strong>Cross</strong> - <strong>modal</strong> Cluster<strong>in</strong>g <strong>in</strong> <strong>the</strong> Acoustic - Articulatory<br />
Space<br />
G. Ananthakrishnan and Daniel M. eiberg<br />
Centre for Speech Technology, CSC, KTH, Stockholm<br />
agopal@kth.se, neiberg@speech.kth.se<br />
Abstract<br />
This paper explores cross-<strong>modal</strong> <strong>cluster<strong>in</strong>g</strong> <strong>in</strong><br />
<strong>the</strong> <strong>acoustic</strong>-<strong>articulatory</strong> <strong>space</strong>. A method to<br />
improve <strong>cluster<strong>in</strong>g</strong> us<strong>in</strong>g <strong>in</strong>formation from<br />
more than one <strong>modal</strong>ity is presented. Formants<br />
and <strong>the</strong> Electromagnetic Articulography measurements<br />
are used to study correspond<strong>in</strong>g clusters<br />
formed <strong>in</strong> <strong>the</strong> two <strong>modal</strong>ities. A measure<br />
for estimat<strong>in</strong>g <strong>the</strong> uncerta<strong>in</strong>ty <strong>in</strong> correspondences<br />
between one cluster <strong>in</strong> <strong>the</strong> <strong>acoustic</strong><br />
<strong>space</strong> and several clusters <strong>in</strong> <strong>the</strong> <strong>articulatory</strong><br />
<strong>space</strong> is suggested.<br />
Introduction<br />
Try<strong>in</strong>g to estimate <strong>the</strong> <strong>articulatory</strong> measurements<br />
from <strong>acoustic</strong> data has been of special<br />
<strong>in</strong>terest for long time and is known as <strong>acoustic</strong>to-<strong>articulatory</strong><br />
<strong>in</strong>version. Though this mapp<strong>in</strong>g<br />
between <strong>the</strong> two <strong>modal</strong>ities expected to be a<br />
one-to-one mapp<strong>in</strong>g, early research presented<br />
some <strong>in</strong>terest<strong>in</strong>g evidence show<strong>in</strong>g nonuniqueness,<br />
<strong>in</strong> this mapp<strong>in</strong>g. Bite-block experiments<br />
have shown that speakers are capable<br />
of produc<strong>in</strong>g sounds perceptually close to <strong>the</strong><br />
<strong>in</strong>tended sounds even though <strong>the</strong> jaw is fixed <strong>in</strong><br />
an unnatural position (Gay et al., 1981). Mermelste<strong>in</strong><br />
(1967) and Schroeder (1967) have<br />
shown, through analytical <strong>articulatory</strong> models,<br />
that <strong>the</strong> <strong>in</strong>version is unique to a class of area<br />
functions ra<strong>the</strong>r than a unique configuration of<br />
<strong>the</strong> vocal tract.<br />
With <strong>the</strong> advent of measur<strong>in</strong>g techniques<br />
like Electromagnetic Articulography (EMA)<br />
and X-Ray Microbeam, it was possible to collect<br />
simultaneous measurements of <strong>acoustic</strong>s<br />
and articulation dur<strong>in</strong>g cont<strong>in</strong>uous speech. Several<br />
attempts have been made by researchers to<br />
perform <strong>acoustic</strong>-to-<strong>articulatory</strong> <strong>in</strong>version by<br />
apply<strong>in</strong>g mach<strong>in</strong>e learn<strong>in</strong>g techniques to <strong>the</strong><br />
<strong>acoustic</strong>-<strong>articulatory</strong> data (Yehia et al., 1998<br />
and Kjellström and Engwall, 2009). The statistical<br />
methods applied to <strong>the</strong> problem of mapp<strong>in</strong>g<br />
brought a new dimension to <strong>the</strong> concept of<br />
non-uniqueness <strong>in</strong> <strong>the</strong> mapp<strong>in</strong>g. In <strong>the</strong> determ<strong>in</strong>istic<br />
case, one can say that if <strong>the</strong> same <strong>acoustic</strong><br />
parameters are produced by more than one <strong>articulatory</strong><br />
configuration, <strong>the</strong>n <strong>the</strong> particular<br />
mapp<strong>in</strong>g is considered to be non-unique. It is<br />
almost impossible to show this us<strong>in</strong>g real recorded<br />
data, unless more than one <strong>articulatory</strong><br />
configuration produces exactly <strong>the</strong> same <strong>acoustic</strong><br />
parameters. However, not f<strong>in</strong>d<strong>in</strong>g such <strong>in</strong>stances<br />
does not imply that non-uniqueness<br />
does not exist.<br />
Q<strong>in</strong> and Carreira-Perp<strong>in</strong>án (2007) proposed<br />
that <strong>the</strong> mapp<strong>in</strong>g is non-unique if, for a particular<br />
<strong>acoustic</strong> cluster, <strong>the</strong> correspond<strong>in</strong>g <strong>articulatory</strong><br />
mapp<strong>in</strong>g may be found <strong>in</strong> more than one<br />
cluster. Evidence of non-uniqueness <strong>in</strong> certa<strong>in</strong><br />
<strong>acoustic</strong> clusters for phonemes like /ɹ/, /l/ and<br />
/w/ was presented. The study by Q<strong>in</strong> quantized<br />
<strong>the</strong> <strong>acoustic</strong> <strong>space</strong> us<strong>in</strong>g <strong>the</strong> perceptual Itakura<br />
distance on LPC features. The <strong>articulatory</strong><br />
<strong>space</strong> was clustered us<strong>in</strong>g a nonparametric<br />
Gaussian density kernel with a fixed variance.<br />
The problem with such a def<strong>in</strong>ition of nonuniqueness<br />
is that one does not know what is<br />
<strong>the</strong> optimal method and level of quantization<br />
for <strong>cluster<strong>in</strong>g</strong> <strong>the</strong> <strong>acoustic</strong> and <strong>articulatory</strong><br />
<strong>space</strong>s.<br />
A later study by Neiberg et. al. (2008) argued<br />
that <strong>the</strong> different <strong>articulatory</strong> clusters<br />
should not only map onto a s<strong>in</strong>gle <strong>acoustic</strong> cluster<br />
but should also map onto <strong>acoustic</strong> distributions<br />
with <strong>the</strong> same parameters, for it to be<br />
called non-unique. Us<strong>in</strong>g an approach based on<br />
f<strong>in</strong>d<strong>in</strong>g <strong>the</strong> Bhattacharya distance between <strong>the</strong><br />
distributions of <strong>the</strong> <strong>in</strong>verse mapp<strong>in</strong>g, <strong>the</strong>y found<br />
that phonemes like /p/, /t/, /k/, /s/ and /z/ are<br />
highly non-unique.<br />
In this study, we wish to observe how clusters<br />
<strong>in</strong> <strong>the</strong> <strong>acoustic</strong> <strong>space</strong> map onto <strong>the</strong> <strong>articulatory</strong><br />
<strong>space</strong>. For every cluster <strong>in</strong> <strong>the</strong> <strong>acoustic</strong><br />
<strong>space</strong>, we <strong>in</strong>tend to f<strong>in</strong>d <strong>the</strong> uncerta<strong>in</strong>ty <strong>in</strong> f<strong>in</strong>d<strong>in</strong>g<br />
a correspond<strong>in</strong>g <strong>articulatory</strong> cluster. It must<br />
be noted that this uncerta<strong>in</strong>ty is not necessarily<br />
<strong>the</strong> non-uniqueness <strong>in</strong> <strong>the</strong> <strong>acoustic</strong>-to<strong>articulatory</strong><br />
mapp<strong>in</strong>g. However, f<strong>in</strong>d<strong>in</strong>g this uncerta<strong>in</strong>ty<br />
would give an <strong>in</strong>tuitive understand<strong>in</strong>g
Proceed<strong>in</strong>gs, FOETIK 2009, Dept. of L<strong>in</strong>guistics, Stockholm University<br />
about <strong>the</strong> difficulties <strong>in</strong> <strong>the</strong> mapp<strong>in</strong>g for different<br />
phonemes.<br />
Cluster<strong>in</strong>g <strong>the</strong> <strong>acoustic</strong> and <strong>articulatory</strong><br />
<strong>space</strong>s separately, as was done <strong>in</strong> previous studies<br />
by Q<strong>in</strong> and Carreira-Perp<strong>in</strong>án (2007) as well<br />
as Neiberg et al. (2008), leads to hard boundaries<br />
<strong>in</strong> <strong>the</strong> clusters. The cluster labels for <strong>the</strong><br />
<strong>in</strong>stances near <strong>the</strong>se boundaries may estimated<br />
<strong>in</strong>correctly, which may cause an over estimation<br />
of <strong>the</strong> uncerta<strong>in</strong>ty. This situation is expla<strong>in</strong>ed<br />
by Fig. 1 us<strong>in</strong>g syn<strong>the</strong>tic data where we<br />
can see both <strong>the</strong> distributions of <strong>the</strong> syn<strong>the</strong>tic<br />
data as well as <strong>the</strong> Maximum A-posteriori<br />
Probability (MAP) Estimates for <strong>the</strong> clusters.<br />
We can see that, because of <strong>the</strong> <strong>in</strong>correct <strong>cluster<strong>in</strong>g</strong>,<br />
it seems as if data belong<strong>in</strong>g to one cluster<br />
<strong>in</strong> mode A belongs to more than one cluster<br />
<strong>in</strong> mode B.<br />
In order to mitigate this problem, we have<br />
suggested a method of cross-<strong>modal</strong> <strong>cluster<strong>in</strong>g</strong><br />
where both <strong>the</strong> available <strong>modal</strong>ities are made<br />
use of by allow<strong>in</strong>g soft boundaries for <strong>the</strong> clusters<br />
<strong>in</strong> each <strong>modal</strong>ity. <strong>Cross</strong>-<strong>modal</strong> <strong>cluster<strong>in</strong>g</strong><br />
has been dealt with <strong>in</strong> detail under several contexts<br />
of comb<strong>in</strong><strong>in</strong>g multi-<strong>modal</strong> data. Coen<br />
(2005) proposed a self supervised method<br />
where he used <strong>acoustic</strong> and visual features to<br />
learn perceptual structures based on temporal<br />
correlations between <strong>the</strong> two <strong>modal</strong>ities. He<br />
used <strong>the</strong> concept of slices, which are topological<br />
manifolds encod<strong>in</strong>g dynamic states. Similarly,<br />
Belolli et al.(2007) proposed a <strong>cluster<strong>in</strong>g</strong><br />
algorithm us<strong>in</strong>g Support Vector Mach<strong>in</strong>es<br />
(SVMs) for <strong>cluster<strong>in</strong>g</strong> <strong>in</strong>ter-related text datasets.<br />
The method proposed <strong>in</strong> this paper does not<br />
make use of correlations, but ma<strong>in</strong>ly uses co<strong>cluster<strong>in</strong>g</strong><br />
properties between <strong>the</strong> two <strong>modal</strong>ities<br />
<strong>in</strong> order to perform <strong>the</strong> cross-<strong>modal</strong> <strong>cluster<strong>in</strong>g</strong>.<br />
Thus, even non-l<strong>in</strong>ear dependencies (uncorrelated)<br />
may also be modeled us<strong>in</strong>g this<br />
simple method.<br />
Theory<br />
We assume that <strong>the</strong> data is a Gaussian Mixture<br />
Model (GMM). The <strong>acoustic</strong> <strong>space</strong> Y = {y 1 ,<br />
y 2 …y } with ‘’ data po<strong>in</strong>ts is modelled us<strong>in</strong>g<br />
‘I’ Gaussians, {λ 1 , λ 2 ,…λ I } and <strong>the</strong> <strong>articulatory</strong><br />
<strong>space</strong> X = {x 1 , x 2 …x } is modelled us<strong>in</strong>g ‘K’<br />
Gaussians, {γ 1 ,γ 2 ,….γ K }. ‘I’ and ‘K’ are<br />
obta<strong>in</strong>ed by m<strong>in</strong>imiz<strong>in</strong>g <strong>the</strong> Bayesian<br />
Information Criterion (BIC). If we know which<br />
<strong>articulatory</strong> Gaussian a particular data po<strong>in</strong>t<br />
belongs to, say, γ k The correct <strong>acoustic</strong><br />
Gaussian ‘λ n ’ for <strong>the</strong> <strong>the</strong> ‘n th ’ data po<strong>in</strong>t hav<strong>in</strong>g<br />
<strong>acoustic</strong> features ‘y n ’ and <strong>articulatory</strong> features<br />
‘x n ’ is given by <strong>the</strong> maximum cross-<strong>modal</strong> a-<br />
posteriori probability<br />
λ = arg max P( λ | x , y , γ )<br />
n i n n k<br />
1≤i ≤I<br />
= arg max p( xn, yn<br />
| λi , γ k )*P( λi | γ k )*P( γ k )<br />
1≤i≤<br />
I<br />
(1)<br />
The knowledge about <strong>the</strong> <strong>articulatory</strong> cluster<br />
can <strong>the</strong>n be used to improve <strong>the</strong> estimate of <strong>the</strong><br />
Figure 1. The figures above show a syn<strong>the</strong>sized<br />
example of data <strong>in</strong> two <strong>modal</strong>ities. The figures<br />
below show how MAP hard <strong>cluster<strong>in</strong>g</strong> may br<strong>in</strong>g<br />
about an effect of uncerta<strong>in</strong>ty <strong>in</strong> <strong>the</strong> correspondence<br />
between clusters <strong>in</strong> <strong>the</strong> two <strong>modal</strong>ities.<br />
correct <strong>acoustic</strong> cluster and vice versa as shown<br />
below<br />
γ = arg max p( x , y | λ , γ )*P( γ | λ )*P( λ )<br />
n n n i k k i i<br />
1≤k ≤K<br />
(2)<br />
Where P(λ|γ) is <strong>the</strong> cross-<strong>modal</strong> prior and <strong>the</strong><br />
p(x,y|λ,γ) is <strong>the</strong> jo<strong>in</strong>t cross-<strong>modal</strong> distribution.<br />
If <strong>the</strong> first estimates of <strong>the</strong> correct clusters are<br />
MAP, <strong>the</strong>n <strong>the</strong> estimates of <strong>the</strong> correct clusters<br />
of <strong>the</strong> speech segments are improved<br />
Figure 2. The figures shows an improved<br />
8<br />
6<br />
4<br />
2<br />
0<br />
2<br />
8<br />
6<br />
4<br />
2<br />
0<br />
2<br />
4<br />
6 4 2 0 2 4 6 8<br />
8<br />
6<br />
4<br />
2<br />
0<br />
2<br />
Mode A<br />
Mode A<br />
4<br />
6 4 2 0 2 4 6 8<br />
Mode A<br />
4<br />
6 4 2 0 2 4 6 8<br />
2<br />
4 2 0 2 4 6 8 10 12<br />
performance and soft boundaries for <strong>the</strong> syn<strong>the</strong>tic<br />
data us<strong>in</strong>g cross-<strong>modal</strong> clustr<strong>in</strong>g, here <strong>the</strong> effect of<br />
uncerta<strong>in</strong>ty <strong>in</strong> correspondences is less.<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
12<br />
10<br />
2<br />
4 2 0 2 4 6 8 10 12<br />
8<br />
6<br />
4<br />
2<br />
0<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
Mode B<br />
Mode B<br />
Mode B<br />
2<br />
4 2 0 2 4 6 8 10 12
Proceed<strong>in</strong>gs, FOETIK 2009, Dept. of L<strong>in</strong>guistics, Stockholm University<br />
Proceed<strong>in</strong>gs, FOETIK 2009, Dept. of L<strong>in</strong>guistics, Stockholm University<br />
measure of <strong>the</strong> uncerta<strong>in</strong>ty <strong>in</strong> prediction, and<br />
thus forms an <strong>in</strong>tuitive measure for our<br />
purpose. It is always between 0 and 1 and so<br />
comparisons between different cross-<strong>modal</strong><br />
<strong>cluster<strong>in</strong>g</strong>s is easy. 1 <strong>in</strong>dicates very high<br />
uncerta<strong>in</strong>ty while 0 <strong>in</strong>dicates one-to-one<br />
mapp<strong>in</strong>g between correspond<strong>in</strong>g clusters <strong>in</strong> <strong>the</strong><br />
two <strong>modal</strong>ities.<br />
Experiments and Results<br />
The MOCHA-TIMIT database (Wrench, 2001)<br />
was used to perform <strong>the</strong> experiments. The data<br />
consists of simultaneous measurements of<br />
<strong>acoustic</strong> and <strong>articulatory</strong> data for a female<br />
speaker. The <strong>articulatory</strong> data consisted of 14<br />
channels, which <strong>in</strong>cluded <strong>the</strong> X and Y-axis<br />
positions of EMA coils on 7 articulators, <strong>the</strong><br />
Lower Jaw (LJ), Upper Lip (UL), Lower Lip<br />
(LL), Tongue Tip (TT), Tongue Body (TB),<br />
Tongue Dorsum (TD) and Velum (V). Only<br />
vowels were considered for this study and <strong>the</strong><br />
<strong>acoustic</strong> <strong>space</strong> was represented by <strong>the</strong> first 5<br />
formants, obta<strong>in</strong>ed from 25 ms <strong>acoustic</strong><br />
w<strong>in</strong>dows shifted by 10 ms. The <strong>articulatory</strong><br />
data was low-pass fitered and down-sampled <strong>in</strong><br />
order to correspond with <strong>acoustic</strong> data rate. The<br />
uncerta<strong>in</strong>ty (U) <strong>in</strong> <strong>cluster<strong>in</strong>g</strong> was estimated<br />
us<strong>in</strong>g Equation 3 for <strong>the</strong> British vowels, namely<br />
/ʊ, æ, e, ɒ, ɑ:, u:, ɜ:ʳ, ɔ:, ʌ, ɩ:, ə, ɘ/. The<br />
<strong>articulatory</strong> data was first clustered for all <strong>the</strong><br />
<strong>articulatory</strong> channels and <strong>the</strong>n was clustered<br />
<strong>in</strong>dividually for each of <strong>the</strong> 7 articulators.<br />
Fig. 3 shows <strong>the</strong> clusters <strong>in</strong> both <strong>the</strong><br />
<strong>acoustic</strong> and <strong>articulatory</strong> <strong>space</strong> for <strong>the</strong> vowel<br />
/e/. We can see that data po<strong>in</strong>ts correspond<strong>in</strong>g<br />
to one cluster <strong>in</strong> <strong>the</strong> <strong>acoustic</strong> <strong>space</strong> (F1-F2<br />
formant <strong>space</strong>) correspond to more than one<br />
cluster <strong>in</strong> <strong>the</strong> <strong>articulatory</strong> <strong>space</strong>. The ellipses,<br />
which correspond to <strong>in</strong>itial clusters are replaced<br />
by different <strong>cluster<strong>in</strong>g</strong> labels estimated by <strong>the</strong><br />
MCMAP algorithm. So though <strong>the</strong> <strong>acoustic</strong><br />
features had more than one cluster <strong>in</strong> <strong>the</strong> first<br />
estimate, after cross-<strong>modal</strong> <strong>cluster<strong>in</strong>g</strong>, all <strong>the</strong><br />
<strong>in</strong>stances are assigned to a s<strong>in</strong>gle cluster.<br />
Fig. 4 shows <strong>the</strong> correspondences between<br />
<strong>acoustic</strong> clusters and <strong>the</strong> LJ for <strong>the</strong> vowel /ə/.<br />
We can see that <strong>the</strong> uncerta<strong>in</strong>ty is less for some<br />
of <strong>the</strong> clusters, while it is higher for some<br />
o<strong>the</strong>rs. Fig. 5 shows <strong>the</strong> comparative measures<br />
of overall <strong>the</strong> uncerta<strong>in</strong>ty (over all <strong>the</strong><br />
articulators), of <strong>the</strong> <strong>articulatory</strong> clusters<br />
correspond<strong>in</strong>g to each one of <strong>the</strong> <strong>acoustic</strong><br />
clusters for <strong>the</strong> different vowels tested. Fig 6.<br />
shows <strong>the</strong> correspondence uncerta<strong>in</strong>ty of<br />
<strong>in</strong>dividual articulators.<br />
Uncerta<strong>in</strong>ty<br />
Figure 5. The figure shows <strong>the</strong> overall<br />
uncerta<strong>in</strong>ty (for <strong>the</strong> whole <strong>articulatory</strong><br />
configuration) for <strong>the</strong> British vowels.<br />
Uncerta<strong>in</strong>ty<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
Overall Uncerta<strong>in</strong>ty for <strong>the</strong> Vowels<br />
ʊ æ e ɒ ɑ: u: ɜ:ʳ ɔ: ʌ ɩ: ə ɘ<br />
Vow els<br />
Uncerta<strong>in</strong>ty for Individual Articulators<br />
ʊ æ e ɒ ɑ: u: ɜ:ʳ ɔ: ʌ ɩ: ə ɘ<br />
Figure 6. The figure shows <strong>the</strong> uncerta<strong>in</strong>ty for<br />
<strong>in</strong>dividual articulators for <strong>the</strong> British vowels.<br />
Discussion<br />
From Fig. 5 it is clear that <strong>the</strong> shorter vowels<br />
seem to have more uncerta<strong>in</strong>ty than longer<br />
vowels which is <strong>in</strong>tuitive. The higher uncerta<strong>in</strong>ty<br />
is seen for <strong>the</strong> short vowels /e/ and /ə/,<br />
while <strong>the</strong>re is almost no uncerta<strong>in</strong>ty for <strong>the</strong> long<br />
vowels /ɑ:/ and /ɩ:/. The overall uncerta<strong>in</strong>ty for<br />
<strong>the</strong> entire configuration is usually around <strong>the</strong><br />
lowest uncerta<strong>in</strong>ty for a s<strong>in</strong>gle articulator. This<br />
is <strong>in</strong>tuitive, and shows that even though certa<strong>in</strong><br />
articulator correspondences are uncerta<strong>in</strong>, <strong>the</strong><br />
correspondences are more certa<strong>in</strong> for <strong>the</strong> overall<br />
configuration. When <strong>the</strong> uncerta<strong>in</strong>ty for <strong>in</strong>dividual<br />
articulators is observed, <strong>the</strong>n it is apparent<br />
that <strong>the</strong> velum has a high uncerta<strong>in</strong>ty of<br />
more than 0.6 for all <strong>the</strong> vowels. This is due to<br />
<strong>the</strong> fact that nasalization is not observable <strong>in</strong><br />
<strong>the</strong> formants very easily. So even though different<br />
clusters are formed <strong>in</strong> <strong>the</strong> <strong>articulatory</strong> <strong>space</strong>,<br />
<strong>the</strong>y are seen <strong>in</strong> <strong>the</strong> same cluster <strong>in</strong> <strong>the</strong> <strong>acoustic</strong><br />
<strong>space</strong>. The uncerta<strong>in</strong>ty is much less <strong>in</strong> <strong>the</strong> lower<br />
lip correspondence for <strong>the</strong> long vowels /ɑ:/, /u:/<br />
and /ɜ:ʳ/ while it is high for /ʊ/ and /e/. The TD<br />
shows lower uncerta<strong>in</strong>ty for <strong>the</strong> back vowels<br />
/u:/ and /ɔ:/. The uncerta<strong>in</strong>ty for TD is higher<br />
V<br />
TD<br />
TB<br />
TT<br />
LL<br />
UL<br />
LJ
Proceed<strong>in</strong>gs, FOETIK 2009, Dept. of L<strong>in</strong>guistics, Stockholm University<br />
for <strong>the</strong> front vowels like /e/ and /ɘ/. The<br />
uncerta<strong>in</strong>ty for <strong>the</strong> tongue tip is lower for <strong>the</strong><br />
vowels like /ʊ/ and /ɑ:/ while it is higher for<br />
/ɜ:ʳ/ and /ʌ/. These results are <strong>in</strong>tuitive, and<br />
show that it is easier to f<strong>in</strong>d correspondences<br />
between <strong>acoustic</strong> and <strong>articulatory</strong> clusters for<br />
some vowels, while it is more difficult for o<strong>the</strong>rs.<br />
Conclusion and Future Work<br />
The algorithm proposed, helps <strong>in</strong> improv<strong>in</strong>g <strong>the</strong><br />
<strong>cluster<strong>in</strong>g</strong> ability us<strong>in</strong>g <strong>in</strong>formation from multiple<br />
<strong>modal</strong>ities. A measure for f<strong>in</strong>d<strong>in</strong>g out uncerta<strong>in</strong>ty<br />
<strong>in</strong> correspondences between <strong>acoustic</strong><br />
and <strong>articulatory</strong> clusters has been suggested and<br />
empirical results on certa<strong>in</strong> British vowels have<br />
been presented. The results presented are <strong>in</strong>tuitive<br />
and show difficulties <strong>in</strong> mak<strong>in</strong>g predictions<br />
about <strong>the</strong> articulation from <strong>acoustic</strong>s for certa<strong>in</strong><br />
sounds. It follows that certa<strong>in</strong> changes <strong>in</strong> <strong>the</strong><br />
<strong>articulatory</strong> configurations cause variation <strong>in</strong><br />
<strong>the</strong> formants, while certa<strong>in</strong> <strong>articulatory</strong> changes<br />
do not change <strong>the</strong> formants.<br />
It is apparent that <strong>the</strong> empirical results presented<br />
depend on <strong>the</strong> type of <strong>cluster<strong>in</strong>g</strong> and <strong>in</strong>itialization<br />
of <strong>the</strong> algorithm. This must be explored.<br />
Future work must also be done on extend<strong>in</strong>g<br />
this paradigm to <strong>in</strong>clude o<strong>the</strong>r classes<br />
of phonemes as well as different languages and<br />
subjects. It would be <strong>in</strong>terest<strong>in</strong>g to see if <strong>the</strong>se<br />
empirical results can be generalized or are special<br />
to certa<strong>in</strong> subjects and languages and accents.<br />
Kjellström, H. and Engwall, O. (2009) Audiovisual-to-<strong>articulatory</strong><br />
<strong>in</strong>version. Speech<br />
Communication 51(3), 195-209.<br />
Mermelste<strong>in</strong>, P., (1967) Determ<strong>in</strong>ation of <strong>the</strong><br />
Vocal-Tract Shape from Measured Formant<br />
Frequencies, J. Acoust. Soc. Am. 41, 1283-<br />
1294.<br />
Neiberg, D., Ananthakrishnan, G. and Engwall,<br />
O. (2008) The Acoustic to Articulation<br />
Mapp<strong>in</strong>g: Non-l<strong>in</strong>ear or Non-unique? Proceed<strong>in</strong>gs<br />
of Interspeech, 1485-1488.<br />
Q<strong>in</strong>, C. and Carreira-Perp<strong>in</strong>án, M. Á. (2007) An<br />
Emperical Investigation of <strong>the</strong> Nonuniqueness<br />
<strong>in</strong> <strong>the</strong> Acoustic-to-Articulatory Mapp<strong>in</strong>g.<br />
Proceed<strong>in</strong>gs of Interspeech, 74–77.<br />
Schroeder, M. R. (1967) Determ<strong>in</strong>ation of <strong>the</strong><br />
geometry of <strong>the</strong> human vocal tract by <strong>acoustic</strong><br />
measurements. J. Acoust. Soc. Am<br />
41(2), 1002–1010.<br />
Wrench, A. (1999) The MOCHA-TIMIT <strong>articulatory</strong><br />
database. Queen Margaret University<br />
College, Tech. Rep, 1999. Onl<strong>in</strong>e:<br />
http://www.cstr.ed.ac.uk/research/projects/a<br />
rtic/mocha.html.<br />
Yehia, H., Rub<strong>in</strong>, P. and Vatikiotis-Bateson.<br />
(1998) Quantitative association of vocaltract<br />
and facial behavior. Speech Communication<br />
26(1-2), 23-43.<br />
Acknowledgements<br />
This work is supported by <strong>the</strong> Swedish<br />
Research Council project 80449001, Computer-<br />
Animated Language Teachers.<br />
References<br />
Bolelli L., Ertek<strong>in</strong> S., Zhou D. and Giles C. L.<br />
(2007) K-SVMeans: A Hybrid Cluster<strong>in</strong>g<br />
Algorithm for Multi-Type Interrelated Datasets.<br />
International Conference on Web Intelligence,<br />
198–204.<br />
Coen M. H. (2005) <strong>Cross</strong>-Modal Cluster<strong>in</strong>g.<br />
Proceed<strong>in</strong>gs of <strong>the</strong> Twentieth National Conference<br />
on Artificial Intelligence, 932-937.<br />
Gay, T., L<strong>in</strong>dblom B. and Lubker, J. (1981)<br />
Production of bite-block vowels: <strong>acoustic</strong><br />
equivalence by selective compensation. J.<br />
Acoust. Soc. Am. 69, 802-810, 1981.
Proceed<strong>in</strong>gs, FONETIK 2009, Dept. of L<strong>in</strong>guistics, Stockholm University