13.07.2015 Views

Chapter 3

Chapter 3

Chapter 3

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Chapter</strong> 3Acoustic Correlates of Voice Characteristics3.1 Voice characterisation3.2 Voice qualities3.3 Speaker characteristics and gender effects3.4 Accent characteristics3.5 Emotion effects3.6 Acoustic correlates and voice profiling3.7 An example of voice profileThe speaker dependent acoustic correlates of voice characteristics can beconsidered in two categories: (i) the physiological characteristics associated withgender, age, voice qualities, etc., (ii) psycho/socio-logical characteristics, e.g.speaking habit, regional accent, and emotional expression, etc. The techniques inmodelling speaker characteristics can be applied to speaker normalisation andadaptation in speech recognition, speaker identification and verification, and forimproving naturalness and voice-adaptation capabilities in speech synthesis. In this<strong>Chapter</strong> the objective is to develop a voice profiling system for modelling speakercharacteristics and acoustic correlates. Using voice profiles, voice characteristics can bemapped between two speakers.In terms of physiological versus psycho/socio-logical dimensions, speakerindividuality can be reasonably related to acoustic features. Physiological dimension isillustrated by the human physical profile such as gender, age, weight, height, etc. Mostof these personal attributes are acoustically manifested and perceived in spectraltemporalcharacteristics of voice. For example, the frequency of openings and closingsof the glottal cords, i.e. the pitch, is the most important factor that conveys gender andalso age information. The vocal tract shape and size, another main physical correlates ofgender, contribute to the distribution of the concentration of speech spectrum.Psycho/socio-logical characteristics include all the influence from the human socialactivities, e.g. family influence, education, residential region, speaking purpose andemotion, etc. They are mainly distributed in prosodic parameters. For instance, theregional accent, a main speaker effect associated with geographical regions, is formedusing different phoneme sets and intonation patterns in the same language. In this thesis,a voice profile is designed to model the time-frequency variation of a speaker’sindividuality using acoustic parameters.This chapter is organised as follows. In the Section 3.1, an overview of thespeaker effects and the acoustic correlates of voice is presented. Section 3.2 to 3.5summarise previous works in voice quality, gender characteristics, accent effects andemotion correlates. Speaker profiling method is defined is Section 3.6. In the lastsection, an example illustrates voice profiles of two speakers for voice conversion.


Acoustic Correlates of Voice Characteristics 23.1 Voice characterisationThe characteristics of a person’s voice are functions of anatomical, accent, prosodic andemotional characteristics. The effects on the distributions of speech parameters varywith segments e.g. phonemes, syllables, words, and sentences etc. Within each speechsegment, the speaker characteristics are distributed across time and frequency. In thefrequency spectrum, voice characteristics are manifested to different degrees acrossdifferent sub-bands. In the time domain, the temporal variation of voice is affected bythe stress, pitch, energy, and duration, and also the context of each segment.3.1.1 Categorisation of voice individualityKuwabara and Sagisaka (1995) categorise voice individuality in terms of physiologicalversus socio/psycho-logical dimensions. The former is associated with voice quality andthe later with speaking style. According to the acoustic parameters, the voice quality isreflected in vocal/nasal tract power spectrum, glottal source frequency and spectrum andthe noise source spectrum from vocal tract constriction. Speaking style is realised in coarticulationwhich is the temporal trajectory of formants and prosodic structure i.e. pitchaccent and contour, stress frequency, duration and regional accent.3.1.2 Physiological characteristicsPhysiological characteristics of speaker individuality vary across gender, age andweight, etc. They are generally affected by speaker’s vocal tract shape and size, vocalfold length and mass, vocal folds and cords elasticity, and the lung capacity. Vocal tractshape and size acoustically affects the resonance, i.e. format frequencies andbandwidths. Vocal fold characteristics and the lung capacity affect the fundamentalfrequency and loudness respectively. For example the vocal tract and the vocal fold sizeand characteristics have been shown as the significant features between male and femalevoices (Wu and Childers, 1991; Childers and Wu, 1991; Klatt and Klatt, 1990). It isagreed that the average of fundamental frequency and formants are the most importantparameters in speaker identification. The glottal excitation waveform and the ratio ofthe periodic and aperiodic parts of the voice source also demonstrate certaindiscrimination among speakers. These parameters are also highly correlated withdifferent types of voice qualities (Childers and Lee, 1991; Klatt and Klatt, 1990).Modelling speech spectrum using Liljencrants-Fant (LF) model with four controllableparameters and modulated noise of the source waveform has been reported on achievingmodal voice, vocal fry, falsetto and breathy voice synthesis by Childers and Lee (1991).


Acoustic Correlates of Voice Characteristics 33.1.3 Psycho/socio-logical characteristicsIn general, psycho/socio-logical characteristics are gradually built up through the humansocial activities such as education, residential area and family background, etc. It isusually demonstrated on linguistic, semantic and emotional differences. For instance,many researchers believed that the melodic, i.e. intonation and/or co-articulation cues,are voice characteristics associated with female voices (Murry and Arnott 1993; Wu andChilders, 1991). In a given language, accents reflect the differences in phoneme sets andpronunciation sequences, i.e. phoneme substitution, insertion and deletion in a word(Humphries, 1997) and also in intonation patterns, e.g. duration variation of phonemes,pitch rise or fall in certain position of sentences, etc (Arslan, 1996). In English, theReceived Pronunciation (RP) is the accent originated in London and its surroundingarea. Today, it is considered as a non-regional accent continued by media (traditionallyBBC English) and taught through education to the foreign learners of English. However,a survey in 1979 by Trudgill revealed that RP is spoken by just three percent of thepopulation of England (Humphries, 1997).3.1.4 Acoustic correlates of voice individualityBased on the available voice analysis and synthesis technologies, it is reasonable tomodel the voice characteristics from the perceptual aspect using acoustic parameters.Referring to the voice production system, acoustic parameters can be categorised intotwo groups, vocal source parameters and vocal tract parameters. The static and dynamicfeatures of these parameters are correlated with the physiological andpsycho/sociological characteristics of the voice. The acoustic features of vocal sourceinclude average pitch, pitch range, pitch accent, accent frequency and pitch fluctuations,etc. The acoustic features of vocal tract consist of spectral slope and shape, long-termaverage spectrum, formant frequencies and bandwidths, and formant trajectories, etc. Inthe following sections four groups of voice characteristics and their acoustic correlatesare discussed.3.2 Voice qualitiesVoice qualities are affected by the characteristics of vocal fold and manifested throughspeech spectrum, e.g. energy ratio of low/high bands, harmonic to noise ratio, loudness,pitch level, and perturbation of pitch and energy, etc. They are perceived in naturalvoice and sometimes vary across speech (Childers and Lee, 1991). They are also theprominent effects perceived in different emotions (Murry and Arnott, 1993). Thesynthesis of different voice qualities has been studied not only on synthesis-by-rulebased speech synthesiser (Klatt and Klatt, 1990) but also on the concatenated speechsynthesiser (Sun, 2000). The objective of these technologies is to make synthesisedspeech sound more natural.


Acoustic Correlates of Voice Characteristics 4Table 3.1 Summary of four kinds of voice qualities in natural speech.Correlates Vocal fry Modal Falsetto BreathyVocal fold length (L) & Short & Thick Medium Long & Thinthickness (T)(F 0 ∝ L)*Pitch Low Medium High Wide rangeF 0 range** (Hz) M: 24 - 52F: 18 - 46M: 94 - 287F: 144 - 538M: 275 - 634F: 495 - 1131Wide rangePitch perturbation High Low HighSource spectral slope Flat Medium Steep SteepLower bandwidth Low Medium HighAspiration Noise Low Medium HighLoudness Soft Wide range Soft SoftAmplitude perturbation Low HighVocal fold vibrationpatternSharp shortpulse followedby long closingphase.Opening phasemay haveseveralopening/closingpulses.Longer openingphase andrapid closingphaseGradual glottalopening andclosing phaseswith a short orno closingphaseVoice mayhave a slightvibratoryexcursion ofthe vocal foldswithincompleteglottal closure* For each speaker, the fundamental frequency is proportional to the vocal fold length (Titze,1989).* *M denotes male, and F denotes female.3.2.1 Voice qualities in natural speechLaryngealised/vocal fry, normal and breathy voices are three very common voice typesin spontaneous speech. Table 3.1 summarise the researches of Klatt and Klatt (1990)and Childers and Lee (1991). In this table, the characteristics for the four kinds of voicequalities named vocal fry, modal, falsetto, and breathy voices are compared accordingto each acoustic correlates.3.3 Speaker characteristics and gender effectsSpeaker characteristics can be considered in two main aspects, the anatomical aspectand the phonetic/prosodic aspect. In anatomical aspect, gender effects are the mostsignificant outcomes. The most prominent gender effect is caused by the vocal foldcharacteristics. The relatively longer and thicker vocal fold of the male speaker provideslower pitch compared to the female speaker. Vocal tact size is also one of the maincontributions to the gender effect. Generally speaking, the vocal tract length of the malespeaker is longer than that of the female speaker. The consequence is that, compared tothe female speakers, the male speakers have lower formant frequencies. Apart from theanatomical reasons, speaking habits of the speakers also consistently affect theperceptual effects of the speech. Several examples are stated as follows. Some speakers


Acoustic Correlates of Voice Characteristics 50.20.15MaleFemale0.10.05060 80 100 120 140 160 180 200 220 240 260Figure 3.1 Distribution of F 0 over 326 male speakers and 136 female speakers(from TIMIT training database). The X-axis is F 0 and the Y-axis is theoccurrence frequency. Mean and standard deviation for male speakers are 117and 16.0 Hz respectively. For female speakers, they are 199 and 20.7 Hz.talk faster than others. Some speakers like to laryngearise their voice. Trainedbroadcasters can keep steady intonation across speech. Trained singers can control theirtone and stress individually.3.3.1 Gender effectsFrom the study of TIMIT American English database, two statistical results demonstratesignificant gender effects through the distribution of speaker’s fundamental frequencyand the average power spectrum of each phoneme. In Figure 3.1, the averagedfundamental frequency for each speaker is estimated and collected from the TIMITtraining database. There are 326 male speakers and 136 female speakers in the database.Each speaker has about two to ten sentences. The results are displayed in severalfrequency bins using bar chart and the amplitude of each bar is the occurrencefrequency. Two Gaussian bell shapes are employed to illustrate the distribution. Themean value and standard deviation for the male speakers are 117 and 16.0 Hzrespectively. For the female speakers they are 199 and 20.7 Hz. In Figure 3.2, a set ofaveraged filter bank energy spectra from male and female speakers are illustrated. Theaveraged energy spectra for male and female speakers are compared in each phoneme.Three vowels, AA, IY, and UW, are used for illustration. These three vowels locate inthree extreme position of the vowel distribution in the F1-F2 plane where F1 and F2denote the first and second formants, respectively. Data are collected from full TIMIT50AA50IY50UWMaleFemale404040303030200 2000 4000200 2000 4000200 2000 4000Figure 3.2 Spectral energy distributions of three vowels, AA, IY and UW, for maleand female speakers are illustrated. In each plot, the characteristics of spectrumshift are confirmed. Data are collected from full TIMIT database.


Acoustic Correlates of Voice Characteristics 6database with 438 male speakers and 192 female speakers. Total number of sentencesfor male speakers is 2313 and for female speakers is 1054. The most significant featureis that, compared to male speakers’ spectra, female speakers’ spectra have certainamount of shift in energy distribution to the higher frequency regions. The amount ofshifting also varies across phonemes.Wu and Childers (1991) studied gender characteristics by analysing the spectraltemplates using different acoustic features. They concluded that the gender informationis time invariant, phoneme independent and speaker independent for a given gender.Furthermore, the significant test was gauged between gender and various acousticparameters i.e. fundamental frequency, formant frequencies, formant bandwidths andformant intensities by Childers and Wu (1991). They reported that their results alsosupport the studies of Bladon (1983). Their results are summarised as follows:• Female’s voice shows a steeper spectral slope and male’s voice have narrowerformant bandwidths.• Female have higher pitch and higher formant frequency which is caused bythe vocal fold length/mass and vocal tract length respectively.• Male speaker have more significant source-tract interaction which illustrates aprominent hump in the opening phase of the source wave.• For those soft voices, the spectral slopes are more similar between male andfemale speakers.• For identifying gender of a speaker, redundant information was imbedded inthe fundamental frequency and the vocal tract resonance features, i.e. formantfrequency, bandwidth and intensity.3.3.2 Formant ratios and fundamental frequencyGender effects, one of the prominent speaker characteristics, are important features inspeech synthesis technologies. Significant amount of research in speech synthesis havebeen conducted to synthesis voice with natural gender characteristics. However,variations of speaker characteristics have been obstacles to automatic speechrecognition since the inter-speaker variations flatten the averaged spectra when trainingthe statistical models. In the recognition stage, the mismatches between trained modelsand the features of the test speech degrade the recognition accuracy significantly. Toconcur the problems caused by speaker variation, speaker normalisation and adaptationhave been studied and used to improve the automatic speech recognition.Vocal tract length (VTL) normalisation techniques have been studied and adoptedto speaker normalisation (Wakita, 1977; Lee and Rose, 1996; Eide and Gish, 1996).Under the hypothesis that for the same vowel with the same context the vocal tractshape of the speakers are similar to each other and differ only in length. Findings fromWakita (1977) had indicated that formant frequencies and bandwidths are inverselyproportional to vocal tract length. Using the articulatory vs. acoustic relation acousticfeatures are normalised phoneme-dependently to reduce the inter-speaker difference.This idea was extended to maximum likelihood (ML) based methods (Lee and Rose,1996; Eide and Gish, 1996; Zhan and Westphal, 1997). In these approaches, linear ornon-linear frequency warping functions are estimated for speaker normalisation toimprove speaker independent automatic speech recognition system.


Acoustic Correlates of Voice Characteristics 7Miller (1989) presented an auditory-perceptual theory descended from theformant-ratio theory. In Miller’s theory a sensory reference frequency was proposed asan absolute reference frequency to normalise the formant frequencies of speakers. Thesensory reference frequency is proportional to the cubic root of the geometric mean ofthe speaker’s voice pitch. This theory and the experiments provide the evidence of thecorrelation between formant frequencies and pitch for each speaker. Tuerk andRobinson (1993) applied miller’s theory to the speaker normalisation techniques andevaluated the approach using TIMIT database. From their study, the statistical resultsshowed that a more accurate correlation could be modelled gender-dependently. Thisnormalisation criterion should also be a good constraint for voice synthesis andtransformation to keep a natural relation between pitch and formant frequencies.3.3.3 Speaking rateSpeaking rates vary significantly across speakers. Various approaches have been studiedin order to figure out the parameters which can perceptually represent the speaking rate.In converting speaker’s voice, Arslan (1997) has use an average speech rate totransform the speaker character between two speakers. Shih and her colleagues (1998)reported that instead of using only one speaking rate for the whole speech it is muchmore appropriate to model speaking rate depending on different phoneme classes. Shihdemonstrates speaking rate across six speakers using six classes of phonemes. Althoughthe overall speaking rate is consistent across speakers, some classes had shown anopposite situation, where phoneme articulation rate is slow when the average rate ifabove “normal” and vice versa.3.4 Accent characteristicsAccent is another significant effect of inter-speaker variations. From the previousstudies, it has been shown that the accents affect not only the phonetic sets andlinguistic rules but also the intonation patterns.3.4.1 Time-frequency variationArslan (1996) has studied foreign accent classification in American English from boththe temporal and frequency aspects. In his experiments, the subjects were AmericanEnglish speakers with different foreign accents. A relatively small database wascollected. It included twenty isolated words and four sentences. They were repeated fivetimes and once by each speaker respectively. Significant results are summarised asfollows.• The average pitch slope was found to be a significant feature in accentclassification. For example, in the study of isolated words, Chinese accent hadbeen shown to have steeper slope. On the contrary, German speaker had flatter


Acoustic Correlates of Voice Characteristics 8slope. For some accents, the pitch slope also depends on the position ofsentences.• Intensity contour showed that different foreign accents affect stress indifferent syllables. For instance different accent speakers showed differentstresses in /T/ of the word “thirty”.• Temporal features, e.g. word-final stop release time, voice onset time, averagevoicing duration, average word duration and average sentence duration, arealso accent dependent.• In the study of formant frequencies for different phonemes, the vowelcentroids, i.e. average formant frequencies, were illustrated as useful featuresin accent classification. It is because accent effects are related to the formantshifts. If formant frequencies shift across the acoustic boundary of a phoneme,it may perceptually be recognised as a different phoneme. It is argued thatspeakers with accent have more problems mimicking the detailed tonguemovements that affects second and third formants. The first formant shift islead by the over all shape of vocal tract and observed mainly among differentphonemes, e.g. /AA/ and /IY/. It showed that second formant was the mostsignificant feature in both recognition and accent classification. The firstformant was more important in recognition than accent classification. On thecontrary, the third formant was more important in accent classification thanrecognition. For the robustness reason, instead of using formant features, theaccent sensitive filter banks were proposed to increase the dimension of theacoustic space in which the filter banks are emphasised around the second andthe third formants.3.4.2 Phoneme sets and pronunciation differenceWith larger databases in automatic speech recognition, the time-frequencycharacteristics of accents can be more feasibly and accurately modelled by theincorporation of linguistic rules. For modelling and adaptation of regional accent inautomatic speech recognition, Humphries (1997) described acoustic space of accentusing the accent variations, i.e. phoneme insertion, deletion and substitution.Humphries compared the accent between American and British English. Hepointed out that two main factors which degrade the accent-mismatched recognitionwere different parameters in the acoustic models and the phonological differences. Todemonstrate the difference of the acoustic models, the difference of the averagelikelihood score for each phoneme was obtained. Given the matched and mismatchedrecognisers, the difference score was derived from the two likelihood scores of the testutterances. On the other hand, two consequent methods were proposed to quantify thephonological differences between two accents.1. A common phone set was created to map source phoneme set to target phonemeset. For each unseen phoneme an extra phone sequence was selected.2. Pronunciation differences were generated and quantified by comparing thepronunciations of the same words from the pronunciation dictionaries of twoaccents. The average counts of substitution, deletion and insertion for each wordwere then obtained in which the multiple pronunciations were also considered.


Acoustic Correlates of Voice Characteristics 9Pronunciation trees were applied for clustering the pronunciation differences. Thecontext dependent accent variation, i.e. pronunciation rules, were modelled using thetrees. For the adaptation of accent-mismatched condition, the statistics of thepronunciation rules were employed into pronunciation dictionary for automatic speechrecognition.In his preliminary study, accent mapping from American to British English gave6% chance of substitution for each phoneme. It was increased to 16% when word’sunigram probability was added. The possibilities for insertion and deletion were 2.5%and 1.6%, respectively. They were increased to 15% and reduced to 0.4% with word’sunigram probability incorporated. It is obvious that the American pronunciationtypically contains less phonemes.3.5 Emotion EffectsModelling emotion effects in human speech is a very complex task. Murry and Arnott(1993) extensively reviewed the literature on human vocal emotion. Variousfactorisation approaches has been used to decompose human vocal emotion in minimumnumber of units. In this section, the emotion effects and the associated acousticcorrelates are reviewed.3.5.1 Human vocal emotion vs. acoustic featureVocal parameters of the human vocal emotion effects are usually conveyed by changesin the average pitch and pitch range, duration, intensity and the voice quality. Williams& Steven (1981) stated that with the arousal of the sympathetic nervous system as withfear, anger or joy heart rate and blood pressure increase, the mouth becomes dry andthere are occasional muscle tremors. Speech is correspondingly loud, fast andTable 3.2 Summery of human vocal emotion effects. The effects described are those mostcommonly associated with the emotions indicated, and are relative to neutral speech.Adopted from Murry and Arnott (1993).Correlates Anger Happiness Sadness Fear DisgustSpeech Rate Slightly faster Faster orslowerSlightlyslowerMuch faster Very muchslowerPitchAverageVery muchhigherMuch higher SlightlyslowerVery muchhigherVery muchlowerPitch Range Much wider Much wider SlightlynarrowerMuch wider SlightlywiderIntensity Higher Higher Lower Normal LowerVoice Quality Breathy,chest toneBreathy,blaringResonant IrregularvoicingGrumbled,chest tonePitchChangesDownwardinflationsNormalAbrupt, onstressedsyllablesSmooth,upwardinflationsWide,downwardterminalinflationsArticulation Tense Normal Slurring Precise Normal


Acoustic Correlates of Voice Characteristics 10enunciated, with strong high frequency energy. With the arousal of the parasympatheticnervous system as with boredom or sadness heart rate and blood pressure decrease andsalivation increase, producing speech that is slow, low-pitched and with little highfrequency energy. Sherer (1986) noted that although pitch is important in emotionexpression, voice quality is more important in differentiating discrete emotional status.About the correlation of speaking rate and emotion Fonagy (1981) observed thatincreased intensity in voice leading to shortening of consonants, liquids, and nasals, andto lengthening of vowels. Black (1961) noted from experiments a speech intensity rangeof 30 dB (normal speech being 10 dB above the “minimum vocal effort” and 20 dBbelow the Maximum). Black also noted that speech intensity increases along with pitchand that soft speech was characterised by slew rate. Some findings showed that someemotions are acoustically similar but semantically different, e.g. anger and enthusiasm,boredom and sadness. They are similar in pitch range, average pitch, speech rate, timbreand enunciation and usually mis-identified (Davitz, 1964). In Table 3.2, a summery ofhuman vocal emotion effects is adopted from Murry and Arnott (1993).3.5.2 Synthesis vocal emotionCahn (1989) developed the Affect Editor which provides the ability to model theacoustic correlates of voice and to produce desired effects in the synthesised speech. Inthis system, the effects of emotion are modelled using quantified acoustic correlates.These quantified correlates are then fed into Klatt synthesiser. The same idea has beenreproduced using concatenated synthesiser by (Rank and Pirker, 1998). However, thesynthesised effects are limited by synthesiser capabilities and by the incompletedescriptions of the acoustical and perceptual phenomena.Speaker voice signalFormants ExcitationF 0PhyPsyPsyPhyBandwidth& LevelFormantlocation &trajectoryDSEInonationpattern &rangeMean &median &stdClassification for perceptual functionsFigure 3.3 A voice profiling system. A speaker’s voice is composed of four categoriesof signals, i.e. formant, excitation, fundamental frequency and timing. “D” is the timing,“S” is the glottal pulses and “E” the source energy. Psy: Physiological, Psy:Psychological.


Acoustic Correlates of Voice Characteristics 113.6 Acoustic correlates and voice profilingIn the previous sections, sets of acoustic correlates which convey different speakers’voice characteristics are studied. Using acoustic correlates to measure and quantify aspeaker’s voice characteristics, a voice profiling system is proposed. In this system, theconventions for the acoustic correlates and categorisation rules follow the works ofCahn (1989), and Kuwabara and Sagisaka (1995). The voice parameters are consideredin two functions, the linguistic-dependent function and the speaker-dependent function.Each function can be categorised into physiological and psycho/sociological dimensions.For example, formant locations are considered as physiological correlates of linguisticfunction since phoneme can be defined as a set of formant locations in a speech segment.The deviations of formant locations for each phoneme are the parameters of speakerdependentfunction, e.g. foreign accent. The formant spectral-temporal variations,affected by co-articulation, are categorised into psychological dimension.The voice profiling system is illustrated in Figure 3.3. In this system, the voicesignals are decomposed and categorised into four sets, i.e. formant, fundamentalfrequency, excitation, and timing, and defined as follows.Formant1. Formant locations, formant bandwidths, spectral shape, and spectral tilt areassociated with physiological dimension.2. Formant trajectories contribute to the psychological dimension.Fundamental frequency3. Average F 0 , range of F 0 and reference line of F 0 are associated withphysiological dimension.4. F 0 trajectory patterns are the prosody signal and are affected by speaker’spsychological attitude and sociological background.Excitation signal5. Excitation pulses and noise are associated with physiological dimension.6. Energy contour is associated with psychological attitude.Timing7. The function of timing varies across voice signals, e.g. stress frequency, durationof phonemes, duration of words or duration of pitch patterns.The definition and estimation of acoustic correlates for a voice profile are presented inthree categories, i.e. vocal tract parameters, pitch parameters, and timing parameters.3.6.1 Vocal tract parametersVocal tract parameters consist of the acoustic correlates that model the temporalspectralpatterns and variations of the vocal tract filter. In the implementation of a voiceprofile, formant parameters are incorporated. Vocal tract length estimated from formantfrequency is also included to indicate the overall vocal tract effect.Formant parametersThe resonance configurations of the vocal tract and articulators can be represented bythe formant frequencies and bandwidths. Here, the linear predictive (LP) model is used


Acoustic Correlates of Voice Characteristics 12to represent the vocal tract filter and is estimated from speech signal of length 25 ms inevery 10 ms. In the implementation, signal sampling rate is 10 kHz and LPC order is 13.Two approaches have been proposed for estimation of formant frequencies using allpolemodels. One is to estimate the peak position of LP spectrum. The other is toanalyse poles of the model. In this thesis, pole analysis procedures are developed forformant estimation. Let H be the transfer function of LP model in z-domain andexpressed asH( z)=P∑ a k zk=0g−k(3.1)where g stands for gain, p denotes the order of the linear predictive model and a k is thecoefficients where a 0 = 1. The i th complex root of an all-pole model can be describedusing the pole radius and angular frequency or the approximated formant frequency andformant bandwidth (Furui, 1989) aszi= A ei± jωi= e( 2π/ F )( −BW± jF )sii(3.2)where F s is the sampling frequency of speech signal, A i the pole radius, ω i the angularfrequency and F i, and BW i, are the formant frequency and bandwidth, respectively.Formant analysis is based on the characteristics of an all-pole model where thesignificant resonators associated with low bandwidths are highly correlated withformants. Therefore, formant candidates are selected from complex poles with positivecentre frequencies. Real poles, which have no contribution to formant frequencies, arenot included in the candidates. The dynamic features of formant frequency andbandwidth are obtained using the difference function as∆Fi∆BW( t) = Fi( t) − Fj( t −1)() t = BW () t − BW ( t −1)iij(3.3)where ∆ denotes the difference function, pole j is the closest pole at time t-1 to pole i attime t. The closest pole, used for calculation of difference function, are estimated usingthe criteria of minimisation of the distance between two poles as( [ F ( t) , BW ( t) ] −[F ( t −1 ),BW ( −1 )])j = arg mintjiijj(3.4)For each formant candidate, four features, [ F, BW , ∆ F,∆BW], are thus obtained. Foreach phoneme, formant candidates are classified into four clusters associated with firstto fourth formant. Using the formant clusters for each phoneme, transformation factorsare estimated for each formant parameter and used to map two speaker’s formantparameters. It is equivalent to have four warping factors for each formant of a phoneme1. Frequency warping factor which changes the formant locations.2. Bandwidth warping factor which changes the formant bandwidth.


Acoustic Correlates of Voice Characteristics 133. Frequency contour rotating factor which changes the slope of the formantcontour.4. Bandwidth contour rotating factor which changes the slope of the formantbandwidth contour.Vocal tract lengthVocal tract length is a significant acoustic correlates of speaker characteristics. Wakita(1977) demonstrated that the vocal tract length of a speaker is inversely proportional tothe average formant frequency. Using the average frequency of the i th formant, the vocaltract length estimator can be expressed as( 2i−1)cVTL ≈ (3.5)4F iwhere F is the formant frequency, c is the velocity of sound (33145 cm/s).3.6.2 Pitch parametersSpeaker information conveyed by the fundamental frequency can be stated as:1. Speaker’s anatomical identities (sex, ages) carried by average pitch, pitch rangeand the reference line of pitch.2. Speaker’s speaking habits, styles, accent and emotion, carried by pitch patternsand their variations.Fundamental frequency (pitch) is estimated from speech signal every 10ms and denotedas F 0 . Average pitch and pitch dynamics are further modelled and described as follows.Average pitchAverage pitch is defined by the average fundamental frequency for the speech spokenby a speaker under normal condition. The expected pitch of each speech segment, e.g.word, utterance, etc., is estimated using the mean value of each segment. In thefollowing discussions, F 0 denotes the frame based fundamental frequency, F 0 (wrd) theword based fundamental frequency, F 0 (utt) the utterance based fundamental frequency.The mean value is defined asF0= mean( F 0)(3.6)Pitch rangePitch range represents the bandwidth of the F 0 bounded from the lowest to the highestfrequency. The pitch range is expressed aswhere std(F 0 ) is the standard deviation of F 0 .range( F0) = F0± 3×std( F0)(3.7)


Acoustic Correlates of Voice Characteristics 14Pitch reference linePitch reference line is the frequency that the speaker usually returns to after high or lowpitch excursion. The median value of F 0 is defined as the reference line of pitchref( F0)= median( F0)(3.8)Pitch contourThe acoustic correlates of speaker individuality for pitch contour include accent shape,contour slope and final lowering.• Accent shape is the rate of change of the fundamental frequency. It describes theoverall steepness or the smoothness of the shape of the F 0 contour at the positionof the pitch accent.• Contour slope is the overall trend of the pitch range for the utterance.• Final lowering is the terminal pitch contour. It is the rate and direction of the F 0change at the end of the utterance. The rise or fall is dependent on the linguisticsor pragmatics, e.g. a rise terminal may convey an intention to continue speaking.The contour slope of an utterance is estimated from the sub-segments of each utterance.Using phone segments in an utterance to estimated the contour slope can be expressedas( ( )(( ( ) ( )) ( ( ) ))) 2F ) = min D phn F phn − F utt − c×T phn T( utt)slop( ∑ −0cF (phn) > 0000(3.9)where c is the slope, D(phn) denotes the duration of a phone segment, F 0(utt) is theaverage F0 of an utterance, F 0(phn)the average F0 of a phone segment, T(utt) andT(phn) are the centroid in time of an utterance and a phone, respectively. Pitch patternsare modelled in different segment levels. The elementary unit of pitch pattern is definedas rising or falling pitch contour. Each rise or fall contour is modelled by a discrete 3 rdorder Lengendre polynomial coefficient vector, [ a 0 , a1,a2,a3]. In this vector, a 0corresponds to the average level of the segment. The accent shape is defined as[ a , a a ]accent( F = (3.10)0)1 2,The accent shape at the end of each utterance is related to the final lowering.33.6.3 Timing parametersTiming parameters include two types of information, duration and energy, and aredescribed as follows.


Acoustic Correlates of Voice Characteristics 15Duration featuresDuration features include the following acoustic correlates.• Speaking rate describes the speaking speed. It can be indicated by the numberof syllables or words spoken per minute and the duration of pause.• Exaggeration describes the degree to which pitch accented words receiveexaggerated duration as a means of emphasis.• Fluent pause is the frequency of pausing between syntactic or semantic unit.• Hesitation pause is the frequency of pausing within syntactic or semantic unitand often occur after the first function word in a clause.Duration parameters are estimated using force alignment (Young et al, 1999) forautomatically estimating the segmentation boundaries. For each phoneme and pause,duration parameters are clustered into three levels, fast, normal and slow.Stress and energy contourStress is conveyed primarily by pitch and energy variations. They are generallycorrelated. Stress frequency is the ratio of stress to stressable words in an utterance(Cahn, 1989). It was defined as the likelihood that the word would be stressed accordingto syntax, semantics and pragmatics.Linear relation between energy E and fundamental frequency F 0 are used to modelenergy contour as( t) aF ( t) + b eE = 0 +(3.11)where a and b are the coefficients of the linear regression function and e is the Gaussiantype of error which model the energy distribution. Given the range of F 0 , the speechenergy could be classified into few classes associated with different degree of stress.110100BRMS110100BRMT909080807070600 1000 2000 3000 4000 5000(a)600 1000 2000 3000 4000 5000(b)Figure 3.4 (a) The long-term average spectrum of BRMS, and (b) the long-termaverage spectrum of BRMT, where X: Hz, Y: dB.


Acoustic Correlates of Voice Characteristics 163.7 An example of voice profilingIn this section, an example of profiling voice for the source and target speakers isdemonstrated. In this example, two databases collected from two British Englishspeakers are employed as the case study for voice profiling. The database of sourcespeaker, British male source (BRMS), consists of four hours of narrative speech. Thedatabase of target speaker, British male target (BRMT), consists of nine minutes ofspeech. These two databases are also applied to <strong>Chapter</strong> 5 and 6 for the evaluation ofvoice mapping.3.7.1 Vocal tract parametersThe long-term average speech spectra for both speakers are illustrated in Figure 3.4. Inthis figure, the differences between two average spectra are clearly shown. For thetarget speaker BRMT, there are two significant humps around the frequency locationsof the first and second formants. Also, from the first to second formant, the spectrumintensity goes down by 15 dB. Furthermore, the third and fourth formants are nearlyunseen.Formant parametersThe averaged formant parameters are obtained from averaging the formant parametersacross thirteen vowels. The formant parameters include formant frequency, bandwidthand the difference between spectral intensities of the adjacent formants. Table 3.3records the averaged formant parameters between two speakers. In this table, the ratiosindicate the division of target features by source features and are used as transformationfactors for voice conversion. The vocal tract lengths are estimated using the thirdformant for BRMS and using the second formant for BRMT. The large third and fourthTable 3.3 The average values of formant information.Correlates BRMS BRMT RatiosVTL 15.4 cm 17.3 cm 1.12F1/BW1/I12 * 424 / 67 / 8.03 425 / 36 / 6.18 1.00 / 0.54 / 0.78F2/BW2/I23 1607 / 74 / 12.22 1448 / 50 / 19.44 0.90 / 0.68 / 1.59F3/BW3/I34 2685 / 87 / 7.06 3066 / 366 / 8.66 1.14 / 4.21 / 1.23F4/BW4 3823 / 161 4109 / 378 1.07 / 2.35* F1 to F4 denote first to fourth formant frequency.BW1 to BW4 denote first to fourth formant bandwidth.I12 to I34 denote spectrum intensity difference between adjacent formants.Table 3.4 The average values of the first derivatives of formant information.Correlates BRMS BRMT RatiosDelta F1/BW1 -1.6 / -4.4 -0.9 / -1.3 0.56 / 0.30Delta F2/BW2 -3.3 / -5.2 -1.1 / -6.5 0.33 / 1.25Delta F3/BW3 6.0 / -6.9 -1.5 / -1.8 -0.25 / 0.26Delta F4/BW4 0.1 / -14.6 2.7 / -10.8 27 / 0.74


Acoustic Correlates of Voice Characteristics 17formant bandwidths imply that the corresponding formants were unreliably estimated.The first derivatives of formant parameters are shown in Table 3.4.Compared to the source speaker, the target speaker demonstrates relatively steadymovements in formant contour. It can be observed from the ratios of dynamic features,shown in Table 3.4, that are significantly smaller than the ratios of static features,shown in Table 3.3. It also explains the reason why the overall average spectrum oftarget speaker illustrates two significant humps in the first and second formant locations.Between source and the target speakers, the transformation ratios of formantparameters for thirteen vowels are recorded in Table 3.5 and illustrated in Figure 3.4using bar chart for comparison. In Table 3.5, it shows that the transformation factors ofthe static features are varied across the thirteen vowels. The variations of the ratios fordynamic features are more significant than that for static features.3.7.2 Pitch parametersThe fundamental frequency (pitch) parameters for the two speakers are recorded inTable 3.6. In this table, both static and dynamic parameters are included. The staticfeatures consist of the mean, the range, and the reference line of F 0 . Each feature has itsmean value and its standard deviation (std). Different segments of speech, i.e. frame,word, utterance and sentence, are incorporated for the estimation of mean pitch. The F 0reference line is estimated for each sentence. Comparing the F 0 statistics estimated formdifferent speech segments, the mean values show no significant difference. However,the longer segment has significant smaller standard deviation of F 0 .The dynamics of F 0 are characterised using accent shape and contour slope.Accent shape is represented using the Legendre polynomial coefficients associated withfirst, second and third derivatives. Rising and falling shapes are displayed separately.Considering the first derivative coefficient associated with the pitch accent slope, therising and falling shapes of the target speaker are about 40% flatter than the shapes ofsource speaker. The contour slope of the target speaker is about 70% flatter than that ofthe source speaker. The measurement of the final lowering is shown by the averagevalue of F 0 (wrd) for those words followed by a pause. The averaged final loweringfrequency for the source speaker is significantly lower than its average F 0 . However, theaveraged final lowering frequency of target speaker demonstrates no significantdifference from its average F 0 .More information is illustrated in Figures 3.7, 3.8 and 3.9 using the histogram andTable 3.6 The average values of pitch information for two male speakersBRMSBRMTF0 Mean/std 138.3/34.8 Hz/frame134.4/30.6 Hz/word136.3/17.6 Hz/utterance106.6/20.9 Hz/frame104.8/17.2 Hz/word104.5/12.8 Hz/utteranceF0 Range ±104.4 Hz/frame ±62.7 Hz/frameF0 Reference/std 133.8/8 Hz/sentence 103.1/7.3 Hz/sentenceAccent Shape R: [11.9, –0.1, -1.4] *F: [-12.1, 0.0, 1.3] * R: [7.0, –0.2, –0.8]F: [-6.9, –0.1, 0.7]Contour Slope -45.3/42.3 Hz/sec -13.8/37.9 Hz/secFinal Lowering 111.3/25.8 Hz/word 101.3/21.5 Hz/word* R denotes the rise pitch accent, F the fall pitch accent.


Acoustic Correlates of Voice Characteristics 18the fitted bell shapes of Gaussian distributions for different features between twospeakers. In each diagram, Y-axis is the percentage occurrence and X-axis is thefundamental frequency defined in Hz. For both speakers, the estimated reference linesof F 0 are varying across sentences as illustrated in Figure 3.7 (a) and is a significantclassifier between two speakers. The Figure 3.7 (b) shows the F 0 (utt) distributions. Thedistributions of F 0 , estimated from voiced words and frames, are demonstrated in Figure3.8 (a) and (b), respectively. The characteristics of pitch patterns in utterance level canbe represented using F 0 (wrd) after and before pause as illustrated in Figure 3.9 (a) and(b). The F 0 (wrd) after a pause gives the information about the initial frequency level foreach utterance. The distributions are illustrated in Figure 3.9 (a). For BRMS, theaverage value is about 12 Hz higher than the overall average of F 0 . The BRMT hasabout the same mean value as the overall average value. The F 0 (wrd) before a pausegives the information about the frequency level of the final lowering. The distributionsare demonstrated in Figure 3.9 (b). For BRMS, the average value is about 23 Hz lowerthan the overall average F 0 value. The BRMT shows 3 Hz lower than the overallaverage of F 0 .Significant differences have been shown between two speakers according to theacoustic correlates of the fundamental frequency. Both the static and dynamic featurescarry speaker information and can be used for speaker identification and voice mappingand transformation.3.7.3 Timing ParametersTiming parameters includes duration, stress and fluency. Generally, the occurrence ofeach acoustic correlate is used to quantify the timing parameters. Timing parameters fortwo speakers is recorded in Table 3.7. In this study, the speaking rate is quantified bythe rate of phonemes or words per minute. For the word based speaking rate, two valuesare obtained. One includes pauses as part of speech and is shown with “(+ pause)”. Theother one considers only word segments as speech and is shown with “(segment)”. Thepause frequency is obtained from the total number of pauses divided by the totalnumber of words. The target speaker has about twice the occurrences of pause than thesource speaker. For further modelling of speaking rate, the phoneme duration isclassified into three levels, long, medium and short.More information can be illustrated using the distribution plots shown in Figure3.10 to 3.13. In each figure, the Y-axis is the percentage of occurrence and the unit inX-axis is 10 ms. Figure 3.10 illustrates four groups of word based F 0 distributions forboth BRMS and BRMT. Each group corresponds to a duration range. The distributionof word based F 0 is changed across duration groups and shown in Figure 3.11. For bothspeakers, the words in the duration group between 200 and 400 ms have the highestTable 3.7 The average value of timing information for two male speakersSpeakingRateBRMS BRMT Remarks162.3 words/min (+ pause) 150.0 words/min (+ pause) Duration Tree202.1 words/min (segment) 196.1 words/min (segment) - 3 levels :616.6 phones/min578.6 phones/minlong, medium(segment)(segment)and shortPause 1 pause / 11.6 words 1 pause / 6.2 words


Acoustic Correlates of Voice Characteristics 19mean word based F 0 . Comparing the groups with longest duration, BRMT has relativelyhigher F 0 than BRMS in which the mean word based F 0 in the group are 8.7% lowerand 0.7% higher than the overall mean of word based F 0 for BRMS and BRMT,respectively. It might indicate that the speech of BRMT is more exaggerated than thespeech of BRMS. This is also perceived in the speech.Figure 3.12 and Figure 3.13 (a) and (b) illustrate the distributions of pausedurations, word durations and utterance durations, respectively, for both source andtarget speakers.


Acoustic Correlates of Voice Characteristics 20Table 3.5 The transform factors of formant correlates for two male speakersaa ae ah ao ax eh er ih iy oh ow uh uwF1 1.13 1.08 1.13 0.87 1.62 0.99 0.8 1.19 1.19 0.57 1.35 0.86 1.22B1 0.46 0.78 0.51 0.45 0.47 0.36 0.31 0.69 0.8 0.58 0.42 0.62 0.84DF1 -0.38 -0.39 0.54 3.88 0.26 0.6 -0.39 0.64 0.44 -0.04 0.39 1.15 0.72DB1 0.27 0.57 0.14 2.06 0.3 0.21 0.29 0.09 0.2 0.12 0.17 0.52 0.42F2 0.98 0.87 1 1.04 0.95 0.84 0.96 0.76 1.31 0.83 0.95 1.76 0.91B2 0.78 0.69 1 0.85 0.64 0.72 0.82 0.63 0.16 1.51 0.61 4.75 0.77DF2 0.31 0.48 1.28 -11.7 0.1 0.14 -0.56 0.21 0.73 -0.05 -0.05 -0.96 0.29DB2 1.08 0.58 0.82 -0.01 2.03 1.98 0.71 1.63 0.33 5.26 1.92 0.39 2.66F3 1.16 0.87 1.2 1.01 0.91 1.24 1.21 1.09 1.27 1.02 1.26 1.23 1.25B3 3.96 3.28 4.17 1.06 3.45 5.62 6.77 4.33 5.18 3.84 5.4 3.25 3.61DF3 0.58 5.29 1.27 -1.85 -2.74 0.02 1.73 1.88 0.29 -4.22 -6.61 18.82 -2.74DB3 -0.18 -2.23 -0.21 -1.23 -2.02 0.67 4.03 -0.09 -0.84 1.28 0.35 1.44 -0.24F4 1.08 0.96 1.07 0.96 0.98 1.06 1.08 1.08 1.11 1.1 1.12 1.11 1.08B4 4.08 2.06 3.16 1.19 1.3 2.59 3.5 1.93 3.07 2.84 2.89 1.2 0.98DF4 -0.98 -1.33 -1.69 -16.9 8.63 0.29 -1.69 0.04 0.69 -3.18 0 2.14 0.23DB4 2.11 1.27 -0.7 2.31 -0.12 1.06 0.99 -0.26 -1.68 1.36 1.58 0.32 0.8251.510.50F1F20aa-5-10ahaxeriyowuwdF1dF2aaahaxeriyowuw-15(a)(b)Figure 3.5 (a) Illustration of the transformation factors for the first and second formants,and (b) the transformation factors for the first and second formant bandwidths. In eachfigure, from left to right, thirteen bar pairs correspond to thirteen vowels shown in Table3.5.543210aa ae ah ao ax eh er ih iy oh ow uh uwBW1BW26543210-1aaahaxeriyowuwdBW1dBW2(a)(b)Figure 3.6 (a) Illustration of the transformation factors for the first derivatives of the firstand second formants, and (b) the transformation factors for the first derivatives of the firstand second formant bandwidths. In each figure, from left to right, thirteen bar pairscorrespond to thirteen vowels shown in Table 3.5.


Acoustic Correlates of Voice Characteristics 210.2BRMSBRMT0.20.15BRMSBRMT0.10.10.05080 100 120 140 160 180(a)0100 150 200(b)Figure 3.7 (a) The distributions of F 0 reference line for each sentence, S: 133.8/8 1 Hz, T:103.1/7.3 Hz, and (b) the F 0 (utt) distributions, S: 136.3/17.6 Hz, T: 104.5/12.8 Hz0.2BRMSBRMT0.2BRMSBRMT0.10.10100 150 200 250(a)0100 150 200 250(b)Figure 3.8 (a) The F 0 (wrd) distributions . S: 134.4/30.6 Hz, T: 104.8/17.2 Hz, and (b) theF 0 distributions, S: 138.3/34.8 Hz, T: 106.6/20.9 Hz.0.40.3BRMSBRMT0.2BRMSBRMT0.20.10.1050 100 150 200 250(a)050 100 150 200 250(b)Figure 3.9 (a) The F 0 (wrd) (after pause) distributions. S: 146.5/31.7 Hz, T: 104.3/14.5 Hz,and (b) the F 0 (wrd) (before pause) distributions, S: 111.3/25.8 Hz, T: 101.3/21.5 Hz.1 S stands for BRMS and T for BRMT. Each number set N/M represents mean value, N, over the standarddeviation, M.


Acoustic Correlates of Voice Characteristics 220


Acoustic Correlates of Voice Characteristics 230.250.2BRMSBRMT0.20.15MRMSBRMT0.150.10.10.050.0500 500 1000(a)00 50 100(b)Figure 3.13 (a) Distributions of utterance duration. Utterance is defined as a speechsegment between two pauses, S: 3.20/2.38 sec, T: 1.76/1.16 sec, and (b) thedistributions for word durations, S: 29.7/19.6, T: 29.4/20.2.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!