13.07.2015 Views

Speech to Song Illusion: Evidence from Mandarin Chinese

Speech to Song Illusion: Evidence from Mandarin Chinese

Speech to Song Illusion: Evidence from Mandarin Chinese

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

future experiment <strong>to</strong> probe in<strong>to</strong> the robustness of<strong>to</strong>ne pitch intervals in real-time speech (by usingcorpus methods) and its effect on intelligibility.Furthermore, I demonstrate how the intervalstructure (1.b) plays a less significant role whenthe duration condition is satisfied in (1.a) in theperception of speech <strong>to</strong> song illusion. All speechanalysis, synthesis, and manipulation are done onPraat (www.praat.com).II Analysis of Deutsch’s DataTo analyze Deutsch’s data, I pay particularattention <strong>to</strong> two data sets. In Deutsch’sexperiment, it was interesting <strong>to</strong> compare the twosets of recordings, one by the group (a) of 11subjects who were asked <strong>to</strong> reproduce the phraseafter having heard it for ten times, the other bygroup (b) of 11 subjects who only heard it oncebut were asked <strong>to</strong> reproduce the phrase as sungphrase. Their output exhibited striking similarityas is shown in the graph below in average pitch.Figure 2This difference suggests that by focusing on theacoustic properties (or imitating the prosody) ofthe speech sentence one might produce the sungphrase (sung phrase is usually produced byexaggerating the acoustic features of thesentence, such as prolonged pitch stability). Thefollowing graphs show my calculation of themean speed of pitch change at a 0.01s intervalfor the two sentences <strong>from</strong> above, one <strong>from</strong>group (a) and the other <strong>from</strong> group (c).Table 1Figure 1Meanwhile, there is striking difference betweenthe output of group (a) and group(c) who onlyheard the phrase once and were simply asked <strong>to</strong>repeat:Table 2It is evident <strong>from</strong> the data in the graph 1 and 2that the speed of pitch change in sung-like2


In the following section I consider the twopart<strong>to</strong>nal hypothesis proved <strong>to</strong> be crucial <strong>to</strong> thegeneration of speech <strong>to</strong> song illusion in theexperiments on German by Falk and Rathecke byusing preliminary evidence <strong>from</strong> experiments ongenerating speech <strong>to</strong> song illusion in <strong>Mandarin</strong><strong>Chinese</strong>. Here I focus on the first <strong>to</strong>ne and third<strong>to</strong>ne of MC. It is known that in connectedspeech, <strong>to</strong>ne 1 (high level) and <strong>to</strong>ne 3(lowfalling, also known as half-third-<strong>to</strong>ne due <strong>to</strong> itsinconsistency with the citation form in mostcases in natural speech), unlike <strong>to</strong>ne 2(midrising) and <strong>to</strong>ne 4(high falling), are more likely<strong>to</strong> have stable pitches therefore facilitating thegeneration of the illusion. (or rather because the<strong>to</strong>ne 2 and <strong>to</strong>ne 4 have relatively large intervalsin their con<strong>to</strong>urs, they are most unlikely <strong>to</strong> besung-like in real speech).III Target Stabilitystatic <strong>to</strong>ne or vowel. An illustration of thisscenario is shown in Figure 3(a).2. A target is given just enough time, e.g.,when a monophthong vowel is 75 ms long, or astatic <strong>to</strong>ne is 150 ms long—Within the allottedtime interval, the movement <strong>to</strong>ward the targetwould proceed continuously, and the target valuewould not be achieved until the end of itsallotted time interval. An illustration of thisscenario is shown in Figure 3(b).3. A target is given insufficient amount oftime, e.g., a vowel is much less than 75 ms long,or a static <strong>to</strong>ne is much less than 150 ms long—Within the allotted time interval, again themovement <strong>to</strong>ward the target would proceedcontinuously; but the target would not beapproached even by the end of its allotted timeinterval. By the time its cycle ends, however, theimplementation of the target has <strong>to</strong> terminate andthe implementation of the next target has <strong>to</strong> start,as is illustrated in Figure 3(c).Recent experiments (Xu 2002; 2004) on themaximum speech of pitch change in speech flowindicated that unlike previously thought, physicallimitation does play an important role in speechproduction and perception. More specifically, themaximum speed of pitch change is actually morefrequently reached in natural speech, and humanlisteners show great robustness <strong>to</strong> high speedcomputer synthesized speech while havingserious limitations on understanding massiveundershoot in human speech. Here I apply results<strong>from</strong> the Target Approximation model in pitchchange. Directly relevant <strong>to</strong> the current researchquestion on the duration and stability of the <strong>to</strong>neproduction, Xu’s Target Approximation modeloutlined three hypothetical scenarios, all ofwhich commonly occur in any language:1. There is plenty of time for a target, e.g.,when a monophthong vowel is well over 100 mslong, or a static <strong>to</strong>ne is well over 200 ms long—The target value would be attained before theend of its allotted time; and a pseudo steady stateat that value might be achieved in the case ofFigure 3Based on this model, we expect a cu<strong>to</strong>ff valueof at least 200ms of the mean steady <strong>to</strong>neduration for the occurrence of the speech <strong>to</strong> songillusion. In a trial experiment, a male nativespeaker of <strong>Mandarin</strong> <strong>Chinese</strong> was asked <strong>to</strong> speakout the following sample sentence (consisting of<strong>to</strong>ne 1 and <strong>to</strong>ne 3 only) in three different speech4


SPEECH TO SONG ILLUSION: EVIDENCE FROM MC 5rates, fast, mid, and slow, with a broad focus.The pitch intervals of the <strong>to</strong>nes are either 5 st or7 st (perfect fourth or perfect fifth), thusfacilitating the generation of the effect:Table 3Preliminary results of perceptual trialexperiments showed that this cu<strong>to</strong>ff value isindeed responsible for the generation of theeffect. Only the slow speech rate generated theperceptual shift <strong>from</strong> speech-like signal <strong>to</strong> thesong-like signal, while the other two do notexhibit this effect no matter how many times therecording is repeated. Moreover, the illusionseems <strong>to</strong> disappear when synthesizedacceleration is applied on the slow rate sentence.In sum, when the cu<strong>to</strong>ff value of Xu’sTarget Approximation model is applied, thetarget stability is demonstrated <strong>to</strong> be crucial forthe generation of speech <strong>to</strong> song illusion. This isin agreement with Falk and Rathecke’s work onGerman. Future perceptual experiments areneeded <strong>to</strong> decide the details of the illusion inMC.IV Interval StructureThe question of interval structure in thespeech <strong>to</strong> song illusion is proved <strong>to</strong> becomplicated in MC, considering the flexibility of<strong>to</strong>ne production and perception in MC, as well asthe interaction between lexical <strong>to</strong>ne andin<strong>to</strong>nation (Shen 1990; Xu 1994). Whenrealizing transitions <strong>from</strong> one <strong>to</strong>ne <strong>to</strong> another inreal-time speech in MC, for instance, <strong>from</strong> thethird <strong>to</strong>ne <strong>to</strong> the first <strong>to</strong>ne, because the <strong>to</strong>nes aredefined relatively, there is a range of acceptedintervals (i.e. all the different intervals that mightbe interpreted as accepted third <strong>to</strong>ne <strong>to</strong> first <strong>to</strong>netransition in real speech). The robustness of thisinterval is true especially when we consider thatMC preserves an over 90% of intelligibilityspoken in mono<strong>to</strong>ne in a non-noisy background(Xu&Patel&Wang 2010). The change in broadand narrow focus semantically also is reflected inprosody, thus in <strong>to</strong>ne intervals.For the current study I use the same samplesentence <strong>from</strong> above. First I ask a question: Howis <strong>to</strong>ne interval mapped on<strong>to</strong> the specific <strong>to</strong>nerelations, such as a transition <strong>from</strong> <strong>to</strong>ne 3 <strong>to</strong> <strong>to</strong>ne1? In a series of trial experiments, the samplesentence (with third <strong>to</strong>ne vs. first <strong>to</strong>ne contrast)was read by male native MC speaker using awide range of intervals <strong>from</strong> smaller than 1 st <strong>to</strong>12 st. Results indicate that all intervals producedare accepted as normal speech (presumablyheard in different context). One explanation forthis is that when third <strong>to</strong>ne is contrasted with thefirst <strong>to</strong>ne in MC, there is no possible confusionas the other two <strong>to</strong>nes in the inven<strong>to</strong>ry are risingand falling <strong>to</strong>nes with large intervals and greaterspeed of pitch change, with no other level <strong>to</strong>nes<strong>to</strong> confuse with. Thus, as long as the third <strong>to</strong>netransits in<strong>to</strong> a first <strong>to</strong>ne in an ascending interval,presumably it can be accepted as intelligible and<strong>to</strong>lerable in normal speech. (While on the otherhand, if a descending interval is heard in thiscase it inflects a <strong>to</strong>ne accent <strong>from</strong> a differentdialect of MC, while still being intelligible inthis case). To confirm this empirically, I intend<strong>to</strong> build a corpus collecting material in realspeech and study the interval range of the level<strong>to</strong>ne contrasts.More strikingly, in one trial of the abovementionedreading where the speaker intends <strong>to</strong>speak in a small interval (such as 1 st), we foundthat in a <strong>to</strong>ne pattern of (3-1-1-1-1-1), the firstfour words are actually assigned the about samepitch height, while the last two <strong>to</strong>nes are about 1-2 st higher (Table 4). In a similar trial, anascending pattern is found within the realizationof the five first <strong>to</strong>ne words, thus assigning thefive first <strong>to</strong>ne words different pitch heights in anascending fashion.5


syllable wo he san bei ka feicentralpitch(Hz) 85 85 84 84 88 88<strong>to</strong>ne 3 1 1 1 1 1Table 4 Irregular pitch-<strong>to</strong>ne interval mappingin the sample sentenceThe great flexibility of <strong>to</strong>ne intervals inaccepted normal speech indicates that theconstraint of an interval relation that is similar <strong>to</strong>the musical scalar relation is little in thegeneration of speech <strong>to</strong> song illusion. We may aswell speculate that given the sufficient pitchtarget stability, the occurrence of speech <strong>to</strong> songillusion depends less on the <strong>to</strong>ne intervalstructure. Moreover, due <strong>to</strong> the listener’s ability<strong>to</strong> adjust perceived pitch in different tuningsystems <strong>to</strong> the ones we’re familiar with (such asinterpreting as well-tempered penta<strong>to</strong>nic scalewhen hearing a Sundanese Salandro scale by theWestern listeners), it is likely that the intervalstructure does not play a significant role in theillusion in the case of MC. Future perceptualexperiments is called for in determining the roleof interval structure as well as its complexmapping on<strong>to</strong> the <strong>to</strong>ne relations.V DiscussionIn conclusion, the current study confirms theTonal Hypothesis by Falk and Rethecke (2010),and the mechanisms in generating speech <strong>to</strong> songillusion in <strong>Mandarin</strong> <strong>Chinese</strong>. In the meantime,the current study does not negate the conclusionby Deutsch. First of all, we do have <strong>to</strong> take in<strong>to</strong>account part of the physical properties as a preconditionin generating the speech <strong>to</strong> songillusion; nonetheless, we do acknowledge thefact that the same input of the same physicalreality of the signal can generate different out putwhen repeated through the human perceptualsystem.There exists a long-standing notion of speech<strong>to</strong> song continuum by ethnomusicologistsanthropologists, who found in many culturesaround the world the existence of the vocalperforming genres that lie between speech andsong (or in some cases known as heightenedspeech). In the meantime, as illustrated in thispaper, what happens <strong>to</strong> speech perception whenrepeated signal is presented <strong>to</strong> human listenersdid not particularly attract scholar attentions untilrecently. Therefore the research in<strong>to</strong> speech <strong>to</strong>song illusion may offer us a new passage way <strong>to</strong>navigate the human audio-perceptual system andbrain modularity <strong>from</strong> speech <strong>to</strong> song.VI References[1] Deutsch, D., 1995. Musical <strong>Illusion</strong>s andParadoxes. CD: Philomel Records.[2] Deutsch, D., Lapidis, R., Henthorn, T., 2008. Thespeech-<strong>to</strong>-song-illusion. J. Acoust. Soc. America,s.124, 2471.[3] Falk, F., Rathcke, T. 2010. “On the <strong>Speech</strong>-To-<strong>Song</strong> <strong>Illusion</strong>: <strong>Evidence</strong> <strong>from</strong> German”. <strong>Speech</strong>Prosody 5 th International Conference, Chicago, IL.[4] Mertens, P. 2009. The Prosogram v.2.7(manual& tu<strong>to</strong>rial).http://bach.arts.kuleuven.be/pmertens/prosogram[5] Patel, A., 2008. Music, language and the brain.Oxford: University Press.[6] Peretz, I., in press. Music, language andmodularity in action. In Rebuschat, P., Rohrmeier,M., Hawkins, J. and Cross, I. [Eds], Language andMusic as cognitive systems. Oxford: University Press.[7] Peretz, I., Champod, S. and Hyde, K. 2003.Varieties of Musical Disorders: The Montreal Batteryof Evaluation of Amusia. Annals of the New YorkAcademy of Sciences 999, 58-75.Peretz, I. Coltheart, M., 2003. Modularity of musicprocessing. Nature neuroscience, 6, 688-691.[8] Pilotti, M., Antrobus, J.S. and Duff, M. 1997.The effect of presemantic acoustic adaptation onsemantic 'satiation'. Memory and Cognition 25 (3),305-312.[9] Sundberg, J., 1987. The Science of the SingingVoice. Illinois: Northern Illinois Press.[10] Sundberg, J., 1989. Synthesis of Singing byrule. In Mathews, M. V. and Pierce, J. R. [Eds],Current Directions in Computer Music Research.6


SPEECH TO SONG ILLUSION: EVIDENCE FROM MC 7MIT Press, 45-55.[11] Sundberg, J. 2006. "The KTH Synthesis ofSinging." in Advances in Cognitive Psychology,2006.v2,no.2-3,pp.131-143.[12] Shen, X.S. 1990. The Prosody of <strong>Mandarin</strong><strong>Chinese</strong> (University of California Publications inLinguistics). CA: UC Press.[13] Xu, Y. 2004. Understanding <strong>to</strong>ne <strong>from</strong> theperspective of production and perception. Languageand Linguistics 5: 757-797.[14] Xu, Y. and Sun X. 2002. Maximum speed ofpitch change and how it may relate <strong>to</strong> speech.Journal of the Acoustical Society of America 111:1399-1413.[15] Xu, Y. 1994. Production and perception ofcoarticulated <strong>to</strong>nes. Journal of the Acoustical Societyof America 95: 2240-2253.[16] Xu, Y. Patel, A. Wang, B. 2010. The role of F0variation in the intelligibility of <strong>Mandarin</strong> sentences.<strong>Speech</strong> Prosody 5 th International Conference,Chicago, IL.[17] Za<strong>to</strong>rre, R.J., Belin, P. and Penhune,V.B., 2002.Structure and function of audi<strong>to</strong>ry cortex: music andspeech. Trends in cognitive Science 6, 37-46.7

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!