Numbers-and-Notes-An-Introduction-to-Musical-Signal-Processing

Numbers 

& notes: 

An introduction to 

musical signal processing 

Regina Collecchia

About this book 

Digital analysis of music is typically difficult, with many variables and much 

mathematics in play. Numbers and notes: An introduction to musical signal 

processing seeks to illuminate in an accessible way the concepts behind audio 

compression, information retrieval, and acoustic design. At the core of such 

techniques lives the discrete Fourier transform (DFT), an historical construct 

that analyzes the frequencies contained in audio signals. The fast algorithm for 

the DFT—namely the celebrated FFT—is a special focus of the book. Given 

herein are actual code examples in C, MATLAB, and Mathematica. 

About the author 

Regina Collecchia has a B.A. degree (2009) from Reed College, where her 

interests focused on mathematics and music. Regina currently works at University 

of Louisville’s Heuser Hearing Research Laboratory in Louisville, 

Kentucky, where she maintains a strong research interest in digital methods 

as apply to music. 

“This book brings it all together for me. I wish it had been written when I 

first began studying digital audio and signal processing. Anyone new to these 

fields should make this the first book in a personal library.” 

—Evan Brooks, 

co-founder of Pro Tools and Digidesign 

ISBN 1-935-63815-5 

9 781935 638155

Numbers & notes: 

An introduction to musical 

signal processing 

Regina Collecchia

Perfectly Scientific Press 

3754 SE Knight St. 

Portland, OR 97202 

Copyright c○ 2012 by Perfectly Scientific Press. 

All Rights Reserved. No part of this book may be reproduced, used, scanned, 

or distributed in any printed or electronic form or in any manner without 

written permission, except for brief quotations embodied in critical articles 

and reviews. 

First Perfectly Scientific Press paperback edition: February 2012. 

Perfectly Scientific Press paperback ISBN: 978-1-935638-15-5. 

Cover design by Julia Canright. 

Cover image: Piano Orchestrion at the Musée Mécanique in San Francisco, 

California, a museum containing old mechanical arcade games and player pianos. 

The Piano Orchestrion has a spinning wheel of bumps that correspond 

to pitches on the piano, so the bumps notate the musical score. Its hammers 

are triggered when they encounter bumps, creating a binary system much 

like digital music. Photograph by Regina Collecchia. 

Visit our website at www.perfscipress.com. 

Printed in the United States of America. 

9876543210

Preface 

A digital audio file like an MP3 or WAV file is a numerical representation 

of a song’s frequency, timing, and loudness information. To fully 

understand its behavior, the physical, psychophysical, and musical 

properties underlying these components must be realized. Therefore, 

the best way to communicate about them for computational purposes 

is by using a mathematical tongue. Taking it one step further, we can 

make use of computers and algorithms to extract and analyze musical 

information. 

The algorithm that is most frequently used to retrieve data from 

music is the fast Fourier transform (FFT), an expedited modification of 

the discrete Fourier transform (DFT). A DFT and an FFT have identical 

output, which is the frequency spectrum of a discrete (digital) signal. 

A frequency spectrum tells us the relative loudness of frequencies 

throughout the sound file, similar to how the file itself tells us the 

amplitude at any instant of time. It transforms a time-based domain 

into a frequency-based domain. 

Fourier transforms are used not only to display frequency information, 

but also in the compression of digital audio, filter design, 

convolution, composition, and many other (digital) signal processing 

methods. It may be the case that for your purposes, you can get by 

without fully understanding the physical meaning of the FFT; but 

for those who require its explanation in explicit detail particularly to 

achieve musical ends, this book is for you. The Fourier transform 

effectively detects periodic waveforms within signals, but how and 

why 

The first chapter covers some of the basic mathematics required to 

understand all of the equations in this book, including logarithms and

trigonometry. The next two chapters examine sound from different 

angles, the first from a physical perspective and the second from a 

musical one. The fourth chapter explores how these perspectives 

manifest in musical instruments and scales. In the fifth chapter, a 

quick overview of psychoacoustics is given, including a discussion of 

musical synesthesia and perfect pitch. Chapter 6 explores digital audio 

and begins to frame some of the parameters of the discrete Fourier 

transform and digital audio signal processing. The seventh chapter 

breaks down all components of the DFT and its inverse, finishing with 

several examples. The FFT and other specifications of the Fourier 

transform are addressed in the final chapter. 

The appendices explore traditional, analog signal processing with 

an overview of frequency-selective filter design, the Laplace and Z 

transforms, and explicit Mathematica, Matlab, and C code examples 

of the FFT. These topics are somewhat peripheral to the preceding 

chapters and are mere introductions to much broader fields. The 

concept of filtering (desiring certain frequencies to remain in a signal 

and others to be discarded) is alluded to many times in the text, so 

Appendix A is there for the curious. 

Only a high school level of calculus is required to mathematically 

evaluate all of the equations given here, with only a few instances of 

integrals and derivatives. What is required is experience listening to 

music and great curiosity about its nature. Music information retrieval, 

sound design, and compositional applications are just three possible 

extensions that this book can precede. Beyond that is up to you. 

Acknowledgements 

This book would have been much, much different if not for the brutal 

honesty of my good friend Timothy Eshing. Timothy is an accomplished 

musician and the ideal reader of this book, so I was able to 

tailor much of its purpose and presentation to him.

Many thanks to the Zahorik Auditory Perception Laboratory at the 

Heuser Hearing Research Center for providing an environment for my 

research and an excellent resource for all things psychoacoustics. 

Special thanks to Sarah Powers for many great images, Jeffrey 

Jackson for help with the C code, and Evan Brooks for elaborate edits. 

Finally, thanks to my loving friends and family, especially Nada Zakaria, 

Meghan Mott, Zachary Thomas, Kate Eldridge, Susan Callander, 

Mom, Tony, and Dad.

Contents 

Preface 

Contents 

i 

iv 

1 Review of mathematical notation and functions 1 

1.1 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 

1.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 

1.3 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 

1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 

2 Physical sound 9 

2.1 What is sound . . . . . . . . . . . . . . . . . . . . . . . 9 

2.2 Simple harmonic motion . . . . . . . . . . . . . . . . . . 13 

2.3 Complex harmonic motion . . . . . . . . . . . . . . . . . 19 

2.4 Harmony, periodicity, and perfect intervals . . . . . . . 22 

2.5 Properties of waves . . . . . . . . . . . . . . . . . . . . . 27 

2.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . 46 

3 Musical sound 49 

3.1 Rhythm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 

3.2 Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 

3.3 Tuning and temperament . . . . . . . . . . . . . . . . . 52 

3.4 Timbre . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 


4 Musical instruments 67 

4.1 The piano . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 

4.2 The viol family . . . . . . . . . . . . . . . . . . . . . . . 71

4.3 Woodwinds and brasses . . . . . . . . . . . . . . . . . . 77 

4.4 Drums . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 

4.5 Electric guitars and effects units . . . . . . . . . . . . . . 93 


5 Auditory perception 109 

5.1 Physiology of the ear . . . . . . . . . . . . . . . . . . . . 109 

5.2 Psychoacoustics . . . . . . . . . . . . . . . . . . . . . . . 116 

5.3 Perfect pitch . . . . . . . . . . . . . . . . . . . . . . . . . 129 


6 Digital audio basics 137 

6.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 

6.2 Compression . . . . . . . . . . . . . . . . . . . . . . . . . 150 


7 The discrete Fourier transform 161 

7.1 The Fourier series . . . . . . . . . . . . . . . . . . . . . . 161 

7.2 Euler’s formula . . . . . . . . . . . . . . . . . . . . . . . 167 

7.3 The discrete Fourier transform . . . . . . . . . . . . . . 172 

7.4 The DFT, simplified . . . . . . . . . . . . . . . . . . . . . 190 

7.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 


8 Other Fourier transforms 213 

8.1 Discrete-time Fourier transform (DTFT) . . . . . . . . . 214 

8.2 Fast Fourier transform (FFT) . . . . . . . . . . . . . . . . 216 

8.3 Short-time Fourier transform (STFT) . . . . . . . . . . . 221 


A Frequency-selective circuits 229 

A.1 Ohm’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . 231 

A.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 

A.3 The Z-transform . . . . . . . . . . . . . . . . . . . . . . . 239

A.4 Chapter summary . . . . . . . . . . . . . . . . . . . . . . 249 

B Using computers to do Fourier transforms 251 

B.1 Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 

B.2 Mathematica . . . . . . . . . . . . . . . . . . . . . . . . . 256 

B.3 C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 

References 273 

Glossary 283 

Index 309

1. Review of mathematical notation 

and functions 

The author’s background is in mathematics, but yours doesn’t need to 

be to work through this book. However, to get the most out of Numbers 

& notes, it is important to know some mathematical definitions. Most of 

these do not extend beyond high school algebra, though a knowledge 

of trigonometry and basic calculus will certainly help. This chapter 

will serve as a brief refresher course for these topics and can be skipped 

if the reader feels well-versed with mathematical syntax and functions. 

1.1 Numbers 

The real numbers, denoted by the set R, are defined as any number that 

does not have an imaginary component, i.e., a real number does not 

contain the quantity i, equal to √ −1. This includes the rational numbers, 

p q 

, where p and q are integers like 1, 2, −3, etc., and the irrational 

numbers, like √ 5. The real numbers form a continuum because they 

can be infinitesimally small and there are an infinite number of them 

between any two numbers. 

The complex numbers (the set C) also form a continuum. We describe 

a complex number c by the quantity 

c = a + bi 

where a and b are real numbers. For example, 0.289576, 5, and −39.01 

are both real and complex numbers, and 0.289576 + 5i is only complex. 

So, the real numbers are a subset of the complex numbers, i.e., R ⊂ C. 

The inverse of a number is the quantity that transforms that number 

into an identity value. The additive identity is 0 and the multiplicative

2 Review of mathematical notation and functions Chapter 1 

identity is 1, so the additive inverse of 2 is −2 because 2 + (−2) = 0, 

and the multiplicative inverse of 2 is 1 2 because 2 · 1 

2 

=1. Likewise, a 

function can have an inverse (in which case we call it invertible), and 

this is almost always the multiplicative inverse. The inverse of e x is 

e −x because e x · e −x = e x+(−x) = e 0 =1. Note that the value of any 

quantity raised to the zeroth power is equal to 1—even 0 0 . 

1.2 Functions 

A function in mathematics accepts an input of a certain type (real or 

complex) and produces an output that is explicitly given by a mathematical 

expression that it equals. A function f with argument x is 

written f(x). The argument constitutes the domain of a function and 

f(x) constitutes the range. The function maps x to a unique, corresponding 

value, the point (x, f(x)). 

Loosely, when an infinitesimally small change in x produces an 

infinitesimally small change in f(x) for all x, we say that the function 

is continuous. A continuous function is smooth and has no jumps 

or missing holes in its graph. The function f(x) =x is continuous 

when x is all of the real numbers, for example. On the other hand, 

a discrete function does have jumps and gaps. Discrete functions are 

characterized by individual points and cases that determine where 

they exist. The function 

⎧ 

1, if x =5 

⎪⎨ 

f(x) = −2, if x =9 

⎪⎩ 0, otherwise 

is an example of a discrete function. A function is only continuous if 

its input is continuous. 

Exponential functions (like f(x) =e x ), logarithmic functions (like 

f(x) = log(2x)), and trigonometric functions (like f(t) = sin(πt)) will 

be used frequently in this book, so we will examine some fundamentals 

of their behavior in this section.

Section 1.2 Functions 3 

Logarithms and exponents 

The logarithm with base a of a value b returns the exponent x such that 

a x = b, i.e., if 

log a b = x, 

then 

a x = b, 

and vice versa. For example, log 10 100 is equal to 2 because 10 2 = 

10 · 10 = 100. 

Three common bases for the logarithm function are 10, 2, and e. 

Most calculators are preprogrammed with 10 as the logarithmic base 

because this is the base of our counting system. In discussions of computational 

complexity, base-2 or binary is the standard. When the base 

is the constant e equal to approximately 2.71828183 . . ., the logarithm 

of x can also be written ln(x). This is called the natural logarithm. It may 

be a bit surprising that log 10 (x) wouldn’t be considered the "natural" 

logarithm due to the way we write real world quantities and regard 

our fingers as digits, but the number 10 is not as mathematically significant 

as e. The significance of the natural logarithm is supported 

by the property that both the derivative and antiderivative (integral) 

of the exponential function e x is itself e x . This means that both the 

slope of e x at the point x = a and the area underneath the curve from 

negative infinity to a are exactly equal to e a . See Figure 1.1. 

A conceptual understanding of logarithmic functions is helpful to 

many aspects of the science of music. Pitch, loudness, and even the ear 

itself are all logarithmic in nature. 

Trigonometry 

There are three important functions in trigonometry: sine, cosine, and 

tangent. Each of these functions treats its argument as an angle. This 

angle is compared to the unit circle, which has a radius of 1 and is


Figure 1.1: The base of the natural logarithm, e, is a constant equal to approximately 

2.71828183 . . . When it is raised to a variable x, it has the property that its derivative 

d[e x ] 

with respect to x is equal to dx 

ex . The derivative is defined as the slope of the curve 

at any point x. Likewise, its antiderivative or integral ∫ e x dx is equal to e x , plus an 

undefined constant. The integral is defined as the area under the curve of the function. 

If we wanted to know the area under the curve between x = a and x = b where a

Section 1.2 Functions 5 

Figure 1.2: The fundamental trigonometric function is the sine function, written sin(x) 

where x is a varying angle. The other functions can be written as functions of sine. 

Cosine is the sine function shifted by 90 ◦ , i.e., cos(x) = sin ( ) 

x + π 2 , and tangent is 

the quotient of sine and cosine, written tan(x) = sin(x) . The sine and cosine functions 

cos(x) 

have a finite range of values, periodically falling in the interval [−1, 1], while tan(x) 

ranges between −∞ and ∞ with discontinuities twice per period, indicated by the 

vertical lines in the third graph. 

the length of the opposite side to the length of the hypotenuse, and 

cosine is the ratio of the length of the adjacent side to the length of 

the hypotenuse. The tangent of an angle is given by the ratio sin(θ) 

cos(θ) , 

which is the ratio of the length of the opposite side to the length of the 

adjacent side. A nice mnemonic device arises here: SOH-CAH-T OA, 

wherein sin(θ) = 

opposite 

hypotenuse , cos(θ) = adjacent 

opposite 

hypotenuse 

, and tan(θ) = 

adjacent . 

The trigonometric functions sin(x), cos(x), and tan(x) treat x as 

an angle (like θ in Figure 1.3), but what is the unit of x The variable 

x can be expressed in radians or degrees, and these functions move 

counterclockwise continuously along the unit circle and record at any 

given point the height, width, or ratio of the height to width. In


Figure 1.3: The unit circle is defined as the circle centered about the origin (0,0) with a 

radius of 1. An angle θ describes the angle between the right side of the horizontal 

axis and the hypotenuse of the right triangle. The hypotenuse is also a vector, and its 

coordinates are (cos(θ), sin(θ)), i.e., the width and height of the right triangle. 

mathematics, the arguments of trigonometric functions are virtually 

always in radians because of the association with the unit circle and 

its circumference of 2π. We will look at many graphs of sinusoidal 

functions later in the text. One important trigonometric identity to note 

is that cos(x) = sin(x + 90 ◦ ) = sin ( x + π ) 

2 , so the sine function is the 

same as the cosine function when it is phase shifted by 90 ◦ (equivalent 

to π 2 radians). Another very common one is cos2 (x) + sin 2 (x) = 1, 

which implies that √ cos 2 (x) + sin 2 (x) =1. 

1.3 Calculus 

We examined the graphs of the derivative and antiderivative of e x , but 

to recapitulate, a derivative is the rate of change of a function, and an 

antiderivative, or integral, is the area underneath the curve of a function.

Section 1.4 Notation 7 

You will only need calculus to compute continuous Fourier transforms, 

necessary when the system is continuous, i.e., you are analyzing an 

electrical, analog system. In discrete systems, we use sums instead 

of integrals. Even in advanced applications of the Fourier transform, 

calculus is rarely needed, but linear algebra and numerical analysis 

methods typically are. 

1.4 Notation 

The magnitude and absolute value will be considered many times in this 

book. The terms are identical: Both are real and positive in value. We 

denote magnitude with double square brackets or vertical bars, and 

the absolute value with vertical bars. The magnitude and absolute 

value of the complex quantity (a + bi) is written 

|a + bi| = [[a + bi]] = √ a 2 + b 2 . 

In the present text, I will use the notation |a + bi| to denote both 

magnitude and absolute value of complex arguments. 

In this book, functions with capital letters denote Fourier transforms 

and functions with frequency domains, like X(f), and functions 

with lowercase letters denote time-domain signals, like x(t). 

The Greek alphabet 

IIt is nice to be able to sound out mathematical equations as you read 

along, so listed below are the letters from the Greek alphabet used 

alongside their English spelling and their function in Numbers & notes.


Gk. letter Pronunciation Common scientific usage 

β beta bandwidth (frequency) 

γ gamma heat capacity ratio 

δ delta the delta function 

θ theta angle 

λ lambda wavelength (m) 

µ mu mass per unit length (kg/m) 

π pi the constant 3.14159265 . . . 

ρ rho density (kg/m 3 ) 

Σ Sigma (cap.) the sum of a sequence; a series 

τ tau time (s) 

φ phi angle 

ω omega angular frequency 2πf 

Other definitions 

For definitions of many words used in Numbers & notes, please refer 

to the glossary at the end. If a word in the text is italicized, you will 

likely find it there.

2. Physical sound 

2.1 What is sound 

Sound is the human ear’s perceived effect of pressure changes in the 

ambient air. Sound can be modeled as a function of time. 

Figure 2.1: A 0.56-second audio clip of an accordion playing C4 (middle C, 261.6 Hz). 

When we hear music, we can evaluate its features almost immediately. 

We can recognize the instrumentation, modality, artist, genre, 

and perhaps the time and place it was recorded. Graphically, it is 

difficult to connect this image to what we actually hear: The above 

graph looks complicated, while the experience of this sound (the audio 

signal) is a single, sustained pitch on an accordion. But when we take 

a Fourier transform of this clip, we can actually view the frequencies 

present in a song. 

Because pitch and timbre are made up exclusively of the change 

over time of frequencies and amplitudes, and they tell us so much 

information about musical features, the Fourier transform is an incredibly 

useful tool that translate time domain signals like music onto an 

axis of frequencies, i.e., a frequency domain. The graph in Figure 2.2 lets 

our eyes verify what our ears already know: The graph describes the

10 Physical sound Chapter 2 

Figure 2.2: The spectrum and listed frequencies attained by the discrete Fourier 

transform of the clip shown in Figure 2.1. Note the locations of the peaks with respect 

to frequency. 

relative strength of the frequencies present in a signal. In this example, 

we see the frequency characteristics of an accordion playing C. 

The frequencies themselves are not as important as the general 

shape of the spikes and the distance between them; most cannot distinguish 

between A and C in isolation, but we do have a relatively 

easy time identifying the difference between a piano and a violin. This 

is because of the texture of the instrument’s sound, called the timbre 

or tone color. When the frequencies are more or less equally spaced 

from one another, we say that the timbre is harmonic, or that we have a 

harmonic overtone series. Explicitly, the Fourier transform of the signal 

in Figure 2.1 and its graphical representation in Figure 2.2 tell us the 

signal contains the frequencies 263.2 Hz (C), 528.2 Hz (C), 787.9 Hz (G), 

1051 Hz (C), and 1313 Hz (E). The frequencies’ respective peak in the 

graph indicates their loudness; hence, they are decreasing in power.

Section 2.1 What is sound 11 

Middle C is 261.6 Hz, so apparently this accordion is slightly out 

of tune—but furthermore, its timbre is not perfectly harmonic: the 

difference between its overtones should be 263.2, but 528.2 − 263.2 = 

265, and 787.9 − 528.2 = 259.7. There are several possible reasons 

for why these spikes are not exactly equally spaced. Most likely, it 

is due to the imperfect physical proportion and construction of the 

instrument’s metal reeds, but it could also be error encountered in the 

recording process or experimental error. 

To interpret how exactly this translates to what our ears hear, we 

must take into account how certain frequencies are perceived by the 

brain. Young, healthy human ears can detect frequencies within a 

range of 20-20,000 Hz, where 20 Hz and 20,000 Hz are threshold and 

limit values, but our ears are not uniformly sensitive to these frequencies 

[1]. Within the range of 1000 to 5000 Hz, our ears are especially 

sensitive, meaning that sounds with frequencies within this range do 

not have to be as loud for our ears to detect them. 

Mathematically, the Fourier transform constructs an orthonormal 

basis that takes a complicated sound wave and reduces it to its component 

waves, which are all simple sine and cosine waves, or sinusoids. 1 

It shows us every frequency and its amplitude that is present in a 

complex sound over an interval of time. The connection between the 

graph of the transform and its mathematical properties is a giant step 

towards realizing the Fourier transform and its digital applications. 

Because sight and sound retrieve giant spheres of information, we 

have to make decisions about what is important and what we can 

take for granted. Our brains are so excellent at processing information 

that we can give certain sensations finer resolution (like an important 

1 "Orthonormal" means orthogonal and normal. For a function to be orthogonal 

to another function, the two functions must be linearly independent. The condition of 

normality is satisfied when each function involved has, in some appropriate sense, 

energy 1. Finally, a basis is a set of functions such that an arbitrary function (within 

reason) may be written in terms of the basis. See Chapter 7.


Figure 2.3: We detect frequencies between about 20 and 20,000 Hz as pitched sound. 

Furthermore, each of these frequencies has a minimum threshold of loudness. This 

graph, the Fletcher–Munson curve, shows the minimal sound pressure level in decibels 

(dB) required for the frequency to be heard. 

message from our friend), and others none at all (like the hum of the 

refrigerator). 

Consider seeing a relatively involved movie for the second or third 

time, and noticing things you didn’t notice before that now make 

sense. We seem to prefer movies like these. We may achieve a decent 

understanding of the plot on the first viewing because we extract 

salient parts of the dialogue and action and put them in order, but a 

complex plot can hide clues of outcomes and their rationale all over 

the film that are more obvious when our brains can support them with 

familiar elements. 

A complex piece of music can be a lot like a complex movie. We 

perceive both sound and light as signals. When a signal demands

Section 2.2 Simple harmonic motion 13 

our attention, it is said to have a high amount of information. A signal 

with meaningless content that we don’t need or want to listen to is 

called noise. ound produced by white noise machines, e.g., is random, 

unpitched, and trivial. It does not contain a message because it is 

formally disorganized, and it can even help some people sleep because 

of its uniform randomness. Noise is composed of so many periodic 

waves that we consider it aperiodic. We cannot extract individual 

frequencies of noise, as we can in melody or a major chord in music. 

A signal can be half meaningful and half noise, and our brains are 

powerful enough to recognize the difference and attempt to separate 

the two. 

Although sine waves are not fun to think about, they substantiate 

much of the mathematics and physics behind music. The mathematical 

and physical equations which produced the previous graphs form a 

basis for the sensation of sound. Many musical concepts are results of 

mathematical relationships. First, let us examine the basic mathematical 

structure of sound. Musical form will be addressed in Chapters 3 

and 4. 

2.2 Simple harmonic motion 

Like light, sound is traveling energy, and we can model such energy 

mathematically with waves. The simplest wave is a sinusoid, a trigonometric 

function such as sin(ωt) or cos(ωt) where t denotes time and 

ω specifies how often it repeats itself—its angular frequency. 2 A sinusoidal 

wave represents the simple harmonic motion of an object because 

its frequency and extreme magnitudes do not change over time. 

Both a spring and a tuning fork exhibit simple harmonic motion. 

Below, we see two states of a vibrating tuning fork called modes of 

vibration. Both of these modes produce sound that is near in tone to a 

sine wave (or pure tone), but as you might have experienced, the tone 

2 The frequency f in Hz. (cycles per second) is related by ω =2πf.


Figure 2.4: A tuning fork and a weighted spring oscillate in simple harmonic motion. 

is more metallic and glassy than the electronic sound of a sine wave. 

When we strike the tuning fork, we experience the attack of the sound 

and then the sound sustains (decaying due to frictional forces with 

time) and eventually releases, leaving no sound. In the ideal physical 

world, i.e., one without the external forces of gravity, friction, and 

other resistive forces, a spring set into motion could oscillate forever 

at a uniform amplitude, as could a tuning fork. But that is not what 

happens in reality. The closest we can get to simple harmonic motion 

is represented by the curve in Figure 2.5. 

Figure 2.5: A musical wave in reality begins at zero energy, climbs to a maximal 

energy, and fades to zero energy.


You can see that the amplitude of this wave varies, but the points 

at which it crosses the horizontal axis, i.e., when its amplitude equals 0, 

are evenly spaced over time. This means the frequency does not vary, 

but the intensity of its motion does. True simple harmonic motion 

can be generated by an oscillator, a computer, or a tuning fork with a 

driving motor attached to it [2]. 

Amplitude represents pressure as well as voltage: An audio function 

models the pressure in the air corresponding to the sound wave 

as a function of time, and when the signal is electrified, the amplitude 

represents the (relative) voltage. In acoustics as well as electrical engineering, 

we call this function a signal, and the amplitude tells us 

most of the information we need to determine how loud our ears will 

perceive it to be. Because air is elastic, when a sound wave travels in 

air, it excites the air molecules and varies the pressure. The amplitude 

of the graph of a sine wave describes this behavior (see Figure 2.6). 

There are three fundamental aspects of a sinusoid of the form 

A sin(ωt + φ): Its magnitude 3 A, frequency ω, and phase φ. We have 

already considered amplitude: It is the pressure. A wave at amplitude 

0 means that the system is at normal atmospheric pressure—the pressure 

of the environment to which our ears have adjusted and no extra 

pressure is affecting the eardrum at that instant. Frequency can be 

determined by the number of times per second that the signal has zero 

pressure, or the rate at which the signal crosses the graph’s horizontal 

axis. Finally, we can determine the phase φ at any point t from the 

time of the next zero crossing, i.e., where the amplitude crosses the 

horizontal axis. 

3 Unfortunately, there are quite a few terms that will be used somewhat interchangeably 

to mean magnitude: Amplitude, height, displacement, energy, power, 

voltage, pressure, strength, and loudness. Loudness is a perceptual word and since 

we do not perceive all frequencies as equal (or at all, as in Figure 2.3), this word will be 

used with caution. Voltage, power, and energy are typically encountered in electrical 

engineering texts to mean amplitude, though they absolutely do not have equivalent 

meaning (see Appendix A). Strength and displacement are words used here to denote 

the magnitude of a wave, i.e., the vertical distance from 0.


Figure 2.6: The compressions and rarefactions in air resulting from sound waves, 

shown two ways. The maximal points of the sine wave graph correspond to the most 

compressed areas of the particle graph, represented by the most densely spaced dots. 

The minimal points correspond to rarefactions, represented by the least dense spacings 

of dots. Where the amplitude of the sine wave is 0 represents normal atmospheric 

pressure, where the density of the dots is average. 

For ease of computation, we only allow amplitude to vary between 

-1 and 1, so the average value of the amplitude of a simple sine wave is 

always zero. 4 It may seem strange that pressure can take on negative 

values, but it simply means that the sound’s pressure is dipping below 

normal atmospheric pressure. For purposes of standardization, this 

is defined as the pressure of air at sea level, 101,325 pascals (Pa); but 

in reality, this is the average atmospheric pressure of our present 

environment to which our ears have adjusted. Hence, amplitudes 

higher than 0 imply that the pressure induced by a sound wave is 

4 In electronic reality, sound signals can, however, have nonzero average value 

due to things like DC offset, uncalibrated equipment, or postproduction changes.


greater than normal pressure (compression), and amplitudes below 0 

imply that the pressure of a sound wave is less than normal pressure 

(rarefaction). Our ears detect sound by change over time in pressure, so 

a single, isolated amplitude tells us nothing about what we actually 

hear. 

Angular frequency is given by ω in radians per second (rad/s), and 

it is equal to 2πf, where f is ordinary frequency, given in hertz (Hz). 

Frequency f is inversely proportional to the time that the sine wave 

takes to complete one period T , as given by the following formula. 

f = 1 T 

Therefore, ω = 2π 

T . 

Phase tells us where the wave is along the course of a single period, 

taking on angles between 0 ◦ and just less than 360 ◦ . We are especially 

interested in phase when we have two waves of identical frequency. 

Now let us examine the nature of a simple sinusoid where ω =2π 

radians/second (so f =1Hz), x(t) = sin(2πt). 

Figure 2.7: A simple sinusoid, x(t) = sin(2πt), with the phase φ marked. 

Take note of the circular diagrams beneath the graph in Figure 2.7. 

These circles show different positions along the unit circle. The starting


position where φ =0is the rightmost point on this circle, situated at its 

intersection with the horizontal axis. When we move counterclockwise 

along the circumference of this circle, we increase the angle relative to 

this position. When we return to this position, we have moved 360 ◦ . 

In radians, 360 ◦ is equal to 2π. We can translate any angle in degrees 

to radians by multiplying the number of degrees by 

π 

180 

, e.g., for a 

right angle φ = 90 ◦ , the equivalent angle in radians is calculated to be 

90 ◦ π · 

180 = π 2 radians. 

Note that φ =0is positioned on the circle exactly where φ =2π and 

φ =4π. This is true of any even-integer multiple of π (4π, 6π, 8π, . . .). 

Similarly, the trigonometric function of any variable (such as a frequency 

ω) is the same as that variable phase shifted by an integermultiple 

of 2π, i.e., 

cos(ω) = cos(ω +2kπ), and 

sin(ω) = sin(ω +2πk), k =0, 1, 2, . . . 

Again, the height of points along the unit circle at angle φ is given 

by sin(φ), and the width is modeled by cos(φ). The phase at the initial 

time, t =0, is the angle that the sinusoid is shifted relative to a sine 

wave with no phase. We represent phase with the Greek letter φ, so 

that a simple sinusoid is formally written 

x(t) =A sin(2πft + φ) =A sin(ωt + φ). 

We multiply frequency by the quantity 2π because this strengthens 

the connection between the unit circle and frequency. The angular 

frequency is commonly used in the Fourier transform and physical 

science. However, in music, we connect pitch to frequency in hertz 

(like A440), so we will use 2πf instead of ω when considering sound 

musically. 

It is impossible to identify the phase of a single sine wave without 

a reference point. However, it is important for a signal to begin and 

end at amplitude 0 in order to understand its behavior, because sounds

Section 2.3 Complex harmonic motion 19 

beginning or ending at a nonzero amplitude surprise our ears to the 

point that the frequency undergoes distortion. Consider dropping a 

needle on the record and hearing a fuzzy click. We hear a burst of 

sound when there is any discontinuity in pressure. This is reflected 

not only on our basilar membrane, but in the Fourier transform, and it 

is one type of clipping. 

2.3 Complex harmonic motion 

Now let us examine more complex waves. In reality, virtually every 

sound is a complex wave. We really only encounter simple waves 

during hearing tests or in electronic music. In fact, listening to a sine 

wave for an extended period of time can cause headaches, extreme 

emotional responses, and hearing damage [3]. 

Figure 2.8: A clip of an audio signal. 

The horizontal axis of Figure 2.8 is once again time and the vertical 

axis is amplitude or pressure. Clearly, this is complicated: It is close to 

impossible for our brains to detect any sort of pattern in this waveform


because there is no clear repetition.Furthermore, there are no distinct 

frequencies we can pick out because there is no obvious repetition in 

the sound wave. However, this wave can be decomposed completely 

into sine waves solely by a Fourier transform. It may require hundreds, 

even an infinite amount of them, but it can be done. Let us look at 

a simpler—but still complex—wave to illustrate the combination of 

simple sinusoids, as in Figure 2.9. Our ears can identify a pattern 

Figure 2.9: The combination of two simple sine waves, x 1(t) = sin(2πt) and x 2(t) = 

sin(4πt). 

because this wave is periodic. This wave is made up of two different 

sine waves, and they are harmonic relatives of each other: One is twice 

the frequency of the other. This wave repeats identically after every 

second. The first sinusoid x 1 has a frequency of 1 Hz (2π rad/s), and 

the second sinusoid x 2 has a frequency of 2 Hz (4π rad/s). These are 

called frequency components ω k , where ω 1 =2π and ω 2 =4π. 

We couldn’t actually hear these frequencies as pitch in reality because 

they oscillate too slowly: Our ears only translate frequencies 

above about 20 Hz as pitched sound to our brains. While using such a 

low frequency is preferable for ease of visualization and computation, 

any pair of sine waves whose frequencies have a 2:1 ratio is defined 

as having the interval of an octave. Say that the scale of the horizontal

Section 2.3 Complex harmonic motion 21 

axis in Figure 2.9 was in milliseconds instead of seconds. Then these 

sine waves would have the audible frequencies of 1000 and 2000 Hz. 

Visually, we can see that these waves intersect half of the time that they 

cross the horizontal axis, and the ratio between these waves’ frequency 

(2:1) emphasizes the inversely proportional mathematical relationship 

between frequency and time (f =1/T ). 

+ 

= 

Figure 2.10: The combination of one sine wave of frequency f 1 with another sine 

wave with frequency f 2 =2f 1 produces the interval of an octave. Note the periodic 

nature of the resultant wave: It repeats itself identically four times, just like the first 

wave.


The graph of the signal given in Figure 2.10 can be determined 

from the graphs of these two sine waves, but picking out even a small 

handful of simple sinusoids from a complex wave is not a task we want 

to leave to our senses. We need more advanced computational tools 

to analyze frequencies contained in a signal that looks like a random 

string of numbers between −1 and 1. That is what is so very exciting 

about the power of the Fourier transform. 

The physical law at work here is called the principle of superposition. 

The principle of superposition: Every wave can be represented 

as a sum of simple sinusoids. 

Note that this says nothing about whether this sum has finitely 

or infinitely many terms. Square and triangle waves, for example, 

have jagged corners that cannot be represented by a finite amount of 

sinusoids. The principle of superposition is critical to understanding 

the concepts in the remainder of this book. 

2.4 Harmony, periodicity, and perfect intervals 

When two waves have frequencies that are related to each other by 

a small-number integer ratio like 2:1, we say that they are harmonic, 

that they have a harmonic relationship, or that they are harmonics of one 

another. The above example of f 1 =1Hz and f 2 =2Hz, i.e., f 2 = 

2f 1 , forms the interval of an octave. Likewise, the octave above any 

frequency is double that frequency, so we can calculate the frequency 

f k that is k-many octaves (the kth octave) above a given frequency f 0 

by the equation 

f k =2 k f 0 . 

Letting k =0returns f 0 , so the frequency 0 octaves above a given 

frequency is the original or fundamental frequency. This is called unison, 

or perfect unison. A perfect interval is characterized by a small-number

Section 2.4 Harmony, periodicity, and perfect intervals 23 

ratio between the two frequencies, restricted in Western music to 1:1 

(P1, perfect unison), 2:1 (P8, perfect octave), 3:2 (P5, perfect fifth), and 

4:3 (P4, perfect fourth) [4]. 

Perfect unison is trivially the smallest integer ratio. Two frequencies 

separated by an octave are in the ratio 2:1, the second smallest integer 

ratio. The perfect fifth has a ratio very close to 3:2 in equal temperament, 

and it is exactly 3:2 in just intonation and Pythagorean tuning. A perfect 

fourth has a 4:3 ratio: It is the inversion of the perfect fifth, sounded 

by moving up an octave and down a perfect fifth. In fact, as in the 

discussion later of musical temperament and tuning systems, every 

note of the 12-tone Pythagorean scale can be attained by moving in 

perfect fifths, but these intervals are related to f 0 by increasingly larger 

integer ratios. 

Moreover, the smaller the integer ratio between two frequencies 

(and two periods, thereby), the more pleasant or consonant we find 

their interval. In the introduction to this chapter, I mentioned the harmonic 

overtone series and its relationship to timbre in music: Musical 

instruments are constructed to have tones containing integer-related 

frequencies. To back up a little bit: When we hear A at 440 Hz on 

a piano, we do not just hear the frequency 440 Hz. If this were the 

case, it would sound no different from an electronic beep caused by 

an oscillator or ideal tuning fork. When we hear A440 from a piano, 

we actually hear a whole spectrum of other frequencies resulting from 

the resonance of the piano and the nature of the fixed string. Musical, 

pitched instruments like the piano, with the exception of percussion 

instruments, generate overtone series that are very-nearly harmonic, 

regardless of the pitch played. The modes of vibration on circular 

membranes have Bessel function ratios. 

We define nodes as zero crossings (i.e., where x(t) is zero) of the 

horizontal axis by a wave and antinodes as areas of maximal compression 

and rarefaction. A node exists on an instrument at a point or 

region that stays stationary while the rest of the instrument vibrates. 

Nodes and antinodes are used when we consider standing waves which


occur within all musical instruments and rooms. A standing wave is 

produced when a sound wave’s forward velocity is the same as its 

backwards velocity, hence it stands still with respect to position. For 

example, in a violin, waves move back and forth at the same velocity 

along a string fixed at both ends and therefore are only displaced up 

and down. Standing waves occur when the wavelengths of a given 

frequency are in integer proportion to the dimensions of a string, room, 

or column of air. They cause feedback in a recording studio because 

they do not die as quickly as waves of other frequencies. 

Helmholtz resonance can be witnessed when air is blown across a 

small opening of an otherwise closed cavity, like a bottle of water or the 

cracked window of a moving car. The frequency produced is inversely 

proportional to the volume of this cavity and proportional to the cross 

sectional area of the opening, so frequencies of larger cavities like the 

interior car are low. The formula to calculate the Helmholtz resonance 

ω H is given by 

ω H = 

√ 

γ A2 P 0 

mV 0 

where γ is the adiabatic index of specific heats (1.4 for dry air), A 

is the area of the opening, P 0 is the initial pressure of the air inside 

of the cavity, m is the mass of air in the neck of the opening, and 

V 0 is the initial volume of air inside of the cavity. So, widening the 

crack of the car window will increase the angular frequency of the 

Helmholtz resonance, and reducing the amount of water in the bottle 

(hence increasing V 0 ) will reduce the characteristic Helmholtz resonant 

frequency ω H . 

Lightly placing one’s finger at any of the nodes on a fixed string 

does noticeable things to its harmonics. Doing so at the halfway point 

on a guitar string, for example, causes the odd harmonics to drop out 

and the octaves above the fundamental to be very clear. Figure 2.11 

depicts the first four modes of vibration of a fixed string and the dots 

highlight their nodes.

Section 2.4 Harmony, periodicity, and perfect intervals 25 

Figure 2.11: The first four modes of a fixed string. Because a string is secured at both 

ends, it’s overtone series is defined according to its length, and the wavelengths of the 

frequencies it contains are restricted to integer divisions of that length, i.e., f:2f:3f, 

and so on. A non-integer ratio would result in an impossible scenario: A string loose 

at one end. 

By paying extra attention to the tone of musical instruments, the 

individual overtones may be realized. If you have access to a piano, try 

this experiment. Find a way to depress every key on the piano except 

for the second-to-bottom A, using books or a friend’s arms. Do this 

slowly so that the keys do not trigger sound. When the piano is silent, 

strike the A with considerable force and listen closely. You should 

be able to hear at least two octaves and a perfect fifth (an E) higher 

than this note. These are the first three overtones of the fundamental 

frequency, 55 Hz, shown in the following table.


Frequency Note name Ratio to 55 Hz Interval 

55 Hz A 1:1 Perfect Unison 

110 Hz A 2:1 Perfect Octave 

165 Hz E 3:1 Perfect Fifth 

220 A 4:1 Perfect Octave 

275 C♯ 5:1 Major Third 

330 E 6:1 Perfect Fifth 

385 G 7:1 Minor Seventh 

440 A 8:1 Perfect Octave 

The intervals above are in their simplest forms. As you can see, the 

interval between E3 at 165 Hz and A1 at 55 Hz spans a perfect octave 

and a perfect fifth. Frequencies separates by octaves sound so similar 

that we actually call all octaves by the same note name, so in most 

cases this relaxed terminology of reducing intervals that span more 

than an octave is acceptable. 5 

This series hypothetically continues forever to include the ratios 

9:1, 10:1, and so on. But the first few partials of any instrument have 

more energy than higher ones, so those are the ones we predominantly 

perceive—even though removing the higher ones would affect the 

perceived timbre. 

5 Particularly in the genre of jazz, the intervals of the ninth, eleventh, and thirteenth 

are used with some frequency. However, their sonority is similar to the interval minus 

an octave, i.e., a ninth has similar quality to a second, a eleventh to a fourth, and a 

thirteenth to a sixth.

Section 2.5 Properties of waves 27 

Harmonicity—and furthermore, the Western conceptualization of 

consonance—in music is manifested by simple mathematical relationships. 

We will say more about consonance and dissonance in the fourth 

chapter on auditory perception. 

2.5 Properties of waves 

When waves interact with other waves or with media like walls, water, 

and hot air, they exhibit to some degree the properties of reflection, 

refraction, interference, and damping. Some of these we observe on 

a daily basis, like echoes, but some are quite rare, like cancelation. 

Understanding the properties of waves helps avoid unwanted sounds 

(noise) and improve the desired message (signals), and all of the properties 

are direct consequences of the behavior of amplitude, phase, and 

frequency in response to the physical world. 

Before we begin to explain these properties, three more features 

of wavesuseful to understand are wavelength, amplitude envelopes, and 

crests versus troughs. Wavelength λ is the distance in meters that a 

wave of frequency f travels away from its source in one period T . We 

calculate wavelength λ with the equation 

λ = vT = v f , 

where v is the velocity of sound. In dry, room-temperature (68 ◦ Fahrenheit) 

air, the speed of sound is about 343 meters per second (m/s), and 

a 1000 Hz wave would therefore have a wavelength of 

343 m/s 

=0.343 m. 

1000 s−1 Note that frequency (f =1/T ) can be notated either as hertz or s −1 . 

An amplitude envelope describes the general shape of the amplitude 

over time for a given wave. Attack, decay, sustain, and release are 

the four general qualities of an amplitude envelope, and they are most


Figure 2.12: The wavelengths of two pure tones, 1 Hz and 10 Hz, calculated by the 

formula λ = v f . Since hertz are measured in inverted seconds (s−1 ), this is the same 

as multiplying the speed of sound by the duration of one period (0.1 and 1 seconds, 

respectively). Notice that λ 10Hz is one-tenth of the length of λ 1Hz. 

often ordered respectively for acoustic instrument examples. They are 

all notated one of three ways: As an instant to refer to the instant at 

which they begin (the onset time), as an interval to mean the interval 

over which they occur, or as a rate defining the speed at which they 

happen. It is easy to understand them graphically, as in Figure 2.13. 

However, an amplitude envelope is rarely this simple, and the one 

defining the shape of the frequency domain is not described the same 

way. Attack time and attack rate are particularly meaningful to the 

mathematics of music, especially when attempting to extract features 

from music. Attack almost solely defines where onsets exist. Onsets 

help us identify the location of beats and important events like the 

beginning of choruses or verses in musical signals. 

Finally, crests and troughs occur at the antinodes of a wave or fixed 

string. These are simply synonyms for maximum and minimum values 

in pressure.


Figure 2.13: A general attack-decay-sustain-release envelope, or ADSR envelope: The 

first onset of a note is the attack; the movement from the peak of the attack to the 

sustain is the decay; the duration a note is held is shown in the sustain; and the final 

decrease is the release, where the note is no longer being played. This envelope also 

describes reverberation. 

Figure 2.14: The nodes, antinodes, crests, and troughs of a sine wave, shown with 

eight different amplitudes. The antinodes are located at the crests and troughs, i.e., 

areas of extreme compression and rarefaction, and the nodes are located at normal 

atmospheric pressure. Along a string, the nodes are located where the string does not 

move. These positions are according to its length. 

The nodes are located where the amplitude is 0. Both ends of the 

string are therefore nodes. Antinodes occur in exactly the opposite 

places: Where the magnitude (absolute value) of the amplitude is locally 

maximal—i.e., the magnitude is greater than both of the leftmost 

and rightmost magnitudes. These extreme regions are also called com-


pressions (where maximal) and rarefactions (where minimal). Skipping 

a jump rope creates one antinode (we only count an antinode once per 

extrema), two nodes, one compression, and one rarefaction. 

Reflection 

The property of reflection can be readily observed when loud sounds 

initiate in rooms with hard surfaces. We experience reflection when 

sound in a room reverberates or echoes. Bats use the reflection of sound 

to aid their night vision using echolocation, calculating their distance to 

objects from their own position to a high level of precision by emitting 

a chirp and measuring the time that it takes for its reflection to be heard. 

[5]. The variables in echolocation are: The speed of sound v equal to 

343m/s (also written c v , though the notation c in physics is typically 

reserved for the speed of light), the round-trip time that the sound 

takes to hit the object and reflect back, and the speed at which the 

observer (the bat) is traveling. Since sound travels relatively quickly 

to the speed of the bat, this third variable is reasonably negligible 

and it is unlikely that the bat is taking note of this at all. So, if the 

sound takes 5 seconds to reflect back to the bats ears, the object is 

(343m/s) · (5s)/2 = 857.5 meters away. This is divided by 2 because it 

took 5 seconds for the round-trip time, so it took 2.5 seconds for the 

sound to travel to the object. 

The human brain perceives sonic events that begin less than about 

one-tenth of a second (0.1 s) apart to be part of the same sound [3]. So, 

reflections of sound over short distances are perceived as a single signal 

because they happen within 0.1 seconds of one another. Therefore, the 

minimum distance that a sound can travel in order for an echo to be 

perceived is therefore about (343 m/s)·(0.1 s)/2 = 17.15 meters, again 

dividing by 2 because it has to make a round trip. Sounds beginning 

greater than 0.2 seconds apart are separated by the brain, and between 

0.1 and 0.2 seconds is an interval of confusion or roughness.


The myth that a duck’s quack does not echo was only recently 

debunked by the Acoustics Research Centre at the University of Salford 

in 2003 [6]. Their best guess as to why this was ever a myth is that 

quacks may be difficult to detect because they do not have a sharp 

attack like lightning or handclapping, and furthermore, ducks are 

usually in water or the air, not in tunnels where echoes are often 

observed. 

Reflection can be a useful property of sound when recording in 

noisy environments. During sporting events like football and basketball 

games, you may observe a few people wearing headphones 

on the sidelines holding a large, clear, circular object with a microphone 

at the center. This is a circular parabola, designed much like 

a satellite dish. The microphone is placed at the parabola’s focus, a 

point through which all waves that hit the parabola reflect and travel. 

At the Exploratorium Museum in San Francisco, there are two large 

parabolas about eight feet in diameter. They are installed vertically so 

that museum visitors can sit inside them on seats strategically placed 

so one’s ears are very close to the focal point. The parabolas face each 

other, but are about 50 feet apart, making it seem irrational that soft 

sounds could be effectively transmitted over such a distance in the 

popular, noisy museum. Surprisingly, speech barely louder than a 

whisper can be clearly heard at the other end. The same idea applies 

to satellite dishes, but their foci extend far beyond the rim of the dish 

to compensate for the great distance to their signals’ sources in outer 

space. 

Refraction 

When sound travels from one region to a region with a different density 

or stiffness, refraction and dispersion occur[4]. In waveguide synthesis, 

these regions are called scattering junctions. The denser a region, the less 

room the closely spaced particles have to move around. Sound waves 

can become more excited in stiffer mediums due to improvedelasticity


[7]. Therefore, sound travels more quickly in stiff, light solids than 

in liquids or gases. Measurements taken with a contact mic on the 

metal interior of a brass horn, for example, are much richer (more 

partials are articulated) than measurements taken from the air outside 

of the horn. The type of wood used in the body of a violin influences 

how much the violin amplifies its sound, and the propagation of 

sound in spruce (a common wood in violins) is twice as fast along 

the grain (3000 m/s) as it is across it (1500 m/s) [8]. Refraction is an 

especially important property to consider for submariners, architects, 

and materials scientists. 

The speed of sound depends on the bulk modulus B of a medium 

(a number representing its elasticity or stiffness) and the density ρ of a 

medium. 

c v = 

√ 

B 

ρ 

So, it increases with stiffness, and decreases with density. Listed on 

the next page are different speeds of sound inside of various media 

[9]. The bulk moduli of woods are given parallel to (along) the grain. 

All of these numbers are variable, and the numbers are averaged if a 

range was given.


Medium B (×10 9 N/m 2 ) ρ (kg/m 3 ) c v 

Dry air (20 ◦ C) 0.000142 1.21 343 m/s 

Water (25 ◦ C) 2.15 965 1493 m/s 

Salt water (25 ◦ C) 2.34 1022 1533 m/s 

Ebony 13.8 1200 3391 m/s 

White oak 11 770 3780 m/s 

Honduras mahogany 10.4 650 4000 m/s 

Indian rosewood 12.0 740 4027 m/s 

White ash 12.2 750 4033 m/s 

Engelmann spruce 9.0 550 4036 m/s 

Red maple 11.3 675 4092 m/s 

Black cherry 12.2 630 4401 m/s 

Steel 200 7820 5057 m/s 

Glass 70 2600 5189 m/s 

Brazilian rosewood 16.0 830 5217 m/s 

Diamond 442 3500 11238 m/s 

Table 2.1: The speed of sound c v in common acoustic materials is given by c v = 

where B is the stiffness and ρ is the density. 

√ 

B 

ρ 

The bulk modulus B describes the volumetric elasticity (threedimensional), 

while Young’s modulus describes the tensile or linear 

elasticity (two-dimensional). For the different types of wood, B is actually 

Young’s modulus. Both are ratios of stress to strain, measuring the 

resistance of a material to uniform compression. Hence, both the bulk 

modulus and Young’s modulus represent the inverse of compressibility. 

Reverberation 

Reverberation refers to sound reflecting against walls, refracting into 

absorbent material, and dissipating in air after its origination. We talk 

about reverberation especially in room acoustics, where a recording 

studio should ideally have no reverberation, but a cathedral may have 

a lot of reverberation. Its graphical representation is different: Once


again, the horizontal axis is time and the vertical axis is amplitude, but 

this is not the signal itself. Instead, Figure 2.15 shows us the amplitude 

of events over time. 

Figure 2.15: Reverberation is typically generalized as three main events: The source 

signal, its early reflections (with thicker vertical lines), and its late reflections (the 

thinner vertical segments). The time of the early reflections with respect to the source 

sound is how a source will seem near or far from the location of the observer. 

The first event is the original sound—the source signal. As the 

source sound propagates in the room, it bounces off each of the walls. 

The first time it does this and returns to the receiver is depicted in 

the early reflections event. In the above graph, there are six early 

reflections representing six walls or surfaces, which is typical in a 

rectangular room with four walls, a floor, and a ceiling. The later the 

reflection, the farther the surface is from the receiver. The weaker the 

reflection, the longer the sound has traveled and the higher absorbency 

of the surface material. The late reflections event depict the later 

bounces off of these surfaces with gradually less energy. 

The frequency response of a reverberant space like a room or musical 

instrument is calculated by exciting the space with all frequencies in 

its range at a constant pressure and transforming this recording with 

a Fourier transform to deduce its resonant frequencies, which appear 

as peaks in the frequency response. This can be done by exciting the 

instrument with a sine sweep (a pure tone that oscillates from low to


high frequencies) and recording the instrument’s vibration. In room 

acoustics, the frequency response is typically calculated by playing a 

burst of white noise, because it has equal energy at all frequencies. The 

Fourier transform of white noise is perfectly flat, reflecting the equal 

power of the frequencies, and the Fourier transform of the recording 

of the white noise in a room will have bumps where the room is 

resonating or attenuating sound on a frequency basis. This burst of 

white noise is also called an impulse, and a space’s reaction to it is called 

an impulse response. Impulses can also be taken with other loud, brief, 

noisy things like fireworks, balloon pops, and handclapping, though 

their frequency response is naturally more variable than that of white 

noise. 

Figure 2.16: The impulse response of a room is a function of time. The general shape 

of the amplitude envelope is decreasing, but there are peaks where the sound is 

reflecting off of surfaces in the room. The horizontal axis is in samples, not seconds, 

so this impulse response lasts less than half of a second.


Figure 2.17: The frequency response of the same room as in Figure 2.15. This is a 

function of frequency. Rooms typically have resonances in the lower frequency range 

because of their larger dimensions compared to musical instruments. 

To calculate the frequency response, we simply take the Fourier 

transform of the impulse response, depicted in Figure 2.17. 

Room acoustics is a constantly expanding field of research in the 

scientific study of sound. A standard way of measuring the reverberation 

of an acoustic space is by calculating the RT 60 , the time that a 

sound (typically wide-band or narrowband noise) takes to decay by 

60 decibels in that space. Architectural structures built for a musical 

purpose like auditoriums and studios use room acoustics to choose 

materials and dimensions that will best amplify or attenuate certain frequencies. 

The table and graph depicted in Figures 2.18 and 2.19 define 

the absorption that some materials have with respect to sound, and 

you can see a direct correlation between the hardness of the material 

and how much sound it absorbs.


Figure 2.18: The absorption coefficients of various materials for the frequencies 250 

Hz, 500 Hz, and 1000-2000 Hz. This table comes from Alexander Wood’s The Physics 

of Music [10]. 

Much of the results depicted in Figures 2.18 and 2.19 come from 

some of the first discoveries concerning room acoustics by Wallace 

Clement Sabine (1868-1919) of Harvard University, who found that 

T =0.161 V AS 

where T is the reverberation time, V is the volume of a reverberant 

space in cubic meters, A is the average absorption coefficient, and S 

is the surface area of the material. A is strictly less than 1 because 100


Figure 2.19: The absorption curves of various materials over the frequency range 

64-4096 Hz [10]. 

percent absorptive material does not exist, but as a point of reference, 

one square meter of 100% absorptive material is called 1 sabin. 

Interference 

Let us begin with a theorem from mathematics. 

Theorem: For any real numbers a, b ∈ R, 

|a + b| ≤ |a| + |b|. 

Proof: Let a, b ≥ 0. Then the result |a + b| = |a| + |b| is 

immediate. Let a, b ≤ 0. Likewise, it is clear that |a + b| = 

|a| + |b|. Finally, let a be the opposite parity (sign) of b. 

Then |a + b| < |a| + |b|. So, for all values of a, b in R, 

|a + b| ≤| a| + |b|.


Constructive interference is only satisfied by the maximal case, |a + b| = 

|a| + |b|, and otherwise destructive interference is occurring. 

There are two types of interference in sound: Constructive and 

destructive. Constructive interference occurs when two waves, call 

them x 1 (t) and x 2 (t), interact such that 

∣ 

∣x 1 (t)+x 2 (t) ∣ ∣ = ∣ ∣x 1 (t) ∣ ∣ + ∣ ∣x 2 (t) ∣ ∣. 

In words, the magnitude of their sum is equal to the sum of the magnitude 

of each wave. It can be shown that the sign of the two waves 

must be the same. 

Destructive interference is the exact opposite of constructive interference. 

Destructive interference is such that 

∣ 

∣x 1 (t)+x 2 (t) ∣ ∣ < ∣ ∣x 1 (t) ∣ ∣ + ∣ ∣x 2 (t) ∣ ∣. 

For this to be true, the signs of the two waves must be opposite, as 

given in the above proof. Therefore, when these waves interact, they 

have a detrimental effect on the overall pressure of the air through 

which they propagate. Constructive or destructive interference can 

only occur when the waves intersect at the same location, whether at a 

single point or set of points, and at the same instant or same interval 

of time. 

Cancelation is a result of completely destructive interference, wherein 

|x 1 (t)+x 2 (t)| =0. Consider two identical sinusoids, x 1 (t) =x 2 (t) = 

A sin(2πft + φ). Now imagine that you have two speakers facing each 

another, located exactly at an integer multiple of the wavelength of the 

sinusoids (λ = v/f, remember) apart from one another, both connected 

to your CD player in stereo. Let one channel be x 1 (t) and the other 

be x 2 (t). When you press play, the waves travel from the speaker to 

the opposite speaker at the same time. Because they are placed an 

integer multiple (4 times) of their wavelengths apart, the crests of one 

wave will occur exactly where the troughs of the second wave occur. 

These two waves are called completely out of phase from each other:


Their phases are different by π radians, or 180 ◦ . This is the most that 

two waves can be out of phase, even though a circle contains 360 ◦ . 

The waves become reflections of each other because they are exact 

opposites, flipped about the horizontal axis, and become in phase with 

each other when there is no difference (angle) between their respective 

phases. 

Figure 2.20: Two speakers exhibiting completely destructive interference. The superposition 

of their respective sounds waves is shown by the dotted line. Since one wave 

is moving to the left and the other to the right at the same speeds, the wave is 0 only 

2 times per period (much like a regular sine wave), but it does not travel—it stands. 

Hence, the result is a standing wave, which causes acoustic feedback. 

So, their superposition is zero everywhere at certain instants— 

at their compressions and rarefactions, to be more precise. These 

two waves form what is called a standing wave. A standing wave 

occurs when two sine waves of equal frequency travel at the same 

velocity in opposite directions, so their velocities, v 1 and v 2 , sum to 

zero where v 1 = −v 2 (i.e., the wave does not propagate—it stands). This 

happens in musical instruments along a fixed string or in a column of 

air. A standing wave is perfectly stationary, but its amplitude changes 

periodically at the same frequency of the two waves. 

Standing waves cause acoustic feedback because they resonate in 

an acoustic space and thus have more sustain, causing a microphone 

and speaker to continuously receive and transmit them when recording. 

Say that a room is 27’ x 25’ x 12’. Then the frequencies with


wavelengths equal to 27 feet, 25 feet, or 12 feet (41.3 Hz, 44.6 Hz, and 

92.9 Hz) will stand in this room, and furthermore, integer multiples of 

these frequencies will also cause feedback (though, to a lesser degree) 

because their reflections will mirror their propagations. This means 

that larger rooms will have very low-frequency resonances and small 

corridors (and musical instruments) will have higher frequency resonances, 

because a wavelength of 30 centimeters translates to about 

1143 Hz. 

Resonators and noise-canceling headphones are effective in diminishing 

the power of undesirable frequencies. Resonators tuned to the 

undesired frequency will capture and reduce the frequency by creating 

a standing wave: The wave carrying the undesired frequency is 

attracted to the resonator, and the resonator absorbs and dissipates 

the wave by cancelation. Noise-canceling headphones detect noise via 

an exterior microphone near the ear. The noise is directed to an electric 

circuit that transforms it into an antinoise signal—a signal exactly 

out of phase with the detected noise. This antinoise signal is played 

through the headphones to cancel the noise. 

The inverse square law 

All forms of radiation obey the inverse square law, which simply says 

that the farther you are away from a source of energy, the less intense 

the energy will be. In a uniform medium, a source propagates in all 

directions equally, so we model this motion in three dimensions as a 

sphere. The intensity I at a radius r from a sound source with original 

power P will be I = 

P , because the surface area of a sphere is given 

4πr 2 

by 4πr 2 . 

However, because we measure the intensity of sound with decibels 

(dB), which are a logarithmic unit, the inverse square law returns a 

different equation for sound waves than the one listed above. Intensity, 

as we will explore in more detail in Chapter 3, is proportional to the


Figure 2.21: Three-dimensional depiction of the inverse-square law 

square of sound pressure, P 2 . Therefore, source intensity becomes 

proportional to I/r 2 at a distance of r from it, and the pressure is then 

P/r, not P 6 . We say "proportional to" in mathematics when a ratio 

exists between two quantities, but their relationship is not necessarily 

the same in every scenario, i.e., the ratio may fluctuate. The number of 

days that it rains in a year, for example, is proportional to the annual 

6 Note that only intensity and pressure diminish with distance, not frequency or 

wavelength. Red does not get any "less red" the farther we are away from it.


inches of rain accumulated in a year, but x-many days of rain does not 

necessarily mean y-many inches of rainfall. 

The brain treats the ears like two distinct microphones, not as a 

total or average [4]. It relies heavily upon the inverse square law to 

detect the proximity of sources, while the distance between the ears 

and the physicality of the pinnae (the flaps of skin external to the skull) 

provide information about the sources’ directionality. The primary 

function of hearing, or of any sensation for that matter, is to alert the 

hearer of threat to its survival. The sensation of sound can quickly 

activate our adrenal glands, and can thereby serve to inform the proper 

fight or flight response. Music, too, has the power to elicit very strong 

emotional reactions, including fear and anger. 

Another aspect of sound that requires a physical explanation is 

our ability to hear sounds from sources outside of the room. In my 

office, I can hear footsteps approaching from around the corner and 

the elevator bell, even though the elevator is located far down the hall. 

But these sounds all seem to be coming from my doorway. This can be 

explained by Huygens’ principle, depicted in Figure 2.22. 

Huygens’ principle: Every point of a moving wave is also 

the center of a new source, each propagating a fresh set of 

waves in all directions. 

This is also known as diffraction, and explains why the sound from a 

loudspeaker can be heard at locations behind it, above it, and to its 

left and right. This is true of light as well, but because the wavelength 

of light is so short (about 390 to 750 nanometers) due to its very large 

frequency (400-790 Terahertz, where 1 Terahertz (THz) = 10 12 Hz), it 

is not as easily perceived. 

The elasticity of air is what allows sound and light to vibrate, so 

there is neither light nor sound in a vacuum. Other qualities of air and 

the Earth’s atmosphere can have interesting effects on traveling waves.


Figure 2.22: Sounds originating on the other side of an open doorway will appear to 

originate from the doorway itself, states Huygens’ Principle. 

The effects of temperature, humidity, velocity, and altitude 

Waves move differently depending on the media through which they 

travel, as stated in the discussion of refraction. Temperature, humidity, 

and altitude are all directly related to atmospheric pressure, which is 

itself a result of gravity. Sound waves vibrate easiest in high-pressure 

areas where there are fewer forces working against their energy. Since 

pressure decreases as elevation increases, sound waves tend toward 

the ground. Sound travels slowly and loses energy faster in hotter 

temperatures because heat rises. As the hot air moves upwards, it 

takes some of the sound waves’ energy with it. 

Temperature’s effect on pitch is most noticeable in wind instruments, 

due to the expansion of their bores from heat.A flute, for 

example, rises in pitch about 0.002 Hz for every 1 ◦ C (1.8 ◦ F) rise in 

temperature. The tuning of piano strings increases about 0.00001 Hz 

for each 1 ◦ C increase in temperature because hotter strings expand. 

The velocity of a sound can be calculated as before, in the section 

on refraction, but also from the derivative of the pressure of a medium 

with respect to its density: 

v = 

√ √ 

∂p 

∂ρ = B 

ρ


where p is the pressure of a medium ρ is once again its density. Therefore, 

the bulk modulus can be determined by 

In 0% humidity (dry) air, 

B = v 2 ρ = ∂p 

∂ρ ρ. 

v = 331.3 

√ 

for temperature t in degrees Celsius. 

1+ t 

273.15 

As mentioned in the discussion of refraction, a soundÕs propagation 

speed is dependent upon the medium through which the sound 

travels. A final way of calculating these speeds is with the Mach number 

of a given medium, where the Mach number of dry air is 1. A Mach 

number greater than 1 indicates that sound is traveling at a supersonic 

speed. We can calculate the Mach number with the equation 

[ M = √ 2 (qc ) γ−1 ] 

γ − 1 p +1 γ 

− 1 

where M is the Mach number, q c is impact pressure of the medium, 

p is the pressure of the medium, and γ is the ratio of the pressure of 

the medium to heat—the volume constant. This equation comes from 

Bernoulli’s principle in fluid dynamics. 

Humidity has a small but detectable effect on sound propagation, 

due to the presence of lighter and more elastic water molecules in the 

air. As you may guess, the velocity of sound increases in humid air, up 

to 0.6%. Since the density of air is lower at higher altitudes than at sea 

level or below it, the speed of sound decreases as its altitude increases. 

The speed by which sound travels affects the sound’s volume at a 

given distance, and unsurprisingly, the faster sound travels, the better 

it maintains its original intensity. Wind will additively or subtractively 

affect the speed, working as you may suspect: When wind is blowing 

in the direction of the sound, it increases its velocity, and thus its 

loudness.


Finally, when an observer or sound source is moving, the Doppler 

effect causes the wavelength of sounds to change. As a source moves 

closer to an observer, every period of the sound wave gets increasingly 

shorter, causing the period to get smaller and frequency to get 

larger. Conversely, waves moving away from an observer will have 

increasingly larger periods, causing frequency to decrease. Austrian 

physicist Christian Doppler witnessed and quantified this in 1842 with 

the mathematical formula 

f o = 

( ) c + vo 

f s , 

c + v s 

where f o is the frequency heard by the observer, c is the speed of sound 

in the medium (343 m/s in air), v o is the velocity of the observer, v s is 

the velocity of the sound’s source, and f s is the frequency of the source. 

The observer’s velocity v o will be positive if the observer is moving 

towards the source, and v s will be positive if the source is moving away 

from the observer. 

2.6 Chapter summary 

This chapter investigated the four main properties of waves—amplitude, 

frequency, period, and phase—and their behavior in an ideal world and 

in reality. Also distinguished were the terms signal and noise. These 

terms are important for understanding the compression of information, 

a topic raised at the end of Chapter 5. 

Sound waves can be decomposed into a sum of simple sine waves, 

says the principle of superposition. It is easy to realize the amplitude (A), 

frequency (f), period (1/f), and phase (φ) of a simple sine wave: It is 

of the form A sin(2πft + φ). When two waves are related in frequency 

by a small-number integer ratio, we say that the (musical) interval 

between them is harmonic. Amplitude explains the amount of pressure 

induced by a sound wave in a medium. Sound can only be heard when 

the pressure of a medium is varying, as changing pressure signifies

Section 2.6 Chapter summary 47 

a disturbance in the eardrum. For example, when we go to places at 

high altitudes, there is a lower pressure, but no specific sounds are 

associated with this change. 

The phase of a wave in relation to the phase of another wave determines 

how two or more waves will interfere with each other when they 

interact in a medium. When the resultant wave is less in amplitude 

than the sum of the amplitudes of the original waves, it is interfering 

destructively. Otherwise, it is undergoing constructive interference. 

Waves traveling in exactly opposite directions and at identical frequencies 

create standing waves. For the same reason, standing waves also 

happen in musical instruments, on a fixed string, and in columns of 

air. 

Sound waves reflect off of objects, refract in different media, and 

lose energy as they dissipate (the inverse square law). All of this is reverberation, 

which describes the behavior of sound after it originates from 

a source like a speaker or musical instrument. Every point through 

which a sound wave travels is also the source of a new set of waves, 

states Huygens’ principle, but this source is not thought of in the same 

way as a speaker or musical instrument. 

A hotter environment will increase the frequency of the sound 

passing through it. Increasing the humidity and density of a medium 

increases the velocity of sound waves. Sounds moving towards an 

observer will have increasingly shorter wavelengths and thus higher 

frequencies, and the converse is true (the Doppler effect). Finally, sound 

waves are attracted to high-pressure areas where their energy will be 

most conserved, and since pressure decreases with altitude, all sound 

waves tend towards the ground.

3. Musical sound 

What makes sound musical You are already conscious of some of the 

devices musicians use to transform sound into music: Harmonizing 

with a melody, repeating sections like a chorus, changing the volume 

of a beat for emphasis or deemphasis, and using tone-rich instruments 

such as violins. How can we scientifically describe these devices Even 

though all sound waves consist only of frequencies, amplitudes, and 

phases, the way that their properties are organized in music can be 

unintuitive and even enigmatic. Indeed, the hard reality of digital 

music analysis and music information retrieval is that a computer still 

cannot classify and relate musical data as quickly and accurately as 

our ears can. A computer needs a whole song—not to mention a large 

database with which to compare it—to surmise high-level features 

about music such as genre, artist, time period, even time signature. 

Experienced listeners can do this in a few seconds. In this chapter, we 

will explore some of the wave behavior behind basic musical features. 

3.1 Rhythm 

The rhythm of a piece of music is a representation of its temporal 

structure. Within rhythm, we mainly talk about tempo and meter. The 

tempo describes the beat or pulse. It describes how quickly or slowly 

beats happen, and uses the unit of beats per minute. Therefore, tempo 

is directly tied to duration in seconds, and can be related to a frequency 

itself. For example, a song with a tempo of 120 bpm beats every 0.5 

seconds (2 beats per second), so it beats at 2 Hz. 1 

1 Of course, this is not the pitch of the beat. Our ears cannot perceive frequencies 

below about 20 Hz as pitched sound. To see this for yourself, take a quarter and

50 Musical sound Chapter 3 

Much like a meter stick standardizes the size of a centimeter and 

measures something’s length, meter (or metric structure) gives the standard 

size of the basic unit in a piece of music and describes the length 

of a musical measure. Some typical basic units are the eighth note, 

quarter note, and half note, and they are designated by the bottom 

number in the following notation. 

Figure 3.1: Some examples of musical time signatures: 4-4 time means that a measure 

lasts the length of four quarter notes, where a quarter note is the basic unit of the 

rhythm; 3-4 time has measures of three quarter notes; 6-8 time has measures of six 

eighth notes; and 3-2 time means that the measures contain three half notes. 

If the bottom number is 4, for example, the basic unit would be the 

quarter note. If it is 8, the unit would be the eighth note, and so on. 

The top number simply says how many of these basic units are needed 

to fill one measure in a piece of music. 

Figure 3.2: Examples of some rhythmic music notation. In the time signature of 4-4, 

the quarter note is the unit note, i.e., the basic unit, and the first and third beats are 

usually accentuated more than the second or fourth beats. 

run your fingernail slowly against the ridges on its side, so that you hear a series of 

clicks. Then, do so more quickly, until you hear a gritty, pitched sound. We perceive 

frequencies below 20 Hz as individual events. More will be said about this when we 

discuss perceptual beats.

Section 3.2 Pitch 51 

The meter indicates where the beats fall within a measure. Beats 

can be strong or weak, and typically a strong beat will fall on the 

first beat of a measure, and weak beats will fall on the second and 

fourth. The third beat will be relatively strong to the second and 

fourth, but perhaps not as strong as the first beat, and not stronger 

unless syncopation is employed. We can see the difference between a 

strong and weak beat quite easily in a graph of a musical signal from a 

modern genre like dubstep or drum and bass: The power of the beat 

shows up as a peak in the amplitude. 

Figure 3.3: A signal that isn’t a simple sinusoid can still have periodic properties. 

This is a clip from an electronic (dance) piece of music with a highly defined rhythm. 

We can see 14 equally spaced beats in a span of 10 seconds, so the tempo would be 

somewhere around 84 bpm. 

Because music is a time-based art form, rhythm can tell us a lot 

about the way it is organized. We associate rhythm and complexity 

with each other to a high degree, because we base a lot of our expectations 

in music on its temporal structure. When music is polyrhythmic 

(contains many rhythms), syncopated (strong beats do not fall at the 

designated time), or even arhythmic (lacks rhythm altogether), these 

expectations stand to be violated. 

3.2 Pitch 

Pitch is our perception of frequency. Often, frequencies will be present 

that we cannot perceive due to their absolute loudness, their relative


quietness to other frequencies, or other psychoacoustical reasons, such 

as masking. Human speech contains many frequencies with audible 

energy that we do not perceive as pitched sound because they are so 

complex.On the contrary, most musical instruments are designed to 

articulate pitches clearly and definitely. 

In music, pitch follows tuning standards to regulate instrument 

construction and eschew tedious debates that arise when playing with 

other musicians. Many tunings for the A above middle C were proposed 

before the A440 that we have now. An international conference 

was held in London in May 1939 that was perhaps the last international 

agreement of any kind made before the beginning of World War 

II. Though not officially standardized by the International Organization 

for Standardization until 1955, A440 was adopted in nearly every 

country using Western temperament well before then [10]. 

With pitch, we can define musical scales, intervals, chords, and 

harmonies. Their construction is based on the consonant, small-integer 

ratios between frequencies and periods discussed in the previous chapter. 

3.3 Tuning and temperament 

In the musical systems of nearly every culture, musical scales are defined. 

These scales reflect the culture’s view of musical consonance 

and their compositional preferences, and make rules that standardize 

the tuning and build of their musical instruments. Musical systems 

are built around the musical intervals that perceptually present themselves, 

such as the highly consonant octave (a 2:1 frequency ratio) and 

the highly dissonant tritone (a √ 2:1 frequency ratio). 

The Greek mathematician and philosopher Pythagoras was perhaps 

the first to propose a system of tuning for music. This system was 

only natural to the Greeks, who were obsessed with small-integer ratios 

and numerology. Much of their cultural aesthetics trace back to the 

number five: The five elements (earth, wind, air, fire, and the universe)

Section 3.3 Tuning and temperament 53 

were assigned to the five regular polyhedra, called the Platonic solids 

[2]. TThe Greeks believed so strongly that these geometric shapes 

were related to the chemical structure of the quintessential elements 

that these solids were named "the atoms of the universe" by Euclid. 

Pythagoras built his structure of musical temperament upon the circle 

of fifths, a sequence that moves upward by a perfect fifth 12 times 

and actually reaches all 12 notes of the scale in doing so. Movement 

through the first five notes of the circle of fifths (C-G-D-A-E) builds 

the pentatonic scale. 2 

As we will see, Pythagoras nearly hit it right on the head with 

respect to the scales of modern day, but there were problems with his 

method. Today, we use equal temperament to describe the relationships 

between pitches in music and from musical instruments. Equal temperament 

defines the relationships between any two pitches in the 

Western scale to be 

p 2 =2 k/12 · p 1 , 

where the pitch p 2 is k-many half steps above the pitch p 1 . If p 2 is 

k-many half steps below p 2 , then 

p 2 =2 −k/12 · p 1 , 

i.e., k can be negative. Therefore, a pitch one octave higher than 

another pitch will be 

p 2 =2 12/12 · p 1 =2· p 1 , 

and more generally, a pitch n-many octaves from another pitch will be 

p 2 =2 n · p 1 , 

n ∈ Z 

where Z is the set of positive and negative integers. When n =0, i.e., 

when the second pitch is identical to the first pitch, then they are equal 

2 The pentatonic scale is written C-D-E-G-A, in ascending pitch order.


Figure 3.4: The Circle of Fifths is shown here: Working clockwise from the center, we 

can build the circle by moving up a fifth twelve times until we reach the enharmonic 

equivalent of the original note (for C, a B♯). The radial lines show the enharmonic 

equivalents. 

because 2 0 =1. When n


Figure 3.5: Two sinusoids related by an octave in frequency. 

can see a harmonious relationship between sinusoids with frequencies 

in a 2:1 ratio. 

The two waves in Figure 3.5 clearly have similar periodicities. They 

intersect with each other every other time that they cross the horizontal 

axis. Amazingly, their harmony can also be explained psychologically, 

a topic of Chapter 5. 

Equal temperament has more advantages over all other temperings 

of the 12-note Western scale because it enables transposition, allowing 

for songs to be played in multiple keys without retuning the instrument. 

To explain this, let us briefly explore the systems of Pythagorean 

temperament and just intonation. 

Pythagorean 

The Greek mathematician Pythagoras constructed this system sometime 

in the 6th century BCE. A popular myth emerged that he did this 

immediately after hearing a pair of hammers pounding on an anvil 

make the interval of an octave when sounded together and realizing 

that their weights formed 2:1 ratio. However, since the relative weights 

between two objects do not imply a similar sonic relationship, the myth 

was debunked [77, 78]. Pythagoras may have been the first to discover


integer ratios with respect to string length using a monochord, a simple 

musical instrument similar to a guitar. 

Figure 3.6: The simplest design of a monochord is a resonating box with two bridges 

and a fixed string on top. One of these bridges is fixed (shaded in dark gray) and 

one slides, changing the effective length of the string. Suppose that the frequency of 

the open fixed string (no altering of the effective length) is 100 Hz. When the sliding 

bridge is relocated to half of the length of the string, the frequency doubles to 200 

Hz. At two-thirds the length, the frequency is 150 Hz, a perfect fifth above the open 

frequency. 

The main difference between a monochord and a guitar is the 

sliding bridge that can move up and down the fretboard to change 

the length of the string: The distance from fixed end (depicted in dark 

gray in Figure 3.6) to the sliding end (in dotted lines) determines the 

wavelength of the string’s fundamental frequency. Positioning the 

sliding bridge at one half of the string’s length produced a pitch one 

octave higher than the pitch produced from the open string (200 Hz 

versus 100 Hz), and moving the bridge to two-thirds of its length made 

a pitch a perfect fifth higher than the open pitch, i.e., 150 Hz.


The Greeks were huge fans of integer ratios found in nature, so 

Pythagoras’ discoveries were not taken lightly [2, 79]. The Pythagorean 

system of tuning is constructed entirely from the ratios of the octave 

and perfect fifth: Indeed, moving up and down these intervals in a 

particular order returns all 12 pitches of the Western scale. 

The frequency of the note C, call it f C , need only be scaled by factors 

of 2 and 3 2 

to attain all 12 pitches in one octave of the Pythagorean 

scale. Temporarily ignoring our adjustments for octaves, the method 

is as follows: We move upwards 6 perfect fifths, ending at F♯, and 

downwards 6 perfect fifths, ending at G♭—what should be the enharmonic 

equivalent of F♯. 3 However, F♯ and G♭ are only enharmonically 

equivalent in equal temperament. In Table 3.1, observe that we get two 

different values for the notes F♯ and G♭ from the Pythagorean method 

of tuning. 

These different ratios mean that this scale cannot be transposed to 

another key. 4 Consider the ratio between two intervals: C and C♯ are 

in a 1.0535:1 ratio, but F and F♯ have a 1.0679:1 ratio between them. 

This might not seem like a significant difference in their actual values, 

but the difference is large enough for our ears to detect. It becomes 

very problematic to play intervals using F♯ as the root or fundamental 

pitch. 

It can be said, however, that because all of the intervals can be 

described with integer ratios, Pythagorean temperament contains more 

consonance than equal temperament. Equal temperament is by design 

strictly irrational: 2 k/12 is only a rational number when k is divisible 

by 12. But the ability to write music in different keys without retuning 

3 Enharmonic equivalent"refers to two pitches that sound the same but function 

differently in transcribed music. In the key of D♭, for example, one does not write F♯ 

to represent the fourth note of its scale (G♭), even though F♯ is enharmonically a perfect 

fourth above D♭. 

4 In order to change key, one needs to shift all of the notes or tones an equal amount 

of half steps up or down, therefore retaining relative pitches and intervals between 

the pitches but not their absolute pitches.


Note Name Process Ratio 

C (none) 1f C 

G ↑ P5 

3 

D 

A 

E 

B 

F♯ 

F 

B♭ 

E♭ 

A♭ 

D♭ 

G♭ 

↑ P5 2x, ↓ P8 

↑ P5 3x, ↓ P8 2x 

↑ P5 4x, ↓ P8 2x 

↑ P5 5x, ↓ P8 2x 

↑ P5 6x, ↓ P8 3x 

↓ P5, ↑ P8 

↓ P5 2x, ↑ P8 2x 

↓ P5 3x, ↑ P8 2x 

↓ P5 4x, ↑ P8 3x 

↓ P5 5x, ↑ P8 4x 

↓ P5 6x, ↑ P8 5x 

( 3 

) 2 

2 (2) −1 f C = 9 8 f C 

( 3 

) 3 

2 (2) −1 f C = 27 

) 4 (2) −2 f C = 81 

( 3 

2 

( 3 

( 2 

3 

( 2 

3 

2 

2 f C 

16 f C 

) 5 (2) −2 f C = 243 

) 6 (2) −3 f C = 729 

) −1 (2) 1 f C = 4 3 f C 

64 f C 

128 f C 

512 f C 

( 3 

) −2 

2 (2) 2 f C = 16 9 f C 

( 3 

) −3 

2 (2) 2 f C = 32 

) −4 (2) 3 f C = 128 

( 3 

( 2 

3 

( 2 

3 

2 

Table 3.1: Pythagorean tuning 

27 f C 

) −5 (2) 3 f C = 256 

243 f C 

) −6 (2) 4 f C = 1024 

729 f C 

81 f C 

one’s instrument (and possibly breaking it in the process) trumps the 

greater consonance of Pythagoras’ system. 

Just intonation 

Just intonation, also known as the just diatonic scale, is constructed on 

the principle that a major triad is in the ratio 4:5:6. A major triad is built 

with a major third (C to E, for example) and a perfect fifth (C to G). This 

scale was invented as a solution to the problematic Pythagorean scale, 

and it came before the equally tempered scale. Like the Pythagorean 

scale, intervals in just intonation are also related by whole-number 

ratios, the main difference being that in addition to the octave (2:1) and 

perfect fifth (3:2), the major third is in a 5:4 ratio to the frequency of 

the fundamental. Note once more that transposition to another key 

does not produce the original consonant intervals in the new key. We 

use the conventional abbreviations P8 to mean perfect octave, P5 for 

perfect fifth, and M3 for major third. If consonance and dissonance


Note Name Process Ratio 

C (none) f C 

C ′ ↑ P8 2f C 

3 

G ↑ P5 

2 f C 

( 

F 

↓ P5, ↑ P8 

3 

) −1 

2 2fC = 4 3 f C 

( 

A ↓ P5, ↑ M3, ↑ P8 

3 

) −1 ( 5 

) 

2 4 (2)fC = 5 3 f C 

5 

E ↑ M3 

4 f C 

( 

E♭ 

↑ P5, ↓ M3 

3 

)( 5 

) −1 

2 4 fC = 6 5 f C 

( 

A♭ 

↓ M3, ↑ P8 

5 

) −1 

4 (2)fC = 8 5 f C 

( 

D 

↑ P5 2x, ↓ P8 

3 

) 2 

( 2 (2) −1 f C = 9 8 f C 

B 

↑ P5, ↑ M3 

3 

)( 5 

) 

2 4 fC = 15 8 f C 

( 

B♭ ↓ P5 2x, ↑ P8 2x 

3 

) −2 

2 (2) 2 = 16 9 f C 

( 

D♭ ↓ P5, ↓ M3, ↑ P8 3 

) −1 ( 5 

) −1 

2 4 (2)fC = 16 

( 

F♯ ↑ P5 2x, ↑ M3, ↓ P8 3 

) 2 ( 5 

) 

2 4 (2) −1 f C = 45 

32 f C 

( 

G♭ ↓ P5 2x, ↓ M3, ↑ P8 2x 2 

) 2 ( 4 

) 

3 5 (2) 2 f C = 64 

45 f C 

15 f C 

Table 3.2: The interval-wise derivation of just intonation, ordered by smallness of 

integer ratios. Note that the enharmonic equivalents F♯ and G♭ have different ratios: 

45/32 = 1.40625 ≠ 64/45 ≈ 1.4222. 

really can be reduced to the smallness of the integer ratios between 

two notes, then Pythagorean tuning says that the M2 (a major second, 

D in the above tables) as well as the M6 (a major sixth, A) intervals are 

more consonant than a M3 (a major third, E). Apparently, the inventors 

of the just intonation tuning system disagreed. 

Just intonation can be considered the most consonant scale of the 

three when played in C because of the smaller integer ratios: The ratio 

between the frequency of B and the frequency of C, for example, is 

in a 15:8 ratio in just intonation versus a 243:128 ratio in Pythagorean 

tuning. But transposition is even further disabled, so its use was 

eclipsed by equal temperament.


Other tuning systems 

In addition to equal, just, and Pythagorean temperaments, there exist 

many other systems. In the East, the Hindustani tuning system permits 

up to 72 divisions of the octave, with intervals like the quarter-tone 

(halfway between a semitone, or half step) [73]. For those composers 

and musicians who feel that 12 notes in an octave is simply not enough, 

there are microtonal tunings and instruments that use them. In microtonal 

systems such as those designed by Harry Partch, the octave is 

partitioned into more than 12 parts. Partch would frequently use a 

large prime number like 19 or 43. His book Genesis of a Music (published 

posthumously in 1979) gives an excellent account of microtonal 

systems. 

Keys and scales 

A musical scale is an ascending or descending sequence of notes which, 

in common practice, corresponds to the key of a piece of music [3]. The 

two most common examples in Western music are that of the major and 

minor scales. The major scale is often said to have a brighter, happier 

spirit than that of the minor scale, which has a darker, gloomier feel. 

We name a key after the root of a scale, so if a scale begins on C 

and has a major intonation, it is called C Major. It is conventional to 

capitalize "Major" and leave "minor" uncapitalized for purposes of 

abbreviating the key "C Major" as "CM" and "C minor" as "Cm." The 

root is also called the tonic. In fact, every note in a scale has a name 

with respect to the tonic and a scale degree, in addition to the name of 

its note. 

Because of their popularity, the major and minor scales will be the 

only ones considered in examples of feature extraction from music. 

The major scale was once called the Ionian mode, and the minor scale 

the Aeolian mode. Each of these are built upon different scale degrees 

of the diatonic scale, i.e., the natural notes C-D-E-F-G-A-B (not sharped


Scale Degree Name 

1 ◦ Tonic 

2 ◦ Supertonic 

3 ◦ Mediant 

4 ◦ Subdominant 

5 ◦ Dominant 

6 ◦ Submediant 

7 ◦ Leading tone 

8 ◦ Tonic 

Table 3.3: Names for scale degrees 

or flattened). In order, they are: Ionian, Dorian (D-E-F-G-A-B-C-D), 

Phrygian (E-F-G-A-B-C-D-E), Lydian (F-G-A-. . .), Mixolydian (G-A-B- 

. . .), Aeolian (A-B-C-. . .), and Locrian (B-C-D-. . .). 

As we continue to relate the periodicity of superimposed sine 

waves to tuning and harmony, it might have occurred to you that a 

scale using the same properties of equal temperament could exist that 

contains a large amount of consonance, such as one using whole notes 

instead of half notes (six divisions of the octave), or even dividing the 

octave by a power of two into eight or 16 parts. Indeed, the first of these 

scales exists: Claude Debussy, among other contemporary composers, 

made extensive use of the whole tone scale in his impressionistic music. 

The scale beginning on C would consist of the notes C-D-E-F♯-G♯-A♯-C. 

In fact, this scale could begin on any of those notes and contain exactly 

the same notes, and furthermore, there are only 2 unique whole tone 

scales (C♯-D♯-F-G-A-B-C♯ the other). This provides evidence that the 

scale lacks tonal center, or root, and it is close to impossible in most 

circumstances to aurally establish the key of music written in whole 

tone scales. 

John Pierce, Heinz Bohlen, and Kees van Prooijen developed their 

own hyper-consonant scale, the Bohlen–Pierce scale or Pierce 3579b 

scale, conceptually similar to just intonation [2]. The distance between 

intervals are given by strictly rational numbers, where both the nu-


merator and denominator are odd. In fact, the note C’ in the key of C 

doesn’t appear until we reach 3 times the note C—i.e., we don’t get an 

integer multiple of the tonic until 2 octaves above it 5 . The frequencies 

of fourth, sixth, and ninth scale degrees above the fundamental note 

are in the ratios 5:3, 7:3, and 9:3 (3:1), and the ninth (C’) is two octaves 

above the fundamental (C) which is the end of the scale. 

Although the ratios of the whole tone and the Bohlen-Pierce scales 

have arguably greater consonance than equal temperament, the lack 

of instruments tuned to these scales make their use obsolete. 

Harmony 

Briefly, when one adds a voice or multiple voices to a melody or phrase 

of music, one creates harmony. Intervals like the octave and perfect 

fifth are considered relatively harmonious when sounded together. 

Again, this is due to the relationship of their corresponding periodicities. 

Harmonic devices can provide the listener with a greater sense of 

expectation or suspense, particularly when compared to the melody. 

There are more rules defining the vocabulary and function of harmony 

than any other aspect of classical music composition [11]. Other genres 

utilize specific chord progressions that define their sound. Blues, for 

example, makes heavy use of the progression I-IV-V( 7 ), and songs in 

other genres that use this progression automatically evoke the blues 

genre. Thus, harmony is a salient feature of music that speaks volumes 

about music as a language and can even place a piece of music 

chronologically and geographically [13]. 

3.4 Timbre 

Timbre is the reason that a piano playing A440 sounds different from a 

violin playing A440. It is the quality of an audible tone. Nonmusical 

5 I have seen this interval of two octaves called a tritave with reference to the 

Bohlen-Pierce scale, but I am not sure of the universality of this term.

Section 3.4 Timbre 63 

sound like speech has timbre, too: You can tell the difference between 

your friends’ voices without seeing their faces. All sounds, whether 

acoustic instruments of the orchestra or an analog synthesizer or the 

beep from your microwave, have a unique timbre. The more experienced 

a listener, the better his or her ability to distinguish between 

sounds and even musical instruments. In the examination of the frequency 

response of a large group of cellos, say, each cello will have a 

distinct timbre due to factors such as the materials used, the structural 

measurements, even the storage conditions—personalities not unlike 

humans’. The word timbre translates to "stamp" in French, so an accessible 

mnemonic is to think of timbre as the fingerprint or signature 

of a sound. The simple definition is "tone color." An instrument has a 

unique timbre and it appears in the relative amplitudes of the frequencies 

activated by playing it. The Fourier transform, therefore, excels at 

determining the source of sounds, because its spectrum exposes the 

timbre. 

Figure 3.7: The spectrum of a trumpet playing A3 (220 Hz). It is very rich in harmonics, 

and its loudest frequency is not its fundamental but rather its third partial, E5.


The graph in Figure 3.7 represents the spectrum of a signal. A 

spectrum is plotted on a frequency domain, whereas a signal is plotted 

on a time domain. Therefore, for a signal x(t) (a function of time), 

we could call its spectrum X(f) or X(ω), a function of frequency in 

cycles per second (more likely, in musical circumstances) or radians 

per second. 

A spectrogram (or spectrograph) is another useful visualization for 

the frequencies contained in a signal, as shown in Figure 3.8. It incorporates 

time, in addition to frequency and amplitude. The domain of 

a spectrogram is now time and the vertical axis is now frequency, not 

amplitude. The graph is darker where the amplitude of the frequency 

f at time t is closer to 1, and whiter where the amplitude is closer to 0. 

Much more will be said about the particular transform that computes 

the data behind a spectrogram and its mathematical definition later, 

but now we have all of the tools to understand the basic meaning with 

our eyes. Both Figures 3.7 and 3.8 graphically convey the frequency 

components of a trumpet playing the pitch C5 (523.2 Hz—the fifth C 

on the piano). 

When there appears to be a uniform, even distance between these 

spikes in the spectrum, we say that the overtone series is a harmonic 

series. Each spike represents a harmonic partial (or simply harmonic) 

of the overtone series of the fundamental, where the kth spike (counting 

left-to-right) is the kth harmonic. 6 It is here that we uncover some 

compelling insight into the fact that there exist both physical and psychoacoustic 

foundations for harmony, in addition to an aesthetic one. 

As stated in Chapter 2, a major chord appears within the first four overtones 

of the fundamental frequency, and after that, a major-seventh 

chord. Experiments by Reinier Plomp show that humans can detect up 

to seven of the harmonic partials of complex tones, and musicians are 

6 The harmonics are also called overtones. The kth harmonic will the be (k − 1)th 

overtone. Harmonic partials form a subset of the partials of a complex tone, where 

partials are all of the sine waves involved in a complex tone, and harmonic partials are 

those that can be calculated by integer multiplication of the fundamental frequency.

Section 3.5 Timbre 65 

Figure 3.8: The spectrogram of the same audio signal used in Figure 3.7, that of a 

trumpet playing and holding the pitch C5. The steady horizontal lines imply that the 

frequencies and amplitudes are unchanging: Where black, the spectrogram shows 

what frequencies in the signal have the most power. The even spacing of these lines 

imply harmonicity: The frequencies are equally spaced apart, separated by a constant 

frequency, which is the fundamental frequency. 

better at this [14]. So, Western tonality is supported by the science behind 

it. With enough experience in reading spectra and spectrographs, 

your eyes in addition to your ears will be able to detect what kind of 

instrument is producing such sounds, and digitally, this information is 

in a matrix ready to be processed by your computer.



This chapter examined the harmonic and periodic natures of rhythm, 

pitch, timbre, and temperament. Rhythm is temporally structured by 

meter wherein a unit of time (like a quarter or eighth note) is specified 

as well as the quantity of them that can fit in one measure of music. 

Tuning systems help define how a culture organizes its alphabet of 

musical pitches into a harmonic vocabulary. 

In general, the smaller the integers in the ratio between two frequencies, 

the more consonant their interval. Virtually every known 

tuning system contains the interval of an octave, which bears the ratio 

of 2:1 between its two frequencies (meaning the higher frequency has 

twice the frequency of the lower one). The ratio relating a perfect 

fifth, for example, in Pythagorean temperament and just intonation is 

exactly 3:2. However, because all songs are not written in the same key, 

small-integer ratios are not always ideal for instrument construction 

and playing, and are therefore approximated by some tuning systems 

such as equal temperament. Equal temperament uses the function 

f 2 =2 k/12 f 1 to calculate the frequency f 2 of a note k-many semitones 

(half steps) above a reference frequency f 1 . If f 2 is below f 1 , k is a negative 

integer. Therefore, in equal temperament, a perfect fifth is in the 

ratio of about 2.9966:2 instead of 3:2. This is close enough that our ears 

cannot detect a difference, and since it enables perfect transposition, 

equal temperament is the preferred Western tuning system, especially 

for instruments with a large frequency range and high amount of 

tension in their build. 

Timbre is unique to every acoustic instrument and references the 

mixture of frequencies and their amplitudes present in the instrument’s 

tone. Every individual instrument is different, but classes and types of 

instruments behave similarly with respect to frequency and loudness 

of harmonic partials. This illuminates some of the need for the Fourier 

transform in music.

4. Musical instruments 

Three common ways in which one may produce pitched sound are (1) 

striking a resonant cavity, such as a drum; (2) causing vibrations in the 

air in a tube, such as a horn or woodwind; and (3) plucking a string 

stretched across a resonant box, such as a violin. All of these involve 

the pressure of air in resonant bodies, and their features (dimensions, 

materials, and holes) completely specify the timbre of an instrument. 

Here I will explore the physical properties of some popular musical 

instruments: The piano, viols, winds, drums, and electric guitar. This 

is not a comprehensive list of instruments by any means, but paying attention 

to the individual spectra given for each instrument may aid the 

understanding of Fourier transforms of music, especially polyphonic 

music containing more than one instrument. 

4.1 The piano 

The piano produces sound by a hammer striking a string stretched 

over a soundboard, which amplifies the string’s energy. It replaced 

the clavichord and harpsichord by combining their best qualities: The 

clavichord had the advantage of control over its volume, but did not 

get very loud, while the harpsichord lacked any sort of control over its 

dynamic range, but could produce loud sounds. Around the time of 

the pianoÕs invention in the 18th century, use of the harpsichord and 

clavichord practically vanished. 

The 1920s are considered the last great era of piano ownership, due 

to the invention of the mass-produced automobile and the economic 

effects of the Great Depression. A piano in the household was a symbol 

of status, much like a car. A 1926 estimate claimed that half of city

68 Musical instruments Chapter 4 

dwellers in America owned a piano. In 1927, 250,000 pianos were 

produced in America, whereas in 1932, just 25,000. 

Some designs of pianos were modifications of the original, aimed 

at straying from the supremacy of the C major scale and the Western 

system of 12 tones. Pianos that employed different tonal systems 

virtually all divided the octave into more than 12 parts, so these were 

dubbed microtonal pianos. The first of these was invented in 1892, and 

quarter-tone pianos (the usual piano is semitonal, keys separated by 

half steps) were actually somewhat popular in the 1920’s. There was 

even a piano with 96 divisions of the octave and 97 keys in total, to 

span just one octave [71]. 

There are six main sections of the piano: (1) the metal frame necessary 

to support the large amount of tension imposed by the strings; 

(2) the soundboard and bridges; (3) the strings, usually made of steel; 

(4) the action, consisting of the keys, hammers, and levers; (5) the foot 

pedals; and (6) the wooden casing. The cheaper and more compact 

upright pianos have strings on a vertical plane, with a more complex 

action. Action is a fairly literal term for the region at which the sound 

is catalyzed, wherein the key triggers a series of levers, eventually 

triggering the hammer, which strikes the string. Altering the speed of 

the action manipulates the attack rate. 

If all of the strings in a piano were identical in density, the piano 

would need to be unrealistically long to produce the lower end of its 

wide frequency range (27.5 to 4186 Hz). By thickening a string and 

relaxing some of its tension, one can increase its effective length and 

thereby lower its fundamental frequency. We calculate the fundamental 

frequency by the velocity of propagation along the string, v, but this 

is quite tricky to calculate by analyzing the string. Since v = 

√ 

T 

m/L , 

where T is the tension in newtons 1 and m/L is the mass (in kilograms) 

1 One newton is equal to 1 kilogram-meter per squared second, kg · m 

s 2 .

Section 4.1 The piano 69 

Figure 4.1: The components of the piano. Image from [94]. 

of the string per unit length (one meter), 2 we can substitute for v and 

get 

√ 

T 

f 0 = v 

2L = m/L 

2L . 

The force that acts on vibrating strings and other masses with 

tension to return them to their resting state is called the restoring force. 

The restoring force produces the overtone series of a string. J.W.S. 

Rayleigh observed that if the tension, and thus the restoring force, is 

relatively low in a string, then not all of the partials (especially higher 

ones) will be harmonic [12, 17]. Thus, the lower keys are less likely 

than higher keys to have exactly harmonic partials. 

2 In some cases you will see the mass per unit length (m/L) of a string written 

with the Greek letter mu, µ.


The frequency response of a piano varies according to the velocity 

of the player’s fingers on the keys. The loudness of the fundamental 

and the overtone series is proportional to the amount of force used 

in striking its keys. Therefore, when the force is slow and soft, less 

overtones can be heard. 

Figure 4.2: This graph depicts the fast Fourier transform of a Yamaha piano playing 

D4, approximately 311.1 Hz. The heights of the spikes indicate the relative amplitudes 

of the frequencies. Note the evenly spaced intervals between spikes: This means that 

the overtone series is harmonic. 

The pedals of the piano manipulate the sustain of its sound and the 

soundboard amplifies it, accelerating the sound waves as they travel 

through the dense wood and coupling the vibrations of the strings 

with the vibrations of the air. The soundboard has a fairly uniform 

response to all frequencies in its wide range [15]. 

The harmonics of piano strings, especially in the low-frequency 

range, are more widely spaced than simple whole-number ratios 

would suggest due to the inharmonicity imposed by the thicker strings 

and their relatively relaxed tension. For this reason pianos are tuned 

with a (slightly) expanded octave called a stretched octave. This octave 

sounds out of tune with true integer-related harmonics (like those 

produced by infinitesimally thin and taut strings) but sounds in tune

Section 4.2 The viol family 71 

with strings that have unique restorative forces at work due to their 

different physical characteristics. 

4.2 The viol family 

In a way, the violin is truly one of the greatest enigmas in all of music. 

It is one of the few musical instruments whose paradigm is early on in 

its invention, by Antoni Stradivari between 1680 and 1700 [8]. Today, 

Stradivarius violins from this time period can be worth hundreds of 

thousands or even millions of dollars. Though a sizable amount of 

the rationale behind this is their beautiful tone, Stradivarius violins 

have become so romanticized over the centuries that they are truly 

legendary. 

Scientifically, Stradivarius violins have more evenly spaced resonant 

frequencies in the frequency response of their body than other violins, 

even when compared to others of very high quality like Guarneris. 

The frequency response is a graph showing frequency peaks that come 

from the resonance of the instrument, not the tones played on it. It can 

be found by playing all of the frequencies in the range of an instrument 

with equivalent input force, and removing the harmonics that come 

from what is played. 

The main resonances of interest in a violin are the main wood resonance 

(W ) and the air resonance (A). The wood prime resonance (W ′ ) is a 

result of the harmonics of the wood resonance, and is about an octave 

below W sometimes sounding below the fundamental frequency of the 

played pitch. The main wood resonance is the resonance resulting 

solely from the wooden parts of the violin, and is typically around 

440 Hz (A), the frequency of the second-highest (in pitch) open string. 

The air’s resonant frequency is ideally a fifth below the main wood 

resonance (a 2:3 ratio), so, it is approximately the frequency of the 

D-string [98]. The area of the f-holes directly corresponds to the air 

resonance of the violin: A larger f-hole area implies a higher resonant 

frequency because it increases the volume of the violin’s body. These


holes also act as band-pass filters, giving preference to the resonance of 

a small range of frequencies. The holes of woodwinds also have this 

property: Closing and opening them affects the pitch of the instrument 

in a localized way. 

Figure 4.3: The frequency response of an average (poor) violin. Ideally, the resonant 

frequency of the cavity (A) is a fifth below the main wood resonance, and the wood 

prime resonance (W ′ ) is an octave below the main wood resonance (W ). 

Figure 4.4: The ideal frequency response of a good violin. When these resonant 

pitches are played on the strings, their loudness will be enhanced by this curve.


Like any other instrument, each component of the violin contributes 

something to its tone. The type and quality of the materials 

is of the utmost importance because the violin is constructed of less 

individual parts than the piano. Modern violins have been tweaked 

from their original design to be fretless, heavier, and stiffer. Also, the 

original violin had a flat back and was called a viol, which is the name 

we give for the violin’s family consisting of the viola, cello, and bass 

[18]. 

Ernst Chladni was one of the first people to investigate the resonance 

of a violin. He invented a technique published in 1787 that 

could show the resonance patterns of the front and back plates using 

sand. When the violin is bowed, the sand drifts away from places of 

high vibration, which are locations of the instrument’s nodes. These 

patterns are highly symmetrical and vary with frequency. Napoleon 

Bonaparte was so impressed with this research that he requested a 

personal demonstration of the technique and commissioned Chladni 

in 1809 to translate his publication into French for a handsome sum of 

6000 francs (around 80,000 USD today) [10]. 

With these plates, Chladni showed why the body of a violin does 

not have a uniform response to all frequencies within its range. He 

was the first to map the resonance curve of the violin, thus turning the 

art of making instruments into a science. 

The violin and cello have nearly identical construction to each 

other but differ in size, as do the the viola and double bass. The open 

tunings of the strings within the viol family reflect their scaled size, 

but maintain an interval of a perfect fifth between neighboring strings 

(except for the double bass, with strings separated by perfect fourths). 

When the strings are set into motion, the sound is initiated. Because 

the string is fixed on both ends, the wavelengths of its modes of 

vibration are limited to integer divisions of the string length. At the


Figure 4.5: Several of the vibrational modes of a violin, as depicted by Chladni’s 

plates. The plates show the response to the vertical and horizontal modes at their 

resonant frequencies. The white portion shows where the sand settled during the 

vibration, and the black shows where the sand fell off. The first mode is the circular 

mode, showing resonant modes parallel to both the vertical and horizontal axis. The 

second shows the resonant frequencies of the modes parallel to the horizontal axis. 

The third mode is the lateral mode, vibrating at the resonant frequency parallel to the 

diagonals of the violin’s body. Finally, the fourth mode shows the resonance of the 

vertical mode. 

Figure 4.6: The first four modes of a fixed string. Because a string is secured at both 

ends, a string’s overtone series is defined according to its length and the wavelengths 

of the frequencies it contains are restricted to integer divisions of that length, i.e., their 

wavelengths will be L, L/2, L/3, and so on. A non-integer ratio would mean that the 

string was loose at one end. The dots indicate the positions of nodes, where one may 

lightly press on a stringed instrument and produce harmonics.


nodes of a fixed string, one can produce harmonics by making light 

contact with the string at these points. 

The fingerboard, body, and sound post amplify the sound because 

the air molecules accelerate in the dense wood. The sound post is 

located inside the body, connecting the front plate to the back plate. 

Functionally, it is a fulcrum, and it has such an impact on the tone that 

the French call it l’âme, meaning the soul. Removal of the sound post 

gives the violin a similar timbre to that of a guitar, which lacks a sound 

post. 

The sound resulting from a bowed violin string is much different 

from a plucked string. This is because the bowed string’s waveform is 

not a sine wave, but closer to a sawtooth wave. The violin’s timbre is one 

of the hardest to reproduce by synthesizers and software programs 

due to the complex harmonics of its unusual body and the behavior of 

the bowed string. 

Figure 4.7: The pressure variations of a sawtooth wave. The sound has a similar 

timbre to that of scratching your fingernail along the edge of a quarter or other ribbed 

surface. 

Using an oscilloscope, Hermann von Helmholtz was the first to 

observe the jagged nature of the waveform depicted in Figure 4.7 

[18]. The wave results from the friction of the bow sliding across the


Figure 4.8: The spectrogram of a solo, bowed violin from Shostakovich’s Fifth Symphony. 

The bowed string produces a rich harmonic overtone series, depicted by the 

evenly spaced horizontal black lines as in Figure 2.7. The line wavers (between 1:54 

and 1:55 and again at 1:56 and 1:57) when the musician plays with vibrato, increasing 

and decreasing the center frequency. 

string. When the bow is taken off of the string, the energy of the string 

decreases, and the waveform becomes increasingly smooth until it 

dies. Helmholtz also saw that, when a string is bowed, it exhibits an 

increased preference for exactly periodic behavior, meaning that the 

entire overtone series is better amplified. This is called mode-locking 

[8]. When a single pitch is played at a constant bowing speed, the 

modes of the fixed string are "locked" into integer multiples of the 

fundamental, regardless of the resonance of the violin’s body. This also 

happens when air is blown at a constant rate into a wind instrument. 

The constant pressure locks the modes of vibration into the harmonic 

series, regardless of the natural resonant frequencies of the instrument. 

So, the modes of vibration of the piano and violin are somewhat 

similar because of the harmonic nature of the fixed string. However, 

the action of a piano gives the tone of its steel strings a much different

Section 4.3 Woodwinds and brasses 77 

Figure 4.9: The spectrogram of a plucked violin string, from Shostakovich’s Fifth 

Symphony. The melody is the same as in the clip above of the bowed string, though 

shifted a little in time. The blackest regions show the onset and attack of the pluck, 

while in the spectrogram of the bowed string, the blackest regions corresponded 

to both the attack and the sustain. Not only do their attack envelopes differ: The 

harmonic overtone series of the plucked string is not as rich (i.e., there are not as many 

audible/visible overtones) as that of the bowed string. 

amplitude envelope than the violin bow gives the violinÕs catgut 

strings. The friction of the bow on the string of a violin reshapes the 

resulting waves to the form of a harmonics-rich sawtooth wave. The 

harmonic series of sine waves contained in the timbre of sawtooth 

waves (e.g., the Fourier series of sawtooth waves) is given in Chapter 

7. 

4.3 Woodwinds and brasses 

It is a bit easier to express the resonance of wind instruments because 

there are only three essential components to consider: The reed, the 

bore, and the side holes.


Figure 4.10: This graph depicts the fast Fourier transform of a viola playing C5, 

approximately 523.2 Hz. In the higher partials, the peaks are thicker, expressing the 

rich tone of the viola. But energy is really lacking at the third partial, which should be 

G6. 

Figure 4.11: The parts of a simple woodwind instrument, the recorder. 

The reed functions as a valve. The player’s lungs, mouth, and lips 

interact with the rest of the instrument to produce a series of periodic 

puffs. The frequency of these puffs influences the fundamental pitch of 

the resulting sound. The intensity of the puff determines the amount 

of pressure in the bore of the instrument, and changes in pressure 

because they are periodic cause the instrument to make pitched sound. 

Therefore, wind instruments resonate via pressure waves. 3 

3 This is not the case for the flute or piccolo. They vibrate by the periodic changes 

in the velocity of the stream of air played over the mouth hole, so they will often be


Figure 4.12: The basic parts of valveless cylindrical and conical brass horns. 

Inside of a teapot, pressure increases with temperature. The air 

molecules inside move faster and change position more quickly. As 

these molecules rise upward to try to escape through the spout of the 

teapot, they bounce back and forth more and more rapidly as temperature 

increases, and thus make the teapot whistle at an increasing pitch 

and intensity. 

Analogously, the more that pressure is varied at the mouthpiece, 

the shorter the distance of time between puffs becomes, i.e., the period 

T decreases and f increases. Woodwinds can have single reeds (saxophone, 

clarinet) or double reeds (oboe, bassoon). Brasses do not have 

a physical reed in their mouthpiece because the mouthpiece causes 

pressure variations in the player’s lips. So, the lips act as the reed in 

brass instruments. 

The bore is the shaft of a woodwind. Because it is greater in size 

than the stiff reed, it influences pitch to a much larger degree. The 

opposite is true of the brass instruments: Their reed (the lips, mouth, 

an exception to the other wind instruments. However, this stream of air still acts as a 

valve, and is therefore still essentially a reed.


Figure 4.13: Pressure varies with temperature and density, both of which are changing 

in a boiling tea kettle. The air particles in the kettle are increasingly excited from 

the increasing pressure, and their motions become more rapid. Once the pressure 

becomes so high that the cavity of air is not large enough to support the expanded 

molecules inside of it, they head towards the exit, pressure still increasing. This creates 

a gliding, whistling noise. Its frequency reflects the motion of the molecules, as well 

as the temperature. 

vocal cords, and lungs) is far more massive and strong than the bore, 

so the reed has greater control over pitch. 

The length of the bore determines the fundamental, resonant frequency 

of a wind instrument, and consequently, its overtone series.The 

length can be altered by opening and closing the side holes, so it is 

only when all are closed that the bore’s length is its physical length. 

The size and spacing of the holes in the bore also influence the bore’s 

effective length: Larger holes leak more air and the spacing determines 

the intervals between the pitches it can produce. These dimensions 

also have an effect on the timbre of the woodwind. 4 For brasses, the 

4 This is not the case for flutes, nor brasses: Their timbre is not nearly as sensitive 

to their side holes. Flutes, on the other hand, are constructed with nearly equally 

spaced and equally sized holes. Altering the fundamental frequency by changing the


"side holes" are tubings covered by valves which open and close to 

shorten and lengthen the bore, respectively. 

Figure 4.14: The effective length of the bore in a wind instrument changes when the 

side holes are opened. The size of the holes influence its effective length. The shorter 

the effective length, the higher the frequency that is produced. 

Woodwinds can either have closed-end (the mouthpiece is open 

and the end of the bore is closed) or open-end (both ends are open) 

bores. In open-end bores, the wavelength of the fundamentalλ 0 is 

exactly equal to double the effective length of the bore, 2L, and the 

wavelengths of the overtones are 1/2, 1/3, 1/4, etc. times λ 0 . Therefore, 

the fundamental frequency of open-end winds is given by f 0 = v λ 0 

= 

v 

2L , and the overtones are the full harmonic spectrum, f 0, 2f 0 , 3f 0 , 4f 0 , 

and so on. The waves in a closed-end bore must have an antinode 

at the closed end of the bore, so the timbre skips the even-numbered 

partials (see Figure 3.21 below). Here, λ 0 =4L, and the overtones have 

effective length of a brass instrument simply shifts the harmonic spectrum to the right 

or left along the frequency domain, retaining the relative strengths of the overtones.


1/3, 1/5, 1/7, etc. times the length of λ 0 . The fundamental frequency 

of closed-end winds is f 0 = v 

4L 

and its overtone spectrum consists of 

only the odd partials, f 0 , 3f 0 , 5f 0 , and so on. 

These nice, small-integer ratios between fundamental frequency 

and bore length relate to the nodes and antinodes of the pressure waves 

occurring inside of the bore. The pressure of the sound wave originates 

at the mouthpiece, so here we have a node. In an open-end bore, all of 

the nodes are contained within the column. At the open end, the wave 

is reflected back in an equally opposite way (a compression, versus a 

rarefaction), and so it forms a node. At the end of a closed bore, there 

is a node beyond the length of the air column, so there is an antinode. 

Figure 4.15: The modes of vibration of three different types of columns of air: Those 

open at both ends, those closed at the end opposite the mouthpiece, and those with a 

horned end. These graphs depict the first four modes of vibration, where the curves 

depict the pressure at the location indicated on the horizontal axis.


Instrument Type of wind instr. Type of bore 

Clarinet Single-reed, woodwind Closed, cylinder 

Saxophone Single-reed, woodwind Closed, conical 

Flute Flute, woodwind Open, cylindrical 

Piccolo Flute, woodwind Open, cylindrical 

Oboe Double-reed, woodwind Open, conical 

Bassoon Double-reed, woodwind Closed, conical 

Recorder Flute, woodwind Closed, cylindrical 

Cor anglais Double-reed, woodwind Closed, conical 

Trombone Brass Open, cylindrical 

Cornet Brass Open, conical 

French horn Brass Open, conical 

Tuba Brass Open, conical 

Euphonium Brass Open, conical 

Trumpet Brass Open, cylindrical 

Table 4.1: The families of woodwind and brass instruments and the nature of their 

bores. 

Nodes and antinodes are formed by the destructive and constructive 

interference of standing waves. In Figure 4.15, you can see that 

all of the bores have zero pressure, a node, and an amplitude of 0 at 

the mouthpiece (the left-most point on the graph). The pressure wave 

defines the locations of nodes (where amplitude is 0) and antinodes 

(where amplitude is extreme), so it follows that at the mouthpiece of 

all wind instruments there is a node. Since acting upon the mouthpiece 

initiates the pressure wave, understanding the mouthpiece’s behavior 

is highly influential upon the pressure within the bore. 

The change in the movement of the traveling waves in a wind 

instrument can be modeled as 90 ◦ out of phase with the change in 

pressure waves (again referring to Fig. 4.21), wherein the variation of 

motion is greatest at the mouthpiece for all of the instruments while the 

change in pressure is zero. So, for the pressure, i.e. sin(ω) or − sin(ω), 

the rate of motion can be modeled by cosine waves, sin(ω + π/2) = 

cos(ω) and − sin(ω + π/2) = − cos(ω). Therefore, the pressure varies 

most when the motion varies the least, and vice versa. This defines


input impedance: The ratio between the pressure amplitude set up in 

the mouthpiece and the excitatory flow that gives rise to it, in the 

words of Arthur Benade [19]. The standing wave inside of the bore 

periodically changes in amplitude as the pressure wave carrying the 

input impedance from the mouthpiece propagates towards the end 

of the bore, against the motion of the reflected wave traveling back 

toward the mouthpiece. 

Now, each of the waves given in the first graph have the exact same 

frequency and wavelength (λ =2L) because both are determined by 

the length of the bore. In the open-end column, the length is L making 

the wavelength λ equal to 2L, and both the even- and odd-numbered 

modes have energy. In the closed cylinder, the length is L/2 so that λ 

is still equal to 4(L/2) = 2L, and only the odd-numbered modes are 

articulated. In the conical-bore column, all the modes are sounded, but 

their individual energies depend on the radius of the bore at a given 

location 5 . 

The bore of a wind instrument is either cylindrical or conical, and 

sometimes has a flared bell at its end (like in all of the brass instruments 

and the clarinet). The cross-sections of the bore at any point in both 

types can be determined by mathematical functions proportional to 

Bessel functions. The order-2 Bessel function closely resembles the 

shape of the flared horn, while the order-0 Bessel function describes 

cylindrical bores. Considering that the cylindrical bore has a constant 

cross-section, it is logical that it would be modeled by a constant 

function (i.e., function of order 0). The cross-section of a conical bore 

increases linearly, so an order-1 Bessel function would suit its area. 

Finally, the cross-section of a flared bore is parabolic and is therefore 

modeled by quadratic functions (e.g., an order-2 function). 

Bessel functions are a pretty advanced topic in mathematics, but 

their form is actually surprisingly similar to that of the continuous 

5 Most brass instruments are conical with some percent of cylindrical tuning, and 

all of them have a flared bell. For more on the bores of wind instruments, see Arthur 

H. Benade’s Fundamentals of Musical Acoustics.


Fourier transform that we’ll study in Chapter 7. The notation J n (x) 

represents the nth Bessel function, and t is time: 

J n (x) = 1 

2π 

∫ π 

−π 

So the order 0 and order 2 functions are 

J 0 (x) = 1 

2π 

J 2 (x) = 1 

2π 

e −i(nt−x sin(t)) dt. 

∫ π 

−π 

∫ π 

−π 

e ix sin(t) dt 

e ix sin(t)−i2t dt. 

Due to the inverse square law, the intensity of the pressure and 

motion decreases with larger surface area. The horn function U determines 

how much energy exits the horn and how much returns to the 

mouthpiece to produce standing waves. This is the degree of flare in 

the bell, calculated by r ext × r int in the horn equation. 

U ≈ 

1 

r int × r ext 

The variables r int and r ext are the interior and exterior radii at any 

given point on the horn, as shown in Figure 4.15. 

Using the horn function at a corresponding point on the horn, we 

can calculate the acoustic wavelength λ: 

v 

λ = √ ∣∣∣f 2 − U · ( ) ∣. 

v 2∣∣ 2π 

The formula is similar to the usual λ = v f 

, but the frequency here is 

the shortest distance between the frequency squared (1/s 2 ) and the 

horn function times the squared velocity (1/m 2·m 2 /s 2 =1/s 2 ), so the 

wavelength varies based on the values of U and f. This is called the 

horn equation. 

Energy is lost to friction and heat proportional to frequency, so the 

high partials of higher frequencies have less energy than the high partials 

of lower frequencies. Impedance dominates harmonics in softly


played notes; playing more loudly increases the amount of energy in 

first lower harmonics and then higher ones. Walter Worman quantified 

this phenomenon and found the amplitude of the second harmonic 

grows so that doubling the strength of the fundamental quadruples 

the strength of the second harmonic [19]. Furthermore, the strength of 

the second harmonic is nearly proportional to the input impedance of 

the bore at this frequency. Likewise for the third and higher harmonics: 

The third harmonic grows eightfold for every doubling of the strength 

of the fundamental, the fourth grows sixteenfold for each doubling 

of f 0 , and so on. Thus, the nth harmonic grows as the nth power of 

the fundamental’s pressure amplitude. Otherwise, the response of 

wind instruments is fairly constant with respect to frequency, so their 

frequency spectrum merely shifts to the frequency of the fundamental. 

Worman’s observations of the input impedance only describe harmonics 

in the mouthpiece. The frequency spectrum of the sound 

exiting from a brass instrument is not nearly as refined or detailed 

as the measurements of the spectrum taken inside the mouthpiece 

and the bore.Because the bell leaks more energy at higher frequencies, 

sound waves that exit the bell experience a "treble boost," wherein 

higher components are radiated into the room. Furthermore, when 

a player places his or her hand into the bell to modify its timbre, the 

overtone series expands to articulate overtones as high as 1500 Hz 

above the preexisting overtones. 

The fast Fourier transforms of wind instruments in Figures 4.16-19 

reveal their highly harmonic nature.


Figure 4.16: The FFT-derived frequency spectrum of an oboe playing A♭4 (415.3 Hz). 

This spectrum is particularly harmonic with nearly exactly 414.6 Hz separating each 

partial. 

Figure 4.17: The FFT-derived frequency spectrum of a bassoon playing E♭3 (155.6 Hz). 

The fundamental frequency is not the strongest; rather, the third harmonic is, much 

like the trumpet. 

Because pressure waves are invisible, it may be harder to believe 

that a column of air produces integer-ratio harmonics of the funda-


Figure 4.18: The FFT-derived frequency spectrum of an alto saxophone playing E♭3 

(155.6 Hz). Its third harmonic is also the strongest, like the trumpet and bassoon. 

Figure 4.19: The FFT-derived frequency spectrum of an A-flute playing C5 (523.2 Hz). 

mental frequency than to believe that a fixed spring produces harmonics. 

Wind instruments amazingly produce frequency spectra far more 

highly harmonic than the frequency spectra viols or pianos. This is 

largely due to the simplicity of their design: There is no interaction of 

an external mechanism like a bow or hammer. It is just the reed and 

the bore that produce pitched sound in a wind instrument.

Section 4.4 Drums 89 

Mode Relative freq. No. of semitones above f 0 

C 1 f 0 Unison 

C 2 2.295f 0 14.4 

C 3 3.598f 0 22.2 

L 1 1.593f 0 8.0 

L 2 2.135f 0 13.1 

L 3 2.917f 0 18.6 

Table 4.2: The relative frequencies of the six simplest circular modes [20]. 

4.4 Drums 

Drums resonate via excitation of a tense membrane into the drum body. 

Because of drums’ highly complex frequency spectrum, they usually 

have ambiguous pitch. We describe their sounds with onomatopoetic 

words like "booms" or "taps." For this reason, we consider percussive 

sounds to be more complex than other instrumentsÕ, but describing 

their vibrations isnÕt necessarily complicated. 

A drum consists of three parts: (a) a drumhead, which is a stretched, 

tense membrane of some material; (b) a drum body, a hollow cavity; 

and (c) a means of affixing the drumhead to the body. The third 

element (c) can feature knobs that allow the tension of the drum to be 

tuned. These knobs can affect the quality of a drum’s sound by tuning, 

but otherwise they theoretically have no bearing on the timbre. 

The modes of vibration of the drum are determined on the membrane. 

As it is stretched and becomes thinner, the fundamental frequency 

and the magnitude of the attack rate rise. The simplest six 

modes of circular membranes are depicted in Figure 4.20. When the 

membrane is not circular like many drumheads from the 1980s, the 

timbre is even more complex. The modes are slightly different and 

more asymmetric because a circle is the shape most symmetrical. The 

first three of these modes are circular modes, and the last three are lateral 

modes.


Figure 4.20: The six simplest modes on circular membranes: The highly symmetrical 

circular modes C 1, C 2, and C 3 only have nodes located at concentric circles on the 

membrane and at the membrane’s outer edge. The lateral nodes, L 1, L 2, and L 3 have 

both linear nodes crossing through the center of the membrane as well as circular 

nodes. The fixed outer edge is always a node. Images captured from [99]. 

The circular modes above are completely symmetrical about the 

center of the drumhead, and the lateral modes are symmetrical about a 

diameter (or two) of the membrane. Nodes exist at points (or regions) 

where the membrane is stationary while the rest of the membrane 

is vibrating. The first circular mode, C 1 , produces the fundamental 

frequency of the drum and has greater energy than all of the other 

modes. It produces the "thumping" sound of the drum. Its node is 

located on the circular outer edge of the membrane, and it is a node 

for all the other modes because it is fixed. 

The lateral mode L 1 has a linear node extending the diameter of 

the membrane and a circular node identical to C 1 . It contributes the 

second most energy to the drum’s timbre with a frequency 1.593 times 

that of the fundamental frequency produced by C 1 . L 2 has diametrical 

nodes in an "X" pattern and the membrane’s perimeter, and it has the 

third greatest amount of energy. Its sound takes the longest time to 

decay due to its poor sound radiating efficiency [91]. Finally, L 3 , C 2 , 

and C 3 have circular nodes all centered at the center of the membrane,

Section 4.4 Drums 91 

but with different sizes (concentric circles). The circular nodes all add 

to the thumping sound of C 1 and have the shortest decay times. 

Like the body of a violin interacts with its vibrating strings, the 

body of a drum serves to amplify and resonate the vibrations of its 

membrane. Symbiotically, the body’s vibrations influence the vibrations 

of the drumhead because they are directly connected. 

Primarily, the resonant drum body shows preference for harmonic 

or close-to-harmonic partials over inharmonic ones. This doesn’t mean 

that the overtone series is completely harmonic, but it does mean that 

the resonant body tends to amplify harmonic partials and attenuate 

inharmonic partials. This is true of all resonant bodies, and the nature 

of the source of excitation, whether by a fixed string, periodic pressure 

variations, or a mallet, determines the harmonicity of the a frequency 

spectrum. 

The location of the strike greatly influences the timbre of drums. 

Striking the drum is essentially like exciting it with an impulse. When 

the drum is struck at the location of one of the nodes, that mode will 

not produce sound, because nodes exist where the drumhead is static. 

Therefore, hitting the drum at the exact center will silence the circular 

modes, but hitting it just off-center will excite all the modes.


Figure 4.21: The frequency spectrum of a snare drum with chains engaged. 

Figure 4.22: The frequency spectrum of a tom drum. 

The drum’s inharmonicity and prevalence in all types of musical 

styles tells us that musical instruments do not need a harmonic 

overtone series to sound pleasurable. The complexity of percussion 

instruments may allow us to maintain interest in percussion sounds 

repeated many times. Drum’s relatively inharmonic nature allows 

for virtually any sort of harmonic and melodic material to be lay-

Section 4.5 Electric guitars and effects units 93 

ered on top of them, without any perception of dissonance from the 

combination of frequencies. 

Due to the relatively strong attack rate of drums and percussion 

instruments, it is fairly easy to detect their onset in sound files by looking 

at the signal and spectrograph. Drums’ frequency response is quite 

different from other instruments’ because of drums’ inharmonicity 

and fundamentals in the low-frequency range. In polyphonic music, 

the part of the Fourier transform output that corresponds to drums is 

one of the easiest to identify. 

4.5 Electric guitars and effects units 

Electric guitars are stringed instruments, and therefore their vibrational 

modes are those of the fixed string. But since their output signal is a 

varying voltage produced by alternating current (AC), it can be filtered 

in real-time by analog circuits found in guitar effects pedals. 

An electric guitar transmits the movements of its strings through 

magnetic pickups, which can be single-coil or humbuckers. These pickups 

act as "antennae" for the movement of the strings, inductively 

detecting electromagnetic radiation and producing a small electric 

current. Unless the guitar is powered by a battery, this is all done 

through passive electronics: Current is produced strictly by the laws 

of electromagnetic physics. Humbucking pickups eliminate the 60 

Hz tone ("bucking the hum") produced by the frequency of AC from 

North American wall sockets 6 by generating two input signals: One 

180 ◦ out of phase with the other. When the difference of these signals is 

taken, the input signal is doubled because of their phase relationship. 

But the induced signal is canceled to zero, because it is induced nearly 

identically in both pickups. High quality microphone cables likewise 

have two wires, one 180 ◦ (or, any odd multiple of π radians) out of 

6 In Europe, this is 50 Hz.


phase with the other, to produce the same result. This is a common 

technique called differential-pairs. 

This alternating current, multiplied by the resistance of the electronic 

circuits inside of the guitar, produces a voltage 7 that is passed 

on to an amplifier. This input signal gets added to any other signal 

present, like the electromagnetic radiation present in the air and the 

electronic signal powering the amplifier. 

Effects pedals are connected between the guitar and the amplifier to 

color and shape the waveforms and timbre of an electric guitar. Guitar 

effects pedals and other electronic effects aren’t actually instruments 

per se: They do not have a resonant body, but they do have a resonant 

circuit board. Because of the overwhelming presence of the electric 

guitar in genres such as rock, metal, and pop, it seems reasonable to 

include at least some graphs of their frequency response when altered 

by some common effects. 

Overdrive distortion ("fuzz") is by and large the most popular 

effect. Distortion clips the electrical signal, flattening (to some degree) 

the top and bottom of the wave. Aurally, the result is more textured, 

described by words like "gritty," "dirty," and "warm." Mathematically, 

clipping adds more overtones to a wave because the strong attack rate 

and flat top turns the wave into a square wave, which is described 

by an infinite series of odd-numbered harmonics. Clipping confuses 

our ears and excites more frequency bands than are actually present in 

a signal. Warm overdrive adds harmonic overtones. The rougher or 

grittier the sound is, the more inharmonic the added overtones are. 

7 Ohm’s Law states that voltage is equal to the product of current and resistance, 

V = IR. Analog circuits are at the core of classic signal processing and produce 

continuous signals. However, in this book, we focus primarily on discrete, digital 

signals like those from a file of music on your computer. See Appendix A for more 

about analog circuits and signal processing in the electrical engineering sense.


Figure 4.23: The effect of clipping is that of adding an infinite series of odd 

harmonics to a signal, approaching a square wave. Depicted here is an approximation 

[ 

for the first six odd harmonics, given by the function x(t) = 

4 

π sin(ωt)+ 

1 

sin(3ωt)+ 1 sin(5ωt)+ 1 sin(7ωt)+ 1 sin(9ωt)+ 1 sin(11ωt)] , 

3 5 7 9 11 

where our frequency is simply 1 Hz, so ω =2π. 

There are more harmonics shown as peaks in the Fourier transform 

of the fuzzy electric guitar, which can be produced by amplifying a 

clean signal past the point of clipping. The more clipping, the "dirtier" 

the sound. The above distortion was produced by amplifying the clean 

guitar’s signal by 15 dB in the free sound editing program Audacity. 

The distortion effect also causes a kind of compression (explained 

in more detail in Chapter 6) and is often used to give the guitar long 

sustain, like that of a bowed cello. 

Ring modulation is achieved by multiplying a signal by a pure sine 

wave (the carrier frequency) and outputting the product and difference 

tones. Ring modulation is named from the shape of its analog electronic 

circuit, which contains a "ring" of diodes. The effect is amplitude 

modulation, which for low modulation frequencies has the effect of 

tremolo.


Figure 4.24: The FFT of a clean electric guitar. 

Figure 4.25: The FFT of a "fuzzy" electric 

guitar signal, playing the same chord as 

in Figure 4.24. The spectra is similar, but 

the scale of the vertical axis differs. 

Figure 4.26: The same FFT as left, with 

the scale of the vertical axis identical to 

the FFT of the clean signal to make the 

additional harmonics more visible. 

The spectrogram in Figure 4.28 of ring modulation shows us some 

interesting results: At first the harmonics of the original tone are fairly


Figure 4.27: A ring-modulated audio signal, multiplied by a sine wave that increases 

in frequency from 0 Hz to 9 kHz. At the beginning the effect looks like tremolo in the 

spectrogram. 

Figure 4.28: The spectrogram of a ring-modulated signal. At about 1.5 seconds, the 

effect begins, and gradually increases its influence. The harmonics both symmetrically 

increase and decrease, with the increasing harmonics showing the "combination tones" 

and the decreasing ones showing the "difference tones." At its most influential at three 

to four seconds in, this particular example of ring modulation appears to diminish 

the presence of the harmonics in favor of lower frequencies. 

strong and defined. As time goes on, these harmonics (indicated by 

the dark parallel lines) become fuzzier and sweep downwards and 

upwards, reflecting the influence of the carrier frequency.


A low frequency oscillator (LFO) produces sine waves below 20 Hz 

and can be used to control the pulse rate of other signals. Remember: 

Rhythm too can be defined by a frequency, albeit a small one, like 2 

or 3 Hz. An LFO is useful in controlling the presence and intensity of 

effects. In Figure 4.30 an audio signal is passed through a vibrato effect 

pedal. The rate of the vibrato is controlled and changed (modulated) by 

an LFO, and goes from no vibrato (0 Hz) to 20 Hz of vibrato. 

The vibrato effect varies a frequency periodically. This is also called 

frequency modulation. Below A musician can create vibrato on a guitar 

by pushing and pulling its whammy bar to change the tension on the 

string or moving one’s finger to the left and right to change its effective 

length. The vibrato a musician can produce is fairly limited, varying 

a center frequency by only a few hertz due to human limitations. 

Computers and synthesizers can perform more rapid vibrato, modulating 

the center frequency by greater rates. Frequency modulation 

(FM) synthesizers like the vintage Yamaha DX-7 synthesizer are able 

to generate a very wide range of timbral textures in music because of 

the phenomenon of sidebands: When a 50 Hz sinusoid, for example, is 

modulated by 20 Hz, we see "sidebands" in the frequency spectrum of 

the signal located at 30 Hz and 70 Hz with some of the energy of the 

50 Hz band now distributed to these frequencies. Sidebands appear 

when vibrato exceeds about 10 percent of the frequency of the carrier 

frequency. 

In vibrato, the frequency should be the only thing varying. But 

in reality, this is nearly impossible to do. Instead, vibrato almost 

always contains a detectible amount of tremolo, too. So, the depiction 

of electronically produced vibrato varies only the frequency, but in 

reality, the amplitude is most likely being affected as well. 

For similar reasons, tremolo virtually always contains some vibrato. 

Tremolo is also called amplitude modulation (AM). It creates a "shaky"


Figure 4.29: The (theoretical) frequency spectra of three signals, from top to bottom: 

x 1(t) = 3 sin(100πt), x 2(t) = 3 sin[100πt + 20 sin(2πt)], and x 3(t) = 3 sin[100πt + 

10 sin(2πt) + 20 sin(2πt)]. The first spectrum shows a peak only at 50 Hz with a 

magnitude of 3. The second has a smaller peak at 50 Hz with magnitude of 2 and two 

sidebands at 30 and 70 Hz with magnitudes of 0.5, the redistributed energy resulting 

from FM synthesis. The final spectrum shows an even further frequency-modulated 

signal with sine waves of both 10 and 20 Hz changing the original frequency of 50 Hz. 

The total energy still sums to 3.


Spectrogram of vibrato 

Figure 4.30: The difference between a three-dimensional spectrogram and a twodimensional 

one is that amplitude is conveyed in two ways: By color and by height. 

This axis is in units of decibels (dB) where the greatest amplitude/height is 0 dB and 

the others are all relatively less than that. This spectrogram shows a signal undergoing 

vibrato that is modulated by a low-frequency oscillator (LFO). You can see fluctuations 

in both frequency and amplitude. 

effect, where the amplitude oscillates between quiet and loud. The 

graph in Figure 4.32 depicts a 20 Hz sinusoid modulated by a 1 Hz cosine 

wave with an amplitude of 0.25, so the overall amplitude envelope 

varies periodically in magnitude from 0.75 to 1. 

Phasers and flangers change the phase of an audio signal by passing 

it through a delay line containing a ladder of all-pass filters. 8 Both of 

8 All-pass filters are really what they sound like: They let all of the original frequencies 

of a signal "pass" through them, killing none of the frequency components and


Figure 4.31: This is the time signal x(t) = sin [100πt + 20 cos(2πt)], i.e., a 50 Hz sine 

wave undergoing 20 Hz of vibrato. The more densely spaced waves depict the 70 Hz 

regions and the more sparsely spaced areas depict the 30 Hz regions. 

Figure 4.32: The effect of tremolo is shown in the periodic changes in the amplitude 

envelope. Here, x(t) = sin(40πt) · [cos(2πt) + 3] /4. 

them sound like a swirling, "galactic" effect like that heard in the first 

few seconds of The Beatles’ "Back in the USSR," not affecting the fundamental 

frequency but rather the range and quality of its overtone series. 

The all-pass filters used in a flanger are linearly spaced and so are their 

phase responses, and the result is that flange sounds harmonic. A phaser 

pedal passes signals through all-pass filters with more logarithmic 

phase responses. The overtones are therefore not harmonic, but do 

tend to highlight specific notes because of the logarithmic nature of 

frequency. The response of each filter is simply added to the original 

signal, creating destructive and constructive interferences varying 

with frequency, to give the effect of changing phase. The more of these 

retaining their original magnitudes. However, filters are specified both by magnitude 

and phase responses, so its "unchanging" nature does not necessarily apply to its 

phase response. See Appendix A for more on filtering.


filters, the more variation in the spectrogram and audible change in 

the signal. 

Figure 4.33: The above is a spectrogram of the phase effect. Phasing increases and 

decreases the frequencies and harmonics of a signal. In this audio clip, a low-frequency 

oscillator modulates the phase downwards and upwards twice and then returns to 

the steady signal. Note that the amplitude peaks (black parts) and notches (white 

parts) do not have a linear, even spacing.


Figure 4.34: This is the spectrogram of a flange effect. Flanging is a type of phasing 

wherein the phase responses of the all-pass filters are in a series with uniform spacing, 

achieving an harmonic series of constructive and destructive interferences with respect 

to frequency, as shown by the dark and light areas here. 

A flanger pedal is a type of phaser pedal. Flanging was originally 

an effect created by recording engineers using magnetic tape. They 

would play two tapes containing the same signal and introduce a 

small delay by pressing a finger against one of the tape reels where 

it wrapped around a flange (edge) on the machine. Hence, the term 

"flanging." A delay line creates a series of uniformly spaced filters 

through which the signal is passed. The filters’ responses are then 

added to the original signal creating the harmonic series of constructive 

and destructive interferences. 

So flanging is the special case of phasing in which the peaks and 

notches of the overtone series it produces are uniformly spaced and 

harmonic, while with phasing, they can be nonlinearly spaced and 

hence inharmonic.



The second half explored the resonance of four classes of musical 

instruments: Pianos, viols, winds, and drums. The waveforms of 

several guitar effects pedals were also shown. An acoustic instrument 

requires two things to make and amplify sound: A resonant cavity like 

a box, and an activating, vibrating mechanism, like a fixed string or 

reed. Stiff objects with a high tension vibrate and amplify better than 

soft ones. Because pianos and violins use fixed strings, the wavelengths 

of their modes of vibrations are integer divisions of the length of 

the string. In winds, a harmonic spectrum results from the interaction 

between standing waves and a reed in a column of air. Drums do not 

have an harmonic spectrum, but some of their harmonics can be related 

by Bessel function ratios. 

Though all waves can be expressed as the sum of sine and cosine 

waves, most musical instruments and effects pedals produce shapes 

of waves that look and sound different from simple sine waves. The 

friction of the bow on the string produces sawtooth waves, characterized 

by a jagged waveform and high amount of attack. These waves have a 

dark, mysterious timbre, and many new composers of electronic music 

use them. Distortion pedals produce square waves that are artificial 

in tone, produced by clipping a smooth wave to make its top and 

bottom flat. Sawtooth and square waves encountered in electronic 

music are usually produced by mathematical functions, so they are 

non-sinusoidal. However, these waveforms can still be expressed as 

an infinite sum of sine waves. 

It is useful to have a deep understanding of the Fourier representation 

of different timbres, because it makes Fourier analysis of 

polyphonic music a whole lot easier. In polyphonic music, two instruments 

will often play the same pitch or a harmonically related 

pitch such as an octave or perfect fifth above or below. Therefore, 

the frequency representation of their total signal will contain a lot of 

intersection, with peaks resulting from more than one instrument. The


only way to separate instruments with either your ear or computer is 

to be familiar with the shape of their timbres. 

The tone of the flute contains one of the simplest overtone series 

of any acoustic instrument. Its timbre is nearly pure, with most of 

the energy centered at the fundamental frequency. The other wind 

instruments (trumpet, trombone, and oboe) all have partials that are 

stronger than their fundamental frequency. The piano and violin have 

similarly shaped spectra, reflecting the modes of vibration of fixed 

strings. The drums have mostly inharmonic overtones: The way they 

are played produces frequency bands instead of single frequencies. 

A spectrogram conveys the frequency, time, and amplitude information 

of a musical signal in a powerful way. Below are two explained 

spectrograms. The Shostakovich piece contains violins, horns, and a 

flute in the segment shown, while the Beatles song has drums, bass 

guitar, electric guitar, acoustic guitar, and George Harrison’s voice.


Figure 4.35: This is the spectrogram of the first 20 seconds of Dmitri Shostakovich’s 

Symphony No. 5 in D minor, Op. 47: II. Allegreto. The piece begins with 12 seconds of 

violins only. Note the relative amount of noise in the spectrum during this part. What 

we can see is mostly their onset, but their harmonics are represented by thicker, less 

powerful lines that are not as clearly defined, and there is a higher spread of power 

over the frequency range. Now, when the horns enter between 0:12 and 0:13, we 

see dark, horizontal lines located at their harmonics. These lines are closely spaced 

because the horns’ harmonics are exact integer multiples of the fundamental. There 

are multiple horns playing all at once but in different registers (pitch ranges). When 

the solo flute enters between 0:15 and 0:16, we see more distantly spaced harmonics 

from its fundamental frequency somewhere around 1200 Hz. Its harmonics extend 

quite far—12 of them above the fundamental—reflecting its clear tone, and the fact 

that it is unaccompanied by any other flute. The remainder of the clip is largely the 

solo flute with some light accompaniment from the horns.


Figure 4.36: Above is a spectrogram from the first 14 seconds of The Beatles’ "I’m 

Happy Just to Dance with You" from A Hard Day’s Night (1964). The first and second 

measures are identical: Minor chords on the electric guitar and a cymbal-heavy drum 

line with snare rolls on the fourth beat. The vocals enter between 0:06 and 0:07. 

Writing them to better understand the syllables of stronger emphasis, the lyrics are 

"before this DANCE is THROUGH, i think i’ll LOVE you TOO, i’m so HAPpy when 

you DANCE with ME." Using "S" as shorthand for a stressed syllable and "w" to mean 

a weak one, it goes w-w-w-S-w-S-w-w-w-S-w-S-w-w-S-w-w-w-S-w-S. Particularly 

for the first two strong syllables ("dance" and "through"), we can see especially dark 

markings in the spectrogram. The drums appear along the bottom of the frequency 

range and have no clearly defined harmonics. The cymbals show up as faint vertical 

lines through the entire range of frequencies.

5. Auditory perception 

Both the ear and our brain’s perception of its movements are far 

broader topics than we have space to cover here, but the aspects of 

hearing most relevant to the techniques that we use to process and digitize 

sound are fairly limited. The two things necessary to take away 

from this chapter before moving on to Chapter 6 are the concept of 

masking and the logarithmic nature of loudness and pitch. The rest consists 

of some interesting and fundamental facts about our physiology 

and perception. 

We have established that the amplitude of sound waves flowing 

through air is a measure of pressure. IIt is essential that air pressure 

changes in order for the ear to relay sonic information to the brain, and 

furthermore, sound must alternate between its minimum to maximum 

values at least 20 times per second (20 Hz) to be considered pitched. 

In fact, if the ear picked up frequencies any lower than 20 Hz, the 

incredibly loud thermal noise of the world would be audible [4]. 

5.1 Physiology of the ear 

We divide the ear into three main sections: The outer ear, the middle 

ear, and the inner ear. The outer ear acts as a receiver for sound. The 

eardrum connects the outer and middle parts of the ear. The pressures 

on either side of the eardrum are compared to each other, and the 

bones and muscles in the middle ear transmit the effective difference 

to the inner ear. The inner ear is attached to the auditory nerve, which 

passes on the sonic information to the brain. 

The leading theory for the physiological response to an auditory 

stimulus is place theory, which states that different frequencies stimulate

110 Auditory perception Chapter 5 

different places along a basilar membrane that is organized much like 

a logarithmic frequency domain [80]. This means that the basilar 

membrane functions like a Fourier device. 

Figure 5.1: A diagram of the outer, middle, and inner parts of the ear. 

The outer ear consists of only three parts: Of the pinnae (the skin 

and cartilage protruding from the head), the meatus or auditory canal, 

and the tympanum, also known as the eardrum. The tympanum vibrates 

when disturbed by pressure fluctuations in the surrounding medium, 

thereby changing the pressure inside the chamber of the middle ear to 

equal the exterior pressure. 

The amount of pressure is constrained by the impedance or resistance 

of a medium. The more impedance, the less a signal can get through. 

This is similar to a current through a resistor on a circuit or a car in 

traffic.Impedance matching refers to the equalization of the middle ear’s 

pressure to the outer ear’s. The ossicles inside the middle ear are three 

bones (the hammer, anvil, and stirrup) that perform this matching 

and send a signal to the inner ear, as well as the eustachian tube. The

Section 5.1 Physiology of the ear 111 

stapedius reflex, also called the acoustic reflex, is a muscle attached to the 

ossicles and is involuntarily flexed when an ongoing acoustic stimulus 

is louder than about 90 dB. Furthermore, the tensor tympani muscle 

connected to the eardrum flexes during loud sensations to tighten the 

eardrum and increase its impedance. In this way, the middle and inner 

ears also protect our hearing. 

However, when an excessively loud sound persists, these muscles 

grow weary. An interesting phenomenon known as temporary threshold 

shifting works to prevent permanent hearing damage by shifting the 

ears’ dynamic range (the threshold of hearing to the limit of hearing) 

higher for a limited amount of time. When this time has run out and 

the loud sound continues, permanent threshold shifting occurs, which 

results in hearing loss. Soft sounds will then be inaudible because the 

threshold levels have shifted upwards. 

The inner ear is the most complicated region, and consists of a 

coiled cavity called the cochlea. The stirrup ossicle is attached to the 

oval window entry to the cochlea, whose other end is the apex. This 

window is the opening to one of the two tubes in the cochlea: The scala 

vestibuli, which is filled with a fluid called perilymph. The other tube 

is called the scala tympani, connected to the middle ear by the round 

window and also filled with perilymph. The perilymph vibrates and 

stimulates the scala media, a tube that separates the scala vestibuli and 

scala tympani. It is filled with endolymph, a fluid with a complementary 

ionic composition to perilymph such that their interaction generates 

electrochemical impulses that are sent on to the brain.


Figure 5.2: The cross-section of the cochlea. Labeled are the places along the basilar 

membrane corresponding to the frequency regions they detect. 

Reissner’s membrane is an impermeable membrane on the scala 

vestibuli side of the scala media. Underneath Reissner’s membrane is 

the tectorial membrane. The surface of the basilar membrane, named as 

such because it is considered to function as the "base" of our perception 

of sound, is covered in rows of hair cells (cilia) with one row on the 

inside and three rows on the outside. The inner hair cells of the basilar 

membrane are triggered by the motion of the tectorial membrane, and 

these send on phase, frequency, and amplitude information to the 

brain. 

The basic order of events in the mechanism of hearing is as follows: 

A stimulus causes pressure changes in the outer ear which changes 

the pressure inside of the middle ear. This makes the stirrup ossicle 

move in and out of the scala vestibuli, causing fluctuations in the 

volume of fluid inside of it. This leads to vertical displacement in 

the basilar membrane and longitudinal waves in its surrounding fluid, 

whose cumulative motion creates a similar surface wave along the 

basilar membrane moving from the stiff end of the cochlea (the base) to 

the apical end (the apex). When this motion is great enough to trigger


the hair cells, information is sent on to the auditory nerve which is 

connected to each hair cell. 

Figure 5.3: The scala vestibuli, scala tympani, and scala media all contain fluid. This is 

a hydrodynamic surface wave, so gravity is the restoring force upon this fluid. When 

excited by sound, the fluid propagates as shown by the arrows [81]. 

The connection between our auditory perception and our physiology, 

along with the nature of the information that is sent to the 

auditory nerve, lies in the inner ear. Because f = 1 T 

, i.e., frequency 

depends on time and vice versa, it is unknown to a degree how exactly 

the pulsating action within the ear is interpreted. Fortunately, sound is 

limited to basically three things: Frequency (periodicity), amplitude, 

and phase. So our perception of pitch, loudness, and phase is tied to 

their physical manifestation in our hearing mechanism. 

The nature of the basilar membrane’s reaction to sound is highly 

analogous to the actual sound wave. Small cameras placed in the 

cochlea through video microscopy reveal that the hair cells along 

the basilar membrane are excited at locations corresponding to the 

frequency of the excitation. For sounds above the threshold of hearing, 

hairs along the apical end of the basilar membrane are excited by low 

frequencies, while those at the basal end responds to high frequencies. 

Therefore, place theory describes the relationship between frequency 

and the placement of cilia on the basilar membrane as a tonotopic 

mapping—the mapping of frequency (tone) to place. Furthermore,


these hairs act as band-pass filters, selecting a small range of frequencies 

much like the holes of a wind instrument. When a single hair is excited 

by a single frequency, it also to a lesser degree excites the hairs around 

it. 

Figure 5.4: Another depiction of place theory in the basilar membrane, with the 

membrane uncoiled. 

The location of frequencies does not follow a linear scale. It follows 

a logarithmic scale, exactly like our perception of frequency. Octaves 

are spaced a constant distance from each other, so half the length of the 

basilar membrane detects 1500 Hz and less, and the other half detects 

the frequencies above 1500 Hz. 

Humans and other mammals localize low-frequency sounds by 

phase delay according to temporal theory and interaural time difference, 

which is defined by the space between our two ears. The ears are 

separated by about 21.5 cm, the wavelength of 1600 Hz, so beyond 

1600 Hz, the phase delay is no longer useful in detecting the location 

of sounds and instead group delay (the time difference between the 

amplitude envelopes of the sound in the left versus right ear) is the 

measure used [89]. At twice this wavelength (i.e., 800 Hz), the auditory 

system can unambiguously detect spatialization using the time


difference. Between 800 and 1600 Hz is a "transition zone" where both 

phase delays and amplitude envelopes are used in localization. 

Furthermore, once every period an action potential is fired from 

the auditory nerve. Below about 30 Hz, when sound is something 

that we detect as separate events rather than pitch, temporal theory is 

strongest, but physically, no more than 1500 action potentials can fire 

per second so the theory does not hold above 1500 Hz [87]. 

If the set of frequencies is steady for some period of time, our ears’ 

detection of these frequencies becomes increasingly fine-tuned. The 

physical result is that our auditory nerve actually charges up the action 

potentials at the corresponding locations along the basilar membrane. 

The music of La Monte Young attempts to exploit this fact. "Dream 

House," an installation in Lower Manhattan in New York City, features 

continuous pure tones that, when heard for an extended period of time, 

are alleged to incite auditory hallucinations in the brain [86]. Exploding 

head syndrome is another form of auditory hallucination and typically 

happens during sleep: Extremely loud or ringing noises will appear 

to originate from inside the head, but these noises are not (usually) 

painful. The symptoms are certainly dream-like but are not necessarily 

connected to dreaming [92]. 

The intensity of sound waves’ pressure manifests in the disturbance 

of the eardrum, but this relationship is not quite as elegant 

as frequency’s. We find it more difficult to compare the loudness of 

sounds than frequency ratios, e.g., we almost never say that one sound 

is twice as loud as another. Perhaps the most important feature of the 

intensity of sounds in our ears is their onset time, earlier referred to 

as the attack rate. Even soft sounds that surprise us can be extremely 

startling. The auditory reflex acts ahead of time only when the brain 

is expecting sound. Therefore, onset always excites the ear more than 

the actual sound’s pressure level. 

Now, what about phase Our brains don’t really seem to react to 

phase information as they react to loudness and frequency. However, 

there is evidence that an action potential fires when the amplitude of


a given frequency is maximal—i.e., when the phase of a sinusoid is 

90 ◦ [1]. It is true that we use phase information to locate the source of 

sounds: A sound that reaches our left ear before our right ear naturally 

means that the source is more to the left. Furthermore, based on 

the loudness and other qualities to the sound, our ears use phase to 

determine the approximate angle of orientation to sources of sound. 

Phase information is largely at play in the cocktail effect, which is 

our ability to focus on certain sound signals when there is a large 

amount of noise in the background, like having a conversation with a 

friend at a noisy party or concert. We focus better on desired signals 

when facing straight-on, such that the sound will reach both ears at the 

same time and the phases will be identical. Also at play is masking, the 

camouflaging of sound sources by other sounds of similar frequencies, 

which is a psychoacoustical feature. 

5.2 Psychoacoustics 

The field of psychophysics seeks to connect physical aspects of the 

world around us to the way our brain perceives them. Each of the 

sensations of sight, sound, touch, taste, and smell have quantifiable 

threshold values and limits that define the range of intensities from 

barely detectable to permanently damaging. For no sensation are these 

ranges absolute, and even their average values contain some level of 

uncertainty due to noise in the sensory system [48]. 

Psychoacoustics relates the physiological response to sound to the 

perceptual interpretation. The basilar membrane acts as a spectral 

analyzer according to place theory. Indeed, we seem to have a very 

easy time identifying the timbre of sounds, which is a frequency-based 

skill: Immediately, we can identify the sound of guitars, drums, and 

the President’s voice. We can quickly tell the difference between five 

of our friends’ voices, even if they are all of the same age and gender. 

Since timbre is a set of frequencies and their amplitudes, we will

Section 5.2 Psychoacoustics 117 

investigate how frequency and intensity translate perceptually to pitch 

and loudness. 

Pitch 

Pitch is our interpretation of frequency in sound. The assumption 

is that this is not an exact correspondence, but rather a rough one, 

especially when we consider how we perceive frequencies below 30 

Hz or so. In general, we consider a sound "pitched" when it has a 

repetitive nature and it is within the range of our thresholds and limits. 

Perhaps the most amazing part about pitch perception is our response 

to frequency ratios, like the octave. We perceive pitches an 

octave apart to be so closely related that we give them identical note 

names. Additionally, intervals other than the octave have distinct qualities 

that are not just limited by their absolute difference: A frequency 

19 half steps above a reference frequency (a "G" above a "C") has a 

very similar quality to a frequency 7 half steps above the reference (a 

different "G" above the same "C"), because the octave consists of 12 

half steps. The mere existence of such a thing as the Circle of Fifths 

suggests that we perceive pitch as a spiral, or Slinky, where pitches at 

identical angles on these surfaces are separated by an octave. 

Roger Shepard devised a schematic for tones that maps the notes 

C, C♯, etc. to chroma and their octave placement to a height, where 

higher octaves have a greater height. Chroma can also be thought 

of as pitch classes where "C" is one class and "C♯" is another class, so 

there are 12 total classes in the Western scale. The Shepard tone is an 

auditory illusion much like the optical illusion a spinning barber’s pole. 

Its sound is the result of layered, identical sine sweeps moving from 

low to high frequencies where the highest frequency is some octave 

of the lowest. When one sweep reaches an octave higher than its 

starting frequency, another sweep begins. Though there is a maximum


Figure 5.5: The pitches shown as tone chroma. 

frequency that each of the sweeps hit, this sound has the illusion of 

constantly increasing in pitch. 

We do not perceive all frequencies as equally loud. In fact, very 

loud sounds played near the limit of our hearing undergo a downward 

shift in pitch. The range of 1,000-5,000 Hz spanning a little more than 

2 octaves has a maximal sensitivity in our ears and this is the range of 

our speaking voice. Thus, loudness too is perceptual. 

Loudness 

Loudness describes the brain’s perception of the intensity of a sound. 

Like frequency, intensity is modeled on a logarithmic scale in decibels 

(dB), where 0 dB corresponds to normal atmospheric pressure (101.325 

kPa). The threshold of our hearing (normal atmospheric pressure) t h


is 10 −12 W/m 2 (Watts per square meter), and the limit of our hearing 

l h is 1 W/m 2 . 

However, loudness does not absolutely correspond to intensity. 

For one, we perceive frequencies in the range of 1,000 to 5,000 Hz 

better than frequencies outside of this range. Frequencies at the extreme 

points of our hearing range have to be very intense in order 

for us to perceive them. The Fletcher–Munson curve given in Figure 

5.6 describes the minimum intensity required to detect specific 

frequencies. 

Figure 5.6: This graph shows our ears’ total sensitivity from 20-20,000 Hz, limited by 

the threshold (minimum sound pressure level required for perception of sound, given 

by the familiar Fletcher-Munson curve) and limit (maximum sound pressure level 

beyond which our hearing is damaged). The average region of speech signals is also 

highlighted. 

Therefore, there are several different ways to think about loudness 

and also different measurements to quantify it. The decibel measure is


fairly meaningless when it comes to how we perceive intensity because 

it varies radically with frequency. Therefore, it is sometimes useful 

to measure loudness using the phon scale. The phon scale answers 

the question, "How loud does frequency B need to be in order to 

be equally loud as frequency A" It is a measure of equal loudness 

based on the Fletcher–Munson function for the minimum threshold of 

hearing, given by 

( ) f −0.8 ( ) f 4 

T (f) =3.64 

− 6.5e −0.6(f/1000−3.3)2 + .001 . 

1000 

1000 

This is the threshold for young, healthy ears. The phon is defined as 

the sound intensity level (SIL) in decibels of a sinusoid of 1,000 Hz, so at 

the threshold level 10 −12 W/m 2 , the loudness is 0 phon, and 10 phons 

equals the sound intensity level of the 1,000 Hz sinusoid at 10 dB. 

This means that a 10 phon sinusoid at x Hz will sound equally loud 

to a 10 phon sinusoid at y Hz. This is a nice scale to apply in analysis 

of the frequency spectrum because it normalizes the amplitudes with 

respect to our perception. A 3,000 Hz sound wave, for example, should 

be considered "more important" than an equally intense 50 Hz sound 

wave, because its intensity does not have to be nearly as great to 

be audible. When we apply the phon scale to the resulting Fourier 

transform, we get a better idea of the actual sound that we perceive 

from a signal. 

Another translation of sound intensity levels to a perceptual measure 

is given by the sone scale, which calculates loudness as a ratio. 

Loudness in sones (L s ) can be directly calculated from loudness in 

phons (L p ) as 

L s =2 (Lp−40)/10 

so, one sone is equal to 40 phons. A sound that is twice as loud as 

another sound will have twice the amount of loudness as another 

sound will have twice as many sones. One sone is therefore equal to 

the loudness of a 1,000 Hz sinusoid at 40 phons.


Figure 5.7: These curves are derived from the Fletcher–Munson curve and depict the 

phon scale. Following a single curve tells you the sound intensity level perceived as 

equally loud across the entire frequency range. The curves are drawn in increments 

of 10 phons. 

It is common to see loudness written in dB SPL, which is the loudness 

relative to a reference pressure. This reference pressure is often 

94 dB SPL because it equals 1 Pascal. This measure is so common that 

it is often (misleadingly) abbreviated as "dB," but decibels are not an 

absolute measure of intensity or pressure. 

Loudness in dB SPL (also called the intensity level) gives us a better 

idea of the perceptual experience of volume, while amplitude describes 

the physical wave. The reference level for our threshold is usually 20 

µPa RMS (root mean squared—a sort of normalization) at 1000 Hz, where 

this would be 0 dB SPL (normal atmosphere). We can calculate the 

loudness in dB SPL (L dB SPL ) of a root mean squared pressure p rms 

from a reference pressure p ref with the formula 

( ) 2 ( ) 

prms 

prms 

L dB SPL = 10 log 10 = 20 log 

p 10 . 

ref p ref 

So this would be the amount of decibels of a sound above some reference 

sound.


Source Intensity SPL Magnitude 

(W/m 2 ) (dB) of t h 

Normal atmosphere 1 × 10 −12 0 10 0 

Rustling leaves 1 × 10 −11 10 10 1 

Whispering 1 × 10 −10 20 10 2 

Quiet library 1 × 10 −8 40 10 4 

Conversation, 1 m 1 × 10 −6 60 10 6 

Vacuum cleaner, 1 m 1 × 10 −5 70 10 7 

Heavy traffic, from sidewalk 1 × 10 −4 80 10 8 

Rock concert, 1 m 1 × 10 −2 100 10 10 

Threshold of pain 1 × 10 1 130 10 13 

Jet engine 1 × 10 2 140 10 14 

Perforation of eardrum 1 × 10 4 160 10 16 

Table 5.1: The sound pressure levels (SPL) of various sound sources of wideband noise, 

in which the sound’s energy is spread across a large range of frequencies. 

If this wasn’t already confusing enough, there is yet another measure 

of loudness that considers the normal hearing level, called dB HL. 

We typically encounter this scale during an audiogram, a test of one’s 

hearing. The quantity 0 dB HL describes the (average) normal hearing 

level at all frequencies, and scoring somewhere in the range of -10 dB 

HL to 20 dB HL is commonly considered the normal hearing range. 

Just-noticeable difference 

All of our sensations have a prescribed resolution of detectable detail, 

and both loudness and pitch have some interval of error within which 

we cannot detect a difference. This interval is defined as the justnoticeable 

difference (jnd) and is measured in limens. The process of 

altering a frequency from some center frequency is called frequency 

modulation, and likewise, the alteration of amplitude from a center 

amplitude is amplitude modulation. Frequency modulation (FM) is 

perceived as vibrato when the modulating frequency is sufficiently 

large for our ears to detect change but small enough to not separate 

the maximum and minimum frequencies. Frequencies closer to the


minimum and maximum values of our hearing range must modulate 

more than frequencies between 1,000 and 5,000 Hz, where our ears are 

most sensitive to change. Within this area of heightened sensitivity, 

frequency changes greater than about 0.5% of the center or carrier 

frequency can be detected. The magnitude of the jnd varies especially 

for trained musicians who have spent lots of time tuning and listening 

to their instruments. 

For loudness, the jnd is roughly proportional to the intensity and 

frequency of the sound, related to the Fletcher–Munson curve above. 

Amplitude modulation (AM) is also known as tremolo when the change 

in loudness is beyond the jnd. When two frequencies in close proximity 

to one another (differing by 10 Hz or less) are played simultaneously in 

a sound, the phenomenon of beating occurs. The phases of the two frequencies 

result in constructive and destructive interference at periodic 

intervals, and the effect is heard as amplitude modulation, beating at a 

frequency exactly equal to the difference of the two frequencies. The 

Figure 5.8: The sum of two closely related frequencies, 50 Hz and 51 Hz, over 3 

seconds. The cosine function cos(πt) is also shown representing the (perceptual and 

actual) modulation of amplitude. 

sound from a whistle produces beats heard as a lower tone. There are 

two frequencies produced, separated in wavelength by the length of 

the gap in the whistle. So, say the first frequency is the result of the 

distance between the mouthpiece and the beginning of the gap (say 5 

centimeters, so 343/0.05 = 6860 Hz), and the second frequency has the 

wavelength of the distance between the mouthpiece and the end of the


gap (say 5.2 centimeters, so 343/0.055 = 6596 Hz). Then blowing this 

whistle would make a difference frequency of 6860 − 6596 = 264 Hz. 

This difference frequency is commonly less than 10 Hz to be called 

beating because the low frequency has the sound of metric rhythm like 

beats from a drum, but difference tones are an artifact of beating no 

matter their frequency. Between 10 and 20 Hz, the modulation is considered 

dissonant or rough, and above 20 Hz, the difference frequency 

is then pitched. This is similar to our perception of sounds as of the 

same source in a series of reflections, where 0.1-0.2 seconds represents 

an interval of ambiguity. 

Figure 5.9: The sum of the frequencies 50 and 62 Hz over 1.5 periods, with a 12 

Hz sinusoid on top to show the relationship between their sum and the difference 

frequency. This signal is dissonant. Beyond 20 Hz, the 20 Hz difference tone becomes 

a pitch of its own, so the dissonance begins to diminish. 

bands. 

Our detection of difference tones is tied to the definition of critical 

Critical bands and masking 

We define noise by a bandwidth and center or carrier frequency, called 

narrowband noise for small bandwidths and wideband noise for large 

bandwidths, where bandwidth determines the wideness of the interval 

of frequencies contained in a signal (determined by minimum and 

maximum frequency bands).


Zwicker and Feldtkeller’s 1955 experiments with narrowband noise 

showed that, beyond certain bandwidths, we perceive the loudness of 

bandwidth-limited noise as disproportional to its total energy: When 

the bandwidth reaches and exceeds a critical value, the energy (loudness) 

of the noise has the illusion of increasing when it is actually 

constant. This value beyond which we perceive the bandwidth to have 

more energy than it physically does is given by 

β c = 25 + 75 

[ 

1 + 1.4 

( ) ] 2 0.69 

fc 

, 

1000 

and β c is called the critical bandwidth. The compression techniques 

behind MP3s and other lossy compressed files use this psychoacoustic 

phenomenon to discard sonic information that we wouldn’t perceive 

anyway. 

Suppose a pure tone with a frequency inside or very near to the 

range of the narrowband noise is played at the same time as the noise. 

If the energy of the noise is the same or greater than the energy of 

the tone, masking occurs. Masking is when a frequency within some 

threshold range of a frequency band challenges the detection of that 

frequency. It is considered a failure of our auditory system in accurately 

detecting sounds, and is due to the localization of disturbances in the 

basilar membrane. Exciting the membrane at a frequency causes the 

membrane to vertically displace at that frequency, which naturally 

displaces the frequencies to the left and right of it, just like a band-pass 

filter (see Appendix A). The depiction of masking in Figure 5.10 reflects 

the "lumpy" nature of the displacement along the membrane, as well 

as our perception of frequency differences. 

So, the narrowband noise masks the sound of a pure tone with a 

frequency within or near its bandwidth. As the bandwidth increases, 

the pure tone must be louder to be detectable. The signal-to-noise ratio, 

the ratio of intensity of the pure tone to the noise, is also quantifiable


Figure 5.10: An example of masking with the Fletcher–Munson curve for reference. 

When a masker exceeds the masking threshold, sounds beneath that threshold both in 

frequency and sound pressure level will be masked. 

Figure 5.11: The displacement of the basilar membrane in response to a sound bearing 

the fundamental pitch p 0. When excited at this location, the wave moves upward and 

downward in addition to propagating towards the apical end of the membrane, where 

the auditory nerve is. The excitation is centered at the pitch, but also excites places 

immediately to the left and right of it. If a pitch slightly less than p 0 were present as 

well in the sound, and with a smaller amplitude, it is possible that the ear could fail to 

detect this second pitch due to masking. 

with respect to frequency and critical bandwidth, but this is true only 

up to a point. At this point, the bandwidth can continue to increase 

for the same intensity of the pure tone, and the pure tone is still de-


tectable. Thus, the signal-to-noise ratio decreases, but the quality of 

our perception of the signal does not. 

The results of these experiments support the proposition that our 

ear groups ranges of frequencies. Lossy compression algorithms that 

compress sound files to MP3 and AAC formats use these results to 

simplify and reduce sonic data: Frequencies in close proximity to one 

another are mapped to a single frequency. 

Consonance and dissonance 

There have been several attempts to bridge the consonance of frequency 

ratios to psychoacoustics, and no theory is considered predominant 

or leading. The debate might continue forever, if not for the 

reason that sensitivities and quality preferences differ between ears 

and musical tastes, then for the reason that music is not only about 

consonance and dissonance. Many scientists and musicians, such as 

Hermann von Helmholtz, have tried to subjectively order intervals 

from consonant to dissonant. Helmholtz attributed qualities to different 

keys—D Major compared to G Major, for example—which might 

mean that his keyboard was not equally tempered, for the two should 

not be discernible as far as frequency ratios are concerned. 

My personal theory is that irrationally proportioned sounds (i.e., 

those without whole integer ratios) are perceived as dissonant because 

frequencies related by integer ratios completely avoid the undesirable 

phenomenon of beating. Furthermore, some intervals are more 

dissonant or consonant than others. The degree of dissonance and consonance 

can be determined, in my opinion, by the amount of beating a 

given interval allows. 

Calculated below are the different degrees of beating that occur 

in the overtones of equally tempered intervals. The fundamental 

frequency is 100 Hz. Reinier Plomp experimented with the audibility of 

partials in complex tones and showed that humans are able to discern 

only up to the first five to eight overtones in a harmonic overtone series


Interval f 0 f 1 f 2 f 3 f 4 f 5 

P1 100 200 300 400 500 600 

m2 105.946 211.893 317.839 423.785 529.732 635.678 

M2 112.246 224.492 336.739 448.985 561.231 673.477 

m3 118.921 237.941 356.762 475.683 594.604 713.524 

M3 125.992 251.984 377.976 503.968 629.961 755.953 

P4 133.484 266.968 400.452 533.936 667.420 800.904 

TT 141.421 282.943 424.264 565.685 707.107 848.528 

P5 149.831 299.661 449.492 599.323 749.154 898.984 

m6 158.740 317.480 476.220 634.960 793.701 952.441 

M6 168.179 336.359 504.538 672.717 840.896 1009.078 

m7 178.180 356.359 534.359 712.719 890.899 1069.080 

M7 188.775 377.530 566.325 755.099 943.874 1132.649 

P8 200 400 600 800 1000 1200 

Table 5.2: The first five overtones of all 13 intervals within one octave. Italicized are 

partials within one jnd of the partials of the fundamental frequency (100 Hz), meaning 

they are less than 0.5 percent off from the partial, representing consonance. Bolded 

are partials that lie between one jnd and 20 Hz from the partials of the fundamental, 

representing dissonance and roughness. 

[80]. 1 In Table 5.2, we compute the first five harmonic partials (f 0 , f 1 , 

. . ., f 5 ) for the twelve Western, equally tempered intervals above 100 

Hz. Instances when the overtones of these intervals are within one 

jnd of the overtones of the 100 Hz fundamental are highlighted by 

italics.. Since they differ by less than 0.5 percent in frequency, I call 

them consonant. If they differ by less than 20 Hz but are greater than 

one jnd apart (shown in bold), I call them dissonant. 

1 This was true for both of the two complex tones used by Plomp in his experiment, 

one harmonic and one inharmonic, and each containing twelve overtones.

Section 5.3 Perfect pitch 129 

Interval 

Unison 

Octave 

Perfect fifth 

Perfect fourth 

Major third 

Major sixth 

Minor third 

Minor second 

Minor sixth 

Tritone 

Minor seventh 

Major seventh 

Major second 

Difference of most consonant partial 

to the partials of 100 Hz 

0 Hz 

0 Hz 

0.339 Hz 

0.452 Hz 

3.968 Hz 

4.538 Hz 

5.394 Hz 

5.946 Hz 

6.299 Hz 

7.107 Hz 

9.101 Hz 

11.225 Hz 

12.246 Hz 

Table 5.3: Ranking of the most consonant intervals as computed from their harmonic 

overtone series in Table 5.2. 

As you can see, this method is not without its problems. The 

tritone, for example, is far from the most dissonant interval even 

though its frequency ratio is considered the "least rational" of all of 

the intervals (2 6/12 = √ 2) in equal temperament (think: The chorus of 

West Side Story’s "Maria"). The church even named the tritone Diabolus 

("the devil’s interval" in Latin), no later than the early 18th century. 

However, perhaps this notion stands to be challenged. 

5.3 Perfect pitch 

Perfect pitch, also called absolute pitch, is a gift that a very tiny number 

of people possess. It is either inborn or learned at the same time as 

language acquisition, before the age of about four years old. People 

with perfect pitch, as the name might suggest, can name pitches played 

in isolation. So, if I were to go up to a piano and press only one key, 

someone with perfect pitch could tell me the note I played.


People who speak tonal languages like Mandarin Chinese and 

Vietnamese are more likely to have perfect pitch [21]. This is further 

evidence that note naming is similar to our cognition of language. 

Relative pitch is not the same thing as perfect pitch, though it might 

seem that way. Relative pitch is perfectly possible to acquire from 

playing or listening to a lot of music, and is employed in the context of 

more than one pitch. A person with good relative pitch can identify 

intervals and chords, but cannot name the pitches themselves unless 

given a reference point, such as key. Relative pitch is very useful for 

identifying the tonal center of a key, but not for naming it (as one could 

do if one had perfect pitch). 

It is estimated that less than 0.05 percent (1/2000) of the population 

have perfect pitch. However, in a study of 600 musicians, 40 percent 

of people who began learning music before the age of five possessed 

perfect pitch [21]. Songbirds also have perfect pitch; it is essential to 

the success their mating calls. In many instances, people with perfect 

pitch don’t realize their own gift until someone makes it known to 

them, or they study music theory. 

Some would argue that perfect pitch can be learned after the age of 

language acquisition. Evidence for this would be in the very-close-toperfect 

pitch of Mandarin and Vietnamese speakers, though short-term 

pitch memory is virtually widespread. Native speakers of these languages 

will often have an enlarged left side of the planum temporale 

in the brain, which is also an indicator of musicianship. More support 

for this argument would be that awareness of one’s own vocal range 

in combination with excellent relative pitch is a way to acquire perfect 

pitch. However, I believe that any method of acquiring perfect pitch 

would require a never-ending amount of practice to keep pitches fresh 

in one’s memory. 

Perfect pitch is an optimal tool for composers and performers— 

Beethoven, Mozart, Bach, Handel, Chopin, Toscanini, and Anton Rubinstein 

all possessed it. Composition of new music and reproduction 

of old is easier when the specific pitches are already in your head.

Section 5.3 Perfect pitch 131 

Perhaps its most practical application is in tuning an instrument. But 

it is extremely rare, so be wary of those who say they have it. 

Synesthesia 

Sometimes, perfect pitch is accompanied by some form of synesthesia. 

Synesthesia (also spelled synaesthesia) is the activation of an unrelated 

sensation during a stimulus. It can come in weak and strong forms, 

and for some it can be overwhelming. I have a very common form of 

weak synesthesia: Numbers correspond to specific colors to me. Zero 

is black, one is white, two is blue, three is orange, and so on. I realized 

this only recently while researching musical synesthesia, though I have 

always been aware that I related numbers to colors. I always assumed 

that it was an artifact of some television show that I had watched as 

a child, but then I found a chart with the very same colors for these 

numbers on Wikipedia. 

Musical synesthesia is most often a visual type. People with musical 

synesthesia are more likely to have perfect pitch because they 

experience consistent phenomenon when specific pitches are played. I 

performed an observation on six individuals with musical synesthesia. 

Only three of these six subjects had perfect pitch. All of the subjects 

were musicians in some right. One reported that he saw the color 

orange with A440, and the brightness of this orange depended on 

how loud the A was played. This relationship between intensity of 

the actual sensation and the intensity of the synesthetic sensation is 

quite common. Additionally, he compared a chord with simultaneous 

pitches to a Mark Rothko painting, primarily colored by the root, and 

secondly by the third. 

Another of the subjects claimed to perceive colors that were literally 

"out of this world," nonexistent in our ROYGBIV spectrum. Yet another 

described some of the most tranquil, beautiful images, relating them to 

timbres and the fullness of orchestration (the more voices in the music, 

the more cluttered these images). A single melody to her appeared


as "dabs of color over a dark field," while a large band would evoke 

something she called a "color field" with rippling, moving currents. 

The subjects also seemed to have an easier time hearing overtones and 

identifying instruments than the general population. 

Amusia 

Amusia is defined by tone deafness, which is the lack of relative pitch. 

This is most apparent when people try to sing along to a song and 

fail to sing even remotely close to the actual melody. Diana Deutsch 

has done a lot of great work and research in the fields of synesthesia, 

perfect pitch, and amusia. Her research has shown that as many as five 

percent of people in the United States have amusia (four percent in the 

UK), defined by extremely poor relative pitch, wherein pitch changes 

cannot be detected and a pitch in isolation cannot be sung back [22]. 

People with amusia, called amusics, can have normal rhythm detection, 

and surprisingly, there does not seem to be a correlation to language 

faculties. For instance, the Russian composer Vissarion Shebalin 

suffered from aphasia, the impairment of language ability. Shebalin 

wrote many genres of music ranging from operas to string quartets to 

film scores and lived from 1902 to 1963. In 1953, he suffered a stroke 

that impaired his ability to communicate verbally, but not musically: 

He continued to compose one more symphony (his fifth) in his lifetime 

that was similar to his earlier compositions, and even received praise 

from contemporary Dmitri Shostakovich. Just as remarkably, Shebalin 

continued to give lectures at universities using only the language of 

music. So, there is no strong evidence that our brain’s musical and 

linguistic faculties are connected. 

A good friend of mine from high school is now a practicing music 

therapist. Part of her background includes training for patients with 

aphasia. She described to me a technique used to restore language 

abilities via music: A sentence is paired with a simple melody and 

rhythm, to make it into a song. The patient then attempts to repeat the


sentence back. To encourage his or her memory, the therapist can pat 

out the rhythm, as well as hum the tune. This technique has shown 

consistently positive results, and hints at a tie between language and 

music, even though they are processed by many different parts of the 

brain. 


This chapter barely scratched the surface of the physiology and psychology 

of hearing, but our auditory perception is an important factor 

in the mathematical analysis of musical signals. The perceptual definitions 

and quantifiable behaviors of frequency resolution, temporal 

resolution, and hearing thresholds and limits all aide decision-making 

in the compression and composition of musical sound. Applying 

these concepts to the output of Fourier transforms results in a more 

perceptually accurate model. 

The organ of the ear is subdivided into three parts based on their 

general functions: The outer, middle, and inner ear. The outer ear 

receives sound, and the pressure of the air is translated onto the surface 

of the eardrum. This pressure is compared to the pressure on the 

eardrum’s interior, the chamber of the middle ear. This pressure is 

translated by the bones and muscles of the middle ear to the cochlea, 

and passed on to the auditory nerve which is directly connected to the 

brain. 

The basilar membrane inside of the cochlea vibrates at locations 

and amplitudes corresponding to the frequency spectrum of a given 

sound. The hair cells on the membrane are band-pass filters, each 

of them tuned to a small range of frequencies. The membrane is excited 

by frequencies between about 20 and 20,000 Hz at minimum 

amplitudes defined by the Fletcher–Munson curve. Half of the membrane 

responds to frequencies below 1,500 Hz and the other half to 

frequencies above 1,500 Hz, with octaves evenly spaced along it. In this


way, the basilar membrane is comparable to a logarithmic frequency 

domain. 

Because the displacement of the membrane at a given frequency 

naturally displaces the frequencies around it, when a sound contains 

closely related frequencies, the ear may not be able to distinguish 

between them. This is the frequency resolution of the ear, and results 

in the phenomena of masking and critical bands. These phenomena 

are utilized in the lossy compression of digital audio. 

The field of psychoacoustics studies how the brain perceives sound. 

Pitch describes our perception of frequency, and loudness our perception 

of intensity. Frequencies within the range of speech are more 

easily detected by the ear, and frequencies near the minimum and 

maximum and audibility require greater intensity to be audible. The 

phon scale, based on the human ear’s sensitivity to 1,000 Hz and the 

Fletcher–Munson curve, describes intensity with respect to frequency 

in order for two frequencies to be equally loud. So, x-many phons 

is defined as the loudness of 1,000 Hz at x-many dB. The phon scale 

should be applied to the output of a Fourier transform when a perceptual 

representation of the amplitudes is desired. The sone scale 

describes loudness ratios, i.e., a sound with a loudness of two sones 

describes a sound twice as loud as one with a loudness of one sone. 

Perfect pitch, musical synesthesia, and amusia are neurological 

conditions concerning music. Perfect pitch is the ability to discern the 

pitch of a sound in isolation, while relative pitch is the ability to discern 

pitch with a known reference point, thereby computing the pitch from 

knowledge of pitch intervals. Children who develop music faculties at 

the same time as language faculties are far more likely to have perfect 

pitch. Synesthesia is the stimulation of multiple senses in response 

to a single sensory stimulus. Musical synesthesia is most often the 

stimulation of a visual response in addition to a sonic response, like a 

specific color appearing from a specific pitch. Synesthetes will often 

also have perfect pitch because of this relation. Finally, amusia is the 

impairment of musical faculties, such as memory of musical melodies


and tone deafness. Surprisingly, aphasia (impairment of language 

faculties) and amusia have no definite relationship.

6. Digital audio basics 

The relationship between digital audio and analog audio is that of 

the finite and the infinite, respectively. A digital recording of music 

is derived from an analog signal, but it only represents it at a finite 

number of points in time. These points are said to be discrete which 

means they are infinitesimally small: A point is zero-dimensional, 

while a line is one-dimensional. 

Figure 6.1: A discrete function. 

Figure 6.2: A continuous function. 

In mathematics, the continuity and differentiability of a space are 

frequent concerns, and there are many properties that follow from 

functions that satisfy those conditions. Putting a song onto a computer 

or CD requires digitization, which is a discrete and discontinuous 

function. It is possible to perform Fourier transforms on continuous 

audio signals like vinyl or live sound, but a computer can calculate the 

transform far more quickly and accurately than analog tools. 

There are many advantages to the digital form versus the analog 

form. For one, material things deteriorate over time. Scratches, dirt,

138 Digital audio basics Chapter 6 

and entropy eventually conquer physical records. Secondly, it is quick, 

cheap, and uncomplicated to replicate a digital version and store it in 

many places. 

An infinitely small interval of time (a point) is not perceptible by 

itself. However, when we provide tens of thousands of these data 

points per second, our ears are fooled, and we hear something very 

much like the original analog signal. 

These data points are called samples, and the number of data points 

per second is the sampling rate, or sampling frequency. Sampling rate is 

usually given in hertz or samples per second because it periodically 

repeats like a frequency. However, though there are audible artifacts 

that result from the sampling rate, the sampling rate itself is never 

itself audible as it is not a sine wave. 

Sampling fundamentals and techniques are very important for the 

Fourier transform because this information must be known in order to 

correctly specify its parameters. 

6.1 Sampling 

We encounter the word sampling in a few different places, but every 

usage means essentially the same thing: Sampling is the process of 

taking members from some set and using their identities to represent 

the whole set. We sample from a complete set because it is expensive, 

unnecessary, or even impossible to collect information about the whole 

set. 

When a sample set is a poor representation of a larger set, this will 

often mean that the size of the sample was too small. It can also mean 

that sampling bias is at play, such as a sample of students from one 

high school taken to represent something about all high schools in the 

country. In dance music, DJs sample from music to point to a larger 

feature about the sampled music, like its artist, or beats per minute, or 

as contrast to the rest of the remix. Disk jockeys encounter sampling 

bias when they choose a sample too short for listeners to recognize, or

Section 6.1 Sampling 139 

a sample consisting of content that isn’t very characteristic of the song, 

like the bridge or coda. 

Figure 6.3: Sampling from a population (set A) produces a subset of A (set B). 

The sampling we do on continuous music signals is with the intention 

and purpose of representing the original form as closely as 

possible. The easiest way to do this is at a uniform, periodic rate, so 

that we don’t have to know anything about a song before sampling it 

in order to sample it well. 

Digital sampling transforms a continuous signal into a discrete 

signal by using an impulse function. 

A single impulse is a vertical function with an infinitesimally small 

width, located at a single instant of time and zero everywhere besides 

at that instant. Its width is the width of a discrete point. The height of 

the impulse depends on the nature of the input signal: If the signal is 

continuous, we use the Dirac delta function to sample the signal, and if 

it is discrete, we use the Kronecker delta function.


Figure 6.4: The process of sampling: A signal gets multiplied by an impulse function 

to produce a sampled signal with sampling frequency equal to that of the impulse 

function. 

It is rare to find a (digital) signal processing text that explicitly 

states which delta function it is using at any given time, and unfortunately, 

they will sometimes be written identically with the Greek letter 

delta as δ(t). Some texts distinguish continuity from discreteness with 

parentheses for the domains of continuous functions and square brackets 

for discrete domains, i.e., δ(t) versus δ[t]. The Dirac delta function 

is a continuous function of time and it is used to sample continuous 

signals. It is defined 

⎧ 

⎨∞, if t =0 

δ(t) = 

⎩0, if t ≠0 

The Kronecker delta function is a discrete function of time used to 

sample discrete functions, and has a similar definition: 

⎧ 

⎨1, if t =0 

δ[t] = 

⎩0, if t ≠0 

These functions can be placed anywhere along the time domain. For 

example, if we wanted an impulse located at t =2, then we specify the 

domain of the function a bit differently by shifting it to the right: 

⎧ 

⎨1, if t =2 

δ[t − 2] = 

⎩0, if t ≠2


The Dirac delta function is continuous and integrable, and 

∫ ∞ 

−∞ 

δ(t − a) =1 

for any real number a. The Kronecker delta function is discrete and 

therefore nonintegrable, and its global sum is equal to 1 because it is 

1 at only one point and 0 elsewhere. To sample a continuous input 

signal, we integrate its multiplication with a delta function centered at 

a desired point. Say that we want to do this in the input signal x(t) at 

t = a, and call the resulting sampled input function x s (t). Then, 

x s (t) = 

∫ ∞ 

−∞ 

x(t)δ(t − a) dt. 

This outputs a continuous function that is 0 everywhere except at t = a, 

where its amplitude is exactly x(a). So the sampled function reflects 

the amplitude of the function x(t) at the points at which it is sampled. 

For music, we want to sample at a periodic rate because of the 

high volume of samples and audible range of frequencies. Applying 

a sampling rate to a musical signal outputs a function with uniformly 

spaced samples. To sample the function at multiple points, 

we simply sequence all of the individual integrals. For a sampling 

rate of 1 a 

, these would be the points centered at the sequence of times 

{0, a, 2a, . . . , (N − 1)a}. In general, we compute a sampled function 

x s (t) by the sequence of integrals 

x s (t) = 

{ ∫ ∞ 

∫ ∞ 

x(t)δ(t) dt, x(t)δ(t − a) dt, 

−∞ 

−∞ 

∫ ∞ 

∫ ∞ 

} 

x(t)δ(t − 2a) dt, . . . , x(t)δ(t − (N − 1)a) dt . 

−∞ 

−∞ 

In the discrete case, this is a very similar process. Instead of summing 

all of the integrals, we sequence the individual sums. A sampled


discrete function is then 

x s [t] = 

{ 

∑ ∞ ∞∑ 

x[t]δ[t], x[t]δ[t − a], . . . , 

t=−∞ 

∞∑ 

t=−∞ 

t=−∞ 

} 

x[t]δ[t − (N − 1)a] 

where the kth term of x s [t] is equal to 

x s [k] = 

∞∑ 

t=−∞ 

x[t]δ[t − ka] 

for k =0, 1, . . . , N − 1. The Kronecker delta is called a unit impulse, 

because "unit" implies unity, meaning 1 (like the unit circle). 1 

This 

method of sampling is called ideal sampling, and others include instantaneous 

and natural sampling. 

For example, let x(t) = cos(1.5πt) be our continuous signal, and 

say that it only happens over over the interval 0 ≤ t ≤ 2 seconds. Let 

our sampling period T =0.5 seconds, so the sampling frequency is 

f s =1/T =2Hz. Then 

∫ ∞ 

∫ ∞ 

−∞ 

−∞ 

∫ ∞ 

∫ ∞ 

−∞ 

−∞ 

x(t)δ(t − 0) dt = 

x(t)δ(t − 0.5) dt = 

x(t)δ(t − 1) dt = 

x(t)δ(t − 1.5) dt = 

⎧ 

⎨cos(1.5 · π · 0) = 1, if t =0 

⎩0, otherwise 

⎧ 

⎨cos(1.5 · π · 0.5) = − √ 2/2, if t =0.5 


⎧ 

⎨cos(1.5 · π · 1) = 0, if t =1 


⎧ 

⎨cos(1.5 · π · 1.5) = √ 2/2, if t =1.5 


1 This is also called a normalized or normal function. Audio for example is normalized 

to only take on amplitudes between −1 and 1.


Figure 6.5: Three different types of sampling of an analog signal, depicted in the top 

left corner. Ideal sampling uses impulses, instantaneous sampling uses rectangles 

centered at the height of the analog signal, and natural sampling uses trapezoids with 

more coordinates of the curve. Pulse-code modulation (PCM), the type of sampling used 

by CDs, uses ideal sampling. 

so x s (t) ={1, − √ 2/2, 0, √ 2/2}. Formally, this is the equation 

i(t) =δ D (t)+δ D (t − 0.5) + δ D (t − 1) + δ D (t − 1.5). 

The discrete version is calculated similarly by sums, with identical 

results. Therefore, the Kronecker delta and the Dirac delta are conceptually 

identical because they both effectively sample a function. 

From here on, we will just be using the notation δ(t), meaning Dirac in 

continuous cases (integrals) and Kronecker in discrete ones (sums).


t x s (t) 

0 1 

0.5 −0.707 

1 0 

1.5 0.707 

2 −1 

2.5 0.707 

3 0 

3.0001 does not exist 

3.5 −0.707 

4 1 

4.5 −0.707 

5 0 

Table 6.1: A function x(t) = 3πt sampled over the interval 0 ≤ t ≤ 5 with a sampling 

rate of 2 Hz. 

Table 6.1 and the graph above it show the function cos(1.5πt) 

(whether continuous or discrete) sampled over the interval 0 ≤ t ≤ 5 

seconds with sampling rate f s =2Hz, i.e., T =0.5 seconds as above. 

We can also represent the multiplication by 

x s (t) = x(t) · (δ(t)+δ(t − T )+δ(t − 2T )+. . . + δ(t + ∞)) 

∞∑ 

= x(t) δ(t − nT ) 

n=0 

where x s (t) is the sampled signal. We do this to infinity to ensure that 

the entire signal, x(t), is sampled, but if the time domain’s endpoints 

are known it is sufficient to just sample over them. Sometimes you 

will see the impulse function expressed over both positive values and


negative values, expressed by 

i s (t) = δ(t + ∞T )+. . . + δ(t + T )+δ(t)+δ(t − T )+. . . + 

δ(t −∞T ) 

∞∑ 

= δ(t − kT), 

k=−∞ 

but in general, a signal doesn’t begin before t =0, so these impulses 

over the negative time domain would return values of 0 from x(t). 

Their inclusion is unnecessary but doesn’t hurt, and leads to a more 

mathematically rigorous expression. 

Choosing a sampling rate 

A signal contains frequencies, and these frequencies are positively 

valued. We don’t care about frequencies that we cannot hear, and 

therefore, we don’t need to sample for the frequencies outside the 20- 

20,000 Hz range. When we choose a sampling rate, we have to choose 

the maximum frequency that we want to detect through sampling. 

We call this f max , and we require the sampling frequency f s to be 

greater than twice the value of f max to effectively capture all audible 

frequencies present in a signal in our sampled signal. So, a sine wave 

must be sampled at least twice per period in order to be digitally 

represented. 2 

We talked above about how sampling has the ability to distort 

signals, namely, when a low sampling rate is chosen. This is called undersampling. 

The resulting sampled signal misrepresents its frequency 

information by either failing to include at all certain frequencies, or by 

misnaming them ("aliasing"), as shown later on in Figure 6.8. 

2 Technically, this should be more than twice per period: The only sine wave that 

could be represented with a sampling frequency equal to twice its frequency is a 

cosine wave with zero phase. Therefore, the probability of correctly sampling a signal 

with sampling frequency equal to two times the maximum frequency component 

approaches 0, because otherwise, the amplitudes do not accurately represent the true 

amplitudes of the original system. See the example in Table 6.2.


Fortunately for us, there is a theorem that explicitly states the 

minimal sampling rate at which a signal must be sampled above to 

avoid undersampling. 

The Nyquist-Shannon Sampling Theorem: If a signal 

x(t) contains no frequencies greater than f max cycles per 

second (Hz), then it is completely determined by a series of 

points spaced less than 

1 

2f max 

seconds apart. Then we can 

choose the sampling frequency f s > 2f max and completely 

reconstruct the original signal. 

The minimum sampling frequency is often referred to as the Nyquist 

frequency, Nyquist rate, or Nyquist limit, and the period of the impulse 

function given by 1 f s 

is the Nyquist period. The bandwidth β of a sampled 

signal is greater than the difference between its highest frequency component 

and its lowest (which is 0 Hz), so β > f max . Bandwidth refers 

to the frequency "bands" that define the size of the range of frequencies 

in a signal or filter. 

This theorem says that a sampling rate of 1 Hz would not be able 

to detect frequencies equal to or greater than 0.5Hz. However, there is 

one scenario in which f s =2f max returns a sufficiently sampled signal: 

When the signal is a cosine wave with no phase shift, or a sine wave 

with a phase equal to π 2 

. Consider the signal x(t) = sin(0.5·2πt+π/2) = 

sin(πt + π/2) over the interval of time t = [0, 7]. 

So, x s (t) = (1, −1, 1, −1, 1, −1, 1, −1). From just this sampled signal, 

we can see that every odd value is 1 and every even value is -1, 

so we deduce that it has a period of 2 seconds and a frequency of 

0.5 Hz. Because pressure changes are what cause vibrations within 

the cochlea, you should agree that this sampling rate could not detect 

a frequency higher than the one presented here, for the signal 

x(t) = sin(πt + π/2) sampled at 0.5 Hz returns the constant signal 

x s (t) = (1, 1, 1, 1, 1, 1, 1, 1) which is inaudible and frequency-less (0 

Hz).


t x(t) 

0 1 

1 -1 

2 1 

3 -1 

4 1 

5 -1 

6 1 

7 -1 

Table 6.2: The sine wave x(t) = sin(πt + π/2) [i.e., cos(πt)] sampled at 1 Hz. Any 

other phase shift of this sine wave would return incorrect amplitudes and indicate 

that the sine wave was multiplied by some constant less than 1. A sine wave with no 

phase shift, for example, would return samples of all zeros at this sampling rate. 

In summary, the sampling rate is crucial to avoiding distortion and 

detecting all of the frequencies of a given signal. The sampling rate 

of CDs is 44,100 Hz, so the highest frequency that can be detected is 

22,050 Hz which is beyond our range of hearing. 

However, that is not to say that audio signals do not contain frequencies 

beyond 22,050 Hz—in fact, they absolutely do, we just can’t 

hear them. Professional audio samples at 48,000 Hz and high definition 

audio samples at 96,000 Hz to include more frequencies and improve 

the fidelity of the audio. At the cost of file size, these files can be 

processed with effects in digital audio workstations like Pro Tools and 

still maintain a high amount of fidelity. Consider, for example, slowing 

the speed of a segment of audio sampled at 44.1 kHz by a factor of 2. 

This means that there will be only 22,050 samples per second retained 

from the original audio. At a higher sampling rate like 96 kHz, 48,000 

samples per second would be retained. 

In summary, when a signal is sampled at a frequency less than the 

Nyquist limit, undersampling happens and aliasing occurs.


Aliasing 

Aliasing refers to the incorrect mapping of a frequency component 

to another frequency component, specifically, mapping a frequency 

above the Nyquist limit to one below the Nyquist limit. The effect in 

audio can sound like a ringing or whistling, or as I call it, the Coke bottle 

effect, wherein the sound appears to be recorded in a large, reverberant 

room with low-frequency resonances. Aliasing can be easily avoided 

by filtering the audio file before sampling: A low-pass filter with a cut-off 

frequency set at fs 

2 

will allow only frequencies less than the cutoff to 

pass through, and frequencies above it will be diminished. This is also 

called an anti-aliasing filter. 

Figure 6.6: A low-pass filter with cutoff frequency at 22,050 Hz can be used on audio 

data to reduce the possibility of aliasing. Amplitudes of frequencies greater than f c 

will be increasingly smaller. 

The graph in Figure 6.8 shows a sinusoid (call it x 1 ) aptly sampled 

by the given sampling frequency, and another sinusoid (x 2 ) of higher 

frequency that is inadequately sampled. Here, x 2 would be interpreted 

to have the same frequency of x 1 when in fact it has five times the 

frequency of x 1 . 

Here, the frequency of x 1 is 0.5 Hz, the frequency of x 2 is 2.5 Hz, 

and the sampling frequency is 2 Hz. The samples retrieved are identical


Figure 6.7: The above is a spectrogram of Amy Winehouse’s "Rehab." Note that most 

of the file is contained within a solid rectangle, and some parts leak above it. There 

are at least two different anti-aliasing filters applied here: One with a higher cutoff 

frequency for Amy’s voice at 18,000 Hz, and the other applied to the rest at 16,000 Hz. 

Figure 6.8: At the specified sampling frequency, the samples retrieved from both 

sinusoids are actually identical for both sinusoids because the higher frequency 

component has been undersampled. 

for both of the frequency components, so naturally, this is a source of 

confusion. The reconstructed frequency of x 2 would then be 0.5 Hz. 

Why is it called aliasing When a signal is undersampled, the 

interval between samples T s is too large to accurately detect frequencies 

greater than fs 

2 

, and will detect them incorrectly as shown above. 

When the sample is reconstructed, it will reconstruct the 2.5 Hz sine 

wave as a 0.5 Hz sine wave. Hence, the identity of the sinusoid will be 

misrepresented (an "alias").


There is also such a thing as oversampling, which does good things 

for the fidelity of a sampled sound file but increases its file size. An antialiasing 

filter can only be so steep at its cutoff frequency, so the cutoff 

frequency must be less than the Nyquist limit in order to attenuate 

frequencies beyond it. Therefore, by setting the sampling frequency 

higher, we can improve the cutoff frequency of the filter. Furthermore, 

the resolution is improved so it becomes easier to eliminate noise, 

because increasing f s increases the bandwidth, and the energy of 

noise stays constant over all bandwidths. Increasing the size of the 

bandwidth decreases the energy of the noise per unit division in a 

signal. Therefore, the signal-to-noise ratio improves when resolution 

is improved. 

6.2 Compression 

To compress something means to make it more dense, and reduce its 

volume by eliminating unused or unnecessary space. 3 When you 

convert a WAV file to an MP3 file, the file size shrinks three-fold to 

twenty-fold depending on the fidelity of the result, and the resulting 

.mp3 is virtually identical sonically. How can this be 

The compression of audio files is done via algorithms. For most 

audio purposes, algorithms are designed to take a file of data, discover 

trends in the data, and perform some redundancy removal to effect a 

smaller file.. 

A simple kind of algorithm—not necessarily applying to audio 

per se—is a sorting algorithm: For an input set of numbers, we put 

them in some designated order, like lowest to highest or vice versa. 

An inefficient way of doing this is to compare each number to all of 

the other numbers in the set, ordering the numbers accordingly. For 

3 The term is also used to mean dynamic range compression, an audio effect implementing 

sustain. This kind of compression is not related to data compression, which is 

the topic here.

Section 6.2 Compression 151 

example, 

{9, 5, 3, 6, 12, 1, −4, −112, 8} → {−112, −4, 1, 3, 5, 6, 8, 9} 

So, for the first element of the set, 9, we would have eight questions to 

ask: Is 9 greater or less than 5 is it greater or less than 3 and so on 

until the whole set had been compared to 9. Then for the next element, 

5, we would have to ask seven questions because it has already been 

compared to the first entry, 9: Is it greater or less than 3 is it greater or 

less than 6 The third element would require 6 questions, the fourth 5 

questions, and so on, so a total of 8 + 7 + 6 + 5 + 4 + 3 + 2 + 1 = 36 

questions are required. 

In the worst case scenario in which we compare every number to 

every number, this requires roughly N 2 operations, where N is the 

size of the input set. The quick-sort algorithm improves upon this, 

and does it on average in about N log 2 N operations.This is a great 

improvement—just consider the difference between 256 2 = 65536 and 

256 log 2 256 = 2048, which isn’t even close to one second of information 

in audio files. 

Explaining general algorithm design is a whole other book—even 

the 13 most popular sorting algorithms would take dozens of pages 

to explain, and sorting is perhaps the simplest algorithm to explain 

conceptually. But alas, the discrete Fourier transform and the host of 

transforms that branch out from it are all algorithms. We will show 

that the discrete Fourier transform performed literally requires on 

average N 2 computations (N is the total number of samples of x s (t)), 

while the fast Fourier transform requires only N log 2 N of them. 

Uncompressed audio is ideal for editing and putting effects on 

music, and processing almost always lowers the fidelity (quality) of 

the data. In addition to the sampling frequency and the length of a 

song, there are several other variables that will determine the total size 

of the information contained in an uncompressed audio file.


1. The number of channels, C: This is 1 for mono and 2 for stereo. 

It is unlikely that an audio file will have more than 2 because 

sound systems typically consist of only 2 speakers, but Dolby 

Surround Sound used on DVDs, for example, has 6 channels 

(though it is notated 5.1: 5 channels and 1 subwoofer). Audio 

CDs can only have 2 channels. 

2. The number of bits per sample, b: This is called the bit depth, and 

it defines how many bits are used to store each sample. Each of 

these is a number in binary where the bit depth is the length of 

that number in bits, and the value expresses the instantaneous 

value of the amplitude in decibels. The bit depth specifies the 

resolution of the dynamic range of a sampled audio file, i.e., the 

range of intensities that the sample can take on. Because of the 

nature of binary, increasing the bit depth by 1 bit means that the 

resolution can be twice as large. 4 We convert this to decibels 

using the equivalence 20 log 10 2 = 6.02 dB, so an increase in 1 bit 

adds about 6 decibels to the dynamic range. The dynamic range 

of the average human ear spans approximately 140 dB SPL, so a 

24-bit depth that yields 144 dB in dynamic range is maximally 

ideal. For ease of computation, a 16-bit depth (96 dB, the bit 

depth of audio CDs) is decent, but the greater the bit depth, 

the higher the resolution and hence fidelity of the resulting file, 

meaning less noise and less truncation of a song’s samples. 

The size of the file in bytes, where one byte is 8 bits, can then be 

calculated as 

|file| = f s · C · b · T/8


File format Extension f s Bit depth 

Waveform Audio File Format .wav 44,100 Hz 16-bit 

Audio Interchange File Format .aiff 44,100 Hz 16-bit 

Au .au 8000 Hz 32-bit 

Table 6.3: File formats with lossless compression, ordered by frequency of usage 

where T is the duration in seconds of the song. There are a few formats 

of losslessly compressed audio files and they are ordered by their 

popularity in Table 6.3. 

So the size of a 180-second .wav or .aiff file would be 

f s · C · b · T/8 = 44, 100 · 2 · 16 · 180/8 

= 31, 752, 000 bytes 

= 31, 752, 000/(1024 2 ) = 30.28 MB, 

while the size of a 180-second .au file would be approximately 11 

MB. We divide by 1024 2 because 1 kilobyte (kB) = 1024 bytes, and 1 

megabyte (MB) = 1024 kB. 

Pulse-code modulation (PCM) 

The standard method for digitally sampling analog signals is pulse-code 

modulation, abbreviated PCM. Pulse-code modulation applies a pulse 

to an analog signal at regular, uniform intervals, exactly as described 

previously in this chapter. This is also called quantization which means 

the division of something into equally sized parts, usually used in the 

context of looping and metric beat creation. 

In digital sampling, the range of values that an amplitude can 

take on is limited by the bit depth of the digital file. The amplitudes 

of the analog signal are rounded at each pulse to the nearest binary 

value. A bit depth of 16, for example, means that amplitudes can take 

4 In base-10, for example, increasing the number of places (hundreds, tens, ones, 

etc.) increases the number of possible values ten-fold, i.e., 0-99 (100 values) versus 

0-999 (1000 values).


on 2 16 or 65536 different values. We quantify the difference between 

the amplitudes of an analog signal and its digital representation by 

quantization error or quantization distortion. 

PCM is the technique used by WAV and AIFF files, as well as in 

compressed formats like MP3, Ogg Vorbis, and WMA. Pulse-code 

modulation is applied before compression. 

Resource Interchange File Format 

The Resource Interchange File Format (RIFF) encompasses the WAVE 

file format. WAVE readers in Matlab and Mathematica make inputting 

WAVE audio simple, but the files contain more than just audio data. 

At the very beginning (the header) of a RIFF file are 44 bytes of information 

about the file, such as its bit depth, sampling rate, and 

format. The information is organized in 2- and 4-byte fields. They are 

read according to their endianness: Little endian means that the data 

is written with the least significant byte first. So, the number 18 is 

expressed in little endian as "00010010 00000000 00000000 00000000" in 

binary (spaces separating the bytes) and "12 00 00 00" in hexadecimal 

(four bytes). In big endian, 18 would be written "00000000 00000000 

00000000 00010010" in binary and "00 00 00 12" in hexadecimal [23]. 

Fields of the little endian type correspond to numerical, integer quantities, 

while fields of the big endian type relate to ASCII (short for 

"American Standard Code for International Interchange," mapping 

characters of the English alphabet to numbers). 

Each sample of the raw sound data as well as its header in its RIFF 

file is expressed by a byte, or 8 bits. On the next page, they are given 

in hexadecimal [23].


Type Position in bytes Field name Field size (bytes) 

ASCII 0 ChunkID 4 

integer 4 ChunkSize 4 

ASCII 8 Format 4 

ASCII 12 Subchunk1ID 4 

integer 16 Subchunk1Size 4 

integer 20 AudioFormat 2 

integer 22 NumChannels 4 

integer 24 SampleRate 4 

integer 28 ByteRate 4 

integer 32 BlockAlign 2 

integer 34 BitsPerSample 2 

ASCII 36 Subchunk2ID 4 

integer 40 Subchunk2Size 4 

integer 44 actual raw data varies 

Table 6.4: Positions of different fields in RIFF encoding. 

There are two classes of efficient algorithms for the compression of 

audio files: Lossless and lossy compression algorithms. The first, when 

uncompressed, returns the exact original file with none of the data 

removed, but the second cannot do this. 

Lossless compression 

Lossless compression retains enough data from an uncompressed file 

that the entire original file can be reproduced when the file is decompressed. 

Lossy compression, however, retains only a portion of the 

raw data and cannot reproduce the original file upon decompression. 

ZIP files (usually used for lossless compression of text or text-like data) 

are generated by lossless compression, and "unzipping" them gives 

the uncompressed file, or folder of files. Lossless files are greater in 

fidelity and size and better for performing audio editing and effects 

upon than lossy files. 

The code or hardware that performs the compression and decompression 

is called a codec, short for compression-decompression. The


Field name Description Example 

(WAVE) 

ChunkID "RIFF" in ASCII form 52 49 46 46 

ChunkSize The size of the entire file N + 36 

in bytes, minus 8 

Format The type of the format in 57 41 56 45 

ASCII, like "WAVE" 

Subchunk1ID The letters "FMT"+ 66 6d 74 20 

space in ASCII 

Subchunk1Size The size of the rest of the 10 00 00 00 (16) 

subchunk (until Subchunk2ID) for PCM 

AudioFormat The form of sampling 01 00—1, 

for PCM 

NumChannels 1 for Mono, 2 for Stereo, 02 00—2, 

and so on 

for stereo 

SampleRate f s 44 AC 00 00 

(44,100 Hz) 

ByteRate SampleRate ∗ NumChannels ∗ 80 F8 2A 00 

BitsPerSample/8 (176,000) 

BlockAlign NumChannels ∗ BitsPerSample/8 04 00 (4) 

BitsPerSample The bit depth 10 00 (16) 

Subchunk2ID The letters "DATA" in ASCII 64 61 74 61 

Subchunk2Size NumSamples ∗ NumChannels ∗ N 

BitsPerSample/8 

data The raw sound data — 

Table 6.5: The format of RIFF encoding. 

most popular is probably the FLAC format, which stands for Free Lossless 

Audio Codec. Because encoding a file is a probabilistic process, 

there is no way to absolutely calculate the size of the resulting file. 

FLAC files are typically 30-50 percent of the size of the original, where 

more repetitive songs would fall in the low end of this range. 

Lossy compression 

Lossy compression works by discarding some of the data from an 

uncompressed file to produce an encoded file that is 5-20 percent 

the size of the original file. One of the techniques it uses to evaluate 

what data can be discarded is analysis of frequencies. It uses the


Fletcher–Munson curve and the psychoacoustic notion of critical bands 

to reduce the bit depth of frequencies closer to the extremes of our hearing 

range, eliminate sounds too quiet to hear, and discard frequencies 

that would not be perceived because of masking by other frequencies 

within their critical bands. In this way, the Fletcher–Munson curve 

can be thought of as a probability density function that determines the 

probability of a given frequency in some file at a specific time. Similar 

to lossless compression, frequencies like those between 30 and 5000 

Hz (approximately the range of the piano) will have a high probability 

and therefore a higher resolution, and frequencies outside of this 

range will have low probability and a corresponding lower resolution. 

Those with lower probabilities are then given a low bit-depth, or even 

discarded altogether when given a bit depth of 0. 

Therefore, the Fourier transform is a typical component of lossy 

compression algorithms because frequency information from the music 

is easier to approximate than is, say, the actual signal data, and part 

of the reason is the way the human ear perceives frequencies. Video 

files can be compressed using the lossy algorithm as well. Websites 

that stream video like Hulu and YouTube use a lossy algorithm to 

play video in real- or better-than-real-time (i.e., the buffer fills ahead 

of time). Where squares of similar colors appear in images and video, 

compression is acting upon the detail and resolution: The more noticeable 

and artificial these squares are, the more excessive and extreme 

the compression is. As the Internet’s bandwidth grows, the necessary 

amount of compression for live streaming diminishes. 

The quantity of data discarded is inversely proportional to the 

specified bit rate: The higher the bit rate, the less data is removed and 

the smaller the quantization error. The bit rate is f s · C · b, the same 

variables from above in our calculation of the size of uncompressed 

audio files. Therefore, bit rate corresponds to fidelity. Note that a bit 

rate is in bits, not bytes: A bit rate of 100 kbps (kilobits per second) 

corresponds to 100,000 bits/second, not bytes per second. The size of 

the resulting encoded file is proportional to the bit rate of the codec.


Codec File extension Bit rate f s 

MPEG-1 Audio Layer III .mp3 32-320 kbps 32-48 kHz 

MPEG-2 Audio Layer III .mp3 8-160 kbps 16-24 kHz 

Advanced Audio Coding .aac 8-320 kbps 8-96 kHz 

Windows Media Audio Lossy .wma 32-768 kbps 8-48 kHz 

Ogg Vorbis .ogg 16-500 kbps 8-192 kHz 

Table 6.6: The different bit rates and sampling rates of common lossy-compressed 

audio file formats. 

An MP3 file with a common bit rate of 128 kbps will be around 11% 

of the size of the original file. You sometimes have the option to 

change the bit rate at which AAC files and MP3 files are imported to 

your computer using Variable Bit Rate (VBR) encoding. Some popular 

formats are given in Table 6.6. 

Constant Bit Rate (CBR) encoding, encodes an audio file at a specified, 

constant bit rate, while VBR encoding encodes an audio file at 

different bit rates depending on its content, using the probabilistic 

schematic explained above but with respect to the amplitude in dB 

SPL. 


Sampling a piece of continuous audio gives a discrete representation 

of it, which is necessary for any kind of practical analysis. In order to 

adequately sample a piece of audio, we need to know its maximum 

frequency component. We sample a continuous signal by convolving 

it with an impulse function to get a discrete set of points. We want to 

retrieve a sufficient number of these points to avoid distortion resulting 

from undersampling, or aliasing. We can avoid this by applying a lowpass, 

anti-aliasing filter to eliminate frequencies beyond the Nyquist 

limit, equal to 2 times the maximum frequency component of the entire 

original audio file, and then sampling above that limit. In other words, 

a waveform can be accurately reconstructed only when more than two


samples per period of each frequency component are taken. Because 

we cannot hear beyond about 20,000 Hz, a sampling rate or sampling 

frequency of 44,100 Hz is sufficient. 

We can oversample audio to improve its resolution and signal-to-noise 

ratio, but the higher the sampling frequency, the larger the resulting 

file. Compression algorithms work to reduce file size while retaining 

either all (in lossless compression) or only part (in lossy compression) of the 

original data in the compressed file. They do this by making decisions 

derived from probabilities found in frequency and amplitude analysis 

of the original file.

7. The discrete Fourier transform 

The Fourier transform was borne from the Fourier series, invented 

by Jean Baptiste Joseph Fourier (1768-1830). Fourier was foremost 

a scientist: Virtually all of his mathematical findings are results of 

scientific investigations. Amazingly, nearly all branches of physical 

and even social sciences have some connection to and foundation upon 

Fourier analysis, as the Fourier transform can detect repetitive behavior 

in any sort of dataset. It was during investigation of thermodynamics 

that Fourier began formulation of the principle of superposition, also 

called the Fourier series. 

7.1 The Fourier series 

The Fourier series decomposes any periodic function into a series of 

simple periodic functions. For a signal that is just a pure tone that can 

be represented by a simple sinusoid A sin(ωt + φ) or A sin(2πft + φ), 

the series is just this sinusoid. But as we have seen by now, musical 

signals are virtually always more complicated than pure sine waves. 

When a signal has aperiodic components or when it is finite in 

duration (as all signals are in reality), its Fourier series will be infinite 

as well as the domain its Fourier transform. Conversely, when a signal 

is completely periodic and when it has (or we can assume that it has) 

an infinite time domain, its Fourier series will be finite, and its Fourier 

transform may be specified over a finite frequency domain. 

A series in mathematics is defined as the sum of a sequence of terms, 

which is of the form 

finish 

∑ 

k=start 

{a k }.

162 The discrete Fourier transform Chapter 7 

The capitalized sigma Σ indicates a sum is to be taken, and the sequence 

is 

{a k } = {a start ,a start+1 , . . . , a finish−1 ,a finish }. 

The term start is the index of the initial value of {a k }, often k =0, 1 or 

−∞, and the term end is the index of its final value, often ∞ or some 

function of N. 

A straightforward complex and periodic function was looked at 

in Chapter 1: The sum of the two sinusoids x 1 (t) = 1 2 

sin(2πt) and 

x 2 (t) = 1 2 

sin(4πt). We write the Fourier series of their resultant wave 

x(t) as 

x(t) = 1 2 

2∑ 

sin(2πkt). 

k=1 

Simple, non-sinusoidal waveforms like the sawtooth wave, triangle 

wave, and square wave have Fourier series representations that 

use Σ. These are infinite Fourier series because their waveform is not 

smooth like a sine wave, so a finite sum of sine waves can only approximate 

their linear nature. The infinite Fourier series that represents a 

sawtooth wave is given by 

x(t) = 2 ∞∑ sin(2πkft) 

π k 

k=1 

= 2 [sin(2πft)+ 1 ] 

π 

2 sin(4πft)+1 3 sin(6πft)+. . . 

= 2 [sin(ωt)+ 1 ] 

π 

2 sin(2ωt)+1 3 sin(3ωt)+. . . . 

Because our fundamental frequency ω is multiplied by every integer 

1, 2, 3, . . ., all integer multiples of the fundamental are represented in 

the harmonic overtone series. The following graphs let our frequency 

f equal 1 Hz.

Section 7.1 The Fourier series 163 

Figure 7.1: x(t) = 2 π sin(2πt) 

Figure 7.2: x(t) = 2 π 

[ 

sin(2πt)+ 

1 

2 sin(4πt)]


Figure 7.3: x(t) = 2 π 

[ 

sin(2πt)+ 

1 

2 sin(4πt)+ 1 3 sin(6πt)] 

Figure 7.4: The first six terms of x(t), i.e., the sum 2 π 

∑ 6 

k=1 

sin 2πkt 

k 

. 

These graphs show that the longer the Fourier series extends, the 

better it approximates a given periodic function.

Section 7.1 The Fourier series 165 

The following infinite Fourier series represents a triangle wave: 

x(t) = 8 ∑ 

∞ k sin[(2k + 1)2πft] 

π 2 (−1) 

(2k + 1) 2 

k=0 

= 8 π 2 [ 

sin(2πft) − 1 9 sin(6πft)+ 1 25 sin(10πft)+. . . ] 

. 

Figure 7.5: The first six terms of the infinite Fourier series representation of a triangle 

wave. The result is an astonishingly close approximation, probably due to its similar 

symmetry to that of the sine wave versus the asymmetry of sawtooth waves and 

phasors. 

The infinite Fourier series representing a square wave is given by 

x(t) = 4 ∞∑ sin((2k − 1)2πft) 

π 2k − 1 

k=1 

= 4 [sin(2πft)+ 1 ] 

π 

3 sin(6πft)+1 5 sin(10πft)+. . . 

= 4 [sin(ωt)+ 1 ] 

π 

3 sin(3ωt)+1 5 sin(5ωt)+. . . . 

Here we see only odd-integer multiples of the fundamental ω represented 

in the harmonic overtone series. As discussed in Chapter 4, 

distortion pedals add harmonics to the sine waves present in a signal 

by amplifying the signal and cutting off (clipping, also dynamic compression) 

the tops and bottoms of its crests and troughs, shaping it into 

a square wave. Figure 4.23 of clipping on page 94 shows the graph of 

the first six terms of the Fourier series representing a square wave.


The fact that non-sinusoidal functions like these can be decomposed 

into a Fourier series demonstrates that (within reason—we 

avoid pathological functions that are unmisic-like) every function has a 

Fourier series representation. Non-sinusoidal functions simply have 

infinite series representations. 

When we deal with real world musical signals, we usually cannot 

use the sigma (Σ) in our formula. This is because the amplitudes of 

the individual sine waves are not solely dependent on the amplitude 

of the fundamental frequency, and therefore cannot be expressed in 

a perfectly recursive or algorithmic manner. Very generally, it is true 

of many musical instruments that the power of the overtones will be 

weaker than the fundamental, and their individual strengths decrease 

as their partial number increases [76], but there are instruments like 

the oboe and bassoon that do not obey this rule. 

Suppose that an instrument’s timbre is such that each of its harmonic 

overtones is half the strength of the previous overtone, i.e., 

x 0 (t) = sin(2πft), x 1 (t) = 1 2 sin(4πft), x 2(t) = 1 4 

sin(6πft), and so 

on. Then the Fourier series representing the whole overtone series, 

where the signal x(t) contains the sinusoids x i (t) in the sequence 

{x(t)} = {x 0 (t),x 1 (t),x 2 (t), . . .}, is 

x(t) = sin(2πft)+ 1 2 sin(4πft)+1 4 sin(6πft)+. . . 

= 

= 

∞∑ 

k=0 

∞∑ 

k=0 

1 

sin[2πft(k + 1)] 

2k 1 

sin[ωt(k + 1)]. 

2k Since the physicality of instruments affects the timbre in many ways, 

no musical instrument in reality will have this exact overtone series. 

However, the general form of the series is not far off from the actual 

overtone series of instruments, especially the highly harmonic overtones 

of the wind instruments, with differences reflecting the nature of 

the physical constraints.

Section 7.2 Euler’s formula 167 

The trigonometric functions of cosine and sine can themselves 

be decomposed into mathematical series, known as the Taylor and 

Maclauren series. The sine function can be deconstructed by the series 

sin(x) = x 1 − x3 

3! + x5 

5! − x7 

7! + . . . 

= 

∞∑ (−1) k x 2k+1 

(2k + 1)! 

k=0 

and the series of the cosine function is 

cos(x) = 1− x2 

2! + x4 

4! − x6 

6! + . . . 

= 

∞∑ (−1) k x 2k 

. 

(2k)! 

k=0 

The syntax "!" in mathematics is the factorial operator, where n! = 

n · (n − 1) · (n − 2) · . . . · 1. For n =0, the operation is defined as 0! = 1. 

The difference between these power series and the Fourier series is 

evident. Because pitched sound contains sinusoidal waves, and the 

human ear is a kind of frequency detector, it is more important to focus 

on the Fourier series here. 

However, the power series of the sine and cosine functions does 

lead to an important identity in mathematics—Euler’s identity. 

7.2 Euler’s formula 

Leonhard Euler (1707-1783) is behind much of modern mathematics 

and physics, ranging from analysis to astronomy. Euler’s number, e, 

is the base of the natural logarithm. Written ln(x), the natural log is 

equivalent to log e (x), where e is the quantity 2.71828 . . ., continuing 

forever. Euler’s number is very special, because the derivative of the 

function e x returns the same function, i.e., 

d 

dx ex = e x


The magic doesn’t stop there: The function e iω where ω is an angular 

frequency (i.e., radians/second) describes positions along the unit 

circle in the complex plane, as given by Euler’s identity. 

Euler’s identity: For Euler’s number e =2.71828 . . ., the 

complex quantity i = √ −1, and the the ratio of the circumference 

of a circle to its diameter π =3.14159 . . ., 

e iπ + 1 = 0. 

When we vary the exponent of e by ω, we are effectively moving 

along the unit circle ω-many radians, and e iω will equal the sum of the 

corresponding coordinates corresponding to the position on the circle. 

When ω is positive, this is counter-clockwise motion, and when it is 

negative, we move clockwise. 

Figure 7.6: The unit circle in the complex plane (also called the z-plane) is real-valued 

along the horizontal axis and complex-valued along the vertical axis, so points in 

this plane will be of the form (Re(ω), Im(ω)). To determine e iω , simply move counterclockwise 

starting at the right-most point of the circle ω-many radians. For example, 

e iπ 4 corresponds to π 4 radians (45◦ ) which is at the point 

e iπ 4 = 

√ 

2 

2 + i √ 2 

2 . This is also the sum cos ( π 

4 

) 

+ i sin 

( π 

4 

) 

. 

( √2 

2 ,i√ 2 

2 

) 

. Therefore,


Figure 7.7: To determine e −iω , simply move clockwise starting at the right-most point 

(0 radians) of the circle ω-many radians. For example, e − iπ 4 corresponds to − 

π 

( radians 

√2 

4 

(−45 ◦ ) which has coordinate , −i √ ) 

2 

. Therefore, e − iπ 4 = √ 2 

− i √ 2 

, which is 

2 2 

2 2 

equivalent to cos ( ) ( ) 

− π 4 + i sin − 

π 

4 . 

As you can see, e iω can be written as a function of sin(ω) and cos(ω) 

where the sin(ω) component is multiplied by the imaginary number i. 

The behavior of e iω with respect to the horizontal axis can be modeled 

by a cosine wave and its behavior with respect to the vertical axis 

with an imaginary sine wave. Because the cross product 1 of each root 

of unity e iω kt with any other root of unity e iω lt is zero, we say that 

the roots of unity form an orthogonal basis. Furthermore, because the 

magnitude of the roots of unity |e iωt | is 1 for all ω, they are also normal 

1 The product of one vector with another’s complex conjugate. From [26], we write 

the cross product of two roots of unity W k N and W l N as 

〈W k N ,W l N 〉 = 

= 

N−1 

∑ 

t=0 

N−1 

∑ 

t=0 

W k N Ŵ l N = 

N−1 

∑ 

t=0 

e i2πkt 

N e 

− i2πlt 

N 

e i2π(k−l)t 

N = 1 − ei2π(k−l) 

1 − e i2π(k−l)/N 

which is equal to 0 when k ≠ l. From linear algebra, this means that the two vectors 

e iω kt and e iω lt are linearly independent and therefore form a basis, and furthermore, 

we can conclude that these two sinusoids are orthogonal.


or normalized functions, and hence they also define an orthonormal basis. 

The function e iω is a complex function known as Euler’s formula. 

Euler’s formula: For Euler’s number e =2.71828 . . ., the 

complex quantity i = √ −1, and any real number ω, 

e iω = cos(ω)+i sin(ω) and 

e −iω = cos(ω) − i sin(ω). 

When we vary ω, we proportionally vary the arguments of cosine 

and sine, so ω represents the angular frequency in radians per second. 

For example, doubling the exponent e iω to be e i2ω doubles the 

frequency of the sine and cosine functions from cos(ω) +i sin(ω) to 

cos(2ω) +i sin(2ω). We use angular frequency instead one given in 

hertz because ω is equal to 2πf and one period is therefore the time 

that it takes for e iω to go around the unit circle once. However, when 

we state the discrete Fourier transform, the exponent of e will be written 

with the expanded notation 2πk to refer to the angular frequency 

ω k . 

Likewise, sin(ω) and cos(ω) can be rewritten in terms of e iω where 

ω is a real number. 

sin(ω) = eiω − e −iω 

2i 

= i ( e −iω − e iω) 

, because 1 2 

i = −i2 i = −i 

cos(ω) = eiω + e −iω 

. 

2


Let us inspect some values of Euler’s formula: 

( 

e i π π 

) ( π 

) 

2 = cos + i sin = 0 + i · 1=i 

2 2 

e iπ = cos(π)+i sin(π) =−1+i · 0=−1 

( 

e i π π 

) ( π 

) √ √ 

2 2 

4 = cos + i sin = 

4 4 2 + i 2 

e i2π = cos(2π)+i sin(2π) = 1 + i · 0 = 1 

= e i·0 = e i2kπ , k =0, 1, 2, . . . 

All of these values are called roots of unity, because their magnitudes 

are all equal to 1. 2 The magnitude of a complex number is 

|a + bi| = √ a 2 + b 2 , 

so the magnitude of e iω is 1 for all ω. 

This also results from the 

trigonometric identity stating that the magnitude of | cos ω + i sin ω| = 

√ 

cos 2 ω + sin 2 ω =1(and therefore, cos 2 ω + sin 2 ω =1). Quickly we 

will prove this by inspection of the above values. 

∣ 

∣e i π 2 ∣ = √ 0 2 +1 2 =1 

∣ 

∣e iπ∣ ∣ = √ (−1) 2 +0 2 =1 

∣ 

∣e i π 4 

∣ = 

√ 

√(√ 

2 

2 

) 2 

+ 

∣ 

∣e i2π∣ ∣ = √ 1 2 +0 2 =1. 

(√ ) 2 √ 

2 2 

= 

4 + 2 4 =1 

2 

The exponentials in every term of the discrete Fourier transform 

and its inverse are all roots of unity. We can order these roots of unity 

by using the following definition. 

Roots of unity: We call the set W = {W 0 N ,W1 N , . . . , W N−1 

N 

} 

the Nth roots of unity corresponding to points on the unit 

2 The words "unit," "unity," and "unitary" all refer to the number 1, and often the 

unit circle.


circle in the complex plane, where 

WN 1 = e i2π 

N 

is the primitive Nth root of unity, 

W k N = e i2πk 

N =(WN ) k is the kth Nth root of unity, 

W 0 N = e i2π(0) 

N =1 is the trivial Nth root of unity, and 

W N N = e i2πN 

N = e i2π =1 

= W 0 N. 

To help demystify the name "roots of unity," these can also be written 

W k N = e i2πk 

N = 

N/(i2πk)√ 

e = 

N √ W k = W k N, 

the "kth root of W N " or the "kth power of the primitive root of unity 

WN 1 ." We don’t usually see exponents written this way because the 

square root symbol is typically only used when the exponent is equal 

to 1/2, so the exponent is inverted when we write roots this way, e.g., 

√ x = x 1/2 = 2√ x. 

The function e iω describing the roots of unity in the Fourier transform 

serves to extract the periodic parts of a signal and attenuate the 

aperiodic parts by essentially multiplying aperiodicities by 0. The 

roots of unity analyze the periodicity of the signal because they are all 

orthogonal to one another. This is the crux of the Fourier transform. 

7.3 The discrete Fourier transform 

The discrete Fourier transform (DFT) and the inverse discrete Fourier 

transform (the IDFT) are applied to discrete time-domain signals to 

extract their sinusoidal frequency components. The DFT computes the 

frequency-domain spectrum of a signal, and the IDFT reconstructs the 

signal (with a phase shift) from the DFT. It is a rich algorithm with 

many variables acting at once, and it can be very intimidating even 

to seasoned mathematicians. We will first define it formally but then 

thoroughly inspect it with several examples.

Section 7.3 The discrete Fourier transform 173 

The discrete Fourier transform is the discrete case of the Fourier 

transform, which requires a continuous (non-discrete) signal input and 

has a continuous output. 

Fourier transform (continuous): The Fourier transform 

is an invertible, linear transformation accepting complexvalued 

inputs and outputting complex values. The Fourier 

transform of a continuous-time signal x(t) is represented by 

the symbol F, and is given by 

F{x(t)} := X(ω) = 

∫ ∞ 

−∞ 

x(t)e −iωt dt 

where t is time and ω is angular frequency. The inverse 

Fourier transform F −1 reconstructs the original signal from 

the Fourier transform with the formula 

F −1 {X(ω)} := x(t) = 1 

2π 

∫ ∞ 

−∞ 

X(ω)e iωt dω. 

We call X(ω) the spectrum of x(t). We multiply the inverse Fourier 

transform (IFT) by 1 

2π 

only when the frequency is specified or desired 

in radians per seconds (ω) instead of hertz (f); otherwise, we leave it 

alone. 

An integral of a function takes the area under the curve of the 

function over an interval whose endpoints specified by the given 

limits, in this case −∞ to ∞. If we partition the function into tiny slices 

over the specified interval, we can sum together all of the slices and 

approximate its integral. The thinner the slices, the better the area of 

the slices will approximate the actual area underneath the curve. 

The continuous Fourier transform of a simple sinusoid with frequency 

ω c is the (scaled) Kronecker delta function δ(ω − ω c )+δ(ω + ω c ) 

where x(t) = sin(ω c t). Conversely, the Fourier transform of the Dirac 

delta function centered at ω c [δ(ω−ω c )] is the exponential e −iωct . When 

it is centered at −ω c , the Fourier transform is the exponential e iωct .


Figure 7.8: The definite integral 

( ) 

t − 

1 2 

dt gives the exact area under 

2 

∫ 1 

0 

the given curve from 0 to 1, shown in gray. 

Using calculus, this area is computed as 

[ 

] 

x 3 

1 − x2 + x = 1 

3 2 4 

0 

Figure 7.9: The Riemann sum partitions 

the function using evenly sized and spaced 

intervals to approximate the area under 

the curve. The more partitions, the closer 

the Riemann sum is to the integral. Here, 

=0.083333. 12 

with 28 partitions, the approximate area is 

0.083227. 

Example. Let x(t) = sin(220πt). Then 

F{x(t)} = 

∫ ∞ 

−∞ 

sin(220πt)e −iωt dt 

= iπδ(ω − 220π) − iπδ(ω + 220π) 

= X(ω). 

So the Fourier transform looks like two spikes centered at the frequencies 

220π rad/s and −220π rad/s. The amplitude of the positive 

frequency is iπ (magnitude is simply π), and the amplitude of the negative 

frequency is −iπ. Graphs of the Fourier transform typically the 

depict magnitude plot |X(ω)| and will sometimes include the phase plot 

φ(ω) of the transform’s behavior. The magnitude plot |X(ω)| shows 

us the relative powers of the frequency components of a signal. We 

calculate this with the formula 

|X(ω)| = √ Re{X(ω)} 2 + Im{X(ω)} 2 . 

The phase φ(ω) is calculated from the formula 

[ ] 

Re{X(ω)} 

φ[X(ω)] = tan −1 . 

Im{X(ω)}


Since the magnitude of iπ is the same as −iπ, the magnitude plot of 

the transform (given in Figure 6.12) is symmetrical about the vertical 

axis (ω =0). 

Figure 7.10: The magnitude plot of the Fourier transform of x(t) = sin(220πt): Two 

vertical segments centered at −220π and 220π radians/second. We calculate the 

magnitude because the Fourier transform is a complex function. Since the magnitudes 

of its imaginary values are considered equally important as those of the real values 

whether positive or negative, a magnitude plot is the simplest way to visually convey 

the spectrum of a signal. 

If a time signal x(t) is infinite, then its Fourier transform may be 

finite. Otherwise, its Fourier transform will be infinite. The minimum 

and maximum frequency components (in Hz) of a spectrum define the 

bandwidth of a signal. For the function sin(220πt), the minimum frequency 

component is −110 Hz and the maximum is 110 Hz. Therefore, 

its bandwidth is 220 Hz. The bandwidth defines the minimum rate at 

which we must sample the function to accurately collect its frequency 

information.


The inverse of this is 

∫ ∞ 

F −1 {X(ω)} = 1 X(ω)e iωt dω 

2π −∞ 

= iπ ( 

e −i220πt − e i220πt) 

2π 

= i [cos(220πt) − i sin(220πt) − cos(220πt) − i sin(220πt)] 

2 

= i [−2i sin(220πt)] 

2 

= sin(220πt) 

= x(t). 

Therefore, the Fourier transform and its inverse accurately reconstruct 

a given sinusoid. 

As you can see, it is very important to remember the mathematical 

identities from Euler’s formula, 

e iω = cos(ω)+i sin(ω) 

e −iω = cos(ω) − i sin(ω) 

when computing continuous Fourier transforms. It is helpful to memorize 

some relationships between commonly encountered functions 

and their transforms. 

Function x(t) 

Constant: x(t) =a 

δ(t − a) 

δ(t + a) 

sin(at) 

cos(at) 

cos(at) + cos(bt) 

Transform X(ω) 

a · δ(ω) 

e −iaω 

e iaω 

iπ [δ(ω − a) − δ(ω + a)] 

π [δ(ω − a)+δ(ω + a)] 

π [δ(ω − a)+δ(ω + a)+δ(ω − b)+δ(ω + b)] 

The constant case wherein x(t) =a produces a transform with delta 

function centered at 0 rad/s, and height a. This is the DC offset: Direct 

current (DC) supplies a constant voltage, as opposed to alternating


current which is associated with a sinusoid. 3 The DC offset is thought 

of as the average value of a waveform. 4 

In the discrete case, we can simply sum over a sampled signal at 

intervals given by the period T s of the sampling frequency f s . Therefore, 

the DFT input is not x(t), but x s (t). However, you rarely see that 

notation used—it is a given, because the discrete Fourier transform 

only accepts discrete inputs. 

The DFT bears quite a resemblance to the continuous case. 

X(ω k )= 

N−1 

∑ 

t=0 

and the inverse DFT, or IDFT, is 

x s (t) = 1 N 

N−1 

∑ 

k=0 

x s (t)e −iω kt , k =0, 1, 2, . . . , N − 1, 

X(ω k )e iω kt , t =0, 1, 2, . . . , N − 1. 

This way of writing the DFT and IDFT returns the exact frequency 

components of the frequency spectrum of x(t). But notationally, there 

3 In many engineering books, current (and therefore voltage) is represented by 

phase vectors or simply phasors. This implies that their frequency and overall amplitude 

(i.e., the constant A that multiplies with the sine wave) is time-invariant. This is not to 

be confused with the phasors that are the vertical reflection of sawtooth waves. 

4 In this text’s representation of the Fourier transform, X(0) will be the sum (or 

integral, in the continuous case) of the signal because we are not normalizing it 

by multiplying by 1 or some other constant, as some specifications of the Fourier 

N 

transform will. In Mathematica, for example, the continuous Fourier transform is 

defined as 

and its inverse is 

F (ω) = 1 √ 

2π 

∫ ∞ 

f(t) = 1 √ 

2π 

∫ ∞ 

−∞ 

−∞ 

f(t)e iωt dt 

F (ω)e −iωt dt. 

The exponentials are different here—but how can that be So long as the inverse 

transform’s exponential is the additive inverse of the transform’s exponential, it forms 

an orthonormal basis that can describe the frequency information of a time-domain 

1 

signal. The constant √ 

2π 

simply acts to normalize the data in a different way from 

our specification.


Symbol Definition 

x s (t) the amplitude of the input sampled-signal at sampling instant t 

N the total number of samples in x s (t) 

n the index of the samples of x s (t), numbered 0, 1, . . . , N − 1 

T s the chosen sampling interval (period) equal to 1/f s 

t equal to n · T s , the time of the nth sample in seconds 

f s the sampling frequency in Hz, equal to ω 2π 

2πk 

ω k NT s 

= 2πkfs 

N 

, the kth harmonic frequency in rad/s 

X(ω k ) the amplitude of frequency ω k in all of x s (t) 

Table 7.1: Description and names of variables used in the discrete Fourier transform. 

are several somewhat misleading conventions: The time component n 

will often be written as t, but this t is not in seconds as in the original 

signal. The t as defined above this is the nth sampling instant multiplied 

by the sampling frequency, n · T s , to give the time in seconds. 

Secondly, the sampled signal x s (t) is usually truncated to x(t), but 

keep in mind that the discrete Fourier transform only accepts discrete 

(sampled) signals. And furthermore, we don’t typically talk about 

musical content with angular frequencies; we want frequency in hertz. 

To clear up some of these confusions, we first note that ω k · t = 

2πk 

NT s · nT s = 2πkn 

N 

. Then we can rewrite the discrete Fourier transform 

and its inverse as 

X(k) = 

N−1 

∑ 

n=0 

x(n)e − i2πkn 

N , k =0, 1, 2, . . . , N − 1 

or as 

x(n) = 1 N 

N−1 

∑ 

k=0 

X(k)e i2πkn 

N , n =0, 1, 2, . . . , N − 1 

X(k) = 

N−1 

∑ 

t=0 

x(t)e − i2πkt 

N , k =0, 1, 2, . . . , N − 1 

x(t) = 1 N 

N−1 

∑ 

k=0 

X(k)e i2πkt 

N , t =0, 1, 2, . . . , N − 1.


This second way is a common way of writing it, and the form that I 

prefer because I like to be reminded of the variable corresponding to 

time. Just keep in mind that x(t) in the DFT is the sampled version of 

the original signal and that t represents the sample number from 0 to 

N − 1, not time in seconds. 

The signal x(t),atime domain function, is transformed by the Fourier 

transform into X(k), the signal’s frequency domain spectrum. The magnitude 

plot of the DFT, |X(k)|, is calculated from the square root of 

the sum of the real part’s coefficient squared and the imaginary part’s 

coefficient squared. 

|X(k)| = √ Re(X(k)) 2 + Im(X(k)) 2 

Therefore, the magnitude is nonnegative. Remember, for a complex 

number c = a + bi, Re(c) =a and Im(c) =b. If there is no imaginary 

part to the DFT at frequency component k, then the amplitude is 

simply X(k). In digital implementations of the DFT, the absolute value 

is taken at some point so that only positive, real values are given in the 

spectrum. 

By Euler’s identity, e − i2πkt 

N is equal to cos ( ) ( 

2πkt 

N − i sin 2πkt 

) 

N and 

the kth Nth root of unity e i2πkt 

N equals cos ( ) ( 

2πkt 

N + i sin 2πkt 

) 

N . Therefore, 

we can also write the DFT and its inverse trigonometrically as 

X(k) = 

= 

N−1 

∑ 

t=0 

N−1 

∑ 

t=0 

x(t) = 1 N 

= 1 N 


N , k =0, 1, . . . , N − 1 

N−1 

∑ 

x(t) cos(2πkt/N) − i 

N−1 

∑ 

k=0 

N−1 

∑ 

k=0 

t=0 

x(t) sin(2πkt/N) 

X(k)e i2πkt 

N , t =0, 1, . . . , N − 1 

X(k) cos(2πkt/N)+ i N 

N−1 

∑ 

k=0 

X(k) sin(2πkt/N).


Let us now verify that the IDFT is the inverse of the DFT. 

X(k) = 

= 

Since the k in e −2πikt 

N 

change the inside k to l: 

( 

= 

= 

N−1 

∑ 

t=0 

N−1 

∑ 

t=0 

= 1 N 

= 1 N 

N−1 

∑ 

t=0 

N−1 

∑ 

t=0 

x(t)e −i2πkt 

N 

( 

1 

N 

N−1 

∑ 

k=0 

X(k)e i2πkt 

N 

) 

e −2πikt 

N . 

is not bounded by the k in the inner sum, we 

1 

N 

N−1 

∑ 

l=0 

N−1 

1 ∑ 

N 

N−1 

∑ 

l=0 

N−1 

∑ 

l=0 

l=0 

X(l)e i2πlt 

N e 

− i2πikt 

N 

X(l)e i2π(l−k)t 

N 

N−1 

∑ 

X(l) 

t=0 

X(l) · N 

Thus, our double sum becomes 

e i2π(l−k)t 

N 

N−1 

1 ∑ 

X(l)Nδ(l − k), 

N 

l=0 

) 

when l = k, 0 otherwise. 

where δ(l − k) is the delta function equal to 1 for l = k and 0 otherwise. 

So, when l = k, the sum is then 

N−1 

1 ∑ 

N 

l=0 

X(l)Nδ(l − k) = 

N−1 

∑ 

l=0 

X(l)δ(l − k) =X(k). 

This shows that the IDFT is indeed the inverse of the DFT. 

Properties of the Fourier transform 

The nature of complex numbers provides the Fourier transform with 

many nice properties. What follows here are the discrete representations 

of the properties, but they also apply to the continuous case.


The Fourier transform is linear as a result of the principle of superposition: 

Since every signal is the sum of complex sinusoids, the sum 

of two or more signals may be "broken apart" from a single sum into 

two or more sums. Additionally, they may be scaled by any number 

such that F{ax} = aF{x}. This also applies to the inverse Fourier 

transform. 

Linearity: The signals x and y can be scaled by any real 

or complex constants a and b such that F{ax + by} = 

aF{x} + bF{y}. Similarly, the spectra X and Y can be 

multiplied by any real or complex constants a, b such that 

F −1 {aX + bY} = aF −1 {X} + bF −1 {Y }. 

Proof: 

F{ax(t)+by(t)} = 

= 

N−1 

∑ 

[ax(t)+by(t)]e − i2πkt 

N 

t=0 

N−1 

∑ 

t=0 

ax(t)e − i2πkt 

N 

N−1 

∑ 

= a x(t)e − i2πkt 

N 

t=0 

N−1 

∑ 

+ 

t=0 

N−1 

∑ 

+ b 

= aF{x(t)} + bF{y(t)}. 

Likewise for the inverse transform. 

t=0 

by(t)e − i2πkt 

N 

y(t)e − i2πkt 

N 

F −1 {aX(k)+bY (k)} = 1 N 

= 1 N 

= a N 

N−1 

∑ 

[aX(k)+bY (k)]e i2πkt 

N 

k=0 

N−1 

∑ 

k=0 

N−1 

∑ 

k=0 

N−1 

aX(k)e i2πkt 1 ∑ 

N + 

N 

X(k)e i2πkt 

N 

k=0 

bY (k)e i2πkt 

N 

N−1 

b ∑ 

+ Y (k)e i2πkt 

N 

N 

k=0 

= aF −1 {X(k)} + bF −1 {Y (k)}.


The Fourier transform is called a time-invariant linear filter because 

of this property and because over time, the system that it is representing 

does not change. 5 

When we multiply a time-domain signal by the exponential function 

e − i2πlt 

N for any real or complex constant l, its Fourier transform 

is shifted by l and vice versa, i.e., e − i2πlt 

N x(t) ⇔ X(k + l). Likewise, 

multiplying a Fourier transform by e i2πut 

N (note the sign change in the 

exponent) for some constant u shifts the time-domain signal by u, i.e., 

e i2πut 

N X(k) ⇔ x(t − u). 

The shift theorem: Multiplying a time signal x(t) by e − i2πlt 

N 

shifts its spectrum X(k) to the left by l, i.e., 

F 

} 

{x(t) · e − i2πlt 

N 

= 

= 

N−1 

∑ 

t=0 

N−1 

∑ 

t=0 

= X(k + l). 


N e 

− i2πlt 

N 

x(t)e − i2π(k+l)t 

N 

Shifting in the frequency domain is transposition to a different key, 

musically. 

A time-domain signal is real-valued if and only if the complex 

conjugate ˆX(k) of its Fourier transform is equal to X(−k). 

Spectral symmetries of real signals: The Fourier transform 

X(k) of real-valued time-domain signals x(t) possess 

Hermitian symmetry where ˆX(k) =X(−k), and vice versa. 

5 This is not to say that a song or voltage does not change over time—of course it 

does! Rather, time-invariance in filter design and electrical engineering means that 

the system does not change over time, meaning that we are not switching the input 

signal to a different one during our analysis, or removing components from a circuit 

while calculating its input and output voltage. Piecewise functions, for example, are 

not time-invariant.


Hermitian symmetry has the following properties: 

Re{X(−k)} = Re{X(k)}, 

Im{X(−k)} = −Im{X(k)}, 

|X(−k)| = |X(k)|, and 

∠X(−k) = ∠X(k). 

This is an important property that helps convey why we only need 

half of the Fourier transform’s output for audio signals in order to 

analyze their frequency content. For the continuous Fourier transform, 

we ignore the first half (the negative frequencies) of the magnitude 

spectrum and for the discrete case we ignore the second half. Since 

all of the frequencies in the DFT are positive, we ignore frequencies 

beyond N 2 

in the DFT. 

The convolution theorem: The convolution of two discrete, 

time-domain signals x(t) and y(t) is 

x(t) ∗ y(t) = 

N−1 

∑ 

n=0 

x(n)y(t − n) 

where n is considered the amount of latency and ∗ denotes 

the operation of convolution. 6 The Fourier transform of 

their convolution is the product of their spectra, i.e., 

Proof: 

x(t) ∗ y(t) ⇔ X(k) · Y (k). 

6 The convolution of two continuous-time signals is 

x(t) ∗ y(t) = 

∫ t 

The convolution of two discrete-time signals is 

x(t) ∗ y(t) = 

0 

N−1 

∑ 

0 

x(s)y(t − s) ds. 

x(n)y(t − n).


F{x(t) ∗ y(t)} = 

= 

= 

= 

N−1 

∑ 

t=0 

N−1 

∑ 

n=0 

N−1 

∑ 

n=0 

( N−1 

( N−1 

) 

∑ 

x(n)y(t − n) e − i2πkt 

N 

n=0 

N−1 

∑ 

x(n) 

t=0 

y(t − n)e − i2πkt 

N 


N Y (k), by the Shift Theorem 

∑ 

n=0 


N 

) 

Y (k) 

= X(k)Y (k). 

This property is an incredibly useful one. Consider what might happen 

when you multiply the spectra of two signals: Frequencies present 

in both signals will be present while frequencies lacking from one of 

the signals will be absent in the resulting signal. Say that you had the 

frequency response of some acoustic space, like an amphitheater or 

church, and you multiplied it with the spectrum (frequency response) 

of some musical signal. The result would be a simulation of that 

musical signal as if it were recorded in that room, and the same effect can 

be done by convolving the two time-domain signals. To collect the 

reverberant behavior of a room, we record the sound of an impulse 

played in it, and call this an impulse response. Since the frequency 

response of the room reflects the amount of the room’s reverberation, 

we call this process convolution reverb. 

To picture the operation of convolution, imagine two signals starting 

at time sample n =0, such as an impulse response and a musical 

signal’s ADSR envelope.


Figure 7.11: An impulse response 

IR(t) defined for 0 ≤ t ≤ 6, where 

t represents the time samples. This impulse 

response shows three reflections, 

the first with power 0.5, the second 

0.25, and the third 0.1. 

Figure 7.12: The ADSR envelope of a 

short musical signal x(t), defined for 

0 ≤ t ≤ 15 again in time samples, not 

seconds. Imagine that this is a short 

puff on a flute, recorded in an anechoic 

(echoless) chamber. 

Before performing the convolution, we flip the shorter signal (the 

impulse response) so that y(t) is now y(−t) and its rightmost point 

is t =0. Then, we shift it to the left any number of points (one will 

suffice) so that it does not intersect with the musical signal when they 

are plotted on the same axis. 

Figure 7.13: To perform convolution 

on both discrete and continuous signals, 

we flip one of the signals horizontally. 

Figure 7.14: Then, we shift that signal 

so that the rightmost point will not intersect 

with the domain of the other 

signal. 

Now we are ready to perform convolution. Again, convolution is 

computed on two signals x(t) and y(t) with the equation x(t) ∗ y(t) =


∑ N−1 

n=0 

x(n)y(t − n). Here, the impulse response IR(t) is our y(t): It 

has been flipped and shifted. N is the length of x(t). Here, x(t) = 

x(n) = [0, 1, 0.9, 0.8, 0.7, 0.5, 0.5, 0.5, 0.5, 0.4, 0.4, 0.3, 0.2, 0.1, 0.1, 0] and 

y(t) = [1, 0, 0.5, 0, 0.25, 0, 0.1], so y(t−n) = [0.1, 0, 0.25, 0, 0.5, 0, 1], and 

if there is an n such that y(t − n) is not defined, we call it zero. Then 

x(t) ∗ y(t) = 

N−1 

∑ 

n=0 

x(n)y(t − n) 

= [0, 1, 0.9, 1.3, 1.15, 1.15, 1.075, 0.95, 0.925, 0.855, 

0.845, 0.675, 0.575, 0.4, 0.35, 0.165, 0.14, 0.055, 

0.045, 0.01, 0.01, 0]. 

Figure 7.15: The shifted and flipped 

impulse response y(t − n) on the same 

plot as x(n), ready for convolution. We 

will move y(t − n) to the right sample 

by sample and multiply it with x(n) 

for each t, and then sum each of these 

multiplications to get the convolution. 

Figure 7.16: The convolved ADSR. 

This signal shows how the given 

ADSR would behave in the room described 

by the impulse response. 

So the signal in Figure 7.17 would sound like it were coming from 

the room with impulse response depicted in Figure 7.12. 

Parseval’s identity states that the sum of the Fourier series coefficients 

c n squared is equal to the integral of the function squared, 

i.e., 

∞∑ 

n=−∞ 

|c n | 2 = 1 ∫ π 

x(t) dt. 

2π −π


This identity gives rise to the relationship between the total power of a 

signal and the total power of its spectrum, given by Parseval’s theorem. 

Parseval’s theorem: For any continuous time-domain signal 

x(t) and its normalized Fourier transform X(ω) where ω 

is in radians per second, 

∫ ∞ 

−∞ 

|x(t)| 2 dt = 1 

2π 

∫ ∞ 

−∞ 

|X(ω)| 2 dω, 

where |x(t)| 2 is the cross product 〈x, x〉 (which is equivalent 

to ∑ N−1 

t=0 

x(t)ˆx(t), i.e., x(t) multiplied by its complex conjugate 

summed over all of its terms), and likewise, |X(k)| is 

taken to mean the cross product 〈X, X〉. If ω were instead 

in Hz, we would simply remove the multiplication by 1 

2π . 

For any discrete time-domain signal x(t) and its discrete, 

normalized Fourier transform X(k), 

N−1 

∑ 

t=0 

|x(t)| 2 = 

N−1 

∑ 

k=0 

If the DFT is not normalized, then 

N−1 

∑ 

t=0 

|x(t)| 2 = 1 N 

N−1 

∑ 

k=0 

|X(k)| 2 . 

|X(k)| 2 . 

Proof: Let x(t) by a complex function of length N. Then 

N−1 

∑ 

t=0 

|x(t)| 2 = 

N−1 

∑ 

t=0 

x(t)ˆx(t) 

= x(t) ∗ x(t), convolution with zero latency, or n =0 

= F −1 ( ˆX(k)X(k)), by the convolution theorem 

= 1 N 

N−1 

∑ 

k=0 

|X(k)| 2 . 

This is also called the Rayleigh energy theorem, Plancherel theorem, Parseval’s 

equality, Parseval’s relation, or simply the power theorem.


When unsure if the sampling frequency has been specified as great 

enough to eschew errors from aliasing, we can increase the length 

of a signal by a factor of L to increase the period of the maximum 

frequency component. We do this by inserting L − 1 zeros in between 

each pair of samples. Hence, f max will become fmax 

L 

and f s will have a 

better chance of correctly sampling all frequencies in the signal. This 

is known as up-sampling a signal, and the method of increasing the 

length of x(t) is called stretching. For continuous signals, we have the 

scaling theorem, and for discrete ones, we have the stretch theorem. 

Scaling theorem: Stretching the time domain of a continuous 

signal x(t) by a nonzero real number a shrinks its 

frequency domain by a factor of a, i.e., 

( t 

x ⇔|a|X(aω). 

a) 

Proof: Let x(t) be a continuous, complex function and let 

a be a nonzero real number. Then the domain of x ( t 

a) 

is 

a-times as wide as x(t), i.e., it is "stretched" by a factor of a. 

Taking the Fourier transform of this stretched signal heeds 

F 

{ ( t 

x 

a)} 

= 

= 

∫ ∞ 

−∞ 

∫ ∞ 

= |a| 

−∞ 

∫ ∞ 

( t 

x 

a) 

x 

−∞ 

e −iωt dt 

( ( 

t 

e 

a) 

−iω(a· a) t 

d 

x 

= |a|X(aω). 

) 

a t a 

( t 

e 

a) 

−i(ωa)( ( 

a) t t d 

a 

) 

Stretch theorem: Stretching the time domain of a discrete, 

time-domain signal x[t] by a nonzero real number a repeats 

the domain of its spectrum X[k] a-many times around 

the unit circle. So, the effect of stretching discrete signals 

has opposite implications on their frequency domain as


stretching continuous signals: Here, the frequency domain 

increases by a factor of a (to go from 0 to aN) instead of 1/a. 

Furthermore, the frequency components of the stretched 

x [ t 

a] 

, call it ω 

′ 

k 

, will be more densely spaced such that 

ω ′ k = 

2πk 

aNT s 

. 

So in both the continuous and discrete cases, the resolution of the 

frequency domain is increased. Hence, in general, up-sampling in the 

time domain improves the accuracy of the frequency domain. 

The process of up-sampling in the discrete case is done by adding 

zeros to a signal, called zero-padding, and the consequence of this 

in its spectrum is called spectral interpolation. Spectral interpolation 

(or simply interpolation) increases the resolution of the spectrum and 

shrinks the number of frequencies with energy because it reduces 

aliasing. Aliasing happens when a signal contains a frequency that 

is not one of the frequency bins, i.e., a frequency not in the set of the 

ω k = 2πk 

NT s 

. Say that a signal x(t) contained the frequencies 20 Hz, 30.2 

Hz, and 32 Hz and we sampled it at 100 Hz over 1 second. Then the 

frequency bins of X(k), i.e., the set of frequency components {ω k }, 

would be 

{ω k } = 

{ } 2πk 

NT s 

∣∣∣N=100,Ts=0.01 

= {0, 2π, 4π, 6π, . . . , 198π}. 

This set includes the frequencies 20 and 32 Hz, but 30.2 Hz is not 

indexed by any of the frequency bins even though it is sampled at 

more than 2 samples per period. The result is that the energy resulting 

from 30.2 Hz "leaks" to nearby frequency bins, mostly to 30 Hz and 

second most to 31 Hz. This appears graphically as side lobes and the 

spread of energy to nearby frequencies is called spectral leakage. If we 

were to increase the sampling rate either by literally increasing the 

value of f s or by zero-padding (stretching), we would reduce the effect


of spectral leakage. Windowing, described in Chapter 8 on the shorttime 

Fourier transform, also produces these spectral lobes, especially 

rectangular or "boxcar" windows. 

Example 2 in Section 7.5 presents a scenario of spectral leakage, 

and Example 3 uses the scaling theorem to curtail the artifact. 

Computational complexity of the DFT 

Each X(k) is a DFT itself, requiring N-many computations. Since 

there are N-many X(k)’s to be calculated for the entire signal to be 

transformed into the frequency domain, a total of N × N = N 2 total 

computations need to be calculated. Therefore, its computational 

complexity is O(N 2 ). 

This is considered expensive and inefficient for an algorithm. Consider 

a three-minute long song sampled at 44,100 Hz. The total number 

of samples is 3 × 60 × 44, 100 = 7, 938, 000 = N. So, a whopping 

7, 938, 000 2 =6, 301, 184, 400, 000, 000, 000, 000 calculations are 

required to fully specify the frequency components of a 3-minute CDquality 

piece of music. As we will see in Chapter 8, the fast Fourier 

transform reduces this complexity greatly. 

7.4 The DFT, simplified 

In essence, the discrete Fourier transform extracts a frequency k from 

a signal x(t) by multiplying x(t) through by a root of unity e ikt with 

identical frequency k and constructively interfering to return a nonzero 

value for X(k). In actuality, the root of unity will have a normalized 

frequency in the interval [0, 2π) to be later multiplied by kfs 

N 

. Since the 

maximum value of k is N −1, the k effectively cancels the N. Therefore, 

the roots of unity are such that each of them maps to a unique angular 

frequency within the interval [0, 2πf s ). 

The larger the sampling frequency, the larger N is, so the greater f s 

is, the more roots of unity will be defined in the DFT.

Section 7.4 The DFT, simplified 191 

Figure 7.17: Roots of unity for N =4, N =6, and N =8. These are all within the 

interval [0, 2π). When N =4, for example, there are 4 different frequencies the DFT 

could detect. 

Figure 7.18: Roots of unity for N = 16 and N = 25, again all within the interval 

[0, 2π) to by multiplied by f s afterwards. 

The exponentials in every iteration of the discrete Fourier transform 

and its inverse are roots of unity, representing unitary trigonometric 

functions in the complex functions of time. As k increases, the frequency 

of a given function increases—but only in the interval [0, 2π), 

the interval of frequencies on the unit circle. These small frequencies 

will be later multiplied by the sampling frequency, f s , to give N-many, 

uniformly spaced frequencies in the interval [0, 2πf s ). 7 

7 The notation "[a, b)" refers to the line with endpoints at a and b. A square bracket, 

"[" or "]" indicates that the given endpoint is included in the interval. A parentheses 

indicates the endpoint is not included.


Suppose that the size of our input signal is N =8. Then the DFT 

exponentials are given by 

e − i2π0t 

8 = cos(0) − i sin(0) 

( 

e − i2π1t π 

) ( π 

) 

8 = cos 

4 t − i sin 

4 t ( 

e − i2π2t π 

) ( π 

) 

8 = cos 

2 t − i sin 

2 t ( ) ( ) 

e − i2π3t 

3π 3π 

8 = cos 

4 t − i sin 

4 t 

e − i2π4t 

8 = cos(πt) − i sin(πt) 

( ) ( ) 

e − i2π5t 

5π 5π 

8 = cos 

4 t − i sin 

4 t ( ) ( ) 

e − i2π6t 

3π 3π 

8 = cos 

2 t − i sin 

2 t ( ) ( ) 

e − i2π7t 

7π 7π 

8 = cos 

4 t − i sin 

4 t 

for k =0, 1, 2, 3, 4, 5, 6, 7. So the real parts of the roots are cosine functions, 

and the imaginary parts are sine functions, both with identical 

frequency. Below, they are plotted, the imaginary parts with dashes.


The roots of unity over time for N =8, f s =4 

( 

Figure 7.19: Re 

( ) 

1, Im 

e − i2π0t 

8 

e − i2π0t 

8 

= − sin(0) = 0 

) 

= cos(0) = 

( 


( ) 

Im 

e − i2π1t 

8 

e − i2π1t 

8 

) 

= − sin ( π 

4 t) 

= cos ( π 

4 t) , 

( 


( ) 

Im 

e − i2π2t 

8 

e − i2π2t 

8 

= − sin ( π 

) 

= cos ( π 

t) , Figure 7.22: Re 

2 

( ) 

t) Im e − i2π3t 

8 

2 

( 

e − i2π3t 

8 

) 

= − sin ( 3π 

4 t) 

= cos ( 3π 

4 t) , 

If X(k) is non-zero, then the frequency component ω k = 2πkfs 

N 

(or, k, 

in Hz) is in the signal. Roots of unity with frequencies that are identical 

to those found in a signal will constructively interfere in the Fourier 

transform to produce a non-zero result. Constructive interference can


( 


( ) 

Im 

e − i2π4t 

8 

e − i2π4t 

8 

= − sin(πt) 

) 

= cos(πt), 

( 


( ) 

Im 

e − i2π5t 

8 

e − i2π5t 

8 

= − sin ( 5π 

4 

) 

) 

= cos ( ) 

5π 

4 , 

( 


( ) 

Im 

e − i2π6t 

8 

e − i2π6t 

8 

= − sin ( 3π 

2 

) 

) 

= cos ( ) 

3π 

2 , 

( 


( ) 

Im 

e − i2π7t 

8 

e − i2π7t 

8 

= − sin ( 7π 

4 

) 

) 

= cos ( ) 

7π 

4 , 

happen either at a single point or interval of time, and if it happens 

over an interval, the frequency is identical. If the signals are out of 

phase, destructive interference occurs—but a sine and cosine wave are 

similar in that they are simply out of phase with one another, by π/2. 

This means that the degree of destructive interference exhibited by one 

of the waves will be the degree of constructive interference exhibited by 

the other wave: If one completely cancels with a frequency component 

of the signal, then the other will completely constructively interfere.


Suppose that, for some k, X(k) =a + bi for nonzero a and b. Then 

the real part a corresponds to the real part of the kth root of unity, i.e., 

a represents the constructive interference of x(t) with the cosine part 

of the sum cos ( ) ( 

2πkt 

N − i sin 2πkt 

) 

N . The imaginary part b is the result 

of constructive interference with the sine function, which is multiplied 

by i. Hence, a nonzero a or b means that the root of unity is "matching" 

a frequency present in x(t). 

The nature of the sampled signal is that its samples are all uniformly 

spaced over time, meaning that it has a periodic nature (its 

sampling frequency). The exponentials of the discrete Fourier transform 

only move through e i2πt0/N to e i2πt(N−1)/N , i.e., one exponential 

less than 1 period, because e i2πtN/N is the same as e i0πt . Therefore, it is 

redundant to include any roots of unity beyond e i2πt(N−1)/N , because 

that frequency is already represented. They serve to identify which 

frequencies are present in a time domain signal only because the transform 

is "blind" to the sampling frequency: f s may as well be thought 

of as 1 Hz because the indexing of the samples is in integers. The 

frequency components ω k at which X(k) ≠0are multiplied only later 

by f s ; the exponentials contain no mention of the sampling period or 

frequency. This is because we are looking at integer-ordered instants 

of time, t n =0, 1, 2, . . . , N − 1, the nth sample of x s (t), not seconds of 

time, t. We just write the DFT using t to remind ourselves that it is 

related to time. 

When we multiply the frequency components by f s at the end, 

each of the non-zero values of X(k) (corresponding to the frequency 

component ω k ; sometimes the DFT is written X(ω k ), but usually this 

opens up a whole other can of worms) is scaled to an integer harmonic 

of the fundamental frequency 2π N · f s. In other words, the sequence of 

frequency components is 

{ω k } = 

{ 

0, 2π 

N f s, 4π 

N f s, 6π 

N f s, . . . , 

} 

2π(N − 1) 

f s . 

N


So, the maximum frequency specified by the roots of unity is W N−1 

N 

. 

Over time, i.e., for the time samples t =0, 1, . . . , N − 1, the exponential 

reaches e − i2π(N−1)(N−1) 

N . Therefore, the maximum frequency component 

is (N − 1)/f s Hz. But for real inputs (which is all sound files), the DFT 

has Hermitian symmetry, meaning that the kth term of the transform, 

X k , is equal to the complex conjugate of its (N − k)th term, ̂X N−k . 

Let us recall the Nyquist sampling theorem: A sinusoidal, periodic 

function must be sampled at least twice per period in order for 

the frequencies to be represented. For the same reason, the DFT can 

only detect frequencies up to (N−1) 

2f s 

Hz, and the spectrum’s magnitude 

|X(k)| will be symmetrical about the vertical line k = N 2 

. The magnitude 

|X(N − 1)| will thus be equal to |X(1)|, |X(N − 2)| = |X(2)|, and 

so on. When N is even, the Nyquist frequency will be at the center of 

this symmetry, and when odd, X[(N − 1)/2] will equal X[(N + 1)/2]. 

This will be graphically explored in the upcoming examples. 

The DFT can be calculated via matrix multiplication of x(t) with a 

matrix containing the exponentials of the roots of unity WN k down each 

column. Here is a quick introduction to matrix operations: We use the 

indices i and j to refer to row and column coordinates (respectively) 

in a matrix, not to be confused with imaginary numbers. In matrix 

multiplication, the (i, j)th term of the left matrix multiplies with the 

(j, k)th term of the right matrix. Therefore, the left matrix must have 

as many columns as the right matrix has rows. If the left matrix is of 

the size m × n where m is the number of rows and n is the number of 

columns, and the right matrix is of the size n × p, their multiplication 

will produce a matrix of size m × p. 

Multiplying x(t),a1 × N size matrix (a row matrix), with W k N , an 

N ×N size matrix, therefore produces a matrix of size 1×N, and this is 

X(k). 8 An orthonormal basis in general is an N × N matrix containing 

8 In some texts, these will be rows of W , meaning also that the signal x will be a 

column matrix and the transform X will be a row matrix [24]. However, note that 

the matrix is symmetrical about its diagonal, so it doesn’t matter if the values of the 

exponentials are given in the rows or the columns.


normal vectors (columns and rows) describing the dimensions and 

linear behavior of a function. So, W is considered N-dimensional. The 

matrix represents the set of roots of unity W = {W 0 N ,W1 N , . . . , W N−1 

N 

} 

by putting them sequentially in columns, where the top row contains 

the 0th root of unity (i.e., k =0) and the bottom row contains the 

(N − 1)th root of unity (i.e., k = N − 1): 

[ 

] 

W = e − i2πk0 

N e − i2πk1 

N . . . e − i2πk(N−1) 

N 

The jth column) 

of W gives the) 

values of all of the roots of unity at time 

t = j, cos − i sin . The ith row contains the values of ith 

( 

2πkj 

N 

( 

2πkj 

N 

root of unity over time. Explicitly, the general matrix representing the 

Nth roots of unity for any N is 

W = 

⎡ 

1 1 1 1 . . . 1 

1 e − i2π 

N e − i4π 

N e − i6π 

N . . . e − i2π(N−1) 

N 

1 e − i4π 

N e − i8π 

N e − i12π 

N . . . e − i4π(N−1) 

N 

1 e − i6π 

N e − i12π 

N e − i18π 

N . . . e − i6π(N−1) 

N 

⎢ 

⎣ . . . . . 

1 e − i2π(N−1) 

N 

e − i4π(N−1) 

N 

e − i6π(N−1) 

N . . . e − i(N−1)π(N−1) 

N 

The matrix is N × N in dimension because the cardinality (magnitude) 

of both k and t is N. Letting N =8, the real parts of the DFT 

can be wholly represented by the following matrix multiplication, in 

which each column of W (t, k) contains N-many evenly spaced values 

of the cosine function corresponding to the real part of each root of 

unity: 9 

x(t) = [x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7)] 

⎤ 

⎥ 

⎦ 

9 The given matrix of the roots of unity is a decimal approximation, where 0.707 ≈ 

√ 

2 

2 = cos ( π 

4 

) 

.


W (t, k) = 

⎡ 

⎤ 

1 1 1 1 1 1 1 1 

1 0.707 − 0.707i 0 − i −0.707 − 0.707i −1 − 0i −0.707 + 0.707i 0+i 0.707 + 0.707i 

1 0− i −1 − 0i 0+i 1 + 0i 0 − i −1 − 0i 0+i 

1 −0.707 − 0.707i 0+i 0.707 − 0.707i −1 − 0i 0.707 + 0.707i 0 − i −0.707 + 0.707i 

1 −1 − 0i 1 + 0i −1 − 0i 1 + 0i −1 − 0i 1 + 0i −1 − 0i 

⎢1 −0.707 + 0.707i 0 − i 0.707 + 0.707i −1 − 0i 0.707 − 0.707i 0+i −0.707 − 0.707i 

⎥ 

⎣1 0 + i −1 − 0i 0 − i 1 + 0i 0+i −1 − 0i 0 − i ⎦ 

1 0.707 + 0.707i 0+i −0.707 + 0.707i −1 − 0i −0.707 − 0.707i 0 − i 0.707 − 0.707i 

So the discrete Fourier transform X(k) can be computed by matrix 

multiplication where X(k) and x(t) are both 1 × N size matrices and 

W (k, t) is N × N in size, by the formula 

X(k) =x(t)W (k, t). 

7.5 Examples 

To recapitulate: When we compute a DFT, we multiply a signal x(t) 

by complex sine waves from the roots of unity, e iωkt . These roots of 

unity are only within the interval [0, 2π) and there are N-many of them. 

Hence, it may seem like the DFT can only detect frequencies between 

0 and 2π rad/s, but these frequencies are just "placeholders" for the 

actual frequencies: We scale these ω k by our sampling frequency f s at 

the end. So, the set of ω k can be thought of as normalized frequencies in 

the range of 0 to 1 hertz. 

The DFT can detect N-many frequencies up to the frequency f s /2, 

but beware: The highest frequency is almost never equal to N, because 

the length of x(t) is almost never one second. N, and hence f s , specify 

the resolution of the DFT. This is visually depicted in Figures 7.18 and 

7.19. 

In this section, we will show the scratch work required to compute 

the DFT of a short, periodic signal by hand. The amount of room 

it takes up should show you that the analysis of more complex (i.e., 

larger) signals is best left to a computer, but for some, it aides comprehension 

to see the explicit math involved. Example 2 is the DFT in

Section 7.5 Examples 199 

Mathematica, which specifies the Fourier transform differently so we 

show how to compensate for that. The third example shows how the 

DFT can be estimated graphically. 

Example 1: A simple sinusoid, by hand 

Let us evaluate the DFT of a simple sinusoid: x(t) = cos(πt) sampled 

at f s =4Hz for the first 2 seconds, i.e., 0 ≤ t


Because e 0 =1, this is just the sum of each value of x s (t) from t =0 

to t =3: 

X(0) = 

3∑ 

x(t) 

t=0 

= x(0) + x(1) + x(2) + x(3) + x(4) + x(5) + x(6) + x(7) 

√ √ √ √ 

2 2 2 2 

= 1 + 

2 +0− 2 − 1 − 2 + 0 + 2 

= 0. 

For k =1: 

X(1) = 

7∑ 

t=0 

x(t)e − i2π(1)t 

8 

= x(0)e − i2π(1)(0) 

8 + x(1)e − i2π(1)(1) 

8 + x(2)e − i2π(1)(2) 

8 

+x(3)e − i2π(1)(3) 

8 + x(4)e − i2π(1)(4) 

8 + x(5)e − i2π(1)(5) 

8 

+x(6)e − i2π(1)(6) 

8 + x(7)e − i2π(1)(7) 

8 

√ ( √ ) 

2 

= (1)e 0 i2π 

2 

+ e− 8 + 0 + − e − i6π 

8 

2 2 

( √ ) √ 

+(−1)e − i8π 2 i10π 2 i14π 

8 + − e− 8 + 0 + e− 8 

2 2 

√ (√ √ ) [ √ ( √ √ )] 

2 2 2 

2 2 2 

= (1)(1) + 

2 2 + i + 0 + − − 

2 

2 2 + i 2 

√ ( √ √ ) √ (√ √ ) 

2 2 2 2 2 2 

+(−1)(−1) − − 

2 2 − i + 0 + 

2 2 2 − i 2 

( 1 

= 1 + 

2 2) 

+ i ( 1 

+ 0 + 

2 2) 

− i ( 1 

+ 1 + 

2 2) 

+ i +0 

( 1 

+ 

2 2) 

− i 

= 4.


That takes up a lot of room, as you can see, so I will leave some of the 

math for you to verify from here on. For k =2: 

X(2) = 

For k =3: 

X(3) = 

7∑ 

t=0 

x(t)e − i2π(2)t 

8 

(√ ) 

( √ ) 

2 

2 

= (1)(1) + (0) + (0)(−1) + − (0) 

2 

2 

( √ ) 

(√ ) 

2 

2 

+(−1)(1) + − (0) + (0)(−1) + (0) 

2 

2 

= 1− 1 = 0. 

7∑ 

t=0 

x(t)e − i2π(3)t 

8 

( 

= 1 + − 1 2 + i ) 

2 

( 

+ − 1 2 − i ) 

2 

= 0. 

( 

+ 0 + − 1 2 − i ) ( 

+ 1 + − 1 2 

2 + i ) 

+0 

2 

For k =4: 

X(4) = 

7∑ 

t=0 

x(t)e − i2π(4)t 

8 , 

so the exponential will vary between −1 and 1 with no imaginary part. 

Therefore, 

√ ( √ ) 

2 

2 

X(4) = (1)(1) + 

2 (−1) + (0)(1) + − (−1) + (−1)(1) 

2 

( √ ) 

√ 

2 

2 

+ − (−1) + (0)(1) + 

2 

2 (−1) 

= 0.


For k =5: 

X(5) = 

7∑ 

t=0 

x(t)e − i2π(5)t 

8 

( 

= 1 + − 1 2 − i ) ( 

+ 0 + − 1 2 

2 + i ) ( 

+ 1 + − 1 2 

2 − i ) 

2 

( 

+0 + − 1 2 + i ) 

2 

= 0. 

For k =6: 

X(6) = 

7∑ 

t=0 

x(t)e − i2π(6)t 

8 

= 1 + 0 + 0 + 0 + (−1) + 0 + 0 + 0 

= 0. 

For k =7: 

X(7) = 

7∑ 

t=0 

x(t)e − i2π(7)t 

8 

( 1 

= 1 + 

2 2) 

− i 

( 1 

+0 + 

2 2) 

+ i 

= 4. 

( 1 

+ 0 + 

2 2) 

+ i 

( 1 

+ 1 + 

2 2) 

− i 

So, X(k) = (0, 4, 0, 0, 0, 0, 0, 4), a function symmetric about the line 

k =4which is equal to N/2. The frequency of the kth frequency 

component is given by ω k = 2πk 

NT s 

, so ω 1 = 

2π1 

8(0.25) = 2π 2 

= π rad/s 

or 0.5 Hz, which is indeed the frequency of our sampled x(t). The 

other nonzero term is just the Hermitian conjugate of X(1), so for that 

reason, the first 5 terms of X(k) (up to the (N/2)th term) are the only 

ones we care about. 10 

reader to verify. 

The inverse DFT of this signal is left for the 

10 Because some texts will call the first term of time signals and frequency spectra 

x(1) or X(1) and the final term x(N) or X(N), sometimes this is specified as the 

(N − 1)/2 frequency component.


When N is odd, X(0) will be nonzero, i.e., there is a DC offset in 

signals of odd length because their average value is not 0. The DC 

offset is equal to the average value of x(t) times the number of samples, 

or rather, the sum of every sample in x(t). This is so because direct 

current (DC) is constant and has a frequency of 0 Hz. X(0) will always 

be real-valued because the input signal is real-valued. A DC offset in 

practice is considered undesirable because speakers will not be at their 

resting state when the signal is on but silent due to the constant flow 

of current yet no change in voltage (amplitude). 

It is important to realize that though N looks like a variable, it is 

actually a constant: We derive it from the length of x(t) in seconds 

and the sampling frequency, but its value does not change. The DFT 

and IDFT can be normalized by scaling them both by √ 1 

N 

. N is not 

bounded by k in the sum, and so multiplying each term by √ 1 

N 

just 

scales the values and does not affect the relative powers of the terms. 

The normalized discrete Fourier transform (NDFT) and its inverse (the 

NIDFT) are written 

ˆF{x(t)} := ˆX(k) = √ 1 N−1 

∑ 


N , 

N 

t=0 

ˆF −1 { ˆX(k)} := x(t) = √ 1 N−1 

∑ 

N 

k=0 

ˆX(k)e i2πkt 

N . 

Mathematica, for whatever reason, reverses these equations such that 

the NDFT is their IDFT and the NIDFT is their DFT. 

Example 2: The DFT of a complex sinusoid, in Mathematica 

The built-in Fourier[] function in Mathematica is the normalized inverse 

discrete Fourier transform (NIDFT) as above, and similarly, 

the InverseFourier[] function takes the normalized discrete Fourier 

transform. Therefore, we need to use InverseFourier[] multiplied by 

√ 

N to take a discrete Fourier transform in the way we have described 

thus far.


Let x(t) = cos(2πt) + cos(πt) + cos(1.2πt). Because f max is 1 Hz, f s 

must be greater than 2 Hz, so let there be 3 samples per second, i.e., 

f s =3Hz. We’ll sample this over 4 seconds: 

x s (t) = {1, −0.5, −0.5, 1, −0.5, −0.5, 1, −0.5, −0.5, 1, −0.5, −0.5} 

+ {1, 0.5, −0.5, −1, −0.5, 0.5, 1, 0.5, −0.5, −1, −0.5, 0.5} 

+ {1, 0.309, −0.809, −0.809, 0.309, 1, 0.309, −0.809, 

−0.809, 0.309, 1, 0.309} 

= {3, 0.309, −1.809, −0.809, −0.691, 1, 2.309, −0.809, −1.809, 

0.309, 0, 0.309} 

for N = 12 and T s =1/3 seconds. Explicitly, the function Fourier[x] 

where x is some discrete time-domain signal is given by 

Fourier[x] = √ 1 N−1 

∑ 

N 

and its inverse InverseFourier[X] is 

t=0 

x(t)e i2πkt 

N 

InverseFourier[x] = √ 1 N−1 

∑ 

X(k)e − i2πkt 

N . 

N 

Not only are these multiplied by 1 √ 

N 

, but the signs of e’s exponent are 

actually reversed. Therefore, to compute x s (t) as above, we type into 

Mathematica the command 

t=0 

x={3,0.309,-1.809,-0.809,-0.691,1,2.309,-0.809,-1.809,0.309,0,0.309}; 

and then


Sqrt[Length[x]]*InverseFourier[x] 

Now press "Shift+Enter" to execute these lines of code in Mathematica. 

11 The output is 

{1.309,1.406+0.812i,8.368+4.102i,-2.927i,6.559-0.968i,0.667-0.385i, 

0.691,0.667+0.385i,6.559+0.968i, 2.927i,8.368-4.102i,1.406-0.812i} 

The magnitude of this is 

|X(k)| = (1.309, 1.624, 9.319, 2.927, 6.630, 0.770, 0.691, 

0.770, 6.630, 2.927, 9.319, 1.624). 

There is energy spread to every one of the frequency bins because k 

never equals exactly 0.6 Hz, so its energy leaks to nearby bins—i.e., 

where ω k = π rad/s and ω k =2π rad/s. The bins with the most 

energy are the k =2and k =4bins (ignoring the second half of the 

2π(2) 

DFT), corresponding to the frequencies ω 2 = 

12(1/3) = 4π 4 = π rad/s 

or 0.5 Hz, which is the frequency of cos(πt), and ω 4 = 

4π(4) 

12(1/3) =2π 

rad/s or 1 Hz, the frequency of cos(2πt). The bin k =3has the third 

greatest energy, showing the leakage from cos(1.2πt), because ω 3 is the 

frequency bin for 1.5π rad/s or 0.75 Hz. However, most of the energy 

from this sinusoid leaks to the k =2frequency bin because 0.5 Hz is 

closer to 0.6 Hz than is 0.75 Hz. This is the spectral leakage. 

Example 3: The DFT of a complex sinusoid, graphically 

To interpret the DFT by graphical inspection, we can choose one of two 

methods: (1) analyzing the product of x(t) when multiplied by each 

root of unity, or (2) stretching x(t) as described above in the scaling 

theorem to be twice its length and half of its frequency, i.e., double its 

11 All of the functionality of Mathematica is online and free at http:// 

wolframalpha.com/.


period. Both ways will return identical results and identical graphs, 

so we will only show method (1) here and leave the second for the 

reader to verify. Let our sinusoid contain the first three harmonics with 

diminishing energy as their frequency increases: x(t) = sin(2πt) + 

0.3 sin(4πt) + 0.1 sin(6πt), defined over the interval of time [0, 2]. The 

length of x(t) is two seconds, and we must sample it at greater than 

two times that of the maximum frequency component, which is 3Hz. 

So let f s =8Hz, and N = length(x) · f s =2· 8 = 16. 

From method (2), the resulting DFT will be identical to the one 

derived from method (1), except that the duration of x(t) is increased 

to 4 seconds and f max is now 1.5 Hz. Therefore, we can sample at 4 

Hz instead of 8 Hz, and the following plots will look exactly the same 

except for the scale of the horizontal axis, which will be doubled. 

Below is the graphical depiction of method (1). The dashed lines 

represent the multiplication of the original function with the imaginary 

sine component of the WN k , and the solid lines show the product of 

x(t) by the real cosine component. There are "O"s marking where the 

cosine wave multiplies with the samples of x(t), and "X"s where the 

imaginary sine wave multiplies with x(t), so there are 16 O’s and 16 

X’s on each graph. Each graph depicts e − i2πkfst 

16 over two seconds for 

0 ≤ k ≤ 15. We multiply by f s here because the scale of our horizontal 

axis shows seconds, not samples, and hence x(1) represents the signal 

at the time of one second, not the eighth sample; otherwise, we would 

leave it alone. Try to inspect where the sum of the amplitudes will not 

be zero, i.e., when the points favor one half of the horizontal axis to 

the other. When the points straddle the axis, this is a likely indication 

that the samples’ total sum is zero.


Sampled sinusoids of the DFT


This method shows us that for k =2, k =4, k = 12, and k = 14, the 

real parts (the O’s) are symmetric about the horizontal axis and probably 

sum to zero, while the imaginary parts (the X’s) are nonzero. 12 

For k =2and k =4, the X’s are more on the bottom (negative) half, 

12 Note that since the sine waves (the dashed lines) are complex, their vertical axis 

is imaginary.


Figure 7.27: The signal x(t) = sin(2πt) + sin(4πt) multiplied by each of the roots 

of unity, e − i2πkfst 

N , equivalent to e −iπkt , which is the sum cos(πkt) − i sin(πkt) for 

0 ≤ k ≤ 15. The real samples resulting from x(t) cos(πkt) sum to zero in each of these 

graphs, while a few of the imaginary samples from x(t)[−i sin(πkt)] have nonzero 

sums. 

and for k = 12 and k = 14, they favor the positive half. In fact, the 

spectrum is X(k) ={0, 0, −8i, 0, −8i, 0, 0, 0, 0, 0, 0, 0, 8i, 0, 8i, 0}. 

Take caution when computing the DFT of actual signals: Often 

times, its results will be jarringly different from what we perceive, 

especially in the case of tonal music. The DFT can be particularly unreliable 

when we want to analyze the frequency content of polyphonic 

music, e.g., music containing more than one instrument and pitch at 

any given time. Fundamental frequency detection, called f 0 -tracking in 

the field of music information retrieval, is a largely unsolved problem 

because sometimes our brain will fill in the fundamental when it is


weak or missing altogether from a signal via difference tones or simply 

the complicated language of music. Consider, for example, letting a 

chord on a guitar ring out for a few seconds. The fundamental will 

disappear rather quickly, but our brains may still say that it is the 

essential frequency content of the sound when the DFT would show 

otherwise. In conclusion, do not be frustrated when the DFT fails to return 

the information you want (and you know that you have specified 

it correctly); rather, devise a way to correct its failures with respect to 

our auditory perception. 13 


The continuous Fourier transform 

F{x(t)} := X(ω) = 

∫ ∞ 

−∞ 


accepts a continuous, time-domain function x(t) and produces a continuous, 

frequency-domain spectrum X(ω) containing the frequency 

information of x(t). Its inverse, 

F −1 {X(ω)} := x(t) = 1 ∫ ∞ 

X(ω)e iωt dω 

2π −∞ 

accepts a continuous, frequency-domain function X(ω) to reproduce 

the continuous time-domain signal x(t). 

The exponentials e ± i2πkt 

N are Euler’s roots of unity that can be written 

as complex trigonometric functions by Euler’s formula, which states that 

e ±iω = cos(ω) ± i sin(ω). 

Therefore, the real parts of X(ω) correspond to results of the cosine 

function’s product with x(t) and the imaginary parts correspond to 

the product of the sine function with x(t). 

13 Try smoothing your data.


The discrete Fourier transform, 

F{x(t)} := X(k) = 

N−1 

∑ 

t=0 


N , 

accepts only discrete, sampled functions x s (t). Because of this, we can 

write the transform with simply x(t) and know it is a discrete, sampled 

signal. Its inverse, 

F −1 {X(k)} := x(t) = 1 N 

N−1 

∑ 

k=0 

X(k)e i2πkt 

N , 

If a frequency ω k (where ω k = 2πk 

NT s 

) is present in the signal x(t), then 

the DFT will constructively interfere with the signal at ω k , producing a 

non-zero value in the frequency bin X(k). The value of X(k) is exactly 

equal to the sum of the amplitudes of the signal multiplied by the 

roots of unity given by WN k . However, if the signal contains frequency 

components not expressed by the roots of unity, i.e., when a poor 

sampling frequency f s is chosen or when the frequency is not in the 

set 

{ω k } = 

{ 

0, 

2π 

NT s 

, 

} 

4π 2π(N − 1) 

, . . . , , 

NT s NT s 

then spectral leakage will occur and the unrepresented frequency 

components will alias with nearby ω k , spreading their energy over 2 

or more frequency bins. 

The Fourier transform of a real signal possesses Hermetian symmetry 

about its (N/2)th frequency component, meaning that magnitude 

|X(j)| is equal to the magnitude |X(N − j)| for some real number 

j ≤ N/2—i.e., the magnitude plot is symmetrical about the line 

k = N/2. This means that in the continuous case, the second half 

of the transform’s magnitude containing positively valued frequency 

components is the half of interest, so we may ignore the first half. In 

the discrete case, all of the frequencies are positively valued, and we 

may ignore the second half to understand the relative energies of the 

frequency components of a signal.


The amplitude of the zeroth frequency component at 0 Hz X(0) 

is a real number corresponding to the DC offset of a waveform, so it 

is significant. A spectrum will have a non-zero DC offset when the 

average value of the waveform is non-zero which is often the case 

when N is odd.

8. Other Fourier transforms 

To quickly review, the continuous Fourier transform is represented by 

the integral 

X(ω) = 

∫ ∞ 

−∞ 


where ω =2πf. This returns the amplitude of the frequency component 

ω in the entire continuous, time-domain signal x(t). Its inverse 

is 

x(t) = 1 ∫ ∞ 

X(ω)e iωt dω 

2π −∞ 

which tells us the amplitude of x(t) at time t by integrating over all of 

the frequency components. 

The discrete Fourier transform of a sampled, time-domain signal 

x s (t) is given by the sum 

X(k) = 

N−1 

∑ 

t=0 

x s (t)e − i2πkt 

N 

for k =0, 1, . . . , N − 1, where N is the total number of samples in the 

sampled signal x s (t), t is the sample number, and k is the index of 

the frequency component ω k . X(k) returns the amplitude of ω k in the 

entire signal x s (t). The inverse discrete Fourier transform (IDFT) is 

x s (t) = 1 N 

N−1 

∑ 

k=0 

X(k)e i2πkt 

N 

for t =0, 1, . . . , N − 1. This reconstructs the amplitude of x s (t). 

In addition to the continuous and discrete Fourier transforms, there 

are several other transforms that give the frequency-domain spectrum

214 Other Fourier transforms Chapter 8 

of a time-domain signal. The Laplace transform is used in electrical 

engineering to compute transfer functions which describe the transfer of 

voltage in a linear, continuous, time-invariant system. The Z-transform 

is the discrete version of the Laplace transform: It takes an infinite, 

discrete, time-domain input and outputs a complex, finite spectrum 

that is limited by some region of convergence. The discrete-time Fourier 

transform (DTFT) is a special case of the Z-transform whose region of 

convergence is the unit circle, i.e., the interval [0, 2π). A DTFT can be 

derived from a DFT via spectral interpolation. Afast Fourier transform 

(FFT) calculates a DFT with substantially less computations and is 

easily the most popular version of the Fourier transform. Finally, a 

short-term Fourier transform (STFT) computes the FFT at very small (0.1 

seconds or less) intervals of time in a song to show how its frequencies 

change. Its results are typically conveyed in a spectrogram. We will 

discuss the Z- and Laplace transforms in Appendix A since they are 

not Fourier transforms. 

8.1 Discrete-time Fourier transform (DTFT) 

The discrete-time Fourier transform is a special case of the Z-transform 

that reduces the domain of a spectrum of frequencies to the continuous 

interval [0, 2π). 1 In the DTFT, the frequency components ω k are 

normalized such that ˆω k =2πkT s , unlike in the DFT where ω k equals 

2πk 

NT s 

. We typically see the DTFT in digital filter design (namely FIR 

filters) where a discrete transfer function H(z) may be computed, i.e., 

the input and output of some discrete, linear, time-invariant system is 

known. Like the DFT, the DTFT requires a discrete, sampled input, but 

there is no N; instead, the duration of the input must be infinite 2 , so its 

time samples t are all of the integers from negative to positive infinity. 

1 Some texts specify this interval as [−π, π). 

2 If a system is time-invariant, sometimes we can assume the input is infinite for 

purposes of calculation. With repetitive, steady signals like pure tones or white noise, 

this is an assumption we may make, though the beginning and ending points should 

both be 0 to avoid clipping artifacts that arise in the frequency domain.

Section 8.1 Discrete-time Fourier transform (DTFT) 215 

The DTFT is defined by the sum 

X(ˆω) = 

∞∑ 

t=−∞ 

x[t]e −iˆωt 

where ˆω is in the interval [0, 2π). Because this forms a continuum, the 

inverse DTFT is the integral 

x[t] = 1 ∫ 2π 

X(ˆω) · e iˆωt dˆω. 

2π 0 

We use square brackets to indicate which function is discrete and 

parentheses to indicate a continuity. 

The critical difference between a DTFT and a DFT is that the DTFT 

frequency domain, [0, 2π), is a continuum. This is related to the fact 

that the Fourier transform of an infinite signal is finite, but stems from 

the property that the DTFT is a periodic function wherein 

X(ˆω k +2πf) =X(ˆω k ). 

The frequency range of the DFT is not continuous because it is only 

defined for the frequencies ω k = 2πk 

NT s 

where 0 ≤ k ≤ (N − 1), representing 

N-many uniformly spaced frequencies [26]. Therefore, we 

consider the DTFT to be more mathematically rigorous than the DFT, 

even though we rarely take a DTFT in practice because our input signals 

are not infinite. However, we can make the time-limited signals 

infinite by zero-padding in the time-domain, i.e., appending zeros onto 

x[t] such that t is defined over all of the integers instead of only 0 to 

N − 1. Zero-padding in the time-domain translates to spectral interpolation 

in the frequency domain, which effectively limits its frequency 

domain to some finite interval and results in a higher resolution in 

the frequency domain. So the more zeros are padded onto a signal, 

the "smoother" the resulting transform. Conversely, if we sample the 

DTFT by computing N-many samples per period of X, the DTFT will


be equivalent to the DFT: 

x[t] = 1 ∫ 2π 

X(ˆω)e iωkt dω 

2π 

∮ 

0 

= T s 

1 

Ts 

X(k)e i2πktTs dk, 

where the syntax " ∮ " denotes the closed path integral. Here, the integral 

∮ 

is computed over any single period of X, i.e., it is of length 2π and 

1 

Ts 

does not necessarily begin at 0. Then 

X 

( k 

NT s 

) 

= 

∞∑ 

t=−∞ 

x[t]e − i2πkt 

N . 

8.2 Fast Fourier transform (FFT) 

The fast Fourier transform is an efficient algorithm that seeks to speed 

up the computation of the discrete Fourier transform by removing 

all of its redundant computations. At first glance (and second, and 

third, and . . .), the mathematical expressions involved are not as clear 

nor do they reveal as much about its inner workings as does that of 

the discrete Fourier transform. This is the main reason why we leave 

the FFT, and algorithms in general, to computers and other hardware 

devices. 

This algorithm speeds up the processing time of the DFT by reducing 

the number of computations required from N 2 -many to N log 2 N- 

many (on average), where N is again the total number of samples in 

the signal x(t). Because the FFT outputs exactly the same thing as the 

DFT, it is computationally the clear choice when N is large—which is 

always the case for audio because of its high sampling rate. 

The FFT still requires its input to be discrete, and likewise it produces 

a discrete output. Furthermore, the total number of samples N 

must be a power of 2, i.e., 256, 512, 1024, and so on. If N is not equal to 

a power of 2, an appropriate number of zeros can be added to the end

Section 8.2 Fast Fourier transform (FFT) 217 

of the signal to make it so. This is the same zero-padding used above 

in the DTFT, though here it is a finite number of zeros. 

The FFT was originally discovered by Carl Friedrich Gauss in 

1805, but the results were not published until after his death, and 

the computational efficiency of the algorithm was realized neither by 

him or readers. In 1965, a paper published by J.W. Cooley (IBM) and 

John Tukey (Princeton University) described the same algorithm and 

its implementation on a computer, but did not cite Gauss and the 

connection was not made for some time after [27]. 

The algorithm improves the computational efficiency of the DFT 

by recursively partitioning the entire signal into smaller parts and 

using the DFTs linearity to do multiple DFTs on different parts of the 

signal and summing them together. There are several versions of the 

FFT proposed by Cooley and Tukey, each using different algorithmic 

techniques to perform this task. Given here is the most popular version: 

The radix-2 decimation in time FFT. 

Radix-2 decimation in time (DIT) of the FFT 

Also called the Danielson-Lanczos lemma, the radix-2 decimation in 

time algorithm performs an FFT by first splitting the input signal into 

two parts: The even-numbered samples, 

x even =(x(0),x(2),x(4), . . . , x(2m)), 

and the odd-numbered samples, 

x odd =(x(1),x(3),x(5), . . . , x(2m + 1)). 

If N, the number of samples in x(t), is a power of 2, then N =2m +2 

(the last term of x is x(2m + 1)—remember, we include the term x(0) 

in the magnitude). Otherwise, we pad the end of the signal with zeros 

until its size is a power of 2. So, if the magnitude of x is 250, we would 

pad it with 6 zeros to make its size 256 = 2 8 . Since we begin at 0, we 

sum from 0 to N − 1. Dividing this by 2, we sum from 0 to N/2 − 1,


twice. Then the DFT of the inputted signal can be written as the sum 

of the DFT’s of these split signals: 

X(k) = 

N 

2 −1 

∑ 

∑ 

x(2m)e −i2π(2m)k/N + x(2m + 1)e −i2π(2m+1)k/N . 

m=0 

N 

2 −1 

m=0 

The zeros padded at the end of the signal do not contribute anything 

to this sum, so this form is sufficient. We can make both of the exponentials 

e −i2π(2m)k/N and e −i2π(2m+1)k/N identical to one another by 

factoring out e −i2πk/N from the second one, so we can rewrite this as 

X(k) = 

N 

2 −1 

∑ 

m=0 

N 

∑2 −1 

x(2m)e −i2π(2m)k/N + e −i2πk/N 

m=0 

x(2m + 1)e −i2π(2m)k/N . 

The length of X(k) is N/2. 3 What this means is that the kth term of 

the DFT is identical to the negative k + N 2 

th term. Therefore, we can 

use this result and the above sum for X(k) to compute the terms of the 

DFT for k ≥ N/2. Because the term is negative, we fiddle with the sign 

of our exponential. For the sake of space, let F = e −i2π(2m)k/N and let 

G = e −i2π(2m)(k− N 2 )/N . Then, 

X(k) = 

⎧ 

⎪⎨ 

⎪⎩ 

∑ N 2 −1 

m=0 

∑ N 2 −1 

i2πk 

x(2m) · F + e− N 

m=0 x(2m) · G − i2π (k− N 2 ) 

e− N 

∑ N 2 −1 

m=0 x(2m + 1) · F, k < N/2 

∑ N 2 −1 

m=0 x(2m + 1) · G, k ≥ N/2. 

3 Again, the second half of the output of a DFT is symmetrical to the first half. To 

realize why this is true, consider the above factor, e −i2πk/N : 

e −i2πk/N = −e −iπ · e −i2πk/N (1) 

= −e −(i2π N 2 )/N · e −i2πk/N (2) 

= −e −i2π(k+ N 2 )/N . (3) 

We can switch the sign in the first step (1) because e −iπ = −1. Then in (2), we rewrite 

e −iπ as e −i2π N 2 /N —the 2’s and the N’s cancel out to make the exponent −iπ. Finally, 

(3) once again makes use of the axiom x a · x b = x a+b . Here, 

We can factor out −i2π 

N 

seen in (3). 

−e − i2π N 2 

N 

· e 

− i2πk 

N = −e 

− i2π N 2 

N 

− i2πk 

N . 

·( N 

N +k) 2 , which gives us the same result 

to reduce this to −e− 

i2π

Section 8.2 Fast Fourier transform (FFT) 219 

This may not seem like it actually reduces the number of computations 

involved: If we have to go through all the k’s anyway, why does this 

result mean anything 

Well, for one, this is not actually the final specification of the algorithm: 

It is just the first reduction, wherein the number of computations 

is reduced from N 2 to N 2 /2. N 2 /2 is equal to N log 2 N when N =2or 

N =4. Further reductions require knowledge of the size of N, because 

then we will know when we can factor out equivalent exponentials. 

The roots of unity described by the exponent of e permit the Fourier 

transform to detect periodic waves, i.e., those with a specific frequency, 

and permit the acceleration of the discrete Fourier transform into the 

fast Fourier transform. By identifying what can be factored out, we 

can reduce the size of the individual sums to 2, no matter the N. The 

number of times we have to factor out exponentials is equal to log 2 N. 

Let us do an example to show that indeed the computational complexity 

reduces to N log 2 N. Let N =8. The normal DFT would then 

be written 

X(k) = x(0) + x(1)e −i2πk 

8 + x(2)e −i4πk 

8 + x(3)e −i6πk 

8 

+ x(4)e −i8πk 

8 + x(5)e −i10πk 

8 + x(6)e −i12πk 

8 + x(7)e −i14πk 

8 . 

Splitting this into the even and odd parts gives 

X(k) = 

= 

] 

[x(0) + x(2)e −i4πk 

8 + x(4)e −i8πk 

8 + x(6)e −i12πk 

8 

] 

+ 

[x(1)e −i2πk 

8 + x(3)e −i6πk 

8 + x(5)e −i10πk 

8 + x(7)e −i14πk 

8 

] 

[x(0) + x(2)e −i4πk 

8 + x(4)e −i8πk 

8 + x(6)e −i12πk 

8 

] 

2πk 

−i 

+ e 8 

[x(1) + x(3)e −i4πk 

8 + x(5)e −i8πk 

8 + x(7)e −i12πk 

8 . 

Further factoring out another exponential to make the sums half the 

length again (so our sum goes from 0 to N/4 − 1) gives


X(k) = 

] {[x(0) + x(4)e −i8πk 

8 + e −i4πk 

8 

2πk 

−i 

+ e 8 

= 

] {[x(1) + x(5)e −i8πk 

8 + e −i4πk 

8 

{[x(0) + x(4)e −iπk] + e −iπk 

2 

πk 

−i 

+ e 4 

{[x(1) + x(5)e −iπk] + e −iπk 

2 

]} 

[x(2) + x(6)e −i8πk 

8 

[x(3) + x(7)e −i8πk 

8 

[x(2) + x(6)e −iπk]} 

[x(3) + x(7)e −iπk]} 

So the final result is 4 sums of size 2, and it took us 3 factorizations 

to get there. Each of these factorizations took 8 computations each. So, 

the total number of computations required was 8 · 3 = 24 = 8 log 2 8= 

N log 2 N. 

The organization of the FFT algorithm is usually visualized by a 

butterfly diagram, but I find them somewhat confusing due to their 

many arrows. The diagram in Figure 8.1 is a different graphical interpretation 

of the radix-2 FFT showing the exponential factors at each 

step of its evaluation. 

]} 

Figure 8.1: Diagram depicting the nested processes of the radix-2 DIT FFT.

Section 8.3 Short-time Fourier transform (STFT) 221 

8.3 Short-time Fourier transform (STFT) 

Music is a time-based art, and we process it sequentially. We pay 

attention to changes and build expectations for these changes as we 

become experienced listeners. A (discrete) Fourier transform retrieves 

the frequency information of an entire signal, but for varied signals 

with multiple instruments and chords, this isn’t very helpful. Instead, 

we want to know what happens at small intervals of time so we can get 

an idea of change in music. Hence, the STFT is a very useful version of 

the DFT. 

The STFT computes the Fourier transform by partitioning the time 

signal into smaller, equally sized time frames, and taking the Fourier 

transform of each of them. The STFT has a continuous and discrete 

form. 

X(τ m ,ω) = 

∫ ∞ 

−∞ 

x(t)w(t − τ m )e −iωt 

The continuous-time STFT, as it is called, applies a windowing function 

w(t−τ m ) to a continuous signal x(t) and returns one Fourier transform 

for each window. The mth window begins at time t = τ m , and τ m is the 

multiplication of the index m of the windows by the size H of the hop. 

The size of the hop differs from the size of the transform, however. We 

call each short-term time-domain signal to be transformed a window, 

or frame, and its size the frame size. We step through these frames 

according to a designated hop size. So, the size of a single transform 

(the frame size) will be a fraction of N according to the number of 

frames. If the frame size is equal to the hop size, then there is zero 

overlap between the frames. If the hop size is less than the frame 

size, then there will be overlap, equal to their difference. Overlap is 

perfectly fine and actually improves the resolution of the STFT. Both 

of these are intervals of time, so they are given in seconds.


Calling hop size H, frame size N ′ , and the number of frames M, 

N ′ = N N·M 

M 

. The STFT contains 

H 

Fourier transforms of size N ′ . Therefore, 

when H < N ′ , the STFT is a more costly algorithm than a single 

FFT of the entire signal, requiring N H · N ′ log 2 N ′ computations versus 

N log 2 N-many. 

However, it is usually sped up in reality by using fast Fourier transforms 

instead. The STFT is the most common implementation of the 

Fourier transform because it gives the most accurate representation 

of a signal: A 180-second long song containing K-many frequency 

components most certainly does not contain identical frequency components 

at every instant of time. Music changes! Usually, the hop size 

and frame size are chosen somewhere around 50-100 ms to correspond 

to the time resolution of our perception. 

To specify the STFT in terms of the FFT, a size N ′ for the FFT must 

be chosen. Because the FFT must be a power of 2, we choose N ′ such 

that N ′ ≥ H, the size of each hop, and N ′ =2 p for some p. This 

is called a N ′ -point FFT. N ′ can be determined using the function 

nextpower(H) in programs like Matlab and Mathematica. We pad each 

x m (t) with zeros, i.e., 

⎧ 

x ⎪⎨ m (t − τ m ), |t − τ m |≤ H−1 

2 

x m (t − τ m )= 

H−1 

0, 

2 


where 

∞∑ 

m=−∞ 

Therefore, 

∞∑ 

m=−∞ 

w(t − mH) =1, t = −∞, . . . , −1, 0, 1, . . . , ∞. 

X(mH, ω) = 

= 

= 

∞∑ 

∞∑ 

m=−∞ t=−∞ 

∞∑ 

t=−∞ 

∞∑ 

t=−∞ 

= X(ω). 

x(t)e −iωt 

x(n)e −iωt 

x(t)w(t − mH)e −iωt 

∞ ∑ 

m=−∞ 

w(t − mH) 

We can do these sums globally, i.e., for the interval (−∞, ∞) instead of 

from 0 to N ′ − 1, because we zero-padded each of our frames. 

Spectrograms are made using short-time fast Fourier transforms. 

The entire file is sliced evenly into partitions by a time interval (around 

100 ms is usually sufficient), and then the discrete Fourier transform 

is taken of each slice. The resulting graph shows frequency, amplitude, 

and time, but is only plotted on two dimensions. The horizontal 

axis shows each time interval, the vertical axis is frequency, and the 

darkness of the point shows amplitude. 

w(t). 

Different types of windows can be specified by a windowing function, 

Windowing 

Windowing is a time-selective process that takes many equal size 

intervals of a signal by multiplying everything outside of that interval 

by zero. It is nearly identical conceptually to the impulse function, but 

its domain is not infinitesimally small. 

Most similar to an impulse function is a rectangular window that 

is constantly 1 over some interval centered about time τ m and 0 elsewhere. 

We define this interval as starting at the mth hop, and it is N ′


in size. Letting t be a real number, our rectangular window or boxcar 

window (because it moves along a function like a train of many boxcars) 

can be given by the function 

⎧ 

⎨1, t ∈ [mH, mH + N ′ ] 

w(t) = 

⎩0, otherwise. 

This interval [mH, mH + N ′ ] can also be written [τ m ,τ m + N ′ ] because 

τ m = mH. It bears quite a resemblance to the Kronecker delta function, 

⎧ 

⎨1, if t =0 

δ(t) = 


This is the simplest window, and it retains all the of the amplitude 

information of a signal but also induces the most spectral leakage and 

side lobes in the frequency domain because of its infinite slope on 

either side. A windowing function with smoother ends minimizes 

spectral leakage and has smaller side lobes that decrease to zero almost 

immediately. However, a rectangular window produces the narrowest 

or "strongest" center lobe (typically one spike at the closest frequency 

versus several spikes) of any of the windowing functions, so there is a 

sort of trade-off for using nondifferentiable windows. 

A triangle window neglects a relatively large amount of a signal 

because of their pointed top: Only one sample’s amplitude (the center 

sample of each window) will be the same as the original signal in the 

windowed function. Also called Bartlett windows, they are specified by 

the function 

⎧ 

⎨1 − ∣ 2 

N 

(t − τ 

w(t) = 

′ m ) − 1 ∣ , if t is in the interval [τm ,τ m + N ′ ] 


Remember, τ m = mH is the location in time of the beginning of 

the mth window. The size of the window is the frame size N ′ , so the


Figure 8.2: A triangle window, centered at 1 second with a size of 2 (N ′ ). We cannot 

gather the hop size from this image; if there were multiple triangle windows shown 

in this graph, the hop size H would be the difference in time between the τ m, i.e., the 

starting point of successive windows. 

mth window interval is given by [mH, mH + N ′ ] if we begin at m =0 

(the zeroth window, as opposed to the first window). The hop size H 

simply defines the spacing of many of these windows. Thus, we begin 

the mth window at the mth hop, and there are M-many windows (or 

frames). 

Hanning windows are another popular choice in STFTs because of 

their gradual slope at their endpoints. This ensures a smooth attack 

rate in the windowed signal, and therefore minimal distortion in the 

spectrum due to discontinuities at the windowed ends of the input 

signal. The Hanning window is represented by the function 

⎧ [ ( )] 

⎨ 1 

2 

1 − cos 2πt 

N 

w(t) = 

′ −1 

, if t ∈ [τ m ,τ m + N ′ ] 

⎩ 

0, otherwise.


Figure 8.3: A Hanning window. 

The Hanning window is a variation of a cosine or sine window, the 

simplest form given by 

w(t) = 

= 

⎧ ( 

⎨sin πt 

N ′ −1 

) 

, t ∈ [τ m ,τ m + N ′ ] 

⎩ 

0, otherwise 

⎧ [ ] 

⎨cos πt 

, t ∈ [τ m ,τ m + N ′ ] 

N ′ −1 − π 2 

⎩ 

0, otherwise. 

Figure 8.4: A cosine or sine window.


All of these graphs show windows with τ m =0, and these are the 

0th windows, i.e., m =0, and the frame size is N ′ =2, so the interval 

is [0,N ′ ] = [τ 0 ,τ 0 + 2]. 


A discrete-time Fourier transform (DTFT) is a fairly obsolete version 

of the DFT, taking an infinite time-domain discrete signal x[n] and 

transforming its frequency components to a continuum, the interval 

[0, 2π). It is given by the sum 

X(ˆω) = 

∞∑ 

n=−∞ 

x[n]e −iˆωn 

Because X(ω) is continuous, the inverse DTFT is the integral 

x[n] = 1 ∫ 2π 

X(ˆω)e iˆωn dˆω. 

2π 0 

The fast Fourier transform (FFT) and short-time (fast) Fourier transform 

(STFT) are efficient algorithms that work to reduce the computational 

complexity of the discrete Fourier transform (DFT) by sorting an input’s 

terms according to the roots of unity, into log 2 (N)-many groups. 

The most common version of the FFT is the radix-2 decimation in time 

(DIT) algorithm. Common roots of unity are factored out, as shown in 

the example where N =8above. A requirement of the FFT is that N 

(the length of the sampled signal x s (t)) must be a power of 2. If it is not, 

the signal is zero-padded wherein zeros are tacked on to the end of the 

signal so that no new information is added but its length can become a 

power of 2. This is why the FFT is considered a "divide-and-conquer" 

algorithm. 

The STFT is the most useful Fourier transform to use when the 

frequency information of a signal changes over time. A STFT does 

not require its inputs to be discrete—it is specified in continuous and 

discrete forms. It windows a signal with windowing function like a

228 Appendix 

rectangular window or Hanning window, and then takes a Fourier 

transform (continuous, discrete, or fast) of each window, indexing 

time. A windowing function is similar to an impulse function, but it is 

not instantaneous. When the interval of these windows is small, say 

50ms or 100ms, the change in frequencies of a signal is best understood. 

A spectrogram can be produced from the results of the STFT, and can 

give graphical information about timbre/instrumentation, melody, 

and harmony in a piece of polyphonic music.

A. Frequency-selective circuits 

Signal processing is a class that every electrical engineering undergraduate 

takes, but rarely does it have a musical focus, or even mention. 

But as we saw in Chapter 4 on electrical guitar effect units, electric 

circuits certainly do have musical applications. 

However, it is important to note that digital signal processing— 

i.e., the frameworks within which we might compute a fast Fourier 

transform—is something different from signal processing in the electrical 

sense. Although discretized musical data concerns voltages, the 

signals passing through circuits are continuous and are therefore an 

exception to most of the techniques described in this book henceforth. 

That said, the construction of synthesizers and microcontrollers can be 

enlightening endeavors into the science of sound, but they each require 

very distinct bodies of knowledge exclusive from the mechanics of the 

DFT. 

Digital filter design is outside of the scope of this book, but there 

are two important connections between analog and digital filters that 

this appendix will address. Assuming the systems are linear and 

time-invariant, the input and output voltages of analog filters may 

be transformed from the time domain to the frequency domain with 

the Laplace transform if the system is continuous, or the Z-transform if 

the system is discrete. These give us the transfer functions of a circuit, 

written 

H(s) = V o(s) 

V i (s) 

for continuous complex frequencies s and 

H(z) = Y (z) 

X(z) ,

230 Appendix A 

for discrete complex frequencies z. The functions V o (s) and V i (s) are 

the frequency responses of the continuous output and input voltage 

functions v o (t) and v i (t), respectively, and Y (z) and X(z) are the frequency 

responses of the discrete output and input voltage functions 

y(t) and x(t). 

In electrical engineering, it is conventional to use the letter j to 

define the imaginary number √ −1 instead of i to avoid confusion with 

current, which is written i(t). The variables s and z are both equal to 

jω, meaning they are defined on the complex plane, but again, s is 

continuous and z is discrete. 

Digital filters are specified and analyzed using Z-transforms, while 

analog filters use Laplace. A Z-transform is equivalent to a discretetime 

Fourier transform (DTFT) when z = e jω , i.e., the DTFT is a special 

case of the Z-transform. The Laplace transform maps an infinite, linear 

range of frequencies s, while the Z-transform maps a finite, circular 

range of frequencies z defined over an interval of size 2π. This circular 

range is sometimes thought of as a "wrapper" because the frequencies 

wrap around the unit circle over and over again. 

Figure A.1: Fourier transforms (both continuous and discrete) and Z-transforms 

map frequencies of a finite domain, between −2π radians and 2π radians, while the 

Laplace transform has an infinite frequency domain from −∞ radians to ∞ radians, 

not confined to the unit circle. Positive frequencies are mapped in a counter-clockwise 

manner and negative frequencies in a clockwise manner.

Appendix A 231 

A second connection exists between analog and digital domains 

with regards to filtering. Going from the infinite, linear s-plane to the 

finite, wrapped z-plane introduces distortions in the resulting transfer 

function that need to be mitigated. One technique that reduces these 

errors is blind deconvolution. This was the technique used by Soundstream 

in 1975 to remove the resonant frequencies of the gramophone 

from one of the first ever recordings, "Vesti la giubba" by the popular 

opera singer Enrico Caruso. Deconvolution is the inverse of convolution, 

and the process is "blind" because both sources (the resonance of 

the gramophone and the spectrum of the song) are unknown. 

Before we get too far ahead of ourselves, we need to explore some 

of the fundamentals of electrical engineering. This appendix is meant 

for those curious to learn some of these basics, but the best way to do 

so is practice! Nilsson and Riedel’s Electric Circuits is a nice text for 

those new to circuit analysis. 

A.1 Ohm’s Law 

An electric circuit is defined as a closed loop that is connected to an 

energy source (like a battery) and a load (like a lamp). The overarching 

law in electrical engineering governing all of circuit design and their 

analysis is Ohm’s Law. It relates the resistance R of an electrified system 

to the voltage V applied to it (from a battery or other source of power) 

and the resulting current I running through it. Ohm’s law is expressed 

by the equation 

V = IR. 

Voltage (measured in volts, V) determines the flow of electricity, current 

(measured in amperes, A) is the amount of flow, and resistance 

(measured in ohms, Ω) increases or decreases flow. Voltage represents 

the amplitude of (musical) signals. Playing music loudly on a laptop 

or MP3 player wears down the supply of the battery more quickly 

that playing it softly because it demands a higher flow of current.


Current can be either direct (DC) in which it travels in one direction, 

or alternating (AC) in which it flows in opposite directions in regular 

cycles, designated by the frequency of the AC. In North America this 

is typically 60 Hz, and in Europe, 50 Hz. Amusingly, a sort of "war" 

broke out between Nikola Tesla who discovered alternating current 

and Thomas Edison, the discoverer of direct current, called the "Current 

Wars." Edison was protective of his success with direct current, so 

when Tesla introduced the idea of alternating current, Edison made 

the claim that AC was fatal. Today, we almost never use direct current, 

because capacitors and inductors are such that they must have 

changing current to produce a voltage. 

Now, frequency responses and Fourier-transformed spectra depict 

frequency versus amplitude, but the amplitude here is power, which is 

a function of voltage and current: 

p = VI = V 2 

R = I2 R. 

The terms power and energy are often used interchangeably because 

of their physical relationship. Power is actually the rate at which work 

is performed—the energy per unit time [26]—but we hear it used in a 

sort of absolute way ("That is a powerful engine," "The president has a 

lot of power," etc.). Engineers seem to be very comfortable using them 

to refer to the same thing, but energy is the sum of power over time. 

The energy of a signal is given by 

∞∑ 

∞∑ 

E = |x(t)| 2 = p(t). 

t=0 

t=0 

This says that the total energy is equal to the sum of all of the power 

in the entire signal. Energy is measured in joules (J), where 1 joule is 

equal to 1 watt times 1 second—so, one watt is equal to 1 J/s. 

We compute the power p at a point in time t 

p(t) =|x(t)| 2 .


This is the absolute value of x(t) because we do not want negative or 

imaginary components. The unit of power is the watt (W). The average 

power P is then 

1 

P = lim 

T →∞ T 

T∑ 

p(t). 

t=0 

Parseval’s theorem from Chapter 7 uses these equalities to draw conclusions 

about the power and total energy of a spectrum. 

Resistance is created by resistors (R), inductors (L), and capacitors 

(C) in RLC circuits. A circuit can be described in a time domain and a 

frequency domain, because we think of the applied voltage as a signal. 

The resistance of a resistor is simply R in both the time and frequency 

domains. The resistance of capacitors and inductors is similarly written 

C and L in the time domain, but in the frequency domain, they become 

complexly valued. Complex resistance is called reactance X, where 

X C = 

1 

jωC = − i 

ωC 

X L = jωL. 

Let us consider the behavior of inductors and capacitors with respect 

to frequency: For large frequencies, i.e., ω →∞, X C (1/j∞C) approaches 

0, while X L (j∞L) approaches infinity. For low frequencies, 

i.e., ω → 0, X C (1/j0C) approaches infinity while X L (j0L) approaches 

0. Now, current flows most easily through the parts of a circuit where 

the resistance is lowest, like a car in heavy traffic. When a device 

like a resistor, capacitor, or inductor has a nearly infinite amount of 

resistance, the current will approach 0 because it cannot flow through 

such high resistance. When current does not flow through a part of a 

circuit, we say that this part behaves like an open circuit wherein the 

circuit is essentially broken, because it may as well physically removed 

if it does not allow current to flow.


Figure A.2: When the complex frequency s = jω goes to 0, the reactance 1/jωC goes 

to infinity, so no current flows through the capacitor (or the circuit, for that matter, 

because the resistance in series is additive) and V out =0volts. Therefore, only high 

frequencies produce an output voltage in this circuit, so we call it a high-pass filter. 

Oppositely, when a device’s resistance is very small, it behaves like 

a short circuit, which is an electric wire with theoretically no resistance. 

Figure A.3: For the same circuit, when the complex frequency s = jω goes to positive 

infinity, the reactance 1/jωC goes to 0, so current freely flows as if there were no 

device at all, and V out = V in − V R. Because V R is the same regardless of frequency, 

we think of the ratio V out/V in theoretically as 1 for frequencies beyond some cutoff 

frequency, which is determined by the value of the capacitor. 

In this circuit, signals containing low frequencies will pass minimal 

to no voltage through to the point designated by the + sign, but at 

high frequencies, the output voltage will be approximately equal to 

the input voltage. This is due to the behavior of the capacitor with 

respect to frequency. The circuits shown in Figures A.2 and A.3 are 

called high-pass filters.


A.2 Filtering 

We have already introduced the basic concept of filtering with respect 

to musical applications: Filtering is a frequency-discriminating process 

by which some frequencies in a signal are kept and the others are 

attenuated. The holes of wind instruments act as bandpass filters 

because they only let a tiny range of frequencies pass, such that a 

configuration of opened and closed holes sounds like a single pitch. 

A mouth is a filter: No other instrument sounds exactly like a human 

voice. A room is a filter, producing standing waves corresponding to 

its resonant frequencies. Eventually, you may realize that every physical 

thing is a filter because it discriminates sound on a basis of frequency. 

Filtering in electrical engineering is the process of frequency discrimination 

with respect to electrical circuits. Since both audio signals 

and AC circuits can be transformed the frequency domain, the same 

filtering concepts can be applied to digital audio signals as the signals 

involved in electric circuits. This is at the core of digital signal processing. 

Analog synthesizers do, however, make use of these frequency-selective 

circuits like in Figures A.2 and A.3 containing capacitors and inductors. 

Transfer functions 

Transfer functions describe the ratio of the output voltage to the input 

voltage for a given frequency. Typically, the Laplace transform is used 

instead of the Fourier transform to convert a time-domain signal to a 

frequency-domain one. The Laplace transform is given by 

L{x(t)} = X(s) = 

∫ ∞ 

−∞ 

x(t)e −st dt 

where s is the complex frequency jω and L is the Laplace transform. It 

is not unlike the continuous Fourier transform, 

X(ω) = 

∫ ∞ 

−∞ 

x(t)e −jωt dt. 

The Laplace transform is just as complicated of an integral to compute 

as the Fourier transform, so most people prefer to memorize some


Function x(t),t≥ 0 X(s) 

Impulse Kδ(t) K 

Step K 

K 

s 

Ramp Kt 

K 

s 2 

Damped ramp Kte −at K 

(s+a) 2 

Exponential Ke −at K 

s+a 

Sine K sin(ωt) 

Kω 

s 2 +ω 2 

Damped sine Ke −at sin(ωt) 

Kω 

(s+a) 2 +ω 2 

Cosine K cos(ωt) 

Ks 

Damped cosine 

Ke −at cos(ωt) 

s 2 +ω 2 

K(s+a) 

(s+a) 2 +ω 2 

Table A.1: Some common Laplace transforms. 

of its general behavior for common functions. In the table below, the 

general Laplace transform is given for the most popular functions, 

where K is a constant, real value. 

A transfer function H is written 

H(s) = V out(s) 

V in (s) 

i.e., the ratio of the output to the input frequency-domain voltages. 

We call the transfer function the frequency response. It is the Laplace 

transform of an impulse response h(t), the ratio of a system’s output 

to the delta function. We use transfer functions, therefore, to describe 

the frequency response of musical instruments, electric circuits, and 

anything else that has a frequency-domain representation in addition 

to a time-domain one. We looked at the transfer functions of a violin 

and a trumpet in Chapter 4: The violin’s transfer function had peaks 

at its air resonance and main wood resonance, and the trumpet’s 

frequency response had a peak representing the length of its bore.


Figure A.4: The frequency response of an average (poor) violin. 

Figure A.5: The frequency response of a trumpet. The curve is smoothest where the 

energy leaks the most. 

A filter is designed to give preference to a selected range of frequencies 

so that the strength of those frequencies will be maintained 

when a signal passes through the filter, while all other frequencies 

will be attenuated to some degree. A transfer function, H(s) or H(jω), 

describes the behavior of a circuit with respect to complex frequency, 

s = jω. Therefore, all circuits are filters of some kind. A cutoff frequency 

ω c defines where a filter changes from retaining a given frequency 

to attenuating it, or vice versa. Cutoff frequencies are located where


the magnitude of the transfer function equals √ 1 

2 

(−3.01 dB) of the 

( 

maximum value of H Hmax √2 

). 1 

Filtering of signals works just like the process of convolution reverb 

described in Chapter 5. We can get a filtered signal either by 

multiplying their spectra or by convolving their time-domain signals 

together. Convolving the signal of a simple sinusoid of frequency 1000 

Hz with a filter that let only low frequencies through, such as the one 

given in Figure A.6, would reduce the amplitude of this frequency by 

a factor of 

1 √2 , because at 1000 Hz, the filter attenuates signals by −3.01 

dB. 

Figure A.6: The frequency response (magnitude and phase) of a low-pass filter. 

( ) 1 1 

This is so because L dB SPL = 20 log 10 L Intensity = 20 log √ 10 2 

= −3.01.


This filter is called a low-pass filter. The amplitude of the above plot 

is given by |H(ω)| to get rid of the complex components. 

|H(ω)| = √ Re[H(jω)] 2 + Im[H(jω)] 2 , 

so the amplitude is equal to the magnitude of the real (Re) and imaginary 

(Im) parts of H(s), making the substitution s = jω. 

We can also calculate the phase φ of the frequencies in the transfer 

function, shown in the second graph in Figure A.5, as 

( ) ω 

φ(ω) = 90 ◦ − tan −1 . 

ω pole 

We must know the locations of the poles (ω pole ) to do so, which are 

located where the transfer function’s denominator is equal to zero. 

A.3 The Z-transform 

When we want the frequency representation of a discrete, time-positive 

input, we compute a Z-transform instead of a Laplace transform. 

The Z-transform is used most often in signal processing when some 

discrete and infinite input signal x[t] and output signal y[t] of a system 

are given and we want to compute their frequency spectra, X(z) and 

Y (z). The transfer function H(z) is 

H(z) = Y (z) 

X(z) 

where z is complex and H, Y , and X are all frequency responses. 

We can compute X(z) from a discrete input function x[t] by the Z- 

transform, which is defined as 

Z{x[t]} = X(z) = 

∞∑ 

x[t]z −t 

where the t are the time samples. Thus, z −k can be thought of as the 

kth sampling instant and it shifts a value with which it multiplies (like 

t=0


x[t]) k-many samples to the right to get to its kth sample. When the 

input is infinite and when z = e jω , we have the discrete-time Fourier 

transform, 

X(z) = 

∞∑ 

x[t]z −t = 

t=0 

∞∑ 

x[t]e −jωt . 

t=0 

So the DTFT is a special case of the Z-transform. 

The inverse Z-transform is given by 

Z −1 {X(z)} = x[t] = 1 ∮ 

X(z)z t−1 dz 

2πj 

the closed path integral defined on some interval [a, a +2π), where a 

is a constant. This a varies depending on the system. 

Now let us look at the general form of filters and some musical 

applications that use them. 

Low and high-pass filters 

A low-pass filter allows only frequencies less than a given cutoff frequency 

from a signal to pass through it unaffectedly. Oppositely, a 

high-pass filter allows only the high frequencies above a given cutoff 

frequency to pass. Therefore, the magnitude response of a low-pass 

filter with respect to frequency has the opposite shape of a high-pass 

filter: Its slope goes from constant to decreasing, while a high-pass 

filter goes from increasing to constant, as in Figure A.7. The magnitude 

response of a high-pass filter is graphically given in Figure A.8. 

The poles of a transfer function are given by the values that would 

make the denominator 0. In the above filters, the cutoff frequency ω c 

is specified by the pole in the denominator where ω c equals the pole, 

so the cutoff frequency is 1000 rad/s. The zeros of a transfer function 

specify where the numerator is 0 and hence H(s) = 0. So, for a transfer 

function with m-many poles p and n-many zeros z, 

H(s) = (s + z 1) · (s + z 2 ) · . . . · (s + z n ) 

(s + p 1 ) · (s + p 2 ) · . . . · (s + p m ) .


Figure A.7: The first-order low-pass filter given by the magnitude of the transfer 

function H(s) = 1000 . Its cutoff frequency is ωc = 1000 rad/s. 

s+1000 

Figure A.8: The first-order high-pass filter given by the magnitude of the transfer 

function H(s) = . Its cutoff frequency is also at 1000 rad/s. 

s 

s+1000 

The general form of a one-pole, one-zero low-pass filter is 

H(s) = 

z 1 

s + p 1 

. 

Its magnitude is 

|H(ω)| = 

z 

√ 1 

. 

ω 2 + p 2 1


The cutoff frequency is given by ω c = p 1 , where z 1 = p 1 , the pole of H. 2 

As ω goes to 0, |H(ω)| = z 1 

p 1 

, so for low frequencies, the amplitude of 

the transfer function is theoretically 1 meaning that the low frequencies 

are retained. As ω goes to infinity, H(ω) = z2 1 

=0, meaning that the 

∞+p 2 1 

amplitudes of high frequencies will be increasingly reduced. 

The most basic circuits representing low-pass filters are given in 

Figure A.9. 

Figure A.9: These series RL and RC circuits are the simplest representations of lowpass 

filters. The first has the transfer function H(s) = Vo(s) 

1/sC 

= 1/RC 

R+1/sC s+1/RC 

V i (s) 

= 

( ) 

1/sC 

·V 

R+1/sC i (s) 

V i 

= 

(s) 

. When H(s) = √ 1 2 

H max, s is the cutoff frequency. So this filter’s 

cutoff frequency is ω c = √ 1 

RC 

. The second transfers its voltage by the equation 

√ 

H(s) = Vo(s) = sL+R R ·V i(s) 

V i (s) V i 

= R = R/L , so its cutoff frequency is (s) sL+R s+R/L ωc = R 

. L 

The general form of a one-pole high-pass filter is 

H(s) = 

|H(ω)| = 

s 

s + p 1 

ω 

√ . 

ω 2 + p 2 1 

Here, the cutoff frequency ω c is once again where ω = p 1 , the pole. 3 

The most basic circuit designs and corresponding transfer functions 

of high-pass filters are given in Figure A.10. 

2 To verify this, check that |H(p 1)| =0.707: H(p 1)= 

0.707. 

3 Check: |H(p 1)| = p 1 √ 

p 2 1 +p2 1 

√ p 1 

p 2 1 +p2 1 

= √ 

p 1 

= p √1 

2p 2 p 

1 1 2 

= √ 1 

2 

=0.707. 

= p 1 

p 1 

√ 

2 

= 1 √ 

2 

=


Figure A.10: The two simplest high-pass filters are once again, series RL and RC 

circuits. The first has a transfer function H(s) = Vo(s) 

s 

s+1/RC 

( ) 

R 

= ·V 

R+1/sC i (s) 

V i (s) V i 

= 

(s) 

R 

= 

R+1/sC 

1 

. Therefore, the cutoff frequency is ωc = √ 

RC 

. The second has the transfer 

function H(s) = Vo(s) 

V i 

= ( sL+R sL )·V i (s) 

(s) 

V i 

= sL = s 

. Therefore, its cutoff 

(s) sL+R s+R/L 

frequency is ω c = √ R/L. 

The tone knobs on electric guitars logarithmically change the value 

of a capacitor connected to each of the magnetic pickups. These capacitors 

act to transform the pickup into a high-pass filter, reducing the 

treble in the guitar’s signal. 

Now, filters with two poles have the general form 

H(s) = b 2s 2 + b 1 s + b 0 

s 2 + a 1 s + a 0 

. 

This general transfer function describes second-order filters (two poles), 

while filters with one pole as described above are first-order filters. 

Different types of filters are determined by the b coefficients: When 

b 2 = b 1 =0, we have a low-pass filter. When b 1 = b 0 =0, we have a 

high-pass filter. 

H(s) = 

H(s) = 

b 0 

s 2 , 

+ a 1 s + a 0 

b 2 s 2 

s 2 , 

+ a 1 s + a 0 

a low-pass filter 

a high-pass filter 

Their magnitude plots are very similar. Pay attention to the values 

along the vertical axis to see the difference: They have a steeper slope 

and thus greater attenuation of the undesired frequencies. The cutoff


Figure A.11: The second-order low-pass filter given by the magnitude of the transfer 

function H(s) = 10002 

(s+1000) 2 . 

Figure A.12: The second-order high-pass filter given by the magnitude of the transfer 

function H(s) = 

s 2 

(s+1000) 2 . 

angular frequencies are the same here, both 1000 rad/s. Now, when 

b 2 = b 0 =0, we have a band-pass filter (Figs. A.14-16), and when only 

b 1 =0, we have a band-stop filter (Figs. A.17-18). 

Band-pass filtering and filter banks 

The transfer function of a band-pass filter has one peak located at the 

center frequency ω 0 , and two cutoff frequencies ω c1 and ω c2 that define 

the bandwidth β = ω c2 −ω c1 , again where |H(ω c1 )| = |H(ω c2 )| = Hmax √ 

2 

. 

A band-pass filter is automatically a second-order filter, and its transfer


function is generally given by 

H(s) = 

b 1 s 

s 2 + a 1 s + a 0 

i.e., b 0 and b 2 are both equal to zero. Here, the bandwidth β is equal 

to a 1 , and the center frequency ω 0 equals √ a 0 . A common way to 

describe a band-pass filter is by the quality Q, calculated from the ratio 

of the center frequency ω 0 to the bandwidth β, 

Q = ω 0 

β = ω 0 

ω c2 − ω c1 

. 

The magnitude plot of the frequency response of a band-pass filter 

is given in Figure A.14. Note what happens at ω = 1000 rad/s: The 

graph peaks. At −3 dB, the two cutoff frequencies can be found, 

because this is where |H(jω)| = |Hmax(jω) √ 

2 

. 

Figure A.13: The magnitude plot of a band-pass filter with ω 0 = 1000 rad/s, given by 

|H(ω)| = 

1000ω 

ω 2 +1000ω+1000 2 . 

The side holes in a wind instrument act as band-pass filters, moving 

the center frequency lower and higher as they lengthen and shorten 

the effective length of the bore, respectively. Each of the side holes’ 

bandwidth is determined by the diameter of the hole, and a smaller 

hole means the bandwidth is smaller and the quality higher. A large Q 

means that the peak of the band-pass filter’s transfer function is more 

intense.


Figure A.14: The magnitude plot of a band-pass filter with a high quality, given by 

20ω 

the function |H(ω)| = 

. The center frequency is 1000 rad/s and the 

ω 2 +20ω+1000 2 

bandwidth is 20 rad/s, so Q = 1000 = 50. 

20 

Figure A.15: The magnitude plot of a band-pass filter with a low quality, given by 

5000ω 

the function |H(ω)| = 

. The center frequency is 1000 rad/s and the 

ω 2 +10,000ω+1000 2 

bandwidth is 5000 rad/s, so Q = 1000 =0.1. 

10,000 

The opposite of a band-pass filter is a band-stop filter, also called a 

band-reject or notch filter. The general form of its transfer function is 

H(s) = 

b 2 s 2 + b 0 

s 2 + a 1 s + a 0 

, 

i.e., b 1 = 0. This kind of filter is used to reduce or eliminate the 

intensity of a given range of frequencies, specified by the bandwidth 

around the center frequency. The size of the bandwidth β is once again 

determined by Q, the two cutoff frequencies ω c1 and ω c2 straddling the


center frequency ω 0 , and furthermore, β is equal to the coefficient a 1 

and ω 0 is √ a 0 . 

Figure A.16: A band-stop filter with a high quality, given by the function H(ω) = 

ω 2 +1000 2 

. The center frequency is 1000 rad/s and the bandwidth is 500 rad/s, 

ω 2 +500ω+1000 2 

so Q = 1000 =2. 

500 

Figure A.17: A band-stop filter with a low quality, given by the function H(ω) = 

ω 2 +1000 2 

ω 2 +10,000ω+1000 2 . The center frequency is 1000 rad/s and the bandwidth is 10000 

rad/s, so Q = 1000 

10,000 =0.1. 

The most basic circuits designing band-pass and band-stop filters 

are given in Figure A.18. They are both called series RLC circuits. Including 

both inductors and capacitors in a circuit makes the frequency 

response behave similarly to frequency extremes, i.e., for band-pass 

filters, |H(j0)| = |H(j∞)| =0and |H(jω 0 )| =1, and for band-stop 

filters, |H(j0)| = |H(j∞)| =1and |H(jω 0 )| =0.


Figure A.18: The left circuit is a band-pass filter and the right a band-stop. This bandpass 

filter’s transfer function is H(s) = Vo(s) 

sR 

= s(R/L) 

s 2 L+sR+1/C s 2 +s(R/L)+(1/RC) 

1 

, making its center frequency ω0 = √ 

RC 

. The bandstop 

has the transfer function H(s) = Vo(s) 

s 2 L+1/C 

s 2 L+sR+1/C = 

( 

) 

R 

= ·V 

sL+R+1/sC i (s) 

V i (s) V i 

= 

(s) 

R 

= 

sL+R+1/sC 

( ) 

sL+1/sC 

= ·V 

sL+R+1/sC i (s) 

V i (s) V i 

= sL+1/sC = 

(s) sL+R+1/sC 

s 2 +1/LC 

1 

, so the center frequency is given by 

s 2 ω0 = 

+s(R/L)+(1/LC) 

√ 

LC 

. 

In the description of phaser and flanger effects pedals in Chapter 

4, the concept of filter banks was broached. Phasing (and flanging) is 

achieved by passing a signal through several filters simultaneously and 

summing each of their frequency responses with the original signal’s 

frequency response. So, filters can be used individually as well as 

connected in series or parallel to achieve a great range of different 

sonic effects. 

The topic of filter banks also appears when we want to extract 

frequency-related information about music, especially for the purpose 

of music information retrieval (MIR). A common filter bank here would 

be one designed to extract the 12 notes of the scale to tell us when a 

specific note occurs in a song. These would be band-pass filters with 

center frequencies scaled by 2 k/12 , where f 0 is the center frequency 

of the first band-pass filter and 2 k/12 f 0 is the center frequency of the 

kth band-pass filter. Other filter banks are useful for detecting instrumentation 

when they are designed to pick up an harmonic overtone 

series, i.e., their center frequencies are spaced an octave apart from one 

another. Therefore, filter banks in music information retrieval can be 

used with respect to pitch, harmony, and timbre detection—wherever 

there is frequency information.


Figure A.19: 24 band-pass filters in parallel, forming a filter bank that spans 2 octaves. 

A.4 Chapter summary 

In this appendix, we reviewed the fundamentals of electrical engineering 

behind the behaviors of electrical and digital systems with 

respect to frequency. For continuous voltages, we can take a Laplace 

transform to compute the spectrum V (s) of a time-domain signal v(t). 

For discrete functions of voltage, we compute the Z-transform to see 

the frequency-domain representation X(z). In both cases, z and s are 

complex frequencies jω where j = √ −1. 

When we know the input and output voltages of a system over 

time, we can compute the transfer function H(s) or H(z). This is the 

proportion of the output voltage to the input voltage. The output 

voltage is the voltage over a load in a circuit and the input voltage is the 

voltage of the source like a battery (or the output of another, connected 

circuit). So, H(s) = Vo(s) 

V i (s) . 

Ohm’s law is the overarching law of all electrical physics, and it 

states that V = IR, i.e., voltage is the product of current with resistance. 

Resistance can be complex-valued in which case we call it the reactance.


Reactance is a function of frequency, so circuits are frequency-selective 

and called filters. 

A low-pass filter allows low frequencies to "pass through it" up to 

some cutoff frequency ω 0 .Ahigh-pass filter lets high frequencies pass 

while attenuating low ones. A band-pass filter is designated by a center 

frequency ω c and a bandwidth β defining the range of frequencies that 

it allows to pass. A small bandwidth means a high quality of filter Q, 

where Q = ωc 

β 

. Finally, we defined a band-stop filter, the converse of a 

band-pass filter that allows everything but some range of frequencies 

to pass through it. 

The zeros of the numerator of a transfer function are called simply 

zeros while the zeros of the denominator are poles. Afirst-order filter 

has one pole and a second-order filter has two. We can define a series of 

filters with a filter bank much like a time-domain windowing function 

in music information retrieval.

B. Using computers to do Fourier 

transforms 

As you saw in Chapter 7, doing a discrete Fourier transform of size 

N =8is extremely cumbersome by hand. The software programs 

Matlab and Mathematica are great places to turn to do these tricky 

computations: Fourier transforms are built in to their functions. The 

following examples are all available for download from my website, 

http://numbersandnotes.com/. 

B.1 Matlab 

Matlab is a high-level language for technical computing, excelling at 

scientific computations like the fast Fourier transform. The Matlab 

syntax that performs an FFT is literally fft(), accepting an array of 

amplitude information with respect to time. 1 As we saw in Chapter 

6, sound files have headers in addition to binary information, and 

this means that their data must be prepared at the low-level (like 

in C). Fortunately, both Mathematica and Matlab have several built 

in functions to prepare audio data for further analysis. In Matlab, I 

prefer to use wavread(), and in Mathematica, the function Import[]. 

These built-ins put the amplitude information in the correct format for 

analysis. 

First, let’s take a basic FFT. Then, we will give the code to produce 

the same spectrograms that we’ve seen before. 

1 The version of Matlab to which this information applies is Matlab 7.

252 Using computers to do Fourier transforms Chapter B 

Perform and plot the fast Fourier transform 

%% Take the fast Fourier transform of a WAV file 

%% and plot its power. 

[x,fs]=wavread(’Trumpet-01-mf-C5.wav’); 

N = length(x) %% number of points in an audio file is just its 

length 

T = length(x)/fs %% define time of interval in seconds 

t = [0:N-1]/N; %% define time instants 

t = t*T; %% define time in seconds 

p = abs(fft(x))/(N/2); %% absolute value of the fft; 

%% we only need the first half of it, N/2 

p = p(1:N/2).^2; %% take the power of first half of the freq’s 

freq = [0:N/2-1]/T; %% find the corresponding frequency in Hz 

figure 

plot(freq,p,’k’) 

axis([0 5000 0 0.012]) %% zoom in 

This displays a plot of the frequency domain of x from 0 to 5000 Hz, 

and 0 to 0.012 relative power. Notice that the variable freq determines 

the frequency in Hz of the frequency components by dividing by 

T, which is set equal to N f s 

at the beginning. Therefore, the values 

[0:N/2-1] are our ω k , k =0, 1, . . . , N/2 − 1. We only go up to the 

(N/2 − 1)th component because the second half of the output of the 

FFT is symmetric to the first half. 

Display a spectrograph 

Matlab offers a great amount of control for the visualization of data. 

Therefore, it is an ideal platform for displaying spectrographs (also 

called spectrograms). It actually has a built-in function for this 

(spectrogram()), but its results are often unreliable. Thus, I have

Appendix B 253 

provided the same code that was used to make all of the spectrographs 

in this book. 

function [TF, freq, time] = spectro(x,secs,fftint,maxfreq) 

%SPECTRO - Spectrogram of audio signal. 

% [TF, freq, time] = spectro(x, secs, fftint, maxfreq) 

% returns a series of Short-time Fourier transforms 

% (STFT’s), i.e., the frequency, amplitude, and time 

% for a given .wav file. The file must be in mono 

% (1-channel). 

% Inputs: 

% x: a .wav file, entered as ’guitar.wav’, 

% for example 

% secs: the file’s duration in seconds. 

% This can be calculated by length(file) 

% divided by its sampling frequency. 

% fftint: the duration of the STFT in seconds. 

% This does not need to be a power of 2. 

% The file will be zero-padded. 

% maxfreq: the maximum frequency component of the 

% file. This should be 0.5*sampling 

% frequency to satisfy the Nyquist limit. 

% 

% The signal x must be a .wav file, and the duration of x 

% specified must not be less than its actual duration. 

% Additionally, the duration of the short-time Fourier 

% transform interval (fftint) must be less than the total 

% duration. Finally, the maxfreq must be at least half 

% of the sampling frequency, typically 22050 Hz. 

% 

% SPECTRO will return the spectrogram and surface of a given 

% real-valued signal. It works by partitioning the file 

% according to fftint, normalizing (zero-padding) the data 

% to fit the requirements for the FFT, and then taking the 

% short-time Fourier transform (STFT). A spectrogram is 

% a three-dimensional output wherein the abscissa is time 

% in seconds, the ordinate is frequency in Hz, and the

254 Appendix B 

% darkness of the color of a given point represents its 

% amplitude in decibels. So, a white point would not be 

% as loud as a gray or black point. Also outputted is the 

% surface of the spectrogram. 

[m d]=wavfinfo(x); % this function reads the WAV header 

[x,fs,nbits]=wavread(x); 

if isempty(m)==1 

error(’The specified file is not .wav file.’) 

x=0 

end 

if fs.*secsmaxfreq 

error(’The specified duration for the intervals of 

the FFT (fftint) is too large.’) 

fftint=0 

end 

if maxfreq>fs/2 

error(’The specified maximum frequency (maxfreq) is 

beyond the Nyquist sampling rate.’) 

maxfreq=0 

end 

% partition the file to get windows of the signal for our 

% spectrogram 

partitionsize=fftint*fs; 

% use a hanning window 

window=hanning(partitionsize); 

partitions=[1:partitionsize:length(x)-partitionsize]; 

Z=zeros(partitionsize,length(partitions)); % pad with zeros 

for i=1:length(partitions) 

Z(1:partitionsize, i)= 

x(partitions(i):partitions(i)+partitionsize-1).*window;


end 

% take the Short Term Fourier Transform (STFT) 

STFT = fft(Z); 

% take absolute value of each partition 

if rem(partitionsize,2)==1 

k=(partitionsize+1)/2; 

else 

k=partitionsize/2; 

end 

f=[0:k-1]*fs/partitionsize; 

t=partitions/fs; 

if nargout>0, TF=STFT; end 

if nargout>1, freq=f; end 

if nargout>2, time=t; end 

% size of the STFT 

maxSTFT=abs(STFT(2:partitionsize*maxfreq/fs,:)); 

% normalized so the max amplitude will be 0 db 

maxSTFT=maxSTFT/max(max(maxSTFT)); 

figure 

% output the spectrogram with colors mapping the intensity 

% in dB’s 

pcolor(t, f(2:partitionsize*maxfreq/fs), 20*log10(maxSTFT)); 

axis xy; 

colormap(flipud(bone)); 

shading interp; 

title(’2D spectrogram of the signal’) 

xlabel(’Time (seconds)’) 

ylabel(’Frequency (Hz)’) 

figure 

% surface function plots the output in 3D


surf(t, f(2:partitionsize*maxfreq/fs),20*log10(maxSTFT)); 

axis xy; 

view(20,84); 

colormap(flipud(bone)); 

shading interp; 

title(’3D spectrogram of the signal’) 

xlabel(’Time (seconds)’) 

ylabel(’Frequency (Hz)’) 

end 

To use this function, simply enter the following information in a 

new script and run it. Be sure that both the WAVE file and spectro.m 

are in Matlab’s file directory by going to File->Set Path. . . and adding 

their location to it. Don’t forget to press "Save"! 

[x,freqsamp]=wavread(’guitar.wav’); 

t=0:1:length(x)-1; 

figure 

plot(t,x,’k’) % plot the audio signal, in black 

axis([0 length(x)-1 -1 1]) 

% automatically determine duration in seconds 

seconds=length(x)/freqsamp+0.001; 

% automatically determine nyquist frequency limit 

nyquist=freqsamp/2; 

spectro(’guitar.wav’,seconds,0.01,nyquist) 

Running this script will result in three figures: The plot of the audio 

signal, the "2D" spectrogram of the signal (it is actually 3D, because 

there are three variables, but the graph is planar), and the 3D surface 

depiction of the spectrogram. 

B.2 Mathematica 

The following code is for executing the specification of the Fourier 

transform given in this text, in Mathematica 7. The Fourier[] and 

InverseFourier[] functions are inversely defined to how they have


been given in this book: The exponents of e are positive in the built-in 

function Fourier[] and negative in the function InverseFourier[], 

i.e., 

X = Fourier[x] = 1 √ 

N 

as opposed to the DFT we use: 

F(x) = 

N−1 

∑ 

t=0 

N ∑ 

t=1 


N . 

Likewise, the IDFT given by Mathematica is 

x(t)e i2π(k−1)(t−1) 

N 

x = InverseFourier[X] = √ 1 ∑ 

X(k)e − i2π(k−1)(t−1) 

N 

N 

instead of 

F −1 (X) = 1 N 

N−1 

∑ 

k=0 

k=1 

X(k)e i2πkt 

N . 

Therefore, to compute the "engineer’s DFT" (the version we’ve been 

using) and its inverse in Mathematica, we need to multiply by √ N and 

use the functions oppositely: 

X = Sqrt[Length[x]]*InverseFourier[x] 

= 

N∑ 

x s (t)e − i2π(k−1)(t−1) 

N 

t=1 

Note that the zeroth frequency component is given by X[1], not X[0]. 

The inverse engineer’s DFT is then 

x = 1/Sqrt[Length[X]]*Fourier[X] 

= 1 N∑ 

X(k)e i2π(k−1)(t−1) 

N 

N 

k=1 

Now let’s import a song and perform an FFT on it in Mathematica. 

Type Directory[] to find Mathematica’s current working directory 

(a folder on your computer), and put your song there—or use


the command SetDirectory["dir"] to change the current directory. 

SetDirectory[$UserDocumentsDirectory], for example, sets the directory 

to your Documents folder on your computer. 

Next, use the command Import"filename.wav"] to import a .wav 

file into Mathematica. For the sake of this example, we will use a 

two-channel .wav file of a short clip of rock music. 

Now enter 

samples = rock[[1,1]]; 

left = samples[[1]]; 

This will give us a sampled sound list of the left channel for Fourier 

analysis. The semicolon at the end of the lines in both Mathematica and 

Matlab suppress the output of the statement, which is ideal for the 

large arrays of audio data that we don’t care to inspect. 

Let us take the short-time discrete Fourier transform. First, we 

have to partition the data. Mathematica has the function Partition[] 

already built in. It is easy to define a function for reuse in Mathematica: 

SoundPartition[x_, dftint_, fs_] := 

Partition[x, Round[fs * dftint]]


So, this defines a function "SoundPartition[]" which accepts an array 

(x), short-time discrete Fourier transform size (dftint) in seconds, 

and sampling frequency (fs) and returns (length(x) /(dftint ∗ fs))- 

many partitions. So, for a 2-second sound file and dftint= 0.1, we 

would get 20 partitions. 

rockPart = SoundPartition[left, 0.1, 44100]; 

Table[ListPlot[Abs[Take[Sqrt[Length[rockPart[[k]]]] 

*InverseFourier[rockPart[[k]]], Length[rockPart[[k]]]/2]], 

Joined->True, PlotRange->All], {k, 1, Length[rockPart]}] 

The first line uses our defined function to partition my array of the 

left channel, "left." Then, the STDFT is performed: For each partition 

k, we take the DFT of the first half of the partition (the second half is 

redundant), find its absolute value, and graph each partition. 

Figure B.1: The first 4 graphs of the STDFT. There are 19 0.1-second partitions made 

in total for the 1.98-second sample of music.


B.3 C 

The languages of C and C++ have many downloadable libraries that 

work to accelerate the programming process. One of these libraries 

is the FFTW library, containing the FFT algorithm. The library aids 

the efficiency of the FFT, but working with it is tricky and not for 

beginners to C. To learn more, go to http://fftw.org/. Additionally, 

an approachable yet comprehensive resource for an introduction to 

music coding is The Audio Programming Book by Richard Boulanger and 

and Victor Lazzarini. 

Read in a WAVE file 

This program will read in a file, check that it is a WAVE file, and store 

it in a matrix for further processing. To execute this file on a Mac, open 

Terminal (in Applications > Utilities) and type 

pwd 

This will give you your current directory, most likely your user folder. 

Drag the files wavefile.c and fft.c into this directory. Then type 

gcc wavefile.c -o wavefile 

This will create a "wavefile" application in the current directory. Finally, 

type into Terminal 

./wavefile filename.wav 

This line prints the header information of the file and translates the 

binary code to a float value between −1 and 1 (all real values), to 

represent the amplitude with respect to time. This is saved in a buffer 

file named realwave.dat. Here, filename.wav is the name of your WAVE 

file. Important: The file must be one-channel. It is easy to split stereo 

tracks and change them to "mono" in the free audio editing program, 

Audacity.


wavefile.c 

/* This program will read in a WAV file for further 

/* manipulation. The file will save to the user’s root 

/* folder in a file named "realwave.dat". 

#include 

#include 

int main(int argc, char * argv[]) 

{ 

FILE * wavefile; /* Input wave file - .wav */ 

FILE * outd; /* Analyzed result in floats - realwave.dat */ 

int i, fsize, sread, swrite, nbytes, rate, avgrate, csize, 

ibyte, smin, smax, savg, bad, nbread; 

short ccode, channels, blockalign, bps; 

char riff[4], data[4], sbyte, more[4], fmt[4], wave[4]; 

{ 

} 

printf("readwave.c executing \n"); 

if(argc


} 

exit(1); 

/* Read the first 44 bytes of the WAV file */ 

printf("Reading WAVE Header information...\n"); 

sread = fread(&riff[0], 1, 4, wavefile); 

printf("First 4 bytes of .wav file should say RIFF, 

File says: %c%c%c%c \n",riff[0],riff[1],riff[2],riff[3]); 

sread = fread(&fsize, 1, 4, wavefile); 

printf("File has %d +8 bytes \n", fsize); 

sread = fread(&wave[0], 1, 4, wavefile); 

printf("File should should say WAVE, Files says: 

%c%c%c%c \n", wave[0],wave[1],wave[2],wave[3]); 

sread = fread(&fmt[0], 1, 4, wavefile); 

printf("File should say fmt, File says: %c%c%c%c \n",fmt[0], 

fmt[1],fmt[2],fmt[3]); 

sread = fread(&nbytes, 1, 4, wavefile); 

printf("Block has %d bytes \n", nbytes); 

sread = fread(&ccode, 1, 2, wavefile); 

printf("Compression Code = %d \n", ccode); 

sread = fread(&channels, 1, 2, wavefile); 

printf("Number of Channels = %d \n", channels); 

sread = fread(&rate, 1, 4, wavefile); 

printf("Rate = %d \n", rate); 

sread = fread(&avgrate, 1, 4, wavefile); 

printf("Average Rate = %d \n", avgrate); 

sread = fread(&blockalign, 1, 2, wavefile);


printf("Block Align = %d 

\n", blockalign); 

sread = fread(&bps, 1, 2, wavefile); 

printf("Bits per Sample = %d \n", bps); 

sread = fread(&data[0], 1, 4, wavefile); 

printf("File should say DATA, File says: %c%c%c%c \n", 

data[0],data[1],data[2],data[3]); 

sread = fread(&csize, 1, 4, wavefile); 

nbread = 44; 

bad = 0; 

savg = 0; 

printf("Begin analyzing sound file.\n"); 

for(i=0; i


{ 

nbread = nbread+csize; 

while(1) 

sread = fread(&more[0], 1, 4, wavefile); 

if(sread != 4) goto done; /* No more bytes to read */ 

/* check for more chunks */ 

sread = fread(&csize, 1, 4, wavefile); 

if(sread != 4) 

{ 

goto done; 

} 

} 

for(i=0; i


} 

return 0; 

} 

Perform an FFT on a WAVE file 

After reading in a .wav file, you may perform an FFT on it with the 

following code. Note that an FFT of an entire file is fairly meaningless 

because music changes often; you may want to partition the file in 

Matlab or with an audio editor first. 

To execute the following program, go once again to Terminal in the 

Utilities folder, and type 

gcc fft.c -o wavefft 

This line executes fft.c and prints the results in the Terminal console. 

These results can be copied and pasted into Excel or R to be graphed 

(a line graph is recommended) [28], [63]. 

fft.c 

/* Performs an FFT on a WAVE file that is saved in 

/* the root folder in the .dat format. 

#include 

#include 

#include 

#include 

/* Definitions */ 

#define strchr index 

#define length 32768 /* max pts in FFT: must be a power of 2 */ 

#define PI M_PI /* Pi defined to machine precision */ 

#define TWOPI (2.0*PI) /* 2 times Pi, used often */


void four1(); 

void realft(); 

double wsum; 

char *pname; 

FILE *ifile; 

int m, n; 

int cflag; 

int decimation = 1; 

int smooth = 1; /* Adjust this variable to scale FFT output */ 

static float *c; 

double norm; 

main(argc, argv) 

int argc; 

char *argv[]; 

{ 

int i; 

char *prog_name(); 

double atof(); 

pname = prog_name(argv[0]); 

if (--argc < 1) 

{ 

exit(1); 

} 

else if ((ifile = fopen(argv[argc], "rt")) == NULL) 

{ 

fprintf(stderr, "%s: can’t open %s\n", pname, argv[argc]); 

exit(2); 

} 

if ((c = (float *)calloc(length, sizeof(float))) == NULL) 

{ 

fprintf(stderr, "%s: insufficient memory\n", pname); 

exit(2); 

}


read_input(); 

fft(); 

fft_print(); 

exit(0); 

} 

read_input() 

{ 

for (n = 0; n < length && fscanf(ifile, "%f", &c[n]) == 1; n++); 

} 

/* calculate forward FFT */ 

fft() 

{ 

int i; 

for (m = length; m >= n; m >>= 1); 

m


for (j = 0, pow = 0.0; j < 2*smooth; j += 2) 

{ 

pow += (c[i+j]*c[i+j] + c[i+j+1]*c[i+j+1])*norm*norm; 

} 

pow /= smooth/decimation; 

printf("%g", sqrt(pow)); /* Print FFT results */ 

printf("\n"); 

} 

} 

char *prog_name(s) 

char *s; 

{ 

char *p = s + strlen(s); 

while (p >= s && *p != ’/’) 

{ 

p--; 

} 

} 

return (p+1); 

void realft(data,n,isign) 

float data[]; 

int n,isign; 

{ 

int i, i1, i2, i3, i4, n2p3; 

float c1 = 0.5, c2, h1r, h1i, h2r, h2i; 

double wr, wi, wpr, wpi, wtemp, theta; 

void four1(); 

theta = PI/(double) n; 

if (isign == 1) 

{ 

c2 = -0.5; 

four1(data, n, 1);


} 

else 

{ 

c2 = 0.5; 

theta = -theta; 

} 

wtemp = sin(0.5*theta); 

wpr = -2.0*wtemp*wtemp; 

wpi = sin(theta); 

wr = 1.0+wpr; 

wi = wpi; 

n2p3 = 2*n+3; 

for (i = 2; i


} 

void four1(data, nn, isign) 

float data[]; 

int nn, isign; 

{ 

int n, mmax, m, j, istep, i; 

double wtemp, wr, wpr, wpi, wi, theta; 

float tempr, tempi; 

n = nn i) 

{ 

tempr = data[j]; 

data[j] = data[i]; 

data[i] = tempr; 

tempr = data[j+1]; 

data[j+1] = data[i+1]; 

data[i+1] = tempr; 

} 

m = n >> 1; 

while (m >= 2 && j > m) 

{ 

j -= m; 

m >>= 1; 

} 

j += m; 

} 

mmax = 2; 

while (n > mmax) /* While loop executed log2nn times */ 

{ 

istep = 2*mmax; 

theta = TWOPI/(isign*mmax); /* Trigonometric Recurrence */ 

wtemp = sin(0.5*theta);

wpr = -2.0*wtemp*wtemp; 

wpi = sin(theta); 

wr = 1.0; 

wi = 0.0; 

for (m = 1; m < mmax; m += 2) 

{ 

for (i = m; i

References 

[1] J. O. Pickles, An Introduction to the Physiology of Hearing. London: 

Academic Press, 2nd ed., 1988. 

[2] J. R. Pierce, The Science of Musical Sound. New York: W. H. Freeman 

and Company, revised ed., 1996. 

[3] D. J. Levitin, This Is Your Brain on Music. New York: Plume, 2006. 

[4] G. Loy, Musimathics: The Mathematical Foundations of Music, Volume 

1. Cambridge, MA: The MIT Press, 2006. 

[5] D. R. Griffin, Listening in the dark: the acoustic orientation of bats and 

men. New Haven, CT: Yale University Press, 1958. 

[6] U. of Salford, "Duck quack echo," accessed September 20, 2011. 

[7] S. J. Jeans, Science & Music. New York: Dover, 1968. 

[8] J. Beament, The Violin Explained. New York: Oxford University 

Press, USA, 2001. 

[9] D. Halliday, R. Resnick, and J. Walker, Fundamentals of Physics. 

New York: Wiley, 9th ed., 2010. 

[10] A. Wood, The Physics of Music. London: University Paperbacks, 

1965. 

[11] A. Schoenberg, Structural Functions of Harmony. London: Williams 

and Norgate Limited, 1954. 

[12] J. Rayleigh and R. B. Lindsay, The Theory of Sound, Volume One. 

New York: Dover, unabridged second revised ed., 1945.


[13] R. Collecchia, The Entropy of Musical Classification. Portland, OR: 

Reed College, unpublished, May 2009. 

[14] R. Plomp, "Timbre as a Multidimensional Attribute of Complex 

Tones," Frequency Analysis and Periodicity Detection in Hearing, ed. 

R. Plomp and G. Smoorenberg. Leiden: Sijthoff, 1970. 

[15] T. D. Rossing, Science of String Instruments. Springer, 2010, pp. 130- 

132. 

[16] C. M. Hutchins and D. Voskull, 

[17] E. D. Blackham, "The physics of the piano," in Hutchins [29], 

pp. 24–33. 

[18] C. M. Hutchins, "The physics of violins," in Hutchins [29], pp. 56– 

68. 

[19] A. H. Benade, "The physics of brasses," in Hutchins [29], pp. 44– 

55. 

[20] B. Hopkin, Musical Instrument Design: Practical Information for 

Instrument Design. Tucson, AZ: See Sharp Press, 1996. 

[21] D. Deutsch, ed., The Psychology of Music. New York: Academic 

Press, 1982. 

[22] I. P. Julie Ayotte and K. Hyde, "Congenital amusia: A group study 

of adults afflicted with a music-specific disorder," Brain: A Journal 

of Neurology, vol. 125, January 2002. 

[23] C. S. Sapp, "Wave pcm soundfile format," updated January 20, 

2003; accessed September 13, 2011. 

[24] A. E. Zonst, Understanding the FFT: A Tutorial on the Algorithm and 

Software for Laymen, Students, Technicians and Working Engineers. 

Titusville, FL: Citrus Press, 1995.


[25] L. R. Rabiner and C. M. Rader, eds., Digital Signal Processing. New 

York: The Institute of Electrical and Electronics Engineers, Inc., 

1972. 

[26] J. O. Smith, Mathematics of the Discrete Fourier Transform (DFT). 

http://www.w3k.org/books/: W3K Publishing, 2007. 

[27] D. H. J. M. T. Heideman and C. S. Burrus, "Gauss and the history 

of the fast fourier transform," IEEE ASSP Magazine, vol. 1, no. 4, 

pp. 14–21, 1984. 

[28] J. W. Nilsson and S. A. Riedel, Electric Circuits. Boston: Prentice 

Hall, 9th ed., 2011. 

[29] C. M. Hutchins, ed., The Physics of Music: Readings from Scientific 

American, (San Francisco, CA), W. H. Freeman and Company, 

1978. 

[30] E. Brattain-Morrin, Entropy, Computation, and Demons. Portland, 

OR: Reed College, unpublished, 2008. 

[31] A. I. Khinchin, Mathematical Foundations of Information Theory. 

New York: Dover, 1957. 

[32] S. Ross, A First Course in Probability. Upper Saddle River, NJ: 

Pearson Education, Inc., 7th ed., 2006. 

[33] C. E. Shannon and W. Weaver, The Mathematical Theory of Communication. 

Urbana, IL: University of Illinois Press, 1998. 

[34] C. Anderton, Electronic Projects for Musicians. New York: Amsco 

Publications, 1980. 

[35] J. Johnson, Introduction to Digital Signal Processing. New Delhi, 

India: Prentice Hall of India, 1998. 

[36] K. Lee, "Automatic Chord Recognition from Audio Using Enhanced 

Pitch Class Profile," Proceedings of International Computer 

Music Conference, 2006.


[37] C. Marven and G. Ewers, A Simple Approach to Digital Signal Processing. 

New York: John Wiley and Sons, Inc., 1996. 

[38] L. R. Rabiner and B. Juang, Fundamentals of Speech Recognition. 

Englewood Cliffs, NJ: PTR Prentice Hall, Inc., 1993. 

[39] C. Roads, The Computer Music Tutorial. Cambridge, MA: The MIT 

Press, 1996. 

[40] C. B. Rorabaugh, DSP Primer. New York: McGraw-Hill, 1999. 

[41] J. O. Smith, Physical Audio Signal Processing. 

http://ccrma.stanford.edu/ jos/pasp/: online book, accessed 

2011. 

[42] J. O. Smith, Introduction to Digital Filters with Audio Applications, 

http://www.w3k.org/books/: W3K Publishing, 2007. 

[43] F. A. Saunders, "Physics and Music," [29], pp. 6–15. 

[44] A. H. Benade, "The Physics of Wood Winds," The Physics of Music: 

Readings from Scientific American, (San Francisco, CA), W. H. 

Freeman and Company, 1978, pp. 34–43. 

[45] J. C. Schelleng, "The Physics of the Bowed String," The Physics of 

Music: Readings from Scientific American, (San Francisco, CA), W. 

H. Freeman and Company, 1978, pp. 69–77. 

[46] V. O. Knudsen, "Architectural Acoustics," The Physics of Music: 

Readings from Scientific American, (San Francisco, CA), W. H. Freeman 

and Company, 1978, pp. 78–92. 

[47] H. F. Olson, Music, physics and engineering. New York: Dover, 

1967. 

[48] G. A. Gescheider, Psychophysics: The Fundamentals. New York: 

Psychology Press, 1997.


[49] A. H. Benade, Fundamentals of Musical Acoustics. New York: Dover, 

Second Revised Ed., 1990. 

[50] H. v. Helmholtz, On the Sensations of Tone. New York: Dover, 1954. 

[51] F. Lerdahl and R. Jackendoff, A Generative Theory of Tonal Music. 

Cambridge, MA: The MIT Press, 1983. 

[52] S. Isacoff, Temperament. New York: Alfred A. Knopf, 2001. 

[53] D. Albright, Modernism and Music. Chicago: The University of 

Chicago Press, 2004. 

[54] D. C. Miller, Anecdotal History of the Science of Sound: To the Beginning 

of the 20th Century. New York: The Macmillan Company, 

1935. 

[55] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to 

Algorithms. Cambridge, MA: The MIT Press, 2nd ed., 2002. 

[56] P. A. Fuchs, A. Rees, C. Plack, and A. Palmer, The Oxford Handbook 

of Auditory Science: Hearing. New York: Oxford University Press, 

USA, 2010. 

[57] G. Martino and L. E. Marks, "Synesthesia: Strong and Weak," 

Current Directions in Psychological Science, vol. 10, no. 2, April 2001. 

[58] E. Zwicker and R. Feldtkeller, "On the Derivation of Critical Bands 

from the Loudness of Complex Sounds," Acustica 5, 1955, pp. 40- 

45. 

[59] I. Peretz, L. Gagnon, S. Hébert and J. Macoir, "Singing in the Brain: 

Insights from Cognitive Neuropsychology," Music Perception: An 

Interdisciplinary Journal, vol. 21, no. 3, Spring 2004, pp. 373–390. 

[60] J. W. Cooley and J. W. Tukey, "An algorithm for the machine 

calculation of complex Fourier series," Mathematical Computation, 

vol. 19, 1965, pp. 297–301.


[61] G. C. Danielson and C. Lanczos, "Some improvements in practical 

Fourier analysis and their application to X-ray scattering from 

liquids," J. Franklin Institute, vol. 233, 1942, pp. 365–380 and 435– 

452. 

[62] C. F. Gauss, "Nachlass: Theoria interpolationis methodo nova 

tractata," Werke, vol. 3, 2011, pp. 265–327. 

[63] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, 

Numerical Recipes in C: The Art of Scientific Computing. Cambridge, 

UK: Cambridge University Press, 2nd ed., 1992, pp. 504–510. 

[64] F. Moavenzadeh, Concise Encyclopedia of Building and Construction 

Materials Cambridge, MA: The MIT Press, 1990. 

[65] C. R. Nave, "HyperPhysics Concepts: Sound and Hearing." 

http://hyperphysics.phy-astr.gsu.edu/hbase/sound/ 

soucon.html#soucon: 2010, accessed September 13, 2011. 

[66] J. O. Smith, Spectral Audio Signal Processing, October 2008 Draft. 

http://ccrma.stanford.edu/ jos/sasp/, online book, accessed 

September 13, 2011. 

[67] C. Chen, Signals and Systems. New York: Oxford University Press, 

Third Ed., 2004. 

[68] M. McLuhan and Q. Fiore, the medium is the MASSAGE: An Inventory 

of Effects. Corte Madera, CA: Gingko Press, 2001. 

[69] B. D. Storey, "Computing Fourier Series and Power Spectrum with 

MATLAB." http://faculty.olin.edu/bstorey/Notes/Fourier.pdf: 

accessed September 13, 2011. 

[70] D. H. Whalen, E. R. Wiley, P. E. Rubin, and F. S. Cooper, "The 

Haskins Laboratories’ pulse code modulation (PCM) system," 

Behavior Research Methods, Instruments, & Computers, vol. 22, no. 5, 

1990, pp. 550–559.


[71] P. Belt, The New Grove Musical Instrument Series: The Piano. W. W. 

Norton & Co., Inc., 1988. 

[72] H. Partch, Genesis of a Music. New York: Da Capo Press, 1974. 

[73] M. Enright, "A comparison of Western and Eastern music modes 

and tone production." 

http://www.kentuckybellydance.com/BabaYagaMusic/Makamsand-Cents.htm, 

accessed September 13, 2011. 

[74] D. A. Russell, "Acoustics and Vibration Animations." 

http://www.kettering.edu/physics/drussell/demos.html, 

accessed November 21, 2011. 

[75] J. Wolfe, "Chladni patterns for violin plates." 

http://www.phys.unsw.edu.au/jw/chladni.html, 

September 13, 2011. 

accessed 

[76] W. E. Worman and A. H. Benade, "Oscillations in Clarinet-like 

Systems: A Status Report." 

https://ccrma.stanford.edu/marl/CASL/Files/benade/Benade- 

ClarinetSystems-1969.pdf: Preliminary report/unpublished, 

April 1969. 

[77] P. Weiss and R. Taruskin, Music in the Western World: A History in 

Documents. Thomason Shirmer, 2nd ed., 1984. 

[78] T. Christensen, The Cambridge history of Western music theory. Cambridge, 

UK: Cambridge University Press, 2002. 

[79] Hesiod, S. Lombardo, and R. Lamberton, Works & Days and 

Theogony. Hackett Publishing Company, 1993. 

[80] B. C. J. Moore, An Introduction to the Psychology of Hearing. Bingley, 

UK: Emerald Group Publishing Ltd., 5th ed., 2003. 

[81] A. Lalwani, Current Diagnosis & Treatment in Otolaryngology—Head 

and Neck Surgery. McGraw-Hill Medical, 2nd ed., 2007.


[82] P. Marler and H. W. Slabbekoorn, Nature’s music: The science of 

birdsong. Academic Press, vol. 1, 2004. 

[83] H. S. Howe, Jr., Electronic Music Synthesis: Concepts, Facilities, 

Techniques. W. W. Norton & Company, Inc., 1975. 

[84] J. Wolfe, "Physics in Speech." 

http://phys.unsw.edu.au/phys_about/PHYSICS!/ 

SPEECH_HELIUM/speech.html: Published 2005, accessed 

November 21, 2011. 

[85] G. P. Scavone, "Percussion Instruments." 

https://ccrma.stanford.edu/CCRMA/Courses/152/ 

percussion.html: Published 1999, accessed November 21, 2011. 

[86] N. Roe, "La Monte Young’s Drugless Trip, Dream House, Is Back 

in Business." http://www.mapcidy.com/q=node/325: published 

September 23, 2009, accessed November 21, 2011. 

[87] R. S. Heffner and H. E. Heffner, "Sound localization and use of 

binaural cues by the gerbil (Meriones unguiculatus)," Behavioral 

Neuroscience, vol. 102, no. 3, June 1988, pp. 422–428. 

[88] J. Blauert and P. Laws, "Group Delay Distortions in Electroacoustical 

Systems," Journal of the Acoustical Society of America, vol. 63, 

no. 5, May 1978, pp. 1478–1483. 

[89] J. Blauert, Spatial hearing: the psychophysics of human sound localization. 

Cambridge, MA: MIT Press, 1983. 

[90] L. A. Jeffress, "A place theory of sound localization," Journal of 

Comparative and Physiological Psychology, vol. 41, 1948, pp. 35–39. 

[91] "Vibrational Modes of Drums." http://www.soundphysics.com/Drum-Vibrational-Modes/: 

2005, accessed 

December 29, 2011.


[92] J. M. Pearce, "Clinical features of the exploding head syndrome," 

Journal of Neurology, Neurosurgery, and Psychiatry, vol. 52, no. 7, 

1989, pp. 907–910. 

[93] "Synesthesia." http://en.wikipedia.org/wiki/Synesthesia: Accessed 

December 29, 2011. 

[94] J. C. Thomas, "About the Piano." 

http://www.thomaspianotuning.com/AboutthePiano.html: accessed 

January 24, 2011. 

[95] G. Loy, Musimathics: The Mathematical Foundations of Music, Volume 

2. Cambridge, MA: The MIT Press, 2007. 

[96] Numerical Recipes in C: The Art of Scientific Computing. Cambridge, 

UK: Cambridge University Press, 1992. 

[97] R. Boulanger and V. Lazzarini, The Audio Programming Book. Cambridge, 

MA: The MIT Press, 2010. 

[98] C. M. Hutchins and D. Voskuil, "Mode tuning for the violin 

maker," CAS Journal, vol. 2, no. 4, November 1993, pp. 5–9. 

[99] D. Knight, "Drum Head Vibrations." 

http://www.snarescience.com/articles/drum-headvibration.php: 

Published 2011, accessed January 2, 2012.

Glossary 

absolute value Distance from a point (a, b) to the point (0, 0), given by 

√ 

a2 + b 2 where a and b are real numbers. Also referred to as the 

magnitude. 

action (1) All of the mechanisms required to cause a system to vibrate. (2) In 

a guitar, the distance between the strings and the fretboard. 

action potential An electrical firing in a neuron or other excitable cell with a 

sharp rise and fall, similar to an impulse. 

ADSR envelope The shape of a signal’s overall amplitude. This signal is 

usually something sudden like the strike of a drum or strum of guitar. 

These envelopes are frequently seen on analog synthesizers. 

algorithm A series of instructions that execute some desired function. 

aliasing The incorrect naming of a frequency due to undersampling. 

all-pass filter A filter that allows all frequencies in a signal to pass through 

it but affects their phase, as found in phaser and flanger pedals. 

amplitude (1) The height of a wave at a given time. (2) The overall strength 

of a given sinusoid, given by A in the expression A sin(ωt + φ). 

amplitude modulation The periodic alteration of amplitude from a reference 

amplitude, also called tremolo. 

angular frequency A number describing how often something (like a sine 

wave) makes one revolution around the unit circle, written with the 

Greek letter omega (ω) and equal to 2πf where f is the ordinary frequency. 

Its unit is radians per second (rad/s). See frequency. 

antialiasing filter A low-pass filter used to avoid aliasing. Its cutoff frequency 

should be less than or equal to half the sampling frequency f s , 

and it should be applied before sampling.


antiderivative The area under the curve of a function, also called the integral. 

A function must be continuous to have an antiderivative. 

antinode Location in a mode of vibration where motion is maximal during 

vibration. 

anvil (incus) Bone in the ear’s ossicles that connects the hammer and the 

stirrup. 

apex (apical end) Refers to the end of the basilar membrane that is at the oval 

and round windows. The basilar membrane is widest and minimally 

stiff at the apex. Low frequencies stimulate the apical end. 

attack The onset of a signal; the behavior of the amplitude envelope as it 

goes from 0 to some maximum value. In an ADSR envelope, attack is 

the "A." 

attenuation Killing or multiplying an amplitude by some number between 

0 and 1 (ideally 0) to reduce its amplitude. 

auditory canal Tube that runs from the outer to middle ear, extending from 

the pinnae to the eardrums. 

band-limiting Passing a signal through a [band-pass] filter of some bandwidth, 

hence restricting the frequencies in the signal to the frequencies 

of the filter. 

band-pass filter A filter that allows only an interval of frequencies to pass 

through it and attenuates the rest, centered around some center frequency 

ω 0 . The interval of frequencies [ω c1 ,ω c2 ] (called the passband) 

define the bandwidth β of the fitler. 

band-stop filter A filter that does not allow an interval of frequencies to 

pass through it while letting frequencies outside of the interval to pass, 

centered at ω 0 and called the stopband. 

bandwidth The maximum frequency component of a signal or connection 

rate. 

basal end The thin, stiff end of the basilar membrane that is suspended in 

fluid. Higher frequencies activate the basilar membrane towards its 

basal end.


basilar membrane A membrane inside of the cochlea that vibrates at locations 

according to frequency. 

basis Series of vectors that are linearly independent to one another and each 

define a new dimension, i.e., a basis of size N defines an N-dimensional 

space. 

beating A psychoacoustic, unpleasant phenomenon that occurs when two 

sine waves close in frequency sound simultaneously, and their small 

difference tone is heard. 

binary A language containing only two symbols, 0 and 1, representing numbers 

in base-2. All digital information is in binary. 

bit A binary digit. 

bit depth The maximum length of a string of binary digits. A bit depth of 16, 

for example, would mean the largest value would be 2 16 − 1 = 65535, 

so the file could take on 65535-many different values. 

bit rate The number of bits that are processed per unit of time. For audio, 

this is typically given in kilobits per second (kbps) and equal to the 

sampling frequency times the number of channels times the bit depth. 

bulk modulus A substance’s resistance to uniform compression, given in 

pascals (Pa). 

byte A byte is an unstandardized power of 2 bits, but in this text, it is 8 bits. 

cancelation The perfectly destructive interference of two or more waves. 

The waves must have identical frequencies and be 180 ◦ out of phase 

with one another in order for cancelation to occur. 

carrier frequency The frequency f c to be modulated in FM synthesis by a 

modulation frequency f m . 

cilia Hairs along the basilar membrane. 

circular modes The modes of vibration that share the same center as the 

center of the instrument but may differ in size (radius). These may be 

circular or elliptical/oval in shape.


clipping (1) A discontinuity in a time-domain signal that causes a digitalanalog 

converter and hence speaker to "clip." (2) Chopping off the tops 

and bottoms of a sine wave to become more like a square wave to 

produce the effect of distortion. (3) Specifying an amplitude too great 

for a digital-analog converter and producing undesired distortions. 

closed path integral Denoted by the syntax " ∮ ", a closed path integral is 

defined over some interval of a definite size but variable or unknown 

endpoints. Only used for complex functions. Not to be confused with 

a line integral. Also called a contour integral. 

cochlea Spiral-shaped cavity in the inner ear filled with fluid and containing 

the basilar membrane and Organ of Corti. 

cocktail effect The psychoacoustic phenomenon that allows an observer to 

receive a signal in a noisy environment if faced straight on towards that 

signal, such that both ears receive virtually identical phase at identical 

times in the signal’s waveform, i.e., the signal received by the left ear is 

completely in phase with the signal received by the right ear. 

codec Short for "coder-decoder," this is where the encoding and decoding of 

a signal takes place. 

combination tone The psychoacoustic phenomenon of an audible sine wave 

with frequency equal to the sum of two or more other frequencies 

present in a signal. Opposite: Difference tone. 

complex plane Two dimensional plane with real numbers on the horizontal 

axis and complex numbers on the vertical axis. 

compression (1) Region of high pressure and particle density, depicted by 

the crests in a waveform. (2) Process of reducing file size using algorithms. 

(3) Process of limiting the quantity of values that a musical 

signal can take on, affecting its volume (note that this does not say 

anything about the maximum and minimum values, i.e., the range, 

of the volume). Putting a digital file through a compressor typically 

makes the quiet parts quieter and the loud parts louder. Opposite: 

Decompression.


computational complexity The number of computations (additions, multiplications) 

involved in an algorithm, determining its theoretical execution 

time. 

consonance The physical and psychophysical agreement of two or more 

pitches due to low integer ratios, considered pleasant or euphonious. 

constructive interference The case when two waveforms combine to produce 

a waveform of greater amplitude than the amplitude of either of 

the original waveforms. 

continuous Fourier transform (FT) Integral that transforms a continuous, 

time-domain signal into a continuous, frequency-domain spectrum, 

given by 

X(ω) = 

∫ ∞ 

−∞ 

x(t)e −iωt dt. 

convolution Binary operation denoted by the syntax "∗". The convolution of 

two continuous functions x(t) and y(t) is the integral 

x(t) ∗ y(t) = 

∫ t 

0 

x(s)y(t − s) ds dt, 

and the convolution of two discrete functions x[t] and y[t] of length N 

is the sum 

x[t] ∗ y[t] = 

N−1 

∑ 

s=0 

x[s]y[t − s], 

where s is a number such that x and y do not intersect on the same axis 

when y is shifted to the left by s and flipped vertically. Also called the 

cyclic convolution. 

convolution reverb The convolution of an impulse response of a room with 

a musical signal to make the music sound as if it were recorded within 

that room. 

crest Location in a wave where pressure is maximal. Opposite: Trough. 

critical band The bandwidth β c beyond which we perceive band-limited (by 

β c ) sound to have more energy than it physically does.


cut-off frequency The frequency or set of frequencies in a filter at which the 

magnitude response is −3 dB and attenuation begins. 

DC offset The value of the spectrum at k =0, i.e., X(0), which quantifies 

the amount of direct (constant) current in a signal. 

decay In an ADSR envelope, decay is the "D," describing the overall magnitude 

of a signal as it decreases from some maximum value. 

decibel (dB) Logarithmic unit of the ratio of power or intensity to a reference 

power or intensity; one tenth of a bel (B). Two signals differing in power 

by one decibel have a power ratio of 10 1/10 ≈ 1.26 and an amplitude 

(intensity) ratio of ( √ 10) 1/10 ≈ 1.12. Not to be confused with sound 

pressure level (dB SPL). 

delay line A filter that causes a feedback or feedforward loop in an electrical 

system, such as a comb filter. 

derivative The rate of change of a function, defining the slope of the tangent 

line to the function at any time. A function must be continuous and 

have no sharp edges or turns (such as x(t) =|t|, the absolute value of 

t) to be differentiable. 

destructive interference The case when two waveforms combine to produce 

a waveform of lesser amplitude than the amplitude of either of the 

original waveforms. 

diatonic scale The seven notes of the major or minor scale. 

difference tone Psychoacoustic phenomenon of an audible sine wave with 

frequency f 3 resulting from the difference of two simultaneous frequencies, 

f 1 and f 2 , such that f 3 = |f 1 − f 2 |. Heard as beating when 

f 3 is less than approximately 10 Hz. Will supplant the fundamental 

frequency when it is removed from a harmonic overtone series, i.e., if 

the frequencies 200, 300, and 400 Hz are sounded, a difference tone of 

100 Hz will also be heard. Opposite: Combination tone. 

diffraction The change in wave motion when an area of different impedance 

is encountered.


digital signal processing Field of electrical engineering that represents and 

seeks to manipulate discrete-time inputs in linear, time-invariant systems. 

Also concerned with the measurement, filtering, and compression 

of analog signals. 

Dirac delta A continuous impulse, defined conditionally as 

⎧ 

⎨∞, when t =0 

δ(t) = 


The global integral of the Dirac delta is exactly 1. 

discrete Fourier transform (DFT) Sum that transforms a discrete, time-domain 

signal of length N into a discrete, frequency-domain spectrum also of 

length N, given by 

X(k) = 

N−1 

∑ 

t=0 


N . 

discrete-time Fourier transform (DTFT) Sum that transforms a discrete, timedomain 

signal of infinite length into a continuous, frequency-domain 

spectrum of length 2π, given by 

X(ˆω) = 

∞∑ 

t=−∞ 

x[t]e −iˆωt , 

for normalized frequencies ˆω in the interval [0, 2π). 

dissonance The physical and psychophysical disagreement of two or more 

pitches due to high integer ratios, considered unpleasant. 

domain The set of input values over which a function is defined. A domain 

can either be continuous or discrete. In the signal x(t), the set of t 

represents the domain of x. 

Doppler effect The relationship between frequency and the movement of a 

sound source with respect to an observer, given by 

f o = 

( c + vo 

c + v s 

) 

f s ,


where f o is the observed frequency, f s is the frequency of the source, 

v o is the speed of the observer (positive if moving towards the source), 

v s is the speed of the sound source (positive if moving away form the 

observer), and c is the speed of sound. 

eardrum (tympanum) Membrane separating the outer and middle ear that 

is disturbed when the pressure in the outer ear changes. Disturbance 

in the eardrum is required for the perception of sound. 

echolocation The technique used by animals such as bats to identify the 

distance to objects by measuring the time it takes for a signal to echo 

back towards the observer. 

effective length The length that a vibrating mechanism like a fixed string or 

column of air when we account for its physical nature, such as string 

density µ or holes in a bore. This is the wavelength of the fundamental 

produced on an ideal (massless, infinitesimally thin, infinitely tense) 

string. 

endianness Adjective describing the direction in which the bytes of a binary 

value are stored into memory. Little endian means the least significant 

byte is read first, so left-to-right, and big endian reads the most 

significant byte first (right-to-left, how we typically read numbers). 

endolymph Fluid inside of the scala media in the cochlea. Reissner’s membrane 

and the basilar membrane separate it from the perilymph in the 

scala vestibuli and scala tympani. Its ionic composition is different from 

that of perilymph and they work together to create electrochemical 

impulses. 

energy Integral of power over time; used to do work. Unit is the joule (J). 

enharmonic equivalent Two notes with the same frequency but different 

note names, such as F♯ and G♭. They function differently in transcribed 

music but sound identical (in equal temperament only). 

equal temperament Scale of 12 tones with equal spacing between tones. A 

tone (f 1 ) that is k half steps above another tone (f 0 ) will have frequency 

f 1 =2 k f 0 . When this tone is below f 0 , k is negative. Equal temperament 

enables perfect transposition to other keys on an instrument, but


lacks the integer-based consonance of Pythagorean temperament and 

just intonation. 

Euler’s formula The formula e ix = cos(x)+i sin(x). Also, e −ix = cos(x) − 

i sin(x). 

Euler’s identity The equation e iπ + 1 = 0. 

factorial Designated by the syntax "!" in mathematics, the factorial of n is 

n! =n · (n − 1) · (n − 2) · . . . · 1. 

filter A frequency-discriminating system. This can be virtually any physical 

object, and it can be modeled by an electric circuit and its corresponding 

transfer function. 

filter bank A series of filters, usually linearly or logarithmically spaced and 

with the purpose of retrieving the notes of some scale. 

FIR filter Shortening of "finite impulse response filter," meaning a filter with 

an impulse response of finite duration because it is completely zero 

after some point in time, as opposed to an infinite impulse response (IIR) 

filter. 

Fourier series The method of approximating a periodic function by a sum 

of sine waves as devised by Jean Baptiste Joseph Fourier. The DFT is 

proportional to the coefficients of the Fourier series. 

frame One "slice" of a windowing function. We call its size N ′ , equal to N M 

for M-many frames. Also called a window. 

frequency A number defining how often something (like a sine wave) repeats 

itself, inversely proportional to the time something takes to repeat 

itself, i.e., f = 1 T 

. Also called the ordinary frequency. Its unit is in hertz 

(Hz) or the inverted second (s −1 ). 

frequency bin The indexing system for the frequency components of the 

discrete Fourier transform, labeled by the integers k =0, 1, 2, . . ., N −1. 

frequency component A frequency present in a signal as evidenced by its 

Fourier transform, named ω 0 , ω 1 , ω 2 , and so on.


frequency modulation The periodic alteration of frequency from a reference 

frequency. For small differences, this is the effect of vibrato, but for 

larger differences, sidebands form and unusual timbres arise. 

frequency response The spectrum of a time-domain signal, X(ω), usually 

used to refer to the transformed reaction of some instrument or filter to 

a sine sweep or impulse; the Fourier transform of the impulse response. 

fundamental frequency Notated f 0 , the fundamental frequency is a reference 

frequency to which other frequencies are compared. It is typically 

the root of a chord or the actual, single pitch played on an instrument 

whose Fourier transform is studied. 

Gibbs phenomenon Observation by J. Willard Gibbs that the Fourier series 

of a piecewise, continuous function (like a square or triangle wave) is 

worst at "jump discontinuities," i.e., at the sharp edges of the waveform. 

The "tails" that appear in the graphs of such Fourier series are called 

Gibbs horns. 

hammer (malleus) Attached to the eardrum and the anvil in the ossicles; 

communicates the vibrations of the eardrum to the inner ear. 

harmonic (1) Short for harmonic partial, meaning a partial that is an integer 

multiple of a fundamental frequency. (2) An adjective describing a timbre 

that contains only integer multiples of the fundamental frequency. 

Hermitian symmetry Symmetry of a complex function. For a spectrum 

X(k), the following properties hold: 

Re{X(−k)} = Re{X(k)}, 

Im{X(−k)} = −Im{X(k)}, 

|X(−k)| = |X(k)|, 

∠X(−k) =∠X(k), 

vertical symmetry of the real parts 

diametric symmetry of the imaginary parts 

vertical symmetry of the magnitudes 

vertical symmetry of the phase angles. 

The Fourier transform possesses Hermitian symmetry. 

high-pass filter A filter that allows frequencies above some cutoff frequency 

ω c to pass through it and attenuates the rest. 

hop The interval of time between the beginning times of the mth and (m + 

1)th windows in a windowed signal, designated by a hop size H.


Huygens’ principle Every point through which a wave propagates is itself 

the source of a new spherical wave. 

ideal sampling Sampling using impulses. 

impedance matching The matching of the impedance of some closed cavity 

(like the middle ear or bore of a wind instrument) to external 

impedance (like the impedance outside of the eardrum or the impedance 

input into the mouthpiece of a wind instrument). 

impulse Theoretically, an impulse is the function δ(t), equal to 1 where 

t =0and 0 elsewhere for discrete time domains, and equal to positive 

infinity where t =0and 0 otherwise for continuous time domains. Its 

Fourier transform is constant, i.e., it has energy spread equally to all 

frequencies. Therefore, an impulse is a burst of white noise. 

impulse response The recorded reaction of a reverberant system to an impulse. 

information The meaningful content of a message or signal. Opposite: Noise. 

input impedance The amount of resistance induced in an instrument, such 

as the amount of pressure introduced by a player’s lips and lungs in a 

wind instrument. 

interpolation The insertion of L-many zeros between every point of a domain 

for the purpose of up-sampling (oversampling). 

inverse continuous Fourier transform (IFT) Integral that transforms a continuous, 

frequency-domain spectrum into a continuous, time-domain 

signal, given by 

x(t) = 1 ∫ ∞ 

X(ω)e iωt dω. 

2π −∞ 

inverse discrete Fourier transform (IDFT) Sum that transforms a discrete, 

frequency-domain spectrum of length N into a discrete, time-domain 

signal also of length N, given by 

x(t) = 1 N 

N−1 

∑ 

k=0 

X(k)e i2πkt 

N .


inverse discrete-time Fourier transform (IDTFT) Integral that transforms a 

continuous, frequency-domain spectrum of length 2π into a discrete, 

time-domain signal of infinite length, given by 

x[t] = 1 

2π 

∫ 2π 

0 

X(ˆω)e iˆωt dˆω 

for normalized frequencies ˆω that lie in the interval [0, 2π). 

inverse square law Law governing the intensity I of waves as they propagate 

as a function of distance r and original power P , given by 

I = 

P 

4πr 2 . 

With respect to sound pressure (a logarithmic measure), the pressure is 

proportional to 1 r . 

inverse Z-transform Closed path integral that transforms a continuous, finite 

frequency-domain spectrum into a discrete, infinite time-domain 

signal, given by the equation 

x[t] = 1 ∮ 

X(z)z n−1 dz 

2πj C 

where C is the region of convergence, an interval of size 2π. 

inversion The "flipping" of a musical interval with respect to the octave of 12 

notes. The inversion of a perfect fifth, for example, is a perfect fourth. 

The inversion of an octave is still an octave. 

just intonation System of tuning built on the intervals of the octave, perfect 

fifth, and major third. 

just-noticeable difference (jnd) (1) The minimum difference in two frequencies 

for the sounds to be perceived as different. (2) The minimum difference 

in two decibel levels for the sounds to be perceived as different. 

Both (1) and (2) are measured in limens. 

key In Western tonality, a key is a scale of notes (typically 7) designated by a 

root note name (like C) and a quality (like major or minor).


Kronecker delta A discrete impulse, defined conditionally as 

⎧ 

⎨1, when t =0 

δ[t] = 


The global sum of the Kronecker delta is exactly 1. It is nonintegrable 

because it is not continuous. 

Laplace transform Integral that transforms a continuous, time-domain signal 

into a continuous, frequency-domain spectrum, given by 

X(s) = 

∫ ∞ 

−∞ 

x(t)e −st dt. 

Since time-domain signals are typically only defined for positive domain, 

we can make our lives a lot easier by changing the limits of 

integration to [0, ∞). 

latency The measure of time delay in a system, ideally zero. 

lateral modes The modes of vibration that are along some diameter of an 

instrument. 

limen See just-noticeable difference. 

limit Value specifying the maximum strength of a frequency can have before 

permanent damage is incurred. 

linear filter Filters that are subject to the constraint of linearity, meaning that 

they satisfy two conditions: (1) the principle of superposition (additivity), 

and (2) scaling the input by a constant also scales the output 

by the same constant—e.g., X(aω) =aX(ω). Every filter covered in 

Numbers & notes is a linear filter. 

linear independence Mathematical condition satisfied when one linear expression 

(i.e., a vector) cannot be written in terms of another linear 

expression. The vectors (1, 0) and (0, 1), for example, are linearly independent, 

because (0, 1) cannot be written as any combination of 

(1, 0). The vectors (1, 0) and (2, 0), on the other hand, are linearly 

independent: The second is two times the first.


load Resistive component of a circuit over which the output voltage is computed 

in order to calculate the circuit’s transfer function. 

lossless compression Compression algorithms that do retain all of the original, 

raw data in a signal, such that when the compressed file is decompressed, 

the original signal is returned. 

lossy compression Compression algorithms that do not retain all of the 

original, raw data in a signal. 

loudness The psychophysical perception of intensity. 

low frequency oscillator (LFO) A sine wave with frequency less than about 

30 Hz that is meant to control periodic changes in other elements, such 

as tremolo or vibrato. 

low-pass filter A filter that allows frequencies below some cutoff frequency 

ω c to pass through it and attenuates the rest. 

Mach number The speed of an object moving through air, with Mach numbers 

greater than 1 indicating that the object is breaking the sound 

barrier. 

magnitude The length of a vector; for a complex number a + bi, this vector 

is 〈a, b〉 and its magnitude is [[a + bi]] = √ a 2 + b 2 , just like its absolute 

value. 

magnitude response The magnitude plot of a filter with response to frequency, 

calculated by 

|H(jω)| = √ Re{H(jω)} 2 + Im{H(jω)} 2 . 

The magnitude response is all we typically care to plot when we take 

a Fourier transform because visualizing its imaginary parts would 

require another axis. 

masking The psychoacoustic phenomenon wherein the power of some frequency 

goes undetected due to the superior power of another nearby 

frequency that is within one critical band of the other. 

meter The organization of rhythm in a piece of music, designating some 

amount of notes with a unit duration per measure.


modes of vibration The patterns describing the physical ways in which an 

instrument vibrates when set into motion by a mechanism such as a 

fixed string, reed, or mallet. 

modulation Periodic change. 

modulation frequency In FM (frequency modulation) synthesis, the modulation 

frequency f m periodically changes some carrier frequency f c to 

produce sidebands at f c − f m and f c + f m . 

music information retrieval (MIR) The digital methods used to detect musical 

devices in sound files, such as instrumentation, emotion, fundamental 

frequency, and style. 

narrowband (band-limited) noise Noise containing only a small interval of 

frequencies. 

node Location in a mode of vibration where a vibrating mechanism remains 

stationary during vibration due to the cancelation of vibrating forces. 

Striking or otherwise causing vibration in the instrument at its nodes 

produces no sound. 

noise The meaningless content of a message or signal. Opposite: Information. 

normal atmospheric pressure The average pressure of the environment, in 

which no sound is perceived. Standardized to 101, 325 pascals (Pa). 

normalized discrete Fourier transform (NDFT) Specification of the DFT that 

maps to the interval [0, 1], given by 

ˆX(k) = √ 1 

N−1 

∑ 


N . 

N 

t=0 

normalized inverse discrete Fourier transform (NIDFT) Specification of the 

IDFT that maps to the interval [−1, 1], given by 

x(t) = √ 1 

N−1 

∑ 

N 

k=0 

ˆX(k)e i2πkt 

N . 

normalization The mathematical process of translating a function’s values 

to be unitary, i.e., to the closed interval [−1, 1].


Nyquist frequency The minimum sampling frequency required to avoid 

aliasing, equal to 2f max . Also called the Nyquist rate and Nyquist 

limit. 

Ohm’s law Voltage is the product of current and resistance; V = IR. 

open circuit A circuit that is or behaves as if its wires were disconnected due 

to infinite resistance. No current will flow in an open circuit. 

organ of Corti Sensory organ of hearing in the cochlea, covered with cilia. 

orthogonality (1) The linear independence of two vectors, implying a zero 

cross product. (2) Mathematical condition satisfied when two vectors 

meet at an angle of 90 ◦ . 

orthonormality Linearly independent (orthogonal) vectors that all have unitary 

magnitude. 

ossicles The three smallest bones in the body, located in the middle ear. 

The ossicles consist of the hammer (malleus), anvil (incus), and stirrup 

(stapes) and they serve to pass on and amplify up to 20 times the 

vibrations of the eardrum to the inner ear. 

oval window Membrane connecting the ossicles to the cochlea in the inner 

ear. Opening to the scala vestibuli. 

oversampling Using a higher sampling frequency than the Nyquist frequency 

(2f max ) to ensure that all frequencies in a signal are captured 

during sampling. Results in a higher file size. 

overtone A frequency produced in the timbre of an instrument that is above 

the fundamental frequency that is played. Usually this is ordered, i.e., 

the fundamental frequency is written f 0 and the third closest frequency 

in its overtone series is written f 3 . 

overtone series The collection of frequencies in the timbre of an instrument, 

ordered f 0 , f 1 , f 2 , and so on. When these f k are integer multiples of f 0 , 

we call this a harmonic overtone series. 

partial See overtone.


pentatonic scale Scale built on the first five notes from the circle of fifths, i.e., 

C-G-D-A-E (typically ordered C-D-E-G-A). 

perfect fifth A consonant interval defined by two notes, in which one note 

is seven semitones (half steps) above the other. Their frequencies are in 

a 3:2 ratio. 

perfect fourth A consonant interval defined by two notes, in which one note 

is five semitones (half steps) above the other, abbreviated P4. Their 

frequencies are in a 4:3 ratio. This is the interval at the beginning of 

"Here Comes the Bride." 

perfect octave A consonant interval defined by two notes, in which one note 

is 12 semitones (half steps) above the other, abbreviated P8 or 8va. 

Their frequencies are in a 2:1 ratio. 

perfect (absolute) pitch The ability to name the notes of pitches played in 

isolation or without a reference pitch or key. 

perfect unison The most consonant interval defined by two notes of identical 

frequency, i.e., a ratio of 1:1. 

perilymph Fluid inside of the scala tympani and the scala vestibuli. 

period The interval of time in seconds (s) that something takes to repeat 

itself, denoted by the variable T and inversely proportional to ordinary 

frequency, i.e., T = 1 f . 

permanent threshold shifting The permanent shifting of the thresholds of 

hearing to higher values; hearing loss. 

phase relationship The functional difference in phase between two or more 

waves. 

phase response The phase plot of a filter with respect to frequency, calculated 

by 

[ ] 

Re{H(jω)} 

φ[H(jω)] = tan −1 . 

Im{H(jω)} 

phase shifting Process used by Steve Reich on a tape reel, achieved by playing 

two identical tapes slightly out of sync with one another and hence 

changing the phase of one of the tapes with respect to the other tape.


phasor (1) In electrical engineering, the phasor refers to the initial phase of a 

voltage function. (2) In Max/MSP and other musical programming environments, 

a phasor is the mirror image of a sawtooth wave, whether 

horizontally or vertically. 

phon Unit of perceived loudness that heeds equal loudness curves, the 

lowest of which is the Fletcher-Munson curve at zero phons. One phon 

is the loudness of 1000 Hz at 1 dB SPL; 10 phons is the loudness of 1000 

Hz at 10 dB SPL. 

pinnae The flaps of skin in the outer ear that stick out from the head. 

pitch The psychophysical perception of frequency. 

pitch class A note regardless of octave. The twelve notes C, C♯, D, D♯, E, F, 

F♯, G, G♯, A, A♯, and B are each pitch classes (could also be written with 

flats, but sharp notes are more common). Also called pitch chroma. 

place theory Leading theory in psychoacoustics supposing that the ear detects 

frequencies with respect to the location of excitation along the 

basilar membrane. 

pole In signal processing, a pole exists where the denominator of a transfer 

function H is equal to zero. 

power Rate of energy transfer (energy per unit time); voltage times current. 

Unit is the watt (W), equal to 1 joule per second. 

pressure Force per unit area. 

pressure wave Wave that periodically alternates between compressions (high 

pressure regions) and rarefactions (low pressure regions). 

principle of superposition Every wave can be represented as a sum of simple 

sinusoids. This is evidenced in the Fourier series. 

probability density function (pdf) Function that defines the probabilities 

of different events for a continuous random variable. Its total sum (or 

integral) is equal to 1. For a discrete random variable, we compute its 

probability mass function (pmf).


pulse code modulation (PCM) Sampling technique similar to ideal sampling 

but with some quantization error. 

pure tone A sine wave with no harmonics or overtones, exhibited in simple 

harmonic motion. 

Pythagorean temperament System of tuning built on the intervals of the 

octave and perfect fifth. 

quality (1) In band-pass and band-stop filters, quality Q refers to the steepness 

of the passband or stopband in the magnitude plot of its transfer 

function. The higher the Q, the steeper the response. (2) The sonority 

of a key or interval, i.e., major or minor. 

quantization Approximating the size at a given point of a measured signal 

to the closest value in a given set of values. 

quantization error The difference between a signal x(t) and its sampled representation 

x s (t), i.e., between raw data and compressed data. When 

the bit depth is low or there are other limits on the size of a file, the 

quantization error will probably be significant. 

range Also called the codomain, the range is the set of output values to 

which a function maps. In the signal x(t), the values of x(t) represent 

its range. Range can also refer to the interval of values that a function 

maps (it’s image), like the bandwidth of a signal. 

rarefaction Region of low pressure and particle density, depicted by the 

troughs in a waveform. 

reactance Resistance in the complex frequency domain, of frequency-dependent 

components such as capacitors and inductors. 

region of convergence The radius of the largest circle for which a series 

converges. 

Reissner’s membrane Membrane in the cochlea separating the scala media 

from the scala vestibuli. 

relative pitch The ability to name the notes of pitches played with some 

reference pitch or key.


release In an ADSR envelope, release is the "R," describing the period after a 

signal sustains and dies out. 

resolution The quality or fidelity of a representation set. High resolution implies 

a relatively low amount of error when quantifying the difference 

between an actual thing and its representation, like quantization error. 

resonance The preference of a system for specific frequencies, evidenced by 

relatively strong peaks in its frequency domain. These resonant frequencies 

are determinable by the physical dimensions of the resonator. 

restoring force The force acting upon a system in motion to return the system 

to equilibrium (stasis). 

reverberation The persistence of sound in a space due to reflections and 

refraction. 

rhythm A structured organization of strong and weak beats which repeat 

periodically in music. 

RIFF Header information in encoded WAVE files. 

room acoustics The study of the resonances of architectural spaces, particularly 

their effect on psychoacoustics and speech intelligibility. 

roots of unity The complex exponentials e iωt in the Fourier transform, defining 

different positions along the unit circle. For a discrete signal x(t) 

of size N, there will be N roots of unity in one revolution of the unit 

circle each separated by the angle e i2π 

N . They define the orthonormal 

basis of the Fourier transform. 

roughness Interval of frequencies that lies between beating frequencies and 

separable frequencies. Considered dissonant. Also called the interval 

of confusion. 

round window The second opening to the inner ear, located below the oval 

window. As the oval window translates the movements of the stirrup 

into the cochlea, the round window pushes out of the cochlea to allow 

the fluid inside to vibrate. Opening to the scala tympani. 

row matrix Matrix consisting of only one row; can have any number of 

columns.


sawtooth wave A waveform with similar appearance to the teeth on a saw. 

Bowing a violin produces a sawtooth wave. 

scala media Membrane separating the scala tympani and the scala vestibuli 

in the cochlea. 

scala tympani Perilymph-filled cavity in the cochlea that translates its vibrations 

to the scala media. 

scala vestibuli Perilymph-filled cavity in the cochlea that translates its vibrations 

to the scala media. Reissner’s membrane separates it and the 

scala media. 

scattering junction An area of different impedance that causes waves to 

change orientation, intensity, and/or speed. 

semitone A half step, such as C to C♯ or A to A♭. 

series The sum of components in a sequence. 

short-time Fourier transform (STFT) Theoretical variation of the fast Fourier 

transform that first divides an input signal into smaller chunks (less 

than 100 milliseconds in length) via a windowing function and then 

computes an FFT on each of those chunks. Particularly useful for musical 

signals whose frequency content changes over time. Generates a 

spectrogram. 

short circuit A circuit that contains or behaves as if it contained a plain 

electric wire with no resistance that effectively shorts out other paths 

of the circuit with higher resistance. Current is infinite in a short circuit 

because it, like traffic, prefers paths with less resistance, so no current 

flows in the parts of a circuit that are shorted. 

sidebands Frequencies that appear as the result of FM synthesis. See modulation 

frequency. 

signal A time-domain message. In signal processing, we denote a signal by 

x(t). 

sinusoid A sine or cosine wave of the form A sin(ωt + φ) or A cos(ωt + φ) 

where A is amplitude, ω is angular frequency, t is time, and φ is phase; 

a pure tone.


sone Unit of perceived loudness that measures the loudness of sound relative 

to some reference sound. For the loudness in phons L p , the loudness 

in sones can be calculated by 

L s =2 (Lp−40)/10 . 

The Fletcher-Munson curve also defines where frequencies are zero 

sones. A frequency f that is 50 phons is considered to be twice as loud 

as the same frequency at 40 phons by the sone scale. 

sound intensity level (SIL) Sound intensity is the sound power P per unit 

area A. Unit is watts per square meter (W/m 2 ). Sound intensity level 

is a measure of the ratio between two sound intensities I 0 and I 1 where 

I 0 is the reference intensity, given by 

( ) 

I1 

L SIL = 10 log 10 dB. 

I 0 

sound pressure level (SPL) Deviation in the ambient pressure level from 

normal atmospheric pressure or some other reference pressure level 

like the threshold of human hearing at 1000 Hz (20 µPa). Unit of sound 

pressure is the pascal (Pa) and of sound pressure level, the dB SPL. 

source (1) A circuit component like a battery that introduces a voltage to the 

circuit, i.e., the input voltage. (2) Something that produces a sound. 

spectrogram (spectrograph) Three dimensional graph drawn on either two 

or three axes where the horizontal axis is time, the vertical axis is 

frequency, and the color of points in the graph indicates power of the 

frequency at that time. Three dimensional graphs will also have an 

axis for power but will also convey it by color. 

spectrum The frequency response of some time-domain signal, transformed 

by an algorithm such as the Fourier transform. The plural of spectrum 

is spectra. 

standing waves Waves that do not propagate in a reverberant space because 

their wavelength is an integer multiple of its physical dimensions. 

stapedius (acoustic) reflex Involuntary muscle in the middle ear that protects 

the ear from loud sounds. Also activated during speaking to 

reduce sound by about 20 dB.


stirrup (stapes) Bone in the ossicles that is connected to the anvil and the 

inner ear’s oval window. 

stretched octave Found in the piano, a stretched octave above a fundamental 

frequency f 0 has a frequency that is slightly greater than 2f 0 . This 

is to compensate for inharmonic overtones caused by the physical 

imperfections of an instrument. 

sustain In an ADSR envelope, sustain is the "S," describing the duration 

of time that a signal stays at approximately the same amplitude after 

initial attack and decay. 

syncopation The placement of a strong beat on a weak beat or at an unexpected 

time. 

synesthesia The response of one of the senses to stimuli of a different type, 

such as visual responses to music. 

tectorial membrane Membrane beneath Reissner’s membrane in the cochlea 

whose motion triggers the inner hair cells of the basilar membrane. Its 

function is largely unknown, but it is hypothesized that it is largely 

responsible for passing on the phase information of waves to the brain. 

temperament The system governing the ratios between pitches in musical 

tuning. 

tempo The pace of a piece of music. 

temporal theory Theory in psychoacoustics supposing that the ear detects 

frequencies with respect to phase and interaural time difference. 

temporary threshold shifting The shifting of the thresholds of hearing during 

a persistently loud sound to a higher value in decibels to eschew 

hearing loss. 

tensor tympani Muscle in the middle ear with the purpose of damping 

sounds. 

threshold Value specifying the minimum strength of a frequency required 

for it to be perceived. 

timbre The tone color of an instrument.


tonal center The tonic or root of a key bearing the same name. Tonal center 

can also refer to the "best guess" for the name of this key. Keys defined 

without some asymmetry, such as the whole-tone scale, lack a tonal 

center. 

tone deafness The inability to repeat a heard melody; amusia. 

tonotopic mapping The mapping of place theory, of frequencies along the 

basilar membrane. This mapping is logarithmic. 

transposition The shifting of a set of pitches by an equal amount, usually to 

put the music in a different key (does not affect they key’s quality). 

tremolo Periodic modulation of amplitude around some center or average 

amplitude, and the resulting aural effect. 

triad Three notes defining a musical chord. With respect to a fundamental 

note, a major triad is made up of a major third and perfect fifth. 

trough Location in a wave where pressure is minimal. Opposite: Crest. 

tuning Matching the pitches of an instrument where the intervals between 

these pitches are according to some standardized system (temperament). 

undersampling Sampling at less than the Nyquist frequency (2f max ). Results 

in aliasing. 

unit circle Circle defined on the complex plane centered at the point (0, 0) 

with a radius of 1. 

unitary Of magnitude 1. The unit circle is unitary. 

up-sampling See oversampling. 

vibrato Periodic modulation of pitch around some center or average pitch, 

and the resulting aural effect. 

wavelength The length in meters (m) that a wave of frequency f travels in 

one period T , denoted by the Greek letter lambda (λ) and given by 

λ = v f = vT.


whammy bar The bar attached to the bridge of a guitar that allows a player 

to alter the tension of the strings. 

wideband noise Noise with a large bandwidth of frequencies, such as white 

noise (infinite bandwidth). 

windowing function A time-domain function w(t) that segments a signal 

with the use of windows to prepare it for the short-time Fourier transform 

(STFT). 

Young’s modulus The measure of stiffness in an elastic material; the ratio of 

the stress to strain. Like the bulk modulus, Young’s modulus is also 

given in pascals (Pa). 

Z-transform Algorithm that converts a discrete, infinite, time-domain system 

into a continuous, finite, frequency-domain spectrum, given by 

X(z) = 

∞∑ 

t=−∞ 

x[t]z −t 

where t is an integer representing time samples and z = Ae jω . 

zero In electrical engineering, a zero refers to where the numerator of a 

transfer function H is equal to zero. 

zero-padding Appending a signal with zeros to make it some desired length.

Index 

Z-transform, 214, 239 

ADSR envelope, 29, 185 

algorithm, 150 

fast Fourier transform, 216 

lossless compression, 155 

lossy compression, 156 

quick-sort, 151 

aliasing, 148 

amusia, 132 

aphasia, 132 

bandwidth, 124, 146, 175, 246 

beating, 123 

Bessel functions, 23, 84 

bit depth, 152 

bit rate, 157 

cancelation, 39 

Caruso, Enrico, 231 

Chladni plates, 74 

Chladni, Ernst, 73 

clipping, 19, 94, 95, 165, 214 

closed tube, 81 

complex numbers, 1 

complex plane, 168, 230 

computational complexity 

FFT, 216 

of the DFT, 190 

constructive interference, 39 

in the DFT, 193 

continuous Fourier transform, 

173, 176 

convolution, 183, 185 

convolution reverb, 184 

convolution theorem, 183 

cosine wave, 3 

in terms of e ix , 170 

critical bands, 124, 157 

cross product, 169, 187 

decibel, 41 

hearing level (dB HL), 122 

sound pressure level (dB SPL), 

121, 238 

deconvolution, 231 

destructive interference, 39 

Deutsch, Diana, 132 

Dirac delta, 141 

discrete Fourier transform, 177 

as matrix multiplication, 196 

symbols used in, 177 

discrete-time Fourier transform, 

214, 240 

Doppler effect, 46 

Doppler, Christian, 46


Euler’s formula, 170 

Euler’s identity, 168 

fast Fourier transform, 216 

radix-2 decimation in time, 

217 

filter 

all-pass, 100 

anti-aliasing, 148 

as an electrical circuit, 93, 

233 

band-pass, 114, 244 

band-stop, 244 

bank, 101, 248 

digital, 214, 230 

high-pass, 234 

in musical instruments, 72, 

125 

linear, 182 

low-pass, 148, 239 

transfer function, 229 

flanger pedal, 100 

Fourier series, 161 

Fourier, Jean Baptiste Joseph, 161 

frame, 221 

frequency bin, 189 

frequency response, 34 

harmonic overtone series, 10 

partials, 26, 64, 85, 166 

reasons for inharmonicity, 

70, 94 

Helmholtz resonance, 24 

Helmholtz, Hermann von, 24, 

75 

Helmholtz,Hermann von, 127 

Hermitian symmetry of the DFT, 

182, 202 

Huygens’ principle, 43 

impedance matching, 84, 110 

impulse function, 139, 143 

impulse response, 35, 184, 236 

integers, 1 

interpolation, 189, 214, 215 

inverse discrete Fourier transform, 

177 

inverse square law, 41 

just-noticeable difference (jnd), 

122 

Kronecker delta, 140 

Laplace transform, 214, 229, 235 

linearity of the DFT, 181 

logarithms, 3 

as a perceptual measure of 

loudness, 41, 119 

as a perceptual measure of 

pitch, 101, 114 

nature of the basilar membrane, 

134 

Mach number, 45 

magnitude response, 174, 240 

masking, 124 

modes, 13


of a fixed string, 25 

of circular membranes (drums), 

89 

of wind instruments, 82 

monochord, 56 

music information retrieval, 49, 

209, 248 

musical intervals, 22 

consonance and dissonance 

of, 128 

with respect to temperament, 

23, 58, 59 

musical synesthesia, 131 

nodes, 23 

of a circular membrane (drum), 

90 

of a fixed string, 75 

of wind instruments, 83 

normalized discrete Fourier transform, 

203 

Nyquist rate, 146 

Nyquist-Shannon sampling theorem, 

146 

open tube, 81 

orthonormality of the DFT, 11, 

170, 196 

Parseval’s theorem, 187, 233 

Partch, Harry, 60 

perfect pitch, 129 

phase response, 101, 174, 239 

phaser pedal, 100 

Plomp, Reinier, 64, 127 

power, 41, 174, 232 

average, 233 

pressure wave, 78 

principle of superposition, 22 

quantization error, 154, 157 

real numbers, 1 

reflection, 30, 40, 124, 185 

refraction, 31 

resistance, 33, 110, 231 

reverberation, 33, 184 

ring modulation, 95 

roots of unity, 169, 171, 191, 227 

roughness, 30, 128 

scaling theorem, 188 

Shebalin, Vissarion, 132 

shift theorem, 182 

short-time Fourier transform, 221 

side lobes, 189, 224 

sine wave, 3 

in terms of e ix , 170 

sound intensity level, 120 

sound pressure level, 121 

spectral leakage, 189, 205, 211, 

224 

speed of sound in different media, 

33 

standing wave, 23, 40, 235 

in wind instruments, 84 

stretch theorem, 188


time-invariance, 177, 182, 214, 

229 

transfer function, 214, 229 

tremolo, 98 

up-sampling, 188 

vibrato, 98 

voltage, 15, 203, 214, 231 

windowing function, 221 

Hanning, 225 

rectangular window, 224 

sine and cosine, 226 

triangle (Bartlett) window, 

224 

Worman, Walter, 86 

zero-padding, 189, 215

Numbers-and-Notes-An-Introduction-to-Musical-Signal-Processing

Create successful ePaper yourself

Delete template?

Save as template?