27.12.2014 Views

Numbers-and-Notes-An-Introduction-to-Musical-Signal-Processing

Numbers-and-Notes-An-Introduction-to-Musical-Signal-Processing

Numbers-and-Notes-An-Introduction-to-Musical-Signal-Processing

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Numbers</strong><br />

& notes:<br />

<strong>An</strong> introduction <strong>to</strong><br />

musical signal processing<br />

Regina Collecchia


About this book<br />

Digital analysis of music is typically difficult, with many variables <strong>and</strong> much<br />

mathematics in play. <strong>Numbers</strong> <strong>and</strong> notes: <strong>An</strong> introduction <strong>to</strong> musical signal<br />

processing seeks <strong>to</strong> illuminate in an accessible way the concepts behind audio<br />

compression, information retrieval, <strong>and</strong> acoustic design. At the core of such<br />

techniques lives the discrete Fourier transform (DFT), an his<strong>to</strong>rical construct<br />

that analyzes the frequencies contained in audio signals. The fast algorithm for<br />

the DFT—namely the celebrated FFT—is a special focus of the book. Given<br />

herein are actual code examples in C, MATLAB, <strong>and</strong> Mathematica.<br />

About the author<br />

Regina Collecchia has a B.A. degree (2009) from Reed College, where her<br />

interests focused on mathematics <strong>and</strong> music. Regina currently works at University<br />

of Louisville’s Heuser Hearing Research Labora<strong>to</strong>ry in Louisville,<br />

Kentucky, where she maintains a strong research interest in digital methods<br />

as apply <strong>to</strong> music.<br />

“This book brings it all <strong>to</strong>gether for me. I wish it had been written when I<br />

first began studying digital audio <strong>and</strong> signal processing. <strong>An</strong>yone new <strong>to</strong> these<br />

fields should make this the first book in a personal library.”<br />

—Evan Brooks,<br />

co-founder of Pro Tools <strong>and</strong> Digidesign<br />

ISBN 1-935-63815-5<br />

9 781935 638155


<strong>Numbers</strong> & notes:<br />

<strong>An</strong> introduction <strong>to</strong> musical<br />

signal processing<br />

Regina Collecchia


Perfectly Scientific Press<br />

3754 SE Knight St.<br />

Portl<strong>and</strong>, OR 97202<br />

Copyright c○ 2012 by Perfectly Scientific Press.<br />

All Rights Reserved. No part of this book may be reproduced, used, scanned,<br />

or distributed in any printed or electronic form or in any manner without<br />

written permission, except for brief quotations embodied in critical articles<br />

<strong>and</strong> reviews.<br />

First Perfectly Scientific Press paperback edition: February 2012.<br />

Perfectly Scientific Press paperback ISBN: 978-1-935638-15-5.<br />

Cover design by Julia Canright.<br />

Cover image: Piano Orchestrion at the Musée Mécanique in San Francisco,<br />

California, a museum containing old mechanical arcade games <strong>and</strong> player pianos.<br />

The Piano Orchestrion has a spinning wheel of bumps that correspond<br />

<strong>to</strong> pitches on the piano, so the bumps notate the musical score. Its hammers<br />

are triggered when they encounter bumps, creating a binary system much<br />

like digital music. Pho<strong>to</strong>graph by Regina Collecchia.<br />

Visit our website at www.perfscipress.com.<br />

Printed in the United States of America.<br />

9876543210


Preface<br />

A digital audio file like an MP3 or WAV file is a numerical representation<br />

of a song’s frequency, timing, <strong>and</strong> loudness information. To fully<br />

underst<strong>and</strong> its behavior, the physical, psychophysical, <strong>and</strong> musical<br />

properties underlying these components must be realized. Therefore,<br />

the best way <strong>to</strong> communicate about them for computational purposes<br />

is by using a mathematical <strong>to</strong>ngue. Taking it one step further, we can<br />

make use of computers <strong>and</strong> algorithms <strong>to</strong> extract <strong>and</strong> analyze musical<br />

information.<br />

The algorithm that is most frequently used <strong>to</strong> retrieve data from<br />

music is the fast Fourier transform (FFT), an expedited modification of<br />

the discrete Fourier transform (DFT). A DFT <strong>and</strong> an FFT have identical<br />

output, which is the frequency spectrum of a discrete (digital) signal.<br />

A frequency spectrum tells us the relative loudness of frequencies<br />

throughout the sound file, similar <strong>to</strong> how the file itself tells us the<br />

amplitude at any instant of time. It transforms a time-based domain<br />

in<strong>to</strong> a frequency-based domain.<br />

Fourier transforms are used not only <strong>to</strong> display frequency information,<br />

but also in the compression of digital audio, filter design,<br />

convolution, composition, <strong>and</strong> many other (digital) signal processing<br />

methods. It may be the case that for your purposes, you can get by<br />

without fully underst<strong>and</strong>ing the physical meaning of the FFT; but<br />

for those who require its explanation in explicit detail particularly <strong>to</strong><br />

achieve musical ends, this book is for you. The Fourier transform<br />

effectively detects periodic waveforms within signals, but how <strong>and</strong><br />

why<br />

The first chapter covers some of the basic mathematics required <strong>to</strong><br />

underst<strong>and</strong> all of the equations in this book, including logarithms <strong>and</strong>


trigonometry. The next two chapters examine sound from different<br />

angles, the first from a physical perspective <strong>and</strong> the second from a<br />

musical one. The fourth chapter explores how these perspectives<br />

manifest in musical instruments <strong>and</strong> scales. In the fifth chapter, a<br />

quick overview of psychoacoustics is given, including a discussion of<br />

musical synesthesia <strong>and</strong> perfect pitch. Chapter 6 explores digital audio<br />

<strong>and</strong> begins <strong>to</strong> frame some of the parameters of the discrete Fourier<br />

transform <strong>and</strong> digital audio signal processing. The seventh chapter<br />

breaks down all components of the DFT <strong>and</strong> its inverse, finishing with<br />

several examples. The FFT <strong>and</strong> other specifications of the Fourier<br />

transform are addressed in the final chapter.<br />

The appendices explore traditional, analog signal processing with<br />

an overview of frequency-selective filter design, the Laplace <strong>and</strong> Z<br />

transforms, <strong>and</strong> explicit Mathematica, Matlab, <strong>and</strong> C code examples<br />

of the FFT. These <strong>to</strong>pics are somewhat peripheral <strong>to</strong> the preceding<br />

chapters <strong>and</strong> are mere introductions <strong>to</strong> much broader fields. The<br />

concept of filtering (desiring certain frequencies <strong>to</strong> remain in a signal<br />

<strong>and</strong> others <strong>to</strong> be discarded) is alluded <strong>to</strong> many times in the text, so<br />

Appendix A is there for the curious.<br />

Only a high school level of calculus is required <strong>to</strong> mathematically<br />

evaluate all of the equations given here, with only a few instances of<br />

integrals <strong>and</strong> derivatives. What is required is experience listening <strong>to</strong><br />

music <strong>and</strong> great curiosity about its nature. Music information retrieval,<br />

sound design, <strong>and</strong> compositional applications are just three possible<br />

extensions that this book can precede. Beyond that is up <strong>to</strong> you.<br />

Acknowledgements<br />

This book would have been much, much different if not for the brutal<br />

honesty of my good friend Timothy Eshing. Timothy is an accomplished<br />

musician <strong>and</strong> the ideal reader of this book, so I was able <strong>to</strong><br />

tailor much of its purpose <strong>and</strong> presentation <strong>to</strong> him.


Many thanks <strong>to</strong> the Zahorik Audi<strong>to</strong>ry Perception Labora<strong>to</strong>ry at the<br />

Heuser Hearing Research Center for providing an environment for my<br />

research <strong>and</strong> an excellent resource for all things psychoacoustics.<br />

Special thanks <strong>to</strong> Sarah Powers for many great images, Jeffrey<br />

Jackson for help with the C code, <strong>and</strong> Evan Brooks for elaborate edits.<br />

Finally, thanks <strong>to</strong> my loving friends <strong>and</strong> family, especially Nada Zakaria,<br />

Meghan Mott, Zachary Thomas, Kate Eldridge, Susan Call<strong>and</strong>er,<br />

Mom, Tony, <strong>and</strong> Dad.


Contents<br />

Preface<br />

Contents<br />

i<br />

iv<br />

1 Review of mathematical notation <strong>and</strong> functions 1<br />

1.1 <strong>Numbers</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />

1.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />

1.3 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

2 Physical sound 9<br />

2.1 What is sound . . . . . . . . . . . . . . . . . . . . . . . 9<br />

2.2 Simple harmonic motion . . . . . . . . . . . . . . . . . . 13<br />

2.3 Complex harmonic motion . . . . . . . . . . . . . . . . . 19<br />

2.4 Harmony, periodicity, <strong>and</strong> perfect intervals . . . . . . . 22<br />

2.5 Properties of waves . . . . . . . . . . . . . . . . . . . . . 27<br />

2.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . 46<br />

3 <strong>Musical</strong> sound 49<br />

3.1 Rhythm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />

3.2 Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />

3.3 Tuning <strong>and</strong> temperament . . . . . . . . . . . . . . . . . 52<br />

3.4 Timbre . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62<br />

3.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . 66<br />

4 <strong>Musical</strong> instruments 67<br />

4.1 The piano . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />

4.2 The viol family . . . . . . . . . . . . . . . . . . . . . . . 71


4.3 Woodwinds <strong>and</strong> brasses . . . . . . . . . . . . . . . . . . 77<br />

4.4 Drums . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

4.5 Electric guitars <strong>and</strong> effects units . . . . . . . . . . . . . . 93<br />

4.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . 104<br />

5 Audi<strong>to</strong>ry perception 109<br />

5.1 Physiology of the ear . . . . . . . . . . . . . . . . . . . . 109<br />

5.2 Psychoacoustics . . . . . . . . . . . . . . . . . . . . . . . 116<br />

5.3 Perfect pitch . . . . . . . . . . . . . . . . . . . . . . . . . 129<br />

5.4 Chapter summary . . . . . . . . . . . . . . . . . . . . . . 133<br />

6 Digital audio basics 137<br />

6.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 138<br />

6.2 Compression . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />

6.3 Chapter summary . . . . . . . . . . . . . . . . . . . . . . 158<br />

7 The discrete Fourier transform 161<br />

7.1 The Fourier series . . . . . . . . . . . . . . . . . . . . . . 161<br />

7.2 Euler’s formula . . . . . . . . . . . . . . . . . . . . . . . 167<br />

7.3 The discrete Fourier transform . . . . . . . . . . . . . . 172<br />

7.4 The DFT, simplified . . . . . . . . . . . . . . . . . . . . . 190<br />

7.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 198<br />

7.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . 210<br />

8 Other Fourier transforms 213<br />

8.1 Discrete-time Fourier transform (DTFT) . . . . . . . . . 214<br />

8.2 Fast Fourier transform (FFT) . . . . . . . . . . . . . . . . 216<br />

8.3 Short-time Fourier transform (STFT) . . . . . . . . . . . 221<br />

8.4 Chapter summary . . . . . . . . . . . . . . . . . . . . . . 227<br />

A Frequency-selective circuits 229<br />

A.1 Ohm’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . 231<br />

A.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 235<br />

A.3 The Z-transform . . . . . . . . . . . . . . . . . . . . . . . 239


A.4 Chapter summary . . . . . . . . . . . . . . . . . . . . . . 249<br />

B Using computers <strong>to</strong> do Fourier transforms 251<br />

B.1 Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251<br />

B.2 Mathematica . . . . . . . . . . . . . . . . . . . . . . . . . 256<br />

B.3 C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260<br />

References 273<br />

Glossary 283<br />

Index 309


1. Review of mathematical notation<br />

<strong>and</strong> functions<br />

The author’s background is in mathematics, but yours doesn’t need <strong>to</strong><br />

be <strong>to</strong> work through this book. However, <strong>to</strong> get the most out of <strong>Numbers</strong><br />

& notes, it is important <strong>to</strong> know some mathematical definitions. Most of<br />

these do not extend beyond high school algebra, though a knowledge<br />

of trigonometry <strong>and</strong> basic calculus will certainly help. This chapter<br />

will serve as a brief refresher course for these <strong>to</strong>pics <strong>and</strong> can be skipped<br />

if the reader feels well-versed with mathematical syntax <strong>and</strong> functions.<br />

1.1 <strong>Numbers</strong><br />

The real numbers, denoted by the set R, are defined as any number that<br />

does not have an imaginary component, i.e., a real number does not<br />

contain the quantity i, equal <strong>to</strong> √ −1. This includes the rational numbers,<br />

p q<br />

, where p <strong>and</strong> q are integers like 1, 2, −3, etc., <strong>and</strong> the irrational<br />

numbers, like √ 5. The real numbers form a continuum because they<br />

can be infinitesimally small <strong>and</strong> there are an infinite number of them<br />

between any two numbers.<br />

The complex numbers (the set C) also form a continuum. We describe<br />

a complex number c by the quantity<br />

c = a + bi<br />

where a <strong>and</strong> b are real numbers. For example, 0.289576, 5, <strong>and</strong> −39.01<br />

are both real <strong>and</strong> complex numbers, <strong>and</strong> 0.289576 + 5i is only complex.<br />

So, the real numbers are a subset of the complex numbers, i.e., R ⊂ C.<br />

The inverse of a number is the quantity that transforms that number<br />

in<strong>to</strong> an identity value. The additive identity is 0 <strong>and</strong> the multiplicative


2 Review of mathematical notation <strong>and</strong> functions Chapter 1<br />

identity is 1, so the additive inverse of 2 is −2 because 2 + (−2) = 0,<br />

<strong>and</strong> the multiplicative inverse of 2 is 1 2 because 2 · 1<br />

2<br />

=1. Likewise, a<br />

function can have an inverse (in which case we call it invertible), <strong>and</strong><br />

this is almost always the multiplicative inverse. The inverse of e x is<br />

e −x because e x · e −x = e x+(−x) = e 0 =1. Note that the value of any<br />

quantity raised <strong>to</strong> the zeroth power is equal <strong>to</strong> 1—even 0 0 .<br />

1.2 Functions<br />

A function in mathematics accepts an input of a certain type (real or<br />

complex) <strong>and</strong> produces an output that is explicitly given by a mathematical<br />

expression that it equals. A function f with argument x is<br />

written f(x). The argument constitutes the domain of a function <strong>and</strong><br />

f(x) constitutes the range. The function maps x <strong>to</strong> a unique, corresponding<br />

value, the point (x, f(x)).<br />

Loosely, when an infinitesimally small change in x produces an<br />

infinitesimally small change in f(x) for all x, we say that the function<br />

is continuous. A continuous function is smooth <strong>and</strong> has no jumps<br />

or missing holes in its graph. The function f(x) =x is continuous<br />

when x is all of the real numbers, for example. On the other h<strong>and</strong>,<br />

a discrete function does have jumps <strong>and</strong> gaps. Discrete functions are<br />

characterized by individual points <strong>and</strong> cases that determine where<br />

they exist. The function<br />

⎧<br />

1, if x =5<br />

⎪⎨<br />

f(x) = −2, if x =9<br />

⎪⎩ 0, otherwise<br />

is an example of a discrete function. A function is only continuous if<br />

its input is continuous.<br />

Exponential functions (like f(x) =e x ), logarithmic functions (like<br />

f(x) = log(2x)), <strong>and</strong> trigonometric functions (like f(t) = sin(πt)) will<br />

be used frequently in this book, so we will examine some fundamentals<br />

of their behavior in this section.


Section 1.2 Functions 3<br />

Logarithms <strong>and</strong> exponents<br />

The logarithm with base a of a value b returns the exponent x such that<br />

a x = b, i.e., if<br />

log a b = x,<br />

then<br />

a x = b,<br />

<strong>and</strong> vice versa. For example, log 10 100 is equal <strong>to</strong> 2 because 10 2 =<br />

10 · 10 = 100.<br />

Three common bases for the logarithm function are 10, 2, <strong>and</strong> e.<br />

Most calcula<strong>to</strong>rs are preprogrammed with 10 as the logarithmic base<br />

because this is the base of our counting system. In discussions of computational<br />

complexity, base-2 or binary is the st<strong>and</strong>ard. When the base<br />

is the constant e equal <strong>to</strong> approximately 2.71828183 . . ., the logarithm<br />

of x can also be written ln(x). This is called the natural logarithm. It may<br />

be a bit surprising that log 10 (x) wouldn’t be considered the "natural"<br />

logarithm due <strong>to</strong> the way we write real world quantities <strong>and</strong> regard<br />

our fingers as digits, but the number 10 is not as mathematically significant<br />

as e. The significance of the natural logarithm is supported<br />

by the property that both the derivative <strong>and</strong> antiderivative (integral)<br />

of the exponential function e x is itself e x . This means that both the<br />

slope of e x at the point x = a <strong>and</strong> the area underneath the curve from<br />

negative infinity <strong>to</strong> a are exactly equal <strong>to</strong> e a . See Figure 1.1.<br />

A conceptual underst<strong>and</strong>ing of logarithmic functions is helpful <strong>to</strong><br />

many aspects of the science of music. Pitch, loudness, <strong>and</strong> even the ear<br />

itself are all logarithmic in nature.<br />

Trigonometry<br />

There are three important functions in trigonometry: sine, cosine, <strong>and</strong><br />

tangent. Each of these functions treats its argument as an angle. This<br />

angle is compared <strong>to</strong> the unit circle, which has a radius of 1 <strong>and</strong> is


4 Review of mathematical notation <strong>and</strong> functions Chapter 1<br />

Figure 1.1: The base of the natural logarithm, e, is a constant equal <strong>to</strong> approximately<br />

2.71828183 . . . When it is raised <strong>to</strong> a variable x, it has the property that its derivative<br />

d[e x ]<br />

with respect <strong>to</strong> x is equal <strong>to</strong> dx<br />

ex . The derivative is defined as the slope of the curve<br />

at any point x. Likewise, its antiderivative or integral ∫ e x dx is equal <strong>to</strong> e x , plus an<br />

undefined constant. The integral is defined as the area under the curve of the function.<br />

If we wanted <strong>to</strong> know the area under the curve between x = a <strong>and</strong> x = b where a


Section 1.2 Functions 5<br />

Figure 1.2: The fundamental trigonometric function is the sine function, written sin(x)<br />

where x is a varying angle. The other functions can be written as functions of sine.<br />

Cosine is the sine function shifted by 90 ◦ , i.e., cos(x) = sin ( )<br />

x + π 2 , <strong>and</strong> tangent is<br />

the quotient of sine <strong>and</strong> cosine, written tan(x) = sin(x) . The sine <strong>and</strong> cosine functions<br />

cos(x)<br />

have a finite range of values, periodically falling in the interval [−1, 1], while tan(x)<br />

ranges between −∞ <strong>and</strong> ∞ with discontinuities twice per period, indicated by the<br />

vertical lines in the third graph.<br />

the length of the opposite side <strong>to</strong> the length of the hypotenuse, <strong>and</strong><br />

cosine is the ratio of the length of the adjacent side <strong>to</strong> the length of<br />

the hypotenuse. The tangent of an angle is given by the ratio sin(θ)<br />

cos(θ) ,<br />

which is the ratio of the length of the opposite side <strong>to</strong> the length of the<br />

adjacent side. A nice mnemonic device arises here: SOH-CAH-T OA,<br />

wherein sin(θ) =<br />

opposite<br />

hypotenuse , cos(θ) = adjacent<br />

opposite<br />

hypotenuse<br />

, <strong>and</strong> tan(θ) =<br />

adjacent .<br />

The trigonometric functions sin(x), cos(x), <strong>and</strong> tan(x) treat x as<br />

an angle (like θ in Figure 1.3), but what is the unit of x The variable<br />

x can be expressed in radians or degrees, <strong>and</strong> these functions move<br />

counterclockwise continuously along the unit circle <strong>and</strong> record at any<br />

given point the height, width, or ratio of the height <strong>to</strong> width. In


6 Review of mathematical notation <strong>and</strong> functions Chapter 1<br />

Figure 1.3: The unit circle is defined as the circle centered about the origin (0,0) with a<br />

radius of 1. <strong>An</strong> angle θ describes the angle between the right side of the horizontal<br />

axis <strong>and</strong> the hypotenuse of the right triangle. The hypotenuse is also a vec<strong>to</strong>r, <strong>and</strong> its<br />

coordinates are (cos(θ), sin(θ)), i.e., the width <strong>and</strong> height of the right triangle.<br />

mathematics, the arguments of trigonometric functions are virtually<br />

always in radians because of the association with the unit circle <strong>and</strong><br />

its circumference of 2π. We will look at many graphs of sinusoidal<br />

functions later in the text. One important trigonometric identity <strong>to</strong> note<br />

is that cos(x) = sin(x + 90 ◦ ) = sin ( x + π )<br />

2 , so the sine function is the<br />

same as the cosine function when it is phase shifted by 90 ◦ (equivalent<br />

<strong>to</strong> π 2 radians). <strong>An</strong>other very common one is cos2 (x) + sin 2 (x) = 1,<br />

which implies that √ cos 2 (x) + sin 2 (x) =1.<br />

1.3 Calculus<br />

We examined the graphs of the derivative <strong>and</strong> antiderivative of e x , but<br />

<strong>to</strong> recapitulate, a derivative is the rate of change of a function, <strong>and</strong> an<br />

antiderivative, or integral, is the area underneath the curve of a function.


Section 1.4 Notation 7<br />

You will only need calculus <strong>to</strong> compute continuous Fourier transforms,<br />

necessary when the system is continuous, i.e., you are analyzing an<br />

electrical, analog system. In discrete systems, we use sums instead<br />

of integrals. Even in advanced applications of the Fourier transform,<br />

calculus is rarely needed, but linear algebra <strong>and</strong> numerical analysis<br />

methods typically are.<br />

1.4 Notation<br />

The magnitude <strong>and</strong> absolute value will be considered many times in this<br />

book. The terms are identical: Both are real <strong>and</strong> positive in value. We<br />

denote magnitude with double square brackets or vertical bars, <strong>and</strong><br />

the absolute value with vertical bars. The magnitude <strong>and</strong> absolute<br />

value of the complex quantity (a + bi) is written<br />

|a + bi| = [[a + bi]] = √ a 2 + b 2 .<br />

In the present text, I will use the notation |a + bi| <strong>to</strong> denote both<br />

magnitude <strong>and</strong> absolute value of complex arguments.<br />

In this book, functions with capital letters denote Fourier transforms<br />

<strong>and</strong> functions with frequency domains, like X(f), <strong>and</strong> functions<br />

with lowercase letters denote time-domain signals, like x(t).<br />

The Greek alphabet<br />

IIt is nice <strong>to</strong> be able <strong>to</strong> sound out mathematical equations as you read<br />

along, so listed below are the letters from the Greek alphabet used<br />

alongside their English spelling <strong>and</strong> their function in <strong>Numbers</strong> & notes.


8 Review of mathematical notation <strong>and</strong> functions Chapter 1<br />

Gk. letter Pronunciation Common scientific usage<br />

β beta b<strong>and</strong>width (frequency)<br />

γ gamma heat capacity ratio<br />

δ delta the delta function<br />

θ theta angle<br />

λ lambda wavelength (m)<br />

µ mu mass per unit length (kg/m)<br />

π pi the constant 3.14159265 . . .<br />

ρ rho density (kg/m 3 )<br />

Σ Sigma (cap.) the sum of a sequence; a series<br />

τ tau time (s)<br />

φ phi angle<br />

ω omega angular frequency 2πf<br />

Other definitions<br />

For definitions of many words used in <strong>Numbers</strong> & notes, please refer<br />

<strong>to</strong> the glossary at the end. If a word in the text is italicized, you will<br />

likely find it there.


2. Physical sound<br />

2.1 What is sound<br />

Sound is the human ear’s perceived effect of pressure changes in the<br />

ambient air. Sound can be modeled as a function of time.<br />

Figure 2.1: A 0.56-second audio clip of an accordion playing C4 (middle C, 261.6 Hz).<br />

When we hear music, we can evaluate its features almost immediately.<br />

We can recognize the instrumentation, modality, artist, genre,<br />

<strong>and</strong> perhaps the time <strong>and</strong> place it was recorded. Graphically, it is<br />

difficult <strong>to</strong> connect this image <strong>to</strong> what we actually hear: The above<br />

graph looks complicated, while the experience of this sound (the audio<br />

signal) is a single, sustained pitch on an accordion. But when we take<br />

a Fourier transform of this clip, we can actually view the frequencies<br />

present in a song.<br />

Because pitch <strong>and</strong> timbre are made up exclusively of the change<br />

over time of frequencies <strong>and</strong> amplitudes, <strong>and</strong> they tell us so much<br />

information about musical features, the Fourier transform is an incredibly<br />

useful <strong>to</strong>ol that translate time domain signals like music on<strong>to</strong> an<br />

axis of frequencies, i.e., a frequency domain. The graph in Figure 2.2 lets<br />

our eyes verify what our ears already know: The graph describes the


10 Physical sound Chapter 2<br />

Figure 2.2: The spectrum <strong>and</strong> listed frequencies attained by the discrete Fourier<br />

transform of the clip shown in Figure 2.1. Note the locations of the peaks with respect<br />

<strong>to</strong> frequency.<br />

relative strength of the frequencies present in a signal. In this example,<br />

we see the frequency characteristics of an accordion playing C.<br />

The frequencies themselves are not as important as the general<br />

shape of the spikes <strong>and</strong> the distance between them; most cannot distinguish<br />

between A <strong>and</strong> C in isolation, but we do have a relatively<br />

easy time identifying the difference between a piano <strong>and</strong> a violin. This<br />

is because of the texture of the instrument’s sound, called the timbre<br />

or <strong>to</strong>ne color. When the frequencies are more or less equally spaced<br />

from one another, we say that the timbre is harmonic, or that we have a<br />

harmonic over<strong>to</strong>ne series. Explicitly, the Fourier transform of the signal<br />

in Figure 2.1 <strong>and</strong> its graphical representation in Figure 2.2 tell us the<br />

signal contains the frequencies 263.2 Hz (C), 528.2 Hz (C), 787.9 Hz (G),<br />

1051 Hz (C), <strong>and</strong> 1313 Hz (E). The frequencies’ respective peak in the<br />

graph indicates their loudness; hence, they are decreasing in power.


Section 2.1 What is sound 11<br />

Middle C is 261.6 Hz, so apparently this accordion is slightly out<br />

of tune—but furthermore, its timbre is not perfectly harmonic: the<br />

difference between its over<strong>to</strong>nes should be 263.2, but 528.2 − 263.2 =<br />

265, <strong>and</strong> 787.9 − 528.2 = 259.7. There are several possible reasons<br />

for why these spikes are not exactly equally spaced. Most likely, it<br />

is due <strong>to</strong> the imperfect physical proportion <strong>and</strong> construction of the<br />

instrument’s metal reeds, but it could also be error encountered in the<br />

recording process or experimental error.<br />

To interpret how exactly this translates <strong>to</strong> what our ears hear, we<br />

must take in<strong>to</strong> account how certain frequencies are perceived by the<br />

brain. Young, healthy human ears can detect frequencies within a<br />

range of 20-20,000 Hz, where 20 Hz <strong>and</strong> 20,000 Hz are threshold <strong>and</strong><br />

limit values, but our ears are not uniformly sensitive <strong>to</strong> these frequencies<br />

[1]. Within the range of 1000 <strong>to</strong> 5000 Hz, our ears are especially<br />

sensitive, meaning that sounds with frequencies within this range do<br />

not have <strong>to</strong> be as loud for our ears <strong>to</strong> detect them.<br />

Mathematically, the Fourier transform constructs an orthonormal<br />

basis that takes a complicated sound wave <strong>and</strong> reduces it <strong>to</strong> its component<br />

waves, which are all simple sine <strong>and</strong> cosine waves, or sinusoids. 1<br />

It shows us every frequency <strong>and</strong> its amplitude that is present in a<br />

complex sound over an interval of time. The connection between the<br />

graph of the transform <strong>and</strong> its mathematical properties is a giant step<br />

<strong>to</strong>wards realizing the Fourier transform <strong>and</strong> its digital applications.<br />

Because sight <strong>and</strong> sound retrieve giant spheres of information, we<br />

have <strong>to</strong> make decisions about what is important <strong>and</strong> what we can<br />

take for granted. Our brains are so excellent at processing information<br />

that we can give certain sensations finer resolution (like an important<br />

1 "Orthonormal" means orthogonal <strong>and</strong> normal. For a function <strong>to</strong> be orthogonal<br />

<strong>to</strong> another function, the two functions must be linearly independent. The condition of<br />

normality is satisfied when each function involved has, in some appropriate sense,<br />

energy 1. Finally, a basis is a set of functions such that an arbitrary function (within<br />

reason) may be written in terms of the basis. See Chapter 7.


12 Physical sound Chapter 2<br />

Figure 2.3: We detect frequencies between about 20 <strong>and</strong> 20,000 Hz as pitched sound.<br />

Furthermore, each of these frequencies has a minimum threshold of loudness. This<br />

graph, the Fletcher–Munson curve, shows the minimal sound pressure level in decibels<br />

(dB) required for the frequency <strong>to</strong> be heard.<br />

message from our friend), <strong>and</strong> others none at all (like the hum of the<br />

refrigera<strong>to</strong>r).<br />

Consider seeing a relatively involved movie for the second or third<br />

time, <strong>and</strong> noticing things you didn’t notice before that now make<br />

sense. We seem <strong>to</strong> prefer movies like these. We may achieve a decent<br />

underst<strong>and</strong>ing of the plot on the first viewing because we extract<br />

salient parts of the dialogue <strong>and</strong> action <strong>and</strong> put them in order, but a<br />

complex plot can hide clues of outcomes <strong>and</strong> their rationale all over<br />

the film that are more obvious when our brains can support them with<br />

familiar elements.<br />

A complex piece of music can be a lot like a complex movie. We<br />

perceive both sound <strong>and</strong> light as signals. When a signal dem<strong>and</strong>s


Section 2.2 Simple harmonic motion 13<br />

our attention, it is said <strong>to</strong> have a high amount of information. A signal<br />

with meaningless content that we don’t need or want <strong>to</strong> listen <strong>to</strong> is<br />

called noise. ound produced by white noise machines, e.g., is r<strong>and</strong>om,<br />

unpitched, <strong>and</strong> trivial. It does not contain a message because it is<br />

formally disorganized, <strong>and</strong> it can even help some people sleep because<br />

of its uniform r<strong>and</strong>omness. Noise is composed of so many periodic<br />

waves that we consider it aperiodic. We cannot extract individual<br />

frequencies of noise, as we can in melody or a major chord in music.<br />

A signal can be half meaningful <strong>and</strong> half noise, <strong>and</strong> our brains are<br />

powerful enough <strong>to</strong> recognize the difference <strong>and</strong> attempt <strong>to</strong> separate<br />

the two.<br />

Although sine waves are not fun <strong>to</strong> think about, they substantiate<br />

much of the mathematics <strong>and</strong> physics behind music. The mathematical<br />

<strong>and</strong> physical equations which produced the previous graphs form a<br />

basis for the sensation of sound. Many musical concepts are results of<br />

mathematical relationships. First, let us examine the basic mathematical<br />

structure of sound. <strong>Musical</strong> form will be addressed in Chapters 3<br />

<strong>and</strong> 4.<br />

2.2 Simple harmonic motion<br />

Like light, sound is traveling energy, <strong>and</strong> we can model such energy<br />

mathematically with waves. The simplest wave is a sinusoid, a trigonometric<br />

function such as sin(ωt) or cos(ωt) where t denotes time <strong>and</strong><br />

ω specifies how often it repeats itself—its angular frequency. 2 A sinusoidal<br />

wave represents the simple harmonic motion of an object because<br />

its frequency <strong>and</strong> extreme magnitudes do not change over time.<br />

Both a spring <strong>and</strong> a tuning fork exhibit simple harmonic motion.<br />

Below, we see two states of a vibrating tuning fork called modes of<br />

vibration. Both of these modes produce sound that is near in <strong>to</strong>ne <strong>to</strong> a<br />

sine wave (or pure <strong>to</strong>ne), but as you might have experienced, the <strong>to</strong>ne<br />

2 The frequency f in Hz. (cycles per second) is related by ω =2πf.


14 Physical sound Chapter 2<br />

Figure 2.4: A tuning fork <strong>and</strong> a weighted spring oscillate in simple harmonic motion.<br />

is more metallic <strong>and</strong> glassy than the electronic sound of a sine wave.<br />

When we strike the tuning fork, we experience the attack of the sound<br />

<strong>and</strong> then the sound sustains (decaying due <strong>to</strong> frictional forces with<br />

time) <strong>and</strong> eventually releases, leaving no sound. In the ideal physical<br />

world, i.e., one without the external forces of gravity, friction, <strong>and</strong><br />

other resistive forces, a spring set in<strong>to</strong> motion could oscillate forever<br />

at a uniform amplitude, as could a tuning fork. But that is not what<br />

happens in reality. The closest we can get <strong>to</strong> simple harmonic motion<br />

is represented by the curve in Figure 2.5.<br />

Figure 2.5: A musical wave in reality begins at zero energy, climbs <strong>to</strong> a maximal<br />

energy, <strong>and</strong> fades <strong>to</strong> zero energy.


Section 2.2 Simple harmonic motion 15<br />

You can see that the amplitude of this wave varies, but the points<br />

at which it crosses the horizontal axis, i.e., when its amplitude equals 0,<br />

are evenly spaced over time. This means the frequency does not vary,<br />

but the intensity of its motion does. True simple harmonic motion<br />

can be generated by an oscilla<strong>to</strong>r, a computer, or a tuning fork with a<br />

driving mo<strong>to</strong>r attached <strong>to</strong> it [2].<br />

Amplitude represents pressure as well as voltage: <strong>An</strong> audio function<br />

models the pressure in the air corresponding <strong>to</strong> the sound wave<br />

as a function of time, <strong>and</strong> when the signal is electrified, the amplitude<br />

represents the (relative) voltage. In acoustics as well as electrical engineering,<br />

we call this function a signal, <strong>and</strong> the amplitude tells us<br />

most of the information we need <strong>to</strong> determine how loud our ears will<br />

perceive it <strong>to</strong> be. Because air is elastic, when a sound wave travels in<br />

air, it excites the air molecules <strong>and</strong> varies the pressure. The amplitude<br />

of the graph of a sine wave describes this behavior (see Figure 2.6).<br />

There are three fundamental aspects of a sinusoid of the form<br />

A sin(ωt + φ): Its magnitude 3 A, frequency ω, <strong>and</strong> phase φ. We have<br />

already considered amplitude: It is the pressure. A wave at amplitude<br />

0 means that the system is at normal atmospheric pressure—the pressure<br />

of the environment <strong>to</strong> which our ears have adjusted <strong>and</strong> no extra<br />

pressure is affecting the eardrum at that instant. Frequency can be<br />

determined by the number of times per second that the signal has zero<br />

pressure, or the rate at which the signal crosses the graph’s horizontal<br />

axis. Finally, we can determine the phase φ at any point t from the<br />

time of the next zero crossing, i.e., where the amplitude crosses the<br />

horizontal axis.<br />

3 Unfortunately, there are quite a few terms that will be used somewhat interchangeably<br />

<strong>to</strong> mean magnitude: Amplitude, height, displacement, energy, power,<br />

voltage, pressure, strength, <strong>and</strong> loudness. Loudness is a perceptual word <strong>and</strong> since<br />

we do not perceive all frequencies as equal (or at all, as in Figure 2.3), this word will be<br />

used with caution. Voltage, power, <strong>and</strong> energy are typically encountered in electrical<br />

engineering texts <strong>to</strong> mean amplitude, though they absolutely do not have equivalent<br />

meaning (see Appendix A). Strength <strong>and</strong> displacement are words used here <strong>to</strong> denote<br />

the magnitude of a wave, i.e., the vertical distance from 0.


16 Physical sound Chapter 2<br />

Figure 2.6: The compressions <strong>and</strong> rarefactions in air resulting from sound waves,<br />

shown two ways. The maximal points of the sine wave graph correspond <strong>to</strong> the most<br />

compressed areas of the particle graph, represented by the most densely spaced dots.<br />

The minimal points correspond <strong>to</strong> rarefactions, represented by the least dense spacings<br />

of dots. Where the amplitude of the sine wave is 0 represents normal atmospheric<br />

pressure, where the density of the dots is average.<br />

For ease of computation, we only allow amplitude <strong>to</strong> vary between<br />

-1 <strong>and</strong> 1, so the average value of the amplitude of a simple sine wave is<br />

always zero. 4 It may seem strange that pressure can take on negative<br />

values, but it simply means that the sound’s pressure is dipping below<br />

normal atmospheric pressure. For purposes of st<strong>and</strong>ardization, this<br />

is defined as the pressure of air at sea level, 101,325 pascals (Pa); but<br />

in reality, this is the average atmospheric pressure of our present<br />

environment <strong>to</strong> which our ears have adjusted. Hence, amplitudes<br />

higher than 0 imply that the pressure induced by a sound wave is<br />

4 In electronic reality, sound signals can, however, have nonzero average value<br />

due <strong>to</strong> things like DC offset, uncalibrated equipment, or postproduction changes.


Section 2.2 Simple harmonic motion 17<br />

greater than normal pressure (compression), <strong>and</strong> amplitudes below 0<br />

imply that the pressure of a sound wave is less than normal pressure<br />

(rarefaction). Our ears detect sound by change over time in pressure, so<br />

a single, isolated amplitude tells us nothing about what we actually<br />

hear.<br />

<strong>An</strong>gular frequency is given by ω in radians per second (rad/s), <strong>and</strong><br />

it is equal <strong>to</strong> 2πf, where f is ordinary frequency, given in hertz (Hz).<br />

Frequency f is inversely proportional <strong>to</strong> the time that the sine wave<br />

takes <strong>to</strong> complete one period T , as given by the following formula.<br />

f = 1 T<br />

Therefore, ω = 2π<br />

T .<br />

Phase tells us where the wave is along the course of a single period,<br />

taking on angles between 0 ◦ <strong>and</strong> just less than 360 ◦ . We are especially<br />

interested in phase when we have two waves of identical frequency.<br />

Now let us examine the nature of a simple sinusoid where ω =2π<br />

radians/second (so f =1Hz), x(t) = sin(2πt).<br />

Figure 2.7: A simple sinusoid, x(t) = sin(2πt), with the phase φ marked.<br />

Take note of the circular diagrams beneath the graph in Figure 2.7.<br />

These circles show different positions along the unit circle. The starting


18 Physical sound Chapter 2<br />

position where φ =0is the rightmost point on this circle, situated at its<br />

intersection with the horizontal axis. When we move counterclockwise<br />

along the circumference of this circle, we increase the angle relative <strong>to</strong><br />

this position. When we return <strong>to</strong> this position, we have moved 360 ◦ .<br />

In radians, 360 ◦ is equal <strong>to</strong> 2π. We can translate any angle in degrees<br />

<strong>to</strong> radians by multiplying the number of degrees by<br />

π<br />

180<br />

, e.g., for a<br />

right angle φ = 90 ◦ , the equivalent angle in radians is calculated <strong>to</strong> be<br />

90 ◦ π ·<br />

180 = π 2 radians.<br />

Note that φ =0is positioned on the circle exactly where φ =2π <strong>and</strong><br />

φ =4π. This is true of any even-integer multiple of π (4π, 6π, 8π, . . .).<br />

Similarly, the trigonometric function of any variable (such as a frequency<br />

ω) is the same as that variable phase shifted by an integermultiple<br />

of 2π, i.e.,<br />

cos(ω) = cos(ω +2kπ), <strong>and</strong><br />

sin(ω) = sin(ω +2πk), k =0, 1, 2, . . .<br />

Again, the height of points along the unit circle at angle φ is given<br />

by sin(φ), <strong>and</strong> the width is modeled by cos(φ). The phase at the initial<br />

time, t =0, is the angle that the sinusoid is shifted relative <strong>to</strong> a sine<br />

wave with no phase. We represent phase with the Greek letter φ, so<br />

that a simple sinusoid is formally written<br />

x(t) =A sin(2πft + φ) =A sin(ωt + φ).<br />

We multiply frequency by the quantity 2π because this strengthens<br />

the connection between the unit circle <strong>and</strong> frequency. The angular<br />

frequency is commonly used in the Fourier transform <strong>and</strong> physical<br />

science. However, in music, we connect pitch <strong>to</strong> frequency in hertz<br />

(like A440), so we will use 2πf instead of ω when considering sound<br />

musically.<br />

It is impossible <strong>to</strong> identify the phase of a single sine wave without<br />

a reference point. However, it is important for a signal <strong>to</strong> begin <strong>and</strong><br />

end at amplitude 0 in order <strong>to</strong> underst<strong>and</strong> its behavior, because sounds


Section 2.3 Complex harmonic motion 19<br />

beginning or ending at a nonzero amplitude surprise our ears <strong>to</strong> the<br />

point that the frequency undergoes dis<strong>to</strong>rtion. Consider dropping a<br />

needle on the record <strong>and</strong> hearing a fuzzy click. We hear a burst of<br />

sound when there is any discontinuity in pressure. This is reflected<br />

not only on our basilar membrane, but in the Fourier transform, <strong>and</strong> it<br />

is one type of clipping.<br />

2.3 Complex harmonic motion<br />

Now let us examine more complex waves. In reality, virtually every<br />

sound is a complex wave. We really only encounter simple waves<br />

during hearing tests or in electronic music. In fact, listening <strong>to</strong> a sine<br />

wave for an extended period of time can cause headaches, extreme<br />

emotional responses, <strong>and</strong> hearing damage [3].<br />

Figure 2.8: A clip of an audio signal.<br />

The horizontal axis of Figure 2.8 is once again time <strong>and</strong> the vertical<br />

axis is amplitude or pressure. Clearly, this is complicated: It is close <strong>to</strong><br />

impossible for our brains <strong>to</strong> detect any sort of pattern in this waveform


20 Physical sound Chapter 2<br />

because there is no clear repetition.Furthermore, there are no distinct<br />

frequencies we can pick out because there is no obvious repetition in<br />

the sound wave. However, this wave can be decomposed completely<br />

in<strong>to</strong> sine waves solely by a Fourier transform. It may require hundreds,<br />

even an infinite amount of them, but it can be done. Let us look at<br />

a simpler—but still complex—wave <strong>to</strong> illustrate the combination of<br />

simple sinusoids, as in Figure 2.9. Our ears can identify a pattern<br />

Figure 2.9: The combination of two simple sine waves, x 1(t) = sin(2πt) <strong>and</strong> x 2(t) =<br />

sin(4πt).<br />

because this wave is periodic. This wave is made up of two different<br />

sine waves, <strong>and</strong> they are harmonic relatives of each other: One is twice<br />

the frequency of the other. This wave repeats identically after every<br />

second. The first sinusoid x 1 has a frequency of 1 Hz (2π rad/s), <strong>and</strong><br />

the second sinusoid x 2 has a frequency of 2 Hz (4π rad/s). These are<br />

called frequency components ω k , where ω 1 =2π <strong>and</strong> ω 2 =4π.<br />

We couldn’t actually hear these frequencies as pitch in reality because<br />

they oscillate <strong>to</strong>o slowly: Our ears only translate frequencies<br />

above about 20 Hz as pitched sound <strong>to</strong> our brains. While using such a<br />

low frequency is preferable for ease of visualization <strong>and</strong> computation,<br />

any pair of sine waves whose frequencies have a 2:1 ratio is defined<br />

as having the interval of an octave. Say that the scale of the horizontal


Section 2.3 Complex harmonic motion 21<br />

axis in Figure 2.9 was in milliseconds instead of seconds. Then these<br />

sine waves would have the audible frequencies of 1000 <strong>and</strong> 2000 Hz.<br />

Visually, we can see that these waves intersect half of the time that they<br />

cross the horizontal axis, <strong>and</strong> the ratio between these waves’ frequency<br />

(2:1) emphasizes the inversely proportional mathematical relationship<br />

between frequency <strong>and</strong> time (f =1/T ).<br />

+<br />

=<br />

Figure 2.10: The combination of one sine wave of frequency f 1 with another sine<br />

wave with frequency f 2 =2f 1 produces the interval of an octave. Note the periodic<br />

nature of the resultant wave: It repeats itself identically four times, just like the first<br />

wave.


22 Physical sound Chapter 2<br />

The graph of the signal given in Figure 2.10 can be determined<br />

from the graphs of these two sine waves, but picking out even a small<br />

h<strong>and</strong>ful of simple sinusoids from a complex wave is not a task we want<br />

<strong>to</strong> leave <strong>to</strong> our senses. We need more advanced computational <strong>to</strong>ols<br />

<strong>to</strong> analyze frequencies contained in a signal that looks like a r<strong>and</strong>om<br />

string of numbers between −1 <strong>and</strong> 1. That is what is so very exciting<br />

about the power of the Fourier transform.<br />

The physical law at work here is called the principle of superposition.<br />

The principle of superposition: Every wave can be represented<br />

as a sum of simple sinusoids.<br />

Note that this says nothing about whether this sum has finitely<br />

or infinitely many terms. Square <strong>and</strong> triangle waves, for example,<br />

have jagged corners that cannot be represented by a finite amount of<br />

sinusoids. The principle of superposition is critical <strong>to</strong> underst<strong>and</strong>ing<br />

the concepts in the remainder of this book.<br />

2.4 Harmony, periodicity, <strong>and</strong> perfect intervals<br />

When two waves have frequencies that are related <strong>to</strong> each other by<br />

a small-number integer ratio like 2:1, we say that they are harmonic,<br />

that they have a harmonic relationship, or that they are harmonics of one<br />

another. The above example of f 1 =1Hz <strong>and</strong> f 2 =2Hz, i.e., f 2 =<br />

2f 1 , forms the interval of an octave. Likewise, the octave above any<br />

frequency is double that frequency, so we can calculate the frequency<br />

f k that is k-many octaves (the kth octave) above a given frequency f 0<br />

by the equation<br />

f k =2 k f 0 .<br />

Letting k =0returns f 0 , so the frequency 0 octaves above a given<br />

frequency is the original or fundamental frequency. This is called unison,<br />

or perfect unison. A perfect interval is characterized by a small-number


Section 2.4 Harmony, periodicity, <strong>and</strong> perfect intervals 23<br />

ratio between the two frequencies, restricted in Western music <strong>to</strong> 1:1<br />

(P1, perfect unison), 2:1 (P8, perfect octave), 3:2 (P5, perfect fifth), <strong>and</strong><br />

4:3 (P4, perfect fourth) [4].<br />

Perfect unison is trivially the smallest integer ratio. Two frequencies<br />

separated by an octave are in the ratio 2:1, the second smallest integer<br />

ratio. The perfect fifth has a ratio very close <strong>to</strong> 3:2 in equal temperament,<br />

<strong>and</strong> it is exactly 3:2 in just in<strong>to</strong>nation <strong>and</strong> Pythagorean tuning. A perfect<br />

fourth has a 4:3 ratio: It is the inversion of the perfect fifth, sounded<br />

by moving up an octave <strong>and</strong> down a perfect fifth. In fact, as in the<br />

discussion later of musical temperament <strong>and</strong> tuning systems, every<br />

note of the 12-<strong>to</strong>ne Pythagorean scale can be attained by moving in<br />

perfect fifths, but these intervals are related <strong>to</strong> f 0 by increasingly larger<br />

integer ratios.<br />

Moreover, the smaller the integer ratio between two frequencies<br />

(<strong>and</strong> two periods, thereby), the more pleasant or consonant we find<br />

their interval. In the introduction <strong>to</strong> this chapter, I mentioned the harmonic<br />

over<strong>to</strong>ne series <strong>and</strong> its relationship <strong>to</strong> timbre in music: <strong>Musical</strong><br />

instruments are constructed <strong>to</strong> have <strong>to</strong>nes containing integer-related<br />

frequencies. To back up a little bit: When we hear A at 440 Hz on<br />

a piano, we do not just hear the frequency 440 Hz. If this were the<br />

case, it would sound no different from an electronic beep caused by<br />

an oscilla<strong>to</strong>r or ideal tuning fork. When we hear A440 from a piano,<br />

we actually hear a whole spectrum of other frequencies resulting from<br />

the resonance of the piano <strong>and</strong> the nature of the fixed string. <strong>Musical</strong>,<br />

pitched instruments like the piano, with the exception of percussion<br />

instruments, generate over<strong>to</strong>ne series that are very-nearly harmonic,<br />

regardless of the pitch played. The modes of vibration on circular<br />

membranes have Bessel function ratios.<br />

We define nodes as zero crossings (i.e., where x(t) is zero) of the<br />

horizontal axis by a wave <strong>and</strong> antinodes as areas of maximal compression<br />

<strong>and</strong> rarefaction. A node exists on an instrument at a point or<br />

region that stays stationary while the rest of the instrument vibrates.<br />

Nodes <strong>and</strong> antinodes are used when we consider st<strong>and</strong>ing waves which


24 Physical sound Chapter 2<br />

occur within all musical instruments <strong>and</strong> rooms. A st<strong>and</strong>ing wave is<br />

produced when a sound wave’s forward velocity is the same as its<br />

backwards velocity, hence it st<strong>and</strong>s still with respect <strong>to</strong> position. For<br />

example, in a violin, waves move back <strong>and</strong> forth at the same velocity<br />

along a string fixed at both ends <strong>and</strong> therefore are only displaced up<br />

<strong>and</strong> down. St<strong>and</strong>ing waves occur when the wavelengths of a given<br />

frequency are in integer proportion <strong>to</strong> the dimensions of a string, room,<br />

or column of air. They cause feedback in a recording studio because<br />

they do not die as quickly as waves of other frequencies.<br />

Helmholtz resonance can be witnessed when air is blown across a<br />

small opening of an otherwise closed cavity, like a bottle of water or the<br />

cracked window of a moving car. The frequency produced is inversely<br />

proportional <strong>to</strong> the volume of this cavity <strong>and</strong> proportional <strong>to</strong> the cross<br />

sectional area of the opening, so frequencies of larger cavities like the<br />

interior car are low. The formula <strong>to</strong> calculate the Helmholtz resonance<br />

ω H is given by<br />

ω H =<br />

√<br />

γ A2 P 0<br />

mV 0<br />

where γ is the adiabatic index of specific heats (1.4 for dry air), A<br />

is the area of the opening, P 0 is the initial pressure of the air inside<br />

of the cavity, m is the mass of air in the neck of the opening, <strong>and</strong><br />

V 0 is the initial volume of air inside of the cavity. So, widening the<br />

crack of the car window will increase the angular frequency of the<br />

Helmholtz resonance, <strong>and</strong> reducing the amount of water in the bottle<br />

(hence increasing V 0 ) will reduce the characteristic Helmholtz resonant<br />

frequency ω H .<br />

Lightly placing one’s finger at any of the nodes on a fixed string<br />

does noticeable things <strong>to</strong> its harmonics. Doing so at the halfway point<br />

on a guitar string, for example, causes the odd harmonics <strong>to</strong> drop out<br />

<strong>and</strong> the octaves above the fundamental <strong>to</strong> be very clear. Figure 2.11<br />

depicts the first four modes of vibration of a fixed string <strong>and</strong> the dots<br />

highlight their nodes.


Section 2.4 Harmony, periodicity, <strong>and</strong> perfect intervals 25<br />

Figure 2.11: The first four modes of a fixed string. Because a string is secured at both<br />

ends, it’s over<strong>to</strong>ne series is defined according <strong>to</strong> its length, <strong>and</strong> the wavelengths of the<br />

frequencies it contains are restricted <strong>to</strong> integer divisions of that length, i.e., f:2f:3f,<br />

<strong>and</strong> so on. A non-integer ratio would result in an impossible scenario: A string loose<br />

at one end.<br />

By paying extra attention <strong>to</strong> the <strong>to</strong>ne of musical instruments, the<br />

individual over<strong>to</strong>nes may be realized. If you have access <strong>to</strong> a piano, try<br />

this experiment. Find a way <strong>to</strong> depress every key on the piano except<br />

for the second-<strong>to</strong>-bot<strong>to</strong>m A, using books or a friend’s arms. Do this<br />

slowly so that the keys do not trigger sound. When the piano is silent,<br />

strike the A with considerable force <strong>and</strong> listen closely. You should<br />

be able <strong>to</strong> hear at least two octaves <strong>and</strong> a perfect fifth (an E) higher<br />

than this note. These are the first three over<strong>to</strong>nes of the fundamental<br />

frequency, 55 Hz, shown in the following table.


26 Physical sound Chapter 2<br />

Frequency Note name Ratio <strong>to</strong> 55 Hz Interval<br />

55 Hz A 1:1 Perfect Unison<br />

110 Hz A 2:1 Perfect Octave<br />

165 Hz E 3:1 Perfect Fifth<br />

220 A 4:1 Perfect Octave<br />

275 C♯ 5:1 Major Third<br />

330 E 6:1 Perfect Fifth<br />

385 G 7:1 Minor Seventh<br />

440 A 8:1 Perfect Octave<br />

The intervals above are in their simplest forms. As you can see, the<br />

interval between E3 at 165 Hz <strong>and</strong> A1 at 55 Hz spans a perfect octave<br />

<strong>and</strong> a perfect fifth. Frequencies separates by octaves sound so similar<br />

that we actually call all octaves by the same note name, so in most<br />

cases this relaxed terminology of reducing intervals that span more<br />

than an octave is acceptable. 5<br />

This series hypothetically continues forever <strong>to</strong> include the ratios<br />

9:1, 10:1, <strong>and</strong> so on. But the first few partials of any instrument have<br />

more energy than higher ones, so those are the ones we predominantly<br />

perceive—even though removing the higher ones would affect the<br />

perceived timbre.<br />

5 Particularly in the genre of jazz, the intervals of the ninth, eleventh, <strong>and</strong> thirteenth<br />

are used with some frequency. However, their sonority is similar <strong>to</strong> the interval minus<br />

an octave, i.e., a ninth has similar quality <strong>to</strong> a second, a eleventh <strong>to</strong> a fourth, <strong>and</strong> a<br />

thirteenth <strong>to</strong> a sixth.


Section 2.5 Properties of waves 27<br />

Harmonicity—<strong>and</strong> furthermore, the Western conceptualization of<br />

consonance—in music is manifested by simple mathematical relationships.<br />

We will say more about consonance <strong>and</strong> dissonance in the fourth<br />

chapter on audi<strong>to</strong>ry perception.<br />

2.5 Properties of waves<br />

When waves interact with other waves or with media like walls, water,<br />

<strong>and</strong> hot air, they exhibit <strong>to</strong> some degree the properties of reflection,<br />

refraction, interference, <strong>and</strong> damping. Some of these we observe on<br />

a daily basis, like echoes, but some are quite rare, like cancelation.<br />

Underst<strong>and</strong>ing the properties of waves helps avoid unwanted sounds<br />

(noise) <strong>and</strong> improve the desired message (signals), <strong>and</strong> all of the properties<br />

are direct consequences of the behavior of amplitude, phase, <strong>and</strong><br />

frequency in response <strong>to</strong> the physical world.<br />

Before we begin <strong>to</strong> explain these properties, three more features<br />

of wavesuseful <strong>to</strong> underst<strong>and</strong> are wavelength, amplitude envelopes, <strong>and</strong><br />

crests versus troughs. Wavelength λ is the distance in meters that a<br />

wave of frequency f travels away from its source in one period T . We<br />

calculate wavelength λ with the equation<br />

λ = vT = v f ,<br />

where v is the velocity of sound. In dry, room-temperature (68 ◦ Fahrenheit)<br />

air, the speed of sound is about 343 meters per second (m/s), <strong>and</strong><br />

a 1000 Hz wave would therefore have a wavelength of<br />

343 m/s<br />

=0.343 m.<br />

1000 s−1 Note that frequency (f =1/T ) can be notated either as hertz or s −1 .<br />

<strong>An</strong> amplitude envelope describes the general shape of the amplitude<br />

over time for a given wave. Attack, decay, sustain, <strong>and</strong> release are<br />

the four general qualities of an amplitude envelope, <strong>and</strong> they are most


28 Physical sound Chapter 2<br />

Figure 2.12: The wavelengths of two pure <strong>to</strong>nes, 1 Hz <strong>and</strong> 10 Hz, calculated by the<br />

formula λ = v f . Since hertz are measured in inverted seconds (s−1 ), this is the same<br />

as multiplying the speed of sound by the duration of one period (0.1 <strong>and</strong> 1 seconds,<br />

respectively). Notice that λ 10Hz is one-tenth of the length of λ 1Hz.<br />

often ordered respectively for acoustic instrument examples. They are<br />

all notated one of three ways: As an instant <strong>to</strong> refer <strong>to</strong> the instant at<br />

which they begin (the onset time), as an interval <strong>to</strong> mean the interval<br />

over which they occur, or as a rate defining the speed at which they<br />

happen. It is easy <strong>to</strong> underst<strong>and</strong> them graphically, as in Figure 2.13.<br />

However, an amplitude envelope is rarely this simple, <strong>and</strong> the one<br />

defining the shape of the frequency domain is not described the same<br />

way. Attack time <strong>and</strong> attack rate are particularly meaningful <strong>to</strong> the<br />

mathematics of music, especially when attempting <strong>to</strong> extract features<br />

from music. Attack almost solely defines where onsets exist. Onsets<br />

help us identify the location of beats <strong>and</strong> important events like the<br />

beginning of choruses or verses in musical signals.<br />

Finally, crests <strong>and</strong> troughs occur at the antinodes of a wave or fixed<br />

string. These are simply synonyms for maximum <strong>and</strong> minimum values<br />

in pressure.


Section 2.5 Properties of waves 29<br />

Figure 2.13: A general attack-decay-sustain-release envelope, or ADSR envelope: The<br />

first onset of a note is the attack; the movement from the peak of the attack <strong>to</strong> the<br />

sustain is the decay; the duration a note is held is shown in the sustain; <strong>and</strong> the final<br />

decrease is the release, where the note is no longer being played. This envelope also<br />

describes reverberation.<br />

Figure 2.14: The nodes, antinodes, crests, <strong>and</strong> troughs of a sine wave, shown with<br />

eight different amplitudes. The antinodes are located at the crests <strong>and</strong> troughs, i.e.,<br />

areas of extreme compression <strong>and</strong> rarefaction, <strong>and</strong> the nodes are located at normal<br />

atmospheric pressure. Along a string, the nodes are located where the string does not<br />

move. These positions are according <strong>to</strong> its length.<br />

The nodes are located where the amplitude is 0. Both ends of the<br />

string are therefore nodes. <strong>An</strong>tinodes occur in exactly the opposite<br />

places: Where the magnitude (absolute value) of the amplitude is locally<br />

maximal—i.e., the magnitude is greater than both of the leftmost<br />

<strong>and</strong> rightmost magnitudes. These extreme regions are also called com-


30 Physical sound Chapter 2<br />

pressions (where maximal) <strong>and</strong> rarefactions (where minimal). Skipping<br />

a jump rope creates one antinode (we only count an antinode once per<br />

extrema), two nodes, one compression, <strong>and</strong> one rarefaction.<br />

Reflection<br />

The property of reflection can be readily observed when loud sounds<br />

initiate in rooms with hard surfaces. We experience reflection when<br />

sound in a room reverberates or echoes. Bats use the reflection of sound<br />

<strong>to</strong> aid their night vision using echolocation, calculating their distance <strong>to</strong><br />

objects from their own position <strong>to</strong> a high level of precision by emitting<br />

a chirp <strong>and</strong> measuring the time that it takes for its reflection <strong>to</strong> be heard.<br />

[5]. The variables in echolocation are: The speed of sound v equal <strong>to</strong><br />

343m/s (also written c v , though the notation c in physics is typically<br />

reserved for the speed of light), the round-trip time that the sound<br />

takes <strong>to</strong> hit the object <strong>and</strong> reflect back, <strong>and</strong> the speed at which the<br />

observer (the bat) is traveling. Since sound travels relatively quickly<br />

<strong>to</strong> the speed of the bat, this third variable is reasonably negligible<br />

<strong>and</strong> it is unlikely that the bat is taking note of this at all. So, if the<br />

sound takes 5 seconds <strong>to</strong> reflect back <strong>to</strong> the bats ears, the object is<br />

(343m/s) · (5s)/2 = 857.5 meters away. This is divided by 2 because it<br />

<strong>to</strong>ok 5 seconds for the round-trip time, so it <strong>to</strong>ok 2.5 seconds for the<br />

sound <strong>to</strong> travel <strong>to</strong> the object.<br />

The human brain perceives sonic events that begin less than about<br />

one-tenth of a second (0.1 s) apart <strong>to</strong> be part of the same sound [3]. So,<br />

reflections of sound over short distances are perceived as a single signal<br />

because they happen within 0.1 seconds of one another. Therefore, the<br />

minimum distance that a sound can travel in order for an echo <strong>to</strong> be<br />

perceived is therefore about (343 m/s)·(0.1 s)/2 = 17.15 meters, again<br />

dividing by 2 because it has <strong>to</strong> make a round trip. Sounds beginning<br />

greater than 0.2 seconds apart are separated by the brain, <strong>and</strong> between<br />

0.1 <strong>and</strong> 0.2 seconds is an interval of confusion or roughness.


Section 2.5 Properties of waves 31<br />

The myth that a duck’s quack does not echo was only recently<br />

debunked by the Acoustics Research Centre at the University of Salford<br />

in 2003 [6]. Their best guess as <strong>to</strong> why this was ever a myth is that<br />

quacks may be difficult <strong>to</strong> detect because they do not have a sharp<br />

attack like lightning or h<strong>and</strong>clapping, <strong>and</strong> furthermore, ducks are<br />

usually in water or the air, not in tunnels where echoes are often<br />

observed.<br />

Reflection can be a useful property of sound when recording in<br />

noisy environments. During sporting events like football <strong>and</strong> basketball<br />

games, you may observe a few people wearing headphones<br />

on the sidelines holding a large, clear, circular object with a microphone<br />

at the center. This is a circular parabola, designed much like<br />

a satellite dish. The microphone is placed at the parabola’s focus, a<br />

point through which all waves that hit the parabola reflect <strong>and</strong> travel.<br />

At the Explora<strong>to</strong>rium Museum in San Francisco, there are two large<br />

parabolas about eight feet in diameter. They are installed vertically so<br />

that museum visi<strong>to</strong>rs can sit inside them on seats strategically placed<br />

so one’s ears are very close <strong>to</strong> the focal point. The parabolas face each<br />

other, but are about 50 feet apart, making it seem irrational that soft<br />

sounds could be effectively transmitted over such a distance in the<br />

popular, noisy museum. Surprisingly, speech barely louder than a<br />

whisper can be clearly heard at the other end. The same idea applies<br />

<strong>to</strong> satellite dishes, but their foci extend far beyond the rim of the dish<br />

<strong>to</strong> compensate for the great distance <strong>to</strong> their signals’ sources in outer<br />

space.<br />

Refraction<br />

When sound travels from one region <strong>to</strong> a region with a different density<br />

or stiffness, refraction <strong>and</strong> dispersion occur[4]. In waveguide synthesis,<br />

these regions are called scattering junctions. The denser a region, the less<br />

room the closely spaced particles have <strong>to</strong> move around. Sound waves<br />

can become more excited in stiffer mediums due <strong>to</strong> improvedelasticity


32 Physical sound Chapter 2<br />

[7]. Therefore, sound travels more quickly in stiff, light solids than<br />

in liquids or gases. Measurements taken with a contact mic on the<br />

metal interior of a brass horn, for example, are much richer (more<br />

partials are articulated) than measurements taken from the air outside<br />

of the horn. The type of wood used in the body of a violin influences<br />

how much the violin amplifies its sound, <strong>and</strong> the propagation of<br />

sound in spruce (a common wood in violins) is twice as fast along<br />

the grain (3000 m/s) as it is across it (1500 m/s) [8]. Refraction is an<br />

especially important property <strong>to</strong> consider for submariners, architects,<br />

<strong>and</strong> materials scientists.<br />

The speed of sound depends on the bulk modulus B of a medium<br />

(a number representing its elasticity or stiffness) <strong>and</strong> the density ρ of a<br />

medium.<br />

c v =<br />

√<br />

B<br />

ρ<br />

So, it increases with stiffness, <strong>and</strong> decreases with density. Listed on<br />

the next page are different speeds of sound inside of various media<br />

[9]. The bulk moduli of woods are given parallel <strong>to</strong> (along) the grain.<br />

All of these numbers are variable, <strong>and</strong> the numbers are averaged if a<br />

range was given.


Section 2.5 Properties of waves 33<br />

Medium B (×10 9 N/m 2 ) ρ (kg/m 3 ) c v<br />

Dry air (20 ◦ C) 0.000142 1.21 343 m/s<br />

Water (25 ◦ C) 2.15 965 1493 m/s<br />

Salt water (25 ◦ C) 2.34 1022 1533 m/s<br />

Ebony 13.8 1200 3391 m/s<br />

White oak 11 770 3780 m/s<br />

Honduras mahogany 10.4 650 4000 m/s<br />

Indian rosewood 12.0 740 4027 m/s<br />

White ash 12.2 750 4033 m/s<br />

Engelmann spruce 9.0 550 4036 m/s<br />

Red maple 11.3 675 4092 m/s<br />

Black cherry 12.2 630 4401 m/s<br />

Steel 200 7820 5057 m/s<br />

Glass 70 2600 5189 m/s<br />

Brazilian rosewood 16.0 830 5217 m/s<br />

Diamond 442 3500 11238 m/s<br />

Table 2.1: The speed of sound c v in common acoustic materials is given by c v =<br />

where B is the stiffness <strong>and</strong> ρ is the density.<br />

√<br />

B<br />

ρ<br />

The bulk modulus B describes the volumetric elasticity (threedimensional),<br />

while Young’s modulus describes the tensile or linear<br />

elasticity (two-dimensional). For the different types of wood, B is actually<br />

Young’s modulus. Both are ratios of stress <strong>to</strong> strain, measuring the<br />

resistance of a material <strong>to</strong> uniform compression. Hence, both the bulk<br />

modulus <strong>and</strong> Young’s modulus represent the inverse of compressibility.<br />

Reverberation<br />

Reverberation refers <strong>to</strong> sound reflecting against walls, refracting in<strong>to</strong><br />

absorbent material, <strong>and</strong> dissipating in air after its origination. We talk<br />

about reverberation especially in room acoustics, where a recording<br />

studio should ideally have no reverberation, but a cathedral may have<br />

a lot of reverberation. Its graphical representation is different: Once


34 Physical sound Chapter 2<br />

again, the horizontal axis is time <strong>and</strong> the vertical axis is amplitude, but<br />

this is not the signal itself. Instead, Figure 2.15 shows us the amplitude<br />

of events over time.<br />

Figure 2.15: Reverberation is typically generalized as three main events: The source<br />

signal, its early reflections (with thicker vertical lines), <strong>and</strong> its late reflections (the<br />

thinner vertical segments). The time of the early reflections with respect <strong>to</strong> the source<br />

sound is how a source will seem near or far from the location of the observer.<br />

The first event is the original sound—the source signal. As the<br />

source sound propagates in the room, it bounces off each of the walls.<br />

The first time it does this <strong>and</strong> returns <strong>to</strong> the receiver is depicted in<br />

the early reflections event. In the above graph, there are six early<br />

reflections representing six walls or surfaces, which is typical in a<br />

rectangular room with four walls, a floor, <strong>and</strong> a ceiling. The later the<br />

reflection, the farther the surface is from the receiver. The weaker the<br />

reflection, the longer the sound has traveled <strong>and</strong> the higher absorbency<br />

of the surface material. The late reflections event depict the later<br />

bounces off of these surfaces with gradually less energy.<br />

The frequency response of a reverberant space like a room or musical<br />

instrument is calculated by exciting the space with all frequencies in<br />

its range at a constant pressure <strong>and</strong> transforming this recording with<br />

a Fourier transform <strong>to</strong> deduce its resonant frequencies, which appear<br />

as peaks in the frequency response. This can be done by exciting the<br />

instrument with a sine sweep (a pure <strong>to</strong>ne that oscillates from low <strong>to</strong>


Section 2.5 Properties of waves 35<br />

high frequencies) <strong>and</strong> recording the instrument’s vibration. In room<br />

acoustics, the frequency response is typically calculated by playing a<br />

burst of white noise, because it has equal energy at all frequencies. The<br />

Fourier transform of white noise is perfectly flat, reflecting the equal<br />

power of the frequencies, <strong>and</strong> the Fourier transform of the recording<br />

of the white noise in a room will have bumps where the room is<br />

resonating or attenuating sound on a frequency basis. This burst of<br />

white noise is also called an impulse, <strong>and</strong> a space’s reaction <strong>to</strong> it is called<br />

an impulse response. Impulses can also be taken with other loud, brief,<br />

noisy things like fireworks, balloon pops, <strong>and</strong> h<strong>and</strong>clapping, though<br />

their frequency response is naturally more variable than that of white<br />

noise.<br />

Figure 2.16: The impulse response of a room is a function of time. The general shape<br />

of the amplitude envelope is decreasing, but there are peaks where the sound is<br />

reflecting off of surfaces in the room. The horizontal axis is in samples, not seconds,<br />

so this impulse response lasts less than half of a second.


36 Physical sound Chapter 2<br />

Figure 2.17: The frequency response of the same room as in Figure 2.15. This is a<br />

function of frequency. Rooms typically have resonances in the lower frequency range<br />

because of their larger dimensions compared <strong>to</strong> musical instruments.<br />

To calculate the frequency response, we simply take the Fourier<br />

transform of the impulse response, depicted in Figure 2.17.<br />

Room acoustics is a constantly exp<strong>and</strong>ing field of research in the<br />

scientific study of sound. A st<strong>and</strong>ard way of measuring the reverberation<br />

of an acoustic space is by calculating the RT 60 , the time that a<br />

sound (typically wide-b<strong>and</strong> or narrowb<strong>and</strong> noise) takes <strong>to</strong> decay by<br />

60 decibels in that space. Architectural structures built for a musical<br />

purpose like audi<strong>to</strong>riums <strong>and</strong> studios use room acoustics <strong>to</strong> choose<br />

materials <strong>and</strong> dimensions that will best amplify or attenuate certain frequencies.<br />

The table <strong>and</strong> graph depicted in Figures 2.18 <strong>and</strong> 2.19 define<br />

the absorption that some materials have with respect <strong>to</strong> sound, <strong>and</strong><br />

you can see a direct correlation between the hardness of the material<br />

<strong>and</strong> how much sound it absorbs.


Section 2.5 Properties of waves 37<br />

Figure 2.18: The absorption coefficients of various materials for the frequencies 250<br />

Hz, 500 Hz, <strong>and</strong> 1000-2000 Hz. This table comes from Alex<strong>and</strong>er Wood’s The Physics<br />

of Music [10].<br />

Much of the results depicted in Figures 2.18 <strong>and</strong> 2.19 come from<br />

some of the first discoveries concerning room acoustics by Wallace<br />

Clement Sabine (1868-1919) of Harvard University, who found that<br />

T =0.161 V AS<br />

where T is the reverberation time, V is the volume of a reverberant<br />

space in cubic meters, A is the average absorption coefficient, <strong>and</strong> S<br />

is the surface area of the material. A is strictly less than 1 because 100


38 Physical sound Chapter 2<br />

Figure 2.19: The absorption curves of various materials over the frequency range<br />

64-4096 Hz [10].<br />

percent absorptive material does not exist, but as a point of reference,<br />

one square meter of 100% absorptive material is called 1 sabin.<br />

Interference<br />

Let us begin with a theorem from mathematics.<br />

Theorem: For any real numbers a, b ∈ R,<br />

|a + b| ≤ |a| + |b|.<br />

Proof: Let a, b ≥ 0. Then the result |a + b| = |a| + |b| is<br />

immediate. Let a, b ≤ 0. Likewise, it is clear that |a + b| =<br />

|a| + |b|. Finally, let a be the opposite parity (sign) of b.<br />

Then |a + b| < |a| + |b|. So, for all values of a, b in R,<br />

|a + b| ≤| a| + |b|.


Section 2.5 Properties of waves 39<br />

Constructive interference is only satisfied by the maximal case, |a + b| =<br />

|a| + |b|, <strong>and</strong> otherwise destructive interference is occurring.<br />

There are two types of interference in sound: Constructive <strong>and</strong><br />

destructive. Constructive interference occurs when two waves, call<br />

them x 1 (t) <strong>and</strong> x 2 (t), interact such that<br />

∣<br />

∣x 1 (t)+x 2 (t) ∣ ∣ = ∣ ∣x 1 (t) ∣ ∣ + ∣ ∣x 2 (t) ∣ ∣.<br />

In words, the magnitude of their sum is equal <strong>to</strong> the sum of the magnitude<br />

of each wave. It can be shown that the sign of the two waves<br />

must be the same.<br />

Destructive interference is the exact opposite of constructive interference.<br />

Destructive interference is such that<br />

∣<br />

∣x 1 (t)+x 2 (t) ∣ ∣ < ∣ ∣x 1 (t) ∣ ∣ + ∣ ∣x 2 (t) ∣ ∣.<br />

For this <strong>to</strong> be true, the signs of the two waves must be opposite, as<br />

given in the above proof. Therefore, when these waves interact, they<br />

have a detrimental effect on the overall pressure of the air through<br />

which they propagate. Constructive or destructive interference can<br />

only occur when the waves intersect at the same location, whether at a<br />

single point or set of points, <strong>and</strong> at the same instant or same interval<br />

of time.<br />

Cancelation is a result of completely destructive interference, wherein<br />

|x 1 (t)+x 2 (t)| =0. Consider two identical sinusoids, x 1 (t) =x 2 (t) =<br />

A sin(2πft + φ). Now imagine that you have two speakers facing each<br />

another, located exactly at an integer multiple of the wavelength of the<br />

sinusoids (λ = v/f, remember) apart from one another, both connected<br />

<strong>to</strong> your CD player in stereo. Let one channel be x 1 (t) <strong>and</strong> the other<br />

be x 2 (t). When you press play, the waves travel from the speaker <strong>to</strong><br />

the opposite speaker at the same time. Because they are placed an<br />

integer multiple (4 times) of their wavelengths apart, the crests of one<br />

wave will occur exactly where the troughs of the second wave occur.<br />

These two waves are called completely out of phase from each other:


40 Physical sound Chapter 2<br />

Their phases are different by π radians, or 180 ◦ . This is the most that<br />

two waves can be out of phase, even though a circle contains 360 ◦ .<br />

The waves become reflections of each other because they are exact<br />

opposites, flipped about the horizontal axis, <strong>and</strong> become in phase with<br />

each other when there is no difference (angle) between their respective<br />

phases.<br />

Figure 2.20: Two speakers exhibiting completely destructive interference. The superposition<br />

of their respective sounds waves is shown by the dotted line. Since one wave<br />

is moving <strong>to</strong> the left <strong>and</strong> the other <strong>to</strong> the right at the same speeds, the wave is 0 only<br />

2 times per period (much like a regular sine wave), but it does not travel—it st<strong>and</strong>s.<br />

Hence, the result is a st<strong>and</strong>ing wave, which causes acoustic feedback.<br />

So, their superposition is zero everywhere at certain instants—<br />

at their compressions <strong>and</strong> rarefactions, <strong>to</strong> be more precise. These<br />

two waves form what is called a st<strong>and</strong>ing wave. A st<strong>and</strong>ing wave<br />

occurs when two sine waves of equal frequency travel at the same<br />

velocity in opposite directions, so their velocities, v 1 <strong>and</strong> v 2 , sum <strong>to</strong><br />

zero where v 1 = −v 2 (i.e., the wave does not propagate—it st<strong>and</strong>s). This<br />

happens in musical instruments along a fixed string or in a column of<br />

air. A st<strong>and</strong>ing wave is perfectly stationary, but its amplitude changes<br />

periodically at the same frequency of the two waves.<br />

St<strong>and</strong>ing waves cause acoustic feedback because they resonate in<br />

an acoustic space <strong>and</strong> thus have more sustain, causing a microphone<br />

<strong>and</strong> speaker <strong>to</strong> continuously receive <strong>and</strong> transmit them when recording.<br />

Say that a room is 27’ x 25’ x 12’. Then the frequencies with


Section 2.5 Properties of waves 41<br />

wavelengths equal <strong>to</strong> 27 feet, 25 feet, or 12 feet (41.3 Hz, 44.6 Hz, <strong>and</strong><br />

92.9 Hz) will st<strong>and</strong> in this room, <strong>and</strong> furthermore, integer multiples of<br />

these frequencies will also cause feedback (though, <strong>to</strong> a lesser degree)<br />

because their reflections will mirror their propagations. This means<br />

that larger rooms will have very low-frequency resonances <strong>and</strong> small<br />

corridors (<strong>and</strong> musical instruments) will have higher frequency resonances,<br />

because a wavelength of 30 centimeters translates <strong>to</strong> about<br />

1143 Hz.<br />

Resona<strong>to</strong>rs <strong>and</strong> noise-canceling headphones are effective in diminishing<br />

the power of undesirable frequencies. Resona<strong>to</strong>rs tuned <strong>to</strong> the<br />

undesired frequency will capture <strong>and</strong> reduce the frequency by creating<br />

a st<strong>and</strong>ing wave: The wave carrying the undesired frequency is<br />

attracted <strong>to</strong> the resona<strong>to</strong>r, <strong>and</strong> the resona<strong>to</strong>r absorbs <strong>and</strong> dissipates<br />

the wave by cancelation. Noise-canceling headphones detect noise via<br />

an exterior microphone near the ear. The noise is directed <strong>to</strong> an electric<br />

circuit that transforms it in<strong>to</strong> an antinoise signal—a signal exactly<br />

out of phase with the detected noise. This antinoise signal is played<br />

through the headphones <strong>to</strong> cancel the noise.<br />

The inverse square law<br />

All forms of radiation obey the inverse square law, which simply says<br />

that the farther you are away from a source of energy, the less intense<br />

the energy will be. In a uniform medium, a source propagates in all<br />

directions equally, so we model this motion in three dimensions as a<br />

sphere. The intensity I at a radius r from a sound source with original<br />

power P will be I =<br />

P , because the surface area of a sphere is given<br />

4πr 2<br />

by 4πr 2 .<br />

However, because we measure the intensity of sound with decibels<br />

(dB), which are a logarithmic unit, the inverse square law returns a<br />

different equation for sound waves than the one listed above. Intensity,<br />

as we will explore in more detail in Chapter 3, is proportional <strong>to</strong> the


42 Physical sound Chapter 2<br />

Figure 2.21: Three-dimensional depiction of the inverse-square law<br />

square of sound pressure, P 2 . Therefore, source intensity becomes<br />

proportional <strong>to</strong> I/r 2 at a distance of r from it, <strong>and</strong> the pressure is then<br />

P/r, not P 6 . We say "proportional <strong>to</strong>" in mathematics when a ratio<br />

exists between two quantities, but their relationship is not necessarily<br />

the same in every scenario, i.e., the ratio may fluctuate. The number of<br />

days that it rains in a year, for example, is proportional <strong>to</strong> the annual<br />

6 Note that only intensity <strong>and</strong> pressure diminish with distance, not frequency or<br />

wavelength. Red does not get any "less red" the farther we are away from it.


Section 2.5 Properties of waves 43<br />

inches of rain accumulated in a year, but x-many days of rain does not<br />

necessarily mean y-many inches of rainfall.<br />

The brain treats the ears like two distinct microphones, not as a<br />

<strong>to</strong>tal or average [4]. It relies heavily upon the inverse square law <strong>to</strong><br />

detect the proximity of sources, while the distance between the ears<br />

<strong>and</strong> the physicality of the pinnae (the flaps of skin external <strong>to</strong> the skull)<br />

provide information about the sources’ directionality. The primary<br />

function of hearing, or of any sensation for that matter, is <strong>to</strong> alert the<br />

hearer of threat <strong>to</strong> its survival. The sensation of sound can quickly<br />

activate our adrenal gl<strong>and</strong>s, <strong>and</strong> can thereby serve <strong>to</strong> inform the proper<br />

fight or flight response. Music, <strong>to</strong>o, has the power <strong>to</strong> elicit very strong<br />

emotional reactions, including fear <strong>and</strong> anger.<br />

<strong>An</strong>other aspect of sound that requires a physical explanation is<br />

our ability <strong>to</strong> hear sounds from sources outside of the room. In my<br />

office, I can hear footsteps approaching from around the corner <strong>and</strong><br />

the eleva<strong>to</strong>r bell, even though the eleva<strong>to</strong>r is located far down the hall.<br />

But these sounds all seem <strong>to</strong> be coming from my doorway. This can be<br />

explained by Huygens’ principle, depicted in Figure 2.22.<br />

Huygens’ principle: Every point of a moving wave is also<br />

the center of a new source, each propagating a fresh set of<br />

waves in all directions.<br />

This is also known as diffraction, <strong>and</strong> explains why the sound from a<br />

loudspeaker can be heard at locations behind it, above it, <strong>and</strong> <strong>to</strong> its<br />

left <strong>and</strong> right. This is true of light as well, but because the wavelength<br />

of light is so short (about 390 <strong>to</strong> 750 nanometers) due <strong>to</strong> its very large<br />

frequency (400-790 Terahertz, where 1 Terahertz (THz) = 10 12 Hz), it<br />

is not as easily perceived.<br />

The elasticity of air is what allows sound <strong>and</strong> light <strong>to</strong> vibrate, so<br />

there is neither light nor sound in a vacuum. Other qualities of air <strong>and</strong><br />

the Earth’s atmosphere can have interesting effects on traveling waves.


44 Physical sound Chapter 2<br />

Figure 2.22: Sounds originating on the other side of an open doorway will appear <strong>to</strong><br />

originate from the doorway itself, states Huygens’ Principle.<br />

The effects of temperature, humidity, velocity, <strong>and</strong> altitude<br />

Waves move differently depending on the media through which they<br />

travel, as stated in the discussion of refraction. Temperature, humidity,<br />

<strong>and</strong> altitude are all directly related <strong>to</strong> atmospheric pressure, which is<br />

itself a result of gravity. Sound waves vibrate easiest in high-pressure<br />

areas where there are fewer forces working against their energy. Since<br />

pressure decreases as elevation increases, sound waves tend <strong>to</strong>ward<br />

the ground. Sound travels slowly <strong>and</strong> loses energy faster in hotter<br />

temperatures because heat rises. As the hot air moves upwards, it<br />

takes some of the sound waves’ energy with it.<br />

Temperature’s effect on pitch is most noticeable in wind instruments,<br />

due <strong>to</strong> the expansion of their bores from heat.A flute, for<br />

example, rises in pitch about 0.002 Hz for every 1 ◦ C (1.8 ◦ F) rise in<br />

temperature. The tuning of piano strings increases about 0.00001 Hz<br />

for each 1 ◦ C increase in temperature because hotter strings exp<strong>and</strong>.<br />

The velocity of a sound can be calculated as before, in the section<br />

on refraction, but also from the derivative of the pressure of a medium<br />

with respect <strong>to</strong> its density:<br />

v =<br />

√ √<br />

∂p<br />

∂ρ = B<br />

ρ


Section 2.5 Properties of waves 45<br />

where p is the pressure of a medium ρ is once again its density. Therefore,<br />

the bulk modulus can be determined by<br />

In 0% humidity (dry) air,<br />

B = v 2 ρ = ∂p<br />

∂ρ ρ.<br />

v = 331.3<br />

√<br />

for temperature t in degrees Celsius.<br />

1+ t<br />

273.15<br />

As mentioned in the discussion of refraction, a soundÕs propagation<br />

speed is dependent upon the medium through which the sound<br />

travels. A final way of calculating these speeds is with the Mach number<br />

of a given medium, where the Mach number of dry air is 1. A Mach<br />

number greater than 1 indicates that sound is traveling at a supersonic<br />

speed. We can calculate the Mach number with the equation<br />

[ M = √ 2 (qc ) γ−1 ]<br />

γ − 1 p +1 γ<br />

− 1<br />

where M is the Mach number, q c is impact pressure of the medium,<br />

p is the pressure of the medium, <strong>and</strong> γ is the ratio of the pressure of<br />

the medium <strong>to</strong> heat—the volume constant. This equation comes from<br />

Bernoulli’s principle in fluid dynamics.<br />

Humidity has a small but detectable effect on sound propagation,<br />

due <strong>to</strong> the presence of lighter <strong>and</strong> more elastic water molecules in the<br />

air. As you may guess, the velocity of sound increases in humid air, up<br />

<strong>to</strong> 0.6%. Since the density of air is lower at higher altitudes than at sea<br />

level or below it, the speed of sound decreases as its altitude increases.<br />

The speed by which sound travels affects the sound’s volume at a<br />

given distance, <strong>and</strong> unsurprisingly, the faster sound travels, the better<br />

it maintains its original intensity. Wind will additively or subtractively<br />

affect the speed, working as you may suspect: When wind is blowing<br />

in the direction of the sound, it increases its velocity, <strong>and</strong> thus its<br />

loudness.


46 Physical sound Chapter 2<br />

Finally, when an observer or sound source is moving, the Doppler<br />

effect causes the wavelength of sounds <strong>to</strong> change. As a source moves<br />

closer <strong>to</strong> an observer, every period of the sound wave gets increasingly<br />

shorter, causing the period <strong>to</strong> get smaller <strong>and</strong> frequency <strong>to</strong> get<br />

larger. Conversely, waves moving away from an observer will have<br />

increasingly larger periods, causing frequency <strong>to</strong> decrease. Austrian<br />

physicist Christian Doppler witnessed <strong>and</strong> quantified this in 1842 with<br />

the mathematical formula<br />

f o =<br />

( ) c + vo<br />

f s ,<br />

c + v s<br />

where f o is the frequency heard by the observer, c is the speed of sound<br />

in the medium (343 m/s in air), v o is the velocity of the observer, v s is<br />

the velocity of the sound’s source, <strong>and</strong> f s is the frequency of the source.<br />

The observer’s velocity v o will be positive if the observer is moving<br />

<strong>to</strong>wards the source, <strong>and</strong> v s will be positive if the source is moving away<br />

from the observer.<br />

2.6 Chapter summary<br />

This chapter investigated the four main properties of waves—amplitude,<br />

frequency, period, <strong>and</strong> phase—<strong>and</strong> their behavior in an ideal world <strong>and</strong><br />

in reality. Also distinguished were the terms signal <strong>and</strong> noise. These<br />

terms are important for underst<strong>and</strong>ing the compression of information,<br />

a <strong>to</strong>pic raised at the end of Chapter 5.<br />

Sound waves can be decomposed in<strong>to</strong> a sum of simple sine waves,<br />

says the principle of superposition. It is easy <strong>to</strong> realize the amplitude (A),<br />

frequency (f), period (1/f), <strong>and</strong> phase (φ) of a simple sine wave: It is<br />

of the form A sin(2πft + φ). When two waves are related in frequency<br />

by a small-number integer ratio, we say that the (musical) interval<br />

between them is harmonic. Amplitude explains the amount of pressure<br />

induced by a sound wave in a medium. Sound can only be heard when<br />

the pressure of a medium is varying, as changing pressure signifies


Section 2.6 Chapter summary 47<br />

a disturbance in the eardrum. For example, when we go <strong>to</strong> places at<br />

high altitudes, there is a lower pressure, but no specific sounds are<br />

associated with this change.<br />

The phase of a wave in relation <strong>to</strong> the phase of another wave determines<br />

how two or more waves will interfere with each other when they<br />

interact in a medium. When the resultant wave is less in amplitude<br />

than the sum of the amplitudes of the original waves, it is interfering<br />

destructively. Otherwise, it is undergoing constructive interference.<br />

Waves traveling in exactly opposite directions <strong>and</strong> at identical frequencies<br />

create st<strong>and</strong>ing waves. For the same reason, st<strong>and</strong>ing waves also<br />

happen in musical instruments, on a fixed string, <strong>and</strong> in columns of<br />

air.<br />

Sound waves reflect off of objects, refract in different media, <strong>and</strong><br />

lose energy as they dissipate (the inverse square law). All of this is reverberation,<br />

which describes the behavior of sound after it originates from<br />

a source like a speaker or musical instrument. Every point through<br />

which a sound wave travels is also the source of a new set of waves,<br />

states Huygens’ principle, but this source is not thought of in the same<br />

way as a speaker or musical instrument.<br />

A hotter environment will increase the frequency of the sound<br />

passing through it. Increasing the humidity <strong>and</strong> density of a medium<br />

increases the velocity of sound waves. Sounds moving <strong>to</strong>wards an<br />

observer will have increasingly shorter wavelengths <strong>and</strong> thus higher<br />

frequencies, <strong>and</strong> the converse is true (the Doppler effect). Finally, sound<br />

waves are attracted <strong>to</strong> high-pressure areas where their energy will be<br />

most conserved, <strong>and</strong> since pressure decreases with altitude, all sound<br />

waves tend <strong>to</strong>wards the ground.


3. <strong>Musical</strong> sound<br />

What makes sound musical You are already conscious of some of the<br />

devices musicians use <strong>to</strong> transform sound in<strong>to</strong> music: Harmonizing<br />

with a melody, repeating sections like a chorus, changing the volume<br />

of a beat for emphasis or deemphasis, <strong>and</strong> using <strong>to</strong>ne-rich instruments<br />

such as violins. How can we scientifically describe these devices Even<br />

though all sound waves consist only of frequencies, amplitudes, <strong>and</strong><br />

phases, the way that their properties are organized in music can be<br />

unintuitive <strong>and</strong> even enigmatic. Indeed, the hard reality of digital<br />

music analysis <strong>and</strong> music information retrieval is that a computer still<br />

cannot classify <strong>and</strong> relate musical data as quickly <strong>and</strong> accurately as<br />

our ears can. A computer needs a whole song—not <strong>to</strong> mention a large<br />

database with which <strong>to</strong> compare it—<strong>to</strong> surmise high-level features<br />

about music such as genre, artist, time period, even time signature.<br />

Experienced listeners can do this in a few seconds. In this chapter, we<br />

will explore some of the wave behavior behind basic musical features.<br />

3.1 Rhythm<br />

The rhythm of a piece of music is a representation of its temporal<br />

structure. Within rhythm, we mainly talk about tempo <strong>and</strong> meter. The<br />

tempo describes the beat or pulse. It describes how quickly or slowly<br />

beats happen, <strong>and</strong> uses the unit of beats per minute. Therefore, tempo<br />

is directly tied <strong>to</strong> duration in seconds, <strong>and</strong> can be related <strong>to</strong> a frequency<br />

itself. For example, a song with a tempo of 120 bpm beats every 0.5<br />

seconds (2 beats per second), so it beats at 2 Hz. 1<br />

1 Of course, this is not the pitch of the beat. Our ears cannot perceive frequencies<br />

below about 20 Hz as pitched sound. To see this for yourself, take a quarter <strong>and</strong>


50 <strong>Musical</strong> sound Chapter 3<br />

Much like a meter stick st<strong>and</strong>ardizes the size of a centimeter <strong>and</strong><br />

measures something’s length, meter (or metric structure) gives the st<strong>and</strong>ard<br />

size of the basic unit in a piece of music <strong>and</strong> describes the length<br />

of a musical measure. Some typical basic units are the eighth note,<br />

quarter note, <strong>and</strong> half note, <strong>and</strong> they are designated by the bot<strong>to</strong>m<br />

number in the following notation.<br />

Figure 3.1: Some examples of musical time signatures: 4-4 time means that a measure<br />

lasts the length of four quarter notes, where a quarter note is the basic unit of the<br />

rhythm; 3-4 time has measures of three quarter notes; 6-8 time has measures of six<br />

eighth notes; <strong>and</strong> 3-2 time means that the measures contain three half notes.<br />

If the bot<strong>to</strong>m number is 4, for example, the basic unit would be the<br />

quarter note. If it is 8, the unit would be the eighth note, <strong>and</strong> so on.<br />

The <strong>to</strong>p number simply says how many of these basic units are needed<br />

<strong>to</strong> fill one measure in a piece of music.<br />

Figure 3.2: Examples of some rhythmic music notation. In the time signature of 4-4,<br />

the quarter note is the unit note, i.e., the basic unit, <strong>and</strong> the first <strong>and</strong> third beats are<br />

usually accentuated more than the second or fourth beats.<br />

run your fingernail slowly against the ridges on its side, so that you hear a series of<br />

clicks. Then, do so more quickly, until you hear a gritty, pitched sound. We perceive<br />

frequencies below 20 Hz as individual events. More will be said about this when we<br />

discuss perceptual beats.


Section 3.2 Pitch 51<br />

The meter indicates where the beats fall within a measure. Beats<br />

can be strong or weak, <strong>and</strong> typically a strong beat will fall on the<br />

first beat of a measure, <strong>and</strong> weak beats will fall on the second <strong>and</strong><br />

fourth. The third beat will be relatively strong <strong>to</strong> the second <strong>and</strong><br />

fourth, but perhaps not as strong as the first beat, <strong>and</strong> not stronger<br />

unless syncopation is employed. We can see the difference between a<br />

strong <strong>and</strong> weak beat quite easily in a graph of a musical signal from a<br />

modern genre like dubstep or drum <strong>and</strong> bass: The power of the beat<br />

shows up as a peak in the amplitude.<br />

Figure 3.3: A signal that isn’t a simple sinusoid can still have periodic properties.<br />

This is a clip from an electronic (dance) piece of music with a highly defined rhythm.<br />

We can see 14 equally spaced beats in a span of 10 seconds, so the tempo would be<br />

somewhere around 84 bpm.<br />

Because music is a time-based art form, rhythm can tell us a lot<br />

about the way it is organized. We associate rhythm <strong>and</strong> complexity<br />

with each other <strong>to</strong> a high degree, because we base a lot of our expectations<br />

in music on its temporal structure. When music is polyrhythmic<br />

(contains many rhythms), syncopated (strong beats do not fall at the<br />

designated time), or even arhythmic (lacks rhythm al<strong>to</strong>gether), these<br />

expectations st<strong>and</strong> <strong>to</strong> be violated.<br />

3.2 Pitch<br />

Pitch is our perception of frequency. Often, frequencies will be present<br />

that we cannot perceive due <strong>to</strong> their absolute loudness, their relative


52 <strong>Musical</strong> sound Chapter 3<br />

quietness <strong>to</strong> other frequencies, or other psychoacoustical reasons, such<br />

as masking. Human speech contains many frequencies with audible<br />

energy that we do not perceive as pitched sound because they are so<br />

complex.On the contrary, most musical instruments are designed <strong>to</strong><br />

articulate pitches clearly <strong>and</strong> definitely.<br />

In music, pitch follows tuning st<strong>and</strong>ards <strong>to</strong> regulate instrument<br />

construction <strong>and</strong> eschew tedious debates that arise when playing with<br />

other musicians. Many tunings for the A above middle C were proposed<br />

before the A440 that we have now. <strong>An</strong> international conference<br />

was held in London in May 1939 that was perhaps the last international<br />

agreement of any kind made before the beginning of World War<br />

II. Though not officially st<strong>and</strong>ardized by the International Organization<br />

for St<strong>and</strong>ardization until 1955, A440 was adopted in nearly every<br />

country using Western temperament well before then [10].<br />

With pitch, we can define musical scales, intervals, chords, <strong>and</strong><br />

harmonies. Their construction is based on the consonant, small-integer<br />

ratios between frequencies <strong>and</strong> periods discussed in the previous chapter.<br />

3.3 Tuning <strong>and</strong> temperament<br />

In the musical systems of nearly every culture, musical scales are defined.<br />

These scales reflect the culture’s view of musical consonance<br />

<strong>and</strong> their compositional preferences, <strong>and</strong> make rules that st<strong>and</strong>ardize<br />

the tuning <strong>and</strong> build of their musical instruments. <strong>Musical</strong> systems<br />

are built around the musical intervals that perceptually present themselves,<br />

such as the highly consonant octave (a 2:1 frequency ratio) <strong>and</strong><br />

the highly dissonant tri<strong>to</strong>ne (a √ 2:1 frequency ratio).<br />

The Greek mathematician <strong>and</strong> philosopher Pythagoras was perhaps<br />

the first <strong>to</strong> propose a system of tuning for music. This system was<br />

only natural <strong>to</strong> the Greeks, who were obsessed with small-integer ratios<br />

<strong>and</strong> numerology. Much of their cultural aesthetics trace back <strong>to</strong> the<br />

number five: The five elements (earth, wind, air, fire, <strong>and</strong> the universe)


Section 3.3 Tuning <strong>and</strong> temperament 53<br />

were assigned <strong>to</strong> the five regular polyhedra, called the Pla<strong>to</strong>nic solids<br />

[2]. TThe Greeks believed so strongly that these geometric shapes<br />

were related <strong>to</strong> the chemical structure of the quintessential elements<br />

that these solids were named "the a<strong>to</strong>ms of the universe" by Euclid.<br />

Pythagoras built his structure of musical temperament upon the circle<br />

of fifths, a sequence that moves upward by a perfect fifth 12 times<br />

<strong>and</strong> actually reaches all 12 notes of the scale in doing so. Movement<br />

through the first five notes of the circle of fifths (C-G-D-A-E) builds<br />

the penta<strong>to</strong>nic scale. 2<br />

As we will see, Pythagoras nearly hit it right on the head with<br />

respect <strong>to</strong> the scales of modern day, but there were problems with his<br />

method. Today, we use equal temperament <strong>to</strong> describe the relationships<br />

between pitches in music <strong>and</strong> from musical instruments. Equal temperament<br />

defines the relationships between any two pitches in the<br />

Western scale <strong>to</strong> be<br />

p 2 =2 k/12 · p 1 ,<br />

where the pitch p 2 is k-many half steps above the pitch p 1 . If p 2 is<br />

k-many half steps below p 2 , then<br />

p 2 =2 −k/12 · p 1 ,<br />

i.e., k can be negative. Therefore, a pitch one octave higher than<br />

another pitch will be<br />

p 2 =2 12/12 · p 1 =2· p 1 ,<br />

<strong>and</strong> more generally, a pitch n-many octaves from another pitch will be<br />

p 2 =2 n · p 1 ,<br />

n ∈ Z<br />

where Z is the set of positive <strong>and</strong> negative integers. When n =0, i.e.,<br />

when the second pitch is identical <strong>to</strong> the first pitch, then they are equal<br />

2 The penta<strong>to</strong>nic scale is written C-D-E-G-A, in ascending pitch order.


54 <strong>Musical</strong> sound Chapter 3<br />

Figure 3.4: The Circle of Fifths is shown here: Working clockwise from the center, we<br />

can build the circle by moving up a fifth twelve times until we reach the enharmonic<br />

equivalent of the original note (for C, a B♯). The radial lines show the enharmonic<br />

equivalents.<br />

because 2 0 =1. When n


Section 3.3 Tuning <strong>and</strong> temperament 55<br />

Figure 3.5: Two sinusoids related by an octave in frequency.<br />

can see a harmonious relationship between sinusoids with frequencies<br />

in a 2:1 ratio.<br />

The two waves in Figure 3.5 clearly have similar periodicities. They<br />

intersect with each other every other time that they cross the horizontal<br />

axis. Amazingly, their harmony can also be explained psychologically,<br />

a <strong>to</strong>pic of Chapter 5.<br />

Equal temperament has more advantages over all other temperings<br />

of the 12-note Western scale because it enables transposition, allowing<br />

for songs <strong>to</strong> be played in multiple keys without retuning the instrument.<br />

To explain this, let us briefly explore the systems of Pythagorean<br />

temperament <strong>and</strong> just in<strong>to</strong>nation.<br />

Pythagorean<br />

The Greek mathematician Pythagoras constructed this system sometime<br />

in the 6th century BCE. A popular myth emerged that he did this<br />

immediately after hearing a pair of hammers pounding on an anvil<br />

make the interval of an octave when sounded <strong>to</strong>gether <strong>and</strong> realizing<br />

that their weights formed 2:1 ratio. However, since the relative weights<br />

between two objects do not imply a similar sonic relationship, the myth<br />

was debunked [77, 78]. Pythagoras may have been the first <strong>to</strong> discover


56 <strong>Musical</strong> sound Chapter 3<br />

integer ratios with respect <strong>to</strong> string length using a monochord, a simple<br />

musical instrument similar <strong>to</strong> a guitar.<br />

Figure 3.6: The simplest design of a monochord is a resonating box with two bridges<br />

<strong>and</strong> a fixed string on <strong>to</strong>p. One of these bridges is fixed (shaded in dark gray) <strong>and</strong><br />

one slides, changing the effective length of the string. Suppose that the frequency of<br />

the open fixed string (no altering of the effective length) is 100 Hz. When the sliding<br />

bridge is relocated <strong>to</strong> half of the length of the string, the frequency doubles <strong>to</strong> 200<br />

Hz. At two-thirds the length, the frequency is 150 Hz, a perfect fifth above the open<br />

frequency.<br />

The main difference between a monochord <strong>and</strong> a guitar is the<br />

sliding bridge that can move up <strong>and</strong> down the fretboard <strong>to</strong> change<br />

the length of the string: The distance from fixed end (depicted in dark<br />

gray in Figure 3.6) <strong>to</strong> the sliding end (in dotted lines) determines the<br />

wavelength of the string’s fundamental frequency. Positioning the<br />

sliding bridge at one half of the string’s length produced a pitch one<br />

octave higher than the pitch produced from the open string (200 Hz<br />

versus 100 Hz), <strong>and</strong> moving the bridge <strong>to</strong> two-thirds of its length made<br />

a pitch a perfect fifth higher than the open pitch, i.e., 150 Hz.


Section 3.3 Tuning <strong>and</strong> temperament 57<br />

The Greeks were huge fans of integer ratios found in nature, so<br />

Pythagoras’ discoveries were not taken lightly [2, 79]. The Pythagorean<br />

system of tuning is constructed entirely from the ratios of the octave<br />

<strong>and</strong> perfect fifth: Indeed, moving up <strong>and</strong> down these intervals in a<br />

particular order returns all 12 pitches of the Western scale.<br />

The frequency of the note C, call it f C , need only be scaled by fac<strong>to</strong>rs<br />

of 2 <strong>and</strong> 3 2<br />

<strong>to</strong> attain all 12 pitches in one octave of the Pythagorean<br />

scale. Temporarily ignoring our adjustments for octaves, the method<br />

is as follows: We move upwards 6 perfect fifths, ending at F♯, <strong>and</strong><br />

downwards 6 perfect fifths, ending at G♭—what should be the enharmonic<br />

equivalent of F♯. 3 However, F♯ <strong>and</strong> G♭ are only enharmonically<br />

equivalent in equal temperament. In Table 3.1, observe that we get two<br />

different values for the notes F♯ <strong>and</strong> G♭ from the Pythagorean method<br />

of tuning.<br />

These different ratios mean that this scale cannot be transposed <strong>to</strong><br />

another key. 4 Consider the ratio between two intervals: C <strong>and</strong> C♯ are<br />

in a 1.0535:1 ratio, but F <strong>and</strong> F♯ have a 1.0679:1 ratio between them.<br />

This might not seem like a significant difference in their actual values,<br />

but the difference is large enough for our ears <strong>to</strong> detect. It becomes<br />

very problematic <strong>to</strong> play intervals using F♯ as the root or fundamental<br />

pitch.<br />

It can be said, however, that because all of the intervals can be<br />

described with integer ratios, Pythagorean temperament contains more<br />

consonance than equal temperament. Equal temperament is by design<br />

strictly irrational: 2 k/12 is only a rational number when k is divisible<br />

by 12. But the ability <strong>to</strong> write music in different keys without retuning<br />

3 Enharmonic equivalent"refers <strong>to</strong> two pitches that sound the same but function<br />

differently in transcribed music. In the key of D♭, for example, one does not write F♯<br />

<strong>to</strong> represent the fourth note of its scale (G♭), even though F♯ is enharmonically a perfect<br />

fourth above D♭.<br />

4 In order <strong>to</strong> change key, one needs <strong>to</strong> shift all of the notes or <strong>to</strong>nes an equal amount<br />

of half steps up or down, therefore retaining relative pitches <strong>and</strong> intervals between<br />

the pitches but not their absolute pitches.


58 <strong>Musical</strong> sound Chapter 3<br />

Note Name Process Ratio<br />

C (none) 1f C<br />

G ↑ P5<br />

3<br />

D<br />

A<br />

E<br />

B<br />

F♯<br />

F<br />

B♭<br />

E♭<br />

A♭<br />

D♭<br />

G♭<br />

↑ P5 2x, ↓ P8<br />

↑ P5 3x, ↓ P8 2x<br />

↑ P5 4x, ↓ P8 2x<br />

↑ P5 5x, ↓ P8 2x<br />

↑ P5 6x, ↓ P8 3x<br />

↓ P5, ↑ P8<br />

↓ P5 2x, ↑ P8 2x<br />

↓ P5 3x, ↑ P8 2x<br />

↓ P5 4x, ↑ P8 3x<br />

↓ P5 5x, ↑ P8 4x<br />

↓ P5 6x, ↑ P8 5x<br />

( 3<br />

) 2<br />

2 (2) −1 f C = 9 8 f C<br />

( 3<br />

) 3<br />

2 (2) −1 f C = 27<br />

) 4 (2) −2 f C = 81<br />

( 3<br />

2<br />

( 3<br />

( 2<br />

3<br />

( 2<br />

3<br />

2<br />

2 f C<br />

16 f C<br />

) 5 (2) −2 f C = 243<br />

) 6 (2) −3 f C = 729<br />

) −1 (2) 1 f C = 4 3 f C<br />

64 f C<br />

128 f C<br />

512 f C<br />

( 3<br />

) −2<br />

2 (2) 2 f C = 16 9 f C<br />

( 3<br />

) −3<br />

2 (2) 2 f C = 32<br />

) −4 (2) 3 f C = 128<br />

( 3<br />

( 2<br />

3<br />

( 2<br />

3<br />

2<br />

Table 3.1: Pythagorean tuning<br />

27 f C<br />

) −5 (2) 3 f C = 256<br />

243 f C<br />

) −6 (2) 4 f C = 1024<br />

729 f C<br />

81 f C<br />

one’s instrument (<strong>and</strong> possibly breaking it in the process) trumps the<br />

greater consonance of Pythagoras’ system.<br />

Just in<strong>to</strong>nation<br />

Just in<strong>to</strong>nation, also known as the just dia<strong>to</strong>nic scale, is constructed on<br />

the principle that a major triad is in the ratio 4:5:6. A major triad is built<br />

with a major third (C <strong>to</strong> E, for example) <strong>and</strong> a perfect fifth (C <strong>to</strong> G). This<br />

scale was invented as a solution <strong>to</strong> the problematic Pythagorean scale,<br />

<strong>and</strong> it came before the equally tempered scale. Like the Pythagorean<br />

scale, intervals in just in<strong>to</strong>nation are also related by whole-number<br />

ratios, the main difference being that in addition <strong>to</strong> the octave (2:1) <strong>and</strong><br />

perfect fifth (3:2), the major third is in a 5:4 ratio <strong>to</strong> the frequency of<br />

the fundamental. Note once more that transposition <strong>to</strong> another key<br />

does not produce the original consonant intervals in the new key. We<br />

use the conventional abbreviations P8 <strong>to</strong> mean perfect octave, P5 for<br />

perfect fifth, <strong>and</strong> M3 for major third. If consonance <strong>and</strong> dissonance


Section 3.3 Tuning <strong>and</strong> temperament 59<br />

Note Name Process Ratio<br />

C (none) f C<br />

C ′ ↑ P8 2f C<br />

3<br />

G ↑ P5<br />

2 f C<br />

(<br />

F<br />

↓ P5, ↑ P8<br />

3<br />

) −1<br />

2 2fC = 4 3 f C<br />

(<br />

A ↓ P5, ↑ M3, ↑ P8<br />

3<br />

) −1 ( 5<br />

)<br />

2 4 (2)fC = 5 3 f C<br />

5<br />

E ↑ M3<br />

4 f C<br />

(<br />

E♭<br />

↑ P5, ↓ M3<br />

3<br />

)( 5<br />

) −1<br />

2 4 fC = 6 5 f C<br />

(<br />

A♭<br />

↓ M3, ↑ P8<br />

5<br />

) −1<br />

4 (2)fC = 8 5 f C<br />

(<br />

D<br />

↑ P5 2x, ↓ P8<br />

3<br />

) 2<br />

( 2 (2) −1 f C = 9 8 f C<br />

B<br />

↑ P5, ↑ M3<br />

3<br />

)( 5<br />

)<br />

2 4 fC = 15 8 f C<br />

(<br />

B♭ ↓ P5 2x, ↑ P8 2x<br />

3<br />

) −2<br />

2 (2) 2 = 16 9 f C<br />

(<br />

D♭ ↓ P5, ↓ M3, ↑ P8 3<br />

) −1 ( 5<br />

) −1<br />

2 4 (2)fC = 16<br />

(<br />

F♯ ↑ P5 2x, ↑ M3, ↓ P8 3<br />

) 2 ( 5<br />

)<br />

2 4 (2) −1 f C = 45<br />

32 f C<br />

(<br />

G♭ ↓ P5 2x, ↓ M3, ↑ P8 2x 2<br />

) 2 ( 4<br />

)<br />

3 5 (2) 2 f C = 64<br />

45 f C<br />

15 f C<br />

Table 3.2: The interval-wise derivation of just in<strong>to</strong>nation, ordered by smallness of<br />

integer ratios. Note that the enharmonic equivalents F♯ <strong>and</strong> G♭ have different ratios:<br />

45/32 = 1.40625 ≠ 64/45 ≈ 1.4222.<br />

really can be reduced <strong>to</strong> the smallness of the integer ratios between<br />

two notes, then Pythagorean tuning says that the M2 (a major second,<br />

D in the above tables) as well as the M6 (a major sixth, A) intervals are<br />

more consonant than a M3 (a major third, E). Apparently, the inven<strong>to</strong>rs<br />

of the just in<strong>to</strong>nation tuning system disagreed.<br />

Just in<strong>to</strong>nation can be considered the most consonant scale of the<br />

three when played in C because of the smaller integer ratios: The ratio<br />

between the frequency of B <strong>and</strong> the frequency of C, for example, is<br />

in a 15:8 ratio in just in<strong>to</strong>nation versus a 243:128 ratio in Pythagorean<br />

tuning. But transposition is even further disabled, so its use was<br />

eclipsed by equal temperament.


60 <strong>Musical</strong> sound Chapter 3<br />

Other tuning systems<br />

In addition <strong>to</strong> equal, just, <strong>and</strong> Pythagorean temperaments, there exist<br />

many other systems. In the East, the Hindustani tuning system permits<br />

up <strong>to</strong> 72 divisions of the octave, with intervals like the quarter-<strong>to</strong>ne<br />

(halfway between a semi<strong>to</strong>ne, or half step) [73]. For those composers<br />

<strong>and</strong> musicians who feel that 12 notes in an octave is simply not enough,<br />

there are micro<strong>to</strong>nal tunings <strong>and</strong> instruments that use them. In micro<strong>to</strong>nal<br />

systems such as those designed by Harry Partch, the octave is<br />

partitioned in<strong>to</strong> more than 12 parts. Partch would frequently use a<br />

large prime number like 19 or 43. His book Genesis of a Music (published<br />

posthumously in 1979) gives an excellent account of micro<strong>to</strong>nal<br />

systems.<br />

Keys <strong>and</strong> scales<br />

A musical scale is an ascending or descending sequence of notes which,<br />

in common practice, corresponds <strong>to</strong> the key of a piece of music [3]. The<br />

two most common examples in Western music are that of the major <strong>and</strong><br />

minor scales. The major scale is often said <strong>to</strong> have a brighter, happier<br />

spirit than that of the minor scale, which has a darker, gloomier feel.<br />

We name a key after the root of a scale, so if a scale begins on C<br />

<strong>and</strong> has a major in<strong>to</strong>nation, it is called C Major. It is conventional <strong>to</strong><br />

capitalize "Major" <strong>and</strong> leave "minor" uncapitalized for purposes of<br />

abbreviating the key "C Major" as "CM" <strong>and</strong> "C minor" as "Cm." The<br />

root is also called the <strong>to</strong>nic. In fact, every note in a scale has a name<br />

with respect <strong>to</strong> the <strong>to</strong>nic <strong>and</strong> a scale degree, in addition <strong>to</strong> the name of<br />

its note.<br />

Because of their popularity, the major <strong>and</strong> minor scales will be the<br />

only ones considered in examples of feature extraction from music.<br />

The major scale was once called the Ionian mode, <strong>and</strong> the minor scale<br />

the Aeolian mode. Each of these are built upon different scale degrees<br />

of the dia<strong>to</strong>nic scale, i.e., the natural notes C-D-E-F-G-A-B (not sharped


Section 3.3 Tuning <strong>and</strong> temperament 61<br />

Scale Degree Name<br />

1 ◦ Tonic<br />

2 ◦ Super<strong>to</strong>nic<br />

3 ◦ Mediant<br />

4 ◦ Subdominant<br />

5 ◦ Dominant<br />

6 ◦ Submediant<br />

7 ◦ Leading <strong>to</strong>ne<br />

8 ◦ Tonic<br />

Table 3.3: Names for scale degrees<br />

or flattened). In order, they are: Ionian, Dorian (D-E-F-G-A-B-C-D),<br />

Phrygian (E-F-G-A-B-C-D-E), Lydian (F-G-A-. . .), Mixolydian (G-A-B-<br />

. . .), Aeolian (A-B-C-. . .), <strong>and</strong> Locrian (B-C-D-. . .).<br />

As we continue <strong>to</strong> relate the periodicity of superimposed sine<br />

waves <strong>to</strong> tuning <strong>and</strong> harmony, it might have occurred <strong>to</strong> you that a<br />

scale using the same properties of equal temperament could exist that<br />

contains a large amount of consonance, such as one using whole notes<br />

instead of half notes (six divisions of the octave), or even dividing the<br />

octave by a power of two in<strong>to</strong> eight or 16 parts. Indeed, the first of these<br />

scales exists: Claude Debussy, among other contemporary composers,<br />

made extensive use of the whole <strong>to</strong>ne scale in his impressionistic music.<br />

The scale beginning on C would consist of the notes C-D-E-F♯-G♯-A♯-C.<br />

In fact, this scale could begin on any of those notes <strong>and</strong> contain exactly<br />

the same notes, <strong>and</strong> furthermore, there are only 2 unique whole <strong>to</strong>ne<br />

scales (C♯-D♯-F-G-A-B-C♯ the other). This provides evidence that the<br />

scale lacks <strong>to</strong>nal center, or root, <strong>and</strong> it is close <strong>to</strong> impossible in most<br />

circumstances <strong>to</strong> aurally establish the key of music written in whole<br />

<strong>to</strong>ne scales.<br />

John Pierce, Heinz Bohlen, <strong>and</strong> Kees van Prooijen developed their<br />

own hyper-consonant scale, the Bohlen–Pierce scale or Pierce 3579b<br />

scale, conceptually similar <strong>to</strong> just in<strong>to</strong>nation [2]. The distance between<br />

intervals are given by strictly rational numbers, where both the nu-


62 <strong>Musical</strong> sound Chapter 3<br />

mera<strong>to</strong>r <strong>and</strong> denomina<strong>to</strong>r are odd. In fact, the note C’ in the key of C<br />

doesn’t appear until we reach 3 times the note C—i.e., we don’t get an<br />

integer multiple of the <strong>to</strong>nic until 2 octaves above it 5 . The frequencies<br />

of fourth, sixth, <strong>and</strong> ninth scale degrees above the fundamental note<br />

are in the ratios 5:3, 7:3, <strong>and</strong> 9:3 (3:1), <strong>and</strong> the ninth (C’) is two octaves<br />

above the fundamental (C) which is the end of the scale.<br />

Although the ratios of the whole <strong>to</strong>ne <strong>and</strong> the Bohlen-Pierce scales<br />

have arguably greater consonance than equal temperament, the lack<br />

of instruments tuned <strong>to</strong> these scales make their use obsolete.<br />

Harmony<br />

Briefly, when one adds a voice or multiple voices <strong>to</strong> a melody or phrase<br />

of music, one creates harmony. Intervals like the octave <strong>and</strong> perfect<br />

fifth are considered relatively harmonious when sounded <strong>to</strong>gether.<br />

Again, this is due <strong>to</strong> the relationship of their corresponding periodicities.<br />

Harmonic devices can provide the listener with a greater sense of<br />

expectation or suspense, particularly when compared <strong>to</strong> the melody.<br />

There are more rules defining the vocabulary <strong>and</strong> function of harmony<br />

than any other aspect of classical music composition [11]. Other genres<br />

utilize specific chord progressions that define their sound. Blues, for<br />

example, makes heavy use of the progression I-IV-V( 7 ), <strong>and</strong> songs in<br />

other genres that use this progression au<strong>to</strong>matically evoke the blues<br />

genre. Thus, harmony is a salient feature of music that speaks volumes<br />

about music as a language <strong>and</strong> can even place a piece of music<br />

chronologically <strong>and</strong> geographically [13].<br />

3.4 Timbre<br />

Timbre is the reason that a piano playing A440 sounds different from a<br />

violin playing A440. It is the quality of an audible <strong>to</strong>ne. Nonmusical<br />

5 I have seen this interval of two octaves called a tritave with reference <strong>to</strong> the<br />

Bohlen-Pierce scale, but I am not sure of the universality of this term.


Section 3.4 Timbre 63<br />

sound like speech has timbre, <strong>to</strong>o: You can tell the difference between<br />

your friends’ voices without seeing their faces. All sounds, whether<br />

acoustic instruments of the orchestra or an analog synthesizer or the<br />

beep from your microwave, have a unique timbre. The more experienced<br />

a listener, the better his or her ability <strong>to</strong> distinguish between<br />

sounds <strong>and</strong> even musical instruments. In the examination of the frequency<br />

response of a large group of cellos, say, each cello will have a<br />

distinct timbre due <strong>to</strong> fac<strong>to</strong>rs such as the materials used, the structural<br />

measurements, even the s<strong>to</strong>rage conditions—personalities not unlike<br />

humans’. The word timbre translates <strong>to</strong> "stamp" in French, so an accessible<br />

mnemonic is <strong>to</strong> think of timbre as the fingerprint or signature<br />

of a sound. The simple definition is "<strong>to</strong>ne color." <strong>An</strong> instrument has a<br />

unique timbre <strong>and</strong> it appears in the relative amplitudes of the frequencies<br />

activated by playing it. The Fourier transform, therefore, excels at<br />

determining the source of sounds, because its spectrum exposes the<br />

timbre.<br />

Figure 3.7: The spectrum of a trumpet playing A3 (220 Hz). It is very rich in harmonics,<br />

<strong>and</strong> its loudest frequency is not its fundamental but rather its third partial, E5.


64 <strong>Musical</strong> sound Chapter 3<br />

The graph in Figure 3.7 represents the spectrum of a signal. A<br />

spectrum is plotted on a frequency domain, whereas a signal is plotted<br />

on a time domain. Therefore, for a signal x(t) (a function of time),<br />

we could call its spectrum X(f) or X(ω), a function of frequency in<br />

cycles per second (more likely, in musical circumstances) or radians<br />

per second.<br />

A spectrogram (or spectrograph) is another useful visualization for<br />

the frequencies contained in a signal, as shown in Figure 3.8. It incorporates<br />

time, in addition <strong>to</strong> frequency <strong>and</strong> amplitude. The domain of<br />

a spectrogram is now time <strong>and</strong> the vertical axis is now frequency, not<br />

amplitude. The graph is darker where the amplitude of the frequency<br />

f at time t is closer <strong>to</strong> 1, <strong>and</strong> whiter where the amplitude is closer <strong>to</strong> 0.<br />

Much more will be said about the particular transform that computes<br />

the data behind a spectrogram <strong>and</strong> its mathematical definition later,<br />

but now we have all of the <strong>to</strong>ols <strong>to</strong> underst<strong>and</strong> the basic meaning with<br />

our eyes. Both Figures 3.7 <strong>and</strong> 3.8 graphically convey the frequency<br />

components of a trumpet playing the pitch C5 (523.2 Hz—the fifth C<br />

on the piano).<br />

When there appears <strong>to</strong> be a uniform, even distance between these<br />

spikes in the spectrum, we say that the over<strong>to</strong>ne series is a harmonic<br />

series. Each spike represents a harmonic partial (or simply harmonic)<br />

of the over<strong>to</strong>ne series of the fundamental, where the kth spike (counting<br />

left-<strong>to</strong>-right) is the kth harmonic. 6 It is here that we uncover some<br />

compelling insight in<strong>to</strong> the fact that there exist both physical <strong>and</strong> psychoacoustic<br />

foundations for harmony, in addition <strong>to</strong> an aesthetic one.<br />

As stated in Chapter 2, a major chord appears within the first four over<strong>to</strong>nes<br />

of the fundamental frequency, <strong>and</strong> after that, a major-seventh<br />

chord. Experiments by Reinier Plomp show that humans can detect up<br />

<strong>to</strong> seven of the harmonic partials of complex <strong>to</strong>nes, <strong>and</strong> musicians are<br />

6 The harmonics are also called over<strong>to</strong>nes. The kth harmonic will the be (k − 1)th<br />

over<strong>to</strong>ne. Harmonic partials form a subset of the partials of a complex <strong>to</strong>ne, where<br />

partials are all of the sine waves involved in a complex <strong>to</strong>ne, <strong>and</strong> harmonic partials are<br />

those that can be calculated by integer multiplication of the fundamental frequency.


Section 3.5 Timbre 65<br />

Figure 3.8: The spectrogram of the same audio signal used in Figure 3.7, that of a<br />

trumpet playing <strong>and</strong> holding the pitch C5. The steady horizontal lines imply that the<br />

frequencies <strong>and</strong> amplitudes are unchanging: Where black, the spectrogram shows<br />

what frequencies in the signal have the most power. The even spacing of these lines<br />

imply harmonicity: The frequencies are equally spaced apart, separated by a constant<br />

frequency, which is the fundamental frequency.<br />

better at this [14]. So, Western <strong>to</strong>nality is supported by the science behind<br />

it. With enough experience in reading spectra <strong>and</strong> spectrographs,<br />

your eyes in addition <strong>to</strong> your ears will be able <strong>to</strong> detect what kind of<br />

instrument is producing such sounds, <strong>and</strong> digitally, this information is<br />

in a matrix ready <strong>to</strong> be processed by your computer.


66 <strong>Musical</strong> sound Chapter 3<br />

3.5 Chapter summary<br />

This chapter examined the harmonic <strong>and</strong> periodic natures of rhythm,<br />

pitch, timbre, <strong>and</strong> temperament. Rhythm is temporally structured by<br />

meter wherein a unit of time (like a quarter or eighth note) is specified<br />

as well as the quantity of them that can fit in one measure of music.<br />

Tuning systems help define how a culture organizes its alphabet of<br />

musical pitches in<strong>to</strong> a harmonic vocabulary.<br />

In general, the smaller the integers in the ratio between two frequencies,<br />

the more consonant their interval. Virtually every known<br />

tuning system contains the interval of an octave, which bears the ratio<br />

of 2:1 between its two frequencies (meaning the higher frequency has<br />

twice the frequency of the lower one). The ratio relating a perfect<br />

fifth, for example, in Pythagorean temperament <strong>and</strong> just in<strong>to</strong>nation is<br />

exactly 3:2. However, because all songs are not written in the same key,<br />

small-integer ratios are not always ideal for instrument construction<br />

<strong>and</strong> playing, <strong>and</strong> are therefore approximated by some tuning systems<br />

such as equal temperament. Equal temperament uses the function<br />

f 2 =2 k/12 f 1 <strong>to</strong> calculate the frequency f 2 of a note k-many semi<strong>to</strong>nes<br />

(half steps) above a reference frequency f 1 . If f 2 is below f 1 , k is a negative<br />

integer. Therefore, in equal temperament, a perfect fifth is in the<br />

ratio of about 2.9966:2 instead of 3:2. This is close enough that our ears<br />

cannot detect a difference, <strong>and</strong> since it enables perfect transposition,<br />

equal temperament is the preferred Western tuning system, especially<br />

for instruments with a large frequency range <strong>and</strong> high amount of<br />

tension in their build.<br />

Timbre is unique <strong>to</strong> every acoustic instrument <strong>and</strong> references the<br />

mixture of frequencies <strong>and</strong> their amplitudes present in the instrument’s<br />

<strong>to</strong>ne. Every individual instrument is different, but classes <strong>and</strong> types of<br />

instruments behave similarly with respect <strong>to</strong> frequency <strong>and</strong> loudness<br />

of harmonic partials. This illuminates some of the need for the Fourier<br />

transform in music.


4. <strong>Musical</strong> instruments<br />

Three common ways in which one may produce pitched sound are (1)<br />

striking a resonant cavity, such as a drum; (2) causing vibrations in the<br />

air in a tube, such as a horn or woodwind; <strong>and</strong> (3) plucking a string<br />

stretched across a resonant box, such as a violin. All of these involve<br />

the pressure of air in resonant bodies, <strong>and</strong> their features (dimensions,<br />

materials, <strong>and</strong> holes) completely specify the timbre of an instrument.<br />

Here I will explore the physical properties of some popular musical<br />

instruments: The piano, viols, winds, drums, <strong>and</strong> electric guitar. This<br />

is not a comprehensive list of instruments by any means, but paying attention<br />

<strong>to</strong> the individual spectra given for each instrument may aid the<br />

underst<strong>and</strong>ing of Fourier transforms of music, especially polyphonic<br />

music containing more than one instrument.<br />

4.1 The piano<br />

The piano produces sound by a hammer striking a string stretched<br />

over a soundboard, which amplifies the string’s energy. It replaced<br />

the clavichord <strong>and</strong> harpsichord by combining their best qualities: The<br />

clavichord had the advantage of control over its volume, but did not<br />

get very loud, while the harpsichord lacked any sort of control over its<br />

dynamic range, but could produce loud sounds. Around the time of<br />

the pianoÕs invention in the 18th century, use of the harpsichord <strong>and</strong><br />

clavichord practically vanished.<br />

The 1920s are considered the last great era of piano ownership, due<br />

<strong>to</strong> the invention of the mass-produced au<strong>to</strong>mobile <strong>and</strong> the economic<br />

effects of the Great Depression. A piano in the household was a symbol<br />

of status, much like a car. A 1926 estimate claimed that half of city


68 <strong>Musical</strong> instruments Chapter 4<br />

dwellers in America owned a piano. In 1927, 250,000 pianos were<br />

produced in America, whereas in 1932, just 25,000.<br />

Some designs of pianos were modifications of the original, aimed<br />

at straying from the supremacy of the C major scale <strong>and</strong> the Western<br />

system of 12 <strong>to</strong>nes. Pianos that employed different <strong>to</strong>nal systems<br />

virtually all divided the octave in<strong>to</strong> more than 12 parts, so these were<br />

dubbed micro<strong>to</strong>nal pianos. The first of these was invented in 1892, <strong>and</strong><br />

quarter-<strong>to</strong>ne pianos (the usual piano is semi<strong>to</strong>nal, keys separated by<br />

half steps) were actually somewhat popular in the 1920’s. There was<br />

even a piano with 96 divisions of the octave <strong>and</strong> 97 keys in <strong>to</strong>tal, <strong>to</strong><br />

span just one octave [71].<br />

There are six main sections of the piano: (1) the metal frame necessary<br />

<strong>to</strong> support the large amount of tension imposed by the strings;<br />

(2) the soundboard <strong>and</strong> bridges; (3) the strings, usually made of steel;<br />

(4) the action, consisting of the keys, hammers, <strong>and</strong> levers; (5) the foot<br />

pedals; <strong>and</strong> (6) the wooden casing. The cheaper <strong>and</strong> more compact<br />

upright pianos have strings on a vertical plane, with a more complex<br />

action. Action is a fairly literal term for the region at which the sound<br />

is catalyzed, wherein the key triggers a series of levers, eventually<br />

triggering the hammer, which strikes the string. Altering the speed of<br />

the action manipulates the attack rate.<br />

If all of the strings in a piano were identical in density, the piano<br />

would need <strong>to</strong> be unrealistically long <strong>to</strong> produce the lower end of its<br />

wide frequency range (27.5 <strong>to</strong> 4186 Hz). By thickening a string <strong>and</strong><br />

relaxing some of its tension, one can increase its effective length <strong>and</strong><br />

thereby lower its fundamental frequency. We calculate the fundamental<br />

frequency by the velocity of propagation along the string, v, but this<br />

is quite tricky <strong>to</strong> calculate by analyzing the string. Since v =<br />

√<br />

T<br />

m/L ,<br />

where T is the tension in new<strong>to</strong>ns 1 <strong>and</strong> m/L is the mass (in kilograms)<br />

1 One new<strong>to</strong>n is equal <strong>to</strong> 1 kilogram-meter per squared second, kg · m<br />

s 2 .


Section 4.1 The piano 69<br />

Figure 4.1: The components of the piano. Image from [94].<br />

of the string per unit length (one meter), 2 we can substitute for v <strong>and</strong><br />

get<br />

√<br />

T<br />

f 0 = v<br />

2L = m/L<br />

2L .<br />

The force that acts on vibrating strings <strong>and</strong> other masses with<br />

tension <strong>to</strong> return them <strong>to</strong> their resting state is called the res<strong>to</strong>ring force.<br />

The res<strong>to</strong>ring force produces the over<strong>to</strong>ne series of a string. J.W.S.<br />

Rayleigh observed that if the tension, <strong>and</strong> thus the res<strong>to</strong>ring force, is<br />

relatively low in a string, then not all of the partials (especially higher<br />

ones) will be harmonic [12, 17]. Thus, the lower keys are less likely<br />

than higher keys <strong>to</strong> have exactly harmonic partials.<br />

2 In some cases you will see the mass per unit length (m/L) of a string written<br />

with the Greek letter mu, µ.


70 <strong>Musical</strong> instruments Chapter 4<br />

The frequency response of a piano varies according <strong>to</strong> the velocity<br />

of the player’s fingers on the keys. The loudness of the fundamental<br />

<strong>and</strong> the over<strong>to</strong>ne series is proportional <strong>to</strong> the amount of force used<br />

in striking its keys. Therefore, when the force is slow <strong>and</strong> soft, less<br />

over<strong>to</strong>nes can be heard.<br />

Figure 4.2: This graph depicts the fast Fourier transform of a Yamaha piano playing<br />

D4, approximately 311.1 Hz. The heights of the spikes indicate the relative amplitudes<br />

of the frequencies. Note the evenly spaced intervals between spikes: This means that<br />

the over<strong>to</strong>ne series is harmonic.<br />

The pedals of the piano manipulate the sustain of its sound <strong>and</strong> the<br />

soundboard amplifies it, accelerating the sound waves as they travel<br />

through the dense wood <strong>and</strong> coupling the vibrations of the strings<br />

with the vibrations of the air. The soundboard has a fairly uniform<br />

response <strong>to</strong> all frequencies in its wide range [15].<br />

The harmonics of piano strings, especially in the low-frequency<br />

range, are more widely spaced than simple whole-number ratios<br />

would suggest due <strong>to</strong> the inharmonicity imposed by the thicker strings<br />

<strong>and</strong> their relatively relaxed tension. For this reason pianos are tuned<br />

with a (slightly) exp<strong>and</strong>ed octave called a stretched octave. This octave<br />

sounds out of tune with true integer-related harmonics (like those<br />

produced by infinitesimally thin <strong>and</strong> taut strings) but sounds in tune


Section 4.2 The viol family 71<br />

with strings that have unique res<strong>to</strong>rative forces at work due <strong>to</strong> their<br />

different physical characteristics.<br />

4.2 The viol family<br />

In a way, the violin is truly one of the greatest enigmas in all of music.<br />

It is one of the few musical instruments whose paradigm is early on in<br />

its invention, by <strong>An</strong><strong>to</strong>ni Stradivari between 1680 <strong>and</strong> 1700 [8]. Today,<br />

Stradivarius violins from this time period can be worth hundreds of<br />

thous<strong>and</strong>s or even millions of dollars. Though a sizable amount of<br />

the rationale behind this is their beautiful <strong>to</strong>ne, Stradivarius violins<br />

have become so romanticized over the centuries that they are truly<br />

legendary.<br />

Scientifically, Stradivarius violins have more evenly spaced resonant<br />

frequencies in the frequency response of their body than other violins,<br />

even when compared <strong>to</strong> others of very high quality like Guarneris.<br />

The frequency response is a graph showing frequency peaks that come<br />

from the resonance of the instrument, not the <strong>to</strong>nes played on it. It can<br />

be found by playing all of the frequencies in the range of an instrument<br />

with equivalent input force, <strong>and</strong> removing the harmonics that come<br />

from what is played.<br />

The main resonances of interest in a violin are the main wood resonance<br />

(W ) <strong>and</strong> the air resonance (A). The wood prime resonance (W ′ ) is a<br />

result of the harmonics of the wood resonance, <strong>and</strong> is about an octave<br />

below W sometimes sounding below the fundamental frequency of the<br />

played pitch. The main wood resonance is the resonance resulting<br />

solely from the wooden parts of the violin, <strong>and</strong> is typically around<br />

440 Hz (A), the frequency of the second-highest (in pitch) open string.<br />

The air’s resonant frequency is ideally a fifth below the main wood<br />

resonance (a 2:3 ratio), so, it is approximately the frequency of the<br />

D-string [98]. The area of the f-holes directly corresponds <strong>to</strong> the air<br />

resonance of the violin: A larger f-hole area implies a higher resonant<br />

frequency because it increases the volume of the violin’s body. These


72 <strong>Musical</strong> instruments Chapter 4<br />

holes also act as b<strong>and</strong>-pass filters, giving preference <strong>to</strong> the resonance of<br />

a small range of frequencies. The holes of woodwinds also have this<br />

property: Closing <strong>and</strong> opening them affects the pitch of the instrument<br />

in a localized way.<br />

Figure 4.3: The frequency response of an average (poor) violin. Ideally, the resonant<br />

frequency of the cavity (A) is a fifth below the main wood resonance, <strong>and</strong> the wood<br />

prime resonance (W ′ ) is an octave below the main wood resonance (W ).<br />

Figure 4.4: The ideal frequency response of a good violin. When these resonant<br />

pitches are played on the strings, their loudness will be enhanced by this curve.


Section 4.2 The viol family 73<br />

Like any other instrument, each component of the violin contributes<br />

something <strong>to</strong> its <strong>to</strong>ne. The type <strong>and</strong> quality of the materials<br />

is of the utmost importance because the violin is constructed of less<br />

individual parts than the piano. Modern violins have been tweaked<br />

from their original design <strong>to</strong> be fretless, heavier, <strong>and</strong> stiffer. Also, the<br />

original violin had a flat back <strong>and</strong> was called a viol, which is the name<br />

we give for the violin’s family consisting of the viola, cello, <strong>and</strong> bass<br />

[18].<br />

Ernst Chladni was one of the first people <strong>to</strong> investigate the resonance<br />

of a violin. He invented a technique published in 1787 that<br />

could show the resonance patterns of the front <strong>and</strong> back plates using<br />

s<strong>and</strong>. When the violin is bowed, the s<strong>and</strong> drifts away from places of<br />

high vibration, which are locations of the instrument’s nodes. These<br />

patterns are highly symmetrical <strong>and</strong> vary with frequency. Napoleon<br />

Bonaparte was so impressed with this research that he requested a<br />

personal demonstration of the technique <strong>and</strong> commissioned Chladni<br />

in 1809 <strong>to</strong> translate his publication in<strong>to</strong> French for a h<strong>and</strong>some sum of<br />

6000 francs (around 80,000 USD <strong>to</strong>day) [10].<br />

With these plates, Chladni showed why the body of a violin does<br />

not have a uniform response <strong>to</strong> all frequencies within its range. He<br />

was the first <strong>to</strong> map the resonance curve of the violin, thus turning the<br />

art of making instruments in<strong>to</strong> a science.<br />

The violin <strong>and</strong> cello have nearly identical construction <strong>to</strong> each<br />

other but differ in size, as do the the viola <strong>and</strong> double bass. The open<br />

tunings of the strings within the viol family reflect their scaled size,<br />

but maintain an interval of a perfect fifth between neighboring strings<br />

(except for the double bass, with strings separated by perfect fourths).<br />

When the strings are set in<strong>to</strong> motion, the sound is initiated. Because<br />

the string is fixed on both ends, the wavelengths of its modes of<br />

vibration are limited <strong>to</strong> integer divisions of the string length. At the


74 <strong>Musical</strong> instruments Chapter 4<br />

Figure 4.5: Several of the vibrational modes of a violin, as depicted by Chladni’s<br />

plates. The plates show the response <strong>to</strong> the vertical <strong>and</strong> horizontal modes at their<br />

resonant frequencies. The white portion shows where the s<strong>and</strong> settled during the<br />

vibration, <strong>and</strong> the black shows where the s<strong>and</strong> fell off. The first mode is the circular<br />

mode, showing resonant modes parallel <strong>to</strong> both the vertical <strong>and</strong> horizontal axis. The<br />

second shows the resonant frequencies of the modes parallel <strong>to</strong> the horizontal axis.<br />

The third mode is the lateral mode, vibrating at the resonant frequency parallel <strong>to</strong> the<br />

diagonals of the violin’s body. Finally, the fourth mode shows the resonance of the<br />

vertical mode.<br />

Figure 4.6: The first four modes of a fixed string. Because a string is secured at both<br />

ends, a string’s over<strong>to</strong>ne series is defined according <strong>to</strong> its length <strong>and</strong> the wavelengths<br />

of the frequencies it contains are restricted <strong>to</strong> integer divisions of that length, i.e., their<br />

wavelengths will be L, L/2, L/3, <strong>and</strong> so on. A non-integer ratio would mean that the<br />

string was loose at one end. The dots indicate the positions of nodes, where one may<br />

lightly press on a stringed instrument <strong>and</strong> produce harmonics.


Section 4.2 The viol family 75<br />

nodes of a fixed string, one can produce harmonics by making light<br />

contact with the string at these points.<br />

The fingerboard, body, <strong>and</strong> sound post amplify the sound because<br />

the air molecules accelerate in the dense wood. The sound post is<br />

located inside the body, connecting the front plate <strong>to</strong> the back plate.<br />

Functionally, it is a fulcrum, <strong>and</strong> it has such an impact on the <strong>to</strong>ne that<br />

the French call it l’âme, meaning the soul. Removal of the sound post<br />

gives the violin a similar timbre <strong>to</strong> that of a guitar, which lacks a sound<br />

post.<br />

The sound resulting from a bowed violin string is much different<br />

from a plucked string. This is because the bowed string’s waveform is<br />

not a sine wave, but closer <strong>to</strong> a saw<strong>to</strong>oth wave. The violin’s timbre is one<br />

of the hardest <strong>to</strong> reproduce by synthesizers <strong>and</strong> software programs<br />

due <strong>to</strong> the complex harmonics of its unusual body <strong>and</strong> the behavior of<br />

the bowed string.<br />

Figure 4.7: The pressure variations of a saw<strong>to</strong>oth wave. The sound has a similar<br />

timbre <strong>to</strong> that of scratching your fingernail along the edge of a quarter or other ribbed<br />

surface.<br />

Using an oscilloscope, Hermann von Helmholtz was the first <strong>to</strong><br />

observe the jagged nature of the waveform depicted in Figure 4.7<br />

[18]. The wave results from the friction of the bow sliding across the


76 <strong>Musical</strong> instruments Chapter 4<br />

Figure 4.8: The spectrogram of a solo, bowed violin from Shostakovich’s Fifth Symphony.<br />

The bowed string produces a rich harmonic over<strong>to</strong>ne series, depicted by the<br />

evenly spaced horizontal black lines as in Figure 2.7. The line wavers (between 1:54<br />

<strong>and</strong> 1:55 <strong>and</strong> again at 1:56 <strong>and</strong> 1:57) when the musician plays with vibra<strong>to</strong>, increasing<br />

<strong>and</strong> decreasing the center frequency.<br />

string. When the bow is taken off of the string, the energy of the string<br />

decreases, <strong>and</strong> the waveform becomes increasingly smooth until it<br />

dies. Helmholtz also saw that, when a string is bowed, it exhibits an<br />

increased preference for exactly periodic behavior, meaning that the<br />

entire over<strong>to</strong>ne series is better amplified. This is called mode-locking<br />

[8]. When a single pitch is played at a constant bowing speed, the<br />

modes of the fixed string are "locked" in<strong>to</strong> integer multiples of the<br />

fundamental, regardless of the resonance of the violin’s body. This also<br />

happens when air is blown at a constant rate in<strong>to</strong> a wind instrument.<br />

The constant pressure locks the modes of vibration in<strong>to</strong> the harmonic<br />

series, regardless of the natural resonant frequencies of the instrument.<br />

So, the modes of vibration of the piano <strong>and</strong> violin are somewhat<br />

similar because of the harmonic nature of the fixed string. However,<br />

the action of a piano gives the <strong>to</strong>ne of its steel strings a much different


Section 4.3 Woodwinds <strong>and</strong> brasses 77<br />

Figure 4.9: The spectrogram of a plucked violin string, from Shostakovich’s Fifth<br />

Symphony. The melody is the same as in the clip above of the bowed string, though<br />

shifted a little in time. The blackest regions show the onset <strong>and</strong> attack of the pluck,<br />

while in the spectrogram of the bowed string, the blackest regions corresponded<br />

<strong>to</strong> both the attack <strong>and</strong> the sustain. Not only do their attack envelopes differ: The<br />

harmonic over<strong>to</strong>ne series of the plucked string is not as rich (i.e., there are not as many<br />

audible/visible over<strong>to</strong>nes) as that of the bowed string.<br />

amplitude envelope than the violin bow gives the violinÕs catgut<br />

strings. The friction of the bow on the string of a violin reshapes the<br />

resulting waves <strong>to</strong> the form of a harmonics-rich saw<strong>to</strong>oth wave. The<br />

harmonic series of sine waves contained in the timbre of saw<strong>to</strong>oth<br />

waves (e.g., the Fourier series of saw<strong>to</strong>oth waves) is given in Chapter<br />

7.<br />

4.3 Woodwinds <strong>and</strong> brasses<br />

It is a bit easier <strong>to</strong> express the resonance of wind instruments because<br />

there are only three essential components <strong>to</strong> consider: The reed, the<br />

bore, <strong>and</strong> the side holes.


78 <strong>Musical</strong> instruments Chapter 4<br />

Figure 4.10: This graph depicts the fast Fourier transform of a viola playing C5,<br />

approximately 523.2 Hz. In the higher partials, the peaks are thicker, expressing the<br />

rich <strong>to</strong>ne of the viola. But energy is really lacking at the third partial, which should be<br />

G6.<br />

Figure 4.11: The parts of a simple woodwind instrument, the recorder.<br />

The reed functions as a valve. The player’s lungs, mouth, <strong>and</strong> lips<br />

interact with the rest of the instrument <strong>to</strong> produce a series of periodic<br />

puffs. The frequency of these puffs influences the fundamental pitch of<br />

the resulting sound. The intensity of the puff determines the amount<br />

of pressure in the bore of the instrument, <strong>and</strong> changes in pressure<br />

because they are periodic cause the instrument <strong>to</strong> make pitched sound.<br />

Therefore, wind instruments resonate via pressure waves. 3<br />

3 This is not the case for the flute or piccolo. They vibrate by the periodic changes<br />

in the velocity of the stream of air played over the mouth hole, so they will often be


Section 4.3 Woodwinds <strong>and</strong> brasses 79<br />

Figure 4.12: The basic parts of valveless cylindrical <strong>and</strong> conical brass horns.<br />

Inside of a teapot, pressure increases with temperature. The air<br />

molecules inside move faster <strong>and</strong> change position more quickly. As<br />

these molecules rise upward <strong>to</strong> try <strong>to</strong> escape through the spout of the<br />

teapot, they bounce back <strong>and</strong> forth more <strong>and</strong> more rapidly as temperature<br />

increases, <strong>and</strong> thus make the teapot whistle at an increasing pitch<br />

<strong>and</strong> intensity.<br />

<strong>An</strong>alogously, the more that pressure is varied at the mouthpiece,<br />

the shorter the distance of time between puffs becomes, i.e., the period<br />

T decreases <strong>and</strong> f increases. Woodwinds can have single reeds (saxophone,<br />

clarinet) or double reeds (oboe, bassoon). Brasses do not have<br />

a physical reed in their mouthpiece because the mouthpiece causes<br />

pressure variations in the player’s lips. So, the lips act as the reed in<br />

brass instruments.<br />

The bore is the shaft of a woodwind. Because it is greater in size<br />

than the stiff reed, it influences pitch <strong>to</strong> a much larger degree. The<br />

opposite is true of the brass instruments: Their reed (the lips, mouth,<br />

an exception <strong>to</strong> the other wind instruments. However, this stream of air still acts as a<br />

valve, <strong>and</strong> is therefore still essentially a reed.


80 <strong>Musical</strong> instruments Chapter 4<br />

Figure 4.13: Pressure varies with temperature <strong>and</strong> density, both of which are changing<br />

in a boiling tea kettle. The air particles in the kettle are increasingly excited from<br />

the increasing pressure, <strong>and</strong> their motions become more rapid. Once the pressure<br />

becomes so high that the cavity of air is not large enough <strong>to</strong> support the exp<strong>and</strong>ed<br />

molecules inside of it, they head <strong>to</strong>wards the exit, pressure still increasing. This creates<br />

a gliding, whistling noise. Its frequency reflects the motion of the molecules, as well<br />

as the temperature.<br />

vocal cords, <strong>and</strong> lungs) is far more massive <strong>and</strong> strong than the bore,<br />

so the reed has greater control over pitch.<br />

The length of the bore determines the fundamental, resonant frequency<br />

of a wind instrument, <strong>and</strong> consequently, its over<strong>to</strong>ne series.The<br />

length can be altered by opening <strong>and</strong> closing the side holes, so it is<br />

only when all are closed that the bore’s length is its physical length.<br />

The size <strong>and</strong> spacing of the holes in the bore also influence the bore’s<br />

effective length: Larger holes leak more air <strong>and</strong> the spacing determines<br />

the intervals between the pitches it can produce. These dimensions<br />

also have an effect on the timbre of the woodwind. 4 For brasses, the<br />

4 This is not the case for flutes, nor brasses: Their timbre is not nearly as sensitive<br />

<strong>to</strong> their side holes. Flutes, on the other h<strong>and</strong>, are constructed with nearly equally<br />

spaced <strong>and</strong> equally sized holes. Altering the fundamental frequency by changing the


Section 4.3 Woodwinds <strong>and</strong> brasses 81<br />

"side holes" are tubings covered by valves which open <strong>and</strong> close <strong>to</strong><br />

shorten <strong>and</strong> lengthen the bore, respectively.<br />

Figure 4.14: The effective length of the bore in a wind instrument changes when the<br />

side holes are opened. The size of the holes influence its effective length. The shorter<br />

the effective length, the higher the frequency that is produced.<br />

Woodwinds can either have closed-end (the mouthpiece is open<br />

<strong>and</strong> the end of the bore is closed) or open-end (both ends are open)<br />

bores. In open-end bores, the wavelength of the fundamentalλ 0 is<br />

exactly equal <strong>to</strong> double the effective length of the bore, 2L, <strong>and</strong> the<br />

wavelengths of the over<strong>to</strong>nes are 1/2, 1/3, 1/4, etc. times λ 0 . Therefore,<br />

the fundamental frequency of open-end winds is given by f 0 = v λ 0<br />

=<br />

v<br />

2L , <strong>and</strong> the over<strong>to</strong>nes are the full harmonic spectrum, f 0, 2f 0 , 3f 0 , 4f 0 ,<br />

<strong>and</strong> so on. The waves in a closed-end bore must have an antinode<br />

at the closed end of the bore, so the timbre skips the even-numbered<br />

partials (see Figure 3.21 below). Here, λ 0 =4L, <strong>and</strong> the over<strong>to</strong>nes have<br />

effective length of a brass instrument simply shifts the harmonic spectrum <strong>to</strong> the right<br />

or left along the frequency domain, retaining the relative strengths of the over<strong>to</strong>nes.


82 <strong>Musical</strong> instruments Chapter 4<br />

1/3, 1/5, 1/7, etc. times the length of λ 0 . The fundamental frequency<br />

of closed-end winds is f 0 = v<br />

4L<br />

<strong>and</strong> its over<strong>to</strong>ne spectrum consists of<br />

only the odd partials, f 0 , 3f 0 , 5f 0 , <strong>and</strong> so on.<br />

These nice, small-integer ratios between fundamental frequency<br />

<strong>and</strong> bore length relate <strong>to</strong> the nodes <strong>and</strong> antinodes of the pressure waves<br />

occurring inside of the bore. The pressure of the sound wave originates<br />

at the mouthpiece, so here we have a node. In an open-end bore, all of<br />

the nodes are contained within the column. At the open end, the wave<br />

is reflected back in an equally opposite way (a compression, versus a<br />

rarefaction), <strong>and</strong> so it forms a node. At the end of a closed bore, there<br />

is a node beyond the length of the air column, so there is an antinode.<br />

Figure 4.15: The modes of vibration of three different types of columns of air: Those<br />

open at both ends, those closed at the end opposite the mouthpiece, <strong>and</strong> those with a<br />

horned end. These graphs depict the first four modes of vibration, where the curves<br />

depict the pressure at the location indicated on the horizontal axis.


Section 4.3 Woodwinds <strong>and</strong> brasses 83<br />

Instrument Type of wind instr. Type of bore<br />

Clarinet Single-reed, woodwind Closed, cylinder<br />

Saxophone Single-reed, woodwind Closed, conical<br />

Flute Flute, woodwind Open, cylindrical<br />

Piccolo Flute, woodwind Open, cylindrical<br />

Oboe Double-reed, woodwind Open, conical<br />

Bassoon Double-reed, woodwind Closed, conical<br />

Recorder Flute, woodwind Closed, cylindrical<br />

Cor anglais Double-reed, woodwind Closed, conical<br />

Trombone Brass Open, cylindrical<br />

Cornet Brass Open, conical<br />

French horn Brass Open, conical<br />

Tuba Brass Open, conical<br />

Euphonium Brass Open, conical<br />

Trumpet Brass Open, cylindrical<br />

Table 4.1: The families of woodwind <strong>and</strong> brass instruments <strong>and</strong> the nature of their<br />

bores.<br />

Nodes <strong>and</strong> antinodes are formed by the destructive <strong>and</strong> constructive<br />

interference of st<strong>and</strong>ing waves. In Figure 4.15, you can see that<br />

all of the bores have zero pressure, a node, <strong>and</strong> an amplitude of 0 at<br />

the mouthpiece (the left-most point on the graph). The pressure wave<br />

defines the locations of nodes (where amplitude is 0) <strong>and</strong> antinodes<br />

(where amplitude is extreme), so it follows that at the mouthpiece of<br />

all wind instruments there is a node. Since acting upon the mouthpiece<br />

initiates the pressure wave, underst<strong>and</strong>ing the mouthpiece’s behavior<br />

is highly influential upon the pressure within the bore.<br />

The change in the movement of the traveling waves in a wind<br />

instrument can be modeled as 90 ◦ out of phase with the change in<br />

pressure waves (again referring <strong>to</strong> Fig. 4.21), wherein the variation of<br />

motion is greatest at the mouthpiece for all of the instruments while the<br />

change in pressure is zero. So, for the pressure, i.e. sin(ω) or − sin(ω),<br />

the rate of motion can be modeled by cosine waves, sin(ω + π/2) =<br />

cos(ω) <strong>and</strong> − sin(ω + π/2) = − cos(ω). Therefore, the pressure varies<br />

most when the motion varies the least, <strong>and</strong> vice versa. This defines


84 <strong>Musical</strong> instruments Chapter 4<br />

input impedance: The ratio between the pressure amplitude set up in<br />

the mouthpiece <strong>and</strong> the excita<strong>to</strong>ry flow that gives rise <strong>to</strong> it, in the<br />

words of Arthur Benade [19]. The st<strong>and</strong>ing wave inside of the bore<br />

periodically changes in amplitude as the pressure wave carrying the<br />

input impedance from the mouthpiece propagates <strong>to</strong>wards the end<br />

of the bore, against the motion of the reflected wave traveling back<br />

<strong>to</strong>ward the mouthpiece.<br />

Now, each of the waves given in the first graph have the exact same<br />

frequency <strong>and</strong> wavelength (λ =2L) because both are determined by<br />

the length of the bore. In the open-end column, the length is L making<br />

the wavelength λ equal <strong>to</strong> 2L, <strong>and</strong> both the even- <strong>and</strong> odd-numbered<br />

modes have energy. In the closed cylinder, the length is L/2 so that λ<br />

is still equal <strong>to</strong> 4(L/2) = 2L, <strong>and</strong> only the odd-numbered modes are<br />

articulated. In the conical-bore column, all the modes are sounded, but<br />

their individual energies depend on the radius of the bore at a given<br />

location 5 .<br />

The bore of a wind instrument is either cylindrical or conical, <strong>and</strong><br />

sometimes has a flared bell at its end (like in all of the brass instruments<br />

<strong>and</strong> the clarinet). The cross-sections of the bore at any point in both<br />

types can be determined by mathematical functions proportional <strong>to</strong><br />

Bessel functions. The order-2 Bessel function closely resembles the<br />

shape of the flared horn, while the order-0 Bessel function describes<br />

cylindrical bores. Considering that the cylindrical bore has a constant<br />

cross-section, it is logical that it would be modeled by a constant<br />

function (i.e., function of order 0). The cross-section of a conical bore<br />

increases linearly, so an order-1 Bessel function would suit its area.<br />

Finally, the cross-section of a flared bore is parabolic <strong>and</strong> is therefore<br />

modeled by quadratic functions (e.g., an order-2 function).<br />

Bessel functions are a pretty advanced <strong>to</strong>pic in mathematics, but<br />

their form is actually surprisingly similar <strong>to</strong> that of the continuous<br />

5 Most brass instruments are conical with some percent of cylindrical tuning, <strong>and</strong><br />

all of them have a flared bell. For more on the bores of wind instruments, see Arthur<br />

H. Benade’s Fundamentals of <strong>Musical</strong> Acoustics.


Section 4.3 Woodwinds <strong>and</strong> brasses 85<br />

Fourier transform that we’ll study in Chapter 7. The notation J n (x)<br />

represents the nth Bessel function, <strong>and</strong> t is time:<br />

J n (x) = 1<br />

2π<br />

∫ π<br />

−π<br />

So the order 0 <strong>and</strong> order 2 functions are<br />

J 0 (x) = 1<br />

2π<br />

J 2 (x) = 1<br />

2π<br />

e −i(nt−x sin(t)) dt.<br />

∫ π<br />

−π<br />

∫ π<br />

−π<br />

e ix sin(t) dt<br />

e ix sin(t)−i2t dt.<br />

Due <strong>to</strong> the inverse square law, the intensity of the pressure <strong>and</strong><br />

motion decreases with larger surface area. The horn function U determines<br />

how much energy exits the horn <strong>and</strong> how much returns <strong>to</strong> the<br />

mouthpiece <strong>to</strong> produce st<strong>and</strong>ing waves. This is the degree of flare in<br />

the bell, calculated by r ext × r int in the horn equation.<br />

U ≈<br />

1<br />

r int × r ext<br />

The variables r int <strong>and</strong> r ext are the interior <strong>and</strong> exterior radii at any<br />

given point on the horn, as shown in Figure 4.15.<br />

Using the horn function at a corresponding point on the horn, we<br />

can calculate the acoustic wavelength λ:<br />

v<br />

λ = √ ∣∣∣f 2 − U · ( ) ∣.<br />

v 2∣∣ 2π<br />

The formula is similar <strong>to</strong> the usual λ = v f<br />

, but the frequency here is<br />

the shortest distance between the frequency squared (1/s 2 ) <strong>and</strong> the<br />

horn function times the squared velocity (1/m 2·m 2 /s 2 =1/s 2 ), so the<br />

wavelength varies based on the values of U <strong>and</strong> f. This is called the<br />

horn equation.<br />

Energy is lost <strong>to</strong> friction <strong>and</strong> heat proportional <strong>to</strong> frequency, so the<br />

high partials of higher frequencies have less energy than the high partials<br />

of lower frequencies. Impedance dominates harmonics in softly


86 <strong>Musical</strong> instruments Chapter 4<br />

played notes; playing more loudly increases the amount of energy in<br />

first lower harmonics <strong>and</strong> then higher ones. Walter Worman quantified<br />

this phenomenon <strong>and</strong> found the amplitude of the second harmonic<br />

grows so that doubling the strength of the fundamental quadruples<br />

the strength of the second harmonic [19]. Furthermore, the strength of<br />

the second harmonic is nearly proportional <strong>to</strong> the input impedance of<br />

the bore at this frequency. Likewise for the third <strong>and</strong> higher harmonics:<br />

The third harmonic grows eightfold for every doubling of the strength<br />

of the fundamental, the fourth grows sixteenfold for each doubling<br />

of f 0 , <strong>and</strong> so on. Thus, the nth harmonic grows as the nth power of<br />

the fundamental’s pressure amplitude. Otherwise, the response of<br />

wind instruments is fairly constant with respect <strong>to</strong> frequency, so their<br />

frequency spectrum merely shifts <strong>to</strong> the frequency of the fundamental.<br />

Worman’s observations of the input impedance only describe harmonics<br />

in the mouthpiece. The frequency spectrum of the sound<br />

exiting from a brass instrument is not nearly as refined or detailed<br />

as the measurements of the spectrum taken inside the mouthpiece<br />

<strong>and</strong> the bore.Because the bell leaks more energy at higher frequencies,<br />

sound waves that exit the bell experience a "treble boost," wherein<br />

higher components are radiated in<strong>to</strong> the room. Furthermore, when<br />

a player places his or her h<strong>and</strong> in<strong>to</strong> the bell <strong>to</strong> modify its timbre, the<br />

over<strong>to</strong>ne series exp<strong>and</strong>s <strong>to</strong> articulate over<strong>to</strong>nes as high as 1500 Hz<br />

above the preexisting over<strong>to</strong>nes.<br />

The fast Fourier transforms of wind instruments in Figures 4.16-19<br />

reveal their highly harmonic nature.


Section 4.3 Woodwinds <strong>and</strong> brasses 87<br />

Figure 4.16: The FFT-derived frequency spectrum of an oboe playing A♭4 (415.3 Hz).<br />

This spectrum is particularly harmonic with nearly exactly 414.6 Hz separating each<br />

partial.<br />

Figure 4.17: The FFT-derived frequency spectrum of a bassoon playing E♭3 (155.6 Hz).<br />

The fundamental frequency is not the strongest; rather, the third harmonic is, much<br />

like the trumpet.<br />

Because pressure waves are invisible, it may be harder <strong>to</strong> believe<br />

that a column of air produces integer-ratio harmonics of the funda-


88 <strong>Musical</strong> instruments Chapter 4<br />

Figure 4.18: The FFT-derived frequency spectrum of an al<strong>to</strong> saxophone playing E♭3<br />

(155.6 Hz). Its third harmonic is also the strongest, like the trumpet <strong>and</strong> bassoon.<br />

Figure 4.19: The FFT-derived frequency spectrum of an A-flute playing C5 (523.2 Hz).<br />

mental frequency than <strong>to</strong> believe that a fixed spring produces harmonics.<br />

Wind instruments amazingly produce frequency spectra far more<br />

highly harmonic than the frequency spectra viols or pianos. This is<br />

largely due <strong>to</strong> the simplicity of their design: There is no interaction of<br />

an external mechanism like a bow or hammer. It is just the reed <strong>and</strong><br />

the bore that produce pitched sound in a wind instrument.


Section 4.4 Drums 89<br />

Mode Relative freq. No. of semi<strong>to</strong>nes above f 0<br />

C 1 f 0 Unison<br />

C 2 2.295f 0 14.4<br />

C 3 3.598f 0 22.2<br />

L 1 1.593f 0 8.0<br />

L 2 2.135f 0 13.1<br />

L 3 2.917f 0 18.6<br />

Table 4.2: The relative frequencies of the six simplest circular modes [20].<br />

4.4 Drums<br />

Drums resonate via excitation of a tense membrane in<strong>to</strong> the drum body.<br />

Because of drums’ highly complex frequency spectrum, they usually<br />

have ambiguous pitch. We describe their sounds with onoma<strong>to</strong>poetic<br />

words like "booms" or "taps." For this reason, we consider percussive<br />

sounds <strong>to</strong> be more complex than other instrumentsÕ, but describing<br />

their vibrations isnÕt necessarily complicated.<br />

A drum consists of three parts: (a) a drumhead, which is a stretched,<br />

tense membrane of some material; (b) a drum body, a hollow cavity;<br />

<strong>and</strong> (c) a means of affixing the drumhead <strong>to</strong> the body. The third<br />

element (c) can feature knobs that allow the tension of the drum <strong>to</strong> be<br />

tuned. These knobs can affect the quality of a drum’s sound by tuning,<br />

but otherwise they theoretically have no bearing on the timbre.<br />

The modes of vibration of the drum are determined on the membrane.<br />

As it is stretched <strong>and</strong> becomes thinner, the fundamental frequency<br />

<strong>and</strong> the magnitude of the attack rate rise. The simplest six<br />

modes of circular membranes are depicted in Figure 4.20. When the<br />

membrane is not circular like many drumheads from the 1980s, the<br />

timbre is even more complex. The modes are slightly different <strong>and</strong><br />

more asymmetric because a circle is the shape most symmetrical. The<br />

first three of these modes are circular modes, <strong>and</strong> the last three are lateral<br />

modes.


90 <strong>Musical</strong> instruments Chapter 4<br />

Figure 4.20: The six simplest modes on circular membranes: The highly symmetrical<br />

circular modes C 1, C 2, <strong>and</strong> C 3 only have nodes located at concentric circles on the<br />

membrane <strong>and</strong> at the membrane’s outer edge. The lateral nodes, L 1, L 2, <strong>and</strong> L 3 have<br />

both linear nodes crossing through the center of the membrane as well as circular<br />

nodes. The fixed outer edge is always a node. Images captured from [99].<br />

The circular modes above are completely symmetrical about the<br />

center of the drumhead, <strong>and</strong> the lateral modes are symmetrical about a<br />

diameter (or two) of the membrane. Nodes exist at points (or regions)<br />

where the membrane is stationary while the rest of the membrane<br />

is vibrating. The first circular mode, C 1 , produces the fundamental<br />

frequency of the drum <strong>and</strong> has greater energy than all of the other<br />

modes. It produces the "thumping" sound of the drum. Its node is<br />

located on the circular outer edge of the membrane, <strong>and</strong> it is a node<br />

for all the other modes because it is fixed.<br />

The lateral mode L 1 has a linear node extending the diameter of<br />

the membrane <strong>and</strong> a circular node identical <strong>to</strong> C 1 . It contributes the<br />

second most energy <strong>to</strong> the drum’s timbre with a frequency 1.593 times<br />

that of the fundamental frequency produced by C 1 . L 2 has diametrical<br />

nodes in an "X" pattern <strong>and</strong> the membrane’s perimeter, <strong>and</strong> it has the<br />

third greatest amount of energy. Its sound takes the longest time <strong>to</strong><br />

decay due <strong>to</strong> its poor sound radiating efficiency [91]. Finally, L 3 , C 2 ,<br />

<strong>and</strong> C 3 have circular nodes all centered at the center of the membrane,


Section 4.4 Drums 91<br />

but with different sizes (concentric circles). The circular nodes all add<br />

<strong>to</strong> the thumping sound of C 1 <strong>and</strong> have the shortest decay times.<br />

Like the body of a violin interacts with its vibrating strings, the<br />

body of a drum serves <strong>to</strong> amplify <strong>and</strong> resonate the vibrations of its<br />

membrane. Symbiotically, the body’s vibrations influence the vibrations<br />

of the drumhead because they are directly connected.<br />

Primarily, the resonant drum body shows preference for harmonic<br />

or close-<strong>to</strong>-harmonic partials over inharmonic ones. This doesn’t mean<br />

that the over<strong>to</strong>ne series is completely harmonic, but it does mean that<br />

the resonant body tends <strong>to</strong> amplify harmonic partials <strong>and</strong> attenuate<br />

inharmonic partials. This is true of all resonant bodies, <strong>and</strong> the nature<br />

of the source of excitation, whether by a fixed string, periodic pressure<br />

variations, or a mallet, determines the harmonicity of the a frequency<br />

spectrum.<br />

The location of the strike greatly influences the timbre of drums.<br />

Striking the drum is essentially like exciting it with an impulse. When<br />

the drum is struck at the location of one of the nodes, that mode will<br />

not produce sound, because nodes exist where the drumhead is static.<br />

Therefore, hitting the drum at the exact center will silence the circular<br />

modes, but hitting it just off-center will excite all the modes.


92 <strong>Musical</strong> instruments Chapter 4<br />

Figure 4.21: The frequency spectrum of a snare drum with chains engaged.<br />

Figure 4.22: The frequency spectrum of a <strong>to</strong>m drum.<br />

The drum’s inharmonicity <strong>and</strong> prevalence in all types of musical<br />

styles tells us that musical instruments do not need a harmonic<br />

over<strong>to</strong>ne series <strong>to</strong> sound pleasurable. The complexity of percussion<br />

instruments may allow us <strong>to</strong> maintain interest in percussion sounds<br />

repeated many times. Drum’s relatively inharmonic nature allows<br />

for virtually any sort of harmonic <strong>and</strong> melodic material <strong>to</strong> be lay-


Section 4.5 Electric guitars <strong>and</strong> effects units 93<br />

ered on <strong>to</strong>p of them, without any perception of dissonance from the<br />

combination of frequencies.<br />

Due <strong>to</strong> the relatively strong attack rate of drums <strong>and</strong> percussion<br />

instruments, it is fairly easy <strong>to</strong> detect their onset in sound files by looking<br />

at the signal <strong>and</strong> spectrograph. Drums’ frequency response is quite<br />

different from other instruments’ because of drums’ inharmonicity<br />

<strong>and</strong> fundamentals in the low-frequency range. In polyphonic music,<br />

the part of the Fourier transform output that corresponds <strong>to</strong> drums is<br />

one of the easiest <strong>to</strong> identify.<br />

4.5 Electric guitars <strong>and</strong> effects units<br />

Electric guitars are stringed instruments, <strong>and</strong> therefore their vibrational<br />

modes are those of the fixed string. But since their output signal is a<br />

varying voltage produced by alternating current (AC), it can be filtered<br />

in real-time by analog circuits found in guitar effects pedals.<br />

<strong>An</strong> electric guitar transmits the movements of its strings through<br />

magnetic pickups, which can be single-coil or humbuckers. These pickups<br />

act as "antennae" for the movement of the strings, inductively<br />

detecting electromagnetic radiation <strong>and</strong> producing a small electric<br />

current. Unless the guitar is powered by a battery, this is all done<br />

through passive electronics: Current is produced strictly by the laws<br />

of electromagnetic physics. Humbucking pickups eliminate the 60<br />

Hz <strong>to</strong>ne ("bucking the hum") produced by the frequency of AC from<br />

North American wall sockets 6 by generating two input signals: One<br />

180 ◦ out of phase with the other. When the difference of these signals is<br />

taken, the input signal is doubled because of their phase relationship.<br />

But the induced signal is canceled <strong>to</strong> zero, because it is induced nearly<br />

identically in both pickups. High quality microphone cables likewise<br />

have two wires, one 180 ◦ (or, any odd multiple of π radians) out of<br />

6 In Europe, this is 50 Hz.


94 <strong>Musical</strong> instruments Chapter 4<br />

phase with the other, <strong>to</strong> produce the same result. This is a common<br />

technique called differential-pairs.<br />

This alternating current, multiplied by the resistance of the electronic<br />

circuits inside of the guitar, produces a voltage 7 that is passed<br />

on <strong>to</strong> an amplifier. This input signal gets added <strong>to</strong> any other signal<br />

present, like the electromagnetic radiation present in the air <strong>and</strong> the<br />

electronic signal powering the amplifier.<br />

Effects pedals are connected between the guitar <strong>and</strong> the amplifier <strong>to</strong><br />

color <strong>and</strong> shape the waveforms <strong>and</strong> timbre of an electric guitar. Guitar<br />

effects pedals <strong>and</strong> other electronic effects aren’t actually instruments<br />

per se: They do not have a resonant body, but they do have a resonant<br />

circuit board. Because of the overwhelming presence of the electric<br />

guitar in genres such as rock, metal, <strong>and</strong> pop, it seems reasonable <strong>to</strong><br />

include at least some graphs of their frequency response when altered<br />

by some common effects.<br />

Overdrive dis<strong>to</strong>rtion ("fuzz") is by <strong>and</strong> large the most popular<br />

effect. Dis<strong>to</strong>rtion clips the electrical signal, flattening (<strong>to</strong> some degree)<br />

the <strong>to</strong>p <strong>and</strong> bot<strong>to</strong>m of the wave. Aurally, the result is more textured,<br />

described by words like "gritty," "dirty," <strong>and</strong> "warm." Mathematically,<br />

clipping adds more over<strong>to</strong>nes <strong>to</strong> a wave because the strong attack rate<br />

<strong>and</strong> flat <strong>to</strong>p turns the wave in<strong>to</strong> a square wave, which is described<br />

by an infinite series of odd-numbered harmonics. Clipping confuses<br />

our ears <strong>and</strong> excites more frequency b<strong>and</strong>s than are actually present in<br />

a signal. Warm overdrive adds harmonic over<strong>to</strong>nes. The rougher or<br />

grittier the sound is, the more inharmonic the added over<strong>to</strong>nes are.<br />

7 Ohm’s Law states that voltage is equal <strong>to</strong> the product of current <strong>and</strong> resistance,<br />

V = IR. <strong>An</strong>alog circuits are at the core of classic signal processing <strong>and</strong> produce<br />

continuous signals. However, in this book, we focus primarily on discrete, digital<br />

signals like those from a file of music on your computer. See Appendix A for more<br />

about analog circuits <strong>and</strong> signal processing in the electrical engineering sense.


Section 4.5 Electric guitars <strong>and</strong> effects units 95<br />

Figure 4.23: The effect of clipping is that of adding an infinite series of odd<br />

harmonics <strong>to</strong> a signal, approaching a square wave. Depicted here is an approximation<br />

[<br />

for the first six odd harmonics, given by the function x(t) =<br />

4<br />

π sin(ωt)+<br />

1<br />

sin(3ωt)+ 1 sin(5ωt)+ 1 sin(7ωt)+ 1 sin(9ωt)+ 1 sin(11ωt)] ,<br />

3 5 7 9 11<br />

where our frequency is simply 1 Hz, so ω =2π.<br />

There are more harmonics shown as peaks in the Fourier transform<br />

of the fuzzy electric guitar, which can be produced by amplifying a<br />

clean signal past the point of clipping. The more clipping, the "dirtier"<br />

the sound. The above dis<strong>to</strong>rtion was produced by amplifying the clean<br />

guitar’s signal by 15 dB in the free sound editing program Audacity.<br />

The dis<strong>to</strong>rtion effect also causes a kind of compression (explained<br />

in more detail in Chapter 6) <strong>and</strong> is often used <strong>to</strong> give the guitar long<br />

sustain, like that of a bowed cello.<br />

Ring modulation is achieved by multiplying a signal by a pure sine<br />

wave (the carrier frequency) <strong>and</strong> outputting the product <strong>and</strong> difference<br />

<strong>to</strong>nes. Ring modulation is named from the shape of its analog electronic<br />

circuit, which contains a "ring" of diodes. The effect is amplitude<br />

modulation, which for low modulation frequencies has the effect of<br />

tremolo.


96 <strong>Musical</strong> instruments Chapter 4<br />

Figure 4.24: The FFT of a clean electric guitar.<br />

Figure 4.25: The FFT of a "fuzzy" electric<br />

guitar signal, playing the same chord as<br />

in Figure 4.24. The spectra is similar, but<br />

the scale of the vertical axis differs.<br />

Figure 4.26: The same FFT as left, with<br />

the scale of the vertical axis identical <strong>to</strong><br />

the FFT of the clean signal <strong>to</strong> make the<br />

additional harmonics more visible.<br />

The spectrogram in Figure 4.28 of ring modulation shows us some<br />

interesting results: At first the harmonics of the original <strong>to</strong>ne are fairly


Section 4.5 Electric guitars <strong>and</strong> effects units 97<br />

Figure 4.27: A ring-modulated audio signal, multiplied by a sine wave that increases<br />

in frequency from 0 Hz <strong>to</strong> 9 kHz. At the beginning the effect looks like tremolo in the<br />

spectrogram.<br />

Figure 4.28: The spectrogram of a ring-modulated signal. At about 1.5 seconds, the<br />

effect begins, <strong>and</strong> gradually increases its influence. The harmonics both symmetrically<br />

increase <strong>and</strong> decrease, with the increasing harmonics showing the "combination <strong>to</strong>nes"<br />

<strong>and</strong> the decreasing ones showing the "difference <strong>to</strong>nes." At its most influential at three<br />

<strong>to</strong> four seconds in, this particular example of ring modulation appears <strong>to</strong> diminish<br />

the presence of the harmonics in favor of lower frequencies.<br />

strong <strong>and</strong> defined. As time goes on, these harmonics (indicated by<br />

the dark parallel lines) become fuzzier <strong>and</strong> sweep downwards <strong>and</strong><br />

upwards, reflecting the influence of the carrier frequency.


98 <strong>Musical</strong> instruments Chapter 4<br />

A low frequency oscilla<strong>to</strong>r (LFO) produces sine waves below 20 Hz<br />

<strong>and</strong> can be used <strong>to</strong> control the pulse rate of other signals. Remember:<br />

Rhythm <strong>to</strong>o can be defined by a frequency, albeit a small one, like 2<br />

or 3 Hz. <strong>An</strong> LFO is useful in controlling the presence <strong>and</strong> intensity of<br />

effects. In Figure 4.30 an audio signal is passed through a vibra<strong>to</strong> effect<br />

pedal. The rate of the vibra<strong>to</strong> is controlled <strong>and</strong> changed (modulated) by<br />

an LFO, <strong>and</strong> goes from no vibra<strong>to</strong> (0 Hz) <strong>to</strong> 20 Hz of vibra<strong>to</strong>.<br />

The vibra<strong>to</strong> effect varies a frequency periodically. This is also called<br />

frequency modulation. Below A musician can create vibra<strong>to</strong> on a guitar<br />

by pushing <strong>and</strong> pulling its whammy bar <strong>to</strong> change the tension on the<br />

string or moving one’s finger <strong>to</strong> the left <strong>and</strong> right <strong>to</strong> change its effective<br />

length. The vibra<strong>to</strong> a musician can produce is fairly limited, varying<br />

a center frequency by only a few hertz due <strong>to</strong> human limitations.<br />

Computers <strong>and</strong> synthesizers can perform more rapid vibra<strong>to</strong>, modulating<br />

the center frequency by greater rates. Frequency modulation<br />

(FM) synthesizers like the vintage Yamaha DX-7 synthesizer are able<br />

<strong>to</strong> generate a very wide range of timbral textures in music because of<br />

the phenomenon of sideb<strong>and</strong>s: When a 50 Hz sinusoid, for example, is<br />

modulated by 20 Hz, we see "sideb<strong>and</strong>s" in the frequency spectrum of<br />

the signal located at 30 Hz <strong>and</strong> 70 Hz with some of the energy of the<br />

50 Hz b<strong>and</strong> now distributed <strong>to</strong> these frequencies. Sideb<strong>and</strong>s appear<br />

when vibra<strong>to</strong> exceeds about 10 percent of the frequency of the carrier<br />

frequency.<br />

In vibra<strong>to</strong>, the frequency should be the only thing varying. But<br />

in reality, this is nearly impossible <strong>to</strong> do. Instead, vibra<strong>to</strong> almost<br />

always contains a detectible amount of tremolo, <strong>to</strong>o. So, the depiction<br />

of electronically produced vibra<strong>to</strong> varies only the frequency, but in<br />

reality, the amplitude is most likely being affected as well.<br />

For similar reasons, tremolo virtually always contains some vibra<strong>to</strong>.<br />

Tremolo is also called amplitude modulation (AM). It creates a "shaky"


Section 4.5 Electric guitars <strong>and</strong> effects units 99<br />

Figure 4.29: The (theoretical) frequency spectra of three signals, from <strong>to</strong>p <strong>to</strong> bot<strong>to</strong>m:<br />

x 1(t) = 3 sin(100πt), x 2(t) = 3 sin[100πt + 20 sin(2πt)], <strong>and</strong> x 3(t) = 3 sin[100πt +<br />

10 sin(2πt) + 20 sin(2πt)]. The first spectrum shows a peak only at 50 Hz with a<br />

magnitude of 3. The second has a smaller peak at 50 Hz with magnitude of 2 <strong>and</strong> two<br />

sideb<strong>and</strong>s at 30 <strong>and</strong> 70 Hz with magnitudes of 0.5, the redistributed energy resulting<br />

from FM synthesis. The final spectrum shows an even further frequency-modulated<br />

signal with sine waves of both 10 <strong>and</strong> 20 Hz changing the original frequency of 50 Hz.<br />

The <strong>to</strong>tal energy still sums <strong>to</strong> 3.


100 <strong>Musical</strong> instruments Chapter 4<br />

Spectrogram of vibra<strong>to</strong><br />

Figure 4.30: The difference between a three-dimensional spectrogram <strong>and</strong> a twodimensional<br />

one is that amplitude is conveyed in two ways: By color <strong>and</strong> by height.<br />

This axis is in units of decibels (dB) where the greatest amplitude/height is 0 dB <strong>and</strong><br />

the others are all relatively less than that. This spectrogram shows a signal undergoing<br />

vibra<strong>to</strong> that is modulated by a low-frequency oscilla<strong>to</strong>r (LFO). You can see fluctuations<br />

in both frequency <strong>and</strong> amplitude.<br />

effect, where the amplitude oscillates between quiet <strong>and</strong> loud. The<br />

graph in Figure 4.32 depicts a 20 Hz sinusoid modulated by a 1 Hz cosine<br />

wave with an amplitude of 0.25, so the overall amplitude envelope<br />

varies periodically in magnitude from 0.75 <strong>to</strong> 1.<br />

Phasers <strong>and</strong> flangers change the phase of an audio signal by passing<br />

it through a delay line containing a ladder of all-pass filters. 8 Both of<br />

8 All-pass filters are really what they sound like: They let all of the original frequencies<br />

of a signal "pass" through them, killing none of the frequency components <strong>and</strong>


Section 4.5 Electric guitars <strong>and</strong> effects units 101<br />

Figure 4.31: This is the time signal x(t) = sin [100πt + 20 cos(2πt)], i.e., a 50 Hz sine<br />

wave undergoing 20 Hz of vibra<strong>to</strong>. The more densely spaced waves depict the 70 Hz<br />

regions <strong>and</strong> the more sparsely spaced areas depict the 30 Hz regions.<br />

Figure 4.32: The effect of tremolo is shown in the periodic changes in the amplitude<br />

envelope. Here, x(t) = sin(40πt) · [cos(2πt) + 3] /4.<br />

them sound like a swirling, "galactic" effect like that heard in the first<br />

few seconds of The Beatles’ "Back in the USSR," not affecting the fundamental<br />

frequency but rather the range <strong>and</strong> quality of its over<strong>to</strong>ne series.<br />

The all-pass filters used in a flanger are linearly spaced <strong>and</strong> so are their<br />

phase responses, <strong>and</strong> the result is that flange sounds harmonic. A phaser<br />

pedal passes signals through all-pass filters with more logarithmic<br />

phase responses. The over<strong>to</strong>nes are therefore not harmonic, but do<br />

tend <strong>to</strong> highlight specific notes because of the logarithmic nature of<br />

frequency. The response of each filter is simply added <strong>to</strong> the original<br />

signal, creating destructive <strong>and</strong> constructive interferences varying<br />

with frequency, <strong>to</strong> give the effect of changing phase. The more of these<br />

retaining their original magnitudes. However, filters are specified both by magnitude<br />

<strong>and</strong> phase responses, so its "unchanging" nature does not necessarily apply <strong>to</strong> its<br />

phase response. See Appendix A for more on filtering.


102 <strong>Musical</strong> instruments Chapter 4<br />

filters, the more variation in the spectrogram <strong>and</strong> audible change in<br />

the signal.<br />

Figure 4.33: The above is a spectrogram of the phase effect. Phasing increases <strong>and</strong><br />

decreases the frequencies <strong>and</strong> harmonics of a signal. In this audio clip, a low-frequency<br />

oscilla<strong>to</strong>r modulates the phase downwards <strong>and</strong> upwards twice <strong>and</strong> then returns <strong>to</strong><br />

the steady signal. Note that the amplitude peaks (black parts) <strong>and</strong> notches (white<br />

parts) do not have a linear, even spacing.


Section 4.5 Electric guitars <strong>and</strong> effects units 103<br />

Figure 4.34: This is the spectrogram of a flange effect. Flanging is a type of phasing<br />

wherein the phase responses of the all-pass filters are in a series with uniform spacing,<br />

achieving an harmonic series of constructive <strong>and</strong> destructive interferences with respect<br />

<strong>to</strong> frequency, as shown by the dark <strong>and</strong> light areas here.<br />

A flanger pedal is a type of phaser pedal. Flanging was originally<br />

an effect created by recording engineers using magnetic tape. They<br />

would play two tapes containing the same signal <strong>and</strong> introduce a<br />

small delay by pressing a finger against one of the tape reels where<br />

it wrapped around a flange (edge) on the machine. Hence, the term<br />

"flanging." A delay line creates a series of uniformly spaced filters<br />

through which the signal is passed. The filters’ responses are then<br />

added <strong>to</strong> the original signal creating the harmonic series of constructive<br />

<strong>and</strong> destructive interferences.<br />

So flanging is the special case of phasing in which the peaks <strong>and</strong><br />

notches of the over<strong>to</strong>ne series it produces are uniformly spaced <strong>and</strong><br />

harmonic, while with phasing, they can be nonlinearly spaced <strong>and</strong><br />

hence inharmonic.


104 <strong>Musical</strong> instruments Chapter 4<br />

4.6 Chapter summary<br />

The second half explored the resonance of four classes of musical<br />

instruments: Pianos, viols, winds, <strong>and</strong> drums. The waveforms of<br />

several guitar effects pedals were also shown. <strong>An</strong> acoustic instrument<br />

requires two things <strong>to</strong> make <strong>and</strong> amplify sound: A resonant cavity like<br />

a box, <strong>and</strong> an activating, vibrating mechanism, like a fixed string or<br />

reed. Stiff objects with a high tension vibrate <strong>and</strong> amplify better than<br />

soft ones. Because pianos <strong>and</strong> violins use fixed strings, the wavelengths<br />

of their modes of vibrations are integer divisions of the length of<br />

the string. In winds, a harmonic spectrum results from the interaction<br />

between st<strong>and</strong>ing waves <strong>and</strong> a reed in a column of air. Drums do not<br />

have an harmonic spectrum, but some of their harmonics can be related<br />

by Bessel function ratios.<br />

Though all waves can be expressed as the sum of sine <strong>and</strong> cosine<br />

waves, most musical instruments <strong>and</strong> effects pedals produce shapes<br />

of waves that look <strong>and</strong> sound different from simple sine waves. The<br />

friction of the bow on the string produces saw<strong>to</strong>oth waves, characterized<br />

by a jagged waveform <strong>and</strong> high amount of attack. These waves have a<br />

dark, mysterious timbre, <strong>and</strong> many new composers of electronic music<br />

use them. Dis<strong>to</strong>rtion pedals produce square waves that are artificial<br />

in <strong>to</strong>ne, produced by clipping a smooth wave <strong>to</strong> make its <strong>to</strong>p <strong>and</strong><br />

bot<strong>to</strong>m flat. Saw<strong>to</strong>oth <strong>and</strong> square waves encountered in electronic<br />

music are usually produced by mathematical functions, so they are<br />

non-sinusoidal. However, these waveforms can still be expressed as<br />

an infinite sum of sine waves.<br />

It is useful <strong>to</strong> have a deep underst<strong>and</strong>ing of the Fourier representation<br />

of different timbres, because it makes Fourier analysis of<br />

polyphonic music a whole lot easier. In polyphonic music, two instruments<br />

will often play the same pitch or a harmonically related<br />

pitch such as an octave or perfect fifth above or below. Therefore,<br />

the frequency representation of their <strong>to</strong>tal signal will contain a lot of<br />

intersection, with peaks resulting from more than one instrument. The


Section 4.6 Chapter summary 105<br />

only way <strong>to</strong> separate instruments with either your ear or computer is<br />

<strong>to</strong> be familiar with the shape of their timbres.<br />

The <strong>to</strong>ne of the flute contains one of the simplest over<strong>to</strong>ne series<br />

of any acoustic instrument. Its timbre is nearly pure, with most of<br />

the energy centered at the fundamental frequency. The other wind<br />

instruments (trumpet, trombone, <strong>and</strong> oboe) all have partials that are<br />

stronger than their fundamental frequency. The piano <strong>and</strong> violin have<br />

similarly shaped spectra, reflecting the modes of vibration of fixed<br />

strings. The drums have mostly inharmonic over<strong>to</strong>nes: The way they<br />

are played produces frequency b<strong>and</strong>s instead of single frequencies.<br />

A spectrogram conveys the frequency, time, <strong>and</strong> amplitude information<br />

of a musical signal in a powerful way. Below are two explained<br />

spectrograms. The Shostakovich piece contains violins, horns, <strong>and</strong> a<br />

flute in the segment shown, while the Beatles song has drums, bass<br />

guitar, electric guitar, acoustic guitar, <strong>and</strong> George Harrison’s voice.


106 <strong>Musical</strong> instruments Chapter 4<br />

Figure 4.35: This is the spectrogram of the first 20 seconds of Dmitri Shostakovich’s<br />

Symphony No. 5 in D minor, Op. 47: II. Allegre<strong>to</strong>. The piece begins with 12 seconds of<br />

violins only. Note the relative amount of noise in the spectrum during this part. What<br />

we can see is mostly their onset, but their harmonics are represented by thicker, less<br />

powerful lines that are not as clearly defined, <strong>and</strong> there is a higher spread of power<br />

over the frequency range. Now, when the horns enter between 0:12 <strong>and</strong> 0:13, we<br />

see dark, horizontal lines located at their harmonics. These lines are closely spaced<br />

because the horns’ harmonics are exact integer multiples of the fundamental. There<br />

are multiple horns playing all at once but in different registers (pitch ranges). When<br />

the solo flute enters between 0:15 <strong>and</strong> 0:16, we see more distantly spaced harmonics<br />

from its fundamental frequency somewhere around 1200 Hz. Its harmonics extend<br />

quite far—12 of them above the fundamental—reflecting its clear <strong>to</strong>ne, <strong>and</strong> the fact<br />

that it is unaccompanied by any other flute. The remainder of the clip is largely the<br />

solo flute with some light accompaniment from the horns.


Section 4.6 Chapter summary 107<br />

Figure 4.36: Above is a spectrogram from the first 14 seconds of The Beatles’ "I’m<br />

Happy Just <strong>to</strong> Dance with You" from A Hard Day’s Night (1964). The first <strong>and</strong> second<br />

measures are identical: Minor chords on the electric guitar <strong>and</strong> a cymbal-heavy drum<br />

line with snare rolls on the fourth beat. The vocals enter between 0:06 <strong>and</strong> 0:07.<br />

Writing them <strong>to</strong> better underst<strong>and</strong> the syllables of stronger emphasis, the lyrics are<br />

"before this DANCE is THROUGH, i think i’ll LOVE you TOO, i’m so HAPpy when<br />

you DANCE with ME." Using "S" as shorth<strong>and</strong> for a stressed syllable <strong>and</strong> "w" <strong>to</strong> mean<br />

a weak one, it goes w-w-w-S-w-S-w-w-w-S-w-S-w-w-S-w-w-w-S-w-S. Particularly<br />

for the first two strong syllables ("dance" <strong>and</strong> "through"), we can see especially dark<br />

markings in the spectrogram. The drums appear along the bot<strong>to</strong>m of the frequency<br />

range <strong>and</strong> have no clearly defined harmonics. The cymbals show up as faint vertical<br />

lines through the entire range of frequencies.


5. Audi<strong>to</strong>ry perception<br />

Both the ear <strong>and</strong> our brain’s perception of its movements are far<br />

broader <strong>to</strong>pics than we have space <strong>to</strong> cover here, but the aspects of<br />

hearing most relevant <strong>to</strong> the techniques that we use <strong>to</strong> process <strong>and</strong> digitize<br />

sound are fairly limited. The two things necessary <strong>to</strong> take away<br />

from this chapter before moving on <strong>to</strong> Chapter 6 are the concept of<br />

masking <strong>and</strong> the logarithmic nature of loudness <strong>and</strong> pitch. The rest consists<br />

of some interesting <strong>and</strong> fundamental facts about our physiology<br />

<strong>and</strong> perception.<br />

We have established that the amplitude of sound waves flowing<br />

through air is a measure of pressure. IIt is essential that air pressure<br />

changes in order for the ear <strong>to</strong> relay sonic information <strong>to</strong> the brain, <strong>and</strong><br />

furthermore, sound must alternate between its minimum <strong>to</strong> maximum<br />

values at least 20 times per second (20 Hz) <strong>to</strong> be considered pitched.<br />

In fact, if the ear picked up frequencies any lower than 20 Hz, the<br />

incredibly loud thermal noise of the world would be audible [4].<br />

5.1 Physiology of the ear<br />

We divide the ear in<strong>to</strong> three main sections: The outer ear, the middle<br />

ear, <strong>and</strong> the inner ear. The outer ear acts as a receiver for sound. The<br />

eardrum connects the outer <strong>and</strong> middle parts of the ear. The pressures<br />

on either side of the eardrum are compared <strong>to</strong> each other, <strong>and</strong> the<br />

bones <strong>and</strong> muscles in the middle ear transmit the effective difference<br />

<strong>to</strong> the inner ear. The inner ear is attached <strong>to</strong> the audi<strong>to</strong>ry nerve, which<br />

passes on the sonic information <strong>to</strong> the brain.<br />

The leading theory for the physiological response <strong>to</strong> an audi<strong>to</strong>ry<br />

stimulus is place theory, which states that different frequencies stimulate


110 Audi<strong>to</strong>ry perception Chapter 5<br />

different places along a basilar membrane that is organized much like<br />

a logarithmic frequency domain [80]. This means that the basilar<br />

membrane functions like a Fourier device.<br />

Figure 5.1: A diagram of the outer, middle, <strong>and</strong> inner parts of the ear.<br />

The outer ear consists of only three parts: Of the pinnae (the skin<br />

<strong>and</strong> cartilage protruding from the head), the meatus or audi<strong>to</strong>ry canal,<br />

<strong>and</strong> the tympanum, also known as the eardrum. The tympanum vibrates<br />

when disturbed by pressure fluctuations in the surrounding medium,<br />

thereby changing the pressure inside the chamber of the middle ear <strong>to</strong><br />

equal the exterior pressure.<br />

The amount of pressure is constrained by the impedance or resistance<br />

of a medium. The more impedance, the less a signal can get through.<br />

This is similar <strong>to</strong> a current through a resis<strong>to</strong>r on a circuit or a car in<br />

traffic.Impedance matching refers <strong>to</strong> the equalization of the middle ear’s<br />

pressure <strong>to</strong> the outer ear’s. The ossicles inside the middle ear are three<br />

bones (the hammer, anvil, <strong>and</strong> stirrup) that perform this matching<br />

<strong>and</strong> send a signal <strong>to</strong> the inner ear, as well as the eustachian tube. The


Section 5.1 Physiology of the ear 111<br />

stapedius reflex, also called the acoustic reflex, is a muscle attached <strong>to</strong> the<br />

ossicles <strong>and</strong> is involuntarily flexed when an ongoing acoustic stimulus<br />

is louder than about 90 dB. Furthermore, the tensor tympani muscle<br />

connected <strong>to</strong> the eardrum flexes during loud sensations <strong>to</strong> tighten the<br />

eardrum <strong>and</strong> increase its impedance. In this way, the middle <strong>and</strong> inner<br />

ears also protect our hearing.<br />

However, when an excessively loud sound persists, these muscles<br />

grow weary. <strong>An</strong> interesting phenomenon known as temporary threshold<br />

shifting works <strong>to</strong> prevent permanent hearing damage by shifting the<br />

ears’ dynamic range (the threshold of hearing <strong>to</strong> the limit of hearing)<br />

higher for a limited amount of time. When this time has run out <strong>and</strong><br />

the loud sound continues, permanent threshold shifting occurs, which<br />

results in hearing loss. Soft sounds will then be inaudible because the<br />

threshold levels have shifted upwards.<br />

The inner ear is the most complicated region, <strong>and</strong> consists of a<br />

coiled cavity called the cochlea. The stirrup ossicle is attached <strong>to</strong> the<br />

oval window entry <strong>to</strong> the cochlea, whose other end is the apex. This<br />

window is the opening <strong>to</strong> one of the two tubes in the cochlea: The scala<br />

vestibuli, which is filled with a fluid called perilymph. The other tube<br />

is called the scala tympani, connected <strong>to</strong> the middle ear by the round<br />

window <strong>and</strong> also filled with perilymph. The perilymph vibrates <strong>and</strong><br />

stimulates the scala media, a tube that separates the scala vestibuli <strong>and</strong><br />

scala tympani. It is filled with endolymph, a fluid with a complementary<br />

ionic composition <strong>to</strong> perilymph such that their interaction generates<br />

electrochemical impulses that are sent on <strong>to</strong> the brain.


112 Audi<strong>to</strong>ry perception Chapter 5<br />

Figure 5.2: The cross-section of the cochlea. Labeled are the places along the basilar<br />

membrane corresponding <strong>to</strong> the frequency regions they detect.<br />

Reissner’s membrane is an impermeable membrane on the scala<br />

vestibuli side of the scala media. Underneath Reissner’s membrane is<br />

the tec<strong>to</strong>rial membrane. The surface of the basilar membrane, named as<br />

such because it is considered <strong>to</strong> function as the "base" of our perception<br />

of sound, is covered in rows of hair cells (cilia) with one row on the<br />

inside <strong>and</strong> three rows on the outside. The inner hair cells of the basilar<br />

membrane are triggered by the motion of the tec<strong>to</strong>rial membrane, <strong>and</strong><br />

these send on phase, frequency, <strong>and</strong> amplitude information <strong>to</strong> the<br />

brain.<br />

The basic order of events in the mechanism of hearing is as follows:<br />

A stimulus causes pressure changes in the outer ear which changes<br />

the pressure inside of the middle ear. This makes the stirrup ossicle<br />

move in <strong>and</strong> out of the scala vestibuli, causing fluctuations in the<br />

volume of fluid inside of it. This leads <strong>to</strong> vertical displacement in<br />

the basilar membrane <strong>and</strong> longitudinal waves in its surrounding fluid,<br />

whose cumulative motion creates a similar surface wave along the<br />

basilar membrane moving from the stiff end of the cochlea (the base) <strong>to</strong><br />

the apical end (the apex). When this motion is great enough <strong>to</strong> trigger


Section 5.1 Physiology of the ear 113<br />

the hair cells, information is sent on <strong>to</strong> the audi<strong>to</strong>ry nerve which is<br />

connected <strong>to</strong> each hair cell.<br />

Figure 5.3: The scala vestibuli, scala tympani, <strong>and</strong> scala media all contain fluid. This is<br />

a hydrodynamic surface wave, so gravity is the res<strong>to</strong>ring force upon this fluid. When<br />

excited by sound, the fluid propagates as shown by the arrows [81].<br />

The connection between our audi<strong>to</strong>ry perception <strong>and</strong> our physiology,<br />

along with the nature of the information that is sent <strong>to</strong> the<br />

audi<strong>to</strong>ry nerve, lies in the inner ear. Because f = 1 T<br />

, i.e., frequency<br />

depends on time <strong>and</strong> vice versa, it is unknown <strong>to</strong> a degree how exactly<br />

the pulsating action within the ear is interpreted. Fortunately, sound is<br />

limited <strong>to</strong> basically three things: Frequency (periodicity), amplitude,<br />

<strong>and</strong> phase. So our perception of pitch, loudness, <strong>and</strong> phase is tied <strong>to</strong><br />

their physical manifestation in our hearing mechanism.<br />

The nature of the basilar membrane’s reaction <strong>to</strong> sound is highly<br />

analogous <strong>to</strong> the actual sound wave. Small cameras placed in the<br />

cochlea through video microscopy reveal that the hair cells along<br />

the basilar membrane are excited at locations corresponding <strong>to</strong> the<br />

frequency of the excitation. For sounds above the threshold of hearing,<br />

hairs along the apical end of the basilar membrane are excited by low<br />

frequencies, while those at the basal end responds <strong>to</strong> high frequencies.<br />

Therefore, place theory describes the relationship between frequency<br />

<strong>and</strong> the placement of cilia on the basilar membrane as a <strong>to</strong>no<strong>to</strong>pic<br />

mapping—the mapping of frequency (<strong>to</strong>ne) <strong>to</strong> place. Furthermore,


114 Audi<strong>to</strong>ry perception Chapter 5<br />

these hairs act as b<strong>and</strong>-pass filters, selecting a small range of frequencies<br />

much like the holes of a wind instrument. When a single hair is excited<br />

by a single frequency, it also <strong>to</strong> a lesser degree excites the hairs around<br />

it.<br />

Figure 5.4: <strong>An</strong>other depiction of place theory in the basilar membrane, with the<br />

membrane uncoiled.<br />

The location of frequencies does not follow a linear scale. It follows<br />

a logarithmic scale, exactly like our perception of frequency. Octaves<br />

are spaced a constant distance from each other, so half the length of the<br />

basilar membrane detects 1500 Hz <strong>and</strong> less, <strong>and</strong> the other half detects<br />

the frequencies above 1500 Hz.<br />

Humans <strong>and</strong> other mammals localize low-frequency sounds by<br />

phase delay according <strong>to</strong> temporal theory <strong>and</strong> interaural time difference,<br />

which is defined by the space between our two ears. The ears are<br />

separated by about 21.5 cm, the wavelength of 1600 Hz, so beyond<br />

1600 Hz, the phase delay is no longer useful in detecting the location<br />

of sounds <strong>and</strong> instead group delay (the time difference between the<br />

amplitude envelopes of the sound in the left versus right ear) is the<br />

measure used [89]. At twice this wavelength (i.e., 800 Hz), the audi<strong>to</strong>ry<br />

system can unambiguously detect spatialization using the time


Section 5.1 Physiology of the ear 115<br />

difference. Between 800 <strong>and</strong> 1600 Hz is a "transition zone" where both<br />

phase delays <strong>and</strong> amplitude envelopes are used in localization.<br />

Furthermore, once every period an action potential is fired from<br />

the audi<strong>to</strong>ry nerve. Below about 30 Hz, when sound is something<br />

that we detect as separate events rather than pitch, temporal theory is<br />

strongest, but physically, no more than 1500 action potentials can fire<br />

per second so the theory does not hold above 1500 Hz [87].<br />

If the set of frequencies is steady for some period of time, our ears’<br />

detection of these frequencies becomes increasingly fine-tuned. The<br />

physical result is that our audi<strong>to</strong>ry nerve actually charges up the action<br />

potentials at the corresponding locations along the basilar membrane.<br />

The music of La Monte Young attempts <strong>to</strong> exploit this fact. "Dream<br />

House," an installation in Lower Manhattan in New York City, features<br />

continuous pure <strong>to</strong>nes that, when heard for an extended period of time,<br />

are alleged <strong>to</strong> incite audi<strong>to</strong>ry hallucinations in the brain [86]. Exploding<br />

head syndrome is another form of audi<strong>to</strong>ry hallucination <strong>and</strong> typically<br />

happens during sleep: Extremely loud or ringing noises will appear<br />

<strong>to</strong> originate from inside the head, but these noises are not (usually)<br />

painful. The symp<strong>to</strong>ms are certainly dream-like but are not necessarily<br />

connected <strong>to</strong> dreaming [92].<br />

The intensity of sound waves’ pressure manifests in the disturbance<br />

of the eardrum, but this relationship is not quite as elegant<br />

as frequency’s. We find it more difficult <strong>to</strong> compare the loudness of<br />

sounds than frequency ratios, e.g., we almost never say that one sound<br />

is twice as loud as another. Perhaps the most important feature of the<br />

intensity of sounds in our ears is their onset time, earlier referred <strong>to</strong><br />

as the attack rate. Even soft sounds that surprise us can be extremely<br />

startling. The audi<strong>to</strong>ry reflex acts ahead of time only when the brain<br />

is expecting sound. Therefore, onset always excites the ear more than<br />

the actual sound’s pressure level.<br />

Now, what about phase Our brains don’t really seem <strong>to</strong> react <strong>to</strong><br />

phase information as they react <strong>to</strong> loudness <strong>and</strong> frequency. However,<br />

there is evidence that an action potential fires when the amplitude of


116 Audi<strong>to</strong>ry perception Chapter 5<br />

a given frequency is maximal—i.e., when the phase of a sinusoid is<br />

90 ◦ [1]. It is true that we use phase information <strong>to</strong> locate the source of<br />

sounds: A sound that reaches our left ear before our right ear naturally<br />

means that the source is more <strong>to</strong> the left. Furthermore, based on<br />

the loudness <strong>and</strong> other qualities <strong>to</strong> the sound, our ears use phase <strong>to</strong><br />

determine the approximate angle of orientation <strong>to</strong> sources of sound.<br />

Phase information is largely at play in the cocktail effect, which is<br />

our ability <strong>to</strong> focus on certain sound signals when there is a large<br />

amount of noise in the background, like having a conversation with a<br />

friend at a noisy party or concert. We focus better on desired signals<br />

when facing straight-on, such that the sound will reach both ears at the<br />

same time <strong>and</strong> the phases will be identical. Also at play is masking, the<br />

camouflaging of sound sources by other sounds of similar frequencies,<br />

which is a psychoacoustical feature.<br />

5.2 Psychoacoustics<br />

The field of psychophysics seeks <strong>to</strong> connect physical aspects of the<br />

world around us <strong>to</strong> the way our brain perceives them. Each of the<br />

sensations of sight, sound, <strong>to</strong>uch, taste, <strong>and</strong> smell have quantifiable<br />

threshold values <strong>and</strong> limits that define the range of intensities from<br />

barely detectable <strong>to</strong> permanently damaging. For no sensation are these<br />

ranges absolute, <strong>and</strong> even their average values contain some level of<br />

uncertainty due <strong>to</strong> noise in the sensory system [48].<br />

Psychoacoustics relates the physiological response <strong>to</strong> sound <strong>to</strong> the<br />

perceptual interpretation. The basilar membrane acts as a spectral<br />

analyzer according <strong>to</strong> place theory. Indeed, we seem <strong>to</strong> have a very<br />

easy time identifying the timbre of sounds, which is a frequency-based<br />

skill: Immediately, we can identify the sound of guitars, drums, <strong>and</strong><br />

the President’s voice. We can quickly tell the difference between five<br />

of our friends’ voices, even if they are all of the same age <strong>and</strong> gender.<br />

Since timbre is a set of frequencies <strong>and</strong> their amplitudes, we will


Section 5.2 Psychoacoustics 117<br />

investigate how frequency <strong>and</strong> intensity translate perceptually <strong>to</strong> pitch<br />

<strong>and</strong> loudness.<br />

Pitch<br />

Pitch is our interpretation of frequency in sound. The assumption<br />

is that this is not an exact correspondence, but rather a rough one,<br />

especially when we consider how we perceive frequencies below 30<br />

Hz or so. In general, we consider a sound "pitched" when it has a<br />

repetitive nature <strong>and</strong> it is within the range of our thresholds <strong>and</strong> limits.<br />

Perhaps the most amazing part about pitch perception is our response<br />

<strong>to</strong> frequency ratios, like the octave. We perceive pitches an<br />

octave apart <strong>to</strong> be so closely related that we give them identical note<br />

names. Additionally, intervals other than the octave have distinct qualities<br />

that are not just limited by their absolute difference: A frequency<br />

19 half steps above a reference frequency (a "G" above a "C") has a<br />

very similar quality <strong>to</strong> a frequency 7 half steps above the reference (a<br />

different "G" above the same "C"), because the octave consists of 12<br />

half steps. The mere existence of such a thing as the Circle of Fifths<br />

suggests that we perceive pitch as a spiral, or Slinky, where pitches at<br />

identical angles on these surfaces are separated by an octave.<br />

Roger Shepard devised a schematic for <strong>to</strong>nes that maps the notes<br />

C, C♯, etc. <strong>to</strong> chroma <strong>and</strong> their octave placement <strong>to</strong> a height, where<br />

higher octaves have a greater height. Chroma can also be thought<br />

of as pitch classes where "C" is one class <strong>and</strong> "C♯" is another class, so<br />

there are 12 <strong>to</strong>tal classes in the Western scale. The Shepard <strong>to</strong>ne is an<br />

audi<strong>to</strong>ry illusion much like the optical illusion a spinning barber’s pole.<br />

Its sound is the result of layered, identical sine sweeps moving from<br />

low <strong>to</strong> high frequencies where the highest frequency is some octave<br />

of the lowest. When one sweep reaches an octave higher than its<br />

starting frequency, another sweep begins. Though there is a maximum


118 Audi<strong>to</strong>ry perception Chapter 5<br />

Figure 5.5: The pitches shown as <strong>to</strong>ne chroma.<br />

frequency that each of the sweeps hit, this sound has the illusion of<br />

constantly increasing in pitch.<br />

We do not perceive all frequencies as equally loud. In fact, very<br />

loud sounds played near the limit of our hearing undergo a downward<br />

shift in pitch. The range of 1,000-5,000 Hz spanning a little more than<br />

2 octaves has a maximal sensitivity in our ears <strong>and</strong> this is the range of<br />

our speaking voice. Thus, loudness <strong>to</strong>o is perceptual.<br />

Loudness<br />

Loudness describes the brain’s perception of the intensity of a sound.<br />

Like frequency, intensity is modeled on a logarithmic scale in decibels<br />

(dB), where 0 dB corresponds <strong>to</strong> normal atmospheric pressure (101.325<br />

kPa). The threshold of our hearing (normal atmospheric pressure) t h


Section 5.2 Psychoacoustics 119<br />

is 10 −12 W/m 2 (Watts per square meter), <strong>and</strong> the limit of our hearing<br />

l h is 1 W/m 2 .<br />

However, loudness does not absolutely correspond <strong>to</strong> intensity.<br />

For one, we perceive frequencies in the range of 1,000 <strong>to</strong> 5,000 Hz<br />

better than frequencies outside of this range. Frequencies at the extreme<br />

points of our hearing range have <strong>to</strong> be very intense in order<br />

for us <strong>to</strong> perceive them. The Fletcher–Munson curve given in Figure<br />

5.6 describes the minimum intensity required <strong>to</strong> detect specific<br />

frequencies.<br />

Figure 5.6: This graph shows our ears’ <strong>to</strong>tal sensitivity from 20-20,000 Hz, limited by<br />

the threshold (minimum sound pressure level required for perception of sound, given<br />

by the familiar Fletcher-Munson curve) <strong>and</strong> limit (maximum sound pressure level<br />

beyond which our hearing is damaged). The average region of speech signals is also<br />

highlighted.<br />

Therefore, there are several different ways <strong>to</strong> think about loudness<br />

<strong>and</strong> also different measurements <strong>to</strong> quantify it. The decibel measure is


120 Audi<strong>to</strong>ry perception Chapter 5<br />

fairly meaningless when it comes <strong>to</strong> how we perceive intensity because<br />

it varies radically with frequency. Therefore, it is sometimes useful<br />

<strong>to</strong> measure loudness using the phon scale. The phon scale answers<br />

the question, "How loud does frequency B need <strong>to</strong> be in order <strong>to</strong><br />

be equally loud as frequency A" It is a measure of equal loudness<br />

based on the Fletcher–Munson function for the minimum threshold of<br />

hearing, given by<br />

( ) f −0.8 ( ) f 4<br />

T (f) =3.64<br />

− 6.5e −0.6(f/1000−3.3)2 + .001 .<br />

1000<br />

1000<br />

This is the threshold for young, healthy ears. The phon is defined as<br />

the sound intensity level (SIL) in decibels of a sinusoid of 1,000 Hz, so at<br />

the threshold level 10 −12 W/m 2 , the loudness is 0 phon, <strong>and</strong> 10 phons<br />

equals the sound intensity level of the 1,000 Hz sinusoid at 10 dB.<br />

This means that a 10 phon sinusoid at x Hz will sound equally loud<br />

<strong>to</strong> a 10 phon sinusoid at y Hz. This is a nice scale <strong>to</strong> apply in analysis<br />

of the frequency spectrum because it normalizes the amplitudes with<br />

respect <strong>to</strong> our perception. A 3,000 Hz sound wave, for example, should<br />

be considered "more important" than an equally intense 50 Hz sound<br />

wave, because its intensity does not have <strong>to</strong> be nearly as great <strong>to</strong><br />

be audible. When we apply the phon scale <strong>to</strong> the resulting Fourier<br />

transform, we get a better idea of the actual sound that we perceive<br />

from a signal.<br />

<strong>An</strong>other translation of sound intensity levels <strong>to</strong> a perceptual measure<br />

is given by the sone scale, which calculates loudness as a ratio.<br />

Loudness in sones (L s ) can be directly calculated from loudness in<br />

phons (L p ) as<br />

L s =2 (Lp−40)/10<br />

so, one sone is equal <strong>to</strong> 40 phons. A sound that is twice as loud as<br />

another sound will have twice the amount of loudness as another<br />

sound will have twice as many sones. One sone is therefore equal <strong>to</strong><br />

the loudness of a 1,000 Hz sinusoid at 40 phons.


Section 5.2 Psychoacoustics 121<br />

Figure 5.7: These curves are derived from the Fletcher–Munson curve <strong>and</strong> depict the<br />

phon scale. Following a single curve tells you the sound intensity level perceived as<br />

equally loud across the entire frequency range. The curves are drawn in increments<br />

of 10 phons.<br />

It is common <strong>to</strong> see loudness written in dB SPL, which is the loudness<br />

relative <strong>to</strong> a reference pressure. This reference pressure is often<br />

94 dB SPL because it equals 1 Pascal. This measure is so common that<br />

it is often (misleadingly) abbreviated as "dB," but decibels are not an<br />

absolute measure of intensity or pressure.<br />

Loudness in dB SPL (also called the intensity level) gives us a better<br />

idea of the perceptual experience of volume, while amplitude describes<br />

the physical wave. The reference level for our threshold is usually 20<br />

µPa RMS (root mean squared—a sort of normalization) at 1000 Hz, where<br />

this would be 0 dB SPL (normal atmosphere). We can calculate the<br />

loudness in dB SPL (L dB SPL ) of a root mean squared pressure p rms<br />

from a reference pressure p ref with the formula<br />

( ) 2 ( )<br />

prms<br />

prms<br />

L dB SPL = 10 log 10 = 20 log<br />

p 10 .<br />

ref p ref<br />

So this would be the amount of decibels of a sound above some reference<br />

sound.


122 Audi<strong>to</strong>ry perception Chapter 5<br />

Source Intensity SPL Magnitude<br />

(W/m 2 ) (dB) of t h<br />

Normal atmosphere 1 × 10 −12 0 10 0<br />

Rustling leaves 1 × 10 −11 10 10 1<br />

Whispering 1 × 10 −10 20 10 2<br />

Quiet library 1 × 10 −8 40 10 4<br />

Conversation, 1 m 1 × 10 −6 60 10 6<br />

Vacuum cleaner, 1 m 1 × 10 −5 70 10 7<br />

Heavy traffic, from sidewalk 1 × 10 −4 80 10 8<br />

Rock concert, 1 m 1 × 10 −2 100 10 10<br />

Threshold of pain 1 × 10 1 130 10 13<br />

Jet engine 1 × 10 2 140 10 14<br />

Perforation of eardrum 1 × 10 4 160 10 16<br />

Table 5.1: The sound pressure levels (SPL) of various sound sources of wideb<strong>and</strong> noise,<br />

in which the sound’s energy is spread across a large range of frequencies.<br />

If this wasn’t already confusing enough, there is yet another measure<br />

of loudness that considers the normal hearing level, called dB HL.<br />

We typically encounter this scale during an audiogram, a test of one’s<br />

hearing. The quantity 0 dB HL describes the (average) normal hearing<br />

level at all frequencies, <strong>and</strong> scoring somewhere in the range of -10 dB<br />

HL <strong>to</strong> 20 dB HL is commonly considered the normal hearing range.<br />

Just-noticeable difference<br />

All of our sensations have a prescribed resolution of detectable detail,<br />

<strong>and</strong> both loudness <strong>and</strong> pitch have some interval of error within which<br />

we cannot detect a difference. This interval is defined as the justnoticeable<br />

difference (jnd) <strong>and</strong> is measured in limens. The process of<br />

altering a frequency from some center frequency is called frequency<br />

modulation, <strong>and</strong> likewise, the alteration of amplitude from a center<br />

amplitude is amplitude modulation. Frequency modulation (FM) is<br />

perceived as vibra<strong>to</strong> when the modulating frequency is sufficiently<br />

large for our ears <strong>to</strong> detect change but small enough <strong>to</strong> not separate<br />

the maximum <strong>and</strong> minimum frequencies. Frequencies closer <strong>to</strong> the


Section 5.2 Psychoacoustics 123<br />

minimum <strong>and</strong> maximum values of our hearing range must modulate<br />

more than frequencies between 1,000 <strong>and</strong> 5,000 Hz, where our ears are<br />

most sensitive <strong>to</strong> change. Within this area of heightened sensitivity,<br />

frequency changes greater than about 0.5% of the center or carrier<br />

frequency can be detected. The magnitude of the jnd varies especially<br />

for trained musicians who have spent lots of time tuning <strong>and</strong> listening<br />

<strong>to</strong> their instruments.<br />

For loudness, the jnd is roughly proportional <strong>to</strong> the intensity <strong>and</strong><br />

frequency of the sound, related <strong>to</strong> the Fletcher–Munson curve above.<br />

Amplitude modulation (AM) is also known as tremolo when the change<br />

in loudness is beyond the jnd. When two frequencies in close proximity<br />

<strong>to</strong> one another (differing by 10 Hz or less) are played simultaneously in<br />

a sound, the phenomenon of beating occurs. The phases of the two frequencies<br />

result in constructive <strong>and</strong> destructive interference at periodic<br />

intervals, <strong>and</strong> the effect is heard as amplitude modulation, beating at a<br />

frequency exactly equal <strong>to</strong> the difference of the two frequencies. The<br />

Figure 5.8: The sum of two closely related frequencies, 50 Hz <strong>and</strong> 51 Hz, over 3<br />

seconds. The cosine function cos(πt) is also shown representing the (perceptual <strong>and</strong><br />

actual) modulation of amplitude.<br />

sound from a whistle produces beats heard as a lower <strong>to</strong>ne. There are<br />

two frequencies produced, separated in wavelength by the length of<br />

the gap in the whistle. So, say the first frequency is the result of the<br />

distance between the mouthpiece <strong>and</strong> the beginning of the gap (say 5<br />

centimeters, so 343/0.05 = 6860 Hz), <strong>and</strong> the second frequency has the<br />

wavelength of the distance between the mouthpiece <strong>and</strong> the end of the


124 Audi<strong>to</strong>ry perception Chapter 5<br />

gap (say 5.2 centimeters, so 343/0.055 = 6596 Hz). Then blowing this<br />

whistle would make a difference frequency of 6860 − 6596 = 264 Hz.<br />

This difference frequency is commonly less than 10 Hz <strong>to</strong> be called<br />

beating because the low frequency has the sound of metric rhythm like<br />

beats from a drum, but difference <strong>to</strong>nes are an artifact of beating no<br />

matter their frequency. Between 10 <strong>and</strong> 20 Hz, the modulation is considered<br />

dissonant or rough, <strong>and</strong> above 20 Hz, the difference frequency<br />

is then pitched. This is similar <strong>to</strong> our perception of sounds as of the<br />

same source in a series of reflections, where 0.1-0.2 seconds represents<br />

an interval of ambiguity.<br />

Figure 5.9: The sum of the frequencies 50 <strong>and</strong> 62 Hz over 1.5 periods, with a 12<br />

Hz sinusoid on <strong>to</strong>p <strong>to</strong> show the relationship between their sum <strong>and</strong> the difference<br />

frequency. This signal is dissonant. Beyond 20 Hz, the 20 Hz difference <strong>to</strong>ne becomes<br />

a pitch of its own, so the dissonance begins <strong>to</strong> diminish.<br />

b<strong>and</strong>s.<br />

Our detection of difference <strong>to</strong>nes is tied <strong>to</strong> the definition of critical<br />

Critical b<strong>and</strong>s <strong>and</strong> masking<br />

We define noise by a b<strong>and</strong>width <strong>and</strong> center or carrier frequency, called<br />

narrowb<strong>and</strong> noise for small b<strong>and</strong>widths <strong>and</strong> wideb<strong>and</strong> noise for large<br />

b<strong>and</strong>widths, where b<strong>and</strong>width determines the wideness of the interval<br />

of frequencies contained in a signal (determined by minimum <strong>and</strong><br />

maximum frequency b<strong>and</strong>s).


Section 5.2 Psychoacoustics 125<br />

Zwicker <strong>and</strong> Feldtkeller’s 1955 experiments with narrowb<strong>and</strong> noise<br />

showed that, beyond certain b<strong>and</strong>widths, we perceive the loudness of<br />

b<strong>and</strong>width-limited noise as disproportional <strong>to</strong> its <strong>to</strong>tal energy: When<br />

the b<strong>and</strong>width reaches <strong>and</strong> exceeds a critical value, the energy (loudness)<br />

of the noise has the illusion of increasing when it is actually<br />

constant. This value beyond which we perceive the b<strong>and</strong>width <strong>to</strong> have<br />

more energy than it physically does is given by<br />

β c = 25 + 75<br />

[<br />

1 + 1.4<br />

( ) ] 2 0.69<br />

fc<br />

,<br />

1000<br />

<strong>and</strong> β c is called the critical b<strong>and</strong>width. The compression techniques<br />

behind MP3s <strong>and</strong> other lossy compressed files use this psychoacoustic<br />

phenomenon <strong>to</strong> discard sonic information that we wouldn’t perceive<br />

anyway.<br />

Suppose a pure <strong>to</strong>ne with a frequency inside or very near <strong>to</strong> the<br />

range of the narrowb<strong>and</strong> noise is played at the same time as the noise.<br />

If the energy of the noise is the same or greater than the energy of<br />

the <strong>to</strong>ne, masking occurs. Masking is when a frequency within some<br />

threshold range of a frequency b<strong>and</strong> challenges the detection of that<br />

frequency. It is considered a failure of our audi<strong>to</strong>ry system in accurately<br />

detecting sounds, <strong>and</strong> is due <strong>to</strong> the localization of disturbances in the<br />

basilar membrane. Exciting the membrane at a frequency causes the<br />

membrane <strong>to</strong> vertically displace at that frequency, which naturally<br />

displaces the frequencies <strong>to</strong> the left <strong>and</strong> right of it, just like a b<strong>and</strong>-pass<br />

filter (see Appendix A). The depiction of masking in Figure 5.10 reflects<br />

the "lumpy" nature of the displacement along the membrane, as well<br />

as our perception of frequency differences.<br />

So, the narrowb<strong>and</strong> noise masks the sound of a pure <strong>to</strong>ne with a<br />

frequency within or near its b<strong>and</strong>width. As the b<strong>and</strong>width increases,<br />

the pure <strong>to</strong>ne must be louder <strong>to</strong> be detectable. The signal-<strong>to</strong>-noise ratio,<br />

the ratio of intensity of the pure <strong>to</strong>ne <strong>to</strong> the noise, is also quantifiable


126 Audi<strong>to</strong>ry perception Chapter 5<br />

Figure 5.10: <strong>An</strong> example of masking with the Fletcher–Munson curve for reference.<br />

When a masker exceeds the masking threshold, sounds beneath that threshold both in<br />

frequency <strong>and</strong> sound pressure level will be masked.<br />

Figure 5.11: The displacement of the basilar membrane in response <strong>to</strong> a sound bearing<br />

the fundamental pitch p 0. When excited at this location, the wave moves upward <strong>and</strong><br />

downward in addition <strong>to</strong> propagating <strong>to</strong>wards the apical end of the membrane, where<br />

the audi<strong>to</strong>ry nerve is. The excitation is centered at the pitch, but also excites places<br />

immediately <strong>to</strong> the left <strong>and</strong> right of it. If a pitch slightly less than p 0 were present as<br />

well in the sound, <strong>and</strong> with a smaller amplitude, it is possible that the ear could fail <strong>to</strong><br />

detect this second pitch due <strong>to</strong> masking.<br />

with respect <strong>to</strong> frequency <strong>and</strong> critical b<strong>and</strong>width, but this is true only<br />

up <strong>to</strong> a point. At this point, the b<strong>and</strong>width can continue <strong>to</strong> increase<br />

for the same intensity of the pure <strong>to</strong>ne, <strong>and</strong> the pure <strong>to</strong>ne is still de-


Section 5.2 Psychoacoustics 127<br />

tectable. Thus, the signal-<strong>to</strong>-noise ratio decreases, but the quality of<br />

our perception of the signal does not.<br />

The results of these experiments support the proposition that our<br />

ear groups ranges of frequencies. Lossy compression algorithms that<br />

compress sound files <strong>to</strong> MP3 <strong>and</strong> AAC formats use these results <strong>to</strong><br />

simplify <strong>and</strong> reduce sonic data: Frequencies in close proximity <strong>to</strong> one<br />

another are mapped <strong>to</strong> a single frequency.<br />

Consonance <strong>and</strong> dissonance<br />

There have been several attempts <strong>to</strong> bridge the consonance of frequency<br />

ratios <strong>to</strong> psychoacoustics, <strong>and</strong> no theory is considered predominant<br />

or leading. The debate might continue forever, if not for the<br />

reason that sensitivities <strong>and</strong> quality preferences differ between ears<br />

<strong>and</strong> musical tastes, then for the reason that music is not only about<br />

consonance <strong>and</strong> dissonance. Many scientists <strong>and</strong> musicians, such as<br />

Hermann von Helmholtz, have tried <strong>to</strong> subjectively order intervals<br />

from consonant <strong>to</strong> dissonant. Helmholtz attributed qualities <strong>to</strong> different<br />

keys—D Major compared <strong>to</strong> G Major, for example—which might<br />

mean that his keyboard was not equally tempered, for the two should<br />

not be discernible as far as frequency ratios are concerned.<br />

My personal theory is that irrationally proportioned sounds (i.e.,<br />

those without whole integer ratios) are perceived as dissonant because<br />

frequencies related by integer ratios completely avoid the undesirable<br />

phenomenon of beating. Furthermore, some intervals are more<br />

dissonant or consonant than others. The degree of dissonance <strong>and</strong> consonance<br />

can be determined, in my opinion, by the amount of beating a<br />

given interval allows.<br />

Calculated below are the different degrees of beating that occur<br />

in the over<strong>to</strong>nes of equally tempered intervals. The fundamental<br />

frequency is 100 Hz. Reinier Plomp experimented with the audibility of<br />

partials in complex <strong>to</strong>nes <strong>and</strong> showed that humans are able <strong>to</strong> discern<br />

only up <strong>to</strong> the first five <strong>to</strong> eight over<strong>to</strong>nes in a harmonic over<strong>to</strong>ne series


128 Audi<strong>to</strong>ry perception Chapter 5<br />

Interval f 0 f 1 f 2 f 3 f 4 f 5<br />

P1 100 200 300 400 500 600<br />

m2 105.946 211.893 317.839 423.785 529.732 635.678<br />

M2 112.246 224.492 336.739 448.985 561.231 673.477<br />

m3 118.921 237.941 356.762 475.683 594.604 713.524<br />

M3 125.992 251.984 377.976 503.968 629.961 755.953<br />

P4 133.484 266.968 400.452 533.936 667.420 800.904<br />

TT 141.421 282.943 424.264 565.685 707.107 848.528<br />

P5 149.831 299.661 449.492 599.323 749.154 898.984<br />

m6 158.740 317.480 476.220 634.960 793.701 952.441<br />

M6 168.179 336.359 504.538 672.717 840.896 1009.078<br />

m7 178.180 356.359 534.359 712.719 890.899 1069.080<br />

M7 188.775 377.530 566.325 755.099 943.874 1132.649<br />

P8 200 400 600 800 1000 1200<br />

Table 5.2: The first five over<strong>to</strong>nes of all 13 intervals within one octave. Italicized are<br />

partials within one jnd of the partials of the fundamental frequency (100 Hz), meaning<br />

they are less than 0.5 percent off from the partial, representing consonance. Bolded<br />

are partials that lie between one jnd <strong>and</strong> 20 Hz from the partials of the fundamental,<br />

representing dissonance <strong>and</strong> roughness.<br />

[80]. 1 In Table 5.2, we compute the first five harmonic partials (f 0 , f 1 ,<br />

. . ., f 5 ) for the twelve Western, equally tempered intervals above 100<br />

Hz. Instances when the over<strong>to</strong>nes of these intervals are within one<br />

jnd of the over<strong>to</strong>nes of the 100 Hz fundamental are highlighted by<br />

italics.. Since they differ by less than 0.5 percent in frequency, I call<br />

them consonant. If they differ by less than 20 Hz but are greater than<br />

one jnd apart (shown in bold), I call them dissonant.<br />

1 This was true for both of the two complex <strong>to</strong>nes used by Plomp in his experiment,<br />

one harmonic <strong>and</strong> one inharmonic, <strong>and</strong> each containing twelve over<strong>to</strong>nes.


Section 5.3 Perfect pitch 129<br />

Interval<br />

Unison<br />

Octave<br />

Perfect fifth<br />

Perfect fourth<br />

Major third<br />

Major sixth<br />

Minor third<br />

Minor second<br />

Minor sixth<br />

Tri<strong>to</strong>ne<br />

Minor seventh<br />

Major seventh<br />

Major second<br />

Difference of most consonant partial<br />

<strong>to</strong> the partials of 100 Hz<br />

0 Hz<br />

0 Hz<br />

0.339 Hz<br />

0.452 Hz<br />

3.968 Hz<br />

4.538 Hz<br />

5.394 Hz<br />

5.946 Hz<br />

6.299 Hz<br />

7.107 Hz<br />

9.101 Hz<br />

11.225 Hz<br />

12.246 Hz<br />

Table 5.3: Ranking of the most consonant intervals as computed from their harmonic<br />

over<strong>to</strong>ne series in Table 5.2.<br />

As you can see, this method is not without its problems. The<br />

tri<strong>to</strong>ne, for example, is far from the most dissonant interval even<br />

though its frequency ratio is considered the "least rational" of all of<br />

the intervals (2 6/12 = √ 2) in equal temperament (think: The chorus of<br />

West Side S<strong>to</strong>ry’s "Maria"). The church even named the tri<strong>to</strong>ne Diabolus<br />

("the devil’s interval" in Latin), no later than the early 18th century.<br />

However, perhaps this notion st<strong>and</strong>s <strong>to</strong> be challenged.<br />

5.3 Perfect pitch<br />

Perfect pitch, also called absolute pitch, is a gift that a very tiny number<br />

of people possess. It is either inborn or learned at the same time as<br />

language acquisition, before the age of about four years old. People<br />

with perfect pitch, as the name might suggest, can name pitches played<br />

in isolation. So, if I were <strong>to</strong> go up <strong>to</strong> a piano <strong>and</strong> press only one key,<br />

someone with perfect pitch could tell me the note I played.


130 Audi<strong>to</strong>ry perception Chapter 5<br />

People who speak <strong>to</strong>nal languages like M<strong>and</strong>arin Chinese <strong>and</strong><br />

Vietnamese are more likely <strong>to</strong> have perfect pitch [21]. This is further<br />

evidence that note naming is similar <strong>to</strong> our cognition of language.<br />

Relative pitch is not the same thing as perfect pitch, though it might<br />

seem that way. Relative pitch is perfectly possible <strong>to</strong> acquire from<br />

playing or listening <strong>to</strong> a lot of music, <strong>and</strong> is employed in the context of<br />

more than one pitch. A person with good relative pitch can identify<br />

intervals <strong>and</strong> chords, but cannot name the pitches themselves unless<br />

given a reference point, such as key. Relative pitch is very useful for<br />

identifying the <strong>to</strong>nal center of a key, but not for naming it (as one could<br />

do if one had perfect pitch).<br />

It is estimated that less than 0.05 percent (1/2000) of the population<br />

have perfect pitch. However, in a study of 600 musicians, 40 percent<br />

of people who began learning music before the age of five possessed<br />

perfect pitch [21]. Songbirds also have perfect pitch; it is essential <strong>to</strong><br />

the success their mating calls. In many instances, people with perfect<br />

pitch don’t realize their own gift until someone makes it known <strong>to</strong><br />

them, or they study music theory.<br />

Some would argue that perfect pitch can be learned after the age of<br />

language acquisition. Evidence for this would be in the very-close-<strong>to</strong>perfect<br />

pitch of M<strong>and</strong>arin <strong>and</strong> Vietnamese speakers, though short-term<br />

pitch memory is virtually widespread. Native speakers of these languages<br />

will often have an enlarged left side of the planum temporale<br />

in the brain, which is also an indica<strong>to</strong>r of musicianship. More support<br />

for this argument would be that awareness of one’s own vocal range<br />

in combination with excellent relative pitch is a way <strong>to</strong> acquire perfect<br />

pitch. However, I believe that any method of acquiring perfect pitch<br />

would require a never-ending amount of practice <strong>to</strong> keep pitches fresh<br />

in one’s memory.<br />

Perfect pitch is an optimal <strong>to</strong>ol for composers <strong>and</strong> performers—<br />

Beethoven, Mozart, Bach, H<strong>and</strong>el, Chopin, Toscanini, <strong>and</strong> <strong>An</strong><strong>to</strong>n Rubinstein<br />

all possessed it. Composition of new music <strong>and</strong> reproduction<br />

of old is easier when the specific pitches are already in your head.


Section 5.3 Perfect pitch 131<br />

Perhaps its most practical application is in tuning an instrument. But<br />

it is extremely rare, so be wary of those who say they have it.<br />

Synesthesia<br />

Sometimes, perfect pitch is accompanied by some form of synesthesia.<br />

Synesthesia (also spelled synaesthesia) is the activation of an unrelated<br />

sensation during a stimulus. It can come in weak <strong>and</strong> strong forms,<br />

<strong>and</strong> for some it can be overwhelming. I have a very common form of<br />

weak synesthesia: <strong>Numbers</strong> correspond <strong>to</strong> specific colors <strong>to</strong> me. Zero<br />

is black, one is white, two is blue, three is orange, <strong>and</strong> so on. I realized<br />

this only recently while researching musical synesthesia, though I have<br />

always been aware that I related numbers <strong>to</strong> colors. I always assumed<br />

that it was an artifact of some television show that I had watched as<br />

a child, but then I found a chart with the very same colors for these<br />

numbers on Wikipedia.<br />

<strong>Musical</strong> synesthesia is most often a visual type. People with musical<br />

synesthesia are more likely <strong>to</strong> have perfect pitch because they<br />

experience consistent phenomenon when specific pitches are played. I<br />

performed an observation on six individuals with musical synesthesia.<br />

Only three of these six subjects had perfect pitch. All of the subjects<br />

were musicians in some right. One reported that he saw the color<br />

orange with A440, <strong>and</strong> the brightness of this orange depended on<br />

how loud the A was played. This relationship between intensity of<br />

the actual sensation <strong>and</strong> the intensity of the synesthetic sensation is<br />

quite common. Additionally, he compared a chord with simultaneous<br />

pitches <strong>to</strong> a Mark Rothko painting, primarily colored by the root, <strong>and</strong><br />

secondly by the third.<br />

<strong>An</strong>other of the subjects claimed <strong>to</strong> perceive colors that were literally<br />

"out of this world," nonexistent in our ROYGBIV spectrum. Yet another<br />

described some of the most tranquil, beautiful images, relating them <strong>to</strong><br />

timbres <strong>and</strong> the fullness of orchestration (the more voices in the music,<br />

the more cluttered these images). A single melody <strong>to</strong> her appeared


132 Audi<strong>to</strong>ry perception Chapter 5<br />

as "dabs of color over a dark field," while a large b<strong>and</strong> would evoke<br />

something she called a "color field" with rippling, moving currents.<br />

The subjects also seemed <strong>to</strong> have an easier time hearing over<strong>to</strong>nes <strong>and</strong><br />

identifying instruments than the general population.<br />

Amusia<br />

Amusia is defined by <strong>to</strong>ne deafness, which is the lack of relative pitch.<br />

This is most apparent when people try <strong>to</strong> sing along <strong>to</strong> a song <strong>and</strong><br />

fail <strong>to</strong> sing even remotely close <strong>to</strong> the actual melody. Diana Deutsch<br />

has done a lot of great work <strong>and</strong> research in the fields of synesthesia,<br />

perfect pitch, <strong>and</strong> amusia. Her research has shown that as many as five<br />

percent of people in the United States have amusia (four percent in the<br />

UK), defined by extremely poor relative pitch, wherein pitch changes<br />

cannot be detected <strong>and</strong> a pitch in isolation cannot be sung back [22].<br />

People with amusia, called amusics, can have normal rhythm detection,<br />

<strong>and</strong> surprisingly, there does not seem <strong>to</strong> be a correlation <strong>to</strong> language<br />

faculties. For instance, the Russian composer Vissarion Shebalin<br />

suffered from aphasia, the impairment of language ability. Shebalin<br />

wrote many genres of music ranging from operas <strong>to</strong> string quartets <strong>to</strong><br />

film scores <strong>and</strong> lived from 1902 <strong>to</strong> 1963. In 1953, he suffered a stroke<br />

that impaired his ability <strong>to</strong> communicate verbally, but not musically:<br />

He continued <strong>to</strong> compose one more symphony (his fifth) in his lifetime<br />

that was similar <strong>to</strong> his earlier compositions, <strong>and</strong> even received praise<br />

from contemporary Dmitri Shostakovich. Just as remarkably, Shebalin<br />

continued <strong>to</strong> give lectures at universities using only the language of<br />

music. So, there is no strong evidence that our brain’s musical <strong>and</strong><br />

linguistic faculties are connected.<br />

A good friend of mine from high school is now a practicing music<br />

therapist. Part of her background includes training for patients with<br />

aphasia. She described <strong>to</strong> me a technique used <strong>to</strong> res<strong>to</strong>re language<br />

abilities via music: A sentence is paired with a simple melody <strong>and</strong><br />

rhythm, <strong>to</strong> make it in<strong>to</strong> a song. The patient then attempts <strong>to</strong> repeat the


Section 5.4 Chapter summary 133<br />

sentence back. To encourage his or her memory, the therapist can pat<br />

out the rhythm, as well as hum the tune. This technique has shown<br />

consistently positive results, <strong>and</strong> hints at a tie between language <strong>and</strong><br />

music, even though they are processed by many different parts of the<br />

brain.<br />

5.4 Chapter summary<br />

This chapter barely scratched the surface of the physiology <strong>and</strong> psychology<br />

of hearing, but our audi<strong>to</strong>ry perception is an important fac<strong>to</strong>r<br />

in the mathematical analysis of musical signals. The perceptual definitions<br />

<strong>and</strong> quantifiable behaviors of frequency resolution, temporal<br />

resolution, <strong>and</strong> hearing thresholds <strong>and</strong> limits all aide decision-making<br />

in the compression <strong>and</strong> composition of musical sound. Applying<br />

these concepts <strong>to</strong> the output of Fourier transforms results in a more<br />

perceptually accurate model.<br />

The organ of the ear is subdivided in<strong>to</strong> three parts based on their<br />

general functions: The outer, middle, <strong>and</strong> inner ear. The outer ear<br />

receives sound, <strong>and</strong> the pressure of the air is translated on<strong>to</strong> the surface<br />

of the eardrum. This pressure is compared <strong>to</strong> the pressure on the<br />

eardrum’s interior, the chamber of the middle ear. This pressure is<br />

translated by the bones <strong>and</strong> muscles of the middle ear <strong>to</strong> the cochlea,<br />

<strong>and</strong> passed on <strong>to</strong> the audi<strong>to</strong>ry nerve which is directly connected <strong>to</strong> the<br />

brain.<br />

The basilar membrane inside of the cochlea vibrates at locations<br />

<strong>and</strong> amplitudes corresponding <strong>to</strong> the frequency spectrum of a given<br />

sound. The hair cells on the membrane are b<strong>and</strong>-pass filters, each<br />

of them tuned <strong>to</strong> a small range of frequencies. The membrane is excited<br />

by frequencies between about 20 <strong>and</strong> 20,000 Hz at minimum<br />

amplitudes defined by the Fletcher–Munson curve. Half of the membrane<br />

responds <strong>to</strong> frequencies below 1,500 Hz <strong>and</strong> the other half <strong>to</strong><br />

frequencies above 1,500 Hz, with octaves evenly spaced along it. In this


134 Audi<strong>to</strong>ry perception Chapter 5<br />

way, the basilar membrane is comparable <strong>to</strong> a logarithmic frequency<br />

domain.<br />

Because the displacement of the membrane at a given frequency<br />

naturally displaces the frequencies around it, when a sound contains<br />

closely related frequencies, the ear may not be able <strong>to</strong> distinguish<br />

between them. This is the frequency resolution of the ear, <strong>and</strong> results<br />

in the phenomena of masking <strong>and</strong> critical b<strong>and</strong>s. These phenomena<br />

are utilized in the lossy compression of digital audio.<br />

The field of psychoacoustics studies how the brain perceives sound.<br />

Pitch describes our perception of frequency, <strong>and</strong> loudness our perception<br />

of intensity. Frequencies within the range of speech are more<br />

easily detected by the ear, <strong>and</strong> frequencies near the minimum <strong>and</strong><br />

maximum <strong>and</strong> audibility require greater intensity <strong>to</strong> be audible. The<br />

phon scale, based on the human ear’s sensitivity <strong>to</strong> 1,000 Hz <strong>and</strong> the<br />

Fletcher–Munson curve, describes intensity with respect <strong>to</strong> frequency<br />

in order for two frequencies <strong>to</strong> be equally loud. So, x-many phons<br />

is defined as the loudness of 1,000 Hz at x-many dB. The phon scale<br />

should be applied <strong>to</strong> the output of a Fourier transform when a perceptual<br />

representation of the amplitudes is desired. The sone scale<br />

describes loudness ratios, i.e., a sound with a loudness of two sones<br />

describes a sound twice as loud as one with a loudness of one sone.<br />

Perfect pitch, musical synesthesia, <strong>and</strong> amusia are neurological<br />

conditions concerning music. Perfect pitch is the ability <strong>to</strong> discern the<br />

pitch of a sound in isolation, while relative pitch is the ability <strong>to</strong> discern<br />

pitch with a known reference point, thereby computing the pitch from<br />

knowledge of pitch intervals. Children who develop music faculties at<br />

the same time as language faculties are far more likely <strong>to</strong> have perfect<br />

pitch. Synesthesia is the stimulation of multiple senses in response<br />

<strong>to</strong> a single sensory stimulus. <strong>Musical</strong> synesthesia is most often the<br />

stimulation of a visual response in addition <strong>to</strong> a sonic response, like a<br />

specific color appearing from a specific pitch. Synesthetes will often<br />

also have perfect pitch because of this relation. Finally, amusia is the<br />

impairment of musical faculties, such as memory of musical melodies


Section 5.4 Chapter summary 135<br />

<strong>and</strong> <strong>to</strong>ne deafness. Surprisingly, aphasia (impairment of language<br />

faculties) <strong>and</strong> amusia have no definite relationship.


6. Digital audio basics<br />

The relationship between digital audio <strong>and</strong> analog audio is that of<br />

the finite <strong>and</strong> the infinite, respectively. A digital recording of music<br />

is derived from an analog signal, but it only represents it at a finite<br />

number of points in time. These points are said <strong>to</strong> be discrete which<br />

means they are infinitesimally small: A point is zero-dimensional,<br />

while a line is one-dimensional.<br />

Figure 6.1: A discrete function.<br />

Figure 6.2: A continuous function.<br />

In mathematics, the continuity <strong>and</strong> differentiability of a space are<br />

frequent concerns, <strong>and</strong> there are many properties that follow from<br />

functions that satisfy those conditions. Putting a song on<strong>to</strong> a computer<br />

or CD requires digitization, which is a discrete <strong>and</strong> discontinuous<br />

function. It is possible <strong>to</strong> perform Fourier transforms on continuous<br />

audio signals like vinyl or live sound, but a computer can calculate the<br />

transform far more quickly <strong>and</strong> accurately than analog <strong>to</strong>ols.<br />

There are many advantages <strong>to</strong> the digital form versus the analog<br />

form. For one, material things deteriorate over time. Scratches, dirt,


138 Digital audio basics Chapter 6<br />

<strong>and</strong> entropy eventually conquer physical records. Secondly, it is quick,<br />

cheap, <strong>and</strong> uncomplicated <strong>to</strong> replicate a digital version <strong>and</strong> s<strong>to</strong>re it in<br />

many places.<br />

<strong>An</strong> infinitely small interval of time (a point) is not perceptible by<br />

itself. However, when we provide tens of thous<strong>and</strong>s of these data<br />

points per second, our ears are fooled, <strong>and</strong> we hear something very<br />

much like the original analog signal.<br />

These data points are called samples, <strong>and</strong> the number of data points<br />

per second is the sampling rate, or sampling frequency. Sampling rate is<br />

usually given in hertz or samples per second because it periodically<br />

repeats like a frequency. However, though there are audible artifacts<br />

that result from the sampling rate, the sampling rate itself is never<br />

itself audible as it is not a sine wave.<br />

Sampling fundamentals <strong>and</strong> techniques are very important for the<br />

Fourier transform because this information must be known in order <strong>to</strong><br />

correctly specify its parameters.<br />

6.1 Sampling<br />

We encounter the word sampling in a few different places, but every<br />

usage means essentially the same thing: Sampling is the process of<br />

taking members from some set <strong>and</strong> using their identities <strong>to</strong> represent<br />

the whole set. We sample from a complete set because it is expensive,<br />

unnecessary, or even impossible <strong>to</strong> collect information about the whole<br />

set.<br />

When a sample set is a poor representation of a larger set, this will<br />

often mean that the size of the sample was <strong>to</strong>o small. It can also mean<br />

that sampling bias is at play, such as a sample of students from one<br />

high school taken <strong>to</strong> represent something about all high schools in the<br />

country. In dance music, DJs sample from music <strong>to</strong> point <strong>to</strong> a larger<br />

feature about the sampled music, like its artist, or beats per minute, or<br />

as contrast <strong>to</strong> the rest of the remix. Disk jockeys encounter sampling<br />

bias when they choose a sample <strong>to</strong>o short for listeners <strong>to</strong> recognize, or


Section 6.1 Sampling 139<br />

a sample consisting of content that isn’t very characteristic of the song,<br />

like the bridge or coda.<br />

Figure 6.3: Sampling from a population (set A) produces a subset of A (set B).<br />

The sampling we do on continuous music signals is with the intention<br />

<strong>and</strong> purpose of representing the original form as closely as<br />

possible. The easiest way <strong>to</strong> do this is at a uniform, periodic rate, so<br />

that we don’t have <strong>to</strong> know anything about a song before sampling it<br />

in order <strong>to</strong> sample it well.<br />

Digital sampling transforms a continuous signal in<strong>to</strong> a discrete<br />

signal by using an impulse function.<br />

A single impulse is a vertical function with an infinitesimally small<br />

width, located at a single instant of time <strong>and</strong> zero everywhere besides<br />

at that instant. Its width is the width of a discrete point. The height of<br />

the impulse depends on the nature of the input signal: If the signal is<br />

continuous, we use the Dirac delta function <strong>to</strong> sample the signal, <strong>and</strong> if<br />

it is discrete, we use the Kronecker delta function.


140 Digital audio basics Chapter 6<br />

Figure 6.4: The process of sampling: A signal gets multiplied by an impulse function<br />

<strong>to</strong> produce a sampled signal with sampling frequency equal <strong>to</strong> that of the impulse<br />

function.<br />

It is rare <strong>to</strong> find a (digital) signal processing text that explicitly<br />

states which delta function it is using at any given time, <strong>and</strong> unfortunately,<br />

they will sometimes be written identically with the Greek letter<br />

delta as δ(t). Some texts distinguish continuity from discreteness with<br />

parentheses for the domains of continuous functions <strong>and</strong> square brackets<br />

for discrete domains, i.e., δ(t) versus δ[t]. The Dirac delta function<br />

is a continuous function of time <strong>and</strong> it is used <strong>to</strong> sample continuous<br />

signals. It is defined<br />

⎧<br />

⎨∞, if t =0<br />

δ(t) =<br />

⎩0, if t ≠0<br />

The Kronecker delta function is a discrete function of time used <strong>to</strong><br />

sample discrete functions, <strong>and</strong> has a similar definition:<br />

⎧<br />

⎨1, if t =0<br />

δ[t] =<br />

⎩0, if t ≠0<br />

These functions can be placed anywhere along the time domain. For<br />

example, if we wanted an impulse located at t =2, then we specify the<br />

domain of the function a bit differently by shifting it <strong>to</strong> the right:<br />

⎧<br />

⎨1, if t =2<br />

δ[t − 2] =<br />

⎩0, if t ≠2


Section 6.1 Sampling 141<br />

The Dirac delta function is continuous <strong>and</strong> integrable, <strong>and</strong><br />

∫ ∞<br />

−∞<br />

δ(t − a) =1<br />

for any real number a. The Kronecker delta function is discrete <strong>and</strong><br />

therefore nonintegrable, <strong>and</strong> its global sum is equal <strong>to</strong> 1 because it is<br />

1 at only one point <strong>and</strong> 0 elsewhere. To sample a continuous input<br />

signal, we integrate its multiplication with a delta function centered at<br />

a desired point. Say that we want <strong>to</strong> do this in the input signal x(t) at<br />

t = a, <strong>and</strong> call the resulting sampled input function x s (t). Then,<br />

x s (t) =<br />

∫ ∞<br />

−∞<br />

x(t)δ(t − a) dt.<br />

This outputs a continuous function that is 0 everywhere except at t = a,<br />

where its amplitude is exactly x(a). So the sampled function reflects<br />

the amplitude of the function x(t) at the points at which it is sampled.<br />

For music, we want <strong>to</strong> sample at a periodic rate because of the<br />

high volume of samples <strong>and</strong> audible range of frequencies. Applying<br />

a sampling rate <strong>to</strong> a musical signal outputs a function with uniformly<br />

spaced samples. To sample the function at multiple points,<br />

we simply sequence all of the individual integrals. For a sampling<br />

rate of 1 a<br />

, these would be the points centered at the sequence of times<br />

{0, a, 2a, . . . , (N − 1)a}. In general, we compute a sampled function<br />

x s (t) by the sequence of integrals<br />

x s (t) =<br />

{ ∫ ∞<br />

∫ ∞<br />

x(t)δ(t) dt, x(t)δ(t − a) dt,<br />

−∞<br />

−∞<br />

∫ ∞<br />

∫ ∞<br />

}<br />

x(t)δ(t − 2a) dt, . . . , x(t)δ(t − (N − 1)a) dt .<br />

−∞<br />

−∞<br />

In the discrete case, this is a very similar process. Instead of summing<br />

all of the integrals, we sequence the individual sums. A sampled


142 Digital audio basics Chapter 6<br />

discrete function is then<br />

x s [t] =<br />

{<br />

∑ ∞ ∞∑<br />

x[t]δ[t], x[t]δ[t − a], . . . ,<br />

t=−∞<br />

∞∑<br />

t=−∞<br />

t=−∞<br />

}<br />

x[t]δ[t − (N − 1)a]<br />

where the kth term of x s [t] is equal <strong>to</strong><br />

x s [k] =<br />

∞∑<br />

t=−∞<br />

x[t]δ[t − ka]<br />

for k =0, 1, . . . , N − 1. The Kronecker delta is called a unit impulse,<br />

because "unit" implies unity, meaning 1 (like the unit circle). 1<br />

This<br />

method of sampling is called ideal sampling, <strong>and</strong> others include instantaneous<br />

<strong>and</strong> natural sampling.<br />

For example, let x(t) = cos(1.5πt) be our continuous signal, <strong>and</strong><br />

say that it only happens over over the interval 0 ≤ t ≤ 2 seconds. Let<br />

our sampling period T =0.5 seconds, so the sampling frequency is<br />

f s =1/T =2Hz. Then<br />

∫ ∞<br />

∫ ∞<br />

−∞<br />

−∞<br />

∫ ∞<br />

∫ ∞<br />

−∞<br />

−∞<br />

x(t)δ(t − 0) dt =<br />

x(t)δ(t − 0.5) dt =<br />

x(t)δ(t − 1) dt =<br />

x(t)δ(t − 1.5) dt =<br />

⎧<br />

⎨cos(1.5 · π · 0) = 1, if t =0<br />

⎩0, otherwise<br />

⎧<br />

⎨cos(1.5 · π · 0.5) = − √ 2/2, if t =0.5<br />

⎩0, otherwise<br />

⎧<br />

⎨cos(1.5 · π · 1) = 0, if t =1<br />

⎩0, otherwise<br />

⎧<br />

⎨cos(1.5 · π · 1.5) = √ 2/2, if t =1.5<br />

⎩0, otherwise<br />

1 This is also called a normalized or normal function. Audio for example is normalized<br />

<strong>to</strong> only take on amplitudes between −1 <strong>and</strong> 1.


Section 6.1 Sampling 143<br />

Figure 6.5: Three different types of sampling of an analog signal, depicted in the <strong>to</strong>p<br />

left corner. Ideal sampling uses impulses, instantaneous sampling uses rectangles<br />

centered at the height of the analog signal, <strong>and</strong> natural sampling uses trapezoids with<br />

more coordinates of the curve. Pulse-code modulation (PCM), the type of sampling used<br />

by CDs, uses ideal sampling.<br />

so x s (t) ={1, − √ 2/2, 0, √ 2/2}. Formally, this is the equation<br />

i(t) =δ D (t)+δ D (t − 0.5) + δ D (t − 1) + δ D (t − 1.5).<br />

The discrete version is calculated similarly by sums, with identical<br />

results. Therefore, the Kronecker delta <strong>and</strong> the Dirac delta are conceptually<br />

identical because they both effectively sample a function.<br />

From here on, we will just be using the notation δ(t), meaning Dirac in<br />

continuous cases (integrals) <strong>and</strong> Kronecker in discrete ones (sums).


144 Digital audio basics Chapter 6<br />

t x s (t)<br />

0 1<br />

0.5 −0.707<br />

1 0<br />

1.5 0.707<br />

2 −1<br />

2.5 0.707<br />

3 0<br />

3.0001 does not exist<br />

3.5 −0.707<br />

4 1<br />

4.5 −0.707<br />

5 0<br />

Table 6.1: A function x(t) = 3πt sampled over the interval 0 ≤ t ≤ 5 with a sampling<br />

rate of 2 Hz.<br />

Table 6.1 <strong>and</strong> the graph above it show the function cos(1.5πt)<br />

(whether continuous or discrete) sampled over the interval 0 ≤ t ≤ 5<br />

seconds with sampling rate f s =2Hz, i.e., T =0.5 seconds as above.<br />

We can also represent the multiplication by<br />

x s (t) = x(t) · (δ(t)+δ(t − T )+δ(t − 2T )+. . . + δ(t + ∞))<br />

∞∑<br />

= x(t) δ(t − nT )<br />

n=0<br />

where x s (t) is the sampled signal. We do this <strong>to</strong> infinity <strong>to</strong> ensure that<br />

the entire signal, x(t), is sampled, but if the time domain’s endpoints<br />

are known it is sufficient <strong>to</strong> just sample over them. Sometimes you<br />

will see the impulse function expressed over both positive values <strong>and</strong>


Section 6.1 Sampling 145<br />

negative values, expressed by<br />

i s (t) = δ(t + ∞T )+. . . + δ(t + T )+δ(t)+δ(t − T )+. . . +<br />

δ(t −∞T )<br />

∞∑<br />

= δ(t − kT),<br />

k=−∞<br />

but in general, a signal doesn’t begin before t =0, so these impulses<br />

over the negative time domain would return values of 0 from x(t).<br />

Their inclusion is unnecessary but doesn’t hurt, <strong>and</strong> leads <strong>to</strong> a more<br />

mathematically rigorous expression.<br />

Choosing a sampling rate<br />

A signal contains frequencies, <strong>and</strong> these frequencies are positively<br />

valued. We don’t care about frequencies that we cannot hear, <strong>and</strong><br />

therefore, we don’t need <strong>to</strong> sample for the frequencies outside the 20-<br />

20,000 Hz range. When we choose a sampling rate, we have <strong>to</strong> choose<br />

the maximum frequency that we want <strong>to</strong> detect through sampling.<br />

We call this f max , <strong>and</strong> we require the sampling frequency f s <strong>to</strong> be<br />

greater than twice the value of f max <strong>to</strong> effectively capture all audible<br />

frequencies present in a signal in our sampled signal. So, a sine wave<br />

must be sampled at least twice per period in order <strong>to</strong> be digitally<br />

represented. 2<br />

We talked above about how sampling has the ability <strong>to</strong> dis<strong>to</strong>rt<br />

signals, namely, when a low sampling rate is chosen. This is called undersampling.<br />

The resulting sampled signal misrepresents its frequency<br />

information by either failing <strong>to</strong> include at all certain frequencies, or by<br />

misnaming them ("aliasing"), as shown later on in Figure 6.8.<br />

2 Technically, this should be more than twice per period: The only sine wave that<br />

could be represented with a sampling frequency equal <strong>to</strong> twice its frequency is a<br />

cosine wave with zero phase. Therefore, the probability of correctly sampling a signal<br />

with sampling frequency equal <strong>to</strong> two times the maximum frequency component<br />

approaches 0, because otherwise, the amplitudes do not accurately represent the true<br />

amplitudes of the original system. See the example in Table 6.2.


146 Digital audio basics Chapter 6<br />

Fortunately for us, there is a theorem that explicitly states the<br />

minimal sampling rate at which a signal must be sampled above <strong>to</strong><br />

avoid undersampling.<br />

The Nyquist-Shannon Sampling Theorem: If a signal<br />

x(t) contains no frequencies greater than f max cycles per<br />

second (Hz), then it is completely determined by a series of<br />

points spaced less than<br />

1<br />

2f max<br />

seconds apart. Then we can<br />

choose the sampling frequency f s > 2f max <strong>and</strong> completely<br />

reconstruct the original signal.<br />

The minimum sampling frequency is often referred <strong>to</strong> as the Nyquist<br />

frequency, Nyquist rate, or Nyquist limit, <strong>and</strong> the period of the impulse<br />

function given by 1 f s<br />

is the Nyquist period. The b<strong>and</strong>width β of a sampled<br />

signal is greater than the difference between its highest frequency component<br />

<strong>and</strong> its lowest (which is 0 Hz), so β > f max . B<strong>and</strong>width refers<br />

<strong>to</strong> the frequency "b<strong>and</strong>s" that define the size of the range of frequencies<br />

in a signal or filter.<br />

This theorem says that a sampling rate of 1 Hz would not be able<br />

<strong>to</strong> detect frequencies equal <strong>to</strong> or greater than 0.5Hz. However, there is<br />

one scenario in which f s =2f max returns a sufficiently sampled signal:<br />

When the signal is a cosine wave with no phase shift, or a sine wave<br />

with a phase equal <strong>to</strong> π 2<br />

. Consider the signal x(t) = sin(0.5·2πt+π/2) =<br />

sin(πt + π/2) over the interval of time t = [0, 7].<br />

So, x s (t) = (1, −1, 1, −1, 1, −1, 1, −1). From just this sampled signal,<br />

we can see that every odd value is 1 <strong>and</strong> every even value is -1,<br />

so we deduce that it has a period of 2 seconds <strong>and</strong> a frequency of<br />

0.5 Hz. Because pressure changes are what cause vibrations within<br />

the cochlea, you should agree that this sampling rate could not detect<br />

a frequency higher than the one presented here, for the signal<br />

x(t) = sin(πt + π/2) sampled at 0.5 Hz returns the constant signal<br />

x s (t) = (1, 1, 1, 1, 1, 1, 1, 1) which is inaudible <strong>and</strong> frequency-less (0<br />

Hz).


Section 6.1 Sampling 147<br />

t x(t)<br />

0 1<br />

1 -1<br />

2 1<br />

3 -1<br />

4 1<br />

5 -1<br />

6 1<br />

7 -1<br />

Table 6.2: The sine wave x(t) = sin(πt + π/2) [i.e., cos(πt)] sampled at 1 Hz. <strong>An</strong>y<br />

other phase shift of this sine wave would return incorrect amplitudes <strong>and</strong> indicate<br />

that the sine wave was multiplied by some constant less than 1. A sine wave with no<br />

phase shift, for example, would return samples of all zeros at this sampling rate.<br />

In summary, the sampling rate is crucial <strong>to</strong> avoiding dis<strong>to</strong>rtion <strong>and</strong><br />

detecting all of the frequencies of a given signal. The sampling rate<br />

of CDs is 44,100 Hz, so the highest frequency that can be detected is<br />

22,050 Hz which is beyond our range of hearing.<br />

However, that is not <strong>to</strong> say that audio signals do not contain frequencies<br />

beyond 22,050 Hz—in fact, they absolutely do, we just can’t<br />

hear them. Professional audio samples at 48,000 Hz <strong>and</strong> high definition<br />

audio samples at 96,000 Hz <strong>to</strong> include more frequencies <strong>and</strong> improve<br />

the fidelity of the audio. At the cost of file size, these files can be<br />

processed with effects in digital audio workstations like Pro Tools <strong>and</strong><br />

still maintain a high amount of fidelity. Consider, for example, slowing<br />

the speed of a segment of audio sampled at 44.1 kHz by a fac<strong>to</strong>r of 2.<br />

This means that there will be only 22,050 samples per second retained<br />

from the original audio. At a higher sampling rate like 96 kHz, 48,000<br />

samples per second would be retained.<br />

In summary, when a signal is sampled at a frequency less than the<br />

Nyquist limit, undersampling happens <strong>and</strong> aliasing occurs.


148 Digital audio basics Chapter 6<br />

Aliasing<br />

Aliasing refers <strong>to</strong> the incorrect mapping of a frequency component<br />

<strong>to</strong> another frequency component, specifically, mapping a frequency<br />

above the Nyquist limit <strong>to</strong> one below the Nyquist limit. The effect in<br />

audio can sound like a ringing or whistling, or as I call it, the Coke bottle<br />

effect, wherein the sound appears <strong>to</strong> be recorded in a large, reverberant<br />

room with low-frequency resonances. Aliasing can be easily avoided<br />

by filtering the audio file before sampling: A low-pass filter with a cut-off<br />

frequency set at fs<br />

2<br />

will allow only frequencies less than the cu<strong>to</strong>ff <strong>to</strong><br />

pass through, <strong>and</strong> frequencies above it will be diminished. This is also<br />

called an anti-aliasing filter.<br />

Figure 6.6: A low-pass filter with cu<strong>to</strong>ff frequency at 22,050 Hz can be used on audio<br />

data <strong>to</strong> reduce the possibility of aliasing. Amplitudes of frequencies greater than f c<br />

will be increasingly smaller.<br />

The graph in Figure 6.8 shows a sinusoid (call it x 1 ) aptly sampled<br />

by the given sampling frequency, <strong>and</strong> another sinusoid (x 2 ) of higher<br />

frequency that is inadequately sampled. Here, x 2 would be interpreted<br />

<strong>to</strong> have the same frequency of x 1 when in fact it has five times the<br />

frequency of x 1 .<br />

Here, the frequency of x 1 is 0.5 Hz, the frequency of x 2 is 2.5 Hz,<br />

<strong>and</strong> the sampling frequency is 2 Hz. The samples retrieved are identical


Section 6.1 Sampling 149<br />

Figure 6.7: The above is a spectrogram of Amy Winehouse’s "Rehab." Note that most<br />

of the file is contained within a solid rectangle, <strong>and</strong> some parts leak above it. There<br />

are at least two different anti-aliasing filters applied here: One with a higher cu<strong>to</strong>ff<br />

frequency for Amy’s voice at 18,000 Hz, <strong>and</strong> the other applied <strong>to</strong> the rest at 16,000 Hz.<br />

Figure 6.8: At the specified sampling frequency, the samples retrieved from both<br />

sinusoids are actually identical for both sinusoids because the higher frequency<br />

component has been undersampled.<br />

for both of the frequency components, so naturally, this is a source of<br />

confusion. The reconstructed frequency of x 2 would then be 0.5 Hz.<br />

Why is it called aliasing When a signal is undersampled, the<br />

interval between samples T s is <strong>to</strong>o large <strong>to</strong> accurately detect frequencies<br />

greater than fs<br />

2<br />

, <strong>and</strong> will detect them incorrectly as shown above.<br />

When the sample is reconstructed, it will reconstruct the 2.5 Hz sine<br />

wave as a 0.5 Hz sine wave. Hence, the identity of the sinusoid will be<br />

misrepresented (an "alias").


150 Digital audio basics Chapter 6<br />

There is also such a thing as oversampling, which does good things<br />

for the fidelity of a sampled sound file but increases its file size. <strong>An</strong> antialiasing<br />

filter can only be so steep at its cu<strong>to</strong>ff frequency, so the cu<strong>to</strong>ff<br />

frequency must be less than the Nyquist limit in order <strong>to</strong> attenuate<br />

frequencies beyond it. Therefore, by setting the sampling frequency<br />

higher, we can improve the cu<strong>to</strong>ff frequency of the filter. Furthermore,<br />

the resolution is improved so it becomes easier <strong>to</strong> eliminate noise,<br />

because increasing f s increases the b<strong>and</strong>width, <strong>and</strong> the energy of<br />

noise stays constant over all b<strong>and</strong>widths. Increasing the size of the<br />

b<strong>and</strong>width decreases the energy of the noise per unit division in a<br />

signal. Therefore, the signal-<strong>to</strong>-noise ratio improves when resolution<br />

is improved.<br />

6.2 Compression<br />

To compress something means <strong>to</strong> make it more dense, <strong>and</strong> reduce its<br />

volume by eliminating unused or unnecessary space. 3 When you<br />

convert a WAV file <strong>to</strong> an MP3 file, the file size shrinks three-fold <strong>to</strong><br />

twenty-fold depending on the fidelity of the result, <strong>and</strong> the resulting<br />

.mp3 is virtually identical sonically. How can this be<br />

The compression of audio files is done via algorithms. For most<br />

audio purposes, algorithms are designed <strong>to</strong> take a file of data, discover<br />

trends in the data, <strong>and</strong> perform some redundancy removal <strong>to</strong> effect a<br />

smaller file..<br />

A simple kind of algorithm—not necessarily applying <strong>to</strong> audio<br />

per se—is a sorting algorithm: For an input set of numbers, we put<br />

them in some designated order, like lowest <strong>to</strong> highest or vice versa.<br />

<strong>An</strong> inefficient way of doing this is <strong>to</strong> compare each number <strong>to</strong> all of<br />

the other numbers in the set, ordering the numbers accordingly. For<br />

3 The term is also used <strong>to</strong> mean dynamic range compression, an audio effect implementing<br />

sustain. This kind of compression is not related <strong>to</strong> data compression, which is<br />

the <strong>to</strong>pic here.


Section 6.2 Compression 151<br />

example,<br />

{9, 5, 3, 6, 12, 1, −4, −112, 8} → {−112, −4, 1, 3, 5, 6, 8, 9}<br />

So, for the first element of the set, 9, we would have eight questions <strong>to</strong><br />

ask: Is 9 greater or less than 5 is it greater or less than 3 <strong>and</strong> so on<br />

until the whole set had been compared <strong>to</strong> 9. Then for the next element,<br />

5, we would have <strong>to</strong> ask seven questions because it has already been<br />

compared <strong>to</strong> the first entry, 9: Is it greater or less than 3 is it greater or<br />

less than 6 The third element would require 6 questions, the fourth 5<br />

questions, <strong>and</strong> so on, so a <strong>to</strong>tal of 8 + 7 + 6 + 5 + 4 + 3 + 2 + 1 = 36<br />

questions are required.<br />

In the worst case scenario in which we compare every number <strong>to</strong><br />

every number, this requires roughly N 2 operations, where N is the<br />

size of the input set. The quick-sort algorithm improves upon this,<br />

<strong>and</strong> does it on average in about N log 2 N operations.This is a great<br />

improvement—just consider the difference between 256 2 = 65536 <strong>and</strong><br />

256 log 2 256 = 2048, which isn’t even close <strong>to</strong> one second of information<br />

in audio files.<br />

Explaining general algorithm design is a whole other book—even<br />

the 13 most popular sorting algorithms would take dozens of pages<br />

<strong>to</strong> explain, <strong>and</strong> sorting is perhaps the simplest algorithm <strong>to</strong> explain<br />

conceptually. But alas, the discrete Fourier transform <strong>and</strong> the host of<br />

transforms that branch out from it are all algorithms. We will show<br />

that the discrete Fourier transform performed literally requires on<br />

average N 2 computations (N is the <strong>to</strong>tal number of samples of x s (t)),<br />

while the fast Fourier transform requires only N log 2 N of them.<br />

Uncompressed audio is ideal for editing <strong>and</strong> putting effects on<br />

music, <strong>and</strong> processing almost always lowers the fidelity (quality) of<br />

the data. In addition <strong>to</strong> the sampling frequency <strong>and</strong> the length of a<br />

song, there are several other variables that will determine the <strong>to</strong>tal size<br />

of the information contained in an uncompressed audio file.


152 Digital audio basics Chapter 6<br />

1. The number of channels, C: This is 1 for mono <strong>and</strong> 2 for stereo.<br />

It is unlikely that an audio file will have more than 2 because<br />

sound systems typically consist of only 2 speakers, but Dolby<br />

Surround Sound used on DVDs, for example, has 6 channels<br />

(though it is notated 5.1: 5 channels <strong>and</strong> 1 subwoofer). Audio<br />

CDs can only have 2 channels.<br />

2. The number of bits per sample, b: This is called the bit depth, <strong>and</strong><br />

it defines how many bits are used <strong>to</strong> s<strong>to</strong>re each sample. Each of<br />

these is a number in binary where the bit depth is the length of<br />

that number in bits, <strong>and</strong> the value expresses the instantaneous<br />

value of the amplitude in decibels. The bit depth specifies the<br />

resolution of the dynamic range of a sampled audio file, i.e., the<br />

range of intensities that the sample can take on. Because of the<br />

nature of binary, increasing the bit depth by 1 bit means that the<br />

resolution can be twice as large. 4 We convert this <strong>to</strong> decibels<br />

using the equivalence 20 log 10 2 = 6.02 dB, so an increase in 1 bit<br />

adds about 6 decibels <strong>to</strong> the dynamic range. The dynamic range<br />

of the average human ear spans approximately 140 dB SPL, so a<br />

24-bit depth that yields 144 dB in dynamic range is maximally<br />

ideal. For ease of computation, a 16-bit depth (96 dB, the bit<br />

depth of audio CDs) is decent, but the greater the bit depth,<br />

the higher the resolution <strong>and</strong> hence fidelity of the resulting file,<br />

meaning less noise <strong>and</strong> less truncation of a song’s samples.<br />

The size of the file in bytes, where one byte is 8 bits, can then be<br />

calculated as<br />

|file| = f s · C · b · T/8


Section 6.2 Compression 153<br />

File format Extension f s Bit depth<br />

Waveform Audio File Format .wav 44,100 Hz 16-bit<br />

Audio Interchange File Format .aiff 44,100 Hz 16-bit<br />

Au .au 8000 Hz 32-bit<br />

Table 6.3: File formats with lossless compression, ordered by frequency of usage<br />

where T is the duration in seconds of the song. There are a few formats<br />

of losslessly compressed audio files <strong>and</strong> they are ordered by their<br />

popularity in Table 6.3.<br />

So the size of a 180-second .wav or .aiff file would be<br />

f s · C · b · T/8 = 44, 100 · 2 · 16 · 180/8<br />

= 31, 752, 000 bytes<br />

= 31, 752, 000/(1024 2 ) = 30.28 MB,<br />

while the size of a 180-second .au file would be approximately 11<br />

MB. We divide by 1024 2 because 1 kilobyte (kB) = 1024 bytes, <strong>and</strong> 1<br />

megabyte (MB) = 1024 kB.<br />

Pulse-code modulation (PCM)<br />

The st<strong>and</strong>ard method for digitally sampling analog signals is pulse-code<br />

modulation, abbreviated PCM. Pulse-code modulation applies a pulse<br />

<strong>to</strong> an analog signal at regular, uniform intervals, exactly as described<br />

previously in this chapter. This is also called quantization which means<br />

the division of something in<strong>to</strong> equally sized parts, usually used in the<br />

context of looping <strong>and</strong> metric beat creation.<br />

In digital sampling, the range of values that an amplitude can<br />

take on is limited by the bit depth of the digital file. The amplitudes<br />

of the analog signal are rounded at each pulse <strong>to</strong> the nearest binary<br />

value. A bit depth of 16, for example, means that amplitudes can take<br />

4 In base-10, for example, increasing the number of places (hundreds, tens, ones,<br />

etc.) increases the number of possible values ten-fold, i.e., 0-99 (100 values) versus<br />

0-999 (1000 values).


154 Digital audio basics Chapter 6<br />

on 2 16 or 65536 different values. We quantify the difference between<br />

the amplitudes of an analog signal <strong>and</strong> its digital representation by<br />

quantization error or quantization dis<strong>to</strong>rtion.<br />

PCM is the technique used by WAV <strong>and</strong> AIFF files, as well as in<br />

compressed formats like MP3, Ogg Vorbis, <strong>and</strong> WMA. Pulse-code<br />

modulation is applied before compression.<br />

Resource Interchange File Format<br />

The Resource Interchange File Format (RIFF) encompasses the WAVE<br />

file format. WAVE readers in Matlab <strong>and</strong> Mathematica make inputting<br />

WAVE audio simple, but the files contain more than just audio data.<br />

At the very beginning (the header) of a RIFF file are 44 bytes of information<br />

about the file, such as its bit depth, sampling rate, <strong>and</strong><br />

format. The information is organized in 2- <strong>and</strong> 4-byte fields. They are<br />

read according <strong>to</strong> their endianness: Little endian means that the data<br />

is written with the least significant byte first. So, the number 18 is<br />

expressed in little endian as "00010010 00000000 00000000 00000000" in<br />

binary (spaces separating the bytes) <strong>and</strong> "12 00 00 00" in hexadecimal<br />

(four bytes). In big endian, 18 would be written "00000000 00000000<br />

00000000 00010010" in binary <strong>and</strong> "00 00 00 12" in hexadecimal [23].<br />

Fields of the little endian type correspond <strong>to</strong> numerical, integer quantities,<br />

while fields of the big endian type relate <strong>to</strong> ASCII (short for<br />

"American St<strong>and</strong>ard Code for International Interchange," mapping<br />

characters of the English alphabet <strong>to</strong> numbers).<br />

Each sample of the raw sound data as well as its header in its RIFF<br />

file is expressed by a byte, or 8 bits. On the next page, they are given<br />

in hexadecimal [23].


Section 6.2 Compression 155<br />

Type Position in bytes Field name Field size (bytes)<br />

ASCII 0 ChunkID 4<br />

integer 4 ChunkSize 4<br />

ASCII 8 Format 4<br />

ASCII 12 Subchunk1ID 4<br />

integer 16 Subchunk1Size 4<br />

integer 20 AudioFormat 2<br />

integer 22 NumChannels 4<br />

integer 24 SampleRate 4<br />

integer 28 ByteRate 4<br />

integer 32 BlockAlign 2<br />

integer 34 BitsPerSample 2<br />

ASCII 36 Subchunk2ID 4<br />

integer 40 Subchunk2Size 4<br />

integer 44 actual raw data varies<br />

Table 6.4: Positions of different fields in RIFF encoding.<br />

There are two classes of efficient algorithms for the compression of<br />

audio files: Lossless <strong>and</strong> lossy compression algorithms. The first, when<br />

uncompressed, returns the exact original file with none of the data<br />

removed, but the second cannot do this.<br />

Lossless compression<br />

Lossless compression retains enough data from an uncompressed file<br />

that the entire original file can be reproduced when the file is decompressed.<br />

Lossy compression, however, retains only a portion of the<br />

raw data <strong>and</strong> cannot reproduce the original file upon decompression.<br />

ZIP files (usually used for lossless compression of text or text-like data)<br />

are generated by lossless compression, <strong>and</strong> "unzipping" them gives<br />

the uncompressed file, or folder of files. Lossless files are greater in<br />

fidelity <strong>and</strong> size <strong>and</strong> better for performing audio editing <strong>and</strong> effects<br />

upon than lossy files.<br />

The code or hardware that performs the compression <strong>and</strong> decompression<br />

is called a codec, short for compression-decompression. The


156 Digital audio basics Chapter 6<br />

Field name Description Example<br />

(WAVE)<br />

ChunkID "RIFF" in ASCII form 52 49 46 46<br />

ChunkSize The size of the entire file N + 36<br />

in bytes, minus 8<br />

Format The type of the format in 57 41 56 45<br />

ASCII, like "WAVE"<br />

Subchunk1ID The letters "FMT"+ 66 6d 74 20<br />

space in ASCII<br />

Subchunk1Size The size of the rest of the 10 00 00 00 (16)<br />

subchunk (until Subchunk2ID) for PCM<br />

AudioFormat The form of sampling 01 00—1,<br />

for PCM<br />

NumChannels 1 for Mono, 2 for Stereo, 02 00—2,<br />

<strong>and</strong> so on<br />

for stereo<br />

SampleRate f s 44 AC 00 00<br />

(44,100 Hz)<br />

ByteRate SampleRate ∗ NumChannels ∗ 80 F8 2A 00<br />

BitsPerSample/8 (176,000)<br />

BlockAlign NumChannels ∗ BitsPerSample/8 04 00 (4)<br />

BitsPerSample The bit depth 10 00 (16)<br />

Subchunk2ID The letters "DATA" in ASCII 64 61 74 61<br />

Subchunk2Size NumSamples ∗ NumChannels ∗ N<br />

BitsPerSample/8<br />

data The raw sound data —<br />

Table 6.5: The format of RIFF encoding.<br />

most popular is probably the FLAC format, which st<strong>and</strong>s for Free Lossless<br />

Audio Codec. Because encoding a file is a probabilistic process,<br />

there is no way <strong>to</strong> absolutely calculate the size of the resulting file.<br />

FLAC files are typically 30-50 percent of the size of the original, where<br />

more repetitive songs would fall in the low end of this range.<br />

Lossy compression<br />

Lossy compression works by discarding some of the data from an<br />

uncompressed file <strong>to</strong> produce an encoded file that is 5-20 percent<br />

the size of the original file. One of the techniques it uses <strong>to</strong> evaluate<br />

what data can be discarded is analysis of frequencies. It uses the


Section 6.2 Compression 157<br />

Fletcher–Munson curve <strong>and</strong> the psychoacoustic notion of critical b<strong>and</strong>s<br />

<strong>to</strong> reduce the bit depth of frequencies closer <strong>to</strong> the extremes of our hearing<br />

range, eliminate sounds <strong>to</strong>o quiet <strong>to</strong> hear, <strong>and</strong> discard frequencies<br />

that would not be perceived because of masking by other frequencies<br />

within their critical b<strong>and</strong>s. In this way, the Fletcher–Munson curve<br />

can be thought of as a probability density function that determines the<br />

probability of a given frequency in some file at a specific time. Similar<br />

<strong>to</strong> lossless compression, frequencies like those between 30 <strong>and</strong> 5000<br />

Hz (approximately the range of the piano) will have a high probability<br />

<strong>and</strong> therefore a higher resolution, <strong>and</strong> frequencies outside of this<br />

range will have low probability <strong>and</strong> a corresponding lower resolution.<br />

Those with lower probabilities are then given a low bit-depth, or even<br />

discarded al<strong>to</strong>gether when given a bit depth of 0.<br />

Therefore, the Fourier transform is a typical component of lossy<br />

compression algorithms because frequency information from the music<br />

is easier <strong>to</strong> approximate than is, say, the actual signal data, <strong>and</strong> part<br />

of the reason is the way the human ear perceives frequencies. Video<br />

files can be compressed using the lossy algorithm as well. Websites<br />

that stream video like Hulu <strong>and</strong> YouTube use a lossy algorithm <strong>to</strong><br />

play video in real- or better-than-real-time (i.e., the buffer fills ahead<br />

of time). Where squares of similar colors appear in images <strong>and</strong> video,<br />

compression is acting upon the detail <strong>and</strong> resolution: The more noticeable<br />

<strong>and</strong> artificial these squares are, the more excessive <strong>and</strong> extreme<br />

the compression is. As the Internet’s b<strong>and</strong>width grows, the necessary<br />

amount of compression for live streaming diminishes.<br />

The quantity of data discarded is inversely proportional <strong>to</strong> the<br />

specified bit rate: The higher the bit rate, the less data is removed <strong>and</strong><br />

the smaller the quantization error. The bit rate is f s · C · b, the same<br />

variables from above in our calculation of the size of uncompressed<br />

audio files. Therefore, bit rate corresponds <strong>to</strong> fidelity. Note that a bit<br />

rate is in bits, not bytes: A bit rate of 100 kbps (kilobits per second)<br />

corresponds <strong>to</strong> 100,000 bits/second, not bytes per second. The size of<br />

the resulting encoded file is proportional <strong>to</strong> the bit rate of the codec.


158 Digital audio basics Chapter 6<br />

Codec File extension Bit rate f s<br />

MPEG-1 Audio Layer III .mp3 32-320 kbps 32-48 kHz<br />

MPEG-2 Audio Layer III .mp3 8-160 kbps 16-24 kHz<br />

Advanced Audio Coding .aac 8-320 kbps 8-96 kHz<br />

Windows Media Audio Lossy .wma 32-768 kbps 8-48 kHz<br />

Ogg Vorbis .ogg 16-500 kbps 8-192 kHz<br />

Table 6.6: The different bit rates <strong>and</strong> sampling rates of common lossy-compressed<br />

audio file formats.<br />

<strong>An</strong> MP3 file with a common bit rate of 128 kbps will be around 11%<br />

of the size of the original file. You sometimes have the option <strong>to</strong><br />

change the bit rate at which AAC files <strong>and</strong> MP3 files are imported <strong>to</strong><br />

your computer using Variable Bit Rate (VBR) encoding. Some popular<br />

formats are given in Table 6.6.<br />

Constant Bit Rate (CBR) encoding, encodes an audio file at a specified,<br />

constant bit rate, while VBR encoding encodes an audio file at<br />

different bit rates depending on its content, using the probabilistic<br />

schematic explained above but with respect <strong>to</strong> the amplitude in dB<br />

SPL.<br />

6.3 Chapter summary<br />

Sampling a piece of continuous audio gives a discrete representation<br />

of it, which is necessary for any kind of practical analysis. In order <strong>to</strong><br />

adequately sample a piece of audio, we need <strong>to</strong> know its maximum<br />

frequency component. We sample a continuous signal by convolving<br />

it with an impulse function <strong>to</strong> get a discrete set of points. We want <strong>to</strong><br />

retrieve a sufficient number of these points <strong>to</strong> avoid dis<strong>to</strong>rtion resulting<br />

from undersampling, or aliasing. We can avoid this by applying a lowpass,<br />

anti-aliasing filter <strong>to</strong> eliminate frequencies beyond the Nyquist<br />

limit, equal <strong>to</strong> 2 times the maximum frequency component of the entire<br />

original audio file, <strong>and</strong> then sampling above that limit. In other words,<br />

a waveform can be accurately reconstructed only when more than two


Section 6.3 Chapter summary 159<br />

samples per period of each frequency component are taken. Because<br />

we cannot hear beyond about 20,000 Hz, a sampling rate or sampling<br />

frequency of 44,100 Hz is sufficient.<br />

We can oversample audio <strong>to</strong> improve its resolution <strong>and</strong> signal-<strong>to</strong>-noise<br />

ratio, but the higher the sampling frequency, the larger the resulting<br />

file. Compression algorithms work <strong>to</strong> reduce file size while retaining<br />

either all (in lossless compression) or only part (in lossy compression) of the<br />

original data in the compressed file. They do this by making decisions<br />

derived from probabilities found in frequency <strong>and</strong> amplitude analysis<br />

of the original file.


7. The discrete Fourier transform<br />

The Fourier transform was borne from the Fourier series, invented<br />

by Jean Baptiste Joseph Fourier (1768-1830). Fourier was foremost<br />

a scientist: Virtually all of his mathematical findings are results of<br />

scientific investigations. Amazingly, nearly all branches of physical<br />

<strong>and</strong> even social sciences have some connection <strong>to</strong> <strong>and</strong> foundation upon<br />

Fourier analysis, as the Fourier transform can detect repetitive behavior<br />

in any sort of dataset. It was during investigation of thermodynamics<br />

that Fourier began formulation of the principle of superposition, also<br />

called the Fourier series.<br />

7.1 The Fourier series<br />

The Fourier series decomposes any periodic function in<strong>to</strong> a series of<br />

simple periodic functions. For a signal that is just a pure <strong>to</strong>ne that can<br />

be represented by a simple sinusoid A sin(ωt + φ) or A sin(2πft + φ),<br />

the series is just this sinusoid. But as we have seen by now, musical<br />

signals are virtually always more complicated than pure sine waves.<br />

When a signal has aperiodic components or when it is finite in<br />

duration (as all signals are in reality), its Fourier series will be infinite<br />

as well as the domain its Fourier transform. Conversely, when a signal<br />

is completely periodic <strong>and</strong> when it has (or we can assume that it has)<br />

an infinite time domain, its Fourier series will be finite, <strong>and</strong> its Fourier<br />

transform may be specified over a finite frequency domain.<br />

A series in mathematics is defined as the sum of a sequence of terms,<br />

which is of the form<br />

finish<br />

∑<br />

k=start<br />

{a k }.


162 The discrete Fourier transform Chapter 7<br />

The capitalized sigma Σ indicates a sum is <strong>to</strong> be taken, <strong>and</strong> the sequence<br />

is<br />

{a k } = {a start ,a start+1 , . . . , a finish−1 ,a finish }.<br />

The term start is the index of the initial value of {a k }, often k =0, 1 or<br />

−∞, <strong>and</strong> the term end is the index of its final value, often ∞ or some<br />

function of N.<br />

A straightforward complex <strong>and</strong> periodic function was looked at<br />

in Chapter 1: The sum of the two sinusoids x 1 (t) = 1 2<br />

sin(2πt) <strong>and</strong><br />

x 2 (t) = 1 2<br />

sin(4πt). We write the Fourier series of their resultant wave<br />

x(t) as<br />

x(t) = 1 2<br />

2∑<br />

sin(2πkt).<br />

k=1<br />

Simple, non-sinusoidal waveforms like the saw<strong>to</strong>oth wave, triangle<br />

wave, <strong>and</strong> square wave have Fourier series representations that<br />

use Σ. These are infinite Fourier series because their waveform is not<br />

smooth like a sine wave, so a finite sum of sine waves can only approximate<br />

their linear nature. The infinite Fourier series that represents a<br />

saw<strong>to</strong>oth wave is given by<br />

x(t) = 2 ∞∑ sin(2πkft)<br />

π k<br />

k=1<br />

= 2 [sin(2πft)+ 1 ]<br />

π<br />

2 sin(4πft)+1 3 sin(6πft)+. . .<br />

= 2 [sin(ωt)+ 1 ]<br />

π<br />

2 sin(2ωt)+1 3 sin(3ωt)+. . . .<br />

Because our fundamental frequency ω is multiplied by every integer<br />

1, 2, 3, . . ., all integer multiples of the fundamental are represented in<br />

the harmonic over<strong>to</strong>ne series. The following graphs let our frequency<br />

f equal 1 Hz.


Section 7.1 The Fourier series 163<br />

Figure 7.1: x(t) = 2 π sin(2πt)<br />

Figure 7.2: x(t) = 2 π<br />

[<br />

sin(2πt)+<br />

1<br />

2 sin(4πt)]


164 The discrete Fourier transform Chapter 7<br />

Figure 7.3: x(t) = 2 π<br />

[<br />

sin(2πt)+<br />

1<br />

2 sin(4πt)+ 1 3 sin(6πt)]<br />

Figure 7.4: The first six terms of x(t), i.e., the sum 2 π<br />

∑ 6<br />

k=1<br />

sin 2πkt<br />

k<br />

.<br />

These graphs show that the longer the Fourier series extends, the<br />

better it approximates a given periodic function.


Section 7.1 The Fourier series 165<br />

The following infinite Fourier series represents a triangle wave:<br />

x(t) = 8 ∑<br />

∞ k sin[(2k + 1)2πft]<br />

π 2 (−1)<br />

(2k + 1) 2<br />

k=0<br />

= 8 π 2 [<br />

sin(2πft) − 1 9 sin(6πft)+ 1 25 sin(10πft)+. . . ]<br />

.<br />

Figure 7.5: The first six terms of the infinite Fourier series representation of a triangle<br />

wave. The result is an as<strong>to</strong>nishingly close approximation, probably due <strong>to</strong> its similar<br />

symmetry <strong>to</strong> that of the sine wave versus the asymmetry of saw<strong>to</strong>oth waves <strong>and</strong><br />

phasors.<br />

The infinite Fourier series representing a square wave is given by<br />

x(t) = 4 ∞∑ sin((2k − 1)2πft)<br />

π 2k − 1<br />

k=1<br />

= 4 [sin(2πft)+ 1 ]<br />

π<br />

3 sin(6πft)+1 5 sin(10πft)+. . .<br />

= 4 [sin(ωt)+ 1 ]<br />

π<br />

3 sin(3ωt)+1 5 sin(5ωt)+. . . .<br />

Here we see only odd-integer multiples of the fundamental ω represented<br />

in the harmonic over<strong>to</strong>ne series. As discussed in Chapter 4,<br />

dis<strong>to</strong>rtion pedals add harmonics <strong>to</strong> the sine waves present in a signal<br />

by amplifying the signal <strong>and</strong> cutting off (clipping, also dynamic compression)<br />

the <strong>to</strong>ps <strong>and</strong> bot<strong>to</strong>ms of its crests <strong>and</strong> troughs, shaping it in<strong>to</strong><br />

a square wave. Figure 4.23 of clipping on page 94 shows the graph of<br />

the first six terms of the Fourier series representing a square wave.


166 The discrete Fourier transform Chapter 7<br />

The fact that non-sinusoidal functions like these can be decomposed<br />

in<strong>to</strong> a Fourier series demonstrates that (within reason—we<br />

avoid pathological functions that are unmisic-like) every function has a<br />

Fourier series representation. Non-sinusoidal functions simply have<br />

infinite series representations.<br />

When we deal with real world musical signals, we usually cannot<br />

use the sigma (Σ) in our formula. This is because the amplitudes of<br />

the individual sine waves are not solely dependent on the amplitude<br />

of the fundamental frequency, <strong>and</strong> therefore cannot be expressed in<br />

a perfectly recursive or algorithmic manner. Very generally, it is true<br />

of many musical instruments that the power of the over<strong>to</strong>nes will be<br />

weaker than the fundamental, <strong>and</strong> their individual strengths decrease<br />

as their partial number increases [76], but there are instruments like<br />

the oboe <strong>and</strong> bassoon that do not obey this rule.<br />

Suppose that an instrument’s timbre is such that each of its harmonic<br />

over<strong>to</strong>nes is half the strength of the previous over<strong>to</strong>ne, i.e.,<br />

x 0 (t) = sin(2πft), x 1 (t) = 1 2 sin(4πft), x 2(t) = 1 4<br />

sin(6πft), <strong>and</strong> so<br />

on. Then the Fourier series representing the whole over<strong>to</strong>ne series,<br />

where the signal x(t) contains the sinusoids x i (t) in the sequence<br />

{x(t)} = {x 0 (t),x 1 (t),x 2 (t), . . .}, is<br />

x(t) = sin(2πft)+ 1 2 sin(4πft)+1 4 sin(6πft)+. . .<br />

=<br />

=<br />

∞∑<br />

k=0<br />

∞∑<br />

k=0<br />

1<br />

sin[2πft(k + 1)]<br />

2k 1<br />

sin[ωt(k + 1)].<br />

2k Since the physicality of instruments affects the timbre in many ways,<br />

no musical instrument in reality will have this exact over<strong>to</strong>ne series.<br />

However, the general form of the series is not far off from the actual<br />

over<strong>to</strong>ne series of instruments, especially the highly harmonic over<strong>to</strong>nes<br />

of the wind instruments, with differences reflecting the nature of<br />

the physical constraints.


Section 7.2 Euler’s formula 167<br />

The trigonometric functions of cosine <strong>and</strong> sine can themselves<br />

be decomposed in<strong>to</strong> mathematical series, known as the Taylor <strong>and</strong><br />

Maclauren series. The sine function can be deconstructed by the series<br />

sin(x) = x 1 − x3<br />

3! + x5<br />

5! − x7<br />

7! + . . .<br />

=<br />

∞∑ (−1) k x 2k+1<br />

(2k + 1)!<br />

k=0<br />

<strong>and</strong> the series of the cosine function is<br />

cos(x) = 1− x2<br />

2! + x4<br />

4! − x6<br />

6! + . . .<br />

=<br />

∞∑ (−1) k x 2k<br />

.<br />

(2k)!<br />

k=0<br />

The syntax "!" in mathematics is the fac<strong>to</strong>rial opera<strong>to</strong>r, where n! =<br />

n · (n − 1) · (n − 2) · . . . · 1. For n =0, the operation is defined as 0! = 1.<br />

The difference between these power series <strong>and</strong> the Fourier series is<br />

evident. Because pitched sound contains sinusoidal waves, <strong>and</strong> the<br />

human ear is a kind of frequency detec<strong>to</strong>r, it is more important <strong>to</strong> focus<br />

on the Fourier series here.<br />

However, the power series of the sine <strong>and</strong> cosine functions does<br />

lead <strong>to</strong> an important identity in mathematics—Euler’s identity.<br />

7.2 Euler’s formula<br />

Leonhard Euler (1707-1783) is behind much of modern mathematics<br />

<strong>and</strong> physics, ranging from analysis <strong>to</strong> astronomy. Euler’s number, e,<br />

is the base of the natural logarithm. Written ln(x), the natural log is<br />

equivalent <strong>to</strong> log e (x), where e is the quantity 2.71828 . . ., continuing<br />

forever. Euler’s number is very special, because the derivative of the<br />

function e x returns the same function, i.e.,<br />

d<br />

dx ex = e x


168 The discrete Fourier transform Chapter 7<br />

The magic doesn’t s<strong>to</strong>p there: The function e iω where ω is an angular<br />

frequency (i.e., radians/second) describes positions along the unit<br />

circle in the complex plane, as given by Euler’s identity.<br />

Euler’s identity: For Euler’s number e =2.71828 . . ., the<br />

complex quantity i = √ −1, <strong>and</strong> the the ratio of the circumference<br />

of a circle <strong>to</strong> its diameter π =3.14159 . . .,<br />

e iπ + 1 = 0.<br />

When we vary the exponent of e by ω, we are effectively moving<br />

along the unit circle ω-many radians, <strong>and</strong> e iω will equal the sum of the<br />

corresponding coordinates corresponding <strong>to</strong> the position on the circle.<br />

When ω is positive, this is counter-clockwise motion, <strong>and</strong> when it is<br />

negative, we move clockwise.<br />

Figure 7.6: The unit circle in the complex plane (also called the z-plane) is real-valued<br />

along the horizontal axis <strong>and</strong> complex-valued along the vertical axis, so points in<br />

this plane will be of the form (Re(ω), Im(ω)). To determine e iω , simply move counterclockwise<br />

starting at the right-most point of the circle ω-many radians. For example,<br />

e iπ 4 corresponds <strong>to</strong> π 4 radians (45◦ ) which is at the point<br />

e iπ 4 =<br />

√<br />

2<br />

2 + i √ 2<br />

2 . This is also the sum cos ( π<br />

4<br />

)<br />

+ i sin<br />

( π<br />

4<br />

)<br />

.<br />

( √2<br />

2 ,i√ 2<br />

2<br />

)<br />

. Therefore,


Section 7.2 Euler’s formula 169<br />

Figure 7.7: To determine e −iω , simply move clockwise starting at the right-most point<br />

(0 radians) of the circle ω-many radians. For example, e − iπ 4 corresponds <strong>to</strong> −<br />

π<br />

( radians<br />

√2<br />

4<br />

(−45 ◦ ) which has coordinate , −i √ )<br />

2<br />

. Therefore, e − iπ 4 = √ 2<br />

− i √ 2<br />

, which is<br />

2 2<br />

2 2<br />

equivalent <strong>to</strong> cos ( ) ( )<br />

− π 4 + i sin −<br />

π<br />

4 .<br />

As you can see, e iω can be written as a function of sin(ω) <strong>and</strong> cos(ω)<br />

where the sin(ω) component is multiplied by the imaginary number i.<br />

The behavior of e iω with respect <strong>to</strong> the horizontal axis can be modeled<br />

by a cosine wave <strong>and</strong> its behavior with respect <strong>to</strong> the vertical axis<br />

with an imaginary sine wave. Because the cross product 1 of each root<br />

of unity e iω kt with any other root of unity e iω lt is zero, we say that<br />

the roots of unity form an orthogonal basis. Furthermore, because the<br />

magnitude of the roots of unity |e iωt | is 1 for all ω, they are also normal<br />

1 The product of one vec<strong>to</strong>r with another’s complex conjugate. From [26], we write<br />

the cross product of two roots of unity W k N <strong>and</strong> W l N as<br />

〈W k N ,W l N 〉 =<br />

=<br />

N−1<br />

∑<br />

t=0<br />

N−1<br />

∑<br />

t=0<br />

W k N Ŵ l N =<br />

N−1<br />

∑<br />

t=0<br />

e i2πkt<br />

N e<br />

− i2πlt<br />

N<br />

e i2π(k−l)t<br />

N = 1 − ei2π(k−l)<br />

1 − e i2π(k−l)/N<br />

which is equal <strong>to</strong> 0 when k ≠ l. From linear algebra, this means that the two vec<strong>to</strong>rs<br />

e iω kt <strong>and</strong> e iω lt are linearly independent <strong>and</strong> therefore form a basis, <strong>and</strong> furthermore,<br />

we can conclude that these two sinusoids are orthogonal.


170 The discrete Fourier transform Chapter 7<br />

or normalized functions, <strong>and</strong> hence they also define an orthonormal basis.<br />

The function e iω is a complex function known as Euler’s formula.<br />

Euler’s formula: For Euler’s number e =2.71828 . . ., the<br />

complex quantity i = √ −1, <strong>and</strong> any real number ω,<br />

e iω = cos(ω)+i sin(ω) <strong>and</strong><br />

e −iω = cos(ω) − i sin(ω).<br />

When we vary ω, we proportionally vary the arguments of cosine<br />

<strong>and</strong> sine, so ω represents the angular frequency in radians per second.<br />

For example, doubling the exponent e iω <strong>to</strong> be e i2ω doubles the<br />

frequency of the sine <strong>and</strong> cosine functions from cos(ω) +i sin(ω) <strong>to</strong><br />

cos(2ω) +i sin(2ω). We use angular frequency instead one given in<br />

hertz because ω is equal <strong>to</strong> 2πf <strong>and</strong> one period is therefore the time<br />

that it takes for e iω <strong>to</strong> go around the unit circle once. However, when<br />

we state the discrete Fourier transform, the exponent of e will be written<br />

with the exp<strong>and</strong>ed notation 2πk <strong>to</strong> refer <strong>to</strong> the angular frequency<br />

ω k .<br />

Likewise, sin(ω) <strong>and</strong> cos(ω) can be rewritten in terms of e iω where<br />

ω is a real number.<br />

sin(ω) = eiω − e −iω<br />

2i<br />

= i ( e −iω − e iω)<br />

, because 1 2<br />

i = −i2 i = −i<br />

cos(ω) = eiω + e −iω<br />

.<br />

2


Section 7.2 Euler’s formula 171<br />

Let us inspect some values of Euler’s formula:<br />

(<br />

e i π π<br />

) ( π<br />

)<br />

2 = cos + i sin = 0 + i · 1=i<br />

2 2<br />

e iπ = cos(π)+i sin(π) =−1+i · 0=−1<br />

(<br />

e i π π<br />

) ( π<br />

) √ √<br />

2 2<br />

4 = cos + i sin =<br />

4 4 2 + i 2<br />

e i2π = cos(2π)+i sin(2π) = 1 + i · 0 = 1<br />

= e i·0 = e i2kπ , k =0, 1, 2, . . .<br />

All of these values are called roots of unity, because their magnitudes<br />

are all equal <strong>to</strong> 1. 2 The magnitude of a complex number is<br />

|a + bi| = √ a 2 + b 2 ,<br />

so the magnitude of e iω is 1 for all ω.<br />

This also results from the<br />

trigonometric identity stating that the magnitude of | cos ω + i sin ω| =<br />

√<br />

cos 2 ω + sin 2 ω =1(<strong>and</strong> therefore, cos 2 ω + sin 2 ω =1). Quickly we<br />

will prove this by inspection of the above values.<br />

∣<br />

∣e i π 2 ∣ = √ 0 2 +1 2 =1<br />

∣<br />

∣e iπ∣ ∣ = √ (−1) 2 +0 2 =1<br />

∣<br />

∣e i π 4<br />

∣ =<br />

√<br />

√(√<br />

2<br />

2<br />

) 2<br />

+<br />

∣<br />

∣e i2π∣ ∣ = √ 1 2 +0 2 =1.<br />

(√ ) 2 √<br />

2 2<br />

=<br />

4 + 2 4 =1<br />

2<br />

The exponentials in every term of the discrete Fourier transform<br />

<strong>and</strong> its inverse are all roots of unity. We can order these roots of unity<br />

by using the following definition.<br />

Roots of unity: We call the set W = {W 0 N ,W1 N , . . . , W N−1<br />

N<br />

}<br />

the Nth roots of unity corresponding <strong>to</strong> points on the unit<br />

2 The words "unit," "unity," <strong>and</strong> "unitary" all refer <strong>to</strong> the number 1, <strong>and</strong> often the<br />

unit circle.


172 The discrete Fourier transform Chapter 7<br />

circle in the complex plane, where<br />

WN 1 = e i2π<br />

N<br />

is the primitive Nth root of unity,<br />

W k N = e i2πk<br />

N =(WN ) k is the kth Nth root of unity,<br />

W 0 N = e i2π(0)<br />

N =1 is the trivial Nth root of unity, <strong>and</strong><br />

W N N = e i2πN<br />

N = e i2π =1<br />

= W 0 N.<br />

To help demystify the name "roots of unity," these can also be written<br />

W k N = e i2πk<br />

N =<br />

N/(i2πk)√<br />

e =<br />

N √ W k = W k N,<br />

the "kth root of W N " or the "kth power of the primitive root of unity<br />

WN 1 ." We don’t usually see exponents written this way because the<br />

square root symbol is typically only used when the exponent is equal<br />

<strong>to</strong> 1/2, so the exponent is inverted when we write roots this way, e.g.,<br />

√ x = x 1/2 = 2√ x.<br />

The function e iω describing the roots of unity in the Fourier transform<br />

serves <strong>to</strong> extract the periodic parts of a signal <strong>and</strong> attenuate the<br />

aperiodic parts by essentially multiplying aperiodicities by 0. The<br />

roots of unity analyze the periodicity of the signal because they are all<br />

orthogonal <strong>to</strong> one another. This is the crux of the Fourier transform.<br />

7.3 The discrete Fourier transform<br />

The discrete Fourier transform (DFT) <strong>and</strong> the inverse discrete Fourier<br />

transform (the IDFT) are applied <strong>to</strong> discrete time-domain signals <strong>to</strong><br />

extract their sinusoidal frequency components. The DFT computes the<br />

frequency-domain spectrum of a signal, <strong>and</strong> the IDFT reconstructs the<br />

signal (with a phase shift) from the DFT. It is a rich algorithm with<br />

many variables acting at once, <strong>and</strong> it can be very intimidating even<br />

<strong>to</strong> seasoned mathematicians. We will first define it formally but then<br />

thoroughly inspect it with several examples.


Section 7.3 The discrete Fourier transform 173<br />

The discrete Fourier transform is the discrete case of the Fourier<br />

transform, which requires a continuous (non-discrete) signal input <strong>and</strong><br />

has a continuous output.<br />

Fourier transform (continuous): The Fourier transform<br />

is an invertible, linear transformation accepting complexvalued<br />

inputs <strong>and</strong> outputting complex values. The Fourier<br />

transform of a continuous-time signal x(t) is represented by<br />

the symbol F, <strong>and</strong> is given by<br />

F{x(t)} := X(ω) =<br />

∫ ∞<br />

−∞<br />

x(t)e −iωt dt<br />

where t is time <strong>and</strong> ω is angular frequency. The inverse<br />

Fourier transform F −1 reconstructs the original signal from<br />

the Fourier transform with the formula<br />

F −1 {X(ω)} := x(t) = 1<br />

2π<br />

∫ ∞<br />

−∞<br />

X(ω)e iωt dω.<br />

We call X(ω) the spectrum of x(t). We multiply the inverse Fourier<br />

transform (IFT) by 1<br />

2π<br />

only when the frequency is specified or desired<br />

in radians per seconds (ω) instead of hertz (f); otherwise, we leave it<br />

alone.<br />

<strong>An</strong> integral of a function takes the area under the curve of the<br />

function over an interval whose endpoints specified by the given<br />

limits, in this case −∞ <strong>to</strong> ∞. If we partition the function in<strong>to</strong> tiny slices<br />

over the specified interval, we can sum <strong>to</strong>gether all of the slices <strong>and</strong><br />

approximate its integral. The thinner the slices, the better the area of<br />

the slices will approximate the actual area underneath the curve.<br />

The continuous Fourier transform of a simple sinusoid with frequency<br />

ω c is the (scaled) Kronecker delta function δ(ω − ω c )+δ(ω + ω c )<br />

where x(t) = sin(ω c t). Conversely, the Fourier transform of the Dirac<br />

delta function centered at ω c [δ(ω−ω c )] is the exponential e −iωct . When<br />

it is centered at −ω c , the Fourier transform is the exponential e iωct .


174 The discrete Fourier transform Chapter 7<br />

Figure 7.8: The definite integral<br />

( )<br />

t −<br />

1 2<br />

dt gives the exact area under<br />

2<br />

∫ 1<br />

0<br />

the given curve from 0 <strong>to</strong> 1, shown in gray.<br />

Using calculus, this area is computed as<br />

[<br />

]<br />

x 3<br />

1 − x2 + x = 1<br />

3 2 4<br />

0<br />

Figure 7.9: The Riemann sum partitions<br />

the function using evenly sized <strong>and</strong> spaced<br />

intervals <strong>to</strong> approximate the area under<br />

the curve. The more partitions, the closer<br />

the Riemann sum is <strong>to</strong> the integral. Here,<br />

=0.083333. 12<br />

with 28 partitions, the approximate area is<br />

0.083227.<br />

Example. Let x(t) = sin(220πt). Then<br />

F{x(t)} =<br />

∫ ∞<br />

−∞<br />

sin(220πt)e −iωt dt<br />

= iπδ(ω − 220π) − iπδ(ω + 220π)<br />

= X(ω).<br />

So the Fourier transform looks like two spikes centered at the frequencies<br />

220π rad/s <strong>and</strong> −220π rad/s. The amplitude of the positive<br />

frequency is iπ (magnitude is simply π), <strong>and</strong> the amplitude of the negative<br />

frequency is −iπ. Graphs of the Fourier transform typically the<br />

depict magnitude plot |X(ω)| <strong>and</strong> will sometimes include the phase plot<br />

φ(ω) of the transform’s behavior. The magnitude plot |X(ω)| shows<br />

us the relative powers of the frequency components of a signal. We<br />

calculate this with the formula<br />

|X(ω)| = √ Re{X(ω)} 2 + Im{X(ω)} 2 .<br />

The phase φ(ω) is calculated from the formula<br />

[ ]<br />

Re{X(ω)}<br />

φ[X(ω)] = tan −1 .<br />

Im{X(ω)}


Section 7.3 The discrete Fourier transform 175<br />

Since the magnitude of iπ is the same as −iπ, the magnitude plot of<br />

the transform (given in Figure 6.12) is symmetrical about the vertical<br />

axis (ω =0).<br />

Figure 7.10: The magnitude plot of the Fourier transform of x(t) = sin(220πt): Two<br />

vertical segments centered at −220π <strong>and</strong> 220π radians/second. We calculate the<br />

magnitude because the Fourier transform is a complex function. Since the magnitudes<br />

of its imaginary values are considered equally important as those of the real values<br />

whether positive or negative, a magnitude plot is the simplest way <strong>to</strong> visually convey<br />

the spectrum of a signal.<br />

If a time signal x(t) is infinite, then its Fourier transform may be<br />

finite. Otherwise, its Fourier transform will be infinite. The minimum<br />

<strong>and</strong> maximum frequency components (in Hz) of a spectrum define the<br />

b<strong>and</strong>width of a signal. For the function sin(220πt), the minimum frequency<br />

component is −110 Hz <strong>and</strong> the maximum is 110 Hz. Therefore,<br />

its b<strong>and</strong>width is 220 Hz. The b<strong>and</strong>width defines the minimum rate at<br />

which we must sample the function <strong>to</strong> accurately collect its frequency<br />

information.


176 The discrete Fourier transform Chapter 7<br />

The inverse of this is<br />

∫ ∞<br />

F −1 {X(ω)} = 1 X(ω)e iωt dω<br />

2π −∞<br />

= iπ (<br />

e −i220πt − e i220πt)<br />

2π<br />

= i [cos(220πt) − i sin(220πt) − cos(220πt) − i sin(220πt)]<br />

2<br />

= i [−2i sin(220πt)]<br />

2<br />

= sin(220πt)<br />

= x(t).<br />

Therefore, the Fourier transform <strong>and</strong> its inverse accurately reconstruct<br />

a given sinusoid.<br />

As you can see, it is very important <strong>to</strong> remember the mathematical<br />

identities from Euler’s formula,<br />

e iω = cos(ω)+i sin(ω)<br />

e −iω = cos(ω) − i sin(ω)<br />

when computing continuous Fourier transforms. It is helpful <strong>to</strong> memorize<br />

some relationships between commonly encountered functions<br />

<strong>and</strong> their transforms.<br />

Function x(t)<br />

Constant: x(t) =a<br />

δ(t − a)<br />

δ(t + a)<br />

sin(at)<br />

cos(at)<br />

cos(at) + cos(bt)<br />

Transform X(ω)<br />

a · δ(ω)<br />

e −iaω<br />

e iaω<br />

iπ [δ(ω − a) − δ(ω + a)]<br />

π [δ(ω − a)+δ(ω + a)]<br />

π [δ(ω − a)+δ(ω + a)+δ(ω − b)+δ(ω + b)]<br />

The constant case wherein x(t) =a produces a transform with delta<br />

function centered at 0 rad/s, <strong>and</strong> height a. This is the DC offset: Direct<br />

current (DC) supplies a constant voltage, as opposed <strong>to</strong> alternating


Section 7.3 The discrete Fourier transform 177<br />

current which is associated with a sinusoid. 3 The DC offset is thought<br />

of as the average value of a waveform. 4<br />

In the discrete case, we can simply sum over a sampled signal at<br />

intervals given by the period T s of the sampling frequency f s . Therefore,<br />

the DFT input is not x(t), but x s (t). However, you rarely see that<br />

notation used—it is a given, because the discrete Fourier transform<br />

only accepts discrete inputs.<br />

The DFT bears quite a resemblance <strong>to</strong> the continuous case.<br />

X(ω k )=<br />

N−1<br />

∑<br />

t=0<br />

<strong>and</strong> the inverse DFT, or IDFT, is<br />

x s (t) = 1 N<br />

N−1<br />

∑<br />

k=0<br />

x s (t)e −iω kt , k =0, 1, 2, . . . , N − 1,<br />

X(ω k )e iω kt , t =0, 1, 2, . . . , N − 1.<br />

This way of writing the DFT <strong>and</strong> IDFT returns the exact frequency<br />

components of the frequency spectrum of x(t). But notationally, there<br />

3 In many engineering books, current (<strong>and</strong> therefore voltage) is represented by<br />

phase vec<strong>to</strong>rs or simply phasors. This implies that their frequency <strong>and</strong> overall amplitude<br />

(i.e., the constant A that multiplies with the sine wave) is time-invariant. This is not <strong>to</strong><br />

be confused with the phasors that are the vertical reflection of saw<strong>to</strong>oth waves.<br />

4 In this text’s representation of the Fourier transform, X(0) will be the sum (or<br />

integral, in the continuous case) of the signal because we are not normalizing it<br />

by multiplying by 1 or some other constant, as some specifications of the Fourier<br />

N<br />

transform will. In Mathematica, for example, the continuous Fourier transform is<br />

defined as<br />

<strong>and</strong> its inverse is<br />

F (ω) = 1 √<br />

2π<br />

∫ ∞<br />

f(t) = 1 √<br />

2π<br />

∫ ∞<br />

−∞<br />

−∞<br />

f(t)e iωt dt<br />

F (ω)e −iωt dt.<br />

The exponentials are different here—but how can that be So long as the inverse<br />

transform’s exponential is the additive inverse of the transform’s exponential, it forms<br />

an orthonormal basis that can describe the frequency information of a time-domain<br />

1<br />

signal. The constant √<br />

2π<br />

simply acts <strong>to</strong> normalize the data in a different way from<br />

our specification.


178 The discrete Fourier transform Chapter 7<br />

Symbol Definition<br />

x s (t) the amplitude of the input sampled-signal at sampling instant t<br />

N the <strong>to</strong>tal number of samples in x s (t)<br />

n the index of the samples of x s (t), numbered 0, 1, . . . , N − 1<br />

T s the chosen sampling interval (period) equal <strong>to</strong> 1/f s<br />

t equal <strong>to</strong> n · T s , the time of the nth sample in seconds<br />

f s the sampling frequency in Hz, equal <strong>to</strong> ω 2π<br />

2πk<br />

ω k NT s<br />

= 2πkfs<br />

N<br />

, the kth harmonic frequency in rad/s<br />

X(ω k ) the amplitude of frequency ω k in all of x s (t)<br />

Table 7.1: Description <strong>and</strong> names of variables used in the discrete Fourier transform.<br />

are several somewhat misleading conventions: The time component n<br />

will often be written as t, but this t is not in seconds as in the original<br />

signal. The t as defined above this is the nth sampling instant multiplied<br />

by the sampling frequency, n · T s , <strong>to</strong> give the time in seconds.<br />

Secondly, the sampled signal x s (t) is usually truncated <strong>to</strong> x(t), but<br />

keep in mind that the discrete Fourier transform only accepts discrete<br />

(sampled) signals. <strong>An</strong>d furthermore, we don’t typically talk about<br />

musical content with angular frequencies; we want frequency in hertz.<br />

To clear up some of these confusions, we first note that ω k · t =<br />

2πk<br />

NT s · nT s = 2πkn<br />

N<br />

. Then we can rewrite the discrete Fourier transform<br />

<strong>and</strong> its inverse as<br />

X(k) =<br />

N−1<br />

∑<br />

n=0<br />

x(n)e − i2πkn<br />

N , k =0, 1, 2, . . . , N − 1<br />

or as<br />

x(n) = 1 N<br />

N−1<br />

∑<br />

k=0<br />

X(k)e i2πkn<br />

N , n =0, 1, 2, . . . , N − 1<br />

X(k) =<br />

N−1<br />

∑<br />

t=0<br />

x(t)e − i2πkt<br />

N , k =0, 1, 2, . . . , N − 1<br />

x(t) = 1 N<br />

N−1<br />

∑<br />

k=0<br />

X(k)e i2πkt<br />

N , t =0, 1, 2, . . . , N − 1.


Section 7.3 The discrete Fourier transform 179<br />

This second way is a common way of writing it, <strong>and</strong> the form that I<br />

prefer because I like <strong>to</strong> be reminded of the variable corresponding <strong>to</strong><br />

time. Just keep in mind that x(t) in the DFT is the sampled version of<br />

the original signal <strong>and</strong> that t represents the sample number from 0 <strong>to</strong><br />

N − 1, not time in seconds.<br />

The signal x(t),atime domain function, is transformed by the Fourier<br />

transform in<strong>to</strong> X(k), the signal’s frequency domain spectrum. The magnitude<br />

plot of the DFT, |X(k)|, is calculated from the square root of<br />

the sum of the real part’s coefficient squared <strong>and</strong> the imaginary part’s<br />

coefficient squared.<br />

|X(k)| = √ Re(X(k)) 2 + Im(X(k)) 2<br />

Therefore, the magnitude is nonnegative. Remember, for a complex<br />

number c = a + bi, Re(c) =a <strong>and</strong> Im(c) =b. If there is no imaginary<br />

part <strong>to</strong> the DFT at frequency component k, then the amplitude is<br />

simply X(k). In digital implementations of the DFT, the absolute value<br />

is taken at some point so that only positive, real values are given in the<br />

spectrum.<br />

By Euler’s identity, e − i2πkt<br />

N is equal <strong>to</strong> cos ( ) (<br />

2πkt<br />

N − i sin 2πkt<br />

)<br />

N <strong>and</strong><br />

the kth Nth root of unity e i2πkt<br />

N equals cos ( ) (<br />

2πkt<br />

N + i sin 2πkt<br />

)<br />

N . Therefore,<br />

we can also write the DFT <strong>and</strong> its inverse trigonometrically as<br />

X(k) =<br />

=<br />

N−1<br />

∑<br />

t=0<br />

N−1<br />

∑<br />

t=0<br />

x(t) = 1 N<br />

= 1 N<br />

x(t)e − i2πkt<br />

N , k =0, 1, . . . , N − 1<br />

N−1<br />

∑<br />

x(t) cos(2πkt/N) − i<br />

N−1<br />

∑<br />

k=0<br />

N−1<br />

∑<br />

k=0<br />

t=0<br />

x(t) sin(2πkt/N)<br />

X(k)e i2πkt<br />

N , t =0, 1, . . . , N − 1<br />

X(k) cos(2πkt/N)+ i N<br />

N−1<br />

∑<br />

k=0<br />

X(k) sin(2πkt/N).


180 The discrete Fourier transform Chapter 7<br />

Let us now verify that the IDFT is the inverse of the DFT.<br />

X(k) =<br />

=<br />

Since the k in e −2πikt<br />

N<br />

change the inside k <strong>to</strong> l:<br />

(<br />

=<br />

=<br />

N−1<br />

∑<br />

t=0<br />

N−1<br />

∑<br />

t=0<br />

= 1 N<br />

= 1 N<br />

N−1<br />

∑<br />

t=0<br />

N−1<br />

∑<br />

t=0<br />

x(t)e −i2πkt<br />

N<br />

(<br />

1<br />

N<br />

N−1<br />

∑<br />

k=0<br />

X(k)e i2πkt<br />

N<br />

)<br />

e −2πikt<br />

N .<br />

is not bounded by the k in the inner sum, we<br />

1<br />

N<br />

N−1<br />

∑<br />

l=0<br />

N−1<br />

1 ∑<br />

N<br />

N−1<br />

∑<br />

l=0<br />

N−1<br />

∑<br />

l=0<br />

l=0<br />

X(l)e i2πlt<br />

N e<br />

− i2πikt<br />

N<br />

X(l)e i2π(l−k)t<br />

N<br />

N−1<br />

∑<br />

X(l)<br />

t=0<br />

X(l) · N<br />

Thus, our double sum becomes<br />

e i2π(l−k)t<br />

N<br />

N−1<br />

1 ∑<br />

X(l)Nδ(l − k),<br />

N<br />

l=0<br />

)<br />

when l = k, 0 otherwise.<br />

where δ(l − k) is the delta function equal <strong>to</strong> 1 for l = k <strong>and</strong> 0 otherwise.<br />

So, when l = k, the sum is then<br />

N−1<br />

1 ∑<br />

N<br />

l=0<br />

X(l)Nδ(l − k) =<br />

N−1<br />

∑<br />

l=0<br />

X(l)δ(l − k) =X(k).<br />

This shows that the IDFT is indeed the inverse of the DFT.<br />

Properties of the Fourier transform<br />

The nature of complex numbers provides the Fourier transform with<br />

many nice properties. What follows here are the discrete representations<br />

of the properties, but they also apply <strong>to</strong> the continuous case.


Section 7.3 The discrete Fourier transform 181<br />

The Fourier transform is linear as a result of the principle of superposition:<br />

Since every signal is the sum of complex sinusoids, the sum<br />

of two or more signals may be "broken apart" from a single sum in<strong>to</strong><br />

two or more sums. Additionally, they may be scaled by any number<br />

such that F{ax} = aF{x}. This also applies <strong>to</strong> the inverse Fourier<br />

transform.<br />

Linearity: The signals x <strong>and</strong> y can be scaled by any real<br />

or complex constants a <strong>and</strong> b such that F{ax + by} =<br />

aF{x} + bF{y}. Similarly, the spectra X <strong>and</strong> Y can be<br />

multiplied by any real or complex constants a, b such that<br />

F −1 {aX + bY} = aF −1 {X} + bF −1 {Y }.<br />

Proof:<br />

F{ax(t)+by(t)} =<br />

=<br />

N−1<br />

∑<br />

[ax(t)+by(t)]e − i2πkt<br />

N<br />

t=0<br />

N−1<br />

∑<br />

t=0<br />

ax(t)e − i2πkt<br />

N<br />

N−1<br />

∑<br />

= a x(t)e − i2πkt<br />

N<br />

t=0<br />

N−1<br />

∑<br />

+<br />

t=0<br />

N−1<br />

∑<br />

+ b<br />

= aF{x(t)} + bF{y(t)}.<br />

Likewise for the inverse transform.<br />

t=0<br />

by(t)e − i2πkt<br />

N<br />

y(t)e − i2πkt<br />

N<br />

F −1 {aX(k)+bY (k)} = 1 N<br />

= 1 N<br />

= a N<br />

N−1<br />

∑<br />

[aX(k)+bY (k)]e i2πkt<br />

N<br />

k=0<br />

N−1<br />

∑<br />

k=0<br />

N−1<br />

∑<br />

k=0<br />

N−1<br />

aX(k)e i2πkt 1 ∑<br />

N +<br />

N<br />

X(k)e i2πkt<br />

N<br />

k=0<br />

bY (k)e i2πkt<br />

N<br />

N−1<br />

b ∑<br />

+ Y (k)e i2πkt<br />

N<br />

N<br />

k=0<br />

= aF −1 {X(k)} + bF −1 {Y (k)}.


182 The discrete Fourier transform Chapter 7<br />

The Fourier transform is called a time-invariant linear filter because<br />

of this property <strong>and</strong> because over time, the system that it is representing<br />

does not change. 5<br />

When we multiply a time-domain signal by the exponential function<br />

e − i2πlt<br />

N for any real or complex constant l, its Fourier transform<br />

is shifted by l <strong>and</strong> vice versa, i.e., e − i2πlt<br />

N x(t) ⇔ X(k + l). Likewise,<br />

multiplying a Fourier transform by e i2πut<br />

N (note the sign change in the<br />

exponent) for some constant u shifts the time-domain signal by u, i.e.,<br />

e i2πut<br />

N X(k) ⇔ x(t − u).<br />

The shift theorem: Multiplying a time signal x(t) by e − i2πlt<br />

N<br />

shifts its spectrum X(k) <strong>to</strong> the left by l, i.e.,<br />

F<br />

}<br />

{x(t) · e − i2πlt<br />

N<br />

=<br />

=<br />

N−1<br />

∑<br />

t=0<br />

N−1<br />

∑<br />

t=0<br />

= X(k + l).<br />

x(t)e − i2πkt<br />

N e<br />

− i2πlt<br />

N<br />

x(t)e − i2π(k+l)t<br />

N<br />

Shifting in the frequency domain is transposition <strong>to</strong> a different key,<br />

musically.<br />

A time-domain signal is real-valued if <strong>and</strong> only if the complex<br />

conjugate ˆX(k) of its Fourier transform is equal <strong>to</strong> X(−k).<br />

Spectral symmetries of real signals: The Fourier transform<br />

X(k) of real-valued time-domain signals x(t) possess<br />

Hermitian symmetry where ˆX(k) =X(−k), <strong>and</strong> vice versa.<br />

5 This is not <strong>to</strong> say that a song or voltage does not change over time—of course it<br />

does! Rather, time-invariance in filter design <strong>and</strong> electrical engineering means that<br />

the system does not change over time, meaning that we are not switching the input<br />

signal <strong>to</strong> a different one during our analysis, or removing components from a circuit<br />

while calculating its input <strong>and</strong> output voltage. Piecewise functions, for example, are<br />

not time-invariant.


Section 7.3 The discrete Fourier transform 183<br />

Hermitian symmetry has the following properties:<br />

Re{X(−k)} = Re{X(k)},<br />

Im{X(−k)} = −Im{X(k)},<br />

|X(−k)| = |X(k)|, <strong>and</strong><br />

∠X(−k) = ∠X(k).<br />

This is an important property that helps convey why we only need<br />

half of the Fourier transform’s output for audio signals in order <strong>to</strong><br />

analyze their frequency content. For the continuous Fourier transform,<br />

we ignore the first half (the negative frequencies) of the magnitude<br />

spectrum <strong>and</strong> for the discrete case we ignore the second half. Since<br />

all of the frequencies in the DFT are positive, we ignore frequencies<br />

beyond N 2<br />

in the DFT.<br />

The convolution theorem: The convolution of two discrete,<br />

time-domain signals x(t) <strong>and</strong> y(t) is<br />

x(t) ∗ y(t) =<br />

N−1<br />

∑<br />

n=0<br />

x(n)y(t − n)<br />

where n is considered the amount of latency <strong>and</strong> ∗ denotes<br />

the operation of convolution. 6 The Fourier transform of<br />

their convolution is the product of their spectra, i.e.,<br />

Proof:<br />

x(t) ∗ y(t) ⇔ X(k) · Y (k).<br />

6 The convolution of two continuous-time signals is<br />

x(t) ∗ y(t) =<br />

∫ t<br />

The convolution of two discrete-time signals is<br />

x(t) ∗ y(t) =<br />

0<br />

N−1<br />

∑<br />

0<br />

x(s)y(t − s) ds.<br />

x(n)y(t − n).


184 The discrete Fourier transform Chapter 7<br />

F{x(t) ∗ y(t)} =<br />

=<br />

=<br />

=<br />

N−1<br />

∑<br />

t=0<br />

N−1<br />

∑<br />

n=0<br />

N−1<br />

∑<br />

n=0<br />

( N−1<br />

( N−1<br />

)<br />

∑<br />

x(n)y(t − n) e − i2πkt<br />

N<br />

n=0<br />

N−1<br />

∑<br />

x(n)<br />

t=0<br />

y(t − n)e − i2πkt<br />

N<br />

x(n)e − i2πkn<br />

N Y (k), by the Shift Theorem<br />

∑<br />

n=0<br />

x(n)e − i2πkn<br />

N<br />

)<br />

Y (k)<br />

= X(k)Y (k).<br />

This property is an incredibly useful one. Consider what might happen<br />

when you multiply the spectra of two signals: Frequencies present<br />

in both signals will be present while frequencies lacking from one of<br />

the signals will be absent in the resulting signal. Say that you had the<br />

frequency response of some acoustic space, like an amphitheater or<br />

church, <strong>and</strong> you multiplied it with the spectrum (frequency response)<br />

of some musical signal. The result would be a simulation of that<br />

musical signal as if it were recorded in that room, <strong>and</strong> the same effect can<br />

be done by convolving the two time-domain signals. To collect the<br />

reverberant behavior of a room, we record the sound of an impulse<br />

played in it, <strong>and</strong> call this an impulse response. Since the frequency<br />

response of the room reflects the amount of the room’s reverberation,<br />

we call this process convolution reverb.<br />

To picture the operation of convolution, imagine two signals starting<br />

at time sample n =0, such as an impulse response <strong>and</strong> a musical<br />

signal’s ADSR envelope.


Section 7.3 The discrete Fourier transform 185<br />

Figure 7.11: <strong>An</strong> impulse response<br />

IR(t) defined for 0 ≤ t ≤ 6, where<br />

t represents the time samples. This impulse<br />

response shows three reflections,<br />

the first with power 0.5, the second<br />

0.25, <strong>and</strong> the third 0.1.<br />

Figure 7.12: The ADSR envelope of a<br />

short musical signal x(t), defined for<br />

0 ≤ t ≤ 15 again in time samples, not<br />

seconds. Imagine that this is a short<br />

puff on a flute, recorded in an anechoic<br />

(echoless) chamber.<br />

Before performing the convolution, we flip the shorter signal (the<br />

impulse response) so that y(t) is now y(−t) <strong>and</strong> its rightmost point<br />

is t =0. Then, we shift it <strong>to</strong> the left any number of points (one will<br />

suffice) so that it does not intersect with the musical signal when they<br />

are plotted on the same axis.<br />

Figure 7.13: To perform convolution<br />

on both discrete <strong>and</strong> continuous signals,<br />

we flip one of the signals horizontally.<br />

Figure 7.14: Then, we shift that signal<br />

so that the rightmost point will not intersect<br />

with the domain of the other<br />

signal.<br />

Now we are ready <strong>to</strong> perform convolution. Again, convolution is<br />

computed on two signals x(t) <strong>and</strong> y(t) with the equation x(t) ∗ y(t) =


186 The discrete Fourier transform Chapter 7<br />

∑ N−1<br />

n=0<br />

x(n)y(t − n). Here, the impulse response IR(t) is our y(t): It<br />

has been flipped <strong>and</strong> shifted. N is the length of x(t). Here, x(t) =<br />

x(n) = [0, 1, 0.9, 0.8, 0.7, 0.5, 0.5, 0.5, 0.5, 0.4, 0.4, 0.3, 0.2, 0.1, 0.1, 0] <strong>and</strong><br />

y(t) = [1, 0, 0.5, 0, 0.25, 0, 0.1], so y(t−n) = [0.1, 0, 0.25, 0, 0.5, 0, 1], <strong>and</strong><br />

if there is an n such that y(t − n) is not defined, we call it zero. Then<br />

x(t) ∗ y(t) =<br />

N−1<br />

∑<br />

n=0<br />

x(n)y(t − n)<br />

= [0, 1, 0.9, 1.3, 1.15, 1.15, 1.075, 0.95, 0.925, 0.855,<br />

0.845, 0.675, 0.575, 0.4, 0.35, 0.165, 0.14, 0.055,<br />

0.045, 0.01, 0.01, 0].<br />

Figure 7.15: The shifted <strong>and</strong> flipped<br />

impulse response y(t − n) on the same<br />

plot as x(n), ready for convolution. We<br />

will move y(t − n) <strong>to</strong> the right sample<br />

by sample <strong>and</strong> multiply it with x(n)<br />

for each t, <strong>and</strong> then sum each of these<br />

multiplications <strong>to</strong> get the convolution.<br />

Figure 7.16: The convolved ADSR.<br />

This signal shows how the given<br />

ADSR would behave in the room described<br />

by the impulse response.<br />

So the signal in Figure 7.17 would sound like it were coming from<br />

the room with impulse response depicted in Figure 7.12.<br />

Parseval’s identity states that the sum of the Fourier series coefficients<br />

c n squared is equal <strong>to</strong> the integral of the function squared,<br />

i.e.,<br />

∞∑<br />

n=−∞<br />

|c n | 2 = 1 ∫ π<br />

x(t) dt.<br />

2π −π


Section 7.3 The discrete Fourier transform 187<br />

This identity gives rise <strong>to</strong> the relationship between the <strong>to</strong>tal power of a<br />

signal <strong>and</strong> the <strong>to</strong>tal power of its spectrum, given by Parseval’s theorem.<br />

Parseval’s theorem: For any continuous time-domain signal<br />

x(t) <strong>and</strong> its normalized Fourier transform X(ω) where ω<br />

is in radians per second,<br />

∫ ∞<br />

−∞<br />

|x(t)| 2 dt = 1<br />

2π<br />

∫ ∞<br />

−∞<br />

|X(ω)| 2 dω,<br />

where |x(t)| 2 is the cross product 〈x, x〉 (which is equivalent<br />

<strong>to</strong> ∑ N−1<br />

t=0<br />

x(t)ˆx(t), i.e., x(t) multiplied by its complex conjugate<br />

summed over all of its terms), <strong>and</strong> likewise, |X(k)| is<br />

taken <strong>to</strong> mean the cross product 〈X, X〉. If ω were instead<br />

in Hz, we would simply remove the multiplication by 1<br />

2π .<br />

For any discrete time-domain signal x(t) <strong>and</strong> its discrete,<br />

normalized Fourier transform X(k),<br />

N−1<br />

∑<br />

t=0<br />

|x(t)| 2 =<br />

N−1<br />

∑<br />

k=0<br />

If the DFT is not normalized, then<br />

N−1<br />

∑<br />

t=0<br />

|x(t)| 2 = 1 N<br />

N−1<br />

∑<br />

k=0<br />

|X(k)| 2 .<br />

|X(k)| 2 .<br />

Proof: Let x(t) by a complex function of length N. Then<br />

N−1<br />

∑<br />

t=0<br />

|x(t)| 2 =<br />

N−1<br />

∑<br />

t=0<br />

x(t)ˆx(t)<br />

= x(t) ∗ x(t), convolution with zero latency, or n =0<br />

= F −1 ( ˆX(k)X(k)), by the convolution theorem<br />

= 1 N<br />

N−1<br />

∑<br />

k=0<br />

|X(k)| 2 .<br />

This is also called the Rayleigh energy theorem, Plancherel theorem, Parseval’s<br />

equality, Parseval’s relation, or simply the power theorem.


188 The discrete Fourier transform Chapter 7<br />

When unsure if the sampling frequency has been specified as great<br />

enough <strong>to</strong> eschew errors from aliasing, we can increase the length<br />

of a signal by a fac<strong>to</strong>r of L <strong>to</strong> increase the period of the maximum<br />

frequency component. We do this by inserting L − 1 zeros in between<br />

each pair of samples. Hence, f max will become fmax<br />

L<br />

<strong>and</strong> f s will have a<br />

better chance of correctly sampling all frequencies in the signal. This<br />

is known as up-sampling a signal, <strong>and</strong> the method of increasing the<br />

length of x(t) is called stretching. For continuous signals, we have the<br />

scaling theorem, <strong>and</strong> for discrete ones, we have the stretch theorem.<br />

Scaling theorem: Stretching the time domain of a continuous<br />

signal x(t) by a nonzero real number a shrinks its<br />

frequency domain by a fac<strong>to</strong>r of a, i.e.,<br />

( t<br />

x ⇔|a|X(aω).<br />

a)<br />

Proof: Let x(t) be a continuous, complex function <strong>and</strong> let<br />

a be a nonzero real number. Then the domain of x ( t<br />

a)<br />

is<br />

a-times as wide as x(t), i.e., it is "stretched" by a fac<strong>to</strong>r of a.<br />

Taking the Fourier transform of this stretched signal heeds<br />

F<br />

{ ( t<br />

x<br />

a)}<br />

=<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

= |a|<br />

−∞<br />

∫ ∞<br />

( t<br />

x<br />

a)<br />

x<br />

−∞<br />

e −iωt dt<br />

( (<br />

t<br />

e<br />

a)<br />

−iω(a· a) t<br />

d<br />

x<br />

= |a|X(aω).<br />

)<br />

a t a<br />

( t<br />

e<br />

a)<br />

−i(ωa)( (<br />

a) t t d<br />

a<br />

)<br />

Stretch theorem: Stretching the time domain of a discrete,<br />

time-domain signal x[t] by a nonzero real number a repeats<br />

the domain of its spectrum X[k] a-many times around<br />

the unit circle. So, the effect of stretching discrete signals<br />

has opposite implications on their frequency domain as


Section 7.3 The discrete Fourier transform 189<br />

stretching continuous signals: Here, the frequency domain<br />

increases by a fac<strong>to</strong>r of a (<strong>to</strong> go from 0 <strong>to</strong> aN) instead of 1/a.<br />

Furthermore, the frequency components of the stretched<br />

x [ t<br />

a]<br />

, call it ω<br />

′<br />

k<br />

, will be more densely spaced such that<br />

ω ′ k =<br />

2πk<br />

aNT s<br />

.<br />

So in both the continuous <strong>and</strong> discrete cases, the resolution of the<br />

frequency domain is increased. Hence, in general, up-sampling in the<br />

time domain improves the accuracy of the frequency domain.<br />

The process of up-sampling in the discrete case is done by adding<br />

zeros <strong>to</strong> a signal, called zero-padding, <strong>and</strong> the consequence of this<br />

in its spectrum is called spectral interpolation. Spectral interpolation<br />

(or simply interpolation) increases the resolution of the spectrum <strong>and</strong><br />

shrinks the number of frequencies with energy because it reduces<br />

aliasing. Aliasing happens when a signal contains a frequency that<br />

is not one of the frequency bins, i.e., a frequency not in the set of the<br />

ω k = 2πk<br />

NT s<br />

. Say that a signal x(t) contained the frequencies 20 Hz, 30.2<br />

Hz, <strong>and</strong> 32 Hz <strong>and</strong> we sampled it at 100 Hz over 1 second. Then the<br />

frequency bins of X(k), i.e., the set of frequency components {ω k },<br />

would be<br />

{ω k } =<br />

{ } 2πk<br />

NT s<br />

∣∣∣N=100,Ts=0.01<br />

= {0, 2π, 4π, 6π, . . . , 198π}.<br />

This set includes the frequencies 20 <strong>and</strong> 32 Hz, but 30.2 Hz is not<br />

indexed by any of the frequency bins even though it is sampled at<br />

more than 2 samples per period. The result is that the energy resulting<br />

from 30.2 Hz "leaks" <strong>to</strong> nearby frequency bins, mostly <strong>to</strong> 30 Hz <strong>and</strong><br />

second most <strong>to</strong> 31 Hz. This appears graphically as side lobes <strong>and</strong> the<br />

spread of energy <strong>to</strong> nearby frequencies is called spectral leakage. If we<br />

were <strong>to</strong> increase the sampling rate either by literally increasing the<br />

value of f s or by zero-padding (stretching), we would reduce the effect


190 The discrete Fourier transform Chapter 7<br />

of spectral leakage. Windowing, described in Chapter 8 on the shorttime<br />

Fourier transform, also produces these spectral lobes, especially<br />

rectangular or "boxcar" windows.<br />

Example 2 in Section 7.5 presents a scenario of spectral leakage,<br />

<strong>and</strong> Example 3 uses the scaling theorem <strong>to</strong> curtail the artifact.<br />

Computational complexity of the DFT<br />

Each X(k) is a DFT itself, requiring N-many computations. Since<br />

there are N-many X(k)’s <strong>to</strong> be calculated for the entire signal <strong>to</strong> be<br />

transformed in<strong>to</strong> the frequency domain, a <strong>to</strong>tal of N × N = N 2 <strong>to</strong>tal<br />

computations need <strong>to</strong> be calculated. Therefore, its computational<br />

complexity is O(N 2 ).<br />

This is considered expensive <strong>and</strong> inefficient for an algorithm. Consider<br />

a three-minute long song sampled at 44,100 Hz. The <strong>to</strong>tal number<br />

of samples is 3 × 60 × 44, 100 = 7, 938, 000 = N. So, a whopping<br />

7, 938, 000 2 =6, 301, 184, 400, 000, 000, 000, 000 calculations are<br />

required <strong>to</strong> fully specify the frequency components of a 3-minute CDquality<br />

piece of music. As we will see in Chapter 8, the fast Fourier<br />

transform reduces this complexity greatly.<br />

7.4 The DFT, simplified<br />

In essence, the discrete Fourier transform extracts a frequency k from<br />

a signal x(t) by multiplying x(t) through by a root of unity e ikt with<br />

identical frequency k <strong>and</strong> constructively interfering <strong>to</strong> return a nonzero<br />

value for X(k). In actuality, the root of unity will have a normalized<br />

frequency in the interval [0, 2π) <strong>to</strong> be later multiplied by kfs<br />

N<br />

. Since the<br />

maximum value of k is N −1, the k effectively cancels the N. Therefore,<br />

the roots of unity are such that each of them maps <strong>to</strong> a unique angular<br />

frequency within the interval [0, 2πf s ).<br />

The larger the sampling frequency, the larger N is, so the greater f s<br />

is, the more roots of unity will be defined in the DFT.


Section 7.4 The DFT, simplified 191<br />

Figure 7.17: Roots of unity for N =4, N =6, <strong>and</strong> N =8. These are all within the<br />

interval [0, 2π). When N =4, for example, there are 4 different frequencies the DFT<br />

could detect.<br />

Figure 7.18: Roots of unity for N = 16 <strong>and</strong> N = 25, again all within the interval<br />

[0, 2π) <strong>to</strong> by multiplied by f s afterwards.<br />

The exponentials in every iteration of the discrete Fourier transform<br />

<strong>and</strong> its inverse are roots of unity, representing unitary trigonometric<br />

functions in the complex functions of time. As k increases, the frequency<br />

of a given function increases—but only in the interval [0, 2π),<br />

the interval of frequencies on the unit circle. These small frequencies<br />

will be later multiplied by the sampling frequency, f s , <strong>to</strong> give N-many,<br />

uniformly spaced frequencies in the interval [0, 2πf s ). 7<br />

7 The notation "[a, b)" refers <strong>to</strong> the line with endpoints at a <strong>and</strong> b. A square bracket,<br />

"[" or "]" indicates that the given endpoint is included in the interval. A parentheses<br />

indicates the endpoint is not included.


192 The discrete Fourier transform Chapter 7<br />

Suppose that the size of our input signal is N =8. Then the DFT<br />

exponentials are given by<br />

e − i2π0t<br />

8 = cos(0) − i sin(0)<br />

(<br />

e − i2π1t π<br />

) ( π<br />

)<br />

8 = cos<br />

4 t − i sin<br />

4 t (<br />

e − i2π2t π<br />

) ( π<br />

)<br />

8 = cos<br />

2 t − i sin<br />

2 t ( ) ( )<br />

e − i2π3t<br />

3π 3π<br />

8 = cos<br />

4 t − i sin<br />

4 t<br />

e − i2π4t<br />

8 = cos(πt) − i sin(πt)<br />

( ) ( )<br />

e − i2π5t<br />

5π 5π<br />

8 = cos<br />

4 t − i sin<br />

4 t ( ) ( )<br />

e − i2π6t<br />

3π 3π<br />

8 = cos<br />

2 t − i sin<br />

2 t ( ) ( )<br />

e − i2π7t<br />

7π 7π<br />

8 = cos<br />

4 t − i sin<br />

4 t<br />

for k =0, 1, 2, 3, 4, 5, 6, 7. So the real parts of the roots are cosine functions,<br />

<strong>and</strong> the imaginary parts are sine functions, both with identical<br />

frequency. Below, they are plotted, the imaginary parts with dashes.


Section 7.4 The DFT, simplified 193<br />

The roots of unity over time for N =8, f s =4<br />

(<br />

Figure 7.19: Re<br />

( )<br />

1, Im<br />

e − i2π0t<br />

8<br />

e − i2π0t<br />

8<br />

= − sin(0) = 0<br />

)<br />

= cos(0) =<br />

(<br />

Figure 7.20: Re<br />

( )<br />

Im<br />

e − i2π1t<br />

8<br />

e − i2π1t<br />

8<br />

)<br />

= − sin ( π<br />

4 t)<br />

= cos ( π<br />

4 t) ,<br />

(<br />

Figure 7.21: Re<br />

( )<br />

Im<br />

e − i2π2t<br />

8<br />

e − i2π2t<br />

8<br />

= − sin ( π<br />

)<br />

= cos ( π<br />

t) , Figure 7.22: Re<br />

2<br />

( )<br />

t) Im e − i2π3t<br />

8<br />

2<br />

(<br />

e − i2π3t<br />

8<br />

)<br />

= − sin ( 3π<br />

4 t)<br />

= cos ( 3π<br />

4 t) ,<br />

If X(k) is non-zero, then the frequency component ω k = 2πkfs<br />

N<br />

(or, k,<br />

in Hz) is in the signal. Roots of unity with frequencies that are identical<br />

<strong>to</strong> those found in a signal will constructively interfere in the Fourier<br />

transform <strong>to</strong> produce a non-zero result. Constructive interference can


194 The discrete Fourier transform Chapter 7<br />

(<br />

Figure 7.23: Re<br />

( )<br />

Im<br />

e − i2π4t<br />

8<br />

e − i2π4t<br />

8<br />

= − sin(πt)<br />

)<br />

= cos(πt),<br />

(<br />

Figure 7.24: Re<br />

( )<br />

Im<br />

e − i2π5t<br />

8<br />

e − i2π5t<br />

8<br />

= − sin ( 5π<br />

4<br />

)<br />

)<br />

= cos ( )<br />

5π<br />

4 ,<br />

(<br />

Figure 7.25: Re<br />

( )<br />

Im<br />

e − i2π6t<br />

8<br />

e − i2π6t<br />

8<br />

= − sin ( 3π<br />

2<br />

)<br />

)<br />

= cos ( )<br />

3π<br />

2 ,<br />

(<br />

Figure 7.26: Re<br />

( )<br />

Im<br />

e − i2π7t<br />

8<br />

e − i2π7t<br />

8<br />

= − sin ( 7π<br />

4<br />

)<br />

)<br />

= cos ( )<br />

7π<br />

4 ,<br />

happen either at a single point or interval of time, <strong>and</strong> if it happens<br />

over an interval, the frequency is identical. If the signals are out of<br />

phase, destructive interference occurs—but a sine <strong>and</strong> cosine wave are<br />

similar in that they are simply out of phase with one another, by π/2.<br />

This means that the degree of destructive interference exhibited by one<br />

of the waves will be the degree of constructive interference exhibited by<br />

the other wave: If one completely cancels with a frequency component<br />

of the signal, then the other will completely constructively interfere.


Section 7.4 The DFT, simplified 195<br />

Suppose that, for some k, X(k) =a + bi for nonzero a <strong>and</strong> b. Then<br />

the real part a corresponds <strong>to</strong> the real part of the kth root of unity, i.e.,<br />

a represents the constructive interference of x(t) with the cosine part<br />

of the sum cos ( ) (<br />

2πkt<br />

N − i sin 2πkt<br />

)<br />

N . The imaginary part b is the result<br />

of constructive interference with the sine function, which is multiplied<br />

by i. Hence, a nonzero a or b means that the root of unity is "matching"<br />

a frequency present in x(t).<br />

The nature of the sampled signal is that its samples are all uniformly<br />

spaced over time, meaning that it has a periodic nature (its<br />

sampling frequency). The exponentials of the discrete Fourier transform<br />

only move through e i2πt0/N <strong>to</strong> e i2πt(N−1)/N , i.e., one exponential<br />

less than 1 period, because e i2πtN/N is the same as e i0πt . Therefore, it is<br />

redundant <strong>to</strong> include any roots of unity beyond e i2πt(N−1)/N , because<br />

that frequency is already represented. They serve <strong>to</strong> identify which<br />

frequencies are present in a time domain signal only because the transform<br />

is "blind" <strong>to</strong> the sampling frequency: f s may as well be thought<br />

of as 1 Hz because the indexing of the samples is in integers. The<br />

frequency components ω k at which X(k) ≠0are multiplied only later<br />

by f s ; the exponentials contain no mention of the sampling period or<br />

frequency. This is because we are looking at integer-ordered instants<br />

of time, t n =0, 1, 2, . . . , N − 1, the nth sample of x s (t), not seconds of<br />

time, t. We just write the DFT using t <strong>to</strong> remind ourselves that it is<br />

related <strong>to</strong> time.<br />

When we multiply the frequency components by f s at the end,<br />

each of the non-zero values of X(k) (corresponding <strong>to</strong> the frequency<br />

component ω k ; sometimes the DFT is written X(ω k ), but usually this<br />

opens up a whole other can of worms) is scaled <strong>to</strong> an integer harmonic<br />

of the fundamental frequency 2π N · f s. In other words, the sequence of<br />

frequency components is<br />

{ω k } =<br />

{<br />

0, 2π<br />

N f s, 4π<br />

N f s, 6π<br />

N f s, . . . ,<br />

}<br />

2π(N − 1)<br />

f s .<br />

N


196 The discrete Fourier transform Chapter 7<br />

So, the maximum frequency specified by the roots of unity is W N−1<br />

N<br />

.<br />

Over time, i.e., for the time samples t =0, 1, . . . , N − 1, the exponential<br />

reaches e − i2π(N−1)(N−1)<br />

N . Therefore, the maximum frequency component<br />

is (N − 1)/f s Hz. But for real inputs (which is all sound files), the DFT<br />

has Hermitian symmetry, meaning that the kth term of the transform,<br />

X k , is equal <strong>to</strong> the complex conjugate of its (N − k)th term, ̂X N−k .<br />

Let us recall the Nyquist sampling theorem: A sinusoidal, periodic<br />

function must be sampled at least twice per period in order for<br />

the frequencies <strong>to</strong> be represented. For the same reason, the DFT can<br />

only detect frequencies up <strong>to</strong> (N−1)<br />

2f s<br />

Hz, <strong>and</strong> the spectrum’s magnitude<br />

|X(k)| will be symmetrical about the vertical line k = N 2<br />

. The magnitude<br />

|X(N − 1)| will thus be equal <strong>to</strong> |X(1)|, |X(N − 2)| = |X(2)|, <strong>and</strong><br />

so on. When N is even, the Nyquist frequency will be at the center of<br />

this symmetry, <strong>and</strong> when odd, X[(N − 1)/2] will equal X[(N + 1)/2].<br />

This will be graphically explored in the upcoming examples.<br />

The DFT can be calculated via matrix multiplication of x(t) with a<br />

matrix containing the exponentials of the roots of unity WN k down each<br />

column. Here is a quick introduction <strong>to</strong> matrix operations: We use the<br />

indices i <strong>and</strong> j <strong>to</strong> refer <strong>to</strong> row <strong>and</strong> column coordinates (respectively)<br />

in a matrix, not <strong>to</strong> be confused with imaginary numbers. In matrix<br />

multiplication, the (i, j)th term of the left matrix multiplies with the<br />

(j, k)th term of the right matrix. Therefore, the left matrix must have<br />

as many columns as the right matrix has rows. If the left matrix is of<br />

the size m × n where m is the number of rows <strong>and</strong> n is the number of<br />

columns, <strong>and</strong> the right matrix is of the size n × p, their multiplication<br />

will produce a matrix of size m × p.<br />

Multiplying x(t),a1 × N size matrix (a row matrix), with W k N , an<br />

N ×N size matrix, therefore produces a matrix of size 1×N, <strong>and</strong> this is<br />

X(k). 8 <strong>An</strong> orthonormal basis in general is an N × N matrix containing<br />

8 In some texts, these will be rows of W , meaning also that the signal x will be a<br />

column matrix <strong>and</strong> the transform X will be a row matrix [24]. However, note that<br />

the matrix is symmetrical about its diagonal, so it doesn’t matter if the values of the<br />

exponentials are given in the rows or the columns.


Section 7.4 The DFT, simplified 197<br />

normal vec<strong>to</strong>rs (columns <strong>and</strong> rows) describing the dimensions <strong>and</strong><br />

linear behavior of a function. So, W is considered N-dimensional. The<br />

matrix represents the set of roots of unity W = {W 0 N ,W1 N , . . . , W N−1<br />

N<br />

}<br />

by putting them sequentially in columns, where the <strong>to</strong>p row contains<br />

the 0th root of unity (i.e., k =0) <strong>and</strong> the bot<strong>to</strong>m row contains the<br />

(N − 1)th root of unity (i.e., k = N − 1):<br />

[<br />

]<br />

W = e − i2πk0<br />

N e − i2πk1<br />

N . . . e − i2πk(N−1)<br />

N<br />

The jth column)<br />

of W gives the)<br />

values of all of the roots of unity at time<br />

t = j, cos − i sin . The ith row contains the values of ith<br />

(<br />

2πkj<br />

N<br />

(<br />

2πkj<br />

N<br />

root of unity over time. Explicitly, the general matrix representing the<br />

Nth roots of unity for any N is<br />

W =<br />

⎡<br />

1 1 1 1 . . . 1<br />

1 e − i2π<br />

N e − i4π<br />

N e − i6π<br />

N . . . e − i2π(N−1)<br />

N<br />

1 e − i4π<br />

N e − i8π<br />

N e − i12π<br />

N . . . e − i4π(N−1)<br />

N<br />

1 e − i6π<br />

N e − i12π<br />

N e − i18π<br />

N . . . e − i6π(N−1)<br />

N<br />

⎢<br />

⎣ . . . . .<br />

1 e − i2π(N−1)<br />

N<br />

e − i4π(N−1)<br />

N<br />

e − i6π(N−1)<br />

N . . . e − i(N−1)π(N−1)<br />

N<br />

The matrix is N × N in dimension because the cardinality (magnitude)<br />

of both k <strong>and</strong> t is N. Letting N =8, the real parts of the DFT<br />

can be wholly represented by the following matrix multiplication, in<br />

which each column of W (t, k) contains N-many evenly spaced values<br />

of the cosine function corresponding <strong>to</strong> the real part of each root of<br />

unity: 9<br />

x(t) = [x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7)]<br />

⎤<br />

⎥<br />

⎦<br />

9 The given matrix of the roots of unity is a decimal approximation, where 0.707 ≈<br />

√<br />

2<br />

2 = cos ( π<br />

4<br />

)<br />

.


198 The discrete Fourier transform Chapter 7<br />

W (t, k) =<br />

⎡<br />

⎤<br />

1 1 1 1 1 1 1 1<br />

1 0.707 − 0.707i 0 − i −0.707 − 0.707i −1 − 0i −0.707 + 0.707i 0+i 0.707 + 0.707i<br />

1 0− i −1 − 0i 0+i 1 + 0i 0 − i −1 − 0i 0+i<br />

1 −0.707 − 0.707i 0+i 0.707 − 0.707i −1 − 0i 0.707 + 0.707i 0 − i −0.707 + 0.707i<br />

1 −1 − 0i 1 + 0i −1 − 0i 1 + 0i −1 − 0i 1 + 0i −1 − 0i<br />

⎢1 −0.707 + 0.707i 0 − i 0.707 + 0.707i −1 − 0i 0.707 − 0.707i 0+i −0.707 − 0.707i<br />

⎥<br />

⎣1 0 + i −1 − 0i 0 − i 1 + 0i 0+i −1 − 0i 0 − i ⎦<br />

1 0.707 + 0.707i 0+i −0.707 + 0.707i −1 − 0i −0.707 − 0.707i 0 − i 0.707 − 0.707i<br />

So the discrete Fourier transform X(k) can be computed by matrix<br />

multiplication where X(k) <strong>and</strong> x(t) are both 1 × N size matrices <strong>and</strong><br />

W (k, t) is N × N in size, by the formula<br />

X(k) =x(t)W (k, t).<br />

7.5 Examples<br />

To recapitulate: When we compute a DFT, we multiply a signal x(t)<br />

by complex sine waves from the roots of unity, e iωkt . These roots of<br />

unity are only within the interval [0, 2π) <strong>and</strong> there are N-many of them.<br />

Hence, it may seem like the DFT can only detect frequencies between<br />

0 <strong>and</strong> 2π rad/s, but these frequencies are just "placeholders" for the<br />

actual frequencies: We scale these ω k by our sampling frequency f s at<br />

the end. So, the set of ω k can be thought of as normalized frequencies in<br />

the range of 0 <strong>to</strong> 1 hertz.<br />

The DFT can detect N-many frequencies up <strong>to</strong> the frequency f s /2,<br />

but beware: The highest frequency is almost never equal <strong>to</strong> N, because<br />

the length of x(t) is almost never one second. N, <strong>and</strong> hence f s , specify<br />

the resolution of the DFT. This is visually depicted in Figures 7.18 <strong>and</strong><br />

7.19.<br />

In this section, we will show the scratch work required <strong>to</strong> compute<br />

the DFT of a short, periodic signal by h<strong>and</strong>. The amount of room<br />

it takes up should show you that the analysis of more complex (i.e.,<br />

larger) signals is best left <strong>to</strong> a computer, but for some, it aides comprehension<br />

<strong>to</strong> see the explicit math involved. Example 2 is the DFT in


Section 7.5 Examples 199<br />

Mathematica, which specifies the Fourier transform differently so we<br />

show how <strong>to</strong> compensate for that. The third example shows how the<br />

DFT can be estimated graphically.<br />

Example 1: A simple sinusoid, by h<strong>and</strong><br />

Let us evaluate the DFT of a simple sinusoid: x(t) = cos(πt) sampled<br />

at f s =4Hz for the first 2 seconds, i.e., 0 ≤ t


200 The discrete Fourier transform Chapter 7<br />

Because e 0 =1, this is just the sum of each value of x s (t) from t =0<br />

<strong>to</strong> t =3:<br />

X(0) =<br />

3∑<br />

x(t)<br />

t=0<br />

= x(0) + x(1) + x(2) + x(3) + x(4) + x(5) + x(6) + x(7)<br />

√ √ √ √<br />

2 2 2 2<br />

= 1 +<br />

2 +0− 2 − 1 − 2 + 0 + 2<br />

= 0.<br />

For k =1:<br />

X(1) =<br />

7∑<br />

t=0<br />

x(t)e − i2π(1)t<br />

8<br />

= x(0)e − i2π(1)(0)<br />

8 + x(1)e − i2π(1)(1)<br />

8 + x(2)e − i2π(1)(2)<br />

8<br />

+x(3)e − i2π(1)(3)<br />

8 + x(4)e − i2π(1)(4)<br />

8 + x(5)e − i2π(1)(5)<br />

8<br />

+x(6)e − i2π(1)(6)<br />

8 + x(7)e − i2π(1)(7)<br />

8<br />

√ ( √ )<br />

2<br />

= (1)e 0 i2π<br />

2<br />

+ e− 8 + 0 + − e − i6π<br />

8<br />

2 2<br />

( √ ) √<br />

+(−1)e − i8π 2 i10π 2 i14π<br />

8 + − e− 8 + 0 + e− 8<br />

2 2<br />

√ (√ √ ) [ √ ( √ √ )]<br />

2 2 2<br />

2 2 2<br />

= (1)(1) +<br />

2 2 + i + 0 + − −<br />

2<br />

2 2 + i 2<br />

√ ( √ √ ) √ (√ √ )<br />

2 2 2 2 2 2<br />

+(−1)(−1) − −<br />

2 2 − i + 0 +<br />

2 2 2 − i 2<br />

( 1<br />

= 1 +<br />

2 2)<br />

+ i ( 1<br />

+ 0 +<br />

2 2)<br />

− i ( 1<br />

+ 1 +<br />

2 2)<br />

+ i +0<br />

( 1<br />

+<br />

2 2)<br />

− i<br />

= 4.


Section 7.5 Examples 201<br />

That takes up a lot of room, as you can see, so I will leave some of the<br />

math for you <strong>to</strong> verify from here on. For k =2:<br />

X(2) =<br />

For k =3:<br />

X(3) =<br />

7∑<br />

t=0<br />

x(t)e − i2π(2)t<br />

8<br />

(√ )<br />

( √ )<br />

2<br />

2<br />

= (1)(1) + (0) + (0)(−1) + − (0)<br />

2<br />

2<br />

( √ )<br />

(√ )<br />

2<br />

2<br />

+(−1)(1) + − (0) + (0)(−1) + (0)<br />

2<br />

2<br />

= 1− 1 = 0.<br />

7∑<br />

t=0<br />

x(t)e − i2π(3)t<br />

8<br />

(<br />

= 1 + − 1 2 + i )<br />

2<br />

(<br />

+ − 1 2 − i )<br />

2<br />

= 0.<br />

(<br />

+ 0 + − 1 2 − i ) (<br />

+ 1 + − 1 2<br />

2 + i )<br />

+0<br />

2<br />

For k =4:<br />

X(4) =<br />

7∑<br />

t=0<br />

x(t)e − i2π(4)t<br />

8 ,<br />

so the exponential will vary between −1 <strong>and</strong> 1 with no imaginary part.<br />

Therefore,<br />

√ ( √ )<br />

2<br />

2<br />

X(4) = (1)(1) +<br />

2 (−1) + (0)(1) + − (−1) + (−1)(1)<br />

2<br />

( √ )<br />

√<br />

2<br />

2<br />

+ − (−1) + (0)(1) +<br />

2<br />

2 (−1)<br />

= 0.


202 The discrete Fourier transform Chapter 7<br />

For k =5:<br />

X(5) =<br />

7∑<br />

t=0<br />

x(t)e − i2π(5)t<br />

8<br />

(<br />

= 1 + − 1 2 − i ) (<br />

+ 0 + − 1 2<br />

2 + i ) (<br />

+ 1 + − 1 2<br />

2 − i )<br />

2<br />

(<br />

+0 + − 1 2 + i )<br />

2<br />

= 0.<br />

For k =6:<br />

X(6) =<br />

7∑<br />

t=0<br />

x(t)e − i2π(6)t<br />

8<br />

= 1 + 0 + 0 + 0 + (−1) + 0 + 0 + 0<br />

= 0.<br />

For k =7:<br />

X(7) =<br />

7∑<br />

t=0<br />

x(t)e − i2π(7)t<br />

8<br />

( 1<br />

= 1 +<br />

2 2)<br />

− i<br />

( 1<br />

+0 +<br />

2 2)<br />

+ i<br />

= 4.<br />

( 1<br />

+ 0 +<br />

2 2)<br />

+ i<br />

( 1<br />

+ 1 +<br />

2 2)<br />

− i<br />

So, X(k) = (0, 4, 0, 0, 0, 0, 0, 4), a function symmetric about the line<br />

k =4which is equal <strong>to</strong> N/2. The frequency of the kth frequency<br />

component is given by ω k = 2πk<br />

NT s<br />

, so ω 1 =<br />

2π1<br />

8(0.25) = 2π 2<br />

= π rad/s<br />

or 0.5 Hz, which is indeed the frequency of our sampled x(t). The<br />

other nonzero term is just the Hermitian conjugate of X(1), so for that<br />

reason, the first 5 terms of X(k) (up <strong>to</strong> the (N/2)th term) are the only<br />

ones we care about. 10<br />

reader <strong>to</strong> verify.<br />

The inverse DFT of this signal is left for the<br />

10 Because some texts will call the first term of time signals <strong>and</strong> frequency spectra<br />

x(1) or X(1) <strong>and</strong> the final term x(N) or X(N), sometimes this is specified as the<br />

(N − 1)/2 frequency component.


Section 7.5 Examples 203<br />

When N is odd, X(0) will be nonzero, i.e., there is a DC offset in<br />

signals of odd length because their average value is not 0. The DC<br />

offset is equal <strong>to</strong> the average value of x(t) times the number of samples,<br />

or rather, the sum of every sample in x(t). This is so because direct<br />

current (DC) is constant <strong>and</strong> has a frequency of 0 Hz. X(0) will always<br />

be real-valued because the input signal is real-valued. A DC offset in<br />

practice is considered undesirable because speakers will not be at their<br />

resting state when the signal is on but silent due <strong>to</strong> the constant flow<br />

of current yet no change in voltage (amplitude).<br />

It is important <strong>to</strong> realize that though N looks like a variable, it is<br />

actually a constant: We derive it from the length of x(t) in seconds<br />

<strong>and</strong> the sampling frequency, but its value does not change. The DFT<br />

<strong>and</strong> IDFT can be normalized by scaling them both by √ 1<br />

N<br />

. N is not<br />

bounded by k in the sum, <strong>and</strong> so multiplying each term by √ 1<br />

N<br />

just<br />

scales the values <strong>and</strong> does not affect the relative powers of the terms.<br />

The normalized discrete Fourier transform (NDFT) <strong>and</strong> its inverse (the<br />

NIDFT) are written<br />

ˆF{x(t)} := ˆX(k) = √ 1 N−1<br />

∑<br />

x(t)e − i2πkt<br />

N ,<br />

N<br />

t=0<br />

ˆF −1 { ˆX(k)} := x(t) = √ 1 N−1<br />

∑<br />

N<br />

k=0<br />

ˆX(k)e i2πkt<br />

N .<br />

Mathematica, for whatever reason, reverses these equations such that<br />

the NDFT is their IDFT <strong>and</strong> the NIDFT is their DFT.<br />

Example 2: The DFT of a complex sinusoid, in Mathematica<br />

The built-in Fourier[] function in Mathematica is the normalized inverse<br />

discrete Fourier transform (NIDFT) as above, <strong>and</strong> similarly,<br />

the InverseFourier[] function takes the normalized discrete Fourier<br />

transform. Therefore, we need <strong>to</strong> use InverseFourier[] multiplied by<br />

√<br />

N <strong>to</strong> take a discrete Fourier transform in the way we have described<br />

thus far.


204 The discrete Fourier transform Chapter 7<br />

Let x(t) = cos(2πt) + cos(πt) + cos(1.2πt). Because f max is 1 Hz, f s<br />

must be greater than 2 Hz, so let there be 3 samples per second, i.e.,<br />

f s =3Hz. We’ll sample this over 4 seconds:<br />

x s (t) = {1, −0.5, −0.5, 1, −0.5, −0.5, 1, −0.5, −0.5, 1, −0.5, −0.5}<br />

+ {1, 0.5, −0.5, −1, −0.5, 0.5, 1, 0.5, −0.5, −1, −0.5, 0.5}<br />

+ {1, 0.309, −0.809, −0.809, 0.309, 1, 0.309, −0.809,<br />

−0.809, 0.309, 1, 0.309}<br />

= {3, 0.309, −1.809, −0.809, −0.691, 1, 2.309, −0.809, −1.809,<br />

0.309, 0, 0.309}<br />

for N = 12 <strong>and</strong> T s =1/3 seconds. Explicitly, the function Fourier[x]<br />

where x is some discrete time-domain signal is given by<br />

Fourier[x] = √ 1 N−1<br />

∑<br />

N<br />

<strong>and</strong> its inverse InverseFourier[X] is<br />

t=0<br />

x(t)e i2πkt<br />

N<br />

InverseFourier[x] = √ 1 N−1<br />

∑<br />

X(k)e − i2πkt<br />

N .<br />

N<br />

Not only are these multiplied by 1 √<br />

N<br />

, but the signs of e’s exponent are<br />

actually reversed. Therefore, <strong>to</strong> compute x s (t) as above, we type in<strong>to</strong><br />

Mathematica the comm<strong>and</strong><br />

t=0<br />

x={3,0.309,-1.809,-0.809,-0.691,1,2.309,-0.809,-1.809,0.309,0,0.309};<br />

<strong>and</strong> then


Section 7.5 Examples 205<br />

Sqrt[Length[x]]*InverseFourier[x]<br />

Now press "Shift+Enter" <strong>to</strong> execute these lines of code in Mathematica.<br />

11 The output is<br />

{1.309,1.406+0.812i,8.368+4.102i,-2.927i,6.559-0.968i,0.667-0.385i,<br />

0.691,0.667+0.385i,6.559+0.968i, 2.927i,8.368-4.102i,1.406-0.812i}<br />

The magnitude of this is<br />

|X(k)| = (1.309, 1.624, 9.319, 2.927, 6.630, 0.770, 0.691,<br />

0.770, 6.630, 2.927, 9.319, 1.624).<br />

There is energy spread <strong>to</strong> every one of the frequency bins because k<br />

never equals exactly 0.6 Hz, so its energy leaks <strong>to</strong> nearby bins—i.e.,<br />

where ω k = π rad/s <strong>and</strong> ω k =2π rad/s. The bins with the most<br />

energy are the k =2<strong>and</strong> k =4bins (ignoring the second half of the<br />

2π(2)<br />

DFT), corresponding <strong>to</strong> the frequencies ω 2 =<br />

12(1/3) = 4π 4 = π rad/s<br />

or 0.5 Hz, which is the frequency of cos(πt), <strong>and</strong> ω 4 =<br />

4π(4)<br />

12(1/3) =2π<br />

rad/s or 1 Hz, the frequency of cos(2πt). The bin k =3has the third<br />

greatest energy, showing the leakage from cos(1.2πt), because ω 3 is the<br />

frequency bin for 1.5π rad/s or 0.75 Hz. However, most of the energy<br />

from this sinusoid leaks <strong>to</strong> the k =2frequency bin because 0.5 Hz is<br />

closer <strong>to</strong> 0.6 Hz than is 0.75 Hz. This is the spectral leakage.<br />

Example 3: The DFT of a complex sinusoid, graphically<br />

To interpret the DFT by graphical inspection, we can choose one of two<br />

methods: (1) analyzing the product of x(t) when multiplied by each<br />

root of unity, or (2) stretching x(t) as described above in the scaling<br />

theorem <strong>to</strong> be twice its length <strong>and</strong> half of its frequency, i.e., double its<br />

11 All of the functionality of Mathematica is online <strong>and</strong> free at http://<br />

wolframalpha.com/.


206 The discrete Fourier transform Chapter 7<br />

period. Both ways will return identical results <strong>and</strong> identical graphs,<br />

so we will only show method (1) here <strong>and</strong> leave the second for the<br />

reader <strong>to</strong> verify. Let our sinusoid contain the first three harmonics with<br />

diminishing energy as their frequency increases: x(t) = sin(2πt) +<br />

0.3 sin(4πt) + 0.1 sin(6πt), defined over the interval of time [0, 2]. The<br />

length of x(t) is two seconds, <strong>and</strong> we must sample it at greater than<br />

two times that of the maximum frequency component, which is 3Hz.<br />

So let f s =8Hz, <strong>and</strong> N = length(x) · f s =2· 8 = 16.<br />

From method (2), the resulting DFT will be identical <strong>to</strong> the one<br />

derived from method (1), except that the duration of x(t) is increased<br />

<strong>to</strong> 4 seconds <strong>and</strong> f max is now 1.5 Hz. Therefore, we can sample at 4<br />

Hz instead of 8 Hz, <strong>and</strong> the following plots will look exactly the same<br />

except for the scale of the horizontal axis, which will be doubled.<br />

Below is the graphical depiction of method (1). The dashed lines<br />

represent the multiplication of the original function with the imaginary<br />

sine component of the WN k , <strong>and</strong> the solid lines show the product of<br />

x(t) by the real cosine component. There are "O"s marking where the<br />

cosine wave multiplies with the samples of x(t), <strong>and</strong> "X"s where the<br />

imaginary sine wave multiplies with x(t), so there are 16 O’s <strong>and</strong> 16<br />

X’s on each graph. Each graph depicts e − i2πkfst<br />

16 over two seconds for<br />

0 ≤ k ≤ 15. We multiply by f s here because the scale of our horizontal<br />

axis shows seconds, not samples, <strong>and</strong> hence x(1) represents the signal<br />

at the time of one second, not the eighth sample; otherwise, we would<br />

leave it alone. Try <strong>to</strong> inspect where the sum of the amplitudes will not<br />

be zero, i.e., when the points favor one half of the horizontal axis <strong>to</strong><br />

the other. When the points straddle the axis, this is a likely indication<br />

that the samples’ <strong>to</strong>tal sum is zero.


Section 7.5 Examples 207<br />

Sampled sinusoids of the DFT


208 The discrete Fourier transform Chapter 7<br />

This method shows us that for k =2, k =4, k = 12, <strong>and</strong> k = 14, the<br />

real parts (the O’s) are symmetric about the horizontal axis <strong>and</strong> probably<br />

sum <strong>to</strong> zero, while the imaginary parts (the X’s) are nonzero. 12<br />

For k =2<strong>and</strong> k =4, the X’s are more on the bot<strong>to</strong>m (negative) half,<br />

12 Note that since the sine waves (the dashed lines) are complex, their vertical axis<br />

is imaginary.


Section 7.5 Examples 209<br />

Figure 7.27: The signal x(t) = sin(2πt) + sin(4πt) multiplied by each of the roots<br />

of unity, e − i2πkfst<br />

N , equivalent <strong>to</strong> e −iπkt , which is the sum cos(πkt) − i sin(πkt) for<br />

0 ≤ k ≤ 15. The real samples resulting from x(t) cos(πkt) sum <strong>to</strong> zero in each of these<br />

graphs, while a few of the imaginary samples from x(t)[−i sin(πkt)] have nonzero<br />

sums.<br />

<strong>and</strong> for k = 12 <strong>and</strong> k = 14, they favor the positive half. In fact, the<br />

spectrum is X(k) ={0, 0, −8i, 0, −8i, 0, 0, 0, 0, 0, 0, 0, 8i, 0, 8i, 0}.<br />

Take caution when computing the DFT of actual signals: Often<br />

times, its results will be jarringly different from what we perceive,<br />

especially in the case of <strong>to</strong>nal music. The DFT can be particularly unreliable<br />

when we want <strong>to</strong> analyze the frequency content of polyphonic<br />

music, e.g., music containing more than one instrument <strong>and</strong> pitch at<br />

any given time. Fundamental frequency detection, called f 0 -tracking in<br />

the field of music information retrieval, is a largely unsolved problem<br />

because sometimes our brain will fill in the fundamental when it is


210 The discrete Fourier transform Chapter 7<br />

weak or missing al<strong>to</strong>gether from a signal via difference <strong>to</strong>nes or simply<br />

the complicated language of music. Consider, for example, letting a<br />

chord on a guitar ring out for a few seconds. The fundamental will<br />

disappear rather quickly, but our brains may still say that it is the<br />

essential frequency content of the sound when the DFT would show<br />

otherwise. In conclusion, do not be frustrated when the DFT fails <strong>to</strong> return<br />

the information you want (<strong>and</strong> you know that you have specified<br />

it correctly); rather, devise a way <strong>to</strong> correct its failures with respect <strong>to</strong><br />

our audi<strong>to</strong>ry perception. 13<br />

7.6 Chapter summary<br />

The continuous Fourier transform<br />

F{x(t)} := X(ω) =<br />

∫ ∞<br />

−∞<br />

x(t)e −iωt dt<br />

accepts a continuous, time-domain function x(t) <strong>and</strong> produces a continuous,<br />

frequency-domain spectrum X(ω) containing the frequency<br />

information of x(t). Its inverse,<br />

F −1 {X(ω)} := x(t) = 1 ∫ ∞<br />

X(ω)e iωt dω<br />

2π −∞<br />

accepts a continuous, frequency-domain function X(ω) <strong>to</strong> reproduce<br />

the continuous time-domain signal x(t).<br />

The exponentials e ± i2πkt<br />

N are Euler’s roots of unity that can be written<br />

as complex trigonometric functions by Euler’s formula, which states that<br />

e ±iω = cos(ω) ± i sin(ω).<br />

Therefore, the real parts of X(ω) correspond <strong>to</strong> results of the cosine<br />

function’s product with x(t) <strong>and</strong> the imaginary parts correspond <strong>to</strong><br />

the product of the sine function with x(t).<br />

13 Try smoothing your data.


Section 7.6 Chapter summary 211<br />

The discrete Fourier transform,<br />

F{x(t)} := X(k) =<br />

N−1<br />

∑<br />

t=0<br />

x(t)e − i2πkt<br />

N ,<br />

accepts only discrete, sampled functions x s (t). Because of this, we can<br />

write the transform with simply x(t) <strong>and</strong> know it is a discrete, sampled<br />

signal. Its inverse,<br />

F −1 {X(k)} := x(t) = 1 N<br />

N−1<br />

∑<br />

k=0<br />

X(k)e i2πkt<br />

N ,<br />

If a frequency ω k (where ω k = 2πk<br />

NT s<br />

) is present in the signal x(t), then<br />

the DFT will constructively interfere with the signal at ω k , producing a<br />

non-zero value in the frequency bin X(k). The value of X(k) is exactly<br />

equal <strong>to</strong> the sum of the amplitudes of the signal multiplied by the<br />

roots of unity given by WN k . However, if the signal contains frequency<br />

components not expressed by the roots of unity, i.e., when a poor<br />

sampling frequency f s is chosen or when the frequency is not in the<br />

set<br />

{ω k } =<br />

{<br />

0,<br />

2π<br />

NT s<br />

,<br />

}<br />

4π 2π(N − 1)<br />

, . . . , ,<br />

NT s NT s<br />

then spectral leakage will occur <strong>and</strong> the unrepresented frequency<br />

components will alias with nearby ω k , spreading their energy over 2<br />

or more frequency bins.<br />

The Fourier transform of a real signal possesses Hermetian symmetry<br />

about its (N/2)th frequency component, meaning that magnitude<br />

|X(j)| is equal <strong>to</strong> the magnitude |X(N − j)| for some real number<br />

j ≤ N/2—i.e., the magnitude plot is symmetrical about the line<br />

k = N/2. This means that in the continuous case, the second half<br />

of the transform’s magnitude containing positively valued frequency<br />

components is the half of interest, so we may ignore the first half. In<br />

the discrete case, all of the frequencies are positively valued, <strong>and</strong> we<br />

may ignore the second half <strong>to</strong> underst<strong>and</strong> the relative energies of the<br />

frequency components of a signal.


212 The discrete Fourier transform Chapter 7<br />

The amplitude of the zeroth frequency component at 0 Hz X(0)<br />

is a real number corresponding <strong>to</strong> the DC offset of a waveform, so it<br />

is significant. A spectrum will have a non-zero DC offset when the<br />

average value of the waveform is non-zero which is often the case<br />

when N is odd.


8. Other Fourier transforms<br />

To quickly review, the continuous Fourier transform is represented by<br />

the integral<br />

X(ω) =<br />

∫ ∞<br />

−∞<br />

x(t)e −iωt dt<br />

where ω =2πf. This returns the amplitude of the frequency component<br />

ω in the entire continuous, time-domain signal x(t). Its inverse<br />

is<br />

x(t) = 1 ∫ ∞<br />

X(ω)e iωt dω<br />

2π −∞<br />

which tells us the amplitude of x(t) at time t by integrating over all of<br />

the frequency components.<br />

The discrete Fourier transform of a sampled, time-domain signal<br />

x s (t) is given by the sum<br />

X(k) =<br />

N−1<br />

∑<br />

t=0<br />

x s (t)e − i2πkt<br />

N<br />

for k =0, 1, . . . , N − 1, where N is the <strong>to</strong>tal number of samples in the<br />

sampled signal x s (t), t is the sample number, <strong>and</strong> k is the index of<br />

the frequency component ω k . X(k) returns the amplitude of ω k in the<br />

entire signal x s (t). The inverse discrete Fourier transform (IDFT) is<br />

x s (t) = 1 N<br />

N−1<br />

∑<br />

k=0<br />

X(k)e i2πkt<br />

N<br />

for t =0, 1, . . . , N − 1. This reconstructs the amplitude of x s (t).<br />

In addition <strong>to</strong> the continuous <strong>and</strong> discrete Fourier transforms, there<br />

are several other transforms that give the frequency-domain spectrum


214 Other Fourier transforms Chapter 8<br />

of a time-domain signal. The Laplace transform is used in electrical<br />

engineering <strong>to</strong> compute transfer functions which describe the transfer of<br />

voltage in a linear, continuous, time-invariant system. The Z-transform<br />

is the discrete version of the Laplace transform: It takes an infinite,<br />

discrete, time-domain input <strong>and</strong> outputs a complex, finite spectrum<br />

that is limited by some region of convergence. The discrete-time Fourier<br />

transform (DTFT) is a special case of the Z-transform whose region of<br />

convergence is the unit circle, i.e., the interval [0, 2π). A DTFT can be<br />

derived from a DFT via spectral interpolation. Afast Fourier transform<br />

(FFT) calculates a DFT with substantially less computations <strong>and</strong> is<br />

easily the most popular version of the Fourier transform. Finally, a<br />

short-term Fourier transform (STFT) computes the FFT at very small (0.1<br />

seconds or less) intervals of time in a song <strong>to</strong> show how its frequencies<br />

change. Its results are typically conveyed in a spectrogram. We will<br />

discuss the Z- <strong>and</strong> Laplace transforms in Appendix A since they are<br />

not Fourier transforms.<br />

8.1 Discrete-time Fourier transform (DTFT)<br />

The discrete-time Fourier transform is a special case of the Z-transform<br />

that reduces the domain of a spectrum of frequencies <strong>to</strong> the continuous<br />

interval [0, 2π). 1 In the DTFT, the frequency components ω k are<br />

normalized such that ˆω k =2πkT s , unlike in the DFT where ω k equals<br />

2πk<br />

NT s<br />

. We typically see the DTFT in digital filter design (namely FIR<br />

filters) where a discrete transfer function H(z) may be computed, i.e.,<br />

the input <strong>and</strong> output of some discrete, linear, time-invariant system is<br />

known. Like the DFT, the DTFT requires a discrete, sampled input, but<br />

there is no N; instead, the duration of the input must be infinite 2 , so its<br />

time samples t are all of the integers from negative <strong>to</strong> positive infinity.<br />

1 Some texts specify this interval as [−π, π).<br />

2 If a system is time-invariant, sometimes we can assume the input is infinite for<br />

purposes of calculation. With repetitive, steady signals like pure <strong>to</strong>nes or white noise,<br />

this is an assumption we may make, though the beginning <strong>and</strong> ending points should<br />

both be 0 <strong>to</strong> avoid clipping artifacts that arise in the frequency domain.


Section 8.1 Discrete-time Fourier transform (DTFT) 215<br />

The DTFT is defined by the sum<br />

X(ˆω) =<br />

∞∑<br />

t=−∞<br />

x[t]e −iˆωt<br />

where ˆω is in the interval [0, 2π). Because this forms a continuum, the<br />

inverse DTFT is the integral<br />

x[t] = 1 ∫ 2π<br />

X(ˆω) · e iˆωt dˆω.<br />

2π 0<br />

We use square brackets <strong>to</strong> indicate which function is discrete <strong>and</strong><br />

parentheses <strong>to</strong> indicate a continuity.<br />

The critical difference between a DTFT <strong>and</strong> a DFT is that the DTFT<br />

frequency domain, [0, 2π), is a continuum. This is related <strong>to</strong> the fact<br />

that the Fourier transform of an infinite signal is finite, but stems from<br />

the property that the DTFT is a periodic function wherein<br />

X(ˆω k +2πf) =X(ˆω k ).<br />

The frequency range of the DFT is not continuous because it is only<br />

defined for the frequencies ω k = 2πk<br />

NT s<br />

where 0 ≤ k ≤ (N − 1), representing<br />

N-many uniformly spaced frequencies [26]. Therefore, we<br />

consider the DTFT <strong>to</strong> be more mathematically rigorous than the DFT,<br />

even though we rarely take a DTFT in practice because our input signals<br />

are not infinite. However, we can make the time-limited signals<br />

infinite by zero-padding in the time-domain, i.e., appending zeros on<strong>to</strong><br />

x[t] such that t is defined over all of the integers instead of only 0 <strong>to</strong><br />

N − 1. Zero-padding in the time-domain translates <strong>to</strong> spectral interpolation<br />

in the frequency domain, which effectively limits its frequency<br />

domain <strong>to</strong> some finite interval <strong>and</strong> results in a higher resolution in<br />

the frequency domain. So the more zeros are padded on<strong>to</strong> a signal,<br />

the "smoother" the resulting transform. Conversely, if we sample the<br />

DTFT by computing N-many samples per period of X, the DTFT will


216 Other Fourier transforms Chapter 8<br />

be equivalent <strong>to</strong> the DFT:<br />

x[t] = 1 ∫ 2π<br />

X(ˆω)e iωkt dω<br />

2π<br />

∮<br />

0<br />

= T s<br />

1<br />

Ts<br />

X(k)e i2πktTs dk,<br />

where the syntax " ∮ " denotes the closed path integral. Here, the integral<br />

∮<br />

is computed over any single period of X, i.e., it is of length 2π <strong>and</strong><br />

1<br />

Ts<br />

does not necessarily begin at 0. Then<br />

X<br />

( k<br />

NT s<br />

)<br />

=<br />

∞∑<br />

t=−∞<br />

x[t]e − i2πkt<br />

N .<br />

8.2 Fast Fourier transform (FFT)<br />

The fast Fourier transform is an efficient algorithm that seeks <strong>to</strong> speed<br />

up the computation of the discrete Fourier transform by removing<br />

all of its redundant computations. At first glance (<strong>and</strong> second, <strong>and</strong><br />

third, <strong>and</strong> . . .), the mathematical expressions involved are not as clear<br />

nor do they reveal as much about its inner workings as does that of<br />

the discrete Fourier transform. This is the main reason why we leave<br />

the FFT, <strong>and</strong> algorithms in general, <strong>to</strong> computers <strong>and</strong> other hardware<br />

devices.<br />

This algorithm speeds up the processing time of the DFT by reducing<br />

the number of computations required from N 2 -many <strong>to</strong> N log 2 N-<br />

many (on average), where N is again the <strong>to</strong>tal number of samples in<br />

the signal x(t). Because the FFT outputs exactly the same thing as the<br />

DFT, it is computationally the clear choice when N is large—which is<br />

always the case for audio because of its high sampling rate.<br />

The FFT still requires its input <strong>to</strong> be discrete, <strong>and</strong> likewise it produces<br />

a discrete output. Furthermore, the <strong>to</strong>tal number of samples N<br />

must be a power of 2, i.e., 256, 512, 1024, <strong>and</strong> so on. If N is not equal <strong>to</strong><br />

a power of 2, an appropriate number of zeros can be added <strong>to</strong> the end


Section 8.2 Fast Fourier transform (FFT) 217<br />

of the signal <strong>to</strong> make it so. This is the same zero-padding used above<br />

in the DTFT, though here it is a finite number of zeros.<br />

The FFT was originally discovered by Carl Friedrich Gauss in<br />

1805, but the results were not published until after his death, <strong>and</strong><br />

the computational efficiency of the algorithm was realized neither by<br />

him or readers. In 1965, a paper published by J.W. Cooley (IBM) <strong>and</strong><br />

John Tukey (Prince<strong>to</strong>n University) described the same algorithm <strong>and</strong><br />

its implementation on a computer, but did not cite Gauss <strong>and</strong> the<br />

connection was not made for some time after [27].<br />

The algorithm improves the computational efficiency of the DFT<br />

by recursively partitioning the entire signal in<strong>to</strong> smaller parts <strong>and</strong><br />

using the DFTs linearity <strong>to</strong> do multiple DFTs on different parts of the<br />

signal <strong>and</strong> summing them <strong>to</strong>gether. There are several versions of the<br />

FFT proposed by Cooley <strong>and</strong> Tukey, each using different algorithmic<br />

techniques <strong>to</strong> perform this task. Given here is the most popular version:<br />

The radix-2 decimation in time FFT.<br />

Radix-2 decimation in time (DIT) of the FFT<br />

Also called the Danielson-Lanczos lemma, the radix-2 decimation in<br />

time algorithm performs an FFT by first splitting the input signal in<strong>to</strong><br />

two parts: The even-numbered samples,<br />

x even =(x(0),x(2),x(4), . . . , x(2m)),<br />

<strong>and</strong> the odd-numbered samples,<br />

x odd =(x(1),x(3),x(5), . . . , x(2m + 1)).<br />

If N, the number of samples in x(t), is a power of 2, then N =2m +2<br />

(the last term of x is x(2m + 1)—remember, we include the term x(0)<br />

in the magnitude). Otherwise, we pad the end of the signal with zeros<br />

until its size is a power of 2. So, if the magnitude of x is 250, we would<br />

pad it with 6 zeros <strong>to</strong> make its size 256 = 2 8 . Since we begin at 0, we<br />

sum from 0 <strong>to</strong> N − 1. Dividing this by 2, we sum from 0 <strong>to</strong> N/2 − 1,


218 Other Fourier transforms Chapter 8<br />

twice. Then the DFT of the inputted signal can be written as the sum<br />

of the DFT’s of these split signals:<br />

X(k) =<br />

N<br />

2 −1<br />

∑<br />

∑<br />

x(2m)e −i2π(2m)k/N + x(2m + 1)e −i2π(2m+1)k/N .<br />

m=0<br />

N<br />

2 −1<br />

m=0<br />

The zeros padded at the end of the signal do not contribute anything<br />

<strong>to</strong> this sum, so this form is sufficient. We can make both of the exponentials<br />

e −i2π(2m)k/N <strong>and</strong> e −i2π(2m+1)k/N identical <strong>to</strong> one another by<br />

fac<strong>to</strong>ring out e −i2πk/N from the second one, so we can rewrite this as<br />

X(k) =<br />

N<br />

2 −1<br />

∑<br />

m=0<br />

N<br />

∑2 −1<br />

x(2m)e −i2π(2m)k/N + e −i2πk/N<br />

m=0<br />

x(2m + 1)e −i2π(2m)k/N .<br />

The length of X(k) is N/2. 3 What this means is that the kth term of<br />

the DFT is identical <strong>to</strong> the negative k + N 2<br />

th term. Therefore, we can<br />

use this result <strong>and</strong> the above sum for X(k) <strong>to</strong> compute the terms of the<br />

DFT for k ≥ N/2. Because the term is negative, we fiddle with the sign<br />

of our exponential. For the sake of space, let F = e −i2π(2m)k/N <strong>and</strong> let<br />

G = e −i2π(2m)(k− N 2 )/N . Then,<br />

X(k) =<br />

⎧<br />

⎪⎨<br />

⎪⎩<br />

∑ N 2 −1<br />

m=0<br />

∑ N 2 −1<br />

i2πk<br />

x(2m) · F + e− N<br />

m=0 x(2m) · G − i2π (k− N 2 )<br />

e− N<br />

∑ N 2 −1<br />

m=0 x(2m + 1) · F, k < N/2<br />

∑ N 2 −1<br />

m=0 x(2m + 1) · G, k ≥ N/2.<br />

3 Again, the second half of the output of a DFT is symmetrical <strong>to</strong> the first half. To<br />

realize why this is true, consider the above fac<strong>to</strong>r, e −i2πk/N :<br />

e −i2πk/N = −e −iπ · e −i2πk/N (1)<br />

= −e −(i2π N 2 )/N · e −i2πk/N (2)<br />

= −e −i2π(k+ N 2 )/N . (3)<br />

We can switch the sign in the first step (1) because e −iπ = −1. Then in (2), we rewrite<br />

e −iπ as e −i2π N 2 /N —the 2’s <strong>and</strong> the N’s cancel out <strong>to</strong> make the exponent −iπ. Finally,<br />

(3) once again makes use of the axiom x a · x b = x a+b . Here,<br />

We can fac<strong>to</strong>r out −i2π<br />

N<br />

seen in (3).<br />

−e − i2π N 2<br />

N<br />

· e<br />

− i2πk<br />

N = −e<br />

− i2π N 2<br />

N<br />

− i2πk<br />

N .<br />

·( N<br />

N +k) 2 , which gives us the same result<br />

<strong>to</strong> reduce this <strong>to</strong> −e−<br />

i2π


Section 8.2 Fast Fourier transform (FFT) 219<br />

This may not seem like it actually reduces the number of computations<br />

involved: If we have <strong>to</strong> go through all the k’s anyway, why does this<br />

result mean anything<br />

Well, for one, this is not actually the final specification of the algorithm:<br />

It is just the first reduction, wherein the number of computations<br />

is reduced from N 2 <strong>to</strong> N 2 /2. N 2 /2 is equal <strong>to</strong> N log 2 N when N =2or<br />

N =4. Further reductions require knowledge of the size of N, because<br />

then we will know when we can fac<strong>to</strong>r out equivalent exponentials.<br />

The roots of unity described by the exponent of e permit the Fourier<br />

transform <strong>to</strong> detect periodic waves, i.e., those with a specific frequency,<br />

<strong>and</strong> permit the acceleration of the discrete Fourier transform in<strong>to</strong> the<br />

fast Fourier transform. By identifying what can be fac<strong>to</strong>red out, we<br />

can reduce the size of the individual sums <strong>to</strong> 2, no matter the N. The<br />

number of times we have <strong>to</strong> fac<strong>to</strong>r out exponentials is equal <strong>to</strong> log 2 N.<br />

Let us do an example <strong>to</strong> show that indeed the computational complexity<br />

reduces <strong>to</strong> N log 2 N. Let N =8. The normal DFT would then<br />

be written<br />

X(k) = x(0) + x(1)e −i2πk<br />

8 + x(2)e −i4πk<br />

8 + x(3)e −i6πk<br />

8<br />

+ x(4)e −i8πk<br />

8 + x(5)e −i10πk<br />

8 + x(6)e −i12πk<br />

8 + x(7)e −i14πk<br />

8 .<br />

Splitting this in<strong>to</strong> the even <strong>and</strong> odd parts gives<br />

X(k) =<br />

=<br />

]<br />

[x(0) + x(2)e −i4πk<br />

8 + x(4)e −i8πk<br />

8 + x(6)e −i12πk<br />

8<br />

]<br />

+<br />

[x(1)e −i2πk<br />

8 + x(3)e −i6πk<br />

8 + x(5)e −i10πk<br />

8 + x(7)e −i14πk<br />

8<br />

]<br />

[x(0) + x(2)e −i4πk<br />

8 + x(4)e −i8πk<br />

8 + x(6)e −i12πk<br />

8<br />

]<br />

2πk<br />

−i<br />

+ e 8<br />

[x(1) + x(3)e −i4πk<br />

8 + x(5)e −i8πk<br />

8 + x(7)e −i12πk<br />

8 .<br />

Further fac<strong>to</strong>ring out another exponential <strong>to</strong> make the sums half the<br />

length again (so our sum goes from 0 <strong>to</strong> N/4 − 1) gives


220 Other Fourier transforms Chapter 8<br />

X(k) =<br />

] {[x(0) + x(4)e −i8πk<br />

8 + e −i4πk<br />

8<br />

2πk<br />

−i<br />

+ e 8<br />

=<br />

] {[x(1) + x(5)e −i8πk<br />

8 + e −i4πk<br />

8<br />

{[x(0) + x(4)e −iπk] + e −iπk<br />

2<br />

πk<br />

−i<br />

+ e 4<br />

{[x(1) + x(5)e −iπk] + e −iπk<br />

2<br />

]}<br />

[x(2) + x(6)e −i8πk<br />

8<br />

[x(3) + x(7)e −i8πk<br />

8<br />

[x(2) + x(6)e −iπk]}<br />

[x(3) + x(7)e −iπk]}<br />

So the final result is 4 sums of size 2, <strong>and</strong> it <strong>to</strong>ok us 3 fac<strong>to</strong>rizations<br />

<strong>to</strong> get there. Each of these fac<strong>to</strong>rizations <strong>to</strong>ok 8 computations each. So,<br />

the <strong>to</strong>tal number of computations required was 8 · 3 = 24 = 8 log 2 8=<br />

N log 2 N.<br />

The organization of the FFT algorithm is usually visualized by a<br />

butterfly diagram, but I find them somewhat confusing due <strong>to</strong> their<br />

many arrows. The diagram in Figure 8.1 is a different graphical interpretation<br />

of the radix-2 FFT showing the exponential fac<strong>to</strong>rs at each<br />

step of its evaluation.<br />

]}<br />

Figure 8.1: Diagram depicting the nested processes of the radix-2 DIT FFT.


Section 8.3 Short-time Fourier transform (STFT) 221<br />

8.3 Short-time Fourier transform (STFT)<br />

Music is a time-based art, <strong>and</strong> we process it sequentially. We pay<br />

attention <strong>to</strong> changes <strong>and</strong> build expectations for these changes as we<br />

become experienced listeners. A (discrete) Fourier transform retrieves<br />

the frequency information of an entire signal, but for varied signals<br />

with multiple instruments <strong>and</strong> chords, this isn’t very helpful. Instead,<br />

we want <strong>to</strong> know what happens at small intervals of time so we can get<br />

an idea of change in music. Hence, the STFT is a very useful version of<br />

the DFT.<br />

The STFT computes the Fourier transform by partitioning the time<br />

signal in<strong>to</strong> smaller, equally sized time frames, <strong>and</strong> taking the Fourier<br />

transform of each of them. The STFT has a continuous <strong>and</strong> discrete<br />

form.<br />

X(τ m ,ω) =<br />

∫ ∞<br />

−∞<br />

x(t)w(t − τ m )e −iωt<br />

The continuous-time STFT, as it is called, applies a windowing function<br />

w(t−τ m ) <strong>to</strong> a continuous signal x(t) <strong>and</strong> returns one Fourier transform<br />

for each window. The mth window begins at time t = τ m , <strong>and</strong> τ m is the<br />

multiplication of the index m of the windows by the size H of the hop.<br />

The size of the hop differs from the size of the transform, however. We<br />

call each short-term time-domain signal <strong>to</strong> be transformed a window,<br />

or frame, <strong>and</strong> its size the frame size. We step through these frames<br />

according <strong>to</strong> a designated hop size. So, the size of a single transform<br />

(the frame size) will be a fraction of N according <strong>to</strong> the number of<br />

frames. If the frame size is equal <strong>to</strong> the hop size, then there is zero<br />

overlap between the frames. If the hop size is less than the frame<br />

size, then there will be overlap, equal <strong>to</strong> their difference. Overlap is<br />

perfectly fine <strong>and</strong> actually improves the resolution of the STFT. Both<br />

of these are intervals of time, so they are given in seconds.


222 Other Fourier transforms Chapter 8<br />

Calling hop size H, frame size N ′ , <strong>and</strong> the number of frames M,<br />

N ′ = N N·M<br />

M<br />

. The STFT contains<br />

H<br />

Fourier transforms of size N ′ . Therefore,<br />

when H < N ′ , the STFT is a more costly algorithm than a single<br />

FFT of the entire signal, requiring N H · N ′ log 2 N ′ computations versus<br />

N log 2 N-many.<br />

However, it is usually sped up in reality by using fast Fourier transforms<br />

instead. The STFT is the most common implementation of the<br />

Fourier transform because it gives the most accurate representation<br />

of a signal: A 180-second long song containing K-many frequency<br />

components most certainly does not contain identical frequency components<br />

at every instant of time. Music changes! Usually, the hop size<br />

<strong>and</strong> frame size are chosen somewhere around 50-100 ms <strong>to</strong> correspond<br />

<strong>to</strong> the time resolution of our perception.<br />

To specify the STFT in terms of the FFT, a size N ′ for the FFT must<br />

be chosen. Because the FFT must be a power of 2, we choose N ′ such<br />

that N ′ ≥ H, the size of each hop, <strong>and</strong> N ′ =2 p for some p. This<br />

is called a N ′ -point FFT. N ′ can be determined using the function<br />

nextpower(H) in programs like Matlab <strong>and</strong> Mathematica. We pad each<br />

x m (t) with zeros, i.e.,<br />

⎧<br />

x ⎪⎨ m (t − τ m ), |t − τ m |≤ H−1<br />

2<br />

x m (t − τ m )=<br />

H−1<br />

0,<br />

2<br />


Section 8.3 Short-time Fourier transform (STFT) 223<br />

where<br />

∞∑<br />

m=−∞<br />

Therefore,<br />

∞∑<br />

m=−∞<br />

w(t − mH) =1, t = −∞, . . . , −1, 0, 1, . . . , ∞.<br />

X(mH, ω) =<br />

=<br />

=<br />

∞∑<br />

∞∑<br />

m=−∞ t=−∞<br />

∞∑<br />

t=−∞<br />

∞∑<br />

t=−∞<br />

= X(ω).<br />

x(t)e −iωt<br />

x(n)e −iωt<br />

x(t)w(t − mH)e −iωt<br />

∞ ∑<br />

m=−∞<br />

w(t − mH)<br />

We can do these sums globally, i.e., for the interval (−∞, ∞) instead of<br />

from 0 <strong>to</strong> N ′ − 1, because we zero-padded each of our frames.<br />

Spectrograms are made using short-time fast Fourier transforms.<br />

The entire file is sliced evenly in<strong>to</strong> partitions by a time interval (around<br />

100 ms is usually sufficient), <strong>and</strong> then the discrete Fourier transform<br />

is taken of each slice. The resulting graph shows frequency, amplitude,<br />

<strong>and</strong> time, but is only plotted on two dimensions. The horizontal<br />

axis shows each time interval, the vertical axis is frequency, <strong>and</strong> the<br />

darkness of the point shows amplitude.<br />

w(t).<br />

Different types of windows can be specified by a windowing function,<br />

Windowing<br />

Windowing is a time-selective process that takes many equal size<br />

intervals of a signal by multiplying everything outside of that interval<br />

by zero. It is nearly identical conceptually <strong>to</strong> the impulse function, but<br />

its domain is not infinitesimally small.<br />

Most similar <strong>to</strong> an impulse function is a rectangular window that<br />

is constantly 1 over some interval centered about time τ m <strong>and</strong> 0 elsewhere.<br />

We define this interval as starting at the mth hop, <strong>and</strong> it is N ′


224 Other Fourier transforms Chapter 8<br />

in size. Letting t be a real number, our rectangular window or boxcar<br />

window (because it moves along a function like a train of many boxcars)<br />

can be given by the function<br />

⎧<br />

⎨1, t ∈ [mH, mH + N ′ ]<br />

w(t) =<br />

⎩0, otherwise.<br />

This interval [mH, mH + N ′ ] can also be written [τ m ,τ m + N ′ ] because<br />

τ m = mH. It bears quite a resemblance <strong>to</strong> the Kronecker delta function,<br />

⎧<br />

⎨1, if t =0<br />

δ(t) =<br />

⎩0, otherwise.<br />

This is the simplest window, <strong>and</strong> it retains all the of the amplitude<br />

information of a signal but also induces the most spectral leakage <strong>and</strong><br />

side lobes in the frequency domain because of its infinite slope on<br />

either side. A windowing function with smoother ends minimizes<br />

spectral leakage <strong>and</strong> has smaller side lobes that decrease <strong>to</strong> zero almost<br />

immediately. However, a rectangular window produces the narrowest<br />

or "strongest" center lobe (typically one spike at the closest frequency<br />

versus several spikes) of any of the windowing functions, so there is a<br />

sort of trade-off for using nondifferentiable windows.<br />

A triangle window neglects a relatively large amount of a signal<br />

because of their pointed <strong>to</strong>p: Only one sample’s amplitude (the center<br />

sample of each window) will be the same as the original signal in the<br />

windowed function. Also called Bartlett windows, they are specified by<br />

the function<br />

⎧<br />

⎨1 − ∣ 2<br />

N<br />

(t − τ<br />

w(t) =<br />

′ m ) − 1 ∣ , if t is in the interval [τm ,τ m + N ′ ]<br />

⎩0, otherwise.<br />

Remember, τ m = mH is the location in time of the beginning of<br />

the mth window. The size of the window is the frame size N ′ , so the


Section 8.3 Short-time Fourier transform (STFT) 225<br />

Figure 8.2: A triangle window, centered at 1 second with a size of 2 (N ′ ). We cannot<br />

gather the hop size from this image; if there were multiple triangle windows shown<br />

in this graph, the hop size H would be the difference in time between the τ m, i.e., the<br />

starting point of successive windows.<br />

mth window interval is given by [mH, mH + N ′ ] if we begin at m =0<br />

(the zeroth window, as opposed <strong>to</strong> the first window). The hop size H<br />

simply defines the spacing of many of these windows. Thus, we begin<br />

the mth window at the mth hop, <strong>and</strong> there are M-many windows (or<br />

frames).<br />

Hanning windows are another popular choice in STFTs because of<br />

their gradual slope at their endpoints. This ensures a smooth attack<br />

rate in the windowed signal, <strong>and</strong> therefore minimal dis<strong>to</strong>rtion in the<br />

spectrum due <strong>to</strong> discontinuities at the windowed ends of the input<br />

signal. The Hanning window is represented by the function<br />

⎧ [ ( )]<br />

⎨ 1<br />

2<br />

1 − cos 2πt<br />

N<br />

w(t) =<br />

′ −1<br />

, if t ∈ [τ m ,τ m + N ′ ]<br />

⎩<br />

0, otherwise.


226 Other Fourier transforms Chapter 8<br />

Figure 8.3: A Hanning window.<br />

The Hanning window is a variation of a cosine or sine window, the<br />

simplest form given by<br />

w(t) =<br />

=<br />

⎧ (<br />

⎨sin πt<br />

N ′ −1<br />

)<br />

, t ∈ [τ m ,τ m + N ′ ]<br />

⎩<br />

0, otherwise<br />

⎧ [ ]<br />

⎨cos πt<br />

, t ∈ [τ m ,τ m + N ′ ]<br />

N ′ −1 − π 2<br />

⎩<br />

0, otherwise.<br />

Figure 8.4: A cosine or sine window.


Section 8.4 Chapter summary 227<br />

All of these graphs show windows with τ m =0, <strong>and</strong> these are the<br />

0th windows, i.e., m =0, <strong>and</strong> the frame size is N ′ =2, so the interval<br />

is [0,N ′ ] = [τ 0 ,τ 0 + 2].<br />

8.4 Chapter summary<br />

A discrete-time Fourier transform (DTFT) is a fairly obsolete version<br />

of the DFT, taking an infinite time-domain discrete signal x[n] <strong>and</strong><br />

transforming its frequency components <strong>to</strong> a continuum, the interval<br />

[0, 2π). It is given by the sum<br />

X(ˆω) =<br />

∞∑<br />

n=−∞<br />

x[n]e −iˆωn<br />

Because X(ω) is continuous, the inverse DTFT is the integral<br />

x[n] = 1 ∫ 2π<br />

X(ˆω)e iˆωn dˆω.<br />

2π 0<br />

The fast Fourier transform (FFT) <strong>and</strong> short-time (fast) Fourier transform<br />

(STFT) are efficient algorithms that work <strong>to</strong> reduce the computational<br />

complexity of the discrete Fourier transform (DFT) by sorting an input’s<br />

terms according <strong>to</strong> the roots of unity, in<strong>to</strong> log 2 (N)-many groups.<br />

The most common version of the FFT is the radix-2 decimation in time<br />

(DIT) algorithm. Common roots of unity are fac<strong>to</strong>red out, as shown in<br />

the example where N =8above. A requirement of the FFT is that N<br />

(the length of the sampled signal x s (t)) must be a power of 2. If it is not,<br />

the signal is zero-padded wherein zeros are tacked on <strong>to</strong> the end of the<br />

signal so that no new information is added but its length can become a<br />

power of 2. This is why the FFT is considered a "divide-<strong>and</strong>-conquer"<br />

algorithm.<br />

The STFT is the most useful Fourier transform <strong>to</strong> use when the<br />

frequency information of a signal changes over time. A STFT does<br />

not require its inputs <strong>to</strong> be discrete—it is specified in continuous <strong>and</strong><br />

discrete forms. It windows a signal with windowing function like a


228 Appendix<br />

rectangular window or Hanning window, <strong>and</strong> then takes a Fourier<br />

transform (continuous, discrete, or fast) of each window, indexing<br />

time. A windowing function is similar <strong>to</strong> an impulse function, but it is<br />

not instantaneous. When the interval of these windows is small, say<br />

50ms or 100ms, the change in frequencies of a signal is best unders<strong>to</strong>od.<br />

A spectrogram can be produced from the results of the STFT, <strong>and</strong> can<br />

give graphical information about timbre/instrumentation, melody,<br />

<strong>and</strong> harmony in a piece of polyphonic music.


A. Frequency-selective circuits<br />

<strong>Signal</strong> processing is a class that every electrical engineering undergraduate<br />

takes, but rarely does it have a musical focus, or even mention.<br />

But as we saw in Chapter 4 on electrical guitar effect units, electric<br />

circuits certainly do have musical applications.<br />

However, it is important <strong>to</strong> note that digital signal processing—<br />

i.e., the frameworks within which we might compute a fast Fourier<br />

transform—is something different from signal processing in the electrical<br />

sense. Although discretized musical data concerns voltages, the<br />

signals passing through circuits are continuous <strong>and</strong> are therefore an<br />

exception <strong>to</strong> most of the techniques described in this book henceforth.<br />

That said, the construction of synthesizers <strong>and</strong> microcontrollers can be<br />

enlightening endeavors in<strong>to</strong> the science of sound, but they each require<br />

very distinct bodies of knowledge exclusive from the mechanics of the<br />

DFT.<br />

Digital filter design is outside of the scope of this book, but there<br />

are two important connections between analog <strong>and</strong> digital filters that<br />

this appendix will address. Assuming the systems are linear <strong>and</strong><br />

time-invariant, the input <strong>and</strong> output voltages of analog filters may<br />

be transformed from the time domain <strong>to</strong> the frequency domain with<br />

the Laplace transform if the system is continuous, or the Z-transform if<br />

the system is discrete. These give us the transfer functions of a circuit,<br />

written<br />

H(s) = V o(s)<br />

V i (s)<br />

for continuous complex frequencies s <strong>and</strong><br />

H(z) = Y (z)<br />

X(z) ,


230 Appendix A<br />

for discrete complex frequencies z. The functions V o (s) <strong>and</strong> V i (s) are<br />

the frequency responses of the continuous output <strong>and</strong> input voltage<br />

functions v o (t) <strong>and</strong> v i (t), respectively, <strong>and</strong> Y (z) <strong>and</strong> X(z) are the frequency<br />

responses of the discrete output <strong>and</strong> input voltage functions<br />

y(t) <strong>and</strong> x(t).<br />

In electrical engineering, it is conventional <strong>to</strong> use the letter j <strong>to</strong><br />

define the imaginary number √ −1 instead of i <strong>to</strong> avoid confusion with<br />

current, which is written i(t). The variables s <strong>and</strong> z are both equal <strong>to</strong><br />

jω, meaning they are defined on the complex plane, but again, s is<br />

continuous <strong>and</strong> z is discrete.<br />

Digital filters are specified <strong>and</strong> analyzed using Z-transforms, while<br />

analog filters use Laplace. A Z-transform is equivalent <strong>to</strong> a discretetime<br />

Fourier transform (DTFT) when z = e jω , i.e., the DTFT is a special<br />

case of the Z-transform. The Laplace transform maps an infinite, linear<br />

range of frequencies s, while the Z-transform maps a finite, circular<br />

range of frequencies z defined over an interval of size 2π. This circular<br />

range is sometimes thought of as a "wrapper" because the frequencies<br />

wrap around the unit circle over <strong>and</strong> over again.<br />

Figure A.1: Fourier transforms (both continuous <strong>and</strong> discrete) <strong>and</strong> Z-transforms<br />

map frequencies of a finite domain, between −2π radians <strong>and</strong> 2π radians, while the<br />

Laplace transform has an infinite frequency domain from −∞ radians <strong>to</strong> ∞ radians,<br />

not confined <strong>to</strong> the unit circle. Positive frequencies are mapped in a counter-clockwise<br />

manner <strong>and</strong> negative frequencies in a clockwise manner.


Appendix A 231<br />

A second connection exists between analog <strong>and</strong> digital domains<br />

with regards <strong>to</strong> filtering. Going from the infinite, linear s-plane <strong>to</strong> the<br />

finite, wrapped z-plane introduces dis<strong>to</strong>rtions in the resulting transfer<br />

function that need <strong>to</strong> be mitigated. One technique that reduces these<br />

errors is blind deconvolution. This was the technique used by Soundstream<br />

in 1975 <strong>to</strong> remove the resonant frequencies of the gramophone<br />

from one of the first ever recordings, "Vesti la giubba" by the popular<br />

opera singer Enrico Caruso. Deconvolution is the inverse of convolution,<br />

<strong>and</strong> the process is "blind" because both sources (the resonance of<br />

the gramophone <strong>and</strong> the spectrum of the song) are unknown.<br />

Before we get <strong>to</strong>o far ahead of ourselves, we need <strong>to</strong> explore some<br />

of the fundamentals of electrical engineering. This appendix is meant<br />

for those curious <strong>to</strong> learn some of these basics, but the best way <strong>to</strong> do<br />

so is practice! Nilsson <strong>and</strong> Riedel’s Electric Circuits is a nice text for<br />

those new <strong>to</strong> circuit analysis.<br />

A.1 Ohm’s Law<br />

<strong>An</strong> electric circuit is defined as a closed loop that is connected <strong>to</strong> an<br />

energy source (like a battery) <strong>and</strong> a load (like a lamp). The overarching<br />

law in electrical engineering governing all of circuit design <strong>and</strong> their<br />

analysis is Ohm’s Law. It relates the resistance R of an electrified system<br />

<strong>to</strong> the voltage V applied <strong>to</strong> it (from a battery or other source of power)<br />

<strong>and</strong> the resulting current I running through it. Ohm’s law is expressed<br />

by the equation<br />

V = IR.<br />

Voltage (measured in volts, V) determines the flow of electricity, current<br />

(measured in amperes, A) is the amount of flow, <strong>and</strong> resistance<br />

(measured in ohms, Ω) increases or decreases flow. Voltage represents<br />

the amplitude of (musical) signals. Playing music loudly on a lap<strong>to</strong>p<br />

or MP3 player wears down the supply of the battery more quickly<br />

that playing it softly because it dem<strong>and</strong>s a higher flow of current.


232 Appendix A<br />

Current can be either direct (DC) in which it travels in one direction,<br />

or alternating (AC) in which it flows in opposite directions in regular<br />

cycles, designated by the frequency of the AC. In North America this<br />

is typically 60 Hz, <strong>and</strong> in Europe, 50 Hz. Amusingly, a sort of "war"<br />

broke out between Nikola Tesla who discovered alternating current<br />

<strong>and</strong> Thomas Edison, the discoverer of direct current, called the "Current<br />

Wars." Edison was protective of his success with direct current, so<br />

when Tesla introduced the idea of alternating current, Edison made<br />

the claim that AC was fatal. Today, we almost never use direct current,<br />

because capaci<strong>to</strong>rs <strong>and</strong> induc<strong>to</strong>rs are such that they must have<br />

changing current <strong>to</strong> produce a voltage.<br />

Now, frequency responses <strong>and</strong> Fourier-transformed spectra depict<br />

frequency versus amplitude, but the amplitude here is power, which is<br />

a function of voltage <strong>and</strong> current:<br />

p = VI = V 2<br />

R = I2 R.<br />

The terms power <strong>and</strong> energy are often used interchangeably because<br />

of their physical relationship. Power is actually the rate at which work<br />

is performed—the energy per unit time [26]—but we hear it used in a<br />

sort of absolute way ("That is a powerful engine," "The president has a<br />

lot of power," etc.). Engineers seem <strong>to</strong> be very comfortable using them<br />

<strong>to</strong> refer <strong>to</strong> the same thing, but energy is the sum of power over time.<br />

The energy of a signal is given by<br />

∞∑<br />

∞∑<br />

E = |x(t)| 2 = p(t).<br />

t=0<br />

t=0<br />

This says that the <strong>to</strong>tal energy is equal <strong>to</strong> the sum of all of the power<br />

in the entire signal. Energy is measured in joules (J), where 1 joule is<br />

equal <strong>to</strong> 1 watt times 1 second—so, one watt is equal <strong>to</strong> 1 J/s.<br />

We compute the power p at a point in time t<br />

p(t) =|x(t)| 2 .


Appendix A 233<br />

This is the absolute value of x(t) because we do not want negative or<br />

imaginary components. The unit of power is the watt (W). The average<br />

power P is then<br />

1<br />

P = lim<br />

T →∞ T<br />

T∑<br />

p(t).<br />

t=0<br />

Parseval’s theorem from Chapter 7 uses these equalities <strong>to</strong> draw conclusions<br />

about the power <strong>and</strong> <strong>to</strong>tal energy of a spectrum.<br />

Resistance is created by resis<strong>to</strong>rs (R), induc<strong>to</strong>rs (L), <strong>and</strong> capaci<strong>to</strong>rs<br />

(C) in RLC circuits. A circuit can be described in a time domain <strong>and</strong> a<br />

frequency domain, because we think of the applied voltage as a signal.<br />

The resistance of a resis<strong>to</strong>r is simply R in both the time <strong>and</strong> frequency<br />

domains. The resistance of capaci<strong>to</strong>rs <strong>and</strong> induc<strong>to</strong>rs is similarly written<br />

C <strong>and</strong> L in the time domain, but in the frequency domain, they become<br />

complexly valued. Complex resistance is called reactance X, where<br />

X C =<br />

1<br />

jωC = − i<br />

ωC<br />

X L = jωL.<br />

Let us consider the behavior of induc<strong>to</strong>rs <strong>and</strong> capaci<strong>to</strong>rs with respect<br />

<strong>to</strong> frequency: For large frequencies, i.e., ω →∞, X C (1/j∞C) approaches<br />

0, while X L (j∞L) approaches infinity. For low frequencies,<br />

i.e., ω → 0, X C (1/j0C) approaches infinity while X L (j0L) approaches<br />

0. Now, current flows most easily through the parts of a circuit where<br />

the resistance is lowest, like a car in heavy traffic. When a device<br />

like a resis<strong>to</strong>r, capaci<strong>to</strong>r, or induc<strong>to</strong>r has a nearly infinite amount of<br />

resistance, the current will approach 0 because it cannot flow through<br />

such high resistance. When current does not flow through a part of a<br />

circuit, we say that this part behaves like an open circuit wherein the<br />

circuit is essentially broken, because it may as well physically removed<br />

if it does not allow current <strong>to</strong> flow.


234 Appendix A<br />

Figure A.2: When the complex frequency s = jω goes <strong>to</strong> 0, the reactance 1/jωC goes<br />

<strong>to</strong> infinity, so no current flows through the capaci<strong>to</strong>r (or the circuit, for that matter,<br />

because the resistance in series is additive) <strong>and</strong> V out =0volts. Therefore, only high<br />

frequencies produce an output voltage in this circuit, so we call it a high-pass filter.<br />

Oppositely, when a device’s resistance is very small, it behaves like<br />

a short circuit, which is an electric wire with theoretically no resistance.<br />

Figure A.3: For the same circuit, when the complex frequency s = jω goes <strong>to</strong> positive<br />

infinity, the reactance 1/jωC goes <strong>to</strong> 0, so current freely flows as if there were no<br />

device at all, <strong>and</strong> V out = V in − V R. Because V R is the same regardless of frequency,<br />

we think of the ratio V out/V in theoretically as 1 for frequencies beyond some cu<strong>to</strong>ff<br />

frequency, which is determined by the value of the capaci<strong>to</strong>r.<br />

In this circuit, signals containing low frequencies will pass minimal<br />

<strong>to</strong> no voltage through <strong>to</strong> the point designated by the + sign, but at<br />

high frequencies, the output voltage will be approximately equal <strong>to</strong><br />

the input voltage. This is due <strong>to</strong> the behavior of the capaci<strong>to</strong>r with<br />

respect <strong>to</strong> frequency. The circuits shown in Figures A.2 <strong>and</strong> A.3 are<br />

called high-pass filters.


Appendix A 235<br />

A.2 Filtering<br />

We have already introduced the basic concept of filtering with respect<br />

<strong>to</strong> musical applications: Filtering is a frequency-discriminating process<br />

by which some frequencies in a signal are kept <strong>and</strong> the others are<br />

attenuated. The holes of wind instruments act as b<strong>and</strong>pass filters<br />

because they only let a tiny range of frequencies pass, such that a<br />

configuration of opened <strong>and</strong> closed holes sounds like a single pitch.<br />

A mouth is a filter: No other instrument sounds exactly like a human<br />

voice. A room is a filter, producing st<strong>and</strong>ing waves corresponding <strong>to</strong><br />

its resonant frequencies. Eventually, you may realize that every physical<br />

thing is a filter because it discriminates sound on a basis of frequency.<br />

Filtering in electrical engineering is the process of frequency discrimination<br />

with respect <strong>to</strong> electrical circuits. Since both audio signals<br />

<strong>and</strong> AC circuits can be transformed the frequency domain, the same<br />

filtering concepts can be applied <strong>to</strong> digital audio signals as the signals<br />

involved in electric circuits. This is at the core of digital signal processing.<br />

<strong>An</strong>alog synthesizers do, however, make use of these frequency-selective<br />

circuits like in Figures A.2 <strong>and</strong> A.3 containing capaci<strong>to</strong>rs <strong>and</strong> induc<strong>to</strong>rs.<br />

Transfer functions<br />

Transfer functions describe the ratio of the output voltage <strong>to</strong> the input<br />

voltage for a given frequency. Typically, the Laplace transform is used<br />

instead of the Fourier transform <strong>to</strong> convert a time-domain signal <strong>to</strong> a<br />

frequency-domain one. The Laplace transform is given by<br />

L{x(t)} = X(s) =<br />

∫ ∞<br />

−∞<br />

x(t)e −st dt<br />

where s is the complex frequency jω <strong>and</strong> L is the Laplace transform. It<br />

is not unlike the continuous Fourier transform,<br />

X(ω) =<br />

∫ ∞<br />

−∞<br />

x(t)e −jωt dt.<br />

The Laplace transform is just as complicated of an integral <strong>to</strong> compute<br />

as the Fourier transform, so most people prefer <strong>to</strong> memorize some


236 Appendix A<br />

Function x(t),t≥ 0 X(s)<br />

Impulse Kδ(t) K<br />

Step K<br />

K<br />

s<br />

Ramp Kt<br />

K<br />

s 2<br />

Damped ramp Kte −at K<br />

(s+a) 2<br />

Exponential Ke −at K<br />

s+a<br />

Sine K sin(ωt)<br />

Kω<br />

s 2 +ω 2<br />

Damped sine Ke −at sin(ωt)<br />

Kω<br />

(s+a) 2 +ω 2<br />

Cosine K cos(ωt)<br />

Ks<br />

Damped cosine<br />

Ke −at cos(ωt)<br />

s 2 +ω 2<br />

K(s+a)<br />

(s+a) 2 +ω 2<br />

Table A.1: Some common Laplace transforms.<br />

of its general behavior for common functions. In the table below, the<br />

general Laplace transform is given for the most popular functions,<br />

where K is a constant, real value.<br />

A transfer function H is written<br />

H(s) = V out(s)<br />

V in (s)<br />

i.e., the ratio of the output <strong>to</strong> the input frequency-domain voltages.<br />

We call the transfer function the frequency response. It is the Laplace<br />

transform of an impulse response h(t), the ratio of a system’s output<br />

<strong>to</strong> the delta function. We use transfer functions, therefore, <strong>to</strong> describe<br />

the frequency response of musical instruments, electric circuits, <strong>and</strong><br />

anything else that has a frequency-domain representation in addition<br />

<strong>to</strong> a time-domain one. We looked at the transfer functions of a violin<br />

<strong>and</strong> a trumpet in Chapter 4: The violin’s transfer function had peaks<br />

at its air resonance <strong>and</strong> main wood resonance, <strong>and</strong> the trumpet’s<br />

frequency response had a peak representing the length of its bore.


Appendix A 237<br />

Figure A.4: The frequency response of an average (poor) violin.<br />

Figure A.5: The frequency response of a trumpet. The curve is smoothest where the<br />

energy leaks the most.<br />

A filter is designed <strong>to</strong> give preference <strong>to</strong> a selected range of frequencies<br />

so that the strength of those frequencies will be maintained<br />

when a signal passes through the filter, while all other frequencies<br />

will be attenuated <strong>to</strong> some degree. A transfer function, H(s) or H(jω),<br />

describes the behavior of a circuit with respect <strong>to</strong> complex frequency,<br />

s = jω. Therefore, all circuits are filters of some kind. A cu<strong>to</strong>ff frequency<br />

ω c defines where a filter changes from retaining a given frequency<br />

<strong>to</strong> attenuating it, or vice versa. Cu<strong>to</strong>ff frequencies are located where


238 Appendix A<br />

the magnitude of the transfer function equals √ 1<br />

2<br />

(−3.01 dB) of the<br />

(<br />

maximum value of H Hmax √2<br />

). 1<br />

Filtering of signals works just like the process of convolution reverb<br />

described in Chapter 5. We can get a filtered signal either by<br />

multiplying their spectra or by convolving their time-domain signals<br />

<strong>to</strong>gether. Convolving the signal of a simple sinusoid of frequency 1000<br />

Hz with a filter that let only low frequencies through, such as the one<br />

given in Figure A.6, would reduce the amplitude of this frequency by<br />

a fac<strong>to</strong>r of<br />

1 √2 , because at 1000 Hz, the filter attenuates signals by −3.01<br />

dB.<br />

Figure A.6: The frequency response (magnitude <strong>and</strong> phase) of a low-pass filter.<br />

( ) 1 1<br />

This is so because L dB SPL = 20 log 10 L Intensity = 20 log √ 10 2<br />

= −3.01.


Appendix A 239<br />

This filter is called a low-pass filter. The amplitude of the above plot<br />

is given by |H(ω)| <strong>to</strong> get rid of the complex components.<br />

|H(ω)| = √ Re[H(jω)] 2 + Im[H(jω)] 2 ,<br />

so the amplitude is equal <strong>to</strong> the magnitude of the real (Re) <strong>and</strong> imaginary<br />

(Im) parts of H(s), making the substitution s = jω.<br />

We can also calculate the phase φ of the frequencies in the transfer<br />

function, shown in the second graph in Figure A.5, as<br />

( ) ω<br />

φ(ω) = 90 ◦ − tan −1 .<br />

ω pole<br />

We must know the locations of the poles (ω pole ) <strong>to</strong> do so, which are<br />

located where the transfer function’s denomina<strong>to</strong>r is equal <strong>to</strong> zero.<br />

A.3 The Z-transform<br />

When we want the frequency representation of a discrete, time-positive<br />

input, we compute a Z-transform instead of a Laplace transform.<br />

The Z-transform is used most often in signal processing when some<br />

discrete <strong>and</strong> infinite input signal x[t] <strong>and</strong> output signal y[t] of a system<br />

are given <strong>and</strong> we want <strong>to</strong> compute their frequency spectra, X(z) <strong>and</strong><br />

Y (z). The transfer function H(z) is<br />

H(z) = Y (z)<br />

X(z)<br />

where z is complex <strong>and</strong> H, Y , <strong>and</strong> X are all frequency responses.<br />

We can compute X(z) from a discrete input function x[t] by the Z-<br />

transform, which is defined as<br />

Z{x[t]} = X(z) =<br />

∞∑<br />

x[t]z −t<br />

where the t are the time samples. Thus, z −k can be thought of as the<br />

kth sampling instant <strong>and</strong> it shifts a value with which it multiplies (like<br />

t=0


240 Appendix A<br />

x[t]) k-many samples <strong>to</strong> the right <strong>to</strong> get <strong>to</strong> its kth sample. When the<br />

input is infinite <strong>and</strong> when z = e jω , we have the discrete-time Fourier<br />

transform,<br />

X(z) =<br />

∞∑<br />

x[t]z −t =<br />

t=0<br />

∞∑<br />

x[t]e −jωt .<br />

t=0<br />

So the DTFT is a special case of the Z-transform.<br />

The inverse Z-transform is given by<br />

Z −1 {X(z)} = x[t] = 1 ∮<br />

X(z)z t−1 dz<br />

2πj<br />

the closed path integral defined on some interval [a, a +2π), where a<br />

is a constant. This a varies depending on the system.<br />

Now let us look at the general form of filters <strong>and</strong> some musical<br />

applications that use them.<br />

Low <strong>and</strong> high-pass filters<br />

A low-pass filter allows only frequencies less than a given cu<strong>to</strong>ff frequency<br />

from a signal <strong>to</strong> pass through it unaffectedly. Oppositely, a<br />

high-pass filter allows only the high frequencies above a given cu<strong>to</strong>ff<br />

frequency <strong>to</strong> pass. Therefore, the magnitude response of a low-pass<br />

filter with respect <strong>to</strong> frequency has the opposite shape of a high-pass<br />

filter: Its slope goes from constant <strong>to</strong> decreasing, while a high-pass<br />

filter goes from increasing <strong>to</strong> constant, as in Figure A.7. The magnitude<br />

response of a high-pass filter is graphically given in Figure A.8.<br />

The poles of a transfer function are given by the values that would<br />

make the denomina<strong>to</strong>r 0. In the above filters, the cu<strong>to</strong>ff frequency ω c<br />

is specified by the pole in the denomina<strong>to</strong>r where ω c equals the pole,<br />

so the cu<strong>to</strong>ff frequency is 1000 rad/s. The zeros of a transfer function<br />

specify where the numera<strong>to</strong>r is 0 <strong>and</strong> hence H(s) = 0. So, for a transfer<br />

function with m-many poles p <strong>and</strong> n-many zeros z,<br />

H(s) = (s + z 1) · (s + z 2 ) · . . . · (s + z n )<br />

(s + p 1 ) · (s + p 2 ) · . . . · (s + p m ) .


Appendix A 241<br />

Figure A.7: The first-order low-pass filter given by the magnitude of the transfer<br />

function H(s) = 1000 . Its cu<strong>to</strong>ff frequency is ωc = 1000 rad/s.<br />

s+1000<br />

Figure A.8: The first-order high-pass filter given by the magnitude of the transfer<br />

function H(s) = . Its cu<strong>to</strong>ff frequency is also at 1000 rad/s.<br />

s<br />

s+1000<br />

The general form of a one-pole, one-zero low-pass filter is<br />

H(s) =<br />

z 1<br />

s + p 1<br />

.<br />

Its magnitude is<br />

|H(ω)| =<br />

z<br />

√ 1<br />

.<br />

ω 2 + p 2 1


242 Appendix A<br />

The cu<strong>to</strong>ff frequency is given by ω c = p 1 , where z 1 = p 1 , the pole of H. 2<br />

As ω goes <strong>to</strong> 0, |H(ω)| = z 1<br />

p 1<br />

, so for low frequencies, the amplitude of<br />

the transfer function is theoretically 1 meaning that the low frequencies<br />

are retained. As ω goes <strong>to</strong> infinity, H(ω) = z2 1<br />

=0, meaning that the<br />

∞+p 2 1<br />

amplitudes of high frequencies will be increasingly reduced.<br />

The most basic circuits representing low-pass filters are given in<br />

Figure A.9.<br />

Figure A.9: These series RL <strong>and</strong> RC circuits are the simplest representations of lowpass<br />

filters. The first has the transfer function H(s) = Vo(s)<br />

1/sC<br />

= 1/RC<br />

R+1/sC s+1/RC<br />

V i (s)<br />

=<br />

( )<br />

1/sC<br />

·V<br />

R+1/sC i (s)<br />

V i<br />

=<br />

(s)<br />

. When H(s) = √ 1 2<br />

H max, s is the cu<strong>to</strong>ff frequency. So this filter’s<br />

cu<strong>to</strong>ff frequency is ω c = √ 1<br />

RC<br />

. The second transfers its voltage by the equation<br />

√<br />

H(s) = Vo(s) = sL+R R ·V i(s)<br />

V i (s) V i<br />

= R = R/L , so its cu<strong>to</strong>ff frequency is (s) sL+R s+R/L ωc = R<br />

. L<br />

The general form of a one-pole high-pass filter is<br />

H(s) =<br />

|H(ω)| =<br />

s<br />

s + p 1<br />

ω<br />

√ .<br />

ω 2 + p 2 1<br />

Here, the cu<strong>to</strong>ff frequency ω c is once again where ω = p 1 , the pole. 3<br />

The most basic circuit designs <strong>and</strong> corresponding transfer functions<br />

of high-pass filters are given in Figure A.10.<br />

2 To verify this, check that |H(p 1)| =0.707: H(p 1)=<br />

0.707.<br />

3 Check: |H(p 1)| = p 1 √<br />

p 2 1 +p2 1<br />

√ p 1<br />

p 2 1 +p2 1<br />

= √<br />

p 1<br />

= p √1<br />

2p 2 p<br />

1 1 2<br />

= √ 1<br />

2<br />

=0.707.<br />

= p 1<br />

p 1<br />

√<br />

2<br />

= 1 √<br />

2<br />

=


Appendix A 243<br />

Figure A.10: The two simplest high-pass filters are once again, series RL <strong>and</strong> RC<br />

circuits. The first has a transfer function H(s) = Vo(s)<br />

s<br />

s+1/RC<br />

( )<br />

R<br />

= ·V<br />

R+1/sC i (s)<br />

V i (s) V i<br />

=<br />

(s)<br />

R<br />

=<br />

R+1/sC<br />

1<br />

. Therefore, the cu<strong>to</strong>ff frequency is ωc = √<br />

RC<br />

. The second has the transfer<br />

function H(s) = Vo(s)<br />

V i<br />

= ( sL+R sL )·V i (s)<br />

(s)<br />

V i<br />

= sL = s<br />

. Therefore, its cu<strong>to</strong>ff<br />

(s) sL+R s+R/L<br />

frequency is ω c = √ R/L.<br />

The <strong>to</strong>ne knobs on electric guitars logarithmically change the value<br />

of a capaci<strong>to</strong>r connected <strong>to</strong> each of the magnetic pickups. These capaci<strong>to</strong>rs<br />

act <strong>to</strong> transform the pickup in<strong>to</strong> a high-pass filter, reducing the<br />

treble in the guitar’s signal.<br />

Now, filters with two poles have the general form<br />

H(s) = b 2s 2 + b 1 s + b 0<br />

s 2 + a 1 s + a 0<br />

.<br />

This general transfer function describes second-order filters (two poles),<br />

while filters with one pole as described above are first-order filters.<br />

Different types of filters are determined by the b coefficients: When<br />

b 2 = b 1 =0, we have a low-pass filter. When b 1 = b 0 =0, we have a<br />

high-pass filter.<br />

H(s) =<br />

H(s) =<br />

b 0<br />

s 2 ,<br />

+ a 1 s + a 0<br />

b 2 s 2<br />

s 2 ,<br />

+ a 1 s + a 0<br />

a low-pass filter<br />

a high-pass filter<br />

Their magnitude plots are very similar. Pay attention <strong>to</strong> the values<br />

along the vertical axis <strong>to</strong> see the difference: They have a steeper slope<br />

<strong>and</strong> thus greater attenuation of the undesired frequencies. The cu<strong>to</strong>ff


244 Appendix A<br />

Figure A.11: The second-order low-pass filter given by the magnitude of the transfer<br />

function H(s) = 10002<br />

(s+1000) 2 .<br />

Figure A.12: The second-order high-pass filter given by the magnitude of the transfer<br />

function H(s) =<br />

s 2<br />

(s+1000) 2 .<br />

angular frequencies are the same here, both 1000 rad/s. Now, when<br />

b 2 = b 0 =0, we have a b<strong>and</strong>-pass filter (Figs. A.14-16), <strong>and</strong> when only<br />

b 1 =0, we have a b<strong>and</strong>-s<strong>to</strong>p filter (Figs. A.17-18).<br />

B<strong>and</strong>-pass filtering <strong>and</strong> filter banks<br />

The transfer function of a b<strong>and</strong>-pass filter has one peak located at the<br />

center frequency ω 0 , <strong>and</strong> two cu<strong>to</strong>ff frequencies ω c1 <strong>and</strong> ω c2 that define<br />

the b<strong>and</strong>width β = ω c2 −ω c1 , again where |H(ω c1 )| = |H(ω c2 )| = Hmax √<br />

2<br />

.<br />

A b<strong>and</strong>-pass filter is au<strong>to</strong>matically a second-order filter, <strong>and</strong> its transfer


Appendix A 245<br />

function is generally given by<br />

H(s) =<br />

b 1 s<br />

s 2 + a 1 s + a 0<br />

i.e., b 0 <strong>and</strong> b 2 are both equal <strong>to</strong> zero. Here, the b<strong>and</strong>width β is equal<br />

<strong>to</strong> a 1 , <strong>and</strong> the center frequency ω 0 equals √ a 0 . A common way <strong>to</strong><br />

describe a b<strong>and</strong>-pass filter is by the quality Q, calculated from the ratio<br />

of the center frequency ω 0 <strong>to</strong> the b<strong>and</strong>width β,<br />

Q = ω 0<br />

β = ω 0<br />

ω c2 − ω c1<br />

.<br />

The magnitude plot of the frequency response of a b<strong>and</strong>-pass filter<br />

is given in Figure A.14. Note what happens at ω = 1000 rad/s: The<br />

graph peaks. At −3 dB, the two cu<strong>to</strong>ff frequencies can be found,<br />

because this is where |H(jω)| = |Hmax(jω) √<br />

2<br />

.<br />

Figure A.13: The magnitude plot of a b<strong>and</strong>-pass filter with ω 0 = 1000 rad/s, given by<br />

|H(ω)| =<br />

1000ω<br />

ω 2 +1000ω+1000 2 .<br />

The side holes in a wind instrument act as b<strong>and</strong>-pass filters, moving<br />

the center frequency lower <strong>and</strong> higher as they lengthen <strong>and</strong> shorten<br />

the effective length of the bore, respectively. Each of the side holes’<br />

b<strong>and</strong>width is determined by the diameter of the hole, <strong>and</strong> a smaller<br />

hole means the b<strong>and</strong>width is smaller <strong>and</strong> the quality higher. A large Q<br />

means that the peak of the b<strong>and</strong>-pass filter’s transfer function is more<br />

intense.


246 Appendix A<br />

Figure A.14: The magnitude plot of a b<strong>and</strong>-pass filter with a high quality, given by<br />

20ω<br />

the function |H(ω)| =<br />

. The center frequency is 1000 rad/s <strong>and</strong> the<br />

ω 2 +20ω+1000 2<br />

b<strong>and</strong>width is 20 rad/s, so Q = 1000 = 50.<br />

20<br />

Figure A.15: The magnitude plot of a b<strong>and</strong>-pass filter with a low quality, given by<br />

5000ω<br />

the function |H(ω)| =<br />

. The center frequency is 1000 rad/s <strong>and</strong> the<br />

ω 2 +10,000ω+1000 2<br />

b<strong>and</strong>width is 5000 rad/s, so Q = 1000 =0.1.<br />

10,000<br />

The opposite of a b<strong>and</strong>-pass filter is a b<strong>and</strong>-s<strong>to</strong>p filter, also called a<br />

b<strong>and</strong>-reject or notch filter. The general form of its transfer function is<br />

H(s) =<br />

b 2 s 2 + b 0<br />

s 2 + a 1 s + a 0<br />

,<br />

i.e., b 1 = 0. This kind of filter is used <strong>to</strong> reduce or eliminate the<br />

intensity of a given range of frequencies, specified by the b<strong>and</strong>width<br />

around the center frequency. The size of the b<strong>and</strong>width β is once again<br />

determined by Q, the two cu<strong>to</strong>ff frequencies ω c1 <strong>and</strong> ω c2 straddling the


Appendix A 247<br />

center frequency ω 0 , <strong>and</strong> furthermore, β is equal <strong>to</strong> the coefficient a 1<br />

<strong>and</strong> ω 0 is √ a 0 .<br />

Figure A.16: A b<strong>and</strong>-s<strong>to</strong>p filter with a high quality, given by the function H(ω) =<br />

ω 2 +1000 2<br />

. The center frequency is 1000 rad/s <strong>and</strong> the b<strong>and</strong>width is 500 rad/s,<br />

ω 2 +500ω+1000 2<br />

so Q = 1000 =2.<br />

500<br />

Figure A.17: A b<strong>and</strong>-s<strong>to</strong>p filter with a low quality, given by the function H(ω) =<br />

ω 2 +1000 2<br />

ω 2 +10,000ω+1000 2 . The center frequency is 1000 rad/s <strong>and</strong> the b<strong>and</strong>width is 10000<br />

rad/s, so Q = 1000<br />

10,000 =0.1.<br />

The most basic circuits designing b<strong>and</strong>-pass <strong>and</strong> b<strong>and</strong>-s<strong>to</strong>p filters<br />

are given in Figure A.18. They are both called series RLC circuits. Including<br />

both induc<strong>to</strong>rs <strong>and</strong> capaci<strong>to</strong>rs in a circuit makes the frequency<br />

response behave similarly <strong>to</strong> frequency extremes, i.e., for b<strong>and</strong>-pass<br />

filters, |H(j0)| = |H(j∞)| =0<strong>and</strong> |H(jω 0 )| =1, <strong>and</strong> for b<strong>and</strong>-s<strong>to</strong>p<br />

filters, |H(j0)| = |H(j∞)| =1<strong>and</strong> |H(jω 0 )| =0.


248 Appendix A<br />

Figure A.18: The left circuit is a b<strong>and</strong>-pass filter <strong>and</strong> the right a b<strong>and</strong>-s<strong>to</strong>p. This b<strong>and</strong>pass<br />

filter’s transfer function is H(s) = Vo(s)<br />

sR<br />

= s(R/L)<br />

s 2 L+sR+1/C s 2 +s(R/L)+(1/RC)<br />

1<br />

, making its center frequency ω0 = √<br />

RC<br />

. The b<strong>and</strong>s<strong>to</strong>p<br />

has the transfer function H(s) = Vo(s)<br />

s 2 L+1/C<br />

s 2 L+sR+1/C =<br />

(<br />

)<br />

R<br />

= ·V<br />

sL+R+1/sC i (s)<br />

V i (s) V i<br />

=<br />

(s)<br />

R<br />

=<br />

sL+R+1/sC<br />

( )<br />

sL+1/sC<br />

= ·V<br />

sL+R+1/sC i (s)<br />

V i (s) V i<br />

= sL+1/sC =<br />

(s) sL+R+1/sC<br />

s 2 +1/LC<br />

1<br />

, so the center frequency is given by<br />

s 2 ω0 =<br />

+s(R/L)+(1/LC)<br />

√<br />

LC<br />

.<br />

In the description of phaser <strong>and</strong> flanger effects pedals in Chapter<br />

4, the concept of filter banks was broached. Phasing (<strong>and</strong> flanging) is<br />

achieved by passing a signal through several filters simultaneously <strong>and</strong><br />

summing each of their frequency responses with the original signal’s<br />

frequency response. So, filters can be used individually as well as<br />

connected in series or parallel <strong>to</strong> achieve a great range of different<br />

sonic effects.<br />

The <strong>to</strong>pic of filter banks also appears when we want <strong>to</strong> extract<br />

frequency-related information about music, especially for the purpose<br />

of music information retrieval (MIR). A common filter bank here would<br />

be one designed <strong>to</strong> extract the 12 notes of the scale <strong>to</strong> tell us when a<br />

specific note occurs in a song. These would be b<strong>and</strong>-pass filters with<br />

center frequencies scaled by 2 k/12 , where f 0 is the center frequency<br />

of the first b<strong>and</strong>-pass filter <strong>and</strong> 2 k/12 f 0 is the center frequency of the<br />

kth b<strong>and</strong>-pass filter. Other filter banks are useful for detecting instrumentation<br />

when they are designed <strong>to</strong> pick up an harmonic over<strong>to</strong>ne<br />

series, i.e., their center frequencies are spaced an octave apart from one<br />

another. Therefore, filter banks in music information retrieval can be<br />

used with respect <strong>to</strong> pitch, harmony, <strong>and</strong> timbre detection—wherever<br />

there is frequency information.


Appendix A 249<br />

Figure A.19: 24 b<strong>and</strong>-pass filters in parallel, forming a filter bank that spans 2 octaves.<br />

A.4 Chapter summary<br />

In this appendix, we reviewed the fundamentals of electrical engineering<br />

behind the behaviors of electrical <strong>and</strong> digital systems with<br />

respect <strong>to</strong> frequency. For continuous voltages, we can take a Laplace<br />

transform <strong>to</strong> compute the spectrum V (s) of a time-domain signal v(t).<br />

For discrete functions of voltage, we compute the Z-transform <strong>to</strong> see<br />

the frequency-domain representation X(z). In both cases, z <strong>and</strong> s are<br />

complex frequencies jω where j = √ −1.<br />

When we know the input <strong>and</strong> output voltages of a system over<br />

time, we can compute the transfer function H(s) or H(z). This is the<br />

proportion of the output voltage <strong>to</strong> the input voltage. The output<br />

voltage is the voltage over a load in a circuit <strong>and</strong> the input voltage is the<br />

voltage of the source like a battery (or the output of another, connected<br />

circuit). So, H(s) = Vo(s)<br />

V i (s) .<br />

Ohm’s law is the overarching law of all electrical physics, <strong>and</strong> it<br />

states that V = IR, i.e., voltage is the product of current with resistance.<br />

Resistance can be complex-valued in which case we call it the reactance.


250 Appendix A<br />

Reactance is a function of frequency, so circuits are frequency-selective<br />

<strong>and</strong> called filters.<br />

A low-pass filter allows low frequencies <strong>to</strong> "pass through it" up <strong>to</strong><br />

some cu<strong>to</strong>ff frequency ω 0 .Ahigh-pass filter lets high frequencies pass<br />

while attenuating low ones. A b<strong>and</strong>-pass filter is designated by a center<br />

frequency ω c <strong>and</strong> a b<strong>and</strong>width β defining the range of frequencies that<br />

it allows <strong>to</strong> pass. A small b<strong>and</strong>width means a high quality of filter Q,<br />

where Q = ωc<br />

β<br />

. Finally, we defined a b<strong>and</strong>-s<strong>to</strong>p filter, the converse of a<br />

b<strong>and</strong>-pass filter that allows everything but some range of frequencies<br />

<strong>to</strong> pass through it.<br />

The zeros of the numera<strong>to</strong>r of a transfer function are called simply<br />

zeros while the zeros of the denomina<strong>to</strong>r are poles. Afirst-order filter<br />

has one pole <strong>and</strong> a second-order filter has two. We can define a series of<br />

filters with a filter bank much like a time-domain windowing function<br />

in music information retrieval.


B. Using computers <strong>to</strong> do Fourier<br />

transforms<br />

As you saw in Chapter 7, doing a discrete Fourier transform of size<br />

N =8is extremely cumbersome by h<strong>and</strong>. The software programs<br />

Matlab <strong>and</strong> Mathematica are great places <strong>to</strong> turn <strong>to</strong> do these tricky<br />

computations: Fourier transforms are built in <strong>to</strong> their functions. The<br />

following examples are all available for download from my website,<br />

http://numbers<strong>and</strong>notes.com/.<br />

B.1 Matlab<br />

Matlab is a high-level language for technical computing, excelling at<br />

scientific computations like the fast Fourier transform. The Matlab<br />

syntax that performs an FFT is literally fft(), accepting an array of<br />

amplitude information with respect <strong>to</strong> time. 1 As we saw in Chapter<br />

6, sound files have headers in addition <strong>to</strong> binary information, <strong>and</strong><br />

this means that their data must be prepared at the low-level (like<br />

in C). Fortunately, both Mathematica <strong>and</strong> Matlab have several built<br />

in functions <strong>to</strong> prepare audio data for further analysis. In Matlab, I<br />

prefer <strong>to</strong> use wavread(), <strong>and</strong> in Mathematica, the function Import[].<br />

These built-ins put the amplitude information in the correct format for<br />

analysis.<br />

First, let’s take a basic FFT. Then, we will give the code <strong>to</strong> produce<br />

the same spectrograms that we’ve seen before.<br />

1 The version of Matlab <strong>to</strong> which this information applies is Matlab 7.


252 Using computers <strong>to</strong> do Fourier transforms Chapter B<br />

Perform <strong>and</strong> plot the fast Fourier transform<br />

%% Take the fast Fourier transform of a WAV file<br />

%% <strong>and</strong> plot its power.<br />

[x,fs]=wavread(’Trumpet-01-mf-C5.wav’);<br />

N = length(x) %% number of points in an audio file is just its<br />

length<br />

T = length(x)/fs %% define time of interval in seconds<br />

t = [0:N-1]/N; %% define time instants<br />

t = t*T; %% define time in seconds<br />

p = abs(fft(x))/(N/2); %% absolute value of the fft;<br />

%% we only need the first half of it, N/2<br />

p = p(1:N/2).^2; %% take the power of first half of the freq’s<br />

freq = [0:N/2-1]/T; %% find the corresponding frequency in Hz<br />

figure<br />

plot(freq,p,’k’)<br />

axis([0 5000 0 0.012]) %% zoom in<br />

This displays a plot of the frequency domain of x from 0 <strong>to</strong> 5000 Hz,<br />

<strong>and</strong> 0 <strong>to</strong> 0.012 relative power. Notice that the variable freq determines<br />

the frequency in Hz of the frequency components by dividing by<br />

T, which is set equal <strong>to</strong> N f s<br />

at the beginning. Therefore, the values<br />

[0:N/2-1] are our ω k , k =0, 1, . . . , N/2 − 1. We only go up <strong>to</strong> the<br />

(N/2 − 1)th component because the second half of the output of the<br />

FFT is symmetric <strong>to</strong> the first half.<br />

Display a spectrograph<br />

Matlab offers a great amount of control for the visualization of data.<br />

Therefore, it is an ideal platform for displaying spectrographs (also<br />

called spectrograms). It actually has a built-in function for this<br />

(spectrogram()), but its results are often unreliable. Thus, I have


Appendix B 253<br />

provided the same code that was used <strong>to</strong> make all of the spectrographs<br />

in this book.<br />

function [TF, freq, time] = spectro(x,secs,fftint,maxfreq)<br />

%SPECTRO - Spectrogram of audio signal.<br />

% [TF, freq, time] = spectro(x, secs, fftint, maxfreq)<br />

% returns a series of Short-time Fourier transforms<br />

% (STFT’s), i.e., the frequency, amplitude, <strong>and</strong> time<br />

% for a given .wav file. The file must be in mono<br />

% (1-channel).<br />

% Inputs:<br />

% x: a .wav file, entered as ’guitar.wav’,<br />

% for example<br />

% secs: the file’s duration in seconds.<br />

% This can be calculated by length(file)<br />

% divided by its sampling frequency.<br />

% fftint: the duration of the STFT in seconds.<br />

% This does not need <strong>to</strong> be a power of 2.<br />

% The file will be zero-padded.<br />

% maxfreq: the maximum frequency component of the<br />

% file. This should be 0.5*sampling<br />

% frequency <strong>to</strong> satisfy the Nyquist limit.<br />

%<br />

% The signal x must be a .wav file, <strong>and</strong> the duration of x<br />

% specified must not be less than its actual duration.<br />

% Additionally, the duration of the short-time Fourier<br />

% transform interval (fftint) must be less than the <strong>to</strong>tal<br />

% duration. Finally, the maxfreq must be at least half<br />

% of the sampling frequency, typically 22050 Hz.<br />

%<br />

% SPECTRO will return the spectrogram <strong>and</strong> surface of a given<br />

% real-valued signal. It works by partitioning the file<br />

% according <strong>to</strong> fftint, normalizing (zero-padding) the data<br />

% <strong>to</strong> fit the requirements for the FFT, <strong>and</strong> then taking the<br />

% short-time Fourier transform (STFT). A spectrogram is<br />

% a three-dimensional output wherein the abscissa is time<br />

% in seconds, the ordinate is frequency in Hz, <strong>and</strong> the


254 Appendix B<br />

% darkness of the color of a given point represents its<br />

% amplitude in decibels. So, a white point would not be<br />

% as loud as a gray or black point. Also outputted is the<br />

% surface of the spectrogram.<br />

[m d]=wavfinfo(x); % this function reads the WAV header<br />

[x,fs,nbits]=wavread(x);<br />

if isempty(m)==1<br />

error(’The specified file is not .wav file.’)<br />

x=0<br />

end<br />

if fs.*secsmaxfreq<br />

error(’The specified duration for the intervals of<br />

the FFT (fftint) is <strong>to</strong>o large.’)<br />

fftint=0<br />

end<br />

if maxfreq>fs/2<br />

error(’The specified maximum frequency (maxfreq) is<br />

beyond the Nyquist sampling rate.’)<br />

maxfreq=0<br />

end<br />

% partition the file <strong>to</strong> get windows of the signal for our<br />

% spectrogram<br />

partitionsize=fftint*fs;<br />

% use a hanning window<br />

window=hanning(partitionsize);<br />

partitions=[1:partitionsize:length(x)-partitionsize];<br />

Z=zeros(partitionsize,length(partitions)); % pad with zeros<br />

for i=1:length(partitions)<br />

Z(1:partitionsize, i)=<br />

x(partitions(i):partitions(i)+partitionsize-1).*window;


Appendix B 255<br />

end<br />

% take the Short Term Fourier Transform (STFT)<br />

STFT = fft(Z);<br />

% take absolute value of each partition<br />

if rem(partitionsize,2)==1<br />

k=(partitionsize+1)/2;<br />

else<br />

k=partitionsize/2;<br />

end<br />

f=[0:k-1]*fs/partitionsize;<br />

t=partitions/fs;<br />

if nargout>0, TF=STFT; end<br />

if nargout>1, freq=f; end<br />

if nargout>2, time=t; end<br />

% size of the STFT<br />

maxSTFT=abs(STFT(2:partitionsize*maxfreq/fs,:));<br />

% normalized so the max amplitude will be 0 db<br />

maxSTFT=maxSTFT/max(max(maxSTFT));<br />

figure<br />

% output the spectrogram with colors mapping the intensity<br />

% in dB’s<br />

pcolor(t, f(2:partitionsize*maxfreq/fs), 20*log10(maxSTFT));<br />

axis xy;<br />

colormap(flipud(bone));<br />

shading interp;<br />

title(’2D spectrogram of the signal’)<br />

xlabel(’Time (seconds)’)<br />

ylabel(’Frequency (Hz)’)<br />

figure<br />

% surface function plots the output in 3D


256 Appendix B<br />

surf(t, f(2:partitionsize*maxfreq/fs),20*log10(maxSTFT));<br />

axis xy;<br />

view(20,84);<br />

colormap(flipud(bone));<br />

shading interp;<br />

title(’3D spectrogram of the signal’)<br />

xlabel(’Time (seconds)’)<br />

ylabel(’Frequency (Hz)’)<br />

end<br />

To use this function, simply enter the following information in a<br />

new script <strong>and</strong> run it. Be sure that both the WAVE file <strong>and</strong> spectro.m<br />

are in Matlab’s file direc<strong>to</strong>ry by going <strong>to</strong> File->Set Path. . . <strong>and</strong> adding<br />

their location <strong>to</strong> it. Don’t forget <strong>to</strong> press "Save"!<br />

[x,freqsamp]=wavread(’guitar.wav’);<br />

t=0:1:length(x)-1;<br />

figure<br />

plot(t,x,’k’) % plot the audio signal, in black<br />

axis([0 length(x)-1 -1 1])<br />

% au<strong>to</strong>matically determine duration in seconds<br />

seconds=length(x)/freqsamp+0.001;<br />

% au<strong>to</strong>matically determine nyquist frequency limit<br />

nyquist=freqsamp/2;<br />

spectro(’guitar.wav’,seconds,0.01,nyquist)<br />

Running this script will result in three figures: The plot of the audio<br />

signal, the "2D" spectrogram of the signal (it is actually 3D, because<br />

there are three variables, but the graph is planar), <strong>and</strong> the 3D surface<br />

depiction of the spectrogram.<br />

B.2 Mathematica<br />

The following code is for executing the specification of the Fourier<br />

transform given in this text, in Mathematica 7. The Fourier[] <strong>and</strong><br />

InverseFourier[] functions are inversely defined <strong>to</strong> how they have


Appendix B 257<br />

been given in this book: The exponents of e are positive in the built-in<br />

function Fourier[] <strong>and</strong> negative in the function InverseFourier[],<br />

i.e.,<br />

X = Fourier[x] = 1 √<br />

N<br />

as opposed <strong>to</strong> the DFT we use:<br />

F(x) =<br />

N−1<br />

∑<br />

t=0<br />

N ∑<br />

t=1<br />

x(t)e − i2πkt<br />

N .<br />

Likewise, the IDFT given by Mathematica is<br />

x(t)e i2π(k−1)(t−1)<br />

N<br />

x = InverseFourier[X] = √ 1 ∑<br />

X(k)e − i2π(k−1)(t−1)<br />

N<br />

N<br />

instead of<br />

F −1 (X) = 1 N<br />

N−1<br />

∑<br />

k=0<br />

k=1<br />

X(k)e i2πkt<br />

N .<br />

Therefore, <strong>to</strong> compute the "engineer’s DFT" (the version we’ve been<br />

using) <strong>and</strong> its inverse in Mathematica, we need <strong>to</strong> multiply by √ N <strong>and</strong><br />

use the functions oppositely:<br />

X = Sqrt[Length[x]]*InverseFourier[x]<br />

=<br />

N∑<br />

x s (t)e − i2π(k−1)(t−1)<br />

N<br />

t=1<br />

Note that the zeroth frequency component is given by X[1], not X[0].<br />

The inverse engineer’s DFT is then<br />

x = 1/Sqrt[Length[X]]*Fourier[X]<br />

= 1 N∑<br />

X(k)e i2π(k−1)(t−1)<br />

N<br />

N<br />

k=1<br />

Now let’s import a song <strong>and</strong> perform an FFT on it in Mathematica.<br />

Type Direc<strong>to</strong>ry[] <strong>to</strong> find Mathematica’s current working direc<strong>to</strong>ry<br />

(a folder on your computer), <strong>and</strong> put your song there—or use


258 Appendix B<br />

the comm<strong>and</strong> SetDirec<strong>to</strong>ry["dir"] <strong>to</strong> change the current direc<strong>to</strong>ry.<br />

SetDirec<strong>to</strong>ry[$UserDocumentsDirec<strong>to</strong>ry], for example, sets the direc<strong>to</strong>ry<br />

<strong>to</strong> your Documents folder on your computer.<br />

Next, use the comm<strong>and</strong> Import"filename.wav"] <strong>to</strong> import a .wav<br />

file in<strong>to</strong> Mathematica. For the sake of this example, we will use a<br />

two-channel .wav file of a short clip of rock music.<br />

Now enter<br />

samples = rock[[1,1]];<br />

left = samples[[1]];<br />

This will give us a sampled sound list of the left channel for Fourier<br />

analysis. The semicolon at the end of the lines in both Mathematica <strong>and</strong><br />

Matlab suppress the output of the statement, which is ideal for the<br />

large arrays of audio data that we don’t care <strong>to</strong> inspect.<br />

Let us take the short-time discrete Fourier transform. First, we<br />

have <strong>to</strong> partition the data. Mathematica has the function Partition[]<br />

already built in. It is easy <strong>to</strong> define a function for reuse in Mathematica:<br />

SoundPartition[x_, dftint_, fs_] :=<br />

Partition[x, Round[fs * dftint]]


Appendix B 259<br />

So, this defines a function "SoundPartition[]" which accepts an array<br />

(x), short-time discrete Fourier transform size (dftint) in seconds,<br />

<strong>and</strong> sampling frequency (fs) <strong>and</strong> returns (length(x) /(dftint ∗ fs))-<br />

many partitions. So, for a 2-second sound file <strong>and</strong> dftint= 0.1, we<br />

would get 20 partitions.<br />

rockPart = SoundPartition[left, 0.1, 44100];<br />

Table[ListPlot[Abs[Take[Sqrt[Length[rockPart[[k]]]]<br />

*InverseFourier[rockPart[[k]]], Length[rockPart[[k]]]/2]],<br />

Joined->True, PlotRange->All], {k, 1, Length[rockPart]}]<br />

The first line uses our defined function <strong>to</strong> partition my array of the<br />

left channel, "left." Then, the STDFT is performed: For each partition<br />

k, we take the DFT of the first half of the partition (the second half is<br />

redundant), find its absolute value, <strong>and</strong> graph each partition.<br />

Figure B.1: The first 4 graphs of the STDFT. There are 19 0.1-second partitions made<br />

in <strong>to</strong>tal for the 1.98-second sample of music.


260 Appendix B<br />

B.3 C<br />

The languages of C <strong>and</strong> C++ have many downloadable libraries that<br />

work <strong>to</strong> accelerate the programming process. One of these libraries<br />

is the FFTW library, containing the FFT algorithm. The library aids<br />

the efficiency of the FFT, but working with it is tricky <strong>and</strong> not for<br />

beginners <strong>to</strong> C. To learn more, go <strong>to</strong> http://fftw.org/. Additionally,<br />

an approachable yet comprehensive resource for an introduction <strong>to</strong><br />

music coding is The Audio Programming Book by Richard Boulanger <strong>and</strong><br />

<strong>and</strong> Vic<strong>to</strong>r Lazzarini.<br />

Read in a WAVE file<br />

This program will read in a file, check that it is a WAVE file, <strong>and</strong> s<strong>to</strong>re<br />

it in a matrix for further processing. To execute this file on a Mac, open<br />

Terminal (in Applications > Utilities) <strong>and</strong> type<br />

pwd<br />

This will give you your current direc<strong>to</strong>ry, most likely your user folder.<br />

Drag the files wavefile.c <strong>and</strong> fft.c in<strong>to</strong> this direc<strong>to</strong>ry. Then type<br />

gcc wavefile.c -o wavefile<br />

This will create a "wavefile" application in the current direc<strong>to</strong>ry. Finally,<br />

type in<strong>to</strong> Terminal<br />

./wavefile filename.wav<br />

This line prints the header information of the file <strong>and</strong> translates the<br />

binary code <strong>to</strong> a float value between −1 <strong>and</strong> 1 (all real values), <strong>to</strong><br />

represent the amplitude with respect <strong>to</strong> time. This is saved in a buffer<br />

file named realwave.dat. Here, filename.wav is the name of your WAVE<br />

file. Important: The file must be one-channel. It is easy <strong>to</strong> split stereo<br />

tracks <strong>and</strong> change them <strong>to</strong> "mono" in the free audio editing program,<br />

Audacity.


Appendix B 261<br />

wavefile.c<br />

/* This program will read in a WAV file for further<br />

/* manipulation. The file will save <strong>to</strong> the user’s root<br />

/* folder in a file named "realwave.dat".<br />

#include <br />

#include <br />

int main(int argc, char * argv[])<br />

{<br />

FILE * wavefile; /* Input wave file - .wav */<br />

FILE * outd; /* <strong>An</strong>alyzed result in floats - realwave.dat */<br />

int i, fsize, sread, swrite, nbytes, rate, avgrate, csize,<br />

ibyte, smin, smax, savg, bad, nbread;<br />

short ccode, channels, blockalign, bps;<br />

char riff[4], data[4], sbyte, more[4], fmt[4], wave[4];<br />

{<br />

}<br />

printf("readwave.c executing \n");<br />

if(argc


262 Appendix B<br />

}<br />

exit(1);<br />

/* Read the first 44 bytes of the WAV file */<br />

printf("Reading WAVE Header information...\n");<br />

sread = fread(&riff[0], 1, 4, wavefile);<br />

printf("First 4 bytes of .wav file should say RIFF,<br />

File says: %c%c%c%c \n",riff[0],riff[1],riff[2],riff[3]);<br />

sread = fread(&fsize, 1, 4, wavefile);<br />

printf("File has %d +8 bytes \n", fsize);<br />

sread = fread(&wave[0], 1, 4, wavefile);<br />

printf("File should should say WAVE, Files says:<br />

%c%c%c%c \n", wave[0],wave[1],wave[2],wave[3]);<br />

sread = fread(&fmt[0], 1, 4, wavefile);<br />

printf("File should say fmt, File says: %c%c%c%c \n",fmt[0],<br />

fmt[1],fmt[2],fmt[3]);<br />

sread = fread(&nbytes, 1, 4, wavefile);<br />

printf("Block has %d bytes \n", nbytes);<br />

sread = fread(&ccode, 1, 2, wavefile);<br />

printf("Compression Code = %d \n", ccode);<br />

sread = fread(&channels, 1, 2, wavefile);<br />

printf("Number of Channels = %d \n", channels);<br />

sread = fread(&rate, 1, 4, wavefile);<br />

printf("Rate = %d \n", rate);<br />

sread = fread(&avgrate, 1, 4, wavefile);<br />

printf("Average Rate = %d \n", avgrate);<br />

sread = fread(&blockalign, 1, 2, wavefile);


Appendix B 263<br />

printf("Block Align = %d<br />

\n", blockalign);<br />

sread = fread(&bps, 1, 2, wavefile);<br />

printf("Bits per Sample = %d \n", bps);<br />

sread = fread(&data[0], 1, 4, wavefile);<br />

printf("File should say DATA, File says: %c%c%c%c \n",<br />

data[0],data[1],data[2],data[3]);<br />

sread = fread(&csize, 1, 4, wavefile);<br />

nbread = 44;<br />

bad = 0;<br />

savg = 0;<br />

printf("Begin analyzing sound file.\n");<br />

for(i=0; i


264 Appendix B<br />

{<br />

nbread = nbread+csize;<br />

while(1)<br />

sread = fread(&more[0], 1, 4, wavefile);<br />

if(sread != 4) go<strong>to</strong> done; /* No more bytes <strong>to</strong> read */<br />

/* check for more chunks */<br />

sread = fread(&csize, 1, 4, wavefile);<br />

if(sread != 4)<br />

{<br />

go<strong>to</strong> done;<br />

}<br />

}<br />

for(i=0; i


Appendix B 265<br />

}<br />

return 0;<br />

}<br />

Perform an FFT on a WAVE file<br />

After reading in a .wav file, you may perform an FFT on it with the<br />

following code. Note that an FFT of an entire file is fairly meaningless<br />

because music changes often; you may want <strong>to</strong> partition the file in<br />

Matlab or with an audio edi<strong>to</strong>r first.<br />

To execute the following program, go once again <strong>to</strong> Terminal in the<br />

Utilities folder, <strong>and</strong> type<br />

gcc fft.c -o wavefft<br />

This line executes fft.c <strong>and</strong> prints the results in the Terminal console.<br />

These results can be copied <strong>and</strong> pasted in<strong>to</strong> Excel or R <strong>to</strong> be graphed<br />

(a line graph is recommended) [28], [63].<br />

fft.c<br />

/* Performs an FFT on a WAVE file that is saved in<br />

/* the root folder in the .dat format.<br />

#include <br />

#include <br />

#include <br />

#include <br />

/* Definitions */<br />

#define strchr index<br />

#define length 32768 /* max pts in FFT: must be a power of 2 */<br />

#define PI M_PI /* Pi defined <strong>to</strong> machine precision */<br />

#define TWOPI (2.0*PI) /* 2 times Pi, used often */


266 Appendix B<br />

void four1();<br />

void realft();<br />

double wsum;<br />

char *pname;<br />

FILE *ifile;<br />

int m, n;<br />

int cflag;<br />

int decimation = 1;<br />

int smooth = 1; /* Adjust this variable <strong>to</strong> scale FFT output */<br />

static float *c;<br />

double norm;<br />

main(argc, argv)<br />

int argc;<br />

char *argv[];<br />

{<br />

int i;<br />

char *prog_name();<br />

double a<strong>to</strong>f();<br />

pname = prog_name(argv[0]);<br />

if (--argc < 1)<br />

{<br />

exit(1);<br />

}<br />

else if ((ifile = fopen(argv[argc], "rt")) == NULL)<br />

{<br />

fprintf(stderr, "%s: can’t open %s\n", pname, argv[argc]);<br />

exit(2);<br />

}<br />

if ((c = (float *)calloc(length, sizeof(float))) == NULL)<br />

{<br />

fprintf(stderr, "%s: insufficient memory\n", pname);<br />

exit(2);<br />

}


Appendix B 267<br />

read_input();<br />

fft();<br />

fft_print();<br />

exit(0);<br />

}<br />

read_input()<br />

{<br />

for (n = 0; n < length && fscanf(ifile, "%f", &c[n]) == 1; n++);<br />

}<br />

/* calculate forward FFT */<br />

fft()<br />

{<br />

int i;<br />

for (m = length; m >= n; m >>= 1);<br />

m


268 Appendix B<br />

for (j = 0, pow = 0.0; j < 2*smooth; j += 2)<br />

{<br />

pow += (c[i+j]*c[i+j] + c[i+j+1]*c[i+j+1])*norm*norm;<br />

}<br />

pow /= smooth/decimation;<br />

printf("%g", sqrt(pow)); /* Print FFT results */<br />

printf("\n");<br />

}<br />

}<br />

char *prog_name(s)<br />

char *s;<br />

{<br />

char *p = s + strlen(s);<br />

while (p >= s && *p != ’/’)<br />

{<br />

p--;<br />

}<br />

}<br />

return (p+1);<br />

void realft(data,n,isign)<br />

float data[];<br />

int n,isign;<br />

{<br />

int i, i1, i2, i3, i4, n2p3;<br />

float c1 = 0.5, c2, h1r, h1i, h2r, h2i;<br />

double wr, wi, wpr, wpi, wtemp, theta;<br />

void four1();<br />

theta = PI/(double) n;<br />

if (isign == 1)<br />

{<br />

c2 = -0.5;<br />

four1(data, n, 1);


Appendix B 269<br />

}<br />

else<br />

{<br />

c2 = 0.5;<br />

theta = -theta;<br />

}<br />

wtemp = sin(0.5*theta);<br />

wpr = -2.0*wtemp*wtemp;<br />

wpi = sin(theta);<br />

wr = 1.0+wpr;<br />

wi = wpi;<br />

n2p3 = 2*n+3;<br />

for (i = 2; i


270 Appendix B<br />

}<br />

void four1(data, nn, isign)<br />

float data[];<br />

int nn, isign;<br />

{<br />

int n, mmax, m, j, istep, i;<br />

double wtemp, wr, wpr, wpi, wi, theta;<br />

float tempr, tempi;<br />

n = nn i)<br />

{<br />

tempr = data[j];<br />

data[j] = data[i];<br />

data[i] = tempr;<br />

tempr = data[j+1];<br />

data[j+1] = data[i+1];<br />

data[i+1] = tempr;<br />

}<br />

m = n >> 1;<br />

while (m >= 2 && j > m)<br />

{<br />

j -= m;<br />

m >>= 1;<br />

}<br />

j += m;<br />

}<br />

mmax = 2;<br />

while (n > mmax) /* While loop executed log2nn times */<br />

{<br />

istep = 2*mmax;<br />

theta = TWOPI/(isign*mmax); /* Trigonometric Recurrence */<br />

wtemp = sin(0.5*theta);


wpr = -2.0*wtemp*wtemp;<br />

wpi = sin(theta);<br />

wr = 1.0;<br />

wi = 0.0;<br />

for (m = 1; m < mmax; m += 2)<br />

{<br />

for (i = m; i


References<br />

[1] J. O. Pickles, <strong>An</strong> <strong>Introduction</strong> <strong>to</strong> the Physiology of Hearing. London:<br />

Academic Press, 2nd ed., 1988.<br />

[2] J. R. Pierce, The Science of <strong>Musical</strong> Sound. New York: W. H. Freeman<br />

<strong>and</strong> Company, revised ed., 1996.<br />

[3] D. J. Levitin, This Is Your Brain on Music. New York: Plume, 2006.<br />

[4] G. Loy, Musimathics: The Mathematical Foundations of Music, Volume<br />

1. Cambridge, MA: The MIT Press, 2006.<br />

[5] D. R. Griffin, Listening in the dark: the acoustic orientation of bats <strong>and</strong><br />

men. New Haven, CT: Yale University Press, 1958.<br />

[6] U. of Salford, "Duck quack echo," accessed September 20, 2011.<br />

[7] S. J. Jeans, Science & Music. New York: Dover, 1968.<br />

[8] J. Beament, The Violin Explained. New York: Oxford University<br />

Press, USA, 2001.<br />

[9] D. Halliday, R. Resnick, <strong>and</strong> J. Walker, Fundamentals of Physics.<br />

New York: Wiley, 9th ed., 2010.<br />

[10] A. Wood, The Physics of Music. London: University Paperbacks,<br />

1965.<br />

[11] A. Schoenberg, Structural Functions of Harmony. London: Williams<br />

<strong>and</strong> Norgate Limited, 1954.<br />

[12] J. Rayleigh <strong>and</strong> R. B. Lindsay, The Theory of Sound, Volume One.<br />

New York: Dover, unabridged second revised ed., 1945.


274 Appendix B<br />

[13] R. Collecchia, The Entropy of <strong>Musical</strong> Classification. Portl<strong>and</strong>, OR:<br />

Reed College, unpublished, May 2009.<br />

[14] R. Plomp, "Timbre as a Multidimensional Attribute of Complex<br />

Tones," Frequency <strong>An</strong>alysis <strong>and</strong> Periodicity Detection in Hearing, ed.<br />

R. Plomp <strong>and</strong> G. Smoorenberg. Leiden: Sijthoff, 1970.<br />

[15] T. D. Rossing, Science of String Instruments. Springer, 2010, pp. 130-<br />

132.<br />

[16] C. M. Hutchins <strong>and</strong> D. Voskull,<br />

[17] E. D. Blackham, "The physics of the piano," in Hutchins [29],<br />

pp. 24–33.<br />

[18] C. M. Hutchins, "The physics of violins," in Hutchins [29], pp. 56–<br />

68.<br />

[19] A. H. Benade, "The physics of brasses," in Hutchins [29], pp. 44–<br />

55.<br />

[20] B. Hopkin, <strong>Musical</strong> Instrument Design: Practical Information for<br />

Instrument Design. Tucson, AZ: See Sharp Press, 1996.<br />

[21] D. Deutsch, ed., The Psychology of Music. New York: Academic<br />

Press, 1982.<br />

[22] I. P. Julie Ayotte <strong>and</strong> K. Hyde, "Congenital amusia: A group study<br />

of adults afflicted with a music-specific disorder," Brain: A Journal<br />

of Neurology, vol. 125, January 2002.<br />

[23] C. S. Sapp, "Wave pcm soundfile format," updated January 20,<br />

2003; accessed September 13, 2011.<br />

[24] A. E. Zonst, Underst<strong>and</strong>ing the FFT: A Tu<strong>to</strong>rial on the Algorithm <strong>and</strong><br />

Software for Laymen, Students, Technicians <strong>and</strong> Working Engineers.<br />

Titusville, FL: Citrus Press, 1995.


Appendix B 275<br />

[25] L. R. Rabiner <strong>and</strong> C. M. Rader, eds., Digital <strong>Signal</strong> <strong>Processing</strong>. New<br />

York: The Institute of Electrical <strong>and</strong> Electronics Engineers, Inc.,<br />

1972.<br />

[26] J. O. Smith, Mathematics of the Discrete Fourier Transform (DFT).<br />

http://www.w3k.org/books/: W3K Publishing, 2007.<br />

[27] D. H. J. M. T. Heideman <strong>and</strong> C. S. Burrus, "Gauss <strong>and</strong> the his<strong>to</strong>ry<br />

of the fast fourier transform," IEEE ASSP Magazine, vol. 1, no. 4,<br />

pp. 14–21, 1984.<br />

[28] J. W. Nilsson <strong>and</strong> S. A. Riedel, Electric Circuits. Bos<strong>to</strong>n: Prentice<br />

Hall, 9th ed., 2011.<br />

[29] C. M. Hutchins, ed., The Physics of Music: Readings from Scientific<br />

American, (San Francisco, CA), W. H. Freeman <strong>and</strong> Company,<br />

1978.<br />

[30] E. Brattain-Morrin, Entropy, Computation, <strong>and</strong> Demons. Portl<strong>and</strong>,<br />

OR: Reed College, unpublished, 2008.<br />

[31] A. I. Khinchin, Mathematical Foundations of Information Theory.<br />

New York: Dover, 1957.<br />

[32] S. Ross, A First Course in Probability. Upper Saddle River, NJ:<br />

Pearson Education, Inc., 7th ed., 2006.<br />

[33] C. E. Shannon <strong>and</strong> W. Weaver, The Mathematical Theory of Communication.<br />

Urbana, IL: University of Illinois Press, 1998.<br />

[34] C. <strong>An</strong>der<strong>to</strong>n, Electronic Projects for Musicians. New York: Amsco<br />

Publications, 1980.<br />

[35] J. Johnson, <strong>Introduction</strong> <strong>to</strong> Digital <strong>Signal</strong> <strong>Processing</strong>. New Delhi,<br />

India: Prentice Hall of India, 1998.<br />

[36] K. Lee, "Au<strong>to</strong>matic Chord Recognition from Audio Using Enhanced<br />

Pitch Class Profile," Proceedings of International Computer<br />

Music Conference, 2006.


276 Appendix B<br />

[37] C. Marven <strong>and</strong> G. Ewers, A Simple Approach <strong>to</strong> Digital <strong>Signal</strong> <strong>Processing</strong>.<br />

New York: John Wiley <strong>and</strong> Sons, Inc., 1996.<br />

[38] L. R. Rabiner <strong>and</strong> B. Juang, Fundamentals of Speech Recognition.<br />

Englewood Cliffs, NJ: PTR Prentice Hall, Inc., 1993.<br />

[39] C. Roads, The Computer Music Tu<strong>to</strong>rial. Cambridge, MA: The MIT<br />

Press, 1996.<br />

[40] C. B. Rorabaugh, DSP Primer. New York: McGraw-Hill, 1999.<br />

[41] J. O. Smith, Physical Audio <strong>Signal</strong> <strong>Processing</strong>.<br />

http://ccrma.stanford.edu/ jos/pasp/: online book, accessed<br />

2011.<br />

[42] J. O. Smith, <strong>Introduction</strong> <strong>to</strong> Digital Filters with Audio Applications,<br />

http://www.w3k.org/books/: W3K Publishing, 2007.<br />

[43] F. A. Saunders, "Physics <strong>and</strong> Music," [29], pp. 6–15.<br />

[44] A. H. Benade, "The Physics of Wood Winds," The Physics of Music:<br />

Readings from Scientific American, (San Francisco, CA), W. H.<br />

Freeman <strong>and</strong> Company, 1978, pp. 34–43.<br />

[45] J. C. Schelleng, "The Physics of the Bowed String," The Physics of<br />

Music: Readings from Scientific American, (San Francisco, CA), W.<br />

H. Freeman <strong>and</strong> Company, 1978, pp. 69–77.<br />

[46] V. O. Knudsen, "Architectural Acoustics," The Physics of Music:<br />

Readings from Scientific American, (San Francisco, CA), W. H. Freeman<br />

<strong>and</strong> Company, 1978, pp. 78–92.<br />

[47] H. F. Olson, Music, physics <strong>and</strong> engineering. New York: Dover,<br />

1967.<br />

[48] G. A. Gescheider, Psychophysics: The Fundamentals. New York:<br />

Psychology Press, 1997.


Appendix B 277<br />

[49] A. H. Benade, Fundamentals of <strong>Musical</strong> Acoustics. New York: Dover,<br />

Second Revised Ed., 1990.<br />

[50] H. v. Helmholtz, On the Sensations of Tone. New York: Dover, 1954.<br />

[51] F. Lerdahl <strong>and</strong> R. Jackendoff, A Generative Theory of Tonal Music.<br />

Cambridge, MA: The MIT Press, 1983.<br />

[52] S. Isacoff, Temperament. New York: Alfred A. Knopf, 2001.<br />

[53] D. Albright, Modernism <strong>and</strong> Music. Chicago: The University of<br />

Chicago Press, 2004.<br />

[54] D. C. Miller, <strong>An</strong>ecdotal His<strong>to</strong>ry of the Science of Sound: To the Beginning<br />

of the 20th Century. New York: The Macmillan Company,<br />

1935.<br />

[55] T. Cormen, C. Leiserson, R. Rivest, <strong>and</strong> C. Stein, <strong>Introduction</strong> <strong>to</strong><br />

Algorithms. Cambridge, MA: The MIT Press, 2nd ed., 2002.<br />

[56] P. A. Fuchs, A. Rees, C. Plack, <strong>and</strong> A. Palmer, The Oxford H<strong>and</strong>book<br />

of Audi<strong>to</strong>ry Science: Hearing. New York: Oxford University Press,<br />

USA, 2010.<br />

[57] G. Martino <strong>and</strong> L. E. Marks, "Synesthesia: Strong <strong>and</strong> Weak,"<br />

Current Directions in Psychological Science, vol. 10, no. 2, April 2001.<br />

[58] E. Zwicker <strong>and</strong> R. Feldtkeller, "On the Derivation of Critical B<strong>and</strong>s<br />

from the Loudness of Complex Sounds," Acustica 5, 1955, pp. 40-<br />

45.<br />

[59] I. Peretz, L. Gagnon, S. Hébert <strong>and</strong> J. Macoir, "Singing in the Brain:<br />

Insights from Cognitive Neuropsychology," Music Perception: <strong>An</strong><br />

Interdisciplinary Journal, vol. 21, no. 3, Spring 2004, pp. 373–390.<br />

[60] J. W. Cooley <strong>and</strong> J. W. Tukey, "<strong>An</strong> algorithm for the machine<br />

calculation of complex Fourier series," Mathematical Computation,<br />

vol. 19, 1965, pp. 297–301.


278 Appendix B<br />

[61] G. C. Danielson <strong>and</strong> C. Lanczos, "Some improvements in practical<br />

Fourier analysis <strong>and</strong> their application <strong>to</strong> X-ray scattering from<br />

liquids," J. Franklin Institute, vol. 233, 1942, pp. 365–380 <strong>and</strong> 435–<br />

452.<br />

[62] C. F. Gauss, "Nachlass: Theoria interpolationis methodo nova<br />

tractata," Werke, vol. 3, 2011, pp. 265–327.<br />

[63] W. H. Press, B. P. Flannery, S. A. Teukolsky, <strong>and</strong> W. T. Vetterling,<br />

Numerical Recipes in C: The Art of Scientific Computing. Cambridge,<br />

UK: Cambridge University Press, 2nd ed., 1992, pp. 504–510.<br />

[64] F. Moavenzadeh, Concise Encyclopedia of Building <strong>and</strong> Construction<br />

Materials Cambridge, MA: The MIT Press, 1990.<br />

[65] C. R. Nave, "HyperPhysics Concepts: Sound <strong>and</strong> Hearing."<br />

http://hyperphysics.phy-astr.gsu.edu/hbase/sound/<br />

soucon.html#soucon: 2010, accessed September 13, 2011.<br />

[66] J. O. Smith, Spectral Audio <strong>Signal</strong> <strong>Processing</strong>, Oc<strong>to</strong>ber 2008 Draft.<br />

http://ccrma.stanford.edu/ jos/sasp/, online book, accessed<br />

September 13, 2011.<br />

[67] C. Chen, <strong>Signal</strong>s <strong>and</strong> Systems. New York: Oxford University Press,<br />

Third Ed., 2004.<br />

[68] M. McLuhan <strong>and</strong> Q. Fiore, the medium is the MASSAGE: <strong>An</strong> Inven<strong>to</strong>ry<br />

of Effects. Corte Madera, CA: Gingko Press, 2001.<br />

[69] B. D. S<strong>to</strong>rey, "Computing Fourier Series <strong>and</strong> Power Spectrum with<br />

MATLAB." http://faculty.olin.edu/bs<strong>to</strong>rey/<strong>Notes</strong>/Fourier.pdf:<br />

accessed September 13, 2011.<br />

[70] D. H. Whalen, E. R. Wiley, P. E. Rubin, <strong>and</strong> F. S. Cooper, "The<br />

Haskins Labora<strong>to</strong>ries’ pulse code modulation (PCM) system,"<br />

Behavior Research Methods, Instruments, & Computers, vol. 22, no. 5,<br />

1990, pp. 550–559.


Appendix B 279<br />

[71] P. Belt, The New Grove <strong>Musical</strong> Instrument Series: The Piano. W. W.<br />

Nor<strong>to</strong>n & Co., Inc., 1988.<br />

[72] H. Partch, Genesis of a Music. New York: Da Capo Press, 1974.<br />

[73] M. Enright, "A comparison of Western <strong>and</strong> Eastern music modes<br />

<strong>and</strong> <strong>to</strong>ne production."<br />

http://www.kentuckybellydance.com/BabaYagaMusic/Makams<strong>and</strong>-Cents.htm,<br />

accessed September 13, 2011.<br />

[74] D. A. Russell, "Acoustics <strong>and</strong> Vibration <strong>An</strong>imations."<br />

http://www.kettering.edu/physics/drussell/demos.html,<br />

accessed November 21, 2011.<br />

[75] J. Wolfe, "Chladni patterns for violin plates."<br />

http://www.phys.unsw.edu.au/jw/chladni.html,<br />

September 13, 2011.<br />

accessed<br />

[76] W. E. Worman <strong>and</strong> A. H. Benade, "Oscillations in Clarinet-like<br />

Systems: A Status Report."<br />

https://ccrma.stanford.edu/marl/CASL/Files/benade/Benade-<br />

ClarinetSystems-1969.pdf: Preliminary report/unpublished,<br />

April 1969.<br />

[77] P. Weiss <strong>and</strong> R. Taruskin, Music in the Western World: A His<strong>to</strong>ry in<br />

Documents. Thomason Shirmer, 2nd ed., 1984.<br />

[78] T. Christensen, The Cambridge his<strong>to</strong>ry of Western music theory. Cambridge,<br />

UK: Cambridge University Press, 2002.<br />

[79] Hesiod, S. Lombardo, <strong>and</strong> R. Lamber<strong>to</strong>n, Works & Days <strong>and</strong><br />

Theogony. Hackett Publishing Company, 1993.<br />

[80] B. C. J. Moore, <strong>An</strong> <strong>Introduction</strong> <strong>to</strong> the Psychology of Hearing. Bingley,<br />

UK: Emerald Group Publishing Ltd., 5th ed., 2003.<br />

[81] A. Lalwani, Current Diagnosis & Treatment in O<strong>to</strong>laryngology—Head<br />

<strong>and</strong> Neck Surgery. McGraw-Hill Medical, 2nd ed., 2007.


280 Appendix B<br />

[82] P. Marler <strong>and</strong> H. W. Slabbekoorn, Nature’s music: The science of<br />

birdsong. Academic Press, vol. 1, 2004.<br />

[83] H. S. Howe, Jr., Electronic Music Synthesis: Concepts, Facilities,<br />

Techniques. W. W. Nor<strong>to</strong>n & Company, Inc., 1975.<br />

[84] J. Wolfe, "Physics in Speech."<br />

http://phys.unsw.edu.au/phys_about/PHYSICS!/<br />

SPEECH_HELIUM/speech.html: Published 2005, accessed<br />

November 21, 2011.<br />

[85] G. P. Scavone, "Percussion Instruments."<br />

https://ccrma.stanford.edu/CCRMA/Courses/152/<br />

percussion.html: Published 1999, accessed November 21, 2011.<br />

[86] N. Roe, "La Monte Young’s Drugless Trip, Dream House, Is Back<br />

in Business." http://www.mapcidy.com/q=node/325: published<br />

September 23, 2009, accessed November 21, 2011.<br />

[87] R. S. Heffner <strong>and</strong> H. E. Heffner, "Sound localization <strong>and</strong> use of<br />

binaural cues by the gerbil (Meriones unguiculatus)," Behavioral<br />

Neuroscience, vol. 102, no. 3, June 1988, pp. 422–428.<br />

[88] J. Blauert <strong>and</strong> P. Laws, "Group Delay Dis<strong>to</strong>rtions in Electroacoustical<br />

Systems," Journal of the Acoustical Society of America, vol. 63,<br />

no. 5, May 1978, pp. 1478–1483.<br />

[89] J. Blauert, Spatial hearing: the psychophysics of human sound localization.<br />

Cambridge, MA: MIT Press, 1983.<br />

[90] L. A. Jeffress, "A place theory of sound localization," Journal of<br />

Comparative <strong>and</strong> Physiological Psychology, vol. 41, 1948, pp. 35–39.<br />

[91] "Vibrational Modes of Drums." http://www.soundphysics.com/Drum-Vibrational-Modes/:<br />

2005, accessed<br />

December 29, 2011.


Appendix B 281<br />

[92] J. M. Pearce, "Clinical features of the exploding head syndrome,"<br />

Journal of Neurology, Neurosurgery, <strong>and</strong> Psychiatry, vol. 52, no. 7,<br />

1989, pp. 907–910.<br />

[93] "Synesthesia." http://en.wikipedia.org/wiki/Synesthesia: Accessed<br />

December 29, 2011.<br />

[94] J. C. Thomas, "About the Piano."<br />

http://www.thomaspianotuning.com/AboutthePiano.html: accessed<br />

January 24, 2011.<br />

[95] G. Loy, Musimathics: The Mathematical Foundations of Music, Volume<br />

2. Cambridge, MA: The MIT Press, 2007.<br />

[96] Numerical Recipes in C: The Art of Scientific Computing. Cambridge,<br />

UK: Cambridge University Press, 1992.<br />

[97] R. Boulanger <strong>and</strong> V. Lazzarini, The Audio Programming Book. Cambridge,<br />

MA: The MIT Press, 2010.<br />

[98] C. M. Hutchins <strong>and</strong> D. Voskuil, "Mode tuning for the violin<br />

maker," CAS Journal, vol. 2, no. 4, November 1993, pp. 5–9.<br />

[99] D. Knight, "Drum Head Vibrations."<br />

http://www.snarescience.com/articles/drum-headvibration.php:<br />

Published 2011, accessed January 2, 2012.


Glossary<br />

absolute value Distance from a point (a, b) <strong>to</strong> the point (0, 0), given by<br />

√<br />

a2 + b 2 where a <strong>and</strong> b are real numbers. Also referred <strong>to</strong> as the<br />

magnitude.<br />

action (1) All of the mechanisms required <strong>to</strong> cause a system <strong>to</strong> vibrate. (2) In<br />

a guitar, the distance between the strings <strong>and</strong> the fretboard.<br />

action potential <strong>An</strong> electrical firing in a neuron or other excitable cell with a<br />

sharp rise <strong>and</strong> fall, similar <strong>to</strong> an impulse.<br />

ADSR envelope The shape of a signal’s overall amplitude. This signal is<br />

usually something sudden like the strike of a drum or strum of guitar.<br />

These envelopes are frequently seen on analog synthesizers.<br />

algorithm A series of instructions that execute some desired function.<br />

aliasing The incorrect naming of a frequency due <strong>to</strong> undersampling.<br />

all-pass filter A filter that allows all frequencies in a signal <strong>to</strong> pass through<br />

it but affects their phase, as found in phaser <strong>and</strong> flanger pedals.<br />

amplitude (1) The height of a wave at a given time. (2) The overall strength<br />

of a given sinusoid, given by A in the expression A sin(ωt + φ).<br />

amplitude modulation The periodic alteration of amplitude from a reference<br />

amplitude, also called tremolo.<br />

angular frequency A number describing how often something (like a sine<br />

wave) makes one revolution around the unit circle, written with the<br />

Greek letter omega (ω) <strong>and</strong> equal <strong>to</strong> 2πf where f is the ordinary frequency.<br />

Its unit is radians per second (rad/s). See frequency.<br />

antialiasing filter A low-pass filter used <strong>to</strong> avoid aliasing. Its cu<strong>to</strong>ff frequency<br />

should be less than or equal <strong>to</strong> half the sampling frequency f s ,<br />

<strong>and</strong> it should be applied before sampling.


284 Appendix B<br />

antiderivative The area under the curve of a function, also called the integral.<br />

A function must be continuous <strong>to</strong> have an antiderivative.<br />

antinode Location in a mode of vibration where motion is maximal during<br />

vibration.<br />

anvil (incus) Bone in the ear’s ossicles that connects the hammer <strong>and</strong> the<br />

stirrup.<br />

apex (apical end) Refers <strong>to</strong> the end of the basilar membrane that is at the oval<br />

<strong>and</strong> round windows. The basilar membrane is widest <strong>and</strong> minimally<br />

stiff at the apex. Low frequencies stimulate the apical end.<br />

attack The onset of a signal; the behavior of the amplitude envelope as it<br />

goes from 0 <strong>to</strong> some maximum value. In an ADSR envelope, attack is<br />

the "A."<br />

attenuation Killing or multiplying an amplitude by some number between<br />

0 <strong>and</strong> 1 (ideally 0) <strong>to</strong> reduce its amplitude.<br />

audi<strong>to</strong>ry canal Tube that runs from the outer <strong>to</strong> middle ear, extending from<br />

the pinnae <strong>to</strong> the eardrums.<br />

b<strong>and</strong>-limiting Passing a signal through a [b<strong>and</strong>-pass] filter of some b<strong>and</strong>width,<br />

hence restricting the frequencies in the signal <strong>to</strong> the frequencies<br />

of the filter.<br />

b<strong>and</strong>-pass filter A filter that allows only an interval of frequencies <strong>to</strong> pass<br />

through it <strong>and</strong> attenuates the rest, centered around some center frequency<br />

ω 0 . The interval of frequencies [ω c1 ,ω c2 ] (called the passb<strong>and</strong>)<br />

define the b<strong>and</strong>width β of the fitler.<br />

b<strong>and</strong>-s<strong>to</strong>p filter A filter that does not allow an interval of frequencies <strong>to</strong><br />

pass through it while letting frequencies outside of the interval <strong>to</strong> pass,<br />

centered at ω 0 <strong>and</strong> called the s<strong>to</strong>pb<strong>and</strong>.<br />

b<strong>and</strong>width The maximum frequency component of a signal or connection<br />

rate.<br />

basal end The thin, stiff end of the basilar membrane that is suspended in<br />

fluid. Higher frequencies activate the basilar membrane <strong>to</strong>wards its<br />

basal end.


Appendix B 285<br />

basilar membrane A membrane inside of the cochlea that vibrates at locations<br />

according <strong>to</strong> frequency.<br />

basis Series of vec<strong>to</strong>rs that are linearly independent <strong>to</strong> one another <strong>and</strong> each<br />

define a new dimension, i.e., a basis of size N defines an N-dimensional<br />

space.<br />

beating A psychoacoustic, unpleasant phenomenon that occurs when two<br />

sine waves close in frequency sound simultaneously, <strong>and</strong> their small<br />

difference <strong>to</strong>ne is heard.<br />

binary A language containing only two symbols, 0 <strong>and</strong> 1, representing numbers<br />

in base-2. All digital information is in binary.<br />

bit A binary digit.<br />

bit depth The maximum length of a string of binary digits. A bit depth of 16,<br />

for example, would mean the largest value would be 2 16 − 1 = 65535,<br />

so the file could take on 65535-many different values.<br />

bit rate The number of bits that are processed per unit of time. For audio,<br />

this is typically given in kilobits per second (kbps) <strong>and</strong> equal <strong>to</strong> the<br />

sampling frequency times the number of channels times the bit depth.<br />

bulk modulus A substance’s resistance <strong>to</strong> uniform compression, given in<br />

pascals (Pa).<br />

byte A byte is an unst<strong>and</strong>ardized power of 2 bits, but in this text, it is 8 bits.<br />

cancelation The perfectly destructive interference of two or more waves.<br />

The waves must have identical frequencies <strong>and</strong> be 180 ◦ out of phase<br />

with one another in order for cancelation <strong>to</strong> occur.<br />

carrier frequency The frequency f c <strong>to</strong> be modulated in FM synthesis by a<br />

modulation frequency f m .<br />

cilia Hairs along the basilar membrane.<br />

circular modes The modes of vibration that share the same center as the<br />

center of the instrument but may differ in size (radius). These may be<br />

circular or elliptical/oval in shape.


286 Appendix B<br />

clipping (1) A discontinuity in a time-domain signal that causes a digitalanalog<br />

converter <strong>and</strong> hence speaker <strong>to</strong> "clip." (2) Chopping off the <strong>to</strong>ps<br />

<strong>and</strong> bot<strong>to</strong>ms of a sine wave <strong>to</strong> become more like a square wave <strong>to</strong><br />

produce the effect of dis<strong>to</strong>rtion. (3) Specifying an amplitude <strong>to</strong>o great<br />

for a digital-analog converter <strong>and</strong> producing undesired dis<strong>to</strong>rtions.<br />

closed path integral Denoted by the syntax " ∮ ", a closed path integral is<br />

defined over some interval of a definite size but variable or unknown<br />

endpoints. Only used for complex functions. Not <strong>to</strong> be confused with<br />

a line integral. Also called a con<strong>to</strong>ur integral.<br />

cochlea Spiral-shaped cavity in the inner ear filled with fluid <strong>and</strong> containing<br />

the basilar membrane <strong>and</strong> Organ of Corti.<br />

cocktail effect The psychoacoustic phenomenon that allows an observer <strong>to</strong><br />

receive a signal in a noisy environment if faced straight on <strong>to</strong>wards that<br />

signal, such that both ears receive virtually identical phase at identical<br />

times in the signal’s waveform, i.e., the signal received by the left ear is<br />

completely in phase with the signal received by the right ear.<br />

codec Short for "coder-decoder," this is where the encoding <strong>and</strong> decoding of<br />

a signal takes place.<br />

combination <strong>to</strong>ne The psychoacoustic phenomenon of an audible sine wave<br />

with frequency equal <strong>to</strong> the sum of two or more other frequencies<br />

present in a signal. Opposite: Difference <strong>to</strong>ne.<br />

complex plane Two dimensional plane with real numbers on the horizontal<br />

axis <strong>and</strong> complex numbers on the vertical axis.<br />

compression (1) Region of high pressure <strong>and</strong> particle density, depicted by<br />

the crests in a waveform. (2) Process of reducing file size using algorithms.<br />

(3) Process of limiting the quantity of values that a musical<br />

signal can take on, affecting its volume (note that this does not say<br />

anything about the maximum <strong>and</strong> minimum values, i.e., the range,<br />

of the volume). Putting a digital file through a compressor typically<br />

makes the quiet parts quieter <strong>and</strong> the loud parts louder. Opposite:<br />

Decompression.


Appendix B 287<br />

computational complexity The number of computations (additions, multiplications)<br />

involved in an algorithm, determining its theoretical execution<br />

time.<br />

consonance The physical <strong>and</strong> psychophysical agreement of two or more<br />

pitches due <strong>to</strong> low integer ratios, considered pleasant or euphonious.<br />

constructive interference The case when two waveforms combine <strong>to</strong> produce<br />

a waveform of greater amplitude than the amplitude of either of<br />

the original waveforms.<br />

continuous Fourier transform (FT) Integral that transforms a continuous,<br />

time-domain signal in<strong>to</strong> a continuous, frequency-domain spectrum,<br />

given by<br />

X(ω) =<br />

∫ ∞<br />

−∞<br />

x(t)e −iωt dt.<br />

convolution Binary operation denoted by the syntax "∗". The convolution of<br />

two continuous functions x(t) <strong>and</strong> y(t) is the integral<br />

x(t) ∗ y(t) =<br />

∫ t<br />

0<br />

x(s)y(t − s) ds dt,<br />

<strong>and</strong> the convolution of two discrete functions x[t] <strong>and</strong> y[t] of length N<br />

is the sum<br />

x[t] ∗ y[t] =<br />

N−1<br />

∑<br />

s=0<br />

x[s]y[t − s],<br />

where s is a number such that x <strong>and</strong> y do not intersect on the same axis<br />

when y is shifted <strong>to</strong> the left by s <strong>and</strong> flipped vertically. Also called the<br />

cyclic convolution.<br />

convolution reverb The convolution of an impulse response of a room with<br />

a musical signal <strong>to</strong> make the music sound as if it were recorded within<br />

that room.<br />

crest Location in a wave where pressure is maximal. Opposite: Trough.<br />

critical b<strong>and</strong> The b<strong>and</strong>width β c beyond which we perceive b<strong>and</strong>-limited (by<br />

β c ) sound <strong>to</strong> have more energy than it physically does.


288 Appendix B<br />

cut-off frequency The frequency or set of frequencies in a filter at which the<br />

magnitude response is −3 dB <strong>and</strong> attenuation begins.<br />

DC offset The value of the spectrum at k =0, i.e., X(0), which quantifies<br />

the amount of direct (constant) current in a signal.<br />

decay In an ADSR envelope, decay is the "D," describing the overall magnitude<br />

of a signal as it decreases from some maximum value.<br />

decibel (dB) Logarithmic unit of the ratio of power or intensity <strong>to</strong> a reference<br />

power or intensity; one tenth of a bel (B). Two signals differing in power<br />

by one decibel have a power ratio of 10 1/10 ≈ 1.26 <strong>and</strong> an amplitude<br />

(intensity) ratio of ( √ 10) 1/10 ≈ 1.12. Not <strong>to</strong> be confused with sound<br />

pressure level (dB SPL).<br />

delay line A filter that causes a feedback or feedforward loop in an electrical<br />

system, such as a comb filter.<br />

derivative The rate of change of a function, defining the slope of the tangent<br />

line <strong>to</strong> the function at any time. A function must be continuous <strong>and</strong><br />

have no sharp edges or turns (such as x(t) =|t|, the absolute value of<br />

t) <strong>to</strong> be differentiable.<br />

destructive interference The case when two waveforms combine <strong>to</strong> produce<br />

a waveform of lesser amplitude than the amplitude of either of the<br />

original waveforms.<br />

dia<strong>to</strong>nic scale The seven notes of the major or minor scale.<br />

difference <strong>to</strong>ne Psychoacoustic phenomenon of an audible sine wave with<br />

frequency f 3 resulting from the difference of two simultaneous frequencies,<br />

f 1 <strong>and</strong> f 2 , such that f 3 = |f 1 − f 2 |. Heard as beating when<br />

f 3 is less than approximately 10 Hz. Will supplant the fundamental<br />

frequency when it is removed from a harmonic over<strong>to</strong>ne series, i.e., if<br />

the frequencies 200, 300, <strong>and</strong> 400 Hz are sounded, a difference <strong>to</strong>ne of<br />

100 Hz will also be heard. Opposite: Combination <strong>to</strong>ne.<br />

diffraction The change in wave motion when an area of different impedance<br />

is encountered.


Appendix B 289<br />

digital signal processing Field of electrical engineering that represents <strong>and</strong><br />

seeks <strong>to</strong> manipulate discrete-time inputs in linear, time-invariant systems.<br />

Also concerned with the measurement, filtering, <strong>and</strong> compression<br />

of analog signals.<br />

Dirac delta A continuous impulse, defined conditionally as<br />

⎧<br />

⎨∞, when t =0<br />

δ(t) =<br />

⎩0, otherwise.<br />

The global integral of the Dirac delta is exactly 1.<br />

discrete Fourier transform (DFT) Sum that transforms a discrete, time-domain<br />

signal of length N in<strong>to</strong> a discrete, frequency-domain spectrum also of<br />

length N, given by<br />

X(k) =<br />

N−1<br />

∑<br />

t=0<br />

x(t)e − i2πkt<br />

N .<br />

discrete-time Fourier transform (DTFT) Sum that transforms a discrete, timedomain<br />

signal of infinite length in<strong>to</strong> a continuous, frequency-domain<br />

spectrum of length 2π, given by<br />

X(ˆω) =<br />

∞∑<br />

t=−∞<br />

x[t]e −iˆωt ,<br />

for normalized frequencies ˆω in the interval [0, 2π).<br />

dissonance The physical <strong>and</strong> psychophysical disagreement of two or more<br />

pitches due <strong>to</strong> high integer ratios, considered unpleasant.<br />

domain The set of input values over which a function is defined. A domain<br />

can either be continuous or discrete. In the signal x(t), the set of t<br />

represents the domain of x.<br />

Doppler effect The relationship between frequency <strong>and</strong> the movement of a<br />

sound source with respect <strong>to</strong> an observer, given by<br />

f o =<br />

( c + vo<br />

c + v s<br />

)<br />

f s ,


290 Appendix B<br />

where f o is the observed frequency, f s is the frequency of the source,<br />

v o is the speed of the observer (positive if moving <strong>to</strong>wards the source),<br />

v s is the speed of the sound source (positive if moving away form the<br />

observer), <strong>and</strong> c is the speed of sound.<br />

eardrum (tympanum) Membrane separating the outer <strong>and</strong> middle ear that<br />

is disturbed when the pressure in the outer ear changes. Disturbance<br />

in the eardrum is required for the perception of sound.<br />

echolocation The technique used by animals such as bats <strong>to</strong> identify the<br />

distance <strong>to</strong> objects by measuring the time it takes for a signal <strong>to</strong> echo<br />

back <strong>to</strong>wards the observer.<br />

effective length The length that a vibrating mechanism like a fixed string or<br />

column of air when we account for its physical nature, such as string<br />

density µ or holes in a bore. This is the wavelength of the fundamental<br />

produced on an ideal (massless, infinitesimally thin, infinitely tense)<br />

string.<br />

endianness Adjective describing the direction in which the bytes of a binary<br />

value are s<strong>to</strong>red in<strong>to</strong> memory. Little endian means the least significant<br />

byte is read first, so left-<strong>to</strong>-right, <strong>and</strong> big endian reads the most<br />

significant byte first (right-<strong>to</strong>-left, how we typically read numbers).<br />

endolymph Fluid inside of the scala media in the cochlea. Reissner’s membrane<br />

<strong>and</strong> the basilar membrane separate it from the perilymph in the<br />

scala vestibuli <strong>and</strong> scala tympani. Its ionic composition is different from<br />

that of perilymph <strong>and</strong> they work <strong>to</strong>gether <strong>to</strong> create electrochemical<br />

impulses.<br />

energy Integral of power over time; used <strong>to</strong> do work. Unit is the joule (J).<br />

enharmonic equivalent Two notes with the same frequency but different<br />

note names, such as F♯ <strong>and</strong> G♭. They function differently in transcribed<br />

music but sound identical (in equal temperament only).<br />

equal temperament Scale of 12 <strong>to</strong>nes with equal spacing between <strong>to</strong>nes. A<br />

<strong>to</strong>ne (f 1 ) that is k half steps above another <strong>to</strong>ne (f 0 ) will have frequency<br />

f 1 =2 k f 0 . When this <strong>to</strong>ne is below f 0 , k is negative. Equal temperament<br />

enables perfect transposition <strong>to</strong> other keys on an instrument, but


Appendix B 291<br />

lacks the integer-based consonance of Pythagorean temperament <strong>and</strong><br />

just in<strong>to</strong>nation.<br />

Euler’s formula The formula e ix = cos(x)+i sin(x). Also, e −ix = cos(x) −<br />

i sin(x).<br />

Euler’s identity The equation e iπ + 1 = 0.<br />

fac<strong>to</strong>rial Designated by the syntax "!" in mathematics, the fac<strong>to</strong>rial of n is<br />

n! =n · (n − 1) · (n − 2) · . . . · 1.<br />

filter A frequency-discriminating system. This can be virtually any physical<br />

object, <strong>and</strong> it can be modeled by an electric circuit <strong>and</strong> its corresponding<br />

transfer function.<br />

filter bank A series of filters, usually linearly or logarithmically spaced <strong>and</strong><br />

with the purpose of retrieving the notes of some scale.<br />

FIR filter Shortening of "finite impulse response filter," meaning a filter with<br />

an impulse response of finite duration because it is completely zero<br />

after some point in time, as opposed <strong>to</strong> an infinite impulse response (IIR)<br />

filter.<br />

Fourier series The method of approximating a periodic function by a sum<br />

of sine waves as devised by Jean Baptiste Joseph Fourier. The DFT is<br />

proportional <strong>to</strong> the coefficients of the Fourier series.<br />

frame One "slice" of a windowing function. We call its size N ′ , equal <strong>to</strong> N M<br />

for M-many frames. Also called a window.<br />

frequency A number defining how often something (like a sine wave) repeats<br />

itself, inversely proportional <strong>to</strong> the time something takes <strong>to</strong> repeat<br />

itself, i.e., f = 1 T<br />

. Also called the ordinary frequency. Its unit is in hertz<br />

(Hz) or the inverted second (s −1 ).<br />

frequency bin The indexing system for the frequency components of the<br />

discrete Fourier transform, labeled by the integers k =0, 1, 2, . . ., N −1.<br />

frequency component A frequency present in a signal as evidenced by its<br />

Fourier transform, named ω 0 , ω 1 , ω 2 , <strong>and</strong> so on.


292 Appendix B<br />

frequency modulation The periodic alteration of frequency from a reference<br />

frequency. For small differences, this is the effect of vibra<strong>to</strong>, but for<br />

larger differences, sideb<strong>and</strong>s form <strong>and</strong> unusual timbres arise.<br />

frequency response The spectrum of a time-domain signal, X(ω), usually<br />

used <strong>to</strong> refer <strong>to</strong> the transformed reaction of some instrument or filter <strong>to</strong><br />

a sine sweep or impulse; the Fourier transform of the impulse response.<br />

fundamental frequency Notated f 0 , the fundamental frequency is a reference<br />

frequency <strong>to</strong> which other frequencies are compared. It is typically<br />

the root of a chord or the actual, single pitch played on an instrument<br />

whose Fourier transform is studied.<br />

Gibbs phenomenon Observation by J. Willard Gibbs that the Fourier series<br />

of a piecewise, continuous function (like a square or triangle wave) is<br />

worst at "jump discontinuities," i.e., at the sharp edges of the waveform.<br />

The "tails" that appear in the graphs of such Fourier series are called<br />

Gibbs horns.<br />

hammer (malleus) Attached <strong>to</strong> the eardrum <strong>and</strong> the anvil in the ossicles;<br />

communicates the vibrations of the eardrum <strong>to</strong> the inner ear.<br />

harmonic (1) Short for harmonic partial, meaning a partial that is an integer<br />

multiple of a fundamental frequency. (2) <strong>An</strong> adjective describing a timbre<br />

that contains only integer multiples of the fundamental frequency.<br />

Hermitian symmetry Symmetry of a complex function. For a spectrum<br />

X(k), the following properties hold:<br />

Re{X(−k)} = Re{X(k)},<br />

Im{X(−k)} = −Im{X(k)},<br />

|X(−k)| = |X(k)|,<br />

∠X(−k) =∠X(k),<br />

vertical symmetry of the real parts<br />

diametric symmetry of the imaginary parts<br />

vertical symmetry of the magnitudes<br />

vertical symmetry of the phase angles.<br />

The Fourier transform possesses Hermitian symmetry.<br />

high-pass filter A filter that allows frequencies above some cu<strong>to</strong>ff frequency<br />

ω c <strong>to</strong> pass through it <strong>and</strong> attenuates the rest.<br />

hop The interval of time between the beginning times of the mth <strong>and</strong> (m +<br />

1)th windows in a windowed signal, designated by a hop size H.


Appendix B 293<br />

Huygens’ principle Every point through which a wave propagates is itself<br />

the source of a new spherical wave.<br />

ideal sampling Sampling using impulses.<br />

impedance matching The matching of the impedance of some closed cavity<br />

(like the middle ear or bore of a wind instrument) <strong>to</strong> external<br />

impedance (like the impedance outside of the eardrum or the impedance<br />

input in<strong>to</strong> the mouthpiece of a wind instrument).<br />

impulse Theoretically, an impulse is the function δ(t), equal <strong>to</strong> 1 where<br />

t =0<strong>and</strong> 0 elsewhere for discrete time domains, <strong>and</strong> equal <strong>to</strong> positive<br />

infinity where t =0<strong>and</strong> 0 otherwise for continuous time domains. Its<br />

Fourier transform is constant, i.e., it has energy spread equally <strong>to</strong> all<br />

frequencies. Therefore, an impulse is a burst of white noise.<br />

impulse response The recorded reaction of a reverberant system <strong>to</strong> an impulse.<br />

information The meaningful content of a message or signal. Opposite: Noise.<br />

input impedance The amount of resistance induced in an instrument, such<br />

as the amount of pressure introduced by a player’s lips <strong>and</strong> lungs in a<br />

wind instrument.<br />

interpolation The insertion of L-many zeros between every point of a domain<br />

for the purpose of up-sampling (oversampling).<br />

inverse continuous Fourier transform (IFT) Integral that transforms a continuous,<br />

frequency-domain spectrum in<strong>to</strong> a continuous, time-domain<br />

signal, given by<br />

x(t) = 1 ∫ ∞<br />

X(ω)e iωt dω.<br />

2π −∞<br />

inverse discrete Fourier transform (IDFT) Sum that transforms a discrete,<br />

frequency-domain spectrum of length N in<strong>to</strong> a discrete, time-domain<br />

signal also of length N, given by<br />

x(t) = 1 N<br />

N−1<br />

∑<br />

k=0<br />

X(k)e i2πkt<br />

N .


294 Appendix B<br />

inverse discrete-time Fourier transform (IDTFT) Integral that transforms a<br />

continuous, frequency-domain spectrum of length 2π in<strong>to</strong> a discrete,<br />

time-domain signal of infinite length, given by<br />

x[t] = 1<br />

2π<br />

∫ 2π<br />

0<br />

X(ˆω)e iˆωt dˆω<br />

for normalized frequencies ˆω that lie in the interval [0, 2π).<br />

inverse square law Law governing the intensity I of waves as they propagate<br />

as a function of distance r <strong>and</strong> original power P , given by<br />

I =<br />

P<br />

4πr 2 .<br />

With respect <strong>to</strong> sound pressure (a logarithmic measure), the pressure is<br />

proportional <strong>to</strong> 1 r .<br />

inverse Z-transform Closed path integral that transforms a continuous, finite<br />

frequency-domain spectrum in<strong>to</strong> a discrete, infinite time-domain<br />

signal, given by the equation<br />

x[t] = 1 ∮<br />

X(z)z n−1 dz<br />

2πj C<br />

where C is the region of convergence, an interval of size 2π.<br />

inversion The "flipping" of a musical interval with respect <strong>to</strong> the octave of 12<br />

notes. The inversion of a perfect fifth, for example, is a perfect fourth.<br />

The inversion of an octave is still an octave.<br />

just in<strong>to</strong>nation System of tuning built on the intervals of the octave, perfect<br />

fifth, <strong>and</strong> major third.<br />

just-noticeable difference (jnd) (1) The minimum difference in two frequencies<br />

for the sounds <strong>to</strong> be perceived as different. (2) The minimum difference<br />

in two decibel levels for the sounds <strong>to</strong> be perceived as different.<br />

Both (1) <strong>and</strong> (2) are measured in limens.<br />

key In Western <strong>to</strong>nality, a key is a scale of notes (typically 7) designated by a<br />

root note name (like C) <strong>and</strong> a quality (like major or minor).


Appendix B 295<br />

Kronecker delta A discrete impulse, defined conditionally as<br />

⎧<br />

⎨1, when t =0<br />

δ[t] =<br />

⎩0, otherwise.<br />

The global sum of the Kronecker delta is exactly 1. It is nonintegrable<br />

because it is not continuous.<br />

Laplace transform Integral that transforms a continuous, time-domain signal<br />

in<strong>to</strong> a continuous, frequency-domain spectrum, given by<br />

X(s) =<br />

∫ ∞<br />

−∞<br />

x(t)e −st dt.<br />

Since time-domain signals are typically only defined for positive domain,<br />

we can make our lives a lot easier by changing the limits of<br />

integration <strong>to</strong> [0, ∞).<br />

latency The measure of time delay in a system, ideally zero.<br />

lateral modes The modes of vibration that are along some diameter of an<br />

instrument.<br />

limen See just-noticeable difference.<br />

limit Value specifying the maximum strength of a frequency can have before<br />

permanent damage is incurred.<br />

linear filter Filters that are subject <strong>to</strong> the constraint of linearity, meaning that<br />

they satisfy two conditions: (1) the principle of superposition (additivity),<br />

<strong>and</strong> (2) scaling the input by a constant also scales the output<br />

by the same constant—e.g., X(aω) =aX(ω). Every filter covered in<br />

<strong>Numbers</strong> & notes is a linear filter.<br />

linear independence Mathematical condition satisfied when one linear expression<br />

(i.e., a vec<strong>to</strong>r) cannot be written in terms of another linear<br />

expression. The vec<strong>to</strong>rs (1, 0) <strong>and</strong> (0, 1), for example, are linearly independent,<br />

because (0, 1) cannot be written as any combination of<br />

(1, 0). The vec<strong>to</strong>rs (1, 0) <strong>and</strong> (2, 0), on the other h<strong>and</strong>, are linearly<br />

independent: The second is two times the first.


296 Appendix B<br />

load Resistive component of a circuit over which the output voltage is computed<br />

in order <strong>to</strong> calculate the circuit’s transfer function.<br />

lossless compression Compression algorithms that do retain all of the original,<br />

raw data in a signal, such that when the compressed file is decompressed,<br />

the original signal is returned.<br />

lossy compression Compression algorithms that do not retain all of the<br />

original, raw data in a signal.<br />

loudness The psychophysical perception of intensity.<br />

low frequency oscilla<strong>to</strong>r (LFO) A sine wave with frequency less than about<br />

30 Hz that is meant <strong>to</strong> control periodic changes in other elements, such<br />

as tremolo or vibra<strong>to</strong>.<br />

low-pass filter A filter that allows frequencies below some cu<strong>to</strong>ff frequency<br />

ω c <strong>to</strong> pass through it <strong>and</strong> attenuates the rest.<br />

Mach number The speed of an object moving through air, with Mach numbers<br />

greater than 1 indicating that the object is breaking the sound<br />

barrier.<br />

magnitude The length of a vec<strong>to</strong>r; for a complex number a + bi, this vec<strong>to</strong>r<br />

is 〈a, b〉 <strong>and</strong> its magnitude is [[a + bi]] = √ a 2 + b 2 , just like its absolute<br />

value.<br />

magnitude response The magnitude plot of a filter with response <strong>to</strong> frequency,<br />

calculated by<br />

|H(jω)| = √ Re{H(jω)} 2 + Im{H(jω)} 2 .<br />

The magnitude response is all we typically care <strong>to</strong> plot when we take<br />

a Fourier transform because visualizing its imaginary parts would<br />

require another axis.<br />

masking The psychoacoustic phenomenon wherein the power of some frequency<br />

goes undetected due <strong>to</strong> the superior power of another nearby<br />

frequency that is within one critical b<strong>and</strong> of the other.<br />

meter The organization of rhythm in a piece of music, designating some<br />

amount of notes with a unit duration per measure.


Appendix B 297<br />

modes of vibration The patterns describing the physical ways in which an<br />

instrument vibrates when set in<strong>to</strong> motion by a mechanism such as a<br />

fixed string, reed, or mallet.<br />

modulation Periodic change.<br />

modulation frequency In FM (frequency modulation) synthesis, the modulation<br />

frequency f m periodically changes some carrier frequency f c <strong>to</strong><br />

produce sideb<strong>and</strong>s at f c − f m <strong>and</strong> f c + f m .<br />

music information retrieval (MIR) The digital methods used <strong>to</strong> detect musical<br />

devices in sound files, such as instrumentation, emotion, fundamental<br />

frequency, <strong>and</strong> style.<br />

narrowb<strong>and</strong> (b<strong>and</strong>-limited) noise Noise containing only a small interval of<br />

frequencies.<br />

node Location in a mode of vibration where a vibrating mechanism remains<br />

stationary during vibration due <strong>to</strong> the cancelation of vibrating forces.<br />

Striking or otherwise causing vibration in the instrument at its nodes<br />

produces no sound.<br />

noise The meaningless content of a message or signal. Opposite: Information.<br />

normal atmospheric pressure The average pressure of the environment, in<br />

which no sound is perceived. St<strong>and</strong>ardized <strong>to</strong> 101, 325 pascals (Pa).<br />

normalized discrete Fourier transform (NDFT) Specification of the DFT that<br />

maps <strong>to</strong> the interval [0, 1], given by<br />

ˆX(k) = √ 1<br />

N−1<br />

∑<br />

x(t)e − i2πkt<br />

N .<br />

N<br />

t=0<br />

normalized inverse discrete Fourier transform (NIDFT) Specification of the<br />

IDFT that maps <strong>to</strong> the interval [−1, 1], given by<br />

x(t) = √ 1<br />

N−1<br />

∑<br />

N<br />

k=0<br />

ˆX(k)e i2πkt<br />

N .<br />

normalization The mathematical process of translating a function’s values<br />

<strong>to</strong> be unitary, i.e., <strong>to</strong> the closed interval [−1, 1].


298 Appendix B<br />

Nyquist frequency The minimum sampling frequency required <strong>to</strong> avoid<br />

aliasing, equal <strong>to</strong> 2f max . Also called the Nyquist rate <strong>and</strong> Nyquist<br />

limit.<br />

Ohm’s law Voltage is the product of current <strong>and</strong> resistance; V = IR.<br />

open circuit A circuit that is or behaves as if its wires were disconnected due<br />

<strong>to</strong> infinite resistance. No current will flow in an open circuit.<br />

organ of Corti Sensory organ of hearing in the cochlea, covered with cilia.<br />

orthogonality (1) The linear independence of two vec<strong>to</strong>rs, implying a zero<br />

cross product. (2) Mathematical condition satisfied when two vec<strong>to</strong>rs<br />

meet at an angle of 90 ◦ .<br />

orthonormality Linearly independent (orthogonal) vec<strong>to</strong>rs that all have unitary<br />

magnitude.<br />

ossicles The three smallest bones in the body, located in the middle ear.<br />

The ossicles consist of the hammer (malleus), anvil (incus), <strong>and</strong> stirrup<br />

(stapes) <strong>and</strong> they serve <strong>to</strong> pass on <strong>and</strong> amplify up <strong>to</strong> 20 times the<br />

vibrations of the eardrum <strong>to</strong> the inner ear.<br />

oval window Membrane connecting the ossicles <strong>to</strong> the cochlea in the inner<br />

ear. Opening <strong>to</strong> the scala vestibuli.<br />

oversampling Using a higher sampling frequency than the Nyquist frequency<br />

(2f max ) <strong>to</strong> ensure that all frequencies in a signal are captured<br />

during sampling. Results in a higher file size.<br />

over<strong>to</strong>ne A frequency produced in the timbre of an instrument that is above<br />

the fundamental frequency that is played. Usually this is ordered, i.e.,<br />

the fundamental frequency is written f 0 <strong>and</strong> the third closest frequency<br />

in its over<strong>to</strong>ne series is written f 3 .<br />

over<strong>to</strong>ne series The collection of frequencies in the timbre of an instrument,<br />

ordered f 0 , f 1 , f 2 , <strong>and</strong> so on. When these f k are integer multiples of f 0 ,<br />

we call this a harmonic over<strong>to</strong>ne series.<br />

partial See over<strong>to</strong>ne.


Appendix B 299<br />

penta<strong>to</strong>nic scale Scale built on the first five notes from the circle of fifths, i.e.,<br />

C-G-D-A-E (typically ordered C-D-E-G-A).<br />

perfect fifth A consonant interval defined by two notes, in which one note<br />

is seven semi<strong>to</strong>nes (half steps) above the other. Their frequencies are in<br />

a 3:2 ratio.<br />

perfect fourth A consonant interval defined by two notes, in which one note<br />

is five semi<strong>to</strong>nes (half steps) above the other, abbreviated P4. Their<br />

frequencies are in a 4:3 ratio. This is the interval at the beginning of<br />

"Here Comes the Bride."<br />

perfect octave A consonant interval defined by two notes, in which one note<br />

is 12 semi<strong>to</strong>nes (half steps) above the other, abbreviated P8 or 8va.<br />

Their frequencies are in a 2:1 ratio.<br />

perfect (absolute) pitch The ability <strong>to</strong> name the notes of pitches played in<br />

isolation or without a reference pitch or key.<br />

perfect unison The most consonant interval defined by two notes of identical<br />

frequency, i.e., a ratio of 1:1.<br />

perilymph Fluid inside of the scala tympani <strong>and</strong> the scala vestibuli.<br />

period The interval of time in seconds (s) that something takes <strong>to</strong> repeat<br />

itself, denoted by the variable T <strong>and</strong> inversely proportional <strong>to</strong> ordinary<br />

frequency, i.e., T = 1 f .<br />

permanent threshold shifting The permanent shifting of the thresholds of<br />

hearing <strong>to</strong> higher values; hearing loss.<br />

phase relationship The functional difference in phase between two or more<br />

waves.<br />

phase response The phase plot of a filter with respect <strong>to</strong> frequency, calculated<br />

by<br />

[ ]<br />

Re{H(jω)}<br />

φ[H(jω)] = tan −1 .<br />

Im{H(jω)}<br />

phase shifting Process used by Steve Reich on a tape reel, achieved by playing<br />

two identical tapes slightly out of sync with one another <strong>and</strong> hence<br />

changing the phase of one of the tapes with respect <strong>to</strong> the other tape.


300 Appendix B<br />

phasor (1) In electrical engineering, the phasor refers <strong>to</strong> the initial phase of a<br />

voltage function. (2) In Max/MSP <strong>and</strong> other musical programming environments,<br />

a phasor is the mirror image of a saw<strong>to</strong>oth wave, whether<br />

horizontally or vertically.<br />

phon Unit of perceived loudness that heeds equal loudness curves, the<br />

lowest of which is the Fletcher-Munson curve at zero phons. One phon<br />

is the loudness of 1000 Hz at 1 dB SPL; 10 phons is the loudness of 1000<br />

Hz at 10 dB SPL.<br />

pinnae The flaps of skin in the outer ear that stick out from the head.<br />

pitch The psychophysical perception of frequency.<br />

pitch class A note regardless of octave. The twelve notes C, C♯, D, D♯, E, F,<br />

F♯, G, G♯, A, A♯, <strong>and</strong> B are each pitch classes (could also be written with<br />

flats, but sharp notes are more common). Also called pitch chroma.<br />

place theory Leading theory in psychoacoustics supposing that the ear detects<br />

frequencies with respect <strong>to</strong> the location of excitation along the<br />

basilar membrane.<br />

pole In signal processing, a pole exists where the denomina<strong>to</strong>r of a transfer<br />

function H is equal <strong>to</strong> zero.<br />

power Rate of energy transfer (energy per unit time); voltage times current.<br />

Unit is the watt (W), equal <strong>to</strong> 1 joule per second.<br />

pressure Force per unit area.<br />

pressure wave Wave that periodically alternates between compressions (high<br />

pressure regions) <strong>and</strong> rarefactions (low pressure regions).<br />

principle of superposition Every wave can be represented as a sum of simple<br />

sinusoids. This is evidenced in the Fourier series.<br />

probability density function (pdf) Function that defines the probabilities<br />

of different events for a continuous r<strong>and</strong>om variable. Its <strong>to</strong>tal sum (or<br />

integral) is equal <strong>to</strong> 1. For a discrete r<strong>and</strong>om variable, we compute its<br />

probability mass function (pmf).


Appendix B 301<br />

pulse code modulation (PCM) Sampling technique similar <strong>to</strong> ideal sampling<br />

but with some quantization error.<br />

pure <strong>to</strong>ne A sine wave with no harmonics or over<strong>to</strong>nes, exhibited in simple<br />

harmonic motion.<br />

Pythagorean temperament System of tuning built on the intervals of the<br />

octave <strong>and</strong> perfect fifth.<br />

quality (1) In b<strong>and</strong>-pass <strong>and</strong> b<strong>and</strong>-s<strong>to</strong>p filters, quality Q refers <strong>to</strong> the steepness<br />

of the passb<strong>and</strong> or s<strong>to</strong>pb<strong>and</strong> in the magnitude plot of its transfer<br />

function. The higher the Q, the steeper the response. (2) The sonority<br />

of a key or interval, i.e., major or minor.<br />

quantization Approximating the size at a given point of a measured signal<br />

<strong>to</strong> the closest value in a given set of values.<br />

quantization error The difference between a signal x(t) <strong>and</strong> its sampled representation<br />

x s (t), i.e., between raw data <strong>and</strong> compressed data. When<br />

the bit depth is low or there are other limits on the size of a file, the<br />

quantization error will probably be significant.<br />

range Also called the codomain, the range is the set of output values <strong>to</strong><br />

which a function maps. In the signal x(t), the values of x(t) represent<br />

its range. Range can also refer <strong>to</strong> the interval of values that a function<br />

maps (it’s image), like the b<strong>and</strong>width of a signal.<br />

rarefaction Region of low pressure <strong>and</strong> particle density, depicted by the<br />

troughs in a waveform.<br />

reactance Resistance in the complex frequency domain, of frequency-dependent<br />

components such as capaci<strong>to</strong>rs <strong>and</strong> induc<strong>to</strong>rs.<br />

region of convergence The radius of the largest circle for which a series<br />

converges.<br />

Reissner’s membrane Membrane in the cochlea separating the scala media<br />

from the scala vestibuli.<br />

relative pitch The ability <strong>to</strong> name the notes of pitches played with some<br />

reference pitch or key.


302 Appendix B<br />

release In an ADSR envelope, release is the "R," describing the period after a<br />

signal sustains <strong>and</strong> dies out.<br />

resolution The quality or fidelity of a representation set. High resolution implies<br />

a relatively low amount of error when quantifying the difference<br />

between an actual thing <strong>and</strong> its representation, like quantization error.<br />

resonance The preference of a system for specific frequencies, evidenced by<br />

relatively strong peaks in its frequency domain. These resonant frequencies<br />

are determinable by the physical dimensions of the resona<strong>to</strong>r.<br />

res<strong>to</strong>ring force The force acting upon a system in motion <strong>to</strong> return the system<br />

<strong>to</strong> equilibrium (stasis).<br />

reverberation The persistence of sound in a space due <strong>to</strong> reflections <strong>and</strong><br />

refraction.<br />

rhythm A structured organization of strong <strong>and</strong> weak beats which repeat<br />

periodically in music.<br />

RIFF Header information in encoded WAVE files.<br />

room acoustics The study of the resonances of architectural spaces, particularly<br />

their effect on psychoacoustics <strong>and</strong> speech intelligibility.<br />

roots of unity The complex exponentials e iωt in the Fourier transform, defining<br />

different positions along the unit circle. For a discrete signal x(t)<br />

of size N, there will be N roots of unity in one revolution of the unit<br />

circle each separated by the angle e i2π<br />

N . They define the orthonormal<br />

basis of the Fourier transform.<br />

roughness Interval of frequencies that lies between beating frequencies <strong>and</strong><br />

separable frequencies. Considered dissonant. Also called the interval<br />

of confusion.<br />

round window The second opening <strong>to</strong> the inner ear, located below the oval<br />

window. As the oval window translates the movements of the stirrup<br />

in<strong>to</strong> the cochlea, the round window pushes out of the cochlea <strong>to</strong> allow<br />

the fluid inside <strong>to</strong> vibrate. Opening <strong>to</strong> the scala tympani.<br />

row matrix Matrix consisting of only one row; can have any number of<br />

columns.


Appendix B 303<br />

saw<strong>to</strong>oth wave A waveform with similar appearance <strong>to</strong> the teeth on a saw.<br />

Bowing a violin produces a saw<strong>to</strong>oth wave.<br />

scala media Membrane separating the scala tympani <strong>and</strong> the scala vestibuli<br />

in the cochlea.<br />

scala tympani Perilymph-filled cavity in the cochlea that translates its vibrations<br />

<strong>to</strong> the scala media.<br />

scala vestibuli Perilymph-filled cavity in the cochlea that translates its vibrations<br />

<strong>to</strong> the scala media. Reissner’s membrane separates it <strong>and</strong> the<br />

scala media.<br />

scattering junction <strong>An</strong> area of different impedance that causes waves <strong>to</strong><br />

change orientation, intensity, <strong>and</strong>/or speed.<br />

semi<strong>to</strong>ne A half step, such as C <strong>to</strong> C♯ or A <strong>to</strong> A♭.<br />

series The sum of components in a sequence.<br />

short-time Fourier transform (STFT) Theoretical variation of the fast Fourier<br />

transform that first divides an input signal in<strong>to</strong> smaller chunks (less<br />

than 100 milliseconds in length) via a windowing function <strong>and</strong> then<br />

computes an FFT on each of those chunks. Particularly useful for musical<br />

signals whose frequency content changes over time. Generates a<br />

spectrogram.<br />

short circuit A circuit that contains or behaves as if it contained a plain<br />

electric wire with no resistance that effectively shorts out other paths<br />

of the circuit with higher resistance. Current is infinite in a short circuit<br />

because it, like traffic, prefers paths with less resistance, so no current<br />

flows in the parts of a circuit that are shorted.<br />

sideb<strong>and</strong>s Frequencies that appear as the result of FM synthesis. See modulation<br />

frequency.<br />

signal A time-domain message. In signal processing, we denote a signal by<br />

x(t).<br />

sinusoid A sine or cosine wave of the form A sin(ωt + φ) or A cos(ωt + φ)<br />

where A is amplitude, ω is angular frequency, t is time, <strong>and</strong> φ is phase;<br />

a pure <strong>to</strong>ne.


304 Appendix B<br />

sone Unit of perceived loudness that measures the loudness of sound relative<br />

<strong>to</strong> some reference sound. For the loudness in phons L p , the loudness<br />

in sones can be calculated by<br />

L s =2 (Lp−40)/10 .<br />

The Fletcher-Munson curve also defines where frequencies are zero<br />

sones. A frequency f that is 50 phons is considered <strong>to</strong> be twice as loud<br />

as the same frequency at 40 phons by the sone scale.<br />

sound intensity level (SIL) Sound intensity is the sound power P per unit<br />

area A. Unit is watts per square meter (W/m 2 ). Sound intensity level<br />

is a measure of the ratio between two sound intensities I 0 <strong>and</strong> I 1 where<br />

I 0 is the reference intensity, given by<br />

( )<br />

I1<br />

L SIL = 10 log 10 dB.<br />

I 0<br />

sound pressure level (SPL) Deviation in the ambient pressure level from<br />

normal atmospheric pressure or some other reference pressure level<br />

like the threshold of human hearing at 1000 Hz (20 µPa). Unit of sound<br />

pressure is the pascal (Pa) <strong>and</strong> of sound pressure level, the dB SPL.<br />

source (1) A circuit component like a battery that introduces a voltage <strong>to</strong> the<br />

circuit, i.e., the input voltage. (2) Something that produces a sound.<br />

spectrogram (spectrograph) Three dimensional graph drawn on either two<br />

or three axes where the horizontal axis is time, the vertical axis is<br />

frequency, <strong>and</strong> the color of points in the graph indicates power of the<br />

frequency at that time. Three dimensional graphs will also have an<br />

axis for power but will also convey it by color.<br />

spectrum The frequency response of some time-domain signal, transformed<br />

by an algorithm such as the Fourier transform. The plural of spectrum<br />

is spectra.<br />

st<strong>and</strong>ing waves Waves that do not propagate in a reverberant space because<br />

their wavelength is an integer multiple of its physical dimensions.<br />

stapedius (acoustic) reflex Involuntary muscle in the middle ear that protects<br />

the ear from loud sounds. Also activated during speaking <strong>to</strong><br />

reduce sound by about 20 dB.


Appendix B 305<br />

stirrup (stapes) Bone in the ossicles that is connected <strong>to</strong> the anvil <strong>and</strong> the<br />

inner ear’s oval window.<br />

stretched octave Found in the piano, a stretched octave above a fundamental<br />

frequency f 0 has a frequency that is slightly greater than 2f 0 . This<br />

is <strong>to</strong> compensate for inharmonic over<strong>to</strong>nes caused by the physical<br />

imperfections of an instrument.<br />

sustain In an ADSR envelope, sustain is the "S," describing the duration<br />

of time that a signal stays at approximately the same amplitude after<br />

initial attack <strong>and</strong> decay.<br />

syncopation The placement of a strong beat on a weak beat or at an unexpected<br />

time.<br />

synesthesia The response of one of the senses <strong>to</strong> stimuli of a different type,<br />

such as visual responses <strong>to</strong> music.<br />

tec<strong>to</strong>rial membrane Membrane beneath Reissner’s membrane in the cochlea<br />

whose motion triggers the inner hair cells of the basilar membrane. Its<br />

function is largely unknown, but it is hypothesized that it is largely<br />

responsible for passing on the phase information of waves <strong>to</strong> the brain.<br />

temperament The system governing the ratios between pitches in musical<br />

tuning.<br />

tempo The pace of a piece of music.<br />

temporal theory Theory in psychoacoustics supposing that the ear detects<br />

frequencies with respect <strong>to</strong> phase <strong>and</strong> interaural time difference.<br />

temporary threshold shifting The shifting of the thresholds of hearing during<br />

a persistently loud sound <strong>to</strong> a higher value in decibels <strong>to</strong> eschew<br />

hearing loss.<br />

tensor tympani Muscle in the middle ear with the purpose of damping<br />

sounds.<br />

threshold Value specifying the minimum strength of a frequency required<br />

for it <strong>to</strong> be perceived.<br />

timbre The <strong>to</strong>ne color of an instrument.


306 Appendix B<br />

<strong>to</strong>nal center The <strong>to</strong>nic or root of a key bearing the same name. Tonal center<br />

can also refer <strong>to</strong> the "best guess" for the name of this key. Keys defined<br />

without some asymmetry, such as the whole-<strong>to</strong>ne scale, lack a <strong>to</strong>nal<br />

center.<br />

<strong>to</strong>ne deafness The inability <strong>to</strong> repeat a heard melody; amusia.<br />

<strong>to</strong>no<strong>to</strong>pic mapping The mapping of place theory, of frequencies along the<br />

basilar membrane. This mapping is logarithmic.<br />

transposition The shifting of a set of pitches by an equal amount, usually <strong>to</strong><br />

put the music in a different key (does not affect they key’s quality).<br />

tremolo Periodic modulation of amplitude around some center or average<br />

amplitude, <strong>and</strong> the resulting aural effect.<br />

triad Three notes defining a musical chord. With respect <strong>to</strong> a fundamental<br />

note, a major triad is made up of a major third <strong>and</strong> perfect fifth.<br />

trough Location in a wave where pressure is minimal. Opposite: Crest.<br />

tuning Matching the pitches of an instrument where the intervals between<br />

these pitches are according <strong>to</strong> some st<strong>and</strong>ardized system (temperament).<br />

undersampling Sampling at less than the Nyquist frequency (2f max ). Results<br />

in aliasing.<br />

unit circle Circle defined on the complex plane centered at the point (0, 0)<br />

with a radius of 1.<br />

unitary Of magnitude 1. The unit circle is unitary.<br />

up-sampling See oversampling.<br />

vibra<strong>to</strong> Periodic modulation of pitch around some center or average pitch,<br />

<strong>and</strong> the resulting aural effect.<br />

wavelength The length in meters (m) that a wave of frequency f travels in<br />

one period T , denoted by the Greek letter lambda (λ) <strong>and</strong> given by<br />

λ = v f = vT.


Appendix B 307<br />

whammy bar The bar attached <strong>to</strong> the bridge of a guitar that allows a player<br />

<strong>to</strong> alter the tension of the strings.<br />

wideb<strong>and</strong> noise Noise with a large b<strong>and</strong>width of frequencies, such as white<br />

noise (infinite b<strong>and</strong>width).<br />

windowing function A time-domain function w(t) that segments a signal<br />

with the use of windows <strong>to</strong> prepare it for the short-time Fourier transform<br />

(STFT).<br />

Young’s modulus The measure of stiffness in an elastic material; the ratio of<br />

the stress <strong>to</strong> strain. Like the bulk modulus, Young’s modulus is also<br />

given in pascals (Pa).<br />

Z-transform Algorithm that converts a discrete, infinite, time-domain system<br />

in<strong>to</strong> a continuous, finite, frequency-domain spectrum, given by<br />

X(z) =<br />

∞∑<br />

t=−∞<br />

x[t]z −t<br />

where t is an integer representing time samples <strong>and</strong> z = Ae jω .<br />

zero In electrical engineering, a zero refers <strong>to</strong> where the numera<strong>to</strong>r of a<br />

transfer function H is equal <strong>to</strong> zero.<br />

zero-padding Appending a signal with zeros <strong>to</strong> make it some desired length.


Index<br />

Z-transform, 214, 239<br />

ADSR envelope, 29, 185<br />

algorithm, 150<br />

fast Fourier transform, 216<br />

lossless compression, 155<br />

lossy compression, 156<br />

quick-sort, 151<br />

aliasing, 148<br />

amusia, 132<br />

aphasia, 132<br />

b<strong>and</strong>width, 124, 146, 175, 246<br />

beating, 123<br />

Bessel functions, 23, 84<br />

bit depth, 152<br />

bit rate, 157<br />

cancelation, 39<br />

Caruso, Enrico, 231<br />

Chladni plates, 74<br />

Chladni, Ernst, 73<br />

clipping, 19, 94, 95, 165, 214<br />

closed tube, 81<br />

complex numbers, 1<br />

complex plane, 168, 230<br />

computational complexity<br />

FFT, 216<br />

of the DFT, 190<br />

constructive interference, 39<br />

in the DFT, 193<br />

continuous Fourier transform,<br />

173, 176<br />

convolution, 183, 185<br />

convolution reverb, 184<br />

convolution theorem, 183<br />

cosine wave, 3<br />

in terms of e ix , 170<br />

critical b<strong>and</strong>s, 124, 157<br />

cross product, 169, 187<br />

decibel, 41<br />

hearing level (dB HL), 122<br />

sound pressure level (dB SPL),<br />

121, 238<br />

deconvolution, 231<br />

destructive interference, 39<br />

Deutsch, Diana, 132<br />

Dirac delta, 141<br />

discrete Fourier transform, 177<br />

as matrix multiplication, 196<br />

symbols used in, 177<br />

discrete-time Fourier transform,<br />

214, 240<br />

Doppler effect, 46<br />

Doppler, Christian, 46


310 Appendix B<br />

Euler’s formula, 170<br />

Euler’s identity, 168<br />

fast Fourier transform, 216<br />

radix-2 decimation in time,<br />

217<br />

filter<br />

all-pass, 100<br />

anti-aliasing, 148<br />

as an electrical circuit, 93,<br />

233<br />

b<strong>and</strong>-pass, 114, 244<br />

b<strong>and</strong>-s<strong>to</strong>p, 244<br />

bank, 101, 248<br />

digital, 214, 230<br />

high-pass, 234<br />

in musical instruments, 72,<br />

125<br />

linear, 182<br />

low-pass, 148, 239<br />

transfer function, 229<br />

flanger pedal, 100<br />

Fourier series, 161<br />

Fourier, Jean Baptiste Joseph, 161<br />

frame, 221<br />

frequency bin, 189<br />

frequency response, 34<br />

harmonic over<strong>to</strong>ne series, 10<br />

partials, 26, 64, 85, 166<br />

reasons for inharmonicity,<br />

70, 94<br />

Helmholtz resonance, 24<br />

Helmholtz, Hermann von, 24,<br />

75<br />

Helmholtz,Hermann von, 127<br />

Hermitian symmetry of the DFT,<br />

182, 202<br />

Huygens’ principle, 43<br />

impedance matching, 84, 110<br />

impulse function, 139, 143<br />

impulse response, 35, 184, 236<br />

integers, 1<br />

interpolation, 189, 214, 215<br />

inverse discrete Fourier transform,<br />

177<br />

inverse square law, 41<br />

just-noticeable difference (jnd),<br />

122<br />

Kronecker delta, 140<br />

Laplace transform, 214, 229, 235<br />

linearity of the DFT, 181<br />

logarithms, 3<br />

as a perceptual measure of<br />

loudness, 41, 119<br />

as a perceptual measure of<br />

pitch, 101, 114<br />

nature of the basilar membrane,<br />

134<br />

Mach number, 45<br />

magnitude response, 174, 240<br />

masking, 124<br />

modes, 13


Appendix B 311<br />

of a fixed string, 25<br />

of circular membranes (drums),<br />

89<br />

of wind instruments, 82<br />

monochord, 56<br />

music information retrieval, 49,<br />

209, 248<br />

musical intervals, 22<br />

consonance <strong>and</strong> dissonance<br />

of, 128<br />

with respect <strong>to</strong> temperament,<br />

23, 58, 59<br />

musical synesthesia, 131<br />

nodes, 23<br />

of a circular membrane (drum),<br />

90<br />

of a fixed string, 75<br />

of wind instruments, 83<br />

normalized discrete Fourier transform,<br />

203<br />

Nyquist rate, 146<br />

Nyquist-Shannon sampling theorem,<br />

146<br />

open tube, 81<br />

orthonormality of the DFT, 11,<br />

170, 196<br />

Parseval’s theorem, 187, 233<br />

Partch, Harry, 60<br />

perfect pitch, 129<br />

phase response, 101, 174, 239<br />

phaser pedal, 100<br />

Plomp, Reinier, 64, 127<br />

power, 41, 174, 232<br />

average, 233<br />

pressure wave, 78<br />

principle of superposition, 22<br />

quantization error, 154, 157<br />

real numbers, 1<br />

reflection, 30, 40, 124, 185<br />

refraction, 31<br />

resistance, 33, 110, 231<br />

reverberation, 33, 184<br />

ring modulation, 95<br />

roots of unity, 169, 171, 191, 227<br />

roughness, 30, 128<br />

scaling theorem, 188<br />

Shebalin, Vissarion, 132<br />

shift theorem, 182<br />

short-time Fourier transform, 221<br />

side lobes, 189, 224<br />

sine wave, 3<br />

in terms of e ix , 170<br />

sound intensity level, 120<br />

sound pressure level, 121<br />

spectral leakage, 189, 205, 211,<br />

224<br />

speed of sound in different media,<br />

33<br />

st<strong>and</strong>ing wave, 23, 40, 235<br />

in wind instruments, 84<br />

stretch theorem, 188


312 Appendix B<br />

time-invariance, 177, 182, 214,<br />

229<br />

transfer function, 214, 229<br />

tremolo, 98<br />

up-sampling, 188<br />

vibra<strong>to</strong>, 98<br />

voltage, 15, 203, 214, 231<br />

windowing function, 221<br />

Hanning, 225<br />

rectangular window, 224<br />

sine <strong>and</strong> cosine, 226<br />

triangle (Bartlett) window,<br />

224<br />

Worman, Walter, 86<br />

zero-padding, 189, 215

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!