09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 5

DeepMind is a company owned by (Google https://deepmind.

com/).

What is even cooler is that DeepMind demonstrated that WaveNet can be also

used to teach computers how to generate the sound of musical instruments such

as piano music.

Now some definitions. TTS systems are typically divided into two different classes:

Concatenative and Parametric.

Concatenative TTS is where single speech voice fragments are first memorized and

then recombined when the voice has to be reproduced. However, this approach does

not scale because it is possible to reproduce only the memorized voice fragments,

and it is not possible to reproduce new speakers or different types of audio without

memorizing the fragments from the beginning.

Parametric TTS is where a model is created for storing all the characteristic features

of the audio to be synthesized. Before WaveNet, the audio generated with parametric

TTS was less natural than concatenative TTS. WaveNet enabled significant

improvement by modeling directly the production of audio sounds, instead of using

intermediate signal processing algorithms as in the past.

In principle, WaveNet can be seen as a stack of 1D convolutional layers with a

constant stride of one and no pooling layers. Note that the input and the output have

by construction the same dimension, so CNNs are well suited to modeling sequential

data such as audio sounds. However, it has been shown that in order to reach a large

size for the receptive field in the output neuron, it is necessary to either use a massive

number of large filters or increase the network depth prohibitively. For this reason,

pure CNNs are not so effective in learning how to synthesize audio.

Remember that the receptive field of a neuron in a layer is the crosssection

of the previous layer from which neurons provide inputs.

The key intuition behind WaveNet is the so-called Dilated Causal Convolutions [5] (or

sometimes known as AtrousConvolution), which simply means that some input values

are skipped when the filter of a convolutional layer is applied. As an example, in one

dimension a filter w of size 3 with dilatation 1 would compute the following sum:

ww[0]xx[0] + ww[1]xx[2] + ww[3]xx[4]

[ 179 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!