pdfcoffee

soumyasankar99
from soumyasankar99 More from this publisher
09.05.2023 Views

Chapter 5DeepMind is a company owned by (Google https://deepmind.com/).What is even cooler is that DeepMind demonstrated that WaveNet can be alsoused to teach computers how to generate the sound of musical instruments suchas piano music.Now some definitions. TTS systems are typically divided into two different classes:Concatenative and Parametric.Concatenative TTS is where single speech voice fragments are first memorized andthen recombined when the voice has to be reproduced. However, this approach doesnot scale because it is possible to reproduce only the memorized voice fragments,and it is not possible to reproduce new speakers or different types of audio withoutmemorizing the fragments from the beginning.Parametric TTS is where a model is created for storing all the characteristic featuresof the audio to be synthesized. Before WaveNet, the audio generated with parametricTTS was less natural than concatenative TTS. WaveNet enabled significantimprovement by modeling directly the production of audio sounds, instead of usingintermediate signal processing algorithms as in the past.In principle, WaveNet can be seen as a stack of 1D convolutional layers with aconstant stride of one and no pooling layers. Note that the input and the output haveby construction the same dimension, so CNNs are well suited to modeling sequentialdata such as audio sounds. However, it has been shown that in order to reach a largesize for the receptive field in the output neuron, it is necessary to either use a massivenumber of large filters or increase the network depth prohibitively. For this reason,pure CNNs are not so effective in learning how to synthesize audio.Remember that the receptive field of a neuron in a layer is the crosssectionof the previous layer from which neurons provide inputs.The key intuition behind WaveNet is the so-called Dilated Causal Convolutions [5] (orsometimes known as AtrousConvolution), which simply means that some input valuesare skipped when the filter of a convolutional layer is applied. As an example, in onedimension a filter w of size 3 with dilatation 1 would compute the following sum:ww[0]xx[0] + ww[1]xx[2] + ww[3]xx[4][ 179 ]

Advanced Convolutional Neural Networks"Atrous" is the "bastardization" of the French expression "à trous,"meaning "with holes." So an AtrousConvolution is a convolutionwith holes.In short, in D-dilated convolution, usually the stride is 1, but nothing preventsyou from using other strides. An example is given in the following diagram withincreased dilatation (hole) sizes = 0, 1, 2:Thanks to this simple idea of introducing "holes," it is possible to stack multipledilated convolutional layers with exponentially increasing filters, and learn longrangeinput dependencies without having an excessively deep network.A WaveNet is therefore a ConvNet where the convolutional layers have variousdilation factors, allowing the receptive field to grow exponentially with depthand therefore efficiently cover thousands of audio timesteps.When we train, the inputs are sounds recorded from human speakers. Thewaveforms are quantized to a fixed integer range. A WaveNet defines an initialconvolutional layer accessing only the current and previous input. Then, there isa stack of dilated convnet layers, still accessing only current and previous inputs.At the end, there is a series of dense layers combining the previous results followedby a softmax activation function for categorical outputs.At each step, a value is predicted from the network and fed back into the input.At the same time, a new prediction for the next step is computed. The loss functionis the cross entropy between the output for the current step and the input at thenext step. The following image shows the visualization of a WaveNet stack and itsreceptive field as introduced by Aaron van den Oord [9]. Note that generation can beslow because the waveform has to be synthesized in a sequential fashion, as x tmustbe sampled first in order to obtain x >twhere x is the input:[ 180 ]

Advanced Convolutional Neural Networks

"Atrous" is the "bastardization" of the French expression "à trous,"

meaning "with holes." So an AtrousConvolution is a convolution

with holes.

In short, in D-dilated convolution, usually the stride is 1, but nothing prevents

you from using other strides. An example is given in the following diagram with

increased dilatation (hole) sizes = 0, 1, 2:

Thanks to this simple idea of introducing "holes," it is possible to stack multiple

dilated convolutional layers with exponentially increasing filters, and learn longrange

input dependencies without having an excessively deep network.

A WaveNet is therefore a ConvNet where the convolutional layers have various

dilation factors, allowing the receptive field to grow exponentially with depth

and therefore efficiently cover thousands of audio timesteps.

When we train, the inputs are sounds recorded from human speakers. The

waveforms are quantized to a fixed integer range. A WaveNet defines an initial

convolutional layer accessing only the current and previous input. Then, there is

a stack of dilated convnet layers, still accessing only current and previous inputs.

At the end, there is a series of dense layers combining the previous results followed

by a softmax activation function for categorical outputs.

At each step, a value is predicted from the network and fed back into the input.

At the same time, a new prediction for the next step is computed. The loss function

is the cross entropy between the output for the current step and the input at the

next step. The following image shows the visualization of a WaveNet stack and its

receptive field as introduced by Aaron van den Oord [9]. Note that generation can be

slow because the waveform has to be synthesized in a sequential fashion, as x t

must

be sampled first in order to obtain x >t

where x is the input:

[ 180 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!