pdfcoffee
Chapter 5DeepMind is a company owned by (Google https://deepmind.com/).What is even cooler is that DeepMind demonstrated that WaveNet can be alsoused to teach computers how to generate the sound of musical instruments suchas piano music.Now some definitions. TTS systems are typically divided into two different classes:Concatenative and Parametric.Concatenative TTS is where single speech voice fragments are first memorized andthen recombined when the voice has to be reproduced. However, this approach doesnot scale because it is possible to reproduce only the memorized voice fragments,and it is not possible to reproduce new speakers or different types of audio withoutmemorizing the fragments from the beginning.Parametric TTS is where a model is created for storing all the characteristic featuresof the audio to be synthesized. Before WaveNet, the audio generated with parametricTTS was less natural than concatenative TTS. WaveNet enabled significantimprovement by modeling directly the production of audio sounds, instead of usingintermediate signal processing algorithms as in the past.In principle, WaveNet can be seen as a stack of 1D convolutional layers with aconstant stride of one and no pooling layers. Note that the input and the output haveby construction the same dimension, so CNNs are well suited to modeling sequentialdata such as audio sounds. However, it has been shown that in order to reach a largesize for the receptive field in the output neuron, it is necessary to either use a massivenumber of large filters or increase the network depth prohibitively. For this reason,pure CNNs are not so effective in learning how to synthesize audio.Remember that the receptive field of a neuron in a layer is the crosssectionof the previous layer from which neurons provide inputs.The key intuition behind WaveNet is the so-called Dilated Causal Convolutions [5] (orsometimes known as AtrousConvolution), which simply means that some input valuesare skipped when the filter of a convolutional layer is applied. As an example, in onedimension a filter w of size 3 with dilatation 1 would compute the following sum:ww[0]xx[0] + ww[1]xx[2] + ww[3]xx[4][ 179 ]
Advanced Convolutional Neural Networks"Atrous" is the "bastardization" of the French expression "à trous,"meaning "with holes." So an AtrousConvolution is a convolutionwith holes.In short, in D-dilated convolution, usually the stride is 1, but nothing preventsyou from using other strides. An example is given in the following diagram withincreased dilatation (hole) sizes = 0, 1, 2:Thanks to this simple idea of introducing "holes," it is possible to stack multipledilated convolutional layers with exponentially increasing filters, and learn longrangeinput dependencies without having an excessively deep network.A WaveNet is therefore a ConvNet where the convolutional layers have variousdilation factors, allowing the receptive field to grow exponentially with depthand therefore efficiently cover thousands of audio timesteps.When we train, the inputs are sounds recorded from human speakers. Thewaveforms are quantized to a fixed integer range. A WaveNet defines an initialconvolutional layer accessing only the current and previous input. Then, there isa stack of dilated convnet layers, still accessing only current and previous inputs.At the end, there is a series of dense layers combining the previous results followedby a softmax activation function for categorical outputs.At each step, a value is predicted from the network and fed back into the input.At the same time, a new prediction for the next step is computed. The loss functionis the cross entropy between the output for the current step and the input at thenext step. The following image shows the visualization of a WaveNet stack and itsreceptive field as introduced by Aaron van den Oord [9]. Note that generation can beslow because the waveform has to be synthesized in a sequential fashion, as x tmustbe sampled first in order to obtain x >twhere x is the input:[ 180 ]
- Page 163 and 164: Convolutional Neural NetworksSo, we
- Page 165 and 166: Convolutional Neural NetworksEach i
- Page 167 and 168: Convolutional Neural NetworksVery d
- Page 169 and 170: Convolutional Neural NetworksRecogn
- Page 171 and 172: Convolutional Neural NetworksIf we
- Page 173 and 174: Convolutional Neural NetworksRefere
- Page 175 and 176: Advanced Convolutional Neural Netwo
- Page 177 and 178: Advanced Convolutional Neural Netwo
- Page 179 and 180: Advanced Convolutional Neural Netwo
- Page 181 and 182: Advanced Convolutional Neural Netwo
- Page 183 and 184: Advanced Convolutional Neural Netwo
- Page 185 and 186: Advanced Convolutional Neural Netwo
- Page 187 and 188: Advanced Convolutional Neural Netwo
- Page 189 and 190: Advanced Convolutional Neural Netwo
- Page 191 and 192: Advanced Convolutional Neural Netwo
- Page 193 and 194: Advanced Convolutional Neural Netwo
- Page 195 and 196: Advanced Convolutional Neural Netwo
- Page 197 and 198: Advanced Convolutional Neural Netwo
- Page 199 and 200: Advanced Convolutional Neural Netwo
- Page 201 and 202: Advanced Convolutional Neural Netwo
- Page 203 and 204: Advanced Convolutional Neural Netwo
- Page 205 and 206: Advanced Convolutional Neural Netwo
- Page 207 and 208: Advanced Convolutional Neural Netwo
- Page 209 and 210: Advanced Convolutional Neural Netwo
- Page 211 and 212: Advanced Convolutional Neural Netwo
- Page 213: Advanced Convolutional Neural Netwo
- Page 217 and 218: Advanced Convolutional Neural Netwo
- Page 219 and 220: Advanced Convolutional Neural Netwo
- Page 221 and 222: Advanced Convolutional Neural Netwo
- Page 223 and 224: Advanced Convolutional Neural Netwo
- Page 226 and 227: GenerativeAdversarial NetworksIn th
- Page 228 and 229: [ 193 ]Chapter 6Eventually, we reac
- Page 230 and 231: [ 195 ]Chapter 6Next, we combine th
- Page 232 and 233: Chapter 6And handwritten digits gen
- Page 234 and 235: Chapter 6Figure 1: Visualizing the
- Page 236 and 237: Chapter 6The resultant generator mo
- Page 238 and 239: Chapter 6Figure 4: A summary of res
- Page 240 and 241: Chapter 6def train(self, epochs, ba
- Page 242 and 243: Chapter 6The preceding images were
- Page 244 and 245: Chapter 6Another interesting paper
- Page 246 and 247: Chapter 6To elaborate, let us say t
- Page 248 and 249: Chapter 6Figure 7: The architecture
- Page 250 and 251: Chapter 6Figure 11: Illegible initi
- Page 252 and 253: Chapter 6Bedrooms: Generated bedroo
- Page 254 and 255: Chapter 6The images need to be norm
- Page 256 and 257: Chapter 6initializer = tf.random_no
- Page 258 and 259: Cool, right? Now we can define the
- Page 260 and 261: Chapter 6d_loss = (dA_loss + dB_los
- Page 262 and 263: Chapter 6generator_AB.save_weights(
Advanced Convolutional Neural Networks
"Atrous" is the "bastardization" of the French expression "à trous,"
meaning "with holes." So an AtrousConvolution is a convolution
with holes.
In short, in D-dilated convolution, usually the stride is 1, but nothing prevents
you from using other strides. An example is given in the following diagram with
increased dilatation (hole) sizes = 0, 1, 2:
Thanks to this simple idea of introducing "holes," it is possible to stack multiple
dilated convolutional layers with exponentially increasing filters, and learn longrange
input dependencies without having an excessively deep network.
A WaveNet is therefore a ConvNet where the convolutional layers have various
dilation factors, allowing the receptive field to grow exponentially with depth
and therefore efficiently cover thousands of audio timesteps.
When we train, the inputs are sounds recorded from human speakers. The
waveforms are quantized to a fixed integer range. A WaveNet defines an initial
convolutional layer accessing only the current and previous input. Then, there is
a stack of dilated convnet layers, still accessing only current and previous inputs.
At the end, there is a series of dense layers combining the previous results followed
by a softmax activation function for categorical outputs.
At each step, a value is predicted from the network and fed back into the input.
At the same time, a new prediction for the next step is computed. The loss function
is the cross entropy between the output for the current step and the input at the
next step. The following image shows the visualization of a WaveNet stack and its
receptive field as introduced by Aaron van den Oord [9]. Note that generation can be
slow because the waveform has to be synthesized in a sequential fashion, as x t
must
be sampled first in order to obtain x >t
where x is the input:
[ 180 ]