Advanced Deep Learning with Keras
Chapter 1Figure 1.3.8: Plot of a function with 2 minima, x = -1.51 and x = 1.66.Also shown is the derivative of the function.Gradient descent is not typically used in deep neural networks since you'll oftencome upon millions of parameters that need to be trained. It is computationallyinefficient to perform a full gradient descent. Instead, SGD is used. In SGD, a minibatch of samples is chosen to compute an approximate value of the descent. Theparameters (for example, weights and biases) are adjusted by the following equation:θ ← θ −∈g (Equation 1.3.7)1In this equation, θ and g = ∇ ∑ L are the parameters and gradients tensor of the lossmθfunction respectively. The g is computed from partial derivatives of the loss function.The mini-batch size is recommended to be a power of 2 for GPU optimizationpurposes. In the proposed network, batch_size=128.Equation 1.3.7 computes the last layer parameter updates. So, how do we adjust theparameters of the preceding layers? For this case, the chain rule of differentiation isapplied to propagate the derivatives to the lower layers and compute the gradientsaccordingly. This algorithm is known as backpropagation in deep learning. Thedetails of backpropagation are beyond the scope of this book. However, a goodonline reference can be found at http://neuralnetworksanddeeplearning.com.[ 19 ]
Introducing Advanced Deep Learning with KerasSince optimization is based on differentiation, it follows that an important criterionof the loss function is that it must be smooth or differentiable. This is an importantconstraint to keep in mind when introducing a new loss function.Given the training dataset, the choice of the loss function, the optimizer, and theregularizer, the model can now be trained by calling the fit() function:# loss function for one-hot vector# use of adam optimizer# accuracy is a good metric for classification tasksmodel.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])# train the networkmodel.fit(x_train, y_train, epochs=20, batch_size=batch_size)This is another helpful feature of Keras. By just supplying both the x and y data,the number of epochs to train, and the batch size, fit() does the rest. In other deeplearning frameworks, this translates to multiple tasks such as preparing the inputand output data in the proper format, loading, monitoring, and so on. While all ofthese must be done inside a for loop! In Keras, everything is done in just one line.In the fit() function, an epoch is the complete sampling of the entire training data.The batch_size parameter is the sample size of the number of inputs to process ateach training step. To complete one epoch, fit() requires the size of train datasetdivided by batch size, plus 1 to compensate for any fractional part.Performance evaluationAt this point, the model for the MNIST digit classifier is now complete. Performanceevaluation will be the next crucial step to determine if the proposed model has comeup with a satisfactory solution. Training the model for 20 epochs will be sufficient toobtain comparable performance metrics.The following table, Table 1.3.2, shows the different network configurations andcorresponding performance measures. Under Layers, the number of units is shownfor layers 1 to 3. For each optimizer, the default parameters in Keras are used. Theeffects of varying the regularizer, optimizer and number of units per layer can beobserved. Another important observation in Table 1.3.2 is that bigger networks donot necessarily translate to better performance.Increasing the depth of this network shows no added benefits in terms of accuracy forboth training and testing datasets. On the other hand, a smaller number of units, like128, could also lower both the test and train accuracy. The best train accuracy at 99.93%is obtained when the regularizer is removed, and 256 units per layer are used. The testaccuracy, however, is much lower at 98.0%, as a result of the network overfitting.[ 20 ]
- Page 2 and 3: Advanced Deep Learningwith KerasApp
- Page 4 and 5: mapt.ioMapt is an online digital li
- Page 6 and 7: I would like to thank my family, Ch
- Page 8 and 9: Table of ContentsPrefaceVChapter 1:
- Page 10 and 11: [ iii ]Table of ContentsChapter 7:
- Page 12 and 13: [ v ]PrefaceIn recent years, deep l
- Page 14 and 15: Chapter 5, Improved GANs, covers al
- Page 16 and 17: def encoder_layer(inputs,filters=16
- Page 18 and 19: Introducing Advanced DeepLearning w
- Page 20 and 21: Chapter 1Installing Keras and Tenso
- Page 22 and 23: Chapter 1• RNNs: Recurrent neural
- Page 24 and 25: [ 7 ]Chapter 1In the preceding figu
- Page 26 and 27: Chapter 1Figure 1.3.3: MLP MNIST di
- Page 28 and 29: Chapter 1model.add(Activation('soft
- Page 30 and 31: Chapter 1model.add(Activation('relu
- Page 32 and 33: Chapter 1As an example, l2 weight r
- Page 34 and 35: [ 17 ]Chapter 1How far the predicte
- Page 38 and 39: Chapter 1The highest test accuracy
- Page 40 and 41: Chapter 1Figure 1.3.9: The graphica
- Page 42 and 43: Chapter 1# image is processed as is
- Page 44 and 45: Chapter 1The computation involved i
- Page 46 and 47: Chapter 1Listing 1.4.2 shows a summ
- Page 48 and 49: Chapter 164-64-64 RMSprop Dropout(0
- Page 50 and 51: Chapter 1There are the two main dif
- Page 52 and 53: Chapter 1Layers Optimizer Regulariz
- Page 54: ConclusionThis chapter provided an
- Page 57 and 58: Deep Neural NetworksWhile this chap
- Page 59 and 60: Deep Neural Networks# reshape and n
- Page 61 and 62: Deep Neural NetworksEverything else
- Page 63 and 64: Deep Neural Networksfrom keras.util
- Page 65 and 66: Deep Neural NetworksFigure 2.1.3: T
- Page 67 and 68: Deep Neural NetworksHence, the netw
- Page 69 and 70: Deep Neural NetworksGenerally speak
- Page 71 and 72: Deep Neural NetworksIn the dataset,
- Page 73 and 74: Deep Neural NetworksTransition Laye
- Page 75 and 76: Deep Neural NetworksThere are some
- Page 77 and 78: Deep Neural NetworksResNet v2 is al
- Page 79 and 80: Deep Neural Networks…if version =
- Page 81 and 82: Deep Neural NetworksTo prevent the
- Page 83 and 84: Deep Neural NetworksAverage Pooling
- Page 85 and 86: Deep Neural Networks# orig paper us
Introducing Advanced Deep Learning with Keras
Since optimization is based on differentiation, it follows that an important criterion
of the loss function is that it must be smooth or differentiable. This is an important
constraint to keep in mind when introducing a new loss function.
Given the training dataset, the choice of the loss function, the optimizer, and the
regularizer, the model can now be trained by calling the fit() function:
# loss function for one-hot vector
# use of adam optimizer
# accuracy is a good metric for classification tasks
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
# train the network
model.fit(x_train, y_train, epochs=20, batch_size=batch_size)
This is another helpful feature of Keras. By just supplying both the x and y data,
the number of epochs to train, and the batch size, fit() does the rest. In other deep
learning frameworks, this translates to multiple tasks such as preparing the input
and output data in the proper format, loading, monitoring, and so on. While all of
these must be done inside a for loop! In Keras, everything is done in just one line.
In the fit() function, an epoch is the complete sampling of the entire training data.
The batch_size parameter is the sample size of the number of inputs to process at
each training step. To complete one epoch, fit() requires the size of train dataset
divided by batch size, plus 1 to compensate for any fractional part.
Performance evaluation
At this point, the model for the MNIST digit classifier is now complete. Performance
evaluation will be the next crucial step to determine if the proposed model has come
up with a satisfactory solution. Training the model for 20 epochs will be sufficient to
obtain comparable performance metrics.
The following table, Table 1.3.2, shows the different network configurations and
corresponding performance measures. Under Layers, the number of units is shown
for layers 1 to 3. For each optimizer, the default parameters in Keras are used. The
effects of varying the regularizer, optimizer and number of units per layer can be
observed. Another important observation in Table 1.3.2 is that bigger networks do
not necessarily translate to better performance.
Increasing the depth of this network shows no added benefits in terms of accuracy for
both training and testing datasets. On the other hand, a smaller number of units, like
128, could also lower both the test and train accuracy. The best train accuracy at 99.93%
is obtained when the regularizer is removed, and 256 units per layer are used. The test
accuracy, however, is much lower at 98.0%, as a result of the network overfitting.
[ 20 ]