Advanced Deep Learning with Keras

fourpersent2020
from fourpersent2020 More from this publisher
16.03.2021 Views

Chapter 1Figure 1.3.8: Plot of a function with 2 minima, x = -1.51 and x = 1.66.Also shown is the derivative of the function.Gradient descent is not typically used in deep neural networks since you'll oftencome upon millions of parameters that need to be trained. It is computationallyinefficient to perform a full gradient descent. Instead, SGD is used. In SGD, a minibatch of samples is chosen to compute an approximate value of the descent. Theparameters (for example, weights and biases) are adjusted by the following equation:θ ← θ −∈g (Equation 1.3.7)1In this equation, θ and g = ∇ ∑ L are the parameters and gradients tensor of the lossmθfunction respectively. The g is computed from partial derivatives of the loss function.The mini-batch size is recommended to be a power of 2 for GPU optimizationpurposes. In the proposed network, batch_size=128.Equation 1.3.7 computes the last layer parameter updates. So, how do we adjust theparameters of the preceding layers? For this case, the chain rule of differentiation isapplied to propagate the derivatives to the lower layers and compute the gradientsaccordingly. This algorithm is known as backpropagation in deep learning. Thedetails of backpropagation are beyond the scope of this book. However, a goodonline reference can be found at http://neuralnetworksanddeeplearning.com.[ 19 ]

Introducing Advanced Deep Learning with KerasSince optimization is based on differentiation, it follows that an important criterionof the loss function is that it must be smooth or differentiable. This is an importantconstraint to keep in mind when introducing a new loss function.Given the training dataset, the choice of the loss function, the optimizer, and theregularizer, the model can now be trained by calling the fit() function:# loss function for one-hot vector# use of adam optimizer# accuracy is a good metric for classification tasksmodel.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])# train the networkmodel.fit(x_train, y_train, epochs=20, batch_size=batch_size)This is another helpful feature of Keras. By just supplying both the x and y data,the number of epochs to train, and the batch size, fit() does the rest. In other deeplearning frameworks, this translates to multiple tasks such as preparing the inputand output data in the proper format, loading, monitoring, and so on. While all ofthese must be done inside a for loop! In Keras, everything is done in just one line.In the fit() function, an epoch is the complete sampling of the entire training data.The batch_size parameter is the sample size of the number of inputs to process ateach training step. To complete one epoch, fit() requires the size of train datasetdivided by batch size, plus 1 to compensate for any fractional part.Performance evaluationAt this point, the model for the MNIST digit classifier is now complete. Performanceevaluation will be the next crucial step to determine if the proposed model has comeup with a satisfactory solution. Training the model for 20 epochs will be sufficient toobtain comparable performance metrics.The following table, Table 1.3.2, shows the different network configurations andcorresponding performance measures. Under Layers, the number of units is shownfor layers 1 to 3. For each optimizer, the default parameters in Keras are used. Theeffects of varying the regularizer, optimizer and number of units per layer can beobserved. Another important observation in Table 1.3.2 is that bigger networks donot necessarily translate to better performance.Increasing the depth of this network shows no added benefits in terms of accuracy forboth training and testing datasets. On the other hand, a smaller number of units, like128, could also lower both the test and train accuracy. The best train accuracy at 99.93%is obtained when the regularizer is removed, and 256 units per layer are used. The testaccuracy, however, is much lower at 98.0%, as a result of the network overfitting.[ 20 ]

Introducing Advanced Deep Learning with Keras

Since optimization is based on differentiation, it follows that an important criterion

of the loss function is that it must be smooth or differentiable. This is an important

constraint to keep in mind when introducing a new loss function.

Given the training dataset, the choice of the loss function, the optimizer, and the

regularizer, the model can now be trained by calling the fit() function:

# loss function for one-hot vector

# use of adam optimizer

# accuracy is a good metric for classification tasks

model.compile(loss='categorical_crossentropy',

optimizer='adam',

metrics=['accuracy'])

# train the network

model.fit(x_train, y_train, epochs=20, batch_size=batch_size)

This is another helpful feature of Keras. By just supplying both the x and y data,

the number of epochs to train, and the batch size, fit() does the rest. In other deep

learning frameworks, this translates to multiple tasks such as preparing the input

and output data in the proper format, loading, monitoring, and so on. While all of

these must be done inside a for loop! In Keras, everything is done in just one line.

In the fit() function, an epoch is the complete sampling of the entire training data.

The batch_size parameter is the sample size of the number of inputs to process at

each training step. To complete one epoch, fit() requires the size of train dataset

divided by batch size, plus 1 to compensate for any fractional part.

Performance evaluation

At this point, the model for the MNIST digit classifier is now complete. Performance

evaluation will be the next crucial step to determine if the proposed model has come

up with a satisfactory solution. Training the model for 20 epochs will be sufficient to

obtain comparable performance metrics.

The following table, Table 1.3.2, shows the different network configurations and

corresponding performance measures. Under Layers, the number of units is shown

for layers 1 to 3. For each optimizer, the default parameters in Keras are used. The

effects of varying the regularizer, optimizer and number of units per layer can be

observed. Another important observation in Table 1.3.2 is that bigger networks do

not necessarily translate to better performance.

Increasing the depth of this network shows no added benefits in terms of accuracy for

both training and testing datasets. On the other hand, a smaller number of units, like

128, could also lower both the test and train accuracy. The best train accuracy at 99.93%

is obtained when the regularizer is removed, and 256 units per layer are used. The test

accuracy, however, is much lower at 98.0%, as a result of the network overfitting.

[ 20 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!