16.03.2021 Views

Advanced Deep Learning with Keras

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 1

As an example, l2 weight regularizer with fraction=0.001 can be implemented as:

from keras.regularizers import l2

model.add(Dense(hidden_units,

kernel_regularizer=l2(0.001),

input_dim=input_size))

No additional layer is added if l1 or l2 regularization is used. The regularization

is imposed in the Dense layer internally. For the proposed model, dropout still has

a better performance than l2.

Output activation and loss function

The output layer has 10 units followed by softmax activation. The 10 units

correspond to the 10 possible labels, classes or categories. The softmax activation

can be expressed mathematically as shown in the following equation:

softmax x

x

e

= (Equation 1.3.5)

i

( )

−1

i N x j

e

j=

0

The equation is applied to all N = 10 outputs, x i

for i = 0, 1 … 9 for the final prediction.

The idea of softmax is surprisingly simple. It squashes the outputs into probabilities

by normalizing the prediction. Here, each predicted output is a probability that the

index is the correct label of the given input image. The sum of all the probabilities for

all outputs is 1.0. For example, when the softmax layer generates a prediction, it will

be a 10-dim 1D tensor that may look like the following output:

[ 3.57351579e-11 7.08998016e-08 2.30154569e-07 6.35787558e-07

5.57471187e-11 4.15353840e-09 3.55973775e-16 9.99995947e-01

1.29531730e-09 3.06023480e-06]

The prediction output tensor suggests that the input image is going to be 7 given

that its index has the highest probability. The numpy.argmax() method can be used

to determine the index of the element with the highest value.

There are other choices of output activation layer, like linear, sigmoid, and tanh. The

linear activation is an identity function. It copies its input to its output. The sigmoid

function is more specifically known as a logistic sigmoid. This will be used if the

elements of the prediction tensor should be mapped between 0.0 and 1.0 independently.

The summation of all elements of the predicted tensor is not constrained to 1.0 unlike in

softmax. For example, sigmoid is used as the last layer in sentiment prediction (0.0 is

bad to 1.0, which is good) or in image generation (0.0 is 0 to 1.0 is 255-pixel values).

[ 15 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!