16.03.2021 Views

Advanced Deep Learning with Keras

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Introducing Advanced Deep Learning with Keras

There are other non-linear functions that can be used such as elu, selu, softplus,

sigmoid, and tanh. However, relu is the most commonly used in the industry and

is computationally efficient due to its simplicity. The sigmoid and tanh are used

as activation functions in the output layer and described later. Table 1.3.1 shows

the equation for each of these activation functions:

relu relu(x) = max(0,x) 1.3.1

softplus softplus(x) = log(1 + e x ) 1.3.2

elu

1.3.3

⎧⎪ x if x≥

0

elu ( x,

a)

= ⎨ x

⎪⎩

a ( e −1)

otherwise

where a ≥ 0 and is a tunable hyperparameter

selu selu(x) = k × elu(x,a)

1.3.4

where k = 1.0507009873554804934193349852946 and

a = 1.6732632423543772848170429916717

Table 1.3.1: Definition of common non-linear activation functions

Regularization

A neural network has the tendency to memorize its training data especially

if it contains more than enough capacity. In such a case, the network fails

catastrophically when subjected to the test data. This is the classic case of the

network failing to generalize. To avoid this tendency, the model uses a regularizing

layer or function. A common regularizing layer is referred to as a dropout.

The idea of dropout is simple. Given a dropout rate (here, it is set to dropout=0.45),

the Dropout layer randomly removes that fraction of units from participating in

the next layer. For example, if the first layer has 256 units, after dropout=0.45 is

applied, only (1 - 0.45) * 256 units = 140 units from layer 1 participate in layer 2.

The Dropout layer makes neural networks robust to unforeseen input data because

the network is trained to predict correctly, even if some units are missing. It's worth

noting that dropout is not used in the output layer and it is only active during

training. Moreover, dropout is not present during prediction.

There are regularizers that can be used other than dropouts like l1 or l2. In Keras,

the bias, weight and activation output can be regularized per layer. l1 and l2 favor

smaller parameter values by adding a penalty function. Both l1 and l2 enforce

the penalty using a fraction of the sum of absolute (l1) or square (l2) of parameter

values. In other words, the penalty function forces the optimizer to find parameter

values that are small. Neural networks with small parameter values are more

insensitive to the presence of noise from within the input data.

[ 14 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!