Advanced Deep Learning with Keras

fourpersent2020
from fourpersent2020 More from this publisher
16.03.2021 Views

mean, stddev = argsdist = tf.distributions.Normal(loc=mean, scale=stddev)entropy = dist.entropy()return entropyThe entropy model is only used by the A2C method:Chapter 10Figure 10.6.8: A value modelPreceding figure shows the value model. The model also uses the pre-trainedencoder with frozen weights to implement following equation which is repeatedhere for convenience:( , ) ( )Tt tθv φtθvv = V s = s(Equation 10.3.2)θvare the weights of the Dense(1) layer, the only layer that receives value gradientupdates. Figure 10.6.8 represents V ( st, θv ) in Algorithms 10.3.1 to 10.5.1. The valuemodel can be built in a few lines:inputs = Input(shape=(self.state_dim, ), name='state')self.encoder.trainable = Falsex = self.encoder(inputs)value = Dense(1,activation='linear',kernel_initializer='zero',name='value')(x)self.value_model = Model(inputs, value, name='value')These lines are also implemented in method build_actor_critic(), which isshown in Listing 10.6.2.[ 327 ]

Policy Gradient MethodsAfter building the network models, the next step is training. In Algorithms 10.2.1 to10.5.1, we perform objective function maximization by gradient ascent. In Keras, weperform loss function minimization by gradient descent. The loss function is simplythe negative of the objective function being maximized. The gradient descent is thenegative of gradient ascent. Listing 10.6.3 shows the logp and value loss functions.We can take advantage of the common structure of the loss functions to unify the lossfunctions in Algorithms 10.2.1 to 10.5.1. The performance and value gradients differonly in their constant factors. All performance gradients have the common term,∇θIn π( at| st, θ). This is represented by y_pred in the policy log probability lossfunction, logp_loss(). The factor to the common term, ∇θIn π( at| st, θ), dependson which algorithm and is implemented as y_true. Table 10.6.1 shows the valuesof y_true. The remaining term is the weighted gradient of entropy, β∇H θ ( π( a t| s t, θ)). It is implemented as the product of beta and entropy in the logp_loss() function.Only A2C uses this term, so by default, beta=0.0. For A2C, beta=0.9.Listing 10.6.3, policygradient-car-10.1.1.py: The loss functions of logp andvalue networks.# logp loss, the 3rd and 4th variables (entropy and beta) are needed# by A2C so we have a different loss function structuredef logp_loss(self, entropy, beta=0.0):def loss(y_true, y_pred):return -K.mean((y_pred * y_true) + (beta * entropy), axis=-1)return loss# typical loss function structure that accepts 2 arguments only# this will be used by value loss of all methods except A2Cdef value_loss(self, y_true, y_pred):return -K.mean(y_pred * y_true, axis=-1)Algorithm y_true of logp_loss y_true of value_loss10.2.1 REINFORCE tγ R tNot applicable10.3.1 REINFORCE with baseline tγ δ10.4.1 Actor-Critic tγ δ10.5.1 A2C( R −V ( s,θ ))tvtγ δtγ δRtTable 10.6.1: y_true value of logp_loss and value_loss[ 328 ]

mean, stddev = args

dist = tf.distributions.Normal(loc=mean, scale=stddev)

entropy = dist.entropy()

return entropy

The entropy model is only used by the A2C method:

Chapter 10

Figure 10.6.8: A value model

Preceding figure shows the value model. The model also uses the pre-trained

encoder with frozen weights to implement following equation which is repeated

here for convenience:

( , ) ( )

T

t t

θv φ

t

θv

v = V s = s

(Equation 10.3.2)

θ

v

are the weights of the Dense(1) layer, the only layer that receives value gradient

updates. Figure 10.6.8 represents V ( st

, θ

v ) in Algorithms 10.3.1 to 10.5.1. The value

model can be built in a few lines:

inputs = Input(shape=(self.state_dim, ), name='state')

self.encoder.trainable = False

x = self.encoder(inputs)

value = Dense(1,

activation='linear',

kernel_initializer='zero',

name='value')(x)

self.value_model = Model(inputs, value, name='value')

These lines are also implemented in method build_actor_critic(), which is

shown in Listing 10.6.2.

[ 327 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!