Advanced Deep Learning with Keras

fourpersent2020
from fourpersent2020 More from this publisher
16.03.2021 Views

logp = Lambda(self.logp,output_shape=(1,),name='logp')([mean, stddev, action])self.logp_model = Model(inputs, logp, name='logp')self.logp_model.summary()plot_model(self.logp_model, to_file='logp_model.png', show_shapes=True)Chapter 10entropy = Lambda(self.entropy,output_shape=(1,),name='entropy')([mean, stddev])self.entropy_model = Model(inputs, entropy, name='entropy')self.entropy_model.summary()plot_model(self.entropy_model, to_file='entropy_model.png', show_shapes=True)value = Dense(1,activation='linear',kernel_initializer='zero',name='value')(x)self.value_model = Model(inputs, value, name='value')self.value_model.summary()Figure 10.6.6: Gaussian log probability model of the policy[ 325 ]

Policy Gradient MethodsFigure 10.6.7: Entropy modelApart from the policy network, π( atst, θ ), we must also have the action log probability(logp) network In π( at| st, θ ) since this is actually what calculates the gradient. Asshown in Figure 10.6.6, the logp network is simply the policy network where anadditional Lambda(1) layer computes the log probability of the Gaussian distributiongiven action, mean, and standard deviation. The logp network and actor (policy)model share the same set of parameters. The Lambda layer does not have anyparameter. It is implemented by the following function:# given mean, stddev, and action compute# the log probability of the Gaussian distributiondef logp(self, args):mean, stddev, action = argsdist = tf.distributions.Normal(loc=mean, scale=stddev)logp = dist.log_prob(action)return logpTraining the logp network trains the actor model as well. In the training methodsthat are discussed in this section, only the logp network is trained.As shown in Figure 10.6.7, the entropy model also shares parameters with thepolicy network. The output Lambda(1) layer computes the entropy of the Gaussiandistribution given the mean and standard deviation using the following function:# given the mean and stddev compute the Gaussian dist entropydef entropy(self, args):[ 326 ]

Policy Gradient Methods

Figure 10.6.7: Entropy model

Apart from the policy network, π( at

st

, θ ), we must also have the action log probability

(logp) network In π( at

| st

, θ ) since this is actually what calculates the gradient. As

shown in Figure 10.6.6, the logp network is simply the policy network where an

additional Lambda(1) layer computes the log probability of the Gaussian distribution

given action, mean, and standard deviation. The logp network and actor (policy)

model share the same set of parameters. The Lambda layer does not have any

parameter. It is implemented by the following function:

# given mean, stddev, and action compute

# the log probability of the Gaussian distribution

def logp(self, args):

mean, stddev, action = args

dist = tf.distributions.Normal(loc=mean, scale=stddev)

logp = dist.log_prob(action)

return logp

Training the logp network trains the actor model as well. In the training methods

that are discussed in this section, only the logp network is trained.

As shown in Figure 10.6.7, the entropy model also shares parameters with the

policy network. The output Lambda(1) layer computes the entropy of the Gaussian

distribution given the mean and standard deviation using the following function:

# given the mean and stddev compute the Gaussian dist entropy

def entropy(self, args):

[ 326 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!