Advanced Deep Learning with Keras
logp = Lambda(self.logp,output_shape=(1,),name='logp')([mean, stddev, action])self.logp_model = Model(inputs, logp, name='logp')self.logp_model.summary()plot_model(self.logp_model, to_file='logp_model.png', show_shapes=True)Chapter 10entropy = Lambda(self.entropy,output_shape=(1,),name='entropy')([mean, stddev])self.entropy_model = Model(inputs, entropy, name='entropy')self.entropy_model.summary()plot_model(self.entropy_model, to_file='entropy_model.png', show_shapes=True)value = Dense(1,activation='linear',kernel_initializer='zero',name='value')(x)self.value_model = Model(inputs, value, name='value')self.value_model.summary()Figure 10.6.6: Gaussian log probability model of the policy[ 325 ]
Policy Gradient MethodsFigure 10.6.7: Entropy modelApart from the policy network, π( atst, θ ), we must also have the action log probability(logp) network In π( at| st, θ ) since this is actually what calculates the gradient. Asshown in Figure 10.6.6, the logp network is simply the policy network where anadditional Lambda(1) layer computes the log probability of the Gaussian distributiongiven action, mean, and standard deviation. The logp network and actor (policy)model share the same set of parameters. The Lambda layer does not have anyparameter. It is implemented by the following function:# given mean, stddev, and action compute# the log probability of the Gaussian distributiondef logp(self, args):mean, stddev, action = argsdist = tf.distributions.Normal(loc=mean, scale=stddev)logp = dist.log_prob(action)return logpTraining the logp network trains the actor model as well. In the training methodsthat are discussed in this section, only the logp network is trained.As shown in Figure 10.6.7, the entropy model also shares parameters with thepolicy network. The output Lambda(1) layer computes the entropy of the Gaussiandistribution given the mean and standard deviation using the following function:# given the mean and stddev compute the Gaussian dist entropydef entropy(self, args):[ 326 ]
- Page 292 and 293: Chapter 9Where:( ) ( , )∗V s maxQ
- Page 294 and 295: Chapter 9Initially, the agent assum
- Page 296 and 297: Chapter 9Figure 9.3.6: Assuming the
- Page 298 and 299: Q-Learning in PythonThe environment
- Page 300 and 301: Chapter 9----------------"""self.re
- Page 302 and 303: Chapter 9# UI to dump Q Table conte
- Page 304 and 305: Chapter 9Figure 9.3.10: The value f
- Page 306 and 307: Chapter 9Figure 9.5.1: Frozen lake
- Page 308 and 309: Chapter 9# discount factorself.gamm
- Page 310 and 311: Chapter 9# training of Q Tableif do
- Page 312 and 313: Chapter 9Where all terms are famili
- Page 314 and 315: Listing 9.6.1 shows us the DQN impl
- Page 316 and 317: Chapter 9if self.ddqn:print("------
- Page 318 and 319: Chapter 9updates# correction on the
- Page 320 and 321: QmaxChapter 9⎧rj+1if episodetermi
- Page 322: References1. Sutton and Barto. Rein
- Page 325 and 326: Policy Gradient MethodsPolicy gradi
- Page 327 and 328: Policy Gradient MethodsGiven a cont
- Page 329 and 330: Policy Gradient MethodsRequire: Dis
- Page 331 and 332: Policy Gradient MethodsRequire: Dis
- Page 333 and 334: Policy Gradient MethodsRequire: Dis
- Page 335 and 336: Policy Gradient MethodsRequire: θ
- Page 337 and 338: Policy Gradient MethodsThe state is
- Page 339 and 340: Policy Gradient Methodsself.encoder
- Page 341: Policy Gradient MethodsThe policy n
- Page 345 and 346: Policy Gradient MethodsAfter buildi
- Page 347 and 348: Policy Gradient Methods[_, _, _, re
- Page 349 and 350: Policy Gradient MethodsEach algorit
- Page 351 and 352: Policy Gradient Methodswhile not do
- Page 353 and 354: Policy Gradient MethodsFigure 10.7.
- Page 355 and 356: Policy Gradient MethodsFigure 10.7.
- Page 357 and 358: Policy Gradient MethodsTrain REINFO
- Page 360 and 361: Other Books YouMay EnjoyIf you enjo
- Page 362: Other Books You May EnjoyLeave a re
- Page 365 and 366: DenseNet 39DenseNet-BC (Bottleneck-
- Page 367: VVariational Autoencoder (VAE)about
Policy Gradient Methods
Figure 10.6.7: Entropy model
Apart from the policy network, π( at
st
, θ ), we must also have the action log probability
(logp) network In π( at
| st
, θ ) since this is actually what calculates the gradient. As
shown in Figure 10.6.6, the logp network is simply the policy network where an
additional Lambda(1) layer computes the log probability of the Gaussian distribution
given action, mean, and standard deviation. The logp network and actor (policy)
model share the same set of parameters. The Lambda layer does not have any
parameter. It is implemented by the following function:
# given mean, stddev, and action compute
# the log probability of the Gaussian distribution
def logp(self, args):
mean, stddev, action = args
dist = tf.distributions.Normal(loc=mean, scale=stddev)
logp = dist.log_prob(action)
return logp
Training the logp network trains the actor model as well. In the training methods
that are discussed in this section, only the logp network is trained.
As shown in Figure 10.6.7, the entropy model also shares parameters with the
policy network. The output Lambda(1) layer computes the entropy of the Gaussian
distribution given the mean and standard deviation using the following function:
# given the mean and stddev compute the Gaussian dist entropy
def entropy(self, args):
[ 326 ]