Advanced Deep Learning with Keras
Chapter 10Figure 10.6.1 MountainCarContinuous-v0 OpenAI gym environmentUnlike Q-Learning, policy gradient methods are applicable to both discrete andcontinuous action spaces. In our example, we'll demonstrate the four policy gradientmethods on a continuous action space case example, MountainCarContinuous-v0of OpenAI gym, https://gym.openai.com. In case you are not familiar withOpenAI gym, please see Chapter 9, Deep Reinforcement Learning.A snapshot of MountainCarContinuous-v0 2D environment is shown in Figure10.6.1. In this 2D environment, a car with a not too powerful engine is between twomountains. In order to reach the yellow flag on top of the mountain on the right, itmust drive back and forth to gain enough momentum. The more energy (that is, thegreater the absolute value of action) that is applied to the car, the smaller (or, themore negative) is the reward. The reward is always negative, and it is only positiveupon reaching the flag. In that case, the car receives a reward of +100. However,every action is penalized by the following code:reward-= math.pow(action[0],2)*0.1The continuous range of valid action values is [-1.0, 1.0]. Beyond the range,the action is clipped to its minimum or maximum value. Therefore, it makesno sense to apply an action value that is greater than 1.0 or less than -1.0.The MountainCarContinuous-v0 environment state has two elements:• Car position• Car velocity[ 319 ]
Policy Gradient MethodsThe state is converted to state features by an encoder. The predicted action is theoutput of the policy model given the state. The output of the value function is thepredicted value of the state:Figure 10.6.2 Autoencoder modelFigure 10.6.3 Encoder model[ 320 ]
- Page 285 and 286: Variational Autoencoders (VAEs)The
- Page 288 and 289: Deep ReinforcementLearningReinforce
- Page 290 and 291: [ 273 ]Chapter 9Formally, the RL pr
- Page 292 and 293: Chapter 9Where:( ) ( , )∗V s maxQ
- Page 294 and 295: Chapter 9Initially, the agent assum
- Page 296 and 297: Chapter 9Figure 9.3.6: Assuming the
- Page 298 and 299: Q-Learning in PythonThe environment
- Page 300 and 301: Chapter 9----------------"""self.re
- Page 302 and 303: Chapter 9# UI to dump Q Table conte
- Page 304 and 305: Chapter 9Figure 9.3.10: The value f
- Page 306 and 307: Chapter 9Figure 9.5.1: Frozen lake
- Page 308 and 309: Chapter 9# discount factorself.gamm
- Page 310 and 311: Chapter 9# training of Q Tableif do
- Page 312 and 313: Chapter 9Where all terms are famili
- Page 314 and 315: Listing 9.6.1 shows us the DQN impl
- Page 316 and 317: Chapter 9if self.ddqn:print("------
- Page 318 and 319: Chapter 9updates# correction on the
- Page 320 and 321: QmaxChapter 9⎧rj+1if episodetermi
- Page 322: References1. Sutton and Barto. Rein
- Page 325 and 326: Policy Gradient MethodsPolicy gradi
- Page 327 and 328: Policy Gradient MethodsGiven a cont
- Page 329 and 330: Policy Gradient MethodsRequire: Dis
- Page 331 and 332: Policy Gradient MethodsRequire: Dis
- Page 333 and 334: Policy Gradient MethodsRequire: Dis
- Page 335: Policy Gradient MethodsRequire: θ
- Page 339 and 340: Policy Gradient Methodsself.encoder
- Page 341 and 342: Policy Gradient MethodsThe policy n
- Page 343 and 344: Policy Gradient MethodsFigure 10.6.
- Page 345 and 346: Policy Gradient MethodsAfter buildi
- Page 347 and 348: Policy Gradient Methods[_, _, _, re
- Page 349 and 350: Policy Gradient MethodsEach algorit
- Page 351 and 352: Policy Gradient Methodswhile not do
- Page 353 and 354: Policy Gradient MethodsFigure 10.7.
- Page 355 and 356: Policy Gradient MethodsFigure 10.7.
- Page 357 and 358: Policy Gradient MethodsTrain REINFO
- Page 360 and 361: Other Books YouMay EnjoyIf you enjo
- Page 362: Other Books You May EnjoyLeave a re
- Page 365 and 366: DenseNet 39DenseNet-BC (Bottleneck-
- Page 367: VVariational Autoencoder (VAE)about
Policy Gradient Methods
The state is converted to state features by an encoder. The predicted action is the
output of the policy model given the state. The output of the value function is the
predicted value of the state:
Figure 10.6.2 Autoencoder model
Figure 10.6.3 Encoder model
[ 320 ]