Advanced Deep Learning with Keras
Chapter 10Performance evaluation of policy gradientmethodsThe four policy gradients methods were evaluated by training the agent for 1,000episodes. We define 1 training session as 1,000 episodes of training. The firstperformance metric is measured by accumulating the number of times the carreached the flag in 1,000 episodes. Figures 10.7.1 to 10.7.4 shows five training sessionsper method.In this metric, A2C reached the flag with the greatest number of times followed byREINFORCE with baseline, Actor-Critic, and REINFORCE. The use of baseline orcritic accelerates the learning. Note that these are training sessions with the agentcontinuously improving its performance. There were cases in the experimentswhere the agent's performance did not improve with time.The second performance metric is based on the requirement that theMountainCarContinuous-v0 is considered solved if the total reward per episodeis at least 90.0. From the five training sessions per method, we selected one trainingsession with the highest total reward for the last 100 episodes (episodes 900 to999). Figures 10.7.5 to 10.7.8 show the results of the four policy gradient methods.REINFORCE with baseline is the only method that was able to consistently achievea total reward of about 90 after 1,000 episodes of training. A2C has the second-bestperformance but could not consistently reach at least 90 for the total rewards.Figure 10.7.1: The number of times the mountain car reached the flag using REINFORCE method[ 335 ]
Policy Gradient MethodsFigure 10.7.2: The number of times the mountain car reached the flag using REINFORCE with baseline methodFigure 10.7.3: The number of times the mountain car reached the flag using the Actor-Critic method[ 336 ]
- Page 302 and 303: Chapter 9# UI to dump Q Table conte
- Page 304 and 305: Chapter 9Figure 9.3.10: The value f
- Page 306 and 307: Chapter 9Figure 9.5.1: Frozen lake
- Page 308 and 309: Chapter 9# discount factorself.gamm
- Page 310 and 311: Chapter 9# training of Q Tableif do
- Page 312 and 313: Chapter 9Where all terms are famili
- Page 314 and 315: Listing 9.6.1 shows us the DQN impl
- Page 316 and 317: Chapter 9if self.ddqn:print("------
- Page 318 and 319: Chapter 9updates# correction on the
- Page 320 and 321: QmaxChapter 9⎧rj+1if episodetermi
- Page 322: References1. Sutton and Barto. Rein
- Page 325 and 326: Policy Gradient MethodsPolicy gradi
- Page 327 and 328: Policy Gradient MethodsGiven a cont
- Page 329 and 330: Policy Gradient MethodsRequire: Dis
- Page 331 and 332: Policy Gradient MethodsRequire: Dis
- Page 333 and 334: Policy Gradient MethodsRequire: Dis
- Page 335 and 336: Policy Gradient MethodsRequire: θ
- Page 337 and 338: Policy Gradient MethodsThe state is
- Page 339 and 340: Policy Gradient Methodsself.encoder
- Page 341 and 342: Policy Gradient MethodsThe policy n
- Page 343 and 344: Policy Gradient MethodsFigure 10.6.
- Page 345 and 346: Policy Gradient MethodsAfter buildi
- Page 347 and 348: Policy Gradient Methods[_, _, _, re
- Page 349 and 350: Policy Gradient MethodsEach algorit
- Page 351: Policy Gradient Methodswhile not do
- Page 355 and 356: Policy Gradient MethodsFigure 10.7.
- Page 357 and 358: Policy Gradient MethodsTrain REINFO
- Page 360 and 361: Other Books YouMay EnjoyIf you enjo
- Page 362: Other Books You May EnjoyLeave a re
- Page 365 and 366: DenseNet 39DenseNet-BC (Bottleneck-
- Page 367: VVariational Autoencoder (VAE)about
Policy Gradient Methods
Figure 10.7.2: The number of times the mountain car reached the flag using REINFORCE with baseline method
Figure 10.7.3: The number of times the mountain car reached the flag using the Actor-Critic method
[ 336 ]