16.03.2021 Views

Advanced Deep Learning with Keras

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 10

Performance evaluation of policy gradient

methods

The four policy gradients methods were evaluated by training the agent for 1,000

episodes. We define 1 training session as 1,000 episodes of training. The first

performance metric is measured by accumulating the number of times the car

reached the flag in 1,000 episodes. Figures 10.7.1 to 10.7.4 shows five training sessions

per method.

In this metric, A2C reached the flag with the greatest number of times followed by

REINFORCE with baseline, Actor-Critic, and REINFORCE. The use of baseline or

critic accelerates the learning. Note that these are training sessions with the agent

continuously improving its performance. There were cases in the experiments

where the agent's performance did not improve with time.

The second performance metric is based on the requirement that the

MountainCarContinuous-v0 is considered solved if the total reward per episode

is at least 90.0. From the five training sessions per method, we selected one training

session with the highest total reward for the last 100 episodes (episodes 900 to

999). Figures 10.7.5 to 10.7.8 show the results of the four policy gradient methods.

REINFORCE with baseline is the only method that was able to consistently achieve

a total reward of about 90 after 1,000 episodes of training. A2C has the second-best

performance but could not consistently reach at least 90 for the total rewards.

Figure 10.7.1: The number of times the mountain car reached the flag using REINFORCE method

[ 335 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!