16.03.2021 Views

Advanced Deep Learning with Keras

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 10

The gradient updates are frequently overestimated. Furthermore, training policybased

methods are time-consuming. The training requires thousands of episodes

(that is, not sample efficient). Each episode only provides a small number of samples.

Typical training in the implementation provided at the end of the chapter would

take about an hour for 1,000 episodes on a GTX 1060 GPU.

In the following sections, we discuss the four policy gradient methods. While the

discussion focuses on continuous action spaces, the concept is generally applicable

to discrete action spaces. Due to similarities in the implementation of the policy

and value networks of the four policy gradient methods, we will wait until the

end of this chapter to illustrate the implementation into Keras.

Monte Carlo policy gradient

(REINFORCE) method

The simplest policy gradient method is called REINFORCE [5],

this is a Monte Carlo policy gradient method:

( θ) E ⎡R lnπ( a s , θ)

∇ = ⎢ ∇

J

π t θ t t (Equation 10.2.1)

where R t

is the return as defined in Equation 9.1.2. R t

is an unbiased sample

π

of Q ( s , a ) in the policy gradient theorem.

t

t

Algorithm 10.2.1 summarizes the REINFORCE algorithm [2]. REINFORCE is

a Monte Carlo algorithm. It does not require knowledge of the dynamics of the

environment (that is, model-free). Only experience samples, siairi + 1si+ 1 , are needed

to optimally tune the parameters of the policy network, π( at

st

, θ ). The discount factor,

γ , takes into consideration that rewards decrease in value as the number of steps

t

increases. The gradient is discounted by γ . Gradients taken at later steps have

smaller contributions. The learning rate, α , is a scaling factor of the gradient update.

The parameters are updated by performing gradient ascent using the discounted

gradient and learning rate. As a Monte Carlo algorithm, REINFORCE requires that

the agent completes an episode before processing the gradient updates. Due to its

Monte Carlo nature, the gradient update of REINFORCE is characterized by high

variance. At the end of this chapter, we will implement the REINFORCE algorithm

into Keras.

Algorithm 10.2.1 REINFORCE

Require: A differentiable parameterized target policy network, ( a s , )

⎥⎦

π θ .

t

t

[ 311 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!