16.03.2021 Views

Advanced Deep Learning with Keras

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Policy Gradient Methods

Require: Discount factor, γ ∈ [ 0,1]

, the learning rate α for the performance gradient

and learning rate for the value gradient, α .

Require: θ , initial policy network parameters (for example,

0

θ0 → 0 ).

network parameters (for example, θv0 → 0 ).

1. Repeat

v

θ , initial value

v0

2. Generate an episode s0a0r1 s1, s1a1 r2 s2, …, s T −1a T −1r T

s T

by following

π a s , θ

( )

t

t

3. for steps t = 0, …, T −1

do

4. Compute return,

R

T k

t

= ∑ k =

γ r

0 t + k

5. Subtract baseline, δ = R −V ( s , θ )

t t v

t

6. Compute discounted value gradient, ∇ V ( θv ) = γ δ∇θ

V ( s , )

v t

θv

7. Perform gradient ascent, θ = θ + a ∇V

( θ )

v v v v

8. Compute discounted performance gradient,

t

∇ J ( θ) = γ δ∇

ln θ

π( a s , θ)

9. Perform gradient ascent, θ = θ + a∇J

( θ)

t

t

Figure 10.3.1: Policy and value networks

[ 314 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!