16.03.2021 Views

Advanced Deep Learning with Keras

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Policy Gradient Methods

Require: Discount factor, γ ∈ [ 0,1]

and learning rate α . For example, γ = 0.99 and

α = 1e

− 3 .

Require: θ , initial policy network parameters (for example, θ

0

0

→ 0 ).

1. Repeat

2. Generate an episode ( s a r s s a r s s a r s )

π

( a s , θ)

t

t

3. for steps t = 0, …, T −1

do

, , , T T T T

0 0 1 1 1 1 2 2 −1 −1

T

… by following

4.

k

Compute return, R t

γ r

k 0 t + k

5.

t

Compute discounted performance gradient, ∇ J ( θ) = γ R ∇ In π( a | s , θ)

6. Perform gradient ascent, θ = θ + α∇J

( θ)

t θ t t

Figure 10.2.1: Policy network

In REINFORCE, the parameterized policy can be modeled by a neural network

as shown in Figure 10.2.1. As discussed in the previous section, for the case

of continuous action spaces, the state input is converted into features. The state

features are the inputs of the policy network. The Gaussian distribution representing

the policy function has a mean and standard deviation that are both functions of

the state features. The policy network, π( θ ), could be an MLP, CNN, or an RNN

depending on the nature of the state inputs. The predicted action is simply a sample

from the policy function.

[ 312 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!