16.03.2021 Views

Advanced Deep Learning with Keras

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Policy Gradient Methods

Each algorithm processes its episode trajectory in a different way.

Algorithm y_true formula y_true in Keras

10.2.1 REINFORCE t

γ R t

10.3.1 REINFORCE

with baseline

t

γ δ

10.4.1 Actor-Critic t

γ δ

10.5.1 A2C

( R −V ( s,

θ ))

t

R

and

t

v

reward * discount_factor

(reward - self.value(state)[0]) *

discount_factor

(reward - self.value(state)[0]

+ gamma*next_value) * discount_

factor

(reward - self.value(state)[0])

and

reward

Table 10.6.2: y_true value in Table 10.6.1

For REINFORCE methods and A2C, the reward is actually the return as computed in

train_by_episode(). discount_factor = gamma**step.

Both REINFORCE methods compute the return,

reward value in the memory as:

T k

t

r

k=

0 t k

R = ∑ γ +

, by replacing the

# only REINFORCE and REINFORCE with baseline

# use the ff codes

# convert the rewards to returns

rewards = []

gamma = 0.99

for item in self.memory:

[_, _, _, reward, _] = item

rewards.append(reward)

# compute return per step

# return is the sum of rewards from t til end of episode

# return replaces reward in the list

for i in range(len(rewards)):

reward = rewards[i:]

horizon = len(reward)

discount = [math.pow(gamma, t) for t in range(horizon)]

return_ = np.dot(reward, discount)

self.memory[i][3] = return_

This then trains the policy (actor) and value models (with baseline only) for each step

beginning with the first step.

[ 332 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!