16.03.2021 Views

Advanced Deep Learning with Keras

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Policy Gradient Methods

[_, _, _, reward, _] = item

rewards.append(reward)

# compute return per step

# return is the sum of rewards from t til end of episode

# return replaces reward in the list

for i in range(len(rewards)):

reward = rewards[i:]

horizon = len(reward)

discount = [math.pow(gamma, t) for t in range(horizon)]

return_ = np.dot(reward, discount)

self.memory[i][3] = return_

# train every step

for item in self.memory:

self.train(item, gamma=gamma)

Listing 10.6.5, policygradient-car-10.1.1.py shows us the main train routine

used by all the policy gradient algorithms. Actor-critic calls this every experience

sample while the rest call this during train per episode routine in Listing 10.6.4:

# main routine for training as used by all 4 policy gradient

# methods

def train(self, item, gamma=1.0):

[step, state, next_state, reward, done] = item

# must save state for entropy computation

self.state = state

discount_factor = gamma**step

# reinforce-baseline: delta = return - value

# actor-critic: delta = reward - value + discounted_next_value

# a2c: delta = discounted_reward - value

delta = reward - self.value(state)[0]

# only REINFORCE does not use a critic (value network)

critic = False

if self.args.baseline:

critic = True

elif self.args.actor_critic:

# since this function is called by Actor-Critic

# directly, evaluate the value function here

critic = True

if not done:

[ 330 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!