16.03.2021 Views

Advanced Deep Learning with Keras

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 10

next_value = self.value(next_state)[0]

# add the discounted next value

delta += gamma*next_value

elif self.args.a2c:

critic = True

else:

delta = reward

# apply the discount factor as shown in Algortihms

# 10.2.1, 10.3.1 and 10.4.1

discounted_delta = delta * discount_factor

discounted_delta = np.reshape(discounted_delta, [-1, 1])

verbose = 1 if done else 0

# train the logp model (implies training of actor model

# as well) since they share exactly the same set of

# parameters

self.logp_model.fit(np.array(state),

discounted_delta,

batch_size=1,

epochs=1,

verbose=verbose)

# in A2C, the target value is the return (reward

# replaced by return in the train_by_episode function)

if self.args.a2c:

discounted_delta = reward

discounted_delta = np.reshape(discounted_delta, [-1, 1])

# train the value network (critic)

if critic:

self.value_model.fit(np.array(state),

discounted_delta,

batch_size=1,

epochs=1,

verbose=verbose)

With all network models and loss functions in place, the last part is the training

strategy, which is different for each algorithm. Two train functions are used as

shown in Listings 10.6.4 and 10.6.5. Algorithms 10.2.1, 10.3.1, and 10.5.1 wait for

a complete episode to finish before training, so it runs both train_by_episode()

and train(). The complete episode is saved in self.memory. Actor-Critic Algorithm

10.4.1 trains per step and only runs train().

[ 331 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!