Advanced Deep Learning with Keras
[ 333 ]Chapter 10The training strategy of A2C is different in the sense that it computes gradients fromthe last step to the first step. Hence, the return accumulates beginning from the laststep reward or the last next state value:# the memory is visited in reverse as shown# in Algorithm 10.5.1for item in self.memory[::-1]:[step, state, next_state, reward, done] = item# compute the returnr = reward + gamma*ritem = [step, state, next_state, r, done]# train per step# a2c reward has been discountedself.train(item)The reward variable in the list is also replaced by return. It is initialized by rewardif the terminal state is reached (that is, the car touches the flag) or the next state valuefor non-terminal states:v = 0 if reward > 0 else agent.value(next_state)[0]In the Keras implementation, all the routines that we mentioned are implementedas methods in the PolicyAgent class. The role of the PolicyAgent is to representthe agent implementing policy gradient methods including building and training thenetwork models and predicting the action, log probability, entropy, and state value.Following listing shows how one episode unfolds when the agent executes and trainsthe policy and value models. The for loop is executed for 1000 episodes. An episodeterminates upon reaching 1000 steps or when the car touches the flag. The agentexecutes the action predicted by the policy at every step. After each episode or step,the training routine is called.Listing 10.6.6, policygradient-car-10.1.1.py: The agent runs for 1000 episodesto execute the action predicted by the policy at every step and perform training:# sampling and fittingfor episode in range(episode_count):state = env.reset()# state is car [position, speed]state = np.reshape(state, [1, state_dim])# reset all variables and memory before the start of# every episodestep = 0total_reward = 0done = Falseagent.reset_memory()
Policy Gradient Methodswhile not done:# [min, max] action = [-1.0, 1.0]# for baseline, random choice of action will not move# the car pass the flag poleif args.random:action = env.action_space.sample()else:action = agent.act(state)env.render()# after executing the action, get s', r, donenext_state, reward, done, _ = env.step(action)next_state = np.reshape(next_state, [1, state_dim])# save the experience unit in memory for training# Actor-Critic does not need this but we keep it anyway.item = [step, state, next_state, reward, done]agent.remember(item)if args.actor_critic and train:# only actor-critic performs online training# train at every step as it happensagent.train(item, gamma=0.99)elif not args.random and done and train:# for REINFORCE, REINFORCE with baseline, and A2C# we wait for the completion of the episode before# training the network(s)# last value as used by A2Cv = 0 if reward > 0 else agent.value(next_state)[0]agent.train_by_episode(last_value=v)# accumulate rewardtotal_reward += reward# next state is the new statestate = next_statestep += 1[ 334 ]
- Page 300 and 301: Chapter 9----------------"""self.re
- Page 302 and 303: Chapter 9# UI to dump Q Table conte
- Page 304 and 305: Chapter 9Figure 9.3.10: The value f
- Page 306 and 307: Chapter 9Figure 9.5.1: Frozen lake
- Page 308 and 309: Chapter 9# discount factorself.gamm
- Page 310 and 311: Chapter 9# training of Q Tableif do
- Page 312 and 313: Chapter 9Where all terms are famili
- Page 314 and 315: Listing 9.6.1 shows us the DQN impl
- Page 316 and 317: Chapter 9if self.ddqn:print("------
- Page 318 and 319: Chapter 9updates# correction on the
- Page 320 and 321: QmaxChapter 9⎧rj+1if episodetermi
- Page 322: References1. Sutton and Barto. Rein
- Page 325 and 326: Policy Gradient MethodsPolicy gradi
- Page 327 and 328: Policy Gradient MethodsGiven a cont
- Page 329 and 330: Policy Gradient MethodsRequire: Dis
- Page 331 and 332: Policy Gradient MethodsRequire: Dis
- Page 333 and 334: Policy Gradient MethodsRequire: Dis
- Page 335 and 336: Policy Gradient MethodsRequire: θ
- Page 337 and 338: Policy Gradient MethodsThe state is
- Page 339 and 340: Policy Gradient Methodsself.encoder
- Page 341 and 342: Policy Gradient MethodsThe policy n
- Page 343 and 344: Policy Gradient MethodsFigure 10.6.
- Page 345 and 346: Policy Gradient MethodsAfter buildi
- Page 347 and 348: Policy Gradient Methods[_, _, _, re
- Page 349: Policy Gradient MethodsEach algorit
- Page 353 and 354: Policy Gradient MethodsFigure 10.7.
- Page 355 and 356: Policy Gradient MethodsFigure 10.7.
- Page 357 and 358: Policy Gradient MethodsTrain REINFO
- Page 360 and 361: Other Books YouMay EnjoyIf you enjo
- Page 362: Other Books You May EnjoyLeave a re
- Page 365 and 366: DenseNet 39DenseNet-BC (Bottleneck-
- Page 367: VVariational Autoencoder (VAE)about
[ 333 ]
Chapter 10
The training strategy of A2C is different in the sense that it computes gradients from
the last step to the first step. Hence, the return accumulates beginning from the last
step reward or the last next state value:
# the memory is visited in reverse as shown
# in Algorithm 10.5.1
for item in self.memory[::-1]:
[step, state, next_state, reward, done] = item
# compute the return
r = reward + gamma*r
item = [step, state, next_state, r, done]
# train per step
# a2c reward has been discounted
self.train(item)
The reward variable in the list is also replaced by return. It is initialized by reward
if the terminal state is reached (that is, the car touches the flag) or the next state value
for non-terminal states:
v = 0 if reward > 0 else agent.value(next_state)[0]
In the Keras implementation, all the routines that we mentioned are implemented
as methods in the PolicyAgent class. The role of the PolicyAgent is to represent
the agent implementing policy gradient methods including building and training the
network models and predicting the action, log probability, entropy, and state value.
Following listing shows how one episode unfolds when the agent executes and trains
the policy and value models. The for loop is executed for 1000 episodes. An episode
terminates upon reaching 1000 steps or when the car touches the flag. The agent
executes the action predicted by the policy at every step. After each episode or step,
the training routine is called.
Listing 10.6.6, policygradient-car-10.1.1.py: The agent runs for 1000 episodes
to execute the action predicted by the policy at every step and perform training:
# sampling and fitting
for episode in range(episode_count):
state = env.reset()
# state is car [position, speed]
state = np.reshape(state, [1, state_dim])
# reset all variables and memory before the start of
# every episode
step = 0
total_reward = 0
done = False
agent.reset_memory()