Advanced Deep Learning with Keras

fourpersent2020
from fourpersent2020 More from this publisher
16.03.2021 Views

[ 333 ]Chapter 10The training strategy of A2C is different in the sense that it computes gradients fromthe last step to the first step. Hence, the return accumulates beginning from the laststep reward or the last next state value:# the memory is visited in reverse as shown# in Algorithm 10.5.1for item in self.memory[::-1]:[step, state, next_state, reward, done] = item# compute the returnr = reward + gamma*ritem = [step, state, next_state, r, done]# train per step# a2c reward has been discountedself.train(item)The reward variable in the list is also replaced by return. It is initialized by rewardif the terminal state is reached (that is, the car touches the flag) or the next state valuefor non-terminal states:v = 0 if reward > 0 else agent.value(next_state)[0]In the Keras implementation, all the routines that we mentioned are implementedas methods in the PolicyAgent class. The role of the PolicyAgent is to representthe agent implementing policy gradient methods including building and training thenetwork models and predicting the action, log probability, entropy, and state value.Following listing shows how one episode unfolds when the agent executes and trainsthe policy and value models. The for loop is executed for 1000 episodes. An episodeterminates upon reaching 1000 steps or when the car touches the flag. The agentexecutes the action predicted by the policy at every step. After each episode or step,the training routine is called.Listing 10.6.6, policygradient-car-10.1.1.py: The agent runs for 1000 episodesto execute the action predicted by the policy at every step and perform training:# sampling and fittingfor episode in range(episode_count):state = env.reset()# state is car [position, speed]state = np.reshape(state, [1, state_dim])# reset all variables and memory before the start of# every episodestep = 0total_reward = 0done = Falseagent.reset_memory()

Policy Gradient Methodswhile not done:# [min, max] action = [-1.0, 1.0]# for baseline, random choice of action will not move# the car pass the flag poleif args.random:action = env.action_space.sample()else:action = agent.act(state)env.render()# after executing the action, get s', r, donenext_state, reward, done, _ = env.step(action)next_state = np.reshape(next_state, [1, state_dim])# save the experience unit in memory for training# Actor-Critic does not need this but we keep it anyway.item = [step, state, next_state, reward, done]agent.remember(item)if args.actor_critic and train:# only actor-critic performs online training# train at every step as it happensagent.train(item, gamma=0.99)elif not args.random and done and train:# for REINFORCE, REINFORCE with baseline, and A2C# we wait for the completion of the episode before# training the network(s)# last value as used by A2Cv = 0 if reward > 0 else agent.value(next_state)[0]agent.train_by_episode(last_value=v)# accumulate rewardtotal_reward += reward# next state is the new statestate = next_statestep += 1[ 334 ]

[ 333 ]

Chapter 10

The training strategy of A2C is different in the sense that it computes gradients from

the last step to the first step. Hence, the return accumulates beginning from the last

step reward or the last next state value:

# the memory is visited in reverse as shown

# in Algorithm 10.5.1

for item in self.memory[::-1]:

[step, state, next_state, reward, done] = item

# compute the return

r = reward + gamma*r

item = [step, state, next_state, r, done]

# train per step

# a2c reward has been discounted

self.train(item)

The reward variable in the list is also replaced by return. It is initialized by reward

if the terminal state is reached (that is, the car touches the flag) or the next state value

for non-terminal states:

v = 0 if reward > 0 else agent.value(next_state)[0]

In the Keras implementation, all the routines that we mentioned are implemented

as methods in the PolicyAgent class. The role of the PolicyAgent is to represent

the agent implementing policy gradient methods including building and training the

network models and predicting the action, log probability, entropy, and state value.

Following listing shows how one episode unfolds when the agent executes and trains

the policy and value models. The for loop is executed for 1000 episodes. An episode

terminates upon reaching 1000 steps or when the car touches the flag. The agent

executes the action predicted by the policy at every step. After each episode or step,

the training routine is called.

Listing 10.6.6, policygradient-car-10.1.1.py: The agent runs for 1000 episodes

to execute the action predicted by the policy at every step and perform training:

# sampling and fitting

for episode in range(episode_count):

state = env.reset()

# state is car [position, speed]

state = np.reshape(state, [1, state_dim])

# reset all variables and memory before the start of

# every episode

step = 0

total_reward = 0

done = False

agent.reset_memory()

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!