09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Reinforcement Learning

self.replay(self.batch_size)

print('Did not solve after {} episodes :('.format(e))

return avg_scores

That's all – you can now train the agent for playing Breakout. The complete code is

available on GitHub of this chapter in the file DQN_Atari.ipynb.

DQN variants

After the unprecedented success of DQN, the interest in RL increased and many new

RL algorithms came into being. Next, we see some of the algorithms that are based

on DQN. They all use a DQN as the base and modify upon it.

Double DQN

In DQN, the agent uses the same Q value to both select an action and evaluate an

action. This can cause a maximization bias in learning. For example, let us consider

that for a state, S, all possible actions have true Q values of zero. Now our DQN

estimates will have some values above and some values below zero, and since

we are choosing the action with the maximum Q value and later evaluating the Q

value of each action using the same (maximized) estimated value function, we are

overestimating Q – or in other words, our agent is over-optimistic. This can lead to

unstable training and a low-quality policy. To deal with this issue Hasselt et al. from

DeepMind proposed the Double DQN algorithm in their paper Deep Reinforcement

Learning with Double Q-Learning. In Double DQN we have two Q-networks with the

same architecture but different weights. One of the Q-networks is used to determine

the action using the epsilon greedy policy and the other is used to determine its

value (Q-target).

If you recall in DQN the Q-target was given by:

QQ tttttttttttt = RR tt+1 + γγmmmmmm AA QQ(SS tt+1 , AA tt )

[ 430 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!