09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 11

Here the action A was selected using the same DQN network Q(S,A; W) where W is

the training parameters of the network; that is, we are writing the Q value function

along with its training parameter to emphasize the difference between vanilla DQN

and Double DQN:

QQ tttttttttttt = RR tt+1 + γγmmmmmm AA QQ(SS tt+1 , aaaaaaaaaaaa tt QQ(SS, AA; WW); WW)

In Double DQN, the equation for the target will now change. Now the DQN

Q(S,A;W) is used for determining the action and DQN Q(S,A;W') is used for

calculating the target (notice the different weights), then the preceding equation will

change to:

QQ tttttttttttt = RR tt+1 + γγmmmmmm AA QQ(SS tt+1 , aaaaaaaaaaaa tt QQ(SS, AA; WW); WW′)

This simple change reduces the overestimation and helps us to train the agent fast

and more reliably.

Dueling DQN

This architecture was proposed by Wang et al. in their paper Dueling Network

Architectures for Deep Reinforcement Learning in 2015. Like DQN and Double DQN it is

also a model-free algorithm.

Dueling DQN decouples the Q-function into the value function and advantage

function. The value function, which we have discussed earlier, represents the value

of the state independent of any action. The advantage function, on the other hand,

provides a relative measure of the utility (advantage/goodness) of action A in the

state S. The Dueling DQN uses convolutional networks in the initial layers to extract

the features from raw pixels. However, in the later stages it is separated into two

different networks, one approximating the value and another approximating the

advantage. This ensures that the network produces separate estimates for the value

function and the advantage function:

QQ(SS, AA) = AA(SS, AA; θθ, αα) + VV ππ (SS; θθ, ββ)

Here θθ is an array of the training parameters of the shared convolutional network

(it is shared by both V and A), αα and ββ are the training parameters for the Advantage

and Value estimator networks. Later the two networks are recombined using an

aggregating layer to estimate the Q-value.

[ 431 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!