16.03.2021 Views

Advanced Deep Learning with Keras

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Deep Reinforcement Learning

The most desirable action is simply the action with the biggest Q value:

Figure 9.6.1: A Deep Q-Network

The data required to train the Q-Network come from the agent's experiences:

( s0a0r1s 1,s1a1r 2s 2, …,sT− 1aT −1rT sT

)

s a r s

. Each training sample is a unit of experience

t t t+ 1 t+ 1

. At a given state at timestep t, s = s , the action, a = a , is determined

t t

using the Q-Learning algorithm similar to the previous section:

π

( s)

( )

( , )

⎧ sample a random ε⎫

<

= ⎪

argmaxQ s a otherwise

⎪⎩

a

⎪⎭

(Equation 9.6.1)

For notational simplicity, we omit the subscript and the use of the bold letter. We

need to note that Q(s,a) is the Q-Network. Strictly speaking, it is Q(a|s) since the

action is moved to the prediction as shown on the right of Figure 9.6.1. The action

with the highest Q value is the action that is applied on the environment to get the

reward, r = r t+1

, the next state, s ' = s t+1

and a Boolean done indicating if the next state

is terminal. From Equation 9.5.1 on generalized Q-Learning, an MSE loss function can

be determined by applying the chosen action:

( r γ max Q ( s ′, a ′) Q ( s , a )) 2

L = + −

(Equation 9.6.2)

a′

[ 294 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!