Advanced Deep Learning with Keras

fourpersent2020
from fourpersent2020 More from this publisher
16.03.2021 Views

Chapter 9Where all terms are familiar from the previous discussion on Q-Learning and Q(a|s)→ Q(s,a). The term max Q( s′ , a′ ) → max Q( a′ s′). In other words, using the Q-Network,a′ a′predict the Q value of each action given next state and get the maximum amongmax Q a′ | s′ = max Q s′ | a′= 0 .them. Note that at the terminal state s', ( ) ( )Algorithm 9.6.1, DQN algorithm:Require: Initialize replay memory D to capacity Na′ a′Require: Initialize action-value function Q with random weights θRequire: Initialize target action-value function Q targetwith weights θRequire: Exploration rate, ε and discount factor, γ1. for episode = 1, …,M do:2. Given initial state s3. for step = 1,…, T do:⎧sample( a)random ε⎫<a = ⎪⎨⎪4. Choose action argmax Q( s, a;θ)otherwise⎬⎪⎩a⎪⎭5. Execute action a, observe reward r and next state s'6. Store transition (s, a, r, s ' ) in D7. Update the state, s = s '8. //experience replay9. Sample a mini batch of episode experiences (s j, a j, r j+1, s j+1) from D⎧rj+1if episodeterminates at j + 1⎫Qmax= ⎨⎪−10. rj+ 1γ max Qtarget ( sj+ 1, aj+1;θ ) otherwise ⎬⎪+⎪a⎩j+1⎪⎭11. Perform gradient descent step on ( Q ( , ; )) max−Q sjajθ −with respect toparameters θ12. // periodic update of the target network13. Every C steps Q target= Q, that is set θ = θ14. EndHowever, it turns out that training the Q-Network is unstable. There are twoproblems causing the instability:1. A high correlation between samples2. A non-stationary target− =θ[ 295 ]

Deep Reinforcement LearningA high correlation is due to the sequential nature of sampling experiences. DQNaddressed this issue by creating a buffer of experiences. The training data arerandomly sampled from this buffer. This process is known as experience replay.The issue of the non-stationary target is due to the target network Q(s ' ,a ' ) that ismodified after every mini batch of training. A small change in the target networkcan create a significant change in the policy, the data distribution, and the correlationbetween the current Q value and target Q value. This is resolved by freezing theweights of the target network for C training steps. In other words, two identicalQ-Networks are created. The target Q-Network parameters are copied from theQ-Network under training every C training steps.The DQN algorithm is summarized in Algorithm 9.6.1.DQN on KerasTo illustrate DQN, the CartPole-v0 environment of the OpenAI Gym is used.CartPole-v0 is a pole balancing problem. The goal is to keep the pole from falling over.The environment is 2D. The action space is made of two discrete actions (left and rightmovements). However, the state space is continuous and is made of four variables:1. Linear position2. Linear velocity3. Angle of rotation4. Angular velocityThe CartPole-v0 is shown in Figure 9.6.1.Initially, the pole is upright. A reward of +1 is provided for every timestep that thepole remains upright. The episode ends when the pole exceeds 15 degrees from thevertical or 2.4 units from the center. The CartPole-v0 problem is considered solvedif the average reward is 195.0 in 100 consecutive trials:Figure 9.6.1: The CartPole-v0 environment[ 296 ]

Chapter 9

Where all terms are familiar from the previous discussion on Q-Learning and Q(a|s)

→ Q(s,a). The term max Q( s′ , a′ ) → max Q( a′ s′

). In other words, using the Q-Network,

a′ a′

predict the Q value of each action given next state and get the maximum among

max Q a′ | s′ = max Q s′ | a′

= 0 .

them. Note that at the terminal state s', ( ) ( )

Algorithm 9.6.1, DQN algorithm:

Require: Initialize replay memory D to capacity N

a′ a′

Require: Initialize action-value function Q with random weights θ

Require: Initialize target action-value function Q target

with weights θ

Require: Exploration rate, ε and discount factor, γ

1. for episode = 1, …,M do:

2. Given initial state s

3. for step = 1,…, T do:

sample( a)

random ε⎫

<

a = ⎪

4. Choose action argmax Q( s, a;

θ)

otherwise

⎪⎩

a

⎪⎭

5. Execute action a, observe reward r and next state s'

6. Store transition (s, a, r, s ' ) in D

7. Update the state, s = s '

8. //experience replay

9. Sample a mini batch of episode experiences (s j

, a j

, r j+1

, s j+1

) from D

rj+

1

if episodeterminates at j + 1⎫

Qmax

= ⎨

10. rj+ 1

γ max Qtarget ( s

j+ 1, a

j+

1;

θ ) otherwise ⎬

+

a

j+

1

⎪⎭

11. Perform gradient descent step on ( Q ( , ; )) max

−Q s

j

a

j

θ −

with respect to

parameters θ

12. // periodic update of the target network

13. Every C steps Q target

= Q, that is set θ = θ

14. End

However, it turns out that training the Q-Network is unstable. There are two

problems causing the instability:

1. A high correlation between samples

2. A non-stationary target

− =

θ

[ 295 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!