Advanced Deep Learning with Keras
Chapter 9Where all terms are familiar from the previous discussion on Q-Learning and Q(a|s)→ Q(s,a). The term max Q( s′ , a′ ) → max Q( a′ s′). In other words, using the Q-Network,a′ a′predict the Q value of each action given next state and get the maximum amongmax Q a′ | s′ = max Q s′ | a′= 0 .them. Note that at the terminal state s', ( ) ( )Algorithm 9.6.1, DQN algorithm:Require: Initialize replay memory D to capacity Na′ a′Require: Initialize action-value function Q with random weights θRequire: Initialize target action-value function Q targetwith weights θRequire: Exploration rate, ε and discount factor, γ1. for episode = 1, …,M do:2. Given initial state s3. for step = 1,…, T do:⎧sample( a)random ε⎫<a = ⎪⎨⎪4. Choose action argmax Q( s, a;θ)otherwise⎬⎪⎩a⎪⎭5. Execute action a, observe reward r and next state s'6. Store transition (s, a, r, s ' ) in D7. Update the state, s = s '8. //experience replay9. Sample a mini batch of episode experiences (s j, a j, r j+1, s j+1) from D⎧rj+1if episodeterminates at j + 1⎫Qmax= ⎨⎪−10. rj+ 1γ max Qtarget ( sj+ 1, aj+1;θ ) otherwise ⎬⎪+⎪a⎩j+1⎪⎭11. Perform gradient descent step on ( Q ( , ; )) max−Q sjajθ −with respect toparameters θ12. // periodic update of the target network13. Every C steps Q target= Q, that is set θ = θ14. EndHowever, it turns out that training the Q-Network is unstable. There are twoproblems causing the instability:1. A high correlation between samples2. A non-stationary target− =θ[ 295 ]
Deep Reinforcement LearningA high correlation is due to the sequential nature of sampling experiences. DQNaddressed this issue by creating a buffer of experiences. The training data arerandomly sampled from this buffer. This process is known as experience replay.The issue of the non-stationary target is due to the target network Q(s ' ,a ' ) that ismodified after every mini batch of training. A small change in the target networkcan create a significant change in the policy, the data distribution, and the correlationbetween the current Q value and target Q value. This is resolved by freezing theweights of the target network for C training steps. In other words, two identicalQ-Networks are created. The target Q-Network parameters are copied from theQ-Network under training every C training steps.The DQN algorithm is summarized in Algorithm 9.6.1.DQN on KerasTo illustrate DQN, the CartPole-v0 environment of the OpenAI Gym is used.CartPole-v0 is a pole balancing problem. The goal is to keep the pole from falling over.The environment is 2D. The action space is made of two discrete actions (left and rightmovements). However, the state space is continuous and is made of four variables:1. Linear position2. Linear velocity3. Angle of rotation4. Angular velocityThe CartPole-v0 is shown in Figure 9.6.1.Initially, the pole is upright. A reward of +1 is provided for every timestep that thepole remains upright. The episode ends when the pole exceeds 15 degrees from thevertical or 2.4 units from the center. The CartPole-v0 problem is considered solvedif the average reward is 195.0 in 100 consecutive trials:Figure 9.6.1: The CartPole-v0 environment[ 296 ]
- Page 261 and 262: Variational Autoencoders (VAEs)VAEs
- Page 263 and 264: Variational Autoencoders (VAEs)outp
- Page 265 and 266: Variational Autoencoders (VAEs)Figu
- Page 267 and 268: Variational Autoencoders (VAEs)The
- Page 269 and 270: Variational Autoencoders (VAEs)Figu
- Page 271 and 272: Variational Autoencoders (VAEs)Prec
- Page 273 and 274: Variational Autoencoders (VAEs)shap
- Page 275 and 276: Variational Autoencoders (VAEs)cvae
- Page 277 and 278: Variational Autoencoders (VAEs)Figu
- Page 279 and 280: Variational Autoencoders (VAEs)Figu
- Page 281 and 282: Variational Autoencoders (VAEs)In F
- Page 283 and 284: Variational Autoencoders (VAEs)Figu
- Page 285 and 286: Variational Autoencoders (VAEs)The
- Page 288 and 289: Deep ReinforcementLearningReinforce
- Page 290 and 291: [ 273 ]Chapter 9Formally, the RL pr
- Page 292 and 293: Chapter 9Where:( ) ( , )∗V s maxQ
- Page 294 and 295: Chapter 9Initially, the agent assum
- Page 296 and 297: Chapter 9Figure 9.3.6: Assuming the
- Page 298 and 299: Q-Learning in PythonThe environment
- Page 300 and 301: Chapter 9----------------"""self.re
- Page 302 and 303: Chapter 9# UI to dump Q Table conte
- Page 304 and 305: Chapter 9Figure 9.3.10: The value f
- Page 306 and 307: Chapter 9Figure 9.5.1: Frozen lake
- Page 308 and 309: Chapter 9# discount factorself.gamm
- Page 310 and 311: Chapter 9# training of Q Tableif do
- Page 314 and 315: Listing 9.6.1 shows us the DQN impl
- Page 316 and 317: Chapter 9if self.ddqn:print("------
- Page 318 and 319: Chapter 9updates# correction on the
- Page 320 and 321: QmaxChapter 9⎧rj+1if episodetermi
- Page 322: References1. Sutton and Barto. Rein
- Page 325 and 326: Policy Gradient MethodsPolicy gradi
- Page 327 and 328: Policy Gradient MethodsGiven a cont
- Page 329 and 330: Policy Gradient MethodsRequire: Dis
- Page 331 and 332: Policy Gradient MethodsRequire: Dis
- Page 333 and 334: Policy Gradient MethodsRequire: Dis
- Page 335 and 336: Policy Gradient MethodsRequire: θ
- Page 337 and 338: Policy Gradient MethodsThe state is
- Page 339 and 340: Policy Gradient Methodsself.encoder
- Page 341 and 342: Policy Gradient MethodsThe policy n
- Page 343 and 344: Policy Gradient MethodsFigure 10.6.
- Page 345 and 346: Policy Gradient MethodsAfter buildi
- Page 347 and 348: Policy Gradient Methods[_, _, _, re
- Page 349 and 350: Policy Gradient MethodsEach algorit
- Page 351 and 352: Policy Gradient Methodswhile not do
- Page 353 and 354: Policy Gradient MethodsFigure 10.7.
- Page 355 and 356: Policy Gradient MethodsFigure 10.7.
- Page 357 and 358: Policy Gradient MethodsTrain REINFO
- Page 360 and 361: Other Books YouMay EnjoyIf you enjo
Chapter 9
Where all terms are familiar from the previous discussion on Q-Learning and Q(a|s)
→ Q(s,a). The term max Q( s′ , a′ ) → max Q( a′ s′
). In other words, using the Q-Network,
a′ a′
predict the Q value of each action given next state and get the maximum among
max Q a′ | s′ = max Q s′ | a′
= 0 .
them. Note that at the terminal state s', ( ) ( )
Algorithm 9.6.1, DQN algorithm:
Require: Initialize replay memory D to capacity N
a′ a′
Require: Initialize action-value function Q with random weights θ
Require: Initialize target action-value function Q target
with weights θ
Require: Exploration rate, ε and discount factor, γ
1. for episode = 1, …,M do:
2. Given initial state s
3. for step = 1,…, T do:
⎧
sample( a)
random ε⎫
<
a = ⎪
⎨
⎪
4. Choose action argmax Q( s, a;
θ)
otherwise
⎬
⎪⎩
a
⎪⎭
5. Execute action a, observe reward r and next state s'
6. Store transition (s, a, r, s ' ) in D
7. Update the state, s = s '
8. //experience replay
9. Sample a mini batch of episode experiences (s j
, a j
, r j+1
, s j+1
) from D
⎧
rj+
1
if episodeterminates at j + 1⎫
Qmax
= ⎨
⎪
−
10. rj+ 1
γ max Qtarget ( s
j+ 1, a
j+
1;
θ ) otherwise ⎬
⎪
+
⎪
a
⎩
j+
1
⎪⎭
11. Perform gradient descent step on ( Q ( , ; )) max
−Q s
j
a
j
θ −
with respect to
parameters θ
12. // periodic update of the target network
13. Every C steps Q target
= Q, that is set θ = θ
14. End
However, it turns out that training the Q-Network is unstable. There are two
problems causing the instability:
1. A high correlation between samples
2. A non-stationary target
− =
θ
[ 295 ]