Advanced Deep Learning with Keras
Chapter 9Initially, the agent assumes a policy that selects a random action 90% of the time andexploits the Q-Table 10% of the time. Suppose the first action is randomly chosen andindicates a move in the right direction. Figure 9.3.3 illustrates the computation of thenew Q value of state (0, 0) for a move to the right action. The next state is (0, 1). Thereward is 0, and the maximum of all the next state's Q values is zero. Therefore, theQ value of state (0, 0) for a move to the right action remains 0.To easily track the initial state and next state, we use different shades of gray on boththe environment and the Q-Table–lighter gray for initial state and darker gray for thenext state. In choosing the next action for the next state, the candidate actions are inthe thicker border:Figure 9.3.3: Assuming the action taken by the agent is a move to the right,the update on Q value of state (0, 0) is shown[ 277 ]
Deep Reinforcement LearningFigure 9.3.4: Assuming the action chosen by the agent is move down,the update on Q value of state (0, 1) is shownFigure 9.3.5: Assuming the action chosen by the agent is a move to the right,the update on Q value of state (1, 1) is shownLet's suppose that the next randomly chosen action is move down. Figure 9.3.4 showsno change in the Q value of state (0, 1) for the move down action. In Figure 9.3.5,the agent's third random action is a move to the right. It encountered the H andreceived a -100 reward. This time, the update is non-zero. The new Q value for thestate (1, 1) is -100 for the move to the right direction. One episode has just finished,and the agent returns to the Start state.[ 278 ]
- Page 244 and 245: Chapter 7Figure 7.1.10: Color (from
- Page 246 and 247: [ 229 ]Chapter 7titles = ('MNIST pr
- Page 248 and 249: Chapter 7Figure 7.1.13: Style trans
- Page 250 and 251: Chapter 7Figure 7.1.15: The backwar
- Page 252: Chapter 7References1. Yuval Netzer
- Page 255 and 256: Variational Autoencoders (VAEs)In t
- Page 257 and 258: Variational Autoencoders (VAEs)Typi
- Page 259 and 260: Variational Autoencoders (VAEs)For
- Page 261 and 262: Variational Autoencoders (VAEs)VAEs
- Page 263 and 264: Variational Autoencoders (VAEs)outp
- Page 265 and 266: Variational Autoencoders (VAEs)Figu
- Page 267 and 268: Variational Autoencoders (VAEs)The
- Page 269 and 270: Variational Autoencoders (VAEs)Figu
- Page 271 and 272: Variational Autoencoders (VAEs)Prec
- Page 273 and 274: Variational Autoencoders (VAEs)shap
- Page 275 and 276: Variational Autoencoders (VAEs)cvae
- Page 277 and 278: Variational Autoencoders (VAEs)Figu
- Page 279 and 280: Variational Autoencoders (VAEs)Figu
- Page 281 and 282: Variational Autoencoders (VAEs)In F
- Page 283 and 284: Variational Autoencoders (VAEs)Figu
- Page 285 and 286: Variational Autoencoders (VAEs)The
- Page 288 and 289: Deep ReinforcementLearningReinforce
- Page 290 and 291: [ 273 ]Chapter 9Formally, the RL pr
- Page 292 and 293: Chapter 9Where:( ) ( , )∗V s maxQ
- Page 296 and 297: Chapter 9Figure 9.3.6: Assuming the
- Page 298 and 299: Q-Learning in PythonThe environment
- Page 300 and 301: Chapter 9----------------"""self.re
- Page 302 and 303: Chapter 9# UI to dump Q Table conte
- Page 304 and 305: Chapter 9Figure 9.3.10: The value f
- Page 306 and 307: Chapter 9Figure 9.5.1: Frozen lake
- Page 308 and 309: Chapter 9# discount factorself.gamm
- Page 310 and 311: Chapter 9# training of Q Tableif do
- Page 312 and 313: Chapter 9Where all terms are famili
- Page 314 and 315: Listing 9.6.1 shows us the DQN impl
- Page 316 and 317: Chapter 9if self.ddqn:print("------
- Page 318 and 319: Chapter 9updates# correction on the
- Page 320 and 321: QmaxChapter 9⎧rj+1if episodetermi
- Page 322: References1. Sutton and Barto. Rein
- Page 325 and 326: Policy Gradient MethodsPolicy gradi
- Page 327 and 328: Policy Gradient MethodsGiven a cont
- Page 329 and 330: Policy Gradient MethodsRequire: Dis
- Page 331 and 332: Policy Gradient MethodsRequire: Dis
- Page 333 and 334: Policy Gradient MethodsRequire: Dis
- Page 335 and 336: Policy Gradient MethodsRequire: θ
- Page 337 and 338: Policy Gradient MethodsThe state is
- Page 339 and 340: Policy Gradient Methodsself.encoder
- Page 341 and 342: Policy Gradient MethodsThe policy n
- Page 343 and 344: Policy Gradient MethodsFigure 10.6.
Deep Reinforcement Learning
Figure 9.3.4: Assuming the action chosen by the agent is move down,
the update on Q value of state (0, 1) is shown
Figure 9.3.5: Assuming the action chosen by the agent is a move to the right,
the update on Q value of state (1, 1) is shown
Let's suppose that the next randomly chosen action is move down. Figure 9.3.4 shows
no change in the Q value of state (0, 1) for the move down action. In Figure 9.3.5,
the agent's third random action is a move to the right. It encountered the H and
received a -100 reward. This time, the update is non-zero. The new Q value for the
state (1, 1) is -100 for the move to the right direction. One episode has just finished,
and the agent returns to the Start state.
[ 278 ]