Advanced Deep Learning with Keras
Chapter 9Figure 9.3.10: The value for each state from Figure 9.3.9 and Equation 9.2.2Nondeterministic environmentIn the event that the environment is nondeterministic, both the reward and actionare probabilistic. The new system is a stochastic MDP. To reflect the nondeterministicreward the new value function is:πV ( s ) ER E (Equation 9.4.1)Tkt=t= ∑ γ rt + kk=0The Bellman equation is modified as:Q s ar Q s a( , ) = + γ max ( ′,′)E s′a′(Equation 9.4.2)Temporal-difference learningQ-Learning is a special case of a more generalized Temporal-Difference Learningor TD-Learning TD( λ ). More specifically, it's a special case of one-step TD-LearningTD(0):( )( , ) ( , ) α γ max ( ′, ′) ( , )Q s a = Q s a + r + Q s a − Q s a (Equation 9.5.1)a′In the equation α is the learning rate. We should note that when 1 α = , Equation9.5.1 is similar to the Bellman equation. For simplicity, we'll refer to Equation9.5.1 as Q-Learning or generalized Q-Learning.[ 287 ]
Deep Reinforcement LearningPreviously, we referred to Q-Learning as an off-policy RL algorithm since it learnsthe Q value function without directly using the policy that it is trying to optimize.An example of an on-policy one-step TD-learning algorithm is SARSA which similarto Equation 9.5.1:( , ) ( , ) α( γ ( ′, ′) ( , ))Q s a = Q s a + r + Q s a − Q s a (Equation 9.5.2)The main difference is the use of the policy that is being optimized to determine a ' .The terms s, a, r, s ' and a ' (thus the name SARSA) must be known to update the Qvalue function at every iteration. Both Q-Learning and SARSA use existing estimatesin the Q value iteration, a process known as bootstrapping. In bootstrapping, weupdate the current Q value estimate from the reward and the subsequent Q valueestimate(s).Q-Learning on OpenAI gymBefore presenting another example, there appears to be a need for a suitable RLsimulation environment. Otherwise, we can only run RL simulations on very simpleproblems like in the previous example. Fortunately, OpenAI created Gym, https://gym.openai.com.The gym is a toolkit for developing and comparing RL algorithms. It works withmost deep learning libraries, including Keras. The gym can be installed by runningthe following command:$ sudo pip3 install gymThe gym has several environments where an RL algorithm can be tested againstsuch as toy text, classic control, algorithmic, Atari, and 2D/3D robots. For example,FrozenLake-v0 (Figure 9.5.1) is a toy text environment similar to the simpledeterministic world used in the Q-Learning in Python example. FrozenLake-v0has 12 states. The state marked S is the starting state, F is the frozen part of thelake which is safe, H is the Hole state that should be avoided, and G is the Goalstate where the frisbee is. The reward is +1 for transitioning to the Goal state. Forall other states, the reward is zero.In FrozenLake-v0, there are also four available actions (Left, Down, Right, Up)known as action space. However, unlike the simple deterministic world earlier, theactual movement direction is only partially dependent on the chosen action. Thereare two variations of the FrozenLake-v0 environment, slippery and non-slippery.As expected, the slippery mode is more challenging:[ 288 ]
- Page 255 and 256: Variational Autoencoders (VAEs)In t
- Page 257 and 258: Variational Autoencoders (VAEs)Typi
- Page 259 and 260: Variational Autoencoders (VAEs)For
- Page 261 and 262: Variational Autoencoders (VAEs)VAEs
- Page 263 and 264: Variational Autoencoders (VAEs)outp
- Page 265 and 266: Variational Autoencoders (VAEs)Figu
- Page 267 and 268: Variational Autoencoders (VAEs)The
- Page 269 and 270: Variational Autoencoders (VAEs)Figu
- Page 271 and 272: Variational Autoencoders (VAEs)Prec
- Page 273 and 274: Variational Autoencoders (VAEs)shap
- Page 275 and 276: Variational Autoencoders (VAEs)cvae
- Page 277 and 278: Variational Autoencoders (VAEs)Figu
- Page 279 and 280: Variational Autoencoders (VAEs)Figu
- Page 281 and 282: Variational Autoencoders (VAEs)In F
- Page 283 and 284: Variational Autoencoders (VAEs)Figu
- Page 285 and 286: Variational Autoencoders (VAEs)The
- Page 288 and 289: Deep ReinforcementLearningReinforce
- Page 290 and 291: [ 273 ]Chapter 9Formally, the RL pr
- Page 292 and 293: Chapter 9Where:( ) ( , )∗V s maxQ
- Page 294 and 295: Chapter 9Initially, the agent assum
- Page 296 and 297: Chapter 9Figure 9.3.6: Assuming the
- Page 298 and 299: Q-Learning in PythonThe environment
- Page 300 and 301: Chapter 9----------------"""self.re
- Page 302 and 303: Chapter 9# UI to dump Q Table conte
- Page 306 and 307: Chapter 9Figure 9.5.1: Frozen lake
- Page 308 and 309: Chapter 9# discount factorself.gamm
- Page 310 and 311: Chapter 9# training of Q Tableif do
- Page 312 and 313: Chapter 9Where all terms are famili
- Page 314 and 315: Listing 9.6.1 shows us the DQN impl
- Page 316 and 317: Chapter 9if self.ddqn:print("------
- Page 318 and 319: Chapter 9updates# correction on the
- Page 320 and 321: QmaxChapter 9⎧rj+1if episodetermi
- Page 322: References1. Sutton and Barto. Rein
- Page 325 and 326: Policy Gradient MethodsPolicy gradi
- Page 327 and 328: Policy Gradient MethodsGiven a cont
- Page 329 and 330: Policy Gradient MethodsRequire: Dis
- Page 331 and 332: Policy Gradient MethodsRequire: Dis
- Page 333 and 334: Policy Gradient MethodsRequire: Dis
- Page 335 and 336: Policy Gradient MethodsRequire: θ
- Page 337 and 338: Policy Gradient MethodsThe state is
- Page 339 and 340: Policy Gradient Methodsself.encoder
- Page 341 and 342: Policy Gradient MethodsThe policy n
- Page 343 and 344: Policy Gradient MethodsFigure 10.6.
- Page 345 and 346: Policy Gradient MethodsAfter buildi
- Page 347 and 348: Policy Gradient Methods[_, _, _, re
- Page 349 and 350: Policy Gradient MethodsEach algorit
- Page 351 and 352: Policy Gradient Methodswhile not do
- Page 353 and 354: Policy Gradient MethodsFigure 10.7.
Deep Reinforcement Learning
Previously, we referred to Q-Learning as an off-policy RL algorithm since it learns
the Q value function without directly using the policy that it is trying to optimize.
An example of an on-policy one-step TD-learning algorithm is SARSA which similar
to Equation 9.5.1:
( , ) ( , ) α( γ ( ′, ′) ( , ))
Q s a = Q s a + r + Q s a − Q s a (Equation 9.5.2)
The main difference is the use of the policy that is being optimized to determine a ' .
The terms s, a, r, s ' and a ' (thus the name SARSA) must be known to update the Q
value function at every iteration. Both Q-Learning and SARSA use existing estimates
in the Q value iteration, a process known as bootstrapping. In bootstrapping, we
update the current Q value estimate from the reward and the subsequent Q value
estimate(s).
Q-Learning on OpenAI gym
Before presenting another example, there appears to be a need for a suitable RL
simulation environment. Otherwise, we can only run RL simulations on very simple
problems like in the previous example. Fortunately, OpenAI created Gym, https://
gym.openai.com.
The gym is a toolkit for developing and comparing RL algorithms. It works with
most deep learning libraries, including Keras. The gym can be installed by running
the following command:
$ sudo pip3 install gym
The gym has several environments where an RL algorithm can be tested against
such as toy text, classic control, algorithmic, Atari, and 2D/3D robots. For example,
FrozenLake-v0 (Figure 9.5.1) is a toy text environment similar to the simple
deterministic world used in the Q-Learning in Python example. FrozenLake-v0
has 12 states. The state marked S is the starting state, F is the frozen part of the
lake which is safe, H is the Hole state that should be avoided, and G is the Goal
state where the frisbee is. The reward is +1 for transitioning to the Goal state. For
all other states, the reward is zero.
In FrozenLake-v0, there are also four available actions (Left, Down, Right, Up)
known as action space. However, unlike the simple deterministic world earlier, the
actual movement direction is only partially dependent on the chosen action. There
are two variations of the FrozenLake-v0 environment, slippery and non-slippery.
As expected, the slippery mode is more challenging:
[ 288 ]