Advanced Deep Learning with Keras

fourpersent2020
from fourpersent2020 More from this publisher
16.03.2021 Views

Chapter 9Figure 9.3.10: The value for each state from Figure 9.3.9 and Equation 9.2.2Nondeterministic environmentIn the event that the environment is nondeterministic, both the reward and actionare probabilistic. The new system is a stochastic MDP. To reflect the nondeterministicreward the new value function is:πV ( s ) ER E (Equation 9.4.1)Tkt=t= ∑ γ rt + kk=0The Bellman equation is modified as:Q s ar Q s a( , ) = + γ max ( ′,′)E s′a′(Equation 9.4.2)Temporal-difference learningQ-Learning is a special case of a more generalized Temporal-Difference Learningor TD-Learning TD( λ ). More specifically, it's a special case of one-step TD-LearningTD(0):( )( , ) ( , ) α γ max ( ′, ′) ( , )Q s a = Q s a + r + Q s a − Q s a (Equation 9.5.1)a′In the equation α is the learning rate. We should note that when 1 α = , Equation9.5.1 is similar to the Bellman equation. For simplicity, we'll refer to Equation9.5.1 as Q-Learning or generalized Q-Learning.[ 287 ]

Deep Reinforcement LearningPreviously, we referred to Q-Learning as an off-policy RL algorithm since it learnsthe Q value function without directly using the policy that it is trying to optimize.An example of an on-policy one-step TD-learning algorithm is SARSA which similarto Equation 9.5.1:( , ) ( , ) α( γ ( ′, ′) ( , ))Q s a = Q s a + r + Q s a − Q s a (Equation 9.5.2)The main difference is the use of the policy that is being optimized to determine a ' .The terms s, a, r, s ' and a ' (thus the name SARSA) must be known to update the Qvalue function at every iteration. Both Q-Learning and SARSA use existing estimatesin the Q value iteration, a process known as bootstrapping. In bootstrapping, weupdate the current Q value estimate from the reward and the subsequent Q valueestimate(s).Q-Learning on OpenAI gymBefore presenting another example, there appears to be a need for a suitable RLsimulation environment. Otherwise, we can only run RL simulations on very simpleproblems like in the previous example. Fortunately, OpenAI created Gym, https://gym.openai.com.The gym is a toolkit for developing and comparing RL algorithms. It works withmost deep learning libraries, including Keras. The gym can be installed by runningthe following command:$ sudo pip3 install gymThe gym has several environments where an RL algorithm can be tested againstsuch as toy text, classic control, algorithmic, Atari, and 2D/3D robots. For example,FrozenLake-v0 (Figure 9.5.1) is a toy text environment similar to the simpledeterministic world used in the Q-Learning in Python example. FrozenLake-v0has 12 states. The state marked S is the starting state, F is the frozen part of thelake which is safe, H is the Hole state that should be avoided, and G is the Goalstate where the frisbee is. The reward is +1 for transitioning to the Goal state. Forall other states, the reward is zero.In FrozenLake-v0, there are also four available actions (Left, Down, Right, Up)known as action space. However, unlike the simple deterministic world earlier, theactual movement direction is only partially dependent on the chosen action. Thereare two variations of the FrozenLake-v0 environment, slippery and non-slippery.As expected, the slippery mode is more challenging:[ 288 ]

Deep Reinforcement Learning

Previously, we referred to Q-Learning as an off-policy RL algorithm since it learns

the Q value function without directly using the policy that it is trying to optimize.

An example of an on-policy one-step TD-learning algorithm is SARSA which similar

to Equation 9.5.1:

( , ) ( , ) α( γ ( ′, ′) ( , ))

Q s a = Q s a + r + Q s a − Q s a (Equation 9.5.2)

The main difference is the use of the policy that is being optimized to determine a ' .

The terms s, a, r, s ' and a ' (thus the name SARSA) must be known to update the Q

value function at every iteration. Both Q-Learning and SARSA use existing estimates

in the Q value iteration, a process known as bootstrapping. In bootstrapping, we

update the current Q value estimate from the reward and the subsequent Q value

estimate(s).

Q-Learning on OpenAI gym

Before presenting another example, there appears to be a need for a suitable RL

simulation environment. Otherwise, we can only run RL simulations on very simple

problems like in the previous example. Fortunately, OpenAI created Gym, https://

gym.openai.com.

The gym is a toolkit for developing and comparing RL algorithms. It works with

most deep learning libraries, including Keras. The gym can be installed by running

the following command:

$ sudo pip3 install gym

The gym has several environments where an RL algorithm can be tested against

such as toy text, classic control, algorithmic, Atari, and 2D/3D robots. For example,

FrozenLake-v0 (Figure 9.5.1) is a toy text environment similar to the simple

deterministic world used in the Q-Learning in Python example. FrozenLake-v0

has 12 states. The state marked S is the starting state, F is the frozen part of the

lake which is safe, H is the Hole state that should be avoided, and G is the Goal

state where the frisbee is. The reward is +1 for transitioning to the Goal state. For

all other states, the reward is zero.

In FrozenLake-v0, there are also four available actions (Left, Down, Right, Up)

known as action space. However, unlike the simple deterministic world earlier, the

actual movement direction is only partially dependent on the chosen action. There

are two variations of the FrozenLake-v0 environment, slippery and non-slippery.

As expected, the slippery mode is more challenging:

[ 288 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!