16.03.2021 Views

Advanced Deep Learning with Keras

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 9

Figure 9.3.10: The value for each state from Figure 9.3.9 and Equation 9.2.2

Nondeterministic environment

In the event that the environment is nondeterministic, both the reward and action

are probabilistic. The new system is a stochastic MDP. To reflect the nondeterministic

reward the new value function is:

π

V ( s ) ER

E

(Equation 9.4.1)

T

k

t

=

t

= ∑ γ rt + k

k=

0

The Bellman equation is modified as:

Q s a

r Q s a

( , ) = + γ max ( ′,

′)

E s′

a′

(Equation 9.4.2)

Temporal-difference learning

Q-Learning is a special case of a more generalized Temporal-Difference Learning

or TD-Learning TD( λ ). More specifically, it's a special case of one-step TD-Learning

TD(0):

( )

( , ) ( , ) α γ max ( ′, ′) ( , )

Q s a = Q s a + r + Q s a − Q s a (Equation 9.5.1)

a′

In the equation α is the learning rate. We should note that when 1 α = , Equation

9.5.1 is similar to the Bellman equation. For simplicity, we'll refer to Equation

9.5.1 as Q-Learning or generalized Q-Learning.

[ 287 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!