16.03.2021 Views

Advanced Deep Learning with Keras

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 9

Where:

( ) ( , )

V s maxQ s a

= (Equation 9.2.2)

a

In other words, instead of finding the policy that maximizes the value for all states,

Equation 9.2.1 looks for the action that maximizes the quality (Q) value for all states.

After finding the Q value function, V * and hence π ∗ are determined by Equation 9.2.2

and 9.1.3 respectively.

If for every action, the reward and the next state can be observed, we can formulate

the following iterative or trial and error algorithm to learn the Q value:

( , ) γ ( ′,

′)

Q s a = r + maxQ s a (Equation 9.2.3)

a′

For notational simplicity, both s ' and a ' are the next state and action respectively.

Equation 9.2.3 is known as the Bellman Equation which is the core of the Q-Learning

algorithm. Q-Learning attempts to approximate the first-order expansion of return or

value (Equation 9.1.2) as a function of both current state and action.

From zero knowledge of the dynamics of the environment, the agent tries an action a,

observes what happens in the form of reward, r, and next state, s ' . max Q( s′ , a′ ) chooses

a′

the next logical action that will give the maximum Q value for the next state. With

all terms in Equation 9.2.3 known, the Q value for that current state-action pair is

updated. Doing the update iteratively will eventually learn the Q value function.

Q-Learning is an off-policy RL algorithm. It learns to improve the policy by not

directly sampling experiences from that policy. In other words, the Q values are

learned independently of the underlying policy being used by the agent. When the

Q value function has converged, only then is the optimal policy determined using

Equation 9.2.1.

Before giving an example on how to use Q-Learning, we should note that the agent

must continually explore its environment while gradually taking advantage of

what it has learned so far. This is one of the issues in RL – finding the right balance

between Exploration and Exploitation. Generally, during the start of learning, the

action is random (exploration). As the learning progresses, the agent takes advantage

of the Q value (exploitation). For example, at the start, 90% of the action is random

and 10% from Q value function, and by the end of each episode, this is gradually

decreased. Eventually, the action is 10% random and 90% from Q value function.

[ 275 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!