16.03.2021 Views

Advanced Deep Learning with Keras

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Deep Reinforcement Learning

Return can be interpreted as a measure of the value of a given state by following

an arbitrary policy, π :

T

k

t t

γ rt + k

k=

0

( )

π

V s = R = (Equation 9.1.2)

To put the RL problem in another way, the goal of the agent is to learn the optimal

policy that maximizes V π for all states s:

π

π ∗ = argmax V ( s)

(Equation 9.1.3)

π

The value function of the optimal policy is simply V * . In Figure 9.1.1, the optimal

policy is the one that generates the shortest sequence of actions that brings the robot

closer and closer to the soda can until it has been fetched. The closer the state is to the

goal state, the higher its value.

The sequence of events leading to the goal (or terminal state) can be modeled as the

trajectory or rollout of the policy:

Trajectory = (s 0

a 0

r 1

s 1

,s 1

a 1

r 2

s 2

,...,s T-1

a T-1

r T

s T

) (Equation 9.1.4)

If the MDP is episodic when the agent reaches the terminal state, s T

, the state is

reset to s 0

. If T is finite, we have a finite horizon. Otherwise, the horizon is infinite.

In Figure 9.1.1, if the MDP is episodic, after collecting the soda can, the robot may

look for another soda can to pick up and the RL problem repeats.

The Q value

An important question is that if the RL problem is to find π ∗ , how does the agent

learn by interacting with the environment? Equation 9.1.3 does not explicitly indicate

the action to try and the succeeding state to compute the return. In RL, we find that

it's easier to learn π ∗ by using the Q value:

a

( , )

π ∗ = argmax Q s a (Equation 9.2.1)

[ 274 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!