16.03.2021 Views

Advanced Deep Learning with Keras

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

[ 273 ]

Chapter 9

Formally, the RL problem can be described as a Markov Decision Process (MDP).

For simplicity, we'll assume a deterministic environment where a certain action

in a given state will consistently result in a known next state and reward. In a later

section of this chapter, we'll look at how to consider stochasticity. At timestep t:

• The environment is in a state s t

from the state space S which may be

discrete or continuous. The starting state is s 0

while the terminal state is s T

.

• The agent takes action a t

from the action space A by obeying the policy,

π ( at

| st

). A may be discrete or continuous.

• The environment transitions to a new state s t+1

using the state transition

dynamics T ( s | , t+

1

st at

). The next state is only dependent on the current state

and action. T is not known to the agent.

• The agent receives a scalar reward using a reward function, r t+1

= R(s t

,a t

) with

r : A× S → R . The reward is only dependent on the current state and action.

R is not known to the agent.

• Future rewards are discounted by

k

γ ∈ 0,1 and k is the future

timestep.

• Horizon, H, is the number of timesteps, T, needed to complete one episode

from s 0

to s T

.

γ where [ ]

The environment may be fully or partially observable. The latter is also known as

a partially observable MDP or POMDP. Most of the time, it's unrealistic to fully

observe the environment. To improve the observability, past observations are also

taken into consideration with the current observation. The state comprises the

sufficient observations about the environment for the policy to decide on which

action to take. In Figure 9.1.1, this could be the 3D position of the soda can with

respect to the robot gripper as estimated by the robot camera.

Every time the environment transitions to a new state, the agent receives a scalar

reward, r t+1

. In Figure 9.1.1, the reward could be +1 whenever the robot gets closer

to the soda can, -1 whenever it gets farther, and +100 when it closes the gripper and

successfully picks up the soda can. The goal of the agent is to learn an optimal policy

π ∗ that maximizes the return from all states:

π ∗ = argmax π

R t

(Equation 9.1.1)

k

The return is defined as the discounted cumulative reward, Rt = ∑ γ r

k=

0 t + k . It

can be observed from Equation 9.1.1 that future rewards have lower weights when

k

compared to the immediate rewards since generally γ < 1.0 where γ ∈ [ 0,1]

.

At the extremes, when γ = 0 , only the immediate reward matters. When γ = 1

future rewards have the same weight as the immediate reward.

T

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!