09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Reinforcement Learning

• Reward R(S,A,S'): Rewards are a scalar value returned by the environment

based on the agent's action(s), here S is the present state and S' the state of

the environment after action A is taken. It is determined by the goal; the

agent gets a higher reward if the action brings it near the goal, and a low

(or even negative) reward otherwise. How we define a reward is totally up

to us—in the case of the maze, we can define the reward as the Euclidean

distance between the agent's current position and goal. The SDC agent

reward can be that the car is on the road (positive reward) or off the road

(negative reward).

• Policy ππ(SS) : Policy defines a mapping between each state and the action to

take in that state. The policy can be deterministic—that is, for each state there

is a well-defined policy. In the case of the maze robot, a policy can be that

if the top block is empty, move up. The policy can also be stochastic—that

is, where an action is taken by some probability. It can be implemented as

a simple look-up table, or it can be a function dependent on the present

state. The policy is the core of the RL agent. In this chapter, we'll learn

about different algorithms that help the agent to learn the policy.

• Return G t

: This is the discounted sum of all future rewards starting from

current time, mathematically defined as:

GG tt = ∑ γγ kk RR tt+kk+1

kk=0

Here R t

is the reward at time t, γγ is the discount factor; its value lies between

(0,1). The discount factor determines how important future rewards are

in deciding the policy. If it is near zero, the agent gives importance to the

immediate rewards. A high value of discount factor, however, means the

agent is looking far into the future. It may lose immediate reward in favor

of the high future rewards, just as in the game chess you may sacrifice

a pawn for checkmate of the opponent.

• Value function V(S): This defines the "goodness" of a state in the long run.

It can be thought of as the total amount of reward the agent can expect to

accumulate over time, starting from the state, S. You can think of it as a longterm

good, as opposed to an immediate, but short-lived good. What do you

think is more important, maximizing immediate reward or value function?

You probably guessed right: just as in chess, we sometimes lose a pawn to

win the game a few steps later, and so the agent should try to maximize

value function.

[ 410 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!