09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 11

Normally, the value is defined either as the State-Value function VV ππ (SS) or

Action-Value function QQ ππ (SS, AA) , where ππ is the policy followed. The statevalue

function is the expected return from the state S after following policy ππ :

VV ππ (SS) = EE ππ [GG tt |SS tt = ss]

Here E is the expectation, and S t

=s is the state at time t. The action-value

function is the expected return from the state S, taking an action A=a and

following the policy ππ :

QQ ππ (SS, AA) = EE ππ [GG tt |SS tt = ss, AA tt = aa]

• Model of the environment: It's an optional element. It mimics the behavior

of the environment, and it contains the physics of the environment; in

other words, it tells how the environment will behave. The model of the

environment is defined by the transition probability to the next state. This is

an optional component; we can have a model free reinforcement learning as

well where the transition probability is not needed to define the RL process.

In RL we assume that the state of the environment follows the Markov property,

that is, each state is dependent solely on the preceding state, the action taken from

the action space, and the corresponding reward. That is, if S t+1 is the state of the

environment at time t+1, then it is a function of S t state at time t, A t is action taken

at time t, and R t is the corresponding reward received at time t, no prior history

is needed. If P(S t+1 |S t ) is the transition probability, mathematically the Markov

property can be written as:

P(S t+1 |S t ) = P(S t+1 |S 1 ,S 2 ,…,S t )

And thus, RL can be assumed to be a Markov Decision Process (MDP).

Deep reinforcement learning algorithms

The basic idea in Deep Reinforcement Learning (DRL) is that we can use a deep

neural network to approximate either policy function or value function. In this

chapter we will be studying some popular DRL algorithms. These algorithms can be

classified in two classes, depending upon what they approximate:

• Value-based methods: In these methods, the algorithms take the action that

maximizes the value function. The agent here learns to predict how good a

given state or action would be. An example of the value-based method is the

Deep Q-Network.

[ 411 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!