09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 11

• The next question that arises is, how does the agent maintain a balance

between exploration and exploitation? There are various strategies; one

of the most employed is the Epsilon-Greedy (εε − gggggggggggg ) policy. Here, the

agent explores unceasingly, and depending upon the value of εε ∈ [0,1] ,

at each step the agent selects a random action with probability εε , and

with probability 1 − εε selects an action that maximizes the value function.

Normally, the value of εε decreases asymptotically. In Python the εε − gggggggggggg

policy can be implemented as:

if np.random.rand() <= epsilon:

a = random.randrange(action_size)

else:

a = np.argmax(model.predict(s))

where model is the deep neural network approximating the value/policy

function, a is the action chosen from the action space of size action_size,

and s is the state. Another way to perform exploration is to use noise;

researchers have experimented with both Gaussian and Ornstein-Uhlenbeck

noise with success.

• How to deal with the highly correlated input state space?

The input to our RL model is the present state of the environment. Each

action results in some change in the environment; however, the correlation

between two consecutive states is very high. Now if we make our network

learn based on the sequential states, the high correlation between consecutive

inputs results in what is known in literature as Catastrophic Forgetting.

To mitigate the effect of Catastrophic Forgetting, in 2018, David Isele and

Akansel Cosgun proposed the Experience Replay [3] method.

In simplest terms, the learning algorithm first stores the MDP tuple: state,

action, reward, and next state <S, A, R, S'> in a buffer/memory. Once a

significant amount of memory is built, a batch is selected randomly to train

the agent. The memory is continuously refreshed with new additions, and

old deletions. The use of experience replay provides three-fold benefits:

° First, it allows the same experience to be potentially used in many

weight updates, hence increases data efficiency.

° Second, the random selection of batches of experience removes the

correlations between consecutive states presented to the network for

training.

° Third, it stops any unwanted feedback loops that may arise and cause

the network to get stuck in local minima or diverge.

[ 413 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!