09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Reinforcement Learning

A modified version of experience replay is the Prioritized Experience Replay

(PER). Introduced in 2015 by Tom Schaul et al. [4], it derives from the idea

that not all experiences (or, you might say, attempts) are equally important.

Some attempts are better lessons than others. Thus, instead of selecting the

experiences randomly, it will be much more efficient to assign higher priority

to more educational experiences in selection for training. In the Schaul

paper it was proposed that experiences in which the difference between the

prediction and target is high should be given priority, as the agent could

learn a lot in these cases.

• How to deal with the problem of moving targets?

Unlike supervised learning, the target is not previously known in RL. With

a moving target, the agent tries to maximize the expected return, but the

maximum value goes on changing as the agent learns. In essence, this like

trying to catch a butterfly yet each time you approach it, it moves to a new

location. The major reason to have a moving target is that the same networks

are used to estimate the action and the target values, and this can cause

oscillations in learning.

A solution to this was proposed by the DeepMind team in their 2015 paper,

titled Human-level Control through Deep Reinforcement Learning, published in

Nature. The solution is that now instead of a moving target, the agent has

short-term fixed targets. The agent now maintains two networks, both are

exactly the same in architecture, one called the local network, which is used

at each step to estimate the present action, and one the target network, which

is used to get the target value. However, both networks have their own set of

weights. At each time step the local network learns in the direction such that its

estimate and target are near to each other. After some number of time steps, the

target network weights are updated. The update can be a hard update, where

the weights of the local network are copied completely to the target network

after N time steps, or it can be a soft update, in which the target network slowly

(by a factor of Tau ττττ[0,1] ) moves its weight toward the local network.

Reinforcement success in recent years

In the last few years, DRL has been successfully used in a variety of tasks, especially

in game playing and robotics. Let us acquaint ourselves with some success stories of

RL before learning its algorithms:

• AlphaGo Zero: Developed by Google's DeepMind team, the AlphaGo Zero

paper Mastering the game of Go without any human knowledge, starts from an

absolutely blank slate (tabula rasa). The AlphaGo Zero uses one neural

network to approximate both the move probabilities and value.

[ 414 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!