16.03.2021 Views

Advanced Deep Learning with Keras

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

REINFORCE with baseline method

Chapter 10

The REINFORCE algorithm can be generalized by subtracting a baseline from the

return, δ = Rt

− B( st

). The baseline function, B(s t

) can be any function as long as it

does not depend on a t

The baseline does not alter the expectation of the performance

gradient:

∇ ( θ) = E ⎡

π ( Rt − B( st )) ∇

θlnπ( at st , θ) ⎤ = ⎡

π

Rt∇θlnπ( at st

, θ)

⎢ ⎥ E

⎣ ⎦ ⎢⎣ ⎥⎦

J (Equation 10.3.1)

Equation 10.3.1 implies that E ⎡B( s ) ∇ ln ( a s , ) ⎤ = 0 since ( ) t

of a

t .

⎣⎢

π t θ

π

t t

θ

⎦⎥

B s is not a function

While the introduction of baseline does not change the expectation, it reduces the

variance of the gradient updates. The reduction in variance generally accelerates

learning. In most cases, we use the value function, B( st

) = V ( st

) as the baseline. If the

return is overestimated, the scaling factor is proportionally reduced by the value

function resulting to a lower variance. The value function is also parameterized,

V ( st ) → V ( st , θv

) and is jointly trained with the policy network. In continuous action

spaces, the state value can be a linear function of state features:

( , ) ( )

T

t t

θv φ

t

θv

v = V s = s (Equation 10.3.2)

Algorithm 10.3.1 summarizes the REINFORCE with baseline method [1]. This is

similar to REINFORCE except that the return is replaced by S . The difference is we

are now training two neural networks. As shown in Figure 10.3.1, in addition to the

policy network, π( θ ), the value network, V ( θ ), is also trained at the same time. The

policy network parameters are updated by the performance gradient, ∇J ( θ)

, while

the value network parameters are adjusted by the value gradient, ∇ V ( θ v ). Since

REINFORCE is a Monte Carlo algorithm, it follows that the value function training

is also a Monte Carlo algorithm.

The learning rates are not necessarily the same. Note that the value network is

also performing gradient ascent. We illustrate how to implement REINFORCE

with baseline using Keras at the end of this chapter.

Algorithm 10.3.1 REINFORCE with baseline

Require: A differentiable parameterized target policy network, ( at

st

, )

Require: A differentiable parameterized value network, V ( s , θ ) .

t

v

π θ .

[ 313 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!