16.03.2021 Views

Advanced Deep Learning with Keras

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 10

Advantage Actor-Critic (A2C) method

In the Actor-Critic method from the previous section, the objective is for the value

function to evaluate the state value correctly. There are other techniques to train

the value network. One obvious method is to use MSE (mean squared error) in the

value function optimization, similar to the algorithm in Q-Learning. The new value

gradient is equal to the partial derivative of the MSE between the return, R

t

, and the

state value:

( θ )

δ

( R −V ( s,

θ )) 2

t

v (Equation 10.5.1)

∇ V

v

=

δθv

As ( Rt

−V ( s, θv

))

→ 0 , the value network prediction gets more accurate. We call this

variation of the Actor-Critic algorithm as A2C. A2C is a single threaded or

synchronous version of the Asynchronous Advantage Actor-Critic (A3C)

by [2]. The quantity R − V ( s,

θ ) is called Advantage.

( )

t

v

Algorithm 10.5.1 summarizes the A2C method. There are some differences between

A2C and Actor-Critic. Actor-Critic is online or is trained on per experience sample.

A2C is similar to Monte Carlo algorithms REINFORCE and REINFORCE with

baseline. It is trained after one episode has been completed. Actor-Critic is trained

from the first state to the last state. A2C training starts from the last state and ends

on the first state. In addition, the A2C policy and value gradients are no longer

t

discounted by γ .

The corresponding network for A2C is similar to Figure 10.4.1 since we

only changed the method of gradient computation. To encourage agent

exploration during training, A3C algorithm [2] suggests that the gradient

of the weighted entropy value of the policy function is added to the gradient

function, β∇

H θ ( π( a t

| s t

, θ)

) . Recall that entropy is a measure of information

or uncertainty of an event.

Algorithm 10.5.1 Advantage Actor-Critic (A2C)

Require: A differentiable parameterized target policy network, ( at

st

, )

Require: A differentiable parameterized value network, V ( s , θ ) .

t

v

π θ .

Require: Discount factor, γ ∈ [ 0,1]

, the learning rate α for the performance gradient,

the learning rate for the value gradient, α

v

and entropy weight, β .

[ 317 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!