16.03.2021 Views

Advanced Deep Learning with Keras

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 10

Actor-Critic method

In REINFORCE with baseline method, the value is used as a baseline. It is not used

to train the value function. In this section, we'll introduce a variation of REINFORCE

with baseline called the Actor-Critic method. The policy and value networks played

the roles of actor and critic networks. The policy network is the actor deciding

which action to take given the state. Meanwhile, the value network evaluates the

decision made by the actor or the policy network. The value network acts as a

critic which quantifies how good or bad the chosen action made by the actor is.

The value network evaluates the state value, V ( s, θ

v ), by comparing it with the sum

of the received reward, r , and the discounted value of the observed next state,

γV ( s′ , θv

). The difference, δ , is expressed as:

( ) ( ) ( ) ( )

δ = r , , , ,

t 1

+ γV st 1

θv − V st θv = r + γV s θv − V s θ (Equation 10.4.1)

v

+ +

where we dropped the subscripts of r and s for simplicity. Equation 10.4.1 is similar

to the temporal differencing in Q-Learning discussed in Chapter 9, Deep Reinforcement

Learning. The next state value is discounted by γ ∈ [ 0,1]

Estimating distant future

rewards is difficult. Therefore, our estimate is based only on the immediate future,

r + γV ( s′

, θv

) . This has been known as bootstrapping technique. The bootstrapping

technique and the dependence on state representation in Equation 10.4.1 often

accelerates learning and reduces variance. From Equation 10.4.1, we notice that the

value network evaluates the current state, s = st

, which is due to the previous action,

at

, of the policy network. Meanwhile, the policy gradient is based on the current

1

action, a . In a sense, the evaluation is delayed by one step.

t

Algorithm 10.4.1 summarizes the Actor-Critic method [1]. Apart from the evaluation

of the state value which is used to train both the policy and value networks, the

training is done online. At every step, both networks are trained. This is unlike

REINFORCE and REINFORCE with baseline where the agent completes an episode

before the training is performed. The value network is consulted twice. Firstly,

during the value estimate of the current state and secondly for the value of the next

state. Both values are used in the computation of gradients. Figure 10.4.1 shows the

Actor-Critic network. We will implement the Actor-Critic method in Keras at the end

of this chapter.

Algorithm 10.4.1 Actor-Critic

Require: A differentiable parameterized target policy network, π( a s , θ ) .

Require: A differentiable parameterized value network, V ( s, θ

v ).

[ 315 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!