Advanced Deep Learning with Keras

fourpersent2020
from fourpersent2020 More from this publisher
16.03.2021 Views

Chapter 10Actor-Critic methodIn REINFORCE with baseline method, the value is used as a baseline. It is not usedto train the value function. In this section, we'll introduce a variation of REINFORCEwith baseline called the Actor-Critic method. The policy and value networks playedthe roles of actor and critic networks. The policy network is the actor decidingwhich action to take given the state. Meanwhile, the value network evaluates thedecision made by the actor or the policy network. The value network acts as acritic which quantifies how good or bad the chosen action made by the actor is.The value network evaluates the state value, V ( s, θv ), by comparing it with the sumof the received reward, r , and the discounted value of the observed next state,γV ( s′ , θv). The difference, δ , is expressed as:( ) ( ) ( ) ( )δ = r , , , ,t 1+ γV st 1θv − V st θv = r + γV s θv − V s θ (Equation 10.4.1)v+ +′where we dropped the subscripts of r and s for simplicity. Equation 10.4.1 is similarto the temporal differencing in Q-Learning discussed in Chapter 9, Deep ReinforcementLearning. The next state value is discounted by γ ∈ [ 0,1]Estimating distant futurerewards is difficult. Therefore, our estimate is based only on the immediate future,r + γV ( s′, θv) . This has been known as bootstrapping technique. The bootstrappingtechnique and the dependence on state representation in Equation 10.4.1 oftenaccelerates learning and reduces variance. From Equation 10.4.1, we notice that thevalue network evaluates the current state, s = st, which is due to the previous action,at−, of the policy network. Meanwhile, the policy gradient is based on the current1action, a . In a sense, the evaluation is delayed by one step.tAlgorithm 10.4.1 summarizes the Actor-Critic method [1]. Apart from the evaluationof the state value which is used to train both the policy and value networks, thetraining is done online. At every step, both networks are trained. This is unlikeREINFORCE and REINFORCE with baseline where the agent completes an episodebefore the training is performed. The value network is consulted twice. Firstly,during the value estimate of the current state and secondly for the value of the nextstate. Both values are used in the computation of gradients. Figure 10.4.1 shows theActor-Critic network. We will implement the Actor-Critic method in Keras at the endof this chapter.Algorithm 10.4.1 Actor-CriticRequire: A differentiable parameterized target policy network, π( a s , θ ) .Require: A differentiable parameterized value network, V ( s, θv ).[ 315 ]

Policy Gradient MethodsRequire: Discount factor, γ ∈ [ 0,1], the learning rate α for the performance gradient,and the learning rate for the value gradient, α .Require: θ0 , initial policy network parameters (for example, θ0 → 0 ). θv0, initial valuenetwork parameters (for example, θv0 → 0 ).1. Repeat2. for steps t = 0, …, T −1do3. Sample an action a ~ π( a s , θ)4. Execute the action and observe reward r and next state s′5. Evaluate state value estimate, δ = r + γV ( s′, θ ) −V ( s,θ )t6. Compute discounted value gradient, ∇ V ( θ ) = γ δ∇V ( s,θ )7. Perform gradient ascent, θ = θ + α ∇V( θ )vv v v vt8. Compute discounted performance gradient, ∇ J ( θ) = γ δ∇π( , θ)9. Perform gradient ascent, θ = θ + α ∇J( θ )10. s s′ =vvθvvvθ ln a sFigure 10.4.1: Actor-critic network[ 316 ]

Policy Gradient Methods

Require: Discount factor, γ ∈ [ 0,1]

, the learning rate α for the performance gradient,

and the learning rate for the value gradient, α .

Require: θ

0 , initial policy network parameters (for example, θ0 → 0 ). θ

v0

, initial value

network parameters (for example, θv0 → 0 ).

1. Repeat

2. for steps t = 0, …, T −1

do

3. Sample an action a ~ π( a s , θ)

4. Execute the action and observe reward r and next state s′

5. Evaluate state value estimate, δ = r + γV ( s′

, θ ) −V ( s,

θ )

t

6. Compute discounted value gradient, ∇ V ( θ ) = γ δ∇

V ( s,

θ )

7. Perform gradient ascent, θ = θ + α ∇V

( θ )

v

v v v v

t

8. Compute discounted performance gradient, ∇ J ( θ) = γ δ∇

π( , θ)

9. Perform gradient ascent, θ = θ + α ∇J

( θ )

10. s s′ =

v

v

θv

v

v

θ ln a s

Figure 10.4.1: Actor-critic network

[ 316 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!