Advanced Deep Learning with Keras
Chapter 10Actor-Critic methodIn REINFORCE with baseline method, the value is used as a baseline. It is not usedto train the value function. In this section, we'll introduce a variation of REINFORCEwith baseline called the Actor-Critic method. The policy and value networks playedthe roles of actor and critic networks. The policy network is the actor decidingwhich action to take given the state. Meanwhile, the value network evaluates thedecision made by the actor or the policy network. The value network acts as acritic which quantifies how good or bad the chosen action made by the actor is.The value network evaluates the state value, V ( s, θv ), by comparing it with the sumof the received reward, r , and the discounted value of the observed next state,γV ( s′ , θv). The difference, δ , is expressed as:( ) ( ) ( ) ( )δ = r , , , ,t 1+ γV st 1θv − V st θv = r + γV s θv − V s θ (Equation 10.4.1)v+ +′where we dropped the subscripts of r and s for simplicity. Equation 10.4.1 is similarto the temporal differencing in Q-Learning discussed in Chapter 9, Deep ReinforcementLearning. The next state value is discounted by γ ∈ [ 0,1]Estimating distant futurerewards is difficult. Therefore, our estimate is based only on the immediate future,r + γV ( s′, θv) . This has been known as bootstrapping technique. The bootstrappingtechnique and the dependence on state representation in Equation 10.4.1 oftenaccelerates learning and reduces variance. From Equation 10.4.1, we notice that thevalue network evaluates the current state, s = st, which is due to the previous action,at−, of the policy network. Meanwhile, the policy gradient is based on the current1action, a . In a sense, the evaluation is delayed by one step.tAlgorithm 10.4.1 summarizes the Actor-Critic method [1]. Apart from the evaluationof the state value which is used to train both the policy and value networks, thetraining is done online. At every step, both networks are trained. This is unlikeREINFORCE and REINFORCE with baseline where the agent completes an episodebefore the training is performed. The value network is consulted twice. Firstly,during the value estimate of the current state and secondly for the value of the nextstate. Both values are used in the computation of gradients. Figure 10.4.1 shows theActor-Critic network. We will implement the Actor-Critic method in Keras at the endof this chapter.Algorithm 10.4.1 Actor-CriticRequire: A differentiable parameterized target policy network, π( a s , θ ) .Require: A differentiable parameterized value network, V ( s, θv ).[ 315 ]
Policy Gradient MethodsRequire: Discount factor, γ ∈ [ 0,1], the learning rate α for the performance gradient,and the learning rate for the value gradient, α .Require: θ0 , initial policy network parameters (for example, θ0 → 0 ). θv0, initial valuenetwork parameters (for example, θv0 → 0 ).1. Repeat2. for steps t = 0, …, T −1do3. Sample an action a ~ π( a s , θ)4. Execute the action and observe reward r and next state s′5. Evaluate state value estimate, δ = r + γV ( s′, θ ) −V ( s,θ )t6. Compute discounted value gradient, ∇ V ( θ ) = γ δ∇V ( s,θ )7. Perform gradient ascent, θ = θ + α ∇V( θ )vv v v vt8. Compute discounted performance gradient, ∇ J ( θ) = γ δ∇π( , θ)9. Perform gradient ascent, θ = θ + α ∇J( θ )10. s s′ =vvθvvvθ ln a sFigure 10.4.1: Actor-critic network[ 316 ]
- Page 281 and 282: Variational Autoencoders (VAEs)In F
- Page 283 and 284: Variational Autoencoders (VAEs)Figu
- Page 285 and 286: Variational Autoencoders (VAEs)The
- Page 288 and 289: Deep ReinforcementLearningReinforce
- Page 290 and 291: [ 273 ]Chapter 9Formally, the RL pr
- Page 292 and 293: Chapter 9Where:( ) ( , )∗V s maxQ
- Page 294 and 295: Chapter 9Initially, the agent assum
- Page 296 and 297: Chapter 9Figure 9.3.6: Assuming the
- Page 298 and 299: Q-Learning in PythonThe environment
- Page 300 and 301: Chapter 9----------------"""self.re
- Page 302 and 303: Chapter 9# UI to dump Q Table conte
- Page 304 and 305: Chapter 9Figure 9.3.10: The value f
- Page 306 and 307: Chapter 9Figure 9.5.1: Frozen lake
- Page 308 and 309: Chapter 9# discount factorself.gamm
- Page 310 and 311: Chapter 9# training of Q Tableif do
- Page 312 and 313: Chapter 9Where all terms are famili
- Page 314 and 315: Listing 9.6.1 shows us the DQN impl
- Page 316 and 317: Chapter 9if self.ddqn:print("------
- Page 318 and 319: Chapter 9updates# correction on the
- Page 320 and 321: QmaxChapter 9⎧rj+1if episodetermi
- Page 322: References1. Sutton and Barto. Rein
- Page 325 and 326: Policy Gradient MethodsPolicy gradi
- Page 327 and 328: Policy Gradient MethodsGiven a cont
- Page 329 and 330: Policy Gradient MethodsRequire: Dis
- Page 331: Policy Gradient MethodsRequire: Dis
- Page 335 and 336: Policy Gradient MethodsRequire: θ
- Page 337 and 338: Policy Gradient MethodsThe state is
- Page 339 and 340: Policy Gradient Methodsself.encoder
- Page 341 and 342: Policy Gradient MethodsThe policy n
- Page 343 and 344: Policy Gradient MethodsFigure 10.6.
- Page 345 and 346: Policy Gradient MethodsAfter buildi
- Page 347 and 348: Policy Gradient Methods[_, _, _, re
- Page 349 and 350: Policy Gradient MethodsEach algorit
- Page 351 and 352: Policy Gradient Methodswhile not do
- Page 353 and 354: Policy Gradient MethodsFigure 10.7.
- Page 355 and 356: Policy Gradient MethodsFigure 10.7.
- Page 357 and 358: Policy Gradient MethodsTrain REINFO
- Page 360 and 361: Other Books YouMay EnjoyIf you enjo
- Page 362: Other Books You May EnjoyLeave a re
- Page 365 and 366: DenseNet 39DenseNet-BC (Bottleneck-
- Page 367: VVariational Autoencoder (VAE)about
Policy Gradient Methods
Require: Discount factor, γ ∈ [ 0,1]
, the learning rate α for the performance gradient,
and the learning rate for the value gradient, α .
Require: θ
0 , initial policy network parameters (for example, θ0 → 0 ). θ
v0
, initial value
network parameters (for example, θv0 → 0 ).
1. Repeat
2. for steps t = 0, …, T −1
do
3. Sample an action a ~ π( a s , θ)
4. Execute the action and observe reward r and next state s′
5. Evaluate state value estimate, δ = r + γV ( s′
, θ ) −V ( s,
θ )
t
6. Compute discounted value gradient, ∇ V ( θ ) = γ δ∇
V ( s,
θ )
7. Perform gradient ascent, θ = θ + α ∇V
( θ )
v
v v v v
t
8. Compute discounted performance gradient, ∇ J ( θ) = γ δ∇
π( , θ)
9. Perform gradient ascent, θ = θ + α ∇J
( θ )
10. s s′ =
v
v
θv
v
v
θ ln a s
Figure 10.4.1: Actor-critic network
[ 316 ]