Advanced Deep Learning with Keras
Policy Gradient MethodsIn the final chapter of this book, we're going to introduce algorithms thatdirectly optimize the policy network in reinforcement learning. These algorithmsare collectively referred to as policy gradient methods. Since the policy networkis directly optimized during training, the policy gradient methods belong to thefamily of on-policy reinforcement learning algorithms. Like value-based methodsthat we discussed in Chapter 9, Deep Reinforcement Learning, policy gradientmethods can also be implemented as deep reinforcement learning algorithms.A fundamental motivation in studying the policy gradient methods is addressingthe limitations of Q-Learning. We'll recall that Q-Learning is about selecting theaction that maximizes the value of the state. With Q function, we're able to determinethe policy that enables the agent to decide on which action to take for a given state.The chosen action is simply the one that gives the agent the maximum value. Inthis respect, Q-Learning is limited to a finite number of discrete actions. It's not ableto deal with continuous action space environments. Furthermore, Q-Learning is notdirectly optimizing the policy. In the end, reinforcement learning is about findingthat optimal policy that the agent will be able to use to decide on which action itshould take in order to maximize the return.In contrast, policy gradient methods are applicable to environments with discrete orcontinuous action spaces. In addition, the four policy gradient methods that we willbe presenting in this chapter are directly optimizing the performance measure of thepolicy network. This results in a trained policy network that the agent can use to actin its environment optimally.In summary, the goal of this chapter is to present:• The policy gradient theorem• Four policy gradient methods: REINFORCE, REINFORCE with baseline,Actor-Critic, and Advantage Actor-Critic (A2C)• A guide on how to implement the policy gradient methods in Kerasin a continuous action space environment[ 307 ]
Policy Gradient MethodsPolicy gradient theoremAs discussed in Chapter 9, Deep Reinforcement Learning, in Reinforcement Learningthe agent is situated in an environment that is in state s t, an element of state spaceS . The state space S may be discrete or continuous. The agent takes an action a tfromthe action space A by obeying the policy, π ( atst) . A may be discrete or continuous.Because of executing the action a t, the agent receives a reward r t+1and theenvironment transitions to a new state s t+1. The new state is dependent onlyon the current state and action. The goal of the agent is to learn an optimalpolicy π ∗ that maximizes the return from all the states:π ∗ = argmax πR (Equation 9.1.1)tThe return, Rt, is defined as the discounted cumulative reward from time t until theend of the episode or when the terminal state is reached:Tkt tγt + kk=0( )πV s = R = ∑ r (Equation 9.1.2)From Equation 9.1.2, the return can also be interpreted as a value of a given stateby following the policy π . It can be observed from Equation 9.1.1 that futurerewards have lower weights compared to immediate rewards since generallykγ < 1.0 where γ ∈ [ 0,1].So far, we have only considered learning the policy by optimizing a valuebased function, Q(s,a). Our goal in this chapter is to directly learn the policy byparameterizing π( at st ) → π( at st, θ). By parameterization, we can use a neuralnetwork to learn the policy function. Learning the policy means that we are goingto maximize a certain objective function, J ( θ)which is a performance measure withrespect to parameter θ . In episodic reinforcement learning, the performance measureis the value of the start state. In the continuous case, the objective function is theaverage reward rate.Maximizing the objective function J ( θ)is achieved by performing gradientascent. In gradient ascent, the gradient update is in the direction of the derivativeof the function being optimized. So far, all our loss functions are optimized byminimization or by performing gradient descent. Later, in the Keras implementation,we're able to see that the gradient ascent can be performed by simply negating theobjective function and performing gradient descent.The advantage of learning the policy directly is that it can be applied to both discreteand continuous action spaces. For discrete action spaces:[ 308 ]
- Page 273 and 274: Variational Autoencoders (VAEs)shap
- Page 275 and 276: Variational Autoencoders (VAEs)cvae
- Page 277 and 278: Variational Autoencoders (VAEs)Figu
- Page 279 and 280: Variational Autoencoders (VAEs)Figu
- Page 281 and 282: Variational Autoencoders (VAEs)In F
- Page 283 and 284: Variational Autoencoders (VAEs)Figu
- Page 285 and 286: Variational Autoencoders (VAEs)The
- Page 288 and 289: Deep ReinforcementLearningReinforce
- Page 290 and 291: [ 273 ]Chapter 9Formally, the RL pr
- Page 292 and 293: Chapter 9Where:( ) ( , )∗V s maxQ
- Page 294 and 295: Chapter 9Initially, the agent assum
- Page 296 and 297: Chapter 9Figure 9.3.6: Assuming the
- Page 298 and 299: Q-Learning in PythonThe environment
- Page 300 and 301: Chapter 9----------------"""self.re
- Page 302 and 303: Chapter 9# UI to dump Q Table conte
- Page 304 and 305: Chapter 9Figure 9.3.10: The value f
- Page 306 and 307: Chapter 9Figure 9.5.1: Frozen lake
- Page 308 and 309: Chapter 9# discount factorself.gamm
- Page 310 and 311: Chapter 9# training of Q Tableif do
- Page 312 and 313: Chapter 9Where all terms are famili
- Page 314 and 315: Listing 9.6.1 shows us the DQN impl
- Page 316 and 317: Chapter 9if self.ddqn:print("------
- Page 318 and 319: Chapter 9updates# correction on the
- Page 320 and 321: QmaxChapter 9⎧rj+1if episodetermi
- Page 322: References1. Sutton and Barto. Rein
- Page 327 and 328: Policy Gradient MethodsGiven a cont
- Page 329 and 330: Policy Gradient MethodsRequire: Dis
- Page 331 and 332: Policy Gradient MethodsRequire: Dis
- Page 333 and 334: Policy Gradient MethodsRequire: Dis
- Page 335 and 336: Policy Gradient MethodsRequire: θ
- Page 337 and 338: Policy Gradient MethodsThe state is
- Page 339 and 340: Policy Gradient Methodsself.encoder
- Page 341 and 342: Policy Gradient MethodsThe policy n
- Page 343 and 344: Policy Gradient MethodsFigure 10.6.
- Page 345 and 346: Policy Gradient MethodsAfter buildi
- Page 347 and 348: Policy Gradient Methods[_, _, _, re
- Page 349 and 350: Policy Gradient MethodsEach algorit
- Page 351 and 352: Policy Gradient Methodswhile not do
- Page 353 and 354: Policy Gradient MethodsFigure 10.7.
- Page 355 and 356: Policy Gradient MethodsFigure 10.7.
- Page 357 and 358: Policy Gradient MethodsTrain REINFO
- Page 360 and 361: Other Books YouMay EnjoyIf you enjo
- Page 362: Other Books You May EnjoyLeave a re
- Page 365 and 366: DenseNet 39DenseNet-BC (Bottleneck-
- Page 367: VVariational Autoencoder (VAE)about
Policy Gradient Methods
In the final chapter of this book, we're going to introduce algorithms that
directly optimize the policy network in reinforcement learning. These algorithms
are collectively referred to as policy gradient methods. Since the policy network
is directly optimized during training, the policy gradient methods belong to the
family of on-policy reinforcement learning algorithms. Like value-based methods
that we discussed in Chapter 9, Deep Reinforcement Learning, policy gradient
methods can also be implemented as deep reinforcement learning algorithms.
A fundamental motivation in studying the policy gradient methods is addressing
the limitations of Q-Learning. We'll recall that Q-Learning is about selecting the
action that maximizes the value of the state. With Q function, we're able to determine
the policy that enables the agent to decide on which action to take for a given state.
The chosen action is simply the one that gives the agent the maximum value. In
this respect, Q-Learning is limited to a finite number of discrete actions. It's not able
to deal with continuous action space environments. Furthermore, Q-Learning is not
directly optimizing the policy. In the end, reinforcement learning is about finding
that optimal policy that the agent will be able to use to decide on which action it
should take in order to maximize the return.
In contrast, policy gradient methods are applicable to environments with discrete or
continuous action spaces. In addition, the four policy gradient methods that we will
be presenting in this chapter are directly optimizing the performance measure of the
policy network. This results in a trained policy network that the agent can use to act
in its environment optimally.
In summary, the goal of this chapter is to present:
• The policy gradient theorem
• Four policy gradient methods: REINFORCE, REINFORCE with baseline,
Actor-Critic, and Advantage Actor-Critic (A2C)
• A guide on how to implement the policy gradient methods in Keras
in a continuous action space environment
[ 307 ]