Advanced Deep Learning with Keras

fourpersent2020
from fourpersent2020 More from this publisher
16.03.2021 Views

Policy Gradient MethodsIn the final chapter of this book, we're going to introduce algorithms thatdirectly optimize the policy network in reinforcement learning. These algorithmsare collectively referred to as policy gradient methods. Since the policy networkis directly optimized during training, the policy gradient methods belong to thefamily of on-policy reinforcement learning algorithms. Like value-based methodsthat we discussed in Chapter 9, Deep Reinforcement Learning, policy gradientmethods can also be implemented as deep reinforcement learning algorithms.A fundamental motivation in studying the policy gradient methods is addressingthe limitations of Q-Learning. We'll recall that Q-Learning is about selecting theaction that maximizes the value of the state. With Q function, we're able to determinethe policy that enables the agent to decide on which action to take for a given state.The chosen action is simply the one that gives the agent the maximum value. Inthis respect, Q-Learning is limited to a finite number of discrete actions. It's not ableto deal with continuous action space environments. Furthermore, Q-Learning is notdirectly optimizing the policy. In the end, reinforcement learning is about findingthat optimal policy that the agent will be able to use to decide on which action itshould take in order to maximize the return.In contrast, policy gradient methods are applicable to environments with discrete orcontinuous action spaces. In addition, the four policy gradient methods that we willbe presenting in this chapter are directly optimizing the performance measure of thepolicy network. This results in a trained policy network that the agent can use to actin its environment optimally.In summary, the goal of this chapter is to present:• The policy gradient theorem• Four policy gradient methods: REINFORCE, REINFORCE with baseline,Actor-Critic, and Advantage Actor-Critic (A2C)• A guide on how to implement the policy gradient methods in Kerasin a continuous action space environment[ 307 ]

Policy Gradient MethodsPolicy gradient theoremAs discussed in Chapter 9, Deep Reinforcement Learning, in Reinforcement Learningthe agent is situated in an environment that is in state s t, an element of state spaceS . The state space S may be discrete or continuous. The agent takes an action a tfromthe action space A by obeying the policy, π ( atst) . A may be discrete or continuous.Because of executing the action a t, the agent receives a reward r t+1and theenvironment transitions to a new state s t+1. The new state is dependent onlyon the current state and action. The goal of the agent is to learn an optimalpolicy π ∗ that maximizes the return from all the states:π ∗ = argmax πR (Equation 9.1.1)tThe return, Rt, is defined as the discounted cumulative reward from time t until theend of the episode or when the terminal state is reached:Tkt tγt + kk=0( )πV s = R = ∑ r (Equation 9.1.2)From Equation 9.1.2, the return can also be interpreted as a value of a given stateby following the policy π . It can be observed from Equation 9.1.1 that futurerewards have lower weights compared to immediate rewards since generallykγ < 1.0 where γ ∈ [ 0,1].So far, we have only considered learning the policy by optimizing a valuebased function, Q(s,a). Our goal in this chapter is to directly learn the policy byparameterizing π( at st ) → π( at st, θ). By parameterization, we can use a neuralnetwork to learn the policy function. Learning the policy means that we are goingto maximize a certain objective function, J ( θ)which is a performance measure withrespect to parameter θ . In episodic reinforcement learning, the performance measureis the value of the start state. In the continuous case, the objective function is theaverage reward rate.Maximizing the objective function J ( θ)is achieved by performing gradientascent. In gradient ascent, the gradient update is in the direction of the derivativeof the function being optimized. So far, all our loss functions are optimized byminimization or by performing gradient descent. Later, in the Keras implementation,we're able to see that the gradient ascent can be performed by simply negating theobjective function and performing gradient descent.The advantage of learning the policy directly is that it can be applied to both discreteand continuous action spaces. For discrete action spaces:[ 308 ]

Policy Gradient Methods

In the final chapter of this book, we're going to introduce algorithms that

directly optimize the policy network in reinforcement learning. These algorithms

are collectively referred to as policy gradient methods. Since the policy network

is directly optimized during training, the policy gradient methods belong to the

family of on-policy reinforcement learning algorithms. Like value-based methods

that we discussed in Chapter 9, Deep Reinforcement Learning, policy gradient

methods can also be implemented as deep reinforcement learning algorithms.

A fundamental motivation in studying the policy gradient methods is addressing

the limitations of Q-Learning. We'll recall that Q-Learning is about selecting the

action that maximizes the value of the state. With Q function, we're able to determine

the policy that enables the agent to decide on which action to take for a given state.

The chosen action is simply the one that gives the agent the maximum value. In

this respect, Q-Learning is limited to a finite number of discrete actions. It's not able

to deal with continuous action space environments. Furthermore, Q-Learning is not

directly optimizing the policy. In the end, reinforcement learning is about finding

that optimal policy that the agent will be able to use to decide on which action it

should take in order to maximize the return.

In contrast, policy gradient methods are applicable to environments with discrete or

continuous action spaces. In addition, the four policy gradient methods that we will

be presenting in this chapter are directly optimizing the performance measure of the

policy network. This results in a trained policy network that the agent can use to act

in its environment optimally.

In summary, the goal of this chapter is to present:

• The policy gradient theorem

• Four policy gradient methods: REINFORCE, REINFORCE with baseline,

Actor-Critic, and Advantage Actor-Critic (A2C)

• A guide on how to implement the policy gradient methods in Keras

in a continuous action space environment

[ 307 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!