Advanced Deep Learning with Keras

fourpersent2020
from fourpersent2020 More from this publisher
16.03.2021 Views

Policy Gradient MethodsIn the final chapter of this book, we're going to introduce algorithms thatdirectly optimize the policy network in reinforcement learning. These algorithmsare collectively referred to as policy gradient methods. Since the policy networkis directly optimized during training, the policy gradient methods belong to thefamily of on-policy reinforcement learning algorithms. Like value-based methodsthat we discussed in Chapter 9, Deep Reinforcement Learning, policy gradientmethods can also be implemented as deep reinforcement learning algorithms.A fundamental motivation in studying the policy gradient methods is addressingthe limitations of Q-Learning. We'll recall that Q-Learning is about selecting theaction that maximizes the value of the state. With Q function, we're able to determinethe policy that enables the agent to decide on which action to take for a given state.The chosen action is simply the one that gives the agent the maximum value. Inthis respect, Q-Learning is limited to a finite number of discrete actions. It's not ableto deal with continuous action space environments. Furthermore, Q-Learning is notdirectly optimizing the policy. In the end, reinforcement learning is about findingthat optimal policy that the agent will be able to use to decide on which action itshould take in order to maximize the return.In contrast, policy gradient methods are applicable to environments with discrete orcontinuous action spaces. In addition, the four policy gradient methods that we willbe presenting in this chapter are directly optimizing the performance measure of thepolicy network. This results in a trained policy network that the agent can use to actin its environment optimally.In summary, the goal of this chapter is to present:• The policy gradient theorem• Four policy gradient methods: REINFORCE, REINFORCE with baseline,Actor-Critic, and Advantage Actor-Critic (A2C)• A guide on how to implement the policy gradient methods in Kerasin a continuous action space environment[ 307 ]

Policy Gradient MethodsPolicy gradient theoremAs discussed in Chapter 9, Deep Reinforcement Learning, in Reinforcement Learningthe agent is situated in an environment that is in state s t, an element of state spaceS . The state space S may be discrete or continuous. The agent takes an action a tfromthe action space A by obeying the policy, π ( atst) . A may be discrete or continuous.Because of executing the action a t, the agent receives a reward r t+1and theenvironment transitions to a new state s t+1. The new state is dependent onlyon the current state and action. The goal of the agent is to learn an optimalpolicy π ∗ that maximizes the return from all the states:π ∗ = argmax πR (Equation 9.1.1)tThe return, Rt, is defined as the discounted cumulative reward from time t until theend of the episode or when the terminal state is reached:Tkt tγt + kk=0( )πV s = R = ∑ r (Equation 9.1.2)From Equation 9.1.2, the return can also be interpreted as a value of a given stateby following the policy π . It can be observed from Equation 9.1.1 that futurerewards have lower weights compared to immediate rewards since generallykγ < 1.0 where γ ∈ [ 0,1].So far, we have only considered learning the policy by optimizing a valuebased function, Q(s,a). Our goal in this chapter is to directly learn the policy byparameterizing π( at st ) → π( at st, θ). By parameterization, we can use a neuralnetwork to learn the policy function. Learning the policy means that we are goingto maximize a certain objective function, J ( θ)which is a performance measure withrespect to parameter θ . In episodic reinforcement learning, the performance measureis the value of the start state. In the continuous case, the objective function is theaverage reward rate.Maximizing the objective function J ( θ)is achieved by performing gradientascent. In gradient ascent, the gradient update is in the direction of the derivativeof the function being optimized. So far, all our loss functions are optimized byminimization or by performing gradient descent. Later, in the Keras implementation,we're able to see that the gradient ascent can be performed by simply negating theobjective function and performing gradient descent.The advantage of learning the policy directly is that it can be applied to both discreteand continuous action spaces. For discrete action spaces:[ 308 ]

Policy Gradient Methods

Policy gradient theorem

As discussed in Chapter 9, Deep Reinforcement Learning, in Reinforcement Learning

the agent is situated in an environment that is in state s t

, an element of state space

S . The state space S may be discrete or continuous. The agent takes an action a t

from

the action space A by obeying the policy, π ( at

st

) . A may be discrete or continuous.

Because of executing the action a t

, the agent receives a reward r t+1

and the

environment transitions to a new state s t+1

. The new state is dependent only

on the current state and action. The goal of the agent is to learn an optimal

policy π ∗ that maximizes the return from all the states:

π ∗ = argmax π

R (Equation 9.1.1)

t

The return, R

t

, is defined as the discounted cumulative reward from time t until the

end of the episode or when the terminal state is reached:

T

k

t t

γ

t + k

k=

0

( )

π

V s = R = ∑ r (Equation 9.1.2)

From Equation 9.1.2, the return can also be interpreted as a value of a given state

by following the policy π . It can be observed from Equation 9.1.1 that future

rewards have lower weights compared to immediate rewards since generally

k

γ < 1.0 where γ ∈ [ 0,1]

.

So far, we have only considered learning the policy by optimizing a value

based function, Q(s,a). Our goal in this chapter is to directly learn the policy by

parameterizing π( at st ) → π( at st

, θ)

. By parameterization, we can use a neural

network to learn the policy function. Learning the policy means that we are going

to maximize a certain objective function, J ( θ)

which is a performance measure with

respect to parameter θ . In episodic reinforcement learning, the performance measure

is the value of the start state. In the continuous case, the objective function is the

average reward rate.

Maximizing the objective function J ( θ)

is achieved by performing gradient

ascent. In gradient ascent, the gradient update is in the direction of the derivative

of the function being optimized. So far, all our loss functions are optimized by

minimization or by performing gradient descent. Later, in the Keras implementation,

we're able to see that the gradient ascent can be performed by simply negating the

objective function and performing gradient descent.

The advantage of learning the policy directly is that it can be applied to both discrete

and continuous action spaces. For discrete action spaces:

[ 308 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!