16.03.2021 Views

Advanced Deep Learning with Keras

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Policy Gradient Methods

Given a continuously differentiable policy function, π( a s , )

can be computed as:

⎡∇θπ( at

st

, θ)

π π

∇ ( θ)

= E

Q ( st , at ) ⎡

π π θlnπ( at st , θ) Q ( st , at

) ⎤

= ∇

π( at

st

, θ)

E ⎢ ⎣ ⎥ ⎦

⎣⎢

⎥⎦

t

t

θ , the policy gradient

J (Equation 10.1.6)

Equation 10.1.6 is also known as the policy gradient theorem. It is applicable to both

discrete and continuous action spaces. The gradient with respect to the parameter θ

is computed from the natural logarithm of the policy action sampling scaled by the

Q value. Equation 10.1.6 takes advantage of the property of the natural logarithm,

∇ x =∇ In x

x .

Policy gradient theorem is intuitive in the sense that the performance gradient is

estimated from the target policy samples and proportional to the policy gradient.

The policy gradient is scaled by the Q value to encourage actions that positively

contribute to the state value. The gradient is also inversely proportional to the action

probability to penalize frequently occurring actions that do not contribute to the

increase of performance measure.

In the next section, we will demonstrate the different methods of estimating the

policy gradient.

For the proof of policy gradient theorem, please see [2] and lecture notes

from David Silver on Reinforcement Learning, http://www0.cs.ucl.

ac.uk/staff/d.silver/web/Teaching_files/pg.pdf

There are subtle advantages of policy gradient methods. For example, in some

card-based games, value-based methods have no straightforward procedure in

handling stochasticity, unlike policy-based methods. In policy-based methods, the

action probability changes smoothly with the parameters. Meanwhile, value-based

actions may suffer from drastic changes with respect to small changes in parameters.

Lastly, the dependence of policy-based methods on parameters leads us to different

formulations on how to perform gradient ascent on the performance measure. These

are the four policy gradient methods to be presented in the succeeding sections.

Policy-based methods have their own disadvantages as well. They are generally

harder to train because of the tendency to converge on a local optimum instead of the

global optimum. In the experiments to be presented at the end of this chapter, it is

easy for an agent to become comfortable and to choose actions that do not necessarily

give the highest value. Policy gradient is also characterized by high variance.

[ 310 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!