09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 11

A neural network is used to approximate the policy; in the simplest form, the neural

network learns a policy for selecting actions that maximize the rewards by adjusting

its weights using steepest gradient ascent, hence, the name: policy gradients.

In this section we will focus on the Deep Deterministic Policy Gradient (DDPG)

algorithm, another successful RL algorithm by Google's DeepMind in 2015. DDPG is

implemented using two networks; one called the actor network and the other called

the critic network.

The actor network approximates the optimal policy deterministically, that is it

outputs the most preferred action for any given input state. In essence the actor is

learning. The critic on the other hand evaluates the optimal action value function

using the actor's most preferred action. Before going further, let us contrast this with

the DQN algorithm that we discussed in the preceding section. In the following

diagram, you can see the general architecture of DDPG:

The actor network outputs the most preferred action; the critic takes as input both

the input state and action taken and evaluates its Q-value. To train the critic network

we follow the same procedure as DQN, that is we try to minimize the difference

between the estimated Q-value and the target Q-value. The gradient of the Q-value

over actions is then propagated back to train the actor network. So, if the critic is

good enough, it will force the actor to choose actions with optimal value functions.

[ 435 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!