16.03.2021 Views

Advanced Deep Learning with Keras

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Deep Reinforcement Learning

A high correlation is due to the sequential nature of sampling experiences. DQN

addressed this issue by creating a buffer of experiences. The training data are

randomly sampled from this buffer. This process is known as experience replay.

The issue of the non-stationary target is due to the target network Q(s ' ,a ' ) that is

modified after every mini batch of training. A small change in the target network

can create a significant change in the policy, the data distribution, and the correlation

between the current Q value and target Q value. This is resolved by freezing the

weights of the target network for C training steps. In other words, two identical

Q-Networks are created. The target Q-Network parameters are copied from the

Q-Network under training every C training steps.

The DQN algorithm is summarized in Algorithm 9.6.1.

DQN on Keras

To illustrate DQN, the CartPole-v0 environment of the OpenAI Gym is used.

CartPole-v0 is a pole balancing problem. The goal is to keep the pole from falling over.

The environment is 2D. The action space is made of two discrete actions (left and right

movements). However, the state space is continuous and is made of four variables:

1. Linear position

2. Linear velocity

3. Angle of rotation

4. Angular velocity

The CartPole-v0 is shown in Figure 9.6.1.

Initially, the pole is upright. A reward of +1 is provided for every timestep that the

pole remains upright. The episode ends when the pole exceeds 15 degrees from the

vertical or 2.4 units from the center. The CartPole-v0 problem is considered solved

if the average reward is 195.0 in 100 consecutive trials:

Figure 9.6.1: The CartPole-v0 environment

[ 296 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!