pdfcoffee

soumyasankar99
from soumyasankar99 More from this publisher
09.05.2023 Views

Chapter 11Image source: https://arxiv.org/pdf/1511.06581.pdfThe aggregate layer is implemented in a manner that allows one to recover both Vand A from the given Q. This is achieved by enforcing that the advantage functionestimator has zero advantage at the chosen action:QQ(SS, AA; θθ, αα, ββ) = AA(SS, AA; θθ, αα) + VV ππ (SS; θθ, ββ) − mmmmmm aa ′ εε|AA|AA(SS, AA ′ ; θθ, αα)In the paper, Wang et al. reported that the network is more stable if the maxoperation is replaced by the average operation. This is so because the speed ofchange in advantage is now the same as the change in average, instead of the optimal(max) value.[ 433 ]

Reinforcement LearningRainbowRainbow is the current state of the art DQN variant. Technically, to call it a DQNvariant would be wrong. In essence it is an ensemble of many DQN variantscombined together into a single algorithm. It modifies the distributional RL [6] lossto multi-step loss and combines it with Double DQN using greedy action. Quotingfrom the paper:"The network architecture is a dueling network architecture adapted for use withreturn distributions. The network has a shared representation fξ(s), which is thenfed into a value stream v ηwith N atomsoutputs, and into an advantage stream aξwith N atoms×N actionsoutputs, where aa ξξ ii (ff ξξ (ss), aa) will denote the output correspondingto atom i and action a. For each atom z i, the value and advantage streams areaggregated, as in Dueling DQN, and then passed through a softmax layer toobtain the normalised parametric distributions used to estimate the returns'distributions."Rainbow combines six different RL algorithms:• N-step returns• Distributional state-action value learning• Dueling networks• Noisy networks• Double DQN• Prioritized Experience ReplayDeep deterministic policy gradientDQN and its variants have been very successful in solving problems where the statespace is continuous and action space is discrete. For example, in Atari games, theinput space consists of raw pixels, but actions are discrete - [up, down, left, right,no-op]. How do we solve a problem with continuous action space? For instance,say an RL agent driving a car needs to turn its wheels: this action has a continuousaction space One way to handle this situation is by discretizing the action spaceand continuing with DQN or its variants. However, a better solution would be touse a policy gradient algorithm. In policy gradient methods the policy ππ(AA|SS) isapproximated directly.[ 434 ]

Chapter 11

Image source: https://arxiv.org/pdf/1511.06581.pdf

The aggregate layer is implemented in a manner that allows one to recover both V

and A from the given Q. This is achieved by enforcing that the advantage function

estimator has zero advantage at the chosen action:

QQ(SS, AA; θθ, αα, ββ) = AA(SS, AA; θθ, αα) + VV ππ (SS; θθ, ββ) − mmmmmm aa ′ εε|AA|AA(SS, AA ′ ; θθ, αα)

In the paper, Wang et al. reported that the network is more stable if the max

operation is replaced by the average operation. This is so because the speed of

change in advantage is now the same as the change in average, instead of the optimal

(max) value.

[ 433 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!