pdfcoffee
Chapter 11Image source: https://arxiv.org/pdf/1511.06581.pdfThe aggregate layer is implemented in a manner that allows one to recover both Vand A from the given Q. This is achieved by enforcing that the advantage functionestimator has zero advantage at the chosen action:QQ(SS, AA; θθ, αα, ββ) = AA(SS, AA; θθ, αα) + VV ππ (SS; θθ, ββ) − mmmmmm aa ′ εε|AA|AA(SS, AA ′ ; θθ, αα)In the paper, Wang et al. reported that the network is more stable if the maxoperation is replaced by the average operation. This is so because the speed ofchange in advantage is now the same as the change in average, instead of the optimal(max) value.[ 433 ]
Reinforcement LearningRainbowRainbow is the current state of the art DQN variant. Technically, to call it a DQNvariant would be wrong. In essence it is an ensemble of many DQN variantscombined together into a single algorithm. It modifies the distributional RL [6] lossto multi-step loss and combines it with Double DQN using greedy action. Quotingfrom the paper:"The network architecture is a dueling network architecture adapted for use withreturn distributions. The network has a shared representation fξ(s), which is thenfed into a value stream v ηwith N atomsoutputs, and into an advantage stream aξwith N atoms×N actionsoutputs, where aa ξξ ii (ff ξξ (ss), aa) will denote the output correspondingto atom i and action a. For each atom z i, the value and advantage streams areaggregated, as in Dueling DQN, and then passed through a softmax layer toobtain the normalised parametric distributions used to estimate the returns'distributions."Rainbow combines six different RL algorithms:• N-step returns• Distributional state-action value learning• Dueling networks• Noisy networks• Double DQN• Prioritized Experience ReplayDeep deterministic policy gradientDQN and its variants have been very successful in solving problems where the statespace is continuous and action space is discrete. For example, in Atari games, theinput space consists of raw pixels, but actions are discrete - [up, down, left, right,no-op]. How do we solve a problem with continuous action space? For instance,say an RL agent driving a car needs to turn its wheels: this action has a continuousaction space One way to handle this situation is by discretizing the action spaceand continuing with DQN or its variants. However, a better solution would be touse a policy gradient algorithm. In policy gradient methods the policy ππ(AA|SS) isapproximated directly.[ 434 ]
- Page 418 and 419: Chapter 10Figure 4: Plot of the fin
- Page 420 and 421: Chapter 10In SOMs, neurons are usua
- Page 422 and 423: [ 387 ]Chapter 10Colour mapping usi
- Page 424 and 425: Chapter 10# Calculating Neighbourho
- Page 426 and 427: We will also need to normalize the
- Page 428 and 429: Chapter 10ρρ(vv oo |h oo ) = σσ
- Page 430 and 431: # Generate the sample probabilityde
- Page 432 and 433: Chapter 10And the reconstructed ima
- Page 434 and 435: Chapter 10inpX = rbm.rbm_output(inp
- Page 436 and 437: Chapter 10(60000, 28, 28) (60000,)(
- Page 438 and 439: Chapter 10Figure 11: Summary of the
- Page 440 and 441: Chapter 10This chapter, along with
- Page 442 and 443: Reinforcement LearningThis chapter
- Page 444 and 445: Chapter 11And unlike unsupervised l
- Page 446 and 447: Chapter 11Normally, the value is de
- Page 448 and 449: Chapter 11• The next question tha
- Page 450 and 451: Chapter 11This neural network takes
- Page 452 and 453: Chapter 11The MuJoCo environment re
- Page 454 and 455: Chapter 11We will first import the
- Page 456 and 457: Chapter 11The αα is the learning
- Page 458 and 459: Chapter 11We set up the global valu
- Page 460 and 461: Chapter 11else:return np.argmax(sel
- Page 462 and 463: Chapter 11DQN to play a game of Ata
- Page 464 and 465: Chapter 11self.model.add( Conv2D(64
- Page 466 and 467: Chapter 11Here the action A was sel
- Page 470 and 471: Chapter 11A neural network is used
- Page 472: Chapter 1111. Details regarding ins
- Page 475 and 476: TensorFlow and Cloud• Scalability
- Page 477 and 478: TensorFlow and Cloud• Azure DevOp
- Page 479 and 480: TensorFlow and Cloud• Lambda: The
- Page 481 and 482: TensorFlow and Cloud• Deep Learni
- Page 483 and 484: TensorFlow and CloudEC2 on AmazonTo
- Page 485 and 486: TensorFlow and CloudCompute Instanc
- Page 487 and 488: TensorFlow and CloudYou just share
- Page 489 and 490: TensorFlow and CloudIn case you req
- Page 491 and 492: TensorFlow and CloudIt starts with
- Page 493 and 494: TensorFlow and CloudTFX librariesTF
- Page 495 and 496: TensorFlow and CloudReferences1. To
- Page 497 and 498: TensorFlow for Mobile and IoT and T
- Page 499 and 500: TensorFlow for Mobile and IoT and T
- Page 501 and 502: TensorFlow for Mobile and IoT and T
- Page 503 and 504: TensorFlow for Mobile and IoT and T
- Page 505 and 506: TensorFlow for Mobile and IoT and T
- Page 507 and 508: TensorFlow for Mobile and IoT and T
- Page 509 and 510: TensorFlow for Mobile and IoT and T
- Page 511 and 512: TensorFlow for Mobile and IoT and T
- Page 513 and 514: TensorFlow for Mobile and IoT and T
- Page 515 and 516: TensorFlow for Mobile and IoT and T
- Page 517 and 518: TensorFlow for Mobile and IoT and T
Chapter 11
Image source: https://arxiv.org/pdf/1511.06581.pdf
The aggregate layer is implemented in a manner that allows one to recover both V
and A from the given Q. This is achieved by enforcing that the advantage function
estimator has zero advantage at the chosen action:
QQ(SS, AA; θθ, αα, ββ) = AA(SS, AA; θθ, αα) + VV ππ (SS; θθ, ββ) − mmmmmm aa ′ εε|AA|AA(SS, AA ′ ; θθ, αα)
In the paper, Wang et al. reported that the network is more stable if the max
operation is replaced by the average operation. This is so because the speed of
change in advantage is now the same as the change in average, instead of the optimal
(max) value.
[ 433 ]