Reinforcement Learning with Long Short-Term Memory
Reinforcement Learning with Long Short-Term Memory
Reinforcement Learning with Long Short-Term Memory
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
The Elman-BPTT system never reached satisfactory solutions in 10 runs. It only<br />
learned to balance the pole for the rst 50 timesteps, when the mode information<br />
is available, thus failing to learn the long-term dependency. However, RL-LSTM<br />
learned optimal performance in 2 out of 10 runs (after an average of 6,250,000<br />
timesteps of learning). After learning, these two agents were able to balance the pole<br />
indenitely in both modes of operation. In the other 8 runs, the agents still learned<br />
to balance the pole in both modes for hundreds or even thousands of timesteps<br />
(after an average of 8,095,000 timesteps of learning), thus showing that the mode<br />
information was remembered for long time lags. In most cases, such an agent<br />
learns optimal performance for one mode, while achieving good but suboptimal<br />
performance in the other.<br />
5 Conclusions<br />
The results presented in this paper suggest that reinforcement learning <strong>with</strong> <strong>Long</strong><br />
<strong>Short</strong>-<strong>Term</strong> <strong>Memory</strong> (RL-LSTM) is a promising approach to solving non-Markovian<br />
RL tasks <strong>with</strong> long-term dependencies. This was demonstrated in a T-maze task<br />
<strong>with</strong> minimal time lag dependencies of up to 70 timesteps, as well as in a non-<br />
Markovian version of pole balancing where optimal performance requires remembering<br />
information indenitely. RL-LSTM's main power is derived from LSTM's<br />
property of constant error ow, but for good performance in RL tasks, the combination<br />
<strong>with</strong> Advantage() learning and directed exploration was crucial.<br />
Acknowledgments<br />
The author wishes to thank Edwin de Jong, Michiel de Jong, Gwendid van der Voort<br />
van der Kleij, Patrick Hudson, Felix Gers, and Jurgen Schmidhuber for valuable<br />
comments.<br />
References<br />
[1] B. Bakker. <strong>Reinforcement</strong> learning <strong>with</strong> LSTM in non-Markovian tasks <strong>with</strong> longterm<br />
dependencies. Technical report, Dept. of Psychology, Leiden University, 2001.<br />
[2] L. Chrisman. <strong>Reinforcement</strong> learning <strong>with</strong> perceptual aliasing: The perceptual distinctions<br />
approach. In Proc. of the 10th National Conf. on AI. AAAI Press, 1992.<br />
[3] F. Gers, J. Schmidhuber, and F. Cummins. <strong>Learning</strong> to forget: Continual prediction<br />
<strong>with</strong> LSTM. Neural Computation, 12 (10):2451{2471, 2000.<br />
[4] M. E. Harmon and L. C. Baird. Multi-player residual advantage learning <strong>with</strong> general<br />
function approximation. Technical report, Wright-Patterson Air Force Base, 1996.<br />
[5] S. Hochreiter and J. Schmidhuber. <strong>Long</strong> short-term memory. Neural Computation, 9<br />
(8):1735{1780, 1997.<br />
[6] L.-J. Lin and T. Mitchell. <strong>Reinforcement</strong> learning <strong>with</strong> hidden states. In Proc. of the<br />
2nd Int. Conf. on Simulation of Adaptive Behavior. MIT Press, 1993.<br />
[7] J. Loch and S. Singh. Using eligibility traces to nd the best memoryless policy in<br />
Partially Observable Markov Decision Processes. In Proc. of ICML'98, 1998.<br />
[8] R. A. McCallum. <strong>Learning</strong> to use selective attention and short-term memory in<br />
sequential tasks. In Proc. 4th Int. Conf. on Simulation of Adaptive Behavior, 1996.<br />
[9] L. Peshkin, N. Meuleau, and L. P. Kaelbling. <strong>Learning</strong> policies <strong>with</strong> external memory.<br />
In Proc. of the 16th Int. Conf. on Machine <strong>Learning</strong>, 1999.<br />
[10] J. Schmidhuber. Networks adjusting networks. In Proc. of Distributed Adaptive Neural<br />
Information Processing, St. Augustin, 1990.<br />
[11] J. Schmidhuber. Curious model-building control systems. In Proc. of IJCNN'91,<br />
volume 2, pages 1458{1463, Singapore, 1991.<br />
[12] R. S. Sutton and A. G. Barto. <strong>Reinforcement</strong> learning: An introduction. MIT Press,<br />
Cambridge, MA, 1998.