A Survey of Monte Carlo Tree Search Methods - Department of ...
A Survey of Monte Carlo Tree Search Methods - Department of ...
A Survey of Monte Carlo Tree Search Methods - Department of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 4, NO. 1, MARCH 2012 22<br />
either revisited, added or sampled from previously seen<br />
children, depending on the number <strong>of</strong> visits. Double<br />
progressive widening worked well for toy problems for<br />
which standard UCT failed, but less so for complex realworld<br />
problems.<br />
5.5.2 Absolute and Relative Pruning<br />
Absolute pruning and relative pruning are two strategies<br />
proposed by Huang [106] to preserve the correctness <strong>of</strong><br />
the UCB algorithm.<br />
• Absolute pruning prunes all actions from a position<br />
except the most visited one, once it becomes clear<br />
that no other action could become more visited.<br />
• Relative pruning uses an upper bound on the number<br />
<strong>of</strong> visits an action has received, to detect when the<br />
most visited choice will remain the most visited.<br />
Relative pruning was found to increase the win rate<br />
<strong>of</strong> the Go program LINGO against GNU GO 3.8 by<br />
approximately 3% [106].<br />
5.5.3 Pruning with Domain Knowledge<br />
Given knowledge about a domain, it is possible to prune<br />
actions known to lead to weaker positions. For example,<br />
Huang [106] used the concept <strong>of</strong> territory in Go to<br />
significantly increase the performance <strong>of</strong> the program<br />
LINGO against GNU GO 3.8. Domain knowledge related<br />
to predicting opponents’ strategies was used by Suoju et<br />
al. for move pruning in the game Dead End for a 51.17%<br />
improvement over plain UCT [99].<br />
Arneson et al. use domain knowledge to prune inferior<br />
cells from the search in their world champion Hex<br />
program MOHEX [8]. This is computationally expensive<br />
to do, so only nodes that had been visited a certain<br />
number <strong>of</strong> times had such domain knowledge applied.<br />
An added benefit <strong>of</strong> this approach is that the analysis<br />
would sometimes solve the position to give its true<br />
game-theoretic value.<br />
5.6 Expansion Enhancements<br />
No enhancements specific to the expansion step <strong>of</strong> the<br />
tree policy were found in the literature. The particular<br />
expansion algorithm used for a problem tends to be more<br />
<strong>of</strong> an implementation choice – typically between single<br />
node expansion and full node set expansion – depending<br />
on the domain and computational budget.<br />
6 OTHER ENHANCEMENTS<br />
This section describes enhancements to aspects <strong>of</strong> the<br />
core MCTS algorithm other than its tree policy. This includes<br />
modifications to the default policy (which are typically<br />
domain dependent and involve heuristic knowledge<br />
<strong>of</strong> the problem being modelled) and other more<br />
general modifications related to the backpropagation<br />
step and parallelisation.<br />
6.1 Simulation Enhancements<br />
The default simulation policy for MCTS is to select<br />
randomly amongst the available actions. This has the<br />
advantage that it is simple, requires no domain knowledge<br />
and repeated trials will most likely cover different<br />
areas <strong>of</strong> the search space, but the games played are<br />
not likely to be realistic compared to games played by<br />
rational players. A popular class <strong>of</strong> enhancements makes<br />
the simulations more realistic by incorporating domain<br />
knowledge into the playouts. This knowledge may be<br />
gathered either <strong>of</strong>fline (e.g. from databases <strong>of</strong> expert<br />
games) or online (e.g. through self-play and learning).<br />
Drake and Uurtamo describe such biased playouts as<br />
heavy playouts [77].<br />
6.1.1 Rule-Based Simulation Policy<br />
One approach to improving the simulation policy is<br />
to hand-code a domain specific policy. Such rule-based<br />
policies should be fast, so as not to unduly impede the<br />
simulation process; Silver discusses a number <strong>of</strong> factors<br />
which govern their effectiveness [203].<br />
6.1.2 Contextual <strong>Monte</strong> <strong>Carlo</strong> <strong>Search</strong><br />
Contextual <strong>Monte</strong> <strong>Carlo</strong> <strong>Search</strong> [104], [167] is an approach<br />
to improving simulations that is independent <strong>of</strong> the<br />
domain. It works by combining simulations that reach<br />
the same areas <strong>of</strong> the tree into tiles and using statistics<br />
from previous simulations to guide the action selection<br />
in future simulations. This approach was used to good<br />
effect for the game Havannah (7.2), for which each tile<br />
described a particular pairing <strong>of</strong> consecutive moves.<br />
6.1.3 Fill the Board<br />
Fill the Board is an enhancement described in [53], [55]<br />
designed to increase simulation diversity for the game<br />
<strong>of</strong> Go. At each step in the simulations, the Fill the Board<br />
algorithm picks N random intersections; if any <strong>of</strong> those<br />
intersections and their immediate neighbours are empty<br />
then it plays there, else it plays a random legal move. The<br />
simulation policy in this case can make use <strong>of</strong> patterns<br />
(6.1.9) and this enhancement fills up board space quickly,<br />
so these patterns can be applied earlier in the simulation.<br />
A similar approach to board filling can be used to<br />
good effect in games with complementary goals in which<br />
exactly one player is guaranteed to win, however the<br />
board is filled. Such games include the connection games<br />
Hex and Y, as discussed in Sections 6.1.9 and 7.2.<br />
6.1.4 Learning a Simulation Policy<br />
Given a new domain, it is possible to learn a new<br />
simulation policy using generic techniques. The<br />
relationship between MCTS and TD learning was<br />
mentioned in Section 4.3.1; other techniques that learn<br />
to adjust the simulation policy by direct consideration<br />
<strong>of</strong> the simulation statistics are listed below.<br />
Move-Average Sampling Technique (MAST) is an