A Survey of Monte Carlo Tree Search Methods - Department of ...
A Survey of Monte Carlo Tree Search Methods - Department of ...
A Survey of Monte Carlo Tree Search Methods - Department of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 4, NO. 1, MARCH 2012 24<br />
Fig. 6. Patterns for a cut move in Go [96].<br />
simulations by detecting pattern matches with the actual<br />
board position and applying associated moves.<br />
For example, Figure 6 from [96] shows a set <strong>of</strong> 3 × 3<br />
patterns for detecting cut moves in Go. The first pattern<br />
must be matched and the other two not matched for the<br />
move to be recognised. Drake and Uurtamo [77] suggest<br />
there may be more to gain from applying heuristics such<br />
as patterns to the simulation policy rather than the tree<br />
policy for Go.<br />
The 3 × 3 patterns described by Wang and Gelly<br />
[228] vary in complexity, and can be used to improve<br />
the simulation policy to make simulated games more<br />
realistic. Wang and Gelly matched patterns around the<br />
last move played to improve the playing strength <strong>of</strong><br />
their Go program MOGO [228]. Gelly and Silver [92]<br />
used a reinforcement learning approach to improve the<br />
simulation policy, specifically a function approximator<br />
QRLGO(s, a), which applied linear weights to a collection<br />
<strong>of</strong> binary features. 20 Several policies using this information<br />
were tested and all <strong>of</strong>fered an improvement over<br />
a random policy, although a weaker handcrafted policy<br />
was stronger when used with UCT.<br />
Coulom [71] searched for useful patterns in Go by<br />
computing Elo ratings for patterns, improving their Go<br />
program CRAZY STONE. Hoock and Teytaud investigate<br />
the use <strong>of</strong> Bandit-based Genetic Programming (BGP) to<br />
automatically find good patterns that should be more<br />
simulated and bad patterns that should be less simulated<br />
for their program MOGO, achieving success with 9 × 9<br />
Go but less so with 19 × 19 Go [105].<br />
Figure 7 shows a bridge pattern that occurs in connection<br />
games such as Hex, Y and Havannah (7.2). The<br />
two black pieces are virtually connected as an intrusion<br />
by white in either cell can be answered by black at<br />
the other cell to restore the connection. Such intrusions<br />
can be detected and completed during simulation to<br />
significantly improve playing strength, as this mimics<br />
moves that human players would typically perform.<br />
6.2 Backpropagation Enhancements<br />
Modifications to the backpropagation step typically involve<br />
special node updates required by other enhancement<br />
methods for forward planning, but some constitute<br />
enhancements in their own right. We describe those not<br />
explicitly covered in previous sections.<br />
20. These features were all 1 × 1 to 3 × 3 patterns on a Go board.<br />
a b<br />
Fig. 7. Bridge completion for connection games.<br />
6.2.1 Weighting Simulation Results<br />
b a<br />
Xie and Liu [237] observe that some simulations are<br />
more important than others. In particular, simulations<br />
performed later in the search tend to be more accurate<br />
than those performed earlier, and shorter simulations<br />
tend to be more accurate than longer ones. In light <strong>of</strong><br />
this, Xie and Liu propose the introduction <strong>of</strong> a weighting<br />
factor when backpropagating simulation results [237].<br />
Simulations are divided into segments and each assigned<br />
a positive integer weight. A simulation with weight w is<br />
backpropagated as if it were w simulations.<br />
6.2.2 Score Bonus<br />
In a normal implementation <strong>of</strong> UCT, the values backpropagated<br />
are in the interval [0, 1], and if the scheme<br />
only uses 0 for a loss and 1 for a win, then there is no<br />
way to distinguish between strong wins and weak wins.<br />
One way <strong>of</strong> introducing this is to backpropagate a value<br />
in the interval [0, γ] for a loss and [γ, 1] for a win with<br />
the strongest win scoring 1 and the weakest win scoring<br />
γ. This scheme was tested for Sums Of Switches (7.3) but<br />
did not improve playing strength [219].<br />
6.2.3 Decaying Reward<br />
Decaying reward is a modification to the backpropagation<br />
process in which the reward value is multiplied by<br />
some constant 0 < γ ≤ 1 between each node in order<br />
to weight early wins more heavily than later wins. This<br />
was proposed alongside UCT in [119], [120].<br />
6.2.4 Transposition Table Updates<br />
Childs et al. [63] discuss a variety <strong>of</strong> strategies – labelled<br />
UCT1, UCT2 and UCT3 – for handling transpositions<br />
(5.2.4), so that information can be shared between<br />
different nodes corresponding to the same state. Each<br />
variation showed improvements over its predecessors,<br />
although the computational cost <strong>of</strong> UCT3 was large.<br />
6.3 Parallelisation<br />
The independent nature <strong>of</strong> each simulation in MCTS<br />
means that the algorithm is a good target for parallelisation.<br />
Parallelisation has the advantage that more<br />
simulations can be performed in a given amount <strong>of</strong> time<br />
and the wide availability <strong>of</strong> multi-core processors can<br />
be exploited. However, parallelisation raises issues such<br />
as the combination <strong>of</strong> results from different sources in a<br />
single search tree, and the synchronisation <strong>of</strong> threads <strong>of</strong>