06.04.2013 Views

A Survey of Monte Carlo Tree Search Methods - Department of ...

A Survey of Monte Carlo Tree Search Methods - Department of ...

A Survey of Monte Carlo Tree Search Methods - Department of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 4, NO. 1, MARCH 2012 24<br />

Fig. 6. Patterns for a cut move in Go [96].<br />

simulations by detecting pattern matches with the actual<br />

board position and applying associated moves.<br />

For example, Figure 6 from [96] shows a set <strong>of</strong> 3 × 3<br />

patterns for detecting cut moves in Go. The first pattern<br />

must be matched and the other two not matched for the<br />

move to be recognised. Drake and Uurtamo [77] suggest<br />

there may be more to gain from applying heuristics such<br />

as patterns to the simulation policy rather than the tree<br />

policy for Go.<br />

The 3 × 3 patterns described by Wang and Gelly<br />

[228] vary in complexity, and can be used to improve<br />

the simulation policy to make simulated games more<br />

realistic. Wang and Gelly matched patterns around the<br />

last move played to improve the playing strength <strong>of</strong><br />

their Go program MOGO [228]. Gelly and Silver [92]<br />

used a reinforcement learning approach to improve the<br />

simulation policy, specifically a function approximator<br />

QRLGO(s, a), which applied linear weights to a collection<br />

<strong>of</strong> binary features. 20 Several policies using this information<br />

were tested and all <strong>of</strong>fered an improvement over<br />

a random policy, although a weaker handcrafted policy<br />

was stronger when used with UCT.<br />

Coulom [71] searched for useful patterns in Go by<br />

computing Elo ratings for patterns, improving their Go<br />

program CRAZY STONE. Hoock and Teytaud investigate<br />

the use <strong>of</strong> Bandit-based Genetic Programming (BGP) to<br />

automatically find good patterns that should be more<br />

simulated and bad patterns that should be less simulated<br />

for their program MOGO, achieving success with 9 × 9<br />

Go but less so with 19 × 19 Go [105].<br />

Figure 7 shows a bridge pattern that occurs in connection<br />

games such as Hex, Y and Havannah (7.2). The<br />

two black pieces are virtually connected as an intrusion<br />

by white in either cell can be answered by black at<br />

the other cell to restore the connection. Such intrusions<br />

can be detected and completed during simulation to<br />

significantly improve playing strength, as this mimics<br />

moves that human players would typically perform.<br />

6.2 Backpropagation Enhancements<br />

Modifications to the backpropagation step typically involve<br />

special node updates required by other enhancement<br />

methods for forward planning, but some constitute<br />

enhancements in their own right. We describe those not<br />

explicitly covered in previous sections.<br />

20. These features were all 1 × 1 to 3 × 3 patterns on a Go board.<br />

a b<br />

Fig. 7. Bridge completion for connection games.<br />

6.2.1 Weighting Simulation Results<br />

b a<br />

Xie and Liu [237] observe that some simulations are<br />

more important than others. In particular, simulations<br />

performed later in the search tend to be more accurate<br />

than those performed earlier, and shorter simulations<br />

tend to be more accurate than longer ones. In light <strong>of</strong><br />

this, Xie and Liu propose the introduction <strong>of</strong> a weighting<br />

factor when backpropagating simulation results [237].<br />

Simulations are divided into segments and each assigned<br />

a positive integer weight. A simulation with weight w is<br />

backpropagated as if it were w simulations.<br />

6.2.2 Score Bonus<br />

In a normal implementation <strong>of</strong> UCT, the values backpropagated<br />

are in the interval [0, 1], and if the scheme<br />

only uses 0 for a loss and 1 for a win, then there is no<br />

way to distinguish between strong wins and weak wins.<br />

One way <strong>of</strong> introducing this is to backpropagate a value<br />

in the interval [0, γ] for a loss and [γ, 1] for a win with<br />

the strongest win scoring 1 and the weakest win scoring<br />

γ. This scheme was tested for Sums Of Switches (7.3) but<br />

did not improve playing strength [219].<br />

6.2.3 Decaying Reward<br />

Decaying reward is a modification to the backpropagation<br />

process in which the reward value is multiplied by<br />

some constant 0 < γ ≤ 1 between each node in order<br />

to weight early wins more heavily than later wins. This<br />

was proposed alongside UCT in [119], [120].<br />

6.2.4 Transposition Table Updates<br />

Childs et al. [63] discuss a variety <strong>of</strong> strategies – labelled<br />

UCT1, UCT2 and UCT3 – for handling transpositions<br />

(5.2.4), so that information can be shared between<br />

different nodes corresponding to the same state. Each<br />

variation showed improvements over its predecessors,<br />

although the computational cost <strong>of</strong> UCT3 was large.<br />

6.3 Parallelisation<br />

The independent nature <strong>of</strong> each simulation in MCTS<br />

means that the algorithm is a good target for parallelisation.<br />

Parallelisation has the advantage that more<br />

simulations can be performed in a given amount <strong>of</strong> time<br />

and the wide availability <strong>of</strong> multi-core processors can<br />

be exploited. However, parallelisation raises issues such<br />

as the combination <strong>of</strong> results from different sources in a<br />

single search tree, and the synchronisation <strong>of</strong> threads <strong>of</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!