06.04.2013 Views

A Survey of Monte Carlo Tree Search Methods - Department of ...

A Survey of Monte Carlo Tree Search Methods - Department of ...

A Survey of Monte Carlo Tree Search Methods - Department of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 4, NO. 1, MARCH 2012 20<br />

Fig. 4. The All Moves As First (AMAF) heuristic [101].<br />

The parameter V represents the number <strong>of</strong> visits a node<br />

will have before the RAVE values are not being used at<br />

all. RAVE is a s<strong>of</strong>ter approach than Cut<strong>of</strong>f AMAF since<br />

exploited areas <strong>of</strong> the tree will use the accurate statistics<br />

more than unexploited areas <strong>of</strong> the tree.<br />

5.3.6 Killer RAVE<br />

Lorentz [133] describes the Killer RAVE 17 variant in<br />

which only the most important moves are used for the<br />

RAVE updates for each iteration. This was found to be<br />

more beneficial for the connection game Havannah (7.2)<br />

than plain RAVE.<br />

5.3.7 RAVE-max<br />

RAVE-max is an extension intended to make the RAVE<br />

heuristic more robust [218], [220]. The RAVE-max update<br />

rule and its stochastic variant δ-RAVE-max were found<br />

to improve performance in degenerate cases for the Sum<br />

<strong>of</strong> Switches game (7.3) but were less successful for Go.<br />

5.3.8 PoolRAVE<br />

Hoock et al. [104] describe the poolRAVE enhancement,<br />

which modifies the MCTS simulation step as follows:<br />

• Build a pool <strong>of</strong> the k best moves according to RAVE.<br />

• Choose one move m from the pool.<br />

• Play m with a probability p, else the default policy.<br />

PoolRAVE has the advantages <strong>of</strong> being independent<br />

<strong>of</strong> the domain and simple to implement if a RAVE<br />

mechanism is already in place. It was found to yield improvements<br />

for Havannah and Go programs by Hoock<br />

et al. [104] – especially when expert knowledge is small<br />

or absent – but not to solve a problem particular to Go<br />

known as semeai.<br />

Helmbold and Parker-Wood [101] compare the main<br />

AMAF variants and conclude that:<br />

17. So named due to similarities with the “Killer Move” heuristic in<br />

traditional game tree search.<br />

• Random playouts provide more evidence about the<br />

goodness <strong>of</strong> moves made earlier in the playout than<br />

moves made later.<br />

• AMAF updates are not just a way to quickly initialise<br />

counts, they are useful after every playout.<br />

• Updates even more aggressive than AMAF can be<br />

even more beneficial.<br />

• Combined heuristics can be more powerful than<br />

individual heuristics.<br />

5.4 Game-Theoretic Enhancements<br />

If the game-theoretic value <strong>of</strong> a state is known, this value<br />

may be backed up the tree to improve reward estimates<br />

for other non-terminal nodes. This section describes<br />

enhancements based on this property.<br />

Figure 5, from [235], shows the backup <strong>of</strong> proven<br />

game-theoretic values during backpropagation. Wins,<br />

draws and losses in simulations are assigned rewards<br />

<strong>of</strong> +1, 0 and −1 respectively (as usual), but proven wins<br />

and losses are assigned rewards <strong>of</strong> +∞ and −∞.<br />

5.4.1 MCTS-Solver<br />

Pro<strong>of</strong>-number search (PNS) is a standard AI technique<br />

for proving game-theoretic values, typically used for<br />

endgame solvers, in which terminal states are considered<br />

to be proven wins or losses and deductions chained<br />

backwards from these [4]. A non-terminal state is a<br />

proven win if at least one <strong>of</strong> its children is a proven win,<br />

or a proven loss if all <strong>of</strong> its children are proven losses.<br />

When exploring the game tree, pro<strong>of</strong>-number search<br />

prioritises those nodes whose values can be proven by<br />

evaluating the fewest children.<br />

Winands et al. [235], [234] propose a modification to<br />

MCTS based on PNS in which game-theoretic values 18<br />

are proven and backpropagated up the tree. If the parent<br />

node has been visited more than some threshold T times,<br />

normal UCB selection applies and a forced loss node is<br />

18. That is, known wins, draws or losses.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!