06.04.2013 Views

A Survey of Monte Carlo Tree Search Methods - Department of ...

A Survey of Monte Carlo Tree Search Methods - Department of ...

A Survey of Monte Carlo Tree Search Methods - Department of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 4, NO. 1, MARCH 2012 16<br />

problem domains. They describe the application <strong>of</strong> ρUCT<br />

to create their MC-AIXA agent, which approximates<br />

the AIXA 15 model. MC-AIXA was found to approach<br />

optimal performance for several problem domains (7.8).<br />

4.10.7 <strong>Monte</strong> <strong>Carlo</strong> Random Walks (MRW)<br />

<strong>Monte</strong> <strong>Carlo</strong> Random Walks (MRW) selectively build the<br />

search tree using random walks [238]. Xie et al. describe<br />

the <strong>Monte</strong> <strong>Carlo</strong> Random Walk-based Local <strong>Tree</strong> <strong>Search</strong><br />

(MRW-LTS) method which extends MCRW to concentrate<br />

on local search more than standard MCTS methods,<br />

allowing good performance in some difficult planning<br />

problems [238].<br />

4.10.8 Mean-based Heuristic <strong>Search</strong> for Anytime Planning<br />

(MHSP)<br />

Pellier et al. [158] propose an algorithm for planning<br />

problems, called Mean-based Heuristic <strong>Search</strong> for Anytime<br />

Planning (MHSP), based on MCTS. There are two key<br />

differences between MHSP and a conventional MCTS<br />

algorithm. First, MHSP entirely replaces the random simulations<br />

<strong>of</strong> MCTS with a heuristic evaluation function:<br />

Pellier et al. [158] argue that random exploration <strong>of</strong> the<br />

search space for a planning problem is inefficient, since<br />

the probability <strong>of</strong> a given simulation actually finding a<br />

solution is low. Second, MHSP’s selection process simply<br />

uses average rewards with no exploration term, but<br />

initialises the nodes with “optimistic” average values.<br />

In contrast to many planning algorithms, MHSP can<br />

operate in a truly anytime fashion: even before a solution<br />

has been found, MHSP can yield a good partial plan.<br />

5 TREE POLICY ENHANCEMENTS<br />

This section describes modifications proposed for the<br />

tree policy <strong>of</strong> the core MCTS algorithm, in order to<br />

improve performance. Many approaches use ideas from<br />

traditional AI search such as α-β, while some have no existing<br />

context and were developed specifically for MCTS.<br />

These can generally be divided into two categories:<br />

• Domain Independent: These are enhancements that<br />

could be applied to any domain without prior<br />

knowledge about it. These typically <strong>of</strong>fer small improvements<br />

or are better suited to a particular type<br />

<strong>of</strong> domain.<br />

• Domain Dependent: These are enhancements specific<br />

to particular domains. Such enhancements might<br />

use prior knowledge about a domain or otherwise<br />

exploit some unique aspect <strong>of</strong> it.<br />

This section covers those enhancements specific to the<br />

tree policy, i.e. the selection and expansion steps.<br />

15. AIXI is a mathematical approach based on a Bayesian optimality<br />

notion for general reinforcement learning agents.<br />

5.1 Bandit-Based Enhancements<br />

The bandit-based method used for node selection in the<br />

tree policy is central to the MCTS method being used. A<br />

wealth <strong>of</strong> different upper confidence bounds have been<br />

proposed, <strong>of</strong>ten improving bounds or performance in<br />

particular circumstances such as dynamic environments.<br />

5.1.1 UCB1-Tuned<br />

UCB1-Tuned is an enhancement suggested by Auer et<br />

al. [13] to tune the bounds <strong>of</strong> UCB1 more finely. It<br />

replaces the upper confidence bound 2 ln n/nj with:<br />

<br />

ln n<br />

min{ 1<br />

, Vj(nj)}<br />

4<br />

where:<br />

Vj(s) = (1/2<br />

nj<br />

s<br />

τ=1<br />

X 2 j,τ ) − X 2<br />

<br />

2 ln t<br />

j,s +<br />

s<br />

which means that machine j, which has been played<br />

s times during the first t plays, has a variance that<br />

is at most the sample variance plus 2 ln t)/s [13]. It<br />

should be noted that Auer et al. were unable to prove<br />

a regret bound for UCB1-Tuned, but found it performed<br />

better than UCB1 in their experiments. UCB1-Tuned has<br />

subsequently been used in a variety <strong>of</strong> MCTS implementations,<br />

including Go [95], Othello [103] and the real-time<br />

game Tron [184].<br />

5.1.2 Bayesian UCT<br />

Tesauro et al. [213] propose that the Bayesian framework<br />

potentially allows much more accurate estimation <strong>of</strong><br />

node values and node uncertainties from limited numbers<br />

<strong>of</strong> simulation trials. Their Bayesian MCTS formalism<br />

introduces two tree policies:<br />

<br />

2 ln N<br />

maximise Bi = µi +<br />

ni<br />

where µi replaces the average reward <strong>of</strong> the node with<br />

the mean <strong>of</strong> an extremum (minimax) distribution Pi<br />

(assuming independent random variables) and:<br />

<br />

2 ln N<br />

maximise Bi = µi +<br />

where σi is the square root <strong>of</strong> the variance <strong>of</strong> Pi.<br />

Tesauro et al. suggest that the first equation is a strict<br />

improvement over UCT if the independence assumption<br />

and leaf node priors are correct, while the second equation<br />

is motivated by the central limit theorem. They provide<br />

convergence pro<strong>of</strong>s for both equations and carry out<br />

an empirical analysis in an artificial scenario based on<br />

an “idealized bandit-tree simulator”. The results indicate<br />

that the second equation outperforms the first, and that<br />

both outperform the standard UCT approach (although<br />

UCT is considerably quicker). McInerney et al. [141] also<br />

describe a Bayesian approach to bandit selection, and<br />

argue that this, in principle, avoids the need to choose<br />

between exploration and exploitation.<br />

ni<br />

σi

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!