03.08.2013 Views

Slides - ijcai-11

Slides - ijcai-11

Slides - ijcai-11

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

On Combining Decisions<br />

from Multiple Expert<br />

Imitators for Performance<br />

Jonathan Rubin and Ian Watson<br />

University of Auckland, New Zealand


On Combining Decisions<br />

from Multiple Expert<br />

Imitators for Performance<br />

• Agents that attempt to imitate the<br />

decisions of an expert (human or artificial)


On Combining Decisions<br />

from Multiple Expert<br />

Imitators for Performance<br />

• Dealing with a collection of expert<br />

imitators


On Combining Decisions<br />

from Multiple Expert<br />

Imitators for Performance<br />

• Combine the decisions from a collection of<br />

imitators<br />

• Can we improve performance?


Domain<br />

• Two-Player Texas Hold’em Poker<br />

• Limit<br />

• No Limit


Domain<br />

• Create individual poker agents<br />

• Assess performance<br />

• Combine decisions<br />

• Assess performance


How to combine<br />

decisions?<br />

• Two existing approaches<br />

• 1 - Ensemble Voting<br />

• 2 - Dynamic Selection at Runtime<br />

• Based on approach by (Johanson, 2007)


Why expert imitators?<br />

• 1 - Train on artificial or real-world data<br />

• 2 - Replace training data to create new<br />

strategy<br />

• 3 - Can easily create a diverse set of playing<br />

styles


Intransitive Performance<br />

Relationship<br />

• Ability to imitate different styles of play<br />

should prove useful<br />

• Exists in poker domain<br />

• A beats B<br />

• B beats C<br />

• A beats C


Lazy Learners<br />

• Use Case-Based Reasoning<br />

• Solve new problems by retrieving<br />

solutions to existing problems


Lazy Learners<br />

• Particularly suited to expert imitation<br />

• Observe and store game scenarios with<br />

solutions<br />

• Retrieve when a decision is required


Expert Imitation<br />

Expert Jonathan Imitators<br />

Rubin<br />

April 14, 20<strong>11</strong><br />

Ij - Expert Imtator<br />

Cj = {cj,1,cj,2,...,cj,n}<br />

∀c ∈ Cj,c=(x, a)<br />

feature vector action vector


BRIEF ARTICLE<br />

Expert Imitators<br />

BRIEF ARTICLE<br />

cmax = arg max<br />

ck<br />

cmax = arg max<br />

THE AUTHOR<br />

THE AUTHOR<br />

global similarity<br />

ck<br />

game state<br />

sim(ck,ct), ∀ck ∈ Cj<br />

sim(ck,ct), ∀ck ∈ Cj<br />

cmax =(xmax,amax)<br />

Ij’s action


Action Vectors<br />

• Limit<br />

• a = (f, c, r)<br />

• No Limit<br />

• a = (f, c, q, h, i, p, d, v, t, a)


Decision Policies<br />

• Probabilistic<br />

• Max Frequency


1 - Ensemble<br />

• Action Vector = (a1, a2, ..., an)<br />

• Each imitator (I1, I2 ... Im) applies maxfrequency<br />

• Derive Vote Vector = (v1, v2, ..., vn)<br />

• Select action, aj, that corresponds to<br />

maximum number of votes, vj.<br />

• If no strictly maximum, use I1.


Showdown DIVAT<br />

• As in (Johanson, 2007), use DIVAT for<br />

variance reduction BRIEF ARTICLE<br />

• Ignorant Value Assessment THE AUTHOR Tool, developed<br />

by Alberta CPRG<br />

• Basic idea: evaluate sim(ck,ct), hand based ∀ck ∈ Cj on EV of<br />

ck<br />

player’s decisions, NOT the outcome<br />

(1) cmax = arg max<br />

(2) cmax =(xmax,amax)<br />

(3) DivatOutcome = EV (ActualActons) − EV (BaselineActions)


Showdown DIVAT<br />

• (Un)lucky occurrences affect both an<br />

imitator’s strategy and the baseline strategy<br />

• What matters is difference in EV<br />

• Requires perfect information<br />

• Can only apply at showdown<br />

• When a fold occurs, actual outcome info<br />

is used


Experimental Results


Methodology<br />

• Limit and No Limit Domains<br />

• All imitators trained on data provided by<br />

ACPC<br />

• 2010 total bankroll division winner<br />

• 2010 instant run-off division winner<br />

• Own entry to 2010 competition


Methodology<br />

• 6 original imitator agents<br />

• 3 training sets x 2 policies<br />

• Derive 2 decision combination players<br />

• Ensemble<br />

• Dynamic<br />

• 8 imitators challenge 2 computerised agents


Match Info<br />

• Duplicate Matches<br />

• Both player’s receive the same sets of cards<br />

• Reduces variance<br />

• 3000 duplicate hands<br />

• Each imitator plays 5 duplicate matches<br />

against each opponent<br />

• 1/2 million hands played in each domain


Limit Results<br />

Approx. Nash Equilibrium Exploitive Monte-Carlo<br />

Table 1: Expert Imitator Results against Fell Omen 2 and AlistairBot<br />

Fell Omen 2 AlistairBot Average<br />

Dynamic 0.01342 ±0.006 0.68848 ±0.009 0.35095 ±0.0075<br />

Ensemble 0.00830 ±0.010 0.67348 ±0.010 0.34089 ±0.0100<br />

Rockhopper-max -0.01445 ±0.009 0.69504 ±0.009 0.34030 ±0.0090<br />

PULPO-max -0.00768 ±0.006 0.66053 ±0.024 0.32643 ±0.0150<br />

Sartre-max 0.00825 ±0.014 0.63898 ±0.019 0.32362 ±0.0165<br />

PULPO-prob -0.00355 ±0.0<strong>11</strong> 0.63385 ±0.018 0.31515 ±0.0145<br />

Sartre-prob -0.01816 ±0.006 0.64535 ±0.015 0.31360 ±0.0105<br />

Rockhopper-prob -0.02943 ±0.0<strong>11</strong> 0.63870 ±0.020 0.30464 ±0.0155<br />

Original Imitator -<br />

Decision Policy * Measurements are small bets per hand


Limit Results<br />

• Dynamic does overall best, followed by<br />

Ensemble<br />

• Some overlap in standard deviations


No Limit Results<br />

MCTS Rules<br />

Table 2: Expert Imitator Results (No Limit) against MCTSBot and SimpleBot<br />

MCTSBot SimpleBot Average<br />

Hyperborean-max 1.7781 ±0.193 0.7935 ±0.084 1.2858 ±0.139<br />

Dynamic 1.3332 ±0.146 0.5928 ±0.058 0.9630 ±0.102<br />

Ensemble 1.3138 ±0.047 0.5453 ±0.058 0.9295 ±0.053<br />

Hyperborean-prob 1.1036 ±0.165 0.5075 ±0.093 0.8055 ±0.129<br />

Sartre-max 0.9313 ±0.<strong>11</strong>7 0.5248 ±0.032 0.7281 ±0.075<br />

Sartre-prob 0.4524 ±0.106 0.3933 ±0.073 0.4228 ±0.090<br />

Tartanian-max 0.1033 ±0.451 0.3450 ±0.087 0.2242 ±0.269<br />

Tartantian-prob -0.0518 ±0.127 0.4221 ±0.032 0.1852 ±0.080<br />

Original Imitator -<br />

Decision Policy * Measurements are big blinds per hand


No Limit Results<br />

• Dynamic 2nd, Ensemble 3rd<br />

• Decision combination not able to do better<br />

than single imitator: Hyperborean-max<br />

• But, still better than most single imitators<br />

• Some overlap in standard deviations


Discussion<br />

• Hyperborean-max performs best against<br />

both no limit opponents<br />

• As such Dynamic and Ensemble are not<br />

able to improve upon this by considering<br />

the actions of another imitator<br />

• More folds in no limit => less variance<br />

reduction


Conclusion<br />

• Introduced our lazy expert imitators<br />

• Described 2 approaches for combining<br />

their decisions<br />

• In no limit domain, decision combination<br />

outperformed most original imitators<br />

• In limit domain, decision combination<br />

outperformed all original imitators


The End.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!