Slides - ijcai-11

On Combining Decisions 

from Multiple Expert 

Imitators for Performance 

Jonathan Rubin and Ian Watson 

University of Auckland, New Zealand




• Agents that attempt to imitate the 

decisions of an expert (human or artificial)




• Dealing with a collection of expert 

imitators




• Combine the decisions from a collection of 

imitators 

• Can we improve performance?

Domain 

• Two-Player Texas Hold’em Poker 

• Limit 

• No Limit

Domain 

• Create individual poker agents 

• Assess performance 

• Combine decisions 

• Assess performance

How to combine 

decisions? 

• Two existing approaches 

• 1 - Ensemble Voting 

• 2 - Dynamic Selection at Runtime 

• Based on approach by (Johanson, 2007)

Why expert imitators? 

• 1 - Train on artificial or real-world data 

• 2 - Replace training data to create new 

strategy 

• 3 - Can easily create a diverse set of playing 

styles

Intransitive Performance 

Relationship 

• Ability to imitate different styles of play 

should prove useful 

• Exists in poker domain 

• A beats B 

• B beats C 

• A beats C

Lazy Learners 

• Use Case-Based Reasoning 

• Solve new problems by retrieving 

solutions to existing problems

Lazy Learners 

• Particularly suited to expert imitation 

• Observe and store game scenarios with 

solutions 

• Retrieve when a decision is required

Expert Imitation 

Expert Jonathan Imitators 

Rubin 

April 14, 2011 

Ij - Expert Imtator 

Cj = {cj,1,cj,2,...,cj,n} 

∀c ∈ Cj,c=(x, a) 

feature vector action vector

BRIEF ARTICLE 

Expert Imitators 

BRIEF ARTICLE 

cmax = arg max 

ck 

cmax = arg max 

THE AUTHOR 

THE AUTHOR 

global similarity 

ck 

game state 

sim(ck,ct), ∀ck ∈ Cj 

sim(ck,ct), ∀ck ∈ Cj 

cmax =(xmax,amax) 

Ij’s action

Action Vectors 

• Limit 

• a = (f, c, r) 

• No Limit 

• a = (f, c, q, h, i, p, d, v, t, a)

Decision Policies 

• Probabilistic 

• Max Frequency

1 - Ensemble 

• Action Vector = (a1, a2, ..., an) 

• Each imitator (I1, I2 ... Im) applies maxfrequency 

• Derive Vote Vector = (v1, v2, ..., vn) 

• Select action, aj, that corresponds to 

maximum number of votes, vj. 

• If no strictly maximum, use I1.

Showdown DIVAT 

• As in (Johanson, 2007), use DIVAT for 

variance reduction BRIEF ARTICLE 

• Ignorant Value Assessment THE AUTHOR Tool, developed 

by Alberta CPRG 

• Basic idea: evaluate sim(ck,ct), hand based ∀ck ∈ Cj on EV of 

ck 

player’s decisions, NOT the outcome 

(1) cmax = arg max 

(2) cmax =(xmax,amax) 

(3) DivatOutcome = EV (ActualActons) − EV (BaselineActions)

Showdown DIVAT 

• (Un)lucky occurrences affect both an 

imitator’s strategy and the baseline strategy 

• What matters is difference in EV 

• Requires perfect information 

• Can only apply at showdown 

• When a fold occurs, actual outcome info 

is used

Experimental Results

Methodology 

• Limit and No Limit Domains 

• All imitators trained on data provided by 

ACPC 

• 2010 total bankroll division winner 

• 2010 instant run-off division winner 

• Own entry to 2010 competition

Methodology 

• 6 original imitator agents 

• 3 training sets x 2 policies 

• Derive 2 decision combination players 

• Ensemble 

• Dynamic 

• 8 imitators challenge 2 computerised agents

Match Info 

• Duplicate Matches 

• Both player’s receive the same sets of cards 

• Reduces variance 

• 3000 duplicate hands 

• Each imitator plays 5 duplicate matches 

against each opponent 

• 1/2 million hands played in each domain

Limit Results 

Approx. Nash Equilibrium Exploitive Monte-Carlo 

Table 1: Expert Imitator Results against Fell Omen 2 and AlistairBot 

Fell Omen 2 AlistairBot Average 

Dynamic 0.01342 ±0.006 0.68848 ±0.009 0.35095 ±0.0075 

Ensemble 0.00830 ±0.010 0.67348 ±0.010 0.34089 ±0.0100 

Rockhopper-max -0.01445 ±0.009 0.69504 ±0.009 0.34030 ±0.0090 

PULPO-max -0.00768 ±0.006 0.66053 ±0.024 0.32643 ±0.0150 

Sartre-max 0.00825 ±0.014 0.63898 ±0.019 0.32362 ±0.0165 

PULPO-prob -0.00355 ±0.011 0.63385 ±0.018 0.31515 ±0.0145 

Sartre-prob -0.01816 ±0.006 0.64535 ±0.015 0.31360 ±0.0105 

Rockhopper-prob -0.02943 ±0.011 0.63870 ±0.020 0.30464 ±0.0155 

Original Imitator - 

Decision Policy * Measurements are small bets per hand

Limit Results 

• Dynamic does overall best, followed by 

Ensemble 

• Some overlap in standard deviations

No Limit Results 

MCTS Rules 

Table 2: Expert Imitator Results (No Limit) against MCTSBot and SimpleBot 

MCTSBot SimpleBot Average 

Hyperborean-max 1.7781 ±0.193 0.7935 ±0.084 1.2858 ±0.139 

Dynamic 1.3332 ±0.146 0.5928 ±0.058 0.9630 ±0.102 

Ensemble 1.3138 ±0.047 0.5453 ±0.058 0.9295 ±0.053 

Hyperborean-prob 1.1036 ±0.165 0.5075 ±0.093 0.8055 ±0.129 

Sartre-max 0.9313 ±0.117 0.5248 ±0.032 0.7281 ±0.075 

Sartre-prob 0.4524 ±0.106 0.3933 ±0.073 0.4228 ±0.090 

Tartanian-max 0.1033 ±0.451 0.3450 ±0.087 0.2242 ±0.269 

Tartantian-prob -0.0518 ±0.127 0.4221 ±0.032 0.1852 ±0.080 

Original Imitator - 

Decision Policy * Measurements are big blinds per hand

No Limit Results 

• Dynamic 2nd, Ensemble 3rd 

• Decision combination not able to do better 

than single imitator: Hyperborean-max 

• But, still better than most single imitators 

• Some overlap in standard deviations

Discussion 

• Hyperborean-max performs best against 

both no limit opponents 

• As such Dynamic and Ensemble are not 

able to improve upon this by considering 

the actions of another imitator 

• More folds in no limit => less variance 

reduction

Conclusion 

• Introduced our lazy expert imitators 

• Described 2 approaches for combining 

their decisions 

• In no limit domain, decision combination 

outperformed most original imitators 

• In limit domain, decision combination 

outperformed all original imitators

The End.

Slides - ijcai-11

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?