You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
On Combining Decisions<br />
from Multiple Expert<br />
Imitators for Performance<br />
Jonathan Rubin and Ian Watson<br />
University of Auckland, New Zealand
On Combining Decisions<br />
from Multiple Expert<br />
Imitators for Performance<br />
• Agents that attempt to imitate the<br />
decisions of an expert (human or artificial)
On Combining Decisions<br />
from Multiple Expert<br />
Imitators for Performance<br />
• Dealing with a collection of expert<br />
imitators
On Combining Decisions<br />
from Multiple Expert<br />
Imitators for Performance<br />
• Combine the decisions from a collection of<br />
imitators<br />
• Can we improve performance?
Domain<br />
• Two-Player Texas Hold’em Poker<br />
• Limit<br />
• No Limit
Domain<br />
• Create individual poker agents<br />
• Assess performance<br />
• Combine decisions<br />
• Assess performance
How to combine<br />
decisions?<br />
• Two existing approaches<br />
• 1 - Ensemble Voting<br />
• 2 - Dynamic Selection at Runtime<br />
• Based on approach by (Johanson, 2007)
Why expert imitators?<br />
• 1 - Train on artificial or real-world data<br />
• 2 - Replace training data to create new<br />
strategy<br />
• 3 - Can easily create a diverse set of playing<br />
styles
Intransitive Performance<br />
Relationship<br />
• Ability to imitate different styles of play<br />
should prove useful<br />
• Exists in poker domain<br />
• A beats B<br />
• B beats C<br />
• A beats C
Lazy Learners<br />
• Use Case-Based Reasoning<br />
• Solve new problems by retrieving<br />
solutions to existing problems
Lazy Learners<br />
• Particularly suited to expert imitation<br />
• Observe and store game scenarios with<br />
solutions<br />
• Retrieve when a decision is required
Expert Imitation<br />
Expert Jonathan Imitators<br />
Rubin<br />
April 14, 20<strong>11</strong><br />
Ij - Expert Imtator<br />
Cj = {cj,1,cj,2,...,cj,n}<br />
∀c ∈ Cj,c=(x, a)<br />
feature vector action vector
BRIEF ARTICLE<br />
Expert Imitators<br />
BRIEF ARTICLE<br />
cmax = arg max<br />
ck<br />
cmax = arg max<br />
THE AUTHOR<br />
THE AUTHOR<br />
global similarity<br />
ck<br />
game state<br />
sim(ck,ct), ∀ck ∈ Cj<br />
sim(ck,ct), ∀ck ∈ Cj<br />
cmax =(xmax,amax)<br />
Ij’s action
Action Vectors<br />
• Limit<br />
• a = (f, c, r)<br />
• No Limit<br />
• a = (f, c, q, h, i, p, d, v, t, a)
Decision Policies<br />
• Probabilistic<br />
• Max Frequency
1 - Ensemble<br />
• Action Vector = (a1, a2, ..., an)<br />
• Each imitator (I1, I2 ... Im) applies maxfrequency<br />
• Derive Vote Vector = (v1, v2, ..., vn)<br />
• Select action, aj, that corresponds to<br />
maximum number of votes, vj.<br />
• If no strictly maximum, use I1.
Showdown DIVAT<br />
• As in (Johanson, 2007), use DIVAT for<br />
variance reduction BRIEF ARTICLE<br />
• Ignorant Value Assessment THE AUTHOR Tool, developed<br />
by Alberta CPRG<br />
• Basic idea: evaluate sim(ck,ct), hand based ∀ck ∈ Cj on EV of<br />
ck<br />
player’s decisions, NOT the outcome<br />
(1) cmax = arg max<br />
(2) cmax =(xmax,amax)<br />
(3) DivatOutcome = EV (ActualActons) − EV (BaselineActions)
Showdown DIVAT<br />
• (Un)lucky occurrences affect both an<br />
imitator’s strategy and the baseline strategy<br />
• What matters is difference in EV<br />
• Requires perfect information<br />
• Can only apply at showdown<br />
• When a fold occurs, actual outcome info<br />
is used
Experimental Results
Methodology<br />
• Limit and No Limit Domains<br />
• All imitators trained on data provided by<br />
ACPC<br />
• 2010 total bankroll division winner<br />
• 2010 instant run-off division winner<br />
• Own entry to 2010 competition
Methodology<br />
• 6 original imitator agents<br />
• 3 training sets x 2 policies<br />
• Derive 2 decision combination players<br />
• Ensemble<br />
• Dynamic<br />
• 8 imitators challenge 2 computerised agents
Match Info<br />
• Duplicate Matches<br />
• Both player’s receive the same sets of cards<br />
• Reduces variance<br />
• 3000 duplicate hands<br />
• Each imitator plays 5 duplicate matches<br />
against each opponent<br />
• 1/2 million hands played in each domain
Limit Results<br />
Approx. Nash Equilibrium Exploitive Monte-Carlo<br />
Table 1: Expert Imitator Results against Fell Omen 2 and AlistairBot<br />
Fell Omen 2 AlistairBot Average<br />
Dynamic 0.01342 ±0.006 0.68848 ±0.009 0.35095 ±0.0075<br />
Ensemble 0.00830 ±0.010 0.67348 ±0.010 0.34089 ±0.0100<br />
Rockhopper-max -0.01445 ±0.009 0.69504 ±0.009 0.34030 ±0.0090<br />
PULPO-max -0.00768 ±0.006 0.66053 ±0.024 0.32643 ±0.0150<br />
Sartre-max 0.00825 ±0.014 0.63898 ±0.019 0.32362 ±0.0165<br />
PULPO-prob -0.00355 ±0.0<strong>11</strong> 0.63385 ±0.018 0.31515 ±0.0145<br />
Sartre-prob -0.01816 ±0.006 0.64535 ±0.015 0.31360 ±0.0105<br />
Rockhopper-prob -0.02943 ±0.0<strong>11</strong> 0.63870 ±0.020 0.30464 ±0.0155<br />
Original Imitator -<br />
Decision Policy * Measurements are small bets per hand
Limit Results<br />
• Dynamic does overall best, followed by<br />
Ensemble<br />
• Some overlap in standard deviations
No Limit Results<br />
MCTS Rules<br />
Table 2: Expert Imitator Results (No Limit) against MCTSBot and SimpleBot<br />
MCTSBot SimpleBot Average<br />
Hyperborean-max 1.7781 ±0.193 0.7935 ±0.084 1.2858 ±0.139<br />
Dynamic 1.3332 ±0.146 0.5928 ±0.058 0.9630 ±0.102<br />
Ensemble 1.3138 ±0.047 0.5453 ±0.058 0.9295 ±0.053<br />
Hyperborean-prob 1.1036 ±0.165 0.5075 ±0.093 0.8055 ±0.129<br />
Sartre-max 0.9313 ±0.<strong>11</strong>7 0.5248 ±0.032 0.7281 ±0.075<br />
Sartre-prob 0.4524 ±0.106 0.3933 ±0.073 0.4228 ±0.090<br />
Tartanian-max 0.1033 ±0.451 0.3450 ±0.087 0.2242 ±0.269<br />
Tartantian-prob -0.0518 ±0.127 0.4221 ±0.032 0.1852 ±0.080<br />
Original Imitator -<br />
Decision Policy * Measurements are big blinds per hand
No Limit Results<br />
• Dynamic 2nd, Ensemble 3rd<br />
• Decision combination not able to do better<br />
than single imitator: Hyperborean-max<br />
• But, still better than most single imitators<br />
• Some overlap in standard deviations
Discussion<br />
• Hyperborean-max performs best against<br />
both no limit opponents<br />
• As such Dynamic and Ensemble are not<br />
able to improve upon this by considering<br />
the actions of another imitator<br />
• More folds in no limit => less variance<br />
reduction
Conclusion<br />
• Introduced our lazy expert imitators<br />
• Described 2 approaches for combining<br />
their decisions<br />
• In no limit domain, decision combination<br />
outperformed most original imitators<br />
• In limit domain, decision combination<br />
outperformed all original imitators
The End.