Bernoulli bandits with covariates - Department of Statistics ...
Bernoulli bandits with covariates - Department of Statistics ...
Bernoulli bandits with covariates - Department of Statistics ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
.<br />
DEPARTMENT OF STATISTICS<br />
--I-------------C-"-----<br />
University <strong>of</strong> Wisconsin<br />
1210 W. Dayton St,<br />
Madison, W1 53706<br />
TECHNICAL REPORT NO, 804<br />
March 1987<br />
<strong>Bernoulli</strong> <strong>bandits</strong> <strong>with</strong> covarfates<br />
Murray K. Clayton 1<br />
University <strong>of</strong> Wisconsin<br />
'~esearch supported in part by U. S. Army Research Office Grant DAAG29-80-C-0041<br />
and Universi ty <strong>of</strong> W i sconsin Graduate School Grant 160701.<br />
AMS 1980 subject classiflcatfons, Primary 62105: Secondary 62t15.<br />
Keywords and phrases: Sequential deci sions, one-armed <strong>bandits</strong>, two-armed<br />
<strong>bandits</strong>, logi t transformation.
SUMMARY<br />
Sequential selections are to be made from two stochastic processes, or<br />
"arms", each yielding <strong>Bernoulli</strong> responses. A t each stage the arm selected<br />
depends on previous observations. The objectfve fs to maximize the expected<br />
number <strong>of</strong> successes In the first n selections.<br />
The probability <strong>of</strong> success for a<br />
given selection depends on a covariate through a logistic transformation,<br />
For<br />
one arm, this transformation is completely known; for the other, it depends on<br />
an unknown parameter, Optimal strategies are developed In terms <strong>of</strong> a<br />
break-even value for the cavariate: it is optimal to observe the arm w-l th<br />
unknown parameter if the covarfate is less than the break-even value, Other<br />
properties <strong>of</strong> optimal strategies are related to those for non-covariate model s.
1, Introduction<br />
A<br />
bandit problem involves sequential selections or "pulls" from a number <strong>of</strong><br />
stochastic processes (or "arms", machines, treatments, etc. ) . The available<br />
processes have unknown characteristics, so learning can take place as the<br />
processes are observed.<br />
As Sn Bradt, Johnson, and Kaslin (19561, we shall<br />
restrict our attention to the class <strong>of</strong> finite horizon <strong>Bernoulli</strong> <strong>bandits</strong>, in<br />
which the responses are <strong>Bernoulli</strong> random variables and the goal is to maximize<br />
the expected sum <strong>of</strong> the first n observations.<br />
Berry and Frjstedt (1985) discuss<br />
this and many other forms <strong>of</strong> bandi t model s.<br />
The bandit model has been proposed as a model for a clfnlcal trial.<br />
The arms represent treatments, and the goal is to a1 1 ocate treatments<br />
sequentially so as to mximire the expected total number <strong>of</strong> successes.<br />
However, in a typical bandit problem, the observations made on a given arm are<br />
assumed to be exchangeable (see, for example, Berry and Frlstedt, 1979). In a<br />
clinical trial, this imp1 ies that all subjects receiving the same treatment have<br />
the same (marginal) probabtl i ty <strong>of</strong> success.<br />
In thi s paper we extend thf s notion<br />
by supposing that, for a given arm, the probabflity <strong>of</strong> success for a given<br />
subject depends both on the treatment being used and on a covariate that can<br />
encode relevant characteristics affecting the chance <strong>of</strong> success.<br />
this might<br />
include such things as the general health <strong>of</strong> the subject, age and sex <strong>of</strong> the<br />
subject, and so on.<br />
To describe our model formally, we begin by assuming that there are two<br />
arms. Let Xi and Yi denote the results from arms 1 and 2 respectively, at stage<br />
i : for i C n exactly one <strong>of</strong> the pair (Xi ,yi) is actually observed. We also<br />
assume that prior to makfng the ith observation we can observe a covariate Si.
We assume that functrons P and .A exist such that P(Xi=l 1 ~(5~)) = p(Si)<br />
and P(Y1=l(A(Si)) = A(Si).<br />
In what follows, we shall assume that after<br />
the 1-lst pull and prior to the ith pull, only the covariate values SI,...,Si<br />
are known.<br />
Informally, we obtaln no information about the i+lst and later<br />
subjects until after the ith subject has been treated.<br />
Subjcts are to be treated sequentially, and past informatton can be used in<br />
deciding Row to proceed.<br />
i th<br />
In particular, the arm selected for observation at the<br />
selection depends on the previous selections, the previous results, the previous<br />
covariate values,<br />
treated.<br />
and the value <strong>of</strong> the covariate for the subject about to be<br />
A decSsion procedure or strategy specifies which arm to select based<br />
on this information,<br />
The worth <strong>of</strong> a strategy is defined in the usual way as the<br />
expectation <strong>of</strong> the sum <strong>of</strong> the first n observations for a1 1 possible histories<br />
resul tlng from that strategy. A strategy is optimal If it yields the maximal<br />
expected sum.<br />
An arm is said to be optimal if it is the first selection <strong>of</strong> some<br />
optimal strategy.<br />
Many possjbil i ties exist for describf ng the relationship between p, A,<br />
and the<br />
covariate, We choose a linear-logistic model: ~(s) = expjes/~(l+ex~\a+s]) and<br />
X(s) = exp{c+s)/(l+exp {c+s)). When necessary, we shall write these as p( a ,~) and<br />
l(c,s), making the dependence on a and c explicit. We shall assume that the<br />
characteristics <strong>of</strong> arm 2 are known In the sense that c is a known constant,<br />
He<br />
assume that a is unknown and following a Bayesian approach, we suppose that<br />
prior information regarding a can be given by a probabill ty distribution R.<br />
Finally, we assume that, prior to their observation, the covariate values are<br />
unknown, and that they are i.i.d. <strong>with</strong> a known distribution function G. This<br />
implies that, while the cevariate value for a given subject is unknown until the<br />
subject arrives for treatment, the distribution <strong>of</strong> possible covariate values is<br />
known. Me shall use lower case letters s,t,sl,s2.,. etc. to denote
observed val ues <strong>of</strong> the covaria te. In the si tua ti an where a subject % scovarfa te<br />
value, s, is known, but the subject has not yet been treated, we shall refer to s<br />
as the "current" covariate value.<br />
Although we have introduced our model fn a clinical trial setting, one can<br />
describe industrial and ether settings where it would be equally appl fcable,<br />
For convenience, we shall continue to use the clinic trial setting in describing<br />
our results,<br />
A special case arises when G 1s degenerate at a point. In that case, the<br />
covariate values are the same for all subjects, and the situation presented<br />
here becomes equivalent to the bandit <strong>of</strong> Bradt, et a1 . (1956).<br />
When comparing<br />
bandit models, we shall refer to the model set out in this paper<br />
as the "covariate" bandit model, and to the made1 <strong>of</strong> Bradt et a1 . (1956) as a<br />
"standard" bandit model.<br />
The information about arm 1 is described by the distribution measure<br />
on a, Injtially, th4s Is given by R, As we make observations on arm 1,<br />
we can describe the posterior distrlbutlon <strong>of</strong> a given observations as<br />
follows: If successes on arm 1 have been observed when the covariate values<br />
and if fail ures occurred when the covari a te val ues were<br />
j '<br />
,tk, then the posterior measure on a is given by<br />
were sin . , , s<br />
tl,.<br />
where<br />
This is an extension <strong>of</strong> the notation <strong>of</strong> Berry and Fristedt (1979 p, 1087).<br />
Note that<br />
order is immaterial here: for example, G,(~R = mtosR. For notational convenience we
shall sometimes refer to oS as ... a + ...( R as hR, h denoting the previous<br />
1 2 y l t 2 tk - -<br />
history <strong>of</strong> successes, failures, and their corresponding covariate values,<br />
Throughout this paper we use the notation E( IR) to denote expectation over a <strong>with</strong><br />
respect to the distribution R. When there can be no confusion, we omit R from the<br />
notation.<br />
By an " (s,n ,R,c) bandit" we mean a bandit for which the current covarf a te value<br />
is s, the number <strong>of</strong> subjects to treat is n, the distribution on a is R, and the<br />
parameter on arm 2 is c.<br />
(We shall regard G as fixed throughout, and suppress it<br />
from the notation.) In notation similar to Berry (1972) and Clayton and Berry (19853,<br />
let w,~(s,R,c)<br />
be the worth <strong>of</strong> selecting arm i initially in the (s,n,~,c) bandit.<br />
I 2<br />
The worth <strong>of</strong> proceeding optimally in the (s,n,~,c) bandlt is rnaxt~, (s,R,c),~, (s,~,c) 1,<br />
which we denote by Wn(s,R,c).<br />
The expected worth, before s is observed, is then<br />
(1.1) jWn(s,R,c)dG(s) = W,(R,c).<br />
Note that for n 3 1 we have the usual dynamic programming equations:<br />
and<br />
(1.3)<br />
2<br />
w, (s,R,c) = i(s) + Wn-l(R,c).<br />
Together <strong>with</strong> the evident condjtlon that W~(R,C) = 0, the above equations give<br />
a recursion for determining W,<br />
(s,R,c) and W,(R,c).<br />
One further quantity that we shall use in describing optimal strategies is<br />
the difference:<br />
11.4)<br />
1 2<br />
dn(s,R,c) = (s,R,c) - Wn (s,R,c).<br />
The sign <strong>of</strong> An indicates the optimal arm at any stage: if bn(s,R,c) > 0<br />
then arm 1 is optimal initially; if \(s,R,c)<br />
< 0 then arm 2 is optimal initially;<br />
and if A ~ s,R,c) ( = 0 then either arm is optimal. Suppose an optimal pull has<br />
been made.<br />
If arm 2 has been observed, and if the current covariate vat ue f s
t, then we are faced <strong>with</strong> a (t,n-l,R,c) bandit, and therefore observe arm 1 or<br />
arm 2 according to the sign o f A (t,R,c). If, fnstead, arm I was observed on<br />
n-1<br />
the first pull then at the next stage we pull arm 1 or arm 2 according to the<br />
sfgn <strong>of</strong> A,,-l(t,a,R,c)<br />
or %_l(t,+,R,c).<br />
The problem described here is a form <strong>of</strong> two-armed bandit <strong>with</strong> one arm known,<br />
In the special case <strong>of</strong> a standard bandit, this problem has been described as a<br />
"one-armed" bandit since it is a stopping problem:<br />
for the standard bandit an<br />
optimal strategy can always be found for which any pull <strong>of</strong> arm 2 i s optimally<br />
followed by another pull <strong>of</strong> arm 2,<br />
Examples <strong>of</strong> <strong>bandits</strong> satisfyTng such a<br />
condition are contaf ned 4 n Brad t, e t a1 , (1956), Berry and Fri sted t (1979),<br />
Clayton and Berry (19851, Clayton and Witmer 11986) and others. As we shall<br />
see below, such a characterization cannot be applied in general to the covariate<br />
band? t.<br />
Although the bandit model has been discussed extensively (see Berry and fristedt,<br />
1985) relatfvely little has been written on the incorporation <strong>of</strong> <strong>covariates</strong> in<br />
bandit models. A notable exception is that <strong>of</strong> Woodro<strong>of</strong>e (1979), who studied a<br />
bandit model that incorporated covaria tes and yf el ded normal l y distributed<br />
observations,<br />
For that bandf t model Woodro<strong>of</strong>e derived second order<br />
approximations to optimal strategies and investigated their behavior.<br />
contribution should be regarded as important twice over: first, for fntroduclng<br />
a covariate band3 t model, and second, for a further discussion <strong>of</strong> bandf ts<br />
other than <strong>Bernoulli</strong>. (See Clayton and Berry (1985), for another example <strong>of</strong> a<br />
non-Bernoull i bandit,) However, as Sirnons ( 1986) has commented, resul ts<br />
derived for normal models are difficult to apply to the <strong>Bernoulli</strong> setting.<br />
As we shall see below, the <strong>Bernoulli</strong> covariate model behaves in a more<br />
cornpl icated fashion than either the standard bandit or Woodro<strong>of</strong>e' s Gaussian covariate<br />
model. Thfs is due in part to the fact that, if we view the <strong>Bernoulli</strong> covarfate<br />
bandit as a Markov decision problem, then an important component <strong>of</strong> the state<br />
Th7s
space Is I?, the distribution on a. Except when G has a fin? te support, this<br />
implies that the state space is effectively fnfinite dimensional.<br />
As a<br />
consequence,<br />
the expl icit determination <strong>of</strong> optimal strategies is difficult,<br />
unless n is small or the support <strong>of</strong> G is on a small number <strong>of</strong> points,<br />
A similar<br />
issue arises when a Dirichlet process prior is used in a sequential decfsfon<br />
problem (Clayton, 1985, Clayton and Berry, 1985).<br />
In the remainder <strong>of</strong> thfs paper we shall focus on properties <strong>of</strong> optfmal<br />
strategies in the covariate bandit and on the relationship between<br />
the standard bandit and the covariate bandft.<br />
In Section 2 we investigate some<br />
basic monotonicfty properties <strong>of</strong> the covariate bandit and discuss stopping<br />
rules,<br />
In Section 3 we discuss further the properties <strong>of</strong> optimal strategies in<br />
terms o f a break-even value.<br />
Section 4 contains some further comments.<br />
2, Properties <strong>of</strong> optimal strategies:<br />
In this section we begin to describe some <strong>of</strong> the properties <strong>of</strong> optimal<br />
strategies.<br />
As <strong>with</strong> most <strong>bandits</strong>, the current model can be seen as an attempt<br />
to reconcile two conff icting goals: (13 to obtain information about R; and<br />
(2) to maximize the chance <strong>of</strong> success for each pull. Such a conflict arises, for<br />
example, when E P( sl < hl s) at a par ti cular stage.<br />
A pul l on arm 2 w? 1 1 have a<br />
greater chance <strong>of</strong> success for the current subject, but a pull <strong>of</strong> arm 1 may yf eld<br />
Snformation about R that will have a benefit when making future pull s. On the<br />
other hand, if R is such that E[ p(s) lhRj<br />
- > X(s) for a1 1 histories - h,<br />
- < X(s), far all - h,<br />
then arm 1 will always be optimal, and likewise, if E[p(s) lh~l<br />
then arm 2 wlll always be optimal.<br />
P(a>c) = 1 ar P(a
In words, W,(R,c)<br />
is bounded above by what would be the expected utility if<br />
a were known at the outset. Wn(R, c) is bounded be1 ow by the worth <strong>of</strong> a<br />
strategy that, for each subject, takes the covariate into account but ignores<br />
any posterior information gained about a during the trial, Finally, this fatter<br />
worth and W,(R,c) are both bounded below by the worth <strong>of</strong> a strategy that<br />
pulls the same arm throughout the trial.<br />
As mentioned above, the standard bandit is a stopping problem, ins<strong>of</strong>ar as an<br />
optimal pull <strong>of</strong> arm 2 can be followed by another optimal pull <strong>of</strong> arm 2.<br />
ThTs<br />
need not be the case for the covariate bandit, as the follawfng example shows.<br />
Example 2.1: Suppose c = 0, and P(a=-I) = P(a=l) = 1/2 = P(S=-5) = P(S=5),<br />
Then it is easy, but tedious, to calculate:<br />
Hence, arm 2 is optimal when n = 2 and S = +5.<br />
However, arm 1 i s optimal when<br />
n = 1 and S = -5. This result reflects the intuition that when the current<br />
covariate value is sufficiently small, arm 1 is more attractive since there is<br />
some potential that a will be +1. However, when S is large, arm 1 becomes less<br />
attractive, since there i s a risk that a will be -1.////<br />
Although the covarl'ate bandit is not a stopping problem in the usual sense,<br />
it does satisfy a weak stopping rule property, as follows:
Theorem 2.1: If, for the (s,n,R,e) bandit arm 2 is uniquely optimal for all s,<br />
then there exists an optimal strategy for which it is optimal to pull arm 2 for<br />
the (s,m,R,c) bandit, m n, for any s.<br />
Pro<strong>of</strong>:<br />
the pro<strong>of</strong> is a generalization <strong>of</strong> the pro<strong>of</strong> <strong>of</strong> a similar<br />
resul t in Brad t, et a1 . ( 1956).<br />
Suppose, to the contrary, that there ex7 sts an<br />
rnl<br />
< n such that arm 1 Js uniquely optimal under the strategy r for the (s' ,in' ,R,c)<br />
bandit. Consfder an (s',n,R,c) bandit that follows r for the first rn'<br />
pulls, and then pulls arm 2 for the remaining n-m' pulls, This has a worth that<br />
i s no less than !4,(s1 ,R,c). But thls contradicts the unlque optimality <strong>of</strong> arm<br />
2 for the s1 ,n,R,c bandit.////<br />
Another property <strong>of</strong> the standard bandit is the "stay on a winner" property:<br />
if<br />
an optimal pulf <strong>of</strong> arm 1 yields a success, then it i s optlmal to pull arm 1 again.<br />
This need not hold, as shown in the next example.<br />
Example 2.2: Suppose c = 0, P( a=-1) = P(a=5) = and P(S.1) = -99 = 1 - P(S=3).<br />
Then h2(3,R) = .00125 but b1(3,03R) = -.00869. That is, <strong>with</strong> 5 = 3 and two<br />
observations to take, arm 1 is optimal.<br />
However, if a pull <strong>of</strong> arm 1 under<br />
such circumstances yields a success, a subsequent pull <strong>of</strong> arm 1 when S = 3<br />
will not be optimal .I///<br />
A1 though no sfmple stay-on-a-winner rule exists, a weak form <strong>of</strong><br />
stay-on-a-winner does exist, as foll ows.<br />
Theorem 2.2: Suppose in the (s,n,R,c) bandit that an initial pulf <strong>of</strong> arm 1 f s<br />
uniquely optlmal and that a success obtains. Then there exists an s' In the<br />
support <strong>of</strong> G such that a pull <strong>of</strong> arm 1 is optimal for the (st,n-l,oSR) bandit.
Pro<strong>of</strong>:<br />
Suppose to the contrary, that arm 1 I s not optfmal for any s' in the<br />
support <strong>of</strong> G for the (s', n-1, oSR,c) bandit. Then by Theorem 2.1 arm 2 i s<br />
optimal for the remaining n-l pulls, and thus<br />
Hence W (oR,c) = (n-l)lX(s)d~(~J. It follows fronTheorem 2.3 below and<br />
n-l s<br />
equation (2.1) that<br />
Moreover, since a pull <strong>of</strong> arm 2 is optimal for all pulls after the first,<br />
it follows that Ep(sl) G i(sl) for any sl in the support <strong>of</strong> G. Finally, by<br />
equation (1.2)<br />
this implies that<br />
contradicting the fact that arm I is uniquely optimal for the fSrst pull,////<br />
While the covariate bandit and the standard bandft share the stopping rule<br />
and stay-on-a-winner properties in only a weakened sense, both <strong>bandits</strong> have<br />
several monotonicity properties in common, For example, It is easy to prove by<br />
induction, using (1.2) and (1.3) that Wn(s,R,c) is nondecreasing in c.<br />
We can a1 so develop a monotonf e i ty result for arm 1.<br />
This contains the<br />
finite horizon version <strong>of</strong> Theorem 3.1 <strong>of</strong> Berry and Fristedt (1979) as a speclal case.
Definition 2.1:<br />
For any two random variables X and X' wjth dfstribution functions F<br />
and F' respectively, we say that the distribution <strong>of</strong> Xt<br />
is "to the right <strong>of</strong>" the<br />
dfstribution <strong>of</strong> X if F(b) F' (b) for all b. As noted in Marshall and Olkin, (1979),<br />
this condition is equivalent to the condition that Eg(X1 1 2 Eg(X) for any<br />
nondecreasfng g such that the expectations exist.////<br />
Note that if the distrfbution <strong>of</strong> ayfs to the right <strong>of</strong> the distribution <strong>of</strong> a,<br />
then the distribution <strong>of</strong> P( a',$) is to the right <strong>of</strong> the distribution <strong>of</strong> p(a,s),<br />
for all s. In addition, if the distribution <strong>of</strong> p(al,s) i s to the right <strong>of</strong> the<br />
distribution <strong>of</strong> p(a,s) for some s, then S t is easy to show that this must hold for<br />
all s, and that the distribution <strong>of</strong> a' is to the right <strong>of</strong> the distribution <strong>of</strong> a.<br />
Definition 2.2: In an extension <strong>of</strong> a notion <strong>of</strong> Berry and Fristedt (1979),<br />
if R' and R are measures for a, we defjne R-0 be "strongly to the right" <strong>of</strong> R<br />
if - hR' i s to the right <strong>of</strong> - hR for all histories - R.////<br />
Given these definitions, we have the fallowing:<br />
Theorem 2.3:<br />
If R1 is strongly to the right <strong>of</strong> R then<br />
W,(R1,c) 2 W,(R,c) and for all s, W,(s,R1,c) 3 W,(s,R,c).<br />
2 1<br />
Pro<strong>of</strong>: fhls is immediately true by induction for Wn. Consider W,.<br />
i<br />
i
Since R' is to the rlght <strong>of</strong> R, E(p(s) (R') 3 E(P(s) IR). Note that pR' is<br />
strongly to the right <strong>of</strong> a,R,<br />
and both <strong>of</strong> these are strongly to the right <strong>of</strong><br />
mSR- Also, mSR' 1s strongly to the right <strong>of</strong> 4,R. By Induction, each <strong>of</strong> the<br />
quantities in square brackets is nonnegative. The rest <strong>of</strong> the pro<strong>of</strong> follows<br />
by definition <strong>of</strong> W,(s,R,c) and equation (1.1).////<br />
Lemma 2.1:<br />
If sl < s2 then<br />
(i) 0 R i s to the right <strong>of</strong> a,<br />
I R and<br />
2<br />
(ii) R i s to the right <strong>of</strong> $ R,<br />
S2 s3.<br />
Pro<strong>of</strong>:<br />
We prove part (i) ; the pro<strong>of</strong> <strong>of</strong> part (ii) f s similar.<br />
It w711 suffice<br />
to show that, for all b. P(ucblo, R) P(u~blo R), or equivalently, that<br />
1 S2<br />
GSritfng the dependence <strong>of</strong> p on ct explicitly, (2.2)<br />
is equivalent to<br />
where IA is the indicator function <strong>of</strong> the set A.<br />
By Tonellits theorem, the<br />
right side <strong>of</strong> (2.3) is
However, if a<br />
a < a', the integrand in (2,4) is nonnegative, and if a 9 a c a'<br />
fajls, the integrand is zero.////<br />
An immediate consequence <strong>of</strong> Lemma 2.1 and Theorem 2.3<br />
is the following.<br />
i<br />
Theorem 2.4: For a1 1 n,R,c,t, and i, each <strong>of</strong> Wn(aSR,c), W,( t, oSF!,c), and<br />
~,[t,o~R,c) are nonincreasing in s and each <strong>of</strong> Wn(mSR,c),<br />
wA(t,$,~.c), and<br />
W (t,$,R,c) are nondecreasing in s.<br />
n<br />
Theorem 2.4 and parts (i)- (iv) <strong>of</strong> Proposf tion 2.1 below are related to the<br />
"informationn obtained by a particular pull. A success observed on arm 1 when s<br />
is large is relatively uninformative, since lim ~ ( s ) = 1 for any a. However, a<br />
S'="<br />
success observed when s is large in magnitude and negative is potentially quite<br />
informative: it suggests that a itself I s large, The reverse sf tuation arises<br />
when we observe a failure on arm 1.<br />
Bounds on the "information" available in a<br />
pull are given by parts (ii) and (iii) <strong>of</strong> the Proposition.<br />
Part (v) <strong>of</strong> Proposition<br />
2.1 suggests that as the current covariate value grows large fn magnitude, we<br />
become tndif ferent to the choice <strong>of</strong> arms for the next pull,<br />
do,R ,a dh+J -a<br />
Proposition 2.1: Define cr_-R by 7 = 7 and R by aR = -<br />
Ee ~ e ' - ~<br />
For all n, for = 1 and 2, for all R and for a17 c,<br />
(i) lirn W (0,R.c) = lim wn(rn,R,c) = w,(R,c)<br />
S+do<br />
s+-'<br />
(ii) lim M,( oSR,c) = W,( q,R,c)<br />
s+-m
1 2<br />
(v) l i m Wn(t.R,c) = Tim W,(t,R,c)<br />
t+s-<br />
t+fm<br />
Pro<strong>of</strong>: We prove some parts <strong>of</strong> the proposition. The remainder follow similarly. First<br />
we prove the flrst half <strong>of</strong> part (iv), using induction.<br />
The result is easy when<br />
n = I.<br />
Note that<br />
By the induction hypothesis, l im W,_l(<br />
S<br />
show that Tim ECp(t)l~~R1=Ep(t). SO<br />
S +-<br />
O~O~R,C)=W,-~( otR, C) . A1 so it is easy to<br />
To prove part (i), use part (lv), noting that<br />
lim W,(o,R,c)<br />
S -bm<br />
= lim j Wn(t.o,R,c)dt<br />
s+*<br />
The second equal i ty above fol 1 ows from the dominated convergence theorem and<br />
equatlon (2.1).<br />
To prove part (v), note that
1<br />
l i m Wn(t,R,c)<br />
t+m<br />
= lirn{Ep(t) + EO(~)W~_,(~~R,~)<br />
t +QI<br />
+ [I-EP(~)IW~-~(~~R,C)I<br />
3. The function A,,.<br />
As mentioned above, the function A,<br />
can be used to determine optimal<br />
strategies. As must also be evident, the determinatlon <strong>of</strong> A, is nontrivial. In<br />
this sectlon we explore certain properties <strong>of</strong> A,<br />
and dlscuss their implications<br />
for determining properties <strong>of</strong> the optimal strategy.<br />
Note that It is useful and proper to consider A,<br />
as a function <strong>of</strong> s for all<br />
real s, even though the support <strong>of</strong> G might be on some proper subset <strong>of</strong> the<br />
reals. O f course, in usfng A, to determine optimal strategies, attention will<br />
be restricted to those s in the support <strong>of</strong> G,<br />
I t is easy to show, by fnduction, and using (1.21,<br />
(1.31, and the definition <strong>of</strong><br />
An (1.4) that An is a continuous function <strong>of</strong> s and c. From Proposition 2.l(v)<br />
it i s evident that T im %(s,R,c) = 0. This fact is illustrated in Figure 1,<br />
s ++<br />
where An is plotted as a function <strong>of</strong> s for n = 1, ..., 6 and for R and G such<br />
1<br />
that P(u=-1) = P(u.1) = = P(S=-1) = P(S=l). We note from Figure 1 that An<br />
has at most one root fn s.<br />
We now set about a pro<strong>of</strong> <strong>of</strong> that fact for the case<br />
n = 1, and derive a weaker result when n > 3.<br />
Theorem 3.1. If R is not degenerate at a point, then hl( s,R,c) has at most<br />
one root in s.
Remark: If R is degenerate at c, then Al(s,R,c) = 0 for all s. If R is<br />
degenerate at a point other than c, then Al(s,R,c) has no roots in s.<br />
Pro<strong>of</strong>: Without loss o f generality, we can assume c = 0. Let sl < s2. We show<br />
that: (a) if Al(s2,R,0) 3 0 then bl(sl,R,O) > 0. A simllar approach<br />
shows that: (b) if Al(sl,R,O) < 0 then A~(s~,R,O) < 0. Parts (a) and (b)<br />
complete the pro<strong>of</strong>.<br />
To proceed <strong>with</strong> (a), note first that bl(s,R,O) = A(s)d[s), where d(s) =<br />
~[(e~l)/(l+e~~) IR]. Since h(s) > 0 for all s, it will suffice to show that<br />
A~(S~,R,O) 3 0 implles d(sl) - d(sp) > 0.<br />
Next, note that<br />
is not degenerate at a point, this difference is strictly pasftlve.<br />
Finally,<br />
some algebra shows that<br />
Consequently, bl(s2,R,0) 3 0 implies s2,aslR,0) 0 which in turn implies<br />
d(sl) - d(s2) > 0,<br />
as required.////<br />
The result <strong>of</strong> Theorem 3.1 may be restated in an a1 ternatjve form, whf ch<br />
we note as a Corollary.<br />
Corollary 3.1:<br />
In the (sl,l,R,c) bandit, there exists a quantity<br />
HI = E~(R,c) E[--,-I such that a pull <strong>of</strong> am 1 is optimal if sl < 1; a pull <strong>of</strong><br />
arm 2 is optimal if sl > El; and either is optimal if sl = El.<br />
Remark: If El = +-, then a pull <strong>of</strong> arm 1 is optimal for all s,<br />
and if El = -- then a pull <strong>of</strong> arm 2 is optimal for all s. So, for example,<br />
S f P(a>c) = l then Z = +CO,<br />
l
Theorem 3.2: If R' is strongly to the right <strong>of</strong> R, then Al(s,R1,c) 2 al(s,R,c)<br />
and E1(Rb,c) ' "(R,c). Also, bl(s,R.c) is decreasing in c; El(R,c) is<br />
nonincreasing in c,<br />
Pro<strong>of</strong>:<br />
This is an immediate consequence <strong>of</strong> Theorem 2.3 and the definitions <strong>of</strong><br />
Al<br />
and El.////<br />
We conjecture that, for all n, there exists a quantity <strong>with</strong> properties<br />
similar to El;<br />
namely, in the (sn,n,R,c) bandit. It is optimal to pull arm 1 if<br />
n < e,(R,c) and it is optimal to pull arm 2 if s > E,(R,c). In this sense<br />
En(R,c) would be a "break-even value" for the covarlate. If En were to exist<br />
for n 2 2,<br />
then a complete determination <strong>of</strong> an optimal strategy could be given<br />
in terms <strong>of</strong> En.<br />
Our next result and its corollary are partial results i n that<br />
direction.<br />
Theorem 3.3: For all n, bn(s,R,c) 3 A ~ ( S , R , ~ ) .<br />
Pro<strong>of</strong>: From 11.21, (1.31, and (1.4),<br />
~,,~(s,R,c) - A ~ ( ~ . R ~ = C ) EP(S)H<br />
13.1)<br />
- w,(R,c)<br />
n (oSR,t) + CI-EP(S)IW~( $sR7~)<br />
The right side <strong>of</strong> (3.1)<br />
may be written as<br />
We show that the integrand in (3.2) fs always nonnegative by<br />
induction. For n = 1, there are two cases. If Wl(t,R,c) = A(t), then the<br />
result is clear, since Wl(t, aSR,c) Act) and Wl[t,+,R,c) A(t).
If Wl( t,R,c) = Ep(t), then we note that Wl(t,o,R,c) 3 E[v(t) 1 osR] and<br />
Wl(t,$SR,~) 3 E[P(~) 14SR1, whence the integrand in (3.2) is bounded below by<br />
Er P( S) IEI P( t) 1 O~RI + C1-E P(S) ]EL P( t) 1 bSRI - EP( t) = 0.<br />
For the induction step we assume that the integrand in (3.2) Ss<br />
nonnegative for n = rn.<br />
Again, we distjnguish two cases:<br />
Z<br />
First, when Wm,l(t,R,c) = W,,l(t,R,c) then we have<br />
- W,,l(t,Rac)*<br />
The first inequality above follows by definition <strong>of</strong> Wm,l.<br />
The second inequality<br />
follows by the induction hypothesis.<br />
In the second case, if<br />
\m+l<br />
(t,R,c) = W,:~(~,R,C) then following (1.2), the integrand in (3.2) can be<br />
written as the sum <strong>of</strong> three components, A, 0, and C, say, where<br />
and
Some algebra shows that A = 0. Consider the expression B. We may write<br />
whi cR is nonnegative by the jnduction hypothesi s.<br />
Ll kewl se, the f nduction<br />
hypothesis may be used to prove that C 3 0, and the theorem now fol tows.S///<br />
I<br />
Corollary 3.2: For any n,R,c, there exists a zA(R,c) such that if s, < E~(R,c)<br />
then arm 1 i s optimal in the (s,,n,R,c) bandit. Moreover, if we take<br />
E;(R,c) = zl(R,c), then E;(R,C) c;(R,c).<br />
A<br />
Let z,(R,c) be the largest $,(R,c) such that the property given in Corollary 3.2<br />
holds. (In particular il(~,c) = ";(R,c) = xl(R,c) .)<br />
A<br />
Then Cn(R,c) is a<br />
weak form <strong>of</strong> "break-even value" for the covariate s,<br />
in the following sense:<br />
A<br />
if 5, < E,(R,c) then arm 1 is optimal in the (s,,n,R,c) bandit. Noting that<br />
PIS) and A(s) are increasfng In s, this says that, in terms <strong>of</strong> a clinical trial,<br />
when subjects have csvariates such that the probability <strong>of</strong> success is small,<br />
then It is better to use the experimental, unknown treament (arm 1).<br />
Our<br />
conjecture that E,,(R,c)<br />
sn<br />
n<br />
exists for all n i s equivalent to the conjecture that<br />
> E,,(R,c) implies that arm 2 is optimal in the (s,,n,R,c) bandit. Thls<br />
certainly h<strong>of</strong> ds for the example f n Figure 1, and we have shown i t to hold<br />
in other examples net provided here.<br />
The pro<strong>of</strong> <strong>of</strong> such a result fn general<br />
seems particular? y el usf ve, however,<br />
The fol lowing example represents a<br />
partial result regarding the existence <strong>of</strong> z,(R,c) for n 1.<br />
Example 3.2: Suppose c = 0, P(a=l) = p = 1 - P(o=-l), and P(S=-1) = P(S=T) = 112.<br />
We demonstrate the existence o f C2 in th7 s case. For brevity, we suppress c<br />
from the notation.<br />
It w ilt suffice to show that
2 2 1<br />
if Hz(-1,R) = W2(-l,R), then W2(1,R) = W2(1,R), and if W2(1,R) = W2(1,R) then<br />
We(-1,R) = ,R) . we prove these sirnul taneously by contradi ctlon.<br />
2 1<br />
Specifically, suppose that W2(-1,R) = W2(-1,R) and that W2(1,R) = W2(1,R).<br />
2<br />
If W2(-1,R) = W2(-1,R) then it follows that A2(-1,R) < 0, where dl(-1,R) x 0 by<br />
Theorem 3.1. But, but Theorem 3.2, dl(-1,R) < 0 implies that al(-1, $_lR) < 0<br />
and bl(-l,\R) < 0. From Theorem 3.1, we thus have Al(l.R) < 0, Al(l,4-1R) < 0,<br />
and ~ ~ ( l , $ ~ < R 0. l Since W2(1.R) = W2(I,R) 1 by assumption, it must be that<br />
dl(-l,olR) > 0. There are now two cases: ~~(l,a~R) 0 and bl(l,alR) < 0.<br />
The second case can be shown to be impossible.<br />
Therefore,<br />
2<br />
On the other hand, W2(1,R) = h(1) + 1/2h(l) + 2 -<br />
whence<br />
Similarly, w~(-~,R)<br />
2<br />
= A(-1) + 1/21(1) + 1/21(-1) and<br />
b12(-1,R) = Ep(-1) + 1/2EC~(-1)(~(1) - h(1) + d-1) - h(-l))l, since, by<br />
Lemma 2.1 and Theorem 3.2 A ~ ( S , ~ _ ~ R 3 ) dl(s,olR) 3 0. It follows that
Some tedious algebra shows that<br />
while<br />
A2(1,R) > 0 iff p > .5883<br />
A*(-1,R) > 0 iff p > . 5694.<br />
This leads to the sought after contradiction,////<br />
4. AddStional Comments<br />
As mentioned, Coral 1 arf es 3.1 and 3.2 are df rected toward the description <strong>of</strong><br />
optimal strategies in terms <strong>of</strong> a break-even val ue for the current covarlate<br />
val ue,<br />
In noncovariate bandit treatments, optimal policies have been described in<br />
terms <strong>of</strong> a break-even val ue for arm 2. (See, for example, Bradt, et al . 1956,<br />
Berry and Fri stedt, 1979, Clay ton and Berry, 1985, and Berry and Fri stedt, 1985.<br />
Gittens and Jones, 1974, have used such indexes in describing muftiarmed<br />
<strong>bandits</strong>.) A comparable break-even value for arm 2 <strong>of</strong> the covariate bandit<br />
is a quantity C, (s,R)<br />
which would characterize optimal strategies as follows:<br />
if ccC,(s,R) then a pull <strong>of</strong> arm 1 i s optimal, if c>Cn(s,R) then a pull <strong>of</strong> arm 2<br />
i s optimal, and if c = C, ( s ,R) then ei ther arm is optimal. We conjecture that<br />
such a quantity c,(s,R) exists for all n,s, and R. Indeed, it is easy to see<br />
that Cl(s,R) exlsts for all s and R; it is the root in c <strong>of</strong> the equation<br />
bl(s,R,c) = 0. The next result describes a situation in whlch C2(s,R) exists:<br />
Proposftion 4.1: If A l ( ~ , R , ~ ) - )Al(t,~,c)d~(t) i s decreasing in c, then C2(s,R)<br />
exi s ts.
Remark:<br />
The Rypothesfs <strong>of</strong> the proposition is equivalent to the requirement that<br />
jX(t,c) (1-X(t,c) )dG( t)cX(s,c) (l-Us,c) 1.<br />
Pro<strong>of</strong>: We prove that, under the hypothesis <strong>of</strong> the proposition, A2(s,R,c) rs<br />
decreasing in c. The existence <strong>of</strong> C2(s,R) as a root in c <strong>of</strong> b2(s,R9c) = 0 then<br />
follows from the fact that limA2(s,R,c) = EP(s) and lirnbp(s,R,c) = Eds) - 1.<br />
C+-<br />
c+-<br />
If we let A;<br />
can be used to show that<br />
= max(~~,O/ and A; = -mln{al,O/ then 1.1, 1 . 2 (1.3) and (1.4)<br />
The desired result now follows from the fact that Al(t,R,c) is decreasing in c<br />
for all Re////<br />
The method <strong>of</strong> pro<strong>of</strong> used for Proposition 4.1<br />
can be generalized to show that if<br />
Al(s,R,c) - njbl(t,R,c)dG(t)<br />
is decreasing in c and if hm(s,R,c) i s decreasing<br />
in c for all s,R,<br />
and m
The followfng example shows that an index for s may not exist for such a<br />
covariate model.<br />
Example 4.1. Suppose A(s) = es/(l+es) and ~ ( s ) = eBS/(l+eBS~, <strong>with</strong><br />
1<br />
P ( 6-0) = = P( B-10). Then EP(s) - I(S) is positive, and so arm 1 i s favored,<br />
if s
REFERENCES<br />
Berry, D.A. (19721, A bernoufli two-armed bandit. Ann, Math. Statfst. - 43,<br />
871-897.<br />
Berry, D.A. and Fri stedt, R, (1979). <strong>Bernoulli</strong> one-armed bandi ts-arb7 trary<br />
discount sequences. Ann. Statist. - 7, 1086-1105.<br />
Berry, D.A. and Frfstedt, B. (1985). Bandit Problems: Sequential Allocatfon<br />
<strong>of</strong> Experiments. Chapman-Hal I, New Yerk.<br />
Bradt, R.N., Johnson, S.M. and Karlin, S. (1956). On sequential designs<br />
for maximiz7ng the sum <strong>of</strong> n observations. Ann. Math. Statist. 27, 1060-1070, -<br />
Clayton, M.K. (1985). A Bayesian nonparametric sequential test for the mean<br />
<strong>of</strong> a population. Ann. Statist. 13, 1129-1139, -<br />
Clayton, M.K. and Berry, D.A. (1985). Bayesian nonparametric <strong>bandits</strong>.<br />
Ann. Stati st. - 13, 1523-1534.<br />
Clayton, M. K. and Wl tmer, J.A. (19873. Two-stage <strong>bandits</strong>. University <strong>of</strong><br />
Wisconsin-Madison, <strong>Department</strong> <strong>of</strong> <strong>Statistics</strong> Technical Report.<br />
Gittens, J.C. and Jones, D.M. (1974). A dynamic allocation index for the<br />
sequential design <strong>of</strong> experiments. In Progress in <strong>Statistics</strong> (eds, J. Gani<br />
et al.) pp. 241-266, North-Holland, Amsterdam.<br />
Marshal 7 , A.M. and 01 kin, I. (1979). Tnequa? i ties:<br />
Its Applications. Academic Press, New York,<br />
Theory <strong>of</strong> Ma joriration and<br />
Simons, G, (1986). Bayes rules for a clinical-trfals model <strong>with</strong> dichotomous<br />
responses. Ann. Statfst. - 14, 954-970.<br />
Woodro<strong>of</strong>e, M. {1979), A one-armed bandit problem <strong>with</strong> a concomitant variable.<br />
J. Amer. Statist. Assoc, 74, 799-806. -<br />
<strong>Department</strong> <strong>of</strong> <strong>Statistics</strong><br />
University <strong>of</strong> Wisconsin<br />
1210 W, Dayton St.<br />
Madison, WI 53706
Figure 1. A,(s,R,c)for vdousnands. Herep(--1)<br />
=P(cr=l)=O.S=P(S=-1) =P(S=l).