Bernoulli bandits with covariates - Department of Statistics ...

. 

DEPARTMENT OF STATISTICS 

--I-------------C-"----- 

University of Wisconsin 

1210 W. Dayton St, 

Madison, W1 53706 

TECHNICAL REPORT NO, 804 

March 1987 

Bernoulli bandits with covarfates 

Murray K. Clayton 1 


'~esearch supported in part by U. S. Army Research Office Grant DAAG29-80-C-0041 

and Universi ty of W i sconsin Graduate School Grant 160701. 

AMS 1980 subject classiflcatfons, Primary 62105: Secondary 62t15. 

Keywords and phrases: Sequential deci sions, one-armed bandits, two-armed 

bandits, logi t transformation.

SUMMARY 

Sequential selections are to be made from two stochastic processes, or 

"arms", each yielding Bernoulli responses. A t each stage the arm selected 

depends on previous observations. The objectfve fs to maximize the expected 

number of successes In the first n selections. 

The probability of success for a 

given selection depends on a covariate through a logistic transformation, 

For 

one arm, this transformation is completely known; for the other, it depends on 

an unknown parameter, Optimal strategies are developed In terms of a 

break-even value for the cavariate: it is optimal to observe the arm w-l th 

unknown parameter if the covarfate is less than the break-even value, Other 

properties of optimal strategies are related to those for non-covariate model s.

1, Introduction 

A 

bandit problem involves sequential selections or "pulls" from a number of 

stochastic processes (or "arms", machines, treatments, etc. ) . The available 

processes have unknown characteristics, so learning can take place as the 

processes are observed. 

As Sn Bradt, Johnson, and Kaslin (19561, we shall 

restrict our attention to the class of finite horizon Bernoulli bandits, in 

which the responses are Bernoulli random variables and the goal is to maximize 

the expected sum of the first n observations. 

Berry and Frjstedt (1985) discuss 

this and many other forms of bandi t model s. 

The bandit model has been proposed as a model for a clfnlcal trial. 

The arms represent treatments, and the goal is to a1 1 ocate treatments 

sequentially so as to mximire the expected total number of successes. 

However, in a typical bandit problem, the observations made on a given arm are 

assumed to be exchangeable (see, for example, Berry and Frlstedt, 1979). In a 

clinical trial, this imp1 ies that all subjects receiving the same treatment have 

the same (marginal) probabtl i ty of success. 

In thi s paper we extend thf s notion 

by supposing that, for a given arm, the probabflity of success for a given 

subject depends both on the treatment being used and on a covariate that can 

encode relevant characteristics affecting the chance of success. 

this might 

include such things as the general health of the subject, age and sex of the 

subject, and so on. 

To describe our model formally, we begin by assuming that there are two 

arms. Let Xi and Yi denote the results from arms 1 and 2 respectively, at stage 

i : for i C n exactly one of the pair (Xi ,yi) is actually observed. We also 

assume that prior to makfng the ith observation we can observe a covariate Si.

We assume that functrons P and .A exist such that P(Xi=l 1 ~(5~)) = p(Si) 

and P(Y1=l(A(Si)) = A(Si). 

In what follows, we shall assume that after 

the 1-lst pull and prior to the ith pull, only the covariate values SI,...,Si 

are known. 

Informally, we obtaln no information about the i+lst and later 

subjects until after the ith subject has been treated. 

Subjcts are to be treated sequentially, and past informatton can be used in 

deciding Row to proceed. 

i th 

In particular, the arm selected for observation at the 

selection depends on the previous selections, the previous results, the previous 

covariate values, 

treated. 

and the value of the covariate for the subject about to be 

A decSsion procedure or strategy specifies which arm to select based 

on this information, 

The worth of a strategy is defined in the usual way as the 

expectation of the sum of the first n observations for a1 1 possible histories 

resul tlng from that strategy. A strategy is optimal If it yields the maximal 

expected sum. 

An arm is said to be optimal if it is the first selection of some 

optimal strategy. 

Many possjbil i ties exist for describf ng the relationship between p, A, 

and the 

covariate, We choose a linear-logistic model: ~(s) = expjes/~(l+ex~\a+s]) and 

X(s) = exp{c+s)/(l+exp {c+s)). When necessary, we shall write these as p( a ,~) and 

l(c,s), making the dependence on a and c explicit. We shall assume that the 

characteristics of arm 2 are known In the sense that c is a known constant, 

He 

assume that a is unknown and following a Bayesian approach, we suppose that 

prior information regarding a can be given by a probabill ty distribution R. 

Finally, we assume that, prior to their observation, the covariate values are 

unknown, and that they are i.i.d. with a known distribution function G. This 

implies that, while the cevariate value for a given subject is unknown until the 

subject arrives for treatment, the distribution of possible covariate values is 

known. Me shall use lower case letters s,t,sl,s2.,. etc. to denote

observed val ues of the covaria te. In the si tua ti an where a subject % scovarfa te 

value, s, is known, but the subject has not yet been treated, we shall refer to s 

as the "current" covariate value. 

Although we have introduced our model fn a clinical trial setting, one can 

describe industrial and ether settings where it would be equally appl fcable, 

For convenience, we shall continue to use the clinic trial setting in describing 

our results, 

A special case arises when G 1s degenerate at a point. In that case, the 

covariate values are the same for all subjects, and the situation presented 

here becomes equivalent to the bandit of Bradt, et a1 . (1956). 

When comparing 

bandit models, we shall refer to the model set out in this paper 

as the "covariate" bandit model, and to the made1 of Bradt et a1 . (1956) as a 

"standard" bandit model. 

The information about arm 1 is described by the distribution measure 

on a, Injtially, th4s Is given by R, As we make observations on arm 1, 

we can describe the posterior distrlbutlon of a given observations as 

follows: If successes on arm 1 have been observed when the covariate values 

and if fail ures occurred when the covari a te val ues were 

j ' 

,tk, then the posterior measure on a is given by 

were sin . , , s 

tl,. 

where 

This is an extension of the notation of Berry and Fristedt (1979 p, 1087). 

Note that 

order is immaterial here: for example, G,(~R = mtosR. For notational convenience we

shall sometimes refer to oS as ... a + ...( R as hR, h denoting the previous 

1 2 y l t 2 tk - - 

history of successes, failures, and their corresponding covariate values, 

Throughout this paper we use the notation E( IR) to denote expectation over a with 

respect to the distribution R. When there can be no confusion, we omit R from the 

notation. 

By an " (s,n ,R,c) bandit" we mean a bandit for which the current covarf a te value 

is s, the number of subjects to treat is n, the distribution on a is R, and the 

parameter on arm 2 is c. 

(We shall regard G as fixed throughout, and suppress it 

from the notation.) In notation similar to Berry (1972) and Clayton and Berry (19853, 

let w,~(s,R,c) 

be the worth of selecting arm i initially in the (s,n,~,c) bandit. 

I 2 

The worth of proceeding optimally in the (s,n,~,c) bandlt is rnaxt~, (s,R,c),~, (s,~,c) 1, 

which we denote by Wn(s,R,c). 

The expected worth, before s is observed, is then 

(1.1) jWn(s,R,c)dG(s) = W,(R,c). 

Note that for n 3 1 we have the usual dynamic programming equations: 

and 

(1.3) 

2 

w, (s,R,c) = i(s) + Wn-l(R,c). 

Together with the evident condjtlon that W~(R,C) = 0, the above equations give 

a recursion for determining W, 

(s,R,c) and W,(R,c). 

One further quantity that we shall use in describing optimal strategies is 

the difference: 

11.4) 

1 2 

dn(s,R,c) = (s,R,c) - Wn (s,R,c). 

The sign of An indicates the optimal arm at any stage: if bn(s,R,c) > 0 

then arm 1 is optimal initially; if \(s,R,c) 

< 0 then arm 2 is optimal initially; 

and if A ~ s,R,c) ( = 0 then either arm is optimal. Suppose an optimal pull has 

been made. 

If arm 2 has been observed, and if the current covariate vat ue f s

t, then we are faced with a (t,n-l,R,c) bandit, and therefore observe arm 1 or 

arm 2 according to the sign o f A (t,R,c). If, fnstead, arm I was observed on 

n-1 

the first pull then at the next stage we pull arm 1 or arm 2 according to the 

sfgn of A,,-l(t,a,R,c) 

or %_l(t,+,R,c). 

The problem described here is a form of two-armed bandit with one arm known, 

In the special case of a standard bandit, this problem has been described as a 

"one-armed" bandit since it is a stopping problem: 

for the standard bandit an 

optimal strategy can always be found for which any pull of arm 2 i s optimally 

followed by another pull of arm 2, 

Examples of bandits satisfyTng such a 

condition are contaf ned 4 n Brad t, e t a1 , (1956), Berry and Fri sted t (1979), 

Clayton and Berry (19851, Clayton and Witmer 11986) and others. As we shall 

see below, such a characterization cannot be applied in general to the covariate 

band? t. 

Although the bandit model has been discussed extensively (see Berry and fristedt, 

1985) relatfvely little has been written on the incorporation of covariates in 

bandit models. A notable exception is that of Woodroofe (1979), who studied a 

bandit model that incorporated covaria tes and yf el ded normal l y distributed 

observations, 

For that bandf t model Woodroofe derived second order 

approximations to optimal strategies and investigated their behavior. 

contribution should be regarded as important twice over: first, for fntroduclng 

a covariate band3 t model, and second, for a further discussion of bandf ts 

other than Bernoulli. (See Clayton and Berry (1985), for another example of a 

non-Bernoull i bandit,) However, as Sirnons ( 1986) has commented, resul ts 

derived for normal models are difficult to apply to the Bernoulli setting. 

As we shall see below, the Bernoulli covariate model behaves in a more 

cornpl icated fashion than either the standard bandit or Woodroofe' s Gaussian covariate 

model. Thfs is due in part to the fact that, if we view the Bernoulli covarfate 

bandit as a Markov decision problem, then an important component of the state 

Th7s

space Is I?, the distribution on a. Except when G has a fin? te support, this 

implies that the state space is effectively fnfinite dimensional. 

As a 

consequence, 

the expl icit determination of optimal strategies is difficult, 

unless n is small or the support of G is on a small number of points, 

A similar 

issue arises when a Dirichlet process prior is used in a sequential decfsfon 

problem (Clayton, 1985, Clayton and Berry, 1985). 

In the remainder of thfs paper we shall focus on properties of optfmal 

strategies in the covariate bandit and on the relationship between 

the standard bandit and the covariate bandft. 

In Section 2 we investigate some 

basic monotonicfty properties of the covariate bandit and discuss stopping 

rules, 

In Section 3 we discuss further the properties of optimal strategies in 

terms o f a break-even value. 

Section 4 contains some further comments. 

2, Properties of optimal strategies: 

In this section we begin to describe some of the properties of optimal 

strategies. 

As with most bandits, the current model can be seen as an attempt 

to reconcile two conff icting goals: (13 to obtain information about R; and 

(2) to maximize the chance of success for each pull. Such a conflict arises, for 

example, when E P( sl < hl s) at a par ti cular stage. 

A pul l on arm 2 w? 1 1 have a 

greater chance of success for the current subject, but a pull of arm 1 may yf eld 

Snformation about R that will have a benefit when making future pull s. On the 

other hand, if R is such that E[ p(s) lhRj 

- > X(s) for a1 1 histories - h, 

- < X(s), far all - h, 

then arm 1 will always be optimal, and likewise, if E[p(s) lh~l 

then arm 2 wlll always be optimal. 

P(a>c) = 1 ar P(a

In words, W,(R,c) 

is bounded above by what would be the expected utility if 

a were known at the outset. Wn(R, c) is bounded be1 ow by the worth of a 

strategy that, for each subject, takes the covariate into account but ignores 

any posterior information gained about a during the trial, Finally, this fatter 

worth and W,(R,c) are both bounded below by the worth of a strategy that 

pulls the same arm throughout the trial. 

As mentioned above, the standard bandit is a stopping problem, insofar as an 

optimal pull of arm 2 can be followed by another optimal pull of arm 2. 

ThTs 

need not be the case for the covariate bandit, as the follawfng example shows. 

Example 2.1: Suppose c = 0, and P(a=-I) = P(a=l) = 1/2 = P(S=-5) = P(S=5), 

Then it is easy, but tedious, to calculate: 

Hence, arm 2 is optimal when n = 2 and S = +5. 

However, arm 1 i s optimal when 

n = 1 and S = -5. This result reflects the intuition that when the current 

covariate value is sufficiently small, arm 1 is more attractive since there is 

some potential that a will be +1. However, when S is large, arm 1 becomes less 

attractive, since there i s a risk that a will be -1.//// 

Although the covarl'ate bandit is not a stopping problem in the usual sense, 

it does satisfy a weak stopping rule property, as follows:

Theorem 2.1: If, for the (s,n,R,e) bandit arm 2 is uniquely optimal for all s, 

then there exists an optimal strategy for which it is optimal to pull arm 2 for 

the (s,m,R,c) bandit, m n, for any s. 

Proof: 

the proof is a generalization of the proof of a similar 

resul t in Brad t, et a1 . ( 1956). 

Suppose, to the contrary, that there ex7 sts an 

rnl 

< n such that arm 1 Js uniquely optimal under the strategy r for the (s' ,in' ,R,c) 

bandit. Consfder an (s',n,R,c) bandit that follows r for the first rn' 

pulls, and then pulls arm 2 for the remaining n-m' pulls, This has a worth that 

i s no less than !4,(s1 ,R,c). But thls contradicts the unlque optimality of arm 

2 for the s1 ,n,R,c bandit.//// 

Another property of the standard bandit is the "stay on a winner" property: 

if 

an optimal pulf of arm 1 yields a success, then it i s optlmal to pull arm 1 again. 

This need not hold, as shown in the next example. 

Example 2.2: Suppose c = 0, P( a=-1) = P(a=5) = and P(S.1) = -99 = 1 - P(S=3). 

Then h2(3,R) = .00125 but b1(3,03R) = -.00869. That is, with 5 = 3 and two 

observations to take, arm 1 is optimal. 

However, if a pull of arm 1 under 

such circumstances yields a success, a subsequent pull of arm 1 when S = 3 

will not be optimal .I/// 

A1 though no sfmple stay-on-a-winner rule exists, a weak form of 

stay-on-a-winner does exist, as foll ows. 

Theorem 2.2: Suppose in the (s,n,R,c) bandit that an initial pulf of arm 1 f s 

uniquely optlmal and that a success obtains. Then there exists an s' In the 

support of G such that a pull of arm 1 is optimal for the (st,n-l,oSR) bandit.


Suppose to the contrary, that arm 1 I s not optfmal for any s' in the 

support of G for the (s', n-1, oSR,c) bandit. Then by Theorem 2.1 arm 2 i s 

optimal for the remaining n-l pulls, and thus 

Hence W (oR,c) = (n-l)lX(s)d~(~J. It follows fronTheorem 2.3 below and 

n-l s 

equation (2.1) that 

Moreover, since a pull of arm 2 is optimal for all pulls after the first, 

it follows that Ep(sl) G i(sl) for any sl in the support of G. Finally, by 

equation (1.2) 

this implies that 

contradicting the fact that arm I is uniquely optimal for the fSrst pull,//// 

While the covariate bandit and the standard bandft share the stopping rule 

and stay-on-a-winner properties in only a weakened sense, both bandits have 

several monotonicity properties in common, For example, It is easy to prove by 

induction, using (1.2) and (1.3) that Wn(s,R,c) is nondecreasing in c. 

We can a1 so develop a monotonf e i ty result for arm 1. 

This contains the 

finite horizon version of Theorem 3.1 of Berry and Fristedt (1979) as a speclal case.

Definition 2.1: 

For any two random variables X and X' wjth dfstribution functions F 

and F' respectively, we say that the distribution of Xt 

is "to the right of" the 

dfstribution of X if F(b) F' (b) for all b. As noted in Marshall and Olkin, (1979), 

this condition is equivalent to the condition that Eg(X1 1 2 Eg(X) for any 

nondecreasfng g such that the expectations exist.//// 

Note that if the distrfbution of ayfs to the right of the distribution of a, 

then the distribution of P( a',$) is to the right of the distribution of p(a,s), 

for all s. In addition, if the distribution of p(al,s) i s to the right of the 

distribution of p(a,s) for some s, then S t is easy to show that this must hold for 

all s, and that the distribution of a' is to the right of the distribution of a. 

Definition 2.2: In an extension of a notion of Berry and Fristedt (1979), 

if R' and R are measures for a, we defjne R-0 be "strongly to the right" of R 

if - hR' i s to the right of - hR for all histories - R.//// 

Given these definitions, we have the fallowing: 

Theorem 2.3: 

If R1 is strongly to the right of R then 

W,(R1,c) 2 W,(R,c) and for all s, W,(s,R1,c) 3 W,(s,R,c). 

2 1 

Proof: fhls is immediately true by induction for Wn. Consider W,. 

i 

i

Since R' is to the rlght of R, E(p(s) (R') 3 E(P(s) IR). Note that pR' is 

strongly to the right of a,R, 

and both of these are strongly to the right of 

mSR- Also, mSR' 1s strongly to the right of 4,R. By Induction, each of the 

quantities in square brackets is nonnegative. The rest of the proof follows 

by definition of W,(s,R,c) and equation (1.1).//// 

Lemma 2.1: 

If sl < s2 then 

(i) 0 R i s to the right of a, 

I R and 

2 

(ii) R i s to the right of $ R, 

S2 s3. 


We prove part (i) ; the proof of part (ii) f s similar. 

It w711 suffice 

to show that, for all b. P(ucblo, R) P(u~blo R), or equivalently, that 

1 S2 

GSritfng the dependence of p on ct explicitly, (2.2) 

is equivalent to 

where IA is the indicator function of the set A. 

By Tonellits theorem, the 

right side of (2.3) is

However, if a 

a < a', the integrand in (2,4) is nonnegative, and if a 9 a c a' 

fajls, the integrand is zero.//// 

An immediate consequence of Lemma 2.1 and Theorem 2.3 

is the following. 

i 

Theorem 2.4: For a1 1 n,R,c,t, and i, each of Wn(aSR,c), W,( t, oSF!,c), and 

~,[t,o~R,c) are nonincreasing in s and each of Wn(mSR,c), 

wA(t,$,~.c), and 

W (t,$,R,c) are nondecreasing in s. 

n 

Theorem 2.4 and parts (i)- (iv) of Proposf tion 2.1 below are related to the 

"informationn obtained by a particular pull. A success observed on arm 1 when s 

is large is relatively uninformative, since lim ~ ( s ) = 1 for any a. However, a 

S'=" 

success observed when s is large in magnitude and negative is potentially quite 

informative: it suggests that a itself I s large, The reverse sf tuation arises 

when we observe a failure on arm 1. 

Bounds on the "information" available in a 

pull are given by parts (ii) and (iii) of the Proposition. 

Part (v) of Proposition 

2.1 suggests that as the current covariate value grows large fn magnitude, we 

become tndif ferent to the choice of arms for the next pull, 

do,R ,a dh+J -a 

Proposition 2.1: Define cr_-R by 7 = 7 and R by aR = - 

Ee ~ e ' - ~ 

For all n, for = 1 and 2, for all R and for a17 c, 

(i) lirn W (0,R.c) = lim wn(rn,R,c) = w,(R,c) 

S+do 

s+-' 

(ii) lim M,( oSR,c) = W,( q,R,c) 

s+-m

1 2 

(v) l i m Wn(t.R,c) = Tim W,(t,R,c) 

t+s- 

t+fm 

Proof: We prove some parts of the proposition. The remainder follow similarly. First 

we prove the flrst half of part (iv), using induction. 

The result is easy when 

n = I. 

Note that 

By the induction hypothesis, l im W,_l( 

S 

show that Tim ECp(t)l~~R1=Ep(t). SO 

S +- 

O~O~R,C)=W,-~( otR, C) . A1 so it is easy to 

To prove part (i), use part (lv), noting that 

lim W,(o,R,c) 

S -bm 

= lim j Wn(t.o,R,c)dt 

s+* 

The second equal i ty above fol 1 ows from the dominated convergence theorem and 

equatlon (2.1). 

To prove part (v), note that

1 

l i m Wn(t,R,c) 

t+m 

= lirn{Ep(t) + EO(~)W~_,(~~R,~) 

t +QI 

+ [I-EP(~)IW~-~(~~R,C)I 

3. The function A,,. 

As mentioned above, the function A, 

can be used to determine optimal 

strategies. As must also be evident, the determinatlon of A, is nontrivial. In 

this sectlon we explore certain properties of A, 

and dlscuss their implications 

for determining properties of the optimal strategy. 

Note that It is useful and proper to consider A, 

as a function of s for all 

real s, even though the support of G might be on some proper subset of the 

reals. O f course, in usfng A, to determine optimal strategies, attention will 

be restricted to those s in the support of G, 

I t is easy to show, by fnduction, and using (1.21, 

(1.31, and the definition of 

An (1.4) that An is a continuous function of s and c. From Proposition 2.l(v) 

it i s evident that T im %(s,R,c) = 0. This fact is illustrated in Figure 1, 

s ++ 

where An is plotted as a function of s for n = 1, ..., 6 and for R and G such 

1 

that P(u=-1) = P(u.1) = = P(S=-1) = P(S=l). We note from Figure 1 that An 

has at most one root fn s. 

We now set about a proof of that fact for the case 

n = 1, and derive a weaker result when n > 3. 

Theorem 3.1. If R is not degenerate at a point, then hl( s,R,c) has at most 

one root in s.

Remark: If R is degenerate at c, then Al(s,R,c) = 0 for all s. If R is 

degenerate at a point other than c, then Al(s,R,c) has no roots in s. 

Proof: Without loss o f generality, we can assume c = 0. Let sl < s2. We show 

that: (a) if Al(s2,R,0) 3 0 then bl(sl,R,O) > 0. A simllar approach 

shows that: (b) if Al(sl,R,O) < 0 then A~(s~,R,O) < 0. Parts (a) and (b) 

complete the proof. 

To proceed with (a), note first that bl(s,R,O) = A(s)d[s), where d(s) = 

~[(e~l)/(l+e~~) IR]. Since h(s) > 0 for all s, it will suffice to show that 

A~(S~,R,O) 3 0 implles d(sl) - d(sp) > 0. 

Next, note that 

is not degenerate at a point, this difference is strictly pasftlve. 

Finally, 

some algebra shows that 

Consequently, bl(s2,R,0) 3 0 implies s2,aslR,0) 0 which in turn implies 

d(sl) - d(s2) > 0, 

as required.//// 

The result of Theorem 3.1 may be restated in an a1 ternatjve form, whf ch 

we note as a Corollary. 

Corollary 3.1: 

In the (sl,l,R,c) bandit, there exists a quantity 

HI = E~(R,c) E[--,-I such that a pull of am 1 is optimal if sl < 1; a pull of 

arm 2 is optimal if sl > El; and either is optimal if sl = El. 

Remark: If El = +-, then a pull of arm 1 is optimal for all s, 

and if El = -- then a pull of arm 2 is optimal for all s. So, for example, 

S f P(a>c) = l then Z = +CO, 

l

Theorem 3.2: If R' is strongly to the right of R, then Al(s,R1,c) 2 al(s,R,c) 

and E1(Rb,c) ' "(R,c). Also, bl(s,R.c) is decreasing in c; El(R,c) is 

nonincreasing in c, 


This is an immediate consequence of Theorem 2.3 and the definitions of 

Al 

and El.//// 

We conjecture that, for all n, there exists a quantity with properties 

similar to El; 

namely, in the (sn,n,R,c) bandit. It is optimal to pull arm 1 if 

n < e,(R,c) and it is optimal to pull arm 2 if s > E,(R,c). In this sense 

En(R,c) would be a "break-even value" for the covarlate. If En were to exist 

for n 2 2, 

then a complete determination of an optimal strategy could be given 

in terms of En. 

Our next result and its corollary are partial results i n that 

direction. 

Theorem 3.3: For all n, bn(s,R,c) 3 A ~ ( S , R , ~ ) . 

Proof: From 11.21, (1.31, and (1.4), 

~,,~(s,R,c) - A ~ ( ~ . R ~ = C ) EP(S)H 

13.1) 

- w,(R,c) 

n (oSR,t) + CI-EP(S)IW~( $sR7~) 

The right side of (3.1) 

may be written as 

We show that the integrand in (3.2) fs always nonnegative by 

induction. For n = 1, there are two cases. If Wl(t,R,c) = A(t), then the 

result is clear, since Wl(t, aSR,c) Act) and Wl[t,+,R,c) A(t).

If Wl( t,R,c) = Ep(t), then we note that Wl(t,o,R,c) 3 E[v(t) 1 osR] and 

Wl(t,$SR,~) 3 E[P(~) 14SR1, whence the integrand in (3.2) is bounded below by 

Er P( S) IEI P( t) 1 O~RI + C1-E P(S) ]EL P( t) 1 bSRI - EP( t) = 0. 

For the induction step we assume that the integrand in (3.2) Ss 

nonnegative for n = rn. 

Again, we distjnguish two cases: 

Z 

First, when Wm,l(t,R,c) = W,,l(t,R,c) then we have 

- W,,l(t,Rac)* 

The first inequality above follows by definition of Wm,l. 

The second inequality 

follows by the induction hypothesis. 

In the second case, if 

\m+l 

(t,R,c) = W,:~(~,R,C) then following (1.2), the integrand in (3.2) can be 

written as the sum of three components, A, 0, and C, say, where 

and

Some algebra shows that A = 0. Consider the expression B. We may write 

whi cR is nonnegative by the jnduction hypothesi s. 

Ll kewl se, the f nduction 

hypothesis may be used to prove that C 3 0, and the theorem now fol tows.S/// 

I 

Corollary 3.2: For any n,R,c, there exists a zA(R,c) such that if s, < E~(R,c) 

then arm 1 i s optimal in the (s,,n,R,c) bandit. Moreover, if we take 

E;(R,c) = zl(R,c), then E;(R,C) c;(R,c). 

A 

Let z,(R,c) be the largest $,(R,c) such that the property given in Corollary 3.2 

holds. (In particular il(~,c) = ";(R,c) = xl(R,c) .) 

A 

Then Cn(R,c) is a 

weak form of "break-even value" for the covariate s, 

in the following sense: 

A 

if 5, < E,(R,c) then arm 1 is optimal in the (s,,n,R,c) bandit. Noting that 

PIS) and A(s) are increasfng In s, this says that, in terms of a clinical trial, 

when subjects have csvariates such that the probability of success is small, 

then It is better to use the experimental, unknown treament (arm 1). 

Our 

conjecture that E,,(R,c) 

sn 

n 

exists for all n i s equivalent to the conjecture that 

> E,,(R,c) implies that arm 2 is optimal in the (s,,n,R,c) bandit. Thls 

certainly hof ds for the example f n Figure 1, and we have shown i t to hold 

in other examples net provided here. 

The proof of such a result fn general 

seems particular? y el usf ve, however, 

The fol lowing example represents a 

partial result regarding the existence of z,(R,c) for n 1. 

Example 3.2: Suppose c = 0, P(a=l) = p = 1 - P(o=-l), and P(S=-1) = P(S=T) = 112. 

We demonstrate the existence o f C2 in th7 s case. For brevity, we suppress c 

from the notation. 

It w ilt suffice to show that

2 2 1 

if Hz(-1,R) = W2(-l,R), then W2(1,R) = W2(1,R), and if W2(1,R) = W2(1,R) then 

We(-1,R) = ,R) . we prove these sirnul taneously by contradi ctlon. 

2 1 

Specifically, suppose that W2(-1,R) = W2(-1,R) and that W2(1,R) = W2(1,R). 

2 

If W2(-1,R) = W2(-1,R) then it follows that A2(-1,R) < 0, where dl(-1,R) x 0 by 

Theorem 3.1. But, but Theorem 3.2, dl(-1,R) < 0 implies that al(-1, $_lR) < 0 

and bl(-l,\R) < 0. From Theorem 3.1, we thus have Al(l.R) < 0, Al(l,4-1R) < 0, 

and ~ ~ ( l , $ ~ < R 0. l Since W2(1.R) = W2(I,R) 1 by assumption, it must be that 

dl(-l,olR) > 0. There are now two cases: ~~(l,a~R) 0 and bl(l,alR) < 0. 

The second case can be shown to be impossible. 

Therefore, 

2 

On the other hand, W2(1,R) = h(1) + 1/2h(l) + 2 - 

whence 

Similarly, w~(-~,R) 

2 

= A(-1) + 1/21(1) + 1/21(-1) and 

b12(-1,R) = Ep(-1) + 1/2EC~(-1)(~(1) - h(1) + d-1) - h(-l))l, since, by 

Lemma 2.1 and Theorem 3.2 A ~ ( S , ~ _ ~ R 3 ) dl(s,olR) 3 0. It follows that

Some tedious algebra shows that 

while 

A2(1,R) > 0 iff p > .5883 

A*(-1,R) > 0 iff p > . 5694. 

This leads to the sought after contradiction,//// 

4. AddStional Comments 

As mentioned, Coral 1 arf es 3.1 and 3.2 are df rected toward the description of 

optimal strategies in terms of a break-even val ue for the current covarlate 

val ue, 

In noncovariate bandit treatments, optimal policies have been described in 

terms of a break-even val ue for arm 2. (See, for example, Bradt, et al . 1956, 

Berry and Fri stedt, 1979, Clay ton and Berry, 1985, and Berry and Fri stedt, 1985. 

Gittens and Jones, 1974, have used such indexes in describing muftiarmed 

bandits.) A comparable break-even value for arm 2 of the covariate bandit 

is a quantity C, (s,R) 

which would characterize optimal strategies as follows: 

if ccC,(s,R) then a pull of arm 1 i s optimal, if c>Cn(s,R) then a pull of arm 2 

i s optimal, and if c = C, ( s ,R) then ei ther arm is optimal. We conjecture that 

such a quantity c,(s,R) exists for all n,s, and R. Indeed, it is easy to see 

that Cl(s,R) exlsts for all s and R; it is the root in c of the equation 

bl(s,R,c) = 0. The next result describes a situation in whlch C2(s,R) exists: 

Proposftion 4.1: If A l ( ~ , R , ~ ) - )Al(t,~,c)d~(t) i s decreasing in c, then C2(s,R) 

exi s ts.

Remark: 

The Rypothesfs of the proposition is equivalent to the requirement that 

jX(t,c) (1-X(t,c) )dG( t)cX(s,c) (l-Us,c) 1. 

Proof: We prove that, under the hypothesis of the proposition, A2(s,R,c) rs 

decreasing in c. The existence of C2(s,R) as a root in c of b2(s,R9c) = 0 then 

follows from the fact that limA2(s,R,c) = EP(s) and lirnbp(s,R,c) = Eds) - 1. 

C+- 

c+- 

If we let A; 

can be used to show that 

= max(~~,O/ and A; = -mln{al,O/ then 1.1, 1 . 2 (1.3) and (1.4) 

The desired result now follows from the fact that Al(t,R,c) is decreasing in c 

for all Re//// 

The method of proof used for Proposition 4.1 

can be generalized to show that if 

Al(s,R,c) - njbl(t,R,c)dG(t) 

is decreasing in c and if hm(s,R,c) i s decreasing 

in c for all s,R, 

and m

The followfng example shows that an index for s may not exist for such a 

covariate model. 

Example 4.1. Suppose A(s) = es/(l+es) and ~ ( s ) = eBS/(l+eBS~, with 

1 

P ( 6-0) = = P( B-10). Then EP(s) - I(S) is positive, and so arm 1 i s favored, 

if s

REFERENCES 

Berry, D.A. (19721, A bernoufli two-armed bandit. Ann, Math. Statfst. - 43, 

871-897. 

Berry, D.A. and Fri stedt, R, (1979). Bernoulli one-armed bandi ts-arb7 trary 

discount sequences. Ann. Statist. - 7, 1086-1105. 

Berry, D.A. and Frfstedt, B. (1985). Bandit Problems: Sequential Allocatfon 

of Experiments. Chapman-Hal I, New Yerk. 

Bradt, R.N., Johnson, S.M. and Karlin, S. (1956). On sequential designs 

for maximiz7ng the sum of n observations. Ann. Math. Statist. 27, 1060-1070, - 

Clayton, M.K. (1985). A Bayesian nonparametric sequential test for the mean 

of a population. Ann. Statist. 13, 1129-1139, - 

Clayton, M.K. and Berry, D.A. (1985). Bayesian nonparametric bandits. 

Ann. Stati st. - 13, 1523-1534. 

Clayton, M. K. and Wl tmer, J.A. (19873. Two-stage bandits. University of 

Wisconsin-Madison, Department of Statistics Technical Report. 

Gittens, J.C. and Jones, D.M. (1974). A dynamic allocation index for the 

sequential design of experiments. In Progress in Statistics (eds, J. Gani 

et al.) pp. 241-266, North-Holland, Amsterdam. 

Marshal 7 , A.M. and 01 kin, I. (1979). Tnequa? i ties: 

Its Applications. Academic Press, New York, 

Theory of Ma joriration and 

Simons, G, (1986). Bayes rules for a clinical-trfals model with dichotomous 

responses. Ann. Statfst. - 14, 954-970. 

Woodroofe, M. {1979), A one-armed bandit problem with a concomitant variable. 

J. Amer. Statist. Assoc, 74, 799-806. - 

Department of Statistics 


1210 W, Dayton St. 

Madison, WI 53706

Figure 1. A,(s,R,c)for vdousnands. Herep(--1) 

=P(cr=l)=O.S=P(S=-1) =P(S=l).

Bernoulli bandits with covariates - Department of Statistics ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?