29.07.2014 Views

Bernoulli bandits with covariates - Department of Statistics ...

Bernoulli bandits with covariates - Department of Statistics ...

Bernoulli bandits with covariates - Department of Statistics ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

.<br />

DEPARTMENT OF STATISTICS<br />

--I-------------C-"-----<br />

University <strong>of</strong> Wisconsin<br />

1210 W. Dayton St,<br />

Madison, W1 53706<br />

TECHNICAL REPORT NO, 804<br />

March 1987<br />

<strong>Bernoulli</strong> <strong>bandits</strong> <strong>with</strong> covarfates<br />

Murray K. Clayton 1<br />

University <strong>of</strong> Wisconsin<br />

'~esearch supported in part by U. S. Army Research Office Grant DAAG29-80-C-0041<br />

and Universi ty <strong>of</strong> W i sconsin Graduate School Grant 160701.<br />

AMS 1980 subject classiflcatfons, Primary 62105: Secondary 62t15.<br />

Keywords and phrases: Sequential deci sions, one-armed <strong>bandits</strong>, two-armed<br />

<strong>bandits</strong>, logi t transformation.


SUMMARY<br />

Sequential selections are to be made from two stochastic processes, or<br />

"arms", each yielding <strong>Bernoulli</strong> responses. A t each stage the arm selected<br />

depends on previous observations. The objectfve fs to maximize the expected<br />

number <strong>of</strong> successes In the first n selections.<br />

The probability <strong>of</strong> success for a<br />

given selection depends on a covariate through a logistic transformation,<br />

For<br />

one arm, this transformation is completely known; for the other, it depends on<br />

an unknown parameter, Optimal strategies are developed In terms <strong>of</strong> a<br />

break-even value for the cavariate: it is optimal to observe the arm w-l th<br />

unknown parameter if the covarfate is less than the break-even value, Other<br />

properties <strong>of</strong> optimal strategies are related to those for non-covariate model s.


1, Introduction<br />

A<br />

bandit problem involves sequential selections or "pulls" from a number <strong>of</strong><br />

stochastic processes (or "arms", machines, treatments, etc. ) . The available<br />

processes have unknown characteristics, so learning can take place as the<br />

processes are observed.<br />

As Sn Bradt, Johnson, and Kaslin (19561, we shall<br />

restrict our attention to the class <strong>of</strong> finite horizon <strong>Bernoulli</strong> <strong>bandits</strong>, in<br />

which the responses are <strong>Bernoulli</strong> random variables and the goal is to maximize<br />

the expected sum <strong>of</strong> the first n observations.<br />

Berry and Frjstedt (1985) discuss<br />

this and many other forms <strong>of</strong> bandi t model s.<br />

The bandit model has been proposed as a model for a clfnlcal trial.<br />

The arms represent treatments, and the goal is to a1 1 ocate treatments<br />

sequentially so as to mximire the expected total number <strong>of</strong> successes.<br />

However, in a typical bandit problem, the observations made on a given arm are<br />

assumed to be exchangeable (see, for example, Berry and Frlstedt, 1979). In a<br />

clinical trial, this imp1 ies that all subjects receiving the same treatment have<br />

the same (marginal) probabtl i ty <strong>of</strong> success.<br />

In thi s paper we extend thf s notion<br />

by supposing that, for a given arm, the probabflity <strong>of</strong> success for a given<br />

subject depends both on the treatment being used and on a covariate that can<br />

encode relevant characteristics affecting the chance <strong>of</strong> success.<br />

this might<br />

include such things as the general health <strong>of</strong> the subject, age and sex <strong>of</strong> the<br />

subject, and so on.<br />

To describe our model formally, we begin by assuming that there are two<br />

arms. Let Xi and Yi denote the results from arms 1 and 2 respectively, at stage<br />

i : for i C n exactly one <strong>of</strong> the pair (Xi ,yi) is actually observed. We also<br />

assume that prior to makfng the ith observation we can observe a covariate Si.


We assume that functrons P and .A exist such that P(Xi=l 1 ~(5~)) = p(Si)<br />

and P(Y1=l(A(Si)) = A(Si).<br />

In what follows, we shall assume that after<br />

the 1-lst pull and prior to the ith pull, only the covariate values SI,...,Si<br />

are known.<br />

Informally, we obtaln no information about the i+lst and later<br />

subjects until after the ith subject has been treated.<br />

Subjcts are to be treated sequentially, and past informatton can be used in<br />

deciding Row to proceed.<br />

i th<br />

In particular, the arm selected for observation at the<br />

selection depends on the previous selections, the previous results, the previous<br />

covariate values,<br />

treated.<br />

and the value <strong>of</strong> the covariate for the subject about to be<br />

A decSsion procedure or strategy specifies which arm to select based<br />

on this information,<br />

The worth <strong>of</strong> a strategy is defined in the usual way as the<br />

expectation <strong>of</strong> the sum <strong>of</strong> the first n observations for a1 1 possible histories<br />

resul tlng from that strategy. A strategy is optimal If it yields the maximal<br />

expected sum.<br />

An arm is said to be optimal if it is the first selection <strong>of</strong> some<br />

optimal strategy.<br />

Many possjbil i ties exist for describf ng the relationship between p, A,<br />

and the<br />

covariate, We choose a linear-logistic model: ~(s) = expjes/~(l+ex~\a+s]) and<br />

X(s) = exp{c+s)/(l+exp {c+s)). When necessary, we shall write these as p( a ,~) and<br />

l(c,s), making the dependence on a and c explicit. We shall assume that the<br />

characteristics <strong>of</strong> arm 2 are known In the sense that c is a known constant,<br />

He<br />

assume that a is unknown and following a Bayesian approach, we suppose that<br />

prior information regarding a can be given by a probabill ty distribution R.<br />

Finally, we assume that, prior to their observation, the covariate values are<br />

unknown, and that they are i.i.d. <strong>with</strong> a known distribution function G. This<br />

implies that, while the cevariate value for a given subject is unknown until the<br />

subject arrives for treatment, the distribution <strong>of</strong> possible covariate values is<br />

known. Me shall use lower case letters s,t,sl,s2.,. etc. to denote


observed val ues <strong>of</strong> the covaria te. In the si tua ti an where a subject % scovarfa te<br />

value, s, is known, but the subject has not yet been treated, we shall refer to s<br />

as the "current" covariate value.<br />

Although we have introduced our model fn a clinical trial setting, one can<br />

describe industrial and ether settings where it would be equally appl fcable,<br />

For convenience, we shall continue to use the clinic trial setting in describing<br />

our results,<br />

A special case arises when G 1s degenerate at a point. In that case, the<br />

covariate values are the same for all subjects, and the situation presented<br />

here becomes equivalent to the bandit <strong>of</strong> Bradt, et a1 . (1956).<br />

When comparing<br />

bandit models, we shall refer to the model set out in this paper<br />

as the "covariate" bandit model, and to the made1 <strong>of</strong> Bradt et a1 . (1956) as a<br />

"standard" bandit model.<br />

The information about arm 1 is described by the distribution measure<br />

on a, Injtially, th4s Is given by R, As we make observations on arm 1,<br />

we can describe the posterior distrlbutlon <strong>of</strong> a given observations as<br />

follows: If successes on arm 1 have been observed when the covariate values<br />

and if fail ures occurred when the covari a te val ues were<br />

j '<br />

,tk, then the posterior measure on a is given by<br />

were sin . , , s<br />

tl,.<br />

where<br />

This is an extension <strong>of</strong> the notation <strong>of</strong> Berry and Fristedt (1979 p, 1087).<br />

Note that<br />

order is immaterial here: for example, G,(~R = mtosR. For notational convenience we


shall sometimes refer to oS as ... a + ...( R as hR, h denoting the previous<br />

1 2 y l t 2 tk - -<br />

history <strong>of</strong> successes, failures, and their corresponding covariate values,<br />

Throughout this paper we use the notation E( IR) to denote expectation over a <strong>with</strong><br />

respect to the distribution R. When there can be no confusion, we omit R from the<br />

notation.<br />

By an " (s,n ,R,c) bandit" we mean a bandit for which the current covarf a te value<br />

is s, the number <strong>of</strong> subjects to treat is n, the distribution on a is R, and the<br />

parameter on arm 2 is c.<br />

(We shall regard G as fixed throughout, and suppress it<br />

from the notation.) In notation similar to Berry (1972) and Clayton and Berry (19853,<br />

let w,~(s,R,c)<br />

be the worth <strong>of</strong> selecting arm i initially in the (s,n,~,c) bandit.<br />

I 2<br />

The worth <strong>of</strong> proceeding optimally in the (s,n,~,c) bandlt is rnaxt~, (s,R,c),~, (s,~,c) 1,<br />

which we denote by Wn(s,R,c).<br />

The expected worth, before s is observed, is then<br />

(1.1) jWn(s,R,c)dG(s) = W,(R,c).<br />

Note that for n 3 1 we have the usual dynamic programming equations:<br />

and<br />

(1.3)<br />

2<br />

w, (s,R,c) = i(s) + Wn-l(R,c).<br />

Together <strong>with</strong> the evident condjtlon that W~(R,C) = 0, the above equations give<br />

a recursion for determining W,<br />

(s,R,c) and W,(R,c).<br />

One further quantity that we shall use in describing optimal strategies is<br />

the difference:<br />

11.4)<br />

1 2<br />

dn(s,R,c) = (s,R,c) - Wn (s,R,c).<br />

The sign <strong>of</strong> An indicates the optimal arm at any stage: if bn(s,R,c) > 0<br />

then arm 1 is optimal initially; if \(s,R,c)<br />

< 0 then arm 2 is optimal initially;<br />

and if A ~ s,R,c) ( = 0 then either arm is optimal. Suppose an optimal pull has<br />

been made.<br />

If arm 2 has been observed, and if the current covariate vat ue f s


t, then we are faced <strong>with</strong> a (t,n-l,R,c) bandit, and therefore observe arm 1 or<br />

arm 2 according to the sign o f A (t,R,c). If, fnstead, arm I was observed on<br />

n-1<br />

the first pull then at the next stage we pull arm 1 or arm 2 according to the<br />

sfgn <strong>of</strong> A,,-l(t,a,R,c)<br />

or %_l(t,+,R,c).<br />

The problem described here is a form <strong>of</strong> two-armed bandit <strong>with</strong> one arm known,<br />

In the special case <strong>of</strong> a standard bandit, this problem has been described as a<br />

"one-armed" bandit since it is a stopping problem:<br />

for the standard bandit an<br />

optimal strategy can always be found for which any pull <strong>of</strong> arm 2 i s optimally<br />

followed by another pull <strong>of</strong> arm 2,<br />

Examples <strong>of</strong> <strong>bandits</strong> satisfyTng such a<br />

condition are contaf ned 4 n Brad t, e t a1 , (1956), Berry and Fri sted t (1979),<br />

Clayton and Berry (19851, Clayton and Witmer 11986) and others. As we shall<br />

see below, such a characterization cannot be applied in general to the covariate<br />

band? t.<br />

Although the bandit model has been discussed extensively (see Berry and fristedt,<br />

1985) relatfvely little has been written on the incorporation <strong>of</strong> <strong>covariates</strong> in<br />

bandit models. A notable exception is that <strong>of</strong> Woodro<strong>of</strong>e (1979), who studied a<br />

bandit model that incorporated covaria tes and yf el ded normal l y distributed<br />

observations,<br />

For that bandf t model Woodro<strong>of</strong>e derived second order<br />

approximations to optimal strategies and investigated their behavior.<br />

contribution should be regarded as important twice over: first, for fntroduclng<br />

a covariate band3 t model, and second, for a further discussion <strong>of</strong> bandf ts<br />

other than <strong>Bernoulli</strong>. (See Clayton and Berry (1985), for another example <strong>of</strong> a<br />

non-Bernoull i bandit,) However, as Sirnons ( 1986) has commented, resul ts<br />

derived for normal models are difficult to apply to the <strong>Bernoulli</strong> setting.<br />

As we shall see below, the <strong>Bernoulli</strong> covariate model behaves in a more<br />

cornpl icated fashion than either the standard bandit or Woodro<strong>of</strong>e' s Gaussian covariate<br />

model. Thfs is due in part to the fact that, if we view the <strong>Bernoulli</strong> covarfate<br />

bandit as a Markov decision problem, then an important component <strong>of</strong> the state<br />

Th7s


space Is I?, the distribution on a. Except when G has a fin? te support, this<br />

implies that the state space is effectively fnfinite dimensional.<br />

As a<br />

consequence,<br />

the expl icit determination <strong>of</strong> optimal strategies is difficult,<br />

unless n is small or the support <strong>of</strong> G is on a small number <strong>of</strong> points,<br />

A similar<br />

issue arises when a Dirichlet process prior is used in a sequential decfsfon<br />

problem (Clayton, 1985, Clayton and Berry, 1985).<br />

In the remainder <strong>of</strong> thfs paper we shall focus on properties <strong>of</strong> optfmal<br />

strategies in the covariate bandit and on the relationship between<br />

the standard bandit and the covariate bandft.<br />

In Section 2 we investigate some<br />

basic monotonicfty properties <strong>of</strong> the covariate bandit and discuss stopping<br />

rules,<br />

In Section 3 we discuss further the properties <strong>of</strong> optimal strategies in<br />

terms o f a break-even value.<br />

Section 4 contains some further comments.<br />

2, Properties <strong>of</strong> optimal strategies:<br />

In this section we begin to describe some <strong>of</strong> the properties <strong>of</strong> optimal<br />

strategies.<br />

As <strong>with</strong> most <strong>bandits</strong>, the current model can be seen as an attempt<br />

to reconcile two conff icting goals: (13 to obtain information about R; and<br />

(2) to maximize the chance <strong>of</strong> success for each pull. Such a conflict arises, for<br />

example, when E P( sl < hl s) at a par ti cular stage.<br />

A pul l on arm 2 w? 1 1 have a<br />

greater chance <strong>of</strong> success for the current subject, but a pull <strong>of</strong> arm 1 may yf eld<br />

Snformation about R that will have a benefit when making future pull s. On the<br />

other hand, if R is such that E[ p(s) lhRj<br />

- > X(s) for a1 1 histories - h,<br />

- < X(s), far all - h,<br />

then arm 1 will always be optimal, and likewise, if E[p(s) lh~l<br />

then arm 2 wlll always be optimal.<br />

P(a>c) = 1 ar P(a


In words, W,(R,c)<br />

is bounded above by what would be the expected utility if<br />

a were known at the outset. Wn(R, c) is bounded be1 ow by the worth <strong>of</strong> a<br />

strategy that, for each subject, takes the covariate into account but ignores<br />

any posterior information gained about a during the trial, Finally, this fatter<br />

worth and W,(R,c) are both bounded below by the worth <strong>of</strong> a strategy that<br />

pulls the same arm throughout the trial.<br />

As mentioned above, the standard bandit is a stopping problem, ins<strong>of</strong>ar as an<br />

optimal pull <strong>of</strong> arm 2 can be followed by another optimal pull <strong>of</strong> arm 2.<br />

ThTs<br />

need not be the case for the covariate bandit, as the follawfng example shows.<br />

Example 2.1: Suppose c = 0, and P(a=-I) = P(a=l) = 1/2 = P(S=-5) = P(S=5),<br />

Then it is easy, but tedious, to calculate:<br />

Hence, arm 2 is optimal when n = 2 and S = +5.<br />

However, arm 1 i s optimal when<br />

n = 1 and S = -5. This result reflects the intuition that when the current<br />

covariate value is sufficiently small, arm 1 is more attractive since there is<br />

some potential that a will be +1. However, when S is large, arm 1 becomes less<br />

attractive, since there i s a risk that a will be -1.////<br />

Although the covarl'ate bandit is not a stopping problem in the usual sense,<br />

it does satisfy a weak stopping rule property, as follows:


Theorem 2.1: If, for the (s,n,R,e) bandit arm 2 is uniquely optimal for all s,<br />

then there exists an optimal strategy for which it is optimal to pull arm 2 for<br />

the (s,m,R,c) bandit, m n, for any s.<br />

Pro<strong>of</strong>:<br />

the pro<strong>of</strong> is a generalization <strong>of</strong> the pro<strong>of</strong> <strong>of</strong> a similar<br />

resul t in Brad t, et a1 . ( 1956).<br />

Suppose, to the contrary, that there ex7 sts an<br />

rnl<br />

< n such that arm 1 Js uniquely optimal under the strategy r for the (s' ,in' ,R,c)<br />

bandit. Consfder an (s',n,R,c) bandit that follows r for the first rn'<br />

pulls, and then pulls arm 2 for the remaining n-m' pulls, This has a worth that<br />

i s no less than !4,(s1 ,R,c). But thls contradicts the unlque optimality <strong>of</strong> arm<br />

2 for the s1 ,n,R,c bandit.////<br />

Another property <strong>of</strong> the standard bandit is the "stay on a winner" property:<br />

if<br />

an optimal pulf <strong>of</strong> arm 1 yields a success, then it i s optlmal to pull arm 1 again.<br />

This need not hold, as shown in the next example.<br />

Example 2.2: Suppose c = 0, P( a=-1) = P(a=5) = and P(S.1) = -99 = 1 - P(S=3).<br />

Then h2(3,R) = .00125 but b1(3,03R) = -.00869. That is, <strong>with</strong> 5 = 3 and two<br />

observations to take, arm 1 is optimal.<br />

However, if a pull <strong>of</strong> arm 1 under<br />

such circumstances yields a success, a subsequent pull <strong>of</strong> arm 1 when S = 3<br />

will not be optimal .I///<br />

A1 though no sfmple stay-on-a-winner rule exists, a weak form <strong>of</strong><br />

stay-on-a-winner does exist, as foll ows.<br />

Theorem 2.2: Suppose in the (s,n,R,c) bandit that an initial pulf <strong>of</strong> arm 1 f s<br />

uniquely optlmal and that a success obtains. Then there exists an s' In the<br />

support <strong>of</strong> G such that a pull <strong>of</strong> arm 1 is optimal for the (st,n-l,oSR) bandit.


Pro<strong>of</strong>:<br />

Suppose to the contrary, that arm 1 I s not optfmal for any s' in the<br />

support <strong>of</strong> G for the (s', n-1, oSR,c) bandit. Then by Theorem 2.1 arm 2 i s<br />

optimal for the remaining n-l pulls, and thus<br />

Hence W (oR,c) = (n-l)lX(s)d~(~J. It follows fronTheorem 2.3 below and<br />

n-l s<br />

equation (2.1) that<br />

Moreover, since a pull <strong>of</strong> arm 2 is optimal for all pulls after the first,<br />

it follows that Ep(sl) G i(sl) for any sl in the support <strong>of</strong> G. Finally, by<br />

equation (1.2)<br />

this implies that<br />

contradicting the fact that arm I is uniquely optimal for the fSrst pull,////<br />

While the covariate bandit and the standard bandft share the stopping rule<br />

and stay-on-a-winner properties in only a weakened sense, both <strong>bandits</strong> have<br />

several monotonicity properties in common, For example, It is easy to prove by<br />

induction, using (1.2) and (1.3) that Wn(s,R,c) is nondecreasing in c.<br />

We can a1 so develop a monotonf e i ty result for arm 1.<br />

This contains the<br />

finite horizon version <strong>of</strong> Theorem 3.1 <strong>of</strong> Berry and Fristedt (1979) as a speclal case.


Definition 2.1:<br />

For any two random variables X and X' wjth dfstribution functions F<br />

and F' respectively, we say that the distribution <strong>of</strong> Xt<br />

is "to the right <strong>of</strong>" the<br />

dfstribution <strong>of</strong> X if F(b) F' (b) for all b. As noted in Marshall and Olkin, (1979),<br />

this condition is equivalent to the condition that Eg(X1 1 2 Eg(X) for any<br />

nondecreasfng g such that the expectations exist.////<br />

Note that if the distrfbution <strong>of</strong> ayfs to the right <strong>of</strong> the distribution <strong>of</strong> a,<br />

then the distribution <strong>of</strong> P( a',$) is to the right <strong>of</strong> the distribution <strong>of</strong> p(a,s),<br />

for all s. In addition, if the distribution <strong>of</strong> p(al,s) i s to the right <strong>of</strong> the<br />

distribution <strong>of</strong> p(a,s) for some s, then S t is easy to show that this must hold for<br />

all s, and that the distribution <strong>of</strong> a' is to the right <strong>of</strong> the distribution <strong>of</strong> a.<br />

Definition 2.2: In an extension <strong>of</strong> a notion <strong>of</strong> Berry and Fristedt (1979),<br />

if R' and R are measures for a, we defjne R-0 be "strongly to the right" <strong>of</strong> R<br />

if - hR' i s to the right <strong>of</strong> - hR for all histories - R.////<br />

Given these definitions, we have the fallowing:<br />

Theorem 2.3:<br />

If R1 is strongly to the right <strong>of</strong> R then<br />

W,(R1,c) 2 W,(R,c) and for all s, W,(s,R1,c) 3 W,(s,R,c).<br />

2 1<br />

Pro<strong>of</strong>: fhls is immediately true by induction for Wn. Consider W,.<br />

i<br />

i


Since R' is to the rlght <strong>of</strong> R, E(p(s) (R') 3 E(P(s) IR). Note that pR' is<br />

strongly to the right <strong>of</strong> a,R,<br />

and both <strong>of</strong> these are strongly to the right <strong>of</strong><br />

mSR- Also, mSR' 1s strongly to the right <strong>of</strong> 4,R. By Induction, each <strong>of</strong> the<br />

quantities in square brackets is nonnegative. The rest <strong>of</strong> the pro<strong>of</strong> follows<br />

by definition <strong>of</strong> W,(s,R,c) and equation (1.1).////<br />

Lemma 2.1:<br />

If sl < s2 then<br />

(i) 0 R i s to the right <strong>of</strong> a,<br />

I R and<br />

2<br />

(ii) R i s to the right <strong>of</strong> $ R,<br />

S2 s3.<br />

Pro<strong>of</strong>:<br />

We prove part (i) ; the pro<strong>of</strong> <strong>of</strong> part (ii) f s similar.<br />

It w711 suffice<br />

to show that, for all b. P(ucblo, R) P(u~blo R), or equivalently, that<br />

1 S2<br />

GSritfng the dependence <strong>of</strong> p on ct explicitly, (2.2)<br />

is equivalent to<br />

where IA is the indicator function <strong>of</strong> the set A.<br />

By Tonellits theorem, the<br />

right side <strong>of</strong> (2.3) is


However, if a<br />

a < a', the integrand in (2,4) is nonnegative, and if a 9 a c a'<br />

fajls, the integrand is zero.////<br />

An immediate consequence <strong>of</strong> Lemma 2.1 and Theorem 2.3<br />

is the following.<br />

i<br />

Theorem 2.4: For a1 1 n,R,c,t, and i, each <strong>of</strong> Wn(aSR,c), W,( t, oSF!,c), and<br />

~,[t,o~R,c) are nonincreasing in s and each <strong>of</strong> Wn(mSR,c),<br />

wA(t,$,~.c), and<br />

W (t,$,R,c) are nondecreasing in s.<br />

n<br />

Theorem 2.4 and parts (i)- (iv) <strong>of</strong> Proposf tion 2.1 below are related to the<br />

"informationn obtained by a particular pull. A success observed on arm 1 when s<br />

is large is relatively uninformative, since lim ~ ( s ) = 1 for any a. However, a<br />

S'="<br />

success observed when s is large in magnitude and negative is potentially quite<br />

informative: it suggests that a itself I s large, The reverse sf tuation arises<br />

when we observe a failure on arm 1.<br />

Bounds on the "information" available in a<br />

pull are given by parts (ii) and (iii) <strong>of</strong> the Proposition.<br />

Part (v) <strong>of</strong> Proposition<br />

2.1 suggests that as the current covariate value grows large fn magnitude, we<br />

become tndif ferent to the choice <strong>of</strong> arms for the next pull,<br />

do,R ,a dh+J -a<br />

Proposition 2.1: Define cr_-R by 7 = 7 and R by aR = -<br />

Ee ~ e ' - ~<br />

For all n, for = 1 and 2, for all R and for a17 c,<br />

(i) lirn W (0,R.c) = lim wn(rn,R,c) = w,(R,c)<br />

S+do<br />

s+-'<br />

(ii) lim M,( oSR,c) = W,( q,R,c)<br />

s+-m


1 2<br />

(v) l i m Wn(t.R,c) = Tim W,(t,R,c)<br />

t+s-<br />

t+fm<br />

Pro<strong>of</strong>: We prove some parts <strong>of</strong> the proposition. The remainder follow similarly. First<br />

we prove the flrst half <strong>of</strong> part (iv), using induction.<br />

The result is easy when<br />

n = I.<br />

Note that<br />

By the induction hypothesis, l im W,_l(<br />

S<br />

show that Tim ECp(t)l~~R1=Ep(t). SO<br />

S +-<br />

O~O~R,C)=W,-~( otR, C) . A1 so it is easy to<br />

To prove part (i), use part (lv), noting that<br />

lim W,(o,R,c)<br />

S -bm<br />

= lim j Wn(t.o,R,c)dt<br />

s+*<br />

The second equal i ty above fol 1 ows from the dominated convergence theorem and<br />

equatlon (2.1).<br />

To prove part (v), note that


1<br />

l i m Wn(t,R,c)<br />

t+m<br />

= lirn{Ep(t) + EO(~)W~_,(~~R,~)<br />

t +QI<br />

+ [I-EP(~)IW~-~(~~R,C)I<br />

3. The function A,,.<br />

As mentioned above, the function A,<br />

can be used to determine optimal<br />

strategies. As must also be evident, the determinatlon <strong>of</strong> A, is nontrivial. In<br />

this sectlon we explore certain properties <strong>of</strong> A,<br />

and dlscuss their implications<br />

for determining properties <strong>of</strong> the optimal strategy.<br />

Note that It is useful and proper to consider A,<br />

as a function <strong>of</strong> s for all<br />

real s, even though the support <strong>of</strong> G might be on some proper subset <strong>of</strong> the<br />

reals. O f course, in usfng A, to determine optimal strategies, attention will<br />

be restricted to those s in the support <strong>of</strong> G,<br />

I t is easy to show, by fnduction, and using (1.21,<br />

(1.31, and the definition <strong>of</strong><br />

An (1.4) that An is a continuous function <strong>of</strong> s and c. From Proposition 2.l(v)<br />

it i s evident that T im %(s,R,c) = 0. This fact is illustrated in Figure 1,<br />

s ++<br />

where An is plotted as a function <strong>of</strong> s for n = 1, ..., 6 and for R and G such<br />

1<br />

that P(u=-1) = P(u.1) = = P(S=-1) = P(S=l). We note from Figure 1 that An<br />

has at most one root fn s.<br />

We now set about a pro<strong>of</strong> <strong>of</strong> that fact for the case<br />

n = 1, and derive a weaker result when n > 3.<br />

Theorem 3.1. If R is not degenerate at a point, then hl( s,R,c) has at most<br />

one root in s.


Remark: If R is degenerate at c, then Al(s,R,c) = 0 for all s. If R is<br />

degenerate at a point other than c, then Al(s,R,c) has no roots in s.<br />

Pro<strong>of</strong>: Without loss o f generality, we can assume c = 0. Let sl < s2. We show<br />

that: (a) if Al(s2,R,0) 3 0 then bl(sl,R,O) > 0. A simllar approach<br />

shows that: (b) if Al(sl,R,O) < 0 then A~(s~,R,O) < 0. Parts (a) and (b)<br />

complete the pro<strong>of</strong>.<br />

To proceed <strong>with</strong> (a), note first that bl(s,R,O) = A(s)d[s), where d(s) =<br />

~[(e~l)/(l+e~~) IR]. Since h(s) > 0 for all s, it will suffice to show that<br />

A~(S~,R,O) 3 0 implles d(sl) - d(sp) > 0.<br />

Next, note that<br />

is not degenerate at a point, this difference is strictly pasftlve.<br />

Finally,<br />

some algebra shows that<br />

Consequently, bl(s2,R,0) 3 0 implies s2,aslR,0) 0 which in turn implies<br />

d(sl) - d(s2) > 0,<br />

as required.////<br />

The result <strong>of</strong> Theorem 3.1 may be restated in an a1 ternatjve form, whf ch<br />

we note as a Corollary.<br />

Corollary 3.1:<br />

In the (sl,l,R,c) bandit, there exists a quantity<br />

HI = E~(R,c) E[--,-I such that a pull <strong>of</strong> am 1 is optimal if sl < 1; a pull <strong>of</strong><br />

arm 2 is optimal if sl > El; and either is optimal if sl = El.<br />

Remark: If El = +-, then a pull <strong>of</strong> arm 1 is optimal for all s,<br />

and if El = -- then a pull <strong>of</strong> arm 2 is optimal for all s. So, for example,<br />

S f P(a>c) = l then Z = +CO,<br />

l


Theorem 3.2: If R' is strongly to the right <strong>of</strong> R, then Al(s,R1,c) 2 al(s,R,c)<br />

and E1(Rb,c) ' "(R,c). Also, bl(s,R.c) is decreasing in c; El(R,c) is<br />

nonincreasing in c,<br />

Pro<strong>of</strong>:<br />

This is an immediate consequence <strong>of</strong> Theorem 2.3 and the definitions <strong>of</strong><br />

Al<br />

and El.////<br />

We conjecture that, for all n, there exists a quantity <strong>with</strong> properties<br />

similar to El;<br />

namely, in the (sn,n,R,c) bandit. It is optimal to pull arm 1 if<br />

n < e,(R,c) and it is optimal to pull arm 2 if s > E,(R,c). In this sense<br />

En(R,c) would be a "break-even value" for the covarlate. If En were to exist<br />

for n 2 2,<br />

then a complete determination <strong>of</strong> an optimal strategy could be given<br />

in terms <strong>of</strong> En.<br />

Our next result and its corollary are partial results i n that<br />

direction.<br />

Theorem 3.3: For all n, bn(s,R,c) 3 A ~ ( S , R , ~ ) .<br />

Pro<strong>of</strong>: From 11.21, (1.31, and (1.4),<br />

~,,~(s,R,c) - A ~ ( ~ . R ~ = C ) EP(S)H<br />

13.1)<br />

- w,(R,c)<br />

n (oSR,t) + CI-EP(S)IW~( $sR7~)<br />

The right side <strong>of</strong> (3.1)<br />

may be written as<br />

We show that the integrand in (3.2) fs always nonnegative by<br />

induction. For n = 1, there are two cases. If Wl(t,R,c) = A(t), then the<br />

result is clear, since Wl(t, aSR,c) Act) and Wl[t,+,R,c) A(t).


If Wl( t,R,c) = Ep(t), then we note that Wl(t,o,R,c) 3 E[v(t) 1 osR] and<br />

Wl(t,$SR,~) 3 E[P(~) 14SR1, whence the integrand in (3.2) is bounded below by<br />

Er P( S) IEI P( t) 1 O~RI + C1-E P(S) ]EL P( t) 1 bSRI - EP( t) = 0.<br />

For the induction step we assume that the integrand in (3.2) Ss<br />

nonnegative for n = rn.<br />

Again, we distjnguish two cases:<br />

Z<br />

First, when Wm,l(t,R,c) = W,,l(t,R,c) then we have<br />

- W,,l(t,Rac)*<br />

The first inequality above follows by definition <strong>of</strong> Wm,l.<br />

The second inequality<br />

follows by the induction hypothesis.<br />

In the second case, if<br />

\m+l<br />

(t,R,c) = W,:~(~,R,C) then following (1.2), the integrand in (3.2) can be<br />

written as the sum <strong>of</strong> three components, A, 0, and C, say, where<br />

and


Some algebra shows that A = 0. Consider the expression B. We may write<br />

whi cR is nonnegative by the jnduction hypothesi s.<br />

Ll kewl se, the f nduction<br />

hypothesis may be used to prove that C 3 0, and the theorem now fol tows.S///<br />

I<br />

Corollary 3.2: For any n,R,c, there exists a zA(R,c) such that if s, < E~(R,c)<br />

then arm 1 i s optimal in the (s,,n,R,c) bandit. Moreover, if we take<br />

E;(R,c) = zl(R,c), then E;(R,C) c;(R,c).<br />

A<br />

Let z,(R,c) be the largest $,(R,c) such that the property given in Corollary 3.2<br />

holds. (In particular il(~,c) = ";(R,c) = xl(R,c) .)<br />

A<br />

Then Cn(R,c) is a<br />

weak form <strong>of</strong> "break-even value" for the covariate s,<br />

in the following sense:<br />

A<br />

if 5, < E,(R,c) then arm 1 is optimal in the (s,,n,R,c) bandit. Noting that<br />

PIS) and A(s) are increasfng In s, this says that, in terms <strong>of</strong> a clinical trial,<br />

when subjects have csvariates such that the probability <strong>of</strong> success is small,<br />

then It is better to use the experimental, unknown treament (arm 1).<br />

Our<br />

conjecture that E,,(R,c)<br />

sn<br />

n<br />

exists for all n i s equivalent to the conjecture that<br />

> E,,(R,c) implies that arm 2 is optimal in the (s,,n,R,c) bandit. Thls<br />

certainly h<strong>of</strong> ds for the example f n Figure 1, and we have shown i t to hold<br />

in other examples net provided here.<br />

The pro<strong>of</strong> <strong>of</strong> such a result fn general<br />

seems particular? y el usf ve, however,<br />

The fol lowing example represents a<br />

partial result regarding the existence <strong>of</strong> z,(R,c) for n 1.<br />

Example 3.2: Suppose c = 0, P(a=l) = p = 1 - P(o=-l), and P(S=-1) = P(S=T) = 112.<br />

We demonstrate the existence o f C2 in th7 s case. For brevity, we suppress c<br />

from the notation.<br />

It w ilt suffice to show that


2 2 1<br />

if Hz(-1,R) = W2(-l,R), then W2(1,R) = W2(1,R), and if W2(1,R) = W2(1,R) then<br />

We(-1,R) = ,R) . we prove these sirnul taneously by contradi ctlon.<br />

2 1<br />

Specifically, suppose that W2(-1,R) = W2(-1,R) and that W2(1,R) = W2(1,R).<br />

2<br />

If W2(-1,R) = W2(-1,R) then it follows that A2(-1,R) < 0, where dl(-1,R) x 0 by<br />

Theorem 3.1. But, but Theorem 3.2, dl(-1,R) < 0 implies that al(-1, $_lR) < 0<br />

and bl(-l,\R) < 0. From Theorem 3.1, we thus have Al(l.R) < 0, Al(l,4-1R) < 0,<br />

and ~ ~ ( l , $ ~ < R 0. l Since W2(1.R) = W2(I,R) 1 by assumption, it must be that<br />

dl(-l,olR) > 0. There are now two cases: ~~(l,a~R) 0 and bl(l,alR) < 0.<br />

The second case can be shown to be impossible.<br />

Therefore,<br />

2<br />

On the other hand, W2(1,R) = h(1) + 1/2h(l) + 2 -<br />

whence<br />

Similarly, w~(-~,R)<br />

2<br />

= A(-1) + 1/21(1) + 1/21(-1) and<br />

b12(-1,R) = Ep(-1) + 1/2EC~(-1)(~(1) - h(1) + d-1) - h(-l))l, since, by<br />

Lemma 2.1 and Theorem 3.2 A ~ ( S , ~ _ ~ R 3 ) dl(s,olR) 3 0. It follows that


Some tedious algebra shows that<br />

while<br />

A2(1,R) > 0 iff p > .5883<br />

A*(-1,R) > 0 iff p > . 5694.<br />

This leads to the sought after contradiction,////<br />

4. AddStional Comments<br />

As mentioned, Coral 1 arf es 3.1 and 3.2 are df rected toward the description <strong>of</strong><br />

optimal strategies in terms <strong>of</strong> a break-even val ue for the current covarlate<br />

val ue,<br />

In noncovariate bandit treatments, optimal policies have been described in<br />

terms <strong>of</strong> a break-even val ue for arm 2. (See, for example, Bradt, et al . 1956,<br />

Berry and Fri stedt, 1979, Clay ton and Berry, 1985, and Berry and Fri stedt, 1985.<br />

Gittens and Jones, 1974, have used such indexes in describing muftiarmed<br />

<strong>bandits</strong>.) A comparable break-even value for arm 2 <strong>of</strong> the covariate bandit<br />

is a quantity C, (s,R)<br />

which would characterize optimal strategies as follows:<br />

if ccC,(s,R) then a pull <strong>of</strong> arm 1 i s optimal, if c>Cn(s,R) then a pull <strong>of</strong> arm 2<br />

i s optimal, and if c = C, ( s ,R) then ei ther arm is optimal. We conjecture that<br />

such a quantity c,(s,R) exists for all n,s, and R. Indeed, it is easy to see<br />

that Cl(s,R) exlsts for all s and R; it is the root in c <strong>of</strong> the equation<br />

bl(s,R,c) = 0. The next result describes a situation in whlch C2(s,R) exists:<br />

Proposftion 4.1: If A l ( ~ , R , ~ ) - )Al(t,~,c)d~(t) i s decreasing in c, then C2(s,R)<br />

exi s ts.


Remark:<br />

The Rypothesfs <strong>of</strong> the proposition is equivalent to the requirement that<br />

jX(t,c) (1-X(t,c) )dG( t)cX(s,c) (l-Us,c) 1.<br />

Pro<strong>of</strong>: We prove that, under the hypothesis <strong>of</strong> the proposition, A2(s,R,c) rs<br />

decreasing in c. The existence <strong>of</strong> C2(s,R) as a root in c <strong>of</strong> b2(s,R9c) = 0 then<br />

follows from the fact that limA2(s,R,c) = EP(s) and lirnbp(s,R,c) = Eds) - 1.<br />

C+-<br />

c+-<br />

If we let A;<br />

can be used to show that<br />

= max(~~,O/ and A; = -mln{al,O/ then 1.1, 1 . 2 (1.3) and (1.4)<br />

The desired result now follows from the fact that Al(t,R,c) is decreasing in c<br />

for all Re////<br />

The method <strong>of</strong> pro<strong>of</strong> used for Proposition 4.1<br />

can be generalized to show that if<br />

Al(s,R,c) - njbl(t,R,c)dG(t)<br />

is decreasing in c and if hm(s,R,c) i s decreasing<br />

in c for all s,R,<br />

and m


The followfng example shows that an index for s may not exist for such a<br />

covariate model.<br />

Example 4.1. Suppose A(s) = es/(l+es) and ~ ( s ) = eBS/(l+eBS~, <strong>with</strong><br />

1<br />

P ( 6-0) = = P( B-10). Then EP(s) - I(S) is positive, and so arm 1 i s favored,<br />

if s


REFERENCES<br />

Berry, D.A. (19721, A bernoufli two-armed bandit. Ann, Math. Statfst. - 43,<br />

871-897.<br />

Berry, D.A. and Fri stedt, R, (1979). <strong>Bernoulli</strong> one-armed bandi ts-arb7 trary<br />

discount sequences. Ann. Statist. - 7, 1086-1105.<br />

Berry, D.A. and Frfstedt, B. (1985). Bandit Problems: Sequential Allocatfon<br />

<strong>of</strong> Experiments. Chapman-Hal I, New Yerk.<br />

Bradt, R.N., Johnson, S.M. and Karlin, S. (1956). On sequential designs<br />

for maximiz7ng the sum <strong>of</strong> n observations. Ann. Math. Statist. 27, 1060-1070, -<br />

Clayton, M.K. (1985). A Bayesian nonparametric sequential test for the mean<br />

<strong>of</strong> a population. Ann. Statist. 13, 1129-1139, -<br />

Clayton, M.K. and Berry, D.A. (1985). Bayesian nonparametric <strong>bandits</strong>.<br />

Ann. Stati st. - 13, 1523-1534.<br />

Clayton, M. K. and Wl tmer, J.A. (19873. Two-stage <strong>bandits</strong>. University <strong>of</strong><br />

Wisconsin-Madison, <strong>Department</strong> <strong>of</strong> <strong>Statistics</strong> Technical Report.<br />

Gittens, J.C. and Jones, D.M. (1974). A dynamic allocation index for the<br />

sequential design <strong>of</strong> experiments. In Progress in <strong>Statistics</strong> (eds, J. Gani<br />

et al.) pp. 241-266, North-Holland, Amsterdam.<br />

Marshal 7 , A.M. and 01 kin, I. (1979). Tnequa? i ties:<br />

Its Applications. Academic Press, New York,<br />

Theory <strong>of</strong> Ma joriration and<br />

Simons, G, (1986). Bayes rules for a clinical-trfals model <strong>with</strong> dichotomous<br />

responses. Ann. Statfst. - 14, 954-970.<br />

Woodro<strong>of</strong>e, M. {1979), A one-armed bandit problem <strong>with</strong> a concomitant variable.<br />

J. Amer. Statist. Assoc, 74, 799-806. -<br />

<strong>Department</strong> <strong>of</strong> <strong>Statistics</strong><br />

University <strong>of</strong> Wisconsin<br />

1210 W, Dayton St.<br />

Madison, WI 53706


Figure 1. A,(s,R,c)for vdousnands. Herep(--1)<br />

=P(cr=l)=O.S=P(S=-1) =P(S=l).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!