1 Introduction and Summary
Consider the problem of a controller sampling sequentially from a finite number of populations or ‘bandits’, where the measurements from population are specified by a sequence of i.i.d. random variables , taken to be normal with finite mean and finite variance . The means and variances are taken to be unknown to the controller. It is convenient to define the maximum mean, , and the bandit discrepancies where It is additionally convenient to define as the minimal variance of any bandit that achieves , that is .
In this paper, given samples from population
we will take the estimators:and for and respectively. Note that the use of the biased estimator for the variance, with the factor in place of
, is largely for aesthetic purposes - the results presented here adapt to the use of the unbiased estimator as well.
For any adaptive, non-anticipatory policy , indicates that the controller samples bandit at time . Define , denoting the number of times bandit has been sampled during the periods under policy ; we take, as a convenience, for all . The value of a policy is the expected sum of the first outcomes under , which we define to be the function
where for simplicity the dependence of on the true, unknown, values of the parameters and , is supressed. The pseudo-regret, or simply regret, of a policy is taken to be the expected loss due to ignorance of the parameters and by the controller. Had the controller complete information, she would at every round activate some bandit such that . For a given policy , we define the expected regret of that policy at time as
It follows from Eqs. (1) and (2) that maximization of with respect to is equivalent to minimization of . This type of loss due to ignorance of the means (regret) was first introduced in the context of an problem by Robbins (1952) as the ‘loss per trial’ (for which ), constructing a modified (along two sparse sequences) ‘play the winner’ policy, , such that (a.s.) and
, using for his derivation only the assumption of the Strong Law of Large Numbers. FollowingBurnetas and Katehakis (1996b) when , if is such that we say policy is uniformly convergent (UC) (since then ). However, if under a policy , grew at a slower pace, such as , or better etc., then the controller would be assured that is making a effective trade-off between exploration and exploitation. It turns our that it is possible to construct ‘uniformly fast convergent’ (UFC) policies, also known as consistent or strongly consistent, defined as the policies for which:
The existence of UFC policies in the case considered here is well established, e.g., Auer et al. (2002) (fig. 4. therein) presented the following UFC policy : [colback=blue!1, arc=3pt, width=.94] Policy (UCB1-NORMAL). At each :
Sample from any bandit for which
If , for all sample from bandit with
(Taking, in this case, as the unbiased estimator.)
Additionally, Auer et al. (2002) (in Theorem 4. therein) gave the following bound:
Ineq. (4) readily implies that . Thus, since for all and it follows that is uniformly fast convergent.
Given that UFC policies exist, the question immediately follows: just how fast can they be? The primary motivation of this paper is the following general result, from Burnetas and Katehakis (1996b), where they showed that for any UFC policy , the following holds:
where the bound itself is determined by the specific distributions of the populations, in this case
For comparison, depending on the specifics of the bandit distributions, there is a considerable distance between the logarithmic term of the upper bound of Eq. (4) and the lower bound implied by Eq. (8).
The derivation of Ineq. (7) implies that in order to guarantee that a policy is uniformly fast convergent, sub-optimal populations have to be sampled at least a logarithmic number of times. The above bound is a special case of a more general result derived in Burnetas and Katehakis (1996b) (part 1 of Theorem 1 therein) for distributions with multi-parameters being unknown (such as in the current problem of Normal populations with both the mean and the variance being unknown):
Previously, Lai and Robbins (1985) had obtained such lower bounds for distributions with one-parameter (such as in the current problem of Normal populations with unknown mean but known variance). Allocation policies that achieved the lower bounds were called asymptotically efficient or optimal in Lai and Robbins (1985).
Ineq. (7) motivates the definition of a uniformly fast convergent policy as having a uniformly maximal convergence rate (UM) or simply being asymptotically optimal, within the class of uniformly fast convergent policies, if since then .
Burnetas and Katehakis (1996b) proposed the following index policy as one that could achieve this lower bound: [colback=blue!1, arc=3pt, width=.94] Policy (UCB-NORMAL)
For sample each bandit twice, and
for , sample from bandit with
Burnetas and Katehakis (1996b) were not able to establish the asymptotic optimality of the policy because they were not able to establish a sufficient condition (Condition A3 therein), which we express here as the following equivalent conjecture (the referenced open question in the subtitle). For each , for every , and for , the following is true:
We show that the above conjecture is false (cf. Proposition A in the Appendix). This does not imply that fails to be UM (i.e., to be asymptotically optimal), but this failure means that the techniques established in Burnetas and Katehakis (1996b) are insufficient to verify its optimality. All is not lost, however. One of the central results of this paper is to establish that with a small change, the policy may be modified to one that is provably asymptotically optimal. We introduce in this paper the policy defined in the following way: [colback=blue!1, arc=3pt, width=.94] Policy (UCB-NORMAL)
For sample each bandit three times, and
for , sample from bandit with
1) Note that policy is only a slight modification of policy , the only difference between their indices is the in the power on under the radical, i.e., in replacing in . This change, while seemingly asymptotically negligible (as in practice (a.s.) with ), has a profound effect on what is provable about .
2) We note that the indices of policy are a significant modification of those of the optimal allocation policy for the case of normal bandits with known variances, cf. Burnetas and Katehakis (1996b) and Katehakis and Robbins (1995), which are:
the difference being replacing the term in by in However, the indices of policy are a minor modification of the optimal policy the difference being replacing the term in by in
3) The and policies can be seen as connected in the following way, however, observing that is a first-order approximation of .
Following Robbins (1952), and additionally Gittins (1979), Lai and Robbins (1985) and Weber (1992) there is a large literature on versions of this problem, cf. Burnetas and Katehakis (2003), Burnetas and Katehakis (1997b) and references therein. For recent work in this area we refer to Audibert et al. (2009), Auer and Ortner (2010), Gittins et al. (2011), Bubeck and Slivkins (2012), Cappé et al. (2013), Kaufmann (2015), Li et al. (2014), Cowan and Katehakis (2015b), Cowan and Katehakis (2015c), and references therein. For more general dynamic programming extensions we refer to Burnetas and Katehakis (1997a), Butenko et al. (2003), Tewari and Bartlett (2008), Audibert et al. (2009), Littman (2012), Feinberg et al. (2014) and references therein. Other related work in this area includes: Burnetas and Katehakis (1993), Burnetas and Katehakis (1996a), Lagoudakis and Parr (2003), Bartlett and Tewari (2009), Tekin and Liu (2012), Jouini et al. (2009), Dayanik et al. (2013), Filippi et al. (2010), Osband and Van Roy (2014), Denardo et al. (2013).
To our knowledge, outside the work in Lai and Robbins (1985), Burnetas and Katehakis (1996b) and Burnetas and Katehakis (1997a), asymptotically optimal policies have only been developed in in Honda and Takemura (2011), and in Honda and Takemura (2010) for the problem of finite known support where optimal policies, cyclic and randomized, that are simpler to implement than those consider in Burnetas and Katehakis (1996b) were constructed. Recently in Cowan and Katehakis (2015a), an asymptotically optimal policy for uniform bandits of unknown support was constructed. The question of whether asymptotically optimal policies exist in the case discussed herein of normal bandits with unknown means and unknown variances was recently resolved in the positive by Honda and Takemura (2013)
who demonstrated that a form of Thompson sampling with certain priors onachieves the asymptotic lower bound
The structure of the rest of the paper is as follows. In section 2, Theorem 2 establishes a finite horizon bound on the regret of . From this bound, it follows that is asymptotically optimal (Theorem 2), and we provide a bound on the remainder term (Theorem 2). Additionally, in Section 3, the Thompson sampling policy of Honda and Takemura (2013) and are compared and discussed, as both achieve asymptotic optimality.
2 The Optimality Theorem and Finite Time Bounds
The main results of this paper, that Conjecture 1 is false (cf. Proposition A in the Appendix), the asymptotic optimality, and the bounds on the behavior of
, all depend on the following probability bounds; we note that tighter bounds seem possible, but these are sufficient for this paper.
Let be independent random variables, a standard normal, and
a chi-squared distribution withdegrees of freedom, where .
For , the following holds for all :
[of Proposition 2] The proof is given in the Appendix.
For policy as defined above, the following bounds hold for all and all :
Before giving the proof of this bound, we present two results, the first demonstrating the asymptotic optimality of , the second giving an -free version of the above bound, which gives a bound on the sub-logarithmic remainder term. It is worth noting the following. The bounds of Theorem 2 can actually be improved, through the use of a modified version of Proposition 2, to eliminate the dependence, so the only dependence on is through the initial term. The cost of this, however, is a dependence on a larger power of . The particular form of the bound given in Eq. (13) was chosen to simplify the following two results, cf. Remark 4 in the proof of Propositition 2.
For a policy as defined above, is asymptotically optimal in the sense that
Taking the infimum over all such ,
and observing the lower bound of Eq. (7) completes the result.
For a policy as defined above, , and more concretely
While the above bound admittedly has a more complex form than such a bound as in Eq. (4), it demonstrates the asymptotic optimality of the dominating term, and bounds the sub-linear remainder term.
This inequality is proven separately as Proposition A in the Appendix.
We make no claim that the results of Theorems 2, 2 are the best achievable for this policy . At several points in the proofs, choices of convenience were made in the bounding of terms, and different techniques may yield tighter bounds still. But they are sufficient to demonstrate the asymptotic optimality of , and give useful bounds on the growth of .
[of Theorem 1] In this proof, we take as defined above. For notational convenience, we define the index function
The structure of this proof will be to bound the expected value of for all sub-optimal bandits , and use this to bound the regret . The basic techniques follow those in Katehakis and Robbins (1995) for the known variance case, modified accordingly here for the unknown variance case and assisted by the probability bound of Proposition 2. For any such that , we define the following quantities: Let and define . For ,
Hence, we have the following relationship for , that
The proof proceeds by bounding, in expectation, each of the four terms.
Observe that, by the structure of the index function ,
The last inequality follows, observing that may be expressed as the sum of indicators, and seeing that the additional condition bounds the number of non-zero terms in the above sum. The additional simply accounts for the term and the term. Note, this bound is sample-path-wise.
For the second term,
The last inequality follows as, for fixed , may be true for at most one value of . Recall that has the distribution of a random variable. Letting , from the above we have
The penultimate step is a Chernoff bound on the terms,
To bound the third term, a similar rearrangement to Eq. (24) (using the sample mean instead of the sample variance) yields:
Recalling that for a standard normal,
The penultimate step is a Chernoff bound on the terms, .
To bound the term, observe that in the event , from the structure of the policy it must be true that . Thus, if is some bandit such that , . In particular, we take to be a bandit that not only achieves the maximal mean , but also the minimal variance among optimal bandits, . We have the following bound,
The last step follows as for in this range, . Hence
As an aside, this is essentially the point at which the conjectured Eq. (10) would have come into play for the proof of the optimality of , bounding the growth of the corresponding term for that policy. We will essentially prove a successful version of that conjecture here. Define the events