1 Main Model
Let
be a known family of probability densities on
, each with finite mean. We define to be the expected value under density , and to be the support of . Consider the problem of sequentially sampling from a finite number of populations or ‘bandits’, where measurements from population are specified by an i.i.d. sequence of random variables with density . We take each as unknown to the controller. It is convenient to define, for each , and . Additionally, we take , the discrepancy of bandit .We note, but for simplicity will not consider explicitly, that both discrete and continuous distributions can be studied when one takes to be i.i.d. with density , with respect to some known measure
For any adaptive, nonanticipatory policy , indicates that the controller samples bandit at time . Define , denoting the number of times bandit has been sampled during the periods under policy ; we take, as a convenience, for all . The value of a policy is the expected sum of the first outcomes under , which we define to be the function
(1) 
where for simplicity the dependence of on the unknown densities is suppressed. The regret of a policy is taken to be the expected loss due to ignorance of the underlying distributions by the controller. Had the controller complete information, she would at every round activate some bandit such that . For a given policy , we define the expected regret of that policy at time as
(2) 
We are interested in policies for which grows as fast as possible with , or equivalently that grows as slowly as possible with
2 Preliminaries  Background
We restrict in the following way:
Assumption 1. Given any set of bandit densities , for any suboptimal bandit i.e., there exists some such that , and .
Effectively, this ensures that at any finite time, given a set of bandits under consideration, for any bandit there is a density in that would both potentially explain the measurements from that bandit, and make it the unique optimal bandit of the set.
The focus of this paper is on as the set of uniform densities over some unknown support.
Let denote the KullbackLiebler divergence of density from ,
(3) 
It is a simple generalization of a classical result (part 1 of Theorem 1) of Burnetas and Katehakis (1996b) that if a policy is uniformly fast (UF), i.e., for all and for any choice of , then, the following bound holds:
(4) 
where the bound itself is determined by the specific distributions of the populations:
(5) 
For a given set of densities , it is of interest to construct policies such that
Such policies achieve the slowest (maximum) regret (value) growth rate possible among UF policies. They have been called UM or asymptotically optimal or efficient, cf. Burnetas and Katehakis (1996b).
For a given , let
be an estimator of
based on the first samples from . It was shown in Burnetas and Katehakis (1996b) that under sufficient conditions on , asymptotically optimal (UM) UCBpolicies could be constructed by initially sampling each bandit some number of times, and then for , following an index policy:(6) 
where the indices are ‘inflations of the current estimates for the means’ (ISM), specified as:
(7) 
The sufficient conditions on the estimators are as follows:
Defining
for all choices of and all , , the following hold for each as
These conditions correspond to Conditions A1A3 given in Burnetas and Katehakis (1996b). However under the stated Assumption 1 on given here, Condition A1 therein is automatically satisfied. Conditions A2 (see also Remark 4(b) in Burnetas and Katehakis (1996b)) and A3 are given as C1 and C2, above, respectively. Note, Condition (C1) is essentially satisfied as long as converges to (and hence sufficiently quickly with . This can often be verified easily with standard large deviation principles. The difficulty in proving the optimality of policy is often in verifying that Condition (C2) holds.
The above discussion is a parameterfree variation of that in Burnetas and Katehakis (1996b), where was taken to be parametrizable, i.e., , taking
as a vector of parameters in some parameter space
. Further, Burnetas and Katehakis (1996b) considered potentially different parameter spaces (and therefore potentially different parametric forms) for each bandit . There, Conditions A1A3 (hence C1, C2 herein) and the corresponding indices were stated in terms of estimates for the bandit parameters, an estimate of the parameters of bandit , given samples. In particular, Eq. (7) appears essentially as(8) 
Previous work in this area includes Robbins (1952), and additionally Gittins (1979), Lai and Robbins (1985) and Weber (1992) there is a large literature on versions of this problem, cf. Burnetas and Katehakis (2003), Burnetas and Katehakis (1997b) and references therein. For recent work in this area we refer to Audibert et al. (2009), Auer and Ortner (2010), Gittins et al. (2011), Bubeck and Slivkins (2012), Cappé et al. (2013), Kaufmann (2015), Li et al. (2014), cowan15s, Cowan and Katehakis (2015), and references therein. For more general dynamic programming extensions we refer to Burnetas and Katehakis (1997a), Butenko et al. (2003), Tewari and Bartlett (2008), Audibert et al. (2009), Littman (2012), Feinberg et al. (2014) and references therein. To our knowledge, outside the work in Lai and Robbins (1985), Burnetas and Katehakis (1996b) and Burnetas and Katehakis (1997a), asymptotically optimal policies have only been developed in Honda and Takemura (2013) for the problem discussed herein and in Honda and Takemura (2011) and Honda and Takemura (2010) for the problem of finite known support where optimal policies, cyclic and randomized, that are simpler to implement than those consider in Burnetas and Katehakis (1996b) were constructed. Other related work in this area includes: Katehakis and Derman (1986), Katehakis and Veinott Jr (1987), Burnetas and Katehakis (1993), Burnetas and Katehakis (1996a), Lagoudakis and Parr (2003), Bartlett and Tewari (2009), Tekin and Liu (2012), Jouini et al. (2009), Dayanik et al. (2013), Filippi et al. (2010), Osband and Van Roy (2014), Burnetas and Katehakis (1997a), Androulakis and Dimitrakakis (2014), Dimitrakakis (2012).
3 The BK Lower Bounds and Inflation Factors
In this section we take as the set of probability densities on uniform over some finite interval, taking as uniform over Note, as the family of densities is parametrizable, this largely falls under the scope of Burnetas and Katehakis (1996b). However, the results to follow seem to demonstrate a hole in that general treatment of the problem.
Note, some care with respect to support must be taken in applying Burnetas and Katehakis (1996b) to this case, to ensure that the integrals remain well defined. But for this , we have that for a given , for any such that , i.e., and ,
(9) 
If is not a subset of , we take as infinite.
For notational convenience, given , for each , we take as supported on some interval . Note then, .
Given samples from bandit , , we take
(10) 
as the maximumlikelihood estimators of and respectively. We may then define as the uniform density over the interval . Note, is the maximumlikelihood estimate of .
We can now state and prove the following.
Under Assumption 1 the following are true.
(11) 
(12) 
Eq. (11) follows from Eq. (5) and the observation that in this case:
For Eq. (12) we have:
(13) 
We are interested in policies such that achieves the lower bound indicated above, for every choice of . Following the prescription of Burnetas and Katehakis (1996b), i.e. Eq. (12), would lead to the following policy,
Policy BKUCB : . At each :
i) For sample each bandit twice, and
ii) for , let be equal to:
(14) 
breaking ties arbitrarily.
It is easy to demonstrate that the estimators converge sufficiently quickly to in probability that Condition (C1) above is satisfied for . Proving that Condition (C2) is satisfied, however, is much much more difficult, and in fact we conjecture that (C2) does not hold for policy . While this does not indicate that that fails to achieve asymptotic optimality, it does imply that the standard techniques are insufficient to verify it. However, asymptotic optimality may provably be achieved by an (seemingly) negligible modification, via the following policy.
4 Asymptotically Optimal UCB Policy
We propose the following policy:
Policy UCBUniform: . At each :
i) For sample each bandit three times, and
ii) for , let be equal to:
(15) 
breaking ties arbitrarily.
In the remainder of this paper, we verify the asymptotic optimality of (Theorem An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support), and additionally give finite horizon bounds on the regret under this policy (Theorem An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support, An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support). Further, while Theorem An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support bounds the order of the remainder term as , this is refined somewhat in Theorem An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support to .
5 Simulation Comparisons of the Sampling
In order to obtain a picture of the benefits of the sampling policy, we compared it with the best known alternatives. In both figures below, curve () () is a plot of the average (over repetitions in Fig. 1 and repetitions in Fig. 2) regret of sampling using policies , , and , respectively; where policy , is based on the sampling policy in Katehakis and Robbins (1995), and is a recently shown, cf. Cowan et al. (2015)
, asymptotically optimal policy for the case in which the population outcomes distributions are normal with unknown means and unknown variances. Specifically, given
samples from bandit at round (global time) , and are maximum index based policies with indices , , and where the first is defined by Eq. (16) and the other two are given by: and where .1  2  3  4  5  6  
0  0  0  1  1  1  
10  9  8  9.5  10  5  
Table 1 
These graphs clearly illustrate the benefit of using the optimal policy.
5.1 Acknowledgments.
PhD student Daniel Pirutinsky did the simulation work underlying Figures 1 and 2. Support for this project was provided by the National Science Foundation (NSF grant CMMI1450743).
5.2 Additional Proofs
For , for all ,
(47) 
[Proof of Proposition 5.2] Let . We have
(48) 
Here we may make use of the following bounds, that for , ,
(49) 
Applying these to the above,
(50) 
Hence, taking ,
(51) 
At this point, taking and yields
(52) 
which, rounding up, completes the result.
For , and , the following bound holds:
(53) 
[Proof of Proposition 5.2] Let denote the RHS of the above, denote the left. We adopt the physicists’ convention of denoting the partial derivative of with respect to as .
Note, . Hence, it suffices to demonstrate that over this range or, since they are both positive,
(54) 
We take, for convenience, , and want to show that for :
(55) 
The above inequality holds when . Taking as the above simplified ratio, it suffices to show that . Simplifying this inequality and canceling the positive factors, it is equivalent to show that , or taking ,
(56) 
This is a fairly standard and easily verified inequality for . This completes the proof.
References
 Androulakis and Dimitrakakis (2014) Emmanouil G Androulakis and Christos Dimitrakakis. Generalised entropy mdps and minimax regret. arXiv preprint arXiv:1412.3276, 2014.
 Audibert et al. (2009) JeanYves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration  exploitation tradeoff using variance estimates in multiarmed bandits. Theoretical Computer Science, 410(19):1876 – 1902, 2009.
 Auer and Ortner (2010) Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multiarmed bandit problem. Periodica Mathematica Hungarica, 61(12):55 – 65, 2010.

Bartlett and Tewari (2009)
Peter L Bartlett and Ambuj Tewari.
Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps.
InProceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, pages 35 – 42. AUAI Press, 2009.  Bubeck and Slivkins (2012) Sébastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial bandits. arXiv preprint arXiv:1202.4473, 2012.
 Burnetas and Katehakis (1993) Apostolos N Burnetas and Michael N Katehakis. On sequencing two types of tasks on a single processor under incomplete information. Probability in the Engineering and Informational Sciences, 7(1):85 – 119, 1993.
 Burnetas and Katehakis (1996a) Apostolos N Burnetas and Michael N Katehakis. On large deviations properties of sequential allocation problems. Stochastic Analysis and Applications, 14(1):23 – 31, 1996a.
 Burnetas and Katehakis (1996b) Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122 – 142, 1996b.

Burnetas and Katehakis (1997a)
Apostolos N Burnetas and Michael N Katehakis.
Optimal adaptive policies for Markov decision processes.
Mathematics of Operations Research, 22(1):222 – 55, 1997a.  Burnetas and Katehakis (1997b) Apostolos N Burnetas and Michael N Katehakis. On the finite horizon onearmed bandit problem. Stochastic Analysis and Applications, 16(1):845 – 859, 1997b.
 Burnetas and Katehakis (2003) Apostolos N Burnetas and Michael N Katehakis. Asymptotic Bayes analysis for the finitehorizon onearmedbandit problem. Probability in the Engineering and Informational Sciences, 17(01):53 – 82, 2003.
 Butenko et al. (2003) Sergiy Butenko, Panos M Pardalos, and Robert Murphey. Cooperative Control: Models, Applications, and Algorithms. Kluwer Academic Publishers, 2003.
 Cappé et al. (2013) Olivier Cappé, Aurélien Garivier, OdalricAmbrym Maillard, Rémi Munos, and Gilles Stoltz. Kullback  Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516 – 1541, 2013.
 Cowan and Katehakis (2015) Wesley Cowan and Michael N Katehakis. Multiarmed bandits under general depreciation and commitment. Probability in the Engineering and Informational Sciences, 29(01):51 – 76, 2015.
 Cowan et al. (2015) Wesley Cowan, Junya Honda, and Michael N Katehakis. Asymptotic optimality, finite horizon regret bounds, and a solution to an open problem. Journal of Machine Learning Research, to appear; preprint arXiv:1504.05823, 2015.
 Dayanik et al. (2013) Savas Dayanik, Warren B Powell, and Kazutoshi Yamazaki. Asymptotically optimal Bayesian sequential change detection and identification rules. Annals of Operations Research, 208(1):337 – 370, 2013.
 Dimitrakakis (2012) Christos Dimitrakakis. Robust bayesian reinforcement learning through tight lower bounds. In Recent Advances in Reinforcement Learning, pages 177–188. Springer, 2012.
 Feinberg et al. (2014) Eugene A Feinberg, Pavlo O Kasyanov, and Michael Z Zgurovsky. Convergence of value iterations for totalcost mdps and pomdps with general state and action sets. In Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014 IEEE Symposium on, pages 1 – 8. IEEE, 2014.

Filippi et al. (2010)
Sarah Filippi, Olivier Cappé, and Aurélien Garivier.
Optimism in reinforcement learning based on Kullback Leibler divergence.
In 48th Annual Allerton Conference on Communication, Control, and Computing, 2010.  Gittins (1979) John C. Gittins. Bandit processes and dynamic allocation indices (with discussion). J. Roy. Stat. Soc. Ser. B, 41:335–340, 1979.
 Gittins et al. (2011) John C. Gittins, Kevin Glazebrook, and Richard R. Weber. Multiarmed Bandit Allocation Indices. John Wiley & Sons, West Sussex, U.K., 2011.
 Honda and Takemura (2010) Junya Honda and Akimichi Takemura. An asymptotically optimal bandit algorithm for bounded support models. In COLT, pages 67 – 79. Citeseer, 2010.
 Honda and Takemura (2011) Junya Honda and Akimichi Takemura. An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Machine Learning, 85(3):361 – 391, 2011.
 Honda and Takemura (2013) Junya Honda and Akimichi Takemura. Optimality of Thompson sampling for Gaussian bandits depends on priors. arXiv preprint arXiv:1311.1894, 2013.
 Jouini et al. (2009) Wassim Jouini, Damien Ernst, Christophe Moy, and Jacques Palicot. Multiarmed bandit based policies for cognitive radio’s decision making issues. In 3rd international conference on Signals, Circuits and Systems (SCS), 2009.
 Katehakis and Derman (1986) Michael N Katehakis and Cyrus Derman. Computing optimal sequential allocation rules. In Clinical Trials, volume 8 of Lecture Note Series: Adoptive Statistical Procedures and Related Topics, pages 29 – 39. Institute of Math. Stats., 1986.
 Katehakis and Robbins (1995) Michael N Katehakis and Herbert Robbins. Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America, 92(19):8584, 1995.
 Katehakis and Veinott Jr (1987) Michael N Katehakis and Arthur F Veinott Jr. The multiarmed bandit problem: decomposition and computation. Math. Oper. Res., 12:262 – 68, 1987.
 Kaufmann (2015) Emilie Kaufmann. Analyse de stratégies Bayésiennes et fréquentistes pour l’allocation séquentielle de ressources. Doctorat, ParisTech., Jul. 31 2015.
 Lagoudakis and Parr (2003) Michail G Lagoudakis and Ronald Parr. Leastsquares policy iteration. The Journal of Machine Learning Research, 4:1107 – 1149, 2003.
 Lai and Robbins (1985) Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 – 2, 1985.
 Li et al. (2014) Lihong Li, Remi Munos, and Csaba Szepesvári. On minimax optimal offline policy evaluation. arXiv preprint arXiv:1409.3653, 2014.
 Littman (2012) Michael L Littman. Inducing partially observable Markov decision processes. In ICGI, pages 145 – 148, 2012.
 Osband and Van Roy (2014) Ian Osband and Benjamin Van Roy. Nearoptimal reinforcement learning in factored mdps. In Advances in Neural Information Processing Systems, pages 604 – 612, 2014.
 Robbins (1952) Herbert Robbins. Some aspects of the sequential design of experiments. Bull. Amer. Math. Monthly, 58:527–536, 1952.
 Tekin and Liu (2012) Cem Tekin and Mingyan Liu. Approximately optimal adaptive learning in opportunistic spectrum access. In INFOCOM, 2012 Proceedings IEEE, pages 1548 – 1556. IEEE, 2012.

Tewari and Bartlett (2008)
Ambuj Tewari and Peter L Bartlett.
Optimistic linear programming gives logarithmic regret for irreducible mdps.
In Advances in Neural Information Processing Systems, pages 1505 – 1512, 2008.  Weber (1992) Richard R Weber. On the Gittins index for multiarmed bandits. The Annals of Applied Probability, 2(4):1024 – 1033, 1992.
Comments
There are no comments yet.