This paper presents new analyses of Bayesian flavored strategies for sequential resource allocation in an unknown, stochastic environment modeled as a multi-armed bandit. A stochastic multi-armed bandit model is a set of probability distributions, , called arms, with which an agent interacts in a sequential way. At round , the agent, who does not know the arms’ distributions, chooses an arm . The draw of this arm produces an independent sample from the associated probability distribution , often interpreted as a reward. Indeed, the arms can be viewed as those of different slot machines, also called one-armed bandits, generating rewards according to some underlying probability distribution.
In several applications that range from the motivating example of clinical trials  to the more modern motivation of online advertisement (e.g., ), the goal of the agent is to adjust his strategy , also called a bandit algorithm, in order to maximize the rewards accumulated during his interaction with the bandit model. The adopted strategy has to be sequential, in the sense that the next arm to play is chosen based on past observations: letting be the -field generated by the observations up to round , is -measurable, where
is a uniform random variable independent from(as algorithms may be randomized).
More precisely, the goal is to design a sequential strategy maximizing the expectation of the sum of rewards up to some horizon . If denote the means of the arms, and , this is equivalent to minimizing the regret, defined as the expected difference between the reward accumulated by an oracle strategy always playing the best arm, and the reward accumulated by a strategy :
The expectation is taken with respect to the randomness in the sequence of successive rewards from each arm , denoted by , and the possible randomization of the algorithm, . We denote by the number of draws from arm at the end of round , so that .
This paper focuses on good strategies in parametric bandit models, in which the distribution of arm depends on some parameter : we write
. Like in every parametric model, two different views can be adopted. In the frequentist view,is an unknown parameter. In the Bayesian view, is a random variable, drawn from a prior distribution . More precisely, we define (resp. ) the probability (resp. expectation) under the probabilistic model in which for all , is i.i.d. distributed under and (resp. ) the probability (resp. expectation) under the probabilistic model in which for all is i.i.d. conditionally to with conditional distribution , and . The expectation in (1) can thus be taken under either of these two probabilistic models. In the first case this leads to the notion of frequentist regret, which depends on :
In the second case, this leads to the notion of Bayesian regret, sometimes called Bayes risk in the literature (see ), which depends on the prior distribution :
The first bandit strategy was introduced by Thompson in 1933  in a Bayesian framework, and a large part of the early work on bandit models is adopting the same perspective [10, 7, 19, 8]. Indeed, as Bayes risk minimization has an exact—yet often intractable—solution, finding ways to efficiently compute this solution has been an important line of research. Since 1985 and the seminal work of Lai and Robbins , there is also a precise characterization of good bandit algorithms in a frequentist sense. They show that for any uniformly efficient policy (i.e. such that for all , for all ), the number of draws of any sub-optimal arm () is asymptotically lower bounded as follows:
denotes the Kullback-Leibler divergence between the distributionsand . From (2), this yields a lower bound on the regret.
This result holds for simple parametric bandit models, including exponential family bandit models presented in Section 2, that will be our main focus in this paper. It paved the way to a new line of research, aimed at building asymptotically optimal strategies, that is, strategies matching the lower bound (4) for some classes of distributions. Most of the algorithms proposed since then belong to the family of index policies
, that compute at each round one index per arm, depending on the history of rewards observed from this arm only, and select the arm with largest index. More precisely, they are UCB-type algorithms, building confidence intervals for the means of the arms and choosing as an index for each arm the associated Upper Confidence Bound (UCB). The design of the confidence intervals has been successively improved[27, 1, 6, 5, 4, 21, 14] so as to obtain simple index policies for which non-asymptotic upper bound on the regret can be given. Among them, the kl-UCB algorithm  matches the lower bound (4) for exponential family bandit models. As they use confidence intervals on unknown parameters, all these index policies are based on frequentist tools. Nevertheless, it is interesting to note that the first index policy was introduced by Gittins in 1979  to solve a Bayesian multi-armed bandit problem and is based on Bayesian tools, i.e. on exploiting the posterior distribution on the parameter of each arm.
However, tools and objectives can be separated: one can compute the Bayes risk of an algorithm based on frequentist tools, or the (frequentist) regret of an algorithm based on Bayesian tools. In this paper, we focus on the latter and advocate the use of index policies inspired by Bayesian tools for minimizing regret, in particular the Bayes-UCB algorithm , which is based on quantiles of the posterior distributions on the means. Our main contribution is to prove that this algorithm is asymptotically optimal, i.e. that it matches the lower bound (4), for any exponential bandit model and for a large class of prior distributions. Our analysis relies on two new ingredients: tight bounds on the tail of posterior distributions (Lemma 4), and a self-normalized deviation inequality featuring an exploration rate that decreases with the number of observations (Lemma 5). This last tool also allows us to prove the asymptotic optimality of two variants of kl-UCB, called kl-UCB and kl-UCB-H, that display improved empirical performance. Interestingly, the alternative exploration rate used by these two algorithms is already suggested by asymptotic approximations of the Bayesian exact solution or the Finite-Horizon Gittins indices.
The paper is structured as follows. Section 2 introduces the class of exponential family bandit models that we consider in the rest of the paper, and the associated frequentist and Bayesian tools. In Section 3, we present the Bayes-UCB algorithm, and give a proof of its asymptotic optimality. We introduce kl-UCB and kl-UCB-H in Section 4, in which we prove their asymptotic optimality and also exhibit connections with existing Bayesian policies. In Section 5, we illustrate numerically the good performance of our three asymptotically optimal, Bayesian-flavored index policies in terms of regret. We also investigate their ability to attain an optimal rate in terms of Bayes risk. Some proofs are provided in the supplemental paper .
Recall that is the number of draws from arm at the end of round . Letting be the empirical mean of the first rewards from , the empirical mean of arm after rounds of the bandit algorithm, , satisfies if , otherwise.
2 (Bayesian) exponential family bandit models
In the rest of the paper, we consider the important class of exponential family bandit models, in which the arms belong to a one-parameter canonical exponential family.
2.1 Exponential family bandit model
A one-parameter canonical exponential family is a set of probability distributions, indexed by a real parameter called the natural parameter, that is defined by
where is an open interval, a twice-differentiable and convex function (called the log-partition function) and
a reference measure. Examples of such distributions include Bernoulli distributions, Gaussian distributions with known variance, Poisson distributions, or Gamma distributions with known shape parameter.
If , it can be shown that and , where (resp. ) is the derivative (resp. second derivative) of with respect to the natural parameter . Thus there is a one-to-one mapping between the natural parameter and the mean , and distributions in an exponential family can be alternatively parametrized by their mean. Letting , for we denote by the distribution in that has mean : . The variance of the distribution is related to its mean in the following way:
In the sequel, we fix an exponential family and consider a bandit model , where belongs to and has mean . When considering Bayesian bandit models, we restrict our attention to product prior distributions on , such that is drawn from a prior distribution on that has density with respect to the Lebesgue measure. We let be the posterior distribution on after the first rounds of the bandit game. With a slight abuse of notation, we will identify with its density, for which a more precise expression is provided in Section 2.3.
2.2 Kullback-Leibler divergence and confidence intervals
For distributions that belong to a one-parameter exponential family, the large deviation rate function has a simple and explicit form, featuring the Kullback-Leibler (KL) divergence, and one can build tight confidence intervals on their means. The KL-divergence between two distributions and in an exponential family has a closed form expression as a function of the natural parameters and , given by
We also introduce as the KL-divergence between the distributions of means and :
Applying the Cramér-Chernoff method (see e.g. ) in an exponential family yields an explicit deviation inequality featuring this divergence function: if is the empirical mean of samples from and , one has . This inequality can be used to build a confidence interval for based on a fixed number of observations . Inside a bandit algorithm, computing a confidence interval on the mean of an arm requires to take into account the random number of observations available at round . Using a self-normalized deviation inequality (see  and references therein), one can show that, at any round of a bandit game, the kl-UCB index, defined as
where is a real parameter, satisfies and is thus an upper confidence bound on . The exploration rate, which is here , controls the coverage probability of the interval.
Closed-form expressions for the divergence function in the most common examples of exponential families are available (see ). Using the fact that is increasing when , an approximation of can then be obtained using, for example, binary search.
2.3 Posterior distributions in Bayesian exponential family bandits
It is well-known that the posterior distribution on the mean of a distribution that belongs to an exponential family depends on two sufficient statistics: the number of observations and the empirical means of these observations. With the density of the prior distribution on , introducing
the density of the posterior distribution on after rounds of the bandit game can be written
While our analysis holds for any choice of prior distribution, in practice one may want to exploit the existence of families of conjugate priors (e.g. Beta distributions for Bernoulli rewards, Gaussian distributions for Gaussian rewards, Gamma distributions for Poisson rewards). With a prior distribution chosen in such a family, the associated posterior distribution is well-known and its quantiles are easy to compute, which is of particular interest for the Bayes-UCB algorithm, described in the next section.
Finally, we give below a rewriting of the posterior distribution that will be very useful in the sequel to obtain tight bounds on its tails.
Let . One has
using the closed form expression (6) and the fact that .
3 Bayes-UCB: a simple and optimal Bayesian index policy
3.1 Algorithm and main result
The Bayes-UCB algorithm is an index policy that was introduced by  in the context of parametric bandit models. Given a prior distribution on the parameters of the arms, the index used for each arm is a well-chosen quantile of the (marginal) posterior distributions of its mean. For exponential family bandit models, given a product prior distribution on the means, the Bayes-UCB index is
where is the quantile of order of the distribution (that is, ) and is a real parameter. In the particular case of bandit models with Gaussian arms,  have introduced a variant of Bayes-UCB with a slightly different tuning of the confidence level, under the name UCL (for Upper Credible Limit).
While the efficiency of Bayes-UCB has been demonstrated even beyond bandit models with independent arms, regret bounds are available only in very limited cases. For Bernoulli bandit models asymptotic optimality is established by  when a uniform prior distribution on the mean of each arm is used. For Gaussian bandit models  give a logarithmic regret bound when an uniformative prior is used. In this section, we provide new finite-time regret bounds that hold in general exponential family bandit models, showing that a slight variant of Bayes-UCB is asymptotically optimal for a large class of prior distributions.
We fix an exponential family, characterized by its log-partition function and the interval of possible natural parameters. We let and ( may be equal to and to ). We analyze Bayes-UCB for exponential bandit models satisfying the following assumption.
There exists and such that
For Poisson or Exponential distributions, this assumption requires that the means of all arms are different from zero, while they should be included infor Bernoulli distributions. We now introduce a regularized version of the Bayes-UCB index that relies on the knowledge of and , as
where . Note that and can be chosen arbitrarily close to and respectively, in which case often coincides with the original Bayes-UCB index .
From Theorem 3, taking the and letting go to zero show that (this slight variant of) Bayes-UCB satisfies
, from a practical point of view Bayes-UCB outperforms kl-UCB and performs similarly (sometimes slightly better, sometimes slightly worse) as Thompson Sampling, another popular Bayesian algorithm that we now discuss.
3.2 Posterior quantiles versus posterior samples
Over the past few years, another Bayesian algorithm, Thompson Sampling, has become increasingly popular for its good empirical performance, and we explain how Bayes-UCB is related to this alternative, randomized, Bayesian approach.
The Thompson Sampling algorithm, that draws each arm according to its posterior probability of being optimal, was introduced in 1933 as the very first bandit algorithm and re-discovered recently for its good empirical performance [36, 16]. Thompson Sampling can be implemented in virtually any Bayesian bandit model in which one can sample the posterior distribution, by drawing one
sample from the posterior on each arm and selecting the arm that yields the largest sample. In any such case, Bayes-UCB can be implemented as well and may appear as a more robust alternative as the quantiles can be estimated based onseveral samples in case there is no efficient algorithm to compute them.
Our experiments of Section 5 show that Bayes-UCB as well as the other Bayesian-flavored index policies presented in Section 4 are competitive with Thompson Sampling in general one-dimensional exponential families. Compared to Bayes-UCB, the theoretical understanding of Thompson Sampling is more limited: this algorithm is known to be asymptotically optimal in exponential family bandit models, yet only for specific choices of prior distributions [25, 3, 26].
In more complex bandit models, there are situations in which Bayes-UCB is indeed used over Thompson Sampling. When there is a potentially infinite number of arms and the mean reward function is assumed to be drawn from a Gaussian Process, the GP-UCB of , that coincides with Bayes-UCB, is very popular in the Bayesian optimization community .
3.3 Tail bounds for posterior distributions
Just like the analysis of , the analysis of Bayes-UCB that we give in the next section relies on tight bounds on the tails of posterior distributions that permit to control quantiles. These bounds are expressed with the Kullback-Leibler divergence function . Therefore, an additional tool in the proof is the control of the deviations of the empirical mean rewards from the true mean reward, measured with this divergence function, which follows from the work of .
In the particular case of Bernoulli bandit models, Bayes-UCB uses quantiles of Beta posterior distributions. In that case a specific argument, namely the fact that is the distribution of the -th order statistic among
uniform random variables, relates a Beta distribution (and its tails) to a Binomial distribution (and its tails). This ‘Beta-Binomial trick’ is also used extensively in the analysis of Thompson Sampling for Bernoulli bandits proposed by[2, 25, 3]. Note that this argument can only be used for Beta distributions with integer parameters, which rules out many possible prior distributions. The analysis of  in the Gaussian case also relies on specific tails bounds for the Gaussian posterior distributions. For exponential family bandit models, an upper bound on the tail of the posterior distribution was obtained by  using the Jeffrey’s prior.
Lemma 4 below present more general results that hold for any class of exponential family bandit models and any prior distribution with a density that is positive on . For such (proper) prior distributions, we give deterministic upper and lower bounds on the corresponding posterior probabilities . Compared to the result of , which is not presented in this deterministic way, Lemma 4 is based on a different rewriting of the posterior distribution, given in Lemma 1.
Let be defined in Assumption 2.
There exist two positive constants and such that for all that satisfy , for all , for all ,
There exists a constant such that for all that satisfy , for all , for all ,
The constants depend on ,, and the prior densities.
This result permits in particular to show that the quantile defined in (8) satisfies , with
Hence, despite their Bayesian nature, the indices used in Bayes-UCB are strongly related to frequentist kl-UCB type indices. However, compared to the index defined in (7), the exploration rate that appears in and also features the current number of draws . Lai gives in 
an asymptotic analysis of any index strategy of the above form with an exploration function, where when goes to infinity. Yet neither nor are not exactly of that form, and we propose below a finite-time analysis that relies on new, non-asymptotic, tools.
3.4 Finite-time analysis
We give here the proof of Theorem 3. To ease the notation, assume that arm 1 is an optimal arm, and let be a suboptimal arm.
We introduce a truncated version of the KL-divergence, and let be a decreasing sequence to be specified later.
Using that, by definition of the algorithm, if is played at round , it holds in particular that , one has
On the one hand, for ,
since by the lower bound in the second statement of Lemma 4,
Now using the lower bound in the first statement of Lemma 4,
On the other hand,
To third inequality follows from exchanging the sums over and and using that is smaller than 1 for all . The last inequality uses that if , and . Then by Chernoff inequality,
Still using Chernoff inequality, the second sum in (9) is upper bounded by
Putting things together, we showed that there exists some constant such that
Term is shown below to be of order , as cannot be too far from . Note however that the deviation is expressed with in place of the traditional , which makes the proof of Lemma 5 more intricate. In particular, Lemma 5 applies to a specific sequence defined therein, and a similar result could not be obtained for the choice , unlike Lemma 6 below.
Let be such that . If , for all , if is larger than ,
From Lemma 5, one has
The following lemma permits to give an upper bound on Term T2.
Let be three functions such that
with and non-increasing for large enough.
For all there exists a (problem-dependent) constant such that for all ,
with , where the variance function is defined in (5).
4 A Bayesian insight on alternative exploration rates
The kl-UCB index of an arm, , introduced in (7), uses the exploration rate