For regret minimization in stochastic bandit problems, two notions of time-optimality coexist. On the one hand, one may consider a fixed model: the famous lower bound by Lai and Robbins (1985) showed that the regret of any consistent strategy should grow at least as when the horizon goes to infinity. Here, is a constant depending solely on the model. A strategy with a regret upper-bounded by will be called in this paper asymptotically-optimal. Lai and Robbins provided a first example of such a strategy in their seminal work. Later, Garivier and Cappé (2011) and Maillard et al. (2011) provided finite-time analysis for variants of the UCB algorithm (see Agrawal (1995); Burnetas and Katehakis (1996); Auer et al. (2002a)) which imply asymptotic optimality. Since then, other algorithms like Bayes-UCB (Kaufmann et al., 2012)2013) have also joined the family.
On the other hand, for a fixed horizon one may assess the quality of a strategy by the greatest regret suffered in all possible bandit models. If the regret of a bandit strategy is upper-bounded by (the optimal rate: see Auer et al. (2002b) and Cesa-Bianchi and Lugosi (2006)) for some numeric constant , this strategy is called minimax-optimal. The PolyINF and the MOSS strategies by Audibert and Bubeck (2009) were the first proved to be minimax-optimal.
Hitherto, as far as we know, no algorithm was proved to be at the same time asymptotically- and minimax-optimal. Two limited exceptions may be mentioned: the case of two Gaussian arms is treated in Garivier et al. (2016a); and the OC-UCB algorithm of Lattimore (2015) is proved to be minimax-optimal and almost problem-dependent optimal for Gaussian multi-armed bandit problems. Notably, the OC-UCB algorithm satisfies another worthwhile property of finite-time instance near-optimality, see Section 2 of Lattimore (2015) for a detailed discussion.
In this work, we put forward the kl-UC algorithm, a slightly modified version of kl-UC algorithm discussed in Garivier et al. (2016a) as an empirical improvement of UCB, and analyzed in Kaufmann (2016)
. This bandit strategy is designed for some exponential distribution families, including for example Bernoulli and Gaussian laws. It borrows from the MOSS algorithm ofAudibert and Bubeck (2009) the idea to divide the horizon by the number of arms in order to reach minimax optimality. We prove that it is at the same time asymptotically- and minimax-optimal. This work thus merges the progress which has been made in different directions towards the understanding of the optimism principle, finally reconciling the two notions of time-optimality.
Insofar, our contribution answers a very simple and natural question. The need for simultaneous minimax- and problem-dependent optimality could only be addressed in very limited settings by means that could not be generalized to the framework adopted in our paper. Indeed, for a given horizon , the worst problem depends on : it involves arms separated by a gap of order . Treating the -dependent problems correctly for all appears as a quite different task than catching the optimal, problem-dependent speed of convergence for every fixed bandit model. We show in this paper that the two goals can indeed be achieved simultaneously.
Combining the two notions of optimality requires a modified exploration rate. We stick as much as possible to existing algorithms and methods, introducing just what is necessary to obtain the desired results. Starting from that of kl-UCB (so as to have a tight asymptotic analysis), one has to completely cancel the exploration bonus of the arms that have been drawn roughlytimes. The consequence is very slight and harmless in the case where the best arm is much better than the others, but essential in order to minimize the regret in the worst case where the best arm is barely distinguishable from the others. Indeed, when the best arm is separated by a gap of order from the suboptimal arms, we can not afford to draw more than times a suboptimal arm so as to get a regret of order .
We present a general yet simple proof, combining the best elements of the above-cited sources which are simplified as much as possible and presented in a unified way. To this end, we develop new deviation inequalities, improving the analysis of the different terms contributing to the regret. This analysis is made in the framework which we believe is the best compromise between simplicity and generality (simple exponential families). This permits us to treat, among others, the Bernoulli and the Gaussian case at the same time. More fundamentally, this appears to us as the right, simple framework for the analysis, which emphasizes what is really required to have simple lower- and upper-bounds (the possibility to make adequate changes of measure, and Chernoff-type deviation bounds).
The paper is organized as follows. In Section 2, we introduce the setting and assumptions required for the main results, Theorems 3 and 3, which are presented in Section 3. We give the entire proofs of these results in Sections 4 and 5, with only a few technical lemmas proved in Appendix A. We conclude in Section 6 with some brief references to possible future prospects.
2 Notation and Setting
We consider a simple stochastic bandit problem with arms indexed by , with
. Each arm is assumed to be a probability distribution of some canonical one-dimensional exponential familyindexed by . The probability law is assumed to be absolutely continuous with respect to a dominating measure on , with a density given by
It is well-known that is convex, twice differentiable on , that and
are respectively the mean and the variance of the distribution. The family can thus be parametrized by the mean , for
. The Kullback-Leibler divergence between two distributions is. This permits to define the following divergence on the set of arm expectations: for and , we write
For a minimax analysis, we need to restrict the set of means to bounded interval: we suppose that each arm satisfies for two fixed real numbers . Our analysis requires a Pinsker-like inequality; we therefore assume that the variance is bounded in the exponential family: there exists such that
This implies that for all ,
In the sequel, we denote by the set of bandit problems satisfying these assumptions. By the usual Pinsker inequality, this setting includes in particular Bernoulli bandits with and (by convention, ). This also includes (bounded) Gaussian bandits with known variance , with the choice and .
The arms are denoted , and the expectation of arm is denoted by . At each round , the player pulls an arm and receives an independent draw of the distribution . This reward is the only piece of information available to the player. The best mean is . We denote by the number of draws of arm up to and including time . In this work, the goal is to minimize the expected regret
Lai and Robbins (1985) proved that if a strategy is uniformly efficient, that is if it is such that under any bandit model of a sufficiently rich family (such as an exponential family described above) holds for every , then it needs to draw any suboptimal arm at least as often as
In light of the previous equality, this directly implies an asymptotic lower bound on .
On the other side, a straightforward adaptation of the the proof of Theorem A.2 of Auer et al. (2002b) shows that there exists a constant depending only on the considered family of distributions such that
where the supremum is taken over all bandit problems in . Note that the notion of minimax-optimality is defined here up to a multiplicative constant, in contrast to the definition of (problem-dependent) asymptotic optimality. For a discussion on the minimax and asymptotic lower bounds, we refer to Garivier et al. (2016b) and references therein.
3 The kl-UC Algorithm
We denote by the empirical mean of the first rewards from arm . The empirical mean of arm after rounds is
Parameters: The horizon and an exploration function .
Initialization: Pull each arm of once.
For to , do
Compute for each arm the quantity
where . The exploration function borrows the general form with the extra exploration rate from the kl-UCB algorithm, the division by the number of draws from kl-UC, and the division by the number of arm from MOSS.
The following results state that the kl-UC algorithm is simultaneously minimax- and asymptotically-optimal. [Minimax optimality] For any family satisfying the assumptions detailed in Section 2, and for any bandit model , the expected regret of the kl-UC algorithm is upper-bounded as
[Asymptotic optimality] For any bandit model , for any suboptimal arm and any such that ,
which implies the asymptotic optimality (see the end of the proof in Section 5 for an explicit bound). Theorems 3 and 3 are proved in Sections 4 and 5 respectively. The main differences between the two proofs are discussed at the beginning of Section 5. Note that the two regret bounds of Theorems 3 and 3 also apply to all -valued bandit models, with the value , as the deviations of2013)). However, the kl-UC algorithm is not asymptotically optimal then: the regret bound in is not optimal in that case. Asymptotic optimality would require tight distribution-dependent, non-parametric upper confidence bounds (for example based on the empirical-likelihood method, as in the above cited paper). This is out of the scope of this work (and would require a lot more space).
4 Proof of Theorem 3
This proof merges merges ideas presented in Bubeck and Liu (2013) for the analysis of the MOSS algorithm and from the analysis of kl-UCB in Cappé et al. (2013) (see also Kaufmann (2016)). It is divided into the following steps:
Decomposition of the regret.
Let be the index of an optimal arm. Since by definition of the strategy for all , the regret can be decomposed as follows:
We define ; since the bound (4) is otherwise trivial, we assume in the sequel that . For the first term , as in the proof of MOSS algorithm, we carefully upper bound the probability that appears inside the integral thanks to a ’peeling trick’. The second term B is easier to handle since we can reduce the index to UCB-like-index thanks to the Pinsker inequality (1) and proceed as in Bubeck and Liu (2013).
Step 1: Upper-bounding .
Term is concerned with the optimal arm only. Two words of intuition: since is meant to be an upper confidence bound for , this term should not be too large, at least as long as the the confidence level controlled by function is large enough – but when the confidence level is low, the number of draws is large and deviations are unlikely.
Upper-bounding term boils down to controlling the probability that
is under-estimated at time. Indeed,
and we need to upper bound the left-deviations of the mean of arm . On the event , we have that , and by definition of it holds that
For small values of , the dominant term is given by , whereas for large the event is quite unlikely. This is why we split the probability in two terms, proceeding as follows. Let be the function defined, for , by
Our choice of implies that , and thus
In particular, for it holds that
It appears that is the right place where to split the probability of Equation (8): defining , we write
Controlling terms and is a matter of deviation inequalities.
Step 1.1: Upper-bounding . The term , which involves self-normalized devation probabilities, can be upper-bounded thanks to a ’peeling trick’ as in the proof of Theorem 5 from Audibert and Bubeck (2009). We assume that , for otherwise . We use the grid , where the real will be chosen later. We write
But thanks to Equation (9),
It is now time to choose , so that and . Together with the definition of , this choice yields
as, for all ,
Step 1.2: Upper-bounding . The term is more simple to handle, as it does not involve self-normalized deviations. Thanks to the maximal inequality (recalled in Equation (33) of Appendix A) and thanks to the Pinsker-like inequality (1),
It remains only to conclude with some calculus:
and replacing by its value we obtain from Equation (16) the following relation:
Summing over from to , this yields:
Step 2: Upper-bounding .
Term is of different nature, since typically . However, as for the term , we first reduce the problem to the upper-bounding of a probability:
The event is typical if is small, and corresponds to a deviation of the sample mean otherwise. In order to handle this correctly, we first get rid of the randomness of by the pessimistic trajectorial upper bound from Bubeck and Liu (2013)
In addition, we simplify the upper bound thanks to our assumption (1) that some Pinsker type inequality is available:
Hence, can be upper-bounded as
Then, we need only to upper bound for each arm . We cut the sum at the critical sample size where the event becomes atypical: for , let be the integer such that
For it holds that
Indeed, as for all , we have
also observe that is such that for , and thus for and
Therefore, cutting the sum in (20) at , we obtain:
where . It remains to integrate Inequality (22) from to infinity. The first summand involves the same integral as we have already met in the upper bound of term :
For the remaining summand, Inequality (33) yields
Thus, as for all ,
Putting everything together starting from Inequality (22), we have proved that
By Equation (20), replacing by its value finally yields
Conclusion of the proof.
5 Proof of Theorem 3
The analysis of asymptotic optimality shares many elements with the minimax analysis, with some differences however. The decomposition of the regret into two terms and is similar, but localized on a fixed sub-optimal arm : we analyze the number of draws of and not directly the regret (and we do not need to integrate the deviations at the end). We proceed roughly as in the proof of Theorem 3 for term , which involves the deviations of an optimal arm. For term B, which stands for the behavior of the sub-optimal arm , a different (but classical) argument is used, as one cannot simply use the Pinsker-like Inequality (1) if one wants to obtain the correct constant (and thus asymptotic optimality).
Decomposition of .
If arm is pulled at time , then by definition of the strategy for any index of an optimal arm. Thus, for any fixed to be chosen later,
As a consequence,
and it remains to bound each of these terms.
Step 1: Upper-bounding A.
As in the proof of Theorem 3, we write
where we use the same function
Here, we used that for all , since the condition implies that ,
Step 2: Upper-bounding B.
Thanks to the definition of it holds that
Together with the following classical argument for regret analysis in bandit models, this yields:
as it holds . Now, let be the integer defined as
Then, for ,
We cut the sum in (28) at , so that
Recall that by assumption , using the inclusion
together with Inequality (33), we obtain that
and Equation (29) yields