Several recent and less recent analyses of bandit problems share the remarkable feature that an instance-dependant lower-bound analysis permits to show the existence of an optimal proportion of draws, which every efficient strategy needs to match, and which is used as a basis for the design of optimal algorithms. This is the case in Active Exploration bandit problems, see Chernoff , Soare et al. , Russo  and Garivier and Kaufmann  but also for the Regret Minimization bandit problems, from the simplest multi-armed bandit setting Garivier et al.  to more complex setting Lattimore and Szepesvari , Combes et al. 
. To reach the asymptotic lower bounds one needs to sample asymptotically according to this optimal proportion of draws. A natural strategy is to sample according to the optimal proportion of draws associated with the current estimate of the true parameter, with some extra exploration. See for exampleAntos et al. , Garivier and Kaufmann , Lattimore and Szepesvari  and Combes et al. . This strategy has a major drawback, computing the optimal proportion of draws requires to solve an often involved concave optimization problem. Thus, this can lead to rather computationally inefficient strategy since one must solve exactly at each steps a new concave optimization problem.
In this paper we propose to use instead a gradient ascent to solve in an online fashion the optimization problem thus merging the Active Exploration problem and the computation of the optimal proportion of draws. Precisely we perform an online lazy mirror ascent, see Shalev-Shwartz et al. , Bubeck , adding an new link between stochastic bandits and online convex optimization. Hence, it is sufficient to compute at each steps only a (sub-)gradient, which greatly improves the computational complexity. As a byproduct the obtained algorithm is quite generic and can be applied in various Active Exploration bandit problems, see Appendix A.
The paper is organized as follows. In Section 1.1 we define the framework. A general asymptotic lower bound is presented in Section 1.2 . In Section 1.3 we motivate the introduction of the gradient ascent. The main result, namely the asymptotic optimality of Algorithm 1 and its proof compose Section 2. Section A regroups various examples that are described by the general setting introduced in Section 1.1. Section 3 reports results of some numerical experiments comparing Algorithm 1 to its competitors.
1.1 Problem description
For , we consider a Gaussian bandit problem
, which we unambiguously refer to by the vector of means. Without loss of generality, we set in the following . We denote by the set of Gaussian bandit problems. Let and be respectively the probability and the expectation under the bandit problem .
We fix a finite number of subsets of bandit problems for with and we assume that the subsets are pairwise disjoint, open and convex. We will explain latter why we need these assumptions on the sets . For a certain bandit problem in our objective is to identify to which set it belongs, i.e. to find such that . Namely, we consider algorithms that output a subset index after pulls. This setting is quite general and encompasses several Active Exploration bandit problems, see Section A.
Two approaches for this problem have been proposed: first, one may consider a given budget and try to minimize the probability to predict a wrong subset index, this is the Fixed Budget setting, see Bubeck et al. , Audibert and Bubeck  and Locatelli et al. . The second approach is the Fixed Confidence setting, where we fix a confidence level and try to minimize the expected number of sample under the constraint that the predicted subset index is the right one with probability at least , see Chernoff , Even-Dar et al. , Mannor and Tsitsiklis  and Kaufmann et al. . In this paper we will consider the second approach.
The game goes as follow: at each round the agent chooses an arm and observes a sample conditionally independent from the past. Let be the information available to the agent at time . In order to respect the confidence constraint the agent must follow a -correct algorithm comprised of:
a sampling rule , where is -measurable,
a stopping rule , a stopping time for the filtration ,
a decision rule -measurable,
such that for all the fixed confidence condition is satisfied and that the algorithm stop almost surely . In this paper we will focus our attention on the sampling rule since stopping rules are now well understood and decision rule are straightforward to find.
1.2 Lower Bound
The Kullback-Leibler divergence between two Gaussian distributionsand is defined by
The set of alternatives of the problem is denoted by . One can prove the following generic asymptotic lower bound on the expected number of samples when the confidence level tends to zero, see Garivier and Kaufmann  and Garivier et al. .
For all , for all ,
where the characteristic time is defined by
In particular (1) implies that
As already explained by Chernoff , it is interesting to note that asymptotically we end up with a zero-sum game where the agent first plays a proportion of draws trying to minimize the sum in (2) then the "nature" plays an alternative trying to do the opposite. The value of this game is exactly . In the sequel we denote by
the function that the agent needs to maximize against a "nature" that plays optimally. An algorithm is thus asymptotically optimal if the reverse inequality of (3) holds with a limsup instead of a liminf.
1.3 Intuition: what is the idea behind the algorithm?
To get an asymptotically optimal algorithm the agent wants to play accordingly to an optimal proportion of draws , defined by
in order to minimize the characteristic time in (2). But, of course, the agent has not access to the true vector of means. One way to settle this problem is to track the optimal proportion of the current empirical means. Let be the vector of empirical means at time :
where denotes the number of draws of arm up to and including time . We will denote by the empirical proportion of draws at time . Following this idea, the sampling rule could be
This rule is equivalent to the direct tracking rule (without forced exploration, see below) by Garivier and Kaufmann . But this approach has a major drawback, at each time we need to solve exactly the concave optimization problem in (5). And it appears that in some case we can not solve it analytically, see for example Garivier et al. . Even if there exists an efficient way to solve the optimization problem numerically like for example in the Best Arm Identification problem some simplest and efficient algorithms give experimentally comparable results. We can cite for example Best Challenger type algorithms, see Garivier and Kaufmann  and Russo .
The idea of our algorithm is best explained on the simple example of the Thresholding Bandit problem (see Section A.1), where the set of all arms larger than the threshold is to be identified. There exists a natural and efficient sampling rule (see Locatelli et al. ):
It turns out that this sampling rule leads to an asymptotically optimal algorithm. We are not aware of a reference for this fact. In order to give an interpretation of this sampling rule, let takes one step back. In this problem we want to maximize with respect to the first variable the following concave function (see Section A.1)
The sub-gradient of at , denoted by , is a convex combination of the vectors
where is some element in the sub-gradient . Then the update of the empirical proportion of draws follows the simple rule
Here we recognize surprisingly one step of the Frank-Wolfe algorithm [Frank and Wolfe, 1956] for maximizing the concave function on the simplex. The exact same analysis can be done with a variant of the Best Challenger sampling rule for the Best Arm Identification problem. This is described in Section A.2. It is not the first time that Frank-Wolfe algorithm appears in the stochastic bandits field, see for example Berthet and Perchet . Precisely in the aforementioned reference they interpret the classical UCB algorithm as an instance of this algorithm with an "optimistic" gradient. The main difficulty here, which does not appear in the Regret Minimization problem, is that the function is not smooth in general (as an infimum of linear functions). Thus we can not directly leverage the analysis of Frank-Wolfe algorithm in our setting as Berthet and Perchet . In particular it is not obvious that the sampling rule driven by the Frank-Wolfe algorithm will converge to the maximum of , for the general problem presented in Section 1, even in the absence of noise (i.e. ).
But we can keep the idea of using a concave optimizer in an online fashion instead of computing at each steps the optimal proportion of draws. Indeed there is a candidate of choice for optimizing non-smooth concave function namely the sub-gradient ascent. Now the strategy is clear, at each steps we will perform one step of sub-gradient ascent for the function on the simplex. Nevertheless, the update of the proportion of draws will be more intricate than in (8), we will need to track the average of weights proposed by the sub-gradient ascent and force some exploration, see next section for details. Note that this greatly improve the computational complexity of the algorithm since one just needs to compute an element of the sub-gradient of at each time step. In various setting this computation is straightforward, see Appendix A, in general it boils down to compute the projection of the vector of empirical means on the closure of alternative sets thanks to the particular form of the function , see (4). Since the set are convex, if the weights are strictly positive (which will be the case in Algorithm 1) the projection always exists.
2 Gradient Ascent
Before presenting the algorithm we need to fix some notations. Since does not necessary lie in the set , we first extend on the entire set , by setting
Then, will denote some element of the sub-gradient of at .
) on the simplex, using the Kullback-Leibler divergence to the uniform distributionas mirror map:
where, for an arbitrary constant , we clipped the gradient . This is just a technical trick to handle the fact that the gradient may be not uniformly bounded in the very first steps. In practice, however, this technical trick seems useless and we recommend to ignore it (that is, take ). There is a closed formula for the weights , see Appendix F
. Note that it is crucial here to use an anytime optimizer since we do not know in advance when the algorithm will stop. Then we skew the weightstoward the uniform distribution to force exploration
This trick is quite usual as for example in the EXP3.P algorithm, see Bubeck et al. . In some particular settings this extra exploration is not necessary, for example in the Thresholding Bandits problem. We believe that there is a more intrinsic way to perform exploration but this is out of the scope of this paper. Since we perform step size of order we can not use the same simple update rule of the empirical proportion of draws as in (8) where the steps size is of order . But we can track the cumulative sum of weights as follows
It is important to track the cumulative sum of weights here because the analysis of the online mirror ascent provides only guarantees on the cumulative regret.
That is, we stop when the vector of empirical means is far enough from any alternative with respect to the empirical Kullback-Leibler divergence. Note that, here, the threshold does not depend directly on , but via the vector of counts . This allows to use the maximal inequality of Proposition 1, which yields a very short and direct proof of -correctness: see Section 2.1.
Initialization Pull each arms once and set for all
Sampling rule, for
Update the weights (sub-gradient ascent)
Pull the arm (track the cumulative sum of weights)
In order to preform a gradient descent we need that the sub-gradient of is bounded in a neighborhood of . For the examples presented in Appendix A or if the are bounded this assertion holds but for some pathological examples this assertion can be wrong (see Appendix G.3). That why we make the following assumption where we denote by the ball of radius for the infinity norm centered at .
We assume that for all there exists that may depend on such that:
We can now state the main result of the paper.
2.1 -correctness of Algorithm 1
Proposition 1 (Maximal inequality).
2.2 Asymptotic Optimality of Algorithm 1
First we need some properties of regularity of the function around in order to prove a regret bound on the online lazy mirror ascent. In Appendix G we derive the following proposition.
Proposition 2 (Regularity).
For all and there exists constants that may depend on such that and it holds
Fix some real number and consider the typical event
where , for some horizon . We want to prove that for large enough, on the event , the difference between the maximum of for the true parameter, namely and its empirical counterpart at time , is small, precisely of order . To this aim we will use the following regret bound for the online lazy mirror ascent proved in Appendix F.
Proposition 3 (Regret bound for the online lazy mirror ascent).
We then need a consequence of the tracking and the forced exploration, proved in Appendix E, to relate to .
Proposition 4 (Tracking).
Hence if we rewrite the stopping rule (12)
since the algorithm will stop as soon as . Thus for such we have the inclusion . But thanks to the forced exploration, see Lemma 2, we know that . Therefore we obtain
Thus dividing the above inequality by and letting go to zero then go to zero allows us to conclude.
3 Numerical Experiments
For the experiments we consider the Best Arm Identification problem described in Section A.2. Precisely we restrict our attention to the simple, arbitrary, 4-armed bandit problem . The optimal proportion of draws is . The experiments compare several algorithms: the Lazy Mirror Ascent (LMA) described in Algorithm 1, the same algorithm but with a constant learning rate (LMAc), the Best Challenger (BC) algorithm given in Section A.2, the Direct Tracking (DT) algorithm by Garivier and Kaufmann 
, Top Two Thompson Sampling (TTTS) byRusso  and finally the uniform Sampling (Unif) as baseline. See Appendix B for details. Note in particular that all of them use the same Chernoff Stopping rule (12) with the same threshold and the same decision rule (13). This allows a fair comparison between the sampling rules. Indeed it is known (see Garivier et al. ) that the choice of the stopping rule is decisive to minimize the expected number of sample. We only investigate here the effects of the sampling rule here because it is where the trade-off between uniform exploration and selective exploration takes place.
|Time (in second)|
Figure 1 displays the average number of draws of each aforementioned algorithms for two different confidence levels and . The associated theoretical expected number of draws is respectively for and for . Table 1 displays the average execution time of one step of these algorithms. Unsurprisingly all the algorithms perform better than the uniform sampling. LMA compares to the other algorithms but with slightly worse results. This may due to the fact that lazy mirror ascent (with a learning rate of order ) is less aggressive than Frank Wolfe algorithm for example. Indeed using a constant learning rate (LMAc) we recover the same results as BC. But doing so we loose the guaranty of asymptotic optimality. The four mentioned algorithms share roughly the same (one step) execution time which is normal since they have the same complexity, see Appendix B. The Direct Tracking of the optimal proportion of draws performs slightly better than the other algorithms but the execution time is much longer (approximately 100 times longer) due to the extra cost of computing the optimal weights. Note that TTTS also tends to be slow when the posteriors are well concentrated, since it is then hard to sample the challenger. But it is the only algorithm that does not explicitly force the exploration.
In this paper we developed an unified approach to Bandit Active Exploration problems. In particular we provided a general, computationally efficient, asymptotically optimal algorithm. To avoid obfuscating technicalities, we treated only the case of Gaussian arms with known variance and unknown mean, but the results can easily be extended to other one-parameter exponential families. For this, we just need to replace the maximal inequality of Proposition1 by the one of Theorem 14 by Kaufmann and Koolen  and to adapt the threshold accordingly.
Several questions remain open. It would be interesting to provide an analysis for the moderate-confidence regime as argued by Simchowitz et al. . An other way of improvement could be to explore further the connection with the Frank-Wolfe algorithm. Nevertheless the main open question, from the author point of view, is to find a natural way to explore instead of forcing the exploration. One possibility could be to use in this setting the principle of optimism. Because even for the Active Exploration problems there is trade-off between uniformly explore the distributions of the arms and selectively explore the distribution of specific arms to find in which set the bandit problem lies.
- Abbasi-Yadkori et al.  Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
- Antos et al.  András Antos, Varun Grover, and Csaba Szepesvári. Active learning in multi-armed bandits. In International Conference on Algorithmic Learning Theory, pages 287–302. Springer, 2008.
- Audibert and Bubeck  Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. In COLT-23th Conference on Learning Theory-2010, pages 13–p, 2010.
- Balsubramani  Akshay Balsubramani. Sharp finite-time iterated-logarithm martingale concentration. arXiv preprint arXiv:1405.2639, 2014.
- Berthet and Perchet  Quentin Berthet and Vianney Perchet. Fast rates for bandit optimization with upper-confidence frank-wolfe. In Advances in Neural Information Processing Systems, pages 2225–2234, 2017.
- Bubeck  Sébastien Bubeck. Introduction to online optimization. Lecture Notes, 2011.
Bubeck et al. 
Sébastien Bubeck, Nicolo Cesa-Bianchi, et al.
Regret analysis of stochastic and nonstochastic multi-armed bandit
Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- Bubeck et al.  Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
- Chernoff  Herman Chernoff. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755–770, 1959.
- Combes et al.  Richard Combes, Stefan Magureanu, and Alexandre Proutiere. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems, pages 1763–1771, 2017.
- Degenne and Koolen  Rémy Degenne and Wouter M Koolen. Pure exploration with multiple correct answers. arXiv preprint arXiv:1902.03475, 2019.
Even-Dar et al. 
Eyal Even-Dar, Shie Mannor, and Yishay Mansour.
Pac bounds for multi-armed bandit and markov decision processes.In
International Conference on Computational Learning Theory, pages 255–270. Springer, 2002.
- Finkelstein et al.  Helen Finkelstein et al. The law of the iterated logarithm for empirical distribution. The Annals of Mathematical Statistics, 42(2):607–615, 1971.
- Frank and Wolfe  Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
- Garivier and Kaufmann  Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998–1027, 2016.
- Garivier et al.  Aurélien Garivier, Pierre Ménard, and Laurent Rossi. Thresholding bandit for dose-ranging: The impact of monotonicity. arXiv preprint arXiv:1711.04454, 2017.
- Garivier et al.  Aurélien Garivier, Pierre Ménard, and Gilles Stoltz. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 2018.
- Kaufmann and Koolen  Emilie Kaufmann and Wouter Koolen. Mixture martingales revisited with applications to sequential tests and confidence intervals. arXiv preprint arXiv:1811.11419, 2018.
- Kaufmann et al.  Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42, 2016.
Lattimore and Szepesvari 
Tor Lattimore and Csaba Szepesvari.
The end of optimism? an asymptotic analysis of finite-armed linear bandits.In Artificial Intelligence and Statistics, pages 728–737, 2017.
- Lattimore and Szepesvári  Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Preprint, 2019.
- Locatelli et al.  Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1690–1698, 2016.
- Mannor and Tsitsiklis  Shie Mannor and John N Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648, 2004.
- Peña et al.  Victor H Peña, Tze Leung Lai, and Qi-Man Shao. Self-Normalized Processes. Springer Science & Business Media, 2008.
- Russo  Daniel Russo. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory, pages 1417–1418, 2016.
- Shalev-Shwartz et al.  Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
- Simchowitz et al.  Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive sampling in the moderate-confidence regime. In Conference on Learning Theory, pages 1794–1834, 2017.
- Soare et al.  Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, pages 828–836, 2014.
Appendix A Examples
In this appendix we present some classical and less classical active exploration bandit problems that can be described by the general framework presented in Section 1.1. Note that for all examples presented below Assumption 1 holds. For the three first examples it is a direct consequence of the expression of the sub-gradient. For the last one just needs to remark that the projection of a certain on an alternative set (for ) is such that belongs to the interval for all .
a.1 Thresholding Bandits
We fix a threshold . The objective here is to identify the set of arms above this threshold, . Therefore, to see this problem as a particular case of the one presented in Section 1.1 we choose the power set of and
For , it turns out that there is an explicit expression for and the characteristic time in this particular case,
In the function we recognize the minimum of the costs (with respect to the weights ) for moving the mean of one arm to the threshold. Thanks to this rewriting the computation of the sub-gradient is direct
for that realize the minimum in (18) (the non-zero coordinate is at position ).
a.2 Best Arm Identification
Here the objective is to identify the arm with the greatest mean. We set and
For , we can simplify a bit the expression of the characteristic time. Indeed, using well chosen alternatives, see Garivier and Kaufmann , we have
where is the mean between the optimal mean and the mean with respect to the weights :
We can see the weighted divergence that appears in (4) as the cost for moving the mean of arm above the optimal one and thus make the arm optimal. Precisely we move at the same time and to the weighted mean . The computation of the sub-gradient is also straightforward in this case
for active coordinates that realize the minimum in (19) (the non-zero coordinates are at positions and ). A variant of the Best Challenger sampling rule introduced by Garivier and Kaufmann , see also Russo , is given by
where we denote by the current optimal arm (the one with the greatest mean) at time . At a high level, we select the best challenger of the current best arm with respect to the cost that appear in (19). Then we greedily choose between and the one that increases the most this cost. Again, as in the previous example, this sampling rule rewrites as one step of the Frank-Wolfe algorithm for the function
a.3 Signed Bandits
This is a variant of the Thresholding Bandits problem where we add the assumption that all the means lie above or under a certain threshold . Thus we choose and
It is easy to see, for , that the function and the characteristic time reduce to
In the function we recognize the cost (with respect to the weights ) for moving all the means to the threshold . The sub-gradient of at is
This example is interesting because if we follow a sampling rule based on the Frank-Wolfe algorithm, see (21) (which is equivalent to track the optimal proportion of draws in this case), it would boil down to a kind of Follow the Leader sampling rule. And it is well known that it can fail to sample asymptotically according to the optimal proportion of draws which is in this case:
where is the number of arms that attain the maximum that appears in the definition of the characteristic time, see (22). This highlights the necessity to force in some way the exploration.
a.4 Monotonous thresholding bandit
It is again a variant of the Thresholding Bandit problem with some additional structure. We fix a threshold and assume that sequence of means is increasing. The objective is to identify the arm with the closest mean to the threshold. Hence, we choose and
Unfortunately there is no explicit expressions for neither for the characteristic time in this problem. But it is possible to compute efficiently an element of the sub-gradient of using isotonic regressions, see Garivier et al. .
Appendix B Details on Numerical Experiments
As stated in the Section 3 we consider the Best Arm Identification problem (see Appendix A) for . For all the algorithms we used the same stopping rule (12) with the threshold and decision rule (13). We consider the following sampling rules:
TTTS: it is basically the sampling rule of Top Tow Thompson Sampling by Russo . We use a Gaussian prior for each arms and we slightly alter the rule to choose between the best sampled arm and its re-sampled challenger . Inspired by (20), if we denote by the sample from the posterior where is optimal and by the re-sample where is optimal, we choose arm if , else. Here the complexity of one step is dominated by the sampling phase, in particular the sampling of the challenger, which can be costly if the posterior are concentrated.
LMAc: Exactly the same as above but with a constant learning rate.
DT: this is the Direct Tracking (DT) algorithm by Garivier and Kaufmann , it basically tracks the optimal weights associated to the vector of empirical means plus some forced exploration (same as BC). For the Best Arm Identification problem, to compute the optimal weights, one needs to find the root of an increasing function, e.g. by the bisection method, whose evaluations requires the resolution of K scalar equations.
Unif: the arm is selected at random.
Appendix C Proof of Theorem 2
Fix some real number and consider the typical event
where , for some horizon such that and ( is sufficient). We also impose to be greater than the smallest integer such that . This condition allows to get rid of the effects of clipping the gradient on .
similarly, we can replace by in the second sum
Hence, we deduce from (16), with , on the event
Now we need to compare the sum in (24) with the quantity . To this end we will use Proposition 3, which is a consequence of the tracking and the forced exploration, see (11) and (10). Thus, using the concavity of then Proposition 2 we have
Before applying Proposition 4 we need to handle the fact that the sum in the last inequality above begins at . But it is not harmful because is small enough, one can proves:
Indeed, using the triangular inequality we have
It remains to notice that
Thus, using the above inequality in (24) and dividing by we get