Non-Asymptotic Pure Exploration by Solving Games

06/25/2019 ∙ by Rémy Degenne, et al. ∙ Centrum Wiskunde & Informatica 0

Pure exploration (aka active testing) is the fundamental task of sequentially gathering information to answer a query about a stochastic environment. Good algorithms make few mistakes and take few samples. Lower bounds (for multi-armed bandit models with arms in an exponential family) reveal that the sample complexity is determined by the solution to an optimisation problem. The existing state of the art algorithms achieve asymptotic optimality by solving a plug-in estimate of that optimisation problem at each step. We interpret the optimisation problem as an unknown game, and propose sampling rules based on iterative strategies to estimate and converge to its saddle point. We apply no-regret learners to obtain the first finite confidence guarantees that are adapted to the exponential family and which apply to any pure exploration query and bandit structure. Moreover, our algorithms only use a best response oracle instead of fully solving the optimisation problem.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study fundamental trade-offs arising in sequential interactive learning. We adopt the framework of Pure Exploration, in which the learning system interacts with its environment by performing a sequence of experiments, with the goal of maximising information gain. We aim to design general, efficient systems that can answer a given query with few experiments yet few mistakes.

As usual, we model the environment by a multi-armed bandit model with exponential family arms, and work in the fixed confidence (-PAC) setting. Information-theoretic lower bounds (garivier2016optimal, ) show that a certain number of samples is unavoidable to reach a certain confidence. Moreover, algorithms are developed (garivier2016optimal, ) that match these lower bounds asymptotically, in the small confidence regime.

Our contribution is a framework for obtaining efficient algorithms with non-asymptotic guarantees. The main object of study is the “Pure Exploration Game” (multiple.answers, ), a two-player zero-sum game that is central to lower bounds as well as to the widely used GLRT-based stopping rules. We develop iterative methods that provably converge to saddle-point behaviour. The game itself is not known to the learner, and has to be explored and estimated on the fly. Our methods are based on pairs of low-regret algorithms, combined with optimism and tracking. We prove sample complexity guarantees for several combinations of algorithms, and discuss their computational and statistical trade-offs.

The rest of the introduction provides more detail on pure exploration problems, the pure exploration game, the connection between them, and expands on our contribution. We also review related work.

Our model for the environment is a -armed bandit, i.e. distributions on . We assume throughout that these distributions come from a one-dimensional exponential family, and we denote by

the relative entropy (Kullback-Leibler divergence) from the distribution with mean

to that with mean . A pure exploration problem is parameterised by a set of -armed bandit models (the possible environments), a finite set of candidate answers and a correct-answer function . We focus on Best Arm Identification, for which and the Minimum Threshold problem, which is defined for any fixed threshold by . The goal of the learner is to learn confidently and efficiently by means of sequentially sampling from the arms of , no matter which it faces. When an algorithm sequentially interacts with , we denote by and the sample count and empirical mean estimate (these form a sufficient statistic) for arm after rounds. We write for the time at which the algorithm stops and for the answer it recommends. The algorithm is correct (on a particular run) if it recommends the correct answer for . An algorithm is -PAC (or -correct) if for each . Among -PAC algorithms, we are interested in those minimising the sample complexity . As it turns out, what can be achieved, and how, is captured by a certain game.

For each , multiple.answers define the two-player zero-sum simultaneous-move Pure Exploration Game: MAX plays an arm , MIN plays an “alternative” bandit model with a different correct answer . We denote the set of such alternatives to answer by . MAX then receives payoff from MIN. As the payoff is neither concave in (since discrete) nor convex in (both domain and divergence are problematic), we will analyse the game by sequencing the moves and considering a mixed strategy for the player moving first. With MAX moving first and playing a mixed strategy (we identify distributions over and the simplex ), the value of the game is


We denote a minimiser of by and call it an oracle allocation. The analogue where MIN plays first using a mixed strategy (distributions over that set) is proposed and analysed in multiple.answers . Despite the baroque domain of in (1), there always exist minimax supported on points due to dimension constraints.

The Pure Exploration Game is essential to both characterising the complexity of learning, and also to algorithm design. Namely, first, any -correct algorithm has sample complexity for each bandit at least 11todo: 1what is ?, and matching this rate requires sampling proportions converging to (see garivier2016optimal, ). Moreover, second, the general approach for obtaining -correct algorithms is based on the Generalised Likelihood Ratio Test (GLRT) statistic . There are universal thresholds (see e.g. garivier2013informational, ; garivier2016optimal, ; mixmart, ; magureanu2014lipschitz, ) such that for any . Hence stopping when and recommending is -correct for any sampling rule. Maximising the GLRT to stop as early as possible is achieved by the sampling proportions .

These considerations show that any successful Pure Exploration agent needs to (approximately) solve the Pure Exploration Game . The Track-and-Stop approach, pioneered by garivier2016optimal , ensures that using forced exploration, and using tracking. Continuity of and then yields that . The GLRT stopping rule triggers when , meeting the lower bound in the asymptotic regime .

Our contributions.

We explore methods to solve the Pure Exploration game associated with the unknown bandit model , and discusses their statistical and computational trade-offs. We look at solving the game iteratively, by instantiating a low-regret online learners for each player. In particular for the -player we use a self-tuning instance of Exponentiated Gradient called AdaHedge ftl.jmlr . The -player needs to play a distribution to deal with non-convexity; we consider Follow the Perturbed Leader as well as an ensemble of Online Gradient Descent experts. We show how a combination of optimistic gradient estimates, concentration of measure arguments and regret guarantees combine to deliver the first non-asymptotic sample complexity guarantees (which retain asymptotic optimality for ). The advantage of this approach is that it only requires a best response oracle (1, right) instead of a computationally more costly max-min oracle (1, left) employed by Track-and-Stop. Going the other extreme, we also develop Optimistic Track-and-Stop based on a max-max-min oracle (the outer max implementing optimism over a confidence region for ), which trades increased computation for tighter sample complexity guarantees with simpler proofs.

Our cocktail sheds new light on the trade-offs involved in the design of pure exploration algorithms. We show how “big-hammer” forced exploration can be refined using problem-adapted optimism. We show how tracking is unnecessary when the player goes second. We show how computational complexity can be traded off using oracles of various sophistication. And finally, we validate our approach empirically in benchmark experiments at practical , and find that our algorithms are either competitive with Track-and-Stop (dense ) or dominate it (sparse ).

Related work

Besides maximising information gain, there is a vast literature on maximising reward in multi-armed bandit models for which a good starting point is . The canonical Pure Exploration problem is Best Arm Identification (DBLP:conf/colt/Even-DarMM02, ; Bubeckal11, ), which is actively studied in the fixed confidence, fixed budget and simple regret settings (, , Ch. 33). Its sample complexity as a function of the confidence level has been analysed very thoroughly in the (sub)-Gaussian case, where we have a rather complete picture, even including lower order terms pmlr-v65-chen17b . on.the.complexity initiated the quest for correct instance-dependent constants for arms from any exponential family. simchowitz2017simulator stresses the importance of the “moderate confidence” regime . Although it is not the focus here, we do believe that it is crucial to obtain the right problem dependence not only in but also in and other structural parameters, as the latter may in practice dominate the sample complexity.

Pure Exploration queries beyond Best Arm include Top- Shivaram:al10 , Thresholding thresholding , Minimum Threshold kaufmann2018sequential , Combinatorial Bandits Chen14ComBAI , pure-strategy Nash equilibria pmlr-v70-zhou17b and Monte-Carlo Tree Search FindTopWinner . There is also significant interest in these problems in structured bandit models, including Rank-one , Lipschitz magureanu2014lipschitz , Monotonic garivier2017thresholding , Unimodal combes2014unimodal and Unit-Sum simchowitz2017simulator . Our framework applies to all these cases. Problems with multiple correct answers were recently considered by multiple.answers . Existing learning strategies do not work unmodified; some fail and others need to be generalised.

Optimism is ubiquitous in bandit optimisation since Aueral02 , and was adapted to pure exploration by Shivaramal12 . We are not aware of optimism being used to solve unknown min-max problems. Optimism was employed in the UCB Frank-Wolfe method by berthet2017bandit for maximising an unknown smooth function faster. We do not currently know how to make use of such fast rate results. For games the best response value is a non-smooth function of the action.

Using a pair of independent no-regret learners to solve a fixed and known game goes back to freund1999adaptive . More recently game dynamics were used to explain (Nesterov) acceleration in offline optimisation DBLP:conf/nips/WangA18 . Ensuring faster convergence with coordinating learners is an active area of research DBLP:conf/nips/RakhlinS13 . Unfortunately, we currently do not know how to obtain an advantage in this way, as our main learning overhead comes from concentration, not regret.

2 Algorithms with finite confidence sample complexity bounds

We introduce a family of algorithms, presented as Algorithm 1, with sample complexity bounds for non-asymptotic confidence. It uses the following ingredients: the GLRT stopping rule, a saddle point algorithm (possibly formed by two regret minimization algorithms) and optimistic loss estimates.

2.1 Model and assumption: sub-Gaussian exponential families.

We suppose that the distributions belong to a known one-parameter exponential family. That is, there is a reference measure and parameters such that the distribution of arm is defined by

. Examples include Gaussians with a given variance or Bernoulli with means in

. All results can be extended to arms each in a possibly different known exponential family. Let be the open interval of possible means of such distributions. A distribution is said to be -sub-Gaussian if for all , . An exponential family has all distributions sub-Gaussian with constant iff for all , it verifies .

Assumption 1.

The arm distributions belong to sub-Gaussian exponential families with constant .

Assumption 2.

There exists a closed interval such that .

As a consequence of Assumption 2, there exists such that for all , the function is -Lipschitz on and . Assumption 1 is implied by Assumption 2. Both are discussed in Appendix F. In particular, Assumption 2 can often be relaxed. and will appear in the sample complexity bounds but none of our algorithms use them explicitly.

Everywhere below, denotes the orthogonal projection of the empirical mean onto , with one possible exception: the GLRT stopping rule may use it either projected or not, indifferently.

2.2 Algorithmic ingredients

1:Algorithms and , stopping threshold and exploration bonus .
2:Sample each arm once and form estimate .
3:for  do
4:     For , let .

KL confidence intervals

5:     Let . if
6:     Let .
7:     Stop and output if . GLRT Stopping rule
8:     Get and from and .
9:     For , let . Optimism
10:     Feed the loss
11:     Feed the loss .
12:     Pick arm . Cumulative tracking
13:     Observe sample . Update .
14:end for
Algorithm 1 Pure exploration meta-algorithm.
Stopping and recommendation rules.

The algorithm stops if any one of GLRT tests succeeds (garivier2016optimal, ). Let denote the likelihood under the model parametrized by . The generalized log-likelihood ratio between a set and the whole parameter space is

By concentration of measure arguments, we may find

such that with probability greater than

, for all , (see garivier2013informational, ; garivier2016optimal, ; mixmart, ; magureanu2014lipschitz, ). Test succeeds if . If the algorithm stops because of test , recommend . If several tests succeed at the same time, choose arbitrarily among these.

Theorem 1.

Any algorithm using the GLRT stopping and recommendation rules with threshold such that is -correct.

A game with two players

An algorithm is unable to stop at time if the stopping condition is not met, i.e.

In order to stop early, the right hand side has to be maximized, i.e. made close to . Then with we obtain up to lower order terms, i.e. the stopping time is close to optimality.

We propose to approach that max-min saddle-point by implementing two iterative algorithms, and , for the -player and a -player. Our sample complexity bound is a function of two quantities and , regret bounds of algorithms and when used for steps on appropriate losses.

One player of our choice goes first. The second player can see the action of the first, see the corresponding loss function and use an algorithm with zero regret (e.g. Best-Response or Be-The-Leader). One of the players has to play distributions on its action set. We have one of the following:

  1. [nolistsep]

  2. -player plays first and uses a distribution in . The -player plays .

  3. -player plays first and uses (distribution over ). The -player plays .

  4. Both players play distributions and go in any order, or concurrently.

Algorithm 1 presents two players playing concurrently but can be modified: if for example plays second, then it gets to see before computing .

The sampling rule at stage first computes the most likely answer for . If the set over which the algorithm optimizes at line 4 is empty, is arbitrary. The -player plays coming from , an instance of running only on the rounds on which the selected answer is that . The -player similarly uses an instance of .


Since a single arm has to be pulled, if the -player plays an additional procedure is needed to translate that play into a sampling rule. We use a so-called tracking procedure, , which ensures that .

Optimism in face of uncertainty.

Existing algorithms for general pure exploration use forced exploration to ensure convergence of to , making sure that every arm is sampled more than e.g. 

times. We replace that method by the “optimism in face of uncertainty” principle, which gives a more adaptive exploration scheme. While that heuristic is widely used in the bandit literature, this work is its first successful implementation for general pure exploration. In Algorithm 

1, the -player algorithm gets an optimistic loss depending on and . The -player gets a non-optimistic loss.

2.3 Proof scheme and sample complexity result

In order to bound the sample complexity, we introduce a sequence of concentration events for and . It verifies (see Appendix B for a proof). The concentration intervals used in Algorihtm 1 are a function of for .

Lemma 1.

Let be an event and be such that for , . Then

We now present briefly the steps of the proof for the stopping time upper bound before stating our main theorem on the sample complexity of Algorithm 1. These steps are inexact and should be regarded as a guideline and not as rigorous computations. A full proof of our results can be found in the appendices (Appendix B for concentration results,  C for tracking and D for the main sample complexity proof). We simplify the presentation by supposing that throughout (the main proof will show this may fail only rounds). For , under concentration event ,

(stopping condition)

The first term is now the infimum of a sum of losses, . We use the regret property of the -player’s algorithm on those losses, then we introduce optimistic values such that for we have .

(regret )
(regret )

Finally, is itself the expectation of another distribution on . Hence

Putting these inequalities together, we get finally an inequality on such a . The exact result we obtain is the following Theorem, proved in Appendix D.

Theorem 2.

Under Assumption 2, the sample complexity of Algorithm 1 on model is

where depends on and and . See Appendix D for an exact definition.

The forms of and of depend on the particular algorithm but we now show how an inequality of that type translates into . The next lemma is a consequence of the concavity of .

Lemma 2.

Suppose that verifies the equation . Then for ,

3 Practical Implementations

Next we discuss instantiating no-regret learners. We consider a hierarchy of computational oracles:

  1. [nolistsep]

  2. Min aka Best-Response oracle: obtain for any , and a minimizer in of .

  3. Max-min aka Game-Solving oracle: obtain for any and

    a vector

    such that there is a Nash equilibrium for the zero-sum game with reward with the -player using the mixed strategy .

  4. Max-max-min oracle: for any confidence region , obtain with and a -player strategy of a Nash equilibrium of the game with reward .

For Minimum Threshold all oracles can be evaluated in closed form in time, and the same is true for Best Response in Best Arm Identification. Max-min for Best Arm requires binary search garivier2016optimal and Max-max-min requires max-min calls. See (Menard19, ) for run-time data on Track-and-Stop (max-min oracle) and gradient ascent (min oracle) for Best Arm. Our approach also extends naturally to min-max and max-min-max oracles, which we plan to incorporate in full detail in our future work.

3.1 A Learning Algorithm for the -Player vs Best-Response for the -Player

In this section the -player plays first, employing a regret minimization algorithm for linear losses on the simplex to produce at time . We pick AdaHedge of ftl.jmlr , which runs in per round and adapts to the scale of the losses. The -player goes second and can use a zero-regret algorithm: Best-Response. It plays

Lemma 3.

AdaHedge has regret where is the loss scale in round , so that . Best-Response has no regret, . The sample complexity is bounded per Theorem 2.

We expect that in practice the scale converges to after a transitory startup phase.

Computational complexity: one best-response oracle call per time step.

3.2 Learning Algorithms for the -Player vs Best Response for the -Player

Using a learner for the -player removes the need for a tracking procedure. In this section the -player goes second and uses Best-Response, with zero regret, i.e.  (see Algorithm 1). After playing , the -player suffers loss .

Most existing regret minimization algorithms do not apply since the function is not convex in general and the action set is also not convex. The challenge is to come up with an algorithm able to play distributions with only access to a best-response oracle.


Follow-The-Perturbed-Leader can sample points from a distribution on by only using best-response oracle calls on . The version we use here incorporates all the information available to the -player: the loss of will be where the only unknown quantity is . Let

be a random vector with independent exponentially distributed coordinates. The idea is that the distribution

played by the -player should be the distribution of

We show in Appendix E.2 that this argmin can be computed by a single best-response oracle call. However, the -player has to be able to compute the best response to . Since we cannot get the above distribution exactly, we instead take for an empirical distribution from samples. A regret bound for that algorithm is in Appendix E.2. The sample complexity is then bounded by Theorem 2.

Computational complexity: best-response oracle calls at time step .

Online Gradient Descent.

While the learning problem for is hard in general, in several common cases the sets have a simple structure. If these sets are unions of a finite number of convex sets and is convex (i.e. for Gaussian or Bernoulli arm distributions), then we can use off-the-shelf regret algorithms. One gradient descent learner can be used on each convex set, and these experts are then aggregated by an exponential weights algorithm. This procedure would have regret. The computational complexity is (convex) best-response oracle calls per time step.

3.3 Optimistic Track-and-Stop.

At stage , this algorithm computes where ranges over all points in in a confidence region around and . Then, the -player plays such that there exists a Nash equilibrium of the game with reward . The proof of its sample complexity bound proceeds slightly differently from the sketch of part 2.3, although the ingredients are still the GLRT, concentration, optimism and game-solving. The proof of the following lemma can be found in appendix E.2.

Lemma 4.

Take in the definition of . Let . Then the expected sample complexity is at most , where is the maximal such that .

Note: the factors are due to the tracking. We conjecture that they should be instead.

Computational complexity: one max-max-min oracle call per time step.

This algorithm is the most computationally expensive but has the best sample complexity upper bound, has a simpler proof and works well in experiments where computing the max-max-min oracle is feasible, like the Best Arm and Minimum Threshold problems (see section 4).

4 Experiments

The goal of our experiments is to empirically validate Algorithm 1 on benchmark problems for practical . We use stylised stopping threshold and exploration bonus . Both are unlicensed by theory yet conservative in practise (the error frequency is way below ). We use the following letter coding to designate sampling rules: D for AdaHedge vs Best-Response as advocated in Section 3.1, T for Track-and-Stop of garivier2016optimal , M for the Gradient Ascent algorithm of Menard19 , O for Optimistic Track-and-Stop from Section 3.3, RR for uniform, and opt for following the oracle proportions . We also ran all our experiments on a simplification of D that uses a single learner instead of partitioning the rounds according to . We omit it from the results, as it was always within a few percent of D. We append -C or -D to indicate whether cumulative () or direct () tracking garivier2016optimal is employed. We finally note that we tune the learning rate of M in terms of (the unknown) .

We perform two series of experiments, one on Best Arm instances from garivier2016optimal ; Menard19 , and one on Minimum Threshold instances from kaufmann2018sequential . Two selected experiments are shown in Figure 1, the others are included in Appendix G. We contrast the empirical sample complexity with the lower bound , and with a more “practical” version, which indicates the time for which , which is, approximately, the first time at which the GLRT stopping rule crosses the threshold .

We see in Figures 1(a) and 1(b) that direct tracking -D has the advantage over cumulative tracking -C across the board, and that uniform sampling RR is sub-optimal as expected. In Figure 1(a) we see that T performs best, closely followed by M and O. Sampling from the oracle weights opt performs surprisingly poorly (as also observed in (simchowitz2017simulator, , Table 1)). The main message of Figure 1(b) is that T can be highly sub-optimal. We comment on the reason in Appendix G.2. Asymptotic optimality of T implies that this effect disappears as . However, for this example this kicks in excruciatingly slowly. Figure 5(d) shows that T is still not competitive at . On the other hand, O performs best, closely followed by M and then D. Practically, we recommend using O if its computational cost is acceptable, M if an estimate of the problem scale is available for tuning, and D otherwise.

(a) Best Arm for Bernoulli bandit model . The oracle weights are .
(b) Minimum Threshold for Gaussian bandit model with threshold , . Note the excessive sample complexity of T-C/ T-D.
Figure 1: Selected experiments. In both cases . Plots based on and runs.

The gap between opt and T (or O) shows that Track-and-Stop outperforms its design motivation. It is an exciting open problem to understand exactly why, and to optimise for stopping early () while ensuring optimality (.

5 Conclusion

We leveraged the game point of view of the pure exploration problem, together with the use of the optimism principle, to derive algorithms with sample complexity guarantees for non-asymptotic confidence. Varying the flavours of optimism and saddle-point strategies leads to procedures with diverse tradeoffs between sample and computational complexities. Our sample complexity bounds attain asymptotic optimality while offering guarantees for moderate confidence and the obtained algorithms are empirically sound. Our bounds however most probably do not depend optimally on the problem parameters, like the number of arms . For BAI and the Top-K arms problems, lower bounds with lower order terms as well as matching algorithms were derived by (simchowitz2017simulator, ). A generalization of such lower bounds to the general pure exploration problem could shed light upon the optimal complexity across the full confidence spectrum.

The richness of existing saddle-point iterative algorithms may bring improved performance over our relatively simple choices. A smart algorithm could possibly take advantage of the stochastic nature of the losses instead of treating them as completely adversarial.


We are grateful to Zakaria Mhammedi and Emilie Kaufmann for multiple generous discussions. Travel funding was provided by INRIA Associate Team PAC. The experiments were carried out on the Dutch national e-infrastructure with the support of SURF Cooperative.


  • [1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, 2002.
  • [2] Quentin Berthet and Vianney Perchet. Fast rates for bandit optimization with upper-confidence Frank-Wolfe. In Advances in Neural Information Processing Systems (NeurIPS), pages 2222–2231, 2017.
  • [3] S. Bubeck, R. Munos, and G. Stoltz. Pure Exploration in Finitely Armed and Continuous Armed Bandits. Theoretical Computer Science 412, 1832-1852, 412:1832–1852, 2011.
  • [4] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
  • [5] Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identification. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 535–592, Amsterdam, Netherlands, July 2017. PMLR.
  • [6] S. Chen, T. Lin, I. King, M. Lyu, and W. Chen. Combinatorial Pure Exploration of Multi-Armed Bandits. In Advances in Neural Information Processing Systems, 2014.
  • [7] Richard Combes and Alexandre Proutiere. Unimodal bandits: Regret lower bounds and optimal algorithms. In International Conference on Machine Learning, pages 521–529, 2014.
  • [8] Steven de Rooij, Tim van Erven, Peter D. Grünwald, and Wouter M. Koolen. Follow the leader if you can, Hedge if you must. Journal of Machine Learning Research, 15:1281–1316, April 2014.
  • [9] Rémy Degenne and Wouter M. Koolen. Pure exploration with multiple correct answers. ArXiv, February 2019.
  • [10] Eyal Even-Dar, Shie Mannor, and Yishay Mansour.

    PAC bounds for multi-armed bandit and markov decision processes.

    In 15th Annual Conference on Learning Theory (COLT), volume 2375 of Lecture Notes in Computer Science, pages 255–270. Springer, 2002.
  • [11] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, 1999.
  • [12] Aurélien Garivier. Informational confidence bounds for self-normalized averages and applications. In 2013 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2013.
  • [13] Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998–1027, 2016.
  • [14] Aurélien Garivier, Pierre Ménard, and Laurent Rossi. Thresholding bandit for dose-ranging: The impact of monotonicity. arXiv preprint arXiv:1711.04454, 2017.
  • [15] S. Kalyanakrishnan and P. Stone. Efficient Selection in Multiple Bandit Arms: Theory and Practice. In International Conference on Machine Learning (ICML), 2010.
  • [16] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic multi-armed bandits. In International Conference on Machine Learning (ICML), 2012.
  • [17] Sumeet Katariya, Branislav Kveton, Csaba Szepesvári, Claire Vernade, and Zheng Wen. Stochastic rank-1 bandits. In Aarti Singh and Xiaojin (Jerry) Zhu, editors,

    Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA

    , volume 54 of Proceedings of Machine Learning Research, pages 392–401. PMLR, 2017.
  • [18] E. Kaufmann, O. Cappé, and A. Garivier. On the Complexity of Best Arm Identification in Multi-Armed Bandit Models. Journal of Machine Learning Research, 17(1):1–42, 2016.
  • [19] Emilie Kaufmann and Wouter M. Koolen. Mixture martingales revisited with applications to sequential tests and confidence intervals. Preprint, October 2018.
  • [20] Emilie Kaufmann, Wouter M. Koolen, and Aurélien Garivier. Sequential test for the lowest mean: From Thompson to Murphy sampling. In Advances in Neural Information Processing Systems (NeurIPS) 31, pages 6333–6343, December 2018.
  • [21] Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2019.
  • [22] Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1690–1698., 2016.
  • [23] Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bound and optimal algorithms. In Conference on Learning Theory, pages 975–999, 2014.
  • [24] Pierre Ménard. Gradient ascent for active exploration in bandit problems. arXiv preprint arXiv:1905.08165, 2019.
  • [25] Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems (NeurIPS), pages 3066–3074, 2013.
  • [26] Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive sampling in the moderate-confidence regime. In Conference on Learning Theory, pages 1794–1834, 2017.
  • [27] Kazuki Teraoka, Kohei Hatano, and Eiji Takimoto. Efficient sampling method for Monte Carlo tree search problem. IEICE Transactions, 97-D(3):392–398, 2014.
  • [28] Jun-Kun Wang and Jacob D. Abernethy. Acceleration through optimistic no-regret dynamics. In Advances in Neural Information Processing Systems (NeurIPS), pages 3828–3838, 2018.
  • [29] Yichi Zhou, Jialian Li, and Jun Zhu. Identify the Nash equilibrium in static games with random payoffs. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4160–4169, International Convention Centre, Sydney, Australia, August 2017. PMLR.

Appendix A Likelihood Ratio and Exponential Families

a.1 Canonical one-parameter exponential families

We suppose that all arms have distributions in a canonical one-parameter exponential family. That is, there is a reference measure and parameters such that the distribution of arm is defined by

Let be the convex conjugate of , i.e. . Let be the open interval on which the first derivative is defined. The Kullback-Leibler divergence between the distributions of the exponential family with means and in is

A distribution is said to be -sub-Gaussian if for all , . A canonical one-parameter exponential family has all distributions sub-Gaussian with constant iff for all , it verifies .

a.2 The Generalized log-likelihood ratio

The generalized log-likelihood ratio between the whole model space and a subset is

In the case of a canonical one-parameter exponential family, the likelihood of the model with means is

For two mean vectors,

The maximum likelihood estimator corresponding to the data is

The GLR for set is