On Regret with Multiple Best Arms

06/26/2020 ∙ by Yinglun Zhu, et al. ∙ University of Wisconsin-Madison 0

We study regret minimization problem with the existence of multiple best/near-optimal arms in the multi-armed bandit setting. We consider the case where the number of arms/actions is comparable or much larger than the time horizon, and make no assumptions about the structure of the bandit instance. Our goal is to design algorithms that can automatically adapt to the unknown hardness of the problem, i.e., the number of best arms. Our setting captures many modern applications of bandit algorithms where the action space is enormous and the information about the underlying instance/structure is unavailable. We first propose an adaptive algorithm that is agnostic to the hardness level and theoretically derive its regret bound. We then prove a lower bound for our problem setting, which indicates: (1) no algorithm can be optimal simultaneously over all hardness levels; and (2) our algorithm achieves an adaptive rate function that is Pareto optimal. With additional knowledge of the expected reward of the best arm, we propose another adaptive algorithm that is minimax optimal, up to polylog factors, over all hardness levels. Experimental results confirm our theoretical guarantees and show advantages of our algorithms over the previous state-of-the-art.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-armed bandit problems describe exploration-exploitation trade-offs in sequential decision making. Most existing bandit algorithms tend to provide regret guarantees when the number of available arms/actions is smaller than the time horizon. In modern applications of bandit algorithm, however, the action space is usually comparable or even much larger than the allowed time horizon so that many existing bandit algorithms cannot even complete their initial exploration phases. Consider a problem of personalized recommendations, for example. For most users, the total number of movies, or even the amount of sub-categories, far exceeds number of times they visit a recommendation site. Similarly, the enormous amount of user-generated content on YouTube and Twitter makes it increasingly challenging to make optimal recommendations. The tension between very large action space and the allowed time horizon pose a realistic problem in which deploying algorithms that converge to an optimal solution over an asymptotically long time horizon do not give satisfying results. There is a need to design algorithms that can exploit the highest possible reward within a limited

time horizon. Past work has partially addressed this challenge. The quantile regret proposed in

chaudhuri2018quantile to calculate regret with respect to an satisfactory action rather than the best one. The discounted regret analyzed in ryzhov2012knowledge; russo2018satisficing

is used to emphasize short time horizon performance. Other existing works consider the extreme case where the number of actions is indeed infinite, and tackle such problems with one of two main assumptions: (1) the discovery of a near-optimal/best arm follows some probability measure with

known parameters berry1997bandit; wang2009algorithms; aziz2018pure; ghalme2020ballooning; (2) the existence of a smooth function represents the mean-payoff over a continuous subset agrawal1995continuum; kleinberg2005nearly; kleinberg2008multi; bubeck2011x; locatelli2018adaptivity; hadiji2019polynomial. However, in many situations, neither assumption may be realistic. We make minimal assumptions in this paper. We study the regret minimization problem over a time horizon , which might be unknown, with respect to a bandit instance with total arms, out of which are best/near-optimal arms. We emphasize that the allowed time horizon and the given bandit instance should be viewed as features of one problem and together they indicate an intrinsic hardness level. We consider the case where is comparable to or larger than so that no standard algorithm provides satisfying result. Our goal is to design algorithms that could adapt to the unknown and achieve optimal regret.

1.1 Contributions and paper organization

We make the following contributions. In Section 2 formally define regret minimization problem that represents the tension between very large action space and limited time horizon; and capture the hardness level in terms of the number of best arms. We provide an adaptive algorithm that is agnostic to the unknown number of best arms in Section 3, and theoretically derive its regret bound. In Section 4, we prove a lower bound for our problem setting that indicates there is no algorithm can be optimal simultaneously over all hardness levels. Our lower bound also shows that our algorithm provided in Section 3 is Pareto optimal. With additional knowledge of the expected reward of the best arm, in Section 5 we provide an algorithm that achieves the non-adaptive minimax optimal regret, up to polylog factors, without the knowledge of the number of best arms. Experiments conducted in Section 6 confirm our theoretical guarantees and show advantages of our algorithms over previous state-of-the-art. We conclude our paper in Section 7. Most of the proofs are deferred to the Appendix due to lack of space.

1.2 Related work

Time sensitivity and large action space. As bandit models are getting much more complex, usually with large or infinite action spaces, researchers have begun to pay attention to tradeoffs between regret and time horizons when deploying such models. deshpande2012linear study a linear bandit problem with ultra-high dimension, and provide algorithms that, under various assumptions, can achieve good reward within short time horizon. russo2018satisficing also take time horizon into account and model time preference by analyzing a discounted regret. chaudhuri2018quantile consider a quantile regret minimization problem where they define their regret with respect to expected reward ranked at -th quantile. One could easily transfer their problem to our setting; however, their regret guarantee is sub-optimal. katz2019true; aziz2018pure also consider the problem with best/near-optimal arms with no other assumptions, but they focus on the pure exploration setting; aziz2018pure additionally requires the knowledge of . Another line of research considers the extreme case where the number arms is infinite, but with some known regularities. berry1997bandit proposes an algorithm with a minimax optimality guarantee under the situation where the reward of each arm follows strictlyBernoulli distribution; teytaud:inria-00173263 provides an anytime algorithm works under the same assumption. wang2009algorithms relaxes the assumption on Bernoulli reward distribution, however, some other parameters are assumed to be known in their setting.

Continuum-armed bandit. Many papers also study bandit problems with continuous action spaces, where they embed each arm into a bounded subset and assume there exists a smooth function governing the mean-payoff for each arm. This setting is firstly introduced by agrawal1995continuum. When the smoothness parameters are known to the learner or under various assumptions, there exists algorithms kleinberg2005nearly; kleinberg2008multi; bubeck2011x with near-optimal regret guarantees. When the smoothness parameters are unknown, however, locatelli2018adaptivity proves a lower bound indicating no strategy can be optimal simultaneously over all smoothness classes; under extra information, they provide adaptive algorithms with near-optimal regret guarantees. Although achieving optimal regret for all settings is impossible, hadiji2019polynomial design adaptive algorithms and prove that they are Pareto optimal. Our algorithms are mainly inspired by the ones in hadiji2019polynomial; locatelli2018adaptivity. A closely related line of work valko2013stochastic; grill2015black; bartlett2018simple; shang2019general aims at minimizing simple regret in the continuum-armed bandit setting.

Adaptivity to unknown parameters. bubeck2011lipschitz argues the awareness of regularity is flawed and one should design algorithms that can adapt to the unknown environment. In situations where the goal is pure exploration or simple regret minimization, katz2019true; valko2013stochastic; grill2015black; bartlett2018simple; shang2019general achieve near-optimal guarantees with unknown regularity because their objectives trade-off exploitation in favor of exploration. In the case of cumulative regret minimization, however, locatelli2018adaptivity shows no strategy can be optimal simultaneously over all smoothness classes. In special situations or under extra information, bubeck2011lipschitz; bull2015adaptive; locatelli2018adaptivity provide algorithms that adapt in different ways. hadiji2019polynomial borrows the concept of Pareto optimality from economics and provide algorithms with adaptive rates that are Pareto optimal. Adaptivity is studied in statistics as well: in some cases, only an additional logarithmic factors are required lepskii1991problem; birge1997model; in others, however, there exists an additional polynomial cost of adaptation cai2005adaptive.

2 Problem Statement and Notation

We consider the multi-armed bandit instance with

(could be infinite) probability distributions such that each

is sub-Gaussian with mean . Let be the highest mean and denote the subset of best arms. The cardinality is unknown to the learner. We could also generalize our setting to with unknown (i.e., situations where there is an unknown number of near-optimal arms). Setting to be dependence, e.g., , is to avoid an additive term linear in . All theoretical results and algorithms presented in this paper are applicable to this generalized setting with minor modifications. For ease of exposition, we focus on the case with multiple best arms throughout the paper. At each time step , the algorithm/learner selects an action , based on information collected already, and receives an independent reward . We measure the success of an algorithm through the expected cumulative (pseudo) regret

We use to denote the set of regret minimization problems with allowed time horizon and any bandit instance with total arms and best arms. We emphasize that being part of the problem instance, which was largely neglected in previous work focusing on asymptotic results. We are particularly interested in the case being comparable or even larger than , which captures many modern applications where the available action space far exceeds the allowed time horizon. Although learning algorithms may not be able to pull each arm once, one should notice that the true/intrinsic hardness level of the problem could be viewed as : selecting a subset uniformly at random with cardinality guarantees, with constant probability, the access to at least one best arm; but of course it is impossible to do this without knowing . We quantify the intrinsic hardness level over a set of regret minimization problems as

where, with a slight abuse of notation, we set to avoid the trivial case with all best arms. is used here as it captures the minimax optimal regret over the set of regret minimization problem , as explained later in our review of the MOSS algorithm and the lower bound. As smaller indicates easier problems, we then define the family of regret minimization problems with hardness level at most as

with . Although is necessary to define a regret minimization problem, we actually encode the hardness level into a single parameter , which captures the tension between the complexity of bandit instance at hand and the allowed time horizon : problems with different time horizons but the same are equally difficult in terms of the achievable minimax regret (the exponent of ). We thus mainly study problems with large enough so that we could mainly focus on the polynomial terms of . We are interested in designing algorithms with minimax guarantees over , but without the knowledge of .

MOSS and upper bound. In the standard setting where , MOSS , designed by audibert2009minimax and improved in garivier2018kl in terms of constant factors, achieves the minimax optimal regret. In this paper, we will use MOSS as a subroutine with regret upper bound . For any problem in with known , one could run MOSS on a subset selected uniformly at random with cardinality and achieve regret .

Lower bound. The lower bound in the standard setting does not work for our setting as its proof heavily relies on the existence of single best arm lattimore2018bandit. However, for problems in , we do have a matching lower bound as one could always apply the standard lower bound on an bandit instance with and .

Throughout the paper, we denote for any positive integer . Although may appear in our bounds, we focus on the case as otherwise the bound is trivial.

3 An adaptive algorithm

Algorithm 1 takes time horizon and an user-specified as input, and it is mainly inspired by hadiji2019polynomial. Algorithm 1 operates in iterations with geometrically-increasing length with . At each iteration , it restarts MOSS on a set consisting of real arms selected uniformly at random plus a set of “virtual" mixture-arms (one from each of the previous iterations, none if ). The mixture-arms are constructed as follows. After each iteration , let

denote the vector of empirical sampling frequencies of the arms in that iteration (i.e., the

-th element of is number of times arm , including all previously constructed mixture-arms, was sampled in iteration divided by the total number of samples ). The mixture-arm for iteration is the -mixture of the arms, denoted by . When MOSS samples from it first draws , then draws a sample from the corresponding arm (or ). The mixture-arms provide a convenient summary of the information gained in the previous iterations, which is key to our theoretical analysis. Although our algorithm is working on fewer regular arms in later iterations, information summarized in mixture-arms is good enough to provide guarantees. We name our algorithm MOSS++ as it restarts MOSS at each iteration with past information summarized in empirical measures. We provide an anytime version of Algorithm 1 in Section A.2 via the standard doubling trick.

0:  Time horizon and user-specified parameter
1:  Set: , and
2:  for  do
3:     Run MOSS on a subset of arms for rounds. contains real arms selected uniformly at random and a virtual arm from each of the previous iterations, .
4:     Construct a virtual mixture-arm based on empirical sampling frequencies of MOSS above
5:  end for
Algorithm 1 MOSS++ 

3.1 Analysis and discussion

We use to denote the highest expected reward over a set of distributions/arms . For any algorithm that only works on , we can decompose the regret into approximation error plus learning error, as following.


This type of regret decomposition was previously used in kleinberg2005nearly; auer2007improved; hadiji2019polynomial to deal with the continuum-armed bandit problem. We consider here a probabilistic version, with randomness in the selection of , for the classical setting.

The main idea behind providing guarantees for MOSS++ is to decompose its regret at each iteration, using Eq. 1, and then bound (expected) approximation error and learning error separately. The learning error at each iteration could always be controlled as thanks to regret guarantees for MOSS and specifically chosen parameters , , . Let be the largest integer such that still hold. The approximation error in iteration could be upper bounded by

following an analysis on hypergeometric distribution. As a result, the expected regret in iteration

is . Since the mixture-arm is included in all following iterations, we could further bound the approximation error in iteration by after a careful analysis on . This intuition is formally stated and proved in Theorem 1.

Theorem 1.

Run MOSS++ with time horizon and an user-specified parameter leads to the following regret upper bound

Remark 1.

We primarily focus on the polynomial terms in when deriving the bound, but put no effort in optimizing the polylog terms. The exponent of might be tightened as well.

The theoretical guarantee is closely related to the user-specified parameter : when , we suffer a multiplicative cost of adaptation , with hitting the sweet spot, comparing to non-adaptive minimax regret; when , there is essentially no guarantees. One may hope to improve this result. However, our lower bound provided in Section 4 indicates: (1) achieving minimax optimal regret for all settings simultaneously is impossible; and (2) the adaptive rate achieved by MOSS++ is already Pareto optimal.

4 Lower bound and Pareto optimality

4.1 Lower bound

The intuition of the lower bound comes from the tradeoff of exploration-exploitation among problems with different hardness levels. Consider an algorithm and any . If algorithm achieves a regret larger than over , it is then already not minimax optimal for . Now suppose algorithm achieves a near-optimal regret over , then should not explore too many arms extensively. This, on the other hand, indicates that may not even be able to locate one best arm in a much harder problem . Thus, it is not possible to construct an algorithm that is simultaneously optimal for all values of . This intuition is formalized in Theorem 2.

Theorem 2.

For any , assume and . If an algorithm is such that , then the regret of this algorithm on is lower bounded as following


4.2 Pareto optimality

We capture the performance of any algorithm by its dependence on polynomial terms of in the asymptotic sense. Note that the hardness level of a problem is encoded in .

Definition 1.

Let denote a non-decreasing function. An algorithm achieves the adaptive rate if

As there may not always exist an ordering over adaptive rate functions, following hadiji2019polynomial, we consider the notion of Pareto optimality over rate functions achieved by some algorithms.

Definition 2.

An adaptive rate function is Pareto optimal if there is no other rate function such that for all and for at least one .

Figure 1: Pareto optimal rates

Combining the results in Theorem 1 and Theorem 2, we obtain the following Theorem 3, illustrated in Fig. 1.

Theorem 3.

The adaptive rate achieved by MOSS++ with any , i.e.,


is Pareto optimal.

Remark 2.

One should notice that the naive algorithm running MOSS on a subset selected uniformly at random with cardinality is not Pareto optimal, since running MOSS++ with serves as a Pareto improvement. The algorithm provided in chaudhuri2018quantile, if transferred to our setting and allow time horizon dependent quantile, is not Pareto optimal as well as it corresponds to the rate function .

5 Optimal strategy with extra information

Although no algorithm could adapt to all settings, one could actually design algorithms achieving near minimax optimal regret under extra information. We provide such an algorithm with expected reward of the best arm as the extra information; our algorithm is mainly inspired by locatelli2018adaptivity.

5.1 Algorithm

We name our Algorithm 3 Parallel as it maintains instances of subroutine, i.e., Algorithm 2, in parallel. Each subroutine is initialized with time horizon and hardness level . We use to denote the number of samples allocated to up to time , and represent its empirical regret at time as with being the -th empirical reward obtained by and being the index of the -th arm pulled by .

0:  Time horizon and hardness level
1:  Select a subset of arms uniformly at random with and run MOSS on .
Algorithm 2 MOSS Subroutine ()
0:  Time horizon and the optimal reward
1:  set: , and
2:  for  do
3:     Set , initialize with , ; set , and
4:  end for
5:  for  do
6:     Select and run for rounds
7:     Update
8:  end for
Algorithm 3 Parallel 

Parallel operates in iterations of length . At the beginning of each iteration, i.e., at time for , Parallel first selects subroutine with the lowest (break tie arbitrarily) empirical regret so far, i.e., ; it then resumes the learning process of , from where it halted, for another more pulls. All information are updated at the end of that iteration. An anytime version of Algorithm 3 is provided in Section C.3.

5.2 Analysis

As Parallel discretizes the hardness parameter over a grid with interval , we first show that the best subroutine achieves regret .

Lemma 1.

Suppose is the true hardness parameter and , run Algorithm 2 with time horizon and leads to the following regret bound

Since Parallel always allocates new samples to subroutine with the lowest empirical regret so far, we know that the regret of every subroutine should be roughly in the same order at time ; particularly, all subroutines should achieve regret , as the best subroutine does. Parallel then achieves the non-adaptive minimax optimal regret, up to polylog factors, without knowing the true hardness level .

Theorem 4.

For any unknown to the learner, run Parallel with time horizon and optimal expected reward leads to the following regret upper bound, with a universal constant ,

6 Experiments

We conduct three experiments to compare our algorithms with baselines. In Section 6.1, we compare the performance of each algorithms on problems with varying hardness levels. We examine how regret curve of each algorithm increases on synthetic and real-world datasets in Section 6.2 and Section 6.3, respectively.

We first introduce the nomenclature of the algorithms. We use MOSS to denote the standard MOSS algorithm; and MOSS Oracle to denote Algorithm 2 with known . Quantile represents the algorithm (QRM2) proposed by chaudhuri2018quantile to minimize the regret with respect to mean at the -th quantile, without the knowledge of . One could easily transfer Quantile to our settings with top fraction of arms treated as best. As suggested in chaudhuri2018quantile, we reuse the statistics obtained in previous iterations of Quantile to improve its sample efficiency. We use MOSS++ to represent the vanilla version of Algorithm 1; and use empMOSS++ to represents an empirical version such that: (1) empMOSS++ reuse statistics obtained in previous round, as did in Quantile ; and (2) instead of selecting real arms uniformly at random at the -th iteration, empMOSS++ selects arms with the highest empirical mean for . We choose for MOSS++ and empMOSS++ in all experiments, even though better performances are anticipated by selecting

. All results are averaged over 100 experiments. Shaded area represents 0.5 standard deviation for each algorithm.

6.1 Adaptivity to hardness level

Figure 2: (a) Comparison with varying hardness levels (b) Regret curve comparison with .

We compare our algorithms with baselines on regret minimization problems with different hardness levels. For this experiment, we generate best arms with expected reward 0.9 and sub-optimal arms with expected reward evenly distributed among . All arms follow Bernoulli distribution. We set the time horizon to and consider the total number of arms . We vary from 0.1 to 0.9 to control the number of best arms and thus the hardness level. In Fig. 2(a), the regret of any algorithm gets larger as increases, which is expected. MOSS does not provide satisfying performance due to large action space and relatively small time horizon. Although implemented in an anytime fashion, Quantile could be roughly viewed as an algorithm that runs MOSS on a subset selected uniformly at random with cardinality . Quantile displays good performance when or , but suffers regret much worse than MOSS++ and empMOSS++ when gets larger. Note that the regret curve of Quantile being flattened at is expected: it simply learns the best sub-optimal arm and suffers a regret . Although Parallel enjoys near minimax optimal regret, the regret it suffers from is the summation of 11 subroutines, which hurts its empirical performance. empMOSS++ achieves performance comparable to MOSS Oracle when is small; and empirically the best performance for . When , MOSS Oracle need to explore most/all of the arms to statistically guarantee the finding at least one best arm, which hurts its empirical performance.

6.2 Regret curve Comparison

We compare how regret curve of each algorithm increases in Fig. 2(b). We consider the same regret minimization configurations as described in Section 6.1 with . empMOSS++ , MOSS++ and Parallel all outperform Quantile with empMOSS++ achieving the performance closest to MOSS Oracle . MOSS Oracle , Parallel and empMOSS++ have flattened their regret curve indicating they could confidently recommend the best arm. MOSS++ and Quantile doesn’t flat their regret curves as the random-sampling component in each of their iterations encourage them to explore new arms. Comparing to MOSS++ , Quantile 

keeps increasing its regret at a much faster rate and with a much larger variance, which empirically confirms the sub-optimality of their regret guarantee on problems with relatively high hardness level.

6.3 Real-world dataset

We also compare all algorithms in a realistic setting of recommending funny captions to website visitors. We use a real-world dataset from the New Yorker Magazine Cartoon Caption Contest111https://www.newyorker.com/cartoons/contest. The dataset of 1-3 star caption ratings/rewards for Contest 652 consists of captions222available online at https://nextml.github.io/caption-contest-data/. We use the ratings to compute Bernoulli reward distributions for each caption as follows. The mean of each caption/arm is calculated as the percentage of its ratings that were funny or somewhat funny (i.e., 2 or 3 stars). We normalize each with the best one and then threshold each: if , then put ; otherwise leave unaltered. This produces a set of best arms with rewards 1 and all other arms with rewards among . We set and this results in a hardness level around .

Figure 3: Regret comparison with real-world dataset

Using these Bernoulli reward models, we compare the performance of each algorithm is shown in Fig. 3. MOSS , MOSS Oracle , Parallel and empMOSS++ have flattened their regret curve indicating they could confidently recommend the funny captions (i.e., best arms). Although MOSS could eventually identify a best arm in this problem, it’s cumulative regret is more than 7x of the regret achieved by empMOSS++ due to its initial exploration phase. The performance of Quantile is even worse, and its cumulative regret is more than 9x of the regret achieved by empMOSS++ . One surprising phenomenon is that empMOSS++ outperforms MOSS Oracle in this realistic setting. Our hypothesis is that MOSS Oracle is a little bit conservative and selects an initial set with cardinality too large. This experiment demonstrates the effectiveness of empMOSS++ and MOSS++ in modern applications of bandit algorithm with large action space but limited time horizon.

7 Conclusion

We study regret minimization problem with large action space but limited time horizon, which captures many modern applications of bandit algorithms. Depending on the number of best/near-optimal arms, we encode the hardness level, in terms of minimax regret achievable, of the given regret minimization problem into a single parameter , and we design algorithms that could adapt the unknown hardness level. Our first algorithm MOSS++ takes a user-specified parameter as input and provides guarantees as long as ; our lower bound further indicates the adaptive rate achieved by MOSS++ is Pareto optimal. Although no algorithm can achieve minimax optimal regret at every , as demonstrated by our lower bound, we overcome this limitation with an (often) easily-obtained extra information and propose Parallel that is near-optimal for all settings. Inspired by MOSS++ , We also propose empMOSS++ with excellent empirical performance. Experiments on both synthetic and real-world datasets demonstrate the efficiency of our algorithms over the previous state-of-the-art.

Broader Impact

This paper provides efficient algorithms that work well in modern applications of bandit algorithms with large action space but limited time horizon. We make minimal assumption about the setting, and our algorithms can automatically adapt to unknown hardness levels. Worst-case regret guarantees are provided for our algorithms; we also show MOSS++ is Pareto optimal and Parallel is minimax optimal, up to polylog factors. empMOSS++ is provided as an practical version of MOSS++ with excellent empirical performance. Our algorithms are particularly useful in areas such as e-commence and movie/content recommendation, where the action space is enormous but possibly contains multiple best/satisfactory actions. If deployed, our algorithms could automatically adapt to the hardness level of the recommendation task and benefit both service-providers and customers through efficiently delivering satisfactory content. One possible negative outcome is that items recommended to a specific user/customer might only come from a subset of the action space. However, this is unavoidable when the number of items/actions exceeds the allowed time horizon. In fact, one should notice that all items/actions will be selected with essentially the same probability, thanks to the incorporation of random selection processes in our algorithms. Our algorithms will not leverage/create biases due to the same reason. Overall, we believe this paper’s contribution will have a net positive impact.


Appendix A Omitted proofs for Section 3

Throughout all the appendix, we define the notation for any event . One should also notice that .

a.1 Proof of Theorem 1

Lemma 2.

For an instance with total arms and best arms, and for a subset selected uniformly at random with cardinality , the probability that none of the best arms are selected in is upper bounded by .


Consider selecting items out of items without replacement; and suppose there are target items. Let denote the event where none of the target items are selected, we then have


where Eq. 4 comes from the fact that is decreasing in ; and Eq. 5 comes from the fact that for all .

See 1


Let . We first notice that Algorithm 1 is a valid algorithm since for all . Now let represents the information available at the beginning of iteration , including the random selection process in generating . We denote the expected cumulative regret at iteration . Recall we use to represent the expected regret conditional on and have .

When is such that , one could see that Theorem 1 trivially hold as . In the following, we only consider the case where .

Recall . Applying Eq. 1 on leads to


where we use as our algorithm will eventually sample a real arm at time . Thanks to the expectation on , this leads to the same result as if we also bring into the analysis.

We first consider the learning error for any iteration . Although is random, it is fixed at time [hadiji2019polynomial]. Since MOSS restarts at each iteration, conditioning on the information available at the beginning of the -th iteration, i.e., , and apply the regret bound for MOSS [audibert2009minimax, garivier2018kl], we have :


where Eq. 7 comes from ; Eq. 8 comes from and for positive integers; Eq. 9 comes from the fact that and .

Taking expectation over all randomness on Eq. 6, we obtain


Now, we only need to consider the first term, i.e., the expected approximation error over the -th iteration. Let denote the event that none of the best arms, among regular arms, is selected in , according to Lemma 2, we further have


where we use the fact the in Eq. 11; and directly plug into Eq. 5 to get Eq. 12.

Let be the largest integer, if exists, such that , we then have that for any


Note that the choice of indicates .

If we have , we then set . Since , we then have


Combining Eq. 10 with Eq. 13 or Eq. 14, we then have for any and in particularly for




where Eq. 16 comes from the fact that by definition of .

Now, let’s consider the expected approximation error for iteration . Since the sampling information during is summarized in the virtual mixture-arm , and being added to all for all . Let denote the expected reward of sampling according to the virtual mixture-arm . For any , we then have


where Eq. 17 comes from the fact that and some rewriting; Eq. 18 comes from the fact that

Combining Eq. 18 and Eq. 9, together with Eq. 15, we have for

where the constant simply comes from ; and thus

a.2 Anytime version

0:  User specified parameter
1:  for  do
2:     Run Algorithm 1 with for rounds.
3:  end for
Algorithm 4 Anytime version of MOSS++ 
Corollary 1.

For any unknown time horizon , run Algorithm 4 with an user-specified parameter leads to the following regret upper bound


Let be the smallest integer such that

We then only need to run Algorithm 1 for at most times. By the definition of , we also know that which leads to .

Let . From Theorem 1 we know that the regret at -th round, denoted as , could be upper bounded by

For , we have as well as long as .

Now for the unknown time horizon , we could upper bound the regret by


where Eq. 19 comes from upper bounding summation by integral; and Eq. 20 comes from the fact that . ∎

Appendix B Omitted proofs for Section 4

b.1 Proof of Theorem 2

See 2

The proof of Theorem 2 is mainly inspired by the proof of lower bound in [hadiji2019polynomial]. before the start of the proof, we first state a generalized version of Pinsker’s inequality developed in [hadiji2019polynomial].

Lemma 3.

(Lemma 3 in [hadiji2019polynomial]) Let and

be two probability measures. For any random variable


We consider bandit instances such that each bandit instance is a collection of distributions where each

represents a Gaussian distribution

with . For any given and fixed time horizon , we choose such that the following three conditions are satisfied.

  1. (when , this condition is replaced with with );

Proposition 1.

Integers satisfying the above three conditions exist. For instance, we could first fix and set . One could then set and . Recall that, by a slightly abuse of notation, we set with in the case of .


We get the first condition hold by construction. We now show that the second and the third condition hold.

For the second condition, we have

For the third condition, we have


where Eq. 21 hold as long as we have , which is obvious from our setting. ∎

Remark 3.

One could also try to design lower bound in a continuous setting, i.e., each arm corresponds to a point in the domain and its expected reward is governed by a function . Note that this setting is not incorporated in the -armed bandit problem as the function need not to be smooth at all.

Now we group distribution into different groups based on their indices: and . We then define bandit instances for by assigning different values to their means .