Grooming a Single Bandit Arm

06/11/2020 ∙ by Eren Ozbay, et al. ∙ University of Illinois at Chicago 0

The stochastic multi-armed bandit problem captures the fundamental exploration vs. exploitation tradeoff inherent in online decision-making in uncertain settings. However, in several applications, the traditional objective of maximizing the expected sum of rewards obtained can be inappropriate. Motivated by the problem of optimizing job assignments to groom novice workers with unknown trainability in labor platforms, we consider a new objective in the classical setup. Instead of maximizing the expected total reward from T pulls, we consider the vector of cumulative rewards earned from each of the K arms at the end of T pulls, and aim to maximize the expected value of the highest cumulative reward. This corresponds to the objective of grooming a single, highly skilled worker using a limited supply of training jobs. For this new objective, we show that any policy must incur a regret of Ω(K^1/3T^2/3) in the worst case. We design an explore-then-commit policy featuring exploration based on finely tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and guarantees a regret of O(K^1/3T^2/3√(log K)) in the worst case. Our numerical experiments demonstrate that this policy improves upon several natural candidate policies for this setting.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The stochastic multi-armed bandit (MAB) problem Lai and Robbins (1985); Auer et al. (2002) presents a basic formal framework to study the exploration vs. exploitation tradeoff fundamental to online decision-making in uncertain settings. Given a set of arms, each of which yields independent and identically distributed (i.i.d.) rewards over successive pulls, the goal is to adaptively choose a sequence of arms to maximize the expected value of the total reward attained at the end of pulls. The critical assumption here is that the reward distributions of the different arms are a priori unknown. Any good policy must hence, over time, optimize the tradeoff between choosing arms that are known to yield high rewards (exploitation) and choosing arms whose reward distributions are yet relatively unknown (exploration). Over several years of extensive theoretical and algorithmic analysis, this classical problem is now quite well understood (see Lattimore and Szepesvári (2018); Slivkins (2019); Bubeck and Cesa-Bianchi (2012) for a survey).

In this paper, we revisit this classical setup; however, we address a new objective. We consider the vector of cumulative rewards that have been earned from the different arms at the end of pulls, and instead of maximizing the expectation of its sum, we aim to maximize the expected value of the maximum of these cumulative rewards across the arms. This problem is motivated by several practical settings, as we discuss below.

  1. [wide, labelwidth=!, labelindent=0pt]

  2. Training workers in online labor platforms. Online labor platforms seek to develop and maintain a reliable pool of high-quality workers in steady-state to satisfy the demand for jobs. This is a challenging problem since, a) workers continuously leave the platform and hence new talent must be groomed, and b) the number of “training” jobs available to groom the incoming talent is limited (this could, for instance, be because of a limit on the budget for the discounts offered to the clients for choosing novice workers). At the core of this challenging operational question is the following problem. Given the limited availability of training jobs, the platform must determine a policy to allocate these jobs to a set of novice workers to maximize some appropriate functional of the distribution of their terminal skill levels. For a platform that seeks to offer robust service guarantees to its clients, simply maximizing the sum of the terminal skill levels across all workers may not be appropriate, and a more natural functional to maximize is the percentile skill level amongst the workers ordered by their terminal skills, where is determined by the volume of demand for regular jobs.

    To address this problem, we can use the MAB framework: the set of arms is the set of novice workers, the reward of an arm is the random increment in the skill level of the worker after allocation of a job, and the number of training jobs available is . Assuming the number of training jobs available per worker is not too large, the random increments may be assumed to be i.i.d. over time. The mean of these increments can be interpreted as the unknown learning rate or the “trainability” of a worker. Given workers, the goal is to adaptively allocate the jobs to these workers to maximize the smallest terminal skill level amongst the top most terminally skilled workers (where ). Our objective corresponds to the case where , and is a step towards solving this general problem.

  3. Grooming an “attractor” product on e-commerce platforms. E-commerce platforms typically feature very similar substitutes within a product category. For instance, consider a product like a tablet cover (e.g., for an iPad). Once the utility of a new product of this type becomes established (e.g., the size specifications of a new version of the iPad becomes available), several brands offering close to identical products serving the same purpose proliferate the marketplace. This proliferation is problematic to the platform for two reasons: a) customers are inundated by choices and may unnecessarily delay their purchase decision, thereby increasing the possibility of leaving the platform altogether Settle and Golden (1974); Gourville and Soman (2005), and b) the heterogeneity in the purchase behavior resulting from the lack of a clear choice may complicate the problem of effectively managing inventory and delivery logistics. Given a budget for incentivizing customers to pick different products in the early exploratory phase where the qualities of the different products are being discovered, a natural objective for the platform is to “groom” a product to have the highest volume of positive ratings at the end of this phase. This product then becomes a clear choice for the customers. Our objective effectively captures this goal.

  4. Training for external competitions. The objective we consider is also relevant to the problem of developing advanced talent within a region for participation in external competitions like Science Olympiads, the Olympic games, etc., with limited training resources. In these settings, only the terminal skill levels of those finally chosen to represent the region matter. The resources spent on others, despite resulting in skill advancement, are effectively wasteful. This feature is not captured by the “sum” objective, while it is effectively captured by the “max” objective, particularly in situations where one individual will finally be chosen to represent the region.

A standard approach in MAB problems is to design a policy that minimizes regret, i.e., the quantity of loss relative to the optimal decision for a given objective over time. In the classical setting with the “sum” objective, it is well known that any policy must incur a regret of in the worst-case over the set of possible bandit instances Auer et al. (2002). A key feature of our new objective is that the rewards earned from arms that do not eventually turn out to be the one yielding the highest cumulative reward are effectively a waste. Owing to this, we show that in our case, a regret of is inevitable (Theorem 1).

For the traditional objective, well-performing policies are typically based on the principle of optimism in the face of uncertainty. A popular policy-class is the Upper Confidence Bound (UCB) class of policies Agrawal (1995); Auer et al. (2002); Auer and Ortner (2010)

, in which a confidence interval is maintained for the mean reward of each arm and at each time, the arm with the highest upper confidence bound is chosen. For a standard tuning of these intervals, this policy – termed UCB1 in literature due to

Auer et al. (2002) – guarantees a regret of in the worst case. With a more refined tuning, can be achieved Audibert and Bubeck ; Lattimore (2018).

For our objective, directly using one of the above UCB policies can prove to be disastrous. To see this, suppose that all arms have an identical distribution for their rewards with bounded support. Then a UCB policy will continue to switch between the arms throughout the pulls, resulting in the highest terminal cumulative reward of ; whereas, a reward of is feasible by simply committing to an arbitrary arm from the start. Hence, the regret is in the worst case.

This observation suggests that any good policy must, at some point, stop exploring and permanently commit to a single arm. A natural candidate is the basic explore-then-commit (ETC) strategy, which uniformly explores all arms until some time that is fixed in advance, and then commits to the empirically best arm Lattimore and Szepesvári (2018); Slivkins (2019). When each arm is chosen times in the exploration phase, this strategy can be shown to achieve a regret of relative to the traditional objective Slivkins (2019). It is easy to argue that it achieves the same regret relative to our “max” objective. However, this policy is excessively optimized for the worst case where the means of all the arms are within of each other. When the arms are easier to distinguish, this policy’s performance is quite poor due to excessive exploration. For example, consider a two armed bandit problem with Bernoulli rewards and means , where . For this fixed instance, ETC will pull both arms times and hence incur a regret of as (relative to our “max” objective). However, it is well known that UCB1 will not pull the suboptimal arm more than

times with high probability

Auer et al. (2002) and hence for this instance, UCB1 will incur a regret of only . Thus, although the worst case regret of UCB1 is due to perpetual exploration, for a fixed bandit instance, its asymptotic performance is significantly better than ETC. This observation motivates us to seek a practical policy with a graceful dependence of performance on the difficulty of the bandit instance, and which will achieve both: the worst-case bound of ETC and the instance-dependent asymptotic bound of .

We propose a new policy with an explore-then-commit structure, in which appropriately defined confidence bounds on the means of the arms are utilized to guide exploration, as well as to decide when to stop exploring. We call this policy Adaptive Explore-then-Commit (ADA-ETC). Compared to the classical UCB1 way of defining the confidence intervals, our policy’s confidence bounds are finely tuned to eliminate wasteful exploration and encourage stopping early if appropriate. We derive rigorous instance-dependent as well as worst-case bounds on the regret guaranteed by this policy. Our bounds show that ADA-ETC adapts to the problem difficulty by exploring less if appropriate, while attaining the same regret guarantee of attained by vanilla ETC in the worst case (Theorem 2). In particular, ADA-ETC also guarantees an instance-dependent asymptotic regret of as . Finally, our numerical experiments demonstrate that ADA-ETC results in significant improvements over the performance of vanilla ETC in easier settings, while never performing worse in difficult ones, thus corroborating our theoretical results. Our numerical results also demonstrate that naive ways of introducing adaptive exploration based on upper confidence bounds, e.g., simply using the upper confidence bounds of UCB1, may lead to no improvement over vanilla ETC.

We finally note that buried in our objective is the goal of quickly identifying the arm with approximately the highest mean reward so that a substantial amount of time can be spent earning rewards from that arm (e.g., “training” a worker). This goal is related to the pure exploration problem in multi-armed bandits. Several variants of this problem have been studied, where the goal of the decision-maker is to either minimize the probability of misidentification of the optimal arm given a fixed budget of pulls Audibert and Bubeck (2010); Carpentier and Locatelli (2016); Kaufmann et al. (2016); or minimize the expected number of pulls to attain a fixed probability of misidentification, possibly within an approximation error Even-Dar et al. (2002, 2006); Mannor and Tsitsiklis (2004); Karnin et al. (2013); Jamieson et al. (2014); Vaidhiyan and Sundaresan (2017); Kaufmann et al. (2016); or to minimize the expected suboptimality (called “simple regret”) of a recommended arm after a fixed budget of pulls Bubeck et al. (2009, 2011); Carpentier and Valko (2015). Extensions to settings where multiple good arms are needed to be identified have also been considered Bubeck et al. (2013); Kalyanakrishnan et al. (2012); Zhou et al. (2014); Kaufmann and Kalyanakrishnan (2013). The critical difference from these approaches is that in our scenario, the budget of pulls must not only be spent on identifying an approximately optimal arm but also on earning rewards on that arm. Hence any choice of apportionment of the budget to the identification problem, or a choice for a target for the approximation error or probability of misidentification within a candidate policy, is a priori unclear and must arise endogenously from our primary objective.

2 Problem Setup

Consider the stochastic multi-armed bandit (MAB) problem parameterized by the number of arms, which we denote by ; the length of the decision-making horizon (the number of discrete times/stages), which we denote by

; and the probability distributions for arms

, denoted by , respectively. To achieve meaningful results, we assume that the rewards are non-negative and their distributions have a bounded support, assumed to be without loss of generality (although this latter assumption can be easily relaxed to allow, for instance,

-Sub-Gaussian distributions with bounded

). We define to be the set of all -tuples of distributions for the arms having support in . Let be the means of the distributions. Without loss of generality, we assume that for the remainder of the discussion. The distributions of the rewards from the arms are unknown to the decision-maker. We denote and .

At each time, the decision-maker chooses an arm to play and observes a reward. Let the arm played at time be denoted as and the reward be denoted as , where is drawn from the distribution , independent from the previous actions and observations. The history of actions and observations at any time is denoted as , and is defined to be the empty set . A policy of the decision-maker is a sequence of mappings , where maps every possible history to an arm to be played at time . Let denote the set of all such policies.

For an arm , we denote to be the number of times this arm is played until and including time , i.e., . We also denote to be the reward observed from the pull of arm .

is thus a sequence of i.i.d. random variables, each distributed as

. Note that the definition of implies that we have . We further define to be the cumulative reward obtained from arm until time .

Once a policy is fixed, then for all , , , and for all , become well-defined random variables. We consider the following notion of reward for a policy .


In words, the objective value attained by the policy is the expected value of the largest cumulative reward across all arms at the end of the decision making horizon. When the reward distributions are known to the decision-maker, then for a large , the best reward that the decision-maker can achieve is

A natural candidate for a “good” policy when the reward distributions are known is the one where the decision-maker exclusively plays arm (the arm with the with the highest mean), attaining an expected reward of . Let us denote . One can show that, in fact, this is the best reward that one can achieve in our problem.

Proposition 1.

For any bandit instance , .

The proof is presented in Section A in the Appendix. This shows that the simple policy of always picking the arm with the highest mean is optimal for our problem. Next, we denote the regret of any policy to be We consider the objective of finding a policy which achieves the smallest regret in the worst-case over all distributions , i.e., we wish to solve the following optimization problem:

Let denote the minmax (or the best worst-case) regret, i.e.,

In the remainder of the paper, we will show that the worst-case regret is of order .

3 Lower Bound

We now show that for our objective, a regret of is inevitable in the worst case.

Theorem 1.

Suppose that . Then,

The proof is presented in Section B in the Appendix. Informally, the argument for the case of arms is as follows. Consider two bandits with Bernoulli rewards, one with the mean rewards , and the other with mean rewards . Then until time , no algorithm can reliably distinguish between the two bandits. Hence, until this time, either pulls are spent on arm 1 irrespective of the underlying bandit, or pulls are spent on arm 2 irrespective of the underlying bandit. In both cases, the algorithm incurs a regret of , essentially because of wasting pulls on a suboptimal arm that could have been spent on earning reward on the optimal arm. This latter argument is not entirely complete, however, since it ignores the possibility of picking a suboptimal arm until time , in which case spending time on the suboptimal arm in the first time periods was not wasteful. However, even in this case, one incurs a regret of . Thus a regret of is unavoidable. Our formal proof builds on this basic argument to additionally determine the optimal dependence on .

4 Adaptive Explore-then-Commit (ADA-ETC)

We now define an algorithm that we call Adaptive Explore-then-Commit (ADA-ETC), specifically designed for our problem. It is formally defined in Algorithm 1. The algorithm can be simply described as follows. After choosing each arm once, choose the arm with the highest upper confidence bound, until there is an arm such that (a) it has been played at least times, and (b) its empirical mean is higher than the upper confidence bounds on the means of all other arms. Once such an arm is found, commit to this arm until the end of the decision horizon.

The upper confidence bound is defined in Equation 2. In contrast to its definition in UCB1, it is tuned to eliminate wasteful exploration and to allow stopping early if appropriate. We enforce the requirement that an arm is played at least times before committing to it by defining a trivial "lower confidence bound" (Equation 3), which takes value until the arm is played less than times, after which both the upper and lower confidence bounds are defined to be the empirical mean of the arm. The stopping criterion can then be simply stated in terms of these upper and lower confidence bounds (Equation 4): stop and commit to an arm when its lower confidence bound is strictly higher than the upper confidence bounds of all other arms (this can never happen before pulls since the rewards are non-negative).

Note that the collapse of the upper and lower confidence bounds to the empirical mean after pulls ensures that each arm is not pulled more than times during the Explore phase. This is because choosing this arm to explore after pulls would imply that its upper confidence bound = lower confidence bound is higher than the upper confidence bounds for all other arms, which means that the stopping criterion has been met and the algorithm has committed to the arm.

Remark 1.

A heuristic rationale behind the choice of the upper confidence bound is as follows. Consider a suboptimal arm whose mean is smaller than the highest mean by

. Let be the probability that this arm is misidentified and committed to in the Commit phase. Then the expected regret resulting from this misidentification is approximately . Since we want to ensure that the regret is at most in the worst-case, we can tolerate a of at most . Unfortunately, is not known to the algorithm. However, a reasonable proxy for is , where is the number of times the arm has been pulled. This is because it is right around , when the distinction between this arm and the optimal arm is expected to occur. Thus a good (moving) target for the probability of misidentification is . This necessitates the scaling of the confidence interval in Equation 2. In contrast, our numerical experiments show that utilizing the traditional scaling of as in UCB1 results in significant performance deterioration. Our tuning is reminiscent of similar tuning of confidence bounds under the “sum” objective to improve the performance of UCB1; see Audibert and Bubeck ; Lattimore (2018); Auer and Ortner (2010).

Remark 2.

Instead of defining the lower confidence bound to be until an arm is pulled times, one may define a non-trivial lower confidence bound to accelerate commitment, perhaps in a symmetric fashion as the upper confidence bound. However, this doesn’t lead to an improvement in the regret bound. The reason is that if an arm looks promising during exploration, then eagerness to commit to it is imprudent, since if it is indeed optimal then it is expected to be chosen frequently during exploration anyway; whereas, if it is suboptimal then we preserve the option of eliminating it by choosing to not commit until after pulls. Thus, to summarize, ADA-ETC eliminates wasteful exploration primarily by reducing the number of times suboptimal arms are pulled during exploration through the choice of appropriately aggressive upper confidence bounds, rather than by being hasty in commitment.

Input: arms with horizon .
Define: Let . For , let be the empirical average reward from arm after pulls, i.e., . Also, for , define,
Also, for , let be the number of times arm pulled until and including time .
  • [wide, labelwidth=!, labelindent=0pt]

  • Explore Phase: From time until , pull each arm once. For :

    1. Identify , breaking ties arbitrarily. If


      then define , break, and enter the Commit phase. Else, continue to Step 2.

    2. Identify , breaking ties arbitrarily. Pull arm .

  • Commit Phase: Pull arm until time .

Algorithm 1 Adaptive Explore-then-Commit (ADA-ETC)

Let denote the implementation of ADA-ETC using and as the input for the number of arms and the time horizon, respectively. Also, define for . We characterize the regret guarantees achieved by in the following result.

Theorem 2 (Ada-Etc).

Let and suppose that . Then for any , the expected regret of is upper bounded as:111We define for .

where . In the worst case, we have

The proof is presented in Section C in the Appendix. Theorem 2 features an instance-dependent regret bound and a worst-case bound of . The first two terms in the instance-dependent bound arise from the wasted pulls during the Explore phase. Under vanilla Explore-then-Commit, to obtain near-optimality in the worst case, every arm must be pulled times in the Explore phase Slivkins (2019). Hence, the expected regret from the Explore phase is irrespective of the instance. On the other hand, our bound on this regret depends on the instance and can be significantly smaller than if the arms are easier to distinguish. For example, if and the instance are fixed (with ), and , then the regret from exploration (and the overall regret) is under ADA-ETC as opposed to under ETC. The next two terms in our instance-dependent bound arise from the regret incurred due to committing to a suboptimal arm, which can be shown to be in the worst case, thus matching the guarantee of ETC. The first of these terms is not problematic since it is the same as the regret arising under ETC. The second term arises due to the inevitably increased misidentifications occurring due to stopping early in adaptive versions of ETC. If the confidence bounds are aggressively small, then this term increases. In ADA-ETC, the upper confidence bounds used in exploration are tuned to be as small as possible while ensuring that this term is no larger than in the worst case. Thus, our tuning of the Explore phase ensures that the performance gains during exploration does not come at the cost of higher worst-case regret (in the leading-order) due to misidentification.

5 Experiments

Benchmark Algorithms. We compare the performance of ADA-ETC with four algorithms described in Table 1. All algorithms, except UCB1 and ETC, have the same algorithmic structure as ADA-ETC: they explore based on upper confidence bounds and commit if the lower confidence bound of an arm rises above upper confidence bounds for all other arms. They differ from ADA-ETC in how the upper and lower confidence bounds are defined. These definitions are presented in Table 1. UCB1 never stops exploring and pulls the arm maximizing the upper confidence bound at each time step, while ETC commits to the arm with the highest empirical mean after each arm has been pulled times. Both NADA-ETC and UCB1-s use UCB1’s upper confidence bound, but they differ in their lower confidence bounds.

Table 1: Benchmark Algorithms

Instances. We let , where is uniformly sampled from for each arm in each instance. We sample three sets of instances, each of size , with . The regret for an algorithm for each instance is averaged over

runs to estimate the expected regret. We vary

and . The average regret over the instances under different algorithms and settings is presented in Figure 1.

Discussion. ADA-ETC shows the best performance uniformly across all settings, although there are settings where its performance is similar to ETC. As anticipated, these are settings where either (a) , in which case, the arms are expected to be close to each other and hence adaptivity in exploring has little benefits, or (b) is relatively small, due to which is small. In these latter situations, the exploration budget of is expected to be exhausted for almost all arms under ADA-ETC, yielding in performance similar to ETC, e.g., if and , then , i.e., a maximum of only three pulls can be used per arm for exploring. When is smaller, i.e., when arms are easier to distinguish, or when is large, the performance of ADA-ETC is significantly better than that of ETC. This illustrates the gains from the refined definition of the upper confidence bounds used to guide exploration in ADA-ETC.

Furthermore, we observe that the performances of UCB1-s and NADA-ETC are essentially the same as ETC. This is an important observation since it shows that naively adding adaptivity to exploration based on UCB1’s upper confidence bounds may not improve the performance of ETC, and appropriate refinement of the confidence bounds is crucial to the gains of ADA-ETC. Finally, we note that UCB1 performs quite poorly, thus demonstrating the importance of introducing an appropriate stopping criterion for exploration.

(a) ,
(b) ,
(c) ,
(d) ,
(e) ,
(f) ,
(g) ,
(h) ,
(i) ,
(j) ,
(k) ,
(l) ,
Figure 1: Performance comparison of ADA-ETC. The performances of UCB1-s and NADA-ETC are essentially same as ETC.

6 Conclusion and Future directions

In this paper, we proposed and offered a near-tight analysis of a new objective in the classical MAB setting, of optimizing the expected value of the maximum of cumulative rewards across arms. From a theoretical perspective, although the current analysis of ADA-ETC is tight, it is unclear whether the extraneous (compared to the lower bound) factor from the upper bound can be eliminated via a more refined algorithm design. Additionally, our assumption that the rewards are i.i.d. over time, while appropriate for the application of grooming an attractor product for e-commerce platforms, may be a limitation in the context of worker training. It would be interesting to study our objective in settings that allow rewards to decrease over time; such models, broadly termed as rotting bandits Heidari et al. ; Levine et al. (2017); Seznec et al. (2019) have attracted recent focus in literature as a part of the study of the more general class of MAB problems with non-stationary rewards Besbes et al. (2014, 2019). This literature has so far only focused on the traditional “sum” objective.

More importantly, our paper presents the possibility of studying a wide variety of new objectives under existing online learning setups motivated by training applications, where the traditional objective of maximizing the total rewards is inappropriate. A natural generalization of our objective is the optimization of other functionals of the vector of cumulative rewards, e.g., maximizing the highest cumulative reward, which is relevant to online labor platforms as we mentioned in the Section 1, or the optimization of norm of the vector of cumulative rewards for , which has natural fairness interpretations in the context of human training (the traditional objective corresponds to the norm, while our objective corresponds to the norm). More generally, one may consider multiple skill dimensions, with job types that differ in their impact on these dimensions. In such settings, a similar variety of objectives may be considered driven by considerations such as fairness, diversity, and focus.

7 Broader Impact

Developing a strong and diverse labor supply under limited resources is one of the oldest and most fundamental economic policy challenges. The advent of online labor platforms, which collect fine-grained data on job outcomes, presents an opportunity to tackle this challenge in a much more refined and data-driven fashion than before.

Training a workforce entails the classic exploration vs. exploitation tradeoff: one needs to learn the inherent “trainability” of the workers for different skills to determine the optimal allocation of training resources. The theory of multi-armed bandit problems presents a formal framework to analyze such tradeoffs and develop practical algorithms. However, this theory has so far mostly focused on the objective of maximizing the total reward of the decision-maker. In many training applications, this objective is inappropriate; instead, one may be interested in optimizing a variety of other objectives depending on the application. These objectives may be informed by considerations such as the nature and volume of demand for jobs, quality guarantees promised to clients, fairness in the allocation of training opportunities, and achieving diversity in skills.

The main technical contribution of the paper is the proposal and tight analysis of an algorithm that optimizes one such practically motivated objective, in which the goal of the decision-maker is to utilize the training resources to groom a single, highly trained worker. Perhaps more importantly, this paper proposes a framework to address various objectives stemming from training applications under the classical multi-armed bandit model, thus introducing a flurry of new, practically relevant problems in this domain.


  • [1] R. Agrawal (1995) Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Advances in Applied Probability 27 (4), pp. 1054–1078. Cited by: §1.
  • [2] J. Audibert and S. Bubeck Minimax policies for adversarial and stochastic bandits. Cited by: §1, Remark 1.
  • [3] J. Audibert and S. Bubeck (2010) Best arm identification in multi-armed bandits. Cited by: §1.
  • [4] P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Machine learning 47 (2-3), pp. 235–256. Cited by: §1, §1, §1, §1.
  • [5] P. Auer and R. Ortner (2010) UCB revisited: improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica 61 (1-2), pp. 55–65. Cited by: §1, Remark 1.
  • [6] O. Besbes, Y. Gur, and A. Zeevi (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in neural information processing systems, pp. 199–207. Cited by: §6.
  • [7] O. Besbes, Y. Gur, and A. Zeevi (2019) Optimal exploration–exploitation in a multi-armed bandit problem with non-stationary rewards. Stochastic Systems 9 (4), pp. 319–337. Cited by: §6.
  • [8] S. Bubeck, T. Wang, and N. Viswanathan (2013) Multiple identifications in multi-armed bandits. In International Conference on Machine Learning, pp. 258–265. Cited by: §1.
  • [9] S. Bubeck and N. Cesa-Bianchi (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721. Cited by: §1.
  • [10] S. Bubeck, R. Munos, and G. Stoltz (2009) Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pp. 23–37. Cited by: §1.
  • [11] S. Bubeck, R. Munos, and G. Stoltz (2011) Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science 412 (19), pp. 1832–1852. Cited by: §1.
  • [12] A. Carpentier and A. Locatelli (2016) Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pp. 590–604. Cited by: §1.
  • [13] A. Carpentier and M. Valko (2015) Simple regret for infinitely many armed bandits. In International Conference on Machine Learning, pp. 1133–1141. Cited by: §1.
  • [14] E. Even-Dar, S. Mannor, and Y. Mansour (2002)

    PAC bounds for multi-armed bandit and markov decision processes


    International Conference on Computational Learning Theory

    pp. 255–270. Cited by: §1.
  • [15] E. Even-Dar, S. Mannor, and Y. Mansour (2006)

    Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems

    Journal of machine learning research 7 (Jun), pp. 1079–1105. Cited by: §1.
  • [16] J. T. Gourville and D. Soman (2005) Overchoice and assortment type: when and why variety backfires. Marketing science 24 (3), pp. 382–395. Cited by: item 2.
  • [17] H. Heidari, M. J. Kearns, and A. Roth Tight policy regret bounds for improving and decaying bandits.. Cited by: §6.
  • [18] K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck (2014) Lil’ucb: an optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pp. 423–439. Cited by: §1.
  • [19] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone (2012) PAC subset selection in stochastic multi-armed bandits. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 227–234. Cited by: §1.
  • [20] Z. Karnin, T. Koren, and O. Somekh (2013) Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 1238–1246. Cited by: §1.
  • [21] E. Kaufmann, O. Cappé, and A. Garivier (2016) On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research 17 (1), pp. 1–42. Cited by: §1.
  • [22] E. Kaufmann and S. Kalyanakrishnan (2013) Information complexity in bandit subset selection. In Conference on Learning Theory, pp. 228–251. Cited by: §1.
  • [23] T. L. Lai and H. Robbins (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §1.
  • [24] T. Lattimore and C. Szepesvári (2018) Bandit algorithms. preprint. Cited by: Appendix B, Appendix C, Appendix C, Appendix C, §1, §1, Lemma 4.
  • [25] T. Lattimore (2018) Refining the confidence level for optimistic bandit strategies. The Journal of Machine Learning Research 19 (1), pp. 765–796. Cited by: §1, Remark 1.
  • [26] N. Levine, K. Crammer, and S. Mannor (2017) Rotting bandits. In Advances in neural information processing systems, pp. 3074–3083. Cited by: §6.
  • [27] S. Mannor and J. N. Tsitsiklis (2004) The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research 5 (Jun), pp. 623–648. Cited by: §1.
  • [28] R. B. Settle and L. L. Golden (1974) Consumer perceptions: overchoice in the market place. ACR North American Advances. Cited by: item 2.
  • [29] J. Seznec, A. Locatelli, A. Carpentier, A. Lazaric, and M. Valko (2019) Rotting bandits are no harder than stochastic ones. In

    The 22nd International Conference on Artificial Intelligence and Statistics

    pp. 2564–2572. Cited by: §6.
  • [30] A. Slivkins (2019) Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272. Cited by: §1, §1, §4.
  • [31] N. K. Vaidhiyan and R. Sundaresan (2017) Learning to detect an oddball target. IEEE Transactions on Information Theory 64 (2), pp. 831–852. Cited by: §1.
  • [32] Y. Zhou, X. Chen, and J. Li (2014) Optimal pac multiple arm identification with applications to crowdsourcing. In International Conference on Machine Learning, pp. 217–225. Cited by: §1.

Appendix A Proof of Proposition 1

Proof of Proposition 1..

For any policy , we have that


Here (a) is obtained due to pushing the max inside the sum; (b) is obtained because for all ; and (c) holds because the reward for an arm in a period is independent of the past history of play and observations. Thus, the reward of is the highest that one can obtain under any policy. And this reward can, in fact, be obtained by the policy of always picking arm . This shows that

Appendix B Proof of Theorem 1

Proof of Theorem 1..

First we fix a policy . Let . We construct two bandit environments with different reward distributions for each of the arms and show that cannot perform well in both environments simultaneously.

We first specify the reward distribution for the arms in the base environment, denoted as the bandit

. Assume that the reward for all of the arms have the Bernoulli distribution, i.e.,

. We let , and for . We let denote the probability distribution induced over events until time under policy in this first environment, i.e., in bandit . Let denote the expectation under .

Define as the (random) number of pulls spent on arm until time (note that ) under policy . Specifically, is the total (random) number of pulls spent on the first arm under policy until time . Under policy , let denote the arm in the set that is pulled the least in expectation until time , i.e., Then clearly, we have that .

Having defined , we can now define the second environment, denoted as the bandit . Again, assume that the reward for all of the arms have the Bernoulli distribution, i.e., . We let , for , and . We let denote the probability distribution induced over events until time under policy in this second environment, i.e., in bandit . Let denote the expectation under

Suppose that in the first environment. Then we can argue that the regret is at least , upto an error of . To see this, note that this regret is at least the regret of a policy that maximizes the objective in environment , subject to the constraint that under this policy . This regret is at least the regret of a policy that minimizes the regret in environment , subject to the constraint that under this policy, . Now this latter regret can be shown to be at least , or at least (since ), up to an approximation error of .

Lemma 1.

Consider the K-armed bandit instance with Bernoulli rewards and mean vector , where . Consider a policy that satisfies . Then Hence,

The proof of Lemma 1 is presented below in this section. A similar argument shows that in the second environment, if , then , and hence the regret in the second environment is at least , again upto an approximation error of .

Lemma 2.

Consider the K-armed bandit instance with Bernoulli rewards and mean vector , where . Consider a policy that satisfies . Then Hence,

The proof of Lemma 2 is omitted since it is almost identical to that of Lemma 1. These two facts result in the following two inequalities:


Now, using the Bretagnolle-Huber inequality (see Thm. 14.2 in [24]), we have,


Here, () is the probability distribution induced by the policy on events until time under bandit (). The first equality then results from the fact that the two events and depend only on the play until time . In the second inequality, which results from the Bretagnolle-Huber inequality, is the relative entropy, or the Kullback-Leibler (KL) divergence between the distributions and respectively. We can upper bound as,

where () denotes the reward distribution of arm in the first (second) environment. The first equality results from the fact that no arm other than offers any distinguishability between and . The next inequality follows from the fact that , since by definition, is the arm that is pulled the least in expectation until time in bandit under . Now is simply the relative entropy between the distributions and , which, by elementary calculations, can be shown to be at most , resulting in the final inequality. Thus, we finally have,

Substituting gives

Finally, using gives the desired lower bound on the regret. ∎

Proof of Lemma 1.

We first have that

Since , by Hoeffding’s inequality, we have that for any ,

Hence, by the union bound we have for any ,

Thus, defining , and for all , we finally have,

Here (a) follows from the fact that . ∎

Appendix C Proof of Theorem 2

The proof of Theorem 2 utilizes two technical lemmas. The first one is the following.

Lemma 3.

Let , and , , , be a sequence of independent -mean 1-Sub-Gaussian random variables. Let . Then for any ,

Its proof is similar to the proof of Lemma 9.3 in [24], which we present below for completeness.

Proof of Lemma 3..

We have,