# Thresholding Bandit with Optimal Aggregate Regret

We consider the thresholding bandit problem, whose goal is to find arms of mean rewards above a given threshold θ, with a fixed budget of T trials. We introduce LSA, a new, simple and anytime algorithm that aims to minimize the aggregate regret (or the expected number of mis-classified arms). We prove that our algorithm is instance-wise asymptotically optimal. We also provide comprehensive empirical results to demonstrate the algorithm's superior performance over existing algorithms under a variety of different scenarios.

## Authors

• 15 publications
• 1 publication
• 62 publications
• 68 publications
06/17/2020

### The Influence of Shape Constraints on the Thresholding Bandit Problem

We investigate the stochastic Thresholding Bandit problem (TBP) under se...
10/14/2019

### Thresholding Bandit Problem with Both Duels and Pulls

The Thresholding Bandit Problem (TBP) aims to find the set of arms with ...
04/15/2017

### Asynchronous Parallel Empirical Variance Guided Algorithms for the Thresholding Bandit Problem

This paper considers the multi-armed thresholding bandit problem -- iden...
10/18/2021

### Online Sign Identification: Minimization of the Number of Errors in Thresholding Bandits

In the fixed budget thresholding bandit problem, an algorithm sequential...
05/22/2019

### Thresholding Graph Bandits with GrAPL

In this paper, we introduce a new online decision making paradigm that w...
05/13/2020

We study a variant of the thresholding bandit problem (TBP) in the conte...
02/15/2018

### Bandit Learning with Positive Externalities

Many platforms are characterized by the fact that future user arrivals a...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The stochastic Multi-Armed Bandit (MAB) problem has been extensively studied in the past decade  Auer (2002); Audibert et al. (2010); Bubeck et al. (2009); Gabillon et al. (2012); Karnin et al. (2013); Jamieson et al. (2014); Garivier and Kaufmann (2016); Chen et al. (2017). In the classical framework, at each trial of the game, a learner faces a set of arms, pulls an arm and receives an unknown stochastic reward. Of particular interest is the fixed budget setting, in which the learner is only given a limited number of total pulls. Based on the received rewards, the learner will recommend the best arm, i.e., the arm with the highest mean reward. In this paper, we study a variant of the MAB problem, called the Thresholding Bandit Problem (TBP). In TBP, instead of finding the best arm, we expect the learner to identify all the arms whose mean rewards () are greater than or equal to a given threshold

. This is a very natural setting with direct real-world applications to active binary classification and anomaly detection

Locatelli et al. (2016); Steinwart et al. (2005).

In this paper, we propose to study TBP under the notion of aggregate regret, which is defined as the expected number of errors after samples of the bandit game. Specifically, for a given algorithm and a TBP instance with arms, if we use

to denote the probability that the algorithm makes an incorrect decision corresponding to arm

A new challenge arising for the TBP with aggregate regret is how to balance the exploration for each arm given a fixed budget. Different from the exploration vs. exploitation trade-off in the classical MAB problems, where exploration is only aimed for finding the best arm, the TBP expects to maximize the accuracy of the classification of all arms. Let be the hardness parameter or gap for each arm . An arm with smaller would need more samples to achieve the same classification confidence. A TBP learner faces the following dilemma – whether to allocate samples to determine the classification of one hard arm, or use it for improving the accuracy of another easier arm.

##### Related Work.

Since we focus on the TBP problem in this paper, due to limit of the space, we are sorry for not being able to include the significant amount of references to other MAB variants.

In a previous work Locatelli et al. (2016), the authors introduced the APT (Anytime Parameter-free Thresholding) algorithm with the goal of simple regret minimization. In this algorithm, a precision parameter is used to determine the tolerance of errors (a.k.a. the indifference zone); and the APT algorithm only attempts to correctly classify the arms with hardness gap . This variant goal of simple regret partly alleviates the trivialization problem mentioned previously because of the extremely hard arms. In details, at any time , APT selects the arm that minimizes , where is the number of times arm has been pulled until time , is defined as , and is the empirical mean reward of arm at time . In their experiments,  Locatelli et al. (2016) also adapted the UCBE algorithm from Audibert et al. (2010) for the TBP problem and showed that APT performs better than UCBE.

When the goal is to minimize the aggregate regret, the APT algorithm also works better than UCBE. However, we notice that the choice of precision parameter has significant influence on the algorithm’s performance. A large makes sure that, when the sample budget is limited, the APT algorithm is not intrigued by a hard arm to spend overwhelmingly many samples on it without achieving a confident label. However, when the sample budget is ample, a large would also prevent the algorithm from making enough samples for the arms with hardness gap . Theoretically, the optimal selection of this precision parameter may differ significantly across the instances, and also depends on the budget . In this work, we propose an algorithm that does not require such a precision parameter and demonstrates improved robustness in practice.

Another natural approach to TBP is the uniform sampling method, where the learner plays each arm the same number of times (about times). In Appendix C, we show that the uniform sampling approach may need times more budget than the optimal algorithm to achieve the same aggregate regret.

Finally, Chen et al. (2015)

proposed the optimistic knowledge gradient heuristic algorithm for budget allocation in crowdsourcing binary classification with Beta priors, which is closely related to the TBP problem in the Bayesian setting.

##### Our Results and Contributions.

Let denote the aggregate regret of an instance after time steps. Given a sequence of hardness parameters , assume is the class of all -arm instances where the gap between of the -th arm and the threshold is , and let

 OPT({Δi}Ki=1,T)\eqdefinfAsupI∈IΔ1,…,ΔKRA(I;T) (1)

be the minimum possible aggregate regret that any algorithm can achieve among all instances with the given set of gap parameters. We say an algorithm is instance-wise asymptotically optimal if for every , any set of gap parameters , and any instance , it holds that

 RA(I;T)≤O(1)⋅OPT({Δi}Ki=1,Ω(T)). (2)

While it may appear that a constant factor multiplied to can affect the regret if the optimal regret is an exponential function of , we note that our definition aligns with major multi-armed bandit literature (e.g., fixed-budget best arm identification Gabillon et al. (2012); Carpentier and Locatelli (2016) and thresholding bandit with simple regret Locatelli et al. (2016)). Indeed, according to our definition, if the universal optimal algorithm requires a budget of to achieve regret, an asymptotically optimal algorithm requires a budget of only multiplying some constant to achieve the same order of regret. On the other hand, if one wishes to pin down the optimal constant before , even for the single arm case, it boils down to the question of the optimal (and distribution dependent) constant in the exponent of existing concentration bounds such as Chernoff Bound, Hoeffding’s Inequality, and Bernstein Inequalities, which is beyond the scope of this paper.

We address the challenges mentioned previously and introduce a simple and elegant algorithm, the Logarithmic-Sample Algorithm (LSA). LSA has a very similar form as the APT algorithm in Locatelli et al. (2016) but introduces an additive term that is proportional to the logarithm of the number of samples made to each arm in order to more carefully allocate the budget among the arms (see Line 4 of Algorithm 1). This logarithmic term arises from the optimal sample allocation scheme of an offline algorithm when the gap parameters are known beforehand. The log-sample additive term of LSA can be interpreted as an incentive to encourage the samples for arms with bigger gaps and/or less explored arms, which boasts a similar idea as the incentive term in the famous Upper Confidence Bound (UCB) type of algorithms that date back to (Lai and Robbins, 1985; Agrawal, 1995; Auer, 2002), while interestingly the mathematical forms of the two incentive terms are very different.

As the main theoretical result of this paper, we analyze the aggregate regret upper bound of LSA in Theorem 4. We complement the upper bound result with a lower bound theorem (Theorem E) for any online algorithm. In Remark 4, we compare the upper and lower bounds and show that LSA is instance-wise asymptotically optimal.

We now highlight the technical contributions made in our regret upper bound analysis at a very high level. Please refer to Section 4 for more detailed explanations. In our proof of the upper bound theorem, we first define a global class of events (in (14)) which serves as a measurement of how well the arms are explored. Our analysis then goes by two steps. In the first step, we show that happens with high probability, which intuitively means that all arms are “well explored”. In the second step, we show the quantitative upper bound on the mis-classification probability for each arm, when conditioned on . The final regret bound follows by summing up the mis-classification probability for each arm via linearity of expectation. Using this approach, we successfully by-pass the analysis that involves pairs of (or even more) arms, which usually brings in union bound arguments and extra terms. Indeed, such slack appears between the upper and lower bounds proved in Locatelli et al. (2016). In contrast, our LSA algorithm is asymptotically optimal, without any super-constant slack.

Another important technical ingredient that is crucial to the asymptotic optimality analysis is a new concentration inequality for the empirical mean of an arm that uniformly holds over all time periods, which we refer to as the Variable Confidence Level Bound. This new inequality helps to reduce an extra factor in the upper bound. It is also a strict improvement of the celebrated Hoeffding’s Maximal Inequality, which might be useful in many other problems.

Finally, we highlight that our LSA is anytime, i.e., it does not need to know the time horizon beforehand. LSA does use a universal tuning parameter. However, this parameter does not depend on the instances. As we will show in Section 5, the choice of the parameter is quite robust; and the natural parameter setting leads to superior performance of LSA among a set of very different instances, while APT may suffer from poor performance if the precision parameter is not chosen well for an instance.

##### Organization.

The organization of the rest of the paper is as follows. In Section 2 we provide the necessary notation and definitions. Then we present the details of the algorithm in Section 3 and upper bound its aggregate regret in Section 4. In Section 5, we present experiments establishing the empirical advantages of over other algorithms. The instance-wise aggregate regret lower bound theorem is deferred to Appendix E.

## 2 Problem Formulation and Notation

Given an integer , we let be the set of arms in an instance . Each arm is associated with a distribution supported on which has an unknown mean . We are interested in the following dynamic game setting: At any round , the learner chooses to pull an arm from and receives an i.i.d. reward sampled from .

We let , with , be the time horizon, or the budget of the game, which is not necessarily known beforehand. We furthermore let be the threshold of the game. After rounds, the learner has to determine, for every arm , whether or not its mean reward is greater than or equal to

. So the learner outputs a vector

, where if and only if decides that . The goal of the Thresholding Bandit Problem (TBP) in this paper is to maximize the expected number of correct labels after rounds of the game.

More specifically, for any algorithm , we use to denote the event that ’s decision corresponding to arm is correct after rounds of the game. The goal of the TBP algorithm is to minimize the aggregate regret, which is the expected number of incorrect classifications for the arms, i.e.,

 RA(T)=RA(I;T)\eqdef\E[K∑i=1I{¯¯¯EAi(T)}], (3)

where denotes the complement of event and denotes the indicator function.

Let

denote the random variable representing the sample received by pulling arm

for the -th time. We further write

 ^θi,t\eqdef1st∑s=1Xi,s  and  ^Δi,t\eqdef|^θi,t−θ| (4)

to denote the empirical mean and the empirical gap of arm after being pulled times, respectively. For a given algorithm , let and denote the number of times arm is pulled and the empirical mean reward of arm after rounds of the game, respectively. For each arm , we use to denote the empirical gap after rounds of the game. We will omit the reference to when it is clear from the context.

## 3 Our Algorithm

We now motivate our Logarithmic-Sample Algorithm by first designing an optimal but unrealistic algorithm with the assumption that the hardness gaps are known beforehand. Now we design the following algorithm . Suppose the algorithm pulls arm a total of times and makes a decision based on the empirical mean : if , the algorithm decides that , and decides otherwise. Note that this is all algorithm can do when the gaps are known. We upper bound the aggregate regret of the algorithm by

 RO(T) =K∑i=1Pr(¯¯¯EOi(T))≤K∑i=1Pr(|^θi,xi−θi|≥Δi)≤K∑i=12exp(−2xiΔ2i), (5)

where the last inequality follows from Chernoff-Hoeffding Inequality (Proposition A). Now we would like to minimize the RHS (right-hand-side) of (5), and upper bound the aggregate regret of the optimal algorithm by

 2⋅minx1+⋯+xK=Tx1,…,xK∈NK∑i=1exp(−2xiΔ2i)=2P∗2({Δi}i∈S,T).

Here, for every , we define

 P∗c({Δi}i∈S,T)\eqdefminx1+⋯+xK=Tx1,…,xK∈NK∑i=1exp(−cxiΔ2i). (6)

We naturally introduce the following continuous relaxation of the program , by defining

 Pc({Δi}i∈S,T)\eqdefminx1+⋯+xK=Tx1,…,xK≥0K∑i=1exp(−cxiΔ2i). (7)

well approximates , as it is straightforward to see that

 Pc({Δi}i∈S,T)≤P∗c({Δi}i∈S,T)≤Pc({Δi}i∈S,T−K). (8)

We apply the Karush-Kuhn-Tucker (KKT) conditions to the optimization problem and find that the optimal solution satisfies

 xiΔ2i+lnΔ−1i≥Φ,for~{}i∈S, (9)

where is independent of . Furthermore, since is an increasing continuous function on , is indeed well-defined. Please refer to Lemma B of Appendix B for the details of the relevant calculations.

In light of (8) and (9), the following algorithm (still, with the unrealistic assumption of the knowledge of the gaps ) incrementally solves and approximates the algorithm – at each time , the algorithm selects the arm that minimizes and plays it.

Our proposed algorithm is very close to

. Since in reality the algorithm does not have access to the precise gap quantities, we use the empirical estimates

instead of in the term. For the logarithmic term, if we also use instead of , we may encounter extremely small empirical estimates when the arm is not sufficiently sampled, which would lead to unbounded value of , and render the arm almost impossible to be sampled in future. To solve this problem, we note that tries to maintain to be roughly the same across the arms (if ignoring the term). In light of this, we use to roughly estimate the order of . This encourages the exploration of both the arms with larger gaps and the ones with fewer trials.

To summarize, at each time , our algorithm selects the arm that minimizes , where is a universal tuning parameter, and plays the arm. The details of the algorithm are presented in Algorithm 1.

## 4 Regret Upper Bound for LSA

In this section, we show the upper bound of the aggregate regret of Algorithm 1.

Let be the solution to the following equation

 K∑i=1(I{x≤lnΔ−1i}⋅exp(2x)+I{x>lnΔ−1i}⋅x−lnΔ−1i+ααΔ2i)=Tmax{40/α+1,40}. (10)

Notice that is a strictly increasing, continuous function with that becomes when and goes to infinity when . Hence is guaranteed to exist and is uniquely defined when is large. Furthermore, for any , we let

 λi\eqdefI{Λ≤lnΔ−1i}⋅exp(2Λ)+I{Λ>lnΔ−1i}⋅Λ−lnΔ−1i+ααΔ2i. (11)

We note that is the optimal solution to . Please refer to Lemma B of Appendix B for the detailed calculations.

The goal of this section is to prove the following theorem.

Let be the aggregate regret incurred by Algorithm 1. When , and , we have

 RLSA(T)≤Φ(α)⋅∑i∈Sexp(−λiΔ2i10), (12)

where is a constant that only depends on the universal tuning parameter .

If we set , then the right-hand side of (12) would be at most . One can verify that

 ∑i∈Sexp(−λiΔ2i10)≤O(P1/10({max{Δi,exp(−Λ)}}i∈S,T/801)) =O(P16({max{Δi,exp(−Λ)}}i∈S,T/128160))≤O(P16({Δi}i∈S,T/128160)).

where the first inequality is due to Lemma B of Appendix B and the equality is because of Lemma B of Appendix B. This matches the lower bound demonstrated in Theorem E up to constant factors. 111While the constants may seem large, we emphasize that i) we make no effort in optimizing the constants in asymptotic bounds, ii) most of the constants come from the lower bound, while the constant factor in our upper bound is , and iii) we believe that the actual constant of our algorithm is quite small, as the experimental evaluation in the later section demonstrates that our algorithm performs very well in practice.

The rest of this section is devoted to the proof of Theorem 4. Before proceeding, we note that the analysis of the APT algorithm Locatelli et al. (2016) crucially depends on a favorable event stating that the empirical mean of any arm at any time does not deviate too much from the true mean. This requires a union bound that introduces extra factors such as and . Our analysis adopts a novel approach that does not need a union bound over all arms, and hence avoids the extra factor. In the second step of our analysis, we introduce the new Variable Confidence Level Bound to save the extra doubly logarithmic term in .

Now we dive into details of the proof. Let . Intuitively, contains the arms that can be well classified by the ideal algorithm (described in Section 3), while even the ideal algorithm suffers regret for each arm in . In light of this, the key of the proof is to upper bound the regret incurred by the arms in .

Let denote the regret incurred by arms in . Note that for every arm , and the regret incurred by each arm is at most . Therefore, to establish (12), we only need to show that

 RLSAB(T)≤Φ(α)⋅∑i∈Bexp(−λiΔ2i10). (13)

We set up a few notations to facilitate the proof of (13). We define to be the expression inside the operator in Line 4 of the algorithm, for arm and at time . We also define .

Intuitively, when is large, we usually have a larger value for , and arm is better explored. Therefore, can be used as a measurement of how well arm is explored, which directly relates to the mis-classification probability for classifying the arm. We say that arm is -well explored at time if there exists such that . For any , we also define the event to be

 FC\eqdef{∃T′≤T : ∀i∈S,ξi(T′)>C}. (14)

When happens, we know that all arms are -well explored.

At a higher level, the proof of (13) goes by two steps. First, we show that for that is almost as large as , happens with high probability, which means that every arm is -well explored. Second, we quantitatively relate that being -well explored and the mis-classification probability for classifying each arm, which can be used to further deduce a regret upper bound given the event .

We start by revealing more details about the first step. The following Lemma 4 gives a lower bound on the probability of the event .

for .

We now introduce the high-level ideas for proving Lemma 4 and defer the formal proofs to Appendix D.2. For any arm and , let be the random variable representing the smallest positive integer such that (i.e., for all ). Intuitively, denotes the first time arm is -well explored. We first show that the distribution of has an exponential tail. Hence, the sum of them with the same also has an exponential tail. Next, we show that with high probability and the probability vanishes exponentially as increases. In the last step, thanks to the design of the algorithm, we are able to argue that implies .

We now proceed to the second step of the proof of (13). The following lemma (whose proof is deferred to Appendix D.3) gives an upper bound of regret incurred by arms in conditioned on .

If , then conditioned on ,

 RLSAB(T)≤9⋅8α√28α√2−1⋅∑i∈Bexp(−λiΔ2i10+k+α−lnα−0.54α).

As mentioned before, the key to proving Lemma 4 is to pin down the quantitative relation between the event and the probability of mis-classifying an arm conditioned on , then the expected regret upper bound can be achieved by summing up the mis-classifying probability for all arms in .

A key technical challenge in our analysis is to design a concentration bound for the empirical mean of an arm (namely arm ) that uniformly holds over all time periods. A typical method is to let the length of the confidence band scale linearly with , where is the number of samples made for the arm. However, this would worsen the failure probability, and lead to an extra factor in the regret upper bound. To reduce the iterated logarithmic factor, we introduce a novel uniform concentration bound where the ratio between the length of the confidence band and is almost constant for large , but becomes larger for smaller . Since this ratio is related to the confidence level of the corresponding confidence band, we refer to this new concentration inequality as the Variable Confidence Level Bound. More specifically, in Appendix D.3.1, we prove the following lemma.

Lemma D.3.1 (Variable Confidence Level Bound, pre-stated) Let be i.i.d. random variables supported on with mean . For any and , it holds that

 Pr⎛⎝∀t∈[1,L],∣∣ ∣∣1tt∑i=1Xi−μ∣∣ ∣∣≤√a+bln(L/t)t⎞⎠≥1−2b/2+22b/2−1exp(−a/2).

This new inequality greatly helps the analysis of our algorithm, where the intuition is that when conditioned on the event , it is much less likely that fewer number of samples are conducted for arm , and therefore we can afford a less accurate (i.e. bigger) confidence band for its mean value.

It is notable that a similar idea is also adopted in the analysis of the MOSS algorithm Audibert and Bubeck (2009) which gives the asymptotically optimal regret bound for the ordinary multi-armed bandits. However, our Variable Confidence Level Bound is more general and may be useful in other applications. We additionally remark that in the celebrated Hoeffding’s Maximal Inequality, the confidence level also changes with time. However, the blow-up factor made to the confidence level in our inequality is only the logarithm of that of the Hoefdding’s Maximal Inequality. Therefore, if constant factors are ignored, our inequality strictly improves Hoeffding’s Maximal Inequality.

The formal proof of Theorem 4 involves a few technical tricks to combine Lemma 4 and Lemma 4 to deduce the final regret bound, and is deferred to Appendix D.1. The lower bound theorem (Theorem E) that complements Theorem 4 is deferred to Appendix E due to space constraints.

## 5 Experiments

In our experiments, we assume that each arm follows independent Bernoulli distributions with different means. To guarantee a fair comparison, we vary the total number of samples

and compare the empirical average aggregate regret on a logarithmic scale which is averaged over independent runs. We consider three different choices of :

1. (arithmetic progression I). ; , and (see Setup 1 in Figure 1).

2. (arithmetic progression II). ; (see Setup 2 in Figure 1).

3. (two-group setting). ; , and (see Setup 3 in Figure 1).

In our experiments, we fix . We notice that the choice of in our LSA is quite robust (see Appendix F.3 for experimental results). To illustrate the performance, we fix in LSA and compare it with four existing algorithms for the TBP problem under a variety of settings. Now we discuss these algorithms and their parameter settings in more details.

• Uniform: Given the budget , this method pulls each arm sequentially from to until budget is reached such that each arm is sampled roughly times. Then it outputs when .

• APT(): Introduced and analyzed in Locatelli et al. (2016), this algorithm aims to output a set of arms () serving as an estimate of the set of arms with means over . The natural adaptation of the APT algorithm to our problem corresponds to changing the output: it outputs if and otherwise. In the experiments, we test the following choices of : , , , and .

• UCBE(): Introduced and analyzed in Audibert and Bubeck (2010), this algorithm aims to identify the best arm (the arm with the largest mean reward). A natural adaptation of this algorithm to TBP is for each time , it pulls where is a tuning parameter. In Audibert and Bubeck (2010), it has been proved optimal when where . Here we set and test three different choices of : , , and .

• Opt-KG(, ): Introduced in Chen et al. (2015)

, this algorithm also aims to minimize the aggregate regret. It models TBP as a Bayesian Markov decision process where

is assumed to be drawn from a known Beta prior . Here we choose two different priors: (uniform prior) and (Jeffreys prior).

##### Comparisons.

In Setup 1, which is a relatively easy setting, LSA works best among all choices of budget

. With the right choice of parameter, APT and Opt-KG also achieve satisfactory performance. Though the performance gaps appear to be small, two-tailed paired t-tests of aggregate regrets indicate that LSA is significantly better than most of the other methods, except APT(.05) and APT(.025) (see Table

1 in Appendix F.1).

In Setup 2 and 3, where ambiguous arms close to the threshold are presented, the performance difference between LSA and other methods is more noticeable. LSA consistently outperforms other methods in both settings over almost all choices of budget with statistical significance. It is worth noting that, though APT works also reasonably well in Setup 2 when is small, the best parameter is different from that for bigger and other setups. On the other hand, the parameters chosen in LSA are fixed across all setups, indicating that our algorithm is more robust.

We perform additional experiments that due to space limitations are included in Appendix F.2. In all setups, LSA outperforms its competitors with various parameter choices.

## 6 Conclusion

In this paper we introduce an algorithm that minimizes the aggregate regret for the thresholding bandit problem. Our algorithm LSA makes use of a novel approach inspired by the optimal allocation scheme of the budget when the reward gaps are known ahead of time. When compared to APT, LSA uses an additional term, similar in spirit to the UCB-type algorithms though mathematically different, that encourages the exploration of arms that have bigger gaps, and/or those have not been sufficiently explored. Moreover, LSA is anytime and robust, while the precision parameter needed in the APT algorithm is highly sensitive and hard to choose. Besides showing empirically that LSA performs better than APT for different values of and other algorithms in a variety of settings, we also employ novel proof ideas that eliminate the logarithmic terms usually brought in by the straightforward union bound argument, design the new Variable Confidence Level Bound that strictly improves the celebrated Hoeffding’s Maximal inequality, and prove that achieves instance-wise asymptotically optimal aggregate regret.

## References

• Auer (2002) Peter Auer. Using confidence bounds for exploitation-exploration trade-offs.

Journal of Machine Learning Research

, 3(Nov):397–422, 2002.
• Audibert et al. (2010) Jean-Yves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identification in multi-armed bandits. In Conference on Learning Theory (COLT), pages 41–53, 2010.
• Bubeck et al. (2009) Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In Algorithmic Learning Theory (ALT), pages 23–37, 2009.
• Gabillon et al. (2012) Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems (NIPS), pages 3212–3220, 2012.
• Karnin et al. (2013) Zohar Shay Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning (ICML), pages 1238–1246, 2013.
• Jamieson et al. (2014) Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’UCB: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory (COLT), pages 423–439, 2014.
• Garivier and Kaufmann (2016) Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory (COLT), pages 998–1027, 2016.
• Chen et al. (2017) Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identification. In Conference on Learning Theory (COLT), pages 535–592, 2017.
• Locatelli et al. (2016) Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In International Conference on Machine Learning (ICML), pages 1690–1698, 2016.
• Steinwart et al. (2005) Ingo Steinwart, Don Hush, and Clint Scovel. A classification framework for anomaly detection. Journal of Machine Learning Research, 6(Feb):211–232, 2005.
• Chen et al. (2015) Xi Chen, Qihang Lin, and Dengyong Zhou. Statistical decision making for optimal budget allocation in crowd labeling. Journal of Machine Learning Research, 16(1):1–46, 2015.
• Carpentier and Locatelli (2016) Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pages 590–604, 2016.
• Lai and Robbins (1985) Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
• Agrawal (1995) Rajeev Agrawal. Sample mean based index policies by regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054–1078, 1995.
• Audibert and Bubeck (2009) Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In Conference on Learning Theory (COLT), 2009.
• Audibert and Bubeck (2010) Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. In Conference on Learning Theory (COLT), pages 13–p, 2010.
• Hoeffding (1963) Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963.
• Janson (2018) Svante Janson. Tail bounds for sums of geometric and exponential variables. Statistics & Probability Letters, 135:1–6, 2018.
• Tsybakov (2009) Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, New York, 2009. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats.
• Lattimore and Szepesvári (2018) Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, 2018.
• Matouek and Gärtner (2006) Jirí Matouek and Bernd Gärtner.

Understanding and Using Linear Programming (Universitext)

.
2006.

## Appendix A Probability Tools

[Chernoff-Hoeffding Inequality Hoeffding (1963)] Let be a list of independent random variables supported on and set . Then, for every , it holds that

 Pr(|X−\E[X]|≥\eps)≤2exp(−2t\eps2).

[Restatement of Theorem 5.1(ii) in Janson (2018)] Let be a list of independent random variables such that for . And let . Then for any , it holds that

 Pr(X≥λμ)≤exp(1−λ).

[Hoeffding’s Maximal Inequality Hoeffding (1963)] Let be a list of i.i.d. random variables supported on and set . Then, for any , it holds that

 Pr(∀i∈[t],X1+X2+⋯+Xi≥iμ+\eps)≤exp(−2\eps2t).

[Restatement of Lemma 2.6 in Tsybakov (2009)] Let and

be two probability distributions supported on some set

. Then for every set , one has

 PrX∼P(A)+PrX∼Q(¯¯¯¯A)≥12exp(−DKL(P∥Q)),

where denotes the complement of and

denotes the Kullback-Leibler divergence between

and given by

 DKL(P∥Q)\eqdef∑x∈XP(x)ln(P(x)Q(x)).

[Restatement of Lemma 15.1 in Lattimore and Szepesvári (2018)] Let and be the reward distributions of two -armed bandits. Assuming for any arm . Fix some policy and let and be the two probability measures induced by the -round interconnection of and (respectively, and ). Then

where is the random variable denoting the number of times arm is pulled.

## Appendix B Properties of Pc

We first show the optimal solution to by proving the following lemma.

If , then the optimal solution to can be expressed in the following form

 xi=max{Φc−lnΔ−1icΔ2i/2,0},

where .

###### Proof.

Since is an increasing continuous function on , is indeed well-defined.

We apply KKT conditions (see Proposition 8.7.2 in Matouek and Gärtner (2006)) to solve the minimization problem . Concretely, the KKT conditions applies to gives

 (−cΔ2i)exp(−cxiΔ2i)−ui+v =0 for i∈[K] uixi =0 for i∈[K] ui ≤0 for i∈[K] xi ≥0 for i∈[K] K∑i=1xi =T,

where for and are newly-introduced variables. In particular, if , then and it holds that

 c2xiΔ2i+lnΔ−1i=12lncv. (15)

It is easy to see the solution for satisfies (15) and is a minimum point. ∎

For any positive number , let be the solution to

 K∑i=1(I{x≤lnΔ−1i}⋅exp(2x)+I{x>lnΔ−1i}⋅x−lnΔ−1i+c/2cΔ2i/2)=T.

Note that

 K∑i=1(I{x≤lnΔ−1i}⋅exp(2x)+I{x>lnΔ−1i}⋅x−lnΔ−1i+c/2cΔ2i/2)

is a strictly increasing continuous function on that equals when and tends to infinity when . Hence exists and is uniquely defined.

Then we derive the optimal solution to , as follows.

If , then the optimal solution to can be expressed in the following form

 xi=I{Ψc≤lnΔ−1i}⋅exp(2Ψc)+I{Ψc>lnΔ−1i}⋅Ψc−lnΔ−1i+c/2cΔ2i/2.
###### Proof.

By Lemma B, the optimal solution to can be expressed as

 c2ximax{Δi,exp(−Ψc)}2+lnmax{Δi,exp(−Ψc)}−1=Φc,

where

 Φc=\argmaxx{x:K∑i=1max{x−lnmax{Δi,exp(−Ψc)}−1cmax{Δi,exp(−Ψc)}2/2,0}≤T}.

It is easy to see that . Therefore the optimal solution to is

 xi=max{Φc−lnmax{Δi,exp(−Ψc)}−1c2max{Δi,exp(−Ψc)}2,0}=I{Ψc≤lnΔ−1i}⋅exp(2Ψc)+I{Ψc>lnΔ−1i}⋅Ψc−lnΔ−1i+c/2cΔ2i/2,

proving this lemma. ∎

Using Lemma B, we derive the following useful inequality.

Suppose and let be the solution to . Then

 ∑i∈Sexp(−cx∗iΔ2i)≤exp(c)Pc({max{Δi,exp(−Ψc)}}i∈S,T).
###### Proof.

By Lemma B, the optimal solution to can be expressed as

 x∗i=I{Ψc≤lnΔ−1i}⋅exp(2Ψc)+I{Ψc>lnΔ−1i}⋅Ψc−lnΔ−1i+c/2cΔ2i/2. (16)

Therefore, we obtain

 ∑i∈Sexp(−cx∗iΔ2i) ≤ exp(c)∑i∈Sexp(−cx∗imax{Δi,exp(−Φc)}2) = exp(c)Pc({max{Δi,exp(−Ψc)}}i∈S,T),

and this lemma follows. ∎

Finally, we will show how the value of will change when is changed.

If , then

 Pc({Δi}i∈S,T)=Pc′({Δi}i∈S,Tc/c′).
###### Proof.

We observe that for any sequence of positive numbers ,

 K∑i=1exp(−cxiΔ2i)=K∑i=1exp(−c′⋅(cxi/c′)Δ2i).

Suppose is the optimal solution to . Then is a feasible solution to . Hence we obtain On the other hand, using a similar argument, we can also obtain Therefore, it holds that

 Pc({Δi}i∈S,T)=Pc′({Δi}i∈S,Tc/c′),

and the lemma follows. ∎

## Appendix C Hard Instances for the Uniform Sampling Approach

In this section, we describe a class of bad instances for the uniform sampling approach. In such instances, we show that, to achieve the same order of regret, the uniform sampling approach needs at least times more budget than the optimal policy.

We fix the threshold . For each , we construct two instances and . In , we set