1 Introduction
The stochastic MultiArmed Bandit (MAB) problem has been extensively studied in the past decade Auer (2002); Audibert et al. (2010); Bubeck et al. (2009); Gabillon et al. (2012); Karnin et al. (2013); Jamieson et al. (2014); Garivier and Kaufmann (2016); Chen et al. (2017). In the classical framework, at each trial of the game, a learner faces a set of arms, pulls an arm and receives an unknown stochastic reward. Of particular interest is the fixed budget setting, in which the learner is only given a limited number of total pulls. Based on the received rewards, the learner will recommend the best arm, i.e., the arm with the highest mean reward. In this paper, we study a variant of the MAB problem, called the Thresholding Bandit Problem (TBP). In TBP, instead of finding the best arm, we expect the learner to identify all the arms whose mean rewards () are greater than or equal to a given threshold
. This is a very natural setting with direct realworld applications to active binary classification and anomaly detection
Locatelli et al. (2016); Steinwart et al. (2005).In this paper, we propose to study TBP under the notion of aggregate regret, which is defined as the expected number of errors after samples of the bandit game. Specifically, for a given algorithm and a TBP instance with arms, if we use
to denote the probability that the algorithm makes an incorrect decision corresponding to arm
after rounds of samples, the aggregate regret is defined as . In contrast, most previous works on TBP aim to minimize the simple regret, which is the probability that at least one of the arms is incorrectly labeled. Note that the definition of aggregate regret directly reflects the overall classification accuracy of the TBP algorithm, which is more meaningful than the simple regret in many realworld applications. For example, in the crowdsourced binary labeling problem, the learner faces binary classification tasks, where each task is associated with a latent true label , and a latent softlabel . The softlabel may be used to model the labeling difficulty/ambiguity of the task, in the sense that fraction of the crowd will label task as and the rest labels task as . The crowd is also assumed to be reliable, i.e., if and only if . The goal of the crowdsourcing problem is to sequentially query a random worker from the large crowd about his/her label on task for a budget of times, and then label the tasks with as high (expected) accuracy as possible. If we treat each of the binary classification task as a Bernoulli arm with mean reward , then this crowdsourced problem becomes aggregate regret minimization in TBP with . If a few tasks are extremely ambiguous (i.e., ), the simple regret would trivially approach (i.e., every algorithm would almost always fail to correctly label all tasks). In such cases, however, a good learner may turn to accurately label the less ambiguous tasks and still achieve a meaningful aggregate regret.A new challenge arising for the TBP with aggregate regret is how to balance the exploration for each arm given a fixed budget. Different from the exploration vs. exploitation tradeoff in the classical MAB problems, where exploration is only aimed for finding the best arm, the TBP expects to maximize the accuracy of the classification of all arms. Let be the hardness parameter or gap for each arm . An arm with smaller would need more samples to achieve the same classification confidence. A TBP learner faces the following dilemma – whether to allocate samples to determine the classification of one hard arm, or use it for improving the accuracy of another easier arm.
Related Work.
Since we focus on the TBP problem in this paper, due to limit of the space, we are sorry for not being able to include the significant amount of references to other MAB variants.
In a previous work Locatelli et al. (2016), the authors introduced the APT (Anytime Parameterfree Thresholding) algorithm with the goal of simple regret minimization. In this algorithm, a precision parameter is used to determine the tolerance of errors (a.k.a. the indifference zone); and the APT algorithm only attempts to correctly classify the arms with hardness gap . This variant goal of simple regret partly alleviates the trivialization problem mentioned previously because of the extremely hard arms. In details, at any time , APT selects the arm that minimizes , where is the number of times arm has been pulled until time , is defined as , and is the empirical mean reward of arm at time . In their experiments, Locatelli et al. (2016) also adapted the UCBE algorithm from Audibert et al. (2010) for the TBP problem and showed that APT performs better than UCBE.
When the goal is to minimize the aggregate regret, the APT algorithm also works better than UCBE. However, we notice that the choice of precision parameter has significant influence on the algorithm’s performance. A large makes sure that, when the sample budget is limited, the APT algorithm is not intrigued by a hard arm to spend overwhelmingly many samples on it without achieving a confident label. However, when the sample budget is ample, a large would also prevent the algorithm from making enough samples for the arms with hardness gap . Theoretically, the optimal selection of this precision parameter may differ significantly across the instances, and also depends on the budget . In this work, we propose an algorithm that does not require such a precision parameter and demonstrates improved robustness in practice.
Another natural approach to TBP is the uniform sampling method, where the learner plays each arm the same number of times (about times). In Appendix C, we show that the uniform sampling approach may need times more budget than the optimal algorithm to achieve the same aggregate regret.
Our Results and Contributions.
Let denote the aggregate regret of an instance after time steps. Given a sequence of hardness parameters , assume is the class of all arm instances where the gap between of the th arm and the threshold is , and let
(1) 
be the minimum possible aggregate regret that any algorithm can achieve among all instances with the given set of gap parameters. We say an algorithm is instancewise asymptotically optimal if for every , any set of gap parameters , and any instance , it holds that
(2) 
While it may appear that a constant factor multiplied to can affect the regret if the optimal regret is an exponential function of , we note that our definition aligns with major multiarmed bandit literature (e.g., fixedbudget best arm identification Gabillon et al. (2012); Carpentier and Locatelli (2016) and thresholding bandit with simple regret Locatelli et al. (2016)). Indeed, according to our definition, if the universal optimal algorithm requires a budget of to achieve regret, an asymptotically optimal algorithm requires a budget of only multiplying some constant to achieve the same order of regret. On the other hand, if one wishes to pin down the optimal constant before , even for the single arm case, it boils down to the question of the optimal (and distribution dependent) constant in the exponent of existing concentration bounds such as Chernoff Bound, Hoeffding’s Inequality, and Bernstein Inequalities, which is beyond the scope of this paper.
We address the challenges mentioned previously and introduce a simple and elegant algorithm, the LogarithmicSample Algorithm (LSA). LSA has a very similar form as the APT algorithm in Locatelli et al. (2016) but introduces an additive term that is proportional to the logarithm of the number of samples made to each arm in order to more carefully allocate the budget among the arms (see Line 4 of Algorithm 1). This logarithmic term arises from the optimal sample allocation scheme of an offline algorithm when the gap parameters are known beforehand. The logsample additive term of LSA can be interpreted as an incentive to encourage the samples for arms with bigger gaps and/or less explored arms, which boasts a similar idea as the incentive term in the famous Upper Confidence Bound (UCB) type of algorithms that date back to (Lai and Robbins, 1985; Agrawal, 1995; Auer, 2002), while interestingly the mathematical forms of the two incentive terms are very different.
As the main theoretical result of this paper, we analyze the aggregate regret upper bound of LSA in Theorem 4. We complement the upper bound result with a lower bound theorem (Theorem E) for any online algorithm. In Remark 4, we compare the upper and lower bounds and show that LSA is instancewise asymptotically optimal.
We now highlight the technical contributions made in our regret upper bound analysis at a very high level. Please refer to Section 4 for more detailed explanations. In our proof of the upper bound theorem, we first define a global class of events (in (14)) which serves as a measurement of how well the arms are explored. Our analysis then goes by two steps. In the first step, we show that happens with high probability, which intuitively means that all arms are “well explored”. In the second step, we show the quantitative upper bound on the misclassification probability for each arm, when conditioned on . The final regret bound follows by summing up the misclassification probability for each arm via linearity of expectation. Using this approach, we successfully bypass the analysis that involves pairs of (or even more) arms, which usually brings in union bound arguments and extra terms. Indeed, such slack appears between the upper and lower bounds proved in Locatelli et al. (2016). In contrast, our LSA algorithm is asymptotically optimal, without any superconstant slack.
Another important technical ingredient that is crucial to the asymptotic optimality analysis is a new concentration inequality for the empirical mean of an arm that uniformly holds over all time periods, which we refer to as the Variable Confidence Level Bound. This new inequality helps to reduce an extra factor in the upper bound. It is also a strict improvement of the celebrated Hoeffding’s Maximal Inequality, which might be useful in many other problems.
Finally, we highlight that our LSA is anytime, i.e., it does not need to know the time horizon beforehand. LSA does use a universal tuning parameter. However, this parameter does not depend on the instances. As we will show in Section 5, the choice of the parameter is quite robust; and the natural parameter setting leads to superior performance of LSA among a set of very different instances, while APT may suffer from poor performance if the precision parameter is not chosen well for an instance.
Organization.
The organization of the rest of the paper is as follows. In Section 2 we provide the necessary notation and definitions. Then we present the details of the algorithm in Section 3 and upper bound its aggregate regret in Section 4. In Section 5, we present experiments establishing the empirical advantages of over other algorithms. The instancewise aggregate regret lower bound theorem is deferred to Appendix E.
2 Problem Formulation and Notation
Given an integer , we let be the set of arms in an instance . Each arm is associated with a distribution supported on which has an unknown mean . We are interested in the following dynamic game setting: At any round , the learner chooses to pull an arm from and receives an i.i.d. reward sampled from .
We let , with , be the time horizon, or the budget of the game, which is not necessarily known beforehand. We furthermore let be the threshold of the game. After rounds, the learner has to determine, for every arm , whether or not its mean reward is greater than or equal to
. So the learner outputs a vector
, where if and only if decides that . The goal of the Thresholding Bandit Problem (TBP) in this paper is to maximize the expected number of correct labels after rounds of the game.More specifically, for any algorithm , we use to denote the event that ’s decision corresponding to arm is correct after rounds of the game. The goal of the TBP algorithm is to minimize the aggregate regret, which is the expected number of incorrect classifications for the arms, i.e.,
(3) 
where denotes the complement of event and denotes the indicator function.
Let
denote the random variable representing the sample received by pulling arm
for the th time. We further write(4) 
to denote the empirical mean and the empirical gap of arm after being pulled times, respectively. For a given algorithm , let and denote the number of times arm is pulled and the empirical mean reward of arm after rounds of the game, respectively. For each arm , we use to denote the empirical gap after rounds of the game. We will omit the reference to when it is clear from the context.
3 Our Algorithm
We now motivate our LogarithmicSample Algorithm by first designing an optimal but unrealistic algorithm with the assumption that the hardness gaps are known beforehand. Now we design the following algorithm . Suppose the algorithm pulls arm a total of times and makes a decision based on the empirical mean : if , the algorithm decides that , and decides otherwise. Note that this is all algorithm can do when the gaps are known. We upper bound the aggregate regret of the algorithm by
(5) 
where the last inequality follows from ChernoffHoeffding Inequality (Proposition A). Now we would like to minimize the RHS (righthandside) of (5), and upper bound the aggregate regret of the optimal algorithm by
Here, for every , we define
(6) 
We naturally introduce the following continuous relaxation of the program , by defining
(7) 
well approximates , as it is straightforward to see that
(8) 
We apply the KarushKuhnTucker (KKT) conditions to the optimization problem and find that the optimal solution satisfies
(9) 
where is independent of . Furthermore, since is an increasing continuous function on , is indeed welldefined. Please refer to Lemma B of Appendix B for the details of the relevant calculations.
In light of (8) and (9), the following algorithm (still, with the unrealistic assumption of the knowledge of the gaps ) incrementally solves and approximates the algorithm – at each time , the algorithm selects the arm that minimizes and plays it.
Our proposed algorithm is very close to
. Since in reality the algorithm does not have access to the precise gap quantities, we use the empirical estimates
instead of in the term. For the logarithmic term, if we also use instead of , we may encounter extremely small empirical estimates when the arm is not sufficiently sampled, which would lead to unbounded value of , and render the arm almost impossible to be sampled in future. To solve this problem, we note that tries to maintain to be roughly the same across the arms (if ignoring the term). In light of this, we use to roughly estimate the order of . This encourages the exploration of both the arms with larger gaps and the ones with fewer trials.To summarize, at each time , our algorithm selects the arm that minimizes , where is a universal tuning parameter, and plays the arm. The details of the algorithm are presented in Algorithm 1.
4 Regret Upper Bound for LSA
In this section, we show the upper bound of the aggregate regret of Algorithm 1.
Let be the solution to the following equation
(10) 
Notice that is a strictly increasing, continuous function with that becomes when and goes to infinity when . Hence is guaranteed to exist and is uniquely defined when is large. Furthermore, for any , we let
(11) 
We note that is the optimal solution to . Please refer to Lemma B of Appendix B for the detailed calculations.
The goal of this section is to prove the following theorem.
Let be the aggregate regret incurred by Algorithm 1. When , and , we have
(12) 
where is a constant that only depends on the universal tuning parameter .
If we set , then the righthand side of (12) would be at most . One can verify that
where the first inequality is due to Lemma B of Appendix B and the equality is because of Lemma B of Appendix B. This matches the lower bound demonstrated in Theorem E up to constant factors. ^{1}^{1}1While the constants may seem large, we emphasize that i) we make no effort in optimizing the constants in asymptotic bounds, ii) most of the constants come from the lower bound, while the constant factor in our upper bound is , and iii) we believe that the actual constant of our algorithm is quite small, as the experimental evaluation in the later section demonstrates that our algorithm performs very well in practice.
The rest of this section is devoted to the proof of Theorem 4. Before proceeding, we note that the analysis of the APT algorithm Locatelli et al. (2016) crucially depends on a favorable event stating that the empirical mean of any arm at any time does not deviate too much from the true mean. This requires a union bound that introduces extra factors such as and . Our analysis adopts a novel approach that does not need a union bound over all arms, and hence avoids the extra factor. In the second step of our analysis, we introduce the new Variable Confidence Level Bound to save the extra doubly logarithmic term in .
Now we dive into details of the proof. Let . Intuitively, contains the arms that can be well classified by the ideal algorithm (described in Section 3), while even the ideal algorithm suffers regret for each arm in . In light of this, the key of the proof is to upper bound the regret incurred by the arms in .
Let denote the regret incurred by arms in . Note that for every arm , and the regret incurred by each arm is at most . Therefore, to establish (12), we only need to show that
(13) 
We set up a few notations to facilitate the proof of (13). We define to be the expression inside the operator in Line 4 of the algorithm, for arm and at time . We also define .
Intuitively, when is large, we usually have a larger value for , and arm is better explored. Therefore, can be used as a measurement of how well arm is explored, which directly relates to the misclassification probability for classifying the arm. We say that arm is well explored at time if there exists such that . For any , we also define the event to be
(14) 
When happens, we know that all arms are well explored.
At a higher level, the proof of (13) goes by two steps. First, we show that for that is almost as large as , happens with high probability, which means that every arm is well explored. Second, we quantitatively relate that being well explored and the misclassification probability for classifying each arm, which can be used to further deduce a regret upper bound given the event .
We start by revealing more details about the first step. The following Lemma 4 gives a lower bound on the probability of the event .
for .
We now introduce the highlevel ideas for proving Lemma 4 and defer the formal proofs to Appendix D.2. For any arm and , let be the random variable representing the smallest positive integer such that (i.e., for all ). Intuitively, denotes the first time arm is well explored. We first show that the distribution of has an exponential tail. Hence, the sum of them with the same also has an exponential tail. Next, we show that with high probability and the probability vanishes exponentially as increases. In the last step, thanks to the design of the algorithm, we are able to argue that implies .
We now proceed to the second step of the proof of (13). The following lemma (whose proof is deferred to Appendix D.3) gives an upper bound of regret incurred by arms in conditioned on .
If , then conditioned on ,
As mentioned before, the key to proving Lemma 4 is to pin down the quantitative relation between the event and the probability of misclassifying an arm conditioned on , then the expected regret upper bound can be achieved by summing up the misclassifying probability for all arms in .
A key technical challenge in our analysis is to design a concentration bound for the empirical mean of an arm (namely arm ) that uniformly holds over all time periods. A typical method is to let the length of the confidence band scale linearly with , where is the number of samples made for the arm. However, this would worsen the failure probability, and lead to an extra factor in the regret upper bound. To reduce the iterated logarithmic factor, we introduce a novel uniform concentration bound where the ratio between the length of the confidence band and is almost constant for large , but becomes larger for smaller . Since this ratio is related to the confidence level of the corresponding confidence band, we refer to this new concentration inequality as the Variable Confidence Level Bound. More specifically, in Appendix D.3.1, we prove the following lemma.
Lemma D.3.1 (Variable Confidence Level Bound, prestated) Let be i.i.d. random variables supported on with mean . For any and , it holds that
This new inequality greatly helps the analysis of our algorithm, where the intuition is that when conditioned on the event , it is much less likely that fewer number of samples are conducted for arm , and therefore we can afford a less accurate (i.e. bigger) confidence band for its mean value.
It is notable that a similar idea is also adopted in the analysis of the MOSS algorithm Audibert and Bubeck (2009) which gives the asymptotically optimal regret bound for the ordinary multiarmed bandits. However, our Variable Confidence Level Bound is more general and may be useful in other applications. We additionally remark that in the celebrated Hoeffding’s Maximal Inequality, the confidence level also changes with time. However, the blowup factor made to the confidence level in our inequality is only the logarithm of that of the Hoefdding’s Maximal Inequality. Therefore, if constant factors are ignored, our inequality strictly improves Hoeffding’s Maximal Inequality.
5 Experiments
In our experiments, we assume that each arm follows independent Bernoulli distributions with different means. To guarantee a fair comparison, we vary the total number of samples
and compare the empirical average aggregate regret on a logarithmic scale which is averaged over independent runs. We consider three different choices of :In our experiments, we fix . We notice that the choice of in our LSA is quite robust (see Appendix F.3 for experimental results). To illustrate the performance, we fix in LSA and compare it with four existing algorithms for the TBP problem under a variety of settings. Now we discuss these algorithms and their parameter settings in more details.

Uniform: Given the budget , this method pulls each arm sequentially from to until budget is reached such that each arm is sampled roughly times. Then it outputs when .

APT(): Introduced and analyzed in Locatelli et al. (2016), this algorithm aims to output a set of arms () serving as an estimate of the set of arms with means over . The natural adaptation of the APT algorithm to our problem corresponds to changing the output: it outputs if and otherwise. In the experiments, we test the following choices of : , , , and .

UCBE(): Introduced and analyzed in Audibert and Bubeck (2010), this algorithm aims to identify the best arm (the arm with the largest mean reward). A natural adaptation of this algorithm to TBP is for each time , it pulls where is a tuning parameter. In Audibert and Bubeck (2010), it has been proved optimal when where . Here we set and test three different choices of : , , and .

OptKG(, ): Introduced in Chen et al. (2015)
, this algorithm also aims to minimize the aggregate regret. It models TBP as a Bayesian Markov decision process where
is assumed to be drawn from a known Beta prior . Here we choose two different priors: (uniform prior) and (Jeffreys prior).
Comparisons.
In Setup 1, which is a relatively easy setting, LSA works best among all choices of budget
. With the right choice of parameter, APT and OptKG also achieve satisfactory performance. Though the performance gaps appear to be small, twotailed paired ttests of aggregate regrets indicate that LSA is significantly better than most of the other methods, except APT(.05) and APT(.025) (see Table
1 in Appendix F.1).In Setup 2 and 3, where ambiguous arms close to the threshold are presented, the performance difference between LSA and other methods is more noticeable. LSA consistently outperforms other methods in both settings over almost all choices of budget with statistical significance. It is worth noting that, though APT works also reasonably well in Setup 2 when is small, the best parameter is different from that for bigger and other setups. On the other hand, the parameters chosen in LSA are fixed across all setups, indicating that our algorithm is more robust.
We perform additional experiments that due to space limitations are included in Appendix F.2. In all setups, LSA outperforms its competitors with various parameter choices.
6 Conclusion
In this paper we introduce an algorithm that minimizes the aggregate regret for the thresholding bandit problem. Our algorithm LSA makes use of a novel approach inspired by the optimal allocation scheme of the budget when the reward gaps are known ahead of time. When compared to APT, LSA uses an additional term, similar in spirit to the UCBtype algorithms though mathematically different, that encourages the exploration of arms that have bigger gaps, and/or those have not been sufficiently explored. Moreover, LSA is anytime and robust, while the precision parameter needed in the APT algorithm is highly sensitive and hard to choose. Besides showing empirically that LSA performs better than APT for different values of and other algorithms in a variety of settings, we also employ novel proof ideas that eliminate the logarithmic terms usually brought in by the straightforward union bound argument, design the new Variable Confidence Level Bound that strictly improves the celebrated Hoeffding’s Maximal inequality, and prove that achieves instancewise asymptotically optimal aggregate regret.
References

Auer (2002)
Peter Auer.
Using confidence bounds for exploitationexploration tradeoffs.
Journal of Machine Learning Research
, 3(Nov):397–422, 2002.  Audibert et al. (2010) JeanYves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identification in multiarmed bandits. In Conference on Learning Theory (COLT), pages 41–53, 2010.
 Bubeck et al. (2009) Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multiarmed bandits problems. In Algorithmic Learning Theory (ALT), pages 23–37, 2009.
 Gabillon et al. (2012) Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems (NIPS), pages 3212–3220, 2012.
 Karnin et al. (2013) Zohar Shay Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multiarmed bandits. In International Conference on Machine Learning (ICML), pages 1238–1246, 2013.
 Jamieson et al. (2014) Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’UCB: An optimal exploration algorithm for multiarmed bandits. In Conference on Learning Theory (COLT), pages 423–439, 2014.
 Garivier and Kaufmann (2016) Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory (COLT), pages 998–1027, 2016.
 Chen et al. (2017) Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identification. In Conference on Learning Theory (COLT), pages 535–592, 2017.
 Locatelli et al. (2016) Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In International Conference on Machine Learning (ICML), pages 1690–1698, 2016.
 Steinwart et al. (2005) Ingo Steinwart, Don Hush, and Clint Scovel. A classification framework for anomaly detection. Journal of Machine Learning Research, 6(Feb):211–232, 2005.
 Chen et al. (2015) Xi Chen, Qihang Lin, and Dengyong Zhou. Statistical decision making for optimal budget allocation in crowd labeling. Journal of Machine Learning Research, 16(1):1–46, 2015.
 Carpentier and Locatelli (2016) Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pages 590–604, 2016.
 Lai and Robbins (1985) Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
 Agrawal (1995) Rajeev Agrawal. Sample mean based index policies by regret for the multiarmed bandit problem. Advances in Applied Probability, 27(4):1054–1078, 1995.
 Audibert and Bubeck (2009) JeanYves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In Conference on Learning Theory (COLT), 2009.
 Audibert and Bubeck (2010) JeanYves Audibert and Sébastien Bubeck. Best arm identification in multiarmed bandits. In Conference on Learning Theory (COLT), pages 13–p, 2010.
 Hoeffding (1963) Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963.
 Janson (2018) Svante Janson. Tail bounds for sums of geometric and exponential variables. Statistics & Probability Letters, 135:1–6, 2018.
 Tsybakov (2009) Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, New York, 2009. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats.
 Lattimore and Szepesvári (2018) Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, 2018.

Matouek and Gärtner (2006)
Jirí Matouek and Bernd Gärtner.
Understanding and Using Linear Programming (Universitext)
. 2006.
Appendix A Probability Tools
[ChernoffHoeffding Inequality Hoeffding (1963)] Let be a list of independent random variables supported on and set . Then, for every , it holds that
[Restatement of Theorem 5.1(ii) in Janson (2018)] Let be a list of independent random variables such that for . And let . Then for any , it holds that
[Hoeffding’s Maximal Inequality Hoeffding (1963)] Let be a list of i.i.d. random variables supported on and set . Then, for any , it holds that
[Restatement of Lemma 2.6 in Tsybakov (2009)] Let and
be two probability distributions supported on some set
. Then for every set , one haswhere denotes the complement of and
denotes the KullbackLeibler divergence between
and given by[Restatement of Lemma 15.1 in Lattimore and Szepesvári (2018)] Let and be the reward distributions of two armed bandits. Assuming for any arm . Fix some policy and let and be the two probability measures induced by the round interconnection of and (respectively, and ). Then
where is the random variable denoting the number of times arm is pulled.
Appendix B Properties of
We first show the optimal solution to by proving the following lemma.
If , then the optimal solution to can be expressed in the following form
where .
Proof.
Since is an increasing continuous function on , is indeed welldefined.
We apply KKT conditions (see Proposition 8.7.2 in Matouek and Gärtner (2006)) to solve the minimization problem . Concretely, the KKT conditions applies to gives
where for and are newlyintroduced variables. In particular, if , then and it holds that
(15) 
It is easy to see the solution for satisfies (15) and is a minimum point. ∎
For any positive number , let be the solution to
Note that
is a strictly increasing continuous function on that equals when and tends to infinity when . Hence exists and is uniquely defined.
Then we derive the optimal solution to , as follows.
If , then the optimal solution to can be expressed in the following form
Proof.
By Lemma B, the optimal solution to can be expressed as
where
It is easy to see that . Therefore the optimal solution to is
proving this lemma. ∎
Using Lemma B, we derive the following useful inequality.
Suppose and let be the solution to . Then
Proof.
By Lemma B, the optimal solution to can be expressed as
(16) 
Therefore, we obtain
and this lemma follows. ∎
Finally, we will show how the value of will change when is changed.
If , then
Proof.
We observe that for any sequence of positive numbers ,
Suppose is the optimal solution to . Then is a feasible solution to . Hence we obtain On the other hand, using a similar argument, we can also obtain Therefore, it holds that
and the lemma follows. ∎
Appendix C Hard Instances for the Uniform Sampling Approach
In this section, we describe a class of bad instances for the uniform sampling approach. In such instances, we show that, to achieve the same order of regret, the uniform sampling approach needs at least times more budget than the optimal policy.
We fix the threshold . For each , we construct two instances and . In , we set
Comments
There are no comments yet.