A Bandit Approach to Multiple Testing with False Discovery Control

by   Kevin Jamieson, et al.
University of Washington

We propose an adaptive sampling approach for multiple testing which aims to maximize statistical power while ensuring anytime false discovery control. We consider n distributions whose means are partitioned by whether they are below or equal to a baseline (nulls), versus above the baseline (actual positives). In addition, each distribution can be sequentially and repeatedly sampled. Inspired by the multi-armed bandit literature, we provide an algorithm that takes as few samples as possible to exceed a target true positive proportion (i.e. proportion of actual positives discovered) while giving anytime control of the false discovery proportion (nulls predicted as actual positives). Our sample complexity results match known information theoretic lower bounds and through simulations we show a substantial performance improvement over uniform sampling and an adaptive elimination style algorithm. Given the simplicity of the approach, and its sample efficiency, the method has promise for wide adoption in the biological sciences, clinical testing for drug discovery, and online A/B/n testing problems.



There are no comments yet.



Online control of the false discovery rate with decaying memory

In the online multiple testing problem, p-values corresponding to differ...

A factor-adjusted multiple testing of general alternatives

Factor-adjusted multiple testing is used for handling strong correlated ...

A New Perspective on Pool-Based Active Classification and False-Discovery Control

In many scientific settings there is a need for adaptive experimental de...

Adaptive Signal Inclusion With Genomic Applications

This paper addresses the challenge of efficiently capturing a high propo...

Lower bounds in multiple testing: A framework based on derandomized proxies

The large bulk of work in multiple testing has focused on specifying pro...

A Batched Multi-Armed Bandit Approach to News Headline Testing

Optimizing news headlines is important for publishers and media sites. A...

Localizing differences in smooths with simultaneous confidence bounds on the true discovery proportion

We demonstrate a method for localizing where two smooths differ using a ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider possible treatments, say, drugs in a clinical trial, where each treatment either has a positive expected effect relative to a baseline (actual positive), or no difference (null), with a goal of identifying as many actual positive treatments as possible. If evaluating the

th trial results in a noisy outcome (e.g. due to variance in the actual measurement or just diversity in the population) then given a total measurement budget of

, it is standard practice to execute and average measurements of each treatment, and then output a set of predicted actual positives based on the measured effect sizes. False alarms (i.e. nulls predicted as actual positives) are controlled by either controlling family-wise error rate (FWER)

, where one bounds the probability that at least one of the predictions is null, or

false discovery rate (FDR), where one bounds the expected proportion of the number of predicted nulls to the number of predictions. FDR is a weaker condition than FWER but is often used in favor of FWER because of its higher statistical power: more actual positives are output as predictions using the same measurements.

In the pursuit of even greater statistical power, there has recently been increased interest in the biological sciences to reject the uniform allocation strategy of trials to the treatments in favor of an adaptive allocation. Adaptive allocations partition the budget into sequential rounds of measurements in which the measurements taken at one round inform which measurements are taken in the next [1, 2]. Intuitively, if the effect size is relatively large for some treatment, fewer trials will be necessary to identify that treatment as an actual positive relative to the others, and that savings of measurements can be allocated towards treatments with smaller effect sizes to boost the signal. However, both [1, 2]

employed ad-hoc heuristics which may not only have sub-optimal statistical power, but also may even result in more false alarms than expected. As another example, in the domain of A/B/n testing in online environments, the desire to understand and maximize click-through-rate across treatments (e.g., web-layouts, campaigns, etc.) has become ubiquitous across retail, social media, and headline optimization for the news. And in this domain, the desire for statistically rigorous adaptive sampling methods with high statistical power are explicit


In this paper we propose an adaptive measurement allocation scheme that achieves near-optimal statistical power subject to FWER or FDR false alarm control. Perhaps surprisingly, we show that even if the treatment effect sizes of the actual positives are identical, adaptive measurement allocation can still substantially improve statistical power. That is, more actual positives can be predicted using an adaptive allocation relative to the uniform allocation under the same false alarm control.

1.1 Problem Statement

Consider distributions (or arms) and a game where at each time , the player chooses an arm and immediately observes a reward where 111

All results without modification apply to unbounded, sub-Gaussian random variables.

and . For a known threshold , define the sets222All results generalize to the case when .

The value of the means for and the cardinality of are unknown. The arms (treatments) in have means greater than (positive effect) while those in have means equal to (no effect over baseline). At each time , after the player plays an arm, she also outputs a set of indices that are interpreted as discoveries

or rejections of the null-hypothesis (that is, if

then the player believes ). For as small a as possible, the goal is to have the number of true detections be approximately for all , subject to the number of false alarms being small uniformly over all times . We now formally define our notions of false alarm control and true discoveries.

Definition 1 (False Discovery Rate, FDR-).

Fix some . We say an algorithm is FDR- if for all possible problem instances it satisfies for all simultaneously.

Definition 2 (Family-wise Error Rate, FWER-).

Fix some . We say an algorithm is FWER- if for all possible problem instances it satisfies .

Note FWER- implies FDR-, the former being a stronger condition than the latter. Allowing a relatively small number of false discoveries is natural, especially if is relatively large. Because is known, there exist schemes that guarantee FDR- or FWER- even if the arm means and the cardinality of are unknown (see Section 2.1). It is also natural to relax the goal of identifying all arms in to simply identifying a large proportion of them.

Definition 3 (True Positive Rate, TPR-).

Fix some . We say an algorithm is TPR- on an instance if for all .

Definition 4 (Family-wise Probability of Detection, FWPD-).

Fix some . We say an algorithm is FWPD- on an instance if for all .

Note that FWPD- implies TPR-, the former being a stronger condition than the latter. Also note and together imply . We will see that it is possible to control the number of false discoveries regardless of how the player selects arms to play. It is the rate at which includes that can be thought of as the statistical power of the algorithm, which we formalize as its sample complexity:

Definition 5 (Sample Complexity).

Fix some and an algorithm that is FDR- (or FWER-) over all possible problem instances. Fix a particular problem instance . At each time , chooses an arm to obtain an observation from, and before proceeding to the next round outputs a set . The sample complexity of on this instance is the smallest time such that is TPR- (or FWPD-).

The sample complexity and value of of an algorithm will depend on the particular instance . For example, if and , then we expect the sample complexity to increase as decreases since at least samples are necessary to determine whether an arm has mean versus . The next section will give explicit cases.

Remark 1 (Impossibility of stopping time).

We emphasize that just as in the non-adaptive setting, at no time can an algorithm stop and declare that it is TPR- or FWPD- for any finite . This is because there may be an arm in with a mean infinitesimally close to but distinct such that no algorithm can determine whether it is in or . Thus, the algorithm must run indefinitely or until it is stopped externally. However, using an anytime confidence bound (see Section 2) one can always make statements like “either , or ” where the

will depend on the width of the confidence interval.

1.2 Contributions and Informal Summary of Main Results

False alarm control
Detection Probability
Theorem 1
Theorem 4
Theorem 2
Theorem 3
Table 1: Informal summary of sample complexity results proved in this paper for , constant (e.g., ) and . Uniform sampling across all settings requires at least samples, and in the FWER+FWPD setting requires . Constants and factors are ignored.

In Section 2 we propose an algorithm that handles all four combinations of {FDR-, FWER-} and {TPR-, FWPD-}. A reader familiar with the multi-armed bandit literature would expect an adaptive sampling algorithm to have a large advantage over uniform sampling when there is a large diversity in the means of since larger means can be distinguished from with fewer samples. However, one should note that to declare all of as discoveries, one must sample every arm in at least as many times as the most sampled arm in , otherwise they are statistically indistinguishable. As discoveries are typically uncovering rare phenomenon, it is common to assume for [4, 5], or , but this implies that the number of samples taken from the arms in , regardless of how samples are allocated to those arms, will almost always be dwarfed by the number of samples allocated to those arms in since there are of them. This line of reasoning, in part, is what motivates us to give our sample complexity results in terms of the quantities that best describe the contributions from those arms in , namely, the cardinality , the confidence parameter (e.g., ), and the gap between the means of the arms in and the smallest mean in . Reporting sample complexity results in terms of also allows us to compare to known lower bounds in the literature [6, 4, 7, 8]. Nevertheless, we do address the case where the means of are varied in Theorem 1.

An informal summary of the sample complexity results proven in this work are found in Table 1 for . For the least strict setting of FDR+TPR, the upper-left quadrant of Table 1 matches the lower bound of [4], a sample complexity of just . In this FDR+TPR setting (which requires the fewest samples of the four settings), uniform sampling which pulls each arm an equal number of times has a sample complexity of at least (see Theorem 6 in Appendix G), which exceeds all results in Table 1 demonstrating the statistical power gained by adaptive sampling. For the most strict setting of FWER+FWPD, the lower-right quadrant of Table 1 matches the lower bounds of [7, 9, 8], a sample complexity of . Uniform sampling in the FWER+FWPD setting has a sample complexity lower bounded by (see Theorem 7 in Appendix G). The settings of FDR+FWPD and FWER+TPR are sandwiched between these results, and we are unaware of existing lower bounds for these settings.

All the results in Table 1 are novel, and to the best of our knowledge are the first non-trivial sample complexity results for an adaptive algorithm in the fixed confidence setting where a desired confidence is set, and the algorithm attempts to minimize the number of samples taken to meet the desired conditions. We also derive tools that we believe may be useful outside this work: for always valid -values (c.f. [3, 10]) we show that FDR is controlled for all times using the Benjamini-Hochberg procedure [11] (see Lemma 1), and also provide an anytime high probability bound on the false discovery proportion (see Lemma 2).

Finally, as a direct consequence of the theoretical guarantees proven in this work and the empirical performance of the FDR+TPR variant of the algorithm on real data, an algorithm faithful to the theory was implemented and is in use in production at a leading A/B testing platform.

1.3 Related work

Identifying arms with means above a threshold, or equivalently, multiple testing via rejecting null-hypotheses with small -values, is an ubiquitous problem in the biological sciences. In the standard setup, each arm is given an equal number of measurements (i.e., a uniform sampling strategy), a -value is produced for each arm where for all and , and a procedure is then run on these -values to declare small -values as rejections of the null-hypothesis, or discoveries. For a set of -values , the so-called Bonferroni selection rule selects . The fact that FWER control implies FDR control, , suggests that greater statistical power (i.e. more discoveries) could be achieved with procedures designed specifically for FDR. The BH procedure [11] is one such procedure to control FDR and is widely used in practice (with its many extensions [6] and performance investigations [5]). Recall that a uniform measurement strategy where every arm is sampled the same number of times requires samples in the FDR+TPR setting, and samples in the FWER+FWPD setting (Theorems 6 and 7 in Appendix G), which can be substantially worse than our adaptive procedure (see Table 1).

Adaptive sequential testing has been previously addressed in the fixed budget setting: the procedure takes a sampling budget as input, and the guarantee states that if the given budget is larger than a problem dependent constant, the procedure drives the error probability to zero and the detection probability to one. One of the first methods called distilled sensing [12] assumed that arms from were Gaussian with mean at most , and successively discarded arms after repeated sampling by thresholding at –at most the median of the null distribution–thereby discarding about half the nulls at each round. The procedure made guarantees about FDR and TPR, which were later shown to be nearly optimal [4]. Specifically, [4, Corollary 4.2] implies that any procedure with requires a budget of at least , which is consistent with our work. Later, another thresholding algorithm for the fixed budget setting addressed the FWER and FWPD metrics [7]. In particular, if their procedure is given a budget exceeding then the FWER is driven to zero, and the FWPD is driven to one. By appealing to the optimality properties of the SPRT (which knows the distributions precisely) it was argued that this is optimal. These previous works mostly focused on the asymptotic regime as and .

Our paper, in contrast to these previous works considers the fixed confidence setting: the procedure takes a desired FDR (or FWER) and TPR (or FWPD) and aims to minimize the number of samples taken before these constraints are met. To the best of our knowledge, our paper is the first to propose a scheme for this problem in the fixed confidence regime with near-optimal sample complexity guarantees.

A related line of work is the threshold bandit problem, where all the means of are assumed to be strictly above a given threshold, and the means of are assumed to be strictly below the threshold [2016locatelli, 13]. To identify this partition, each arm must be pulled a number of times inversely proportional to the square of its deviation from the threshold. This contrasts with our work, where the majority of arms may have means equal to the threshold and the goal is to identify arms with means greater than the threshold subject to discovery constraints. If the arms in are assumed to be strictly below the threshold it is possible to declare arms as in . In our setting we can only ever determine that an arm is in and not , but it is impossible to detect that an arm is in and not in .

Note that the problem considered in this paper is very related to the top- identification problem where the objective is to identify the unique arms with the highest means with high probability [14, 9, 8]. Indeed, if we knew , then our FWER+FWPD setting is equivalent to the top- problem with . Lower bounds derived for the top- problem assume the algorithm has knowledge of the values of the means, just not their indices [14, 8]. Thus, these lower bounds also apply to our setting and are what are referenced in Section 1.2.

As pointed out by [2016locatelli], both our setting and the threshold bandit problem can be posed as a combinatorial bandits problem as studied in [15, 16], but such generality leads to unnecessary factors. The techniques used in this work aim to reduce extraneous factors, a topic of recent interest in the top- and top- arm identification problem [17, 18, 19, 20, 14, 8]. While these works are most similar to exact identification (FWER+FWPD), there also exist examples of approximate top- where the objective is to find any means that are each within of the best means [9]. Approximate recovery is also studied in a ranking context with a symmetric difference metric [21] which is more similar to the FDR and TPR setting, but neither this nor that work subsumes one another.

Finally, maximizing the number of discoveries subject to a FDR constraint has been studied in a sequential setting in the context of A/B testing with uniform sampling [3]. This work popularized the concept of an always valid -value that we employ here (see Section 2). The work of [10] controls FDR over a sequence of independent bandit problems that each outputs at most one discovery. While [10] shares much of the same vocabulary as this paper, the problem settings are very different.

2 Algorithm and Discussion

Input: Threshold , confidence , confidence interval
Initialize: Pull each arm once and let denote the number of times arm has been pulled up to time . Set , , and
      ,       and      
Else if FWPD
      ,       and      
      Pull arm ,
       Apply Benjamini-Hochberg [11] selection to obtain FDR-controlled set :
            (if set )
      If FWER and :
           Pull arm
            Apply Bonferroni-like selection to obtain FWER-controlled set :

Algorithm 1 An algorithm for identifying arms with means above a threshold using as few samples as possible subject to false alarm and true discovery conditions. The set is designed to control FDR at level . The set is designed to control FWER at level .

Throughout, we will assume the existence of an anytime confidence interval. Namely, if denotes the empirical mean of the first bounded i.i.d. rewards in from arm , then for any we assume the existence of a function such that for any we have . We assume that is non-increasing in its second argument and that there exists an absolute constant such that . It suffices to define with this upper bound with but there are much sharper known bounds that should be used in practice (e.g., they may take empirical variance into account), see [19, 22, 23, 24]. Anytime bounds constructed with such a are known to be tight in the sense that and that there exists an absolute constant such that by the Law of the Iterated Logarithm [25].

Consider Algorithm 1. Before entering the for loop, time-dependent variables and are defined that should be updated at each time for different settings. If just FDR control is desired, the algorithm merely loops over the three lines following the for loop, pulling the arm not in that has the highest upper confidence bound; such strategies are common for pure-exploration problems [19, 10]. But if FWER control is desired then at most one additional arm is pulled per round to provide an extra layer of filtering and evidence before an arm is added to . Below we describe the main elements of the algorithm and along the way sketch out the main arguments of the analysis, shedding light on the constants and .

2.1 False alarm control

is FDR-controlled. In addition to its use as a confident bound, we can also use to construct:


Proposition 1 of [2017arXiv170605378Y] (and the proof of our Lemma 1) shows that if so that then is an

anytime, sub-uniformly distributed

-value in the sense that . Sequences that have this property are sometimes referred to as always-valid -values [3]. Note that if so that , we would intuitively expect the sequence to be point-wise smaller than if by the property that is non-increasing in its second argument. This leads to the intuitive rule to reject the null-hypothesis (i.e., declare ) for those arms where is very small. The Benjamini-Hochberg (BH) procedure introduced in [11] proceeds by first sorting the -values so that , then defines , and sets . Note that this procedure is identical to defining sets

setting , and , which is exactly the set in Algorithm 1. Thus, in Algorithm 1 is equivalent to applying the BH procedure to the anytime -values of (1). Because the algorithm is pulling arms sequentially, some dependence between the -values may be introduced, but we show in Lemma 1 that FDR is still controlled whenever anytime -values are used: .

is FWER-controlled. A core obstacle in our analysis is the fact that we don’t know the cardinality of . If we did know (and equivalently know ) then a FWER+FWPD algorithm is equivalent to the so-called top- multi-armed bandit problem [9, 8] and controlling FWER would be relatively simple using a Bonferroni correction:

which implies FWER-. Comparing the first expression immediately above to the definition of in the algorithm, it is clear our strategy is to use as a surrogate for . Note that we could use the bound to guarantee FWER-, but this could be very loose and induce an sample complexity. Using as a surrogate for in is intuitive because by the FDR guarantee, we know , implying that which may be much tighter than if . Because we only know and not its expectation, the extra factors in the surrogate expression used in are used to ensure correctness with high-probability (see Lemma 7).

2.2 Sampling strategies to boost statistical power

The above discussion about controlling false alarms for and holds for any choice of arms and that may be pulled at time . Thus, and are chosen in order to minimize the amount of time necessary to add arms into and , respectively, and optimize the sample complexity.

TPR- setting implies . Define the random set . Because is an anytime confidence bound, . If , then and we claim that with probability at least (Section C)

Thus once this number of samples has been taken, either , or arms in will be repeatedly sampled until they are added to since each arm has its upper confidence bound larger than those arms in by definition. It is clear that an arm in that is repeatedly sampled will eventually be added to since its anytime -value of (1) approaches at an exponential rate as it is pulled, and BH selects for low -values. A similar argument holds for and adding arms to .

FWPD- setting is more delicate and uses inflated values of and . This time, we must ensure that . Because then we could argue that either , or only arms in are sampled until they are added to (mirroring the TPR argument). As in the FWER setting above, if we knew the value of the we could set to observe that

which is less than , to guarantee such a condition. But we don’t know so we use as a surrogate, resulting in the inflated definitions of and relative to the TPR setting. The key argument is that either so that by the definition of (since ), or and with high probability which implies and the union bound of the display above holds.

3 Main Results

In what follows, we say if there exists a that is independent of all problem parameters and . The theorems provide an upper bound on the sample complexity as defined in Section 1.1 for TPR- or FWER- that holds with probability at least for different values of 333 Each theorem relies on different events holding with high probability, and consequently a different for each. To have for each of the four settings, we would have had to define different constants in the algorithm for each setting. We hope the reader forgives us for this attempt at minimizing clutter.. We begin with the least restrictive setting, resulting in the smallest sample complexity of all the results presented in this work. Note the slight generalization in the below theorem where the means of are assumed to be no greater than .

Theorem 1 (Fdr, Tpr).

Let , . Define for , , and for . For all we have . Moreover, with probability at least there exists a such that

and for all . Neither argument of the minimum follows from the other.

If the means of are very diverse so that then the second argument of the min in Theorem 1 can be tighter than the first. But as discussed above, this advantage is inconsequential if . The remaining theorems are given in terms of just . The dependence is due to inverting the confidence interval and is unavoidable on at least one arm when is unknown a priori due to the law of the iterated logarithm [25, 19, 20].

Informally, Theorem 1 states that if just most true detections suffice while not making too many mistakes, then samples suffice. The first argument of the min is known to be tight in a minimax sense up to doubly logarithmic factors due to the lower bound of [4]. As a consequence of this work, an algorithm inspired by Algorithm 1 in this setting is now in production at one of the largest A/B testing platforms on the web. The full proof of Theorem 1 (and all others) is given in the Appendix due to space.

Theorem 2 (Fdr, Fwpd).

For all we have . Moreover, with probability at least , there exists a such that

and for all .

Here roughly scales like where the term comes from a high probability bound on the false discovery proportion for anytime -values (in contrast to just expectation) in Lemma 2 that may be of independent interest. While negligible for all practical purposes, it appears unnatural and we suspect that this is an artifact of our analysis. We note that if then the sample complexity sheds this awkwardness444In the asymptotic regime, it is common to study the case when for [4, 12]..

The next two theorems are concerned with controlling FWER on the set and determining how long it takes before the claimed detection conditions are satisfied on the set . Note we still have that FDR is controlled on the set but now this set feeds into .

Theorem 3 (Fwer, Fwpd).

For all we have . Moreover, with probability at least , we have for all and there exists a such that

and for all . Note, together this implies for all .

Theorem 3 has the strongest conditions, and therefore the largest sample complexity. Ignoring factors, roughly scales as . Inspecting the top-k lower bound of [8] where the arms’ means in are equal to , the arms’ means in are equal to , and the algorithm has knowledge of the cardinality of , a necessary sample complexity of is given. It is not clear whether this small difference of versus is an artifact of our analysis, or a fundamental limitation when the cardinality is unknown. We now state our final theorem.

Theorem 4 (Fwer, Tpr).

For all we have . Moreover, with probability at least we have for all and there exists a such that

and for all , where .

4 Experiments

The distribution of each arm equals where if , and if . We consider three algorithms: ) uniform allocation with anytime BH selection as done in Algorithm 1, ) successive elimination (SE) (see Appendix G)555Inspired by the best-arm identification literature [17]. that performs uniform allocation on only those arms that have not yet been selected by BH, and ) Algorithm 1 (UCB). Algorithm 1 and the BH selection rule for all algorithms use from [23, Theorem 8]. Here we present the sample complexity for TPR+FDR with and different parameterizations of , , .

The first panel shows an empirical estimate of

at each time for each algorithm, averaged over 1000 trials. The black dashed line on the first panel denotes the level , and corresponds to the dashed black line on the second panel. The right four panels show the number of samples each algorithm takes before the true positive rate exceeds , relative to the number of samples taken by UCB, for various parameterizations. Panels two, three, and four have for while panel five is a case where the ’s are linear for . While the differences are most clear on the second panel when , over all cases UCB uses at least times fewer samples than uniform and SE. For FDR+TPR, Appendix G shows uniform sampling roughly has a sample complexity that scales like while SE’s is upper bounded by . Comparing with Theorem 1 for the difference cases (i.e., ) provides insight into the relative difference between UCB, uniform, and SE on the different panels.


This work was informed and inspired by early discussions with Aaditya Ramdas on methods for controlling the false discovery rate (FDR) in multiple testing; we are grateful to have learned from a leader in the field. We also thank the leading experimentation and A/B testing platform on the web, Optimizely, for its support, insight into its customers’ needs, and for committing engineering time to implementing this research into their platform [26]. In particular, we thank Whelan Boyd, Jimmy Jin, Pete Koomen, Sammy Lee, Ajith Mascarenhas, Sonesh Surana, and Hao Xia at Optimizely for their efforts.


  • [1] Linhui Hao, Akira Sakurai, Tokiko Watanabe, Ericka Sorensen, Chairul A Nidom, Michael A Newton, Paul Ahlquist, and Yoshihiro Kawaoka. Drosophila rnai screen identifies host genes important for influenza virus replication. Nature, 454(7206):890, 2008.
  • [2] GJ Rocklin, TM Chidyausiku, I Goreshnik, A Ford, S Houliston, A Lemak, L Carter, R Ravichandran, VK Mulligan, A Chevalier, CH Arrowsmith, and D Baker. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science, 357:168–175, 2017.
  • [3] Ramesh Johari, Leo Pekelis, and David J Walsh. Always valid inference: Bringing sequential analysis to a/b testing. arXiv preprint arXiv:1512.04922, 2015.
  • [4] Rui M Castro. Adaptive sensing performance lower bounds for sparse signal detection and support estimation. Bernoulli, 20(4):2217–2246, 2014.
  • [5] Maxim Rabinovich, Aaditya Ramdas, Michael I Jordan, and Martin J Wainwright. Optimal rates and tradeoffs in multiple testing. arXiv preprint arXiv:1705.05391, 2017.
  • [6] A. Ramdas, R. Foygel Barber, M. J. Wainwright, and M. I. Jordan. A Unified Treatment of Multiple Testing with Prior Knowledge. ArXiv e-prints, March 2017.
  • [7] Matthew L Malloy and Robert D Nowak. Sequential testing for sparse recovery. IEEE Transactions on Information Theory, 60(12):7862–7873, 2014.
  • [8] Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive sampling in the moderate-confidence regime. In Conference on Learning Theory, pages 1794–1834, 2017.
  • [9] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selection in stochastic multi-armed bandits. In ICML, volume 12, pages 655–662, 2012.
  • [10] Fanny Yang, Aaditya Ramdas, Kevin G Jamieson, and Martin J Wainwright. A framework for multi-a(rmed)/b(andit) testing with online fdr control. In Advances in Neural Information Processing Systems, pages 5959–5968, 2017.
  • [11] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), pages 289–300, 1995.
  • [12] Jarvis Haupt, Rui M Castro, and Robert Nowak. Distilled sensing: Adaptive sampling for sparse detection and estimation. IEEE Transactions on Information Theory, 57(9):6222–6235, 2011.
  • [13] Hideaki Kano, Junya Honda, Kentaro Sakamaki, Kentaro Matsuura, Atsuyoshi Nakamura, and Masashi Sugiyama. Good arm identification via bandit feedback. arXiv preprint arXiv:1710.06360, 2017.
  • [14] Lijie Chen, Jian Li, and Mingda Qiao. Nearly instance optimal sample complexity bounds for top-k arm selection. In Artificial Intelligence and Statistics, pages 101–110, 2017.
  • [15] Shouyuan Chen, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, pages 379–387, 2014.
  • [16] Tongyi Cao and Akshay Krishnamurthy. Disagreement-based combinatorial pure exploration: Efficient algorithms and an analysis with localization. arXiv preprint arXiv:1711.08018, 2017.
  • [17] Eyal Even-Dar, Shie Mannor, and Yishay Mansour.

    Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.

    Journal of machine learning research

    , 7(Jun):1079–1105, 2006.
  • [18] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1238–1246, 2013.
  • [19] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423–439, 2014.
  • [20] Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identification. In Conference on Learning Theory, pages 535–592, 2017.
  • [21] Reinhard Heckel, Max Simchowitz, Kannan Ramchandran, and Martin Wainwright. Approximate ranking from pairwise comparisons. In International Conference on Artificial Intelligence and Statistics, pages 1057–1066, 2018.
  • [22] A. Balsubramani. Sharp Finite-Time Iterated-Logarithm Martingale Concentration. ArXiv e-prints, May 2014.
  • [23] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17(1):1–42, 2016.
  • [24] Ervin Tanczos, Robert Nowak, and Bob Mankoff. A kl-lucb algorithm for large-scale crowdsourcing. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5896–5905. 2017.
  • [25] Philip Hartman and Aurel Wintner. On the law of the iterated logarithm. American Journal of Mathematics, 63(1):169–176, 1941.
  • [26] Optimizely. Accelerating experimentation through machine learning, https://blog.optimizely.com/2017/10/18/stats-accelerator, 2017.
  • [27] Bradley Efron.

    Large-scale inference: empirical Bayes methods for estimation, testing, and prediction

    , volume 1.
    Cambridge University Press, 2012.
  • [28] Maxim Raginsky and Alexander Rakhlin.

    Lower bounds for passive and active learning.

    In Advances in Neural Information Processing Systems, pages 1026–1034, 2011.
  • [29] Alexandre B Tsybakov. Introduction to nonparametric estimation, 2009.
  • [30] Pascal Massart et al. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The annals of Probability, 18(3):1269–1283, 1990.

Appendix A Analysis Preliminaries

Recall that for any we assume the existence of a function such that for any we have . We assume that is non-increasing in its second argument and that there exists an absolute constant such that . It suffices to take but there are much sharper known bounds that should be used in practice, see [19, 22, 23, 24]. Moreover, define its inverse . For the same constant , it can be shown that for a sufficiently large constant (and any ).

The technical challenges in this work revolve around arguments that avoid union bounds. By union bounding over all arms we have