A framework for Multi-A(rmed)/B(andit) testing with online FDR control

06/16/2017 ∙ by Fanny Yang, et al. ∙ 0

We propose an alternative framework to existing setups for controlling false alarms when multiple A/B tests are run over time. This setup arises in many practical applications, e.g. when pharmaceutical companies test new treatment options against control pills for different diseases, or when internet companies test their default webpages versus various alternatives over time. Our framework proposes to replace a sequence of A/B tests by a sequence of best-arm MAB instances, which can be continuously monitored by the data scientist. When interleaving the MAB tests with an an online false discovery rate (FDR) algorithm, we can obtain the best of both worlds: low sample complexity and any time online FDR control. Our main contributions are: (i) to propose reasonable definitions of a null hypothesis for MAB instances; (ii) to demonstrate how one can derive an always-valid sequential p-value that allows continuous monitoring of each MAB test; and (iii) to show that using rejection thresholds of online-FDR algorithms as the confidence levels for the MAB algorithms results in both sample-optimality, high power and low FDR at any point in time. We run extensive simulations to verify our claims, and also report results on real data collected from the New Yorker Cartoon Caption contest.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Code for the MAB-FDR framework introduced in "A framework for Multi-A(rmed)/B(andit) Testing with Online FDR Control", NIPS 2017

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For most modern internet companies, wherever there is a metric that can be measured (e.g., time spent on a page, click-through rates, conversion of curiousity to a sale), there is almost always a randomized trial behind the scenes, with the goal of identifying an alternative website design that provides improvements over the default design. The use of such data-driven decisions for perpetual improvement is colloquially known as A/B testing in the case of two alternatives, or A/B/n testing for several alternatives. Given a default configuration and several alternatives (e.g., color schemes of a website), the standard practice is to divert a small amount of scientist-traffic to a randomized trial over these alternatives and record the desired metric for each of them. If an alternative appears to be significantly better, it is implemented; otherwise, the default setting is maintained.

At first glance, this procedure seems intuitive and simple. However, in cases where the aim is to optimize over one particular metric, this common tool suffers from several downsides. (1) First, whereas some alternatives may be clearly worse than the default, others may only have a slight edge. If one wishes to minimize the amount of time and resources spent on this randomized trial the more promising alternatives should intuitively get a larger share of the traffic than the clearly-worse alternatives. Yet typical A/B/n testing frameworks allocate traffic uniformly over alternatives. (2) Second, companies often desire to continuously monitor an ongoing A/B test as they may adjust their termination criteria as time goes by and possibly stop earlier or later than originally intended. However, just as if you flip a coin long enough, a long string of heads is eventually inevitable, the practice of continuous monitoring (without mathematically correcting for it) can easily fool the tester to believe that a result is statistically significant, when in reality it is not. This is one of the reasons for the lack of reproducibility of scientific results, an issue recently receiving increased attention from the public media. (3) Third, the lack of sufficient evidence or an insignificant improvement of the metric may make it undesirable from a practical or financial perspective to replace the default. Therefore, when a company runs hundreds to thousands of A/B tests within a year, ideally the number of statistically insignificant changes that it made should be small compared to the total number of changes made. Controlling the false alarm rate of each individual test at a desired level however does not achieve this type of control, also known as controlling the false discovery rate. Of course, it is also desirable to detect better alternatives (when they exist), and to do so as quickly as possible.

In this paper, we provide a novel framework that addresses the above shortcomings of A/B or A/B/n testing. The first concern is tackled by employing recent advances in adaptive sampling like the pure-exploration multi-armed bandit (MAB) algorithm. For the second concern, we adopt the notion of any-time -values for guilt-free continuous monitoring, and we make the advantages and risks of early-stopping transparent. Finally, we handle the third issue using recent advances in online false discovery rate (FDR) control. Hence the combined framework can be described as doubly-sequential (sequences of MAB tests, each of which is itself sequential). Although each of those problems has been studied in hitherto disparate communities, how to leverage the best of all worlds, if at all possible, has remained an open problem. The main contributions of this paper are in merging these ideas in a combined framework and presenting the conditions under which it can be shown to yield near-optimal sample complexity, near-optimal best-alternative discovery rate, as well as FDR control.

While the above concerns raised about A/B/n testing were discussed using the example of modern internet companies, the same concerns carry forward qualitatively to other domains, like pharmaceutical companies running sequential clinical trials with a control (often placebo) and a few treatments (like different doses or drug substances). In a manufacturing or food production setting, one may be interested in identifying (perhaps cheaper) substitutes for individual materials without compromising the quality of a product too much. In a government setting, pilot programs are funded in search of improvements over current programs and it is desirable from a social welfare standpoint and cost to limit the adoption of ineffective policies.

The remainder of this paper is organized as follows. In Section 2, we lay out the primary goals of the paper, and describe a meta-algorithm that combines adaptive sampling strategies with FDR control procedures. Section 3 is devoted to the description of a concrete procedure, along with some theoretical guarantees on its properties. In Section 4, we describe the results of our extensive experiments on both simulated and real-world data sets that are available to us, before we conclude with a discussion in Section 6.

2 Formal experimental setup and a meta-algorithm

In this section we first formalize the setup of a typical A/B/n test and provide a high-level overview of our proposed combined framework aimed at addressing the shortcomings mentioned in the introduction. A specific instantiation of this meta-algorithm along with detailed theoretical guarantees are specified in Section 3.

For concreteness, we refer to the system designer, whether a tech company or a pharmaceutical company, as a (data) scientist. We assume that the scientist needs to possibly conduct an infinite number of experiments sequentially, indexed by . Each experiment has one default setting, referred to as the control, and alternative settings, called the treatments or alternatives. The scientist must return one of the options that is the “best” according to some predefined metric, before the next experiment is started. Such a setup is a simple mathematical model both for clinical trials run by pharmaceutical labs, and A/B/n testing used at scale by tech companies.

One full experiment consists of steps of the following kind: In each step, the scientist assigns a new person—who arrives at the website or who enrolls in the clinical trial—to one of the options and obtains a measurable outcome. In practice, the role of the scientist could be taken by an adaptive algorithm, which determines the assignment at time step by careful consideration of all previous outcomes. Borrowing terminology from the multi-armed bandit (MAB) literature, we refer to each of the options as an arm, and each assignment to arm is termed “pulling arm ”. For concreteness, we assign the index to the default or control arm, and note that this index is known to the algorithm.

We assume that the observable metric from each pull of arm

corresponds to an independent draw from an unknown probability distribution with expectation

. Ideally, if the means were known, we would use them as scores to compare the arms where higher is better. In the sequel we use to denote the mean of the best arm. We refer the reader to Table 1 for a glossary of the notation used throughout this paper.

2.1 Some desiderata and difficulties

Given the setup above, how can we mathematically describe the guarantees that the companies might desire from an improved multiple-A/B/n testing framework? Which parts of the puzzle can be directly transferred from known results, and what challenges remain?

In order to answer the first question, let us adopt terminology from the hypothesis testing literature and view each experiment as a test of a null hypothesis. Any claim that an alternative arm is the best is called a discovery, and if such a claim is erroneous then it is called a false discovery. When multiple hypotheses need to be tested, the scientist needs to define the quantity it wants to control. While we may desire that the probability of even a single false discovery—called the family-wise error rate—is small, this is usually far too stringent for a large and unknown number of tests. For this reason, [1] proposed that it may be more interesting to control the expected ratio of false discoveries to the total number of discoveries (called the False Discovery Rate, or FDR for short) or ratio of expected number of false discoveries to the expected number of total discoveries (called the modified FDR or mFDR for short). Over the past decades, the FDR and its variants like mFDR have become standard quantities for multiple testing applications. In the following, if not otherwise specified, we use the term FDR to denote both measures in order to simplify the presentation. In Section 3, we show that both mFDR and FDR can be controlled for different choices of procedures.

2.1.1 Challenges in viewing an MAB instance as a hypothesis test

In our setup, we want to be able to control the FDR at any time in an online manner. Online FDR procedures were first introduced by Foster and Stine [2], and have since been studied by other authors (e.g., [3, 4]). A typical online FDR procedure is based on comparing a valid -value with carefully-chosen levels for each hypothesis test111A valid

must be stochastically dominated by a uniform distribution on

, which we henceforth refer to as super-uniformly distributed.. We reject the null hypothesis, represented as , when and we set otherwise.

As mentioned, we want to use adaptive MAB algorithms in each experiment to test each hypothesis, since they can find a best arm among with near-optimal sample complexity. However the traditional MAB setup does not account for the asymmetry between the arms as is the case in a testing setup, with one being the default (control) and others being alternatives (treatments). This is the standard scenario in A/B/n testing applications, as for example a company might prefer wrong claims that the control is the best (false negative), rather than wrong claims that an alternative is the best (false positive), simply because new system-wide adoption of selected alternatives might involve high costs. What would be a suitable null hypothesis in this hybrid setting? To allow continuous monitoring, is it possible to define and compute always-valid -values that are super-uniformly distributed under the null hypothesis when computed at any time ? (This could be especially challenging given that the number of samples from each the arm is random, and different for each arm.)

In addition to asymmetry, the practical scientist might have a different incentive than the ideal outcome for MAB algorithms. In particular, he/she might not want to find the best alternative if it is not substantially better than the control. Indeed, if the net gain made by adopting a new alternative is small, it might be offset by the cost of implementing the change from the existing default choice. By similar reasoning, we may not require identifying the single best arm if there is a set of arms with similar means that are all larger than the rest.

We propose a sensible null-hypothesis for each experiment which incorporates the approximation and improvement notions as described above and provide an always valid -value which can be easily calculated at each time step in the experiment. We show that a slight modification of the usual LUCB algorithm caters to this specific null-hypothesis while still maintaining near-optimal sample complexity.

2.1.2 Interaction between MAB and FDR

In order to take advantage of the sample efficiency of best-arm bandit algorithms, it is crucial to set the confidence levels close to what is needed. Given a user-defined level , at each hypothesis , online FDR procedures automatically output the significance level which are “needed” to guarantee FDR control, based on past decisions.

Figure 1: Diagram of the MAB-FDR meta algorithm designed to achieve online FDR control along with near-optimal sample complexity. The green arrows symbolize interaction between the MAB and FDR procedures via the FDR test levels and rejection indicator variables . Notice that the -values are now dependent as each depends on . The eyes represent possible continuous monitoring by the scientist.

Can we directly set the MAB confidence levels to the output levels from the online FDR procedure? If we do, our -values are not independent across different hypotheses anymore: directly depends on the FDR levels and each in turn depends on past MAB rejections, thus on past MAB -values (see Figure 1). Does the new interaction compromise FDR guarantees?

Although known online FDR procedures [2, 4] guarantee FDR control for independent -values, this does not hold for dependent -values in general. Hence FDR control guarantees cannot simply be obtained out of the box. In particular, it is not a priori obvious that the introduced dependence between the -values does not cause problems, i.e. violates necessary conditions for FDR control type theorems. A key insight that emerges from our analysis is that an appropriate bandit algorithm actually shapes the -value distribution under the null in a “good” way that allows us to control FDR.

2.2 A meta-algorithm

Procedure 1 summarizes our doubly-sequential procedure, with a corresponding flowchart in Figure 1. We will prove theoretical guarantees after instantiating the separate modules. Note that our framework allows the scientist to plug in their favorite best-arm MAB algorithm or online FDR procedure. The choice for each of them determines which guarantees can be proven for the entire setup. Any independent improvement in either of the two parts would immediately lead to an overall performance boost of the overall framework.

  1. The scientist sets a desired FDR control rate .

  2. For each :

    • Experiment receives a designated control arm and some number of alternative arms.

    • An online-FDR procedure returns an that is some function of the past values .

    • An MAB procedure with inputs (a) the control arm and alternative arms, (b) confidence level , and (c) (optional) a precision , is executed and if the procedure self-terminates, returns a recommended arm.

    • Throughout the MAB procedure, an always valid -value is constructed continuously for each time using only the samples collected up to that time from the -th experiment: for any

      , it is a random variable

      that is super-uniformly distributed whenever the control-arm is best.

    • When the MAB procedure is terminated at time (either by itself or by a user-defined stopping criterion that may depend on ), if the arm with the highest empirical mean is not the control arm and , then we return , and the control arm is rejected in favor of this empirically best arm.

Procedure 1 MAB-FDR Meta algorithm skeleton

3 A concrete procedure with guarantees

We now take the high-level road map given in Procedure 1, and show that we can obtain a concrete, practically implementable framework with FDR control and power guarantees. We first discuss the key modeling decisions we have to make in order to seamlessly embed MAB algorithms into an online FDR framework. We then outline a modified version of a commonly used best-arm algorithm, before we finally prove FDR and power guarantees for the concrete combined procedure.

3.1 Defining null hypotheses and constructing -values

Our first task is to define a null hypothesis for each experiment. As mentioned before, the choice of the null is not immediately obvious, since we sample from multiple distributions adaptively instead of independently. In particular, we will generally not have the same number of samples for all arms. Given a distribution with default mean and alternative distributions with means , we propose that the null hypothesis for the -th experiment should be defined as


In words, the null corresponds to there being no alternative arm that is -better than the control arm.

It remains to define a -value for each experiment that is stochastically dominated by a uniform random variable under the null; such a -value is said to be superuniform. In order to simplify notation below, we omit the index for the experiment and retain only the index for the choice of arms. In order to be able to use a -value at arbitrary times in the testing procedure and to allow scientists to monitor the algorithm’s progress in real time, it is helpful to define an always valid -value, as previously defined by Johari et al. [5]. An always valid p-value is a stochastic process such that for all fixed and random stopping times , under any distribution over the arm rewards such that the null hypothesis is true, we have


When all arms are drawn independently an equal number of times, by linearity of expectation one can regard the distance of each pair of samples as a random variable drawn i.i.d. from a distribution with mean . We can then view the problem as testing the standard hypothesis . However, when the arms are pulled adaptively, a different solution needs to be found—indeed, in this case, the sample means are

not unbiased estimators

of the true means, since the number of times an arm was pulled now depends on the empirical means of all the arms.

Our strategy is to construct always valid

-values by using the fact that p-values can be obtained by inverting confidence intervals. To construct always-valid confidence bounds, we resort to the fundamental concept of the law of the iterated logarithm (LIL), for which non-asymptotic versions have been recently derived and used for both bandits and testing problems (see

[6], [7]).

To elaborate, define the function



is the empirical average of independent samples from a sub-Gaussian distribution, then it is known (see, for instance,  

[8, Theorem 8]) that for all , we have


where .

We are now ready to propose single arm -values of the form


Here we set if the supremum is taken over an empty set. Given these single arm -values, the always-valid -value for the experiment is defined as


We claim that this procedure leads to an always valid -value (with proof in Appendix 5.1).

Proposition 1.

The sequence defined via equation (6) is an always valid -value.

See Section 5.1 for the proof of this proposition.

3.2 Adaptive sampling for best-arm identification

In the traditional A/B testing setting described in the introduction, samples are allocated uniformly to the different alternatives. But by allocating different numbers of samples to the alternatives, decisions can be made with the same statistical significance using far fewer samples. Suppose moreover that there is a unique maximizer , so that

Then for any , best-arm identification algorithms for the multi-armed bandit problem can identify with probability at least based on at most222Here we have ignored some doubly-logarithmic factors. total samples (see the paper [9] for a brief survey and [10] for an application to clinical trials). In contrast, if samples are allocated uniformly to the alternatives under the same conditions, then the most natural procedures require samples before returning with probability at least .

However, standard best-arm bandit algorithms do not incorporate asymmetry as induced by null-hypotheses as in definition (1) by default. Furthermore, recall that a practical scientist might desire the ability to incorporate approximation and a minimum improvement requirement. More precisely, it is natural to consider the requirement that the returned arm satisfies the bounds and for some . For those readers unfamiliar with best-arm MAB algorithms, it is likely helpful to first grasp the entire framework in the special throughout, before understanding it in full generality with the complications introduced by setting . In the following we present a modified MAB algorithm based on the common LUCB algorithm (see [11, 12]).

For all let be the number of times arm has been pulled up to time . In addition, for each arm let , define

  1. Set and sample every arm once.

  2. Repeat: Compute , and

    1. If , for all , then output and terminate.
      Else if and , then output and terminate.

    2. If , let and pull all distinct arms in once.
      If , pull arms and and set .

Algorithm 1 Best-arm identification with a control arm for confidence and precision

Inside the loop of Algorithm 1, we use to denote the current empirically-best arm, to denote the most promising contender among the other arms that has not yet been sampled enough to be ruled out. The parameter is a slack variable, and the algorithm is easiest to first understand when . We provide a visualization of how affects the stopping condition in Figure 2. Step (a) checks if is within of the true highest mean, and if it is also at least greater than the true mean of the control arm (or is the control arm), terminates with this arm . Step (b) ensures that the control arm is sufficiently sampled when . Step (c) pulls and , reducing the overall uncertainty in the difference between their two means.

(a) (b)
Figure 2: (a) The means of arms are within of the best arm, but only arms are at least better than the control arm 0. Thus, returning any of arms would result in a false discovery when . (b) An example of the stopping condition being critically met and returning a non-control arm . While is satisfied with some slack, is just barely satisfied.

The following proposition applies to Algorithm 1 run with a control arm indexed by with mean and alternative arms indexed by with means , respectively. Let denote the random arm returned by the algorithm assuming that it exits, and define the set


Note that the mean associated with any index , assuming that the set is non-empty, is guaranteed to be -superior to the control mean, and at most -inferior to the maximum mean over all arms.

Proposition 2.

The algorithm 1 terminates in finite time with probability one. Furthermore, suppose that the samples from each arm are independent and sub-Gaussian with scale . Then for any and , Algorithm 1 has the following guarantees:

  1. Suppose that . Then with probability at least , the algorithm exits with after taking at most time steps with effective gaps

  2. Otherwise, suppose that the set as defined in equation (7) is non-empty. Then with probability at least , the algorithm exits with after taking at most
    time steps with effective gaps

See Section 5.2 for the proof of this claim. Part (a) of Proposition 2 guarantees that when no alternative arm is -superior to the control arm (i.e. under the null hypothesis), the algorithm stops and returns the control arm after a certain number of samples with probability at least , where the sample complexity depends on -modified gaps between the means and . Part (b) guarantees that if there is in fact at least one alternative that is -superior to the control arm (i.e. under the alternative), then the algorithm will find at least one of them that is at most -inferior to the best of all possible arms with the same sample complexity and probability.

Note that the required number of samples in Proposition 2 is comparable, up to log factors, with the well-known results in [11, 12] for the case , with the modified gaps replacing . Indeed, the nearly optimal sample complexity result of [12] implies that the algorithm terminates under settings (a) and (b) after at most ) samples are taken.

In our development to follow, we now bring back the index for experiment , in particular using to denote the quantity at any stopping time . Here the stopping time can either be defined by the scientist, or in an algorithmic manner.

3.3 Best-arm MAB interacting with online FDR

After having established null hypotheses and -values in the context of best-arm MAB algorithms, we are now ready to embed them into an online FDR procedure. In the following, we consider -values for the -th experiment which is just the -value as defined in equation (6) at the stopping time , which depends on .

We denote the set of true null and false null hypotheses up to experiment as and respectively, where we drop the argument whenever it’s clear from the context. The variable indicates whether a the null hypothesis of experiment has been rejected, where denotes a claimed discovery that an alternative was better than the control. The false discovery rate (FDR) and modified FDR up to experiment are then defined as


Here the expectations are taken with respect to distributions of the arm pulls and the respective sampling algorithm. In general, it is not true that control of one quantity implies control of the other. Nevertheless, in the long run (when the law of large numbers is a good approximation), one does not expect a major difference between the two quantities in practice.

The set of true nulls thus includes all experiments where is true, and the FDR and mFDR are well-defined for any number of experiments , since we often desire to control or for all . In order to measure power, we define the -best-arm discovery rate as


We provide a concrete procedure 2 for our doubly sequential framework, where we use a particular online FDR algorithm due to Javanmard and Montanari [4] known as LORD; the reader should note that other online FDR procedure could be used to obtain essentially the same set of guarantees. Given a desired level , the LORD procedure starts off with an initial “-wealth” of . Based on a inifinite sequence that sums to one, and the time of the most recent discovery , it uses up a fraction of the remaining -wealth to test. Whenever there is a rejection, we increase the -wealth by . A feasible choice for a stopping time in practice is , where is a maximal number of samples the scientist wants to pull and is the stopping time of the best-arm MAB algorithm run at confidence .

  1. Initialize , set , and choose a sequence s.t.

  2. At each step , compute and

  3. Output and run Algorithm 1 using -confidence and stop at a stopping time .

  4. Algorithm 1 returns and we reject the null hypothesis if .

  5. Set , update and go back to step 2.

Procedure 2 MAB-LORD: best-arm identification with online FDR control

The following theorem provides guarantees on mFDR and power for the MAB-LORD procedure.

Theorem 1 (Online mFDR control for MAB-LORD).
  1. Procedure 2 achieves mFDR control at level for stopping times .

  2. Furthermore, if we set , Procedure 2 satisfies


The proof of this theorem can be found in Section 5.3. Note that by the arguments in the proof of Theorem 1, mFDR control itself is actually guaranteed for any generalized -investing procedure [3] combined with any best-arm MAB algorithm. In fact we could use any adaptive stopping time which depend on the history only via the rejections . Furthermore, using a modified LORD proposed by Javanmard and Montanari [13], we can also guarantee FDR control– which can be found in Appendix B.

It is noteworthy that small values of do not only guarantee smaller FDR error but also higher BDR. However, there is no free lunch — a smaller implies a smaller at each experiment, which in turn causes the best-arm MAB algorithm to employ a larger number of pulls in each experiment.

4 Experimental results

In the following, we describe the results of experiments 333The code for reproducing all experiments and plots in this paper is publicly available at https://github.com/fanny-yang/MABFDR on both simulated and real-world data sets to illustrate the properties and guarantees of our procedure described in Section 3. In particular, we show that the mFDR is indeed controlled over time and that MAB-FDR (used interchangeably with MAB-LORD here) is highly advantageous in terms of sample complexity and power compared to a straightforward extension of A/B testing that is embedded in online FDR procedures. Unless otherwise noted, we set in all of our simulations to focus on the main ideas and keep the discussion concise.

There are two natural frameworks to compare against MAB-FDR. The first, called AB-FDR or AB-LORD, swaps the MAB part for an A/B (i.e. A/B/n) test (uniformly sampling all alternatives until termination). The second comparator swaps the online FDR control for independent testing at for all hypotheses – we call this MAB-IND. Formally, AB-FDR swaps step 3 in Procedure 2 with “Output and uniformly sample each arm until stopping time .” while MAB-IND swaps step 4 in Procedure 2 with “The algorithm returns and we reject the null hypothesis if .”. In order to compare the performances of these procedures, we ran three sets of simulations using Procedure 2 with and as in [4]. The first two sets are on artificial data (Gaussian and Bernoulli draws from sets of randomly drawn means ), while the third is based on data from the New Yorker Cartoon Caption Contest (Bernoulli draws).

Our experiments are run on artificial data with Gaussian/Bernoulli draws and real-world Bernoulli draws from the New Yorker Cartoon Caption Contest. Recall that the sample complexity of the best-arm MAB algorithm is determined by the gaps . One of the main relevant differences to consider between an experiment of artificial or real-world nature is thus the distribution of the means for . The artificial data simulations are run with a fixed gap between the mean of the best arm and second best arm , which we denote by . In each experiment (hypothesis), the means of the other arms are set uniformly in . For our real-world simulations with the cartoon contest, the means for the arms in each experiment are not arbitrary but correspond to empirical means from the caption contest. In addition, the contests actually follow a natural chronological order (see details below), which makes this dataset highly relevant to our purposes. In all simulations, 60% of all the hypotheses are true nulls, and their indices are chosen uniformly.

4.1 Power and sample complexity

The first set of simulations compares MAB-FDR against AB-FDR. They confirm that the total number of necessary pulls to determine significance (which we refer to as sample complexity) is much smaller for MAB-FDR than for AB-FDR. In the MAB-FDR framework, this also effectively leads to higher power given a fixed truncation time.

Two types of plots are used to demonstrate the superiority of our procedure: for one we fix the number of arms and plot the with (which we call BDR for short) for both procedures over different choices of truncation times . For the other we fix and show how the sample complexity varies with the number of arms. Note that low BDR means that the bandit algorithm often reaches truncation time before it could stop.

4.1.1 Simulated Gaussian and Bernoulli trials

For the Gaussian draws, we set . The gap to the second best arm is so that all means are drawn uniformly between . The number of hypotheses is fixed to be . For Bernoulli draws we choose the maximum mean to be , so that all means are drawn uniformly between . The number of hypotheses is fixed at . We display the empirical average over runs where each run uses the same hypothesis sequence (indicating which hypotheses are true and false) and sequence of means for each hypothesis. The only randomness we average over comes from the random Gaussian/Bernoulli draws which cause different rejections and , so that the randomness in each draw propagates through the online FDR procedure. The results can be seen in Figures 3 and 4.

(a) (b)
Figure 3: (a) Power vs. truncation time (per hypothesis) for arms and (b) Sample complexity vs. # arms for truncation time for Gaussian draws with fixed , over hypotheses with non-nulls, averaged over runs.

The power at any given truncation time is much higher for MAB-FDR than AB-FDR. This is because the best-arm MAB is more likely to satisfy the stopping criterion before any given truncation time than the uniform sampling algorithm. The plot in Fig. 3(a) suggests that the actual stopping time of the algorithm is concentrated between and while it is much more spread out for the uniform algorithm.

(a) (b)
Figure 4: (a) Power over truncation time (per hypothesis) for arms and (b) Sample complexity over number of arms for truncation time for Bernoulli draws with fixed , over hypotheses with non-nulls, averaged over runs.

The sample complexity plot in Fig. 3(b) qualitatively shows how the total number of necessary arm pulls for AB-FDR increases much faster with the number of arms than for the MAB-FDR, before it plateaus at the truncation time multiplied by the number of hypotheses. Recall that whenever the best-arm MAB stops before the truncation time in each hypothesis, the stopping criterion is met, i.e. the best arm is identified with probability at least , so that the power is bound to be close to one whenever .

For Bernoulli draws we choose the maximum mean to be , so that all means are drawn uniformly between . The number of hypotheses is fixed at . Otherwise the experimental setup is identical to those discussed in the main text for Gaussians. The plots for Bernoulli data can be found in Fig. 4.

The behavior for both Gaussian and Bernoullis are comparable, which is not surprising due to the choice of the subGaussian LIL bound. However one may notice that the choice of the gap of vs.

drastically increases sample complexity so that the phase transition for power is shifted to very large


4.1.2 Application to New Yorker captions

In the simulations with real data we consider the crowd-sourced data collected for the New Yorker Magazine’s Cartoon Caption contest: for a fixed cartoon, captions are shown to individuals online one at a time and they are asked to rate them as ‘unfunny’, ‘somewhat funny’, or ‘funny’. We considered 30 contests444Contest numbers 520-551, excluding 525 and 540 as they were not present. Full dataset and its description is available at https://github.com/nextml/NEXT-data/.

where for each contest, we computed the fraction of times each caption was rated as either ‘somewhat funny’ or ‘funny’. We treat each caption as an arm, but because each caption was only shown a finite number of times in the dataset, we simulate draws from a Bernoulli distribution with the observed empirical mean computed from the dataset. When considering subsets of the arms in any given experiment, we always use the captions with the highest empirical means (i.e. if

then we use the captions that had the highest empirical means in that contest).

Although MAB-FDR still outperforms AB-FDR by a large margin, the plots in Figure 5 also show how the power and sample complexity notably differ from our toy simulation, where we seem to have chosen a rather benign distribution of means - in this setting, the gap is much lower, often around .

(a) (b)
Figure 5: (a) BDR over number of arms, i.e. truncation time per hypothesis for arms and (b) Sample complexity over number of arms for truncation time for Bernoulli draws, hypotheses with non-nulls and averaged over runs.

4.2 mFDR and FDR control

In this section we use simulations to demonstrate the second part of our meta algorithm which deals with the control of the false discovery rate or its modified version. Since bandit algorithms have a very high best-arm discovery guarantee which in practice even exceeds its theoretical guarantee of at least , mFDR and FDR plots on MAB-FDR directly do not lead to very insightful plots - namely the constant line. However, we can demonstrate that even under adversarial conditions, i.e. when the -value under the null is much less concentrated around one than obtained via the best arm bandit algorithm, mFDR or the false discovery proportion (FDP) in each run are still controlled at any time as Theorem 1 guarantees. Albeit not exactly reflecting mFDR control in the case of MAB-FDR but in fact in an even harder setting, results from these experiments can be regarded as valuable on their own - it emphasizes the fact that Theorem 1 guarantees mFDR control independent of the adaptive sampling algorithm and specific choice of -value as long as it is always valid.

For Figure 6, we again consider Gaussian draws with the same settings as described in 4.1. This time however, for each true null hypothesis we skip the bandit experiment and directly draw to compare with the significance levels from our online FDR procedure 2. As mentioned above, by Theorem 1, mFDR should still be controlled as it only requires the -values to be super-uniform. In Figure 6(a) we plot the instantaneous false discovery proportion (number of false discoveries over total discoveries) over the hypothesis index for different runs with the same settings. Apart from fluctuations in the beginning due to the relatively small denominator, we can observe how the guarantee for the , with its empirical value depicted by the red line, transfers to the control of each individual run (blue lines).

In Figure 6, we compare the mFDR (which in fact coincides with the FDR in this plot) of MAB-FDR using different multiple testing procedures, including MAB-IND and a Bonferroni type correction. The latter uses a simple union bound and chooses such that and thus trivially allows for any time FWER, and thus FDR control. In our simulations we use . As expected, Bonferroni is too conservative and barely makes any rejections whereas the naive MAB-IND approach does not control FDR. LORD avoids both extremes and controls FDR while having reasonable power.

(a) (b)
Figure 6: (a) Single runs of MAB-LORD (blue) and their average (red) with uniformly drawn -values for null hypotheses and Gaussian draws for non-nulls with , and , hypotheses with true nulls and arms, the desired mFDR level is (b) mFDR over different proportions of non-nulls , with same settings, averaged over runs.

5 Proofs

In this section we provide the proofs of the main results in the paper.

5.1 Proof of Proposition 1

For any fixed , we have the equivalence

If , then we have

by equation (4). Thus, we have , which completes the proof.

5.2 Proof of Proposition 2

Here we prove that the algorithm 1 terminates in finite time. The technical proof for sample complexity is moved to the Appendix C. It suffices to argue for and we discuss the other case at the end.

Proof of termination in finite time

First we prove by contradiction that the algorithm terminates in finite time with probability one for the case .

Assuming that there exist runs for which the algorithm does not terminate, the set of arms defined by

is necessarily non-empty for these runs. We now show that this assumption yields a contradiction so that


First take note that by definition of the algorithm, if an arm is drawn infinitely often (i.o.), then so is the control arm and we have as well as as . This follows by the law of large numbers combined with the fact that as , since as . Since for the null hypothesis we have , it follows that for all for some .

This argument implies that all arms can only be drawn a finite number of times, i.e. for all . However, the fact that they are not drawn i.o. implies that and i.o. for all , so that there exists such that i.o. By definition of we then obtain


However, since , inequality (12) cannot hold and equation (11) is proved.

A nearly identical argument to the above shows that the stopping condition is met in finite time.

5.3 Proof of Theorem 1

We now turn to the proof of Theorem 1, splitting our argument into parts (a) and (b), respectively.

5.3.1 Proof of part (a)

In order for generalized alpha-investing procedures such as LORD to successfully control the mFDR, it is sufficient that -values under the null be conditionally super-uniform, meaning that for all , we have


where is the -field induced by . Note that as long as condition (13) is satisfied, and thus could potentially depend on , i.e. the rejection indicator variables and potentially . See Aharoni and Rosset [3] for further details.

It thus suffices to show that condition (13) holds for our definition of -value in our framework. We know that by Proposition 1 we have for any random stopping time, thus any fixed truncation time , that . We now show that the same bound also holds for the (-dependent) bandit stopping time , i.e. that .

Under the null hypothesis, the best arm is at most better than the control arm, i.e. , so that by Proposition 2 we have that with probability , , i.e. for all . Hence, , and thus, by the definition of the -values, for all with probability . It finally follows that .

Putting things together, under the true null hypothesis (omitting the index to simplify notation) we directly have that for any

for all fixed even when the stopping time is dependent on . This is equivalent to stating that for any sequence we have

and the proof is complete.

5.3.2 Proof of part (b)

It suffices to prove that for a single experiment and , we have where is the distribution of a non-null experiment . First observe that at stopping time of Algorithm 1, either or for all . The former event happens whenever the algorithm exits with , i.e. when holds. Then, by definition of the -value in equation (6) and we must have that . As a consequence, by Proposition 2, we have

and the proof is complete.

6 Discussion

The recent focus in popular media about the lack of reproducibility of scientific results erodes the public’s confidence in published scientific research. To maintain high standards of published results and claimed discoveries, simply increasing the statistical significance standards of each individual experimental work (e.g., reject at level 0.001 rather than 0.05) would drastically hurt power. We take the alternative approach of controlling the ratio of false discoveries to claimed discoveries at some desired value (e.g., 0.05) over many sequential experiments. This means that the statistical significance for validating a discovery changes from experiment to experiment, and could be larger or smaller than 0.05, requiring less or more data to be collected. Unlike earlier works on online FDR control, our framework synchronously interacts with adaptive sampling methods like MABs over uniform sampling to make the overall sampling procedure as efficient as possible. We do not know of other works in the literature combining the benefits of adaptive sampling and FDR control. It should be clear that any improvement, theoretical or practical, to either online FDR algorithms or best-arm identification in MAB (or their variants), immediately results in a corresponding improvement for our MAB-FDR framework.

More general notions of FDR with corresponding online procedures have recently been developed by Ramdas et al [14]. In particular, they incorporate the notion of memory and a priori importance of each hypothesis. This could prove to be a valuable extension for our setting, especially in cases when only the percentage of wrong rejections in the recent past matters. It would be useful to establish FDR control for these generalized notions of FDR as well.

There are several directions that could be explored in future work. First, it would be interesting to extend the MAB aspect (in which each arm is univariate) of our framework to more general settings. Balasubramani and Ramdas [7] show how to construct sequential tests for many multivariate nonparametric testing problems, using LIL confidence intervals, which can again be inverted to provide always valid p-values. It might be of interest to marry the ideas in our paper with theirs. For example, the null hypothesis might be that the control arm has the same (multivariate) mean as other arms (-sample testing), and under the alternative, we would like to pick the arm whose mean is furthest away from the control. A more complicated example could involve dependence, where we observe pairs of arms, and the null hypothesis is that the rewards in the control arm are independent of the alternatives, and if the null is false we may want to pick the most correlated arm. The work by Zhao et al. [15] on tightening LIL-bounds could be practically relevant. Recent work on sequential p-values by Malek et al. [16] also naturally fit into our framework. Lastly, in this work we treat samples or pulls from arms as identical from a statistical perspective; it might be of interest in subsequent work to extend our framework to the contextual bandit setting, in which the samples are associated with features to aid exploration.


This work was partially supported by Office of Naval Research MURI grant DOD-002888, Air Force Office of Scientific Research Grant AFOSR-FA9550-14-1-001, and National Science Foundation Grants CIF-31712-23800 and DMS-1309356.


Appendix A Notation

Notation Terminology and explanation
MAB (pure exploration for best-arm identification in) multi-armed bandits
the expected ratio of # false discoveries to # discoveries up to experiment
the ratio of expected # false discoveries to expected # discoveries
target for FDR or mFDR control after any number of experiments
the best arm discovery rate (generalization of test power)
the -best arm discovery rate (softer metric than BDR)
the lower and upper confidence bounds used in the best-arm algorithms
experiment counter (number of MAB instances)
stopping time for the -th experiment
always valid -value after time (in experiment , explicit or implicit)
always valid -value for experiment at its stopping time
threshold set by the online FDR algorithm for , using
stopping time for the -th experiment, when experiment uses
0 the control or default arm
alternatives or treatment arms (experiment implicit)
options or “all arms”
the best of all arms, and the arm returned by MAB
the mean of the -th arm, and the mean of the best arm
total number of pulls, number of times arm is pulled up to time
Table 1: Common notation used throughout the paper.

Appendix B Notes on FDR control

We can prove FDR control for our framework using the specific online FDR procedure called LORD ’15 introduced in [13]. When used in Procedure 2, the only adjustment that needs to be made is to reset to in step 2 after every rejection, yielding for any sequence such that . We call the adjusted procedure MAB-LORD’ for short.

Theorem 2 (Online FDR control for MAB-LORD).
  1. MAB-LORD’ achieves mFDR and FDR control at a specified level for stopping times .

  2. Furthermore, if we set , MAB-LORD’ satisfies


Note that LORD as in [13] is less powerful than in [4] since the values in the former can be much smaller than those in [4], which could in fact exceed the level . Therefore, for FDR control we currently do have to sacrifice some power.


We leverage the proposition that can be obtained from a slightly more careful analysis of the procedure than in [13].

Proposition 3.

If , i.e. the distribution of the values under the null are superuniform conditioned on the last rejection, using the online LORD’15 procedure controls the FDR at each .

Note that this proposition allows online FDR control for any, possibly dependent, -values which are conditionally superuniform. This condition is not equivalent to (13) in general, it is in fact less restrictive since the probability is conditioned only on a function of all past rejections. Formally, the sigma algebra induced by is contained in and hence by the tower property. Finally, utilizing the fact that our -values are conditionally super-uniform as proven in Section 5.3.1, i.e. inequality (13) holds, the condition for Proposition 3 is fulfilled and the proof is complete. ∎

b.1 Proof of Proposition 3

Let denote the time of the -th rejection with (note that this is different from ). and define . Let be the th hypothesis that was rejected. We adjust an argument from [13].

First observe that and so that

Since for the LORD ’15 procedure, we have , and thus for all positive integers , the random variables with are conditionally independent of given . Additionally noting that for all by definition of and , using