Adaptive Monte Carlo Multiple Testing via Multi-Armed Bandits

02/01/2019 ∙ by Martin J. Zhang, et al. ∙ 0

Monte Carlo (MC) permutation testing is considered the gold standard for statistical hypothesis testing, especially when standard parametric assumptions are not clear or likely to fail. However, in modern data science settings where a large number of hypothesis tests need to be performed simultaneously, it is rarely used due to its prohibitive computational cost. In genome-wide association studies, for example, the number of hypothesis tests m is around 10^6 while the number of MC samples n for each test could be greater than 10^8, totaling more than nm=10^14 samples. In this paper, we propose Adaptive MC Testing (AMT) to estimate MC p-values and control false discovery rate in multiple testing. The algorithm outputs the same result as the standard full MC approach with high probability while requiring only O(√(n)m) samples. This sample complexity is shown to be optimal. On a Parkinson GWAS dataset, the algorithm reduces the running time from 2 months for full MC to an hour. The AMT algorithm is derived based on the theory of multi-armed bandits.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Monte Carlo (MC) permutation testing is considered the gold standard for statistical hypothesis testing. It has the broad advantage of estimating significance non-parametrically, thereby safeguarding against inflated false positives (Dwass, 1957; Davison et al., 1997; Boos & Zhang, 2000; Lehmann & Romano, 2006; Phipson & Smyth, 2010). It is especially useful in cases where the distributional assumption of the data is not apparent or likely to be violated.

A good example is genome-wide association study (GWAS), whose goal is to identify associations between the genotypes (single nucleotide polymorphisms or SNPs) and the phenotypes (traits) (Visscher et al., 2017)

. For testing the association between a SNP and the phenotype, the p-value is often derived via closed-form methods like the analysis of variance (ANOVA) or the Pearson’s Chi-square test

(Purcell et al., 2007). However, these methods rely on certain assumptions on the null distribution, the violation of which can lead to a large number of false positives (Yang et al., 2014; Che et al., 2014). MC permutation test does not require distributional assumption and is preferable in such cases from a statistical consideration (Gao et al., 2010). However, the main challenge of applying MC permutation test to GWAS is computational.

MC permutation test is a special type of MC test where the p-values are estimated by MC sampling from the null distribution — permutation test computes such MC samples by evaluating the test statistic on the data points but with the responses (labels) randomly permuted. Let

be the observed test statistic and be

independently and identically distributed (i.i.d.) test statistics randomly generated under the null hypothesis. The MC p-value is written as


which conservatively111Under the null hypothesis that the ideal p-value , the corresponding MC p-value

is stochastically greater than the uniform distribution.

estimates the ideal p-value . In addition, converges to the ideal p-value as the number of MC samples .

GWAS is an example of large-scale multiple testing: each SNP is tested for association with the phenotype, and there are many SNPs to test. For performing such tests simultaneously, the data is collected and each of the null hypotheses is associated with an ideal p-value (Fig.1a). A common practice, as visualized in Fig.1b, is to first compute an MC p-value for each test using MC samples and then apply a multiple testing procedure to the set of MC p-values to control the false positives, e.g., using the Bonferroni procedure (Dunn, 1961) or the Benjamini-Hochberg procedure (BH) (Benjamini & Hochberg, 1995). Here, as folklore222As an intuition, note that the MC p-values are no less than . Since the Bonferroni threshold is , to have any potential rejections, needs to be at least ; the minimum requirement for BH is slightly smaller. However, to account for the randomness of MC sampling, is usually chosen to be a few orders of magnitude greater than the minimum requirement., the number of MC samples is usually chosen to be at least 10 or 100 times of the number of tests . In GWAS, there are around SNPs to be examined simultaneously via multiple testing and is recommended to be at least (Johnson et al., 2010). The total number of MC samples is =, infeasible to compute.

Figure 1: Workflow. (a) data samples are collected and the th data point has a response (label) and inputs (features) . There are null hypotheses to test; the th null hypothesis corresponds to no association between the th input = and the response =. (b) The standard fMC workflow is to: 1) compute an MC p-value for each test using MC samples ; 2) apply the BH procedure on the set of MC p-values to control FDR. In this example, is the correlation between the th input and a randomly permuted response . (c) The fMC result is to make discovery (claim association) for a subset of inputs. (d) AMT directly estimates the fMC p-values by adaptive MC sampling and recovers the fMC testing result with high probability.

This work considers the standard full MC (fMC) workflow, as shown in Fig.1b, of first computing p-values with MC sampling and then controlling the false discovery rate (FDR) by applying the BH procedure to the set of MC p-values. The aim is to reduce the number of MC samples while obtaining the same fMC testing result. The focus of the present paper is on solving a computational problem, i.e., accelerating the standard fMC workflow, rather than a statistical problem, e.g., improving the power of the test. An alternative goal may be to recover the BH discoveries on the ideal p-values , which is an ill-posed problem that may take unrealistically many MC samples. Recovering the fMC result, however, takes at most samples and any improvement over the complexity of uniform sampling represents an improvement over the standard workflow.

Contribution. We propose Adaptive MC multiple Testing (AMT) to compute the fMC testing result via adaptive MC sampling. While recovering the fMC result with high probability, it effectively improves the sample complexity333 hides logarithmic factors. from to under mild assumptions that encompass virtually all practical cases. A matching lower bound is provided. In a GWAS dataset on the Parkinson’s disease, it improves the computational efficiency by 2-3 orders of magnitude, reducing the running time from 2 months to an hour. We note that AMT is not specific to MC permutation test; it can be used for MC tests in general.

The fMC procedure computes MC samples for each of the null hypotheses. For each null hypothesis, a randomly selected subset of the MC samples can provide an estimate of its fMC p-value, whereas the size of this subset determines the estimation accuracy. Intuitively, to recover the fMC result, we only need to estimate how each fMC p-value compares with the corresponding BH threshold; hypotheses with p-values far away from the threshold can be estimated less accurately, thus requiring fewer MC samples. AMT turns this pure computational fMC procedure into a statistical estimation problem, where adaptive sampling can be used.

The specific adaptive sampling procedure is developed via a connection to the pure exploration problem in multi-armed bandits (MAB) (Audibert & Bubeck, 2010; Jamieson et al., 2014). Specifically, the Top- identification problem (Kalyanakrishnan et al., 2012; Chen et al., 2017; Simchowitz et al., 2017) aims to identify the best arms via adaptive sampling. For AMT, we can think of the null hypotheses as arms, fMC p-values as arm parameters, and MC samples as observations for each arm. Then recovering the fMC result corresponds to identifying a subset of best arms with small p-values. The difference is that the size of this subset is not known ahead of time — it is a function of the fMC p-values that needs to be learned from data. Nonetheless, the techniques in MAB is borrowed to develop AMT.

1.1 Background

Permutation test. Consider testing the association between input and response using

data samples, i.e., the input vector

and the response vector . A reasonable test statistic can be the Pearson’s correlation . Let be a permutation on and be the set of all possible permutations. The permutation test statistic by permuting the response with can be written as . Under the null hypothesis that the response is exchangeable among samples, the rank of the observed test statistic among all permutation test statistics is uniformly distributed. Hence, the permutation p-value


follows a uniform distribution over the support . In most cases, the sample size is too large for computing all possible permutations; MC permutation test is used where the permutations are uniformly sampled from .

FDR control. For simultaneously testing null hypotheses with p-values , a common goal is to control FDR, defined as the expected proportion of false discoveries


The most widely-used FDR control algorithm is the BH procedure (Benjamini & Hochberg, 1995). Let be the th smallest p-value. The BH procedure rejects hypotheses , where is the critical rank defined as


The BH procedure controls FDR under the assumption that the null p-values are independent and stochastically greater than the uniform distribution.

1.2 Related works

The idea of algorithm acceleration by converting a computational problem into a statistical estimation problem and designing the adaptive sampling procedure via MAB has witnessed a few successes. An early example of such works is the Monte Carlo tree search method (Chang et al., 2005; Kocsis & Szepesvári, 2006) to solve large-scale Markov decision problems, a central component of modern game playing systems like AlphaZero (Silver et al., 2017)

. More recent examples include adaptive hyper-parameter tuning for deep neural networks

(Jamieson & Talwalkar, 2016; Li et al., 2016) and medoid computation (Bagaria et al., 2018a). The latter work gives a clear illustration of the power of such an approach. The medoid of a set of points is the point in the set with the smallest average distance to other points. The work shows that by adaptively estimating instead of exactly computing the average distance for each point, the computational complexity can be improved from of the naive method to almost linear in . This idea is further generalized in AMO (Bagaria et al., 2018b) that considers optimizing an arbitrary objective function over a finite set of inputs. In all these works, the adaptive sampling is by standard Best-arm identification algorithms. This present work also accelerates the fMC procedure by turning it into a statistical estimation problem. However, no MAB algorithm is readily available for this particular problem.

Our work applies MAB to FDR control by building an efficient computational tool to run the BH procedure given the data. There are recent works that also apply MAB to FDR control but in a statistical inference setting where the data collection process itself can be made adaptive over the different tests. In these works, each arm also corresponds to a test, but each arm parameter takes on a value that corresponds to either null or alternative. Fresh data can be adaptively sampled for each arm and the goal is to select a subset of arms while controlling FDR (Yang et al., 2017; Jamieson & Jain, 2018). In such settings, each observation is a new data and the p-values for the alternative hypotheses can be driven to zero. This is different from AMT where the arm observations are MC samples simulated from the data. As a result, the fMC p-values themselves are the arm parameters and the goal is to compute them efficiently to perform BH. In an application like GWAS, where all the SNPs data are typically collected simultaneously via whole genome sequencing, adaptive data collection does not apply but overcoming the computational bottleneck of the full MC procedure is an important problem addressed by the present work. See more details of bandit FDR in SubSec. 3.3.

The problem of multiple testing with MC p-values has been studied in the broader statistical literature. Interesting heuristic adaptive algorithms were proposed without formal FDR guarantee

(Sandve et al., 2011; Gandy & Hahn, 2017)

; the latter was developed via modifying Thompson sampling, another MAB algorithm. Asymptotic results were provided that the output of the adaptive algorithms will converge to the desired set of discoveries

(Guo & Peddada, 2008; Gandy & Hahn, 2014, 2016). Specifically, the most recent work (Gandy & Hahn, 2016) provided a general result that incorporates virtually all popular multiple testing procedures. However, none of the above works provide a standard FDR control guarantee (e.g., ) nor an analysis of the MC sample complexity; the MC sample complexity was analyzed in another work only for the case of using Bonferroni procedure (Hahn, 2015). In the present work, standard FDR control guarantee is provided, as well as both upper and lower bounds on the MC sample complexity, establishing the optimality of AMT.

There have been works on fast permutation test for GWAS (Pahl & Schäfer, 2010; Kimmel & Shamir, 2006; Browning, 2008; Jiang & Salzman, 2012); they consider a different goal which is to accelerate the process of separately computing each MC p-value. In contrast, AMT accelerates the entire workflow of both computing MC p-values and applying BH on them, where the decision for each hypothesis also depends globally on others. The state-of-art method is the sequential Monte Carlo procedure (sMC) that is implemented in the popular GWAS package PLINK (Besag & Clifford, 1991; Purcell et al., 2007; Che et al., 2014). For each hypothesis, it keeps MC sampling until having observed extreme events or hit the sampling cap . Then BH is applied on the set of sMC p-values. Here we note that the sMC p-values are conservative so this procedure controls FDR. sMC is discussed and thoroughly compared against in the rest of the paper.

(a) Initialization
(b) In progress
(c) Termination
Figure 2: Progression of AMT. AMT maintains upper and lower CBs for each hypothesis (vertical grey bar). (a) At initialization, the estimated BH threshold is set to be maximum while all CBs cross . Thus, all hypotheses are in

and need to be further sampled (red triangle). (b) As the algorithm progresses, more MC samples narrow the confidence intervals and some hypotheses become certain to be greater (green circle) or less (blue diamond) than the estimated BH threshold. The estimated BH threshold also moves down accordingly. (c) At termination, there is no uncertain hypothesis.

2 Problem Formulation

Let be the number of hypotheses and be the ideal p-values. We use the standard notation . For two numbers , means while means .

For each hypothesis , we assume the MC samples are available of the form


Note that one can think of .

To contrast with adaptive MC sampling, we change the superscript from “MC()” to “fMC” for the fMC p-values. Specifically, the fMC procedure uniformly computes MC samples for each hypothesis, yielding fMC p-values


Here, the extra “1” in the brackets is to make the fMC p-value conservative under the null. We would like to point out that there are two sources of randomness. The first is from the data generation process corresponding to the ideal p-values while the second is from MC sampling; they correspond to panel a and panels b-c in Fig.1, respectively. The second source of randomness corresponding to MC sampling is of primary interest in the present paper.

Applying the BH procedure to the fMC p-values yields a set of discoveries . Since the fMC p-values are stochastically greater than the uniform distribution under the null hypothesis (Phipson & Smyth, 2010), the set of fMC discoveries has a FDR controlled below the nominal level . Here, let represent the th smallest p-value and define the critical rank as


The BH threshold can be written as while the set of fMC discoveries . The goal is to compute the fMC discoveries with high probability (say for some small ) while requiring minimal number of MC samples.

3 Algorithm

AMT is described in Algorithm 1. It adopts a top-down procedure by starting with an initial critical rank estimate and gradually moving down until it reaches the true critical rank . Specifically, it maintains upper and lower confidence bounds (CBs) for each hypothesis , the critical rank estimate , and the corresponding BH threshold estimate . Based on the current estimate, the hypotheses can be categorized as:


As shown in Fig.2a, at initialization the critical rank estimate is set to be the largest possible value and all hypotheses are uncertain as compared to the estimated BH threshold ; they will be further sampled. In Fig.2b, as more MC samples narrow the confidence intervals, some hypotheses will become certain to be greater/less than ; they will leave and stop being sampled. At the same time, according to (7) the estimate cannot be the true critical rank if more than p-values are greater than the corresponding estimated BH threshold . Therefore, we can decrease and update until . Note that the estimated BH threshold will be reduced correspondingly. Finally, shown in Fig.2c, the algorithm terminates if there is no hypothesis left in .

For practical consideration, every time a batch of MC samples is obtained for each hypothesis in instead of one, with batch sizes prespecified as for some . Here, and .

3.1 Confidence bounds

Since the fMC p-values are themselves random, the CBs are defined conditional on the fMC p-values, where the MC samples for the th hypothesis are drawn uniformly and without replacement from the set of all MC samples . This gives finite population CBs whose uncertainty is when samples are obtained.

Specifically, for any , let

be random variables sampled uniformly and without replacement from the set

and . The -CBs satisfy


For the analysis, we assume that they take the form


where is a constant depending only on the probability . For example, the Clopper-Pearson confidence interval satisfies (10) with (Thulin et al., 2014). It it also valid for the sampling-without-replacement case (Bardenet et al., 2015).

In the actual implementation, Agresti-Coull confidence interval is used (Agresti & Coull, 1998) since the current python implementation of Pearson-Clopper confidence interval (Seabold & Perktold, 2010) has numerical issues when is small.

  Input: failure probability , nominal FDR .
  Initialization: -CBs , critical rank estimate , BH threshold estimate , hypothesis sets .
     Sample obtain the next batch of MC samples for each hypothesis in and update their -CBs (SubSec. 3.1).
     Update reduce one at a time and update correspondingly until the following hold at the same time:
Update the estimated BH threshold and the hypothesis sets
  Return: .
Algorithm 1 The AMT algorithm.

3.2 Comparison to sMC

Figure 3: sMC computes more MC samples for hypotheses with smaller p-values while AMT computes more MC samples for hypotheses with p-values closer to the BH threshold.

In sMC, for each hypothesis, MC samples are obtained until either extreme values (MC observation equal to 1) are observed, or total permutations are computed with total successes, where . Let be the number of MC samples obtained for the hypothesis. The sMC p-value is defined as


After this, BH is applied on the set of sMC p-values to obtain the testing result.

As shown in Fig.3, sMC computes more MC samples for hypotheses with smaller p-values while AMT computes more MC samples for hypotheses with p-values closer to the BH threshold, effectively addressing the hardness of recovering the fMC result, i.e., deciding how each fMC p-value compares with the BH threshold.

For sMC the parameter need to be chosen a priori. A back-of-the-envelope calculation shows that for a hypothesis test with the ideal p-value , the sMC p-value is around while the fMC p-value is around . Suppose the BH threshold on the ideal p-values is . Since it is desirable for the BH result on the MC p-values (sMC, fMC) to be close to the BH result on the ideal p-values, the accuracy of the MC p-values with corresponding ideal p-values close to can be thought of as the accuracy of the entire multiple testing problem. Matching such accuracy for sMC and fMC gives that . When = and =, we have that =. That is, should be at least if there are more than discoveries on the ideal p-values. However, since we do not know before running the experiment, a larger value is preferred. It is noted that values =30-120 are recommended in a recent work (Thulin et al., 2014).

3.3 Comparison to Bandit FDR

In the bandit FDR setting (Jamieson & Jain, 2018), each arm has a parameter with for null arms and for alternative arms, for some and given before the experiment. For arm , i.i.d. observations are available that are bounded and have expected value . The goal is to select a subset of arms and the selected set should control FDR while achieving a certain level of power.

Both bandit FDR and AMT aim to select a subset of “good arms” as defined by comparing the arm parameters to a threshold. In bandit FDR this threshold is given as . In AMT, however, this is the BH threshold that is not known ahead of time and needs to be learned from the observed data. The two frameworks also differ in the error criterion. Bandit FDR considers FDR and power for the selected set, a novel criterion in MAB literature. AMT, on the other hand, adopts the traditional PAC-learning criterion of recovering the fMC discoveries with high probability. These distinctions lead to different algorithms: bandit FDR uses an algorithm similar to thresholding MAB (Locatelli et al., 2016) but with carefully designed confidence bounds to control FDR; AMT devises a new LUCB (lower and upper confidence bound) algorithm that adaptively estimates two things simultaneously: the BH threshold and how each arm compares to the threshold.

4 Theoretical guarantee

We present the high probability recovery and FDR control result, the upper bound, and the lower bound in order. For the upper bound, we first state the result in Proposition 1, which is a direct consequence of the main instance-wise upper bound as stated in Theorem 2.

4.1 Correctness

Theorem 1.

(Correctness) AMT recovers the fMC result with probability at least , i.e.,


Moreover, AMT controls FDR at level , where is the null proportion.

Remark 1.

A stronger version is actually proved for (12): AMT recovers the fMC result with probability at least conditional on any set of fMC p-values , i.e.,


This also corresponds to the -correctness definition in the lower bound Theorem 3.

For the FDR control argument, is negligible as compared to ; is set to be a term, e.g., . Hence, in most cases.

4.2 Upper bound

Without loss of generality, let us assume that the ideal p-values, corresponding to the generation of the data, are drawn i.i.d. from an unknown distribution , which can be understood as a mixture of the null distribution and the alternative distribution, i.e., , where is the null proportion and is the alternative distribution. The following result shows that the sample complexity of AMT is under mild assumptions of .

Proposition 1.

Assume that the ideal p-values are drawn i.i.d. from some unknown distribution with density that is either constant () or continuous and monotonically decreasing. With , the total number of MC samples for AMT satisfies


where hides logarithmic factors with respect to and .

Remark 2.

The asymptotic regime is when while . This is because the number of MC samples should always be larger than the number of hypothesis tests . A more complete result including is

For the assumption on the ideal p-value distribution , corresponds to the case where all hypotheses are true null while

being continuous and monotonically decreasing essentially assumes that the alternative p-values are stochastically smaller than uniform. Such assumption includes many common cases, e.g., when the p-value is calculated from the z-score

with under the null and under the alternative (Hung et al., 1997).

A strictly weaker but less natural assumption is sufficient for the result. Let . It assumes that s.t. , . As shown in the proof, is the BH threshold in the limiting case and as long as is strictly decreasing on . Hence, this weaker assumption contains most practical cases and the result holds generally. However, this weaker assumption involves the definition of which is technical. We therefore chose the stronger but more natural assumption in the statement of the corollary.

Proposition 1 is based on an instance-wise upper bound conditional on the fMC p-values , stated as follows.

Theorem 2.

Conditioning on any set of fMC p-values , let be the th smallest p-value and . For the CBs satisfying (10), the total number of MC samples satisfies

Remark 3.

Note that and for common CBs, . By setting and , we have

The terms in the summations correspond to the number of MC samples for each hypothesis test. The denominator represents the hardness for determining if to reject each hypothesis while the hypothesis-dependent numerator ( in the first summation and in the second) represents a natural scaling of the binomial proportion confidence bound. The in the second term corresponds to the specific behavior of the top-down approach; it is easy to construct examples where this is necessary. The factor corresponds to the high probability bound which is in and in . This is preferable since may be much larger than . Overall, the bound is conjectured to be tight except improvements on the term (to perhaps ).

4.3 Lower bound

We provide a matching lower bound for the upper bound. Here, we define a -correct algorithm to be one that, conditional on any set of fMC p-values , recovers the fMC result with probability at least .

Theorem 3.

Assume that the ideal p-values are drawn i.i.d. from some unknown distribution with null proportion . , s.t. , any -correct algorithm satisfies


where hides logarithmic factors with respect to and .

Remark 4.

In practical settings most hypotheses are true null and therefore .

5 Empirical results

Failure Avg. MC samples Prop. of
prob. per hypothesis (std) success recovery
0.001 112873 100
0.01 103372 100
0.1 93070 100
Table 1: Recovery of the fMC result.

Figure 4: Average number of MC samples per hypothesis test for different algorithms while increasing the number of hypothesis tests and letting =10.
(a) Nominal FDR .
(b) Alternative proportion.
(c) Effect size .
Figure 5: Average number of MC samples per hypothesis test for different algorithms while varying different parameters.

5.1 Simulated data

Setting. In the default setting, we consider =1000 hypothesis tests, out of which 200 are true alternatives. The p-values are generated from z-scores , where the effect size =0 under the null and =2.5 under the alternative. The number of fMC samples per hypothesis is set to be =10,000 while the nominal FDR is =0.1. We investigate the performance of AMT by varying different parameters. The performance of sMC is also reported for comparison, where we set =100 according to the discussion in SubSec. 3.2. We also tested =50, which shows a similar result and is hence omitted.

Reliability. We first investigate the reliability of AMT by varying the failure probability , where the probability of the CBs is set to be . For each value of the experiment is repeated 10,000 times. In each repetition, a different set of data is generated and the AMT result is compared to the fMC result while fixing the random seed for MC sampling. As shown in Table 1, AMT recovers the fMC result in all cases while having a 10x gain in sample efficiency. We also found the AMT is rather stable that in practice a larger value of may be used; empirically, AMT starts to fail to recover the fMC result when exceeds .

Scaling. Next we investigate the asymptotic MC sample complexity by increasing while fixing =. The experiment is repeated 5 times for each parameter and 95 confidence intervals are provided. The result is shown in Fig.4 where the number of MC samples per hypothesis scales sub-linearly with . A simple linear fitting shows that the empirical scaling for AMT is , validating the scaling of average MC samples per hypothesis test as derived by Proposition 1 and Theorem 3. Empirically, sMC scales sub-linearly but with a higher rate of .

Varying other parameters. Finally we vary other parameters including the nominal FDR , alternative proportion, and the effect size , where the experiment for each parameter setting is repeated 5 times and 95 confidence intervals are provided. The results are shown in Fig.5. Here, sMC processes each hypothesis separately and computes more MC samples for hypotheses with smaller p-values, as discussed in Sec. 3.2. However, to obtain the BH result, the true difficulty is quantified by the closeness of the p-values to the BH threshold but zero — in other words, if a hypothesis test has a very small p-value while the BH threshold is large, it should not be hard to infer that this null hypothesis should be rejected. AMT captures this by adaptively computing more MC samples for null hypotheses with p-values closer to the BH threshold but zero, effectively adapting to different parameter settings and outperforms sMC in terms of the MC sample complexity.

5.2 GWAS on Parkinson’s disease

We consider a publicly available GWAS dataset that aims to identify genetic variants associated with Parkinson’s disease (Fung et al., 2006), which is known to be a complex disease and is likely to be associated with many different SNPs (Chang et al., 2017); FDR control via BH may yield more new discoveries that are interesting to the community. The dataset comprises 267 cases and 271 controls that are unrelated and have matching ages (55-88). All samples were assayed with the Illumina Infinium I and Infinium HumanHap300 SNP chips, consisting of 448,001 different SNPs that are carefully designed to represent information about several million common genetic variants throughout the genome (Consortium et al., 2003).

The phenotype is binary disease/healthy while the genotype is categorical AA/Aa/aa/missing. SNPs with more than missing values are removed to prevent discoveries due to the missing value pattern; this leaves 404,164 SNPs. The MC samples are based on the permutation test using the Pearson’s Chi-square test, where the phenotype is randomly permuted while keeping the same number of cases and controls. This experiment is run with 32 cores (AMD Processor 6378).

Small data. We first compare AMT with fMC on a smaller-scale data that consists of all 23,915 SNPs on chromosome 4. The number of fMC samples is chosen to be =, yielding a total number of MC samples that takes 34 mins to compute with 32 cores (4th row in Table 3). As shown in Table 2, the fMC p-values (3rd column) are similar to the p-values reported in the original paper (2nd column) (Fung et al., 2006). FDR level = yields 47 discoveries including all discoveries on chromosome 4 reported in the original paper; = yields 25 discoveries. The AMT result is identical to the fMC result; it takes 123s and an average of 1,241 MC samples per hypothesis, representing a 17x gain in running time and 201x gain in MC sample efficiency. The same experiment is performed on other chromosomes (chromosome 1-3), which gives a similar result — AMT recovers the fMC result in all cases and as shown in Table 3, AMT has a gain of 17-39x in running time and 201-314x in MC sample efficiency. See also Supp. Table 1. for the fMC p-values.

Full data. We next consider the full dataset with 404,164 SNPs and set the number of fMC samples to be =40,416,400, yielding a total number of MC samples. Since there is no internal adaptivity in the fMC procedure, it is reasonable to assume its running time to be proportional to the total number of MC samples, yielding an estimate of 2 months. It is noted that due to the computational cost, no full-scale permutation analysis has been performed on the dataset. The original paper performed permutation test on a subset of SNPs with theoretical p-values less than 0.05. However, such practice may cause outfitting since the same data is used for both hypothesis selection and testing.

We run AMT on this dataset with FDR level =0.1, taking 1.1hr to finish and average 13,723 MC samples, representing a gain of 1500x in running time and 3000x in MC sample efficiency. We note that we should expect more computational gain for larger-scale problems since AMT scales linearly with while fMC scales linearly with . In addition, for larger-scale problems the MC samples are computed in larger batches which is more efficient, effectively closing the gap between the gain in actual running time and the gain in MC sample efficiency.

With a FDR level =0.1, AMT made 304 discoveries, including 22/25 SNPs reported in the original paper. Among the three SNPs that are missing, rs355477 (=9.1e-5) and rs355464 (=1.8e-4) are borderline while rs11090762 (=5.9e-2) is likely to be a false positive.

6 Discussion

We have shown that AMT improves the computational efficiency of the fMC workflow, i.e., applying BH on the fMC p-values. A direct extension is to the workflow of applying the Storey-BH procedure (Storey et al., 2004) on the fMC p-values. In addition, in many cases, especially in genetic research, additional covariate information is available for each null hypothesis, e.g., functional annotations of the SNPs in GWAS, where a covariate-dependent rejection threshold can be used to increase testing power (Xia et al., 2017; Zhang et al., 2018). Extending AMT to such cases would allow both efficient computation of MC p-values and increased power via covariate-adaptive thresholding. Last but not least, MC sampling is an important building block in some modern multiple testing approaches like the model-X knockoff (Candes et al., 2018) or the conditional permutation test (Berrett et al., 2018), where ideas in the present paper may be used to improve the computational efficiency.

Code availability. The software is available at

dbSNP ID Original fMC Rej. at Rej. at
p-value p-value =0.1 =0.05
rs2242330 1.7e-6 8.0e-6
rs6826751 2.1e-6 1.6e-5
rs4862792 3.5e-5 4.0e-6
rs3775866 4.6e-5 3.6e-5
rs355477 7.9e-5 8.0e-5
rs355461 8.3e-5 8.0e-5
rs355506 8.3e-5 8.0e-6
rs355464 8.9e-5 1.3e-4
rs1497430 9.7e-5 5.2e-5
rs11946612 9.7e-5 4.8e-5
Table 2: Small GWAS on chromosome 4.

Chromosome Avg. MC samples Running time (s)
(# of SNPs) fMC AMT fMC AMT
1 (31,164) 250,000 874 (286x) 3,148 100(31x)
2 (32,356) 250,000 797 (314x) 3,505 90 (39x)
3 (27,386) 250,000 964 (259x) 2,505 89 (28x)
4 (23,915) 250,000 1,241 (201x) 2,031 123 (17x)
Table 3: Small GWAS data. Average MC samples per hypothesis and running time for fMC and AMT. The same experiment is performed on chromosome 1-4 separately.


  • Agresti & Coull (1998) Agresti, A. and Coull, B. A. Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician, 52(2):119–126, 1998.
  • Audibert & Bubeck (2010) Audibert, J.-Y. and Bubeck, S. Best arm identification in multi-armed bandits. In COLT-23th Conference on Learning Theory-2010, pp. 13–p, 2010.
  • Bagaria et al. (2018a) Bagaria, V., Kamath, G., Ntranos, V., Zhang, M., and Tse, D. Medoids in almost-linear time via multi-armed bandits. In

    International Conference on Artificial Intelligence and Statistics

    , pp. 500–509, 2018a.
  • Bagaria et al. (2018b) Bagaria, V., Kamath, G. M., and Tse, D. N. Adaptive monte-carlo optimization. arXiv preprint arXiv:1805.08321, 2018b.
  • Bardenet et al. (2015) Bardenet, R., Maillard, O.-A., et al. Concentration inequalities for sampling without replacement. Bernoulli, 21(3):1361–1385, 2015.
  • Benjamini & Hochberg (1995) Benjamini, Y. and Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), pp. 289–300, 1995.
  • Berrett et al. (2018) Berrett, T. B., Wang, Y., Barber, R. F., and Samworth, R. J. The conditional permutation test. arXiv preprint arXiv:1807.05405, 2018.
  • Besag & Clifford (1991) Besag, J. and Clifford, P. Sequential monte carlo p-values. Biometrika, 78(2):301–304, 1991.
  • Boos & Zhang (2000) Boos, D. D. and Zhang, J. Monte carlo evaluation of resampling-based hypothesis tests. Journal of the American Statistical Association, 95(450):486–492, 2000.
  • Browning (2008) Browning, B. L. Presto: rapid calculation of order statistic distributions and multiple-testing adjusted p-values via permutation for one and two-stage genetic association studies. BMC bioinformatics, 9(1):309, 2008.
  • Candes et al. (2018) Candes, E., Fan, Y., Janson, L., and Lv, J. Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551–577, 2018.
  • Chang et al. (2017) Chang, D., Nalls, M. A., Hallgrímsdóttir, I. B., Hunkapiller, J., van der Brug, M., Cai, F., Kerchner, G. A., Ayalon, G., Bingol, B., Sheng, M., et al. A meta-analysis of genome-wide association studies identifies 17 new parkinson’s disease risk loci. Nature genetics, 49(10):1511, 2017.
  • Chang et al. (2005) Chang, H. S., Fu, M. C., Hu, J., and Marcus, S. I.

    An adaptive sampling algorithm for solving markov decision processes.

    Operations Research, 53(1):126–139, 2005.
  • Che et al. (2014) Che, R., Jack, J. R., Motsinger-Reif, A. A., and Brown, C. C. An adaptive permutation approach for genome-wide association study: evaluation and recommendations for use. BioData mining, 7(1):9, 2014.
  • Chen et al. (2017) Chen, L., Li, J., and Qiao, M. Nearly instance optimal sample complexity bounds for top-k arm selection. arXiv preprint arXiv:1702.03605, 2017.
  • Consortium et al. (2003) Consortium, I. H. et al. The international hapmap project. Nature, 426(6968):789, 2003.
  • Davison et al. (1997) Davison, A. C., Hinkley, D. V., et al. Bootstrap methods and their application, volume 1. Cambridge university press, 1997.
  • Dunn (1961) Dunn, O. J. Multiple comparisons among means. Journal of the American statistical association, 56(293):52–64, 1961.
  • Dwass (1957) Dwass, M. Modified randomization tests for nonparametric hypotheses. The Annals of Mathematical Statistics, pp. 181–187, 1957.
  • Fung et al. (2006) Fung, H.-C., Scholz, S., Matarin, M., Simón-Sánchez, J., Hernandez, D., Britton, A., Gibbs, J. R., Langefeld, C., Stiegert, M. L., Schymick, J., et al. Genome-wide genotyping in parkinson’s disease and neurologically normal controls: first stage analysis and public release of data. The Lancet Neurology, 5(11):911–916, 2006.
  • Gandy & Hahn (2014) Gandy, A. and Hahn, G. Mmctest—a safe algorithm for implementing multiple monte carlo tests. Scandinavian Journal of Statistics, 41(4):1083–1101, 2014.
  • Gandy & Hahn (2016) Gandy, A. and Hahn, G. A framework for monte carlo based multiple testing. Scandinavian Journal of Statistics, 43(4):1046–1063, 2016.
  • Gandy & Hahn (2017) Gandy, A. and Hahn, G. Quickmmctest: quick multiple monte carlo testing. Statistics and Computing, 27(3):823–832, 2017.
  • Gao et al. (2010) Gao, X., Becker, L. C., Becker, D. M., Starmer, J. D., and Province, M. A. Avoiding the high bonferroni penalty in genome-wide association studies. Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society, 34(1):100–105, 2010.
  • Guo & Peddada (2008) Guo, W. and Peddada, S. Adaptive choice of the number of bootstrap samples in large scale multiple testing. Statistical applications in genetics and molecular biology, 7(1), 2008.
  • Hahn (2015) Hahn, G. Optimal allocation of samples to multiple hypothesis tests. arXiv preprint arXiv:1502.07864, 2015.
  • Hung et al. (1997) Hung, H. J., O’Neill, R. T., Bauer, P., and Kohne, K. The behavior of the p-value when the alternative hypothesis is true. Biometrics, pp. 11–22, 1997.
  • Jamieson & Jain (2018) Jamieson, K. and Jain, L. A bandit approach to multiple testing with false discovery control. arXiv preprint arXiv:1809.02235, 2018.
  • Jamieson & Talwalkar (2016) Jamieson, K. and Talwalkar, A.

    Non-stochastic best arm identification and hyperparameter optimization.

    In Artificial Intelligence and Statistics, pp. 240–248, 2016.
  • Jamieson et al. (2014) Jamieson, K., Malloy, M., Nowak, R., and Bubeck, S. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pp. 423–439, 2014.
  • Jiang & Salzman (2012) Jiang, H. and Salzman, J. Statistical properties of an early stopping rule for resampling-based multiple testing. Biometrika, 99(4):973–980, 2012.
  • Johnson et al. (2010) Johnson, R. C., Nelson, G. W., Troyer, J. L., Lautenberger, J. A., Kessing, B. D., Winkler, C. A., and O’Brien, S. J. Accounting for multiple comparisons in a genome-wide association study (gwas). BMC genomics, 11(1):724, 2010.
  • Kalyanakrishnan et al. (2012) Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. Pac subset selection in stochastic multi-armed bandits. In ICML, volume 12, pp. 655–662, 2012.
  • Kimmel & Shamir (2006) Kimmel, G. and Shamir, R. A fast method for computing high-significance disease association in large population-based studies. The American Journal of Human Genetics, 79(3):481–492, 2006.
  • Kocsis & Szepesvári (2006) Kocsis, L. and Szepesvári, C. Bandit based monte-carlo planning. In

    European conference on machine learning

    , pp. 282–293. Springer, 2006.
  • Lehmann & Romano (2006) Lehmann, E. L. and Romano, J. P. Testing statistical hypotheses. Springer Science & Business Media, 2006.
  • Li et al. (2016) Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560, 2016.
  • Locatelli et al. (2016) Locatelli, A., Gutzeit, M., and Carpentier, A. An optimal algorithm for the thresholding bandit problem. arXiv preprint arXiv:1605.08671, 2016.
  • Pahl & Schäfer (2010) Pahl, R. and Schäfer, H. Permory: an ld-exploiting permutation test algorithm for powerful genome-wide association testing. Bioinformatics, 26(17):2093–2100, 2010.
  • Phipson & Smyth (2010) Phipson, B. and Smyth, G. K. Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn. Statistical applications in genetics and molecular biology, 9(1), 2010.
  • Purcell et al. (2007) Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller, J., Sklar, P., De Bakker, P. I., Daly, M. J., et al. Plink: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics, 81(3):559–575, 2007.
  • Sandve et al. (2011) Sandve, G. K., Ferkingstad, E., and Nygård, S. Sequential monte carlo multiple testing. Bioinformatics, 27(23):3235–3241, 2011.
  • Seabold & Perktold (2010) Seabold, S. and Perktold, J. Statsmodels: Econometric and statistical modeling with python. In 9th Python in Science Conference, 2010.
  • Serfling (1974) Serfling, R. J. Probability inequalities for the sum in sampling without replacement. The Annals of Statistics, pp. 39–48, 1974.
  • Silver et al. (2017) Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
  • Simchowitz et al. (2017) Simchowitz, M., Jamieson, K., and Recht, B. The simulator: Understanding adaptive sampling in the moderate-confidence regime. arXiv preprint arXiv:1702.05186, 2017.
  • Storey et al. (2004) Storey, J. D., Taylor, J. E., and Siegmund, D. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(1):187–205, 2004.
  • Thulin et al. (2014) Thulin, M. et al. The cost of using exact confidence intervals for a binomial proportion. Electronic Journal of Statistics, 8(1):817–840, 2014.
  • Visscher et al. (2017) Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., and Yang, J. 10 years of gwas discovery: biology, function, and translation. The American Journal of Human Genetics, 101(1):5–22, 2017.
  • Xia et al. (2017) Xia, F., Zhang, M. J., Zou, J. Y., and Tse, D. Neuralfdr: Learning discovery thresholds from hypothesis features. In Advances in Neural Information Processing Systems, pp. 1541–1550, 2017.
  • Yang et al. (2017) Yang, F., Ramdas, A., Jamieson, K. G., and Wainwright, M. J. A framework for multi-a (rmed)/b (andit) testing with online fdr control. In Advances in Neural Information Processing Systems, pp. 5957–5966, 2017.
  • Yang et al. (2014) Yang, G., Jiang, W., Yang, Q., and Yu, W. Pboost: a gpu-based tool for parallel permutation tests in genome-wide association studies. Bioinformatics, 31(9):1460–1462, 2014.
  • Zhang et al. (2018) Zhang, M. J., Xia, F., and Zou, J. Adafdr: a fast, powerful and covariate-adaptive approach to multiple hypothesis testing. bioRxiv, pp. 496372, 2018.

1 Proof of Theorem 1


(Proof of Theorem 1) To show (12), it suffices to show that conditional on any set of fMC p-values ,


Let denote the event that all CBs hold. Since the number of CBs is at most and each of them holds with probability at least conditional on the fMC p-values, by union bound,

Next we show that implies , which further gives (16). Let be the total number of rounds, which is finite since at most MC samples will be computed. For any round , let “” represent the corresponding values before the MC sampling of the round, e.g., , , , , . Also, let represent the values at termination. For any ,

  1. if , by (7) more than fMC p-values are greater than whereas . Thus, there is at least one hypothesis that has fMC p-value greater than and is not in . On , it cannot be in . Hence, it is in , giving that . Thus, and the algorithm will not terminate.

  2. if , there are hypotheses in corresponding to those with fMC p-values greater than . Other hypotheses all have fMC p-values less than and hence, on , will not enter after further sampling. Therefore, will not further decrease.

Therefore, . Since , on , contains all hypotheses with fMC p-values less than , i.e., . Hence, we have shown (16).

Next we prove FDR control. Let and denote the false discovery proportion and FDR of the set , respectively. It is noted that . Let denote the event that and be the complement of . Then due to (12) that we have just proved. For AMT,


The first term of (18)

where the last inequality is because the fMC p-values are stochastically greater than the uniform distribution under the null hypothesis, and hence, applying BH on them controls FDR at level .

The second term of (18) is upper bounded by as FDP is always no greater than . Therefore,

2 Proof of Theorem 2


(Proof of Theorem 2) The entire analysis is conditional on the fMC p-values . Without loss of generality assume . Let be the total number of rounds, which is finite since at most MC samples will be computed. For any round , let “” represent the corresponding values before the MC sampling of the round. Note that “” represent the values at termination. The quantities useful to the analysis include

  1. : number of MC samples for arm .

  2. : lower and upper CBs for arm .

  3. Empirical mean .

  4. , , : hypothesis sets as defined in (8).

  5. : critical rank estimate and the corresponding BH threshold estimate.

Let denote the event that all CBs hold. Since the number of CBs is at most and each of them holds with probability at least conditional on the fMC p-values, by union bound,

Conditional on , when the algorithm terminates, . There are hypotheses in and hypotheses in . We next upper the number of MC samples for hypotheses in these two sets separately.

Step 1. Hypotheses in . On , there are hypotheses in . For any , let be the th hypothesis entering . For two hypotheses entering in the same round, the one is considered entering earlier if it has a larger upper CB before the MC sampling in the entering round.

Consider any that enters after MC sampling in round and let be the first hypothesis entering in the same round. Here, we note that and the number of MC samples . In addition,


since the batch sizes is a geometric sequence with ratio . Now we focus on .

Since is sampled in round , we have that . This indicates that in round , the lower CB of should be no greater than the estimated threshold before MC sampling; otherwise would have entered before round . Hence,


Also, being the first to enter in round , its upper CB is the largest among all elements in , i.e.,


Subtracting (20) from (21) to have the width of the confidence interval


where the last inequality is conditional on . Since , we have that . Therefore (22) can be further written as


Since the CBs satisfy (10), equations (20) and (23) can be rewritten as


Note that . By Lemma 1,


Since , we have that . Therefore.


Step 2. Hypotheses in . On , and . Consider any hypothesis whose fMC p-value is . It will be sampled until its upper CB is no greater than . Let its last sample round be . Then,


Subtracting the third term from the first term yields


Since the CBs satisfy (10), the second term in (28) along with (29) can be rewritten as


Note that and , (30) can be further written as