1 Introduction
The explosion of available highthroughput technological pipelines in the biological and medical sciences over the past 20 years has opened up many new avenues of research that were previously unthinkable. The need to understand the role of power in the era of “omics” studies cannot be overstated. As a case in point, consider the promise and pitfalls of RNA expression microarrays. One of the take away themes is that new technologies such as this one enjoy initial exuberance and early victories (Alizadeh et al., 2000), followed by calls for caution from epidemiologists and statisticians (Ioannidis, 2005; Baggerly and Coombes, 2009). Some of the most constructive things gleaned from this journey, in hindsight, have been a thorough reevaluation of what should constitute the “bar for science”. More and more researchers are starting to realize that some of the blame for lack of reproducibility is owed to lack of power (Ioannidis, 2005). Central to these renewed calls for scientific vetting has been the concept of multiple testing. Very early in the history of “omics” researchers realized that correction for multiple testing should be done and that the most commonly used method, Bonferroni correction, was by far too conservative for tens of thousands of simultaneous tests. Somewhat prophetically, half a decade before, Benjamini and Hochberg (1995)
and colleagues introduced a new testing paradigm, that, in contrast to the Bonferroni procedure which controls the probability that one or more type I errors are committed, instead controls the proportion of false discoveries among the tests called significant. By now, use of the BenjaminiHochberg (BH) false discovery rate (FDR) procedure for making statistical significance calls in multiple testing scenarios is widespread.
2 Definitions and Notation
Before describing the model of the data distribution, we need the following definition.
Definition 2.1.
A family of pdf’s, , has the monotone likelihood ratio property if and only if
(1) 
A family of CDF’s has the monotone likelihood ratio property if its members are absolutely continuous and the corresponding family of pdf’s has the property.
Now, consider simultaneous tests of hypotheses , each a test whether a location parameter is 0, () or nonzero (). We start by supposing that an expected proportion,
, of the test statistics are distributed about nonzero location parameters. The test statistic distributions are modeled according to a mixture model, first introduced by
Storey (2002) and others Baldi, Pierre and Long, A. D. (2001); Ibrahim et al. (2002). First, let be an i.i.d. Bernouli sequence, with success probability,. Denote the binomially distributed sum,
, which is the number of test statistics belonging to the nonzero location parameter population. For a sample of size replicate outcomes resulting in simultaneous tests, let be the test statistic. We assume that conditional upon , that its CDF is of the form:(2) 
In the above, is the common distribution of all null distributed tests, and is the common distribution of all nonnull distributed tests. We assume that is the “minimal” element of a class of CDF’s, , satisfying the monotone likelihood ratio principle and that is a finite mixture of elements , including the possibility that the mixing proportions are degenerate at a single distribution. Note that since we will only consider only twosided tests, our scope is solely focused on nonnegative test statistics, , since they represent the the absolute value of some intermediate quantity. We consider to be nonnegative in the remainder of the paper. Let denote the complementary CDF (cCDF), so that and . We will for nearly the entire paper make the assumption of independent hypothesis tests. While the BenjaminiHochberg false discovery rate procedure is not imune to departures from the assumption of independent tests, the effect of departures from independence (i) do not affect the expected false discovery fraction or expected true positive fraction (defined precisely below) and (ii) discrepencies which do result from such departures are seen in the distribution of the FDF and TPF and are caused by a reduced effective number of simultaneous tests. For these reasons, results obtained under the assumption of independent hypothesis tests are still of great utility. We return to this discussion in the final section.
2.1 Mixed distribution of the Pvalues
For , let denote the twosided nominal pvalues corresponding to the test statistics, , and let denote their order statistics. Notice that the nominal pvalues, are i.i.d. having mixed CDF ,
(3) 
As we shall see in the proofs of Theorems 4.1, 4.3, and 4.6, below, the requirement that the family satisfies the monotone likelihood ratio principle guarantees that is concave. Next in the original unsorted list of nominal pvalues, , let be the subset of nominal pvalues corresponding to test statistics from the noncentral population, in the order that they occur in the original unsorted list, e.g., if counts the number of noncentrally located statistics among the first in the original unsorted list, then .
2.2 Numbers of significant calls, true postives and false positives
The Benjamini and Hochberg (BH) false discovery rate (FDR) procedure (Benjamini and Hochberg, 1995) provides a simultaneous test of all
null hypotheses, that controls for multiplicity in a less conservative way than Bonferroni adjustment by changing the paradigm. Instead of controlling the probability that one or more null hypotheses is erroneously rejected, it controls the expected proportion of null hypotheses rejected that were true, or equivalently, the posterior probability that a test statistic has null location parameter given it was called significant. The algorithm is implemented by specifying a tolerable false discovery rate,
, and then finding the largest row number, , for which the corresponding order statistic, , is less than . The total number of test statistics in the rejection region, , is given by the following expression:Definition 2.2.
We will refer to as the number of positive calls or discoveries which is consistent with the terminology of Benjamini and Hochberg, and we call the ratio the positive call fraction. Notice that expression 2.2 has the following alternate form:
(4) 
The number of positive calls partitions into true positve calls and false positive calls
Definition 2.3.
Let denote the number of true positive calls:
(5) 
Definition 2.4.
Let denote the number of false positive calls:
(6) 
There are several possible choices of normalizers for and
, depending upon the popultion value being estimated. Because power in the single hypothesis test scenario is the probability of rejection conditional upon the alternative distribution, it is natural to normalize by the number of nonnull distributed statistics,
:Definition 2.5.
We define the true positive fraction as the ratio .
Results concerning the false discovery rate will follow as corollaries to our other results. For this reason, we normalize the number of false positive calls, , by the number of positive callse, :
Definition 2.6.
We define the false discovery fraction as the ratio .
In general we will use the term fraction for a ratio that is a random quantity and rate for its expectation.
Table 1 shows rows partitioning the test statistics into those that are nonnull distributed, and those that are null distributed, numbering and , respectively, and columns partitioning the results of hypothesis testing into the positives and negatives calls, numbering , and , respectively.
2.3 False discovery rate
Let . Benjamini and Hochberg (1995) showed in their original paper that their procedure controls the expected false discovery fraction, which they called the false discovery rate:
(7) 
In keeping with pervasive terminology, the phrase “false discovery rate” is applied to both the expected false discovery fraction, , and in addition, the nominal value, which is used to set the threshold on the pvalue scale. We will use the symbol to denote the BHFDR procedure at nominal false discovery rate,
. In this paper, whenever a random variable occurs in the denominator, we tacitly define the indeterminate 0/0 to be 0, which has the effect that all such ratios are defined jointly with the event that the denominator is nonzero.
rej  acc  row Total  

is FALSE  
is TRUE  
col Total 
3 The distribution of and notions of power in multiple testing scenarios
In the single hypothesis test situation, , we consider probabilities of rejection given is true or false. Under the setup introduced here, when , the BHFDR, becomes the type I error probability, and the power as it is usually defined for a single hypothesis test, is the conditional expectation of given that . In the case of multiple tests, is distributed over values from zero to as high as so that naturally there are a multitude of avenues for conceptualizing the power. Consider first, that had we been using the Bonferroni procedure for multiple tests adjustment to threshhold the test statistics arriving at positives and true positives, the distribution of would have been binomial with common success probability equal to the pertest power. The fact that the distribution of is not binomial when the BHFDR criterion is used is what makes discussion of power more difficult. However, the common thread is that any discussion of power in the multiple testing scenario must be based upon some summary of the distribution of
, e.g. a right tail or a moment.
3.1 Various definitions of power in multiple testing scenarios
One of the first approaches was to use the probability that is nonzero: . Lee and Whitmore (2002) used a Poisson approximation to derive a closedform expression for the probability to observe one or more true positives. This kind of power, the familywise power, is arguably not a meaningful target of optimization for experiments built around a large number of simultaneous tests, especially when there are typically complex underlying hypotheses relying on positive calls for a sizable portion of those tests for which the alternative is true. For example, consider that in a microarray experiment in which there will be downstream pathways analysis, we would start by assuming that there are around 3% or more of the tests for which the alternate hypothesis is true, and hope to make significant calls at an FDR of 15% for at least 80% of these nonnull distributed statistics, so as to have a thresholded list of roughly genes to send to an analysis of pathways.
3.2 The Average Power and power
Average Power
In the BHFDR procedure for multiple testing, the role of the type I error played in the single testing scenario is assumed by the FDR which is an expected proportion. Therefore, it is natural in the multiple testing scenario to consider a power that is also defined as an expected proportion. One interpretation of power in the setting of multiple testing that falls along this line of reasoning is the “average power”.
Definition 3.1.
The average power is the expected true positive fraction, i.e. the expected proportion of all nonnull distributed statistics that are declared significant by the BH FDR procedure.
(8) 
Notice that here the dependence upon is made explicit, so that the average power depends upon the number of simultaneous tests in addition to quantities named above. Glueck et al. (2008) provided an explicit formula for the average power in a finite number, , of simultaneous test, but its complexity grows as the factorial of the number of simultaneous tests, and this is clearly intractable in the realm of microarray studies and GWAS where there are tens of thousands or even a million simultaneous tests in question.
3.3 The plugin estimate of the average power
Thinking heuristically for the moment, if
is very large as will be the case in many “omics” scenarios, then the positive call fraction could be considered very close to a limiting value, , if such a limiting value existed either in probability or almost surely. Continuing along heuristic lines then, we could replace the positive call fraction, , with the limiting constant, , in the right hand side of expression 4, as well as in expressions 5 and 6, which define and . If such a treatment were legitimate, the resulting analysis would be extremely easy as and would be sums of i.i.d’s and the usual L.L.N. holds, with limits given by the expected value of a single increment in the corresponding sum. This gives rise to the plugin estimate of the average power,(9) 
Several authors have discussed this plugin estimate of average power, for example Genovese and Wasserman (2004); Storey (2002). Independently, Jung (2005) and Liu and Hwang (2007)
discuss sample size and power in the setting of multiple testing based upon the BH FDR procedure. Without actually ever calling it the average power, they derive an expression very close to the above plugin estimate. Their derivation starts with the posterior probability that a statistic was drawn from the nulldistributed population, given that it was called significant. Bayes theorem is used to express this in terms of the prior,
and , and conditional, and . They mistakenly equate this to the nominal false discovery rate, , when in actuality it is the observed false discovery rate, . Not withstanding, their methodology is valid because the resulting power is, as we shall see below, the average power at the “oracle” threshold (Genovese and Wasserman (2004)) on the pvalue scale , . This is the largest cutoff that is still valid at the nominal false discovery rate, . Around the same time, Sun and Cai (2007), discussed a generalization of of the FDR procedure based upon the local FDR, and showed via decision theoretic techniques, that the false nondiscovery rate, a quantity related to the average power, has better performance characteristics than the FDR under many circumstances. The paper also provided a very good survey of the then currently available results. While the ramifications of that work are great, the results provided here have merit in that the results and methodology supplied in the form of second order asymptotics for the TDF and the FDF are entirely new and have important ramifications in of themselves.The Power
Use of the average power in designing studies or deriving operating characteristics of them makes sense only when the width of the distribution of is very narrow. To have more definitive control over the true positive fraction, some authors have introduced the “Kpower”. This was originally introduced in a model where the number of nonnull distributed tests was fixed and was defined as the probability that the number of true positives exceeded a given integral threshold, . Under the current setup in which the number of nonnull distributed tests is a binomial random variable this no longer makes sense. We introduce instead the power, which is the probability that the true positive fraction, exceeds a given threshold, :
Definition 3.2.
We define the power:
(10) 
We will also use the term “power” to denote , the power at threshold
. The associated quantile function is denoted:
(11) 
As mentioned above, the power becomes especially meaningful in experiments for which there are a small to intermediate number of simultaneous tests and for which the distribution of the TPF, , becomes nonnegligibly dispersed. We prove a CLT for the true positive fraction which we use to approximate the power. The accuracy of this approximation will be investigated in a simulation study.
Remark 3.3.
Because the distribution of the TPF is nearly symetric for even relatively small values of , the mean and median nearly coincide. Thus
(12) 
Because the power takes the values when and when and is continuous by assumption, there exists a quantile, , at which the power equals the average power:
(13) 
Because the power is a cCDF it is a nonincreasing function of ,
(14) 
Bounding the FDF
We conclude the section on various notions of power with a brief diversion. The fact that the TPF may be nonnegligibly dispersed at small to intermediate values of leads to concern that the FDF distribution is similarly dispersed at these small to intermediate values of . This concern is addressed in one of the CLT results and in the simulation study. We introduce some notation for its cCDF and quantile function.
Definition 3.4.
At BHFDR , denote the FDF tail probability:
(15) 
Denote its quantile function:
(16) 
At small and intermediate values of , the value of required to bound the FDF by with probability bounded by can be as much as 100% larger than the FDR. As remarked above, this will be discussed further in the context of our CLT results and simulation studies below.
The remainder of the paper proceeds according to the following plan. Section 4 is a presentation of the main theoretical results, and this is done two subsections. In subsection 4.1, almost sure limits of the positive call fraction, true positive fraction and false discovery fraction, as the number of simultaneous tests tends to infinity, are shown to exist and are fully characterized. Convergence of the corresponding expectations, the true positive rate or average power, and false positive rate, follow as a corollaries. Subsection 4.2 contains central limit theorems (CLT’s) for the positive call fraction, true positive fraction and false discovery fraction. We also provide a lower bound for the average power at a finite number, , of simultaneous tests. We show how these CLT results can be used to approximate the power allowing tighter control over the TPF in power and sample size calculations, as well as how the approximate distirbution of the FDF can be used to tighten down control over the FDF at both the design and analysis stage. Section 5 is devoted to a simulation study, in which we study the regions of the parameter space that are typical to small biomarker studies, microarray studies and GWAS studies. We also focus on characteristics of the distribution of the FDF as the number of simultaneous tests grows. Weak consistency of and to , and was proved in Storey (2002) and Genovese and Wasserman (2002). The paper Genovese and Wasserman (2004) is a study of consistency and convergence in distribution of the paths of the plugin estimator, considered a stochastic process in the pvalue plugin criterion, . The strong consistency results and weak convergence results for the observed positive call fraction, true positive fraction and false discovery fraction presented here are new. Note that almost sure convergence of the positive call fraction is necessary for almost sure convergence of the true positive and false discovery fractions.c
4 Theoretical Results
4.1 Law of Large Numbers
LLN for Positive Call Fraction,
Theorem 4.1.
If the family is absolutely continuous and has the monotone likelihood ratio property, then
(17) 
Proofs of this and all other results are contained in the accompanying supplemental material.
Remark 4.2.
When the family has the monotone likelihood ratio property, will be the unique nonzero solution of .
Once the almost sure convergence of is established we can apply a result of Taylor and Patterson (1985) to establish convergence results for the true positive fraction.
LLN for the True Positive Fraction,
Theorem 4.3.
Under the conditions of theorem 4.1,
(19)  
(20) 
Corresponding convergence results for the false discovery fraction and its expected value follow as a corollary.
LLN for the False Discovery Fraction,
Corollary 4.3.1.
Under the conditions of theorem 4.1,
(22)  
(23) 
Remark 4.4.
Remark 4.5.
If the nominal false discovery rate, , is replaced by the inflated value, , resulting in the procedure, note that the FDR is still controlled at the nominal value, , since in this case, due to cancellation. This threshold, , on the pvalue scale, has been called the oracle threshold by some authors, Genovese and Wasserman (2004), because it is the criterion resulting in the largest power which is still valid for a given FDR, . Call this the oracle average power, . The actual difference only begins to get appreciable as increases in size. In practice, as we will see in our simulation study, must be as large as 50% or more before this has a dramatic effect on the average power. Keep in mind that in practice when analyzing a given dataset, this increased power is only attainable at the stage of estimation if a reasonably good estimate of is possible. The fact that this is very problematic has also been a topic of much discussion.
Next, if we replace in the definition of the IST average power 19 by the expression 24, we arrive at a new equation which gives an implicit definition for the IST average power.
Corollary 4.5.1.
Under the conditions of theorem 4.1,
(25) 
The almost sure convergence results given in Theorems 4.1, 4.3.1 and 4.3 above each have corresponding central limit results which we state now. The first, a CLT for the centered and scaled positive call fraction, , is needed in the proof of the second and third results, CLT’s for centered and scaled versions of the false discovery fraction and the true positive fraction.
4.2 CLTs for the PCF, FDF, and TPF; Lower Bound for Average Power
Theorem 4.6.
Under the conditions of theorem 4.1,
(28) 
The proof, which uses results on convergence of stopped stochastic processes, is constructive in nature producing fully characterized limiting distributions yielding asymptotic variance formulae. We reiterate the practical implications of these results.
Approximating the power via the CLT for the TPF
Currently, multiple testing experiments are designed using the average power, , which is the mean of the distribution of . In cases in which the width of this distribution is nonnegligible, e.g. simultaneous tests or so, we recommend using the power instead of the average power. As we defined above, in equation 10, the power is the probability that the TPF exceeds a given . We will see in our simulation study that in the ranges of the parameter space investigated, this CLT approximation is quite good and can be used to approximate the power:
(29) 
where above is the squareroot of the asymptotic variance, , given in formula 56 in the proof of theorem 4.6.
Enhanced control of the FDF via its CLT
As defined above in expression 16 and the text leading up to it, is the quantile of the FDF distribution at upper tail probability, . The CLT for the FDF can be used to approximate it:
(30) 
Here, as above, is the square root of the asymptotic variance, given in formula 58 in the proof of 4.6, and
is the standard normal quantile function. This can be used in several different ways to bound the FDF with specified probability. Three possibilities are as follows. First, as a kind of loss function on lack of control inherent in the use of the
procedure, we could determine how large a threshhold is required so that the FDF is bounded by except for a tail probability of(31) 
A second way would to find the solution to the following equation.
(32) 
This would produce a reduced FDR, , at which the procedure would result in a FDF, of no more than with probability . The solution is
(33) 
A third way to do this, and the most conservative of the three, would be to determine the value of a reduced FDR, , at which the procedure would result in a FDF no more than with probability , by solving the following equation numerically:
(34) 
Remark 4.7.
The farther apart is from is an indication of the dispersion of the distribution of .
Remark 4.8.
The procedure summarized in equation 34, for finding a reduced FDR, , at which would produce an FDF of no more than with probability , can also be used at the analysis phase. Note that expression 58 in the proof of 4.6 for the asymptotic variance, , of depends only upon and . Thus we can replace with and estimate from the data using the plugin estimate, . This has important ramifications for the setting of small to intermediate number of simultaneous tests, .
Lower Bound for finite simultaneous tests average power
As we will see in the simulation study which follows, the IST average power is in fact extremely close to the finite simultaneous tests (FST) average power for the broad ranges of the parameters studied. Nevertheless, it is still useful to have bounds for the FST average power.
Theorem 4.9.
The FST average power, , is bounded below by the following quantity, , given below.
5 Simulation Study
We conducted four simulation studies. The first three of these had fixed and ranges of the other parameters chosen based upon relevance to subject matter areas. The first, with , was meant to model biomarker studies. In a second simulation study, we varied in order to study characteristics of the FDF distribution as grows. In the third, in which , was meant to model microarray studies for while the fourth, for which , was meant to model genome wide association (GWA) studies.
In all four cases, the test statistic distributions, and , were chosen to be tdistributions of degrees of freedom. The common noncentrality parameter was fixed at . This corresponds to a two group comparison as is often done. For each of these simulation studies, we chose subject matter relevant ranges for the four parameters, the expected proportion of nonnull tests, , the location parameter, , and the false discovery rate, . Except when set explicitly as in the fourth case, a range sample sizes, , in increments of 5, was chosen to result in powers between 60% and 95% at each setting of the other parameters. We conducted a total of four simulation studies. The first, with simultaneous tests, was meant to model biomarker studies. The second, with varying sizes of ranging from 1,000 to 20,000 was done in order to assess the width of the FDF distribution and the adequacy of the CLT approximation to it. The third, with 54,675 was meant to model human oligonucleotide microarray experiments, and the forth with 1,000,000 was meant to model GWA studies. We present the first two of these in the main text and the remainder in the supplementary material.
For each of simulation studies focused on the distribution of the TPF, we computed, at each combination of these four parameters, the IST average power, , from line 20 of Theorem 4.3, the oracle power, , mentioned in remark 4.5 above, and the lower bound, , from Theorem 4.9. We computed the approximate  and  powers using expression 29 based upon the CLT for the TPF (theorem 4.6) using the expression for the asymptotic variance, (expression 56 given in the proof). Also, at each combination of parameters, we conducted 1,000 simulation replicates. At each simulation replicate, we began by generating i.i.d. Bernoulli variables, with success probability, , to assign each of the test statistics to the null (0) or nonnull (1) populations, recording the number, , of nonnull distributed test statistics. This was followed next by drawing test statistics from or , being the central and noncentral (respectively) tdistribution of degrees of freedom, corresponding to the particular value of . Next, the BH FDR procedure was applied and the number of positive calls, , and number of true positives, were recorded. At the conclusion of the 1,000 simulation replicates, we recorded the simulated average power as the mean over simulation replicate of the TPF, . In addition, the simulated  and  powers were derived as the fraction of simulation replicates of the TPF that exceeded and , respectively. Finally we computed the sample size required for power.
In another simulation study, focused on the distribution of the FDF for increasing . At each combination of the parameters considered, we computed the reduced FDR, , required to bound the FDF with probability as the unique numerical solution to expression 34. We also computed the sample sizes and required for specified average power under and under , respectively. Sample sizes and at specified power under the corresponding procedure were also derived. A simulation, conducted in a fashion identical to that described above, under , was done at each combination of parameters, this time including the additional two parameters and specified power. From simulation replicates of the FDF, , we computed the probability in excess of .
All calculations were done in R, version 3.5.0 (R Core Team (2016)) using a R package, pwrFDR, written by the author Izmirlian (2018), available for download on cran. Simulation was conducted on the NIH Biowulf cluster (NIH High Performance Computing Staff (2017)), using the swarm facility, whereby each of 50 nodes was tasked with carrying out 20 simulation replicates resulting in 1,000 simulation replicates for each configuration of parameters.
5.1 Biomarker Studies
For the first simulation study we considered experiments typical of biomarker studies with simultaneous tests. We attempted to cover a broad spectrum of parameters spanning the domain of typical biomarker study designs. The false discovery rate, , was ranged over the values , and from 5% to 30% in increments of 5%. The expected number of tests with nonzero means, , was varied over the values 5, 10 and 20, and from 10 to 100 in increments of 10, representing values of ranging from 0.025 to 0.5. The effect size, , was allowed to vary from 0.6 to 1.5 in increments of 0.1. At each configuration, a range of sample sizes were chosen to result in powers between 50% and 98% as mentioned above. This resulted in 2,648 configurations of the parameters, , and (full set of parameter combinations). The job took roughly 7 minutes on the NIH Biowulf cluster.
Table 2 tabulates the IST average power, the oracle power and the simulated mean of the TPF at 28 different parameter settings excerpted from the full set of 2,648 parameter combinations. Over the full set of parameter settings, the both the IST , and oracle powers are very close to the simulated average power. The difference between the IST power, , and the simulated power was less than 0.15, 0.95 and 2.00 at 50%, 90% and 99% of the parameter settings, respectively. As remarked earlier, the oracle power is actually the average power at the oracle threshold. Since it borders on feasible, we allowed to take values as large as 0.5 for which the oracle threshold has a substantial gain in power. The oracle power differed from the IST power by 1.8%, 8.5% and 16% at 50%, 90% and 99% of the parameter settings, respectively, suggesting that the oracle threshold is worth considering. Recall that our IST power, can be set to the oracle threshold by setting the FDR to . However, careful consideration must be taken if using the oracle threshold to design a study, since when its time to actually threshold the data one needs a plugin estimate of and as discussed extensively in the literature, this can be problematic. The lower bound comes within roughly 10% of the simulated power, with differences with the simulated power less than 36%, 45%, 50%, 56% and 71% at 20%, 40%, 50%, 60% and 80% of the parameter settings, respectively.
Table 3 displays, at threshold 0.75 and at threshold 0.90, the power as derived from the CLT 4.6 and estimated from simulation replicates (hatted version), respectively, excerpted from the full set of 2,648 parameter combinatons as before. In the last column is the ratio of the sample size required for power to the original sample size. First, we note that when restricted to powers strictly between 50% and 100%, occurring at 1,488 parameter combinations, the CLT approximate and simulated
power were within the following relative error of one another (median over parameter conditions (lower quartile, upper quartile)): 2% (0.7%, 4.5%), with 23.1% over 5%. Corresponding results for the simulated and CLT approximate
power for powers strictly between 50% and 100% occurring at 1275 of the parameter values, were within the following relativer error of one another 3.3% (1.3%, 9.2%), with 37.6% over 5%. The greater discrepancy between CLT approximate powers and simulated values is due to the lack of accuracy of the CLT asymptotic approximation at such small sample sizes, . Note that for sample sizes in excess of the degree of accuracy starts to improved dramatically, especially at higher powers. Also noteworthy is corroboration in ordering of the average power and power based upon the size of relative to . All values of are less than 90%, but some are between 75% and 90%, and the ordering of average power and powers is in accordance with expression 14.Furthermore the discrepancy between the average power and the power is reflective difference between and . This trend is echoed in the magnitude of the sample size ratio, with magnitude increasing in the discrepancy between and . Note that in this case, as the number of simultaneous tests, , is relatively “small”, the distribution of the TPF, , is more dispersed and therefore, growth in the powers is more gradual with increasing sample size.
5.2 The false discovery fraction, intermediate number of simultaneous tests
The second simulation study was focused on the use of the CLT for the FDF to find a bound for the FDF with large probability. We varied the number of simultaneous tests, , over 1000, 2500, 5000, 7500, 10000 and 20000. The effect size, , was varied over 2/3, 5/6 and 1. The proportion of statistics drawn from the nonnull distributed population, , ranged over 0.025, 0.05 and 0.075 and the FDR, , ranged over the values 0.1, 0.15 and 0.2. At each set of values of these parameters, we used expression 34 based upon the CLT for the FDF to find a reduced FDR at which the BHFDR procedure would result in an FDF of no more than with large probability. We calculated the sample sizes required for specified average power under the original and reduced FDR. Sample sizes required for specified power under the original and reduced FDR were also calculated. Finally, the probability that the FDF exceeded under the reduced FDR was estimated from simulation replicates.
Table 4 tabulates the reduced FDR, , required to bound the FDF by with probability , and the sample sizes and , required for specified average power under and under , respectively. Also shown are the sample sizes, and , required for specified power under and under , respectively, as well as the simulated tail probability, , in excess of . The general trend for increasing as the distribution of collapses to a point mass at are a value of closer to , and sample sizes under that are less inflated relative to corresponding sample sizes under . The simulated right tail probability, should in theory have only simulation error about its theoretical value, . However, as is derived from an approximation based upon a CLT, we expect the accuracy of the approximation to improve with larger . Not surprisingly, the results are consistent with these observations. Over the full set of 324 parameter settings we obtained the following results (median, (lower quartile, upper quartile)). The ratio of the reduced FDR, , to the original FDR, : 0.79 (0.69, 0.86) when , and 0.91 (0.88, 0.93) when , showing that the reduced FDR gets closer in value to the original FDR with increasing . The ratio of sample size required for average power at to that at : 1.06 (1.04, 1.1) when and 1.03 (1.02, 1.04) when , showing that the inflation factor reduces with increasing . This is also the case when the sample sizes are derived for given power: 1.05 (1.03, 1.08) when and 1.02 (1.0025, 1.03) when , respectively. Finally, the ratio of the simulated tail probability, to the CLT approximated value: 1.02 (0.95, 1.11) when and 0.995 (0.94, 1.0675) when respectively, highlighting that the CLT approximation gets better with increasing .
In order to judge the relative impact changes in the parameters had, especially those unique to this setting of multiple testing, we computed numerical partial derivatives of the power function with respect to the proportion of test statistics distributed as the nonnull distribution , the effect size , and the FDR . The partials were then scaled to the range of the relevant parameter (max minus min) so that unit changes were comparable and corresponded to the ranges of the parameters considered. Numerical partial derivatives were computed at all 10,020 configurations of the parameters. These results were summarized separately for each of the three parameters by calculating quartiles of the respective numerical partial derivative at each given level of the respective parameter, over all configurations of all other parameters resulting in powers of 50%, 60%, 70%, 80% and 90%. The results corresponding to median values at 70% power are displayed in Figure 2.
6 Discussion
We proved LLNs for the PCF, FDF and TPF as well as CLTs for scaled versions of them. Our LLN result for the TPF allowed characterization of the large limit and this in turn allowed a proper interpretation of the power discussed in Jung (2005) and in Liu and Hwang (2007), being nearly identical to the average power. Our CLT result for the TPF allowed us to introduce the power, similar in nature to the power discussed by previous authors. The power allows tighter control over the TPF in the design of multiple testing experiments by bounding the distribution of the TPF by an acceptable threshold, rather than just its mean, as is the case with the average power. Our CLT result for the FDF provides a technique whereby an investigator can determine a reduced FDR at which the usual BHFDR procedure will result in a FDF no greater than a stipulated value with arbitrary large probability. This latter technique is useful both at the design phase as well as the analysis phase because the asymptotic variance depends only upon the limiting PCF, , and proportion belonging to the null distributed population, and one can use as an estimate of and consider when faced with a data analysis. Key to the proofs of the LLN results was, first, the LLN for the PCF, , which was proved directly via a simple argument. Prior results by Genovese and Wasserman (2004) obtained convergence only in probability. Once established we applied a result of Taylor and Patterson (1985) for a.s. convergence of triangular arrays of finite exchangeable sequences The proofs of the CLT results was made possible by building on the work of Genovese and Wasserman (2004) which considered values thresholded at a deterministic , treating them as stochastic processes. We applied a result of Silvestrov (2004) for weak convergence of stopped stochastic processes.
In a very large and thorough simulation study, we investigated three major domains of the space of operating characteristics typically encountered in the design of multiple testing experiments: two larger domains, typical to human RNA expression microarray studies, and typical to GWA studies and a smaller domain, which is typical of biomarker studies. In each case, we compared the average power derived from the TPF LLN limit to simulated values and observed at all ranges of sample sizes that the agreement was quite good. We also used the CLT result for the TPF to compute approximate powers and compared these with the simulation distribution. Agreement in this case was overall very good, but there was some breakdown in the level of accuracy in the asymptotic approximation at simultaneous tests . The last simulation study was focused upon the procedure for bounding the FDF with large probability and its behavior as the number of simultaneous tests, , grows from hundreds to tens of thousands. We noted that overall, the method is feasible, even when the asymptotic approximation begins to fail, as it always offers tighter control of the FDF than the BHFDR procedure alone.
We investigated departures from the assumption of independent hypothesis tests by conducting a simulation in which tests were correlated within blocks according to a compound symmetry structure under a multivariate normal, having marginal variances equal to 1. For the purposes of this investigation, we fixed the block size to 100, effect size 1.25, FDR at 15%, proportion of nonnull distributed tests, , at 5% for 2000 simultaneous tests. We varied sample sizes from 14 to 16 and block correlation from 0 to 80% in increments of 10%. The average power, power, empirical FDR and probability that the FDF exceeds 18% are tabulated in table 5 over the ranges of sample sizes and block correlations considered. As we can see, comparing the independent tests lines for each of the two sample sizes, 14 and 16, with corresponding values for correlated test statistics, a very important point can be made. From the standpoint of the mean, there is virtually no difference. This is to say that the empirical FDR and average power are virtually unaffected when there are correlated blocks of tests. Notable differences do occur in the distributions of the TPF and FDF as the power for independent test statistics is 30% in a sample of 14, and 87% in a sample of 16, respectively, while the values when there are correlated blocks of tests are substantially greater for a sample of 14 and substantially less in a sample of 16, respectively. Discrepancies between the independent tests versus correlated blocks of tests in the same direction are also observed in the probability that the FDF exceeds 18%. The reason for this is that correlated blocks of test statistics result in a reduced effective number of tests. Apparently, the observed effective number of test statistics is large enough that the empirical means are still very good estimates of their almost sure limiting values, but not great enough for stability in the distribution of empirical means at the ranges of parameters under consideration. The conclusion to be drawn is not that the BHFDR procedure is to be avoided because it is not completely imune to departures from the independent test statistics assumption. By analogy, is any limit theorem meaningless because it doesn’t apply to a sample size of 3? Quite not. The first conclusion to be drawn is that the empirical means appear to be unaffected in the ranges of parameters considered here. If one is truly comfortable controlling the false discovery rate and powering studies using the average power, then one can ignore the appropriateness of the independence assumption. Problems start to occur when one uses the tails of the distribution of the FDF and TPF, as we are making the case for use here. However, rather then give up on the use of the BHFDR procedure altogether, the phenomenon should be viewed from the lens of the effective number of simultaneous tests. So, ironically, the problem is solved if one can simply increase the number of simultaneous tests.
Drawing away from the specific discussion of correlated tests and widening the focus to the conclusions to be drawn from the paper as a whole, the point to be made is that the quantities arising in the BHFDR procedure, the expected FDF which is controlled, and the expected TPF which forms the basis of a power calculation, should be seen for what they are, location parameters. Because first and second order asymptotics in the FDF and TPF occur as the number of simultaneous tests tends to infinity, then within the scope of reasonable ranges of parameters, e.g. effect size no more than 1 or so, and sample sizes within the ranges seen for equipment that is either very expensive per replicate or just starting to get a bit cheaper, say a few tens of replicates, then the following generalizations can be made. For more than 20,000 simultaneous tests, the means and the distributions effectively coincide so that controlling the FDR and using the average power to derive sample sizes is well supported. This is great news for GWAS and RNAseq studies for example. However, for less than one or two thousand simultaneous tests, one must use second order asymptotics to control the type Ilike error and calculate sample sizes using the CLT’s for the FDF and TPF in the manner outlined here. For on the order of a hundred or so simultaneous tests, asymptotic approximation using the CLT’s may not be appropriate. In this case, simulation is advised. This cautionary note is of particular importance in many biomarker studies.
References
 Alizadeh et al. (2000) Alizadeh, A. A., M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran, X. Yu, et al. (2000). Distinct types of diffuse large bcell lymphoma identified by gene expression profiling. Nature 403(6769), 503–511.
 Baggerly and Coombes (2009) Baggerly, K. A. and K. R. Coombes (2009, Dec). Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in highthroughput biology. Ann. of Appl. Statist. 3(4), 1309–1334.

Baldi,
Pierre and Long, A. D. (2001)
Baldi, Pierre and Long, A. D. (2001).
A Bayesian Framework for the Analysis of Microarray Expression Data: Regularized tTest and Inference of Gene Changes.
Bioinformatics 17(6), 509–519.  Benjamini and Hochberg (1995) Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate  a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57(1), 289–300.
 Genovese and Wasserman (2002) Genovese, C. and L. Wasserman (2002). Operating characteristics and extensions of the false discovery rate procedure. J. R. Stat. Soc. Ser. B Stat. Methodol. 64(3), 499–517.
 Genovese and Wasserman (2004) Genovese, C. and L. Wasserman (2004, JUN). A stochastic process approach to false discovery control. Ann. Stat. 32(3), 1035–1061.
 Glueck et al. (2008) Glueck, D. H., J. Mandel, A. KarimpourFard, L. Hunter, and K. E. Muller (2008). Exact Calculations of Average Power for the BenjaminiHochberg Procedure. The International Journal of Biostatistics 4(1), 11–28.
 Ibrahim et al. (2002) Ibrahim, J. G., C. M. H., and R. J. Gray (2002, MAR). Bayesian models for gene expression with DNA microarray data. J. Amer. Statist. Assoc. 97(457), 88–99.
 Ioannidis (2005) Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Med 2(8), e124.
 Izmirlian (2018) Izmirlian, G. (2018). pwrFDR: FDR Power. R package version 1.90.
 Jung (2005) Jung, S. H. (2005, Jul 15). Sample size for FDRcontrol in microarray data analysis. Bioinf. 21(14), 3097–3104.
 Lee and Whitmore (2002) Lee, M. L. T. and G. A. Whitmore (2002, Dec 15). Power and sample size for DNA microarray studies. Stat. in Med. 21(23), 3543–3570.
 Liu and Hwang (2007) Liu, P. and J. T. G. Hwang (2007, Mar 15). Quick calculation for sample size while controlling false discovery rate with application to microarray analysis. Bioinf. 23(6), 739–746.
 NIH High Performance Computing Staff (2017) NIH High Performance Computing Staff (2017). The NIH Biowulf cluster. http://hpc.nih.gov/. [Online; accessed 22December2017].
 R Core Team (2016) R Core Team (2016). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
 Silvestrov (2004) Silvestrov, D. (2004). Limit Theorems for Randomly Stopped Stochastic Processes. London: SpringerVerlag.
 Storey (2002) Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. Methodol. 64(3), 479–498.
 Sun and Cai (2007) Sun, W. and T. T. Cai (2007). Oracle and adaptive compound decision rules for false discovery rate control. J. Amer. Statist. Assoc. 102(479), 901–912.

Taylor and
Patterson (1985)
Taylor, R. L. and R. F. Patterson (1985).
Strong laws of large number for arrays of rowwise exchangeable random elements.
Internat. J. Math. & Math. Sci. 8(1), 135–144.
7 Appendix: Further simulation studies
7.1 RNA Expression Microarray Studies
The third simulation study we considered experiments typical of human RNA expression microarray studies using the Affymetrix Hgu133plus2 oligonucleotide mRNA gene chip. In this case, there are simultaneous tests. We attempted to cover a broad spectrum of parameters spanning the domain typical of microarray study designs. The false discovery rate, , was ranged over the values 1%, and from 5% to 30% in increments of 5%. The expected number of tests with nonzero means, , was varied from 100 to 2500 in increments of 100 representing values of ranging from 0.0018 to 0.046. The effect size, , was allowed to vary from 0.6 to 1.5 in increments of 0.1. At each configuration, a range of sample sizes were chosen to result in powers between 60% and 95% as mentioned above. This resulted in 10,020 configurations of the parameters, , and (full set of parameter combinations). The job took roughly 12 hours on the NIH Biowulf cluster.
Table 6 tabulates the IST average power, the oracle power and the simulated mean of the TPF at 28 different parameter settings excerpted from the full set of 10,020 parameter combinations. Over the full set of parameter settings, the both the IST, , and oracle, , powers are very close to the simulated average power. The difference between the IST power, , and the simulated power was less than 0.022%, 0.077% and 0.19% at 50%, 90% and 99% of the parameter settings, respectively. The oracle power differed from the IST power by less than 0.15%, 0.5% and 0.92% at 50%, 90% and 99% of the parameter settings, respectively. As remarked earlier, the oracle power is actually the average power at the oracle threshold, but for such small values of there is not much gain in power to be had. The lower bound comes within roughly 10% of the simulated power, with differences with the simulated power less than 2.1%, 4.5%, 6.2%, 8.5% and 15% at 20%, 40%, 50%, 60% and 80% of the parameter settings, respectively.
Table 7 displays, at threshold 0.75 and at threshold 0.90, the power as derived from the CLT 4.6 and estimated from simulation replicates (hatted version), respectively, excerpted from the full set of 10,020 parameter combinatons as before. In the last column is the ratio of the sample size required for power to the original sample size. First, we note that when restricted to powers strictly between 50% and 100%, occurring at 1,066 parameter combinations, the CLT approximate and simulated power were within the following relative error of one another (median over parameter conditions (lower quartile, upper quartile)): 0.4% (0.2%, 1.2%), with 2.4% over 5%. Corresponding results for the simulated and CLT approximate power for powers strictly between 50% and 100% occurring at 1476 of the parameter values, were within the following relativer error of one another 0.5% (0.2%, 1.3%), with 1.8% over 5%. Also noteworthy is corroboration in ordering of the average power and power based upon the size of relative to . All values of are less than 90%, but some are between 75% and 90%, and the ordering of average power and powers is in accordance with expression 14. Furthermore the discrepancy between the average power and the power is reflective difference between and . This trend is echoed in the magnitude of the sample size ratio, with magnitude increasing in the discrepancy between and . The relatively rapid rise in sample size, , of all powers is an indication of the degree to which the distribution of the TPF, , is spiked.
7.2 GWA Studies
The last simulation study we considered experiments typical of GWA studies with simultaneous tests. We attempted to cover a broad spectrum of parameters spanning the domain typical of GWA study designs. The false discovery rate, , was ranged over the values 0.5%, 1%, 5% and 10%. The expected number of tests with nonzero means, , was varied from 400 to 1000 in increments of 200 representing values of ranging from 4e04 to 0.001. The effect size, , was allowed to vary from 0.08 to 0.68 in increments of 0.2. At each configuration, a range of sample sizes were chosen to result in powers between 50% and 98% as mentioned above. This resulted in 512 configurations of the parameters, , and (full set of parameter combinations). The job took roughly 5 hours on the NIH Biowulf cluster.
Table 8 tabulates the IST average power, the oracle power and the simulated mean of the TPF at 32 different parameter settings excerpted from the full set of 512 parameter combinations. Over the full set of parameter settings, the both the IST, , and oracle, , powers are very close to the simulated average power. The difference between the IST power, , and the simulated power was less than 0.031, 0.087 and 0.147 at 50%, 90% and 99% of the parameter settings, respectively. The oracle power differed from the IST power by 0.0038%, 0.0081% and 0.011% at 50%, 90% and 99% of the parameter settings, respectively. As remarked earlier, the oracle power is actually the average power at the oracle threshold, but for such small values of the gain in power is now less than 1%. The lower bound comes within roughly 10% of the simulated power, with differences with the simulated power less than 0.83%, 3.8%, 6.9%, 9.6% and 15% at 20%, 40%, 50%, 60% and 80% of the parameter settings, respectively.
Table 9 displays, at threshold 0.75 and at threshold 0.90, the power as derived from the CLT 4.6 and estimated from simulation replicates (hatted version), respectively, excerpted from the full set of 512 parameter combinatons as before. In the last column is the ratio of the sample size required for power to the original sample size. First, we note that when restricted to powers strictly between 50% and 100%, occurring at 68 parameter combinations, the CLT approximate and simulated power were within the following relative error of one another (median over parameter conditions (lower quartile, upper quartile)): 0.5% (0.2%, 1.3%), with 1.5% over 5%. Corresponding results for the simulated and CLT approximate power for powers strictly between 50% and 100% occurring at 87 of the parameter values, were within the following relativer error of one another 0.6% (0.2%, 1.2%), with 1.1% over 5%. Also noteworthy is corroboration in ordering of the average power and power based upon the size of relative to . All values of are less than 90%, but some are between 75% and 90%, and the ordering of average power and powers is in accordance with expression 14. Furthermore the discrepancy between the average power and the power is reflective difference between and . This trend is echoed in the magnitude of the sample size ratio, with magnitude increasing in the discrepancy between and . Notice that over values considered the ranges of are comparable among the the microarray, GWAS and biomarker simulation studies. Therefore, the “all” or “nothing” rapid rise in the powers with increasing sample size here must be solely due to the distribution of the TPF, , being even more dramatically spiked, since the number of simultaneous tests, , is in this case, considerably larger.
8 Appendix: Proofs
Proof of Theorem 4.1.
The author wishes to thank Professor Thomas G. Kurtz (Kurtz, 2016) for assistance with this proof. Recall the nominal pvalues, , their common CDF, , listed in expression 3 in the text and their order statistics . Let be the empirical C.D.F. of .
(35) 
By Kolmogorov’s theorem, at all continuity points, , of . By assumption the family is absolutely continuous and has the monotone likelihood ratio property. It follows that each of the ratios is monotone and hence the mixture of likelihood ratios, is monotone. It follows that G is concave and therefore has one nonzero solution which we will call . Let be the set of measure zero such that for all , and consider fixed in this set of measure 1 for the remainder of this proof. Substituting for in expression 35 shows that
(36) 
Let and . While contains only 0 and a unique nonzero solution, is a step function and therefore, can contain multiple nonzero solutions. Nonetheless, for each , is a finite set. By definition, is an element of the set . It follows that
where the second line follows because the set is finite. Thus, taking limsup with respect to on both sides above gives:
where the last equality follows because we can interchange the order of the limsup and maximum and because the limit exists. In the other direction, next note that because is a solution to , it also follows that for every such that . Thus,
(37) 
Because of the convexity of the limiting function, , it follows, for large enough, that . Therefore, upon taking taking liminf with respect to on both sides we have:
where the last equality follows because we can interchange the order of liminf and the maximum and because the limit exists. This completes the proof. ∎
Proof of Theorem 4.3.
First, we note that
(38) 
is the average of row in a triangular array of finite exchangeable sequences. We will apply Theorem 1 of Taylor and Patterson (1985). Let and . We must show that (i) the increments on the row, , each converge almost surely to respective elements of a sequence ; (ii) the increments have variances tending to a limit and (iii) for each and , the covariance of increments and tend to zero.
Remark 8.1.
Our condition (i), elementwise almost sure convergence, which on surface appears weaker than the corresponding first condition in the cited reference, almost sure monotone decreasing distances to the limit, is sufficient in the context of the other assumptions. See the remark following the proof of Theorem 3 in that reference.
Verification of condition (i) is trivial, as it follows by Theorem 4.1 that almost surely as for each . Let . Condition (ii) follows easily since is bounded, so that by the LDCT, for each , as . Note that the same argument verifies that . Next, to verify that condition (iii) is satisfied, note first, for , that is bounded above by , and converges almost surely to , by Theorem 4.1. Thus, condition (iii) follows by the LDCT. We may now apply Theorem 1 of Taylor and Patterson (1985) to conclude that almost surely as . Therefore,
and the last written expectation is equal to . Because almost surely as , it follows that almost surely as . Because is bounded by 1, the average power, its expectation, also converges to as by the LDCT. ∎