1 Introduction
Nowadays, scientific techniques in areas such as image processing, astronomy, genomics routinely produce data of size gigabytes, terabytes, or even petabytes posing great challenges and opportunities to statisticians. With such a large scale problem, statistical inferences, such as estimating the parameters and testing the hypothesis, are made simultaneously for thousands or even millions of parameters where the classical methodologies are no longer applicable. A tremendous burst of statistical methodologies, impressively creative, are then proposed to dealing with various issues, including model selection, classification, feature selection, and etc. In this paper, we focus on the large scale simultaneous hypothesis testing, or large scale multiple comparison (MCP), testing a collection of
hypotheses:(1.1) 
Associated with these hypotheses is a collection of test statistics . Here, “largescale” means that is a big number, say , or even larger.
1.1 Model
For , assume that the test statistic
under the null hypothesis
and under the alternative hypothesis where and are two density functions. Let be the proportion of the alternative hypothesis. We therefore consider the twogroup model (Efron (2008, 2010)) where(1.2) 
Similarly, let and be the cdf of under the null and alternative hypothesis respectively. Then the cdf function of is . Assume that and are known; but and are unknown. With the availability of these test statistic, two quantities, pvalues and local fdr scores, are widely used for making the decision.
1.2 Revisit the pvalue
The value is defined as the probability of obtaining a test statistics at least as extreme as the one that was actually observed given the null by Ronald A. Fisher in his research papers and various editions of his influential texts, such as Fisher (1925) and Fisher (1935). A small pvalue indicates that “Either an exceptionally rare chance has occurered or the theory is not true” (Fisher (1959), p.39).
Inspired by this defintion, a widely used pvalue is given as
(1.3) 
a starting point of many testing method, including the famous one from Benjamini and Hochberg (1995), abbreviated as BH method:
Equivalently, a threshold is chosen and the th hypothesis is rejected if the observation exceeds this level. In other words, we
(1.4) 
Such commonly used pvalues, however, do not depend on (), the distribution of the test statistic under the alternative hypothesis. For the cases when is monotone increasing with respect to , known as the monotone likelihood ratio property (MLR) (Karlin and Rubin (1956a, b)), a small pvalue, or equivalently a large observation implies stronger evidence against the null. However, for other cases when MLR does not hold, the belief that “the larger the , the stronger evidence against the null” is shattered into pieces. For example, let and where
is the density function of the standard normal distribution. When
, an extremely large observation should come from the null hypothesis. The resultant extremely small pvalue actually favors the null hypothesis rather than . In other words, the intuition that “the larger the , the more likely it is from ” is no longer true: those extremely large observations are more likely from . Thus intuitively, the simple thresholding procedure defined in (1.4) is no good, no matter what threshold you pick; instead, it seems that we should use the following procedure:(1.5) 
The nonmonotonicity of the likelihood ratio is prevalent in both theories and applications. In Figure 1, we have plotted the likelihood ratio function for the following settings.

Let and where
is the density function of a standard normal random variable;

GeneralizedGaussian model: and , where is the family of generalizedGaussian (Subbotin) distribution with density functions ;
1.3 Likelihood Ratio Test
In Neyman and Pearson (1928a, b, 1933), they have introduced the famous NeymanPearson lemma which provides the most powerful test based on the likelihood ratio test statistic which fully uses distributions under both the null and alternative hypotheses. The NeymanPearson lemma has a Bayesian interpretation.
Consider a Bayesian classification problem where the goal is to classify
into two groups: one consists of data from and the other from an alternative distribution with a density function asThe “likelihood” that
is from the first group can be measured by the following posterior probability:
which is also called the local fdr (Efron et al. (2001), Efron (2008, 2010), Sun and Cai (2007), Cao et al. (2013), He et al. (2015), Liu et al. (2016)). The Bayesian classification rule would simply classify into the first group if and only if:
(1.6) 
Basic calculus shows that the above local fdr first decreases then increases with respect to . The Bayesian classification rule (1.6) agrees with the threshold procedure in (1.5).
When assuming the twogroup model (1.2), then the local fdr is
(1.7) 
The local fdr based approach originates from the Bayesian classification rule and is optimal. However, these scores rely on the density . There are many attempts, including Efron et al. (2001), Efron (2008), Sun and Cai (2007), Sun and Cai (2009), and Cao et al. (2013), to derive datadriven or empirical Bayes version of it by estimating these local fdrs using various nonparametric density estimation. However, these methods could suffer due to a slow convergence rate of the nonparametric density estimation (Wasserman (2006)) especially on the tail. To illustrate this, consider the following setting.
Let , , , , and a desired FDR level be . The ideal rejection region is . We plot in Figure 2, represented by the red curve. We generate a random sample
according to this mixture distribution. The kernel density estimator is then applied and the estimated value of
is plotted in Figure 2 as the green curve. It is clearly seen that the kernel density estimation smooths the area around the spike and fails to capture the spike around 0.Alternatively, we applied the locfdr package on the transformed z value to obtain estimated local fdrs. The hypotheses corresponding to the first smallest local fdrs are rejected where is chosen as the maximum value such that the average of these smallest local fdrs is no greater than which is chosen as 0.5 for a demonstration. We replicate these steps 100 times to calculate the average number of true rejections (ET), average number of false rejections (EV), and FDR. For comparison, the results of the proposed method are reported in Table 1. It is clearly seen that the former method controls the FDR in a meaningless way in the sense that it is powerless in finding true signals. In Figure 3, we plot the histogram of the False Discovery Proportions (FDP) of these two procedures among 100 replications. It is clearly seen that the FDP of the former method, as shown in the left panel, severely deviates from the true level, . On the other hand, the proposed one works well as shown in the right panel.
In summary, the local fdr based approach provides the rejection region optimally when is known. However, the difficulty of the density estimation deteriorates the data driven method, from bad performance to a complete failure.
ET  EV  fdr  
Locfdr  1.29  2.89  0.278 
CLAT  476  440  0.473 
1.4 Compromise
In Section 1.2, it is shown that traditional pvalue based approaches are not optimal for cases with nonmonotone likelihood ratio (NonMLR). In Section 1.3, it is shown that local fdr based approaches are optimal, but suffer from the nonparametric density estimation. In this section, we introduce a new method which is optimal for many cases with NonMLR and is free from the density estimation.
Motivated by (1.5), we consider the following rejection interval ,
(1.8) 
The decision based on (1.8) is optimal for cases listed at the end of Section 1.2 where the likelihood ratio is not monotone. To derive the data driven version of , one can replace the cdf function by the empirical cdf . The datadriven rejection interval is thus obtained according to
(1.9) 
A hypothesis is rejected if and only if the test statistic falls in .
Unlike any local fdr based approaches which require an estimation of the density function, this decision rule relies on the empirical cdf, which converges to the true cdf uniformly with fast rate, guaranteed by the wellknown DKW theorem (Dvoretzky et al. (1956)). Additionally, such an estimation is free from choosing tuning parameters. This new rule yields better theoretical properties and methodological performance. It successfully combines the advantages of both pvalue and local fdr based approaches and avoids the issues of these two. We call this method “Cdf and Local fdr Assisted multiple Testing method (CLAT)”.
1.5 Algorithm
There is an issue when implementing the method (1.9). It is known that the estimation error of the empirical cdf is in the order of . To avoid selecting a completely wrong interval, we put a restriction on the length of as in the following algorithm. The choice of the constant is not critical and thus chosen as 2.
Remark 1.1.
When we have a reliable information of , the proportion of nonnulls, we can replace by its estimator. Alternatively, one can set as zero and the resultant method is still valid, though slightly conservative.
In the Step 3 of Algorithm 2, the computational time complexity of direct searching and is , not feasible when the number of hypotheses is very large. We substitute it by the following novel algorithm with time complexity of .
Note that the key constraint is which can be rewritten as
(1.10) 
Let and order increasingly as . Let be the index such that . Then for any two integers with , it ensures that
The problem can be simplified as finding the maximum value of where . For each , we only need to calculate the difference between and , requiring us to scan the whole sequence ’s once.
Based on this, we replace Step 3 of Algorithm 2 by the following:
Remark 1.2.
Algorithm 2 is designed for rightsided test. For the left sided test, we calculate pvalues as . When testing two sided hypothesis, we apply the algorithm to the right and leftsided pvalues at level respectively and the final rejection is the union of these two sets. We don’t use twosided pvalues, which results in a rejection set which is a union of two intervals with the same length.
The remaining part of the paper is organized as following. In Section 2, we introduce the oracle and datadriven version of the procedure and study the properties of the datadriven procedures. Sections 3 and 4 include simulations and data analysis, all showing that CLAT is powerful in detecting true significances while committing small number of false significances. We leave technical proofs in Section 7.
2 Main Result
2.1 Oracle Procedure
When discussing the multiple testing procedure controlling the false discovery rate, the BH method (Benjamini and Hochberg (1995)) is the most important one to start with. Without loss of generality, consider the rightsided test. Given test statistic ’s, the pvalues are calculated according to (1.3). Let be the mixture cdf of the pvalues. If is known, the oracle BH procedure is equivalent to the method which rejects the hypothesis if the corresponding pvalue with chosen as
(2.1) 
Or equivalently, we reject a hypothesis if the corresponding test statistic is greater than where
Here , and are given in (1.2), representing the cdfs of the test statistic under the null and alternative hypotheses. This rule does not depend on the nonnull proportion , and is called distributionfree (Genovese and Wasserman (2002)). If there exists reliable information of , one can choose a less conservative as
Let and we call it the oracle BH rejection interval.
In the original paper (Benjamini and Hochberg (1995)), it is shown that the BH method is valid in controlling fdr but there is no discussion on the optimality. Later, there are a number of attempts trying to address this issue, such as Genovese and Wasserman (2002); Storey (2003); Efron (2007); Sun and Cai (2007); He et al. (2015). In this paper, we aim at constructing optimal testing procedure which minimizes the mfnr subject to the controlling of mfdr. Unfortunately, is not optimal for NonMLR cases, including those shown in Figure 1. Instead, we should consider a rejection set . Note that the corresponding mfdrof is . This motivates us to select an oracle reject set as
(2.2) 
According to Sun and Cai (2007); He et al. (2015), among all the sets which controls the mfdr at a given level, the one cut by the local fdrs maximizes the power. Note that is decreasing with respect to the likelihood ratio . can also be determined by thresholding the likelihood ratio function .
As a special case, one can define an oracle rejection interval as (1.8). We have the following obivous corrolaries:
Corollary 2.1.

When is a finite interval , the interval based on (1.8) is optimal; however, the oracle BH interval is not optimal;

When is monotone increasing, then the rejection set , the ideal interval , and the ideal BH interval are the same.
The proof is straightforward and is thus omitted.
In theory, the rejection set can be an union of multiple disjoint intervals and the algorithm (2) can be easily extended to this case. But this rarely happens in practice. We therefore focus on the rejection interval in the following discussion.
When the true rejection set is a finite interval, it is problematic to apply . To illustrate this, we consider the following example. Let and be the proportion of nonnull hypothesis. Assume that and . For different choices of , we randomly generate a sequence with of them being generated from the alternative distribution and the rest from the null distribution. We then order them as . Let
We replicate this step 100 times and calculate the average number of and report this number in the fourth column of Table 2.
For instance, when and . Setting and , on average, the largest observation
generated from the alternative hypothesis appears as the 47th largest value in these numbers.
In other words, the first 46 largest observations
are generated from the null hypothesis. Using results in too many false positives on the tail. It is obviously better to choose an rejection interval which ends at a finite number, hopefully .
Most existing literature talks about how to how to find the rejection set. There is few discussion on whether such a set exists. Zhang et al. (2011) brought out a phenomenon called “lack of identification” which describes situations when there exists no nontrivial procedure that controls the fdr at a given level. Next theorem gives a necessary condition of the existence of a nontrivial procedure.
Theorem 2.1.
Let be the local fdr and be the likelihood ratio. If or equivalently where , then for any set where are disjoint intervals,
Theorem 2.1 indicates that the maximum value of must be large enough such that a nontrivial rejection set exists. Otherwise, the only set with the corresponding mfdr being controlled at the level is the empty set.
When is monotone increasing, intuitively, one would conject that the mfdr can be as small as any arbitrarily chosen when in goes to the infinity. Unfortunately, this intuition is no longer true. Indeed when , no matter how large is, the mfdr level of any nontrivial procedure can never be controlled at the level. One of examples is the case when and are the density function of a T and noncentral T random variables with degree of freedom. The likelihood ratio is monotone increasing with an upper limit. Consequently, there is a lower limit of the mfdr level that one can possibly control. When setting the fdr level to be smaller than this limit, all the nontrivial procedures fail.
On the other hand, if , then under certain regularity conditions, the following theorem guarantees the existence of the ideal rejection interval.
Theorem 2.2.
Assume that or . Let and be the solutions of . Further assume that is monotone increasing in , monotone decreasing in with , and for all . Then the mfdr based on the rejection interval is less than or equal to .
Theorem 2.3.
If is monotone and . Then exists.
Ave  

0.7  0.8  2.0  38.77 
0.7  0.5  2.5  32.5 
0.6  0.8  1.5  47.12 
In this theorem, we only require the monotonicity of on and , but impose no restriction when .
2.2 Convergence rate of the generalized BH procedure
In Section 2.1, we have discussed the oracle interval when assuming is known. When it is unknown, we can estimate it by the empirical cdf and obtain the datadriven version of . DKW’s inequality guarantees that . Therefore, we would expect that the empirical interval mimics the ideal interval well for large sample size .
Before stating the theorem, we introduce some notations. Let , . imples that the mfdr based on the rejection interval is less than or equal to . Let and be the constants defined in Theorem 2.1 and 2.2. Let . Then is the longest rejection interval starting from which controls mfdr at level. Let be the probability of rejection. Similarly, define as the empirical version of and be the proportion of hypotheses being rejected.
Theorem 2.4.
Assume that and conditions in Theorem 2.2 hold and . Let be the ideal rejection interval. Assume that attains the maximum at and a hypothesis is rejected if the test statistic falls between and . Then and there exists a constant such that
(2.3) 
3 Simulation
In this section, we will use simulations to compare three approaches, BH procedure, the SC procedure (Sun and Cai (2007)) which is local fdr based, and CLAT. Assume that where and the total number of nonnull hypothesis is . We consider the following two cases.
Case I: Let , the density function of a standard normal distrbution. Under the alternative hypothesis ,
Here, is chosen such that the oracle rejection interval exists. Namely, under the alternative hypothesis, are generated from a mixture of two normal random variables, centering around and respectively. Among all the nonnull hypothesis, of them are on the right side and the rest on the left.
fdr(q)  BH  SC  CLAT  

0.1  2.6  0.1 / 20.9 / 2.34  0.11 / 32.7 / 4.14  0.11 / 31.4 / 3.7 
0.3  2.3  0.3 / 37.8 / 15.9  0.31 / 56.3 / 25.7  0.3 / 53.6 / 22.7 
0.5  2.1  0.48 / 53.1 / 49.1  0.52 / 76.8 / 81.6  0.49 / 72.3 / 69.6 
0.7  1.9  0.67 / 73.5 / 152  0.72 / 105 / 270  0.7 / 96.1 / 224 
fdr(q)  BH  SC  CLAT  

0.1  2.9  0.18 / 9.13 / 2.04  0.12 / 20.5 / 2.85  0.12 / 19.5 / 2.66 
0.3  2.7  0.34 / 17.7 / 9.09  0.35 / 35.8 / 19  0.33 / 34.1 / 17.1 
0.5  2.5  0.51 / 27.6 / 28.9  0.55 / 50.2 / 61.8  0.55 / 48.2 / 58.1 
0.7  2.4  0.71 / 41.3 / 99.6  0.75 / 70.7 / 214  0.76 / 62 / 198 
Case II: Let , the density function of a uniform random . For a constant , define as
The is shown as the red solid curve in Figure 2 for a specific setting of .
After generating the data for each setting with given parameters, we apply various procedures and calculate the FDP, the number of true positives and the number of false positives. We replicate these steps 100 times to calculate the average number of these three quantities and report them in Tables 3, 4 and 5. In the simulation, we only consider the distributionfree procedures without estimating the nonnull proportion.
q  BH  SC  CLAT  

0.1  0.41  0.15  0.07 / 0 / 0.07  0 / 0 / 0  0.083 / 5604 / 510 
0.3  0.35  0.2  0.23 / 0 / 0.48  0 / 0 / 0  0.27 / 4751 / 1762 
0.5  0.4  0.3  0.46 / 0 / 1.44  0 / 0 / 0  0.49 / 2885 / 2746 
0.7  0.4  0.4  0.69 / 0.01 / 6.36  0 / 0 / 0  0.71 / 516 / 1263 
In both tables of case I, it is clearly seen that the BH procedure has much less power in finding nonnulls than the other two. For instance when and , BH procedure discoveries 15.6 true positive on average, and the other two methods declare 29.8 and 28.6 true positives respectively. The procedure of Sun and Cai (2007) declares more rejections than the CLAT in Case I. However, the majority of the difference is contributed by the number of false positives. For instance, in the last row of Table 3 where , , and . The average number of total rejection of Sun and Cai (2007) is 375 while that of ours is only 305.8. Among these 70 additional rejections, 60 of them are falsely rejected, which is about 25% of the total number of nonnulls.
For Case II, it is clearly seen that the CLAT works much better than all its alternatives. It controls the fdr level well, and is powerful in detecting true significance. Any local fdr based approaches fail miserably because they all rely on the density estimation which oversmooths the spike and thus fail to find any true signals. The BH method fails because the likelihood ratio function is not monotone. If forcing the rejection interval starting from zero will inevitably include too many false positives. The number of false positives is so large that BH must accept all the hypotheses to protect itself from an inflated FDR level.
In summary, guaranteeing the control of the fdr at a given level, CLAT has more power in detecting the true significance than the existing methods.
4 Data Analysis
In this section, we apply various procedures to two data sets to demonstrate the advantage of CLAT.
4.1 HIV data
This data set was considered before in Van’t Wout et al. (2003), Efron (2007) and Sun and Cai (2007). It consists of 8 arrays on 7680 genes, 4 of which corresponding to HIV positive patients and 4 to negative. The test statistics corresponding to all the genes can be found in the R package locfdr. Consequently, in Table 6, we report the total number of rejection for each methods under various FDR levels as 0.01, 0.05, 0.1, and 0.2. The performance of CLAT and SC method are quite similar and both are superior to the BH method.
HIV Data
q  BH  SC  CLAT 

0.01  13  13  13 
0.05  18  19  19 
0.10  22  24  20 
0.15  23  30  29 
4.2 Golden spikein data
In this study, we analyze the golden spikein data set. In this data set, all the parameters are preset and known and we can use it to assess the performance of different methods. We process the data according to Hwang et al. (2009). Let be the statistic with the degrees of freedom taken to be the Satterthwaite approximation. Define the statistic as where is the cdf of the standard normal distribution. Now, we apply three approaches to these ’s. The fdr level we are aiming at controlling are set as , , and respectively. The results are reported in Table 7. In each cell, we report the number of true rejections and the number of false rejections. It is clearly seen the BH approach works significant worse than the other two. This is not surprising because as shown in the bottom right of Figure 1, the estimated likelihood ratio is clearly nonmonotonic. It is also seen that CLAT is better than SC method in terms of identifying larger number of true positives and smaller number of false positives.
Gold Spikein Data
q  BH  SC  CLAT 

0.05  602/107  721/94  728/88 
0.10  760/249  848/243  859/200 
0.15  851/406  908/408  922/325 
0.20  910/596  938/608  974/478 
5 Conclusion
Testing multiple hypothesis has been an important problem in the last two decades. In this article, we study the limitations of pvalue and local fdr based approaches, which is fundemantal to many existing testing procedures. Consequently, we propose a new method CLAT with the following threefold advantages: (i) it is optimal for a broader family of distributions; (ii) it has a fast convergence rate because it relies on the empirical distribution function; (iii) it can be computed instantaneously. Extensive simulation and real data analysis have demonstrated its superiority over two popular existing methods. We thus strongly recommend it when testing large number of hypotheses simultaneously. The code for CLAT is available on https://github.com/zhaozhg81/CLAT
6 Acknowledgment
This research is supported in part by NSF Grant DMS1208735 and NSF Grant IIS1633283. The author is grateful for initial discussions and helpful comments from Dr. Jiashun Jin.
7 Appendix
7.1 Proof of Theorem 2.1:
Recall the definition that . The condition is equivalent to where . For any interval , let . Then
Consequently, for any fixed , is increasing with respect to . Since , therefore, . This implies that , for all . As a result, which completes the proof. ∎
7.2 Proof of Theorem 2.2:
Consider . Then . According to the proof of Theorem 2.1, . Consequently, and exists. Furthermore, where satisfy . implies that . Namely ∎
7.3 Proof of Theorem 2.4:
According to the definition of and , we know that
Consequently, for any fixed , increases when or and decreases when . Similarly,
For any fixed , decreases when or and inreases when . To demonstrate this pattern, we plot various curves of in Figure 4.
Since attains the maximum at , according to Theorem 2.2, and . Consequently, , and . Therefore, the function is a monotone increasing function of at a small neighborhood of . For a sufficiently small constant independent of , there exists a neighborhood of such that , where . Let where . The proof of Theorem 2.4 requires the following lemmas.
Lemma 7.1.
Let be the empirical cdf, then , if or and , then
If , then .
Lemma 7.2.
There exists a subinterval of , such that for all , provided that .
Lemma 7.3.
The function can not achieve the maximum at .
Lemma 7.4.
For any , .
Proof of Theorem 2.4: Assume that attains the maximum at , then according to Lemma 7.3, . According to Lemma 7.4,
Since , . In other words, Further, DKW’s inequality guarantees that . Consequently,
Next, we will prove that . According to the definition of ,
The marginal fdr can be written as
Note that
Comments
There are no comments yet.