Where to find needles in a haystack?

by   Zhigen Zhao, et al.
Temple University

In many existing methods in multiple comparison, one starts with either Fisher's p-values or the local fdr scores. The former one, with a usual definition as the tail probability exceeding the observed test statistic under the null distribution, fails to use the information from the alternative hypothesis and the targeted region of signals could be completely wrong especially when the likelihood ratio function is not monotone. The local fdr based approaches, usually relying on the density functions, are optimal oracally. However, the targeted region of the signals of the data-driven version is problematic because of the slow convergence of the non-parametric density estimation especially on the boundaries. In this paper, we propose a new method: Cdf and Local fdr Assisted multiple Testing method (CLAT), which is optimal for cases when the p-values based method are not. Additionally, the data-driven version only relies on the estimation of the cumulative distribution function and converges to the oracle version quickly. Both simulations and real data analysis demonstrate the superior performance of the proposed method than the existing ones. Furthermore, the computation is instantaneous based on a novel algorithm and is scalable to the large data set.



There are no comments yet.


page 1

page 2

page 3

page 4


Data-driven aggregation in non-parametric density estimation on the real line

We study non-parametric estimation of an unknown density with support in...

Direct estimation of density functionals using a polynomial basis

A number of fundamental quantities in statistical signal processing and ...

Exact model comparisons in the plausibility framework

Plausibility is a formalization of exact tests for parametric models and...

Obfuscation via Information Density Estimation

Identifying features that leak information about sensitive attributes is...

Fast Multi-Class Probabilistic Classifier by Sparse Non-parametric Density Estimation

The model interpretation is essential in many application scenarios and ...

Data-driven deep density estimation

Density estimation plays a crucial role in many data analysis tasks, as ...

Multiple competition based FDR control

Competition based FDR control has been commonly used for over a decade i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays, scientific techniques in areas such as image processing, astronomy, genomics routinely produce data of size gigabytes, terabytes, or even petabytes posing great challenges and opportunities to statisticians. With such a large scale problem, statistical inferences, such as estimating the parameters and testing the hypothesis, are made simultaneously for thousands or even millions of parameters where the classical methodologies are no longer applicable. A tremendous burst of statistical methodologies, impressively creative, are then proposed to dealing with various issues, including model selection, classification, feature selection, and etc. In this paper, we focus on the large scale simultaneous hypothesis testing, or large scale multiple comparison (MCP), testing a collection of



Associated with these hypotheses is a collection of test statistics . Here, “large-scale” means that is a big number, say , or even larger.

1.1 Model

For , assume that the test statistic

under the null hypothesis

and under the alternative hypothesis where and are two density functions. Let be the proportion of the alternative hypothesis. We therefore consider the two-group model (Efron (2008, 2010)) where


Similarly, let and be the cdf of under the null and alternative hypothesis respectively. Then the cdf function of is . Assume that and are known; but and are unknown. With the availability of these test statistic, two quantities, p-values and local fdr scores, are widely used for making the decision.

1.2 Revisit the p-value

The -value is defined as the probability of obtaining a test statistics at least as extreme as the one that was actually observed given the null by Ronald A. Fisher in his research papers and various editions of his influential texts, such as Fisher (1925) and Fisher (1935). A small p-value indicates that “Either an exceptionally rare chance has occurered or the theory is not true” (Fisher (1959), p.39).

Inspired by this defintion, a widely used p-value is given as


a starting point of many testing method, including the famous one from Benjamini and Hochberg (1995), abbreviated as BH method:

1:Order the p-values as ;
2:Let ;
3:Reject if and only if .
Algorithm 1 BH Method.

Equivalently, a threshold is chosen and the -th hypothesis is rejected if the observation exceeds this level. In other words, we


Such commonly used p-values, however, do not depend on (), the distribution of the test statistic under the alternative hypothesis. For the cases when is monotone increasing with respect to , known as the monotone likelihood ratio property (MLR) (Karlin and Rubin (1956a, b)), a small p-value, or equivalently a large observation implies stronger evidence against the null. However, for other cases when MLR does not hold, the belief that “the larger the , the stronger evidence against the null” is shattered into pieces. For example, let and where

is the density function of the standard normal distribution. When

, an extremely large observation should come from the null hypothesis. The resultant extremely small p-value actually favors the null hypothesis rather than . In other words, the intuition that “the larger the , the more likely it is from ” is no longer true: those extremely large observations are more likely from . Thus intuitively, the simple thresholding procedure defined in (1.4) is no good, no matter what threshold you pick; instead, it seems that we should use the following procedure:

Figure 1: Plot of the likelihood ratio for the different models. The panels correspond to (a) the normal model, (b) Cauchy model, (c) the generalized Gaussian model, and (d) the estimated likelihood ratio for the golden spike-in data.

The non-monotonicity of the likelihood ratio is prevalent in both theories and applications. In Figure 1, we have plotted the likelihood ratio function for the following settings.

  1. Let and where

    is the density function of a standard normal random variable;

  2. Let

    be the density function of a Cauchy distribution, and

    be the location transformation of ;

  3. Generalized-Gaussian model: and , where is the family of generalized-Gaussian (Subbotin) distribution with density functions ;

  4. In the golden spike-in data set of Choe et al. (2005), which will be revisited in Section 4, we plot the estimated likelihood ratio .

1.3 Likelihood Ratio Test

In Neyman and Pearson (1928a, b, 1933), they have introduced the famous Neyman-Pearson lemma which provides the most powerful test based on the likelihood ratio test statistic which fully uses distributions under both the null and alternative hypotheses. The Neyman-Pearson lemma has a Bayesian interpretation.

Consider a Bayesian classification problem where the goal is to classify

into two groups: one consists of data from and the other from an alternative distribution with a density function as

The “likelihood” that

is from the first group can be measured by the following posterior probability:

which is also called the local fdr (Efron et al. (2001), Efron (2008, 2010), Sun and Cai (2007), Cao et al. (2013), He et al. (2015), Liu et al. (2016)). The Bayesian classification rule would simply classify into the first group if and only if:


Basic calculus shows that the above local fdr first decreases then increases with respect to . The Bayesian classification rule (1.6) agrees with the threshold procedure in (1.5).

When assuming the two-group model (1.2), then the local fdr is


The local fdr based approach originates from the Bayesian classification rule and is optimal. However, these scores rely on the density . There are many attempts, including Efron et al. (2001), Efron (2008), Sun and Cai (2007), Sun and Cai (2009), and Cao et al. (2013), to derive data-driven or empirical Bayes version of it by estimating these local fdrs using various nonparametric density estimation. However, these methods could suffer due to a slow convergence rate of the nonparametric density estimation (Wasserman (2006)) especially on the tail. To illustrate this, consider the following setting.

Let , , , , and a desired FDR level be . The ideal rejection region is . We plot in Figure 2, represented by the red curve. We generate a random sample

according to this mixture distribution. The kernel density estimator is then applied and the estimated value of

is plotted in Figure 2 as the green curve. It is clearly seen that the kernel density estimation smooths the area around the spike and fails to capture the spike around 0.

Alternatively, we applied the locfdr package on the transformed z value to obtain estimated local fdrs. The hypotheses corresponding to the first smallest local fdrs are rejected where is chosen as the maximum value such that the average of these smallest local fdrs is no greater than which is chosen as 0.5 for a demonstration. We replicate these steps 100 times to calculate the average number of true rejections (ET), average number of false rejections (EV), and FDR. For comparison, the results of the proposed method are reported in Table 1. It is clearly seen that the former method controls the FDR in a meaningless way in the sense that it is powerless in finding true signals. In Figure 3, we plot the histogram of the False Discovery Proportions (FDP) of these two procedures among 100 replications. It is clearly seen that the FDP of the former method, as shown in the left panel, severely deviates from the true level, . On the other hand, the proposed one works well as shown in the right panel.

In summary, the local fdr based approach provides the rejection region optimally when is known. However, the difficulty of the density estimation deteriorates the data driven method, from bad performance to a complete failure.

ET EV fdr
Locfdr 1.29 2.89 0.278
CLAT 476 440 0.473
Table 1: This tables summarize the average number of true rejections, average number of false rejections, and the FDR of the CLAT and local fdr based methods.
Figure 2: Plot of the and its estimate.
Figure 3: Among 100 replications, we plot the histogram of the FDP based on the CLAT and local fdr based methods.

1.4 Compromise

In Section 1.2, it is shown that traditional p-value based approaches are not optimal for cases with non-monotone likelihood ratio (Non-MLR). In Section 1.3, it is shown that local fdr based approaches are optimal, but suffer from the nonparametric density estimation. In this section, we introduce a new method which is optimal for many cases with Non-MLR and is free from the density estimation.

Motivated by (1.5), we consider the following rejection interval ,


The decision based on (1.8) is optimal for cases listed at the end of Section 1.2 where the likelihood ratio is not monotone. To derive the data driven version of , one can replace the cdf function by the empirical cdf . The data-driven rejection interval is thus obtained according to


A hypothesis is rejected if and only if the test statistic falls in .

Unlike any local fdr based approaches which require an estimation of the density function, this decision rule relies on the empirical cdf, which converges to the true cdf uniformly with fast rate, guaranteed by the well-known DKW theorem (Dvoretzky et al. (1956)). Additionally, such an estimation is free from choosing tuning parameters. This new rule yields better theoretical properties and methodological performance. It successfully combines the advantages of both p-value and local fdr based approaches and avoids the issues of these two. We call this method “Cdf and Local fdr Assisted multiple Testing method (CLAT)”.

1.5 Algorithm

There is an issue when implementing the method (1.9). It is known that the estimation error of the empirical cdf is in the order of . To avoid selecting a completely wrong interval, we put a restriction on the length of as in the following algorithm. The choice of the constant is not critical and thus chosen as 2.

1:Calculate the p-values for each hypothesis as ;
2: Let and order the p-values increasingly as ;
3: Find and such that where
4: (a) If , then reject the -th hypothesis where and accept the rest; (b) If , accept all the hypotheses.
Algorithm 2 CLAT.
Remark 1.1.

When we have a reliable information of , the proportion of non-nulls, we can replace by its estimator. Alternatively, one can set as zero and the resultant method is still valid, though slightly conservative.

In the Step 3 of Algorithm 2, the computational time complexity of direct searching and is , not feasible when the number of hypotheses is very large. We substitute it by the following novel algorithm with time complexity of .

Note that the key constraint is which can be rewritten as


Let and order increasingly as . Let be the index such that . Then for any two integers with , it ensures that

The problem can be simplified as finding the maximum value of where . For each , we only need to calculate the difference between and , requiring us to scan the whole sequence ’s once.

Based on this, we replace Step 3 of Algorithm 2 by the following:

1:Calculate , and ;
2:Let and . For in , (a) If , let ; (b) If and , let , and ; (c) If and , let , , and ;
Remark 1.2.

Algorithm 2 is designed for right-sided test. For the left sided test, we calculate p-values as . When testing two sided hypothesis, we apply the algorithm to the right- and left-sided p-values at level respectively and the final rejection is the union of these two sets. We don’t use two-sided p-values, which results in a rejection set which is a union of two intervals with the same length.

The remaining part of the paper is organized as following. In Section 2, we introduce the oracle and data-driven version of the procedure and study the properties of the data-driven procedures. Sections 3 and 4 include simulations and data analysis, all showing that CLAT is powerful in detecting true significances while committing small number of false significances. We leave technical proofs in Section 7.

2 Main Result

2.1 Oracle Procedure

When discussing the multiple testing procedure controlling the false discovery rate, the BH method (Benjamini and Hochberg (1995)) is the most important one to start with. Without loss of generality, consider the right-sided test. Given test statistic ’s, the p-values are calculated according to (1.3). Let be the mixture cdf of the p-values. If is known, the oracle BH procedure is equivalent to the method which rejects the hypothesis if the corresponding p-value with chosen as


Or equivalently, we reject a hypothesis if the corresponding test statistic is greater than where

Here , and are given in (1.2), representing the cdfs of the test statistic under the null and alternative hypotheses. This rule does not depend on the non-null proportion , and is called distribution-free (Genovese and Wasserman (2002)). If there exists reliable information of , one can choose a less conservative as

Let and we call it the oracle BH rejection interval.

In the original paper (Benjamini and Hochberg (1995)), it is shown that the BH method is valid in controlling fdr but there is no discussion on the optimality. Later, there are a number of attempts trying to address this issue, such as Genovese and Wasserman (2002); Storey (2003); Efron (2007); Sun and Cai (2007); He et al. (2015). In this paper, we aim at constructing optimal testing procedure which minimizes the mfnr subject to the controlling of mfdr. Unfortunately, is not optimal for Non-MLR cases, including those shown in Figure 1. Instead, we should consider a rejection set . Note that the corresponding mfdrof is . This motivates us to select an oracle reject set as


According to Sun and Cai (2007); He et al. (2015), among all the sets which controls the mfdr at a given level, the one cut by the local fdrs maximizes the power. Note that is decreasing with respect to the likelihood ratio . can also be determined by thresholding the likelihood ratio function .

As a special case, one can define an oracle rejection interval as (1.8). We have the following obivous corrolaries:

Corollary 2.1.
  1. When is a finite interval , the interval based on (1.8) is optimal; however, the oracle BH interval is not optimal;

  2. When is monotone increasing, then the rejection set , the ideal interval , and the ideal BH interval are the same.

The proof is straight-forward and is thus omitted.

In theory, the rejection set can be an union of multiple disjoint intervals and the algorithm (2) can be easily extended to this case. But this rarely happens in practice. We therefore focus on the rejection interval in the following discussion.

When the true rejection set is a finite interval, it is problematic to apply . To illustrate this, we consider the following example. Let and be the proportion of non-null hypothesis. Assume that and . For different choices of , we randomly generate a sequence with of them being generated from the alternative distribution and the rest from the null distribution. We then order them as . Let

We replicate this step 100 times and calculate the average number of and report this number in the fourth column of Table 2. For instance, when and . Setting and , on average, the largest observation generated from the alternative hypothesis appears as the 47-th largest value in these numbers. In other words, the first 46 largest observations are generated from the null hypothesis. Using results in too many false positives on the tail. It is obviously better to choose an rejection interval which ends at a finite number, hopefully .

Most existing literature talks about how to how to find the rejection set. There is few discussion on whether such a set exists. Zhang et al. (2011) brought out a phenomenon called “lack of identification” which describes situations when there exists no non-trivial procedure that controls the fdr at a given level. Next theorem gives a necessary condition of the existence of a non-trivial procedure.

Theorem 2.1.

Let be the local fdr and be the likelihood ratio. If or equivalently where , then for any set where are disjoint intervals,

Theorem 2.1 indicates that the maximum value of must be large enough such that a non-trivial rejection set exists. Otherwise, the only set with the corresponding mfdr being controlled at the level is the empty set.

When is monotone increasing, intuitively, one would conject that the mfdr can be as small as any arbitrarily chosen when in goes to the infinity. Unfortunately, this intuition is no longer true. Indeed when , no matter how large is, the mfdr level of any nontrivial procedure can never be controlled at the -level. One of examples is the case when and are the density function of a T and non-central T random variables with degree of freedom. The likelihood ratio is monotone increasing with an upper limit. Consequently, there is a lower limit of the mfdr level that one can possibly control. When setting the fdr level to be smaller than this limit, all the non-trivial procedures fail.

On the other hand, if , then under certain regularity conditions, the following theorem guarantees the existence of the ideal rejection interval.

Theorem 2.2.

Assume that or . Let and be the solutions of . Further assume that is monotone increasing in , monotone decreasing in with , and for all . Then the mfdr based on the rejection interval is less than or equal to .

Theorem 2.3.

If is monotone and . Then exists.

0.7 0.8 2.0 38.77
0.7 0.5 2.5 32.5
0.6 0.8 1.5 47.12
Table 2: Assume that where and . For each parameters setting, we generate the random sample and sort the data. This table report the average location of the largest observation generated from the alternative hypothesis after ordering the data decreasingly.

In this theorem, we only require the monotonicity of on and , but impose no restriction when .

2.2 Convergence rate of the generalized BH procedure

In Section 2.1, we have discussed the oracle interval when assuming is known. When it is unknown, we can estimate it by the empirical cdf and obtain the data-driven version of . DKW’s inequality guarantees that . Therefore, we would expect that the empirical interval mimics the ideal interval well for large sample size .

Before stating the theorem, we introduce some notations. Let , . imples that the mfdr based on the rejection interval is less than or equal to . Let and be the constants defined in Theorem 2.1 and 2.2. Let . Then is the longest rejection interval starting from which controls mfdr at -level. Let be the probability of rejection. Similarly, define as the empirical version of and be the proportion of hypotheses being rejected.

Theorem 2.4.

Assume that and conditions in Theorem 2.2 hold and . Let be the ideal rejection interval. Assume that attains the maximum at and a hypothesis is rejected if the test statistic falls between and . Then and there exists a constant such that


Remark: According to this theorem, the proportion of hypotheses being rejected converges to the probability of the ideal rejection interval with a rate of . A similar result can be obtained if and satisfy the condition in Theorem 2.3. The proof is similar to that of Theorem 2.4 and is omitted.

3 Simulation

In this section, we will use simulations to compare three approaches, BH procedure, the SC procedure (Sun and Cai (2007)) which is local fdr based, and CLAT. Assume that where and the total number of non-null hypothesis is . We consider the following two cases.

Case I: Let , the density function of a standard normal distrbution. Under the alternative hypothesis ,

Here, is chosen such that the oracle rejection interval exists. Namely, under the alternative hypothesis, are generated from a mixture of two normal random variables, centering around and respectively. Among all the non-null hypothesis, of them are on the right side and the rest on the left.

fdr(q) BH SC CLAT
0.1 2.6 0.1 / 20.9 / 2.34 0.11 / 32.7 / 4.14 0.11 / 31.4 / 3.7
0.3 2.3 0.3 / 37.8 / 15.9 0.31 / 56.3 / 25.7 0.3 / 53.6 / 22.7
0.5 2.1 0.48 / 53.1 / 49.1 0.52 / 76.8 / 81.6 0.49 / 72.3 / 69.6
0.7 1.9 0.67 / 73.5 / 152 0.72 / 105 / 270 0.7 / 96.1 / 224
Table 3: Case I:
fdr(q) BH SC CLAT
0.1 2.9 0.18 / 9.13 / 2.04 0.12 / 20.5 / 2.85 0.12 / 19.5 / 2.66
0.3 2.7 0.34 / 17.7 / 9.09 0.35 / 35.8 / 19 0.33 / 34.1 / 17.1
0.5 2.5 0.51 / 27.6 / 28.9 0.55 / 50.2 / 61.8 0.55 / 48.2 / 58.1
0.7 2.4 0.71 / 41.3 / 99.6 0.75 / 70.7 / 214 0.76 / 62 / 198
Table 4: Case I:

Case II: Let , the density function of a uniform random . For a constant , define as

The is shown as the red solid curve in Figure 2 for a specific setting of .

After generating the data for each setting with given parameters, we apply various procedures and calculate the FDP, the number of true positives and the number of false positives. We replicate these steps 100 times to calculate the average number of these three quantities and report them in Tables 3, 4 and 5. In the simulation, we only consider the distribution-free procedures without estimating the non-null proportion.

0.1 0.41 0.15 0.07 / 0 / 0.07 0 / 0 / 0 0.083 / 5604 / 510
0.3 0.35 0.2 0.23 / 0 / 0.48 0 / 0 / 0 0.27 / 4751 / 1762
0.5 0.4 0.3 0.46 / 0 / 1.44 0 / 0 / 0 0.49 / 2885 / 2746
0.7 0.4 0.4 0.69 / 0.01 / 6.36 0 / 0 / 0 0.71 / 516 / 1263
Table 5: Case II:

In both tables of case I, it is clearly seen that the BH procedure has much less power in finding non-nulls than the other two. For instance when and , BH procedure discoveries 15.6 true positive on average, and the other two methods declare 29.8 and 28.6 true positives respectively. The procedure of Sun and Cai (2007) declares more rejections than the CLAT in Case I. However, the majority of the difference is contributed by the number of false positives. For instance, in the last row of Table 3 where , , and . The average number of total rejection of Sun and Cai (2007) is 375 while that of ours is only 305.8. Among these 70 additional rejections, 60 of them are falsely rejected, which is about 25% of the total number of non-nulls.

For Case II, it is clearly seen that the CLAT works much better than all its alternatives. It controls the fdr level well, and is powerful in detecting true significance. Any local fdr based approaches fail miserably because they all rely on the density estimation which over-smooths the spike and thus fail to find any true signals. The BH method fails because the likelihood ratio function is not monotone. If forcing the rejection interval starting from zero will inevitably include too many false positives. The number of false positives is so large that BH must accept all the hypotheses to protect itself from an inflated FDR level.

In summary, guaranteeing the control of the fdr at a given level, CLAT has more power in detecting the true significance than the existing methods.

4 Data Analysis

In this section, we apply various procedures to two data sets to demonstrate the advantage of CLAT.

4.1 HIV data

This data set was considered before in Van’t Wout et al. (2003), Efron (2007) and Sun and Cai (2007). It consists of 8 arrays on 7680 genes, 4 of which corresponding to HIV positive patients and 4 to negative. The test statistics corresponding to all the genes can be found in the R package locfdr. Consequently, in Table 6, we report the total number of rejection for each methods under various FDR levels as 0.01, 0.05, 0.1, and 0.2. The performance of CLAT and SC method are quite similar and both are superior to the BH method.

HIV Data

0.01 13 13 13
0.05 18 19 19
0.10 22 24 20
0.15 23 30 29
Table 6: This table summaries the data analysis result of three testing procedures for the HIV data set.

4.2 Golden spike-in data

In this study, we analyze the golden spike-in data set. In this data set, all the parameters are preset and known and we can use it to assess the performance of different methods. We process the data according to Hwang et al. (2009). Let be the statistic with the degrees of freedom taken to be the Satterthwaite approximation. Define the -statistic as where is the cdf of the standard normal distribution. Now, we apply three approaches to these ’s. The fdr level we are aiming at controlling are set as , , and respectively. The results are reported in Table 7. In each cell, we report the number of true rejections and the number of false rejections. It is clearly seen the BH approach works significant worse than the other two. This is not surprising because as shown in the bottom right of Figure 1, the estimated likelihood ratio is clearly non-monotonic. It is also seen that CLAT is better than SC method in terms of identifying larger number of true positives and smaller number of false positives.

Gold Spike-in Data

0.05 602/107 721/94 728/88
0.10 760/249 848/243 859/200
0.15 851/406 908/408 922/325
0.20 910/596 938/608 974/478
Table 7: This table summaries the data analysis result of three testing procedures. In each cell, two numbers correspond to the number of true positives and false positives among all rejections.

5 Conclusion

Testing multiple hypothesis has been an important problem in the last two decades. In this article, we study the limitations of p-value and local fdr based approaches, which is fundemantal to many existing testing procedures. Consequently, we propose a new method CLAT with the following three-fold advantages: (i) it is optimal for a broader family of distributions; (ii) it has a fast convergence rate because it relies on the empirical distribution function; (iii) it can be computed instantaneously. Extensive simulation and real data analysis have demonstrated its superiority over two popular existing methods. We thus strongly recommend it when testing large number of hypotheses simultaneously. The code for CLAT is available on https://github.com/zhaozhg81/CLAT

6 Acknowledgment

This research is supported in part by NSF Grant DMS-1208735 and NSF Grant IIS-1633283. The author is grateful for initial discussions and helpful comments from Dr. Jiashun Jin.

7 Appendix

7.1 Proof of Theorem 2.1:

Recall the definition that . The condition is equivalent to where . For any interval , let . Then

Consequently, for any fixed , is increasing with respect to . Since , therefore, . This implies that , for all . As a result, which completes the proof. ∎

7.2 Proof of Theorem 2.2:

Consider . Then . According to the proof of Theorem 2.1, . Consequently, and exists. Furthermore, where satisfy . implies that . Namely

7.3 Proof of Theorem 2.4:

According to the definition of and , we know that

Consequently, for any fixed , increases when or and decreases when . Similarly,

For any fixed , decreases when or and inreases when . To demonstrate this pattern, we plot various curves of in Figure 4.

Figure 4: Curve of the function of . In the left panel, is a fixed constant and we plot as a function of . In the right panel, we plot it as a function of with a being fixed as a constant.

Since attains the maximum at , according to Theorem 2.2, and . Consequently, , and . Therefore, the function is a monotone increasing function of at a small neighborhood of . For a sufficiently small constant independent of , there exists a neighborhood of such that , where . Let where . The proof of Theorem 2.4 requires the following lemmas.

Lemma 7.1.

Let be the empirical cdf, then , if or and , then

If , then .

Lemma 7.2.

There exists a sub-interval of , such that for all , provided that .

Lemma 7.3.

The function can not achieve the maximum at .

Lemma 7.4.

For any , .

Proof of Theorem 2.4: Assume that attains the maximum at , then according to Lemma 7.3, . According to Lemma 7.4,

Since , . In other words, Further, DKW’s inequality guarantees that . Consequently,

Next, we will prove that . According to the definition of ,

The marginal fdr can be written as

Note that