In modern statistical applications, we usually face the situation that there are a large number of parameters to be estimated simultaneously. The shrinkage type methods, initially introduced in Stein  and James and Stein , have been proved extremely useful in the last six decades. In Efron and Morris [11, 12, 13], it is shown that the James-Stein estimator can be interpreted beautifully as an empirical Bayes estimator. The shrinkage estimators have been widely used in various statistical inferences (Morris [22, 23], He , Hwang, Qiu and Zhao , Zhao and Hwang , Hwang and Zhao ) which can be used in many microarray dataset.
The previous literature focuses mostly on inference for mean parameters for the normal mean model. The estimation of unknown variances is an important step toward inferring the means. An easy approach is to estimate each variance by the corresponding sample variance, an unbiased estimator. However, in Big Data era, the heteroscedasticity among the variances commonly exists and the sample variances vary dramatically. For instance, in the golden spike-in data set (Choe et al. ), the smallest sample variance among all the genes is while the largest one is . Such an easy estimator will produce misleading results because testing the mean effects can be shadowed by extremely small or large values of the sample variances.
To address this issue, various variance estimators are proposed. For instance, Storey and Tibshirani  suggested adding a constant to each sample variance in order to stabilize the variance estimator. Lin et al.  suggested a robust estimator using rank order. Empirical Bayesian approaches for estimating the variances have been proposed in Lönnstedt and Speed , Kendziorski et al.  and Smyth 
. Approximating the sample variances by log-normal random variables,Cui et al.  proposed the exponential Lindley-James-Stein estimator and applied it to the test. It is demonstrated that the resultant testing method has a high power for capturing the differentially expressed genes. This method has been further investigated by Tong and Wang  theoretically. A similar approach has been taken in Zhao , Zhao and Hwang , Hwang and Zhao  to construct shrinkage confidence intervals for both unselected and the selected mean parameters.
Most of the existing popular empirical Bayes estimators for the variances assume a strong parametric assumption on the prior of the variances. In this paper, we assume an arbitrary prior distribution for the variances . When assuming an invariant loss function, the Bayesian decision rule depends on the data through the marginal cumulative distribution function of the sample variances only. As a result, the proposed Non-parametric Empirical Bayes estimator for the Variances (NEBV) relies on the empirical cumulative distribution function, which can reliably estimate the distribution function. Consequently, the corresponding empirical Bayes estimator enjoys good properties such as fast computation and being free from choosing tuning parameters.
In many applications such as the mircoarray experiments, an investigator might only select those parameter(s) with a large -statistics in magnitude for further studies. Estimating the corresponding variances post the selection is important toward the inference for the mean. In general, the Bayes estimator is immune to the selection (Dawid ) and the corresponding empirical Bayes estimator is a good approximation of the Bayes version. When focusing on the selected parameters, the approximation error becomes substantial and adjustment for the selection is often required. On the other hand, NEBV relies on the estimation of the cumulative distribution function and converges to the Bayes decision rule uniformly over a set which covers the selected data with a high probability. As a result, NEBV works well for the selected parameters without any adjustment, which is a striking but extremely useful feature.
The proposed estimator can be used as a step-stool to infer mean parameters. In this paper, we use it to construct confidence intervals for the mean parameters in a post selection setting, which is closely related to the Replication crisis in scientific investigations. Extensive simulations show that the intervals based on NEBV substantially improve the existing ones. We create a simulation using real data sets to demonstrate that the NEBV-based approach results in the smallest number of discordance, a term we introduced to denote the number of discoveries based on the current data which cannot be replicated using the newly collected data.
Here is how the article is organized. In Section 2, we recall various loss functions for the estimation of the variances. We introduce NEBV in Section 3 and study its theoretical properties in Section 4. In Section 5 and 6, we apply this estimator to construct confidence intervals for the mean parameters and demonstrate its superior performance via extensive simulations and real data analysis. We conclude the paper in Section 7 and leave all the technical proofs in Appendix A.
2 Loss Functions and Risks
Let be the parameter of interest and be the corresponding sample variances. Assume that
is the degrees of freedom.
2.1 Loss Functions
Let be an estimator of Consider the following loss function:
Other commonly used loss functions include
The difference between and is the position of and in the ratio. The loss function is commonly known as the Stein’s loss.
All these three loss functions are scale-invariant and equal zero when for all . The loss increases when the ratio of to moves away from one. Figure 1 displays these loss functions when and . It is shown in the graph that, when the estimator of is close to zero, the loss function and the Stein’s loss function assign much larger penalty than the square loss function
. Estimating the variance by an even smaller number will result in a inflated probability of committing type I error which should be avoided. Therefore, one should put more penalty on the underestimation (P. 332 ofCasella and Berger ). The loss function fails to do so because there is a fixed upper limit for the penalty at the lower end. On the other hand, by using the loss function , we can avoid such underestimation. Stein’s loss function has also been used in literature (Stein , Dey and Srinivasan , Tong and Wang ) and the penalty goes to when the estimation goes to zero. However, as Shown in Section 3, the corresponding Bayes rule depends on the marginal density of the sample variances which is difficult to estimate.
In many applications, a practitioner is only interested in a certain set of parameters, denoted as , after applying a selection rule. For instance, in microarray analysis, a scientist can select a set of genes based on the magnitude of -statistics and then zoom in on these parameters for further investigation. Upon selection, the loss function can be naturally modified as
3 Nonparametric Empirical Bayes Estimator for Variances
We consider the following non-parametric Bayesian hierarchical model
where is an arbitrary prior of .
Under Model (3.1), the marginal distribution of can be written as
and the posterior distribution of can be written as
We will now derive the Bayes decision rule based on the loss function (2.2). Given , then
Therefore, the Bayes decision rule which minimizes the Bayes risk is
Assume Model (3.1). Let be the marginal density of and be the corresponding cumulative distribution function. Assume that both
exist. Then, given , the Bayes decision rule of based on the loss function is
The Bayes decision rule (3.4) relies on the marginal cumulative distribution function which can be easily and reliably estimated using the empirical distribution function . When replacing by , we have the following Non-parametric Empirical Bayes estimator of the Variances (NEBV).
, and . The NEBV can be rewritten as
Note that if and only if , implying that when estimating , the estimator (3.6) only relies on the data ’s which are greater than or equal to . Second, decreases with respect to , implying that puts less weight on those observations which are farther away from .
For a given selection rule, let be the sample variance of the -th selected data. Then the corresponding variance, , is estimated by
where is defined similar to with being replaced by .
We would like to point out that due to the nature of the NEBV, it suffers when estimating the parameters with the largest sample variances. When implementing this method, we estimate those parameters corresponding to the top (say ) largest sample variances by these sample variances.
When assuming Stein’s loss function, basic calculation shows that
The empirical Bayes version of this requires the estimation of the marginal density and its derivative. However, the estimation of the density function is a challenging problem itself and the performance of various techniques for density estimation further deteriorates at the tail. Additionally, choosing a tuning parameter optimally in kernel density estimation is a daunting task, not even to mention that the optimal tuning parameters for estimating the density and its derivative do not even agree.
On the contrary, the NEBV relies on the sample variances via its marginal cumulative distribution function, which can be easily estimated.
4 Analytic Properties of NEBV
Let and where is an indicator function. Then the Bayes estimator and NEBV can be written as
respectively. First, we study the numerator and denominator separately.
Suppose that , , and is continuous, then
This theorem implies that both the numerator and the denominator of NEBV converge to those of the Bayes estimator uniformly in .
Though both the numerator and the denominator of NEBV estimator converge to the corresponding part of Bayes rule uniformly, unfortunately, it does not guarantee that the ratio converges uniformly. The reason is that the denominator can be arbitrarily small when goes to . For any arbitrary small number , let be a set defined as
Then for any , there exists a constant such that the denominator in the Bayes rule is greater than or equal to . We then have the following theorem:
Assume the same condition as Theorem 2, then
Note that the set where is decreasing with respect to . When , . We only show that converges to uniformly on the set . This result is sufficient in many applications because the sample variances for those parameters which are selected, for instance using the -statistic, are small with high probability.
An immediate corrolary follows.
Assume the same condition as Theorem 2, then
This striking results shows that NEBV, when applied to the post-selection setting without modification, converges to the Bayes estimator as long as . One of the key reason is that NEBV depends on the empirical distribution function and the empirical distribution converges to the distribution function uniformly according to the famous DKW’s inequality (Dvoretzky, Kiefer and Wolfowitz ).
5 Application to the Selected Intervals
In this section, we will apply the NEBV estimator to construct empirical Bayes intervals for the mean parameters assuming the following model:
Note that the posterior distribution of is
When ’s are known, the hyper-parameters
can be estimated using the method of moments as
where , known as the truncated value (Hwang, Qiu and Zhao ), is defined as, with ,
For unknown ’s, we replace them by various estimates to get the following empirical Bayesian confidence interval
In many applications, we might be only interested in for which the corresponding -statistic attains the maximum in magnitude among all the -statistic, . We then add the parentheses adjustment according to Hwang and Zhao .
In the comparison, we applied , (Cui et al. ) and (Tong and Wang ). We wanted to point out that these methods are not designed for estimating the variance post the selection. Actually, to the best of our knowledge, there is no methods designed for estimating the variances post the selection in the exising literature.
5.1 Independent Case
We generated ’s according to Model (5.1) with different choices of the prior for ’s, including inverse Gamma prior (), mixture of Gamma/inverse Gamma prior, and log-normal prior (). The dimension for the numerical study is 3,000. After obtaining the statistic , we made the inference for the parameters corresponding to the observation with the largest -statistic in magnitude. These steps were replicated 1,000 times.
We first compared the risk of different estimators for estimating the variance . The logarithm of the risks based on the loss function (2.5) are reported in Figure 2. It is clearly seen that the proposed NEBV has the smallest risk for all the parameter settings.
Next, we considered the confidence interval for . In addition to the three intervals by replacing with one of the three variance estimators above, we included the second-order corrected (SOC) interval (Hwang and Zhao ) and Bonferroni’s corrected -interval: . The SOC intervals are designed for the log-normal model where and . We set the desired coverage probability as 95%, namely .
We reported the simulated results corresponding to the log-normal prior in Figure 4. The coverage probabilities were reported in the first row. It is seen that the intervals based on NEBV, the Bonferroni’s corrected -interval, and SOC intervals all guarantee good coverage probabilities for all the hyper-parameters. The intervals based on the other two estimators, ELJS and TW, have substantially low coverage probabilities. For instance, for the inverse Gamma prior with and , the coverage probability of intervals based on TW can be as low as 35%. This is expected because these method are not designed for the setting post the selection. We also reported the average ratio of the length of all the confidence intervals to the length of -interval based on the Bonferroni’s correction in the second row of the figure. It is seen that among the three intervals which guarantee a desired coverage probability, the one based on NEBV is the shortest one.
In Figure 5, we reported the coverage probability and average length ratio of the various approaches for the inverse Gamma prior. The intervals based on NEBV and the Bonferroni’s corrected -interval still have good coverage probability. However, the coverage probability of SOC interval dropped below the nominal level. SOC intervals are designed to work well for the log-normal model and does not work well when this assumption does not hold.
In Figure 6, the variances
’s were generated from a mixture Gamma/inverse Gamma distribution. Although SOC intervals can be shorter than the NEBV-based intervals, but the coverage probabilities of SOC intervals were much lower than 95% for some parameter settings. The NEBV-based intervals are robust to the prior of’s and still work well. We have done extensive simulations and all of which demonstrated the patterns similar to the representative cases reported here.
5.2 Dependent Case
The previous estimators of the variances and the confidence intervals (5.2
) are based on the independence assumption. In this section, we compared their numerical performance when the independence assumption is violated. We generated the data from the multivariate normal distribution. Namely, for,
Here, the variance-covariance matrix is given as where with . Four types of correlation matrix were considered:
Banded Correlation matrix ( when and when ),
AR Correlation matrix (),
Sparse Correlation matrix, and
Correlation matrix from real data.
In the banded case, were set as , and respectively. For the AR case, was chosen as 0.5, 0.75, 0.9, 0.95, and 0.99 respectively. To get a sparse correlation matrix, we started with a matrix by setting 1.5% of randomly chosen entries as numbers from and the rest as zero. Let where is a diagonal matrix with the diagonal-line elements being the diagonal line of . The is a sparse correlation matrix and about 77% of the entries are zero under this setting. We considered the following choices of : and .
We have done extensive simulation studies and only reported representative results in Figure 7. Under all these dependence cases, the coverage probability of the NEBV-based intervals for guarantee the desired coverage probability. On the other hand, the intervals based on ELJS and TW fail to achieve the satisfactory coverage probability. In Figure 8, we reported the average length ratio of the confidence intervals.
In the last case, we considered a correlation matrix estimated from two microarray experiments : Leukemia (Golub et al. ) and Colon Cancer (Alon et al. ). We downloaded the data from http://www.bioconductor.org/packages/release/data/experiment/html/golubEsets.html and http://genomics-pubs.princeton.edu/oncology/affydata/index.html. The leukemia data includes the expressions of genes extracted from patients with two types of leukemia(): Acute Lymphoblastic Leukemia(ALL, 47 patients) and Acute Myeloid Leukemia (AML, 25 patients). The colon cancer data contains gene expressions of 2000 genes for 22 normal people and 40 patients. Since leukemia data and colon cancer data have 72 and 62 observations with 7128 and 2000 variables, respectively. We randomly selected 70 and 60 variables respectively from each dataset and calculated the sample covariance matrices which were treated as in the simulations.
The results of coverage probabilities, and length of the intervals for are shown in Figure 9. What we have observed is similar to that for the other settings. The intervals based on NEBV guarantees a desired coverage probability; while the coverage probabilities of the intervals based on ELJS and TW could be much lower than the nominal level.
6 Real data Analysis
According to a 2016 poll of 1,500 scientists reported in the journal Nature, 70% of them had failed to reproduce at least one other scientist’s experiment (Baker ). Replication crisis, which has long-standing roots, catches growing awareness in the last decades. A natural question is in what degree, the statistical conclusion of a certain method based on the data agrees with the conclusion when provided with different/new data set. In this section, we used the above two microarray data to examine the performance of various statistical methods.
For the Leukemia data set, we first randomly split the subjects into two subgroups such that both subgroups contain the similar numbers of subjects from ALL and AML groups. We then constructed confidence intervals based on one sub-group. We declared a gene to be significant if the corresponding interval does not enclose zero. We did the same for the other sub-group. We call a gene discordant if the interval based on the first subgroup does not enclose zero while the interval based on the second subgroup encloses zero. Namely, a gene is discordant implies that the conclusion that it is significant based on the first subgroup cannot be replicated based on the second one. We repeated these steps 1,000 times to calculate the average proportions of discordant genes. We did the same thing for the colon cancer data and reported the results in Table 1 and Figure 10.
In Figure 10, we plotted the box-plot based on four interval constructions: intervals based on the sample variance, ELJS, TW, and NEBV. In Table 1, we summarized the summary statistic based on 1,000 replications. It is seen that the intervals based on NEBV commit the smallest number of discordance and the difference is statistically significant.
In summary, as shown in both simulations and real data analysis, the performance of the intervals based on NEBV is better than its competitors and is thus recommended.
We develop a novel method of non-parametric empirical Bayes estimation of the variances assuming an arbitrary prior distribution. The Bayes estimator only relies on the marginal distribution function. The NEBV, the empirical Bayes version, relies on the empirical distribution function only, thus avoids any estimation of the density function, a known challenging problem.
We applied such an estimator to the construction of confidence intervals in the post selection setting. Extensive numerical evidences have shown that NEBV provides much better inference procedures.
The NEBV is developed under Model (3.1) assuming a distribution. However, the underlying model can be extended to some family of distributions such as Gamma or exponential family as an on-going research. In this work, the inference for the mean and variance parameters are splitted as two steps. Another natural yet challenging question is how to obtain a non-parametric estimator of the means assuming unknown and unequal variances. We leave all these in future research.
Appendix A Technical Details
Proof of Theorem 1. To ease the notation, we drop the subscript “” in the proof. Recall that and are the density function of and the prior of respectively. Define , and as
Note that is the marginal distribution of . Then
Take the derivatives of with respect to ,
for some constant .
Note that the left hand side can be expressed as
Now, we consider the first and second order derivatives of with respect to ,
for some constants and .
(Blum-DeHardt) Let F be a class of measurable functions such that for every . Then is P-Glivenko-Cantelli.
The bracket is the set of all the functions with . An -bracket is a bracket with . The bracketing number is the minimum number of -brackets with which can be covered.
We only prove the part for the numerator and the denominator can be similarly done. Let and It suffices to show that is a P-Glivenko-Cantelli class of functions. Since for any an infinite collection of real numbers can be found such that