Empirical Bayes estimation of normal means, accounting for uncertainty in estimated standard errors

by   Mengyin Lu, et al.
The University of Chicago

We consider Empirical Bayes (EB) estimation in the normal means problem, when the standard deviations of the observations are not known precisely, but estimated with error -- which is almost always the case in practical applications. In classical statistics accounting for estimated standard errors usually involves replacing a normal distribution with a t distribution. This suggests approaching this problem by replacing the normal assumption with a t assumption, leading to an "EB t-means problem". Here we show that an approach along these lines can indeed work, but only with some care. Indeed, a naive application of this idea is flawed, and can perform poorly. We suggest how this flaw can be remedied by a two-stage procedure, which first performs EB shrinkage estimation of the standard errors and then solves an EB t-means problem. We give numerical results illustrating the effectiveness of this remedy.



page 8

page 11


Solving the Empirical Bayes Normal Means Problem with Correlated Noise

The Normal Means problem plays a fundamental role in many areas of moder...

Robust Empirical Bayes Confidence Intervals

We construct robust empirical Bayes confidence intervals (EBCIs) in a no...

Nonparametric Empirical Bayes Simultaneous Estimation for Multiple Variances

The shrinkage estimation has proven to be very useful when dealing with ...

ebnm: An R Package for Solving the Empirical Bayes Normal Means Problem Using a Variety of Prior Families

The empirical Bayes normal means (EBNM) model plays an important role in...

Distributionally Weighted Least Squares in Structural Equation Modeling

In real data analysis with structural equation modeling, data are unlike...

Nonparametric Empirical Bayes Estimation on Heterogeneous Data

The simultaneous estimation of many parameters η_i, based on a correspon...

On Estimating Many Means, Selection Bias, and the Bootstrap

With recent advances in high throughput technology, researchers often fi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider Empirical Bayes (EB) estimation in applications where we have observed estimates () of a series of underlying “effects” , with estimated standard errors . Our goal is to perform EB estimation for the effects , from the observations , under standard normal theory assumptions, but taking account of uncertainty in the standard errors.

If the standard errors of were known, rather than estimated, our problem would simply involve EB inference for the well-studied “Normal means” problem (e.g. Johnstone and Silverman, 2004):


where denotes the “true” standard error of , and is some specified family of distributions. (The conditioning on in (2) makes explicit an assumption that the are independent and identically distributed from , independent of , an assumption we relax later.) Fitting this EB model involves first obtaining an estimate for (e.g. by marginal maximum likelihood), and then computing the posterior distributions . These posterior distributions can be used to obtain both point and interval estimates of . And, if the family involves sparse distributions (with a point mass on 0), then the posterior distributions can also be used to compute (local) false discovery rates (Efron, 2004), effectively providing an EB solution to the “multiple testing” problem. This EB normal means problem is well studied, and there exist flexible software implementations for solving it for a range of choices of (e.g. Stephens, 2016). For example, methods in Stephens (2016) effectively solve this problem for

the set of all unimodal distributions (by exploiting the fact that any such distribution can be approximated, to arbitrary accuracy, by a mixture of sufficiently many uniform distributions).

In classical statistics, the fact that standard errors are estimated is usually dealt with by replacing normal distributions with distributions. Indeed, in the settings we consider here, we have




where denotes the distribution on degrees of freedom. Expression (4

) is routinely used in classical statistics to obtain confidence intervals for

and values testing .

From this it is tempting to replace the EB normal means problem (1)-(2) with what we call the “EB -means problem” (EBTM):


where denotes the generalized distribution on degrees of freedom, with mean and scale parameter (i.e. the distribution of when ). While this EBTM problem is much less studied than the EB normal means problem, Stephens (2016) also provides flexible software implementations solving the EBTM problem – estimating and computing posterior distributions – for a range of choices of .

Unfortunately, there is a problem with this tempting naive approach: while the EBTM problem (5)-(6) is well-defined and solvable, the standard theory that leads to (4) does not imply (5). The reason is that in (4) is random, and not conditioned on, and the unconditional expression does not imply a corresponding conditional one:


To give a simple explicit example of this: if and then but . Consequently (5) does not hold in general, and – as we show later – ignoring this can produce very unreliable inferences in practice.

In this paper we describe a simple solution to this problem. Our solution involves EB analysis of the standard errors (Smyth, 2004), which is already widely used in genomics applications – indeed, currently much more widely used than EB analysis of the effect estimates . Our approach effectively combines the methods from Smyth (2004) with the methods for the EBTM problem from Stephens (2016). We demonstrate empirically that, in contrast with the naive approach, this combined approach can provide reliable inference.

2 Methods

Assume that, independently for , we have observed estimates and corresponding (estimated) standard errors , satisfying




For example (8)-(10) hold if are the usual estimate of

and its standard error in a simple linear regression,

, where and are observed

-vectors and the residual errors

, with .

Our goal is to perform EB estimation for under the assumption (6) that . As noted in the Introduction, if (5) held then this would be solved by methods for the EBTM problem in Stephens (2016). However, unfortunately (8)-(10) do not imply (5) and so (5) does not hold in general.

We now describe a simple solution to this problem, based on combining the EBTM methods in Stephens (2016) with EB estimation for using the methods in Smyth (2004). Specifically, Smyth (2004) combines the sampling distribution for (10

) with an assumption that the true variances

come from an inverse-gamma distribution, which can be written:


where are parameters to be estimated. The EB approach in Smyth (2004) estimates from the observations

(using a method of moments), and then bases inferences for

on its posterior distribution given these estimates, which is also an inverse-gamma distribution. Indeed, given and , the posterior can be written




In particular Smyth (2004) uses – which lies between and – as a “moderated” estimate of .

The key to our approach is the following simple Lemma.

Lemma 1.

Assuming (8),(9) and (12) it follows that


Thus, although (5) does not hold in general, under the assumptions (8)-(11) (which imply (12)) an analogous expression (15) does hold. This analogous expression simply involves replacing the original standard errors and degrees of freedom with their moderated values, (13) and (14). Combining (15) with (6) then yields an EBTM problem that can be solved using methods from Stephens (2016).

2.1 A two-step strategy

Putting this all together, we suggest the following two-step strategy for fitting the EBNM model, accounting for uncertainty in estimated standard errors:

  1. Apply EB shrinkage methods to estimated standard errors , using the likelihood (10) and prior (11), as in Smyth (2004). This yields estimates for , and subsequently moderated estimates (14) and degrees of freedom (13).

  2. Apply methods for the EBTM problem (e.g Stephens, 2016) to the estimates , estimated standard errors and degrees of freedom . This yields estimates for and the posterior distributions .


  1. Like many two-step procedures, this two-step procedure in not fully efficient: in principle it would be more efficient to jointly estimate from , rather than first estimate from and then estimate while fixing the estimates of . However in practice, because is typically large, can already be accurately estimated from , and in our view the convenience of the two-step procedure greatly outweighs any minor loss of efficiency.

  2. The distributional assumption (11), which leads to (12), may seem somewhat restrictive. However, the moderated statistics from Smyth (2004) – which rely on the same assumption – have been found to be well behaved in practice and are widely used. See Lu and Stephens (2016); Phipson et al. (2016) for discussion and assessment of more flexible assumptions.

  3. Although assumptions (11) and (10) are the simplest way to obtain posterior distributions of the form (12), the form (12) holds more generally. For example, the voom framework (Law et al., 2014) adapts methods in Smyth (2004)

    to deal with the count nature of RNA sequencing data, and involves both accounting for mean-variance relationships and using weighted least squares rather than ordinary least squares. However, it ultimately yields conditional distributions of the form (

    12), which – by Lemma 1 – lead to an EBTM problem for .

2.2 Dependence of on

Equation (6) assumes that the are independent of . Methods in Stephens (2016) for the EBTM problem can deal with the more general assumption:


for any choice of . The choice gives (6). The choice corresponds to assuming that the moderated statistics from Smyth (2004) are independent of , which in turn leads to the property that EB measures of significance (e.g. local FDR) are monotonic as the moderated statistics move away from 0 (and monotonic in the corresponding values if is symmetric about 0). Thus can be thought of as corresponding to the implicit assumption made when ranking significance by values from the moderated statistics (Wakefield, 2009).

Although values of other than 0 and 1 do not have a straightforward motivation or interpretation, it is straightforward to fit these models, and to estimate by comparing likelihoods if desired.

2.3 An ad hoc strategy that avoids the EBTM problem

The framework outlined above has the advantage of being based on clear statistical principles. However, it has the disadvantage that the EBTM problem is often more complex to solve than the EBNM problem. In our numerical studies below we therefore also consider an alternative ad hoc strategy, which avoids solving the EBTM problem.

This ad hoc strategy starts by using the same ideas as above to obtain the estimates , moderated standard errors and degrees of freedom . However, rather than applying the EBTM methods to these data, we convert the problem into a “normal means” problem, by changing the standard errors. Specifically for each we define the “adjusted standard error”

to be the value for which the z-score

results in the same -value (when compared with a standard normal distribution) as from the moderated test (comparing with a distribution on degrees of freedom).111The following R function computes the adjusted standard error from an effect estimate bhat and corresponding value p: pval2se = function(bhat,p){z = qnorm(1-p/2); s = abs(bhat/z); return(s)} . We then use as the inputs to an EBNM problem to obtain posterior distributions and shrinkage estimates for .

3 Numerical Studies

We illustrate our two-stage strategy, and compare it with the naive strategy, the ad hoc

strategy, and other related methods, using simulations. To make our simulated standard errors and test statistics realistic, we base our simulations on real data from an RNA sequencing experiment (RNA-seq data). However, unlike real RNA-seq data, our simulations create data that are independent across genes. In practice RNA-seq data are often strongly correlated among genes, and these correlations can cause severe complications for many analyses methods

(Leek and Storey, 2007)

, including the Empirical Bayes methods used here

(Efron, 2010; Gerard and Stephens, 2018). By removing these correlations here we are comparing methods under idealized conditions, and seek to show that even under idealized conditions the naive approach – which does not use EB shrinkage of the standard errors – performs poorly. For empirical comparisons of methods on correlated RNA-seq data see Gerard and Stephens (2018); Lu (2018).

We perform simulations for two groups, each containing samples, with and genes. The effects are simulated from for various choices of distribution (Figure 1; Table 1), and then divided by a scaling factor chosen so that power is similar for different (). For each combination of , we simulate 50 datasets with drawn uniformly from [0,1].

Scenario Alternative distribution,
Table 1: Summary of simulation scenarios considered
Figure 1: Densities of non-zero effects, , used in simulations.

We analyzed each simulated dataset with several methods based on the voom-limma (VL) pipeline (Law et al., 2014), which uses the voom and lmFit functions from the limma R package (Ritchie et al., 2015) to obtain estimates and standard errors , along with degrees of freedom . Many of the pipelines also use the eBayes function to obtain moderated standard errors and moderated degrees of freedom , which yield moderated statistics and corresponding values.

  • VL+ash: this is the “naive” approach, which directly feeds the (without variance moderation) into the EBTM solver in the ash function in the ashr software (Stephens, 2016). As noted above this approach is flawed in principle, and our results show it can also perform poorly in practice.

  • VL+eBayes+ash and VL+eBayes+ash.alpha=1: these are our proposed pipelines, which feed the and moderated standard errors (and ) to the ash EBTM solver (with for VL+eBayes+ash and for VL+eBayes+ash.alpha=1).

  • VL+pval2se+ash: this is our “ad hoc” approach (Section 2.3), which converts the EBTM problem into an EBNM problem by computing “adjusted standard errors” , and then applies ash to solve the EBNM problem for .

  • VL+eBayes+qvalue: this is a standard pipeline for controlling FDR in differential expression studies (not based on EB methods): it feeds the values from the moderated statistics to the qvalue software (Storey, 2002), which outputs an estimate for and a -value for each test which can be used to control FDR.

Estimation of null proportion

All of the above methods provide an estimate of the null proportion, . Obtaining accurate estimates of is important for obtaining accurate estimates of FDR: underestimating will lead to anti-conservative estimates of FDR, whereas overestimating will lead to conservative (over-)estimates of FDR, effectively reducing statistical power.

Figure 2 compares the estimated with the true in our simulations. The first key observation is that the naive approach voom+ash can dramatically underestimate , and cannot be recommended. All other approaches generally provide reasonable (conservative) estimates of , with the ash-based approaches producing more accurate (less conservative) estimates than those from qvalue. This improved accuracy comes from the additional assumption made by the EB approach in ash, that the effects are unimodal (Stephens, 2016). The results are reasonably robust to this assumption, but estimates of can be anti-conservative in the bimodal scenario (just as in Stephens (2016)).

Assessment of FDR control and power

Figures 3 assesses how well each method controls FDR in our simulations (at nominal level 0.05), and Figure 4 shows the corresponding power (proportion of true effects declared significant). The naive method completely fails to control FDR for very small sample sizes (2 vs 2 or 4 vs 4). Other methods perform generally well at controlling FDR, although there is some lack of FDR control of ash-based methods in the bimodal scenario. The ash-based methods are slightly more powerful than qvalue because of the less conservative estimates of .

Figure 2: Comparison of true and estimated values of on simulated data. Generally VL+ash is anti-conservative, often substantially under-estimating . When the UA holds the other three methods yield conservative (over-)estimates for , with VL+eBayes+ash, VL+eBayes+ash.alpha=1 and VL+pval2se+ash being less conservative, and hence more accurate. When the UA does not hold (“bimodal” scenario) the VL+eBayes+ash estimates are slightly anti-conservative.
Figure 3: Comparison of empirical false discovery proportions (FDP), at -value , on simulated data . Generally the naive method, VL+ash, is anti-conservative, failing to control FDP . In contrast, other methods generally control FDP near or under 0.05, although VL+eBayes+ash is slightly anti-conservative in the “big-normal” scenario with small sample size ( vs ).
Figure 4: Comparison of proportion of discoveries (“power”), at -value , on simulated data. Typically VL+eBayes+ash and VL+eBayes+ash.alpha=1 have more discoveries than VL+eBayes+qvalue, while controlling FDP (Figure 3).
Effect estimates

One attractive feature of the EB approach to multiple testing is that it provides not only estimates of FDR, but also shrinkage estimates of effect sizes. To compare the accuracy of the shrinkage estimates with the original (unshrunk VL+eBayes) estimates we compute the relative root mean squared error (RRMSE) for each method as the ratio of the method’s RMSE and the baseline RMSE for the original estimates. Here .

The results (Figure 5) demonstrate the expected benefits of shrinkage estimation: the shrunken estimates from VL+eBayes+ash (whether or ) are consistently better than the original unshrunk estimates. The improvement on the baseline RMSE is up to 90% in settings where most effects are null, where the benefits of shrinkage are strongest.

Calibration of posterior intervals

In addition to shrinkage point estimates of , the EB approach also provides “shrunk” interval estimates. Stephens (2016)

used simulations to show that, under idealized conditions (with known standard errors), these interval estimates not only have good coverage properties on average, but also “post-selection”: that is, even if we focus only on significant effects, the coverage of the EB credible intervals is good. This property is difficult to obtain in other ways.

Here we repeat this coverage assessment in the case of estimated standard errors. Table 2 shows the coverage rates of 95% lower credible bounds for the effects, split into all observations (a), significant negative discoveries (b) and significant positive discoveries (c). Note that for significant negative discoveries (b), the lower credible bound is bounding how “large” the effect is (in absolute value), whereas for positive discoveries (c) it is bounding how close to 0 it can be. In general coverage rates are satisfactory, with the most prominent exception being the case

in (b), where coverage rates are often much lower than the nominal 95%. This says that the method is “over-shrinking” the significant effects towards zero in this case, probably due to underestimating the length of the tail of the effects. In low-signal situations some level of over-shrinkage may be inevitable if we want to maintain conservative behaviour (i.e avoid under-shrinkage); thus it is unclear to what extent this behavior could be improved on.

Figure 5: Comparison of RRMSE (relative root mean squared error) of effect estimates on simulated data . To compute RRMSE we compute RMSE of VL as the baseline level, and divide the RMSE of each method by this baseline. Thus by definition RRMSE of VL is 1. VL+eBayes+ash is more accurate (RRMSE¡1) in all scenarios, especially when is close 1. The ad hoc approach VL+pval2se+ash performs less well for small sample sizes (although similar to VL+eBayes+ash for vs , except for “big-normal” scenario).
big-normal bimodal flat-top near-normal spiky
N=2 0.80 0.93 0.96 0.91 0.93
N=4 0.93 0.96 0.96 0.96 0.95
N=10 0.97 0.97 0.96 0.96 0.95
(a) All observations. Coverage rates are generally satisfactory, except for the big-normal scenario case when =2.
big-normal bimodal flat-top near-normal spiky
N=2 0.23 0.74 0.94 0.65 0.74
N=4 0.76 0.95 0.94 0.95 0.94
N=10 0.95 0.94 0.94 0.94 0.93
(b) “Significant” negative discoveries. Coverage rates are generally satisfactory when and for (except for big-normal scenario), but are generally poor for , suggesting over-shrinkage (underestimating the length of the tail of ) in this setting.
big-normal bimodal flat-top near-normal spiky
N=2 0.98 0.96 0.96 0.96 0.96
N=4 0.96 0.96 0.96 0.96 0.96
N=10 0.95 0.96 0.96 0.96 0.96
(c) “Significant” positive discoveries. Coverage rates are generally satisfactory.
Table 2: Table of empirical coverage for nominal 95% lower credible bounds for method VL+eBayes+ash on simulated data .

3.1 Discussion

In summary, we have shown how EB analysis of normal means with estimated standard errors can be satisfactorily solved by performing an EB analysis of the “-means” problem (EBTM), but only after applying EB methods to the estimated standard errors themselves to obtain moderated estimates of the standard errors (and associated degrees of freedom).

Our numerical results also show that a simpler ad hoc approach, VL+pval2se+ash in the Figures, which avoids solving the more complex EBTM problem by instead adjusting the standard errors and solving an EBNM problem, can work adequately to control false discovery rates. However, it performs less well in estimation accuracy than the more principled approaches based on solving the EBTM problem.

Code used to obtain the numerical results presented here is available at http://doi.org/10.5281/zenodo.2547022.


  • Efron (2004) Efron, B. (2004).

    Large-scale simultaneous hypothesis testing: the choice of a null hypothesis.

    Journal of the American Statistical Association 99(465), 96–104.
  • Efron (2010) Efron, B. (2010). Correlated z-values and the accuracy of large-scale statistical estimates. Journal of the American Statistical Association 105(491), 1042–1055.
  • Gerard and Stephens (2018) Gerard, D. and M. Stephens (2018). Empirical Bayes shrinkage and false discovery rate estimation, allowing for unwanted variation. Biostatistics, kxy029.
  • Johnstone and Silverman (2004) Johnstone, I. M. and B. W. Silverman (2004). Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences. The Annals of Statistics 32(4), 1594–1649.
  • Law et al. (2014) Law, C. W., Y. Chen, W. Shi, and G. K. Smyth (2014). Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15(2), R29.
  • Leek and Storey (2007) Leek, J. T. and J. D. Storey (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3(9), e161.
  • Lonsdale et al. (2013) Lonsdale, J., J. Thomas, M. Salvatore, R. Phillips, E. Lo, S. Shad, R. Hasz, G. Walters, F. Garcia, N. Young, et al. (2013). The genotype-tissue expression (GTEx) project. Nature genetics 45(6), 580–585.
  • Lu (2018) Lu, M. (2018). Generalized Adaptive Shrinkage Methods and Applications in Genomics Studies. Ph. D. thesis, University of Chicago.
  • Lu and Stephens (2016) Lu, M. and M. Stephens (2016). Variance adaptive shrinkage (vash): flexible empirical Bayes estimation of variances. Bioinformatics, btw483.
  • Phipson et al. (2016) Phipson, B., S. Lee, I. J. Majewski, W. S. Alexander, and G. K. Smyth (2016).

    Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression.

    The annals of applied statistics 10(2), 946.
  • Ritchie et al. (2015) Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth (2015). limma powers differential expression analyses for rna-sequencing and microarray studies. Nucleic Acids Research 43(7), e47.
  • Smyth (2004) Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3(1), Article 3.
  • Stephens (2016) Stephens, M. (2016). False discovery rates: a new deal. Biostatistics 18(2), 275–294.
  • Storey (2002) Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64(3), 479–498.
  • Wakefield (2009) Wakefield, J. (2009). Bayes factors for genome-wide association studies: comparison with P-values. Genetic epidemiology 33(1), 79–86.

Appendix A Simulation details

The following simulation scheme is designed to create realistic count datasets that mimic the structure of real RNA-seq data, making distributional assumptions only sparingly.

  1. Start with real RNA-seq data on genes in samples.

  2. Create a “null” data set containing two groups ( and ), of sizes , by randomly sampling (without replacement) samples for group and samples for group . Because the assignment of samples to the two groups is random, this is a null dataset by construction. Let denote the read count for gene and sample .

  3. Randomly select genes as “alternative genes”, and generate their effects (-fold-change between groups) ’s from a specified “effect distribution” .

  4. For these alternative genes, if (so group should be more highly expressed), we use Poisson thinning to achieve the desired fold-change i.e. thin the read counts in group as follows:


    Similarly if , thin the read counts in group :


    Replacing by will result in a new RNA-seq dataset, where the true effects follow .


The above simulation scheme, which we developed during our work on this project, was used by Gerard and Stephens (2018) to generate realistic simulated RNA-seq datasets with a desired effect distributions, while still preserving most of the structure (correlation, magnitude, etc) of the actual RNA-seq data. Unfortunately correlations among genes create substantial complications for many analysis methods (Leek and Storey, 2007), including ours; see Gerard and Stephens (2018) for extensive discussion and further references. To avoid these complications here we modify this scheme to remove correlations between genes. Specifically we modify step 2 to randomly select the and samples for groups and independently at each gene. This modification ensures that the simulated null data at each gene are independent.

While this modification makes the simulations unrepresentative of typical RNA-seq experiments (since real data are typically correlated across genes), it allows us to study the behaviour of methods under idealized situations, which is helpful for understanding the main conceptual contribution of our work here. Results of our methods on the more realistic simulations with correlations intact are given in Lu (2018).

Our simulations here used RNA-seq data from liver tissue samples distributed by the Genotype-Tissue Expression (GTEx) project (Lonsdale et al., 2013). These data (GTEx V6 dbGaP accession phs000424.v6.p1, release date: Oct 19, 2015, http://www.gtexportal.org/home/) contained data on 119 samples, and we restricted simulations to the 10,000 top expressed genes.