Solving the Empirical Bayes Normal Means Problem with Correlated Noise

by   Lei Sun, et al.
The University of Chicago

The Normal Means problem plays a fundamental role in many areas of modern high-dimensional statistics, both in theory and practice. And the Empirical Bayes (EB) approach to solving this problem has been shown to be highly effective, again both in theory and practice. However, almost all EB treatments of the Normal Means problem assume that the observations are independent. In practice correlations are ubiquitous in real-world applications, and these correlations can grossly distort EB estimates. Here, exploiting theory from Schwartzman (2010), we develop new EB methods for solving the Normal Means problem that take account of unknown correlations among observations. We provide practical software implementations of these methods, and illustrate them in the context of large-scale multiple testing problems and False Discovery Rate (FDR) control. In realistic numerical experiments our methods compare favorably with other commonly-used multiple testing methods.



page 5

page 6

page 13

page 14


ebnm: An R Package for Solving the Empirical Bayes Normal Means Problem Using a Variety of Prior Families

The empirical Bayes normal means (EBNM) model plays an important role in...

Empirical Bayes estimation of normal means, accounting for uncertainty in estimated standard errors

We consider Empirical Bayes (EB) estimation in the normal means problem,...

On spike and slab empirical Bayes multiple testing

This paper explores a connection between empirical Bayes posterior distr...

Large-Scale Multiple Hypothesis Testing with the Normal-Beta Prime Prior

We revisit the problem of simultaneously testing the means of n independ...

Testing Homogeneity for Normal Mixture Models: Variational Bayes Approach

The test of homogeneity for normal mixtures has been conducted in divers...

Knockoffs for the mass: new feature importance statistics with false discovery guarantees

An important problem in machine learning and statistics is to identify f...

Nonparametric Empirical Bayes Estimation and Testing for Sparse and Heteroscedastic Signals

Large-scale modern data often involves estimation and testing for high-d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the Empirical Bayes (EB) approach to the Normal Means problem (Efron and Morris, 1973; Johnstone and Silverman, 2004):



denotes the normal distribution with mean

and variance

; are observations;

are standard deviations that are assumed known; and

are unknown means to be estimated. The EB approach assumes that are independent and identically distributed (iid) from some “prior” distribution,


and performs inference for in two steps: first obtain an estimate of , say, and second compute the posterior distributions . We refer to the two-step process as “solving the Empirical Bayes Normal Means (EBNM) problem.” The first step, estimating , is sometimes of direct interest in itself, and is an example of a “deconvolution” problem (e.g. Kiefer and Wolfowitz, 1956; Laird, 1978; Stefanski and Carroll, 1990; Fan, 1991; Cordy and Thomas, 1997; Bovy et al., 2011; Efron, 2016).

First named by Robbins (1956), EB methods have seen extensive theoretical study (e.g. Robbins, 1964; Morris, 1983; Efron, 1996; Jiang and Zhang, 2009; Brown and Greenshtein, 2009; Scott and Berger, 2010; Petrone et al., 2014; Rousseau and Szabo, 2017; Efron, 2018), and are becoming widely used in practice. Indeed, according to Efron and Hastie (2016)

, “large parallel data sets are a hallmark of twenty-first-century scientific investigation, promoting the popularity of empirical Bayes methods.”

The EB approach provides a particularly attractive solution to the Normal Means problem. For example, the posterior means of provide shrinkage point estimates, with all the accompanying risk-reduction benefits (Efron and Morris, 1972; Berger, 1985). And the posterior distributions for provide corresponding “shrinkage” interval estimates, which can have good coverage properties even “post-selection” (Dawid, 1994; Stephens, 2017). Further, by estimating , EB methods “borrow strength” across observations, and automatically determine an appropriate amount of shrinkage from the data (Johnstone and Silverman, 2004). Because of these benefits, methods for solving the EBNM problem – and related extensions – are increasingly used in data applications (e.g. Clyde and George, 2000; Johnstone and Silverman, 2005b; Brown, 2008; Koenker and Mizera, 2014; Xing and Stephens, 2016; Urbut et al., 2018; Wang and Stephens, 2018; Dey and Stephens, 2018). One application of EBNM methods that we pay particular attention to later is large-scale multiple testing, and estimation/control of the False Discovery Rate (FDR; Benjamini and Hochberg, 1995; Efron, 2010b; Muralidharan, 2010; Stephens, 2017; Gerard and Stephens, 2018).

Almost all existing treatments of the EBNM problem assume that the observations in (1) are independent given . However, this assumption can be grossly violated in practice. Non-negligible correlations are common in real world data sets. Further, as we discuss later, EB approaches to the Normal Means problem are particularly vulnerable to being misled by pervasive correlations. Specifically, when the average strength of pairwise correlations among observations is non-negligible, the estimate of can be very inaccurate, and this adversely affects inference for all . Ironically then, the attractive “borrowing strength” property of the EB approach becomes, in the presence of pervasive correlations, its Achilles’ heel.

In this paper we introduce methods for solving the EBNM problem allowing for unknown correlations among the observations. More precisely, rewriting (1) as


we develop methods that allow for unknown correlations among the “noise” . Our methods are built on elegant theory from Schwartzman (2010), who shows, in essence, that the limiting empirical distribution, say, of correlated random variables can be represented using a basis of the standard Gaussian density and its derivatives of increasing order. We use this result, combined with an assumption that are exchangeable, to frame solving this “EBNM with correlated noise” problem as a two-step process: first jointly estimate and from all observations; and second compute the posterior distribution of given the estimated (and ). Although many possible assumptions on are possible, here we assume to be a scale mixture of zero-mean Gaussians, following the flexible “adaptive shrinkage” approach in Stephens (2017). We have implemented these methods in an R package, cashr (“correlated adaptive shrinkage in R”), available from

The rest of the paper is organized as follows. In Section 2, we illustrate how correlation can derail existing EBNM methods, and review Schwartzman (2010)’s representation of the empirical distribution of correlated random variables. In Section 3 we introduce the exchangeable correlated noise (ECN) model, and describe methods to solve the EBNM with correlated noise problem. Section 4 provides numeric examples with realistic simulations and real data illustrations. Section 5 concludes and discusses future research directions.

2 Motivation and Background

2.1 Correlation distorts empirical distribution and misleads EBNM methods

In essence, the reason correlation can cause problems for EBNM methods is that, even with large samples, the empirical distribution of correlated variables can be quite different from their marginal distribution (e.g. Efron, 2007a). To illustrate this, we generated realistic correlated -scores using a framework similar to Gerard and Stephens (2017, 2018); Lu (2018). Specifically, we took RNA-seq gene expression data on the most highly expressed genes in 119 human liver tissues (The GTEx Consortium, 2015, 2017). In each simulation we randomly drew two groups of five samples (without replacement), and applied a standard RNA-seq analysis pipeline, using the software packages edgeR (Robinson et al., 2010), voom (Law et al., 2014), and limma (Ritchie et al., 2015), to compute, for each gene , an estimate of the -fold difference in mean expression, , and a corresponding -value,

, testing the null hypothesis that the difference in mean is 0. We converted

to a -score , where is the CDF of . We also computed an “effective” standard deviation for later use (Figure 2 and Section 4).

In these simulations, because samples are randomly assigned to the two groups, there are no genuine differences in mean expression. Therefore the -scores should have marginal distribution . And, indeed, empirical checks confirm that the analysis pipeline produces well-calibrated marginally -scores (Appendix A). However, the scores in each simulated data set are correlated, due to correlations among genes, and such correlations can distort the empirical distribution so that it is very different from (Efron, 2007a, 2010a, 2010b). Figure 1 shows four examples, which were chosen to highlight some common patterns. Panels (a-c) all exhibit a feature we call pseudo-inflation, where the empirical distribution is more dispersed than . Conversely, panel (d) exhibits pseudo-deflation, where the empirical distribution is less dispersed than

. Panel (b) also exhibits skew.

Figure 1: Illustration that the empirical distribution of a large number of correlated and marginally null -scores can deviate substantially from . The red lines are fitted densities obtained using our “Exchangeable Correlated Noise” model (Section 3.1) which uses a linear combination of the standard Gaussian density and its standardized derivatives.

Such correlation-induced distortion of the empirical distribution, if not appropriately addressed, can have serious consequence for EBNM methods. To illustrate this we applied several EBNM methods to five data sets simulated according to (2)-(3) as follows:

  • The normal means are iid samples from the mixture , where denotes a point mass on whose coefficient () is the null proportion, and denotes the Gaussian density with mean and variance . The same are used in all five data sets.

  • In the first four data sets, the noise variables, , are the correlated null -scores from the four panels of Figure 1. In the fifth data set are iid samples.

  • The standard deviations are simulated from the RNA-seq gene expression data as described above, and are scaled to have .

We provide the simulated values to four existing EBNM methods – EbayesThresh (Johnstone and Silverman, 2004, 2005a), REBayes (Koenker and Mizera, 2014; Koenker and Gu, 2017), ashr (Stephens, 2017), and deconvolveR (Efron, 2016; Narasimhan and Efron, 2016) – that all ignore correlation and assume independence among observations. (For deconvolveR we set as its current implementation supports only homoskedastic noise.)

The estimates of obtained by each method are shown in Figure 2. All methods do reasonably well in the fifth data set where are indeed independent (panel (e)). However, in the correlated data sets (panels (a-d)) the methods all misbehave in a similar way: over-estimating the dispersion of under pseudo-inflation, and under-estimating it under pseudo-deflation. Their estimates of the null proportion are particularly inaccurate. These adverse effects are visible even when the distortion appears not severe (e.g. Figure 1(a)).

Figure 2: Illustration of how correlation can distort estimates of obtained by EBNM methods. Each panel compares the true with the estimated from several EBNM methods applied to the same simulated dataset (see main text for details). In panels (a-d) are the correlated null -scores from the corresponding panels of Figure 1. In panel (e) are iid samples. Existing EBNM methods (EbayesThresh, REBayes, ashr, deconvolveR), which ignore correlation among observations, do reasonably well with iid noise (e). However they do much less well in the correlated cases (a-d): over-estimating the dispersion of under pseudo-inflation (a-c) and under-estimating it under pseudo-deflation (d). In contrast, our new method cashr (Section 3) estimates consistently well.

As a taster for what is to come, Figure 2 also shows the results from our new method, cashr, described later. This new method can account for both pseudo-inflation and pseudo-deflation, and in these examples estimates consistently well.

2.2 Pseudo-inflation is non-Gaussian

In a series of pioneering papers (Efron, 2004, 2007a, 2007b, 2008, 2010a), Efron studied the impact of correlations among -scores on EB approaches to multiple testing. To account for the effects of correlation (pseudo-inflation, pseudo-deflation, and skew) on the empirical distribution of null -scores he introduced the concept of an “empirical null.” In his locfdr method (Efron, 2005), the empirical null is assumed to be Gaussian

. However, theory suggests that pseudo-inflation is not well modelled by a Gaussian distribution

(Schwartzman, 2010, reviewed in Section 2.3), and a closer look at our empirical results here supports this conclusion.

To illustrate, Figure 3 shows more detailed analysis of the empirical distribution of Figure 1(c) -scores. The central part of this -score distribution could perhaps be modelled by a Gaussian distribution with inflated variance – for example, it matches more closely to a than to . However, in the tails, the empirical distribution has much shorter tails than . For example, iid samples would be expected to yield 43 -values exceeding the Bonferroni threshold of , whereas in fact we observe none here. In short, the effects of pseudo-inflation are primarily in the “shoulders” of the distribution, where -scores are only moderately large, and not in the tails. (Incidentally, this behavior is far more evident in the histogram of -scores than in the histogram of corresponding -values, and we find the -score histogram generally more helpful for diagnosing potential correlation-induced distortion.)

Figure 3: Illustration that the effects of pseudo-inflation are primarily in the “shoulders” of the distribution of null -scores, and not in the tails. Panels (a-b): Histograms of correlated -scores (from Figure 1(c)) and their corresponding -values. Note that the ”shoulder-but-not-tail” inflation is evident in the histogram of -scores (a) but not in the oft-used histogram of -values (b). Panels (c-e): Comparison of the empirical CDF of correlated -scores with the CDF of and . The -score distribution is closer to in the center, but closer to in the tails. Panels (f-h): Comparison of correlated -values with -values obtained from iid and -scores. The number of correlated -values is closer to scores from , but the number in the extreme tail (e.g. clearing the Bonferroni or universal thresholds) is closer to .

With hindsight this shoulder-but-not-tail inflation pattern should perhaps be expected. If one views the effect of correlation as to reduce the effective sample size, the number of extreme values of a sample with a smaller effective sample size should indeed be smaller. There are also relevant discussions on “asymptotic independence” in the extreme value theory (Sibuya, 1960; De Carvalho and Ramos, 2012). However, this property of pseudo-inflation does suggest that using a Gaussian to describe correlation-induced distortion, as in locfdr, is not ideal (more discussion in Section 4).

2.3 Empirical distribution of correlated random variables

We now summarize an elegant result of Schwartzman (2010), which characterizes the empirical distribution of a large number of correlated -scores. This result plays a key role in our work.

On notation: let denote the PDF of , and denote the derivative of . We refer to the collection of functions as the (standardized) Gaussian derivatives. (Here “standardized” means that they are scaled to be orthonormal with respect to the weight function .)

Let be identically distributed, but not necessarily independent, random variables. Let denote their empirical CDF:


where the indicator function . Since are random variables, is a random function on . Also, because are marginally , the expectation of is .

Schwartzman (2010) studies the distribution of , and how its deviation from the expectation depends on the correlations among . Specifically, assuming that each pair is bivariate normal with correlation (which is weaker than the common assumption that all are joint multivariate normal), Schwartzman (2010) provides the following representation of when is large:


where are uncorrelated random variables with , and


Although uncorrelated, are not independent; they must have higher-order dependence to guarantee that is non-decreasing. Also here we assume for all . Note that this assumption should not be too demanding for large in practice (Schwartzman, 2010, also see Appendix E).

Since is a CDF, its derivative defines a corresponding density:


Intuitively, (7) characterizes how the (limiting) empirical distribution (histogram) of is likely to randomly deviate from the expectation , using standardized Gaussian derivatives as basis functions.

The representation (7) is crucial to our work here, and provides some remarkable insights. We highlight particularly the following:

  1. The expected deviations of from are determined by the variances of the coefficients

    , which are determined by the mean and higher moments of the pairwise correlations,

    . In the special case where are independent all these terms are , and .

  2. Following from 1, to create a tangible deviation from , must be non-negligible (for some ). This requires pervasive, but not necessarily strong, pairwise correlations. For example, pervasive correlations occur if there is an underlying low-rank structure in the data, where all are influenced by a small number of underlying random factors, and so are all correlated with one another. In this case will be non-negligible, and may deviate from . In contrast, there can exist very strong pairwise correlations with negligible effect on . For example, suppose is even, and let be in pairs, with each pair having correlation one but different pairs being independent. The histogram of will look very much like , because for large . In other words, not all kinds of correlations, even large ones, distort the empirical distribution of .

  3. Barring special cases such as , the moments , and hence the expected magnitude of , will decay quickly with . Consequently the sum in (7) will typically be dominated by the first few terms, and the shape of the first few basis functions will determine the typical deviation of from . Of the first four basis functions (Figure 4), the 2 and 4 correspond to pseudo-inflation or pseudo-deflation in the shoulders of , depending on the signs of their coefficients, whereas the 1 and 3 correspond to mean shift and skewness. This explains the empirical observation that correlation-induced pseudo-inflation tends to focus in the shoulders, and not the tails. (Also see Appendix B for the special case .)

Figure 4: Illustration of the standard Gaussian density, , and its standardized derivatives. The left panel shows and its first four standardized derivatives. The 2 and 4 derivatives correspond to pseudo-inflation or pseudo-deflation in the shoulders; the 1 and 3 derivatives correspond to mean shift or skewness. The right panel shows and its 7-10 derivatives. Even for these higher-order derivatives, tails are short, implying that correlation-induced distortion is unlikely to have long tails.

In discussing Efron (2010a), Schwartzman (2010) used this result to argue that “a wide unimodal histogram (of -scores) may be indication of the presence of true signal, rather than an artifact of correlation.” Specifically, by discarding terms for in (7), he found that the largest central spread (standard deviation) for in (7) being a unimodal density is approximately 1.3. Along similar lines, we can show (Appendix B) that the maximum standard deviation for being a Gaussian density is . The key point here is that the effects of correlation are different from the effects of true signals, so the two can (often) be separated. Our methods here are designed to do exactly that.

3 Empirical Bayes Normal Means with Correlated Noise

3.1 The Exchangeable Correlated Noise model

As a first step towards allowing for correlated noise in the EBNM problem, we develop methods to fit the representation (7) to correlated null -scores. We do this by treating as conditionally iid samples from in (7), parameterized by which are realizations of :


It may seem perverse to model correlated random variables as conditionally iid. However, this treatment can be motivated by assuming are exchangeable and appealing to de Finetti’s representation theorem (De Finetti, 1937), which says that (infinitely) exchangeable random variables can be represented as being conditionally iid from their empirical distribution. We therefore refer to the model (8) as the exchangeable correlated noise (ECN) model. We also refer to as the correlated noise distribution.

To fit the ECN model (8) with observed , we estimate , essentially by maximum likelihood, but with a couple of additional complications that we now describe. First, since is a density, we must constrain the parameters to ensure that is non-negative (note that (8) integrates to one for any , but is not guaranteed to be non-negative). Ideally should be non-negative on the whole real line, but this constraint is difficult to work with, so we approximate it using a discrete approximation: we constrain on a fine grid such as , in addition to for all .

Second, to incorporate the prior expectation that the absolute value of should decay quickly with (because ) we introduce a penalty on ,


where we take the penalty parameters to be


where represents a common penalty, and some notion of average pairwise correlation. For computational convenience we use only the first Gaussian derivatives (see Figure 4 for 7-10 standardized Gaussian derivatives) and set for . (Recall that , so ’s realization will generally be negligible in practice for .) Of course a full Bayesian treatment would attempt to account for uncertainty in ; in ignoring that here we are making the usual EB compromise.

In numerical simulations we experimented with different combinations of and , and found that performed well in a variety of situations, although results were not very sensitive to the choice of and . All results in this paper were obtained with .

In summary, we estimate by solving:


This is a convex optimization and can be solved efficiently and stably using an interior point method; we implemented this using the R package Rmosek to interface to the MOSEK commercial solver (MOSEK ApS, 2018). With , the problem is solved on average within 0.50 seconds on a personal computer (Apple iMac, 3.2 GHz, Intel Core i5).

Figure 1 shows the fitted distributions from the ECN model, , on the four illustrative sets of correlated null -scores.

3.2 The EBNM model with correlated noise

To allow for correlated noise in the EBNM problem, we combine the standard EBNM model (2)-(3) with the ECN model (8):


Note that in this model the observations are conditionally independent given and .

Following Stephens (2017) we model the prior distribution by a finite mixture of zero-mean Gaussians:


where is the null proportion. Here the mixture proportions are non-negative and sum to 1, and are to be estimated, whereas the component standard deviations are a fixed pre-specified grid of values. By using a sufficiently wide and dense grid of standard deviations this finite mixture can approximate, to arbitrary accuracy, any scale mixture of zero-mean Gaussians.

The marginal log-likelihood for , integrating out , , is given by the following Theorem.

Theorem 1.

Combining (12)-(15), the marginal log-likelihood of is




See Appendix C. ∎

3.3 Fitting the model

Following the usual EB approach, we fit the model (12)-(15) in two steps, first estimating by estimating and then basing inference for on the (estimated) posterior distribution . Note that under the model (12)-(15) are conditionally independent given , so this posterior distribution factorizes, and is determined by its marginal distributions . The intuition here is that, under the exchangeability assumption, the effects of correlation are captured entirely by the (realized) correlated noise distribution . Once this distribution is estimated, the inferences for each become independent, just as in the standard EBNM problem.

The usual EBNM approach to estimating would be to maximize the likelihood . Here we modify this approach using maximum penalized likelihood. Specifically we use the penalty on as in (9), and the penalty on used by Stephens (2017) to encourage conservative (over-)estimation of the null proportion (to induce conservative estimation of false discovery rates). Thus, we solve


subject to the constraints


In (21) we used the same device as in (11) to capture non-negativity of . We set as in (10), use only the first Gaussian derivatives, and set , as in Stephens (2017).

Problem (18) is biconvex. That is, given a feasible , the optimization over is convex; and given a feasible , the optimization over is convex. The optimization over can be solved using the EM algorithm, or more efficiently using convex optimization methods (Koenker and Mizera, 2014; Koenker and Gu, 2017; Kim et al., 2018). To optimize over we use the same approach as in solving (11). To solve (18) we simply iterate between these two steps until convergence.

3.4 Posterior calculations

For each , the posterior distribution

is, by Bayes Theorem, given by


Despite the somewhat complex form, some important functionals of this posterior distribution are analytically available.

  1. The posterior mean for


    where .

  2. The local FDR (lfdr; Efron, 2008) is


    From this, the FDR of any discovery set can be estimated as


    where denotes the number of elements in . Storey’s -value (Storey, 2003) for each is defined as

  3. Stephens (2017)

    introduced the term “local false sign rate (lfsr)” to refer to the probability of getting the sign of an effect wrong, as well as the false sign rate (FSR) and the

    -value, analogous to the FDR and the -value, respectively. Making statistical inference about the sign of a parameter, rather than solely focusing on whether the parameter being zero or not, was also discussed in Tukey (1991); Gelman et al. (2012). The value of is defined as


    which is easily calculated from and


    where . The FSR and -value are estimated and defined similarly to the FDR and -value as


3.5 Software

We implemented both the fitting procedure and posterior calculations in an R package cashr which is available at For , it takes on average about 6 seconds for model fitting and posterior calculations on a personal computer (Apple iMac, 3.2 GHz, Intel Core i5).

4 Numerical Results

We now empirically assess the performance of cashr on both simulated and real data. We focus our assessments on the “multiple testing” setting where is sparse and the main goal is to identify “significant” non-zero elements . This problem can be tackled using EB methods (Thomas et al., 1985; Greenland and Robins, 1991) and here we compare cashr with both locfdr (Efron, 2005), which attempts to capture effects of correlation through an empirical null strategy discussed in Section 2.2, and ashr (Stephens, 2017), which fits the same EBNM model as cashr but without allowing for correlation – i.e. ashr is equivalent to setting in (14). Multiple testing can also be tackled by attempting to control the FDR in the frequentist sense, and so we also compare with the Benjamini-Hochberg procedure (BH; Benjamini and Hochberg, 1995) and qvalue (Storey, 2002, 2003). One advantage of the EBNM approach to multiple testing is that it can provide not only FDR assessments, but also point estimates and interval estimates for the effects (Stephens, 2017). However, to keep our comparisons simple we focus here only on FDR assessments.

4.1 Realistic simulation with gene expression data

We constructed synthetic data with realistic correlation structure using the simulation framework in Section 2.1. The data are simulated according to the EBNM with correlated noise model (2)-(3) as follows.

  • The normal means are iid samples from


    for six choices of and three choices of (Figure 5). The density functions of these six choices of and other simulation details are in Appendix D.

  • To make the correlation structure among noise realistic, in each simulation are simulated from real gene expression data as in Section 2.1.

  • The standard deviations are also simulated from real gene expression data using the same pipeline, and are scaled to have .

  • The observations are constructed as , .

In each simulated data set, this framework generates correlated observations of respective normal means with corresponding standard deviations . The data are made available to each method, while the effects are withheld. The analysis goal is to identify which are significantly different from 0. We applied each method to formulate a discovery set at nominal FDR , and calculated the empirical false discovery proportion (FDP) for each discovery set. We ran 1000 simulations for each , divided evenly among the three choices of .

Figure 5: Illustration that cashr outperforms other methods in producing discovery sets whose FDP are consistently close to the nominal FDR, while maintaining good statistical power. Simulation results are shown for six different distributions for the non-null effect (; panel (a)) and three different values of the null proportion (

), stratified by methods. Panel (b): Comparison of the distribution of FDP, summarized as boxplots on square-root scale. The boxplots show the mean (cross), median (line), inter-quartile ranges (box), and 5th and 95th percentiles (whiskers). Panel (c): Comparison of the root MSE of FDP from the nominal FDR of 0.1, defined as

. In all scenarios the distribution of FDP for cashr is more concentrated near the nominal level than other methods. Especially, the root MSE of FDP for cashr is uniformly lower than other methods. Panel (d): Comparison of the mean of TDP, as an indication of statistical power. On average, cashr maintains good power, only worse than ashr in some scenarios, which sometimes finds more true signals at the cost of severely losing control of FDP.

Figure 5 compares the performance of each method in these simulations. Our first result is that, despite the presence of correlation, most of the methods control FDR in the usual frequentist sense under most scenarios: that is, the mean FDP is usually below the nominal level of 0.1. Indeed, BH is notable in never showing a mean FDP exceeding the nominal level, even though, as far as we are aware, no known theory guarantees this under the realistic patterns of correlation used here (Benjamini and Yekutieli (2001) gives relevant theoretical results under more restrictive assumptions on the correlation). The method most prone to lose control is ashr, but even its mean FDP is never above 0.2.

However, despite this frequentist control of FDR, for most methods the FDP for individual data sets can often lie far from the nominal level (see also Owen, 2005; Qiu et al., 2005; Blanchard and Roquain, 2009; Friguet et al., 2009, for example). Arguably, then, frequentist control of FDR is insufficient in practice, since we desire – as far as is possible – to make sound statistical inference for each data set. That is, we might consider a method to perform well if its FDP is consistently close to the nominal level, rather than close on average. By this criterion, cashr consistently outperforms other methods (Figure 5): it provides uniformly lower root MSE of FDP from the nominal FDR, , and the whiskers in the boxplots (indicating 5th and 95th percentiles) are narrower. Along with FDP, Figure 5 also shows the empirical true discovery proportion (TDP), defined as the proportion of true discoveries out of the number of all non-zero , as an indication of statistical power. cashr maintains good power in that it produces higher TDP than most methods in most scenarios. In some scenarios, ashr sometimes finds more true signals than cashr, but at the cost of severely losing control of FDP.

We note that cashr performs well even in settings that do not fully satisfy its underlying assumptions (e.g. where is asymmetric or multimodal). Note also that for our choices of , is a highly sparse setting, as a large portion of the non-zero are close to zero. For example, when is Gaussian, only about out of are expected to be larger than . Therefore, it is understandable that no methods perform particularly well in this difficult setting. But even for this setting, although first impressions from the plot may be that cashr and BH perform similarly, closer visual inspection shows cashr to be better, in that its median FDP tends to be closer to .

Figure 6: Illustration that cashr consistently produces reliable FDP under different types of correlation-induced distortion. Here we take the results from a single simulation scenario ( is Gaussian, ) and stratify them into three groups of equal size according to the sample standard deviations of the realized correlated noise. Methods that ignore correlations among observations (BH, qvalue, ashr) are generally too conservative under pseudo-deflation and too anti-conservative under pseudo-inflation; locfdr tends to be too conservative under pseudo-inflation and consequently lose power; cashr maintains good FDR control in all settings. The boxplots show the mean (cross), median (line), inter-quartile ranges (box), and 5th and 95th percentiles (whiskers). FDP are plotted on square-root scale. Other choices of and give qualitatively similar results (not shown here).

The reason that cashr produces more consistently reliable FDP is that, by design, it adapts itself to the particular correlation-induced distortion present in each data set. As illustrated in Figure 1, correlation can lead to pseudo-inflation in some data sets and pseudo-deflation in others. cashr is able to recognize which pattern is present, and correspondingly modify its behavior – becoming more conservative in the former case and less conservative in the latter. This is illustrated in Figure 6, which stratifies the realized data sets according to sample standard deviation of the realized correlated noise in each data set (for the setting where is Gaussian, ). The bottom are categorized as pseudo-deflation, top pseudo-inflation, and the others “in-between.”

For data sets where show no strong distortion (“in-between”) all methods give similar and reasonable results, with cashr showing only a small improvement. However, when are pseudo-inflated, methods ignoring correlation, such as BH, qvalue, ashr, tend to be anti-conservative; that is, they form discovery sets whose FDP are often much larger than the nominal FDR. In contrast, cashr produces conservative FDP near the nominal value; and locfdr is too conservative, consequently losing substantial power (discussed further in Section 4.2). Conversely, with pseudo-deflation, methods ignoring correlation are too conservative, producing FDP much smaller than the nominal FDR, losing power compared with cashr and locfdr.

4.2 Real data illustrations

We now use two real data examples to illustrate some of the features of cashr (and other methods) that we observed in simulated data. The first example is a well-studied data set from a leukemia study (Golub et al., 1999), comparing gene expression in acute myeloid leukemia vs acute lymphoblastic leukemia samples, which was discussed extensively in Efron (2010a) as a prime example of how correlation can distort empirical distributions. The second example comes from a study on embryonic mouse hearts (Smemo, 2012), comparing gene expression in 2 left ventricle samples vs 2 right ventricle samples. (The number of samples is small, but each sample is a pool of ventricles from 150 mice – necessary to obtain sufficient tissue for the experiments to work well – and so this experiment involved dissection of 300 mouse hearts.)

For each data set we let denote the true -fold change in gene expression between the two groups for gene . We use a standard analysis protocol (based on Smyth (2004); see Appendix D for details) to obtain an estimate for , and a corresponding -value . As in Section 2.1, we convert the -value to the corresponding -score and use this to compute an effective standard deviation .

Figure 7 shows the empirical distribution of the -scores for each data set, together with the fitted correlated noise distribution from cashr and the fitted empirical null from locfdr. In both cases the histogram is substantially more dispersed than . However the two data sets have otherwise quite different patterns of inflation: the leukemia data show inflation in both the shoulders and tails of the distribution, whereas the mouse data show inflation only in the shoulders. This indicates the presence of some strong signals in the leukemia data, whereas the inflation in the mouse data may be primarily pseudo-inflation caused by correlation. Consistent with this, both locfdr and cashr identify hundreds of significant signals in the leukemia data (at nominal FDR ), but no significant signals in the mouse data (Table 1).

Figure 7: Distribution of -scores from analyzing gene differential expression in two real data sets. In both data sets, for each gene , a -score is computed, and under the null hypothesis of no differential expression. Then we compare the histogram of -scores with , the fitted correlated noise distribution from cashr, and the fitted empirical null from locfdr, scaled by respective estimated null proportions. Both histograms are substantially more dispersed than . The mouse data show inflation primarily in the shoulders of the distribution, and the fitted correlated noise distribution from cashr appears to be a much better fit than the fitted empirical null from locfdr, particularly in the tails. The leukemia data show inflation in both shoulders and tails of the distribution, indicating the presence of some strong signals. Although otherwise similar, the fitted correlated noise distribution from cashr has a noticeably shorter right tail than the fitted empirical null from locfdr, improving power.
Number of discoveries
Method Leukemia data     Mouse data
cashr 385 0
locfdr 282 0
BH 1579 4130
qvalue 1972 6502
ashr 3346 17191
Table 1: Numbers of discoveries from different methods at nominal FDR . We analyzed genes in the leukemia data and genes in the mouse data. In both data sets, the -score distributions appear to have correlation-induced inflation, and the numbers of significant discoveries declared by methods accounting for correlation (cashr and locfdr) are much smaller than those ignoring correlation (BH, qvalue, ashr). For the leukemia data, cashr finds more significant genes than locfdr.

Although the conclusions from cashr and locfdr are, here, qualitatively similar, there are some notable differences in their results. First, in the mouse data, the cashr correlated noise distribution gives, visually, a much better fit than the locfdr empirical null, particularly in the tails (Figure 7). This is because the cashr correlated noise distribution is ideally suited to capture this “shoulder-but-not-tail” inflation pattern that is symptomatic of correlation-induced inflation. The Gaussian empirical null distribution assumed by locfdr is simply inconsistent with these data. Indeed, this inconsistency is reflected in the null proportion estimated by locfdr (1.08) which exceeds the theoretical upper bound of 1.

Second, in the leukemia data, cashr identifies 37% more significant results than locfdr (385 vs 282). This is consistent with the greater power of cashr vs locfdr in our simulations. One reason that locfdr can lose power is that its Gaussian empirical null distribution tends to overestimate inflation in the tails when it tries to fit inflation in the shoulders. We see this feature in the mouse data, and although less obvious, this appears to also be the case for the leukemia data: the estimated standard deviation of the empirical null is 1.58, which is almost certainly too large: a pseudo-inflated Gaussian correlated noise distribution is unlikely to have standard deviation exceeding (Appendix B). In comparison the fitted correlated noise distribution from cashr has a noticeably shorter right tail (e.g. ) which leads it to categorize more -scores in the right tail as significant (Figure 7). On a side note, cashr also experiences the benefits of ashr highlighted in Stephens (2017), which can also help increase power. For example, the unimodal assumption on the effects – which allows that some of the -scores around zero may correspond to true, albeit non-significant, signals – can help improve estimates of , and hence improve power.

Another feature of cashr, which distinguishes it from locfdr, is that, by estimating while accounting for correlation-induced distortion, it can provide an estimate on the effect size distribution, . For the mouse data, cashr estimates , or 12% of genes may be differentially expressed to some extent, although it is not able to pin down any clear example of a differentially expressed gene: no gene has an estimated local FDR less than 0.80. One possible explanation for the lack of significant results in this case is lack of power. However, the estimated from cashr suggests that there may simply not exist any large effects to be discovered: of the probability mass of the estimated is on effect size , or a mere -fold change in gene expression. Thus the signals here, if any, are too weak to be discerned from noise and pseudo-inflation.

We also applied the other methods – BH, qvalue, and ashr – to both data sets. All three methods find very large numbers of significant results in both data sets (Table 1). Although we do not know the truth in these real data, there is a serious concern that many of these results could be false positives, since these methods are all prone to erroneously viewing pseudo-inflation as true signal (Figure 6), and Figure 7 suggests that pseudo-inflation may be present in both data sets.

5 Discussion

We have presented a general approach to accounting for correlations among observations in the widely-used Empirical Bayes Normal Means model. Our strategy exploits theoretical results from Schwartzman (2010) to model the impact of correlation on the empirical distribution of correlated variables, and convex optimization techniques to produce an efficient implementation. We demonstrated through empirical examples that this strategy can both improve estimation of the underlying distribution of true effects (Figure 2) and – in the multiple testing setting – improve estimation of FDR compared with EB methods that ignore correlation (Figures 5, 6). To the best of our knowledge, cashr is the first EBNM methodology to deal with correlated noise in this way.

Our empirical results demonstrate some benefits of the EB approach to multiple testing compared with traditional methods. In particular, cashr provides, on average, more accurate estimates of the FDP than either BH or qvalue. However, although we find these empirical results encouraging, we do not have theoretical guarantees of (frequentist) FDR control. That said, theoretical guarantees of FDR control under arbitrary correlation structure are lacking even for the widely-studied BH method. BH has been shown to control FDR under certain correlation stuctures (e.g. “positive regression dependence on subsets”; Benjamini and Yekutieli, 2001). The Benjamini-Yekutieli procedure (Benjamini and Yekutieli, 2001) is proved to control FDR under arbitrary dependence, but at the cost of being excessively conservative, and is consequently rarely used in practice.

A key feature of cashr is that it requires no information about the actual correlations among observations. This has the important advantage that it can be applied wherever EBNM methods that ignore correlation can be applied. On the other hand, when additional information on correlations is available it clearly may be helpful to incorporate it into analyses. Within our approach such information could be used to estimate the moments of the pairwise correlations, and thus inform estimates of in the correlated noise distribution . Alternatively, one could take a more ambitious approach: explicitly model the whole correlation matrix, and use this to help inform inference (e.g. Benjamini and Heller, 2007; Wu, 2008; Sun and Cai, 2009; Friguet et al., 2009; Fan et al., 2012). Modeling correlation is likely to provide more efficient inferences when it can be accurately achieved (Hall and Jin, 2010). However, in many situations – particularly involving small sample sizes – reliably modeling correlation may be impossible. Under what circumstances this more ambitious approach produces better inferences could be one area for future investigation.

The main assumptions underlying cashr are that the correlated noise is marginally

, and that the standard deviations are reliably computed. In the multiple testing setting this corresponds to assuming that the test statistics are (marginally) well calibrated. If these conditions do not hold – for example, due to failure of asymptotic theory underlying test statistic computations, or due to confounding factors (such as batch effects in gene expression studies), then

cashr could give unreliable results. Of course cashr is not unique in this regard – methods like BH and qvalue similarly assume that test statistics are well calibrated. Dealing with confounders in gene expression studies is an active area of research, and several approaches exist, many of them based on factor analysis (e.g. Leek and Storey, 2007; Sun et al., 2012; Gagnon-Bartsch and Speed, 2012; Wang et al., 2017; Gerard and Stephens, 2017, 2018). Again, the possibility of combining these ideas with our methods could be a future research direction.


Appendix A The marginal distribution of the simulated null -scores

Figures A.1 and A.2 offer support for the claim that the -scores simulated in Section 2.1 are marginally -distributed.

Figure A.1 compares -scores simulated as in Section 2.1 with -scores simulated under a modified framework that removes gene-gene correlations, and with iid samples. The modified framework uses exactly the same simulation and analysis pipeline as the original framework of Section 2.1, with one important difference: in each simulation, for each gene independently we randomly selected two groups of five samples without replacement, hence removing gene-gene correlations.

The empirical CDF of data sets simulated as in Section 2.1 show a huge amount of variability (panel (a)), presumably due to correlations among genes. In the modified framework, correlation-induced distortion disappears: the empirical CDF of all data sets are almost exactly the same as (panel (b)), just as with the iid samples (panel (c)). This demonstrates that without gene-gene correlations, the analysis pipeline used here produces uncorrelated -scores.

Figure A.1: Comparison of empirical CDF of -scores () obtained by applying the same analysis pipeline to data simulated by two different frameworks: the original framework in Section 2.1 which keeps gene-gene correlations (panel (a)); and the modified framework to remove gene-gene correlations by randomizing samples for each gene (panel (b)). We also plot empirical CDF of iid samples for comparison (panel (c)). The -scores obtained under the original framework show clear correlation-induced distortion – the variability of empirical CDF is huge. In contrast, when gene-gene correlations are removed under the modified framework, distortion disappears: empirical CDF are almost exactly the same as and the variability is essentially invisible; indeed, they are indistinguishable from empirical CDF of iid -scores. It shows clear evidence that the analysis pipeline can produce well-calibrated null -scores if no gene-gene correlations. Dotted lines are Dvoretzky-Kiefer-Wolfowitz bounds with .

In addition, Figure A.2 shows that the mean empirical CDF of the data sets simulated from the original framework – the average of empirical CDF of Figure A.1(a) – is very close to . Possible deviation happens only in the far tails (). Compared with and , the deviation is very small even on the logarithmic scale (panels (b-c)), probably caused by numerical constraints as one or two -scores in this area in a few data sets can make a visible difference.

Figure A.2: Illustration that the average empirical CDF of -scores () simulated as in Section 2.1 closely matches , aggregated over data sets. Left: the average of all empirical CDF in Figure A.1(a). The average empirical CDF is extremely close to . Center and Right: the left and right tails of the average empirical CDF on logarithmic scale. Shaded areas are confidence bands. Compared with and , possible deviation from is light even in the far tails.

Appendix B Decomposing Gaussian by standardized Gaussian derivatives

Proposition 1.

The PDF of can be decomposed by standard Gaussian and its derivatives in the form of (7) if and only if .


Let denote the probabilists’ Hermite polynomial. The orthogonality and completeness of Hermite polynomials in (e.g. Szegõ, 1975) leads to the following fact


where . Therefore, if any PDF can be decomposed in the form of (7), the coefficient of the -order standardized Gaussian derivative has to be


where , sometimes called “Hermite moment,” is the expected value of when the PDF of the random variable is . If is , we can obtain analytic expressions of these Hermite moments


where denotes the double factorial of , and denotes the function of -order moment of a Gaussian with mean and variance . Putting (B.2)-(B.3) together, the coefficients in (7) become


Note that is not exploding if and only if or equivalently, . ∎

This result suggests that a pseudo-inflated Gaussian correlated noise distribution is not likely to have standard deviation greater than .

In the special case when , becomes , a point mass on , with randomly sampled from . It is interesting to note that can be decomposed in the form of (7) as

Appendix C Proof of Theorem 1


The marginal distribution of , denoted as , is obtained by integrating out


where is essentially a convolution of and and has an analytic form

This form is also valid for , . Following (C.1), the marginal log-likelihood of is given by

Appendix D Simulation details

d.1 Six choices of the non-null effect distribution

Table D.1 lists the details of the six choices of , the non-null effects in Section 4.1. The table also shows the average signal strength,