# Solving the Empirical Bayes Normal Means Problem with Correlated Noise

The Normal Means problem plays a fundamental role in many areas of modern high-dimensional statistics, both in theory and practice. And the Empirical Bayes (EB) approach to solving this problem has been shown to be highly effective, again both in theory and practice. However, almost all EB treatments of the Normal Means problem assume that the observations are independent. In practice correlations are ubiquitous in real-world applications, and these correlations can grossly distort EB estimates. Here, exploiting theory from Schwartzman (2010), we develop new EB methods for solving the Normal Means problem that take account of unknown correlations among observations. We provide practical software implementations of these methods, and illustrate them in the context of large-scale multiple testing problems and False Discovery Rate (FDR) control. In realistic numerical experiments our methods compare favorably with other commonly-used multiple testing methods.

## Authors

• 39 publications
• 8 publications
10/01/2021

### ebnm: An R Package for Solving the Empirical Bayes Normal Means Problem Using a Variety of Prior Families

The empirical Bayes normal means (EBNM) model plays an important role in...
01/30/2019

### Empirical Bayes estimation of normal means, accounting for uncertainty in estimated standard errors

We consider Empirical Bayes (EB) estimation in the normal means problem,...
08/29/2018

### On spike and slab empirical Bayes multiple testing

This paper explores a connection between empirical Bayes posterior distr...
07/04/2018

### Large-Scale Multiple Hypothesis Testing with the Normal-Beta Prime Prior

We revisit the problem of simultaneously testing the means of n independ...
12/15/2019

### Testing Homogeneity for Normal Mixture Models: Variational Bayes Approach

The test of homogeneity for normal mixtures has been conducted in divers...
07/17/2018

### Knockoffs for the mass: new feature importance statistics with false discovery guarantees

An important problem in machine learning and statistics is to identify f...
06/16/2021

### Nonparametric Empirical Bayes Estimation and Testing for Sparse and Heteroscedastic Signals

Large-scale modern data often involves estimation and testing for high-d...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the Empirical Bayes (EB) approach to the Normal Means problem (Efron and Morris, 1973; Johnstone and Silverman, 2004):

 Xj∣θj,sj∼N(θj,s2j) ,j=1,…,p . (1)

Here

denotes the normal distribution with mean

and variance

; are observations;

are standard deviations that are assumed known; and

are unknown means to be estimated. The EB approach assumes that are independent and identically distributed (iid) from some “prior” distribution,

 θjiid∼g(⋅) ,j=1,…,p ; (2)

and performs inference for in two steps: first obtain an estimate of , say, and second compute the posterior distributions . We refer to the two-step process as “solving the Empirical Bayes Normal Means (EBNM) problem.” The first step, estimating , is sometimes of direct interest in itself, and is an example of a “deconvolution” problem (e.g. Kiefer and Wolfowitz, 1956; Laird, 1978; Stefanski and Carroll, 1990; Fan, 1991; Cordy and Thomas, 1997; Bovy et al., 2011; Efron, 2016).

First named by Robbins (1956), EB methods have seen extensive theoretical study (e.g. Robbins, 1964; Morris, 1983; Efron, 1996; Jiang and Zhang, 2009; Brown and Greenshtein, 2009; Scott and Berger, 2010; Petrone et al., 2014; Rousseau and Szabo, 2017; Efron, 2018), and are becoming widely used in practice. Indeed, according to Efron and Hastie (2016)

, “large parallel data sets are a hallmark of twenty-first-century scientific investigation, promoting the popularity of empirical Bayes methods.”

The EB approach provides a particularly attractive solution to the Normal Means problem. For example, the posterior means of provide shrinkage point estimates, with all the accompanying risk-reduction benefits (Efron and Morris, 1972; Berger, 1985). And the posterior distributions for provide corresponding “shrinkage” interval estimates, which can have good coverage properties even “post-selection” (Dawid, 1994; Stephens, 2017). Further, by estimating , EB methods “borrow strength” across observations, and automatically determine an appropriate amount of shrinkage from the data (Johnstone and Silverman, 2004). Because of these benefits, methods for solving the EBNM problem – and related extensions – are increasingly used in data applications (e.g. Clyde and George, 2000; Johnstone and Silverman, 2005b; Brown, 2008; Koenker and Mizera, 2014; Xing and Stephens, 2016; Urbut et al., 2018; Wang and Stephens, 2018; Dey and Stephens, 2018). One application of EBNM methods that we pay particular attention to later is large-scale multiple testing, and estimation/control of the False Discovery Rate (FDR; Benjamini and Hochberg, 1995; Efron, 2010b; Muralidharan, 2010; Stephens, 2017; Gerard and Stephens, 2018).

Almost all existing treatments of the EBNM problem assume that the observations in (1) are independent given . However, this assumption can be grossly violated in practice. Non-negligible correlations are common in real world data sets. Further, as we discuss later, EB approaches to the Normal Means problem are particularly vulnerable to being misled by pervasive correlations. Specifically, when the average strength of pairwise correlations among observations is non-negligible, the estimate of can be very inaccurate, and this adversely affects inference for all . Ironically then, the attractive “borrowing strength” property of the EB approach becomes, in the presence of pervasive correlations, its Achilles’ heel.

In this paper we introduce methods for solving the EBNM problem allowing for unknown correlations among the observations. More precisely, rewriting (1) as

 Xj=θj+sjZjZj∼N(0,1) , (3)

we develop methods that allow for unknown correlations among the “noise” . Our methods are built on elegant theory from Schwartzman (2010), who shows, in essence, that the limiting empirical distribution, say, of correlated random variables can be represented using a basis of the standard Gaussian density and its derivatives of increasing order. We use this result, combined with an assumption that are exchangeable, to frame solving this “EBNM with correlated noise” problem as a two-step process: first jointly estimate and from all observations; and second compute the posterior distribution of given the estimated (and ). Although many possible assumptions on are possible, here we assume to be a scale mixture of zero-mean Gaussians, following the flexible “adaptive shrinkage” approach in Stephens (2017). We have implemented these methods in an R package, cashr (“correlated adaptive shrinkage in R”), available from https://github.com/LSun/cashr.

The rest of the paper is organized as follows. In Section 2, we illustrate how correlation can derail existing EBNM methods, and review Schwartzman (2010)’s representation of the empirical distribution of correlated random variables. In Section 3 we introduce the exchangeable correlated noise (ECN) model, and describe methods to solve the EBNM with correlated noise problem. Section 4 provides numeric examples with realistic simulations and real data illustrations. Section 5 concludes and discusses future research directions.

## 2 Motivation and Background

### 2.1 Correlation distorts empirical distribution and misleads EBNM methods

In essence, the reason correlation can cause problems for EBNM methods is that, even with large samples, the empirical distribution of correlated variables can be quite different from their marginal distribution (e.g. Efron, 2007a). To illustrate this, we generated realistic correlated -scores using a framework similar to Gerard and Stephens (2017, 2018); Lu (2018). Specifically, we took RNA-seq gene expression data on the most highly expressed genes in 119 human liver tissues (The GTEx Consortium, 2015, 2017). In each simulation we randomly drew two groups of five samples (without replacement), and applied a standard RNA-seq analysis pipeline, using the software packages edgeR (Robinson et al., 2010), voom (Law et al., 2014), and limma (Ritchie et al., 2015), to compute, for each gene , an estimate of the -fold difference in mean expression, , and a corresponding -value,

, testing the null hypothesis that the difference in mean is 0. We converted

to a -score , where is the CDF of . We also computed an “effective” standard deviation for later use (Figure 2 and Section 4).

In these simulations, because samples are randomly assigned to the two groups, there are no genuine differences in mean expression. Therefore the -scores should have marginal distribution . And, indeed, empirical checks confirm that the analysis pipeline produces well-calibrated marginally -scores (Appendix A). However, the scores in each simulated data set are correlated, due to correlations among genes, and such correlations can distort the empirical distribution so that it is very different from (Efron, 2007a, 2010a, 2010b). Figure 1 shows four examples, which were chosen to highlight some common patterns. Panels (a-c) all exhibit a feature we call pseudo-inflation, where the empirical distribution is more dispersed than . Conversely, panel (d) exhibits pseudo-deflation, where the empirical distribution is less dispersed than

. Panel (b) also exhibits skew.

Such correlation-induced distortion of the empirical distribution, if not appropriately addressed, can have serious consequence for EBNM methods. To illustrate this we applied several EBNM methods to five data sets simulated according to (2)-(3) as follows:

• The normal means are iid samples from the mixture , where denotes a point mass on whose coefficient () is the null proportion, and denotes the Gaussian density with mean and variance . The same are used in all five data sets.

• In the first four data sets, the noise variables, , are the correlated null -scores from the four panels of Figure 1. In the fifth data set are iid samples.

• The standard deviations are simulated from the RNA-seq gene expression data as described above, and are scaled to have .

We provide the simulated values to four existing EBNM methods – EbayesThresh (Johnstone and Silverman, 2004, 2005a), REBayes (Koenker and Mizera, 2014; Koenker and Gu, 2017), ashr (Stephens, 2017), and deconvolveR (Efron, 2016; Narasimhan and Efron, 2016) – that all ignore correlation and assume independence among observations. (For deconvolveR we set as its current implementation supports only homoskedastic noise.)

The estimates of obtained by each method are shown in Figure 2. All methods do reasonably well in the fifth data set where are indeed independent (panel (e)). However, in the correlated data sets (panels (a-d)) the methods all misbehave in a similar way: over-estimating the dispersion of under pseudo-inflation, and under-estimating it under pseudo-deflation. Their estimates of the null proportion are particularly inaccurate. These adverse effects are visible even when the distortion appears not severe (e.g. Figure 1(a)).

As a taster for what is to come, Figure 2 also shows the results from our new method, cashr, described later. This new method can account for both pseudo-inflation and pseudo-deflation, and in these examples estimates consistently well.

### 2.2 Pseudo-inflation is non-Gaussian

In a series of pioneering papers (Efron, 2004, 2007a, 2007b, 2008, 2010a), Efron studied the impact of correlations among -scores on EB approaches to multiple testing. To account for the effects of correlation (pseudo-inflation, pseudo-deflation, and skew) on the empirical distribution of null -scores he introduced the concept of an “empirical null.” In his locfdr method (Efron, 2005), the empirical null is assumed to be Gaussian

. However, theory suggests that pseudo-inflation is not well modelled by a Gaussian distribution

(Schwartzman, 2010, reviewed in Section 2.3), and a closer look at our empirical results here supports this conclusion.

To illustrate, Figure 3 shows more detailed analysis of the empirical distribution of Figure 1(c) -scores. The central part of this -score distribution could perhaps be modelled by a Gaussian distribution with inflated variance – for example, it matches more closely to a than to . However, in the tails, the empirical distribution has much shorter tails than . For example, iid samples would be expected to yield 43 -values exceeding the Bonferroni threshold of , whereas in fact we observe none here. In short, the effects of pseudo-inflation are primarily in the “shoulders” of the distribution, where -scores are only moderately large, and not in the tails. (Incidentally, this behavior is far more evident in the histogram of -scores than in the histogram of corresponding -values, and we find the -score histogram generally more helpful for diagnosing potential correlation-induced distortion.)

With hindsight this shoulder-but-not-tail inflation pattern should perhaps be expected. If one views the effect of correlation as to reduce the effective sample size, the number of extreme values of a sample with a smaller effective sample size should indeed be smaller. There are also relevant discussions on “asymptotic independence” in the extreme value theory (Sibuya, 1960; De Carvalho and Ramos, 2012). However, this property of pseudo-inflation does suggest that using a Gaussian to describe correlation-induced distortion, as in locfdr, is not ideal (more discussion in Section 4).

### 2.3 Empirical distribution of correlated N(0,1) random variables

We now summarize an elegant result of Schwartzman (2010), which characterizes the empirical distribution of a large number of correlated -scores. This result plays a key role in our work.

On notation: let denote the PDF of , and denote the derivative of . We refer to the collection of functions as the (standardized) Gaussian derivatives. (Here “standardized” means that they are scaled to be orthonormal with respect to the weight function .)

Let be identically distributed, but not necessarily independent, random variables. Let denote their empirical CDF:

 Fp(⋅):=1pp∑j=1I(Zj≤⋅) , (4)

where the indicator function . Since are random variables, is a random function on . Also, because are marginally , the expectation of is .

Schwartzman (2010) studies the distribution of , and how its deviation from the expectation depends on the correlations among . Specifically, assuming that each pair is bivariate normal with correlation (which is weaker than the common assumption that all are joint multivariate normal), Schwartzman (2010) provides the following representation of when is large:

 Fp(⋅)≈F(⋅):=Φ(⋅)+∞∑l=1Wl1√l!φ(l−1)(⋅) , (5)

where are uncorrelated random variables with , and

 var(Wl)=¯¯¯¯ρl:=1p(p−1)∑i,j:i≠jρlij . (6)

Although uncorrelated, are not independent; they must have higher-order dependence to guarantee that is non-decreasing. Also here we assume for all . Note that this assumption should not be too demanding for large in practice (Schwartzman, 2010, also see Appendix E).

Since is a CDF, its derivative defines a corresponding density:

 f(⋅):=F′(⋅)=φ(⋅)+∞∑l=1Wl1√l!φ(l)(⋅) . (7)

Intuitively, (7) characterizes how the (limiting) empirical distribution (histogram) of is likely to randomly deviate from the expectation , using standardized Gaussian derivatives as basis functions.

The representation (7) is crucial to our work here, and provides some remarkable insights. We highlight particularly the following:

1. The expected deviations of from are determined by the variances of the coefficients

, which are determined by the mean and higher moments of the pairwise correlations,

. In the special case where are independent all these terms are , and .

2. Following from 1, to create a tangible deviation from , must be non-negligible (for some ). This requires pervasive, but not necessarily strong, pairwise correlations. For example, pervasive correlations occur if there is an underlying low-rank structure in the data, where all are influenced by a small number of underlying random factors, and so are all correlated with one another. In this case will be non-negligible, and may deviate from . In contrast, there can exist very strong pairwise correlations with negligible effect on . For example, suppose is even, and let be in pairs, with each pair having correlation one but different pairs being independent. The histogram of will look very much like , because for large . In other words, not all kinds of correlations, even large ones, distort the empirical distribution of .

3. Barring special cases such as , the moments , and hence the expected magnitude of , will decay quickly with . Consequently the sum in (7) will typically be dominated by the first few terms, and the shape of the first few basis functions will determine the typical deviation of from . Of the first four basis functions (Figure 4), the 2 and 4 correspond to pseudo-inflation or pseudo-deflation in the shoulders of , depending on the signs of their coefficients, whereas the 1 and 3 correspond to mean shift and skewness. This explains the empirical observation that correlation-induced pseudo-inflation tends to focus in the shoulders, and not the tails. (Also see Appendix B for the special case .)

In discussing Efron (2010a), Schwartzman (2010) used this result to argue that “a wide unimodal histogram (of -scores) may be indication of the presence of true signal, rather than an artifact of correlation.” Specifically, by discarding terms for in (7), he found that the largest central spread (standard deviation) for in (7) being a unimodal density is approximately 1.3. Along similar lines, we can show (Appendix B) that the maximum standard deviation for being a Gaussian density is . The key point here is that the effects of correlation are different from the effects of true signals, so the two can (often) be separated. Our methods here are designed to do exactly that.

## 3 Empirical Bayes Normal Means with Correlated Noise

### 3.1 The Exchangeable Correlated Noise model

As a first step towards allowing for correlated noise in the EBNM problem, we develop methods to fit the representation (7) to correlated null -scores. We do this by treating as conditionally iid samples from in (7), parameterized by which are realizations of :

 Zj∣{W=ω}iid∼f(⋅;ω):=φ(⋅)+∞∑l=1ωl1√l!φ(l)(⋅) . (8)

It may seem perverse to model correlated random variables as conditionally iid. However, this treatment can be motivated by assuming are exchangeable and appealing to de Finetti’s representation theorem (De Finetti, 1937), which says that (infinitely) exchangeable random variables can be represented as being conditionally iid from their empirical distribution. We therefore refer to the model (8) as the exchangeable correlated noise (ECN) model. We also refer to as the correlated noise distribution.

To fit the ECN model (8) with observed , we estimate , essentially by maximum likelihood, but with a couple of additional complications that we now describe. First, since is a density, we must constrain the parameters to ensure that is non-negative (note that (8) integrates to one for any , but is not guaranteed to be non-negative). Ideally should be non-negative on the whole real line, but this constraint is difficult to work with, so we approximate it using a discrete approximation: we constrain on a fine grid such as , in addition to for all .

Second, to incorporate the prior expectation that the absolute value of should decay quickly with (because ) we introduce a penalty on ,

 h(ω):=∑lγl|ωl|, (9)

where we take the penalty parameters to be

 γl={0l is oddγ/ρl/2l is even , (10)

where represents a common penalty, and some notion of average pairwise correlation. For computational convenience we use only the first Gaussian derivatives (see Figure 4 for 7-10 standardized Gaussian derivatives) and set for . (Recall that , so ’s realization will generally be negligible in practice for .) Of course a full Bayesian treatment would attempt to account for uncertainty in ; in ignoring that here we are making the usual EB compromise.

In numerical simulations we experimented with different combinations of and , and found that performed well in a variety of situations, although results were not very sensitive to the choice of and . All results in this paper were obtained with .

In summary, we estimate by solving:

 maxω∑jlogf(Zj;ω)−h(ω)s.t.f(Zj;ω)≥0,j=1,…,p ,f(zi;ω)≥0,i=1,…,m . (11)

This is a convex optimization and can be solved efficiently and stably using an interior point method; we implemented this using the R package Rmosek to interface to the MOSEK commercial solver (MOSEK ApS, 2018). With , the problem is solved on average within 0.50 seconds on a personal computer (Apple iMac, 3.2 GHz, Intel Core i5).

Figure 1 shows the fitted distributions from the ECN model, , on the four illustrative sets of correlated null -scores.

### 3.2 The EBNM model with correlated noise

To allow for correlated noise in the EBNM problem, we combine the standard EBNM model (2)-(3) with the ECN model (8):

 Xj =θj+sjZj (12) θj ∼g(⋅) (13) Zj∣ω ∼f(⋅;ω)=φ(⋅)+L∑l=1ωl1√l!φ(l)(⋅) . (14)

Note that in this model the observations are conditionally independent given and .

Following Stephens (2017) we model the prior distribution by a finite mixture of zero-mean Gaussians:

 g(⋅;π)=π0δ0(⋅)+K∑k=1πkN(⋅;0,σ2k) , (15)

where is the null proportion. Here the mixture proportions are non-negative and sum to 1, and are to be estimated, whereas the component standard deviations are a fixed pre-specified grid of values. By using a sufficiently wide and dense grid of standard deviations this finite mixture can approximate, to arbitrary accuracy, any scale mixture of zero-mean Gaussians.

The marginal log-likelihood for , integrating out , , is given by the following Theorem.

###### Theorem 1.

Combining (12)-(15), the marginal log-likelihood of is

 L(π,ω):=log(n∏j=1p(Xj|π,ω))=n∑j=1log(K∑k=0πk(pjk0+L∑l=1ωlpjkl)), (16)

where

 pjkl=slj√σ2k+s2jl+11√l!φ(l)⎛⎜ ⎜⎝Xj√σ2k+s2j⎞⎟ ⎟⎠. (17)
###### Proof.

See Appendix C. ∎

### 3.3 Fitting the model

Following the usual EB approach, we fit the model (12)-(15) in two steps, first estimating by estimating and then basing inference for on the (estimated) posterior distribution . Note that under the model (12)-(15) are conditionally independent given , so this posterior distribution factorizes, and is determined by its marginal distributions . The intuition here is that, under the exchangeability assumption, the effects of correlation are captured entirely by the (realized) correlated noise distribution . Once this distribution is estimated, the inferences for each become independent, just as in the standard EBNM problem.

The usual EBNM approach to estimating would be to maximize the likelihood . Here we modify this approach using maximum penalized likelihood. Specifically we use the penalty on as in (9), and the penalty on used by Stephens (2017) to encourage conservative (over-)estimation of the null proportion (to induce conservative estimation of false discovery rates). Thus, we solve

 ^π,^ω=argmaxπ,ωn∑j=1log(K∑k=0πk(pjk0+L∑l=1ωlpjkl))+K∑k=0λklog(πk)−L∑l=1γl|ωl| (18)

subject to the constraints

 K∑k=0πk =1 (19) πk ≥0,  k=0,1,…,K (20) φ(zi)+L∑l=1ωl1√l!φ(l)(zi) ≥0,  i=1,…,m . (21)

In (21) we used the same device as in (11) to capture non-negativity of . We set as in (10), use only the first Gaussian derivatives, and set , as in Stephens (2017).

Problem (18) is biconvex. That is, given a feasible , the optimization over is convex; and given a feasible , the optimization over is convex. The optimization over can be solved using the EM algorithm, or more efficiently using convex optimization methods (Koenker and Mizera, 2014; Koenker and Gu, 2017; Kim et al., 2018). To optimize over we use the same approach as in solving (11). To solve (18) we simply iterate between these two steps until convergence.

### 3.4 Posterior calculations

For each , the posterior distribution

is, by Bayes Theorem, given by

 p(θj∣Xj,^π,^ω)=[^π0δ0+K∑k=1^πkN(θj;0,σ2k)][1sjφ(Xj−θjsj)+L∑l=1^ωl1sj1√l!φ(l)(Xj−θjsj)]K∑k=0^πk(pjk0+L∑l=1^ωlpjkl). (22)

Despite the somewhat complex form, some important functionals of this posterior distribution are analytically available.

1. The posterior mean for

 E[θj∣Xj,^π,^ω]=K∑k=0^πk(mjk0+L∑l=1^ωlmjkl)K∑k=0^πk(pjk0+L∑l=1^ωlpjkl) , (23)

where .

2. The local FDR (lfdr; Efron, 2008) is

 lfdrj:=Pr(θj=0∣Xj,^π,^ω)=^π01sjφ(Xjsj)+L∑l=1^ωl1sj1√l!φ(l)(Xjsj)K∑k=0^πk(pjk0+L∑l=1^ωlpjkl). (24)

From this, the FDR of any discovery set can be estimated as

 ˆFDR(Γ)=1|Γ|∑j∈Γ% lfdrj , (25)

where denotes the number of elements in . Storey’s -value (Storey, 2003) for each is defined as

 qj:=ˆFDR({k:lfdrk≤% lfdrj}) . (26)
3. Stephens (2017)

introduced the term “local false sign rate (lfsr)” to refer to the probability of getting the sign of an effect wrong, as well as the false sign rate (FSR) and the

-value, analogous to the FDR and the -value, respectively. Making statistical inference about the sign of a parameter, rather than solely focusing on whether the parameter being zero or not, was also discussed in Tukey (1991); Gelman et al. (2012). The value of is defined as

 lfsrj:=min{Pr(θj≥0∣Xj,^π,^ω), Pr(θj≤0∣Xj,^π,^ω)} , (27)

which is easily calculated from and

 Pr(θj>0∣Xj,^π,^ω)=K∑k=1^πk(^τjk0+L∑l=1^ωlτjkl)K∑k=0^πk(pjk0+L∑l=1^ωlpjkl) , (28)

where . The FSR and -value are estimated and defined similarly to the FDR and -value as

 ˆFSR(Γ)=1|Γ|∑j∈Γ% lfsrj ,sj:=ˆFSR({k:% lfsrk≤lfsrj}) . (29)

### 3.5 Software

We implemented both the fitting procedure and posterior calculations in an R package cashr which is available at https://github.com/LSun/cashr. For , it takes on average about 6 seconds for model fitting and posterior calculations on a personal computer (Apple iMac, 3.2 GHz, Intel Core i5).

## 4 Numerical Results

We now empirically assess the performance of cashr on both simulated and real data. We focus our assessments on the “multiple testing” setting where is sparse and the main goal is to identify “significant” non-zero elements . This problem can be tackled using EB methods (Thomas et al., 1985; Greenland and Robins, 1991) and here we compare cashr with both locfdr (Efron, 2005), which attempts to capture effects of correlation through an empirical null strategy discussed in Section 2.2, and ashr (Stephens, 2017), which fits the same EBNM model as cashr but without allowing for correlation – i.e. ashr is equivalent to setting in (14). Multiple testing can also be tackled by attempting to control the FDR in the frequentist sense, and so we also compare with the Benjamini-Hochberg procedure (BH; Benjamini and Hochberg, 1995) and qvalue (Storey, 2002, 2003). One advantage of the EBNM approach to multiple testing is that it can provide not only FDR assessments, but also point estimates and interval estimates for the effects (Stephens, 2017). However, to keep our comparisons simple we focus here only on FDR assessments.

### 4.1 Realistic simulation with gene expression data

We constructed synthetic data with realistic correlation structure using the simulation framework in Section 2.1. The data are simulated according to the EBNM with correlated noise model (2)-(3) as follows.

• The normal means are iid samples from

 g(⋅)=π0δ0(⋅)+(1−π0)g1(⋅) , (30)

for six choices of and three choices of (Figure 5). The density functions of these six choices of and other simulation details are in Appendix D.

• To make the correlation structure among noise realistic, in each simulation are simulated from real gene expression data as in Section 2.1.

• The standard deviations are also simulated from real gene expression data using the same pipeline, and are scaled to have .

• The observations are constructed as , .

In each simulated data set, this framework generates correlated observations of respective normal means with corresponding standard deviations . The data are made available to each method, while the effects are withheld. The analysis goal is to identify which are significantly different from 0. We applied each method to formulate a discovery set at nominal FDR , and calculated the empirical false discovery proportion (FDP) for each discovery set. We ran 1000 simulations for each , divided evenly among the three choices of .

Figure 5 compares the performance of each method in these simulations. Our first result is that, despite the presence of correlation, most of the methods control FDR in the usual frequentist sense under most scenarios: that is, the mean FDP is usually below the nominal level of 0.1. Indeed, BH is notable in never showing a mean FDP exceeding the nominal level, even though, as far as we are aware, no known theory guarantees this under the realistic patterns of correlation used here (Benjamini and Yekutieli (2001) gives relevant theoretical results under more restrictive assumptions on the correlation). The method most prone to lose control is ashr, but even its mean FDP is never above 0.2.

However, despite this frequentist control of FDR, for most methods the FDP for individual data sets can often lie far from the nominal level (see also Owen, 2005; Qiu et al., 2005; Blanchard and Roquain, 2009; Friguet et al., 2009, for example). Arguably, then, frequentist control of FDR is insufficient in practice, since we desire – as far as is possible – to make sound statistical inference for each data set. That is, we might consider a method to perform well if its FDP is consistently close to the nominal level, rather than close on average. By this criterion, cashr consistently outperforms other methods (Figure 5): it provides uniformly lower root MSE of FDP from the nominal FDR, , and the whiskers in the boxplots (indicating 5th and 95th percentiles) are narrower. Along with FDP, Figure 5 also shows the empirical true discovery proportion (TDP), defined as the proportion of true discoveries out of the number of all non-zero , as an indication of statistical power. cashr maintains good power in that it produces higher TDP than most methods in most scenarios. In some scenarios, ashr sometimes finds more true signals than cashr, but at the cost of severely losing control of FDP.

We note that cashr performs well even in settings that do not fully satisfy its underlying assumptions (e.g. where is asymmetric or multimodal). Note also that for our choices of , is a highly sparse setting, as a large portion of the non-zero are close to zero. For example, when is Gaussian, only about out of are expected to be larger than . Therefore, it is understandable that no methods perform particularly well in this difficult setting. But even for this setting, although first impressions from the plot may be that cashr and BH perform similarly, closer visual inspection shows cashr to be better, in that its median FDP tends to be closer to .

The reason that cashr produces more consistently reliable FDP is that, by design, it adapts itself to the particular correlation-induced distortion present in each data set. As illustrated in Figure 1, correlation can lead to pseudo-inflation in some data sets and pseudo-deflation in others. cashr is able to recognize which pattern is present, and correspondingly modify its behavior – becoming more conservative in the former case and less conservative in the latter. This is illustrated in Figure 6, which stratifies the realized data sets according to sample standard deviation of the realized correlated noise in each data set (for the setting where is Gaussian, ). The bottom are categorized as pseudo-deflation, top pseudo-inflation, and the others “in-between.”

For data sets where show no strong distortion (“in-between”) all methods give similar and reasonable results, with cashr showing only a small improvement. However, when are pseudo-inflated, methods ignoring correlation, such as BH, qvalue, ashr, tend to be anti-conservative; that is, they form discovery sets whose FDP are often much larger than the nominal FDR. In contrast, cashr produces conservative FDP near the nominal value; and locfdr is too conservative, consequently losing substantial power (discussed further in Section 4.2). Conversely, with pseudo-deflation, methods ignoring correlation are too conservative, producing FDP much smaller than the nominal FDR, losing power compared with cashr and locfdr.

### 4.2 Real data illustrations

We now use two real data examples to illustrate some of the features of cashr (and other methods) that we observed in simulated data. The first example is a well-studied data set from a leukemia study (Golub et al., 1999), comparing gene expression in acute myeloid leukemia vs acute lymphoblastic leukemia samples, which was discussed extensively in Efron (2010a) as a prime example of how correlation can distort empirical distributions. The second example comes from a study on embryonic mouse hearts (Smemo, 2012), comparing gene expression in 2 left ventricle samples vs 2 right ventricle samples. (The number of samples is small, but each sample is a pool of ventricles from 150 mice – necessary to obtain sufficient tissue for the experiments to work well – and so this experiment involved dissection of 300 mouse hearts.)

For each data set we let denote the true -fold change in gene expression between the two groups for gene . We use a standard analysis protocol (based on Smyth (2004); see Appendix D for details) to obtain an estimate for , and a corresponding -value . As in Section 2.1, we convert the -value to the corresponding -score and use this to compute an effective standard deviation .

Figure 7 shows the empirical distribution of the -scores for each data set, together with the fitted correlated noise distribution from cashr and the fitted empirical null from locfdr. In both cases the histogram is substantially more dispersed than . However the two data sets have otherwise quite different patterns of inflation: the leukemia data show inflation in both the shoulders and tails of the distribution, whereas the mouse data show inflation only in the shoulders. This indicates the presence of some strong signals in the leukemia data, whereas the inflation in the mouse data may be primarily pseudo-inflation caused by correlation. Consistent with this, both locfdr and cashr identify hundreds of significant signals in the leukemia data (at nominal FDR ), but no significant signals in the mouse data (Table 1).

Although the conclusions from cashr and locfdr are, here, qualitatively similar, there are some notable differences in their results. First, in the mouse data, the cashr correlated noise distribution gives, visually, a much better fit than the locfdr empirical null, particularly in the tails (Figure 7). This is because the cashr correlated noise distribution is ideally suited to capture this “shoulder-but-not-tail” inflation pattern that is symptomatic of correlation-induced inflation. The Gaussian empirical null distribution assumed by locfdr is simply inconsistent with these data. Indeed, this inconsistency is reflected in the null proportion estimated by locfdr (1.08) which exceeds the theoretical upper bound of 1.

Second, in the leukemia data, cashr identifies 37% more significant results than locfdr (385 vs 282). This is consistent with the greater power of cashr vs locfdr in our simulations. One reason that locfdr can lose power is that its Gaussian empirical null distribution tends to overestimate inflation in the tails when it tries to fit inflation in the shoulders. We see this feature in the mouse data, and although less obvious, this appears to also be the case for the leukemia data: the estimated standard deviation of the empirical null is 1.58, which is almost certainly too large: a pseudo-inflated Gaussian correlated noise distribution is unlikely to have standard deviation exceeding (Appendix B). In comparison the fitted correlated noise distribution from cashr has a noticeably shorter right tail (e.g. ) which leads it to categorize more -scores in the right tail as significant (Figure 7). On a side note, cashr also experiences the benefits of ashr highlighted in Stephens (2017), which can also help increase power. For example, the unimodal assumption on the effects – which allows that some of the -scores around zero may correspond to true, albeit non-significant, signals – can help improve estimates of , and hence improve power.

Another feature of cashr, which distinguishes it from locfdr, is that, by estimating while accounting for correlation-induced distortion, it can provide an estimate on the effect size distribution, . For the mouse data, cashr estimates , or 12% of genes may be differentially expressed to some extent, although it is not able to pin down any clear example of a differentially expressed gene: no gene has an estimated local FDR less than 0.80. One possible explanation for the lack of significant results in this case is lack of power. However, the estimated from cashr suggests that there may simply not exist any large effects to be discovered: of the probability mass of the estimated is on effect size , or a mere -fold change in gene expression. Thus the signals here, if any, are too weak to be discerned from noise and pseudo-inflation.

We also applied the other methods – BH, qvalue, and ashr – to both data sets. All three methods find very large numbers of significant results in both data sets (Table 1). Although we do not know the truth in these real data, there is a serious concern that many of these results could be false positives, since these methods are all prone to erroneously viewing pseudo-inflation as true signal (Figure 6), and Figure 7 suggests that pseudo-inflation may be present in both data sets.

## 5 Discussion

We have presented a general approach to accounting for correlations among observations in the widely-used Empirical Bayes Normal Means model. Our strategy exploits theoretical results from Schwartzman (2010) to model the impact of correlation on the empirical distribution of correlated variables, and convex optimization techniques to produce an efficient implementation. We demonstrated through empirical examples that this strategy can both improve estimation of the underlying distribution of true effects (Figure 2) and – in the multiple testing setting – improve estimation of FDR compared with EB methods that ignore correlation (Figures 5, 6). To the best of our knowledge, cashr is the first EBNM methodology to deal with correlated noise in this way.

Our empirical results demonstrate some benefits of the EB approach to multiple testing compared with traditional methods. In particular, cashr provides, on average, more accurate estimates of the FDP than either BH or qvalue. However, although we find these empirical results encouraging, we do not have theoretical guarantees of (frequentist) FDR control. That said, theoretical guarantees of FDR control under arbitrary correlation structure are lacking even for the widely-studied BH method. BH has been shown to control FDR under certain correlation stuctures (e.g. “positive regression dependence on subsets”; Benjamini and Yekutieli, 2001). The Benjamini-Yekutieli procedure (Benjamini and Yekutieli, 2001) is proved to control FDR under arbitrary dependence, but at the cost of being excessively conservative, and is consequently rarely used in practice.

A key feature of cashr is that it requires no information about the actual correlations among observations. This has the important advantage that it can be applied wherever EBNM methods that ignore correlation can be applied. On the other hand, when additional information on correlations is available it clearly may be helpful to incorporate it into analyses. Within our approach such information could be used to estimate the moments of the pairwise correlations, and thus inform estimates of in the correlated noise distribution . Alternatively, one could take a more ambitious approach: explicitly model the whole correlation matrix, and use this to help inform inference (e.g. Benjamini and Heller, 2007; Wu, 2008; Sun and Cai, 2009; Friguet et al., 2009; Fan et al., 2012). Modeling correlation is likely to provide more efficient inferences when it can be accurately achieved (Hall and Jin, 2010). However, in many situations – particularly involving small sample sizes – reliably modeling correlation may be impossible. Under what circumstances this more ambitious approach produces better inferences could be one area for future investigation.

The main assumptions underlying cashr are that the correlated noise is marginally

, and that the standard deviations are reliably computed. In the multiple testing setting this corresponds to assuming that the test statistics are (marginally) well calibrated. If these conditions do not hold – for example, due to failure of asymptotic theory underlying test statistic computations, or due to confounding factors (such as batch effects in gene expression studies), then

cashr could give unreliable results. Of course cashr is not unique in this regard – methods like BH and qvalue similarly assume that test statistics are well calibrated. Dealing with confounders in gene expression studies is an active area of research, and several approaches exist, many of them based on factor analysis (e.g. Leek and Storey, 2007; Sun et al., 2012; Gagnon-Bartsch and Speed, 2012; Wang et al., 2017; Gerard and Stephens, 2017, 2018). Again, the possibility of combining these ideas with our methods could be a future research direction.

## Appendix A The marginal distribution of the simulated null z-scores

Figures A.1 and A.2 offer support for the claim that the -scores simulated in Section 2.1 are marginally -distributed.

Figure A.1 compares -scores simulated as in Section 2.1 with -scores simulated under a modified framework that removes gene-gene correlations, and with iid samples. The modified framework uses exactly the same simulation and analysis pipeline as the original framework of Section 2.1, with one important difference: in each simulation, for each gene independently we randomly selected two groups of five samples without replacement, hence removing gene-gene correlations.

The empirical CDF of data sets simulated as in Section 2.1 show a huge amount of variability (panel (a)), presumably due to correlations among genes. In the modified framework, correlation-induced distortion disappears: the empirical CDF of all data sets are almost exactly the same as (panel (b)), just as with the iid samples (panel (c)). This demonstrates that without gene-gene correlations, the analysis pipeline used here produces uncorrelated -scores.

In addition, Figure A.2 shows that the mean empirical CDF of the data sets simulated from the original framework – the average of empirical CDF of Figure A.1(a) – is very close to . Possible deviation happens only in the far tails (). Compared with and , the deviation is very small even on the logarithmic scale (panels (b-c)), probably caused by numerical constraints as one or two -scores in this area in a few data sets can make a visible difference.

## Appendix B Decomposing Gaussian by standardized Gaussian derivatives

###### Proposition 1.

The PDF of can be decomposed by standard Gaussian and its derivatives in the form of (7) if and only if .

###### Proof.

Let denote the probabilists’ Hermite polynomial. The orthogonality and completeness of Hermite polynomials in (e.g. Szegõ, 1975) leads to the following fact

 ∫R1√m!hm(x)1√n!φ(n)(x)dx=(−1)nδmn ,∀m,n=0,1,2,…, (B.1)

where . Therefore, if any PDF can be decomposed in the form of (7), the coefficient of the -order standardized Gaussian derivative has to be

 wl=(−1)l∫R1√l!hl(x)f(x)dx=(−1)l√l!Ef[hl] , (B.2)

where , sometimes called “Hermite moment,” is the expected value of when the PDF of the random variable is . If is , we can obtain analytic expressions of these Hermite moments

 Ef[hl]=μl+⌊l/2⌋∑k=1(l2k)μl−2k(σ2−1)k(2k−1)!!:=Ml(μ,σ2−1) , (B.3)

where denotes the double factorial of , and denotes the function of -order moment of a Gaussian with mean and variance . Putting (B.2)-(B.3) together, the coefficients in (7) become

 wl=(−1)l√l!Ml(μ,σ2−1) . (B.4)

Note that is not exploding if and only if or equivalently, . ∎

This result suggests that a pseudo-inflated Gaussian correlated noise distribution is not likely to have standard deviation greater than .

In the special case when , becomes , a point mass on , with randomly sampled from . It is interesting to note that can be decomposed in the form of (7) as

 δz(⋅)=φ(⋅)+∞∑l=1[(−1)l√l!hl(z)][1√l!φ(l)(⋅)] ,∀z∈R .

## Appendix C Proof of Theorem 1

###### Proof.

The marginal distribution of , denoted as , is obtained by integrating out

 p(Xj) =∫Rg(θj)p(Xj|θj,sj)dθj=∫Rg(θj)1sjf(Xj−θjsj)dθj =∫R[π0δ0(θj)+K∑k=1πk1σkφ(θjσk)][1sjφ(Xj−θjsj)+1sjL∑l=1ωl1√l!φ(l)(Xj−θjsj)]dθj =K∑k=0πk(pjk0+L∑l=lωlpjkl) , (C.1)

where is essentially a convolution of and and has an analytic form

 pjkl=slj√σ2k+s2jl+11√l!φ(l)⎛⎜ ⎜⎝Xj√σ2k+s2j⎞⎟ ⎟⎠ .

This form is also valid for , . Following (C.1), the marginal log-likelihood of is given by

 log(n∏j=1p(Xj))=n∑j=1log(K∑k=0πk(pjk0+L∑l=lωlpjkl)) .

## Appendix D Simulation details

### d.1 Six choices of the non-null effect distribution

Table D.1 lists the details of the six choices of , the non-null effects in Section 4.1. The table also shows the average signal strength,