Log In Sign Up

A Robustified posterior for Bayesian inference on a large number of parallel effects

Many modern experiments, such as microarray gene expression and genome-wide association studies, present the problem of estimating a large number of parallel effects. Bayesian inference is a popular approach for analyzing such data by modeling the large number of unknown parameters as random effects from a common prior distribution. However, misspecification of the prior distribution, particularly in the tails of the distribution, can lead to erroneous estimates of the random effects, especially for the largest and most interesting effects. This paper proposes a robustified posterior distribution that eliminates the impact of a misspecified prior on one component of the standard posterior by replacing that component with an asymptotically correct form. The proposed posterior can be combined with a flexible working prior to achieve superior inference across different structures of the underlying effects of interest.


page 1

page 2

page 3

page 4


A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging

Flexible hierarchical Bayesian modeling of massive data is challenging d...

Strong replica symmetry in high-dimensional optimal Bayesian inference

We consider generic optimal Bayesian inference, namely, models of signal...

Fisher-Rao Geometry and Jeffreys Prior for Pareto Distribution

In this paper, we investigate the Fisher-Rao geometry of the two-paramet...

Further Inference on Categorical Data – A Bayesian Approach

Three different inferential problems related to a two dimensional catego...

Variational Bayesian inference of hidden stochastic processes with unknown parameters

Estimating hidden processes from non-linear noisy observations is partic...

An Interpretable Neural Network for Parameter Inference

Adoption of deep neural networks in fields such as economics or finance ...

1 Introduction

In the past decades, new technologies such as gene microarray and genome-wide association studies have fundamentally changed the landscape of biomedical research. Instead of studying one gene or one single nucleotide polymorphism (SNP) at a time, these technologies allow us to study thousands of genes or SNPs simultaneously. Bayesian approaches have proven effective for analyzing such data by modeling a large number of parallel parameters for individual genes or SNPs as random effects from a common prior. Bayesian methods can improve inference by borrowing information from other genes and by incorporating useful structure such as modeling a large proportion of the genes or SNPs as having no effect on the outcome. This approach automatically adjusts for multiple comparisons and selection bias inherent in the large-scale data setting (Johnstone and Silverman, 2004; Efron, 2010).

A canonical model for this data structure is


where is the observed measurement, is the unknown true parameter of interest for the gene or SNP, and is an unobserved random error. For many applications, there is often little information to distinguish one from the others before the data is collected, and in such situations, we can consider the exchangeable (Gelman et al., 2014, Chapter 5). For this paper, we shall treat as random effects drawn from a density , and consider as a smooth or limiting form of the empirical distribution of the underlying to be estimated. Letting and , a suitable posterior distribution that facilitates the inference about is

We would like to approximate as closely as possible in our Bayesian inference.

In practice, however, is usually unknown. A standard way to take advantage of such data structure is through a parametric Bayesian hierarchical model (Gelman et al., 2014, Chapter 5) as follows:


where is defined by the desnsity of error in (1), are independent given , and are independent given . This approach can easily incorporate prior information about the structure of for improved inference about . For example, we can choose the working prior to be a Laplace family with scale parameter if we believe that is unimodal and long-tailed and a normal family if short-tailed. When the shape of is misspecified and severely deviated from , however, it can lead to inferior inference. For example, excessive shrinkage and therefore bias can occur if a working prior has much shorter tails than . This misspecification of working prior can be a serious concern because, while some information about may be available in a particular application, the information is typically insufficient to determine how heavy the tails should be, which can substantially influence the extremal effects of , usually the effects of most practical interest.

Alternatively, nonparametric and semi-parametric Bayesian methods (Muralidharan, 2010; Martin and Tokdar, 2012; Bogdan et al., 2008; Müller and Mitra, 2013; Do et al., 2005; Kim et al., 2009) can be used that impose minimum structure on . A popular approach is to specify the working prior for as generated from a Dirichlet process mixture, as described in more detail in Section 4. Although such a working prior generally performs reasonably well for a wide range of , it may not be optimal due to its weaker prior information. Additionally, its focus on flexibility of the model can make it difficult for a statistician to incorporate useful prior information (Hoff et al., 2013; O’Hagan, 2013; Carlin et al., 2013). However, to our knowledge, no systematic comparison of parametric and nonparametric approaches, via simulation study or real data-sets, is available in the setting of a large number of effects ( in the order of thousands) to give empirical guidance in real applications.

This paper has two aims. First, we propose a simple new method to robustify the posterior distribution of for parametric hierarchical model (2) by utilizing the asymptotic behavior of order statistics. A unique feature of the proposed method is that it can substantially improve the posterior distribution when is misspecified without affecting the posterior under a correctly specified working prior. Therefore, it can be broadly used. Second, we conduct a systematic performance comparison of the standard parametric posterior, the proposed robustified parametric posterior, and a nonparametric Bayesian posterior which uses a Dirichlet process mixture prior with a normal base distribution. Our study shows that while the nonparametric Bayesian method does provide reasonable performance under different forms of , it can perform poorly when is severely deviated from normal, possibly due to its normal base distribution in the Dirichlet process mixture prior. The proposed robustifed posterior when combined with a flexible parametric prior can be a superior alternative to nonparametric Bayesian methods.

2 The Robustified Posterior

We now present the key result of how to robustify the standard posterior distribution for a general working prior density given by


In order to develop the robustified version denoted by , we need the following assumption, which is reasonable if is to be a measurement of .

Assumption 1.

We assume the distribution of is strictly stochastically increasing in .


be the cumulative distribution function of the errors

in (1) and let be the corresponding density. We then have . Let


be the quantiles of error

. Here we allow the distribution of to depend on in order to be more general. It follows immediately that . For a given , is a strictly decreasing function of by Assumption 1 and therefore can be written as a function of : . We will reformulate the posterior of , given in (3), in terms of the quantiles of error . Then the change-of-variables formula rewrites the posterior (3) as


where , , and .

An interesting special case occurs when the distribution of does not depend on . In this case, and . Therefore the posterior distribution (5) has the particularly simple form


We now return to the general case of (5). Let denote the order statistics of . Then the posterior distribution of under the working prior has the decomposition


If the working prior is misspecified, both factors in decomposition (7) can be distorted from their corresponding distribution under the correctly specified prior . The key idea of our proposed robustified posterior distribution is to replace in (7) with its the asymptotic limit , which turns out not to depend on . In the rest of this section, we shall assume that is generated under model (1) with . In what follows, we show the asymptotic limit of is available without knowledge of the correct prior . More specifically,

converges to the discrete uniform distribution on

. The key insight is that when is large and under the correct , is well approximated by the quantiles of the uniform distribution on ; this is the same rationale as justifies the widely used QQ-plot for distribution checking. We formalize this in the following theorem (proof provided in the Appendix).

Theorem 1.

Let be generated under model (1) with . Let denote the order statistics of drawn from . Then

except on a small subset of

whose probability can be made as small as any


We therefore propose to fix the in in the right hand side in (7) to be its asymptotic limit derived under and define our robustified posterior as


In the robust posterior (8), the sample space of , to be denoted by , consists of the permutations of , where . This gives the following explicit form of (8)


on , where . Note that this approach of robustifying the standard posterior through truncation to the discrete space can be applied to the posterior for any general Bayesian model including hierarchical Bayes and empirical Bayes. Using relationship as defined at the beginning of this section, we can easily map back to as . In particular, if .

Based on the discussion above, the robustified posterior (9) is effective over the standard posterior when the misspecified causes to deviate from . This can happen when the working prior over-shrinks due to underspecification of the working prior or shrinks in the wrong direction due to a location shift. On the other hand, it does not improve inference if the misspecified primarily affects in (7), which happens when the working prior is too diffuse and therefore under-shrinks.

For hierarchical model (2), let


be the prior of after integrating out . We can use Theorem 1 to improve the inference of hierarchical model (2) by replacing , as defined in (3), by , which is, however, computationally prohibitive because the mixture density given in (10) is very expensive to evaluate. To get around this this computational limitations, note that the standard posterior can be simulated as the stationary distribution of Gibbs sampler

Now a robustified version of the component , , can be constructed as , where , which is much less computationally intensive because is usually a simple parametric distribution. We use it to construct a robustified Gibbs sampler


which can be simulated from rapidly. We shall show in Section 3 that this new Gibbs sampler has a stationary distribution. Our proposed robustified posterior is defined as the the stationary distribution of for this robustified Gibbs sampler, which can replace the standard posterior for improved Bayesian inference.

3 A Markov Chain Monte Carlo Algorithm

We now describe a random walk Metropolis-Hasting algorithm (e.g. Robert and Casella, 2009, Chapter 6) to sample from in (9), whose sample space consists of permutations of . Let

be the current position of the Markov chain. Randomly select

locations in and then randomly permute these elements at these locations. Let the resulting be the candidate . The parameter plays the role of a step size in a continuous random walk. It is easy to see that this algorithm is symmetric; the probability of generating from is the same as the probability of generate from . Therefore, the Metropolis-Hastings algorithm accepts with probability

Due to the enormous number of points in the discrete sample space, , an effective Metropolis-Hasting algorithm must start the chain from a well supported point and must be able to control the distance of the proposal from current position . To address this, we propose the following enhancement. Let be the -value of testing versus , which can serve as an approximation of the unknown . Note also that is the -value for testing versus . Now reorder by the value of . To simplify notation, the reordered sequence will still be denoted as but now satisfies

Since is an approximation of , the underlying for the reordered should approximately follow the same increasing pattern as increases. We propose to apply the above random walk Metropolis-Hasting algorithm to the reordered dataset with two benefits. first, we start the chain at a well-supported point . Second, in generating proposal from , we randomly select consecutive positions in and then randomly permute elements at these locations. Because the consecutive elements in now generally have similar values, we can easily control the probability of accepting to be around 25% as recommended in Robert and Casella (2009, Section 6.6) by selecting an appropriate . This enhancement drastically improves the sampling efficiency in our experience.

We can now easily implement robustified Gibbs sampler (11). Drawing from is usually straightforward. The above Metropolis-Hasting algorithm can be used to sample from and therefore . Note that has a finite sample space of and the chain is irreducible so long as for all under any given . It follows that this robustified Gibbs sampler has a unique stationary distribution, whose density can be written down explicitly from the two conditional distributions in the Gibbs sampler by the Hammersley-Clifford theorem (Besag, 1974).

4 Comparison of Performance

A comprehensive simulation study is perfromed to compare the performance of three sets of procedures: standard parametric posterior, the robustified parametric posterior, a nonparametric Bayesian using Dirichlet process mixture prior. The simulation study is conducted as follows.

Step 1:

For and , generate (for ), where and .

Step 2:

Reorder data by the values of as described in Section 3. Arrange the generated in the corresponding order.

Step 3:

For dataset , compute the posterior means and the estimation errors () using seven estimation methods described below.

Steps 1 through 3 are repeated 100 times for the results reported here. In this simulation study, three forms of are used. The first form,

, is a normal distribution

, which serves as an example of a light-tailed distribution. The second form, , is the scaled

-distribution with five degrees of freedom and a standard deviation of two, which represents a heavy-tailed distribution. The third form,

, is given by

where is truncated to interval and is truncated to . This hybrid distribution has the form of in the middle and the form of in the tails.

We now describe the seven estimation methods used in this simulation.

Method 1 (Laplace).

Standard posterior for hierarchical model (2) with Laplace working prior: , with scale parameter .

Method 2 (R Laplace).

Robustified posterior of Method 1.

Method 3 (Normal).

Standard posterior for hierarchical model (2) with normal working prior: with

. Note the variances of

with and with are the same.

Method 4 (R Normal).

Robustified posterior of Method 3.

Method 5 (Mixture).

Standard posterior for hierarchical model (2) with a mixture working prior:

with , and and distributed as in Methods 1 and 3.

Method 6 (R Mixture).

Robustified posterior of Method 5.

Method 7 (DP).

Nonparametric Bayes with Dirichlet process mixture prior:

In this formulation, is the scaling parameter and

is the base distribution of the Dirichlet process. The shape and scale parameters in the two inverse Gamma distributions are chosen so that

and have sufficient variability for a flexible model.

For , figures 1 presents the distribution of and over 100 replications under each of , , , and for each of the 7 estimation methods using boxplots. Figure 2 presents the same results for . Note that we can combine and into one boxplot to save space because they have the same distribution due to the symmetries of and the working priors. Concentration of the distribution around 0 in a boxplot represents a good estimation method while concentration below and above 0 represent under-shrinkage and over-shrinkage respectively. Note also that have been re-ordered in Step 2 above by the values. Therefore, and are the two most extremal effects usually of the greatest practical interest.

We now summarize the performance of the 7 estimation methods as reflected in figures 1 and 2. Method 1 under-shrinks considerably under as expected because the working Laplace prior has much heavier tails. It works well under heavier-tailed and . Method 2 offers small improvement over method 1 for . Method 3 works very well under but severely over-shrinks under and . Method 4 substantially improves over Method 3 under and but still performs poorly due to the restrictive short tails in the working prior. Method 5, with its flexible mixture working prior, has low bias across all three forms of but suffers from increased variation. Method 6 offers considerable improvement over Method 5 under and , in particular by reducing the extreme estimation errors. The nonparametric Method 7 performs well under , less well under , and poorly under , possibly due to its normal base-distribution. Changing the scaling parameter to 1/2 to increase the variability of the Dirichlet process does not improve its performance (data not shown). Nevertheless, Method 7 never performs very badly across all three forms of , demonstrating the relative robustness of the nonparametric method. Overall, Method 6, which combines the robustified posterior with a reasonably flexible parametric working prior, has the best performance. It is also shown that the improvement provided by robustification becomes more dramatic when is increased from 1000 to 2000.

There are a few cases in figures 1 and 2 in which the robustified posterior provides little improvement over the standard posterior. This can happen when the working prior is already optimal or close to optimal, such as Method 1 under and Method 3 under . It can also happen when the working prior has longer tails than as noted in Section 2, such as Method 1 under .

Finally, Tables 1 and 2 give, for and respectively, the mean square error of each estimation method as the average of and from the 100 replications for . The performance ranking of the 7 methods summarized above for in figures 1 and 2 is still generally valid for but the difference between different methods is much smaller.

The standard posterior estimates in Methods 1, 3 and 5 are computed using RStan (Carpenter et al., 2016). The robustified posterior estimates for Methods 2, 4, and 6 are obtained by our own R code. Function DPMmeta in the DPpackage (Jara et al., 2011) is used for the nonparametric Bayesian estimation in Method 7. Our complete R code for this simulation study is available at

Figure 1: =1000. Boxplots of the estimation error of the most extreme random effects; and .
Laplace R Laplace Normal R Normal Mixture R Mixture DP
1.25 1.29 0.85 0.84 0.90 0.84 0.88
1 1.02 1.02 4.29 3.25 1.31 1.03 1.61
1.09 1.08 3.57 2.81 1.59 1.36 1.53
1.31 1.34 0.79 0.79 0.89 0.80 0.80
2 1.13 1.12 2.30 2.05 1.25 1.14 1.23
1.26 1.25 2.28 2.09 1.46 1.39 1.41
1.12 1.14 0.85 0.84 0.88 0.83 0.84
3 1.09 1.09 1.90 1.75 1.14 1.10 1.17
1.08 1.08 1.65 1.56 1.14 1.10 1.12
Table 1: . Mean square error of and for the seven methods under , , and .
Figure 2: =2000. Boxplots of the estimation error of the most extreme random effects; and .
Laplace R Laplace Normal R Normal Mixture R Mixture DP
1.56 1.50 0.74 0.74 0.94 0.76 0.75
1 1.07 1.07 6.08 4.33 1.79 1.06 1.60
0.97 0.97 4.20 3.21 1.81 0.98 1.34
1.38 1.33 0.85 0.85 1.02 0.87 0.87
2 1.33 1.33 3.51 3.02 1.87 1.33 1.48
1.18 1.19 2.82 2.56 1.52 1.23 1.32
1.40 1.35 0.99 0.99 1.15 1.01 1.00
3 1.15 1.16 2.67 2.41 1.30 1.15 1.18
1.02 1.03 2.04 1.92 1.32 1.02 1.09
Table 2: =2000. Mean square error of and for the seven methods under , , and .

5 Discussion

This paper proposes a robusified posterior for improving inference on a large number of parallel effects. By providing significant protection against misspecified priors, our method encourages the use and specification of genuinely informative priors instead of defaulting to a weak and ineffective prior. For example, Method 6 in Section 4 can be an excellent choice if we believe that the tails of are between a short-tailed normal and a long-tailed t-distribution. Other approaches to enhance the robustness of Bayesian inference have been proposed in different contexts and models. For example, Lazar (2003) replaces the likelihood function in the Bayesian posterior by an empirical likelihood, which achieves improved robustness by reduced specification in the likelihood. Also, Hoff (2007) proposed to replace the likelihood of the complete data by the likelihood of the rank of the data to remove nuisance parameters in a semi-parametric copula estimation. The robustified posterior in this paper is specifically developed for estimating a large number of parallel effects. By utilizing asymptotic behavior of order statistics and the unique structure of parallel effects, our method has the distinctive advantage of improving robustness with little or no loss of inferential efficiency even when the working prior is correctly specified.

Finally, we have previously proposed a rank-based robustified posterior in which the posterior of is computed conditioned on the rank of among instead of the value of itself (Liao et al., 2014). The rank-based posterior has similar properties as the robustified posterior in this paper but works well only when error have similar variation across . In contrast, the robustified posterior in this paper only requires the error distribution in (1) to be continuous.

Appendix A Proof of Theorem 1


Formally, first consider the marginal distribution of order statistics :

where is the marginal distribution of . It follows from (4) that

is the joint distribution of the order statistics from uniform

(see, e.g., Shao, 1999, p. 72). Let be a draw from . The Glivenko-Cantelli theorem and the Berry-Esseen theorem state that, as , the empirical distribution of converges to the function uniformly on . Recent refinements to these theorems (Fresen, 2011, Lemma 2) are able to characterize the behavior of the order statistics directly:


in probability.

Now we show the asymptotics in (12), derived under the marginal distribution , can be extended to the conditional distribution . It follows from equation (12) that, for any given and as , we have

Now for any , define

where the right side is the conditional probability given . It follows that

where the expectation on the right is with respect to . Since for every and , we have, for any ,

as , where the probability is evaluated with respect to . In other words, except on a small set of whose probability goes to , we have

for every when is sufficiently large. ∎


  • (1)
  • Besag (1974) Besag, J. (1974), “Spatial interaction and the statistical analysis of lattice systems,” Journal of the Royal Statistical Society. Series B (Methodological), 36(2), 192–236.
  • Bogdan et al. (2008) Bogdan, M., Ghosh, J. K., Tokdar, S. T. et al. (2008), “A comparison of the Benjamini-Hochberg procedure with some Bayesian rules for multiple testing,” in Beyond parametrics in interdisciplinary research: Festschrift in honor of Professor Pranab K. Sen Institute of Mathematical Statistics, pp. 211–230.
  • Carlin et al. (2013) Carlin, B. P., Murray, T. A. et al. (2013), “Comment on Article by Müller and Mitra,” Bayesian Analysis, 8(2), 303–310.
  • Carpenter et al. (2016) Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A., Guo, J., Li, P., Riddell, A. et al. (2016), “Stan: A probabilistic programming language,” Journal of Statistical Software, 20(2), 1–37.
  • Do et al. (2005) Do, K., Mueller, P., and Tang, F. (2005), “A nonparametric Bayesian mixture model for gene expression,” JR Stat. Soc. Ser. C, 54, 1–18.
  • Fresen (2011) Fresen, D. (2011), Simultaneous concentration of order statistics,, Technical report, arXiv.
  • Gelman et al. (2014) Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2014), Bayesian Data Analysis, Vol. 3 Taylor & Francis.
  • Hoff (2007) Hoff, P. D. (2007), “Extending the rank likelihood for semiparametric copula estimation,” The Annals of Applied Statistics, 1(1), 265–283.
  • Hoff et al. (2013) Hoff, P. D. et al. (2013), “Comment on Article by Müller and Mitra,” Bayesian Analysis, 8(2), 311–318.
  • Jara et al. (2011) Jara, A., Hanson, T. E., Quintana, F. A., Müller, P., and Rosner, G. L. (2011), “DPpackage: Bayesian semi-and nonparametric modeling in R,” Journal of Statistical Software, 40(5), 1.
  • Kim et al. (2009) Kim, S., Dahl, D. B., and Vannucci, M. (2009), “Spiked Dirichlet process prior for Bayesian multiple hypothesis testing in random effects models,” Bayesian Analysis, 4(4), 707–732.
  • Lazar (2003) Lazar, N. A. (2003), “Bayesian empirical likelihood,” Biometrika, 90(2), 319–326.
  • Liao et al. (2014) Liao, J., Mcmurry, T., and Berg, A. (2014), “Prior robust empirical Bayes inference for large-scale data by conditioning on rank with application to microarray data,” Biostatistics, 15(1), 60–73.
  • Martin and Tokdar (2012) Martin, R., and Tokdar, S. T. (2012), “A nonparametric empirical Bayes framework for large-scale multiple testing,” Biostatistics, 13(3), 427–439.
  • Müller and Mitra (2013) Müller, P., and Mitra, R. (2013), “Bayesian nonparametric inference–why and how,” Bayesian Analysis, 8(2), 269–302.
  • Muralidharan (2010) Muralidharan, O. (2010), “An empirical Bayes mixture method for effect size and false discovery rate estimation,” The Annals of Applied Statistics, 4(1), 422–438.
  • O’Hagan (2013) O’Hagan, A. (2013), “Comment on Article by Müller and Mitra,” Bayesian Analysis, 8(2), 319–322.
  • Robert and Casella (2009) Robert, C., and Casella, G. (2009), Introducing Monte Carlo Methods with R Springer Science & Business Media.
  • Shao (1999) Shao, J. (1999), Mathematical Statistics Springer.