1 Introduction
In the past decades, new technologies such as gene microarray and genomewide association studies have fundamentally changed the landscape of biomedical research. Instead of studying one gene or one single nucleotide polymorphism (SNP) at a time, these technologies allow us to study thousands of genes or SNPs simultaneously. Bayesian approaches have proven effective for analyzing such data by modeling a large number of parallel parameters for individual genes or SNPs as random effects from a common prior. Bayesian methods can improve inference by borrowing information from other genes and by incorporating useful structure such as modeling a large proportion of the genes or SNPs as having no effect on the outcome. This approach automatically adjusts for multiple comparisons and selection bias inherent in the largescale data setting (Johnstone and Silverman, 2004; Efron, 2010).
A canonical model for this data structure is
(1) 
where is the observed measurement, is the unknown true parameter of interest for the gene or SNP, and is an unobserved random error. For many applications, there is often little information to distinguish one from the others before the data is collected, and in such situations, we can consider the exchangeable (Gelman et al., 2014, Chapter 5). For this paper, we shall treat as random effects drawn from a density , and consider as a smooth or limiting form of the empirical distribution of the underlying to be estimated. Letting and , a suitable posterior distribution that facilitates the inference about is
We would like to approximate as closely as possible in our Bayesian inference.
In practice, however, is usually unknown. A standard way to take advantage of such data structure is through a parametric Bayesian hierarchical model (Gelman et al., 2014, Chapter 5) as follows:
(2) 
where is defined by the desnsity of error in (1), are independent given , and are independent given . This approach can easily incorporate prior information about the structure of for improved inference about . For example, we can choose the working prior to be a Laplace family with scale parameter if we believe that is unimodal and longtailed and a normal family if shorttailed. When the shape of is misspecified and severely deviated from , however, it can lead to inferior inference. For example, excessive shrinkage and therefore bias can occur if a working prior has much shorter tails than . This misspecification of working prior can be a serious concern because, while some information about may be available in a particular application, the information is typically insufficient to determine how heavy the tails should be, which can substantially influence the extremal effects of , usually the effects of most practical interest.
Alternatively, nonparametric and semiparametric Bayesian methods (Muralidharan, 2010; Martin and Tokdar, 2012; Bogdan et al., 2008; Müller and Mitra, 2013; Do et al., 2005; Kim et al., 2009) can be used that impose minimum structure on . A popular approach is to specify the working prior for as generated from a Dirichlet process mixture, as described in more detail in Section 4. Although such a working prior generally performs reasonably well for a wide range of , it may not be optimal due to its weaker prior information. Additionally, its focus on flexibility of the model can make it difficult for a statistician to incorporate useful prior information (Hoff et al., 2013; O’Hagan, 2013; Carlin et al., 2013). However, to our knowledge, no systematic comparison of parametric and nonparametric approaches, via simulation study or real datasets, is available in the setting of a large number of effects ( in the order of thousands) to give empirical guidance in real applications.
This paper has two aims. First, we propose a simple new method to robustify the posterior distribution of for parametric hierarchical model (2) by utilizing the asymptotic behavior of order statistics. A unique feature of the proposed method is that it can substantially improve the posterior distribution when is misspecified without affecting the posterior under a correctly specified working prior. Therefore, it can be broadly used. Second, we conduct a systematic performance comparison of the standard parametric posterior, the proposed robustified parametric posterior, and a nonparametric Bayesian posterior which uses a Dirichlet process mixture prior with a normal base distribution. Our study shows that while the nonparametric Bayesian method does provide reasonable performance under different forms of , it can perform poorly when is severely deviated from normal, possibly due to its normal base distribution in the Dirichlet process mixture prior. The proposed robustifed posterior when combined with a flexible parametric prior can be a superior alternative to nonparametric Bayesian methods.
2 The Robustified Posterior
We now present the key result of how to robustify the standard posterior distribution for a general working prior density given by
(3) 
In order to develop the robustified version denoted by , we need the following assumption, which is reasonable if is to be a measurement of .
Assumption 1.
We assume the distribution of is strictly stochastically increasing in .
Let
be the cumulative distribution function of the errors
in (1) and let be the corresponding density. We then have . Let(4) 
be the quantiles of error
. Here we allow the distribution of to depend on in order to be more general. It follows immediately that . For a given , is a strictly decreasing function of by Assumption 1 and therefore can be written as a function of : . We will reformulate the posterior of , given in (3), in terms of the quantiles of error . Then the changeofvariables formula rewrites the posterior (3) as(5) 
where , , and .
An interesting special case occurs when the distribution of does not depend on . In this case, and . Therefore the posterior distribution (5) has the particularly simple form
(6) 
We now return to the general case of (5). Let denote the order statistics of . Then the posterior distribution of under the working prior has the decomposition
(7) 
If the working prior is misspecified, both factors in decomposition (7) can be distorted from their corresponding distribution under the correctly specified prior . The key idea of our proposed robustified posterior distribution is to replace in (7) with its the asymptotic limit , which turns out not to depend on . In the rest of this section, we shall assume that is generated under model (1) with . In what follows, we show the asymptotic limit of is available without knowledge of the correct prior . More specifically,
converges to the discrete uniform distribution on
. The key insight is that when is large and under the correct , is well approximated by the quantiles of the uniform distribution on ; this is the same rationale as justifies the widely used QQplot for distribution checking. We formalize this in the following theorem (proof provided in the Appendix).Theorem 1.
Let be generated under model (1) with . Let denote the order statistics of drawn from . Then
except on a small subset of
whose probability can be made as small as any
.We therefore propose to fix the in in the right hand side in (7) to be its asymptotic limit derived under and define our robustified posterior as
(8) 
In the robust posterior (8), the sample space of , to be denoted by , consists of the permutations of , where . This gives the following explicit form of (8)
(9) 
on , where . Note that this approach of robustifying the standard posterior through truncation to the discrete space can be applied to the posterior for any general Bayesian model including hierarchical Bayes and empirical Bayes. Using relationship as defined at the beginning of this section, we can easily map back to as . In particular, if .
Based on the discussion above, the robustified posterior (9) is effective over the standard posterior when the misspecified causes to deviate from . This can happen when the working prior overshrinks due to underspecification of the working prior or shrinks in the wrong direction due to a location shift. On the other hand, it does not improve inference if the misspecified primarily affects in (7), which happens when the working prior is too diffuse and therefore undershrinks.
For hierarchical model (2), let
(10) 
be the prior of after integrating out . We can use Theorem 1 to improve the inference of hierarchical model (2) by replacing , as defined in (3), by , which is, however, computationally prohibitive because the mixture density given in (10) is very expensive to evaluate. To get around this this computational limitations, note that the standard posterior can be simulated as the stationary distribution of Gibbs sampler
Now a robustified version of the component , , can be constructed as , where , which is much less computationally intensive because is usually a simple parametric distribution. We use it to construct a robustified Gibbs sampler
(11) 
which can be simulated from rapidly. We shall show in Section 3 that this new Gibbs sampler has a stationary distribution. Our proposed robustified posterior is defined as the the stationary distribution of for this robustified Gibbs sampler, which can replace the standard posterior for improved Bayesian inference.
3 A Markov Chain Monte Carlo Algorithm
We now describe a random walk MetropolisHasting algorithm (e.g. Robert and Casella, 2009, Chapter 6) to sample from in (9), whose sample space consists of permutations of . Let
be the current position of the Markov chain. Randomly select
locations in and then randomly permute these elements at these locations. Let the resulting be the candidate . The parameter plays the role of a step size in a continuous random walk. It is easy to see that this algorithm is symmetric; the probability of generating from is the same as the probability of generate from . Therefore, the MetropolisHastings algorithm accepts with probabilityDue to the enormous number of points in the discrete sample space, , an effective MetropolisHasting algorithm must start the chain from a well supported point and must be able to control the distance of the proposal from current position . To address this, we propose the following enhancement. Let be the value of testing versus , which can serve as an approximation of the unknown . Note also that is the value for testing versus . Now reorder by the value of . To simplify notation, the reordered sequence will still be denoted as but now satisfies
Since is an approximation of , the underlying for the reordered should approximately follow the same increasing pattern as increases. We propose to apply the above random walk MetropolisHasting algorithm to the reordered dataset with two benefits. first, we start the chain at a wellsupported point . Second, in generating proposal from , we randomly select consecutive positions in and then randomly permute elements at these locations. Because the consecutive elements in now generally have similar values, we can easily control the probability of accepting to be around 25% as recommended in Robert and Casella (2009, Section 6.6) by selecting an appropriate . This enhancement drastically improves the sampling efficiency in our experience.
We can now easily implement robustified Gibbs sampler (11). Drawing from is usually straightforward. The above MetropolisHasting algorithm can be used to sample from and therefore . Note that has a finite sample space of and the chain is irreducible so long as for all under any given . It follows that this robustified Gibbs sampler has a unique stationary distribution, whose density can be written down explicitly from the two conditional distributions in the Gibbs sampler by the HammersleyClifford theorem (Besag, 1974).
4 Comparison of Performance
A comprehensive simulation study is perfromed to compare the performance of three sets of procedures: standard parametric posterior, the robustified parametric posterior, a nonparametric Bayesian using Dirichlet process mixture prior. The simulation study is conducted as follows.
 Step 1:

For and , generate (for ), where and .
 Step 2:

Reorder data by the values of as described in Section 3. Arrange the generated in the corresponding order.
 Step 3:

For dataset , compute the posterior means and the estimation errors () using seven estimation methods described below.
Steps 1 through 3 are repeated 100 times for the results reported here. In this simulation study, three forms of are used. The first form,
, is a normal distribution
, which serves as an example of a lighttailed distribution. The second form, , is the scaleddistribution with five degrees of freedom and a standard deviation of two, which represents a heavytailed distribution. The third form,
, is given bywhere is truncated to interval and is truncated to . This hybrid distribution has the form of in the middle and the form of in the tails.
We now describe the seven estimation methods used in this simulation.
 Method 1 (Laplace).

Standard posterior for hierarchical model (2) with Laplace working prior: , with scale parameter .
 Method 2 (R Laplace).

Robustified posterior of Method 1.
 Method 3 (Normal).
 Method 4 (R Normal).

Robustified posterior of Method 3.
 Method 5 (Mixture).

Standard posterior for hierarchical model (2) with a mixture working prior:
with , and and distributed as in Methods 1 and 3.
 Method 6 (R Mixture).

Robustified posterior of Method 5.
 Method 7 (DP).

Nonparametric Bayes with Dirichlet process mixture prior:
In this formulation, is the scaling parameter and
is the base distribution of the Dirichlet process. The shape and scale parameters in the two inverse Gamma distributions are chosen so that
and have sufficient variability for a flexible model.
For , figures 1 presents the distribution of and over 100 replications under each of , , , and for each of the 7 estimation methods using boxplots. Figure 2 presents the same results for . Note that we can combine and into one boxplot to save space because they have the same distribution due to the symmetries of and the working priors. Concentration of the distribution around 0 in a boxplot represents a good estimation method while concentration below and above 0 represent undershrinkage and overshrinkage respectively. Note also that have been reordered in Step 2 above by the values. Therefore, and are the two most extremal effects usually of the greatest practical interest.
We now summarize the performance of the 7 estimation methods as reflected in figures 1 and 2. Method 1 undershrinks considerably under as expected because the working Laplace prior has much heavier tails. It works well under heaviertailed and . Method 2 offers small improvement over method 1 for . Method 3 works very well under but severely overshrinks under and . Method 4 substantially improves over Method 3 under and but still performs poorly due to the restrictive short tails in the working prior. Method 5, with its flexible mixture working prior, has low bias across all three forms of but suffers from increased variation. Method 6 offers considerable improvement over Method 5 under and , in particular by reducing the extreme estimation errors. The nonparametric Method 7 performs well under , less well under , and poorly under , possibly due to its normal basedistribution. Changing the scaling parameter to 1/2 to increase the variability of the Dirichlet process does not improve its performance (data not shown). Nevertheless, Method 7 never performs very badly across all three forms of , demonstrating the relative robustness of the nonparametric method. Overall, Method 6, which combines the robustified posterior with a reasonably flexible parametric working prior, has the best performance. It is also shown that the improvement provided by robustification becomes more dramatic when is increased from 1000 to 2000.
There are a few cases in figures 1 and 2 in which the robustified posterior provides little improvement over the standard posterior. This can happen when the working prior is already optimal or close to optimal, such as Method 1 under and Method 3 under . It can also happen when the working prior has longer tails than as noted in Section 2, such as Method 1 under .
Finally, Tables 1 and 2 give, for and respectively, the mean square error of each estimation method as the average of and from the 100 replications for . The performance ranking of the 7 methods summarized above for in figures 1 and 2 is still generally valid for but the difference between different methods is much smaller.
The standard posterior estimates in Methods 1, 3 and 5 are computed using RStan (Carpenter et al., 2016). The robustified posterior estimates for Methods 2, 4, and 6 are obtained by our own R code. Function DPMmeta in the DPpackage (Jara et al., 2011) is used for the nonparametric Bayesian estimation in Method 7. Our complete R code for this simulation study is available at http://sites.google.com/site/jiangangliao.
Laplace  R Laplace  Normal  R Normal  Mixture  R Mixture  DP  

1.25  1.29  0.85  0.84  0.90  0.84  0.88  
1  1.02  1.02  4.29  3.25  1.31  1.03  1.61  
1.09  1.08  3.57  2.81  1.59  1.36  1.53  
1.31  1.34  0.79  0.79  0.89  0.80  0.80  
2  1.13  1.12  2.30  2.05  1.25  1.14  1.23  
1.26  1.25  2.28  2.09  1.46  1.39  1.41  
1.12  1.14  0.85  0.84  0.88  0.83  0.84  
3  1.09  1.09  1.90  1.75  1.14  1.10  1.17  
1.08  1.08  1.65  1.56  1.14  1.10  1.12 
Laplace  R Laplace  Normal  R Normal  Mixture  R Mixture  DP  

1.56  1.50  0.74  0.74  0.94  0.76  0.75  
1  1.07  1.07  6.08  4.33  1.79  1.06  1.60  
0.97  0.97  4.20  3.21  1.81  0.98  1.34  
1.38  1.33  0.85  0.85  1.02  0.87  0.87  
2  1.33  1.33  3.51  3.02  1.87  1.33  1.48  
1.18  1.19  2.82  2.56  1.52  1.23  1.32  
1.40  1.35  0.99  0.99  1.15  1.01  1.00  
3  1.15  1.16  2.67  2.41  1.30  1.15  1.18  
1.02  1.03  2.04  1.92  1.32  1.02  1.09 
5 Discussion
This paper proposes a robusified posterior for improving inference on a large number of parallel effects. By providing significant protection against misspecified priors, our method encourages the use and specification of genuinely informative priors instead of defaulting to a weak and ineffective prior. For example, Method 6 in Section 4 can be an excellent choice if we believe that the tails of are between a shorttailed normal and a longtailed tdistribution. Other approaches to enhance the robustness of Bayesian inference have been proposed in different contexts and models. For example, Lazar (2003) replaces the likelihood function in the Bayesian posterior by an empirical likelihood, which achieves improved robustness by reduced specification in the likelihood. Also, Hoff (2007) proposed to replace the likelihood of the complete data by the likelihood of the rank of the data to remove nuisance parameters in a semiparametric copula estimation. The robustified posterior in this paper is specifically developed for estimating a large number of parallel effects. By utilizing asymptotic behavior of order statistics and the unique structure of parallel effects, our method has the distinctive advantage of improving robustness with little or no loss of inferential efficiency even when the working prior is correctly specified.
Finally, we have previously proposed a rankbased robustified posterior in which the posterior of is computed conditioned on the rank of among instead of the value of itself (Liao et al., 2014). The rankbased posterior has similar properties as the robustified posterior in this paper but works well only when error have similar variation across . In contrast, the robustified posterior in this paper only requires the error distribution in (1) to be continuous.
Appendix A Proof of Theorem 1
Proof.
Formally, first consider the marginal distribution of order statistics :
where is the marginal distribution of . It follows from (4) that
is the joint distribution of the order statistics from uniform
(see, e.g., Shao, 1999, p. 72). Let be a draw from . The GlivenkoCantelli theorem and the BerryEsseen theorem state that, as , the empirical distribution of converges to the function uniformly on . Recent refinements to these theorems (Fresen, 2011, Lemma 2) are able to characterize the behavior of the order statistics directly:(12) 
in probability.
Now we show the asymptotics in (12), derived under the marginal distribution , can be extended to the conditional distribution . It follows from equation (12) that, for any given and as , we have
Now for any , define
where the right side is the conditional probability given . It follows that
where the expectation on the right is with respect to . Since for every and , we have, for any ,
as , where the probability is evaluated with respect to . In other words, except on a small set of whose probability goes to , we have
for every when is sufficiently large. ∎
References
 (1)
 Besag (1974) Besag, J. (1974), “Spatial interaction and the statistical analysis of lattice systems,” Journal of the Royal Statistical Society. Series B (Methodological), 36(2), 192–236.
 Bogdan et al. (2008) Bogdan, M., Ghosh, J. K., Tokdar, S. T. et al. (2008), “A comparison of the BenjaminiHochberg procedure with some Bayesian rules for multiple testing,” in Beyond parametrics in interdisciplinary research: Festschrift in honor of Professor Pranab K. Sen Institute of Mathematical Statistics, pp. 211–230.
 Carlin et al. (2013) Carlin, B. P., Murray, T. A. et al. (2013), “Comment on Article by Müller and Mitra,” Bayesian Analysis, 8(2), 303–310.
 Carpenter et al. (2016) Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A., Guo, J., Li, P., Riddell, A. et al. (2016), “Stan: A probabilistic programming language,” Journal of Statistical Software, 20(2), 1–37.
 Do et al. (2005) Do, K., Mueller, P., and Tang, F. (2005), “A nonparametric Bayesian mixture model for gene expression,” JR Stat. Soc. Ser. C, 54, 1–18.
 Fresen (2011) Fresen, D. (2011), Simultaneous concentration of order statistics,, Technical report, arXiv. https://arxiv.org/abs/1102.1128.
 Gelman et al. (2014) Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2014), Bayesian Data Analysis, Vol. 3 Taylor & Francis.
 Hoff (2007) Hoff, P. D. (2007), “Extending the rank likelihood for semiparametric copula estimation,” The Annals of Applied Statistics, 1(1), 265–283.
 Hoff et al. (2013) Hoff, P. D. et al. (2013), “Comment on Article by Müller and Mitra,” Bayesian Analysis, 8(2), 311–318.
 Jara et al. (2011) Jara, A., Hanson, T. E., Quintana, F. A., Müller, P., and Rosner, G. L. (2011), “DPpackage: Bayesian semiand nonparametric modeling in R,” Journal of Statistical Software, 40(5), 1.
 Kim et al. (2009) Kim, S., Dahl, D. B., and Vannucci, M. (2009), “Spiked Dirichlet process prior for Bayesian multiple hypothesis testing in random effects models,” Bayesian Analysis, 4(4), 707–732.
 Lazar (2003) Lazar, N. A. (2003), “Bayesian empirical likelihood,” Biometrika, 90(2), 319–326.
 Liao et al. (2014) Liao, J., Mcmurry, T., and Berg, A. (2014), “Prior robust empirical Bayes inference for largescale data by conditioning on rank with application to microarray data,” Biostatistics, 15(1), 60–73.
 Martin and Tokdar (2012) Martin, R., and Tokdar, S. T. (2012), “A nonparametric empirical Bayes framework for largescale multiple testing,” Biostatistics, 13(3), 427–439.
 Müller and Mitra (2013) Müller, P., and Mitra, R. (2013), “Bayesian nonparametric inference–why and how,” Bayesian Analysis, 8(2), 269–302.
 Muralidharan (2010) Muralidharan, O. (2010), “An empirical Bayes mixture method for effect size and false discovery rate estimation,” The Annals of Applied Statistics, 4(1), 422–438.
 O’Hagan (2013) O’Hagan, A. (2013), “Comment on Article by Müller and Mitra,” Bayesian Analysis, 8(2), 319–322.
 Robert and Casella (2009) Robert, C., and Casella, G. (2009), Introducing Monte Carlo Methods with R Springer Science & Business Media.
 Shao (1999) Shao, J. (1999), Mathematical Statistics Springer.