Nonparametric Bayesian Deconvolution of a Symmetric Unimodal Density

02/17/2020 ∙ by Ya Su, et al. ∙ Johns Hopkins University University of Kentucky Texas A&M University 0

We consider nonparametric measurement error density deconvolution subject to heteroscedastic measurement errors as well as symmetry about zero and shape constraints, in particular unimodality. The problem is motivated by applications where the observed data are estimated effect sizes from regressions on multiple factors, where the target is the distribution of the true effect sizes. We exploit the fact that any symmetric and unimodal density can be expressed as a mixture of symmetric uniform densities, and model the mixing density in a new way using a Dirichlet process location-mixture of Gamma distributions. We do the computations within a Bayesian context, describe a simple scalable implementation that is linear in the sample size, and show that the estimate of the unknown target density is consistent. Within our application context of regression effect sizes, the target density is likely to have a large probability near zero (the near null effects) coupled with a heavy-tailed distribution (the actual effects). Simulations show that unlike standard deconvolution methods, our Constrained Bayesian Deconvolution method does a much better job of reconstruction of the target density. Applications to a genome-wise association study (GWAS) and microarray data reveal similar results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In important applied problems, one of which we discuss in Section 6 and the other in the Supplementary Material, data come from a one-dimensional classical measurement error model , where the true density of , , is assumed to be unimodal and symmetric. We assume the error has density with mean zero and a scale parameter , details can be found in the main text, whence the density of , denoted , is the convolution of and . Given observations , our interest lies in estimating the distribution of under the given constraints. As part of our applications, we additionally consider the case where the scales of are heteroscedastic, and denoted as .

One of our motivations arises from genome-wide association studies (GWAS) containing a vast number of single nucleotide polymorphisms (SNPs) along with a response for a relatively small number of individuals, where the marginal effect sizes for the SNPs association with the response are of interest. Let denote the estimated marginal effect size of the th SNP obtained from a regression of the response on the th centered and standardized SNP. It can be shown (see Section 6.2 for more details) that the true effect size for the th SNP, , can be related to through , with the

being approximately normally distributed, but heteroscedastic.

If we treat the true effect sizes as random effects, the sampling distribution of

has two key features. First, it makes sense that the effect sizes will be symmetric about zero and unimodal, and not biased towards being marginally skew. This is the case in our two data applications, where the observed data have almost zero skewness and are unimodal. Second, in practice, we expect that most of the predictors have very small association with the response, with a handful possibly being practically significant. This suggests the density should have a sharp peak near zero while possibly being heavy-tailed; for an example of a density satisfying the two features above, see the blue solid curve in Figure

5. The primary challenge then lies in characterizing the density of while properly capturing its expected shape.

There is a rich literature on density estimation in the measurement error context when the measurement error is homoscedastic (Stefanski and Carroll, 1990; Carroll and Hall, 1988; Fan, 1991), among many others. Delaigle and Meister (2008) introduced a deconvoluting kernel technique for the heteroscedastic measurement error case; see also Sarkar et al. (2014) for a Bayesian approach. However, none of the existing approaches are designed to fulfill the specific constraints in our case. As a result, we are only able to compare our proposed approach with the general nonparametric kernel deconvolution estimator (Delaigle and Meister, 2008) in our simulations and real data examples.

In situations without any measurement error, there is some literature on modeling symmetric and unimodal densities. West (1987) studied scale-mixtures of Normals which notably includes the student- and Laplace families. However, this approach is not fully flexible as there exist symmetric and unimodal densities for which the underlying mixing functions are not distributions (Chu, 1973). There are also methods based on Bernstein polynomial basis function where the shape constraints are preserved under constraints on the coefficients of the basis functions, e.g. Turnbull and Ghosh (2014). The disadvantages of using Bernstein polynomial bases are two fold. First, the distribution functions it can characterize exclude those whose support is . Second, the asymptotics of such shape constrained estimators are not well-studied in the literature even without the measurement error.

In this article, we propose a Bayesian approach for unimodal and symmetric density estimation in the measurement error context. The proposed method is easily adapted to a heteroscedastic error model, as we will exhibit. A key ingredient of the methodology is a representation theorem for symmetric and unimodal densities dating back at least to Feller (1971)

, where it was proved that any unimodal and symmetric density function can be represented by a mixture of uniform distributions.

Brunner and Lo (1989) adopted this approach and modeled the mixing distribution via a Dirichlet process, which does not yield smooth densities owing to the almost sure discreteness of the Dirichlet process. To yield a smooth density, we model the mixing distribution using a Dirichlet mixture of Gamma distributions, which has large support on the space of smooth densities, and is amenable to scalable posterior computation via an efficient Gibbs sampler we develop here.

We provide large-sample theoretical support to the proposed methodology by showing posterior consistency for the observed density and the latent density. For the observed density of , we borrow results from recent work (Bochkina and Rousseau, 2017) where posterior convergence rates for estimating a density on the positive half-line were established using Dirichlet location-mixtures of Gamma distributions. Their setup nicely serves as a component in our hierarchical model for the density of . While appreciating the value of their theory, the difficulty due to the hierarchical model we develop and the intrinsic deconvolution problem has not been discussed before and is highlighted in our current work.

We derive a posterior consistency result for the unobserved density of under a Wasserstein metric. The Wasserstein metric has its origins in the theory of optimal transportation (Villani, 2008) and has recently been found suitable for studying convergence of mixing measures in deconvolution problems (Nguyen, 2013; Gao and van der Vaart, 2016; Scricciolo, 2018) . These papers consider a Dirichlet process mixture type of model where the mixing distribution is discrete and needs to satisfy some conditions, see Section 3.3 for a discussion on their conditions . A key ingredient of our theory is the development of a new inversion inequality which relates the convergence of the observed/mixture density to that of the unobserved/mixing density. The idea of using inversion inequalities in the Bayes literature is fairly new, with only a few instances of such results, e.g., Nguyen (2013), Scricciolo (2018). However, existing inequalities can not be applied directly to our case, necessitating a new inversion inequality to fit our needs.

Section 2 gives the Bayesian model leading to our methodology, while Section 3 states asymptotic results. Section 4 describes our algorithm and Section 5 presents some of the many simulations we have conducted. Section 6 presents an analysis of a genome-wide association study, and shows that our methodology is able to capture the mixture distribution we expect to see in the data as described above. Section 7 gives concluding remarks. Supplementary Material includes additional data analysis of a microarray experiment.

2 Model Specification

Throughout our paper, denotes a symmetric unimodal density on the real line which specifies our family of error distributions. We further denote by the corresponding scale family: for . Finally, denotes the distribution function with density (in ) given by }.

Since , the true density of has the form

(1)

where the true density of , , has a unimodal and symmetric shape. If is continuous with finite derivative for all , then it is well-known (Feller, 1971) that there exists a density , where , such that

(2)

In other words, any symmetric and unimodal density is a mixture of symmetric uniforms. Given our motivating application, it is natural to assume in addition that is finite at zero, which ensures the finiteness of . The finiteness of can in turn be ensured by assuming that . Our parameter space for thus consists of all densities on the positive half-line satisfying .

In the deconvolution literature, two types of error distributions, ordinary-smooth and super-smooth, are commonly studied. By definition, a density is ordinary-smooth or super-smooth if the tail of its Fourier transform decays to zero at polynomial rate or exponential rate, respectively. For our theoretical analysis and simulation studies, we pick one distribution from each class, namely the Normal and Laplace distributions. When presenting the theory we illustrate the Normal error case first, while the results for the Laplace error distribution are studied in a separate section. A similar strategy has been taken with the proofs. Furthermore in a more complicated situation when only the type (ordinary-smooth or super-smooth) is known, we point out the possibility of modeling the error distribution using mixtures of Normal/Laplace distributions prior; see

Sarkar et al. (2017) for an instance of the former.

We build our Bayesian model in a hierarchical structure as the true densities, that is, the candidate densities , and are defined in a similar way as in (1) and (2). In particular, given the representation (2), the problem of modeling equivalently reduces to creating a flexible model for . Recall that is supported on . We model using a Dirichlet process location-mixture of Gamma distributions, which has large support (Bochkina and Rousseau, 2017) on densities supported on , and is easy to implement in a Bayesian framework. Specifically, we reparameterize a Gamma density by its shape and mean as parameter pairs. Denote to be a Gamma density with shape and rate ; we use

to denote the corresponding probability distribution. We assume a Dirichlet process prior

(Ferguson, 1973) on the distribution of and another prior on . With these ingredients, our hierarchical Bayesian model is

where is a Uniform distribution on the interval and denotes a Dirichlet process with concentration parameter and base probability measure

. The hyperparameters are

and other possible parameters for specification of and .

Using the stick-breaking representation (Sethuraman, 1994) for the Dirichlet process, the model-prior for can be represented as

where denotes the density evaluated at . For numerical computation, we use a finite Dirichlet approximation (Ishwaran and Zarepour, 2002) to the Dirichlet process in our simulations and data examples.

3 Theoretical Analysis

3.1 Goal and Background

In this section, we provide theoretical support to our method in terms of posterior consistency for the observed and latent densities. Specifically, we show that the posterior distribution for and increasingly concentrates on arbitrarily small neighborhoods of the true densities and , respectively, as the sample size increases.

We follow the general procedures in Ghosal et al. (2000) of establishing posterior contraction theory and make substantial modifications to adapt to the hierarchical model considered in this paper. We begin with a basic model with no measurement error and then build the theory towards its measurement error counterpart, allowing multiple layers of mixture in the latter case. Another novelty of the current approach is its ability to work with having a continuous density with infinite support, as opposed to a discrete density with finite support considered in Nguyen (2013). This is achieved by a mixture model with a mixing distribution modelled by a Dirichlet Process mixture of Gamma distributions. We obtain some preliminary results on this layer from Bochkina and Rousseau (2017). An inversion inequality is derived that bridges our theory from to .

We list some key definitions and notation in this section. Let and be two probability measures defined on a metric space with metric . If and both have finite

th moments, the

th Wasserstein distance (Villani, 2008), denoted , is defined as , where represents the collection of all joint measures with marginal measures and . We consider the metric space with the Euclidean distance . For any two densities and on , is the same as where and

are the cumulative distribution functions corresponding to

and , respectively. Another distance metric between two probability densities and is the Hellinger distance, . The Hellinger distance is widely used in the Bayesian asymptotics literature for quantifying posterior consistency or convergence of densities. The notation

stands for a posterior probability of an event

given the observations .

To make notation simpler, from now on, we assign an overall symbol for probability or expectation under the true distribution of the corresponding variable, e.g., or mean the probability that or under the true or respectively. Also, means that there exists a positive constant such that for all . In addition, if and only if and , , . Finally, denotes the smallest integer that is greater than or equal to .

3.2 Posterior Consistency for the Observed Density

This section gives a theorem on the posterior convergence rate for . Our conditions are mainly at the layer of , which is modelled as a Dirichlet location-mixture of Gamma distributions. We will give the conditions followed by some interpretations on these conditions and then state the theorem.

Condition 1.

We adopt a function space for , , which contains a set of density functions which satisfy that there exists and that for all , and ,

Condition 2.

For some , .

Condition 3.

(i) The prior on is , where has a positive and continuous density on satisfying that for some and ,

(ii) The prior on , , has support (1, ). For constants , and ,

For notational simplicity, we drop the arguments and only use to denote the space of densities in Condition 1. Similar function spaces with additional smoothness assumptions have been used by Bochkina and Rousseau (2017); we do not make such smoothness assumptions here. The conditions are typical in the literature on Bayesian density estimation. A density satisfying Condition 1 and Condition 2 can be well approximated by a mixture of Gamma distributions which facilitates finding a KL divergence neighbourhood around the true observed density . When the error distribution is Laplace, Condition 2 is slightly relaxed, see Condition 2 below. Condition 3 (i) is on the base measure of Dirichlet process and agrees with that in Shen et al. (2013) except that the support is on instead of . Condition 3 mainly controls the prior thickness of the sieve space upon which the inversion inequality in Section 3.3 can be derived. Bochkina and Rousseau (2017) showed Condition 1 is satisfied by Weibull, folded Student-t and Frechet-type densities. Condition 3 (ii) holds, for example, if has a Gamma prior.

Clearly, the prior is hierarchical, Condition 1 and Condition 2 are imposed on which is free of shape constraints except that it is a density on the positive half line. It is generally difficult to do the other way around, that is, impose conditions on and identify its corresponding properties on . However, we can verify these conditions under some special cases. When

is a Normal density with mean zero and standard deviation

, which belongs to a Weibull family of distributions. Therefore Condition 1 is met. Condition 2 holds for arbitrarily large . When

is a t-distribution with degrees of freedom

,

which is an Inverse Beta distribution. Condition

1 can be verified by similar arguments in Bochkina and Rousseau (2017) for a folded Student-t density since only the tail behavior of its derivatives matters. Condition 2 holds when with .

Theorem 1.

Fix . Under Conditions 13, for any large enough,

Proof.

To prove Theorem 1, we shall exhibit a sequence such that

To prove the assertion in the above display, it follows from Ghosal et al. (2000) that the desired result holds as long as there exists a sequence of compact subsets in the space where resides and a sequence with and such that

(3)
(4)
(5)

for some positive constants , and is the -covering number of relative to the Hellinger distance. Equations (3) and (4) are entropy and prior mass conditions on the sieve space and (5) is referred to as the prior concentration condition. Equation (5) is a slight variation compared to the original prior concentration condition in Ghosal et al. (2000); see Bochkina and Rousseau (2017).

In Appendix A.1, the details for deriving equations (3), (4) and (5) are provided for and an appropriate sieve space . The constant in is determined by the constants , and in Condition 2 and 3 (i). ∎

3.3 Posterior Consistency for the Latent Density

We now establish that the posterior distribution for the latent density increasingly concentrates around the true density . To show such a result, we build an inversion inequality which harnesses the consistency of the observed density derived above to prove consistency for the latent density . A few previous instances of inversion inequalities can be found in the recent literature. Theorem 2 of Nguyen (2013) relates the Wasserstein distance between the mixing distributions with the total variation of the mixture density, but it requires the mixing distribution to reside on a finite support or have bounded moment. Scricciolo (2018)

makes use of an inversion inequality to establish the convergence rate of the Bayes estimator for the mixing density; one of the key requirements on the mixing distribution is that it has a bounded moment generating function on some interval containing

. However, there does not exist an inversion inequality that can be directly applied to our problem, where the mixing density has unbounded support and there is no way to bound the moment generating function on any interval containing for all in a sieve space. In Appendix A.2, we prove the next Lemma that relates the convergence of to under the Wasserstein metric, , and the distance between and .

Lemma 1.

On the sieve in Theorem 1, when and (see Condition 2 and Condition 3) are large enough,

Remark 1.

For any two densities , , . The conclusion of Lemma 1 can be equivalently stated as .

Theorem 2.

Fix . Under the Conditions in Theorem 1 and Lemma 1, for any large enough,

Proof.

Theorem 2 follows from Theorem 1 and Lemma 1. ∎

Remark 2.

Theorem 2 states that the posterior consistency of in the metric as a result of the presence of the metric in the inversion inequality in Lemma 1. In fact, the proof of Lemma 1 can be extended to for any , which in turn would imply posterior consistency in any metric. To the best of our knowledge, technical difficulties exist in order to derive Lemma 1 for the metric between and . The difficulties lie in finding a uniform upper bound for the distance between functions in the sieve space and its convolution with the molifier. Whereas if Wasserstein distance (of order ) is in use, such an upper bound is simply the second moment of the molifier. This is probably the hurdle if one wants to establish posterior contraction theory in distance for the mixing density without restricting oneself on special cases of the mixing density.

3.4 Theory when the error has a Laplace distribution

All theorems and Lemmas in Section 3.2 and Section 3.3 can be derived when the measurement error has a Laplace distribution under a relaxation of Condition 2. We state the condition and theorems whenever changes are met.

Condition 2.

For some , .

It can be inferred that Condition 2 holds for assuming Condition 2. The statement in Theorem 1 holds under Condition 1, 2 and 3.

Lemma 2.

On the sieve in Theorem 1, when and , see Condition 2, and 3 (i) are large enough, there exists a depending on and such that

Theorem 1 and Lemma 2 together imply that Theorem 2 holds.

The proofs are along the lines of their correspondence to the Normal error case. They are in Appendix A.1 with only the differences presented.

4 Algorithm

To ease computational complexity, we follow standard practice by approximating the Dirichlet process mixture prior with a finite mixture of Gamma distributions with components where K is large, with a specific Dirichlet prior on the mixture probabilities (Ishwaran and Zarepour, 2002). It is trivial to implement our procedure for the infinite mixture using the slice sampler of Kalli et al. (2011); however we prefer the finite Dirichlet due to its substantially better mixing behavior for our multi-layered hierarchical model. Our theoretical results in Section 3 were developed for the Dirichlet location-mixture of Gamma priors on , where only the mean parameter is mixed over. For flexibility, we adopt a mixture on both the shape and rate parameters for our numerical implementation. The conditions on the priors for these parameters become less stringent because the number of such parameters is finite. We select these priors among some popular choices. Specifically, our hierarchical Bayes model for subsequent implementations is as follows. Let denote the index for subject, and be the index for the th component, for all , . Let denote a fixed constant. Then,

where denotes a Dirichlet distribution with parameters ,

denotes an exponential distribution with parameter

truncated at . The paragraph above Theorem 1 points out the reason for truncating . The set of hyperparameters is .

Denote the set of all variables and hyperparameters given above as

For ease of notation, let be all variables in but excluding . For , let be the total number of individuals that fall into group and be the summation of the from the th group. To sample from the posterior distribution of , we use a Gibbs sampler for all parameters other than the , combined with a Metropolis-Hastings within Gibbs for the . The posterior full-conditional distributions are

The symbol denotes the distribution truncated at . Meanwhile corresponds to a Gamma distribution with parameters truncated at . Since the posterior distribution of does not belong to a standard family, we implement a Metropolis-Hastings algorithm within the Gibbs sampler to update the . We use a Gamma proposal distribution; specifically, , and we accept the proposed or keep the original according to the general Metropolis-Hastings rule. The proposal distribution is truncated to reflect the prior assumption on .

For all of our simulations presented, we treat the error variances

for all as known: this is reasonable in our examples, and often used in the standard deconvolution theory. The default selected values for hyperparameters are . Sensitivity analysis showed little sensitivity to different choices of the hyperparameters. The marginal density for , our estimator, is computed as the average value of the marginal density at each MCMC iteration. We name the method as Bayes density deconvolution with shape constraint estimator (Constrained Bayes Deconvolution).

Our Constrained Bayes Deconvolution method is easily seen to be scalable in that it is linear in the sample size, and indeed in Section 6.2 it is show to be able to handle sample size of nearly : it is written in R with use of the package RCPP.

5 Simulations

5.1 Overview

We conducted simulations for two distinctly different problems. In the first, the target density for has a standard t-distribution with 5 degrees of freedom. In the second, related to our examples,

has a density that is a mixture of (a) t random variables with 5 degrees of freedom; and (b) values with mean zero and very small variability. In addition, for each of (a) and (b), we consider the case of homoscedastic and heteroscedastic measurement errors generated from either the Normal or the Laplace distributions.

Case (b) is the important one for us given the type of data we want to analyze, while Case (a) is simply meant to show that we are competitive with the standard method, namely the kernel density deconvolution estimator, in standard problems. The kernel estimator has two versions depending on whether the measurement errors are homoscedastic or heteroscedastic. The plug-in bandwidth, which minimizes the asymptotic mean integrated squared error, is chosen for this estimator in comparison with our method, see Delaigle and Meister (2008). The R package, deconvolve, published on Github implements the kernel density deconvolution estimator.

In each design of the simulation we generated data with sample sizes , , each repeated with simulated data sets.

We compute posterior samples of the density across the MCMC steps and the estimated density is obtained as the mean of these posterior samples. The estimated densities and the true density are compared via the square root of the integrated squared error (ISE), the integrated absolute error (IAE) and the Wasserstein distance () for each simulated data set. An overall summary is given in Section 5.4.

5.2 When has a t-distribution With 5 Degrees of Freedom

We generated observations by , has a distribution with degrees of freedom. In the case of homoscedastic error, the variance of is equal to the variance of , specifically, . In the heteroscedastic case, , with the variance of being times the mean of . In all cases, the observations are subject to substantial measurement error. The estimated densities are displayed in Figure 1 – Figure 4. The numerical comparisons for our Constrained Bayes Deconvolution method and the Kernel method are given in Table 3 – Table 4.

5.3 When has a Tight Peak Around Zero

The setting in this section is designed for cases when the distribution of has a large probability clustered near zero, as we expect in our examples. One way to do this is through a mixture structure, assuming that the density of has a component that is tightly concentrated at zero and another component from a standard density. We implement a mixing of a for the first component and a -distribution with degrees of freedom for the second component, with mixing probabilities and respectively. We choose the small value so that the mixing density has a very sharp peak around zero. For .

In this case, when the true density puts a high concentration around zero, in addition to the usual global metrics IAE, ISE and , it is interesting to study how well an estimated density can capture the probability greater than, in absolute value, times the standard deviation of the “tight peak” component. With a small abuse of notation, in the following, “Exceedance” is defined as the absolute difference between the exceedance probability under the estimated density and that under the true density.

In the case of homoscedastic error, , such that the variance of is equal to the variance of . We implement the heteroscedastic case by adjusting an appropriate form for in Section 5.2 such that the mean of is more than the variance of , specifically, . Again in all cases, the observations are subject to substantial measurement error. The estimated densities are displayed in Figure 5 – Figure 8. The numerical comparisons for our Constrained Bayes Deconvolution method and the Kernel method are given in Table 5 – Table 6.

5.4 Conclusions from the Simulations

For both the simulations in Section 5.2 and Section 5.3, with either homoscedastic or heteroscedastic error, we observe that under the global metrics ISE and IAE, large gains in efficiency are achieved with our Constrained Bayes Deconvolution estimator over the deconvoluting kernel estimator across all choices of sample size. Also, from the figures and tables of Section 5.3, with either homoscedastic or heteroscedastic error, the Constrained Bayes Deconvolution estimator performs much better in capturing the peak as well as the tail behavior, from both a visual check and the Exceedance metric. Lastly, the kernel deconvolution estimator gives a biased peak for our sample sizes when the errors are heteroscedastic.

6 Genome Wide Association Applications

6.1 Background

In this section, we describe the results of a genome-wide association study (GWAS) that is particularly appropriate. In the Supplementary Material, we also describe results from a microarray experiment, which reaches similar conclusions.

6.2 Height data

Our data come from a genome-wide association study for height (Allen et al., 2010). The study data we have involves 133,653 individuals, and each individual in our data set has 941,389 SNPs that were measured. The goal of the study was to understand which SNPs were related to height, either positively or negatively. Because of the relative rareness of traits that affect height, the simulation of Section 5.3 is particularly relevant.

The data we have access to are regression coefficients of standardized heights, say, on standardized SNPs for SNP , say, and are thus estimated effect sizes. If we regress the on the , it is easy to see that if the true effect size is , the estimated effect size is , which, because of the sample size involved, is approximately normally distributed with mean and measurement error , where , where is the sample size and is the regression variance of the on the . Clearly, because of the sample size and the division by , is well-estimated and thus essentially known, but heteroscedastic.

For our Constrained Bayes Deconvolution estimator, we run 5000 MCMC iterations using the same hyperparameters used in the simulation section. There was a difficulty with the deconvoluting kernel density estimator, because its current implementation is exceedingly slow in terms of computation and resulted in a memory issue on a Linux machine with Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz. As a result, we subsampled 1% of the SNPs (by taking every 100th SNP) to obtain results for this estimator, although such subsampling was unnecessary for our efficient implementation of the Constrained Bayes Deconvolution estimator. We have confirmed that our Constrained Bayes Deconvolution estimator gave very similar results for both the full data and the subsampled data. We also ran the R package Kern Smooth to obtain the naive Kernel density estimator that ignores measurement error: as expected, our Constrained Bayes Deconvolution estimator dominated it as well for both the full and subsampled data.

The resulting density estimators are shown in Figure 9. Among the three, our Constrained Bayes Deconvolution method yields a density that has a much sharper peak. This is expected, as in the simulation of Section 5.3, because regular kernel methods, deconvolved or not, cannot handle well this type of very non-standard, but practically important, density.

In addition to the graphical comparison, quantitative comparisons were also made. We compute the estimated probability of the effect size in absolute value being greater than some choices of minimum effect size, displayed in Table 1 and Figure 10. As mentioned above, the effect sizes for all SNPs are chosen for our Constrained Bayes Deconvolution and naive Kernel estimators while that of every 100th SNP are selected for the Kernel deconvolution estimator.

Minimum effect size
Estimator 0.002 0.0025 0.003 0.0035 0.004 0.0045 0.005
Constrained Bayes 0.253 0.175 0.104 0.067 0.040 0.021 0.007
Kernel 0.426 0.346 0.286 0.226 0.191 0.159 0.130
Naive Kernel 0.561 0.466 0.382 0.310 0.248 0.196 0.133
Table 1: Comparison of estimated probability of effect sizes associated with height that the absolute value of effect sizes is greater than the given minimum effect size under our Constrained Bayes Deconvolution method (Constrained Bayes), the deconvoluting kernel density estimator (Kernel) and the naive ordinary kernel density estimator (Naive Kernel) for the GIANT Height effect sizes.

A scientific question in GWAS is to predict the number of significant SNPs for a given sample size, i.e., the number of individuals. Current scientific discoveries are based on the significance of p-values (with a Bonferroni significance level ) for individual SNPs followed by a “LD clumping” step which selects independent SNPs using their linkage disequilibrium. In recently published GWAS studies of height, Allen et al. (2010), Wood et al. (2014), and Yengo et al. (2018), the number of individuals increased from 133K, 253K, to 700K, leading to 180, 697, and 3290 significant discoveries using the described method or more complicated methods regarding the joint SNP effects.

We now briefly discuss the relevance of our density estimation procedure towards such sample size calculations; additional details are deferred to Section S.1.2 of Supplementary Material. Suppose , where denotes an observed effect size, denotes the corresponding true effect size with density , and the error variance is displayed as a constant here for notational simplicity. A standard approach (Chatterjee et al., 2013) for predicting the number of effect sizes achieving genome-wide significance at sample size is provided by the projection formula, , where . Here and denote the cummulative distribution function and the

th quantile of a standard normal random variable.

We can obtain point and interval estimates for the quantity from our MCMC output. A Monte Carlo integration is performed to approximate the projection formula using the posterior samples of , leading to the desired point prediction. We can further quantify the posterior variability of the predicted number by repeating the calculation on slices dispersed over a MCMC chain. Since scientists are generally interested in the number of independent SNPs that are discovered, we first selected a subset of independent SNPs based on the linkage disequilibrium between the SNPs before estimating the density of using our procedure. More details about the above procedures can be found in Section S.1.2 of Supplementary Material.

We report in Table 2 the posterior mean of these predicted numbers as our estimator for the expected number of SNPs discovered, together with a credible interval for that number. Although we make an uncommon assumption that none of the effect sizes are exactly zero, our estimates in Table 2 are in the ballpark of the actual numbers from the three cited papers. A clear advantage of using a valid density estimator of true effect sizes in conjunction with the projection formula is that it provides a cheap and simple calculation without carrying out any large-scale experiments. That is, we obtain the density estimator based on the smallest sample size of height study, and quantifies the number of significant SNPs including its uncertainty for larger studies, given no information except their sample sizes. Hence our method can be used to infer the required sample size needed for an expected given number of discoveries.

Number of individuals
133K 253K 700K
Exp. Disc. 134 375 2907
C.I. (125, 143) (357, 394) (2790, 3039)
Table 2: Estimated value (Exp.Disc.) and a credible interval ( C.I.) for predicting the expected number of SNPs discovered as the number of individuals varies. We obtain posterior samples of the predicted number from the projection formula and posterior samples of effect size distribution.

7 Discussion

We have considered the case of nonparametric density deconvolution with possibly heteroscedastic measurement errors, where the true densities are subject to shape constraints, in our case symmetry and unimodality. We are particularly interested in applications where there is a large probability near zero coupled with possibly heavy tailed distributions. We showed that our method, which we call Constrained Bayes Deconvolution, is nonparametrically consistent for estimating the true target density in general, and is particularly well-equipped for the mixture problem described immediately above. Computationally, it is linear in the sample size, and hence highly scalable.

Mixtures of uniforms are known to contain the Normal variance mixture class (Wang and Pillai, 2013) described in Section 1, and have been utilized in various applications for modeling a symmetric unimodal density. However, the flexibility of such a model depends critically on the flexibility of the mixing distribution. Our carefully designed choice of the Dirichlet process mixture of gammas for this mixing distribution has large support on the space of densities on the positive real line, leads to efficient computation, and is provably consistent. Different approaches, based instead on a number of mixtures of Normals, include Stephens (2016), and a very different approach, based on a computation in Yang et al. (2012), has been taken by Zhang et al. (2018), wherein they fit a regression to a large number of predictors, get the joint regression coefficients, and then do approximations and linear model calculations to reduce to the marginal effects, which in this context is our . Zhu and Stephens (2017) is a Bayesian approach similar to Zhang et al. (2018). This particular approach (Zhu and Stephens, 2017) seems to be limited to genome-wide association studies based on SNPs, where the linkage disequilibrium (correlation) between the SNPs is known.

While we are not limited to the effect size context, in that context it might be interesting to replace the idea of a large probability near zero to the case of a point mass exactly at zero, which has been done in the mixtures of Normals by Stephens (2016) and Zhang et al. (2018). This is possible to do within our framework and will be reported upon elsewhere. The corresponding results in Table 1 are much the same.

Supplementary Material

The Supplementary Material includes a data analysis of a microarray experiment. The R code is available from the last author. Code for simulations are provided at https://github.com/tamustatsy/Constrained_Deconvolution/.

Acknowledgments

Su and Carroll were supported by a grant from the National Cancer Institute (U01- CA057030). Bhattacharya was supported from National Science Foundation grant (NSF DMS 1613156) and a NSF CAREER Award (DMS 1653404). Zhang and Chatterjee were partially funded through a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1602-34530). The authors were also supported in part by a grant from the National Human Genome Research Institute (R01-HG010480). The statements and opinions in this article are solely the responsibility of the authors and do not necessarily represent the views of PCORI, its Board of Governors or Methodology Committee. The authors are grateful to Aurore Delaigle of the University of Melbourne and her collaborators for publishing R package, deconvolve, for homoscedastic and heteroscedastic kernel density deconvolution on Github.

References

  • Allen et al. (2010) Allen, H. L., Estrada, K., Lettre, G., Berndt, S. I., Weedon, M. N., Rivadeneira, F., Willer, C. J., Jackson, A. U., Vedantam, S., Raychaudhuri, S., et al. (2010). Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature, 467, 832–838.
  • Bochkina and Rousseau (2017) Bochkina, N. and Rousseau, J. (2017). Adaptive density estimation based on a mixture of Gammas. Electronic Journal of Statistics, 11, 916–962.
  • Brunner and Lo (1989) Brunner, L. J. and Lo, A. Y. (1989). Bayes methods for a symmetric unimodal density and its mode. Annals of Statistics, 17, 1550–1566.
  • Carroll and Hall (1988) Carroll, R. J. and Hall, P. (1988). Optimal rates of convergence for deconvolving a density. Journal of the American Statistical Association, 83, 1184–1186.
  • Chatterjee et al. (2013) Chatterjee, N., Wheeler, B., Sampson, J., Hartge, P., Chanock, S. J., and Park, J.-H. (2013). Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nature genetics, 45, 400.
  • Chu (1973) Chu, K. C. (1973). Estimation and decision for linear systems with elliptical random processes. IEEE Transactions on Automatic Control, 18, 499–505.
  • Davidson et al. (2004) Davidson, L. A., Nguyen, D. V., Hokanson, R. M., Callaway, E. S., Isett, R. B., Turner, N. D., Dougherty, E. R., Wang, N., Lupton, J. R., Carroll, R. J., et al. (2004). Chemopreventive n-3 polyunsaturated fatty acids reprogram genetic signatures during colon cancer initiation and progression in the rat. Cancer Research, 64, 6797–6804.
  • Delaigle and Meister (2008) Delaigle, A. and Meister, A. (2008). Density estimation with heteroscedastic error. Bernoulli, 14, 562–579.
  • Donnet et al. (2018) Donnet, S., Rivoirard, V., Rousseau, J., and Scricciolo, C. (2018). Posterior concentration rates for Empirical Bayes procedures, with applications to Dirichlet Process mixtures. Bernoulli, 24, 231–256.
  • Fan (1991) Fan, J. (1991). On the optimal rates of convergence for nonparametric deconvolution problems. Annals of Statistics, 19, 1257–1272.
  • Feller (1971) Feller, W. (1971).

    An Introduction to Probability Theory and its Applications

    , volume 2.
    John Wiley & Sons.
  • Ferguson (1973) Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1, 209–230.
  • Gao and van der Vaart (2016) Gao, F. and van der Vaart, A. (2016). Posterior contraction rates for deconvolution of Dirichlet-Laplace mixtures. Electronic Journal of Statistics, 10, 608–627.
  • Ghosal et al. (2000) Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Annals of Statistics, 28, 500–531.
  • Ishwaran and Zarepour (2002) Ishwaran, H. and Zarepour, M. (2002). Exact and approximate sum representations for the dirichlet process. Canadian Journal of Statistics, 30, 269–283.
  • Kalli et al. (2011) Kalli, M., Griffin, J. E., and Walker, S. G. (2011). Slice sampling mixture models. Statistics and Computing, 21, 93–105.
  • Nguyen (2013) Nguyen, X. (2013). Convergence of latent mixing measures in finite and infinite mixture models. Annals of Statistics, 41, 370–400.
  • Purcell et al. (2007) Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller, J., Sklar, P., De Bakker, P. I., Daly, M. J., et al. (2007). Plink: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics, 81, 559–575.
  • Sarkar et al. (2014) Sarkar, A., Mallick, B. K., Staudenmayer, J., Pati, D., and Carroll, R. J. (2014). Bayesian semiparametric density deconvolution in the presence of conditionally heteroscedastic measurement errors. Journal of Computational and Graphical Statistics, 23, 1101–1125.
  • Sarkar et al. (2017) Sarkar, A., Pati, D., Chakraborty, A., Mallick, B. K., and Carroll, R. J. (2017). Bayesian semiparametric multivariate density deconvolution. Journal of the American Statistical Association, 112.
  • Scricciolo (2018) Scricciolo, C. (2018). Bayes and maximum likelihood for -Wasserstein deconvolution of Laplace mixtures. Statistical Methods & Applications, 27, 333–362.
  • Sethuraman (1994) Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639–650.
  • Shen et al. (2013) Shen, W., Tokdar, S. T., and Ghosal, S. (2013). Adaptive Bayesian multivariate density estimation with Dirichlet mixtures. Biometrika, 100, 623–640.
  • Stefanski and Carroll (1990) Stefanski, L. and Carroll, R. J. (1990). Deconvoluting kernel density estimators. Statistics, 21, 165–184.
  • Stephens (2016) Stephens, M. (2016). False discovery rates: a new deal. Biostatistics, 18, 275–294.
  • Turnbull and Ghosh (2014) Turnbull, B. C. and Ghosh, S. K. (2014). Unimodal density estimation using Bernstein polynomials. Computational Statistics & Data Analysis, 72, 13–29.
  • Villani (2008) Villani, C. (2008). Optimal Transport: Old and New, volume 338. Springer Science & Business Media.
  • Wang and Pillai (2013) Wang, H. and Pillai, N. S. (2013). On a class of shrinkage priors for covariance matrix estimation. Journal of Computational and Graphical Statistics, 22, 689–707.
  • West (1987) West, M. (1987). On scale mixtures of Normal distributions. Biometrika, 74, 646–648.
  • Wood et al. (2014) Wood, A. R., Esko, T., Yang, J., Vedantam, S., Pers, T. H., Gustafsson, S., Chu, A. Y., Estrada, K., Luan, J., Kutalik, Z., et al. (2014). Defining the role of common variation in the genomic and biological architecture of adult human height. Nature Genetics, 46, 1173.
  • Yang et al. (2012) Yang, J., Ferreira, T., Morris, A. P., Medland, S. E., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W., Weedon, M. N., Loos, R. J., et al. (2012). Conditional and joint multiple-snp analysis of gwas summary statistics identifies additional variants influencing complex traits. Nature genetics, 44, 369.
  • Yengo et al. (2018) Yengo, L., Sidorenko, J., Kemper, K. E., Zheng, Z., Wood, A. R., Weedon, M. N., Frayling, T. M., Hirschhorn, J., Yang, J., Visscher, P. M., et al. (2018). Meta-analysis of genome-wide association studies for height and body mass index in  700000 individuals of european ancestry. Human Molecular Genetics, 27, 3641–3649.
  • Zhang et al. (2018) Zhang, Y., Qi, G., Park, J.-H., and Chatterjee, N. (2018). Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits. Nature Genetics, 50, 1318.
  • Zhu and Stephens (2017) Zhu, X. and Stephens, M. (2017). Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. Annals of Applied Statistics, 11, 1561–1592.

Appendix

a.1 Proof of Theorem 1

Below we provide details to verify (3), (4) and (5) in Section 1.

Bochkina and Rousseau (2017) derive the posterior convergence rate for Dirichlet location-mixture of Gammas in the no-measurement error case. We obtain some preliminary results on the layer of from their work. It is worth pointing out that since the condition on the Dirichlet process base probability is different from theirs, only results that are not affected by the type of prior can be inherited directly in this article. These results can be obtained by Proposition 2.1 and Lemma B.2 in Bochkina and Rousseau (2017). Any can be approximated by convoluting a Gamma kernel and some discrete probability, that is, , where is representing the Gamma kernel with shape and rate parameter and is a discrete probability , with , . The sequences and satisfy that , , and for some and with and , , , the choice of lower bound on is larger than that used in Bochkina and Rousseau (2017), specifically we require . Define , , then covers . Moreover, .

Under our Dirichlet location-mixture of Gammas model, , where the mixing measure follows . Define a prior set , while . The choice of will be specified later.

In Appendix A.1.1 below , we show that on this prior set , the following bounds hold,

(A.1)

In Appendix A.1.2

, the lower bound for the prior probability of the prior set

is derived, namely that

(A.2)

where .

Take , such that . From (A.1) and (A.2), the prior set has prior probability bounded below by while on this set , . Therefore, the prior concentration inequality (5) holds.

Under the prior in Condition 3 and the just defined, the sieve space on , , in (3) and (4) will be defined as follows. Consider a subspace of ,

The sieve space of is given by . Because of the multi-layer relationship between , and from the definition of and , the sieve space on , , is defined naturally based on . Furthermore, the entropy and prior mass conditions, (3) and (4), for can be passed along to due to the fact that the Hellinger distance between any two functions is greater than or equal to that between the corresponding , that is, . It remains to show that satisfies (3) and (4).

According to Lemma 4.2 in Bochkina and Rousseau (2017), (4) holds for if for some positive constant ,