Efficient testing and effect size estimation for set-based genetic association inference via semiparametric multilevel mixture modeling: Application to a genome-wide associatio

03/12/2020 ∙ by Shonosuke Sugasawa, et al. ∙ The University of Tokyo 0

In genetic association studies, rare variants with extremely small allele frequency play a crucial role in complex traits, and the set-based testing methods that jointly assess the effects of groups of single nucleotide polymorphisms (SNPs) were developed to improve powers for the association tests. However, the powers of these tests are still severely limited due to the extremely small allele frequency, and precise estimations for the effect sizes of individual SNPs are substantially impossible. In this article, we provide an efficient set-based inference framework that addresses the two important issues simultaneously based on a Bayesian semiparametric multilevel mixture model. We propose to use the multilevel hierarchical model that incorporate the variations in set-specific effects and variant-specific effects, and to apply the optimal discovery procedure (ODP) that achieves the largest overall power in multiple significance testing. In addition, we provide Bayesian optimal "set-based" estimator of the empirical distribution of effect sizes. Efficiency of the proposed methods is demonstrated through application to a genome-wide association study of coronary artery disease (CAD), and through simulation studies. These results suggested there could be a lot of rare variants with large effect sizes for CAD, and the number of significant sets detected by the ODP was much greater than those by existing methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The advent of high-throughput technologies have realized the whole genome sequencing, and the genome-wide association studies (GWAS) have successfully identified many common single nucleotide polymorphisms (SNPs) associated with complex traits in the last two decades (Manolio et al., 2008). However, a large portion of the heritability of diseases, behaviors, and other phenotypes cannot be accounted by these genetic variations, and it is the well-known the “missing heritability” problem (Cohen et al., 2004). To explain the missing heritability, recent studies have gradually revealed that “rare variants” have highly impacts to the complex traits, which are the minor allele frequency (MAF) are smaller than 1-5% in populations (e.g. Holm et al., 2011; Rivas et al., 2011). Besides, detections of the associations of the rare variants are challenging due to their extremely low frequency of rare variants and lacks of the statistical powers for the conventional association tests.

To overcome this underpowered problem, set-based testing methods that jointly assess the associations of a group of SNPs (e.g., a set on a gene, pathway, or network) with a phenotype have been developed in the past decade (Wu et al., 2011; Lee et al., 2014). Various set-based tests for the rare variant analyses based on statistical models were proposed, including C-alpha (Neale et al., 2011), SKAT (Wu et al., 2011), SKAT-O (Lee et al., 2014) and HMVD (Cheng et al., 2016). There have also been several methods summarizing rare variant information within a region into a single genetic score (e.g. Li and Leal, 2008; Madsen and Browning, 2009; Price et al., 2010)

. However, the statistical powers of these set-based tests are still usually insufficient. Also, another problem is the effect size estimation. Due to the extremely low frequency of these variants, the ordinary estimators of effect measures (e.g., odds-ratio) are quite unstable. Existing set-based inference frameworks only provided testing methods, and there have been no effective methods to quantifying the effect sizes of these variants.

In this article, we propose an efficient set-based inference framework that addresses the two important issues simultaneously based on the Bayesian semiparametric multilevel mixture models (e.g. Shen and Louis, 1999). We propose to use a multilevel hierarchical model that incorporate the variations in set-specific effects and variant-specific effects, and to apply the optimal discovery procedure (ODP) that achieves the largest overall power in multiple significance testing (Storey, 2007; Noma and Matsui, 2012) via the empirical Bayes framework. As shown in Section 3, the ODP gains the overall powers in the set-based test compared with existing methods such as SKAT-O and HMVD. The multiplicity of the set-based ODP can be adequately adjusted with controlling the false discovery rate (FDR; Benjamini and Hochberg, 1995; Storey, 2002). In addition, the Bayesian formulation enables accurate “set-based” effect size estimation. It should be noted that the effect size estimation for individual SNPs cannot be suitably implemented, even for the Bayesian shrinkage estimators because unstable ordinary estimators (e.g., maximum likelihood estimator) for the extremely low frequency variants are too shrunken and strongly biased. Thus, most of previous rare variant analyses did not discuss the effect size estimates due to these substantial limitations. However, using the proposed multilevel hierarchical model, we can obtain accurate estimates of effect size distributions of individual units of the set-based analysis alternatively via Bayesian optimal estimation of the empirical distribution function (Shen and Louis, 1998). The set-based distribution estimate would provide an additional new relevant information to the set-based inferences. After presenting these proposed methods, we evaluate their performance by simulation studies and assess their practical usefulness via application to the PennCATH study (Reilly et al., 2011), a large GWAS for coronary artery disease (CAD).

This article is organized as follows. We describe the proposed model model and the ODP in Section 2. In Section 3, we provide some simulation results, and we apply our approach to the PennCATH study of CAD in Section 4. Lastly, we conclude with some discussion in Section 5. Technical details are given in the Appendix.

2 Methods

2.1 Hierarchical mixture modeling for effect sizes of multiple rare variants

Suppose we are interested in group associations of regions and rare variants are included in the th region (). Let denote a

-dimensional vector of outcome for the

th variant in the th region. If there are no missing values, and . Correspondingly, let be a matrix of baseline covariates, and be a -dimensional vector of number of minor alleles, so that the elements of are or . We first consider the following generalized linear model:


where is a -dimensional vector of regression coefficient and is the (scalar) effect size of the th variant in the th region. The primary concern in this article is inference on the vector of true effect sizes in each region, for . When , the th region is not associated with the outcome. On the other hand, the th region is associated with the outcome when at least one element in is non-zero. To express such a structure, we consider the following mixture model for :



is the prior probability of being null, namely,

, and and are null and non-null distributions. It is reasonable to force be the -dimensional one-point distribution on . We propose estimating the non-null distribution in a nonparametric way by incorporating smooth-by-rouging approach (Shen and Louis, 1999), that is, we approximate by the following form:

where denotes a one-point distribution on , are fixed knots specified by the user and are probabilities of , satisfying . The probabilities ’s will be estimated later.

Combination of (1) and (2) gives the posterior distribution of given by

where is the likelihood function of (1), and is a vector of unknown parameters in (2). Since may have complicated forms, the computation of the integral appeared in the above distribution would be computationally intensive. Alternatively, we consider an approximated method for computing the posterior distribution of . Specifically, we approximate as a function of

by a normal distribution with mean and variance corresponding to the maximum likelihood estimate

and its asymptotic variance of based on the model (1). The approximation is valid under relatively large , which would be often satisfied in practice. Therefore, the approximated posterior distribution of is given by


This approximated posterior distribution can be derived from the following multilevel models:


independently for . In what follows, we call the model (4) semiparametric multilevel mixture model. Under the model (4), are mutually independent and the marginal distribution of is given by

where is the mixing probabilities in . The unknown model parameters can be estimated by maximizing the marginal likelihood function, the product of for , and the maximization can be efficiently carried out via EM-algorithm (Dempster and Rubin, 1977), where its details are provided in the Appendix.

2.2 Region-specific and variant-specific indices

Some region-specific indices are useful for screening regions. Let be the indicator variable for null/non-null status for the th region, such that if the th region is non-null and otherwise. The value of is unknown and has the prior probability . Let

. The following posterior probability of being non-null is useful for screening regions:

The effect sizes of each variant can be estimated by the posterior mean given by


The distribution of effect sizes, in each region would provide more valuable and interpretable information than the point estimates. In this case, the histogram of posterior means is not necessarily a good estimator of the true distribution (Shen and Louis, 1998), and the optimal estimator of the distribution function is given by


The posterior indices can be estimated by replacing with in (5).

2.3 Screening regions based on the optimal discovery procedure

For detecting and ranking multiple group associations, we follow the optimal discovery procedure (Storey, 2007; Storey et al., 2007) and select significant regions by controlling false discovery rate. To this end, we use the following statistics for each region induced from the estimated semiparametric multilevel mixture model:


This is a model-based version of the optimal discovery statistic (Cao et al., 2009; Noma and Matsui, 2012, 2013). For each , we define an index set in which ODP statistics are equal or greater than . For , we can evaluate the model-based false discovery rate (e.g. McLachlan et al., 2006),

We identify the optimal whose is maximum and smaller than pre-specified proportion of false discovery, e.g. . Due to the flexibility of the semiparametric modeling of the non-null distribution, the proposed method would adequately control the false discovery rate in a wide range of underlying true structures of effect sizes.

3 Simulation study

We assessed the performance of the proposed method through simulation studies. We considered individuals, regions. We randomly generated the number of variants in each region from , where

denotes the Gamma distribution and

is the round function. Then,

is the total number of rare variants, which were around 3000 in simulations. The minor allele frequency (MAF) of each variant was generated from the uniform distribution on

. For generating genotype data, we first generated two -dimensional binary vectors and using rmvbin function in R with a correlation matrix with . We then set the genotype vector to . Also we generated two clinical covariates, and

from the standard normal distribution and the Bernoulli distribution with success probability

. We considered two cases of response variables and used the following generating model:

where . For the effect sizes of in each region, we randomly sampled from the following distribution:

where we considered two cases of null probability , and .

For the simulated dataset, we applied the proposed ODP method with knots . As the alternative screening methods, we applied the optimal sequential kernel association test (SKAT-O; Lee et al., 2014)

and the association test based on hidden Markov model

(HMVD; Cheng et al., 2016) to each region to calculate -value, both of which are available as R packages. Then, significant regions are selected using the -value method (Storey and Tibshirani, 2003) with controlling the FDR. Based on 200 simulations, we calculated the average number of detected regions and true positives at FDR5, 10, 15, or 20% for the two types of responses, which are summarized in Table S1. Overall, larger numbers of significant regions and true positives are discovered by the proposed method, which indicate its efficiency, compared with the direct use of -values combined with the existing methods. It is also confirmed that the efficiency gain of the proposed method appeared for smaller values of and larger values of FDR.

# significant regions # true positive
FDR levels 5% 10% 15% 20% 5% 10% 15% 20%
Continuous ()
ODP (proposed) 107.8 131.2 149.9 167.4 101.2 116.3 124.9 131.0
SKAT-O (-value) 81.8 102.9 118.8 133.9 78.1 93.8 103.5 110.7
HMVD (-value) 89.7 110.9 128.1 144.8 84.6 98.9 107.9 114.5
Continuous ()
ODP (proposed) 166.2 214.4 251.6 285.0 155.7 188.8 208.7 221.9
SKAT-O (-value) 106.8 146.0 176.9 205.6 102.2 134.2 155.6 172.2
HMVD (-value) 122.8 162.7 193.6 222.1 116.5 146.7 166.0 180.8
Binary ()
ODP (proposed) 27.5 50.5 72.0 93.2 26.6 46.9 63.4 77.5
SKAT-O (-value) 16.0 30.3 44.6 59.4 15.3 27.8 39.1 49.5
HMVD (-value) 24.9 43.4 59.4 75.6 23.5 38.9 50.5 60.7
Binary ()
ODP (proposed) 53.6 109.5 161.5 211.1 51.6 99.6 138.0 168.5
SKAT-O (-value) 17.9 41.1 65.1 89.2 17.3 38.6 58.7 77.4
qHMVD (-value) 31.8 61.7 90.9 117.9 30.6 56.9 80.2 99.4
Table 1: Average numbers of significant biomarkers and true positives for the proposed and alternative methods based on 200 simulations.

4 Application

To illustrate the proposed methods in practice, we analyzed a GWAS dataset from the PennCATH study (Reilly et al., 2011). The PennCATH study is a hospital-based cohort study to evaluate genetic risk factors and biomarkers for CAD, and a nested case-control GWAS study of angiographic CAD was conducted (Reilly et al., 2011). After the same preprocessing with Reed et al. (2015), we involved data on cases and controls ( individuals in total) for the GWAS analyses. For the rare variant analyses, we selected the variants whose MAFs were smaller than and at least 3 individuals had the corresponding alleles. We defined the sets of analyses by the groups of SNPs on the same protein coding genes. Then, we adopted genetic regions who have at least 10 rare variants, which results in regions with total SNPs variants in the analyses.

We fitted the proposed multilevel model by the logistic regression model where the outcome variable was the disease status (case

, control ), and four covariates, age, sex (male , female ), high- and low-density lipoprotein cholesterol are involved as adjustment variables. Then, we applied the proposed multilevel mixture model with the knots . By the EM algorithm, the estimated null-probability was

. The histogram of z-scores

and estimated non-null distribution are shown in Figure 2, which shows that almost all the non-null variants have negative effect sizes. Also, a large fraction (about ) of the non-null variants was indicated to have log OR smaller than (for OR scale, about ). In addition, of non-null variants were estimated to have log ORs smaller than (for OR scale, about ), thus possibly have strong preventive factors of CAD.

For the set-based testing analyses, we first applied SKAT-O and HMVD methods. However, applying the -value method for controlling FDR (Storey and Tibshirani, 2003), no significant regions were detected under FDR control at . Even relaxing the threshold to FDR at 20, SKAT-O detected only 3 significant regions. On the other hand, applying the proposed method, we could detect 74 regions under FDR control at . The number of significant sets was much greater than those of the existing methods. The detailed descriptions of the detected regions were provided in the Supplementary Materials.

In addition, using the Bayesian optimal distribution function estimator given by (5), we estimated the effect size distributions for SNPs of each region. In Figure 2, we show the estimated distribution functions in four selected regions which corresponds to ones having the largest upper quantile, the smallest lower

quantile, the largest and smallest standard deviations of the estimated distribution functions, respectively. The estimated frequencies of the null component were different among these regions, and these estimates indicated certain small proportions of SNPs might have large preventive effects (

) for CAD. Also, in the region OLFM3, in spite of the large proportion of null variants, the proposed method could effectively detect such region. Conventional SNP-specific ML and Bayes estimates cannot provide useful snapshots of these effect size information, so these distribution estimates would provide useful summary for assessing the potential impacts of the detected variants.

Figure 1: Histgram of scores of all the variants (left) and estimated distribution of the non-null distribution of effect sizes (right).
Figure 2: The estimated region-wise distribution of effect sizes in selected four selected regions. The bold line on the origin represents the distribution of null effects of rare variants.

5 Discussion

In this article, we developed effective set-based testing and estimating methods for rare variant analyses. As show in the simulation studies of Section 3, we could show the overall powers of the set-based tests were improved by the theoretically most powerful testing method, ODP. In addition, using the multilevel hierarchical mixture model, the empirical Bayes inference could borrow the strengths of the among- and within-variants information. These advantages were also reflected to the effect size estimation method, the Bayesian estimator of empirical distribution function of the variant-specific effect sizes. Since one of the main purpose of the rare variant analyses is screening of candidate SNPs for further investigations, the effect size information would be relevant information for the prioritizing steps.

In the applications to PennCATH study, we could detect 74 significant regions under FDR at . The number of significant regions were much greater than those of the common rare variant analysis procedures with SKAT-O and HMVD. The discordance of these methods would reflect the relative performances of them in the simulation studies in Section 3. Along with the existing evidence for the performances of ODP (e.g. Storey, 2007; Matsui et al., 2018; Otani et al., 2018), the proposed method would have greater overall powers under practical situations.

Besides, the results of the effect size distribution estimation indicated there exists a large number of variants with strong preventive effects of CAD. There have no effective methods that provide effective summary of effect size information in rare variant analyses, thus the proposed new framework would provide relevant information for these studies. Also, although these results indicate the existence of a lot of preventive factors of CAD, the existing methods (i.e., SKAT-O and HMVD) could not detect these factors by the limitations of statistical powers. The proposed ODP also still has the limitation of power, but the effect size estimates provide an alternative effective information to the testing results how these rare variants potentially influences to the heritability of diseases, behaviors, and other phenotypes. The power analyses using this framework would be possible (Matsui and Noma, 2011). These new information might provide effective hints for the missing heritability problem, and might derive new insights via applying existing rare variant analyses. These practical investigations are subjects for further researches.


This research was supported by CREST (grant number: JPMJCR1412) from JST, JSPS KAKENHI (grant numbers: JP16H07406, JP18K12757, JP17H01557), and the Practical Research for Innovative Cancer Control (grant number: 17ck0106266) from the Japan Agency for Medical Research.

Appendix: EM algorithm for computing estimates of the model parameters

By introducing latent variables and , the multilevel model (4) can be expressed as

The complete log-likelihood given the latent variables ’s and ’s is given by

By taking the expectation with respect to the latent variables, we have the following objective function in the M-step:

where and . The maximization steps of reduces to the updating steps given by

Hence, the EM algorithm requires computing and in the E-step and update and ’s in the M-step until convergence.


  • Benjamini and Hochberg (1995) Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57, 289–300.
  • Cao et al. (2009) Cao, J.and Xie, X. J., S. Zhang, A. Whitehurst, and M. A. White (2009). Bayesian optimal discovery procedure for simultaneous significance testing. BMC Bioinformatics 10, 5.
  • Cheng et al. (2016) Cheng, Y., J. Y. Dai, and C. Kooperberg (2016). Group association test using a hidden markov model. Biostatistics 17, 221–234.
  • Cohen et al. (2004) Cohen, J. C., R. S. Kiss, A. Pertsemlidis, Y. L. Marcel, R. McPherson, and H. H. Hobbs (2004). Multiple rare alleles contribute to low plasma levels of hdl cholesterol. Science 305, 869–872.
  • Dempster and Rubin (1977) Dempster, A.P., L. N. and D. Rubin (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38.
  • Holm et al. (2011) Holm, H., D. F. Gudbjartsson, P. Sulem, G. Masson, H. T. Helgadottir, and C. e. a. Zanon (2011). A rare variant in myh6 is associated with high risk of sick sinus syndrome. Nature Genetics 43, 316–320.
  • Lee et al. (2014) Lee, S., M. C. Wu, and X. Lin (2014). Optimal tests for rare variant effects in sequencing association studies. Biostatistics 13, 762–775.
  • Li and Leal (2008) Li, B. and S. Leal (2008). Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. American Journal of Human Genetics 83, 311–321.
  • Madsen and Browning (2009) Madsen, B. E. and S. R. Browning (2009). A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genetics 5.
  • Manolio et al. (2008) Manolio, T. A., F. S. Collins, N. J. Cox, D. B. Goldstein, L. A. Hindorff, D. J. Hunter, and M. I. McCarthy (2008). Finding the missing heritability of complex diseases. Nature 461, 747–753.
  • Matsui and Noma (2011) Matsui, S. and H. Noma (2011). Estimating effect sizes of differentially expressed genes for power and sample-size assessments in microarray experiments. Biometrics 67, 1225–1235.
  • Matsui et al. (2018) Matsui, S., H. Noma, P. Qu, Y. Sakai, K. Matsui, C. Heuck, and J. Crowley (2018). Multi-subgroup gene screening using semi-parametric hierarchical mixture models and the optimal discovery procedure: Application to a randomized clinical trial in multiple myeloma. Biometrics 74, 313–320.
  • McLachlan et al. (2006) McLachlan, G. J., R. W. Bean, and L. B. Jones (2006). A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 22, 1608–1615.
  • Neale et al. (2011) Neale, B., M. Rivas, B. Voight, D. Altshuler, B. Devlin, M. Orho-Melander, S. Kathiresan, S. Purcell, K. Roeder, and M. Daly (2011). Testing for an unusual distribution of rare variants. PLoS Genetics 7, e1001322.
  • Noma and Matsui (2012) Noma, H. and S. Matsui (2012). The optimal discovery procedure in multiple significance testing: an empirical Bayes approach. Statistics in Medicine 31, 165–176.
  • Noma and Matsui (2013) Noma, H. and S. Matsui (2013). Empirical Bayes ranking and selection methods via semiparametric hierarchical mixture models in microarray studies. Statistics in Medicine 32, 1904–1916.
  • Otani et al. (2018) Otani, T., H. Noma, J. Nishino, and S. Matsui (2018). Re-assessment of multiple testing strategies for more efficient genome-wide association studies. European Journal of Human Genetics 26, 1038–1048.
  • Price et al. (2010) Price, A., G. Kryukov, P. de Bakker, S. Purcell, J. Staples, L. Wei, and S. Sunyaev (2010). Pooled association tests for rare variants in exon-resequencing studies. American Journal of Human Genetics 86, 832–838.
  • Reed et al. (2015) Reed, E., S. Nunez, D. Kulp, J. Qian, M. P. Reilly, and A. S. Foulkes (2015). A guide to genome-wide association analysis and post-analytic interrogation. Statistics in Medicine 34, 3769–3792.
  • Reilly et al. (2011) Reilly, M. P., M. Li, J. He, J. Ferguson, I. Stylianou, and N. e. a. Mehta (2011). Identification of adamts7 as a novel locus for coronary atherosclerosis and association of abo with myocardial infarction in the presence of coronary atherosclpmid21378990erosis: two genome-wide association studies. Lancet 377, 383–392.
  • Rivas et al. (2011) Rivas, M. A., M. Beaudoin, A. Gardet, C. Stevens, Y. Sharma, and C. K. e. a. Zhang (2011). Deep resequencing of gwas loci identifies independent rare variants associated with inflammatory bowel disease. Nature Genetics 43, 1066–1073.
  • Shen and Louis (1998) Shen, W. and T. Louis (1998). Triple-goal estimates in two-stage hierarchical models. Journal of the Royal Statistical Society, Series B 60, 455–471.
  • Shen and Louis (1999) Shen, W. and T. Louis (1999). Empirical Bayes estimation via the smoothing by roughening approach. Journal of Computational and Graphical Statistics 8, 800–823.
  • Storey (2007) Storey, J. (2007). The optimal discovery procedure: a new approach to simultaneous significance testing. Journal of the Royal Statistical Society, Series B 69, 347–368.
  • Storey et al. (2007) Storey, J., J. Dai, and J. Leek (2007). The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. Biostatistics 8, 414–432.
  • Storey (2002) Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B 64, 479–498.
  • Storey and Tibshirani (2003) Storey, J. D. and R. Tibshirani (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America 100, 9440–9445.
  • Wu et al. (2011) Wu, M. C., S. Lee, T. Cai, Y. Li, M. Boehnke, and X. Lin (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. American Journal of Human Genetics 89, 82–93.