A Bayesian Zero-Inflated Negative Binomial Regression Model for the Integrative Analysis of Microbiome Data

Microbiome `omics approaches can reveal intriguing relationships between the human microbiome and certain disease states. Along with identification of specific bacteria taxa associated with diseases, recent scientific advancements provide mounting evidence that metabolism, genetics and environmental factors can all modulate these microbial effects. However, the current methods for integrating microbiome data and other covariates are severely lacking. Hence, we present an integrative Bayesian zero-inflated negative binomial regression model that can both distinguish differential abundant taxa with distinct phenotypes and quantify covariate-taxa effects. Our model has good performance using simulated data. Furthermore, we successfully integrated microbiome taxonomies and metabolomics in two real microbiome datasets to provide biologically interpretable findings. In all, we proposed a novel integrative Bayesian regression model that features bacterial differential abundance analysis and microbiome-covariate effects quantifications which is suitable for general microbiome studies.


Zero-inflated Beta distribution regression modeling

A frequent challenge encountered with ecological data is how to interpre...

Comparing Spatial Regression to Random Forests for Large Environmental Data Sets

Environmental data may be "large" due to number of records, number of co...

Distribution-on-Distribution Regression via Optimal Transport Maps

We present a framework for performing regression when both covariate and...

Identifying microbial drivers in biological phenotypes with a Bayesian Network Regression model

In Bayesian Network Regression models, networks are considered the predi...

Generalized Bayesian Regression and Model Learning

We propose a generalized Bayesian regression and model learning tool bas...

Quantitative evaluation of regulatory policies for reducing deforestation using the bent-cable regression model

Reducing and redressing the effects of deforestation is a complex public...

Colombian Women's Life Patterns: A Multivariate Density Regression Approach

Women in Latin America and the Caribbean face difficulties related to th...

1 Introduction

The human microbiome is estimated to contain

bacteria (Sender and others, 2016) and microbial genes (Qin and others, 2010). Microbial communities have a profound impact on human health (Ursell and others, 2012). Recently, microbiome studies have identified disease-associated bacteria taxa in type 2 diabetes (Karlsson and others, 2013), liver cirrhosis (Qin and others, 2014), inflammatory bowel disease (Halfvarson and others, 2017), nonalcoholic fatty liver disease (Loomba and others, 2017), psoriasis (Tett and others, 2017), and melanoma patients responsive to cancer immunotheraphy (Frankel and others, 2017). An increasing number of research projects continue to systematically investigate the role of the microbiome in human diseases (Integrative, 2014).

While innovations in next-generation sequencing technology continue to shape the next steps in the microbiome field, the statistical methods used in microbiome research have not kept pace. For instance, metagenomic shotgun sequencing generates a massive amount of sequence reads that can provide species or isolate level taxonomic resolution (Segata and others, 2012). The subsequent comparative statistical analysis assesses whether specific species are associated with a phenotypic state or experimental condition.

Upon surveying commonly used statistical approaches, one method focuses on the comparison of multi-taxa (Chen and others, 2012; Chen and Li, 2013; Kelly and others, 2015; Zhao and others, 2015; Wu and others, 2016), frequently termed the microbiome community. However, those approaches do not aim to identify differentially abundant species—making clinical interpretation, mechanistic insights, and biological validations difficult. Another approach interrogates each individual bacteria taxa for different groups or conditions. For example, the earlier work of La Rosa and others (2015)

uses a Wilcoxon rank-sum test or Kruskal-Wallis test for groupwise comparisons on microbiome compositional data. Recently, methods developed for RNA-Seq data have been adapted to microbiome studies, e.g. the negative-binomial regression model in

DESeq2 (Love and others, 2014) and overdispersed Poisson model in edgeR (Robinson and others, 2010). These methods, however, are not optimized for microbiome datasets.

Microbial abundance can be affected by covariates, such as metabolites, antibiotics and host genetics. These confounding variables need to be adjusted for more accurate differential abundance analysis. Ultimately, there may be a clinical need to quantify the associations between microbiome and clinical confounders (Kinross and others, 2011; Zhu and others, 2018; Maier and others, 2018). One common approach is to calculate pairwise correlations between all taxa and covariates (Li and others, 2008), but this method may be significantly underpowered. Other model-based methods (Chen and Li, 2013; Wadsworth and others, 2017) have been proposed to detect covariate-taxa associations, but the taxon-outcome associations haven been ignored from these models.

Here, we propose a Bayesian integrative modeling approach to analyze microbiome count data. Our model jointly identifies differentially abundant taxa among multiple groups and simultaneously quantify the taxon-covariate associations, which is a unique strength of this method. Our modeling construction includes several advantages. First, it characterizes the over-dispersion and zero-inflation frequently observed in microbiome count data by introducing a zero-inflated negative binomial model (ZINB). Second, it models the heterogeneity from different sequencing depths, covariate effects, and group effects via a log-linear regression framework on the ZINB mean components. Last, we propose two feature selection processes to simultaneously detect differentially abundant taxa and estimate the covariate-taxa associations using the


priors. We compute Bayesian posterior probabilities for these correlated features and provide Bayesian false discovery rate (FDR). Extensive and thorough use of simulated data demonstrates that our model largely improved performance when compared with existing methods. We present two applications from real microbiome datasets with various covariate sets. Biological interpretations of our results confirm those of previous studies and offer a more comprehensive understanding of the underlying mechanism in disease etiology.

The paper is organized as follows. In Section 2, we introduce the integrative hierarchical mixture model and the prior formulations. Section 3 supplies a brief discussion of the MCMC algorithm and the resulting posterior inference. In Section 4, we evaluate model performance on simulated data through a comparison study. We investigate the covariate association in Section 5. Two real data analyses are shown in Section 6. Our conclusions are presented in Section 7.

2 Hierarchical Model

Our model starts with a high-dimensional count matrix where each entry represents the count of sequence reads belonging to a taxonomy such as bacterial species. Specifically, we denote (usually ) be a microbial abundance matrix, with representing the observed count of the -th sample and -th taxon out of the total samples and taxa (features). Note that the proposed model can also be applied to an operational taxonomic unit (OTU) count table obtained via 16S metagenomics approaches. For an OTU table, each feature would be a taxonomic unit of a bacteria species or genus depending on the sequence similarity threshold (e.g.

). The sample allocation vector

indicates the membership for each sample, where reveals that the -th sample belongs to the -th group. We also denote a covariate matrix where each entry represents the measurement of the -th covariate on the -th sample. The graphical formulation of the proposed model is summarized in Figure 1 and 2.

2.1 Count generating process

In practice, the microbial abundance matrix is characterized by an inflated amount of zeros, resulting from insufficient sampling depth. Meanwhile, the abundance matrix usually consists of extremely large counts. Based on these two facts, we assume that each count is sampled from a zero-inflated negative binomial (ZINB) distribution so as to simultaneously account for both zero-inflation and over-dispersion presented in :


where represents the weight of generating extra zeros, is an indicator function, and

denotes a negative binomial distribution for random variable

with the expectation and dispersion

. Under this parameterization, the variance of

is . A small value of allows modeling of extra variation. Note that increasing

towards infinity yields a Poisson distribution with both expectation and variance equal to

. We assume a Gamma prior for the dispersion parameter .

An equivalent way to model this count generating process is to introduce a latent binary variables

, such that



is from a Bernoulli distribution with parameter

, i.e. . We further impose , which leads to a Beta-Bernoulli prior for with expectation .

2.2 Integrative modeling with feature selection

Microbiome count data is characterized by high variability in the number of reads among samples from different groups (due to distinct biological conditions), or even the the same group (due to uneven sequencing depths). To accommodate this setting, we parameterize the mean parameter of the negative binomial distribution as the multiplicative effects of two positive random effects: 1) the size factor reflects how the sequencing depth affects counts across all taxa observed in the -th sample; 2) the relative abundance for the -th taxon in the -th sample once the sample-specific variability has been accounted for. Our goal is to find a subset of taxa that enables us to discriminate the samples from distinct groups. We introduce a binary latent vector , with indicating that the -th taxon has significantly differential abundance among the groups, and otherwise. Therefore, conditional on , we reparameterize the non-zero kernel of Equation (2) as follows:


As mentioned before, is the sample allocation indicator, is the size factor of the -th sample, which can be estimated from the data (see Section 2.3). We assume an independent Bernoulli prior for each taxon

, and further impose a beta hyperprior on

to formulate a Beta-Bernoulli prior, i.e. . The choice of and incorporates the prior belief that a certain percentage of taxa are discriminatory.

We further specify a log-link function to integrate the covariates into the modeling construction for each relative abundance:


where is a feature-specific baseline parameter for taxon . Note that ’s can be considered as scaling factors adjusting for feature-specific levels across all samples. The group-specific parameter captures the baseline shift between the -th group and the reference group. Note that we set if the -th group is the reference group to avoid identifiability problems arising from the sum of the components. , the -th row of covariate matrix , contains all the covariate measurements for sample . Here, is a -by- vector, with each element modeling the global effect of the -th covariate on the observed counts for the -th taxon. In practice, not all of the covariates are related to the abundance of a taxon. Therefore, we allow different sets of covariates to affect different taxa by specifying a spike-and-slab prior (Brown and others, 1998; Ishwaran and others, 2005) on each as


where indicates the -th covariate is associated with the latent relative abundance for the -th feature, and otherwise. This modeling approach allows us to identify significant covariate-taxa associations, via the selection of the non-zero coefficients, for all discriminatory and non-discriminatory taxonomic features. We complete the model by setting , , and . Letting for all yields a vague prior for the feature-specific baseline parameter. An inverse-gamma (IG) hyperprior is shared by and .

2.3 Size factor estimation

The parameterization of the negative binomial mean, as shown in Equation (3), is a product of the size factor and the relative abundance. It is typical to normalize the size factor first to ensure model identifiablity. Hence, the plug-in estimator (equivalent to a point-mass prior) of is adopted to facilitate the inference based on the relative abundance as shown in Equation (4). The plug-in estimators can be calculated from the observed count matrix . There have been a number of proposals to estimate ’s in the context of RNA-seq data analyses. For instance, the most commonly used approach is to set (Marioni and others, 2008; Mortazavi and others, 2008; Witten, 2011), where TSS is short for “Total Sum Scaling” (TSS). As a further example, Bullard and others (2010) proposed , where denotes the -th percentile of the counts in sample . Witten (2011) and Li and others (2017) present a comprehensive summary of different choices of size factor inference for count data. However, those assumptions are likely not appropriate for highly diverse microbial environments (Weiss and others, 2017). Recently, a so-called cumulative sum scaling (CSS) method has been developed by Paulson and others (2013) as


where we set for our model. CSS can be considered as an adaptive extension of Q75, which is better suited for microbiome data. Combining this with some constraint such that (i.e. ), we are able to obtain a set of identifiable values.

3 Model Fitting and Posterior Inference

Our model space consists of with the extra zero indicators , the dispersion parameters , the feature-specific baselines , the group-specific baselines , the covariate coefficients , the discriminating taxa indicators , and the association indicators

. We explore the joint posterior distribution via implementing a Markov chain Monte Carlo (MCMC) algorithm based on stochastic search variable selection with within-model updates

(Savitsky and Vannucci, 2010). Full details can be found in the Supplementary Material.

We are interested in distinguishing taxa that are differentially abundant among different groups, via , as well as their associations with covariates, via . One way to summarize the posterior distributions of these binary parameters is via the marginal posterior probability of inclusion (PPI). Suppose index the MCMC iterations after burn-in, then PPI of each and can be written as , respectively. Subsequently, important features and covariates can be selected based on a given PPI threshold. Following Newton and others (2004), we choose a threshold that controls the Bayesian FDR. Specifically, we solve the following equations to determine the thresholds:


A well-accepted setting is to set both and equal to , which corresponds to an expected FDR of .

4 Simulation Study

In this section, we evaluated the proposed model using simulated data and compared our model with other existing methods described in the prior microbiome studies. In order to mimic metagenome sequencing data from the real data applications (Section 6), we chose the parameters as follows: we set samples for groups with balanced group size . We chose a large number of candidate features by setting the number of taxa , and randomly selected 20 true discriminant features to evaluate our model performance. Each row of , denoted as , was generated from a Dirichlet-Multinomial distribution as described in Wadsworth and others (2017). For , we let with the row sum and . We further incorporated the feature and covariate effects through by setting with . Compared with Equation (3), this data generating process is different from the assumption of the proposed model. We set , for all and for all selected discriminating features and 0 otherwise. Then for the covariate effects, we first obtained the covariate matrix by sampling each row from the covariate matrix of the liver cirrhosis study in Section 6.1 (with and ). In particular, we sampled 30 covariate records from health and disease groups respectively. For each taxon , we then randomly selected out of covariates and let the corresponding while setting the rest . As for the choice of , we let to evaluate our model under 4 different scenarios. Lastly, we randomly set of the simulated observations to 0 to obtain the extra zeros observed in the real data. The following parameter values were adopted to fully mimic the microbiome data: .

The hyperparameters were specified using the following default settings. For the binary variables with Beta-Bernoulli priors

and , we set , which means that of the taxa are expected to be discriminant features, and of the covariate coefficients to be nonzero. We chose assuming that about half of the zeros are truly missing. For the dispersion parameter with Ga( prior, we set to obtain a vague gamma prior with mean of 100 and variance of 10,000. Next, we specified a flat prior for the variance term and . We refer to the sensitivity analysis reported in the Supplementary Material for more details on the choice of and . When implementing our model on a dataset, we ran four independent chains with different starting points where each feature or covariate was randomly initialized to have . We set iterations as the default and discarded the first half as burn-in. To assess the concordance between four chains, we looked at all the pairwise correlation coefficients between the marginal PPI of and . As mentioned by Stingo and others (2013), high values of correlation suggest that MCMC chains are run for a satisfactory number of iterations. After ensuring convergence, we assessed our model performance based on the averaged result over four chains.

Our goal was to identify the discriminating features (aka. taxa) and the significant feature-covariate associations, i.e. all nonzero and in our model. We thus obtained the PPI for all and , and visualized the accuracy in feature selection using the receiving operating characteristic (ROC) curve. We also compared the false positive rate when all feature-covariate associations were zero. Unlike our integrative Bayesian model, existing methods treat identifying discriminating features and feature-covariate associations as two separate problems. Therefore, we evaluated the model performances in two parts. In the first part, we compared it with Wilcoxon rank-sum test and two widely used differential expression analysis methods implemented by the R packages edgeR (Robinson and others, 2010) and limma (Ritchie and others, 2015). The former models count data using a negative binomial distribution and the latter adopts a linear model for the log-transformed count data. We also simplified our Bayesian model by excluding the covariate terms in Equation (4), in order to make a head to head comparison with existing method. In the second part, we started by implementing a Wilcoxon rank-sum test and edgeR

to remove the feature-outcome effect before the association analysis. In particular, for the Wilcoxon rank-sum test, we first normalized the observed counts using TSS to identify discriminating features under a type I error rate of 5%. Next, we centered the significantly differentiated features by group and the rest across all samples. When using

edgeR, we normalized the count data using the trimmed mean of M values (TMM) (Robinson and Oshlack, 2010) by default. Then we centered each feature either by group or across all samples according to the p-values provided in edgeR with a significance level of 0.05. Here, we did not consider limma

due to its relatively inferior performance in the first part. To detect feature-covariate associations, we implemented multiple linear regression, random forest (RF) and LASSO as follows. We fitted a linear regression model or a RF model between each feature and the covariate matrix

, which yielded p-values or variable importance measures to assess the model fitting. The corresponding ROC curves were calculated based on the results from all features. For LASSO, we also treated each feature as the response variable, and the covariate matrix

as the design matrix. Due to the feature heterogeneity, we computed the ROC curve of each feature and averaged them over all to evaluate the performance of LASSO. Notice that all the p-values generated using different methods were adjusted using the BH (Benjamini and Hochberg, 1995) method to control the FDR.

For each of the four scenarios, Figures (3) and (4) compares the model performance through the averaged ROC curve over 100 simulated datasets. We also included the area under curve (AUC) for each approach. For detecting discriminating features, the proposed method consistently showed high AUC () across all scenarios, and similar results for capturing the feature-covariate association (AUC ). Moreover, the proposed method maintained a low FDR even when all are 0. In addition, the proposed model achieved the highest true positive fraction under a fixed small value of FDR in all scenarios, whereas the commonly used methods tended to suffer from reporting too many false positive findings for estimating both and .

5 Feature-Covariate Association Analysis

In this section, we demonstrate that our model can estimate the association between a taxonomic feature and a covariate by adjusting for the remaining confounders. As a comparison, current approaches rely on correlation analysis between the pairwise microbiome and covariates. Specifically, those analyses converted each observed taxonomic count to a fraction (or termed percentages, intensities) using TSS by sample. Next the Pearson correlation coefficients were calculated between the log scaled fractions and the covariate measurements for each outcome group. Lastly, a Fisher z-transformation

(Fisher, 1915) was applied to obtain the p-values for testing the significance of correlation.

Our model constructs a regression framework to enable the quantification of the relationship between the latent relative abundance and covariate effects through the Equation (4). Based on Equation (4), given a feature and a covariate of interest, we first normalized the observed abundance using CSS and performed logarithmic transformation. Next, to calculate and the group shift , we subtracted the estimated feature-specific influence and other covariates’ impact from the transformed abundance. Lastly, we could evaluate whether our model provided a reasonable estimation () of the feature-covariate association between covariate and the normalized and adjusted observations of feature .

Here, we demonstrated the advantages of the proposed model in estimating the feature-covariate association over the correlation-based method through the simulation. For each feature, we randomly selected four out of seven covariates to have non-zero linear effects on the latent abundances, and generated a simulated dataset following the description in Section 4. We kept the same prior and algorithm settings to obtain the estimations for all parameters of interest. We chose a Bayesian FDR for estimating . Among all feature-covariate combinations, the proposed model achieved sensitivity and specificity rates of and respectively. We randomly chose several pairs of feature and covariate and compared the proposed method and the correlation-based method. Figure 5 displays the results of 2 examples, where the true values of were 1. The two dashed lines in Figure (5 a) or (5 b) have the same slope of as our estimated covariate effect. Both plots suggest that the proposed model is able to capture the feature-covariate relationship. Notice that we did not adjust for the group-specific effect. Hence the differences between two dashed lines represents the group-specific parameter . These results illustrated the advantages in simultaneously detecting the discriminating features and quantifying the feature-covariate associations. Furthermore, we also validated the proposed model had correctly captured the direction of covariate effects in both cases. Figure (5 b) and (5 d) show the results from the correlation-based model. The slope of each dashed line represents the Pearson correlation coefficient. However, there was no significant result as all p-values greater than . The correlation-based method failed to isolate the covariate of interest from the confounders, and it might suggest a wrong direction of covariate effect, as shown in Figure (5 d).

6 Real data analysis

We applied the proposed model on two real data sets. They have different sample sizes ranging from hundreds of samples to only 24 samples. Compared with the analysis methods proposed in the original publications, our model has better performance in detecting differentially abundant bacteria. In addition, our model supports adjusting for biologically meaningful covariates. When adjusting for the metabolic pathway quantities (or metabolites through metabolomics technology) as covariates, our model estimates the association between taxa and metabolism-related functions (or metabolites).

6.1 Liver cirrhosis dataset

Cirrhosis is a late-stage condition of scarring or fibrosis of the liver caused by liver disease such as hepatitis B, hepatitis C, and non-alcoholic fatty liver disease (Abubakar and others, 2015). The liver is connected to the gastrointestinal tract via the hepatic portal and bile secretion systems. Interestingly, distinct gut microbiota signatures have been associated with both early-stage liver diseases and end-stage liver cirrhosis (Garcia-Tsao and Wiest, 2004; Yan and others, 2011; Benten and Wiest, 2012). We applied our model on a gut microbiome dataset from a liver cirrhosis study carried out by Qin and others (2014). All metagenome sequenced samples were available from the NCBI Short Read Archive and the curated microbial abundance matrix was accessible from ExperimentHub (Pasolli and others, 2017).

The full dataset includes 237 samples (Qin and others (2014)). The observed microbial abundance matrix was profiled from the gut microbiome at the species taxonomic level. The study has two patient groups, including 114 health controls and 123 liver cirrhosis patients. We filtered out the phyla with extremely low abundance before the downstream analysis as suggested in Wadsworth and others (2017). We obtained 528 taxa that had at least 2 observed counts in both groups for further analysis. The covariates are metabolic pathway abundances summarized by the bacterial gene contents. We used MetaCyc, a collection of microbial pathways and enzymes involved in metabolism for an extensive amount of organisms (Caspi and others, 2007). We incorporated the 529 MetaCyc pathways as covariates for 237 individuals in the study, and reduced the high correlation among all the candidate pathways by average linkage clustering on their correlation matrix (Wadsworth and others, 2017)

. Specifically, we kept the pathway with the largest fold-change between two groups in each cluster, and decided the number of clusters such that the correlations between the resulting pathways were less than 0.5. Logarithmic transformation and normalizations (zero mean and unit variance) applied to the selected covariates to ensure the zero mean and unit standard deviation. After the pre-processing step, we had seven covariates representing metabolic functions.

In the original study (Qin and others, 2014), differential analyses were based on the Wilcoxon rank-sum test and the p-values were corrected by the BH method. Although a stringent threshold of significance level (0.0001) was used, the authors discovered 79 differentially abundant species and had to restrictively report the 30 top candidates in each group (indicated as blue dots in Figure (a)a). Further, as suggested by the simulation study, these results may reflect a high FDR as covariate effects were not factored in the analysis. In addition, the Wilcoxon rank-sum test cannot account for the pathway effects and thus the associations between bacteria and metabolic pathways were not identified.

We applied the proposed Bayesian ZINB model to jointly analyze the microbial abundance matrix of bacteria and their metabolic pathway abundance. We set a similar hyperparameter setting as discussed in Section 4 by first specifying and . Next, we set to be the same as their default values discussed in Section 4. We ran four independent Markov chains with different starting points. Each chain had 40,000 iterations with the first half discarded as burn-in. We checked the convergence visually and calculated the pairwise Pearson correlation, which ranged from 0.988 to 0.994 for ’s and from 0.982 to 0.989 for ’s. These conclude highly consistent results. The PPIs for all 528 taxa are shown in Figure (a)a, where the dashed line represents the threshold corresponding to an expected FDR of 0.05. We identified 19 differentially expressed taxa, the majority of which are more abundant in the liver cirrhosis group.

The posterior mean of for all identified discriminating taxa is shown in Figure (c)c, and Table 1 contains all the detailed parameter estimations for those taxa. Interestingly, two clear taxonomic branches are distinguished by our model (as indicated by red dots, Figure (a)a): the genera Veillonella and Streptococcus, both of which can originate from the oral cavity. Of note, oral commensal bacteria are able to colonize the distal intestinal tract in liver cirrhosis patients (Qin and others, 2014), probably due to bile acid changes. Veillonella spp. and Streptococcus spp. have been identified as more abundant in patients with primary biliary cholangitis patients (Tang and others, 2017), another hepatic disorder which shares pathophysiologic features with liver cirrhosis (Ridlon and others, 2013). Our joint analysis can also identify associations between microbiota and metabolic pathways (7). For example, the L-alanine biosynthesis (PWY0-1061) is positively correlated with Veillonella. Alanine is a gluconeogenesis precursors in liver metabolism, and increased alanine is thought to induce pyruvate kinase in Veillonella. Thus, this connection between alanine synthesis and Veillonella is intriguing and potentially novel, but biologic validation experiments would be for further clarification.

6.2 Metastatic melanoma dataset

The proposed Bayesian ZINB model can perform integrative analysis of microbiome taxonomic data and other ‘omics datasets. In this section, we applied this model to jointly analyze microbiome and metabolomics data from a study of advanced stage melanoma patients receiving immune checkpoint inhibitor therapy (ICT) in which we used metagenomic shotgun sequencing and unbiased shotgun metabolics with goal of identifying unique microbiome taxonomic and metabolomic signatures in those patients who responded favorably to ICT (Frankel and others, 2017).

A subset of patients in this study () were treated with ipilimumab and nivolumab(IN), a combination therapy that has been shown to be more efficacious than therapy with anti-PD1 or anti-CTLA4 therapy alone. 16 patients responded to treatment and 8 patients had progression. We performed quality control steps on metagenomic shotgun sequence reads and profiled them using MetaPhlAn Segata and others (2012) as described in (Frankel and others, 2017). We filtered out taxa with at most one observation in either patient group, which left taxa from species to kingdom level. For the same fecal samples, we performed metabolomics profiling and quantified 1,901 patients’ metabolite compounds as the covariate matrix . We are interested in statistically assessing how the biochemical volumes between patient groups are associated with bacteria burden or quantities. We adopted the same strategy as mentioned in section 6.1 to reduce the correlation between covariates, which resulted in a matrix as the covariate matrix of .

In the model fitting stage, for prior specification, we used to obtain a sparser covariate effect due to the small sample size, and it suggested about 10% of covariates were significant. We kept the same default setting for the rest of the hyperparameters. Next, we ran four independent chains with different starting points. After burning in the first half of 40,000 iterations for each chain. Although the small and unbalanced sample size poses challenges for parameter estimation, the second half iterations showed high pairwise Pearson correlations for (ranging from 0.989 to 0.992) and (ranging from 0.927 to 0.953). We calculated PPIs for all taxa in Figure (b)b, as well as posterior means for multiple parameters in our model (Figure (d)d and Table 2).

Our model jointly identified differentially abundant species and revealed the microbiome-metabolite associations. First, among all 7 taxa identified, it is of specific interest to investigate the responder-enriched taxon Bifidobacterium (genus level), Bifidobacteriaceae (family level). Bifidobacterium, nesting within Bifidobacteriaceae, is a genus of gram-positive, nonmotile, often branched anaerobic bacteria (Schell and others, 2002). Bifidobacteria are one of the major genera of bacteria that make up the gastrointestinal tract microbiota in mammals. This result about Bifidobacterium is supported by recent melanoma studies. Sivan and others (2015) compared melanoma growth in mice harboring specific microbiota, and used sequencing of the 16S ribosomal RNA to identify Bifidobacterium as associated with the antitumor effects. They also found that oral administration of Bifidobacterium augmented ICT efficacy. Moreover, Matson and others (2018) detected significant association between several species from Bifidobacterium with patients’ outcomes in an immunotherapy treatment study for metastatic melanoma. Both studies showed consistent direction of effect, as did our model. The responder-enriched taxon Bifidobacterium were estimated to negatively correlated with 2-oxoarginine and 2-hydroxypalmitate in 7(d). The suppression of these fatty-acid metabolites may be necessary for better cancer treatment as they formed the structure and have the oncogenic signaling role in cancer cells (Louie and others, 2013).

7 Conclusion

In this paper, we presented a Bayesian ZINB model for analysis of high-throughput sequencing microbiome data. Our method is novel in simultaneously incorporating the effect from measurable genetic covariates and identifying differentially abundant taxa for multiple patient groups in one statistical framework. This allows for integrative analysis of microbiome data and other omics data. Our method is flexible, as it allows identification and estimation of the association between covariates and each taxon’s abundance. These results could potentially guide clinical decisions on precision shaping of the microbiome, although results would need to be validated in preclinical models first. In addition, our method is computationally feasible in posterior inferences. We implemented the MCMC algorithm to analyze the data from two metagenomic shotgun sequencing studies and the computations took only minutes.

In real data analysis, our method identified differentially abundant species and these species are often cluttered in the same phylogenetic branch. These parsimonious results are achieved without imposing the phylogenetic structures in the model. This highlights that the results from our model are biologically interpretable and thus capable of guiding further biological mechanism studies. Our results on the metastatic melanoma study uncover novel relationships between taxa and metabolites which merit further experimental investigation.

The framework of our model allows for several extensions. For example, the current methods support two phenotype groups. If there are multiple groups (e.g. the intermediate phenotypes), the current model can incorporate group-specific parameters while holding the other parameters unchanged (e.g. the latent microbiome abundance can be inferred in the same way). Then the same posterior inferences can be applied. Another interesting extension would be to analyze correlated covariates such as longitudinal clinical measurements (Zhang and others, 2017).

8 Software

Software in the form R/C++ is available on GitHub https://github.com/shuangj00/IntegrativeBayes. All the simulated datasets analyzed in Section 4 and two real datasets presented in Section 6 are available on figshare https://figshare.com/projects/IntegrativeBayesZINB/57980.

9 Supplementary Material

Supplementary material is available online at http://biostatistics.oxfordjournals.org.


The authors thank Jiwoong Kim for processing the microbiome data and helpful discussions. We thank Jessie Norris for her comments on the manuscript. Conflict of Interest: Dr. Andrew Koh is a consultant for Merck and Saol Therapeutics.


  • Abubakar and others (2015) Abubakar, II, Tillmann, T and Banerjee, A. (2015). Global, regional, and national age-sex specific all-cause and cause-specific mortality for 240 causes of death, 1990-2013: a systematic analysis for the Global Burden of Disease study 2013. Lancet 385(9963), 117–171.
  • Benjamini and Hochberg (1995) Benjamini, Yoav and Hochberg, Yosef. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 289–300.
  • Benten and Wiest (2012) Benten, Daniel and Wiest, Reiner. (2012). Gut microbiome and intestinal barrier failure–The “Achilles heel” in hepatology? Journal of Hepatology 56(6), 1221–1223.
  • Brown and others (1998) Brown, Philip J, Vannucci, Marina and Fearn, Tom. (1998). Multivariate Bayesian variable selection and prediction. Journal of the Royal Statistical Society: Series B (Methodology) 60(3), 627–641.
  • Bullard and others (2010) Bullard, James H, Purdom, Elizabeth, Hansen, Kasper D and Dudoit, Sandrine. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11(1), 94.
  • Caspi and others (2007) Caspi, Ron, Foerster, Hartmut, Fulcher, Carol A, Kaipa, Pallavi, Krummenacker, Markus, Latendresse, Mario, Paley, Suzanne, Rhee, Seung Y, Shearer, Alexander G, Tissier, Christophe and others. (2007). The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Research 36(suppl_1), D623–D631.
  • Chen and others (2012) Chen, Jun, Bittinger, Kyle, Charlson, Emily S, Hoffmann, Christian, Lewis, James, Wu, Gary D, Collman, Ronald G, Bushman, Frederic D and Li, Hongzhe. (2012). Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics 28(16), 2106–2113.
  • Chen and Li (2013) Chen, Jun and Li, Hongzhe. (2013). Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. The Annals of Applied Statistics 7(1).
  • Fisher (1915) Fisher, Ronald A. (1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika 10(4), 507–521.
  • Frankel and others (2017) Frankel, Arthur E, Coughlin, Laura A, Kim, Jiwoong, Froehlich, Thomas W, Xie, Yang, Frenkel, Eugene P and Koh, Andrew Y. (2017). Metagenomic shotgun sequencing and unbiased metabolomic profiling identify specific human gut microbiota and metabolites associated with immune checkpoint therapy efficacy in melanoma patients. Neoplasia 19(10), 848–855.
  • Garcia-Tsao and Wiest (2004) Garcia-Tsao, Guadalupe and Wiest, Reiner. (2004). Gut microflora in the pathogenesis of the complications of cirrhosis. Best Practice & Research Clinical Gastroenterology 18(2), 353–372.
  • Gelman and others (2006) Gelman, Andrew and others. (2006). Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis 1(3), 515–534.
  • Halfvarson and others (2017) Halfvarson, Jonas, Brislawn, Colin J, Lamendella, Regina, Vázquez-Baeza, Yoshiki, Walters, William A, Bramer, Lisa M, D’Amato, Mauro, Bonfiglio, Ferdinando, McDonald, Daniel, Gonzalez, Antonio and others. (2017). Dynamics of the human gut microbiome in inflammatory bowel disease. Nature Microbiology 2(5), 17004.
  • Integrative (2014) Integrative, HMP. (2014). The Integrative Human Microbiome Project: dynamic analysis of microbiome-host omics profiles during periods of human health and disease. Cell Host & Microbe 16(3), 276.
  • Ishwaran and others (2005) Ishwaran, Hemant, Rao, J Sunil and others. (2005). Spike and slab variable selection: frequentist and Bayesian strategies. The Annals of Statistics 33(2), 730–773.
  • Karlsson and others (2013) Karlsson, Fredrik H, Tremaroli, Valentina, Nookaew, Intawat, Bergström, Göran, Behre, Carl Johan, Fagerberg, Björn, Nielsen, Jens and Bäckhed, Fredrik. (2013). Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 498(7452), 99.
  • Kelly and others (2015) Kelly, Brendan J, Gross, Robert, Bittinger, Kyle, Sherrill-Mix, Scott, Lewis, James D, Collman, Ronald G, Bushman, Frederic D and Li, Hongzhe. (2015). Power and sample-size estimation for microbiome studies using pairwise distances and PERMANOVA. Bioinformatics 31(15), 2461–2468.
  • Kinross and others (2011) Kinross, James M, Darzi, Ara W and Nicholson, Jeremy K. (2011). Gut microbiome-host interactions in health and disease. Genome Medicine 3(3), 14.
  • La Rosa and others (2015) La Rosa, Patricio S, Zhou, Yanjiao, Sodergren, Erica, Weinstock, George and Shannon, William D. (2015). Hypothesis testing of metagenomic data. In: Metagenomics for Microbiology. Elsevier, pp. 81–96.
  • Li and others (2008) Li, Min, Wang, Baohong, Zhang, Menghui, Rantalainen, Mattias, Wang, Shengyue, Zhou, Haokui, Zhang, Yan, Shen, Jian, Pang, Xiaoyan, Zhang, Meiling and others. (2008). Symbiotic gut microbes modulate human metabolic phenotypes. Proceedings of the National Academy of Sciences 105(6), 2117–2122.
  • Li and others (2017) Li, Qiwei, Guindani, Michele, Reich, Brian J, Bondell, Howard D and Vannucci, Marina. (2017). A Bayesian mixture model for clustering and selection of feature occurrence rates under mean constraints.

    Statistical Analysis and Data Mining: The ASA Data Science Journal

     10(6), 393–409.
  • Loomba and others (2017) Loomba, Rohit, Seguritan, Victor, Li, Weizhong, Long, Tao, Klitgord, Niels, Bhatt, Archana, Dulai, Parambir Singh, Caussy, Cyrielle, Bettencourt, Richele, Highlander, Sarah K and others. (2017). Gut microbiome-based metagenomic signature for non-invasive detection of advanced fibrosis in human nonalcoholic fatty liver disease. Cell Metabolism 25(5), 1054–1062.
  • Louie and others (2013) Louie, Sharon M, Roberts, Lindsay S, Mulvihill, Melinda M, Luo, Kunxin and Nomura, Daniel K. (2013). Cancer cells incorporate and remodel exogenous palmitate into structural and oncogenic signaling lipids. Biochimica et Biophysica Acta (BBA)-Molecular and Cell Biology of Lipids 1831(10), 1566–1572.
  • Love and others (2014) Love, Michael I, Huber, Wolfgang and Anders, Simon. (2014). Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15(12), 550.
  • Maier and others (2018) Maier, Lisa, Pruteanu, Mihaela, Kuhn, Michael, Zeller, Georg, Telzerow, Anja, Anderson, Exene Erin, Brochado, Ana Rita, Fernandez, Keith Conrad, Dose, Hitomi, Mori, Hirotada and others. (2018). Extensive impact of non-antibiotic drugs on human gut bacteria. Nature 555(7698), 623.
  • Marioni and others (2008) Marioni, John C, Mason, Christopher E, Mane, Shrikant M, Stephens, Matthew and Gilad, Yoav. (2008). RNA-Seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Research 18(9), 1509–1517.
  • Matson and others (2018) Matson, Vyara, Fessler, Jessica, Bao, Riyue, Chongsuwat, Tara, Zha, Yuanyuan, Alegre, Maria-Luisa, Luke, Jason J and Gajewski, Thomas F. (2018). The commensal microbiome is associated with anti–PD-1 efficacy in metastatic melanoma patients. Science 359(6371), 104–108.
  • Matthews and others (1985) Matthews, DR, Hosker, JP, Rudenski, AS, Naylor, BA, Treacher, DF and Turner, RC. (1985). Homeostasis model assessment: insulin resistance and -cell function from fasting plasma glucose and insulin concentrations in man. Diabetologia 28(7), 412–419.
  • Mortazavi and others (2008) Mortazavi, Ali, Williams, Brian A, McCue, Kenneth, Schaeffer, Lorian and Wold, Barbara. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5(7), 621.
  • Newton and others (2004) Newton, Michael A, Noueiry, Amine, Sarkar, Deepayan and Ahlquist, Paul. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5(2), 155–176.
  • Pasolli and others (2017) Pasolli, Edoardo, Schiffer, Lucas, Manghi, Paolo, Renson, Audrey, Obenchain, Valerie, Truong, Duy Tin, Beghini, Francesco, Malik, Faizan, Ramos, Marcel, Dowd, Jennifer B and others. (2017). Accessible, curated metagenomic data through Experimenthub. Nature Methods 14(11), 1023.
  • Paulson and others (2013) Paulson, Joseph N, Stine, O Colin, Bravo, Héctor Corrada and Pop, Mihai. (2013). Differential abundance analysis for microbial marker-gene surveys. Nature Methods 10(12), 1200.
  • Qin and others (2010) Qin, Junjie, Li, Ruiqiang, Raes, Jeroen, Arumugam, Manimozhiyan, Burgdorf, Kristoffer Solvsten, Manichanh, Chaysavanh, Nielsen, Trine, Pons, Nicolas, Levenez, Florence, Yamada, Takuji and others. (2010). A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464(7285), 59.
  • Qin and others (2014) Qin, Nan, Yang, Fengling, Li, Ang, Prifti, Edi, Chen, Yanfei, Shao, Li, Guo, Jing, Le Chatelier, Emmanuelle, Yao, Jian, Wu, Lingjiao and others. (2014). Alterations of the human gut microbiome in liver cirrhosis. Nature 513(7516), 59.
  • Ridlon and others (2013) Ridlon, Jason M, Alves, Joao Marcelo, Hylemon, Phillip B and Bajaj, Jasmohan S. (2013). Cirrhosis, bile acids and gut microbiota: unraveling a complex relationship. Gut Microbes 4(5), 382–387.
  • Ritchie and others (2015) Ritchie, Matthew E, Phipson, Belinda, Wu, Di, Hu, Yifang, Law, Charity W, Shi, Wei and Smyth, Gordon K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7), e47–e47.
  • Robinson and others (2010) Robinson, Mark D, McCarthy, Davis J and Smyth, Gordon K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1), 139–140.
  • Robinson and Oshlack (2010) Robinson, Mark D and Oshlack, Alicia. (2010). A scaling normalization method for differential expression analysis of RNA-Seq data. Genome Biology 11(3), R25.
  • Savitsky and Vannucci (2010) Savitsky, Terrance and Vannucci, Marina. (2010). Spiked Dirichlet process priors for Gaussian process models. Journal of Probability and Statistics 2010.
  • Schell and others (2002) Schell, Mark A, Karmirantzou, Maria, Snel, Berend, Vilanova, David, Berger, Bernard, Pessi, Gabriella, Zwahlen, Marie-Camille, Desiere, Frank, Bork, Peer, Delley, Michele and others. (2002). The genome sequence of Bifidobacterium longum reflects its adaptation to the human gastrointestinal tract. Proceedings of the National Academy of Sciences 99(22), 14422–14427.
  • Segata and others (2012) Segata, Nicola, Waldron, Levi, Ballarini, Annalisa, Narasimhan, Vagheesh, Jousson, Olivier and Huttenhower, Curtis. (2012). Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Methods 9(8), 811.
  • Sender and others (2016) Sender, Ron, Fuchs, Shai and Milo, Ron. (2016). Revised estimates for the number of human and bacteria cells in the body. PLoS Biology 14(8), e1002533.
  • Sivan and others (2015) Sivan, Ayelet, Corrales, Leticia, Hubert, Nathaniel, Williams, Jason B, Aquino-Michaels, Keston, Earley, Zachary M, Benyamin, Franco W, Lei, Yuk Man, Jabri, Bana, Alegre, Maria-Luisa and others. (2015). Commensal Bifidobacterium promotes antitumor immunity and facilitates anti–PD-L1 efficacy. Science, aac4255.
  • Stingo and others (2013) Stingo, Francesco C, Guindani, Michele, Vannucci, Marina and Calhoun, Vince D. (2013). An integrative Bayesian modeling approach to imaging genetics. Journal of the American Statistical Association 108(503), 876–891.
  • Tang and others (2017) Tang, Ruqi, Wei, Yiran, Li, Yanmei, Chen, Weihua, Chen, Haoyan, Wang, Qixia, Yang, Fan, Miao, Qi, Xiao, Xiao, Zhang, Haiyan and others. (2017). Gut microbial profile is altered in primary biliary cholangitis and partially restored after UDCA therapy. Gut, gutjnl–2016.
  • Tett and others (2017) Tett, Adrian, Pasolli, Edoardo, Farina, Stefania, Truong, Duy Tin, Asnicar, Francesco, Zolfo, Moreno, Beghini, Francesco, Armanini, Federica, Jousson, Olivier, De Sanctis, Veronica and others. (2017). Unexplored diversity and strain-level structure of the skin microbiome associated with psoriasis. NPJ Biofilms and Microbiomes 3(1), 14.
  • Ursell and others (2012) Ursell, Luke K, Metcalf, Jessica L, Parfrey, Laura Wegener and Knight, Rob. (2012). Defining the human microbiome. Nutrition Reviews 70(suppl_1), S38–S44.
  • Wadsworth and others (2017) Wadsworth, W Duncan, Argiento, Raffaele, Guindani, Michele, Galloway-Pena, Jessica, Shelburne, Samuel A and Vannucci, Marina. (2017). An integrative Bayesian Dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data. BMC Bioinformatics 18(1), 94.
  • Weiss and others (2017) Weiss, Sophie, Xu, Zhenjiang Zech, Peddada, Shyamal, Amir, Amnon, Bittinger, Kyle, Gonzalez, Antonio, Lozupone, Catherine, Zaneveld, Jesse R, Vázquez-Baeza, Yoshiki, Birmingham, Amanda and others. (2017). Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 5(1), 27.
  • Witten (2011) Witten, Daniela M. (2011). Classification and clustering of sequencing data using a Poisson model. The Annals of Applied Statistics, 2493–2518.
  • Wu and others (2016) Wu, Chong, Chen, Jun, Kim, Junghi and Pan, Wei. (2016). An adaptive association test for microbiome data. Genome Medicine 8(1), 56.
  • Yan and others (2011) Yan, Arthur W, Fouts, Derrick E, Brandl, Johannes, Stärkel, Peter, Torralba, Manolito, Schott, Eckart, Tsukamoto, Hide, Nelson, Karen E, Brenner, David A and Schnabl, Bernd. (2011). Enteric dysbiosis associated with a mouse model of alcoholic liver disease. Hepatology 53(1), 96–105.
  • Zhang and others (2017) Zhang, Xinyan, Mallick, Himel, Tang, Zaixiang, Zhang, Lei, Cui, Xiangqin, Benson, Andrew K and Yi, Nengjun. (2017). Negative binomial mixed models for analyzing microbiome count data. BMC Bioinformatics 18(1), 4.
  • Zhao and others (2015) Zhao, Ni, Chen, Jun, Carroll, Ian M, Ringel-Kulka, Tamar, Epstein, Michael P, Zhou, Hua, Zhou, Jin J, Ringel, Yehuda, Li, Hongzhe and Wu, Michael C. (2015). Testing in microbiome-profiling studies with MiRKAT, the microbiome regression-based kernel association test. The American Journal of Human Genetics 96(5), 797–807.
  • Zhu and others (2018) Zhu, Wenhan, Winter, Maria G, Byndloss, Mariana X, Spiga, Luisella, Duerkop, Breck A, Hughes, Elizabeth R, Büttner, Lisa, de Lima Romão, Everton, Behrendt, Cassie L, Lopez, Christopher A and others. (2018). Precision editing of the gut microbiota ameliorates colitis. Nature 553(7687), 208.

Supplementary Material

S1: Sensitivity Analysis

We assess impacts of setting priors via sensitivity analysis. In our model, the choice of and in the IG prior for has an impact on the posterior probabilities of inclusion of . To investigate model performance with respect to the choice of these hyperparameters, we simulated 30 datasets as described in Section 4 in the main text, and benchmarked our model with varying values of from to and from to . Specifically, we set 4 out of 7 covariates to have nonzero contributions to the outcome of each feature. The choices of and are illustrated in Figure (9).

The results given by different values of () were compared based on the Matthews correlation coefficient (MCC) (Matthews and others, 1985) across 30 replicated datasets. In each replicate, we controlled a Bayesian false discovery rate and selected discriminating features. We then calculated the number of true positive (TP), true negative (TN), false positive (FP) and false negative (FN) and MCC. Here MCC is defined as

MCC ranges from to , and larger values represents favorable prediction results. It is also demonstrated in the above formula that the MCC-based evaluation is suitable for classes with very different sizes, since it strikes a balance between TP and FP counts. In our scenario, the size of truly discriminating features were relatively small compared to the total number. Therefore, we adopted MCC as an appropriate performance metric to handle the imbalanced setting. As can be seen in Figure (9), given a small value of (), the MCC was undesirable with any value of displayed here. As shown by Gelman and others (2006), the IG() prior with small and would distort the posterior inferences. On the other hand, if we increase to have while fixing to be small, the corresponding prior distribution was strongly informative since IG() has the mean of and the variance of . Therefore, we chose and to be the default setting, since this ensured a flat prior and yielded a beneficial variable selection result as shown in Figure 9.

S2: Details of MCMC Algorithms

First, we write the likelihood function as follows:

Then, we update the parameters in each iteration following the steps below:

  1. Update of zero-inflation latent indicator : Notice that we only need to update the ’s that correspond to . We write the posterior as:

    Then it follows that

  2. Update of : We update each sequentially using an independent Metropolis-Hasting algorithm. We first propose a new from and then accept the proposed value with probability min(, where

  3. Joint Update of : A between-model step is implemented first to jointly update and . We use an add-delete algorithm, where we select a at random and change its value. For the add case, i.e. , we propose for each from . For the delete case, i.e. , we set for all . We finally accept the proposed values with probability min(), where

    Further update of when : A within-model step is followed to further update each that corresponds to in the current iteration. We first propose a new from and then accept the proposed value with probability min(), where

  4. Joint update of and : Very similar to the above, we perform a between-model step first using an add-delete algorithm. For each , we first select an at random and change its value. For the add case, i.e. , we propose from . For the delete case, i.e. , we set all . Then finally we accept the proposed values with probability min(), where

    Further update of when : A within-model step is followed to further update each that corresponds to . We first propose a new from and then accept the proposed value with probability min(), where

  5. Update of : We update each sequentially by using an independent Metropolis-Hasting algorithm. We first propose a new

    from the normal distribution

    that truncated at 0, and accept the proposed value with probability min(1,), where

Figure 9: Sensitivity Analysis: Relationship between Matthews correlation coefficient (MCC) for selecting discriminatory features () with the choice of () from the inverse-gamma prior on and . Each value of MCC was the averaged result of 30 simulated datasets.