1 Introduction
The human microbiome plays a critical role in maintaining health and causing both acute and chronic disease. Microbes live in communities in which multiple species establish synergistic and antagonistic relationships (Pflughoeft and Versalovic, 2012). These interactions allow some species to thrive and keep others in check. The complex biological dependencies among taxa demand statistical methods that account for and exploit this interdependence. There are valid and powerful methods for jointly analyzing microbiome sequence data as predictors of health outcomes, but there are fewer methodologic options for analyzing microbiome community data as a set of joint endpoints. We specifically address three challenges that commonly arise in analysis of microbiome sequencing data as responses (dependent variables): excess zeros, interdependence of the endpoints, and the need for outcomespecific covariate selection.
First, in most human microbiome studies, a large proportion of microbial taxa is absent in the majority of subjects, leading to many more zero counts for each taxon than expected on the basis of a Poisson, negative binomial, or Dirichletmultinomial distribution (e.g., see Supplementary Material A) (Chen and Li, 2013). Application of a conventional linear model that uses untransformed or logarithmictransformed counts is inappropriate for zeroinflated data (Loeys et al., 2012; Xu et al., 2015). An intuitive approach to analyzing zeroinflated count data is to view the data as arising from an underlying zeroinflated distribution, which is a mixture of a point mass at zero and a count distribution, such as Poisson (Lambert, 1992).
Second, as mentioned above, microbiome sequencing data are typically multivariate (joint response) count data sampled from communities of interdependent species. Naïve application of a univariate, taxonbytaxon approach implicitly assumes that counts of each taxon are uncorrelated. Although one could control for the type I error in this approach, this generally results in loss of power (Breiman and Friedman, 1997; La Rosa et al., 2012). Multivariate nonparametric methods are available compare bacterial community composition between two groups (Mantel, 1967; Mantel and Valand, 1970; Anderson, 2001); these are generally less powerful than regression methods and often do not quantify the magnitude of group differences. One approach for joint modeling of multivariate microbial sequence count data is Dirichletmultinomial regression (Holmes et al., 2012; La Rosa et al., 2012; Chen and Li, 2013; Wadsworth et al., 2017). However, the Dirichletmultinomial model imposes restrictions that may misrepresent features of multivariate taxa count data distributions. For example, despite that relationships among microbial species can be either positively or negatively correlated, the dependence between Dirichlet variates is always negatively correlated (Aitchison and Ho, 1989; Li, 2015).
Multivariate zeroinflated regression models can address both excess zeros as well as interdependent responses. Such methods that have been developed to date have been scaled to model only a small number of interdependent count endpoints, which include bivariate (Arab et al., 2012; Fox, 2013) and trivariate (Li et al., 1999)
zeroinflated Poisson models. In some cases, existing methods have incorporated a restrictive covariance structure among outcomes, which may not always be appropriate. Specific examples of such restrictions include zeroinflated models for longitudinal data only with variance components (with no covariance components)
(Lee et al., 2006; Hall, 2000; Long et al., 2015), models with dependence structures specific to spatialtemporal data (Earnest et al., 2007; Fernandes et al., 2009; Wang et al., 2015a), and models including latent factors that can induce only positive correlations among outcomes (Neelon and Chung, 2017).A third impediment to developing and applying a multivariate analysis technique to microbiome data is that due to having more than one endpoint, there is a large number of potential covariateendpoint associations to be modeled. It is well recognized that variable selection helps improve prediction accuracy and reduce the cost of measurement and storage of future data. The need for variable selection techniques is well appreciated for highdimensional covariate data and may be less well known in the context of multiple outcomes. Although there exist variable selection methods for multivariate normal (Brown et al., 1998; Lee et al., 2017) or multinomial responses (Wadsworth et al., 2017), we know of no such technique applied to methods for multivariate zeroinflated outcomes.
We have developed multivariate zeroinflated regression models by relaxing requirements regarding the covariance structure and incorporating a Bayesian variable selection approach. The proposed methods can be used to identify zeroinflated count outcomes associated with a set of covariates while accounting for the covariance structure of the outcomes. Since it is implausible that all outcomes are relevant to the same subset of covariates, we enable the proposed model to perform outcomespecific variable selection, i.e., to identify exposures or treatments associated with particular outcomes, in this case microbial taxa. Spikeandslab approaches have been widely used for Bayesian variable selection(George and McCulloch, 1993, 1997)
, including for multivariate linear regression problems
(Brown et al., 1998; Lee et al., 2017). In this work, we extend the spikeandslab approach to the context of multivariate zeroinflated data.We use the newly developed model to analyze data from the Pediatric HIV/AIDS Cohort Oral Health Study (PHACS). PHACS is an ongoing prospective cohort study at 15 US clinical sites, designed to assess the health effects of HIV infection and antiretroviral therapy (ART) on youth perinatally infected with HIV (PHIV) compared with exposed but uninfected (PHEU) youth (Alperen et al., 2014; Tassiopoulos et al., 2016; Starr et al., 2018). The data analyzed were from a crosssectional study focused on oral health and the oral microbiome (Moscicki et al., 2016; Ryder et al., 2017). All participants were exposed to HIV perinatally, the period when they became HIV infected if at all. Emerging from the womb, it is likely that they began acquiring their oral microbiota at birth and should have had oral microbiota similar to those of adults by 3 years of age (Mueller et al., 2015; PerezMuñoz et al., 2017). Thus, if there is a causal association, the oral bacterial sequences we measured at 1022 years of age more likely resulted from perinatal HIV infection rather than the reverse. This is why we treat taxa’s counts as endpoints and HIV as the exposure. The goals of this analysis are (i) to identify taxa associated with HIV infection; and (ii) to estimate and test the association of HIV infection with counts of the identified taxa.
The remainder of this paper is organized as follows. Section 2 describes the proposed Bayesian framework, including model formulation and specification of prior distributions. Section 3 describes results from simulation studies conducted to compare the operating performance of the proposed variable selection approach versus an existing univariate method. Section 4 describes results from the PHACS data analysis. In section 5 we further discuss the method and results.
2 Methods
In this section, we describe the proposed multivariate zeroinflated model, present prior distributions for model parameters and the variable selection strategy, discuss interpretation of the regression parameters, and summarize the computational scheme and implementation (see Supplementary Material B for a summary of model parameters and C for implementation details).
2.1 Model Formulations
Suppose that count outcomes are observed for taxon =1,…, and subject =1,…,. We use an approach that assumes that
follow a multivariate zeroinflated Poisson (MZIP) distribution, which is a mixture of a Poisson distribution and point mass distribution at zero (
):(1)  
where is an unobservable indicator for the excess zeros for taxon in subject , and
is the mean of the Poisson distribution. The model implies that some zeros occur through a Poisson process whereas others represent the impossibility for a given taxon to be observed in some subjects. In practice, regression analysis based on the MZIP model proceeds by placing structure on
and P[], specifically as a function of the covariates and random effects. Toward this, let andbe a vector of
and covariates for the subject that will be considered in the model for and P[], respectively. With this formulation it is not necessary for the presence and the count of a taxon to depend on the same set of covariates.For the count (Poisson) model part, we consider the following general modeling specification:
(2) 
where =(,…,) are the outcomespecific intercepts and =(,…,) are the outcomespecific vectors of fixedeffect regression parameters. The random effects characterize the unobserved characteristics that are associated with the mean count for taxon in subject and account for withinsubject correlations. The term is included as an offset variable for settings in which one is interested in the incidence density . For application to genetic sequence counts, setting to the total number of sequencing reads accounts for individual variation in sequencing depth.
To account for the dependency structure in the binary part of the model, we adopt a multivariate probit model (Ashford and Sowden, 1970). Letting =, with indicator function , we consider a multivariate normal (MVN) distribution for the latent variable =(,…,), with location vector and variancecovariance matrix . Here, =(,…,) are the intercepts and is the coefficient matrix whose columns are =(, …,), =1,,
. Then the probability density function of
is given by(3) 
is restricted to be a correlation matrix to ensure identifiability of all model parameters (Chib and Greenberg, 1998; Liu, 2001). measures the dependence between and by using the correlations among the elements of the vector . Let , , , and denote the collections of , , , and , respectively, across all subjects. Let be the coefficient matrix whose columns=(, …,), =1,,. Combining (1), (2), and (3), the augmented data likelihood function, as a function of the unknown parameters ={, , , , , }, is:
(4)  
2.2 Prior Specification and Covariance structure
We complete the Bayesian formulation of the proposed framework by specifying prior distributions for the unknown parameters. To facilitate outcomespecific variable selection, we adopt spikeandslab priors for the regression parameters in both parts of the proposed model. Such a prior has been widely used in the context of Bayesian stochastic search variable selection (George and McCulloch, 1993, 1997). Specifically, we assign the following priors for and :
(5) 
where and , for , , are vectors
of binary latent variables indicating the membership of each regression parameter to one of the mixture components in (5). The th covariate is considered to be associated with mean counts for the th outcome if the data support over . A similar interpretation holds for
. We use independent Bernoulli hyperpriors for
and with inclusion probability and , respectively.As outlined in Section 2.1, the dependence among mean bacterial counts of multiple taxa is accounted for by the distribution of random effects . Specifically, we structure this dependence through a hierarchical PoissonlogNormal model (Aitchison and Ho, 1989; Fox, 2013), which corresponds to a MVN prior, . We use an unstructured covariance pattern for and , thus imposing no specific structure for the dependence among outcomes. Under the unstructured model, we adopt a conjugate hyperprior, inverseWishart(, ), for . For , we use , which is the prior for a correlation matrix based on Jeffreys’ prior distribution for the variancecovariance matrix (Box and Tiao, 2011). In addition, we assign , where
is the hyperparameter to be specified. We discuss the desirable properties of the prior distributions for
and in Supplementary Material C (Chib and Greenberg, 1998; Liu and Sun, 2000). We assign MVN(, ) for the intercepts , where is the hyperparameter to be specified, and is the identity matrix. Lastly, we use the conjugate hyperpriors inverseGamma(, ), inverseGamma(, ), inverseGamma(, ) and inverseGamma(, ), for , , , and , respectively.2.3 Induced Marginal Incident Density Ratio
Because the MZIP model is a mixture model, the regression parameters, , have latent interpretations: represents the change in log mean count of taxon associated with a oneunit increase in the covariate , in a susceptible subpopulation (Preisser et al., 2012). The relationship between and the parameters from the proposed MZIP model is given by
where
is the cumulative distribution function of the standard normal distribution. For models with loglinks and normally distributed random effects such as the one we propose, it is straightforward to marginalize the conditional expectation over the random effects distribution
(Long et al., 2015). Under a model with , the ratio of means for a oneunit increase in covariate is given by(6)  
where and are the vectors and , respectively, with the th element removed. Although we obtain by marginalizing over the latent mixture distribution and the distribution of random effect , the quantity still depends on the values of other covariates , which can be addressed several ways (Preisser et al., 2012). For continuous covariates, one could obtain a covariateadjusted by either i) inserting mean values of covariates , or ii) assuming specific covariate distributions and marginalizing the quantity over these distributions. For discrete covariates, one could iii) empirically marginalize the over the observed distribution of covariates, or iv) present multiple different values for the , one for each category defined by unique covariate profiles.
As has been well described, estimating the variance of measures such as the in a frequentist framework would require an extra statistical technique such as bootstrap resampling (Albert et al., 2014). An advantage of the Bayesian paradigm is that estimation of uncertainty for follows directly from the variance of its posterior distribution, estimated by evaluating its expression at each scan of the Markov chain Monte Carlo (MCMC) scheme.
2.4 Markov Chain Monte Carlo
We perform estimation and inference for the proposed model by using a Gibbs sampling algorithm to generate samples from the posterior distribution. In the corresponding MCMC scheme, parameters are updated either by exploiting conjugacies inherent in the model structure or by using a MetropolisHastings step. However, MCMC is far from straightforward because the joint posterior distribution under the proposed framework involves i) the unobserved multivariate latent variables, ; ii) the augmented data likelihood function based upon the latent mixture distribution; iii) spikeandslab mixture priors for the regression parameters; iv) a highdimensional parameter space due to the unstructured pattern for and ; and v) restrictions on the correlation parameters in . Therefore, we develop an efficient MCMC sampling scheme based on a data augmentation algorithm (Tanner and Wong, 1987)
in which the computational challenge of highdimensionality is avoided by iterating between an “imputation step,” in which values of the unobserved latent variables
are imputed and updated, and a “posterior step,” in which the model parameters are updated. In the posterior step, we used a parameter expanded data augmentation method (Liu, 2001) to update for computational efficiency (see Supplementary Material C for detailed description of the proposed computation scheme). We have developed a series of core functions in C to improve the computation speed, for which we provide the algorithm in the mBvs package for R (https://cran.rproject.org/web/packages/mBvs). As an example of computational time, it takes approximately 3 minutes to generate 10,000 MCMC scans on a 2.5 GHz Intel Core i7 MacBook Pro for the analysis of the PHACS data (=254, =44, ==2).3 Simulation Studies
We evaluated the performance of the proposed method on simulated data. We generated data sets under six scenarios with varying outcome correlation structures and association patterns between a covariate and the vector of outcomes in the two model parts.
3.1 SetUp and Data Generation
We generated samples of size with outcomes and covariate under the proposed model given in (1), (2), and (3). The covariate was generated from Normal(0, 2) and the intercepts set to and . In Scenarios IIII and VI, we varied the scale and sign of the association between the covariate and outcomes in the two model parts by setting [0.05, 0.10, 0.15, 0.20, 0.25, 0.05, 0.10, 0.15, 0.20, 0.25, ]. In Scenario IV, the covariate was associated with outcomes in only one of the two model parts: [0.05, 0.10, 0.15, 0.20, 0.25, ] and [, 0.05, 0.10, 0.15, 0.20, 0.25, ]. We considered the null case in Scenario V by setting all elements of and to zero. We set each variancecovariance matrix in the count part of the model and in the binary part of the model to a correlation matrix with an exchangeable structure with correlation within the block of the first ten outcomes, an exchangeable structure with correlation within the block of the second 10 outcomes, and a common crossblock correlation of for pairs of outcomes from different blocks. In Scenarios I, IV, and V, the outcomes associated with the covariate were highly correlated and outcomes unassociated with the covariate only moderately correlated, (, , )=(0.70, 0.30, 0.20). In Scenario II, the outcomes associated with the covariate were moderately correlated and the remaining outcomes weakly correlated, (, , )=(0.40, 0.05, 0.10). In Scenario III, the outcomes associated with the covariate were weakly correlated and those unassociated with the covariate highly correlated, (, , )=(0.20, 0.70, 0.30). In Scenario VI, each outcome is assumed to follow a univariate zeroinflated Poisson (UZIP) distribution, indicating (, , )=(0, 0, 0) (independence), and diag()=diag(R)= (no overdispersion).
3.2 Analyses and Specification of Hyperparameters
We fit the proposed MZIP model to 600 simulated data sets, 100 under each of the six scenarios. For comparison purposes in each data set, we also implemented UZIP regression with and without the lasso penalty by using the pscl(Zeileis et al., 2008) and mpath(Wang et al., 2015b) R packages, respectively. As outlined in Section 2.2, the proposed Bayesian framework requires the specification of several hyperparameters. For the intercepts and and their variance components and , we assigned noninformative priors by setting (, , )=(, , )=(, 0.7, 0.7). For the regression parameters, and , and their variance components, and , the hyperparameters (, , )=(, , ), , , were set to (10, 0.7, 0.7) to make the corresponding priors fairly noninformative. The hyperparameters were set to either 0.1 or 0.5, implying 0.1 or 0.5 a priori probability, respectively, of each covariate to be selected as associated with each outcome. Finally, we set with , corresponding to a prior distribution of centered at and with variance of the diagonal elements equal to 2.0. We ran each MCMC chain for 1 million iterations with the first half taken as burnin.
For the proposed Bayesian MZIP model, we perform variable selection based on the marginal posterior distribution of variable selection indicators, and
. Here, we applied a marginal posterior probability cutoff of 0.5. Between the two univariate approaches implemented, in initial simulation studies the penalized UZIP model tended to select the covariate for all outcomes in both model parts when the outcomes were correlated. For this reason, and because it would be one typical practice, for the UZIP regression analyses we performed variable selection by applying 95% confidence intervals with a false discovery rate (FDR)controlling procedure
(Benjamini and Hochberg, 1995) to account for multiple comparisons.We assessed performance of the variable selection feature of the model by calculating four quantities based on the true positives (TP; the number of outcomes associated with the covariate and selected into the model) and false positives (FP; the number of outcomes unrelated to the covariate and mistakenly selected into the model), where is the number of outcomes that are truly associated with the covariate: the true positive rate, TPR=TP/, the false positive rate, FPR=FP/(), the positive predictive value, PPV=TP/(TP+FP), and the negative predictive value, NPV=()/().
UZIP  MZIP  
()  ()  
Scenario  Binary  Count  Binary  Count  Binary  Count  
Mean  (SD)  Mean  (SD)  Mean  (SD)  Mean  (SD)  Mean  (SD)  Mean  (SD)  
TPR  54.3  (9.7)  99.1  (2.9)  67.4  (8.4)  82.5  (5.5)  78.4  (8.1)  87.4  (5.5)  
FPR  0.1  (1.0)  86.0  (11.1)  0.3  (1.8)  0.8  (2.8)  4.6  (6.5)  3.1  (5.5)  
I  PPV  99.8  (1.7)  53.7  (3.4)  99.6  (2.4)  99.1  (3.0)  95.0  (6.8)  96.9  (5.3) 
NPV  68.9  (4.7)  94.4  (17.0)  75.6  (4.9)  85.2  (4.2)  81.9  (5.6)  88.7  (4.4)  
5.4  (1.0)  18.5  (1.2)  6.8  (0.8)  8.3  (0.6)  8.3  (1.1)  9.1  (0.8)  
TPR  53.7  (9.9)  99.3  (2.6)  60.8  (9.6)  75.2  (8.0)  73.0  (8.8)  84.1  (7.0)  
FPR  0.4  (1.9)  87.1  (10.0)  0.5  (2.1)  0.6  (2.4)  4.0  (5.8)  2.7  (5.1)  
II  PPV  99.5  (2.8)  53.4  (2.9)  99.4  (2.8)  99.3  (2.8)  95.3  (6.7)  97.2  (5.4) 
NPV  68.6  (4.6)  96.4  (14.2)  72.1  (4.9)  80.4  (5.3)  78.4  (5.6)  86.2  (5.5)  
5.4  (1.0)  18.6  (1.1)  6.1  (1.0)  7.6  (0.8)  7.7  (1.1)  8.7  (0.8)  
TPR  56.0  (10.2)  99.1  (2.8)  60.0  (10.0)  73.5  (8.3)  74.3  (9.2)  82.6  (7.9)  
FPR  0.2  (1.5)  88.7  (11.1)  0.2  (1.5)  0.1  (1.0)  3.9  (5.8)  1.8  (4.1)  
III  PPV  99.6  (2.5)  53.0  (3.3)  99.6  (2.4)  99.9  (1.2)  95.2  (7.1)  98.1  (4.3) 
NPV  69.8  (5.0)  92.3  (23.2)  71.7  (5.1)  79.4  (5.2)  79.2  (6.3)  85.4  (5.9)  
5.6  (1.0)  18.8  (1.2)  6.0  (1.0)  7.4  (0.8)  7.8  (0.9)  8.4  (1.0)  
TPR  56.8  (16.3)  98.8  (4.8)  63.0  (17.4)  82.6  (11.2)  79.0  (15.1)  88.6  (10.7)  
FPR  0.1  (0.7)  86.8  (8.6)  0.4  (1.6)  0.3  (1.3)  3.3  (4.6)  2.5  (4.2)  
IV  PPV  99.8  (2.0)  27.6  (2.5)  98.5  (6.0)  99.3  (3.6)  90.3  (12.8)  93.3  (10.8) 
NPV  87.6  (4.2)  95.9  (17.1)  89.2  (4.6)  94.6  (3.4)  93.4  (4.5)  96.3  (3.4)  
2.9  (0.8)  18.0  (1.3)  3.2  (0.9)  4.2  (0.6)  4.5  (1.0)  4.8  (0.8)  
TPR                          
FPR  0.1  (0.5)  86.8  (8.5)  0.1  (0.7)  0.1  (0.7)  1.9  (3.4)  1.1  (2.3)  
V  PPV                         
NPV  100.0  (0.0)  100.0  (0.0)  100.0  (0.0)  100.0  (0.0)  100.0  (0.0)  100.0  (0.0)  
0.0  (0.1)  17.4  (1.7)  0.0  (0.1)  0.0  (0.1)  0.4  (0.7)  0.2  (0.5)  
TPR  58.1  (10.7)  100.0  (0.0)  57.7  (10.8)  100.0  (0.0)  72.2  (9.6)  100.0  (0.0)  
FPR  0.0  (0.0)  0.0  (0.0)  0.1  (1.2)  0.0  (0.0)  4.7  (6.3)  0.1  (1.1)  
VI  PPV  100.0  (0.0)  100.0  (0.0)  99.8  (1.9)  100.0  (0.0)  94.4  (7.3)  99.9  (1.0) 
NPV  70.9  (5.5)  100.0  (0.0)  70.6  (5.4)  100.0  (0.0)  77.8  (6.1)  100.0  (0.0)  
5.8  (1.1)  10.0  (0.0)  5.8  (1.1)  10.0  (0.0)  7.7  (1.1)  10.0  (0.1)  
TPR=TP/, FPR=FP/(), PPV=TP/(TP+FP), NPV=()/(+FP), where TP is the number of outcomes  
associated with the covariate and selected into the model, FP is the number of outcomes unrelated to the covariate and mistakenly  
selected into the model, and is the number of outcomes that are truly associated with the covariate. Note, is in Scenarios IIII  
and VI, in Scenario IV, and in Scenario V.  
For variable selection in UZIP analyses, we used 95% confidence intervals with a false discovery rate controlling procedure.  
For variable selection in MZIP models, we applied a marginal posterior probability cutoff of 0.5 for both and . The hyperparameters  
= are set to either 0.1 or 0.5  
Since there is no outcomes that are truly associated with the covariate in Scenario V, TPR and PPV are not presented.  
NOTE: Throughout the mean and standard deviation (SD) values are based on results from 100 simulated datasets. 
UZIP  MZIP  
Binary  Count  Binary  Count  
True  
Est (SE)  Est (SE)  PM (SD)  PM  PM (SD)  PM  
1  0.05  0.05 (0.04)  0.02  0.042 ()  0.95  0.06 (0.04)  0.04  0.03 (0.02)  0.03 
2  0.10  0.11 (0.04)  0.14  0.088 ()  1.00  0.11 (0.04)  0.43  0.09 (0.02)  1.00 
3  0.15  0.15 (0.05)  0.63  0.142 ()  1.00  0.15 (0.04)  0.95  0.15 (0.02)  1.00 
4  0.20  0.20 (0.05)  0.91  0.191 ()  1.00  0.20 (0.04)  1.00  0.19 (0.02)  1.00 
5  0.25  0.25 (0.05)  0.99  0.233 ()  1.00  0.24 (0.04)  1.00  0.24 (0.02)  1.00 
6  0.05  0.04 (0.04)  0.01  0.046 ()  0.97  0.06 (0.04)  0.04  0.06 (0.02)  0.23 
7  0.10  0.11 (0.04)  0.25  0.104 ()  0.99  0.11 (0.04)  0.36  0.11 (0.02)  1.00 
8  0.15  0.14 (0.05)  0.55  0.143 ()  1.00  0.15 (0.04)  0.95  0.15 (0.02)  1.00 
9  0.20  0.20 (0.05)  0.95  0.194 ()  1.00  0.20 (0.04)  1.00  0.20 (0.02)  1.00 
10  0.25  0.25 (0.05)  0.99  0.251 ()  1.00  0.24 (0.04)  1.00  0.25 (0.02)  1.00 
11  0.00  0.01 (0.04)  0.00  0.001 ()  0.82  0.01 (0.04)  0.01  0.00 (0.01)  0.00 
12  0.00  0.02 (0.04)  0.00  0.001 ()  0.84  0.02 (0.04)  0.02  0.00 (0.01)  0.00 
13  0.00  0.01 (0.04)  0.00  0.004 ()  0.89  0.00 (0.04)  0.01  0.00 (0.01)  0.00 
14  0.00  0.00 (0.04)  0.00  0.002 ()  0.82  0.00 (0.04)  0.01  0.00 (0.01)  0.00 
15  0.00  0.00 (0.04)  0.00  0.000 ()  0.93  0.00 (0.04)  0.01  0.00 (0.01)  0.01 
16  0.00  0.00 (0.04)  0.00  0.004 ()  0.80  0.00 (0.04)  0.01  0.00 (0.01)  0.00 
17  0.00  0.00 (0.04)  0.00  0.005 ()  0.85  0.00 (0.04)  0.01  0.00 (0.01)  0.00 
18  0.00  0.01 (0.04)  0.00  0.007 ()  0.96  0.00 (0.04)  0.01  0.00 (0.01)  0.00 
19  0.00  0.00 (0.04)  0.01  0.003 ()  0.85  0.00 (0.04)  0.01  0.00 (0.01)  0.00 
20  0.00  0.01 (0.04)  0.00  0.002 ()  0.83  0.00 (0.04)  0.01  0.00 (0.01)  0.00 
The medians of the maximum likelihood estimate (Est) and standard error (SE) of and , the proportion that 

the covariate is selected for the j outcome (, ) are calculated.  
The empirical standard deviations of Est() range between 0.036 and 0.054. (not presented in the table)  
The medians of the posterior means (PM) and posterior standard deviation (SD) of and (conditioning on  
and , respectively), the medians of the posterior means of and (marginal posterior probability  
of inclusion) are computed. The hyperparameters are set to 0.1.  
NOTE: Throughout values are based on results from 100 simulated datasets. 
3.3 Results
We focus the presentation of results in this section on the MZIP model with ==. This is to demonstrate the improvement gained by the proposed multivariate approach over an analogous univariate method, while implementing the fairest comparison to the univariate method. When the values of the overall prior inclusion probabilities ( and ) increased from 0.1 to 0.5, the MZIP model tended to select one to two more variables on average, yielding higher TPR and NPV but also a bit higher FPR and lower PPV in both parts of the model (Table 1). When the outcome variables were uncorrelated (Scenario VI), the variable selection capability for the MZIP model with == was almost the same as that of UZIP model.
Across scenarios in which the outcomes were correlated (IIII), the binary part of the MZIP model was more sensitive than in the UZIP approach (Table 1), for example, with TPR=61% versus 54% in Scenario II, respectively. The TPR in UZIP models was insensitive to the strength of correlation among outcomes, whereas the MZIP TPR increased to 67% in Scenario I, in which there was a stronger correlation among outcomes associated with the covariate. Both the UZIP and the MZIP methods successfully identified the covariate associations for the four outcomes with the largest effect sizes (; Table 2). However, the MZIP model performed much better at detecting smallermagnitude associations: when (outcomes 2 and 7), associations were correctly included in 40% and 20% of MZIP and UZIP models, respectively; the corresponding inclusion rates were 95% and 60% when (outcomes 3 and 8).
The multivariate approach yielded much more substantial improvement for the count part of model when outcomes were correlated (Table 1). Even controlling the FDR, the univariate approach generally exhibited inflated type I error, as high as 86% across Scenarios IV. In contrast, across all scenarios the MZIP model had a low probability of false discovery (FPR%) while also exhibiting high TPR that ranged from 73% to 83% in Scenarios IIV. The relatively poor performance of the UZIP method is not due to bias, since the estimated association for outcomes unassociated with the covariate (, =11,…,20) were very close to zero (Table 2). However, the medians of the asymptotic standard errors (SE) for were 20 times smaller than the empirical standard deviations of (rang between 0.036 and 0.054) (Table 2), was also observed in Scenario IIV (Supplementary Material D). Thus, the univariate approach appears to perform poorly in estimation of the standard errors for count model parameters when outcomes are correlated. Consequently, the estimated confidence intervals are too narrow.
In the null case (Scenario V), both approaches successfully excluded the covariate for all outcomes for the binary part of the model even when outcomes were strongly correlated; the UZIP model exhibited a high false discovery rate (87%) for the count part of the model.
We ran extensive additional simulations to explore other factors (detailed in Table E.1 and E.2 in Supplementary Material E), including a larger number of outcomes (=50), a lower signal density (45%), a smaller sample size (=150) and negative correlations. Briefly, the results were similar to those described above, with the MZIP performing much better for variable selection and the UZIP exhibiting inflated type I error. We also compared the proposed MZIP to a univariate nonparametric method, the Wilcoxon rank sum test. Although type I error was well controlled in the univariate nonparametric method, substantially higher power was achieved by the MZIP.
To summarize, compared with the univariate approach, the proposed multivariate method improved upon the UZIP’s performance for the binary part of the model by maintaining type I error while boosting the ability to identify true associations under the simulated settings. For the count part of the model, there were some scenarios in which the power of UZIP was higher than with the MZIP approach. This higher power of UZIP was at a cost of a highly inflated false discovery rate, whereas the MZIP FPR was %. Performance of the MZIP model was enhanced by increasing the prior inclusion probabilities, . The TPR then exceeded 80% in all nonnull scenarios and FPR remained % across all scenarios.
4 Application to Pediatric HIV/AIDS Cohort Study Data
The proposed MZIP method was originally motivated by research into whether cariesassociated bacteria differ in PHIV and PHEU youth (Moscicki et al., 2016; Ryder et al., 2017). The 254 subjects were age 10 to 22 years at the time of an oral health examination done from September 2012 to January 2014. Subgingival dental plaque samples were collected at two preselected sites and excluded if participants had antibiotic exposure in the previous three months. DNA was isolated from plaque specimens and 16S rDNA sequenced (Caporaso et al., 2011; Gomes et al., 2015). The sequencing reads were trimmed, filtered, and grouped using the DADA2 pipeline, and reads matched to the curated Human Oral Microbiome Database (99.9% of reads matched to the species or genus level) (Dewhirst et al., 2010). Each subject had a count (number of sequencing reads) for each taxon identified in the study.
4.1 Analysis Details and Prior Specifications
We focused our analysis on taxa: 14 known cariesassociated species (Aas et al., 2008) and any additional species that were highly correlated with them in this dataset (Figure 2 (a)). HIV infection status (=; 0, uninfected; 1, infected) was the covariate of primary interest, with adjustment for participants’ age (=) without performing variable selection on it (i.e., it was “forced” into the model). To account for sequencing depth variation across samples, the total number of sequencing reads was included as an offset in the count model.
We fit the proposed MZIP model and the UZIP model to PHACS data. For the Bayesian MZIP approach, we set the hyperparameters, (, , , , , , , , , , , , , ), to the same values as in Section 3. Since performance of the MZIP was improved when the prior inclusion probabilities were increased from 0.1 to 0.5 in the simulation studies, we set =0.5. We ran two independent MCMC chains for 2 million iterations, each with the first half taken as burnin. We assessed convergence of the MCMC sampler by plotting traces of the MCMC scans for each parameter. A visual assessment of convergence to the stationary distribution was carried out by overlaying plots for the two MCMC chains.
We also calculated the posterior median with 95% credible intervals for the marginal IDR for MZIP (described in Section
2.3). We ageadjusted the estimate by using the mean value, 16 years.4.2 Results
Binary  Count  IDR  
IP  IP  
PM (95% CI)  PM (95% CI)  PM (95% CI)  
Actinomyces graevenitzi  0.39 (0.67, 0.09)  0.54  0.25 (0.22, 0.72)  0.11  0.96 (0.57, 1.08)  
Fusobacterium periodonticum  0.16 (0.60, 0.48)  0.07  0.43 (0.66, 0.14)  0.93  0.66 (0.53, 0.98)  
MZIP  Lachnoanaerobaculum orale  0.10 (0.17, 0.33)  0.03  0.67 (1.15, 0.25)  0.95  0.52 (0.24, 0.86) 
Leptotrichia sp oral taxon 215  0.44 (0.75, 0.17)  0.63  0.04 (0.19, 0.20)  0.02  0.86 (0.74, 1.00)  
Veillonella genus NOI  0.04 (0.60, 0.45)  0.05  0.37 (0.60, 0.12)  0.88  0.71 (0.57, 1.00)  
Est (95% CI)  Est (95% CI)  
Actinomyces graevenitzi  0.27 (0.89, 0.35)  0.46 (0.69, 0.23)  
Fusobacterium periodonticum  0.08 (0.93, 0.77)  0.80 (0.84, 0.75)  
UZIP  Lachnoanaerobaculum orale  0.31 (0.38, 0.99)  2.58 (2.79, 2.36)  
Leptotrichia sp oral taxon 215  0.41 (1.04, 0.21)  0.26 (0.41, 0.11)  
Veillonella genus NOI  0.20 (1.02, 1.41)  0.24 (0.27, 0.21)  
In the MZIP model, the posterior median (PM) and 95% credible interval (CI) of regression parameters and incidence density ratio (IDR) are  
computed.  
In the UZIP model, the maximum likelihood estimates (Est) and 95% confidence intervals (CI) of regression parameters are computed.  
Adjusted for individuals with age of 16 years.  
not otherwise identified 
With a marginal posterior probability cutoff of 0.5, the MZIP method identified three species (Fusobacterium periodonticum, Lachnoanaerobaculum orale, and Veillonella genus NOI) for which counts were associated with HIV infection among “susceptible” individuals, and it selected two species (Actinomyces graevenitzii and Leptotrichia sp oral taxon 215) for which the probability of participants being susceptible to these two species was associated with HIV infection (Figure 1). In contrast, the frequentist UZIP analysis with 95% confidence intervals identified 39 species whose levels were associated with HIV infection in the susceptible population, including the three associations identified by MZIP. The UZIP analysis identified no species for which the probability of being “susceptible” to that species was associated with HIV infection. Based on the methods’ relative performance in the simulation study, it is plausible that the UZIP’s lack of accounting for outcomes’ correlation patterns, which are complex (Figure 2 (a)), grossly inflated the type I error rate in the count model and may also have decreased sensitivity of the zero model.
Indeed, comparing the estimated associations between HIV infection and the five taxa selected based on the MZIP model (Table 3), the uncertainty associated with the count part of UZIP model was much smaller than that generated by the MZIP method. As with the simulation, this may have accounted for the detection of 39 associations, many of which are presumably false positive associations. Because of how we calculated the IDR estimate, it has an agespecific interpretation. For example, 16 yearold youth, PHIV youth had 34% (95% credible interval 2%, 47%) and 48% (95% credible interval 14%, 76%) lower abundance of Fusobacterium periodonticum and Lachnoanaerobaculum orale species, respectively, compared with PHEU youth.
The proposed MZIP model captures withinsubject dependence among multiple outcomes via two correlation components, and . The dependence patterns arising from the empirical correlations appear to reflect, strongly, the correlation structure predicted by the binary component of the proposed MZIP (Figure 2). This implies that in these data, the presence of taxa is more structured than their counts. This result is not attributable to smoothing of the empirical correlation from having added 1 to every count, because the results were not sensitive to changes of this value to 0.5, 0.1, and 0.01. The model also provides an opportunity to quantify and compare the contribution of zero inflation versus other sources of overdispersion to microbiome taxon abundance (see Supplementary Material F for posterior estimates of and ).
5 Discussion
We have described the development of a new Bayesian variable selection method that addresses challenges arising in the analysis of microbiome sequencing data: excess zero counts and highdimensional outcomes with a complex association structure. Applying the proposed multivariate approach led to the identification of two species for which the probability of being susceptible to those species was associated with HIV infection; these associations did not meet FDR thresholds when the existing univariate approach was applied. In addition, based on the estimated induced marginalized IDR under the proposed model, another two species were less abundant in HIVinfected youth aged 16 years compared with PHEU youth of the same age.
One might question how realistic are these analyses when they are adjusted for only one confounder, age. Some reassurance might be provided from the observation that in univariate analyses that included additional confounders (e.g., sex, race, and dental visit in previous year as a marker of oral hygiene), inference was not greatly altered compared with models including only HIV status and age. We are continuing to study performance in a range of datasets, including more complete confounder adjustment. We are also working to scale up the proposed method in the number of endpoints, as discussed further below, and also in the number of covariates, both of which are required for integrated omics analyses.
The simulation study demonstrated superior performance of the proposed MZIP approach over the existing UZIP method when outcomes were correlated. The sample size was small enough that asymptotic assumptions under the frequentistbased UZIP model did not hold. This affected estimation of the asymptotic variance of the regression parameters for the count model, which was not a limitation for the proposed multivariate Bayesian approach. This difference in performance is primarily because i) for small data settings, estimation is generally more stable with Bayesian approaches, which exploit information from both the observed data likelihood and prior distribution; and ii) the MZIP method uses information not only on the mean model but also from the structure of covariance among outcomes.
We used a multivariate probit model for the binary part of the MZIP mixture model. An alternative is to assume a multivariate logistic distribution for (O’Brien and Dunson, 2004), for which posterior computation can be facilitated based on a data augmentation algorithm (Albert and Chib, 1993). However, initial numerical studies using the latter approach resulted in prohibitively slow mixing of the MCMC algorithms due to sparseness of data, even for data with =10 and assuming an unstructured covariance pattern. This is because the multivariate logistic model specification requires the estimation of more latent parameters than does the multivariate probit model. Thus, the multivariate probit model in the MZIP proved to be much more computationally tractable.
We have presented the model in its most general form that allows the importance of each covariate, as well as the correlation structure among the multivariate outcomes, to vary across the binary and the count components of the model. This gives the user maximal flexibility and provides evidence on how a covariate is associated with each response, i.e. with more zeros or higher counts. The question arises whether the complexity of the model is necessary or whether simpler models should suffice. Analysis of the motivating data suggests that different correlation structures were needed in this case. It is difficult to provide a general answer to this question until we have had more opportunity to apply it a range of datasets and compare results with those obtained in simpler models. One would not be able to make this comparison if the most general model is not available as a basis for comparison. Yet, simpler models might well be useful in other datasets.
Thus, the software implementation of the proposed approach offers more parsimonious versions of the model, simplified by imposing additional restrictions regarding the model parameters. For example, the two model parts can be forced to have one common variable selection indicator by setting . In practice, such a restriction might facilitate implementation by providing a single vector of variable selection indicators, i.e. one list as to which species are associated with each covariate. A different assumption is that both model parts share the same covariance pattern (), which will greatly reduce the number of parameters to be estimated and thus the computational complexity in the MCMC algorithm, especially for data with large . In our initial analysis, fitting this restricted model to the PHACS data yielded unreliable estimates of , because the assumption that is violated in these data (Figure 2). Again, the restricted model may serve well in other datasets. Therefore, we made available the algorithms to implement both types of simpler MZIP models as options in the mBvs package.
There are several ways the proposed framework could be extended. First, marginalized zeroinflated models have recently been developed so that inference can be made on the marginal mean of the sampled population via a set of unified regression coefficients (Long et al., 2015; Tabb et al., 2016). The unified regression coefficients have better interpretability, as the marginalized models do not require the additional steps described in Section 2.3 to address the dependence of parameter values on . In some applications, it may be more appropriate to interpret the two sets of regression coefficients separately, yet there also may be other applications for which interpretability of regression parameters would be enhanced by adopting a marginalized model within the proposed multivariate Bayesian variable selection method. Marginalization may also provide more stable model fitting. Second, although we focused data analysis on a preselected subset of species in the application, often microbiologists’ goal is to perform wholecommunity oral microbiome analysis, which generally involves several hundred taxa. We are currently working to address the computational issues arising from an even higherdimensional parameter space. Because the complexity mainly results from the flexible unstructured covariance model, we propose to scale up the proposed MZIP method by adopting alternative correlation structures that can flexibly accommodate potentially complicated patterns among hundreds of taxa. A final possibility is to study the interaction between the marginal posterior probability cutoff and the prior inclusion probability in controlling the FDR at the desired level under the proposed model.
In conclusion, the proposed framework gives researchers valid and powerful statistical tools to overcome major methodological barriers in microbiome sequencing data analysis. Beyond the study of the human microbiome, the methods, software, and guidance from simulation studies in this work will be useful in any field requiring analysis of multivariate zeroinflated count data.
SUPPLEMENTARY MATERIALS
 Rpackage ‘mBvs’:

Rpackage mBvs contains codes to implement proposed Bayesian framework described in the article. The package is currently available in CRAN (https://cran.rproject.org/web/packages/mBvs).
References
 Aas et al. (2008) Aas, J., Griffen, A., Dardis, S., Lee, A., Olsen, I., Dewhirst, F., Leys, E., and Paster, B. (2008). Bacteria of dental caries in primary and permanent teeth in children and young adults. Journal of Clinical Microbiology, 46(4):1407–1417.

Aitchison and Ho (1989)
Aitchison, J. and Ho, C. (1989).
The multivariate Poissonlog normal distribution.
Biometrika, 76(4):643–653.  Albert and Chib (1993) Albert, J. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association, 88(422):669–679.
 Albert et al. (2014) Albert, J., Wang, W., and Nelson, S. (2014). Estimating overall exposure effects for zeroinflated regression models with application to dental caries. Statistical Methods in Medical Research, 23(3):257–278.
 Alperen et al. (2014) Alperen, J., Brummel, S., Tassiopoulos, K., Mellins, C., Kacanek, D., Smith, R., Seage, G., and Moscicki, A. (2014). Prevalence of and risk factors for substance use among perinatally human immunodeficiency virus–infected and perinatally exposed but uninfected youth. Journal of Adolescent Health, 54(3):341–349.
 Anderson (2001) Anderson, M. J. (2001). A new method for nonparametric multivariate analysis of variance. Austral Ecology, 26(1):32–46.
 Arab et al. (2012) Arab, A., Holan, S., Wikle, C., and Wildhaber, M. (2012). Semiparametric bivariate zeroinflated Poisson models with application to studies of abundance for multiple species. Environmetrics, 23(2):183–196.
 Ashford and Sowden (1970) Ashford, J. and Sowden, R. (1970). Multivariate probit analysis. Biometrics, 26:535–546.
 Benjamini and Hochberg (1995) Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B B, 57(1):289–300.
 Box and Tiao (2011) Box, G. and Tiao, G. (2011). Bayesian Inference in Statistical Analysis. John Wiley & Sons.
 Breiman and Friedman (1997) Breiman, L. and Friedman, J. (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B, 59(1):3–54.
 Brown et al. (1998) Brown, P., Vannucci, M., and Fearn, T. (1998). Multivariate Bayesian variable selection and prediction. Journal of the Royal Statistical Society: Series B, 60(3):627–641.
 Caporaso et al. (2011) Caporaso, J., Lauber, C., Walters, W., BergLyons, D., Lozupone, C., Turnbaugh, P., Fierer, N., and Knight, R. (2011). Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. PNAS, 108:4516–4522.
 Chen and Li (2013) Chen, J. and Li, H. (2013). Variable selection for sparse Dirichletmultinomial regression with an application to microbiome data analysis. The Annals of Applied Statistics, 7(1):418–442.
 Chib and Greenberg (1998) Chib, S. and Greenberg, E. (1998). Analysis of multivariate probit models. Biometrika, 85(2):347–361.
 Dewhirst et al. (2010) Dewhirst, F. E., Chen, T., Izard, J., Paster, B. J., Tanner, A., Yu, W.H., Lakshmanan, A., and Wade, W. G. (2010). The human oral microbiome. Journal of bacteriology, 192(19):5002–5017.
 Earnest et al. (2007) Earnest, A., Morgan, G., Mengersen, K., Ryan, L., Summerhayes, R., and Beard, J. (2007). Evaluating the effect of neighbourhood weight matrices on smoothing properties of conditional autoregressive (car) models. International journal of health geographics, 6(1):54.
 Fernandes et al. (2009) Fernandes, M. V., Schmidt, A. M., and Migon, H. S. (2009). Modelling zeroinflated spatiotemporal processes. Statistical Modelling, 9(1):3–25.
 Fox (2013) Fox, J. (2013). Multivariate zeroinflated modeling with latent predictors: Modeling feedback behavior. Computational Statistics & Data Analysis, 68:361–374.
 George and McCulloch (1997) George, E. and McCulloch, R. (1997). Approaches for Bayesian variable selection. Statistica Sinica, 7:339–374.
 George and McCulloch (1993) George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423):881–889.
 Gomes et al. (2015) Gomes, B., Berber, V., Kokaras, A., Chen, T., and Paster, B. (2015). Microbiomes of endodonticperiodontal lesions before and after chemomechanical preparation. Journal of endodontics, 41(12):1975–1984.
 Hall (2000) Hall, D. B. (2000). Zeroinflated poisson and binomial regression with random effects: a case study. Biometrics, 56(4):1030–1039.
 Holmes et al. (2012) Holmes, I., Harris, K., and Quince, C. (2012). Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE, 7(2):e30126.
 La Rosa et al. (2012) La Rosa, P. S., Brooks, J. P., Deych, E., Boone, E. L., Edwards, D. J., Wang, Q., Sodergren, E., Weinstock, G., and Shannon, W. D. (2012). Hypothesis testing and power calculations for taxonomicbased human microbiome data. PloS one, 7(12):e52078.
 Lambert (1992) Lambert, D. (1992). Zeroinflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34(1):1–14.
 Lee et al. (2006) Lee, A. H., Wang, K., Scott, J. A., Yau, K. K. W., and McLachlan, G. J. (2006). Multilevel zeroinflated Poisson regression modelling of correlated count data with excess zeros. Statistical Methods in Medical Research, 15(1):47–61.
 Lee et al. (2017) Lee, K. H., Tadesse, M. G., Baccarelli, A. A., Schwartz, J., and Coull, B. A. (2017). Multivariate Bayesian variable selection exploiting dependence structure among outcomes: Application to air pollution effects on DNA methylation. Biometrics, 73(1):232–241.
 Li et al. (1999) Li, C., Lu, J., Park, J., Kim, K., Brinkley, P. A., and Peterson, J. P. (1999). Multivariate zeroinflated Poisson models and their applications. Technometrics, 41(1):29–38.
 Li (2015) Li, H. (2015). Microbiome, metagenomics, and highdimensional compositional data analysis. Annual Review of Statistics and Its Application, 2:73–94.
 Liu (2001) Liu, C. (2001). Bayesian analysis of multivariate probit models  discussion on the art of data augmentation. Journal of Computational and Graphical Statistics, 10(1):75–81.
 Liu and Sun (2000) Liu, C. and Sun, D. X. (2000). Analysis of intervalcensored data from fractionated experiments using covariance adjustment. Technometrics, 42(4):353–365.
 Loeys et al. (2012) Loeys, T., Moerkerke, B., De Smet, O., and Buysse, A. (2012). The analysis of zeroinflated count data: Beyond zeroinflated Poisson regression. British Journal of Mathematical and Statistical Psychology, 65(1):163–180.
 Long et al. (2015) Long, D. L., Preisser, J. S., Herring, A. H., and Golin, C. E. (2015). A marginalized zeroinflated Poisson regression model with random effects. Journal of the Royal Statistical Society: Series C, 64(5):815–830.
 Mantel (1967) Mantel, N. (1967). The detection of disease clustering and a generalized regression approach. Cancer research, 27(2 Part 1):209–220.
 Mantel and Valand (1970) Mantel, N. and Valand, R. S. (1970). A technique of nonparametric multivariate analysis. Biometrics, 26(3):547–558.
 Moscicki et al. (2016) Moscicki, A.B., Yao, T.J., Ryder, M. I., Russell, J. S., Dominy, S. S., Patel, K., McKenna, M., Van Dyke, R. B., Seage III, G. R., Hazra, R., and Shiboski, C. H. (2016). The burden of oral disease among perinatally hivinfected and hivexposed uninfected youth. PloS one, 11(6):e0156459.
 Mueller et al. (2015) Mueller, N. T., Bakacs, E., Combellick, J., Grigoryan, Z., and DominguezBello, M. G. (2015). The infant microbiome development: mom matters. Trends in molecular medicine, 21(2):109–117.
 Neelon and Chung (2017) Neelon, B. and Chung, D. (2017). The LZIP: A Bayesian latent factor model for correlated zeroinflated counts. Biometrics, 73(1):185–196.

O’Brien and Dunson (2004)
O’Brien, S. M. and Dunson, D. B. (2004).
Bayesian multivariate logistic regression.
Biometrics, 60(3):739–746.  PerezMuñoz et al. (2017) PerezMuñoz, M. E., Arrieta, M.C., RamerTait, A. E., and Walter, J. (2017). A critical assessment of the “sterile womb” and “in utero colonization” hypotheses: implications for research on the pioneer infant microbiome. Microbiome, 5(1):48.
 Pflughoeft and Versalovic (2012) Pflughoeft, K. J. and Versalovic, J. (2012). Human microbiome in health and disease. Annual Review of Pathology: Mechanisms of Disease, 7:99–122.
 Preisser et al. (2012) Preisser, J. S., Stamm, J. W., Long, D. L., and Kincade, M. E. (2012). Review and recommendations for zeroinflated count regression modeling of dental caries indices in epidemiological studies. Caries research, 46(4):413–423.
 Ryder et al. (2017) Ryder, M. I., Yao, T.J., Russell, J. S., Moscicki, A.B., and Shiboski, C. H. (2017). Prevalence of periodontal diseases in a multicenter cohort of perinatally HIVinfected and HIVexposed and uninfected youth. Journal of clinical periodontology, 44(1):2–12.
 Starr et al. (2018) Starr, J., Huang, Y., Lee, K., Murphy, C., Moscicki, A., Shiboski, C., Ryder, M., Yao, T., Faller, L., Van Dyke, R., and Paster, B. (2018). Oral microbiota in youth with perinatally acquired HIV infection. Microbiome, page in press.
 Tabb et al. (2016) Tabb, L. P., Tchetgen, E. J., Wellenius, G. A., and Coull, B. A. (2016). Marginalized zeroaltered models for longitudinal count data. Statistics in biosciences, 8(2):181–203.
 Tanner and Wong (1987) Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398):528–540.
 Tassiopoulos et al. (2016) Tassiopoulos, K., Patel, K., Alperen, J., Kacanek, D., Ellis, A., Berman, C., Allison, S. M., Hazra, R., Barr, E., Cantos, K., et al. (2016). Following young people with perinatal HIV infection from adolescence into adulthood: the protocol for PHACS AMP Up, a prospective cohort study. BMJ open, 6(6):e011396.
 Wadsworth et al. (2017) Wadsworth, W. D., Argiento, R., Guindani, M., GallowayPena, J., Shelburne, S. A., and Vannucci, M. (2017). An integrative Bayesian Dirichletmultinomial regression model for the analysis of taxonomic abundances in microbiome data. BMC bioinformatics, 18(1):94.
 Wang et al. (2015a) Wang, X., Chen, M., Kuo, R. C., and Dey, D. K. (2015a). Bayesian spatialtemporal modeling of ecological zeroinflated count data. Statistica Sinica, 25(1):189–204.
 Wang et al. (2015b) Wang, Z., Ma, S., and Wang, C. (2015b). Variable selection for zeroinflated and overdispersed data with application to health care demand in Germany. Biometrical Journal, 57(5):867–884.
 Xu et al. (2015) Xu, L., Paterson, A. D., Turpin, W., and Xu, W. (2015). Assessment and selection of competing models for zeroinflated microbiome data. PloS one, 10(7):e0129606.
 Zeileis et al. (2008) Zeileis, A., Kleiber, C., and Jackman, S. (2008). Regression models for count data in R. Journal of statistical software, 27(8):1–25.
Comments
There are no comments yet.