Regression Analysis of Dependent Binary Data for Estimating Disease Etiology from Case-Control Studies

06/20/2019 ∙ by Zhenke Wu, et al. ∙ University of Michigan 0

In large-scale disease etiology studies, epidemiologists often need to use multiple binary measures of unobserved causes of disease that are not perfectly sensitive or specific to estimate cause-specific case fractions, referred to as "population etiologic fractions" (PEFs). Despite recent methodological advances, the scientific need of incorporating control data to estimate the effect of explanatory variables upon the PEFs, however, remains unmet. In this paper, we build on and extend nested partially-latent class model (npLCMs, Wu et al., 2017) to a general framework for etiology regression analysis in case-control studies. Data from controls provide requisite information about measurement specificities and covariations, which is used to correctly assign cause-specific probabilities for each case given her measurements. We estimate the distribution of the controls' diagnostic measures given the covariates via a separate regression model and a priori encourage simpler conditional dependence structures. We use Markov chain Monte Carlo for posterior inference of the PEF functions, cases' latent classes and the overall PEFs of policy interest. We illustrate the regression analysis with simulations and show less biased estimation and more valid inference of the overall PEFs than an npLCM analysis omitting covariates. A regression analysis of data from a childhood pneumonia study site reveals the dependence of pneumonia etiology upon season, age, disease severity and HIV status.



There are no comments yet.


page 16

page 41

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In epidemiologic studies of disease etiology, one important scientific goal is to assess the effect of explanatory variables upon disease etiology. Based on multiple binary non-gold-standard diagnostic measurements made on a list of putative causes with different error rates, this paper develops and demonstrates a regression analytic approach for drawing inference about the cause-specific fractions among the case population that depend on covariates. We illustrate the analytic needs raised by a study of pediatric pneumonia etiology.

Pneumonia is a clinical condition associated with infection of the lung tissue, which can be caused by more than 30 different species of microorganisms, including bacteria, viruses, mycobacteria and fungi (Scott et al., 2008). The Pneumonia Etiology Research for Child Health (PERCH) study is a seven-country case-control study of the etiology of severe and very severe pneumonia and has enrolled more than hospitalized children under five years of age and more than healthy controls (PERCH Study Group, 2019). The goal of the PERCH study is to estimate the population fractions of cases due to the pathogen causes, referred to as “population etiologic fractions” (PEFs) and to assign cause-specific probabilities for each pneumonia child given her measurements, termed as “individual etiologic fractions” (IEFs). The PERCH study also aims to understand the variation of the PEFs as a function of factors such as region, season, a child’s age, disease severity, nutrition status and human immunodeficiency virus (HIV) status.

The cause of lung infection cannot, except in rare cases, be directly observed (Hammitt et al., 2017). The PERCH study tests the presence or absence of a list of pathogens using specimens in peripheral compartments including the blood, sputum, pleural fluid and nasopharyngeal (NP) cavity (Crawley et al., 2017). In this paper, we focus on two sources of imperfect measurements: (a) NP Polymerase Chain Reaction (NPPCR) results from cases and controls that are not perfectly sensitive or specific, referred to as “bronze-standard” (BrS) data; and (b) blood culture (BCX) results from cases only that are perfectly specific but lack sensitivity, referred to as “silver-standard” (SS) data.

Valid inference about the population and individual etiologic fractions must address three salient characteristics of the measurements. First, tests lacking sensitivity such as NPPCR and BCX may miss true causative agent(s) which if unadjusted may underestimate the PEFs. Second, imperfect diagnostic specificities may result in the detection of multiple pathogens in NPPCR that may indicate asymptomatic carriage but not causes of pneumonia. Determining the primary causative agent(s) must use statistical controls. Third, multiple specimens are tested among the cases with only a subset available from the controls. Other large-scale disease etiology studies have raised similar analytic needs and challenges of integrating multiple sources of imperfect measurements of multiple pathogens to produce an accurate understanding of etiology (e.g., Saha et al., 2018; Kotloff et al., 2013).

To address the analytic needs, Wu et al. (2016) introduced a partially-latent class model (pLCM) as an extension to classical latent class models (LCMs Lazarsfeld, 1950; Goodman, 1974) that uses case-control data to estimate the PEFs. This prior work shows the capacity of the multivariate specimen measurements to inform the distribution of unobserved, or “latent” health status for an individual and the population. PLCM is a finite mixture model with components for multivariate binary data where a case observation is drawn from a mixture of components each representing a cause of disease, or “disease class”; Controls have no infection in the lung hence are assumed drawn from an observed class. The pLCM is a semi-supervised method for learning the unobserved classes, where the “label” (cause of disease) is observed for only a subset of subjects. Let represent case ’s disease class which is categorically distributed with probabilities equal to the PEFs in the -dimensional simplex . A case class can represent a single- or multiple-pathogen cause of pneumonia, or pathogen causes not targeted by the assays which we refer to as “Not Specified (NoS

)”. PLCM uses a vector of

response probabilities to specify the conditional distribution of the measurements in each class. PLCM is an example of restricted LCMs (RLCMs, Wu et al., 2019) which restrict how the response probabilities differ by class to reflect the scientific knowledge that causative pathogens are more likely to appear in the upper respiratory tract in a pneumonia child than a healthy control. In particular, each causative pathogen is assumed to be observed with a higher probability in case class (sensitivity or true positive rate, TPR) than among the controls; A non-causative pathogen is observed with the same probability as in the controls (1 - specificity or false positive rate). Under the pLCM, a higher observed marginal positive rate of pathogen among cases than controls indicates etiologic significance.

In a Bayesian framework, measurements of differing precisions can be optimally combined under a pLCM to generate stronger evidence about . The pLCM is partially-identified (Jones et al., 2010; Gu and Xu, 2019b). There exist two sets of values of a subset of model parameters (here the TPRs) that the likelihood function alone cannot distinguish even with infinite samples; Bounds on the parameters however are available (e.g., Wu et al., 2016, Equation 6). Informative prior distributions for the TPRs elicited from laboratory experts or estimated from vaccine probe studies for a subset of pathogens (Feikin et al., 2014) can be readily incorporated to improve inference (Gustafson, 2015).

The pLCM assumes “local independence” (LI) which means the BrS measurements are mutually independent given the class membership. This classical assumption is central to mixture models for multivariate data, because the estimation procedures essentially find the optimal partition of observations into subgroups so that the LI approximately holds in each subgroup. Deviations from LI, or “local dependence” (LD) are testable using the control BrS data, which can be accounted for by an extension of pLCM, called nested partially-latent class model (npLCM, Wu et al., 2017). In each class, the npLCM uses the classical LCM formulation that has the capacity to describe complex multivariate dependence among discrete data (Dunson and Xing, 2009). For example, it assumes the within-class correlations among NPPCR tests are induced by unobserved heterogeneity in subjects’ propensities for pathogens colonizing the nasal cavities. In particular, LD is induced in an npLCM by nesting latent subclasses within each class , where subclasses respond with distinct vectors of probabilities. In a Bayesian framework with a prior that encourages few important subclasses, the npLCM reduces the bias in estimating , retains estimation efficiency and offers more valid inference under substantial deviation from LI.

Extensions to incorporate covariates in an npLCM are critical for two reasons. Firstly, covariates such as season, age, disease severity and HIV status may directly influence . Secondly, in an npLCM without covariates, the relative probability of assigning a case subject to class versus class depends on the FPRs (Wu et al., 2016) which are estimable using the control data. However, the FPRs may vary by covariates which if not modeled will bias the assignment of cause-specific probabilities for each case subject. For example, pathogen A found in a case’s nasal cavity less likely indicates etiologic significance than a colonization during seasons with high asymptomatic carriage rates, and much more so when the same pathogen rarely appears in healthy subjects.

Adapting existing no-covariate methods to account for discrete covariates, one may perform a fully-stratified analysis by fitting an npLCM to the case-control data in each covariate stratum. Like pLCM, the npLCM is partially-identified in each stratum (Wu et al., 2017), necessitating multiple sets of independent informative priors across multiple strata. There are two primary issues with this approach. First, sparsely-populated strata defined by many discrete covariates may lead to unstable PEF estimates. Second, it is often of policy interest to quantify the overall cause-specific disease burdens in a population. Let the overall PEFs be the empirical average of the stratum-specific PEFs. Since the informative TPR priors are often elicited for a case population and rarely for each stratum, reusing independent prior distributions of the TPRs across all the strata will lead to overly-optimistic posterior uncertainty in , hampering policy decisions.

Estimating disease etiology across discrete and continuous epidemiologic factors needs new methods in a general modeling framework. In this paper, we extend the npLCM to perform regression analysis in case-control disease etiology studies that (a) incorporates controls to estimate the PEFs, (b) specifies parsimonious functional dependence of upon covariates such as additivity, and (c) correctly assesses the posterior uncertainty of the PEF functions and the overall PEFs by applying the TPR priors just once.

The rest of the paper is organized as follows. Section 2 overviews the npLCM without covariates. Section 3 builds on the npLCM and makes the regression extension. We demonstrate the estimation of disease etiology regression functions through simulations in Section 4; We also show superior inferential performance of the regression model in estimating the overall PEFs relative to an analysis omitting the covariates. In Section 5, we characterize the effect of seasonality, age, HIV status upon the PEFs by applying the proposed npLCM regression model to the PERCH data. The paper concludes with a discussion.

2 Overview of npLCMs without Covariates

Let binary BrS measurements indicate the presence or absence of pathogens for subject . Let indicate a case () or a control () subject. If , let represent case ’s unobserved disease class; Otherwise, let because a control subject’s class is known (in PERCH, no lung infection). In this paper, we simplify the presentation of models by focusing on single-pathogen causes (hence ). The npLCM readily extends to for including additional pre-specified multi-pathogen and/or “Not Specified” (NoS) causes (Wu et al., 2017).

The likelihood function for an npLCM has three components: (a) PEFs or cause-specific case fractions: ; (b) : a table of probabilities of making binary observations in a case class ; (c) : the same probability table but for controls. Since cases’ disease classes are unobserved, the distribution of cases’ measurements is a finite-mixture model with weights for the disease classes: .

Models in this section differ by how and {} are specified; Regression models in Section 3 further incorporate covariate into the specifications ( as well). More specifically, the likelihood of an npLCM (Wu et al., 2017) is a product of case () and control () likelihood functions


where and are sensitivity and specificity parameters necessary for modeling the imperfect measurements; The rest of parameters , . Existing methods for estimating

in the framework of npLCM can be classified by whether or not

and assumes local independence (LI) which means measurements are independent of one another given the class (). In Equation (1), LI results if and only if ; Otherwise, and account for deviations from LI given a control or disease class.

PLCM. under the original pLCM (Wu et al., 2016) satisfies LI and equals a product of probabilities: , where

is the probability mass function for a product Bernoulli distribution given the success probabilities

, and the parameters represent the positive rates absent disease, referred to as “false positive rates” (FPRs). For example, in the PERCH data, Respiratory Syncytial Virus (RSV) has a low observed FPR because of its rare appearance in controls’ NPs; Other pathogens such as Rhinovirus (RHINO) have higher observed FPRs.

For , the pLCM makes a key “non-interference” assumption that disease-causing pathogen(s) are more frequently detected among cases than controls and the non-causative pathogens are observed with the same rates among cases as in controls (Wu et al., 2017). The “non-interference” assumption says that in a case class is a product of the probabilities of measurements made (a) on the causative pathogen , , where and (b) on the non-causative pathogens , where represents all but the -th element in a vector . The parameter is termed “true positive rate” (TPR) and may be larger than the FPR ; Under the single-pathogen-cause assumption, pLCM uses TPRs for causes and FPRs .

NPLCM. To reduce estimation bias in under deviations from LI, the “nested pLCM” or npLCM extends the original pLCM to describe residual correlations among binary pathogen measurements in the controls () and in each case class (, ) (Wu et al., 2017). The extension is motivated by the ability of the classical LCM formulation (Lazarsfeld, 1950) to approximate any joint multivariate discrete distribution (Dunson and Xing, 2009).

For in the controls, the npLCM introduces subclasses; The original pLCM results if . Given a subclass , the probability of observing binary measurements among controls is , where is the -th column of a by FPR matrix . Since we do not observe controls’ subclasses, is a weighted average of according to the subclass probabilities : .

For in case class , the npLCM again introduces unobserved subclasses and assumes is a weighted average of according to the case subclass weights : . In particular, the npLCM assumes the probability of observing in subclass in disease class , , is a product of the probabilities of making an observation (a) on the causative pathogen : and (b) on non-causative pathogens , where is the -th column of excluding the -th row. We collect the TPRs in a by TPR matrix . We summarize the preceding specification by , where the vector represents the positive rates for measurements in subclass of disease class : which equals the TPR for a causative pathogen and the FPR otherwise; Here is an indicator function that equals if the statement A is true and otherwise.

The likelihood for npLCM results upon substituting and above into Equation (1): . Setting and , the special case of pLCM results.

Similar to the pLCM, the FPRs in the npLCM are shared among controls and case classes over non-causative pathogens (via ). Different from the pLCM, the subclass mixing weights may differ between cases () and controls (). The special case of , means the covariation patterns among the non-causative pathogens in a disease class is no different from the controls. However, relative to controls, diseased individuals may have different strength and direction of measurement dependence in each disease class. By allowing the subclass weights to differ between the cases and the controls, npLCM is more flexible than pLCM in referencing cases’ measurements against controls.

3 Regression Analysis via npLCM

We extend npLCM to perform regression analysis of data , where are covariates that may influence case ’s etiologic fractions and is a possibly different vector of covariates that may influence the subclass weights among the controls and the cases; Let the continuous covariates comprise the first and elements of and , respectively. A subset of may be available from the cases only. We let if so that all the covariates for a control subject are included in ; Let for a case subject. For example, healthy controls have no disease severity information. We let three sets of parameters in an npLCM (1) depend on the observed covariates: (a) the etiology regression function among cases, , which is of primary scientific interest, (b) the conditional probability of measurements given covariates in case classes: , , (c) and in the controls ; We keep the specifications for the TPRs and FPRs (, ) as in the original npLCM.

3.1 Disease Etiology Regression

is the primary target of inference. Recall that represents case ’s disease being caused by pathogen . We assume this event occurs with probability

that depends upon covariates. In our model, we use a multinomial logistic regression model

, , where

is the log odds of case

in disease class relative to : . Without specifying a baseline category, we treat all the disease classes symmetrically which simplifies prior specification. We further assume additive models for , where is the subvector of the predictors that enters the model for all disease classes as linear predictors and collects all the parameters. For covariates such as enrollment date that serves as a proxy for factors driven by seasonality, nonlinear functional dependence is expected. We use B-spline basis expansion to approximate and use P-spline for estimating smooth functions (Lang and Brezger, 2004). Finally, we specify the distribution of case measurements given disease class , covariates and . We extend the case likelihood in an npLCM (1) to let the subclass weights depend on covariates : . Integrating over unobserved disease classes, we obtain the likelihood function for the cases that incorporates covariates :


where and are the regression parameters; The form of is introduced in the model for controls.

3.2 Covariate-dependent reference distribution

Data from controls provide requisite information about the specificities and covariations at distinct covariate values, necessitating adjustment in an npLCM analysis. For example, factors such as enrollment date is a proxy for season and may influence the background colonization rates and interactions of some pathogens that circulate more during winter (Obando-Pacheco et al., 2018; Nair et al., 2011). We propose a novel approach to estimating the reference distribution of measurements that may depend on covariates using control data.

The regression model for a control subject is a mixture model with covariate-dependent mixing weights : where FPRs do not depend on covariates and the vector lies in a -simplex . We discuss the FPRs and the subclass weight functions in order.

Firstly, constant FPR profiles enable coherent interpretation across individuals with different covariate values (Erosheva et al., 2007). FPR profile receives a weight of for a control subject with covariates . The marginal FPRs in the controls , , also depend on . Consequently, observed marginal control positive curve for a pathogen informs how different the FPRs are across the subclasses. For example, if the NPPCR measure of pathogen A shows strong seasonal trends among the controls, the estimated FPRs will be more variable across the subclasses. And the subclass with a high FPR will receive a larger weight during seasons with higher carriage rates in controls. The control model reduces to special cases, with covariate-independent , , resulting in the in a -subclass npLCM without covariates; A further single-subclass constraint () gives the in the original pLCM.

Secondly, we parameterize the case and control subclass weight regressions and using the same regression form but different parameters.

Control subclass weight regression. We rewrite the subclass weights , using a stick-breaking parameterization. Let be a link function. Let be subject ’s linear predictor at stick-breaking step . Using the stick-breaking analogy, we begin with a unit-length stick, break a segment of length and continue breaking a fraction from the remaining and so on; At step , we break a fraction from what is left in the preceding steps resulting in the -th stick segment of length ; We stop until sticks of variable lengths result. In this paper, we use the logistic function

which is consistent with the multinomial logit regression for

so that the priors of the coefficients and can be similar (Supplementary Materials A.2). Generalization to other link functions such as the probit function is straightforward (e.g., Rodriguez and Dunson, 2011). We use this parameterization to introduce a novel shrinkage prior on a simplex for the subclass weights (see Supplementary Materials A.1) which encourages fewer than effective subclasses, or “-sparse” shrinkage prior on the simplex. This provides parsimonious approximation to the conditional distribution of control measurements using a few subclasses.

In our analysis, we use generalized additive models (Hastie and Tibshirani, 1986) for the -th linear predictor for . We have parameterized the possibly nonlinear using B-spline basis expansions with coefficients ; are the linear effects of a subset of predictors which can include an intercept and is a subvector of predictors ; Let collect all the regression parameters. Following Lang and Brezger (2004), we constrain to have zero means for statistical identifiability. Supplementary Materials A.2 provides the technical details about the parameterization of .

The subclass-specific intercepts globally control the magnitudes of the linear predictors. We hence propose priors on to a priori encourage few subclasses (see Supplementary Materials A.1). In particular, a large positive intercept makes and hence breaks nearly the entire remaining stick after the -th stick-breaking. Since the stick-breaking parameterization one-to-one maps to a classical latent class regression model formulation for the control data, the linear predictor and the sum are identifiable except in a Lebesgue zero set of parameter values, or “generic identifiability” (Huang and Bandeen-Roche, 2004). Consequently, even if the intercept is not statistically identified if includes an intercept , the MCMC samples of the statistically identifiable functions can provide valid posterior inferences (Carlin and Louis, 2009). We write the control likelihood with covariates as . Supplementary Materials B provides further remarks on the assumption for introducing covariates into the control model.

Case subclass weight regression. The subclass weight regression functions for cases are also specified via a logistic stick-breaking regression as in the controls but with different parameters: ,

. Since given the TPRs and the FPRs, the subclass weights fully determine the joint distribution

hence the measurement dependence in each class, we let and be different between cases and controls for any .

Let the -th linear predictor , where are the regression parameters that differ from the control counterpart (). In particular, we approximate , here using the same set of B-spline basis functions as in the controls but estimate a different set of basis coefficients . In addition, we have directly used the intercepts from the control model to ensure only important subclasses in the controls are used in the cases. For example, absent covariates , a large and positive effectively halts the stick breaking procedure at step for the controls (); Applying the same intercept to the cases makes .

Combining the case () and control likelihood () with covariates, we obtain the joint likelihood for the regression model .

Under an assumption (A1): the case subclass weights are constant over covariates: , , the regression model reduces to an npLCM model without covariates upon integration over a distribution of covariates . To see this, the case and control likelihood functions and integrate to and , respectively; Here and where and are probability or empirical distributions of and , respectively. The mathematical equivalence enables valid inference about the overall PEFs omitting and (see Supplementary Materials E.2 for an example). The no-covariate analysis becomes deficient under deviations from (A1); Section 4 provides examples.

3.2.1 Priors and Posterior Inference

The unknown parameters include the coefficients in the etiology regression (, the subclass mixing weight regression for the cases () and the controls (), the true and false positive rates . With typical samples sizes about controls and cases in each study site, the number of parameters in controls likelihood () easily exceeds the number of distinct binary measurement patterns observed. To overcome potential overfitting and increase model interpretability, we a priori place substantial probabilities on models with the following two features: (a) Few non-trivial subclasses via a novel additive half-Cauchy prior for the intercepts , and (b) for a continuous variable, smooth regression curves , and by Bayesian Penalized-splines (P-splines) (Lang and Brezger, 2004) combined with shrinkage priors on the spline coefficients (Ni et al., 2015) to encourage towards constant values, , which reduces to the original npLCM. Supplementary Materials A details the prior specifications.

We use the Markov chain Monte Carlo (MCMC) algorithm to draw samples of the unknowns to approximate their joint posterior distribution (Gelfand and Smith, 1990). Flexible posterior inferences about any functions of the model parameters and individual latent variables are available by plugging in the posterior samples of the unknowns. For example, the posterior samples of the case positive rate curve for pathogen help evaluate model fit. The red bands in Row 1 of Figure 1 are posterior credible bands obtained by substituting relevant parameters with their sampled values across MCMC iterations in . The npLCMs with or without covariates are fitted using a free and publicly available R package baker ( Baker calls an external automatic Bayesian model fitting software JAGS 4.2.0 (Plummer et al., 2003)

from within R and provides functions to visualize the posterior distributions of the unknowns (e.g., the PEFs and cases’ latent disease class indicators) and perform posterior predictive model checking

(Gelman et al., 1996). Supplementary Materials C details the convergence diagnostics.

4 Simulations

We simulate case-control bronze-standard (BrS) measurements along with observed continuous and/or discrete covariates under multiple combinations of true model parameter values and sample sizes that mimic the motivating PERCH study. In Simulation I, we illustrate flexible statistical inferences about the PEF functions . In Simulation II, we focus on the overall PEFs that quantify the overall cause-specific disease burdens in a population which are of policy interest. Let be an empirical average of , . We compare the frequentist properties of the posterior mean obtained from analyses with or without covariate (Little et al., 2011). Regression analyses reduce estimation bias, retain efficiency and provide more valid frequentist coverage of the CrIs. The relative advantage varies by the true data generating mechanism and sample sizes.

In all analyses here, we use a working number of subclasses, with independent Beta(7.13,1.32) TPR prior distributions that match 0.55 and 0.99 with the lower and upper quantiles, respectively; We specify Beta(1,1) for the identifiable FPRs. The priors for the regression coefficients follow the specifications in Supplementary Materials A.

Simulation I. We demonstrate that the inferential algorithm recovers the true PEF functions . We simulate cases and controls for each of two levels of (a discrete covariate) and uniformly sample the subjects’ enrollment dates over a period of days. Supplementary Materials D specifies the true data generating mechanism and the regression specifications. Based on the simulated data, pathogen A has a bimodal positive rate curve mimicking the trends observed of RSV in one PERCH site; other pathogens have overall increasing positive rate curves over enrollment dates. We set the simulation parameters in a way that the marginal control rate may be higher than cases for small ’s (impossible under the more restrictive pLCM). Row 2 of Figure 1 visualizes for the causes (by column), the posterior means (thin black line) and CrIs (gray bands) for the etiology regression curves are close to the simulation truths . Supplementary Materials E provides additional simulation results to assess the recovery of the true for a discrete covariate .

Figure 1: Row 2) For each of the causes (by column) in Simulation I, the posterior mean (thin black curves) and pointwise credible bands (gray bands) for the etiology regression curves are close to the simulation truths . In row 1), the fitted case (red) and control (blue) positive rate curves are shown with the posterior mean curves (solid black curves) and pointwise credible bands (shaded); The rug plots show the positive (top) and negative (bottom) measurements made on cases and controls on the enrollment dates. The solid horizontal lines in row 1 indicate the true TPRs.

Simulation II. We show the regression model accounts for population stratification by covariates hence reduces the bias of the posterior mean in estimating the overall PEFs () and produces more valid CrIs. We illustrate the advantage of the regression approach under simple scenarios with a single two-level covariate ; We let . We perform npLCM regression analysis with for each of replication data sets simulated under each of scenarios detailed in Supplementary Materials D that correspond to distinct numbers of causes, sample sizes, relative sizes of PEF functions (rare versus popular etiologies), signal strengths (more discrepant TPRs and FPRs indicate stronger signals, Wu et al. (2016)), and effects of on and .

In estimating , we evaluate the bias , where is the true overall PEF, and is an empirical average of the posterior mean PEFs at . We also evaluate the empirical coverage rates of the CrIs.

The regression model incorporates covariates and performs better in estimating than a model omitting covariates. For example, Figure 2(a) shows for that, relative to no-covariate npLCM analyses, regression analyses produce posterior means that on average have negligible relative biases (percent difference between the posterior mean and the truth relative to the truth) for each pathogen across simulation scenarios. As expected, we observe slight relative biases from the regression model in the bottom two rows of Figure 2(a), because the informative TPR prior Beta(7.13,1.32) has a mean value lower than the true TPR ; A more informative prior further reduces the relative bias; See additional simulations in Supplementary Materials E on the role of informative TPR priors. Figure 2(b) regression analyses also produce CrIs for that have more valid empirical coverage rates in all scenarios. Misspecified models without covariates concentrate the posterior distribution away from the true overall PEFs, resulting in large biases that dominate the posterior uncertainty of which is evident from the more severe undercoverages with higher TPRs and lower FPRs (row 3 and 4 versus row 1 and 2, Figure 2).

(a) Percent relative bias
(b) Empirical coverage rates
Figure 2: The regression analyses produce less biased posterior mean estimates and more valid empirical coverage rates for over replications in Simulation II with . Each panel corresponds to one of combinations of true parameter values and sample sizes. Top) Each boxplot (left: regression; right: no regression) shows the distribution of the percent relative bias of the posterior mean in estimating the overall PEF for six causes (A - F); The red horizontal dashed lines indicate zero bias. Bottom) Each dot or triangle indicates the empirical coverage rate of the CrIs produced by analyses with regression () or without regression (); The nominal rate is marked by horizontal red dashed lines. Since each coverage rate for is computed from binary observations, the truth being covered or not, a CI is also shown.

5 Regression Analysis of PERCH Data

We restrict attention in this regression analysis to cases and controls from one of the PERCH study sites in the Southern Hemisphere that collected information on enrollment date (, August 2011 to September 2013; standardized), age (dichotomized to younger or older than one year), disease severity for cases (severe or very severe), HIV status (positive or negative) and presence or absence of seven species of pathogens (five viruses and two bacteria, representing a subset of pathogens evaluated) in nasopharyngeal (NP) specimens tested with polymerase chain reaction (PCR), or NPPCR (bronze-standard, BrS); We also include in the analysis the blood culture (BCX, silver-standard, SS) results for two bacteria from cases only. Detailed analyses of the entire data are reported in PERCH Study Group (2019).

Table 1 shows the observed case and control frequencies by age, disease severity and HIV status. The two strata with the most subjects are severe pneumonia children who were HIV negative and under or above one year of age. Some low or zero cell counts preclude fitting npLCMs by stratum. Regression models with additive assumptions among the covariates can borrow information across strata and stabilize the PEF estimates. Supplemental Figure S5 shows summary statistics for the NPPCR (BrS) and BCX (SS) data including the positive rates in the cases and the controls and the conditional odds ratio (COR) contrasting the case and control rates adjusting for the presence or absence of other pathogens (NPPCR only).

For NPPCR, pathogens RSV and Haemophilus influenzae (HINF) are detected with the highest positive rates among cases: and , respectively, which are higher than the corresponding control rates ( and ). The CORs are large, for RSV and for HINF, indicating etiologic importance. Adenovirus (ADENO) also has a statistically significant COR of . Human metapneumovirus type A or B (HMPV_A_B) and Parainfluenza type 1 virus (PARA_1) have larger positive and statistically significant CORs of and . However, the two pathogens rarely appear in cases’ nasal cavities (HMPV_A_B: , PARA_1: ), which in light of high sensitivities means non-primary etiologic roles. For the rest of pathogens, we observed similar case and control positive rates as shown by the statistically non-significant CORs (RHINO (case: ; control: ) and Streptococcus pneumoniae (PNEU) (case: ; control: ). Similar to Wu et al. (2017), we integrate case-only SS measurements for HINF and PNEU by using informative priors of the sensitivities (e.g., from vaccine probe studies e.g., Feikin et al. (2014)) to adjust the PEF estimates in a coherent Bayesian framework. It is expected that the rare detection of the two bacteria, for HINF and for PNEU from SS data, will lower their PEF estimates relative to the ones obtained from an NPPCR-only analysis.

We include in the regression analysis a cause “Not Specified (NoS)” to account for true pathogen causes other than the seven pathogens. We incorporate the prior knowledge about the TPRs of the NPPCR measures from laboratory experts. We set the Beta priors for sensitivities by and , so that the and quantiles match the lower and upper ranges of plausible sensitivity values of and , respectively. We specify the Beta(7.59,58.97) prior for the two TPRs of SS measurements similarly but with a lower range of . We use a working number of subclasses . In the etiology regression model , we use 7 d.f. for B-spline expansion of the additive function for the standardized enrollment date at uniform knots along with three binary indicators for age older than one, very severe pneumonia, HIV positive; In the subclass weight regression model , we use 5 d.f. for the standardized enrollment date with uniform knots and two indicators for age older than one and HIV positive. The prior distributions for the etiology and subclass weight regression parameters follow the specification in Supplementary Materials A.

age very severe (VS) HIV positive cases () controls ()
(case-only) total: 524 (100) total: 964 (100)
0 0 0 208 (39.7) 545 (56.5)
1 0 0 72 (13.7) 278 (28.8)
0 1 0 116 (22.1) 0
1 1 0 33 (6.3) 0
0 0 1 37 (7.1) 85 (8.8)
1 0 1 24 (4.5) 51 (5.3)
0 1 1 25 (4.8) 0
1 1 1 3 (0.6) 0
control: -
Table 1: The observed count (frequency) of cases and controls by age, disease severity and HIV status (1: yes; 0: no). The marginal fractions among cases and controls for each covariate are shown at the bottom. Results from the regression analyses are shown in Figure 3 for the first two strata.

The regression analysis produces seasonal estimates of the PEF function for each cause that varies in trend and magnitude among the eight strata defined by age, disease severity and HIV status. Figure 3 shows among two age-HIV-severity strata the posterior mean curve and pointwise credible bands of the etiology regression functions as a function of . For example, among the younger, HIV negative and severe pneumonia children (Figure 3(a)), the PEF curve of RSV is estimated to have a prominent bimodal temporal pattern that peaked at two consecutive winters in the Southern Hemisphere (June 2012 and 2013). Other single-pathogen causes HINF, PNEU, ADENO, HMPV_A_B and PARA_1 have overall low and stable PEF curves across seasons. The estimated PEF curve of NoS shows a trend with a higher level of uncertainty that is complementary to RSV because given any enrollment date the PEFs of all the causes sum to one. In contrast, Figure 3(b) shows a lower degree of seasonal variation of RSV PEF curve among the older, HIV negative and severe pneumonia children.

(a) Age year, severe pneumonia, HIV negative
(b) Age year, severe pneumonia, HIV negative
Figure 3: Estimated seasonal PEF for two most prevalent age-severity-HIV strata: younger (a) or older (b) than one, with severe pneumonia, HIV negative; Here the results are obtained from a model assuming seven single-pathogen causes (HINF, PNEU, ADENO, HMPV.A.B, PARA.1, RHINO, RSV) and an “Not Specified” cause. In an age-severity-HIV stratum and for each cause :
Row 2) shows the temporal trend of which is enveloped by pointwise credible bands shown in gray. The estimated overall PEF averaged among cases in the present stratum is shown by a horizontal solid line, below and above which are two dashed black lines indicating the and posterior quantiles. The rug plot on the x-axis indicates cases’ enrollment dates.
Row 1) shows the fitted temporal case (red) and control (blue) positive rate curves enclosed by the pointwise CrIs; The two rug plots at the top (bottom) indicate the dates of the cases and controls being enrolled and tested positive (negative) for the pathogen.

The regression model accounts for stratification of etiology by the observed covariates and assigns cause-specific probabilities for two cases who have identical measurements but different covariate values. Supplemental Figure S6 shows for two cases with all negative NPPCR results (the most frequent pattern among cases), the older case has a lower posterior probability of her disease caused by

RSV and higher probability of being caused by NoS. Indeed, contrasting older and younger children while holding the enrollment date, HIV, severity constant, the estimated difference in the log odds (i.e., log odds ratio) of a child being caused by RSV versus NoS is negative: .

Given age, severity and HIV status, we quantify the overall cause-specific disease burdens by averaging the PEF function estimates by the empirical distribution of the enrollment dates. Contrasting the results in the two age-severity-HIV strata in Figure 3(a) and 3(b), since the case positive rate of RSV among the older children reduces from to but the control positive rates remain similar (from to ), the overall PEF of RSV () decreases from to and attributing a higher total fraction of cases to NoS () from to ; The overall PEFs for other causes remain similar.

6 Discussion

In disease etiology studies where gold­-standard data are infeasible to obtain, epidemiologists need to integrate multiple sources of data of distinct quality to draw inference about the population and individual etiologic fractions. While the existing methods based on npLCM account for imperfect diagnostic sensitivities and specificities, complex measurement dependence and missingness, they do not describe the relationship between covariates and the PEFs. This paper addresses this analytic need by extending npLCM to a general regression modeling framework using case-control multivariate binary data to estimate disease etiology.

The proposed approach has three distinguishing features: 1) It allows analysts to specify a model for the functional dependence of the PEFs upon important covariates. And with assumptions such as additivity, we can improve estimation stability for sparsely populated strata defined by many discrete covariates. 2) The model incorporate control data for the inference of PEF curve. The posterior inferential algorithm estimates a parsimonious covariate-dependent reference distribution of the diagnostic measurements from controls. Finally, 3) the model uses informative priors of the sensitivities (TPRs) only once in a population for which these priors were elicited. Relative to a fully-stratified npLCM analysis that reuses these priors, the proposed regression analysis avoids overly-optimistic etiology uncertainty estimates.

We have shown by simulations that the regression approach accounts for population stratification by important covariates and as expected reduces estimation biases and produces credible intervals that have more valid empirical coverage rates than an npLCM analysis omitting covariates. In addition, the proposed regression analysis can readily integrate multiple sources of diagnostic measurements of distinct levels of diagnostic sensitivities and specificities, a subset of which are only available from cases (SS data), to further reduce the posterior uncertainty of the etiology estimates. Our regression analysis integrates the BrS and SS data from one PERCH site and reveals prominent dependence of the PEFs upon seasonality and a pneumonia child’s age, HIV status and disease severity.

Future work may improve the proposed methods. First, flexible and parsimonious alternatives to the additive models may capture important interaction effects (e.g., Linero, 2018). Second, in the presence of many covariates, class-specific predictor selection methods for may provide further regularization and improve interpretability (Gustafson et al., 2008). Third, when the subsets of pathogens that have caused the diseases in the population is unknown, the proposed method can be combined with subset selection procedures (Wu et al., 2019; Gu and Xu, 2019a). Finally, scalable posterior inference for multinomial regression parameters (e.g., Zhang and Zhou, 2017) will likely improve the computational speed in the presence of a large number of disease classes and covariates.

Supplementary Materials

The supplementary materials contain the technical details on prior specifications, a remark, additional simulation results and supplemental figures referenced in Main Paper.


We thank the PERCH study team led by Kathernine O’Brien for providing the data and scientific advice, Scott Zeger, Maria Deloria-Knoll, Christine Prosperi and Qiyuan Shi for insightful comments and valuable feedback about baker and Jing Chu for preliminary simulations. The research was partly supported by the Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1408-20318, ZW), NIH grants P30CA046592 (National Cancer Institute Cancer Center Support Grant Development Funds, Rogel Cancer Center; ZW and IC), U01CA229437 (ZW) and an Investigator Award from Precision Health Initiative and an MCubed Award from University of Michigan (ZW).


  • Bandeen-Roche et al. (1997) Bandeen-Roche, K., Miglioretti, D. L., Zeger, S. L., and Rathouz, P. J. (1997). Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association, 92(440):1375–1386.
  • Bartolucci and Forcina (2006) Bartolucci, F. and Forcina, A. (2006). A class of latent marginal models for capture–recapture data with continuous covariates. Journal of the American Statistical Association, 101(474):786–794.
  • Bedrick et al. (1996) Bedrick, E. J., Christensen, R., and Johnson, W. (1996). A new perspective on priors for generalized linear models. Journal of the American Statistical Association, 91(436):1450–1460.
  • Brooks and Gelman (1998) Brooks, S. and Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4):434–455.
  • Carlin and Louis (2009) Carlin, B. and Louis, T. (2009). Bayesian methods for data analysis, volume 78. Chapman & Hall/CRC.
  • Chung et al. (2006) Chung, H., Flaherty, B. P., and Schafer, J. L. (2006). Latent class logistic regression: application to marijuana use and attitudes among high school seniors. Journal of the Royal Statistical Society: Series A (Statistics in Society), 169(4):723–743.
  • Crawley et al. (2017) Crawley, J., Prosperi, C., Baggett, H. C., Brooks, W. A., Deloria Knoll, M., Hammitt, L. L., Howie, S. R., Kotloff, K. L., Levine, O. S., Madhi, S. A., et al. (2017). Standardization of clinical assessment and sample collection across all perch study sites. Clinical infectious diseases, 64(suppl_3):S228–S237.
  • Deloria Knoll et al. (2017) Deloria Knoll, M., Fu, W., Shi, Q., Prosperi, C., Wu, Z., Hammitt, L. L., Feikin, D. R., Baggett, H. C., Howie, S. R., Scott, J. A. G., et al. (2017). Bayesian estimation of pneumonia etiology: epidemiologic considerations and applications to the pneumonia etiology research for child health study. Clinical infectious diseases, 64(suppl_3):S213–S227.
  • Dunson and Xing (2009) Dunson, D. and Xing, C. (2009). Nonparametric bayes modeling of multivariate categorical data. Journal of the American Statistical Association, 104(487):1042–1051.
  • Erosheva et al. (2007) Erosheva, E. A., Fienberg, S. E., and Joutard, C. (2007). Describing disability through individual-level mixture models for multivariate binary data. The annals of applied statistics, 1(2):346.
  • Feikin et al. (2014) Feikin, D., Scott, J., and Gessner, B. (2014). Use of vaccines as probes to define disease burden. The Lancet, 383(9930):1762–1770.
  • Gelfand and Smith (1990) Gelfand, A. and Smith, A. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American statistical association, pages 398–409.
  • Gelman et al. (2008) Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y.-S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, pages 1360–1383.
  • Gelman et al. (1996) Gelman, A., Meng, X.-L., and Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6(4):733–760.
  • Geweke and Zhou (1996) Geweke, J. and Zhou, G. (1996). Measuring the pricing error of the arbitrage pricing theory. The review of financial studies, 9(2):557–587.
  • Goodman (1974) Goodman, L. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61(2):215–231.
  • Gryparis et al. (2007) Gryparis, A., Coull, B. A., Schwartz, J., and Suh, H. H. (2007). Semiparametric latent variable regression models for spatiotemporal modelling of mobile source particles in the greater boston area. Journal of the Royal Statistical Society: Series C (Applied Statistics), 56(2):183–209.
  • Gu and Xu (2019a) Gu, Y. and Xu, G. (2019a). Learning attribute patterns in high-dimensional structured latent attribute models.

    Journal of Machine Learning Research

    , page In press.
  • Gu and Xu (2019b) Gu, Y. and Xu, G. (2019b). Partial identifiability of restricted latent class models. Annals of Statistics, page In press.
  • Gustafson (2015) Gustafson, P. (2015). Bayesian Inference for Partially Identified Models: Exploring the Limits of Limited Data, volume 140. CRC Press.
  • Gustafson et al. (2008) Gustafson, P., Lefebvre, G., et al. (2008). Bayesian multinomial regression with class-specific predictor selection. The Annals of Applied Statistics, 2(4):1478–1502.
  • Hammitt et al. (2017) Hammitt, L. L., Feikin, D. R., Scott, J. A. G., Zeger, S. L., Murdoch, D. R., O’brien, K. L., and Deloria Knoll, M. (2017). Addressing the analytic challenges of cross-sectional pediatric pneumonia etiology data. Clinical infectious diseases, 64(suppl_3):S197–S204.
  • Hastie and Tibshirani (1986) Hastie, T. and Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1(3):297–318.
  • Huang and Bandeen-Roche (2004) Huang, G.-H. and Bandeen-Roche, K. (2004). Building an identifiable latent class model with covariate effects on underlying and measured variables. Psychometrika, 69(1):5–32.
  • Jones et al. (2010) Jones, G., Johnson, W., Hanson, T., and Christensen, R. (2010). Identifiability of models for multiple diagnostic testing in the absence of a gold standard. Biometrics, 66(3):855–863.
  • Kotloff et al. (2013) Kotloff, K. L., Nataro, J. P., Blackwelder, W. C., Nasrin, D., Farag, T. H., Panchalingam, S., Wu, Y., Sow, S. O., Sur, D., Breiman, R. F., et al. (2013). Burden and aetiology of diarrhoeal disease in infants and young children in developing countries (the global enteric multicenter study, gems): a prospective, case-control study. The Lancet, 382(9888):209–222.
  • Lang and Brezger (2004) Lang, S. and Brezger, A. (2004). Bayesian p-splines. Journal of computational and graphical statistics, 13(1):183–212.
  • Lazarsfeld (1950) Lazarsfeld, P. F. (1950). The logical and mathematical foundations of latent structure analysis, volume IV, chapter The American Soldier: Studies in Social Psychology in World War II, pages 362–412. Princeton, NJ: Princeton University Press.
  • Linero (2018) Linero, A. R. (2018). Bayesian regression trees for high-dimensional prediction and variable selection. Journal of the American Statistical Association, 113(522):626–636.
  • Little et al. (2011) Little, R. et al. (2011). Calibrated bayes, for statistics in general, and missing data in particular. Statistical Science, 26(2):162–174.
  • Morrissey et al. (2011) Morrissey, E. R., Juárez, M. A., Denby, K. J., and Burroughs, N. J. (2011). Inferring the time-invariant topology of a nonlinear sparse gene regulatory network using fully bayesian spline autoregression. Biostatistics, 12(4):682–694.
  • Nair et al. (2011) Nair, H., Brooks, W. A., Katz, M., Roca, A., Berkley, J. A., Madhi, S. A., Simmerman, J. M., Gordon, A., Sato, M., Howie, S., et al. (2011). Global burden of respiratory infections due to seasonal influenza in young children: a systematic review and meta-analysis. The Lancet, 378(9807):1917–1930.
  • Ni et al. (2015) Ni, Y., Stingo, F. C., and Baladandayuthapani, V. (2015). Bayesian nonlinear model selection for gene regulatory networks. Biometrics.
  • Obando-Pacheco et al. (2018) Obando-Pacheco, P., Justicia-Grande, A. J., Rivero-Calle, I., Rodríguez-Tenreiro, C., Sly, P., Ramilo, O., Mejías, A., Baraldi, E., Papadopoulos, N. G., Nair, H., et al. (2018). Respiratory syncytial virus seasonality: a global overview. The Journal of infectious diseases, 217(9):1356–1364.
  • PERCH Study Group (2019) PERCH Study Group (2019). Aetiology of severe hospitalised pneumonia in hiv-uninfected children from africa and asia: the pneumonia aetiology research for child health (perch) case-control study. Lancet.
  • Plummer et al. (2003) Plummer, M. et al. (2003). Jags: A program for analysis of bayesian graphical models using gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing, volume 124.
  • Rodriguez and Dunson (2011) Rodriguez, A. and Dunson, D. B. (2011). Nonparametric bayesian models through probit stick-breaking processes. Bayesian analysis (Online), 6(1).
  • Saha et al. (2018) Saha, S. K., Schrag, S. J., El Arifeen, S., Mullany, L. C., Islam, M. S., Shang, N., Qazi, S. A., Zaidi, A. K., Bhutta, Z. A., Bose, A., et al. (2018). Causes and incidence of community-acquired serious infections among young children in south asia (anisa): an observational cohort study. The Lancet, 392(10142):145–159.
  • Scott et al. (2008) Scott, J. A. G., Brooks, W. A., Peiris, J. M., Holtzman, D., and Mulhollan, E. K. (2008). Pneumonia research to reduce childhood mortality in the developing world. The Journal of clinical investigation, 118(4):1291.
  • Witte et al. (1998) Witte, J. S., Greenland, S., and Kim, L.-L. (1998). Software for hierarchical modeling of epidemiologic data. Epidemiology, 9(5):563–566.
  • Wu et al. (2019) Wu, Z., Casciola-Rosen, L., Rosen, A., and Zeger, S. L. (2019). A bayesian approach to restricted latent class models for scientifically-structured clustering of multivariate binary outcomes. arXiv preprint arXiv:1808.08326.
  • Wu et al. (2016) Wu, Z., Deloria-Knoll, M., Hammitt, L. L., Zeger, S. L., and for Child Health Core Team, P. E. R. (2016). Partially latent class models for case–control studies of childhood pneumonia aetiology. Journal of the Royal Statistical Society: Series C (Applied Statistics), 65(1):97–114.
  • Wu et al. (2017) Wu, Z., Deloria-Knoll, M., and Zeger, S. L. (2017). Nested partially latent class models for dependent binary data; estimating disease etiology. Biostatistics (Oxford, England), 18:200–213.
  • Zhang and Zhou (2017) Zhang, Q. and Zhou, M. (2017). Permuted and augmented stick-breaking bayesian multinomial regression. The Journal of Machine Learning Research, 18(1):7479–7511.

Appendix A Prior distributions

The unknown parameters include the regression coefficients in the etiology regression (, the parameters in the subclass weight regression for the cases () and the controls (), the true and false positive rates . To mitigate potential overfitting and increase model interpretability, we a priori place substantial probabilities on models with the following two features: (a) Few non-trivial subclasses via a novel additive half-Cauchy prior for the intercepts , and (b) for a continuous variable, smooth regression curves , and by Bayesian Penalized-splines (P-splines, Lang and Brezger, 2004) combined with shrinkage priors on the spline basis coefficients (Ni et al., 2015) to encourage towards constant values.

a.1 Subclass Weight Regression: Encourage Few Subclasses

We propose a novel prior to encourage a small number of subclasses of non-trivial weights in finite samples, or “simplex regression shrinkage prior”. We parameterize the intercepts so that a priori the higher-order subclasses are less likely to receive non-trivial weights. We let where is a pre-specified triangular array of positive values. Upon heavy-tailed priors on with positive supports, we will a priori make higher-order subclasses increasingly less likely to receive substantial weights. In this paper, we use ; Other choices such as or may be useful in other settings. We specify the prior distributions of

to be heavy-tailed. In this paper we use Cauchy distribution with scale

. Since our control model take a classical latent class regression model form (Bandeen-Roche et al., 1997) (the generic term “class” here corresponds to control “subclass” in an npLCM), the proposed prior for the subclass weight is also useful for a classical LCM regression analysis where the number of classes is unknown. Unlike a logistic stick-breaking specification without the intercepts , the proposed priors on the intercepts encourage few subclasses and well recovers the true subclass weights. Using the same data simulated in Simulation I, Section 3 of Main Paper, Figure 5 shows the proposed prior propagates into the posterior distribution and estimates 2 non-trivial subclasses from a working number of 7 subclasses.

At stick-breaking step , the prior allows taking away nearly the entire stick segment currently left. Our basic idea is to have one of close to one a posteriori by making the posterior mean of one of large. We accomplish this by designing a novel prior on the intercept where

The first level has a mean-zero Gaussian distribution truncated to the positive half. At the second-level, the precision (inverse variance) is Gamma distributed with shape

, and rate ; it has the interpretation of prior independent sample(s) with a mean sample variance of . Large values of help to stop stick-breaking at subclass forcing weights for ensuing subclasses , , while small values let the stick-breaking scheme continue to step . This type of prior sparsity, which we call “selective stopping” or shrinkage over a simplex uniformly over covariates, effectively encourages using a small number of subclasses to approximate the observed

probability contingency table for the control measurements in finite samples.

We accomplish selective stopping by the heavy right tail of ’s marginal prior. It has a truncated scaled-

distribution with degree of freedom

and scale , and consequently peaks at zero and admits large positive values. Given other parameters in , a near-zero intercept takes the stick-breaking procedure to the next step, while a large positive intercept effectively halts it. The tendency to stop at step is a priori modulated by the scale parameter . Because, given the degree-of-freedom

, the prior probability

approaches as the scale parameter increases.

In our simulations and applications, we choose hyperparameters

and for the intercept, and for the first B-spline coefficients in the prior (Equation 3, Section A.2). We have chosen our hyperparameters based on the interpretations on the probability (inverse-link) scale; see similar prior elicitations for regression coefficients in other applications (e.g., Bedrick et al., 1996; Witte et al., 1998) and for automatic, stabilized and weakly-informative fitting of generalized linear models (Gelman et al., 2008). We choose the hyperparameters for the intercepts that put most prior mass of within , because is sufficiently close to which means the stick-breaking is stopped at Step . In contrast, we choose the first B-spline coefficient’s hyperparameter that puts most prior mass of within , a range for the weight of a non-trivial subclass to break from the rest of the stick at Step . Figure 8 shows a sharp separation between the priors for and . The shapes of the priors again highlight the different roles played by the intercept and the B-spline coefficients: the former decides whether to continue the stick-breaking procedure to induce complex conditional dependence given covariates, and if so, the latter computes the fraction to break from the remaining length of the stick. The intercepts in the controls are shared with the case subclass weight regression ; We set the same prior distributions for other elements of , .

a.2 Encourage Smooth and

We use penalized B-splines to model the additive functions of a continuous variable in etiology regression (