1 Introduction
In epidemiologic studies of disease etiology, one important scientific goal is to assess the effect of explanatory variables upon disease etiology. Based on multiple binary nongoldstandard diagnostic measurements made on a list of putative causes with different error rates, this paper develops and demonstrates a regression analytic approach for drawing inference about the causespecific fractions among the case population that depend on covariates. We illustrate the analytic needs raised by a study of pediatric pneumonia etiology.
Pneumonia is a clinical condition associated with infection of the lung tissue, which can be caused by more than 30 different species of microorganisms, including bacteria, viruses, mycobacteria and fungi (Scott et al., 2008). The Pneumonia Etiology Research for Child Health (PERCH) study is a sevencountry casecontrol study of the etiology of severe and very severe pneumonia and has enrolled more than hospitalized children under five years of age and more than healthy controls (PERCH Study Group, 2019). The goal of the PERCH study is to estimate the population fractions of cases due to the pathogen causes, referred to as “population etiologic fractions” (PEFs) and to assign causespecific probabilities for each pneumonia child given her measurements, termed as “individual etiologic fractions” (IEFs). The PERCH study also aims to understand the variation of the PEFs as a function of factors such as region, season, a child’s age, disease severity, nutrition status and human immunodeficiency virus (HIV) status.
The cause of lung infection cannot, except in rare cases, be directly observed (Hammitt et al., 2017). The PERCH study tests the presence or absence of a list of pathogens using specimens in peripheral compartments including the blood, sputum, pleural fluid and nasopharyngeal (NP) cavity (Crawley et al., 2017). In this paper, we focus on two sources of imperfect measurements: (a) NP Polymerase Chain Reaction (NPPCR) results from cases and controls that are not perfectly sensitive or specific, referred to as “bronzestandard” (BrS) data; and (b) blood culture (BCX) results from cases only that are perfectly specific but lack sensitivity, referred to as “silverstandard” (SS) data.
Valid inference about the population and individual etiologic fractions must address three salient characteristics of the measurements. First, tests lacking sensitivity such as NPPCR and BCX may miss true causative agent(s) which if unadjusted may underestimate the PEFs. Second, imperfect diagnostic specificities may result in the detection of multiple pathogens in NPPCR that may indicate asymptomatic carriage but not causes of pneumonia. Determining the primary causative agent(s) must use statistical controls. Third, multiple specimens are tested among the cases with only a subset available from the controls. Other largescale disease etiology studies have raised similar analytic needs and challenges of integrating multiple sources of imperfect measurements of multiple pathogens to produce an accurate understanding of etiology (e.g., Saha et al., 2018; Kotloff et al., 2013).
To address the analytic needs, Wu et al. (2016) introduced a partiallylatent class model (pLCM) as an extension to classical latent class models (LCMs Lazarsfeld, 1950; Goodman, 1974) that uses casecontrol data to estimate the PEFs. This prior work shows the capacity of the multivariate specimen measurements to inform the distribution of unobserved, or “latent” health status for an individual and the population. PLCM is a finite mixture model with components for multivariate binary data where a case observation is drawn from a mixture of components each representing a cause of disease, or “disease class”; Controls have no infection in the lung hence are assumed drawn from an observed class. The pLCM is a semisupervised method for learning the unobserved classes, where the “label” (cause of disease) is observed for only a subset of subjects. Let represent case ’s disease class which is categorically distributed with probabilities equal to the PEFs in the dimensional simplex . A case class can represent a single or multiplepathogen cause of pneumonia, or pathogen causes not targeted by the assays which we refer to as “Not Specified (NoS
)”. PLCM uses a vector of
response probabilities to specify the conditional distribution of the measurements in each class. PLCM is an example of restricted LCMs (RLCMs, Wu et al., 2019) which restrict how the response probabilities differ by class to reflect the scientific knowledge that causative pathogens are more likely to appear in the upper respiratory tract in a pneumonia child than a healthy control. In particular, each causative pathogen is assumed to be observed with a higher probability in case class (sensitivity or true positive rate, TPR) than among the controls; A noncausative pathogen is observed with the same probability as in the controls (1  specificity or false positive rate). Under the pLCM, a higher observed marginal positive rate of pathogen among cases than controls indicates etiologic significance.In a Bayesian framework, measurements of differing precisions can be optimally combined under a pLCM to generate stronger evidence about . The pLCM is partiallyidentified (Jones et al., 2010; Gu and Xu, 2019b). There exist two sets of values of a subset of model parameters (here the TPRs) that the likelihood function alone cannot distinguish even with infinite samples; Bounds on the parameters however are available (e.g., Wu et al., 2016, Equation 6). Informative prior distributions for the TPRs elicited from laboratory experts or estimated from vaccine probe studies for a subset of pathogens (Feikin et al., 2014) can be readily incorporated to improve inference (Gustafson, 2015).
The pLCM assumes “local independence” (LI) which means the BrS measurements are mutually independent given the class membership. This classical assumption is central to mixture models for multivariate data, because the estimation procedures essentially find the optimal partition of observations into subgroups so that the LI approximately holds in each subgroup. Deviations from LI, or “local dependence” (LD) are testable using the control BrS data, which can be accounted for by an extension of pLCM, called nested partiallylatent class model (npLCM, Wu et al., 2017). In each class, the npLCM uses the classical LCM formulation that has the capacity to describe complex multivariate dependence among discrete data (Dunson and Xing, 2009). For example, it assumes the withinclass correlations among NPPCR tests are induced by unobserved heterogeneity in subjects’ propensities for pathogens colonizing the nasal cavities. In particular, LD is induced in an npLCM by nesting latent subclasses within each class , where subclasses respond with distinct vectors of probabilities. In a Bayesian framework with a prior that encourages few important subclasses, the npLCM reduces the bias in estimating , retains estimation efficiency and offers more valid inference under substantial deviation from LI.
Extensions to incorporate covariates in an npLCM are critical for two reasons. Firstly, covariates such as season, age, disease severity and HIV status may directly influence . Secondly, in an npLCM without covariates, the relative probability of assigning a case subject to class versus class depends on the FPRs (Wu et al., 2016) which are estimable using the control data. However, the FPRs may vary by covariates which if not modeled will bias the assignment of causespecific probabilities for each case subject. For example, pathogen A found in a case’s nasal cavity less likely indicates etiologic significance than a colonization during seasons with high asymptomatic carriage rates, and much more so when the same pathogen rarely appears in healthy subjects.
Adapting existing nocovariate methods to account for discrete covariates, one may perform a fullystratified analysis by fitting an npLCM to the casecontrol data in each covariate stratum. Like pLCM, the npLCM is partiallyidentified in each stratum (Wu et al., 2017), necessitating multiple sets of independent informative priors across multiple strata. There are two primary issues with this approach. First, sparselypopulated strata defined by many discrete covariates may lead to unstable PEF estimates. Second, it is often of policy interest to quantify the overall causespecific disease burdens in a population. Let the overall PEFs be the empirical average of the stratumspecific PEFs. Since the informative TPR priors are often elicited for a case population and rarely for each stratum, reusing independent prior distributions of the TPRs across all the strata will lead to overlyoptimistic posterior uncertainty in , hampering policy decisions.
Estimating disease etiology across discrete and continuous epidemiologic factors needs new methods in a general modeling framework. In this paper, we extend the npLCM to perform regression analysis in casecontrol disease etiology studies that (a) incorporates controls to estimate the PEFs, (b) specifies parsimonious functional dependence of upon covariates such as additivity, and (c) correctly assesses the posterior uncertainty of the PEF functions and the overall PEFs by applying the TPR priors just once.
The rest of the paper is organized as follows. Section 2 overviews the npLCM without covariates. Section 3 builds on the npLCM and makes the regression extension. We demonstrate the estimation of disease etiology regression functions through simulations in Section 4; We also show superior inferential performance of the regression model in estimating the overall PEFs relative to an analysis omitting the covariates. In Section 5, we characterize the effect of seasonality, age, HIV status upon the PEFs by applying the proposed npLCM regression model to the PERCH data. The paper concludes with a discussion.
2 Overview of npLCMs without Covariates
Let binary BrS measurements indicate the presence or absence of pathogens for subject . Let indicate a case () or a control () subject. If , let represent case ’s unobserved disease class; Otherwise, let because a control subject’s class is known (in PERCH, no lung infection). In this paper, we simplify the presentation of models by focusing on singlepathogen causes (hence ). The npLCM readily extends to for including additional prespecified multipathogen and/or “Not Specified” (NoS) causes (Wu et al., 2017).
The likelihood function for an npLCM has three components: (a) PEFs or causespecific case fractions: ; (b) : a table of probabilities of making binary observations in a case class ; (c) : the same probability table but for controls. Since cases’ disease classes are unobserved, the distribution of cases’ measurements is a finitemixture model with weights for the disease classes: .
Models in this section differ by how and {} are specified; Regression models in Section 3 further incorporate covariate into the specifications ( as well). More specifically, the likelihood of an npLCM (Wu et al., 2017) is a product of case () and control () likelihood functions
(1) 
where and are sensitivity and specificity parameters necessary for modeling the imperfect measurements; The rest of parameters , . Existing methods for estimating
in the framework of npLCM can be classified by whether or not
and assumes local independence (LI) which means measurements are independent of one another given the class (). In Equation (1), LI results if and only if ; Otherwise, and account for deviations from LI given a control or disease class.PLCM. under the original pLCM (Wu et al., 2016) satisfies LI and equals a product of probabilities: , where
is the probability mass function for a product Bernoulli distribution given the success probabilities
, and the parameters represent the positive rates absent disease, referred to as “false positive rates” (FPRs). For example, in the PERCH data, Respiratory Syncytial Virus (RSV) has a low observed FPR because of its rare appearance in controls’ NPs; Other pathogens such as Rhinovirus (RHINO) have higher observed FPRs.For , the pLCM makes a key “noninterference” assumption that diseasecausing pathogen(s) are more frequently detected among cases than controls and the noncausative pathogens are observed with the same rates among cases as in controls (Wu et al., 2017). The “noninterference” assumption says that in a case class is a product of the probabilities of measurements made (a) on the causative pathogen , , where and (b) on the noncausative pathogens , where represents all but the th element in a vector . The parameter is termed “true positive rate” (TPR) and may be larger than the FPR ; Under the singlepathogencause assumption, pLCM uses TPRs for causes and FPRs .
NPLCM. To reduce estimation bias in under deviations from LI, the “nested pLCM” or npLCM extends the original pLCM to describe residual correlations among binary pathogen measurements in the controls () and in each case class (, ) (Wu et al., 2017). The extension is motivated by the ability of the classical LCM formulation (Lazarsfeld, 1950) to approximate any joint multivariate discrete distribution (Dunson and Xing, 2009).
For in the controls, the npLCM introduces subclasses; The original pLCM results if . Given a subclass , the probability of observing binary measurements among controls is , where is the th column of a by FPR matrix . Since we do not observe controls’ subclasses, is a weighted average of according to the subclass probabilities : .
For in case class , the npLCM again introduces unobserved subclasses and assumes is a weighted average of according to the case subclass weights : . In particular, the npLCM assumes the probability of observing in subclass in disease class , , is a product of the probabilities of making an observation (a) on the causative pathogen : and (b) on noncausative pathogens , where is the th column of excluding the th row. We collect the TPRs in a by TPR matrix . We summarize the preceding specification by , where the vector represents the positive rates for measurements in subclass of disease class : which equals the TPR for a causative pathogen and the FPR otherwise; Here is an indicator function that equals if the statement A is true and otherwise.
The likelihood for npLCM results upon substituting and above into Equation (1): . Setting and , the special case of pLCM results.
Similar to the pLCM, the FPRs in the npLCM are shared among controls and case classes over noncausative pathogens (via ). Different from the pLCM, the subclass mixing weights may differ between cases () and controls (). The special case of , means the covariation patterns among the noncausative pathogens in a disease class is no different from the controls. However, relative to controls, diseased individuals may have different strength and direction of measurement dependence in each disease class. By allowing the subclass weights to differ between the cases and the controls, npLCM is more flexible than pLCM in referencing cases’ measurements against controls.
3 Regression Analysis via npLCM
We extend npLCM to perform regression analysis of data , where are covariates that may influence case ’s etiologic fractions and is a possibly different vector of covariates that may influence the subclass weights among the controls and the cases; Let the continuous covariates comprise the first and elements of and , respectively. A subset of may be available from the cases only. We let if so that all the covariates for a control subject are included in ; Let for a case subject. For example, healthy controls have no disease severity information. We let three sets of parameters in an npLCM (1) depend on the observed covariates: (a) the etiology regression function among cases, , which is of primary scientific interest, (b) the conditional probability of measurements given covariates in case classes: , , (c) and in the controls ; We keep the specifications for the TPRs and FPRs (, ) as in the original npLCM.
3.1 Disease Etiology Regression
is the primary target of inference. Recall that represents case ’s disease being caused by pathogen . We assume this event occurs with probability
that depends upon covariates. In our model, we use a multinomial logistic regression model
, , whereis the log odds of case
in disease class relative to : . Without specifying a baseline category, we treat all the disease classes symmetrically which simplifies prior specification. We further assume additive models for , where is the subvector of the predictors that enters the model for all disease classes as linear predictors and collects all the parameters. For covariates such as enrollment date that serves as a proxy for factors driven by seasonality, nonlinear functional dependence is expected. We use Bspline basis expansion to approximate and use Pspline for estimating smooth functions (Lang and Brezger, 2004). Finally, we specify the distribution of case measurements given disease class , covariates and . We extend the case likelihood in an npLCM (1) to let the subclass weights depend on covariates : . Integrating over unobserved disease classes, we obtain the likelihood function for the cases that incorporates covariates :(2) 
where and are the regression parameters; The form of is introduced in the model for controls.
3.2 Covariatedependent reference distribution
Data from controls provide requisite information about the specificities and covariations at distinct covariate values, necessitating adjustment in an npLCM analysis. For example, factors such as enrollment date is a proxy for season and may influence the background colonization rates and interactions of some pathogens that circulate more during winter (ObandoPacheco et al., 2018; Nair et al., 2011). We propose a novel approach to estimating the reference distribution of measurements that may depend on covariates using control data.
The regression model for a control subject is a mixture model with covariatedependent mixing weights : where FPRs do not depend on covariates and the vector lies in a simplex . We discuss the FPRs and the subclass weight functions in order.
Firstly, constant FPR profiles enable coherent interpretation across individuals with different covariate values (Erosheva et al., 2007). FPR profile receives a weight of for a control subject with covariates . The marginal FPRs in the controls , , also depend on . Consequently, observed marginal control positive curve for a pathogen informs how different the FPRs are across the subclasses. For example, if the NPPCR measure of pathogen A shows strong seasonal trends among the controls, the estimated FPRs will be more variable across the subclasses. And the subclass with a high FPR will receive a larger weight during seasons with higher carriage rates in controls. The control model reduces to special cases, with covariateindependent , , resulting in the in a subclass npLCM without covariates; A further singlesubclass constraint () gives the in the original pLCM.
Secondly, we parameterize the case and control subclass weight regressions and using the same regression form but different parameters.
Control subclass weight regression. We rewrite the subclass weights , using a stickbreaking parameterization. Let be a link function. Let be subject ’s linear predictor at stickbreaking step . Using the stickbreaking analogy, we begin with a unitlength stick, break a segment of length and continue breaking a fraction from the remaining and so on; At step , we break a fraction from what is left in the preceding steps resulting in the th stick segment of length ; We stop until sticks of variable lengths result. In this paper, we use the logistic function
which is consistent with the multinomial logit regression for
so that the priors of the coefficients and can be similar (Supplementary Materials A.2). Generalization to other link functions such as the probit function is straightforward (e.g., Rodriguez and Dunson, 2011). We use this parameterization to introduce a novel shrinkage prior on a simplex for the subclass weights (see Supplementary Materials A.1) which encourages fewer than effective subclasses, or “sparse” shrinkage prior on the simplex. This provides parsimonious approximation to the conditional distribution of control measurements using a few subclasses.In our analysis, we use generalized additive models (Hastie and Tibshirani, 1986) for the th linear predictor for . We have parameterized the possibly nonlinear using Bspline basis expansions with coefficients ; are the linear effects of a subset of predictors which can include an intercept and is a subvector of predictors ; Let collect all the regression parameters. Following Lang and Brezger (2004), we constrain to have zero means for statistical identifiability. Supplementary Materials A.2 provides the technical details about the parameterization of .
The subclassspecific intercepts globally control the magnitudes of the linear predictors. We hence propose priors on to a priori encourage few subclasses (see Supplementary Materials A.1). In particular, a large positive intercept makes and hence breaks nearly the entire remaining stick after the th stickbreaking. Since the stickbreaking parameterization onetoone maps to a classical latent class regression model formulation for the control data, the linear predictor and the sum are identifiable except in a Lebesgue zero set of parameter values, or “generic identifiability” (Huang and BandeenRoche, 2004). Consequently, even if the intercept is not statistically identified if includes an intercept , the MCMC samples of the statistically identifiable functions can provide valid posterior inferences (Carlin and Louis, 2009). We write the control likelihood with covariates as . Supplementary Materials B provides further remarks on the assumption for introducing covariates into the control model.
Case subclass weight regression. The subclass weight regression functions for cases are also specified via a logistic stickbreaking regression as in the controls but with different parameters: ,
. Since given the TPRs and the FPRs, the subclass weights fully determine the joint distribution
hence the measurement dependence in each class, we let and be different between cases and controls for any .Let the th linear predictor , where are the regression parameters that differ from the control counterpart (). In particular, we approximate , here using the same set of Bspline basis functions as in the controls but estimate a different set of basis coefficients . In addition, we have directly used the intercepts from the control model to ensure only important subclasses in the controls are used in the cases. For example, absent covariates , a large and positive effectively halts the stick breaking procedure at step for the controls (); Applying the same intercept to the cases makes .
Combining the case () and control likelihood () with covariates, we obtain the joint likelihood for the regression model .
Under an assumption (A1): the case subclass weights are constant over covariates: , , the regression model reduces to an npLCM model without covariates upon integration over a distribution of covariates . To see this, the case and control likelihood functions and integrate to and , respectively; Here and where and are probability or empirical distributions of and , respectively. The mathematical equivalence enables valid inference about the overall PEFs omitting and (see Supplementary Materials E.2 for an example). The nocovariate analysis becomes deficient under deviations from (A1); Section 4 provides examples.
3.2.1 Priors and Posterior Inference
The unknown parameters include the coefficients in the etiology regression (, the subclass mixing weight regression for the cases () and the controls (), the true and false positive rates . With typical samples sizes about controls and cases in each study site, the number of parameters in controls likelihood () easily exceeds the number of distinct binary measurement patterns observed. To overcome potential overfitting and increase model interpretability, we a priori place substantial probabilities on models with the following two features: (a) Few nontrivial subclasses via a novel additive halfCauchy prior for the intercepts , and (b) for a continuous variable, smooth regression curves , and by Bayesian Penalizedsplines (Psplines) (Lang and Brezger, 2004) combined with shrinkage priors on the spline coefficients (Ni et al., 2015) to encourage towards constant values, , which reduces to the original npLCM. Supplementary Materials A details the prior specifications.
We use the Markov chain Monte Carlo (MCMC) algorithm to draw samples of the unknowns to approximate their joint posterior distribution (Gelfand and Smith, 1990). Flexible posterior inferences about any functions of the model parameters and individual latent variables are available by plugging in the posterior samples of the unknowns. For example, the posterior samples of the case positive rate curve for pathogen help evaluate model fit. The red bands in Row 1 of Figure 1 are posterior credible bands obtained by substituting relevant parameters with their sampled values across MCMC iterations in . The npLCMs with or without covariates are fitted using a free and publicly available R package baker
(https://github.com/zhenkewu/baker). Baker
calls an external automatic Bayesian model fitting software JAGS 4.2.0
(Plummer et al., 2003)
from within R and provides functions to visualize the posterior distributions of the unknowns (e.g., the PEFs and cases’ latent disease class indicators) and perform posterior predictive model checking
(Gelman et al., 1996). Supplementary Materials C details the convergence diagnostics.4 Simulations
We simulate casecontrol bronzestandard (BrS) measurements along with observed continuous and/or discrete covariates under multiple combinations of true model parameter values and sample sizes that mimic the motivating PERCH study. In Simulation I, we illustrate flexible statistical inferences about the PEF functions . In Simulation II, we focus on the overall PEFs that quantify the overall causespecific disease burdens in a population which are of policy interest. Let be an empirical average of , . We compare the frequentist properties of the posterior mean obtained from analyses with or without covariate (Little et al., 2011). Regression analyses reduce estimation bias, retain efficiency and provide more valid frequentist coverage of the CrIs. The relative advantage varies by the true data generating mechanism and sample sizes.
In all analyses here, we use a working number of subclasses, with independent Beta(7.13,1.32) TPR prior distributions that match 0.55 and 0.99 with the lower and upper quantiles, respectively; We specify Beta(1,1) for the identifiable FPRs. The priors for the regression coefficients follow the specifications in Supplementary Materials A.
Simulation I. We demonstrate that the inferential algorithm recovers the true PEF functions . We simulate cases and controls for each of two levels of (a discrete covariate) and uniformly sample the subjects’ enrollment dates over a period of days. Supplementary Materials D specifies the true data generating mechanism and the regression specifications. Based on the simulated data, pathogen A has a bimodal positive rate curve mimicking the trends observed of RSV in one PERCH site; other pathogens have overall increasing positive rate curves over enrollment dates. We set the simulation parameters in a way that the marginal control rate may be higher than cases for small ’s (impossible under the more restrictive pLCM). Row 2 of Figure 1 visualizes for the causes (by column), the posterior means (thin black line) and CrIs (gray bands) for the etiology regression curves are close to the simulation truths . Supplementary Materials E provides additional simulation results to assess the recovery of the true for a discrete covariate .
Simulation II. We show the regression model accounts for population stratification by covariates hence reduces the bias of the posterior mean in estimating the overall PEFs () and produces more valid CrIs. We illustrate the advantage of the regression approach under simple scenarios with a single twolevel covariate ; We let . We perform npLCM regression analysis with for each of replication data sets simulated under each of scenarios detailed in Supplementary Materials D that correspond to distinct numbers of causes, sample sizes, relative sizes of PEF functions (rare versus popular etiologies), signal strengths (more discrepant TPRs and FPRs indicate stronger signals, Wu et al. (2016)), and effects of on and .
In estimating , we evaluate the bias , where is the true overall PEF, and is an empirical average of the posterior mean PEFs at . We also evaluate the empirical coverage rates of the CrIs.
The regression model incorporates covariates and performs better in estimating than a model omitting covariates. For example, Figure 2(a) shows for that, relative to nocovariate npLCM analyses, regression analyses produce posterior means that on average have negligible relative biases (percent difference between the posterior mean and the truth relative to the truth) for each pathogen across simulation scenarios. As expected, we observe slight relative biases from the regression model in the bottom two rows of Figure 2(a), because the informative TPR prior Beta(7.13,1.32) has a mean value lower than the true TPR ; A more informative prior further reduces the relative bias; See additional simulations in Supplementary Materials E on the role of informative TPR priors. Figure 2(b) regression analyses also produce CrIs for that have more valid empirical coverage rates in all scenarios. Misspecified models without covariates concentrate the posterior distribution away from the true overall PEFs, resulting in large biases that dominate the posterior uncertainty of which is evident from the more severe undercoverages with higher TPRs and lower FPRs (row 3 and 4 versus row 1 and 2, Figure 2).
5 Regression Analysis of PERCH Data
We restrict attention in this regression analysis to cases and controls from one of the PERCH study sites in the Southern Hemisphere that collected information on enrollment date (, August 2011 to September 2013; standardized), age (dichotomized to younger or older than one year), disease severity for cases (severe or very severe), HIV status (positive or negative) and presence or absence of seven species of pathogens (five viruses and two bacteria, representing a subset of pathogens evaluated) in nasopharyngeal (NP) specimens tested with polymerase chain reaction (PCR), or NPPCR (bronzestandard, BrS); We also include in the analysis the blood culture (BCX, silverstandard, SS) results for two bacteria from cases only. Detailed analyses of the entire data are reported in PERCH Study Group (2019).
Table 1 shows the observed case and control frequencies by age, disease severity and HIV status. The two strata with the most subjects are severe pneumonia children who were HIV negative and under or above one year of age. Some low or zero cell counts preclude fitting npLCMs by stratum. Regression models with additive assumptions among the covariates can borrow information across strata and stabilize the PEF estimates. Supplemental Figure S5 shows summary statistics for the NPPCR (BrS) and BCX (SS) data including the positive rates in the cases and the controls and the conditional odds ratio (COR) contrasting the case and control rates adjusting for the presence or absence of other pathogens (NPPCR only).
For NPPCR, pathogens RSV and Haemophilus influenzae (HINF) are detected with the highest positive rates among cases: and , respectively, which are higher than the corresponding control rates ( and ). The CORs are large, for RSV and for HINF, indicating etiologic importance. Adenovirus (ADENO) also has a statistically significant COR of . Human metapneumovirus type A or B (HMPV_A_B) and Parainfluenza type 1 virus (PARA_1) have larger positive and statistically significant CORs of and . However, the two pathogens rarely appear in cases’ nasal cavities (HMPV_A_B: , PARA_1: ), which in light of high sensitivities means nonprimary etiologic roles. For the rest of pathogens, we observed similar case and control positive rates as shown by the statistically nonsignificant CORs (RHINO (case: ; control: ) and Streptococcus pneumoniae (PNEU) (case: ; control: ). Similar to Wu et al. (2017), we integrate caseonly SS measurements for HINF and PNEU by using informative priors of the sensitivities (e.g., from vaccine probe studies e.g., Feikin et al. (2014)) to adjust the PEF estimates in a coherent Bayesian framework. It is expected that the rare detection of the two bacteria, for HINF and for PNEU from SS data, will lower their PEF estimates relative to the ones obtained from an NPPCRonly analysis.
We include in the regression analysis a cause “Not Specified (NoS)” to account for true pathogen causes other than the seven pathogens. We incorporate the prior knowledge about the TPRs of the NPPCR measures from laboratory experts. We set the Beta priors for sensitivities by and , so that the and quantiles match the lower and upper ranges of plausible sensitivity values of and , respectively. We specify the Beta(7.59,58.97) prior for the two TPRs of SS measurements similarly but with a lower range of . We use a working number of subclasses . In the etiology regression model , we use 7 d.f. for Bspline expansion of the additive function for the standardized enrollment date at uniform knots along with three binary indicators for age older than one, very severe pneumonia, HIV positive; In the subclass weight regression model , we use 5 d.f. for the standardized enrollment date with uniform knots and two indicators for age older than one and HIV positive. The prior distributions for the etiology and subclass weight regression parameters follow the specification in Supplementary Materials A.
age  very severe (VS)  HIV positive  cases ()  controls () 
(caseonly)  total: 524 (100)  total: 964 (100)  
0  0  0  208 (39.7)  545 (56.5) 
1  0  0  72 (13.7)  278 (28.8) 
0  1  0  116 (22.1)  0 
1  1  0  33 (6.3)  0 
0  0  1  37 (7.1)  85 (8.8) 
1  0  1  24 (4.5)  51 (5.3) 
0  1  1  25 (4.8)  0 
1  1  1  3 (0.6)  0 
case:  
control:   
The regression analysis produces seasonal estimates of the PEF function for each cause that varies in trend and magnitude among the eight strata defined by age, disease severity and HIV status. Figure 3 shows among two ageHIVseverity strata the posterior mean curve and pointwise credible bands of the etiology regression functions as a function of . For example, among the younger, HIV negative and severe pneumonia children (Figure 3(a)), the PEF curve of RSV is estimated to have a prominent bimodal temporal pattern that peaked at two consecutive winters in the Southern Hemisphere (June 2012 and 2013). Other singlepathogen causes HINF, PNEU, ADENO, HMPV_A_B and PARA_1 have overall low and stable PEF curves across seasons. The estimated PEF curve of NoS shows a trend with a higher level of uncertainty that is complementary to RSV because given any enrollment date the PEFs of all the causes sum to one. In contrast, Figure 3(b) shows a lower degree of seasonal variation of RSV PEF curve among the older, HIV negative and severe pneumonia children.
Row 2) shows the temporal trend of which is enveloped by pointwise credible bands shown in gray. The estimated overall PEF averaged among cases in the present stratum is shown by a horizontal solid line, below and above which are two dashed black lines indicating the and posterior quantiles. The rug plot on the xaxis indicates cases’ enrollment dates.
Row 1) shows the fitted temporal case (red) and control (blue) positive rate curves enclosed by the pointwise CrIs; The two rug plots at the top (bottom) indicate the dates of the cases and controls being enrolled and tested positive (negative) for the pathogen.
The regression model accounts for stratification of etiology by the observed covariates and assigns causespecific probabilities for two cases who have identical measurements but different covariate values. Supplemental Figure S6 shows for two cases with all negative NPPCR results (the most frequent pattern among cases), the older case has a lower posterior probability of her disease caused by
RSV and higher probability of being caused by NoS. Indeed, contrasting older and younger children while holding the enrollment date, HIV, severity constant, the estimated difference in the log odds (i.e., log odds ratio) of a child being caused by RSV versus NoS is negative: .Given age, severity and HIV status, we quantify the overall causespecific disease burdens by averaging the PEF function estimates by the empirical distribution of the enrollment dates. Contrasting the results in the two ageseverityHIV strata in Figure 3(a) and 3(b), since the case positive rate of RSV among the older children reduces from to but the control positive rates remain similar (from to ), the overall PEF of RSV () decreases from to and attributing a higher total fraction of cases to NoS () from to ; The overall PEFs for other causes remain similar.
6 Discussion
In disease etiology studies where goldstandard data are infeasible to obtain, epidemiologists need to integrate multiple sources of data of distinct quality to draw inference about the population and individual etiologic fractions. While the existing methods based on npLCM account for imperfect diagnostic sensitivities and specificities, complex measurement dependence and missingness, they do not describe the relationship between covariates and the PEFs. This paper addresses this analytic need by extending npLCM to a general regression modeling framework using casecontrol multivariate binary data to estimate disease etiology.
The proposed approach has three distinguishing features: 1) It allows analysts to specify a model for the functional dependence of the PEFs upon important covariates. And with assumptions such as additivity, we can improve estimation stability for sparsely populated strata defined by many discrete covariates. 2) The model incorporate control data for the inference of PEF curve. The posterior inferential algorithm estimates a parsimonious covariatedependent reference distribution of the diagnostic measurements from controls. Finally, 3) the model uses informative priors of the sensitivities (TPRs) only once in a population for which these priors were elicited. Relative to a fullystratified npLCM analysis that reuses these priors, the proposed regression analysis avoids overlyoptimistic etiology uncertainty estimates.
We have shown by simulations that the regression approach accounts for population stratification by important covariates and as expected reduces estimation biases and produces credible intervals that have more valid empirical coverage rates than an npLCM analysis omitting covariates. In addition, the proposed regression analysis can readily integrate multiple sources of diagnostic measurements of distinct levels of diagnostic sensitivities and specificities, a subset of which are only available from cases (SS data), to further reduce the posterior uncertainty of the etiology estimates. Our regression analysis integrates the BrS and SS data from one PERCH site and reveals prominent dependence of the PEFs upon seasonality and a pneumonia child’s age, HIV status and disease severity.
Future work may improve the proposed methods. First, flexible and parsimonious alternatives to the additive models may capture important interaction effects (e.g., Linero, 2018). Second, in the presence of many covariates, classspecific predictor selection methods for may provide further regularization and improve interpretability (Gustafson et al., 2008). Third, when the subsets of pathogens that have caused the diseases in the population is unknown, the proposed method can be combined with subset selection procedures (Wu et al., 2019; Gu and Xu, 2019a). Finally, scalable posterior inference for multinomial regression parameters (e.g., Zhang and Zhou, 2017) will likely improve the computational speed in the presence of a large number of disease classes and covariates.
Supplementary Materials
The supplementary materials contain the technical details on prior specifications, a remark, additional simulation results and supplemental figures referenced in Main Paper.
Acknowledgment
We thank the PERCH study team led by Kathernine O’Brien for providing the data and scientific advice, Scott Zeger, Maria DeloriaKnoll, Christine Prosperi and Qiyuan Shi for insightful comments and valuable feedback about baker
and Jing Chu for preliminary simulations. The research was partly supported by the PatientCentered Outcomes Research Institute (PCORI) Award (ME140820318, ZW), NIH grants P30CA046592 (National Cancer Institute Cancer Center Support Grant Development Funds, Rogel Cancer Center; ZW and IC), U01CA229437 (ZW) and an Investigator Award from Precision Health Initiative and an MCubed Award from University of Michigan (ZW).
References
 BandeenRoche et al. (1997) BandeenRoche, K., Miglioretti, D. L., Zeger, S. L., and Rathouz, P. J. (1997). Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association, 92(440):1375–1386.
 Bartolucci and Forcina (2006) Bartolucci, F. and Forcina, A. (2006). A class of latent marginal models for capture–recapture data with continuous covariates. Journal of the American Statistical Association, 101(474):786–794.
 Bedrick et al. (1996) Bedrick, E. J., Christensen, R., and Johnson, W. (1996). A new perspective on priors for generalized linear models. Journal of the American Statistical Association, 91(436):1450–1460.
 Brooks and Gelman (1998) Brooks, S. and Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4):434–455.
 Carlin and Louis (2009) Carlin, B. and Louis, T. (2009). Bayesian methods for data analysis, volume 78. Chapman & Hall/CRC.
 Chung et al. (2006) Chung, H., Flaherty, B. P., and Schafer, J. L. (2006). Latent class logistic regression: application to marijuana use and attitudes among high school seniors. Journal of the Royal Statistical Society: Series A (Statistics in Society), 169(4):723–743.
 Crawley et al. (2017) Crawley, J., Prosperi, C., Baggett, H. C., Brooks, W. A., Deloria Knoll, M., Hammitt, L. L., Howie, S. R., Kotloff, K. L., Levine, O. S., Madhi, S. A., et al. (2017). Standardization of clinical assessment and sample collection across all perch study sites. Clinical infectious diseases, 64(suppl_3):S228–S237.
 Deloria Knoll et al. (2017) Deloria Knoll, M., Fu, W., Shi, Q., Prosperi, C., Wu, Z., Hammitt, L. L., Feikin, D. R., Baggett, H. C., Howie, S. R., Scott, J. A. G., et al. (2017). Bayesian estimation of pneumonia etiology: epidemiologic considerations and applications to the pneumonia etiology research for child health study. Clinical infectious diseases, 64(suppl_3):S213–S227.
 Dunson and Xing (2009) Dunson, D. and Xing, C. (2009). Nonparametric bayes modeling of multivariate categorical data. Journal of the American Statistical Association, 104(487):1042–1051.
 Erosheva et al. (2007) Erosheva, E. A., Fienberg, S. E., and Joutard, C. (2007). Describing disability through individuallevel mixture models for multivariate binary data. The annals of applied statistics, 1(2):346.
 Feikin et al. (2014) Feikin, D., Scott, J., and Gessner, B. (2014). Use of vaccines as probes to define disease burden. The Lancet, 383(9930):1762–1770.
 Gelfand and Smith (1990) Gelfand, A. and Smith, A. (1990). Samplingbased approaches to calculating marginal densities. Journal of the American statistical association, pages 398–409.
 Gelman et al. (2008) Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y.S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, pages 1360–1383.
 Gelman et al. (1996) Gelman, A., Meng, X.L., and Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6(4):733–760.
 Geweke and Zhou (1996) Geweke, J. and Zhou, G. (1996). Measuring the pricing error of the arbitrage pricing theory. The review of financial studies, 9(2):557–587.
 Goodman (1974) Goodman, L. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61(2):215–231.
 Gryparis et al. (2007) Gryparis, A., Coull, B. A., Schwartz, J., and Suh, H. H. (2007). Semiparametric latent variable regression models for spatiotemporal modelling of mobile source particles in the greater boston area. Journal of the Royal Statistical Society: Series C (Applied Statistics), 56(2):183–209.

Gu and Xu (2019a)
Gu, Y. and Xu, G. (2019a).
Learning attribute patterns in highdimensional structured latent
attribute models.
Journal of Machine Learning Research
, page In press.  Gu and Xu (2019b) Gu, Y. and Xu, G. (2019b). Partial identifiability of restricted latent class models. Annals of Statistics, page In press.
 Gustafson (2015) Gustafson, P. (2015). Bayesian Inference for Partially Identified Models: Exploring the Limits of Limited Data, volume 140. CRC Press.
 Gustafson et al. (2008) Gustafson, P., Lefebvre, G., et al. (2008). Bayesian multinomial regression with classspecific predictor selection. The Annals of Applied Statistics, 2(4):1478–1502.
 Hammitt et al. (2017) Hammitt, L. L., Feikin, D. R., Scott, J. A. G., Zeger, S. L., Murdoch, D. R., O’brien, K. L., and Deloria Knoll, M. (2017). Addressing the analytic challenges of crosssectional pediatric pneumonia etiology data. Clinical infectious diseases, 64(suppl_3):S197–S204.
 Hastie and Tibshirani (1986) Hastie, T. and Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1(3):297–318.
 Huang and BandeenRoche (2004) Huang, G.H. and BandeenRoche, K. (2004). Building an identifiable latent class model with covariate effects on underlying and measured variables. Psychometrika, 69(1):5–32.
 Jones et al. (2010) Jones, G., Johnson, W., Hanson, T., and Christensen, R. (2010). Identifiability of models for multiple diagnostic testing in the absence of a gold standard. Biometrics, 66(3):855–863.
 Kotloff et al. (2013) Kotloff, K. L., Nataro, J. P., Blackwelder, W. C., Nasrin, D., Farag, T. H., Panchalingam, S., Wu, Y., Sow, S. O., Sur, D., Breiman, R. F., et al. (2013). Burden and aetiology of diarrhoeal disease in infants and young children in developing countries (the global enteric multicenter study, gems): a prospective, casecontrol study. The Lancet, 382(9888):209–222.
 Lang and Brezger (2004) Lang, S. and Brezger, A. (2004). Bayesian psplines. Journal of computational and graphical statistics, 13(1):183–212.
 Lazarsfeld (1950) Lazarsfeld, P. F. (1950). The logical and mathematical foundations of latent structure analysis, volume IV, chapter The American Soldier: Studies in Social Psychology in World War II, pages 362–412. Princeton, NJ: Princeton University Press.
 Linero (2018) Linero, A. R. (2018). Bayesian regression trees for highdimensional prediction and variable selection. Journal of the American Statistical Association, 113(522):626–636.
 Little et al. (2011) Little, R. et al. (2011). Calibrated bayes, for statistics in general, and missing data in particular. Statistical Science, 26(2):162–174.
 Morrissey et al. (2011) Morrissey, E. R., Juárez, M. A., Denby, K. J., and Burroughs, N. J. (2011). Inferring the timeinvariant topology of a nonlinear sparse gene regulatory network using fully bayesian spline autoregression. Biostatistics, 12(4):682–694.
 Nair et al. (2011) Nair, H., Brooks, W. A., Katz, M., Roca, A., Berkley, J. A., Madhi, S. A., Simmerman, J. M., Gordon, A., Sato, M., Howie, S., et al. (2011). Global burden of respiratory infections due to seasonal influenza in young children: a systematic review and metaanalysis. The Lancet, 378(9807):1917–1930.
 Ni et al. (2015) Ni, Y., Stingo, F. C., and Baladandayuthapani, V. (2015). Bayesian nonlinear model selection for gene regulatory networks. Biometrics.
 ObandoPacheco et al. (2018) ObandoPacheco, P., JusticiaGrande, A. J., RiveroCalle, I., RodríguezTenreiro, C., Sly, P., Ramilo, O., Mejías, A., Baraldi, E., Papadopoulos, N. G., Nair, H., et al. (2018). Respiratory syncytial virus seasonality: a global overview. The Journal of infectious diseases, 217(9):1356–1364.
 PERCH Study Group (2019) PERCH Study Group (2019). Aetiology of severe hospitalised pneumonia in hivuninfected children from africa and asia: the pneumonia aetiology research for child health (perch) casecontrol study. Lancet.
 Plummer et al. (2003) Plummer, M. et al. (2003). Jags: A program for analysis of bayesian graphical models using gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing, volume 124.
 Rodriguez and Dunson (2011) Rodriguez, A. and Dunson, D. B. (2011). Nonparametric bayesian models through probit stickbreaking processes. Bayesian analysis (Online), 6(1).
 Saha et al. (2018) Saha, S. K., Schrag, S. J., El Arifeen, S., Mullany, L. C., Islam, M. S., Shang, N., Qazi, S. A., Zaidi, A. K., Bhutta, Z. A., Bose, A., et al. (2018). Causes and incidence of communityacquired serious infections among young children in south asia (anisa): an observational cohort study. The Lancet, 392(10142):145–159.
 Scott et al. (2008) Scott, J. A. G., Brooks, W. A., Peiris, J. M., Holtzman, D., and Mulhollan, E. K. (2008). Pneumonia research to reduce childhood mortality in the developing world. The Journal of clinical investigation, 118(4):1291.
 Witte et al. (1998) Witte, J. S., Greenland, S., and Kim, L.L. (1998). Software for hierarchical modeling of epidemiologic data. Epidemiology, 9(5):563–566.
 Wu et al. (2019) Wu, Z., CasciolaRosen, L., Rosen, A., and Zeger, S. L. (2019). A bayesian approach to restricted latent class models for scientificallystructured clustering of multivariate binary outcomes. arXiv preprint arXiv:1808.08326.
 Wu et al. (2016) Wu, Z., DeloriaKnoll, M., Hammitt, L. L., Zeger, S. L., and for Child Health Core Team, P. E. R. (2016). Partially latent class models for case–control studies of childhood pneumonia aetiology. Journal of the Royal Statistical Society: Series C (Applied Statistics), 65(1):97–114.
 Wu et al. (2017) Wu, Z., DeloriaKnoll, M., and Zeger, S. L. (2017). Nested partially latent class models for dependent binary data; estimating disease etiology. Biostatistics (Oxford, England), 18:200–213.
 Zhang and Zhou (2017) Zhang, Q. and Zhou, M. (2017). Permuted and augmented stickbreaking bayesian multinomial regression. The Journal of Machine Learning Research, 18(1):7479–7511.
Appendix A Prior distributions
The unknown parameters include the regression coefficients in the etiology regression (, the parameters in the subclass weight regression for the cases () and the controls (), the true and false positive rates . To mitigate potential overfitting and increase model interpretability, we a priori place substantial probabilities on models with the following two features: (a) Few nontrivial subclasses via a novel additive halfCauchy prior for the intercepts , and (b) for a continuous variable, smooth regression curves , and by Bayesian Penalizedsplines (Psplines, Lang and Brezger, 2004) combined with shrinkage priors on the spline basis coefficients (Ni et al., 2015) to encourage towards constant values.
a.1 Subclass Weight Regression: Encourage Few Subclasses
We propose a novel prior to encourage a small number of subclasses of nontrivial weights in finite samples, or “simplex regression shrinkage prior”. We parameterize the intercepts so that a priori the higherorder subclasses are less likely to receive nontrivial weights. We let where is a prespecified triangular array of positive values. Upon heavytailed priors on with positive supports, we will a priori make higherorder subclasses increasingly less likely to receive substantial weights. In this paper, we use ; Other choices such as or may be useful in other settings. We specify the prior distributions of
to be heavytailed. In this paper we use Cauchy distribution with scale
. Since our control model take a classical latent class regression model form (BandeenRoche et al., 1997) (the generic term “class” here corresponds to control “subclass” in an npLCM), the proposed prior for the subclass weight is also useful for a classical LCM regression analysis where the number of classes is unknown. Unlike a logistic stickbreaking specification without the intercepts , the proposed priors on the intercepts encourage few subclasses and well recovers the true subclass weights. Using the same data simulated in Simulation I, Section 3 of Main Paper, Figure 5 shows the proposed prior propagates into the posterior distribution and estimates 2 nontrivial subclasses from a working number of 7 subclasses.At stickbreaking step , the prior allows taking away nearly the entire stick segment currently left. Our basic idea is to have one of close to one a posteriori by making the posterior mean of one of large. We accomplish this by designing a novel prior on the intercept where
The first level has a meanzero Gaussian distribution truncated to the positive half. At the secondlevel, the precision (inverse variance) is Gamma distributed with shape
, and rate ; it has the interpretation of prior independent sample(s) with a mean sample variance of . Large values of help to stop stickbreaking at subclass forcing weights for ensuing subclasses , , while small values let the stickbreaking scheme continue to step . This type of prior sparsity, which we call “selective stopping” or shrinkage over a simplex uniformly over covariates, effectively encourages using a small number of subclasses to approximate the observedprobability contingency table for the control measurements in finite samples.
We accomplish selective stopping by the heavy right tail of ’s marginal prior. It has a truncated scaled
distribution with degree of freedom
and scale , and consequently peaks at zero and admits large positive values. Given other parameters in , a nearzero intercept takes the stickbreaking procedure to the next step, while a large positive intercept effectively halts it. The tendency to stop at step is a priori modulated by the scale parameter . Because, given the degreeoffreedom, the prior probability
approaches as the scale parameter increases.In our simulations and applications, we choose hyperparameters
and for the intercept, and for the first Bspline coefficients in the prior (Equation 3, Section A.2). We have chosen our hyperparameters based on the interpretations on the probability (inverselink) scale; see similar prior elicitations for regression coefficients in other applications (e.g., Bedrick et al., 1996; Witte et al., 1998) and for automatic, stabilized and weaklyinformative fitting of generalized linear models (Gelman et al., 2008). We choose the hyperparameters for the intercepts that put most prior mass of within , because is sufficiently close to which means the stickbreaking is stopped at Step . In contrast, we choose the first Bspline coefficient’s hyperparameter that puts most prior mass of within , a range for the weight of a nontrivial subclass to break from the rest of the stick at Step . Figure 8 shows a sharp separation between the priors for and . The shapes of the priors again highlight the different roles played by the intercept and the Bspline coefficients: the former decides whether to continue the stickbreaking procedure to induce complex conditional dependence given covariates, and if so, the latter computes the fraction to break from the remaining length of the stick. The intercepts in the controls are shared with the case subclass weight regression ; We set the same prior distributions for other elements of , .a.2 Encourage Smooth and
We use penalized Bsplines to model the additive functions of a continuous variable in etiology regression (
Comments
There are no comments yet.