# A Bayesian Nonparametric Method for Estimating Causal Treatment Effects on Zero-Inflated Outcomes

We present a Bayesian nonparametric method for estimating causal effects on continuous, zero-inflated outcomes. This work is motivated by a need for estimates of causal treatment effects on medical costs; that is, estimates contrasting average total costs that would have accrued under one treatment versus another. Cost data tend to be zero-inflated, skewed, and multi-modal. This presents a significant statistical challenge, even if the usual causal identification assumptions hold. Our approach flexibly models expected cost conditional on treatment and covariates using an infinite mixture of zero-inflated regressions. This conditional mean model is incorporated into the Bayesian standardization formula to obtain nonparametric estimates of causal effects. Moreover, the estimation procedure predicts latent cluster membership for each patient - automatically identifying patients with different cost-covariate profiles. We present a generative model, an MCMC method for sampling from the posterior and posterior predictive, and a Monte Carlo standardization procedure for computing causal effects. Our simulation studies show the resulting causal effect estimates and credible interval estimates to have low bias and close to nominal coverage, respectively. These results hold even under highly irregular data distributions. Relative to a standard infinite mixture of regressions, our method yields interval estimates with better coverage probability. We apply the method to compare inpatient costs among endometrial cancer patients receiving either chemotherapy or radiation therapy in the SEER Medicare database.

## Authors

• 4 publications
• 10 publications
• 8 publications
• ### Targeted Smooth Bayesian Causal Forests: An analysis of heterogeneous treatment effects for simultaneous versus interval medical abortion regimens over gestation

05/22/2019 ∙ by Jennifer E. Starling, et al. ∙ 0

• ### Bayesian Nonparametric Cost-Effectiveness Analyses: Causal Estimation and Adaptive Subgroup Discovery

Cost-effectiveness analyses (CEAs) are at the center of health economic ...
02/11/2020 ∙ by Arman Oganisian, et al. ∙ 0

• ### Conditional separable effects

Researchers are often interested in treatment effects on outcomes that a...
06/28/2020 ∙ by Mats J. Stensrud, et al. ∙ 0

• ### Survivor average causal effects for continuous time: a principal stratification approach to causal inference with semicompeting risks

In semicompeting risks problems, nonterminal time-to-event outcomes such...
02/15/2019 ∙ by Leah Comment, et al. ∙ 0

• ### Bayesian graphical modelling for heterogeneous causal effects

Our motivation stems from current medical research aiming at personalize...
06/06/2021 ∙ by Federico Castelletti, et al. ∙ 0

• ### Policy Implications of Statistical Estimates: A General Bayesian Decision-Theoretic Model for Binary Outcomes

How should scholars evaluate the statistically estimated causal effect o...
08/25/2020 ∙ by Akisato Suzuki, et al. ∙ 0

• ### Adaptive Covariate Acquisition for Minimizing Total Cost of Classification

In some applications, acquiring covariates comes at a cost which is not ...
02/21/2020 ∙ by Daniel Andrade, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Rising treatment costs are a significant driver of overall healthcare costs in the U.S. economy. When a treatment comes to market, how can we assess whether it is overall more or less costly than an existing treatment with equivalent efficacy? What would the cumulative medical costs have been had all patients received one drug instead of the other? Such important policy questions are inherently causal in nature - requiring us to compare costs in hypothetical worlds where everything else was held constant except treatment assignment.

Though it is necessary for answering key policy questions, causal estimation is impeded by the complexities of medical cost data. Cost data tend to be highly irregular - exhibiting zero-inflation, skewness, and multi-modality. In this paper, we develop a Bayesian nonparametric (BNP) model for the conditional cost distribution and incorporate it into a Bayesian standardization procedure for computing causal treatment effects on costs.

BNP models are popular for several reasons. First, they can flexibly model complex data distributions. At the same time, all inference is done using the posterior - which greatly simplifies interval estimation even for such complex models. Finally, point and interval estimates can be constructed for any causal contrast of interest (e.g. differences, ratios, etc.) once the posterior is known or sampled.

It is perhaps for these reasons that BNP methods have been growing in popularity within causal inference. For example, Bayesian additive regression trees (BART)[1, 2] have been used to estimate average treatment effects. Dependent Dirichlet process (DDP) methods have been developed for estimating marginal structural models [3]. Dirichlet process (DP) mixture approaches for mediation analysis[4] and Enriched Dirichlet process (EDP) [5] mixture approaches to standardization have also been developed [6].

The BNP causal inference methods above, however, do not explicitly model zero-inflation - a defining feature of medical costs. Parametric [7] and nonparametric [8, 9] Bayesian models for zero-inflated outcomes have been developed. However, these studies do not evaluate the properties (e.g. bias, coverage) of causal effect estimates produced by these models.

This paper contributes to the existing literature by developing a BNP standardization procedure for causal effect estimation with zero-inflated outcomes. In particular, we propose a DP mixture of zero-inflated regressions. Each zero-inflated regression is a two-part model: a logistic model for the probability of the outcome being zero and a Gaussian regression for non-zero outcomes. We develop a post-estimation standardization procedure for computing both marginal and conditional causal effects. While this work is motivated by healthcare costs, it is applicable to any zero-inflated continuous outcome.

DP mixtures [10, 11]

are a class of BNP methods that partition the joint distribution of the outcome and covariates into homogenous clusters. In the simplest case, cluster-specific means and variances are estimated by assuming data within each cluster are Normally distributed. In our case, the cluster-specific conditional means are modeled using the two-part zero-inflated regression mentioned earlier. A BNP standardization procedure then ensembles the cluster-specific models to compute a marginal causal effect.

Unlike finite mixtures, DP mixtures allow for infinitely many clusters - removing the need to pre-specify the number of clusters. Rather, as many clusters are formed as the complexity of the data demands. The DP naturally limits the number of clusters. So that even though infinitely many possible clusters exist, only finitely many appear. This serves to prevent overfitting while remaining data-adaptive. In addition to flexible causal effect estimation, our DP approach can identify subpopulations of high, low, or moderate cost patients. These heterogeneous subpopulations may be interesting from clinical or policy perspectives.

In section 2, we provide a brief overview of Bayesian standardization. In section 3, we provide an overview of DP mixtures before developing our generative model in Section 4. We also present MCMC methods for sampling from the posterior and posterior predictive distributions.

In section 5, we describe a Monte Carlo standardization procedure which uses the posterior predictive cost distribution of our infinite mixture model to compute average treatment effects under standard causal identifying assumptions.

Section 6 presents simulation studies exploring bias and coverage of causal effect estimates from the proposed procedure. We compare our method to standardization methods using a parametric Bayesian zero-inflated regression and a DP mixture of Gaussian regressions that does not model zeros explicitly. We show that the infinite mixture of zero-inflated regressions yields low-bias causal effect estimates and credible intervals with close to nominal coverage even when the data distributions are quite complex. While the DP Gaussian regression mixture also has low bias, it underestimates variance - resulting in poor credible interval coverage. In settings where the data distributions are simple and parametric, our proposed model reduces to a parametric zero-inflated model - also yielding low-bias point estimates and interval estimates with close to nominal coverage.

In section 7, we apply our method to cost data from the SEER Medicare database. Specifically, we compute the causal effect of radiation therapy versus chemotherapy on inpatient costs among patients with endometrial cancer.

## 2 Review of Bayesian Standardization

In this section, we will introduce the causal estimand of interest as well as the method of standardization. For simplicity, consider a setting with independently sampled patients indexed by . Each patient is assigned a binary treatment . Let

represent some total cost accumulated over some fixed followup period after treatment. Finally, assume we observe a vector of confounders

, that are sufficient to achieve ignorability. We would like to estimate total cost that would have been accumulated had patients taken treatment versus . Using potential outcomes notation [12, 13]

, let the random variable

represent potential total cost under treatment . Formally, our target of inference is the causal estimand

 Ψ=E[YAi=1i−YAi=0i] (2.1)

Intuitively, represents the average difference in cost had all patients taken treatment versus . The so-called “fundamental problem of causal inference” is that for any given subject we only observe either or , but not both. The counterfactual remains unobserved, so that is not identifiable without further assumptions. In the causal literature, these assumptions are

• Ignorability: . Conditional on observed covariates, potential cost is independent of treatment assignment. In randomized control trials, this conditional independence holds due to randomization.

• Consistency: observed under the actual treatment is equal to . Specifically, .

• No interference: one patients treatment assignment does not impact another’s potential outcome - . A common example of interference is a setting in which represents someone’s infection status and represents someone’s vaccination status against the disease in question. It is reasonable to assume that one patient’s vaccination status may impact another’s infection status.

• Positivity: no patient has a deterministic treatment. That is, the probability is strictly between and . If this assumption did not hold, then one of subject ’s potential outcomes would be undefined.

These causal assumptions allow us to identify using observed data. If these assumptions hold, the method of standardization yields an expression for both expectations of in terms of the observed data . Specifically, standardization allows us to express as

 E[YAi=ai]=E[E[YAi=ai|Li]]=E[E[YAi=ai|Li,Ai=a]]=E[E[Yi|Li,Ai=a]]=∫E[Yi|Li,Ai=a]dF(L) (2.2)

The first equality follows from iterated expectation. The second follows from the ignorability assumption. The third follows from the consistency assumption. Recall that

is a vector, so that the integral may be very high-dimensional in practice. In some sense, standardization is an imputation formula for imputing the missing counterfactual cost. We plug in the value of

under which we wish to impute . Under the causal assumptions, standardization imputes it with . Thus, can be expressed in terms of observed data as

 Ψ=∫E[Yi|Li,Ai=1]dF(L)−∫E[Yi|Li,Ai=0]dF(L) (2.3)

A parametric approach to standardization follows from assuming that the conditional distribution of and the distribution of are members of some family of distributions indexed by finitely many parameters. For example, if is only 1-dimensional and binary, we may assume that conditional on , and parameters , is Normally distributed with mean

 E[Yi|Li,Ai,β]=β0+βAAi+βLLi

and variance . If , then each term of can be computed as

 E[YAi=ai]=E[Yi|Li=1,Ai=a,β]π+E[Yi|Li=0,Ai=a,β](1−π)

Keil et al. [14] developed a Bayesian approach to the g-formula, which is a generalization of standardization to settings with time-varying treatment/confounding. In this Bayesian approach, inference about the causal effect is conducted using the posterior predictive distribution. Let , , and denote a new cost value under intervention . Also, let and denote a predictive draw of new confounders and treatment, respectively. The posterior predictive distribution, , is given by

 p(~ya|Y,A,L)=∫π∫ϕ∫β∑~L∈{0,1}p(~ya,~L,β,ϕ,π|Y,A,L) dβ dϕ dπ=∫π∫ϕ∫β∑~L∈{0,1}p(~ya|~L,β,ϕ,π,Y,A,L)p(~L,β,ϕ,π|Y,A,L)  dβ dϕ dπ=∫π∫ϕ∫β∑~L∈{0,1}p(~ya|~L,β,ϕ,π,Y,A,L)p(~L|β,ϕ,π,Y,A,L)p(β,ϕ,π|Y,A,L)  dβ dϕ dπ (2.4)

Conditional on future , . Similarly, conditional on parameters, . Of course, the parameters governing the distribution of are conditionally independent of . Similarly, the parameters governing the distribution of are conditionally independent of . With these conditions,

 p(~ya|Y,A,L)=∫π∫ϕ∫β∑~L∈{0,1}p(~ya|~L,β,ϕ)p(~L|π)p(β,ϕ,π|Y,A,L) dβ dϕ dπ (2.5)

Now, since is sufficient to control for confounding, ignorability holds. Therfore, we can condition on on the right-hand side . By consistency, we can drop the superscript on to obtain an expression in terms of observed data:

 p(~ya|Y,A,L)=∫π∫ϕ∫β∑~L∈{0,1}p(~y|~A=a,~L,β,ϕ)p(~L|π)p(β,ϕ,π|Y,A,L) dβ dϕ dπ (2.6)

Notice that the last term, , is the posterior distribution of the paramters. From this formula we see that Bayesian standardization proceeds by averaging posterior predictive draws of the outcome over both the distribution of the confounders and the posterior distribution of the parameters. More generally, we would replace the summation over with integration for continuous components.

The mean of this posterior predictive distribution can serve as a Bayesian point estimate of . We can also use percentiles of the posterior predictive distribution to construct credible intervals for . Standard MCMC methods can be used to draw from the posterior distribution of the parameters. Keil et al. describe a Monte Carlo procedure for computing the integrals involved in Equation 2.6 given these posterior draws.

The Bayesian standardization approach just described is fully parametric: we require a parametric model for both

and . How well we impute the potential cost outcomes under various interventions is partially determined by how correctly we model these two distributions. Thus, in addition to the standard causal assumptions needed to identify causal effects with observed data, we are also required to meet statistical assumptions of correctly specified models when using standardization.

The next few sections, and this paper in general, focuses on methods for relaxing these parametric assumptions so that we can better estimate even when the outcome and covariates follows highly irregular distributions.

## 3 Review of Dirichlet Process Mixtures

Finite mixture models assume a fixed number of clusters, which is to be specified by the analyst [15]. DP mixtures, however, are a class of BNP models that allow for infinitely many clusters - so that the analyst need not specify the number of clusters. Instead, as many clusters are formed as are demanded by the data (and prior). This section will provide some intuition for why clustering occurs in Dirichlet process mixtures, as well as review the idea of DP mixtures of regression [16].

It is easiest to develop these ideas for finite mixtures before extending them to the infinite case. Consider we are given some continuous data that we believe follow some unknown distribution. We may wish to model this distribution flexibly as a mixture of Gaussian distribution, with density denoted as , having mean and variance parameters for . Then conditional on latent cluster membership , for subjects , the distribution of is Gaussian:

 yi | ci=k∼f(yi|μk,ϕk)ci | (π1,…,πk)∼Categorical(π1,…,πK)(π1,…,πk)∼Dirichlet(α1,…,αK) (3.1)

Above, and

are the concentration parameters of the Dirichlet distribution. Bayesian estimation follows by placing priors on the cluster-specific parameters. Typically, a conjugate Dirichlet distribution prior is placed on the vector of cluster membership probabilities, as done above, and conjugate priors are placed on the Gaussian parameters. With conjugate priors, a Gibbs sampler can be used to sample from the posterior. Such samplers start with some random initialization of cluster assignments among the

clusters. Conditional on these assignments, the cluster-specific parameters are sampled from conditional posteriors. Then, conditional on these posterior parameter draws, cluster assignment is drawn from their conditional posteriors - updating each subject’s cluster membership. This process is repeated - generating a Markov chain for dependent draws that, after a suitable burn-in period, converge to draws from the posterior of the latent cluster assignment indicators and cluster-specific parameters.

In this finite model, is pre-specified. The Dirichlet process mixture generalizes this model by taking to infinity. In this infinite mixture model, we view as having come from the mixture

 y | G∼∫f(y|μ,ϕ)dG(μ,ϕ)

Where is some joint distribution on the parameters . A Bayesian model is completed by specifying a prior for the distribution - which is infinite dimensional. An appropriate prior is the Dirichlet process - often referred to as a “distribution over distributions” [17] - which we will conventionally denote as . The parameter is some base distribution over the parameters . The parameter is the concentration parameter. High values of make the distributions drawn from more tightly concentrated around the base distribution . Lower values of have the opposite affect. In this model, each is drawn from an with it’s own mean and variance parameter, rather than cluster-specific parameters.

Completed with this prior, the Bayesian infinite Gaussian mixture model can be written as

 yi |μi,ϕi∼f(y|μi,ϕi)(μi,ϕi) | G∼GG∼DP(αG0) (3.2)

It is a well-known results that distributions, , drawn from a are discrete with probability one [15]. Since follows a discrete distribution, there is a positive probability of ties - i.e., some of the will end up sharing parameters. In this way, the DP prior induces clustering among the .

MCMC methods for sampling from the posterior distribution of these infinite mixture models have been developed by Neal [18]. Mueller et al [15] provide an overview of other methods such as slice sampling and a Truncated DP approximation.

Hannah et al. developed the DP mixture of generalized linear models [16]. Their approach is a natural extension of the DP mixture in Equation 3.2. The central idea is to model cluster-specific means in Equation 3.2 as a linear combination of covariates, , and regression parameters, . For example, e.g. . This allows for cluster-specific regression parameters. The model is “locally” parametric in the sense that a paramteric form is assumed conditional on cluster membership. However, the model is overall nonparametric in the sense that there are infinitely many possible clusters (and, therefore, parameters) possible.

We now return to our trivial example with observed cost , a binary treatment , and a single binary confounder . Assuming local Gaussian distributions, Hannah et al’s DP mixture model can be written generatively as

 yi |βi,ϕi∼N(β0,i+β1,iAi+β2,iLi,ϕi)Ai|Li∼Ber( expit(γ0,i+γ1,iLi) )Li∼Ber(πL,i)(βi,γ,πL,i,ϕi) | G∼GG∼DP(αG0) (3.3)

Due to the clustering, this approach models and - the distributions needed for standardization - more flexibly. Hannah et al. developed this approach generally for the exponential family case - making their method suitable for both discrete and continuous outcomes with various link functions.

Roy et al [6] used a similar model to 3.3 to estimate causal effects using standardization. However, they use an EDP prior instead of a the standard DP prior. This induces a nested clustering which improves performance relative to the standard DP mixture in high-dimensional settings. This method, however, does not have a zero-inflated component. So, applying it to zero-inflated outcomes would likely underestimate posterior variance of causal effects.

In the next section, we extend this nonparametric model to an infinite mixture of zero-inflated regressions.

## 4 Dirichlet Process Mixture of Zero-Inflated Regressions

In this section, we develop a DP mixture of zero-inflated regressions. Within each cluster, in addition to modeling non-zero costs with a Gaussian regression, we also model the probability of cost being zero using a logistic regression. Patients are clustered into groups with similar regression parameters, logistic regression parameters, as well as covariate parameters. The number of clusters is not specified by the user - it is instead driven by the data. Because it is a fully Bayesian approach, uncertainty at all levels of this model propogate through to the causal effect estimates of interest. Inference and interval estimation is done using draws from the relevant posteriors obtained via MCMC methods.

The next several subsections develop the joint distribution of the data as well as the posterior of our mixture model. We also develop the posterior predictive distribution of the outcome. For simplicity, we develop the model for a finite mixture with clusters, then take to infinity at the end. Obtaining draws from the posterior predictive is our primary interest as they will be used in a standardization procedure to compute causal treatment effects.

After developing the statistical model, we will discuss using this model for causal effect estimation in Section 5.

### 4.1 A Generative Model

Consider a setting where for the subject we have a scalar outcome . Perhaps this outcome exhibits a highly non-normal distribution with many zero values. For all subjects, we also observe a vector of covariates, , whose components can be either binary or continuous. The method can be easily extended to discrete covariates with more than two levels, but we consider only the case of binary discrete covariates for simplicity. Note that when we eventually return to the topic of doing causal inference with this model, the vector will include a treatment indicator. Finally, we observe independent patients, so that our observed data is the set .

Due to the zero inflation, we would like to explicitly model the . Define for each subject an indicator of the outcome being zero. So our observed data is augmented by .

Moreover, we assume that each subject belongs to one of clusters, with membership given by variables . This class membership is latent and remains unobserved. Thus, our augmented (but partially unobserved) data is given by .

Denoting the vector of cluster membership probabilities as and the vector of cluster membership indicators for all subjects as , the joint distribution of the complete data can be expressed hierarchically as

 yi | zi,xi,ci∼{δ(yi),zi=1N(μi,ci,ϕci),zi=0zi|xi,ci∼Ber(θi,ci)xij|ci∼gj,ci={N(λj,ci,τj,ci), x is continuousBer(pj,ci), x is binary,  j∈{1,2,…,d}c | π∼ Categorical(π) (4.1)

Above, , is a point mass at . We model and where and are -dimensional parameter vectors that include intercepts. We assume that the covariates are independent conditional cluster membership so that . For convenience, we will sometimes refer to the parameters governing the joint covariate distribution in cluster as . For example, if we have one continuous and one binary covariate, then with the continuous parameters ordered first, .

Intuitively, this finite mixture model assumes that patients belong to one of clusters. The parameters that govern the covariate distribution, outcome distribution, and the probability of the outcome being zero vary across clusters. This is indicated by the subscripts on these parameters. Note that the model explicitly places a positive probability on the event , that is .

To complete a Bayesian model, we must place priors on all cluster-level parameters as well as a prior on . For now, we will discuss the prior on and leave the discussion of cluster-level parameter priors to the next section. We follow the conventional approach of placing a symmetric Dirichlet prior on . That is, . Following Neal [18], we integrate out from the joint distribution . Then we find the distribution of each conditional the cluster indicators for all other subjects, denoted by . The resulting prior on a single subject’s cluster indicator is then

 p(ci=k|c−i)=ni,k+α/Ki−1+α,   k∈{1,2,…,K} (4.2)

This assigns each subject a prior probability of belonging to cluster

. The term is the number of subjects in cluster .

We now have the necessary components to write the joint distribution. Letting represent the Gaussian density evaluated at , the joint distribution of the subject’s (complete) data is

 p(yi,zi,ci,xi)=p(yi|zi,ci,xi)p(zi|ci,xi)p(xi|ci)p(ci|π)=N(yi;μi,ci,ϕci)1−zi⋅θzii,ci⋅(1−θi,ci)1−zip(xi|ci)p(ci|c−i) (4.3)

It should be noted that the distribution above is actually conditional on all cluster-specific parameters. The conditioning is omitted for clarity.

Now define vectors , , and . Denote the number of subjects in each cluster as for . The joint distribution of the complete data for all subjects conditional on there being clusters is given by

 p(y,z,x,c)=n∏i=1p(yi,zi,ci,xi)=n∏i=1N(yi;μi,ci,ϕci)1−zi⋅θzii,ci⋅(1−θi,ci)1−zip(xi|ci)⋅p(ci|c−i)=K∏k=1[nk∏i=1N(yi;μi,k,ϕk)1−zi⋅Ber(zi;θi,k)⋅pk(xi)⋅p(ci=k|c−i)] (4.4)

Above, the joint covariate distribution for subject , , is governed by cluster-specific parameters. The term represents the Bernoulli density evaluated at . Our model assumes that patients within a cluster are independent. We let denote the number of subjects in cluster with . Without loss of generality, we assume subjects with non-zero outcomes are ordered first. So we can further decompose the joint distribution as

 p(y,z,x,c)=K∏k=1[(nk−nz,k∏i=1N(yi;μi,k,ϕk))⋅(nk∏i=1Ber(zi;θi,k))⋅pk(x)nk∏i=1⋅p(ci=k|c−i)] (4.5)

Above, , where . From the expression above, we see that the joint distribution factors into separate components for each cluster’s parameters, conditional on cluster assignment. Within each cluster, the joint further factors into a component for non-zero (i.e. ) and a component for estimating the probability of having a zero outcome (i.e. ). Not only is this conceptually elegant, but it also gives us the potential for more computationally efficient sampling - allowing parameters of all clusters to be sampled in parallel within an MCMC iteration.

In the next section, we discuss choices of priors for the cluster-specific parameters. We will then take to infinity and discuss the resulting posterior distribution. As will be seen, the posterior cannot be written explicitly. Nevertheless, an MCMC procedure can be devised to sample from it.

### 4.2 Prior Distributions and Conditional Posteriors

In order to complete the Bayesian model, we must place prior distributions on each of the cluster-specific parameters in Equation 4.5. These include parameters governing the distribution of non-zero cost, , parameters governing the probability of zero cost, , as well as the parameters governing the covariate distributions ().

We detail our choices of priors for these parameters in the Appendix A. Generally, we choose priors that are conjugate conditional on clusters assignment where possible - a choice that is largely motivated by computational efficiency.

Recall from the previous section that a Dirichlet prior was placed on the cluster membership probabilities . We then integrated these probabilities out and found the implied conditional prior on the single subject’s cluster membership indicator, . This prior is given by Equation 4.2. Now, taking in Equation 4.2 yields the following conditional prior on a single subject’s cluster membership [18, 17]

 p(ci=k | c1,…,ci−1)→ni,ki−1+αp(ci≠cj, ∀j

Note that if the belong to one of clusters, then . The first equation is the prior probability of belonging to an existing cluster. The second equation is the prior probability of belonging to a new cluster. We have now generalized our finite mixture to an infinite Dirichlet process mixture.

This set of equations is an explicit statement of the Chinese restaurant process - one characterization of the Dirichlet Process. The characterization is as follows. Customers arrive sequentially to a Chinese restaurant with infinitely many tables. The first customer sits at the first table. The second customer sits at the first table with probability . The one in the numerator reflects the fact that there is one person at this table already. The second customer sits at a new table with probability . In general, the customer sits at the of the existing tables with probability proportional to - the number of patients who had arrived before and are seated at table . The customer sits at a new table proportional to .

The Chinese restaurant characterization clearly depicts the clustering inherent in the DP. Even though there are infinitely many tables at the restaurant, as more and more patients arrive, the probability of a customer being seated at a new table becomes less and less likely. That is, as , since increments up to some thing large. The parameter controls how many tables are likely to be occupied. High values of increase the relative probability of being seated at a new table - increasing the expected number of occupied tables. Low values of decrease the expected number of tables. It is possible to express our uncertainty about by placing a prior distribution on this parameter.

To explicitly link this Chinese restaurant analogy to our model, each table serves a unique set of parameters. Subjects in table/cluster have distinct , and covariate parameters that govern the distribution of their data.

We can combine the prior in Equation 4.6 with Equation 4.5

and drop all terms not involving cluster indicators to obtain the posterior probabilities of subject

’s membership to an existing cluster or to a new cluster. The posterior probability of belonging to an existing cluster is

 p(ci=k|c−i,β,ϕ,γ,θx,y,x,z)∝p(y,x,z|ci=k,c−i,β,ϕ,γ,)p(ci=k|c−i)∝N(yi;μi,k,ϕk)1−zi⋅Ber(zi;θi,k)⋅pk(xi)p(ci=k|c−i)∝N(yi;μi,k,ϕk)1−zi⋅Ber(zi;θi,k)⋅pk(xi)ni,ki−1+α (4.7)

For each patient, we can compute this for each of the existing clusters given that we have parameter values. Notice that, generally speaking, the posterior probability of belonging to cluster is proportional to the evaluation of the joint distribution for that subject’s data (as well as the prior, of course). Therefore, the probability of a subject belonging to a cluster is high if the likelihood of the subject’s data values under cluster ’s parameters is high.

The posterior probability of belonging to a new cluster is similar, except that we integrate out the cluster-specific parameters (since this cluster does not exist). For , where ,

 p(ci=k, k∉c−i|c−i,β,ϕ,γ,θx,y,x,z)∝∫N(yi;μi,k,ϕk)⋅Ber(zi;θi,k)⋅pk(xi)ni,ki−1+αdF(βk,γk,ϕk,θx,k) (4.8)

Above, is the joint prior over the parameters. The integral in Equation 4.8 cannot be evaluated analytically in this case since we do not have a conjugate prior on . However, we can draw parameters from the prior, then evaluate the likelihood of the data under these draws - essentially performing a Monte Carlo evaluation of the integral [17].

Conditional on cluster memberships, we can draw from the conditional posteriors of each parameter of interest - , , , and . For each parameter, this is done by combining Equation 4.5 with the corresponding prior, then dropping terms involving other parameters. The details are given in Appendix B.

### 4.3 Sampling from the Posterior

In this section, we outline a block Metropolis-in-Gibbs MCMC procedure for sampling from the posterior that is extended from Neal [18, 19]. The pseudocode for a function, , implementing this procedure is presented in Algorithm 1.

To summarize, the sampler is initialized by a choosing the number of initial clusters and initial parameter values. We start by randomly allocating patients among the initial clusters. We draw from the conditional posterior distribution of each cluster’s parameters. We use conjugacy and blocking to draw from posteriors of and . We use a Metropolis step to draw from the conditional posterior of . After drawing from the posteriors of these parameters, we compute the posterior probability of membership to each of the existing classes (Equation 4.7) and the posterior probability of membership to a new cluster (Equation 4.8

). We do this for each patient, then re-classify each patient based on these posterior probabilities. This completes a single Gibbs iteration.

Notice that in each iteration, a patient tends to be re-classified to the cluster with highest likelihood of observed data under the current iterations parameter draws. The classification improves with each iteration. As the classification improves, so do the cluster-specific parameter draws which use the patients in these clusters. In each iteration, new clusters can be born. Subjects are likely to be classified into this new cluster if the likelihood under existing cluster parameters are relatively low. In each iteration, existing clusters can also die as all patients in one cluster may be re-classified to other clusters that better fit their observed data.

Sampling from the posteriors of the clusters in each iteration can also be parallelized on a multicore computer for further efficiency gains - as alluded to in Section 4.1. After a suitable burn-in period, this MCMC sampler yields draws from the posterior of all cluster-specific paramaters. Conjugacy is used for all posterior draws except for . We use a Metropolis step within each Gibbs iteration to propose a block. Moreover we obtain posterior samples of - each subject’s latent class membership indicator. The posterior mode of these draws can be use to classify patients according to the parameters governing their outcome-covariate distribution.

In order to incorporate our model into the standardization procedure and use it to non-parametrically estimate causal effects, we must be able to sample from the posterior predictive distribution under different interventions. An algorithm for sampling from the posterior predictive distribution is presented in the next section.

### 4.4 Sampling from the Posterior Predictive Distribution

In this section, we describe a procedure for sampling from the posterior predictive distribution of . Letting represent a posterior predictive draw, we wish to sample from - the distribution of a new cost value conditional on the data we have observed. We propose a Monte Carlo procedure using the post-burnin samples from Algorithm 1.

Assuming we have such draws, indexed by , this integration procedure is given in Algorithm 2 in Appendix C. Intuitively, the algorithm takes the post-burnin parameter draws and runs them through the generative model in Equation 4.1 to draw outcome values for each subject.

Given the draw of and , we can compute the probability of a subject belonging to one of the occupied clusters or a new cluster. We can draw the posterior predictive cluster assignment using this information and call it . Then we can simulate a covariate vector for this pseudo-subject be using the posterior covariate parameter draws corresponding to . Similarly, we can simulate a zero-inflated cost given the current iteration’s posterior regression parameter draws corresponding to cluster .

We do this for all posterior draws, yielding draws from the posterior predictive distribution indicated by . Summary statistics such as the mean and percentiles of these draws can be used to compute point and interval estimates of the predicted outcome for each subject.

## 5 Estimating Causal Effects Using Standardization

Having proposed a Dirichlet process mixture of zero-inflated regression and outlined methods from sampling from the posterior and posterior predictive distributions, we circle back to our original goal: flexible causal effect estimation using standardization.

Recall, from Section 1, that our causal estimand of interest is the expectation

 Ψ=E[YA=1−YA=0]

Assume that our covariate vector includes the treatment assignment indicator, so that for each subject . We can define the covariate vector under intervention as . Here, represent a vector of confounders - factors affecting both treatment assignment and cost. It can contain a mix of binary and continuous components and the set is assumed to be sufficient to control for confounding.

Let , and represent draws from their respective posterior predictive distributions.Then the posterior predictive distribution of the potential outcome under intervention is given by

 p(~ya|y,x,z)=∞∑k=1∫θx,k∫βk∫ϕk∫γ∫~x∑l∈{0,1}p(~ya,βk,ϕk,γk,θx,k,~x,c=k,~z=l|y,x,z) d~xdγkdϕkdβkdθx,k=∞∑k=1∫θx,k∫βk∫ϕk∫γ∫~x∑l∈{0,1}p(~ya|βk,ϕk,c=k,~z,~x)⋅p(~z|c=k,~x,γk)    ⋅p(~x|θx,k,c=k)⋅p(βk,ϕk,γk,θx,k,c=k|y,x,z) d~xdγkdϕkdβkdθx,k (5.1)

Under the ignorability and consistency assumptions, we can substitute into the conditional distribution and drop the superscript from . This yields,

 p(~ya|y,x,z)=∞∑k=1∫θx,k∫βk∫ϕk∫γ∫~x∑l∈{0,1}p(~y|βk,ϕk,c=k,~z=l,~xa)⋅p(~z=l|c=k,~xa,γk)    ⋅p(~xa|θx,k,c=k)⋅p(βk,ϕk,γk,θx,k,c=k|y,x,z) d~xdγkdϕkdβkdθx,k (5.2)

This identifies the causal effect in terms of parameters, the observed data, and a desired intervention. There are no potential outcome random variables on the right-hand side of Equation 5.2.

Note that is the posterior distribution from which we sample using Algorithm 1. This Bayesian nonparametric standardization procedure works by averaging the conditional cost distribution over the distribution of the covariates and posterior. Note, however, that here the conditional distribution of and the conditional distribution of the covariates are infinite mixtures. This is a more flexible model than the parametric formula in Equation 2.6.

We can use the above to compute both terms of the causal estimand, ,

 ^Ψ=E[~yA=1|y,x,z]−E[~yA=0|y,x,z]=∫~y⋅(~y1|y,x,z) d~y−∫~y⋅p(~y0|y,x,z) d~y (5.3)

Interval estimates can be computed using percentiles of . Moreover, other causal contrasts such as ratios, , can also be computed in an identical fashion. Conditional causal effects may also be computed by applying the conditioning to Equation 5.2.

The central impediment is the intractability of the integral in Equation 5.2. We propose a Monte Carlo integration procedure to evaluate this integral. This procedure is outlined in Algorithm 3 in Appendix C. It is essentially the same as the posterior predictive algorithm, but with treatment held fixed at the desired intervention.

Using each posterior draw from Algorithm 1, the standardization algorithm generates a pseudo-population under intervention . Conceptually, we begin by first simulating a cluster assignment for each pseudo-subject, where the probability of cluster membership is proportional to the number of subjects in each occupied cluster at the current iteration. A pseudo-patient may also be assigned to a new cluster with probability proportional to . Then, conditional on this cluster assignment to, say, cluster , we simulate covariates using the covariate parameter draw of that cluster. Conditional on these simulated covariates, and the cluster’s posterior draw of , we simulate a . Conditional on these, we simulate a . Averaging the across these pseudo-subjects preforms a Monte Carlo integration over the distributions of and . We compute this average for each posterior draw, then take a grand average across posterior draws. This final averaging is a Monte Carlo integration over the posterior distribution.

## 6 Simulation Study

In this section we evaluate the bias of the BNP estimator given in Equation 5.3 as well as the coverage if the corresponding 95% credible interval under a variety of simulation settings.

For each of these settings, we simulate 1000 datasets with 900 subjects each. We assume a binary treatment, a single continuous covariate , and the existence of three clusters - each with separate parameters for the conditional cost distribution and covariate distributions.

For each simulated dataset, we obtain samples from the posterior predictive distribution of using the infinite mixture of zero-inflated models. We estimate using the mean of this distribution, as defined by Equation 5.3. We estimate a credible interval by using the and percentiles of this distribution. We compute the bias of and the coverage of this credible interval across the 1000 simulations.

For comparison, we compute an estimate and credible interval for using an infinite mixture of Gaussian regressions with no zero-inflated component. We also compare against a parametric Bayesian zero-inflated regression.

### 6.1 Simulation Setting 1: Random Zero-Inflation

In this relatively simple setting, we simulate data where zero-inflation is independent of the treatment and the confounder. First, we simulate data with three clusters - each with a different set of parameters governing the data generation process. In each of the three clusters we simulate, the proportion of zero outcomes are 73%, 50%, and 26% - fairly extreme zero-inflation in all clusters.

Figure 1 depicts clustering results for a single simulated dataset. The true cluster assignments are given in the left-most panel. The zero-inflated infinite mixture correctly identifies three clusters - attributing zero outcome values to each of the three clusters. By contrast, the Gaussian infinite mixture identifies four clusters - grouping the zero outcome values into a separate cluster. Note that the colors representing clusters are not meaningful and need not match the colors of the true cluster assignments.

Figure 2 depicts posterior predictive draws under each of the three models for a single simulated dataset. The left-most panel depicts the simulated dataset. The posterior predictive draws from both the zero-inflated and Gaussian mixtures look similar to the data - indicating adequate model fit. Notice that the treatment groups, indicated with red () and gray () points, are better predicted under the zero-inflated mixture. As expected, the posterior predictive draws from the parametric model are not representative of the data - indicating poor fit.

Table 1 summarizes the bias and coverage of the causal effect estimators under each of the three models across 1000 simulations. As expected based on the quality of the clustering and posterior predictive draws, both of the infinite mixture models yield low-bias estimates of , while the parametric model performs poorly.

Notably, the coverage of the 95% credible interval from the zero-inflated mixture is much closer to the nominal coverage probability. The credible intervals from the Gaussian mixture are much too narrow - with a median interval width of across 1000 simulations. Since the Gaussian mixture does not explicitly model the zero-inflation process, it underestimates variability.

We now consider the same simulation setting, but where data are simulated using only a single cluster. We initialize both the zero-inflated and Gaussian mixtures by randomly assigning subjects to one of three clusters. We expect that the zero-inflated model will reduce to a single cluster.

Figure 3 shows the expected result for a single simulated datasets. The zero-inflated infinite mixture correctly groups all subjects into a single cluster. The Gaussian mixture settles on two clusters - one cluster of non-zero outcomes and one cluster of zero outcomes.

For all three models, the posterior predictive draws for a single simulated dataset, depicted in Figure 4, are similar to the data - indicating adequate fit.

Table 2 summarizes results across 1000 simulated datasets. All procedures, including the parametric model, yield low-bias causal effect estimates with credible interval coverage close to the nominal 95% probability. As before, the Gaussian mixture has poor coverage relative to the zero-inflated mixture.

These results highlight the advantages of the zero-inflated infinite mixture. In the event that the data are generating according to a simple parametric model, the infinite mixture collapses to a single cluster. Thus, achieving bias and efficiency results similar to a parametric zero-inflated model.

However, as the data become more complex, the infinite mixture opens new clusters. This accommodates the complexity by grouping subjects into more homogeneous clusters and modeling each with separate parameters. This yields in low-bias causal effect estimates with good credible interval coverage relative to the Gaussian mixture and the parametric model even in complicated settings.

## References

• [1] Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Bart: Bayesian additive regression trees. Ann. Appl. Stat., 4(1):266–298, 03 2010.
• [2] Jennifer L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
• [3] Jason Roy, Kirsten J. Lum, and Michael J. Daniels. A bayesian nonparametric approach to marginal structural models for point treatments and a continuous or survival outcome. Biostatistics, 18(1):32–47, 2017.
• [4] Kim Chanmin, Daniels Michael J., Marcus Bess H., and Roy Jason A. A framework for bayesian nonparametric inference for causal effects of mediation. Biometrics, 73(2):401–409.
• [5] Sara Wade, David B. Dunson, Sonia Petrone, and Lorenzo Trippa. Improving prediction from dirichlet process mixtures via enrichment. J. Mach. Learn. Res., 15(1):1041–1071, January 2014.
• [6] Roy Jason, Lum Kirsten J., Zeldow Bret, Dworkin Jordan D., Re Vincent Lo, and Daniels Michael J. Bayesian nonparametric generative models for causal inference with missing at random covariates. Biometrics, 0(0).
• [7] Sujit K. Ghosh, Pabak Mukhopadhyay, and Jye-Chyi(JC) Lu. Bayesian analysis of zero-inflated regression models. Journal of Statistical Planning and Inference, 136(4):1360 – 1375, 2006.
• [8] William Barcella, Maria De Iorio, Gianluca Baio, and James Malone-Lee. A bayesian nonparametric model for white blood cells in patients with lower urinary tract symptoms. Electron. J. Statist., 10(2):3287–3309, 2016.
• [9] A. R. Linero, D. Sinha, and S. R. Lipsitz. Semiparametric Mixed-Scale Models Using Shared Bayesian Forests. ArXiv e-prints, September 2018.
• [10] Thomas S. Ferguson. A bayesian analysis of some nonparametric problems. Ann. Statist., 1(2):209–230, 03 1973.
• [11] Thomas S. Ferguson. Bayesian density estimation by mixtures of normal distributions11this research was partially supported by the national science foundation under grant mcs77-2121. In M. Haseeb Rizvi, Jagdish S. Rustagi, and David Siegmund, editors, Recent Advances in Statistics, pages 287 – 302. Academic Press, 1983.
• [12] Donald B. Rubin. Bayesian inference for causal effects: The role of randomization. Ann. Statist., 6(1):34–58, 01 1978.
• [13] Donald B Rubin. Causal inference using potential outcomes. Journal of the American Statistical Association, 100(469):322–331, 2005.
• [14] Alexander P Keil, Eric J Daza, Stephanie M Engel, Jessie P Buckley, and Jessie K Edwards. A bayesian approach to the g-formula. Statistical Methods in Medical Research, 0(0):0962280217694665, 0. PMID: 29298607.
• [15] P. Müller, F.A. Quintana, A. Jara, and T. Hanson. Bayesian Nonparametric Data Analysis. Springer Series in Statistics. Springer International Publishing, 2015.
• [16] Lauren A Hannah, David M Blei, and Warren B Powell. Dirichlet process mixtures of generalized linear models.

Journal of Machine Learning Research

, 12(Jun):1923–1953, 2011.
• [17] Carl E. Rasmussen. The Infinite Gaussian Mixture Model. In In Advances in Neural Information Processing Systems 12, volume 12, pages 554–560, 2000.
• [18] Radford M. Neal. Markov chain sampling methods for dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265, 2000.
• [19] W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109, 1970.

## Appendix A Prior Distributions

In this appendix, we discuss prior distribution choices for the infinite mixture of zero-inflated regression. To complete a Bayesian model, we specify prior distributions for the parameters in Equation 4.1. We place the following conjugate normal prior on all positive cost parameters for each cluster:

 βk∼Nd+1(mβ,Ωβ),ϕk∼InvGam