Bayesian causal inference via probabilistic program synthesis

10/30/2019 ∙ by Sam Witty, et al. ∙ 53

Causal inference can be formalized as Bayesian inference that combines a prior distribution over causal models and likelihoods that account for both observations and interventions. We show that it is possible to implement this approach using a sufficiently expressive probabilistic programming language. Priors are represented using probabilistic programs that generate source code in a domain specific language. Interventions are represented using probabilistic programs that edit this source code to modify the original generative process. This approach makes it straightforward to incorporate data from atomic interventions, as well as shift interventions, variance-scaling interventions, and other interventions that modify causal structure. This approach also enables the use of general-purpose inference machinery for probabilistic programs to infer probable causal structures and parameters from data. This abstract describes a prototype of this approach in the Gen probabilistic programming language.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian formulations of causal inference enable practitioners to explicitly reason about uncertainty when answering structural questions (e.g., “What is the probability that causes ?”) as well as questions about the effects of a specific intervention (e.g., “How much will intervening to make increase the probability that ?”). Bayesian formulations have been developed for both the potential outcomes framework McCandless et al. (2009) and the causal graphical models framework Friedman and Koller (2000); Griffiths and Tenenbaum (2009); Heckerman et al. (1995). In principle, Bayesian approaches make it possible to incorporate prior knowledge and make efficient use of limited data Mansinghka et al. (2006); Murphy (2012).

In this paper, we explore a new approach to implementing Bayesian causal inference based on probabilistic programming, inspired by Bayesian synthesis Saad et al. (2019). Probabilistic programming languages enable users to compactly specify probabilistic models in code. Some languages, like Stan Carpenter et al. (2017), have syntax that closely resembles the statistical notation often used in the literature to define probabilistic models: a list of equations of the form . Others, like Gen Cusumano-Towner et al. (2019), allow users to include arbitrary program control flow in their models; a model is represented by a program that simulates stochastically from a distribution. In this paper, we represent hypothesized causal models explaining some phenomenon as programs in MiniStan, a simple probabilistic programming language designed to resemble Stan (Figure 1

). Then, we use a more expressive probabilistic programming language, Gen, to encode a prior and likelihood over MiniStan programs, and to do inference. The Gen model (i) stochastically generates MiniStan programs to encode a prior distribution over causal model structures and parameters, (ii) programmatically edits the generated MiniStan programs to reflect interventions and experimental conditions, then (iii) interprets the MiniStan programs to generate observational and experimental data. We can then use Gen’s inference programming and conditioning features to condition the entire process on actual observational and experimental data, and to obtain posterior samples of the MiniStan code defining the original observational model—that is, to perform both structure learning and parameter estimation.

Figure 1: Grammar of MiniStan

Causal models are typically structured as a set of autonomous components Aldrich (1989); Haavelmo (1944); Pearl (2000), such that interventions in the system can be accurately represented in the model as an alteration of a small number of model components, and all other model components (and the causal relationships among them) remain unchanged. In the formalism of causal graphical models, interventions are typically expressed using the do-operator Pearl (2000)

, which fixes the value of one random variable and removes the influence of its parents. However, many realistic interventions are not accurately represented by this particular variety of model alteration

Eberhardt and Scheines (2007); Korb et al. (2004); Sherman and Shpitser (2019). For example, realistic interventions might best be represented by altering the functional form of a particular dependence, enabling or disabling specific causes, or enacting complex combinations of these interventions. This paper demonstrates interventions represented as modifications of probabilistic program source code and shows how this representation enables the Bayesian synthesis approach to handle a broad class of experimental data.

1.1 A conceptual example

(a) “Belief and skill matter” CGM.
(b) “Only skill matters” CGM.
s ~ normal(mu_s, sigma_s)
b ~ normal(s, sigma_b)
logit_o = s * lambda_so + b * lambda_bo
o ~ bernoulli(1/(1+exp(-logit_o)))
(c) “Belief and skill matter” model as source code.
s ~ normal(mu_s, sigma_s)
b ~ normal(s, sigma_b)
logit_o = s * lambda_so
o ~ bernoulli(1/(1+exp(-logit_o)))
(d) “Only skill matters” model as source code.
Figure 2: A conceptual example combining structure learning and parameter estimation.

Consider the task of inferring whether a student’s belief in her ability is causal for success at a research project. Observational data on student belief and student success alone are insufficient to answer this question, due to the confounding effect of skill (see Figures 2a & 2b).

We can imagine multiple types of experiments that would enable effective causal inference despite the confounding effect of skill. For example, an advisor could encourage a student, shifting her belief in her ability (but not increasing her skill). An advisor could also administer an assessment on the key skills needed for the project, before the student attempts it, and look at the results. Unfortunately, although this might reveal the true skill level to the advisor, this might also change the student’s belief in her own ability to succeed. Hypothetically, one can imagine a miracle pill that modifies one’s confidence to a fixed value, without changing anything else. Each of these experiments corresponds to a different modification to the source code from Figures 2c and 2d. Examples of these modifications are shown in Figures 3a-f.

This paper shows how to formalize this example, using probabilistic programs that generate, edit, and interpret the source code of causal models. It also presents results from an implementation in the Gen probabilistic programming language, demonstrating the utility of incorporating diverse sources of experimental data.

s ~ normal(mu_s, sigma_s)
b = 5
logit_o = s * lambda_so + b * lambda_bo
o ~ bernoulli(1/(1+exp(-logit_o)))
(a) “Belief and skill matter” with belief pill.
s ~ normal(mu_s, sigma_s)
b = 5
logit_o = s * lambda_so
o ~ bernoulli(1/(1+exp(-logit_o)))
(b) “Only skill matters” with belief pill.
s ~ normal(mu_s, sigma_s)
b ~ normal(s + 3, sigma_b)
logit_o = s * lambda_so + b * lambda_bo
o ~ bernoulli(1/(1+exp(-logit_o)))
(c) “Belief and skill matter” with encouragement design.
s ~ normal(mu_s, sigma_s)
b ~ normal(s + 3, sigma_b)
logit_o = s * lambda_so
o ~ bernoulli(1/(1+exp(-logit_o)))
(d) “Only skill matters” with encouragement design.
s ~ normal(mu_s + 2, sigma_s)
b ~ normal(s, sigma_b / 100)
logit_o = s * lambda_so + b * lambda_bo
o ~ bernoulli(1/(1+exp(-logit_o)))
(e) “Belief and skill matter” with assessment.
s ~ normal(mu_s + 2, sigma_s)
b ~ normal(s, sigma_b / 100)
logit_o = s * lambda_so
o ~ bernoulli(1/(1+exp(-logit_o)))
(f) “Only skill matters” with assessment.
Figure 3: Various interventions expressed as modifications of MiniStan source code.

2 Priors on Causal Models

Figure 4: Graphical meta-model for the Bayesian synthesis approach to causal structure and parameter learning. A set of global parameters determine the source code of the observational causal program , which is modified via code-editing intervention functions to induce experimental causal programs for the belief-pill ( encouragement design (), and the assessment () interventions. The code for each program is run through an interpreter, which generates (observational or experimental) data. The likelihoods of the various kinds of data under the different interpreted programs can be used to infer the posterior distribution over , and therefore over the observational causal program .

To compute the posterior distribution over the two candidate causal models, we first specify a prior distribution over a set of global latent variables. One of these variables, , determines whether influences .

1@gen function generate_causal_model()
2 mu_s = @trace(normal(0, 1), :mu_s)
3 sigma_s = @trace(uniform(0, 1), :sigma_s)
4 sigma_b = @trace(uniform(0, 1), :sigma_b)
5 lambda_so = @trace(uniform(0, 1), :so_weight)
6 lambda_bo = @trace(uniform(0, 1), :bo_weight)
7 edge = @trace(bernoulli(0.5), :edge)
9 if edge
10 logit_o_expr = quote s * $so_weight + b * $bo_weight end
11 else
12 logit_o_expr = quote s * $so_weight end
13 end
15 causal_model = quote
16 s ~ normal($mu_s, $sigma_s)
17 b ~ normal(s, $sigma_b)
18 logit_o = $logit_o_expr
19 o ~ bernoulli(1/(1+exp(-logit_o)))
20 end
21 return causal_model
s ~ normal(0.237, 0.449)
b ~ normal(s, 0.913)
logit_o = s * 0.137 + b * 0.852
o ~ bernoulli(1/(1 + exp(-logit_o)))
s ~ normal(-0.592, 0.302)
b ~ normal(s, 0.724)
logit_o = s * 0.503 + b * 0.491
o ~ bernoulli(1/(1 + exp(-logit_o)))
s ~ normal(1.892, 0.108)
b ~ normal(s, 0.301)
logit_o = s * 0.542
o ~ bernoulli(1/(1 + exp(-logit_o)))
1@gen function generate_data(NObs, NBeliefPill, NEncouragement, NAssessment)
2 observational_model = @trace(generate_causal_model())
3 belief_pill_model = applyDoIntervention(observational_model, :b, 5)
4 encouragement_model = applyShiftIntervention(observational_model, :b, 3)
5 assessment_model = applyVarianceScalingIntervention(applyShiftIntervention(observational_model, :s, 2),
6 :b, 1/100)
8 observational_data = @trace(interpretMiniStan(observational_model, n_runs=NObs), :obs)
9 belief_pill_data = @trace(interpretMiniStan(belief_pill_model, n_runs=NBeliefPill), :belief_pill)
10 encouragement_data = @trace(interpretMiniStan(encouragement_model, n_runs=NEncouragement), :encouragement)
11 assessment_data = @trace(interpretMiniStan(assessment_model, n_runs=NAssessment, :assessment)
Figure 5: Gen implementation of causal inference via Bayesian synthesis. The generate_causal_model Gen program (a) encodes a prior distribution over MiniStan models; (b) shows three samples from this prior. The generate_data Gen program (c) encodes the likelihood: it samples a possible causal model from the prior (line 2), modifies it to obtain MiniStan code representing experimental conditions (lines 3-6), then simulates observational and experimental data by running the MiniStan programs (lines 8-11). The interpreter is itself a Gen probabilistic program.

In the Bayesian synthesis framework, a prior distribution over causal models is a stochastic procedure generating programs in a domain specific language (Figure 5). The grammar for our simple domain specific language, MiniStan, is presented in Figure 1.

3 Likelihoods for Experiments

To incorporate experimental evidence of various forms, the Bayesian synthesis approach requires an intervention library which consists of a set of code-editing functions that modify causal model programs in the domain specific language. For the conceptual example, our intervention library contains three interventions: (i) an atomic intervention, which applies the do-operator; (ii) a shift intervention, which changes the mean of a distribution by a fixed increment; and (iii) a variance-scaling intervention, which modifies the variance of a random variable assumed to be drawn from a normal distribution. In principle, an intervention library could contain arbitrary rules for modifying causal model source code, including changing the underlying distribution for a random variable or adding variables (latent or observed) that didn’t exist in the observational model.

These interventions can be freely composed to represent a diverse set of experimental scenarios. We demonstrate this compositionality in the “assessment” experiment, which is composed of a shift intervention (a student’s skill may improve if she has to take a test) and a variance-scaling intervention (a student’s belief in her ability has less noise after taking a test).

When interpreted, a causal program in MiniStan represents a likelihood function over observational data. To compute the likelihood of experimental data, we simply modify the causal program using the intervention library before subsequently interpretting the modified program.

1function applyDoIntervention(program, var, newValue)
2 walk(program) do expr
3 @match expr begin
4 :($x = $val) && if x == var end => :($var = $newValue)
5 :($x ~ $dist) && if x == var end => :($var = $newValue)
6 _ => expr
7 end
8 end
1function applyShiftIntervention(program, var, shiftValue)
2 walk(program) do expr
3 @match expr begin
4 :($x ~ normal($mean, $std)) && if x == var end => :($x ~ normal($mean + $shiftValue, $std))
5 :($x ~ uniform($a, $b)) && if x == var end => :($x ~ uniform($a + $shiftValue, $b + $shiftValue))
6 :($x = $value) && if x == var end => :($x = $value + $shiftValue)
7 _ => expr
8 end
9 end
Figure 6: Julia implementation of the atomic (“do”) intervention and the shift intervention. Rather than perform graph operations such as removing edges, an atomic intervention on a program walks the program’s code and replaces any expression that assigns var with a new expression, implementing the intervention (var = newValue). The shift intervention walks the program’s code and adds shiftvalue

to the mean argument for the normal distribution, the lower and upper bound arguments for the uniform distribution, and the value of any deterministic assignment.

4 Inference

We demonstrate the utility of this approach by performing approximate posterior inference over synthesized causal model programs from our conceptual example. In this example we: (i) generate a MiniStan program from the prior, (ii) generate a set of observational and experimental data from the interpreted MiniStan program, and (iii) perform approximate posterior inference over synthesized causal models using sequential Monte Carlo Doucet et al. (2000) with Metropolis Hastings rejuvination. We generated ten individuals’ skill, belief, and outcome for each of the four observational and experimental settings from a single causal model where and .

Figure 7: Posterior probability of the existence and strength of causal dependence between a student’s belief and her subsequent outcome. The vertical gray line is the actual value for lambda_bo

Using only observational data, the posterior probability of the edge variable is low. This may be because the data can be explained only by appealing to skill, and this simpler model could lead to a higher marginal probability than one which introduces a new parameter (lambda_bo). (This phenomenon is sometimes called “Bayesian Ockham’s Razor”.) However, as we incorporate additional experimental evidence the posterior probability of the edge increases. Similarly, the posterior distribution over , the effect of belief on outcome, concentrates around the true value as we leverage experimental evidence.

5 Discussion

The Bayesian synthesis approach we have outlined in this paper provides several advantages over alternative approaches to structure discovery and parameter estimation in causal modeling: (i) an explicit characterization of uncertainty over model structures; (ii) a principled way to model diverse interventions; and (iii) a formalization that can be re-used in diverse problems, with varying degrees of prior knowledge, without requiring practitioners to design custom inferences for each use case.

Although this example uses parametric causal models, it is conceptually straightforward to use Gaussian processes and/or Dirichlet process mixture models for the functional forms of causal relationships Saad et al. (2019). It may thus be fruitful to develop Bayesian variants of existing non-parametric techniques for causal inference Imbens (2004); Louizos et al. (2017).

The results reported here were obtained using vanilla sequential Monte Carlo over the joint space of model structure, parameters, and the latent variables in each observation or experiment. In order for this approach to scale to complex models, hierarchical priors over models, and large datasets, we expect more powerful techniques will be necessary. However, the Gen platform provides programmable inference constructs Cusumano-Towner et al. (2019), including hybrids of Hamiltonian Monte Carlo Duane et al. (1987) and Metropolis-Adjusted Langevin Roberts et al. (1996) approaches with sequential Monte Carlo Doucet et al. (2000), that could potentially address some of these scaling challenges.

6 Related Work

Probabilistic programs are often used to represent causal processes Goodman et al. (2012). Some languages, such as Omega Tavares et al. (2019), make this causal interpretation explicit, including a semantics for interventional and counterfactual reasoning. It would be interesting to consider whether the framework we present here, which considers interventions to be arbitrary code-editing procedures, could also be usefully applied to counterfactual reasoning problems.

Incorporating experimental evidence for structure learning and parameter estimation can be thought of as the inner loop of an optimal experimental design procedure. Probabilistic programs have been used to automate this search over experiments Ouyang et al. (2016), seeking to maximize the expected information gain over some query given new evidence. In that work, experiments are modeled as arguments to a probabilistic program. Our approach instead describes an experiment as a modification of MiniStan programs, enabling a clean abstraction between the specification of causal models (or distributions over causal models) and interventions that modify those models.

Improving methodology for combining observational and experimental evidence has far-reaching implications for a wide variety of scientific disciplines, and has received significant attention in the graph-based causal inference literature. For example, extensions of the do-calculus have been developed to incorporate experiments expressed as atomic interventions given a known causal graphical model structure Lee et al. (2019). Recent extensions of existing graph-based structure discovery algorithms have been made to incorporate atomic interventions Wang et al. (2017) and imperfect interventions Yang et al. (2018). Our work proposes characterizing imperfect interventions as code-editors acting on probabilistic programs; this representation enables us to perform posterior inference (with uncertainty estimates) over both structure and model parameters.


We thank Javier Burroni, Dan Garant, Zenna Tavares, and Reilly Grant for thoughtful discussion.


  • J. Aldrich (1989) Autonomy. Oxford Economic Papers 41 (1), pp. 15–34. Cited by: §1.
  • B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell (2017) Stan: a probabilistic programming language. Journal of statistical software 76 (1). Cited by: §1.
  • M. F. Cusumano-Towner, F. A. Saad, A. K. Lew, and V. K. Mansinghka (2019) Gen: a general-purpose probabilistic programming system with programmable inference. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, New York, NY, USA, pp. 221–236. External Links: ISBN 978-1-4503-6712-7, Link, Document Cited by: §1, §5.
  • A. Doucet, S. Godsill, and C. Andrieu (2000) On sequential monte carlo sampling methods for bayesian filtering. Statistics and computing 10 (3), pp. 197–208. Cited by: §4, §5.
  • S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth (1987) Hybrid monte carlo. Physics letters B 195 (2), pp. 216–222. Cited by: §5.
  • F. Eberhardt and R. Scheines (2007) Interventions and causal inference. Philosophy of Science 74 (5), pp. 981–995. Cited by: §1.
  • N. Friedman and D. Koller (2000) Being bayesian about network structure. In

    Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence

    pp. 201–210. Cited by: §1.
  • N. Goodman, V. Mansinghka, D. M. Roy, K. Bonawitz, and J. B. Tenenbaum (2012) Church: a language for generative models. arXiv preprint arXiv:1206.3255. Cited by: §6.
  • T. L. Griffiths and J. B. Tenenbaum (2009) Theory-based causal induction.. Psychological review 116 (4), pp. 661. Cited by: §1.
  • T. Haavelmo (1944) The probability approach in econometrics. Econometrica: Journal of the Econometric Society, pp. iii–115. Cited by: §1.
  • D. Heckerman, D. Geiger, and D. M. Chickering (1995)

    Learning bayesian networks: the combination of knowledge and statistical data

    Machine learning 20 (3), pp. 197–243. Cited by: §1.
  • G. W. Imbens (2004) Nonparametric estimation of average treatment effects under exogeneity: a review. Review of Economics and statistics 86 (1), pp. 4–29. Cited by: §5.
  • K. B. Korb, L. R. Hope, A. E. Nicholson, and K. Axnick (2004) Varieties of causal intervention. In Pacific Rim International Conference on Artificial Intelligence, pp. 322–331. Cited by: §1.
  • S. Lee, J. D. Correa, and E. Bareinboim (2019) General identifiability with arbitrary surrogate experiments. Cited by: §6.
  • C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling (2017) Causal effect inference with deep latent-variable models. In Advances in Neural Information Processing Systems, pp. 6446–6456. Cited by: §5.
  • V. Mansinghka, C. Kemp, J. Tenenbaum, and T. Griffiths (2006) Structured priors for structure learning.. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI 2006), Cited by: §1.
  • L. C. McCandless, P. Gustafson, and P. C. Austin (2009) Bayesian propensity score analysis for observational data. Statistics in medicine 28 (1), pp. 94–112. Cited by: §1.
  • K. P. Murphy (2012) Machine learning: a probabilistic perspective. MIT Press. Cited by: §1.
  • L. Ouyang, M. H. Tessler, D. Ly, and N. Goodman (2016) Practical optimal experiment design with probabilistic programs. arXiv preprint arXiv:1608.05046. Cited by: §6.
  • J. Pearl (2000) Causality: models, reasoning and inference. Vol. 29, Springer. Cited by: §1.
  • G. O. Roberts, R. L. Tweedie, et al. (1996) Exponential convergence of langevin distributions and their discrete approximations. Bernoulli 2 (4), pp. 341–363. Cited by: §5.
  • F. A. Saad, M. F. Cusumano-Towner, U. Schaechtle, M. C. Rinard, and V. K. Mansinghka (2019) Bayesian synthesis of probabilistic programs for automatic data modeling. Proceedings of the ACM on Programming Languages 3 (POPL), pp. 37. Cited by: §1, §5.
  • E. Sherman and I. Shpitser (2019) Intervening on network ties. In Proceedings of the International Conference on Uncertainty in Artificial Intelligence, pp. . Cited by: §1.
  • Z. Tavares, X. Zhang, J. Koppel, and A. S. Lezama (2019) Soft constraints for inference with declarative knowledge. Cited by: §6.
  • Y. Wang, L. Solus, K. Yang, and C. Uhler (2017) Permutation-based causal inference algorithms with interventions. In Advances in Neural Information Processing Systems, pp. 5822–5831. Cited by: §6.
  • K. D. Yang, A. Katcoff, and C. Uhler (2018) Characterizing and learning equivalence classes of causal dags under interventions. arXiv preprint arXiv:1802.06310. Cited by: §6.