Causal inference with Bayes rule

by   Finnian Lattimore, et al.

The concept of causality has a controversial history. The question of whether it is possible to represent and address causal problems with probability theory, or if fundamentally new mathematics such as the do-calculus is required has been hotly debated, In this paper we demonstrate that, while it is critical to explicitly model our assumptions on the impact of intervening in a system, provided we do so, estimating causal effects can be done entirely within the standard Bayesian paradigm. The invariance assumptions underlying causal graphical models can be encoded in ordinary Probabilistic graphical models, allowing causal estimation with Bayesian statistics, equivalent to the do-calculus.


page 1

page 2

page 3

page 4


Replacing the do-calculus with Bayes rule

The concept of causality has a controversial history. The question of wh...

A Bayesian Solution to the M-Bias Problem

It is common practice in using regression type models for inferring caus...

Partially Intervenable Causal Models

Graphical causal models led to the development of complete non-parametri...

On the Use of Causal Graphical Models for Designing Experiments in the Automotive Domain

Randomized field experiments are the gold standard for evaluating the im...

A Probabilistic Calculus of Actions

We present a symbolic machinery that admits both probabilistic and causa...

Modeling cumulative biological phenomena with Suppes-Bayes Causal Networks

Several diseases related to cell proliferation are characterized by the ...

Causal Inference by Surrogate Experiments: z-Identifiability

We address the problem of estimating the effect of intervening on a set ...

1 Introduction

Difficult causal questions, such as ‘does eating meat cause cancer’ or ‘would increasing the minimum wage lead to a fall in employment’ are fundamental to decisions around how our society is structured and our understanding of the world. The development of Causal Graphical Models (CGMs) and the do-calculus Pearl (1995, 2009)

has given us an extremely rich and powerful framework with which to formalise and approach such questions. This framework is presented as fundamentally extra-statistical - Pearl has argued forcefully that (Bayesian) probability theory alone is not sufficient for solving causal problems

Pearl (2001).

The notion that causality fundamentally requires new mathematics and that causal questions cannot be solved within existing paradigms for probabilistic inference has led to extensive controversy and debate, eg Gelman (2009, 2019). This debate has been particularly intense between proponents of causal modelling and Bayesian modellers, perhaps not surprisingly, since the Bayesian approach to combining assumptions with data is typically presented as sufficiently general to tackle any probabilistic inference problem (although computational constraints may make it impractical).

In this paper, we demonstrate how the assumptions encoded by causal graphical models can be represented with a probabilistic graphical model (PGM). The advantage of doing so is mostly conceptual: it allows Bayesian practitioners to represent and reason about the modelling assumptions required for causal inference in a framework with which they are familiar. However, there may also be practical benefits in cases where causal queries are not identifiable via the do-calculus. In such cases, it is fundamentally impossible to infer the exact outcome of an intervention, even given infinite pre-interventional data without additional assumptions. Modelling such problems within a standard Bayesian inference setting allows us to leverage a vast body of existing research on combining assumptions with data to obtain finite sample estimates for distributions of interest. While the posterior distribution will always remain sensitive to the prior (unless we add assumptions about the functional form of the relationships between variables) we may still obtain useful bounds. The disadvantage of modelling causal questions explicitly as a single PGM is that it is more cumbersome and computationally expensive (unless we use the machinery of the do-calculus to identify appropriate re-parameterisations).

1.1 Representing a Causal Problem with a Probabilistic graphical model

In the following sections we show how a causal query can be represented with a PGM and how to do causal inference via this approach. For the necessary background on probabilistic and causal graphical models, we refer readers to the appendix.

To represent an intervention with an ordinary Probabilistic graphical model, we must explicitly model the pre and post intervention systems and the relationship between them. Algorithm 1 constructs a probabilistic graphical model for a specific intervention in a causal graphical model.

Algorithm 1: CausalBayesConstruct

Input: Causal graph and intervention .  
Output: Probabilistic graphical model representing this intervention

  1. Draw the original causal graph inside a plate indexed from to represent the data generating process.

  2. For each variable , parameterize by adding a parameter with a link into .

  3. Draw the graph after the intervention by setting and removing all links into it. Rename each of the variables to distinguish them from the variables in the original graph, e.g. becomes .

  4. Connect the two graphs linking to the corresponding variable in the post-interventional graph, for each excluding .

Figure 1: A CGM of Case 1: Left observational, Right: mutilated


Figure 2: A PGM of Case 1

A PGM constructed with Algorithm 1 represents exactly the same assumptions about a specific intervention as the corresponding CGM, see Figures 1 and 2 for an example. We have just explicitly created a joint model over the system pre and post-intervention, which allows the direct application of standard statistical inference, rather than requiring additional notation and operations that map from one to the other - as the do-calculus does. The Bayesian model is specified by the parameterization of the conditional distribution of variables given their parents, and priors may be placed on the parameters . The fact that the parameters are shared for all pairs of variables excluding , captures the assumption that all that is changed by the intervention is the way takes its value - the conditional distributions for all other variables given their parents are invariant.

Despite its simplicity we are unaware of a direct statement of Algorithm 1, it is related to twin networks Pearl (2009) and augmented directed acyclic graphs Dawid (2015) but is distinct from both.

1.2 Causal Inference with Probabilistic graphical models

The result of Algorithm 1 is a Probabilistic graphical model on which we can do inference with standard probability theory rather than the do-calculus, and which has properties such as arrow reversal (by the use of Bayes rule). To infer causal effects we compute a predictive distribution for the quantity of interest in the post-intervention graph using Bayes rule, integrating out all parameters, latent variables and any observed variables that are not of interest, for each setting of the treatment .

To make this procedure clearer, let be the set of variables in the original causal graph , excluding the variable we intervene on, , and be the corresponding variables in the post-interventional graph. We have: : the set of model parameters, : a matrix of the observations of variables , collected pre-intervention,

: a vector of the

observed values of the treatment variable , , : The variables of the system post-intervention, : the value that the intervened on variable is set to, : the variable of interest post-intervention.

The goal is to infer the value of the unobserved post-interventional distribution over , given the observed data and and a selected treatment . By construction, conditional on the parameters , the post-interventional variables are independent of data collected pre-intervention . The value of the intervention is set exogenously111Also has no marginal distribution - it is a constant set by the intervention - so is independent of both and

. This ensures joint distribution over

factorize into three terms: a prior over the parameters , the likelihood for the original system , and a predictive distribution for the post-interventional variables given parameters and intervention :

We then marginalize out ,


and condition on the observed data ,


Finally, if the goal is to infer mean treatment effects222We could also compute conditional treatment effects by first conditioning on selected variables in . on a specific variable post-intervention , we can marginalize out the remaining variables in ,


If there are no latent variables in , assuming positive density over the domain of and a well defined prior , the likelihood will dominate, and the posterior over the parameters will become independent of the prior at the infinite data limit. The term can be expanded into a product of terms of the form following the factorization implied by the post-interventional graph. From step (3) of Algorithm 1 each of these terms are equal to the corresponding terms , giving results equivalent to Pearl’s truncated product formula Pearl (2009). Authors (2019) demonstrate the equivalence of this approach with the do-calculus on a number of worked examples.

2 Conclusion

The paper shows that it is possible to arrive at the same solution for causal problems using both the do-calculus and Bayesian theory, the key insight required for the Bayesian formulation is that the probabilistic graphical model must jointly model both the pre-intervention and post intervention worlds. Our conclusion is similar to that of Lindley et al. (1981), however we provide an explicit mechanism by which we can encode the assumptions implied by a causal graphical model, formalising the notion of exchangability in this context.


  • Authors (2019) Authors (2019). Replacing the do-calculus with bayes rule. arXiv preprint arXiv:1906.07125.
  • Dawid (2015) Dawid, A. P. (2015). Statistical causality from a decision-theoretic perspective. Annual Review of Statistics and Its Application, 2:273–303.
  • Gelman (2009) Gelman, A. (2009). Resolving disputes between j. pearl and d. rubin on causal inference.
  • Gelman (2019) Gelman, A. (2019). “the book of why” by pearl and mackenzie.
  • Jordan (2004) Jordan, M. I. (2004). Graphical models. Statistical Science, 19(1):140–155.
  • Lindley et al. (1981) Lindley, D. V., Novick, M. R., et al. (1981). The role of exchangeability in inference. The Annals of Statistics, 9(1):45–58.
  • Pearl (1995) Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669–688.
  • Pearl (2001) Pearl, J. (2001). Bayesianism and causality, or, why i am only a half-Bayesian. In Foundations of Bayesianism, pages 19–36. Springer.
  • Pearl (2009) Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press, New York.
  • Peters et al. (2017) Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of causal inference: foundations and learning algorithms. MIT press.

3 Appendix

3.1 Background on Probabilistic and Causal graphical models

Probabilistic graphical models (PGMs) combine graph theory with probability theory in order to develop new algorithms and to present models in an intuitive framework Jordan (2004). A Probabilistic graphical model is a directed acyclic graph over variables, which represents how the joint distribution over these variables may be factorized. In particular, any missing edge in the graph must correspond to a conditional independence relation in the joint distribution. There are multiple valid Probabilistic graphical model representations for a given joint distribution. For example, any joint distribution over two variables may be represented by both or .

A causal graphical model (CGM) is a Probabilistic graphical model, with the additional assumption that a link means causes . Think of the data generating process for a CGM as sampling data first for the exogenous variables (those with no parents in the graph), and then in subsequent steps sampling values for the children of previously sampled nodes. An atomic intervention in such a system that sets the value of a specific variable to a fixed constant corresponds to removing all links into - as it is now set exogenously, rather than determined by its previous causes. It is assumed that everything else in the system remains unchanged, in particular the functions or conditional distributions that determine the value of a variable given its parents in the graph. In this way, a CGM encodes more than the factorization (or conditional independence structure) of the joint distribution over its variables; It additionally specifies how the system responds to atomic interventions.

A CGM describes how the structure of a system is modified by an intervention. However, answering causal queries such as "what would the distribution of cancer look like if we were able to prevent smoking?" requires inference about the distributions of variables in the post-interventional system. The do-notation is a short-hand for describing the distribution of variables post-intervention and the do-calculus is a set of rules for identifying which (conditional) distributions are equivalent pre and post-intervention. If it is possible to derive an expression for the desired post-interventional distribution purely in terms of the joint distribution over the original system via the do-calculus then the causal query is identifiable, meaning assuming positive density and infinite data we obtain a point estimate for it.

Here we present the do-calculus in a simplified form that applies to interventions on single variables Pearl (1995, 2009); Peters et al. (2017).

The do-calculus

Let be a CGM, represent post-intervention (i.e with all links into removed) and represent with all links out of removed. Let represent intervening to set a single variable to ,

Rule 1:

if in

Rule 2:

if in

Rule 3:

if in , and is not a decedent of .