Difficult causal questions, such as ‘does eating meat cause cancer’ or ‘would increasing the minimum wage lead to a fall in employment’ are fundamental to decisions around how our society is structured and our understanding of the world. The development of Causal Graphical Models (CGMs) and the do-calculus Pearl (1995, 2009)
has given us an extremely rich and powerful framework with which to formalise and approach such questions. This framework is presented as fundamentally extra-statistical - Pearl has argued forcefully that (Bayesian) probability theory alone is not sufficient for solving causal problemsPearl (2001).
The notion that causality fundamentally requires new mathematics and that causal questions cannot be solved within existing paradigms for probabilistic inference has led to extensive controversy and debate, eg Gelman (2009, 2019). This debate has been particularly intense between proponents of causal modelling and Bayesian modellers, perhaps not surprisingly, since the Bayesian approach to combining assumptions with data is typically presented as sufficiently general to tackle any probabilistic inference problem (although computational constraints may make it impractical).
In this paper, we demonstrate how the assumptions encoded by causal graphical models can be represented with a probabilistic graphical model (PGM). The advantage of doing so is mostly conceptual: it allows Bayesian practitioners to represent and reason about the modelling assumptions required for causal inference in a framework with which they are familiar. However, there may also be practical benefits in cases where causal queries are not identifiable via the do-calculus. In such cases, it is fundamentally impossible to infer the exact outcome of an intervention, even given infinite pre-interventional data without additional assumptions. Modelling such problems within a standard Bayesian inference setting allows us to leverage a vast body of existing research on combining assumptions with data to obtain finite sample estimates for distributions of interest. While the posterior distribution will always remain sensitive to the prior (unless we add assumptions about the functional form of the relationships between variables) we may still obtain useful bounds. The disadvantage of modelling causal questions explicitly as a single PGM is that it is more cumbersome and computationally expensive (unless we use the machinery of the do-calculus to identify appropriate re-parameterisations).
1.1 Representing a Causal Problem with a Probabilistic graphical model
In the following sections we show how a causal query can be represented with a PGM and how to do causal inference via this approach. For the necessary background on probabilistic and causal graphical models, we refer readers to the appendix.
To represent an intervention with an ordinary Probabilistic graphical model, we must explicitly model the pre and post intervention systems and the relationship between them. Algorithm 1 constructs a probabilistic graphical model for a specific intervention in a causal graphical model.
Algorithm 1: CausalBayesConstruct
Input: Causal graph and intervention .
Output: Probabilistic graphical model representing this intervention
Draw the original causal graph inside a plate indexed from to represent the data generating process.
For each variable , parameterize by adding a parameter with a link into .
Draw the graph after the intervention by setting and removing all links into it. Rename each of the variables to distinguish them from the variables in the original graph, e.g. becomes .
Connect the two graphs linking to the corresponding variable in the post-interventional graph, for each excluding .
A PGM constructed with Algorithm 1 represents exactly the same assumptions about a specific intervention as the corresponding CGM, see Figures 1 and 2 for an example. We have just explicitly created a joint model over the system pre and post-intervention, which allows the direct application of standard statistical inference, rather than requiring additional notation and operations that map from one to the other - as the do-calculus does. The Bayesian model is specified by the parameterization of the conditional distribution of variables given their parents, and priors may be placed on the parameters . The fact that the parameters are shared for all pairs of variables excluding , captures the assumption that all that is changed by the intervention is the way takes its value - the conditional distributions for all other variables given their parents are invariant.
1.2 Causal Inference with Probabilistic graphical models
The result of Algorithm 1 is a Probabilistic graphical model on which we can do inference with standard probability theory rather than the do-calculus, and which has properties such as arrow reversal (by the use of Bayes rule). To infer causal effects we compute a predictive distribution for the quantity of interest in the post-intervention graph using Bayes rule, integrating out all parameters, latent variables and any observed variables that are not of interest, for each setting of the treatment .
To make this procedure clearer, let be the set of variables in the original causal graph , excluding the variable we intervene on, , and be the corresponding variables in the post-interventional graph. We have: : the set of model parameters, : a matrix of the observations of variables , collected pre-intervention,
: a vector of theobserved values of the treatment variable , , : The variables of the system post-intervention, : the value that the intervened on variable is set to, : the variable of interest post-intervention.
The goal is to infer the value of the unobserved post-interventional distribution over , given the observed data and and a selected treatment . By construction, conditional on the parameters , the post-interventional variables are independent of data collected pre-intervention . The value of the intervention is set exogenously111Also has no marginal distribution - it is a constant set by the intervention - so is independent of both and
. This ensures joint distribution overfactorize into three terms: a prior over the parameters , the likelihood for the original system , and a predictive distribution for the post-interventional variables given parameters and intervention :
We then marginalize out ,
and condition on the observed data ,
Finally, if the goal is to infer mean treatment effects222We could also compute conditional treatment effects by first conditioning on selected variables in . on a specific variable post-intervention , we can marginalize out the remaining variables in ,
If there are no latent variables in , assuming positive density over the domain of and a well defined prior , the likelihood will dominate, and the posterior over the parameters will become independent of the prior at the infinite data limit. The term can be expanded into a product of terms of the form following the factorization implied by the post-interventional graph. From step (3) of Algorithm 1 each of these terms are equal to the corresponding terms , giving results equivalent to Pearl’s truncated product formula Pearl (2009). Authors (2019) demonstrate the equivalence of this approach with the do-calculus on a number of worked examples.
The paper shows that it is possible to arrive at the same solution for causal problems using both the do-calculus and Bayesian theory, the key insight required for the Bayesian formulation is that the probabilistic graphical model must jointly model both the pre-intervention and post intervention worlds. Our conclusion is similar to that of Lindley et al. (1981), however we provide an explicit mechanism by which we can encode the assumptions implied by a causal graphical model, formalising the notion of exchangability in this context.
- Authors (2019) Authors (2019). Replacing the do-calculus with bayes rule. arXiv preprint arXiv:1906.07125.
- Dawid (2015) Dawid, A. P. (2015). Statistical causality from a decision-theoretic perspective. Annual Review of Statistics and Its Application, 2:273–303.
- Gelman (2009) Gelman, A. (2009). Resolving disputes between j. pearl and d. rubin on causal inference. https://statmodeling.stat.columbia.edu/2009/07/05/disputes_about/.
- Gelman (2019) Gelman, A. (2019). “the book of why” by pearl and mackenzie. https://statmodeling.stat.columbia.edu/2019/01/08/book-pearl-mackenzie/.
- Jordan (2004) Jordan, M. I. (2004). Graphical models. Statistical Science, 19(1):140–155.
- Lindley et al. (1981) Lindley, D. V., Novick, M. R., et al. (1981). The role of exchangeability in inference. The Annals of Statistics, 9(1):45–58.
- Pearl (1995) Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669–688.
- Pearl (2001) Pearl, J. (2001). Bayesianism and causality, or, why i am only a half-Bayesian. In Foundations of Bayesianism, pages 19–36. Springer.
- Pearl (2009) Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press, New York.
- Peters et al. (2017) Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of causal inference: foundations and learning algorithms. MIT press.
3.1 Background on Probabilistic and Causal graphical models
Probabilistic graphical models (PGMs) combine graph theory with probability theory in order to develop new algorithms and to present models in an intuitive framework Jordan (2004). A Probabilistic graphical model is a directed acyclic graph over variables, which represents how the joint distribution over these variables may be factorized. In particular, any missing edge in the graph must correspond to a conditional independence relation in the joint distribution. There are multiple valid Probabilistic graphical model representations for a given joint distribution. For example, any joint distribution over two variables may be represented by both or .
A causal graphical model (CGM) is a Probabilistic graphical model, with the additional assumption that a link means causes . Think of the data generating process for a CGM as sampling data first for the exogenous variables (those with no parents in the graph), and then in subsequent steps sampling values for the children of previously sampled nodes. An atomic intervention in such a system that sets the value of a specific variable to a fixed constant corresponds to removing all links into - as it is now set exogenously, rather than determined by its previous causes. It is assumed that everything else in the system remains unchanged, in particular the functions or conditional distributions that determine the value of a variable given its parents in the graph. In this way, a CGM encodes more than the factorization (or conditional independence structure) of the joint distribution over its variables; It additionally specifies how the system responds to atomic interventions.
A CGM describes how the structure of a system is modified by an intervention. However, answering causal queries such as "what would the distribution of cancer look like if we were able to prevent smoking?" requires inference about the distributions of variables in the post-interventional system. The do-notation is a short-hand for describing the distribution of variables post-intervention and the do-calculus is a set of rules for identifying which (conditional) distributions are equivalent pre and post-intervention. If it is possible to derive an expression for the desired post-interventional distribution purely in terms of the joint distribution over the original system via the do-calculus then the causal query is identifiable, meaning assuming positive density and infinite data we obtain a point estimate for it.
Let be a CGM, represent post-intervention (i.e with all links into removed) and represent with all links out of removed. Let represent intervening to set a single variable to ,
if in , and is not a decedent of .