Replacing the do-calculus with Bayes rule

06/17/2019 ∙ by Finnian Lattimore, et al. ∙ Criteo 0

The concept of causality has a controversial history. The question of whether it is possible to represent and address causal problems with probability theory, or if fundamentally new mathematics such as the do calculus is required has been hotly debated, e.g. Pearl (2001) states "the building blocks of our scientific and everyday knowledge are elementary facts such as "mud does not cause rain" and "symptoms do not cause disease" and those facts, strangely enough, cannot be expressed in the vocabulary of probability calculus". This has lead to a dichotomy between advocates of causal graphical modeling and the do calculus, and researchers applying Bayesian methods. In this paper we demonstrate that, while it is critical to explicitly model our assumptions on the impact of intervening in a system, provided we do so, estimating causal effects can be done entirely within the standard Bayesian paradigm. The invariance assumptions underlying causal graphical models can be encoded in ordinary Probabilistic graphical models, allowing causal estimation with Bayesian statistics, equivalent to the do calculus. Elucidating the connections between these approaches is a key step toward enabling the insights provided by each to be combined to solve real problems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The do-calculus [Pearl, 1995] is a powerful body of theory that provides three additional rules for probability theory on the basis that probability theory alone is not sufficient for solving causal problems (an argument prosecuted forcefully by Pearl in several places e.g. [Pearl, 2001]

). In this paper we provide side by side analysis of four classic causal problems: the two cases responsible for Simpson’s reversal, and two cases with unobserved confounders. These four analyses are strongly suggestive that the do-calculus and Bayesian inference can both be used in order to make causal estimates, although this is not to suggest that each approach does not have its strengths and weaknesses.

A fully Bayesian approach leverages a vast body of existing research and is able to account for finite sample uncertainties. On the other hand the do-calculus has simpler graphs and is sometimes a more direct approach, also some of the results pertaining to unobserved confounders were discovered using the do-calculus (in particular results for front door adjustment and M-Bias) and while we show these results can be transferred in the Bayesian paradigm the mechanism for systematically doing so in a tractable manner remains unclear.

While there is much to be discussed about the similarities and differences between the two approaches we mostly leave this outside scope and simply provide the four examples side by side. The paper has the following structure, in Section 2 we outline the two methodologies in a sufficiently general framework that either could be used to solve causal problems. In Section 3 we outline the two fully observed problems, here we focus on Simpson’s paradox. In Section 4 we outline two problems involving unobserved confounders; the causality non-identifiable case and the front door rule; concluding remarks are made in Section 5.

2 Two schools of thought

2.1 Probabilistic graphical models

Probabilistic graphical models (PGMs) combine graph theory with probability theory in order to develop new algorithms and to present models in an intuitive framework [Jordan, 2004]

. A Probabilistic graphical model is a directed acyclic graph over variables, which represents how the joint distribution over these variables may be factorized. In particular, any

missing edge in the graph must correspond to a conditional independence relation in the joint distribution. There are multiple valid Probabilistic graphical model representations for a given joint distribution. For example, any joint distribution over two variables may be represented by both or .

2.2 Causal Graphical Models And The Do-Calculus

A causal graphical model (CGM) is a Probabilistic graphical model, with the additional assumption that a link means causes . Think of the data generating process for a CGM as sampling data first for the exogenous variables (those with no parents in the graph), and then in subsequent steps sampling values for the children of previously sampled nodes. An atomic intervention in such a system that sets the value of a specific variable to a fixed constant corresponds to removing all links into - as it is now set exogenously, rather than determined by its previous causes. It is assumed that everything else in the system remains unchanged, in particular the functions or conditional distributions that determine the value of a variable given its parents in the graph. In this way, a CGM encodes more than the factorization (or conditional independence structure) of the joint distribution over its variables; It additionally specifies how the system responds to atomic interventions.

A CGM describes how the structure of a system is modified by an intervention. However, answering causal queries such as "what would the distribution of cancer look like if we were able to prevent smoking?" requires inference about the distributions of variables in the post-interventional system. The do-notation is a short-hand for describing the distribution of variables post-intervention and the do-calculus is a set of rules for identifying which (conditional) distributions are equivalent pre and post-intervention. If it is possible to derive an expression for the desired post-interventional distribution purely in terms of the joint distribution over the original system via the do-calculus then the causal query is identifiable, meaning assuming positive density and infinite data we obtain a point estimate for it. The do-calculus is complete; A query is identifiable if and only if it can be solved via the do-calculus [Shpitser and Pearl, 2006, Huang and Valtorta, 2006].

Here we present the do-calculus in a simplified form that applies to interventions on single variables - which is sufficient for the examples presented in this paper. The full form of the do-calculus applies to interventions on any subset of variables - see [Pearl, 1995, Pearl, 2009, Peters et al., 2017].

The do-calculus

Let be a CGM, represent post-intervention (i.e with all links into removed) and represent with all links out of removed. Let represent intervening to set a single variable to ,

Rule 1:

if in

Rule 2:

if in

Rule 3:

if in , and is not a decedent of .

2.3 Representing a Causal Problem with a Probabilistic graphical model

While PGMs and CGMs may appear similar, there are key differences between them, both in the information they represent and how they are typically applied. CGMs are used to determine if a given query is identifiable and to obtain an expression for it in terms of the original joint distribution - with estimation of this expression a follow up step; latent variables are introduced to capture dependence induced by unobserved variables that may complicate identification of causal effects and links are not reversible. By contrast in PGMs links can be reversed, model specific details for estimation - including plates & parameters - are included graphically, and latent variables (of a specific form) are introduced for computational reasons, usually to coerce the model into complete data exponential family form.

To represent an intervention with an ordinary Probabilistic graphical model, we must explicitly model the pre and post intervention systems and the relationship between them.

Algorithm 1: CausalBayesConstruct

Input: Causal graph and intervention .  
Output: Probabilistic graphical model representing this intervention

  1. Draw the original causal graph inside a plate indexed from to represent the data generating process.

  2. For each variable , parameterize by adding a parameter with a link into .

  3. Draw the graph after the intervention by setting and removing all links into it. Rename each of the variables to distinguish them from the variables in the original graph, e.g. becomes .

  4. Connect the two graphs linking to the corresponding variable in the post-interventional graph, for each excluding .

A PGM constructed with Algorithm 1 represents exactly the same assumptions about a specific intervention as the corresponding CGM, see Figures 1 and 3 for an example. We have just explicitly created a joint model over the system pre and post-intervention, which allows the direct application of standard statistical inference, rather than requiring additional notation and operations that map from one to the other - as the do-calculus does. The Bayesian model is specified by the parameterization of the conditional distribution of variables given their parents, and priors may be placed on the parameters . The fact that the parameters are shared for all pairs of variables excluding , captures the assumption that all that is changed by the intervention is the way takes its value - the conditional distributions for all other variables given their parents are invariant.

Despite its simplicity we are unaware of a direct statement of Algorithm 1, it is related to twin networks [Pearl, 2009] and augmented directed acyclic graphs [Dawid, 2015] but is distinct from both.

2.4 Causal Inference with Probabilistic graphical models

The result of Algorithm 1 is a Probabilistic graphical model on which we can do inference with standard probability theory rather than the do-calculus, and which has properties such as arrow reversal (by the use of Bayes rule). To infer causal effects we compute a predictive distribution for the quantity of interest in the post-intervention graph using Bayes rule, integrating out all parameters, latent variables and any observed variables that are not of interest, for each setting of the treatment .

To make this procedure clearer, let be the set of variables in the original causal graph , excluding the variable we intervene on, , and be the corresponding variables in the post-interventional graph. We have:

  • : the set of model parameters.

  • : a matrix of the observations of variables , collected pre-intervention.

  • : a vector of the

    observed values of the treatment variable , , and

  • : The variables of the system post-intervention.

  • : the value that the intervened on variable is set to.

  • : the variable of interest post-intervention.

The goal is to infer the value of the unobserved post-interventional distribution over , given the observed data and and a selected treatment . By construction, conditional on the parameters , the post-interventional variables are independent of data collected pre-intervention . The value of the intervention is set exogenously111Also has no marginal distribution - it is a constant set by the intervention - so is independent of both and . This ensures joint distribution over factorize into three terms: a prior over the parameters , the likelihood for the original system , and a predictive distribution for the post-interventional variables given parameters and intervention :

We then marginalize out ,


and condition on the observed data ,


Finally, if the goal is to infer mean treatment effects222We could also compute conditional treatment effects by first conditioning on selected variables in . on a specific variable post-intervention , we can marginalize out the remaining variables in ,


If there are no latent variables in , assuming positive density over the domain of and a well defined prior , the likelihood will dominate, and the posterior over the parameters will become independent of the prior at the infinite data limit. The term can be expanded into a product of terms of the form following the factorization implied by the post-interventional graph. From step (3) of Algorithm 1 each of these terms are equal to the corresponding terms , giving results equivalent to Pearl’s truncated product formula [Pearl, 2009].

The presence of latent variables in adds complications which we defer to Section 4.

3 Simpson’s Paradox (Fully Observed)

Simpson’s paradox provides an excellent case study for demonstrating that raw data cannot be used for inferring causality without further assumptions. In this section, we show how we can infer treatment effects and resolve the paradox, with either the do-calculus or via Bayesian inference, and that these approaches yield equivalent results. Assume we have a table of data on some outcome for two different treatments () broken down by a third variable as shown in Table 1.

0 0 0 150
0 0 1 50
0 1 0 180
0 1 1 180
1 0 0 50
1 0 1 200
1 1 0 4
1 1 1 36
Table 1: Example Data

By estimating probabilities as past frequencies we obtain the following conditional probabilities:

The paradox

Treatment seems best overall, but if we break the data down by then, regardless of which value of a patient has, treatment seems better. If we had to select a single treatment for everyone - which should it be? The key to resolving this question is to realize that what we care about in this setting is the expected outcome of intervening in the system to set , (in Pearl’s notation ) rather than either of the conditional distributions or . As a result, which treatment is preferred hinges on causal assumptions, which we may specify using a CGM or a PGM.

3.1 Simpson’s Paradox Case 1

Figure 1: A CGM of Case 1: Left observational, Right: mutilated
Figure 2: The graphs , and for case 1.


Figure 3: A PGM of Case 1

Imagine that our observations are generated by CGM given in Figure 1, where the covariate is a cause of both and . Applying the rules of the do-calculus gives:

See Figure 2 for , and in case 1.

To find the same solution using a Bayesian approach we first apply CausalBayesConstruct on CGM Figure 1 to produce the PGM in Figure 3. We then explicitly parameterize the model and write out the three model components, the post intervention predictive, the likelihood component and the prior component.

We use the following parameterization:

The post intervention predictive is:

The likelihood component is:

By de Finetti’s strong law of large numbers

[De Finetti, 1980] as the posterior concentrate on a single point and and , consequently:

Which demonstrates the agreement between the Bayesian solution and the solution found using the do-calculus at large samples.

This convergence is usually very fast and good agreement will also be found for low sample sizes (where instead of using the point estimate we integrate over the posterior) e.g. under uniform priors the posterior of the parameters will have Beta distributions; and the predictive distribution giving the causal inference can be computed using the “Laplace smoothing algorithm” which involves adding one to the counts before normalizing.

Returning to the numerical example applying the do-calculus using the maximum likelihood algorithm we obtain:


Assuming uniform priors and applying the Bayesian solution we obtain:

We see good agreement between the two methods with the only difference being the prior impact due to the finite sample. We also see that is the better treatment (assuming that is the desired outcome).

4 Simpson’s Paradox Case 2

Figure 4: A CGM of Case 2: Left observational, Right: mutilated
Figure 5: The graphs , and for case 2.

Imagine that our observations are generated by CGM given in Figure 4. Using the do-calculus we get the result in one step:

see Figure 5 to see the meaning of .


Figure 6: A PGM of Case 2

To find the same solution using a Bayesian approach we first apply CausalBayesConstruct on CGM Figure 4 to produce the PGM in Figure 6. We then explicitly parameterize the model and write out the three model components, the post intervention predictive, the likelihood component and the prior component.

We use the following parameterization:

The post intervention predictive is:

The likelihood component is:

Again by de Finetti’s strong law of large numbers as the the posterior concentrate on a single point and and , consequently:

Showing that again there is large sample agreement between the two methods, Similarly, convergence is usually very fast and there is close agreement for even small samples.

Returning to the numerical example applying the do-calculus using the maximum likelihood algorithm we obtain:

Assuming uniform priors and applying the Bayesian solution again using the Laplace smoothing result we obtain:

Again we see good agreement between the two methods with the only difference being the prior impact due to the finite sample. We also see that is the better treatment.

Note that the distribution over is identical for both Case 1 and Case 2, and yet the optimal treatment differs. The paradox is resolved by understanding that the difference is due to different model assumptions about the impact of intervening on , and we have demonstrated that these assumptions can be expressed either with a CGM or an extended PGM.

5 With Unobserved Confounders

Unobserved confounders (or latent variables) are hidden variables that can complicate causal inference at best and at worst render it impossible. While a direct attack using the pre-specified methodology does allow Bayesian inference to solve these problems, this is achieved by marginalizing out a complex latent variable the size of which grows with the data set. Usually the inclusion of the latent variable is not viable and the model must be marginalize to remove it and re-parameterized. Whether this is possible in a way that allows causation to be identified depends on the structure of the graph. If it isn’t possible to identify all parameters that have a causal impact then prior distributions will have an impact even in the large data limit.

5.1 When Causality Cannot Be Identified

Figure 7: CGM Where Causality Is “not identifiable”: Left observational, Right: mutilated


Figure 8: PGM Where Causality Is “not identifiable”

The simplest graphical model where causation becomes impossible even with unlimited samples is shown in Figure 7. This fact is demonstrated in the do-calculus by the fact that there is no way to apply the 3 rules in order to obtain .

In this problem we consider to be have two states and to have two states, but the latent variable or unobserved confounder is of arbitrary complexity. This reflects many real life problems e.g. could represent the presence of some substance in a person’s diet (so it is binary), could represent some binary health outcome and could represent socio-economic circumstances of a person affecting both and .

Following the same prescription as before; to find the same solution using a Bayesian approach we first apply CausalBayesConstruct on CGM Figure 7 to produce the PGM in Figure 8. We then explicitly parameterize the model and write out the three model components, the post intervention predictive component, the likelihood component and the prior component. The first step of parameterization is complicated by the fact that is both latent and high dimensional this results in posteriors over the parameters that are not-identifiable and high dimensional, we will also consider a re-parameterization which partially mitigates these difficulties.

We use the following parameterization:

The post intervention predictive is:

We introduce the low dimensional as a re-parameterization of as statistically identifying this parameter is sufficient for making causal inference.

Unfortunately when we write the likelihood we see we cannot identify this parameter, but rather a different low dimensional function of :

We introduce the low dimensional as a re-parameterization of as this parameter is identifiable.

We can now see the difficulty in this problem, we need but can only infer . Both, and are different low dimensional projections of ; is identifiable and is causally relevant, they are related due to the fact that they are both functions of and , so it may be reasonable to specify a joint prior giving:

In specifying priors for this problem, we may reasonably use default priors (e.g. flat priors) for (or even take a point estimate) as it is identifiable with modest data sets. On the other hand will be completely unaffected by the data so it is an extremely important that any information concerning on how affects knowledge of is carefully assessed; in many instances this will be considered too difficult to reasonably attempt, e.g. it may be that in which case the data adds no value to causal problems at all.

5.2 The Front Door Rule

Figure 9: A CGM for the Front Door Rule: Top the original graph, Bottom: the mutilated graph

Imagine that our observations are generated by the CGM given in Figure 9, which is the graph that requires the front door rule. The front door rule is remarkable in that it shows that a graph quite similar to Figure 7 does allow causation to be identifiable, the only difference being another observed node between the treatment and the outcome.

Using the do-calculus gives (detailed steps in [Pearl, 1995]):



Figure 10: A Bayesian Model for the Front Door Rule

To find the same solution using a Bayesian approach we first apply CausalBayesConstruct on CGM Figure 9 to produce the PGM in Figure 10. We then explicitly parameterize the model and write out the three model components. Again we are hampered by the presence of , but for this problem we can effectively marginalize and re-parameterize the model to make causation identifiable.

We use the following parameterization:

The post intervention predictive is:

And the likelihood component is:

Unfortunately, the fact that is both latent and large means that the posterior over is both non-identifiable and high dimensional (although the marginal posterior over is identifiable, since it depends only on and ). A direct attack would require sophisticated Bayesian approximation methods to capture the complex structure within the posterior and is not within the scope of this paper. Instead, we note that can be eliminated from the 2nd term in the post-intervention predictive distribution: