1 Introduction
The docalculus [Pearl, 1995] is a powerful body of theory that provides three additional rules for probability theory on the basis that probability theory alone is not sufficient for solving causal problems (an argument prosecuted forcefully by Pearl in several places e.g. [Pearl, 2001]
). In this paper we provide side by side analysis of four classic causal problems: the two cases responsible for Simpson’s reversal, and two cases with unobserved confounders. These four analyses are strongly suggestive that the docalculus and Bayesian inference can both be used in order to make causal estimates, although this is not to suggest that each approach does not have its strengths and weaknesses.
A fully Bayesian approach leverages a vast body of existing research and is able to account for finite sample uncertainties. On the other hand the docalculus has simpler graphs and is sometimes a more direct approach, also some of the results pertaining to unobserved confounders were discovered using the docalculus (in particular results for front door adjustment and MBias) and while we show these results can be transferred in the Bayesian paradigm the mechanism for systematically doing so in a tractable manner remains unclear.
While there is much to be discussed about the similarities and differences between the two approaches we mostly leave this outside scope and simply provide the four examples side by side. The paper has the following structure, in Section 2 we outline the two methodologies in a sufficiently general framework that either could be used to solve causal problems. In Section 3 we outline the two fully observed problems, here we focus on Simpson’s paradox. In Section 4 we outline two problems involving unobserved confounders; the causality nonidentifiable case and the front door rule; concluding remarks are made in Section 5.
2 Two schools of thought
2.1 Probabilistic graphical models
Probabilistic graphical models (PGMs) combine graph theory with probability theory in order to develop new algorithms and to present models in an intuitive framework [Jordan, 2004]
. A Probabilistic graphical model is a directed acyclic graph over variables, which represents how the joint distribution over these variables may be factorized. In particular, any
missing edge in the graph must correspond to a conditional independence relation in the joint distribution. There are multiple valid Probabilistic graphical model representations for a given joint distribution. For example, any joint distribution over two variables may be represented by both or .2.2 Causal Graphical Models And The DoCalculus
A causal graphical model (CGM) is a Probabilistic graphical model, with the additional assumption that a link means causes . Think of the data generating process for a CGM as sampling data first for the exogenous variables (those with no parents in the graph), and then in subsequent steps sampling values for the children of previously sampled nodes. An atomic intervention in such a system that sets the value of a specific variable to a fixed constant corresponds to removing all links into  as it is now set exogenously, rather than determined by its previous causes. It is assumed that everything else in the system remains unchanged, in particular the functions or conditional distributions that determine the value of a variable given its parents in the graph. In this way, a CGM encodes more than the factorization (or conditional independence structure) of the joint distribution over its variables; It additionally specifies how the system responds to atomic interventions.
A CGM describes how the structure of a system is modified by an intervention. However, answering causal queries such as "what would the distribution of cancer look like if we were able to prevent smoking?" requires inference about the distributions of variables in the postinterventional system. The donotation is a shorthand for describing the distribution of variables postintervention and the docalculus is a set of rules for identifying which (conditional) distributions are equivalent pre and postintervention. If it is possible to derive an expression for the desired postinterventional distribution purely in terms of the joint distribution over the original system via the docalculus then the causal query is identifiable, meaning assuming positive density and infinite data we obtain a point estimate for it. The docalculus is complete; A query is identifiable if and only if it can be solved via the docalculus [Shpitser and Pearl, 2006, Huang and Valtorta, 2006].
Here we present the docalculus in a simplified form that applies to interventions on single variables  which is sufficient for the examples presented in this paper. The full form of the docalculus applies to interventions on any subset of variables  see [Pearl, 1995, Pearl, 2009, Peters et al., 2017].
The docalculus
Let be a CGM, represent postintervention (i.e with all links into removed) and represent with all links out of removed. Let represent intervening to set a single variable to ,
Rule 1:
if in
Rule 2:
if in
Rule 3:
if in , and is not a decedent of .
2.3 Representing a Causal Problem with a Probabilistic graphical model
While PGMs and CGMs may appear similar, there are key differences between them, both in the information they represent and how they are typically applied. CGMs are used to determine if a given query is identifiable and to obtain an expression for it in terms of the original joint distribution  with estimation of this expression a follow up step; latent variables are introduced to capture dependence induced by unobserved variables that may complicate identification of causal effects and links are not reversible. By contrast in PGMs links can be reversed, model specific details for estimation  including plates & parameters  are included graphically, and latent variables (of a specific form) are introduced for computational reasons, usually to coerce the model into complete data exponential family form.
To represent an intervention with an ordinary Probabilistic graphical model, we must explicitly model the pre and post intervention systems and the relationship between them.
Algorithm 1: CausalBayesConstruct
Input: Causal graph and intervention .
Output: Probabilistic graphical model representing this intervention

Draw the original causal graph inside a plate indexed from to represent the data generating process.

For each variable , parameterize by adding a parameter with a link into .

Draw the graph after the intervention by setting and removing all links into it. Rename each of the variables to distinguish them from the variables in the original graph, e.g. becomes .

Connect the two graphs linking to the corresponding variable in the postinterventional graph, for each excluding .
A PGM constructed with Algorithm 1 represents exactly the same assumptions about a specific intervention as the corresponding CGM, see Figures 1 and 3 for an example. We have just explicitly created a joint model over the system pre and postintervention, which allows the direct application of standard statistical inference, rather than requiring additional notation and operations that map from one to the other  as the docalculus does. The Bayesian model is specified by the parameterization of the conditional distribution of variables given their parents, and priors may be placed on the parameters . The fact that the parameters are shared for all pairs of variables excluding , captures the assumption that all that is changed by the intervention is the way takes its value  the conditional distributions for all other variables given their parents are invariant.
Despite its simplicity we are unaware of a direct statement of Algorithm 1, it is related to twin networks [Pearl, 2009] and augmented directed acyclic graphs [Dawid, 2015] but is distinct from both.
2.4 Causal Inference with Probabilistic graphical models
The result of Algorithm 1 is a Probabilistic graphical model on which we can do inference with standard probability theory rather than the docalculus, and which has properties such as arrow reversal (by the use of Bayes rule). To infer causal effects we compute a predictive distribution for the quantity of interest in the postintervention graph using Bayes rule, integrating out all parameters, latent variables and any observed variables that are not of interest, for each setting of the treatment .
To make this procedure clearer, let be the set of variables in the original causal graph , excluding the variable we intervene on, , and be the corresponding variables in the postinterventional graph. We have:

: the set of model parameters.

: a matrix of the observations of variables , collected preintervention.

: The variables of the system postintervention.

: the value that the intervened on variable is set to.

: the variable of interest postintervention.
The goal is to infer the value of the unobserved postinterventional distribution over , given the observed data and and a selected treatment . By construction, conditional on the parameters , the postinterventional variables are independent of data collected preintervention . The value of the intervention is set exogenously^{1}^{1}1Also has no marginal distribution  it is a constant set by the intervention  so is independent of both and . This ensures joint distribution over factorize into three terms: a prior over the parameters , the likelihood for the original system , and a predictive distribution for the postinterventional variables given parameters and intervention :
We then marginalize out ,
(1) 
and condition on the observed data ,
(2) 
Finally, if the goal is to infer mean treatment effects^{2}^{2}2We could also compute conditional treatment effects by first conditioning on selected variables in . on a specific variable postintervention , we can marginalize out the remaining variables in ,
(3) 
If there are no latent variables in , assuming positive density over the domain of and a well defined prior , the likelihood will dominate, and the posterior over the parameters will become independent of the prior at the infinite data limit. The term can be expanded into a product of terms of the form following the factorization implied by the postinterventional graph. From step (3) of Algorithm 1 each of these terms are equal to the corresponding terms , giving results equivalent to Pearl’s truncated product formula [Pearl, 2009].
The presence of latent variables in adds complications which we defer to Section 4.
3 Simpson’s Paradox (Fully Observed)
Simpson’s paradox provides an excellent case study for demonstrating that raw data cannot be used for inferring causality without further assumptions. In this section, we show how we can infer treatment effects and resolve the paradox, with either the docalculus or via Bayesian inference, and that these approaches yield equivalent results. Assume we have a table of data on some outcome for two different treatments () broken down by a third variable as shown in Table 1.
0  0  0  150 
0  0  1  50 
0  1  0  180 
0  1  1  180 
1  0  0  50 
1  0  1  200 
1  1  0  4 
1  1  1  36 
By estimating probabilities as past frequencies we obtain the following conditional probabilities:
The paradox
Treatment seems best overall, but if we break the data down by then, regardless of which value of a patient has, treatment seems better. If we had to select a single treatment for everyone  which should it be? The key to resolving this question is to realize that what we care about in this setting is the expected outcome of intervening in the system to set , (in Pearl’s notation ) rather than either of the conditional distributions or . As a result, which treatment is preferred hinges on causal assumptions, which we may specify using a CGM or a PGM.
3.1 Simpson’s Paradox Case 1
Imagine that our observations are generated by CGM given in Figure 1, where the covariate is a cause of both and . Applying the rules of the docalculus gives:
See Figure 2 for , and in case 1.
To find the same solution using a Bayesian approach we first apply CausalBayesConstruct on CGM Figure 1 to produce the PGM in Figure 3. We then explicitly parameterize the model and write out the three model components, the post intervention predictive, the likelihood component and the prior component.
We use the following parameterization:
The post intervention predictive is:
The likelihood component is:
By de Finetti’s strong law of large numbers
[De Finetti, 1980] as the posterior concentrate on a single point and and , consequently:Which demonstrates the agreement between the Bayesian solution and the solution found using the docalculus at large samples.
This convergence is usually very fast and good agreement will also be found for low sample sizes (where instead of using the point estimate we integrate over the posterior) e.g. under uniform priors the posterior of the parameters will have Beta distributions; and the predictive distribution giving the causal inference can be computed using the “Laplace smoothing algorithm” which involves adding one to the counts before normalizing.
Returning to the numerical example applying the docalculus using the maximum likelihood algorithm we obtain:
so
Assuming uniform priors and applying the Bayesian solution we obtain:
We see good agreement between the two methods with the only difference being the prior impact due to the finite sample. We also see that is the better treatment (assuming that is the desired outcome).
4 Simpson’s Paradox Case 2
Imagine that our observations are generated by CGM given in Figure 4. Using the docalculus we get the result in one step:
see Figure 5 to see the meaning of .
To find the same solution using a Bayesian approach we first apply CausalBayesConstruct on CGM Figure 4 to produce the PGM in Figure 6. We then explicitly parameterize the model and write out the three model components, the post intervention predictive, the likelihood component and the prior component.
We use the following parameterization:
The post intervention predictive is:
The likelihood component is:
Again by de Finetti’s strong law of large numbers as the the posterior concentrate on a single point and and , consequently:
Showing that again there is large sample agreement between the two methods, Similarly, convergence is usually very fast and there is close agreement for even small samples.
Returning to the numerical example applying the docalculus using the maximum likelihood algorithm we obtain:
Assuming uniform priors and applying the Bayesian solution again using the Laplace smoothing result we obtain:
Again we see good agreement between the two methods with the only difference being the prior impact due to the finite sample. We also see that is the better treatment.
Note that the distribution over is identical for both Case 1 and Case 2, and yet the optimal treatment differs. The paradox is resolved by understanding that the difference is due to different model assumptions about the impact of intervening on , and we have demonstrated that these assumptions can be expressed either with a CGM or an extended PGM.
5 With Unobserved Confounders
Unobserved confounders (or latent variables) are hidden variables that can complicate causal inference at best and at worst render it impossible. While a direct attack using the prespecified methodology does allow Bayesian inference to solve these problems, this is achieved by marginalizing out a complex latent variable the size of which grows with the data set. Usually the inclusion of the latent variable is not viable and the model must be marginalize to remove it and reparameterized. Whether this is possible in a way that allows causation to be identified depends on the structure of the graph. If it isn’t possible to identify all parameters that have a causal impact then prior distributions will have an impact even in the large data limit.
5.1 When Causality Cannot Be Identified
The simplest graphical model where causation becomes impossible even with unlimited samples is shown in Figure 7. This fact is demonstrated in the docalculus by the fact that there is no way to apply the 3 rules in order to obtain .
In this problem we consider to be have two states and to have two states, but the latent variable or unobserved confounder is of arbitrary complexity. This reflects many real life problems e.g. could represent the presence of some substance in a person’s diet (so it is binary), could represent some binary health outcome and could represent socioeconomic circumstances of a person affecting both and .
Following the same prescription as before; to find the same solution using a Bayesian approach we first apply CausalBayesConstruct on CGM Figure 7 to produce the PGM in Figure 8. We then explicitly parameterize the model and write out the three model components, the post intervention predictive component, the likelihood component and the prior component. The first step of parameterization is complicated by the fact that is both latent and high dimensional this results in posteriors over the parameters that are notidentifiable and high dimensional, we will also consider a reparameterization which partially mitigates these difficulties.
We use the following parameterization:
The post intervention predictive is:
We introduce the low dimensional as a reparameterization of as statistically identifying this parameter is sufficient for making causal inference.
Unfortunately when we write the likelihood we see we cannot identify this parameter, but rather a different low dimensional function of :
We introduce the low dimensional as a reparameterization of as this parameter is identifiable.
We can now see the difficulty in this problem, we need but can only infer . Both, and are different low dimensional projections of ; is identifiable and is causally relevant, they are related due to the fact that they are both functions of and , so it may be reasonable to specify a joint prior giving:
In specifying priors for this problem, we may reasonably use default priors (e.g. flat priors) for (or even take a point estimate) as it is identifiable with modest data sets. On the other hand will be completely unaffected by the data so it is an extremely important that any information concerning on how affects knowledge of is carefully assessed; in many instances this will be considered too difficult to reasonably attempt, e.g. it may be that in which case the data adds no value to causal problems at all.
5.2 The Front Door Rule
Imagine that our observations are generated by the CGM given in Figure 9, which is the graph that requires the front door rule. The front door rule is remarkable in that it shows that a graph quite similar to Figure 7 does allow causation to be identifiable, the only difference being another observed node between the treatment and the outcome.
Using the docalculus gives (detailed steps in [Pearl, 1995]):
(4) 
To find the same solution using a Bayesian approach we first apply CausalBayesConstruct on CGM Figure 9 to produce the PGM in Figure 10. We then explicitly parameterize the model and write out the three model components. Again we are hampered by the presence of , but for this problem we can effectively marginalize and reparameterize the model to make causation identifiable.
We use the following parameterization:
The post intervention predictive is:
And the likelihood component is:
Unfortunately, the fact that is both latent and large means that the posterior over is both nonidentifiable and high dimensional (although the marginal posterior over is identifiable, since it depends only on and ). A direct attack would require sophisticated Bayesian approximation methods to capture the complex structure within the posterior and is not within the scope of this paper. Instead, we note that can be eliminated from the 2nd term in the postintervention predictive distribution: