A formal framework for causal inference is provided by the probabilistic causal model (Pearl, 2009) that encodes our knowledge of the variables of interest and their mutual relationships. In observational studies experimentation is not available, but through the causal model framework we can still symbolically intervene on variables, forcing them to take certain values as if an experiment had taken place. The question is whether we can make inferences about the effect of the intervention in the post-intervention model using only the observed probability distribution of the variables in the model before the intervention took place. This question is formally defined as identifiability of causal effects, and it has received considerable attention in literature, including a number of algorithmic solutions (Huang and Valtorta, 2006; Shpitser and Pearl, 2006; Tian and Pearl, 2002).
A causal model can be associated with a directed acyclic graph (DAG) that represents the functional relationships of the variables included in the model. The graphical representation provides us with the concept of d-separation (Geiger et al., 1990), that can be used to infer conditional independences between variables from the graph. If the distribution of the variables implies no conditional independence statements other than those already encoded in the graph, we say that the distribution is faithful (Spirtes et al., 2000).
The use of d-separation in the post-intervention model is the basis of do-calculus (Pearl, 1995), which consists of a set of inference rules for manipulating interventional distributions. The purpose of do-calculus is to derive formulas for causal effects and other causal queries, and it has been shown to be complete with respect to the identifiability of causal effects (Huang and Valtorta, 2006; Shpitser and Pearl, 2006). The derived formulas provide recipes for estimating the causal effects from observational data.
When computing causal effect formulas, we often apply an identifiability algorithm, such as the ID algorithm by Shpitser and Pearl (2006). Criteria for identifiability such as the back-door criterion and front-door criterion are available for manual derivations (Pearl, 2009) but the ID algorithm is more general and thus more suitable for automated processing. The ID algorithm splits the original problem into smaller subproblems which are then solved and aggregated as the final expression for the causal effect.
Complicated expressions are likely to arise in situations where we have included variables in our model that do not provide further benefit for the identification of the causal effect of interest. It is often the case that these variables nonetheless appear in the resulting formula, and deriving a simpler expression with the variable eliminated can be non-trivial. It is hard to specify what makes one expression simpler than another, but we can consider a number of criteria to evaluate simplicity. For example, we can compare the number of sums and fractions and the number of variables present in the expression.
In this paper we propose a number of graphical criteria to infer which variables in our causal model are in fact not necessary for identification. These criteria allow us to prune the graph, which in practice means removing specific vertices and considering identification in a latent projection. A significantly simpler expression can be obtained by pruning alone, but we may also combine pruning with simplification procedures that operate symbolically on the interventional distribution as presented in (Tikka and Karvanen, 2017b). Applying these methods in conjunction often provides additional benefits.
We present an identifiability algorithm that is able to recognize and eliminate unnecessary variables from the graph based on our criteria resulting in a simpler expression. When a large number of graphs and identifiability queries are processed, evaluating simpler expressions has apparent computational benefits. First, it is more efficient to evaluate a simpler expression repeatedly especially when some variables have been completely removed which further reduces the complexity of the task. Second, in practical applications that involve real-world data, variables often contain missing data or are affected by bias. Obtaining expression that do not involve such variables can be of great benefit in estimation. Third, a simpler expression is easier to communicate.
An introductory example motivates the use of the improved algorithm. We are interested in the causal effect of on in graph of Figure 1(fig:intro_start). Here, open circles denote unobserved variables. A more in-depth overview of graph theoretic concepts used in this paper is provided in Section 2.
The causal effect is identifiable and the output of the ID algorithm is
This expression is very cumbersome and complicated. However, it turns out that a simpler expression exists for the causal effect. By exploiting the structure of the graph and using standard probability calculus the following expression can be obtained
This expression is simpler in every regard compared to the original output. It contains fewer terms and no fractions. Also, we have completely removed the variables and from the expression. It can be shown that identifying the causal effect in the original graph is equivalent to identifying it in the graph depicted in Figure 1(fig:intro_pruned). By running our improved algorithm we are able to prune the original graph and obtain this simpler expression directly. The algorithm works recursively and the pruning is carried out at each stage of the recursion. The recursive pruning provides significant benefits over pruning as a pre-processing step as demonstrated later.
The paper is structured as follows. In Section 2 we review crucial definitions and concepts related to graph theory and causal models. In Section 3 we focus on semi-Markovian causal models and present the original formulation of the ID algorithm. Our main results are presented in Section 4 and they are implemented into an improved identifiability algorithm in Section 5. Examples on the benefits of recursive pruning are provided in Section 6. Section 7 concludes with a discussion.
We assume the reader to be familiar with a number of graph theoretic concepts and refer them to works such as (Koller and Friedman, 2009). We use capital letters to denote vertices and the respective variables, and small letters to denote their values. Bold letters are used to denote sets. A directed graph with a vertex set and an edge set is denoted by . For a graph and a set of vertices the sets and denote a set that contains in addition to its parents, children, ancestors and descendants in , respectively. We also define the set to denote the set of vertices that are connected to in via paths where the directionality of the edges is ignored, including . The root set of a graph is the set of vertices without any descendants , where denotes the set difference. A subgraph of a graph induced by a set of vertices is denoted by . This subgraph retains all edges of such that . The graph obtained from by removing all incoming edges of and all outgoing edges of is written as . To facilitate analysis of causal effects we must first define the probabilistic causal model (Pearl, 2009).
[Probabilistic Causal Model] A probabilistic causal model is a quadruple
is a set of unobserved (exogenous) variables that are determined by factors outside the model.
is a set of observed (endogenous) variables that are determined by variables in .
is a set of functions such that each is a mapping from (the respective domains of) to , and such that the entire set forms a mapping from to .
is a joint probability distribution of the variables in the set .
Each causal model induces a causal diagram which is a directed graph that provides a graphical means to convey our assumptions of the causal mechanisms involved. The induced graph is constructed by adding a vertex for each variable in and a directed edge from into whenever is defined in terms of .
Causal inference often focuses on a sub-class of models that satisfy additional assumptions: each appears in at most two functions of , the variables in are mutually independent and the induced graph of the model is acyclic. Models that satisfy these additional assumptions are called semi-Markovian causal models. A graph associated with a semi-Markovian model is called a semi-Markovian graph (SMG). In SMGs every has at most two children. When semi-Markovian models are considered it is common not to depict background variables in the induced graph explicitly. Unobserved variables with exactly two children are not denoted as but as a bidirected edge instead. Furthermore, unobserved variables with only one or no children are omitted entirely. We also adopt these abbreviations. For SMGs the sets and contain only observed vertices. Additionally, a subgraph of an SMG will also retain any bidirected edges between vertices in .
Any DAG can be associated with an SMG by constructing its latent projection (Verma, 1993).
[latent projection] Let be a DAG such that the vertices in are observed and the vertices in are latent. The latent projection is a DAG , where for every pair of distinct vertices it holds that:
contains an edge if there exists a directed path in on which every vertex except and is in .
contains an edge if there exists a path from to in that does not contain the pattern (a collider) and on which every vertex except and is in and the first edge has an arrowhead pointing into and the last edge has an arrowhead pointing into .
From the construction it is easy to see that a latent projection is in fact an SMG. The induced graph of a probabilistic causal model can also be used to derive conditional independences among the variables in the model using a concept known as d-separation. We provide a definition for d-separation (Shpitser and Pearl, 2008) which takes into account the presence of bidirected edges and is thus suitable for SMGs.
[d-separation] A path in an SMG is said to be d-separated by a set if and only if either
contains one of the following three patterns of edges: , or , such that , or
contains one of the following three patterns of edges: , , , such that .
Disjoint sets and are said to be d-separated by in if every path from to is d-separated by in .
Whenever we can decompose the joint distribution of the observed variablesand the unobserved variables as , where also contains the unobserved parents but not the argument itself, we say that is an I-map of (Pearl, 2009). If sets and are d-separated by in G, then is independent of given in every for which is an I-map (Pearl, 1988). We use the notation of (Dawid, 1979) to denote this d-separation and conditional independence statement as . It is clear that the graph induced by any semi-Markovian causal model is an I-map for the joint distribution induced by the model.
Our interest lies in the effects of actions imposing changes to the model. An action that forces to take a specific value is called an intervention and it is denoted by (Pearl, 2009). An intervention on a model creates a new sub-model, denoted by , where the functions in that determine the value of have been replaced with constant functions. The interventional distribution of a set of variables in the model is denoted by . This distribution is also known as the causal effect of on .
Multiple causal models can share the same graph, and thus the same sub-model resulting from an intervention. The question is, are our assumptions encoded in the causal model sufficient to uniquely specify an interventional distribution of interest. This notion is captured by the following definition (Shpitser and Pearl, 2006).
[identifiability] Let be an SMG and let and be disjoint sets of variables such that . The causal effect of on is said to be identifiable from in if is uniquely computable from in any causal model that induces .
In order to show the identifiability of a given effect we have to express the interventional distribution in terms of observed probabilities only. The link between observed probabilities and interventional distributions is provided by three inference rules known as do-calculus (Pearl, 1995):
Insertion and deletion of observations:
Exchanging actions and observations:
Insertion and deletion of actions:
Completeness of do-calculus was established independently by Huang and Valtorta (2006) and Shpitser and Pearl (2006). In this paper we focus on the solution provided by Shpitser and Pearl (2006). They constructed an identifiability algorithm called ID, which in essence applies the rules of do-calculus and breaks the problem into smaller sub-problems repeatedly.
3 ID Algorithm
In order to present the ID algorithm, we first need some additional definitions that are used to construct the graphical criterion for non-identifiability (Shpitser and Pearl, 2006).
[C-component] Let be an SMG and let . If every pair of vertices in is connected by a bidirected path, that is a path consisting entirely of bidirected edges, then is a C-component (confounded component). Furthermore, is a maximal C-component if contains every vertex connected to via bidirected paths in and is an induced subgraph of .
No restrictions are imposed on the directed edges of a C-component. The same is not true for the maximal C-components (also known as districts) of an SMG , which are assumed to be induced subgraphs of . This requirement guarantees the uniqueness of the maximal C-components.
Maximal C-components are an important tool for identifying causal effects. The set of maximal C-components of a semi-Markovian graph is denoted by . A result in (Tian, 2002) states that if is a maximal C-component and then the causal effect is identifiable from in . A distribution of a semi-Markovian model also factorizes with respect to the maximal C-components of the induced graph such that (Shpitser and Pearl, 2006). It is precisely this factorization that the ID algorithm takes advantage of. A specific type of C-component is used to characterize problematic structures for identifiability.
[C-forest] Let be an SMG and let be the root set of . If is a C-component and all observed vertices have at most one child, then is a rooted C-forest.
The complete criterion for non-identifiability uses a structure formed by two C-forests:
[hedge] Let be disjoint sets of variables and let be an SMG. Let and be -rooted C-forests in such that , , , and . Then and form a hedge for in .
Intuitively hedges are a difficult concept. Whenever a hedge is present, there exists two causal models with the same probability distribution over but their interventional distributions do not agree. Observational data can not be used to estimate causal effects in this scenario. We are now ready to present the ID algorithm.
Shpitser and Pearl (2006) showed that whenever Algorithm 1 returns an expression for a causal effect, it is correct. Additionally whenever line 5 is triggered there exists a hedge for the causal effect currently being identified. This result establishes the completeness of the algorithm and also the completeness of do-calculus, since the soundness of each line of the algorithm can be shown with do-calculus and standard probability calculus alone.
4 Pruning of Variables
In this section we present a number of results that deal with variables that are not necessary for identification either by removing them from the graph or by considering them latent. When the causal effect is considered in an SMG we can present an outline of the pruning process:
Removal of non-ancestors of .
Removal of ancestors of that are connected to only via under certain conditions.
Removal of vertices connected to other vertices only through a single vertex.
Identification in a latent projection under certain conditions.
Steps 2–4 are new and they are based on the results of this section. Step 1 is derived from a useful result by Shpitser and Pearl (2006) which states that for a causal effect we can always ignore non-ancestors of .
Let . Then obtained from in is equal to obtained from in .
Lemma 4 is implemented on line 2 of Algorithm 1. Not all ancestors of are always necessary for identification. The next result states that we may sometimes remove ancestors of that are connected to only through .
Let be an SMG and let be the set of all vertices such that intercepts all paths from to . Then the causal effect obtained from in is equal to obtained from in if contains no members of and if . Let and assume that . Let , and be sets of unobserved variables such that for all it holds that , for all it holds that and for all it holds that . The sets , and partition because intercepts all paths from to . According to the third rule of do-calculus because the condition holds as removing the edges incoming to separates from its ancestors. Applying the truncated factorization formula (Pearl, 2009) we have that
Since variables in can only be parents of variables in or in , we can sum them out from the previous expression and obtain
Similarly, variables in can only be parents of variables in in , so we can also sum them out of the expression to obtain
We let . Verma (1993) showed that a graph and its latent projection have the same set of conditional independence relations among the observed variables. Because we have assumed that every conditional independence between variables in and applies in both and . We have that for all it holds that and for all it holds that . Finally we obtain
Theorem 4 can also be applied in a more general setting where a subset of intercepts all paths from a set to Let be an SMG and let be the set of all vertices such that a set intercepts all paths from to and no member of is a descendant of . Then the causal effect obtained from in is equal to obtained from in if contains no members of and if . Since intercepts all paths from to and no member of is a descendant of , it follows that no member of is in . According to the third rule of do-calculus we have that and . The claim now follows by applying Theorem 4 to .
When the causal effect is considered in graph , a set of vertices can be removed from if . The set contains in addition to the ancestors of , and the set contains and all vertices that are connected to via a path that does not contain edges incoming to . Therefore, contains such ancestors of that all paths from to contain . The removal of from is now licensed by Corollary 4.
Corollary 4 provides a constructive criterion for the set described in Corollary 4 when consists only of and its ancestors. If a vertex is a member of then it must be connected to only through paths containing some . We can always choose the sets in such a way that the union over the members of has no descendants in . The set intercepts all paths from to . Conversely, if is a vertex such that a set intercepts all paths from to , then cannot be connected to in . If we assume that it follows that is a member of .
Applying the ID algorithm results in the following expression for the causal effect
Applying Corollary 4 in this case would result in the removal of the vertices and from the graph, since they are ancestors of in but not connected to in and the corresponding latent projection is the subgraph of Figure 2(fig:cor1graph_sub). Running the ID algorithm in this subgraph provides us the following expression
We may consider this expression simpler compared to the previous output by noting that and do not appear in the expression and it has fewer unique terms. The same expression can also be obtained manually by applying the front-door criterion (Pearl, 2009).
Often the question of identifiability can not be answered directly by neither the back-door nor the front-door criterion which leads us to more general methods, such as the ID algorithm. We are interested in the causal effect of , and on and in the graph of Figure 3(fig:cor1graph2_start).
Direct application of the ID algorithm provides us with the following expression
In this graph is connected to only through and , but the corresponding latent projection does not match the subgraph with removed as seen in Figures 4(fig:necessary_sub) and 4(fig:necessary_latent). In the causal effect is identifiable, but it is not identifiable in the latent projection . In this latent projection a bidirected edge exists between and and a hedge is formed by the C-forests and .
We may also remove sets of vertices that are connected to the rest of the graph only through a single vertex even when no intervention on the corresponding variable has taken place.
Let be an SMG such that for a set of vertices and let be a vertex of . If there exists a set such that and is connected to only through . Then the causal effect obtained from in is equal to obtained from in . Let and let and be sets of unobserved variables such that for all it holds that and for all it holds that . Sets and partition because is connected to only through . Applying the truncated factorization formula yields
Since variables in and can be connected to the other vertices of only through we can complete the marginalization over and
Because we have assumed that is disconnected from in we have that . Therefore, just as in the proof of Theorem 4, we have that for all and for all . Additionally, we have . Finally we obtain
Let be a vertex of an SMG and let . When the causal effect is considered in graph , the set of vertices can be removed from the graph if . No descendant of can be removed via theorem 4 since they are connected to and the set to be removed cannot itself contain . By removing descendants of and itself, and assuming that , we have that . Thus it remains to remove those vertices from that are connected to through a path that does not contain . Removal of the resulting set from the graph is now licensed by theorem 4.