1 Introduction
A causal effect is defined as the distribution where variables are observed, variables are intervened upon (forced to values irrespective of their natural causes) and variables are conditioned on. Instead of placing various parametric restrictions based on background knowledge, we are interested in this paper in the question of identifiability: can the causal effect be uniquely determined from the distributions (data) we have and a graph representing our structural knowledge on the generating causal system.
In the most basic setting we are identifying causal effects from a single observational input distribution, corresponding to passively observed data. To solve such problems more generally than what is possible with the backdoor adjustment (Spirtes et al., 1993; Pearl, 2009; Greenland et al., 1999), Pearl (1995) introduced docalculus
, a set of three rules that together with probability theory enable the manipulation of interventional distributions.
Shpitser and Pearl (2006a) and Huang and Valtorta (2006) showed that docalculus is complete by presenting polynomialtime algorithms whose each step can be seen as a rule of docalculus or as an operation based on basic probability theory. The algorithms have a high practical value because the rules of docalculus do not by themselves provide an indication on the order in which they should be applied. The algorithms save us from manual application of docalculus, which is a tedious task in all but the simplest problems.Since then many extensions of the basic identifiability problem have appeared. In identifiability using surrogate experiments (Bareinboim and Pearl, 2012b), or identifiability, an experimental distribution is available in addition to the observed probability distribution. For data observed in the presence of selection bias, both algorithmic and graphical identifiability results have been derived (Bareinboim and Tian, 2015; Correa et al., 2018). More generally, the presence of missing data necessitates the representation of the missingness mechanism, which poses additional challenges (Mohan et al., 2013; Shpitser et al., 2015). Another dimension of complexity is the number of available data sources. Identification from a mixture of observational and interventional distributions that originate from multiple conceptual domains is known as transportability for which complete solutions exist in a specific setting (Bareinboim and Pearl, 2014). Most of these algorithms are implemented in the R package causaleffect (Tikka and Karvanen, 2017a).
While completeness has been accomplished for a number of basic identifiability problems, there are still many challenging but important extensions to the identifiability problem that have not been studied so far. Table 1 recaps the current state of the art identifiability results; it also describes generalizations that we aim to investigate in this paper. To find solutions to the more complicated identifiability problems, we present a unified approach to the identification of observational and interventional causal queries by constructing a search algorithm that directly applies the rules of docalculus. We impose no restrictions to the number or type of known input distributions: we thus provide a solution to problems for which no algorithmic solutions exist (row 7 in Table 1). We also extend to identifiability under missing data together with mechanisms related to selection bias and transportability (row 10 in Table 1).
To combat the inherent computational complexity of such a searchbased approach, we derive rules and techniques that avoid unnecessary steps. We also present a search heuristic that considerably speeds up the search in the cases where the effect is indeed identifiable. The approach, called dosearch, is provably sound and it retains the completeness in the cases previously proven to be solved by docalculus rules. We can easily scale up to the problems sizes commonly reported in the literature. An R package (R Core Team, 2018) implementing dosearch is also available on CRAN at:
Missing  
Problem  Input  data  Solution  
(Reference)  Target  (assumptions)  pattern  (complete)  
1  Causal effect identifiability  None  ID (Yes)  
(Shpitser and Pearl, 2006a)  
2  Causal effect identifiability  None  IDC (Yes)  
(Shpitser and Pearl, 2006b)  
3  identifiability  ,  None  zID (Yes)  
(Bareinboim and Pearl, 2012b)  (NE, ED)  
4  transportability  None  TR (Yes)  
(Bareinboim and Pearl, 2014)  (NEDD, ED)  
5  Surrogate outcome  None  TRSO (No)  
identifiability  (NE, SO)  
(Tikka and Karvanen, 2018b)  
6  Selection bias recoverability  Selection  RC (?)  
(Bareinboim and Tian, 2015)  
7  Generalized identifiability  None  None  
8  Missing data recoverability  Restricted  Thm. 2 (Yes)  
(Mohan et al., 2013)  
9  Missing data recoverability  Arbitrary  MID (?)  
(Shpitser et al., 2015)  
10  Generalized identifiability  Arbitrary  None  
with missing data 
stands for passively observed joint distribution of all variables. Input
is the joint distribution with missing data (see Section 5). Input means the joint distribution under selection bias. Input stands for an experiment where all variables are measured and input stands for an experiment where only a subset of the variables is measured. Notation denotes a set of inputs enumerated by the index . The variable sets present in the same distribution are disjoint. The assumptions of nested experiments (NE), entire distributions (ED) and nested experiments in different domains (NEDD) are explained in Section 2. Assumptions related to surrogate outcomes (SO) can be found in (Tikka and Karvanen, 2018b). The last column tells the algorithm or result that can be used to solve the problem and whether it provides a complete solution to the problem, or whether the completeness status is not known (?). An algorithm is complete if it returns a formula when the target query is identifiable. Problems 1–6 are special cases of problem 7 and problems 1–9 are special cases of problem 10.The paper is structured as follows. Section 2 formulates our general search problem and explains the scenarios in Table 1 and previous research in detail. Section 3 presents the search algorithm, including the rules we use, search space reduction techniques, heuristics, theoretical properties, and finally simulations that demonstrate the efficacy of the search. Section 4 shows a number of new problems for which we can find solutions by using the search. These problems include combined transportability and selection bias, multiple sources of selection bias, and causal effect identification from arbitrary (experimental) distributions. Section 5 shows how the search can be extended to problems that involve missing data. This section also includes a systematic analysis of missing data problems and casecontrol designs. Section 6 discusses the merits and limitations of the approach. Section 7 offers concluding remarks.
2 The General Causal Effect Identification Problem
Our presentation is based on Structural Causal Models (SCM) and the language of directed graphs. We assume the reader to be familiar with these concepts and refer them to detailed works on these topics for extended discussion and descriptions, such as (Pearl, 2009) and (Koller and Friedman, 2009). Following the standard setup of docalculus (Pearl, 1995), we assume that the causal structure can be represented by a semiMarkovian causal graph over a set of vertices (see Fig 1(fig:intro_backdoor) for example). The directed edges correspond to direct causal relations between the variables (relative to ); directed edges do not form any cycles. Confounding of any two observed variables in by some unobserved common cause is represented by a bidirected edge between the variables.
In a nonparametric setting, the problem of expressing a causal quantity of interest in terms of available information has been be described in various ways depending on the context. When available data are affected by selection bias or missing data, a typical goal is to “recover” some joint or marginal distributions. If data are available from multiple conceptual domains, a distribution is “transported” from the source domains, from which a combination of both observational and experimental data are available, to a target domain. The aforementioned can be expressed in the SCM framework by equipping the graph of the model with special vertices. However, on a fundamental level these problems are simply variations of the original identifiability problem of causal effects and as such, our goal is to represent them as a single generalized identifiability problem. Formally, identifiability can be defined as follows (Pearl, 2009; Shpitser and Pearl, 2008).
Definition 1 (Identifiability).
Let be a set of models with a description and two objects and computable from each model. Then is identifiable from in if is uniquely computable from in any model . In other words, all models in which agree on also agree on .
In the simplest case, the description refers to the graph induced by causal model, is the joint distribution of the observed variables and the query is a causal effect . On the other hand, proving nonidentifiability of from can be obtained by describing two models such that is the same in and , but object in is different from in .
The general form for a causal identifiability problem that we consider in this paper is formulated as follows.
 Input:

A set of input distributions of the form , a query and a semiMarkovian causal graph over .
 Task:

Output a formula for the query over the input distributions, or decide that it is not identifiable.
Here are disjoint subsets of for all , and are disjoint subsets of . The causal graph may contain vertices which describe mechanisms related to transportability and selection bias. In the following subsections we explain several important special cases of this problem definition, some that have been considered in the literature and some which have not been.
2.1 Previously Considered Scenarios as Special Cases
We restate the concepts of transportability and selection bias under the causal inference framework, and show that identifiability in the scenarios of rows 1–6 of Table 1 falls under the general form on row 7. We return to problems that involve missing data on rows 8–10 later in Section 5.
Causal Effect Identification
identifiability
Similarly to ordinary causal effect identification, the input consists of the passive observational distribution but also of experimental distributions known as surrogate experiments on a set (Bareinboim and Pearl, 2012b). Two restricting assumptions, called here nested experiments and entire distributions, apply to surrogate experiments. Experiments are called nested experiments (NE) when for each experiment intervening a set of variables , experiments intervening on all subsets of are available as well. Entire distributions (ED) denote the assumption that the union of observed and intervened variables is always the set of all variables .
Surrogate Outcome Identifiability
Surrogate outcomes generalize the notion of surrogate experiments from identifiability. For surrogate outcomes, the assumption of nested experiments still holds, but the assumption of entire distributions can be dropped. Some less strict assumptions (SO) still apply (Tikka and Karvanen, 2018b). The idea of surrogate outcomes is that data from previous experiments are available, but the target was at most only partially measured in these experiments and the experiments do not have to be disjoint from .
Transportability
The problem of incorporating data from multiple causal domains is known as transportability (Bareinboim and Pearl, 2013). Formally, the goal is to identify a query in a target domain using data from source domains . The domains are represented in the causal graph using a special set of transportability nodes which is partitioned into disjoint subsets corresponding to each domain . The causal graph contains an extra edge whenever a functional discrepancy in or in exists between the target domain and the source domain . The discrepancy is active if and inactive otherwise. A distribution associated with a domain is of the form . In other words, only the discrepancies between the and are active. A distribution corresponding to the target domain has no active discrepancies meaning that it is of the form ). Any variable is conditionally independent from inactive transportability nodes since their respective edges vanish. Furthermore, since transportability nodes set to vanish, we can assume any present transportability node to have the value . Thus an input distribution from a domain takes the form . In the specific case of transportability, the assumptions of entire distributions (ED) and nested experiments in different domains (NEDD) apply, which means that is available for every subset of in each domain .
Selection Bias Recoverability
Selection bias can be seen as a special case of missing data, where the mechanism responsible for the preferential selection is represented in the causal graph by a special sink vertex (Bareinboim and Pearl, 2012a). Typical input for the recoverability problem is , the joint distribution observed under selection bias. Just as in the case of transportability nodes, selection bias nodes only appear when the mechanism has been enabled. Thus we may assume that the input is of form . More generally, we can consider input distributions of the form .
2.2 New Scenarios as Special Cases
The following settings are special cases of the general identifiability problem of row 7 in Table 1, that do not fall under any of the problems of rows 1–6. They serve as interesting additions to the cases considered in the literature. Concrete examples on these new scenarios are presented in Section 4. Section 5 extends the general problem of row 7 in Table 1 to the general problem with missing data on row 10 while also showcasing the special cases of rows 8 and 9.
Multiple Data Sources with Partially Overlapping Variable Sets
The scenario where only subsets of variables are ever observed together has been extensively considered in the causal discovery literature (Danks et al., 2009; Tillman and Spirtes, 2011; Triantafillou et al., 2010), but not in the context of causal effect identification. In the basic setting the input consists of passively observed distributions such that . We may also observe experimental distributions (Hyttinen et al., 2012; Triantafillou and Tsamardinos, 2015) or even conditionals . Our approach sets no limitations for the number and types of input distributions.
Combining Transportability and Selection Bias
To the best of our knowledge, the frameworks of transportability and selection bias have not been considered simultaneously. The combination of these scenarios fits into the general problem formulation. For example, we may have access to two observational distributions originating from different source domains, but affected by the same biasing mechanism: and , where and are the transportability nodes corresponding to the two source domains and is the selection bias node.
Recovering from Multiple Sources of Selection Bias
In recent literature on selection bias as a causal inference problem, the focus has been on settings where only a single selection bias node is present (e.g. Bareinboim et al., 2014; Correa and Bareinboim, 2017; Correa et al., 2018)
. However, multiple sources of selection bias are typical in longitudinal studies where dropout occurs at different stages of the study. Our approach is applicable for an arbitrary number of selection bias mechanisms and input distributions affected by arbitrary combinations of these mechanisms. In other words, if
is the set of all selection bias nodes present in the graph, the inputs can take the form , where is an arbitrary subset of .3 A Search Based Approach for Causal Effect Identification
The key to identification of causal effects is that interventional expressions can be manipulated using the rules of docalculus. We present these rules for augmented DAGs where an additional intervention variable such that is added to the induced graph for each variable (Spirtes et al., 1993; Pearl, 2009; Lauritzen, 2000) (see Figure 1(fig:augmented)). Now a dseparation condition of the form means that and are dseparated by and in a graph where edges incoming to (intervened) have been removed (Hyttinen et al., 2015; Dawid, 2002). The three rules of docalculus Pearl (1995) can be expressed as follows:
The rules are often referred to as insertion/deletion of observations, exchange of actions and observations, and insertion/deletion of actions respectively. Each rule of docalculus is only applicable if the accompanying dseparation criterion (on the righthand side) holds in the underlying graph. In addition to these rules, most derivations require basic probability calculus.
Docalculus directly motivates a forwards search over its rules. The outline of this type of search is given in Algorithm 1. The algorithm derives new identifiable distributions based on what has been given as the input or identified in the previous steps. For each identified distribution every rule of docalculus and standard probability manipulations of marginalization and conditioning are applied in succession, until the target distribution is found, or no new distributions can be found to be identifiable. A preliminary version of this kind of search is used by Hyttinen et al. (2015) as a part of an algorithmic solution to causal effect identifiability when the underlying graph is unavailable.
The formulas produced by Algorithm 1 correspond to short derivations and unnecessarily complicated expressions are avoided. Also, only distributions guaranteed to be identifiable are derived and used during the search. Formulas for intermediary queries that were identified during the search are also available as a result. Alternatively, one could also start with the target and search towards the input distributions; a search in this direction will spend time deriving a number expressions that are anyway nonidentifiable based on the input. A depthfirst search would produce unnecessarily complicated expressions.
The search can easily derive for example the backdoor criterion in the graph of Figure 1(fig:intro_backdoor) as shown by the derivation in Figure 1(fig:derivation). The target is and input is . From the search first derives the marginal and the conditional . Then is derived by the third rule of docalculus because . The second rule derives from as . The two terms can be combined via the product rule of probability calculus to get and finally the target is is just a marginalization of this. The familiar formula is thus obtained.
However, it is not straightforward to make a search over docalculus computationally feasible. The search space in Figure 1(fig:derivation) shows only the parts that resulted in the identifying formula: for example all passively observed marginals and conditionals over can be derived from the input . Especially in a nonidentifiable case a naive search may go through a huge space before it can return the nonidentifiable verdict. The choice of rules is also not obvious: a redundant rule may make the search faster or slower; false nonidentifiability may be concluded if a necessary rule is missing. Also the order in which the rules are applied can have a large impact on the performance of the search. In the following sections we will provide highly nontrivial solutions to these challenges.
3.1 Rules
Table 2 lists the full set of rules used to manipulate distributions during the search, generalizing Hyttinen et al. (2015).
Rule  Additional Input  Output  Description 

Insertion of observations  
Deletion of observations  
Observation to action exchange  
Action to observation exchange  
Insertion of actions  
Deletion of actions  
Marginalization  
Conditioning  
Chain rule multiplication  
Chain rule multiplication 
Docalculus
Rules and correspond to the rules of docalculus such that rules are used to add conditional variables and interventions and rules are used to remove them. Each rule is only valid if the corresponding dseparation criterion given in the beginning of Section 3 hold.
Probability theory
Rule performs marginalization over , and produces a summation at the formula level:
Similarly, rule conditions on a subset to obtain the following formula:
Rules and perform multiplication using the chain rule of probability which requires two known distributions. When rule is applied, the distribution is known and we check whether is known as well. For rule , the roles of the distributions are reversed. In the case of rule , is a subset of and we obtain
The two version of the chain rule are needed: it may be the case that when expanding with rule the additional input is only identified later in the search. Then, is identified when rule is applied to .
3.2 Improving the Efficacy of the Search
In this section, we present various techniques that improved the efficiency of the search. These findings are implemented in a search algorithm in Section 3.3.
3.2.1 Term Expansion
Term expansion refers to the process of deriving new distributions from an input distribution using the rules of Table 2. By term we mean a single identified distribution. A term is considered expanded if the rules of Table 2 have been applied to it in every possible way when the term is in the role of the input. Note that an expanded distribution may still take the role of an additional input when another term is being expanded. Consider the step of expanding the input term in Table 2 to all possible outputs with any rule. This can be done by enumerating every nonempty subset of , and applying the rule with regard to it.
Rule  Validity condition  Termination condition 

Table 3 outlines the requirements for for each rule of the search. Table 3 tells us that when an observation is added using rule , it cannot be contained in any of the sets or since they are already present in the term. Only observations that are present can be removed, which is why has to a subset of when applying rule . We may skip the application of this rule if the set of observations is empty for the current term. The exchange of observations to experiments using rule has similar requirements for set as rule . Exchanging experiments to observations using rule works in a similar fashion. Only experiments that are present can be exchanged which means that . This rule can be skipped if the set of experiments is empty. New experiments are added using rule with similar requirements as rule . Welldefined subsets for using rule are the same as for rule . For rules and , the only requirement is that is a proper subset of . When the chain rule is applied with rule , we require that the variables of the second product term is observed in the first term. When applied in reverse with rule , the variables of the second term must not be present in the first term.
3.2.2 Termination Conditions
Additionally, Table 3 lists the termination condition: if it is satisfied by the current term to be expanded we know that the rule cannot be applied to it. The following simple lemma shows that when any of the termination conditions hold, no new distributions can be derived from it using the respective rule, which allows the search to directly proceed to the next rule.
Lemma 1.
Let be a semiMarkovian graph and let and be disjoint subsets of . Then all of the following are true:
Proof.
For (i), the set is empty so the application of rule using any subset would result in which is already identified. For (ii), the set is empty so no observation can be exchanged for an action using the second rule of docalculus. For (iii), the set is empty so no action can be exchanged for an observation using the second rule of docalculus. For (iv), the set is empty so the application of rule using any subset would result in which is already identified. For (v) and (vi), the set only has a single vertex, so it cannot have a nonempty subset. For (vii), the set is empty so no subset can exist for the second input. ∎
3.2.3 Rule Necessity
The rule 1 of docalculus can be omitted as shown by Huang and Valtorta (2006, Lemma 4). Instead of inserting an observation using rule 1, we can insert an intervention and then exchange it for an observation. Similarly, an observation can be removed by first exchanging it for an intervention and then deleting the intervention. It follows that rules and of Table 2 are unnecessary for the search.
The following example shows that the remaining rules of Table 2 are all necessary. In the graph of Figure 2, the causal effect can be identified from the inputs , , , and when all rules are available, but not when any individual rule is omitted. This can be verified by running the search algorithm presented at the beginning of Section 3 or the more advanced algorithm of Section 3.3 with each rule turned off individually.
3.2.4 Early Detection of Nonidentifiable Instances
Worstcase performance of the search can be improved by detecting nonidentifiable quantities directly based on the set of inputs before launching the search. The following theorem provides a sufficient criterion for nonidentifiability.
Theorem 1.
Let be a semiMarkovian graph, let and let
Then is not identifiable from in if
Proof.
Since , there exists a variable such that none of the sets contain it. We construct two models, and , such that where is a constant. For any child of , we define the structural equations so that . For all other variables, the structural equations are the same for the models and . We have that while all inputs are the same for the models and . It follows that is not identifiable. ∎
In other words, Theorem 1 can be used to verify that the entire set of a target distribution cannot be constructed from the inputs. If this is the case, the target quantity is not identifiable.
3.2.5 Heuristics
During the search, we always expand one term at a time through the rules and store the newly identified distributions. In order for the search to perform fast, we need to decide which branches are the most promising and should therefore be expanded first. We can do this by defining a proximity function relating the source terms and the target query, and by always expanding the closest term first. Our suggestion here is motivated by the way an educated person might apply docalculus in a manual derivation. Our chosen proximity function links the target distribution and a source distribution in the following way:
Each input distribution and terms derived using the search are assigned into a priority queue, where the priority is determined by the value given by . Distributions closer to the target are prioritized over other terms. The weight 10 for the term
indicates that having the correct response variables is considered as the first priority. Having the correct intervention is considered as the second priority (weight 5) and having the correct condition as the third priority (weight 3). The remaining terms in
penalize variables that are in the target distribution but not in the source distribution or vice versa. Again, variables that are intervened on are considered to be more important than conditioning variables.3.3 The Search Algorithm
We take Algorithm 1 as our starting point and compile the results of Section 3.2 into a new search algorithm called dosearch. This algorithm is capable of solving generalized identifiability problems (row 7 in Table 1) while streamlining the search process through a heuristic search order and elimination of redundant rules and subsets. The pseudocode for dosearch is shown in Algorithm 2.
The algorithm begins by checking whether the query can be solved trivially without performing the search. This can happen if the target is a member of the set of inputs or if Theorem 1 applies. Next, we note that each input distribution in the set is marked as unexpanded at the beginning of the search. Distributions in are expanded one at a time by applying every rule of Table 2 in every possible way.
The iteration over the unexpanded distributions proceeds as follows (lines 4–5). Each input distribution and terms derived from it using the search are assigned into a priority queue, where the priority is determined by the value given by the proximity function . Distributions closer to the target are expanded first. In the implementation, only the actual memory addresses of the distribution objects are placed into the queue. The set is implemented as a hash table that serves as a container for all input distributions and those derived from them. Each new distribution is assigned a unique index that also serves the hash function for this table. The distribution objects contained in the table are represented uniquely by three integers corresponding to the sets and of the general form . The distribution objects also contain additional auxiliary information such as which rule was used to derive it, whether it is expanded or not and from which distribution it was obtained. This information is used to construct the derivation if the target is found to be identifiable.
Multiple distributions can share the same value of the proximity function . In the case that multiple candidates share the maximal value, the one that was derived the earliest takes precedence. When the unexpanded distribution currently closest to the target is determined, the rules of Table 2 are applied sequentially for all valid subsets dictated by Table 3. When rules one, two and three of docalculus are considered the necessary dseparation criteria is checked in (line 14). For the chain rule, the presence of the required second input is also verified. The reverse lookup is implemented by using another hash table, where the hash function is based on the unique representation of each distribution object. The values contained in the table are the indices of the derived distributions. The same hash table is also used to verify that we do not derive again distributions that have been previously found to be identifiable from the inputs.
We construct a set of applicable rules for each unexpanded distribution using the termination criteria of Table 3 (lines 6–8). If all the necessary criteria have been found to hold for an applicable rule and a subset, the newly derived distribution is added to the set of known distributions and placed into the priority queue as an unexpanded distribution. When the applicable rules and subsets have been exhausted for the current distribution , it is marked as expanded and removed from the queue (line 19). If the target distribution is found at any point (line 15), a formula is returned for it in terms of the original inputs. Alternatively, we can also continue deriving distributions to obtain different search paths to the target that can possibly produce different formulas for it. If instead we exhaust the set of unexpanded distributions by emptying the queue, the target is deemed nonidentifiable by the search (line 20).
We keep track of the rules that were used to derive each new distribution in the search. This allows us to construct a graph of the derivation where each root node is a member of the original input set and their descendants are the distributions derived from them during the search. Each edge represents a manipulation of the parent node(s) to obtain the child node. For an identifiable target quantity, the formula is obtained by backtracking the chain of manipulations recursively until the roots are reached (line 16). The derivation of the example in the beginning of Section 3 depicted in Figure 1(fig:derivation) can be efficiently found by applying this procedure.
3.4 Soundness and Completeness Properties
We are ready to establish some key theoretical properties of dosearch. The first theorem considers the correctness of the search.
Theorem 2 (Soundness).
dosearch always terminates: if it returns an expression for the target , it is correct, if it returns NA then is not identifiable with respect to the rules of docalculus and standard probability manipulations (in Table 2).
Proof.
Each new distribution is derived by using only welldefined manipulations as outlined by Table 3 and by ensuring that the required dseparation criteria hold in when rules of docalculus are concerned. It follows that if the search terminates and returns a formula for the target distribution, it was reached from the set input distributions through a chain of valid manipulations. If dosearch terminates as a result of Theorem 1, we are done. Suppose now that Theorem 1 does not apply. By definition, dosearch enumerates every rule of Table 2 for every welldefined subset of Table 3. By Lemma 1, no distributions are left out by applying the termination criteria of Table 3. We know that if rules and of Table 3 are omitted, the distributions generated by these rules can be obtained by a combination of rules and . Furthermore, the order in which the distributions are expanded has no effect, as every possible manipulation is still carried out. The search will eventually terminate, since distributions that have already been derived are not added again to the set of unexpanded distributions and there are only finitely many ways to apply the rules of Table 2. ∎
The following theorem provides a completeness result in connection to existing identifiability results. Since docalculus has been shown to be complete with respect to (conditional) causal effect identifiability, identifiability and transportability, it follows that dosearch is complete for these problems as well.
Theorem 3 (Completeness).
If dosearch returns NA in the settings in rows 1–4 in Table 1, then the query is nonidentifiable.
Proof.
Docalculus has been shown complete in these settings. The rules of probability calculus encode what is used in the algorithms as can be seen for example from the proofs of Theorem 7 and Lemmas 4–8 of (Shpitser and Pearl, 2006a). ∎
It is not known whether the rules implemented in dosearch are sufficient for other more general identifiability problems since it is conceivable that some additional rules might exist that would be required to achieve completeness. One such generalization is the inclusion of missing data in the causal model, which we present in Section 5. However, if one were to show that docalculus (or any other set of rules included in dosearch) is complete for some special case of the generalized identifiability problem, then dosearch would be complete for this problem as well. In the following sections we will use the term “identifiable by dosearch” to refer to causal queries that can be indentified by dosearch.
3.5 Simulations
We implemented dosearch (Algorithm 2) in C++. Here we report the findings of a simulation study to assess the running time performance of dosearch and the impact of the improvements outlined in Section 3.2 as well as the search heuristic described in Section 3.2.5.
Our synthetic simulation scenario consisted of semiMarkovian causal graphs of vertices that were generated at random by first generating a random topological order of the vertices followed by a random lower triangular adjacency matrices for both directed and bidirected edges. Graphs without a directed path from to were discarded. We sampled sequentially input distributions of the form at random by generating disjoint subsets such that is always nonempty. This was continued until the target quantity was found to be identifiable by the search. Then for each graph, we recorded the search times for set of inputs that first resulted in the query to be identified and for the last set such that the target was nonidentifiable. In other words, each graph generates two simulation instances, one for an identifiable query and one for a nonidentifiable query. This setting directly corresponds to the setting of partially overlapping experimental data sets discussed in Section 2.2 for which no other algorithmic solutions exist.
To understand the impact of the search heuristic and the various improvements, we compare four different search configurations: the basic dosearch without the search heuristic or improvements^{1}^{1}1In this configuration, terms are expanded in the order they were identified; the conditions in Table 3 are not checked., one that only uses the search heuristic, one that only uses the improvements of Section 3.2 and one that uses them both.
Figure 3 shows the search times of the configurations compared to the basic configuration for identifiable instances. Most importantly, a vast majority of instances (93 %) are solved faster than the basic configuration when both heuristics and improvements are used. The average search time with both heuristics and improvements enabled was 32.7 seconds and 75.2 seconds for the basic configuration. The search heuristic provides the greatest benefit for these instances as can be seen from Figure 3(fig:scatter_id_h). Using a heuristic can also hinder performance by leading the search astray and by causing additional computational steps through the evaluation of the proximity function. For example, there is a small number of instances where the search time is over ten times slower than the basic configuration when using a heuristic. Fortunately, there are several instances in the opposite direction, where the heuristic provides over one hundred fold reduction in search time. Curiously, even using the improvements sometimes results in slower search times. This is most likely due to the elimination of rule 1 of docalculus, since it may be the case that the basic search is able to use this rule to reach the target distribution faster. More importantly, Figure 3(fig:scatter_id_i) shows that the improvements clearly benefit the search. Furthermore, the benefit tends to increase as the instances get harder.
Figure 4 shows the search times of the configurations for nonidentifiable instances. Relying only on a search heuristic provides no benefit here, as expected. The improvements to the search are most valuable for these instances, and in this scenario every nonidentifiable instance was solved faster than baseline using the improvements, and when applied with the heuristic only three instance were slower than baseline. The average search time with both heuristic and improvements enabled was 105.2 seconds and 139.7 seconds for the basic configuration. The almost zero second instances are a result of Theorem 1 when no search has to be performed in order to determine the instance to be nonidentifiable. The benefit of the improvements tends to increase as the instances get harder also for these instances.
Finally we examined the average run time performance of dosearch, with all improvements and heuristics enabled. We replicated the previously described simulation scenario with the same number of instances (1071) for graphs up to 10 vertices. Figure 5
shows the boxplots of search times on a logscale for graphs of different size, including both identifiable and nonidentifiable instances. Note that for every graph size there are a number of easily solvable instances that show up as outliers in this plot. 10node instances are solved routinely under 100 seconds. In this plot, the running times increase exponentially with increasing graph size (or number of variables).
4 New Causal Effect Identification Results
We present a number of results for various identifiability problems to showcase the versatility of dosearch.
4.1 Multiple Data Sources with Partially Overlapping Variable Sets
Earlier generalizations of the identifiability problem assume nested experiments or entire distributions with the exception of surrogate outcome identifiability (Tikka and Karvanen, 2018b) which also has its own intricate assumptions regarding the available distributions. None of these assumptions are needed in dosearch and it can be used to solve identifiability problems from completely arbitrary collections of input distributions.
We showcase identifiability from multiple experimental distributions by two examples. In the first example we consider identifiability of in the graph of Figure 6(fig:obsandexp_1) from , , , and . The target quantity is identifiable and dosearch produces the following formula for it
In the second example we consider identifiability of in the graph of Figure 6(fig:obsandexp_2) from , , , , , , and . Again, the target quantity is identifiable and dosearch outputs the following formula
This example shows that a heuristic approach can also help us to find shorter formulas. If we run dosearch again without the heuristic in this instance, the output formula is instead
4.2 Combining Transportability and Selection Bias
Input distributions that originate from multiple sources while being simultaneously affected by selection bias can be considered with dosearch. This kind of problem cannot be solved with algorithms RC or TR of Table 1. As an example we consider one source domain and a target domain with two input data sets: a biased distribution from the target domain and an unbiased experimental distribution from the source domain. We evaluate the query in the graph of Figure 7 using these inputs. In the figure transportability node is depicted as a gray square and selection bias node is depicted as an open double circle.
The query is identifiable and dosearch outputs the following formula for it
4.3 Recovering from Multiple Sources of Selection Bias
We present an example where bias originates from two sources with two input data sets: a distribution affected by both biasing mechanisms and a distribution affected only by a single bias source . We evaluate the query in the graph of Figure 8 using the inputs.
The query is identifiable and the following formula is obtained by dosearch
5 Extension to Missing Data Problems
The SCM framework can be extended to describe missing data mechanisms. For each variable that is not fully observed, two special vertices are added to the causal graph. The vertex is the observed proxy variable which is linked to the true variable via the missingness mechanism (Little and Rubin, 1986; Mohan et al., 2013):
(1) 
where NA denotes a missing value and is called the response indicator (of ). In other words, the variable that is actually observed matches the true value if it is not missing (). Figure 10 depicts some examples of graphs containing missing data mechanisms.
The observed vertices of the causal diagram are partitioned into four categories
where is the set of fully observed variables, is the set of partially observed variables, is the set of all proxy variables and is the set of response indicators. Our method is also capable of processing queries when the causal graph contains missing data mechanisms where the sets and of the input distributions are restricted to contain observed variables in . An active response indicator is denoted by . Proxy variables are not explicitly shown in the graphs of this section for clarity.
Determining identifiability is more challenging under missing data. As evidence of this, even some noninterventional queries require the application of docalculus (Mohan and Pearl, 2018). Furthermore, the rules used in the search of Table 2 are no longer sufficient and deriving the desired quantity necessitates the use of additional rules that stem from the definition of the proxy variables and the response indicator. Each new partially observed variable also has a higher impact on computational complexity, since the corresponding response indicator and proxy variable are always added to the graph as well.
Table 4 extends the set of rules of Table 2 to missing data problems by providing manipulations related to the missingness mechanism. The missing data column lists extended requirements for the valid subset if missing data mechanisms are present in the graph. The following notation is used in the table: is the set of active response indicators for the current term, denotes the set of partially observed variables corresponding to the proxy variables present in the current term. and denotes the set of proxy variables corresponding to the partially observed variables present in the current term. For example, if the current term is the aforementioned sets would be and . The sets and are defined accordingly with respect to the set .
Rule  Input  Additional Input  Output  Description 

Insertion of observations  
Comments
There are no comments yet.