Joint Causal Inference from Observational and Experimental Datasets

11/30/2016 ∙ by Sara Magliacane, et al. ∙ Radboud Universiteit University of Amsterdam 0

We introduce Joint Causal Inference (JCI), a powerful formulation of causal discovery from multiple datasets that allows to jointly learn both the causal structure and targets of interventions from statistical independences in pooled data. Compared with existing constraint-based approaches for causal discovery from multiple data sets, JCI offers several advantages: it allows for several different types of interventions in a unified fashion, it can learn intervention targets, it systematically pools data across different datasets which improves the statistical power of independence tests, and most importantly, it improves on the accuracy and identifiability of the predicted causal relations. A technical complication that arises in JCI is the occurrence of faithfulness violations due to deterministic relations. We propose a simple but effective strategy for dealing with this type of faithfulness violations. We implement it in ACID, a determinism-tolerant extension of Ancestral Causal Inference (ACI) (Magliacane et al., 2016), a recently proposed logic-based causal discovery method that improves reliability of the output by exploiting redundant information in the data. We illustrate the benefits of JCI with ACID with an evaluation on a simulated dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Discovering causal relations from data is at the foundation of the scientific method. Traditionally, causal relations are either recovered from experimental data in which the variable of interest is perturbed, or from purely observational data, e.g., using the seminal PC and FCI algorithms (Spirtes2000; Zhang:2008:COR:1414091.1414237).

In recent years, several methods for combining observational and experimental data to discover causal relations have been proposed, showing that this combination can improve greatly on the accuracy and identifiability of the predicted causal relations. Some of the proposed methods are score-based (e.g., CooperYoo1999; TianPearl2001; EatonMurphy07; hauser2012characterization; MooijHeskes_UAI_13), i.e., they evaluate models using a penalized likelihood score, while others (e.g., TianPearl2001; Claassen++_NIPS2010; antti; triantafillou2015constraint; ICP; ACI; borboudakistowards) are constraint-based, i.e., they use statistical independences to express constraints over possible models.

In this work we propose Joint Causal Inference (JCI), a formulation of causal discovery over multiple datasets in which both the causal structure and targets of interventions are jointly learnt from independence test results in pooled data. A related approach was already proposed for score-based methods by EatonMurphy07, but here we extend it so that constraint-based methods can be employed. Our goal is to combine the idea of joint inference from observational and experimental data from EatonMurphy07 with the advantages that constraint-based methods have over score-based methods, namely, the ability to handle latent confounders and selection bias naturally in a nonparametric approach, and, especially in the case of logic-based methods, an easy integration of background knowledge.

Existing constraint-based methods for multiple datasets typically learn the causal structure on each dataset separately and then merge the learnt structures (e.g., Claassen++_NIPS2010; antti; triantafillou2015constraint; borboudakistowards). The merging process depends on the type of interventions, and most existing methods support only interventions on known targets. Instead, JCI: (1) allows for several different types of interventions; (2) can learn the intervention targets; (3) systematically pools data across different datasets, which improves the statistical power of independence tests; and (4) improves the identifiability and accuracy of the predicted causal relations.

On the other hand, JCI poses challenges for current constraint-based methods because of their susceptibility to violations of the Causal Faithfulness assumption. Specifically, JCI induces faithfulness violations due to deterministic relations, which would typically result in erroneous inferences with standard constraint-based methods. We propose a simple but effective strategy for dealing with this type of faithfulness violations. The strategy can be applied to any constraint-based causal discovery method for observational data that can handle partial inputs, i.e. missing results for a certain independence test, thus extending it to a JCI method that can handle a combination of observational and experimental data. We implement the strategy in ACID (Ancestral Causal Inference with Determinism), a determinism-tolerant extension of ACI (Ancestral Causal Inference) (ACI), a recently proposed logic-based causal discovery method that improves reliability of the output by exploiting redundant information in its input. In our evaluation on synthetic data we show that JCI with ACID improves on the accuracy of the causal predictions with respect to simply merging separately learned causal graphs, illustrating the advantage of joint causal discovery.

2 Preliminaries

In this section we review a few useful concepts from the related work and introduce the notation we use in the rest of the paper. Most of the concepts described here are explained in detail in the seminal books by Pearl2009 and Spirtes2000. In the following, we represent variables with uppercase letters, while sets of variables are denoted by boldface.

2.1 Graph terminology

Throughout the paper we assume that the data generating process can be modeled by a causal Directed Acyclic Graph (DAG) that may contain latent variables. For simplicity, we do not consider selection bias. A directed edge in the causal DAG represents a direct causal relationship of cause on effect . We then say that is a parent of , and denote the set of parents of as . A sequence of directed edges is a directed path. If there is a directed path from to (or ) then is an ancestor of (denoted as ). We denote the set of ancestors of as . If there is no directed path from to (and ) we denote this as . A sequence of unique nodes such that each pair is connected by an edge is called a path. A collider is a node on a path that has two incoming arrow heads from both its neighboring nodes on the path, i.e., of the form . Any other node on a path (including the end nodes ) is called a non-collider.

For a set of variables , we extend the definition of parents to the union of all parents of any variable . We similarly define as the set of ancestors of all . We write if there exists at least one effect that has as an ancestor, i.e., . We write if for all .

2.2 Independences and d-separation

For disjoint sets of random variables

, distributed according to a joint probability distribution

, we denote the conditional independence of and given in as , and conditional dependence as . We often omit when it is obvious from the context which probability distribution we are referring to. We call the cardinality the order of the conditional (in)dependence relation.

A well-known graphical criterion for DAGs, with many implications for causal discovery, is d-separation (Pearl2009; Spirtes2000): For disjoint sets of variables in a DAG , we say that is d-separated from by in , written , iff every path in that connects any with any is blocked by , i.e. at least one of the following holds:

  • contains a collider not in , or

  • contains a non-collider in .

The opposite, i.e., d-connection, is denoted as . We often omit from the notation when it is obvious which DAG we are referring to.

The following two key assumptions for constraint-based causal discovery have been thoroughly discussed in the literature (see for example Spirtes2000). They connect conditional independences in the observational distribution with d-separations in the underlying causal DAG .

  • Causal Markov Assumption: d-separation in the causal DAG implies conditional independence in the observational distribution . For all disjoint sets of variables :

    which can also be expressed contrapositively as:

  • Causal Faithfulness Assumption: the inverse, i.e., for all disjoint sets of variables :

If we assume both the Causal Markov and Causal Faithfulness assumptions to hold, the conditional independences of the observational distribution correspond one-to-one with the d-separations in the causal DAG. This setting is very favourable for causal discovery, thus both assumptions are usually made in constraint-based approaches.

2.3 Deterministic relations and faithfulness violations

         Functionally determined example

         Non-functionally determined example
Figure 1: Examples of faithfulness violations: In the left example, we show a set of faithfulness violations due to a functionally determined relation, in which the parent fully determines the child . In the right example, and

are binary variables and

is the XOR function. Conditioning on fully determines , even though are not ancestors of . This creates a series of non-trivial faithfulness violations, listed below the graph. In both graphs, we represent the variables resulting from structural equations without noise terms with a double circled node.

Although often reasonable, the Causal Faithfulness assumption is violated in some cases, notably in the common case of deterministic relations among variables, e.g., for Structural Causal Model equations (Pearl2009) in which there is no noise term. Some of the faithfulness violations related to determinism are captured by an extension to the d-separation criterion, the D-separation criterion, first introduced by Geiger1990 and later extended by Spirtes2000. Under the Causal Markov assumption, the formulation of D-separation presented by Geiger1990 is proven to be complete for the restricted setting where determinism arises only due to functionally determined relations, defined recursively as variables that are fully determined by their parents, or more precisely: A variable is functionally determined by a set for a given DAG if , or all parents of are functionally determined by .

We show an example of faithfulness violations due to a functionally determined relation in Figure 1 (left). Following Geiger1990, we represent the variables resulting from structural equations without noise terms with a double circled node.

In Section 3.8 of their book, Spirtes2000 extend D-separation to model also some deterministic relations that are due to variables being determined when conditioning on their non-ancestors. In Figure 1 (right) we show an example in which the definition of D-separation from Geiger1990 fails to capture the faithfulness violation that is due to the deterministic relation between non-ancestors of , but the extended notion of D-separation by Spirtes2000 correctly captures it.

Although the version of D-separation by Spirtes2000 retains completeness for the restricted case of functionally determined relations, it is not proven to be complete in general. Nevertheless, Spirtes2000 introduce several useful concepts for handling general deterministic relations, so we summarize their findings here, adapting them to our notation. We start with the assumption that we have complete knowledge of all deterministic relations in the system.

Assumption 1

is a complete set of all the deterministic relations among variables, where each entry in the set indicates that variable is a deterministic function of variables , but it is not of any strict subset of .

For example, in Figure 1 left, , and in Figure 1 right, , , . This assumption is not as restrictive as it may seem at first, because in practice one can easily reconstruct deterministic relations in the data by using several standard methods. We use to define a function that maps a given set of variables to the set of all variables that are determined by : Given a set of variables and a complete set of deterministic relations , we define as the set of variables determined according to by (a subset of) . We omit , using only , if it is obvious from the context which set we are referring to. Note that trivially includes itself. Also, any variable with constant value is by definition in for all as it is determined by . Following Spirtes2000, we can use to extend d-separation for deterministic relations. Given a DAG , three disjoint sets of variables , and the complete set of deterministic relations , we define and to be D-separated by w.r.t.  and , denoted as , iff for every path in between any and any , at least one of the following holds:

  • contains a collider not in , or

  • contains a non-collider111Note that we also refer to the end nodes of a path as non-colliders. in .

If , D-separation reduces to standard d-separation. We omit and if it is obvious from the context which set and graph we are referring to. Under the Causal Markov Assumption, this formulation of D-separation is proven to imply independence (Spirtes2000). More precisely: if is Markov with respect to and is the complete set of deterministic relations that hold in . For the case of functionally determined relations, this version of D-separation is complete, i.e., it completely identifies all additional independences due to functionally determined relations, because in that setting it reduces to the version of D-separation by Geiger1990, which was shown to be complete by Geiger1990.

3 Related work

Given a set of observational and interventional datasets, most constraint-based methods that handle multiple datasets learn the causal structure from each dataset separately and then merge the learned structures (e.g., Claassen++_NIPS2010; antti; triantafillou2015constraint; borboudakistowards). Some of these methods, e.g., COMBINE (triantafillou2015constraint), perform these two steps sequentially, applying a greedy procedure to resolve any potential conflict from the causal structures learnt in the first step. Others, e.g., HEJ (antti), combine learning and merging in a single procedure, solving potential conflicts by formulating an optimization problem. A recent approach, ETIO (borboudakistowards), combines the aspects of the previous approaches by learning and merging in a single procedure, but using a greedy algorithm for resolving conflicts.

Merging causal structures learnt on each dataset separately has several drawbacks with respect to a method that can jointly use all datasets, as for example the score-based method from EatonMurphy07. First, merging approaches require known targets and cannot learn the targets of the interventions, since this type of information is only available when considering multiple datasets jointly. Moreover, they cannot take advantage of certain interventional datasets, e.g., in the case of a single data point per interventional setting, as for example happens in a popular genomics dataset (Kemmeren2014). Two other important drawbacks are a loss of statistical power because of the separation into smaller isolated datasets, and, as we will show with some examples in Section 4.1, less identifiable relations with respect to a joint causal inference method.

There is some related work on special cases in which to apply constraint-based methods with mixtures of observational and experimental datasets (e.g., TianPearl2001; Eberhardt2008; Lagani2012; ICP; borboudakistowards), but the problem has not been systematically discussed and formalized in a general framework yet. In particular, to the best of our knowledge, no existing work addresses learning the intervention targets, possibly jointly with the causal graph, from independence tests.

The approach described by TianPearl2001 learns the causal structure by combining a standard constraint-based method on observational data with information extracted from changes in the marginal probability of each variable. This information can be extracted from a sequence of (interventional) datasets by comparing each pair of datasets in the sequence, under the assumption that the only difference between the two datasets is a mechanism change on a single known variable. This is a quite restrictive assumption in practice.

Other approaches describe sufficient, although restrictive, conditions under which pooling data does not change the conditional distribution of the variables under consideration. In particular, Eberhardt2008 describes how naively pooling data from different experimental settings, while discarding the information of which experimental setting a sample was taken, may give wrong results. Thus Eberhardt2008 proposes a sufficient condition that allows one to pool data for a given independence test when the conditional distribution of the tested variables is the same in all experimental conditions. Lagani2012 present two other approaches: (i) perform the conditional independence tests separately in each dataset, then define the pooled dependence as a disjunction of the single dependences; (ii) pool experimental conditions that differ only in the value of at most one intervened variable. Both of these approaches describe restrictive conditions in which one can pool datasets, while in this paper we argue that, when done systematically, e.g., as we will show in the next section, one can always pool all available datasets.

Other approaches like Invariant Causal Prediction (ICP) (ICP) focus on certain specific combinations of independence tests that are performed jointly on all datasets. ICP is a causal discovery method that looks for invariance across different experimental settings, returning a conservative subset of ancestors (or parents, if one assumes there are no latent confounders) for a given target variable . The main assumption is that the conditional distribution of given its parents does not change in the different interventional settings (in particular, that is not directly intervened upon). This assumption is also referred to as invariance or modularity (Pearl2009; Spirtes2000). Since the method searches for patterns that are invariant across different settings, it can safely pool together a subset of settings in a new virtual “experimental” setting to increase the statistical power for settings with few data. On the other hand, as we show with some examples in Section 4.2

, the conservativeness of the ICP estimates sometimes significantly reduces the causal information that can be inferred. Like ICP, JCI makes an invariance assumption that allows it to combine different datasets. The invariance assumption made by JCI is that the causal structure is invariant across experimental settings (but model parameters are allowed to change).

4 Joint Causal Inference (JCI)

We propose to model jointly with a single causal graph observational or experimental datasets . We assume that there is a unique underlying causal DAG in all of these datasets, defined over the same set of variables that we call the system variables, , some of which are possibly hidden.

Each dataset has an associated joint probability distribution and represents the data points collected after a set of interventions on possibly unknown intervention targets. In the context of this paper, observational data are simply datasets with an empty set of interventions. We assume each distribution () to be Markov and faithful with respect to the causal DAG . This assumption precludes certain types of interventions, notably, perfect interventions (Pearl2009). On the other hand, it allows for many other types of interventions, e.g., soft interventions (Markowetz++2005), mechanism changes (TianPearl2001), fat-hand interventions (EatonMurphy07), activity interventions (MooijHeskes_UAI_13), etc., as long as they do not induce new (in)dependences, which can be seen as modifications to the underlying DAG .

(20, no drug): 0.1 0.2 0.5 (30, drug): 2.2 3.4 1.5 1 20 0 0.1 0.2 0.5 1 20 0 2 20 1 0.1 3.1 0.6 2 20 1 3 30 0 2.3 1.3 1.4 3 30 0 4 30 1 2.2 3.4 1.5 4 30 1

Figure 2: Prototypical example of JCI setting:

A set of four experimental datasets in raw form (left), in a pooled tabular form with the addition of dummy variables (center) and as a causal DAG

(right) representing the causal structure of the system variables , regime variable and intervention variables . The intervention variable represents the temperature at which each experiment was performed, while represents the addition of a drug in some of the experiments.

Using the terminology from Dawid2002, we call the different distributions in the datasets regimes. In related work different names have been used, e.g., experimental conditions or environments (MooijHeskes_UAI_13; ICP). We introduce two types of dummy variables in the data:

  • a regime variable , representing which dataset a data point is from, i.e., , for data from .

  • intervention variables , which are deterministic functions of the regime . Intervention variables represent the interventions performed in each dataset. In absence of any information on the interventions performed in the datasets, we can use as intervention variables the indicator variables for each of the datasets.

We can now state the main assumption of JCI.

Assumption 2

We assume that the causal relations between system variables and the introduced dummy variables and can be represented as an acyclic Structural Causal Model (SCM) with jointly independent exogenous variables :

Here, are the system variable parents of , while denote its intervention parents and is the exogenous parent of . The distribution corresponding to dataset is given by .

We denote the corresponding causal DAG as . The Causal Markov assumption then holds by construction. We show an example in Figure 2, where we model four datasets with the same underlying causal structure.

The JCI assumptions are applicable when the value of the regime/intervention variables are determined by the experimenter before the system variables are measured. More generally, the assumption is that the system variables cannot cause the regime/intervention variables. In addition, we assume that the values set by the experimenter are chosen independently of any other possible cause of the system variables. In other words, we assume there to be no confounders between the regime variable and the system variables, or between the intervention variables and the system variables. For the purposes of causal discovery as intended in this paper, there is nothing else that really distinguishes the regime/intervention variables from the system variables: they can both be considered to be random variables, where the distribution of the regime/intervention variables just reflects the empirical distribution of the experimental design chosen by the experimenter. Moreover, there is no distinction between observational and experimental datasets, allowing for several observational datasets, possibly from different contexts.

Intervention variables are functions of the regime variable, and do not have any associated noise. This means that they are determined by the regime. We represent these functions as a matrix: We define the experimental design matrix as the matrix representing the functional relations between and each intervention variable , and the corresponding probabilities of the regime variable , where is the number of data points in dataset . We assume that the intervention variables are complete in the sense that every effect of the regime variable is mediated through an intervention variable. In other words, we assume that there are no direct effects of on any of the system variables.

In general, other deterministic relations between dummy variables may arise. For example, consider the example in Table 1 left in which

represents a drug that was added when the regime is an odd number, while

indicates another drug that was added when the regime is an even number. These two variables determine each other. Even though this is clear from the experimental design matrix, it is not visible in the causal influence diagram.

1 0 1 0 0.375 2 1 0 0 0.125 3 0 1 1 0.2 4 1 0 1 0.3 1 0 0 0.375 2 1 0 0.125 3 0 1 0.2 4 1 1 0.3
Table 1: Example of experimental design matrix with additional deterministic relations beyond the ones allowed in JCI (left) and a reduced version with only allowed deterministic relations (right). In the right version we joined and in a single intervention variable representing the addition of a single drug (U0126 when the value is 0, Akt-Inh when it is 1).

In this paper, we focus on a special case by allowing only certain types of deterministic relations. We assume that the regime determines each of the intervention variables . Optionally, we allow one additional deterministic relation, namely that all intervention variables together determine the regime . We assume that there are no other deterministic relations.

Assumption 3

The deterministic relations that hold in the joint distribution

are for all , and optionally, . No other deterministic relations hold in the joint distribution.

In practice, one can often “normalize” a system that does not satisfy this assumption. For example, the experimental design matrix in Table 1 left contains also deterministic relations that are not allowed in JCI, but that arise from “redundant” intervention variables. Table 1 right shows how joining two intervention variables can yield a “normalized” system that satisfies the JCI assumptions.

D-separation has been shown to be sound (Spirtes2000), but was only conjectured to be complete. We prove that in our restricted setting, D-separation is actually complete. Under Assumptions 13, D-separation is complete, i.e., it gives all conditional independences that are entailed by the assumptions. First we consider the case that the regime variable is not determined by the intervention variables. Then, all deterministic relations are functionally determined relations (see Definition 2.3), as they correspond to each intervention variable being a function of the regime variable only. The notion of D-separation introduced by Geiger1990 is proved to be sounds and complete under the Causal Markov assumption for functionally determined relations. If all deterministic relations are functionally determined relations, then their notion of D-separation is equivalent to the one by Spirtes2000 that we use here. Thus the statement follows.

For the other case, we will show that the D-separations do not change by removing the deterministic relation that the regime variable is determined by all intervention variables . Let denote the complete set of deterministic relations according to and assume . Let . Let be the DAG associated with . We claim that for disjoint sets of variables :

Assume and . This can only happen when , because otherwise and then the two D-separations are identical by definition. So there must exist a path in that is D-open w.r.t.  but D-closed w.r.t. . This means it must contain as a non-collider. Also, otherwise the path would be closed w.r.t. . Since and are disjoint, must contain at least one other node that is adjacent to , which must be one of the intervention variables . Since the intervention variables can only be non-colliders on by the JCI assumptions, and , they must d-block . Hence we have arrived at a contradiction: cannot be D-open w.r.t. .

Moreover, and cannot happen, because d-separations are the same for and , and , so there cannot be additional D-separations w.r.t.  compared to the D-separations w.r.t. . Therefore, removing this particular deterministic relation does not change the D-separation statements, and hence completeness follows also for this case. We conjecture that for the more general case of arbitrary deterministic relations between dummy variables, e.g., Table 1 left, D-separation as defined by Spirtes2000 is still complete, but we leave the proof for future work.

The completeness of D-separation is important in the JCI context because it motivates a relaxation of the standard Causal Faithfulness assumption. In our setting, the standard assumption is too restrictive, so we relax it to allow for violations due to deterministic relations between the regime and the intervention variables. We define our relaxed version, that we call D-Faithfulness assumption, as follows:

Assumption 4 (D-Faithfulness)

For three disjoint sets of variables and a probability distribution that satisfies both the Causal Markov assumption for and the set of deterministic relations , we assume that .

This assumption, in conjunction with the previous ones, implies that in JCI independences correspond one-to-one with D-separations, which paves the road for constraint-based causal discovery222A consequence of D-Faithfulness is that intervention variables should be pairwise dependent (also when conditioning on a subset of them). This excludes some experimental design matrices, e.g. a matrix similar to Table 1, but where for all . In practice, we can often alleviate this problem, e.g. by dropping some data points.. The completeness of D-separation suggests that this relaxation is “tight”, i.e., we only relax the standard Causal Faithfulness assumption to allow for the extra independences that are due to the deterministic relations in Assumption 3, but not any more.

Under Assumptions 14, we define Joint Causal Inference (JCI) as the problem of inferring the causal DAG from the distributions or from finite samples of those. Moreover, we call any causal discovery method that can solve a JCI instance a JCI method. We will show in Section 4.2 that the ideas behind some previous approaches, e.g., (ICP), can be seen as special cases of JCI.

4.1 Joint Causal Inference improves on the identifiability w.r.t. merging learnt structures

As already mentioned, formulating causal inference on multiple datasets as JCI offers several advantages with respect to the approaches in which the causal graphs are learnt separately from each dataset and then merged. One of the advantages is the improved identifiability. In this section, we show a few examples, where, for simplicity, we assume oracle inputs and model only the regime variable (rather than the regime variable and a single intervention variable).

In Figure 3 we show a simple example in which JCI improves identifiability. In the absence of information on the intervention targets, we cannot identify the causal direction between the variables when we learn the structures separately and then combine them. In the same case, a JCI method is able to correctly reconstruct the causal structure by using additional conditional independence tests with the intervention variable, specifically, , , , and . Using the background knowledge from JCI that system variables cannot cause the intervention variable , i.e., , and that there are no latent confounders between and the system variables, we can infer , and that there are no confounders with any JCI method supporting direct causal relations. Note that in this example, since there are only two datasets, , so for simplicity we represent only .

(observational):

(soft intervention):

PAG learnt from :

PAG learnt from :

Merged PAG:

JCI method
Figure 3: A simple example in which JCI improves identifiability: Consider two datasets with the same underlying DAG, one of which has a soft intervention (left). If we learn the causal graphs of and in each dataset separately and then merge them, e.g., as described by triantafillou2015constraint, we cannot learn the causal direction but only that they are dependent (middle). JCI adds more variables and thus conditional independence tests that allow to distinguish the direction (right). Here, since , w.l.o.g. we represent only . See details in the main text.
(observational):

(soft intervention on )

Ancestral relations: PAG learnt from :

PAG learnt from :

Merged PAG (using ancestral relations):

JCI method
Figure 4: A more complex example in which JCI improves identifiability: If we have background knowledge on the intervention targets, e.g., we know that in one of the datasets and are intervened upon (left), we can use this information to extract some extra background knowledge in the form of ancestral relations. Merging separately learnt causal structures and this extra knowledge (as done e.g. by ACI) is still not enough information to recover the causal structure (middle). Instead, a JCI method can identify the true causal structure, more precisely the ADMG (right). Since , we can represent only . See details in the main text.

If for each datasets the targets of the intervention are known, then it is possible to retrieve their descendants (and non-descendants) by checking which variables change in each interventional dataset with respect to the observational case. This technique was successfully applied in (ACI) to retrieve a list of weighted ancestral relations that could be used as background knowledge. For example, if one were to know that is the intervention target in in the example in Figure 3, the change in in with respect to would imply that .

Although (weighted) ancestral relations help in simple cases as the previous example, in general they cannot reproduce the same results as JCI. An example is given in Figure 4. In this case, knowing that in dataset the intervention targets are and , and observing that changes significantly, allows us to reconstruct that one of these targets causes , which is not enough to reconstruct the causal graph by merging the PAGs learnt from each dataset separately. Instead, a JCI method that supports direct causal relations can take advantage of the additional conditional independence tests with the regime variable and infer the complete DAG: In the example in Figure 4, a JCI method that supports direct causal relations can reconstruct correctly the underlying causal graph, more precisely the acyclic directed mixed graph (ADMG), from oracle independence test results. For readability, we provide the proof in the Appendix.

4.2 Reformulation of related work as special cases of JCI

Local Causal Discovery (LCD) (Cooper1997) is a simple algorithm that searches for variables that satisfy the pattern , where is a variable not caused by any other variable under consideration. We can apply LCD to multiple observational and experimental datasets with soft interventions by using as , since the regime variable is by assumption not caused by any other variable. Then LCD can be summarized as:

This rule can be seen as a restricted case of the JCI setting, in which we can iteratively pick pairs of variables and apply the above rule to detect a subset of the causal graph.

An extension of this approach is used in Invariant Causal Prediction (ICP) (ICP). ICP also considers the regime variable (which is called the discrete environment variable in that work), but it does not model the intervention variables (see ICP, Appendix). Given a target variable that is not directly intervened upon, we can reformulate the main idea behind ICP as the search for the intersection of all the sets such that :

(1)

In the absence of confounders, is a conservative estimate of a subset of the parents of , even when the Causal Faithfulness assumption is violated. If we cannot exclude the presence of confounders, as in the JCI setting, then ICP requires the Causal Faithfulness assumption to provide as an estimate of a subset of the ancestors of . We can see this reformulation of ICP as a special case of JCI that extends LCD with a more conservative estimate. In principle, one could easily integrate the conservative estimate (1) in a JCI method to provide more accurate estimates for the top predictions, but we leave this for future work.

ICP:     JCI method:

ICP:     JCI method:
Figure 5: Two examples in which ICP is overly conservative in the JCI setting: In the left example, for variable ICP finds several sets that satisfy , e.g., , and , but their intersection is . Instead, a JCI method that supports direct causal relations will correctly infer that the single parent of is . In the right example, naively adding intervention variables does not allow to estimate the ancestors of a variable , which would otherwise be estimated correctly.

On the other hand, depending on the set of interventions in the available datasets, ICP may be overly conservative compared to JCI. Besides the restriction on the variable not to be directly intervened upon in any dataset, which is not necessary in JCI, there are some other cases in which ICP provides an overly conservative estimate of the set of ancestors. We show two examples in Figure 5. Specifically, in the left example the estimated set of ancestors for a variable that is two hops away from the intervened variable is empty, while a JCI method that supports direct causal relations can find the correct parent set . Similarly, as shown in the right example, naively adding the intervention variables reduces the applicability of ICP to only estimate ancestors of variables that are directly intervened upon (e.g., ). This naive addition would allow ICP to learn the intervention targets, but not the structure of the causal graph.

5 A strategy for extending constraint-based methods for JCI

Joint Causal Inference provides some challenges for current constraint-based methods:

  • faithfulness violations due to deterministic relations between the dummy variables,

  • the availability of complex background knowledge (i.e., not limited to the presence/absence of edges or ancestral relations) on the dummy variables that can improve structure learning and recover from some of the faithfulness violations (e.g., can only cause a system variable through an intervention variable).

There is some work on dealing with faithfulness violations in the PC algorithm (Lemeire2012), but it assumes causal sufficiency (in our context, no hidden variables in ), and cannot handle background knowledge. Logic-based constraint-based algorithms, (e.g., antti; triantafillou2015constraint; ACI) can handle complex background knowledge and causal insufficiency, but the existing implementations cannot deal with faithfulness violations due to deterministic relations.

Here we propose a simple but effective strategy for dealing with faithfulness violations due to deterministic relations. We rephrase the constraints of a constraint-based algorithm in terms of d-separations and d-connections, instead of independence test results. At testing time we decide for each independence test result which d-separation or d-connection can be soundly derived from it and provide these d-separations and d-connections as input to the modified constraint-based algorithm.

Before introducing the rules that we use to derive sound d-separations and d-connections from input independence test results, we first summarise the basic properties of conditional independence originally introduced by Dawid1979, which we will use to prove an intermediate lemma. We follow the notation and ordering from a more recent publication (ECIpreprint):

Let be random variables. We write to denote that is a function of , or in other words for a measurable function . Then the following properties hold:

  1. ,

  2. ,

  3. and ,

  4. and ,

  5. .

For disjoint (sets) of random variables and random variable :

This is a simple consequence of the properties of conditional independence that we reviewed in Proposition 5. We first show one direction of the implication:

And for the other direction:

We can now use Lemma 5 to prove a sound conversion from D-separation statements to d-separation statements: For some set of variables , let denote the variables determined by (a subset of) (see Definition 2.3). Let be two different variables that are disjoint from . Under the Causal Markov and D-Faithfulness assumptions, the following holds:

The following equivalences hold:

The first equivalence follows from the Causal Markov and D-Faithfulness assumptions, while the second is based on Lemma (5). The third equivalence follows again from the Causal Markov and D-Faithfulness assumptions, while the last one is based on the definition of D-separation, which reduces to d-separation when conditioning on a set for which .

Using the result from Theorem 5, we can now introduce our strategy for dealing with faithfulness violations due to deterministic relations. First we rephrase a constraint-based algorithm in terms of d-separations and d-connections, which is usually a trivial change, as shown in Section 6. Then we can convert the problem of possibly unfaithful independences to the problem of possibly incomplete input. Specifically, we can derive a subset of sound d-separations and d-connections from independence test results as follows: Let be disjoint (sets) of variables. Let be the variables determined by (a subset of) (see Definition 2.3). Under the Causal Markov and D-Faithfulness assumptions, the following holds:

  • ,

  • If .

The first implication follows from the Causal Markov assumption, while the second follows from the Causal Markov and D-Faithfulness assumptions, and Theorem 5. Note that this procedure outputs d-separations only for a subset of independence test results, ignoring independences when or .

The simple strategy in Corollary 5 can be applied to any constraint-based method, providing that it can deal with partial inputs, i.e., missing results for certain independence tests. Logic-based methods (e.g., antti; ACI) can be run out-of-the-box with partial inputs, while other standard algorithms like FCI (Zhang:2008:COR:1414091.1414237) would require possibly non-trivial extensions. Anytime FCI (Colombo++2012) allows one to ignore (in)dependences above a certain order, but up to that order they are all required to be available, so that algorithm would also require possibly non-trivial extensions.

6 Ancestral Causal Inference With Determinism (ACID)

We implement the strategy in Corollary 5 in Ancestral Causal Inference with Determinism (ACID) as a determinism-tolerant extension of Ancestral Causal Inference (ACI), a recently introduced logic-based method (ACI). Before describing how ACID differs from ACI, we will briefly describe ACI itself.

6.1 Ancestral Causal Inference (ACI)

ACI reconstructs ancestral structures (combinations of “indirect” causal relations), also in the presence of latent variables and statistical errors. Ancestral structures are formally defined as: An ancestral structure is any relation on the observed variables that satisfies the non-strict partial order axioms:

(2)
(3)
(4)

The underlying causal DAG induces a unique ancestral structure on the observed variables: the transitive closure of the direct causal relations (directed edges) in the DAG.

ACI encodes the ancestral structure definition and five other causal reasoning rules: For , , , , disjoint (sets of) variables:

  1. ,

  2. ,

  3. ,

  4. ,

  5. .

These rules are shown to be sound assuming the Causal Markov and Causal Faithfulness assumptions. Causal discovery is then reformulated as an optimization problem where a loss function is optimized over possible ancestral structures. Given a list of weighted inputs, e.g. a set of conditional independences weighted by their confidence, the loss function sums the weights of all the inputs that are violated in a candidate ancestral structure. In addition, ACI provides a method for scoring causal predictions, which roughly approximates their marginal probability.

6.2 Acid

Out-of-the-box ACI is not able to deal with the faithfulness violations due to deterministic relations, and thus cannot be used for JCI. Therefore, we propose Ancestral Causal Inference with Determinism (ACID), which extends ACI following the strategy discussed in Section 5. We reformulate the logical rules of ACI in terms of d-separation, completely decoupling them from any assumption on the relation between (in)dependences and d-separations/connections, e.g., Causal Faithfulness. These new rules, that we call the ACID rules, are almost identical to the original ACI rules, except that the independences are substituted by d-separations , and the dependences by d-connections , as we show in the following: For , , , , disjoint (sets of) variables:

  1. ,

  2. ,

  3. ,

  4. ,

  5. .

The proofs of these rules are slight modifications of the proofs of the ACI rules. For completeness, we provide them in the Appendix. So far, this only changes the interpretation of the implemented rules, but no change of the code is required. What changes are the inputs: only the sound d-separations and d-connections that can be derived with Corollary 5 are used as inputs for ACID. Similarly to other logic-based methods, the ACID rules are sound also with partial inputs (i.e. when some d-separation information may not be available). On the other hand, using partial inputs may reduce the completeness of causal discovery. We consider this a minor issue, since our focus is on prediction accuracy, and ACI is already known not to be complete in the general case, but has nevertheless been shown to obtain state-of-the-art accuracies (ACI).

6.3 Acid-Jci

To improve the identifiability and accuracy of the predictions, we also add as background knowledge a series of logical rules describing the causal structure of the regime and intervention variables that apply in the JCI setting: Under the JCI assumptions, for any set of variables :

  1. ,

  2. .

  3. ,

  4. ,

  5. ,

  6. ,

  7. .

The propositions follow directly from the JCI assumptions and background knowledge:

  1. causes the intervention variables directly;

  2. cannot cause system variables directly, but only through intervention variables;

  3. The intervention variables suffice to block any path between and any other variable;

  4. System variables cannot cause any or ;

  5. There are no confounders between and the system variables;

  6. There are no possible confounders between the intervention variables and system variables other than ;

  7. Adding to the separating set cannot open paths;

  8. Adding an intervention variable to the separating set cannot open paths.

Adding this background knowledge provides a simple means to ruling out several spurious candidate causal structures that do not satisfy the JCI modeling assumptions. The integration of such complex knowledge is one of the main advantages of logic-based methods. We will refer to the combination of ACID with these rules as ACID-JCI.

If some of the intervention targets are known, we can also use this information as background knowledge. For example, if we know that the inhibitor targets the protein , we can simply add the background knowledge . Given the flexibility of logic-based approaches, we can also include more complex cases. For example, for two mutually exclusive targets and for intervention , we can add the background knowledge: , and .

(a) PR ancestral, p = 4, i = 1
(b) PR non-ancestral, p = 4, i = 1
(c) PR ancestral, p = 4, i = 3
(d) PR non-ancestral, p = 4, i = 3
(e) PR ancestral, p = 6, i = 1
(f) PR non-ancestral, p = 6, i = 1
Figure 6: Precision-recall curves on synthetic data for system variables and interventions: ancestral predictions (left column) and non-ancestral predictions (right column). Using JCI substantially improves the accuracy.
(a) PR ancestral, p = 3, i = 4
(b) PR non-ancestral, p = 3, i = 4
Figure 7: Precision-recall curves on synthetic data for system variables and interventions: ancestral predictions (left column) and non-ancestral predictions (right column). Also in this setting, JCI substantially improves the accuracy.

7 Evaluation

We evaluate ACID on simulated data in the JCI setting. The simulator builds up on a simulator used in related work (antti; ACI) and implements soft interventions on unknown targets. For each combination of the number of system variables and interventions , we generate randomly 1000 linear acyclic models with latent variables and Gaussian noise, and simulate soft interventions on random targets. We then sample data points for each model, randomly distributed between the experimental datasets and the observational dataset, perform independence tests and weight the (in)dependence statements using the weighting schemes described by ACI.

In our setting, we evaluate the causal discovery methods applied to datasets with unknown-target soft interventions. Other existing constraint-based methods (e.g., antti; triantafillou2015constraint) do not apply to this setting as they assume perfect interventions with known targets. Moreover, we wish to evaluate the net effect of using JCI with respect to merging separately learnt causal structures, while factoring out the effect due to different algorithms.

Therefore, we choose to compare the ancestral structure predicted by ACID-JCI with a naive baseline based on ACI, “Merged ACI”. In this baseline we merge ancestral structures learnt on each dataset separately with ACI by averaging the scores of the causal predictions over the datasets. As inputs to both algorithms we provide the same weighted independence test results (up to maximum order), computed with a test based on partial correlation and Fisher’s z-transform with significance threshold , and the frequentist weighting scheme described by ACI.

In Figures 6 and 7

we report the precision and recall (PR) curves for predicting ancestral relations (“indirect” causal relations) and non-ancestral relations (the absence of such a causal relation) for different settings of

system variables and interventions. We can see from these figures that ACID-JCI improves significantly on the accuracy of the predictions with respect to the baseline. As expected, the more interventional datasets are available, the better the accuracy for ACID-JCI, as we can see in the case of , vs. , in Figure 6. When there are only few datasets, e.g., , the performance of both methods improves as the number of system variables increases, and the gap between the method gets smaller, as we can see in the case of , and , in Figure 6. On the other hand, when there are only few variables, e.g., in Figure 7, ACID-JCI is able to predict correctly several ancestral relations that are not identifiable otherwise. In particular, in the case of there are no predicted ancestral relations for standard methods like “Merged ACI”, which explains why the PR curve for ancestral relations in Figure 7(a) starts at 0.5 recall. Instead, ACID-JCI is able to accurately predict these relations, illustrating that the Joint Causal Inference framework not only leads to statistical advantages but also enlarges the set of identifiable ancestral relations compared to methods that deal with each dataset separately.

8 Conclusions and discussion

In this paper, we presented Joint Causal Inference (JCI), a powerful formulation of causal discovery over multiple datasets that has been unexploited so far by constraint-based methods. Current constraint-based methods cannot be applied out-of-the-box to JCI because of faitfulness violations, so we proposed a simple strategy for dealing with this type of faithfulness violations. We implemented this strategy in ACID, a determinism-tolerant extension of a recently proposed causal discovery method, and applied ACID to JCI, showing its benefits in an evaluation on simulated data.

A limitation of JCI is that the assumption of a unique underlying causal DAG precludes certain types of interventions. There are several techniques to extend our formulation of the problem to perfect interventions, or other interventions that induce new independences. For example, given an observational dataset, we could identify the datasets with perfect interventions by noticing the additional independences, perform causal inference on each of them separately and merge the predictions in a similar way to the methods presented by antti and by triantafillou2015constraint. In future work, we plan to investigate these techniques, as well as techniques to include fat-hand interventions that induce new dependences between the intervention targets.

Moreover, we plan to investigate other possible strategies or extensions to existing algorithms for dealing with faithfulness violations due to deterministic relations. Finally, although very accurate and flexible, logic-based methods as HEJ (antti) and ACI (ACI) are limited in the number of possible variables they can handle. JCI introduces additional variables, reducing their scalability even more. We plan to investigate improvements to the execution times of methods like ACID.

Acknowledgments

SM, JMM and TC were supported by NWO, the Netherlands Organization for Scientific Research (VIDI grant 639.072.410). SM was also supported by the Dutch programme COMMIT/ under the Data2Semantics project. TC was also supported by NWO grant 612.001.202 (MoCoCaDi), and EU-FP7 grant agreement n.603016 (MATRICS).

References

Appendix A Proofs

In the example in Figure 4, a JCI method that supports direct causal relations can reconstruct correctly the underlying causal graph, more precisely the acyclic directed mixed graph (ADMG), from oracle independence test results. In this proof, we proceed in two steps: first we reconstruct the ancestral relations, then show that a subset of these relations are actually direct causal relations and finally we show the absence of confounders. Given their simplicity, for the first step we use the rules from ACI that we will show in detail in Lemma 6.1 in Section 6.

From , the background knowledge that there are no latent confounders between and the system variables and that , we can infer that . Reasoning in a similar way, we can also infer that and . From and we can use rule (3) in Lemma 6.1, which implies that . Given the background knowledge we can infer that . Similarly, from , we can infer that . Since and (from the acyclicity of ancestral relations), then this must imply that . Since ancestral relations are transitive, from and we get that . Moreover, because of the acyclicity of ancestral relations, these statements imply that , , .

We now go through the ancestral relations we found and check which of them are direct causal relations:

  • : since and , cannot cause indirectly through or (from the acyclicity of causal relations), so directly;

  • : since and , cannot cause indirectly through or (acyclicity), so directly;

  • : this relation is not direct, because , so they are not adjacent in the causal graph;

  • : since and , cannot cause indirectly through or (from the acyclicity of causal relations), so directly;

  • : this relation is not direct, because , so they are not adjacent in the causal graph;

  • : given all the previous causal relations, the background knowledge that there are no confounders between and the other variables, and the only possible graph includes a direct causal relation