Leveraging directed causal discovery to detect latent common causes

10/22/2019
by   Ciarán M. Lee, et al.
0

The discovery of causal relationships is a fundamental problem in science and medicine. In recent years, many elegant approaches to discovering causal relationships between two variables from uncontrolled data have been proposed. However, most of these deal only with purely directed causal relationships and cannot detect latent common causes. Here, we devise a general method which takes a purely directed causal discovery algorithm and modifies it so that it can also detect latent common causes. The identifiability of the modified algorithm depends on the identifiability of the original, as well as an assumption that the strength of noise be relatively small. We apply our method to two directed causal discovery algorithms, the Information Geometric Causal Inference of (Daniusis et al., 2010) and the Kernel Conditional Deviance for Causal Inference of (Mitrovic, Sejdinovic, and Teh, 2018), and extensively test on synthetic data—detecting latent common causes in additive, multiplicative and complex noise regimes—and on real data, where we are able to detect known common causes. In addition to detecting latent common causes, our experiments demonstrate that both modified algorithms preserve the performance of the original directed algorithm in distinguishing directed causal relations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/11/2014

Distinguishing cause from effect using observational data: methods and benchmarks

The discovery of causal relationships from purely observational data is ...
05/12/2015

Removing systematic errors for exoplanet search via latent causes

We describe a method for removing the effect of confounders in order to ...
09/04/2019

Likelihood-Free Overcomplete ICA and Applications in Causal Discovery

Causal discovery witnessed significant progress over the past decades. I...
11/16/2020

Causal motifs and existence of endogenous cascades in directed networks with application to company defaults

Motivated by detection of cascades of defaults in economy, we developed ...
04/22/2018

Causal network discovery by iterative conditioning: comparison of algorithms

Estimating causal interactions in complex networks is an important probl...
10/07/2021

Inference for a Large Directed Graphical Model with Interventions

Inference of directed relations given some unspecified interventions, th...
11/09/2020

Multi-label Causal Variable Discovery: Learning Common Causal Variables and Label-specific Causal Variables

Causal variables in Markov boundary (MB) have been widely applied in ext...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Causal knowledge is crucial in medicine, as causal relations—unlike correlations—allow one to reason about the consequences of possible treatments P; richens2019; perov2019. Indeed, determining whether a particular treatment causes a reduction in the severity of a disease is an incredibly important problem in healthcare. However, learning from medical records that a treatment is correlated with recovery is not sufficient to conclude that treatment will cause a new patient to recover, due to the presence of latent confounding. More expensive treatment could be given to wealthier patients who are more likely to recover regardless of treatment, due to better lifestyle and diet. Here, the patients wealth acts as a latent confounder, or common cause, between receiving treatment and recovering. In reality, the treatment could be detrimental. Thus, learning the underlying causal structure is vital. This difficult problem is confounded further when no high-grade controlled trial evidence exists—-meaning the standard approach to learning causal structure isn’t available. Hence, methods for discovering causal structure from uncontrolled, or observational, data are paramount to making correct treatment decisions.

In recent years, a number of elegant approaches to discovering causal relations between two variables have been proposed Noise; shimizu2006linear; janzing2012information; mitrovic2018causal; Confounders; goudet2017learning; zhang2009identifiability; fonollosa2016conditional; lopez2015towards; lee2017causal. However, most of these deal only with purely directed causal relationships and cannot detect latent common causes. That is, given two variables and , these algorithms can only distinguish causes , graphically depicted in Figure 1(a), from causes , depicted in Figure 1(b). Those algorithms that can detect latent common causes either impose stern restrictions on the underlying causal model—such as enforcing linearity shimizu2006linear or demanding that noise be additive Confounders. As the presence of latent common causes is a significant problem that confounds many statistical analyses, surmounting these shortcomings and developing general methods for detecting latent common causes that do not rely on strong parametric assumptions, such as additive noise or linearity, is an important problem in causal discovery.

In this work, we devise a heuristic method which takes a purely directed causal discovery algorithm and modifies it so that it can also discover latent common causes. That is, our method takes in a causal discovery algorithm which can only distinguish between the causal structures in Figure 

1(a) and Figure 1(b), and outputs a causal discovery algorithm which can distinguish between all three structures in Figure 1. The identifiability of the modified algorithm depends on the identifiability of the original. If the original directed algorithm requires a parametric assumption, that noise be additive for example, then so too does the modified algorithm. However, if the directed algorithm does not rely on any parametric assumptions for its identifiability, then neither does the modified algorithm resulting from applying our method. Additionally, identifiability of our method requires a constraint on the strength of latent noise.

We apply our method to two state-of-the-art directed causal discovery algorithms, the Information Geometric Causal Inference (IGCI) algorithm of daniusis2012inferring and the Kernel Conditional Deviance for Causal Inference (KCDC) algorithm of mitrovic2018causal, and extensively test on synthetic data—detecting latent common causes in additive, multiplicative and complex noise regimes—as well as on real data, where we are able to detect known common causes. Importantly, and in addition to detecting latent common causes with high accuracy, our experiments demonstrate that both modified algorithms do not sacrifice the performance of the original directed algorithm in distinguishing directed causal relations.

The paper is organised as follows. In Section II we discuss previous and related work. In Section III we introduce and describe our method and discuss identifiability results. Section IV outlines our experiments and Section V contains a discussion of possible future work.

Ii Related work

Methods for discovering the causal structure underlying a data generating process largely fall into two categories. The first, which we term global causal discovery, attempts to reconstruct a (partially) undirected version of the DAG. This approach is broadly split into two categories: constraint-based and score-based. The constraint-based approach employs conditional independence tests between the variables in question to determine which should share an edge in the causal structure. Examples include the PC algorithm Sprite, the IC algorithm P, as well as algorithms which allow for latent variables silva2006learning and selection bias spirtes1995causal. There are also examples employing kernel-based conditional independence tests zhang2012kernel. The score-based approach introduces a scoring function, such as Minimum Description Length, that evaluates each network with respect to some training data, and searches for the best network according to this function. Hybrid approaches employing both constraint- and score-based techniques appear to outperform either alone tsamardinos2006max.

The main drawback of constraint-based algorithms is that, as they can only recover Markov equivalence classes, they are not always able to orient edges between dependent variables. Given correlated variables and , these methods are unable to distinguish the structures depicted in Figure 1. The second category of causal discovery algorithm, which we term local or bivariate causal discovery, aims to address this by exploiting notions of asymmetry between cause and effect. These methods specify different assumptions that aim to make such asymmetries testable at an observational level.

The first type of assumption proposed in the literature, known as the functional causal model approach, specifies that each effect is a deterministic function of its cause together with some latent, independent noise term peters2017elements. The first such algorithm, termed Linear Non-Gaussian Additive noise Model (LiNGAM) shimizu2006linear, assumes the functions are linear and the latent noise variables are non-Gaussian. Given these assumptions, this method can distinguish between the causal structures in Figure 1(a) and (b); a separate hidden variables extension hoyer2008lingam extends this result to all structures in Figure 1. The other prototypical example, the Additive Noise Model (ANM) Noise, allows the effect to be an arbitrary function of the cause, but assumes the effect only depends on the latent noise term additively. The ANM algorithm can distinguish between the two structures in Figure 1(a)-(b) as long as the additivity assumption holds. ANM has been extended to allow the effect to depend on the cause and noise in a non-linear fashion zhang2009identifiability. Finally, the Confounding Additive Noise (CAN) algorithm Confounders extends the ANM algorithm to deal with latent common causes, allowing all structures in Figure 1 to be distinguished. We review the CAN model in Section II.1.

The second type of assumption stipulates that the cause be independent of the mechanism . An information geometric approach to measuring such independence has been proposed daniusis2012inferring; janzing2012information, known as the Information Geometric Causal Inference (IGCI) algorithm. This method can only distinguish the two structures in Figure 1(a)-(b). IGCI does not require specific parametric functional relationships to hold.

The last type of assumption entails that the causal mechanism should be “simpler”, given some quantifiable notion of “simple”, than the acausal mechanism . The Kernel Conditional Deviance Causal discovery (KCDC) algorithm mitrovic2018causal

algorithm uses conditional kernel mean embeddings to establish a norm on conditional probabilities and uses the variance of these norms as a simplicity measure.

The IGCI and KCDC algorithms constitute the current state-of-the-art at distinguishing between the two causal structures in Figure 1(a)-(b).

While the ANM algorithm has been extended to deal with latent common causes via the CAN algorithm, its generalisations, such as KCDC and IGCI, have not. The current paper introduces a heuristic method that takes as input any purely directed causal discovery algorithm and outputs a new algorithm that can distinguish all causal structures in Figure 1.

Figure 1: In all cases, blue nodes represent observed variables and green nodes denote latent noise terms. (a) causes . (c) causes . (d) Latent common cause between &.

ii.1 Identifying latent common causes using additive noise models

As a prelude to our general method for detecting latent common causes, we review a current state-of-the-art algorithm for distinguishing all structures in Figure 1, the Confounding Additive Noise model (CAN) of Confounders. In CAN, one assumes that the functions underlying the causal structure from Figure 1(c) are and , for arbitrary . The first step of CAN is to fit such a model using samples from . To do this, manifold learning is employed to learn a representation of from the observed samples of . Next, functions are fitted to map the learned representation of to and using e.g. Gaussian Process regression. The residuals resulting from this regression correspond to representations of the noise terms, and . In fitting this model, one must ensure that are all mutually independent, as implied by the graphical structure of Figure 1(c). To ensure independence, CAN employs a computationally expensive optimisation procedure. See Section IV for further discussion.

With such a model fit using samples of , the next step is to use it to distinguish all structures in Figure 1. Note that the additive noise assumption for Figure 1(a)111Figure 1(b) follows by interchanging and . means and , for arbitrary . CAN distinguishes Figure 1(a) from Figure 1(c) if the variance of the learned is small, relative to the variance of , and prefers Figure 1(c) if the variances are approximately equal. The case of Figure 1(b) follows by switching and . The authors justify this as follows. Suppose causes , and, by a slight measurement error, we observe , which differs from by a small additive term. Here, we would expect the functions to be as described above for Figure 1(c). But we should not distinguish from its measurement result if both variables almost coincide. In this case, after normalising to unit variance, the variance of would be small compared to , as is close to . In general, the CAN algorithm distinguishes between all structures in Figure 1 using the following decision criteria: i) if , ii) if , and iii) if . Determining reasonable practical thresholds for the above criteria is a tricky problem, with values being decided on a case by case basis; see Confounders for more information.

Iii Methods

We work in the functional causal model222Also known as the structural equation model framework. framework peters2017elements. Here, a causal structure corresponds to a directed acyclic graph (DAG) between observed and latent variables, with each variable a deterministic function of its parents and some latent, independent noise term. A causal model corresponds to a DAG, the functions, and a specification of the priors over the latent noise terms.

The following two Lemmas and Definition play an important role in our method for distinguishing directed from latent common causes. We define a causal model with no directed arrows between observed variables purely common cause. We define two causal models observationally equivalent if they are characterised by the same set of distributions over observed variables.

Lemma 1.

Any causal model wherein there is a directed arrow between & ( or ) is observationally equivalent to one that is purely common cause.

This result has been proved before, in lee2017causal for example, but to see that it holds for the case of Figure 1(a), consider the following. A model for this causal structure corresponds to specifying the functional dependencies between and their parents: and . Substituting for in , one obtains . Defining results in the relations and . The directed arrow from to has been replaced by the latent “common cause” , and Figure 1(a) has been reduced to a purely common cause model. This is outlined graphically in the first two causal structures from Figure 2, where the symbol denotes observational equivalence.

Figure 2: Depiction of three observationally equivalent DAGs discussed in Lemma 2. The symbol denotes observational equivalence.

We say a causal model is in canonical form when it is purely common cause. Note that Figure 1(c) is already in canonical form. The following Lemma provides another key observation underlying our approach.

Lemma 2.

Given the causal model in Figure 1(a)333The case of Figure 1(b) follows by interchanging and ., the canonical model is consistent with two causal structures: one where is a common cause between and , and the other where is the mediator of the causal arrow from to . This is depicted in the last two structures of Figure 2.

Proof.

Without loss of generality, the canonical model of Figure 1(a) is given by and . These functions are consistent with the second causal structure in Figure 2, in which a causal arrow points from to . But the equality is also consistent with the third causal structure in Figure 2, where the causal arrow points from to . That is, the relationship is consistent with both an arrow from to and an arrow from to . Note that due to the independence of the noise terms and , this is not true of the other arrows in the causal structure. A causal arrow pointing from to in Figure 2 would induce correlations between and , in direct contradiction with the independence of noise terms. ∎

Lemma 2 tells us that in the canonical model of Figure 1(a), the directionality of the causal link between and is underdetermined, as illustrated in last two structures of Figure 2. All other causal links are fully determined and directed however. If, on the other hand, the original structure had been Figure 1(b), then the causal link between the canonical “common cause”, , and would be underdetermined, with all remaining links determined and directed. Note that if the structure had originally been that of Figure 1(c), then neither arrow from the true common cause, , to the observed variables is underdetermined.

Here, we focus on directed causal discovery algorithms, , that decide the causal direction using some quantifiable notion of asymmetry, such as IGCI daniusis2012inferring and KCDC mitrovic2018causal, both described in Section II. In general, such algorithms assign a real scalar to each causal direction and , returning the causal direction with smallest value. Due to the pervasiveness of statistical noise, such algorithms have a decision threshold , such that if the algorithm fails to detect a causal direction; see mitrovic2018causal for further details.

For instance, the KCDC algorithm of mitrovic2018causal uses conditional kernel mean embeddings to establish a norm on conditional probabilities , for each value of , . If the variance of the collection of norms is smaller than the variance of , their causal discovery algorithm returns Figure 1(a), otherwise it returns Figure 1(b).

The following definition formalises the notion of a causal arrow being underdetermined relative to a particular directed causal discovery algorithm.

Definition 1.

A causal link between two variables and is underdetermined relative to causal discovery algorithm if for pre-specified decision threshold .

iii.1 Detecting latent common causes

Lemmas 12 and Definition 1 suggest a method for distinguishing the three causal structures in Figure 1. To employ the method however, the canonical common cause needs to be identified from samples of the obersved variables .

Lemma 3.

Under the assumption that the influence of the latent noise term is small, the canonical common cause can be identified from samples of .

Proof.

Consider the causal model of Figure 1(a)444All other cases follow by a similar argument., which, following Lemma 1 we can write as and Now, a Taylor expansion yields , where is the partial derivative of with respect to . The assumption that the influence of the noise term on is small corresponds to the fact that we can drop higher order terms in in the above Taylor expansion. Hence, we can write

. Without loss of generality, we can assume the noise term is a Gaussian random variable with mean

. This implies that the expected value of is . Thus the expected value of samples is . That is, they are both functions of the canonical common cause alone and can hence be identified from samples. ∎

The assumption underlying Lemma 3 intuitively corresponds to the requirement that fluctuations around the manifold parameterised by the canonical common cause due to latent noises are small.

Lemma 3 implies that given the canonical causal model associated with any of the structures from Figure 1, manifold learning can be utilised—as is done in the CAN algorithm (janzing2012information, Section 3), described in Section II.1 of the current paper—to determine the approximate parameterisation (up to rescaling) of the canonical “common cause” from samples of the observed .

Under the hypothesis that such an algorithm will learn the most likely latent representation of the data, we determine the most likely causal explanation as follows. In the case of Figure 1(a) and Figure 1(b), the “common cause” corresponds to the variables and respectively. In the case of Figure 1(c) it is the actual common cause, . Second, given the parameterisation of the “common cause”, one can determine the original pre-canonical causal structure using a purely directed causal discovery algorithm to determine which causal link to the observed variables is underdetermined (if any). If the arrow to is underdetermined then the causal structure is Figure 1(a); if the arrow to is underdetermined the structure is Figure 1(b); if no arrow is underdetermined, the structure is Figure 1(c). The causal discovery algorithm described above is summarised in Algorithm 1.

Input: samples, manifold learning algorithm , directed causal discovery algorithm .
Output: Single causal structure from Figure 1.

1:Run on to obtain parameterisation (up to rescaling, etc.) of canonical common cause that best fits the data.
2:Implement between & and &
3:if outputs & the causal link between & is underdetermined relative to do:
4:  Output DAG from Figure 1(a)
5:else if outputs & the causal link between & is underdetermined relative to do:
6:  Output DAG from Figure 1(b)
7:else if outputs & do:
8:  Conclude is a common cause of and output DAG from Figure 1(c)
9:return DAG output from above
Algorithm 1

iii.2 Decision criteria

We now introduce a heuristic decision criterion for checking the conditions of Algorithm 1. Consider step of Algorithm 1, where the learned canonical common cause is denoted . To simultaneously check whether the causal link between and is underdetermined, and also whether , are consistent with the data, one could apply the following heuristic. Check if

lies in the region for some pre-specified decision threshold close to . This follows as the conditions in step imply and , for small . That is, the first term in the numerator of is generally close to zero, while the second is not.

Alternatively, if , then the data is consistent with & , as in this case and . Hence, both terms in the numerator can be roughly thought of as being closer in magnitude than in the case of steps 3 and 5. This allows us to check whether step 7 of Algorithm1 is true.

Note that the first criterion is also consistent with step 5 of Algorithm 1, as steps 3 and 5 are the same under interchange of & . Hence if , then either step 3 or step 5 is true, and step 7 is false, as per the above. As the two cases in in step 3 and 5 are purely directed causal structures, they can be distinguished using the original directed causal discovery algorithm.

To assign confidence to the outcome of the above heuristic, one could take a bootstrapped approach and calculate the mean of the ’s output by running the algorithm on subsamples of the input data, where is the value computed on the th subsample. We suppress the index in the below for notational ease. As reasoned above, is consistent with a directed causal structure and with a common cause. Additionally, the variance of the ’s calculated in this manner encodes information of the correct DAG. For instance, small variance —where is again a pre-specified decision threshold less than —is consistent with a directed causal structure and large variance is consistent with a common cause. This follows because for a directed causal structure only one of the terms in the numerator of each must lie in the region , for small , and hence cannot vary much. For a common cause, both terms are outside this region and thus can vary more.

These two implication, high being indicative of a directed cause and high of a common cause, can be used in conjunction to determine the causal structure. For example, it could be the case that a particular common cause structure has moderately high , but such a high that a directed cause is ruled out, and a common cause is returned by the algorithm. Similar reasoning can be applied to the case of a directed cause whose is lower than expected, but has such a low that a common cause is ruled out.

To summarise, the above decision criteria are as follows. The mean and variance of ’s output by running the algorithm on subsamples of the input data are computed. Recall the possible range of is . This range is split into a number of regions. We fix region to be , region to be , and so on. Each region is associated with a threshold value for , . If lies in region , then if the algorithm outputs a directed causal structure, whose direction is then determined by the original directed causal discovery algorithm. But if , a common cause structure is output. Note that as the regions get closer to , that is, to , the variance threshold gets increasing small. Hence, . This encodes our previous reasoning of how mean and variance imply different causal structures. Additionally, there should be a failure mode for very low variance and mean, and very high mean and variance, as these are not indicative of a direct nor a common cause.

As with the decision criteria for the CAN algorithm, described in Section II.1, and KCDC mitrovic2018causal setting such thresholds for causal discovery presents a challenge due to the varying impact of noise within experimental data sets. An appropriate method we take in this work is to use data of similar provenance with known causal relations and determine appropriate thresholds for that domain, which are then fixed.

iii.3 Identifiability

Lemmas 1, 2, and 3 establish the identifiability of our method. That is, given the assumption that the strength of latent noise is relatively small, a representation of the canonical common cause can be extracted from samples of the observed . The discussion in Section III.1 and Lemmas 1 and 2 then show how this representation can be used to distinguish all three causal structures in Figure 1.

The sensitivity of our algorithm to the assumption that the strength of latent noise be relatively small is explored empirically in Section IV.3.

Additionally, as our algorithm depends on the original directed causal discovery algorithm, their identifiability is closely related. It may sometimes be difficult to demarcate underdetermination from the unidentifiability of the original algorithm. A directed algorithm could conclude that a causal link is underdetermined when it is not, due to the fact that the model underlying said link makes it unidentifiable. However, this does not seem to happen often in practice, as we see when conducting extensive experiments in many different noise and functional regimes in Section IV.

Iv Experimental results

We test our method by showing it can turn both the KCDC algorithm of mitrovic2018causal and the IGCI algorithm of daniusis2012inferring, which can each only distinguish Figure 1(a) and (b), into algorithms that can distinguish all causal structures in Figure 1. We refer to the modified KCDC and IGCI algorithms output by our method as modKCDC and modIGCI respectively. To benchmark performance of modKCDC and modIGCI, we compare against CAN Confounders the only previous causal discovery algorithm explicitly designed to distinguish all structures in Figure 1, which we outlined in Section II.1. The CAN algorithm proved to be very computational intensive to run compared to modKCDC and modIGCI. To reduce computational cost, we only enforced two of the three independence constraints discussed in Section II.1: .

For KCDC, we make use of a radial basis function kernel with hyperparameters fixed in all experiments; there is scope to extend this to multiple kernel majority vote following

mitrovic2018causal, though performance with a single basic kernel proved effective here. We use the implementation of CAN outlined in Confounders, using Gaussian Process regression and HSIC to test independence, and employing up to five iterations of the independence optimisation loops, capped at 5000 iterations of the solver each time. For the manifold learning subroutine of modKCDC, modIGCI, and CAN, we employ Isomap tenenbaum2000global. This is the same implementation of manifold learning as originally used in the CAN algorithm Confounders.

We followed the bootsrapping procedure outlined in Section III.2, using 25 bootstraps each randomly sampling 95% of the data. To set the thresholds, we generated synthetic data from directed causal models with additive, multiplicative, and complex noise functions—similar to the models we test on in this section—and determined the threshold values which approximately reproduced the performance of the original directed causal discovery algorithms on these different models. We did this for modKCDC and modIGCI, as well as for CAN, whose decision criteria were discussed in Section II.1. This is similar to the way the CAN thresholds were set in janzing2012information, and worked well in our experiments despite the fact that they were quite diverse in nature. We report the thresholds set in this manner in the Supplementary Material. These thresholds are used in all experiments.

iv.1 Synthetic data

We first test on synthetic data for both purely directed and purely common causes. The functions used to generate this synthetic data were adapted from the work of mitrovic2018causal, who used them to test performance of KCDC and compare its performance to IGCI. As such, we believe using similar synthetic functions and the same range of synthetic noise distributions provides a fair and robust benchmark of KCDC, IGCI, and their modifications.

On the directed synthetic datasets, we compare the modified algorithms to the original to determine whether the modification noticeably reduced their performance in detecting directed causal structure. Our experiments show that modKCDC and modIGCI preserve the high level of performance of the original KCDC and IGCI algorithms on directed data.

iv.1.1 Directed causal structures

In the below experiments, we sample 100 datasets each of 250 observations and test modKCDC and modIGCI. The CAN algorithm proved to be very computational intensive to run, taking 300-500 times as long to run on a dataset of size 250 compared to modKCDC and modIGCI. Hence we only tested it on 10 datasets, each of 250 observations. For our three algorithms, we sample , and test across three different noise regimes: Normal , Uniform , and Exponential . We record the accuracy over all datasets, that is, the percentage of cases where the algorithm output the correct ground truth directed structure. This is the same metric employed by mitrovic2018causal in their experiments. The CAN algorithm was only able to fit a model that satisfied the independence criteria discussed in Section II.1 and IV in the case of additive noise, as used in experiments (1) and (2) below. This is to be expected as non-additive noise, as in experiments (3)–(6), violates the identifiability requirements of CAN. In these cases we could not report CAN’s performance. All results are presented in Table 1.

Additive noise:

Multiplicative noise:

Complex noise:

In all synthetic direct cause experiments, KCDC and IGCI perform extremely well, achieving 100% accuracy in all situations with the exception of IGCI in experiment (6) with normally distributed noise. In all cases bar one, modKCDC and modIGCI either match this performance, or come within 1-2% of it. It is also interesting to note that the one case in which IGCI performed badly, so did modIGCI. modIGCI failed to match the performance of IGCI in one case however, that of experiment (6) with uniform noise. As IGCI itself performed badly on this experiment with normal noise, it seems to present a challenging dataset for this algorithm, so it is perhaps not overly surprising that modIGCI yielded poor performance in this case.

The CAN algorithm did not perform well on any of these synthetic datasets, failing to fit a model that complied with the independence criteria in all cases beyond additive noise. We only reported accuracy for CAN in cases where it successfully fit a model. This in principle gives CAN an added advantage. In (1), this happened in 5/10, 9/10, 7/10 cases for the three different noise distributions. In (2), this happened in 7/10, 3/10, 5/10 cases. CAN’s poor performance shows the difficulty in developing an algorithm that can distinguish purely directed and common causes.

Table 1
Directed Cause
Exp. Algorithm Normal Uniform Expon.
1 modKCDC
KCDC
modIGCI
IGCI
CAN
2 modKCDC
KCDC
modIGCI
IGCI
CAN
3 modKCDC
KCDC
modIGCI
IGCI
4 modKCDC
KCDC
modIGCI
IGCI
5 modKCDC
KCDC
modIGCI
IGCI
6 modKCDC
KCDC
modIGCI
IGCI

iv.1.2 Common cause

The common cause setup closely follows that of Section IV.1.1. For these synthetic experiments, the common cause was sampled from . Here, CAN was only able to fit a model for experiment (2), below. In all other cases we could not report CAN’s performance. Results are presented in Table 2.

Additive noise:

Multiplicative noise:

Additive and Mulitplicative noise:

Table 2
Common Cause
Exp. Algorithm Normal Uniform Expon.
1 modKCDC
modIGCI

2
modKCDC
modIGCI
CAN

3
modKCDC
modIGCI
4 modKCDC
modIGCI
5 modKCDC
modIGCI
6 modKCDC
modIGCI

Here, modKCDC and modIGCI achieve impressive accuracy across a range of different synthetic noise functions—including additive and multiplicative noise—and diverse distributions over these noise terms. Despite the fact that neither KCDC nor IGCI could detect latent common causes, our method turned them into algorithms that could detect them. CAN performed well on the only experiment where it managed to fit a model, (2). Again, accuracy was only reported when it fit a model, in principle giving CAN an added advantage. In (2), this happened in 5/10, 3/10, 1/10 cases for the different noise distributions. CAN failed to fit a model in (1), (3)-(6) above, again demonstrating the difficulty in distinguishing common and directed causes.

iv.2 Common cause robustness tests

In the above experiments we tested the robustness of our algorithms in different noise regimes. We now test robustness to more complex functional relationships. First, we test the accuracy of the algorithms in a regime beyond additive and multiplicative noise. We then test on functions drawn from Gaussian Processes. Results are presented in Table 3.

iv.2.1 Complex noise

iv.2.2 Gaussian Process generators

  1. [(1)]

  2. Let & , with , drawn from the same Gaussian Process whose kernel is a sum of polynomial (with only terms) and periodic exponential.

  3. is drawn from a GP with polynomial kernel (only terms) and is drawn from GP with a sum of polynomial (only terms) and a periodic exponential kernel.

Table 3
Complex Noise Common Cause
Experiment Algorithm Result
1 modKCDC
modIGCI
2 modKCDC
modIGCI
3 modKCDC
modIGCI
4 modKCDC
modIGCI

Both modKCDC and modIGCI perform well in these datasets beyond additive and multiplicative noise. Experiments (3) and (4) provide an analysis of robustness to changes in the parameters of the underlying functions.

iv.3 Sensitivity to the assumption of Lemma 3

As stated in Lemma 3, the identifiability of our method is based on the assumption that the influence of latent noise is small. The sensitivity of our method to this assumption will now be empirically tested using modIGCI. We synthetically generate both directed and common cause data from models which violate the assumption. Directed data is generated by sampling and and inputting to

Common cause data is generated by sampling , & and inserting to

Here, the latent noise terms influence the observed variables via a double exponential function, which violates the assumption underlying Lemma 3. In both cases, the smaller is, the smaller the influence of the noise terms. We would expect that the smaller , the higher the accuracy of our method, and the larger , the lower the accuracy of our method. This intuition is borne out in the experiments. Results are plotted in Figure 3.

Despite the fact that the noise influences the observed variables via a double exponential function, the accuracy of modIGCI is quite resilient. For instance, when the accuracy on directed data is 46% and the accuracy on common cause data is 62%. Full results are in Figure 3.

Figure 3: Effect of increasing on algorithm accuracy.

iv.4 Real data

We now test on real world data. We first consider NLSchools data, the th pair in the Tübingen cause-effect data555https://webdav.tuebingen.mpg.de/cause-effect/. Here ‘socioeconomic status’ is expected to be a direct cause of ‘language score’. However, some researchers have argued that the correlation between these two attributes could be genetic in nature, implying the underlying structure is purely common cause. When applied on this data, modKCDC returns and , with modIGCI returning and . modKCDC’s output—with its high mean and low variance—is clearly indicative of a directed causal structure. While modIGCI has a slightly low , its value is extremely low. Following the discussion at the end of Section III.2, we should conclude that modIGCI returns a directed causal structure in this instance. Hence, both modKCDC and modIGCI return a direct cause from ‘socioeconomic status’ to ‘language score’, ruling out a purely common cause explanation of this data.

Next, we consider Breast Cancer Wisconsin (Diagnostic) Data UCI. Here the diagnosis (malignant or benign) is the common cause of two attributes, ‘perimeter’ and ‘compactness’, as they are conditionally independent given diagnosis. When run on this ‘perimeter’ and ‘compactness’ data, modKCDC returns and , and modIGCI returns and . In both cases, is relatively small, with in both cases relatively large. From the discussion in Section III.2, we should conclude that both algorithms are consistent with a common cause. In addition, all values are within the thresholds fixed in Section IV, which are reported in the Supplementary material, for a common cause. We hence conclude that both algorithms return a common cause between ‘perimeter’ and ‘compactness’, agreeing with the ground truth.

Next we consider the AutoMPG data set from UCI. Here, attributes ‘acceleration’ and ‘MPG’ (miles per gallon) are correlated but have a common cause, the model year. This is checked by performing a conditional independence test between ‘acceleration’ and ‘MPG’ given model year. When applied on this data, modKCDC returns and , with modIGCI yielding and . As per the thresholds reported in the Supplementary Material, we conclude that modKCDC returns a common cause between ‘acceleration’ and ‘MPG’. While modIGCI returned a low value—which is indicative of a common cause—its value for is also low, which is not in general consistent with one. Given this relatively large size difference, following the discussion at the end of Section III.2 one could conclude that modIGCI’s output is consistent with a common cause between ‘acceleration’ and ‘MPG’.

V Conclusion

We devised a method to turn a purely directed causal discovery algorithm into one that can also detect latent common causes. In our experiments we took the KCDC mitrovic2018causal and IGCI daniusis2012inferring algorithms, which each could distinguish directed causes but could not detect latent common causes, and showed our method enabled detection of latent common causes while preserving accuracy in distinguishing directed causes.

The setting of the thresholds for the decision criterion from Section III.2 is a challenging problem. The use of domain heuristic employed here follows both the approach of CAN Confounders, described in Section II.1, and KCDC mitrovic2018causal, outlined just before Definition 1. Despite its heuristic nature, it worked quite well in our experiments, despite how varied they were. Ongoing research will investigate the use of statistical tests to analyse the thresholds.

In future work, applications to the burgeoning field of quantum causal models allen2017quantum; lee2018towards; lee2018certification; chaves2015information; wolfe2016inflation will be explored.

References

Appendix A Decision thresholds for experimental section of main paper

In Section 4 of the main paper, we use synthetic directed data to set the thresholds for the decision criteria—discussed in Section 3.2 of the main paper—of our modifications of KCDC mitrovic2018causal and IGCI daniusis2012inferring, termed modKCDC and modIGCI respectively. To set the thresholds, we generated synthetic data from directed causal models with additive, multiplicative, and complex noise functions—similar to the ones used in the Experiments section—and determined the threshold values which approximately reproduced the performance of the original directed causal discovery algorithms on these three different models. These thresholds are then used in all experiments. While this is somewhat ad-hoc, it is similar to the way CAN’s thresholds were set in janzing2012information, and worked well in our experiments despite the fact that they were quite diverse in nature. The thresholds were set following the bootstrapping approach detailed in Section 3.2, using 25 bootstraps each randomly sampling 95% of the data.

The thresholds we set for modKCDC are as follows:

  1. If and , then we output a directed causal structure oriented using the original causal discovery algorithm.

    • The regime and is a failure mode of our algorithm.

  2. If and , a common cause structure is output.

    • The regime and is a failure mode of our algorithm.

  3. If and , then a directed causal structure is output.

  4. But if and , then a common cause structure is output.

  5. If and , then a directed causal structure is output.

  6. But if and , then a common cause structure is output.

The thresholds we set for modIGCI are as follows:

  1. If and , then we output a directed causal structure oriented using the original causal discovery algorithm.

    • The regime and is a failure mode of our algorithm.

  2. If and , a common cause structure is output

    • The regime and is a failure mode of our algorithm.

  3. If and , then a directed causal structure is output.

  4. But if and , then a common cause structure is output.

  5. If and , then a directed causal structure is output.

  6. But if and , then a common cause structure is output.

The thresholds set for CAN, described in Section 2.1 of the main paper, are as follows. Let . If then a is returned. If , then is returned. If , then a common cause structure is returned.