ninodimontalcino
None
view repo
Discovering causal structures from data is a challenging inference problem of fundamental importance in all areas of science. The appealing scaling properties of neural networks have recently led to a surge of interest in differentiable neural network-based methods for learning causal structures from data. So far differentiable causal discovery has focused on static datasets of observational or interventional origin. In this work, we introduce an active intervention-targeting mechanism which enables a quick identification of the underlying causal structure of the data-generating process. Our method significantly reduces the required number of interactions compared with random intervention targeting and is applicable for both discrete and continuous optimization formulations of learning the underlying directed acyclic graph (DAG) from data. We examine the proposed method across a wide range of settings and demonstrate superior performance on multiple benchmarks from simulated to real-world data.
READ FULL TEXT VIEW PDFNone
Learning causal structure from data is a challenging but important task that lies at the heart of scientific reasoning and accompanying progress in many disciplines (sachs2005causal; hill2016inferring; lauritzen1988local; korb2010bayesian). While there exists a plethora of methods for the task, computationally and statistically more efficient algorithms are highly desired (heinze2018causal)
. As a result, there has been a surge in interest in differentiable structure learning and the combination of deep learning and causal inference
(scholkopf2021toward). Such methods define a structural causal model with smoothly differentiable parameters that are adjusted to fit observational data (zheng2018dags; yu2019dag; zheng2020learning; bengio2019meta; lorch21dibs; annadani2021variational), although some methods can accept interventional data, thereby significantly improving the identification of the underlying data-generating process (ke2021dependency; brouillard2020differentiable; lippe2021efficient). However, the improvement critically depends on the experiments and interventions available.Despite advances in high-throughput methods for interventional data in specific fields (dixit2016perturb), the acquisition of interventional samples in the general settings tends to be costly, technically impossible or even unethical for specific interventions. There is, therefore, a need for efficient usage of the available interventional samples and efficient experimental design to keep the number of interventions to a minimum.
A significant amount of prior work exists in causal structure learning that leverages active learning and experimental design to improve identifiability in a sequential manner. These approaches are either graph theoretical
(he2008active; eberhardt2012almost; hyttinen2013experiment; hauser2014two; shanmugam2015learning; kocaoglu2017experimental; kocaoglu2017cost; lindgren2018experimental; ghassami2018budgeted; ghassami2019interventional; greenewald2019sample; squires2020active), Bayesian (murphy2001active; tong2001active; masegosa2013interactive; cho2016reconstructing; ness2017bayesian; agrawal2019abcd; zemplenyi2021bayesian) or rely on Invariant Causal Prediction (gamella2020active). These methods are typically computationally very expensive and do not scale well with respect to the number of variables or dataset size (heinze2018causal). A promising alternative to these methods is the use of active learning in a continuous optimization framework for causal structure learning from joint data. However, since the applicability of existing scores / heuristics for selecting intervention targets is limited for existing frameworks (see §
4), current approaches rely on random and independent interventions and do not leverage the acquired evidence of the processed experiments.To this end, we propose a novel method of active selection of intervention targets that can easily be incorporated into many differentiable causal discovery algorithms. Since most differentiable causal discovery algorithm treats the adjacency matrix of a causal graph as a learned soft-adjacency, it is readily available for parametrized sampling of different hypothesis graphs. Our method finds an intervention target that gives maximum disagreement between post-interventional sample distributions under these hypothesis graphs, and we conjecture that interventions on such nodes will contain more information about the causal structure and hence enable more efficient learning. To the best of our knowledge, our paper is the first approach to combine both a continuous optimization framework and active causal structure learning from observational and interventional data. We summarize our contributions as follows:
[topsep=0pt,itemsep=0pt,leftmargin=25pt]
We propose a novel approach for selecting intervention (single and multi-target interventions) which identify the underlying graph efficiently and can be used for any differentiable causal discovery method.
We introduce a novel scalable, two-phase DAG sampling procedure which efficiently generates hypothesis DAGs based on a soft-adjacency matrix.
We examine the proposed intervention-targeting method across a wide range of settings and demonstrate superior performance against established competitive baselines on multiple benchmarks from simulated to real-world data.
We provide empirical insights on the distribution of selected intervention targets and its connection to the (causal) topological order of the variables in the underlying system. Further, we show how the underlying causal graph is identified through the interventions.
Structural Causal Model. An SCM (peters2017elements)
is defined over a set of random variables
or just for short and a directed acyclic graph (DAG) over variables. The random variables are connected by functions and jointly independent noise variables viawhere are ’s parents, and directed edges in the graph represent direct causation. The conditionals define the conditional distribution for given its parents of .
Interventions. Interventions on changes the conditional distribution of to a different distribution, hence affecting the outcome of . Interventions can be perfect/hard or imperfect/soft. Hard interventions entirely remove the dependencies of a variable on its parents
, hence defining the conditional probably distribution of
by some rather than . A more general form of intervention is the soft intervention, where the intervention changes the effect of the parents of on itself by modifying the conditional distribution from to an alternative .We evaluate the proposed method under multiple continuous-optimization causal learning frameworks from fused data (bareinboim2016causal), one of them being DSDI (ke2021dependency). The work of DSDI reformulates the problem of causal discovery from discrete data as a continuous optimization problem using neural networks. The framework proposes to learn the causal adjacency matrix as a matrix parameter of a neural network, and is trained using a 3-stage iterative procedure. The first stage involves sampling graphs under the model’s current belief in the graph structure and then training the conditionals of the sampled graphs using observational data. The next stage is to evaluate the sampled graphs under interventional data and score these graphs accordingly. The final step is to update the learned adjacency matrix with the scores from stage 2. This method performs competitively compared to many other methods; However, all intervention targets in stage 2 of DSDI are random and independent, a strategy that scales poorly to larger graphs. A better approach would have been active intervention targeting.
We also consider the work of DCDI (brouillard2020differentiable), which addresses causal discovery from continuous
data as a continuous-constrained optimization problem using neural networks to model parameters of gaussian distributions or normalizing flows
(rezende2015variational) which represent conditional distributions. Unlike DSDI’s iterative training of the structural and functional parameters, DCDI optimizes the causal adjacency matrix and functional parameters jointly over the fused (observational and interventional) data space. But like DSDI, DCDI uses random and independent interventions.
![]() |
![]() |
as a ratio of variance-between-graphs
VBG and variance-within-graphs VWG. The landscape of the discrepancy score is visualized in logarithmic scale in (b), where red denotes high values and blue low values.We present a score-based intervention design strategy, called Active Intervention Targeting (AIT), which is applicable to many discrete and continuous optimization formulations of causal structure learning algorithms. Furthermore, we show how our proposed method can be integrated into recent differentiable causal discovery frameworks for guided exploration using interventional data.
Assumptions. The proposed method assumes access to a belief state in the graph structure (e.g., in the form of a distribution over graphs, probabilistic adjacency matrix or a set of hypothetical graphs) and functional parameters characterizing the conditional relationships between variables. The proposed model does not have to assume causal sufficiency per se. However, it inherits the assumptions of the selected base framework, and this may include causal sufficiency depending on the base algorithm of choice. In case the underlying framework can handle unobserved variables and offers a generative method for interventional samples, then our method is also applicable
Given a graph belief state with its corresponding functional parameters , and a possible set of intervention targets (single-node and multi-node intervention targets), we wish to select the most informative intervention target(s) with respect to identifiability of the underlying structure. Such target(s) presumably yield relatively high discrepancies between samples drawn under different hypothesis graphs, indicating uncertainty about the target’s relation to its parents and/or children.
We thus construct an F-test-inspired score to determine the exhibiting the highest discrepancies between post-interventional sample distributions generated by likely graph structures under fixed functional parameters .
In order to compare sample distributions over different graphs, we distinguish between two sources of variation: Variance between graphs (VBG) and variance within graphs (VWG). While VBG characterizes the variance of sample means over multiple graphs, VWG accounts for the sample variance within a specific graph. We mask the contribution of the intervened variables to VBG and VWG, and construct our discrepancy score as a ratio .
This discrepancy score attains high values for intervention targets of particular interest (see. Fig. 1b for a visualization of the landscape). While VBG itself indicates for which intervention targets the model is unsettled about, an extension to the proposed variance ratio enables more control over the region of interest. Given a fixed set of graphs and a fixed interventional sample size across all graphs, let us assume a scenario where multiple intervention targets attain high VBG. Assessing VWG allows us to distinguish between two extreme cases: (a) targets with sample populations that exhibit large VWG (b) targets with sample populations that exhibit low VWG. While high VBG in (a) might be induced by an insufficient sample size due to high variance in the interventional distribution itself, (b) clearly indicates high discrepancy between graphs and should be preferentially studied for the causal discovery process.
Input: Functional Parameters , Graph Belief State , Interventional Target Space
Output: Intervention Target
We begin by sampling a set of graphs from our graph structure belief state , however parametrized. This will remain fixed for all considered interventions.
Then, we fix an intervention target and apply the corresponding intervention to , resulting in partially altered functional parameters where some conditionals have been changed. Next, we draw interventional samples from on every graph and set variables in to zero to mask of their contribution to the variance. Having collected all samples over the considered graphs for the specific intervention target , we compute and as:
where
is a vector of the same dimension as any sample in
and denotes the overall sample-mean of the interventional setting, the corresponding mean for a specific graph and is the -th sample of the -th graph configuration. Finally, we construct the discrepancy score of as:In contrast to the original definition of the F-Score, we can ignore the normalization constants due to equal group size and degree-of-freedoms. Although the dependence between the variables is apparent from the connected causal structure, we approximate the variance of the multidimensional samples as the trace over the covariance matrix by assuming that the variables are independent. An outline of the method is provided in Algorithm
1.Embedding AIT into recent differentiable causal discovery frameworks requires a graph sampler which generates a set of likely graph configurations under the current graph belief state. However, drawing samples from unconstrained graphs (e.g. partially undirected graphs, cyclic directed graphs) is an expensive multi-pass process and we thus constrain our graph sampling space to DAGs for the present work. Since most differentiable causal structure learning algorithms learn edge beliefs in the form of a soft-adjacency matrix, we present a scalable, two-stage DAG sampling procedure which exploits structural information of the soft-adjacency beyond independent edge confidences (see Figure 2 for a visual illustration). More precisely, we start by sampling topological node orderings from an iterative refined score and construct DAGs in the constrained space by independent Bernoulli draws over possible edges. We can therefore guarantee DAGness by construction and do not have to rely on expensive, non-scalable techniques such as rejection sampling or gibbs sampling. The overall method is inspired by topological sorting algorithms of DAGs where we iteratively identify nodes with no incoming edges, remove them from the graph and repeat it until we have processed all nodes.
Soft-Adjacency. Given a learnable graph structure of a graph over variables, the soft-adjacency matrix is given as such that encodes the probabilistic belief in random variable being a direct cause of , where
denotes the sigmoid function. For the ease of notation, we define
and use to denote the considered soft-adjacency at iteration k. Note that the shape of changes through the iterations.Sample node orderings. For the iterative root sampling procedure, we start at iteration with an initial soft-adjacency and apply the following routine for iterations. We take the maximum over rows of
and result at a vector of independent probabilities
, where denotes the maximal probability of variable being a child of any other variable at the current belief state. After taking the complement , we arrive at where denotes the approximated probability of variablebeing a root node in the current state. In order to arrive at normalized distribution to sample a root node, we apply a temperature-scalded softmax:
where denotes the temperature. The introduction of temperature-scaling allows to control the distribution over nodes and account for the entropy of the structural belief. We proceed by sampling a (root) node as and delete all corresponding rows and columns from and arrive at a shrinked soft-adjacency over the remaining variables. We repeat the procedure until we have processed all nodes and have a resulting topological node ordering of .
Sample DAGs based on node orderings. Given a node ordering , we permute the soft-adjacency accordingly and constrain the upper triangular part by setting values to to ensure DAGness by construction (as shown in Figure 2). Finally, we sample a DAG by independent Bernoulli draws of the edge beliefs, as proposed in ke2021dependency.
Connection to Plackett-Luce distribution (Luce59; plackett1975analysis):
Our proposed node ordering sampling routine can be regarded as an extension of the Placket-Luce distribution over node permutations. In contrast, we refine scores in an iterative fashion rather than setting them apriori as we account for previously drawn nodes to estimate the probability of a node being the root node in the current iteration.
Before integrating our method into the DSDI framework, we must choose/design a graph sampler based on DSDI’s graph belief characterization and define a sampling routine to generate interventional samples under a given state of the structural and functional parameters.
DSDI offers a learnable graph structure of a graph over variables as such that encodes the soft-adjacency matrix. This formulation naturally suggests the application of the introduced two-phase DAG sampling to generate hypothetical DAGs under current beliefs. Under these hypothetical, acyclic graph configurations, one may then apply an intervention to DSDI functional parameters and sample data using ancestral sampling.
We start again with defining sampling routines for the generation of hypothetical graphs and interventional samples under a given state. DCDI offers us also access to which allows the same setup as with DSDI. In order to generate interventional samples under the hypothetical graph configuration, we alter the conditionals of the intervened variables and perform ancestral sampling based on the models learned conditional densities.
Embedding AIT into DCDI allows us to predict an interventional target space instead of relying on random interventional samples chosen out of the full target space. In contrast to the unconstrained target space of the original formulation, we estimate a target space of constrained size using AIT and reevaluate it after a fixed amount of gradient steps (see §A.6.1 for technical details).
Causal induction can use either observational and (or) interventional data. With purely observational data, the causal graph is only identifiable up to a Markov equivalence class (MEC) (spirtes2000causation), interventions are needed in order to identify the underlying causal graph (eberhardt2007interventions). Our work focuses on causal induction from interventional data.
Causal Structure Learning. There exists several approaches for causal induction from interventional data: score-based, constraint-based, conditional independence test based and continuous optimization. We refer to (heinze2018causal; vowels2021d) for recent overviews. While most algorithms perform heuristic, guided searches through the discrete space of DAGs, zheng2018dags reformulates it as a continuous optimization problem constrained to the zero level set of the adjacency matrix exponential. This important result has driven recent work in the field and showed promising results (kalainathan2018sam; yu2019dag; ng2019graph; lachapelle2020gradient-based; zheng2020learning; Zhu2020Causal). Due to the limitations of purely observational data, ke2021dependency and brouillard2020differentiable extend the continuous optimization framework to make use of interventional data. lippe2021efficient scales in a concurrent work with ours the work of (ke2021dependency) to higher dimensions by splitting structural edge parameters in separate orientation and likelihood parameters and leveraging it in an adapted gradient formulation with lower variance. In contrast to (brouillard2020differentiable; ke2021dependency) and our work, they require interventional data on every variable.
Active Causal Structure Learning. Interventions are usually hard to perform and in some cases even impossible (peters2017elements). Minimizing the number of interventions performed is desirable. Active causal structure learning addresses this problem, and a number of approaches have been proposed in the literature. These approaches can be divided into those that select intervention targets using graph-theoretic frameworks, and those using Bayesian methods and information gain.
Graph-theoretic frameworks usually proceed from a pre-specified MEC or CPDAG (completed partially directed acyclic graph) and either investigate special graph substructures (he2008active) such as cliques (eberhardt2012almost; squires2020active), trees (greenewald2019sample), or they prune and orient edges until a satisfactory solution is reached (ghassami2018budgeted; ghassami2019interventional; hyttinen2013experiment), perhaps under a cost budget (kocaoglu2017cost; lindgren2018experimental). Their chief limitation is that an incorrect starting CPDAG can prevent reaching the correct graph structure even with an optimal choice of interventions.
The other popular set of techniques involve sampling graphs from the posterior distribution in a Bayesian framework using MCMC and then selecting the interventions which maximize the information gain on discrete (murphy2001active; tong2001active) or Gaussian (cho2016reconstructing) variables. The drawbacks of these techniques are poor scaling and the difficulty of integrating them with non-Bayesian methods, except perhaps by bootstrapping (agrawal2019abcd).
In contrast to existing work, our base frameworks do not start from a pre-specified MEC or CPDAG and existing graph-theoretical approaches are hence not applicable unless we pre-initalize them with a known skeleton. However, in the case we offer access to a predefined structure in the form of a MEC or CPDAG, a previously directed edge is likely to be inverted during the ongoing process which contradicts with the underlying assumptions of existing approaches. Further, we build atop non-Bayesian frameworks and are therefore limited in applying methods based on information gain which require access to a posterior distribution over graph structures. While bootstrapping would allow us to approximate the posterior distribution over graph structures in our non-Bayesian setting, it is not guaranteed to achieve full support over all graphs since the support is limited to graphs estimated in the bootstrap procedure (agrawal2019abcd). Furthermore, the computational burden of bootstrap would limit us in scaling to graphs of larger size.
We evaluate the proposed active intervention targeting mechanism on single-target interventions under two different settings: DSDI (ke2021dependency) and DCDI (brouillard2020differentiable). We investigate the impact of AIT under both settings for identifiability, sample complexity, and convergence behaviour compared to random targeting where the next intervention target is chosen independent of the current evidence. In a further line of experiments, we analyze the targeting dynamics with respect to convergence behaviour and the distribution of target node selections. This section will highlight our results on DSDI while also including key results with respect to DCDI (structural discovery and identifiability). However, further ablation studies and analysis of DCDI results have been shifted to the appendix.
A huge variety of SCMs and their induced DAGs exist, each of which can stress causal structure discovery algorithms in different ways. We perform a systematic evaluation over a selected set of synthetic and non-synthetic SCMs (and datasets). We distinguish between synthetic structured graphs and random graphs, the latter generated from the Erdős–Rényi (ER) model with varying edge densities (see §A.2 for a detailed description of the setup). For conciseness, we only report results on 15-node graphs in this section for the noise-free synthetic setting for AIT on DSDI and on 10-node graphs for the noisy setting for AIT on DSDI (discrete data). In addition, we report key results on 10-node graphs for AIT on DCDI (continuous data) in the main text and provide further results and ablation studies in the appendix. We complete the setup with the Sachs flow cytometry dataset (sachs2005causal) and the Asia network (lauritzen1988local) to evaluate the proposed method on well-known real-world datasets for causal structure discovery. 111The real-world datasets are available through a Creative Commons Attribution-Share Alike License in the bnlearn R package and most baseline implementations are available for Python in the causal discovery toolbox (kalainathan2019causal) with an MIT license. A-ICP is provide by the authors at https://github.com/juangamella/aicp but without a license.
(a) We report strong results for active-targeted structure discovery on both discrete and continuous-valued datasets and outperform random targeting in all experiments. (b) The proposed intervention targeting mechanism significantly reduces sample complexity with strong benefits for graphs of increasing size and density. (c) The distribution of target selections during graph exploration is strongly connected to the topology of the underlying graph. (d) Undesirable interventions are drastically reduced. (e) When monitoring structured Hamming distance (SHD) throughout the procedure, an “elbow” point appears approximately when the Markov equivalence class (MEC) has been isolated. (f) Active targeting introduces desirable properties such as improved recovery of erroneously converging edges. (g) AIT significantly improves robustness in noise-perturbed environments.
We evaluate accuracy in terms of structural Hamming distance (SHD) (acid2003searching) on a diverse set of synthetic non-linear datasets under both DSDI and DCDI, adopting their respective evaluation setups.
The results of DSDI with AIT are reported in Table 1. DSDI with active intervention targeting outperforms all baselines and DSDI with random intervention targeting over all presented datasets. It enables almost perfect identifiability on all structured graphs of size 15 except for the full15 graph, and significantly improves structure discovery of random graphs with varying densities. As the size or density of the underlying causal graphs increases, the benefit of the selection policy becomes more apparent (see Figure 4).
We also examine the effectiveness of our proposed method for DCDI (brouillard2020differentiable) on non-linear data from random graphs of size 10. Active Intervention Targeting improves the identification in terms of sample complexity and structural identifiability compared to random exploration (see Figure 6 and §A.6 for further analyses). We observe that the targeting mechanisms which controls the order and frequency of interventional targets presented to the model, has a clear impact. Further experimental results for DCDI can be found in the appendix.
Structured Graphs | Random Graphs | ||||||||
Chain | Collider | Tree | Bidiag | Jungle | Full | ER-1 | ER-2 | ER-4 | |
GES (chickering2002optimal) | 13 | 1 | 12 | 14 | 14 | 69 | 8.3 () | 17.6 () | 39.4 () |
GIES (hauser2012characterization) | 13 | 6 | 10 | 17 | 23 | 60 | 10.9 () | 18.1 () | 39.3 () |
ICP (peters2016causal) | 14 | 14 | 14 | 27 | 26 | 105 | 16.2 () | 31.1 () | 60.1 () |
A-ICP (gamella2020active) | 14 | 14 | 14 | 27 | 26 | 105 | 16.2 () | 31.1 () | 60.1 () |
NOTEARS (zheng2018dags) | 22 | 21 | 26 | 33 | 35 | 93 | 23.7 () | 35.8 () | 59.5 () |
DAG-GNN (yu2019dag) | 11 | 14 | 15 | 27 | 25 | 97 | 16.0 () | 30.6 () | 59.7 () |
DSDI (Random) (ke2021dependency) | highlightColor!400 | highlightColor!400 | 2 | 3 | 7 | 24 | 1.4 () | 2.1 () | 7.2 () |
DSDI (AIT) | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!407 | highlightColor!400.0 () | highlightColor!400.0 () | highlightColor!400.0 () |
Sachs | Asia | |
GES (chickering2002optimal) | 19 | 4 |
GIES (hauser2012characterization) | 16 | 11 |
ICP (peters2016causal) | 17 | 8 |
A-ICP (gamella2020active) | 17 | 8 |
NOTEARS (zheng2018dags) | 22 | 14 |
DAG-GNN (yu2019dag) | 19 | 10 |
DSDI (Random) (ke2021dependency) | highlightColor!406 | highlightColor!400 |
DSDI (AIT) | highlightColor!406 | highlightColor!400 |
While the synthetic datasets systematically explore the strengths and weaknesses of causal structure discovery methods, we further evaluate their capabilities on the real-world flow cytometry dataset (also known as Sachs network)(sachs2005causal) and the Asia network (lauritzen1988local) from the BnLearn Repository. DSDI with active intervention targeting outperforms all measured baselines and achieves the same result as random targeting in terms of SHD, but with reduced sample complexity. Despite AIT deviating only by 6 undirected edges from the (concensus) ground truth structure of Sachs et al. (sachs2005causal), there is some concern about the correctness of this graph and the different assumptions associated with the dataset (mooij2020jointcausal; zemplenyi2021bayesian). Therefore, perfect identification may not be achievable by any method in practice in the Sachs setting.
Aside from the significantly improved identification of underlying causal structures, our method allows for a substantial reduction in interventional sample complexity. After reaching the “elbow” point in terms of structural Hamming distance, random intervention targeting requires a fairly long time to converge to a solution within the MEC. In contrast, our proposed technique continues to select informative intervention targets beyond the elbow point and more quickly converges to the correct graph within the MEC. The continued effectiveness of our method directly translates to increased sample-efficiency and convergence speed, and is apparent for all examined datasets (see Figure 4).
The careful study of the behaviour of the proposed method under our chosen synthetic graphs enable us to reason about the method’s underlying dynamics. Analysing the dynamics of intervention targeting reveals that the distribution of target node selections is linked to the topology of the underlying graph. More specifically, the number of selections of a given target node strongly correlates with its out-degree and number of descendants in the underlying ground-truth graph structure (see Figure 5). That our method prefers interventions on nodes with greater (downstream) impact on the overall system can be most clearly observed in the distribution of target selection on the example of the chain and jungle synthetic graphs in Figure 7.
An intervention destroys the original causal influence of other variables on the intervened target variable , so its samples cannot be used to determine the causal parents of in the undisturbed system. Therefore, if a variable without children is detected, interventions upon it should be avoided since they effectively result in redundant observational samples of the remaining variables that are of no benefit for causal structure discovery. Active intervention targeting leads to the desirable property that interventions on such variables are drastically reduced (see Figure 6 and 7).
Investigating the evolution of the target distribution over time reveal that the discovery seems to be divided into two phases of exploration: Phase 1 lasts until the elbow point in terms of SHD, and Phase 2 from the elbow point until convergence (see Figure 4). We observed over multiple experiments that phase 1 tends to quickly discover the underlying skeleton (removing superfluous connections while keeping some edges undirected), until a belief state is reached representing a MEC, or a class of graphs very close to a MEC. Phase 2 is predominantly operating on the partially directed skeleton and directing the remaining edges.
(a) Graph: chain15
(b) Graph: jungle15
Recovery of incorrectly-converging edges critically depends on adapting the order of interventions, which a random intervention policy does not. In sharp contrast, intervention targeting significantly promotes early recovery (see Figure 8). The observed edge dynamics and the corresponding graph belief states indicate that the random policy can lock itself into unfavorable belief states from which recovery is extremely difficult, while AIT provides an escape hatch throughout learning.
Considering that noise significantly impairs the performance of causal discovery, we examine the performance of active intervention targeting in noise-perturbed environments with respect to SHD and convergence speed and compare it with random intervention targeting. We conduct experiments under different noise levels in the setting of binary data generated from structured and random graphs of varying density. A noise level denotes the probability of flipping a random variable and apply it to all measured variables of observational and interventional samples.
Through all examined settings, we observe that active intervention targeting significantly improves identifiability in contrast to random targeting (see Table 3). Active intervention targeting perfectly identifies all structured graphs, except for the collider and full graph, up to a noise level of , i.e. where every 20th variables is flipped.
The observed performance boost is even more noticeable in the convergence speed, as shown in Fig. 9 for ER-4 graphs spanning over 10 variables. While the convergence-gap gets more significant with an increasing noise-level, random targeting does not converge to the ground-truth graphs for a noise level higher than . In contrast, AIT still converges to the correct graph and shows even a convergence-tendency for . These findings support our observation from different experiments that active intervention targeting leads to a more controlled and robust graph discovery. Further experimental results in noise-perturbed environments can be found in the appendix.
Chain10 | Collider10 | Tree10 | Bidiag10 | Jungle10 | Full10 | ER-1 | ER-2 | ER-4 | ||
---|---|---|---|---|---|---|---|---|---|---|
Random | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 () | 0.0 () | 0.0 () | |
AIT | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 () | 0.0 () | 0.0 () | |
Random | 0 | 0 | 0 | 0 | 0 | 3 | 0.0 () | 0.0 () | 0.6 () | |
AIT | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 () | 0.0 () | 0.0 () | |
Random | 0 | 4 | 0 | 0 | 0 | 12 | 0.0 () | 0.0 () | 6.0 () | |
AIT | 0 | 0 | 0 | 0 | 0 | 3 | 0.0 () | 0.0 () | 0.0 () | |
Random | 1 | 9 | 0 | 2 | 1 | 33 | 1.3 () | 8.0 () | 27.0 () | |
AIT | 0 | 7 | 0 | 0 | 0 | 23 | 0.0 () | 1.3 () | 18.7 () | |
Random | 9 | 9 | 9 | 16 | 16 | 45 | 11.0 () | 20.7 () | 40.0 () | |
AIT | 7 | 9 | 6 | 16 | 15 | 44 | 10.3 () | 20.0 () | 39.3 () |
Promising results have driven a recent surge of interest in continuous optimization methods for Bayesian network structure learning from observational and interventional data. In this work, we proposed a active learning method to choose interventions which identify the underlying graph efficiently in the setting of differentiable causal discovery. We show that active intervention targeting not only improves sample efficiency but also identification of the underlying causal structures compared to random targeting of interventions.
While our method shows significant improvements with respect to sample efficiency and graph recovery over existing methods across multiple noise-free and noise-perturbed datasets, the number of interventions is not yet optimal (atkinson1975optimal; eberhardt2012number) and can potentially be reduced in future work. Further, the interventional samples were presented to the evaluated frameworks according to a fixed learning schema (e.g. fixed amount of samples for evaluated interventions in graph scoring) in this work. It would be interesting to see if the information discovered by AIT could be used for a more adaptive learning procedure to further improve sample efficiency.
We present an outline of the proposed two-stage DAG sampling procedure which exploits structural information of the soft-adjacency beyond independent edge confidences. The routine is based on a graph belief state where denotes a soft-adjacency characterization. We start by sampling topological node orderings from an iterative refined score and construct DAGs in the constrained space by independent Bernoulli draws over possible edges. We can therefore guarantee DAGness by construction.
The temperature parameter of the temperature-scaled softmax can be used to account for the entropy of the graph belief state. However, in the general setting we suggest to initialize the parameter to . Note that initializing results in always picking the maximizing argument and
results in an uniform distribution.
Input: Graph Belief State in the form of a soft-adjacency matrix
Output: DAG Adjacency Matrix
A huge variety of SCMs and their induced DAGs exist, each of which can stress causal structure discovery algorithms in different ways. In this work, We perform a systematic evaluation over a selected set of synthetic and non-synthetic SCMs. We distinguish between discrete (based on DSDI (ke2021dependency)) or continuous (based on DCDI (brouillard2020differentiable)) valued random variables. Through all experiment, we limit us to 1000 samples per intervention.
Graph Structure. We adopt the structured graphs (see Fig. 10) proposed in the work of DSDI (ke2021dependency) as they adequately represent topological diversity of possible DAGs in a compact fashion. They can be split up in a set of graphs without cycles in the undirected skeletons, and one group with cycles. Extending the setup with random graphs with varying edge densities, generated from the Erdős–Rényi (ER) model, allows us to assess the generalized performance of the proposed method from sparse to dense DAGs.
Discrete Data Generation. We adopt the generative setup of DSDI (ke2021dependency)
and model the SCMs using two-layer MLPs with Leaky ReLU activations between layers. For every variable
, a seperate MLP models the conditional relationship . The MLP parameters are initialized orthogonally within the range of and biases uniformly in the range of .Continuous Data Generation. For the evaluation of the adapted DCDI framework, we adopt their generative setup as described in (brouillard2020differentiable) and use the existing non-linear datasets.
Graphs with acyclic skeletons:
Graphs with cyclic skeletons:
Besides the many synthetic graphs, we evaluate our method on real-world datasets provided by the BnLearn data repository. Namely on the Asia (lauritzen1988local) and the Sachs (sachs2005causal) datasets (see Fig. 11 for a visualization of their underlying ground-truth structure). Sachs (sachs2005causal) represents a systems biology dataset which exhibits non-linearity, confounding and complex structure.
Base Frameworks.
DSDI (ke2021dependency): https://github.com/nke001/causal_learning_unknown_interventions
DCDI (brouillard2020differentiable): https://github.com/slachapelle/dcdi
Baseline Methods.
GES (chickering2002optimal) and GIES (hauser2012characterization): www.github.com/FenTechSolutions/CausalDiscoveryToolbox (kalainathan2019causal)
ICP (peters2016causal): https://github.com/juangamella/aicp
A-ICP (gamella2020active): https://github.com/juangamella/aicp
NOTEARS (zheng2018dags): https://github.com/xunzheng/notears
DAG-GNN (yu2019dag): https://github.com/fishmoon1234/DAG-GNN
Datasets.
BnLearn Data Repository: https://www.bnlearn.com/bnrepository/
We used a similar set of hyperparameters for our AIT + DSDI and AIT + DCDI models as those used in the original paper
(ke2021dependency; brouillard2020differentiable). The specific hyperparamters we used are stated as follows.DSDI.
Number of iterations | 1000 |
Batch size | 256 |
Sparsity Regularizer | 0.1 |
DAG Regularizer | 0.5 |
Functional parameter training iterations | 10000 |
Number of interventions per phase 2 | 25 |
Number of data batches for scoring | 10 |
Number of graph configurations for scoring | |
- Graph Size 5: | 10 |
- Graph Size 10: | 20 |
- Graph Size 15 | 40 |
AIT: | |
- Number of graph configurations | 100 |
- Number of interventional samples per graph & target | 256 |
DCDI.
0 | |
---|---|
2 | |
0.9 | |
Augmented Lagrangian Thresh | |
Learning rate | |
Nr. of hidden units | 16 |
Nr. of hidden layers | 2 |
AIT: | |
- Number of graph configurations | 100 |
- Number of interventional samples per graph & target | 256 |
In this section, we show further results and visualizations of experiments on discrete data and single-target interventions in various settings (such as graphs of varying size, noise-free vs. noise-perturbed, limited intervention targets). All experiments are based on the framework DSDI.
Structured Graphs | Random Graphs | ||||||||
Chain | Collider | Tree | Bidiag | Jungle | Full | ER-1 | ER-2 | ER-4 | |
GES (chickering2002optimal) | 3 | 0 | 4 | 6 | 4 | 9 | 4.3 () | † | † |
GIES (hauser2012characterization) | 3 | 4 | 2 | 6 | 5 | 10 | 4.7 () | † | † |
ICP (peters2016causal) | 4 | 4 | 4 | 7 | 6 | 10 | 5.4 () | † | † |
A-ICP (gamella2020active) | 4 | 4 | 4 | 7 | 6 | 10 | 5.4 () | † | † |
NOTEARS (zheng2018dags) | 5 | 3 | 6 | 5 | 7 | 9 | 6.1 () | † | † |
DAG-GNN (yu2019dag) | 4 | 4 | 3 | 4 | 6 | 9 | 5.1 () | † | † |
DSDI (Random) (ke2021dependency) | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400.0 () | † | † |
DSDI (AIT) | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400.0 () | † | † |
Structured Graphs | Random Graphs | ||||||||
Chain | Collider | Tree | Bidiag | Jungle | Full | ER-1 | ER-2 | ER-4 | |
GES (chickering2002optimal) | 9 | 2 | 6 | 8 | 10 | 35 | 7.0 () | 10.7 () | 26.7 () |
GIES (hauser2012characterization) | 12 | 6 | 13 | 16 | 9 | 20 | 12.2 () | 14.1 () | 26.1 () |
ICP (peters2016causal) | 9 | 9 | 9 | 17 | 16 | 45 | 10.6 () | 20.7 () | 39.8 () |
A-ICP (gamella2020active) | 9 | 9 | 9 | 17 | 16 | 45 | 10.6 () | 20.7 () | 39.8 () |
NOTEARS (zheng2018dags) | 13 | 16 | 12 | 21 | 21 | 42 | 16.4 () | 22.9 () | 36.6 () |
DAG-GNN (yu2019dag) | 8 | 7 | 6 | 15 | 13 | 38 | 10.3 () | 20.1 () | 38.4 () |
DSDI (Random) (ke2021dependency) | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400.0 () | highlightColor!400.0 () | highlightColor!400.0 () |
DSDI (AIT) | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400.0 () | highlightColor!400.0 () | highlightColor!400.0 () |
Structured Graphs | Random Graphs | ||||||||
Chain | Collider | Tree | Bidiag | Jungle | Full | ER-1 | ER-2 | ER-4 | |
GES (chickering2002optimal) | 13 | 1 | 12 | 14 | 14 | 69 | 8.3 () | 17.6 () | 39.4 () |
GIES (hauser2012characterization) | 13 | 6 | 10 | 17 | 23 | 60 | 10.9 () | 18.1 () | 39.3 () |
ICP (peters2016causal) | 14 | 14 | 14 | 27 | 26 | 105 | 16.2 () | 31.1 () | 60.1 () |
A-ICP (gamella2020active) | 14 | 14 | 14 | 27 | 26 | 105 | 16.2 () | 31.1 () | 60.1 () |
NOTEARS (zheng2018dags) | 22 | 21 | 26 | 33 | 35 | 93 | 23.7 () | 35.8 () | 59.5 () |
DAG-GNN (yu2019dag) | 11 | 14 | 15 | 27 | 25 | 97 | 16.0 () | 30.6 () | 59.7 () |
DSDI (Random) (ke2021dependency) | highlightColor!400 | highlightColor!400 | 2 | 3 | 7 | 24 | 1.4 () | 2.1 () | 7.2 () |
DSDI (AIT) | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!400 | highlightColor!408 | highlightColor!400.0 () | highlightColor!400.0 () | highlightColor!400.0 () |
While we have shown the effectiveness of AIT on random ER graphs of size in §5.3, we observe similar effects on ER graphs of size (see Figure 12). Overall, the results indicate a greater impact of our proposed targeting mechanisms on graphs of bigger size compared to random intervention targeting which poorly scales to graphs of larger size.
We evaluate the distribution of target node selections over multiple DAGs of varying size to investigate the behaviour of our proposed method. Over all performed experiments, our method prefers interventions on nodes with greater (downstream) impact on the overall system, i.e. nodes of higher topological rank in the underlying DAG.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
We show all edge dynamics of all structured graphs over 15 variables and compare the dynamics of random targeting to active intervention targeting in a noise-free setting where we have access to all possible single-target interventions.
(a) Graph: Chain
(b) Graph: Tree
(c) Graph: Collider
(d) Graph: Bidiag
(e) Graph: Jungle
(f) Graph: Full
While section §5.8 highlights our key findings in noise-perturbed systems, we examine the impact of AIT in noise perturbed environments more thoroughly in this section. Therefore, we systematically analyze experiments under different noise levels in the setting of binary data generated from random graphs of varying densities. A noise level denotes the probability of flipping a random variable and apply it to all measured variables of observational and interventional samples.
Evaluating convergence on various ER graphs of varying densities over 10 variables under different noise levels reveals that the impact of AIT becomes of larger magnitude as the density of the graph and the noise level increases.
(i) ER-1:
(ii) ER-2:
(iii) ER-4:
While we allow access to all possible single-target interventions in all other experiments, real world settings are usually more restrictive. Specific interventions might be either technically impossible or even unethical, or the experiments might want to prevent interventions upon specific target nodes due to increased experiment costs. In order to test the capability of AIT, we limit the set of possible intervention targets in the following experiments and analyze the resulting behaviour based on DSDI. We examine speed of convergence and the effect on the target distribution under different scenarios on structured graphs using DSDI with AIT based on single-target interventions.
Scenario 1: We perform experiments on a Chain5 graph where we restrict us on intervening upon a different node in five experiment and once allow access to all targets as a comparison.
Throughout the experiments, we observe that blocking interventions on nodes of a higher topological level results in greater degradation of the convergence speed compared to blocked intervention on lower levels (see Figure 19). Furthermore, the distribution of selected targets indicates that our method preferentially chooses neighboring nodes of a blocked target node in the restricted setting.
Scenario 2: We perform multiple experiments on a Tree5 graph where we restrict access to different subsets of nodes (e.g. root node, set of all sink nodes) for single-target interventions.
Similar to the experiments on Chain5, we observe a clear impact of the available intervention targets on the convergence speed and identifiability of the underlying structure (see Figure 20). While preventing interventions on all sink nodes (node 2, 3 and 4) results in improved convergence towards the underlying structure, restricted access to the set of nodes which act as causes of other nodes (node 0 and 1) prevents us from identifying the correct underlying structure.
While the original framework of DCDI (brouillard2020differentiable) proposes a joint-optimization over the observational and interventional sample space by selecting samples at random, we adapt their framework to the setting of active causal discovery where we acquire interventional sample in an adaptive manner. We hypothesize that a controlled selection of informative intervention targets allows a more rapid and controlled discovery of the underlying causal structure.
Instead of demanding the full interventional target space during the complete optimization as in the original approach, we split the optimization procedure into different episodes, where AIT is used to estimate a target space of size for each episode. This is done by computing the discrepancy scores over all possible intervention targets and selecting the highest scoring targets. During an episode, we continue by performing gradient steps using the fixed target space and reevaluate it afterwards for the next episode. We visualize the adaption in the following high-level outline of the individual methodologies.
We evaluate the effectiveness of AIT in the base framework of DCDI in the setting of non-linear, continuous data generated from random graphs over variables and show the potential of our proposed method.
Structural Identification / Convergence: Despite their joint optimization formulation is not apriori designed for the setting of experimental design, an AIT guided version shows superior/competitive performance in terms of structural identification and sample complexity over the original formulation (see Figure 21).
Distribution of Intervention Targets: As in DSDI, we observe strong correlation of the number of target selections with the measured topological properties of the specific nodes. This indicates a controlled discovery of the underlying causal structure through preferential targeting of nodes with greater (downstream) impact on the overall system. In addition, interventions on variables without children are drastically reduced (see also §5.5 for equivalent observations in DSDI).
Effect of Target Space Size : While the original formulation assumes for the complete optimization procedure (i.e. ) and relies on random samples out of the full target space, our adapted AIT-guided version of DCDI constrains the target space to a subset of targets for each episode. An ablation study on the size of the target space shows that for all choices of , our approach outperforms the original formulation in terms of sample complexity while achieving same or better performance in terms of SHD.
![]() |
![]() |
![]() |
![]() |