Learning Causal Graphs with Small Interventions

We consider the problem of learning causal networks with interventions, when each intervention is limited in size under Pearl's Structural Equation Model with independent errors (SEM-IE). The objective is to minimize the number of experiments to discover the causal directions of all the edges in a causal graph. Previous work has focused on the use of separating systems for complete graphs for this task. We prove that any deterministic adaptive algorithm needs to be a separating system in order to learn complete graphs in the worst case. In addition, we present a novel separating system construction, whose size is close to optimal and is arguably simpler than previous work in combinatorics. We also develop a novel information theoretic lower bound on the number of interventions that applies in full generality, including for randomized adaptive learning algorithms. For general chordal graphs, we derive worst case lower bounds on the number of interventions. Building on observations about induced trees, we give a new deterministic adaptive algorithm to learn directions on any chordal skeleton completely. In the worst case, our achievable scheme is an α-approximation algorithm where α is the independence number of the graph. We also show that there exist graph classes for which the sufficient number of experiments is close to the lower bound. In the other extreme, there are graph classes for which the required number of experiments is multiplicatively α away from our lower bound. In simulations, our algorithm almost always performs very close to the lower bound, while the approach based on separating systems for complete graphs is significantly worse for random chordal graphs.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/01/2020

Active Structure Learning of Causal DAGs via Directed Clique Tree

A growing body of work has begun to study intervention design for effici...
07/05/2021

Matching a Desired Causal State via Shift Interventions

Transforming a causal system from a given initial state to a desired tar...
12/05/2015

A Novel Paradigm for Calculating Ramsey Number via Artificial Bee Colony Algorithm

The Ramsey number is of vital importance in Ramsey's theorem. This paper...
05/20/2022

A Unified Experiment Design Approach for Cyclic and Acyclic Causal Models

We study experiment design for the unique identification of the causal g...
06/30/2022

Verification and search algorithms for causal DAGs

We study two problems related to recovering causal graphs from intervent...
09/11/2017

Budgeted Experiment Design for Causal Structure Learning

We study the problem of causal structure learning when the experimenter ...
06/06/2021

Collaborative Causal Discovery with Atomic Interventions

We introduce a new Collaborative Causal Discovery problem, through which...

1 Introduction

Causality is a fundamental concept in sciences and philosophy. The mathematical formulation of a theory of causality in a probabilistic sense has received significant attention recently (e.g. [14, 6, 2, 9, 8]). A formulation advocated by Pearl considers the structural equation models: In this framework, is a cause of , if can be written as , for some deterministic function

and some latent random variable

. Given two causally related variables and , it is not possible to infer whether causes or causes from random samples, unless certain assumptions are made on the distribution of and/or on [15, 7]. For more than two random variables, directed acyclic graphs (DAGs) are the most common tool used for representing causal relations. For a given DAG , the directed edge shows that is a cause of .

If we make no assumptions on the data generating process, the standard way of inferring the causal directions is by performing experiments, the so-called interventions. An intervention requires modifying the process that generates the random variables: The experimenter has to enforce values on the random variables. This process is different than conditioning as explained in detail in [14].

The natural problem to consider is therefore minimizing the number of interventions required to learn a causal DAG. Hauser et al. [6] developed an efficient algorithm that minimizes this number in the worst case. The algorithm is based on optimal coloring of chordal graphs and requires at most interventions to learn any causal graph where is the chromatic number of the chordal skeleton.

However, one important open problem appears when one also considers the size of the used interventions: Each intervention is an experiment where the scientist must force a set of variables to take random values. Unfortunately, the interventions obtained in [6] can involve up to variables. The simultaneous enforcing of many variables can be quite challenging in many applications: for example in biology, some variables may not be enforceable at all or may require complicated genomic interventions for each parameter.

In this paper, we consider the problem of learning a causal graph when intervention sizes are bounded by some parameter . The first work we are aware of for this problem is by Eberhardt et al.  [2], where he provided an achievable scheme. Furthermore [3] shows that the set of interventions to fully identify a causal DAG must satisfy a specific set of combinatorial conditions called a separating system111A separating system is a - matrix with distinct columns and each row has at most ones., when the intervention size is not constrained or is 1. In [9], with the assumption that the same holds true for any intervention size, Hyttinen et al. draw connections between causality and known separating system constructions. One open problem is: If the learning algorithm is adaptive after each intervention, is a separating system still needed or can one do better? It was believed that adaptivity does not help in the worst case [3] and that one still needs a separating system.

Our Contributions: We obtain several novel results for learning causal graphs with interventions bounded by size . The problem can be separated for the special case where the underlying undirected graph (the skeleton) is the complete graph and the more general case where the underlying undirected graph is chordal.

  1. For complete graph skeletons, we show that any adaptive deterministic algorithm needs a separating system. This implies that lower bounds for separating systems also hold for adaptive algorithms and resolves the previously mentioned open problem.

  2. We present a novel combinatorial construction of a separating system that is close to the previous lower bound. This simple construction may be of more general interest in combinatorics.

  3. Recently [8] showed that randomized adaptive algorithms need only

    interventions with high probability for the unbounded case. We extend this result and show that

    interventions of size bounded by suffice with high probability.

  4. We present a more general information theoretic lower bound of to capture the performance of such randomized algorithms.

  5. We extend the lower bound for adaptive algorithms for general chordal graphs. We show that over all orientations, the number of experiments from a separating system is needed where is the chromatic number of the skeleton graph.

  6. We show two extremal classes of graphs. For one of them, the interventions through separating system is sufficient. For the other class, we need experiments in the worst case.

  7. We exploit the structural properties of chordal graphs to design a new deterministic adaptive algorithm that uses the idea of separating systems together with adaptability to Meek rules. We simulate our new algorithm and empirically observe that it performs quite close to the separating system. Our algorithm requires much fewer interventions compared to separating systems.

2 Background and Terminology

2.1 Essential graphs

A causal DAG is a directed acyclic graph where is a set of random variables and is a directed edge if and only if is a direct cause of . We adopt Pearl’s structural equation model with independent errors (SEM-IE) in this work (see [14] for more details). Variables in cause , if where is a random variable independent of all other variables.

The causal relations of imply a set of conditional independence (CI) relations between the variables. A conditional independence relation is of the following form: Given , the set and the set are conditionally independent for some disjoint subsets of variables . Due to this, causal DAGs are also called

causal Bayesian networks

. A set of variables is Bayesian with respect to a DAG

if the joint probability distribution of

can be factorized as a product of marginals of every variable conditioned on its parents.

All the CI relations that are learned statistically through observations can also be inferred from the Bayesian network using a graphical criterion called the d-separation [16] assuming that the distribution is faithful to the graph 222Given Bayesian network, any CI relation implied by d-separation holds true. All the CIs implied by the distribution can be found using d-separation if the distribution is faithful. Faithfulness is a widely accepted assumption, since it is known that only a measure zero set of distributions are not faithful [13].. Two causal DAGs are said to be Markov equivalent if they encode the same set of CIs. Two causal DAGs are Markov equivalent if and only if they have the same skeleton333Skeleton of a DAG is the undirected graph obtained when directed edges are converted to undirected edges. and the same immoralities444An induced subgraph on is an immorality if and are disconnected, and . . The class of causal DAGs that encode the same set of CIs is called the Markov equivalence class. We denote the Markov equivalence class of a DAG by . The graph union555Graph union of two DAGs and with the same skeleton is a partially directed graph , where is undirected if the edges in and have different directions, and directed as if the edges in and are both directed as . of all DAGs in is called the essential graph of . It is denoted . is always a chain graph with chordal666An undirected graph is chordal if it has no induced cycle of length greater than . chain components 777This means that can be decomposed as a sequence of undirected chordal graphs (chain components) such that there is a directed edge from a vertex in to a vertex in only if [1].

The -separation criterion can be used to identify the skeleton and all the immoralities of the underlying causal DAG [16]. Additional edges can be identified using the fact that the underlying DAG is acyclic and there are no more immoralities. Meek derived local rules (Meek rules), introduced in [17], to be recursively applied to identify every such additional edge (see Theorem 3 of [12]). The repeated application of Meek rules on this partially directed graph with identified immoralities until they can no longer be used yields the essential graph.

2.2 Interventions and Active Learning

Given a set of variables , an intervention on a set of the variables is an experiment where the performer forces each variable to take the value of another independent (from other variables) variable , i.e.,

. This operation, and how it affects the joint distribution is formalized by the

do operator by Pearl [14]. An intervention modifies the causal DAG as follows: The post intervention DAG is obtained by removing the connections of nodes in to their parents. The size of an intervention is the number of intervened variables, i.e., . Let denote the complement of the set .

CI-based learning algorithms can be applied to to identify the set of removed edges, i.e. parents of [16], and the remaining adjacent edges in the original skeleton are declared to be the children. Hence,

(R0) The orientations of the edges of the cut between and in the original DAG can be inferred.

Then, local Meek rules (introduced in [17]) are repeatedly applied to the original DAG with the new directions learnt from the cut to learn more till no more directed edges can be identified. Further application of CI-based algorithms on will reveal no more information. The Meek rules are given below:

  • [topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]

  • () is oriented as () if s.t. and .

  • () is oriented as () if s.t. and .

  • () is oriented as () if s.t. ,,, and .

  • () is oriented as () if s.t. ,,, and .

The concepts of essential graphs and Markov equivalence classes are extended in [4] to incorporate the role of interventions: Let , be a set of interventions and let the above process be followed after each intervention. Interventional Markov equivalence class ( equivalence) of a DAG is the set of DAGs that represent the same set of probability distributions obtained when the above process is applied after every intervention in . It is denoted by . Similar to the observational case, essential graph of a DAG is the graph union of all DAGs in the same equivalence class; it is denoted by . We have the following sequence:

(1)

Therefore, after a set of interventions has been performed, the essential graph is a graph with some oriented edges that captures all the causal relations we have discovered so far, using . Before any interventions happened captures the initially known causal directions. It is known that is a chain graph with chordal chain components. Therefore when all the directed edges are removed, the graph becomes a set of disjoint chordal graphs.

2.3 Problem Definition

We are interested in the following question:

Problem 1.

Given that all interventions in are of size at most variables, i.e., for each intervention , , minimize the number of interventions such that the partially directed graph with all directions learned so far .

The question is the design of an algorithm that computes the small set of interventions given . Note, of course, that the unknown directions of the edges are not available to the algorithm. One can view the design of

as an active learning process to find

from the essential graph . is a chain graph with undirected chordal components and it is known that interventions on one chain components do not affect the discovery process of directed edges in the other components [5]. So we will assume that is undirected and a chordal graph to start with. Our notion of algorithm does not consider the time complexity (of statistical algorithms involved) of steps and in (2.2). Given interventions, we only consider efficiently computing using (possibly) the graph . We consider the following three classes of algorithms:

  1. Non-adaptive algorithm: The choice of is fixed prior to the discovery process.

  2. Adaptive algorithm: At every step , the choice of is a deterministic function of .

  3. Randomized adaptive algorithm: At every step , the choice of is a random function of .

The problem is different for complete graphs versus more general chordal graphs since rule R becomes applicable when the graph is not complete. Thus we give a separate treatment for each case. First, we provide algorithms for all three cases for learning the directions of complete graphs (undirected complete graph) on vertices. Then, we generalize to chordal graph skeletons and provide a novel adaptive algorithm with upper and lower bounds on its performance.

The missing proofs of the results that follow can be found in the Appendix.

3 Complete Graphs

In this section, we consider the case where the skeleton we start with, i.e. , is an undirected complete graph (denoted ). It is known that at any stage in (2.2) starting from , rules R, R and R do not apply. Further, the underlying DAG is a directed clique. The directed clique is characterized by an ordering on such that, in the subgraph induced by , has no incoming edges. Let be denoted by for some ordering . Let denote the set . We need the following results on a separating system for our first result regarding adaptive and non-adaptive algorithms for a complete graph.

3.1 Separating System

Definition 1.

[10, 18] An -separating system on an element set is a set of subsets such that and for every pair there is a subset such that either or . If a pair satisfies the above condition with respect to , then is said to separate the pair . Here, we consider the case when

In [10], Katona gave an -separating system together with a lower bound on . In [18], Wegener gave a simpler argument for the lower bound and also provided a tighter upper bound than the one in [10]. In this work, we give a different construction below where the separating system size is at most larger than the construction of Wegener. However, our construction has a simpler description.

Lemma 1.

There is a labeling procedure that produces distinct length labels for all elements in using letters from the integer alphabet where . Further, in every digit (or position), any integer letter is used at most times.

Once we have a set of string labels as in Lemma 1, our separating system construction is straightforward.

Theorem 1.

Consider an alphabet of size where . Label every element of an element set using a distinct string of letters from of length using the procedure in Lemma 1 with . For every and , choose the subset of vertices whose string’s -th letter is . The set of all such subsets is a -separating system on elements and .

3.2 Adaptive algorithms: Equivalence to a Separating System

Consider any non-adaptive algorithm that designs a set of interventions , each of size at most , to discover . has to be a separating system in the worst case over all . This is already known. Now, we prove the necessity of a separating system for deterministic adaptive algorithms in the worst case.

Theorem 2.

Let there be an adaptive deterministic algorithm that designs the set of interventions such that the final graph learnt for any ground truth ordering starting from the initial skeleton . Then, there exists a such that designs an which is a separating system.

The theorem above is independent of the individual intervention sizes. Therefore, we have the following theorem, which is a direct corollary of Theorem 2:

Theorem 3.

In the worst case over , any adaptive or a non-adaptive deterministic algorithm on the DAG has to be such that . There is a feasible with

Proof.

By Theorem 2, we need a separating system in the worst case and the lower and upper bounds are from [18, 10].∎

3.3 Randomized Adaptive Algorithms

In this section, we show that that total number of variable accesses to fully identify the complete causal DAG is .

Theorem 4.

To fully identify a complete causal DAG on variables using size- interventions, interventions are necessary. Also, the total number of variables accessed is at least .

The lower bound in Theorem 4 is information theoretic. We now give a randomized algorithm that requires experiments in expectation. We provide a straightforward generalization of [8], where the authors gave a randomized algorithm for unbounded intervention size.

Theorem 5.

Let be and the experiment size for some . Then there exists a randomized adaptive algorithm which designs an such that with probability polynomial in , and in expectation.

4 General Chordal Graphs

In this section, we turn to interventions on a general DAG . After the initial stages in (2.2), is a chain graph with chordal chain components. There are no further immoralities throughout the graph. In this work, we focus on one of the chordal chain components. Thus the DAG we work on is assumed to be a directed graph with no immoralities and whose skeleton is chordal. We are interested in recovering from using interventions of size at most following (2.2).

4.1 Bounds for Chordal skeletons

We provide a lower bound for both adaptive and non-adaptive deterministic schemes for a chordal skeleton . Let be the coloring number of the given chordal graph. Since, chordal graphs are perfect, it is the same as the clique number.

Theorem 6.

Given a chordal , in the worst case over all DAGs (which has skeleton and no immoralities), if every intervention is of size at most , then for any adaptive and non-adaptive algorithm with .

Upper bound: Clearly, the separating system based algorithm of Section 3 can be applied to the vertices in the chordal skeleton and it is possible to find all the directions. Thus, . This with the lower bound implies an approximation algorithm (since , under a mild assumption ).

Remark: The separating system on nodes gives an approximation. However, the new algorithm in Section 4.3

exploits chordality and performs much better empirically. It is possible to show that our heuristic also has an

approximation guarantee but we skip that.

4.2 Two extreme counter examples

We provide two classes of chordal skeletons : One for which the number of interventions close to the lower bound is sufficient and the other for which the number of interventions needed is very close to the upper bound.

Theorem 7.

There exists chordal skeletons such that for any algorithm with intervention size constraint , the number of interventions required is at least where and are the independence number and chromatic numbers respectively. There exists chordal graph classes such that is sufficient.

4.3 An Improved Algorithm using Meek Rules

In this section, we design an adaptive deterministic algorithm that anticipates Meek rule R usage along with the idea of a separating system. We evaluate this experimentally on random chordal graphs. First, we make a few observations on learning connected directed trees from the skeleton (undirected trees are chordal) that do not have immoralities using Meek rule R where every intervention is of size . Because the tree has no cycle, Meek rules R-R do not apply.

Lemma 2.

Every node in a directed tree with no immoralities has at most one incoming edge. There is a root node with no incoming edges and intervening on that node alone identifies the whole tree using repeated application of rule R.

Lemma 3.

If every intervention in is of size at most , learning all directions on a directed tree with no immoralities can be done adaptively with at most where is the number of vertices in the tree. The algorithm runs in time .

Lemma 4.

Given any chordal graph and a valid coloring, the graph induced by any two color classes is a forest.

In the next section, we combine the above single intervention adaptive algorithm on directed trees which uses Meek rules, with that of the non-adaptive separating system approach.

4.3.1 Description of the algorithm

The key motivation behind the algorithm is that, a pair of color classes is a forest (Lemma 4). Choosing the right node to intervene leaves only a small subtree unlearnt as in the proof of Lemma 3. In subsequent steps, suitable nodes in the remaining subtrees could be chosen until all edges are learnt. We give a brief description of the algorithm below.

Let denote the initial undirected chordal skeleton and let be its coloring number. Consider a separating system . To intervene on the actual graph, an intervention set corresponding to is chosen. We would like to intervene on a node of color .

Consider a node of color . Now, we attach a score as follows. For any color , consider the induced forest on the color classes and in . Consider the tree containing node in . Let be the degree of in . Let be the resulting disjoint trees after node is removed from . If is intervened on, according to the proof of Lemma 3: a) All edge directions in all trees except one of them would be learnt when applying Meek Rules and rule R. b) All the directions from to all its neighbors would be found.

The score is taken to be the total number of edge directions guaranteed to be learnt in the worst case. Therefore, the score is: The node with the highest score among the color class is used for the intervention . After intervening on , all the edges whose directions are known through Meek Rules (by repeated application till nothing more can be learnt) and R are deleted from . Once is processed, we recolor the sparser graph . We find a new with the new chromatic number on and the above procedure is repeated. The exact hybrid algorithm is described in Algorithm 1.

Theorem 8.

Given an undirected choral skeleton of an underlying directed graph with no immoralities, Algorithm 1 ends in finite time and it returns the correct underlying directed graph. The algorithm has runtime complexity polynomial in .

1:Input: Chordal Graph skeleton with no Immoralities.
2:Initialize with nodes and no directed edges. Initialize time .
3:while  do
4:      Color the chordal graph with colors. Standard algorithms exist to do it in linear time
5:     Initialize color set . Form a separating system such that .
6:     for  until  do
7:         Initialize Intervention .
8:         for  and every node in color class  do
9:              Consider , and (as per definitions in Sec. 4.3.1).
10:              Compute: .
11:         end for
12:         if  then
13:              .
14:         else
15:              .
16:         end if
17:         
18:         Apply R and Meek rules using and after intervention . Add newly learnt directed edges to and delete them from .
19:     end for
20:     Remove all nodes which have degree in G.
21:end while
22:return .
Algorithm 1 Hybrid Algorithm using Meek rules with separating system

5 Simulations

We simulate our new heuristic, namely Algorithm 1, on randomly generated chordal graphs and compare it with a naive algorithm that follows the intervention sets given by our separating system as in Theorem 1. Both algorithms apply R and Meek rules after each intervention according to (2.2). We plot the following lower bounds: a) Information Theoretic LB of b) Max. Clique Sep. Sys. Entropic LB which is the chromatic number based lower bound of Theorem 6. Moreover, we use two known separating system constructions for the maximum clique size as “references”: The best known separating system is shown by the label Max. Clique Sep. Sys. Achievable LB and our new simpler separating system construction (Theorem 1) is shown by Our Construction Clique Sep. Sys. LB. As an upper bound, we use the size of the best known separating system (without any Meek rules) and is denoted Separating System UB.

Random generation of chordal graphs: Start with a random ordering on the vertices. Consider every vertex starting from . For each vertex , with probability inversely proportional to for every where . The proportionality constant is changed to adjust sparsity of the graph. After all such are considered, make a clique by adding edges respecting the ordering , where is the neighborhood of . The resultant graph is a DAG and the corresponding skeleton is chordal. Also, is a perfect elimination ordering.

(a)
(b)
Figure 1: : no. of vertices, : Intervention size bound. The number of experiments is compared between our heuristic and the naive algorithm based on the separating system on random chordal graphs. The red markers represent the sizes of separating system. Green circle markers and the cyan square markers for the same value correspond to the number of experiments required by our heuristic and the algorithm based on an separating system(Theorem 1), respectively, on the same set of chordal graphs. Note that, when and , the naive algorithm requires on average about and (close to ) experiments respectively, while our algorithm requires at most (orderwise close to ) when .

Results: We are interested in comparing our algorithm and the naive one which depends on the separating system to the size of the separating system. The size of the separating system is roughly . Consider values around on the x-axis for the plots with and . Note that, our algorithm performs very close to the size of the separating system, i.e. . In fact, it is always in both cases while the average performance of naive algorithm goes from (close to ) to (close to ). The result points to this: For random chordal graphs, the structured tree search allows us to learn the edges in a number of experiments quite close to the lower bound based only on the maximum clique size and not . The plots for and are given in Appendix.

6 Conclusions

We have considered the problem of adaptively designing interventions of bounded size to learn a causal graph under Pearl’s SEM-IE model. We proposed lower and upper bounds for the number of interventions needed in the worst case for various classes of algorithms, when the causal graph skeleton is complete. We developed lower and upper bounds on the minimum number of interventions required in the worst case for general graphs. We characterized two extremal graph classes such that the minimum number of interventions in one class is close to the lower bound and in the other class it is close to the upper bound. In the case of chordal skeletons, we proposed an algorithm that combines ideas for the complete graphs with the ones when the skeleton is a forest via application of Meek rules. Empirically, on randomly generated chordal graphs, our algorithm performs close to the lower bound and it outperforms the previous state of the art. Possible future work includes obtaining a tighter lower bound for chordal graphs that would possibly establish a tighter approximation guarantee for our algorithm.

Acknowledgments

Authors acknowledge the support from grants: NSF CCF 1344179, 1344364, 1407278, 1422549 and a ARO YIP award (W911NF-14-1-0258). We also thank Frederick Eberhardt for helpful discussions.

References

  • [1] Steen A. Andersson, David Madigan, and Michael D. Perlman. A characterization of markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2):505–541, 1997.
  • [2] Frederich Eberhardt, Clark Glymour, and Richard Scheines. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among n variables. In

    Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI)

    , pages 178–184.
  • [3] Frederick Eberhardt. Causation and Intervention (Ph.D. Thesis), 2007.
  • [4] Alain Hauser and Peter Bühlmann. Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs.

    Journal of Machine Learning Research

    , 13(1):2409–2464, 2012.
  • [5] Alain Hauser and Peter Bühlmann. Two optimal strategies for active learning of causal networks from interventional data. In Proceedings of Sixth European Workshop on Probabilistic Graphical Models, 2012.
  • [6] Alain Hauser and Peter Bühlmann. Two optimal strategies for active learning of causal models from interventional data. International Journal of Approximate Reasoning, 55(4):926–939, 2014.
  • [7] Patrik O Hoyer, Dominik Janzing, Joris Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models. In Proceedings of NIPS 2008, 2008.
  • [8] Huining Hu, Zhentao Li, and Adrian Vetta. Randomized experimental design for causal graph discovery. In Proceedings of NIPS 2014, Montreal, CA, December 2014.
  • [9] Antti Hyttinen, Frederick Eberhardt, and Patrik Hoyer. Experiment selection for causal discovery. Journal of Machine Learning Research, 14:3041–3071, 2013.
  • [10] Gyula Katona. On separating systems of a finite set. Journal of Combinatorial Theory, 1(2):174–194, 1966.
  • [11] Richard J Lipton and Robert Endre Tarjan. A separator theorem for planar graphs. SIAM Journal on Applied Mathematics, 36(2):177–189, 1979.
  • [12] Christopher Meek. Causal inference and causal explanation with background knowledge. In Proceedings of the eleventh international conference on uncertainty in artificial intelligence, 1995.
  • [13] Christopher Meek. Strong completeness and faithfulness in bayesian networks. In Proceedings of the eleventh international conference on uncertainty in artificial intelligence, 1995.
  • [14] Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2009.
  • [15] S Shimizu, P. O Hoyer, A Hyvarinen, and A. J Kerminen. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7:2003––2030, 2006.
  • [16] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. A Bradford Book, 2001.
  • [17] Thomas Verma and Judea Pearl. An algorithm for deciding if a set of observed independencies has a causal explanation. In Proceedings of the Eighth international conference on uncertainty in artificial intelligence, 1992.
  • [18] Ingo Wegener. On separating systems whose elements are sets of at most k elements. Discrete Mathematics, 28(2):219–222, 1979.

Appendix

6.1 Proof of Lemma 1

We describe a string labeling procedure as follows to label elements of the set .

String Labelling: Let be a positive integer. Let be the integer such that . . Every element is given a label which is a string of integers of length drawn from the alphabet of size . Let and for some integers , where and . Now, we describe the sequence of the -th digit across the string labels of all elements from to :

  1. Repeat times, repeat the next integer times and so on circularly 888Circular means that after is completed, we start with again. from till .

  2. After that, repeat times followed by times till we reach the th position. Clearly, -th integer in the sequence would not exceed .

  3. Every integer occurring after the position is increased by .

From the three steps used to generate every digit, a straightforward calculation shows that every integer letter is repeated at most times in every digit in the string. Now, we would like to prove inductively that the labels are distinct for all elements. Let us assume the induction hypothesis: For all , the labels are distinct. The base case of is easy to see. Then, we would like to show that for , the labels are distinct.

Another way of looking at the labeling procedure is as follows. Let with . Divide the label matrix (of dimensions ) into two parts, one consisting of the first columns and the other consisting of the remaining columns. The first rows of is nothing but the string labels for all numbers from to expressed in base . For any row in the original matrix of labels, till the end of first columns, the labeling procedure would be still in Step . After that, one can take to be the new size of the set of elements to be labelled and then restart the procedure with this . Therefore we have the following key observation: (the matrix with first rows of ) is nothing but the label matrix for distinct elements from the above labeling procedure.

Since, , by the induction hypothesis, the columns are distinct. Hence, any two columns in are distinct. Suppose the first rows of two columns and of are identical. These correspond to base expansion of and . They are separated by at least columns. But the last row of columns and in has to be distinct because according to Step and Step of the labeling procedure, in the row, every integer is repeated at most times continuously, and only once. Therefore, any two columns in are distinct. The last row entries in are different from because of the addition in Step . Therefore, all columns of are distinct. Hence, by induction, the result is shown.

6.2 Proof of Theorem 1

By Lemma 1, th place has at most occurrences of symbol . Therefore, . Now, consider the pair of distinct elements . Since they are labelled distinctly (Lemma 1), there is at least one letter in their string labels where they differ. Suppose the distinct th letters are and let us say without loss of generality. Then, clearly the separation criterion is met by the subset . This proves the claim.

6.3 Proof of Theorem 2

We construct a worst case inductively. Before every step , the adaptive algorithm deterministically chooses based on . Therefore, we will reveal a partial order to satisfy the observations so far. Inductively for every , we will make sure that after is chosen by the algorithm, further details about can be revealed to form such that after intervening on and then applying R, we will make sure there is no opportunity to apply the rule . This would make sure that is a separating system on elements.

Before intervention at any step , let us ‘tag’ every vertex using a subset such that . contains indices of all those interventions that contain vertex before step . Let contain distinct elements of the multi-set .We will construct partially such that it satisfies the following criterion always:

Inductive Hypothesis: The partial order is such that for any two elements with and , and are incomparable if and comparable otherwise. This means the edges between the elements tagged with the same tag has not been revealed, and thus the relevant directed edges are not known by the algorithm.

Now, we briefly digress to argue that if we could construct satisfying such a property throughout, then clearly all vertices must be tagged differently otherwise the directions among the vertices that are tagged similarly cannot be learned by the algorithm. Therefore, the algorithm has not succeeded in its task. If all vertices are tagged differently, then it means it is a separating system.

Construction of : We now construct that can be shown to satisfy the induction hypothesis before step . Before step , consider the vertices in for any . Let the current intervention be chosen by the deterministic algorithm. We make the following changes: Modify such that vertices in come before in the partial order (vertices inside either sets are still not ordered amongst themselves) in the ordering and clearly the directions between these two sets are revealed by R. By the induction hypothesis for step and with the new tagging of vertices into , it is easy to see that only directions between distinct in the new have been revealed and all directions within a tag set are not revealed and all vertices in a tag set are contiguous in the ordering so far. We need to only show that rule R cannot reveal anymore edges amongst vertices in after the new and intervention . Suppose there are two vertices such that just after intervention and the modified , they are tagged identically and application of R reveals the direction between and before the next intervention. Then there has to be a vertex tagged differently from such that and are both known. But this implies that and are comparable in leading to a contradiction. This implies the hypothesis holds for step .

Base case: Trivially, the induction hypothesis holds for step where leaves the entire set unordered.

6.4 Proof of Lemma 2

The proof is a direct obvious consequence of acyclicity, non-existence of immoralities and the definition of rule R1.

6.5 Proof of Lemma 3

By Lemma 2, it is sufficient for an algorithm to identify the root node of the tree. Suppose the root node is unknown to the algorithm. Every tree has a single vertex separator that partitions the tree into components each of which has size at most [11]. Choose that vertex separator (it can be found in by removing every node and determining the components left). If it is a root node we stop here. Otherwise, its parent (if it is not) after application of rule R is identified. Let us consider component trees that result by removing node . Let contain . All directions in all other trees are known after repeated application of R on the original tree after R is applied. Directions in T will not be known. For the next step, is the new skeleton which has no immoralities. Again, we find the best vertex separator and the process continues. This procedure will terminate at some step when or there is only one node left which should be by Lemma 2. Since the number of nodes reduce by about at least each time, and initially it can be at most , this procedure terminates in at most steps.

6.6 Proof of Lemma 4

The graph induced by two colors classes in any graph is a bi-partite graph and bi-partite graphs do not have odd induced cycles. Since the graph and any induced subgraph is chordal, it implies the induced graph on a pair of color classes does not have a cycle. This proves the theorem.

6.7 Proof of Theorem 4

Assume is even for simplicity. We define a family of partial order as follows: Group into . Ordering among and is not revealed. But all the edges between and for any are directed from to . Now, one has to design a set of interventions such that exactly one node among every is intervened on at least once. This is because, if neither nor in are intervened on, then the direction between and cannot be figured out by applying rule R on any other set of directions in the rest of the graph. Since the size of every intervention is at most and at least nodes need to be covered by intervention sets, the number of interventions required is at least .

6.8 Proof of Theorem 5

Proof.

Separate vertices arbitrarily into disjoint subsets of size-. Let the first interventions be such that if and only if . This divides the problem of learning a clique of size into learning cliques of size . Then, we can apply the clique learning algorithm in [8] as a black box to each of the blocks: Each block is learned with probability after experiments in expectation. For , choose . Then the union bound over blocks yields probability polynomial in . Since each block takes experiments, we need experiments. ∎

6.9 Proof of Theorem 6

We need the following definitions and some results before proving the theorem.

Definition 2.

A perfect elimination ordering on the vertices of an undirected chordal graph is such that for all , the induced neighborhood of on the subgraph formed by is a clique.

Lemma 5.

([6]) If all directions in the chordal graph are according to perfect elimination ordering (edges go only from vertices lower in the order to higher in the order), then there are no immoralities.

We make the following observation: Let the directions in a graph be oriented according to an ordering on the vertices. If a clique comes first in the ordering, then the knowledge of edge directions in the rest of the graph, excluding that of the clique, cannot help at any stage of the intervention process on the clique; because all the edges are directed outwards from the clique and hence none of the Meek rules apply. This is because, if is to be inferred by Meek rules from other known directions, then either there has to be a known edge direction into or before the inference step. So if one of the directed edges not from the clique was to help in the discovery process, either that edge has to be directed towards or (like in Meek rules R, R and R), or it has to be directed towards in another (R) which belongs to the clique. Both the above cases are not possible.

Lemma 6.

([6]) Let be a maximum clique of an undirected chordal graph , then there is an underlying DAG on the chordal skeleton that is oriented according to a perfect elimination ordering (implying no immoralities), where the clique occurs first.

By Lemmas 5, 6 and the observation above, given a chordal skeleton, we can construct a DAG on the skeleton with no immoralities such that the directions of the maximum clique in cannot be learned by using knowledge of the directions outside. This means that only the intervention sets matter for learning the directions on this clique. Therefore inference on the clique is isolated. Hence, all the lower bounds for the clique case transfer to this case and since the size of the largest clique is exactly the coloring number of the chordal skeleton, the theorem follows.

6.10 Proof of Theorem 7

Example with a feasible solution with close to the lower bound: Consider a graph that can be partitioned into a clique of size and an independent set . Such graphs are called split graphs and as , the fraction of split graphs to chordal graphs tends to . If where is a split graph skeleton, it is enough to intervene only on the nodes in the clique and therefore the number of interventions that are needed is that for the clique. It is certainly possible to orient the edges in such a way so as to avoid immoralities, since the graph is chordal.

Example with which needs to be close to the upper bound: We construct a connected chordal skeleton with independent set and clique size (also coloring number) such that it would require interventions at least for any algorithm over a class of orientations.

Consider a line consisting of vertices such that every node is connected to and . For, all , consider a clique of size which only has nodes from the line . Now assume that the actual orientation of the L is . In every clique, the orientation is partially specified as follows: In every clique , all edges from node are outgoing. It is very clear that this partial orientation excludes all immoralities. Further, each clique can have any arbitrary orientation out of possible ones in the actual DAG. Now, even if all the specified directions are revealed to the algorithm, the algorithm has to intervene on all disjoint cliques each of size and directions in one clique will not force directions on the others through any of the Meek rules or rule R. Therefore, the lower bound of total node accesses (total number of nodes intervened) is implied by Theorem 4. Given every intervention is of size , these chordal skeletons with the revealed partial order needs at least more experiments.

6.11 Performance Comparison of Our Algorithm vs. Naive Scheme for and

(a)
(b)
Figure 2: : no. of vertices, : Intervention size bound. The number of experiments is compared between our heuristic and the naive algorithm based on the separating system on random chordal graphs. The red markers represent the sizes of separating system. Green circle markers and the cyan square markers for the same value correspond to the number of experiments required by our heuristic and the algorithm based on an separating system(Theorem 1), respectively, on the same set of chordal graphs. All four plots (including the ones in the main text) indicate that our algorithm requires number of experiments proportional to the clique number , whereas naive separating system based algorithm requires experiments on the order of number of variables .

6.12 Proof of Theorem 8

We provide the following justifications for the correctness of Algorithm 1.

  1. At line 4 of the algorithm, when Meek rules and R are applied after every intervention, the intermediate graph , with unlearned edges, will always be a disjoint union of chordal components (refer to (2.2) and the comments below) and hence a chordal graph.

  2. The number of unlearned edges before and after the main while loop in Algorithm 1 reduces by at least one. Every edge in is incident on two colors and one of the colors is always picked for processing because we use a separating system on the colors. Therefore, one node belonging to some edge has a positive score and is intervened on. The edge direction is learnt through rule R. Therefore, the algorithm terminates.

  3. It identifies the correct because every edge is inferred after some intervention by applying rule R and Meek rules as in (2.2) both of which are correct.

  4. the algorithm has polynomial run time complexity because the main while loop ends in time .