1 Introduction
Causal inference is important for many applications including, among others, biology, econometrics and medicine [1, 9, 26]. Randomized trials are the golden standard for causal inference since they lead to reliable conclusions with minimal assumptions. The problem is that enforcing randomization to different variables in a causal inference problem can have significant and varying costs. A causal discovery algorithm should take these costs into account and optimize experiments accordingly.
In this paper we formulate this problem of learning a causal graph when there is a cost for intervening on each variable. We follow the structural equation modeling framework [21, 29] and use interventions, i.e., experiments. To perform each intervention, a scienstist randomizes a set of variables and collects new data from the perturbed system. For example, suppose the scientist wants to observe the causal effect of the variable smoking on the variable cancer. Suppose she decides to perform an intervention on the smoking variable. This entails forcing a random subset of participants to smoke, irrespective of them being smokers or nonsmokers. This intervention would be (clearly) hard to perform, so the cost of intervening on the variable smoking should be set pretty high. An intervention on the second variable cancer would be physically impossible, since there is no mechanism for the scientist to enforce this variable to take a value for a participant. We should therefore, be setting the cost of intervening on the cancer variable to infinity.
In this paper we study the following problem: We want to learn a causal graph and for each variable we are given a cost. For each intervention set, the cost is the sum of the costs of all the variables in the set. Total cost is the sum of the costs of the performed interventions. We would like to learn a causal graph with the minimum possible total cost.
Our Contributions: This is a natural problem that, to the best of our knowledge, has not been previously studied except for some special cases as we explain in the related work section. Our results are as follows:

We show that the problem of designing the minimum cost interventions to learn a causal graph can be solved in polynomial time.

We study the minimum cost intervention design problem when the number of interventions is limited. We formulate the costoptimum intervention design problem as an integer linear program. This formulation allows us to identify two causal graph families for which the problem can be solved in polynomial time.

For general graphs, we develop an efficient greedy algorithm. We also propose an improved variant of this algorithm, which runs in polynomial time when the causal graph is an interval graph.
Our machinery is graph theoretic. We rely on the connection between graph separating systems and proper colorings. Although this connection was previously discovered, it does not seem to be widely known in the literature.
2 Background and Notation
In this section, we present a brief overview of Pearl’s causality framework and illustrate how interventions are useful in identifying causal relations. We also present the requisite graph theory background. Finally, we explain separating systems: Separating systems are the central mathematical objects for nonadaptive intervention design.
2.1 Causal Graphs, Interventions and Learning
A causal graph is a directed acyclic graph (DAG), where each vertex represents a random variable of the causal system. Consider a set of random variables
. A directed acyclic graph is a causal graph if the arrows in the edge set encode direct causal relations between the variables: A directed edge represents a direct causal relation between and . is said to be a direct cause of . In the structural causal modeling framework (also called structural equations with independent errors), every variable can be written as a deterministic function of its parent set in the causal graph and some unobserved random variable . is called an exogenous variable and it is statistically independent from the nondescendants of . Thus where is the set of the parents of in and is some deterministic function. We assume that the graph is acyclic^{1}^{1}1Treatment of cyclic graphs require mechanics different than independent exogenous variables, or a time varying system, and is out of the scope of this paper. (DAG) and all the variables are observable (causal sufficiency).The functional relations between the observed variables and the exogenous variables induce a joint probability distribution over the observed variables. It can be shown that the underlying causal graph
is a valid Bayesian network for the joint distribution induced over the observed variables by the causal model. To identify the causal graph, we can check the conditional independence relations between the observed variables. Under the faithfulness assumption
[29], every conditional independence relation is equivalent to a graphical criterion called the dseparation ^{2}^{2}2The set of unfaithful distributions are shown to have measure 0. This makes faithfulness a widely employed assumption, even though it was recently shown that almost faithful distibutions may have significant measure [30]..In general, there is no unique Bayesian network that corresponds to a given joint distribution: There exists multiple Bayesian networks for a given set of conditional independence relations. Thus, it is not possible to uniquely identify the underlying causal graph using only these tests in general. However, conditional independence tests allow us to identify a certain induced subgraph: Immoralities, i.e., induced subgraphs on three nodes of the form . An undirected graph is called the skeleton of a causal directed graph , if every edge of corresponds to a directed edge of , and every nonedge of corresponds to a nonedge of . PC algorithm [29] and its variants use conditional independence tests: They first identify the graph skeleton, and then determine all the immoralities. The runtime is polynomial if the underlying graph has constant vertex degree.
The set of invariant causal edges are not only those that belong to an immorality. For example, one can identify additional causal edges based on the fact that the graph is acyclic. Meek developed a complete set of rules in [19, 20] to identify every invariant edge direction, given a set of causal edges and the skeleton. Meek rules can be iteratively applied to the output of the PC algorithm to identify every invariant arrow. The graph that contains every invariant causal arrow as a directed edge, and the others as undirected edges is called the essential graph of . Essential graphs are shown to contain undirected components which are always chordal ^{3}^{3}3A graph is chordal if its every cycle of length 4 or more contains a chord.[29, 10] .
Performing experiments is the most definitive way to learn the causal direction between variables. Randomized clinical trials, which aim to measure the causal effect of a drug are examples of such experiments. In Pearl’s causality framework, an experiment is captured through the do operator: The do operator refers to the process of assigning a particular value to a set of variables. An intervention is an experiment where the scientist collects data after performing the do operation on a subset of variables. This process is fundamentally different from conditioning, and requires scientist to have the power of changing the underlying causal system: For example, by forcing a patient not to smoke, the scientist removes the causal effect of the patient’s urge to smoke which may be caused by a gene. An intervention is called perfect if it does not change any other mechanism of the causal system and only assigns the desired value to the intervened variable. A stochastic intervention assigns the value of the variable of interest to the realizations of another variable instead of a fixed value. The assigned variable is independent from the other variables in the system. This is represented as for some independent random variable .
Due to the change of the causal mechanism, an intervention removes the causal arrows from to . This change in the graph skeleton can be detected by checking the conditional independences in the postinterventional distribution: The edges still adjacent to must have been directing away from before the experiment. The edges that are missing must have been the parents of . Thus, an intervention on enables us to learn the direction of every edge adjacent to . Similarly, intervening on a set of nodes concurrently enables us to learn the causal edges across the cut .
Given sufficient data and computation power, we can apply the PC algorithm and Meek rules to identify the essential graph. To discover the rest of the graph we need to use interventions on the undirected components. We assume that we work on a single undirected component after this preprocessing step. Hence, the graphs we consider are chordal without loss of generality, since these components are shown to always be chordal [10]. After each intervention, we also assume that the scientist can apply the PC algorithm and Meek rules to uncover more edges. A set of interventions is said to learn a causal graph given skeleton , if every causal edge of any causal graph with skeleton can be identified through this procedure. A set of interventions is called an intervention design and is shown by , where is the set of nodes intervened on in the experiment.
An intervention design algorithm is called nonadaptive if the choice of an intervention set does not depend on the outcome of the previous interventions. Yet, we can make use of the Meek rules over the hypothetical outcomes of each experiment. Adaptive algorithms design the next experiment based on the outcome of the previous interventions. Adaptive algorithms are in general hard to design and analyze and sometimes impractical when the scientist needs to design the interventions before the experiment starts.
In this paper we are interested in the problem of learning a causal graph given its skeleton where each variable is associated with a cost. The objective is to nonadaptively design the set of interventions that minimizes the total interventional cost. We prove that, any set of interventions that can learn every causl graph with a given skeleton needs to be a graph separating system for the skeleton. This is the first formal proof of this statement to the best of our knowledge.
2.2 Separating systems, Graphs, Colorings
A separating system on a set of elements is a collection of subsets with the following property: For every pair of elements from the set, there exists at least one subset which contains exactly one element from the pair:
Definition 1.
For set , a collection of subsets of , , is called a separating system if for every pair , such that either and , or and .
The subset that contains exactly one element from the pair is said to separate the pair. The number of subsets in the separating system is called the size of the separating system. We can represent a separating system with a binary matrix:
Definition 2.
Consider a separating system for the set . A binary matrix is called the separating system matrix for if for any element , if and 0 otherwise.
Thus, each set element has a corresponding row coordinate, and the rows represent the set membership of these elements. Each column of
is a 01 vector that indicates which elements belong to the set corresponding to that column. See Figure
1(b) for two examples. The definition of every pair being separated by some set then translates to every row of being different.Given an undirected graph, a graph separating system is a separating system that separates every edge of the graph.
Definition 3.
Given an undirected graph , a set of subsets of , , is a separating system if for every pair for which , such that either and , or and .
Thus, graph separating systems only need to separate pairs of elements adjacent in the graph. Graph separating systems are considered in [18]. It was shown that the size of the minimum graph separating system is , where is the coloring number of . Based on this, we can trivially extend the definition of separating system matrices to include graph separating systems.
A coloring of an undirected graph is an assignment of a set of labels (colors) to every vertex. A coloring is called proper if every adjacent vertex is assigned to a different color. A proper coloring for a graph is optimal if it is the proper coloring that uses the minimum number of colors. The number of colors used by an optimal coloring is the chromatic number of the graph. Optimum coloring is hard to find in general graphs, however it is in for perfect graphs. Since chordal graphs are perfect, the graphs we are interested in in this paper can be efficiently colored using minimum number of colors. For a given undirected graph , the vertex induced subgraph on is shown by .
3 Related Work
The framework of learing causal relations from data has been extensively studied under different assumptions on the causal model. The additive noise assumption
asserts that the effect of the exogenous variables are additive in the structural equations. Under the additional assumptions that the data is Gaussian and that the exogenous variables have equal variances,
[22] shows that the causal graph is identifiable. Recently, under the additive linear model with jointly Gaussian variables [23] proposed using the invariance of the causal relations to combine a given set of interventional data.For the case of two variable causal graphs, there is a rich set of theoretical results for datadriven learning: [12] and [28] show that we can learn a twovariable causal graph under different assumptions on the function or the noise term under the additive noise model. Alternatively, an information geometric appraoch that is based on the independence of cause and effect is suggested by [14]. [17]
recently proposed using a classifier on the datasets to label each dataset either as
causes or causes . The lack of large real causal datasets forced him to generate artificial causal data, which makes this approach dependent on the data generation process.Information theoretic causal inference approaches have gained increased attention recently [32, 7]. For timeseries data along with Granger causality, directed information is used [8, 5, 24, 16, 25]. An entropic causal inference framework is recently proposed for the twovariable causal graphs by [15].
The literature on learning causal graphs using interventions without assumptions on the causal model is more limited. For the objective of minimizing the number of experiments, [11] proposes a coloringbased algorithm to construct the optimum set of interventions. [3] introduced the constraint on the number of variables intervened in each experiment. He proved in [4] that, when all causal graphs are considered, the set of interventions to fully identify the causal DAG needs to be a separating system for the set of variables. For example for complete graphs, separating systems are necessary. [13] draws connections between the combinatorics literature and causality via known separating system constructions. [27] illustrates several theoretical findings: They show that the separating systems are necessary even under the constraint that each intervention has size at most , identify an information theoretic lower bound on the necessary number of experiments, propose a new separating system construction, and develop an adaptive algorithm that exploits the Meek rules. To the best of our knowledge, the fact that a graph separating system is necessary for a given causal graph skeleton was unknown until this work.
4 Graph Separating Systems, Proper Colorings and Intervention Design
In this section, we illustrate the relation between graph colorings and graph separating systems, and show how they are useful for nonadaptive intervention design algorithms.
Given a graph separating system for the skeleton of a causal graph, we can construct the set of interventions as follows: For experiment , intervene on the set of variables in the set . Since is a graph separating system, for every edge in the skeleton, there is some for which intervenes on only one of the variables adjacent to that edge. Since the edge is cut, it can be learned by learning the skeleton of the postinterventional graph, as explained in Section 2. Since every edge is cut at least once, an intervention design based on a separating system identifies any causal graph with skeleton .
Graph separating systems provide a structured way of designing interventions that can learn any causal graph. Their necessity however is more subtle: One might suspect that using the Meek rules in between every intervention may eliminate the need for the set of interventions to correspond to a graph separating system. Suppose we designed the first experiments. Applying the Meek rules over all possible outcomes of our first experiments on may enable us to design the th experiment in an informed manner, even though we do not get to see the outcome of our experiments. Eventually it might be possible to uncover the whole graph without having to separate every edge. In the following we show that Meek rules are not powerful enough to accomplish this, and we actually need a graph separating system. This fact seems to be known [4, 11], however we could not locate a proof. We provide our own proof:
Theorem 1.
Consider an undirected graph . A set of interventions learns every causal graph with skeleton if and only if is a graph separating system for .
Proof.
See the appendix. ∎
4.1 Any Graph Separating System is Some Coloring
In this section, we explain the relation between graph separating systems and proper graph colorings. This relation, which is already known [11], is important for us in reformulating the intervention design problem in the later sections.
Let be a proper graph coloring for graph which uses colors in total. The colors are labeled by length binary vectors. First construct matrix as follows: Let row of be the label corresponding to the color of vertex , i.e., . Then is a separating system matrix: Let be the set of row indices of for which the corresponding entries in the column are 1. Let be the set of subsets constructed in this manner from columns of . Then is a graph separating system for . To see this, consider any pair of vertices that are adjacent in : . Since the coloring is proper, the color labels of these vertices are different, which implies the corresponding rows of , and are different. Hence, there is some column of which is 1 in exactly one of the and rows. Thus, the subset constructed from this column separates the pair of vertices .
Therefore any proper graph coloring can be used to construct a graph separating system. It turns out that the converse is also true: Any graph separating system can be used to construct a proper graph coloring. This is shown by Cai in [18] within his proof that shows that the minimum size of a graph separating system is , where is the chromatic number. We repeat this result for completeness^{4}^{4}4Note that this lemma is not formally stated in [18] but rather verbally argued within a proof of another statement.:
Lemma 1 ([18]).
Let be a graph separating system for the graph . Let be the separating system matrix for : column of is the binary vector of length which is 1 in the rows that are contained in . Then the coloring is a proper coloring for .
This connection between graph colorings and graph separating systems is important: Ultimately, we want to use graph colorings as a tool for searching over all sets of interventions, and find the one that minimizes a cost function. This is possible due to the characterization in Lemma 1 and the fact that the set of interventions has to correspond to a graph separating system in order to identify any causal graph by Theorem 1.
Along this direction, we have the following simple, yet important observation: We observe that a minimum graph separating system does not have to correspond to an optimum coloring. We illustrate this with a simple example:
Proposition 1.
Proof.
Notice that the chromatic number of the given graph is 3. Hence the minimum separating system size is . Thus the given graph separating system is a minimum graph separating system. In any proper 3coloring, and must have different colors. Hence, any colorseparating system separates and . However the rows of the graph separating system which correspond to and are the same. In other words, any 3coloring based graph separating system separates and whereas the graph separating system given in Fig. 1(a) does not. ∎
This problem can be solved by assigning both vertices and a new color, hence coloring the graph by colors. We can conclude the following: Suppose we consider the costoptimum intervention design problem with at most interventions. When we formulate it as a search problem over the graph colorings, we need to consider the colorings with at most colors instead of colors.
5 CostOptimal Intervention Design
In this section, we first define the costoptimal intervention design problem. Later we show that this problem can be solved in polynomial time.
Suppose each variable has an associated cost of being intervened on. We consider a modular cost function: The cost of intervening on a set of nodes is . Our objective is to find the set of interventions with minimum total cost, that can identify any causal graph with the given skeleton: Given the causal graph skeleton , find the set of interventions that can identify any causal graph with the skeleton , with minimum total cost . In this section, we do not assume that the number of experiments are bounded and we are only interested in minimizing the total cost. We have the following theorem:
Theorem 2.
Let be a chordal graph, and be a cost function on its vertices. Let an intervention on set have cost . Then the optimal set of interventions with minimum total cost, that can learn any causal graph with skeleton is given by , where is any coloring of the graph , where is the maximum weighted independent set of .
Proof.
See the supplementary material. ∎
In other words, the optimum strategy is to color the vertex induced subgraph obtained by removing the maximum weighted independent set and intervening on each color class individually. After coloring the maximum weighted independent set, the remaining graph can always be colored by at most colors, i.e., the chromatic number of . The remaining graph is still chordal. Since optimum coloring and maximum weighted independent set can be found in polynomial time for chordal graphs, can be constructed in polynomial time.
6 Intervention Design with Bounded Number of Interventions
In this section, we consider the costoptimum intervention design problem for a given number of experiments. We construct a linear integer program formulation of this problem and identify the conditions under which it can be efficiently solved. As a corollary we show that when the causal graph skeleton is a tree or a clique tree, the costoptimal intervention design problem can be solved in polynomial time. Later, we present two greedy algorithms for more general graph classes.
To be able to uniquely identify any causal graph, we need a graph separating system by Theorem 1. Hence, we need since the minimum graph separating system has size due to [18].
6.1 Coloring formulation of CostOptimum Intervention Design
One common approach to tackle combinatorial optimization problems is to write them as linear integer programs: Often binary variables are used with a linear objective function and a set of linear constraints. The constraints determine the set of feasible points. One can construct a convex object (a convex polytope) based on the set of feasible points by simply taking their convex hull. However this object can not always be described efficiently. If it can, then the linear program over this convex object can be efficiently solved and the result is the optimal solution of the original combinatorial optimization problem. We develop an integer linear program formulation for finding the costoptimum intervention design using its connection to proper graph colorings.
From Theorem 1, we know that we need the set of interventions to correspond to a graph separating system for the skeleton. From Lemma 1, we know that any graph separating system can be constructed from some proper coloring. Based on these, we have the following key observation: To solve the costoptimal intervention design problem given a skeleton graph, it is sufficient to search over all proper colorings, and find the coloring that gives the graph separating system with the minimum cost. We use the following (standard) coloring formulation: Suppose we are given an undirected graph with vertices and colors are available. Assign a binary variable to every vertexcolor pair : if vertex is colored with color , and 0 otherwise. Each vertex is assigned a single color, which can be captured by the equality . Since coloring is proper, every pair of adjacent vertices are assigned different colors, which can be captured by . Based on our linear integer program formulation given in the supplementary material, we have the following theorem:
Theorem 3.
Consider the costoptimal nonadaptive intervention design problem given the skeleton of the causal graph: Let each node be associated with an intervention cost, and the cost of intervening on a set of variables be the sum of the costs of each variable. Then, the nonadaptive intervention design that can learn any causal graph with the given skeleton in at most interventions with the minimum total cost can be identified in polynomial time, if the following polytope can be described using polynomially many linear inequalities:
(1)  
Proof.
See the supplementary material. ∎
Donne in [2] identifies that when the graph is a tree, one can replace the constraints with for all without changing the polytope in 1. He also shows that when the graph is a cliquetree (a graph that can be obtained from a tree by replacing the vertices of the tree with cliques), a simple alternative characterization based on the constraints on the maximum cliques of the graph exists, which can be efficiently described. Based on this and Theorem 3, we have the following corollary:
Corollary 1.
The costoptimal nonadaptive intervention design problem can be solved in polynomial time if the given skeleton of the causal graph is a tree or a clique tree.
We can identify two other special solutions for the costoptimum intervention design problem through a combinatorial argument: The maximum number of interventions is greater than or equal to the chromatic number , the graph is uniquely colorable. See the supplementary material for the corresponding results and details.
6.2 Greedy algorithms
In this section, we present two greedy algorithms for the minimum cost intervention design problem for more general graph classes.
We have the following observation: Consider a coloring , which uses up to colors. Consider the graph separating system matrix constructed using this coloring, as described in Section 4.1. Recall that the row of is a vector which represents the label for the color of vertex , and column is the indicator vector for the set of variables included in intervention . We call the vector used for color as the coloring label for color . The separating property does not depend on the color labels: Using different labels for different colors is sufficient for the graph separating property to hold. However, the number of 1s of a coloring label determines how many times the corresponding variable is intervened on using the corresponding intervention design. Hence, we can choose the coloring labels from the binary vectors with small weight, given the choice. Moreover, the column index of a 1 in a certain row does not affect the cost since in a nonadaptive design, every intervention counts towards the total cost (we cannot stop the experiments earlier since we do not get feedback from the causal graph, unlike adaptive algorithms).
Based on this observation, we can try to greedily color the graph as follows: Suppose we are allowed to use up to interventions. Thus the corresponding graph separating system matrix can have up to columns, which allows up to distinct coloring labels. We can greedily color the graph by choosing labels with small weight first: Choose the color label with smallest weight from the available labels. Find the maximum weighted independent set of the graph. Assign the coloring label to the rows associated with this the vertices in this independent set. Remove the used coloring label from the available labels, update the graph by removing the colored vertices and iterate.
However, this type of greedy coloring could end up using many more colors than allowed. Indeed one can show that greedily coloring a chordal graph using maximum independent sets at each step cannot approximate the chromatic number within an additive gap for all graphs. Thus, this vanilla greedy algorithm may use up all available colors and still have uncolored vertices, even though . To avoid this, we use the following modified greedy algorithm: For the first steps, greedily color the graph using maximum weighted independent sets. Use the last colors to color the remaining uncolored vertices. Since the graph obtained by removing colored vertices have at most the same chromatic number as the original graph, colors are sufficient. The remaining graph is also chordal since removing vertices do not change the chordal property, hence finding a coloring that uses colors can be done efficiently. This algorithm is given in Algorithm 1.
We can improve our greedy algorithm when the graph is an interval graph, which is a strict subclass of the chordal graphs. Note that there are binary labels of length with weight . When we use these vectors as the coloring labels, the corresponding intervention design requires every variable with these colors to be intervened on exactly times in total. Then, rather than finding the maximum independent set at iteration , we can find the maximum weighted colorable subgraph, and use all the coloring labels of weight . The cost of the colored vertices in the intervention design is times their total cost. We expect this to create a better coloring in terms of the total cost, since it colors a larger portion of the graph at each step. Finding the maximum weighted colorable subgraph is hard for nonconstant in chordal graphs, however it can be solved in polynomial time if the graph is an interval graph [31]. This modified algorithm is given in Algorithm 2. Notice that when , the number of possible coloring labels is superpolynomial in , which seem to make the algorithms run in superpolynomial time. However, when , we can only use the first color labels with the lowest weight, since a proper coloring on a graph with vertices can use at most colors in total.
7 Experiments
In this section, we test our greedy algorithm to construct intervention designs over randomly sampled chordal graphs. We follow the sampling scheme proposed by [27] (See the supplementary material for details). The costs of the vertices of the graph are selected from i.i.d. samples of an exponential random variable with mean 1. The total cost of all variables is then the same as the number of variables . We normalize the cost incurred by our algorithm with and compare this normalized cost for different regimes. The parameter is a parameter that determines the sparsity of the graph: Graphs with larger are expected to have more edges. See the supplementary material for the details of how the parameter affects the probability of an edge. We limit the simulation to at most 10 experiments (axis) and observe the effect of changing the number of variables and sparsity parameter .
Our intervention design algorithm, Algorithm 1 requires a subroutine that can find the maximum weighted independent set of a given chordal graph. We implement the lineartime algorithm by Frank [6] for finding the maximum weighted independent set of a chordal graph. For the details of Frank’s implementation, see the supplementary material.
We observe that the main factor that determines the average incurred cost is sparsity of the graph: The number of edges compared to the number of nodes. For a fixed , reducing results in a smaller average cost by increasing the sparsity of the graph. For a fixed , increasing reduces the sparsity, which is also shown to reduce the average cost incurred by the greedy intervention design. See the supplementary material for additional simulations where the costs are chosen as the i.i.d. sampes of a uniform random variable over the interval .
References
 [1] Krzysztof Chalupka, Tobias Bischoff, Pietro Perona, and Frederick Eberhardt. Unsupervised discovery of el nino using causal feature learning on microlevel climate data. In Proc. of UAI’16, 2016.
 [2] Diego Delle Donne and Javier Marenco. Polyhedral studies of vertex coloring problems: The standard formulation. Discrete Optimization, 21:1–13, 2016.

[3]
Frederich Eberhardt, Clark Glymour, and Richard Scheines.
On the number of experiments sufficient and in the worst case
necessary to identify all causal relations among n variables.
In
Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI)
, pages 178–184, 2005.  [4] Frederick Eberhardt. Phd thesis. Causation and Intervention (Ph.D. Thesis), 2007.
 [5] Jalal Etesami and Negar Kiyavash. Discovering influence structure. In IEEE ISIT, 2016.
 [6] András Frank. Some polynomial algorithms for certain graphs and hypergraphs. In Proc. of the Fifth British Combinatorial Conference, Congressus Numerantium XV, 1975.

[7]
Weihao Gao, Sreeram Kannan, Sewoong Oh, and Pramod Viswanath.
Causal strength via shannon capacity: Axioms, estimators and applications.
InProceedings of the 33 rd International Conference on Machine Learning
, 2016.  [8] Clive W. Granger. Investigating causal relations by econometric models and crossspectral methods. Econometrica: Journal of the Econometric Society, pages 424–438, 1969.
 [9] Moritz GrosseWentrup, Dominik Janzing, Markus Siegel, and Bernhard Schölkopf. Identification of causal relations in neuroimaging data with latent confounders: An instrumental variable approach. NeuroImage (Elsevier), 125:825–833, 2016.
 [10] Alain Hauser and Peter Bühlmann. Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research, 13(1):2409–2464, 2012.

[11]
Alain Hauser and Peter Bühlmann.
Two optimal strategies for active learning of causal networks from interventional data.
In Proceedings of Sixth European Workshop on Probabilistic Graphical Models, 2012.  [12] Patrik O Hoyer, Dominik Janzing, Joris Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models. In Proceedings of NIPS 2008, 2008.
 [13] Antti Hyttinen, Frederick Eberhardt, and Patrik Hoyer. Experiment selection for causal discovery. Journal of Machine Learning Research, 14:3041–3071, 2013.
 [14] Dominik Janzing, Joris Mooij, Kun Zhang, Jan Lemeire, Jakob Zscheischler, Povilas Daniušis, Bastian Steudel, and Bernhard Schölkopf. Informationgeometric approach to inferring causal directions. Artificial Intelligence, 182183:1–31, 2012.
 [15] Murat Kocaoglu, Alexandros G. Dimakis, Sriram Vishwanath, and Babak Hassibi. Entropic causal inference. In AAAI’17, 2017.
 [16] Ioannis Kontoyiannis and Maria Skoularidou. Estimating the directed information and testing for causality. IEEE Trans. Inf. Theory, 62:6053–6067, Aug. 2016.
 [17] David LopezPaz, Krikamol Muandet, Bernhard Schölkopf, and Ilya Tolstikhin. Towards a learning theory of causeeffect inference. In Proceedings of ICML 2015, 2015.
 [18] Cai MaoCheng. On separating systems of graphs. Discrete Mathematics, 49:15–20, 1984.
 [19] Christopher Meek. Causal inference and causal explanation with background knowledge. In Proceedings of the eleventh international conference on uncertainty in artificial intelligence, 1995.
 [20] Christopher Meek. Strong completeness and faithfulness in bayesian networks. In Proceedings of the eleventh international conference on uncertainty in artificial intelligence, 1995.
 [21] Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2009.
 [22] Jonas Peters and Peter Bühlman. Identifiability of gaussian structural equation models with equal error variances. Biometrika, 101:219–228, 2014.

[23]
Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen.
Causal inference using invariant prediction: identification and confidence intervals.
Statistical Methodology, Series B, 78:947 – 1012, 2016.  [24] Christopher Quinn, Negar Kiyavash, and Todd Coleman. Directed information graphs. IEEE Trans. Inf. Theory, 61:6887–6909, Dec. 2015.
 [25] Maxim Raginsky. Directed information and pearl’s causal calculus. In Proc. 49th Annual Allerton Conf. on Communication, Control and Computing, 2011.
 [26] Joseph D. Ramsey Ramsey, Stephen José Hanson, Catherine Hanson, Yaroslav O. Halchenko, Russell Poldrack, and Clark Glymour. Six problems for causal inference from fmri. NeuroImage (Elsevier), 49:1545–1558, 2010.
 [27] Karthikeyan Shanmugam, Murat Kocaoglu, Alex Dimakis, and Sriram Vishwanath. Learning causal graphs with small interventions. In NIPS 2015, 2015.
 [28] S Shimizu, P. O Hoyer, A Hyvarinen, and A. J Kerminen. A linear nongaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7:2003––2030, 2006.
 [29] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. A Bradford Book, 2001.
 [30] Caroline Uhler, Garvesh Raskutti, Peter Bühlmann, and Bin Yu. Geometry of the faithfulness assumption in causal inference. Annals of Statistics, 41:436–463, 2013.
 [31] Mihalis Yannakakis and Fanica Gavril. The maximum kcolorable subgraph problem for chordal graphs. Information Processing Letters, 24:133–137, 1987.
 [32] Brian D. Ziebart, J. Andrew Bagnell, and Anind K. Dey. The principle of maximum causal entropy for estimating interacting processes. IEEE Transactions on Information Theory, 59:1966 – 1980, 2013.
8 Appendix
Proof of Theorem 1
One direction is trivial: Consider a separating system. For every edge there is an intervention where only one endpoint is intervened. This edge is in the cut and learned. Constraints are over the subsets of the graph separating system, which directly correspond to interventions. Hence interventions also obey the constraint . For the other direction, we use the following observation from [27], which is implicit in the proof of Theroem 6 in [27].
Lemma 2.
Let be an undirected chordal graph. Consider any clique of . There is a directed graph with skeleton with no immoralities such that, the vertices come before any other vertices in the partial order that determine . If this is the underlying causal graph, knowing the causal edges outside this clique does not help identify any edges within the clique.
The lemma essentially states that, Meek rules do not aid in identifying the edges within a clique, if the clique vertices come before any other vertex in the partial order of the underlying causal DAG.
Assume that there is an edge that is not separated by the set of interventions. If the underlying causal DAG has partial order that starts with the nodes at the endpoints of this edge, then knowing every other edge does not help learn the direction of this edge by Lemma 2 (notice that an edge is a clique of size 2). Thus this set of interventions cannot learn every causal graph with the given skeleton.
Proof of Theorem 2
Consider the graph separating system matrix : Let be a 01 matrix, where . Since every set of interventions must be a graph separating system by Theorem 1, we can work with the corresponding graph separating system matrices. Notice that any graph separating system corresponds to some proper coloring due to 1. Thus, any set of vertices that has identical rows in should be within the same color class. We know in any proper coloring, each color class is an independent set. Then, over all proper colorings, the color class with maximum weight is given by the maximum weighted independent set. Since each row of is either the allzero vector, or contains at least a single 1, the total cost is minimized by assigning the allzero vector to the vertices belonging to the maximum weighted independent set, and using distinct weight1 vectors for the remaining rows. The induced graph on the vertices outside the maximum weighted independent set is still chordal and has the chromatic number at most . Thus, we need an matrix , hence experiments in total to minimize the total intervention cost.
Proof of Theorem 3
In this section, we show that we can write the total cost of the interventions constructed by a given graph coloring can be written as a linear objective in terms of .
First, we illustrate the cost incurred by a given separating system. Consider the color separating system in Figure 1(b). Notice that the rows of that correspond to vertices within a fixed color class are the same. For example is a color class, and both rows are . Recall that the columns where a particular row is 1 indicate the interventions which contain that variable. The cost incurred by any vertex is the number of times the vertex is intervened on times the cost of intervening on that vertex. The cost incurred by a set of vertices is the sum of the cost incurred by each vertex within the set. Vertices within a color class are intervened on the same number of times since they have the same rows in the separating system matrix . Thus, the cost of a color class is given by , where is the row of any node from color class in , and is the number of 1s in .
Notice that the exact labeling of rows do not matter for the separating system: We only need vertices with different colors to correspond to different rows. Since the cost of a color class is proportional to the number of 1s in its row vector, an optimum graph separating system given a coloring should assing vectors with smaller weight if possible, in order to minimize the total cost. Hence, in Figure 1(b), instead of assigning as the characteristic vector of , we can assign without affecting the separating system property. Since 3 colors are sufficient, we do not need to use vector.
In general, given a number of interventions , we need to construct a set of coloring labels to assign to each color. Suppose the causal graph has variables. If , then every length binary vector should be available, since the number of colors can be up to . If , using all labels give more number of colors than we can use to search over all proper colorings. Hence, in this case, we choose the labels with smallest weight until we find coloring labels. This ensures that the integer programming formulation does not have exponentially many variables, even when number of interventions is allowed to be . Thus we construct a vector, to be used as the weight of color labels as follows:
(2) 
where is such that and . appears times if and times if . For notational convenience, let .
Standard coloring formulation assigns a variable to every node and color : if node is colored with color , and 0 otherwise. Each vertex is assigned a single color. Every pair of adjacent vertices are assigned different colors, which can be captured by . Then, using this standard coloring formulation, we can write our optimization problem as follows:
min  (3)  
s. t.  
Uniquely Colorable Graphs
Next, we give a special case, which admits a simple solution without restricting the graph class. Suppose is uniquely colorable, where is the maximum number of interventions we are allowed to use. Then there is only a single coloring up to permutations of colors. Hence the costs of color classes are fixed. Now we can simply sort the color classes in the order of decreasing cost, and assign row vectors of to these color classes in the order of increasing number of 1s. This assures that the total cost of interventions is minimized.
Implementation Details
First, we need to define a perfect elimination ordering:
Definition 4.
A perfect elimination ordering (PEO) on the vertices of an undirected chordal graph is such that for all , the induced neighborhood of on the subgraph formed by is a clique.
It is known that an undirected graph is chordal if and only if it has a perfect elimination ordering. We use this fact to generate chordal graphs based on a randomly chosen perfect elimination ordering: First we choose a random permutation to be the perfect elimination ordering for the chordal graph. Then the vertex is connected to each node in with respect to the PEO independently randomly with probability . A random vertex from is chosen to be a parent of with probability 1 to keep the graph connected. The parent set are connected to each other to assure the ordering is a PEO.
Frank’s Algorithm
Consider a PEO . At step , skip the vertex if it has weight . Otherwise, mark it red and reduce the weight of all its neighbors that are before in the PEO by , and set . After steps, we have a set of vertices colored red. Parse this set in the order of and convert a red vertex to blue if it does not have any neighbor in which is already colored blue. [6] proves that this algorithm outputs the maximum weighted independent set.
Additional Simulations
In this section we provide additional simulations for when the graph weights are uniformly distributed
. The results are given in Figure 3. Similar to the exponentially distributed weigths, the main factor determining the cost is the graph spartiy, which is captured by parameter
.