1 Introduction
Directed acyclic graphs (DAGs) are popular models for capturing causal relationships among a set of variables. This approach has found important applications in various areas including biology, epidemiology and sociology (Gangl, 2010; Lagani et al., 2016). A central problem in these applications is to learn the causal DAG from observations on the nodes. A popular approach is to infer missing edges based on conditional independence information that is learned from the data (Spirtes et al., 2000; Kalisch and Bühlmann, 2007). However, multiple DAGs can encode the same set of conditional independences. Hence in general the causal DAG can only be learned up to a Markov equivalence class (MEC) and interventional data is needed in order to identify the causal DAG.
While an MEC may contain a super exponential number of candidate DAGs, Gillispie and Perlman (2001)
showed by enumerating all MECs up to 10 nodes that for small graphs (up to 10 nodes) an MEC on average contains about four DAGs and that about a quarter of all MECs consist of a unique DAG. Generalizing these results to larger graphs is critical for estimating the average number of interventional experiments needed for identifying the underlying causal DAG. More generally, given the recent rise in interventional data in genomics enabled by genome editing technologies
(Xiao et al., 2015), it is of great interest to understand the average reduction in the size of MECs through the availability of interventional data, i.e., to characterize the average size of an interventional Markov equivalence class (IMEC). Further, such an analysis would also shed light on the number of additional interventions needed to uniquely identify the underlying causal DAG moving away from worst case bounds.The problem of characterizing the size of an MEC or IMEC is not only of interest for experimental design of interventions but also from an algorithmic perspective. A popular approach to causal inference is given by scorebased methods that assign a score such as the Bayesian Information Criterion (BIC) to each DAG or MEC and greedily optimize over the space of DAGs (Castelo and Kocka, 2003), a combination of permutations and undirected graphs (Teyssier and Koller, 2012; Raskutti and Uhler, 2018; Solus et al., 2017; Mohammadi et al., 2018) or MECs (Meek, 1997; Brenner and Sontag, 2013). Similar scorebased approaches have also been developed in the interventional setting (Hauser and Bühlmann, 2012a; Wang et al., 2017; Yang et al., 2018). While a greedy step in the space of graphs can easily be defined (addition, removal or flipping of an edge), a greedy step in the space of Markov equivalence classes is complicated (Meek, 1997). Hence performing a greedy algorithm in the space of MECs only makes sense if the space of MECs is significantly smaller as compared to the space of DAGs. For instance showing that typically occurring MECs or IMECs are small would imply that graphbased search procedures operate on a similar search space as the ones that use MECs, but can do so using simpler moves.
Motivated by these considerations, in this work, we initiate the study of interventional and observational MECs for random DAG models. We focus on random order DAGs, where the skeleton is a random Erdős Rényi graph with constant density and the ordering is a random permutation. We derive tight bounds for the asymptotic versions of various metrics on the IMECs. More specifically, our contributions are as follows:

[itemsep=4pt]

We derive tight upper and lower bounds on (a) the asymptotic expected number of unoriented edges in an IMEC given data from
interventions; (b) the asymptotic probability that the IMEC is a unique DAG given data from
interventions; (c) the asymptotic number of additional interventions needed to fully discover the DAG given data from interventions; and (d) the asymptotic expected size of the IMEC given data from interventions. 
We also provide tight bounds for the number of unoriented edges in the IMEC when interventions have been performed using different algorithms for choosing the interventions given the observational MEC as input.

If is the metric of interest of a random order DAG of size and interventions, then our bounds are of the following form: . Here, is the limiting asymptotic metric, which we show is well defined and exists. We also show that decays exponentially fast in for constant density .

We numerically compute through Monte Carlo simulations for as large as at which point is a small constant for various parameter regimes.

One of the surprising results is that for constant density random order DAGs, all the above metrics tend asymptotically to a constant. Through a combination of analysis of our bounds and numerical computation, we can characterize these constants precisely.

As an example of the nature of our results, quite surprisingly, the asymptotic (as ) expected observational MEC size of a random order DAG with density is at most with probability at least (see Theorem 14).
All omitted proofs can be found in the supplemental material.
Related Work: There is currently only limited work available on counting and characterizing MECs. In (Gillispie and Perlman, 2001), the authors enumerated all MECs on DAGs with nodes and analyzed the total number of MECs, the average size of an MEC, and the proportion of MECs of size one on nodes. Motivated by this work, Gillispie (2006), Steinsky (2003), and Wagner (2013) provided formulas for counting MECs of a specific size. Supplementing this line of work, He and Yu (2016) developed various methods for counting the size of a given MEC. Finally, Radhakrishnan et al. (2017) addressed these enumerative questions using a pair of generating functions that encode the number and size of MECs for DAGs with a fixed skeleton (i.e. underlying undirected graph) and also applied these results to derive bounds on the MECs for various families of DAGs on trees (Radhakrishnan et al., 2018).
Another line of work (Hu et al., 2014; Hauser and Bühlmann, 2012b; Shanmugam et al., 2015; Eberhardt et al., 2012; Hyttinen et al., 2013; Kocaoglu et al., 2017)
aims at characterizing the number of interventions required to learn a causal DAG completely. While some of these works deal with the active learning setting
(Shanmugam et al., 2015; Hauser and Bühlmann, 2012b), others choose interventions nonadaptively given the observational MEC (Hu et al., 2014; Eberhardt et al., 2012; Hyttinen et al., 2013; Kocaoglu et al., 2017; Bello and Honorio, 2017) and hence are concerned with the worstcase scenario.2 Preliminaries and Definitions
In this work, we characterize the asymptotic behavior of different metrics that capture the amount of “causal relationships” which can be inferred from observational and interventional data on random DAG models. In this section, we describe the random orderDAG model, briefly review causal DAG models and Markov equivalence, and introduce the metrics that we will analyze in this work.
2.1 Random Order DAG Model
Let be a directed acyclic graph (DAG) with vertices and directed edges . A random orderDAG with density on vertices is a DAG whose skeleton (i.e., underlying undirected graph) is given by an ErdösRényi graph on vertices with edge probability and whose edges are oriented according to a total ordering which is uniformly sampled among all permutations of vertices. We denote a graph sampled from this model by .
Remark: Our sampling procedure is a standard one used for testing causal inference algorithms. It is for example used in the well known pcalg R package^{1}^{1}1https://rdrr.io/rforge/pcalg/man/randomDAG.html. A different sampling scheme would be to sample DAGs uniformly at random from all DAGs in which isomorphic DAGs would not be double counted. However, such a sampling scheme is difficult to perform in practice, while ours has a generative model that is easy and intuitive. Limited prior computational evidence in the observational setting suggests that the two sampling schemes behave similarly (Gillispie and Perlman, 2001).
2.2 Markov Equivalence
The MEC of a DAG can be uniquely represented by a partially directed graph known as the essential graph of . The skeleton of is the same as the skeleton of and the directed edges in are precisely those edges in that have the same orientation in all members of the MEC of . All other edges in are unoriented (Hauser and Bühlmann, 2012a). The following procedure provides all directed edges in :

[itemsep=5pt]

For every triple of nodes if and are disconnected in
and the ordered pairs
, then both edges and are also oriented in .
2.3 Interventional Markov Equivalence
Let and consider the set of single node interventional distributions , where node is set to some constant. Since in , node (a constant) is independent of its parents , it introduces additional conditional independences in addition to those present in . Let denote the intervened DAG obtained by deleting the edges from to . If is Markov with respect to , then is Markov with respect to . Two DAGs and are in the same Markov equivalence class (IMEC) if and only if and are in the same MEC for all (Hauser and Bühlmann, 2012a).
Similarly as in the purely observational setting, an MEC can be uniquely represented by an essential graph denoted by . The skeleton of is the same as the skeleton of and the directed edges in are precisely those edges in that have the same orientation in all members of the MEC of . All other edges in are unoriented. The following procedure provides all directed edges in :

[itemsep=4pt]

For every triple of nodes with and if and are disconnected in , then both edges and are also oriented in .

For every edge such that either or , then is oriented.
2.4 Metrics of Interest
Suppose that the causal Bayesian network that generates data (both interventional and observational) is an orderDAG
. Let be an associated family of interventional distributions compatible with . In this setting, our work asymptotically characterizes some metrics that reflect identifiable portions of from an observational distribution nd possibly also interventional distributions.We denote by uEss an essential graph that is also a DAG, i.e., an essential graph representing an MEC consisting of a unique DAG. Such DAGs are of particular interest since they are identifiable from purely observational data.
In the following, we will measure the degree of identifiability of a random DAG using the following metrics:

[itemsep=5pt]

Let be the number of unoriented edges in . We show that exists.

Let be an indicator variable that is only if is a DAG. Similarly, the limit is denoted

Let be the number of single node interventions required to fully orient . Similarly, the limit is denoted .

Let be the size of the (observational) MEC of . The limit is denoted .

Let be the minimum number of unoriented edges in optimized over all . The limit is denoted .

Let be an indicator variable that is 1 when . The limit is denoted .

Let be the size of the interventional markov equivalence class when the interventions in the set are performed on , where minimizes the number of unoriented edges in optimized over all . This limit is denoted .
3 Main Results
We first describe the nature of our results and the approach taken for obtaining these results for . The results for all other metrics follow using a similar approach, although the technical details differ depending on the metric of interest. We show that and we provide an explicit expression for . As a consequence, tight upper and lower bounds can be constructed on the quantities of interest by numerically computing using Monte Carlo simulations by generating random order DAGs for large and averaging.
Formally, we state the main result in our work about the asymptotic quantities of various metrics.
Theorem 1.
We have the following inequalities satisfied by various metrics:
for all . Here, is defined as follows:
(1) 
where and is the edge probability when sampling an order DAG.
We establish the main result on upper and lower bounds through intermediate results as follows (explained taking the example of ): a) We first exhibit a coupling between and such that their respective marginal distributions are preserved. This is done in Section 3.1. b) Using the properties of this specific coupling, we first show that is a monotonic sequence in in Section 3.1.1. c) The expression for is obtained by upper bounding the successive differences again using the properties of order DAG sampling and the coupling. This is explained in Sections 3.1.2 and 3.1.3. Other sections provide additional results on IMECs obtained through other interventional design algorithms along with numerical and simulation results.
3.1 Probability coupling
In this section, we provide a coupling argument between the distribution of and such that ‘unorientability’ properties of certain edges are preserved.
For all , let
be a binary random variable that is 1 with probability
. Let be the DAG with nodes and directed edges between if and only if .Observation 1.
with permutation , has the distribution of a random orderDAG on vertices with density .
Remark: Observation 1 says that randomly sampling a symmetric adjacency matrix (undirected graph with edge probability ), permuting rows and columns with a random permutation, and then taking the upper triangular part (orienting the graph according to the permutation) is the same as fixing the permutation from 1,2..n and populating the upper triangular part randomly.
Coupling: Motivated by the above observation, we couple and as follows. We first generate for as above and use that to orient . Then, we generate additional random variables for all and orient the edges incident to accordingly.
The above coupling along with certain structural properties of Meek Rules (given in Appendix A) leads to the following results on orientability of certain edges in and under the coupling.
Lemma 1.
Under the above coupling, if an edge is unorientable in , it is also unorientable in .
Lemma 2.
Under the above coupling, if after a set of interventions on the edge is unorientable in , then it is also unorientable in after the same set of interventions on together with an intervention on .
3.1.1 Monotonicity Lemmas
We prove that expected values of all metrics of interest are monotonic in using the properties of the coupling demonstrated above. First, we show this for observational quantities by appealing to Lemma 1.
Theorem 2.
The following statements hold with probability for the coupling between and :
a) .
b) .
c) .
Therefore, , and .
Similar monotonicity properties for interventional quantities are obtained by appealing to Lemma 2. However, note that these proofs are not a straightforward application of Lemma 2. Often, additional arguments need to be made to show the following results.
Theorem 3.
with probability according to the coupling between and . Hence, .
The previous two theorems directly provide the following result.
Theorem 4.
for all best interventions with probability under the coupling between and . Hence, .
Theorem 5.
with probability under the coupling between and . Hence, .
The established monotonicity results help prove that the asymptotic versions of these metrics exist.
Theorem 6.
exists and .
Remark: Theorem 6 extends to all metrics that have been shown to be monotonic nondecreasing, i.e. metrics in the set , by analogous arguments. Note that monotonically nonincreasing sequences like are bounded below and above and hence the results can be shown again by the same theorem applied to shifted negatives of these variables.
3.1.2 Gap Bounds on Observational Metrics
Using properties of the coupling between and we can show that the expected difference in the observational metrics for and the asymptotic version is bounded.
Theorem 7.
.
Theorem 8.
.
3.1.3 Gap Bounds on Interventional Metrics
In the following, we show that the expected difference in the interventional metrics for and the asymptotic version is bounded again using the properties of the coupling described before.
Theorem 9.
.
Theorem 10.
.
All these results together allow us to prove the main result (Theorem 1).
3.1.4 Lower Bound on Successive Differences
The above gap bounds depend on upper bounding successive differences of . In the following, we provide a lower bound on the successive differences which implies that gap bounds that are faster than exponential cannot exist.
Theorem 11.
4 Results on IMECs obtained by Interventional Design Algorithms
In the following, we provide asymptotic convergence rates for the number of undirected edges after interventions, when the interventions are chosen by an algorithm that has a property that we call downstreamindependence. Greedy algorithms that choose interventions sequentially based on the essential graph at the observational stage are downstreamindependent. Note that, in this section, we do not consider , which is the minimum number of edges left unoriented when interventions are chosen based on the DAG structure. We are therefore interested in algorithms that optimize the interventions based on the essential graph, which can be inferred from purely observed datasets.
Notation 1.
Let be a set of interventions. We say that when is the essential graph that results from performing the interventions on the underlying causal DAG . Note that if is a subgraph of , then is obtained by skipping the interventions on nodes outside of .
Lemma 3.
Let be a DAG and a vertex of with no outgoing or undirected edges. Then, . In other words, interventions do not affect vertices that have no outgoing or undirected edges.
Lemma 4.
Let be an induced subgraph of consisting of all vertices such that neither nor any descendants of have adjacent undirected edges. Then .
Proof.
The proof follows by applying Lemma 3 recursively to . ∎
Definition 1.
We say that an algorithm for performing interventions on an essential graph is downstreamindependent if the inverventions it performs on are identical to the ones it performs on .
Note that is the result of the following process: starting with , recursively remove vertices that have no undirected or outgoing edges.
Theorem 12.
Let be a downstreamindependent algorithm. Let be the expected number of undirected edges in the essential graph of the random order DAG after performing interventions according to algorithm . Then
(2) 
Remark: Suppose there is an algorithm that optimizes some score function based on the essential graphs alone which is a proxy for minimizing the number of expected unoriented edges after interventions, then such algorithms are likely to be making decisions independent of in general due to Lemma 4
. An example is the algorithm that greedily picks the intervention that reduces the expected number of unoriented edges where the expectation is over the uniform distribution of DAGs compatible with the essential graph.
Theorem 13.
Let be an algorithm that is downstream independent and chooses interventions based on . Let be the number of undirected edges after interventions made by the algorithm . Then,
Here, and this limit exists.
Proof.
This is a direct corollary from the previous results in this section together with analogous arguments regarding monotonicity and existence of limits similar to those for . ∎
5 Discussion of the Results
Theorems 1 and 13 provide upper bounds in terms of quantities computable by MonteCarlo simulation at finite from random order DAGs and constants such as that are exponentially small in . If empirical means of these finite quantities appearing in these upper bounds can be characterized with very high precision, then we can characterize the constant by which these asymptotic quantities are upper bounded.
In the following section, we plot the empirical means of these finite quantities or upper bounds to these finite quantities for very large and show that when combined with the above bounds, the asymptotic quantities tend to a constant.
5.1 Precise Calculation of High Confidence Upper Bounds on Asymptotic MEC Size for Random Order DAGs of Density
We demonstrate how to obtain confidence intervals on the expected asymptotic mean
and using our bounds and Monte Carlo simulations.Details of the Numerical Experiment: We sampled times for random order DAGs with
. The sample variance we observed was
while the empirical mean was .We use an empirical Bernstein bound for and show the following bound on expected value of :
Theorem 14.
With probability at least over the randomness in our numerical experiments over samples, we have: .
This is an illustration of how our upper bounds, emprirical Bernstein bounds and Monte Carlo simulation can be combined to give highly precise guarantees for all the considered metrics.
6 Numerical Results
We compute and plot the empirical means of the following observational metrics: a) , b) , c) , and d) . We also plot the empirical mean of the following interventional metrics a) , b) , c) , and d) . These interventional metrics are obtained on the essential graph obtained by the greedy algorithm that operates as follows: First pick the node that orients the most edges, then for each consecutive , pick that orients the most edges in given the ()essential graph.
Graph Generation: We generated 2,000 random order DAGs with nodes and densities . For each DAG, we used the opensource causaldag package in Python to compute the number of DAGs in the ()MEC and the number of undirected edges in the ()essential graph obtained by applying algorithm on .
Results Established: The plots serve two purposes  a) The empirical mean plots (Figs. (a)a(b)b) and the box plots (Figs. (a)a(c)c) of all the estimated quantities provide an idea of what values the asymptotic quantities are bounded by given the formula for in Theorem 1. For a more refined high confidence upper bound, for large enough , analysis similar to Theorem 14 can be done. b) They help corroborate the monotonicity results we have derived analytically.
Bounding Interventional Metrics: We observe that the above interventional metrics plotted provide an upper bound to and which are based on the set of optimal interventions for that minimize the number of unoreinted edges given . Therefore, by Theorem 1 they certainly provide valid upper bounds together with . The shaded regions in each plot are the estimates of the 95% confidence intervals as given by the scipy.stats function bayes_mvs.
Figure (a)a plots empirical mean of and . We observe that increases sharply for and plateaus near , while increases more gradually for , with a higher limit for sparse graphs. For all densities, the empirical mean of increases more gradually than the observational .
Figure (b)b plots empirical mean of and . We again observe sharper increases and lower plateaus for the higher densities, and , compared to more gradual rises and higher plateaus for the lower densities. Whereas in Figure (a)a, stabilizes at similar values for and , in Figure (b)b, the empirical mean of is greater for than for . This indicates that each unoriented edge contributes to more MECs when the density is low.
Figure (a)a demonstrates the monotonicity of the empirical mean of and . We observe that the empirical mean of drops sharply for all densities, with appearing to have the highest limit. The difference in behavior of the empirical mean of and for different densities is noteworthy. For sparser graphs, 1 or 2 interventions do not significantly increase the expected ability to identify the DAG; for instance, when , the expected number of fully identified DAGs barely changes from the observational case after . However, for denser graphs, such as for and , even 1 intervention is sufficient to learn roughly 50% and 60% of the sampled graphs, respectively, and 2 interventions is sufficient to learn nearly all of them, even when . This result can be explained by the fact that sparse graphs often consist of multiple connected components and interventions in one component have no effect on other components. Finally, Figure (b)b demonstrates the monotonicity of the empirical mean of . Surprisingly, it takes very few interventions to orient even large, sparse graphs.
7 Conclusion
We provided sharp upper and lower bounds for asymptotic expected MEC size and the number of interventions needed to fully orient a random order DAG after (constant) number of initial interventions. There are various other metrics associated with MECs of random order DAGs that we precisely quantify in this work. Our methods relied on analytical bounds on the asymptotic quantities based on coupling arguments and exploiting the properties of Meek rules. This together with Monte Carlo simulations at finite sizes establishes quantifiable and precise bounds.
Our results mean that a walk over the space of graphs (larger search space but simpler moves) would not be more time consuming than a walk over the space of Markov equivalence classes (more complicated moves) when implementing greedy search for structure learning. This is because the asymptotic log MEC size goes to a constant for dense graphs. In addition, our results imply that in general relatively few interventions are needed to identifying dense causal networks. Investigations like this for random graphs considering various levels of sparsity and relaxing the causal sufficiency assumptions are interesting directions for future work.
Acknowledgements
C. Uhler was partially supported by NSF (DMS1651995), ONR (N000141712147 and N000141812765), IBM, and a Sloan Fellowship.
References
 Bello and Honorio (2017) K. Bello and J. Honorio. Learning causal Bayes networks using interventional path queries in polynomial time and sample complexity. arXiv preprint arXiv:1706.00754, 2017.
 Brenner and Sontag (2013) E. Brenner and D. Sontag. Sparsityboost: A new scoring function for learning Bayesian network structure. arXiv preprint arXiv:1309.6820, 2013.

Castelo and Kocka (2003)
R. Castelo and T. Kocka.
On inclusiondriven learning of Bayesian networks.
Journal of Machine Learning Research
, 4(Sep):527–574, 2003.  Eberhardt et al. (2012) F. Eberhardt, C. Glymour, and R. Scheines. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among n variables. arXiv preprint arXiv:1207.1389, 2012.
 Gangl (2010) M. Gangl. Causal inference in sociological research. Annual review of sociology, 36:21–47, 2010.
 Gillispie (2006) S. B. Gillispie. Formulas for counting acyclic digraph Markov equivalence classes. Journal of Statistical Planning and Inference, 136(4):1410–1432, 2006.

Gillispie and Perlman (2001)
S. B. Gillispie and M. D. Perlman.
Enumerating Markov equivalence classes of acyclic digraph models.
In
Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence
, pages 171–177. Morgan Kaufmann Publishers Inc., 2001.  Hauser and Bühlmann (2012a) A. Hauser and P. Bühlmann. Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research, 13(Aug):2409–2464, 2012a.
 Hauser and Bühlmann (2012b) A. Hauser and P. Bühlmann. Two optimal strategies for active learning of causal networks from interventional data. In Proceedings of Sixth European Workshop on Probabilistic Graphical Models, volume 119, page 5, 2012b.
 He and Yu (2016) Y. He and B. Yu. Formulas for counting the sizes of Markov equivalence classes of directed acyclic graphs. arXiv preprint arXiv:1610.07921, 2016.
 Hu et al. (2014) H. Hu, Z. Li, and A. R. Vetta. Randomized experimental design for causal graph discovery. In Advances in Neural Information Processing Systems, pages 2339–2347, 2014.
 Hyttinen et al. (2013) A. Hyttinen, F. Eberhardt, and P. O. Hoyer. Experiment selection for causal discovery. The Journal of Machine Learning Research, 14(1):3041–3071, 2013.
 Kalisch and Bühlmann (2007) M. Kalisch and P. Bühlmann. Estimating highdimensional directed acyclic graphs with the PCalgorithm. Journal of Machine Learning Research, 8(Mar):613–636, 2007.
 Kocaoglu et al. (2017) M. Kocaoglu, A. G. Dimakis, and S. Vishwanath. Costoptimal learning of causal graphs. arXiv preprint arXiv:1703.02645, 2017.
 Lagani et al. (2016) V. Lagani, S. Triantafillou, G. Ball, J. Tegner, and I. Tsamardinos. Probabilistic computational causal discovery for systems biology. In Uncertainty in Biology, pages 33–73. Springer, 2016.
 Maurer and Pontil (2009) A. Maurer and M. Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
 Meek (1995) C. Meek. Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pages 403–410. Morgan Kaufmann Publishers Inc., 1995.
 Meek (1997) C. Meek. Graphical Models: Selecting causal and statistical models. PhD thesis, PhD thesis, Carnegie Mellon University, 1997.
 Mohammadi et al. (2018) F. Mohammadi, C. Uhler, C. Wang, and J. Yu. Generalized permutohedra from probabilistic graphical models. SIAM Journal on Discrete Mathematics, 32:64–93, 2018.
 Radhakrishnan et al. (2017) A. Radhakrishnan, L. Solus, and C. Uhler. Counting Markov equivalence classes by number of immoralities. Proceedings of the ThirtyThird Conference on Uncertainty in Artificial Intelligence, 2017.
 Radhakrishnan et al. (2018) A. Radhakrishnan, L. Solus, and C. Uhler. Counting Markov equivalence classes for dag models on trees. Discrete Applied Mathematics, 244:170–185, 2018.
 Raskutti and Uhler (2018) G. Raskutti and C. Uhler. Learning directed acyclic graphs based on sparsest permutations. Stat, 7:e183, 2018.
 Shanmugam et al. (2015) K. Shanmugam, M. Kocaoglu, A. G. Dimakis, and S. Vishwanath. Learning causal graphs with small interventions. In Advances in Neural Information Processing Systems, pages 3195–3203, 2015.
 Solus et al. (2017) L. Solus, Y. Wang, L. Matejovicova, and C. Uhler. Consistency guarantees for permutationbased causal inference algorithms. arXiv preprint arXiv:1702.03530, 2017.
 Spirtes et al. (2000) P. Spirtes, C. N. Glymour, R. Scheines, D. Heckerman, C. Meek, G. Cooper, and T. Richardson. Causation, prediction, and search. MIT press, 2000.
 Steinsky (2003) B. Steinsky. Enumeration of labelled chain graphs and labelled essential directed acyclic graphs. Discrete Mathematics, 270(13):267–278, 2003.
 Teyssier and Koller (2012) M. Teyssier and D. Koller. Orderingbased search: A simple and effective algorithm for learning Bayesian networks. arXiv preprint arXiv:1207.1429, 2012.
 Wagner (2013) S. Wagner. Asymptotic enumeration of extensional acyclic digraphs. Algorithmica, 66(4):829–847, 2013.
 Wang et al. (2017) Y. Wang, L. Solus, K. Yang, and C. Uhler. Permutationbased causal inference algorithms with interventions. In Advances in Neural Information Processing Systems, pages 5822–5831, 2017.
 Xiao et al. (2015) Y. Xiao, Y. Gong, Y. Lv, Y. Lan, J. Hu, F. Li, J. Xu, J. Bai, Y. Deng, L. Liu, et al. Gene perturbation atlas (GPA): a singlegene perturbation repository for characterizing functional mechanisms of coding and noncoding genes. Scientific Reports, 5:10889, 2015.
 Yang et al. (2018) K. D. Yang, A. Katcoff, and C. Uhler. Characterizing and learning equivalence classes of causal DAGs under interventions. Proceedings of Machine Learning Research, 80:5537–5546, 2018.
References
 Bello and Honorio (2017) K. Bello and J. Honorio. Learning causal Bayes networks using interventional path queries in polynomial time and sample complexity. arXiv preprint arXiv:1706.00754, 2017.
 Brenner and Sontag (2013) E. Brenner and D. Sontag. Sparsityboost: A new scoring function for learning Bayesian network structure. arXiv preprint arXiv:1309.6820, 2013.

Castelo and Kocka (2003)
R. Castelo and T. Kocka.
On inclusiondriven learning of Bayesian networks.
Journal of Machine Learning Research
, 4(Sep):527–574, 2003.  Eberhardt et al. (2012) F. Eberhardt, C. Glymour, and R. Scheines. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among n variables. arXiv preprint arXiv:1207.1389, 2012.
 Gangl (2010) M. Gangl. Causal inference in sociological research. Annual review of sociology, 36:21–47, 2010.
 Gillispie (2006) S. B. Gillispie. Formulas for counting acyclic digraph Markov equivalence classes. Journal of Statistical Planning and Inference, 136(4):1410–1432, 2006.

Gillispie and Perlman (2001)
S. B. Gillispie and M. D. Perlman.
Enumerating Markov equivalence classes of acyclic digraph models.
In
Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence
, pages 171–177. Morgan Kaufmann Publishers Inc., 2001.  Hauser and Bühlmann (2012a) A. Hauser and P. Bühlmann. Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research, 13(Aug):2409–2464, 2012a.
 Hauser and Bühlmann (2012b) A. Hauser and P. Bühlmann. Two optimal strategies for active learning of causal networks from interventional data. In Proceedings of Sixth European Workshop on Probabilistic Graphical Models, volume 119, page 5, 2012b.
 He and Yu (2016) Y. He and B. Yu. Formulas for counting the sizes of Markov equivalence classes of directed acyclic graphs. arXiv preprint arXiv:1610.07921, 2016.
 Hu et al. (2014) H. Hu, Z. Li, and A. R. Vetta. Randomized experimental design for causal graph discovery. In Advances in Neural Information Processing Systems, pages 2339–2347, 2014.
 Hyttinen et al. (2013) A. Hyttinen, F. Eberhardt, and P. O. Hoyer. Experiment selection for causal discovery. The Journal of Machine Learning Research, 14(1):3041–3071, 2013.
 Kalisch and Bühlmann (2007) M. Kalisch and P. Bühlmann. Estimating highdimensional directed acyclic graphs with the PCalgorithm. Journal of Machine Learning Research, 8(Mar):613–636, 2007.
 Kocaoglu et al. (2017) M. Kocaoglu, A. G. Dimakis, and S. Vishwanath. Costoptimal learning of causal graphs. arXiv preprint arXiv:1703.02645, 2017.
 Lagani et al. (2016) V. Lagani, S. Triantafillou, G. Ball, J. Tegner, and I. Tsamardinos. Probabilistic computational causal discovery for systems biology. In Uncertainty in Biology, pages 33–73. Springer, 2016.
 Maurer and Pontil (2009) A. Maurer and M. Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
 Meek (1995) C. Meek. Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pages 403–410. Morgan Kaufmann Publishers Inc., 1995.
 Meek (1997) C. Meek. Graphical Models: Selecting causal and statistical models. PhD thesis, PhD thesis, Carnegie Mellon University, 1997.
 Mohammadi et al. (2018) F. Mohammadi, C. Uhler, C. Wang, and J. Yu. Generalized permutohedra from probabilistic graphical models. SIAM Journal on Discrete Mathematics, 32:64–93, 2018.
 Radhakrishnan et al. (2017) A. Radhakrishnan, L. Solus, and C. Uhler. Counting Markov equivalence classes by number of immoralities. Proceedings of the ThirtyThird Conference on Uncertainty in Artificial Intelligence, 2017.
 Radhakrishnan et al. (2018) A. Radhakrishnan, L. Solus, and C. Uhler. Counting Markov equivalence classes for dag models on trees. Discrete Applied Mathematics, 244:170–185, 2018.
 Raskutti and Uhler (2018) G. Raskutti and C. Uhler. Learning directed acyclic graphs based on sparsest permutations. Stat, 7:e183, 2018.
 Shanmugam et al. (2015) K. Shanmugam, M. Kocaoglu, A. G. Dimakis, and S. Vishwanath. Learning causal graphs with small interventions. In Advances in Neural Information Processing Systems, pages 3195–3203, 2015.
 Solus et al. (2017) L. Solus, Y. Wang, L. Matejovicova, and C. Uhler. Consistency guarantees for permutationbased causal inference algorithms. arXiv preprint arXiv:1702.03530, 2017.
 Spirtes et al. (2000) P. Spirtes, C. N. Glymour, R. Scheines, D. Heckerman, C. Meek, G. Cooper, and T. Richardson. Causation, prediction, and search. MIT press, 2000.
 Steinsky (2003) B. Steinsky. Enumeration of labelled chain graphs and labelled essential directed acyclic graphs. Discrete Mathematics, 270(13):267–278, 2003.
 Teyssier and Koller (2012) M. Teyssier and D. Koller. Orderingbased search: A simple and effective algorithm for learning Bayesian networks. arXiv preprint arXiv:1207.1429, 2012.
 Wagner (2013) S. Wagner. Asymptotic enumeration of extensional acyclic digraphs. Algorithmica, 66(4):829–847, 2013.
 Wang et al. (2017) Y. Wang, L. Solus, K. Yang, and C. Uhler. Permutationbased causal inference algorithms with interventions. In Advances in Neural Information Processing Systems, pages 5822–5831, 2017.
 Xiao et al. (2015) Y. Xiao, Y. Gong, Y. Lv, Y. Lan, J. Hu, F. Li, J. Xu, J. Bai, Y. Deng, L. Liu, et al. Gene perturbation atlas (GPA): a singlegene perturbation repository for characterizing functional mechanisms of coding and noncoding genes. Scientific Reports, 5:10889, 2015.
 Yang et al. (2018) K. D. Yang, A. Katcoff, and C. Uhler. Characterizing and learning equivalence classes of causal DAGs under interventions. Proceedings of Machine Learning Research, 80:5537–5546, 2018.
Appendix A Meek Orientation Rules
In in Figure 3, we provide the four Meek orientation rules that are used in the definition of the essential graph.
The following two observable properties play an important role in various results in the main paper.
Property 1.
If a node is involved in any of the four Meek rules and if the node does not have an outgoing edge in the original causal DAG, then the oriented edge (in the right hand side motif of any of the four rules in Figure 3) is incident to .
Property 2.
If a node is involved in a motif for any of the four rules, then either has an outgoing edge or it has an adjacent undirected edge (on the left hand side motif appearing in that rule).
Appendix B Additional Proofs
b.1 Proof of Lemma 1
Observe that all edges between and are directed to and that does not have any outgoing edges. Suppose that is involved in one of the four Meek rules in Appendix A. Then by Property 1 in Appendix A, the discovered edge has to be incident to . On the other hand, if is not part of any Meek rule, then the rules must have already been applied in to orient edges maximally, which completes the proof.
b.2 Proof of Lemma 2
b.3 Proof of Equation 1
We can simplify this sum as follows:
Substituting for , we obtain
b.4 Proof of Lemma 3
Suppose , then by Property 2, it cannot be part of any of the Meek rules in Appendix A . Therefore, it cannot aid in any of the rule applications after new edges have been discovered by interventions in . This means that removing it before or after applying the Meek rules is irrelevant. Hence . If , then the intervention on gives no additional information as all its adjacent edges have already been discovered and hence it is equivalent to using . Hence, this reduces to the previous case with replaced by and thereby completes the proof.
b.5 Proof of Theorem 2
a) This follows directly from Lemma 1.
b) Suppose the MEC of is given by the DAG set . For each , let be the DAG extended by adding the vertex , with the same incoming edges as it has in . From Lemma 1, it follows that the DAGs are contained in the MEC of .
c) Due to the coupling, if a set of interventions orients , it also orients . The result follows.
The results for the expected values follow from the almost sure results.
b.6 Proof of Theorem 3
Let be the set of interventions that achieves the minimum number of unoriented edges in and let . We apply the same set of interventions to barring the possible intervention on node . By Lemma 2, all edges unorientable in after these ‘copied’ interventions are also unorientable in even with/without the possible additional intervention on . Since we can (possibly) add the extra intervention in to bring the total number to , this means that after these r interventions on we have at most unorientable edges in , which completes the proof.
b.7 Proof of Theorem 5
Let be the set of optimal set of interventions on that orient the maximum number of edges in the . Therefore, . Let . Suppose the is given by the DAG set . For each , let be the DAG extended by adding the vertex , with the same incoming edges as in . From Lemma 2, it follows that the DAGs are contained in . This means, that,
b.8 Proof of Theorem 6
For all monotonic sequences ,
Further, when a sequence of measurable functions converges pointwise to a function , then is also measurable. Here, is a measurable function of the random variables . Hence, follows from the Lebesgue Monotone Convergence Theorem.
b.9 Proof of Theorem 7
We first prove the following Lemma. The theorem follows from the Lemma.
Lemma 5.
.
Proof.
Observe that the left hand side equals the expected number of unorientable edges incident to in by Lemma 1. We will upper bound this number as follows:
For each vertex i the edge is unoriented if it is present (probability ), and not part of an uncovered collider. The probability that and form an uncovered collider given is present is , and such probabilities for different are independent. Thus the probability that is not part of an uncovered collider given that it is present is . Multiplying by for the probability that is present, and by for the total number of such potential edges leads to the desired bound. ∎
b.10 Proof of Theorem 8
Since the unoriented edges incident to can be oriented with at most one intervention each, we have that . This, combined with Theorem 7 results in the desired bound.
b.11 Proof of Theorem 11
If , and for all it holds that , then the edge is isolated and thus unorientable. That happens with probability for each . Note that there are such potential edges that are adjacent to vertex and therefore figure into , which completes the proof.
b.12 Proof of Theorem 9
We provide a lemma and its proof regarding successive differences of the interventional metric . The result in the theorem follows immediately from this.
Lemma 6.
.
b.13 Proof of Theorem 10
b.14 Proof of Theorem 12
It follows from the proof of Lemma 5 that the probability that has an undirected edge is less than . If does not have an undirected edge, then, by the fact that is a downstream independent algorithm it follows that , and therefore that = 0. If is adjacent to undirected edges, then , the number of possible edges in . The bound follows since this happens with probability .
b.15 Proof of Theorem 14
We consider the following bound for from Theorem 1:
(3) 
(by direct calculation) based on Lemma LABEL:RHS.
Now, we apply the following theorem from (quoted with appropriate modifications for a random variable taking values in ).
Theorem 15.
(Maurer and Pontil, 2009) If are i.i.d random variables each bounded in . Let be the empirical mean and be the empirical variance of the samples. Then, with probability ,
(4) 
Now, for , . Substituting in the above theorem and using (3) we have the bound in the theorem.
Appendix C Additional Figures
Number of unoriented edges of the 2,000 orderDAG samples. The middle line of the box is the median, the upper and lower edges are the upper and lower quartiles, and the circles are outliers.
Comments
There are no comments yet.