1 Introduction
Can the temperature of a city alter its altitude (Peters et al., 2017)? Can a light bulb change the state of a switch? Can the brakes of a car be activated by their indicator light (de Haan et al., 2019)? Chances are, you did not need to think very hard to answer these questions, since you intuitively understand the implausibility of causal relationships between certain types of entities. This form of prior knowledge has been shown to play a key role in causal reasoning (Griffiths et al., 2011; Schulz & Gopnik, 2004; Gopnik & Sobel, 2000). In fact, in the absence of evidence (e.g., data), humans tend to reason inductively and use domain knowledge to generalize known causal relationships to new, similar, entities (Kemp et al., 2010).
Nonetheless, the elucidation of causal relationships often goes beyond human intuition. The abundance of largescale scientific endeavours to understand the causes of diseases (1KGP, 2010) or natural phenomena (Runge et al., 2019) are good examples. In such cases, computational methods for causal discovery may help reveal causal relationships based on patterns of association in data (see HeinzeDeml et al. (2018) for a review). The most common setting consists of representing causal relationships as a directed acyclic graph where vertices correspond to variables of interest and edges indicate causal relationships. Additional assumptions, like the faithfulness condition, are then made to enable reasoning about graph structures based on conditional independences in the data. While these enable datadriven causal discovery, the underlying causal graph can only be identified up to its Markov equivalence class (Peters et al., 2017), which can often be very large (He et al., 2015) thus leaving many edges unoriented.
Inspired by how humans use types to reason about causal relationships, this work explores if prior knowledge about the nature of the variables can help reduce the size of such equivalence classes. Building on the theoretical foundations of causal discovery in directed acyclic graphs, we propose a new theoretical framework for the case where variables are labeled by type. Such types can be attributed based on prior knowledge, e.g., via a domain expert. We then impose assumptions on how types can interact with each other, which constrains the space of possible graphs and leads to reduced equivalence classes. We show, both theoretically and empirically, that when such assumptions hold in the data, significant gains in the identification of causal relationships can be made.
Contributions:

We propose a new set of assumptions for causal discovery, based on variable types (Section 4).

We present a simple algorithm to incorporate these assumptions in causal discovery (Section 5).

We prove theoretical results that guarantee convergence of the equivalence class to a singleton, i.e., identification, when the number of vertices tends to infinity and the number of types is fixed (Section 6).

We present an empirical study that supports our theoretical results (Section 7).
2 Problem formulation
Causal graphical models.
In this work, we adopt the framework of causal graphical models (CGM) (Peters et al., 2017). Let
be a random vector with distribution
. Let be a directed acyclic graph (DAG) with vertices . Each vertex is associated to variable in , and a directed edge represents a direct causal relationship from to . We assume that can be factorized according to , that is,where denotes the parents^{1}^{1}1We are referring to the corresponding vertex in . of in . From this graph, it is possible to answer causal questions (e.g., via docalculus (Pearl, 1995)). However, in many situations, the structure of is unknown and must be inferred from data.
Causal discovery.
The task of causal discovery consists of learning the structure of based on observations from . Some assumptions are required to make this possible. By adopting the CGM framework, we assume: (i) causal sufficiency, which states there is no unobserved variable that causes more than one variable in and (ii) the causal Markov property, which states that , where is a set composed of variables in , indicates that and are separated by in , and indicates that and are independent conditioned on . Additionally, we assume (iii) faithfulness, which states that . Hence, conditional independences in the data can be used to learn about separations in .
Equivalence classes.
Even with these assumptions, can only be recovered up to a Markov equivalence class (MEC), which is a set containing all the DAGs that can represent the same distributions as . The MEC is often characterized graphically using an essential graph or Completed Partially Directed Acyclic Graph (CPDAG), which corresponds to the union of all Markov equivalent DAGs (Andersson et al., 1997). In some cases, e.g., for sparse graphs, the size of the MEC can be huge (He & Yu, 2016; He et al., 2015), significantly limiting inference about the direction of edges in . Hence, it is a problem of key importance to find new realistic assumptions to shrink the equivalence class.
There have been a wealth of approaches to this problem. For instance, some have made progress by including data collected under intervention (Hauser & Bühlmann, 2012), making assumptions about the functional form of causal relationships (e.g. Peters et al. (2014); Peters & Bühlmann (2014); Shimizu et al. (2006)), or including background knowledge on the direction of edges (Meek, 1995). In this work, we propose an alternative approach, based on background knowledge, where types are attributed to variables and the interaction between types is constrained.
3 Related works
The inclusion of background knowledge in causal discovery aims to reduce the size of the solution space by adding or rulingout causal relationships based on expert knowledge. Several forms of background knowledge have been proposed, which place various levels of burden on the expert. Below, we review those that are most relevant to our work (see Constantinou et al. (2021) for a review).
Hard background knowledge. The background knowledge that the we consider as “hard”, is hard in the sense that it must be respected in the inferred graph structures. Previous works have considered: sets of forbidden and known edges (Meek, 1995), a known ordering of the variables (Cooper & Herskovits, 1992), partial orderings of the variables (Andrews, 2020; Scheines et al., 1998), and ancestral constraints (Li & Beek, 2018; Chen et al., 2016). Among these, partial orderings (or tiered background knowledge) are the most similar to our contribution. In this setting, it is assumed that an expert partitions the variables into sets called tiers, and orders the tiers such that variables in later tier cannot cause variables in an earlier tier. In contrast, while we require the expert to partition variables into sets (by type), we do not assume that an ordering is known a priori. In the context of causal discovery, the possible intertype interactions would be learned with the graph structure.
Soft background knowledge. A setting similar to ours, where the type of each variable dictates its possible causal relations, was presented in Mansinghka et al. (2012). They propose a Bayesian method to leverage this prior knowledge in causal discovery. Their work highlights the benefits of such priors, but they do not investigate this space of graphs and their properties w.r.t. to structure identifiability.
Practical applications. One may wonder whether the form of background knowledge that we consider has practical interest. As a concrete example, we refer the reader to the work of Flores et al. (2011), which outlines an application of tiered background knowledge in a medical case study. Since our typebased background knowledge puts strictly less burden on the expert, we expect that it could also be applied in this setting.
4 Typed directed acyclic graphs
Our work builds on two fundamental structures: typed directed acyclic graphs (tDAG), which are essentially DAGs with typed vertices; and tedges, which are sets of edges relating vertices of given types. Formal definitions follow.
Definition 1 (tDAG).
A tDAG is a tuple with a mapping such that forms a DAG and is the type of vertex , with .
Definition 2 (tedge).
A tedge is a set of edges between vertices of given types in a tDAG. More formally, for any pair of types .
For example, the graphs illustrated in Fig. 1 (b) and (c) are tDAGs where colors represent types and the set is a tedge between types and .
4.1 Assumptions on type interactions
We now introduce the type consistency assumption, which constrains the possible interactions between variables of different types.
Definition 3 (Consistent tDAG).
A consistent tDAG is a tDAG where for every tedge , we have that . We refer to this additional constraint as type consistency.
For conciseness, we denote as . Note that this definition implies that there are no tedges between vertices of the same type in a consistent tDAG.
In Fig. 1 (b), we present an example of a consistent tDAG. In contrast, the tDAG shown in Fig. 1 (c) is not consistent: the tedge (purple to white) contains the edge , while the reverse tedge, , is not empty since it contains . Notice how, the orientation of all tedges, summarized in Fig. 1 (a), fully determines the orientation of edges in a consistent tDAG.
Note that alternative assumptions could have been considered. For instance, we could have assumed that tedges form a DAG (i.e. the types have a partial ordering). However, the assumptions considered here are less restrictive and, as we demonstrate later, lead to interesting results.
4.2 Equivalence classes for consistent tDAGs
We define the equivalence classes MEC and tMEC as the set of DAGs and the set of consistent tDAGs that are Markov equivalent, respectively.
Definition 4 (Mec).
The MEC of a tDAG is where “” denotes Markov equivalence.
Definition 5 (tMEC).
The tMEC of a consistent tDAG is where “” denotes Markov equivalence limited to consistent tDAGs with the same mapping T as .
To represent an equivalence class, we can use an essential graph, which corresponds to the union of equivalent DAGs.
Definition 6 (Essential graph).
The essential graph associated to the consistent tDAG is:
The union over graphs is defined as the union of their vertices and edges: . Also, if and its opposite , then the edge is considered to be undirected.
Definition 7 (tEssential graph).
The tessential graph associated to the consistent tDAG is
4.3 Space of consistent tDAGs and size of tMEC
We consider some statements that can directly be made about the space of consistent tDAGs and the size of tMEC with respect to their nontyped counterparts.
By definition, the space of consistent tDAGs (with types and vertices), , is included in the space of DAGs with the same number of nodes and types, . Moreover, as stated in Proposition 1, for a number of types smaller than the number of vertices, the space of consistent tDAGs is strictly smaller than that of DAGs.
Proposition 1.
for , where is the number of vertices and is the number of types.
Notice that, in the limit case , i.e. every vertex has a different type, the tDAG is always consistent and .
The tessential graph is identical to the essential graph except for the additional edges which can be oriented thanks to the type consistency assumption. This is summarized in Proposition 2.
Proposition 2.
Let and be, respectively, the tessential and essential graphs of an arbitrary consistent tDAG . Then, .
From the tessential graph, we can easily derive an upper bound on the size of the tMEC based on the number of undirected edges.
Proposition 3 (Upper bound on the size of the tMEC).
For any consistent tDAG , we have , where is the number of undirected tedges in the tessential graph of .
From this bound, we can also directly conclude that if the tessential graph contains no undirected tedges, . In other words, is identified.
5 An algorithm to find the tessential graph
To make practical use of the above definitions, we seek an algorithm that can recover the tessential graph based on a set of observational data and variable types attributed by a domain expert. Following previous work, it would be intuitive to make use of Meek (1995)’s rules as follows:

Recover the essential graph by applying an algorithm like PC (Spirtes et al., 2000) to the data.

Enforce type consistency: If there exists an oriented edge between any pair of variables with types in the graph of the previous step, then conclude , i.e., orient all edges between these types.

Apply Meek (1995)’s rules to propagate the edge orientations derived in Step (2) (see their Section 2.1.2).

Repeat from Step (2) until the graph is unchanged.
This algorithm is sound, i.e., all edges that it orients are also oriented in the tessential graph. However, it is not complete — some edges that should be oriented in the tessential graph will remain unoriented.
To see this, consider the following simple example of a consistent threevertex tDAG that we call the twotype fork (illustrated in Fig. 2 a). The essential graph for this tDAG obtained at Step (1), would be completely undirected since it contains no immoralities. This would result in a case where none of Meek (1995)’s rules are applicable and thus, Step (2) would not orient any edges. The algorithm would therefore stop and return a fully undirected graph. However, according to the following proposition, the twotype fork should have been oriented.
Proposition 4.
If a consistent tDAG contains vertices with types and , with edges , then the tedge is directed in the tessential graph, i.e., the direction of causation between types and is known.
The proof follows the argument that if the true orientation were , this would create an immorality, which in turn would orient the edges in the essential graph. Because we assume type consistency (Definition 3), the only other possible orientation is .
Nevertheless, adding a rule to orient such structures in Step (2), as illustrated in Fig. 2 b), is not sufficient to obtain a complete algorithm. In Appendix B, we show more complex cases (not local and involving multiple tedges) that are not covered, even with this additional orientation rule.
It remains an open question whether one could design a polynomialtime algorithm to find the tessential graph as Meek (1995) and Andersson et al. (1997) did for essential graphs. For now, we propose the following algorithm, which has a time complexity exponential in the number of types:

Run the previously described (incomplete) algorithm, adding the twotype fork case to the orientation rules in Step (2).

Enumerate all tDAGs that can be produced by orienting edges in the graph resulting from the previous step. Reject any inconsistent tDAG.

Take the union of all these tDAGs (see Definition 7) to obtain the tessential graph.
In practice, the nonpolynomial time complexity was found to be nonprohibitive in our experiments.
6 Identification for random graphs
In this section, we explore the benefits in identification that result from using variable typing in a class of graphs generated at random based on a process inspired by the ErdősRényi random graphs model (Erdős & Rényi, 1959).
Assume we are given a set of types
of observing each type s.t. , and a tinteraction matrix where each cell defines the probability that a variable of type is a direct cause of a variable of type . As per Definition 3 (type consistency), we impose that if , then .Definition 8.
(Random sequence of growing tDAG) We define a random sequence of tDAGs with and , such that . Each new tDAG in the sequence is obtained from as follows: Create a new vertex and sample its type from a categorical distribution with probabilities ; Let ; To obtain , for every vertex , add the edge to with probability .
Our main theorem below states that, as we add more vertices to a random sequence of tDAGs, we get closer to identifiability. We defer the proof to Section A.5.
Theorem 1.
Let be a random sequence of growing tDAGs as defined in Definition 8. Then the size of the tMEC converges to exponentially fast, i.e., there exist a and such that for all ,
where .
To give an intuition for the proof, recall that the tMEC shrinks every time we orient a tedge. In addition, Proposition 3 tells us that we can orient a tedge if we observe a twotype fork structure. We thus argue that, as we add more vertices, the probability of observing a twotype fork (Proposition 4) for arbitrary type pairs converges to . This argument relies on the fact that the number of types remains constant throughout as the random tDAG grows.
7 Experiments
We perform multiple experiments to understand how the size of MECs and tMECs compare ^{2}^{2}2The code used to perform these experiments is available at https://github.com/ElementAI/typeddag.. For a given tDAG, the size of the MEC is obtained by finding its CPDAG (without considering types) and enumerating all DAGs that do not introduce a cycle or add an immorality. For the same tDAG, the size of the tMEC is obtained by applying the algorithm described in Section 5.
The tDAGs that we consider are randomly generated according to the process described at Definition 8. Let be the number of types in the tDAG. We attribute uniform probability to each type, i.e., , . The tinteraction matrix is defined as follows. For each pair of types , s.t., , the direction of the tedge is sampled randomly with uniform probability and we use a fixed probability of interaction , which controls the graph density. For example, if the direction is sampled, then and . In what follows, unless otherwise specified , the number of vertices is , the number of types is , and the probability of interaction is .
In Fig. 3, 4, and 5, the size of the equivalence classes are compared with respect to the number of vertices, the number of types, and the density, respectively. All boxplots are calculated over 100 random consistent tDAGs. First, in Fig. 3 we see that as the number of vertices increases (and the number of types remains constant), the size of the tMEC converges to , as demonstrated in Section 6. In contrast, the size of the MEC first increases and then remains near constant. Notice how the size of the MEC and the tMEC are identical when the number of vertices equals the number of types; this is because type consistency does not constrain the graph structure. Second, in Fig. 4, as the number of types increases, the size of the tMEC increases. This is expected, because as the number of types approaches the number of vertices, type consistency imposes less structural constraints. Further, notice that the size of the MEC changes with the number of types, even though it is agnostic to type consistency. This is because tDAGs with fewer types (e.g., 2) are more likely to contain immoralities (see Proposition 1), leading to smaller MECs. Third, in Fig. 5, as the density increases, the size of the MEC and the tMEC both decrease. This is in line with the observations of He et al. (2015).
In summary, all our experiments indicate that the size of the tMEC is smaller or equal to that of the MEC for random tDAGs. The difference is particularly striking when the number of types is low and the number of vertices is high. Of particular interest are the results shown in Fig. 3, as they provide empirical evidence for the correctness of Theorem 1.
8 Discussion
In this work, we addressed an important problem in causal discovery: it is often impossible to identify the causal graph precisely, due to the size of its Markov equivalence class. This is particularly true for sparse graphs, where the size of the MEC grows superexponentially with the number of vertices (He et al., 2015). In this sense, we proposed a new type of assumption based on variable types, which we formalized as typed directed acyclic graphs (tDAGs). Our theoretical and empirical results clearly demonstrate that there exists conditions in which our variabletyping assumptions greatly shrink the size of the equivalence class. Hence, when such assumptions hold in the data, gains in identification are to be expected.
We note that the new assumptions that we introduce can be used in conjunction with other strategies to shrink the size of the equivalence class, such as considering interventions (Hauser & Bühlmann, 2012), hard background knowledge on the presence/absence of edges (Meek, 1995), or functionalform assumptions (Peters et al., 2014; Peters & Bühlmann, 2014; Shimizu et al., 2006)
. Moreover, our assumptions could be used with methods that estimate treatment effects from equivalence classes, such as IDA
(Perkovic et al., 2017), to improve their accuracy.We believe that this work may stimulate new advances at the intersection of machine learning and causality (Schölkopf et al., 2021; Schölkopf, 2019). In fact, machine learning algorithms excel at classification, and thus it may be interesting to explore a setting where the variable types are learned based on some features. Type assignments could be learned in parallel with causal discovery, using recent methods for differentiable causal discovery (Brouillard et al., 2020; Zheng et al., 2018). This may further reduce the burden on the human expert in cases where types are hard to assign. As an example, consider the task of learning causal models of gene regulatory networks. One could train a model to assign types to genes based on their categorization in the gene ontology (Gene Ontology Consortium, 2004) or on features of their DNA sequence.
Additionally, an interesting future direction would be to use our typing assumptions to perform causal discovery on multiple graphs at once, i.e., multitask causal discovery. In fact, assume that we are given data for multiple groups of variables that correspond to disjoint systems (no interactions across groups), but that share similar types. It would be possible to use type consistency (Definition 3) to propagate tedge orientations across graphs.
In conclusion, the results reported in this work show that considering typing assumptions has the potential to improve identification in causal discovery. However, we barely scratched the surface of what is possible. Future work will include extensive experiments to put our theoretical work into practice on realworld datasets and will further explore the aforementioned directions.
Acknowledgements
The authors are grateful to Assya Trofimov, David Berger, and JeanPhilippe Reid for helpful comments and suggestions.
References
 1000 Genomes Project Consortium (1KGP) and others (2010) 1000 Genomes Project Consortium (1KGP) and others. A map of human genome variation from populationscale sequencing. Nature, 467(7319):1061, 2010.
 Andersson et al. (1997) Andersson, S. A., Madigan, D., Perlman, M. D., et al. A characterization of markov equivalence classes for acyclic digraphs. Annals of statistics, 25(2):505–541, 1997.

Andrews (2020)
Andrews, B.
On the completeness of causal discovery in the presence of latent
confounding with tiered background knowledge.
In
The 23rd International Conference on Artificial Intelligence and Statistics
, volume 108 of Proceedings of Machine Learning Research, pp. 4002–4011. PMLR, 2020.  Brouillard et al. (2020) Brouillard, P., Lachapelle, S., Lacoste, A., LacosteJulien, S., and Drouin, A. Differentiable causal discovery from interventional data. In Advances in Neural Information Processing Systems 33, 2020.
 Chen et al. (2016) Chen, E. Y., Shen, Y., Choi, A., and Darwiche, A. Learning bayesian networks with ancestral constraints. In Advances in Neural Information Processing Systems 29, pp. 2325–2333, 2016.
 Constantinou et al. (2021) Constantinou, A. C., Guo, Z., and Kitson, N. K. Information fusion between knowledge and data in bayesian network structure learning. arXiv preprint arXiv:2102.00473, 2021.
 Cooper & Herskovits (1992) Cooper, G. F. and Herskovits, E. A bayesian method for the induction of probabilistic networks from data. Machine learning, 9(4):309–347, 1992.

de Haan et al. (2019)
de Haan, P., Jayaraman, D., and Levine, S.
Causal confusion in imitation learning.
In Advances in Neural Information Processing Systems 32, pp. 11693–11704, 2019.  Erdős & Rényi (1959) Erdős, P. and Rényi, A. On random graphs. Publicationes Mathematicae Debrecen, 6:290–297, 1959.
 Flores et al. (2011) Flores, M. J., Nicholson, A. E., Brunskill, A., Korb, K. B., and Mascaro, S. Incorporating expert knowledge when learning bayesian network structure: a medical case study. Artificial intelligence in medicine, 53(3):181–204, 2011.
 Gene Ontology Consortium (2004) Gene Ontology Consortium. The gene ontology (go) database and informatics resource. Nucleic acids research, 32(suppl_1):D258–D261, 2004.
 Gopnik & Sobel (2000) Gopnik, A. and Sobel, D. M. Detecting blickets: How young children use information about novel causal powers in categorization and induction. Child development, 71(5):1205–1222, 2000.
 Griffiths et al. (2011) Griffiths, T. L., Sobel, D. M., Tenenbaum, J. B., and Gopnik, A. Bayes and blickets: Effects of knowledge on causal induction in children and adults. Cognitive Science, 35(8):1407–1455, 2011.
 Hauser & Bühlmann (2012) Hauser, A. and Bühlmann, P. Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. The Journal of Machine Learning Research, 13(1):2409–2464, 2012.
 He & Yu (2016) He, Y. and Yu, B. Formulas for counting the sizes of markov equivalence classes of directed acyclic graphs. arXiv preprint arXiv:1610.07921, 2016.
 He et al. (2015) He, Y., Jia, J., and Yu, B. Counting and exploring sizes of markov equivalence classes of directed acyclic graphs. The Journal of Machine Learning Research, 16(1):2589–2609, 2015.
 HeinzeDeml et al. (2018) HeinzeDeml, C., Maathuis, M. H., and Meinshausen, N. Causal structure learning. Annual Review of Statistics and Its Application, 5:371–391, 2018.
 Kemp et al. (2010) Kemp, C., Goodman, N. D., and Tenenbaum, J. B. Learning to learn causal models. Cognitive Science, 34(7):1185–1243, 2010.
 Li & Beek (2018) Li, A. and Beek, P. Bayesian network structure learning with side constraints. In International Conference on Probabilistic Graphical Models, pp. 225–236. PMLR, 2018.
 Mansinghka et al. (2012) Mansinghka, V., Kemp, C., Griffiths, T., and Tenenbaum, J. Structured priors for structure learning. arXiv preprint arXiv:1206.6852, 2012.
 Meek (1995) Meek, C. Causal inference and causal explanation with background knowledge. arXiv preprint arXiv:1302.4972, 1995.
 Pearl (1995) Pearl, J. Causal diagrams for empirical research. Biometrika, 82(4):669–688, 1995.
 Perkovic et al. (2017) Perkovic, E., Kalisch, M., and Maathuis, M. H. Interpreting and using cpdags with background knowledge. In Proceedings of the ThirtyThird Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2017.

Peters & Bühlmann (2014)
Peters, J. and Bühlmann, P.
Identifiability of gaussian structural equation models with equal error variances.
Biometrika, 101(1):219–228, 2014.  Peters et al. (2014) Peters, J., Mooij, J. M., Janzing, D., and Schölkopf, B. Causal discovery with continuous additive noise models. The Journal of Machine Learning Research, 15(1):2009–2053, 2014.
 Peters et al. (2017) Peters, J., Janzing, D., and Schölkopf, B. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
 Runge et al. (2019) Runge, J., Bathiany, S., Bollt, E., CampsValls, G., Coumou, D., Deyle, E., Glymour, C., Kretschmer, M., Mahecha, M. D., MuñozMarí, J., et al. Inferring causation from time series in earth system sciences. Nature communications, 10(1):1–13, 2019.
 Scheines et al. (1998) Scheines, R., Spirtes, P., Glymour, C., Meek, C., and Richardson, T. The tetrad project: Constraint based aids to causal model specification. Multivariate Behavioral Research, 33(1):65–117, 1998.
 Schölkopf (2019) Schölkopf, B. Causality for machine learning. arXiv preprint arXiv:1911.10500, 2019.
 Schölkopf et al. (2021) Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634, 2021.
 Schulz & Gopnik (2004) Schulz, L. E. and Gopnik, A. Causal learning across domains. Developmental psychology, 40(2):162, 2004.
 Shimizu et al. (2006) Shimizu, S., Hoyer, P. O., Hyvärinen, A., and Kerminen, A. A linear nongaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(Oct):2003–2030, 2006.
 Spirtes et al. (2000) Spirtes, P., Glymour, C. N., Scheines, R., and Heckerman, D. Causation, prediction, and search. MIT press, 2000.
 Verma & Pearl (1990) Verma, T. S. and Pearl, J. On the equivalence of causal models. arXiv preprint arXiv:1304.1108, 1990.
 Zheng et al. (2018) Zheng, X., Aragam, B., Ravikumar, P., and Xing, E. P. Dags with NO TEARS: continuous optimization for structure learning. In Advances in Neural Information Processing Systems 31, pp. 9492–9503, 2018.
Appendix A Proofs
a.1 Proof of Proposition 1
for , where is the number of vertices and is the number of types.
Proof.
By definition, a tDAG is a DAG, so we know that .
Since , by the pigeonhole principle, at least two vertices have the same type. We claim that two vertices of the same type cannot be linked by an edge. To show the claim is true, let be a consistent tDAG, and assume without loss of generality that and . For the sake of contradiction, assume further that . Then we have that:
This means contains both edges and , contradicting the fact that is consistent.
Therefore, when , the space of consistent tDAGs will exclude at least some DAGs that are in the space of DAGs, giving us the strict subset . ∎
a.2 Proof of Proposition 2
Let and be, respectively, the tessential and essential graphs of an arbitrary consistent tDAG . Then, .
Proof.
It is clear that and since and are obtained by undirecting edges from . Also, since enforcing type consistency can only orient more edges in , we have that . ∎
a.3 Proof of Proposition 3
For any consistent tDAG , we have , where is the number of undirected tedges in , the tessential graph of .
Proof.
A tessential graph is the union of consistent tDAGs. First, note that we do not have to consider every edge of a consistent tDAG independently since, by consistency, we have that all the edges included in a tedge of will always take the same orientation. Thus, if a tedge is undirected in , it means that there exists at least one consistent tDAG in for each orientation of the tedge. Since each of the undirected tedges can take on two directions, there are possible combinations. Note that this is only an upper bound — some of these orientations are not part of the equivalence class, since they create either a cycle or new immoralities not present in . ∎
a.4 Proof of Proposition 4
If a consistent tDAG contains vertices with types and , with edges , then the tedge is directed in the tessential graph, i.e., the direction of causation between types and is known.
Proof.
To prove the statement we show that among all possible orientations , , and of the tedge, the last two are not valid.
For the sake of contradiction, first assume is directed in the tessential graph of . This means that there exists a consistent tDAG , having tedge , that is Markov equivalent to . Recall that two graphs are Markov equivalent if and only if they have the same skeleton and the same immoralities (Verma & Pearl, 1990). Given that , then has the structure , which forms an immorality. But since does not contain this immorality, this contradicts the fact that is Markov equivalent to .
Now, suppose that is not directed in the tessential graph of . This means that there exist two consistent tDAGs and that are Markov equivalent to , having the tedge orientations and , respectively. As per the argument in the previous case, the existence of leads to a contradiction.
Therefore, the only possible orientation for the types and is . ∎
a.5 Proof of Theorem 1
Let be a random sequence of growing tDAG as defined in Definition 8. Then the size of the tMEC converges to exponentially fast, i.e., there exist a and such that for all ,
where .
Proof.
To demonstrate this, first recall that the tMEC shrinks every time we orient a tedge (Proposition 3). In addition, Proposition 4 tells us that we can orient a tedge if we observe a twotype fork structure. Therefore, proving the theorem is equivalent to showing that, as goes to infinity, and for an arbitrary pair of types and with , the probability of observing a twotype fork structure converges to . Our proof relies on the fact that as grows, the number of types remains constant.
Case :
Let be a random tDAG. If has vertices of only one type (i.e., ), then is a set of disconnected vertices; hence and the theorem holds.
Case :
Let represent the event of observing a twotype fork structure of type and . Our aim is to bound this probability from below and show that it converges exponentially fast to 1 as . We split our search into two steps. Let be the number of vertices of type after observing the first vertices in the topological ordering induced by . Then, we have:
(1) 
where
is the Binomial distribution of
success out of trials with probability . In the second step, we search for at least two vertices of type that are direct causes of , the first vertex of type found in the first step. Let be the count of such vertices among the last vertices, and let be the probability of sampling a vertex of type and connecting it to . Then, we have:(2) 
Combining these two steps, we have:
(3)  
The last inequality arises from the fact that the remaining terms are strictly positive. Now, we look at the probability of this event not happening:
(4)  
Using , , and , we can rewrite as follows:
(5)  
Since , and , this converges to 0 as . Now, we look at the 2 different convergence rates for and .
Rate 1.
Assume , there exist constants and such that for all :
(6) 
Since , this is true for and sufficiently large.
Rate 2.
Assume that , there exist constants and such that for all :
Since , and , this is true for and sufficiently large.
Combining the two rates.
By combining the 2 rates, and the fact that , we have that converges exponentially fast to 0 with rate as follows:
∎
Appendix B Additional counterexamples
In this section, we give two additional examples where the algorithm without the enumeration presented in Section 5 would not orient some edges that are oriented in the tessential graph.
Our first example is interesting because it shows that in order to decide the orientation of a tedge sometimes several tedges (possibly not local) have to be considered simultaneously. The second counterexample shows that looking only for the direct parent or child of a tedge is not always sufficient.
The first example is presented in Fig. 6. Note that vertices denoted by the same letter have the same type. The algorithm orients the tedge since one of its edges in the tDAG is part of an immorality. All other edges are unoriented because they are not covered by any rules. However, in the tessential graph (see Fig. 6 c) the tedge is oriented. To see why this is the case, consider the four possible orientations of the tedges and (recall that an orientation cannot create a cycle or a new immorality):

, possible.

, impossible (creates an immorality that is not present in the original tDAG).

, possible.

, impossible (creates a cycle).
In the two configurations that are possible, the tedge is always oriented as . Thus, this is an essential edge that should have been recovered by the algorithm.
The second example is presented in Fig. 7. The dashed line between and represents a path that does not contain oriented edges in the tessential graph. Thus, the tDAG does not contain any immorality. Without loss of generality, let us consider the dashed line as a chain . The algorithm does not orient any tedges because they are not covered by any rules. However, in the tessential graph (see Fig. 7 c) the tedge is oriented. Consider the impossible orientation . By construction, the tDAG contains no immorality. Thus, let us orient the edges of the chain as or . In both cases, a new immorality is created (respectively, and ) leading to a contradiction. Thus, the tedge has to be oriented as .
Comments
There are no comments yet.