1. Introduction
Chain graphs (CGs) are graphs with possibly directed and undirected edges but without semidirected cycles. They have been extensively studied as a formalism to represent probabilistic independence models, because they can model symmetric and asymmetric relationships between random variables. Moreover, they are much more expressive than directed acyclic graphs (DAGs) and undirected graphs (UGs) (Sonntag and Peña, 2016). There are three different interpretations of CGs as independence models: The LauritzenWermuthFrydenberg (LWF) interpretation (Lauritzen, 1996), the multivariate regression (MVR) interpretation (Cox and Wermuth, 1996), and the AnderssonMadiganPerlman (AMP) interpretation (Andersson et al., 2001). No interpretation subsumes another (Andersson et al., 2001; Sonntag and Peña, 2015)
. Moreover, AMP and MVR CGs are coherent with data generation by blockrecursive normal linear regressions
(Andersson et al., 2001).Richardson (2003) extends MVR CGs by (i) relaxing the semidirected acyclity constraint so that only directed cycles are forbidden, and (ii) allowing up to two edges between any pair of nodes. The resulting models are called acyclic directed mixed graphs (ADMGs). These are the models in which Pearl’s docalculus operate to determine if the causal effect of an intervention is identifiable from observed quantities (Pearl, 2009). In this paper, we make the same two extensions to AMP CGs. We call our ADMGs alternative as opposed to the ones proposed by Richardson, which we call original. It is worth mentioning that neither the original ADMGs nor any other family of mixed graphical models that we know of (e.g. summary graphs (Cox and Wermuth, 1996), ancestral graphs (Richardson and Spirtes, 2002), MC graphs (Koster, 2002) or loopless mixed graphs (Sadeghi and Lauritzen, 2014)) subsume AMP CGs and hence our alternative ADMGs. To see it, we refer the reader to the works by Richardson and Spirtes (2002, p. 1025) and Sadeghi and Lauritzen (2014, Section 4.1). Therefore, our work complements the existing works.
The rest of the paper is organized as follows. Section 2 introduces some preliminaries. Sections 3 and 4 introduce global, and ordered local and pairwise Markov properties for our ADMGs, and prove their equivalence. When the random variables are continuous, Section 5 offers an intuitive interpretation of our ADMGs as systems of structural equations with correlated errors, so that Pearl’s docalculus can easily be adapted to them. Section 6 describes an exact algorithm for learning our ADMGs from observational and interventional data via answer set programming (Gelfond, 1988; Niemelä, 1999; Simons et al., 2002). We close the paper with some discussion in Section 7. Formal proofs of the claims made in this paper can be found in the appendix.
2. Preliminaries
In this section, we introduce some concepts about graphical models. Unless otherwise stated, all the graphs and probability distributions in this paper are defined over a finite set . The elements of are not distinguished from singletons. An ADMG is a graph with possibly directed and undirected edges but without directed cycles. There may be up to two edges between any pair of nodes, but in that case the edges must be different and one of them must be undirected to avoid directed cycles. Edges between a node and itself are not allowed. See Figure 1 for two examples of ADMGs.
Given an ADMG , we represent with that or (or both) is in . The parents of in are is in with . The children of in are is in with . The neighbours of in are is in with . The ancestors of in are is in with or . The descendants of in are is in with or . The semidescendants of in are is in with or . The nonsemidescendants of in are . The connectivity component of in is is in with or . The connectivity components in are denoted as . A route between a node and a node on is a sequence of (not necessarily distinct) nodes such that and are adjacent in for all . If the nodes in the route are all distinct, then the route is called a path. Finally, the subgraph of induced by , denoted as , is the graph over that has all and only the edges in whose both ends are in .
Every probability distribution satisfies the following four properties, where , , and disjoint subsets of : Symmetry , decomposition , weak union , and contraction . If is strictly positive, then it also satisfies the intersection property . Some (not yet characterized) probability distributions also satisfy the composition property .
3. Global Markov Property for ADMGs
In this section, we introduce four separation criteria for ADMGs. Moreover, we show that they are all equivalent for strictly positive probability distributions. A probability distribution is said to satisfy the global Markov property with respect to an ADMG if every separation in the graph can be interpreted as an independence in the distribution.
Criterion 1. A node on a path in an ADMG is said to be a collider on the path if is a subpath. Moreover, the path is said to be connecting given when

every collider on the path is in , and

every noncollider on the path is outside unless is a subpath and .
Let , and denote three disjoint subsets of . When there is no path in connecting a node in and a node in given , we say that is separated from given in and denote it as .
Criterion 2. A node on a route in an ADMG is said to be a collider on the route if is a subroute. Note that maybe . Moreover, the route is said to be connecting given when

every collider on the route is in , and

every noncollider on the route is outside .
Let , and denote three disjoint subsets of . When there is no route in connecting a node in and a node in given , we say that is separated from given in and denote it as .
Criterion 3. Let denote the UG over that contains all and only the undirected edges in . The extended subgraph with is defined as . Two nodes and in are said to be collider connected if there is a path between them such that every nonendpoint node is a collider, i.e. or . Such a path is called a collider path. Note that a single edge forms a collider path. The augmented graph is the UG over such that is in if and only if and are collider connected in . The edge is called augmented if it is in but and are not adjacent in . A path in is said to be connecting given if no node on the path is in . Let , and denote three disjoint subsets of . When there is no path in connecting a node in and a node in given , we say that is separated from given in and denote it as .
Criterion 4. Given an UG over and , we define the marginal graph as the UG over such that is in if and only if is in or is with . We define the marginal extended subgraph as . Let , and denote three disjoint subsets of . When there is no path in connecting a node in and a node in given , we say that is separated from given in and denote it as .
The first three separation criteria introduced above coincide with those introduced by Andersson et al. (2001) and Levitz et al. (2001) for AMP CGs. The equivalence for AMP CGs of these three criteria has been proven by Levitz et al. (2001, Theorem 4.1). The following theorems prove the equivalence for ADMGs of the four separation criteria introduced above.
Theorem 1.
There is a path in an ADMG connecting a node in and a node in given if and only if there is a path in connecting a node in and a node in given .
Theorem 2.
There is a path in an ADMG connecting and given if and only if there is a route in connecting and given .
Theorem 3.
Given an ADMG , there is a path in connecting a node in and a node in given if and only if there is a path in connecting a node in and a node in given .
Unlike in AMP CGs, two nonadjacent nodes in an ADMG are not necessarily separated. For example, does not hold for any in the ADMGs in Figure 1. This drawback is shared by the original ADMGs (Evans and Richardson, 2013, p. 752), summary graphs and MC graphs (Richardson and Spirtes, 2002, p. 1023), and ancestral graphs (Richardson and Spirtes, 2002, Section 3.7). For ancestral graphs, the problem can be solved by adding edges to the graph without altering the separations represented until every missing edge corresponds to a separation (Richardson and Spirtes, 2002, Section 5.1). A similar solution does not exist for our ADMGs (we omit the details).
4. Ordered Local and Pairwise Markov Properties for ADMGs
In this section, we introduce ordered local and pairwise Markov properties for ADMGs. Given an ADMG , the directed acyclity of implies that we can specify a total ordering () of the nodes of such that only if . Such an ordering is said to be consistent with . Let the predecessors of with respect to be defined as or . Given , we define the Markov blanket of with respect to as . We say that a probability distribution satisfies the ordered local Markov property with respect to and if for any and such that
for all .
Theorem 4.
Given a probability distribution satisfying the intersection property, satisfies the global Markov property with respect to an ADMG if and only if it satisfies the ordered local Markov property with respect to the ADMG and a consistent ordering of its nodes.
Similarly, we say that a probability distribution satisfies the ordered pairwise Markov property with respect to and if for any and such that
for all nodes that are not adjacent in , and where denotes the nodes in .
Theorem 5.
Given a probability distribution satisfying the intersection property, satisfies the global Markov property with respect to an ADMG if and only if it satisfies the ordered pairwise Markov property with respect to the ADMG and a consistent ordering of its nodes.
For each and such that , the ordered local Markov property specifies an independence for each . The number of independences to specify can be reduced by noting that and, thus, we do not need to consider every set but only those that are ancestral, i.e. those such that . The number of independences to specify can be further reduced by considering only maximal ancestral sets, i.e. those sets such that for every ancestral set such that . The independences for the nonmaximal ancestral sets follow from the independences for the maximal ancestral sets by decomposition. A characterization of the maximal ancestral sets is possible but notationally cumbersome (we omit the details). All in all, for each node and maximal ancestral set, the ordered local Markov property specifies an independence for each node in the set. This number is greater than for the original ADMGs, where a single independence is specified for each node and maximal ancestral set (Richardson, 2003, Section 3.1). Even fewer independences are needed for the original ADMGs when interpreted as linear causal models (Kang and Tian, 2009, Section 4). All in all, our ordered local Markov property serves its purpose, namely to identify a subset of the independences in the global Markov property that implies the rest.
Note that Andersson et al. (2001, Theorem 3) describe local and pairwise Markov properties for AMP CGs that are equivalent to the global one under the assumption of the intersection and composition properties. Our ordered local and pairwise Markov properties above only require assuming the intersection property. Note that this assumption is also needed to prove similar results for much simpler models such as UGs (Lauritzen, 1996, Theorem 3.7). For AMP CGs, however, we can do better than just using the ordered local and pairwise Markov properties for ADMGs above. Specifically, we introduce in the next section neater local and pairwise Markov properties for AMP CGs under the intersection property assumption. Later on, we will also use them to prove some results for ADMGs.
4.1. Local and Pairwise Markov Properties for AMP CGs
Andersson et al. (2001, Theorem 2) introduce the following blockrecursive Markov property. A probability distribution satisfies the global Markov property with respect to an AMP CG if and only if the following three properties hold for all :

C1: .

C2: satisfies the global Markov property with respect to .

C3: for all .
We simplify the blockrecursive Markov property as follows.
Theorem 6.
C1, C2 and C3 hold if and only if the following two properties hold:

C1: for all .

C2: satisfies the global Markov property with respect to .
Andersson et al. (2001, Theorem 3) also introduce the following local Markov property. A probability distribution satisfying the intersection and composition properties satisfies the global Markov property with respect to an AMP CG if and only if the following two properties hold for all :

L1: for all .

L2: for all .
We introduce below a local Markov property that is equivalent to the global one under the assumption of the intersection property only.
Theorem 7.
A probability distribution satisfying the intersection property satisfies the global Markov property with respect to an AMP CG if and only if the following two properties hold for all :

L1: for all .

L2: for all and .
Finally, Andersson et al. (2001, Theorem 3) also introduce the following pairwise Markov property. A probability distribution satisfying the intersection and composition properties satisfies the global Markov property with respect to an AMP CG if and only if the following two properties hold for all :

P1: for all and .

P2: for all and .
We introduce below a pairwise Markov property that is equivalent to the global one under the assumption of the intersection property only.
Theorem 8.
A probability distribution satisfying the intersection property satisfies the global Markov property with respect to an AMP CG if and only if the following two properties hold for all :

P1: for all and .

P2: for all , and .
5. Causal Interpretation of ADMGs
Let us assume that
is normally distributed. In this section, we show that an ADMG
can be interpreted as a system of structural equations with correlated errors. Specifically, the system includes an equation for each , which is of the formwhere denotes the error term. The error terms are represented implicitly in . They can be represented explicitly by magnifying into the ADMG as follows:
1  Set 

2  For each node in 
3  Add the node and the edge to 
4  For each edge in 
5  Replace with the edge in 
The magnification above basically consists in adding the error nodes to and connect them appropriately. Figure 2 shows an example. Note that every node is determined by and that is determined by . Let denote all the error nodes in . Formally, we say that is determined by when or is a function of . We use to denote all the nodes that are determined by . From the point of view of the separations, that a node outside the conditioning set of a separation is determined by the conditioning set has the same effect as if the node were actually in the conditioning set. Bearing this in mind, it is not difficult to see that, as desired, and represent the same separations over . The following theorem formalizes this result.
Theorem 9.
Let , and denote three disjoint subsets of . Then, if and only if .
Finally, let such that if is not in . Then, can be interpreted as a system of structural equations with correlated errors as follows. For any
(1) 
and for any other
(2) 
The following two theorems confirm that the interpretation above works as intended. A similar result to the second theorem exists for the original ADMGs (Koster, 1999, Theorem 1).
Theorem 11.
The equations above specify each node as a linear function of its parents with additive normal noise. The equations can be generalized to nonlinear or nonparametric functions as long as the noise remains additive normal. That is, for all , with such that if is not in . That the noise is additive normal ensures that is determined by , which is needed for Theorem 9 to remain valid which, in turn, is needed for Theorem 11 to remain valid.
A less formal but more intuitive alternative interpretation of ADMGs is as follows. We can interpret the parents of each node in an ADMG as its observed causes. Its unobserved causes are grouped into an error node that is represented implicitly in the ADMG. We can interpret the undirected edges in the ADMG as the correlation relationships between the different error nodes. The causal structure is constrained to be a DAG, but the correlation structure can be any UG. This causal interpretation of our ADMGs parallels that of the original ADMGs (Pearl, 2009). There are however two main differences. First, the noise in the original ADMGs is not necessarily additive normal. Second, the correlation structure of the error nodes in the original ADMGs is represented by a covariance graph, i.e. a graph with only bidirected edges (Pearl and Wermuth, 1993). Therefore, whereas a missing edge between two error nodes in the original ADMGs represents marginal independence, in our ADMGs it represents conditional independence given the rest of the error nodes. This means that the original and our ADMGs represent complementary causal models. Consequently, there are scenarios where the identification of the causal effect of an intervention is not possible with the original ADMGs but is possible with ours, and vice versa. We elaborate on this in the next section.
5.1. docalculus for ADMGs
We start by adapting Pearl’s docalculus, which operates on the original ADMGs, to our ADMGs. The original docalculus consists of the following three rules, whose repeated application permits in some cases to identify (i.e. compute) the causal effect of an intervention from observed quantities:

Rule 1 (insertion/deletion of observations):
if .

Rule 2 (action/observation exchange):
if .

Rule 3 (insertion/deletion of actions):
if .
where , , and are disjoint subsets of , is the original ADMG augmented with an intervention random variable and an edge for every , and “” denotes an intervention on X in , i.e. any edge with an arrowhead into any node in is removed. See Pearl (1995, p. 686) for further details and the proof that the rules are sound. Fortunately, the rules also apply to our ADMGs by simply redefining “” appropriately. The proof that the rules are still sound is essentially the same as before. Specifically, “” should now be implemented as follows:

Delete all the directed edges pointing to nodes in ,

for every path with and , add the edge , and

delete all the undirected edges with an endnode in .
The first step is the same as an intervention in an original ADMG. The second and third steps of the intervention are best understood in terms of the magnified ADMG : They correspond to marginalizing the error nodes associated to the nodes in out of , the UG that represents the correlation structure of the error nodes. In other words, they replace with , the marginal graph of over . This makes sense since is no longer associated to due to the intervention and, thus, we may want to marginalize it out because it is unobserved. This is exactly what the second and third steps of the intervention imply. To see it, note that the ADMG after the intervention and the magnified ADMG after the intervention represent the same separations over , by Theorem 9.
Now, we show that the original and our ADMGs allow for complementary causal reasoning. Specifically, we show an example where our ADMGs allow for the identification of the causal effect of an intervention whereas the original ADMGs do not, and vice versa. Consider the DAG in Figure 3, which represents the causal relationships among all the random variables in the domain at hand.^{1}^{1}1For instance, the DAG may correspond to the following fictitious domain: = Smoking, = Lung cancer, = Drinking, = Parents’ smoking, = Parents’ lung cancer, = Parents’ drinking, = Parents’ genotype that causes smoking and drinking, = Parents’ hospitalization. However, only , and are observed. Moreover, represents selection bias. Although other definitions may exist, we say that selection bias is present if two unobserved causes have a common effect that is omitted from the study but influences the selection of the samples in the study (Pearl, 2009, p. 163). Therefore, the corresponding unobserved causes are correlated in every sample selected. Note that this definition excludes the possibility of an intervention affecting the selection because, in a causal model, unobserved causes do not have observed causes. Note also that our goal is not the identification of the causal effect of an intervention in the whole population but in the subpopulation that satisfies the selection bias criterion.^{2}^{2}2For instance, in the fictitious domain in the previous footnote, we are interested in the causal effect that smoking may have on the development of lung cancer for the patients with hospitalized parents. For causal effect identification in the whole population, see Bareinboim and Tian (2015).
DAG  Our ADMG  Original ADMG 

The ADMGs in Figure 3 represent the causal model represented by the DAG when only the observed random variables are modeled. According to our interpretation of ADMGs above, our ADMG is derived from the DAG by keeping the directed edges between observed random variables, and adding an undirected edge between two observed random variables if and only if their unobserved causes are not separated in the DAG given the unobserved causes of the rest of the observed random variables. In other words, holds in the DAG but and do not and, thus, the edges and are added to the ADMG but is not. Deriving the original ADMG is less straightforward. The bidirected edges in an original ADMG represent potential marginal dependence due to a common unobserved cause, also known as confounding. Thus, the original ADMGs are not meant to model selection bias. The best we can do is then to use bidirected edges to represent potential marginal dependences regardless of their origin. This implies that we can derive the original ADMG from the DAG by keeping the directed edges between observed random variables, and adding a bidirected edge between two observed random variables if and only if their unobserved causes are not separated in the DAG given the empty set. Clearly, is not identifiable with the original ADMG but is identifiable with our ADMG. Specifically,
where the first equality is due to marginalization, the second due to Rule 3, and the third due to Rule 2.
The original ADMGs assume that confounding is always the source of correlation between unobserved causes. In the example above, we consider selection bias as an additional source. However, this is not the only possibility. For instance, and may be tied by a physical law of the form devoid of causal meaning, much like Boyle’s law relates the pressure and volume of a gas as if the temperature and amount of gas remain unchanged within a closed system. In such a case, the discussion above still applies and our ADMG allows for causal effect identification but the original does not. For an example where the original ADMGs allow for causal effect identification whereas ours do not, simply replace the subgraph in Figure 3 with where is an unobserved random variable. Then, our ADMG will contain the same edges as before plus the edge , making the causal effect nonidentifiable. The original ADMG will contain the same edges as before with the exception of the edge , making the causal effect identifiable.
In summary, the bidirected edges of the original ADMGs have a clear semantics: They represent potential marginal dependence due to a common unobserved cause. This means that we have to know the causal relationships between the unobserved random variables to derive the ADMG. Or at least, we have to know that there is no selection bias or tying laws so that marginal dependence can be attributed to a common unobserved cause due to Reichenbach’s principle (Pearl, 2009, p. 30). This knowledge may not be available in some cases. Moreover, the original ADMGs are not meant to represent selection bias or tying laws. To solve these two problems, we may be willing to use the bidirected edges to represent potential marginal dependences regardless of their origin. Our ADMGs are somehow dual to the original ADMGs, since the undirected edges represent potential saturated conditional dependence between unobserved causes. This implies that in some cases, such as in the example above, our ADMGs may allow for causal effect identification whereas the original may not.
6. Learning ADMGs Via ASP
In this section, we introduce an exact algorithm for learning ADMGs via answer set programming (ASP), which is a declarative constraint satisfaction paradigm that is wellsuited for solving computationally hard combinatorial problems (Gelfond, 1988; Niemelä, 1999; Simons et al., 2002). ASP represents constraints in terms of firstorder logical rules. Therefore, when using ASP, the first task is to model the problem at hand in terms of rules so that the set of solutions implicitly represented by the rules corresponds to the solutions of the original problem. One or multiple solutions of the original problem can then be obtained by invoking an offtheshelf ASP solver on the constraint declaration. The algorithms underlying the ASP solver clingo (Gebser et al., 2011), which we use in this work, are based on stateoftheart Boolean satisfiability solving techniques (Biere et al., 2009).
Figure 4 shows the ASP encoding of our learning algorithm. The predicate node(X) in rule 1 represents that is a node. The predicates line(X,Y,I) and arrow(X,Y,I) represent that there is an undirected and directed edge from to after having intervened on the node . The observational regime corresponds to . The rules 23 encode a nondeterministic guess of the edges for the observational regime, which means that the ASP solver with implicitly consider all possible graphs during the search, hence the exactness of the search. The edges under the observational regime are used in the rules 45 to define the edges in the graph after having intervened on , following the description in Section 5.1
. Therefore, the algorithm assumes continuous random variables and additive normal noise when the input contains interventions. It makes no assumption though when the input consists of just observations. The rules 67 enforce the fact that undirected edges are symmetric and that there is at most one directed edge between two nodes. The predicate
ancestor(X,Y) represents that is an ancestor of . The rules 810 enforce that the graph has no directed cycles. The predicates in the rules 1112 represent whether a node is or is not in a set of nodes . The rules 1324 encode the separation criterion 2 in Section 3. The predicate con(X,Y,C,I) in rules 2528 represents that there is a connecting route between and given after having intervened on . The rule 29 enforces that each dependence in the input must correspond to a connecting route. The rule 30 represents that each independence in the input that is not represented implies a penalty of units. In our case, . The rules 3133 represent a penalty of 1 unit per edge. Other penalty rules can be added similarly.Figure 4 shows the ASP encoding of all the (in)dependences in the probability distribution at hand, e.g. as determined by some available data. Specifically, the predicate nodes(3) represents that there are three nodes in the domain at hand, and the predicate set(0..7) represents that there are eight sets of nodes, indexed from 0 (empty set) to 7 (full set). The predicate indep(X,Y,C,I,W) (respectively dep(X,Y,C,I,W)) represents that the nodes and are conditionally independent (respectively dependent) given the set index after having intervened on the node . Observations correspond to . The penalty for failing to represent an (in)dependence is . Note that it suffices to specify all the (in)dependences between pair of nodes, because these identify uniquely the rest of the independences in the probability distribution (Studený, 2005, Lemma 2.2)
. Note also that we do not assume that the probability distribution at hand is faithful to some ADMG or satisfies the composition property, as it is the case in most heuristic learning algorithms.
By calling the ASP solver with the encodings of the learning algorithm and the (in)dependences in the domain, the solver will essentially perform an exhaustive search over the space of graphs, and will output the graphs with the smallest penalty. Specifically, when only the observations are used (i.e. the last six lines of Figure 5 are removed), the learning algorithm finds 37 optimal models. Among them, we have UGs such as line(1,2) line(2,3) line(3,1), DAGs such as arrow(3,1) arrow(1,2) arrow(3,2), AMP CGs such as line(1,2) arrow(3,1) arrow(3,2), and ADMGs such as line(1,2) line(2,3) arrow(1,2) or line(1,2) line(1,3) arrow(2,3). When all the observations and interventions available are used, the learning algorithm finds 18 optimal models. These are the models out the 37 models found before that have no directed edge coming out of the node 3. This is the expected result given the last four lines in Figure 5. Note that the output still includes the ADMGs mentioned before.
Finally, the ASP code easily accommodates prior knowledge in the form of a node ordering, or forbidden and compelled edges. For instance, we can constrain the graphs to be consistent with the ordering by simply adding the rules in Figure 6 (top). Moreover, the ASP code can easily be extended as shown in Figure 6 (bottom) to learn not only our ADMGs but also original ADMGs. Note that the second line forbids graphs with both undirected and bidirected edges. This results in 34 optimal models: The 18 previously found plus 16 original ADMG, e.g. biarrow(1,2) biarrow(1,3) arrow(1,2) or biarrow(1,2) biarrow(1,3) arrow(2,3).
7. Discussion
In this work, we have introduced ADMGs as an extension AMP CGs by (i) relaxing the semidirected acyclity constraint so that only directed cycles are forbidden, and (ii) allowing up to two edges between any pair of nodes. We have introduced and proved the equivalence of global, and ordered local and pairwise Markov properties for the new models. We have also shown that when the random variables are continuous, the new models can be interpreted as systems of structural equations with correlated errors. This has enabled us to adapt Pearl’s docalculus to them. We have shown that our models complement those used in Pearl’s docalculus, as there are cases where the identification of the causal effect of an intervention is not possible with the latter but is possible with the former, and vice versa. Finally, we have described an exact algorithm for learning the new models from observational and interventional data.
In the future, we plan to unify the original and our ADMGs by allowing directed, undirected and bidirected edges.
Appendix: Proofs
Lemma 1.
If there is a path in an ADMG between and such that (i) no noncollider on is in unless is a subpath of and , and (ii) every collider on is in , then there is a path in connecting a node in and a node in given .
Proof.
Suppose that has a collider such that with , or with . Assume without loss of generality that with because, otherwise, a symmetric argument applies. Then, replace the subpath of between and with . Note that the resulting path (i) has no noncollider in unless is a subpath of and , and (ii) has every collider in . Note also that the resulting path has fewer colliders than that are not in . Continuing with this process until no such collider exists produces the desired result. ∎
Lemma 2.
Given an ADMG , let denote a shortest path in connecting two nodes and given . Then, a path in between and can be obtained as follows. First, replace every augmented edge on with an associated collider path in . Second, replace every nonaugmented edge on with an associated edge in . Third, replace any configuration produced in the previous steps with .
Proof.
We start by proving that the collider paths added in the first step of the lemma either do not have any node in common except possibly one of the endpoints, or the third step of the lemma removes the repeated nodes. Suppose for a contradiction that and are two augmented edges on such that their associated collider paths have in common a node which is not an endpoint of these paths. Consider the following two cases.
 Case 1:

Suppose that . Then, one of the following configurations must exist in .
However, the first case implies that is in , which implies that replacing the subpath of between and with results in a path in connecting and given that is shorter than . This is a contradiction. Similarly for the fourth, sixth, seventh, eighth, ninth and tenth cases. And similarly for the rest of the cases by replacing the subpath of between and with .
 Case 2:

Suppose that . Then, one of the following configurations must exist in .
However, the first case implies that is in , which implies that replacing the subpath of between and with results in a path in connecting and given that is shorter than . This is a contradiction. Similarly for the second, fourth and seventh cases. For the third, fifth and sixth cases, the third step of the lemma removes the repeated nodes. Specifically, it replaces with in the third case, with in the fifth case, and with in the sixth case.
It only remains to prove that the collider paths added in the first step of the lemma have no nodes in common with except the endpoints. Suppose that has an augmented edge . Then, one of the following configurations must exist in .
Consider the first case and suppose for a contradiction that occurs on . Note that because, otherwise, would not be connecting. Assume without loss of generality that occurs on before and because, otherwise, a symmetric argument applies. Then, replacing the subpath of between and with results in a path in connecting and given that is shorter than . This is a contradiction. Similarly for the second case. Specifically, assume without loss of generality that occurs on because, otherwise, a symmetric argument with applies. Note that because, otherwise, would not be connecting. If occurs on after and , then replace the subpath of between and with . This results in a path in connecting and given that is shorter than , which is a contradiction. If occurs on before and , then replace the subpath of between and with . This results in a path in connecting and given that is shorter than , which is a contradiction. ∎
Lemma 3.
Let denote a path in an ADMG connecting two nodes and given . The sequence of noncolliders on forms a path in between and .
Proof.
Consider the maximal undirected subpaths of . Note that each endpoint of each subpath is ancestor of a collider or endpoint of , because is connecting. Thus, all the nodes on are in . Suppose that and are two successive noncolliders on . Then, the subpath of between and consists entirely of colliders. Specifically, the subpath is of the form , , or . Then, and are adjacent in . ∎
Proof of Theorem 1.
We start by proving the only if part. Let denote a path in connecting and given . By Lemma 3 the noncolliders on form a path between and in . Since is connecting, every noncollider on is outside unless is a subpath of and . In the latter case, has a subpath where unless D is a collider on , i.e. is on . Similarly for and . Therefore, we replace the subpath of with where . Then, is connecting given . Note that is in , because or is in and is in with . Similarly for .
To prove the if part, let denote a shortest path in connecting and given . We can transform into a path in as described in Lemma 2. Since is connecting, no node on is in and, thus, no noncollider on is in . Finally, since all the nodes on are in , it follows that every collider on is in . To see it, note that if is an augmented edge in then the colliders on any collider path associated with are in . Thus, by Lemma 1 there exist a node in and a node in which are connected given in . ∎
Proof of Theorem 2.
We prove the theorem for the following separation criterion, which is equivalent to criterion 2: A route is said to be connecting given when

every collider on the route is in , and

every noncollider on the route is outside unless is a subroute and .
The only if part is trivial. To prove the if part, let denote a route in connecting and given . Let denote a node that occurs more than once in . Consider the following cases.
 Case 1:

Assume that is of the form . Then, for to be connecting given . Then, removing the subroute between the two occurrences of from results in the route , which is connecting given .
 Case 2:

Assume that is of the form . Then, for to be connecting given . Then, removing the subroute between the two occurrences of from results in the route , which is connecting given .
 Case 3:

Assume that is of the form . Then, for to be connecting given . Then, removing the subroute between the two occurrences of from results in the route , which is connecting given .
 Case 4:

Assume that is of the form . Then, for to be connecting given . Then, removing the subroute between the two occurrences of from results in the route , which is connecting given .
 Case 5:

Assume that is of the form . Then, for to be connecting given . Then, removing the subroute between the two occurrences of from results in the route , which is connecting given .
 Case 6:

Assume that is of the form and . Then, removing the subroute between the two occurrences of from results in the route , which is connecting given .
 Case 7:

Assume that is of the form and . Then, must actually be of the form or . Note that in the former case for to be connecting given . For the same reason, in the latter case. Then, in either case. Then, removing the subroute between the two occurrences of from results in the route , which is connecting given .
Repeating the process above until no such node exists produces the desired path. ∎
Proof of Theorem 3.
We start by proving the only if part. Suppose that there is path in connecting a node in and a node in given . We can then obtain a path in connecting and given as shown in the proof of Theorem 1. In this path, replace with every subpath such that and . The result is a path in . Moreover, the path connects and given . To see it, note that the resulting and original paths have the same colliders, and the noncolliders on the resulting path are a subset of the noncolliders on the original path. Then, there is path in