1 Introduction
Constraintbased methods for causal structure learning use conditional independence tests to obtain information about the underlying causal structure. We start by discussing several prominent examples of such algorithms, designed for different settings.
The PCalgorithm (Spirtes et al. (1993, 2000)) was designed for learning directed acyclic graphs (DAGs) under the assumption of causal sufficiency, i.e., no unmeasured common causes and no selection variables. It learns a Markov equivalence class of DAGs that can be uniquely described by a socalled completed partially directed acyclic graph (CPDAG) (see Section 2 for a precise definition). The PCalgorithm is widely used in highdimensional settings (e.g., Kalisch et al. (2010); Nagarajan et al. (2010); Stekhoven et al. (2012); Zhang et al. (2011)), since it is computationally feasible for sparse graphs with up to thousands of variables, and opensource software is available (e.g., pcalg (Kalisch et al., 2012) and TETRAD IV (Spirtes et al., 2000)). Moreover, the PCalgorithm has been shown to be consistent for highdimensional sparse graphs (Kalisch and Bühlmann, 2007; Harris and Drton, 2012).
The FCI and RFCIalgorithms and their modifications (Spirtes et al. (1993, 2000, 1999); Spirtes (2001); Zhang (2008), Colombo et al. (2012), Claassen et al. (2013)) were designed for learning directed acyclic graphs when allowing for latent and selection variables. Thus, these algorithms learn a Markov equivalence class of DAGs with latent and selection variables, which can be uniquely represented by a partial ancestral graph (PAG). These algorithms first employ the PCalgorithm, and then perform additional conditional independence tests because of the latent variables.
Finally, the CCDalgorithm (Richardson, 1996) was designed for learning Markov equivalence classes of directed (not necessarily acyclic) graphs under the assumption of causal sufficiency. Again, the first step of this algorithm consists of the PCalgorithm.
Hence, all these algorithms share the PCalgorithm as a common first step. We will therefore focus our analysis on this algorithm, since any improvements to the PCalgorithm can be directly carried over to the other algorithms. When the PCalgorithm is applied to data, it is generally orderdependent, in the sense that its output depends on the order in which the variables are given. Dash and Druzdzel (1999) exploit the orderdependence to obtain candidate graphs for a scorebased approach. Cano et al. (2008) resolve the orderdependence via a rather involved method based on measuring edge strengths. Spirtes et al. (2000) (Section 5.4.2.4) propose a method that removes the “weakest” edges as early as possible. Overall, however, the orderdependence of the PCalgorithm has received relatively little attention in the literature, suggesting that it seems to be regarded as a minor issue. We found, however, that the orderdependence can become very problematic for highdimensional data, leading to highly variable results and conclusions for different variable orderings.
In particular, we analyzed a yeast gene expression data set (Hughes et al. (2000); see Section 6
for more detailed information) containing gene expression levels of 5361 genes for 63 wildtype yeast organisms. First, we considered estimating the skeleton of the CPDAG, that is, the undirected graph obtained by discarding all arrowheads in the CPDAG. Figure
1(a) shows the large variability in the estimated skeletons for 25 random orderings of the variables. Each estimated skeleton consists of roughly 5000 edges which can be divided into three groups: about 1500 are highly stable and occur in all orderings, about 1500 are moderately stable and occur in at least of the orderings, and about 2000 are unstable and occur in at most of the orderings. Since the FCI and CCDalgorithms employ the PCalgorithm as a first step, their resulting skeletons for these data are also highly orderdependent.An important motivation for learning DAGs lies in their causal interpretation. We therefore also investigated the effect of different variable orderings on causal inference that is based on the PCalgorithm. In particular, we applied the IDA algorithm (Maathuis et al., 2010, 2009) to the yeast gene expression data discussed above. The IDA algorithm conceptually consists of twosteps: one first estimates the Markov equivalence class of DAGs using the PCalgorithm, and one then applies Pearl’s docalculus (Pearl, 2000)
to each DAG in the Markov equivalence class. (The algorithm uses a fast local implementation that does not require listing all DAGs in the equivalence class.) One can then obtain estimated lower bounds on the sizes of the causal effects between all pairs of genes. For each of the 25 random variable orderings, we ranked the gene pairs according to these lower bounds, and compared these rankings to a gold standard set of large causal effects computed from gene knockout data. Figure
1(b) shows the large variability in the resulting receiver operating characteristic (ROC) curves. The ROC curve that was published in Maathuis et al. (2010) was significantly better than random guessing with , and is somewhere in the middle. Some of the other curves are much better, while there are also curves that are indistinguishable from random guessing.The remainder of the paper is organized as follows. In Section 2 we discuss some background and terminology. Section 3 explains the original PCalgorithm. Section 4 introduces modifications of the PCalgorithm (and hence also of the (R)FCI and CCDalgorithms) that remove part or all of the orderdependence. These modifications are identical to their original counterparts when perfect conditional independence information is used. When applied to data, the modified algorithms are partly or fully orderindependent. Moreover, they are consistent in highdimensional settings under the same conditions as the original algorithms. Section 5 compares all algorithms in simulations, and Section 6 compares them on the yeast gene expression data discussed above. We close with a discussion in Section 7.
2 Preliminaries
2.1 Graph terminology
A graph consists of a vertex set and an edge set . The vertices represent random variables and the edges represent relationships between pairs of variables.
A graph containing only directed edges () is directed, one containing only undirected edges () is undirected, and one containing directed and/or undirected edges is partially directed. The skeleton of a partially directed graph is the undirected graph that results when all directed edges are replaced by undirected edges.
All graphs we consider are simple, meaning that there is at most one edge between any pair of vertices. If an edge is present, the vertices are said to be adjacent. If all pairs of vertices in a graph are adjacent, the graph is called complete. The adjacency set of a vertex in a graph , denoted by adj, is the set of all vertices in that are adjacent to in . A vertex in adj is called a parent of if . The corresponding set of parents is denoted by pa.
A path is a sequence of distinct adjacent vertices. A directed path is a path along directed edges that follows the direction of the arrowheads. A directed cycle is formed by a directed path from to together with the edge . A (partially) directed graph is called a (partially) directed acyclic graph if it does not contain directed cycles.
A triple in a graph is unshielded if and as well as and are adjacent, but and are not adjacent in . A vstructure is an unshielded triple in a graph where the edges are oriented as .
2.2 Probabilistic and causal interpretation of DAGs
We use the notation to indicate that is independent of given , where is a set of variables not containing and (Dawid, 1980). If is the empty set, we simply write . If , we refer to as a separating set for . A separating set for is called minimal if there is no proper subset of such that .
A distribution is said to factorize according to a DAG if the joint density of can be written as the product of the conditional densities of each variable given its parents in : .
A DAG entails conditional independence relationships via a graphical criterion called dseparation (Pearl, 2000). If two vertices and are not adjacent in a DAG , then they are dseparated in by a subset of the remaining vertices. If and are dseparated by , then in any distribution that factorizes according to . A distribution is said to be faithful to a DAG if the reverse implication also holds, that is, if the conditional independence relationships in are exactly the same as those that can be inferred from using dseparation.
Several DAGs can describe exactly the same conditional independence information. Such DAGs are called Markov equivalent and form a Markov equivalence class. Markov equivalent DAGs have the same skeleton and the same vstructures, and a Markov equivalence class can be described uniquely by a completed partially directed acyclic graph (CPDAG) (Andersson et al., 1997; Chickering, 2002). A CPDAG is a partially directed acyclic graph with the following properties: every directed edge exists in every DAG in the Markov equivalence class, and for every undirected edge there exists a DAG with and a DAG with in the Markov equivalence class. A CPDAG is said to represent a DAG if belongs to the Markov equivalence class described by .
3 The PCalgorithm
We now describe the PCalgorithm in detail. In Section 3.1, we discuss the algorithm under the assumption that we have perfect conditional independence information between all variables in . We refer to this as the oracle version. In Section 3.2 we discuss the more realistic situation where conditional independence relationships have to be estimated from data. We refer to this as the sample version.
3.1 Oracle version
A sketch of the PCalgorithm is given in Algorithm 3.1. We see that the algorithm consists of three steps. Step 1 finds the skeleton and separation sets, while Steps 2 and 3 determine the orientations of the edges.
Step 1 is given in pseudocode in Algorithm 3.2. We start with a complete undirected graph . This graph is subsequently thinned out in the loop on lines 315 in Algorithm 3.2, where an edge is deleted if for some subset of the remaining variables. These conditional independence queries are organized in a way that makes the algorithm computationally efficient for highdimensional sparse graphs.
First, when , all pairs of vertices are tested for marginal independence. If , then the edge is deleted and the empty set is saved as separation set in and . After all pairs of vertices have been considered (and many edges might have been deleted), the algorithm proceeds to the next step with .
When , the algorithm chooses an ordered pair of vertices still adjacent in , and checks for subsets of size of . If such a conditional independence is found, the edge is removed, and the corresponding conditioning set is saved in and . If all ordered pairs of adjacent vertices have been considered for conditional independence given all subsets of size of their adjacency sets, the algorithm again increases by one. This process continues until all adjacency sets in the current graph are smaller than . At this point the skeleton and the separation sets have been determined.
Step 2 determines the vstructures. In particular, it considers all unshielded triples in , and orients an unshielded triple as a vstructure if and only if .
Finally, Step 3 orients as many of the remaining undirected edges as possible by repeated application of the following three rules:

orient into whenever there is a directed edge such that and are not adjacent (otherwise a new vstructure is created);

orient into whenever there is a chain (otherwise a directed cycle is created);

orient into whenever there are two chains and such that and are not adjacent (otherwise a new vstructure or a directed cycle is created).
The PCalgorithm was shown to be sound and complete. (Theorem 5.1 on p.410 of Spirtes et al. (2000)) Let the distribution of be faithful to a DAG , and assume that we are given perfect conditional independence information about all pairs of variables in given subsets . Then the output of the PCalgorithm is the CPDAG that represents .
We briefly discuss the main ingredients of the proof, as these will be useful for understanding our modifications in Section 4. The faithfulness assumption implies that conditional independence in the distribution of is equivalent to dseparation in the graph . The skeleton of can then be determined as follows: and are adjacent in if and only if they are conditionally dependent given any subset of the remaining nodes. Naively, one could therefore check all these conditional dependencies, which is known as the SGS algorithm (Spirtes et al., 2000). The PCalgorithm obtains the same result with fewer tests, by using the following fact about DAGs: two variables and in a DAG are dseparated by some subset of the remaining variables if and only if they are dseparated by or . The PCalgorithm is guaranteed to check these conditional independencies: at all stages of the algorithm, the graph is a supergraph of the true CPDAG, and the algorithm checks conditional dependencies given all subsets of the adjacency sets, which obviously include the parent sets.
3.2 Sample version
In applications, we of course do not have perfect conditional independence information. Instead, we assume that we have an i.i.d. sample of size of . A sample version of the PCalgorithm can then be obtained by replacing all steps where conditional independence decisions were taken by statistical tests for conditional independence at some prespecified level . For example, if the distribution of is multivariate Gaussian, one can test for zero partial correlation, see, e.g., Kalisch and Bühlmann (2007). We note that the significance level is used for many tests, and plays the role of a tuning parameter, where smaller values of tend to lead to sparser graphs.
3.3 Orderdependence in the sample version
Let order() denote an ordering on the variables in . We now consider the role of order() in every step of the algorithm. Throughout, we assume that all tasks are performed according to the lexicographical ordering of order, which is the standard implementation in pcalg (Kalisch et al., 2012) and TETRAD IV (Spirtes et al., 2000), and is called “PC1” in Spirtes et al. (2000) (Section 5.4.2.4).
In Step 1, order() affects the estimation of the skeleton and the separating sets. In particular, at each level of , order() determines the order in which pairs of adjacent vertices and subsets of their adjacency sets are considered (see lines 6 and 8 in Algorithm 3.2). The skeleton is updated after each edge removal. Hence, the adjacency sets typically change within one level of , and this affects which other conditional independencies are checked, since the algorithm only conditions on subsets of the adjacency sets. In the oracle version, we have perfect conditional independence information, and all orderings on the variables lead to the same output. In the sample version, however, we typically make mistakes in keeping or removing edges. In such cases, the resulting changes in the adjacency sets can lead to different skeletons, as illustrated in Example 3.3.
Moreover, different variable orderings can lead to different separating sets in Step 1. In the oracle version, this is not important, because any valid separating set leads to the correct vstructure decision in Step 2. In the sample version, however, different separating sets in Step 1 of the algorithm may yield different decisions about vstructures in Step 2. This is illustrated in Example 3.3.
Finally, we consider the role of order() on the orientation rules in Steps 2 and 3 of the sample version of the PCalgorithm. Example 3.3 illustrates that different variable orderings can lead to different orientations, even if the skeleton and separating sets are orderindependent.
(Orderdependent skeleton in the sample PCalgorithm.) Suppose that the distribution of is faithful to the DAG in Figure 2(a). This DAG encodes the following conditional independencies with minimal separating sets: and .
Suppose that we have an i.i.d. sample of , and that the following conditional independencies with minimal separating sets are judged to hold at some level : , , and . Thus, the first two are correct, while the third is false.
We now apply the PCalgorithm with two different orderings: and . The resulting skeletons are shown in Figures 2(b) and 2(c), respectively. We see that the skeletons are different, and that both are incorrect as the edge is missing. The skeleton for contains an additional error, as there is an additional edge .
We now go through Algorithm 3.2 to see what happened. We start with a complete undirected graph on . When , variables are tested for marginal independence, and the algorithm correctly removes the edge between and . No other conditional independencies are found when or . When , there are two pairs of vertices that are thought to be conditionally independent given a subset of size 2, namely the pairs and .
In , the pair is considered first. The corresponding edge is removed, as and is a subset of . Next, the pair is considered and the corresponding edges is erroneously removed, because of the wrong decision that and the fact that is a subset of .
In , the pair is considered first, and the corresponding edge is erroneously removed. Next, the algorithm considers the pair . The corresponding separating set is not a subset of , so that the edge remains. Next, the algorithm considers the pair . Again, the separating set is not a subset of , so that the edge again remains. In other words, since was considered first in , the adjacency set of was affected and no longer contained , so that the algorithm “forgot” to check the conditional independence .
(Orderdependent separating sets and vstructures in the sample PCalgorithm.) Suppose that the distribution of is faithful to the DAG in Figure 3(a). This DAG encodes the following conditional independencies with minimal separating sets: , , , , , , and .
We consider the oracle PCalgorithm with two different orderings on the variables: and . For , we obtain sepset, while for we get sepset. Thus, the separating sets are orderdependent. However, we obtain the same vstructure for both orderings, since is not in the sepset, regardless of the ordering. In fact, this holds in general, since in the oracle version of the PCalgorithm, a vertex is either in all possible separating sets or in none of them (cf. (Spirtes et al., 2000, Lemma 5.1.3)).
Now suppose that we have an i.i.d. sample of . Suppose that at some level , all true conditional independencies are judged to hold, and is thought to hold by mistake. We again consider two different orderings: and . With we obtain the incorrect . This also leads to an incorrect vstructure in Step 2 of the algorithm. With , we obtain the correct , and hence correctly find that is not a vstructure in Step 2. This illustrates that orderdependent separating sets in Step 1 of the sample version of the PCalgorithm can lead to orderdependent vstructures in Step 2 of the algorithm.
(Orderdependent orientation rules in Steps 2 and 3 of the sample PCalgorithm.) Consider the graph in Figure 4(a) with two unshielded triples and , and assume this is the skeleton after Step 1 of the sample version of the PCalgorithm. Moreover, assume that we found . Then in Step 2 of the algorithm, we obtain two vstructures and . Of course this means that at least one of the statistical tests is wrong, but this can happen in the sample version. We now have conflicting information about the orientation of the edge . In the current implementation of pcalg, where conflicting edges are simply overwritten, this means that the orientation of is determined by the vstructure that is last considered. Thus, we obtain if is considered last, while we get if is considered last.
Next, consider the graph in Figure 4(b), and assume that this is the output of the sample version of the PCalgorithm after Step 2. Thus, we have two vstructures, namely and , and four unshielded triples, namely , , , and . Thus, we then apply the orientation rules in Step 3 of the algorithm, starting with rule R1. If one of the two unshielded triples or is considered first, we obtain . On the other hand, if one of the unshielded triples or is considered first, then we obtain . Note that we have no issues with overwriting of edges here, since as soon as the edge is oriented, all edges are oriented and no further orientation rules are applied.
These examples illustrate that Steps 2 and 3 of the PCalgorithm can be orderdependent regardless of the output of the previous steps.
4 Modified algorithms
We now propose several modifications of the PCalgorithm (and hence also of the related algorithms) that remove the orderdependence in the various stages of the algorithm. Sections 4.1, 4.2, and 4.3 discuss the skeleton, the vstructures and the orientation rules, respectively. In each of these sections, we first describe the oracle version of the modifications, and then results and examples about orderdependence in the corresponding sample version (obtained by replacing conditional independence queries by conditional independence tests, as in Section 3.3). Finally, Section 4.4 discusses orderindependent versions of related algorithms like RFCI and FCI, and Section 4.5 presents highdimensional consistency results for the sample versions of all modifications.
4.1 The skeleton
We first consider estimation of the skeleton in Step 1 of the PCalgorithm. The pseudocode for our modification is given in Algorithm 4.1. The resulting PCalgorithm, where Step 1 in Algorithm 3.1 is replaced by Algorithm 4.1, is called “PCstable”.
The main difference between Algorithms 3.2 and 4.1 is given by the forloop on lines 68 in the latter one, which computes and stores the adjacency sets of all variables after each new size of the conditioning sets. These stored adjacency sets are used whenever we search for conditioning sets of this given size . Consequently, an edge deletion on line 13 no longer affects which conditional independencies are checked for other pairs of variables at this level of .
In other words, at each level of , Algorithm 4.1 records which edges should be removed, but for the purpose of the adjacency sets it removes these edges only when it goes to the next value of . Besides resolving the orderdependence in the estimation of the skeleton, our algorithm has the advantage that it is easily parallelizable at each level of .
The PCstable algorithm is sound and complete in the oracle version (Theorem 4.1), and yields orderindependent skeletons in the sample version (Theorem 4.1). We illustrate the algorithm in Example 4.1.
Let the distribution of be faithful to a DAG , and assume that we are given perfect conditional independence information about all pairs of variables in given subsets . Then the output of the PCstable algorithm is the CPDAG that represents . The proof of Theorem 4.1 is completely analogous to the proof of Theorem 3.1 for the original PCalgorithm, as discussed in Section 3.1.
The skeleton resulting from the sample version of the PCstable algorithm is orderindependent. We consider the removal or retention of an arbitrary edge at some level . The ordering of the variables determines the order in which the edges (line 9 of Algorithm 4.1) and the subsets of and (line 11 of Algorithm 4.1) are considered. By construction, however, the order in which edges are considered does not affect the sets and .
If there is at least one subset of or such that , then any ordering of the variables will find a separating set for and (but different orderings may lead to different separating sets as illustrated in Example 3.3). Conversely, if there is no subset of or such that , then no ordering will find a separating set.
Hence, any ordering of the variables leads to the same edge deletions, and therefore to the same skeleton.
(Orderindependent skeletons) We go back to Example 3.3, and consider the sample version of Algorithm 4.1. The algorithm now outputs the skeleton shown in Figure 2(b) for both orderings and .
We again go through the algorithm step by step. We start with a complete undirected graph on . The only conditional independence found when or is , and the corresponding edge is removed. When , the algorithm first computes the new adjacency sets: and for . There are two pairs of variables that are thought to be conditionally independent given a subset of size 2, namely and . Since the sets are not updated after edge removals, it does not matter in which order we consider the ordered pairs , , and . Any ordering leads to the removal of both edges, as the separating set for is contained in , and the separating set for is contained in (and in ).
4.2 Determination of the vstructures
We propose two methods to resolve the orderdependence in the determination of the vstructures, using the conservative PCalgorithm (CPC) of Ramsey et al. (2006) and a variation thereof.
The CPCalgorithm works as follows. Let be the graph resulting from Step 1 of the PCalgorithm (Algorithm 3.1). For all unshielded triples in , determine all subsets of and of that make and conditionally independent, i.e., that satisfy . We refer to such sets as separating sets. The triple is labelled as unambiguous if at least one such separating set is found and either is in all separating sets or in none of them; otherwise it is labelled as ambiguous. If the triple is unambiguous, it is oriented as vstructure if and only if is in none of the separating sets. Moreover, in Step 3 of the PCalgorithm (Algorithm 3.1), the orientation rules are adapted so that only unambiguous triples are oriented. We refer to the combination of PCstable and CPC as the CPCstable algorithm.
We found that the CPCalgorithm can be very conservative, in the sense that very few unshielded triples are unambiguous in the sample version. We therefore propose a minor modification of this approach, called majority rule PCalgorithm (MPC). As in CPC, we first determine all subsets of and of satisfying . We then label the triple as unambiguous if at least one such separating set is found and is not in exactly of the separating sets. Otherwise it is labelled as ambiguous. If the triple is unambiguous, it is oriented as vstructure if and only if is in less than half of the separating sets. As in CPC, the orientation rules in Step 3 are adapted so that only unambiguous triples are oriented. We refer to the combination of PCstable and MPC as the MPCstable algorithm.
Theorem 4.2 states that the oracle CPC and MPCstable algorithms are sound and complete. When looking at the sample versions of the algorithms, we note that any unshielded triple that is judged to be unambiguous in CPCstable is also unambiguous in MPCstable, and any unambiguous vstructure in CPCstable is an unambiguous vstructure in MPCstable. In this sense, CPCstable is more conservative than MPCstable, although the difference appears to be small in simulations and for the yeast data (see Sections 5 and 6). Both CPCstable and MPCstable share the property that the determination of vstructures no longer depends on the (orderdependent) separating sets that were found in Step 1 of the algorithm. Therefore, both CPCstable and MPCstable yield orderindependent decisions about vstructures in the sample version, as stated in Theorem 4.2. Example 4.2 illustrates both algorithms.
We note that the CPC/MPCstable algorithms may yield a lot fewer directed edges than PCstable. On the other hand, we can put more trust in those edges that were oriented.
Let the distribution of be faithful to a DAG , and assume that we are given perfect conditional independence information about all pairs of variables in given subsets . Then the output of the CPC/MPC(stable) algorithms is the CPDAG that represents . The skeleton of the CPDAG is correct by Theorems 3.1 and 4.1. The unshielded triples are all unambiguous (in the conservative and the majority rule versions), since for any unshielded triple in a DAG, is either in all sets that dseparate and or in none of them (Spirtes et al., 2000, Lemma 5.1.3). In particular, this also means that all vstructures are determined correctly. Finally, since all unshielded triples are unambiguous, the orientation rules are as in the original oracle PCalgorithm, and soundness and completeness of these rules follows from Meek (1995) and Andersson et al. (1997).
The decisions about vstructures in the sample versions of the CPC/MPCstable algorithms are orderindependent. The CPC/MPCstable algorithms have orderindependent skeletons in Step 1, by Theorem 4.1. In particular, this means that their unshielded triples and adjacency sets are orderindependent. The decision about whether an unshielded triple is unambiguous and/or a vstructure is based on the adjacency sets of nodes in the triple, which are orderindependent.
(Orderindependent decisions about vstructures) We consider the sample versions of the CPC/MPCstable algorithms, using the same input as in Example 3.3. In particular, we assume that all conditional independencies induced by the DAG in Figure 3(a) are judged to hold, plus the additional (erroneous) conditional independency .
Denote the skeleton after Step 1 by . We consider the unshielded triple . First, we compute adj and adj. We now consider all subsets of these adjacency sets, and check whether . The following separating sets are found: , , and .
Since is in some but not all of these separating sets, CPCstable determines that the triple is ambiguous, and no orientations are performed. Since is in more than half of the separating sets, MPCstable determines that the triple is unambiguous and not a vstructure. The output of both algorithms is given in Figure 3(b).
4.3 Orientation rules
Even when the skeleton and the determination of the vstructures are orderindependent, Example 3.3 showed that there might be some orderdependence left in the sampleversion. This can be resolved by allowing bidirected edges () and working with lists containing the candidate edges for the vstructures in Step 2 and the orientation rules R1R3 in Step 3.
In particular, in Step 2 we generate a list of all (unambiguous) vstructures, and then orient all of these, creating a bidirected edge in case of a conflict between two vstructures. In Step 3, we first generate a list of all edges that can be oriented by rule R1. We orient all these edges, again creating bidirected edges if there are conflicts. We do the same for rules R2 and R3, and iterate this procedure until no more edges can be oriented.
When using this procedure, we add the letter L (standing for lists), e.g., LCPCstable and LMPCstable. The LCPCstable and LMPCstable algorithms are correct in the oracle version (Theorem 4.3) and fully orderindependent in the sample versions (Theorem 4.3). The procedure is illustrated in Example 4.3.
We note that the bidirected edges cannot be interpreted causally. They simply indicate that there was some conflicting information in the algorithm.
Let the distribution of be faithful to a DAG , and assume that we are given perfect conditional independence information about all pairs of variables in given subsets . Then the (L)CPC(stable) and (L)MPC(stable) algorithms output the CPDAG that represents .
By Theorem 4.2, we know that the CPC(stable) and MPC(stable) algorithms are correct. With perfect conditional independence information, there are no conflicts between vstructures in Step 2 of the algorithms, nor between orientation rules in Step 3 of the algorithms. Therefore, the (L)CPC(stable) and (L)MPC(stable) algorithms are identical to the CPC(stable) and MPC(stable) algorithms.
The sample versions of LCPCstable and LMPCstable are fully orderindependent. This follows straightforwardly from Theorems 4.1 and 4.2 and the procedure with lists and bidirected edges discussed above.
Table 1 summarizes the three orderdependence issues explained above and the corresponding modifications of the PCalgorithm that removes the given orderdependence problem.
skeleton  vstructures decisions  edges orientations  

PC       
PCstable      
CPC/MPCstable    
BCPC/BMPCstable 
First, we consider the two unshielded triples and as shown in Figure 4(a). The version of the algorithm that uses lists for the orientation rules, orients these edges as , regardless of the ordering of the variables.
Next, we consider the structure shown in Figure 4(b). As a first step, we construct a list containing all candidate structures eligible for orientation rule R1 in Step 3. The list contains the unshielded triples , , , and . Now, we go through each element in the list and we orient the edges accordingly, allowing bidirected edges. This yields the edge orientation , regardless of the ordering of the variables.
4.4 Related algorithms
The FCIalgorithm (Spirtes et al. (2000, 1999)) first runs Steps 1 and 2 of the PCalgorithm (Algorithm 3.1). Based on the resulting graph, it then computes certain sets, called “PossibleDSEP” sets, and conducts more conditional independence tests given subsets of the PossibleDSEP sets. This can lead to additional edge removals and corresponding separating sets. After this, the vstructures are newly determined. Finally, there are ten orientation rules as defined by Zhang (2008).
From our results, it immediately follows that FCI with any of our modifications of the PCalgorithm is sound and complete in the oracle version. Moreover, we can easily construct partially or fully orderindependent sample versions as follows. To solve the orderdependence in the skeleton we can use the following three step approach. First, we use PCstable to find an initial orderindependent skeleton. Next, since PossibleDSEP sets are determined from the orientations of the vstructures, we need orderindependent vstructures. Therefore, in Step 2 we can determine the vstructures using CPC. Finally, we compute the PossibleDSEP sets for all pairs of nodes at once, and do not update these after possible edge removals. The modification that uses these three steps returns an orderindependent skeleton, and we call it FCIstable. To assess orderindependent vstructures in the final output, one should again use an orderindependent procedure, as in CPC or MPC for the second time that vstructures are determined. We call these modifications CFCIstable and MFCIstable, respectively. Regarding the orientation rules, we have that the FCIalgorithm does not suffer from conflicting vstructures, as shown in Figure 4(a) for the PCalgorithm, because it orients edge marks and because bidirected edges are allowed. However, the ten orientation rules still suffer from orderdependence issues as in the PCalgorithm, as in Figure 4(b). To solve this problem, we can again use lists of candidate edges for each orientation rule as explained in the previous section about the PCalgorithm. However, since these ten orientation rules are more involved than the three for PC, using lists can be very slow for some rules, for example the one for discriminating paths. We refer to these modifications as LCFCIstable and LMFCIstable, and they are fully orderindependent in the sample version.
Table 2 summarizes the three orderdependence issues for FCI and the corresponding modifications that remove them.
skeleton  vstructures decisions  edges orientations  

FCI       
FCIstable      
CFCI/MFCIstable    
LCFCI/LMFCIstable 
The RFCIalgorithm (Colombo et al., 2012) can be viewed as an algorithm that is in between PC and FCI, in the sense that its computational complexity is of the same order as PC, but its output can be interpreted causally without assuming causal sufficiency (but is slightly less informative than the output from FCI).
RFCI works as follows. It runs the first step of PC. It then has a more involved Step 2 to determine the vstructures (Colombo et al., 2012, Lemma 3.1). In particular, for any unshielded triple , it conducts additional tests to check if both and and and are conditionally dependent given found in Step 1. If a conditional independence relationship is detected, the corresponding edge is removed and a minimal separating set is stored. The removal of an edge can create new unshielded triples or destroy some of them. Therefore, the algorithm works with lists to make sure that these actions are orderindependent. On the other hand, if both conditional dependencies hold and is not in the separating set for , the triple is oriented as a vstructure. Finally, in Step 3 it uses the ten orientation rules of Zhang (2008) with a modified orientation rule for the discriminating paths, that also involves some additional conditional independence tests.
From our results, it immediately follows that RFCI with any of our modifications of the PCalgorithm is correct in the oracle version. Because of its more involved rules for vstructures and discriminating paths, one needs to make several adaptations to create a fully orderindependent algorithm. For example, the additional conditional independence tests conducted for the vstructures are based on the separating sets found in Step 1. As already mentioned before (see Example 3.3) these separating sets are orderdependent, and therefore also the possible edge deletions based on them are orderdependent, leading to an orderdependent skeleton. To produce an orderindependent skeleton one should use a similar approach to the conservative one for the vstructures to make the additional edge removals orderindependent. Nevertheless, we can remove a large amount of the orderdependence in the skeleton by using the stable version for the skeleton as a first step. We refer to this modification as RFCIstable. Note that this procedure does not produce a fully orderindependent skeleton, but as shown in Section 5.2, it reduces the orderdependence considerably. Moreover, we can combine this modification with CPC or MPC on the final skeleton to reduce the orderdependence of the vstructures. We refer to these modifications as CRFCIstable and MRFCIstable. Finally, we can again use lists for the orientation rules as in the FCIalgorithm to reduce the orderdependence caused by the orientation rules.
The CCDalgorithm (Richardson, 1996) can also be made orderindependent using similar approaches.
4.5 Highdimensional consistency
The original PCalgorithm has been shown to be consistent for certain sparse highdimensional graphs. In particular, Kalisch and Bühlmann (2007)
proved consistency for multivariate Gaussian distributions. More recently,
Harris and Drton (2012) showed consistency for the broader class of Gaussian copulas when using rank correlations, under slightly different conditions.These highdimensional consistency results allow the DAG and the number of observed variables in to grow as a function of the sample size, so that , and . The corresponding CPDAGs that represent are denoted by , and the estimated CPDAGs using tuning parameter are denoted by . Then the consistency results say that, under some conditions, there exists a sequence such that as .
These consistency results rely on the fact that the PCalgorithm only performs conditional independence tests between pairs of variables given subsets of size less than or equal to the degree of the graph (when no errors are made). We made sure that our modifications still obey this property, and therefore the consistency results of Kalisch and Bühlmann (2007) and Harris and Drton (2012) remain valid for the (L)CPC(stable) and (L)MPC(stable) algorithms, under exactly the same conditions as for the original PCalgorithm.
Finally, also the consistency results of Colombo et al. (2012) for the FCI and RFCIalgorithms remain valid for the (L)CFCI(stable), (L)MFCI(stable), CRFCI(stable), and MRFCI(stable) algorithms, under exactly the same conditions as for the original FCI and RFCIalgorithms.
5 Simulations
We compared all algorithms using simulated data from lowdimensional and highdimensional systems with and without latent variables. In the lowdimensional setting, we compared the modifications of PC, FCI and RFCI. All algorithms performed similarly in this setting, and the results are presented in Appendix A.1. The remainder of this section therefore focuses on the highdimensional setting, where we compared (L)PC(stable), (L)CPC(stable) and (L)MPC(stable) in systems without latent variables, and RFCI(stable), CRFCI(stable) and MRFCI(stable) in systems with latent variables. We omitted the FCIalgorithm and the modifications with lists for the orientation rules of RFCI because of their computational complexity. Our results show that our modified algorithms perform better than the original algorithms in the highdimensional settings we considered.
In Section 5.1 we describe the simulation setup. Section 5.2 evaluates the estimation of the skeleton of the CPDAG or PAG (i.e., only looking at the presence or absence of edges), and Section 5.3 evaluates the estimation of the CPDAG or PAG (i.e., also including the edge marks). Appendix A.2 compares the computing time and the number of conditional independence tests performed by PC and PCstable, showing that PCstable generally performs more conditional independence tests, and is slightly slower than PC. Finally, Appendix A.3 compares the modifications of FCI and RFCI in two mediumdimensional settings with latent variables, where the number of nodes in the graph is roughly equal to the sample size and we allow somewhat denser graphs. The results indicate that also in this setting our modified versions perform better than the original ones.
5.1 Simulation setup
We used the following procedure to generate a random weighted DAG with a given number of vertices and an expected neighborhood size . First, we generated a random adjacency matrix with independent realizations of Bernoulli random variables in the lower triangle of the matrix and zeroes in the remaining entries. Next, we replaced the ones in by independent realizations of a Uniform() random variable, where a nonzero entry can be interpreted as an edge from to with weight . (We bounded the edge weights away from zero to avoid problems with nearunfaithfulness.)
We related a multivariate Gaussian distribution to each DAG by letting and for , where are mutually independent random variables. The variables then have a multivariate Gaussian distribution with mean zero and covariance matrix , where is the identity matrix.
We generated 250 random weighted DAGs with and , and for each weighted DAG we generated an i.i.d. sample of size . In the setting without latents, we simply used all variables. In the setting with latents, we removed half of the variables that had no parents and at least two children, chosen at random.
We estimated each graph for 20 random orderings of the variables, using the sample versions of (L)PC(stable), (L)CPC(stable), and (L)MPC(stable) in the setting without latents, and using the sample versions of RFCI(stable), CRFCI(stable), and MRFCI(stable) in the setting with latents, using levels for the partial correlation tests. Thus, from each randomly generated DAG, we obtained 20 estimated CPDAGs or PAGs from each algorithm, for each value of .
5.2 Estimation of the skeleton
Figure 5 shows the number of edges, the number of errors, and the true discovery rate for the estimated skeletons. The figure only compares PC and PCstable in the setting without latent variables, and RFCI and RFCIstable in the setting with latent variables, since the modifications for the vstructures and the orientation rules do not affect the estimation of the skeleton.
We first consider the number of estimated errors in the skeleton, shown in the first row of Figure 5. We see that PCstable and RFCIstable return estimated skeletons with fewer edges than PC and RFCI, for all values of . This can be explained by the fact that PCstable and RFCIstable tend to perform more tests than PC and RFCI (see also Appendix A.2). Moreover, for both algorithms smaller values of lead to sparser outputs, as expected. When interpreting these plots, it is useful to know that the average number of edges in the true CPDAGs and PAGs are 1000 and 919, respectively. Thus, for both algorithms and almost all values of , the estimated graphs are too sparse.
The second row of Figure 5 shows that PCstable and RFCIstable make fewer errors in the estimation of the skeletons than PC and RFCI, for all values of . This may be somewhat surprising given the observations above: for most values of the output of PC and RFCI is too sparse, and the output of PCstable and RFCIstable is even sparser. Thus, it must be that PCstable and RFCIstable yield a large decrease in the number of false positive edges that outweighs any increase in false negative edges.
This conclusion is also supported by the last row of Figure 5, which shows that PCstable and RFCIstable have a better True Discovery Rate (TDR) for all values of , where the TDR is defined as the proportion of edges in the estimated skeleton that are also present in the true skeleton.
Figure 6 shows more detailed results for the estimated skeletons of PC and PCstable for one of the 250 graphs (randomly chosen), for four different values of . For each value of shown, PC yielded a certain number of stable edges that were present for all 20 variable orderings, but also a large number of extra edges that seem to pop in or out randomly for different orderings. The PCstable algorithm yielded far fewer edges (shown in red), and roughly captured the edges that were stable among the different variable orderings for PC. The results for RFCI and RFCIstable show an equivalent picture.
5.3 Estimation of the CPDAGs and PAGs
We now consider estimation of the CPDAG or PAG, that is, also taking into account the edge orientations. For CPDAGs, we summarize the number of estimation errors using the Structural Hamming Distance (SHD), which is defined as the minimum number of edge insertions, deletions, and flips that are needed in order to transform the estimated graph into the true one. For PAGs, we summarize the number of estimation errors by counting the number of errors in the edge marks, which we call “SHD edge marks”. For example, if an edge is present in the estimated PAG but the true PAG contains , then that counts as one error, while it counts as two errors if the true PAG contains, for example, or and are not adjacent.
Figure 7 shows that the PCstable and RFCIstable versions have significantly better estimation performance than the versions with the original skeleton, for all values of . Moreover, MPC(stable) and CPC(stable) perform better than PC(stable), as do MRFCI(stable) and CRFCI(stable) with respect to RFCI(stable). Finally, for PC the idea to introduce bidirected edges and lists in LCPC(stable) and LMPC(stable) seems to make little difference.
Figure 8
shows the variance in SHD for the CPDAGs, see Figure
8(a), and the variance in SHD edge marks for the PAGs, see Figure 8(b), both computed over the 20 random variable orderings per graph, and then plotted as averages over the 250 randomly generated graphs for the different values of . The PCstable and RFCIstable versions yield significantly smaller variances than their counterparts with unstabilized skeletons. Moreover, the variance is further reduced for (L)CPCstable and (L)MPCstable, as well as for CRFCIstable and MRFCIstable, as expected.Figure 9
shows receiver operating characteristic curves (ROC) for the directed edges in the estimated CPDAGs (Figure
9(a)) and PAGs (Figure 9(b)). We see that finding directed edges is much harder in settings that allow for hidden variables, as shown by the lower true positive rates (TPR) and higher false positive rates (FPR) in Figure 9(b). Within each figure, the different versions of the algorithms perform roughly similar, and MPCstable and MRFCIstable yield the best ROCcurves.6 Yeast gene expression data
We also compared the PC and PCstable algorithms on the yeast gene expression data (Hughes et al., 2000) that were already briefly discussed in Section 1. In Section 6.1 we consider estimation of the skeleton of the CPDAG, and in Section 6.2 we consider estimation of bounds on causal effects.
We used the same preprocessed data as in Maathuis et al. (2010). These contain: (1) expression measurements of 5361 genes for 63 wildtype cultures (observational data of size ), and (2) expression measurements of the same 5361 genes for 234 singlegene deletion mutant strains (interventional data of size ).
6.1 Estimation of the skeleton
We applied PC and PCstable to the observational data. We saw in Section 1 that the PCalgorithm yielded estimated skeletons that were highly dependent on the variable ordering, as shown in black in Figure 10 for the 26 variable orderings (the original ordering and 25 random orderings of the variables). The PCstable algorithm does not suffer from this orderdependence, and consequently all these 26 random orderings over the variables produce the same skeleton which is shown in the figure in red. We see that the PCstable algorithm yielded a far sparser skeleton (2086 edges for PCstable versus 50155159 edges for the PCalgorithm, depending on the variable ordering). Just as in the simulations in Section 5 the orderindependent skeleton from the PCstable algorithm roughly captured the edges that were stable among the different orderdependent skeletons estimated from different variable orderings for the original PCalgorithm.
To make “captured the edges that were stable” somewhat more precise, we defined the following two sets: Set 1 contained all edges (directed edges) that were present for all 26 variable orderings using the original PCalgorithm, and Set 2 contained all edges (directed edges) that were present for at least of the 26 variable orderings using the original PCalgorithm. Set 1 contained 1478 edges (7 directed edges), while Set 2 contained 1700 edges (20 directed edges).
Table 3 shows how well the PC and PCstable algorithms could find these stable edges in terms of number of edges in the estimated graphs that are present in Sets 1 and 2 (IN), and the number of edges in the estimated graphs that are not present in Sets 1 and 2 (OUT). We see that the number of estimated edges present in Sets 1 and 2 is about the same for both algorithms, while the output of the PCstable algorithm has far fewer edges which are not present in the two specified sets.
Edges  Directed edges  

PCalgorithm  PCstable algorithm  PCalgorithm  PCstable algorithm  
Set 1  IN  1478 (0)  1478 (0)  7 (0)  7 (0) 
OUT  3606 (38)  607 (0)  4786 (47)  1433 (7)  
Set 2  IN  1688 (3)  1688 (0)  19 (1)  20 (0) 
OUT  3396 (39)  397 (0)  4774 (47)  1420 (7) 
6.2 Estimation of causal effects
We used the interventional data as the gold standard for estimating the total causal effects of the 234 deleted genes on the remaining 5361 (see Maathuis et al. (2010)). We then defined the top of the largest effects in absolute value as the target set of effects, and we evaluated how well IDA (Maathuis et al., 2009, 2010) identified these effects from the observational data.
We saw in Figure 1(b) that IDA with the original PCalgorithm is highly orderdependent. Figure 11 shows the same analysis with PCstable (solid black lines). We see that using PCstable generally yielded better and more stable results than the original PCalgorithm. Note that some of the curves for PCstable are worse than the reference curve of Maathuis et al. (2010) towards the beginning of the curves. This can be explained by the fact that the original variable ordering seems to be especially “lucky” for this part of the curve (cf. Figure 1(b)). There is still variability in the ROC curves in Figure 11 due to the orderdependent vstructures (because of orderdependent separating sets) and orientations in the PCstable algorithm, but this variability is less prominent than in Figure 1(b). Finally, we see that there are 3 curves that produce a very poor fit.
Using CPCstable and MPCstable helps in stabilizing the outputs, and in fact all the 25 random variable orderings produce almost the same CPDAGs for both modifications. Unfortunately, these estimated CPDAGs are almost entirely undirected (around 90 directed edges among the 2086 edges) which leads to a large equivalence class and consequently to a poor performance in IDA, see the dashed black line in Figure 11 which corresponds to the 25 random variable orderings for both CPCstable and MPCstable algorithms.
Another possible solution for the orderdependence orientation issues would be to use stability selection (Meinshausen and Bühlmann, 2010) to find the most stable orientations among the runs. In fact, Stekhoven et al. (2012) already proposed a combination of IDA and stability selection which led to improved performance when compared to IDA alone, but they used the original PCalgorithm and did not permute the variable ordering. We present here a more extensive analysis, where we consider the PCalgorithm (black lines), the PCstable algorithm (red lines), and the MPCstable algorithm (blue lines). Moreover, for each one of these algorithms we propose three different methods to estimate the CPDAGs and the causal effects: (1) use the original ordering of the variables (solid lines); (2) use the same methodology used in Stekhoven et al. (2012) with 100 stability selection runs but without permuting the variable orderings (labelled as + SS; dashed lines); and (3) use the same methodology used in Stekhoven et al. (2012) with 100 stability selection runs but permuting the variable orderings in each run (labelled as + SSP; dotted lines). The results are shown in Figure 12 where we investigate the performance for the top 20000 effects instead of the 5000 as in Figures 1(b) and 11.
We see that PC with stability selection and permuted variable orderings (PC + SSP) loses some performance at the beginning of the curve when compared to PC with standard stability selection (PC + SS), but it has much better performance afterwards. The PCstable algorithm with the original variable ordering performs very similar to PC plus stability selection (PC + SS) along the whole curve. Moreover, PCstable plus stability selection (PCstable + SS and PCstable + SSP), loses a bit at the beginning of the curves but picks up much more signal later on in the curve. It is interesting to note that for PCstable with stability selection, it makes little difference if the variable orderings are further permuted or not, even though PCstable is not fully orderindependent (see Figure 11). In fact, PCstable plus stability selection (with or without permuted variable orderings) produces the best fit over all results.
Acknowledgments
We thank Richard Fox, Markus Kalisch, and Thomas Richardson for their valuable comments.
7 Discussion
Due to their computational efficiency, constraintbased causal structure learning algorithms are often used in sparse highdimensional settings. We have seen, however, that especially in these settings the orderdependence in these algorithms is highly problematic.
In this paper, we investigated this issue systematically, and resolved the various sources of orderdependence. There are of course many ways in which the orderdependence issues could be resolved, and we designed our modifications to be as simple as possible. Moreover, we made sure that existing highdimensional consistency results for PC, FCI and RFCIalgorithms remain valid for their modifications under the same conditions. We showed that our proposed modifications yield improved and more stable estimation in sparse highdimensional settings for simulated data, while their performances are similar to the performances of the original algorithms in lowdimensional settings.
Additionally to the orderdependence discussed in this paper, there is another minor type of orderdependence in the sense that the output of these algorithms also depends on the order in which the final orientation rules for the edges are applied. The reason is that an edge(mark) could be eligible for orientation by several orientation rules, and might be oriented differently depending on which rule is applied first. In our analyses, we have always used the original orderings in which the rules were given.
Compared to the adaptation of Cano et al. (2008), the modifications we propose are much simpler and we made sure that they preserve existing soundness, completeness, and highdimensional consistency results. Finally, our modifications can be easily used together with other adaptations of constraintbased algorithms, for example hybrid versions of PC with scorebased methods Singh and Valtorta (1993); Spirtes and Meek (1995); van Dijk et al. (2003) or the PC algorithm (Spirtes et al., 2000, Section 5.4.2.3).
All software is implemented in the Rpackage pcalg (Kalisch et al., 2012).
Appendix A Additional simulation results
We now present additional simulation results for lowdimensional settings (Appendix A.1), highdimensional settings (Appendix A.2) and mediumdimensional settings (Appendix A.3).
a.1 Estimation performance in lowdimensional settings
We considered the estimation performance in lowdimensional settings with less sparse graphs.
For the scenario without latent variables, we generated 250 random weighted DAGs with and , as described in Section 5.1. For each weighted DAG we generated an i.i.d. sample of size . We then estimated each graph for 50 random orderings of the variables, using the sample versions of (L)PC(stable), (L)CPC(stable), and (L)MPC(stable) at levels for and for for the partial correlation tests. Thus, for each randomly generated graph, we obtained 50 estimated CPDAGs from each algorithm, for each value of . Figure 13 shows the estimation performance of PC (circle; black line) and PCstable (triangles; red line) for the skeleton. Figure 14 shows the estimation performance of all modifications of PC and PCstable with respect to the CPDAGs in terms of SHD, and in terms of the variance of the SHD over the 50 random variable orderings per graph.
For the scenario with latent variables, we generated 120 random weighted DAGs with and , as described in Section 5.1. For each DAG we generated an i.i.d. sample size of . To assess the impact of latent variables, we randomly defined in each DAG half of the variables that have no parents and at least two children to be latent. We then estimated each graph for 20 random orderings of the observed variables, using the sample versions of FCI(stable), CFCI(stable), MFCI(stable), RFCI(stable), CRFCI(stable), and MRFCI(stable) at levels for the partial correlation tests. Thus, for each randomly generated graph, we obtained 20 estimated PAGs from each algorithm, for each value of . Figure 15 shows the estimation performance of FCI (circles; black dashed line), FCIstable (triangles; red dashed line), RFCI (circles; black solid line), and RFCIstable (triangles; red solid line) for the skeleton. Figure 16 shows the estimation performance for the PAGs in terms of SHD edge marks, and in terms of the variance of the SHD edge marks over the 20 random variable orderings per graph.
Regarding the skeletons of the CPDAGs and PAGs, the estimation performances between PC and PCstable, as well as between (R)FCI and (R)FCIstable are basically indistinguishable for all values of . However, Figure 15 shows that FCI(stable) returns graphs with slightly fewer edges than RFCI(stable), for all values of . This is related to the fact that FCI(stable) tends to perform more tests than RFCI(stable).
Regarding the CPDAGs and PAGs, the performance of the modifications of PC and (R)FCI (black lines) are very close to the performance of PCstable and (R)FCIstable (red lines). Moreover, CPC(stable) and MPC(stable) as well as C(R)FCI(stable) and M(R)FCI(stable) perform better in particular in reducing the variance of the SHD and SHD edge marks, respectively. This indicates that most of the orderdependence in the lowdimensional setting is in the orientation of the edges.
We also note that in all proposed measures there are only small differences between modifications of FCI and of RFCI.
a.2 Number of tests and computing time
We consider the number of tests and the computing time of PC and PCstable in the highdimensional setting described in Section 5.1.
One can easily deduce that Step 1 of the PC and PCstable algorithms perform the same number of tests for , because the adjacency sets do not play a role at this stage. Moreover, for PCstable performs at least as many tests as PC, since the adjacency sets
Comments
There are no comments yet.