1 Introduction
It has been proved that under the Causal Markov, Faithfulness assumptions and Causal Sufficiency Assumption, there are no uniformly consistent estimators of Markov equivalence classes of causal structures represented by directed acyclic graphs (DAG)(Robins et al., 2003). Kalisch and Bühlmann (2007) showed that for linear Gaussian models, under the Causal Markov Assumption, the Strong Causal Faithfulness Assumption, and the assumption of causal sufficiency, the PC algorithm is a uniformly consistent estimator of the Markov Equivalence Class of the true causal DAG for linear Gaussian models; it follows from this that for the identifiable causal effects in the Markov Equivalence Class, there are uniformly consistent estimators of causal effects as well. The TriangleFaithfulness Assumption is a strictly weaker assumption that avoids some implausible implications of the Strong Causal Faithfulness Assumption and also allows for uniformly consistent estimates of Markov Equivalence Classes (in a weakened sense), and of identifiable causal effects.
However, both of these assumptions are restricted to linear Gaussian models. We propose the Generalized kTriangle Faithfulness, which can be applied to any smooth distribution. In addition, under the Generalized kTriangle Faithfulness Assumption, we describe the Edge Estimation Algorithm that provides uniformly consistent estimates of causal effects in some cases (and otherwise outputs ”can’t tell”), and the Very Conservative SGS Algorithm that (in a slightly weaker sense) is a uniformly consistent estimator of the Markov equivalence class of the true DAG.
2 The Basic Assumptions for Causal Discovery
2.1 DAG and Causal Markov Condition
We use directed acyclic graphs to represent causal relations between variables. A directed graph consists of a set of nodes V and a set of edges . If there is an edge , we write . is a parent of , and is a child of , the edge is out of and into , and is the source and is the target. A directed path from to is a sequence of ordered edges where the source of the first edge is , the target of the last edge is , and if there are n edges in the sequence, for , the target of the th edge is the source of the th edge; is an ancestor of , and is a descendant of .
If a variable is in a structure , and there is no edge between and , we call an unshielded collider; if there is also an edge between and ,then is a triangle and we call a shielded collider. If is a triangle but is not a child of both and , we call a shielded noncollider; if there is no edge between and , then is a unshielded noncollider.
is an ordered pair
whereis a probability distribution over a set of variables
V in . A distribution over a set of variables satisfies the (local) Markov condition for if and only if each variable in is independent of its nonparents and nondescendants, conditional on its parents. Given , denotes and denotes . Two acyclic directed graphs (DAG) and are Markov equivalent if conditional independence relations entailed by Markov condition in are the same as in . It has been proven that two s are Markov equivalent if and only if they have the same adjacencies and same unshielded colliders (Verma and Pearl, 1990). A pattern is an undirected graph that represents a set of Markov equivalent DAGs: an edge is in if it is in every DAG in ; if is in some DAG and in some other DAG in , then in (Spirtes and Zhang, 2016)We assume causal sufficiency, which means that V contains all direct common causes of variables in V.
We want a model that represents the causal relations in a given population, so we have the Causal Markov Assumption (Spirtes et al., 2001):
2.2 Faithfulness, linear Gaussian case and kTriangleFaithfulness
Given a that satisfies Markov Condition, we say that is faithful to if any conditional independence relation that holds in is entailed by by the Markov Condition. We further make the Causal Markov and Faithfulness assumption:
Causal Markov Assumption: If the true causal model of a population is causally sufficient, each variable in is independent of its nonparents and nondescendants, conditional on its its parents in (Spirtes and Zhang, 2016).
Causal Faithfulness Assumption: all conditional independence relations that hold in the population are consequences of the Markov condition from the underlying true causal DAG.
In this paper we take about cases where over for respects the Causal Markov Assumption. If is faithful to and all variables in are Gaussian and all causal relations are linear, that is, any can be written as:
where denotes the set of parents of in , the set of variables is jointly standard Gaussian, is a real valued coefficient, and the set of are jointly Gaussian and jointly independent, conditional correlation between any two variables where and implies that there is no edge between and . Based on the equation above, we define in the linear Gaussian case the edge strength as the corresponding coefficient .
It has been proved that under the Causal Markov and Faithfulness assumptions, there are no uniformly consistent estimators of Markov equivalence classes of causal structures represented by DAG (Robins et al., 2003). Kalisch and Bühlmann (2007) showed that such uniform consistency is achieved by the algorithm if the underlying DAG is sparse relative to the sample size under a strengthened version of Faithfulness Assumption. This Strong Causal Faithfulness Assumption in the linear Gaussian case bounds the absolute value of any partial correlation not entailed to be zero by the true causal DAG away from zero by some positive constants. It has the implausible consequence that it puts a lower bound on the strength of edges, since a very weak edge entails a very weak partial correlation. However, the Strong Causal Faithfulness Assumption can be weakened to the strictly weaker (for some values of ) kFaithfulness Assumption while still achieving uniform consistency. Furthermore, at the cost of having a smaller subset of edges oriented, the kFaithfulness Assumption can be weakened to the kTriangleFaithfulness Assumption, while still achieving uniform consistency and can be relaxed while preserving the uniform consistency: the kTriangleFaithfulness Assumption (Spirtes and Zhang, 2014) only bounds the conditional correlation between variables in a triangle structure from below by some functions of the corresponding edge strength:
kTriangleFaithfulness Assumption: Given a set of variables V, suppose the true causal model over V is , where
is a Gaussian distribution over
V, and G is a DAG with vertices V. For any variables X, Y, Z that form a triangle in G:
if Y is a noncollider on the path , then for all that do not contain Y; and

if Y is a collider on the path , then for all that do contain Y.
where the represents the edge between and but the direction is not determined.(Sprites and Zhang, 2014)
The kTriangleFaithfulness Assumption is strictly weaker than the Strong Faithfulness Assumption in several respects: the Strong faithfulness Assumption does not allow edges to be week anyywhere in a graph, while the kTriangleFaithfulness Assumption only excludes conditional correlations from being too small if and are in some triangle structures and is not a weak edge; for every used in the Strong Faithfulness Assumption as the lower bound for any partial correlation, there is a for the kTriangleFaithfulness Assumption that gives a lower bound smaller than .
2.3 VCSGS Algorithm
The algorithm we use to infer the structure of the underlying true causal graph is Very Conservative SGS () algorithm, which takes uniformly consistent tests of conditional independence as input:
VCSGS Algorithm

Form the complete undirected graph on the given set of variables .

For each pair of variables and in , search for a subset of such that and are independent conditional on . Remove the edge between and in if and only if such a set is found.

Let be the graph resulting from Step 2. For each unshielded triple (the only two variables that are not adjacent are and ),

If and are not independent conditional on any subset of that contains , then orient the triple as a collider: .

If and are not independent conditional on any subset of that does not contain Y, then mark the triple as a noncollider.

Otherwise, mark the triple as ambiguous.


Execute the following orientation rules until none of them applies:

If , and the triple is marked as a noncollider, then orient as .

If and , then orient as .

If , another triple is marked as a noncollider, and , then orient as .


Let be the graph resulting from step 4. For each consistent disambiguation of the ambiguous triples in (, each disambiguation that leads to a pattern), test whether the resulting pattern satisfies the Markov condition. If every pattern does, then mark all the ‘apparently nonadjacent’ pairs as ‘definitely nonadjacent’.
It has been proved that under the kTriangleFaithfulness Assumption, algorithm is uniformly consistent in the inference of graph structure. Furthermore, a followup algorithm that estimates edge strength given the output of also reaches uniform consistency. We are going to prove that the uniform consistency of the estimation of the causal influences under the kTriangleFaithfulness Assumption can be extended to discrete and nonparametric cases as long as there are uniformly consistent tests of conditional independence (which in the general case requires a smoothness assumption), by showing that missed edges in the inference of causal structure are so weak that the estimations of the causal influences are still uniformly consistent.
3 Nonparametric Case
For nonparametric case, we consider variables supported on . We first define the strength of the edge as the maximum change in
norm of the probability of
when we condition on different values of while holding everything else constant:If
where denotes the set of values that takes, the set of values that parents of Y take, the probability distribution of and the density of for . Since we are conditioning on the set of parents, the conditional probability is equal to the manipulated probability.
Then we can make the kTriangleFaithfulness Assumption: given a set of variables V, where the true causal model over V is , is a distribution over V, and G is a DAG with vertices V, for any variables X, Y, Z that form a triangle in G:

if Z is a noncollider on the path , given any subset ,
for some 
if Z is a collider on the path , then for every , given any subset , for some
In order to have uniformly consistent tests of conditional independence, we make smoothness assumption for continuous variables with the support on :
TV Smoothness(L): Let be the collection of distributions , such that for all , we have:
Given the TV smoothness(L), is continuous. Furthermore, since () is compact, for any (the set of all variables in the true causal graph) , attains its max and min on its support. Since is finite, we can further assume conditional densities are nonzero (NZ(T)):
for any , , for some .
Notice that by TV Smoothness(L) and that variables have support on , we can derive an upper bound for probability of any variable given its parents:
Although the discrete probability case does not have support on , and its probability is not continuous, it still satisfies the TV smoothness(L) assumption: for instance, if the discrete variables have support only on integers, we can set . By replacing the density with the probability in NZ(T) assumption, we have a NZ(T) assumption for the discrete case. Therefore the proof of uniform consistency for the nonparametric case in the rest of the paper also works for the discrete case.
3.1 Uniform Consistency in the inference of structure
We use norm to characterize dependence :
.We want a test of versus . is a family of functions: one for each sample size, that takes an i.i.d sample
from the joint distribution over
. Then the test is uniformly consistent w.r.t. a set of distributions for :

for every ,
With the TV Smoothness(L) assumption, there are uniformly consistent tests of conditional independence, such as a minimax optimal conditional independence test proposed by Neykov et al.(2020).
Given any causal model over , let denote the (random) output of the algorithm given an sample of size size from the distribution , then there are three types of errors that it can contain that will mislead the estimation of causal influences:

errs in kind I if it has an adjacency not in ;

errs in kind II if every adjacency it has is in , but it has a marked noncollider not in ;

errs in kind III if every adjacency and marked noncollider it has is in , but it has an orientation not in
If errs in either of these three way, there will be variable and in such that is treated as a parent of but is not in the true graph ; if there is no undirected edge connecting in this , the algorithm will estimate the causal influence of “ parents” of on , but such estimation does not bear useful information since intervening does not really influence . Notice that missing an edge is not listed as an mistake here, and we are going to prove later that the estimation of causal influence can still be used to correctly predict the effect of intervention even if the algorithm misses edges.
Let be the set of causal models over V under TriangleFaithfulness Assumption, TV smoothness(L) and the assumptions of NZ(T).
We will prove that under the causal sufficiency of the measured variables V, causal Markov assumption, kTriangleFaithfulness, TV smoothness(L) assumption and NZ(T) assumption,
We begin by proving a useful lemma that bounds with strengths of the edge :
Lemma 3.1.
Given an ancestral set that contains the parents of but not :
If is a parent of :
Proof.
The last step is derived using a direct conclusion from NZ(T):
∎
We are going to prove for each case that the probability for to make each of the three kinds of mistakes uniformly converges to zero. Since the proofs for the kind I and kind III errors are almost the same as the proof for the linear Gaussian case (Spirtes and Zhang, 2014), we are only going to prove kind II here.
Lemma 3.2.
Given causal sufficiency of the measured variables , the Causal Markov, kTriangleFaithfulness, TV smoothness(L) assumption and NZ(T) assumption:
Proof.
For any , if it errs in kind II then it contains a marked noncollider that is not in . Since it’s been proved (Spirtes and Zhang, 2014):
the errors of kind II can be one of the two cases:
is an unshielded collider in ;
is a shielded collider in ;
the proof for case (I) is the same as the proof for the errs in kind I (Spirtes and Zhang, 2014), so we are going to prove here that the probability of case (II) uniformly converges to zero as sample size increases.
We are going to prove by contradiction. Suppose that the probability that making a mistake of kind II does not uniformly converge to zero. Then there exists , such that for every sample size , there is a model such that the probability of containing an unshielded noncollider that is a shielded collider in is greater than . Let that triangle be with being the parent of in .
The algorithm will identify the triple as an unshielded noncollider only if:
there is a set containing , such that the test of returns 0, call this test ;
there is an ancestral set that contains and but not , such that for set , the test returns 1, call this test .
If what we want to proof does not hold for the algorithm, for all there is a model :
(1)
(2)
(1) tells us that there is some such that and as since the test is uniformly consistent. So we have:
The last step is by kTriangleFaithfulness
So .
By Lemma 3.1, .
Therefore, as ∎
Theorem 3.3.
Given causal sufficiency of the measured variables , the Causal Markov, kTriangleFaithfulness, TV smoothness(L) and NZ(T) assumptions:
Proof.
Since we have proved that the probability for to make any of the three kinds of mistakes uniformly converges to 0, the theorem directly follows. ∎
3.2 Uniform consistency in the inference of causal effects
Edge Estimation Algorithm:
For each vertex such that all of the edges containing are oriented in (output ), if is the parent set of in , we use histogram to estimate ^{1}^{1}1we denote the density of at as in this section to match with the result of estimation. for and ; for any of the remaining edges, return ‘Unknown’; if any estimation of conditional probability violates TV smoothness(L), abandon the output and return ‘Unknown’.
Defining the distance between and
The method for estimation for is: we first get and by histogram, then we get:
Let be the output of the Edge Estimation Algorithm, and be a causal model, we define the conditional probability distance, , between and to be:
where denotes the parent set of in causal model . By convention if is “Unknown”.
Now we want to show, the edge estimation algorithm is uniformly consistent in the sense that for every ,
Here is any causal model satisfying causal sufficiency of the measured variables , the Causal Markov, kTriangleFaithfulness, TV smoothness(L) and NZ(T) assumptions and is the output of the algorithm given an iid sample from .
Proof.
Let be the set of possible graphs of . Since given , there are only finitely many outputs in , it suffices to prove that for each output ,
Now partition all the into three sets given :

all adjacencies, nonadjacencies and orientations in O are true in;

only some adjacencies, or orientations in O are not true in ;

only some nonadjacencies in O are not true in .
It suffices to show that for each ,
:
For any , if the conditional probabilities of a vertex in can be estimated (so not “Unknown”), it means that . Recall that the histogram estimator is close to the true density with high probability:
for any ,
where is continuous and monotonically decreasing wrt and and (the number of bins) where is the dimentionality of the . For instance, when estimating .
Given a , entails that
. By monotonicity of , when s.t. ,
.
Therefore the histogram estimators of
and are uniformly consistent. Next we are going to use the lemma below:
Lemma 3.4.
If and are uniformly consistent
estimators of and
,
then
is a uniformly consistent estimator for
Proof.
Recall that:
for any ,
where is continuous and monotonically decreasing wrt and and (the number of bins) where is the dimentionality of the . For instance, when estimating .
For any , entails that
. By monotonicity of , when s.t. , .
Let , notice that for any , with probability at least ,
^{2}^{2}2here we use instead of because is dependent on the dimension of
(By and the fact that the estimation result can only be valid if it satisfies TV smoothness(L))^{3}^{3}3Recall that denotes the set of variables in the true graph
The second to the last step is derived because is upper bounded by by TV smoothness.
We have:
So
is a uniform consistent estimator for
∎
By lemma 3.4, we conclude that the is a uniformly consistent estimator for
,
So:
the proof is exactly the same as for the discrete case.
Let be the population version of . Since the histogram estimator is uniformly consistent over and there are finitely many parentchild combinations, for every there is a sample size , such that for , and all ,
Since only some nonadjacencies in are not true in , we know that for any vertex that have some estimated conditional probabilities given its parents in , where denotes the parent set of in the when the underlying probability is (i.e., is the true causal model). Since , for any and , is a marginalization of . Therefore, the distance between and is:
Given the corresponding to the equation above, let and . Since is the marginalization of all