, whether a group of events causes another group of events is a central problem in a wide range of fields from natural to social sciences. A common approach to inferring causality is Randomized controlled trials (RCT). Here the experimenter intervenes on a system of variables (often called stimulus variables) such that it is not affected by any confounders with the variables of interest (often called response variables) and observes the probability distributions on the response variables. Unfortunately, in many cases of interest performing RCT is either costly or impossible due to practical or ethical or legal reasons. A common example is the age-old debate on whether smoking causes cancer. In such scenarios RCT is completely out of the question due to ethical issues. This necessitates new inference techniques.
The causal inference problem has been extensively studied in statistics and mathematics (e.g., [19, 21, 25, 26]) where decades of research has led to rich mathematical theories and a framework for conceptualizing and analyzing causal inference. One such line of work is the Linear Structural Equation Model (or in short) for formalizing causal inference (see the monograph  for a survey of classical results). In fact, this is among the most commonly used models of causality in social sciences [3, 5] and some natural sciences . In this model, we are given a mixed graph on (observable) variables111In this paper we are interested in properties for large . of the system containing both directed and bi-directed edges (see Figure 1 for an example). We will assume that the directed edges in the mixed graph form a directed acyclic graph (DAG). A directed edge from vertex to vertex represents the presence of causal effect of variable on variable , while the bi-directed edges represent the presence of confounding effect (modeled as noise) which we next explain.222We also interchangeably refer to the directed edges as observable edges since they denote the direct causal effects and the bi-directed edges as unobservable edges since they indicate the unobserved common causes. In the , the following extra assumption is made (see Equation (1.1)): the value of a variable is determined by a (weighted) linear combination of its parents’ (in the directed graph) values added with a zero-mean Gaussian noise term (). The bi-directed graph indicates dependencies between the noise variables (lack of an edge between variables and implies that the covariance between and is ). We use to represent the matrix of edge weights of the DAG, to represent the covariance matrix of the Gaussian noise and to represent the covariance matrix of the observation data (henceforth called data covariance matrix). Letdenote the vector of corresponding noises whose covariance matrix is given by . Formally, the assumes the following relationship between the random variables in :
From the Gaussian assumption on the noise random variable , it follows that is also a multi-variate Gaussian with covariance matrix given by,
In a typical setting, the experimenter estimates the joint-distribution by estimating the covariance matrixover the observable variables obtained from finitely many samples. The experimenter also has a causal hypothesis (represented as a mixed graph for the causal effects and the covariance among the noise variables, which in turn determines which entries of and are required to be , referred to as the zero-patterns of and ). One then wants to solve the inverse problem of recovering and given . This problem is solvable for some special types of mixed graphs using parameter recovery algorithms, such as the one in [13, 17].
Thus a central question in the study of is for which mixed graphs (specified by their zero patterns) and which values of the parameters ( and ) is the inverse problem above solvable; in other words, which parameters are identifiable. Ideally one would like all values of the parameters to be identifiable. However, identifiability is often too strong a property to expect to hold for all parameters and instead we are satisfied with a slightly weaker property, namely generic identifiability (GI): here we require that identifiability holds for all parameter values except for a measure set (according to some reasonable measure). (The issue of identifiability is a very general one that arises in solving inverse problems in statistics and many other areas of science.) A series of works (e.g., [7, 15, 14, 17, 23]) have made progress on this question by providing various classes of mixed graphs that do allow generic identifiability. However, the general problem of generic identifiability has not been fully resolved. This problem is important since it is a version of a central question in science: what kind of causal hypotheses can be validated purely from observational data as opposed to the situations where one can do RCT. Much of the prior work has focused on designing algorithms with the assumption that the exact joint distribution over the variables is available. However, in practice, the data is noisy and inaccurate and the joint distribution is generated via finite samples from this noisy data.
While theoretical advances assuming exact joint distribution have been useful it is imperative to understand the effect of violation of these assumptions rigorously. Algorithms for identification in s are numerical algorithms that solve the inverse problem of recovering the underlying parameters constructed from noisy and limited data. Such models and associated algorithms are useful only if they solve the inverse problem in a stable fashion: if the data is perturbed by a small amount then the recovered parameters change only a small amount. If the recovery is unstable then the recovered parameters are unlikely to tell us much about the underlying causal model as they are inordinately affected by the noise. We say that robust identifiability (RI) holds for parameter values and if even after perturbing the corresponding to (by a small noise), the recovered parameters values and are close to the original ones. It follows from the preceding discussion that to consider an inverse problem solved it is not sufficient for generic identifiability to hold; instead, we would like the stronger property of robust identifiability to hold for almost all parameter values (we call this generic robust identifiability; for now we leave the notions of “almost everywhere” and “close” informal). In addition, the problem should also be solvable by an efficient algorithm. The mixed graphs we consider all admit efficient parameter recovery algorithms. Note that GI and RI are properties of the problem and not of the algorithm used to solve the problem.
In the general context of inverse problems, the difference between GI and RI is quite common and important: e.g., in solving a system of linear equations in variables given by an matrix . For any reasonable distribution on matrices, the set of singular matrices has measure (an algebraic set of lower dimension given by ). Hence, is invertible with probability
and GI holds. This however does not imply that generic robust identifiability holds: for that one needs to say that the set of ill-conditioned matrices has small measure. To understand RI for this problem, one needs to resort to analyses of the minimum singular value ofwhich are non-trivial in comparison to GI (e.g., see [8, Sec 2.4]
). In general, proving RI almost everywhere turns out to be harder and remains an open problem in many cases even though GI results are known. One recent example is rank-1 decomposition of tensors (forrandom tensors of rank up to , GI is known, whereas generic robust identifiability is known only up to rank ); see, e.g., . The problem of tensor decomposition is just one example of a general recent concern for RI results in the theoretical computer science literature and there are many more examples. Generic robust identifiability remains an open problem for semi-Markovian models for which efficient GI is known, e.g., .
In the context of causality, the study of robust identifiability was initiated in  where the authors construct a family of examples in the so-called Pearl’s notion of causality on
semi-Markovian graphs333Unlike , this is a non-parametric model. The functional dependence of a variable on the
parents’ variables is allowed to be fully general, in particular it need not be linear. This of course comes at the price of making
the inference problem computationally and statistically harder.
, this is a non-parametric model. The functional dependence of a variable on the parents’ variables is allowed to be fully general, in particular it need not be linear. This of course comes at the price of making the inference problem computationally and statistically harder.(see e.g., ) and show that for this family there exists an adversarial perturbation of the input which causes the recovered parameters to be drastically different (under an appropriate metric described later) from the actual set of parameters. However this result has the following limitation. Their adversarial perturbations are carefully crafted and this worst case scenario can be alleviated by modifying just a few edges (in fact just deleting some edges randomly suffices which can also be achieved without changing the zero-pattern by choosing the parameters appropriately). This leads us to ask: how prevalent are such perturbations? Since there is no canonical notion of what a typical LSEM model is (i.e. the graph structure and the associated parameters), we will assume that the graph structure is given and the parameters are randomly chosen according to some reasonable distribution. Thus we would like to answer the following question444The question of understanding the condition number (a measure of stability) of LSEMs was raised in , though presumably in the worst-case sense of whether there are identifiable instances for which recovery is unstable. As mentioned in their work, when the model is not uniquely identifiable, the authors in  show an example where uncertainty in estimating the parameters can be unbounded..
[Informal] For the class of s that are uniquely identifiable, does robust identifiability hold for most choices of parameters? The question above is informal mainly because “most choices of parameters” is not a formally defined notion. We can quantify this notion by putting a probability measure on the space of all parameters. This is what we will do.
Notation. Throughout this paper, we will use the following notation. Bold fonts represent matrices and vectors. We use the following shorthand in proofs.
Given matrices , we define the relative distance, denoted by as the following quantity.
In this paper we use the notion of condition number (see  for a detailed survey on condition numbers in numerical algorithms) to quantitatively measure the effect of perturbations on data in the parameter recovery problem. The specific definition of condition number we use is a natural extension of the -condition number studied in  to matrices.
[Relative -condition number] Let be a given data covariance matrix and be the corresponding parameter matrix. Let a -perturbed family of matrices be denoted by (i.e., set of matrices such that ). For any let the corresponding recovered parameter matrix be denoted by . Then the relative -condition number is defined as,
We confine our attention to the stability of recovery of as once is recovered approximately, can be easily approximated in a stable manner. In this paper, we restrict our theoretical analyses to causal models specified by bow-free paths which are inspired by bow-free graphs 555See footnote 4 in  for the significance of bow-free graphs in .. In  the authors show that the bow-free property666Bow-free (mixed) graphs are those where for any pair of vertices, both directed and undirected edges are never present simultaneously. In other words, any pair of variables with direct causal relationship (implied by a directed edge connecting the two variables) do not have correlated errors (not connected by a bi-directed edge) of the mixed graph underlying the causal model is a sufficient condition for unique identifiability. We now define bow-free paths; Figure 2 has an example.
[Bow-free Paths] A causal model is called a bow-free path if the underlying DAG forms a directed path and the mixed graph is bow-free . Consider vertices indexed . The (observable) DAG forms a path starting at with directed edges going from vertex to vertex . The bi-directed edges can exist between pairs of vertices only if . Thus the only potential non-zero entries in are in the diagonal immediately above the principal diagonal. Similarly, the diagonal immediately above and below the principal diagonal in is always zero.777We want to emphasize that bow-free paths allow all pairs, which do not have a causal directed edge, to have bi-directed edges.
We further assume the following conditions on the parameters of the bow-free paths for studying the numerical stability of any parameter recovery algorithm. Later in Section 4 we show that a natural random process of generating and satisfies these assumptions with high-probability. Informally, the model can be viewed as a generalization of symmetric diagonally dominant (SDD) matrices.
Our model satisfies the following conditions on the data-covariance matrix and the perturbation with the underlying graph structure given by bow-free paths. It takes parameters and .
Data properties. is an symmetric matrix satisfying the following conditions888We note that the important part is that is bounded by a constant independent of . The precise constant is of secondary importance. for some .
Additionally, the true parameters , corresponding to , satisfy the following. For every such that there is a directed edge from to , we have , where is a parameter.
Perturbation properties. For each and a fixed , let be arbitrary numbers satisfying for all such that that there exists a pair with . Note that we eventually consider in the limit going to , hence some small but fixed suffices. Let for every pair .
The covariance matrix has errors arising from three sources: (1) measurement errors, (2) numerical errors, and (3) error due to finitely many samples. The combined effect is modeled by Item 2 (perturbation properties) above of Model 1.4 which is very general as we only care about the max error.
With the definitions and model in place, we can now pose Question 1 formally:
[Formal] For the class of s represented by bow-free paths and Model 1, can we characterize the behavior of the -condition number?
1.1. Our Contributions
Our key contribution in this paper is a step towards understanding Question 1 by providing an answer to Question 2. In particular, we first prove that when Model assumptions 1 hold, the -condition number of the parameter recovery problem is upper-bounded by a polynomial in . Formally, we prove Theorem 1.1. This implies that the loss in precision scales logarithmically and hence we need at most additional bits to get a precision of bits999Note we already need bits to represent the graph and the associated matrices.. See  and  for further discussion on the relation between condition number and the number of bits needed for -bit precision in computation as well as other implications of small condition number.
[Stability Result] Under the assumptions in Model 1, we have the following bound on the condition number for bow-free paths with vertices.
More specifically, the condition number is upper-bounded by and for the parameters chosen in this model, we have the bound in Theorem 1.1.
Furthermore, in Section 4 we show that a natural generative process to construct matrices and satisfies the Model assumptions 1 101010We in fact study various regimes in the parameter space of the generative process and show in which of those regimes the model assumptions hold.. Hence this implies that a large class of instances are well-conditioned and hence ill-conditioned models are not prevalent under our generative assumption. Moreover, as described in the model preliminaries, all data covariance matrices that are SDD and have the property that every row has at least entries that are not arbitrarily close to , satisfy the assumptions in Model 1. Thus an important corollary of this theorem is that when the data covariance matrix is in this class of SDD matrices, it is well-conditioned.
Next we show that there exist examples for s with arbitrarily high condition number. This implies that on such examples it is unreasonable to expect any form of accurate computation. Formally, we prove Theorem 1.1. It shows that the recovery problem itself has a bad condition number and does not depend on any particular algorithm that is used for parameter recovery. This theorem follows easily using the techniques of  and we include it for completeness.
[Instability Result] There exists a bow-free path of length four and data covariance matrix such that the parameter recovery problem of obtaining from has an arbitrarily large condition number.
We perform numerical simulations to corroborate the theoretical results in this paper. We verify our theorems using numerical simulations. Furthermore, we consider general graphs (e.g., clique of paths, layered graphs) and show that a similar behavior on the condition number holds. Finally, we consider a real-world dataset used in  and perform condition number analysis experimentally.
We conclude the paper by giving a general heuristic for practitioners to determine the condition number of any given instance. This heuristic can detect bad instances with high-probability. This procedure might be of independent interest.
1.2. Related work
A standard reference on Structural Causal Models is Bollen . Identifiability and robust identifiability of s has been studied from various viewpoints in the literature: Robustness to model misspecification, to measurement errors, etc. (e.g., Chapter 5 of Bollen  and [20, 27, 22]). These works are not directly related to this paper, since they focus on the identifiability problem under erroneous model specification and/or measurement. See the sub-section on measurement errors below for further details.
Identifiability. The problem of (generic) identification in s is well-studied though still not fully understood. We give a brief overview of the known results. All works in this section assume that the data-covariance matrix is exact and do not consider robustness aspects of the problem. These works do not directly relate to the problem studied in this paper. This problem has a rich history (see  for some classical results) and we only give an overview of recent results. Brito and Pearl  gave a sufficient condition called G-criterion which implies linear independence on a certain set of equations. The variables involved in this set is called the auxiliary variables. The main theorem in their paper is that if for every variable there is a corresponding “auxiliary” variable, then the model is identifiable. Followed by this, Foygel  introduced the notion of half-trek criterion which gives a graphical criterion for identifiability in . This criterion strictly subsumes the G-criterion framework. Both [7, 17] supply efficient algorithms for identification. Chen  tackle the problem of identifying “overidentifying constraints”, which leads to identifiability on a class of graphs strictly larger than those in  and . Ernest  consider a variant of the problem called “Partially Linear Additive SEM” (PLSEM) with gaussian noise. In this variant the value for a random variable is determined both by an additive linear factor of some of its parents (called linear parents) and additive factor of the other parents (called non-linear parents) via a non-linear function. They give characterization for identifiability in terms of graphical conditions for this more general model. Chen  extends the half-trek criterion of  and give a -component decomposition (Tian ) based algorithm to identify a broader class of models. Drton and Weihs  also extend the half-trek criterion () to include the ancestor decomposition technique of Tian . Chen  approach the parameter identification problem for via the notion of auxiliary variables. This method subsumes all the previous works above and identifies a strictly larger set of models.
Measurement errors. Another recent line of work [28, 34] studies the problem of identification under measurement errors. Both our work and these works share the same motivation: causal inference in real-world is usually performed on noisy data and hence it is important to understand how noise affects the causal identification. However their focus differs significantly from the question we tackle in this paper. They pose and answer the problem of causal identification when the variables used are not identical to the ones that were intended to be measured. This leads to different conditional independences and hence a different causal graph; they study the identification problem in this setting and characterize when such identification is possible. Recently  looked at sample complexity of identification in . They consider the special case when
is the Identity matrix and give a new algorithm for identifiability with optimal sample complexity.
In this paper we are interested in the question of robust identifiability, along the lines of aforementioned work of Schulman and Srivastava . While the work of  was direct inspiration for the present paper, since we work with LSEMs and  work with the semi-Markovian causal models of Pearl, the techniques involved are completely different. Moreover, as previously remarked, another important difference between  and our work is that our main result is positive: the parameter recovery problem is well-conditioned for most choices of the parameters for a well-defined notion of most choices.
2. Bow-free Paths — Stable Instances
In this section, we show that for bow-free paths under Model 1, the condition number is small. Later in Section 4 we show that a natural random process of generating and satisfies these assumptions. Together this implies that for a large class of inputs, the stability of the map is well-behaved. More precisely, we prove Theorem 1.1.
To prove the main theorem, we set-up some helper lemmas. Using Foygel , we obtain the following recurrence to compute the values of from the correlation matrix . The base case is . For every we have,
We first show that the model assumptions 1 imply that the parameters recovered from the perturbed version are not too large.
For each , we have .
For , we have
The last inequality used the fact that when . Indeed, this is true since . Thus, when . Therefore, . Thus, we have
|(2.2)||(From recurrence (2.1))|
|(2.3)||(From Inductive Hypothesis)|
We will now prove a lower-bound on the denominator . For every we know that . Thus, it follows that .
We will now show that . Thus, we have .
Using the data properties of the model and the inductive hypothesis, we have , and . Thus, we have
Since , this implies that for . Therefore, Equation (2.3) can be upper bounded by,
|(From Inductive Hypothesis)|
|(2.4)||(From data properties)|
We will now show that Equation (2.4) can be upper-bounded by . This will complete the induction.
Note that Equation (2.4) can be written as
Recall that we have , and . Consider . This can be written as
Thus, we have . Therefore, Eq. (2.5) can be upper-bounded by,
Consider . When we have,
Thus, Eq. (2.6) can be upper-bounded by,
Thus, as long as the induction holds. Rearranging, this inequality holds if and only if
Thus the inequality holds as long as . From the definition we have . From the inductive hypothesis we have . Thus, holds. This completes the inductive step.
The above induction implies that . ∎
Next, we show that the relative distance between the real parameter and the recovered parameter from the perturbed instance is not too large. Let . Define . Then for each we have that,
We will prove this via induction on . The base case is when . Note that and . The expression evaluates to,
This can be upper-bounded by,
The second inequality uses the perturbation properties in Model 1. Therefore, when , we have that .
Note that . Consider . Since and we have . This completes the base case.
We will now consider the case when with the inductive hypothesis that for all , the statement in the lemma is true.
Consider the expression . Expanding out the terms based on Equation (2.1) we get the absolute value of the following in the numerator.
First consider the terms without as a multiplicative factor in the above expression. After cancellations, they evaluate to
Using the inductive hypothesis, we have that . Therefore, we have,
Among the terms involving consider those which are multiplied by a . The terms and can be upper-bounded by and respectively. Among the remaining terms, three of them, namely , and , can each be upper-bounded by and the remaining three, namely , and , can each be upper-bounded by . Therefore, we have that the absolute value of the numerator can be upper-bounded by,
Since , we have that . Therefore, the numerator can be upper-bounded by
We want to lower-bound the absolute value of the denominator, which can be written as,
Using the model properties, we can lower-bound the above expression by,
Since , this can be lower-bounded by,
First, we show that . Recall that . Re-arranging we want to show that holds. Note that when and . Consider . This is at least when and . The RHS evaluates to which is larger than and thus larger than .
Re-arranging, if and only if,
Thus, for we want,
Rearranging Equation (2.9) we want . When , the RHS is at least . On the other hand, . Thus Eq. (2.9) holds and we have that is a constant greater than . Note that where the maximum is achieved when .
This completes the inductive step. Therefore, from induction the theorem holds for all . ∎
2.1. Proof of Theorem 1.1
3. Instability Example
In this section, we show that there exist simple bow-free path examples where the condition number can be arbitrarily large. We consider a point on the measure set of unidentifiable parameters and apply a small perturbation (of magnitude ). This instance in the parameter space is identifiable (by using ). We then apply a perturbation (of magnitude ) to this instance and compute the condition number. By making arbitrarily close to , we obtain as large a condition number as desired. Any point on the singular variety could be used for this purpose and the proof of Theorem 1.1 provides a concrete instance. The example we construct is as follows. Consider a path of four vertices. Fix a small value . Define the matrices and as follows.
Thus, we have the following.
Multiplying with we get is,
Therefore, the resulting matrix , obtained by is,