1 Introduction
Understanding and using causal relations among variables of interest has been a fundamental problem in various fields, including biology, neuroscience, and social sciences. Since interventions or controlled randomized experiments are usually expensive or even impossible to conduct, discovering causal information from observational data, known as causal discovery (Spirtes et al., 2001; Pearl, 2000), has been an important task and received much attention in computer science, statistics, and philosophy. Roughly speaking, methods for causal discovery are categorized into constraintbased ones, such as the PC algorithm (Spirtes et al., 2001), and scorebased ones, such as Greedy Equivalence Search (GES) (Chickering, 2002).
Causal discovery algorithms aim to find the causal relations among the observed variables. However, in many cases the measured variables are not identical to the variables we intend to measure. For instance, the measured brain signals may contain error introduced by the instruments, and in social sciences many variables are not directly measurable and one usually resorts to proxies (e.g., for “regional security" in a particular area). In this paper, we assume that the observed variables , , are generated from the underlying measurementnoisefree variables with additional random measurement errors :
(1) 
Here we assume that the measurement errors are independent from and have nonzero variances. We call this model the CAusal Model with Measurement Error (CAMME). Generally speaking, because of the presence of measurement errors, the dseparation patterns among are different from those among the underlying variables . This generating process has been called the random measurement error model in (Scheines & Ramsey, 2017). According to the causal Markov condition (Spirtes et al., 2001; Pearl, 2000), observed variables and the underlying variables may have different conditional independence/dependence relations and, as a consequence, the output of constraintbased approaches to causal discovery is sensitive to such error, as demonstrated in (Scheines & Ramsey, 2017). Furthermore, because of the measurement error, the structural equation models according to which the measurementerrorfree variables are generated usually do not hold for the observed variables . (In fact, follow errorinvariables models, for which the identifiability of the underlying causal relation is not clear.) Hence, approaches based on structural equation models, such as the linear, nonGaussian, acyclic model (LiNGAM (Shimizu et al., 2006)), will generally fail to find the correct causal direction and causal model.
In this paper, we aim to estimate the causal model underlying the measurementerrorfree variables
from their observed values contaminated by random measurement error. We assume linearity of the causal model and causal sufficiency relative to . We particularly focus on the case where the causal structure for is represented by a Directed Acyclic Graph (DAG), although this condition can be weakened. In order to develop principled causal discovery methods to recover the causal model for from observed values of , we have to address theoretical issues include
whether the causal model of interest is completely or partially identifiable from the contaminated observations,

what are the precise identifiability conditions, and

what information in the measured data is essential for estimating the identifiable causal knowledge.
We make an attempt to answer the above questions on both theoretical and methodological sides.
One of the main difficulties in dealing with causal discovery in the presence of measurement error is because the variances of the measurement errors are unknown. Otherwise, if they are known, one can readily calculate the covariance matrix of the measurementerrorfree variables and apply traditional causal discovery methods such as the PC (Spirtes et al., 2001) or GES (Chickering, 2002)) algorithm. It is worth noting that there exist causal discovery methods to deal with confounders, i.e., hidden direct common causes, such as the Fast Causal Inference (FCI) algorithm (Spirtes et al., 2001). However, they cannot estimate the causal structure over the latent variables, which is what we aim to recover in this paper. (Silva et al., 2006) and (Kummerfeld et al., ) have provided algorithms for recovering latent variables and their causal relations when each latent variable has multiple measured effects. Their problem is different from the measurement error setting we consider, where clustering for latent common causes is not required and each measured variable is the direct effect of a single "true" variable. Furthermore, as shown in next section, their models can be seen as special cases of our setting.
2 Effect of Measurement Error on Conditional Independence / Dependence
We use an example to demonstrate how measurement error changes the (conditional) independence and dependence relationships in the data. More precisely, we will see how the (conditional) independence and independence relations between the observed variables are different from those between the measurementerrorfree variables . Suppose we observe , , and , which are generated from measurementerrorfree variables according to the structure given in Figure 1. Clearly is dependent on , while and are conditionally independent given . One may consider general settings for the variances of the measurement errors. For simplicity, here let us assume that there is only measurement error in , i.e., , , and .
Let be the correlation coefficient between and and be the partial correlation coefficient between and given , which is zero. Let and be the corresponding correlation coefficient and partial correlation coefficient in the presence of measurement error. We also let to make the result simpler. So we have . Let . For the data with measurement error,
As the variance of the measurement error in increases, become larger, and decreases and finally goes to zero; in contrast, , which is zero for the measurementerrorfree variables, is increasing and finally converges to . See Figure 2 for an illustration. In other words, in this example as the variance of the measurement error in increases, and become more and more independent, while and are conditionally more and more dependent given . However, for the measurementerrorfree variables, and are dependent and and and conditionally independent given . Hence, the structure given by constraintbased approaches to causal discovery on the observed variables can be very different from the causal structure over measurementerrorfree variables.
One might apply other types of methods instead of the constraintbased ones for causal discovery from data with measurement error. In fact, as the measurementerrorfree variables are not observable, in Figure 1 is actually a confounder for observed variables. As a consequence, generally speaking, due to the effect of the confounders, the independence noise assumption underlying functional causal modelbased approaches, such as the method based on the linear, nonGaussian, acyclic model (Shimizu et al., 2006), will not hold for the observed variables any more. Figure 3 gives an illustration on this. Figure 3(a) shows the scatter plot of vs. and the regression line from to , where , the noise in , and the measurement error
, are all uniformly distributed (
, and ). As seen from Figure 3(b), the residual of regressing on is not independent from , although the residual of regressing on is independent from . As a result, the functional causal modelbased approaches to causal discovery may also fail to find the causal structure of the measurementerrorfree variables from their contaminated observations.3 Canonical Representation of Causal Models with Measurement Error
Let be the acyclic causal model over . Here we call it measurementerrorfree causal model. Let be the corresponding causal adjacency matrix for , in which is the coefficient of the direct causal influence from to and . We have,
(2) 
where the components of , , have nonzero, finite variances. Then
is actually a linear transformation of the error terms in
because (2) implies(3) 
Now let us consider two types of nodes of , namely, leaf nodes (i.e., those that do not influence any other node) and nonleaf nodes. Accordingly, the noise term in their structural equation models also has distinct behaviors: If is a leaf node, then influences only , not any other; otherwise influences and at least one other variable, ,
. Consequently, we can decompose the noise vector into two groups:
consists of the noise terms that influence only leaf nodes, and contains the remaining noise terms. Equation (3) can be rewritten as(4) 
where , and are and matrices, respectively. Here both and have specific structures. All entries of are 0 or 1; for each column of , there is only one nonzero entry. In contrast, each column of has at least two nonzero entries, representing the influences from the corresponding nonleaf noise term.
Further consider the generating process of observed variables . Combining (1) and (4) gives
(5)  
(6) 
where and
denotes the identity matrix. To make it more explicit, we give how
and are related to the original CAMME process:(7)  
Clearly s are independent across , and as we shall see in Section 4, the information shared by difference is still captured by .
Proposition 1.
The proof was actually given in the construction procedure of the representation (5) or (6) from the original CAMME. We call the representation (5) or (6) the canonical representation of the underlying CAMME (CRCAMME).
Example Set 1
Consider the following example with three observed variables , , for which , with causal relations . That is,
and according to (3),
Therefore,
In causal discovery from observations in the presence of measurement error, we aim to recover information of the measurementerrorfree causal model . Let us define a new graphical model, . It is obtained by replacing variables in with variables . In other words, it has the same causal structure and causal parameters (given by the matrix) as , but its nodes correspond to variables . If we manage to estimate the structure of and involved causal parameters in , then , the causal model of interest, is recovered. Comparing with , involves some deterministic causal relations because each leaf node is a deterministic function of its parents (the noise in leaf nodes has been removed; see (7)). We defined the graphical model because we cannot fully estimate the distribution of measurementerrorfree variables , but might be able to estimate that of , under proper assumptions.
In what follows, most of the time we assume

The causal Markov condition holds for and the distribution of is nondeterministically faithful w.r.t. , in the sense that if there exists , a subset of , such that neither of and is a deterministic function of and holds, then and (or and ) are dseparated by in .
This nondeterministically faithfulness assumption excludes a particular type of parameter coupling in the causal model for . in Figure 4 we give a causal model in which the causal coefficients are carefully chosen so that this assumption is violated: because and , we have , implying and , which are not given by the causal Markov condition on . We note that this nondeterministic faithfulness is defined for the distribution of the constructed variables , not the measurementerrorfree variables . (Bear in mind their relationship given in (7).) This assumption is generally stronger than the faithfulness assumption for the distribution of . In particular, in the causal model given in Figure 4, the distribution of is still faithful w.r.t. . Below we call the conditional independence relationship between and given where neither of and is a deterministic function of nondeterministic conditional independence.
Now we have two concerns. One is whether essential information of the CRCAMME is identifiable from observed values of . We are interested in finding the causal model for (or a particular type of dependence structures in) . The CRCAMME of , given by (5) or (6), has two terms, and . The latter is independent across all variables, and the former preserves major information of the dependence structure in . Such essential information of the CRCAMME may be the covariance matrix of or the matrix , as discussed in next sections. In the extreme case, suppose such information is not identifiable at all, then it is hopeless to find the underlying causal structure of . The other is what information of the original CAMME, in particular, the causal model over the measurementerrorfree variables, can be estimated from the above identifiable information of the CRCAMME. Although the transformation from the original CAMME to a CRCAMME is straightforward, without further knowledge there does not necessarily exist a unique CAMME corresponding to a given CRCAMME: first, the CRCAMME does not tell us which nodes are leaf nodes in ; second, even if is known to be a leaf node, it is impossible to separate the measurement error from the noise in . Fortunately, we are not interested in everything of the original CAMME, but only the causal graph and the corresponding causal influences .
Accordingly, in the next sections we will explore what information of the CRCAMME is identifiable from the observations of and how to further reconstruct necessary information of the original CAMME. In the measurement error model (1) we assumed that each observed variable is generated from its own latent variable . We note that in case multiple observed variables are generated from a single latent variable or a single observed variable is generated by multiple latent variables (see, e.g., (Silva et al., 2006)), we can still use the CRCAMME to represent the process. In the former case, certain rows of are identical. For instance, if and are generated as noisy observations of the same latent variable, then in (5) the first two rows of are identical. (More generally, if one allows different coefficients to generate them from the latent variable, the two rows are proportional to each other.) Then let us consider an example in the latter case. Suppose is generated by latent variables and , for each of which there is also an observable counterpart. Write the causal model as and introduce the latent variable , and then we have . The CRCAMME formulation then follows.
4 Identifiability with Second Order Statistics
The CRCAMME (5) has a form of the factor analysis model (FA) (Everitt, 1984), which has been a fundamental tool in data analysis. In its general form, FA assumes the observable random vector was generated by
(8) 
where the factors satisfies , and noise terms, as components of , are mutually independent and also independent from . Denote by the covariance matrix of , which is diagonal. The unknowns in (8) are the loading matrix and the covariance matrix .
Factor analysis only exploits the secondorder statistics, i.e., it assumes that all variables are jointly Gaussian. Clearly in FA is not identifiable; it suffers from at least the right orthogonal transformation indeterminacy. However, under suitable conditions, some essential information of FA is generically identifiable, as given in the following lemma.
Lemma 2.
For the factor analysis model, when the number of factors , the model is generically globally identifiable, in the sense that for randomly generated in (8), it is with only measure 0 that there exists another representation such that and generate the same covariance matrix for and .
This was formulated as a conjecture by (Shapiro, 1985), and was later proven by (Bekker & ten Berge, 1997). This lemma immediately gives rise to the following generic identifiability of the variances of measurement errors.^{1}^{1}1We note that this “generic identifiability" is sightly weaker than what we want: we want to show that for certain the model is necessarily identifiable. To give this proof is nontrivial and is a line of our future research.
Proposition 3.
The variances of error terms and the covariance matrix of in the CRCAMME (5) are generically identifiable when the sample size and the following assumption on the number of leaf nodes holds:

The number of leaf variables satisfies
(9)
Clearly is decreasing in and as . To give a sense how restrictive the above condition is, Fig. 5 shows how changes with . In particular, when , , condition (9) implies the number of leaf nodes is ; when , , condition (9) implies . Roughly speaking, as increases, it is more likely for condition (9) to hold. Note that the condition given in Proposition 3 is sufficient but not necessary for the identifiability of the noise variances and the covariance matrix of the nonleaf hidden variables (Bekker & ten Berge, 1997).
Now we know that under certain conditions, the covariance matrices of and in the CRCAMME (5) are (asymptotically) identifiable from observed data with measurement error. Can we recover the measurementerrorfree causal model from them?
4.1 Gaussian CAMME with the Same Variance For Measurement Errors
In many problems the variances of the measurement errors in different variables are roughly the same because the same instrument is used and the variables are measured in similar ways. For instance, this might approximately be the case for Functional magnetic resonance imaging (fMRI) recordings. In fact, if we made the following assumption on the measurement error, the underlying causal graph can be estimated at least up to the equivalence class, as shown in the following corollary.

The measurement errors in all observed variables have the same variance.
Proposition 4.
Suppose assumptions A0, A1, and A2 hold. Then as , can be estimated up to the equivalence class and, moreover, the leaf nodes of are identifiable.
Proofs are given in Appendix. The proof of this corollary inspires a procedure to estimate the information of from contaminated observations in this case, which is denoted by FA+EquVar. It consists of four steps. (1) Apply FA on the data with a given number of leaf nodes and estimate the variances of as well as the covariance matrix of .^{2}^{2}2Here we suppose the number of leaf nodes is given. In practice one may use model selection methods, such as BIC, to find this number. (2) The smallest values of the variances of correspond to nonleaf nodes, and the remaining nodes correspond to leaf nodes. (3) Apply a causal discovery method, such as the PC algorithm, to the submatrix of the estimated covariance matrix of corresponding to nonleaf nodes and find the causal structure over nonleaf nodes. (4) For each leaf node , find the subset of nonleaf nodes that determines , and draw directed edges from those nodes to , and further perform orientation propagation.
4.2 Gaussian CAMME: General Case
Now let us consider the general case where we do not have the constraint A2 on the measurement error. Generally speaking, after performing FA on the data, the task is to discover causal relations among by analyzing their estimated covariance matrix, which is, unfortunately, singular, with the rank . Then there must exist deterministic relations among , and we have to deal with such relations in causal discovery. Here suppose we simply apply the Deterministic PC (DPC) algorithm (Glymour, 2007; Luo, 2006) to tackle this problem. DPC is almost identical to PC, and the only difference is that when testing for conditional independence relationship , if or is a deterministic function of , one then ignores this test (or equivalently we do not remove the edge between and ). We denote by FA+DPC this procedure for causal discovery from data with measurement error.
Under some conditions on the underlying causal model , it can be estimated up to its equivalence class, as given in the following proposition. Here we use to denote the set of parents (direct causes) of in .
Proposition 5.
Suppose Assumptions A0 and A1 hold. As , compared to , the graph produced by the above DPC procedure does not contain any missing edge. In particular, the edges between all nonleaf nodes are corrected identified. Furthermore, the whole graph of is identifiable up to its equivalence class if the following assumption further holds:

For each pair of leaf nodes and , there exists and that are dseparated in by a variable set , which may be the empty set. Moreover, for each leaf node and each nonleaf node which are not adjacent, there exists which is dseparated from in by a variable set , which may be the empty set.
Example Set 2 and Discussion
Suppose assumption A0 holds.

Assumptions A0, A1, and A3 are sufficient conditions for to be recovered up to its equivalence class and, they, especially A3, may not be necessary. For instance, consider the causal graph given in Figure 6(b), for which assumption A3 does not hold. If assumption A2 holds, can be uniquely estimated from contaminated data. Other constraints may also guarantee the identifiability of the underlying graph. For example, suppose all coefficients in the causal model are smaller than one in absolute value, then can also be uniquely estimated from noisy data. Relaxation of assumption A3 which still guarantees that is identifiable up to its equivalence class is a future line of research.
5 Identifiability with Higher Order Statistics
The method based on secondorder statistics exploits FA and deterministic causal discovery, both of which are computationally relatively efficient. However, if the number of leafnodes is so small that the condition in Proposition 3 is violated (roughly speaking, usually this does not happen when is big, say, bigger than 50, but is likely to be the case when is very small, say, smaller than 10), the underlying causal model is not guaranteed to be identifiable from contaminated observations. Another issue is that with secondorder statistics, the causal model for is usually not uniquely identifiable; in the best case it can be recovered up to its equivalence class (and leaf nodes). To tackle these issues, below we show that we can benefit from higherorder statistics of the noise terms.
In this section we further make the following assumption on the distribution of :

All are nonGaussian.
We note that under the above assumption, in (6) can be estimated up to the permutation and scaling indeterminacies (including the sign indeterminacy) of the columns, as given in the following lemma.
Lemma 6.
Suppose assumption A4 holds. Given which is generated according to (6), is identifiable up to permutation and scaling of columns as the sample size .
Proof.
5.1 NonGaussian CAMME with the Same Variance For Measurement Errors
We first note that under certain assumptions the underlying graph is fully identifiable, as shown in the following proposition.
Proposition 7.
Suppose the assumptions in Corollary 4 hold, and further suppose assumption A4 holds. Then as , the underlying causal graph is fully identifiable from observed values of .
5.2 NonGaussian CAMME: More General Cases
In the general case, what information of the causal structure can we recover? Can we apply existing methods for causal discovery based on LiNGAM, such as ICALiNGAM (Shimizu et al., 2006) and DirectLiNGAM (Shimizu et al., 2011), to recover it? LiNGAM assumes that the system is nondeterministic: each variable is generated as a linear combination of its direct causes plus a nondegenerate noise term. As a consequence, the linear transformation from the vector of observed variables to the vector of independent noise terms is a square matrix; ICALiNGAM applies certain operations to this matrix to find the causal model, and DirectLiNGAM estimates the causal ordering by enforcing the property that the residual of regressing the effect on the root cause is always independent from the root cause.
In our case, , the essential part of the mixing matrix in (6), is , where . In other words, for some of the variables , the causal relations are deterministic. (In fact, if is a leaf node in , is a deterministic function of ’s direct causes.) As a consequence, unfortunately, the above causal analysis methods based on LiNGAM, including ICALiNGAM and DirectLiNGAM, do not apply. We will see how to recover information of by analyzing the estimated .
We will show that some group structure and the groupwise causal ordering in can always be recovered. Before presenting the results, let us define the following recursive group decomposition according to causal structure .
Definition 8 (Recursive group decomposition).
Consider the causal model . Put all leaf nodes which share the same directandonlydirect node in the same group; further incorporate the corresponding directandonlydirect node in the same group. Here we say a node is the “directandonlydirect" node of if and only if is a direct cause of and there is no other directed path from to . For those nodes which are not a directandonlydirect node of any leaf node, each of them forms a separate group. We call the set of all such groups ordered according to the causal ordering of the nonleaf nodes in DAG a recursive group decomposition of , denoted by .
Example Set 3
As seen from the process of recursive group decomposition, each nonleaf node is in one and only one recursive group, and it is possible for multiple leaf nodes to be in the same group. Therefore, in total there are recursive groups. For example, for given in Figure 6(a), a corresponding group structure for the corresponding is , and for in Figure 6(b), there is only one group: . For both and , given in Figure 6(c), a recursive group decomposition is .
Note that the causal ordering and the recursive group decomposition of given variables according to the graphical model may not be unique. For instance, if has only two variables and which are not adjacent, both decompositions and are correct. Consider over three variables, , where and are not adjacent and are both causes of ; then both and are valid recursive group decompositions.
We first present a procedure to construct the recursive group decomposition and the causal ordering among the groups from the estimated . We will further show that the recovered recursive group decomposition is always asymptotically correct under assumption A4.
5.2.1 Construction and Identifiability of Recursive Group Decomposition
First of all, Lemma 7 tells us that in (6) is identifiable up to permutation and scaling columns. Let us start with the asymptotic case, where the columns of the estimated from values of are a permuted and rescaled version of the columns of . In what follows the permutation and rescaling of the columns of does not change the result, so below we just work with the true , instead of its estimate.
and follow the same causal DAG, , and are causally sufficient, although some variables among them (corresponding to leaf nodes in ) are determined by their direct causes. Let us find the causal ordering of . If there are no deterministic relations and the values of are given, the causal ordering can be estimated by recursively performing regression and checking independence between the regression residual and the predictor (Shimizu et al., 2011). Specifically, if one regresses all the remaining variables on the root cause, the residuals are always independent from the predictor (the root cause). After detecting a root cause, the residuals of regressing all the other variables on the discovered root cause are still causally sufficient and follow a DAG. One can repeat the above procedure to find a new root cause over such regression residuals, until no variable is left.
However, in our case we have access to but not the values of . Fortunately, the independence between regression residuals and the predictor can still be checked by analyzing . Recall that , where the components of are independent. Without loss of generality, here we assume that all components of are standardized, i.e., they have a zero mean and unit variance. Denote by the th row of . We have and . The regression model for on is
Here the residual can be written as
(10) 
If for all , is either zero or independent from , we consider as the current root cause and put it and all the other variables which are deterministically related to it in the first group, which is a root cause group. Now the problem is whether we can check for independence between nonzero residuals and the predictor . Interestingly, the answer is yes, as stated in the following proposition.
Proposition 9.
So we can check for independence as if the values of were given. Consequently, we can find the root cause group.
We then consider the residuals of regressing all the remaining variables on the discovered root cause as a new set of variables. Note that like the variables , these variables are also linear mixtures of . Repeating the above procedure on this new set of variance will give the second root cause and its recursive group. Applying this procedure repeatedly until no variable is left finally discovers all recursive groups following the causal ordering. The constructed recurse group decomposition is asymptotically correct, as stated in the following proposition.
Proposition 10.
(Identifiable recursive group decomposition) Let be generated by the CAMME with the corresponding measurementerrorfree variables generated by the causal DAG and suppose assumptions A0 and A4 hold. The recursive group decomposition constructed by the above procedure is asymptotically correct, in the sense that as the sample size , if nonleaf node is a cause of nonleaf node , then the recursive group which is in precedes the group which belongs to. However, the causal ordering among the nodes within the same recursive group may not be identifiable.
The result of Proposition 10 applies to any DAG structure in . Clearly, the indentifiability can be naturally improved if additional assumptions on the causal structure hold. In particular, to recover information of , it is essential to answer the following questions.

Can we determine which nodes in a recursive group are leaf nodes?

Can we find the causal edges into a particular node as well as their causal coefficients?
Below we will show that under rather mild assumptions, the answers to both questions are yes.
5.2.2 Identifying Leaf Nodes and Individual Causal Edges
If for each recursive group we can determine which variable is the nonleaf node, the causal ordering among the variables is then fully known. The causal structure in as well as the causal model can then be readily estimated by regression: for a leaf node, its direct causes are those nonleaf nodes that determine it; for a nonleaf node, we can regress it on all nonleaf nodes that precede it according to the causal ordering, and those predictors with nonzero linear coefficients are its parents. (Equivalently, its parents are the nodes that causal precede it and in its Markov blanket.)
Now the problem is whether it is possible to find out which variable in a given recursive group is a leaf node; if all leaf nodes are found, then the remaining one is the (only) nonleaf node. We may find leaf nodes by “looking backward" and “looking forward"; the former makes use of the parents of the variables in the considered group, and the latter exploits the fact that leaf nodes do not have any child.
Proposition 11.
(Leaf node determination by “looking backward") Suppose the observed data were generated by the CAMME where assumptions A0 and A4 hold.^{3}^{3}3In this nonGaussian case (implied by assumption A4), the result reported in this proposition may still hold if one avoids the nondeterministic faithfulness assumption and assumes a weaker condition; however, for simplicity of the proof we currently still assume nondeterministic faithfulness. Let the sample size . Then if assumption A5 holds, leaf node is correctly identified from values of (more specifically, from the estimated or the distribution of ); alternatively, if assumption A6 holds, leaf nodes and are correctly identified from values of .

According to , leaf node in the considered recursive group, , has a parent which is not a parent of the nonleaf node in .

According to , leaf nodes and in the considered recursive group, , are nondeterministically conditionally independent given some subset of the nodes in .
Example Set 4
Suppose assumptions A0 and A4 hold.

For in Figure 6(a), assumption A6 holds for and in the recursive group : they are nondeterministically conditionally independent given ; so both of them are identified to be leaf nodes from the estimated or the distribution of , and can be determined as a nonleaf node. (In addition, assumption A5 holds for , allowing us to identify this leaf node even if is absent in the graph.)

For both
Comments
There are no comments yet.