Anchored Causal Inference in the Presence of Measurement Error

by   Basil Saeed, et al.

We consider the problem of learning a causal graph in the presence of measurement error. This setting is for example common in genomics, where gene expression is corrupted through the measurement process. We develop a provably consistent procedure for estimating the causal structure in a linear Gaussian structural equation model from corrupted observations on its nodes, under a variety of measurement error models. We provide an estimator based on the method-of-moments, which can be used in conjunction with constraint-based causal structure discovery algorithms. We prove asymptotic consistency of the procedure and also discuss finite-sample considerations. We demonstrate our method's performance through simulations and on real data, where we recover the underlying gene regulatory network from zero-inflated single-cell RNA-seq data.



There are no comments yet.


page 1

page 2

page 3

page 4


Causal Inference with Corrupted Data: Measurement Error, Missing Values, Discretization, and Differential Privacy

Even the most carefully curated economic data sets have variables that a...

Direct Estimation of Differences in Causal Graphs

We consider the problem of estimating the differences between two causal...

Causal Structural Learning Via Local Graphs

We consider the problem of learning causal structures in sparse high-dim...

Causal Discovery in the Presence of Measurement Error: Identifiability Conditions

Measurement error in the observed values of the variables can greatly ch...

Corrected score methods for estimating Bayesian networks with error-prone nodes

Motivated by inferring cellular signaling networks using noisy flow cyto...

An Upper Bound for Random Measurement Error in Causal Discovery

Causal discovery algorithms infer causal relations from data based on se...

A Distance Covariance-based Kernel for Nonlinear Causal Clustering in Heterogeneous Populations

We consider the problem of causal structure learning in the setting of h...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Determining causal relationships between a set of variables is a central task in causal inference with applications in many scientific fields including economics, biology and social sciences Friedman et al. (2000); Pearl (2003); Robins et al. (2000)

. Directed acyclic graph (DAG) models, also known as Bayesian networks, are commonly used to represent the causal structure among variables. Learning a DAG from observations on the nodes is intrinsically hard 

Chickering et al. (2004), and in general a DAG is only identifiable up to its Markov equivalence class Verma and Pearl (1990). In addition, in many applications there may be latent variables. While various algorithms have been developed to learn DAGs with latent variables Colombo et al. (2012); Spirtes (2001); Spirtes et al. (2000); Zhang (2008), without restrictions on the latent variables there may be infinitely many DAGs that can explain the data Richardson et al. (2002), in which case the model is of little use for further interpretation and analysis.

Restrictions on the latent variables can improve model identifiability. In this paper, we consider the problem of causal discovery with measurement error, where each latent variable has exactly one corresponding observed variable (corrupted observation of the latent variable), which serves as its anchor, and the goal is to infer the causal relationships among the latent variables; see Figure 1a. For instance in social sciences, the beliefs of people cannot be directly measured, but surveys can provide a noisy version of the latent variables, and we may be interested in inferring the causal structure among the latent beliefs. Similarly, in biological applications measurment error needs to be taken into account, e.g. when measuring brain signals using functional magnetic resonance (fMRI) or gene expression using RNA sequencing.

While the method developed in this paper can be applied generally to causal inference in the presence of measurement noise, we will showcase its use on learning the underlying gene regulatory network from single-cell RNA-seq data Klein et al. (2015). Such data is known to suffer from dropout Ziegenhain et al. (2017)

, which manifests itself as false zeros due to too little starting RNA or technical noise. In single-cell RNA-seq experiments, it is estimated that such false zeros occur with a probability of 24-76% across current state-of-the-art technologies 

Ziegenhain et al. (2017). Applying causal inference methods directly to such data may lead to biased estimates, where negatively correlated variables may appear positively correlated due to dropout; see Figure 1

b. Currently, the typical approach for dealing with dropout is to first impute gene expression data 

Van Dijk et al. (2018). However, this may introduce artificial dependencies (Figure 1c) that may be detrimental for learning the correct causal model. It is therefore of great interest to develop algorithms that directly learn the causal structure among the latent variables from the corrupted data.

(a) Anchored Causal Model
(b) Dropout
(c) Imputation
Figure 1: (a) Anchored Causal Model with latent variables and their corrupted observed counterparts

(anchors). (b) Simulated Gaussian random variables before (top) and after dropout with probability 0.5 (bottom). (c) Imputed RNA-seq data (top) and raw data with dropout (bottom).

Zhang et al. Zhang et al. (2017) considered the problem of learning a causal DAG model under measurement error as in Figure 1a, but restricted the measurement error to be independent from the latent variables. For many applications, including modeling dropout in single-cell RNA-seq data, this assumption is too restrictive. Halpern et al. Halpern et al. (2015) considered more general anchored causal models than in Figure 1a, but only in the binary setting. Silva et al. Silva et al. (2006) considered a similar model for continuous distributions, but under the assumption that the dependence between latent and observed variables is linear, an assumption that is too restrictive for many applications. Inspired by topic modeling, the authors in Anandkumar et al. (2013) proposed a causal discovery method for DAGs with various levels of latent variables, but under the assumption that the latent variables are non-Gaussian and have sufficient outgoing edges for identifiability of the model.

The main contributions of this paper are as follows:

  • We introduce anchored causal inference to model causal relationships among latent variables from observations with measurement error in the Gaussian setting.

  • We develop a provably consistent algorithm based on the method-of-moments to identify the causal structure (up to Markov equivalence) among the latent variables for a flexible class of measurement error models, including non-linear dependence on the latent variables.

  • In particular, we derive a consistent estimator for partial correlations and a consistent conditional independence test when the measurement noise is dropout.

  • We present experimental results on both simulated and single-cell RNA-seq data, showing that our estimator, which takes into account measurement error, outperforms standard causal inference algorithms applied directly to the corrupted data.

2 Preliminaries and Related Work

Let be a directed acyclic graph (DAG) with nodes and directed edges . We associate a random variable to each node

. We denote the joint distribution of

by and assume that is generated by a linear Gaussian structural equation model:


where is the weighted adjacency matrix of and with . We consider the problem where only a noise-corrupted version of is observed. We define to be the observed variables (anchors) generated from the latent variables by a noise process , where

has non-zero variance; see Figure 

1a. We aim to learn the DAG associated with the latent variables .

The majority of literature in causal inference assumes causal sufficiency, i.e. no latent variables. A standard approach for causal structure discovery in this setting is to first infer the conditional independence (CI) relations among the observed variables and then use the CI relations to learn the DAG structure Spirtes et al. (2000). However, since multiple DAGs can encode the same CI relations, can only be identified up to its Markov equivalence class (MEC). An MEC can be represented by a CPDAG, a partially directed graph whose skeleton (underlying undirected graph) is the skeleton of and an edge is directed if it has the same direction for all DAGs in the MEC Andersson et al. (1997); Verma and Pearl (1990). Various algorithms have been developed for learning a CPDAG under causal sufficiency Chickering (2002); Solus et al. (2017); Spirtes et al. (2000), most prominently the PC algorithm Spirtes et al. (2000), which treats causal inference as a constraint satisfaction problem with the constraints being CI relations. The PC algorithm is provably consistent, meaning that it outputs the correct MEC when the sample size , under the so-called faithfulness assumption, which asserts that the CI relations entailed by are exactly the relations implied by separation in the underlying DAG  Spirtes et al. (2000).

In the presence of latent variables, identifiability is further weakened (only the so-called PAG is identifiable) and various algorithms have been developed for learning a PAG Colombo et al. (2012); Spirtes (2001); Spirtes et al. (2000); Zhang (2008). However, these algorithms cannot estimate causal relations among the latent variables, which is our problem of interest. Leung et al. Leung et al. (2016) study identifiability of directed Gaussian graphical models in the presence of a single latent variable. Zhang et al. Zhang et al. (2017), Silva et al. Silva et al. (2006), Halpern et al. Halpern et al. (2015) and Anandkumar et al. Anandkumar et al. (2013) all consider the problem of learning causal edges among latent variables from the observed variables, i.e. models as in Figure 1a or generalizations thereof, but under assumptions that may not hold for our applications of interest, namely that the measurement error is independent of the latent variables Zhang et al. (2017), that the observed variables are a linear function of the latent variables Silva et al. (2006), that the observed variables are binary Halpern et al. (2015), or that each latent variable is non-Gaussian with sufficient outgoing edges to guarantee identifiability Anandkumar et al. (2013).

3 Anchored Causal Inference

In the following, we first describe the assumptions of our Anchored Causal Model, then motivate the model by the application to learning the underlying gene regulatory network from zero-inflated single-cell RNA-seq data, and finally provide an algorithm for anchored causal inference and prove its consistency under the model assumptions.

Model AssumptionsAnchored Causal Model.

Given a DAG , the latent variables are generated by a linear Gaussian structural equation model (see (1)) that is faithful to .


The observed random vector

satisfies the CI relations

Furthermore, for all there exists a finite-dimensional vector of monomials in and a finite-dimensional vector of monomials in and such that their means can be mapped to the moments of the latent variables by known continuously differentiable functions and , i.e.  and , and their covariance satisfies .

While Assumption (A1) fixes the structural and functional relationship between the latent variables by a linear Gaussian structural equation model, Assumption (A2) fixes the structural relationship between latent and observed variables by each having exactly one parent for all , and ensures that the first- and second-order moments of can be obtained from moments of without the restriction to a specific measurement error model. This allows for more general noise models than in Silva et al. (2006); Zhang et al. (2017).

Example 3.1 (Modeling single-cell RNA-seq data).

Let represent the true latent RNA values (log-transformed) and the observed RNA values after dropout. The authors in Pierson and Yau (2015) considered a simple model of gene regulation represented by a linear Gaussian structural equation model among the latent variables and modeled dropout for single-cell RNA-seq data by

Assuming the dropout probabilities are known, this model satisfies Assumptions (A1) and (A2) with

but does not satisfy the assumptions in Silva et al. (2006); Zhang et al. (2017). ∎

Algorithm 1 describes our Anchored Causal Inference procedure. The procedure works as follows: Given i.i.d. samples of denoted by , compute the required empirical moments and . Given a particular measurement error model defined by and , compute the first- and second-order moments of to obtain its covariance matrix . If we can obtain the set of CI relations involving , we can use causal structure discovery algorithms to learn (up to its Markov equivalence class). Since

follows a Gaussian distribution, conditional independence corresponds to zero partial correlation. Let

and , then the sample partial correlations can be computed recursively for increasing conditioning set sizes by


where in the base case are the correlations obtained from

. The main difficulty lies in developing test statistics based on the estimated partial correlations

such that the inferred CI relations correspond as to the set of CI relations implied by the underlying causal DAG . Such test statistics are developed in Corollaries 1 and 2. The inferred CI relations can then be fed into a constraint-based causal discovery algorithm such as the PC algorithm Spirtes et al. (2000) to obtain the CPDAG of .

The first step in asserting consistency of Algorithm 1

is the following lemma, which follows from the law of large numbers; the proof is provided in appendix 


Input: samples of the random vector .
Output: CPDAG representing the Markov equivalence class of the DAG of the latent variables .
1. For each compute the sample moment vectors and from the samples .
2. Estimate the sample moments of via .
3. Estimate the covariance matrix of by for all .
4. Estimate the partial correlations of from using (2).
5. Calculate the test statistics defined in Corollaries 1 or 2 to infer the CI relations among the latent variables .
6. Use a consistent causal discovery algorithm (e.g. the PC algorithm) based on the inferred CI relations.

Algorithm 1 Anchored Causal Inference
Lemma 1.

Under assumptions (A1) and (A2), the estimator in (2) is asymptotically consistent.

Next, we design a consistent hypothesis test for obtaining CI relations based on the estimated partial correlations of in (2

), similar in principle to Gaussian CI tests based on Fisher’s z-transform used by many causal inference algorithms 

Chickering (2002); Spirtes et al. (2000); Wang et al. (2017). Under causal sufficiency, i.e., when is the identity function for all , it can be shown Lehmann (1998) that the estimated partial correlations in (2) satisfy


Hence, applying the delta method to Fisher’s z-transform of the estimated partial correlations yields


Hence Fisher’s z-transform can be used in the test statistic to test conditional independence by declaring at significance level if and only if


where denotes the inverse CDF of . The following theorem generalizes (3) to our anchored causal model class, where the partial correlations of are estimated from the observed moments of .

Theorem 1.

Let denote the vector of monomials of required to compute the first- and second-order moments of . Let denote the vector of first- and second-order moments of . Then under assumptions (A1) and (A2), for any and , the estimated partial correlation in (2) satisfies

where is a continuous function of .

The proof of Theorem 1 can be found in appendix B, where we provide a procedure for computing the function for any and

. The idea of the proof is as follows: First apply the Central Limit Theorem to the vector of sample moments

. Under assumption (A2), the correlations based on are continuously differentiable functions of . Furthermore, for any and , the partial correlation is defined recursively for increasing conditioning set sizes as a continuously differentiable function of . Hence, one can iteratively apply the delta method starting from the statement of the Central Limit Theorem applied to to obtain the asymptotic distribution of .

In the following two corollaries to Theorem 1, we provide different test statistics for CI testing based on the estimated partial correlations of the latent vector . We start by generalizing Fisher’s transform and its asymptotic distribution given in (4).

Corollary 1.

If the asymptotic variance can be written purely as a function of , i.e., there exists such that , and there exists a variance stabilizing transformation such that


with chosen such that , then under (A1) and (A2)

The proof of Corollary 1 follows by applying the delta method to Theorem 1 (appendix C). Whether the conditions of Corollary 1 are satisfied, depends on the measurement error model . We show in appendix F.1 that the conditions of Corollary 1 hold for the dropout model for and , and derive the corresponding variance stabilizing transformation. Note that it is sufficient if we can compute the integral in (6) numerically; a closed-form solution is not required. Corollary 1 implies that the test statistic in (5) can be used to consistently estimate the CI relations among the latent variables . When the assumptions of Corollary 1 are not met, then we can obtain a different test statistic that is asymptotically normal as shown in the following result.

Corollary 2.

Define . Then under (A1) and (A2)

Corollary 2 follows from Theorem 1: since is continuous in , converges to by the law of large numbers and implies that as (appendix D). Hence the test statistic can be used in (5) to obtain the CI relations among the latent variables .

With respect to finite-sample considerations, note that in Corollary 2 is a function of , and thus its convergence to its asymptotic distribution requires the convergence of . Hence, we expect the convergence in distribution of Corollary 1 to be faster than that of Corollary 2 and as a result the test statistic in Corollary 1 to perform better in the finite-sample regime.

We end this section with the main result of this paper, namely the consistency of Algorithm 1.

Theorem 2.

Under assumptions (A1) and (A2), Algorithm 1 is consistent, i.e., as it returns a CPDAG that represents the Markov equivalence class of the true DAG .

The proof can be found in appendix E. In particular, we show that under the faithfulness assumption, the set of CI relations inferred from in Algorithm 1 converges to the set of CI statements implied by the underlying DAG . Hence using any consistent causal structure discovery algorithm on these CI relations results in the correct Markov equivalence class.

4 Implementation

Next, we discuss an important aspect of implementation and show how the results in Section 3 can be applied to the dropout model in Example 3.1.

In the finite-sample setting, the estimated covariance matrix of the latent variables is not guaranteed to be positive semidefinite. In this case, shrinkage towards a positive definite matrix can be used as a form of regularization. When

, a standard approach is to use Ledoit-Wolf shrinkage towards the identity matrix 

Ledoit and W. (2004). When , the sample covariance matrix based on the samples is positive definite with probability 1 and hence can be shrunk towards by


The form of shrinkage that provides better results depends on whether or the identity matrix are better approximations of the true underlying covariance matrix . In our experiments in Section 5, we applied shrinkage towards as in (7). Both types of shrinkage result in consistent estimates: consistency of Ledoit-Wolf shrinkage is proven in Ledoit and W. (2004) and the consistency of shrinkage towards the sample covariance matrix in (7) follows from Theorem 1, since as implies that becomes positive semidefinite with large enough sample size, and therefore, as , which shows that shrinkage reduces to the consistent case without shrinkage.

4.1 Application: The Dropout Model

Under the dropout model in Example 3.1, the assumptions of Corollary 1 are not satisfied, since in general cannot be expressed as a function of only. This is shown in appendix F.2 by plotting as a function of for fixed . In the special case when , the conditions are satisfied for all when , i.e., a variance stabilizing transformation can be found for the correlations . This dropout stabilizing transform is provided in appendix F.1

. This transform and the resulting CI test can be used as a heuristic also when

or and we analyze its performance in Section 5.

In appendix F.3, we provide a recursive formula for computing from Corollary 2 for the dropout model, which we refer to as the dropout normalizing transform. Computing it requires determining the asymptotic variance of . Also note that applying shrinkage will in general change the asymptotic variance of the partial correlations. We show how to correct as a function of the shrinkage coefficient in appendix F.4. This adjustment is applied in all of our experiments in Section 5.

In Section 5.2, we apply our estimation procedure based on the dropout model to single-cell RNA-seq data in order to infer the structure of the underlying gene regulatory network. For the theoretical analysis in Section 3 we assumed that the dropout probabilities are known. However, in general these parameters need to be estimated and it was proposed in Pierson (2015) to model by for , where depends on the single-cell RNA-seq assay. Using this model for the dropout probabilities we can jointly estimate the parameters and as follows. This model implies

Since corresponds to the gene expression count averages, we can assume that . Under this assumption, the equation has a unique solution for . With respect to the parameter , for some single-cell RNA-seq assays it is possible to obtain an estimate for by including molecules with known expression as controls. However, since this estimate is often unreliable Grün and van Oudenaarden (2015) and not always available, we selected so as to minimize the amount of shrinkage required to obtain a positive semidefinite matrix.

5 Experiments

In this section, we analyze the performance of Algorithm 1 based on the dropout model both on simulated data and on single-cell RNA-seq data.

5.1 Simulations

(a) ROC Skeleton
(b) ROC Skeleton
(c) SHD Skeleton
(d) SHD Skeleton
(e) ROC Skeleton
(f) ROC Skeleton
(g) SHD Skeleton
(h) SHD Skeleton
Figure 2: Performance of the dropout stabilizing transform, dropout normalizing transform and Gaussian CI test in simulations. (a)-(d) ROC and SHD curves with and (e-h) with for evaluating the accuracy of estimating the skeleton of the true DAG.

The data was generated from the dropout model described in Example 3.1. The structure of the matrix in the linear Gaussian structural equation model (1) was generated by an Erdös-Renyi model with expected degree for and number of nodes . The weights of the matrix were uniformly drawn from to be bounded away from . The mean parameters were uniformly drawn from and the probabilities from . These ranges were chosen to match the expected ranges in the gene expression data analyzed in Section 5.2. We generated observations of from this generating model, for .

Figure 2 shows the ROC curves and the Structural Hamming Distance (SHD) plots evaluating the skeleton of the CPDAG output by Algorithm 1 for , and . Each point in the plots is an average over simulations (divisible by the number of cores). The CI relations were obtained using the dropout stabilizing transform (green curve), the dropout normalizing transform (red curve), and the Gaussian CI test applied directly to the perturbed observations (blue curve) and the PC algorithm was used to estimate the CPDAG. In appendix G, we provide additional figures for the setting where , and , as well as the ROC curves and SHD plots evaluating the inferred CPDAG.

The simulation results show that the dropout stabilizing transform outperforms, or performs at least as well as the Gaussian CI test in all settings we tested even though it was derived for and . The performance of both dropout transforms improves over the Gaussian CI test with increasing sample size. A large sample size is especially important for the dropout normalizing transform because it relies on the estimation of more parameters. Since the dropout stabilizing transform is preferable to the dropout normalizing transform from a computational point of view while performing similarly well, we concentrate on this transform in Figure 2(e-h).

The simulation results show that our estimators significantly outperform the naive Gaussian CI test applied directly to the corrupted data. This has important implications for the development of new single-cell RNA-seq technologies indicating that an increased sample size is preferrable to minimizing dropout. Current single-cell technologies have been heading exactly in this direction, trading off increased sample sizes (with studies containing up to a million samples) for an increased dropout rate sin . The simulation results suggest that our estimators are well-suited for such data.

5.2 Single-cell RNA-seq Data

(a) Perturb-seq
(b) Perturb-seq gene regulatory network
(c) Pancreas
Figure 3: (a) ROC curve for predicting causal effects of interventions in Perturb-seq data. (b) Gene regulatory network estimated from Perturb-seq data (blue edges indicate previously known interactions Dixit et al. (2016) that were also detected by our method). (c) SHD between CPDAGs estimated on data sets collected with low versus high dropout rate in the pancreas.

Perturb-seq. We tested our method on gene expression data collected via single-cell Perturb-seq by Dixit et al. Dixit et al. (2016) from bone marrow-derived dendritic cells (BMDCs). As in most single-cell studies, the gene expression observations are affected by dropout. The data consists of 933 observational samples (after standard pre-processing), which we used for learning the gene regulatory network. The Perturb-seq data set also contains interventional samples, which we used to evaluate the estimated CPDAG and construct an ROC curve. As in Dixit et al. (2016), we focussed our analysis on 24 genes, which are important transcription factors known to regulate each other as well as a variety of other genes Garber et al. (2012). We used the dropout stabilizing transform to obtain the CI relations among the latent variables and compared the resulting CPDAG to the graph obtained using the standard Gaussian CI test applied directly to the observed corrupted data. In both settings we used the PC algorithm to infer the CPDAG from the CI relations. Figure 3a shows the resulting ROC curve, which quantifies for varying tuning parameters the accuracy of each of the learned CPDAGs in predicting the effect of each of the eight interventions as also described in Wang et al. (2017). Our algorithm with the dropout stabilizing transform outperforms the Gaussian CI test. The inferred gene regulatory network is shown in Figure 3b.

Pancreas - Type II Diabetes. We also tested our method on two gene expression data sets collected from human pancreatic cells Baron et al. (2016); Segerstolpe et al. (2016) via different single-cell assays, one with low dropout rate (Smart-seq2) and the other with high dropout rate (inDrop). The Smart-seq2 data set consists of 3514 cells and the inDrop data set of 8569 cells. We focused our analysis on a gene regulatory network of 20 genes, which is known to be involved in Type II Diabetes Sharma et al. (2018). Since no interventional data is available for this application, we evaluated our estimator based on how consistent the estimated CPDAG is across the two data sets. Figure 3c shows the SHD between the CPDAGs estimated from the data set with low versus high dropout using the dropout stabilizing transform as compared to the Gaussian CI test applied directly on the observed data. The inferred gene regulatory network is provided in appendix G. Since the SHD is lower for the dropout stabilizing transform than the Gaussian CI test, the CPDAG estimates produced by our method are more consistent across different dropout levels, thereby suggesting that our method is more robust to dropout.

6 Discussion

In this paper, we proposed a procedure for learning causal relationships in the presence of measurement error. For this purpose, we considered the anchored causal model, where each corrupted observed variable is generated from a latent uncorrupted variable and the aim is to learn the causal relationships among the latent variables using the observed variables as anchors. We introduced an algorithm that learns the Markov equivalence class of the causal DAG among the latent variables based on the empirical moments of the observed variables and proved its consistency. One of the main motivations for developing this algorithm was to address the problem of dropout in single-cell RNA-seq experiments. We showed how to apply our algorithm for learning the underlying gene regulatory network under a standard dropout model and analyzed its performance on synthetic data and on single-cell RNA-seq data, thereby showing that taking into account dropout allows identifying the causal relationships between genes in a more accurate and robust manner.


  • [1] 10x genomics: Transcriptional profiling of 1.3 million brain cells with the chromium single cell 3’ solution. Application note. Accessed May, 2019.
  • Anandkumar et al. [2013] A. Anandkumar, D. Hsu, A. Javanmard, and S. Kakade. Learning linear bayesian networks with latent variables. In

    International Conference on Machine Learning

    , pages 249–257, 2013.
  • Andersson et al. [1997] S. A. Andersson, D. Madigan, and M. D. Perlman. A characterization of Markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2):505–541, 1997.
  • Baron et al. [2016] M. Baron, A. Veres, S. L. Wolock, A. L. Faust, R. Gaujoux, A. Vetere, J. H. Ryu, B. K. Wagner, S. S. Shen-Orr, A. M. Klein, D.A. Melton, and Yanai I. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell Systems, 3(4):346–360, 2016.
  • Chickering [2002] D. M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3(Nov):507–554, 2002.
  • Chickering et al. [2004] D. M. Chickering, D. Heckerman, and C. Meek. Large-sample learning of Bayesian networks is NP-hard. Journal of Machine Learning Research, 5(Oct):1287–1330, 2004.
  • Colombo et al. [2012] D. Colombo, M. H. Maathuis, M. Kalisch, and T. S. Richardson. Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, pages 294–321, 2012.
  • Dixit et al. [2016] A. Dixit, O. Parnas, B. Li, J. Chen, C. P. Fulco, L. Jerby-Arnon, N. D. Marjanovic, D. Dionne, T. Burks, R. Raychowdhury, B. Adamson, T. M. Norman, E. S. Lander, J. S. Weissman, N. Friedman, and A. Regev. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell, 167(7):1853–1866, 2016.
  • Friedman et al. [2000] N. Friedman, M. Linial, I. Nachman, and D. Pe’er. Using Bayesian networks to analyze expression data. Journal of Computational Biology, 7(3-4):601–620, 2000.
  • Garber et al. [2012] M. Garber, N. Yosef, A. Goren, R. Raychowdhury, A. Thielke, M. Guttman, J. Robinson, B. Minie, N. Chevrier, Z. Itzhaki, R. Blecher-Gonen, C. Bornstein, D. Amann-Zalcenstein, A. Weiner, D. Friedrich, J. Meldrim, O. Ram, C. Cheng, A. Gnirke, S. Fisher, N. Friedman, B. Wong, B. E. Bernstein, C. Nusbaum, N. Hacohen, A. Regev, and I. Amit. A high-throughput chromatin immunoprecipitation approach reveals principles of dynamic gene regulation in mammals. Molecular Cell, 47(5):810–822, 2012.
  • Grün and van Oudenaarden [2015] D. Grün and A. van Oudenaarden. Design and analysis of single-cell sequencing experiments. Cell, 163(4):799–810, 2015.
  • Halpern et al. [2015] Y. Halpern, S. Horng, and D. Sontag. Anchored discrete factor analysis. arXiv preprint arXiv:1511.03299, 2015.
  • Klein et al. [2015] A. M. Klein, L. Mazutis, I. Akartuna, N. Tallapragada, A. Veres, V. Li, L. Peshkin, and M. W. Kirscner. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell, 161(5):1187–1201, 2015. ISSN 0092-8674. doi: 10.1016/j.cell.2015.04.044. URL
  • Ledoit and W. [2004] O. Ledoit and Michael W. A well-conditioned estimator for large-dimensional covariance matrices.

    Journal of Multivariate Analysis

    , 88(2):365–411, 2004.
  • Lehmann [1998] E. L. Lehmann. Elements of Large-Sample Theory. Spring Science and Business Media, 1998. ISBN 0-387-98595-6.
  • Leung et al. [2016] D. Leung, M. Drton, and H. Hara. Identifiability of directed Gaussian graphical models with one latent source. Electronic Journal of Statistics, 10(1):394–422, 2016.
  • Pearl [2003] J. Pearl. Causality: Models, reasoning, and inference. Econometric Theory, 19(4):675–685, 2003.
  • Pierson [2015] E. Pierson. Statistical models for single-cell data. Master’s thesis, Oxford University, UK, 2015.
  • Pierson and Yau [2015] E. Pierson and C. Yau. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biology, 16(1):241, 2015.
  • Richardson et al. [2002] Thomas Richardson, Peter Spirtes, et al.

    Ancestral graph Markov models.

    The Annals of Statistics, 30(4):962–1030, 2002.
  • Robins et al. [2000] J. M. Robins, M. A. Hernan, and B. Brumback. Marginal structural models and causal inference in epidemiology, 2000.
  • Segerstolpe et al. [2016] Å. Segerstolpe, A. Palasantza, P. Eliasson, E. Andersson, A. Andréasson, X. Sun, S. Picelli, A. Sabirsh, M. Clausen, M. K. Bjursell, D. M. Smith, M. Kasper, C. Ammala, and R. Sandberg. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metabolism, 24(4):593–607, 2016.
  • Sharma et al. [2018] A. Sharma, A. Halu, J. L. Decano, M. Padi, Y. Liu, R. B. Prasad, J. Fadista, M. Santolini, J. Menche, S. T. Weiss, M. Vidal, E.K. Silverman, M. Aikawa, A. L. Barabási, L. Groop, and J. Loscalzo. Controllability in an islet specific regulatory network identifies the transcriptional factor NFATC4, which regulates type 2 diabetes associated genes. NPJ Systems Biology and Applications, 4(1):25, 2018.
  • Silva et al. [2006] R. Silva, R. Scheine, C. Glymour, and P. Spirtes. Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7(Feb):191–246, 2006.
  • Solus et al. [2017] L. Solus, Y. Wang, L. Matejovicova, and C. Uhler. Consistency guarantees for permutation-based causal inference algorithms. arXiv preprint arXiv:1702.03530, 2017.
  • Spirtes [2001] P. Spirtes. An anytime algorithm for causal inference. In

    Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics

    , pages 213–221, 2001.
  • Spirtes et al. [2000] P. Spirtes, C. N. Glymour, R. Scheines, D. Heckerman, C. Meek, G. Cooper, and T. Richardson. Causation, Prediction, and Search. MIT press, 2000.
  • Van Dijk et al. [2018] D. Van Dijk, R. Sharma, J. Nainys, K. Yim, P. Kathail, A. J. Carr, C. Burdziak, K. R. Moon, C. L. Chaffer, D. Pattabiraman, B. Bierie, L. Mazutis, G. Wolf, S. Krishnaswamy, and D. Pe’er. Recovering gene interactions from single-cell data using data diffusion. Cell, 174(3):716–729, 2018.
  • Verma and Pearl [1990] T. Verma and J. Pearl. Equivalence and synthesis of causal models. In Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, pages 255–270, 1990.
  • Wang et al. [2017] Y. Wang, L. Solus, K. Yang, and C. Uhler. Permutation-based causal inference algorithms with interventions. In Advances in Neural Information Processing Systems, pages 5822–5831, 2017.
  • Zhang [2008] J. Zhang. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence, 172(16-17):1873–1896, 2008.
  • Zhang et al. [2017] K. Zhang, M. Gong, J. Ramsey, K. Batmanghelich, P. Spirtes, and C. Glymour. Causal discovery in the presence of measurement error: Identifiability conditions. arXiv preprint arXiv:1706.03768, 2017.
  • Ziegenhain et al. [2017] C. Ziegenhain, B. Vieth, S. Parekh, B. Reinius, A. Guillaumet-Adkins, M. Smets, H. Leonhardt, H. Heyn, I. Hellmann, and W. Enard. Comparative analysis of single-cell RNA sequencing methods. Molecular Cell, 65(4):631–643.e4, 2017. ISSN 10974164. doi: 10.1016/j.molcel.2017.01.023.

Appendix A Proof of Lemma 1


By the strong law of large numbers, for all ,

The functions are continuous since they are continuously differentiable by (A2), therefore, for all ,

and similarly, for all


and the correlations

Recursively applying a similar argument to equation (2) proves that is consistent. ∎

Appendix B Proof of Theorem 1

We start by defining the vectors of all correlations estimated from Algorithm 1 and all true correlations of as


respectively. We use to denote the vector obtained from concatenating all monomials in and that appear in and for in Assumption (A2). That is,

We let

be the analogous concatenated vector of sample monomials in and calculated from the data .

The following lemma is concerned with the asymptotic distribution of the correlation vector .

Lemma 2.

Under Assumptions (A1) and (A2),

where is the vector of all first and second order moments of and is a continuous function of .


Assumption (A2) asserts that the covariance of is finite. Hence, we can apply the Central Limit Theorem to obtain


where is the covariance matrix of . The elements of the covariance matrix can be written as a continuous function of the first-and second-order moments of , i.e., they can be written as a continuous function of .

Assumptions (A1) and (A2) imply that we can write for all ,


We compute the sample correlation in our algorithm as

Based on equation (10) we can define a function such that and . Applying the delta method to equation (9) with the function , we get

where , since the elements of the mean vector are elements of . Notice that under the assumption that the variance of is non-zero from Section 2, the denominator in (10) is non-zero, and therefore is continuously differentiable in , which is continuously differentiable in by Assumption (A2). Hence, is continuous in , and therefore continuous in . Since is a matrix product of functions continuous in , it is also continuous in . ∎

Lemma 3.


where is the vector of all first- and second-order moments of , and is continuous in , then under Assumptions (A1) and (A2), for any and ,

for some that is continuous in , where are the partial correlations estimated by Algorithm 1.


We take any arbitrary but fixed and subset , and we prove the lemma for Let , where is the size of the conditioning set . We begin by relabeling the variables of interest for clarity. We relabel to , to and the elements of to . Furthermore, we define the sets

Note that , and thereby, the partial correlation of interest is . Now we define for , the vectors

It follows from the definition of that and . In order to prove the lemma, we proceed by induction on starting with the base case of and show that for all such that ,


for some that is continuous in . Note that the base case is given by the hypothesis in the lemma. Moreover, the statement of the lemma is that the above holds for , and therefore, completing the inductive step proves the lemma.

To complete the inductive step, assume that for such that , we have


Note that for any , the recursive formula for the partial correlations


implies that the vector can be written as a function of . Let be this function, then we have

Note that this implies since our procedure uses this recursive formula to estimate the partial correlations. Applying the delta method to with the function gives

where . The matrix can be computed to be the following matrix:



To simplify indexing, we define the index function

Then, the element will be on the row and column of the Jacobian . We can now compute the elements of the matrix in terms of the elements of . Namely, defining and using the notation to denote the entry in the row and column of , we can compute the element in the row and column of to be