Causal Discovery from Incomplete Data: A Deep Learning Approach

by   Yuhao Wang, et al.
TU Eindhoven

As systems are getting more autonomous with the development of artificial intelligence, it is important to discover the causal knowledge from observational sensory inputs. By encoding a series of cause-effect relations between events, causal networks can facilitate the prediction of effects from a given action and analyze their underlying data generation mechanism. However, missing data are ubiquitous in practical scenarios. Directly performing existing casual discovery algorithms on partially observed data may lead to the incorrect inference. To alleviate this issue, we proposed a deep learning framework, dubbed Imputated Causal Learning (ICL), to perform iterative missing data imputation and causal structure discovery. Through extensive simulations on both synthetic and real data, we show that ICL can outperform state-of-the-art methods under different missing data mechanisms.


Causal discovery in the presence of missing data

Missing data are ubiquitous in many domains such as healthcare. Dependin...

Causal Discovery from Incomplete Data using An Encoder and Reinforcement Learning

Discovering causal structure among a set of variables is a fundamental p...

MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models

State-of-the-art causal discovery methods usually assume that the observ...

Causal Discovery of a River Network from its Extremes

Causal inference for extremes aims to discover cause and effect relation...

A practical guide to causal discovery with cohort data

In this guide, we present how to perform constraint-based causal discove...

Missing Data Estimation in High-Dimensional Datasets: A Swarm Intelligence-Deep Neural Network Approach

In this paper, we examine the problem of missing data in high-dimensiona...

Causal Generative Neural Networks

We introduce CGNN, a framework to learn functional causal models as gene...

1 Introduction

Analyzing causality is a fundamental problem to infer the causal mechanism from observed data. Usually causal relations among variables are described using a Directed Acyclic Graph (DAG), with the nodes representing variables and the edges indicating probabilistic relations among them. Learning such causal networks has proven useful in various applications, ranging from smart cities to health care. For example, knowledge of the causal structure is (1) helpful for analyzing relations among different business entities and supporting business decisions [borb], (2) necessary in learning gene regulatory network and analyzing complex disease traits [wang2017permutation], (3) important for visualizing causal attentions of self-driving cars, where a highlighted region would causally influence the vehicular steering control [lopez2017discovering]. In short, the discovered causal networks enable accurate decision making [sulik2017encoding], robust uncertainty inference [nakamura2007information], reliable fault diagnose [cai2017bayesian], and efficient redundancy elimination [xie2017anomaly].

Previous works on causal discovery mainly focus on the complete-data setting. They either try to learn the Bayesian network structure to estimate Markov properties or use the addictive noise model for causal inference. However, causal discovery under the missing-data setting is still relatively under-explored 

[Gain2018a]. In practice, missing data is a common issue. The underlying missingness mechanisms can be categorized into three basic types: Missing At Random (MAR), Missing Completely At Random (MCAR), and Missing Not At Random (MNAR). For example, sensors on the road intersection can record the traffic density, and traffic related information will be transmitted to the Road Side Units (RSUs) for traffic management in real time. In MAR, missingness is caused by fully observed variables. For example, when the vehicle density is above a threshold, RSUs will get overloaded and fail to collect traffic data. Missing traffic data depends on the traffic density recorded by the traffic sensor. MCAR is a special case of MAR, the cause of missingness is purely random and does not depend on the variables of interest, such as the lost of traffic information happens by chance. In MNAR, missingness depends on either unobserved attributes or the missing attribute itself. For example, the missingness of RSUs depends on the traffic density detected by the sensor. Additionally, the sensor itself also introduces missing values.

Some of the previous approaches handling missing data by directly deleting data entries with missing values, resulting in a complete observation for the problem at hand [carter2006solutions, van2015efficient]. This data processing way may be satisfactory with a small proportion of missing values (e.g., less than about 5% [graham2009missing]), but could result in a biased model in the presence of larger missing proportions. In theory, MCAR and MAR conditions ensure the recoverability of the underlying distributions from the measured value alone [nakagawa2015missing], and do not require the prior assumption of how data are missing. Therefore, a feasible solution can be first performing imputation to recover the missing entries, then followed by a causal discovery algorithm for knowledge representation from the recovered data [strobl2018fast]. However, as will be discussed further, directly perform imputation could introduce incorrect causal relations.

In this paper, we focus on causal discovery from observational data (as opposed to intervention experiments). Note that estimating the causal graph as a DAG is an NP-complete problem [chickering1996learning]

, and the task becomes even more challenging under the missing data condition. Causal discovery is an unsupervised learning problem and the goal is to discover the data generation process in the form of causal graphs. Inspired by

[yu2019dag] and motivated by the recent success of Generatvie Adversarial Networks (GAN) [goodfellow2014generative]

and Variational Autoencoder (VAE)

[kingma2013auto] in learning high-dimensional distributions, in this work, we use GAN and VAE to decompose this problem into two sub-problems, namely, iterative imputation with causal skeleton learning, and identify individual pairs of causal directions. In general, causal skeleton learning returns a reasonable network structure and offers a global view of how variables are dependent on each other, while causal direction identification provides a more accurate local view between the matched variable pairs. These complimentary local and global view helps approximate the data generating process among all observed variables.

Our contribution is three-fold:

  • We propose a deep learning framework, called Imputed Causal Learning (ICL), for iterative missing data imputation and causal structure discovery, producing both imputed data and causal skeletons.

  • We leverage the extra asymmetry cause-effect information within dependent pair sets in the causal skeleton . The causal directions in then being enumerated in a pair-wise way to uncover the underlying causal graph .

  • Through extensive simulations on both synthetic and publicly-used real data, we show that under MCAR and MAR conditions, our proposed algorithm outperforms state-of-the-art baseline methods.

2 Related Work

Causal Discovery from Complete Data

Methods for identifying causal relations from complete observation data usually fall into two categories: the first one exploits Markov properties of DAGs [chickering2002], and the second one tries to leverage asymmetries between variable pairs of the Functional Causal Model (FCM) [shimizu2006, mooij2016]. For methods in the first category, they may not be able to orient the causal direction of , since and are Markov equivalent. However, the causal direction can be further identified using methods in the second category by leveraging the asymmetry between causes and effects. Methods in the first category typically include constraint-based approaches, score-based approaches, and hybrid approaches. They can discover the dependence relations and identify the Markov equivalence class. Constraint-based approach discovers conditional independence between variables of DAGs. Typical algorithms under this category include the PC algorithm, Fast Causal Inference (FCI) [spirtes2000], and Really Fast Causal Inference (RFCI) [colombo2011learning]. Greedy Equivalence Search (GES) [nandy2018high] is a Score-based approach, it performs structure learning with a scoring criteria over the search space of the Markov Equivalence class. The recent breakthrough [zheng2018dags] makes the score-based method amenable with the existing black-box solvers. DAG-GNN [yu2019dag]

learns the DAG structure using a graph neural network. Besides,

hybrid approaches, such as the the Adaptively Restricted Greedy Equivalence Search (ARGES) [nandy2018high], Causal Generative Neural Network [goudet2018learning], which combine ideas of constraint and score-based approach. They restricts the score-based search space with the help of the conditional independence graph for either the computational efficiency or performance accuracy. Meanwhile, methods in the second category can be used to identify the causal directions, include linear non-Gaussian acyclic model (LiNGAM) [shimizu2006], Addictive Noise Model (ANM) [peters2014causal], Post-nonlinear model (PNL) [zhang2016estimation].

Figure 1: System architecture of our proposed ICL network, including three modules. We train Module A and B in an end-to-end manner for simultaneously imputation and causal skeleton learning, and the results is used as the input of Module C for causal direction identification.

Causal Discovery from Incomplete Data

Works related to causal discovery from incomplete data can be classified into two categories: one category attempts to discover causal structure using only available partial observations and the other aims at imputing all missing entries to recover the whole observation. Typical algorithms with partial observations perform (1) list-wise deletion on all entries (rows) with missing values before causal discovery

[pmlrv72gain18a]. (2) Test-wise deletion effectively ignores only the variables containing missing values involved in the conditional independence (CI) test [strobl2018fast, tu2019causal]

. These methods are suitable when the missingness mechanism can not be ignored and the underlying distribution is less likely to be recovered. Another category attempts to impute the missing values before performing causal discovery. Previous works use Expectation Maximization (EM) or Gibbs sampling to perform imputation. However, these approaches require prior knowledge of the underlying structure and are therefore not practical 

[singh1997learning]. On the other hand, imputation strategies for handling missing data is also very important. Works related to this category include the Multivariate Imputation by Chained Equations (MICE) [white2011multiple], MissForest (MF) [stekhoven2011missforest], and deep-learning-based approaches, such as using GAN for more powerful imputation [li2018learning, luo2018multivariate, yoon2018gain]. In this context, recovering the full distributions from missing data through imputation and performing causal discovery on the recovered data is the most straightforward solution [adel2017learning].

3 Imputed Causal Learning

On a high level, our model first takes incomplete observational data as input and then simultaneously performs missing data imputation and structural learning to estimate both the causal skeleton (as an undirected graph) and the recovered data (Module A and B of Figure 1). After that, pair-wise causal direction identification is performed to orient the causal edges and uncover the final underlying causal graph (Module C of Figure 1). Figure 1 shows an overview of our framework. The following subsections explain these two steps in detail.

Notation and Preliminaries

A causal graph consists of nodes and edges

. We denote a random variable set

with to represent i.i.d. observations into an data matrix. Node set corresponds to vertices, whereby each node in represents a random variable in a causal DAG. Within the edge set , an edge from two adjacent nodes to exists if and only if and , leading to a cause-effect pair of . A causal skeleton can be represented as . Besides, linear causal relationship in a form of graph can be equivalently represented as a linear Structural Equation Model (SEM):


And the relations between variables in rows are equivalent to . is a strictly upper triangular adjacency matrix with for all , and represent an edge between and in . is an

noise matrix with noise vectors

. Furthermore, a generalized nonlinear SEM model can be formulated as [yu2019dag].

can be treated as an autoregression matrix of the DAG. The joint probability distribution

is defined over the graphical model with a finite set of vertices on random variables .

3.1 Causal Skeleton Discovery from Incomplete Data

Problem Formulation and Method Overview

Under the missing data condition, we assume confounders (unobserved direct common cause of two variables) do not exist in the input data. This means that we can observe all variables but some samples may be missed. We define an incomplete version of as , where in Equation (2) is the corresponding masks. is a binary random variable and used to denote which entries in are missing. Specifically:


where means ‘missing’.

In this paper, causal skeleton discovery from incomplete data refers to the problem of inferring from incomplete observations . We do this by iteratively imputing and updating .

Imputing : Note that unlike previous causal discovery approaches dealing with missing data by either list-wise or test-wise deletion, we aim to generate full observations and yield an optimistic estimation from by imputation. Therefore, with only, we then need to first recover the underlying joint probability distribution from , and representing with a structured dependency among variables in with , where denotes the set of parents of node . We denote the recovered data by , and then formulate our task as minimizing the distribution difference of and by imputing all missing values of into .

Updating : In each iteration after imputing , we infer (and update) the autoregression parameter with by mapping samples from into a linear-separable hyperspace of with a neural network.

Iterative Update: The imputation (Module A of Figure 1) and learning of (Module B of Figure 1) are performed jointly. This is important since the data imputation and learning of can adjust and improve each other.

Figure 2: Incomplete data causal structure discovery (a) Imputation first, then structure discovery; (b) Simultaneous imputation and structure learning;

Proposed Method

Built on GAN and VAE, we generalize the work on Bayesian structure learning from [yu2019dag] and propose a deep learning algorithm to simultaneously perform missing data imputation and causal skeleton discovery. Our algorithm consists of four components: a generator (), a discriminator(), a structure encoder (), and a structure decoder (). Given incomplete observational data , and learn to obtain the estimated complete data , based on which and will try to learn the causal structure

. These four components are trained jointly using backpropagation (BP).

Note that a naive approach would be to perform imputation first and then follow by causal discovery (Figure 1(a)). This is sub-optimal because the estimated causal discovery cannot improve the data imputation process in turn. Empirically we find that its performance is very similar to performing causal discovery after directly deleting all data entries with missing values, meaning that imputation does not introduce any additional value into the causal discovery process. We address this issue by alternating between imputation and causal discovery, which is made possible through the use of differentiable neural networks (Figure 1(b)). Such an iterative process can do better in terms of performing multiple imputation passes to take into account the variability while preserving the underlying statistical causal relationship between variables.

Concretely, in each iteration of our algorithm, and take the incomplete data as input and impute the missing values to form . The causal structure is involved as parameters of both and . We encode into a latent code through , and decode into with . The above procedure can be seen as two neural network modules, GAN and VAE, jointly learning together. The former recovers missing data while the later discovers the causal structure.

Missing Data Imputation

Similar to [luo2018multivariate, yoon2018gain], we use and together to approach the underlying data distribution of for imputation. Since GAN can not take values as the input, to initialize the imputation process, we use a d-dimensional noise variable

sampled from the standard normal distribution

. And we replace the entries in with , where represent element-wise multiplication. will be served as the input of GAN to generate . With as the input of the structure learning neural network of and to discover the autoregression parameter through each iteration (details of the structure learning method will we covered in the next subsection). Specifically, the generator is responsible for observing the real data and imputing missing components conditioned on what is actually observed according to . The generator takes , , and as input:


Therefore, the recovered data can be obtained by replacing data on missing entries in with the generated corresponding values from as


Besides, the discriminator is introduced as an adversary training module accompanying the generator . Due to the incomplete observations, the initialized data inherently contains both real and fake values, which makes from standard GAN not feasible for our task. In this context, instead of counting real/fake from , the mapping of attempts to determine whether the components are actually observed or not. Specifically, we set as the input to , while is trying to fool in an adversarial way. In summary, and

together learn a desired joint distribution of

and then perform imputation given . Note that the difference between ICL and previous GAN-based imputation methods is that our imputation is also related to the recovered causal skeleton.

Causal Skeleton Learning

Then we perform structure discovery to find the underlying causal mechanism from the variable set in . Using the structure discovery method in [yu2019dag], with the scoring function , this concatenate task then turns into a continuous optimization problem of finding a that satisfies:


where is a function to remove directions in , leading to predicted causal skeleton . The adjacency matrix space represents the set of all DAGs. is the smooth function over real matrices, and ensures that is acyclic.

is a hyperparameter. Following

[yu2019dag], takes

as the input of a multilayer perceptron (MLP). The output, denoted as

, is then multiplied by and transformed into in Equation (6), where is the noise vector mentioned at the start of Section 3. The decoder in Equation (7) performs an inverse operation on the encoded to recover , where is a parameter of and to incorporate the causal structure during the learning process.

denotes the identity matrix.

and are parameters in corresponding layers.


The parameter plays an important role during the learning phase, stands for the dependence relationship between and in .

By extracting from the learning process described in Equations (6) and (7), we can have the knowledge of the marginal or conditional distribution of a random variable in . This is also how we discover a causal skeleton from .

Joint Training

The overall procedure can be seen as simultaneously recovering all missing entries in by and , and optimizing the structure learning performance of by and .

The loss function is formed into two parts as the imputation loss and structure learning loss. Since the missing values in real-scene are not known, it would not make sense to use their reconstruction error as a stopping criterion in the imputation loss part. The training objective can be formulated as a

problem of while measuring the degree of imputation fitness, as it is usually done when using the standard GANs. In our work, we optimize the data generation performance of a GAN with the loss function as follows


The generator generates samples conditioned on the partial observation of , the missingness indicator , and the noise . We train to generate and minimize the prediction probability of , while we train to maximize the prediction accuracy of . Then we follow the evidence lower bound (ELBO) from [yu2019dag], given below, for causal skeleton learning.

We denote and as parameter sets in GANs and VAEs separately. The overall learning problem can be formed as:


The stopping criteria is either the error is sufficiently small or the number of iterations is large enough. With the best fitting in Equation (9), the causal skeleton is generated by keeping edges in but remove their directions. The pseudo code is summarized in Algorithm 1.

Initialize : , , ,
, minibatch .
Input : Observational incomplete data .
Output  : Causal skeleton and imputed data .
while Loss has not converged do
        for  do
               Step 1: Missing data imputation:
               Missing entries:
               Step 2:Structure discovery:
               Step 3: Extract from :
               Let with
               Step 4: Update parameters of G and D in GAN using SGD according to Equation (8).
               Step 5: Update parameters of SE and SD in VAE and using SGD according to Equation (9).
        end for
end while
Algorithm 1 Causal Skeleton Discovery

3.2 Causal Direction Identification

The above procedure can identify the conditional probability, but may not truly represent the underlying causal mechanism. For example, given two variables and , their joint distribution can be decomposed equally as either or . These two decompositions relate to different causal mechanisms. With the additive noise model [mooij2016] , however, we can represent asymmetries between and , leading to a unique causal direction from purely observational data. In detail, let the joint distribution of with ground truth be . Then the effect of conditioned on the cause can be represented by:


where the second equality assumes and . Note that due to the asymmetry, Equation (10) does not hold in the reverse direction . This property makes it possible to determine the causal direction from observational data under proper conditions.

Therefore, given from the above section, our goal is to utilize such pair-wise asymmetry and orient the edges of , consequently uncovering the final causal DAG . This can be achieved by calculating the maximum evidences of the marginal log-likelihood over two models and . The model that shows the larger evidence is selected. In this work, we use the Cascade Additive Noise Model (CANM) proposed by [ijcai2019-223]. Specifically, to enumerate causal direction from variables pairs in , we use variable pairs from as input, then the log-marginal likelihood on variable and is computed with:

and are the parameters of the CANM model, which encode and into a latent code . The evidence score of the log marginal likelihood with can be calculated in the following way in both directions.

And the causal direction can be identified by:


Given the bivariate identifiable condition in Equation (10), causal discovery from more than two variables can be achieved if each of the causal pairs follows the ANM class [peters2011identifiability]. To uncover the underlying causal graph , we then independently orient each pair-wise edge using the bivariate identification method in Equation (11). Besides, note that a combination of causal structure learning and bi-variate direction identification requires a final verification to ensure that the DAG is acyclic. In the final stage, by checking if cycles in exist, we enumerate the related edges with the calculated score , then simply remove the edge which holds the lowest score. We will consider more sophisticated algorithms in future work.

30 Var MCAR (Nonlinear 1) (Ideal SHD=7) 50 Var MAR (Nonlinear 2) (Ideal SHD=17)
10% 30% 50% 10% 30% 50%
ICL (Ours) 9.8 3.9 7.4 3.8 8.4 4.9 19.0 4.2 25.5 3.8 27.3 5.5
Table 1:

Performance comparison (mean and standard deviation) using Structural Hamming Distance, lower is better.

4 Experiment Results

In this section, we will demonstrate how ICL performs on two synthetic datasets and one real-world dataset compared to state-of-the-art baselines.

4.1 Baseline Algorithms

Algorithms for data imputation include list-wise deletion (LD), multivariate imputation by chained equations (MICE) [white2011multiple], MissForest (MF) [stekhoven2011missforest], and GAN from as shown in Figure 1(a). Algorithms for the causal structure discovery include constraint-based approaches such as PC [spirtes2000], linear non-Gaussian acyclic model (LiNGAM) [shimizu2006], really fast causal inference (RFCI) [colombo2011learning], score-based approaches such as greedy equivalence search (GES) [chickering2002], hybrid approaches such as max-min parents-children-addictive noise model (MMPC-ANM) [cai2018], and a deep-learning approach based on DAG-GNN [yu2019dag]. For DAG-GNN we consider two variants: GAN-DAG first performs imputation first and then use the imputation results for structure discovery; LD-DAG first delete all entries with missing values and then perform causal discovery. Each baseline consists of one data imputation algorithm and one causal discovery algorithm. Therefore we have the following combinations: LD-PC, LD-LiNGAM, LD-RFCI, LD-MMPC, LD-GES; MF-PC, MF-LiNGAM, MF-RFCI, MF-MMPC, MF-GES; MC-PC, MC-LiNGAM, MC-RFCI, MC-MMPC, MC-GES; GAN-PC, GAN-LiNGAM, GAN-RFCI, GAN-MMPC, GAN-GES. All the baseline algorithms above are implemented using R-packages such as bnlearn [scutari2009], CompareCausalNetworks [heinze2017], pcalg [kalisch2012], and SELF [cai2018]. We use rpy2 [gautier2012rpy2] to make the above R-packages accessible from Python and ensure that all algorithms can be compared in the same environment. Following [tu2019causal, strobl2018fast]

, we use Structural Hamming Distance (SHD) as the evaluation metric.

4.2 Quantitative Results

In this subsection, we first provide on synthetic and real-world datasets in terms of both the causal graphs and the missing machanisms. We then compare ICL with the baselines above on these datasets.

Synthetic Data Generation

The synthetic ground truth graph with nodes is generated randomly using the Erdős Rényi (ER) model with an expected neighbor size . The edge weights of are uniformly drawn from to ensure that they are non-zero. Once is generated, the observational i.i.d. data is generated with a sample size and a variable size . For linear cases, the i.i.d. data is generated by sampling the model , where is a strictly upper triangular matrix; similarly for nonlinear cases, the sampled model is described by . Here the noise follows either the Exponential or the Gumbel distribution. In our work, two different mechanisms are considered in nonlinear cases:

In order to achieve a more general comparison in our experiments, the missing data proportions over all the synthetic data are set to be 10%, 30%, and 50%.

Missingness Mechanisms

In this paper we generate synthetic incomplete data using one of the two missingness mechanisms, namely MCAR and MAR, leaving MNAR as future work. For MCAR, the missingness mask is formed by selecting the missing entries from the observational data corresponding to with the same probability. Here

is a uniformly distributed random matrix which has the same dimensions as the observational data matrix. A threshold

is used as the missingness selection criterion. For MAR, the missingness mask is generated based on both the randomly generated graph (more details later) and . Specifically, we first randomly sample parent-child pairs, denoted as , from . is then set to if there exists an such that and . This is to simulate the setting where the missingness of the child node is determined by the (randomly generated) values of its parent nodes.

Quantitative Experiment Results

Table 1 reports SHD of our proposed ICL and other baeslines. The results are averaged over twenty random repetitions, with the missing proportion and under both MCAR and MAR conditions. We cover two nonlinear mechanisms as mentioned above. In our experiments, the linear results is consistent with the nonlinear results, and are not included due to space constraints. As shown in Table 1, ICL shows superior performance compared with all other baselines. ’Ideal SHD’ refers to ICL’s performance using complete data (no missing values). Recall that GAN-DAG performs data imputaion first and then follow by data causal discovery without the iterative process. As shown in Table 1, GAN-DAG’s performance is worse than ICL since its causal module cannot improve the data imputation process in turn. Interestingly, comparing LD-DAG and ICL, we can see that ignoring entries with missing values may have a negative effect on the performance of causal discovery. Furthermore, GES-based algorithms achieve the worst performance even with only 10% missing values. We can also see that MMPC-based algorithms are suitable for nonlinear data, while LiNGAM-based algorithms are suitable for linear data. As expected, directly removing the missing entries leads to worse performance, since it not only reduces the sample size (and consequently throwing away useful information in the observational data), but also introduced a form of selection bias, leading to incorrect causal discovery [pmlrv72gain18a]. Furthermore, it is also worth mentioning that by performing missing data imputation and causal discovery separately (like GAN-DAG), the results could be even worse than deletion-based methods. As we discussed, imputation could be helpful for recovering the joint distribution of , but sub-optimal when we want to perform a further step of the distribution decomposition to discover the underlying causal graph. In contrast, our ICL model does not have the issues above and can therefore achieve better performance.

Case Study on AutoMPG

As a case study we also show ICL’s results on a real-world dataset, AutoMPG [lichman2013uci], which is a city-cycle fuel consumption dataset with 398 instances. We discard the attributes of the car-name and the origin, and use the left 7 attributes: miles per gallon consumption (MPG), the release date of vehicles (AGE), vehicle weight (WEI), engine displacement (DIS), cylinder number (CYL), horsepower (HP), and vehicle’s acceleration capability (ACC). We simulate missing data under MAR and compare the performance of ICL and GAN-DAG (best baseline). Their learned causal networks are shown in Figure 3, where the SHD for ICL and GAN-DAG is 9 and 11, respectively.

Figure 3: AutoMPG results. (a) Our ICL algorithm (SHD=9). (b) GAN-DAG (SHD=11).

5 Conclusion

In this work, we addressed the problem of incomplete data causal discovery, and we proposed a deep learning model of ICL to handle this issue. Specifically, our ICL model contains a global view of iterative missing data imputation and causal skeleton discovery, and a local view of enumerating causal directions to uncover the underlying causal . In the end, we evaluated the effectiveness of our method on both synthetic and real data. As future work, we will generalize our method under more complex conditions such as the existence of confounders.