1 Introduction
Analyzing causality is a fundamental problem to infer the causal mechanism from observed data. Usually causal relations among variables are described using a Directed Acyclic Graph (DAG), with the nodes representing variables and the edges indicating probabilistic relations among them. Learning such causal networks has proven useful in various applications, ranging from smart cities to health care. For example, knowledge of the causal structure is (1) helpful for analyzing relations among different business entities and supporting business decisions [borb], (2) necessary in learning gene regulatory network and analyzing complex disease traits [wang2017permutation], (3) important for visualizing causal attentions of selfdriving cars, where a highlighted region would causally influence the vehicular steering control [lopez2017discovering]. In short, the discovered causal networks enable accurate decision making [sulik2017encoding], robust uncertainty inference [nakamura2007information], reliable fault diagnose [cai2017bayesian], and efficient redundancy elimination [xie2017anomaly].
Previous works on causal discovery mainly focus on the completedata setting. They either try to learn the Bayesian network structure to estimate Markov properties or use the addictive noise model for causal inference. However, causal discovery under the missingdata setting is still relatively underexplored
[Gain2018a]. In practice, missing data is a common issue. The underlying missingness mechanisms can be categorized into three basic types: Missing At Random (MAR), Missing Completely At Random (MCAR), and Missing Not At Random (MNAR). For example, sensors on the road intersection can record the traffic density, and traffic related information will be transmitted to the Road Side Units (RSUs) for traffic management in real time. In MAR, missingness is caused by fully observed variables. For example, when the vehicle density is above a threshold, RSUs will get overloaded and fail to collect traffic data. Missing traffic data depends on the traffic density recorded by the traffic sensor. MCAR is a special case of MAR, the cause of missingness is purely random and does not depend on the variables of interest, such as the lost of traffic information happens by chance. In MNAR, missingness depends on either unobserved attributes or the missing attribute itself. For example, the missingness of RSUs depends on the traffic density detected by the sensor. Additionally, the sensor itself also introduces missing values.Some of the previous approaches handling missing data by directly deleting data entries with missing values, resulting in a complete observation for the problem at hand [carter2006solutions, van2015efficient]. This data processing way may be satisfactory with a small proportion of missing values (e.g., less than about 5% [graham2009missing]), but could result in a biased model in the presence of larger missing proportions. In theory, MCAR and MAR conditions ensure the recoverability of the underlying distributions from the measured value alone [nakagawa2015missing], and do not require the prior assumption of how data are missing. Therefore, a feasible solution can be first performing imputation to recover the missing entries, then followed by a causal discovery algorithm for knowledge representation from the recovered data [strobl2018fast]. However, as will be discussed further, directly perform imputation could introduce incorrect causal relations.
In this paper, we focus on causal discovery from observational data (as opposed to intervention experiments). Note that estimating the causal graph as a DAG is an NPcomplete problem [chickering1996learning]
, and the task becomes even more challenging under the missing data condition. Causal discovery is an unsupervised learning problem and the goal is to discover the data generation process in the form of causal graphs. Inspired by
[yu2019dag] and motivated by the recent success of Generatvie Adversarial Networks (GAN) [goodfellow2014generative]and Variational Autoencoder (VAE)
[kingma2013auto] in learning highdimensional distributions, in this work, we use GAN and VAE to decompose this problem into two subproblems, namely, iterative imputation with causal skeleton learning, and identify individual pairs of causal directions. In general, causal skeleton learning returns a reasonable network structure and offers a global view of how variables are dependent on each other, while causal direction identification provides a more accurate local view between the matched variable pairs. These complimentary local and global view helps approximate the data generating process among all observed variables.Our contribution is threefold:

We propose a deep learning framework, called Imputed Causal Learning (ICL), for iterative missing data imputation and causal structure discovery, producing both imputed data and causal skeletons.

We leverage the extra asymmetry causeeffect information within dependent pair sets in the causal skeleton . The causal directions in then being enumerated in a pairwise way to uncover the underlying causal graph .

Through extensive simulations on both synthetic and publiclyused real data, we show that under MCAR and MAR conditions, our proposed algorithm outperforms stateoftheart baseline methods.
2 Related Work
Causal Discovery from Complete Data
Methods for identifying causal relations from complete observation data usually fall into two categories: the first one exploits Markov properties of DAGs [chickering2002], and the second one tries to leverage asymmetries between variable pairs of the Functional Causal Model (FCM) [shimizu2006, mooij2016]. For methods in the first category, they may not be able to orient the causal direction of , since and are Markov equivalent. However, the causal direction can be further identified using methods in the second category by leveraging the asymmetry between causes and effects. Methods in the first category typically include constraintbased approaches, scorebased approaches, and hybrid approaches. They can discover the dependence relations and identify the Markov equivalence class. Constraintbased approach discovers conditional independence between variables of DAGs. Typical algorithms under this category include the PC algorithm, Fast Causal Inference (FCI) [spirtes2000], and Really Fast Causal Inference (RFCI) [colombo2011learning]. Greedy Equivalence Search (GES) [nandy2018high] is a Scorebased approach, it performs structure learning with a scoring criteria over the search space of the Markov Equivalence class. The recent breakthrough [zheng2018dags] makes the scorebased method amenable with the existing blackbox solvers. DAGGNN [yu2019dag]
learns the DAG structure using a graph neural network. Besides,
hybrid approaches, such as the the Adaptively Restricted Greedy Equivalence Search (ARGES) [nandy2018high], Causal Generative Neural Network [goudet2018learning], which combine ideas of constraint and scorebased approach. They restricts the scorebased search space with the help of the conditional independence graph for either the computational efficiency or performance accuracy. Meanwhile, methods in the second category can be used to identify the causal directions, include linear nonGaussian acyclic model (LiNGAM) [shimizu2006], Addictive Noise Model (ANM) [peters2014causal], Postnonlinear model (PNL) [zhang2016estimation].Causal Discovery from Incomplete Data
Works related to causal discovery from incomplete data can be classified into two categories: one category attempts to discover causal structure using only available partial observations and the other aims at imputing all missing entries to recover the whole observation. Typical algorithms with partial observations perform (1) listwise deletion on all entries (rows) with missing values before causal discovery
[pmlrv72gain18a]. (2) Testwise deletion effectively ignores only the variables containing missing values involved in the conditional independence (CI) test [strobl2018fast, tu2019causal]. These methods are suitable when the missingness mechanism can not be ignored and the underlying distribution is less likely to be recovered. Another category attempts to impute the missing values before performing causal discovery. Previous works use Expectation Maximization (EM) or Gibbs sampling to perform imputation. However, these approaches require prior knowledge of the underlying structure and are therefore not practical
[singh1997learning]. On the other hand, imputation strategies for handling missing data is also very important. Works related to this category include the Multivariate Imputation by Chained Equations (MICE) [white2011multiple], MissForest (MF) [stekhoven2011missforest], and deeplearningbased approaches, such as using GAN for more powerful imputation [li2018learning, luo2018multivariate, yoon2018gain]. In this context, recovering the full distributions from missing data through imputation and performing causal discovery on the recovered data is the most straightforward solution [adel2017learning].3 Imputed Causal Learning
On a high level, our model first takes incomplete observational data as input and then simultaneously performs missing data imputation and structural learning to estimate both the causal skeleton (as an undirected graph) and the recovered data (Module A and B of Figure 1). After that, pairwise causal direction identification is performed to orient the causal edges and uncover the final underlying causal graph (Module C of Figure 1). Figure 1 shows an overview of our framework. The following subsections explain these two steps in detail.
Notation and Preliminaries
A causal graph consists of nodes and edges
. We denote a random variable set
with to represent i.i.d. observations into an data matrix. Node set corresponds to vertices, whereby each node in represents a random variable in a causal DAG. Within the edge set , an edge from two adjacent nodes to exists if and only if and , leading to a causeeffect pair of . A causal skeleton can be represented as . Besides, linear causal relationship in a form of graph can be equivalently represented as a linear Structural Equation Model (SEM):(1) 
And the relations between variables in rows are equivalent to . is a strictly upper triangular adjacency matrix with for all , and represent an edge between and in . is an
noise matrix with noise vectors
. Furthermore, a generalized nonlinear SEM model can be formulated as [yu2019dag].can be treated as an autoregression matrix of the DAG. The joint probability distribution
is defined over the graphical model with a finite set of vertices on random variables .3.1 Causal Skeleton Discovery from Incomplete Data
Problem Formulation and Method Overview
Under the missing data condition, we assume confounders (unobserved direct common cause of two variables) do not exist in the input data. This means that we can observe all variables but some samples may be missed. We define an incomplete version of as , where in Equation (2) is the corresponding masks. is a binary random variable and used to denote which entries in are missing. Specifically:
(2) 
where means ‘missing’.
In this paper, causal skeleton discovery from incomplete data refers to the problem of inferring from incomplete observations . We do this by iteratively imputing and updating .
Imputing : Note that unlike previous causal discovery approaches dealing with missing data by either listwise or testwise deletion, we aim to generate full observations and yield an optimistic estimation from by imputation. Therefore, with only, we then need to first recover the underlying joint probability distribution from , and representing with a structured dependency among variables in with , where denotes the set of parents of node . We denote the recovered data by , and then formulate our task as minimizing the distribution difference of and by imputing all missing values of into .
Updating : In each iteration after imputing , we infer (and update) the autoregression parameter with by mapping samples from into a linearseparable hyperspace of with a neural network.
Iterative Update: The imputation (Module A of Figure 1) and learning of (Module B of Figure 1) are performed jointly. This is important since the data imputation and learning of can adjust and improve each other.
Proposed Method
Built on GAN and VAE, we generalize the work on Bayesian structure learning from [yu2019dag] and propose a deep learning algorithm to simultaneously perform missing data imputation and causal skeleton discovery. Our algorithm consists of four components: a generator (), a discriminator(), a structure encoder (), and a structure decoder (). Given incomplete observational data , and learn to obtain the estimated complete data , based on which and will try to learn the causal structure
. These four components are trained jointly using backpropagation (BP).
Note that a naive approach would be to perform imputation first and then follow by causal discovery (Figure 1(a)). This is suboptimal because the estimated causal discovery cannot improve the data imputation process in turn. Empirically we find that its performance is very similar to performing causal discovery after directly deleting all data entries with missing values, meaning that imputation does not introduce any additional value into the causal discovery process. We address this issue by alternating between imputation and causal discovery, which is made possible through the use of differentiable neural networks (Figure 1(b)). Such an iterative process can do better in terms of performing multiple imputation passes to take into account the variability while preserving the underlying statistical causal relationship between variables.
Concretely, in each iteration of our algorithm, and take the incomplete data as input and impute the missing values to form . The causal structure is involved as parameters of both and . We encode into a latent code through , and decode into with . The above procedure can be seen as two neural network modules, GAN and VAE, jointly learning together. The former recovers missing data while the later discovers the causal structure.
Missing Data Imputation
Similar to [luo2018multivariate, yoon2018gain], we use and together to approach the underlying data distribution of for imputation. Since GAN can not take values as the input, to initialize the imputation process, we use a ddimensional noise variable
sampled from the standard normal distribution
. And we replace the entries in with , where represent elementwise multiplication. will be served as the input of GAN to generate . With as the input of the structure learning neural network of and to discover the autoregression parameter through each iteration (details of the structure learning method will we covered in the next subsection). Specifically, the generator is responsible for observing the real data and imputing missing components conditioned on what is actually observed according to . The generator takes , , and as input:(3) 
Therefore, the recovered data can be obtained by replacing data on missing entries in with the generated corresponding values from as
(4) 
Besides, the discriminator is introduced as an adversary training module accompanying the generator . Due to the incomplete observations, the initialized data inherently contains both real and fake values, which makes from standard GAN not feasible for our task. In this context, instead of counting real/fake from , the mapping of attempts to determine whether the components are actually observed or not. Specifically, we set as the input to , while is trying to fool in an adversarial way. In summary, and
together learn a desired joint distribution of
and then perform imputation given . Note that the difference between ICL and previous GANbased imputation methods is that our imputation is also related to the recovered causal skeleton.Causal Skeleton Learning
Then we perform structure discovery to find the underlying causal mechanism from the variable set in . Using the structure discovery method in [yu2019dag], with the scoring function , this concatenate task then turns into a continuous optimization problem of finding a that satisfies:
(5) 
where is a function to remove directions in , leading to predicted causal skeleton . The adjacency matrix space represents the set of all DAGs. is the smooth function over real matrices, and ensures that is acyclic.
is a hyperparameter. Following
[yu2019dag], takesas the input of a multilayer perceptron (MLP). The output, denoted as
, is then multiplied by and transformed into in Equation (6), where is the noise vector mentioned at the start of Section 3. The decoder in Equation (7) performs an inverse operation on the encoded to recover , where is a parameter of and to incorporate the causal structure during the learning process.denotes the identity matrix.
and are parameters in corresponding layers.(6)  
(7) 
The parameter plays an important role during the learning phase, stands for the dependence relationship between and in .
Joint Training
The overall procedure can be seen as simultaneously recovering all missing entries in by and , and optimizing the structure learning performance of by and .
The loss function is formed into two parts as the imputation loss and structure learning loss. Since the missing values in realscene are not known, it would not make sense to use their reconstruction error as a stopping criterion in the imputation loss part. The training objective can be formulated as a
problem of while measuring the degree of imputation fitness, as it is usually done when using the standard GANs. In our work, we optimize the data generation performance of a GAN with the loss function as follows(8) 
The generator generates samples conditioned on the partial observation of , the missingness indicator , and the noise . We train to generate and minimize the prediction probability of , while we train to maximize the prediction accuracy of . Then we follow the evidence lower bound (ELBO) from [yu2019dag], given below, for causal skeleton learning.
We denote and as parameter sets in GANs and VAEs separately. The overall learning problem can be formed as:
(9) 
The stopping criteria is either the error is sufficiently small or the number of iterations is large enough. With the best fitting in Equation (9), the causal skeleton is generated by keeping edges in but remove their directions. The pseudo code is summarized in Algorithm 1.
3.2 Causal Direction Identification
The above procedure can identify the conditional probability, but may not truly represent the underlying causal mechanism. For example, given two variables and , their joint distribution can be decomposed equally as either or . These two decompositions relate to different causal mechanisms. With the additive noise model [mooij2016] , however, we can represent asymmetries between and , leading to a unique causal direction from purely observational data. In detail, let the joint distribution of with ground truth be . Then the effect of conditioned on the cause can be represented by:
(10) 
where the second equality assumes and . Note that due to the asymmetry, Equation (10) does not hold in the reverse direction . This property makes it possible to determine the causal direction from observational data under proper conditions.
Therefore, given from the above section, our goal is to utilize such pairwise asymmetry and orient the edges of , consequently uncovering the final causal DAG . This can be achieved by calculating the maximum evidences of the marginal loglikelihood over two models and . The model that shows the larger evidence is selected. In this work, we use the Cascade Additive Noise Model (CANM) proposed by [ijcai2019223]. Specifically, to enumerate causal direction from variables pairs in , we use variable pairs from as input, then the logmarginal likelihood on variable and is computed with:
and are the parameters of the CANM model, which encode and into a latent code . The evidence score of the log marginal likelihood with can be calculated in the following way in both directions.
And the causal direction can be identified by:
(11) 
Given the bivariate identifiable condition in Equation (10), causal discovery from more than two variables can be achieved if each of the causal pairs follows the ANM class [peters2011identifiability]. To uncover the underlying causal graph , we then independently orient each pairwise edge using the bivariate identification method in Equation (11). Besides, note that a combination of causal structure learning and bivariate direction identification requires a final verification to ensure that the DAG is acyclic. In the final stage, by checking if cycles in exist, we enumerate the related edges with the calculated score , then simply remove the edge which holds the lowest score. We will consider more sophisticated algorithms in future work.
30 Var MCAR (Nonlinear 1) (Ideal SHD=7)  50 Var MAR (Nonlinear 2) (Ideal SHD=17)  
10%  30%  50%  10%  30%  50%  
GES  LDGES  
GANGES  
MFGES  
MCGES  
RFCI  LDRFCI  
GANRFCI  
MFRFCI  
MCRFCI  
LiNGAM  LDLiNGAM  
GANLiNGAM  
MFLiNGAM  
MCLiNGAM  
PC  LDPC  
GANPC  
MFPC  
MCPC  
MMPC  LDMMPC  
GANMMPC  
MFMMPC  
MCMMPC  
DAG  LDDAG  
GANDAG  
ICL (Ours)  9.8 3.9  7.4 3.8  8.4 4.9  19.0 4.2  25.5 3.8  27.3 5.5 
Performance comparison (mean and standard deviation) using Structural Hamming Distance, lower is better.
4 Experiment Results
In this section, we will demonstrate how ICL performs on two synthetic datasets and one realworld dataset compared to stateoftheart baselines.
4.1 Baseline Algorithms
Algorithms for data imputation include listwise deletion (LD), multivariate imputation by chained equations (MICE) [white2011multiple], MissForest (MF) [stekhoven2011missforest], and GAN from as shown in Figure 1(a). Algorithms for the causal structure discovery include constraintbased approaches such as PC [spirtes2000], linear nonGaussian acyclic model (LiNGAM) [shimizu2006], really fast causal inference (RFCI) [colombo2011learning], scorebased approaches such as greedy equivalence search (GES) [chickering2002], hybrid approaches such as maxmin parentschildrenaddictive noise model (MMPCANM) [cai2018], and a deeplearning approach based on DAGGNN [yu2019dag]. For DAGGNN we consider two variants: GANDAG first performs imputation first and then use the imputation results for structure discovery; LDDAG first delete all entries with missing values and then perform causal discovery. Each baseline consists of one data imputation algorithm and one causal discovery algorithm. Therefore we have the following combinations: LDPC, LDLiNGAM, LDRFCI, LDMMPC, LDGES; MFPC, MFLiNGAM, MFRFCI, MFMMPC, MFGES; MCPC, MCLiNGAM, MCRFCI, MCMMPC, MCGES; GANPC, GANLiNGAM, GANRFCI, GANMMPC, GANGES. All the baseline algorithms above are implemented using Rpackages such as bnlearn [scutari2009], CompareCausalNetworks [heinze2017], pcalg [kalisch2012], and SELF [cai2018]. We use rpy2 [gautier2012rpy2] to make the above Rpackages accessible from Python and ensure that all algorithms can be compared in the same environment. Following [tu2019causal, strobl2018fast]
, we use Structural Hamming Distance (SHD) as the evaluation metric.
4.2 Quantitative Results
In this subsection, we first provide on synthetic and realworld datasets in terms of both the causal graphs and the missing machanisms. We then compare ICL with the baselines above on these datasets.
Synthetic Data Generation
The synthetic ground truth graph with nodes is generated randomly using the Erdős Rényi (ER) model with an expected neighbor size . The edge weights of are uniformly drawn from to ensure that they are nonzero. Once is generated, the observational i.i.d. data is generated with a sample size and a variable size . For linear cases, the i.i.d. data is generated by sampling the model , where is a strictly upper triangular matrix; similarly for nonlinear cases, the sampled model is described by . Here the noise follows either the Exponential or the Gumbel distribution. In our work, two different mechanisms are considered in nonlinear cases:
In order to achieve a more general comparison in our experiments, the missing data proportions over all the synthetic data are set to be 10%, 30%, and 50%.
Missingness Mechanisms
In this paper we generate synthetic incomplete data using one of the two missingness mechanisms, namely MCAR and MAR, leaving MNAR as future work. For MCAR, the missingness mask is formed by selecting the missing entries from the observational data corresponding to with the same probability. Here
is a uniformly distributed random matrix which has the same dimensions as the observational data matrix. A threshold
is used as the missingness selection criterion. For MAR, the missingness mask is generated based on both the randomly generated graph (more details later) and . Specifically, we first randomly sample parentchild pairs, denoted as , from . is then set to if there exists an such that and . This is to simulate the setting where the missingness of the child node is determined by the (randomly generated) values of its parent nodes.Quantitative Experiment Results
Table 1 reports SHD of our proposed ICL and other baeslines. The results are averaged over twenty random repetitions, with the missing proportion and under both MCAR and MAR conditions. We cover two nonlinear mechanisms as mentioned above. In our experiments, the linear results is consistent with the nonlinear results, and are not included due to space constraints. As shown in Table 1, ICL shows superior performance compared with all other baselines. ’Ideal SHD’ refers to ICL’s performance using complete data (no missing values). Recall that GANDAG performs data imputaion first and then follow by data causal discovery without the iterative process. As shown in Table 1, GANDAG’s performance is worse than ICL since its causal module cannot improve the data imputation process in turn. Interestingly, comparing LDDAG and ICL, we can see that ignoring entries with missing values may have a negative effect on the performance of causal discovery. Furthermore, GESbased algorithms achieve the worst performance even with only 10% missing values. We can also see that MMPCbased algorithms are suitable for nonlinear data, while LiNGAMbased algorithms are suitable for linear data. As expected, directly removing the missing entries leads to worse performance, since it not only reduces the sample size (and consequently throwing away useful information in the observational data), but also introduced a form of selection bias, leading to incorrect causal discovery [pmlrv72gain18a]. Furthermore, it is also worth mentioning that by performing missing data imputation and causal discovery separately (like GANDAG), the results could be even worse than deletionbased methods. As we discussed, imputation could be helpful for recovering the joint distribution of , but suboptimal when we want to perform a further step of the distribution decomposition to discover the underlying causal graph. In contrast, our ICL model does not have the issues above and can therefore achieve better performance.
Case Study on AutoMPG
As a case study we also show ICL’s results on a realworld dataset, AutoMPG [lichman2013uci], which is a citycycle fuel consumption dataset with 398 instances. We discard the attributes of the carname and the origin, and use the left 7 attributes: miles per gallon consumption (MPG), the release date of vehicles (AGE), vehicle weight (WEI), engine displacement (DIS), cylinder number (CYL), horsepower (HP), and vehicle’s acceleration capability (ACC). We simulate missing data under MAR and compare the performance of ICL and GANDAG (best baseline). Their learned causal networks are shown in Figure 3, where the SHD for ICL and GANDAG is 9 and 11, respectively.
5 Conclusion
In this work, we addressed the problem of incomplete data causal discovery, and we proposed a deep learning model of ICL to handle this issue. Specifically, our ICL model contains a global view of iterative missing data imputation and causal skeleton discovery, and a local view of enumerating causal directions to uncover the underlying causal . In the end, we evaluated the effectiveness of our method on both synthetic and real data. As future work, we will generalize our method under more complex conditions such as the existence of confounders.