1 Introduction
Missing data is an unavoidable byproduct of collecting data in most practical domains. In medicine, for example, doctors may choose to omit what they deem to be irrelevant information (e.g., some patients may be asked to get comprehensive blood tests while others don’t), data may be explicitly omitted by the patient (e.g., avoiding questions on smoking status precisely because of their smoking habit) or simply misrecorded in electronic health systems (see e.g., Bell et al. (2014); Jakobsen et al. (2017); Sterne et al. (2009)).
Imputation algorithms can be used to estimate missing values based on data that was recorded, but their correctness depends on the type of missingness. For instance, expanding on the example above, younger patients may also be more likely to omit their smoking status. As illustrated in Figure
1, the challenge is that implicitly conditioning inference on observed data introduces a spurious path of correlation between age and the prevalence of smoking that wouldn’t exist with complete data.Missing data creates a shift between the available missing data distribution and the target complete data distribution. It is a shift that may be explicitly modeled as missingness indicators in an underlying causal model (i.e., a missingness graph as proposed by Mohan et al. Mohan et al. (2013)) as shown in Figure 1. The learning problem is one of extrapolation, learning with access to a missing data distribution for prediction and inference on the complete data distribution – that is, generated from a model where all missingness indicators have been intervened on (interventions interpreted in the sense of Pearl Pearl (2009)) thus graphically removing the dependence between missingness and its causes, and any spurious correlations among its ancestors.
With this causal interpretation, imputation of missing data on a given variable from other observed variables is formulated as a problem of robust optimization,
(1) 
simultaneously optimizing over the set of distributions arising from interventions on missingness indicators. Causal solutions – i.e. imputation using functions of causal parents of each missing variable in the underlying causal graph – are a closelyrelated version of this problem with an uncertainty set defined as any distribution arising from interventions on observed variables and variable indicators (see e.g. sections 3.2 and 3.3 in Meinshausen (2018)),
(2) 
since . Our premise is that causal solutions, i.e. minimizing the righthandside of (2), are expected to correct for spurious correlations introduced by distribution shift due to missing data and preserve the dependencies of the complete data for downstream analysis.
1.1 Contributions
In this paper, we propose to impute while preserving the causal structure of the data. Missing values in a given variable are replaced with their conditional expectation given the realization of its causal parents instead of the more common conditional expectation given all other observed variables, which absorbs spurious correlations.
We propose a novel imputation method called Missing data Imputation Refinement And Causal LEarning (MIRACLE). MIRACLE is a general framework for imputation that operates on any baseline (existing) imputation method. A visual description is given in Figure 2: given some initial imputation from a baseline method, MIRACLE refines its imputations iteratively by learning a missingness graph (graph) Mohan et al. (2013) and regularizing the imputation function such that it is consistent with the causal graph generating the data, substantially improving performance. In experiments, we apply MIRACLE to improve six popular imputation methods as baselines. We present detailed simulations to demonstrate on synthetic and a variety of publicly available datasets from the UCI Machine Learning Repository Dua and Graff (2020) that MIRACLE can improve imputation in almost every scenario and never degrades performance across all imputation methods.
1.2 Related work
The literature on imputation is large and varied. Still, most imputation algorithms work with the prior assumption that the missing data mechanism is ignorable, in the sense that imputation does not require explicitly modeling the distribution of missing values for imputation Seaman et al. (2013)
. Accordingly, classical imputation methods impute using a joint distribution over the incomplete data that is either explicit or implicitly defined through a set of conditional distributions. For example, explicit joint modeling methods include generative methods based on Generative Adversarial Networks
Yoon et al. (2018); Yoon and Sull (2020); Allen and Li (2016), matrix completion methods Mazumder et al. (2010), and parametric models for the joint distribution of the data. Missing values are then imputed by drawing from their predictive distribution. The conditional modeling approach
van Buuren (2018)consists of specifying one model for each variable and iteratively imputing using estimated conditional distributions. Examples of discriminative methods are random forests
Stekhoven and Bühlmann (2012)Gong et al. (2020); Gondara and Wang (2017); Mattei and Frellsen (2019), graph neural networks
You et al. (2020), distribution matching via optimal transport Muzellec et al. (2020), and multiple imputation using chained equations Buuren and GroothuisOudshoorn (2011).In a different line of research, Mohan et al., in a series of papers, see e.g. Mohan et al. (2013); Mohan and Pearl (2014), explicitly considered missing data within the underlying causal mechanisms of the data. Subsequently, a range of related problems has been studied, including identifiability of distributions and causal effects in the presence of missing data, see e.g. Bhattacharya et al. (2019); Shpitser et al. (2015); Nabi et al. (2020), testable implications relating to the causal structure using missing data Mohan and Pearl (2014), and causal discovery in the presence of missing data Gain and Shpitser (2018); Tu et al. (2019). Our focus, in contrast, is algorithmic in nature. We aim to develop an algorithm that improves imputation quality by leveraging causal insights represented as an estimated missingness graph learned from data.
2 Background
The basic semantic framework of our analysis rests on structural causal models (SCMs) (see e.g. Chapter 7 in Pearl (2009) for more details) explicitly introducing missingness indicators and their functional relationship with other variables, using in part the notation of Mohan et al. (2013). We define an SCM as a tuple where
is a vector of
endogenous variables and is a vector of exogenous variables.^{1}^{1}1Essentially, is the groundtruth features; is the random noise in the data generating process. is the vector of missingness indicators that represent the status of missingness of the endogenous variables . Precisely, is responsible for the value of a proxy variable of , i.e., the observed version of . For example, is equal to if the corresponding record is observed (), otherwise is missing (). is a set of functions where each decide the values of an endogenous variable and a missingness indicator variable , respectively. The function takes two separate arguments as parts of (except itself) and , termed as and . That is, and .The randomness in SCMs comes from the exogenous distribution where the exogenous variables in are generated independently and are mutually independent. Naturally, through the functions in , the SCM induces a joint distribution over the endogenous variables
, called the endogenous distribution. An intervention on some arbitrary random variables
in and , denoted by , is an operation which sets the value of to be , regardless of how they are ordinarily determined. For an SCM , let denote a submodel of induced by intervention . The interventional distribution induced by is defined as the distribution over in the submodel , namely, .Each SCM in the context of missingness is associated with a graph (e.g., Fig. 1a), which is a directed acyclic graph (DAG) where nodes represent endogenous variables and missingness indicators , and arrows represent the arguments and of each function and respectively. By convention, exogenous variables are often not shown explicitly in the graph.
Assumption 1 (Missingness indicators are not causes)
No missingness indicator in can be the cause of the endogenous variables , i.e., the arguments of the functions generating .
Assumption 2 (Causal sufficiency)
Exogeneous variables are mutually independent, i.e., all common parents of the endogenous variables are included in .
Assumption 3 (No selfmasking missingness)
Selfmasking missingness refers to missingness in a variable that is caused by itself. In the graph this is depicted by an edge from to (as shown in Figure 3 (d)). We assume that there is no such edges in the graph.
Assumption 4 (Observed root nodes)
The endogenous variables such that (i.e., the root nodes) in the graph are always observed (
with probability 1).
We make the four assumptions above throughout the following sections. Assumption 1 and 2 are employed in most related works using graphs (see e.g. Mohan and Pearl (2014); Mohan et al. (2013)). Assumption 1 is valid, for example, if is generated in the data collection process after the variable values are assigned. Consequently, under this assumption, if two endogenous variables of interest and are not separated by some variable , they are not separated by together with their missingness indicators and . We denote an independent relation in a data distribution by “ ” and separation in a graph by “”. We assume the data distribution is faithful to a graph, meaning that the two independencies are equivalent. As shown in Figure 3, data is missing completely at random (MCAR) if holds in the graph, missing at random (MAR) if for any endogenous variable , holds, and missing not at random (MNAR) otherwise, as stated in Mohan et al. (2013). If Assumption 3 is violated, we are unable to learn the missingness for selfmasked variables. Assumption 4 is necessary for imputing all the missing variables from their causal parents. These assumptions are imperative for MIRACLE to provide improved imputations by leveraging the causal structure of the underlying data generating process. In our experiments (Section 4), we apply MIRACLE to realworld datasets where these assumptions are not guaranteed.
2.1 Why is imputation prone to bias?
The reason for considering the causal structure of the underlying system is that when learning an imputation model from observed data, implicitly conditioning on some missingness indicators in induces spurious dependencies that would not otherwise exist. For example, in a graph , conditioning on induces a dependence between and . In general, the distributions and differ unless missingness occurs completely at random, and motivates an interpretation of the problem as domain generalization, training on data from one distribution ultimately to be applied on data from a different distribution that, in our case, arises from missing data (i.e., interventions on missingness indicators). This shift is not addressed in the imputation methods that only use the feature correlations.
3 Miracle
In this section, we propose to correct for the shift in distribution due to missing data by searching for causal solutions and explicitly refining imputation methods using a penalty on the induced causal structure. In practice, we have realizations of the observed version of , concatenated as a incomplete data matrix , together with missingness indicators concatenated in a matrix . We use here the same bold uppercase notation for sets of variables and matrices of observations but their meaning should be clear from the context. Our goal is to impute the unobserved values in using each variable’s causal parents. We define the imputed data ,
where is the elementwise product of matrices and is an estimate of the complete data matrix.
3.1 Network architecture
In this section, we describe our approach for estimating . Let be the number of partially observed features, i.e., missing for at least one realization. is the set of missing features indices. The imputation network is defined as a function that takes an initially imputed dataset (using an existing baseline imputation method), and returns two quantities:

A refined imputation .

An estimation of the probabilities of features being missing, and .
A depiction of the network architecture and optimization algorithm is shown in Figure 4. The architecture is constructed with respect to the assumptions shown in Section 2. Our model is decomposed into two subnetworks, , responsible for imputing the unobserved data and estimate the probabilities of missingness, respectively. The imputation network has components, , one for each variable, and the missingness network has components, . Each component, for both networks, has separate input and output layers but shared hidden layers (of size ). Let and denote the weight matrix (we omit biases for clarity) in the input layer of and respectively. The th column of and is set to . Let , for , denote the weight matrix of each hidden layer and let and , be the dimensional output layers of each subnetwork. The imputation network prediction is given by,
for . And similarly, the missingness network prediction is given by,
for , where
is the ELU activation function and
is the sigmoid function. Our network is optimized with respect to three objectives. First, to accurately predict missing values, second, to faithfully encode the causal relationships given by the underlying
graph, and third to satisfy a moment constraint of the missing data mechanism on the imputed values.
3.2 Reconstruction loss
The first objective is to train to correctly reconstruct each feature from the observed data using a reconstruction loss,
where and are the realized and imputed feature vector of the th instance, are the components of that are missing for at least one instance. The first loss term is for reconstructing the observed features, and the second loss term is for estimating the probabilities of missingness.
3.3 Causal regularizer
The second objective is to ensure that the dependencies defined by correspond to a DAG over the features and the missing indicators , which enforces that the learned functional dependencies recover a DAG in the equivalence class of causal graphs over the observed data. Enforcing the acyclicity of the dependencies induced by a continuous function is originally proposed in Zheng et al. (2018, 2020). Define a binary adjacency matrix ; (i.e., the norm of the th column of the matrix or is ) is a realistic and sufficient condition for achieving . The adjacency matrix of the graph induced by the learned is acyclic if and only if,
(3) 
is equal to zero, where and is the matrix exponential.
Remark 1. Existing imputation methods based on feature correlations essentially assume an undirected (noncausal) graph between the features. Further, acyclicity is a realistic and practical assumption to make on the static datasets collected by human experts. In nature, most data distributions generate their features in some order. In a directed graph, a cycle means a path starts and ends at the same node. This is unlikely to happen in the data generating process if not considering variables over time, i.e., timeseries data. By enforcing acyclicity, MIRACLE only uses the causal parents for imputation, which is less biased by spurious dependencies that only exist in the observed data.
3.4 Moment regularizer
The third objective leverages a set of moment constraints in the missingness pattern to improve imputation. Assume , for some . The following derivation holds for MAR or MCAR missingness patterns only. It holds that,
(4) 
where the third equality follows from the MAR assumption (). Under the MCAR assumption, this derivation holds trivially since in that case .
We can use the missingness and imputation networks to enforce the above equality algorithmically, ensuring the left hand side equals the right hand side in the empirical version of (4) as follows,
where , and is the th element of , i.e., the index of the th missing feature. Minimizing forces the two estimators of to match, the stabilized inverse propensity score weighting (SIPW) estimator Robins et al. (2007) using the missingness network (in ) and the mean estimator using the imputation network .
Remark 2. We hypothesize this mechanism can improve performance for two reasons. First, the missing data mechanism can be a simpler function that takes less samples to learn than the function that generates the feature , . Then the SIPW estimator based on will converge to the true mean faster than the estimator based on . Second, in , the mean estimator using is based on all the samples; is trained to produce predictions on the samples with missing feature for the sake of matching the SIPW estimator. By contrast, without the regularizer , is solely trained on the samples with observed feature , and its performance may fail to generalize to data with missing feature .
3.5 Bootstrap Imputation
Discovering a causal graph requires complete data. However, this is not the case for missing data problems. Because of this, we require that MIRACLE be seeded by another imputation method. Imputed values are iteratively refined by MIRACLE, hence “bootstrapping”, to potentially converge to a new imputation that minimizes MIRACLE’s objective (including causal and moment regularizers). MIRACLE’s objective for optimization is,
(5) 
where and
are hyperparameters that define the strength of regularization. We iteratively update the baseline matrix
with a new imputed matrixgiven by MIRACLE every ten epochs in training. With increasing epochs, stochastic optimization minimizes the loss for the imputed matrices that respect the causal and moment regularization. In theory, this is analogous to supervised training a denoising autoencoder (DAE)
Vincent et al. (2010); Bengio et al. (2013); Pretorius et al. (2018), but differs only by the fact that “noise” comes from prior or previous imputations. In training DAE, the input samples are corrupted by independent noise with each epoch, yet convergence is still guaranteed Arora et al. (2014). In our experiments, we demonstrate that bootstrap imputation indeed converges on multiple datasets and baseline methods. In Appendix E, we provide experiments on the computational complexity of MIRACLE.. Note that we show the average error over a variety of DAG instantiations and target variables, thus the magnitude and standard deviation of errors vary significantly between runs. See Appendix
B for MCAR results.4 Experiments
In this section, we validate the performance of MIRACLE using both synthetic and a variety of realworld datasets.

In the first set of experiments, we quantitatively analyze MIRACLE on synthetic data generated from a known causal structure.

In the second set of experiments, we quantitatively evaluate the imputation performance of MIRACLE using various publicly available UCI datasets Dua and Graff (2020).
General setup. We conduct each experiment five times under random instantiations of missingness. We report the RMSE along with standard deviation across each of the five experiments. Unless otherwise specified, missingness is applied at a rate of 30% per feature. For MCAR, this is applied uniformly across all features. For MAR, we randomly select 30% of the features to have missingness caused by another disjoint and randomly chosen set of features. Similarly, we randomly select 30% of features to be MNAR. We induce MAR and MNAR missingness using the methods outlined in Yoon et al. (2018), and we provide more details in Appendix A.2.
We use an 8020 traintest split. We performed a hyperparameter sweep (logbased) for and with ranges between 1e3 and 100. By default we have and set to 0.1 and 1, respectively.
Evaluating imputation. For each subsection below, we present three model evaluations in terms of missingness imputation performance, label prediction performance of a prediction algorithm trained on imputed data and the congeniality of imputation models.

Missingness imputation performance is evaluated with the root mean squared error comparing the imputed missing values with their actual unobserved values.

Label prediction performance of an imputation model is its ability to improve the postimputation prediction. By postimputation, we refer to using the imputed data to perform a downstream prediction task. To be fair to all benchmark methods, we use the same model (support vector regression) in all cases.

The congeniality of an imputation model is its ability to impute values that respect the featurelabel relationship post imputation. Specifically, we compare, support vector parameters, , learned from the complete dataset with the parameters , learned from the imputed dataset. We report root mean square error for each method. Lower values imply better congeniality Yoon et al. (2018).
Baseline imputation methods.
We apply MIRACLE imputation over a variety of six commonly used imputation baseline methods: (1) mean imputation using the featurewise mean, (2) a deep generative adversarial network for imputation using GAIN Yoon et al. (2018) (3)
nearest neighbors (KNN)
Troyanskaya et al. (2001) using the Euclidean distance as a distance metric of each missing sample to observed samples, (4) a treebased algorithm using MissForest (MF) Stekhoven and Bühlmann (2012), (5) a deep neural distribution matching method based on optimal transport (OT) Muzellec et al. (2020), and (6) Multivariate Imputation by Chained Equations (MICE) Buuren and GroothuisOudshoorn (2011). For each of the baseline imputation methods with tunable hyperparameters, we used the published values. We implement MIRACLE using the tensorflow^{2}^{2}2Source code at https://github.com/vanderschaarlab/MIRACLE. library. Complete training details and hyperparameters are provided in Appendix D.Additional experiments. In Appendix B and C, we also evaluate MIRACLE in terms of providing imputations that improve predictive performance Yoon et al. (2018); You et al. (2020), as well as, in its ability to impute values that respect the featurelabel relationship, i.e., congeniality Meng (1994); Deng et al. (2016). We also investigate the impact of the graphical missingness location on imputation performance on synthetic data in Appendix G.
4.1 Synthetic data
In this subsection, we evaluate MIRACLE on synthetic data. In doing so, we can control aspects of the underlying data generating process and possess oracle knowledge of the DAG structure.
Data generating process.
We generate random ErdosRenyi graphs with functional relationships from parent to children nodes. At each node, we add Gaussian noise with mean 0 and variance 1. For a complete description of the underlying data generating process, see Appendix
A.Synthetic results. In Figure 5, we show experiments of MIRACLE on synthetic MAR data in terms of RMSE. Our experiments show that MIRACLE is able to significantly improve imputation over each of the baseline imputation methods. Figure 5 shows MIRACLE improves performance over each baseline method across various dataset sizes, missingness ratios, and feature sizes (DAG sizes). We show results for MCAR in Appendix B.
4.2 Experiments on real data
In Figure 6 we show experiments of MIRACLE on real data. We perform experiments on several UCI datasets used in Yoon et al. (2018); You et al. (2020); Muzellec et al. (2020): Autism, Life expectancy, Energy, Abalone, Protein Structure, Communities and Crime, Yeast, Mammographic Masses, Wine Quality, and Facebook Metrics. In Figure 6, the improvements of MIRACLE are minimal for MCAR (except for mean imputation). This agrees with our discussion in Section 2.1, because the baseline imputations are not biased in the MCAR setting where holds in the graph. Conversely for the MAR and MNAR settings, as expected, we observe MIRACLE has an significant improvement on some of the datasets, such as Abalone, Autism, Energy and Protein Structure. As discussed in Section 2, MIRACLE can improve the baseline imputation under Assumptions 14, which may not hold in these realworld datasets. Nevertheless, we observe that MIRACLE never degrades performance relative to its baseline imputation on any dataset. Furthermore, no baseline imputer is optimal across the datasets. In almost all cases, applying MIRACLE to any baseline results in the lowest error. We show a similar gain for this experiment with MCAR and MNAR in Appendix C. An ablation study on our overall loss is provided in Appendix F.
4.3 Causal discovery and imputation performance
In our experiments, we observe a positive correlation between the quality of learned DAGs (and causal parents) with imputation performance. Consider the leftmost plot in Figure 7
using OT as a baseline imputer under MAR on our real data sets. Here, we do not have oracle knowledge but assume that the sparseness of the learned DAG implies a coherent DAG. We observe that MIRACLE has the most performance gain when fewer causal parents are identified for the missing variable in the learned DAGs. When MIRACLE is less able to isolate causal parents for prediction, the learned DAGs contains many spurious edges, and MIRACLE only has marginal improvements over the baseline imputer. We note that the gain of MIRACLE is not reproducible via feature selection methods, which are still prone to the spurious correlations in the observed data, as discussed in Section
2.1.4.4 MIRACLE Convergence
In this subsection, we investigate two dimensions of MIRACLE refinement: (1) baseline imputation quality and (2) sample or instancewise refinement. Regarding baseline imputation quality, we are interested in understanding the impact of MIRACLE refinement on various baseline imputers that may have disparate performances. In the middle plot of Figure 7, we show MIRACLE applied to various baseline imputers on the Abalone dataset. Similar plots for other datasets can be found in Appendix C. We observe that even though mean imputation starts off with the worst error, after refinement by MIRACLE, we see that all methods converge to similar RMSEs. For the second experiment, we investigate the samplewise improvement of MIRACLE on the abalone dataset using MissForest as a baseline imputer. On the rightmost plot of Figure 7, we observe that a vast majority of the samples are improved by MIRACLE. Note that every point below the diagonal is considered an improvement on an instance over the baseline imputation method. We can see MIRACLE improves the imputation almost universally except for the instances with small errors in the baseline imputation; on these instances, MIRACLE does not inflate their errors by a large margin. Furthermore, we observe that MIRACLE iteratively improves imputation as training progresses (over each epoch) by the observation that the slope of each line decreases with each epoch.
5 Discussion
In conclusion, motivated by the minimax optimization problem (1) arising from interventions on missingness indicators in the graph that encode the conditional independencies in the data distribution, we proposed MIRACLE, an iterative framework to refine any baseline missing data imputation to use the causal parents embodied in the estimated graphs. MIRACLE learns the causal graph as an adjacency matrix embedded in its input layers. We proposed a twopart regularizer based on the causal graph and a moment regularizer based on the missing variable indicators. We demonstrated that MIRACLE significantly improved the imputations of six baseline imputation methods over a variety of synthetic and real datasets. MIRACLE never hurts performance in the worstcase, and we envision MIRACLE becoming a de facto standard in refining missing data imputation.
There are several limitations we would like to identify as paths for future work. First, any violation of the assumptions in Section 2
may adversely impact the performance of MIRACLE in practice. Second, causal discovery under missing data is an ongoing research area, and therefore MIRACLE may be discovering DAGs with bias introduced from the baseline methods. However, in experiments, MIRACLE still performs well even if it starts with mean imputation. We expect MIRACLE to improve as causal discovery methods under missingness improve. Third, in its current form, MIRACLE is not extensible to scenarios where causality may not be applicable, such as computer vision. Fourth, because of the causal discovery regularizer and network architecture, MIRACLE may have difficulty scaling to very high dimensional data. Lastly, a more general and detailed discussion is needed between our work and the merits of causality and robustness.
Acknowledgements
This work was supported by GlaxoSmithKline (GSK), the National Science Foundation (NSF) under grant number 1722516, the Office of Naval Research (ONR), and The Alan Turning Institute (ATI). We thank all reviewers for their invaluable comments and suggestions.
References
 [1] (2016) Generative adversarial denoising autoencoder for face completion,. Note: https://www.cc.gatech.edu/~hays/7476/projects/Avery_Wenchen/ Cited by: §1.2.
 [2] (2014) Provable bounds for learning some deep representations. In International conference on machine learning, pp. 584–592. Cited by: §3.5.
 [3] (2014) Handling missing data in rcts; a review of the top medical journals. BMC medical research methodology 14 (1), pp. 1–8. Cited by: §1.
 [4] (2013) Generalized denoising autoencoders as generative models. In Proceedings of the 26th International Conference on Neural Information Processing Systems  Volume 1, NIPS’13, Red Hook, NY, USA, pp. 899–907. Cited by: §3.5.

[5]
(201906)
Identification in missing data models represented by directed acyclic graphs.
In
Proc. of Conference on Uncertainty in Artificial Intelligence (UAI)
, pp. . Cited by: §1.2.  [6] (201112) MICE: multivariate imputation by chained equations in r. Journal of Statistical Software 45, pp. . External Links: Document Cited by: §1.2, §4.
 [7] (201602) Multiple imputation for general missing data patterns in the presence of highdimensional data. Scientific Reports 6, pp. 21689. External Links: Document Cited by: §4.
 [8] (2020) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §1.1, item 2, item 3a.
 [9] (2018) Structure learning under missing data. Proceedings of machine learning research 72, pp. 121. Cited by: §1.2.
 [10] (201705) Multiple imputation using deep denoising autoencoders. pp. . Cited by: §1.2.

[11]
(202008 Dec)
Variational selective autoencoder.
In
Proceedings of The 2nd Symposium on Advances in Approximate Bayesian Inference
, C. Zhang, F. Ruiz, T. Bui, A. B. Dieng, and D. Liang (Eds.), Proceedings of Machine Learning Research, Vol. 118, pp. 1–17. External Links: Link Cited by: §1.2.  [12] (2017) When and how should multiple imputation be used for handling missing data in randomised clinical trials–a practical guide with flowcharts. BMC medical research methodology 17 (1), pp. 1–10. Cited by: §1.
 [13] (201909–15 Jun) MIWAE: deep generative modelling and imputation of incomplete data sets. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 4413–4423. External Links: Link Cited by: §1.2.
 [14] (2010) Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research 11 (80), pp. 2287–2322. Cited by: §1.2.

[15]
(2018)
Causality from a distributional robustness point of view.
In
2018 IEEE Data Science Workshop (DSW)
, pp. 6–10. Cited by: §1.  [16] (1994) Multipleimputation inferences with uncongenial sources of input. Statistical Science 9 (4), pp. 538–558. External Links: ISSN 08834237 Cited by: §4.
 [17] (2013) Graphical models for inference with missing data. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26, pp. 1277–1285. Cited by: §1.1, §1.2, §1, §2, §2.
 [18] (2014) On the testability of models with missing data. In Artificial Intelligence and Statistics, pp. 643–650. Cited by: §1.2, §2.
 [19] (202013–18 Jul) Missing data imputation using optimal transport. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, Virtual, pp. 7130–7140. Cited by: §1.2, §4, §4.2.
 [20] (2020) Full law identification in graphical models of missing data: completeness results. In Proc. of International Conference on Machine Learning (ICML), pp. . Cited by: §1.2.
 [21] (2009) Causality. Causality: Models, Reasoning, and Inference, Cambridge Univ. Press. External Links: ISBN 9780521895606, LCCN 99042108 Cited by: §1, §2.
 [22] (201810–15 Jul) Learning dynamics of linear denoising autoencoders. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 4141–4150. External Links: Link Cited by: §3.5.
 [23] (2007) Comment: performance of doublerobust estimators when” inverse probability” weights are highly variable. Statistical Science 22 (4), pp. 544–559. Cited by: §3.4.
 [24] (2013) What is meant by” missing at random”?. Statistical Science, pp. 257–268. Cited by: §1.2.
 [25] (2015) Missing data as a causal and probabilistic problem. In Proc. of Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §1.2.
 [26] (2012) MissForest  nonparametric missing value imputation for mixedtype data. Bioinformatics 28 1, pp. 112–8. Cited by: §1.2, §4.
 [27] (2009) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. Bmj 338. Cited by: §1.
 [28] (200107) Missing value estimation methods for dna microarrays. Bioinformatics 17, pp. 520–525. External Links: Document Cited by: §4.
 [29] (2019) Causal discovery in the presence of missing data. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1762–1770. Cited by: §1.2.
 [30] (2018) Flexible imputation of missing data. Chapman & Hall/CRC Interdisciplinary Statistics, CRC Press LLC. External Links: ISBN 9781138588318, LCCN 2018017122 Cited by: §1.2.
 [31] (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, pp. 3371–3408. Cited by: §3.5.
 [32] (2018) GAIN: missing data imputation using generative adversarial nets. In ICML, Cited by: 2nd item, §A.2, §1.2, 3rd item, §4, §4, §4.2, §4.

[33]
(202006)
GAMIN: generative adversarial multiple imputation network for highly missing data.
In
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §1.2.  [34] (2020) Handling missing data with graph representation learning. In Advances in Neural Information Processing Systems, Cited by: §1.2, §4, §4.2.
 [35] (2018) DAGs with NO TEARS: Continuous Optimization for Structure Learning. In Advances in Neural Information Processing Systems, Cited by: §3.3.
 [36] (202026–28 Aug) Learning sparse nonparametric dags. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, S. Chiappa and R. Calandra (Eds.), Proceedings of Machine Learning Research, Vol. 108, pp. 3414–3425. Cited by: §3.3.
Checklist

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

Did you describe the limitations of your work? See the Discussion in Section 5.

Did you discuss any potential negative societal impacts of your work? We do not foresee any negative societal impacts of our work.

Have you read the ethics review guidelines and ensured that your paper conforms to them?


If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results?

Did you include complete proofs of all theoretical results?


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? The code is available at https://anonymous.4open.science/r/MIRACLE7702/. The data is all publicly available at [8]. The instructions to reproduce our work can be found in the experimental Section 4 and Appendix A.

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? See Section 4.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? See Section 4.

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See Appendix D.


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators?

Did you mention the license of the assets?

Did you include any new assets either in the supplemental material or as a URL?

Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable?

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix
This appendix is outlined as follows:

Section A details our synthetic data generating process and how we generated missingness.

Section B contains additional experiments on synthetic data testing for imputation performance, testing for prediction performance of a downstream machine learning algorithm on imputed data, and testing for congeniality. In each case we consider MCAR, MAR and MNAR missingness patterns. We used the same experimental setup in [32] for testing prediction performance of a downstream machine learning algorithm and for congeniality.

Section C contains additional experiments on real datasets testing for prediction performance of a downstream machine learning algorithm on imputed data, and testing for congeniality over all types of missingness.

Section D provides an overview of our model and training details.

Section E provides a computational complexity analysis of MIRACLE.

Section F
provides an ablation study for the MIRACLE loss function.

Section G provides experiments regarding missingness location.
Appendix A Supplementary experimental details
a.1 Synthetic data generation
In each synthetic experiment, we generated a dimensional random graph from a Erdös–Rényi random graph model with edges on average. Given , we assigned uniformly random edge weights to obtain a weighted adjacency matrix . Given , we sampled repeatedly from a Gaussian noise model for (each dimension sampled independently) to generate independent observations from this system.
a.2 Generating missingness.
The following explains how we constructed synthetic datasets that satisfy MCAR, MAR and MNAR patterns of missingness. We apply a modification to the missingness generation from [32].

MCAR. Missing completely at random was introduced by randomly removing of the observations in each feature.

MAR. We sequentially define the probability that the th component of the th sample is observed conditional on the missingness and values (if observed) of the previous components to be,
(6) where corresponds to the average missing rate of the th feature, and , are sampled from (but are only sampled once for the entire dataset).

MNAR. Missing not at random was introduced by defining the probability of the th component of the th sample to be observed by,
(7) with the same notation as above. Here, the missingness of a data point is directly dependent on its value (with dependence determined by the weight , sampled from ).
Appendix B Additional synthetic results
In this section, we provide supplementary results for synthetic experiments. Note that the error bars are large for some of the plots with predictive error and congeniality. This is because the yaxis of these plots are minmax normalized between 0 and 1, so the high variance (large error bars) shows that the improvement by MIRACLE may be minimal for the mentioned datasets. Additionally, this could be caused by the fact that the missing features aren’t predictive of a target variable, i.e., better imputation does not necessarily lead to any performance gain for the predicting the target variable.
b.1 MCAR Results
Using our synthetic experimental setup used in the main paper, we show the performance of MIRACLE in terms of RMSE, predictive error, and congeniality in Figure 8 for each of our baseline methods with MCAR.
b.2 MAR Results
Using our synthetic experimental setup used in the main paper, we show the performance of MIRACLE in terms of RMSE, predictive error, and congeniality in Figure 9 for each of our baseline methods with MAR.
b.3 MNAR Results
Using our synthetic experimental setup used in the main paper, we show the performance of MIRACLE in terms of RMSE, predictive error, and congeniality in Figure 10 for each of our baseline methods with MNAR.
Appendix C Additional real datasets
Prediction error and congeniality.
Additional convergence plots.
We include additional convergence plots on real datasets in Figure 11. We use the same experimental setup used in Figure 7 in Section 4. We observe that MIRACLE is able to converge regardless of baseline imputation used.
Appendix D Model and training details
We used the following network architecture for MIRACLE. Our proposed architecture consists of subnetworks with shared hidden layers, as shown in Figure 4. Each network is constructed with two hidden layers of neurons with ELU activation. Each benchmark method is initialized and seeded identically with the same random weights. For dataset preprocessing, all continuous variables are standardized with a mean of 0 and a variance of 1. We train each model using the Adam optimizer with a learning rate of 0.0005 for up to a maximum of 300 epochs.
Computational hardware.
All models were trained on an Ubuntu 18.04 OS with 64GB of RAM (Intel Core i76850K CPU @ 3.60GHz) and 2 NVidia 1080 Ti GPUs.
Appendix E Computational Complexity
Pseudocode for MIRACLE is provided in Algorithm 1. We perform an analysis of the MIRACLE scalability. Using our synthetic data generation, we created datasets of 1000 samples. Using our the synthetic experimental setup presented in the main paper, we present the computational timing results for MIRACLE as we increase the number of input features on inference and training time in Figure 14. Computational time scales linearly with increasing the number of input samples. As expected, we observe that the time to train 1000 samples grows exponentially with the feature size; however, the inference time remains linear. Inference time on 1000 samples with 400 features takes approximately 1.1 seconds, while training time takes nearly 85 seconds. Experiments were conducted on an Ubuntu 18.04 OS using 6 Intel i76850K CPUs.
Appendix F Ablation study
We provide an ablation study on our MIRACLE loss function in Eq. 5 to understand the sources of gain of MIRACLE. Here we execute this experiment on our real datasets using the same experimental details highlighted in the main manuscript. We show the results of our ablation on MIRACLE using MissForest as baseline imputation with MAR missingness to highlight our sources of gain in Table 1. Here, we observe that MIRACLE (rightmost column) has the most gain over all datasets. Additionally, we observe that has the most gain when MIRACLE has the most performance improvement over the baseline (see Fig. 7 in the manuscript).
Dataset  

abalone  0.321 0.108  0.521 0.199  0.312 0.082  0.222 0.062 
autism  0.093 0.005  0.094 0.004  0.091 0.004  0.073 0.004 
energy  0.106 0.011  0.147 0.077  0.132 0.050  0.065 0.061 
protein  0.134 0.016  0.129 0.008  0.119 0.010  0.080 0.008 
life expectancy  0.239 0.007  0.223 0.019  0.216 0.014  0.208 0.015 
community  0.490 0.015  0.516 0.020  0.479 0.023  0.463 0.010 
yeast  0.984 0.013  0.984 0.006  0.988 0.004  0.950 0.014 
mammo masses  1.105 0.010  1.150 0.009  1.103 0.013  1.040 0.013 
wine quality  0.797 0.004  0.745 0.013  0.724 0.008  0.722 0.003 
1.056 0.005  1.032 0.044  1.034 0.056  0.983 0.002  

Appendix G Understanding missingness location
An important consideration is how well does predicting with the causal parents work when downselecting features. Consider missingness in in the DAG in Figure 15. The first column with the causal parents Pa mean that only the parents of features were used for imputation. represents a variable that is not causally linked to anything.
Using our synthetic data generating process, we synthesized a dataset according to Figure 15. The goal here was to impute the missing values in , using each variable in Figure 15 to induce the missingness. Each of the missingness causes is categorized as MAR, except for , which is MNAR (since missingness caused by itself), and for , which is MCAR, since it is an external noise variable. The results provide several interesting findings.

[leftmargin=*,itemsep=0pt]

Using MissForest as a baseline imputer, the results in Table 2 show that MIRACLE performs as well as Pa, and has better performance than the baseline imputer. Moreover, the two rightmost columns of Table 2 give the average estimated functional dependence of (our target for prediction) and its parents and nonparents. We see that MIRACLE recovers true parents consistently.

We see that using causal parents (Pa and MIRACLE) for missingness caused by itself, , and a noise variable, , leads to the least amount of improvement.

We see that MIRACLE has the most gain when the missingness is caused by a causal parent ( or ).

Interestingly, for this example, we observe comparable performance when using the Markov blanket features versus all features in our baseline algorithm (MissForest). This suggests that the Markov blanket features are likely used for imputation by the baseline method.
In Appendix B, we show that the baseline performs similarly to the Markov blanket features colored in blue in Figure 15.
imputed error (RMSE)  edge weights (no threshold)  
Cause 
Pa  MB  Baseline  MIRACLE  Pa  nonPa 
0.11 .06  0.15 .03  0.27 .05  0.12 .07  0.44 0.14  0.02 0.01  
0.98 .08  1.34 .05  1.31 .06  0.49 .06  0.64 0.09  0.01 0.01  
1.20 .04  1.49 .04  1.45 .09  0.50 .06  0.62 0.13  0.02 0.01  
0.69 .05  1.20 .07  1.23 .05  1.04 .05  0.29 0.11  0.13 0.05  
1.51 .03  1.75 .08  1.76 .06  1.59 .07  0.37 0.18  0.03 0.02  
0.13 .08  0.17 .04  0.18 .07  0.14 .05  0.34 0.15  0.05 0.02  
1.04 .05  1.47 .04  1.47 .06  1.01 .06  0.39 0.05  0.04 0.01  
0.21 .04  0.28 .05  0.23 .03  0.20 .03  0.46 0.10  0.02 0.01  
0.15 .03  0.18 .04  0.17 .07  0.14 .05  0.31 0.15  0.02 0.01  

Appendix H Convergence
Using the same experimental setup, in Appendix G, we examine how the estimated adjacency matrix converges to the truth as sample size increases. Table 3, shows that as sample size increases we see that the quality of the learned mgraph improves.
Dataset size  

100 
0.135 0.04 
500  0.242 0.03 
1000  0.252 0.03 
5000  0.910 0.01 
10000  0.916 0.01 