Log In Sign Up

MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms

Missing data is an important problem in machine learning practice. Starting from the premise that imputation methods should preserve the causal structure of the data, we develop a regularization scheme that encourages any baseline imputation method to be causally consistent with the underlying data generating mechanism. Our proposal is a causally-aware imputation algorithm (MIRACLE). MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism, encouraging imputation to be consistent with the causal structure of the data. We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation over a variety of benchmark methods across all three missingness scenarios: at random, completely at random, and not at random.


page 9

page 20


Dimensional Data KNN-Based Imputation

Data Warehouses (DWs) are core components of Business Intelligence (BI)....

Missing Value Estimation using Clustering and Deep Learning within Multiple Imputation Framework

Missing values in tabular data restrict the use and performance of machi...

Proper Scoring Rules for Missing Value Imputation

Given the prevalence of missing data in modern statistical research, a b...

GEDI: A Graph-based End-to-end Data Imputation Framework

Data imputation is an effective way to handle missing data, which is com...

Towards a methodology for addressing missingness in datasets, with an application to demographic health datasets

Missing data is a common concern in health datasets, and its impact on g...

Multiple Imputation: A Review of Practical and Theoretical Findings

Multiple imputation is a straightforward method for handling missing dat...

PulseImpute: A Novel Benchmark Task for Pulsative Physiological Signal Imputation

The promise of Mobile Health (mHealth) is the ability to use wearable se...

1 Introduction

Missing data is an unavoidable byproduct of collecting data in most practical domains. In medicine, for example, doctors may choose to omit what they deem to be irrelevant information (e.g., some patients may be asked to get comprehensive blood tests while others don’t), data may be explicitly omitted by the patient (e.g., avoiding questions on smoking status precisely because of their smoking habit) or simply misrecorded in electronic health systems (see e.g., Bell et al. (2014); Jakobsen et al. (2017); Sterne et al. (2009)).

Imputation algorithms can be used to estimate missing values based on data that was recorded, but their correctness depends on the type of missingness. For instance, expanding on the example above, younger patients may also be more likely to omit their smoking status. As illustrated in Figure

1, the challenge is that implicitly conditioning inference on observed data introduces a spurious path of correlation between age and the prevalence of smoking that wouldn’t exist with complete data.

Figure 1: Missingness may introduce spurious dependencies.

Missing data creates a shift between the available missing data distribution and the target complete data distribution. It is a shift that may be explicitly modeled as missingness indicators in an underlying causal model (i.e., a missingness graph as proposed by Mohan et al. Mohan et al. (2013)) as shown in Figure 1. The learning problem is one of extrapolation, learning with access to a missing data distribution for prediction and inference on the complete data distribution – that is, generated from a model where all missingness indicators have been intervened on (interventions interpreted in the sense of Pearl Pearl (2009)) thus graphically removing the dependence between missingness and its causes, and any spurious correlations among its ancestors.

With this causal interpretation, imputation of missing data on a given variable from other observed variables is formulated as a problem of robust optimization,


simultaneously optimizing over the set of distributions arising from interventions on missingness indicators. Causal solutions – i.e. imputation using functions of causal parents of each missing variable in the underlying causal graph – are a closely-related version of this problem with an uncertainty set defined as any distribution arising from interventions on observed variables and variable indicators (see e.g. sections 3.2 and 3.3 in Meinshausen (2018)),


since . Our premise is that causal solutions, i.e. minimizing the right-hand-side of (2), are expected to correct for spurious correlations introduced by distribution shift due to missing data and preserve the dependencies of the complete data for downstream analysis.

1.1 Contributions

In this paper, we propose to impute while preserving the causal structure of the data. Missing values in a given variable are replaced with their conditional expectation given the realization of its causal parents instead of the more common conditional expectation given all other observed variables, which absorbs spurious correlations.

We propose a novel imputation method called Missing data Imputation Refinement And Causal LEarning (MIRACLE). MIRACLE is a general framework for imputation that operates on any baseline (existing) imputation method. A visual description is given in Figure 2: given some initial imputation from a baseline method, MIRACLE refines its imputations iteratively by learning a missingness graph (-graph) Mohan et al. (2013) and regularizing the imputation function such that it is consistent with the causal graph generating the data, substantially improving performance. In experiments, we apply MIRACLE to improve six popular imputation methods as baselines. We present detailed simulations to demonstrate on synthetic and a variety of publicly available datasets from the UCI Machine Learning Repository Dua and Graff (2020) that MIRACLE can improve imputation in almost every scenario and never degrades performance across all imputation methods.

Figure 2: MIRACLE refines baseline imputation by simultaneously learning an

-graph using a bootstrap imputation loop that serves to incrementally regularize predictions with a learned causal graph. We plot average testing error and estimated causal graph as a function of training epochs on a synthetic data experiment described in Section 

4. The true causal structure (as an adjacency matrix) and imputation improvements for each missing value separately (each missing value with a corresponding dot) is shown in the right-most panel.

1.2 Related work

The literature on imputation is large and varied. Still, most imputation algorithms work with the prior assumption that the missing data mechanism is ignorable, in the sense that imputation does not require explicitly modeling the distribution of missing values for imputation Seaman et al. (2013)

. Accordingly, classical imputation methods impute using a joint distribution over the incomplete data that is either explicit or implicitly defined through a set of conditional distributions. For example, explicit joint modeling methods include generative methods based on Generative Adversarial Networks

Yoon et al. (2018); Yoon and Sull (2020); Allen and Li (2016), matrix completion methods Mazumder et al. (2010)

, and parametric models for the joint distribution of the data. Missing values are then imputed by drawing from their predictive distribution. The conditional modeling approach

van Buuren (2018)

consists of specifying one model for each variable and iteratively imputing using estimated conditional distributions. Examples of discriminative methods are random forests

Stekhoven and Bühlmann (2012)

, autoencoders

Gong et al. (2020); Gondara and Wang (2017); Mattei and Frellsen (2019)

, graph neural networks

You et al. (2020), distribution matching via optimal transport Muzellec et al. (2020), and multiple imputation using chained equations Buuren and Groothuis-Oudshoorn (2011).

In a different line of research, Mohan et al., in a series of papers, see e.g. Mohan et al. (2013); Mohan and Pearl (2014), explicitly considered missing data within the underlying causal mechanisms of the data. Subsequently, a range of related problems has been studied, including identifiability of distributions and causal effects in the presence of missing data, see e.g. Bhattacharya et al. (2019); Shpitser et al. (2015); Nabi et al. (2020), testable implications relating to the causal structure using missing data Mohan and Pearl (2014), and causal discovery in the presence of missing data Gain and Shpitser (2018); Tu et al. (2019). Our focus, in contrast, is algorithmic in nature. We aim to develop an algorithm that improves imputation quality by leveraging causal insights represented as an estimated missingness graph learned from data.

2 Background

The basic semantic framework of our analysis rests on structural causal models (SCMs) (see e.g. Chapter 7 in Pearl (2009) for more details) explicitly introducing missingness indicators and their functional relationship with other variables, using in part the notation of Mohan et al. (2013). We define an SCM as a tuple where

is a vector of

endogenous variables and is a vector of exogenous variables.111Essentially, is the ground-truth features; is the random noise in the data generating process. is the vector of missingness indicators that represent the status of missingness of the endogenous variables . Precisely, is responsible for the value of a proxy variable of , i.e., the observed version of . For example, is equal to if the corresponding record is observed (), otherwise is missing (). is a set of functions where each decide the values of an endogenous variable and a missingness indicator variable , respectively. The function takes two separate arguments as parts of (except itself) and , termed as and . That is, and .

The randomness in SCMs comes from the exogenous distribution where the exogenous variables in are generated independently and are mutually independent. Naturally, through the functions in , the SCM induces a joint distribution over the endogenous variables

, called the endogenous distribution. An intervention on some arbitrary random variables

in and , denoted by , is an operation which sets the value of to be , regardless of how they are ordinarily determined. For an SCM , let denote a submodel of induced by intervention . The interventional distribution induced by is defined as the distribution over in the submodel , namely, .

Each SCM in the context of missingness is associated with a -graph (e.g., Fig. 1a), which is a directed acyclic graph (DAG) where nodes represent endogenous variables and missingness indicators , and arrows represent the arguments and of each function and respectively. By convention, exogenous variables are often not shown explicitly in the graph.

Assumption 1 (Missingness indicators are not causes)

No missingness indicator in can be the cause of the endogenous variables , i.e., the arguments of the functions generating .

Assumption 2 (Causal sufficiency)

Exogeneous variables are mutually independent, i.e., all common parents of the endogenous variables are included in .

Assumption 3 (No self-masking missingness)

Self-masking missingness refers to missingness in a variable that is caused by itself. In the -graph this is depicted by an edge from to (as shown in Figure 3 (d)). We assume that there is no such edges in the -graph.

Assumption 4 (Observed root nodes)

The endogenous variables such that (i.e., the root nodes) in the -graph are always observed (

with probability 1).

We make the four assumptions above throughout the following sections. Assumption 1 and 2 are employed in most related works using -graphs (see e.g. Mohan and Pearl (2014); Mohan et al. (2013)). Assumption 1 is valid, for example, if is generated in the data collection process after the variable values are assigned. Consequently, under this assumption, if two endogenous variables of interest and are not -separated by some variable , they are not -separated by together with their missingness indicators and . We denote an independent relation in a data distribution by “ ” and -separation in a -graph by “”. We assume the data distribution is faithful to a -graph, meaning that the two independencies are equivalent. As shown in Figure 3, data is missing completely at random (MCAR) if holds in the -graph, missing at random (MAR) if for any endogenous variable , holds, and missing not at random (MNAR) otherwise, as stated in Mohan et al. (2013). If Assumption 3 is violated, we are unable to learn the missingness for self-masked variables. Assumption 4 is necessary for imputing all the missing variables from their causal parents. These assumptions are imperative for MIRACLE to provide improved imputations by leveraging the causal structure of the underlying data generating process. In our experiments (Section 4), we apply MIRACLE to real-world datasets where these assumptions are not guaranteed.

[A MCAR graph.]  [A MAR graph.]  [A MNAR graph.]  [Self-masking missingness.]

Figure 3: Example graphs. are endogenous variables and are missing data indicators. Red shaded variables are not always observed while white shaded variables are always observed.

2.1 Why is imputation prone to bias?

The reason for considering the causal structure of the underlying system is that when learning an imputation model from observed data, implicitly conditioning on some missingness indicators in induces spurious dependencies that would not otherwise exist. For example, in a graph , conditioning on induces a dependence between and . In general, the distributions and differ unless missingness occurs completely at random, and motivates an interpretation of the problem as domain generalization, training on data from one distribution ultimately to be applied on data from a different distribution that, in our case, arises from missing data (i.e., interventions on missingness indicators). This shift is not addressed in the imputation methods that only use the feature correlations.

3 Miracle

In this section, we propose to correct for the shift in distribution due to missing data by searching for causal solutions and explicitly refining imputation methods using a penalty on the induced causal structure. In practice, we have realizations of the observed version of , concatenated as a incomplete data matrix , together with missingness indicators concatenated in a matrix . We use here the same bold uppercase notation for sets of variables and matrices of observations but their meaning should be clear from the context. Our goal is to impute the unobserved values in using each variable’s causal parents. We define the imputed data ,

where is the element-wise product of matrices and is an estimate of the complete data matrix.

3.1 Network architecture

In this section, we describe our approach for estimating . Let be the number of partially observed features, i.e., missing for at least one realization. is the set of missing features indices. The imputation network is defined as a function that takes an initially imputed dataset (using an existing baseline imputation method), and returns two quantities:

  1. A refined imputation .

  2. An estimation of the probabilities of features being missing, and .

A depiction of the network architecture and optimization algorithm is shown in Figure 4. The architecture is constructed with respect to the assumptions shown in Section 2. Our model is decomposed into two sub-networks, , responsible for imputing the unobserved data and estimate the probabilities of missingness, respectively. The imputation network has components, , one for each variable, and the missingness network has components, . Each component, for both networks, has separate input and output layers but shared hidden layers (of size ). Let and denote the weight matrix (we omit biases for clarity) in the input layer of and respectively. The -th column of and is set to . Let , for , denote the weight matrix of each hidden layer and let and , be the dimensional output layers of each sub-network. The imputation network prediction is given by,

for . And similarly, the missingness network prediction is given by,

for , where

is the ELU activation function and

is the sigmoid function. Our network is optimized with respect to three objectives. First, to accurately predict missing values, second, to faithfully encode the causal relationships given by the underlying

-graph, and third to satisfy a moment constraint of the missing data mechanism on the imputed values.

Figure 4: Network and optimization diagram for MIRACLE.

3.2 Reconstruction loss

The first objective is to train to correctly reconstruct each feature from the observed data using a reconstruction loss,

where and are the realized and imputed feature vector of the -th instance, are the components of that are missing for at least one instance. The first loss term is for reconstructing the observed features, and the second loss term is for estimating the probabilities of missingness.

3.3 Causal regularizer

The second objective is to ensure that the dependencies defined by correspond to a DAG over the features and the missing indicators , which enforces that the learned functional dependencies recover a DAG in the equivalence class of causal graphs over the observed data. Enforcing the acyclicity of the dependencies induced by a continuous function is originally proposed in Zheng et al. (2018, 2020). Define a binary adjacency matrix ; (i.e., the -norm of the -th column of the matrix or is ) is a realistic and sufficient condition for achieving . The adjacency matrix of the graph induced by the learned is acyclic if and only if,


is equal to zero, where and is the matrix exponential.

Remark 1. Existing imputation methods based on feature correlations essentially assume an undirected (non-causal) graph between the features. Further, acyclicity is a realistic and practical assumption to make on the static datasets collected by human experts. In nature, most data distributions generate their features in some order. In a directed graph, a cycle means a path starts and ends at the same node. This is unlikely to happen in the data generating process if not considering variables over time, i.e., time-series data. By enforcing acyclicity, MIRACLE only uses the causal parents for imputation, which is less biased by spurious dependencies that only exist in the observed data.

3.4 Moment regularizer

The third objective leverages a set of moment constraints in the missingness pattern to improve imputation. Assume , for some . The following derivation holds for MAR or MCAR missingness patterns only. It holds that,


where the third equality follows from the MAR assumption (). Under the MCAR assumption, this derivation holds trivially since in that case .

We can use the missingness and imputation networks to enforce the above equality algorithmically, ensuring the left hand side equals the right hand side in the empirical version of (4) as follows,

where , and is the -th element of , i.e., the index of the -th missing feature. Minimizing forces the two estimators of to match, the stabilized inverse propensity score weighting (SIPW) estimator Robins et al. (2007) using the missingness network (in ) and the mean estimator using the imputation network .

Remark 2. We hypothesize this mechanism can improve performance for two reasons. First, the missing data mechanism can be a simpler function that takes less samples to learn than the function that generates the feature , . Then the SIPW estimator based on will converge to the true mean faster than the estimator based on . Second, in , the mean estimator using is based on all the samples; is trained to produce predictions on the samples with missing feature for the sake of matching the SIPW estimator. By contrast, without the regularizer , is solely trained on the samples with observed feature , and its performance may fail to generalize to data with missing feature .

3.5 Bootstrap Imputation

Discovering a causal graph requires complete data. However, this is not the case for missing data problems. Because of this, we require that MIRACLE be seeded by another imputation method. Imputed values are iteratively refined by MIRACLE, hence “bootstrapping”, to potentially converge to a new imputation that minimizes MIRACLE’s objective (including causal and moment regularizers). MIRACLE’s objective for optimization is,


where and

are hyperparameters that define the strength of regularization. We iteratively update the baseline matrix

with a new imputed matrix

given by MIRACLE every ten epochs in training. With increasing epochs, stochastic optimization minimizes the loss for the imputed matrices that respect the causal and moment regularization. In theory, this is analogous to supervised training a denoising autoencoder (DAE)

Vincent et al. (2010); Bengio et al. (2013); Pretorius et al. (2018), but differs only by the fact that “noise” comes from prior or previous imputations. In training DAE, the input samples are corrupted by independent noise with each epoch, yet convergence is still guaranteed Arora et al. (2014). In our experiments, we demonstrate that bootstrap imputation indeed converges on multiple datasets and baseline methods. In Appendix E, we provide experiments on the computational complexity of MIRACLE.

(a) MAR
(b) MNAR
Figure 5: Experiments on MAR (left) and MNAR (right) synthetic data in terms of RMSE over varying dataset sizes (top), missingness rates (middle), and feature sizes (bottom)

. Note that we show the average error over a variety of DAG instantiations and target variables, thus the magnitude and standard deviation of errors vary significantly between runs. See Appendix 

B for MCAR results.

4 Experiments

In this section, we validate the performance of MIRACLE using both synthetic and a variety of real-world datasets.

  1. In the first set of experiments, we quantitatively analyze MIRACLE on synthetic data generated from a known causal structure.

  2. In the second set of experiments, we quantitatively evaluate the imputation performance of MIRACLE using various publicly available UCI datasets Dua and Graff (2020).

General set-up. We conduct each experiment five times under random instantiations of missingness. We report the RMSE along with standard deviation across each of the five experiments. Unless otherwise specified, missingness is applied at a rate of 30% per feature. For MCAR, this is applied uniformly across all features. For MAR, we randomly select 30% of the features to have missingness caused by another disjoint and randomly chosen set of features. Similarly, we randomly select 30% of features to be MNAR. We induce MAR and MNAR missingness using the methods outlined in Yoon et al. (2018), and we provide more details in Appendix A.2.

We use an 80-20 train-test split. We performed a hyperparameter sweep (log-based) for and with ranges between 1e-3 and 100. By default we have and set to 0.1 and 1, respectively.

Evaluating imputation. For each subsection below, we present three model evaluations in terms of missingness imputation performance, label prediction performance of a prediction algorithm trained on imputed data and the congeniality of imputation models.

  • Missingness imputation performance is evaluated with the root mean squared error comparing the imputed missing values with their actual unobserved values.

  • Label prediction performance of an imputation model is its ability to improve the post-imputation prediction. By post-imputation, we refer to using the imputed data to perform a downstream prediction task. To be fair to all benchmark methods, we use the same model (support vector regression) in all cases.

  • The congeniality of an imputation model is its ability to impute values that respect the feature-label relationship post imputation. Specifically, we compare, support vector parameters, , learned from the complete dataset with the parameters , learned from the imputed dataset. We report root mean square error for each method. Lower values imply better congeniality Yoon et al. (2018).

Baseline imputation methods.

We apply MIRACLE imputation over a variety of six commonly used imputation baseline methods: (1) mean imputation using the feature-wise mean, (2) a deep generative adversarial network for imputation using GAIN Yoon et al. (2018) (3)

-nearest neighbors (KNN)

Troyanskaya et al. (2001) using the Euclidean distance as a distance metric of each missing sample to observed samples, (4) a tree-based algorithm using MissForest (MF) Stekhoven and Bühlmann (2012), (5) a deep neural distribution matching method based on optimal transport (OT) Muzellec et al. (2020), and (6) Multivariate Imputation by Chained Equations (MICE) Buuren and Groothuis-Oudshoorn (2011). For each of the baseline imputation methods with tunable hyperparameters, we used the published values. We implement MIRACLE using the tensorflow222Source code at library. Complete training details and hyperparameters are provided in Appendix D.

Additional experiments. In Appendix B and  C, we also evaluate MIRACLE in terms of providing imputations that improve predictive performance Yoon et al. (2018); You et al. (2020), as well as, in its ability to impute values that respect the feature-label relationship, i.e., congeniality Meng (1994); Deng et al. (2016). We also investigate the impact of the graphical missingness location on imputation performance on synthetic data in Appendix G.

4.1 Synthetic data

In this subsection, we evaluate MIRACLE on synthetic data. In doing so, we can control aspects of the underlying data generating process and possess oracle knowledge of the DAG structure.

Data generating process.

We generate random Erdos-Renyi graphs with functional relationships from parent to children nodes. At each node, we add Gaussian noise with mean 0 and variance 1. For a complete description of the underlying data generating process, see Appendix 


Synthetic results. In Figure 5, we show experiments of MIRACLE on synthetic MAR data in terms of RMSE. Our experiments show that MIRACLE is able to significantly improve imputation over each of the baseline imputation methods. Figure 5 shows MIRACLE improves performance over each baseline method across various dataset sizes, missingness ratios, and feature sizes (DAG sizes). We show results for MCAR in Appendix B.

4.2 Experiments on real data

Figure 6: MIRACLE on real data. MIRACLE improves all baselines across MCAR (top), MAR (middle), and MNAR (bottom). Results for predictive error and congeniality can be found in the Appendix C.

In Figure 6 we show experiments of MIRACLE on real data. We perform experiments on several UCI datasets used in Yoon et al. (2018); You et al. (2020); Muzellec et al. (2020): Autism, Life expectancy, Energy, Abalone, Protein Structure, Communities and Crime, Yeast, Mammographic Masses, Wine Quality, and Facebook Metrics. In Figure 6, the improvements of MIRACLE are minimal for MCAR (except for mean imputation). This agrees with our discussion in Section 2.1, because the baseline imputations are not biased in the MCAR setting where holds in the -graph. Conversely for the MAR and MNAR settings, as expected, we observe MIRACLE has an significant improvement on some of the datasets, such as Abalone, Autism, Energy and Protein Structure. As discussed in Section 2, MIRACLE can improve the baseline imputation under Assumptions 1-4, which may not hold in these real-world datasets. Nevertheless, we observe that MIRACLE never degrades performance relative to its baseline imputation on any dataset. Furthermore, no baseline imputer is optimal across the datasets. In almost all cases, applying MIRACLE to any baseline results in the lowest error. We show a similar gain for this experiment with MCAR and MNAR in Appendix C. An ablation study on our overall loss is provided in Appendix F.

4.3 Causal discovery and imputation performance

In our experiments, we observe a positive correlation between the quality of learned DAGs (and causal parents) with imputation performance. Consider the left-most plot in Figure 7

using OT as a baseline imputer under MAR on our real data sets. Here, we do not have oracle knowledge but assume that the sparseness of the learned DAG implies a coherent DAG. We observe that MIRACLE has the most performance gain when fewer causal parents are identified for the missing variable in the learned DAGs. When MIRACLE is less able to isolate causal parents for prediction, the learned DAGs contains many spurious edges, and MIRACLE only has marginal improvements over the baseline imputer. We note that the gain of MIRACLE is not reproducible via feature selection methods, which are still prone to the spurious correlations in the observed data, as discussed in Section


4.4 MIRACLE Convergence

In this subsection, we investigate two dimensions of MIRACLE refinement: (1) baseline imputation quality and (2) sample or instance-wise refinement. Regarding baseline imputation quality, we are interested in understanding the impact of MIRACLE refinement on various baseline imputers that may have disparate performances. In the middle plot of Figure 7, we show MIRACLE applied to various baseline imputers on the Abalone dataset. Similar plots for other datasets can be found in Appendix C. We observe that even though mean imputation starts off with the worst error, after refinement by MIRACLE, we see that all methods converge to similar RMSEs. For the second experiment, we investigate the sample-wise improvement of MIRACLE on the abalone dataset using MissForest as a baseline imputer. On the right-most plot of Figure 7, we observe that a vast majority of the samples are improved by MIRACLE. Note that every point below the diagonal is considered an improvement on an instance over the baseline imputation method. We can see MIRACLE improves the imputation almost universally except for the instances with small errors in the baseline imputation; on these instances, MIRACLE does not inflate their errors by a large margin. Furthermore, we observe that MIRACLE iteratively improves imputation as training progresses (over each epoch) by the observation that the slope of each line decreases with each epoch.

Figure 7: (Left) Analysis of MIRACLE w.r.t. causal parents on real data. MIRACLE has the most gain when we have identified a sparse set of causal parents. When many features are identified as causal parents, imputation performance degrades. (Mid) Convergence of MIRACLE across various baseline imputers. On the abalone dataset, we show that MIRACLE converges to consistent RMSE regardless of baseline imputation. (Right) Sample-wise RMSE for MIRACLE across various epochs. MIRACLE is applied to refine MissForest imputations, demonstrating that error is reduced in a sample-wise basis. Note: anything below the diagonal, is an improvement over the baseline imputations.

5 Discussion

In conclusion, motivated by the minimax optimization problem (1) arising from interventions on missingness indicators in the -graph that encode the conditional independencies in the data distribution, we proposed MIRACLE, an iterative framework to refine any baseline missing data imputation to use the causal parents embodied in the estimated -graphs. MIRACLE learns the causal -graph as an adjacency matrix embedded in its input layers. We proposed a two-part regularizer based on the causal graph and a moment regularizer based on the missing variable indicators. We demonstrated that MIRACLE significantly improved the imputations of six baseline imputation methods over a variety of synthetic and real datasets. MIRACLE never hurts performance in the worst-case, and we envision MIRACLE becoming a de facto standard in refining missing data imputation.

There are several limitations we would like to identify as paths for future work. First, any violation of the assumptions in Section 2

may adversely impact the performance of MIRACLE in practice. Second, causal discovery under missing data is an ongoing research area, and therefore MIRACLE may be discovering DAGs with bias introduced from the baseline methods. However, in experiments, MIRACLE still performs well even if it starts with mean imputation. We expect MIRACLE to improve as causal discovery methods under missingness improve. Third, in its current form, MIRACLE is not extensible to scenarios where causality may not be applicable, such as computer vision. Fourth, because of the causal discovery regularizer and network architecture, MIRACLE may have difficulty scaling to very high dimensional data. Lastly, a more general and detailed discussion is needed between our work and the merits of causality and robustness.


This work was supported by GlaxoSmithKline (GSK), the National Science Foundation (NSF) under grant number 1722516, the Office of Naval Research (ONR), and The Alan Turning Institute (ATI). We thank all reviewers for their invaluable comments and suggestions.


  • [1] A. Allen and W. Li (2016) Generative adversarial denoising autoencoder for face completion,. Note: Cited by: §1.2.
  • [2] S. Arora, A. Bhaskara, R. Ge, and T. Ma (2014) Provable bounds for learning some deep representations. In International conference on machine learning, pp. 584–592. Cited by: §3.5.
  • [3] M. L. Bell, M. Fiero, N. J. Horton, and C. Hsu (2014) Handling missing data in rcts; a review of the top medical journals. BMC medical research methodology 14 (1), pp. 1–8. Cited by: §1.
  • [4] Y. Bengio, L. Yao, G. Alain, and P. Vincent (2013) Generalized denoising auto-encoders as generative models. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, NIPS’13, Red Hook, NY, USA, pp. 899–907. Cited by: §3.5.
  • [5] R. Bhattacharya, R. Nabi, I. Shpitser, and J. Robins (2019-06) Identification in missing data models represented by directed acyclic graphs. In

    Proc. of Conference on Uncertainty in Artificial Intelligence (UAI)

    pp. . Cited by: §1.2.
  • [6] S. Buuren and C. Groothuis-Oudshoorn (2011-12) MICE: multivariate imputation by chained equations in r. Journal of Statistical Software 45, pp. . External Links: Document Cited by: §1.2, §4.
  • [7] Y. Deng, C. Chang, M. Ido, and Q. Long (2016-02) Multiple imputation for general missing data patterns in the presence of high-dimensional data. Scientific Reports 6, pp. 21689. External Links: Document Cited by: §4.
  • [8] D. Dua and C. Graff (2020) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §1.1, item 2, item 3a.
  • [9] A. Gain and I. Shpitser (2018) Structure learning under missing data. Proceedings of machine learning research 72, pp. 121. Cited by: §1.2.
  • [10] L. Gondara and K. Wang (2017-05) Multiple imputation using deep denoising autoencoders. pp. . Cited by: §1.2.
  • [11] Y. Gong, H. Hajimirsadeghi, J. He, M. Nawhal, T. Durand, and G. Mori (2020-08 Dec) Variational selective autoencoder. In

    Proceedings of The 2nd Symposium on Advances in Approximate Bayesian Inference

    , C. Zhang, F. Ruiz, T. Bui, A. B. Dieng, and D. Liang (Eds.),
    Proceedings of Machine Learning Research, Vol. 118, pp. 1–17. External Links: Link Cited by: §1.2.
  • [12] J. C. Jakobsen, C. Gluud, J. Wetterslev, and P. Winkel (2017) When and how should multiple imputation be used for handling missing data in randomised clinical trials–a practical guide with flowcharts. BMC medical research methodology 17 (1), pp. 1–10. Cited by: §1.
  • [13] P. Mattei and J. Frellsen (2019-09–15 Jun) MIWAE: deep generative modelling and imputation of incomplete data sets. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 4413–4423. External Links: Link Cited by: §1.2.
  • [14] R. Mazumder, T. Hastie, and R. Tibshirani (2010) Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research 11 (80), pp. 2287–2322. Cited by: §1.2.
  • [15] N. Meinshausen (2018) Causality from a distributional robustness point of view. In

    2018 IEEE Data Science Workshop (DSW)

    pp. 6–10. Cited by: §1.
  • [16] X. Meng (1994) Multiple-imputation inferences with uncongenial sources of input. Statistical Science 9 (4), pp. 538–558. External Links: ISSN 08834237 Cited by: §4.
  • [17] K. Mohan, J. Pearl, and J. Tian (2013) Graphical models for inference with missing data. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26, pp. 1277–1285. Cited by: §1.1, §1.2, §1, §2, §2.
  • [18] K. Mohan and J. Pearl (2014) On the testability of models with missing data. In Artificial Intelligence and Statistics, pp. 643–650. Cited by: §1.2, §2.
  • [19] B. Muzellec, J. Josse, C. Boyer, and M. Cuturi (2020-13–18 Jul) Missing data imputation using optimal transport. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, Virtual, pp. 7130–7140. Cited by: §1.2, §4, §4.2.
  • [20] R. Nabi, R. Bhattacharya, and I. Shpitser (2020) Full law identification in graphical models of missing data: completeness results. In Proc. of International Conference on Machine Learning (ICML), pp. . Cited by: §1.2.
  • [21] J. Pearl (2009) Causality. Causality: Models, Reasoning, and Inference, Cambridge Univ. Press. External Links: ISBN 9780521895606, LCCN 99042108 Cited by: §1, §2.
  • [22] A. Pretorius, S. Kroon, and H. Kamper (2018-10–15 Jul) Learning dynamics of linear denoising autoencoders. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 4141–4150. External Links: Link Cited by: §3.5.
  • [23] J. Robins, M. Sued, Q. Lei-Gomez, and A. Rotnitzky (2007) Comment: performance of double-robust estimators when” inverse probability” weights are highly variable. Statistical Science 22 (4), pp. 544–559. Cited by: §3.4.
  • [24] S. Seaman, J. Galati, D. Jackson, and J. Carlin (2013) What is meant by” missing at random”?. Statistical Science, pp. 257–268. Cited by: §1.2.
  • [25] I. Shpitser, K. Mohan, and J. Pearl (2015) Missing data as a causal and probabilistic problem. In Proc. of Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §1.2.
  • [26] D. J. Stekhoven and P. Bühlmann (2012) MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics 28 1, pp. 112–8. Cited by: §1.2, §4.
  • [27] J. A. Sterne, I. R. White, J. B. Carlin, M. Spratt, P. Royston, M. G. Kenward, A. M. Wood, and J. R. Carpenter (2009) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. Bmj 338. Cited by: §1.
  • [28] O. Troyanskaya, M. Cantor, G. Sherlock, T. Hastie, R. Tibshirani, D. Botstein, and R. Altman (2001-07) Missing value estimation methods for dna microarrays. Bioinformatics 17, pp. 520–525. External Links: Document Cited by: §4.
  • [29] R. Tu, C. Zhang, P. Ackermann, K. Mohan, H. Kjellström, and K. Zhang (2019) Causal discovery in the presence of missing data. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1762–1770. Cited by: §1.2.
  • [30] S. van Buuren (2018) Flexible imputation of missing data. Chapman & Hall/CRC Interdisciplinary Statistics, CRC Press LLC. External Links: ISBN 9781138588318, LCCN 2018017122 Cited by: §1.2.
  • [31] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, pp. 3371–3408. Cited by: §3.5.
  • [32] J. Yoon, J. Jordon, and M. van der Schaar (2018) GAIN: missing data imputation using generative adversarial nets. In ICML, Cited by: 2nd item, §A.2, §1.2, 3rd item, §4, §4, §4.2, §4.
  • [33] S. Yoon and S. Sull (2020-06) GAMIN: generative adversarial multiple imputation network for highly missing data. In

    The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1.2.
  • [34] J. You, X. Ma, D. Y. Ding, M. Kochenderfer, and J. Leskovec (2020) Handling missing data with graph representation learning. In Advances in Neural Information Processing Systems, Cited by: §1.2, §4, §4.2.
  • [35] X. Zheng, B. Aragam, P. Ravikumar, and E. P. Xing (2018) DAGs with NO TEARS: Continuous Optimization for Structure Learning. In Advances in Neural Information Processing Systems, Cited by: §3.3.
  • [36] X. Zheng, C. Dan, B. Aragam, P. Ravikumar, and E. Xing (2020-26–28 Aug) Learning sparse nonparametric dags. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, S. Chiappa and R. Calandra (Eds.), Proceedings of Machine Learning Research, Vol. 108, pp. 3414–3425. Cited by: §3.3.


  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work? See the Discussion in Section 5.

    3. Did you discuss any potential negative societal impacts of your work? We do not foresee any negative societal impacts of our work.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? The code is available at The data is all publicly available at [8]. The instructions to reproduce our work can be found in the experimental Section 4 and Appendix A.

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? See Section 4.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? See Section 4.

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See Appendix D.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets?

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?


This appendix is outlined as follows:

  • Section A details our synthetic data generating process and how we generated missingness.

  • Section B contains additional experiments on synthetic data testing for imputation performance, testing for prediction performance of a downstream machine learning algorithm on imputed data, and testing for congeniality. In each case we consider MCAR, MAR and MNAR missingness patterns. We used the same experimental setup in [32] for testing prediction performance of a downstream machine learning algorithm and for congeniality.

  • Section C contains additional experiments on real datasets testing for prediction performance of a downstream machine learning algorithm on imputed data, and testing for congeniality over all types of missingness.

  • Section D provides an overview of our model and training details.

  • Section E provides a computational complexity analysis of MIRACLE.

  • Section F

    provides an ablation study for the MIRACLE loss function.

  • Section G provides experiments regarding missingness location.

Appendix A Supplementary experimental details

a.1 Synthetic data generation

In each synthetic experiment, we generated a -dimensional random graph from a Erdös–Rényi random graph model with edges on average. Given , we assigned uniformly random edge weights to obtain a weighted adjacency matrix . Given , we sampled repeatedly from a Gaussian noise model for (each dimension sampled independently) to generate independent observations from this system.

a.2 Generating missingness.

The following explains how we constructed synthetic datasets that satisfy MCAR, MAR and MNAR patterns of missingness. We apply a modification to the missingness generation from [32].

  • MCAR. Missing completely at random was introduced by randomly removing of the observations in each feature.

  • MAR. We sequentially define the probability that the -th component of the -th sample is observed conditional on the missingness and values (if observed) of the previous components to be,


    where corresponds to the average missing rate of the -th feature, and , are sampled from (but are only sampled once for the entire dataset).

  • MNAR. Missing not at random was introduced by defining the probability of the -th component of the -th sample to be observed by,


    with the same notation as above. Here, the missingness of a data point is directly dependent on its value (with dependence determined by the weight , sampled from ).

Appendix B Additional synthetic results

In this section, we provide supplementary results for synthetic experiments. Note that the error bars are large for some of the plots with predictive error and congeniality. This is because the y-axis of these plots are min-max normalized between 0 and 1, so the high variance (large error bars) shows that the improvement by MIRACLE may be minimal for the mentioned datasets. Additionally, this could be caused by the fact that the missing features aren’t predictive of a target variable, i.e., better imputation does not necessarily lead to any performance gain for the predicting the target variable.

b.1 MCAR Results

Using our synthetic experimental setup used in the main paper, we show the performance of MIRACLE in terms of RMSE, predictive error, and congeniality in Figure 8 for each of our baseline methods with MCAR.

[Performance in terms of RMSE.]   [Performance in terms of prediction RMSE.]   [MCAR congeniality (in terms of RMSE).]

Figure 8: Experiments on MCAR synthetic data as a function of dataset sizes (top), missingness rates (middle), and feature sizes (bottom) of each subfigure: (a) RMSE, (b) machine learning predictive error of a random variable, and (c) congeniality.

b.2 MAR Results

Using our synthetic experimental setup used in the main paper, we show the performance of MIRACLE in terms of RMSE, predictive error, and congeniality in Figure 9 for each of our baseline methods with MAR.

[Performance in terms of RMSE.]   [Performance in terms of prediction RMSE.]   [Congeniality (in terms of RMSE).]

Figure 9: Experiments on MAR synthetic data as a function of dataset sizes (top), missingness rates (middle), and feature sizes (bottom) of each subfigure: (a) RMSE, (b) machine learning predictive error of a random variable, and (c) congeniality.

b.3 MNAR Results

Using our synthetic experimental setup used in the main paper, we show the performance of MIRACLE in terms of RMSE, predictive error, and congeniality in Figure 10 for each of our baseline methods with MNAR.

[Performance in terms of RMSE.]   [Performance in terms of prediction RMSE.]   [Congeniality (in terms of RMSE).]

Figure 10: Experiments on MNAR synthetic data as a function of dataset sizes (top), missingness rates (middle), and feature sizes (bottom) of each subfigure: (a) RMSE, (b) machine learning predictive error of a random variable, and (c) congeniality.

Appendix C Additional real datasets

Prediction error and congeniality.

We include additional plots for the real data experiments for prediction error and congeniality in Figures 12 and 13, respectively.

Additional convergence plots.

We include additional convergence plots on real datasets in Figure 11. We use the same experimental setup used in Figure 7 in Section 4. We observe that MIRACLE is able to converge regardless of baseline imputation used.

(a) Autism
(b) Energy
(c) Protein
(d) Life expectancy
(e) Community
(f) Yeast
(g) Mammo. masses
(h) Wine quality
(i) Facebook
Figure 11: Convergence plots for real datasets.
Figure 12: MIRACLE on real datasets in terms of predictive error. MIRACLE improves over all baselines across all types of missingness: MCAR (top), MAR (middle), and MNAR (bottom).
Figure 13: MIRACLE on real datasets in terms of congeniality. MIRACLE improves over all baselines across all types of missingness: MCAR (top), MAR (middle), and MNAR (bottom).

Appendix D Model and training details

We used the following network architecture for MIRACLE. Our proposed architecture consists of sub-networks with shared hidden layers, as shown in Figure 4. Each network is constructed with two hidden layers of neurons with ELU activation. Each benchmark method is initialized and seeded identically with the same random weights. For dataset preprocessing, all continuous variables are standardized with a mean of 0 and a variance of 1. We train each model using the Adam optimizer with a learning rate of 0.0005 for up to a maximum of 300 epochs.

Computational hardware.

All models were trained on an Ubuntu 18.04 OS with 64GB of RAM (Intel Core i7-6850K CPU @ 3.60GHz) and 2 NVidia 1080 Ti GPUs.

Figure 14: MIRACLE scalability analysis

Appendix E Computational Complexity

  Input: An incomplete dataset with missing values, a missing indicator matrix (with 1 indicating observed), an imputed matrix by some baseline method,
  Output: Imputed dataset with no missing values.
  Initialization: Imputation network , with maximum size for saving imputed matrices over epochs
     Train for one epoch by optimizing the objective function with as input.
     if  is full then
        Remove the first element from
     end if
      average all the elements of .
  until MIRACLE converges (i.e., change of is small)
Algorithm 1 Train MIRACLE

Pseudocode for MIRACLE is provided in Algorithm 1. We perform an analysis of the MIRACLE scalability. Using our synthetic data generation, we created datasets of 1000 samples. Using our the synthetic experimental setup presented in the main paper, we present the computational timing results for MIRACLE as we increase the number of input features on inference and training time in Figure 14. Computational time scales linearly with increasing the number of input samples. As expected, we observe that the time to train 1000 samples grows exponentially with the feature size; however, the inference time remains linear. Inference time on 1000 samples with 400 features takes approximately 1.1 seconds, while training time takes nearly 85 seconds. Experiments were conducted on an Ubuntu 18.04 OS using 6 Intel i7-6850K CPUs.

Appendix F Ablation study

We provide an ablation study on our MIRACLE loss function in Eq. 5 to understand the sources of gain of MIRACLE. Here we execute this experiment on our real datasets using the same experimental details highlighted in the main manuscript. We show the results of our ablation on MIRACLE using MissForest as baseline imputation with MAR missingness to highlight our sources of gain in Table 1. Here, we observe that MIRACLE (rightmost column) has the most gain over all datasets. Additionally, we observe that has the most gain when MIRACLE has the most performance improvement over the baseline (see Fig. 7 in the manuscript).

abalone 0.321 0.108 0.521 0.199 0.312 0.082 0.222 0.062
autism 0.093 0.005 0.094 0.004 0.091 0.004 0.073 0.004
energy 0.106 0.011 0.147 0.077 0.132 0.050 0.065 0.061
protein 0.134 0.016 0.129 0.008 0.119 0.010 0.080 0.008
life expectancy 0.239 0.007 0.223 0.019 0.216 0.014 0.208 0.015
community 0.490 0.015 0.516 0.020 0.479 0.023 0.463 0.010
yeast 0.984 0.013 0.984 0.006 0.988 0.004 0.950 0.014
mammo masses 1.105 0.010 1.150 0.009 1.103 0.013 1.040 0.013
wine quality 0.797 0.004 0.745 0.013 0.724 0.008 0.722 0.003
facebook 1.056 0.005 1.032 0.044 1.034 0.056 0.983 0.002

Table 1: Ablation study of MIRACLE on real datasets to highlight sources of gain.

Appendix G Understanding missingness location

An important consideration is how well does predicting with the causal parents work when down-selecting features. Consider missingness in in the DAG in Figure 15. The first column with the causal parents Pa mean that only the parents of features were used for imputation. represents a variable that is not causally linked to anything.

Using our synthetic data generating process, we synthesized a dataset according to Figure 15. The goal here was to impute the missing values in , using each variable in Figure 15 to induce the missingness. Each of the missingness causes is categorized as MAR, except for , which is MNAR (since missingness caused by itself), and for , which is MCAR, since it is an external noise variable. The results provide several interesting findings.

  1. [leftmargin=*,itemsep=0pt]

  2. Using MissForest as a baseline imputer, the results in Table 2 show that MIRACLE performs as well as Pa, and has better performance than the baseline imputer. Moreover, the two right-most columns of Table 2 give the average estimated functional dependence of (our target for prediction) and its parents and non-parents. We see that MIRACLE recovers true parents consistently.

  3. We see that using causal parents (Pa and MIRACLE) for missingness caused by itself, , and a noise variable, , leads to the least amount of improvement.

  4. We see that MIRACLE has the most gain when the missingness is caused by a causal parent ( or ).

  5. Interestingly, for this example, we observe comparable performance when using the Markov blanket features versus all features in our baseline algorithm (MissForest). This suggests that the Markov blanket features are likely used for imputation by the baseline method.

In Appendix B, we show that the baseline performs similarly to the Markov blanket features colored in blue in Figure 15.

Figure 15: A sample DAG. is the incomplete variable in red. The Markov Blanket MB is shown in blue, and the causal parents Pa are shown with dashed borders. represents a variable that is not causally linked to anything.
imputed error (RMSE) edge weights (no threshold)

Pa MB Baseline MIRACLE Pa non-Pa
0.11 .06 0.15 .03 0.27 .05 0.12 .07 0.44 0.14 0.02 0.01
0.98 .08 1.34 .05 1.31 .06 0.49 .06 0.64 0.09 0.01 0.01
1.20 .04 1.49 .04 1.45 .09 0.50 .06 0.62 0.13 0.02 0.01
0.69 .05 1.20 .07 1.23 .05 1.04 .05 0.29 0.11 0.13 0.05
1.51 .03 1.75 .08 1.76 .06 1.59 .07 0.37 0.18 0.03 0.02
0.13 .08 0.17 .04 0.18 .07 0.14 .05 0.34 0.15 0.05 0.02
1.04 .05 1.47 .04 1.47 .06 1.01 .06 0.39 0.05 0.04 0.01
0.21 .04 0.28 .05 0.23 .03 0.20 .03 0.46 0.10 0.02 0.01
0.15 .03 0.18 .04 0.17 .07 0.14 .05 0.31 0.15 0.02 0.01

Table 2: Understanding the location of missingness. We predict when its missingness is caused by each variable in the DAG. and represent MNAR and MCAR, respectively. All other causes are MAR. The two right-most columns show the learned edge weights into for the parental and non-parental variables.

Appendix H Convergence

Using the same experimental setup, in Appendix G, we examine how the estimated adjacency matrix converges to the truth as sample size increases. Table 3, shows that as sample size increases we see that the quality of the learned m-graph improves.

Dataset size

0.135 0.04
500 0.242 0.03
1000 0.252 0.03
5000 0.910 0.01
10000 0.916 0.01
Table 3: Convergence of DAG weights as dataset size increases. is the element-wise product and is the predicted adjacency matrix weights