MCFlow: Monte Carlo Flow Models for Data Imputation

03/27/2020 ∙ by Trevor W. Richardson, et al. ∙ University of Rochester 0

We consider the topic of data imputation, a foundational task in machine learning that addresses issues with missing data. To that end, we propose MCFlow, a deep framework for imputation that leverages normalizing flow generative models and Monte Carlo sampling. We address the causality dilemma that arises when training models with incomplete data by introducing an iterative learning scheme which alternately updates the density estimate and the values of the missing entries in the training data. We provide extensive empirical validation of the effectiveness of the proposed method on standard multivariate and image datasets, and benchmark its performance against state-of-the-art alternatives. We demonstrate that MCFlow is superior to competing methods in terms of the quality of the imputed data, as well as with regards to its ability to preserve the semantic structure of the data.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Missing data is a widespread problem in real-life machine learning problems. Because most of the existing data analysis frameworks require complete datasets, imputation methods are indispensable to practitioners in the field. As a consequence, data imputation has been the focus of extensive research in recent decades [30, 45]. Moreover, much of the recent research in data imputation has leveraged advanced machine learning techniques. This has resulted in the development of numerous shallow [44, 46, 42]

and deep learning frameworks 

[40, 49, 29, 33] which have continuously pushed the state-of-the-art envelope. A pervasive shortcoming of traditional learning-based imputation methods is that their training relies on fully observed data [15, 40]. A more reasonable assumption, however, is that the training data itself may have missing entries, a limitation addressed by recently proposed methods [49, 29, 33]. In this paper, we propose a data imputation framework that leverages deep generative models and incorporates mechanisms that provide more accurate estimates of the missing data as compared to existing frameworks. In contrast to competing methods that are based on Generative Adversarial Networks (GANs) [49, 29] and on Deep Latent Variable Models (DLVM) [33], our framework leverages normalizing flow models [10, 11, 24]

. In particular, we take advantage of the exact log-likelihood evaluation, latent variable inference, and data sample reconstruction afforded by normalizing flow models in order to explicitly learn complex, high-dimensional data distributions.

We address the causality dilemma that arises when attempting to construct probabilistic models of incomplete data by adopting an alternating learning strategy inspired by model-based multiple imputation approaches [31]

such as those based on Expectation Maximization 


and Monte Carlo Markov Chain 

[43] techniques. Recent examples of related alternating techniques include applications to training generator networks [19, 14]

. The overarching idea behind the proposed framework is to alternately sample the data and update the density estimate until a good approximation of the true empirical distribution is attained. This iterative process is enabled by the exact sampling and density evaluation aspects afforded by the normalizing flow model framework referenced above. We note that although accurate density estimation is central to our framework, and autoregressive models achieve state-of-the-art performance in that field 

[47, 5, 38]

, their slow sampling time would be a significance hindrance to our algorithm. In order to address the problem of missing data, we introduce a new imputation algorithm called MCFlow. The proposed iterative learning scheme calls for alternately optimizing two different objective functions: (i) the traditional log-likelihood loss involved in training flow models, required to update the density estimate based on the complete data including imputed values; and, (ii) a maximum-likelihood criterion involved in sampling the latent flow space in order to find the optimal values of the missing data given the current estimate of the data distribution. While the former can be achieved by implementing known backpropagation techniques, we introduce a novel non-iterative approach to address the latter, which relies on a neural network trained to identify points in the latent space that maximize a density value while minimizing the reconstruction error as computed on the observed entries of the data. This approach can be viewed as an instance of an algorithm that learns to optimize 

[28, 1] and, although not as effective as iterative [22] or sampling-based methods [43, 20], it is significantly more computationally efficient.

The primary contributions of this work can be summarized as follows:

  • a framework for data imputation based on deep generative normalizing flow models;

  • an alternating learning algorithm that enables accurate density estimation of incomplete, complex and high-dimensional data by leveraging the efficient sampling and density evaluation attributes of flow models;

  • a neural network that learns to optimize in the embedding flow space; and,

  • extensive empirical validation of the proposed framework on standard multivariate and image datasets, including benchmarking against state-of-the-art imputation methods.

2 Related Work

An imputation method is classified as a single or multiple imputation method depending upon whether it estimates one or more than one value for each missing entry, respectively 

[30]. Even though a multitude of single imputation methods have been proposed in the literature [44, 42, 50], multiple imputation methods are often preferred as they provide an assessment of uncertainty [30, 45, 35]

. Multiple imputation frameworks often rely on building statistical models for the data, and then drawing samples from them in order to perform the imputation. Early attempts at building multiple imputation frameworks relied on simple parametric models such as Bayesian models 

[3, 41] as well as mixtures of Gaussians [30, 9]. More recently, with the advent of sophisticated deep generative models, the focus has shifted to investigating more effective ways to leverage the expressivity of the models to address data imputation scenarios.

Initial efforts exploiting deep models for imputation relied on the availability of fully observable training data [40]. More recent work has overcome this shortcoming. Two such publications rely on modifications to the Generative Adversarial Network (GAN) architecture [18]. The Generative Adversarial Imputation Network (GAIN) framework [49] employs an adversarially trained imputer optimized at discriminating between fake and real imputations. While the GAN for Missing Data (MisGAN) approach [29] also implements an adversarially trained imputer, it additionally includes an explicit model for the missing data. Unfortunately, being close GAN relatives, these models can be difficult to train [39, 2]

. Furthermore, we believe there is an advantage to explicitly learning a probability density model of the data in imputation tasks, a feature that GAN-based frameworks generally don’t afford.

Models that enable approximate density learning rely on estimating the variational bound [23, 40, 4]

. While building such models requires observations of complete data, the Missing Data Importance-Weighted Autoencoder (MIWAE) framework 

[33] extends the variational lower bound principle to scenarios where only partial observations are available. Unfortunately, such a family of methods is still intrinsically limited to learning an approximation of the density of the data and can therefore be challenging to optimize [25].

In general, models that enable tractable density learning can be classified into frameworks based on fully visible belief networks (FVBNs) and those leveraging nonlinear independent components analysis (ICA). FVBNs rely on the general product rule of probability which enables calculation of the joint distribution of a set of random variables using the product of conditional probabilities 

[13, 37]. However, given the sequential nature of the operations involved, sampling the density estimate is not efficient. Nonlinear ICA methods define a set of continuous and invertible nonlinear mappings between two spaces [7, 6, 10, 11, 24]. Their main limitation stems from the fact that the mappings need to be invertible, which may limit their expressivity.

3 Framework

Throughout the paper, we consider the scenario in which data is Missing Completely at Random (MCAR) [31]. Formally speaking, assume that fully observable data points are distributed according to , the set of binary masks indicates the location of the missing data entries, and the mask entries are distributed according to . Data is MCAR when the missingness is independent of the data, that is, when . Suppose that we observe data point with corresponding mask , which means that the -th entry in , denoted , is observed if and missing if . Often, it will be convenient to work with the complement of mask , which we denote . We use the notation and , respectively, to denote the set of observed and unobserved entries in given mask . Note that it is always possible to determine which entries of are missing and which are present, which means that can be uniquely determined from . Assuming is a partial observation of unknown data point with entries observed according to mask , we formulate the data imputation task of recovering as a maximum likelihood problem, namely:


Because is a monotonic function, Eq. 1 is equivalent to


Note that, in general, is unknown, but even if that were not the case, solving the optimization task from Eqs. 1 and 2 would be challenging given that the distribution of any data of interest is likely to be highly non-convex and high-dimensional. One of the goals of this work is to make this optimization task feasible. To that end, and for the sake of argument, assume a tractable, explicit density model for exists. Let this model be in the form of a generative network mapping a sample of interest lying in space to an embedding representation in space . Further, assume that network effects a continuous, differentiable and invertible mapping such that and


This model is tractable if is tractable and if the determinant of the Jacobian of is tractable [17]. Because is invertible, exact sample generation from density is possible by drawing a sample and computing . Additionally, computing the density on

involves computing the density of its embedding vectors, and then scaling the result by the Jacobian determinant as in Eq. 

3 [11]. Transformations that take the form of Eq. 3 fall under the category of nonlinear independent component analysis. A variety of methods of this type have been proposed in the literature [10, 11, 24]. In this paper, we implement network in the form of a normalizing flow model, and modify existing architectures to support learning from data with missing entries. It is often computationally convenient to work with the log-likelihood function. Taking the of both sides of Eq. 3 results in


Given a complete set of training data (that is, data without missing entries), and assuming is parameterized by a set of parameters , learning network corresponds to finding an optimal set of parameters such that


If network were available, then the task of data imputation as formulated in Eq. 2 would become feasible. For instance, standard gradient-based optimization techniques could be applied. Because missing data affects the training set in the scenarios under consideration, however, direct estimation of the distribution of the data according to Eqs. 3 or 4 is not possible.

This scenario constitutes a causality dilemma. Drawing inspiration from alternating algorithms such as Expectation Maximization (EM) [8], MCFlow iteratively fills in the missing data according to Eq. 2 (i.e., based on the current estimate of the density), and updates the generative network parameters according to Eq. 5 (i.e., based on the generated samples). The intuition behind this approach is that finding the values of the missing entries in the data according to the maximum likelihood condition requires knowledge of the data distribution, and learning a model for the data distribution requires complete knowledge of the data. As will become evident later, filling in for missing values in this iterative process involves drawing samples from the current model for the data distribution. In this sense, the MCFlow framework is more closely related to Monte Carlo versions of the EM algorithm [48, 36] than to its vanilla version. In MCEM implementations, the E step consists in generating samples of the latent variables based on the current estimate of the predictive conditional distribution, and the M step estimates the parameters that maximize the observed posterior. This similarity becomes more apparent if we interpret the imputation tasks from Eqs. 1 and 2 as generating samples from the current approximation of the conditional predictive distribution of the missing data given the observed data and the current model parameters, (E step). Furthermore, the optimization task from Eq. 5 can be interpreted as updating the current approximation of the posterior of the model parameters given the observed and imputed values, (M step).

4 Model Architecture

The MCFlow architecture utilizes a hybrid framework comprised of a normalizing flow network (referred to herein as flow network or model) that is trained in an unsupervised manner, and a feedforward neural network that is trained in a supervised manner. The flow network provides an invertible mapping , where are the tunable parameters of the network, between data space and embedding space and vice versa. The feedforward network operates in the embedding space by mapping input embedding vectors to output embedding vectors via function , where are the tunable parameters of the network. In general, the role of the flow model is to learn the distribution of the data. The role of the feedforward network is to find the embedding vector with the largest possible density estimate (i.e., the likeliest embedding vector) that maps to a data vector whose entries match the observed values (i.e., values at locations indexed by the complement of the mask). A high-level overview of the model is illustrated in Fig. 1.

Figure 1: High-level view of the MCFlow architecture.

As with traditional flow implementations, training the generative portion of the framework involves finding the set of parameters that optimizes the mapping pursuant to Eq. 5. Because the optimization task from Eq. 5 does not natively support incomplete data, a specific imputation scheme is implemented at initialization. This scheme involves sampling the marginal observed density of the variable that contains the missing data in the case of multivariate, tabular data, and nearest-neighbor sampling in the case of image data. After preprocessing the data as described above, an initial density estimate exists, and the model is now capable of performing data imputation. The estimates for missing values in are updated during defined subsequent iterations of training with the output of the model, . This is illustrated by the dotted arrow in Fig. 1. The alternating nature of the training process is more clearly illustrated in Fig. 2

, which shows how the trained model from the previous epoch is used to update the missing value estimates for use in the current epoch.

Figure 2: Unrolled view of the architecture and backpropagation process.

The training stage of the generative portion of the framework can be formalized as follows: assume training samples, with corresponding masks are available. For every incomplete sample , a full training data sample is computed by combining the observed values with imputed values from imputed sample according to . Here, denotes the Hadamard, or element-wise product between two vectors. With a full training set constructed, learning the optimal set of parameters in is accomplished by minimizing the following cost function:


or equivalently, using Eq. 4


which is the batch version of Eq. 5. In other words, the optimal parameters of network are the ones that maximize the log-likelihood of the imputed data, with the imputation mechanism being iteratively updated. The black arrow in Fig. 2 illustrates the direction in which the gradients are backpropagated when training the generative model.

Once an initial density estimate is available, learning the optimal mapping function in the embedding space is performed by training feedforward network . The input to is the embedding of the imputed training set, namely , where . Feedforward network , which maps inputs to outputs , is trained by finding the set of parameters in that minimizes the following objective function:


where , , , and denotes the mean-squared error operator between vectors and . The first cost term in Eq. 8 encourages network to output an embedding whose reconstruction matches the training sample at the observed entries. The second cost term encourages network to output the vector with the highest density value according to the current density estimate. Both terms combined yield estimates in the form of the likeliest embedding vector that matches the observed values, effectively solving the maximum likelihood objective from Eqs. 1 and 2. The red arrows in Fig. 2 illustrate the computation of the different terms in Eq. 8. The solid segments of the arrows indicate the sections of the framework that contain weights affected by the computations, while the dotted segments indicate where the computations take place and how they are propagated through the framework without affecting any parameters. More specifically, the term is computed in the data space but it only affects weights in neural network as it is backpropagated. On the other hand, the log-likelihood term is computed in the embedding space of the flow model and is used to update the weights in .

We note that optimizing the cost function from Eq. 8 requires repeated log-likelihood evaluation (e.g., in the computation of ), latent variable inference (e.g., in the computation of from ), and data sample reconstruction (e.g., in the computation of from ), all of which can be effectively computed with flow network . While other generative models may outperform flow approaches at one task or another, we found flow models to constitute the best compromise for the needs of our framework. Pseudocode for the training procedure is provided in Algorithm 1. Note that we concurrently update the parameters of and in order to achieve some computational savings: updating first and then would require two separate forward passes through (one to update itself and one to compute the mappings involved in updating ).

Data: data points with corresponding masks indicating location of observed and missing entries
Naively impute missing data in to produce :
for  to  do
     Forward Pass:
     Compute loss according to Eq. 5
     Compute loss according to Eq. 8
     Update and by backpropagating loss
Algorithm 1 Training Procedure

Once the model is trained, MCFlow imputes missing values according to Algorithm 2. The procedure involves encoding the naively imputed data sample into , forward propagating through the feedforward neural network to obtain , and reconstructing data sample by decoding . The final imputed sample is obtained by filling in the missing entries with corresponding entries in

Data: Test data point with corresponding mask
Naively impute missing data in to produce :
Algorithm 2 Imputation Procedure

4.1 Implementation Details

The MCFlow architecture, comprised of various neural network layers, multiple optimizers and competing loss functions, can be implemented in the manner described in this section. A Pytorch implementation of the MCFlow algorithm is available online.

111 Preprocessing each dataset was done in two steps: 1) initializing , and 2) scaling the data for training. Each data point requires the construction of initial estimates for . This initial step can be performed easily using naive strategies such as zero imputation. We, however, employed two different initialization strategies, both of which estimate initial values for based on observable values in the dataset. For numerical multivariate datasets, each missing element in is replaced by sampling the marginal observed density of the variable containing the missing data. For the image datasets, each missing pixel was selected by randomly choosing one entry from the set of nearest observable neighbors of the missing pixel. After completing the selection of values for the missing data points, data was scaled to be in the interval via min-max normalization. For multivariate datasets, only observed values for each variable were used to determine the maximum and minimum values for that variable. For image datasets, the possible maximum of and minimum of were used for each pixel. These steps are required to construct the initial instantiation of .

Default Credit Card Online News Popularity Letter Recognition
MICE .1763 .0007 .2585 .0010 .1537 .0006
MissForest .1623 .0120 .1976 .0015 .1605 .0004
Matrix .2282 .0005 .2602 .0073 .1442 .0006
Auto-Encoder .1667 .0014 .2388 .0005 .1351 .0009
EM .1912 .0011 .2604 .0015 .1563 .0012
GAIN .1441 .0007 .1858 .0010 .1198 .0005
MCFlow .1233.0012 .1760.0032 .1033.0017
Table 1: Imputation Results on UCI Datasets - RMSE (lower is better, 0.2 missing rate)

For all datasets and experiments, the Adam optimizer was used. For multivariate datasets, we used a learning rate of and a batch size of . Normalizing flow network was built using affine coupling layers as introduced by the Real NVP framework [11]. Our implementation of uses six affine coupling layers and a random masking strategy in contrast to the deterministic masking strategy used in Real NVP. The operations involved in the forward pass of each affine coupling transformation are depicted in Eq. 9:


In Eq. 9, () represents the set of randomly selected indices which will not (will) be scaled or translated in the current affine transformation. Indices in

were initialized using a binomial distribution with a

success rate. These indices remain constant after initialization. In our implementation of , the and functions of every coupling layer are defined by 4-layer fully connected neural networks. Both and

networks use Leaky ReLu as the activation function in the hidden layers. The final output layer for

and utilizes the activation functions and linear, respectively.


has five linear layers with the same number of neurons in each layer as the dimensionality of the data. Leaky ReLu was chosen as the activation function between layers. For the image datasets, the following changes were made as a result of resource constraints: First, the batch sizes used for MNIST, CIFAR-10 and CelebA were

, and respectively; second, for missing data rates above , a learning rate of was employed. A detailed description of the training procedure for MCFlow can be seen in Algorithm 1.

MCFlow periodically updates the missing entries in the training data and resets the parameters in the flow model function, . To that end, we use an exponential scheduling mechanism, where the data update and parameter reset occurs at every epoch that is a power of 2. This implies that in order to perform inference on arbitrary data points, it is necessary to save the model parameters for both and after updating according to Eq. 10 but before resetting the model parameters in :

Missing Rate .1 .2 .3 .4 .5 .6 .7 .8 .9
MNIST GAIN .11508 .12441 .13988 .14745 .16281 .18233 .20734 .24179 .27258
MisGAN .11740 .10997 .11377 .11297 .12174 .13393 .15445 .19455 .27806
MCFlow .07464 .07929 .08508 .09187 .10045 .11255 .12996 .15806 .20801
CIFAR-10 GAIN .10053 .12700 .13248 .11785 .12451 .13130 .13832 .18728 .53728
MisGAN .18923 .15223 .14746 .12947 .13027 .14746 .17335 .24060 .31722
MCFlow .06012 .06232 .06686 .07215 .08311 .10048 .13132 .15015 .16939
CelebA GAIN .06752 .07493 .08367 .08479 .09292 .10608 .11720 .14042 .52050
MCFlow .05733 .06243 .06266 .06946 .07261 .07890 .08487 .11073 .12225
Table 2: Imputation Results on Image Datasets - RMSE (lower is better)
Missing Rate .1 .2 .3 .4 .5 .6 .7 .8 .9
GAIN .0696 .4035 1.24 3.277 6.337 12.44 22.91 43.69 92.74
MisGAN .0529 .1015 .2085 .2691 .3634 .8870 1.324 2.334 6.325
MCFlow .0521 .0779 .2295 .6097 .8366 .9082 1.951 6.765 15.11
Table 3: FID Results on Imputed MNIST Data (lower is better)
Missing Rate .1 .2 .3 .4 .5 .6 .7 .8 .9
GAIN .989 .988 .985 .978 .969 .931 .852 .629 .261
MisGAN .989 .988 .986 .980 .968 .945 .872 .690 .334
MCFlow .991 .990 .990 .988 .985 .979 .963 .905 .705
Table 4: Classification Accuracy on Imputed MNIST Data (higher is better)

Based on this exponential scheduling scheme, inference in the MCFlow architecture, as described in Algorithm 2, requires all of the models saved during training. The number of saved models and passes through the architecture is a function of the number of epochs trained, . More specifically, training MCFlow involves saving the parameters for and at every epoch that is a power of 2, ultimately requiring saved models in order to properly impute missing data for new samples. The longest the model took to converge was epochs, which required nine saved models to properly inference testing datapoints. Imputing missing data in the test dataset involves performing initial naive imputation followed by full passes through each saved architecture. At the end of this process, the final prediction from MCFlow of in is returned and performance metrics are recorded.

5 Experimental Results

5.1 Datasets

We evaluate the performance of MCFlow as well as competing methods on three standard, multivariate datasets from the UCI repository [12] and three image datasets. The UCI datasets considered are Default of Credit Card Clients, Letter Recognition and Online News Popularity. The image datasets used in this research are MNIST[27], CIFAR-10 [32] and CelebA [26]. The MNIST dataset contains -pixel grayscale images of handwritten digits; we used the standard 60,000/10,000 training/test set partition. CIFAR-10 contains -pixel RGB images from ten classes; we used the standard 50,000/10,000 training/test set partition. CelebA contains -pixel RGB images of celebrity faces. It was split using the first 162,770 images as the training set and the last 19,962 images for testing. The CelebA images were center cropped and resized to

pixels using bicubic interpolation.

5.2 Experimental Setup

Imputation performance on multivariate and image data was measured using using root mean squared error (RMSE). We report performance numbers on each test set from models obtained upon convergence of the training process, where convergence is determined based on stabilization of the training losses for both (Eq. 7) and (Eq. 8). We emphasize that the training set itself has missing data and that the training loss is computed only on observed data points in order to mimic real-world scenarios as faithfully as possible. Imputation performance on the UCI datasets was evaluated with a data missingness rate of

using five-fold cross validation. We report the mean and standard deviation of the imputation accuracy across all folds for six competing methods: MICE 

[46], MissForest [42], Matrix Completion (Matrix) [34], Auto-Encoder [16], Expectation-Maximization (EM) [15] and GAIN [49]. For MNIST, CIFAR-10 and CelebA, the testing imputation accuracies are reported across missing rates ranging from to in steps of . Competing methods considered include GAIN [49] and MisGAN [29]. All reported numbers are measured with respect to imputation accuracy over the missing data, namely . Despite considerable efforts to make MisGAN achieve reasonable performance numbers on CelebA, we weren’t able to manage it based on the currently published code. No other versions of the code were available directly from the authors. We also measured the quality of the imputed MNIST images with the Fréchet Inception Distance (FID) [21], which has been shown to correlate well with human perception. Lastly, we measured the ability of the algorithms to preserve semantic content by performing a classification task on imputed imagery with networks pretrained on fully observed data.

5.3 Quantitative Results

Tables 1 and 2 depict MCFlow’s performance at MCAR imputation against competing methods. For all datasets and missing rates, the MCFlow framework outperforms all other methods at imputation accuracy from the perspective of the RMSE between the predictions of the model in question and the ground truth values. For the UCI datasets, MCFlow produces an average reduction in square error of as compared to the existing state of the art, GAIN. In addition, MCFlow produces an average reduction in RMSE of , and as compared to the next best performing method on MNIST, CIFAR-10 and CelebA respectively. Sec. 1 in the Supplementary Material contains imputation results on the training dataset and in terms of PSNR. The results in Table 2 illustrate the pixel-level quality of the imputed imagery; in contrast, Table 3 includes FID performance which is meant to be indicative of quality as perceived by humans. It can be seen that MisGAN outperforms competing methods across most of the missingness range. As will be illustrated in Sec. 5.4, while this means that imputed images more closely resemble attributes of the target image population, they do not necessarily preserve the original semantic content of the partially observed inputs.

In order to illustrate the performance of the methods beyond image quality metrics, we tested their imputation abilities within the context of a larger data processing pipeline where incomplete data is imputed before being processed through a classifier trained on complete data. To that end, we measured the classification performance of a LeNET-based handwritten digit classifier on imputed MNIST data with various degrees of missingness. The LeNET network was pre-trained on MNIST data without missing values. Table 4 contains these results. It can be seen that the imputation results produced by MCFlow have the smallest impact on the semantic content of the imagery, as classification results are consistently higher for images imputed with our method. This phenomenon becomes more evident as the missing data rate increases: the classifier operating on MCFlow-imputed imagery is able to achieve good classification accuracy up to the highest rate of missingness tested, and performs acceptably even in this extreme scenario.

5.4 Qualitative Results

Fig. 3 illustrates the imputation performance of competing methods for the missing data rate on the MNIST dataset. Images along columns (a) and (b) contain the original and observed images, respectively. Pixels that are not observed are assigned a value of 0 for visualization purposes. The imputation models only see the observed images; complete images are included for reference only. Images along columns (c), (d) and (e) include the imputed results after using GAIN, MisGAN and MCFlow, respectively. It can be seen that MCFlow does the best job among the competing methods at preserving and recovering the semantic content of the intended image, which further supports the results from Table 4. Numbers in GAIN-imputed images are, for the most part, illegible. In contrast, MisGAN-imputed images are visually impressive, which is in line with the results from Table 3. One shortcoming of MisGAN, however, is that the imputations produced often fail to represent the digit contained in the original image. Sec. 2 in the Supplementary Material contains additional qualitative results on MNIST and CIFAR-10.

(a)          (b)         (c)          (d)         (e)

Figure 3: Sample imputation results for MNIST at 0.9 missingess rate: (a) Original, (b) observed, (c) GAIN-imputed, (d) MisGAN-imputed, and (e) MCFlow-imputed images.

6 Conclusions

We proposed MCFlow, a method for data imputation that leverages a normalizing flow model as the underlying density estimator. We augmented the traditional generative framework with an alternating learning scheme that enables it to accurately learn distributions from incomplete data sets with various degrees of missingness. We empirically demonstrated the superiority of the proposed method relative to state-of-the-art alternatives in terms of RMSE between the imputed and the original data. Experimental results further show that MCFlow outperforms competing methods with regards to preservation of the semantic structure of the data. This is evidenced by the superior classification performance on imputed data achieved by a classifier trained on complete data, which holds across the full range of missing data ratios evaluated. The ability of the method to make sense of and recover the semantic content of the data at every tested missing rate indicates that it is effectively learning the underlying statistical properties of the data, even in extreme paucity scenarios.

Acknowledgements. The authors would like to thank Charles Venuto and Monica Javidnia from the Center for Health and Technology (CHET) at the University of Rochester for their fruitful insight and support.


  • [1] M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas (2016) Learning to learn by gradient descent by gradient descent. abs/1606.04474. External Links: Link, 1606.04474 Cited by: §1.
  • [2] M. Arjovsky and L. Bottou (2017) Towards principled methods for training generative adversarial networks. abs/1701.04862. Cited by: §2.
  • [3] V. Audigier, F. Husson, and J. Josse (2016)

    Multiple imputation for continuous variables using a bayesian principal component analysis

    86 (11), pp. 2140–2156. External Links: Document, Link, Cited by: §2.
  • [4] Y. Burda, R. B. Grosse, and R. Salakhutdinov (2015) Importance weighted autoencoders. abs/1509.00519. Cited by: §2.
  • [5] X. Chen, N. Mishra, M. Rohaninejad, and P. Abbeel (2018-10–15 Jul) PixelSNAIL: an improved autoregressive generative model. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 864–872. External Links: Link Cited by: §1.
  • [6] P. Comon and C. Jutten (2010) Handbook of blind source separation: independent component analysis and applications. 1st edition, Academic Press, Inc., Orlando, FL, USA. External Links: ISBN 0123747260, 9780123747266 Cited by: §2.
  • [7] G. Deco and W. Brauer (1995) Higher order statistical decorrelation without information loss. In Advances in Neural Information Processing Systems 7, G. Tesauro, D. S. Touretzky, and T. K. Leen (Eds.), pp. 247–254. External Links: Link Cited by: §2.
  • [8] A. P. Dempster, N. M. Laird, and D. B. Rubin (1977) Maximum likelihood from incomplete data via the em algorithm. 39 (1), pp. 1–38. Cited by: §1, §3.
  • [9] M. Di Zio, U. Guarnera, and O. Luzi (2007-07)

    Imputation through finite gaussian mixture models

    51 (11), pp. 5305–5316. External Links: ISSN 0167-9473, Link, Document Cited by: §2.
  • [10] L. Dinh, D. Krueger, and Y. Bengio (2014) NICE: non-linear independent components estimation. abs/1410.8516. Cited by: §1, §2, §3.
  • [11] L. Dinh, J. Sohl-Dickstein, and S. Bengio (2017) Density estimation using Real NVP. External Links: Link Cited by: §1, §2, §3, §4.1.
  • [12] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §5.1.
  • [13] B. J. Frey (1998) Graphical models for machine learning and digital communication. MIT Press, Cambridge, MA, USA. External Links: ISBN 0-262-06202-X Cited by: §2.
  • [14] R. Gao, Y. Lu, J. Zhou, S. Zhu, and Y. N. Wu (2018) Learning generative convnets via multi-grid modeling and sampling. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    pp. 9155–9164. External Links: Link, Document Cited by: §1.
  • [15] P. J. Garcia-Laencina, J. Sancho-Gomez, and A. R. Figueiras-Vidal (2010-03) Pattern classification with missing data: a review. 19 (2), pp. 263–282. External Links: ISSN 0941-0643, Link, Document Cited by: §1, §5.2.
  • [16] L. Gondara and K. Wang (2017)

    Multiple imputation using deep denoising autoencoders

    abs/1705.02737. Cited by: §5.2.
  • [17] I. J. Goodfellow (2017) NIPS 2016 tutorial: generative adversarial networks. abs/1701.00160. External Links: Link, 1701.00160 Cited by: §3.
  • [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. External Links: Link Cited by: §2.
  • [19] T. Han, Y. Lu, S. Zhu, and Y. N. Wu (2017) Alternating back-propagation for generator network. In AAAI, Cited by: §1.
  • [20] W. K. Hastings (1970) Monte carlo sampling methods using markov chains and their applications. 57 (1), pp. 97–109. External Links: Document,, Link Cited by: §1.
  • [21] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §5.2.
  • [22] C.T. Kelley (1999) Iterative methods for optimization. Frontiers in Applied Mathematics, Society for Industrial and Applied Mathematics. External Links: ISBN 9780898714333, LCCN lc99011141, Link Cited by: §1.
  • [23] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. Note: cite arxiv:1312.6114 External Links: Link Cited by: §2.
  • [24] D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 10215–10224. External Links: Link Cited by: §1, §2, §3.
  • [25] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4743–4751. External Links: Link Cited by: §2.
  • [26] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.1.
  • [27] Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: External Links: Link Cited by: §5.1.
  • [28] K. Li and J. Malik (2016) Learning to optimize. abs/1606.01885. External Links: Link, 1606.01885 Cited by: §1.
  • [29] S. C. Li, B. Jiang, and B. Marlin (2019) MisGAN: learning from incomplete data with generative adversarial networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2, §5.2.
  • [30] R.J.A. Little and D.B. Rubin (2002) Statistical analysis with missing data. Wiley series in probability and mathematical statistics. Probability and mathematical statistics, Wiley. External Links: ISBN 9780471183860, LCCN 2002027006, Link Cited by: §1, §2.
  • [31] R. J. A. Little and D. B. Rubin (2002) Statistical analysis with missing data. John Wiley & Sons, Inc., New York, NY, USA. External Links: ISBN 9780471183860 Cited by: §1, §3.
  • [32] Z. Liu, P. Luo, X. Wang, and X. Tang (2015-12) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §5.1.
  • [33] P. Mattei and J. Frellsen (2019-09–15 Jun) MIWAE: deep generative modelling and imputation of incomplete data sets. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 4413–4423. External Links: Link Cited by: §1, §2.
  • [34] R. Mazumder, T. Hastie, and R. Tibshirani (2010-08) Spectral regularization algorithms for learning large incomplete matrices. 11, pp. 2287–2322. External Links: ISSN 1532-4435, Link Cited by: §5.2.
  • [35] J. Murray (2018) Multiple imputation: a review of practical and theoretical findings. Cited by: §2.
  • [36] R. C. Neath (2013) On convergence properties of the Monte Carlo EM algorithm. Cited by: §3.
  • [37] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016) WaveNet: a generative model for raw audio. Note: cite arxiv:1609.03499 External Links: Link Cited by: §2.
  • [38] N. J. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, and A. Ku (2018) Image transformer. External Links: Link Cited by: §1.
  • [39] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. External Links: Link Cited by: §2.
  • [40] D. J. Rezende, S. Mohamed, and D. Wierstra (2014-22–24 Jun) Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara (Eds.), Proceedings of Machine Learning Research, Vol. 32, Bejing, China, pp. 1278–1286. External Links: Link Cited by: §1, §2, §2.
  • [41] Y. Si and J. P. Reiter (2013)

    Nonparametric bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys

    38 (5), pp. 499–521. External Links: Document, Link, Cited by: §2.
  • [42] D. J. Stekhoven and P. Bühlmann (2011-10) MissForest—non-parametric missing value imputation for mixed-type data. 28 (1), pp. 112–118. Cited by: §1, §2, §5.2.
  • [43] M. Takahashi (2017) Statistical inference in missing data by mcmc and non-mcmc multiple imputation algorithms: assessing the effects of between-imputation iterations. 16 (37), pp. 1–17. Cited by: §1.
  • [44] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman (2001-06) Missing value estimation methods for DNA microarrays . 17 (6), pp. 520–525. External Links: ISSN 1367-4803, Document, Link, Cited by: §1, §2.
  • [45] S. van Buuren (2018) Flexible imputation of missing data. second edition. Vol. , Chapman and Hall, CRC Press, Boca Raton, FL. Cited by: §1, §2.
  • [46] S. van Buuren and K. Groothuis-Oudshoorn (2011) Multivariate imputation by chained equations in r. Journal of Statistical Software 45 (3), pp. 1–67. External Links: Link Cited by: §1, §5.2.
  • [47] A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016)

    Pixel recurrent neural networks

    In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1747–1756. External Links: Link Cited by: §1.
  • [48] G. C. G. Wei and M. A. Tanner (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. 85 (411), pp. 699–704. External Links: Document, Link, Cited by: §3.
  • [49] J. Yoon, J. Jordon, and M. van der Schaar (2018-10–15 Jul) GAIN: missing data imputation using generative adversarial nets. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 5689–5698. External Links: Link Cited by: §1, §2, §5.2.
  • [50] Z. Zhang (2016) Missing data imputation: focusing on single imputation. 4 (1). External Links: ISSN 2305-5847, Link Cited by: §2.