In real world datasets, missing values are pervasive, caused by incomplete measurement, data corruption, privacy concerns, and so on. For example, consider lab test values in a clinical setting: it is typical that at a certain time point only a small subset of laboratory tests are examined, leaving the values of most tests missed. Such missingness incurs substantial difficulty for analyzing data and distilling insights therefrom.
To address this issue, it is important to perform missing value imputation (MVI): guessing the missing entries based on the observed ones. By filling the missing entries correctly, one can obtain a complete picture of the data which facilitates the extraction of more informed patterns. In addition, since most machine learning (ML) models require the inputs to be fixed-length feature vectors, MVI makes incomplete data ML-ready. In general, MVI is a challenging task, especially when the the missing rate is high. This is due to the following reasons. First, the relationship between observed values and missing values, which is the foundation of designing MVI methods, is unknown in most cases. Previous methods[6, 14] make simplistic assumptions such as the relationship is linear, which may not hold in practice. Second, the pattern of missingness is typically irregular: different data instances have different number of missing entries and the positions of these entries are different as well. Such irregularity hinders the design of MVI models. For example, suppose one wants to build a predictive model that maps the observed entries to missing entries. Mathematically, the inputs and outputs of such a model is difficult to be formulated since the missing entries of different examples have different positions.
use autoencoders for MVI. While empirically effective, they are heuristics-based and lack a principled foundation. Specifically, it is unclear what mathematical objective is defined w.r.t the missing values. Some methods perform MVI based on matrix completion[3, 10, 7], by assuming the underlying complete data matrix is low-rank. Such an assumption essentially posits that the relationship between observed entries and missing entries is linear. Hence these methods are unable to explore the nonlinear relationship, which are more common in practice. In many methods such as mean imputation, the missing entries are imputed at a population-level: for each feature, the same value (e.g., mean of all observed values of this feature) is used to impute all missing instances of this feature across different data examples. This is problematic since two examples that may be completely different are given the same value.
In this paper, we aim at developing an MVI method that addresses the limitations of existing approaches. First, we formulate MVI as estimating the conditional distributionof missing entries based on the observed values , and show that this distribution can be inferred by solving a fixed-point problem (FPP). Such a formulation enables our method to perform example-level imputation where the missing values of a data example is imputed based on the observed values of this example. This FPP involves two conditional distributions bridged by a latent variable. To capture the nonlinear relationship between and
, we parameterized the two conditional distributions using deep neural networks (DNNs). DNNs are nonlinear functions that possess great approximation power. Hence our method can well capture the nonlinear relationship between observed and missing values. The design of these two distributions enables our method to adapt to any irregularity of positions and numbers of the missing values. The weight parameters of DNNs are estimated by maximizing the approximated log-likelihood of all observed data, via an expectation maximization (EM) algorithm. In the E step, MVI is performed by fixing the model parameters. In the M step, model parameters are updated based on the imputed missing values. Compared with previous approaches, our method possesses the following merits: (1) a well-grounded theoretical foundation; (2) ability of capturing nonlinear relationship between missing and observed values; (3) ability of performing instance-level imputation. We demonstrate the empirical effectiveness of our method on 13 datasets by comparing with 11 baselines.
The rest of this paper is organized as follows. In Section 2, we introduce related works. Section 3 and 4 present the methods and experimental results. Section 5 concludes the paper.
2 Related Work
Dempster et al  propose an EM algorithm for maximum likelihood estimation from incomplete data. Our method differs from this work in two-fold. First, we estimate the conditional distribution by solving a fixed-point problem. Second, the imputation models of our methods are parameterized by deep neural networks. Several works [3, 10, 7] have studied missing value imputation from a matrix completion perspective, either via nuclear norm regularization [3, 10] or matrix factorization . Buuren and Groothuis-Oudshoorn 
propose to iteratively impute the values of each feature through regression on the values of remaining features. All these methods are inherently linear, which may not be sufficient to capture the nonlinear patterns between the observed and missing values. And these methods need to perform singular value decomposition in each iteration, which is computationally heavy. Deep learning models have been explored for MVI. Gondara and Wang
propose a multiple imputation model based on overcomplete deep denoising autoencoders. Tran el al.  propose a cascaded residual autoencoder (CRA) to impute missing modalities. By stacking residual autoencoders, CRA grows iteratively to model the residual between the current prediction and original data. While showing empirical effectiveness, these methods lack a theoretical foundation. Stekhoven and Bühlmann  develop a nonparametric method to cope with the imputation of different types of variables simultaneously. Multiple imputation by chained equations (MICE)  generates imputations based on a set of imputation models, one for each variable with missing values. Similarly, these methods are heuristics-based. A theoretical understanding is missing.
In this section, we study how to perform missing value imputation based on deep generative models. Overall there are three types of missing mechanism: (1) missing completely at random (MCAR), where the propensity for a data item to be missing is completely random; (2) missing at random (MAR), where the propensity for a data item to be missing is not related to the missing data, but it is related to some of the observed data; (3) missing not at random (MNAR), where the data items are missing due to certain underlying mechanisms. In this work, we mainly focus on MNAR.
We assume there are features at the population level. For each individual data example , it has () observed features and the rest features
are missing. For different data examples, the missing features are different. We formulate the missing value imputation problem in a probabilistic framework: (1) treat observed and missing features as random variables; (2) given the observed features, infer the conditional distribution of the missing ones . We introduce a latent variable to learn the latent representation of this data example and assume and are conditionally independent given :
In the sequel, we drop from the notations for simplicity. Based on , the conditional distribution can be calculated in the following way:
where the last step is first-order Taylor approximation of . An intuitive interpretation of this equation is that: given the observed features, we first infer the latent representation , then generate the missing features by inferring . To calculate , we need to infer , which can be done in the following way:
Similarly, the last step is obtained using first order Taylor approximation. As can be seen from Eq.(2) and Eq.(3), and can be estimated using fixed-point iteration method which iteratively performs the following two steps until convergence: (1) given , substitute it into as an approximation of ; (2) given , substitute it into as an approximation of . Figure 1 illustrates this process. We initialize the missing values with , concatenate it with the observed values , and feed the concatenation to , which is called an encoder. Then we calculate and feed it to , which is called a decoder to get a refined estimation of the missing entries. This two steps iterates until convergence.
Next, we discuss how to parameterize and . Similar to variational autoencoder , we use deep neural networks to perform such parameterization. To parameterize , we first generate two dimensional vectors using deep networks (represented by and ): and . Then for each missing feature , let denote its index. If is a continuous variable, the distribution defined on is a univariate Gaussian where and are the -th dimension of and . If. Similarly, we use deep networks (represented by and ) to parameterize
as a multivariate Gaussian distributionand denotes the concatenation of the observed features and imputed features (in the order of their indexes).
Next, we discuss how to learn the parameters of the deep networks. First, we show that the log-likelihood of the observed values can be approximated by a function of , , , and , where and are parameterized by the same neural networks and defined before.
We then perform parameter learning by maximizing the approximation of using an expectation maximization (EM) algorithm, which iteratively performs the following two steps until convergence. First, in the E(expectation) step, we fix the weight parameters of neural networks and infer . This is basically the missing value imputation task, which can be accomplished by executing the fixed-point iteration method between Eq.(2) and Eq.(3). The (fixed) weight parameters of neural networks are needed to calculate these two equations. In the (M)aximization step, we fix (and therefore ), and estimate the model parameters. The estimation can be conducted using the stochastic gradient variational Bayes algorithm .
We evaluated our method on various publicly available datasets, with various percentages of missing values and compared with 11 baseline methods both quantitatively and qualitatively.
We use 13 datasets from the UCI repository: 10 for classification tasks and 3 for regression tasks. Table 1 shows the statistics of these datasets. The numbers of data examples vary from about 100 to over 3000 while the numbers of features vary from 4 to 60. These features are of various types, including ordinal, categorical, and continuous. Min-max normalization was used for feature normalization.
|# Training||# Testing||# Features||Data types|
|bc (breast cancer)||489||105||9||Categorical|
|housevote (house voting)||304||66||16||Categorical|
|boston (Boston housing)||354||76||13||Categorical, ordinal, continuous|
Most of the original datasets have no missing entries. We simulated the missingness by using Algorithm 1 to generate two types of missing data: missing not at random with uniform missingness (mnar-uniform) and missing not at random with random missingness (mnar-random). In uniform setting, data records are cropped entirely, i.e. all features are not presenting in missing records. In random setting, data records only miss part of their features. In reality, it won’t make much sense to predict a record if there is no available information of it. As a result, most of our work focus on mnar-random case, where we own partial knowledge of the missing records. Fixing
will produce consistent missing rates for different datasets. All the missing entries are filled in by the mean of remained training data on a feature basis, except in last experiment where we investigated the influence of various initialization methods. This means the cropped entries will not be accessed during the whole training and inference process, so the models will not know the ground truth. The final evaluation metric that we use is the sum of root mean squared error given in following formwhere suppose p is the imputation result and t is true data matrix, with n sample rows and m feature columns, we compute the root mean squared errors feature-wise and then sum them up.
4.2 Experimental Settings
We used the Adam  optimizer in our experiment. The learning rate was set to 1e-3 and the batch size was set to 50 for all datasets. The dimension of the hidden variable
was set to 100. We parameterized with feedforward networks having two hidden layers where the number of units is 128 and 64 respectively. For, the networks have two hidden layers as well, with 64 and 128 units respectively. The activation function was set to tanh. In the stochastic gradient variational Bayes algorWe sample from
Gaussian distribution in sample layer. The variance of the Gaussian distribution has an impact on the performance. From the experiment finding it appears that usually the higher the variance the worse performance the model will achieve. We have two extra hyperparameters which are the number of inference operation and number of training step before we perform inference update on training data. They are important for the model performance. Too much inference in early training stage will not help the imputation as the model parameters are still not well updated. In most of our experiments the inference time is set to 2 and the training interval is set to 5. And we run 100 epoch to convergence.
. For Mean/Median imputation, the missing entries are filled by the feature-wise mean/median value computed from observed data, and Zero imputation fills missing entries by zero. SoftImpute uses iterative soft-thresholded singular value decomposition to impute the missing value. MICE iteratively imputes each feature by regression on the rest features. MissForest is a non-parametric approach that directly predicts missing values by training a random forest model on observed data. KNN approximates the missing value in each data sample by the corresponding value of the closest data example. PCA fits on the incomplete data with some initialization on the missing entries and estimates them with calculated principal components. AE, DAE, and RAE work mostly in the same manner. They all have an encoder to map the input to some encoded feature and use a decoder to map the feature to some output. For AE the input and output are the same. For DAE the input is the incomplete data with added noise on observed entries and the output is the original incomplete data. For RAE the input is the same as DAE one but the output is the residual between the corrupted and original incomplete data. All three models share the same model structure as our method uses. Encoded dimension can be tuned to achieve better solution for each model. On each dataset, each model is run multiple times to get multiple imputation result for analysis so we take advantage of Multiple Imputation (MI).
Besides these deep models, we also compare our method with MICE, MissForest, KNN, PCA and SoftImpute. For these methods we use existing packages to record the final performance. The error analysis uses as evaluation metric performed on scaled data in order to remove influence from diverse ranges of features.
4.3 Experiment Results
Table 2 shows the imputation performance on datasets with a 50% missing rate. Our method outperforms all the baselines on 11 out of 13 datasets by reducing error up to ~7%. This shows the generative model with EM algorithm can lead the generated missing value closer to the ground truth, i.e. closer to true
, in various situations. The Mean/Median/Zero imputation only use fixed numbers to replace the missing values which cannot fully represent the intrinsic correlation among features and only serve as a general and blur estimation, although mean and median remain strong competitors in our compared baselines, which is due to the reason that when the data is not skewed, i.e. there doesn’t exist much outliers in the dataset, the mean/median can represent the most common information of the data. MICE, KNN and SoftImpute are inherently linear and cannot capture nonlinear feature patterns so they perform poorer than our method. Among MICE, MissForest, KNN, PCA and SoftImpute, MissForest takes longer time to train and estimate the missing value since it requires quite amount of trees to make decision on complicated dataset. MissForest and KNN can easily overfit on training data. And since they are based on heuristic rules, they don’t capture inter-correlation among features and perform poorer than our method. Such similar performance is also shown in our qualitative analysis, which will be discussed later.
Table 3 shows the imputation performance on datasets with various missing rates, 25% and 75% in our setting to represent low and high missing situation. When the rate is low, our method outperforms baselines on 8/13 datasets while most of the methods tend to achieve comparable results. It is intuitive as when the missing rate is small, the observed data can provide enough information to infer the missing entries using different models. When the rate is super high, it is hard for models to learn inter-correlation among features. In such situation the imputation raises a higher demand on model to capture those relations. The performance is more diverged compared to previous results while our method still outperforms other baseline methods on over a half cases. In reality, when the dataset is largely missing, it is not wise to directly impute the missing data. It is better to use techniques such as heuristics or correlation analysis to remove unnecessary features in order to reduce the missing rate and then perform the missing imputation task. And although not a main point of study, we still report the performance on a setting of 50% uniformly missing data, as show in table 4, in which our method still achieves better results on most datasets. All above results demonstrate the power of our model to learn data pattern and generate more real data for missing values.
The main goal of imputing missing value is to generate real-like data to help with analysis of the dataset, for example, train a machine learning model to classify an unseen data point. Apparently in general, with more information in data, the classification will be more accurate. So besides the analysis of how close the imputed data is to the true data, we also care about how meaningful the imputed data is to the analytics, i.e. how much intrinsic correlation among features we catch in the imputed data. We fit multi-layer perceptron (MLP) model for all datasets and change the loss between cross entropy loss and squared error loss based on the task. For the classification task we use cross entropy loss and report test accuracy and for the regression task we use squared error loss and report final root mean squared error (RMSE) as evaluation metric.
Table 5 shows the results performance of classification/regression tasks on datasets with 50% missing rate using mnar-random missing generation strategy. The Mean remains a strong competitor besides RAE model. The result, in a qualitative way, demonstrates that our method better captures the intrinsic data pattern on a wide variation of datasets. And the imputed data possess higher quality in different data analysis tasks.
Lastly we investigate the model performance given different initialization. We compare using mean/zero/median and random value to initialize the dataset. The random value is generated with a Gaussian distribution. The result is shown in table 6. We see initializing with mean value yields the best performance on most datasets. This makes sense as we iteratively use imputed data to train the model and update the missing entries with trained weights, it is better to start from a good indicator. A deviated starting point will make the initial learning of our method hard to capture real correlation among features and thus make the entire process not able to achieve best imputation result.
5 Conclusions and Future Works
In this paper, we study the imputation of missing values. We formulate this problem as a generative process: give the observed values , generate the missing ones , by estimating a conditional distribution . We show that this distribution can be estimated by solving a fixed-point problem defined on two conditional distributions that are parameterized by deep neural networks. In experiments on 10 datasets, our method outperforms 11 baseline methods.
For future works, we will extend our method to missing value imputation in time-series data. The general probabilistic framework developed in Section 3 can be largely re-used. The two conditional distributions need to be redesigned to accommodate the time-series structure. Possible candidates are the variational recurrent network .
Another direction to explore is task-aware missing value imputation. Instead of performing the MVI and the target task sequentially: first filling the missing data, then feed the imputed data to the downstream task, we can perform them jointly to better explore the synergistic relationship between them. On one hand, more accurate imputation of missing values prepares better inputs for the target task, which boosts the task’ performance. On the other hand, during training, the performance of the target task provides guidance to MVI such that the imputed values are more suitable for this task.
-  Naomi S Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992.
-  S van Buuren and Karin Groothuis-Oudshoorn. mice: Multivariate imputation by chained equations in r. Journal of statistical software, pages 1–68, 2010.
-  Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
-  Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015.
-  Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
-  Lovedeep Gondara and Ke Wang. Multiple imputation using deep denoising autoencoders. CoRR, abs/1705.02737, 2017.
-  Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980–2998, 2010.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
-  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research, 11(Aug):2287–2322, 2010.
-  David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
-  Joseph L Schafer. Multiple imputation: a primer. Statistical methods in medical research, 8(1):3–15, 1999.
-  Daniel J Stekhoven and Peter Bühlmann. Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118, 2011.
-  Luan Tran, Xiaoming Liu, Jiayu Zhou, and Rong Jin. Missing modalities imputation via cascaded residual autoencoder. , pages 4971–4980, 2017.
-  Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.