Missing Value Imputation With Unsupervised Backpropagation

Many data mining and data analysis techniques operate on dense matrices or complete tables of data. Real-world data sets, however, often contain unknown values. Even many classification algorithms that are designed to operate with missing values still exhibit deteriorated accuracy. One approach to handling missing values is to fill in (impute) the missing values. In this paper, we present a technique for unsupervised learning called Unsupervised Backpropagation (UBP), which trains a multi-layer perceptron to fit to the manifold sampled by a set of observed point-vectors. We evaluate UBP with the task of imputing missing values in datasets, and show that UBP is able to predict missing values with significantly lower sum-squared error than other collaborative filtering and imputation techniques. We also demonstrate with 24 datasets and 9 supervised learning algorithms that classification accuracy is usually higher when randomly-withheld values are imputed using UBP, rather than with other methods.


page 1

page 2

page 3

page 4


Using Association Rules for Better Treatment of Missing Values

The quality of training data for knowledge discovery in databases (KDD) ...

Missing Value Estimation using Clustering and Deep Learning within Multiple Imputation Framework

Missing values in tabular data restrict the use and performance of machi...

Fast Imbalanced Classification of Healthcare Data with Missing Values

In medical domain, data features often contain missing values. This can ...

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Handling missing values in training datasets for constructing learning m...

Internal Data Imputation in Data Warehouse Dimensions

Missing values occur commonly in the multidimensional data warehouses. T...

Missing Features Reconstruction and Its Impact on Classification Accuracy

In real-world applications, we can encounter situations when a well-trai...

Tomographic Auto-Encoder: Unsupervised Bayesian Recovery of Corrupted Data

We propose a new probabilistic method for unsupervised recovery of corru...

Code Repositories


Matrix Imputation using Unsupervised Backpropagation

view repo

1 Introduction

Many effective machine learning techniques are designed to operate on dense matrices or complete tables of data. Unfortunately, real-world datasets often include only samples of observed values mixed with many missing or unknown elements. Missing values may occur due to human impatience, human error during data entry, data loss, faulty sensory equipment, changes in data collection methods, inability to decipher handwriting, privacy issues, legal requirements, and a variety of other practical factors. Thus, improvements to methods for imputing missing values can have far-reaching impact on improving the effectiveness of existing learning algorithms for operating on real-world data. We present a method for imputation called

Unsupervised Backpropagation (UBP), which trains a multi-layer perceptron (MLP) to fit to the manifold represented by the known features in a dataset. We demonstrate this algorithm with the task of imputing missing values, and we show that it is significantly more effective than other methods for imputation.

Backpropagation has long been a popular method for training neural networks (Rumelhart et al., 1986; Werbos, 1990). A typical supervised approach trains the weights,

, of a multilayer perceptron (MLP) to fit to a set of training examples, consisting of a set of

feature vectors , and corresponding label vectors

. With many interesting problems, however, training data is not available in this form. In this paper, we consider the significantly different problem of training an MLP to estimate, or impute, the missing attribute values of

. Here is represented as an matrix where each of the attributes may be continuous or categorical. Because the missing elements in must be predicted, becomes the output of the MLP, rather than the input. A new set of latent vectors, , will be fed as inputs into the MLP. However, no examples from are given in the training data. Thus, both and must be trained using only the known elements in . After training, each may be fed into the MLP to predict all of the elements in .

Training in this manner causes the MLP to fit a surface to the (typically non-linear) manifold sampled by . After training, may be considered as a reduced-dimensional representation of . That is, will be an matrix, where is typically much smaller than , and the MLP maps .

UBP accomplishes the task of training an MLP using only the known attribute values in with on-line backpropagation. For each presentation of a known value of the attribute from the instance (), UBP simultaneously computes a gradient vector to update the weights , and a gradient vector to update the input vector . ( is the element in row , column of .)

In this paper, we demonstrate UBP as a method for imputing missing values, and show that it outperforms other approaches at this task. We compare UBP against 5 other imputation methods on a set of 24 data sets. 10% to 90% of the values are removed from the data sets completely at random. We show that UBP predicts the missing values with signficantly lower error (as measured by sum-squared difference with normalized values) than other approaches. We also evaluated 9 learning algorithms to compare classification accuracy using imputed data sets. Learning algorithms using imputed data from UBP usually achieve higher classification accuracy than with any of the other methods. The increase is most significant when 30% to 70% of the data is missing.

The remainder of this paper is organized as follows. Section 2 reviews related work to UBP and missing value imputation. UBP is described in Section 3. Section 4 presents the results of comparing UBP with other imputation methods. We provide conclusions and a discussion of future directions for UBP in Section 5.

2 Related Work

As an algorithm, UBP falls at the intersection of several different paradigms: neural networks, collaborative filtering, data imputation, and manifold learning. In neural networks, UBP is an extension of generative backpropagation (Hinton, 1988). Generative backpropagation adjusts the inputs in a neural network while holding the weights constant. UBP, by contrast, computes both the weights and the input values simultaneously. Related approaches have been used to generate labels for images (Coheh and Shawe-Taylor, 1990), and for natural language (Bengio et al., 2006). Although these techniques have been used for labeling images and documents, to our knowledge, they have not been used for the application of imputing missing values. UBP differs from generative backpropagation in that it trains the weights simultaneously with the inputs, instead of training them as a pre-processing step.

UBP may also be classified as a manifold learning algorithm. Like common non-linear dimensionality reduction (NLDR) algorithms, such as Isomap

(Tenenbaum et al., 2000), MLLE (Zhang and Wang, 2007), or Manifold Sculpting (Gashler et al., 2008), it reduces a set of high-dimensional vectors, , to a corresponding set of low-dimensional vectors, . Unlike these algorithms, however, UBP also learns a model of the manifold. Also unlike these algorithms, UBP is designed to operate with incomplete observations.

In collaborative filtering, UBP may be viewed as a non-linear generalization of matrix factorization (MF). MF is a linear dimensionality reduction technique that can be effective for collaborative filtering (Adomavicius and Tuzhilin, 2005) as well as imputation. This method has become a popular technique, in part due to its effectiveness with the data used in the NetFlix competition (Koren et al., 2009)

. MF involves factoring the data matrix into two much-smaller matrices. These smaller matrices can then be combined to predict all of the missing values in the original dataset. It is equivalent to using linear regression to project the data onto its first few principal components. Unfortunately, MF is not well-suited for data that exhibits non-linearities. It was previously shown that matrix factorization could be represented with a neural network model involving one hidden layer and linear activation functions

(Takács et al., 2009). In comparison with this approach, UBP uses a standard MLP with an arbitrary number of hidden layers and non-linear activation functions, instead of the network structure previously proposed for matrix factorization. MF produces very good results at the task of imputation, but we demonstrate that UBP does better.

As an imputation technique, UBP is a refinement of Nonlinear PCA (Scholz et al., 2005) (NLPCA), which has been shown to be effective for imputation. This approach also uses gradient descent to train an MLP to map from low to high-dimensional space. After training, the weights of the MLP can be used to represent non-linear components within the data. If these components are extracted one-at-a-time from the data, then they are the principal components, and NLPCA becomes a non-linear generalization of PCA. Typically, however, these components are all learned together, so it would more properly be termed a non-linear generalization of MF. NLPCA was evaluated with the task of missing value imputation (Scholz et al., 2005), but its relationship to MF was not yet recognized at the time, so it was not compared against MF. One of the contributions of this paper is that we show NLPCA to be a significant improvement over MF at the task of imputation. We also demonstrate that UBP achieves even better results than NLPCA at the same task, and is the best algorithm for imputation of which we are aware. The primary difference between NLPCA and UBP is that UBP utilizes a three-phase training approach (described in Section 3) which makes it more robust against falling into a local optimum during training.

UBP is comparable with the latter-half of an autoencoder

(Hinton and Salakhutdinov, 2006). Autoencoders create a low dimensional representation of a training set by using the training examples as input features as well as the target values. The first half of the encoder reduces the input features into low dimensional space by only have nodes in the middle layer where is less than the number of input features. The latter-half of the autoencoder then maps the low dimensional representation of the training set back to the original input features. However, to capture non-linear dependencies in the data, autoencoders require deep architectures to allow for layers between the inputs and low dimensional representation of the data and between the low dimensional representation of the data and the output. The deep architecture makes training an autoencoder difficult and computationally expensive, generally requiring unsupervised layer-wise training (Bengio et al., 2007; Erhan et al., 2009). Because UBP trains a network with half the depth of a corresponding autoencoder, UBP is practical for many problems for which autoencoders are too computationally expensive.

Since we demonstrate UBP with the application of imputing missing values in data, it is also relevant to consider other approaches that are classically used for this task. Simple methods, such as dropping patterns that contain missing values or randomly drawing values to replace the missing values, are often used based on simplicity for implementation. These methods, however, have significant obvious disadvantages when data is scarce. Another common approach is to treat missing elements as having a unique value. This approach, however has been shown to bias the parameter estimates for multiple linear regression models (Jones, 1996) and to cause problems for inference with many models (Shafer, 1997). We take it for granted that better accuracy is desirable, so these methods should generally not be used, as better methods do exist.

A simple improvement over BL is to compute a separate centroid for each output class. The disadvantages of this method are that it is not suitable for regression problems, and it cannot generalize to unlabeled data since it depends on labels to impute. Methods based on maximum likelihood (Little and Rubin, 2002) have long been studied in statistics, but these also depend on pattern labels. Since it is common to have more unlabeled data than labeled data, we restrict our analysis to unsupervised methods that do not rely on labels to impute missing values.

Another well-studied approach involves training a supervised learning algorithm to predict missing values using the non-missing values as inputs (Quinlan, 1989; Lakshminarayan et al., 1996; Alireza Farhangfar, 2008). Unfortunately, the case where multiple values are missing in one pattern present a difficulty for these approaches. Either a learning algorithm must be used that implicitly handles missing values in some manner, or an exponential number of models must be trained to handle each combination of missing values. Further, it has also been shown that results with these methods tend to be poor when there are high percentages (more than about 15%) of missing values (Acuña and Rodriguez, 2004).

One very effective collaborative filtering method for imputation is to cluster the data, and then make predictions according to the centroid of the cluster in which each point falls (Adomavicius and Tuzhilin, 2005). Luengo compared several imputation methods by evaluating their effect on classification accuracy (Luengo, 2011). He found cluster-based imputation with Fuzzy -Means (FKM) (Li et al., 2004) using Manhattan distance to outperform other methods, including those involving state of the art machine learning methods and other methods traditionally used for imputation. Our analysis, however, finds that most of the methods we compared outperform FKM.

A related imputation method called instance-based imputation (IBI) is to combine the non-missing values of the -nearest neighbors of a point to replace its missing values. To evaluate the similarity between points, cosine correlation is often used because it tends to be effective in the presence of missing values (Adomavicius and Tuzhilin, 2005; Li et al., 2008; Sarwar et al., 2001).

UBP, as well as the aforementioned imputation techniques, are considered single imputation techniques because only one imputation for each missing value is made. Single imputation has the disadvantage of introducing large amounts of bias since the imputed values do not reflect the added uncertainty from the fact that values are missing. To overcome this, Rubin (1987)

proposed multiple imputation that estimates the added variance by combining the outcomes of

imputed data sets. Similarly, ensemble techniques have also been shown to be effective for imputing missing values (Schafer and Graham, 2002). In this paper, we do not compare against ensemble methods because UBP involves a single model, and it may be included in an ensemble as well as any other imputation method.

3 Unsupervised Backpropagation

Figure 1: UBP trains an MLP to fit to high-dimensional observations, . For each known , UBP uses backpropagation to compute the gradient vectors and , which are used to update the weights, , and the input vector .

In order to formally describe the UBP algorithm, we define the following terms. The relationships between these terms are illustrated graphically in Figure 1.

  • Let be a given matrix, which may have many missing elements. We seek to impute values for these elements. is the number of instances. is the number of attributes.

  • Let be a latent matrix, where .

  • If is the element at row , column in , then is the value predicted by the MLP for this element when is fed forward into the MLP.

  • Let be the weight that feeds from unit to unit in the MLP.

  • For each network unit on hidden layer , let be the net input into the unit, let be the output or activation value of the unit, and let be an error term associated with the unit.

  • Let be the number of hidden layers in the MLP.

  • Let be a vector representing the gradient with respect to the weights of an MLP, such that is the component of the gradient that is used to refine .

  • Let be a vector representing the gradient with respect to the inputs of an MLP, such that is the component of the gradient that is used to refine .

Using backpropagation to compute , the gradient with respect to the weights, is a common operation for training MLPs (Rumelhart et al., 1986; Werbos, 1990). Using backpropagation to compute , the gradient with respect to the inputs, however, is much less common, so we provide a derivation of it here. In this deriviation, we compute each from the presentation of a single element . It could also be derived from the presentation of a full row (which is typically called “on-line training”), or from the presentation of all of (“batch training”), but since we assume that is high-dimensional and is missing many values, it is significantly more efficient to train with the presentation of each known element individually. We begin by defining an error signal, , and then express the gradient as the partial derivative of this error signal with respect to the inputs:


The intrinsic input affects the value of through the net value of a unit and further through the output of a unit

. Using the chain rule, Equation

1 becomes:


The backpropagation algorithm calculates (which is for a network unit) as the error term associated with a network unit. Thus, to calculate the only additional calculation that needs to be made is . For a network with 0 hidden layers:

which is non-zero only when equals and is equal to . When there are no hidden layers ():


If there is at least one hidden layer (), then,

where the and represent the output values and the net values for the units in the hidden layer. As part of the error term for the units in the layer, backpropagation calculates as the error term associated with each network unit. Thus, the only additional calculation for is:

As before, is non-zero only when equals . For network with at least one hidden layer:


Equation 4 is a strict generalization of Equation 3. Equation 3 only considers the one output unit, , for which a known target value is being presented, whereas Equation 4 sums over each unit, , into which the intrinsic value feeds.

3.1 3-phase Training

UBP trains and in three phases: 1) the first phase computes an initial estimate for the intrinsic vectors, , 2) the second phase computes an initial estimate for the network weights,

, and 3) the third phase refines them both together. All three phases train using stochastic gradient descent, which we derive from the classic backpropagation algorithm. We now briefly give an intuitive justification for this approach. In our initial experimentation, we used the simpler approach of training in a single phase. With several problems, we observed that early during training, the intrinsic point vectors,

, tended to separate into clusters. The points in each cluster appeared to be unrelated, as if they were arbitrarily assigned to one of the clusters by their random initialization. As training continued, the MLP effectively created a separate mapping for each cluster in the intrinsic representation to the corresponding values in . This effectively places an unnecessary burden on the MLP, because it must learn a separate mapping from each cluster that forms in to the expected target values. In phase 1, we give the intrinsic vectors a chance to self-organize while there are no hidden layers to form nonlinear separations among them. Likewise, phase 2 gives the weights a chance to organize without having to train against moving inputs. These two preprocessing phases initialize the system (consisting of both intrinsic vectors and weights) to a good initial starting point, such that gradient descent is more likely to find a local optimum of higher quality. Our empirical results validate this theory by showing that UBP produces more accurate imputation results than NLPCA, which only refines and together.

1:  Initialize each element in with small random values.
2:  Let be the weights of a single-layer perceptron
3:  Initialize each element in with small random values.
4:  ; ; ;
5:  ;
6:  while  
7:      train_epoch()
8:     if then
10:  Let be the weights of a multi-layer perceptron with hidden layers,
11:  Initialize each element in with small random values.
12:  ;
13:  while  
14:      train_epoch()
15:     if then
17:  ;
18:  while  
19:      train_epoch()
20:     if then
22:  return
Algorithm 1 UBP()

Pseudo-code for the UBP algorithm, which trains and in three phases, is given in Algorithm 1. This algorithm calls Algorithm 2

, which performs a single epoch of training. A detailed description of Algorithm 

1 follows.

A matrix containing the known data values, , is passed in to UBP (See Algorithm 1). UBP returns and . is a matrix such that each row, , is a low-dimensional representation of the corresponding row, . is a set or ragged matrix containing weight values for an MLP that maps from each to an approximation of . may be forward-propagated into this MLP to estimate values for any missing elements in .

Lines 1-9 perform the first phase of training, which computes an initial estimate for .

Line 1 of Algorithm 1 initializes the elements in

with small random values. Our implementation draws values from a Normal distribution with a mean of 0 and a deviation of 0.01.

Lines 2-3 initialize the weights, , of a single-layer perceptron using the same mechanism. This single-layer perceptron is a temporary model that is only used in phase 1 to assist the initial training of .

Line 4

sets some heuristic values that we used to detect convergence. We note that many other techniques could be used to detect convergence. Our implementation, used the simple approach of dividing half of the training data for a validation set. We decay the learning rate whenever predictions fail to improve by a sufficient amount on the validation data. This simple approch always stops, and it yielded better empirical results than a few other variations that we tried.

specifies an initial learning rate. Convergence is detected when the learning rate falls below . specifies the amount of improvement that is expected after each epoch, or else the learning rate is decayed. is a regularization term that is used during the first two phases to ensure that the weights do not become excessively saturated before the final phase of training. No regularization is used in the final phase of training because we want the MLP to ultimately fit the data as closely as possible. (Overfit can still be mitigated by limiting the number of hidden units used in the MLP.) We used the default heuristic values specified on this line in all of our experiments because tuning them seemed to have little impact on the final results. We believe that these values are well-suited for most problems, but could possibly be tuned if necessary.

Line 5 sets the learning rate, , to the initial value. The value is used to store the previous error score. As no error has yet been measured, it is initialized to .

Lines 6-9 train and until convergence is detected. may then be discarded.

Lines 10-16 perform the second phase of training. This phase differs from the first phase in two ways: 1) an MLP is used instead of a temporary single-layer perceptron, and 2) is held constant during this phase.

Lines 17-21 perform the third phase of training. In this phase, the same MLP that is used in phase 2 is used again, but and are both refined together. Also, no regularization is used in the third phase.

3.2 Stochastic gradient descent

1:  for each known in random order 
2:     Compute by forward-propagating into an MLP with weights .
4:     for each hidden unit feeding into output unit  
6:     for each hidden unit in an earlier hidden layer (in backward order) 
8:     for each  
11:     if  then
12:        for  from to  
13:           if then else
15:   measure RMSE with
16:  return
Algorithm 2 train_epoch()

Next, we describe Algorithm 2, which performs a single epoch of training by stochastic gradient descent. This algorithm is very similar to an epoch of traditional backpropagation, except that it presents each element individually, instead of presenting each vector, and it conditionally refines the intrinsic vectors, , as well as the weights, .

Line 1 presents each known element in random order.

Line 2 computes a predicted value for the presented element given the current . Note that efficient implementations of line 2 should only propagate values into output unit .

Lines 3-7 compute an error term for output unit , and each hidden unit in the network. The activation of the other output units is not computed, so the error on those units is 0.

Lines 8-10 refine by gradient descent.

Line 11 specifies that should only be refined during phases 1 and 3.

Lines 12-14 refine by gradient descent.

Line 15 computes the root-mean-squared-error of the MLP for each known element in .

In order to enable UBP to process nominal (categorical) attributes, we convert such values to a vector representing a membership weight in each category. For example, a given value of “cat” from the value set {“mouse”,“cat”,“dog”} is represented with the vector in . Unknown values in this attribute are converted to 3 unknown real values, requiring the algorithm to make 3 predictions. After missing values are imputed, we convert the data back to its original form by finding the mode of each categorical distribution. For example, the predicted vector would be converted to a prediction of “mouse”.

3.3 Complexity

Because UBP uses a heuristic technique to detect convergence, a full analysis of the computational complexity of UBP is not possible. However, it is sufficiently similar to other well-known techniques that comparisons can be made. Matrix factorization is generally considered to be a very efficient imputation technique (Koren et al., 2009). Nonlinear PCA is a nonlinear generalization of matrix factorization. If a linear activation function is used, then it is equivalent to matrix factorization, with the additional (very small) computational overhead of computing the activation function. When hidden layers are added, computational complexity is increased, but remains proportional to the number of weights in the network. UBP adds two additional phases of training to Nonlinear PCA. Thus, in the worst case, UBP is 3 times slower than Nonlinear PCA. In practice, however, the first two phases tend to be very fast (because they optimize fewer values), and these preprocessing phases may even cause the third phase to be much faster (by initializing the weights and intrinsic vectors to start at a position much closer to a local optimum).

4 Empirical Validation

Because running time was not a significant issue with UBP, our empirical validation focuses on imputation accuracy. We compared UBP with 5 other imputation algorithms. The imputation methods that we examined as well as their algorithmic parameter values (including UBP) are:

  • Mean/Mode or Baseline (BL). To establish a “baseline” for comparison, we compare with the method of replacing missing values in continuous attributes with the mean of the non-missing values in that attribute, and replacing missing values in nominal (or categorical) attributes with the most common value in the non-missing values of that attribute. There are no parameters for BL. It is expected that any reasonable algorithm should outperform this baseline algorithm with most problems.

  • Fuzzy K-Means (FKM).

    We varied (the number of clusters) over the set , we varied the -norm value for computing distance over the set (Manhattan distance to Euclidean distance), and the fuzzification factor over the set which were reported to be the most effective values (Li et al., 2004).

  • Instance-Based Imputation (IBI). We used cosine correlation to evaluate similarity, and we varied (the number of neighbors) over the set

    . These values were selected because they were all odd, and spanned the range of intuitively suitable values.

  • Matrix Factorization (MF). We varied the number of intrinsic values over the set , and the regularization term over the set . Again, these values were selected to span the range of intuitively suitable values.

  • Nonlinear PCA (NLPCA). We varied the number of hidden units over the set , and the number of intrinsic values over the set . In the case of 0 hidden units, only a single layer of sigmoid units was used.

  • Unuspervised Backpropagation (UBP). The parameters were varied over the same ranges as those of NLPCA.

For NLPCA and UBP, we used the logistic function as the activation function and summed squared error as the objective function. Thus, we imputed missing values in a total of 66000 dataset scenarios. For each algorithm, we found the set of parameters that yielded the best results, and we compared only these best results for each algorithm averaged over the ten runs of differing random seeds. To facilitate reproduction of our results, and to assist with related research efforts, we have integrated our implementation of UBP and all of the competitor algorithms into the Waffles machine learning toolkit (Gashler, 2011)

Figure 2: A comparison of the average sum-squared error in each pattern by 5 imputation techniques over a range of sparsity values with two representative datasets. (Lower is better.)

In order to evaluate the effectiveness of UBP and related imputation techniques, we gathered a set of 24 datasets from the UCI repository (Frank and Asuncion, 2010), the Promise repository (Sayyad Shirabad and Menzies, 2005), and mldata.org: abalone, arrhythmia, bupa, colic, credit-g, desharnais, diabetes, ecoli, eucalyptus, glass, hypothyroid, ionosphere, iris, nursery, ozone, pasture, sonar, spambase, spectrometer, teaching_assistant, vote, vowel, waveform-500, and yeast. To ensure an objective evaluation, this collection was determined before evaluation was performed, and was not modified to emphasize favorable results. To ensure that our results would be applicable for tasks that require generalization, we removed the class labels from each dataset so that only the input features could be used for imputing missing values. We normalized all real-valued attributes to fall within a range from 0 to 1 so that every attribute would carry approximately equal weight in our evaluation. We then removed completely at randomOther categories of “missingness”, besides missing completely at random (MCAR), have been studied (Little and Rubin, 2002), but we restrict our analysis to the imputation of MCAR values. % of the values from each dataset, where .

For each dataset, and for each , we generated 10 datasets with missing values, each using a different random number seed, to make a total of 1200 tasks for evaluation. The task for each imputation algorithm was to restore these missing values. We measured error by comparing each predicted (imputed) value with the corresponding original normalized value, summed over all attributes in the dataset. For nominal (categorical) values, we used Hamming distance, and for real values, we used the squared difference between the original and predicted values. The average error was computed over all of the patterns in each dataset.

Figure 2 shows two representative comparisons of the error scores obtained by each algorithm at varying levels of sparsity. Comparisons with other datasets generally exhibited similar trends. MF, NLPCA, and UBP did much better than other algorithms when 50% or 70% of the values were missing. No algorithm was best in every case, but UBP achieved the best score in more cases than any other algorithm. Table 1 summarizes the results of these comparisons. UBP achieved lower error than the other algorithm in 20 out of 25 pair-wise comparisons, each comparing imputation scores with 24 datasets averaged over 10 runs with different random seeds. In 15 pair-wise comparisons, UBP did better with a sufficient number of datasets to establish statistical significance according to the Wilcoxon Signed Ranks test. These cases are indicated with a “” symbol.

Sparsity Algorithm UBP comparative P-value
0.1 BL 20,0,4 0.001
FKM 19,0,5 0.004
IBI 15,0,9 0.250
MF 12,0,12 0.348
NLPCA 11,7,6 0.176
0.3 BL 20,0,4 0.001
FKM 22,0,2 0.000
IBI 19,0,5 0.022
MF 11,0,13 0.437
NLPCA 10,11,3 0.036
0.5 BL 20,0,4 0.003
FKM 20,0,4 0.000
IBI 17,0,7 0.018
MF 12,0,12 0.394
NLPCA 10,11,3 0.046
0.7 BL 16,0,8 0.022
FKM 17,0,7 0.004
IBI 15,0,9 0.040
MF 13,0,11 0.312
NLPCA 7,14,3 0.038
0.9 BL 7,0,17 0.437
FKM 17,0,7 0.006
IBI 13,0,11 0.147
MF 16,0,8 0.123
NLPCA 2,10,12 0.495
Table 1: A high-level summary of comparisons between UBP and five other imputation techniques. Results are shown for each of the 5 sparsity values. Each row in this table summarizes a comparison between UBP and a competitor imputation algorithm for predicting missing values.

To help us further understand how UBP might lead to better classification results, we also performed experiments involving classification with the imputed datasets. We restored class labels to the datasets in which we had imputed feature values. We then used 5 repetitions of 10-fold cross validation with each of 9 learning algorithms from the WEKA (Witten and Frank, 2005) machine learning toolkit: backpropagation (BP), C4.5 (Quinlan, 1993), 5-nearest neighbor (IB5), locally weighted learning (LWL) (Atkeson et al., 1997), naïve Bayes (NB), nearest neighbor with generalization (NNGE) (Martin, 1995)

, Random Forest (RandF)

(Breiman, 2001), RIpple DOwn Rule Learner (RIDOR) (Gaines and Compton, 1995), and RIPPER (Repeated Incremental Pruning to Produce Error Reduction) (Cohen, 1995). The learning algorithms were chosen with the intent of being diverse from one another, where diversity is determined using unsupervised meta-learning (Lee and Giraud-Carrier, 2011). We evaluated each learning algorithm at each of the previously mentioned sparsity levels using each imputation algorithm with the parameters that resulted in the lowest SSE error score for imputation. The results from this experiment are summarized for each of the 5 sparsity levels in Tables 2 through 6. Each cell in these tables summarizes a set of experiments with an imputation algorithm and classification algorithm pair. The three upper numbers in each cell indicate the number of times that UBP lead to higher, equal, and lower classification accuracy. When the left-most of these three numbers is biggest, this indicates that classification accuracy was higher in more cases when UBP was used as the imputation algorithm. The lower number in each cell indicates the P-value obtained from the Wilcoxon signed ranks test by performing a pair-wise comparison between UBP and the imputation algorithm of that column. When this value is below 0.05, UBP did better in a sufficient number of pair-wise comparisons to demonstrate that it is better with statistical significance. These cases are indicated with a “” symbol.

BP 16,1,7 14,0,10 12,0,12 9,0,15 6,7,11
0.013 0.080 0.835 0.874 0.915
C4.5 16,0,8 18,0,6 12,0,12 14,1,9 9,7,8
0.039 0.011 0.727 0.219 0.715
IB5 17,1,6 17,0,7 15,0,9 11,0,13 9,8,7
0.011 0.005 0.039 0.616 0.688
LWL 16,2,6 12,5,7 7,4,13 9,2,13 7,9,8
0.015 0.134 0.731 0.652 0.601
NB 12,0,12 13,0,11 12,0,12 13,2,9 9,7,8
0.374 0.302 0.594 0.487 0.538
Nnge 17,1,6 19,1,4 17,1,6 16,0,8 10,7,7
0.001 0.000 0.121 0.109 0.353
RandF 17,0,7 16,1,7 14,0,10 16,0,8 8,8,8
0.003 0.006 0.151 0.054 0.572
Ridor 20,0,4 20,1,3 12,0,12 15,0,9 6,8,10
0.000 0.000 0.254 0.220 0.757
RIPPER 16,1,7 12,1,11 12,0,12 15,0,9 10,7,7
0.009 0.134 0.517 0.057 0.388
Table 2: Results of classification tests at a sparsity level of 0.1., (meaning 10% of data values were removed and then imputed prior to classification). The 3 upper numbers in each cell indicate the wins, ties, and losses for UBP in a pair-wise comparison. The lower number in each cell indicates the Wilcoxon signed ranks test P-value evaluated over the 24 datasets. (When the value is smaller than 0.05, indicated with a “” symbol, the performance of UBP was better with statistical significance.)
BP 18,0,6 18,0,6 15,0,9 12,0,12 6,10,8
0.008 0.002 0.048 0.550 0.884
C4.5 15,0,9 15,1,8 15,0,9 13,0,11 7,10,7
0.024 0.017 0.025 0.461 0.735
IB5 16,0,8 18,0,6 10,0,14 10,0,14 4,10,10
0.016 0.004 0.539 0.658 0.970
LWL 17,1,6 16,2,6 15,2,7 12,2,10 7,12,5
0.005 0.008 0.109 0.218 0.484
NB 11,0,13 13,0,11 12,0,12 13,0,11 9,10,5
0.689 0.342 0.374 0.561 0.226
Nnge 14,0,10 16,0,8 15,0,9 14,0,10 7,10,7
0.132 0.020 0.138 0.245 0.774
RandF 14,0,10 17,0,7 15,0,9 13,1,10 6,10,8
0.132 0.021 0.084 0.083 0.842
Ridor 16,1,7 17,1,6 16,0,8 10,0,14 4,10,10
0.004 0.001 0.018 0.583 0.942
RIPPER 14,0,10 16,0,8 17,0,7 9,0,15 7,10,7
0.051 0.042 0.026 0.868 0.500
Table 3: Results of classification tests at a sparsity level of 0.3.
BP 13,1,10 14,1,9 20,1,3 14,0,10 6,11,7
0.055 0.033 0.002 0.195 0.472
C4.5 14,0,10 14,0,10 14,0,10 15,0,9 9,11,4
0.245 0.076 0.080 0.115 0.104
IB5 15,0,9 17,0,7 16,1,7 13,1,10 8,11,5
0.015 0.001 0.002 0.093 0.338
LWL 14,1,9 14,2,8 13,2,9 14,1,9 4,13,7
0.016 0.009 0.077 0.193 0.825
NB 13,0,11 13,0,11 12,1,11 15,0,9 8,11,5
0.302 0.322 0.458 0.363 0.132
Nnge 13,0,11 15,0,9 18,0,6 15,0,9 7,11,6
0.245 0.048 0.021 0.084 0.338
RandF 14,0,10 17,0,7 19,0,5 14,1,9 7,10,7
0.109 0.015 0.002 0.202 0.401
Ridor 15,1,8 16,0,8 19,0,5 13,0,11 6,11,7
0.022 0.005 0.000 0.395 0.758
RIPPER 15,0,9 16,0,8 19,0,5 14,0,10 7,11,6
0.030 0.028 0.001 0.384 0.338
Table 4: Results of classification tests at a sparsity level of 0.5.
BP 13,0,11 17,0,7 17,1,6 16,0,8 8,14,2
0.264 0.008 0.001 0.051 0.026
C4.5 9,0,15 9,2,13 16,0,8 13,0,11 5,14,5
0.763 0.474 0.064 0.332 0.658
IB5 14,0,10 17,0,7 15,0,9 17,0,7 8,14,2
0.165 0.005 0.032 0.008 0.042
LWL 14,1,9 16,1,7 18,2,4 14,1,9 6,16,2
0.013 0.002 0.000 0.121 0.147
NB 16,0,8 15,0,9 14,0,10 12,0,12 6,14,4
0.104 0.203 0.539 0.506 0.305
Nnge 12,1,11 15,0,9 14,0,10 14,0,10 8,14,2
0.169 0.011 0.068 0.018 0.010
RandF 16,0,8 18,0,6 18,0,6 19,0,5 10,12,2
0.017 0.001 0.001 0.002 0.003
Ridor 13,1,10 16,0,8 19,0,5 16,0,8 4,15,5
0.052 0.017 0.000 0.042 0.453
RIPPER 13,0,11 15,0,9 18,0,6 14,0,10 8,15,1
0.211 0.060 0.015 0.195 0.038
Table 5: Results of classification tests at a sparsity level of 0.7.
BP 12,0,12 14,0,10 16,3,5 16,0,8 8,10,6
0.658 0.180 0.033 0.057 0.265
C4.5 8,0,16 10,0,14 13,0,11 15,0,9 6,10,8
0.936 0.689 0.282 0.151 0.827
IB5 10,0,14 11,0,13 11,0,13 17,0,7 5,11,8
0.880 0.637 0.658 0.032 0.712
LWL 12,1,11 14,1,9 15,1,8 14,1,9 4,10,10
0.458 0.070 0.140 0.109 0.949
NB 21,0,3 18,0,6 14,0,10 13,0,11 10,10,4
0.000 0.001 0.172 0.228 0.012
Nnge 14,0,10 17,0,7 14,0,10 18,0,6 8,10,6
0.180 0.004 0.010 0.008 0.265
RandF 12,0,12 15,0,9 17,0,7 20,0,4 10,10,4
0.439 0.145 0.008 0.001 0.074
Ridor 7,0,17 14,0,10 15,0,9 20,0,4 7,10,7
0.880 0.187 0.084 0.002 0.599
RIPPER 10,1,13 12,0,12 18,0,6 19,0,5 7,10,7
0.811 0.363 0.004 0.010 0.450
Table 6: Results of classification tests at a sparsity level of 0.9.

Some classification algorithms were more responsive to the improved imputation accuracy that UBP offered than others. For example, UBP appears to have a more beneficial effect on classification results when used with Random Forest than with naïve Bayes. Overall, UBP was demonstrated to be more beneficial as an imputation algorithm than any of the other imputation algorithms. As expected, NLPCA was the closest competitor. UBP did better than NLPCA in a larger number of cases, but we were not able to demonstrate that it was better with statistical significance until sparsity level 0.7. It appears from these results that the improvements of UBP over NLPCA generally have a bigger impact when there are more missing values in the data. This is significant because the difficulty of effective imputation increases with the sparsity of the data.

To demonstrate the difference between UBP and NLPCA (and, hence, the advantage of 3-phase training), we briefly examine UBP as a manifold learning (or non-linear dimensionality reduction) technique. We trained both NLPCA and UBP using data from the MNIST dataset of handwritten digits. We used a multilayer perceptron with a topology. 2 of the 4 inputs were treated as latent values, and the other 2 inputs we used to specify coordinates in each image. The one output was trained to predict the grayscale pixel value at the specified coordinates. We trained a separate network for each of the 10 digits in the dataset. After training these multilayer perceptrons in this manner, we uniformly sampled the two-dimensional latent inputs in order to visualize how each multilayer perceptron organized the digits that it had learned. Matrix factorization was not suitable for this application because it is linear, so it would predict only a linear gradient instead of an image. The other imputation algorithms are not suitable for this task because they are not designed to be used in a generative manner. Figure 3 shows a sample of ’4’s generated by NLPCA and UBP. (Note that the images shown are not part of the original training data. Rather, each was generated by the multilayer perceptron after it was trained. Because the multilayer perceptron is a continuous function, these digits could be generated with arbitrary resolution, independent of the original training data.)

Both algorithms did approximately equally well at generating digits that appear natural over a range of styles. The primary difference between these results is how they ultimately formed order in the intrinsic values that represent the various styles. In the case of UBP, somewhat better intrinsic organization is formed. Three distinct styles can be observed in three horizontal stripes in Figure 3b: boxey digits in the top two rows, slanted digits in the middle rows, and digits with a closed top in the bottom two rows. The height of the horizontal bar in the digit varies clearly from left-to-right. Along the left, the horizontal bar crosses at a low point, and along the right, the horizontal bar crosses at a high point. In the case of NLPCA, similar styles were organized into clusters, but they do not exhibit the same degree of organization that is apparent with UBP. For example, it does not appear to be able to vary the height of the horizontal bar in each of the three main styles of this digit. This occurs because NLPCA does not have an extra step designed explicitly to promote organization in the intrinsic values.

This demonstration also serves to show that techniques for imputation and techniques for manifold learning are merging. In the case of these handwritten digits, the multilayer perceptron was flexible enough to generate visibly appealing digits, even when the intrinsic organization was poor. However, better generalization can be expected when the mapping is simplest, which occurs when the intrinsic values are better organized. Hence, as imputation and manifold learning are applied to increasingly complex problems, the importance of finding good organization in the intrinsic values will increase. Future work in this area will explore other techniques for promoting organization within the intrinsic values, and contrast them with the simple approach proposed by UBP.

a b
Figure 3: An MLP trained with NLPCA (a) and UBP(b) on digits from the MNIST dataset of handwritten digits. After training, the digits shown here were generated by uniformly sampling over the space of intrinsic values, and generating an image at each sampled intrinsic point.Although both NLPCA and UBP were able to generate high-quality digits, UBP was able to organize the various styles more effectively. Note that the height of the horizontal bar varies from left-to-right. The results from NLPCA do not provide the ability to vary the height of this bar in each of the possible styles of the digit.

to generate high-quality digits, UBP was able to organize the various styles more effectively. Note that the height of the horizontal bar varies from left-to-right. The results from NLPCA do not provide the ability to vary the height of this bar in each of the possible styles of the digit.

5 Conclusions

Since the NetFlix competition, matrix factorization and related linear techniques have generally been considered to be the state-of-the-art in imputation. Unfortunately, the focus on imputing with a few large datasets has caused the community to largely overlook the value of nonlinear imputation techniques. One of the contributions of this paper is the observation that Nonlinear PCA, which was presented almost a decade ago, consistently outperforms matrix factorization across a diversity of datasets. Perhaps this has not yet been recognized because a thorough comparison between the two techniques across a diversity of datasets was not previously performed.

The primary contribution of this paper, however, is an improvement to Nonlinear PCA, which improves its accuracy even more. This contribution consists of a 3-phase training process that initializes both weights and intrinsic vectors to better starting positions. We theorize that this leads to better results because it bypasses many local optima in which the network could otherwise settle, and places it closer to the global optimum.

We empirically compared results from UBP with 5 other imputation techniques, including baseline, fuzzy -means, instance-based imputation, matrix factorization, and Nonlinear PCA, with 24 datasets across a range of parameters for each algorithm. UBP predicts missing values with lower error than any of these other methods in the majority of cases. We also demonstrated that using UBP to impute missing values leads to better classification accuracy than any of the other imputation techniques over all, especially at higher levels of sparsity, which is the most important case for imputation. We demonstrated that UBP is also better-suited for manifold learning than NLPCA with a problem involving handwritten digits. Ongoing research seeks to demonstrate that better organization in the intrinsic variables naturally leads to better generalization. This has significant potential application in unsupervised learning tasks such as automation.


  • Acuña and Rodriguez (2004) Acuña, Edgar, and Caroline Rodriguez. 2004. The treatment of missing values and its effect on classifier accuracy. In Classification, Clustering, and Data Mining Applications. Springer Berlin / Heidelberg, pp. 639–647.
  • Adomavicius and Tuzhilin (2005) Adomavicius, G., and A. Tuzhilin. 2005. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE transactions on knowledge and data engineering, 17(6):734–749. ISSN 1041-4347.
  • Alireza Farhangfar (2008) Alireza Farhangfar, Lukasz Kurgan Jennifer Dy. 2008. Impact of imputation of missing values on classification error for discrete data. Pattern Recognition41:3692–3705.
  • Atkeson et al. (1997) Atkeson, Christopher G., Andrew W. Moore, and Stefan Schaal. 1997. Locally weighted learning. Artificial Intelligence Review, 11:11–73. ISSN 0269-2821.
  • Bengio et al. (2007) Bengio, Yoshua, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2007. Greedy layer-wise training of deep networks. In NIPS, pp. 153–160.
  • Bengio et al. (2006) Bengio, Y., H. Schwenk, J.S. Senécal, F. Morin, and J.L. Gauvain. 2006. Neural probabilistic language models. In Innovations in Machine Learning. Springer, pp. 137–186.
  • Breiman (2001) Breiman, L. 2001. Random forests. Machine Learning, 45(1):5–32.
  • Coheh and Shawe-Taylor (1990) Coheh, D., and J. Shawe-Taylor. 1990. Daugman’s gabor transform as a simple generative back propagation network. Electronics Letters, 26(16):1241–1243.
  • Cohen (1995) Cohen, William W. 1995. Fast effective rule induction. In In Proceedings of the Twelfth International Conference on Machine Learning, Morgan Kaufmann, pp. 115–123.
  • Erhan et al. (2009) Erhan, Dumitru, Pierre-Antoine Manzagol, Yoshua Bengio, Samy Bengio, and Pascal Vincent. 2009. The difficulty of training deep architectures and the effect of unsupervised pre-training. Journal of Machine Learning Research - Proceedings Track, 5:153–160.
  • Frank and Asuncion (2010) Frank, A., and A. Asuncion. 2010. UCI machine learning repository. http://archive.ics.uci.edu/ml.
  • Gaines and Compton (1995) Gaines, Brian R., and Paul Compton. 1995. Induction of ripple-down rules applied to modeling large databases. Journal of Intelligent Information Systems, 5:211–228. ISSN 0925-9902.
  • Gashler et al. (2008) Gashler, Michael, Dan Ventura, and Tony Martinez. 2008. Iterative non-linear dimensionality reduction with manifold sculpting. In Advances in Neural Information Processing Systems 20. MIT Press, pp. 513–520.
  • Gashler (2011) Gashler, Michael S. 2011. Waffles: A machine learning toolkit. Journal of Machine Learning Research, MLOSS 12:2383–2387.
  • Hinton (1988) Hinton, G. E. 1988. Generative back-propagation. In Abstracts 1st INNS.
  • Hinton and Salakhutdinov (2006) Hinton, G. E., and R. R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science, 28(313 (5786)):504–507.
  • Jones (1996) Jones, Michael P. 1996. Indicator and stratificaion methods for missing explanatory variables in multiple linear regression. Journal of the American Statistical Association, 91:222–230.
  • Koren et al. (2009) Koren, Y., R. Bell, and C. Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37. ISSN 0018-9162.
  • Lakshminarayan et al. (1996) Lakshminarayan, K., S.A. Harp, R. Goldman, T. Samad, and others. 1996. Imputation of missing data using machine learning techniques. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 140–145.
  • Lee and Giraud-Carrier (2011) Lee, Jun, and Christophe Giraud-Carrier. 2011. A metric for unsupervised metalearning. Intelligent Data Analysis, 15(6):827–841.
  • Li et al. (2004) Li, Dan, Jitender Deogun, William Spaulding, and Bill Shuart. 2004. Towards missing data imputation: A study of fuzzy k-means clustering method. In Rough Sets and Current Trends in Computing, Volume 3066 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg, pp. 573–579.
  • Li et al. (2008) Li, Q., Y.P. Chen, and Z. Lin. 2008. Filtering Techniques For Selection of Contents and Products. In Personalization of Interactive Multimedia Services: A Research and development Perspective, Nova Science Publishers. ISBN 978-1-60456-680-2.
  • Little and Rubin (2002) Little, R.J.A., and D.B. Rubin. 2002. Statistical analysis with missing data.

    Wiley series in probability and mathematical statistics. Probability and mathematical statistics. Wiley.

    ISBN 9780471183860.
  • Luengo (2011) Luengo, J. 2011. Soft Computing based learning and Data Analysis: Missing Values and Data Complexity. Ph. D. thesis, Department of Computer Science and Artificial Intelligence, University of Granada.
  • Martin (1995) Martin, Brent. 1995. Instance-based learning: nearest neighbour with generalisation. Technical Report 95/18, University of Waikato, Department of Computer Science.
  • Quinlan (1989) Quinlan, J. R. 1989. Unknown attribute values in induction. In Proceedings of the sixth international workshop on Machine learning, pp. 164–168.
  • Quinlan (1993) Quinlan, J. Ross. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, USA.
  • Rubin (1987) Rubin, D. B. 1987. Multiple Imputation for Nonresponse in Surveys. Wiley.
  • Rumelhart et al. (1986) Rumelhart, D.E., G.E. Hinton, and R.J. Williams. 1986. Learning representations by back-propagating errors. Nature, 323:9.
  • Sarwar et al. (2001) Sarwar, B., G. Karypis, J. Konstan, and J. Reidl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, ACM. ISBN 1581133480. pp. 285–295.
  • Sayyad Shirabad and Menzies (2005) Sayyad Shirabad, J., and T.J. Menzies. 2005. The PROMISE Repository of Software Engineering Databases. School of Information Technology and Engineering, University of Ottawa, Canada. http://promise.site.uottawa.ca/SERepository.
  • Schafer and Graham (2002) Schafer, J.L., and J.W. Graham. 2002. Missing data: Our view of the state of the art. Psychological methods, 7(2):147. ISSN 1939-1463.
  • Scholz et al. (2005) Scholz, M., F. Kaplan, C. L. Guy, J. Kopka, and J. Selbig. 2005. Non-linear pca: a missing data approach. Bioinformatics, 21(20):3887–3895.
  • Shafer (1997) Shafer, J. L. 1997. Analysis of Incomplete Multivariate Data. Chapman and Hall, London.
  • Takács et al. (2009) Takács, G., I. Pilászy, B. Németh, and D. Tikk. 2009. Scalable collaborative filtering approaches for large recommender systems. The Journal of Machine Learning Research, 10:623–656.
  • Tenenbaum et al. (2000) Tenenbaum, Joshua B., Vin de Silva, and John C. Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323.
  • Werbos (1990) Werbos, P.J. 1990. Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560. ISSN 0018-9219.
  • Witten and Frank (2005) Witten, Ian H., and Eibe Frank. 2005. Data Mining: Practical machine learning tools and techniques (2nd ed.). Morgan Kaufmann, San Fransisco.
  • Zhang and Wang (2007) Zhang, Z., and J. Wang. 2007. MLLE: Modified locally linear embedding using multiple weights. Advances in Neural Information Processing Systems, 19:1593. ISSN 1049-5258.