Autoencoders and Probabilistic Inference with Missing Data: An Exact Solution for The Factor Analysis Case

01/11/2018 ∙ by Christopher K. I. Williams, et al. ∙ 0

Latent variable models can be used to probabilistically "fill-in" missing data entries. The variational autoencoder architecture (Kingma and Welling, 2014; Rezende et al., 2014) includes a "recognition" or "encoder" network that infers the latent variables given the data variables. However, it is not clear how to handle missing data variables in this network. We show how to calculate exactly the latent posterior distribution for the factor analysis (FA) model in the presence of missing data, and note that this solution exhibits a non-trivial dependence on the pattern of missingness. Experiments compare the effectiveness of various approaches to filling in the missing data.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Latent variable models, like factor analysis and “deeper” versions such as the variational autoencoder (VAE, Kingma and Welling 2014; Rezende et al. 2014) and generative adversarial networks (GANs, Goodfellow et al. 2014

), are a compelling approach to modelling structure in complex high-dimensional data such as images.

One important use case of such models is when some part of the observable data is missing—such as when some part of an image is not observed. In this case we would like to use the latent variable model to “inpaint” the missing data. Another example is modelling 3-D volumetric data—we may have observations of an object from one viewpoint, and wish to make inferences about the whole 3-D object.

The VAE is composed of two parts, an encoder (or recognition) network that predicts the distribution of the latent variables given the data, and a decoder (or generative) model that maps from the latent variables to the visible variables. However, if some part of the visible data is missing, how can we make use of the encoder network—what can be done about the missing entries? One simple fix is to replace the missing value with a constant such as zero (see e.g. Pathak et al. 2016, Nazabal et al. 2018) or “mean imputation”, i.e. filling in the missing value with the unconditional mean of that variable. However, these are not principled solutions, and do not maintain uncertainty about the missing values.

Rezende et al. (2014)

have shown that one can construct a Markov chain Monte Carlo (MCMC) method to sample from the posterior over the missing data for a VAE, but this is an iterative technique that has the obvious disadvantage of taking a long time (and much computation) for the chain to converge.

This paper is concerned with inference for the latent variables given a particular pre-trained generative model. The issue of learning a model in the presence of missing data is a different issue; for PCA a survey of methods is given e.g. in Dray and Josse (2015).

The non-linear encoder and decoder networks of the VAE make an exact analysis of missing-data inference techniques challenging. As such we focus on the factor analysis (FA) model as an example autoencoder (with linear encoder and decoder networks) for which analytic inference techniques can be derived. In this paper we show that for the FA model one can carry out exact (non-iterative) inference for the latent variables in a single feedforward pass. However, the parameters of the required linear feedforward model depend non-trivially on the missingness pattern of the data, requiring a different matrix inversion for each pattern. Below we consider three approximations to exact inference, including a “denoising encoder” approach to the problem, inspired by the denoising autoencoder of

Vincent et al. (2008). Experiments compare the effectiveness of various approaches to filling in the missing data.

2 Theory

We focus on generative models which have latent variables associated with each datapoint. We assume that such a model has been trained on fully visible data, and that the task is to perform inference of the latent variables in the presence of missing data, and reconstruction of the missing input data.

Consider a generative latent-variable model that consists of a prior distribution over latent variables of dimension , and a conditional distribution over data variables given latents . When there is missing data, the data variables can be separated into the visible variables and missing variables . is a missing-data indicator function such that if has been observed, and if is missing. Let have dimension , with visible and missing variables.

Given a datapoint we would like to be able to obtain a latent posterior conditioned on the visible variables . We assume below that the missing data is missing at random (MAR, see Little and Rubin 1987 for further details), i.e. that the missingness mechanism does not depend on .

One can carry out exact inference for linear subspace models such as factor analysis and its special case probabilistic principal components analysis (PPCA,

Tipping and Bishop 1999). Let and where is the factor loadings matrix,

is the mean (offset) vector in the data space, and

is a -dimensional diagonal covariance matrix. In these models the general form of the posterior is , where


see e.g. (Bishop, 2006, sec. 12.2), where denotes the submatrix of relating to the visible variables, and similarly for . These equations are simply the standard form for a FA model when the missing variables have been marginalized out. The expression for can be re-written as


where is a diagonal matrix the th entry being 111Given that is idempotent i.e. , one does not need to make use of the usual construction on either side of .. This means that the diagonal elements in will be be : if (so is visible) then

is observed with variance

, but for missing data dimensions implies an effective infinite variance for , meaning that any data value for this missing dimension will be ignored. Note also that an information-theoretic argument that conditioning on additional variables never increases entropy shows that will have a determinant no larger than 222The entropy argument strictly applies when taking expectations over or

, but for Gaussian distributions the predictive uncertainty is independent of the value of

or , so it does imply the desired conclusion.

Equation 3 can be rewritten in an interesting way. Let denote the th row of . Then


i.e. where the second term on the RHS is a sum of rank-1 matrices. This arises from the fact that , and can be related to the product of Gaussian experts construction Williams and Agakov (2002). Similarly the mean has the form


Again note how each observed dimension contributes one non-zero term to the sum on the RHS.

The non-identifiability of the FA model due to “rotation of factors” (see e.g. Mardia et al. 1979, sec. 9.6) means one can transform

with an orthogonal matrix

so that is a diagonal Gaussian. If , then setting makes the corresponding be diagonal. Vedantam et al. (2018, sec. 2) suggest that for a VAE with missing data, one can combine together diagonal Gaussians in -space from each observed data dimension to obtain a posterior using a product of Gaussian experts construction. However, our analysis above shows that this cannot be exact: even if we are in the basis where is a diagonal Gaussian, we note from eq. 4 that the contribution of each observed dimension is a rank-1 update to , and thus there will in general be no basis in which all of these updates will be diagonal. However, note that a rank-1 update involves the same number of entries () as a diagonal update.

Figure 1: Illustration of the seven reconstruction methods on a specific image (leftmost) with the top left quarter missing. Top row: second column shows the original quarter, columns 3-7 show the corresponding imputations. The bottom row shows the pixelwise squared error.
Figure 2:

Figure illustrating how the posterior means and standard deviations for unobserved pixels change as the amount of missing data decreases. Data above the red line is present, below it is missing and is imputed with exact inference.

If the data does not contain missing values, it is straightforward to build a “recognition network” that predicts and from using only matrix-vector multiplies, as per equations 1 and 2 without the subscripts, as the matrix inverse can be computed once and stored. However, if there is missing data then there is a different for each of the non-trivial patterns of missingness, and thus exact inference cannot be carried out efficiently in a single feedforward inference network. This leads to a number of approximations:

Approximation 1: Full Covariance Approximation (FCA)

If we carry out mean imputation for and replace those entries with to give , then we can rewrite eq. 2 as


as the missing data slots in will be filled in by zeros and have no effect on the calculation. However, note that does depend on the pattern of missingness as per eq. 3. A simple approximation is to replace it with which would apply when is fully observed; we term this the full covariance approximation, as it uses the covariance relating to a fully observed . We denote this approximation as , as written in eq. 7:


Approximation 2: Scaled Covariance Approximation (SCA)

When no data is observed, we have that the posterior covariance is equal to the prior, . When there is complete data, then


where , making use of eq. 4. Although as we have seen the ’s are all different, a simple approximation would be to assume they are all equal, and thus rearrange eq. 8 to obtain for .333A slightly more general decomposition would be to set with , but then it would be more difficult to decide how to set the s; this could e.g. be learned for a given missingness mechanism. This would give the approximation


which linearly interpolates between

and depending on the amount of missing data. Using this approximation instead of in eq. 7 yields the scaled covariance approximation (SCA). The inversion of the RHS of eq. 9 can be carried out simply and analytically if is diagonal, as per the rotation-of-factors discussion above.

Approximation 3: Denoising Encoder (DE)

Vincent et al. (2008) introduced the denoising autoencoder (DAE), where the goal is to train an autoencoder to take a corrupted data point as input, and predict the original, uncorrupted data point as its output. In Vincent et al. (2008) the stochastic corruption process sets a randomly-chosen set of the inputs to zero; the goal is then to reconstruct the corrupted inputs from the non-corrupted ones. As we have seen above, if the data is centered, then setting input values to zero is equivalent to mean imputation. Despite this, the motivation for denoising autoencoders was not so much to handle missing data, but to more robustly learn about the structure of the underlying data manifold. Also, as seen from our results above, a simple feedforward net that ignores the pattern of missingness (as in the DAE) cannot perform exact inference even for the factor analysis case.

However, we can develop ideas relating to the DAE to create a denoising encoder (DE). In our situation we have complete training data (i.e. without missingness). Thus for each we can obtain the corresponding posterior for (as per eqs. 1 and 2 without any missing data). We then create a pattern of missingness and apply it to to obtain . One can then train a regression model from to the corresponding posterior mean of . Note that this averages over the patterns of missingness, rather than handling each one separately. The decoder for the DE is the exact solution . A limitation of the DE is that to train it we need examples of the patterns of missingness that occur in the data, which is not the case for the exact or approximate methods described above.

3 Experiments

Mean imputation Full-covariance approx. Scaled-covariance approx. Denoising encoder Denoising encoder* Exact inference
R ()
Table 1: Mean squared imputation error (

) with standard errors for various imputation methods on PPCA generated data with random (R) and quarters (Q) patterns of missingness. For a particular data example the squared imputation error is the average error over imputed values for that example. Denoising encoder

is trained on random patterns of missingness, whereas Denoising encoder is trained on the same type of missingness patterns as it is tested on. For the Denoising encoder and Denoising encoder on the Random data, note that the performance is identical as the training and testing data are the same in each case. This is indicated with brackets.
Mean imputation Full-covariance approx. Scaled-covariance approx. Denoising encoder Denoising encoder* Exact inference
R Train ()
Test ()
Q Train
Table 2: Mean squared imputation error () with standard errors for various imputation methods on the real Frey faces dataset. The details are as in Table 1.

We trained a PPCA model on the Frey faces dataset444Available from, which consists of 1965 frames of a greyscale video sequence with resolution pixels. Pixel intensities were rescaled to lie between -1 and 1. We used 43 latent components for the model, explaining of the data variability. of the data frames were randomly selected as a training set, and the remainder were held out as a test set.

We estimate the parameters

of the model using maximum likelihood, and obtain the analytic solutions as in Tipping and Bishop (1999). In the case of the denoising encoder, we create one missingness-corrupted pattern for each training example when estimating the regression model.

Given a trained PPCA model we investigate various approaches for handling missing data with a data imputation task. Given a partially observed data example, the task is to predict the values of the missing data variables using the trained model. We consider two patterns of missing data. In the first setting each data variable is independently dropped out with probability

. In the second setting one of the four image quarters is dropped out in each example (see Fig. 1).

We consider five approaches to handling missing data. The first method is simply to inpaint the missing variables with their training set means, without using the latent variable model at all. This is denoted by “mean imputation”. The second approach is to use FCA, and the third SCA. For the fourth method we train a denoising encoder as described above, using a linear regression model. The final method is to use exact inference as per eqs. 

1 and 2.

3.1 Results

We present two sets of results, firstly for data generated from the PPCA model fitted to the Frey data (denoted PPCA-Frey data), and secondly for the Frey data itself.

For the PPCA-Frey data, the mean squared imputation error results are shown in Table 1. In this case 400 test cases are generated from the PPCA model. For the DE case, 1600 training cases are also generated to train the DE model. For the denoising autoencoder on the quarters data, we train it either on data with missing quarters, or on data with pixels with a random pattern of missingness—the latter is denoted as Denoising encoder in the tables.

In Table 1 we see that FCA improves over mean imputation, and that SCA improves over FCA. Training the denoising autoencoder (which uses the same input as FCA and SCA) gives a further improvement for the quarters data, but is very similar to SCA for the random missingness pattern. Note that on the quarters data, the Denoising encoder results are quite a lot worse than the DE which had been trained on missing-quarters training data. For both the random and quarters data, the exact inference method is the best, as expected.

For the Frey data the mean-squared imputation error results are shown in Table 2, and a comparison on a specific image is shown in Figure 1. For the random missingness data, mean imputation is the worst and exact inference the best, with FCA, DE and SCA falling in between (from worst to best). Exact inference significantly outperforms the denoising encoder on the test set. For the quarters dataset, we see a similar pattern of performance, except for slightly better performance of the denoising encoder than exact inference on the test set. Notice also that when using the Denoising encoder (training using random patterns of missingness) the performance is significantly worse. The slight performance gain of the DE on the quarters data over exact inference may be due to the fact that exact inference applies to the PPCA model, but as the Frey data was not generated from this model there can be some room for improvement over the “exact” method. Also note that for the quarters data only four patterns of missingness are used, so the DE network is likely to be able to learn about this more easily than random missingness. There are noticeable differences between the training and test set errors for the DE model. This can be at least partially explained by noting that on the training set the DE model has access to the posterior for based on the complete data, so it is not surprising that it does better here.

Figure 2 illustrates how the posterior means and standard deviations of unobserved pixels change as the amount of missing data decreases, using exact inference. As would be expected the uncertainty decreases as the amount of visible data increases, but notice how some regions like the mouth retain higher uncertainty until observed.

4 Conclusions

Above we have discussed various approaches to handling missing data with probabilistic autoencoders. We have shown how to calculate exactly the latent posterior distribution for the factor analysis (FA) model in the presence of missing data. This solution exhibits a non-trivial dependence on the pattern of missingness, requiring a different matrix inversion for each pattern. We have also described three approximate inference methods (FCA, SCA and DE). Our experimental results show the relative effectiveness of these methods on the Frey faces dataset with random or structured missingness. A limitation of the DE method is that the structure of the missingness needs to be known at training time—our results demonstrate that performance deteriorates markedly when the missingness assumptions at train and test time differ.

Possible future directions include the use of nonlinear DE models, and the development of methods to handle missing data in nonlinear autoencoders.


CW and AN would like to acknowledge the funding provided by the UK Government’s Defence & Security Programme in support of the Alan Turing Institute. The work of CW is supported in part by EPSRC grant EP/N510129/1 to the Alan Turing Institute. CN is supported by a PhD studentship from the EPSRC CDT in Data Science EP/L016427/1. CW thanks Kevin Murphy for a useful email discussion.


  • Bishop (2006) Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  • Dray and Josse (2015) Dray, S. and Josse, J. (2015). Principal components analysis with missing values: a comparitive survey of methods. Plant Ecology, 216:657–667.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27, pages 2672–2680.
  • Kingma and Welling (2014) Kingma, D. P. and Welling, M. (2014). Auto-Encoding Variational Bayes. In ICLR.
  • Little and Rubin (1987) Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data. Wiley.
  • Mardia et al. (1979) Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London.
  • Nazabal et al. (2018) Nazabal, A., Olmos, P. M., Ghahramani, Z., and Valera, I. (2018). Handling Incomplete Heterogeneous Data using VAEs. arXiv:1807.03653.
  • Pathak et al. (2016) Pathak, D., Krähenbühl, P., Donahue, J., and Darrell, T. (2016). Context Encoders: Feature Learning by Inpainting. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).

    Stochastic Backpropagation and Approximate Inference in Deep Generative Models.

    In ICML.
  • Tipping and Bishop (1999) Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal components analysis. J. Roy. Statistical Society B, 61(3):611–622.
  • Vedantam et al. (2018) Vedantam, R., Fischer, I., Huang, J., and Murphy, K. P. (2018). Generative Models of Visually Grounded Imagination. In ICLR.
  • Vincent et al. (2008) Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P. (2008). Extracting and Composing Robust Features with Denoising Autoencoders. In Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML‘08), pages 1096–1103.
  • Williams and Agakov (2002) Williams, C. K. I. and Agakov, F. V. (2002). Products of Gaussians and Probabilistic Minor Component Analysis. Neural Computation, 14(5):1169–1182.