1 Introduction
AEVB (AutoEncoding Variational Bayes) is one of the most widely used algorithms for learning generative models kingma2013auto . Approximate posterior inference is an important step within AEVB which allows for sampling from the latent space conditioned on the input. VAE (Variational AutoEncoder kingma2013auto ), the most wellknown implementation of AEVB, relies on an encoder to perform approximate posterior inference. In machine learning, incomplete data, i.e. data with missing values, is widespread. In the realm of incomplete data, encoderbased approaches such as VAE pose a cyclic dependency with generative modeling: to learn a generative model, an encoder needs to be used, and to use an encoder, the missing dimensions need to be replaced, ideally using the underlying data distribution (i.e. the generative model). Commonly, this chain is broken by suboptimally imputing the missing values (mostly with zeros even in the most recently proposed VAE imputation models ivanov2018variational ). This in turn causes volatility in the input space of the VAE structure, however, the structure is assumed to be able to handle the volatility and reliably perform posterior approximation.
Within the AEVB algorithm, the approximate posterior parameters need not necessarily be inferred using an encoder. As opposed to using an encoder to infer the parameters of the approximate posterior distribution, we propose to optimize the parameters of the approximate posterior directly without an encoder. This results in an alternative implementation of AEVB, called Variational AutoDecoder (VAD). Subsequently, using the reparameterization trick kingma2013auto for wellknown distributions, the approximate posterior parameters can be learned endtoend by maximizing the variational lower bound, which is differentiable w.r.t the parameters of the wellknown distribution (hence gradientbased approaches can be easily used). Within the AEVB learning framework, we specifically study which of VAD or VAE can maximize the variational lower bound more efficiently and learn a more accurate generative model in the presence of missing data. We study this question over multiple datasets from different domains and the following scenarios: 1) similar train and testtime missingness, as well as 2) testonly missingness and trainonly missingness.
2 Related Work
Learning from incomplete data is a fundamental research area in machine learning. Notable related works fall into several categories as denoted below.
In a neural framework, Variational AutoEncoders have been commonly used for learning from incomplete data mccoy2018variational ; williams2018autoencoders ; nazabal2018handling . A particular implementation based on Conditional Variational AutoEncoders (CVAE) has shown to achieve superior performance over existing methods for learning from incomplete data ivanov2018variational .
Generative Adversarial Networks (GANs) have been used for missing data imputation yoon2018gain . Aside from being particularly hard to train salimans2016improved , VAE approaches have shown to perform better in practice ivanov2018variational . This implementation of VAE is the baseline we compare to in this paper.
Previously proposed Markovchain based approaches require computationally heavy sampling time and full data to be observable during training rezende2014stochastic ; sohl2015deep ; bordes2017learning . One appeal of these models is that they can directly maximize the evidence (as opposed to the lower bound), however at a heavy computational cost.
Inpainting approaches exist in computer vision which are particularly engineered for visual tasks and sometimes require similar train and testtime missingness for best performance
pathak2016context ; yang2017high .Approaches have relied on simple learning techniques such as Gaussian Mixture Models
delalleau2012efficient pelckmans2005handling or Principle Component Analysis dray2015principal . Such models have fallen short in the recent years due to lacking the necessary complexity to deal with increasingly nonlinear nature of many realworld datasets.3 Model
(AEVB) AutoEncoding Variational Bayes kingma2013auto is among the most successful methods for learning generative models. Using a reparameterization trick on a set of known distributions, AEVB allows the learning to be done using SVI (Stochastic Variational Inference rezende2014stochastic
). A particularly important step within AEVB is learning an approximate posterior distribution. This approximate posterior is commonly parameterized by a neural network in a VAE (Variational AutoEncoder
kingma2013auto ). The encoder essentially outputs the parameteres of the approximate posterior.In this section, we outline an alternative implementation of the AEVB algorithm for the case of incomplete data. We call this approach Variational AutoDecoder (VAD) since it does not utilize an encoder to infer the paramaters of the approximate posterior. VAD initializes the parameters of the approximate posterior randomly and updates those parameters during the training process using gradientbased methods. We first outline the problem formulation, and subsequently outline the training and inference procedure for VAD.
3.1 Problem Formulation
We assume a groundtruth random variable
, sampled from a groundtruth distribution, with being the dimension of the input space. Unfortunately, the space of is considered to not be fully observable. The part that is observable we denote via random variable , regarded hereon as incomplete input. We assume that a random variable denotes whether or not the data is observable through an indicator in each dimension with value being observable and being missing.We formalize the process of generating the random variable as the process of first drawing a groundtruth data sample from and a missingness pattern sample from , and subsequently removing information from using . We draw i.i.d. samples from the above process to build a dataset.^{1}^{1}1Notably, each datapoint can have a distinct missingness pattern. For the rest of this paper, the incomplete dataset is regarded as and the missingness patterns are regarded as . The groundtruth dataset is regarded as . can never directly be a part of training, validation or testing since it is considered strictly unknown.
3.2 Training
Assuming that data distribution can be approximated using a parametric family of distributions with the parameters , learning can be done by maximizing the likelihood , w.r.t . In practice, the log of the likelihood is often calculated and used. In a latent variablemodeling framework the evidence can often be defined by marginalizing a latent variable as follows:
(1) 
In practice calculating the marginal integral over is either expensive or intractable. Subsequently direct latent posterior inference using , which is an essential step in latent variable modeling, becomes impractical.
For any given and any conditional density with as an unobserved random variable and as parameters of , we can rewrite the evidence in Equation 1 as follows:
(2) 
With the condition that if . To simplify notation, we refer to true posterior as and approximate posterior as . More simply, the likelihood in Equation 2 can be written as:
(3) 
In the above equation,
is the KullbackLeibler divergence. One can directly minimize this asymmetric divergence and approximate the true posterior using an approximate posterior
. However, doing so requires samples to be drawn from the true posterior. Markov Chain Monte Carlo (MCMC) approaches can be used to draw samples from the true posterior, however, such approaches are usually very costly.is referred to as the Evidence Lower Bound (ELBo) or simply variational lowerbound. It is equal to sum of the expected value of the log of the posterior under distribution and entropy of :
(4) 
Through the above formulation, rather than employing a method for learning model parameters through likelihood of data, variational Bayes methods approximate the posterior probability
with a simpler distribution . Equation 4 can be rewritten as:(5) 
In the above equation, the first term encourages the latent samples to show high expected likelihood (through reconstruction of ) under the approximate posterior distribution, and the second term encourages the latent samples to simultaneously follow the latent prior .
In a Variational AutoDecoder framework, the approximate posterior distribution is not parameterized by a neural network, but rather using a wellknown distribution directly. Therefore, as opposed to randomly initializing the weights of the encoder, we simply randomly initialize the intrinsic parameters of the approximate posterior. We focus on the family of multivariate Gaussian distributions for this purpose, however other distributions can also be used, as long as reparameterization trick
kingma2013auto can be defined for them. We define a multivariate Gaussian approximate posterior as:(6) 
Note are learnable parameters of this approximate posterior distribution. The reparameterization of this posterior is essentially defined as with . Using this reparameterization, the gradient of the lower bound
can be directly backpropagated to the mean
and variance
.Likelihood is similarly defined as a multivariate Gaussian with missing dimensions of marginalized:
(7) 
This density is centered around as its mean. The covariance is defined as a diagonal positive semidefinite matrix with on its main diagonal whenever . is a neural decoder which takes in the samples drawn from posterior in Equation 6
. The optimization is subsequently defined within AEVB: first sampling from the approximate posterior to calculate a MonteCarlo estimate of the lower bound
, and subsequently maximizing w.r.t and (Equations 6 and 7). Algorithm 1 summarizes the training (and also inference in the next section).3.3 Inference
Typically, once a generative model is learned, it is used to sample data which belong to the underlying learned distribution. Sampling can be done by sampling from the latent space and subsequently using the decoder to generate the data, with no other steps required.
In certain cases a new data point is given and the goal is to samples the posterior. Calculating the evidence in Equation 1 is still infeasible, even after training is done. Therefore, for the new datapoint, the same variational lower bound needs to be maximized. Using the same process as during training framework in Equation 5, the parameters of the approximate posterior are initialized randomly and iteratively updated until convergence. Thus, inference is similar to training, except learned parameters () of the decoder are not updated during inference. Once is maximized, samples of the approximate posterior can be used to generate similar instances as the given datapoint.
4 Experiments
Based on the Equation 5, the variational lower bound is dependent on the expectation of the log likelihood under the approximate posterior . This term, which relies on the incomplete input , indicates how well samples drawn from the approximate posterior are able to recreate (using likelihood in Equation 7). If due to the parameterization of the approximate posterior, this term cannot be maximized efficiently for incomplete data during training, then maximizing the lower bound will subsequently be impacted.^{2}^{2}2In simple terms, regardless of the second term in Equation 5, if the approximate posterior and decoder cannot reproduce the data efficiently in the best case, then generative modeling will not be successful, regardless of the second term of Equation 5 (which is anyways the same for both models). For both VAD and VAE,^{3}^{3}3The implementation of VAE in this paper using Equation 7 and with missing mask is identical to a model which was published during preparation of this paper, called VAEAC by ivanov2018variational . Missing values replaced by zeros before inputting to encoder. The authors did substantial experiments and found that this model is comparable or better than previous approaches in most data imputation and posterior approximation tasks studied in their paper. we aim to address whether missing data can cause issues for maximizing this expectation. Therefore, we specifically study the lower bound with only the first term to compare if any of the two models inherently fall short in presence of missing data.
In the experiments,^{4}^{4}4Code and data available through https://github.com/A2Zadeh/VariationalAutodecoder both models are trained on identical data and maximize the same lower bound
as depicted in the previous paragraph. The only difference between the two models is therefore parameterization of the approximate posterior distribution: for VAD it is the parameters of the distributions and for VAE it is the weights of the neural networks. A validation set is used to choose the best performing hyperparameter
^{5}^{5}5Both models undergo substantial hyperparameter search as described in Appendix A(with exact values). Hyperparameters include (but not limited to) the number of layers in the decoder (and encoder for VAE), the number of neuron in each layer, and the latent space dimensions.
setup exactly based on the lower bound . Subsequently, the best trained model is used on test data. The ground truth is never used during training, validation or testing (unless required by the experiment, described later in this section). Only for evaluation purposes, after the inference is done on test set, the ground truth is simply revealed. To report a measure that is easy to compare, we report the MSE^{6}^{6}6MSE is calculated per each dimension, therefore it is independent of the missing rate. (Mean Squared Error) between the decoded mean of the approximate posterior in the following categories: 1) Incomplete: we report the MSE between the incomplete data (available dimensions) and the output of the decoder. Since the incomplete data is the basis of the likelihood in Equation 7, we expect models to show low MSE for the incomplete data. 2) Missing: once the inference is done over the incomplete data, missing values are revealed to evaluate the imputation performance of both models. 3) Full: after revealing missing values, we can simply calculate the performance over the full groundtruth data.Specifically, the following two scenarios are studied in the form of experiments in this paper:
Experiment 1: We study the case where during train and test time, data follows similar a missing rate. Essentially the distribution of missingness is also the same for these cases.
Experiment 2: In realworld situations it is very unlikely that the data will follow the same missing rate during train and test time. Therefore, we also compare the two models for different train and test time missing rates. Since there may be many combinations of missing rates for train and test time, we only study the extreme cases: only testtime missingness (train on groundtruth), and only traintime missingness (test on groundtruth).
4.1 Datasets
We experiment with a variety of datasets from different areas within machine learning. To better understand the ranges of MSE for each dataset, we report a baseline obtained by taking the mean of the groundtruth training data as the prediction during test time. This baseline indicates the limit beyond which models are performing worse than just projecting the mean of the groundtruth data regardless of the input^{7}^{7}7With a very minimal deviation across experiments for each dataset, this threshold also applies to missing and incomplete components..
Toy Synthetic Dataset: We study a case of synthetic data where we control the distributional properties of the data. In the generation process, we first acquire a set of independent dimensions randomly sampled from 5 univariate distributions with uniform random parameters: {Normal, Uniform, Beta, Logistic, Gumbel}. Often in realistic scenarios there are interdependencies among the dimensions. Hence we proceed to generate interdependent dimensions by picking random subsets of the independent components and combining them using random operations such as weighted multiplication, affine addition, and activation. Using this method, we generate a dataset containing datapoints with groundtruth dimension . Further details of the generation and exact ranges are given in the Supplementary Material. The threshold MSE for this dataset is on this dataset.
Menpo Facial Landmark Dataset: Menpo2D contains facial images with various subjects, expressions, and poses zafeiriou2017menpo . Due to these variations, the nature of the dataset is complex. Since Menpo dataset has ground truth annotations for landmarks regardless of selfocclusions in the natural image, it allows for creating a reallife groundtruth data for our experiments. The purpose of using this dataset is to compare the two models on how well they recreate the structure of an object given only a subset of available keypoints. The threshold MSE for this dataset is .
CMUMOSEI Dataset: CMU Multimodal Sentiment and Emotion Intensity (CMUMOSEI) is an inthewild dataset of multimodal sentiment and emotion recognition zadeh2018multimodal . The datasets consist of sentences utterance from online YouTube monologue videos. CMUMOSEI consists of 23,500 such sentences and with three modalities of text (words), vision (gestures) and acoustic (sound). For text modality, the datasets contains GloVe word embeddings pennington2014glove . For visual modality, the datasets contains facial action units, facial landmarks, and head pose information. For acoustic modality, the datasets contain high and lowlevel descriptors following COVAREP degottex2014covarep . We use expected multimodal context for each sentence, similar to unordered compositional approaches in NLP iyyer2015deep . The threshold MSE for this dataset is .
FashionMNIST: FashionMNIST^{8}^{8}8https://github.com/zalandoresearch/fashionmnist is a variant of the MNIST dataset. It is considered to be more challenging than MNIST since variations within fashion items are usually more complex than written digits. The dataset consists of grayscale images with shape from 10 fashion items. The threshold MSE is for this dataset.
We base our experiments on Missing Completely at Random (MCAR), which is a severe case of missingness. For each , we sample a missing pattern with missing ratio ranging from to with increments of . This form of essentially allows each dimension to unexpectedly go missing.
4.2 Experiment 1
In this experiment, both train and test data follow the same missing ratio . For each , models are trained using likelihood in Equation 7 and maximize the lowerbound in Equation 5. Figure 1 shows the results of this experiment for the best validated models for both VAD and VAE. In all the three incomplete, missing (imputation) and full (groundtruth) categories, VAD shows superior performance than VAE. As the missingness increases, the gap between the two models widens in all three categories (except for CMUMOSEI where the performance gap is large in incomplete and missing even for small ). This essentially states that VAE becomes increasingly unstable in presence of missing data. Specifically for the case of incomplete data, VAE is not able to perform reliable posterior inference since the output of the decoder increasingly deviates farther away from the available input. VAD on the other hand, shows steady performance in the incomplete category. The performance of both models is naturally affected for the missing category as increases (in reality some of the missing data may not be imputable given the available input). However, the increasing gap between the two models also appears in missing category. Finally the comparison in the full category shows that VAD is able to regenerate the groundtruth better than VAE. In a full picture, Figure 1 suggests that approximate posterior using an encoder conditioned on a volatile input becomes increasingly unstable as missingness becomes more severe.
4.3 Experiment 2
While in the previous experiment both the train and test stages followed the same missing rate, realistic scenarios are often more complex. In the most extreme cases, we study two possible scenarios: testonly missingness and trainonly missingness. For this experiment, we choose the best performing models from the previous experiment based on their performance on incomplete data in validation set.
In the testtime missingness scenario, models are trained on the groundtruth data without any expectation that during testtime an arbitrary subset of the data may go missing. Essentially, during testtime this assumption proves to be wrong and the data indeed goes missing exactly following a missing ratio ranging from to . Figure 2 shows the results of this experiment for the synthetic and Menpo2D datasets in Full category.^{9}^{9}9Due to space constraint we only report for 2 datasets in Full category. In both cases the performance of VAE is substantially affected by the missing dimensions during testtime, achieving far inferior performance than the case in Experiment 1 (being trained on the same missingness). The performance of VAD remains almost similar for both synthetic and Menpo2D datasets and relatively similar to Experiment 1. We also visually demonstrate this in an inpainting scenario. We compare models when they are trained on groundtruth and tested on incomplete data against when they are trained on similar missing ratio. Both models are trained on the FashionMNIST groundtruth train set and subsequently during testing the data may go missing. Figure 3 shows the testtime performance for both models for different missing ratios as well as blocksized missingness. Visually, it can be seen that VAE suffers heavily if it is trained on groundtruth but data goes missing during testtime. In fact, in high missing rate, VAE simply blurs out around the available datapoints while VAD is able to recover the missing areas of the image. Compared to the case where missing ratio is the same, we observe that it is crucial for VAE to train on the same missing rate as the test time, while VAD does not suffer from this.
The traintime missingness scenario is the opposite of the above scenario.^{10}^{10}10Not to be confused with denoising methods or dropout which map noisy input to the groundtruth during train time. In this scenario the groundtruth training set can never be fully observed for training. Models are trained on a train set with a missing ratio ranging from to . During testing, they perform inference on a different missing rate, in the extreme case on groundtruth test. The right side of Figure 2 shows the results of this experiment on both VAD and VAE. We observe a similar trend of performance between the two models, with VAD remaining consistent while the performance of VAE deteriorates as missing rate increases.
5 Conclusion
In this paper, we proposed an alternative implementation of the AEVB for the case of incomplete data, called Variational AutoDecoder (VAD). We studied the effect of missing data on the approximate posterior conditioned directly using an encoder on the incomplete input (i.e. VAE). We showed that such conditioning may not allow for maximizing the variational lower bound efficiently, due to poor performance for maximizing the expected likelihood (under the approximate posterior) of the incomplete data. We showed that VAD is better suited for this case since it does not take the volatile data as input. The approximate posterior in VAD is parameterized by a known distribution, parameters of which are directly optimized in a variational learning framework. For VAD, similar to VAE, the parameters of the approximate posterior can be learned using gradientbased approaches. When train and test followed similar missing ratios, our experimental results showed superior performance of VAD. This was extended to cases of only testtime missingness and only traintime missingness, where VAD showed superior performance.
References
 [1] Florian Bordes, Sina Honari, and Pascal Vincent. Learning to generate samples from noise through infusion training. arXiv preprint arXiv:1703.06975, 2017.
 [2] Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. Covarep: A collaborative voice analysis repository for speech technologies. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 960–964. IEEE, 2014.
 [3] Olivier Delalleau, Aaron Courville, and Yoshua Bengio. Efficient em training of gaussian mixtures with missing data. arXiv preprint arXiv:1209.0521, 2012.
 [4] Stéphane Dray and Julie Josse. Principal component analysis with missing values: a comparative survey of methods. Plant Ecology, 216(5):657–667, 2015.

[5]
Oleg Ivanov, Michael Figurnov, and Dmitry Vetrov.
Variational autoencoder with arbitrary conditioning.
2018. 
[6]
Mohit Iyyer, Varun Manjunatha, Jordan BoydGraber, and Hal Daumé III.
Deep unordered composition rivals syntactic methods for text
classification.
In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, volume 1, pages 1681–1691, 2015.  [7] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [8] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [9] John T McCoy, Steve Kroon, and Lidia Auret. Variational autoencoders for missing data imputation with application to a simulated milling circuit. IFACPapersOnLine, 51(21):141–146, 2018.
 [10] Alfredo Nazabal, Pablo M Olmos, Zoubin Ghahramani, and Isabel Valera. Handling incomplete heterogeneous data using vaes. arXiv preprint arXiv:1807.03653, 2018.

[11]
Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A
Efros.
Context encoders: Feature learning by inpainting.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 2536–2544, 2016. 
[12]
Kristiaan Pelckmans, Jos De Brabanter, Johan AK Suykens, and Bart De Moor.
Handling missing values in support vector machine classifiers.
Neural Networks, 18(56):684–692, 2005.  [13] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. 2014.
 [14] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 [15] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
 [16] Jascha SohlDickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585, 2015.
 [17] Christopher KI Williams, Charlie Nash, and Alfredo Nazábal. Autoencoders and probabilistic inference with missing data: An exact solution for the factor analysis case. arXiv preprint arXiv:1801.03851, 2018.

[18]
Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li.
Highresolution image inpainting using multiscale neural patch synthesis.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6721–6729, 2017.  [19] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. Gain: Missing data imputation using generative adversarial nets. arXiv preprint arXiv:1806.02920, 2018.
 [20] AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and LouisPhilippe Morency. Multimodal language analysis in the wild: Cmumosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2236–2246, 2018.
 [21] Stefanos Zafeiriou, George Trigeorgis, Grigorios Chrysos, Jiankang Deng, and Jie Shen. The menpo facial landmark localisation challenge: A step towards the solution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, volume 1, page 2, 2017.
Appendix A Implementation Remarks
Here we detail the hyperparameter space used for the experiments. Where possible, we tried to establish a fair comparison between the VAD and VAE models by using similar hyperparameters, however both models underwent substantial grid search. We varied three main hyperparameters: the dimensionality of the posterior space , the number of feedforward hidden layers in the decoder and the encoder (encoder only for VAE), and the number of hidden units in each hidden layer. A summary is shown in Table 1.
During inference of VAD models, we simply stopped once the model reached a plateau. Since VAD models have a high degree of freedom for approximate posterior , we observed that it is crucial to use Adam [7] for learning the parameters of the approximate distribution. For both models, learning rates of for the latent variables and hidden units were used.
During inference, both models use an approximate posterior with a learnable variance. However, in practice when using incomplete data, learning the variance was a very unstable process for VAE. VAE models showed very high sensitivity to even small variances, quickly performing similar to projecting the mean in each dimension. We believe combination of noise from imputed values and also the noise added through approximate posterior may have too much cause uncertainties for VAE. While VAD models suffered similarly from the same instability during learning the variance, the performance was better than VAE.
Therefore, during our experiments we treated the approximate posterior variance as a hyperparameter and trained the models for different variances. This way results improved substantially. The best performing variance may change depending on the problem and the range of the input space, however, in general we observed that very large variances did not converge well, while small variances did not yield the best results.
a.1 Synthetic Data Generation
The parameters of the synthetic data are outlined in Table 2.
Hyperparameter  Group  Values 
# of latent variables  A  25, 50, 100 
B  50, 100, 400  
# of hidden units per layer  A  50, 100, 200 
B  100, 200, 400  
# of hidden layers  A, B  2, 4, 6 
LR of network parameters and latent variables  A, B 
Distribution  Parameter  Range of values 
Normal  
Uniform  
Beta  
Logistic  
Gumbel  
Comments
There are no comments yet.