Variational Auto-Decoder: Neural Generative Modeling from Partial Data

03/03/2019 ∙ by Amir Zadeh, et al. ∙ Carnegie Mellon University 32

Learning a generative model from partial data (data with missingness) is a challenging area of machine learning research. We study a specific implementation of the Auto-Encoding Variational Bayes (AVEB) algorithm, named in this paper as a Variational Auto-Decoder (VAD). VAD is a generic framework which uses Variational Bayes and Markov Chain Monte Carlo (MCMC) methods to learn a generative model from partial data. The main distinction between VAD and Variational Auto-Encoder (VAE) is the encoder component, as VAD does not have one. Using a proposed efficient inference method from a multivariate Gaussian approximate posterior, VAD models allow inference to be performed via simple gradient ascent rather than MCMC sampling from a probabilistic decoder. This technique reduces the inference computational cost, allows for using more complex optimization techniques during latent space inference (which are shown to be crucial due to a high degree of freedom in the VAD latent space), and keeps the framework simple to implement. Through extensive experiments over several datasets and different missing ratios, we show that encoders cannot efficiently marginalize the input volatility caused by imputed missing values. We study multimodal datasets in this paper, which is a particular area of impact for VAD models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

AEVB (Auto-Encoding Variational Bayes) is one of the most widely used algorithms for learning generative models kingma2013auto . Approximate posterior inference is an important step within AEVB which allows for sampling from the latent space conditioned on the input. VAE (Variational Auto-Encoder kingma2013auto ), the most well-known implementation of AEVB, relies on an encoder to perform approximate posterior inference. In machine learning, incomplete data, i.e. data with missing values, is widespread. In the realm of incomplete data, encoder-based approaches such as VAE pose a cyclic dependency with generative modeling: to learn a generative model, an encoder needs to be used, and to use an encoder, the missing dimensions need to be replaced, ideally using the underlying data distribution (i.e. the generative model). Commonly, this chain is broken by sub-optimally imputing the missing values (mostly with zeros even in the most recently proposed VAE imputation models ivanov2018variational ). This in turn causes volatility in the input space of the VAE structure, however, the structure is assumed to be able to handle the volatility and reliably perform posterior approximation.

Within the AEVB algorithm, the approximate posterior parameters need not necessarily be inferred using an encoder. As opposed to using an encoder to infer the parameters of the approximate posterior distribution, we propose to optimize the parameters of the approximate posterior directly without an encoder. This results in an alternative implementation of AEVB, called Variational Auto-Decoder (VAD). Subsequently, using the reparameterization trick kingma2013auto for well-known distributions, the approximate posterior parameters can be learned end-to-end by maximizing the variational lower bound, which is differentiable w.r.t the parameters of the well-known distribution (hence gradient-based approaches can be easily used). Within the AEVB learning framework, we specifically study which of VAD or VAE can maximize the variational lower bound more efficiently and learn a more accurate generative model in the presence of missing data. We study this question over multiple datasets from different domains and the following scenarios: 1) similar train and test-time missingness, as well as 2) test-only missingness and train-only missingness.

2 Related Work

Learning from incomplete data is a fundamental research area in machine learning. Notable related works fall into several categories as denoted below.

In a neural framework, Variational Auto-Encoders have been commonly used for learning from incomplete data mccoy2018variational ; williams2018autoencoders ; nazabal2018handling . A particular implementation based on Conditional Variational Auto-Encoders (C-VAE) has shown to achieve superior performance over existing methods for learning from incomplete data ivanov2018variational .

Generative Adversarial Networks (GANs) have been used for missing data imputation yoon2018gain . Aside from being particularly hard to train salimans2016improved , VAE approaches have shown to perform better in practice ivanov2018variational . This implementation of VAE is the baseline we compare to in this paper.

Previously proposed Markov-chain based approaches require computationally heavy sampling time and full data to be observable during training rezende2014stochastic ; sohl2015deep ; bordes2017learning . One appeal of these models is that they can directly maximize the evidence (as opposed to the lower bound), however at a heavy computational cost.

Inpainting approaches exist in computer vision which are particularly engineered for visual tasks and sometimes require similar train and test-time missingness for best performance

pathak2016context ; yang2017high .

Approaches have relied on simple learning techniques such as Gaussian Mixture Models 

delalleau2012efficient

, Support Vector Machines 

pelckmans2005handling or Principle Component Analysis dray2015principal . Such models have fallen short in the recent years due to lacking the necessary complexity to deal with increasingly non-linear nature of many real-world datasets.

3 Model

(AEVB) Auto-Encoding Variational Bayes kingma2013auto is among the most successful methods for learning generative models. Using a reparameterization trick on a set of known distributions, AEVB allows the learning to be done using SVI (Stochastic Variational Inference rezende2014stochastic

). A particularly important step within AEVB is learning an approximate posterior distribution. This approximate posterior is commonly parameterized by a neural network in a VAE (Variational Auto-Encoder 

kingma2013auto ). The encoder essentially outputs the parameteres of the approximate posterior.

In this section, we outline an alternative implementation of the AEVB algorithm for the case of incomplete data. We call this approach Variational Auto-Decoder (VAD) since it does not utilize an encoder to infer the paramaters of the approximate posterior. VAD initializes the parameters of the approximate posterior randomly and updates those parameters during the training process using gradient-based methods. We first outline the problem formulation, and subsequently outline the training and inference procedure for VAD.

3.1 Problem Formulation

We assume a ground-truth random variable

, sampled from a ground-truth distribution, with being the dimension of the input space. Unfortunately, the space of is considered to not be fully observable. The part that is observable we denote via random variable , regarded hereon as incomplete input. We assume that a random variable denotes whether or not the data is observable through an indicator in each dimension with value being observable and being missing.

We formalize the process of generating the random variable as the process of first drawing a ground-truth data sample from and a missingness pattern sample from , and subsequently removing information from using . We draw i.i.d. samples from the above process to build a dataset.111Notably, each datapoint can have a distinct missingness pattern. For the rest of this paper, the incomplete dataset is regarded as and the missingness patterns are regarded as . The ground-truth dataset is regarded as . can never directly be a part of training, validation or testing since it is considered strictly unknown.

3.2 Training

Assuming that data distribution can be approximated using a parametric family of distributions with the parameters , learning can be done by maximizing the likelihood , w.r.t . In practice, the log of the likelihood is often calculated and used. In a latent variable-modeling framework the evidence can often be defined by marginalizing a latent variable as follows:

(1)

In practice calculating the marginal integral over is either expensive or intractable. Subsequently direct latent posterior inference using , which is an essential step in latent variable modeling, becomes impractical.

For any given and any conditional density with as an unobserved random variable and as parameters of , we can rewrite the evidence in Equation 1 as follows:

(2)

With the condition that if . To simplify notation, we refer to true posterior as and approximate posterior as . More simply, the likelihood in Equation 2 can be written as:

(3)

In the above equation,

is the Kullback-Leibler divergence. One can directly minimize this asymmetric divergence and approximate the true posterior using an approximate posterior

. However, doing so requires samples to be drawn from the true posterior. Markov Chain Monte Carlo (MCMC) approaches can be used to draw samples from the true posterior, however, such approaches are usually very costly.

is referred to as the Evidence Lower Bound (ELBo) or simply variational lower-bound. It is equal to sum of the expected value of the log of the posterior under distribution and entropy of :

(4)

Through the above formulation, rather than employing a method for learning model parameters through likelihood of data, variational Bayes methods approximate the posterior probability

with a simpler distribution . Equation 4 can be rewritten as:

(5)

In the above equation, the first term encourages the latent samples to show high expected likelihood (through reconstruction of ) under the approximate posterior distribution, and the second term encourages the latent samples to simultaneously follow the latent prior .

In a Variational Auto-Decoder framework, the approximate posterior distribution is not parameterized by a neural network, but rather using a well-known distribution directly. Therefore, as opposed to randomly initializing the weights of the encoder, we simply randomly initialize the intrinsic parameters of the approximate posterior. We focus on the family of multivariate Gaussian distributions for this purpose, however other distributions can also be used, as long as reparameterization trick 

kingma2013auto can be defined for them. We define a multivariate Gaussian approximate posterior as:

(6)

Note are learnable parameters of this approximate posterior distribution. The reparameterization of this posterior is essentially defined as with . Using this reparameterization, the gradient of the lower bound

can be directly backpropagated to the mean

and variance

.

Likelihood is similarly defined as a multivariate Gaussian with missing dimensions of marginalized:

(7)

This density is centered around as its mean. The covariance is defined as a diagonal positive semi-definite matrix with on its main diagonal whenever . is a neural decoder which takes in the samples drawn from posterior in Equation 6

. The optimization is subsequently defined within AEVB: first sampling from the approximate posterior to calculate a Monte-Carlo estimate of the lower bound

, and subsequently maximizing w.r.t and (Equations 6 and 7). Algorithm 1 summarizes the training (and also inference in the next section).

3.3 Inference

Typically, once a generative model is learned, it is used to sample data which belong to the underlying learned distribution. Sampling can be done by sampling from the latent space and subsequently using the decoder to generate the data, with no other steps required.

In certain cases a new data point is given and the goal is to samples the posterior. Calculating the evidence in Equation 1 is still infeasible, even after training is done. Therefore, for the new datapoint, the same variational lower bound needs to be maximized. Using the same process as during training framework in Equation 5, the parameters of the approximate posterior are initialized randomly and iteratively updated until convergence. Thus, inference is similar to training, except learned parameters () of the decoder are not updated during inference. Once is maximized, samples of the approximate posterior can be used to generate similar instances as the given datapoint.

1: Only for training, gets during inference
2:
3:repeat:
4:      Sampling approximate posterior, Equation 6
5:      Equation 7
6:      No grad_step w.r.t during inference
7:     
8:until maximization convergence on
Algorithm 1

Training (and inference) process for the VAD models with multivariate normal distribution as approximate posterior.

4 Experiments

Based on the Equation 5, the variational lower bound is dependent on the expectation of the log likelihood under the approximate posterior . This term, which relies on the incomplete input , indicates how well samples drawn from the approximate posterior are able to recreate (using likelihood in Equation 7). If due to the parameterization of the approximate posterior, this term cannot be maximized efficiently for incomplete data during training, then maximizing the lower bound will subsequently be impacted.222In simple terms, regardless of the second term in Equation 5, if the approximate posterior and decoder cannot reproduce the data efficiently in the best case, then generative modeling will not be successful, regardless of the second term of Equation 5 (which is anyways the same for both models). For both VAD and VAE,333The implementation of VAE in this paper using Equation 7 and with missing mask is identical to a model which was published during preparation of this paper, called VAEAC by ivanov2018variational . Missing values replaced by zeros before inputting to encoder. The authors did substantial experiments and found that this model is comparable or better than previous approaches in most data imputation and posterior approximation tasks studied in their paper. we aim to address whether missing data can cause issues for maximizing this expectation. Therefore, we specifically study the lower bound with only the first term to compare if any of the two models inherently fall short in presence of missing data.

In the experiments,444Code and data available through https://github.com/A2Zadeh/Variational-Autodecoder both models are trained on identical data and maximize the same lower bound

as depicted in the previous paragraph. The only difference between the two models is therefore parameterization of the approximate posterior distribution: for VAD it is the parameters of the distributions and for VAE it is the weights of the neural networks. A validation set is used to choose the best performing hyperparameter

555Both models undergo substantial hyperparameter search as described in Appendix A

(with exact values). Hyperparameters include (but not limited to) the number of layers in the decoder (and encoder for VAE), the number of neuron in each layer, and the latent space dimensions.

setup exactly based on the lower bound . Subsequently, the best trained model is used on test data. The ground truth is never used during training, validation or testing (unless required by the experiment, described later in this section). Only for evaluation purposes, after the inference is done on test set, the ground truth is simply revealed. To report a measure that is easy to compare, we report the MSE666MSE is calculated per each dimension, therefore it is independent of the missing rate. (Mean Squared Error) between the decoded mean of the approximate posterior in the following categories: 1) Incomplete: we report the MSE between the incomplete data (available dimensions) and the output of the decoder. Since the incomplete data is the basis of the likelihood in Equation 7, we expect models to show low MSE for the incomplete data. 2) Missing: once the inference is done over the incomplete data, missing values are revealed to evaluate the imputation performance of both models. 3) Full: after revealing missing values, we can simply calculate the performance over the full ground-truth data.

Specifically, the following two scenarios are studied in the form of experiments in this paper:

Experiment 1: We study the case where during train and test time, data follows similar a missing rate. Essentially the distribution of missingness is also the same for these cases.

Experiment 2: In real-world situations it is very unlikely that the data will follow the same missing rate during train and test time. Therefore, we also compare the two models for different train and test time missing rates. Since there may be many combinations of missing rates for train and test time, we only study the extreme cases: only test-time missingness (train on ground-truth), and only train-time missingness (test on ground-truth).

4.1 Datasets

We experiment with a variety of datasets from different areas within machine learning. To better understand the ranges of MSE for each dataset, we report a baseline obtained by taking the mean of the ground-truth training data as the prediction during test time. This baseline indicates the limit beyond which models are performing worse than just projecting the mean of the ground-truth data regardless of the input777With a very minimal deviation across experiments for each dataset, this threshold also applies to missing and incomplete components..

Toy Synthetic Dataset: We study a case of synthetic data where we control the distributional properties of the data. In the generation process, we first acquire a set of independent dimensions randomly sampled from 5 univariate distributions with uniform random parameters: {Normal, Uniform, Beta, Logistic, Gumbel}. Often in realistic scenarios there are inter-dependencies among the dimensions. Hence we proceed to generate interdependent dimensions by picking random subsets of the independent components and combining them using random operations such as weighted multiplication, affine addition, and activation. Using this method, we generate a dataset containing datapoints with ground-truth dimension . Further details of the generation and exact ranges are given in the Supplementary Material. The threshold MSE for this dataset is on this dataset.

Figure 1: Best viewed zoomed in and in color. The results of the Experiment 1 (Section 4.2) for incomplete, missing and full categories. Blue curve shows VAD results and orange curve shows VAE results. The axis denotes the missing rate and the

axis is the reconstruction Mean Squared Error (MSE, lowe better). The standard deviation is calculated based on

test runs of the best performing model on the validation set. The gap between both models becomes larger as the missing ratio increases. In the full category, shows the performance of the case where there is no missing data (train or test). The performance thresholds from left to right are: , , , indicating the MSE beyond which models are predicting worse than average of each dimension regardless of input.

Menpo Facial Landmark Dataset: Menpo2D contains facial images with various subjects, expressions, and poses zafeiriou2017menpo . Due to these variations, the nature of the dataset is complex. Since Menpo dataset has ground truth annotations for landmarks regardless of self-occlusions in the natural image, it allows for creating a real-life ground-truth data for our experiments. The purpose of using this dataset is to compare the two models on how well they recreate the structure of an object given only a subset of available keypoints. The threshold MSE for this dataset is .

CMU-MOSEI Dataset: CMU Multimodal Sentiment and Emotion Intensity (CMU-MOSEI) is an in-the-wild dataset of multimodal sentiment and emotion recognition zadeh2018multimodal . The datasets consist of sentences utterance from online YouTube monologue videos. CMU-MOSEI consists of 23,500 such sentences and with three modalities of text (words), vision (gestures) and acoustic (sound). For text modality, the datasets contains GloVe word embeddings pennington2014glove . For visual modality, the datasets contains facial action units, facial landmarks, and head pose information. For acoustic modality, the datasets contain high and low-level descriptors following COVAREP degottex2014covarep . We use expected multimodal context for each sentence, similar to unordered compositional approaches in NLP iyyer2015deep . The threshold MSE for this dataset is .

Fashion-MNIST: Fashion-MNIST888https://github.com/zalandoresearch/fashion-mnist is a variant of the MNIST dataset. It is considered to be more challenging than MNIST since variations within fashion items are usually more complex than written digits. The dataset consists of grayscale images with shape from 10 fashion items. The threshold MSE is for this dataset.

We base our experiments on Missing Completely at Random (MCAR), which is a severe case of missingness. For each , we sample a missing pattern with missing ratio ranging from to with increments of . This form of essentially allows each dimension to unexpectedly go missing.

4.2 Experiment 1

In this experiment, both train and test data follow the same missing ratio . For each , models are trained using likelihood in Equation 7 and maximize the lower-bound in Equation 5. Figure 1 shows the results of this experiment for the best validated models for both VAD and VAE. In all the three incomplete, missing (imputation) and full (ground-truth) categories, VAD shows superior performance than VAE. As the missingness increases, the gap between the two models widens in all three categories (except for CMU-MOSEI where the performance gap is large in incomplete and missing even for small ). This essentially states that VAE becomes increasingly unstable in presence of missing data. Specifically for the case of incomplete data, VAE is not able to perform reliable posterior inference since the output of the decoder increasingly deviates farther away from the available input. VAD on the other hand, shows steady performance in the incomplete category. The performance of both models is naturally affected for the missing category as increases (in reality some of the missing data may not be imputable given the available input). However, the increasing gap between the two models also appears in missing category. Finally the comparison in the full category shows that VAD is able to regenerate the ground-truth better than VAE. In a full picture, Figure 1 suggests that approximate posterior using an encoder conditioned on a volatile input becomes increasingly unstable as missingness becomes more severe.

Figure 2: Best viewed zoomed in and in color. Results (in Full cateogry) of Experiment 2 (Section 4.3) for (a) test-time missingness and (b) train-time missingness. Blue curve shows VAD results and orange curve shows VAE results. The axis denotes the missing rate and the axis is the reconstruction Mean Squared Error (MSE). In both the experiments VAD shows superior performance than VAE. The performance of VAE is significantly affected in both scenarios.

4.3 Experiment 2

While in the previous experiment both the train and test stages followed the same missing rate, realistic scenarios are often more complex. In the most extreme cases, we study two possible scenarios: test-only missingness and train-only missingness. For this experiment, we choose the best performing models from the previous experiment based on their performance on incomplete data in validation set.

In the test-time missingness scenario, models are trained on the ground-truth data without any expectation that during test-time an arbitrary subset of the data may go missing. Essentially, during test-time this assumption proves to be wrong and the data indeed goes missing exactly following a missing ratio ranging from to . Figure 2 shows the results of this experiment for the synthetic and Menpo2D datasets in Full category.999Due to space constraint we only report for 2 datasets in Full category. In both cases the performance of VAE is substantially affected by the missing dimensions during test-time, achieving far inferior performance than the case in Experiment 1 (being trained on the same missingness). The performance of VAD remains almost similar for both synthetic and Menpo2D datasets and relatively similar to Experiment 1. We also visually demonstrate this in an inpainting scenario. We compare models when they are trained on ground-truth and tested on incomplete data against when they are trained on similar missing ratio. Both models are trained on the Fashion-MNIST ground-truth train set and subsequently during testing the data may go missing. Figure 3 shows the test-time performance for both models for different missing ratios as well as block-sized missingness. Visually, it can be seen that VAE suffers heavily if it is trained on ground-truth but data goes missing during test-time. In fact, in high missing rate, VAE simply blurs out around the available datapoints while VAD is able to recover the missing areas of the image. Compared to the case where missing ratio is the same, we observe that it is crucial for VAE to train on the same missing rate as the test time, while VAD does not suffer from this.

The train-time missingness scenario is the opposite of the above scenario.101010Not to be confused with denoising methods or dropout which map noisy input to the ground-truth during train time. In this scenario the ground-truth training set can never be fully observed for training. Models are trained on a train set with a missing ratio ranging from to . During testing, they perform inference on a different missing rate, in the extreme case on ground-truth test. The right side of Figure 2 shows the results of this experiment on both VAD and VAE. We observe a similar trend of performance between the two models, with VAD remaining consistent while the performance of VAE deteriorates as missing rate increases.

Figure 3: Visualization of inpainting for experiments on Fashion-MNIST. Top row (Data) shows the given data to both VAD and VAE. Ground-truth is the real image from Fashion-MNIST. For (a), training is done on ground-truth data and testing is done on incomplete data. For (b) training and testing is done on similar missing rate. For the case of MCAR (a), VAE significantly deteriorates when trained on ground-truth data and given incomplete data. This trend is also visible but at a much slower rate for the case (b) - where only at VAE shows visually perceivable failure. We also compare the performance of both models when the missingness changes from MCAR random missing blocks - training for case (a) still done on ground-truth and (b) done in presence of random blocks. VAD is able to maintain a better performance in both (a) and (b) for random blocks. Overall, (a) shows that VAE mostly focuses on the given areas of the image and mostly reconstructs that area. This is not the case for the VAD, which recreates the other areas of the image as well.

5 Conclusion

In this paper, we proposed an alternative implementation of the AEVB for the case of incomplete data, called Variational Auto-Decoder (VAD). We studied the effect of missing data on the approximate posterior conditioned directly using an encoder on the incomplete input (i.e. VAE). We showed that such conditioning may not allow for maximizing the variational lower bound efficiently, due to poor performance for maximizing the expected likelihood (under the approximate posterior) of the incomplete data. We showed that VAD is better suited for this case since it does not take the volatile data as input. The approximate posterior in VAD is parameterized by a known distribution, parameters of which are directly optimized in a variational learning framework. For VAD, similar to VAE, the parameters of the approximate posterior can be learned using gradient-based approaches. When train and test followed similar missing ratios, our experimental results showed superior performance of VAD. This was extended to cases of only test-time missingness and only train-time missingness, where VAD showed superior performance.

References

Appendix A Implementation Remarks

Here we detail the hyperparameter space used for the experiments. Where possible, we tried to establish a fair comparison between the VAD and VAE models by using similar hyperparameters, however both models underwent substantial grid search. We varied three main hyperparameters: the dimensionality of the posterior space , the number of feedforward hidden layers in the decoder and the encoder (encoder only for VAE), and the number of hidden units in each hidden layer. A summary is shown in Table 1.

During inference of VAD models, we simply stopped once the model reached a plateau. Since VAD models have a high degree of freedom for approximate posterior , we observed that it is crucial to use Adam [7] for learning the parameters of the approximate distribution. For both models, learning rates of for the latent variables and hidden units were used.

During inference, both models use an approximate posterior with a learnable variance. However, in practice when using incomplete data, learning the variance was a very unstable process for VAE. VAE models showed very high sensitivity to even small variances, quickly performing similar to projecting the mean in each dimension. We believe combination of noise from imputed values and also the noise added through approximate posterior may have too much cause uncertainties for VAE. While VAD models suffered similarly from the same instability during learning the variance, the performance was better than VAE.

Therefore, during our experiments we treated the approximate posterior variance as a hyperparameter and trained the models for different variances. This way results improved substantially. The best performing variance may change depending on the problem and the range of the input space, however, in general we observed that very large variances did not converge well, while small variances did not yield the best results.

a.1 Synthetic Data Generation

The parameters of the synthetic data are outlined in Table 2.

Hyperparameter Group Values
# of latent variables A 25, 50, 100
B 50, 100, 400
# of hidden units per layer A 50, 100, 200
B 100, 200, 400
# of hidden layers A, B 2, 4, 6
LR of network parameters and latent variables A, B
Table 1: Hyperparameters used for the experiments on VAD and VAE across different datasets. Group B refers to Fashion-MNIST, while group A refers to all other datasets.
Distribution Parameter Range of values
Normal
Uniform
Beta
Logistic
Gumbel
Table 2: Parameters used during generation of synthetic data.