Variational Autoencoder with Implicit Optimal Priors

09/14/2018 ∙ by Hiroshi Takahashi, et al. ∙ 0

The variational autoencoder (VAE) is a powerful generative model that can estimate the probability of a data point by using latent variables. In the VAE, the posterior of the latent variable given the data point is regularized by the prior of the latent variable using Kullback Leibler (KL) divergence. Although the standard Gaussian distribution is usually used for the prior, this simple prior incurs over-regularization. As a sophisticated prior, the aggregated posterior has been introduced, which is the expectation of the posterior over the data distribution. This prior is optimal for the VAE in terms of maximizing the training objective function. However, KL divergence with the aggregated posterior cannot be calculated in a closed form, which prevents us from using this optimal prior. With the proposed method, we introduce the density ratio trick to estimate this KL divergence without modeling the aggregated posterior explicitly. Since the density ratio trick does not work well in high dimensions, we rewrite this KL divergence that contains the high-dimensional density ratio into the sum of the analytically calculable term and the low-dimensional density ratio term, to which the density ratio trick is applied. Experiments on various datasets show that the VAE with this implicit optimal prior achieves high density estimation performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Estimating data distributions is one of the important challenges of machine learning. The variational autoencoder (VAE)

[Kingma and Welling2013, Rezende, Mohamed, and Wierstra2014]

was presented as a powerful generative model that can learn distributions by using latent variables and neural networks. Since the VAE can capture the high-dimensional complicated data distributions, it is widely applied to various data, such as images

[Gulrajani et al.2016], videos [Gregor et al.2015], and audio and speech [Hsu, Zhang, and Glass2017, van den Oord, Vinyals, and kavukcuoglu2017].

The VAE is composed of three distributions: the encoder, the decoder, and the prior of the latent variable. The encoder and the decoder are conditional distributions, and neural networks are used to model these distributions. The encoder defines the posterior of the latent variable given the data point, whereas the decoder defines the distribution of the data point given the latent variable. The parameters of encoder and decoder neural networks are optimized by maximizing the sum of the evidence lower bound of the log marginal likelihood. In the training of VAE, the prior regularizes the encoder by Kullback Leibler (KL) divergence. The standard Gaussian distribution is usually used for the prior since the KL divergence can be calculated in a closed form.

Recent research shows that the prior plays an important role in the density estimation [Hoffman and Johnson2016]. Although the standard Gaussian prior is usually used, this simple prior incurs over-regularization, which is one of the causes of the poor density estimation performance. This over-regularization is also known as the posterior-collapse phenomenon [van den Oord, Vinyals, and kavukcuoglu2017]. To improve the density estimation performance, the aggregated posterior prior has been introduced, which is the expectation of the encoder over the data distribution [Hoffman and Johnson2016]. The aggregated posterior is an optimal prior in terms of maximizing the training objective function of the VAE. However, KL divergence with the aggregated posterior cannot be calculated in a closed form, which prevents us from using this optimal prior. In previous work [Tomczak and Welling2018]

, the aggregated posterior is modeled by using the finite mixture of encoders for calculating the KL divergence in a closed form. Nevertheless, it has sensitive hyperparameters such as the number of mixture components, which are difficult to tune.

In this paper, we propose the VAE with implicit optimal priors, where the aggregated posterior is used as the prior, but the KL divergence is directly estimated without modeling the aggregated posterior explicitly. This implicit modeling enables us to avoid the difficult hyperparameter tuning for the aggregated posterior model. We use the density ratio trick, which can estimate the density ratio between two distributions without modeling each distribution explicitly, since the KL divergence is the expectation of the density ratio between the encoder and aggregated posterior. Although the density ratio trick is powerful, it has been experimentally shown to work poorly in high dimensions [Sugiyama, Suzuki, and Kanamori2012, Rosca, Lakshminarayanan, and Mohamed2018]. Unfortunately, with high-dimensional datasets, the density ratio between the encoder and the aggregated posterior also becomes high-dimensional. To avoid the density ratio estimation in high dimensions, we rewrite the KL divergence with the aggregated posterior to the sum of two terms. The first term is the KL divergence between the encoder and the standard Gaussian prior, which can be calculated in a closed form. The other term is the low-dimensional density ratio between the aggregated posterior and the standard Gaussian distribution, to which the density ratio trick is applied.

2 Preliminaries

2.1 Variational Autoencoder

First, we review the variational autoencoder (VAE) [Kingma and Welling2013, Rezende, Mohamed, and Wierstra2014]

. The VAE is a probabilistic latent variable model that relates an observed variable vector

to a low-dimensional latent variable vector by a conditional distribution. The VAE models the probability of a data point by

(1)

where is a prior of the latent variable vector, and is the conditional distribution of given , which is modeled by neural networks with parameter . For example, if

is binary, this distribution is modeled by a Bernoulli distribution

, where is neural networks with parameter and input . These neural networks are called the decoder.

The log marginal likelihood is bounded below by the evidence lower bound (ELBO), which is derived from Jensen’s inequality, as follows:

(2)

where represents the expectation, and is the posterior of given , which are modeled by neural networks with parameter . is usually modeled by a Gaussian distribution , where and are neural networks with parameter and input . These neural networks are called the encoder.

The ELBO (Eq. (2)) can be also written as

(3)

where is the Kullback Leibler (KL) divergence between and . The second expectation term in Eq. (3) is called the reconstruction term, which is also known as the negative reconstruction error.

The parameters of the encoder and decoder neural networks are optimized by maximizing the following expectation of the lower bound of the log marginal likelihood:

(4)

where is the data distribution.

2.2 Aggregated Posterior

The training of VAE is maximizing the reconstruction term with regularization by KL divergence between the encoder and the prior. The prior is usually modeled by a standard Gaussian distribution [Kingma and Welling2013]. However, this is not an optimal prior for the VAE. This simple prior incurs over-regularization, which is one of the causes of the poor density estimation performance [Hoffman and Johnson2016]. This phenomenon is called the posterior-collapse [van den Oord, Vinyals, and kavukcuoglu2017].

The optimal prior that maximizes the objective function of VAE (Eq. (4)) can be derived analytically. The maximization of Eq. (4) with respect to the prior is written as follows:

(5)

where is the negative cross entropy between and . Since takes a maximum value when is equal to , the optimal prior that maximizes Eq. (4) is

(6)

This distribution is called the aggregated posterior.

When we use the standard Gaussian prior , the KL divergence can be calculated in a closed form [Kingma and Welling2013]. However, when we use the aggregated posterior as the prior, the KL divergence

(7)

cannot be calculated in a closed form, which prevents us from using the aggregated posterior as the prior.

2.3 Previous work: VampPrior

In previous work, the aggregated posterior is modeled by using the finite mixture of encoders to calculate the KL divergence. Given a dataset , the aggregated posterior can be simply modeled by an empirical distribution:

(8)

Nevertheless, this empirical distribution incurs over-fitting [Tomczak and Welling2018]. Thus, the VampPrior [Tomczak and Welling2018] models the aggregated posterior by

(9)

where is the number of mixtures, and is the same dimensional vector as a data point.

is regarded as the pseudo input for the encoder, and is optimized during the training of the VAE through the stochastic gradient descent (SGD). If

, the VampPrior can avoid over-fitting [Tomczak and Welling2018]. The KL divergence with the VampPrior can be calculated by the Monte Carlo approximation. The VAE with the VampPrior achieves better density estimation performance than the VAE with the standard Gaussian prior and the VAE with the Gaussian mixture prior [Dilokthanakul et al.2016]. However, this approach has a major drawback: it has sensitive hyperparameters such as the number of mixtures , which are difficult to tune.

From the above discussion, the aggregated posterior seems to be difficult to model explicitly. In this paper, we estimate the KL divergence with the aggregated posterior without modeling the aggregated posterior explicitly.

3 Proposed Method

In this section, we propose the approximation method of the KL divergence with the aggregated posterior, and describe the optimization procedure of our approach.

3.1 Estimating the KL Divergence

As shown in Eq. (7), the KL divergence with the aggregated posterior is the expectation of the logarithm of the density ratio . In this paper, we introduce the density ratio trick [Sugiyama, Suzuki, and Kanamori2012, Goodfellow et al.2014], which can estimate the ratio of two distributions without modeling each distribution explicitly. Hence, there is no need to model the aggregated posterior explicitly. By using the density ratio trick,

can be estimated by using a probabilistic binary classifier

.

However, the density ratio trick has a serious drawback: it has been experimentally shown to work poorly in high dimensions [Sugiyama, Suzuki, and Kanamori2012, Rosca, Lakshminarayanan, and Mohamed2018]. Unfortunately, if is high-dimensional, also becomes a high-dimensional density ratio. The reason is as follows. Since the is a conditional distribution of given , the density ratio trick has to use a probabilistic binary classifier , which takes and jointly as an input. In fact,

estimates the density ratio of joint distributions of

and , which is a high-dimensional density ratio with high-dimensional [Mescheder, Nowozin, and Geiger2017].

To avoid the density ratio estimation in high dimensions, we rewrite the KL divergence as follows:

(10)

The first term in Eq. (10) is KL divergence between the encoder and standard Gaussian distribution, which can be calculated in a closed form. The second term is the expectation of the logarithm of the density ratio . We estimate with the density ratio trick. Since the latent variable vector is low-dimensional, the density ratio trick works well.

We can estimate the density ratio as follows. First, we prepare the samples from and samples from . We can sample from and since these distributions are a Gaussian, and we can also sample from the aggregated posterior by using ancestral sampling: we choose a data point from a dataset randomly and sample from the encoder given this data point . Second, we label to samples from and to samples from . Then, we define as follows:

(11)

Third, we introduce a probabilistic binary classifier that discriminates between the samples from and samples from . If can discriminate these samples perfectly, we can rewrite the density ratio

by using Bayes theorem and

as follows:

(12)

where equals since the number of samples is the same. We model by , where is a neural network with parameter and input , and

is a sigmoid function. We train

to maximize the following objective function:

(13)

By using , we can estimate the density ratio as follows:

(14)

Therefore, we can estimate the KL divergence with the aggregated posterior by

(15)

3.2 Optimization Procedure

From the above discussion, we obtain the training objective function of the VAE with our implicit optimal prior:

(16)

where maximizes the Eq. (13). Given a dataset , we optimize the Monte Carlo approximation of this objective:

(17)

and we approximate the expectation term by the reparameterization trick [Kingma and Welling2013]:

(18)

where , is a sample drawn from , is the element-wise product, and is the sample size of the reparameterization trick. Then, the resulting objective function is

(19)

We optimize this model with stochastic gradient descent (SGD) [Duchi, Hazan, and Singer2011, Zeiler2012, Tieleman and Hinton2012, Kingma and Ba2014] by iterating a two-step procedure: we first update and to maximize Eq. (19) with fixed and next update to maximize the Monte Carlo approximation of Eq. (13) with fixed and , as follows:

(20)

where is a sample drawn from , is a sample drawn from , and is the sampling size of Monte Carlo approximation. Note that we need to compute the gradient of with respect to in the optimization of Eq. (19) since models . However, when equals , the expectation of this gradient becomes zero, as follows:

(21)

Therefore, we ignore this gradient in the optimization 111There is almost the same discussion in [Mescheder, Nowozin, and Geiger2017].. We also note that is likely to overfit to the log density ratio between the empirical aggregated posterior (Eq. (8)) and the standard Gaussian distribution. As mentioned in Section 2.3, this over-fitting also incurs over-fitting of the VAE [Tomczak and Welling2018]. Therefore, we use the regularization techniques such as dropout [Srivastava et al.2014] for , which prevents it from over-fitting. We train more than and : if we update and for steps, we update for steps, where is larger than . Algorithm 1 shows the pseudo code of the optimization procedure of this model, where is the minibatch size of SGD.

1:while not converged do
2:     for  steps do
3:         Sample minibatch from
4:         Compute the gradients of Eq. (19) w.r.t. and
5:         Update and with their gradients
6:     end for
7:     for  steps do
8:         Sample minibatch from
9:         Sample minibatch from
10:         Compute the gradient of Eq. (20) w.r.t.
11:         Update with its gradient
12:     end for
13:end while
Algorithm 1 VAE with Implicit Optimal Priors

4 Related Work

For improving the density estimation performance of the VAE, numerous works have focused on the regularization effect of the KL divergence between the encoder and the prior. These works improve either the encoder or the prior.

First, we focus on the works about the prior. Although the optimal prior for the VAE is the aggregated posterior, the KL divergence with the aggregated posterior cannot be calculated in a closed form. As described in Section 2.3, the VampPrior [Tomczak and Welling2018] has been presented to solve this problem. However, it has sensitive hyperparameters such as the number of mixtures . Since the VampPrior requires a heavy computational cost, these hyperparameters are difficult to tune. In contrast to this, our approach can estimate the KL divergence more easily and robustly than the VampPrior since it does not need to model the aggregated posterior explicitly. In addition, since the computational cost of our approach is much more lightweight than that of VampPrior, the hyperparameters of our approach are easier to tune than those of VampPrior.

There are approaches on improving the prior other than the aggregated posterior. For example, non-parametric Bayesian distribution [Nalisnick and Smyth2017] and hyperspherical distribution [Davidson et al.2018] are used for the prior. These approaches aim to obtain the useful and interpretable latent representation rather than improving the density estimation performance, which is opposite to our purpose. We should mention the disadvantage of our approach compared with these approaches. Since our prior is implicit, we cannot sample from our prior directly. Instead, we can sample from the aggregated posterior, which our implicit prior models, by using ancestral sampling. That is, when we sample from the prior, we need to prepare a data point.

Next, we focus on the works about the encoder. To improve the density estimation performance, these works increase the flexibility of the encoder. The normalizing flow [Rezende and Mohamed2015, Kingma et al.2016, Huang et al.2018] is one of the main approaches, which applies a sequence of invertible transformations to the latent variable vector until a desired level of flexibility is attained. Our approach is orthogonal to the normalizing flow and can be used together with it.

The similar approaches to ours are the adversarial variational Bayes (AVB) [Mescheder, Nowozin, and Geiger2017] and the adversarial autoencoders (AAE) [Makhzani et al.2015, Tolstikhin et al.2017]. These approaches use the implicit encoder network, which takes as input a data point and Gaussian random noise and produces a latent variable vector . Since the implicit encoder does not assume the distribution type, it can become a very flexible distribution. In these approaches, the standard Gaussian distribution is used for the prior. Although the KL divergence between the implicit encoder and the standard Gaussian prior cannot be calculated in a closed form, the AVB estimates this KL divergence by using the density ratio trick. However, this estimation does not work well with high-dimensional datasets since this KL divergence also becomes a high-dimensional density ratio [Rosca, Lakshminarayanan, and Mohamed2018]. Our approach can avoid this problem since we use the density ratio trick in a low dimension. The AAE is an expansion of the Autoencoder rather than the VAE. The AAE regularizes the aggregated posterior to be close to the standard Gaussian prior by minimizing the KL divergence . The AAE also uses the density ratio trick to estimate this KL divergence, and this works well since this KL divergence is a low-dimensional density ratio. However, the AAE cannot estimate the probability of a data point. Our approach is based on the VAE, and can estimate the probability of a data point.

5 Experiments

In this section, we experimentally evaluate the density estimation performance of our approach.

5.1 Data

We used five datasets: OneHot [Mescheder, Nowozin, and Geiger2017], MNIST [Salakhutdinov and Murray2008], OMNIGLOT [Burda, Grosse, and Salakhutdinov2015], FreyFaces222This dataset is available at https://cs.nyu.edu/~roweis/data/frey_rawface.mat, and Histopathology [Tomczak and Welling2016]. OneHot consists of only four-dimensional one hot vectors: , , , and . This simple dataset is useful for observing the posterior of the latent variable, which is used in [Mescheder, Nowozin, and Geiger2017]. MNIST and OMNIGLOT are binary image datasets, and FreyFaces and Histopathology are grayscale image datasets. These image datasets are useful for measuring the density estimation performance, which are used in [Tomczak and Welling2018]. The number and the dimensions of data points of the five datasets are listed in Table 1.

Dimension Train size Valid size Test size
OneHot 4 1,000 100 1,000
MNIST 784 50,000 10,000 10,000
OMNIGLOT 784 23,000 1,345 8,070
FreyFaces 560 1,565 200 200
Histopathology 784 6,800 2,000 2,000
Table 1: Number and dimensions of datasets
(a) Standard VAE.
(b) AVB.
(c) VAE with VampPrior.
(d) Proposed method.
Figure 1: Comparison of posteriors of latent variable on OneHot. We plotted samples drawn from , where is a one hot vector: , , , or . We used test data for this sampling. Samples in each color correspond to each latent representation of one hot vectors. (a) Standard VAE (VAE with standard Gaussian prior). (b) AVB. (c) VAE with VampPrior. (d) Proposed method.
(a) Standard VAE.
(b) AVB.
(c) VAE with VampPrior.
(d) Proposed method.
Figure 2:

Comparison of the evidence lower bound (ELBO) with validation data on OneHot. We plotted the ELBO from 100 to 1,000 epochs since we used warm-up for the first 100 epochs. The optimal log-likelihood on this dataset is

. We plotted this value by a dashed line for comparison. (a) Standard VAE (VAE with standard Gaussian prior). (b) AVB. (c) VAE with VampPrior. (d) Proposed method.
MNIST OMNIGLOT FreyFaces Histopathology
Standard VAE -85.84 0.07 -111.39 0.11 1382.53 3.57 1081.53 0.70
VAE with VampPrior -83.90 0.08 -110.53 0.09 1392.62 6.25 1083.11 2.10
Proposed method -83.21 0.13 -108.48 0.16 1396.27 2.75 1087.42 0.60
Table 2: Comparison of test log-likelihoods on four image datasets.
Figure 3:

Relationship between the test log-likelihoods and number of pseudo inputs of VampPrior on Histopathology. We plotted the test log-likelihoods of our approach by a dashed line for comparison. The semi-transparent area and error bar represent standard deviations.

5.2 Setup

We compared our implicit optimal prior with standard Gaussian prior and VampPrior. We set the dimensions of the latent variable vector to 2 for OneHot, and 40 for other datasets. We used two-layer neural networks (500 hidden units per layer) for the encoder, the decoder, and the density ratio estimator. We used the gating mechanism [Dauphin et al.2016]

for the encoder and the decoder and used a hyperbolic tangent as the activation function for the density ratio estimator. We initialized the weights of these neural networks in accordance with the method in

[Glorot and Bengio2010]. We used a Gaussian distribution as the encoder. As the decoder, we used a Bernoulli distribution for OneHot, MNIST, and OMNIGLOT and used a Gaussian distribution for FreyFaces and Histopathology, means of which were constrained to the interval by using a sigmoid function. We trained all methods by using Adam [Kingma and Ba2014] with a mini-batch size of 100 and learning rate in . We set the maximum number of epochs to 1,000 and used early-stopping [Goodfellow, Bengio, and Courville2016] on the basis of validation data. We set the sample size of the reparameterization trick to . In addition, we used warm-up [Bowman et al.2015]

for the first 100 epochs of Adam. For MNIST and OMNIGLOT, we used dynamic binarization

[Salakhutdinov and Murray2008] during the training of VAE to avoid over-fitting. For image datasets, we calculated the log marginal likelihood of the test data by using the importance sampling [Burda, Grosse, and Salakhutdinov2015]. We set the sample size of the importance sampling to 10. We ran all experiments eight times each.

With VampPrior, we set the number of mixtures

to 50 for OneHot, 500 for MNIST, FreyFaces, and Histopathology, and 1,000 for OMNIGLOT. In addition, for image datasets, we used a clipped relu function that equals

to scale the pseudo inputs in since the range of data points of these datasets is 333We referred to https://github.com/jmtomczak/vae_vampprior.

With our approach, we used dropout [Srivastava et al.2014] in the training of the density ratio estimator since it is likely to over-fit. We set the keep probability of dropout to 50%. We updated the parameter of the density ratio estimator: for 10 epochs during the updating of the parameters of VAE: and for one epoch. We set the sampling size of Monte Carlo approximation in Eq. (20) to .

In addition, we compared our approach with adversarial variational Bayes (AVB) on OneHot. We set the dimension of the Gaussian random noise input of AVB to 10, and other settings are almost the same as those for our approach.

5.3 Results

Figures 11 show the posteriors of latent variable of each approach on OneHot, and Figures 22 show the evidence lower bound of each approach on OneHot.

These results show the difference between these approaches. We can see that the evidence lower bound (ELBO) of the standard VAE (VAE with standard Gaussian prior) on OneHot was worse than the optimal log-likelihood on this dataset: . The over-regularization incurred by the standard Gaussian prior can be given as a reason. The posteriors were overlapped, and it became difficult to discriminate between samples from these posteriors. Hence, the decoder became confused when reconstructing. This caused the poor density estimation performance.

On the other hand, the ELBOs of AVB, VAE with VampPrior, and our approach are much closer to the optimal log-likelihood than the standard VAE. We note that the ELBOs of the AVB and our approach are the estimated values, and that these approaches may overestimate the ELBO on OneHot since the training data and validation data of OneHot are the same. First, we focus on the AVB. Although there is still the strong regularization by the standard Gaussian prior, the posteriors barely overlapped, and the data point was easy to reconstruct from the latent representation. The reason is that the implicit encoder network of AVB can learn complex posterior distributions. Next, we focus on the VAE with VampPrior and our approach. The VampPrior and our implicit optimal prior model the aggregated posterior that is the optimal prior for the VAE. These priors made the posteriors of these approaches different from each other, and the data point was easy to reconstruct from the latent representation.

Table 2 compares the test log-likelihoods on four image datasets. We used bold to highlight the best result and the results that are not statistically different from the best result according to a pair-wise -test. We used 5% as the p-value. We did not compare with AVB since the estimated log marginal likelihood of AVB with high-dimensional datasets such as images is not accurate [Rosca, Lakshminarayanan, and Mohamed2018].

First, we focus on the VampPrior. We can see that test log-likelihoods of VampPrior are better than those of standard VAE. However, we found two drawbacks with the VampPrior. One is that the pseudo inputs of VampPrior are difficult to optimize. For example, the pseudo inputs have an initial value dependence. Although the warm-up helps in solving this problem, it seems difficult to solve completely. The other is that the number of mixtures is a sensitive hyperparameter. Figure 3 shows the test log-likelihoods with various on Histopathology. The high standard deviation of the VampPrior indicates its high dependence of the pseudo input initial values. In addition, even though we choose the optimal K, the test log-likelihood of the VampPrior is worse than that of our approach.

Next, we focus on our approach. Our approach obtained the equal to or better density estimation performance than the VampPrior. Since our approach models the aggregated posterior implicitly, it can estimate the KL divergence more easily and robustly than the VampPrior. In addition, it has a much more lightweight computational cost than the VampPrior. In the training phase on MNIST, our approach was almost times faster than the VampPrior. Therefore, although our approach has as many hyperparameters, like the neural architecture of the density ratio estimator, as the VampPrior, these hyperparameters are easier to tune than those of the VampPrior.

These results indicate that our implicit optimal prior is a good alternative to the VampPrior: our implicit optimal prior can be optimized easily and robustly, and its density estimation performance is equal to or better than that of the VAE with the VampPrior.

6 Conclusion

In this paper, we proposed the variational autoencoder (VAE) with implicit optimal priors. Although the standard Gaussian distribution is usually used for the prior, this simple prior incurs over-regularization, which is one of the causes of poor density estimation performance. To improve the density estimation performance, the aggregated posterior has been introduced as a sophisticated prior, which is optimal in terms of maximizing the training objective function of VAE. However, Kullback Leibler (KL) divergence between the encoder and the aggregated posterior cannot be calculated in a closed form, which prevents us from using this optimal prior. Even though explicit modeling of the aggregated posterior has been tried, this optimal prior is difficult to model explicitly.

With the proposed method, we introduced the density ratio trick for estimating this KL divergence directly. Since the density ratio trick can estimate the density ratio between two distributions without modeling each distribution explicitly, there is no need to model the aggregated posterior explicitly. Although the density ratio trick is useful, it does not work well in a high dimension. Unfortunately, the KL divergence between the encoder and the aggregated posterior is high-dimensional. Hence, we rewrite the KL divergence into the sum of two terms: the KL divergence between the encoder and the standard Gaussian distribution that can be calculated in a closed form, and the low-dimensional density ratio between the aggregated posterior and the standard Gaussian distribution, to which the density ratio trick is applied. We experimentally showed the high density estimation performance of the VAE with this implicit optimal prior.

References