Log In Sign Up

Variational Inference: A Unified Framework of Generative Models and Some Revelations

by   Jianlin Su, et al.

We reinterpreting the variational inference in a new perspective. Via this way, we can easily prove that EM algorithm, VAE, GAN, AAE, ALI(BiGAN) are all special cases of variational inference. The proof also reveals the loss of standard GAN is incomplete and it explains why we need to train GAN cautiously. From that, we find out a regularization term to improve stability of GAN training.


Wasserstein Variational Inference

This paper introduces Wasserstein variational inference, a new form of a...

Variational Inference using Implicit Distributions

Generative adversarial networks (GANs) have given us a great tool to fit...

A Unified f-divergence Framework Generalizing VAE and GAN

Developing deep generative models that flexibly incorporate diverse meas...

On the Relationship Between Active Inference and Control as Inference

Active Inference (AIF) is an emerging framework in the brain sciences wh...

Stabilizing Training of Generative Adversarial Nets via Langevin Stein Variational Gradient Descent

Generative adversarial networks (GANs), famous for the capability of lea...

Distribution Matching in Variational Inference

The difficulties in matching the latent posterior to the prior, balancin...

Relay Variational Inference: A Method for Accelerated Encoderless VI

Variational Inference (VI) offers a method for approximating intractable...

Variational Inference

Suppose is an explicit variable, is a latent variable,and is evidence distribution of . We let


and we hope will be a good approximation of . In general cases, we want to maximize log likelihood function


which is equivalent to minimizing :


But if we can not calculate the integral analytically, we can not maximize log likelihood or minimize KL-divergence directly.

The variational inference changes objective function: rather minimizing KL-divergence of marginal distributions

, we can minimize the KL-divergence of joint distribution

or . We have


which suggests is an upper bound of . In many cases, joint KL-divergence easier to calculate than marginal KL-divergence. Therefore, variational inference provides a computable solution. If it works, we have ,which means . Namely, becomes an approximation of the real distribution .

VAE and EM algorithm

Due to our new insight of variational inference, VAE and EM algorithm can be derived in a very simple way.

In VAE, we let , while

are Gaussian distributions with unknown parameters and

is standard Gaussian distribution. The loss we need to minimize is


while does not contain any parameters, it does not change final result. So loss can be transed into


Because are both Gaussian, we can get the analytic expression of . And with the reparametrization trick, the first term can be approximate as . Consequently, the final loss for VAE is


The assumption of EM algorithm is like VAE, excluding supposing is Gaussian. In EM algorithm, the loss is still , but we treat entire as training parameters. Rather than minimizing the loss directly, here we use an alternate training way. Firstly, we fix and just optimize . Removing the ”constant” term, the loss of is


Secondly, we fix and optimize . We define by


now we have


Because we don’t make any assumptions about the form of , we can let make loss equal zero, which is an optimal solution of . In other words, the optimal is


EM algorithm is just to perform alternately。

GAN within Variational Inference

In this section, we describe a general approach to incorporate GAN into the variational inference, which leads a new insight to GAN and results a effective regularization for GAN.

General Framework

As same as VAE, GAN also want to achive a generative model , which can transform to the evidence distribution . Different from Gaussian assumption in VAE, GAN let be a Dirac delta function


whose is a neutral network of generative model, called generator.

Generally, we considered is a random latent variable in generative model. However, it is well-known that Dirac delta function is non-zero at only one point, so the mapping from to in GAN is almost one to one. The variable is not ”random” enough, so we do not treat it as a latent variable (that means we need not to consider posterior distribution

). In fact, we just consider the binary random variable

as a random latent variable in GAN:



discribing a Bernoulli distribution. For simpler we set


On the other hand, we let , while is a conditional Bernoulli distribution. Distinct from VAE, GAN choose another direction of KL-divergence as optimal objective:


Once succeed, we have , means


consequently .

Now we have to solve and . For simpler we set , called discriminator. Like EM algorithm, we use a alternately training strategy. Firstly, we fix , so does. Ignoring constants for , we get:


Then we fix for optimizing . Ignoring constants for , we get the pure loss:


For minimizing this loss, we need the formula of , which is always impossible. For the same reason as , if has enough fitting ability, the optimal is


is at previous stage. We can solve from it and replace in :


Basic Analysis

It is obviously that the fisrt term is one of the standard losses of GAN:


The second extra item describes the distance between the new distribution and the old distribution. Two terms are adversarial. try to make the two distributions more similar, while will be very large because will be very small for if discriminator is trained fully (all of them will be considered as negative samples), and vice versa. Thus, minimizing entire loss requires model to inherit the old distribution and explore the new world .

As we know, the generator’s loss in current standard GAN has no the second term, which is autually an incomplete loss. Suppose there is a omnipotent optimizer which can identify the global optimum in very short time and has enough fitting ability, then can only generate just one sample which make largest. In other words, the global optimal solution of is , while . That is called Model Collapse, which will occur certainly in theory.

So, what enlightenment can give for us? We let


that means the updates of parameters of in this iteration is . Using Taylor series to expand to second order, we get


We have already indicated that a complete loss should contain . If not, we should keep it small during training by other way. The above approximation shows the extra loss is about , which can not be too large, meaning can not be too large because can be regard as a constant during one iteration.

Now we can explain why we need to adjust hyperparameters carefully to make GAN converge[Salimans2016Improved]. The most common optimizers we use are all based on gradient descent, so is proportional to gradients. We need to keep

small, which is equivalent to keep gradients small. Consequently, we apply gradients clipping, Batch Normalization in GAN because they can make gradients steady. For the same reason, we always use Adam rather than SGD+Momentum. Meanwhile, the iterations of

can not be too many, while will be large if updates a lot.

Regularization Term

Here we focus on getting something really useful and practical. We try to calculate

directly, for obtain an regularization term to add in generator’s loss. Because of difficulty of directly calculating, we estimate it via calculating

(maybe inspired by variational inference):


we have a limitation


which means

can be replaced with a Gaussian distribution of small variance. So we have




In other words, we can use the distance between samples from old and new generator as a regularization term, to guarantee the new generator has little deviation from old generator.

Experiment 1 and 2 on CelebA datasets111the code is modified from, now available at shows this regularization works well.

(a) With our regularization, the model keeps steady.

Without regularization, the model collapses after 25 epochs.

Figure 1: An ordinary DCGAN model on CelebA, train discriminator and generator one iteration per period.
(a) With our regularization, the model has a faster convergence.
(b) Without regularization, the model need more iterations for convergence.
Figure 2: An ordinary DCGAN model removing BN, train discriminator and generator 5 iterations per period.

Models Related to GAN

Adversarial Autoencoders (AAE)[Makhzani2015Adversarial] and Adversarially Learned Inference (ALI)[Dumoulin2016Adversarially] are two variants of GAN, they also can be incorporated into variational inference. Of course, with the preparation above, it is just like two homework questions.

AAE under GAN framework

Autually, for obtaining AAE, the only thing we need to do is exchanging in standard GAN. In detail, AAE wants to train an encoder to map the distribution of real data to the standard Gaussian distribution , while


whose is a neutral network of encoder.

Like GAN, AAE needs a binary random latent variable , and


we also let . On the other hand, we set , whose posterior distribution is conditional Bernoulli distribution taking as input. Now we minimize :


Now we have to solve and . we set and still train it alternately. Firstly, we fix , so does. Ignoring constants for , we get:


Then we fix for optimizing . Ignoring constants for , we get the pure loss:


Use the theoretical solution and replace :


On the one hand, like standard GAN, if we train carefully, we may remove the second term and have


on the other hand, we can train a decoder after finishing the adversarial trainning. However, if our has strong enough modeling ability, then we can add a reconstruction error into encoder’s loss, which will not interfere the original adversarial optimition of encoder. Therefore, we get a joint loss:


Our Version of ALI

ALI is like a fusion of GAN and AAE. And there is an almost identical version called Bidirectional GAN (BiGAN)[Donahue2017Adversarial]. Compared with GAN, they treats as a latent variable, so it needs a posterior distribution . Concretely, in ALI we have


and , then we minimize :


which is equivalent to minimize


Now we have to solve . we set , while is a Gaussian distribution including an encoder and is an another Gaussian distribution including an generator . Still alternately train it. Firstly we fix , the loss related to is


As same as VAE, the expectation of and can be done using the the reparametrization trick. Now fix for optimizing , and because of cooccurrence of , loss can not be simplified. But using the theoretical solution of


can transform it to


Due to Gaussianity of and , we can calculate last two term analytically, or ignore them while optimizing it carefully, leading


That is our version of ALI, which has little different from the standard ALI. The current popular view is to treat ALI (includes GAN) as a min-max problem. From that, the loss of encoder and generator is




both of which are not like . Our experiment shows has the same performance as and . That means treating adversarial networks as a a min-max problem is not the only one approach. Variational inference may give us some new insight sometimes.


Our results prove that variational inference is a general framwork to derivate and explain many generative models, including VAE and GAN. We also discribe how variational inference do that by introducing a new interpretation of variational inference. This interpretation is powerful, which can lead to VAE and EM algorithm in serveral lines and deduce GAN in clearly.

An related work is [Hu2018On], which also attemps to link VAE and GAN with variational inference. However, their processing is not clear enough. we made up for this deficiency, trying to give an simpler view on GAN under variational inference.

It seems the potential of variational inference is waitting to be mined.