Suppose is an explicit variable, is a latent variable，and is evidence distribution of . We let
and we hope will be a good approximation of . In general cases, we want to maximize log likelihood function
which is equivalent to minimizing :
But if we can not calculate the integral analytically, we can not maximize log likelihood or minimize KL-divergence directly.
The variational inference changes objective function: rather minimizing KL-divergence of marginal distributions
, we can minimize the KL-divergence of joint distributionor . We have
which suggests is an upper bound of . In many cases, joint KL-divergence easier to calculate than marginal KL-divergence. Therefore, variational inference provides a computable solution. If it works, we have ，which means . Namely, becomes an approximation of the real distribution .
VAE and EM algorithm
Due to our new insight of variational inference, VAE and EM algorithm can be derived in a very simple way.
In VAE, we let , while
are Gaussian distributions with unknown parameters andis standard Gaussian distribution. The loss we need to minimize is
while does not contain any parameters, it does not change final result. So loss can be transed into
Because are both Gaussian, we can get the analytic expression of . And with the reparametrization trick, the first term can be approximate as . Consequently, the final loss for VAE is
The assumption of EM algorithm is like VAE, excluding supposing is Gaussian. In EM algorithm, the loss is still , but we treat entire as training parameters. Rather than minimizing the loss directly, here we use an alternate training way. Firstly, we fix and just optimize . Removing the ”constant” term, the loss of is
Secondly, we fix and optimize . We define by
now we have
Because we don’t make any assumptions about the form of , we can let make loss equal zero, which is an optimal solution of . In other words, the optimal is
EM algorithm is just to perform alternately。
GAN within Variational Inference
In this section, we describe a general approach to incorporate GAN into the variational inference, which leads a new insight to GAN and results a effective regularization for GAN.
As same as VAE, GAN also want to achive a generative model , which can transform to the evidence distribution . Different from Gaussian assumption in VAE, GAN let be a Dirac delta function
whose is a neutral network of generative model, called generator.
Generally, we considered is a random latent variable in generative model. However, it is well-known that Dirac delta function is non-zero at only one point, so the mapping from to in GAN is almost one to one. The variable is not ”random” enough, so we do not treat it as a latent variable (that means we need not to consider posterior distribution
). In fact, we just consider the binary random variableas a random latent variable in GAN:
discribing a Bernoulli distribution. For simpler we set.
On the other hand, we let , while is a conditional Bernoulli distribution. Distinct from VAE, GAN choose another direction of KL-divergence as optimal objective:
Once succeed, we have , means
Now we have to solve and . For simpler we set , called discriminator. Like EM algorithm, we use a alternately training strategy. Firstly, we fix , so does. Ignoring constants for , we get:
Then we fix for optimizing . Ignoring constants for , we get the pure loss:
For minimizing this loss, we need the formula of , which is always impossible. For the same reason as , if has enough fitting ability, the optimal is
is at previous stage. We can solve from it and replace in :
It is obviously that the fisrt term is one of the standard losses of GAN:
The second extra item describes the distance between the new distribution and the old distribution. Two terms are adversarial. try to make the two distributions more similar, while will be very large because will be very small for if discriminator is trained fully (all of them will be considered as negative samples), and vice versa. Thus, minimizing entire loss requires model to inherit the old distribution and explore the new world .
As we know, the generator’s loss in current standard GAN has no the second term, which is autually an incomplete loss. Suppose there is a omnipotent optimizer which can identify the global optimum in very short time and has enough fitting ability, then can only generate just one sample which make largest. In other words, the global optimal solution of is , while . That is called Model Collapse, which will occur certainly in theory.
So, what enlightenment can give for us? We let
that means the updates of parameters of in this iteration is . Using Taylor series to expand to second order, we get
We have already indicated that a complete loss should contain . If not, we should keep it small during training by other way. The above approximation shows the extra loss is about , which can not be too large, meaning can not be too large because can be regard as a constant during one iteration.
Now we can explain why we need to adjust hyperparameters carefully to make GAN converge[Salimans2016Improved]. The most common optimizers we use are all based on gradient descent, so is proportional to gradients. We need to keep
small, which is equivalent to keep gradients small. Consequently, we apply gradients clipping, Batch Normalization in GAN because they can make gradients steady. For the same reason, we always use Adam rather than SGD+Momentum. Meanwhile, the iterations ofcan not be too many, while will be large if updates a lot.
Here we focus on getting something really useful and practical. We try to calculate
directly, for obtain an regularization term to add in generator’s loss. Because of difficulty of directly calculating, we estimate it via calculating(maybe inspired by variational inference):
we have a limitation
can be replaced with a Gaussian distribution of small variance. So we have
In other words, we can use the distance between samples from old and new generator as a regularization term, to guarantee the new generator has little deviation from old generator.
Experiment 1 and 2 on CelebA datasets111the code is modified from https://github.com/LynnHo/DCGAN-LSGAN-WGAN-WGAN-GP-Tensorflow, now available at https://github.com/bojone/gan/tree/master/vgan. shows this regularization works well.
Models Related to GAN
Adversarial Autoencoders (AAE)[Makhzani2015Adversarial] and Adversarially Learned Inference (ALI)[Dumoulin2016Adversarially] are two variants of GAN, they also can be incorporated into variational inference. Of course, with the preparation above, it is just like two homework questions.
AAE under GAN framework
Autually, for obtaining AAE, the only thing we need to do is exchanging in standard GAN. In detail, AAE wants to train an encoder to map the distribution of real data to the standard Gaussian distribution , while
whose is a neutral network of encoder.
Like GAN, AAE needs a binary random latent variable , and
we also let . On the other hand, we set , whose posterior distribution is conditional Bernoulli distribution taking as input. Now we minimize :
Now we have to solve and . we set and still train it alternately. Firstly, we fix , so does. Ignoring constants for , we get:
Then we fix for optimizing . Ignoring constants for , we get the pure loss:
Use the theoretical solution and replace :
On the one hand, like standard GAN, if we train carefully, we may remove the second term and have
on the other hand, we can train a decoder after finishing the adversarial trainning. However, if our has strong enough modeling ability, then we can add a reconstruction error into encoder’s loss, which will not interfere the original adversarial optimition of encoder. Therefore, we get a joint loss:
Our Version of ALI
ALI is like a fusion of GAN and AAE. And there is an almost identical version called Bidirectional GAN (BiGAN)[Donahue2017Adversarial]. Compared with GAN, they treats as a latent variable, so it needs a posterior distribution . Concretely, in ALI we have
and , then we minimize :
which is equivalent to minimize
Now we have to solve . we set , while is a Gaussian distribution including an encoder and is an another Gaussian distribution including an generator . Still alternately train it. Firstly we fix , the loss related to is
As same as VAE, the expectation of and can be done using the the reparametrization trick. Now fix for optimizing , and because of cooccurrence of , loss can not be simplified. But using the theoretical solution of
can transform it to
Due to Gaussianity of and , we can calculate last two term analytically, or ignore them while optimizing it carefully, leading
That is our version of ALI, which has little different from the standard ALI. The current popular view is to treat ALI (includes GAN) as a min-max problem. From that, the loss of encoder and generator is
both of which are not like . Our experiment shows has the same performance as and . That means treating adversarial networks as a a min-max problem is not the only one approach. Variational inference may give us some new insight sometimes.
Our results prove that variational inference is a general framwork to derivate and explain many generative models, including VAE and GAN. We also discribe how variational inference do that by introducing a new interpretation of variational inference. This interpretation is powerful, which can lead to VAE and EM algorithm in serveral lines and deduce GAN in clearly.
An related work is [Hu2018On], which also attemps to link VAE and GAN with variational inference. However, their processing is not clear enough. we made up for this deficiency, trying to give an simpler view on GAN under variational inference.
It seems the potential of variational inference is waitting to be mined.