## Variational Inference

Suppose is an explicit variable, is a latent variable，and is evidence distribution of . We let

(1) |

and we hope will be a good approximation of . In general cases, we want to maximize log likelihood function

(2) |

which is equivalent to minimizing :

(3) |

But if we can not calculate the integral analytically, we can not maximize log likelihood or minimize KL-divergence directly.

The variational inference changes objective function: rather minimizing KL-divergence of marginal distributions

, we can minimize the KL-divergence of joint distribution

or . We have(4) | ||||

which suggests is an upper bound of . In many cases, joint KL-divergence easier to calculate than marginal KL-divergence. Therefore, variational inference provides a computable solution. If it works, we have ，which means . Namely, becomes an approximation of the real distribution .

## VAE and EM algorithm

Due to our new insight of variational inference, VAE and EM algorithm can be derived in a very simple way.

In VAE, we let , while

are Gaussian distributions with unknown parameters and

is standard Gaussian distribution. The loss we need to minimize is(5) | ||||

while does not contain any parameters, it does not change final result. So loss can be transed into

(6) |

Because are both Gaussian, we can get the analytic expression of . And with the reparametrization trick, the first term can be approximate as . Consequently, the final loss for VAE is

(7) |

The assumption of EM algorithm is like VAE, excluding supposing is Gaussian. In EM algorithm, the loss is still , but we treat entire as training parameters. Rather than minimizing the loss directly, here we use an alternate training way. Firstly, we fix and just optimize . Removing the ”constant” term, the loss of is

(8) |

Secondly, we fix and optimize . We define by

(9) |

now we have

(10) | ||||

Because we don’t make any assumptions about the form of , we can let make loss equal zero, which is an optimal solution of . In other words, the optimal is

(11) |

EM algorithm is just to perform alternately。

## GAN within Variational Inference

In this section, we describe a general approach to incorporate GAN into the variational inference, which leads a new insight to GAN and results a effective regularization for GAN.

### General Framework

As same as VAE, GAN also want to achive a generative model , which can transform to the evidence distribution . Different from Gaussian assumption in VAE, GAN let be a Dirac delta function

(12) |

whose is a neutral network of generative model, called generator.

Generally, we considered is a random latent variable in generative model. However, it is well-known that Dirac delta function is non-zero at only one point, so the mapping from to in GAN is almost one to one. The variable is not ”random” enough, so we do not treat it as a latent variable (that means we need not to consider posterior distribution

). In fact, we just consider the binary random variable

as a random latent variable in GAN:(13) |

here

discribing a Bernoulli distribution. For simpler we set

.On the other hand, we let , while is a conditional Bernoulli distribution. Distinct from VAE, GAN choose another direction of KL-divergence as optimal objective:

(14) | ||||

Once succeed, we have , means

(15) | ||||

consequently .

Now we have to solve and . For simpler we set , called discriminator. Like EM algorithm, we use a alternately training strategy. Firstly, we fix , so does. Ignoring constants for , we get:

(16) | ||||

Then we fix for optimizing . Ignoring constants for , we get the pure loss:

(17) |

For minimizing this loss, we need the formula of , which is always impossible. For the same reason as , if has enough fitting ability, the optimal is

(18) |

is at previous stage. We can solve from it and replace in :

(19) | ||||

### Basic Analysis

It is obviously that the fisrt term is one of the standard losses of GAN:

(20) |

The second extra item describes the distance between the new distribution and the old distribution. Two terms are adversarial. try to make the two distributions more similar, while will be very large because will be very small for if discriminator is trained fully (all of them will be considered as negative samples), and vice versa. Thus, minimizing entire loss requires model to inherit the old distribution and explore the new world .

As we know, the generator’s loss in current standard GAN has no the second term, which is autually an incomplete loss. Suppose there is a omnipotent optimizer which can identify the global optimum in very short time and has enough fitting ability, then can only generate just one sample which make largest. In other words, the global optimal solution of is , while . That is called Model Collapse, which will occur certainly in theory.

So, what enlightenment can give for us? We let

(21) |

that means the updates of parameters of in this iteration is . Using Taylor series to expand to second order, we get

(22) | ||||

We have already indicated that a complete loss should contain . If not, we should keep it small during training by other way. The above approximation shows the extra loss is about , which can not be too large, meaning can not be too large because can be regard as a constant during one iteration.

Now we can explain why we need to adjust hyperparameters carefully to make GAN converge[Salimans2016Improved]. The most common optimizers we use are all based on gradient descent, so is proportional to gradients. We need to keep

small, which is equivalent to keep gradients small. Consequently, we apply gradients clipping, Batch Normalization in GAN because they can make gradients steady. For the same reason, we always use Adam rather than SGD+Momentum. Meanwhile, the iterations of

can not be too many, while will be large if updates a lot.### Regularization Term

Here we focus on getting something really useful and practical. We try to calculate

directly, for obtain an regularization term to add in generator’s loss. Because of difficulty of directly calculating, we estimate it via calculating

(maybe inspired by variational inference):(23) | ||||

we have a limitation

(24) |

which means

can be replaced with a Gaussian distribution of small variance. So we have

(25) |

becomes

(26) |

In other words, we can use the distance between samples from old and new generator as a regularization term, to guarantee the new generator has little deviation from old generator.

Experiment 1 and 2 on CelebA datasets^{1}^{1}1the code is modified from https://github.com/LynnHo/DCGAN-LSGAN-WGAN-WGAN-GP-Tensorflow, now available at https://github.com/bojone/gan/tree/master/vgan. shows this regularization works well.

## Models Related to GAN

Adversarial Autoencoders (AAE)[Makhzani2015Adversarial] and Adversarially Learned Inference (ALI)[Dumoulin2016Adversarially] are two variants of GAN, they also can be incorporated into variational inference. Of course, with the preparation above, it is just like two homework questions.

### AAE under GAN framework

Autually, for obtaining AAE, the only thing we need to do is exchanging in standard GAN. In detail, AAE wants to train an encoder to map the distribution of real data to the standard Gaussian distribution , while

(27) |

whose is a neutral network of encoder.

Like GAN, AAE needs a binary random latent variable , and

(28) |

we also let . On the other hand, we set , whose posterior distribution is conditional Bernoulli distribution taking as input. Now we minimize :

(29) | ||||

Now we have to solve and . we set and still train it alternately. Firstly, we fix , so does. Ignoring constants for , we get:

(30) | ||||

Then we fix for optimizing . Ignoring constants for , we get the pure loss:

(31) |

Use the theoretical solution and replace :

(32) |

On the one hand, like standard GAN, if we train carefully, we may remove the second term and have

(33) |

on the other hand, we can train a decoder after finishing the adversarial trainning. However, if our has strong enough modeling ability, then we can add a reconstruction error into encoder’s loss, which will not interfere the original adversarial optimition of encoder. Therefore, we get a joint loss:

(34) | ||||

### Our Version of ALI

ALI is like a fusion of GAN and AAE. And there is an almost identical version called Bidirectional GAN (BiGAN)[Donahue2017Adversarial]. Compared with GAN, they treats as a latent variable, so it needs a posterior distribution . Concretely, in ALI we have

(35) |

and , then we minimize :

(36) | ||||

which is equivalent to minimize

(37) | ||||

Now we have to solve . we set , while is a Gaussian distribution including an encoder and is an another Gaussian distribution including an generator . Still alternately train it. Firstly we fix , the loss related to is

(38) | ||||

As same as VAE, the expectation of and can be done using the the reparametrization trick. Now fix for optimizing , and because of cooccurrence of , loss can not be simplified. But using the theoretical solution of

(39) |

can transform it to

(40) | ||||

Due to Gaussianity of and , we can calculate last two term analytically, or ignore them while optimizing it carefully, leading

(41) | ||||

That is our version of ALI, which has little different from the standard ALI. The current popular view is to treat ALI (includes GAN) as a min-max problem. From that, the loss of encoder and generator is

(42) | ||||

or

(43) | ||||

both of which are not like . Our experiment shows has the same performance as and . That means treating adversarial networks as a a min-max problem is not the only one approach. Variational inference may give us some new insight sometimes.

## Conclusion

Our results prove that variational inference is a general framwork to derivate and explain many generative models, including VAE and GAN. We also discribe how variational inference do that by introducing a new interpretation of variational inference. This interpretation is powerful, which can lead to VAE and EM algorithm in serveral lines and deduce GAN in clearly.

An related work is [Hu2018On], which also attemps to link VAE and GAN with variational inference. However, their processing is not clear enough. we made up for this deficiency, trying to give an simpler view on GAN under variational inference.

It seems the potential of variational inference is waitting to be mined.