f-VAEs: Improve VAEs with Conditional Flows

09/16/2018 ∙ by Jianlin Su, et al. ∙ 0

In this paper, we integrate VAEs and flow-based generative models successfully and get f-VAEs. Compared with VAEs, f-VAEs generate more vivid images, solved the blurred-image problem of VAEs. Compared with flow-based models such as Glow, f-VAE is more lightweight and converges faster, achieving the same performance under smaller-size architecture.



There are no comments yet.


page 4

page 6

page 7

Code Repositories


Keras implement of flow-based models

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Let be the evidence distribution of the dataset. The basic idea of generative models is to fit the dataset by the following formulation distribution


whose prior distribution

is standard Gaussian distribution usually.

discribes a generative procedure, which is conditional Gaussian distribution (in VAEs) or Delta distribution (in GANs and flow-based models). Ideally, we maximize log likelihood function , or equally, minimize KL divergence , to find out the best model.

However, the integral usually has no explicit result, so we need some trick solutions. One of them is VAEs, who introduces a posterior distribution , and change to minimize joint KL divergence , which is an upper bound of and always easy-to-compute [Su2018Variational].

Another solution is flow-based models, who let and calculate the integral out by well designed (stacks of flow). The major component is Coupling layer: split into two parts and then


whose inverse is


and Jacobi determinant is . This transformation we call it Affine Coupling (if , it’s also called Additive Coupling), denoted as . Many Coupling layers can be combined to generate a complex invertible function, that is , called (unconditional) flow.

Therefore flow-based models can maximize log likelihood function directly. Recently, Glow model [Kingma2018Glow] demonstrates the very realistic generative effect of flow-based models, arousing many people’s interest in flow-based models again. But flow-based models is usually very heavyweight, for example, the Glow model for 256x256 images needs to train for a week on 40 GPUs 111the details can be found at https://github.com/openai/glow/issues/14#issuecomment-406650950 and https://github.com/openai/glow/issues/37#issuecomment-410019221.. So flow-based model is not friendly enough.


As we know, the images generated by VAEs always will blur. Some thinks it is the consequence of MSE loss, some thinks it is the inherent problem of KL divergence. However, one we need to know is that the reconstructed images by ordinary AutoEncoders are always blurring too. So the blurring problem may be an inherent problem of AE, which need to reconstruct high-dimensional data by low-dimensional hidden variable.

How about if the size of hidden variable equals the input image size? Not enough. Because of the Gaussian assumption of , the fitting ability of VAEs is limited, while the Gaussian distributions is just a tiny subset of all possible posterior distributions.

So what is the problem of flow-based models? flow-based models design a invertible (and strongly nonlinear) transformation to encode input distribution to Gaussian. For this transformation, we have to guarantee not only the invertibility but also the computability of Jacobi determinant, which leads to . However, these Coupling layers can only generate very weak nonlinearity, so we need to stack a lot of Coupling layers to achieve strong nonlinearity, which leads flow-based models to be very heavy.


Our new model is to introduce flow into VAEs, using flow-based model to construct more powerful posterior distribution in stead of Gaussian ditribution. We called it Flow-based Variational Autoencoders (f-VAEs). It can generate clear images (compared with VAE) and has a lighter weight (compared with Glow).


We start from the original loss of VAE:


different with standard VAE, we no longer assume that is Gaussian, instead of a flow-based formulation


here is standard Gaussian distribution. is a bivariate function of and invertible for variable . We can regard it as a flow-based model of , but its parameters are the function of variable . Here we call it conditional flow. Replace in with it, we get


that is the general loss of f-VAE. The details can be found at Appenix A.

Two Cases

Now we calculate two special simple cases to see how works. Firstly, we let


so we have




The sum of them equals . Plug it into , we get the loss of standard VAE, consisting reparameterization trick automatically.

Secondly, we choose


while is a small positive constant and is a flow-based encoder whose parameters are independent of (unconditional flow). Then we can see


It means there is no training parameters in . Therefore, the total loss equals


That is the loss of original flow-based models, whose inputs are added Gaussian noise with variance

. Interestingly, the standard training strategy of flow-based models will actually add some noise into inputs and our result includes it naturally.

Our Model

From above we can see the original VAEs and flow-based models are included in natrually. discribes how the information of and mix up, and in theory we can use any complicated to improve fitting ability of , for example,


whose are unconditional flow.

Up to now we do not have any constraint for the size of (or

). Actually it is a hyperparameter in the model. Thus

allow us to train a better dimension-reduced VAE or dimension-reduced flow-based models. Howerver, for image generating, we have seen that dimension-reduced AutoEncoders may lead to blurring problem. So in this paper, we let the size of equals the size of input image .

Out of practicability and simplicity, we combine and to a general formulation


whose are training parameters (both are scaler), are training encoder and decoder, and is unconditional flow. Plug into , we get the final loss


The sampling procedure is


Related Work

In fact, flow-based models are the general terms of a series of models. Except the above Coupling layer, we also have autoregressive flows, whose representative works are PixelRNNs, PixelCNNs, and so on[Oord2016Pixel, Salimans2017PixelCNN]. Autoregressive flows also work well, but they generate images pixel by pixel. Therefore autoregressive flows based models usually have a very slow speed.

Normalizing flows like RealNVP and Glow, are an another imporving choice of flows, especially Glow has shown amazing results in generation effect of flow-based models. Glow has a lower inferring time cost but training cost is still very heavy.

The first try to combine VAEs and flow-based models is [Rezende2015Variational]. And a developed job is [Chen2016Variational] and [Kingma2017Improving]. They introduces (Inverse) Autoregressive Flows into VAEs. Both of these results (included ours) are similar. However, all the former jobs do not give a general framework like . And none of them achieve a breakthrough improvement on image generation.

Our work seems to be the first one to introduce (normalizing) flows like RealNVP and Glow into VAE. These flows are based on Coupling layer and can be computed parallelly. So they are more efficient than autoregressive flows and can be stacked deeply. We also ensure that the latent variables dimensional are not compressed, thus alleviating the problem of generated blurry.



Limitted by GPUs, we can only evalute our model on 64x64 and 128x128 resolution of CelebA HQ. We firstly make a fast comparision with ordinary VAEs and Glow on 64x64 and then demonstrates high-quality generation on 128x128.

The encoder is stacks of Convolution and Squeeze operations. In details, our contains serveral blocks and apply Squeeze operation before echo blocks. And echo blocks contains serveral steps, denoted by , while

is stack of a 3x3 Relu Convolution and 1x1 linear Convolution.

The decoder is stacks of Convolution and UnSqueeze, whose structure is the inverse of . Add activation at the final of is usually a good choice, but not necessary all the time. The structure of flow we use is also a multi-scale design like Glow, but with smaller depth and smaller kernel size of Convolution.

(a) Ordinary VAEs
(b) flow-based models
(c) f-VAEs (ours)
Figure 1: Comparasion of Procedure of VAEs, flow and f-VAEs.


Compare figure 2(a) and 2(c)

, we can see f-VAEs has basically solved the blurring problem. We also compare the result of similar Glow on the same epoch (figure

2(b)). We do not doubt Glow will also perform well under more layers and more depths. But, obviously, f-VAEs perform better a lot than Glow under the same complexity and same training time. For achieving the above results, we only use one GTX 1060 to train about 7 hours (120 epochs).

Actually, we can see the total encoder of f-VAEs is , which is composite of and . In ordinary flow-based models, we need to calculate the Jacobi determinant of , as it doest not need in f-VAEs. Therefore can be a general Convolutional Network, which can realize strong non-linearity.

Random interpolation in figure

3 also show that encoder of f-VAEs transforms input images into a good embedding space.

Our results on 128x128 show in Appendix B.

(a) Samples from VAEs
(b) Samples from Glow
(c) Samples from f-VAEs
Figure 2: Samples from VAEs, Glow and f-VAEs.
Figure 3: Linear interpolation in latent space of f-VAEs between real images



Actually, the orginal goal of this paper is to ask the following questions of Glow:

  1. How to reduce computation of Glow ?

  2. How to implement a dimension-reduced version of Glow ?

Our results show that a dimension-kept f-VAEs also equals a tiny-computation version of Glow but have better result. And actually allow us to train a dimension-reduced flow-based model. We also reveal that ordinary VAEs and flow-based models are included in our framework theoretically. So we can say we get a more general generative and inference framework.

Future Work

we can see that random samples from f-VAEs still have a style like oil painting. One reason we think is our experimential model is not complex enough to fit the detail. But the most important reason we think is the abuse of 3x3 Convolution. It will make the perceptual field of Convolution infinitely expanded, leading the Convolution can not focus on the major details.

Therefore, a challenging task is to find out the magics of how to design workable and reasonable encoder and decoder. It seems that structures like Network in Network [Lin2013Network] may be suitable to do that. But it is still waiting to be validated. And progressive growing structure like [Karras2017Progressive] is also worth to have a try.


Appendix A Appendix

A. Detailed Derivation of Equation

Combine and , we have


Let , we have relation of Jacobi determinant


And becomes


B. Results on 128x128

We also validate our model on 128x128 CelebA HQ, whose results show in figure . We just trian it on one GTX1060, costing only about 1.5 day (150 epoch).

Figure 4:

Samples from 128x128 model at temperature (standard deviation of prior distribution) 0.8

Figure 5: Linear interpolation in latent space between 2 real images
Figure 6: Linear interpolation in latent space between 4 real images
(a) =0
(b) =0.5
(c) =0.6
(d) =0.7
(e) =0.8
(f) =0.9
(g) =1.0
Figure 7: Effect of change of temperature. From left to right, samples obtained at temperatures 0, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0