flow
Keras implement of flow-based models
view repo
In this paper, we integrate VAEs and flow-based generative models successfully and get f-VAEs. Compared with VAEs, f-VAEs generate more vivid images, solved the blurred-image problem of VAEs. Compared with flow-based models such as Glow, f-VAE is more lightweight and converges faster, achieving the same performance under smaller-size architecture.
READ FULL TEXT VIEW PDF
Flow-based generative models are powerful exact likelihood models with
e...
read it
Statistical generative models for molecular graphs attract attention fro...
read it
Flow-based generative models are an important class of exact inference m...
read it
Recent research has shown that it is challenging to detect
out-of-distri...
read it
Flow-based generative models have recently become one of the most effici...
read it
Lossless compression methods shorten the expected representation size of...
read it
We propose a general framework to learn deep generative models via
Varia...
read it
Keras implement of flow-based models
Let be the evidence distribution of the dataset. The basic idea of generative models is to fit the dataset by the following formulation distribution
(1) |
whose prior distribution
is standard Gaussian distribution usually.
discribes a generative procedure, which is conditional Gaussian distribution (in VAEs) or Delta distribution (in GANs and flow-based models). Ideally, we maximize log likelihood function , or equally, minimize KL divergence , to find out the best model.However, the integral usually has no explicit result, so we need some trick solutions. One of them is VAEs, who introduces a posterior distribution , and change to minimize joint KL divergence , which is an upper bound of and always easy-to-compute [Su2018Variational].
Another solution is flow-based models, who let and calculate the integral out by well designed (stacks of flow). The major component is Coupling layer: split into two parts and then
(2) | ||||
whose inverse is
(3) | ||||
and Jacobi determinant is . This transformation we call it Affine Coupling (if , it’s also called Additive Coupling), denoted as . Many Coupling layers can be combined to generate a complex invertible function, that is , called (unconditional) flow.
Therefore flow-based models can maximize log likelihood function directly. Recently, Glow model [Kingma2018Glow] demonstrates the very realistic generative effect of flow-based models, arousing many people’s interest in flow-based models again. But flow-based models is usually very heavyweight, for example, the Glow model for 256x256 images needs to train for a week on 40 GPUs ^{1}^{1}1the details can be found at https://github.com/openai/glow/issues/14#issuecomment-406650950 and https://github.com/openai/glow/issues/37#issuecomment-410019221.. So flow-based model is not friendly enough.
As we know, the images generated by VAEs always will blur. Some thinks it is the consequence of MSE loss, some thinks it is the inherent problem of KL divergence. However, one we need to know is that the reconstructed images by ordinary AutoEncoders are always blurring too. So the blurring problem may be an inherent problem of AE, which need to reconstruct high-dimensional data by low-dimensional hidden variable.
How about if the size of hidden variable equals the input image size? Not enough. Because of the Gaussian assumption of , the fitting ability of VAEs is limited, while the Gaussian distributions is just a tiny subset of all possible posterior distributions.
So what is the problem of flow-based models? flow-based models design a invertible (and strongly nonlinear) transformation to encode input distribution to Gaussian. For this transformation, we have to guarantee not only the invertibility but also the computability of Jacobi determinant, which leads to . However, these Coupling layers can only generate very weak nonlinearity, so we need to stack a lot of Coupling layers to achieve strong nonlinearity, which leads flow-based models to be very heavy.
Our new model is to introduce flow into VAEs, using flow-based model to construct more powerful posterior distribution in stead of Gaussian ditribution. We called it Flow-based Variational Autoencoders (f-VAEs). It can generate clear images (compared with VAE) and has a lighter weight (compared with Glow).
We start from the original loss of VAE:
(4) | ||||
different with standard VAE, we no longer assume that is Gaussian, instead of a flow-based formulation
(5) |
here is standard Gaussian distribution. is a bivariate function of and invertible for variable . We can regard it as a flow-based model of , but its parameters are the function of variable . Here we call it conditional flow. Replace in with it, we get
(6) |
that is the general loss of f-VAE. The details can be found at Appenix A.
Now we calculate two special simple cases to see how works. Firstly, we let
(7) |
so we have
(8) |
and
(9) |
The sum of them equals . Plug it into , we get the loss of standard VAE, consisting reparameterization trick automatically.
Secondly, we choose
(10) |
while is a small positive constant and is a flow-based encoder whose parameters are independent of (unconditional flow). Then we can see
(11) | ||||
It means there is no training parameters in . Therefore, the total loss equals
(12) |
That is the loss of original flow-based models, whose inputs are added Gaussian noise with variance
. Interestingly, the standard training strategy of flow-based models will actually add some noise into inputs and our result includes it naturally.From above we can see the original VAEs and flow-based models are included in natrually. discribes how the information of and mix up, and in theory we can use any complicated to improve fitting ability of , for example,
(13) | ||||
whose are unconditional flow.
Up to now we do not have any constraint for the size of (or
). Actually it is a hyperparameter in the model. Thus
allow us to train a better dimension-reduced VAE or dimension-reduced flow-based models. Howerver, for image generating, we have seen that dimension-reduced AutoEncoders may lead to blurring problem. So in this paper, we let the size of equals the size of input image .Out of practicability and simplicity, we combine and to a general formulation
(14) | ||||
whose are training parameters (both are scaler), are training encoder and decoder, and is unconditional flow. Plug into , we get the final loss
(15) | ||||
The sampling procedure is
(16) |
In fact, flow-based models are the general terms of a series of models. Except the above Coupling layer, we also have autoregressive flows, whose representative works are PixelRNNs, PixelCNNs, and so on[Oord2016Pixel, Salimans2017PixelCNN]. Autoregressive flows also work well, but they generate images pixel by pixel. Therefore autoregressive flows based models usually have a very slow speed.
Normalizing flows like RealNVP and Glow, are an another imporving choice of flows, especially Glow has shown amazing results in generation effect of flow-based models. Glow has a lower inferring time cost but training cost is still very heavy.
The first try to combine VAEs and flow-based models is [Rezende2015Variational]. And a developed job is [Chen2016Variational] and [Kingma2017Improving]. They introduces (Inverse) Autoregressive Flows into VAEs. Both of these results (included ours) are similar. However, all the former jobs do not give a general framework like . And none of them achieve a breakthrough improvement on image generation.
Our work seems to be the first one to introduce (normalizing) flows like RealNVP and Glow into VAE. These flows are based on Coupling layer and can be computed parallelly. So they are more efficient than autoregressive flows and can be stacked deeply. We also ensure that the latent variables dimensional are not compressed, thus alleviating the problem of generated blurry.
Limitted by GPUs, we can only evalute our model on 64x64 and 128x128 resolution of CelebA HQ. We firstly make a fast comparision with ordinary VAEs and Glow on 64x64 and then demonstrates high-quality generation on 128x128.
The encoder is stacks of Convolution and Squeeze operations. In details, our contains serveral blocks and apply Squeeze operation before echo blocks. And echo blocks contains serveral steps, denoted by , while
is stack of a 3x3 Relu Convolution and 1x1 linear Convolution.
The decoder is stacks of Convolution and UnSqueeze, whose structure is the inverse of . Add activation at the final of is usually a good choice, but not necessary all the time. The structure of flow we use is also a multi-scale design like Glow, but with smaller depth and smaller kernel size of Convolution.
, we can see f-VAEs has basically solved the blurring problem. We also compare the result of similar Glow on the same epoch (figure
2(b)). We do not doubt Glow will also perform well under more layers and more depths. But, obviously, f-VAEs perform better a lot than Glow under the same complexity and same training time. For achieving the above results, we only use one GTX 1060 to train about 7 hours (120 epochs).Actually, we can see the total encoder of f-VAEs is , which is composite of and . In ordinary flow-based models, we need to calculate the Jacobi determinant of , as it doest not need in f-VAEs. Therefore can be a general Convolutional Network, which can realize strong non-linearity.
Random interpolation in figure
3 also show that encoder of f-VAEs transforms input images into a good embedding space.Our results on 128x128 show in Appendix B.
Actually, the orginal goal of this paper is to ask the following questions of Glow:
How to reduce computation of Glow ?
How to implement a dimension-reduced version of Glow ?
Our results show that a dimension-kept f-VAEs also equals a tiny-computation version of Glow but have better result. And actually allow us to train a dimension-reduced flow-based model. We also reveal that ordinary VAEs and flow-based models are included in our framework theoretically. So we can say we get a more general generative and inference framework.
we can see that random samples from f-VAEs still have a style like oil painting. One reason we think is our experimential model is not complex enough to fit the detail. But the most important reason we think is the abuse of 3x3 Convolution. It will make the perceptual field of Convolution infinitely expanded, leading the Convolution can not focus on the major details.
Therefore, a challenging task is to find out the magics of how to design workable and reasonable encoder and decoder. It seems that structures like Network in Network [Lin2013Network] may be suitable to do that. But it is still waiting to be validated. And progressive growing structure like [Karras2017Progressive] is also worth to have a try.
Combine and , we have
(17) | ||||
Let , we have relation of Jacobi determinant
(18) |
And becomes
(19) | ||||
We also validate our model on 128x128 CelebA HQ, whose results show in figure . We just trian it on one GTX1060, costing only about 1.5 day (150 epoch).
Samples from 128x128 model at temperature (standard deviation of prior distribution) 0.8
Comments
There are no comments yet.