With the development of deep learning , the capacity of neural networks is sufficiently powerful to approximate highly nonlinear functions or model complex distributions for real-world problems. Among emerging algorithms of novel forms due to deep learning, deep generative models play a more and more important role in cracking challenges in various scientific disciplines: high-quality image generation [15, 47, 17, 18, 4], text-to-speech transformation [41, 43], information retrieval , 3D rendering [45, 9], signal-to-image acquisition , molecular and material science [5, 34], particle physics , etc.
Overall, generative models fall into four categories: autoencoder and its most important variant of Variational Auto-Encoder (VAE) , auto-regressive models [42, 41], Generative Adversarial Network (GAN) , and normalizing flows (NF) [38, 37, 32]. In order to make our algorithm LIA easily understood, we begin with an introduction of generative models from a dimensional view. To be formal, let denote a data point in the -dimensional observable space and be its corresponding low-dimensional representation in the feature space . The general formulation of dimensionality reduction is
where is the mapping function and . The manifold learning is devoted to acquiring under various constraints on [39, 33]. However, the sparsity of data points in high-dimensional spaces sometimes leads to the data dearth for training algorithms or characterizing the diversity of data patterns, thus necessitating research on opposite mapping from to , i.e.
where is the opposite mapping function with respect to . In this scenario, a convention is that the variables in low-dimensional spaces are usually endowed with some simple prior distributions such as uniform, Gaussian, or logistic distributions. To differentiate the characteristic, we let represent the low-dimensional vector obeying such probabilistic distributions. Thus we can write
The task of establishing the dual maps and is the central mission of generative models in machine learning — the topic that we engage with in this paper. To help understand these concepts, we illustrate the mapping processes of the Swiss roll manifold including its embedding and generation in Figure 1.
The comparison of mappings for AE, VAE, NF, and GAN is listed in Table 1
. In the parlance of probability, the process ofis called inference, and the other procedure of is called sampling or generation. VAE is capable of carrying out inference and generation in one framework by two collaborative functional modules. But it suffers blurry generative results. Besides, the posterior collapse frequently occurs for complex decoders [3, 20]. GAN is able to yield photo-realistic results [17, 18]. However, a critical limitation is that performing inference is challenging for GAN due to the absence of the encoder . Normalizing flows can perform the exact inference and generation by one architecture in one pass by virtue of invertible networks and the generated results are nearly photo-realistic as well . But the principle of reversible normalizing flows requires that the dimension of the data space must be identical to that of the latent space, thus posing computational issues due to high complexity of learning deep flows and computing Jacobians.
Inspired from GANs [10, 47, 18] and success of normalizing flows [20, 19], we develop a new algorithm called Latently Invertible Autoencoder (LIA). Overall, an invertible network is symmetrically harnessed for laterally bridging the encoder and the decoder in VAE (Figure 1f). To promote photo-realistic generation, adversarial learning is performed via a discriminator, and a feature extractor is exploited for reconstruction measure. We summarize the key contributions as follows.
The symmetric application of the invertible network brings two benefits for LIA. The prior distribution can be exactly fitted from an unfolded feature space, thus significantly simplifying the inference problem. Besides, the latent space can be detached from the LIA framework, resulting in a simple autoencoder that can be easily trained without likelihood optimization.
We decompose the LIA framework into a Wasserstein GAN (only a prior needed) and a standard autoencoder without stochastic variables and then perform two-stage adversarial learning. Therefore the whole process of LIA training is deterministic, implying that the model is adversarially learned without posterior probability involved.
Our algorithm is free from posterior collapse which breaks down the learning process of VAE when the decoder is more complex or followed by additional architectures such as discriminators and classifiers.
We compare LIA with state-of-the-art algorithms on inference (reconstruction) and interpolation tasks. The experiment of style mixing is also conducted. These experimental results verify the superiority of LIA and support the claims made in this paper.
|(a) data||(b) feature||(c) prior||(d) feature||(e) generation|
|(f) neural architecture of LIA|
2 Latently invertible autoencoder
The neural architecture of LIA is designed such that the data distribution can be progressively re-shaped from a complex or manifold-valued one to a given simple prior. We now describe the structural details in the following sub-sections.
2.1 Neural architecture of LIA
The main part of LIA is based on the classic VAE and the realization of the normalizing flow. As shown in Figure 1f, we symmetrically embed an invertible neural network in the latent space of VAE, showing the diagram of mapping process by
where denotes the deep composite mapping of the normalizing flow and is its depth. LIA first performs nonlinear dimensionality reduction on the input data and transform them into the low-dimensional feature space . The authors of StyleGAN  have been analyzed and demonstrated that the attributes of the inputs in for feature mapping are disentangled, meaning that the role of for LIA can be regarded to unfold the data manifold, as illustrated in Figures 1a and 1b. Therefore, the Euclidean manipulations like linear interpolation and vector arithmetic are reliable in the feature space. Then we establish an invertible mapping from the feature to the latent variable , as opposed to various VAEs that directly map original data to latent variables. The feature can be exactly recovered via the invertibility of from , which is the unique advantage of invertible networks. The recovered feature is then fed into a partial decoder to generate the corresponding data . If the maps and of the invertible network are bypassed, a standard autoencoder is left, i.e. . However, the sequential composite maps of the autoencoder and the invertible network collaborate to accomplish the inference and the sampling .
In general, any arbitrary invertible networks are applicable in the LIA framework. We find in practice that a simple invertible network is sufficiently capable to construct the mapping from the feature space to the latent space . Let and be the forms of the top and bottom fractions of and , respectively. Then the simple invertible networks  can be formed with
where is the transformation that can be an arbitrary differentiable function. Alternatively, one can attempt to exploit the complex invertible network with affine coupling mappings for difficult tasks [7, 19]. As conducted in , we set
2.2 Reconstruction constraint with feature extractor
To guarantee the precision of the reconstruction , the conventional ways by (variational) autoencoders are to use the distance or the cross entropy directly between and . Here, we utilize the feature vectors derived from feature extractors like VGG  and ResNet . Let denote the feature extractor. Then we can write the loss
where is the feature vector and signifies the
-norm. The benefit of using features instead of the original data lies in that feature extractors are able to accommodate more data variations, thus improving the accuracy of reconstruction measurement. The feasibility for this type of applications of feature extractors is actually evident in diverse image-to-image translation tasks[15, 47].
It suffices to emphasize that the functionality of here is in essence to produce the representations of the input and the output . It can be attained by supervised orunsupervised learning, meaning that can be trained with class labels or without class labels. The acquisition of is fairly flexible.
2.3 Adversarial learning with discriminator
The norm-based reconstruction constraints usually incur the bluriness of generated images in autoencoder-like architectures. The real reason about this phenomenon is still subtle. An insightful explanation was presented in . However, this problem can be handled via the advantage of adversarial learning . To do so, a discriminator is employed to balance the loss of the comparison between and . Using the Wasserstein GAN [1, 12], we can write the optimization object
denote the probability distributions of the real data and the generated data, respectively.is the hyper-parameter of the regularization. The regularizer is formulated in , which is proven more stable for algorithmic convergence. In practice, the sliced Wasserstein distance that is approximated by Monte Carlo sampling is preferred to perform comparison between and .
|(a) GAN training.||(b) Encoder training.|
3 Two-stage stochasticity-free training
The conventional fashion of training a deep model prefers end-to-end for the whole architecture. To backpropagate gradients through random variables for VAE, the reparameterization trick is usually harnessed , i.e., where is the mean and
the standard deviation. The regularization of coupling the prior and the posterior is the Kullback-Leibler divergence that is able to optimize the parameters of the encoder by backpropagation. For our framework, however, we find that this learning strategy cannot lead the algorithm to converge at satisfactory optima, even if the optimization of normalizing flows is taken into account. To proceed, we propose a scheme of two-stagestochasticity-free learning. As opposed to the traditional way, we decompose the framework into two sub-frameworks that can be well trained end-to-end, as displayed in Figure 2.
3.1 GAN training
The line of the GAN algorithms has made breakthrough progress in the past year. ProGAN , StyleGAN , and BigGAN  are capable of generating photo-realistic images of high quality for randomly sampled inputs from some priors. Then such GAN models are supposed to recover a precise if we can find the latent variable for the given . Namely, we may train the associated GAN model separately in the LIA framework. To conduct this, we single out a standard GAN model for training of the first stage as displayed in Figure 2a, the diagram of which can be formalized by
where is directly sampled from a pre-defined prior. According to the principle of the vanilla GAN , the optimization objective can be written as
where the superscript denotes that the parameters of corresponding mappings have already been learned. It is worth noting that the role of the invertible network here is just its transformation invertibility. We do not pose any optimization on the probabilities of and in contrast to normalizing flows.
Generally speaking, any GAN architectures can be re-formed to be the GAN utilization in (10) by attaching an invertible network with their individual generators. So, the strategy of composing the composite generator with invertible networks is generic for GANs.
3.2 Encoder training
In the LIA architecture, the invertible network is embedded in the latent space in the symmetric fashion, in the sense that . The unique characteristic of the invertible network allows us to detach the invertible network from the LIA framework. Thus we attain a traditional autoencoder without stochastic variables, as shown in Figure 2b. We can write the diagram
In practice, the feature extractor
is frequently represented by VGG pre-tained on the ImageNet dataset. Therefore, there is only the parameter ofneeded to be learned in (12) after the first-stage GAN training. Writing the optimization gives
where is the hyper-parameter. In fact, the above optimization serving to the architecture in (12) is the well-studied problem in computer vision. It is the backbone framework of various GANs for diverse image processing tasks [15, 47]. For LIA, however, it is much simpler because we only need to learn the partial encoder . This simplicity brought by the two-stage training is able to enforce the encoder to converge with faithful inference.
4 Related work
Our LIA algorithm is relevant to the works that aim to solve the inference problem for VAEs with adversarial learning and to formulate encoders for GANs.
The integration of GAN with VAE can be traced back to the work of VAE/GAN  and the adversarial and implicit autoencoders [28, 27]. These methods encounter the difficulty of end-to-end training, because the gradients are prone to becoming unstable after going through the latent space for deep complex architectures [3, 20]. There is an intriguing attempt of training VAE in the adversarial manner, in the sense that the adversarial learning is established between the encoder and the decoder [40, 14]. These approaches confront the trade-off between the roles of the encoder that performs inference and compares the real/fake distributions. This is difficult to tune. So we prefer the complete GAN with an indispensable discriminator.
The closely related works to LIA are the algorithm of combining VAE and the inverse autoregressive flow  and the f-VAE approach that is a VAE with latent variables conditioned by normalizing flow . These two models both need to optimize the posterior probability of normalizing flows, which is essentially different from our deterministic optimization that is no optimization loss for the invertible network in LIA. The stochasticity-free training is directly derived from the symmetric design of the invertible network in the latent space, which is the architectural difference from and . There is an alternative attempt of specifying the generator of GAN with normalizing flow . This approach suffers from high complexity computation for high dimensions.
There is also the available work pertaining to the two-stage training, which is the combination of GAN-based generation and VAE-based inference such as . There are two key differences between LIA and the inverse generator presented in . The first difference is that the stochasticity of cannot be guaranteed without the constraint; the second one is that there is only the reconstruction loss based on cross entropy in , implying that the blurriness problem still exists. These issues are properly addressed via the symmetry of the invertible network and adversarial learning in LIA.
In this paper, we instantiate the decoder of LIA with the generator of StyleGAN . The difference is that we use the invertible network to replace the mapping network (MLP) in StyleGAN. The layer number of the invertible network is 8. The hyper-parameters for the discriminator are and
. Based on the elegant TensorFlow code of StyleGAN, we write the LIA code and will make it open source for peer evaluation.
The algorithms we compare are Glow that is the state-of-the-art algorithm in normalizing flows , the MSE-based optimization methods [30, 2, 25]111We use the code at https://github.com/Puzer/stylegan-encoder., the adversarially learned inference (ALI) , and the adversarial generator-encoder (AGE) network . To highlight the necessity of the invertible network, we also train an encoded StyleGAN that replaces the invertible network in LIA with multiple layer perceptron. The two-stage training scheme is used as LIA does. The generator and discriminator of the encoded StyleGAN is exactly StyleGAN.
All algorithms are tested on the Flickr-Faces-HQ (FFHQ) database222https://github.com/NVlabs/ffhq-dataset created by the authors of StyleGAN as the benchmark for GANs. FFHQ contains 70,000 high-quality face images. We take the first 65,000 faces as the training set and the remaining 5,000 faces as the reconstruction test according to the exact order of the dataset. We do not split the dataset by random sampling for interested readers can precisely reproduce all the reported results with our open source and experimental protocol.
For quantitative metrics of generation quality, we use Fréchet inception distance (FID), sliced Wasserstein distance (SWD), and mean square error (MSE). These three metrics are most frequently employed to measure the numerical accuracy of generative algorithms [40, 17, 16, 18]. We directly use the code released by the authors of ProGAN . The probability prior for is Gaussian.
|Raw FFHQ faces|
|Metric||LIA (ours)||ALI||AGE||MSE-based optimization||Encoded StyleGAN|
Figure 3 shows the reconstructed faces of all algorithms. It is clear that LIA significantly outperforms the competing algorithms. The reconstructed faces by ALI and AGE seem correct, but the quality is mediocre. The ideas of ALI and AGE are elegant. Their performance may be improved with the new techniques such as progressive growing of neural architecture or style-based one. The method of the MSE-based optimization produces facial parts of comparable quality with LIA when the faces are normal. But this approach fails when the variations of faces become large. For example, the failure comes from the long fair, hats, beard, and large pose. The interesting phenomenon is that the encoded StyleGAN does not succeed in recovering the target faces using the same training strategy as LIA, even though it is capable of generating photo-realistic faces of high quality due to the StyleGAN generator. This is the evidence that the invertible network plays the crucial role to make the algorithm work. The quantitative accuracy in Table 2 indicates the consistent superiority of LIA. More reconstruction results are demonstrated in supplementary material.
5.2 Interpolation and style mixing
Examining the interpolated results using the interpolation in the latent or feature space is an effective way of studying the capability of generative algorithms to generate images or fit underlying distributions. Here we only compare the three algorithms whose results are visually comparable. As shown in Figure 4, LIA yields the smooth interpolation while well preserving the photo-realistic effect. The interpolation quality of the MSE-based optimization is actually based on the reconstruction performance because it has a good generator (StyleGAN). The intermediate interpolation from Glow deviates from real faces. The performance of Glow is data-sensitive. We do not obtain sufficiently good results from Glow on FFHQ. More results are available in supplementary material.
To make the power of our algorithm more impressive, we perform style mixing using a small set of reconstructed faces. Style mixing is conducted using the same approach presented in . The different aspect is that LIA uses the real faces due to the encoding capability. Figure 5
illustrates that our algorithm can infer accurate latent codes and generate high-quality mixing faces for such cases. By manipulating the latent feature for a given sample, we are allowed to enable various tasks such as data augmentation with LIA, which is pretty useful for supervised learning of few samples. More results are provided in supplementary material.
A new generative algorithm, named Latently Invertible Autoencoder (LIA), has been proposed for generating photo-realistic images from a probability prior and simultaneously inferring accurate latent codes and features for given samples. The core idea of LIA is to symmetrically embedding an invertible network in an autoencoder. Then the neural architecture is trained with adversarial learning as two decomposed GAN frameworks. To be simple, LIA can be viewed as a kind of GAN framework equipped with an encoder. So, LIA is applicable to any GAN algorithms. In this paper, we instantiate LIA with the state-of-the-art StyleGAN. The effectiveness of LIA is validated with experiments of reconstruction (inference), interpolation, and style mixing.
With the accurate inference of LIA, the ability of GAN-based models in various scenarios may be promoted, e.g. image editing, data augmentation, few-shot learning, and 3D graphics. The innovative applications of LIA are interesting. We encourage interested readers to explore the possibility.
 Arjovsky, M., Chintala, S. & Bottou, L. Wasserstein GAN. (2017) In arXiv:1701.07875.
 Berthelot, D., Schumm, T. & Metz, L. (2017) BEGAN: Boundary equilibrium generative adversarial networks. In arXiv:1703.10717.
 Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R. & Bengio, S. (2015) Generating sentences from a continuous space. In arXiv:1511.06349.
 Brock, A., Donahue, J. & Simonyan, K. (2018) Large scale GAN training for high fidelity natural image synthesis. In arXiv:1809.11096.
 Butler, K.T., Davies, D.W., Cartwright, H., Isayev, O. & Walsh, A. (2018) Machine learning for molecular and materials science. Nature 559:547-555.
 Dinh, L., Krueger, D. & Bengio, Y. (2015) NICE: Non-linear independent components estimation. InInternational Conference on Learning Representations (ICLR).
 Dinh, L., Sohl-Dickstein, J. & Bengio, S. (2017) Density estimation using Real NVP. In International Conference on Learning Representations (ICLR).
 Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M. & Courville, A. (2017) Adversarially learned inference. In International Conference on Learning Representations (ICLR).
 Eslami, S.M.A., Rezende, D.J., Besse, F., Viola, F., Morcos, A.S., Garnelo, M., Ruderman, A., Rusu, A.A., Danihelka, I., Gregor, K., Reichert, D.P., Buesing, L., Weber, T., Vinyals, O., Rosenbaum, D., Rabinowitz, N., King, H., Hillier, C., Botvinick, M., Wierstra, D., Kavukcuoglu, K. & Hassabis, D. (2018) Neural scene representation and rendering.Science 360(6394):1204-1210.
 Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. (2014) Generative adversarial networks. In Advances in Neural Information Processing Systems (NeurIPS).
 Grover, A., Dhar, M. & Ermon, S. (2017) Flow-GAN: Combining maximum likelihood and adversarial learning in generative models. In arXiv:1705.08868.
 Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, V. (2017) Improved training of Wasserstein GANs. In arXiv:1704.00028.
 He, K.M., Zhang, X.Y., Ren, S.J. & Sun, J. (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
 Heljakka, A., Solin, A. & Kannala, J. (2018) Pioneer Networks: Progressively growing generative autoencoder. In arXiv:1807.03026.
 Isola, P., Zhu, J.Y., Zhou, T. & Efros, A.A. (2017) Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 Donahue, J., Krähenbühl, P. & Darrell, T. (2017) Adversarial feature learning. In International Conference on Learning Representations (ICLR).
 Karras, T., Aila, T., Laine, S. & Lehtinen, J. (2018) Progressive growing of GANs for improved quality, stability, and variation. In Proceedings of the 6th International Conference on Learning Representations (ICLR).
 Karras, T., Laine, S. & Aila, T. (2018) A style-based generator architecture for generative adversarial networks. In arXiv:1812.04948.
 Kingma, D.P. & Dhariwal, P. (2018) Glow: Generative flow with invertible 1x1 convolutions. In arXiv:1807.03039.
 Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I. & Welling, M. (2016) Improving variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems (NeurIPS).
 Kingma, D.P. & Welling, M. (2013) Auto-encoding variational bayes. In Proceedings of the 2th International Conference on Learning Representations (ICLR).
 Larsen, A.B.L., Sønderby, S.K., Larochelle, H. & Winther, O. (2016) Autoencoding beyond pixels using a learned similarity metric. In International Conference on Machine Learning (ICML).
 LeCun, Y., Bengio, Y. & Hinton. G. (2015) Deep learning. Nature 521(7553):436-444.
 Lehtinen, J., Munkberg, J., Hasselgren, J., Laine, S., Karras, T, Aittala, M & Aila, T. (2018) Noise2Noise: Learning image restoration without clean data. In Proceedings of the 35th International Conference on Machine Learning (ICML).
 Lipton, Z.C. & Tripathi, S. (2017) Precise recovery of latent vectors from generative adversarial networks. In International Conference on Learning Representations (ICLR).
 Luo, J., Xu, Y., Tang, C. & Lv, J. (2017) Learning inverse mapping by autoencoder based generative adversarial nets. In arXiv:1703.10094.
 Makhani, A. (2018) Implicit autoencoders. In arXiv:1805.09804.
 Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I. & Frey, B. (2015) Adversarial autoencoders. In arXiv:1511.05644.
 Mescheder, L., Geiger, A. & Nowozin, S. (2018) Which training methods for GANs do actually converge? In arXiv:1801.04406.
 Radford, A. & Metz, L.& Chintala, S. (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of the 4th International Conference on Learning Representations (ICLR).
 Radovic, A., Williams, M., Rousseau, D., Kagan, M., Bonacorsi, D., Himmel, A., Aurisano, A., Terao, K. & Wongjirad, T. (2018) Machine learning at the energy and intensity frontiers of particle physics. Nature 560:41-48.
 Rezende, D.J. & Mohamed, S. (2015) Variational inference with normalizing flows. In International Conference on Machine Learning (ICML).
 Roweis, S.T. & Saul, L.K. (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323-2326.
 Sanchez-Lengeling, B. & Aspuru-GuziK, A. (2018) Inverse molecular design using machine learning: Generative models for matter engineering. Science 361(6400):360-365.
 Simonyan, K. & Zisserman, A. (2014) Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.
 Su, J. & Wu, G. (2018) f-VAEs: Improve VAEs with conditional flows. In arXiv:1809.05861.
 Tabak, E.G. & Turner, C.V. (2013) A family of nonparametric density estimation algorithms. In Communications in Mathematical Sciences 66(2):145-164.
 Tabak, E.G. & Vanden-Eijnden, E. (2010) Density estimation by dual ascent of the log-likelihood. 8(1):217-233.
 Tenenbaum, J.B., Silva, V.D. & Langford, J.C. (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319-2323.
 Ulyanov, D., Vedaldi, A. & Lempitsky, V. (2017) It takes (only) two: Adversarial generator-encoder networks. In arXiv:1704.02304.
 van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A, Kalchbrenner, N., Senior, A. & Kavukcuoglu, K. (2016) WaveNet: A generative model for raw audio. In arXiv:1609.03499.
 van den Oord, A., Kalchbrenner, N. & Kavukcuoglu, K. (2016) Pixel recurrent neural networks. InarXiv:1601.06759.
 van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., Driessche, G.v.d., Lockhart, E., Cobo, L.C., Stimberg, F., Casagrande, N., Grewe, D., Noury, S., Dieleman, S., Elsen, E., Kalchbrenner, N., Zen, H., Graves, G., King, H., Walters, T., Belov, D. & Hassabis, D. (2017) Parallel WaveNet: Fast high-fidelity speech synthesis. In arXiv:1711.10433.
 Wang, J., Yu, L., Zhang, W., Gong, Y., Xu, Y., Wang, B., Zhang, P. & Zhang, D. (2017) IRGAN: A minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieva.
 Wu, J., Zhang, C., Xue, T., Freeman, W.T. & Tenenbaum, J. (2016) Learning a probabilistic latent space of object shapes via 3D generative adversarial modeling. In Advances in Neural Information Processing Systems (NeurIPS).
 Zhu, B., Liu, J.Z., Cauley, S.F., Rosen, B.R. & Rosen, M.S. (2018) Image reconstruction by domain-transform manifold learning. Nature 555:487-492.
 Zhu, J.Y., Park, T., Isola, P. & Efros, A.A. (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In International Conference on Computer Vision (ICCV).