1 Introduction
Generative models have become extremely popular in the machine learning and computer vision communities. Two main actors currently prevail in this scenario, Variational Autoencoders (VAEs)
[24] and especially Generative Adversarial Networks (GANs) [16]. In this paper we focus on a different family, the socalled flowbased generative models [12], which remain under the shadow of VAEs and GANs despite offering very appealing properties. Compared to other generative method, flowbased models build upon a sequence of reversible mappings between the input and latent space that allow for (1) exact latentvariable inference and loglikelihoood evaluation, (2) efficient and parallelizable inference and synthesis and (3) useful and simple data manipulation by operating directly on the latent space.The main contribution of this paper is a novel approach to condition normalizing flows, making it possible to perform multimodality transfer tasks which have so far not been explored under the umbrella of flowbased generative models. For this purpose, we introduce CFlow, a framework consisting of two parallel flow branches, interconnected across their reversible functions using conditional coupling layers and trained with an invertible cycle consistency. This scheme allows guiding a source domain towards a target domain guaranteeing the satisfaction of the aforementioned properties of flowbased models. Conditional inference is then implemented in a simple manner, by (exactly) embedding the source sample into its latent space, sampling a point from a Gaussian prior, and then propagating them through the learned normalizing flow. For example, for the application of synthesizing multiple plausible photos given a semantic segmentation mask, each image is generated by jointly propagating the segmentation embedding and a random point drawn from a prior distribution across the learned flow.
Our second contribution is a strategy to enable flowbased methods to model unordered 3D point clouds. Specifically, we introduce (1) a reordering of the 3D data points according to a Hilbert sorting scheme, (2) a global feature operation compatible with the reversible scheme, and (3) an invertible cycle consistency that penalizes the Chamfer distance. Combining this strategy with the proposed conditional scheme we can then address tasks such as shape interpolation, 3D object reconstruction from an image, and rendering an image given a 3D point cloud (Fig.
LABEL:fig:teaser).Importantly, our new conditioning scheme enables a wide range of tasks beyond 3D point cloud modeling. In particular, we are the first flowbased model to show mapping between a large diversity of domains, including imagetoimage, pointcloudtoimage, edgestoimage segmentationtoimage and their inverse mappings. Also, we are the first to demonstrate application in image content manipulation and style transfer tasks.
We believe our conditioning scheme, and its ability to deal with a variety of domains, opens the door to building generalpurpose and easy to train solutions. We hope all this will spur future research in the domain of flowbased generative models.
2 Related Work
FlowBased Generative Models.
Variational AutoEncoders (VAEs) [24] and Generative Adversarial Networks (GANs) [16] are the most studied deep generative models so far. VAEs use deep networks as function approximators and maximize a lower bound on the data loglikelihood to model a continuous latent variable with intractable posterior distribution [41, 46, 27]
. GANs, on the other hand, circumvent the need for dealing with likelihood estimation by leveraging an adversarial strategy. While GANs’ versatility has made possible advances in many applications
[22, 36, 59, 6, 1], their training is unstable [40] and requires careful hyperparameter tuning.Flowbased generative models [12, 42] have received little attention compared to GANs and VAEs, despite offering very attractive properties such as the ability to estimate exact loglikelihood, efficient synthesis and exact latentvariable inference. Further advances have been proposed in RealNVP [13] by introducing the affine coupling layers and in Glow [25], through an architecture with 1x1 invertible convolutions for image generation and editing. These works have been later applied to audio generation [35, 23, 55, 44], image modeling [48, 18, 8] and video prediction [26].
Some recent works have proposed strategies for conditioning normalizing flows by combining them with other generative models. For instance, [28, 18] combine flows with GANs. These models, however, are more difficult to train as adversarial losses tend to introduce instabilities. Similarly, for the specific application of video prediction, [26]
enforces an autoregressive model onto the past latent variables to predict them in the future. DualGlow
[48] uses a conditioning scheme for MRItoPET brain scan mapping by concatenating the prior distribution of the source image with the latent variables of the target image.In this paper, we introduce a novel mechanism to condition flowbased generative models by enforcing a sourcetotarget coupling at every transformation step instead of only feeding the source information into the target prior distribution. As we show experimentally, this enables finegrained control over the modeling process (Sec. 7).
Modeling and reconstruction of 3D Shapes.
The success of deep learning has spurred a large number of discriminative approaches for 3D reconstruction
[9, 37, 50, 45, 17]. These techniques, however, only learn direct mappings between output shapes and input images. Generative models, in contrast, capture the actual shape distribution from the training set, enabling not only to reconstruct new test images, but also to sample new shapes from the learned distribution. There exist several works along this line. For instance, GANs have been used in Wu et al. [53] to model objects in a voxel representation; Hamu et al. [4] used them to model body parts; and Pumarola et al. [38] to learn the manifold of geometry images representing clothed 3D bodies. Autoencoders [11, 47] and VAEs [14, 3, 29, 19] have also been applied to model 3D data. More recently, Joon Park et al. [31] used autodecoders [5, 15] to represent the surface of a shape with a continuous volumetric field.PointFlow [56] is the only approach that uses normalizing flows to model 3D data. They learn a generative model for point clouds by first modeling the distribution of object shapes and then applying normalizing flows to model the point cloud distribution for each shape. This strategy, however, cannot condition the shape, preventing PointFlow from being used in applications such as 3D reconstruction and rendering. Also, its inference time is very high, as point clouds are generated one point at a time, while we generate the entire point cloud in one forward pass.
3 FlowBased Generative Model
Flowbased generative models aim to approximate an unknown true data distribution from a limited set of observations . The data is modeled by learning an invertible transformation mapping to from a latent space with tractable density :
(1) 
where is a latent variable and
is typically a Gaussian distribution
. The function , commonly known as a normalizing flow [42], is bijective, meaning that given a data point its latentvariable is computed as:(2) 
where is composed of a sequence of invertible transformations defining a mapping between and such that:
(3) 
being a fixed hyperparameter.
The goal of generative models is to find the parameters such that best approximates
. Explicitly modeling such probability density function is usually intractable, but using the normalizing flow mapping of Eq. (
1) under the change of variable theorem, we can compute the exact loglikelihood for a given data point as:(4)  
(5) 
where is the Jacobian matrix of at and the Jacobian determinant measures the change of logdensity made by when transforming to . Since we can now compute the exact loglikelihood, the training criterion of flowbased generative model is directly the negative loglikelihood over the observations . Note that optimizing over the actual loglikelihood of the observations is more stable and informative than doing it over a lowerbound of the loglikelihood for VAEs, or minimizing the adversarial loss in GANs. This is one of the major virtues of flowbased approaches.
4 Conditional FlowBased Generative Model
Given two input tensors
and , the proposed conditional coupling layer transforms the second half of conditioned on the first halves of and . The first halves of all tensors are not updated. By sequentially concatenating these bijective operations we can transform data points into their latent representation (forward propagation) and vice versa (backward propagation).Let us define a true data distribution . Our goal is to learn a model for to map sample points from domain to domain . For example, for the application of 3D reconstruction from a single view, would be an image and a 3D point cloud. To this end, we propose a conditional flowbased generative model extending the architectures of [13, 25]. Our Llevels model, learns both distributions with two bijective transformations and (Fig. 1):
(6)  
(7)  
(8) 
where and are latentvariables, and and
are tractable spherical multivariate Gaussian distributions with learnable mean and variance.
We then define the mapping to sample conditioned on , as a threestep operation:
(9)  
(10)  
generate cond. on  (11) 
In the following subsections we describe how this conditional framework is implemented. Sec. 4 discusses the foundations of the conditional coupling layer we propose to map source to target data using invertible functions, and how its Jacobian is computed. Sec. 4.2 describes the architecture we define for the practical implementation of the coupling layers. Sec. 4.3 presents an invertible cycle consistency loss introduced to further stabilize the training process. Finally, in Sec. 4.4 we define the total training loss.
4.1 Conditional Coupling Layer
When designing the conditional coupling layer we need to fulfill the constraint that each transformation has to be bijective and tractable. As shown in [12, 13], both these issues can be overcome by choosing transformations with triangular Jacobian. In this case their determinant is calculated as the product of diagonal terms, making the computation tractable and ensuring invertibility. Motivated by these works, we propose an extension of their coupling layer to account for crossdomain conditioning. A schematic of the proposed layers is shown in Fig. 2. Formally, let us define and . We then write the invertible function to transform a data point based on as follows:
where is the number of channel dimensions in both data points, denotes elementwise multiplication and and are the scale and translation functions from . We set in all experiments.
The inverse of the conditional coupling layer is:
(12) 
and its Jacobian:
where is an identy matrix. Since the Jacobian is a triangular matrix, its determinant can be calculated efficiently as the product of the diagonal elements. Note that it is not required to compute the Jacobian of the functions and
, enabling them to be arbitrarily complex. In practice, we implement these functions using a convolutional neural network
that returns both and .4.2 Coupling Network Architecture
We next describe the architecture of and used to regress the affine transform applied at every conditional coupling layer at each and respectively. We build upon the stack of three 2D convolution layers proposed by [25]. The first two layers have a filter size of and with output channels followed by actnorm [25]
and a ReLU activation. The third layer regresses the final scale and translation by applying a 2D convolutional layer with filter size
initialized with zeros such that each affine transformation at the beginning of training is equivalent to an identity function.For the transformation we exactly use this architecture, but for we extend it to take into account the conditioning . Concretely, in , is initially transformed by two convolution layers, like the first two of . Then, is adapted with a channelwise affine transform implemented by a convolution. Finally, its output is added to the transformed . To ensure a similar contribution of and their activations are normalized with actnorm so that they operate in the same range. A final convolution regresses the conditional coupling layer operators and .
4.3 Invertible Cycle Consistency
We train our model to maximize the loglikelihood of the training dataset. However, likewise in GANs learning [33, 21], we found beneficial to add a loss encouraging the generated and real samples to be similar in L1. To do so, we exploit the fact that our model is made of bijective transformations, and introduce what we call an invertible cycle consistency. This operation can be summarized as follows:
(13) 
Concretely, the data points observations (, ) are initially mapped into their latent variables (, ), where each variable is composed of an level stack. As demonstrated in [13] the first levels encode the high frequencies (details) in the data, and the last levels the low frequencies.
We then resample the first dimensions of from a Gaussian distribution, i.e. . By doing this, is only retaining the lowest frequencies of the original .
As a final step, we invert , to recover and penalize its L1 difference w.r.t the original . What we are essentially doing is to force the model to use information from the condition so that the recover sample is as similar as possible to the original . Note that if reconstructed based on the entire latent variable, the recovered sample would be identical to the original because is bijective, and this loss would be meaningless.
4.4 Total Loss
Formally, denoting the training pairs of observations as
, the model parameters are learned by minimizing the following loss function:
(14) 
The first term maximizes the joint likelihood of the data observations. With our design, it also maximizes the conditional likelihood of
and thus forces the model to learn the desired mapping. To show this, we apply the law of total probability and we factor it into:
(15) 
Due to the diagonal structure of the Jacobians, the marginal likelihood of depends only on (first sum), while the conditional of – only on . Maximizing the joint likelihood thus maximizes both likelihoods independently.
5 Modeling Unordered 3D Point Clouds
The model described so far can handle input data represented on regular grids but it fails to model unordered 3D point clouds, whose lack of spatial neighborhood ordering prevents convolutions from being applied. To process point clouds with deep networks, a common practice is to apply symmetry operations [39]
that create fixedsize tensors of global features describing the entire point cloud sample. These operations require extracting pointindependent features followed by a maxpool, which is not invertible and not applicable to normalizing flows. Another alternative would be the graph convolutional networks
[54], although their high computational cost makes them not suitable for our scheme of multiple coupling layers. We propose a threestep mechanism to enable modeling 3D point clouds:(i) Approximate Sorting with SpaceFilling Curves. CFlow is based on convolutional layers which require input data with a local neigboorhood consistent across samples. To fulfill this condition on unordered point clouds, we propose to sort them based on proximity. As discussed in [39], for high dimensional spaces it is not possible to produce a perfect ordering stable to point perturbations. In this paper we therefore consider using the approximation provided by the Hilbert’s spacefilling curve algorithm [20]. For each training sample, we project its points into a 3D Hilbert curve and reorder them based on their ordering along the curve (Fig. 3). Notice that not only we can establish a neighborhood relationship but also a semanticallystable ordering (e.g. in Fig. 3 the chair’s rightleg is always blue). To the best of our knowledge there is no previous work using such preprocessing for point clouds.
(ii) Approximating Global Features. Hilbert Sort is not sufficient to model 3D data because of a major issue: it splits the space into equally sized quadrants and the Hilbert curve will cover all points in a quadrant before moving to the next. As a consequence, two points that were originally close in space, but lie near the boundaries of two different quadrants, will end up far away in the final ordering. To mitigate this effect we extend the proposed coupling network architecture (Sec. 4.2) with an approximate but invertible version of the global features proposed in [39] that describe the whole point cloud. Concretely, we first resample and reshape the reordered point cloud to form matrices (in practice we use the same size as that of the images). Then we approximate the global descriptors of [39] through a convolution to extract pointindependent features followed by a maxpool applied only over the first half of the point cloud features (Fig. 4). The coupling layer remains bijective because during the backward propagation the approximated global features can be recovered using a similar strategy as in Eq. (12).
(iii) Symmetric Chamfer Distance for Cycle Consistency. For the specific case of point clouds, we observed that when penalizing the invertible cycle consistency with L1 the model converged to a mean Hilbert curve. Therefore, for point clouds, we substitute L1 by the symmetric Chamfer distance , which computes the mean Euclidean distance between the groundtruth point cloud and the recovered .
6 Implementation Details
Due to memory restrictions, we train with image samples of resolution. For 3D point clouds, to maintain the same architecture as in images, we reshape each point cloud sample (list of points) to . At test time we also regress 3D points per forward pass. Our implementation builds upon that of Glow [25]. We use Adam with learning rate , , and batch size . The multiscale architecture consists of levels with 12 flow steps per level ( in Eq. (3)) each and squeezing operations. For conditional sampling we found additive coupling () to be more stable during training than affine transformation. The prior distributions and
are initialized with mean 0 and variance 1. The rest of weights are randomly initilized from a normal distribution with mean
and std . in Eq. (14). As in previous likelihoodbased generative models [32, 25], we observed that sampling from a reducedtemperature prior improves the results. To do so, we multiply the variance of by . The model is trained with 4 GPUs P100 for 10 days.7 Experimental Evaluation
We next evaluate our system on diverse tasks: (1) Modeling point clouds (Sec. 7.1), (2) 3D reconstruction and rendering (Sec. 7.2), (3) Imagetoimage mapping in a variety of domains and datasets (Sec. 7.3), and (4) Image manipulation and style transfer (Sec. 7.4).
7.1 Modeling 3D Point Clouds
We evaluate the potential of our approach to model 3D point clouds on ShapeNet [7]. For this task, we do not consider the full conditioning scheme and only use one of the branches of CFlow in Fig. 1, which we denote as CFlow*.
In our first experiment we study the representation capacity of unknown shapes, formally defined as the ability to retain the information after mapping forward and backward between the original and latent spaces. For this purpose, we first map a real point cloud to the latent space . The fullsize embedding has as many dimensions as the input (). Then we progressively remove information from by replacing their leftmost components with samples drawn from a Gaussian distribution, i.e. . Note that the embedding size can be set at test time with no need to retrain, making tasks like point cloud compression straightforward. Finally we map back this embedding to the original point cloud space and compare to .
Method  

CFlow* Glow [25]  0.00  0.39  0.39  0.39 
CFlow* + Sort  0.00  0.19  0.21  0.22 
CFlow* + Sort + GFCoupling  0.00  0.14  0.18  0.31 
AtlasNetSph. [17]  0.75  
AtlasNet25 [17]  0.37  
DeepSDF [31]  0.20 
Tab. 1 reports the Chamfer Distance (CD) for different embedding sizes. The plain version of CFlow* (no conditioning, no sorting, no global features) is equivalent to Glow [25]. This version is consistently improved when introducing the sorting and global features strategies (Sec. 5). The error decreases gracefully as we increase the embedding size, and importantly, when using the full size embedding we obtain a perfect recovering (Fig. 5top). This is a virtue of the bijective models, and is not a trivial property. Tab. 1 also reports the numbers of AtlasNet [17] and DeepSDF [31], showing that our approach achieves competitive results. This comparison is only indicative as the representation used in these approaches is inherently different ( [17] parametric and [31] a continuous surface).
Recall that the leftmost components randomly sampled in encode the high details of the shape. We exploit this property to generate point clouds with an arbitrarily large number of points by performing multiple backward propagations () of a partial embedding (Fig. 5bottom). Every time we propagate, we recover a new set of 3D points allowing to progressively improve the density of the reconstruction.
Another task that can be addressed with CFlow is shape interpolation in the latent space (Figure 6).
Image PC  Image PC  
Method  CD  BPD  IS 
3DR2N2 [9]  0.27     
PSGN [14]  0.26     
Pix2Mesh [50]  0.27     
AtlasNet [17]  0.21     
ONet [29]  0.23     
CFlow  0.86  4.38  1.80 
CFlow + Sort  0.52  2.77  2.41 
CFlow + Sort + GFCoupling  0.49  2.87  2.61 
CFlow + Sort + GFCoupling + CD  0.26     
7.2 3D Reconstruction & rendering
We next evaluate the ability of CFlow to model the conditional distributions (1) image point cloud , which enables to perform 3D reconstruction from a single image; and (2) point cloud image, which is its inverse problem of rendering an image given a 3D point cloud. Fig. LABEL:fig:teaser shows qualitative results on the Chair class of ShapeNet. In the top row our model is able to generate plausible 3D reconstructions of unknown objects even under strong selfocclusions (topright example). The second row depicts results for rendering, which highlights another advantage of our model: it allows sampling multiple times from the conditional distribution to produce several images of the same object which exhibit different properties (e.g. viewpoint or texture).
In Table 2 we compare CFlow with other singleimage 3D reconstruction methods 3DR2N2 [9], PSGN [14], Pix2Mesh [50], AtlasNet [17] and ONet [29]. We evaluate 3D reconstruction in terms of the Chamfer distance (CD) with the ground truth shapes. Our approach (last row) performs on par with [9, 14, 50] and it is slightly below the stateoftheart techniques specifically designed for 3D reconstruction [17, 29].
CFlow  CFlow + cycle  

Method  BPD  SSIM  IS  BPD  SSIM  IS 
segmentation street views  3.21  0.37  1.80  3.17  0.42  1.94 
segmentation street views  3.25  0.33  2.19  3.05  0.36  2.23 
structure facades  3.55  0.24  1.92  3.54  0.26  1.69 
structure facades  3.55  0.31  2.05  3.55  0.30  2.01 
map aerial photo  3.65  0.19  1.52  3.65  0.17  1.62 
map aerial photo  3.65  0.54  1.95  3.65  0.57  1.97 
edges shoes  1.70  0.66  2.40  1.68  0.67  2.43 
edges shoes  1.65  0.64  1.61  1.65  0.65  1.69 
With the same model, we can also render images from point clouds. To the best of our knowledge, no previous work can perform such mapping. While a few approaches do render point clouds [30, 2, 34], they hold on strong assumptions of knowing the RGB color per point and the camera calibration to project the point cloud onto the image plane. Table 2 also reports an ablation study about the different operations we devised to handle 3D point clouds, namely sorting the point cloud (Sort), approximating global features (GFCoupling) and inverse cycle consistency with chamfer distance (CD). In this case, evaluation is reported using Inception Score (IS) [43] and Bits Per Dimension (BPD) which is equivalent to the negative log2likelihood typically used to report flowbased methods performance. Results show a performance boost when using each of these components, and especially when combining them.
7.3 ImagetoImage mappings
We evaluate the ability of CFlow to perform multidomain imagetoimage mapping: segmentation street views trained on Cityscapes [10], structure facade trained on CMP Facades [49], map aerial photo trained on [21] and edges shoes trained on [57, 58, 21]. The examples on Fig. 7left show mappings in which the target domain has a wide variance and multiple sampling generates different results (e.g. a semantic segmentation map can map to several grayscale images). The examples on the right have a target domain with a narrower variance, and despite multiple samplings the generated images are very similar (e.g. given an image its segmentation is well defined).
Table 3 reports quantitative evaluations using Structural Similarity (SSIM) [52], and again BPD and IS. When introducing the invertible cycle consistency loss (Sec. 4.3) the model does not improve its compression abilities (BPD) but improves in terms of structural similarity (SSIM) and semantic content (IS). It is worth to mention that while GANs have shown impressive imagetoimage mapping results, even at high resolution [51], ours is the first work that can address such tasks using normalizing flows.
7.4 Other Applications
Finally, we demonstrate the versatility of CFlow being the first flowbased method capable of performing style transfer and image content manipulation (Fig. 8). Importantly, the model was not retrained for these specific tasks, and we use the same parameters learned to perform imagetoimage mappings (Sec. 7.3). For image manipulation we use the weights of segmentation street view and for style transfer those of edges shoes. Formally, let the domain to be the structure (e.g. segmentation mask) and the domain to be the image (e.g. street view). Then, image manipulation is achieved via three operations:
(16)  
(17)  
synthesise new image  (18) 
Note that following this generation approach we are no longer conditioning based only on , as in Sec. 7.3, and now the synthesised image is jointly conditioned on (for structure) and (for texture).
To perform style transfer, we first transform the content image into its structure . For instance, in Fig. 8bottom, the content of the shoe is initially mapped onto its edge structure with the shoes edges weights. Then, we apply the same procedure as we did for image manipulation using the edges shoes weights, setting to be the structure of the content image and the style image.
8 Conclusions
We have proposed CFlow, a novel conditioning scheme for normalizing flows. This conditioning, in conjunction with a new strategy to model unordered 3D point clouds, has made it possible to address 3D reconstruction and rendering images from point clouds, problems which so far, could not be tackled with normalizing flows. Furthermore, we demonstrate CFlow to be a generalpurpose model, being also applicable to many more multimodality problems, such as imagetoimage translation, style transfer and image content edition. To the best of our knowledge, no previous model has demonstrated such an adaptability.
References
 [1] (2019) Generative adversarial networks for extreme learned image compression. In ICCV, Cited by: §2.
 [2] (2019) Neural pointbased graphics. arXiv preprint arXiv:1906.08240. Cited by: §7.2.
 [3] (2018) Modeling facial geometry using compositional vaes. In CVPR, Cited by: §2.
 [4] (2018) Multichart generative surface modeling. In SIGGRAPH Asia, Cited by: §2.
 [5] (2017) Optimizing the latent space of generative networks. PMLR. Cited by: §2.
 [6] (2019) Fast video object segmentation with spatiotemporal gans. arXiv preprint arXiv:1903.12161. Cited by: §2.
 [7] (2015) Shapenet: an informationrich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §7.1.
 [8] (2019) BeautyGlow: ondemand makeup transfer framework with reversible generative network. In CVPR, Cited by: §2.
 [9] (2016) 3dr2n2: a unified approach for single and multiview 3d object reconstruction. In ECCV, Cited by: §2, §7.2, Table 2.

[10]
(2016)
The cityscapes dataset for semantic urban scene understanding
. In CVPR, Cited by: §7.3.  [11] (2017) Shape completion using 3dencoderpredictor cnns and shape synthesis. In CVPR, Cited by: §2.
 [12] (2014) Nice: nonlinear independent components estimation. In ICLR, Cited by: §1, §2, §4.1.
 [13] (2017) Density estimation using real nvp. ICLR. Cited by: §2, §4.1, §4.3, §4.
 [14] (2017) A point set generation network for 3d object reconstruction from a single image. In CVPR, Cited by: §2, §7.2, Table 2.
 [15] (2018) Matrix completion by deep matrix factorization. Neural Networks. Cited by: §2.
 [16] (2014) Generative adversarial nets. In NeurIPS, Cited by: §1, §2.
 [17] (2018) AtlasNet: a papiermâché approach to learning 3d surface generation. CVPR. Cited by: §2, §7.1, §7.2, Table 1, Table 2.
 [18] (2019) AlignFlow: cycle consistent learning from multiple domains via normalizing flows. arXiv preprint arXiv:1905.12892. Cited by: §2, §2.
 [19] (2019) Learning singleimage 3D reconstruction by generative modelling of shape, pose and shading. IJCV. Cited by: §2.
 [20] (1935) Über die stetige abbildung einer linie auf ein flächenstück. In Dritter Band: Analysis ⋅ Grundlagen der Mathematik ⋅ Physik Verschiedenes: Nebst Einer Lebensgeschichte, Cited by: §5.

[21]
(2016)
Imagetoimage translation with conditional adversarial networks
. arXiv preprint arXiv:1611.07004. Cited by: §4.3, §7.3.  [22] (2017) Imagetoimage translation with conditional adversarial networks. In CVPR, Cited by: §2.
 [23] (2018) FloWaveNet: a generative flow for raw audio. ICML. Cited by: §2.
 [24] (2014) Autoencoding variational bayes. In ICLR, Cited by: §1, §2.
 [25] (2018) Glow: generative flow with invertible 1x1 convolutions. In NeurIPS, Cited by: §2, Figure 1, §4.2, §4, §6, §7.1, Table 1.
 [26] (2019) VideoFlow: a flowbased generative model for video. arXiv preprint arXiv:1903.01434. Cited by: §2, §2.
 [27] (2017) A generative model of people in clothing. In ICCV, Cited by: §2.
 [28] (2019) Conditional adversarial generative flow for controllable image synthesis. In CVPR, Cited by: §2.
 [29] (2019) Occupancy networks: learning 3d reconstruction in function space. In CVPR, Cited by: §2, §7.2, Table 2.
 [30] (2019) Neural rerendering in the wild. In CVPR, Cited by: §7.2.
 [31] (2019) Deepsdf: learning continuous signed distance functions for shape representation. CVPR. Cited by: §2, §7.1, Table 1.
 [32] (2018) Image transformer. arXiv preprint arXiv:1802.05751. Cited by: §6.
 [33] (2016) Context encoders: feature learning by inpainting. In CVPR, Cited by: §4.3.
 [34] (2019) Revealing scenes by inverting structure from motion reconstructions. In CVPR, Cited by: §7.2.
 [35] (2018) Waveglow: a flowbased generative network for speech synthesis. In ICASSP, Cited by: §2.
 [36] (2019) GANimation: oneshot anatomically consistent facial animation. IJCV. Cited by: §2.
 [37] (2018) Geometryaware network for nonrigid shape prediction from a single view. In CVPR, Cited by: §2.
 [38] (2019) 3DPeople: Modeling the Geometry of Dressed Humans. In ICCV, Cited by: §2.
 [39] (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In CVPR, Cited by: §5, §5, §5.
 [40] (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR. Cited by: §2.
 [41] (2019) Generating diverse highfidelity images with vqvae2. arXiv preprint arXiv:1906.00446. Cited by: §2.
 [42] (2015) Variational inference with normalizing flows. In ICML, Cited by: §2, §3.
 [43] (2016) Improved techniques for training gans. In NeurIPS, Cited by: §7.2.
 [44] (2019) Blow: a singlescale hyperconditioned flow for nonparallel rawaudio voice conversion. NeurIPS. Cited by: §2.
 [45] (2017) Surfnet: generating 3d shape surfaces using deep residual networks. In CVPR, Cited by: §2.
 [46] (2015) Learning structured output representation using deep conditional generative models. In NeurIPS, Cited by: §2.
 [47] (2018) Learning 3d shape completion from laser scan data with weak supervision. In CVPR, Cited by: §2.
 [48] (2019) DUALglow: conditional flowbased generative model for modality transfer. arXiv preprint arXiv:1908.08074. Cited by: §2, §2.
 [49] (2013) Spatial pattern templates for recognition of objects with regular structure. In GCPR, Cited by: §7.3.
 [50] (2018) Pixel2mesh: generating 3d mesh models from single rgb images. In ECCV, Cited by: §2, §7.2, Table 2.

[51]
(2018)
Highresolution image synthesis and semantic manipulation with conditional gans.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 8798–8807. Cited by: §7.3.  [52] (2004) Image quality assessment: from error visibility to structural similarity. TIP. Cited by: §7.3.
 [53] (2016) Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In NeurIPS, Cited by: §2.
 [54] (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §5.

[55]
(2019)
AdaFlow: domainadaptive density estimator with application to anomaly detection and unpaired crossdomain translation
. In ICASSP, Cited by: §2.  [56] (2019) PointFlow: 3d point cloud generation with continuous normalizing flows. ICCV. Cited by: §2.
 [57] (2014) Finegrained visual comparisons with local learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 192–199. Cited by: §7.3.
 [58] (2016) Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pp. 597–613. Cited by: §7.3.
 [59] (2017) Unpaired imagetoimage translation using cycleconsistent adversarial networks. In ICCV, Cited by: §2.