C-Flow: Conditional Generative Flow Models for Images and 3D Point Clouds

by   Albert Pumarola, et al.

Flow-based generative models have highly desirable properties like exact log-likelihood evaluation and exact latent-variable inference, however they are still in their infancy and have not received as much attention as alternative generative models. In this paper, we introduce C-Flow, a novel conditioning scheme that brings normalizing flows to an entirely new scenario with great possibilities for multi-modal data modeling. C-Flow is based on a parallel sequence of invertible mappings in which a source flow guides the target flow at every step, enabling fine-grained control over the generation process. We also devise a new strategy to model unordered 3D point clouds that, in combination with the conditioning scheme, makes it possible to address 3D reconstruction from a single image and its inverse problem of rendering an image given a point cloud. We demonstrate our conditioning method to be very adaptable, being also applicable to image manipulation, style transfer and multi-modal image-to-image mapping in a diversity of domains, including RGB images, segmentation maps, and edge masks.



There are no comments yet.


page 1

page 6

page 7

page 8


Discrete Point Flow Networks for Efficient Point Cloud Generation

Generative models have proven effective at modeling 3D shapes and their ...

Conditional Adversarial Generative Flow for Controllable Image Synthesis

Flow-based generative models show great potential in image synthesis due...

Neural Style Transfer for Point Clouds

How can we edit or transform the geometric or color property of a point ...

SoftFlow: Probabilistic Framework for Normalizing Flow on Manifolds

Flow-based generative models are composed of invertible transformations ...

CAFLOW: Conditional Autoregressive Flows

We introduce CAFLOW, a new diverse image-to-image translation model that...

Set Flow: A Permutation Invariant Normalizing Flow

We present a generative model that is defined on finite sets of exchange...

Flow Plugin Network for conditional generation

Generative models have gained many researchers' attention in the last ye...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative models have become extremely popular in the machine learning and computer vision communities. Two main actors currently prevail in this scenario, Variational Autoencoders (VAEs) 

[24] and especially Generative Adversarial Networks (GANs) [16]. In this paper we focus on a different family, the so-called flow-based generative models [12], which remain under the shadow of VAEs and GANs despite offering very appealing properties. Compared to other generative method, flow-based models build upon a sequence of reversible mappings between the input and latent space that allow for (1) exact latent-variable inference and log-likelihoood evaluation, (2) efficient and parallelizable inference and synthesis and (3) useful and simple data manipulation by operating directly on the latent space.

The main contribution of this paper is a novel approach to condition normalizing flows, making it possible to perform multi-modality transfer tasks which have so far not been explored under the umbrella of flow-based generative models. For this purpose, we introduce C-Flow, a framework consisting of two parallel flow branches, interconnected across their reversible functions using conditional coupling layers and trained with an invertible cycle consistency. This scheme allows guiding a source domain towards a target domain guaranteeing the satisfaction of the aforementioned properties of flow-based models. Conditional inference is then implemented in a simple manner, by (exactly) embedding the source sample into its latent space, sampling a point from a Gaussian prior, and then propagating them through the learned normalizing flow. For example, for the application of synthesizing multiple plausible photos given a semantic segmentation mask, each image is generated by jointly propagating the segmentation embedding and a random point drawn from a prior distribution across the learned flow.

Our second contribution is a strategy to enable flow-based methods to model unordered 3D point clouds. Specifically, we introduce (1) a re-ordering of the 3D data points according to a Hilbert sorting scheme, (2) a global feature operation compatible with the reversible scheme, and (3) an invertible cycle consistency that penalizes the Chamfer distance. Combining this strategy with the proposed conditional scheme we can then address tasks such as shape interpolation, 3D object reconstruction from an image, and rendering an image given a 3D point cloud (Fig. 


Importantly, our new conditioning scheme enables a wide range of tasks beyond 3D point cloud modeling. In particular, we are the first flow-based model to show mapping between a large diversity of domains, including image-to-image, pointcloud-to-image, edges-to-image segmentation-to-image and their inverse mappings. Also, we are the first to demonstrate application in image content manipulation and style transfer tasks.

We believe our conditioning scheme, and its ability to deal with a variety of domains, opens the door to building general-purpose and easy to train solutions. We hope all this will spur future research in the domain of flow-based generative models.

2 Related Work

Flow-Based Generative Models.

Variational Auto-Encoders (VAEs) [24] and Generative Adversarial Networks (GANs) [16] are the most studied deep generative models so far. VAEs use deep networks as function approximators and maximize a lower bound on the data log-likelihood to model a continuous latent variable with intractable posterior distribution [41, 46, 27]

. GANs, on the other hand, circumvent the need for dealing with likelihood estimation by leveraging an adversarial strategy. While GANs’ versatility has made possible advances in many applications 

[22, 36, 59, 6, 1], their training is unstable [40] and requires careful hyper-parameter tuning.

Flow-based generative models [12, 42] have received little attention compared to GANs and VAEs, despite offering very attractive properties such as the ability to estimate exact log-likelihood, efficient synthesis and exact latent-variable inference. Further advances have been proposed in RealNVP [13] by introducing the affine coupling layers and in Glow [25], through an architecture with 1x1 invertible convolutions for image generation and editing. These works have been later applied to audio generation [35, 23, 55, 44], image modeling [48, 18, 8] and video prediction [26].

Some recent works have proposed strategies for conditioning normalizing flows by combining them with other generative models. For instance, [28, 18] combine flows with GANs. These models, however, are more difficult to train as adversarial losses tend to introduce instabilities. Similarly, for the specific application of video prediction,  [26]

enforces an autoregressive model onto the past latent variables to predict them in the future. Dual-Glow 

[48] uses a conditioning scheme for MRI-to-PET brain scan mapping by concatenating the prior distribution of the source image with the latent variables of the target image.

In this paper, we introduce a novel mechanism to condition flow-based generative models by enforcing a source-to-target coupling at every transformation step instead of only feeding the source information into the target prior distribution. As we show experimentally, this enables fine-grained control over the modeling process (Sec. 7).

Modeling and reconstruction of 3D Shapes.

The success of deep learning has spurred a large number of discriminative approaches for 3D reconstruction 

[9, 37, 50, 45, 17]. These techniques, however, only learn direct mappings between output shapes and input images. Generative models, in contrast, capture the actual shape distribution from the training set, enabling not only to reconstruct new test images, but also to sample new shapes from the learned distribution. There exist several works along this line. For instance, GANs have been used in Wu et al[53] to model objects in a voxel representation; Hamu et al[4] used them to model body parts; and Pumarola et al[38] to learn the manifold of geometry images representing clothed 3D bodies. Auto-encoders [11, 47] and VAEs [14, 3, 29, 19] have also been applied to model 3D data. More recently, Joon Park et al[31] used auto-decoders [5, 15] to represent the surface of a shape with a continuous volumetric field.

PointFlow [56] is the only approach that uses normalizing flows to model 3D data. They learn a generative model for point clouds by first modeling the distribution of object shapes and then applying normalizing flows to model the point cloud distribution for each shape. This strategy, however, cannot condition the shape, preventing PointFlow from being used in applications such as 3D reconstruction and rendering. Also, its inference time is very high, as point clouds are generated one point at a time, while we generate the entire point cloud in one forward pass.

3 Flow-Based Generative Model

Flow-based generative models aim to approximate an unknown true data distribution from a limited set of observations . The data is modeled by learning an invertible transformation mapping to from a latent space with tractable density :


where is a latent variable and

is typically a Gaussian distribution

. The function , commonly known as a normalizing flow [42], is bijective, meaning that given a data point its latent-variable is computed as:


where is composed of a sequence of invertible transformations defining a mapping between and such that:


being a fixed hyper-parameter.

The goal of generative models is to find the parameters such that best approximates

. Explicitly modeling such probability density function is usually intractable, but using the normalizing flow mapping of Eq. (

1) under the change of variable theorem, we can compute the exact log-likelihood for a given data point as:


where is the Jacobian matrix of at and the Jacobian determinant measures the change of log-density made by when transforming to . Since we can now compute the exact log-likelihood, the training criterion of flow-based generative model is directly the negative log-likelihood over the observations . Note that optimizing over the actual log-likelihood of the observations is more stable and informative than doing it over a lower-bound of the log-likelihood for VAEs, or minimizing the adversarial loss in GANs. This is one of the major virtues of flow-based approaches.

4 Conditional Flow-Based Generative Model

Figure 1: The C-Flow model consists of two parallel flow branches mutually interconnected with conditional coupling layers. This scheme allows sampling conditioned on . For a detailed description on functions in grey refer to [25].
(a) Forward Propagation
(b) Backward Propagation
Figure 2: Conditional coupling layer for forward and backward propagation.

Given two input tensors

and , the proposed conditional coupling layer transforms the second half of conditioned on the first halves of and . The first halves of all tensors are not updated. By sequentially concatenating these bijective operations we can transform data points into their latent representation (forward propagation) and vice versa (backward propagation).

Let us define a true data distribution . Our goal is to learn a model for to map sample points from domain to domain . For example, for the application of 3D reconstruction from a single view, would be an image and a 3D point cloud. To this end, we propose a conditional flow-based generative model extending the architectures of [13, 25]. Our L-levels model, learns both distributions with two bijective transformations and (Fig. 1):


where and are latent-variables, and and

are tractable spherical multivariate Gaussian distributions with learnable mean and variance.

We then define the mapping to sample conditioned on , as a three-step operation:

generate cond. on (11)

In the following subsections we describe how this conditional framework is implemented. Sec. 4 discusses the foundations of the conditional coupling layer we propose to map source to target data using invertible functions, and how its Jacobian is computed. Sec. 4.2 describes the architecture we define for the practical implementation of the coupling layers. Sec. 4.3 presents an invertible cycle consistency loss introduced to further stabilize the training process. Finally, in Sec. 4.4 we define the total training loss.

4.1 Conditional Coupling Layer

When designing the conditional coupling layer we need to fulfill the constraint that each transformation has to be bijective and tractable. As shown in [12, 13], both these issues can be overcome by choosing transformations with triangular Jacobian. In this case their determinant is calculated as the product of diagonal terms, making the computation tractable and ensuring invertibility. Motivated by these works, we propose an extension of their coupling layer to account for cross-domain conditioning. A schematic of the proposed layers is shown in Fig. 2. Formally, let us define and . We then write the invertible function to transform a data point based on as follows:

where is the number of channel dimensions in both data points, denotes element-wise multiplication and and are the scale and translation functions from . We set in all experiments.

The inverse of the conditional coupling layer is:


and its Jacobian:

where is an identy matrix. Since the Jacobian is a triangular matrix, its determinant can be calculated efficiently as the product of the diagonal elements. Note that it is not required to compute the Jacobian of the functions and

, enabling them to be arbitrarily complex. In practice, we implement these functions using a convolutional neural network

that returns both and .

4.2 Coupling Network Architecture

We next describe the architecture of and used to regress the affine transform applied at every conditional coupling layer at each and respectively. We build upon the stack of three 2D convolution layers proposed by [25]. The first two layers have a filter size of and with output channels followed by actnorm [25]

and a ReLU activation. The third layer regresses the final scale and translation by applying a 2D convolutional layer with filter size

initialized with zeros such that each affine transformation at the beginning of training is equivalent to an identity function.

For the transformation we exactly use this architecture, but for we extend it to take into account the conditioning . Concretely, in , is initially transformed by two convolution layers, like the first two of . Then, is adapted with a channel-wise affine transform implemented by a convolution. Finally, its output is added to the transformed . To ensure a similar contribution of and their activations are normalized with actnorm so that they operate in the same range. A final convolution regresses the conditional coupling layer operators and .

4.3 Invertible Cycle Consistency

We train our model to maximize the log-likelihood of the training dataset. However, likewise in GANs learning [33, 21], we found beneficial to add a loss encouraging the generated and real samples to be similar in L1. To do so, we exploit the fact that our model is made of bijective transformations, and introduce what we call an invertible cycle consistency. This operation can be summarized as follows:


Concretely, the data points observations (, ) are initially mapped into their latent variables (, ), where each variable is composed of an -level stack. As demonstrated in [13] the first levels encode the high frequencies (details) in the data, and the last levels the low frequencies.

We then resample the first dimensions of from a Gaussian distribution, i.e. . By doing this, is only retaining the lowest frequencies of the original .

As a final step, we invert , to recover and penalize its L1 difference w.r.t the original . What we are essentially doing is to force the model to use information from the condition so that the recover sample is as similar as possible to the original . Note that if reconstructed based on the entire latent variable, the recovered sample would be identical to the original because is bijective, and this loss would be meaningless.

4.4 Total Loss

Formally, denoting the training pairs of observations as

, the model parameters are learned by minimizing the following loss function:


The first term maximizes the joint likelihood of the data observations. With our design, it also maximizes the conditional likelihood of

and thus forces the model to learn the desired mapping. To show this, we apply the law of total probability and we factor it into:


Due to the diagonal structure of the Jacobians, the marginal likelihood of depends only on (first sum), while the conditional of – only on . Maximizing the joint likelihood thus maximizes both likelihoods independently.

The second term in (14) minimizes the cycle consistency loss. is a hyper-parameter balancing the terms. This loss is fully differentiable, and we provide details on how we optimize it in Sec. 6.

Figure 3: Sorting 3D point clouds. Point clouds corresponding to three different chairs. The colored line connects all points based on their ordering. Top: Unordered. Bottom: Applying the proposed sorting strategy. Note how the coloring is consistent across samples even for point clouds with different topology.

5 Modeling Unordered 3D Point Clouds

The model described so far can handle input data represented on regular grids but it fails to model unordered 3D point clouds, whose lack of spatial neighborhood ordering prevents convolutions from being applied. To process point clouds with deep networks, a common practice is to apply symmetry operations [39]

that create fixed-size tensors of global features describing the entire point cloud sample. These operations require extracting point-independent features followed by a max-pool, which is not invertible and not applicable to normalizing flows. Another alternative would be the graph convolutional networks 

[54], although their high computational cost makes them not suitable for our scheme of multiple coupling layers. We propose a three-step mechanism to enable modeling 3D point clouds:

(i) Approximate Sorting with Space-Filling Curves. C-Flow is based on convolutional layers which require input data with a local neigboorhood consistent across samples. To fulfill this condition on unordered point clouds, we propose to sort them based on proximity. As discussed in [39], for high dimensional spaces it is not possible to produce a perfect ordering stable to point perturbations. In this paper we therefore consider using the approximation provided by the Hilbert’s space-filling curve algorithm [20]. For each training sample, we project its points into a 3D Hilbert curve and reorder them based on their ordering along the curve (Fig. 3). Notice that not only we can establish a neighborhood relationship but also a semantically-stable ordering (e.g. in Fig. 3 the chair’s right-leg is always blue). To the best of our knowledge there is no previous work using such preprocessing for point clouds.

Figure 4: Approximating global features in point clouds. When dealing with point clouds (reordered and reshaped to a size and using ) we approximate, with operations in blue, global features in coupling layers while still being invertible. stands for affine transformation where the first input channels are the scale and the other half the translation.

(ii) Approximating Global Features. Hilbert Sort is not sufficient to model 3D data because of a major issue: it splits the space into equally sized quadrants and the Hilbert curve will cover all points in a quadrant before moving to the next. As a consequence, two points that were originally close in space, but lie near the boundaries of two different quadrants, will end up far away in the final ordering. To mitigate this effect we extend the proposed coupling network architecture (Sec. 4.2) with an approximate but invertible version of the global features proposed in [39] that describe the whole point cloud. Concretely, we first resample and reshape the reordered point cloud to form matrices (in practice we use the same size as that of the images). Then we approximate the global descriptors of [39] through a convolution to extract point-independent features followed by a max-pool applied only over the first half of the point cloud features (Fig. 4). The coupling layer remains bijective because during the backward propagation the approximated global features can be recovered using a similar strategy as in Eq. (12).

(iii) Symmetric Chamfer Distance for Cycle Consistency. For the specific case of point clouds, we observed that when penalizing the invertible cycle consistency with L1 the model converged to a mean Hilbert curve. Therefore, for point clouds, we substitute L1 by the symmetric Chamfer distance , which computes the mean Euclidean distance between the ground-truth point cloud and the recovered .

Figure 5: Embedding 3D points clouds. Top: Reconstruction with partial embeddings. Bottom: Reconstruction with three iterations of backward propagations of partial embeddings.

6 Implementation Details

Due to memory restrictions, we train with image samples of resolution. For 3D point clouds, to maintain the same architecture as in images, we reshape each point cloud sample (list of points) to . At test time we also regress 3D points per forward pass. Our implementation builds upon that of Glow [25]. We use Adam with learning rate , , and batch size . The multi-scale architecture consists of levels with 12 flow steps per level ( in Eq. (3)) each and squeezing operations. For conditional sampling we found additive coupling () to be more stable during training than affine transformation. The prior distributions and

are initialized with mean 0 and variance 1. The rest of weights are randomly initilized from a normal distribution with mean

and std . in Eq. (14). As in previous likelihood-based generative models [32, 25], we observed that sampling from a reduced-temperature prior improves the results. To do so, we multiply the variance of by . The model is trained with 4 GPUs P-100 for 10 days.

7 Experimental Evaluation

We next evaluate our system on diverse tasks: (1) Modeling point clouds (Sec. 7.1), (2) 3D reconstruction and rendering (Sec. 7.2), (3) Image-to-image mapping in a variety of domains and datasets (Sec. 7.3), and (4) Image manipulation and style transfer (Sec. 7.4).

7.1 Modeling 3D Point Clouds

We evaluate the potential of our approach to model 3D point clouds on ShapeNet [7]. For this task, we do not consider the full conditioning scheme and only use one of the branches of C-Flow in Fig. 1, which we denote as C-Flow*.

In our first experiment we study the representation capacity of unknown shapes, formally defined as the ability to retain the information after mapping forward and backward between the original and latent spaces. For this purpose, we first map a real point cloud to the latent space . The full-size embedding has as many dimensions as the input (). Then we progressively remove information from by replacing their left-most components with samples drawn from a Gaussian distribution, i.e. . Note that the embedding size can be set at test time with no need to retrain, making tasks like point cloud compression straightforward. Finally we map back this embedding to the original point cloud space and compare to .

Figure 6: Interpolation. Results of interpolating two 3D point clouds and in the learned latent space.
C-Flow* Glow [25] 0.00 0.39 0.39 0.39
C-Flow* + Sort 0.00 0.19 0.21 0.22
C-Flow* + Sort + GF-Coupling 0.00 0.14 0.18 0.31
AtlasNet-Sph. [17] 0.75
AtlasNet-25 [17] 0.37
DeepSDF [31] 0.20
Table 1: Representing 3D point clouds. Chamfer distance when recovering point clouds with partial embeddings. For all C-Flow* we change the embedding size at test, with no further training. The percentages are with respect to the input dimension (4096). For AtlasNet and DeepSDF we provide the results from [31].

Tab. 1 reports the Chamfer Distance (CD) for different embedding sizes. The plain version of C-Flow* (no conditioning, no sorting, no global features) is equivalent to Glow [25]. This version is consistently improved when introducing the sorting and global features strategies (Sec. 5). The error decreases gracefully as we increase the embedding size, and importantly, when using the full size embedding we obtain a perfect recovering (Fig. 5-top). This is a virtue of the bijective models, and is not a trivial property. Tab. 1 also reports the numbers of AtlasNet [17] and DeepSDF [31], showing that our approach achieves competitive results. This comparison is only indicative as the representation used in these approaches is inherently different ( [17] parametric and [31] a continuous surface).

Recall that the left-most components randomly sampled in encode the high details of the shape. We exploit this property to generate point clouds with an arbitrarily large number of points by performing multiple backward propagations () of a partial embedding (Fig. 5-bottom). Every time we propagate, we recover a new set of 3D points allowing to progressively improve the density of the reconstruction.

Another task that can be addressed with C-Flow is shape interpolation in the latent space (Figure 6).

Figure 7: Image-to-Image. Results from image-to-image mappings on a variety of domains. : source image; : generated image in the target domain. The examples on the left correspond to target domains with high variability that when sampled multiple times generate different images. In the examples on the right the target domain has a small variability and the sampling becomes deterministic.
Image PC Image PC
Method CD BPD IS
3D-R2N2 [9] 0.27 - -
PSGN [14] 0.26 - -
Pix2Mesh [50] 0.27 - -
AtlasNet [17] 0.21 - -
ONet [29] 0.23 - -
C-Flow 0.86 4.38 1.80
C-Flow + Sort 0.52 2.77 2.41
C-Flow + Sort + GF-Coupling 0.49 2.87 2.61
C-Flow + Sort + GF-Coupling + CD 0.26 - -
Table 2: 3D Reconstruction and rendering. : the lower the better, : the higher the better. C-Flow is the first approach able to render images from point clouds. The same model can be used to perform 3D reconstruction from images. The results of all other methods are obtained from their original papers.

7.2 3D Reconstruction & rendering

We next evaluate the ability of C-Flow to model the conditional distributions (1) image point cloud , which enables to perform 3D reconstruction from a single image; and (2) point cloud image, which is its inverse problem of rendering an image given a 3D point cloud. Fig. LABEL:fig:teaser shows qualitative results on the Chair class of ShapeNet. In the top row our model is able to generate plausible 3D reconstructions of unknown objects even under strong self-occlusions (top-right example). The second row depicts results for rendering, which highlights another advantage of our model: it allows sampling multiple times from the conditional distribution to produce several images of the same object which exhibit different properties (e.g. viewpoint or texture).

In Table 2 we compare C-Flow with other single-image 3D reconstruction methods 3D-R2N2 [9], PSGN [14], Pix2Mesh [50], AtlasNet [17] and ONet [29]. We evaluate 3D reconstruction in terms of the Chamfer distance (CD) with the ground truth shapes. Our approach (last row) performs on par with [9, 14, 50] and it is slightly below the state-of-the-art techniques specifically designed for 3D reconstruction [17, 29].

C-Flow C-Flow + cycle
segmentation street views 3.21 0.37 1.80 3.17 0.42 1.94
segmentation street views 3.25 0.33 2.19 3.05 0.36 2.23
structure facades 3.55 0.24 1.92 3.54 0.26 1.69
structure facades 3.55 0.31 2.05 3.55 0.30 2.01
map aerial photo 3.65 0.19 1.52 3.65 0.17 1.62
map aerial photo 3.65 0.54 1.95 3.65 0.57 1.97
edges shoes 1.70 0.66 2.40 1.68 0.67 2.43
edges shoes 1.65 0.64 1.61 1.65 0.65 1.69
Table 3: Conditional image-to-image generation. Evaluation of C-Flow (plain) and C-Flow + cycle consistency loss in image-to-image mapping.
(a) Image content manipulation
(b) Style transfer
Figure 8: Other applications. Sample results on image manipulation and style transfer. The model was not retrained for these tasks, and we used the same training weights to perform image-to-image in Fig. 7.

With the same model, we can also render images from point clouds. To the best of our knowledge, no previous work can perform such mapping. While a few approaches do render point clouds [30, 2, 34], they hold on strong assumptions of knowing the RGB color per point and the camera calibration to project the point cloud onto the image plane. Table 2 also reports an ablation study about the different operations we devised to handle 3D point clouds, namely sorting the point cloud (Sort), approximating global features (GF-Coupling) and inverse cycle consistency with chamfer distance (CD). In this case, evaluation is reported using Inception Score (IS) [43] and Bits Per Dimension (BPD) which is equivalent to the negative log2-likelihood typically used to report flow-based methods performance. Results show a performance boost when using each of these components, and especially when combining them.

7.3 Image-to-Image mappings

We evaluate the ability of C-Flow to perform multi-domain image-to-image mapping: segmentation street views trained on Cityscapes [10], structure facade trained on CMP Facades [49], map aerial photo trained on [21] and edges shoes trained on [57, 58, 21]. The examples on Fig. 7-left show mappings in which the target domain has a wide variance and multiple sampling generates different results (e.g. a semantic segmentation map can map to several grayscale images). The examples on the right have a target domain with a narrower variance, and despite multiple samplings the generated images are very similar (e.g. given an image its segmentation is well defined).

Table 3 reports quantitative evaluations using Structural Similarity (SSIM) [52], and again BPD and IS. When introducing the invertible cycle consistency loss (Sec. 4.3) the model does not improve its compression abilities (BPD) but improves in terms of structural similarity (SSIM) and semantic content (IS). It is worth to mention that while GANs have shown impressive image-to-image mapping results, even at high resolution [51], ours is the first work that can address such tasks using normalizing flows.

7.4 Other Applications

Finally, we demonstrate the versatility of C-Flow being the first flow-based method capable of performing style transfer and image content manipulation (Fig. 8). Importantly, the model was not retrained for these specific tasks, and we use the same parameters learned to perform image-to-image mappings (Sec. 7.3). For image manipulation we use the weights of segmentation street view and for style transfer those of edges shoes. Formally, let the domain to be the structure (e.g. segmentation mask) and the domain to be the image (e.g. street view). Then, image manipulation is achieved via three operations:

synthesise new image (18)

Note that following this generation approach we are no longer conditioning based only on , as in Sec. 7.3, and now the synthesised image is jointly conditioned on (for structure) and (for texture).

To perform style transfer, we first transform the content image into its structure . For instance, in Fig. 8-bottom, the content of the shoe is initially mapped onto its edge structure with the shoes edges weights. Then, we apply the same procedure as we did for image manipulation using the edges shoes weights, setting to be the structure of the content image and the style image.

8 Conclusions

We have proposed C-Flow, a novel conditioning scheme for normalizing flows. This conditioning, in conjunction with a new strategy to model unordered 3D point clouds, has made it possible to address 3D reconstruction and rendering images from point clouds, problems which so far, could not be tackled with normalizing flows. Furthermore, we demonstrate C-Flow to be a general-purpose model, being also applicable to many more multi-modality problems, such as image-to-image translation, style transfer and image content edition. To the best of our knowledge, no previous model has demonstrated such an adaptability.


  • [1] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. V. Gool (2019) Generative adversarial networks for extreme learned image compression. In ICCV, Cited by: §2.
  • [2] K. Aliev, D. Ulyanov, and V. S. Lempitsky (2019) Neural point-based graphics. arXiv preprint arXiv:1906.08240. Cited by: §7.2.
  • [3] T. Bagautdinov, C. Wu, J. Saragih, P. Fua, and Y. Sheikh (2018) Modeling facial geometry using compositional vaes. In CVPR, Cited by: §2.
  • [4] H. Ben-Hamu, H. Maron, I. Kezurer, G. Avineri, and Y. Lipman (2018) Multi-chart generative surface modeling. In SIGGRAPH Asia, Cited by: §2.
  • [5] P. Bojanowski, A. Joulin, D. Lopez-Paz, and A. Szlam (2017) Optimizing the latent space of generative networks. PMLR. Cited by: §2.
  • [6] S. Caelles, A. Pumarola, F. Moreno-Noguer, A. Sanfeliu, and L. Van Gool (2019) Fast video object segmentation with spatio-temporal gans. arXiv preprint arXiv:1903.12161. Cited by: §2.
  • [7] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §7.1.
  • [8] H. Chen, K. Hui, S. Wang, L. Tsao, H. Shuai, and W. Cheng (2019) BeautyGlow: on-demand makeup transfer framework with reversible generative network. In CVPR, Cited by: §2.
  • [9] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese (2016) 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In ECCV, Cited by: §2, §7.2, Table 2.
  • [10] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In CVPR, Cited by: §7.3.
  • [11] A. Dai, C. Ruizhongtai Qi, and M. Nießner (2017) Shape completion using 3d-encoder-predictor cnns and shape synthesis. In CVPR, Cited by: §2.
  • [12] L. Dinh, D. Krueger, and Y. Bengio (2014) Nice: non-linear independent components estimation. In ICLR, Cited by: §1, §2, §4.1.
  • [13] L. Dinh, J. Sohl-Dickstein, and S. Bengio (2017) Density estimation using real nvp. ICLR. Cited by: §2, §4.1, §4.3, §4.
  • [14] H. Fan, H. Su, and L. J. Guibas (2017) A point set generation network for 3d object reconstruction from a single image. In CVPR, Cited by: §2, §7.2, Table 2.
  • [15] J. Fan and J. Cheng (2018) Matrix completion by deep matrix factorization. Neural Networks. Cited by: §2.
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: §1, §2.
  • [17] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry (2018) AtlasNet: a papier-mâché approach to learning 3d surface generation. CVPR. Cited by: §2, §7.1, §7.2, Table 1, Table 2.
  • [18] A. Grover, C. D. Chute, R. Shu, Z. Cao, and S. Ermon (2019) AlignFlow: cycle consistent learning from multiple domains via normalizing flows. arXiv preprint arXiv:1905.12892. Cited by: §2, §2.
  • [19] P. Henderson and V. Ferrari (2019) Learning single-image 3D reconstruction by generative modelling of shape, pose and shading. IJCV. Cited by: §2.
  • [20] D. Hilbert (1935) Über die stetige abbildung einer linie auf ein flächenstück. In Dritter Band: Analysis ⋅ Grundlagen der Mathematik ⋅ Physik Verschiedenes: Nebst Einer Lebensgeschichte, Cited by: §5.
  • [21] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2016)

    Image-to-image translation with conditional adversarial networks

    arXiv preprint arXiv:1611.07004. Cited by: §4.3, §7.3.
  • [22] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §2.
  • [23] S. Kim, S. Lee, J. Song, and S. Yoon (2018) FloWaveNet: a generative flow for raw audio. ICML. Cited by: §2.
  • [24] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: §1, §2.
  • [25] D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In NeurIPS, Cited by: §2, Figure 1, §4.2, §4, §6, §7.1, Table 1.
  • [26] M. Kumar, M. Babaeizadeh, D. Erhan, C. Finn, S. Levine, L. Dinh, and D. Kingma (2019) VideoFlow: a flow-based generative model for video. arXiv preprint arXiv:1903.01434. Cited by: §2, §2.
  • [27] C. Lassner, G. Pons-Moll, and P. V. Gehler (2017) A generative model of people in clothing. In ICCV, Cited by: §2.
  • [28] R. Liu, Y. Liu, X. Gong, X. Wang, and H. Li (2019) Conditional adversarial generative flow for controllable image synthesis. In CVPR, Cited by: §2.
  • [29] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3d reconstruction in function space. In CVPR, Cited by: §2, §7.2, Table 2.
  • [30] M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, N. Snavely, and R. Martin-Brualla (2019) Neural rerendering in the wild. In CVPR, Cited by: §7.2.
  • [31] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019) Deepsdf: learning continuous signed distance functions for shape representation. CVPR. Cited by: §2, §7.1, Table 1.
  • [32] N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, A. Ku, and D. Tran (2018) Image transformer. arXiv preprint arXiv:1802.05751. Cited by: §6.
  • [33] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In CVPR, Cited by: §4.3.
  • [34] F. Pittaluga, S. J. Koppal, S. B. Kang, and S. N. Sinha (2019) Revealing scenes by inverting structure from motion reconstructions. In CVPR, Cited by: §7.2.
  • [35] R. Prenger, R. Valle, and B. Catanzaro (2018) Waveglow: a flow-based generative network for speech synthesis. In ICASSP, Cited by: §2.
  • [36] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer (2019) GANimation: one-shot anatomically consistent facial animation. IJCV. Cited by: §2.
  • [37] A. Pumarola, A. Agudo, L. Porzi, A. Sanfeliu, V. Lepetit, and F. Moreno-Noguer (2018) Geometry-aware network for non-rigid shape prediction from a single view. In CVPR, Cited by: §2.
  • [38] A. Pumarola, J. Sanchez, G. Choi, A. Sanfeliu, and F. Moreno-Noguer (2019) 3DPeople: Modeling the Geometry of Dressed Humans. In ICCV, Cited by: §2.
  • [39] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In CVPR, Cited by: §5, §5, §5.
  • [40] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR. Cited by: §2.
  • [41] A. Razavi, A. v. d. Oord, and O. Vinyals (2019) Generating diverse high-fidelity images with vq-vae-2. arXiv preprint arXiv:1906.00446. Cited by: §2.
  • [42] D. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. In ICML, Cited by: §2, §3.
  • [43] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In NeurIPS, Cited by: §7.2.
  • [44] J. Serrà, S. Pascual, and C. Segura (2019) Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion. NeurIPS. Cited by: §2.
  • [45] A. Sinha, A. Unmesh, Q. Huang, and K. Ramani (2017) Surfnet: generating 3d shape surfaces using deep residual networks. In CVPR, Cited by: §2.
  • [46] K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In NeurIPS, Cited by: §2.
  • [47] D. Stutz and A. Geiger (2018) Learning 3d shape completion from laser scan data with weak supervision. In CVPR, Cited by: §2.
  • [48] H. Sun, R. Mehta, H. Zhou, Z. Huang, S. Johnson, V. Prabhakaran, and V. Singh (2019) DUAL-glow: conditional flow-based generative model for modality transfer. arXiv preprint arXiv:1908.08074. Cited by: §2, §2.
  • [49] R. Tyleček and R. Šára (2013) Spatial pattern templates for recognition of objects with regular structure. In GCPR, Cited by: §7.3.
  • [50] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. Jiang (2018) Pixel2mesh: generating 3d mesh models from single rgb images. In ECCV, Cited by: §2, §7.2, Table 2.
  • [51] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 8798–8807. Cited by: §7.3.
  • [52] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. TIP. Cited by: §7.3.
  • [53] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum (2016) Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In NeurIPS, Cited by: §2.
  • [54] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §5.
  • [55] M. Yamaguchi, Y. Koizumi, and N. Harada (2019)

    AdaFlow: domain-adaptive density estimator with application to anomaly detection and unpaired cross-domain translation

    In ICASSP, Cited by: §2.
  • [56] G. Yang, Z. Hao, M. Liu, S. Belongie, and B. Hariharan (2019) PointFlow: 3d point cloud generation with continuous normalizing flows. ICCV. Cited by: §2.
  • [57] A. Yu and K. Grauman (2014) Fine-grained visual comparisons with local learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 192–199. Cited by: §7.3.
  • [58] J. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros (2016) Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pp. 597–613. Cited by: §7.3.
  • [59] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: §2.