Learning Likelihoods with Conditional Normalizing Flows

by   Christina Winkler, et al.
University of Amsterdam

Normalizing Flows (NFs) are able to model complicated distributions p(y) with strong inter-dimensional correlations and high multimodality by transforming a simple base density p(z) through an invertible neural network under the change of variables formula. Such behavior is desirable in multivariate structured prediction tasks, where handcrafted per-pixel loss-based methods inadequately capture strong correlations between output dimensions. We present a study of conditional normalizing flows (CNFs), a class of NFs where the base density to output space mapping is conditioned on an input x, to model conditional densities p(y|x). CNFs are efficient in sampling and inference, they can be trained with a likelihood-based objective, and CNFs, being generative flows, do not suffer from mode collapse or training instabilities. We provide an effective method to train continuous CNFs for binary problems and in particular, we apply these CNFs to super-resolution and vessel segmentation tasks demonstrating competitive performance on standard benchmark datasets in terms of likelihood and conventional metrics.



page 6

page 7

page 17

page 18


Structured Output Learning with Conditional Generative Flows

Traditional structured prediction models try to learn the conditional li...

Resampling Base Distributions of Normalizing Flows

Normalizing flows are a popular class of models for approximating probab...

Distilling the Knowledge from Normalizing Flows

Normalizing flows are a powerful class of generative models demonstratin...

Jacobian Determinant of Normalizing Flows

Normalizing flows learn a diffeomorphic mapping between the target and b...

End-to-End Learning for the Deep Multivariate Probit Model

The multivariate probit model (MVP) is a popular classic model for study...

A Method to Model Conditional Distributions with Normalizing Flows

In this work, we investigate the use of normalizing flows to model condi...

Constraining the Reionization History using Bayesian Normalizing Flows

The next generation 21 cm surveys open a new window onto the early stage...

Code Repositories


Master thesis for the MSc. Artificial Intelligence at the University of Amsterdam, 2019. Topic: Super-resolution with Conditional Normalizing Flows.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning conditional distributions

is one of the oldest problems in machine learning. When the output

is high-dimensional this is a particularly challenging task, and the practitioner is left with many design choices. Do we factorize the conditional? If not, do we model correlations with, say, a conditional random field (Prince, 2012)? Do we use a unimodal distribution? How fat should the tails be? Do we use an explicit likelihood at all, or use implicit methods (Mohamed and Rezende, 2015) such as a GAN (Goodfellow et al., 2014)? Do we quantize the output? Ideally, the practitioner should not have to make design choices at all, and the distribution should be learned from the data.

In the field of density estimation

normalizing flows (NFs) are a relatively new family of models (Rezende and Mohamed, 2015). NFs model complicated high dimensional marginal distributions by transforming a simple base distribution or prior through a learnable, invertible mapping and then applying the change of variables formula. NFs are efficient in inference and sampling, are able to learn inter-dimensional correlations and multi-modality, and they are exact likelihood models, amenable to gradient-based optimization.

Flow-based generative models (Dinh et al., 2016) are generally trained on the image space, and are in some cases computationally efficient in both the forward and inverse direction. These are advantageous over other likelihood based methods because i)

sampling is efficient opposed to autoregressive models

(Van Oord et al., 2016), and ii)

flows admit exact likelihood optimization in contrast with variational autoencoders

(Kingma and Welling, 2014).

Conditional random fields directly model correlations between pixels, and have been fused with deep learning

(Chen et al., 2016). However, they require the practitioner to choose which pixels have pairwise interactions. Another approach uses adversarial training (Goodfellow et al., 2014). A downside is that the training procedure can be unstable and they are difficult to evaluate quantitatively.

We propose to learn the likelihood of conditional distributions with few modeling choices using Conditional Normalizing Flows (CNFs). CNFs can be harnessed for conditional distributions by conditioning the prior and the invertible mapping on the input . In particular, we apply conditional flows to super-resolution (Wang et al., 2018) and vessel segmentation (Staal et al., 2004). We evaluate their performance gains on multivariate prediction tasks along side architecturally-matched factored baselines by comparing likelihood

and application specific evaluation metrics.

2 Background

In the following, we present the relevant background material on normalizing flows and structured prediction. This section covers the change of variables formula, invertible modules, variational dequantization and conventional likelihood optimization.

2.1 Normalizing Flows

A standard NF in continuous space is based on a simple change of variables formula. Given two spaces of equal dimension and ; a once-differentiable, parametric, bijective111We can also use injective mappings mapping , where are the parameters of ; and a prior distribution , we can model a complicated distribution as


The term is the Jacobian determinant of , evaluated at and it accounts for volume changes induced by . The transformation introduces correlations and multi-modality in . The main challenge in the field of normalizing flows is designing the transformation . It has to be i) bijective, ii) have an efficient and tractable Jacobian determinant, iii) be from a ‘flexible’ model class. In addition, iv) for fast sampling the inverse needs to be efficiently computable. Below we briefly state which invertible modules are used in our architectures, obeying the aforementioned points.

Coupling layers

Affine coupling layers (Dinh et al., 2016) are invertible, nonlinear layers. They work by splitting the input into two components and and nonlinearly transforming as a function of , before reconcatenating the result. If and this is

Where the scale and translation functions can be any function, typically implemented with a CNN. Similar conditioning with normalizing flows has been done in previous works by Mohamed and Rezende (2015) and Kingma et al. (2016b).

Invertible 1 x 1 Convolutions

Proposed in Kingma and Dhariwal (2018), invertible 1 1 convolutions help mix information across channel dimensions. We implement them as regular 1 1 convolutions and for the inverse, we convolve with the inverse of the kernel.

Squeeze layers

Squeeze layers (Dinh et al., 2016) are used to compress the spatial resolution of activations. These also help with increasing spatial receptive field of pixels in the deeper activations.

Split Prior

Split priors (Dinh et al., 2016) work by spliting a set of activations into two components and . We then condition on using a simple base density e.g. , where and are neural networks. The component can be modeled by further flow layers. This prior, is useful for modeling hierarchical correlations between dimensions, and also helps reduce computation, since is reduced in size.

Variational dequantization

When modeling discrete data, Theis et al. (2016) introduced the concept of dequantization

. For this, they modeled the probability mass function over

as a latent variable model


where the latent variables are continuous-valued. This is a convenient model to use, since the marginal , living on a continuous sample space, can be modelled with a continuous NF. The distribution is known as the quantizer and is typically an indicator function . Other works (Hoogeboom et al., 2019a; Tran et al., 2019) directly model with a discrete-valued flow, but these are known to be difficult to optimize. As an extension of dequantization, Ho et al. (2019) introduced a variational distribution , called a dequantizer, and write a lower bound on the data log-likelihood using Jensen’s inequality as follows


Noting that the joint , we see that the dequantizer distribution must be defined such that , otherwise and the lower-bound is undefined. Restricting to satisfy this condition, results in the following variational dequantization bound


2.2 Structured prediction

Structured prediction tasks such as image segmentation or super-resolution, can be probabilistically framed as learning an unknown target distribution , with an input and a target . In practice with deep learning models, the unknown distribution is often learned by a factored model:


where represents the th dimension of

. Several loss-based optimization methods are a special case of this factored model. The mean squared error is equivalent to a product of normal distributions with equal and fixed standard deviation. Other examples are: cross entropy, equivalent to a product of log categorical distributions, and binary cross entropy, equivalent to a product of log Bernoulli distributions.

With factorized independent likelihoods, individual dimensions of are assumed to be conditionally independent. As a result, sampling leads to results with uncorrelated noise over the output dimensions. In the literature, a fix for this problem is to visualize the mode of the distribution and interpret that as a prediction. However, because the likelihood was optimized assuming a conditionally independent noise distribution, these modes tend to be blurry and lack crisp details.

3 Method

In this section we present our main innovations. i) learning conditional likelihoods using CNFs and ii)

a variational dequantization framework for binary random variables.

3.1 Conditional Normalizing Flows

We propose to learn conditional likelihoods using conditional normalizing flows for complicated target distributions in multivariate prediction tasks. Take an input and a regression target . We learn a complicated distribution using a conditional prior and a mapping , which is bijective in and . The likelihood of this model is:


Notice that the difference between Equations 1 and 7 is that all distributions are conditional and the flow has a conditioning argument of .

The generative process from to (shown in Figure 1) can be described by first sampling from a simple base density with its parameters conditioned on (for us this is a diagonal Gaussian) and then passing it through a sequence of bijective mappings . This allows for modelling multimodal conditional distributions in , which is typically uncommon.

Figure 1: Diagram of our model in the train and sampling phases. Solid lines represent deterministic mappings and dashed lines represent sampling. The conditioning variable enters the network in base density and the bijective mappings .

For the training procedure, the process runs in reverse. We begin with label and conditioning input . We ‘flow’ the label back through to yield , and then we evaluate the log-likelihood of the parameters of the prior, given this transformed label

. The flow and prior parameters can be optimized using stochastic gradient descent and training in minibatches in the usual fashion. Note that this style of training a conditional density model

differs fundamentally from traditional models, because we compute the log-likelihood in -space and not -space. As a result, we are not biasing our results with an arbitrary choice of output-space likelihood or in the case of this paper, handcrafted image loss. Instead, one could interpret this method as learning the correlational and multimodal structure of the likelihood or simply put loss-learning.

Conditional modules

In our work, the conditioning is introduced in the prior, the split priors, and the affine coupling modules. For the prior, we set the mean and variance as functions of

. For the split prior, we add as a conditoning argument to the conditional . And for the affine coupling layers, we pass to the scale and translation networks so that

Conditional Prior
Conditional Split Prior
Conditional Coupling ;   

In practice these functions are implemented using deep neural networks. First the conditioning term is transformed into a rich representation using a large network . Subsequently, each function in the flow is applied to a concatenation of and the relevant part of . For example, the translation of a conditional coupling is computed as .

3.2 Variational Dequantization for Binary Random Variables

We generalize the variational dequantization scheme for the binary setting. Let be a multivariate binary random variable and its dequantized representation. In Ho et al. (2019) the bound is not guaranteed to be tight, since there is a domain mismatch in the support of and the variational dequantizer . Technically, if

is modeled as a bijective mapping from a Gaussian distribution where the mapping only has finite volume changes, then the support of

is unbounded. On the otherhand, the support of the dequantizer is bounded and so we have to redefine either the dequantizer to map to all of or restrict the support of the flow to a bounded volume inside . We resolve this by dequantizing with half-infinite noise, where


The softplus guarantees that samples from the neural network NN are only positive. If is 1, the term outputs positive-valued noise and if is 0 the noise is negative-valued.

4 Related Work

Normalizing flows were originally introduced to machine learning to learn a flexible variational posterior, a conditional distribution, in VAEs (Rezende and Mohamed, 2015; Kingma et al., 2016a; van den Berg et al., 2018). Flow-based generative models (Dinh et al., 2016; Papamakarios et al., 2017; Huang et al., 2018; Kingma and Dhariwal, 2018; Hoogeboom et al., 2019b; Grathwohl et al., 2019; Cao et al., 2019; Chen et al., 2019) are typically trained directly in the data space. Several of these are designed to be fast to invert, which makes them suitable for drawing samples after training.

Different versions and applications of conditional normalizing flows include Agrawal and Dukkipati (2016) who utilize flows in the decoder of Variational AutoEncoders Kingma and Welling (2014), which are conditioned on the latent variable. Trippe and Turner (2018) who utilize conditional flows for prediction problems in a Bayesian framework for density estimation. Atanov et al. (2019) introduce a semi-conditional flow that provides an efficient way to learn from unlabeled data for semi-supervised classification problems. Very recently, Ardizzone et al. (2019)

have proposed conditional flow-based generative models for image colorization, which differs from our work in training objective, architecture and applicability to binary segmentation. Autoregressive models

(Van Oord et al., 2016) have also been studied for conditional image generation van den Oord et al. (2016) but are generally slow to sample from.

Adversarial methods (Goodfellow et al., 2014) have widely been applied to (conditional) image density modeling tasks (Vu et al., 2019; Sajjadi et al., 2017b; Yuan et al., 2018; Mechrez et al., 2018), because they tend to generate high-fidelity images. Disadvantages of adversarial methods are that they can be complicated to train, and it is difficult to obtain likelihoods. For this reason, it can be hard to assess whether they are overfitting or generalizing.

5 Experiments

Here we explain our experiments into super-resolution and vessel segmentation. All models were implemented using the PyTorch framework.

5.1 Single Image Super Resolution

Single Image Super Resolution (SISR) methods aims to find a high resolution image given a single (downsampled) low resolution image . Framing this problem as learning a likelihood, we utilize a CNF to learn the distribution . To compare our method we also train a factorized baseline likelihood model with comparable architectures and parameter budget. The factorized baseline uses a product of discretized logistic distributions (Kingma et al., 2016a; Salimans et al., 2017). All methods are compared on negative log-likelihood if available, which has the information theoretic interpretation bits per dimensions. In addition, we evaluate using SSIM (Wang et al., 2004) and PSNR metrics.

Implementation Details

The flow is based on the (Dinh et al., 2016; Kingma and Dhariwal, 2018) multi-scale architectures. Each step of flow consists of K subflows and L levels. One subflow consists of an activation normalization, 1 1 convolution, and our conditional coupling layer. After completing a level, half of the representation is factored-out and modeled using our conditional split prior. After all levels have been completed, our conditional prior is used to model the final part of the latent variable.

The conditioning variable is transformed into the feature representation using Residual-in-Residual Dense Block (RRDB) architecture (Wang et al., 2018), consisting of 16 residual-in-residual blocks. To match the parameter budget, the channel growth is 55 for the baseline and the growth is 32 for the CNF.


The models are trained on natural image datasets, Imagenet32 and Imagenet64 (Chrabaszcz et al., 2017). Since the dataset has no test set, we use its validation images as a test set. For validation we take 10000 images from the train images. The performance is always reported on the test set unless specified otherwise. We evaluate our models on widely used benchmark datasets Set5 (Bevilacqua et al., 2012), Set14 (Zeyde et al., 2012) and BSD100 (Huang et al., 2015)

. At test time, we pad the test images with zeros at right and bottom so that they are square and compatible with squeeze layers. When evaluating on SSIM and PSNR, we can extract the patch with the exact image shape. For all datasets the LR images are obtained using MATLABs bicubic kernel function with reducing aliasing artifacts following

Wang et al. (2018). For these experiments, the pixel values are dequantized by adding uniform noise (Theis et al., 2016).

Training Settings

We train on ImageNet32 and ImageNet64 for iterations with mini batches of size 64, and a learning rate of 0.0001 using the Adam optimizer (Kingma and Ba, 2015). The high-resolution image either the original or original input images. The flow architecture is build with levels and .

(a) Low resolution (b) Ground truth (b) CNF sample (c) Baseline mode
Figure 2: Super resolution results on the Imagenet64 test data. Samples are taken from the CNF and the mode is visualized for the factorized baseline model. Best viewed electronically.

5.1.1 Evaluation

In this section the performance of CNFs for SISR is compared against a baseline likelihood model on ImageNet32 and ImageNet64. Their performance measured in log-likelihood (bits per dimension) is shown in Table 1, which show that the CNF outperforms the factorized baseline in likelihood. Recall that the baseline model is factorized and conditionally independent. These results indicate that it is advantageous to capture the correlations and multi-modality present in the data.

Dataset CNF factorized LL
ImageNet32 3.01 4.00
ImageNet64 2.90 3.61
Table 1: Comparison of likelihood learning with CNFs and factorized discrete baseline on ImageNet32 and ImageNet64 measured in bits per dimensions.

Super resolution samples from the Imagenet64 test data are shown in Figure 2. The distribution mode of the factorized baseline is displayed in Figure 2 panel c). We show that the baseline is able to learn a relationship between conditioning variable and , but lacks crisp details. The super-resolution images from the CNF are shown in panel b). Notice there are more high-frequency components modelled, for instance in grass and in hairs. Following Kingma and Dhariwal (2018), we sample from the base distributions with a temperature of 0.8 to achieve the best perceptual quality for the distribution learned by the CNF.

As there are no standard metric for measuring perceptual quality, we measure performance results on PSNR and SSIM between our predicted image and the ground truth image in Table 2. We compare CNFs to other state-of-the-art per-pixel loss based methods and the factorized baseline for a 2x upsampling task on standard super-resolution benchmarks. If available, we report negative log

-likelihood (bpd) (computed as an average over 1000 randomly cropped 128 x 128 patches). Note that without any hyperparameter tuning or compositional loss weighting, as is typical in SISR, the CNF performs competitively with state-of-the-art super-resolution methods by simply optimizing the likelihood. The SSIM scores for the baseline perform on par or better than the adversarial methods and the CNF on all benchmarks. On PSNR scores however, the CNF beats the factorized discrete baseline. Samples shown in Figure

3 show that the CNF predictions have more fine grained texture details. Comparing this finding with the baseline that outperforms every method on SSIM, show that metrics can be misleading.

Notice how samples from a independent factorized likelihood model have a lot of color noise, whereas samples from the CNF do not have such problems. Increasing temperature increases high-level detail, where we find that

strikes a balance between noise smoothing and detail. This can be attributed to the property of flows to model pixel correlations among output dimensions for high-dimensional data such as images.

Figure 3: Conditional samples from the CNF (ours) for sampling temperatures and the factorized discrete baseline for 2x upscaling. Conditioning image is a baboon from Set14 test set. Both models were trained on ImageNet64. Best viewed electronically.
Set5 Set14 BSD100
Model Type bpd PSNR SSIM bpd PSNR SSIM bpd PSNR SSIM
Bicubic - 33.7 0.930 - 30.2 0.869 - 29.6 0.843
SRCNN - 36.7 0.954 - 32.4 0.906 - 31.4 0.888
PSyCO - 36.9 0.956 - 32.6 0.898 - 31.4 0.890
ENet - 37.3 0.958 - 33.3 0.915 - 32.0 0.898
LL Baseline 2.34 32.5 0.958 3.23 31.0 0.917 3.20 30.6 0.900
CNF (ours) 2.11 36.2 0.957 2.51 32.5 0.911 2.33 31.4 0.893
Table 2: CNF compared to factorized discrete baseline and adversarial, pixel-wise methods (Dong et al., 2015; Sajjadi et al., 2017a; Pérez-Pellitero et al., 2016) based on negative log-likelihood (bits per dimension or bpd), PSNR, SSIM and for 2x upscaling. Our methods were trained on ImageNet64.

5.2 Vessel segmentation

Figure 4: Example of retinal segmentations using DRIU, our likelihood baseline trained with the same loss, and our CNF. For the CNF, the mean of 100 samples is visualized. Notice that our segmentations more accurately capture the vessel width, which is overdilated in the DRIU and factored models.

Vessel segmentation is an important, long-standing, medical imaging problem, where we seek to segment blood vessels from pictures of the retina (the back of the eye). This is a difficult task, because the vessels are thin and of varying thickness. A likelihood function used in segmentation is a weighted Bernoulli distribution


where is the prediction probability that pixel is positive (vessel class) and is a class balancing constant set to

for us. This loss function is preferred, because it accounts for the apparent class imbalance in the ratio of vessels to background. In practice, the numerator of this likelihood is used as a loss function and the normalizer is ignored. The resulting loss is called a weighted cross-entropy. In our experiments we train using the weighted cross-entropy (as in the literature), but we report likelihood values including the normalizer, for a meaningful comparison.

Dataset and comparisons

We test on the DRIVE database Staal et al. (2004) consisting of

, 8-bit RGB images, split into 20 train, and 20 test images. To compare against other methods, we plot precision-recall curves, report the maximum F-score along each curve (shown as a dot in the graph), report the bits per dimension, and plot distributions in PR-space. The main CNN-based contenders are Deep Retinal Image Understanding (DRIU)

(Maninis et al., 2016) and Holistically-Nested Edge Detection (HED) (Xie and Tu, 2015), both of which are instances of Deeply-Supervised Nets (Lee et al., 2015)

. The main difference between DRIU and HED is that DRIU is pretrained on ImageNet

Krizhevsky et al. (2012); whereas, HED is not. Other competing methods are reported with results cited from (Maninis et al., 2016). For fairness, we also train a model which we call the likelihood baseline, which uses the exact same architecture as the flow but run a feedforward model and trained with the weighted Bernoulli loss.

SE LD Wavelets Human HED KB DRIU Factored (ours) CNF Uniform (Ours) CNF (ours)
bpd - - - - - - - -
F-Score 0.658 0.692 0.762 0.791 0.794 0.800 0.805 0.821
Table 3: Numerical results on the DRIVE dataset. We see that the CNF is in the range of the SOTA model DRIU. SE: Structured Forests (Dollár and Zitnick, 2013), LD: Line Detector (Ricci and Perfetti, 2007), Wavelets (Soares et al., 2006), Human (Staal et al., 2004), HED: Holistic Edge Detector (Xie and Tu, 2015), KB: Kernel Boost Becker et al. (2013), : Fields (Ganin and Lempitsky, 2014), DRIU: Deep Retinal Image Understanding (Maninis et al., 2016). Our answers are shown in form, where statistics are taken over 5 runs.
Layer Type Res.
0 input 1024
1 block 1024
2 max-pool 512
3 block 512
4 block 512
5 max-pool 256
6 block 256
7 block 256
Table 4: Feature extractor architecture for retinal vessel segmentation. Res. abbreviates resolution. Outputs are at layer 4 and 7.

The flow is identical to the flow used in the previous section, with some key differences. i) instead of activation normalization, we use instance normalization (Ulyanov et al., 2016), ii) the conditional affine coupling layers do not contain a scaling component but just the translation , hence it is volume preserving, and iii) we train using variational dequantization. Since the data is binary-valued, we dequantize according to the Flow++ scheme of Ho et al.

, modified to binary variables (see Section


), using a CNF at just a single scale for the dequantizer. The CNF is conditioned on resolution-matched features extracted from a VGG-like network

Simonyan and Zisserman (2015). This model is composed of blocks of the form , and 2x2 max-pooling layer, shown in Table 4. All filter sizes are 3x3. The outputs are at layers 4 and 7. These are used to condition the resolution 512 and 256 levels of the CNF, respectively.

Training/test settings

We train using the Adam optimizer at learning rate 0.001, and a minibatch size of 2 for 2000 epochs. All images are padded to 1024x1024 pixels, so that they are compatible with squeeze layers. We use

rotation augmentation, isotropic scalings in the range , and shears drawn from a normal distribution with standard deviation . At test time we draw samples from our model and compare those against the groundtruth labels. This contrasts with other methods, that measure labels against thresholded versions of a factorized predictive distribution. To create the PR curve in Figure 5 we take the average of 100 samples and threshold the resulting map (example shown in Figure 4). While crude, this mean image is useful in defining a PR-curve, since there is not great topology change between samples.


The results of our experiments are shown in Table 3 and Figure 5, with a visualization in Figure 4. We see in the table that the CNF trained with our binary dequantization achieves the best bits per dimension, with comparable F-score to the state of the art model (DRIU), but our model does not require pretraining on ImageNet. Interestingly, we found training a flow with uniform dequantization slightly unstable and the results were far from satisfactory. In the PR-curve Figure 5

, we show a comparable curve for our binary dequantized CNF to the DRIU model. These results, however, say nothing about the calibration of the probability outputs, but just that the various probability predictions are well ranked. To gain an insight into the calibration of the probabilities, we measure the distribution of precision and recall values for point samples drawn from all models, including a second human grader, present in the original DRIVE dataset. We synthesized samples from factored distributions (all except ours and ‘human’), by sampling images from a factored Bernoulli with mean as the soft image. We see the results in the right hand plot of Figure

5, which shows that while the other CNN-based methods such as DRIU or HED have good precision, they suffer in term of recall. On the other hand, the CNF drops in precision a little bit, but makes up for this in terms of high recall, with a PR distribution overlapping the human grader. This indicates that the CNF has learned a well-calibrated distribution, compared to the baseline methods. Further evidence of this is seen in the visualization in Figure 4, which shows details from the predicted means (soft images). This shows that the DRIU and likelihood baseline overdilate segmentations and the CNF does not. This can be explained from the fact that in the weighted Bernoulli it is cheaper to overdilate than to underdilate. Since the CNF contains no handcrafted loss function, we circumvent this pathology.

(a) PR curve (b) PR scatter plot
Figure 5: Here we show two visualizations of the same data. Left: We show the PR-curves generated from a sweeping threshold on soft images output by each listed method. Maximal F-scores for each curve are shown as circles with the green lines indicating constant F-score. We see that our method beats all traditional methods and is on par with DRIU, which unlike us was pretrained on Imagenet. Right: We show a scatter plot in PR-space of samples drawn from each model. To draw samples from the all factored models, we sample images from a factored Bernoulli with a mean as the soft image. We see that the DRIU and HED models, while having good precision, have poor recall in this regime. This indicates that while the output of their networks produce a good ranking of probabilities, the values of the probabilities are poorly calibrated. For us, we drop in precision slightly, but gain greatly in terms of recall, indicating that our samples are drawn from a better calibrated distribution, overlapping significantly with the Human distribution.

6 Conclusion

In this paper we propose to learn likelihoods of conditional distributions using conditional normalizing flows. In this setting, supervised prediction tasks can be framed probabilistically. In addition, we propose a generalization of variational dequantization for binary random variables, which is useful for binary segmentation problems. Experimentally we show competitive performance with competing methods in the domain of super-resolution and binary image segmentation.


  • S. Agrawal and A. Dukkipati (2016) Deep variational inference without pixel-wise reconstruction. CoRR abs/1611.05209. Cited by: §4.
  • L. Ardizzone, C. Lüth, J. Kruse, C. Rother, and U. Köthe (2019) Guided image generation with conditional invertible neural networks. CoRR abs/1907.02392. External Links: Link, 1907.02392 Cited by: §4.
  • A. Atanov, A. Volokhova, A. Ashukha, I. Sosnovik, and D. Vetrov (2019)

    Semi-conditional normalizing flows for semi-supervised learning

    pp. . Cited by: §4.
  • C. J. Becker, R. Rigamonti, V. Lepetit, and P. Fua (2013) Supervised feature learning for curvilinear structure segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2013 - 16th International Conference, Nagoya, Japan, September 22-26, 2013, Proceedings, Part I, pp. 526–533. External Links: Link, Document Cited by: Table 3.
  • M. Bevilacqua, A. Roumy, C. Guillemot, and M. Alberi-Morel (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, Cited by: §5.1.
  • N. D. Cao, W. Aziz, and I. Titov (2019) Block neural autoregressive flow. In

    Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019

    pp. 511. Cited by: §4.
  • L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2016) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR abs/1606.00915. External Links: Link, 1606.00915 Cited by: §1.
  • R. T. Q. Chen, J. Behrmann, D. Duvenaud, and J. Jacobsen (2019) Residual flows for invertible generative modeling. CoRR abs/1906.02735. Cited by: §4.
  • P. Chrabaszcz, I. Loshchilov, and F. Hutter (2017) A downsampled variant of imagenet as an alternative to the CIFAR datasets. CoRR abs/1707.08819. External Links: Link, 1707.08819 Cited by: §5.1.
  • L. Dinh, J. Sohl-Dickstein, and S. Bengio (2016) Density estimation using real NVP. CoRR abs/1605.08803. External Links: Link, 1605.08803 Cited by: Appendix A, §1, §2.1, §2.1, §2.1, §4, §5.1.
  • P. Dollár and C. L. Zitnick (2013) Structured forests for fast edge detection. In

    IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013

    pp. 1841–1848. External Links: Link, Document Cited by: Table 3.
  • C. Dong, C. C. Loy, K. He, and X. Tang (2015) Image super-resolution using deep convolutional networks. CoRR abs/1501.00092. External Links: Link, 1501.00092 Cited by: Table 2.
  • Y. Ganin and V. S. Lempitsky (2014) N-fields: neural network nearest neighbor fields for image transforms. CoRR abs/1406.6558. External Links: Link Cited by: Table 3.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2672–2680. External Links: Link Cited by: §1, §1, §4.
  • W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud (2019) FFJORD: free-form continuous dynamics for scalable reversible generative models. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: §4.
  • [16] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel Flow++: improving flow-based generative models with variational dequantization and architecture design. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Cited by: §5.2.
  • J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel (2019) Flow++: improving flow-based generative models with variational dequantization and architecture design. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 2722–2730. External Links: Link Cited by: §2.1, §3.2.
  • E. Hoogeboom, J. W. T. Peters, R. van den Berg, and M. Welling (2019a) Integer discrete flows and lossless compression. CoRR abs/1905.07376. External Links: Link, 1905.07376 Cited by: §2.1.
  • E. Hoogeboom, R. van den Berg, and M. Welling (2019b) Emerging convolutions for generative normalizing flows. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Cited by: §4.
  • C. Huang, D. Krueger, A. Lacoste, and A. C. Courville (2018) Neural autoregressive flows. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 2083–2092. Cited by: §4.
  • J. Huang, A. Singh, and N. Ahuja (2015) Single image super-resolution from transformed self-exemplars. In

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. , pp. 5197–5206. External Links: Document, ISSN Cited by: §5.1.
  • D. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. Cited by: Appendix A, §5.1.
  • D. P. Kingma and M. Welling (2014) Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, Cited by: §1, §4.
  • D. P. Kingma, T. Salimans, and M. Welling (2016a) Improving variational inference with inverse autoregressive flow. CoRR abs/1606.04934. External Links: Link, 1606.04934 Cited by: §4, §5.1.
  • D. P. Kingma, T. Salimans, and M. Welling (2016b) Improving variational inference with inverse autoregressive flow. CoRR abs/1606.04934. External Links: Link, 1606.04934 Cited by: §2.1.
  • D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 10236–10245. External Links: Link Cited by: Appendix A, §2.1, §4, §5.1, §5.1.1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    ImageNet classification with deep convolutional neural networks

    In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pp. 1106–1114. External Links: Link Cited by: §5.2.
  • C. Lee, S. Xie, P. W. Gallagher, Z. Zhang, and Z. Tu (2015) Deeply-supervised nets. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9-12, 2015, External Links: Link Cited by: §5.2.
  • K. Maninis, J. Pont-Tuset, P. A. Arbeláez, and L. V. Gool (2016) Deep retinal image understanding. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2016 - 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II, pp. 140–148. External Links: Link, Document Cited by: §5.2, Table 3.
  • R. Mechrez, I. Talmi, F. Shama, and L. Zelnik-Manor (2018) Learning to maintain natural image statistics. CoRR abs/1803.04626. External Links: Link, 1803.04626 Cited by: §4.
  • S. Mohamed and D. J. Rezende (2015)

    Variational information maximisation for intrinsically motivated reinforcement learning

    In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, Cambridge, MA, USA, pp. 2125–2133. External Links: Link Cited by: §1, §2.1.
  • G. Papamakarios, T. Pavlakou, and I. Murray (2017) Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 2338–2347. External Links: Link Cited by: §4.
  • E. Pérez-Pellitero, J. Salvador, J. Ruiz-Hidalgo, and B. Rosenhahn (2016) Manifold span reduction for super resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)., Cited by: Table 2.
  • S. J. Prince (2012) Computer vision: models, learning, and inference. Cambridge University Press. Cited by: §1.
  • D. J. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 1530–1538. External Links: Link Cited by: §1, §4.
  • E. Ricci and R. Perfetti (2007)

    Retinal blood vessel segmentation using line operators and support vector classification

    IEEE Trans. Med. Imaging 26 (10), pp. 1357–1365. External Links: Link, Document Cited by: Table 3.
  • M. S. M. Sajjadi, B. Schölkopf, and M. Hirsch (2017a) EnhanceNet: single image super-resolution through automated texture synthesis. In Proceedings IEEE International Conference on Computer Vision (ICCV), Piscataway, NJ, USA, pp. 4501–4510. External Links: Link Cited by: Table 2.
  • M. S. M. Sajjadi, B. Schölkopf, and M. Hirsch (2017b) EnhanceNet: single image super-resolution through automated texture synthesis. In Proceedings IEEE International Conference on Computer Vision (ICCV), Piscataway, NJ, USA, pp. 4501–4510. External Links: Link Cited by: §4.
  • T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma (2017) PixelCNN++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. In 5th International Conference on Learning Representations, ICLR 2017, Cited by: §5.1.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §5.2.
  • J. V. B. Soares, J. J. G. Leandro, R. M. Cesar, H. F. Jelinek, and M. J. Cree (2006) Retinal vessel segmentation using the 2-d gabor wavelet and supervised classification. IEEE Trans. Med. Imaging 25 (9), pp. 1214–1222. External Links: Link, Document Cited by: Table 3.
  • J. Staal, M. D. Abràmoff, M. Niemeijer, M. A. Viergever, and B. van Ginneken (2004) Ridge-based vessel segmentation in color images of the retina. IEEE Trans. Med. Imaging 23 (4), pp. 501–509. External Links: Link, Document Cited by: §1, §5.2, Table 3.
  • L. Theis, A. van den Oord, and M. Bethge (2016) A note on the evaluation of generative models. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, External Links: Link Cited by: §2.1, §5.1.
  • D. Tran, K. Vafa, K. K. Agrawal, L. Dinh, and B. Poole (2019) Discrete flows: invertible generative models of discrete data. In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019, External Links: Link Cited by: §2.1.
  • B. L. Trippe and R. E. Turner (2018) Conditional density estimation with bayesian normalising flows. . External Links: Link, 1802.04908 [ Cited by: §4.
  • D. Ulyanov, A. Vedaldi, and V. S. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. CoRR abs/1607.08022. External Links: Link, 1607.08022 Cited by: §5.2.
  • R. van den Berg, L. Hasenclever, J. M. Tomczak, and M. Welling (2018) Sylvester normalizing flows for variational inference. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6-10, 2018, pp. 393–402. Cited by: §4.
  • A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves (2016) Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, pp. 4790–4798. Cited by: §4.
  • A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016)

    Pixel recurrent neural networks

    In International Conference on Machine Learning, pp. 1747–1756. Cited by: §1, §4.
  • T. Vu, T. Luu, and C. Yoo (2019) Perception-enhanced image super-resolution via relativistic generative adversarial networks: munich, germany, september 8-14, 2018, proceedings, part v. pp. 98–113. External Links: ISBN 978-3-030-11020-8, Document Cited by: §4.
  • X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C. C. Loy, Y. Qiao, and X. Tang (2018) ESRGAN: enhanced super-resolution generative adversarial networks. CoRR abs/1809.00219. External Links: Link, 1809.00219 Cited by: Table 8, §1, §5.1, §5.1.
  • Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. Image Processing, IEEE Transactions on 13, pp. 600 – 612. External Links: Document Cited by: §5.1.
  • S. Xie and Z. Tu (2015) Holistically-nested edge detection. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1395–1403. External Links: Link, Document Cited by: §5.2, Table 3.
  • Y. Yuan, S. Liu, J. Zhang, Y. Zhang, C. Dong, and L. Lin (2018) Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 814–81409. Cited by: §4.
  • R. Zeyde, M. Elad, and M. Protter (2012) On single image scale-up using sparse-representations. In Proceedings of the 7th International Conference on Curves and Surfaces, Berlin, Heidelberg, pp. 711–730. External Links: ISBN 978-3-642-27412-1, Link, Document Cited by: §5.1.

Appendix A Architectures

This section describes architecture and optimization details of the conditional normalizing flow network, low-resolution image feature extractor, and shallow convolutional neural network in the conditional coupling layers.

The conditional coupling layer is shown schematically in Figure 6. This shows that conditioned on an input , we are able to build a relatively straight-forward invertible mapping between latent representations and , which have been partitioned into vectors of equal dimension.

(a) Forward (b) Reverse
Figure 6: The forward and reverse paths of the conditional coupling layer. In our experiments we concatenate an embedding of the conditioning input to the latent , which is fed through another neural network to output the affine transformation parameters applied to . This operation is invertible in and , but not in .

Details for the CNFs are given in Table 5 and the details of the individual coupling layers in Table 6 and 7. The architecture of the feature extractor is given in Table 8. The architecture has levels and subflows, following (Dinh et al., 2016; Kingma and Dhariwal, 2018). All networks are optimized using Adam (Kingma and Ba, 2015) for 200000 iterations.

Dataset Minibatch Size Levels Sub-flows Learning rate
ImageNet32 64 2 8 0.0001
ImageNet64 64 2 8 0.0001
DRIVE 2 2 2 0.001
Table 5: Configuration of the CNF architecture for the super-resolution task.
Layer Intermediate Channels Kernel size
Conv2d 512
Conv2d 512
Table 6: Architecture details for a single coupling layer in the super resolution task. The variable denotes the number of output channels. The first two convolutional layers are followed by a ReLU activation.
Layer Intermediate Channels Kernel size
Conv2d 32
InstanceNorm2d 32 -
ReLU - -
Table 7: Architecture details for a single coupling layer in the DRIVE segmentation task. The variable denotes the number of output channels for the . The first two convolutional layers are followed by a ReLU activation.
Model Type RRDB blocks Channel growth Context Channels
CNF 16 32 128
Factorized LL 16 55 128
Table 8: Architecture details for the conditioning network in the super-resolution task. Residual-in-residual denseblocks Wang et al. (2018) are utilized. The channel growth is adjusted so that the CNF and the factorized baseline have an equal number of parameters.

Appendix B Conditional Image Generation

In this section, larger versions of the ImageNet64 samples are provided, sampled at different temperatures .

Figure 7: Super resolution results CNF trained on Imagenet64 sampled at temperature .
Figure 8: Super resolution results for the CNF trained on Imagenet64 sampled at
Figure 9: Super resolution results for the CNF trained on Imagenet64 sampled at