Master thesis for the MSc. Artificial Intelligence at the University of Amsterdam, 2019. Topic: Super-resolution with Conditional Normalizing Flows.
Normalizing Flows (NFs) are able to model complicated distributions p(y) with strong inter-dimensional correlations and high multimodality by transforming a simple base density p(z) through an invertible neural network under the change of variables formula. Such behavior is desirable in multivariate structured prediction tasks, where handcrafted per-pixel loss-based methods inadequately capture strong correlations between output dimensions. We present a study of conditional normalizing flows (CNFs), a class of NFs where the base density to output space mapping is conditioned on an input x, to model conditional densities p(y|x). CNFs are efficient in sampling and inference, they can be trained with a likelihood-based objective, and CNFs, being generative flows, do not suffer from mode collapse or training instabilities. We provide an effective method to train continuous CNFs for binary problems and in particular, we apply these CNFs to super-resolution and vessel segmentation tasks demonstrating competitive performance on standard benchmark datasets in terms of likelihood and conventional metrics.READ FULL TEXT VIEW PDF
Master thesis for the MSc. Artificial Intelligence at the University of Amsterdam, 2019. Topic: Super-resolution with Conditional Normalizing Flows.
Learning conditional distributions
is one of the oldest problems in machine learning. When the outputis high-dimensional this is a particularly challenging task, and the practitioner is left with many design choices. Do we factorize the conditional? If not, do we model correlations with, say, a conditional random field (Prince, 2012)? Do we use a unimodal distribution? How fat should the tails be? Do we use an explicit likelihood at all, or use implicit methods (Mohamed and Rezende, 2015) such as a GAN (Goodfellow et al., 2014)? Do we quantize the output? Ideally, the practitioner should not have to make design choices at all, and the distribution should be learned from the data.
In the field of density estimationnormalizing flows (NFs) are a relatively new family of models (Rezende and Mohamed, 2015). NFs model complicated high dimensional marginal distributions by transforming a simple base distribution or prior through a learnable, invertible mapping and then applying the change of variables formula. NFs are efficient in inference and sampling, are able to learn inter-dimensional correlations and multi-modality, and they are exact likelihood models, amenable to gradient-based optimization.
Flow-based generative models (Dinh et al., 2016) are generally trained on the image space, and are in some cases computationally efficient in both the forward and inverse direction. These are advantageous over other likelihood based methods because i)
sampling is efficient opposed to autoregressive models(Van Oord et al., 2016), and ii)
flows admit exact likelihood optimization in contrast with variational autoencoders(Kingma and Welling, 2014).
Conditional random fields directly model correlations between pixels, and have been fused with deep learning(Chen et al., 2016). However, they require the practitioner to choose which pixels have pairwise interactions. Another approach uses adversarial training (Goodfellow et al., 2014). A downside is that the training procedure can be unstable and they are difficult to evaluate quantitatively.
We propose to learn the likelihood of conditional distributions with few modeling choices using Conditional Normalizing Flows (CNFs). CNFs can be harnessed for conditional distributions by conditioning the prior and the invertible mapping on the input . In particular, we apply conditional flows to super-resolution (Wang et al., 2018) and vessel segmentation (Staal et al., 2004). We evaluate their performance gains on multivariate prediction tasks along side architecturally-matched factored baselines by comparing likelihood
and application specific evaluation metrics.
In the following, we present the relevant background material on normalizing flows and structured prediction. This section covers the change of variables formula, invertible modules, variational dequantization and conventional likelihood optimization.
A standard NF in continuous space is based on a simple change of variables formula. Given two spaces of equal dimension and ; a once-differentiable, parametric, bijective111We can also use injective mappings mapping , where are the parameters of ; and a prior distribution , we can model a complicated distribution as
The term is the Jacobian determinant of , evaluated at and it accounts for volume changes induced by . The transformation introduces correlations and multi-modality in . The main challenge in the field of normalizing flows is designing the transformation . It has to be i) bijective, ii) have an efficient and tractable Jacobian determinant, iii) be from a ‘flexible’ model class. In addition, iv) for fast sampling the inverse needs to be efficiently computable. Below we briefly state which invertible modules are used in our architectures, obeying the aforementioned points.
Affine coupling layers (Dinh et al., 2016) are invertible, nonlinear layers. They work by splitting the input into two components and and nonlinearly transforming as a function of , before reconcatenating the result. If and this is
Where the scale and translation functions can be any function, typically implemented with a CNN. Similar conditioning with normalizing flows has been done in previous works by Mohamed and Rezende (2015) and Kingma et al. (2016b).
Proposed in Kingma and Dhariwal (2018), invertible 1 1 convolutions help mix information across channel dimensions. We implement them as regular 1 1 convolutions and for the inverse, we convolve with the inverse of the kernel.
Squeeze layers (Dinh et al., 2016) are used to compress the spatial resolution of activations. These also help with increasing spatial receptive field of pixels in the deeper activations.
Split priors (Dinh et al., 2016) work by spliting a set of activations into two components and . We then condition on using a simple base density e.g. , where and are neural networks. The component can be modeled by further flow layers. This prior, is useful for modeling hierarchical correlations between dimensions, and also helps reduce computation, since is reduced in size.
When modeling discrete data, Theis et al. (2016) introduced the concept of dequantization
. For this, they modeled the probability mass function overas a latent variable model
where the latent variables are continuous-valued. This is a convenient model to use, since the marginal , living on a continuous sample space, can be modelled with a continuous NF. The distribution is known as the quantizer and is typically an indicator function . Other works (Hoogeboom et al., 2019a; Tran et al., 2019) directly model with a discrete-valued flow, but these are known to be difficult to optimize. As an extension of dequantization, Ho et al. (2019) introduced a variational distribution , called a dequantizer, and write a lower bound on the data log-likelihood using Jensen’s inequality as follows
Noting that the joint , we see that the dequantizer distribution must be defined such that , otherwise and the lower-bound is undefined. Restricting to satisfy this condition, results in the following variational dequantization bound
Structured prediction tasks such as image segmentation or super-resolution, can be probabilistically framed as learning an unknown target distribution , with an input and a target . In practice with deep learning models, the unknown distribution is often learned by a factored model:
where represents the th dimension of
. Several loss-based optimization methods are a special case of this factored model. The mean squared error is equivalent to a product of normal distributions with equal and fixed standard deviation. Other examples are: cross entropy, equivalent to a product of log categorical distributions, and binary cross entropy, equivalent to a product of log Bernoulli distributions.
With factorized independent likelihoods, individual dimensions of are assumed to be conditionally independent. As a result, sampling leads to results with uncorrelated noise over the output dimensions. In the literature, a fix for this problem is to visualize the mode of the distribution and interpret that as a prediction. However, because the likelihood was optimized assuming a conditionally independent noise distribution, these modes tend to be blurry and lack crisp details.
In this section we present our main innovations. i) learning conditional likelihoods using CNFs and ii)
a variational dequantization framework for binary random variables.
We propose to learn conditional likelihoods using conditional normalizing flows for complicated target distributions in multivariate prediction tasks. Take an input and a regression target . We learn a complicated distribution using a conditional prior and a mapping , which is bijective in and . The likelihood of this model is:
The generative process from to (shown in Figure 1) can be described by first sampling from a simple base density with its parameters conditioned on (for us this is a diagonal Gaussian) and then passing it through a sequence of bijective mappings . This allows for modelling multimodal conditional distributions in , which is typically uncommon.
For the training procedure, the process runs in reverse. We begin with label and conditioning input . We ‘flow’ the label back through to yield , and then we evaluate the log-likelihood of the parameters of the prior, given this transformed label
. The flow and prior parameters can be optimized using stochastic gradient descent and training in minibatches in the usual fashion. Note that this style of training a conditional density modeldiffers fundamentally from traditional models, because we compute the log-likelihood in -space and not -space. As a result, we are not biasing our results with an arbitrary choice of output-space likelihood or in the case of this paper, handcrafted image loss. Instead, one could interpret this method as learning the correlational and multimodal structure of the likelihood or simply put loss-learning.
In our work, the conditioning is introduced in the prior, the split priors, and the affine coupling modules. For the prior, we set the mean and variance as functions of. For the split prior, we add as a conditoning argument to the conditional . And for the affine coupling layers, we pass to the scale and translation networks so that
|Conditional Split Prior|
In practice these functions are implemented using deep neural networks. First the conditioning term is transformed into a rich representation using a large network . Subsequently, each function in the flow is applied to a concatenation of and the relevant part of . For example, the translation of a conditional coupling is computed as .
We generalize the variational dequantization scheme for the binary setting. Let be a multivariate binary random variable and its dequantized representation. In Ho et al. (2019) the bound is not guaranteed to be tight, since there is a domain mismatch in the support of and the variational dequantizer . Technically, if
is modeled as a bijective mapping from a Gaussian distribution where the mapping only has finite volume changes, then the support ofis unbounded. On the otherhand, the support of the dequantizer is bounded and so we have to redefine either the dequantizer to map to all of or restrict the support of the flow to a bounded volume inside . We resolve this by dequantizing with half-infinite noise, where
The softplus guarantees that samples from the neural network NN are only positive. If is 1, the term outputs positive-valued noise and if is 0 the noise is negative-valued.
Normalizing flows were originally introduced to machine learning to learn a flexible variational posterior, a conditional distribution, in VAEs (Rezende and Mohamed, 2015; Kingma et al., 2016a; van den Berg et al., 2018). Flow-based generative models (Dinh et al., 2016; Papamakarios et al., 2017; Huang et al., 2018; Kingma and Dhariwal, 2018; Hoogeboom et al., 2019b; Grathwohl et al., 2019; Cao et al., 2019; Chen et al., 2019) are typically trained directly in the data space. Several of these are designed to be fast to invert, which makes them suitable for drawing samples after training.
Different versions and applications of conditional normalizing flows include Agrawal and Dukkipati (2016) who utilize flows in the decoder of Variational AutoEncoders Kingma and Welling (2014), which are conditioned on the latent variable. Trippe and Turner (2018) who utilize conditional flows for prediction problems in a Bayesian framework for density estimation. Atanov et al. (2019) introduce a semi-conditional flow that provides an efficient way to learn from unlabeled data for semi-supervised classification problems. Very recently, Ardizzone et al. (2019)
have proposed conditional flow-based generative models for image colorization, which differs from our work in training objective, architecture and applicability to binary segmentation. Autoregressive models(Van Oord et al., 2016) have also been studied for conditional image generation van den Oord et al. (2016) but are generally slow to sample from.
Adversarial methods (Goodfellow et al., 2014) have widely been applied to (conditional) image density modeling tasks (Vu et al., 2019; Sajjadi et al., 2017b; Yuan et al., 2018; Mechrez et al., 2018), because they tend to generate high-fidelity images. Disadvantages of adversarial methods are that they can be complicated to train, and it is difficult to obtain likelihoods. For this reason, it can be hard to assess whether they are overfitting or generalizing.
Here we explain our experiments into super-resolution and vessel segmentation. All models were implemented using the PyTorch framework.
Single Image Super Resolution (SISR) methods aims to find a high resolution image given a single (downsampled) low resolution image . Framing this problem as learning a likelihood, we utilize a CNF to learn the distribution . To compare our method we also train a factorized baseline likelihood model with comparable architectures and parameter budget. The factorized baseline uses a product of discretized logistic distributions (Kingma et al., 2016a; Salimans et al., 2017). All methods are compared on negative log-likelihood if available, which has the information theoretic interpretation bits per dimensions. In addition, we evaluate using SSIM (Wang et al., 2004) and PSNR metrics.
The flow is based on the (Dinh et al., 2016; Kingma and Dhariwal, 2018) multi-scale architectures. Each step of flow consists of K subflows and L levels. One subflow consists of an activation normalization, 1 1 convolution, and our conditional coupling layer. After completing a level, half of the representation is factored-out and modeled using our conditional split prior. After all levels have been completed, our conditional prior is used to model the final part of the latent variable.
The conditioning variable is transformed into the feature representation using Residual-in-Residual Dense Block (RRDB) architecture (Wang et al., 2018), consisting of 16 residual-in-residual blocks. To match the parameter budget, the channel growth is 55 for the baseline and the growth is 32 for the CNF.
The models are trained on natural image datasets, Imagenet32 and Imagenet64 (Chrabaszcz et al., 2017). Since the dataset has no test set, we use its validation images as a test set. For validation we take 10000 images from the train images. The performance is always reported on the test set unless specified otherwise. We evaluate our models on widely used benchmark datasets Set5 (Bevilacqua et al., 2012), Set14 (Zeyde et al., 2012) and BSD100 (Huang et al., 2015)
. At test time, we pad the test images with zeros at right and bottom so that they are square and compatible with squeeze layers. When evaluating on SSIM and PSNR, we can extract the patch with the exact image shape. For all datasets the LR images are obtained using MATLABs bicubic kernel function with reducing aliasing artifacts followingWang et al. (2018). For these experiments, the pixel values are dequantized by adding uniform noise (Theis et al., 2016).
We train on ImageNet32 and ImageNet64 for iterations with mini batches of size 64, and a learning rate of 0.0001 using the Adam optimizer (Kingma and Ba, 2015). The high-resolution image either the original or original input images. The flow architecture is build with levels and .
In this section the performance of CNFs for SISR is compared against a baseline likelihood model on ImageNet32 and ImageNet64. Their performance measured in log-likelihood (bits per dimension) is shown in Table 1, which show that the CNF outperforms the factorized baseline in likelihood. Recall that the baseline model is factorized and conditionally independent. These results indicate that it is advantageous to capture the correlations and multi-modality present in the data.
Super resolution samples from the Imagenet64 test data are shown in Figure 2. The distribution mode of the factorized baseline is displayed in Figure 2 panel c). We show that the baseline is able to learn a relationship between conditioning variable and , but lacks crisp details. The super-resolution images from the CNF are shown in panel b). Notice there are more high-frequency components modelled, for instance in grass and in hairs. Following Kingma and Dhariwal (2018), we sample from the base distributions with a temperature of 0.8 to achieve the best perceptual quality for the distribution learned by the CNF.
As there are no standard metric for measuring perceptual quality, we measure performance results on PSNR and SSIM between our predicted image and the ground truth image in Table 2. We compare CNFs to other state-of-the-art per-pixel loss based methods and the factorized baseline for a 2x upsampling task on standard super-resolution benchmarks. If available, we report negative log
-likelihood (bpd) (computed as an average over 1000 randomly cropped 128 x 128 patches). Note that without any hyperparameter tuning or compositional loss weighting, as is typical in SISR, the CNF performs competitively with state-of-the-art super-resolution methods by simply optimizing the likelihood. The SSIM scores for the baseline perform on par or better than the adversarial methods and the CNF on all benchmarks. On PSNR scores however, the CNF beats the factorized discrete baseline. Samples shown in Figure3 show that the CNF predictions have more fine grained texture details. Comparing this finding with the baseline that outperforms every method on SSIM, show that metrics can be misleading.
Notice how samples from a independent factorized likelihood model have a lot of color noise, whereas samples from the CNF do not have such problems. Increasing temperature increases high-level detail, where we find that
strikes a balance between noise smoothing and detail. This can be attributed to the property of flows to model pixel correlations among output dimensions for high-dimensional data such as images.
Vessel segmentation is an important, long-standing, medical imaging problem, where we seek to segment blood vessels from pictures of the retina (the back of the eye). This is a difficult task, because the vessels are thin and of varying thickness. A likelihood function used in segmentation is a weighted Bernoulli distribution
where is the prediction probability that pixel is positive (vessel class) and is a class balancing constant set to
for us. This loss function is preferred, because it accounts for the apparent class imbalance in the ratio of vessels to background. In practice, the numerator of this likelihood is used as a loss function and the normalizer is ignored. The resulting loss is called a weighted cross-entropy. In our experiments we train using the weighted cross-entropy (as in the literature), but we report likelihood values including the normalizer, for a meaningful comparison.
We test on the DRIVE database Staal et al. (2004) consisting of
, 8-bit RGB images, split into 20 train, and 20 test images. To compare against other methods, we plot precision-recall curves, report the maximum F-score along each curve (shown as a dot in the graph), report the bits per dimension, and plot distributions in PR-space. The main CNN-based contenders are Deep Retinal Image Understanding (DRIU)(Maninis et al., 2016) and Holistically-Nested Edge Detection (HED) (Xie and Tu, 2015), both of which are instances of Deeply-Supervised Nets (Lee et al., 2015)
. The main difference between DRIU and HED is that DRIU is pretrained on ImageNetKrizhevsky et al. (2012); whereas, HED is not. Other competing methods are reported with results cited from (Maninis et al., 2016). For fairness, we also train a model which we call the likelihood baseline, which uses the exact same architecture as the flow but run a feedforward model and trained with the weighted Bernoulli loss.
|SE||LD||Wavelets||Human||HED||KB||DRIU||Factored (ours)||CNF Uniform (Ours)||CNF (ours)|
The flow is identical to the flow used in the previous section, with some key differences. i) instead of activation normalization, we use instance normalization (Ulyanov et al., 2016), ii) the conditional affine coupling layers do not contain a scaling component but just the translation , hence it is volume preserving, and iii) we train using variational dequantization. Since the data is binary-valued, we dequantize according to the Flow++ scheme of Ho et al.
, modified to binary variables (see Section3.2
), using a CNF at just a single scale for the dequantizer. The CNF is conditioned on resolution-matched features extracted from a VGG-like networkSimonyan and Zisserman (2015). This model is composed of blocks of the form , and 2x2 max-pooling layer, shown in Table 4. All filter sizes are 3x3. The outputs are at layers 4 and 7. These are used to condition the resolution 512 and 256 levels of the CNF, respectively.
We train using the Adam optimizer at learning rate 0.001, and a minibatch size of 2 for 2000 epochs. All images are padded to 1024x1024 pixels, so that they are compatible with squeeze layers. We userotation augmentation, isotropic scalings in the range , and shears drawn from a normal distribution with standard deviation . At test time we draw samples from our model and compare those against the groundtruth labels. This contrasts with other methods, that measure labels against thresholded versions of a factorized predictive distribution. To create the PR curve in Figure 5 we take the average of 100 samples and threshold the resulting map (example shown in Figure 4). While crude, this mean image is useful in defining a PR-curve, since there is not great topology change between samples.
The results of our experiments are shown in Table 3 and Figure 5, with a visualization in Figure 4. We see in the table that the CNF trained with our binary dequantization achieves the best bits per dimension, with comparable F-score to the state of the art model (DRIU), but our model does not require pretraining on ImageNet. Interestingly, we found training a flow with uniform dequantization slightly unstable and the results were far from satisfactory. In the PR-curve Figure 5
, we show a comparable curve for our binary dequantized CNF to the DRIU model. These results, however, say nothing about the calibration of the probability outputs, but just that the various probability predictions are well ranked. To gain an insight into the calibration of the probabilities, we measure the distribution of precision and recall values for point samples drawn from all models, including a second human grader, present in the original DRIVE dataset. We synthesized samples from factored distributions (all except ours and ‘human’), by sampling images from a factored Bernoulli with mean as the soft image. We see the results in the right hand plot of Figure5, which shows that while the other CNN-based methods such as DRIU or HED have good precision, they suffer in term of recall. On the other hand, the CNF drops in precision a little bit, but makes up for this in terms of high recall, with a PR distribution overlapping the human grader. This indicates that the CNF has learned a well-calibrated distribution, compared to the baseline methods. Further evidence of this is seen in the visualization in Figure 4, which shows details from the predicted means (soft images). This shows that the DRIU and likelihood baseline overdilate segmentations and the CNF does not. This can be explained from the fact that in the weighted Bernoulli it is cheaper to overdilate than to underdilate. Since the CNF contains no handcrafted loss function, we circumvent this pathology.
In this paper we propose to learn likelihoods of conditional distributions using conditional normalizing flows. In this setting, supervised prediction tasks can be framed probabilistically. In addition, we propose a generalization of variational dequantization for binary random variables, which is useful for binary segmentation problems. Experimentally we show competitive performance with competing methods in the domain of super-resolution and binary image segmentation.
Semi-conditional normalizing flows for semi-supervised learning. pp. . Cited by: §4.
Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, pp. 511. Cited by: §4.
IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pp. 1841–1848. External Links: Cited by: Table 3.
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 5197–5206. External Links: Cited by: §5.1.
ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pp. 1106–1114. External Links: Cited by: §5.2.
Variational information maximisation for intrinsically motivated reinforcement learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, Cambridge, MA, USA, pp. 2125–2133. External Links: Cited by: §1, §2.1.
Retinal blood vessel segmentation using line operators and support vector classification. IEEE Trans. Med. Imaging 26 (10), pp. 1357–1365. External Links: Cited by: Table 3.
This section describes architecture and optimization details of the conditional normalizing flow network, low-resolution image feature extractor, and shallow convolutional neural network in the conditional coupling layers.
The conditional coupling layer is shown schematically in Figure 6. This shows that conditioned on an input , we are able to build a relatively straight-forward invertible mapping between latent representations and , which have been partitioned into vectors of equal dimension.
Details for the CNFs are given in Table 5 and the details of the individual coupling layers in Table 6 and 7. The architecture of the feature extractor is given in Table 8. The architecture has levels and subflows, following (Dinh et al., 2016; Kingma and Dhariwal, 2018). All networks are optimized using Adam (Kingma and Ba, 2015) for 200000 iterations.
|Dataset||Minibatch Size||Levels||Sub-flows||Learning rate|
|Layer||Intermediate Channels||Kernel size|
|Layer||Intermediate Channels||Kernel size|
|Model Type||RRDB blocks||Channel growth||Context Channels|
In this section, larger versions of the ImageNet64 samples are provided, sampled at different temperatures .