Normalizing Flow as a Flexible Fidelity Objective for Photo-Realistic Super-resolution

by   Andreas Lugmayr, et al.

Super-resolution is an ill-posed problem, where a ground-truth high-resolution image represents only one possibility in the space of plausible solutions. Yet, the dominant paradigm is to employ pixel-wise losses, such as L_1, which drive the prediction towards a blurry average. This leads to fundamentally conflicting objectives when combined with adversarial losses, which degrades the final quality. We address this issue by revisiting the L_1 loss and show that it corresponds to a one-layer conditional flow. Inspired by this relation, we explore general flows as a fidelity-based alternative to the L_1 objective. We demonstrate that the flexibility of deeper flows leads to better visual quality and consistency when combined with adversarial losses. We conduct extensive user studies for three datasets and scale factors, where our approach is shown to outperform state-of-the-art methods for photo-realistic super-resolution. Code and trained models will be available at:



There are no comments yet.


page 1

page 4

page 6

page 7

page 12

page 13

page 15

page 16


SRFlow: Learning the Super-Resolution Space with Normalizing Flow

Super-resolution is an ill-posed problem, since it allows for multiple p...

DeepSEE: Deep Disentangled Semantic Explorative Extreme Super-Resolution

Super-resolution (SR) is by definition ill-posed. There are infinitely m...

Deep Photo Cropper and Enhancer

This paper introduces a new type of image enhancement problem. Compared ...

S3RP: Self-Supervised Super-Resolution and Prediction for Advection-Diffusion Process

We present a super-resolution model for an advection-diffusion process w...

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

The primary aim of single-image super-resolution is to construct a high-...

Temporally Coherent GANs for Video Super-Resolution (TecoGAN)

Adversarial training has been highly successful in the context of image ...

Reconstructing High-resolution Turbulent Flows Using Physics-Guided Neural Networks

Direct numerical simulation (DNS) of turbulent flows is computationally ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Photo-realistic image super-resolution (SR) is the task of upscaling a low-resolution (LR) image by adding natural-looking high-frequency content. Since this information is not contained in the LR image, SR assumes that a prior can be learned to add plausible high-frequency components. In general, however, there are infinitely many possible high-resolution (HR) images mapped to the same LR image. Therefore, this task is highly ill-posed, rendering the learning of powerful deep SR models highly challenging.

To cope with the ill-posed nature of the SR problem, existing state-of-the-art methods employ an ensemble of multiple losses designed for different purposes [23, 39, 47]. In particular, these works largely rely on the loss for fidelity and the adversarial loss for perceptual quality. Theoretically, the objective aims to predict the average overall plausible HR image manifestations under a Laplace model. That leads to blurry SR predictions, which are generally not perceptually pleasing. In contrast, the adversarial objective prefers images with natural characteristics and high-frequency details. These two losses are thus fundamentally conflicting in nature [6, 5].

The conflict between the and the adversarial loss has important negative consequences as seen in Figure LABEL:fig:intro. In order to find a decent trade-off, a precarious balancing between the two terms is needed. The found compromise is not optimal in terms of fidelity nor perceptual quality. Moreover, the conflict between the two losses results in a remarkably inferior low-resolution consistency. That is, the down-sampled version of the predicted SR image is substantially different from the original LR image. The conflict between the losses drives the prediction towards a point outside the space of plausible HR images (Illustrated in Fig. 1).

We attribute those shortcomings to the loss. Since SR is a highly ill-posed problem, the loss imposes a rigid and exceptionally inaccurate model of the complicated image manifold of solutions. Ideally, we want a loss that ensures fidelity while not penalizing realistic image patches preferred by the adversarial loss. In this work, we therefore first revisit the loss and view it from a probabilistic perspective. We observe that the objective corresponds to a one-layer conditional normalizing flow. That inspires us to explore flow-based generalizations capable of better capturing the manifold of plausible HR images to mitigate the conflict between adversarial and fidelity-based objectives.

A few very recent works [40, 30] have investigated flows for SR. However, these approaches use heavy-weight flow networks as an alternative to the adversarial loss for perceptual quality. In this work, we pursue a very different view, namely the flow as a fidelity-based generalization of the objective. Our goal is not to replace the adversarial loss but to find a fidelity-based companion that can enhance the effectiveness of adversarial learning for SR. In contrast, to [30], this allows us to employ much shallower and more practical flow networks, ensuring substantially faster training and inference times. Furthermore, we demonstrate that the adversarial loss effectively removes artifacts generated by purely flow-based methods.

Contributions:  Our main contributions of this work are as follows: (i) We revisit the loss from a probabilistic perspective, expressing it as a one-layer conditional flow. (ii) We generalize the fidelity loss by employing a deep flow and demonstrate that it can be more effectively combined with an adversarial loss. (iii) We design a more practical, efficient, and stable flow architecture, better suited to the combined objective, leading to faster training and inference compared to [30]. (iv) We perform comprehensive experiments analyzing the flow loss combined with adversarial losses, giving valuable insights on the effects of increasing the flexibility of the fidelity-based objective. In comprehensive user studies, totaling over 50 000 votes, our approach outperforms state-of-the-art on three different datasets and scale factors.

2 Related Work

Single Image Super-Resolution:

  is the task of estimating a high-resolution image from a low-resolution counterpart. It is fundamentally an ill-posed inverse problem. While originally addressed by employing interpolation techniques, learned methods are better suited for this complex task. Early learned approaches used sparse-coding 

[7, 36, 44, 45]

and local linear regression 

[37, 38, 43]

. In recent years, deep learning based methods have largely replaced previous techniques for Super-Resolution owing to their highly impressive performance.

Initial deep learning approaches [10, 11, 19, 22, 25] for SR aimed at minimize the or distance between the SR and Ground-Truth image. With this objective, the model is effectively trained to predict a mean of plausible super-resolutions corresponding to the given input LR image. To alleviate this problem [23] introduced an adversarial and perceptual loss. Since then, this strategy has remained the predominant approach to super-resolution [2, 14, 35, 39, 28, 12, 17, 18]. Only very few works have investigated other learning formulations. Notably, Zhang  [47] introduces a selection mechanism based on perceptual quality metrics. In order to achieve an explorable SR formulation, Bahat  [4] recently trained a stochastic SR network based on mainly adversarial objectives. The output acts as a prior for a low-resolution consistency enforcing module, optimizing the image in a post-processing step.

Figure 1: While loss drags the SR prediction towards a blurry mean, both Flow and GAN loss push the prediction towards the real image manifold. Replacing with flow as fidelity term therefore reduces conflict with the GAN loss.

In recent works, invertible networks have gained popularity for image-to-image translation

[42, 9, 8, 34, 40, 30, 41, 24]. Xiao  [42] uses invertible networks to learn down and upscaling of images. This is similar to compression, but where the compressed representation is constrained to be an LR image. For super-resolution, [30] recently introduced a new strategy based on Normalizing Flows. It aims at replacing adversarial losses with normalizing flows [9, 8, 34]. In contrast, we investigate conditional flows as a replacement for the loss. In fact, we demonstrate that it forms a direct generalization of the objective. The aim of this work is to investigate flows as an alternative fidelity-based companion to the adversarial loss.

3 Method

3.1 Revisiting the Loss

The standard paradigm for learning a SR network is to directly penalize the reconstruction error between a predicted image and the ground truth HR image corresponding to the LR . The reconstruction error is usually measured by applying simple norms in a color space (., RGB or YCbCr). While initial methods [11, 19, 22] employed the norm, the mean squared error, later works [49, 25] studied the benefit of the error,


To understand the implications of this objective function, we use its probabilistic interpretation. Namely, that the loss (1) corresponds to the Negative Log-Likelihood (NLL) of the Laplace distribution. This derivation will be particularly illustrative for the generalizations considered later.

We first consider a latent variable with the standard Laplace distribution . Let be a be a function that encodes the LR-HR pair into the latent space as . Through the inverse relation it is easy to see that follows a Laplace distribution with mean ,


Here, is the total dimensionality of . From a probabilistic perspective, we are thus predicting the conditional distribution of the HR output image given the LR . In particular, our SR network estimates the mean of this distribution under a Laplacian model. In order to learn the parameters of the network, we simply minimize the NLL of (2), which is equal to the loss (1) up to an additive constant.

In the aforementioned Laplacian model (2), derived from the loss (1), only the mean

is estimated from the LR image. Thus, the model assumes that the variance, which reflects the possible variability of each pixel, remains constant. This assumption is however, not accurate. Indeed, super-resolving a constant blue sky is substantially

easier than estimating the pixel values of a highly textured region, such as the foliage of a tree. In the former case, the predicted pixels should have low variance, while the latter has high variability, corresponding to different possible textures of foliage. For a Laplace distribution, we can encode the variability in the scale parameter

, which is proportional to the standard deviation. By predicting the scale parameter

for each pixel, we can learn a more accurate distribution that also quantifies some aspect of the ill-posed nature of the SR problem.

We easily extend our model with the scale parameter prediction by modifying our function as,


Since this yields a Laplace distribution , we achieve the NLL,


In practice, we can easily modify an SR network to jointly estimating the mean and scale by doubling the number of output dimensions. The loss (10) stimulates the network to predict larger scale values for ‘uncertain’ pixels, that are likely to have large error . In principle, (10) thus extends the objective to better cope with the ill-posed nature of the SR problem by predicting a more flexible distribution of the HR image. In the next section, we will further generalize the objectives (1), (10) through normalizing flows, to achieve an even more flexible fidelity loss.

3.2 Generalizing the Loss With Flows

To capture the probability distribution of the error between the prediction

and ground-truth , we also need to consider spatial dependencies. Neighboring pixels are generally highly correlated in natural images. Indeed, to create coherent textures, even long-range correlations need to be considered. However, the loss (1) and its extension (10) assume each pixel in to be conditionally independent given . In fact, sampling from the predicted conditional distribution

is equivalent to simply adding Laplacian white noise to the predicted mean

. In super-resolution, we strive to create fine textures and details. To achieve this, the predictive distribution must capture complex correlations in the image space.

In this paper, we generalize the loss (10) with the aim of achieving a more flexible objective, better capturing the ill-posed setting. That is done through the probabilistic interpretation discussed in Sec. 3.1. We observe that the function introduced in Sec. 3.1 corresponds to a one-layer conditional normalizing flow with a Laplacian latent space. We can thus generalize this setting by constructing deeper flow networks . While prior works [3, 40, 30, 33] investigate conditional flows for SR as a replacement for adversarial losses, we see it as a generalization of the fidelity-based loss. With this view, we aim to find a fidelity-based objective better suited for ill-posed problems and, therefore, more effectively combined with adversarial losses.

Figure 2: Overview of our super-resolution approach. Our flow-based NLL loss replaces the often used loss. We accomplish this by encoding the LR image with the network . This conditions the flow , which encodes the GT image . From that, we obtain the NLL loss that drives the SR fidelity. We combine this with a standard adversarial loss, calculated using a discriminator .

The purpose of the function is to map the HR-LR pair to a latent space , which follows a simple distribution. By increasing the depth and complexity of the flow , more flexible conditional densities, and therefore also NLL-based losses, are achieved. In the general case, we let flow to be conditioned on the embedding of the LR image as . In fact, the network can be seen as predicting the parameters of the conditional distribution . In this view, the embedding generalizes the purpose of the SR network , which predicts the mean of the Laplace distribution in the case (1). In the general Laplace case (8), the LR embedding network needs to generate both the mean and the scale . Thanks to the flexibility of conditional flow layers, we can however, still use the underlying image representation of any standard SR architecture as . For example, we generate the embedding by concatenating a series of intermediate feature maps from, , the RRDB [39], or the RCAN [48] architecture. For simplicity, we often drop the explicit dependence on in the flow and simply write .

In order for to be a valid conditional flow network, we need to preserve invertibility in the first coordinate. Under this condition, the conditional density is derived using the change of variable formula [9, 21, 30] as,


The latent space prior is set to a simple distribution, standard Gaussian or Laplacian. The second factor in (5) is the resulting volume scaling, given by the determinant of the Jacobian . We can easily draw samples from the model by inverting the flow as . The network thus transforms a simple distribution to capture the complex correlations in the output image space.

The NLL training objective is obtained by applying the negative logarithm to (5),


where . In the second equality, we have decomposed into the sequence of flow layers , with and . This allows for efficient computation of the log-determinant term.

We can now derive that the flow objective (6) generalizes the scaled loss (10) and thereby also the standard loss (1). By using the function defined in (8) and the standard Laplacian latent variable , we derive the first term in (10) is by inserting (8) into the first term in (6a). For the second term, we first immediately obtain the Jacobian of (8) as a diagonal matrix with elements . Inserting this result into the log-determinant term in (6a) yields the second term in (10). A more detailed derivation is provided in the supplementary material. Next, we employ the flow-based fidelity objective in a full super-resolution framework by combining it with adversarial losses.

3.3 Flow-Fidelity with Adversarial Losses

The introduction of adversarial losses [23] pioneered a new direction in super-resolution, aiming to generate perceptually pleasing HR outputs from the natural image manifold. In order to achieve this, the adversarial loss needs to be combined with fidelity-based objectives, ensuring that the generated SR image is close to the HR ground-truth. Therefore, SRGAN [23] and later works [2, 14, 35, 28, 12, 17] most typically combine the adversarial loss with the objective. However, these two objectives are fundamentally conflicting. Unlike the loss that pulls the super-resolution towards the mean of all plausible manifestations, the adversarial loss forces the generator to choose exactly one image of the natural image manifold. Hence, the adversarial objective ideally assigns a low loss on all natural image patches. In contrast, such predictions generate a high loss since it prefers the blurry average of plausible predictions. We aim to resolve this issue by replacing the loss with the aforementioned flow-based generalizations.

The flow can learn a more flexible conditional distribution of the HR . It therefore better spans the natural image manifold while simultaneously encouraging consistency with the input LR image . The NLL loss (6) of the flow distribution does therefore not penalize patches from the natural image manifold to the same extent. That allows the adversarial objective to drive the generated SR images towards perceptually pleasing results without being penalized by the fidelity-based loss. Conversely, the flow-based fidelity loss allows the network to learn from the single provided ground-truth HR image , without reducing perceptual quality or incurring a higher adversarial loss.

Interestingly, the flow network

can also be seen as stochastic generator for the adversarial learning. While stochastic generators are fundamental to unconditional Generative Adversarial Networks (GANs)

[13], deterministic networks are most common in the conditional setting, including super-resolution. In fact, GANs are well known to be highly susceptible to mode collapse in the conditional setting [16, 32]. In contrast, flows are highly resistant to mode collapse due to the bijective constraint on

. This is highly important for ill-posed problems such as SR, where we ideally want to span the space of possible predictions. However, it is important to note that the flow is not merely a generator, as in the standard GAN setting. The flow itself also serves as a flexible loss function (


Formally, we add the adversarial loss on samples generated by the flow network . Let be the discriminator with parameters . For one LR-HR pair and random latent sample , we consider the adversarial loss


The loss (7) is minimized the flow and LR encoder parameters and maximized the discriminator parameters . In general, any other variant of adversarial loss (7) can be employed. During training, we employ a linear combination of the NLL loss (6) and (7). Our training procedure is detailed in Sec. 3.5.

3.4 Conditional Flow Architecture

Our full approach, depicted in Fig. 2, consists of the super-resolution network , the flow network and the discriminator . We construct our conditional flow network based on [30] and use the same settings, where not mentioned otherwise. It is based on Glow [21] and RealNVP [9]. It employs a pyramid structure with scales, each halving the previous layer’s spatial size using a squeeze layer and, depending on the number of channels, also bypassing half of the activations directly to the NLL calculation. We use 3, 4, and 4 scale levels for , and respectively, each consisting of a series of flow steps.

Each Flow-Step consists of a sequence of four layers. In encoding direction, we first employ the ActNorm [21] to normalize the activations using a learned channel-wise scale and bias. To establish information transfer across the channel dimension, we then use an invertible convolution [21]. The following layers condition the flow on the LR image similar to [30]. First, the Conditional Affine Coupling [9, 30], partitions the channels into two halves. The first half is used as input, together with the LR encoding , to a 3-layer convolutional network module, which predicts the element-wise scale and bias for the second half. This module adds non-linearities and spatial dependencies to the flow network while ensuring easy invertibility and tractable log-determinants. Secondly, the Affine Image Injector is applied, which transforms all channels conditioned on the low-resolution encoding .

Instead of the learnable convolutions used in [30, 21], we use constant orthonormal matrices that are randomly sampled at start of the training. We found this to significantly improve training speed while ensuring better stability due to these layers’ perfect conditioning. When combined with an adversarial loss, the flow network operates in both the encode and decode direction during training. To ensure stability during training in both directions, we reparametrize the prediction of the multiplicative unit in the conditional affine coupling layer. In particular, we predict the multiplicative factor as , where is the unconstrained prediction stemming from the convolutional module in the coupling.

AdFlow 4 6 8
compared to DIV2K BSD Urban DIV2K BSD Urban DIV2K BSD Urban
BaseFlow 62.1% ± 2.2 68.3% ± 2.4 74.2% ± 2.1 73.4% ± 2.0 80.7% ± 1.8 82.9% ± 1.7 69.2% ± 2.1 73.1% ± 2.0 78.2% ± 1.9
SRFlow 60.1% ± 2.2 67.2% ± 2.4 66.3% ± 2.2 - - - 66.2% ± 2.1 67.2% ± 2.1 71.8% ± 2.0
RankSRGAN 56.9% ± 2.2 54.8% ± 2.6 67.5% ± 2.2 - - - - - -
ESRGAN 56.1% ± 2.2 51.2% ± 2.6 64.5% ± 2.3 57.5% ± 2.3 62.8% ± 2.2 63.8% ± 2.2 49.9% ± 2.3 54.5% ± 2.2 57.1% ± 2.2
Ground Truth 49.0% ± 2.2 25.4% ± 2.2 29.1% ± 2.2 27.4% ± 2.1 8.9% ± 1.3 11.6% ± 1.5 18.3% ± 1.7 4.2% ± 0.9 7.7% ± 1.2
Table 1: Quantitative results of the user study. Each entry is aggregated from 1 500 votes. For each dataset and scale factor, we directly compare AdFlow with each competing method in a pairwise fashion, as detailed in Sec. 4.1. For each compared method (left), we report the proportion of votes in favor

of AdFlow along with the 95% confidence interval. We indicate if AdFlow is significantly

better or worse.

Super-Resolution embedding network :  Our flow-based objective is designed as a replacement of loss. Our formulation is therefore agnostic to the architecture underlying SR embedding network . We use the popular RRDB [39] SR network as our encoder . Instead of outputting the final RGB SR image, these networks predict a rich embedding of the LR image. We obtain this in practice by simply concatenating the underlying feature activations at the intermediate RRDB blocks 1, 4, 6 and 8.

Discriminator:  We use the VGG-based network from [39] as a discriminator. Since we generate stochastic SR samples during training, we found it beneficial to reduce the discriminator’s capacity to ensure a balanced adversarial objective. We, therefore, reduce the internal channel dimension of the discriminator from 64 to 16.

3.5 Training Details

Our approach is trained by a weighted combination of the NLL loss (6) for fidelity and the adversarial loss (7) to increase perceptual quality. We consider the standard bicubic setting in our experiments, where the LR is generated with the MATLAB bicubic downsampling kernel. In particular, we train for both and the challenging SR scenario. We first train the networks and using only the flow NLL loss for 200k iterations using initial learning rates of for and for , which are then decreased step-wise. We fine-tune the network with the adversarial loss for 200k iterations and select the checkpoint with lowest LPIPS [46] as measured the training set. We employ the Adam [20] optimizer. As in [30]

we add uniformly distributed noise with a strength of

of the signal range to the ground-truth HR. Our network is trained on HR patches of pixels for and and pixels for . In principle, our framework can employ any adversarial loss formulation. To allow for a direct comparison with the popular state-of-the-art network ESRGAN [39], we employ the same relativistic adversarial formulation. For and SR, we weight the adversarial loss with a factor of and use a discriminator learning rate of . For , we use and respectively We use the same training data employed by ESRGAN [39], consisting of the DF2K dataset. It comprises 2650 training images from Flickr2K [25] and 800 training images from the DIV2K [1] dataset.

4 Experiments

Urban100 BSD100 DIV2K
Low Resolution RankSRGAN ESRGAN[39] SRFlow[30] BaseFlow AdFlow
Figure 3: Qualitative comparison with state-of-the-art approaches on the DIV2K (val), BSD100 and Urban100 set for SR.
Urban100 BSD100 DIV2K
Low Resolution ESRGAN[39] BaseFlow AdFlow
Figure 4: Qualitative comparison with state-of-the-art approaches on the DIV2K (val), BSD100 and Urban100 set for SR.
Urban100 BSD100 DIV2K
Low Resolution ESRGAN[39] SRFlow[30] BaseFlow AdFlow
Figure 5: Qualitative comparison with state-of-the-art approaches on the DIV2K (val), BSD100 and Urban100 set for SR.

We validate our proposed formulation by performing comprehensive experiments on the three standard datasets, namely DIV2K [1], BSD100 [31] and Urban100 [15]. We train our approach for three different scale factors , , and . We term our flow-only baseline as BaseFlow and our final method, which also employs adversarial learning, as AdFlow. The prediction and evaluation is performed on the full image resolution. For the purely flow-based baseline, we found it best to use a sampling temperature [21, 30] of . For our final approach with adversarial loss, we found the standard sampling temperature of to yield best results. Detailed results and more visual examples are found in the supplementary material.

4.1 State-of-the-Art Comparison

We first compare our approach with state-of-the-art. This work aims to achieve SR predictions that are (i) photo-realistic and (ii) consistent with the input LR image. Since it has become well known [23, 35, 39, 28, 12, 17, 30, 26, 29, 27] that computed metrics, such as PSNR and SSIM, fail to rank methods according to photo-realism (i), we therefore perform extensive user studies as further described below. To assess the consistency of the prediction with the LR input (ii), we first downscale the predicted SR image with the given bicubic kernel and compare the result with the LR input. Their similarity is measured using PSNR, and we therefore refer to this metric as LR-PSNR. The LR consistency penalizes hallucinations and artifacts that cannot be explained from the input image.

User studies:  We compare the photo-realism of our AdFlow with other methods in user studies. The user is shown the full low-resolution image where a randomly selected region is marked with a bounding box. Next to this image, two different super-resolutions, or “zooms”, of the marked region, are displayed. The user is asked to select “Which image zoom looks more realistic?”. In this manner, the user evaluates the photo-realism of our AdFlow versus each compared method. To obtain an unbiased opinion, the methods were anonymized to the user and shown in a different random order for each crop. In each study, we evaluate 3 random crops for each of the 100 images in a dataset (DIV2k, BSD100, or Urban100). We use 5 different users for every study, resulting in 1500 votes per method-to-method comparison in each dataset and scale factor. The full user study is shown in Tab. 1, thus collects over 50 000 votes. Further details are provided in the supplement.

Methods:  We compare our approach with state-of-the-art approaches for photo-realistic super-resolution: ESRGAN [39], RankSRGAN [47], and SRFlow [30]. For the two latter approaches, we use the publicly available code and trained models ( for RankSRGAN, and for SRFlow). In addition to the publicly ESRGAN model for SR, we train models for and SR using the code provided by the authors. All compared methods are trained on the same training set, namely DF2k [25].

Results:  The results of our user study are given in Table. 1. The first number represents the ratio of votes in favour of AdFlow and the second the 95% confidence interval. AdFlow outperforms all other methods with significance level 95% in all datasets except for one case. Interestingly, it almost matches the realism of the ground-truth on the DIV2k dataset. The visual results in Fig. 13 show that our AdFlow generates sharp and realistic textures and structures. In contrast, ESRGAN frequently generates visible artifacts while SRFlow and BaseFlow achieve less sharp results. While RankSRGAN experiences fewer artifacts compared to ESRGAN, its predictions are less sharp compared to AdFlow.

Method DIV2K BSD Urban DIV2K BSD Urban DIV2K BSD Urban
BaseFlow 49.9 49.9 49.5 48.3 48.6 47.9 49.8 50.2 48.7
SRFlow 50.0 49.9 49.5 - - - 49.0 51.0 48.1
RankSRGAN 42.3 41.7 39.9 - - - - - -
ESRGAN 39.0 37.7 36.8 33.2 32.8 30.9 31.3 31.7 28.9
AdFlow 45.2 45.6 43.4 37.5 38.5 36.0 46.0 46.8 42.1
Table 2: Consistency to the input in terms of LR-PSNR (dB) on the DIV2K (val), BSD100 and Urban100 datasets. We compare for methods that employ adversarial loss (bottom). While purely Flow based methods (top) achieve high LR-PSNR, they have worse perceptual quality (Tab. 1). AdFlow outperforms other methods using adversarial loss for LR consistency significantly.

The results for higher scale factors and show a similar trend as seen in Fig. 14 and 15. Our approach consistently outperforms the purely flow-based approaches BaseFlow and SRFlow for all scale factors and datasets by over of the votes. As seen in the visual examples, particularly for and , the flow-based approaches often generate strong high-frequency artifacts. In contrast, our AdFlow generates structured and crips textures attributed to the adversarial loss. Compared to ESRGAN, which combines with adversarial loss, AdFlow produces generally sharper results and has no visible color shift, as seen in Fig. 15. Interestingly, AdFlow demonstrates substantially better generalization to the BSD100 and Urban100 datasets than ESRGAN, as shown in the user study in Table 1. This indicates that ESRGAN tends to overfit to the DIV2k distribution. Qualitative examples for , , and are shown in Fig. 13, 14, and 15, respectively.

We report the LR-PSNR for all datasets and scale factors in Tab. 2. ESRGAN and RankSRGAN obtain poor LR consistency across all datasets, as shown in Tab. 2. AdFlow gains dB - dB in LR-PSNR over ESRGAN, indicating less hallucination artifacts and color shift. By employing flow-based fidelity instead of , AdFlow achieves superior photo-realism while ensuring high LR consistency.

4.2 Analysis of Flow-based Fidelity Objective

Here, we analyze the impact of generalizing the loss towards a gradually more flexible flow-based NLL objective. This is done by increasing the number of flow steps per level inside the flow architecture. We train and evaluate our AdFlow with different depths for SR. Due to the difficulty and cost of running a large number of user studies, we here use the learned LPIPS [46] distance as a surrogate to assess photo-realism. In Fig. 6 we plot the LPIPS and LR-PSNR on the DIV2K validation set the number of flow-steps . We also include the results obtained by the loss, which is an even simpler one-layer flow loss, as discussed in Sec. 3.1 and 3.2. Note that these results correspond to the standard RRDB and ESRGAN, respectively.

 Adv. Loss Affine Coup. Rand. Rot. Coupl. Mult. Percept. Loss LPIPS  LR-PSNR 
0.349 39.76
 ✓ 0.337 34.85
0.253 50.16
 ✓ - -
0.254 50.19
 ✓ - -
0.253 49.78
 ✓ 0.253 47.54
 ✓ 0.270 47.35
Table 3: Ablation of architecture choice for adversarial loss (Adv.), the use of Affine Couplings (Aff. Coup.), Random Depth-Wise Rotation (Rand. Rot.), Decode Multiplication for the Affine Couplings (Decode Mult.) and perceptual loss (Percept.) [23].

As we increase the depth of the flow network , the LPIPS decreases while the LR-PSNR increases. This indicates an improvement in perceptual quality and low-resolution consistency. This trend also holds when starting from the NLL objective. Note that the brief increase in LPIPS is explained by the added stochasticity when transitioning from the to the flow. Indeed, a too shallow flow network does not capture rich enough spatial correlations in order to generate more natural samples. However, already at , the flow-based generalization outperforms the in LPIPS. Increasing the flexibility of the NLL-based fidelity loss, starting from , thus benefits perceptual quality and consistency. This strongly indicates that a flow-based fidelity objective alleviates the conflicts between the adversarial loss and loss.

Figure 6: Analysis of the input consistency LR-PSNR and perceptual quality LPIPS for different numbers of Flow Steps .

4.3 Ablation of Flow Architecture

In Tab. 3 we show results of our ablative experiments for SR on DIV2k. First, we ablate the use of Conditional Affine Couplings. Removing this layer (top row) results in a conditionally linear flow, thereby radically limiting its expressiveness. This leads to a substantially worse LPIPS and LR-PSNR, demonstrating the importance of a flexible flow network. Second, we replace the learnable convolutions with fixed random rotation matrices (Rand. Rot.). While widely preserving the quality in all metrics, it reduces the training time by . Next, we consider the reparametrization of the coupling layers (Coupl. Mult.). We found this to be critical for training stability when combined with the adversarial loss. Lastly, we investigate the use of the VGG-based perceptual loss [23] that is commonly used in SR methods. It is generally employed as a more perceptually inclined fidelity loss to complement the objective. However, we found the perceptual loss not to be beneficial. This indicates that the more flexible flow-based fidelity loss can also effectively replace the VGG loss.

5 Conclusion

We explore conditional flows as a generalization of the loss in the context of photo-realistic super-resolution. In particular, we tackle the conflicting objectives between and adversarial losses. Our flow-based alternatives offer both improved fidelity to the input low-resolution and a higher degree of flexibility. Extensive user studies clearly demonstrate the advantages of our approach over state-of-the-art on three datasets and scale factors. Lastly, our experimental analysis brings new insights into the learning of super-resolution methods, paving for further explorations in the pursuit of more powerful learning formulations.

Acknowledgements:  This work was supported by the ETH Zürich Fund (OK), a Huawei Technologies Oy (Finland) project and an Nvidia GPU grant and AWS.


  • [1] E. Agustsson and R. Timofte (2017) NTIRE 2017 challenge on single image super-resolution: dataset and study. In CVPR Workshops, Cited by: Figure 12, §3.5, Table 4, §4.
  • [2] N. Ahn, B. Kang, and K. Sohn (2018) Image super-resolution via progressive cascading residual network. In CVPR, Cited by: §2, §3.3.
  • [3] L. Ardizzone, C. Lüth, J. Kruse, C. Rother, and U. Köthe (2019)

    Guided image generation with conditional invertible neural networks

    CoRR abs/1907.02392. External Links: Link, 1907.02392 Cited by: §3.2.
  • [4] Y. Bahat and T. Michaeli (2020) Explorable super resolution. In CVPR, Cited by: §2.
  • [5] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor (2018) The 2018 PIRM challenge on perceptual image super-resolution. In Computer Vision - ECCV 2018 Workshops - Munich, Germany, September 8-14, 2018, Proceedings, Part V, L. Leal-Taixé and S. Roth (Eds.), Lecture Notes in Computer Science, Vol. 11133, pp. 334–355. External Links: Document, Link Cited by: §1.
  • [6] Y. Blau and T. Michaeli (2018) The perception-distortion tradeoff. In CVPR, pp. 6228–6237. External Links: Document, Link Cited by: §1.
  • [7] D. Dai, R. Timofte, and L. V. Gool (2015) Jointly optimized regressors for image super-resolution. Comput. Graph. Forum 34 (2), pp. 95–104. External Links: Document Cited by: §2.
  • [8] L. Dinh, D. Krueger, and Y. Bengio (2015) NICE: non-linear independent components estimation. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, Cited by: §2.
  • [9] L. Dinh, J. Sohl-Dickstein, and S. Bengio (2017) Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §2, §3.2, §3.4, §3.4.
  • [10] C. Dong, C. C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In ECCV, pp. 184–199. External Links: Document Cited by: §2.
  • [11] C. Dong, C. C. Loy, K. He, and X. Tang (2016) Image super-resolution using deep convolutional networks. TPAMI 38 (2), pp. 295–307. Cited by: §2, §3.1.
  • [12] M. Fritsche, S. Gu, and R. Timofte (2019) Frequency separation for real-world super-resolution. In 2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, Seoul, Korea (South), October 27-28, 2019, pp. 3599–3608. External Links: Document, Link Cited by: §2, §3.3, §4.1, Table 5, Table 6, Table 7.
  • [13] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 2672–2680. Cited by: §3.3.
  • [14] M. Haris, G. Shakhnarovich, and N. Ukita (2018) Deep back-projection networks for super-resolution. In CVPR, Cited by: §2, §3.3.
  • [15] J. Huang, A. Singh, and N. Ahuja (2015) Single image super-resolution from transformed self-exemplars. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 5197–5206. Cited by: §4.
  • [16] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, pp. 5967–5976. External Links: Document, Link Cited by: §3.3.
  • [17] X. Ji, Y. Cao, Y. Tai, C. Wang, J. Li, and F. Huang (2020-06) Real-world super-resolution via kernel estimation and noise injection. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §2, §3.3, §4.1, Table 5, Table 6, Table 7.
  • [18] D. Kim, M. Kim, G. Kwon, and D. Kim (2019) Progressive face super-resolution via attention to facial landmark. In arxiv, Vol. abs/1908.08239. Cited by: §2.
  • [19] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Accurate image super-resolution using very deep convolutional networks. In CVPR, Cited by: §2, §3.1.
  • [20] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §B, §3.5.
  • [21] D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pp. 10236–10245. Cited by: §3.2, §3.4, §3.4, §3.4, §4.
  • [22] W. Lai, J. Huang, N. Ahuja, and M. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, Cited by: §2, §3.1.
  • [23] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. CVPR. Cited by: §1, §2, §3.3, §4.1, §4.3, Table 3, Table 5, Table 6, Table 7.
  • [24] J. Liang, A. Lugmayr, K. Zhang, M. Danelljan, L. Van Gool, and R. Timofte (2021) Hierarchical conditional flow: a unified framework for image super-resolution and image rescaling. In ICCV, Cited by: §2.
  • [25] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017) Enhanced deep residual networks for single image super-resolution. CVPR. Cited by: §2, §B, §3.1, §3.5, §4.1.
  • [26] A. Lugmayr, M. Danelljan, R. Timofte, et al. (2019) AIM 2019 challenge on real-world image super-resolution: methods and results. In ICCV Workshops, Cited by: §4.1.
  • [27] A. Lugmayr, M. Danelljan, R. Timofte, et al. (2020) NTIRE 2020 challenge on real-world image super-resolution: methods and results. CVPR Workshops. Cited by: §4.1.
  • [28] A. Lugmayr, M. Danelljan, and R. Timofte (2019) Unsupervised learning for real-world super-resolution. In ICCV Workshops, Cited by: §2, §3.3, §4.1, Table 5, Table 6, Table 7.
  • [29] A. Lugmayr, M. Danelljan, and R. Timofte (2021) NTIRE 2021 learning the super-resolution space challenge. In CVPRW, Cited by: §4.1.
  • [30] A. Lugmayr, M. Danelljan, L. Van Gool, and R. Timofte (2020) SRFlow: learning the super-resolution space with normalizing flow. In ECCV, Cited by: §1, §1, §2, §B, §3.2, §3.2, §3.4, §3.4, §3.4, §3.5, Figure 13, Figure 15, Figure 3, Figure 5, §4.1, §4.1, Table 5, Table 6, Table 7, §4.
  • [31] D. Martin, C. Fowlkes, D. Tal, and J. Malik (2001-07) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, Vol. 2, pp. 416–423. Cited by: §4.
  • [32] M. Mathieu, C. Couprie, and Y. LeCun (2016) Deep multi-scale video prediction beyond mean square error. In ICLR, External Links: Link Cited by: §3.3.
  • [33] A. Pumarola, S. Popov, F. Moreno-Noguer, and V. Ferrari (2020) C-flow: conditional generative flow models for images and 3d point clouds. In CVPR, pp. 7949–7958. Cited by: §3.2.
  • [34] D. J. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. In

    Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015

    pp. 1530–1538. Cited by: §2.
  • [35] M. S. M. Sajjadi, B. Schölkopf, and M. Hirsch (2017) EnhanceNet: single image super-resolution through automated texture synthesis. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 4501–4510. External Links: Document Cited by: §2, §3.3, §4.1, Table 5, Table 6, Table 7.
  • [36] L. Sun and J. Hays (2012) Super-resolution from internet-scale scene matching. In ICCP, Cited by: §2.
  • [37] R. Timofte, V. De Smet, and L. Van Gool (2014) A+: adjusted anchored neighborhood regression for fast super-resolution. In ACCV, pp. 111–126. Cited by: §2.
  • [38] R. Timofte, V. D. Smet, and L. V. Gool (2013) Anchored neighborhood regression for fast example-based super-resolution. In ICCV, pp. 1920–1927. External Links: Document, Link Cited by: §2.
  • [39] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C. C. Loy, Y. Qiao, and X. Tang (2018) ESRGAN: enhanced super-resolution generative adversarial networks. ECCV. Cited by: §1, §2, Figure 12, §3.2, §C.2, §C.2, §3.4, §3.4, §3.5, Table 4, Figure 13, Figure 14, Figure 15, Figure 3, Figure 4, Figure 5, §4.1, §4.1, Table 5, Table 6, Table 7.
  • [40] C. Winkler, D. E. Worrall, E. Hoogeboom, and M. Welling (2019) Learning likelihoods with conditional normalizing flows. arxiv abs/1912.00042. External Links: Link, 1912.00042 Cited by: §1, §2, §3.2.
  • [41] V. Wolf, A. Lugmayr, M. Danelljan, L. Van Gool, and R. Timofte (2021) Deflow: learning complex image degradations from unpaired data with conditional flows. In CVPR, Cited by: §2.
  • [42] M. Xiao, S. Zheng, C. Liu, Y. Wang, D. He, G. Ke, J. Bian, Z. Lin, and T. Liu (2020) Invertible image rescaling. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Lecture Notes in Computer Science, Vol. 12346, pp. 126–144. External Links: Document, Link Cited by: §2.
  • [43] C. Yang and M. Yang (2013) Fast direct super-resolution by simple functions. In ICCV, pp. 561–568. External Links: Document, Link Cited by: §2.
  • [44] J. Yang, J. Wright, T. S. Huang, and Y. Ma (2008) Image super-resolution as sparse representation of raw image patches. In CVPR, External Links: Document Cited by: §2.
  • [45] J. Yang, J. Wright, T. S. Huang, and Y. Ma (2010) Image super-resolution via sparse representation. IEEE Trans. Image Processing 19 (11), pp. 2861–2873. External Links: Document, Link Cited by: §2.
  • [46] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    CVPR. Cited by: §B, §3.5, §4.2.
  • [47] W. Zhang, Y. Liu, C. Dong, and Y. Qiao (2019) Ranksrgan: generative adversarial networks with ranker for image super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3096–3105. Cited by: §1, §2, §4.1, Table 5.
  • [48] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In ECCV, Lecture Notes in Computer Science, Vol. 11211, pp. 294–310. External Links: Document, Link Cited by: §3.2.
  • [49] H. Zhao, O. Gallo, I. Frosio, and J. Kautz (2017) Loss functions for image restoration with neural networks. IEEE Trans. Computational Imaging 3 (1), pp. 47–57. External Links: Document, Link Cited by: §3.1.


In this appendix, we first provide further details on the user study in Sec. A. Secondly, we provide an analysis of the sampling temperature in Sec B. Third, we present additional details about the minimally generalized loss in Sec. C. Finally, we provide a further qualitative and quantitative comparison of AdFlow with other state-of-the-art methods in Sec. D. Additional visual results, used in our study, will be available on the project page

Figure 7: Influence of temperature parameter during inference on perception and fidelity. As opposed to AdFlow, ESRGAN and RRDB have only a single operating point. ( super-resolution)
Figure 8: Influence of temperature parameter during inference on perception and fidelity. As opposed to AdFlow, ESRGAN and RRDB have only a single operating point. ( super-resolution)
Figure 9: Influence of temperature parameter during inference on perception and fidelity. As opposed to AdFlow, ESRGAN and RRDB have only a single operating point. ( super-resolution)

A User Study

Figure 10: Screenshot of the web GUI of our user study. The ‘Reference’ provides an overview and indicates which part of an image should be considered. The images ‘Zoom 1’ and ‘Zoom 2’ are the two candidates, where the latter was selected to look more realistic.

As described in Sec. 4.1 in the main paper, we conduct the user study. The GUI interface is shown in Figure 10. We ask the user to evaluate which image of the two looks more realistic. To select the chosen image, the user presses the 1 key for the left and 2 key for the right images. Once a selection was made, the user can see the next image using the arrow right key until they have completed all tasks. Finally, the form is submitted using the button on the top right.

To increase the data quality, we use a filtering mechanism. For that, we add redundant questions and reject submissions that have a low self-consistency. A visualization of results of the study results is shown in Fig. 11. The green bars display the percentage of votes favoring the photo-realism of AdFlow, while the red bars show the percentage favoring the other method. We display the statistical significance by showing the 95% confidence interval in black. Examples for images used in our study are shown in visuals.html.

Figure 11: User study results, as the percentage of votes favoring the photo-realism of AdFlow (green) versus each other method (red). A bar represents 1500 user votes. The 95% confidence interval is in black. We compare on DIV2K, BSD100, Urban100 for , , and .

B Analysis of Sampling Temperature

Here, we analyze the trade-off between the image quality, in terms of LPIPS [46], and the consistency to the low-resolution input in terms of LR PSNR when varying the sampling temperature. We sample from the latent space with a Gaussian prior distribution with variance . The latter is usually termed the sampling temperature [20]. Similar to [30] we can set the operation point by adjusting the temperature . Figures 78 and 9 show that ESRGAN [25] trades off much more low-resolution consistency to improve the perceptual quality than AdFlow. The best trade-off is achieved at for BaseFlow and for AdFlow, as used in the main paper.

C Minimal generalization

Figure 12: Qualitative comparison of standard approach and with learned variance on the DIV2K [1] validation set. ()

Here we provide further theoretical and empirical analysis when generalizing the loss with normalizing flows.

c.1 Relation of Flow loss to

We here derive the generalized objective (Eq. (4) in the main paper) from the Normalizing Flow formulation as a 1-layer special case. Let be standard Laplace. We use the function defined as (Eq. (3) of the main paper),


we obtain the inverse as,


Since is a standard Laplace distribution, it is easy to see that as given by Eq. (4) in the main paper, that is


Hence, (8) is the flow of (10). Inserting (8) into the NLL formula for flows (Eq. (6a) in the main paper) gives,


Here, the Jacobian is a diagonal matrix with elements . The final result thus corresponds to the NLL derived directly in (10). We therefore conclude that the generalized objective is a special case given by the 1-layer normalizing flow defined in (8).

c.2 Empirical Analysis

We report results for the intermediate step of predicting an adaptive variance according to the Laplacian model described in Section 3.1, Equations (3)-(4) of the main paper. Those three channels predict the log-scale of the Laplace distribution. The loss in Eq. (4) of the main paper can thus be written as,


We notice that even this extension of the objective reduces the conflict with the adversarial loss to some extent. The effective removal of artifacts for super-resolution is especially apparent in the first row of Figure 12 between ESRGAN [39] and ESRGAN + Adaptive Variance. Our further generalization of loss continues to improve the quality of the super-resolutions.

As the increase in visual quality alone would not be a good indicator for a reduced conflict of objectives, we also report the low-resolution consistency in Table 4 which improves by dB from ESRGAN [39] to ESRGAN + Adaptive Variance. An additional generalization to BaseFlow and AdFlow leads to a further improved low-resolution consistency. Based on observing an improved visual quality and low-resolution consistency, we conclude that the minimally generalized loss reduces the conflict in objectives, which further validates our strategy of replacing the with a more flexible generalization.

RRDB [39] 25.52 0.697 0.419 45.31
RRDB [39] + Adaptive Variance 25.47 0.696 0.418 44.51
ESRGAN [39] 22.14 0.578 0.277 31.28
ESRGAN [39] + Adaptive Variance 22.94 0.593 0.280 34.19
BaseFlow 23.58 0.595 0.253 49.78
AdFlow 23.45 0.602 0.253 47.54
Table 4: Quantitative results of standard approach and with learned variance on the DIV2K [1] validation set. “Adaptive variance” indicates the generalized loss with predicted variance, as described in Sec. C. ()

D Detailed Results

In this section, we provide an extended quantitative and qualitative analysis of the same BaseFlow and AdFlow networks evaluated in the main paper. For completeness, we here provide the PSNR, SSIM and LPIPS on the DIV2K, BSD100, and Urban100 datasets. Results are reported in Tables 56 and 7. However, note that these metrics do not well reflect photo-realism, as discussed in Sec. 4.1 in the main paper.

Further qualitative results for the scale levels  and  are provided in Figures 1314 and 15 respectively.



Bicubic 26.69 0.766 0.409 38.69
RRDB [39] 29.44 0.844 0.253 49.17
ESRGAN [39] 26.20 0.747 0.124 39.01
RankSRGAN [47] 26.55 0.750 0.128 42.33
SRFlow [30] 27.08 0.756 0.120 49.97
BaseFlow 27.21 0.760 0.118 49.88
AdFlow 27.02 0.768 0.132 45.17


Bicubic 22.40 0.508 0.713 37.13
RRDB [39] 23.58 0.572 0.554 45.26
ESRGAN [39] 20.99 0.462 0.332 31.68
SRFlow [30] 21.76 0.467 0.335 51.01
BaseFlow 22.03 0.478 0.325 50.17
AdFlow 22.01 0.486 0.327 48.78


Bicubic 19.31 0.477 0.686 33.93
RRDB [39] 21.15 0.603 0.401 43.33
ESRGAN [39] 18.43 0.475 0.306 28.88
SRFlow [30] 19.29 0.501 0.309 48.11
BaseFlow 19.72 0.513 0.304 48.71
AdFlow 19.04 0.506 0.278 44.67
Table 5: Approximations for perceptual quality on the sets DIV2K (val.), BSD100, and Urban100 (). Since [23, 35, 39, 28, 12, 17, 30] showed the limitations of calculated metrics for SR our main metric is the human study.


Bicubic 24.87 0.680 0.519 37.78
RRDB 26.51 0.741 0.382 46.86
ESRGAN [39] 23.16 0.629 0.222 33.21
BaseFlow 24.04 0.621 0.216 49.12
AdFlow 23.94 0.6505 0.216 37.57


Bicubic 23.26 0.564 0.645 37.51
RRDB 24.42 0.625 0.507 46.41
ESRGAN [39] 21.42 0.501 0.288 32.76
BaseFlow 21.98 0.500 0.274 48.60
AdFlow 22.17 0.533 0.269 38.52


Bicubic 20.20 0.541 0.606 34.24
RRDB 21.95 0.650 0.371 44.81
ESRGAN [39] 19.43 0.541 0.251 30.87
BaseFlow 20.43 0.564 0.255 47.90
AdFlow 20.26 0.583 0.235 36.01
Table 6: Approximations for perceptual quality on the sets DIV2K (val.), BSD100, and Urban100 (). Since [23, 35, 39, 28, 12, 17, 30] showed the limitations of calculated metrics for SR our main metric is the human study.


Bicubic 23.74 0.627 0.584 37.14
RRDB [39] 25.52 0.697 0.419 45.31
ESRGAN [39] 22.14 0.578 0.277 31.28
SRFlow [30] 23.04 0.578 0.275 49.02
BaseFlow 23.58 0.595 0.253 49.78
AdFlow 23.38 0.600 0.264 46.02


Bicubic 22.40 0.508 0.713 37.13
RRDB [39] 23.58 0.572 0.554 45.26
ESRGAN [39] 20.99 0.462 0.332 31.68
SRFlow [30] 21.76 0.467 0.335 51.01
BaseFlow 22.03 0.478 0.325 50.17
AdFlow 22.01 0.486 0.327 48.78


Bicubic 19.31 0.477 0.686 33.93
RRDB [39] 21.15 0.603 0.401 43.33
ESRGAN [39] 18.43 0.475 0.306 28.88
SRFlow [30] 19.29 0.501 0.309 48.11
BaseFlow 19.72 0.513 0.304 48.71
AdFlow 19.04 0.506 0.278 44.67
Table 7: Approximations for perceptual quality on the sets DIV2K (val.), BSD100, and Urban100 (). Since [23, 35, 39, 28, 12, 17, 30] showed the limitations of calculated metrics for SR our main metric is the human study.
Urban100 BSD100 DIV2K
Low Resolution RankSRGAN ESRGAN[39] SRFlow[30] BaseFlow AdFlow
Figure 13: Qualitative comparison with state-of-the-art approaches on the DIV2K (val), BSD100 and Urban100 set for SR.
Urban100 BSD100 DIV2K
Low Resolution ESRGAN[39] BaseFlow AdFlow
Figure 14: Qualitative comparison with state-of-the-art approaches on the DIV2K (val), BSD100 and Urban100 set for SR.
Urban100 BSD100 DIV2K
Low Resolution ESRGAN[39] SRFlow[30] BaseFlow AdFlow
Figure 15: Qualitative comparison with state-of-the-art approaches on the DIV2K (val), BSD100 and Urban100 set for SR.