1 Introduction
Photorealistic image superresolution (SR) is the task of upscaling a lowresolution (LR) image by adding naturallooking highfrequency content. Since this information is not contained in the LR image, SR assumes that a prior can be learned to add plausible highfrequency components. In general, however, there are infinitely many possible highresolution (HR) images mapped to the same LR image. Therefore, this task is highly illposed, rendering the learning of powerful deep SR models highly challenging.
To cope with the illposed nature of the SR problem, existing stateoftheart methods employ an ensemble of multiple losses designed for different purposes [23, 39, 47]. In particular, these works largely rely on the loss for fidelity and the adversarial loss for perceptual quality. Theoretically, the objective aims to predict the average overall plausible HR image manifestations under a Laplace model. That leads to blurry SR predictions, which are generally not perceptually pleasing. In contrast, the adversarial objective prefers images with natural characteristics and highfrequency details. These two losses are thus fundamentally conflicting in nature [6, 5].
The conflict between the and the adversarial loss has important negative consequences as seen in Figure LABEL:fig:intro. In order to find a decent tradeoff, a precarious balancing between the two terms is needed. The found compromise is not optimal in terms of fidelity nor perceptual quality. Moreover, the conflict between the two losses results in a remarkably inferior lowresolution consistency. That is, the downsampled version of the predicted SR image is substantially different from the original LR image. The conflict between the losses drives the prediction towards a point outside the space of plausible HR images (Illustrated in Fig. 1).
We attribute those shortcomings to the loss. Since SR is a highly illposed problem, the loss imposes a rigid and exceptionally inaccurate model of the complicated image manifold of solutions. Ideally, we want a loss that ensures fidelity while not penalizing realistic image patches preferred by the adversarial loss. In this work, we therefore first revisit the loss and view it from a probabilistic perspective. We observe that the objective corresponds to a onelayer conditional normalizing flow. That inspires us to explore flowbased generalizations capable of better capturing the manifold of plausible HR images to mitigate the conflict between adversarial and fidelitybased objectives.
A few very recent works [40, 30] have investigated flows for SR. However, these approaches use heavyweight flow networks as an alternative to the adversarial loss for perceptual quality. In this work, we pursue a very different view, namely the flow as a fidelitybased generalization of the objective. Our goal is not to replace the adversarial loss but to find a fidelitybased companion that can enhance the effectiveness of adversarial learning for SR. In contrast, to [30], this allows us to employ much shallower and more practical flow networks, ensuring substantially faster training and inference times. Furthermore, we demonstrate that the adversarial loss effectively removes artifacts generated by purely flowbased methods.
Contributions: Our main contributions of this work are as follows: (i) We revisit the loss from a probabilistic perspective, expressing it as a onelayer conditional flow. (ii) We generalize the fidelity loss by employing a deep flow and demonstrate that it can be more effectively combined with an adversarial loss. (iii) We design a more practical, efficient, and stable flow architecture, better suited to the combined objective, leading to faster training and inference compared to [30]. (iv) We perform comprehensive experiments analyzing the flow loss combined with adversarial losses, giving valuable insights on the effects of increasing the flexibility of the fidelitybased objective. In comprehensive user studies, totaling over 50 000 votes, our approach outperforms stateoftheart on three different datasets and scale factors.
2 Related Work
Single Image SuperResolution:
is the task of estimating a highresolution image from a lowresolution counterpart. It is fundamentally an illposed inverse problem. While originally addressed by employing interpolation techniques, learned methods are better suited for this complex task. Early learned approaches used sparsecoding
[7, 36, 44, 45]and local linear regression
[37, 38, 43]. In recent years, deep learning based methods have largely replaced previous techniques for SuperResolution owing to their highly impressive performance.
Initial deep learning approaches [10, 11, 19, 22, 25] for SR aimed at minimize the or distance between the SR and GroundTruth image. With this objective, the model is effectively trained to predict a mean of plausible superresolutions corresponding to the given input LR image. To alleviate this problem [23] introduced an adversarial and perceptual loss. Since then, this strategy has remained the predominant approach to superresolution [2, 14, 35, 39, 28, 12, 17, 18]. Only very few works have investigated other learning formulations. Notably, Zhang [47] introduces a selection mechanism based on perceptual quality metrics. In order to achieve an explorable SR formulation, Bahat [4] recently trained a stochastic SR network based on mainly adversarial objectives. The output acts as a prior for a lowresolution consistency enforcing module, optimizing the image in a postprocessing step.
In recent works, invertible networks have gained popularity for imagetoimage translation
[42, 9, 8, 34, 40, 30, 41, 24]. Xiao [42] uses invertible networks to learn down and upscaling of images. This is similar to compression, but where the compressed representation is constrained to be an LR image. For superresolution, [30] recently introduced a new strategy based on Normalizing Flows. It aims at replacing adversarial losses with normalizing flows [9, 8, 34]. In contrast, we investigate conditional flows as a replacement for the loss. In fact, we demonstrate that it forms a direct generalization of the objective. The aim of this work is to investigate flows as an alternative fidelitybased companion to the adversarial loss.3 Method
3.1 Revisiting the Loss
The standard paradigm for learning a SR network is to directly penalize the reconstruction error between a predicted image and the ground truth HR image corresponding to the LR . The reconstruction error is usually measured by applying simple norms in a color space (., RGB or YCbCr). While initial methods [11, 19, 22] employed the norm, the mean squared error, later works [49, 25] studied the benefit of the error,
(1) 
To understand the implications of this objective function, we use its probabilistic interpretation. Namely, that the loss (1) corresponds to the Negative LogLikelihood (NLL) of the Laplace distribution. This derivation will be particularly illustrative for the generalizations considered later.
We first consider a latent variable with the standard Laplace distribution . Let be a be a function that encodes the LRHR pair into the latent space as . Through the inverse relation it is easy to see that follows a Laplace distribution with mean ,
(2) 
Here, is the total dimensionality of . From a probabilistic perspective, we are thus predicting the conditional distribution of the HR output image given the LR . In particular, our SR network estimates the mean of this distribution under a Laplacian model. In order to learn the parameters of the network, we simply minimize the NLL of (2), which is equal to the loss (1) up to an additive constant.
In the aforementioned Laplacian model (2), derived from the loss (1), only the mean
is estimated from the LR image. Thus, the model assumes that the variance, which reflects the possible variability of each pixel, remains constant. This assumption is however, not accurate. Indeed, superresolving a constant blue sky is substantially
easier than estimating the pixel values of a highly textured region, such as the foliage of a tree. In the former case, the predicted pixels should have low variance, while the latter has high variability, corresponding to different possible textures of foliage. For a Laplace distribution, we can encode the variability in the scale parameter, which is proportional to the standard deviation. By predicting the scale parameter
for each pixel, we can learn a more accurate distribution that also quantifies some aspect of the illposed nature of the SR problem.We easily extend our model with the scale parameter prediction by modifying our function as,
(3) 
Since this yields a Laplace distribution , we achieve the NLL,
(4) 
In practice, we can easily modify an SR network to jointly estimating the mean and scale by doubling the number of output dimensions. The loss (10) stimulates the network to predict larger scale values for ‘uncertain’ pixels, that are likely to have large error . In principle, (10) thus extends the objective to better cope with the illposed nature of the SR problem by predicting a more flexible distribution of the HR image. In the next section, we will further generalize the objectives (1), (10) through normalizing flows, to achieve an even more flexible fidelity loss.
3.2 Generalizing the Loss With Flows
To capture the probability distribution of the error between the prediction
and groundtruth , we also need to consider spatial dependencies. Neighboring pixels are generally highly correlated in natural images. Indeed, to create coherent textures, even longrange correlations need to be considered. However, the loss (1) and its extension (10) assume each pixel in to be conditionally independent given . In fact, sampling from the predicted conditional distributionis equivalent to simply adding Laplacian white noise to the predicted mean
. In superresolution, we strive to create fine textures and details. To achieve this, the predictive distribution must capture complex correlations in the image space.In this paper, we generalize the loss (10) with the aim of achieving a more flexible objective, better capturing the illposed setting. That is done through the probabilistic interpretation discussed in Sec. 3.1. We observe that the function introduced in Sec. 3.1 corresponds to a onelayer conditional normalizing flow with a Laplacian latent space. We can thus generalize this setting by constructing deeper flow networks . While prior works [3, 40, 30, 33] investigate conditional flows for SR as a replacement for adversarial losses, we see it as a generalization of the fidelitybased loss. With this view, we aim to find a fidelitybased objective better suited for illposed problems and, therefore, more effectively combined with adversarial losses.
The purpose of the function is to map the HRLR pair to a latent space , which follows a simple distribution. By increasing the depth and complexity of the flow , more flexible conditional densities, and therefore also NLLbased losses, are achieved. In the general case, we let flow to be conditioned on the embedding of the LR image as . In fact, the network can be seen as predicting the parameters of the conditional distribution . In this view, the embedding generalizes the purpose of the SR network , which predicts the mean of the Laplace distribution in the case (1). In the general Laplace case (8), the LR embedding network needs to generate both the mean and the scale . Thanks to the flexibility of conditional flow layers, we can however, still use the underlying image representation of any standard SR architecture as . For example, we generate the embedding by concatenating a series of intermediate feature maps from, , the RRDB [39], or the RCAN [48] architecture. For simplicity, we often drop the explicit dependence on in the flow and simply write .
In order for to be a valid conditional flow network, we need to preserve invertibility in the first coordinate. Under this condition, the conditional density is derived using the change of variable formula [9, 21, 30] as,
(5) 
The latent space prior is set to a simple distribution, standard Gaussian or Laplacian. The second factor in (5) is the resulting volume scaling, given by the determinant of the Jacobian . We can easily draw samples from the model by inverting the flow as . The network thus transforms a simple distribution to capture the complex correlations in the output image space.
The NLL training objective is obtained by applying the negative logarithm to (5),
(6a)  
(6b) 
where . In the second equality, we have decomposed into the sequence of flow layers , with and . This allows for efficient computation of the logdeterminant term.
We can now derive that the flow objective (6) generalizes the scaled loss (10) and thereby also the standard loss (1). By using the function defined in (8) and the standard Laplacian latent variable , we derive the first term in (10) is by inserting (8) into the first term in (6a). For the second term, we first immediately obtain the Jacobian of (8) as a diagonal matrix with elements . Inserting this result into the logdeterminant term in (6a) yields the second term in (10). A more detailed derivation is provided in the supplementary material. Next, we employ the flowbased fidelity objective in a full superresolution framework by combining it with adversarial losses.
3.3 FlowFidelity with Adversarial Losses
The introduction of adversarial losses [23] pioneered a new direction in superresolution, aiming to generate perceptually pleasing HR outputs from the natural image manifold. In order to achieve this, the adversarial loss needs to be combined with fidelitybased objectives, ensuring that the generated SR image is close to the HR groundtruth. Therefore, SRGAN [23] and later works [2, 14, 35, 28, 12, 17] most typically combine the adversarial loss with the objective. However, these two objectives are fundamentally conflicting. Unlike the loss that pulls the superresolution towards the mean of all plausible manifestations, the adversarial loss forces the generator to choose exactly one image of the natural image manifold. Hence, the adversarial objective ideally assigns a low loss on all natural image patches. In contrast, such predictions generate a high loss since it prefers the blurry average of plausible predictions. We aim to resolve this issue by replacing the loss with the aforementioned flowbased generalizations.
The flow can learn a more flexible conditional distribution of the HR . It therefore better spans the natural image manifold while simultaneously encouraging consistency with the input LR image . The NLL loss (6) of the flow distribution does therefore not penalize patches from the natural image manifold to the same extent. That allows the adversarial objective to drive the generated SR images towards perceptually pleasing results without being penalized by the fidelitybased loss. Conversely, the flowbased fidelity loss allows the network to learn from the single provided groundtruth HR image , without reducing perceptual quality or incurring a higher adversarial loss.
Interestingly, the flow network
can also be seen as stochastic generator for the adversarial learning. While stochastic generators are fundamental to unconditional Generative Adversarial Networks (GANs)
[13], deterministic networks are most common in the conditional setting, including superresolution. In fact, GANs are well known to be highly susceptible to mode collapse in the conditional setting [16, 32]. In contrast, flows are highly resistant to mode collapse due to the bijective constraint on. This is highly important for illposed problems such as SR, where we ideally want to span the space of possible predictions. However, it is important to note that the flow is not merely a generator, as in the standard GAN setting. The flow itself also serves as a flexible loss function (
6).Formally, we add the adversarial loss on samples generated by the flow network . Let be the discriminator with parameters . For one LRHR pair and random latent sample , we consider the adversarial loss
(7) 
The loss (7) is minimized the flow and LR encoder parameters and maximized the discriminator parameters . In general, any other variant of adversarial loss (7) can be employed. During training, we employ a linear combination of the NLL loss (6) and (7). Our training procedure is detailed in Sec. 3.5.
3.4 Conditional Flow Architecture
Our full approach, depicted in Fig. 2, consists of the superresolution network , the flow network and the discriminator . We construct our conditional flow network based on [30] and use the same settings, where not mentioned otherwise. It is based on Glow [21] and RealNVP [9]. It employs a pyramid structure with scales, each halving the previous layer’s spatial size using a squeeze layer and, depending on the number of channels, also bypassing half of the activations directly to the NLL calculation. We use 3, 4, and 4 scale levels for , and respectively, each consisting of a series of flow steps.
Each FlowStep consists of a sequence of four layers. In encoding direction, we first employ the ActNorm [21] to normalize the activations using a learned channelwise scale and bias. To establish information transfer across the channel dimension, we then use an invertible convolution [21]. The following layers condition the flow on the LR image similar to [30]. First, the Conditional Affine Coupling [9, 30], partitions the channels into two halves. The first half is used as input, together with the LR encoding , to a 3layer convolutional network module, which predicts the elementwise scale and bias for the second half. This module adds nonlinearities and spatial dependencies to the flow network while ensuring easy invertibility and tractable logdeterminants. Secondly, the Affine Image Injector is applied, which transforms all channels conditioned on the lowresolution encoding .
Instead of the learnable convolutions used in [30, 21], we use constant orthonormal matrices that are randomly sampled at start of the training. We found this to significantly improve training speed while ensuring better stability due to these layers’ perfect conditioning. When combined with an adversarial loss, the flow network operates in both the encode and decode direction during training. To ensure stability during training in both directions, we reparametrize the prediction of the multiplicative unit in the conditional affine coupling layer. In particular, we predict the multiplicative factor as , where is the unconstrained prediction stemming from the convolutional module in the coupling.
AdFlow  4  6  8  

compared to  DIV2K  BSD  Urban  DIV2K  BSD  Urban  DIV2K  BSD  Urban 
BaseFlow  62.1% ± 2.2  68.3% ± 2.4  74.2% ± 2.1  73.4% ± 2.0  80.7% ± 1.8  82.9% ± 1.7  69.2% ± 2.1  73.1% ± 2.0  78.2% ± 1.9 
SRFlow  60.1% ± 2.2  67.2% ± 2.4  66.3% ± 2.2        66.2% ± 2.1  67.2% ± 2.1  71.8% ± 2.0 
RankSRGAN  56.9% ± 2.2  54.8% ± 2.6  67.5% ± 2.2             
ESRGAN  56.1% ± 2.2  51.2% ± 2.6  64.5% ± 2.3  57.5% ± 2.3  62.8% ± 2.2  63.8% ± 2.2  49.9% ± 2.3  54.5% ± 2.2  57.1% ± 2.2 
Ground Truth  49.0% ± 2.2  25.4% ± 2.2  29.1% ± 2.2  27.4% ± 2.1  8.9% ± 1.3  11.6% ± 1.5  18.3% ± 1.7  4.2% ± 0.9  7.7% ± 1.2 
of AdFlow along with the 95% confidence interval. We indicate if AdFlow is significantly
better or worse.SuperResolution embedding network : Our flowbased objective is designed as a replacement of loss. Our formulation is therefore agnostic to the architecture underlying SR embedding network . We use the popular RRDB [39] SR network as our encoder . Instead of outputting the final RGB SR image, these networks predict a rich embedding of the LR image. We obtain this in practice by simply concatenating the underlying feature activations at the intermediate RRDB blocks 1, 4, 6 and 8.
Discriminator: We use the VGGbased network from [39] as a discriminator. Since we generate stochastic SR samples during training, we found it beneficial to reduce the discriminator’s capacity to ensure a balanced adversarial objective. We, therefore, reduce the internal channel dimension of the discriminator from 64 to 16.
3.5 Training Details
Our approach is trained by a weighted combination of the NLL loss (6) for fidelity and the adversarial loss (7) to increase perceptual quality. We consider the standard bicubic setting in our experiments, where the LR is generated with the MATLAB bicubic downsampling kernel. In particular, we train for both and the challenging SR scenario. We first train the networks and using only the flow NLL loss for 200k iterations using initial learning rates of for and for , which are then decreased stepwise. We finetune the network with the adversarial loss for 200k iterations and select the checkpoint with lowest LPIPS [46] as measured the training set. We employ the Adam [20] optimizer. As in [30]
we add uniformly distributed noise with a strength of
of the signal range to the groundtruth HR. Our network is trained on HR patches of pixels for and and pixels for . In principle, our framework can employ any adversarial loss formulation. To allow for a direct comparison with the popular stateoftheart network ESRGAN [39], we employ the same relativistic adversarial formulation. For and SR, we weight the adversarial loss with a factor of and use a discriminator learning rate of . For , we use and respectively We use the same training data employed by ESRGAN [39], consisting of the DF2K dataset. It comprises 2650 training images from Flickr2K [25] and 800 training images from the DIV2K [1] dataset.4 Experiments
Urban100  BSD100  DIV2K 
Urban100  BSD100  DIV2K 
Low Resolution  ESRGAN[39]  BaseFlow  AdFlow 
Urban100  BSD100  DIV2K 
We validate our proposed formulation by performing comprehensive experiments on the three standard datasets, namely DIV2K [1], BSD100 [31] and Urban100 [15]. We train our approach for three different scale factors , , and . We term our flowonly baseline as BaseFlow and our final method, which also employs adversarial learning, as AdFlow. The prediction and evaluation is performed on the full image resolution. For the purely flowbased baseline, we found it best to use a sampling temperature [21, 30] of . For our final approach with adversarial loss, we found the standard sampling temperature of to yield best results. Detailed results and more visual examples are found in the supplementary material.
4.1 StateoftheArt Comparison
We first compare our approach with stateoftheart. This work aims to achieve SR predictions that are (i) photorealistic and (ii) consistent with the input LR image. Since it has become well known [23, 35, 39, 28, 12, 17, 30, 26, 29, 27] that computed metrics, such as PSNR and SSIM, fail to rank methods according to photorealism (i), we therefore perform extensive user studies as further described below. To assess the consistency of the prediction with the LR input (ii), we first downscale the predicted SR image with the given bicubic kernel and compare the result with the LR input. Their similarity is measured using PSNR, and we therefore refer to this metric as LRPSNR. The LR consistency penalizes hallucinations and artifacts that cannot be explained from the input image.
User studies: We compare the photorealism of our AdFlow with other methods in user studies. The user is shown the full lowresolution image where a randomly selected region is marked with a bounding box. Next to this image, two different superresolutions, or “zooms”, of the marked region, are displayed. The user is asked to select “Which image zoom looks more realistic?”. In this manner, the user evaluates the photorealism of our AdFlow versus each compared method. To obtain an unbiased opinion, the methods were anonymized to the user and shown in a different random order for each crop. In each study, we evaluate 3 random crops for each of the 100 images in a dataset (DIV2k, BSD100, or Urban100). We use 5 different users for every study, resulting in 1500 votes per methodtomethod comparison in each dataset and scale factor. The full user study is shown in Tab. 1, thus collects over 50 000 votes. Further details are provided in the supplement.
Methods: We compare our approach with stateoftheart approaches for photorealistic superresolution: ESRGAN [39], RankSRGAN [47], and SRFlow [30]. For the two latter approaches, we use the publicly available code and trained models ( for RankSRGAN, and for SRFlow). In addition to the publicly ESRGAN model for SR, we train models for and SR using the code provided by the authors. All compared methods are trained on the same training set, namely DF2k [25].
Results: The results of our user study are given in Table. 1. The first number represents the ratio of votes in favour of AdFlow and the second the 95% confidence interval. AdFlow outperforms all other methods with significance level 95% in all datasets except for one case. Interestingly, it almost matches the realism of the groundtruth on the DIV2k dataset. The visual results in Fig. 13 show that our AdFlow generates sharp and realistic textures and structures. In contrast, ESRGAN frequently generates visible artifacts while SRFlow and BaseFlow achieve less sharp results. While RankSRGAN experiences fewer artifacts compared to ESRGAN, its predictions are less sharp compared to AdFlow.
Method  DIV2K  BSD  Urban  DIV2K  BSD  Urban  DIV2K  BSD  Urban  


BaseFlow  49.9  49.9  49.5  48.3  48.6  47.9  49.8  50.2  48.7  
SRFlow  50.0  49.9  49.5        49.0  51.0  48.1  

RankSRGAN  42.3  41.7  39.9              
ESRGAN  39.0  37.7  36.8  33.2  32.8  30.9  31.3  31.7  28.9  
AdFlow  45.2  45.6  43.4  37.5  38.5  36.0  46.0  46.8  42.1 
The results for higher scale factors and show a similar trend as seen in Fig. 14 and 15. Our approach consistently outperforms the purely flowbased approaches BaseFlow and SRFlow for all scale factors and datasets by over of the votes. As seen in the visual examples, particularly for and , the flowbased approaches often generate strong highfrequency artifacts. In contrast, our AdFlow generates structured and crips textures attributed to the adversarial loss. Compared to ESRGAN, which combines with adversarial loss, AdFlow produces generally sharper results and has no visible color shift, as seen in Fig. 15. Interestingly, AdFlow demonstrates substantially better generalization to the BSD100 and Urban100 datasets than ESRGAN, as shown in the user study in Table 1. This indicates that ESRGAN tends to overfit to the DIV2k distribution. Qualitative examples for , , and are shown in Fig. 13, 14, and 15, respectively.
We report the LRPSNR for all datasets and scale factors in Tab. 2. ESRGAN and RankSRGAN obtain poor LR consistency across all datasets, as shown in Tab. 2. AdFlow gains dB  dB in LRPSNR over ESRGAN, indicating less hallucination artifacts and color shift. By employing flowbased fidelity instead of , AdFlow achieves superior photorealism while ensuring high LR consistency.
4.2 Analysis of Flowbased Fidelity Objective
Here, we analyze the impact of generalizing the loss towards a gradually more flexible flowbased NLL objective. This is done by increasing the number of flow steps per level inside the flow architecture. We train and evaluate our AdFlow with different depths for SR. Due to the difficulty and cost of running a large number of user studies, we here use the learned LPIPS [46] distance as a surrogate to assess photorealism. In Fig. 6 we plot the LPIPS and LRPSNR on the DIV2K validation set the number of flowsteps . We also include the results obtained by the loss, which is an even simpler onelayer flow loss, as discussed in Sec. 3.1 and 3.2. Note that these results correspond to the standard RRDB and ESRGAN, respectively.
Adv. Loss  Affine Coup.  Rand. Rot.  Coupl. Mult.  Percept. Loss  LPIPS  LRPSNR 
0.349  39.76  
✓  0.337  34.85  
✓  0.253  50.16  
✓  ✓      
✓  ✓  0.254  50.19  
✓  ✓  ✓      
✓  ✓  ✓  0.253  49.78  
✓  ✓  ✓  ✓  0.253  47.54  
✓  ✓  ✓  ✓  ✓  0.270  47.35 
As we increase the depth of the flow network , the LPIPS decreases while the LRPSNR increases. This indicates an improvement in perceptual quality and lowresolution consistency. This trend also holds when starting from the NLL objective. Note that the brief increase in LPIPS is explained by the added stochasticity when transitioning from the to the flow. Indeed, a too shallow flow network does not capture rich enough spatial correlations in order to generate more natural samples. However, already at , the flowbased generalization outperforms the in LPIPS. Increasing the flexibility of the NLLbased fidelity loss, starting from , thus benefits perceptual quality and consistency. This strongly indicates that a flowbased fidelity objective alleviates the conflicts between the adversarial loss and loss.
4.3 Ablation of Flow Architecture
In Tab. 3 we show results of our ablative experiments for SR on DIV2k. First, we ablate the use of Conditional Affine Couplings. Removing this layer (top row) results in a conditionally linear flow, thereby radically limiting its expressiveness. This leads to a substantially worse LPIPS and LRPSNR, demonstrating the importance of a flexible flow network. Second, we replace the learnable convolutions with fixed random rotation matrices (Rand. Rot.). While widely preserving the quality in all metrics, it reduces the training time by . Next, we consider the reparametrization of the coupling layers (Coupl. Mult.). We found this to be critical for training stability when combined with the adversarial loss. Lastly, we investigate the use of the VGGbased perceptual loss [23] that is commonly used in SR methods. It is generally employed as a more perceptually inclined fidelity loss to complement the objective. However, we found the perceptual loss not to be beneficial. This indicates that the more flexible flowbased fidelity loss can also effectively replace the VGG loss.
5 Conclusion
We explore conditional flows as a generalization of the loss in the context of photorealistic superresolution. In particular, we tackle the conflicting objectives between and adversarial losses. Our flowbased alternatives offer both improved fidelity to the input lowresolution and a higher degree of flexibility. Extensive user studies clearly demonstrate the advantages of our approach over stateoftheart on three datasets and scale factors. Lastly, our experimental analysis brings new insights into the learning of superresolution methods, paving for further explorations in the pursuit of more powerful learning formulations.
Acknowledgements: This work was supported by the ETH Zürich Fund (OK), a Huawei Technologies Oy (Finland) project and an Nvidia GPU grant and AWS.
References
 [1] (2017) NTIRE 2017 challenge on single image superresolution: dataset and study. In CVPR Workshops, Cited by: Figure 12, §3.5, Table 4, §4.
 [2] (2018) Image superresolution via progressive cascading residual network. In CVPR, Cited by: §2, §3.3.

[3]
(2019)
Guided image generation with conditional invertible neural networks
. CoRR abs/1907.02392. External Links: Link, 1907.02392 Cited by: §3.2.  [4] (2020) Explorable super resolution. In CVPR, Cited by: §2.
 [5] (2018) The 2018 PIRM challenge on perceptual image superresolution. In Computer Vision  ECCV 2018 Workshops  Munich, Germany, September 814, 2018, Proceedings, Part V, L. LealTaixé and S. Roth (Eds.), Lecture Notes in Computer Science, Vol. 11133, pp. 334–355. External Links: Document, Link Cited by: §1.
 [6] (2018) The perceptiondistortion tradeoff. In CVPR, pp. 6228–6237. External Links: Document, Link Cited by: §1.
 [7] (2015) Jointly optimized regressors for image superresolution. Comput. Graph. Forum 34 (2), pp. 95–104. External Links: Document Cited by: §2.
 [8] (2015) NICE: nonlinear independent components estimation. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Workshop Track Proceedings, Cited by: §2.
 [9] (2017) Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §2, §3.2, §3.4, §3.4.
 [10] (2014) Learning a deep convolutional network for image superresolution. In ECCV, pp. 184–199. External Links: Document Cited by: §2.
 [11] (2016) Image superresolution using deep convolutional networks. TPAMI 38 (2), pp. 295–307. Cited by: §2, §3.1.
 [12] (2019) Frequency separation for realworld superresolution. In 2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, Seoul, Korea (South), October 2728, 2019, pp. 3599–3608. External Links: Document, Link Cited by: §2, §3.3, §4.1, Table 5, Table 6, Table 7.
 [13] (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pp. 2672–2680. Cited by: §3.3.
 [14] (2018) Deep backprojection networks for superresolution. In CVPR, Cited by: §2, §3.3.

[15]
(2015)
Single image superresolution from transformed selfexemplars.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 5197–5206. Cited by: §4.  [16] (2017) Imagetoimage translation with conditional adversarial networks. In CVPR, pp. 5967–5976. External Links: Document, Link Cited by: §3.3.
 [17] (202006) Realworld superresolution via kernel estimation and noise injection. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §2, §3.3, §4.1, Table 5, Table 6, Table 7.
 [18] (2019) Progressive face superresolution via attention to facial landmark. In arxiv, Vol. abs/1908.08239. Cited by: §2.
 [19] (2016) Accurate image superresolution using very deep convolutional networks. In CVPR, Cited by: §2, §3.1.
 [20] (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §B, §3.5.
 [21] (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada, pp. 10236–10245. Cited by: §3.2, §3.4, §3.4, §3.4, §4.
 [22] (2017) Deep laplacian pyramid networks for fast and accurate superresolution. In CVPR, Cited by: §2, §3.1.
 [23] (2017) Photorealistic single image superresolution using a generative adversarial network. CVPR. Cited by: §1, §2, §3.3, §4.1, §4.3, Table 3, Table 5, Table 6, Table 7.
 [24] (2021) Hierarchical conditional flow: a unified framework for image superresolution and image rescaling. In ICCV, Cited by: §2.
 [25] (2017) Enhanced deep residual networks for single image superresolution. CVPR. Cited by: §2, §B, §3.1, §3.5, §4.1.
 [26] (2019) AIM 2019 challenge on realworld image superresolution: methods and results. In ICCV Workshops, Cited by: §4.1.
 [27] (2020) NTIRE 2020 challenge on realworld image superresolution: methods and results. CVPR Workshops. Cited by: §4.1.
 [28] (2019) Unsupervised learning for realworld superresolution. In ICCV Workshops, Cited by: §2, §3.3, §4.1, Table 5, Table 6, Table 7.
 [29] (2021) NTIRE 2021 learning the superresolution space challenge. In CVPRW, Cited by: §4.1.
 [30] (2020) SRFlow: learning the superresolution space with normalizing flow. In ECCV, Cited by: §1, §1, §2, §B, §3.2, §3.2, §3.4, §3.4, §3.4, §3.5, Figure 13, Figure 15, Figure 3, Figure 5, §4.1, §4.1, Table 5, Table 6, Table 7, §4.
 [31] (200107) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, Vol. 2, pp. 416–423. Cited by: §4.
 [32] (2016) Deep multiscale video prediction beyond mean square error. In ICLR, External Links: Link Cited by: §3.3.
 [33] (2020) Cflow: conditional generative flow models for images and 3d point clouds. In CVPR, pp. 7949–7958. Cited by: §3.2.

[34]
(2015)
Variational inference with normalizing flows.
In
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015
, pp. 1530–1538. Cited by: §2.  [35] (2017) EnhanceNet: single image superresolution through automated texture synthesis. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017, pp. 4501–4510. External Links: Document Cited by: §2, §3.3, §4.1, Table 5, Table 6, Table 7.
 [36] (2012) Superresolution from internetscale scene matching. In ICCP, Cited by: §2.
 [37] (2014) A+: adjusted anchored neighborhood regression for fast superresolution. In ACCV, pp. 111–126. Cited by: §2.
 [38] (2013) Anchored neighborhood regression for fast examplebased superresolution. In ICCV, pp. 1920–1927. External Links: Document, Link Cited by: §2.
 [39] (2018) ESRGAN: enhanced superresolution generative adversarial networks. ECCV. Cited by: §1, §2, Figure 12, §3.2, §C.2, §C.2, §3.4, §3.4, §3.5, Table 4, Figure 13, Figure 14, Figure 15, Figure 3, Figure 4, Figure 5, §4.1, §4.1, Table 5, Table 6, Table 7.
 [40] (2019) Learning likelihoods with conditional normalizing flows. arxiv abs/1912.00042. External Links: Link, 1912.00042 Cited by: §1, §2, §3.2.
 [41] (2021) Deflow: learning complex image degradations from unpaired data with conditional flows. In CVPR, Cited by: §2.
 [42] (2020) Invertible image rescaling. In Computer Vision  ECCV 2020  16th European Conference, Glasgow, UK, August 2328, 2020, Proceedings, Part I, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Lecture Notes in Computer Science, Vol. 12346, pp. 126–144. External Links: Document, Link Cited by: §2.
 [43] (2013) Fast direct superresolution by simple functions. In ICCV, pp. 561–568. External Links: Document, Link Cited by: §2.
 [44] (2008) Image superresolution as sparse representation of raw image patches. In CVPR, External Links: Document Cited by: §2.
 [45] (2010) Image superresolution via sparse representation. IEEE Trans. Image Processing 19 (11), pp. 2861–2873. External Links: Document, Link Cited by: §2.

[46]
(2018)
The unreasonable effectiveness of deep features as a perceptual metric
. CVPR. Cited by: §B, §3.5, §4.2.  [47] (2019) Ranksrgan: generative adversarial networks with ranker for image superresolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3096–3105. Cited by: §1, §2, §4.1, Table 5.
 [48] (2018) Image superresolution using very deep residual channel attention networks. In ECCV, Lecture Notes in Computer Science, Vol. 11211, pp. 294–310. External Links: Document, Link Cited by: §3.2.
 [49] (2017) Loss functions for image restoration with neural networks. IEEE Trans. Computational Imaging 3 (1), pp. 47–57. External Links: Document, Link Cited by: §3.1.
Appendix
In this appendix, we first provide further details on the user study in Sec. A. Secondly, we provide an analysis of the sampling temperature in Sec B. Third, we present additional details about the minimally generalized loss in Sec. C. Finally, we provide a further qualitative and quantitative comparison of AdFlow with other stateoftheart methods in Sec. D. Additional visual results, used in our study, will be available on the project page git.io/AdFlow.
A User Study
As described in Sec. 4.1 in the main paper, we conduct the user study. The GUI interface is shown in Figure 10. We ask the user to evaluate which image of the two looks more realistic. To select the chosen image, the user presses the 1 key for the left and 2 key for the right images. Once a selection was made, the user can see the next image using the arrow right key until they have completed all tasks. Finally, the form is submitted using the button on the top right.
To increase the data quality, we use a filtering mechanism. For that, we add redundant questions and reject submissions that have a low selfconsistency. A visualization of results of the study results is shown in Fig. 11. The green bars display the percentage of votes favoring the photorealism of AdFlow, while the red bars show the percentage favoring the other method. We display the statistical significance by showing the 95% confidence interval in black. Examples for images used in our study are shown in visuals.html.
B Analysis of Sampling Temperature
Here, we analyze the tradeoff between the image quality, in terms of LPIPS [46], and the consistency to the lowresolution input in terms of LR PSNR when varying the sampling temperature. We sample from the latent space with a Gaussian prior distribution with variance . The latter is usually termed the sampling temperature [20]. Similar to [30] we can set the operation point by adjusting the temperature . Figures 7, 8 and 9 show that ESRGAN [25] trades off much more lowresolution consistency to improve the perceptual quality than AdFlow. The best tradeoff is achieved at for BaseFlow and for AdFlow, as used in the main paper.
C Minimal generalization
Here we provide further theoretical and empirical analysis when generalizing the loss with normalizing flows.
c.1 Relation of Flow loss to
We here derive the generalized objective (Eq. (4) in the main paper) from the Normalizing Flow formulation as a 1layer special case. Let be standard Laplace. We use the function defined as (Eq. (3) of the main paper),
(8) 
we obtain the inverse as,
(9) 
Since is a standard Laplace distribution, it is easy to see that as given by Eq. (4) in the main paper, that is
(10) 
Hence, (8) is the flow of (10). Inserting (8) into the NLL formula for flows (Eq. (6a) in the main paper) gives,
(11) 
Here, the Jacobian is a diagonal matrix with elements . The final result thus corresponds to the NLL derived directly in (10). We therefore conclude that the generalized objective is a special case given by the 1layer normalizing flow defined in (8).
c.2 Empirical Analysis
We report results for the intermediate step of predicting an adaptive variance according to the Laplacian model described in Section 3.1, Equations (3)(4) of the main paper. Those three channels predict the logscale of the Laplace distribution. The loss in Eq. (4) of the main paper can thus be written as,
(12) 
We notice that even this extension of the objective reduces the conflict with the adversarial loss to some extent. The effective removal of artifacts for superresolution is especially apparent in the first row of Figure 12 between ESRGAN [39] and ESRGAN + Adaptive Variance. Our further generalization of loss continues to improve the quality of the superresolutions.
As the increase in visual quality alone would not be a good indicator for a reduced conflict of objectives, we also report the lowresolution consistency in Table 4 which improves by dB from ESRGAN [39] to ESRGAN + Adaptive Variance. An additional generalization to BaseFlow and AdFlow leads to a further improved lowresolution consistency. Based on observing an improved visual quality and lowresolution consistency, we conclude that the minimally generalized loss reduces the conflict in objectives, which further validates our strategy of replacing the with a more flexible generalization.
PSNR  SSIM  LPIPS  LRPSNR  

RRDB [39]  25.52  0.697  0.419  45.31 
RRDB [39] + Adaptive Variance  25.47  0.696  0.418  44.51 
ESRGAN [39]  22.14  0.578  0.277  31.28 
ESRGAN [39] + Adaptive Variance  22.94  0.593  0.280  34.19 
BaseFlow  23.58  0.595  0.253  49.78 
AdFlow  23.45  0.602  0.253  47.54 
D Detailed Results
In this section, we provide an extended quantitative and qualitative analysis of the same BaseFlow and AdFlow networks evaluated in the main paper. For completeness, we here provide the PSNR, SSIM and LPIPS on the DIV2K, BSD100, and Urban100 datasets. Results are reported in Tables 5, 6 and 7. However, note that these metrics do not well reflect photorealism, as discussed in Sec. 4.1 in the main paper.
Further qualitative results for the scale levels , and are provided in Figures 13, 14 and 15 respectively.
PSNR  SSIM  LPIPS  LRPSNR  

DIV2K 
Bicubic  26.69  0.766  0.409  38.69 
RRDB [39]  29.44  0.844  0.253  49.17  
ESRGAN [39]  26.20  0.747  0.124  39.01  
RankSRGAN [47]  26.55  0.750  0.128  42.33  
SRFlow [30]  27.08  0.756  0.120  49.97  
BaseFlow  27.21  0.760  0.118  49.88  
AdFlow  27.02  0.768  0.132  45.17  
BSD100 
Bicubic  22.40  0.508  0.713  37.13 
RRDB [39]  23.58  0.572  0.554  45.26  
ESRGAN [39]  20.99  0.462  0.332  31.68  
SRFlow [30]  21.76  0.467  0.335  51.01  
BaseFlow  22.03  0.478  0.325  50.17  
AdFlow  22.01  0.486  0.327  48.78  
Urban100 
Bicubic  19.31  0.477  0.686  33.93 
RRDB [39]  21.15  0.603  0.401  43.33  
ESRGAN [39]  18.43  0.475  0.306  28.88  
SRFlow [30]  19.29  0.501  0.309  48.11  
BaseFlow  19.72  0.513  0.304  48.71  
AdFlow  19.04  0.506  0.278  44.67 
PSNR  SSIM  LPIPS  LRPSNR  

DIV2K 
Bicubic  24.87  0.680  0.519  37.78 
RRDB  26.51  0.741  0.382  46.86  
ESRGAN [39]  23.16  0.629  0.222  33.21  
BaseFlow  24.04  0.621  0.216  49.12  
AdFlow  23.94  0.6505  0.216  37.57  
BSD100 
Bicubic  23.26  0.564  0.645  37.51 
RRDB  24.42  0.625  0.507  46.41  
ESRGAN [39]  21.42  0.501  0.288  32.76  
BaseFlow  21.98  0.500  0.274  48.60  
AdFlow  22.17  0.533  0.269  38.52  
Urban100 
Bicubic  20.20  0.541  0.606  34.24 
RRDB  21.95  0.650  0.371  44.81  
ESRGAN [39]  19.43  0.541  0.251  30.87  
BaseFlow  20.43  0.564  0.255  47.90  
AdFlow  20.26  0.583  0.235  36.01 
PSNR  SSIM  LPIPS  LRPSNR  

DIV2K 
Bicubic  23.74  0.627  0.584  37.14 
RRDB [39]  25.52  0.697  0.419  45.31  
ESRGAN [39]  22.14  0.578  0.277  31.28  
SRFlow [30]  23.04  0.578  0.275  49.02  
BaseFlow  23.58  0.595  0.253  49.78  
AdFlow  23.38  0.600  0.264  46.02  
BSD100 
Bicubic  22.40  0.508  0.713  37.13 
RRDB [39]  23.58  0.572  0.554  45.26  
ESRGAN [39]  20.99  0.462  0.332  31.68  
SRFlow [30]  21.76  0.467  0.335  51.01  
BaseFlow  22.03  0.478  0.325  50.17  
AdFlow  22.01  0.486  0.327  48.78  
Urban100 
Bicubic  19.31  0.477  0.686  33.93 
RRDB [39]  21.15  0.603  0.401  43.33  
ESRGAN [39]  18.43  0.475  0.306  28.88  
SRFlow [30]  19.29  0.501  0.309  48.11  
BaseFlow  19.72  0.513  0.304  48.71  
AdFlow  19.04  0.506  0.278  44.67 
Urban100  BSD100  DIV2K 
Urban100  BSD100  DIV2K 
Low Resolution  ESRGAN[39]  BaseFlow  AdFlow 
Comments
There are no comments yet.