Toward Real-World Super-Resolution via Adaptive Downsampling Models

09/08/2021 ∙ by Sanghyun Son, et al. ∙ Google Seoul National University 0

Most image super-resolution (SR) methods are developed on synthetic low-resolution (LR) and high-resolution (HR) image pairs that are constructed by a predetermined operation, e.g., bicubic downsampling. As existing methods typically learn an inverse mapping of the specific function, they produce blurry results when applied to real-world images whose exact formulation is different and unknown. Therefore, several methods attempt to synthesize much more diverse LR samples or learn a realistic downsampling model. However, due to restrictive assumptions on the downsampling process, they are still biased and less generalizable. This study proposes a novel method to simulate an unknown downsampling process without imposing restrictive prior knowledge. We propose a generalizable low-frequency loss (LFL) in the adversarial training framework to imitate the distribution of target LR images without using any paired examples. Furthermore, we design an adaptive data loss (ADL) for the downsampler, which can be adaptively learned and updated from the data during the training loops. Extensive experiments validate that our downsampling model can facilitate existing SR methods to perform more accurate reconstructions on various synthetic and real-world examples than the conventional approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 9

page 10

page 11

page 13

page 17

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image super-resolution (SR), which aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) input, plays an essential role in computer vision and digital photography. There exist numerous applications, including enhancing the details and photorealism of an image 

[27], high-quality editing [3], and breaking the sensor limitation of mobile cameras [51]. Recently, a plethora of SR methods have been developed on the basis of deep CNNs [10, 25, 27, 31] and large-scale datasets [1, 31]. However, state-of-the-art methods [48, 61, 13, 62] do not generalize well to the real-world inputs even they perform relatively well on synthesized, e.g., bicubic-downsampled, LR images. In overcoming this issue, few recent approaches [60, 7, 6, 50] have collected high-quality pairs of real-world LR and HR examples to learn their SR models. Nevertheless, such an acquisition process remains to be challenging due to outdoor scene dynamics and spatial misalignments [60].

Conventional SR methods synthesize various LR samples from ground-truth HR images by the following:

(1)

where is a 2D degradation kernel, is a spatial convolution,

is a decimation with a stride

, and is a noise term. The decimation operator corresponds to direct downsampling mentioned in the super-resolution literature [57]. With a specific assumption of blur kernels, e.g., variants of Gaussian [12, 8, 57], LR and HR pairs can be synthesized to train the following SR models. However, such prior typically limit the kernel space, and the synthesized LR images may not reflect the distribution of real-world inputs [7]. Therefore, the learned SR models become less generalizable toward arbitrary real-world input images.

(a) Bicubic
(b) [48]
(c) [4] + [44]
(d) Ours
(e) GT
Fig. 1: SR results on a real-world LR image.

(a) LR image magnified with bicubic interpolation. (b) Result of RRDB 

[48]. (c) Result of KernelGAN [4] + ZSSR [44]. (d) Our unsupervised approach (ADL + RRDB) reconstructs a sharp and visually pleasing output without artifacts and aliasing compared with the existing methods. (e) Ground-truth patch from RealSR-V3 [6]. Images are cropped from ‘Canon/045.png.’

On the other hand, recent unsupervised methods simulate real-world LR samples that contain unknown noise [64, 34] and artifacts [32, 5]. Without using a paired dataset, they first learn a downsampling model under adversarial training frameworks [11] to imitate the distribution of real-world images. The following SR models can then be trained in a supervised manner on the simulated dataset to deliver accurate reconstruction results on the real-world inputs. One of the challenges in such methods arises from preserving image content across different scales, i.e., HR and LR, while learning the downsampling model. Existing approaches deal with this problem using a predetermined downsampling operator, e.g., bicubic downsampling, in their objective functions and constrain the simulated LR images not to deviate much from the known formulations. However, the manual selection of the operator can introduce a bias in the unsupervised learning framework, which can also act as a restrictive prior if the ground-truth downsampling model is much different from the used one. While KernelGAN [4]

alleviates the issue by estimating a low-dimensional downsampling kernel

in (1) from an LR image , various regularization terms need to be applied to restrict the diversity of the possible kernel space.

Therefore, we propose an effective way of imitating the real-world LR samples of an unknown distribution to address the aforementioned issues. Similar to the previous unsupervised methods [32, 4], we also train a downsampling CNN to simulate the LR images in our target distribution. However, rather than formulating the objective function with a handcrafted downsampling operator, we propose a novel and generalizable low-frequency loss (LFL) that does not pose substantial bias. Our LFL facilitates the downsampling model to learn much more diverse and precise functions without being constrained to a specific prior assumption. Furthermore, we develop an adaptive data loss (ADL) that iteratively adjusts the training objective for the given dataset and stabilizes the learning process. As shown in Fig. 1, our unsupervised learning framework is straightforward, effective, and generalizable to arbitrary downsampling models. Extensive experiments validate that the SR models learned on our downsampled images perform favorably on synthetic and real-world LR images. The contributions of this study can be organized threefold:

  • We present a novel unsupervised learning framework to learn an unknown downsampling process without using any HR and LR image pairs.

  • We propose LFL and ADL to simulate accurate and realistic LR samples from HR images without relying on any predetermined downsampling operators.

  • We demonstrate that the proposed method can be easily integrated with existing SR frameworks and achieve much better results on synthetic and real-world images.

2 Related Work

2.1 SR on bicubic downsampled images

With the success of SRCNN [10], several CNN-based methods have been developed for image SR. As one of the most influential studies, VDSR [22] has proposed a novel residual learning strategy to train a deep network and inspired lots of following methods [27, 31, 2, 29]. Earlier works primarily focus on improving the network designs, such as pixel shuffling [43], progressive upsampling [25, 26, 49], dense connections [46, 62, 63], recursive structures [23, 39, 30], and back-projection [13]. Recent approaches utilize the attention [61, 9, 37], while designing architectures for efficient inference [16, 28, 33] is considered essential as well. From the perspective of image realism, several methods introduce perceptual loss [21, 42, 35, 58] to synthesize photorealistic textures [27, 48, 59, 40]. However, the existing methods are typically trained on synthesized image pairs in which LR inputs are generated using conventional bicubic interpolation from HR targets. While state-of-the-art algorithms perform impressively well when training and test distributions are matched, i.e., test inputs are also downsampled with the same operator, they cannot be fully generalized to arbitrary in-the-wild LR images [44, 17].

2.2 Synthesizing diverse LR images for SR

For practical SR application, it is essential to determine how to generate LR images [55] so that a supervised SR model can be trained without real-world LR and HR image pairs. Several approaches have synthesized diverse LR images with multiple degradations to train their SR algorithms, assuming that the generalization on such examples can improve SR performance on arbitrary inputs. SRMD [57] considers the formation of LR images under various downsampling kernels in (1). It can reconstruct HR images from diverse types of LR inputs, using off-the-shelf methods [36, 38, 4] to predict a candidate kernel from a given LR input. USRNet [56] further allows diversity to the downsampling kernel and can deliver clean SR results even when LR inputs are corrupted with motion blur and noise.

Furthermore, recent methods [12, 8, 53] present unified frameworks to jointly estimate the kernel and reconstruct visually pleasing results from an arbitrary LR image. However, since considering all possible forms of downsampling operation is not practical, the candidate kernels in such methods are often simplified to variants of 2D Gaussian. Recent studies have demonstrated that such approximations may not hold for actual LR images [7, 65] in the wild, thus making the abovementioned SR algorithms less generalizable. In this study, we demonstrate that existing approaches [12, 8] do not perform well on inputs from unknown downsampling kernels or real-world images, and our method provides better generalization.

2.3 Learning to simulate real-world LR images

Instead of synthesizing LR images from some handcrafted formulations, numerous approaches [5, 64, 32, 34] have adopted adversarial training [11] to simulate the unknown distributions of real-world LR images using downsampler CNNs. These methods have shown impressive performance when dealing with unknown noise [64, 34] and artifacts [5, 32] in real-world LR images. Considering the definition of downsampling, one of the required characteristics of such methods is to preserve the contents of HR input and generate a feasible LR image. Therefore, predetermined downsampling operators [5, 32, 34] and cyclic architecture [66] are used to guide the generated LR images not deviating much from the desired outputs. However, a significant limitation of this formulation is that the necessity of estimating an accurate downsampling process is often considered less important. In particular, the handcrafted operators may significantly differ from the unknown downsampling function and bias the following downsampler, making the model less effective in estimating the actual operators rather than noise and artifacts.

On the other hand, KernelGAN [4] is designed to directly predict the degradation kernel , which is used to generate the given LR image . The estimated kernel is then used to synthesize LR and HR pairs for the following SR model [44]. In addressing the ill-posed problem of finding the kernel in (1), several optimization constraints are assumed, such as patch recurrences [36] in a single image, deep linear generator, and various prior knowledge on physically meaningful kernels. However, this approach may not handle practical cases in which such strong assumptions do not hold. While Ji et al. [20] have extended the approach to a set of LR images, several prior terms for the appropriate degradation kernel still act as a bottleneck for generalization. On the contrary, our LFL and ADL are designed to reduce inherent bias from adopting a specific downsampling operator or strong kernel priors.

2.4 Paired datasets for real-world SR

Limitations of existing SR methods arise from difficulties in constructing the real-world dataset. Few approaches capture the paired dataset [7, 6, 50] by precisely manipulating camera parameters in which images captured from long and short focal lengths are labeled as HR and LR samples, respectively. Zhang et al. [60] introduces SR-RAW based on raw images and contextual bilateral loss to handle misalignments in the real-world pairs. Xu et al. [52] utilizes raw and color images jointly in their model for effective real-world SR. Those image pairs can be used to learn the real-world SR models to some extent. Nevertheless, they still suffer from a lack of scene diversity, misalignments, dynamic motions, and scalability issues. To overcome the several challenges in acquiring realistic SR datasets, we synthesize accurate HR and LR pairs from unpaired examples. While we assume that a set of LR images undergo the same or similar formation process, the data collection is much easier since careful alignment and delicate post-processing are not required.

3 Learning to Downsample

In conventional frameworks, mismatches between the handcrafted kernel space and real-world downsampling model [65, 7] makes the following SR networks less generalizable. Thus, we develop an unsupervised learning framework to accurately simulate the LR samples from the unpaired HR images . The following SR model can then be trained to reconstruct the HR results from the given LR dataset . For simplicity, we assume that the LR and HR images have spatial resolutions of and , respectively, for a downsampling factor .

3.1 Learning an unknown downsampling process

We synthesize LR images under the generalized formulation as , where is the latent HR samples from the distribution and is an unknown downsampling operator. The goal is to learn an SR model , which can reconstruct a high-quality HR image from the given LR image . However, it is not straightforward to learn the upsampling function directly as the corresponding ground-truth HR images are unavailable. Thus, we first learn a downsampling model in an unsupervised manner, so that the distribution of the synthesized images is close to the distribution of the target LR samples . By using the generated pairs of , our SR model can be trained to reconstruct HR images from the given LR distribution in a fully supervised manner.

To learn the downsampling function , we adopt adversarial training framework [11] to jointly optimize the downsampler CNN and the discriminator CNN . Then, we formulate the downsampling and discriminator objectives, and , as follows:

(2)

where is the data loss, is the adversarial loss [11], and

is a hyperparameter. If the learned downsampling model can accurately synthesize LR images from

, i.e., the distribution of and are approximately the same, the following SR model can be generalized on by learning from a set of training pairs . Fig. 2 shows the overall pipeline of our method, which learns the downsampling and super-resolution models consecutively.

For simplicity, we assume that the target LR images are not corrupted with noise, where the term in (1) is ignored. The primary reason is that the noise can be a discriminative feature between the real LR and downsampled images in adversarial training. Since we do not include randomness in our downsampler architecture, such behavior also prevents the proposed method from learning an accurate downsampling function. In Section 4.7, we discuss the effect of real-world noise in the proposed framework.

(a) Downsampling
(b) Super-resolution
Fig. 2: Our two-stage approach for unpaired SR. (a) We first optimize a downsampling model to synthesize from . The primary goal is to learn the distribution of downsampled images rather than a proper downsampling function. (b) Using generated pairs, we train the SR model , which can also be generalized to the target LR images . Dotted lines represent latent components that are not available in the entire learning process. Blue items show learned elements in each stage, and red elements denote the actual goal we want to achieve.

3.2 Data constraint in the downsampling model

In practice, the actual formulation of the given LR images, i.e., the ground-truth downsampling model, is unknown. Thus, we introduce the adversarial loss to enforce the downsampled images to follow a target distribution without using ground-truth LR images. However, unlike the other image generation tasks [41], appropriate constraints are also required to generate faithful LR samples to the given HR counterparts and preserve input contents. In particular, low-level information of a given image, e.g., pixel colors and edge structures, should not be changed during the downsampling, as shown in Fig. 3(a) and (b). Thus, the appropriate formulation of the data term in (2) plays a critical role in preserving the image content across different scales. A widely-used approach is to define the data loss with a known operator , such as bicubic downsampling or average pooling [5], as follows:

(3)

That is, a reference example constrains the generated LR sample to be a feasible downsampled image. A recent method from Maeda [34] has also combined the bicubic downsampling operator

and image-to-image translation CNN

in their downsampling model so that . Under the such configuration, the translator network is trained to maintain the consistency between its input and output which corresponds to in (3).

(a)
(b)
(c)
(d) Ours
(d)
(e)
(f)
(g)
Fig. 3: Differences between the ground-truth and learned LR images under various configurations. (a) A reference HR. (b) A ground-truth LR patch we want to synthesize, where the kernel is unknown. (c) The corresponding bicubic-downsampled LR which is different from . (d) The absolute difference between and generated LR from our downsampling model is visualized with color-coding, where red pixels indicate large differences. (e)-(h) Difference between and outputs from the learned downsampler under (2), where the bicubic downsampling operator is used for . Difference maps are normalized for better visualization. See more details in Section 4.2.

In (3), the data term enforces the downsampled images to be close to references . Such a formulation contributes to preserve the image content and facilitate the training process for generating LR images. Nevertheless, optimizing the data term in (2) may bias the learned model toward the used operator . The bias may conflict with the adversarial training objective if the distribution of the downsampled images deviate significantly from the target distribution .

Fig. 3 illustrates an example to demonstrate the negative effect of using a predetermined downsampling operator, e.g., bicubic kernel , in the data term . The target LR images are generated using a different kernel , where for an arbitrary HR image shown in Fig. 3(a)–(c). Then, we jointly minimize the data and adversarial loss terms in (2) under different values so that the downsampling model can be close to the target distribution . Fig. 3(e)–(h) illustrate differences between the actual LR and downsampled image with a varying . If the data term is not used, i.e., , the adversarial loss is solely optimized in the training so that . As shown in Fig. 3(e), the downsampled image does not preserve the original colors and becomes inconsistent with the input in such case.

On the other hand, if we increase weight to retain the input content, the resulting downsampled images will more likely resemble rather than the desired output , as shown around edge and corner regions of Fig. 3(f)–(h). The tradeoff between preserving image contents and synthesizing an accurate distribution of the LR images occurs due to the inherent conflict between the predetermined downsampling operator and the adversarial loss . While the data term is necessary to learn an appropriate downsampling function, it also operates as a restrictive prior and prevents an accurate simulation of the target LR images. Therefore, an SR method developed with the biased downsampler may not perform well on the target distribution , as conventional bicubic SR algorithms cannot be generalized on real-world images.

(a) Data loss from a predetermined kernel
(b) LFL (Proposed)
(c) ADL (Proposed)
Fig. 4: Different formulation for the data term. We visualize how pixels in is constrained to depending on the data term . (a) Data loss from a predetermined kernel. (b) In the proposed LFL, we apply low-pass filters to HR and downsampled images so that image contents can be preserved across different scales regardless of the downsampling model. (c) In our adaptive data term, the orange kernel is learned from training samples and iteratively adjusted inside the training loops rather than handcrafted.

3.3 Data loss over low-frequency components

We propose an effective and generalizable formulation of the data term to address the limitations of the existing approaches. Similar to the previous methods, our downsampler also takes an input HR image and generates a downsampled image . However, to preserve image contents and low-level structures in the downsampling process, we first define the operator as a combination of low-pass filtering and subsampling by , which reduces the resolution of a given image by a factor of . Then, we rewrite the data loss in (2) with a low-frequency loss (LFL) as follows:

(4)

where is a scaling factor. Since the HR image is times larger than the downsampled one , sizes of and are the same. We adopt two different low-pass filters: the box and Gaussian, to formulate the loss term. As the HR and downsampled images have different resolutions, we adjust the filter weights proportionally so that the same context can be covered from the images with different scales. By default, we use and box filters for and , respectively, with the scaling factor . We provide more details and ablations regarding the low-pass filters in Appendix A.

Fig. 4 illustrates the differences between the existing formulation and the proposed loss term. As shown in Fig. 4(a), the handcrafted operator constrains each pixel of the downsampled image to be a predetermined function of the input HR image . The primary limitation of such an approach is that the HR image and operator are both kept unchanged throughout the entire learning process. Therefore, the reference image is also fixed for each HR image, which can bias the learning process. Thus, even with the adversarial training objective, the learned downsampler can be biased toward the predetermined operator rather than the desired downsampling model, especially when the weight in (2) is large.

Our motivation is that we only need to preserve the low-frequency components of the image contents and structures. Fig. 4(b) demonstrate that the downsampled image is no longer constrained to be a specific function of its HR counterpart with our LFL. Instead, we adopt a relaxed objective designed to match low-frequency structures between input and output of the downsampler. By doing so, the adversarial loss can play a significant role in rendering the unknown types of LR images. Our LFL is not a restrictive constraint for a general downsampling model and can be generalized well on various synthetic and real-world images. In other words, we can minimize the new data term without causing notable conflict with the adversarial loss for LR images from an arbitrary downsampling model. More details are described in Section 4.5.

Scale transfer learning.

The proposed LFL does not include any scale-specific formulation and can be generalized to larger scales, e.g., . However, directly optimizing a high-scale downsampler may cause less stable behaviors due to the significant differences in the HR and downsampled images. To ensure stability, we learn a model on the desired distribution and repeat it times [4] to obtain the downsampling models by following:

(5)

3.4 Adaptive data loss

Our LFL is designed to reduce the bias from selecting a predetermined downsampling operator for the data loss . While this formulation enables LFL to be generalized well across various unknown degradations, several limitations exist. For example, an inherent ambiguity in LFL makes it challenging to solve the optimization problem because the LR images from the different downsampling processes may share similar low-frequency components, i.e., for . Considering that our goal is to simulate the unknown downsampling model with a CNN-based downsampler , the ideal data loss should be zero only if the condition satisfies. The primary limitation of LFL is that it is designed to maintain consistency between HR and LR images, not to simulate structures of LR images in the target distribution. Since LFL only considers low-frequency components in the image, optimizing the term is an ill-posed problem where numerous possible exist. In particular, minimizing LFL allows the downsampler to generate valid LR images, while it is not guaranteed that the learned downsampler achieves our desired behavior. Therefore, the definition of LFL is generalizable but cannot be an optimal one for any arbitrary downsampling model due to the ambiguity.

Moreover, LFL can be problematic when the downsampled image is corrupted with high-frequency noise, which is suppresssed after low-pass filtering. Since the proposed LFL cannot reject noisy estimations, it is challenging to generate clean and accurate LR samples of the desired distribution. Consequently, the downsampler heavily relies on adversarial loss to simulate an accurate distribution of , which may not be very stable in practice [41].

Therefore, we propose an adaptive data loss (ADL) to complement the limitations of LFL. The primary motivation is that the LFL-based downsampler can serve as a dataset-specific objective if and the ground-truth downsampling model are similar to some extent. To formulate the ADL, we first reduce the noise in the pre-trained model . Rather than introducing a new objective term in (2) for regularization, we retrieve a low-rank approximation of the learned network with a simple function. From the observation that a proper downsampling function consists of low-pass filtering and decimation [36, 12, 8, 4], we linearize the learned downsampling model to a corresponding 2D kernel :

(6)

where denotes an -th example to estimate the kernel and is the total number of samples that have been used, respectively. We note that there exists a closed-form solution for the least-squares in (6). Since (6) can be interpreted as an average of the possibly noisy downsampling network over inputs, the kernel is a regularized representation of the pre-trained network . With the estimated kernel , a novel ADL for data term is defined as follows:

(7)
(a)
(b) and
(c) and
(d) and
(e) and
(f) and
(f) and
(g) and
(h) and
(i) and
(j) and
Fig. 5: Examples of bicubic and randomly selected Gaussian kernels with corresponding LR images. (a) We use DIV2K [1]0869.png’ for the HR image . (b)-(f) We note that the bicubic kernel contains positive (green) and small negative (red) values together. The former two Gaussian kernels and are isotropic, while later kernels and are anisotropic. We note that there exist subtle differences between images from different downsampling kernels. (g)-(k) We also visualize downsampled images from the proposed ADL formulation. Here, and refer to the downsampling CNN and approximated kernel in Algorithm 1 that are learned on the synthetic DIV2K dataset from . Kernel boundaries are cropped for better visualization. Best viewed with digital zoom.

While (7) looks identical to the data terms with handcrafted downsampling in (3), we can deduce several merits from the ADL formulation. Unlike the predetermined operators or , the kernel is adaptively learned from the training data and shows less conflict to the adversarial loss . In other words, the linear downsampling process in (7) is less likely to deviate considerably from our desired downsampling model . Compared with the LFL formulation, the ADL term can provide a stable training objective and prevent the downsampler from learning false-negative cases. Moreover, the learned downsampling model is not constrained to be a deep linear network [4], as we jointly optimize the adversarial loss with the ADL.

Also, we introduce two modifications to utilize our ADL effectively in practice. First, the downsampler has been observed to simulate a target downsampling model to some extent under the LFL, even with few training iterations. Rather than using a fully pre-trained model for the kernel estimation, we start from scratch and train the downsampler for iterations with the LFL. We then replace our data term with the ADL, in which the kernel is calculated from the downsampler after updates. Second, we periodically adjust the kernel to prevent our downsampling model from being biased toward a fixed operator. Similar to (3), our ADL may also bias the training process unless holds. Thus, we periodically update the kernel by retrieving it from the currently learned downsampler. Even if the initial estimation is less accurate, such periodic updates allow the kernel to be adaptively adjusted during learning loops. The training pipeline of our downsampler with the modified ADL formulation is summarized in Algorithm 1.

0:  Set of HR patches , set of unpaired LR patches , warm-up interval , update interval , total training iterations , and learning rate .
0:  Downsampler parameters and discriminator parameters .
1:  . // Parameter initialization [41].
2:   = None.
3:  for  do
4:     , . // Sample training batches.
5:     .
6:     . // Update by (2).
7:     if  then
8:        Calculate with (4).
9:     else
10:        if  or is None then
11:           Calculate from . // Retrieve the kernel.
12:        end if
13:        Calculate with (7).
14:     end if
15:     . // Update by (2).
16:  end for
Algorithm 1 ADL for learning our downsampler

3.5 Image super-resolution

To learn the SR model , we first generate the LR images from HR images with the learned downsampler to construct a training set of pairs. A downsampling-specific SR model can be trained in a supervised manner by optimizing the loss [31, 25]:

(8)

where refers to a super-resolved image. As shown in (8), our approach does not require any paired examples, i.e., LR image , to learn the SR model for .

One of our contributions is that the downsampling and SR models can be learned independently. For instance, it is straightforward to introduce perceptual objective [21, 27, 48, 58] for the SR network, which can be used to reconstruct photo-realistic results. To reconstruct more realistic textures from the real-world LR images, we jointly optimize and to learn the perceptual SR model and the discriminator network respectively:

(9)

where is features of the pre-trained VGG-19 [45, 27, 48] network after the conv5_4 layer, is a super-resolved image, is adversarial loss, and is a hyperparameter, respectively.

4 Experiments

We implement our method based on the PyTorch framework. More results can be further provided in our Appendix and project site:

https://cv.snu.ac.kr/research/ADL. We will also release the source code and pre-trained models.

4.1 Experimental setups

Dataset. To validate whether our method can simulate an unknown distribution of LR images accurately, we construct a synthetic dataset by using a bicubic kernel () and the Gaussian kernels () with random shapes [12, 8]. Then, we obtain LR inputs for the test from HR images by following (1). We visualize the different degradation kernels used in our experiments and the corresponding LR images in Fig. 5. For the configurations, we use two times the enlarged versions of the kernels. More details about the selected kernels are described in Appendix B.

We construct unpaired training data by dividing 800 HR images from the DIV2K [1] training that is split by half. For each degradation kernel, we assign 400 HR samples (‘0001.png’–‘0400.png’) to . The remaining 400 images (‘0401.png’–‘0800.png’) are used to synthesize LR samples and allocated to . The images do not overlap between and . With the proposed LFL and ADL formulation, the downsampler can learn to simulate the distribution of LR samples by using the given HR images . For fair evaluations, we use another 100 images from the DIV2K [1] validation set to generate test inputs for different kernel .

Evaluation metrics. We evaluate our downsampling methods in two aspects. As the primary goal of our methods is generating training examples to learn SR models on an unknown distribution of LR images, we generate pairs of using 400 HR images in with the learned downsampler for each test degradation . Then, we train the SR model as described in Section 3.5, and report the PSNR values between the reconstructed images and the reference HR images . We note that generating more accurate LR images allows the following SR model to improve generalization on the inputs from unknown degradation. In addition to the SR task, we also measure the PSNR values between the downsampled images and ground-truth LR images to quantitatively evaluate the performance of the learned downsampling models . All PSNR values are calculated using RGB channels rather than luminance.

Model architecture. We use the patch-based discriminator [19] with the instance normalization [47] for training. For the SR task, we use a small EDSR [31] model as the baseline with 1.5M parameters. To demonstrate that our method is orthogonal to the selection of the SR backbone, we also introduce a larger RRDB [48] architecture with 16.7M parameters. The details regarding our downsampling and discriminator CNNs are described in Appendix E.

Hyperparameters. In all experiments, we use a box filter for and the one with spatial size for with a scale factor of 2. The ablation studies about the filter selection and relevant hyperparameters are described in Section 4.5

. In training, one epoch consists of 1,741 iterations, which is proportional to the number of total training patches. More details are provided in Appendix F.

PSNR between and
40.70 39.64 35.79 37.88 37.22
40.33 38.54 36.03 37.35 35.13
in (4) (Proposed) 43.16 42.03 43.30 43.69 43.75
in (7) (Proposed) 45.83 45.61 46.34 44.86 46.41
26.36 26.91 26.85 25.54 25.67
24.18 25.63 25.40 26.64 26.70
in (4) (Proposed) 31.13 34.28 39.66 38.31 37.17
in (7) (Proposed) 38.24 38.12 43.54 39.08 41.10
TABLE I: Evaluation of LR images from our unsupervised downsampler.
We evaluate PSNR (dB) between downsampled and ground-truth LR images on the synthetic DIV2K dataset for each kernel . The best and second-best methods are bolded and underlined, respectively.
Method Training input Training target
Bicubic
Oracle
Proposed
TABLE II: Training configurations of different SR methods.
All the other hyperparameters are kept fixed to train those SR models. We note that the downsampler is learned for each specific degradation in an unsupervised manner.

4.2 Evaluating simulated LR images

The primary contribution of our LFL and ADL is that they do not make a conflict with the adversarial loss, which guides downsampled images to resemble LR samples from an unknown distribution. To demonstrate the advantages of the proposed framework when simulating an arbitrary downsampling process, we compare our LFL and ADL to data terms using predetermined operators. Maeda [34] proposes to utilize bicubic downsampled images in an unsupervised downsampling model, especially for cycle consistency and identity loss terms. While the unsupervised learning approach from Maeda [34] is not the same as our formulation, the objective between the generated LR and bicubic downsampled images can be interpreted as in (3). Similarly, Bulat et al. [5] used an average pooling () for the resizing operator in the data term , where corresponds to a scaling factor.

Method PSNR for SR PSNR for SR
EDSR [31] (Bicubic) 34.61 31.51 27.76 27.91 27.95 28.92 26.35 24.06 24.21 24.21
RRDB [48] (Bicubic) 29.45 26.44 24.07 24.22 24.22
EDSR (Oracle) 34.61 34.44 33.64 33.23 33.27 28.92 28.73 28.02 27.79 27.84
RRDB (Oracle) 29.45 29.28 28.39 28.08 28.62
KernelGAN [4] + ZSSR [44] 22.32 26.42 30.44 29.10 29.12 20.11 24.67 25.85 25.21 25.36
IKC [12] 28.59 28.07 27.65 24.15 25.12
BlindSR [8] 26.56 26.62 26.49
LFL + EDSR (Proposed) 33.91 33.26 31.38 31.48 31.57 27.45 27.31 26.69 26.65 26.33
ADL + EDSR (Proposed) 34.07 33.68 32.51 32.08 32.05 28.16 28.04 27.08 26.82 26.97
ADL + RRDB (Proposed) 28.55 28.49 27.51 27.00 27.19
TABLE III: Blind super-resolution results on synthetic LR images.
We show PSNR (dB) between ground-truth HR and SR images from various methods on the synthetic DIV2K test dataset. Performance is not reported () if the pre-trained model is available only for a specific scale or cannot generate output images. refers to the bicubic kernel.

We note that direct comparisons between ours and the existing generation-based methods [34, 5], including Lugmayr et al. [32] are not conducted due to several reasons. First, we explicitly find the unknown downsampling operator, while previous approaches focus on modeling noise and artifacts in real-world LR images. In addition, as those methods do not provide source code, evaluation on diverse synthetic kernels cannot be carried out for fair comparisons. Therefore, we train multiple downsampling networks under different data terms on different synthetic kernels () and scales ( and ). Then, we compare how the proposed data term outperforms the previous formulations in terms of the feasibility of the synthesized samples.

Table I illustrates the average PSNR between the generated LR images from each downsampler and ground-truth. When the predetermined operator is well-matched with a ground-truth downsampling function, e.g., using to estimate , the unsupervised models effectively simulate target LR images. However, if the predetermined functions (bicubic and average pooling) are not overlapped with the unknown degradation kernel (), the data term biases the training objective and conflicts with the adversarial loss. Table I demonstrates that the conflict affects the learned downsampling model in a negative way and makes the SR model less generalizable, even with synthetic kernels. On the other hand, the proposed LFL and ADL terms can be generalized better and facilitate the downsampler to generate accurate LR images for various configurations.

4.3 SR on the synthetic examples

Using generated LR images from our downsampler, we train baseline EDSR [31] and RRDB [48] and evaluate them on each degradation kernel individually on three different configurations described in Table II. In the bicubic configuration, bicubic-downsampled images are used to train the SR model as those in existing approaches [10, 22, 27]. We note that the bicubic models are shared across different setups. In contrast, our method first learns a degradation-specific downsampling model from unpaired LR and HR images and leverages it to generate training samples for the SR model. We also introduce an oracle for each degradation , where the SR model can fully utilize the ground-truth kernel to synthesize training images. As the distributions of the training and test images are matched, the oracle serves as an upper bound for a specific degradation kernel .

Table III compares various SR methods on the synthetic DIV2K dataset. EDSR and RRDB trained on bicubic LR images perform well when the inputs are also formed by the bicubic kernel (). However, they do not generalize well when the inputs are downsampled by different kernels (), as the distribution of the test images deviates significantly from that of the training samples. Also, the larger RRDB network does not bring any advantages over the smaller EDSR model, showing that the bicubic SR models cannot be generalized to unknown types of LR images.

Method # Parameters Training data PSNR for SR PSNR for SR
Canon Nikon Canon Nikon
EDSR [31] (Bicubic) 1.5M Synthetic (Bicubic ) 30.58 30.00 26.05 25.89
RRDB [48] (Bicubic) 16.7M 26.05 25.91
EDSR (Oracle) 1.5M RealSR-V3 (Paired) 32.45 31.59 27.59 27.14
RRDB (Oracle) 16.7M 27.90 27.39
KernelGAN [4] + ZSSR [44] 50  (0.2M 0.2M) A given 28.79 27.54 23.68 22.46
IKC [12] 9.0M Synthetic (Multiple ) 25.71 25.27
BlindSR [8] 1.1M 25.80 24.17
LFL + EDSR (Proposed) 0.9M 1.5M RealSR-V3 (Unpaired) 31.67 30.75 26.47 25.90
ADL + EDSR (Proposed) 0.9M 1.5M 31.81 30.99 26.79 26.46
ADL + RRDB (Proposed) 0.9M 16.7M 26.90 26.64
TABLE IV: Blind super-resolution results on realistic LR images.
We provide PSNR (dB) between ground-truth HR and SR results from different methods on the RealSR-V3 [6] dataset. Since the KernelGAN [4] and ZSSR [44] combination is learned on each test image, they require parameters in practice to handle 50 inputs.

On the other hand, EDSR and RRDB achieve significant performance gains over the other approaches when the training LR images are generated from the proposed LFL and ADL. As discussed previously in Section 4.2, our approach can generate a set of faithful LR and HR training pairs so that the following SR models can achieve much better performance on LR images from some unknown downsampling process. Since LFL does not bias the learned downsampler to a specific downsampling operator, e.g., bicubic or average pooling, the respective EDSR and RRDB generalize well across various kernels () and scales ( and ). Furthermore, our downsampling model with ADL generates more accurate training LR images for the SR models and brings significant improvements to EDSR and RRDB across all kernel configurations consistently.

Interestingly, a larger RRDB model with ADL achieves better performance and even comparable to the oracle EDSR, especially on the SR task with and . If the distribution of the downsampled images, i.e., , deviates much from that of the target LR images, then a better fitting to the training data may worsen the performance on the test images. Therefore, the performance gain of the RRDB model demonstrates that our unsupervised downsampling framework can faithfully simulate the distribution of the target LR images to a certain extent. We also note that our data generation process is orthogonal to the architecture of the SR models. Therefore, integrating the proposed downsampling models with state-of-the-art SR architectures [13, 61] can directly improve performance.

We also apply the existing approaches to various synthetic degradation kernels. The combination of KernelGAN [4] and ZSSR [44] first estimates an input-specific degradation kernel from a single image [4] and applies the zero-shot SR model [44] to deal with the arbitrary LR images. Compared with bicubic EDSR and RRDB, the single image approach achieves better performance on , demonstrating the importance of estimating image-specific kernel modeling for the blind SR task. However, this method does not perform well when test images are downsampled by and , as depicted by the significant decrease in PSNR with respect to the oracle model.

Instead of synthesizing realistic LR images, IKC [12] and BlindSR [8] utilize large-scale synthetic data in which the degradation kernels follow specific shapes. They first predict the kernel used to generate the given LR image and derive input-dependent SR results within a single network architecture. As the IKC model only considers isotropic Gaussian kernels, it achieves comparable performance to the oracle models on . However, it does not perform well on the and cases, where the degradation kernels are anisotropic. While BlindSR [8] also takes a similar strategy, it is less stable and unable to handle some inputs from and , and it also diverges. As shown in Fig. 6 (first row), RRDB that has learned on our training data can reconstruct realistic details from the challenging SR task.

()  (Input)
() RRDB [48]
() IKC [12]
() [4]  [44]
() ADL  [48]
() Oracle
()  (GT)
Fig. 6: Qualitative SR results on the various datasets. Patches in each row are from the synthetic DIV2K [1] dataset ‘0820.png (),’ ‘0853.png (),’ RealSR-V3 [6] dataset ‘Canon/006.png,’ and ‘Nikon/041.png,’ respectively. The RRDB [48] model is used as a backbone SR architecture for our ADL as well as the Oracle.

4.4 SR on the RealSR-V3 dataset

Our approach can also be applied to real-world LR images from unknown camera processing pipelines. For the quantitative evaluation, we utilize RealSR-V3 [6] containing 200 well-aligned LR and HR image pairs for two different real-world cameras: Canon and Nikon. Similar to the description in Section 4.1, we divide the dataset by half for each camera model. The same amounts of images are assigned to and without overlapping. We learn the corresponding downsampling and SR models by following our pipeline, as described in Section 4.1. As the dataset provides accurately aligned LR and HR examples to train a supervised SR model, the oracle models learn from those image pairs.

Compared to the experiments on synthetic images in Section 4.3, real-world cases are more challenging. First, a set of LR images may share the similar but not exactly the same degradation process. Also, since the dataset mainly consists of indoor scenes and static objects without large motions, training the images may lack the diversity that can hinder generalization. Table IV shows the results of the evaluated SR algorithms on RealSR-V3, where each of the Canon and Nikon split contains 50 test images. KernelGAN + ZSSR, IKC, and BlindSR do not perform well even compared to EDSR and RRDB learned on bicubic downsampled images. The primary reason is that numerous constraints in these methods, e.g., Gaussian kernels [12, 8] or kernel shape priors [4], do not usually hold for real-world scenes.

In contrast, our downsampling method with different SR backbones (LFL + EDSR, ADL + EDSR, and ADL + RRDB) achieve better results on both cameras at different scales ( and ), compared with the other approaches. Even if the LR images in RealSR-V3 are not formulated from the same kernel, our LFL and ADL can learn an average of all possible downsampling operators and generalize well on the real-world dataset. We also demonstrate that the larger RRDB model performs better in the realistic case, showing that a better fitting on the generated LR images can help generalization on unseen real-world examples. Fig. 6 shows that our approach can reconstruct more visually pleasing results than the existing methods on RealSR-V3.

Method PSNR for SR ()
1 10 100 200
LFL + EDSR (Proposed) 29.36 30.70 31.49 31.57
ADL + EDSR (Proposed) 29.08 31.42 32.05 32.02
(a) Effect of the balancing parameter in (2).
Method PSNR for SR ()
1 10 50 100
KernelGAN + ZSSR 29.12
LFL + EDSR (Proposed) 25.43 28.70 29.86 31.21
ADL + EDSR (Proposed) 29.84 30.20 31.36 32.10
(b) Effect of the number of training LR images .
Method Configuration PSNR (dB) for SR ()
Joint +BP +BP+FQ Two-stage
ADL + EDSR (Proposed) 32.09 31.39 31.50 32.05
(c) Effect of the joint training.
TABLE V: Ablation studies on the proposed method.
We report how different training configurations for the downsampling network affect the SR results on the synthetic DIV2K dataset.
() 1 epoch
() 2 epochs
() 10 epochs
() 20 epochs
() 40 epochs
(a) 60 epochs
(b) 80 epochs
(c) GT kernel
Fig. 7: Evolution of the retrieved degradation kernel in the proposed ADL. We visualize the estimated kernels from (6) on two different datasets. For simplicity, we refer 1,741 iterations as one epoch. Since we apply the ADL after 10 warm-up epochs, downsampling networks in (a) and (b) are trained under LFL, not ADL. Furthermore, (a) and (b) visualize the linear approximations of the learned downsamplers after a certain number of training epochs rather than the approximated kernels . In the RealSR-V3 [6] dataset, no ground-truth kernel is available for the Nikon camera configuration. We crop image boundaries for better illustration.

4.5 Ablation study

To see the contribution of each design component in our LFL and ADL, we conduct extensive ablation studies in this section. The selected hyperparameters are used throughout the entire experiments in Section 4.3, 4.4, and 4.7, without any additional adjustments. We note that an ablation regarding the shape of is described in our Appendix A. The stability of our LFL and ADL is described in Appendix C.

Effect of the balancing hyperparameter. As we describe in Section 3.2, the predetermined downsampling operator may bias the overall training objective of the downsampler , especially when the balancing hyperparameter is large. To validate that our LFL and ADL terms do not impose negative bias in the learning stage, we analyze how the balance between data and adversarial losses in (2) affects the associated SR models. Table 7(a) shows that the SR model with LFL and ADL do not perform well when or , as preserving image content is challenging during the downsampling process. When using relative larger values of , i.e., or , the baseline SR models with LFL and ADL terms perform reasonably well without making bias. As such, we choose for the LFL and for the ADL to achieve the best performance.

Effect of the number of training samples. In Section 4.3, we train the proposed downsampling model on 400 LR images generated from the same kernel . Compared with the KernelGAN [4] method, which predicts a proper degradation kernel from a single input image, our approach requires more examples to estimate an unknown degradation accurately. For a fair comparison, we vary the number of LR images to train the downsampling model and analyze the performance of the following SR models. We use the first 1, 10, 50, and 100 examples from , e.g., ‘0401.png’ for the case, to learn our downsampling model. The other hyperparameters are fixed unless mentioned otherwise.

Table 7(b) shows how the size of for the downsampler affects the following SR performance. Our methods (LFL + EDSR and ADL + EDSR) gradually achieve better performance as the number of training samples increases, validating the effectiveness of using large-scale datasets. Nevertheless, even with a single LR sample, ADL + EDSR outperforms the single-image method. Table 7(b) also shows that ADL consistently outperforms LFL, especially when the number of training LR images is limited. Specifically, ADL + EDSR with a single LR image performs equally well as LFL + EDSR with 50 LR images. In Appendix D, we also analyze some opposite cases for fair comparison where KernelGAN is trained with multiple LR and HR images.

()  (Input)
() RRDB [48]
() IKC [12]
() [4]  [44]
() ADL  RRDB
() ADL  
Fig. 8: Perceptual SR results on the DPED dataset. We note that no ground-truth HR images exist for the dataset. Therefore, quantitative comparison on the DPED images nor the oracle model is not available. denotes the RRDB [48] model learned with (9) to reconstruct more realistic textures and sharper details. From the top, patches are cropped from the DPED-val ‘20.png,’ ‘49.png,’ and ‘63.png,’ respectively.

Joint training of downsampling and SR networks. Our two-stage (downsampling + SR) pipeline has several advantages. First, if the two models are jointly learned, one may affect the other to be suboptimal solution. For example, the downsampler may generate LR images that can be easily upsampled rather than accurately simulating the desired target. In addition, connecting the two models increases the algorithmic complexity, making hard to train the whole model. Finally, the two-stage approach accommodates more effective models and objective functions, as we have a fixed downsampling network and corresponding LR images.

On the other hand, it is possible to optimize the downsampling and SR networks jointly. Table 7(c)

provides experimental results of joint training with different setups. To reduce the training time, we optimize downsampling and SR networks together (joint), contrary to the original formulation (two-stage). We note that the gradient from the SR model does not backpropagate to the downsampler in this configuration. Interestingly, the joint training approach achieves marginal performance gain to the SR network. Since we do not fix input of the SR network and keep updating the downsampler, it has similar effects to data augmentation and slightly improves the following SR model.

We also train the downsampling and SR networks together in an end-to-end manner, where backpropagated gradient from the SR model flows to the downsampler (+BP). However, this approach negatively affects the following SR network in two specific aspects. First, the downsampler tends to generate images that are easy to be upscaled rather than accurately simulating samples in the target LR distribution. Second, each color pixel in the generated image is a continuous variable, while pixels in our test samples only have 256 discontinuous values. We further introduce fake quantization (+BP+FQ) to deal with the second issue, where output of the downsampler are quantized while the gradient flows just as the pixel values are continuous. Although it brings 0.11dB performance gain, the end-to-end learning does not bring any advantage in our framework. As such, we use the two-stage approach in all the experiments.

Total training epochs 20 40 60 80
0.9682 0.9707 0.9714 0.9718
PSNR for SR () 31.48 31.96 31.93 32.05
(a) Effect of the total training epochs in Algorithm 1.
Update interval (epochs) ADL + EDSR (Proposed)
1 10 20
PSNR for SR () 31.65 32.05 32.01 31.85
(b) Effect of the iterative adjustment in Algorithm 1.
The number of samples ADL + EDSR (Proposed)
1 5 10 50
PSNR for SR () 31.09 31.84 32.01 32.05
(c) Effect of the number of samples for kernel estimation in (6).
TABLE VI: Ablation study about the proposed ADL method.
To evaluate the SR performance, we use ADL + EDSR configuration and the synthetic DIV2K dataset. (a) We note that the kernel similarity [15] is measured after the last update, and is a linear approximation of the learned downsampler for each configuration. (b) means that the kernel is not updated after the first estimation.

4.6 Analysis on ADL

As the estimated kernel in Algorithm 1 is derived from the training dataset, the ADL does not make a significant conflict with the adversarial loss . To demonstrate the effectiveness of our adaptive adjustment strategy, we visualize how the estimated kernels on the synthetic DIV2K and RealSR-V3 datasets are updated in Fig. 7. We note that the kernels from the RealSR-V3 [6] dataset do not appear to be standard Gaussian forms that are preferred in the existing approaches [12, 8]. Since we do not constrain the downsampling network to resemble specific shapes of kernels, our approach can yield better generalizability.

In the synthetic cases, the estimated kernel should be similar to the ground-truth for the following SR training. Thus, we show the kernel similarity [15] between the retrieved and ground-truth degradation kernels in Table 9(a) to validate that our prediction becomes more accurate as the training proceeds. It is also demonstrated that the more precise estimation supports the following SR network to perform better, and the performance is maximized at 80 epochs. Furthermore, Table 9(b) shows that our iterative update prevents the downsampler from being biased toward the fixed kernel and helps with the convergence. As we describe in Section 3.4, ADL is designed to stabilize the potentially noisy downsampling model. Therefore, the number of samples for the kernel estimation is also crucial. Table 9(c) demonstrates that the ADL does not perform well only with a single training image. However, as the number of examples increases, our method can assist the following SR model to generate better results.

4.7 SR on the real-world images

Our methods assume that the set of available LR images follow the same downsampling process. In practice, acquiring multiple images from a similar degradation pipeline, such as photos from a fixed camera configuration [18] or multiple frames in a video, is relatively easier than collecting real-world LR and HR pairs. We validate the proposed method by using the DPED [18] dataset, which consists of low-quality photos captured by an iPhone 3GS camera. To train the downsampler, we assign random 120 images from the DPED [18] dataset as , while DIV2K is used for the HR samples . Since the dataset consists of low-quality examples captured by the iPhone 3GS camera, we remove the unknown noise and artifacts in by applying the off-the-shelf RL-restore [54] algorithm as a preprocessing. Fig. 8 compares the SR results of the preprocessed DPED [18] dataset. As no ground-truth HR images exist in this dataset, we note that no oracle model is available for the dataset. Compared with the existing methods, our ADL facilitates RRDB [48] to reconstruct sharper edges and more detailed textures without introducing visual artifacts. Moreover, we train the ADL-based downsampler without RL-restore [54].

We also train the perceptual SR network using the images from ADL-based downsampler to demonstrate the merit of our approach for real-world SR. Compared with the PSNR-based model (ADL  RRDB) trained with (8), the perceptual RRDB model (ADL  ) from (9) reconstructs more realistic and visually pleasing results. The advantage of the two-stage approach is that we do not require additional training of the downsampler to introduce different optimization objectives for the SR network. Thus, the only difference between Fig. 8() and Fig. 8() is training loss while the backbone networks are the same. More additional qualitative SR results are presented in Appendix G.

In Fig. 9, we analyze how much our ADL is affected by noise and artifacts. For comparison, we evaluate the pretrained RealSR [20] and BSRGAN [55] models on the same DPED images. Since those methods explicitly consider noise and artifacts in LR inputs, their results look robust to such degradation. In contrast, our ADL yields sharper but a bit noisy outputs. Note that when the preprocessing is not applied (ADL-n), it gives blurry SR results since noisy LR samples prevent the discriminator from learning distinguishable features from downsampled images.

() RealSR
() BSRGAN
() ADL  
() ADL-n  
Fig. 9: Effect of preprocessing on the DPED dataset. In ADL, appropriate preprocessing is required for more effective learning. Patches are cropped from DPED-val ‘45.png,’ ‘63.png,’ and ‘84.png,’ respectively.

5 Conclusions

We propose a novel unsupervised method to estimate an unknown distribution of LR images using unpaired LR and HR examples. The proposed LFL and ADL terms facilitate the downsampler to accurately synthesize the LR images with the desired distribution. Compared to conventional approaches, we do not pose restrictive priors to the learned function in the adversarial training framework. Consequently, the existing SR models can be trained with our LR images and achieve significant performance gains on synthetic and realistic datasets. We also demonstrate that our approach can be applied to a set of arbitrary images [6, 18] in the wild. The results verify that the proposed method can be used to handle real-world SR problems. In the future work, we will extend our approach to estimating a feasible downsampling model with real-world noise jointly.

Acknowledgement

This work was partly supported by IITP grant funded by the Korea government [No. 2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)].

References

  • [1] E. Agustsson and R. Timofte (2017) NTIRE 2017 challenge on single image super-resolution: dataset and study. In CVPR Workshops, Cited by: §1, Fig. 5, Fig. 6, §4.1, Fig. S3.
  • [2] N. Ahn, B. Kang, and K. Sohn (2018) Fast, accurate, and lightweight super-resolution with cascading residual network. In ECCV, Cited by: §2.1.
  • [3] Y. Bahat and T. Michaeli (2020) Explorable super resolution. In CVPR, Cited by: §1.
  • [4] S. Bell-Kligler, A. Shocher, and M. Irani (2019) Blind super-resolution kernel estimation using an internal-gan. In NeurIPS, Cited by: Fig. 1, 1(c), §1, §1, §2.2, §2.3, §S2, §3.3, §3.4, §3.4, 6(), 8(), §4.3, §4.4, §4.5, TABLE III, TABLE IV, §S5, 3(), 4(), 5().
  • [5] A. Bulat, J. Yang, and G. Tzimiropoulos (2018) To learn image super-resolution, use a GAN to learn how to do image degradation first. In ECCV, Cited by: §1, §2.3, §3.2, §4.2, §4.2.
  • [6] J. Cai, H. Zeng, H. Yong, Z. Cao, and L. Zhang (2019) Toward real-world single image super-resolution: a new benchmark and a new model. In ICCV, Cited by: Fig. 1, §1, §2.4, Fig. 6, Fig. 7, §4.4, §4.6, TABLE IV, §5, Fig. S4.
  • [7] C. Chen, Z. Xiong, X. Tian, Z. Zha, and F. Wu (2019) Camera lens super-resolution. In CVPR, Cited by: §1, §1, §2.2, §2.4, §3.
  • [8] V. Cornillère, A. Djelouah, W. Yifan, O. Sorkine-Hornung, and C. Schroers (2019) Blind image super-resolution with spatially variant degradations. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–13. Cited by: §1, §2.2, §3.4, §4.1, §4.3, §4.4, §4.6, TABLE III, TABLE IV.
  • [9] T. Dai, J. Cai, Y. Zhang, S. Xia, and L. Zhang (2019) Second-order attention network for single image super-resolution. In CVPR, Cited by: §2.1.
  • [10] C. Dong, C. C. Loy, K. He, and X. Tang (2016) Image super-resolution using deep convolutional networks. TPAMI. Cited by: §1, §2.1, §4.3.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §1, §2.3, §3.1.
  • [12] J. Gu, H. Lu, W. Zuo, and C. Dong (2019) Blind super-resolution with iterative kernel correction. In CVPR, Cited by: §1, §2.2, §3.4, 6(), 8(), §4.1, §4.3, §4.4, §4.6, TABLE III, TABLE IV, 3(), 4(), 5(), §S7.
  • [13] M. Haris, G. Shakhnarovich, and N. Ukita (2018) Deep back-projection networks for super-resolution. In CVPR, Cited by: §1, §2.1, §4.3, §S6.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §S5.
  • [15] Z. Hu and M. Yang (2012) Good regions to deblur. In ECCV, Cited by: §4.6, TABLE VI.
  • [16] Z. Hui, X. Wang, and X. Gao (2018) Fast and accurate single image super-resolution via information distillation network. In CVPR, Cited by: §2.1.
  • [17] S. A. Hussein, T. Tirer, and R. Giryes (2020) Correction filter for single image super-resolution: robustifying off-the-shelf deep super-resolvers. In CVPR, Cited by: §2.1.
  • [18] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, and L. Van Gool (2017) DSLR-quality photos on mobile devices with deep convolutional networks. In ICCV, Cited by: §4.7, §5, Fig. S5.
  • [19] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-Image translation with conditional adversarial networks. In CVPR, Cited by: §4.1, §S5.
  • [20] X. Ji, Y. Cao, Y. Tai, C. Wang, J. Li, and F. Huang (2020) Real-world super-resolution via kernel estimation and noise injection. In CVPRW, Cited by: §2.3, §4.7.
  • [21] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §2.1, §3.5.
  • [22] J. Kim, J. K. Lee, and K. M. Lee (2016) Accurate image super-resolution using very deep convolutional networks. In CVPR, Cited by: §2.1, §4.3, §S5.
  • [23] J. Kim, J. K. Lee, and K. M. Lee (2016) Deeply-recursive convolutional network for image super-resolution. In CVPR, Cited by: §2.1.
  • [24] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv. Cited by: §S6.
  • [25] W. Lai, J. Huang, N. Ahuja, and M. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, Cited by: §1, §2.1, §3.5.
  • [26] W. Lai, J. Huang, N. Ahuja, and M. Yang (2018) Fast and accurate image super-resolution with deep laplacian pyramid networks. TPAMI 41 (11), pp. 2599–2613. Cited by: §2.1.
  • [27] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi (2017)

    Photo-realistic single image super-resolution using a generative adversarial network

    .
    In CVPR, Cited by: §1, §2.1, §3.5, §4.3.
  • [28] W. Lee, J. Lee, D. Kim, and B. Ham (2020) Learning with privileged information for efficient image super-resolution. In ECCV, Cited by: §2.1.
  • [29] J. Li, F. Fang, K. Mei, and G. Zhang (2018) Multi-scale residual network for image super-resolution. In ECCV, Cited by: §2.1.
  • [30] Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu (2019) Feedback network for image super-resolution. In CVPR, Cited by: §2.1.
  • [31] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017) Enhanced deep residual networks for single image super-resolution. In CVPR Workshops, Cited by: §1, §2.1, §3.5, §4.1, §4.3, TABLE III, TABLE IV, §S6.
  • [32] A. Lugmayr, M. Danelljan, and R. Timofte (2019) Unsupervised learning for real-world super-resolution. arXiv. Cited by: §1, §1, §2.3, §4.2.
  • [33] X. Luo, Y. Xie, Y. Zhang, Y. Qu, C. Li, and Y. Fu (2020) LatticeNet: towards lightweight image super-resolution with lattice block. In ECCV, Cited by: §2.1.
  • [34] S. Maeda (2020) Unpaired image super-resolution using pseudo-supervision. In CVPR, Cited by: §1, §2.3, §3.2, §4.2, §4.2.
  • [35] R. Mechrez, I. Talmi, and L. Zelnik-Manor (2018) The contextual loss for image transformation with non-aligned data. In ECCV, Cited by: §2.1.
  • [36] T. Michaeli and M. Irani (2013) Nonparametric blind super-resolution. In CVPR, Cited by: §2.2, §2.3, §3.4.
  • [37] B. Niu, W. Wen, W. Ren, X. Zhang, L. Yang, S. Wang, K. Zhang, X. Cao, and H. Shen (2020) Single image super-resolution via a holistic attention network. In ECCV, Cited by: §2.1.
  • [38] J. Pan, Z. Hu, Z. Su, and M. Yang (2014) Deblurring text images via l0-regularized intensity and gradient prior. In CVPR, Cited by: §2.2.
  • [39] Y. Qiu, R. Wang, D. Tao, and J. Cheng (2019) Embedded block residual network: a recursive restoration model for single-image super-resolution. In ICCV, Cited by: §2.1.
  • [40] M. S. Rad, B. Bozorgtabar, U. Marti, M. Basler, H. K. Ekenel, and J. Thiran (2019) SROBB: targeted perceptual loss for single image super-resolution. In ICCV, Cited by: §2.1.
  • [41] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv. Cited by: Fig. S2, §3.2, §3.4, 1.
  • [42] M. S. M. Sajjadi, B. Scholkopf, and M. Hirsch (2017) EnhanceNet: single image super-resolution through automated texture synthesis. In ICCV, Cited by: §2.1.
  • [43] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016)

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    .
    In CVPR, Cited by: §2.1.
  • [44] A. Shocher, N. Cohen, and M. Irani (2018) “Zero-Shot” super-resolution using deep internal learning. In CVPR, Cited by: Fig. 1, 1(c), §2.1, §2.3, 6(), 8(), §4.3, TABLE III, TABLE IV, 3(), 4(), 5().
  • [45] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §3.5.
  • [46] T. Tong, G. Li, X. Liu, and Q. Gao (2017) Image super-resolution using dense skip connections. In ICCV, Cited by: §2.1.
  • [47] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv. Cited by: Fig. S2, §4.1, §S5.
  • [48] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy (2018) ESRGAN: enhanced super-resolution generative adversarial networks. In ECCV Workshops, Cited by: Fig. 1, 1(b), §1, §2.1, §3.5, Fig. 6, 6(), 6(), Fig. 8, 8(), §4.1, §4.3, §4.7, TABLE III, TABLE IV, 3(), 3(), Fig. S4, 4(), 4(), 5(), 5(), §S7.
  • [49] Y. Wang, F. Perazzi, B. McWilliams, A. Sorkine-Hornung, O. Sorkine-Hornung, and C. Schroers (2018) A fully progressive approach to single-image super-resolution. In CVPRW, Cited by: §2.1.
  • [50] P. Wei, Z. Xie, H. Lu, Z. Zhan, Q. Ye, W. Zuo, and L. Lin (2020) Component divide-and-conquer for real-world image super-resolution. In ECCV, Cited by: §1, §2.4.
  • [51] B. Wronski, I. Garcia-Dorado, M. Ernst, D. Kelly, M. Krainin, C. Liang, M. Levoy, and P. Milanfar (2019) Handheld multi-frame super-resolution. ACM TOG 38 (4), pp. 1–18. Cited by: §1.
  • [52] X. Xu, Y. Ma, and W. Sun (2019) Towards real scene super-resolution with raw images. In CVPR, Cited by: §2.4.
  • [53] Y. Xu, S. R. Tseng, Y. Tseng, H. Kuo, and Y. Tsai (2020) Unified dynamic convolutional network for super-resolution with variational degradations. In CVPR, Cited by: §2.2.
  • [54] K. Yu, C. Dong, L. Lin, and C. Change Loy (2018)

    Crafting a toolchain for image restoration by deep reinforcement learning

    .
    In CVPR, Cited by: §4.7.
  • [55] K. Zhang, J. Liang, L. Van Gool, and R. Timofte (2021) Designing a practical degradation model for deep blind image super-resolution. arXiv. Cited by: §2.2, §4.7.
  • [56] K. Zhang, L. Val Gool, and R. Timofte (2020) Deep unfolding network for image super-resolution. In CVPR, Cited by: §2.2.
  • [57] K. Zhang, W. Zuo, and L. Zhang (2018) Learning a single convolutional super-resolution network for multiple degradations. In CVPR, Cited by: §1, §2.2.
  • [58] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    .
    In CVPR, Cited by: §2.1, §3.5.
  • [59] W. Zhang, Y. Liu, C. Dong, and Y. Qiao (2019) RankSRGAN: generative adversarial networks with ranker for image super-resolution. In ICCV, Cited by: §2.1.
  • [60] X. Zhang, Q. Chen, R. Ng, and V. Koltun (2019) Zoom to learn, learn to zoom. In CVPR, Cited by: §1, §2.4.
  • [61] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In ECCV, Cited by: §1, §2.1, §4.3, §S6.
  • [62] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018) Residual dense network for image super-resolution. In CVPR, Cited by: §1, §2.1.
  • [63] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2020) Residual dense network for image restoration. TPAMI. Cited by: §2.1.
  • [64] T. Zhao, W. Ren, C. Zhang, D. Ren, and Q. Hu (2018) Unsupervised degradation learning for single image super-resolution. arXiv. Cited by: §1, §2.3.
  • [65] R. Zhou and S. Susstrunk (2019) Kernel modeling super-resolution on real low-resolution images. In ICCV, Cited by: §2.2, §3.
  • [66] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: §2.3.

S1 Details about the Low-pass Filters

Formulation of the low-pass filters. We describe specific implementations of low-pass filters in the proposed LFL formulation. Our LFL utilizes conventional low-pass filters to extract low-frequency components from given images. Therefore, we replace the term in (4) to a kernel representation and to simplify the description as follows:

(S1)

where is a subsampling factor of the LR image, and (S1) is equivalent to (4) in our main manuscript. For example, our default formulation corresponds to a 2D kernel of where . We note that is used throughout our studies. In downsamping case, a kernel for HR images can be expressed as a box filter of with . Here, describes a coordinate system of the downsampling kernel, including a sub-pixel shift. Specifically, a center of the kernel, which is not a pixel, corresponds to and neighboring 4 pixels are represented as , respectively. This formulation is useful for Gaussian kernels on an even-sized grid as follows:

(S2)

where is a normalization factor so that . For the proposed LFL, only isotropic cases are tested where and are equal. In the Gaussian cases, we follow a convention and set the kernel grid size to where is the nearest power of two from . While we adopt the filtering-based method for our LFL for simplicity, more complex formulations such as Wavelet can be introduced without losing generality.

Selection of the low-pass filters. An appropriate selection of the low-pass filter in our LFL plays an essential role. Therefore, we conduct an extensive ablation study to determine the low-pass filter when training the downsampler . Table S2 shows how different types and shapes of low-pass filters for the downsampler affect the SR results. We present the performance evaluation on the synthetic DIV2K dataset with a challenging anisotropic Gaussian kernel . As shown in Table S2(a), a small box filter, e.g., , may bias the training objective and degrade the following SR performance. On the other hand, a large box filter with operates as an extremely loose constraint and cannot contribute to preserving image contents across different scales. In Table S2(b), we have also introduced 2D Gaussian filters for the LFL. However, simple box filters have demonstrated relatively better performance. Thus, we use box filters for the low-pass filter and for by default throughout our experiments.

Type Filter Size Shape
2D Box box
box
2D Gaussian
TABLE S1: Specifications of low-pass filters we use.
For box and Gaussian filters, weights are normalized so that their values are summed to 1. We note that a subsampling by and follow after and , respectively, to reduce image resolutions. More details about the filters are described in Appendix A.
Method Box size PSNR(dB) for SR ()
2 4 8 16 32 64
LFL + EDSR (Proposed) 29.51 30.17 31.06 31.57 31.58 28.11
(a) Box filters for
Method Gaussian sigma PSNR for SR ()
0.8 1.2 1.6 2.0 2.5 3.0
LFL + EDSR (Proposed) 29.64 30.24 30.42 30.62 30.96 30.75
(b) Gaussian filters for
TABLE S2: Ablation study on the shapes and sizes of .
We train the LFL-based downsampler and the following SR model on DIV2K () to observe how different low-pass filters affect the performance of our approach.

S2 Details about the Synthetic Kernels

We present a formulation of the synthetic downsampling kernels used for various experiments in our main manuscript. As we describe in Section 4.1, denotes a widely-used MATLAB bicubic kernel. The other kernels, i.e., , are

and sampled from a standard 2D Gaussian distribution following (

S2). Table S3 describes the actual parameters used to instantiate our synthetic kernels. To validate the generalization ability of the proposed method, we do not resort to radial kernels that are relatively easy to model and introduce anisotropic kernels and . The most challenging case further includes rotation of random degrees and have neither vertical nor horizontal symmetries. For a larger downsampling factor, we follow an approach from KernelGAN [4] and convolve the same kernel twice to generate a larger one.

Kernel Type
1.0 1.0 0 Isotropic
1.6 1.6 0 Isotropic
1.0 2.0 0 Anisotropic
1.0 2.0 29 Anisotropic
TABLE S3: Detailed parameters to implement the synthetic Gaussian kernels. Fig. 5 in our main manuscript also visualizes each downsampling kernel in detail.

S3 Stability of the proposed methods

Since our LFL and ADL rely on unsupervised adversarial training, stability and reproducibility of the proposed method can be an essential issue. Therefore, we conduct five independent experiments for each of the five downsampling kernels at a scale factor of t analyze the stability of our training scheme. Fig. S1 shows the average performance of the following SR model after we train the downsampler using LFL and ADL, across five different kernel configurations on the synthetic DIV2K dataset. Even in the unsupervised learning framework, the SR model with our LFL and ADL schemes performs consistently with a small variation. Since the SR model performs stably across different experimental configurations, we report the result from one single run in the other sections.

Fig. S1: Stability analysis of the proposed methods. We visualize the average performance of the baseline EDSR

model and standard deviation from five runs on each downsampling kernel. Notably, ADL consistently outperforms LFL in all synthetic downsampling kernel configurations.

(a) Downsampling CNN
(b) Discriminator CNN
Fig. S2: Our CNN architectures. C in the navy box denotes a convolutional layer, e.g., C5 for . For a sequence of the convolutional, instance normalization [47]

, and ReLU (or LeakyReLU 

[41]) activation layers, we use the term C I R (or L) for simplicity. The green block in (a) incorporates a shortcut connection wrapping around the C I R sequence. We note that refers to the number of output channels, and describes the stride of the convolutional layer with a default value of 1, respectively.

S4 Detailed comparison with KernelGAN

Table 4 in our main manuscript shows that the proposed ADL + EDSR outperforms the KernelGAN + ZSSR combination by a significant margin even when only one LR image is available for the training. In this section, we use multiple images to train KernelGAN to demonstrate the advantage of our method when large-scale unpaired images are available. Table S4 shows extensive experimental results regarding different training datasets of the KernelGAN. We note that ZSSR is used to reconstruct following KernelGAN by default unless mentioned otherwise.

Case Dataset for the Generator SR model PSNR(dB) for SR
HR image(s) LR image(s)
1 08010900 ZSSR 21.54 26.41 30.55 31.18 27.68
2 EDSR 16.87 18.55 29.95 31.08 23.28
3 00010400 04010800 ZSSR 20.91 26.83 29.97 28.46 27.99
4 EDSR 16.11 20.35 29.26 24.85 19.68
0