Super-resolution (SR) aims to enhance the resolution of natural
images. Recent years have seen an increased interest in the problem, driven by emerging applications. Most notably, current generations of smartphones allow for the deployment of powerful image enhancement techniques, based on machine learning approaches. This calls for super-resolution methods that can be applied to natural images, that are often subject to significant levels of sensor noise, compression artifacts or other corruptions encountered in applications. In this work, we therefore address the problem of super-resolution in the real-world setting.
Real-world SR poses a fundamental challenge that has been largely ignored until very recently. The lack of natural low resolution (LR) and high resolution (HR) image pairs greatly complicates the evaluation and training of SR methods. Therefore, research in the field has long relied on the use of known degradation operators such as bicubic kernel in order to artificially generate a corresponding LR image [9, 31, 34]. While this straight-forward approach enables simple and efficient benchmarking and generation of virtually unlimited training data, it comes with significant drawbacks. Bicubic downsampling can drastically change the natural characteristics of an image by, , removing sensor noise and compression artifacts.
State-of-the-art methods trained only to reconstruct images artificially downsampled with a bicubic kernel, do not generalize to natural images. As visualized in Figure 1, even small levels of noise causes a network trained only on bicubic images, in this case ESRGAN 
, to output significant artifacts. In fact, this is expected as deep learning methods are known to be sensitive to significant differences between the train and test distributions. The ESRGAN has not seen noisy input images during train-time due to the smoothing effects introduced by bicubic downsampling.
In this work, we present a novel way of training a generic method in order to overcome the challenges of real-world SR. We address the shift between training and testing distributions arising from the bicubic downsampling by learning the corresponding inverse mapping operation. To this end, we train a mapping from the bicubic images to the distribution of real-world LR images. By employing cycle consistency losses 
, we learn this mapping in a fully unsupervised manner. The learned network is applied on bicubically downsampled images to generate paired LR and HR images that follow the real-world distribution. This allows us to learn the SR network on a realistic dataset, unaffected by the bicubic shift. Furthermore, the SR network is trained with direct pixel-wise supervision in the HR domain, without the need of any paired ground-truth data. Visual results of our approach on natural images is shown in Figure1.
Due to the unavailability of paired data, we introduce a protocol for benchmarking real-world SR methods, based on simulating natural degradations. We analyze our approach in two scenarios, namely Domain (DSR) and Clean Super-Resolution (CSR). In the former case, the real-world data distribution is defined by one set of natural images. However, our approach generalizes to the case when the real-world input and output distributions of the SR network are different. We therefore introduce the CSR task, where the goal is to achieve a clean super-resolved image, defined by a separate output distribution of high-quality images. We demonstrate the effectiveness of our approach on the aforementioned benchmark, and compare it to baseline methods and state-of-the-art approaches. Finally, we show qualitative results for the task of super-resolving real-world smartphone images on the DPED  dataset.
2 Related Work
Until very recently, single image super-resolution (SISR) methods were primarily benchmarked in terms of PSNR, for the task of super-resolving bicubic downsampled images. While traditionally addressed with classical techniques [19, 12, 29, 36, 15], current approaches [9, 10, 22, 23, 26, 11, 2, 3, 14, 16] employ deep learning methodologies to train a mapping from LR to HR. Among the latter, EDSR 
notably introduced a ResNet inspired architecture, better adapted for the task at hand. For training the network however, these methods rely on the L1 or L2 losses. While these losses are closely related to the PSNR evaluation metric, they do not preserve the natural image characteristics, generally leading to a blurry result. To address this problem, Ledig et al.  introduced an objective function aimed at perceptually more pleasing results. The novel objectives were a GAN discriminator and a loss computed in the VGG feature space. While providing inferior PSNR compared to state-of-the-art, the super-resolved images experienced significantly better perceptual quality. Following this philosophy, the recent winner of the PIRM2018  challenge ESRGAN , proposed further architectural improvements to further enhance the perceptual quality.
Despite their success, the aforementioned approaches are severely limited by their reliance on the bicubic downsampling operation for training data generation. This operation eliminates most high frequency components and therefore, significantly altering the natural image characteristics, such as noise, compression artifacts, and other corruptions. The bicubic assumption therefore rarely reflects the real-world scenario. Blind SR generalizes the problem by assuming LR and HR image pairs with an unknown degradation and downsampling kernel. Early attempts 
to this problem include explicitly estimating the unknown point spread function itself[28, 13]. Another direction of research aims to completely remove the need for external training data by performing image-specific SR. Following this idea, ZSSR  trains a lightweight network using only the testing image itself, by performing extensive data augmentation. However, this approach still employs a fix downsampling operation to generate synthetic pairs at test time. Furthermore, the image-specific learning leads to extremely slow prediction.
A few recent works address the unsupervised SR setting, where no paired LR-HR pairs are given and the relation between LR and HR images is unknown. The Cycle-in-Cycle network  learns a mapping from the original input image to a clean image space, using a framework that employs cycle consistency losses. The SR network itself is trained by only employing indirect supervision in the LR domain, in addition to the usual perceptual GAN-discriminator. In contrast, our framework allows direct supervision in the HR domain, resulting in better training of the SR network itself. Furthermore, instead of “cleaning” the input image during train and test time, we learn a mapping to the original input domain for only the training. Another work focuses on the downsampling process  in order to improve the SR. However, SR is only performed on images with the learned downsampling operation, and is therefore not applicable to our real-world scenario. Also Bulat  focus on the problem of learning the downsampling process. However, this approach specifically addresses the problem of super-resolving faces, where strong content priors can be learned by the network. In contrast, we tackle the general SR problem, not putting any assumptions on the image content. Lastly, recent works [39, 7, 6] propose strategies to capture real LR-HR image pairs. However, these methods rely on complicated data collection procedures, requiring specialized hardware, that is difficult and expensive to scale. Our approach operates without the need of any additional data, greatly increasing its use and applicability.
3 Proposed Method
3.1 The Super-Resolution Problem
In essence, super-resolution (SR) is the problem of increasing the resolution of natural images. However, this problem comes with a fundamental challenge that has been largely ignored up until very recently. Namely, the lack of natural LR and HR image pairs, which are needed for evaluation and training. Therefore, research in SR has long relied on the use of known downscaling operators (e.g. bicubic) in order to artificially generate a corresponding LR image pair. While this simplification has historically also served the development of SR methods, it is fundamentally limiting.
Bicubic downsampling can drastically change the natural characteristics of an image by, , removing sensor noise and compression artefacts. A real-world example is shown in Figure 2. The natural image (left) is affected by natrual sensor noise. However, the corresponding bicubically downsampled image does not preserve these characteristics. Hence, a network trained to super-resolve the latter image cannot be expected to generalize to the original real-world distribution.
To formalize the problem, we let denote the natural image we wish to super-resolve. We also introduce the distribution of such natural images on which we want our SR approach to operate. In practice, could be defined as images obtained from a specific camera or a dataset of real-world images. The aim is to learn a function that maps an image to a high resolution image that is distributed according to the output distribution . In applications, we could have , meaning that we want the characteristics of the image to remain unchanged after super-resolution. We term this setting domain-specific super-resolution (DSR). Another alternative would be to let be defined by a set of high-quality images, which we call clean super-resolution (CSR) setting.
For most real-world applications it is incredibly hard and strenuous to collect natural image pairs for SR. In classical SR this is addressed by artificially constructing the input image , where is the bicubic downsampling operation. The task is then aimed to super-resolve to match the original image . However, as illustrated in Figure 2, the bicubically downsampled images do not match the input distribution, . Unfortunately, methods trained in this manner struggle when supplied with real data .
Related to our discussion is the concept of blind SR. In this setting, the input images are assumed to be generated from the output images with some fixed and simple transformation that is unknown. Often, a more general downsampling kernel is used in combination with a non-linear degradation function , such that . Some methods try to find the kernel from data or learn the transformation end-to-end.
The real-world SR setting, addressed in this work, can be seen as a generalization of blind SR. In our approach, we assume no particular relation, such as a parameterized transformation, between the input and output images. We only assume that a set of input image samples and a set of output image samples are available. These image samples are not paired. Given this data, the problem is to learn a mapping that can super-resolve a new image such that . In order to train from such unpaired data, we learn a function that maps the bicubically downsampled image from the output distribution to an image sample that fits the input distribution . This effectively constructs an input-output training pair , allowing the SR network to be learned in a supervised manner such that . The main advantage of our approach is that the SR network can be trained with direct pixel-wise supervision in the HR domain. The proposed framework is depicted in Figure 3.
We first train the generator , called the domain distribution network, in a conditional GAN setting. This is performed by employing a discriminator network aiming to differentiate the generated images from true input images . Since no paired output is available, we enforce a cycle consistency loss by employing a second generator mapping input images to . Crucially, we train the domain distribution network independently from the SR network . While, this may seem counter intuitive at first, it is clearly motivated from the fact that the networks and have fundamentally conflicting objectives. The aim of is to map a bicubically downsampled image from the output distribution to an image following the input distribution , such that a faithful training sample is generated for the SR network. The network simply aims to super-resolve any image from . If both networks were to be trained jointly using the cycle-consistency loss for , the networks and would collaborate in order to minimize the aforementioned loss. This leads to severe overfitting and poor generalization. As illustrated in Figure 3, we train the SR network is a second separate training stage, using the training pairs generated by the network .
3.3 Domain Distribution Learning
The task of the domain distribution learning is to map a bicubic downsampled image from the output distribution to the input distribution . Since we do not have access to paired samples, we need to venture into unsupervised learning territories. We firstly employ a GAN discriminator , tasked to differentiate between the generated and images drawn from the input distribution . For this, we employ the original GAN formulation,
To preserve the image content, despite the lack of paired images, we employ cycle consistency losses . A second generator is tasked to map images from the input domain to the domain of bicubic downsampled images , where . We then add cycle consistency losses as,
They constrain the generators and to be each others approximate inverses. Hence, the image shall be preserved if mapped through and then back to the original domain. Analogous to (3.3), we add a discriminator and similar loss on the bicubic side. The full objective is thus,
The full architecture is shown in Figure 3 (blue).
Network architectures For our experiments we designed the domain distribution mapping based on the CycleGAN architecture . The generators and use a ResNet architecture with nine blocks. We replace the transposed convolution layers with bi-linear upsampling followed by a standard convolution. We found this to be beneficial for learning stability, and it effectively removed checkerboard pattern artifacts. Furthermore, we found the non-linearity on the output to be harmful for color consistency, and therefore use no non-linear activation at the output. The discriminators and consist of a three-layer network architecture that operate on a patch level [25, 20].
We adopt the training procedure proposed in CycleGAN, using 200 epochs and the Adam optimizer with. The starting learning rate is set to .
3.4 Super-Resolution Learning
Here we describe the learning of the SR network . In the absence of paired ground-truth data, we train the network with pairs , where the input image is generated by our domain distribution network . We employ the pixel-wise content loss ,
Following the success of SRGAN , we also employ the VGG feature loss, that is known to better correlate with perceptual quality
Here, denotes the feature activations extracted from the VGG network. We extract the features at the same depth as SRGAN, which is after the activation of the 4th convolutional layer, before the 5th maxpooling layer.
For better perceptual quality, we further employ a GAN discriminator . To this end, we adopt the relativistic discriminator employed in ESRGAN 
. As opposed to the conventional discriminator, providing an absolute real/fake probability for each image, a relative score real/fake is estimated compared to a set of real of fake images.
Where is the raw discriminator output and
is the sigmoid function. The SR network is trained with added perceptual loss,
This results in the total loss of
The GAN loss is multiplied by a weight , balancing the guidance of the two pixel-wise losses and against the GAN loss .
Network architecture Our approach is agnostic to the specific architecture of the SR network . For simplicity, we adopt the recently proposed ESRGAN architecture, which is the winner of the PIRM 2018 challenge . It introduced a new building block called Residual-in-Residual Dense Blocks, improving stability of training. We augment the ESRGAN network with a final color adjustment layer, to ensure a faithful reproduction of the color palette in the input LR image. This layer adjusts the local mean RGB value to that of the low-resolution image.
Training details To train our SR network, we start from pre-trained ESRGAN  generator and discriminator networks. We then perform 50000 training iterations. We use that ADAM optimizer with an initial learning rate of and set and for both the Generator and Discriminator. We use the learning rate schedule in , decreasing it by a factor of 0.5 after 10%, 20%, 40% and 60% of the total number of iterations.
In this section, we present comprehensive quantitative and qualitative evaluation of our approach. We first discuss the setup and datasets employed in our experiments. Detailed results are provided in the supplementary material.
4.1 Experimental Setup
We present a novel strategy for evaluating real-world SR methods. In traditional SR the bicubic downsampled image is super-resolved and compared to . In the real-world scenario, we do not have access to a ground-truth image, complicating quantitative analysis. On the other hand, we can closely simulate the real-world scenario by constructing the sets input and output training images by applying downscaling and synthetic degradations to a dataset of original images. The type of degradation is unknown to the SR approach. For evaluation, we further generate a set of ground truth pairs , where and . These are inaccessible to the network during training, and only used for evaluation purposes. We consider two scenarios: DSR and CSR, detailed below.
Domain-Specific Super-Resolution (DSR) The input and target images shares the same real-world distribution, . Thus, the aim is to produce super-resolved images that are of the same distribution as its input images. The training set is generated by first downsampling the image and then simulating the real-world degradation. In the DSR case the same training set represents both the input and output distribution. For evaluation, the input image is constructed using the same procedure as for the training images . The corresponding ground-truth output image is obtained by directly adding the degradation to the original HR image. The procedure is visualized in Figure 4.
In Clean Super-Resolution (CSR), the goal is to super-resolve an input image such that the output image fits another distribution . We let be defined by a dataset of high quality images. Therefore, we employ the unaltered original image from the dataset to be the ground-truth output . The corresponding LR image used for the evaluation is generated as in the DSR case above. We also employ the same training set of input images as for the DSR. In the CSR case however, the output training data represent a different distribution of clean images. These are generated by bicubically downsampling the original image. The resulting image thus represents a clean ideal output from the SR network. See Figure 4 for a schematic description of the procedure.
To model the real-world setting we evaluate unsupervised SR approaches using two types of image degradations: JPEG compression artifacts and simulated sensor noise. In case of JPEG artifacts we use a quality setting of 30. JPEG compression artifacts are a common when applying super-resolution to images captured by smartphones or acquired from the internet. In the second case, we employ white Gaussian noise with a standard deviation of. This simulates the case of real-world sensor noise present in , low light conditions or small sensor sizes.
Quantitative Evaluation Measures In order to quantitatively compare the different approaches we use the distance metrics PSNR, SSIM and LPIPS. While PSNR and SSIM are handcrafted methods, LPIPS is a learned metric for perceptual similarity  between two images. In Figure 5 we provide a comparison of PSNR and the LPIPS, measures using the model provided by the authors.
Datasets We use the DF2K  dataset that was introduced for learning the ESRGAN. It is a merge of the DIV2K  with 800 and the Flickr2K  dataset with 2640 images. The mean size of DF2K is 1439x1935. We also perform experiments on the DPED  dataset, acquired by a smartphone camera. It contains natural images with real-world sensor noise and other effects.
|Sensor Noise||JPEG Artifacts|
|Cleaning the input||19.81||0.32||0.5737||22.23||0.56||0.4295|
|Low res. supervision||21.46||0.36||0.4363||20.00||0.49||0.4483|
|DSR Supervised (ref.)||23.97||0.48||0.1778||22.97||0.59||0.3526|
|Cleaning the input||22.10||0.55||0.4516||20.28||0.46||0.4889|
|Low res. supervision||22.03||0.55||0.4401||20.16||0.49||0.4752|
|CSR Supervised (ref.)||25.54||0.70||0.2103||22.60||0.58||0.3484|
4.2 Ablation Study
For our ablation study we use the DF2K dataset as training data and the validation image from DIV2K to measure the performance of the different methods. The quantitative comparison is done using the PSNR, SSIM and LPIPS measures. For comparisons, we mainly consider the LPIPS distance due to its higher correlation with perceptual similarity.
We evaluate four different approaches for the DSR and CSR setting. All methods are trained using the same settings. For SR network we employ a pretrained ESRGAN model that is then fine-tuned for each method, as described below. Quantitative and visual results are shown in Table 1 and Figure 6 respectively.
Baseline First, we compare with the standard approach of training the network on LR images generated by bicubic downsampling. For this purpose, we finetune the ESRGAN using image pairs . In the case of sensor noise, the baseline achieves significantly inferior performance for both DSR and CSR (Table 1) compared to ours. This is due to the smoothing behaviour of the bicubic downsampling. The baseline ESRGAN does thus not see appropriate levels of noise during training. This leads to severe artifacts in both the DSR and CSR case (Figure 6). Our approach also improves in the case of JPEG artifacts, leading to a improvement in LPIPS in the CSR case.
|DSR - JPEG||DSR - Sensor Noise|
|CSR - JPEG||CSR - Sensor Noise|
|Baseline||Cleaning the input||Low resolution supervision||Ours||GT|
Cleaning the input Another strategy for tackling the shift between train and test distribution, caused by the bicubic downsampling, is to map the input image to the bicubic distribution before applying the SR network, as proposed in . With this strategy, the SR network is trained using bicubic data, exactly as in the baseline setting discussed previously. During inference, the input image is super-resolved as . In fact, in our approach, we already train such a mapping to ensure cycle consistency in the training of domain correction network (Section 3.3). Since our training is fully symmetric between the two domains, including the architectures of and , we use the generator trained in our framework for a fair comparison.
In case of sensor noise, this version improve the results compared to the baseline method, suggesting that some of the domain shift problem is alleviated. However, our approach further improves the LPIPS distance by 50% in the DSR and 19% in the CSR case. This is partly due to the fact that the SR network acts directly on the input image , while the mapping can introduce artifacts or remove information. Our approach also achieves a significant improvement in the JPEG case.
Low resolution supervision Here, we compare our approach with performing supervision in the LR domain. Similar to , we add another generator network that maps the super-resolved image back to the original domain. We then perform the direct pixel-wise supervision in the LR domain instead, using the same losses and to ensure . We observed that this approach leads to stronger GAN hallucinations, as shown in Figure 6. This tendency is also observed in the quantitative results, obtaining significantly worse LPIPS and PSNR in all cases. This demonstrates the importance of direct HR supervision provided by our method.
Fully supervised To assess the performance of our approach, we compare with fully supervised training using paired samples, otherwise unavailable to the network. We generate paired data using the same strategy employed for evaluation, by applying the ground-truth degradation. The ESRGAN is then finetuned on this data directly. Note that the ground-truth degradation operation is unknown for all other methods in this comparison. We observe that our approach achieves performance much closer to this upper bound for both DSR and CSR. In particular in the case of JPEG artifacts, where our unsupervised method is only slightly worse than full supervision.
4.3 State-of-the-art Comparison on DIV2K
|Sensor Noise||JPEG Artifacts|
|ESRGAN FT IN||18.35||0.23||0.7595||23.01||0.60||0.4903|
|ESRGAN FT OUT||17.35||0.19||0.9040||22.82||0.59||0.5087|
In the following we compare our approach with other state-of-the-art methods: ZSSR , EDSR , ESRGAN . Therefore we use the original code and trained models. The method ZSSR applies a Zero-Shot learning strategy, where weights are learned for each image individually. The EDSR trains a ResNet-based model without perceptual loss. ESRGAN applies the same SR architecture and perceptual losses as in our approach. Both EDSR and ESRGAN are trained using bicubic supervision.
We also report the results of finetuning the ESRGAN on the same training data employed by our approach. In the case of CSR, where two distinct training sets are available, we further compare with finetuning the ESRGAN on each of those. The method ”ESRGAN FT IN” is trained on the set of input images, while ”ESRGAN FT OUT” is trained on the set of output images. In all cases we follow the same training procedure as , constructing the corresponding LR image using bicubic downsampling.
|Ours (DSR)||ZSSR||EDSR||ESRGAN||ESRGAN FT IN||ESRGAN FT OUT||Ours (DSR)||Ours (CSR)||GT|
Sensor Noise In the case of Sensor noise, the ESRGAN provides poor perceptual quality with LPIPS distances of and for DSR and CSR. Finetuning the model on the given data only provides minor improvements. This is due to the domain distribution shift caused by the bicubic downsampling leading to clean input images during training. EDSR experiences better robustness to noisy input with LPIPS values of and for DSR and CSR respectively. Among previous approaches ZSSR achieves the best performance, owing to its zero-shot learning strategy. Our unsupervised approach achieves the best overall perceptual quality, significantly reducing the LPIPS error metric by and for DSR and CSR respectively, in the sensor noise setting.
As show in Figure 7 the ESRGAN approaches produce strong artifacts in the case of noisy input. This is likely due to the perceptual loss that encourages the network to output high-frequency components in order to provide a sharp image. In contrast, our method do not suffer from such artifacts despite employing a strong perceptual loss. This demonstrates that our SR network has learned significant robustness towards image noise, though our unsupervised training strategy.
JPEG Compression In the DSR case, where SR is performed within the same domain, our approach achieves a significantly lower LPIPS distance compared to state-of-the-art. For CSR, where the task is to additionally clean the input from artifacts, our approach has a strong advantage, reducing the LPIPS by more than . However, as shown in Figure 7, the difference in perceptual quality is not fully captured by the quantitative results. All compared approaches produces highly visible block artifacts, stemming from the JPEG compression of the input image. In contrast, our approach provides visually pleasing output without such artifacts.
|Ours (DSR)||Input||Ours (DSR)||Ours (CSR)||ZSSR||EDSR||ESRGAN||ESRGAN FT|
4.4 Real-World Evaluation on the DPED Dataset
In this section we apply our method to the original images of the DPED iPhone3 dataset . It contains natural images, which include real-world degradations due to poor sensor and lens quality. We train our model using the training split to represent both the input and output distributions. We also finetune the ESRGAN model on the same data, as described in the previous experiment. Since ground-truth images are not available in this real-world setting, we show a diverse set of qualitative examples from the DPED iPhone3 validation set in Figure S9. The artifacts produced by ZSSR, EDSR, ESRGAN and ESRGAN Finetuned are of similar nature as in the sensor noise case in the DIV2K setting in Figure 7. Note that these limitations in previous approaches cannot be alleviated by more training data or architectural designs. Instead these issues originate from an oversimplified problem formulation not reflected in most real-world applications. Our approach is able to overcome the limitations of previous methods by learning the input image distribution. Our approach generate high-quality images, with very few artifacts.
We tackle the problem of real-world super-resolution, where no paired data is available. To avoid the artifacts caused by bicubic downsampling, we learn a network that restores the low resolution image to the real-world image distribution. This allows us to generate realistic training pairs for our super-resolution model. Lastly, we propose a benchmark, based on the DIV2K dataset, for quantitatively evaluating real-world super-resolution approaches. Experiments are performed on our real-world benchmark and the DPED datasets. Compared previous methods, our approach generalizes to natural images, affected by significant sensor noise, compression artifacts and other effects.
This work was partly supported by ETH General Fund, Amazon through an AWS grant, Nvidia through a GPU grant, and Huawei.
-  (2017) NTIRE 2017 challenge on single image super-resolution: dataset and study. In CVPR Workshops, Cited by: §S1, §S4.
-  (2018) Fast, accurate, and lightweight super-resolution with cascading residual network. In ECCV, Cited by: §2.
-  (2018) Image super-resolution via progressive cascading residual network. In CVPR, Cited by: §2.
-  (2004) Blind super-resolution using a learning-based approach. In ICPR, Cited by: §2.
-  (2018) To learn image super-resolution, use a gan to learn how to do image degradation first. arXiv preprint arXiv:1807.11458. Cited by: §2.
-  (2019-06) NTIRE 2019 challenge on real image super-resolution: methods and results. In CVPR Workshops, Cited by: §2.
-  (2019) Camera lens super-resolution. In CVPR, Cited by: §2.
-  (2017) Cyclegan, a master of steganography. arXiv preprint arXiv:1712.02950. Cited by: §3.3, §3.3.
-  (2014) Learning a deep convolutional network for image super-resolution. In ECCV, Cited by: §1, §2.
-  (2016) Image super-resolution using deep convolutional networks. TPAMI 38 (2), pp. 295–307. Cited by: §2.
-  (2017) Balanced two-stage residual networks for image super-resolution. In CVPR, Cited by: §2.
-  (2002) Example-based super-resolution. IEEE Computer graphics and Applications. Cited by: §2.
-  (2019) Blind super-resolution with iterative kernel correction. In CVPR, Cited by: §2.
-  (2018) Deep back-projection networks for super-resolution. In CVPR, Cited by: §2.
-  (2015) Single image super-resolution from transformed self-exemplars. In CVPR, Cited by: §2.
-  (2018) Densely connected high order residual network for single frame image super resolution. arXiv preprint arXiv:1804.05902. Cited by: §2.
-  (2017) DSLR-quality photos on mobile devices with deep convolutional networks. In ICCV, Cited by: §1, §4.1, §4.4.
-  (2018) Pirm challenge on perceptual image enhancement on smartphones: report. arXiv preprint arXiv:1810.01641. Cited by: §2, §3.4.
-  (1991) Improving resolution by image registration. CVGIP. Cited by: §2.
-  (2017) Image-to-image translation with conditional adversarial networks. Cited by: §3.3.
-  (2018) Task-aware image downscaling. ECCV. Cited by: §2.
-  (2016) Accurate image super-resolution using very deep convolutional networks. In CVPR, Cited by: §2.
-  (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, Cited by: §2.
-  (2017) Photo-realistic single image super-resolution using a generative adversarial network. CVPR. Cited by: §2, §3.4.
-  (2016) Precomputed real-time texture synthesis with markovian generative adversarial networks. In ECCV, Cited by: §3.3.
-  (2017) Enhanced deep residual networks for single image super-resolution. CVPR. Cited by: §2, §S2, §3.4, §4.3.
-  (2019) AIM 2019 challenge on real-world image super-resolution: methods and results. ICCV Workshops. Cited by: §S1.
-  (2013) Nonparametric blind super-resolution. In ICCV, Cited by: §2.
-  (2003) Super-resolution image reconstruction: a technical overview. IEEE signal processing magazine. Cited by: §2.
-  (2018) Zero-shot” super-resolution using deep internal learning. In CVPR, Cited by: §2, §S2, §4.3.
-  (2017) Memnet: a persistent memory network for image restoration. In ICCV, Cited by: §1.
-  (2017) Ntire 2017 challenge on single image super-resolution: methods and results. CVPR Workshops. Cited by: §S1, §4.1.
-  (2018) NTIRE 2018 challenge on single image super-resolution: methods and results. In CVPR Workshops, Cited by: §S2.
-  (2017) Image super-resolution using dense skip connections. In ICCV, Cited by: §1.
-  (2018) ESRGAN: enhanced super-resolution generative adversarial networks. ECCV. Cited by: Figure 1, §1, §2, §S2, §3.4, §3.4, §4.1, §4.3, §4.3, §S4.
-  (2010) Image super-resolution via sparse representation. TIP. Cited by: §2.
-  (2018) Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks. CVPR Workshops. Cited by: §2, §4.2, §4.2.
The unreasonable effectiveness of deep features as a perceptual metric. CVPR. Cited by: §2, §S2, §4.1.
-  (2019) Zoom to learn, learn to zoom. In CVPR, Cited by: §2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV. Cited by: §1.
S1 AIM 2019 Real-World Super Resolution Challenge
In addition to the experiment settings presented in section 4.1, we introduce the real-world SR benchmark employed in the recent AIM 2019 Real-World Super Resolution Challenge . We use the same overall procedure as described in section 4.1, but employ a more complex degradation mapping, composed of several different operations. The input domain train images are obtained by directly adding the degradation to the high-resolution Flickr2K  images, while we use the clean DIV2K  train split for the target domain set . For validation and testing we use the corresponding splits from the DIV2K. The results of our approach compared to state-of-the-art methods are shown in Table S1 and S2. Note that Track 1 and 2 correspond to the DSR and CRS settings respectively. Our approach achieves superior LPIPS score compared to previous approaches.
|ESRGAN FT IN||21.40||0.52||0.5191|
|ESRGAN FT OUT||21.66||0.55||0.5282|
|ESRGAN FT IN||21.36||0.54||0.5275|
|ESRGAN FT OUT||21.94||0.59||0.5083|
|Bicubic||ZSSR||EDSR||ESRGAN (Baseline)||Cleaning the Input||Low resolution supervision||Ours|
S2 NTIRE 2018 Evaluation
We perform an additional experiment on the NTIRE 2018  challenge. To evaluate the approaches in the unsupervised scenario, we use the Track 2: Realistic Mild adverse conditions. In this setting, the an unknown degradation operation has been applied to the low resolution images to simulate realistic sensor noise and motion blur. The goal is to produce clean output images, similar to our CSR scenario. Although LR and HR training pairs are provided, we do not utilize the paired data. As performed in the paper, we train our approach in the fully unsupervised real-world setting, using only unpaired LR and HR images. Compared to the official challenge , our approach is thus trained in considerably harder, but more realistic conditions.
The results are provided in Table S3. We compare our approach to simple bicubic upscaling, ZSSR , EDSR  and the approaches considered in our ablative study: the baseline ESRGAN , Cleaning the Input, and Low resolution supervision. See Section 4.2 in the main paper and Section S3 here for more details about the latter baseline versions. Results are reported in terms of LPIPS , PSNR and SSIM. Since our goal is perceptual super resolution, we focus on the LPIPS metric. A comparison between these metrics are shown in Figure 5. Although naïve bicubic upscaling achieves the best PSNR in Table S3, our approach clearly achieves better perceptual results, as indicated by the LPIPS metric. Among the compared previous SR methods, the baseline ESRGAN achieves the best performance with an LPIPS of . Employing the generator to clean the input image during inference improves the LPIPS metric to . Using supervision in the low resolution domain yields an LPIPS of . Our approach, based on the domain distribution learning, allowing direct supervision in the high resolution domain, achieves the best LPIPS of . This additional experiment demonstrates the same trend as the results shown in the main paper.
S3 Details of the Baseline Approaches
Here we provide additional details of the ablative methods evaluated in Section 4.2 in the paper. Figure S5 illustrate the three baseline versions, along with our approach. In the baseline ESRGAN version (Figure (a)a), we follow the same protocol to generate the low resolution training samples as ESRGAN to finetune the pretrained model. The super-resolution network is trained to reconstruct images that are bicubic downsampled by . As shown, this produces strong artifacts due to the mismatch of input domains during the training and testing phases. A first approach to match the input distribution of the super-resolution network is to first clean the low resolution image during test time (Figure (b)b). The domain distribution mapping is learned using cycle consistency loss to map input images to the domain of bicubic downsampled images. Another way to obtain matching domains during train and test time is to do supervision in low resolution, as one can directly use the natural image as low resolution input for training (Figure (c)c).
In Our approach (Figure (d)d we make sure that the inputs during train and test are matching by applying the mapping during training to invert the effects of bicubic downsampling on the image distribution. This mapping is learned by our with unsupervised domain distribution learning (Section 3.3). Compared to the version in Figure (b)b, our approach requires no extra network for processing the input image during test-time, before the super-resoution is applied. Instead, our approach leans to super-resolve natural images directly. Moreover, in contrast to Figure (c)c, we do a pixel-wise supervision in high resolution with the original natural image, which helps the super-resolution network to produce crisp, photo-realistic images.
S4 Real-world SR Benchmarking Details
Here, we provide additional details of the proposed benchmarking procedure and dataset for real-world super-resolution methods. In classical super-resolution, the low resolution images for training and evaluation are constructed by applying bicubic downsampling to a dataset of natural images. Therefore, those methods are only evaluated on how well they super-resolve images that were downsampled the same way as during training. However, in the real-world setting the aim is to super-resolve natural images to a higher resolution that is unseen during training. Thus, the desired high resolution images are not available for training the super-resolution network. To reflect this crucial aspect in the proposed benchmarking procedure for real-world super-resolution, we apply different procedures for constructing training and test data. Importantly, the training data only contains low resolution images, while the original high resolution images are only used when evaluating the methods. Thus the final desired image resolution is unseen during training, as in real-world applications.
We construct the train and test images for real-world super-resolution, based on a dataset of original images. An illustration of the procedure is shown in Figure 4 (also shown in the main paper). For evaluation we construct an input-output image pair. The input image is obtained by first downsampling the original image and then applying the degradation transformation to simulate the real-world case. Since the degradation is heavily affected by image downsampling, it is always applied after the downsampling procedure. In the domain-specific super-resolution case (DSR), where the goal is to super-resolve the image while preserving the original characteristics, we construct the ground-truth output image by applying the same degradation operation directly to the original image. In the clean super-resolution case, we employ the original image as ground-truth. The training set is constructed by first downsampling the original image and then applying the degradation operation. In the CSR case, a set of clean images are also available during training (see Section 4.1 in the paper). These are constructed by only applying bicubic downsampling to the original images, and no degradations. Note that no images of the original resolution, used for evaluation, are available for training.
|Dataset||Num. images||Mean image size||Set|
In this work, we employ the DF2K  dataset, consisting of DIV2K  and Flikr2K  datasets. The dataset specifications are summarized in Table S4. For testing, the 100 validation images in DIV2K are used. The training sets are constructed using the 3450 training images in DF2K. The images for training and evaluation has no overlap. We always employ downsampling, and evaluate the methods for sensor noise and JPEG degradation, as described in the paper (Section 4.1).
S5 Domain Mapping Examples
As described in Section 3.3 of the main paper, we learn a generator that transfers the downsampled images to the natural domain. In Figure S6 and Figure S7 we show how the downsampled images are transferred to images with sensor noise and JPEG artifacts respectively. In the sensor noise case one can see a Gaussian noise like image characteristics. In the JPEG case, the generator has learned to simulate blocky compression artifacts, especially at sharp edges.
S6 Visual Results
S7 Failure Cases
We developed our method in order to match the natural input domain during test time also for real-world datasets with significant noise or JPEG artifacts. This requires a domain distribution learning that also introduces some artifacts by itself as shown in the first row of Figure S6
. When training the network on clean high quality data, for which bicubic downsampling does not have a big effect on the image distribution, our method has slightly more artifacts due to the domain transfer learning.