Unsupervised Learning for Real-World Super-Resolution

09/20/2019 ∙ by Andreas Lugmayr, et al. ∙ 11

Most current super-resolution methods rely on low and high resolution image pairs to train a network in a fully supervised manner. However, such image pairs are not available in real-world applications. Instead of directly addressing this problem, most works employ the popular bicubic downsampling strategy to artificially generate a corresponding low resolution image. Unfortunately, this strategy introduces significant artifacts, removing natural sensor noise and other real-world characteristics. Super-resolution networks trained on such bicubic images therefore struggle to generalize to natural images. In this work, we propose an unsupervised approach for image super-resolution. Given only unpaired data, we learn to invert the effects of bicubic downsampling in order to restore the natural image characteristics present in the data. This allows us to generate realistic image pairs, faithfully reflecting the distribution of real-world images. Our super-resolution network can therefore be trained with direct pixel-wise supervision in the high resolution domain, while robustly generalizing to real input. We demonstrate the effectiveness of our approach in quantitative and qualitative experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

page 8

page 12

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Super-resolution (SR) aims to enhance the resolution of natural

images. Recent years have seen an increased interest in the problem, driven by emerging applications. Most notably, current generations of smartphones allow for the deployment of powerful image enhancement techniques, based on machine learning approaches. This calls for super-resolution methods that can be applied to natural images, that are often subject to significant levels of sensor noise, compression artifacts or other corruptions encountered in applications. In this work, we therefore address the problem of super-resolution in the real-world setting.

Real-world SR poses a fundamental challenge that has been largely ignored until very recently. The lack of natural low resolution (LR) and high resolution (HR) image pairs greatly complicates the evaluation and training of SR methods. Therefore, research in the field has long relied on the use of known degradation operators such as bicubic kernel in order to artificially generate a corresponding LR image [9, 31, 34]. While this straight-forward approach enables simple and efficient benchmarking and generation of virtually unlimited training data, it comes with significant drawbacks. Bicubic downsampling can drastically change the natural characteristics of an image by, , removing sensor noise and compression artifacts.

Figure 1: Super-resolving () natural images (left) with ESRGAN [35], trained on bicubic LR images, and our unsupervised approach. Ground truth data is unavailable in the real-world setting. Our approach learns to handle sensor noise and other artifacts in natural images, while ESRGAN fails to generalize.

State-of-the-art methods trained only to reconstruct images artificially downsampled with a bicubic kernel, do not generalize to natural images. As visualized in Figure 1, even small levels of noise causes a network trained only on bicubic images, in this case ESRGAN [35]

, to output significant artifacts. In fact, this is expected as deep learning methods are known to be sensitive to significant differences between the train and test distributions. The ESRGAN has not seen noisy input images during train-time due to the smoothing effects introduced by bicubic downsampling.

In this work, we present a novel way of training a generic method in order to overcome the challenges of real-world SR. We address the shift between training and testing distributions arising from the bicubic downsampling by learning the corresponding inverse mapping operation. To this end, we train a mapping from the bicubic images to the distribution of real-world LR images. By employing cycle consistency losses [40]

, we learn this mapping in a fully unsupervised manner. The learned network is applied on bicubically downsampled images to generate paired LR and HR images that follow the real-world distribution. This allows us to learn the SR network on a realistic dataset, unaffected by the bicubic shift. Furthermore, the SR network is trained with direct pixel-wise supervision in the HR domain, without the need of any paired ground-truth data. Visual results of our approach on natural images is shown in Figure 

1.

Due to the unavailability of paired data, we introduce a protocol for benchmarking real-world SR methods, based on simulating natural degradations. We analyze our approach in two scenarios, namely Domain (DSR) and Clean Super-Resolution (CSR). In the former case, the real-world data distribution is defined by one set of natural images. However, our approach generalizes to the case when the real-world input and output distributions of the SR network are different. We therefore introduce the CSR task, where the goal is to achieve a clean super-resolved image, defined by a separate output distribution of high-quality images. We demonstrate the effectiveness of our approach on the aforementioned benchmark, and compare it to baseline methods and state-of-the-art approaches. Finally, we show qualitative results for the task of super-resolving real-world smartphone images on the DPED [17] dataset.

2 Related Work

Until very recently, single image super-resolution (SISR) methods were primarily benchmarked in terms of PSNR, for the task of super-resolving bicubic downsampled images. While traditionally addressed with classical techniques [19, 12, 29, 36, 15], current approaches [9, 10, 22, 23, 26, 11, 2, 3, 14, 16] employ deep learning methodologies to train a mapping from LR to HR. Among the latter, EDSR [26]

notably introduced a ResNet inspired architecture, better adapted for the task at hand. For training the network however, these methods rely on the L1 or L2 losses. While these losses are closely related to the PSNR evaluation metric, they do not preserve the natural image characteristics, generally leading to a blurry result 

[38]. To address this problem, Ledig et al. [24] introduced an objective function aimed at perceptually more pleasing results. The novel objectives were a GAN discriminator and a loss computed in the VGG feature space. While providing inferior PSNR compared to state-of-the-art, the super-resolved images experienced significantly better perceptual quality. Following this philosophy, the recent winner of the PIRM2018 [18] challenge ESRGAN [35], proposed further architectural improvements to further enhance the perceptual quality.

Despite their success, the aforementioned approaches are severely limited by their reliance on the bicubic downsampling operation for training data generation. This operation eliminates most high frequency components and therefore, significantly altering the natural image characteristics, such as noise, compression artifacts, and other corruptions. The bicubic assumption therefore rarely reflects the real-world scenario. Blind SR generalizes the problem by assuming LR and HR image pairs with an unknown degradation and downsampling kernel. Early attempts [4]

to this problem include explicitly estimating the unknown point spread function itself 

[28, 13]. Another direction of research aims to completely remove the need for external training data by performing image-specific SR. Following this idea, ZSSR [30] trains a lightweight network using only the testing image itself, by performing extensive data augmentation. However, this approach still employs a fix downsampling operation to generate synthetic pairs at test time. Furthermore, the image-specific learning leads to extremely slow prediction.

A few recent works address the unsupervised SR setting, where no paired LR-HR pairs are given and the relation between LR and HR images is unknown. The Cycle-in-Cycle network [37] learns a mapping from the original input image to a clean image space, using a framework that employs cycle consistency losses. The SR network itself is trained by only employing indirect supervision in the LR domain, in addition to the usual perceptual GAN-discriminator. In contrast, our framework allows direct supervision in the HR domain, resulting in better training of the SR network itself. Furthermore, instead of “cleaning” the input image during train and test time, we learn a mapping to the original input domain for only the training. Another work focuses on the downsampling process [21] in order to improve the SR. However, SR is only performed on images with the learned downsampling operation, and is therefore not applicable to our real-world scenario. Also Bulat  [5] focus on the problem of learning the downsampling process. However, this approach specifically addresses the problem of super-resolving faces, where strong content priors can be learned by the network. In contrast, we tackle the general SR problem, not putting any assumptions on the image content. Lastly, recent works [39, 7, 6] propose strategies to capture real LR-HR image pairs. However, these methods rely on complicated data collection procedures, requiring specialized hardware, that is difficult and expensive to scale. Our approach operates without the need of any additional data, greatly increasing its use and applicability.

3 Proposed Method

3.1 The Super-Resolution Problem

In essence, super-resolution (SR) is the problem of increasing the resolution of natural images. However, this problem comes with a fundamental challenge that has been largely ignored up until very recently. Namely, the lack of natural LR and HR image pairs, which are needed for evaluation and training. Therefore, research in SR has long relied on the use of known downscaling operators (e.g. bicubic) in order to artificially generate a corresponding LR image pair. While this simplification has historically also served the development of SR methods, it is fundamentally limiting.

Bicubic downsampling can drastically change the natural characteristics of an image by, , removing sensor noise and compression artefacts. A real-world example is shown in Figure 2. The natural image (left) is affected by natrual sensor noise. However, the corresponding bicubically downsampled image does not preserve these characteristics. Hence, a network trained to super-resolve the latter image cannot be expected to generalize to the original real-world distribution.

To formalize the problem, we let denote the natural image we wish to super-resolve. We also introduce the distribution of such natural images on which we want our SR approach to operate. In practice, could be defined as images obtained from a specific camera or a dataset of real-world images. The aim is to learn a function that maps an image to a high resolution image that is distributed according to the output distribution . In applications, we could have , meaning that we want the characteristics of the image to remain unchanged after super-resolution. We term this setting domain-specific super-resolution (DSR). Another alternative would be to let be defined by a set of high-quality images, which we call clean super-resolution (CSR) setting.

For most real-world applications it is incredibly hard and strenuous to collect natural image pairs for SR. In classical SR this is addressed by artificially constructing the input image , where is the bicubic downsampling operation. The task is then aimed to super-resolve to match the original image . However, as illustrated in Figure 2, the bicubically downsampled images do not match the input distribution,  . Unfortunately, methods trained in this manner struggle when supplied with real data .

Related to our discussion is the concept of blind SR. In this setting, the input images are assumed to be generated from the output images with some fixed and simple transformation that is unknown. Often, a more general downsampling kernel is used in combination with a non-linear degradation function , such that . Some methods try to find the kernel from data or learn the transformation end-to-end.

Original Crop Bicubic Restored
Figure 2: Here we visualize the effects for bicubic downsampling and compare it to our domain distribution learning. The original HR image contains significant sensor noise, which almost completely disappears after bicubic downsampling. This is clearly observed when compared with the same-resolution crop of the original image. Our learned mapping restores the image characteristics present in the original image (right).
Figure 3: Schematic overview of our approach. In the first step, we learn the domain distribution network , depicted in blue. Given unpaired data from the input and output distributions, the generator is trained in a GAN framework by employing cycle consistency losses. The SR network is trained in a second stage, depicted in orange, using pairs generated by our domain distribution network .

3.2 Overview

The real-world SR setting, addressed in this work, can be seen as a generalization of blind SR. In our approach, we assume no particular relation, such as a parameterized transformation, between the input and output images. We only assume that a set of input image samples and a set of output image samples are available. These image samples are not paired. Given this data, the problem is to learn a mapping that can super-resolve a new image such that . In order to train from such unpaired data, we learn a function that maps the bicubically downsampled image from the output distribution to an image sample that fits the input distribution . This effectively constructs an input-output training pair , allowing the SR network to be learned in a supervised manner such that . The main advantage of our approach is that the SR network can be trained with direct pixel-wise supervision in the HR domain. The proposed framework is depicted in Figure 3.

We first train the generator , called the domain distribution network, in a conditional GAN setting. This is performed by employing a discriminator network aiming to differentiate the generated images from true input images . Since no paired output is available, we enforce a cycle consistency loss by employing a second generator mapping input images to . Crucially, we train the domain distribution network independently from the SR network . While, this may seem counter intuitive at first, it is clearly motivated from the fact that the networks and have fundamentally conflicting objectives. The aim of is to map a bicubically downsampled image from the output distribution to an image following the input distribution , such that a faithful training sample is generated for the SR network. The network simply aims to super-resolve any image from . If both networks were to be trained jointly using the cycle-consistency loss for , the networks and would collaborate in order to minimize the aforementioned loss. This leads to severe overfitting and poor generalization. As illustrated in Figure 3, we train the SR network is a second separate training stage, using the training pairs generated by the network .

3.3 Domain Distribution Learning

The task of the domain distribution learning is to map a bicubic downsampled image from the output distribution to the input distribution . Since we do not have access to paired samples, we need to venture into unsupervised learning territories. We firstly employ a GAN discriminator , tasked to differentiate between the generated and images drawn from the input distribution . For this, we employ the original GAN formulation,

(1)

To preserve the image content, despite the lack of paired images, we employ cycle consistency losses [8]. A second generator is tasked to map images from the input domain to the domain of bicubic downsampled images , where . We then add cycle consistency losses as,

(2)

They constrain the generators and to be each others approximate inverses. Hence, the image shall be preserved if mapped through and then back to the original domain. Analogous to (3.3), we add a discriminator and similar loss on the bicubic side. The full objective is thus,

(3)

The full architecture is shown in Figure 3 (blue).

Network architectures For our experiments we designed the domain distribution mapping based on the CycleGAN architecture [8]. The generators and use a ResNet architecture with nine blocks. We replace the transposed convolution layers with bi-linear upsampling followed by a standard convolution. We found this to be beneficial for learning stability, and it effectively removed checkerboard pattern artifacts. Furthermore, we found the non-linearity on the output to be harmful for color consistency, and therefore use no non-linear activation at the output. The discriminators and consist of a three-layer network architecture that operate on a patch level [25, 20].

Training details

We adopt the training procedure proposed in CycleGAN, using 200 epochs and the Adam optimizer with

. The starting learning rate is set to .

3.4 Super-Resolution Learning

Here we describe the learning of the SR network . In the absence of paired ground-truth data, we train the network with pairs , where the input image is generated by our domain distribution network . We employ the pixel-wise content loss [26],

(4)

Following the success of SRGAN [24], we also employ the VGG feature loss, that is known to better correlate with perceptual quality

(5)

Here, denotes the feature activations extracted from the VGG network. We extract the features at the same depth as SRGAN, which is after the activation of the 4th convolutional layer, before the 5th maxpooling layer.

For better perceptual quality, we further employ a GAN discriminator . To this end, we adopt the relativistic discriminator employed in ESRGAN [35]

. As opposed to the conventional discriminator, providing an absolute real/fake probability for each image, a relative score real/fake is estimated compared to a set of real of fake images.

(6)

Where is the raw discriminator output and

is the sigmoid function. The SR network is trained with added perceptual loss,

(7)

This results in the total loss of

(8)

The GAN loss is multiplied by a weight , balancing the guidance of the two pixel-wise losses and against the GAN loss .

Network architecture Our approach is agnostic to the specific architecture of the SR network . For simplicity, we adopt the recently proposed ESRGAN architecture, which is the winner of the PIRM 2018 challenge [18]. It introduced a new building block called Residual-in-Residual Dense Blocks, improving stability of training. We augment the ESRGAN network with a final color adjustment layer, to ensure a faithful reproduction of the color palette in the input LR image. This layer adjusts the local mean RGB value to that of the low-resolution image.

Training details To train our SR network, we start from pre-trained ESRGAN [35] generator and discriminator networks. We then perform 50000 training iterations. We use that ADAM optimizer with an initial learning rate of and set and for both the Generator and Discriminator. We use the learning rate schedule in [35], decreasing it by a factor of 0.5 after 10%, 20%, 40% and 60% of the total number of iterations.

4 Experiments

In this section, we present comprehensive quantitative and qualitative evaluation of our approach. We first discuss the setup and datasets employed in our experiments. Detailed results are provided in the supplementary material.

4.1 Experimental Setup

Figure 4: Overview of our data generation procedure for benchmarking unsupervised SR methods. See text for details.

We present a novel strategy for evaluating real-world SR methods. In traditional SR the bicubic downsampled image is super-resolved and compared to . In the real-world scenario, we do not have access to a ground-truth image, complicating quantitative analysis. On the other hand, we can closely simulate the real-world scenario by constructing the sets input and output training images by applying downscaling and synthetic degradations to a dataset of original images. The type of degradation is unknown to the SR approach. For evaluation, we further generate a set of ground truth pairs , where and . These are inaccessible to the network during training, and only used for evaluation purposes. We consider two scenarios: DSR and CSR, detailed below.

Domain-Specific Super-Resolution (DSR) The input and target images shares the same real-world distribution,  . Thus, the aim is to produce super-resolved images that are of the same distribution as its input images. The training set is generated by first downsampling the image and then simulating the real-world degradation. In the DSR case the same training set represents both the input and output distribution. For evaluation, the input image is constructed using the same procedure as for the training images . The corresponding ground-truth output image is obtained by directly adding the degradation to the original HR image. The procedure is visualized in Figure 4.

In Clean Super-Resolution (CSR), the goal is to super-resolve an input image such that the output image fits another distribution . We let be defined by a dataset of high quality images. Therefore, we employ the unaltered original image from the dataset to be the ground-truth output . The corresponding LR image used for the evaluation is generated as in the DSR case above. We also employ the same training set of input images as for the DSR. In the CSR case however, the output training data represent a different distribution of clean images. These are generated by bicubically downsampling the original image. The resulting image thus represents a clean ideal output from the SR network. See Figure 4 for a schematic description of the procedure.

Degradations

To model the real-world setting we evaluate unsupervised SR approaches using two types of image degradations: JPEG compression artifacts and simulated sensor noise. In case of JPEG artifacts we use a quality setting of 30. JPEG compression artifacts are a common when applying super-resolution to images captured by smartphones or acquired from the internet. In the second case, we employ white Gaussian noise with a standard deviation of

. This simulates the case of real-world sensor noise present in ,  low light conditions or small sensor sizes.

Quantitative Evaluation Measures In order to quantitatively compare the different approaches we use the distance metrics PSNR, SSIM and LPIPS. While PSNR and SSIM are handcrafted methods, LPIPS is a learned metric for perceptual similarity [38] between two images. In Figure 5 we provide a comparison of PSNR and the LPIPS, measures using the model provided by the authors.

Figure 5: While the PSNR is only dependend on the pixel-wise distance between GT and the prediction, LPIPS is a more elaborate measure that takes the perceptual quality into account. Although the EDSR image is perceptually much worse than the prediction of our method, it scores higher in PSNR. The LPIPS distance is smaller for our method, which is perceptually superior.

Datasets We use the DF2K [35] dataset that was introduced for learning the ESRGAN. It is a merge of the DIV2K [32] with 800 and the Flickr2K [32] dataset with 2640 images. The mean size of DF2K is 1439x1935. We also perform experiments on the DPED [17] dataset, acquired by a smartphone camera. It contains natural images with real-world sensor noise and other effects.

Sensor Noise JPEG Artifacts
Method PSNR SSIM LPIPS PSNR SSIM LPIPS

DSR Unsuperv.

Baseline 18.46 0.23 0.8182 22.22 0.41 0.6319
Cleaning the input 19.81 0.32 0.5737 22.23 0.56 0.4295
Low res. supervision 21.46 0.36 0.4363 20.00 0.49 0.4483
Ours 22.43 0.40 0.2897 23.30 0.62 0.3732
DSR Supervised (ref.) 23.97 0.48 0.1778 22.97 0.59 0.3526

CSR Unsuperv.

Baseline 18.48 0.23 0.7532 23.15 0.60 0.4887
Cleaning the input 22.10 0.55 0.4516 20.28 0.46 0.4889
Low res. supervision 22.03 0.55 0.4401 20.16 0.49 0.4752
Ours 22.42 0.55 0.3645 22.80 0.57 0.3729
CSR Supervised (ref.) 25.54 0.70 0.2103 22.60 0.58 0.3484
Table 1: Comparison of the four different versions in our ablation study for the DSR (upper half) and CSR (lower half) setting. We validate the methods on the DIV2K dataset. We also compare with training the SR network with full supervision, providing an upper bound for unsupervised methods. Our approach achieves superior perceptual quality measured by the LPIPS distance for sensor noise and JPEG artifacts.

4.2 Ablation Study

For our ablation study we use the DF2K dataset as training data and the validation image from DIV2K to measure the performance of the different methods. The quantitative comparison is done using the PSNR, SSIM and LPIPS measures. For comparisons, we mainly consider the LPIPS distance due to its higher correlation with perceptual similarity.

We evaluate four different approaches for the DSR and CSR setting. All methods are trained using the same settings. For SR network we employ a pretrained ESRGAN model that is then fine-tuned for each method, as described below. Quantitative and visual results are shown in Table 1 and Figure 6 respectively.

Baseline First, we compare with the standard approach of training the network on LR images generated by bicubic downsampling. For this purpose, we finetune the ESRGAN using image pairs . In the case of sensor noise, the baseline achieves significantly inferior performance for both DSR and CSR (Table 1) compared to ours. This is due to the smoothing behaviour of the bicubic downsampling. The baseline ESRGAN does thus not see appropriate levels of noise during training. This leads to severe artifacts in both the DSR and CSR case (Figure 6). Our approach also improves in the case of JPEG artifacts, leading to a improvement in LPIPS in the CSR case.

DSR - JPEG DSR - Sensor Noise
CSR - JPEG CSR - Sensor Noise
Baseline Cleaning the input Low resolution supervision Ours GT
Figure 6: Ablation Study for DSR (Top) and CSR (Bottom) settings for four different versions using the DIV2K validation set. Our approach, in the second most right column, provides favorable results despite the strong noise or JPEG artifacts in the input images. The compared baseline variations generate strong artifacts in most cases. In the CSR JPEG case we observe a slightly blurred result compared to ground truth. This is due to the loss of detail caused by the JPEG compression.

Cleaning the input Another strategy for tackling the shift between train and test distribution, caused by the bicubic downsampling, is to map the input image to the bicubic distribution before applying the SR network, as proposed in [37]. With this strategy, the SR network is trained using bicubic data, exactly as in the baseline setting discussed previously. During inference, the input image is super-resolved as . In fact, in our approach, we already train such a mapping to ensure cycle consistency in the training of domain correction network (Section 3.3). Since our training is fully symmetric between the two domains, including the architectures of and , we use the generator trained in our framework for a fair comparison.

In case of sensor noise, this version improve the results compared to the baseline method, suggesting that some of the domain shift problem is alleviated. However, our approach further improves the LPIPS distance by 50% in the DSR and 19% in the CSR case. This is partly due to the fact that the SR network acts directly on the input image , while the mapping can introduce artifacts or remove information. Our approach also achieves a significant improvement in the JPEG case.

Low resolution supervision Here, we compare our approach with performing supervision in the LR domain. Similar to [37], we add another generator network that maps the super-resolved image back to the original domain. We then perform the direct pixel-wise supervision in the LR domain instead, using the same losses and to ensure . We observed that this approach leads to stronger GAN hallucinations, as shown in Figure 6. This tendency is also observed in the quantitative results, obtaining significantly worse LPIPS and PSNR in all cases. This demonstrates the importance of direct HR supervision provided by our method.

Fully supervised To assess the performance of our approach, we compare with fully supervised training using paired samples, otherwise unavailable to the network. We generate paired data using the same strategy employed for evaluation,  by applying the ground-truth degradation. The ESRGAN is then finetuned on this data directly. Note that the ground-truth degradation operation is unknown for all other methods in this comparison. We observe that our approach achieves performance much closer to this upper bound for both DSR and CSR. In particular in the case of JPEG artifacts, where our unsupervised method is only slightly worse than full supervision.

4.3 State-of-the-art Comparison on DIV2K

Sensor Noise JPEG Artifacts
Method PSNR SSIM LPIPS PSNR SSIM LPIPS

DSR

ZSSR 23.65 0.47 0.6925 24.20 0.65 0.4584
EDSR 23.39 0.44 0.7137 24.12 0.65 0.4525
ESRGAN 17.17 0.19 0.7363 22.90 0.61 0.4447
ESRGAN FT 18.08 0.22 0.6233 23.24 0.62 0.4129
Ours 22.43 0.40 0.2897 23.30 0.62 0.3732

CSR

ZSSR 24.87 0.60 0.6466 23.92 0.64 0.5447
EDSR 24.46 0.53 0.6824 23.83 0.63 0.5454
ESRGAN 17.39 0.19 0.9434 22.67 0.59 0.5069
ESRGAN FT IN 18.35 0.23 0.7595 23.01 0.60 0.4903
ESRGAN FT OUT 17.35 0.19 0.9040 22.82 0.59 0.5087
Ours 22.42 0.55 0.3645 22.80 0.57 0.3729
Table 2: Comparison to state-of-the-art super-resolution methods on the DIV2K dataset using the DSR and CSR setting. Our approach achieves the best perceptual results in both sensor noise and JPEG artifact case.

In the following we compare our approach with other state-of-the-art methods: ZSSR [30], EDSR [26], ESRGAN [35]. Therefore we use the original code and trained models. The method ZSSR applies a Zero-Shot learning strategy, where weights are learned for each image individually. The EDSR trains a ResNet-based model without perceptual loss. ESRGAN applies the same SR architecture and perceptual losses as in our approach. Both EDSR and ESRGAN are trained using bicubic supervision.

We also report the results of finetuning the ESRGAN on the same training data employed by our approach. In the case of CSR, where two distinct training sets are available, we further compare with finetuning the ESRGAN on each of those. The method ”ESRGAN FT IN” is trained on the set of input images, while ”ESRGAN FT OUT” is trained on the set of output images. In all cases we follow the same training procedure as [35], constructing the corresponding LR image using bicubic downsampling.

We evaluate the aforementioned approaches on the DIV2K validation set using the DSR and CSR setting as described in Section 4.1. Results for the DSR and CSR settings are reported in Table 2.

JPEG Sensor Noise
Ours (DSR) ZSSR EDSR ESRGAN ESRGAN FT IN ESRGAN FT OUT Ours (DSR) Ours (CSR) GT
Figure 7: Qualitative comparison between our approach and four state-of-the-art approaches. Our approach is able to recover the original distribution. This is due to the input and output domain awareness that the other methods lack of.

Sensor Noise In the case of Sensor noise, the ESRGAN provides poor perceptual quality with LPIPS distances of and for DSR and CSR. Finetuning the model on the given data only provides minor improvements. This is due to the domain distribution shift caused by the bicubic downsampling leading to clean input images during training. EDSR experiences better robustness to noisy input with LPIPS values of and for DSR and CSR respectively. Among previous approaches ZSSR achieves the best performance, owing to its zero-shot learning strategy. Our unsupervised approach achieves the best overall perceptual quality, significantly reducing the LPIPS error metric by and for DSR and CSR respectively, in the sensor noise setting.

As show in Figure 7 the ESRGAN approaches produce strong artifacts in the case of noisy input. This is likely due to the perceptual loss that encourages the network to output high-frequency components in order to provide a sharp image. In contrast, our method do not suffer from such artifacts despite employing a strong perceptual loss. This demonstrates that our SR network has learned significant robustness towards image noise, though our unsupervised training strategy.

JPEG Compression In the DSR case, where SR is performed within the same domain, our approach achieves a significantly lower LPIPS distance compared to state-of-the-art. For CSR, where the task is to additionally clean the input from artifacts, our approach has a strong advantage, reducing the LPIPS by more than . However, as shown in Figure 7, the difference in perceptual quality is not fully captured by the quantitative results. All compared approaches produces highly visible block artifacts, stemming from the JPEG compression of the input image. In contrast, our approach provides visually pleasing output without such artifacts.

Ours (DSR) Input Ours (DSR) Ours (CSR) ZSSR EDSR ESRGAN ESRGAN FT
Figure 8: Qualitative comparison on real-world images from the DEPD dataset. Due to the training setup of other state-of-the-art methods they produce large artifacts on real-world data while our methods (DSR, CSR) can super-resolve those images in a perceptual satisfying manner.

4.4 Real-World Evaluation on the DPED Dataset

In this section we apply our method to the original images of the DPED iPhone3 dataset [17]. It contains natural images, which include real-world degradations due to poor sensor and lens quality. We train our model using the training split to represent both the input and output distributions. We also finetune the ESRGAN model on the same data, as described in the previous experiment. Since ground-truth images are not available in this real-world setting, we show a diverse set of qualitative examples from the DPED iPhone3 validation set in Figure S9. The artifacts produced by ZSSR, EDSR, ESRGAN and ESRGAN Finetuned are of similar nature as in the sensor noise case in the DIV2K setting in Figure 7. Note that these limitations in previous approaches cannot be alleviated by more training data or architectural designs. Instead these issues originate from an oversimplified problem formulation not reflected in most real-world applications. Our approach is able to overcome the limitations of previous methods by learning the input image distribution. Our approach generate high-quality images, with very few artifacts.

5 Conclusion

We tackle the problem of real-world super-resolution, where no paired data is available. To avoid the artifacts caused by bicubic downsampling, we learn a network that restores the low resolution image to the real-world image distribution. This allows us to generate realistic training pairs for our super-resolution model. Lastly, we propose a benchmark, based on the DIV2K dataset, for quantitatively evaluating real-world super-resolution approaches. Experiments are performed on our real-world benchmark and the DPED datasets. Compared previous methods, our approach generalizes to natural images, affected by significant sensor noise, compression artifacts and other effects.

Acknowledgements.

This work was partly supported by ETH General Fund, Amazon through an AWS grant, Nvidia through a GPU grant, and Huawei.

References

  • [1] E. Agustsson and R. Timofte (2017) NTIRE 2017 challenge on single image super-resolution: dataset and study. In CVPR Workshops, Cited by: §S1, §S4.
  • [2] N. Ahn, B. Kang, and K. Sohn (2018) Fast, accurate, and lightweight super-resolution with cascading residual network. In ECCV, Cited by: §2.
  • [3] N. Ahn, B. Kang, and K. Sohn (2018) Image super-resolution via progressive cascading residual network. In CVPR, Cited by: §2.
  • [4] I. Begin and F. Ferrie (2004) Blind super-resolution using a learning-based approach. In ICPR, Cited by: §2.
  • [5] A. Bulat, J. Yang, and G. Tzimiropoulos (2018) To learn image super-resolution, use a gan to learn how to do image degradation first. arXiv preprint arXiv:1807.11458. Cited by: §2.
  • [6] J. Cai, S. Gu, R. Timofte, and L. Zhang (2019-06) NTIRE 2019 challenge on real image super-resolution: methods and results. In CVPR Workshops, Cited by: §2.
  • [7] C. Chen, Z. Xiong, X. Tian, Z. Zha, and F. Wu (2019) Camera lens super-resolution. In CVPR, Cited by: §2.
  • [8] C. Chu, A. Zhmoginov, and M. Sandler (2017) Cyclegan, a master of steganography. arXiv preprint arXiv:1712.02950. Cited by: §3.3, §3.3.
  • [9] C. Dong, C. C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In ECCV, Cited by: §1, §2.
  • [10] C. Dong, C. C. Loy, K. He, and X. Tang (2016) Image super-resolution using deep convolutional networks. TPAMI 38 (2), pp. 295–307. Cited by: §2.
  • [11] Y. Fan, H. Shi, J. Yu, D. Liu, W. Han, H. Yu, Z. Wang, X. Wang, and T. S. Huang (2017) Balanced two-stage residual networks for image super-resolution. In CVPR, Cited by: §2.
  • [12] W. T. Freeman, T. R. Jones, and E. C. Pasztor (2002) Example-based super-resolution. IEEE Computer graphics and Applications. Cited by: §2.
  • [13] J. Gu, H. Lu, W. Zuo, and C. Dong (2019) Blind super-resolution with iterative kernel correction. In CVPR, Cited by: §2.
  • [14] M. Haris, G. Shakhnarovich, and N. Ukita (2018) Deep back-projection networks for super-resolution. In CVPR, Cited by: §2.
  • [15] J. Huang, A. Singh, and N. Ahuja (2015) Single image super-resolution from transformed self-exemplars. In CVPR, Cited by: §2.
  • [16] Y. Huang and M. Qin (2018) Densely connected high order residual network for single frame image super resolution. arXiv preprint arXiv:1804.05902. Cited by: §2.
  • [17] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, and L. Van Gool (2017) DSLR-quality photos on mobile devices with deep convolutional networks. In ICCV, Cited by: §1, §4.1, §4.4.
  • [18] A. Ignatov, R. Timofte, T. Van Vu, T. M. Luu, T. X. Pham, C. Van Nguyen, Y. Kim, J. Choi, M. Kim, J. Huang, et al. (2018) Pirm challenge on perceptual image enhancement on smartphones: report. arXiv preprint arXiv:1810.01641. Cited by: §2, §3.4.
  • [19] M. Irani and S. Peleg (1991) Improving resolution by image registration. CVGIP. Cited by: §2.
  • [20] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. Cited by: §3.3.
  • [21] H. Kim, M. Choi, B. Lim, and K. M. Lee (2018) Task-aware image downscaling. ECCV. Cited by: §2.
  • [22] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Accurate image super-resolution using very deep convolutional networks. In CVPR, Cited by: §2.
  • [23] W. Lai, J. Huang, N. Ahuja, and M. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, Cited by: §2.
  • [24] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. CVPR. Cited by: §2, §3.4.
  • [25] C. Li and M. Wand (2016) Precomputed real-time texture synthesis with markovian generative adversarial networks. In ECCV, Cited by: §3.3.
  • [26] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017) Enhanced deep residual networks for single image super-resolution. CVPR. Cited by: §2, §S2, §3.4, §4.3.
  • [27] A. Lugmayr, M. Danelljan, R. Timofte, et al. (2019) AIM 2019 challenge on real-world image super-resolution: methods and results. ICCV Workshops. Cited by: §S1.
  • [28] T. Michaeli and M. Irani (2013) Nonparametric blind super-resolution. In ICCV, Cited by: §2.
  • [29] S. C. Park, M. K. Park, and M. G. Kang (2003) Super-resolution image reconstruction: a technical overview. IEEE signal processing magazine. Cited by: §2.
  • [30] A. Shocher, N. Cohen, and M. Irani (2018) Zero-shot” super-resolution using deep internal learning. In CVPR, Cited by: §2, §S2, §4.3.
  • [31] Y. Tai, J. Yang, X. Liu, and C. Xu (2017) Memnet: a persistent memory network for image restoration. In ICCV, Cited by: §1.
  • [32] R. Timofte, E. Agustsson, L. Van Gool, M. Yang, L. Zhang, B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee, et al. (2017) Ntire 2017 challenge on single image super-resolution: methods and results. CVPR Workshops. Cited by: §S1, §4.1.
  • [33] R. Timofte, S. Gu, J. Wu, and L. Van Gool (2018) NTIRE 2018 challenge on single image super-resolution: methods and results. In CVPR Workshops, Cited by: §S2.
  • [34] T. Tong, G. Li, X. Liu, and Q. Gao (2017) Image super-resolution using dense skip connections. In ICCV, Cited by: §1.
  • [35] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C. C. Loy, Y. Qiao, and X. Tang (2018) ESRGAN: enhanced super-resolution generative adversarial networks. ECCV. Cited by: Figure 1, §1, §2, §S2, §3.4, §3.4, §4.1, §4.3, §4.3, §S4.
  • [36] J. Yang, J. Wright, T. S. Huang, and Y. Ma (2010) Image super-resolution via sparse representation. TIP. Cited by: §2.
  • [37] Y. Yuan12, S. Liu134, J. Zhang1, Y. Zhang3, C. Dong1, and L. Lin1 (2018) Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks. CVPR Workshops. Cited by: §2, §4.2, §4.2.
  • [38] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    .
    CVPR. Cited by: §2, §S2, §4.1.
  • [39] X. Zhang, Q. Chen, R. Ng, and V. Koltun (2019) Zoom to learn, learn to zoom. In CVPR, Cited by: §2.
  • [40] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV. Cited by: §1.

S1 AIM 2019 Real-World Super Resolution Challenge

In addition to the experiment settings presented in section 4.1, we introduce the real-world SR benchmark employed in the recent AIM 2019 Real-World Super Resolution Challenge [27]. We use the same overall procedure as described in section 4.1, but employ a more complex degradation mapping, composed of several different operations. The input domain train images are obtained by directly adding the degradation to the high-resolution Flickr2K [32] images, while we use the clean DIV2K [1] train split for the target domain set . For validation and testing we use the corresponding splits from the DIV2K. The results of our approach compared to state-of-the-art methods are shown in Table S1 and S2. Note that Track 1 and 2 correspond to the DSR and CRS settings respectively. Our approach achieves superior LPIPS score compared to previous approaches.

Method PSNR SSIM LPIPS

DSR

ZSSR 25.24 0.61 0.6613
EDSR 25.24 0.60 0.6698
ESRGAN 22.49 0.50 0.5829
ESRGAN FT 24.60 0.55 0.4482
Ours 24.81 0.56 0.4376

CSR

ZSSR 22.42 0.61 0.5996
EDSR 22.36 0.60 0.6150
ESRGAN 20.69 0.51 0.5604
ESRGAN FT IN 21.40 0.52 0.5191
ESRGAN FT OUT 21.66 0.55 0.5282
Ours 21.59 0.55 0.4720
Table S1: Comparison to state-of-the-art super-resolution methods on the AIM Real World Super-Resolution Challenge dataset using the DSR (Track 1) and CSR (Track 2) settings on the validation set.
Method PSNR SSIM LPIPS

DSR

Bicubic 25.34 0.61 0.7331
EDSR 25.14 0.60 0.6657
ESRGAN 22.57 0.51 0.5770
ESRGAN FT 24.54 0.56 0.4458
Ours 24.22 0.54 0.4356
Fully Supervised 24.22 0.55 0.3011

CSR

Bicubic 22.37 0.63 0.6602
EDSR 22.35 0.62 0.5959
ESRGAN 20.76 0.52 0.5539
ESRGAN FT IN 21.36 0.54 0.5275
ESRGAN FT OUT 21.94 0.59 0.5083
Ours 21.17 0.54 0.4594
Fully Supervised 22.80 0.65 0.2928
Table S2: Comparison to state-of-the-art super-resolution methods on the AIM Real World Super-Resolution Challenge dataset using the DSR (Track 1) and CSR (Track 2) settings on the test set.
Bicubic ZSSR EDSR ESRGAN (Baseline) Cleaning the Input Low resolution supervision Ours
LPIPS 0.2537 0.2320 0.2534 0.2670 0.2272 0.2038 0.1858
PSNR 22.72 22.66 22.48 21.10 20.56 19.98 18.89
SSIM 0.52 0.51 0.49 0.37 0.42 0.47 0.49
Table S3: Comparison of the different state-of-the-art methods and three versions of our ablation study for the DSR. We validate the methods on the DIV2K Track 2 dataset, which was published for the NTIRE2018 Challenge. Our approach achieves superior perceptual quality measured by the LPIPS distance.

S2 NTIRE 2018 Evaluation

We perform an additional experiment on the NTIRE 2018 [33] challenge. To evaluate the approaches in the unsupervised scenario, we use the Track 2: Realistic Mild adverse conditions. In this setting, the an unknown degradation operation has been applied to the low resolution images to simulate realistic sensor noise and motion blur. The goal is to produce clean output images, similar to our CSR scenario. Although LR and HR training pairs are provided, we do not utilize the paired data. As performed in the paper, we train our approach in the fully unsupervised real-world setting, using only unpaired LR and HR images. Compared to the official challenge [33], our approach is thus trained in considerably harder, but more realistic conditions.

The results are provided in Table S3. We compare our approach to simple bicubic upscaling, ZSSR [30], EDSR [26] and the approaches considered in our ablative study: the baseline ESRGAN [35], Cleaning the Input, and Low resolution supervision. See Section 4.2 in the main paper and Section S3 here for more details about the latter baseline versions. Results are reported in terms of LPIPS [38], PSNR and SSIM. Since our goal is perceptual super resolution, we focus on the LPIPS metric. A comparison between these metrics are shown in Figure 5. Although naïve bicubic upscaling achieves the best PSNR in Table S3, our approach clearly achieves better perceptual results, as indicated by the LPIPS metric. Among the compared previous SR methods, the baseline ESRGAN achieves the best performance with an LPIPS of . Employing the generator to clean the input image during inference improves the LPIPS metric to . Using supervision in the low resolution domain yields an LPIPS of . Our approach, based on the domain distribution learning, allowing direct supervision in the high resolution domain, achieves the best LPIPS of . This additional experiment demonstrates the same trend as the results shown in the main paper.

(a) Baseline
(b) Cleaning the input
(c) Low resolution supervision
(d) Ours
Baseline
Figure S5: The four different approaches for real-world super-resolution analyzed in the ablation study (Section 4.2 in the paper). For each method, we illustrate how the super-resolution network is trained (left) and applied during inference. Pixel-wise supervision is indicated by “”, while the discriminator losses are omitted for clarity.

S3 Details of the Baseline Approaches

Here we provide additional details of the ablative methods evaluated in Section 4.2 in the paper. Figure S5 illustrate the three baseline versions, along with our approach. In the baseline ESRGAN version (Figure (a)a), we follow the same protocol to generate the low resolution training samples as ESRGAN to finetune the pretrained model. The super-resolution network is trained to reconstruct images that are bicubic downsampled by . As shown, this produces strong artifacts due to the mismatch of input domains during the training and testing phases. A first approach to match the input distribution of the super-resolution network is to first clean the low resolution image during test time (Figure (b)b). The domain distribution mapping is learned using cycle consistency loss to map input images to the domain of bicubic downsampled images. Another way to obtain matching domains during train and test time is to do supervision in low resolution, as one can directly use the natural image as low resolution input for training (Figure (c)c).

In Our approach (Figure (d)d we make sure that the inputs during train and test are matching by applying the mapping during training to invert the effects of bicubic downsampling on the image distribution. This mapping is learned by our with unsupervised domain distribution learning (Section 3.3). Compared to the version in Figure (b)b, our approach requires no extra network for processing the input image during test-time, before the super-resoution is applied. Instead, our approach leans to super-resolve natural images directly. Moreover, in contrast to Figure (c)c, we do a pixel-wise supervision in high resolution with the original natural image, which helps the super-resolution network to produce crisp, photo-realistic images.

S4 Real-world SR Benchmarking Details

Here, we provide additional details of the proposed benchmarking procedure and dataset for real-world super-resolution methods. In classical super-resolution, the low resolution images for training and evaluation are constructed by applying bicubic downsampling to a dataset of natural images. Therefore, those methods are only evaluated on how well they super-resolve images that were downsampled the same way as during training. However, in the real-world setting the aim is to super-resolve natural images to a higher resolution that is unseen during training. Thus, the desired high resolution images are not available for training the super-resolution network. To reflect this crucial aspect in the proposed benchmarking procedure for real-world super-resolution, we apply different procedures for constructing training and test data. Importantly, the training data only contains low resolution images, while the original high resolution images are only used when evaluating the methods. Thus the final desired image resolution is unseen during training, as in real-world applications.

We construct the train and test images for real-world super-resolution, based on a dataset of original images. An illustration of the procedure is shown in Figure 4 (also shown in the main paper). For evaluation we construct an input-output image pair. The input image is obtained by first downsampling the original image and then applying the degradation transformation to simulate the real-world case. Since the degradation is heavily affected by image downsampling, it is always applied after the downsampling procedure. In the domain-specific super-resolution case (DSR), where the goal is to super-resolve the image while preserving the original characteristics, we construct the ground-truth output image by applying the same degradation operation directly to the original image. In the clean super-resolution case, we employ the original image as ground-truth. The training set is constructed by first downsampling the original image and then applying the degradation operation. In the CSR case, a set of clean images are also available during training (see Section 4.1 in the paper). These are constructed by only applying bicubic downsampling to the original images, and no degradations. Note that no images of the original resolution, used for evaluation, are available for training.

Dataset Num. images Mean image size Set
DF2K 3450 1439x1935 Training
DIV2K 100 1451x1979 Testing
Table S4: Datasets that were used for our experiments.

In this work, we employ the DF2K [35] dataset, consisting of DIV2K [1] and Flikr2K [35] datasets. The dataset specifications are summarized in Table S4. For testing, the 100 validation images in DIV2K are used. The training sets are constructed using the 3450 training images in DF2K. The images for training and evaluation has no overlap. We always employ downsampling, and evaluate the methods for sensor noise and JPEG degradation, as described in the paper (Section 4.1).

S5 Domain Mapping Examples

As described in Section 3.3 of the main paper, we learn a generator that transfers the downsampled images to the natural domain. In Figure S6 and Figure S7 we show how the downsampled images are transferred to images with sensor noise and JPEG artifacts respectively. In the sensor noise case one can see a Gaussian noise like image characteristics. In the JPEG case, the generator has learned to simulate blocky compression artifacts, especially at sharp edges.

S6 Visual Results

Visual examples from the NTIRE2018 challenge evaluation (Section S2) is shown in Figure S8. We also show additional visual examples for the real-world DPED iPhone dataset in Figure S9.

S7 Failure Cases

We developed our method in order to match the natural input domain during test time also for real-world datasets with significant noise or JPEG artifacts. This requires a domain distribution learning that also introduces some artifacts by itself as shown in the first row of Figure S6

. When training the network on clean high quality data, for which bicubic downsampling does not have a big effect on the image distribution, our method has slightly more artifacts due to the domain transfer learning.

Figure S6: Visual example output images of the domain distribution learning network when trained with images corrupted with sensor noise. The Input is the bicubic downsampled image , Output the image that is transferred to the natural distribution and GT show the real degradation applied to .
Figure S7: Visual example output images of the domain distribution learning network when trained with images corrupted with JPEG compression. The Input is the bicubic downsampled image , Output the image that is transferred to the natural distribution and GT show the real degradation applied to .
Figure S8: Qualitative comparison on the NTIRE2018 Track 2 challenge.
Input Input Ours ZSSR EDSR ESRGAN ESRGAN FT
Figure S9: Qualitative comparison on real-world images from the DPED dataset (Section 4.4 in the paper). Our approach achieve superior perceptual quality. As this is the real-world dataset, the ground-truth does not exist for the super-resolution.