DASR
Official implementation of the paper 'Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution'
view repo
Efficient and effective real-world image super-resolution (Real-ISR) is a challenging task due to the unknown complex degradation of real-world images and the limited computation resources in practical applications. Recent research on Real-ISR has achieved significant progress by modeling the image degradation space; however, these methods largely rely on heavy backbone networks and they are inflexible to handle images of different degradation levels. In this paper, we propose an efficient and effective degradation-adaptive super-resolution (DASR) network, whose parameters are adaptively specified by estimating the degradation of each input image. Specifically, a tiny regression network is employed to predict the degradation parameters of the input image, while several convolutional experts with the same topology are jointly optimized to specify the network parameters via a non-linear mixture of experts. The joint optimization of multiple experts and the degradation-adaptive pipeline significantly extend the model capacity to handle degradations of various levels, while the inference remains efficient since only one adaptively specified network is used for super-resolving the input image. Our extensive experiments demonstrate that the proposed DASR is not only much more effective than existing methods on handling real-world images with different degradation levels but also efficient for easy deployment. Codes, models and datasets are available at https://github.com/csjliang/DASR.
READ FULL TEXT VIEW PDFOfficial implementation of the paper 'Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution'
Single image super-resolution (SISR) [20, 49, 65, 37, 45] is an active research topic in low-level vision, aiming at reconstructing a high-resolution (HR) version of a degraded low-resolution (LR) image. Since the seminal work of SRCNN [8]
, many convolutional neural network (CNN) based SISR methods
[41, 22, 43, 66, 19] have been proposed, most of which assume a pre-defined degradation process (e.g., bicubic down-sampling) from HR to LR images. Despite the great success, the performance of these non-blind SISR methods will be much deteriorated when facing real-world images [30] because of the mismatch of degradation models between the training data and the real-world test data [60].The blind image super-resolution (BISR) methods [30, 61, 36, 14, 67] have been proposed to address the problems of non-blind SISR methods by considering more complex degradation kernels extracted from real-world images. However, the degradation space of these methods is actually restricted to a set of pre-collected kernels, such as the DPED kernel pool [67, 16]. For real-world images, their degradation space can be much larger, including more types and more complex kernels than the DPED kernel pool, more complex and stronger noise, and other degradation operations such as compression. Therefore, many recent researches have been focused on the real-world image super-resolution (Real-ISR) tasks [4, 51, 35, 33, 10, 34, 18, 40] by modeling and synthesizing the complex degradation process of real-world images [3, 52]. The representative works include BSRGAN [60] and Real-ESRGAN [47], which introduce comprehensive degradation operations such as blur, noise, down-sampling, and JPEG compression, and control the severity of each operation by randomly sampling the respective hyper-parameters. They also employ random shuffle of degradation orders [60] and second-order degradation [47] to better simulate the real-world complex degradations, respectively.
Despite the remarkable progress of BSRGAN [60] and Real-ESRGAN [47] on improving the image perceptual quality, they have several limitations for practical usage. On one hand, they are basically designed to work on severely degraded LR images. While BSRGAN and Real-ESRGAN can generate a certain amount of details on some tough LR images, they are difficult to generate fine details on mildly degraded LR inputs. It is highly anticipated to develop Real-ISR models which can handle images with different degradation levels. On the other hand, the BSRGAN and Real-ESRGAN methods rely on heavy backbone networks (e.g., RRDB [49]), which make them difficult to be deployed on devices with limited computational resources [7, 63, 1, 55, 44]. It is also anticipated to develop efficient Real-ISR models to meet the requirement of high efficiency.
To tackle the above problems, in this paper, we propose a degradation-adaptive super-resolution (DASR) network whose parameters are adaptively specified to the given image according to its degradation. Our DASR consists of a tiny regression network to estimate the degradation parameters of the input image and multiple light-weight super-resolution experts, which are jointly optimized on a balanced degradation space. For each input image, an adaptive network is constructed via a non-linear mixture of experts, whose adaptive weighting factors are specified by the estimated degradation parameters. The multiple super-resolution experts and the degradation-aware mixture significantly improve the model capacity for handling images of different degradations. Meanwhile, the whole pipeline of DASR is highly efficient to meet the requirement of Real-ISR tasks, as only one adaptive network is employed to super-resolve the image during inference and the cost of mixing experts is negligible.
The contributions of this paper are two-fold. First, we propose a degradation-adaptive super-resolution network, which significantly improves the model capacity to super-resolve images of various degradation levels. Second, the pipeline of our DASR network is highly efficient, providing a good solution to perform Real-ISR in practical applications. Extensive experiments verified the effectiveness and efficiency of the proposed method.
How to reproduce effectively and efficiently the HR image from low-quality and low-resolution real-world images is a challenging issue in SISR research. The distribution of real-world images can differ dramatically due to the varying image degradation process, different imaging devices, and image signal processing methods [30, 52]. Researches [4, 64] have tried to capture real-world HR-LR image pairs by adapting the focal length of the camera, yet the collection of data is tedious and this can only describe a limited subspace of image degradation. There are also some unsupervised methods [52, 10] proposed to explore the domain adaptation between the synthesized LR image and the real one, yet the domain gap is still inevitable which deteriorates the SR performance [33, 35].
Recently, several Real-ISR methods such as BSRGAN [60], Real-ESRGAN [47] and SwinIR [28] have achieved remarkable progress by introducing comprehensive degradation models to effectively synthesize real-world images. However, they rely on a heavy and computationally intensive backbone network, e.g., RRDB [49] and Swin transformer [32], and are not flexible to process images of different degradation levels. In this paper, we propose a degradation-adaptive framework to address this issue, targeting an effective and efficient network for the challenging Real-ISR task.
In many non-blind SISR methods [25, 49, 65, 48, 20, 37, 11], the degradation model is simply assumed as bicubic down-sampling or blurred down-sampling with a Gaussian kernel. The performance of these non-blind methods can be dramatically undermined when applied to images with different degradations [30]. As a remedy, SRMD [61], UDVD [53] and some other methods [58, 62] extend the degradation space to cover more blur kernels and noise levels, and use the degradation map as additional input to perform conditional SISR. While these methods can handle multiple degradations with a single model, they rely on accurate degradation estimation, which itself is also a challenging task.
A few blind SISR methods have been proposed for unknown degradation [46, 15, 31, 39, 3, 56, 52]. In KMSR [67]
, a kernel pool is constructed from real photographs using generative adversarial network
[12], followed by synthesizing training pairs in a more realistic way. Some methods like IKC [14] and VBSR [6] incorporate a blur kernel estimator into the SISR framework, which can be adaptive to images degraded from different blur kernels [36, 23]. However, most of the blind SISR methods are trained with a pre-collected kernel pool [67, 16], and hence they are not really blind and can hardly be generalized to real-world images.Recent Real-ISR methods such as BSRGAN [60] and Real-ESRGAN [47] further extend the degradation modeling space by incorporating comprehensive degradation types with randomly sampled degradation parameters to enhance the variation. The larger degradation space helps the trained Real-ISR model to improve the perceptual quality of some tough LR inputs. However, the degradation parameter sampling in BSRGAN and Real-ESRGAN is unbalanced to train a flexible network, limiting the trained model in generating fine details, especially for inputs with mild degradations. In this work, we propose to balance the degradation space by partitioning it into three levels with balanced frequencies. Such balanced space facilitates the optimization of our degradation-adaptive model on different degradation levels and brings a better approximation to the real-world LR images.
The mixture of experts (MoE, [17, 21, 13, 2]) is a long-standing method that calculates the weighted sum of multiple expert networks to improve the performance. A trainable gating network is employed to compute the weight for activating each expert [38], usually based on an explicit (e.g., labeled classes) or implicit (content clustering) partition of the data. In this paper, we calculate the adaptive weight of experts according to the degradation of the image for the Real-ISR tasks. Besides, instead of activating all experts and calculating the weighted sum of outputs as in previous MoE methods [50], we adaptively mix the network parameters, resulting in only one adapted network for inference. Such a pipeline is effective and efficient due to the increased non-linearity and the fast inference.
Dynamic convolution [5, 27] or conditional convolution [26, 54] aims to enhance the feature representation capacity by making the convolutional parameters sample-adaptive. Most of the existing methods optimize multiple sets of convolutional parameters and learn feature self-attention to linearly combine the parameters. However, this pipeline introduces many computations to obtain self-attention, causing a trade-off between efficiency and effectiveness. In this paper, we achieve the non-linear mixture of experts via an adapted conditional convolution, where the conditions are the degradation parameters and the weighting factors are calculated once for all layers to keep efficiency.
This section presents our degradation-adaptive network for real-world image super-resolution, i.e., DASR. As shown in Figure 1, DASR mainly consists of a degradation prediction network and a CNN-based SR network with multiple experts. In the following sections, we first provide the details of the proposed DASR framework and then introduce our degradation modeling to set degradation parameters and generate training pairs.
Degradation prediction network. To allow efficient and degradation-adaptive super-resolution, we propose to estimate the degradation parameters of each input via a regression network , i.e., , where denotes the estimation of . We employ a set of parameters to elaborately describe the degradation space. The details of degradation space modeling will be discussed in Section 3.2. To make the estimation process efficient, we design a light-weighted network to predict . Specifically, consists of
convolution layers with Leaky ReLU activation, followed by a global average pooling layer. We first use convolution layers to extract image spatial degradation features and then use the global average pooling layer to estimate the degradation parameters.
To optimize the network , we introduce a regression loss between the estimated degradation parameters and the ground-truth using the -norm distance as follows:
(1) |
According to the degradation model, each parameter in is randomly sampled to specify the degradation process to generate the LR-HR image pairs.
Image super-resolution network. An ideal Real-ISR method is expected to be both effective and efficient. On one hand, in real-world SR tasks, the computation resources are usually limited, especially for edge devices. On the other hand, the model should be able to effectively handle images with various kinds of degradations. Nevertheless, most of the current SR methods [28, 60, 47, 25, 29] can only trade-off between efficiency and effectiveness, and they are inflexible to handle images with different degradation types and levels.
To develop an effective and efficient Real-ISR model, we propose a degradation-adaptive SR network to boost the model capacity via a non-linear mixture of experts (MoE) strategy, whose additional cost is negligible during inference. In specific, we employ convolutional experts, denoted by , where each expert is a light-weighted SR network, e.g., SRResNet [25] or EDSR-M [29], with independent parameters . All the share the same network topology, and they are optimized jointly with the supervision of the same loss. Our idea is to implicitly train each expert to handle images falling into a sub-space of the degradation space so that they can work together to process images with various kinds of degradations in the whole space.
A vector of weighting factors
, which is adaptive to the degradation of the input , is then calculated to adaptively mix the experts. We calculate conditioned on the estimated via a tiny network with two fully-connected layers, i.e., . As both and are of low dimension ( and in our experiments), the network is highly efficient. Note that if is constrained to be a one-hot vector, only one expert will be activated for super-resolving the input , and this will degrade our framework to a competitive MoE [38], which may perform well on tasks whose sample distribution space can be partitioned with clear boundaries, yet it can hardly work well for the Real-ISR task with a large and continuous degradation space.With the multiple experts and their adaptive weighting factors , we mix the experts adaptively in a non-linear manner. For each convolution layer of the desired network, we employ the dynamic convolution technique [54, 5] to parameterize the convolutional kernels as follows:
(2) |
where and denote the input and the output features, indicates the value of , denotes the layer parameters for expert and
is the activation function. That is, we adaptively fuse the parameters of each layer among all experts, resulting in an adaptive network, denoted as
.Note that in classic dynamic convolution, the weighting factor of each layer is calculated by an independent network conditioned on the feature map of the last layer, thus introducing non-negligible computational costs. In contrast, we learn a single set of degradation-adaptive weighting factors for all convolution layers, which is very efficient. Our framework follows the spirit of MoE but in a non-linear manner due to the activation operation in intermediate layers. The non-linearity and the degradation-adaptive mixture of multiple experts significantly extend the model capacity to handle degradations of various levels.
Our DASR is very efficient. For each convolutional layer, the model only deploys one adapted network in the inference stage, rather than deploying models as done in the classic MoE methods [17, 21]. The degradation prediction network and the weighting module are also very light-weighted. Therefore, the cost of inference is of the same order as one single expert network. The computational overhead caused by the mixture operation is negligible. Specifically, the mixture process consists of multiplications and additions operations of the parameters of experts. For a light-weighted backbone network like SRResNet or EDSR-M, the number of parameters of each expert is only , and they are independent of the size of input images. Therefore, compared with the calculation of multiple feature maps, the complexity of the mixture of parameters is several orders of magnitude lower and thus can be neglected.
Since high-quality real-world LR-HR pairs are hard to be collected due to the misalignment issue [4, 64], the degradation modeling is very important to synthesize real-world LR inputs from a given HR image for Real-ISR model training. A degradation space, denoted by , should be pre-defined to synthesize training pairs and perform degradation-adaptive optimization. The quality of an LR sample in is controlled by a degradation parameter vector , where specifies the type or severity of a degrading operation and denotes the number of degradation parameters. In our DASR, also serves as the ground-truth for training the degradation prediction network.
The image degradation model has been recently improved significantly from the simple bicubic down-sampling [8, 49] to shuffling [60] and second-order [47] pipelines. We adopt the degradation operations of blurring (both isotropic and anisotropic Gaussian blur), resizing (both down-sampling and up-sampling with area, and bilinear and bicubic operations), noise corruption (both additive Gaussian and Poisson noise), and JPEG compression in our modeling. In , we use a one-hot code to quantify the degradation operation type and use a single value to record the degradation level normalized by its respective dynamic range.
It is worth mentioning that different from the methods [61, 14] which quantify a blur kernel by its kernel coefficients, we quantify a blurring degradation by its kernel size
, the standard deviation
along the two principal axes, and the rotation degree . In this way, the degradation parameters are more interpretable to specify the degradation types and levels, and can better support the degradation-aware mixture of experts. Meanwhile, the parameter vector has only dimensions, while the kernel vector will have much higher dimensions to estimate. Benefiting from the interpretability and compactness of the degradation space, our DASR allows explicit user control towards degradation parameters during inference. This can facilitate many user-interactive applications to customize the desired super-resolving effect.Though the shuffling degradation method in BSRGAN [60] and the second-order degradation pipeline in Real-ESRGAN [47] can generate a sufficiently large degradation space, it is hard for them to train a model which can adaptively handle images with different levels of degradations. Our DASR is designed to be adaptive to a wide range of real-world inputs with multiple light-weight expert networks, each of which is expected to handle a subspace of images of different degradation levels. Therefore, we partition the whole degradation space into levels by specifying the parameters accordingly. Among them, and are generated with first-order degradation with small and large parameter ranges, respectively, while is generated by the second-order degradation. Due to space limitation, more details of the degradation operations and the specification of are provided in the Section 6.1 in the Appendix.
The learnable modules of our DASR network include . As mentioned in Section 3.1, the loss is used to optimize to predict the degradation parameters. To optimize the overall framework, following the many works in literature [47, 60, 49], we adopt the -norm pixel-wise loss , the perceptual loss and the adversarial loss . The total loss is defined as follows (more details are provided in the Section 6.2 in the Appendix):
(3) |
where and denote the balancing parameters.
Following previous works [49, 47], we employ DIV2K, Flickr2K, and OutdoorSceneTraining datasets for training our DASR model. For efficiency, we employ the SRResNet [25] as our backbone. The weights of the experts are initialized by the model pre-trained with pixel-wise loss. The Adam [24] optimizer is employed to train the network. The learning rate is set to , the total batch size is and the training iteration is set to . We balance the training loss with . Without loss of generality and for a fair comparison, we conduct Real-ISR experiments with the scale factor of by following the setting in BSRGAN [60] and Real-ESRGAN [47]. In our experiment, the dimension of degradation parameters is and the number of experts is . The LR patch size is set to .
We evaluate our DASR method both quantitatively and qualitatively. For quantitative evaluation, as in BSRGAN [60] we synthesize LR-HR pairs by applying the levels of degradations to the validation images in the DIV2K dataset, i.e., LR-HR pairs for each level. We also make the comparison on the original DIV2K dataset with bicubic downsampling. An illustration of images with different degradations is shown in Fig. 2, where more samples are shown in Section 6.3 in the Appendix. For qualitative evaluation, we also employ the images in the RealSRSet [60, 47], where the input images are corrupted by various blur, noise, or other real degradation operations.
We compare the proposed DASR with representative and state-of-the-art SR methods, including RRDB [49], ESRGAN [49], IKC [14], BSRGAN [60], Real-ESRGAN [47] and Real-SwinIR (-M and -L) [28]. Among them, RRDB is trained on bicubic degradation with pixel-wise loss; ESRGAN is trained on bicubic degradation with pixel-wise, perceptual and adversarial losses; IKC is a representative BISR method trained on various isotropic Gaussian blur kernels; BSRGAN and Real-ESRGAN are state-of-the-art Real-ISR methods with a heavy RRDB backbone; Real-SwinIR is trained on the degradation space of BSRGAN with the computationally expensive SwinIR backbone.
For a more comprehensive and fair comparison, we also re-train those commonly used backbone networks, including SRResNet, EDSR, RRDB, and SwinIR, with our constructed training dataset. Following the common practice [60, 47], we employ PSNR (the larger the better) and LPIPS (learned perceptual image patch similarity, the smaller the better) to quantitatively compare the performance of different methods on synthetic datasets, and make visual comparisons on real-world images since there are no reference images.
Effectiveness. In Table 1 and Table 2, we quantitatively compare the performance of competing methods in terms of PSNR and LPIPS on datasets with different levels of degradations. Specifically, Table 1 compares the methods trained with their own degradation models, while Table 2 compares the methods re-trained on our proposed degradation space.
As shown in Table 1, existing methods can only achieve satisfactory performance on datasets with a specific type of degradation. For example, RRDB and ESRGAN can respectively achieve good fidelity and perceptual quality on the bicubic-downsampled dataset, yet their performance drops dramatically when handling images with other degradations, even for the ‘Level-I’ degradation with mild noise and blurs. Real-ESRGAN, BSRGAN, and Real-SwinIR perform well on the most severely degraded dataset, i.e., ‘Level-III’. However, their performance deteriorates much on the other three datasets.
In contrast, our DASR achieves stable and significant improvement against other methods under the first three types of degradations, which cover the majority of real-world images, while achieving highly competitive (among the best two) results for the last type of degradation. For example, DASR outperforms Real-ESRGAN by about dB in PSNR and in LPIPS on the ‘Level-I’ dataset. On the ‘Level-III’ dataset with severely degraded images (as shown in Fig. 2 (d)), DASR achieves almost the same PSNR and LPIPS indices as BSRGAN. These observations clearly demonstrate that our DASR can generalize well to images with a wide range of degradations.
To further validate the effectiveness of our degradation-adaptive strategy, in Table 2 we re-train the backbones of popular SR models on our proposed degradation space. Note that the heavy RRDB backbone is adopted in both BSRGAN and RealESRGAN, and the lightweight SRResNet is adopted in our DASR as the backbone. As can be seen from this table, with the same network topology and similar computational overhead, our DASR outperforms the baseline SRResNet on all datasets by a large margin, e.g., improving db of PSNR on the bicubic-downsampled dataset and about of LPIPS on the Level-II dataset. This demonstrates that the degradation-adaptive mixture of multiple experts can significantly extend the model capacity while keeping the efficiency.
Compared to RRDB and SwinIR backbones that are adopted in recent state-of-the-art methods [60, 47, 28], our DASR consumes much less computational resources, e.g., about and latency of RRDB and SwinIR, respectively. At the same time, DASR outperforms these heavy models in terms of reconstruction fidelity on all datasets, demonstrating its effectiveness of degradation-adaptive super-resolution and high efficiency to deploy in practice.
D-Level | Metric | RRDB | ESRGAN | IKC | BSRGAN | Real- ESRGAN | Real- SwinIR-M | Real- SwinIR-L | DASR |
Bicubic | PSNR | 30.92 | 28.17 | 28.01 | 27.32 | 26.65 | 26.83 | 27.21 | 28.55 |
LPIPS | 0.2537 | 0.1154 | 0.2695 | 0.2364 | 0.2284 | 0.2221 | 0.2135 | 0.1696 | |
Level-I | PSNR | 26.27 | 21.16 | 24.09 | 26.78 | 26.17 | 26.21 | 26.45 | 27.84 |
LPIPS | 0.3419 | 0.4727 | 0.3805 | 0.2412 | 0.2312 | 0.2247 | 0.2161 | 0.1707 | |
Level-II | PSNR | 26.46 | 22.77 | 25.39 | 26.75 | 26.16 | 26.12 | 26.39 | 27.58 |
LPIPS | 0.4441 | 0.4900 | 0.4531 | 0.2462 | 0.2391 | 0.2313 | 0.2213 | 0.2126 | |
Level-III | PSNR | 23.91 | 23.63 | 22.91 | 24.05 | 23.81 | 23.34 | 23.46 | 23.93 |
LPIPS | 0.7631 | 0.7314 | 0.7583 | 0.3995 | 0.3901 | 0.3844 | 0.3765 | 0.4144 |
Efficiency. The inference efficiency is a crucial factor in Real-ISR tasks due to the limited computational resources in practical applications. We compare different backbone networks in terms of multiple efficiency-related metrics and depict the results in the bottom rows of Table 2.
Data & Metrics | SRResNet | EDSR | SwinIR | RRDB | DASR | |
Bicubic |
PSNR | 28.05 | 28.26 | 28.28 | 27.92 | 28.55 |
LPIPS | 0.1747 | 0.1807 | 0.1488 | 0.1473 | 0.1696 | |
Level-I | PSNR | 27.60 | 27.79 | 27.78 | 27.84 | 27.84 |
LPIPS | 0.1772 | 0.1834 | 0.1531 | 0.1569 | 0.1707 | |
Level-II | PSNR | 27.34 | 27.53 | 27.45 | 27.29 | 27.58 |
LPIPS | 0.2228 | 0.2284 | 0.1854 | 0.1886 | 0.2126 | |
Level-III | PSNR | 23.71 | 23.87 | 23.60 | 23.54 | 23.93 |
LPIPS | 0.4419 | 0.4351 | 0.3869 | 0.3847 | 0.4144 | |
Latency (ms) | 113 | 105 | 1719 | 460 | 142 | |
#FLOPs (GMac) | 166 | 130 | 539 | 1176 | 184 | |
#Params (M) | 1.52 | 1.52 | 11.72 | 16.70 | 8.07 | |
#Memory (M) | 2359 | 2169 | 2699 | 2417 | 2452 |
As shown in the table, the computational overhead of different backbone networks differs dramatically. For example, RRDB [49], which is employed in recent Real-ISR methods [60, 47], consumes about times the FLOPs and more than times the inference time than SRResNet [25]. In other words, the RRDB based Real-ISR methods achieve superior performance at the price of applicability. The recent transformer-based method SwinIR has an acceptable number of FLOPs, however, it actually consumes much more inference time due to the heavy computation of attentions and frequent IO consumption.
Benefiting from the light SRResNet-based backbone and the efficient degradation prediction and parameter fusion, our DASR is very efficient. In specific, the degradation prediction network and the weighting module consume GMac FLOPs, ms latency, M parameters and M GPU memory in total for . Besides, the consumption on parameter fusion operation is negligible, as there are only M multiplications and additions respectively and they can be calculated in parallel. Compared with the classical MoE methods that mix the feature maps of all experts [17, 21, 9, 50], our DASR only conducts one forward pass. As a result, the computational cost increases slightly with a larger , which supports a flexible extension of model capacity.
It is worth mentioning that although our model has more parameters, the maximum GPU memory consumption does not increase much as shown in the row of #Memory in Table 2, since the deployment of model parameters costs much less space than storing input-dependent feature maps. On the other hand, the increased model parameters do not demand much storage space, which is much easier to afford than the computing power.
Fig. 3 shows the visual comparisons between different methods on images with different degradations. One can see that DASR can stably restore sharp and realistic details and remove artifacts for a wide range of degradations. In specific, the first sample image is degraded with bicubic downsampling and suffers from the aliasing issue. Both BSRGAN and Real-ESRGAN cannot generate satisfactory texture details even with the heavy RRDB backbone. This is because these two methods are trained on pairs with relatively severe degradations so that their denoising capacity is strengthened yet the detail-generation capacity is limited. Similar observations can be made on all the four samples in Fig. 3.
The RRDB backbone trained with pixel-wise loss performs well on the first two samples in generating textures details, yet it cannot be generalized to the last two samples whose degradations are severe. This is reasonable since all its training pairs are generated by bicubic downsampling. In addition, the results of RRDB in the first and third samples are blurry, which is a well-acknowledged side-effect of pixel-wise loss. By applying perceptual and adversarial losses, ESRGAN achieves sharper results yet introduces many visual artifacts due to the instability of training generative adversarial networks. The ESRGAN also amplifies the noise as shown in the second sample. By considering different blur kernels, IKC can restore rich textures on most images, yet bring overshoot artifacts when facing unseen kernels in real-world images (the fourth sample). It also lacks the capacity to remove noise as shown in the second sample.
The results of Real-SRGAN are obtained by re-training the SRResNet on our proposed degradation space with the same loss as in Real-ESRGAN [47]. It can be observed that due to the insufficient feature representation capacity, Real-SRGAN cannot perform well on all four samples compared to our DASR. In the first three samples, the Real-SRGAN generates messy details or artifacts, as the light-weighted model limits its capacity to achieve degradation-adaptive super-resolution. On the last sample which is a real-world image, the reconstruction of rich details is restricted in Real-SRGAN. In contrast, our proposed DASR outperforms the others in reconstructing realistic details and inhibiting artifacts, thanks to the effective degradation-adaptive framework and the joint optimization of multiple experts. More visual comparisons can be found in Section 6.4 in the Appendix.
We conduct comprehensive ablation studies on our proposed DASR model by using real-world images and depict the visual results in Fig. 4.
Effectiveness of . Models in Figs. 4(a) and (b) evaluate the selection of . It can be seen that using experts leads to relatively smooth results, while models of in (h) and in (b) enhance the generation of details. As shows similar visual quality to , we consider that is sufficient to model the proposed degradation space.
Effectiveness of model design. Figs. 4(c) and (d) validate the effectiveness of our model design. The result in (c) demonstrates that adding a sigmoid layer to the weighting module cannot improve the performance. As we mix different experts in terms of model parameters, there is no need to ensure positive weights by a sigmoid layer. The experts in Fig. 4(d) are fused following the strategy of classical MoE [17, 21, 9, 50], where all experts are forwarded and the outputs are fused. We can see that the result of classical MoE in (d) lacks fine details compared to (h), yet its computational cost is times heavier than our DASR.
Effectiveness of different dynamic convolutions. Figs. 4(e) and (f) compare different dynamic convolutions [27, 53] without introducing many additional parameters. While the inference latency and FLOPs are increased, the performance of those methods drops, e.g., the artifacts generated in (e). We believe it is the joint optimization of multiple experts and the degradation-adaptive mixture that make our DASR more effective than other methods.
Generalization to different backbone. Fig. 4(g) applies the EDSR-M backbone to DASR. The satisfactory perceptual quality of (g) demonstrates the generalization capacity of our proposed DASR to different backbone networks.
One interesting advantage of our DASR over other Real-ISR methods is that it supports easy user-interactive super-resolution during inference, owing to its interpretable and compact degradation representation.
We depict an example of user-interactive super-resolution in Fig. 5. As can be seen, the proposed DASR allows explicit user control to customize the super-resolution effects. Manually setting larger values to the blur-related parameters (e.g., kernel scale) leads to sharper super-resolution results, as shown in Fig. 5(c), while adjusting the level of noise can flexibly balance between image details and noise, as shown in Fig. 5(e) and (f). Such an advantage of flexible user control makes our DASR very attractive in practical Real-ISR tasks.
In this paper, we proposed an efficient degradation-adaptive network, namely DASR, for the real-world image super-resolution (Real-ISR) task. In order to improve the modeling capacity and flexibility of various degradation levels, we jointly learned multiple super-resolution experts and adaptively mixed them into one expert in a degradation-aware manner. The proposed DASR was not only degradation adaptive but also efficient during inference. Extensive quantitative and qualitative experiments were conducted. The results demonstrated that DASR not only achieved superior performance on images with a wide range of degradation levels but also kept good efficiency for easy deployment. In addition, DASR allowed easy user control for customized super-resolution results.
We report the detailed parameter settings of our degradation modeling in Table 3. We partition the whole degradation space into levels
, and randomly select one of them to generate the LR-HR image pairs during training with a balanced probability of
. For the blur operation, we use isotropic and anisotropic Gaussian kernels with a probability of , where we set if isotropic blur kernel is specified. In the second degradation stage of , following the practice in Real-ESRGAN [47], we skip the blur operation with a probability of , and perform sinc kernel filtering with a probability of . We finally resize the image to the desired LR size, i.e., of the original size.For those operations that have more than one mode, e.g., the resize mode, we use a one-hot vector to indicate the choice of mode in . For other parameters, we normalize each of them by , where and indicate the original value, the normalized value, the minimum and maximum values of the parameter, respectively.
As discussed in Section 3.3, the total training loss is defined as
where the regression loss has been provided in Eq. (1) of the main paper. For the other three losses, the settings are the same as in Real-ESRGAN [47]. Specifically, the pixel loss is defined as the distance , where and denote the super-resolved image and the ground-truth HR image, respectively. For the perceptual loss , we first extract the {conv, conv, conv, conv, conv} feature maps of and by using the pre-trained VGG19 network [42], then calculate the weighted sum of the respective distances between the feature maps of and as the perceptual loss, where the weights are set to be [0.1, 0.1, 1, 1, 1]. For the adversarial loss , the U-Net discriminator with spectral normalization is adopted.
In Fig. 6, we provide more sample images with different degradation levels in our datasets, as well as the ground-truth HR images. As can be seen from the figure, those images can cover a wide range of real-world degradations. The balanced sampling from the three levels during training improves the generalization capacity of our DASR to real-world images with different degradations.
In Fig. 7, we provide more qualitative comparisons of competing methods on real-world images, while in Figs. 8, 9, 10 and 11, we provide more qualitative comparisons of competing methods on datasets with bicubic, Level-I, Level-II and Level-III degradations, respectively. Our models are trained by using the images in DIV2K, Flickr2K, and OutdoorScene-Training datasets. To further validate the generalization capability of DASR to different image contents, the visual comparisons in Figs. 8, 9, 10 and 11 also include images from the Urban100 dataset by using the same degrading strategy as in our main paper. From those figures, consistent observations to our main paper can be made. Our DASR can generate more realistic structures and details on different degradations, benefiting from its degradation-adaptive strategy and the joint training and adaptive mixture of multiple experts.
Level | Operation | Parameter | Stage 1 | Stage 2 | ||
Range | Range | |||||
Blur | kernel size | - | - | |||
standard deviation | - | - | ||||
standard deviation | - | - | ||||
rotation degree | - | - | ||||
Resize | [up, down, keep] | - | - | - | ||
scale factor | - | - | ||||
resize mode | [‘a’, ‘b’, ‘b’] | - | - | |||
Noise | type | [‘G’, ‘P’] | - | - | ||
sigma of Gaussian | - | - | ||||
scale of Poisson | - | - | ||||
gray probability | - | - | ||||
JPEG | quality factor | - | - | |||
mode of final resize | [‘a’, ‘b’, ‘b’] | - | - | |||
Blur | kernel size | - | - | |||
standard deviation | - | - | ||||
standard deviation | - | - | ||||
rotation degree | - | - | ||||
Resize | [up, down, keep] | - | - | - | ||
scale factor | - | - | ||||
resize mode | [‘a’, ‘b’, ‘b’] | - | - | |||
Noise | type | [‘G’, ‘P’] | - | - | ||
sigma of Gaussian | - | - | ||||
scale of Poisson | - | - | ||||
gray probability | - | - | ||||
JPEG | quality factor | - | - | |||
mode of final resize | [‘a’, ‘b’, ‘b’] | - | - | |||
Blur | kernel size | |||||
standard deviation | ||||||
standard deviation | ||||||
rotation degree | ||||||
sinc kernel size | - | - | ||||
of sinc kernel | - | - | ||||
Resize | [up, down, keep] | - | - | |||
scale factor | ||||||
resize mode | [‘a’, ‘b’, ‘b’] | [‘a’, ‘b’, ‘b’] | ||||
Noise | type | [‘G’, ‘P’] | [‘G’, ‘P’] | |||
sigma of Gaussian | ||||||
scale of Poisson | ||||||
gray probability | ||||||
JPEG | quality factor | |||||
operating order | - | - | R-J or J-R | |||
mode of final resize | - | - | [‘a’, ‘b’, ‘b’] |
is padded with
; [‘a’, ‘b’, ‘b’] denote the resize modes of [area, bilinear, bicubic]; [‘G’, ‘P’] denote the noise types of [Gaussian, Poisson]; is the cutoff frequency of the sinc kernel; R-J and J-R indicate the different operating orders of resizing and JPEG compression; denotes the value of .Accelerating the super-resolution convolutional neural network
. In ECCV, Cited by: §1.