Image Restoration Toolbox (PyTorch). Training and testing codes for DnCNN, FFDNet, SRMD, DPSR, MSRResNet, ESRGAN, IMDN
Learning-based single image super-resolution (SISR) methods are continuously showing superior effectiveness and efficiency over traditional model-based methods, largely due to the end-to-end training. However, different from model-based methods that can handle the SISR problem with different scale factors, blur kernels and noise levels under a unified MAP (maximum a posteriori) framework, learning-based methods generally lack such flexibility. To address this issue, this paper proposes an end-to-end trainable unfolding network which leverages both learning-based methods and model-based methods. Specifically, by unfolding the MAP inference via a half-quadratic splitting algorithm, a fixed number of iterations consisting of alternately solving a data subproblem and a prior subproblem can be obtained. The two subproblems then can be solved with neural modules, resulting in an end-to-end trainable, iterative network. As a result, the proposed network inherits the flexibility of model-based methods to super-resolve blurry, noisy images for different scale factors via a single model, while maintaining the advantages of learning-based methods. Extensive experiments demonstrate the superiority of the proposed deep unfolding network in terms of flexibility, effectiveness and also generalizability.READ FULL TEXT VIEW PDF
Deep learning based methods have recently pushed the state-of-the-art on...
By developing sophisticated image priors or designing deep(er) architect...
Recent works on plug-and-play image restoration have shown that a denois...
Single image dehazing is a challenging ill-posed restoration problem. Va...
It is widely acknowledged that single image super-resolution (SISR) meth...
Deep neural networks (DNNs) based methods have achieved great success in...
The last decade has shown a tremendous success in solving various comput...
Image Restoration Toolbox (PyTorch). Training and testing codes for DnCNN, FFDNet, SRMD, DPSR, MSRResNet, ESRGAN, IMDN
Deep Unfolding Network for Image Super-Resolution (CVPR, 2020) (PyTorch)
Single image super-resolution (SISR) refers to the process of recovering the natural and sharp detailed high-resolution (HR) counterpart from a low-resolution (LR) image. It is one of the classical ill-posed inverse problems in low-level computer vision and has a wide range of real-world applications, such as enhancing the image visual quality on high-definition displays [53, 42] and improving the performance of other high-level vision tasks .
Despite decades of studies, SISR still requires further study for academic and industrial purposes [64, 35]. The difficulty is mainly caused by the inconsistency between the simplistic degradation assumption of existing SISR methods and the complex degradations of real images . Actually, for a scale factor of , the classical (traditional) degradation model of SISR [17, 18, 37] assumes the LR image is a blurred, decimated, and noisy version of an HR image . Mathematically, it can be expressed by
where represents two-dimensional convolution of with blur kernel , denotes the standard -fold downsampler, i.e., keeping the upper-left pixel for each distinct patch and discarding the others, and
is usually assumed to be additive, white Gaussian noise (AWGN) specified by standard deviation (or noise level). With a clear physical meaning, Eq. (1) can approximate a variety of LR images by setting proper blur kernels, scale factors and noises for an underlying HR images. In particular, Eq. (1) has been extensively studied in model-based methods which solve a combination of a data term and a prior term under the MAP framework.
Though model-based methods are usually algorithmically interpretable, they typically lack a standard criterion for their evaluation because, apart from the scale factor, Eq. (1) additionally involves a blur kernel and added noise. For convenience, researchers resort to bicubic degradation without consideration of blur kernel and noise level [60, 56, 14]. However, bicubic degradation is mathematically complicated 
, which in turn hinders the development of model-based methods. For this reason, recently proposed SISR solutions are dominated by learning-based methods that learn a mapping function from a bicubicly downsampled LR image to its HR estimation. Indeed, significant progress on improving PSNR[26, 70] and perceptual quality [31, 47, 58]
for the bicubic degradation has been achieved by learning-based methods, among which convolutional neural network (CNN) based methods are the most popular, due to their powerful learning capacity and the speed of parallel computing. Nevertheless, little work has been done on applying CNNs to tackle Eq. (1) via a single model. Unlike model-based methods, CNNs usually lack flexibility to super-resolve blurry, noisy LR images for different scale factors via a single end-to-end trained model (see Fig. 1).
In this paper, we propose a deep unfolding super-resolution network (USRNet) to bridge the gap between learning-based methods and model-based methods. On one hand, similar to model-based methods, USRNet can effectively handle the classical degradation model (i.e., Eq. (1)) with different blur kernels, scale factors and noise levels via a single model. On the other hand, similar to learning-based methods, USRNet can be trained in an end-to-end fashion to guarantee effectiveness and efficiency. To achieve this, we first unfold the model-based energy function via a half-quadratic splitting algorithm. Correspondingly, we can obtain an inference which iteratively alternates between solving two subproblems, one related to a data term and the other to a prior term. We then treat the inference as a deep network, by replacing the solutions to the two subproblems with neural modules. Since the two subproblems correspond respectively to enforcing degradation consistency knowledge and guaranteeing denoiser prior knowledge, USRNet is well-principled with explicit degradation and prior constraints, which is a distinctive advantage over existing learning-based SISR methods. It is worth noting that since USRNet involves a hyper-parameter for each subproblem, the network contains an additional module for hyper-parameter generation. Moreover, in order to reduce the number of parameters, all the prior modules share the same architecture and same parameters.
The main contributions of this work are as follows:
An end-to-end trainable unfolding super-resolution network (USRNet) is proposed. USRNet is the first attempt to handle the classical degradation model with different scale factors, blur kernels and noise levels via a single end-to-end trained model.
USRNet integrates the flexibility of model-based methods and the advantages of learning-based methods, providing an avenue to bridge the gap between model-based and learning-based methods.
USRNet intrinsically imposes a degradation constraint (i.e., the estimated HR image should accord with the degradation process) and a prior constraint (i.e., the estimated HR image should have natural characteristics) on the solution.
USRNet performs favorably on LR images with different degradation settings, showing great potential for practical applications.
Knowledge of the degradation model is crucial for the success of SISR [59, 16] because it defines how the LR image is degraded from an HR image. Apart from the classical degradation model and bicubic degradation model, several others have also been proposed in the SISR literature.
In some early works, the degradation model assumes the LR image is directly downsampled from the HR image without blurring, which corresponds to the problem of image interpolation. In [52, 34], the bicubicly downsampled image is further assumed to be corrupted by Gaussian noise or JPEG compression noise. In [15, 42], the degradation model focuses on Gaussian blurring and a subsequent downsampling with scale factor 3. Note that, different from Eq. (1), their downsampling keeps the center rather than upper-left pixel for each distinct 33 patch. In , the degradation model assumes the LR image is the blurred, bicubicly downsampled HR image with some Gaussian noise. By assuming the bicubicly downsampled clean HR image is also clean,  treats the degradation model as a composition of deblurring on the LR image and SISR with bicubic degradation.
While many degradation models have been proposed, CNN-based SISR for the classical degradation model has received little attention and deserves further study.
Although CNN-based SISR methods have achieved impressive success to handle bicubic degradation, applying them to deal with other more practical degradation models is not straightforward. For the sake of practicability, it is preferable to design a flexible super-resolver that takes the three key factors, i.e., scale factor, blur kernel and noise level, into consideration.
Several methods have been proposed to tackle bicubic degradation with different scale factors via a single model, such as LapSR  with progressive upsampling, MDSR  with scales-specific branches, Meta-SR  with meta-upscale module. To flexibly deal with a blurry LR image, the methods proposed in [44, 67] take the PCA dimension reduced blur kernel as input. However, these methods are limited to Gaussian blur kernels. Perhaps the most flexible CNN-based works which can handle various blur kernels, scale factors and noise levels, are the deep plug-and-play methods [65, 68]. The main idea of such methods is to plug the learned CNN prior into the iterative solution under the MAP framework. Unfortunately, these are essentially model-based methods which suffer from a high computational burden and they involve manually selected hyper-parameters. How to design an end-to-end trainable model so that better results can be achieved with fewer iterations remains uninvestigated.
While learning-based blind image restoration has recently received considerable attention [50, 12, 62, 43, 39], we note that this work focuses on non-blind SISR which assumes the LR image, blur kernel and noise level are known beforehand. In fact, non-blind SISR is still an active research direction. First, the blur kernel and noise level can be estimated, or are known based on other information (e.g., camera setting). Second, users can control the preference of sharpness and smoothness by tuning the blur kernel and noise level. Third, non-blind SISR can be an intermediate step towards solving blind SISR.
), deep unfolding methods can also integrate model-based methods and learning-based methods. Their main difference is that the latter optimize the parameters in an end-to-end manner by minimizing the loss function over a large training set, and thus generally produce better results even with fewer iterations. The early deep unfolding methods can be traced back to[4, 48, 54] where a compact MAP inference based on gradient descent algorithm is proposed for image denoising. Since then, a flurry of deep unfolding methods based on certain optimization algorithms (e.g., half-quadratic splitting , alternating direction method of multipliers  and primal-dual [9, 1]) have been proposed to solve different image restoration tasks, such as image denoising [11, 32], image deblurring [49, 29], image compressive sensing [61, 63], and image demosaicking .
Compared to plain learning-based methods, deep unfolding methods are interpretable and can fuse the degradation constraint into the learning model. However, most of them suffer from one or several of the following drawbacks. (i) The solution of the prior subproblem without using a deep CNN is not powerful enough for good performance. (ii) The data subproblem is not solved by a closed-form solution, which may hinder convergence. (iii) The whole inference is trained via a stage-wise and fine-tuning manner rather than a complete end-to-end manner. Furthermore, given that there exists no deep unfolding SISR method to handle the classical degradation model, it is of particular interest to propose such a method that overcomes the above mentioned drawbacks.
Since bicubic degradation is well-studied, it is interesting to investigate its relationship to the classical degradation model. Actually, the bicubic degradation can be approximated by setting a proper blur kernel in Eq. (1). To achieve this, we adopt the data-driven method to solve the following kernel estimation problem by minimizing the reconstruction error over a large HR/bicubic-LR pairs ,
Fig. 2 shows the approximated bicubic kernels for scale factors 2, 3 and 4. It should be noted that since the downsamlping operation selects the upper-left pixel for each distinct patch, the bicubic kernels for scale factors 2, 3 and 4 have a center shift of 0.5, 1 and 1.5 pixels to the upper-left direction, respectively.
According to the MAP framework, the HR image could be estimated by minimizing the following energy function
where is the data term, is the prior term, and is a trade-off parameter. In order to obtain an unfolding inference for Eq. (3), the half-quadratic splitting (HQS) algorithm is selected due to its simplicity and fast convergence in many applications. HQS tackles Eq. (3) by introducing an auxiliary variable , leading to the following approximate equivalence
where is the penalty parameter. Such problem can be addressed by iteratively solving subproblems for and
According to Eq. (5), should be large enough so that and are approximately equal to the fixed point. However, this would also result in slow convergence. Therefore, a good rule of thumb is to iteratively increase . For convenience, the in the -th iteration is denoted by .
), the fast Fourier transform (FFT) can be utilized by assuming the convolution is carried out with circular boundary conditions. Notably, it has a closed-form expression
where is defined as
with and where the and denote FFT and inverse FFT, denotes complex conjugate of , denotes the distinct block processing operator with element-wise multiplication, i.e., applying element-wise multiplication to the distinct blocks of , denotes the distinct block downsampler, i.e., averaging the distinct blocks, denotes the standard -fold upsampler, i.e., upsampling the spatial size by filling the new entries with zeros. It is especially noteworthy that Eq. (7) also works for the special case of deblurring when . For the solution of Eq. (6), it is known that, from a Bayesian perspective, it actually corresponds to a denoising problem with noise level .
Once the unfolding optimization is determined, the next step is to design the unfolding super-resolution network (USRNet). Because the unfolding optimization mainly consists of iteratively solving a data subproblem (i.e., Eq. (5)) and a prior subproblem (i.e., Eq. (6)), USRNet should alternate between a data module and a prior module . In addition, as the solutions of the subproblems also take the hyper-parameters and as input, respectively, a hyper-parameter module is further introduced into USRNet. Fig. 3 illustrates the overall architecture of USRNet with iterations, where is empirically set to 8 for the speed-accuracy trade-off. Next, more details on , and are provided.
The data module plays the role of Eq. (7) which is the closed-form solution of the data subproblem. Intuitively, it aims to find a clearer HR image which minimizes a weighted combination of the data term and the quadratic regularization term with trade-off hyper-parameter . Because the data term corresponds to the degradation model, the data module thus not only has the advantage of taking the scale factor and blur kernel as input but also imposes a degradation constraint on the solution. Actually, it is difficult to manually design such a simple but useful multiple-input module. For brevity, Eq. (7) is rewritten as
Note that is initialized by interpolating with scale factor via the simplest nearest neighbor interpolation. It should be noted that Eq. (8
) contains no trainable parameters, which in turn results in better generalizability due to the complete decoupling between data term and prior term. For the implementation, we use PyTorch where the main FFT and inverse FFT operators can be implemented bytorch.rfft and torch.irfft, respectively.
The prior module aims to obtain a cleaner HR image by passing through a denoiser with noise level . Inspired by , we propose a deep CNN denoiser that takes the noise level as input
The proposed denoiser, namely ResUNet, integrates residual blocks  into U-Net . U-Net is widely used for image-to-image mapping, while ResNet owes its popularity to fast training and its large capacity with many residual blocks. ResUNet takes the concatenated and noise level map as input and outputs the denoised image . By doing so, ResUNet can handle various noise levels via a single model, which significantly reduces the total number of parameters. Following the common setting of U-Net, ResUNet involves four scales, each of which has an identity skip connection between downscaling and upscaling operations. Specifically, the number of channels in each layer from the first scale to the fourth scale are set to 64, 128, 256 and 512, respectively. For the downscaling and upscaling operations, 2
2 strided convolution (SConv) and 2
2 transposed convolution (TConv) are adopted, respectively. Note that no activation function is followed by SConv and TConv layers, as well as the first and the last convolutional layers. For the sake of inheriting the merits of ResNet, a group of 2 residual blocks are adopted in the downscaling and upscaling of each scale. As suggested in, each residual block is composed of two 3
3 convolution layers with ReLU activation in the middle and an identity skip connection summed to its output.
The hyper-parameter module acts as a ‘slide bar’ to control the outputs of the data module and prior module. For example, the solution would gradually approach as increases. According to the definition of and , is determined by and , while depends on and . Although it is possible to learn a fixed and , we argue that a performance gain can be obtained if and vary with two key elements, i.e., scale factor and noise level , that influence the degree of ill-posedness. Let and , we use a single module to predict and
The hyper-parameter module consists of three fully connected layers with ReLU as the first two activation functions and Softplus  as the last. The number of hidden nodes in each layer is 64. Considering the fact that and should be positive, and Eq. (7) should avoid division by extremely small , the output Softplus layer is followed by an extra addition of 1e-6. We will show how the scale factor and noise level affect the hyper-parameters in Sec. 4.4.
The end-to-end training aims to learn the trainable parameters of USRNet by minimizing a loss function over a large training data set. Thus, this section mainly describe the training data, loss function and training settings. Following , we use DIV2K  and Flickr2K  as the HR training dataset. The LR images are synthesized via Eq. (1). Although USRNet focuses on SISR, it is also applicable to the case of deblurring with . Hence, the scale factors are chosen from . However, due to limited space, this paper does not consider the deblurring experiments. For the blur kernels, we use anisotropic Gaussian kernels as in [44, 51, 67] and motion kernels as in . We fix the kernel size to . For the noise level, we set its range to .
With regard to the loss function, we adopt the L1 loss for PSNR performance. Following , once the model is obtained, we further adopt a weighted combination of L1 loss, VGG perceptual loss and relativistic adversarial loss  with weights , and for perceptual quality performance. We refer to such fine-tuned model as USRGAN. As usual, USRGAN only considers scale factor 4. We do not use additional losses to constrain the intermediate outputs since the above losses work well. One possible reason is that the prior module shares parameters across iterations.
To optimize the parameters of USRNet, we adopt the Adam solver  with mini-batch size 128. The learning rate starts from and decays by a factor of 0.5 every iterations and finally ends with . It is worth pointing out that due to the infeasibility of parallel computing for different scale factors, each min-batch only involves one random scale factor. For USRGAN, its learning rate is fixed to . The patch size of the HR image for both USRNet and USRGAN is set to . We train the models with PyTorch on 4 Nvidia Tesla V100 GPUs in Amazon AWS cloud. It takes about two days to obtain the USRNet model.
|Zoomed LR (4)||RCAN ||IKC ||IRCNN ||USRNet (ours)||RankSRGAN ||USRGAN (ours)|
We choose the widely-used color BSD68 dataset [40, 46] to quantitatively evaluate different methods. The dataset consists of 68 images with tiny structures and fine textures and thus is challenging to improve the quantitative metrics, such as PSNR. For the sake of synthesizing the corresponding testing LR images via Eq. (1), blur kernels and noise levels should be provided. Generally, it would be helpful to employ a large variety of blur kernels and noise levels for a thorough evaluation, however, it would also give rise to burdensome evaluation process. For this reason, as shown in Table 1, we only consider 12 representative and diverse blur kernels, including 4 isotropic Gaussian kernels with different widths (i.e., 0.7, 1.2, 1.6 and 2.0), 4 anisotropic Gaussian kernels from , and 4 motion blur kernels from [33, 5]. While it has been pointed out that anisotropic Gaussian kernels are enough for SISR task [44, 51], the SISR method that can handle more complex blur kernels would be a preferred choice in real applications. Therefore, it is necessary to further analyze the kernel robustness of different methods, we will thus separately report the PSNR results for each blur kernel rather than for each type of blur kernels. Although it has been pointed out that the proper blur kernel should vary with scale factor , we argue that the 12 blur kernels are diverse enough to cover a large kernel space. For the noise levels, we choose 2.55 (1%) and 7.65 (3%).
The average PSNR results of different methods for different degradation settings are reported in Table 1. The compared methods include RCAN , ZSSR , IKC  and IRCNN . Specifically, RCAN is state-of-the-art PSNR oriented method for bicubic degradation; ZSSR is a non-blind zero-shot learning method with the ability to handle Eq. (1) for anisotropic Gaussian kernels; IKC is a blind iterative kernel correction method for isotropic Gaussian kernels; IRCNN a non-blind deep denoiser based plug-and-play method. For a fair comparison, we modified IRCNN to handle Eq. (1) by replacing its data solution with Eq. (7). Note that following , we fix the pixel shift issue before calculating PSNR if necessary.
According to Table 1, we can have the following observations. First, our USRNet with a single model significantly outperforms the other competitive methods on different scale factors, blur kernels and noise levels. In particular, with much fewer iterations, USRNet has at least an average PSNR gain of 1dB over IRCNN with 30 iterations due to the end-to-end training. Second, RCAN can achieve good performance on the degradation setting similar to bicubic degradation but would deteriorate seriously when the degradation deviates from bicubic degradation. Such a phenomenon has been well studied in . Third, ZSSR performs well on both isotropic and anisotropic Gaussian blur kernels for small scale factors but loses effectiveness on motion blur kernel and large scale factors. Actually, ZSSR has difficulty in capturing natural image characteristic on severely degraded image due to the single image learning strategy. Fourth, IKC does not generalize well to anisotropic Gaussian kernels and motion kernels.
Although USRNet is not designed for bicubic degradation, it is interesting to test its results by taking the approximated bicubic kernels in Fig. 2 as input. From Table 2, one can see that USRNet still performs favorably without training on the bicubic kernels.
The visual results of different methods on super-resolving noise-free LR image with scale factor 4 are shown in Fig. 4. Apart from RCAN, IKC and IRCNN, we also include RankSRGAN  for comparison with our USRGAN. Note that the visual results of ZSSR are omitted due to the inferior performance on scale factor 4. It can be observed from Fig. 4 that USRNet and IRCNN produce much better visual results than RCAN and IKC on the LR image with motion blur kernel. While USRNet can recover shaper edges than IRCNN, both of them fail to produce realistic textures. As expected, USRGAN can yield much better visually pleasant results than USRNet. On the other hand, RankSRGAN does not perform well if the degradation largely deviates from the bicubic degradation. In contrast, USRGAN is flexible to handle various LR images.
Because the proposed USRNet is an iterative method, it is interesting to investigate the HR estimations of data module and prior module in different iterations. Fig. 5 shows the results of USRNet and USRGAN in different iterations for an LR image with scale factor 4. As one can see, and can facilitate each other for iterative and alternating blur removal and detail recovery. Interestingly, can also act as a detail enhancer for high-frequency recovery due to the task-specific training. In addition, it does not reduce blur kernel induced degradation which verifies the decoupling between and . As a result, the end-to-end trained USRNet has a task-specific advantage over Gaussian denoiser based plug-and-play SISR. To quantitatively analyze the role of , we have trained an USRNet model with 5 iterations, it turns out that the average PSNR value will decreases about 0.1dB on Gaussian blur kernels and 0.3dB on motion blur kernels. This further indicates that aims to eliminate blur kernel induced degradation. In addition, one can see that USRGAN has similar results with USRNet in the first few iterations, but will instead recover tiny structures and fine textures in last few iterations.
Fig. 6 shows outputs of the hyper-parameter module for different combinations of scale factor and noise level . It can be observed from Fig. 6(a) that is positively correlated with and varies with . This actually accords with the definition of in Sec. 3.2 and our analysis in Sec. 3.3. From Fig. 6(b), one can see that has a decreasing tendency with the number of iterations and increases with scale factor and noise level. This implies that the noise level of HR estimation is gradually reduced across iterations and complex degradation requires a large to tackle with the illposeness. It should be pointed out that the learned hyper-parameter setting is in accordance with that of IRCNN . In summary, the learned is meaningful as it plays the proper role.
|(a) Zoomed LR (3)||(b) USRNet|
|(c) Zoomed LR (3)||(d) USRGAN|
As mentioned earlier, the proposed method enjoys good generalizability due to the decoupling of data term and prior term. To demonstrate such an advantage, Fig. 7 shows the visual results of USRNet and USRGAN on LR image with a kernel of much larger size than training size of 2525. It can be seen that both USRNet and USRGAN can produce visually pleasant results, which can be attributed to the trainable parameter-free data module. It is worth pointing out that USRGAN is trained on scale factor 4, while Fig. 7(b) shows its visual result on scale factor 3. This further indicates that the prior module of USRGAN can generalize to other scale factors. In summary, the proposed deep unfolding architecture has superiority in generalizability.
Because Eq. (7) is based on the assumption of circular boundary condition, a proper boundary handling for the real LR image is generally required. We use the following three steps to do such pre-processing. First, the LR image is interpolated to the desired size. Second, the boundary handling method proposed in 
is adopted on the interpolated image with the blur kernel. Last, the downsampled boundaries are padded to the original LR image. Fig.8 shows the visual result of USRNet on real LR image with scale factor 4. The blur kernel is manually selected as isotropic Gaussian kernel with width 2.2 based on user preference. One can see from Fig. 8 that the proposed USRNet can reconstruct the HR image with improved visual quality.
|(a) Zoomed LR (4)||(b) USRNet|
In this paper, we focus on the classical SISR degradation model and propose a deep unfolding super-resolution network. Inspired by the unfolding optimization of traditional model-based method, we design an end-to-end trainable deep network which integrates the flexibility of model-based methods and the advantages of learning-based methods. The main novelty of the proposed network is that it can handle the classical degradation model via a single model. Specifically, the proposed network consists of three interpretable modules, including the data module that makes HR estimation clearer, the prior module that makes HR estimation cleaner, and the hyper-parameter module that controls the outputs of the other two modules. As a result, the proposed method can impose both degradation constrain and prior constrain on the solution. Extensive experimental results demonstrated the flexibility, effectiveness and generalizability of the proposed method for super-resolving various degraded LR images. We believe that our work can benefit to image restoration research community.
Acknowledgments: This work was partly supported by the ETH Zürich Fund (OK), a Huawei Technologies Oy (Finland) project, an Amazon AWS grant, and Nvidia.
Foundations and Trends in Machine Learning, 3(1):1–122, 2011.