DAN
This is an official implementation of Unfolding the Alternating Optimization for Blind Super Resolution
view repo
Previous methods decompose the blind super-resolution (SR) problem into two sequential steps: i) estimating the blur kernel from given low-resolution (LR) image and ii) restoring the SR image based on the estimated kernel. This two-step solution involves two independently trained models, which may not be well compatible with each other. A small estimation error of the first step could cause a severe performance drop of the second one. While on the other hand, the first step can only utilize limited information from the LR image, which makes it difficult to predict a highly accurate blur kernel. Towards these issues, instead of considering these two steps separately, we adopt an alternating optimization algorithm, which can estimate the blur kernel and restore the SR image in a single model. Specifically, we design two convolutional neural modules, namely Restorer and Estimator. Restorer restores the SR image based on the predicted kernel, and Estimator estimates the blur kernel with the help of the restored SR image. We alternate these two modules repeatedly and unfold this process to form an end-to-end trainable network. In this way, Estimator utilizes information from both LR and SR images, which makes the estimation of the blur kernel easier. More importantly, Restorer is trained with the kernel estimated by Estimator, instead of the ground-truth kernel, thus Restorer could be more tolerant to the estimation error of Estimator. Extensive experiments on synthetic datasets and real-world images show that our model can largely outperform state-of-the-art methods and produce more visually favorable results at a much higher speed. The source code is available at <https://github.com/greatlog/DAN.git>.
READ FULL TEXT VIEW PDFThis is an official implementation of Unfolding the Alternating Optimization for Blind Super Resolution
Single image super-resolution (SISR) aims to recover the high-resolution (HR) version of a given degraded low-resolution (LR) image. It has wide applications in video enhancement, medical imaging, as well as security and surveillance imaging. Mathematically, the degradation process can be expressed as
(1) |
where is the original HR image, is the degraded LR image, denotes the two-dimensional convolution of with blur kernel , denotes Additive White Gaussian Noise (AWGN), and denotes the standard -fold downsampler, which means keeping only the upper-left pixel for each distinct patch [58]. Then SISR refers to the process of recovering from . It is a highly ill-posed problem due to this inverse property, and thus has always been a challenging task [5].
During the past five years, deep neural networks (DNNs) have achieved remarkable results on SISR
[3, 53]. But most of these methods [25, 26]assume that the blur kernel is predefined as the kernel of bicubic interpolation. In this case, the SR task degenerates to find the inverse solution for bicubic downsampling. However, blur kernels in real applications are much more complicated. They are usually unknown and differ from image to image, as the blur kernels can be easily influenced by the camera intrinsic parameters, camera pose,
etc. Consequently, there is a domain gap between bicubically synthesized training samples and the real images. This domain gap will lead to a severe performance drop when these networks are applied to real applications [32]. Thus, more attention should be paid to SR in the context of unknown blur kernels, i.e. blind SR.In blind SR, there is one more undetermined variable, i.e. the blur kernel , and the optimization also becomes much more difficult. To make this problem easier to be solved, previous methods [60, 28] usually decompose it into two sequential steps: i) estimating the blur kernel from LR image and ii) restoring the SR image based on estimated kernel. This two-step solution involves two independently trained models, thus they may be not well compatible with each other. Specifically, the model in the second step is usually trained with ground-truth kernels. While during testing, it is provided with kernel estimated in the first step. As a result, a small estimation error of the first step could cause a severe performance drop of the following one [20]. And on the other hand, the first step can only utilize limited information from the LR image, which makes it difficult to predict a highly accurate blur kernel. Consequently, although both models can perform well individually, the final result may be suboptimal when they are combined together.
Instead of considering these two steps separately, we adopt an alternating optimization algorithm, which can estimate the blur kernel and restore the SR image in the same model. In detail, we design two convolutional neural modules, namely Restorer and Estimator. Restorer restores the SR image based on the blur kernel predicted by Estimator, and the restored the SR image is further used to help Estimator estimate a better blur kernel. Once the blur kernel is manually initialized, the two modules can well corporate with each other to form a closed loop, which can be iterated over and over. The iterating process is then unfolded to an end-to-end trainable network, which is called a deep alternating network (DAN). In this way, Estimator can utilize information from both LR and the SR images, which makes the estimation of the blur kernel easier. More importantly, Restorer is trained with the kernel estimated by Estimator, instead of ground-truth kernel. Thus during testing Restorer could be more tolerant to the estimation error of Estimator. Besides, the results of both modules could be substantially improved during the iterations, thus it is likely for our alternating optimization algorithm to get better final results than the direct two-step solutions.
We summarize our contributions into three points:
We adopt an alternating optimization algorithm to estimate the blur kernel and restore the SR image for blind SR in a single network (DAN), which helps the two modules to be well compatible with each other and so as to get better final results than the previous two-step solution.
We design two convolutional neural modules, which can be alternated repeatedly and then unfolded to form an end-to-end trainable network, without any pre/post-processing. It is easier to be trained and has a higher speed than the previous two-step solution. To the best of our knowledge, the proposed method is the first end-to-end network for blind SR.
Extensive experiments on synthetic and real-world images show that our model can largely outperform state-of-the-art methods and produce more visually favorable results at a much higher speed.
A preliminary version of this work has been presented as a conference paper [37]. In the current work, we incorporate additional contents in significant ways:
We propose a dual-path conditional block (DPCB) to optimize the architectures of both Estimator and Restorer (Sec 3.4.1). Compared with the original conditional residual block (CRB), DPCB has its advantages: i
) DPCB can simultaneously explore deep features of both its basic and conditional inputs, while CRB only focuses on the basic one. It enables DPCB to model a deeper correlation between the two inputs and help improve the performance of
Estimator and Restorer. ii) The dual-path design in DPCB abandons the expansion and concatenation operations in CRB, which saves much computation. Experiments show that DPCB accelerates the whole network by 28%.In current version, Estimator is supervised by the complete blur kernel, instead of the kernel in the reduced space as the conference version does. On the one hand, stronger supervision may help Estimator to be better optimized. On the other hand, it is easy for the complete predicted kernel to be used in other tasks, while the reduced kernel can only be used in the Restorer.
We investigate more details and add considerable analysis to the initial version, such as visualization of the predicted kernel, ablation studies about the architectures of Restorer and Estimator, etc.
Learning-based methods for SISR usually require a large number of paired HR and LR images as training samples. However, these paired samples are hard to obtain in the real world. Consequently, researchers manually synthesize LR images from HR images with predefined downsampling settings. The most popular setting is bicubic interpolation, i.e. defining in Equation 1
as the bicubic kernel. In this way, a large amount of paired samples can be easily synthesized, which helps boost the development of various deep-learning-based methods. Since the arising of SRCNN
[14], various DNNs [53, 21, 23] have been proposed based on this setting. And most of them focus on optimizing the network architecture for SR. Strategies such as post-upscaling [15], residual learning [30], and pixel-shuffle operation [45], have become the default choices for building an SR network. After the proposal of RCAN [61], RRDB [52] and SAN [13], the performance in the context of bicubic downsampling even starts to get saturate on common benchmark datasets.Despite that great achievements have been made for super-resolving bicubically downsampled images, it is still difficult for SR methods to get applied in real scenarios. Because the blur kernels for real images are usually unknown and differ from image to image, and are much more complicated than the bicubic one. Consequently, due to the domain gap between real and synthesized data, methods designed for bicubically downsampled images will suffer serve performance drop in real applications [32, 11]. To address this issue, researchers begin to work on more challenging cases where degradations of test images are unknown, i.e. blind super resolution.
As indicated in Equation 1, blind super resolution involves solving both the blur kernel and SR image . Previous methods usually decompose it into two sequential steps and each step is an independent research field.
Kernel estimation. The first step is estimating the blur kernel from the test image. As this is an ill-posed problem [33, 35], some priors are usually needed to get it properly solved. In [40], a non-parametric method is used by utilizing the patch recurrence between the test image and its downscaled version. A similar idea is also adopted in [6, 9], but powered with neural networks and adversarial training [19]. Another widely used prior is the extreme channel priors. In [41, 42], Pan et al. firstly propose the dark channel prior, i.e. the dark channel in a natural image is usually sparse, which can be used for solving the blur kernel from a blurred image. In [55, 10], the bright channel prior is further proposed and the idea is augmented to extreme channel priors. Although these manually set priors may help in some cases, they may often be violated in applications. Consequently, as we will show in the experimental section 4.1.3, the accuracy of estimated kernels is still limited.
Super Resolution with given kernel. The second step is super-resolving the SR image with the estimated kernel. This research field is also known as non-blind SR, in which methods are designed under the assumption that the ground-truth blur kernel is known. In [18, 46, 47], the blur kernel is used to downsample images and synthesize training samples, which can be used to train a specific model for a given kernel and LR image. In [60], the kernel and LR image are directly concatenated at the first layer of a DNN. Thus, the SR result can be closely correlated to both LR image and blur kernel. In [28], Zhang et al. propose a method based on the ADMM algorithm. They interpret this problem as MAP optimization and solve the data term and prior term alternately. A similar idea is adopted in [58]. These methods can achieve remarkable performance as long as the ground-truth blur kernel is known. However, in real applications, the blur kernels are predicted by kernel-estimating methods, which are biased from the ground-truth ones. As we will illustrate in Sec 4.1.1, this bias will cause a serve performance drop when the two steps are combined together.
End-to-end methods for blind SR are rarely studied before. In [20], a kernel-estimation module and a non-blind-SR module are firstly integrated into a single blind SR method. It further proposes a correction module, which uses the super-resolved SR image to iteratively correct the estimated kernel. However, the three modules in [20] are still trained in two steps, which is complicated and may restrict its performance. In the proposed method of our paper, the kernel-estimation and SR modules are end-to-end optimized, which is not only much simpler but also can help the two modules get more compatible with each other and achieve better performance.
In this section, we first illustrate the overall algorithm of our proposed and then go into the details. We start from the formulation of blind SR, which helps us explain our method mathematically. The design details will be described at last.
As shown in Equation 1, there are three variables, i.e. , and , to be determined in blind SR problem. From Equation 1 we can get
(2) |
As is assumed to be Gaussian noise with zero mean, the blind SR problem can be mathematically expressed an optimization problem in the Maximum A Posteriori (MAP) framework [59]:
(3) |
Thus, the number of variables that need to be determined becomes . However, this optimization problem is still ill-posed and has an infinite number of solutions [5]. To get it properly solved, some prior terms are usually added [44, 29]:
(4) |
where denotes the prior term for HR image, and represents the prior term for blur kernel. In [49], Tipping et al
. model the process of imaging and parameterize it with several unknown variables. They further assume that these unknown variables are subjected to high-dimensional Gaussian distributions. With the elaborated imaging model and strong assumptions, they succeed to solve this optimization problem directly. However, the imaging model or assumptions about unknown variables may be easily violated in real applications. On the other hand, without these strong assumptions, it is extremely difficult to solve this problem directly.
Given that the overall blind SR is difficult to be tackled, previous methods usually decompose this problem into two sequential steps:
(5) |
where denotes the function that estimates from , and the second step is usually solved by a non-blind SR method described in Sec 2.2. As we have mentioned in Sec 2.2, the two steps are independent research fields in most cases. Both of them only consider the performance under their own given conditions, while ignoring the overall performance. This two-step solution has its drawbacks in threefold. Firstly, this algorithm usually requires training of two or even more models, which is rather complicated. Secondly, can only utilize information from . However, this also an ill-posed problem: could not be properly solved without information from . At last, the non-blind SR model for the second step is trained with ground-truth kernels. While during testing, it can only have access to kernels estimated in the first step. The difference between ground-truth and estimated kernels will usually cause serve performance drop of the non-blind SR model [20].
Towards the drawbacks of two-step solution, we propose an end-to-end network that can largely alleviate these issues. We still split it into two subproblems. However, instead of solving them sequentially, we adopt an alternating optimization algorithm, which restores the SR image and estimates the corresponding blur kernel alternately. The mathematical expression is
(6) |
We define two solvers, namely Estimator and Restorer for the two subproblems respectively. For Estimator, there even has an analytic solution [51]. However, in current work, we choose to implement both solvers with convolutional neural modules. We have three reasons: 1) It is difficult to determine the appropriate analytic forms of the two prior terms. While neural modules are good at learning such priors implicitly [50, 4, 22]. 2) Both modules tackle intermediate results, i.e. and respectively, instead of ground-truth ones. Methods based on ground-truth assumptions may fail in this case. We also experimentally find that a neural-network-based Estimator is more robust than the analytic solution in our method. 3) Once the neural modules are trained, it is easy for them to perform inference.
Thus, we alternately the two subproblems with two neural modules. As shown in Figure 1, we fix the number of iterations as and unfold the iterating process to form an end-to-end trainable network, which is called a deep alternating network (DAN). We initialize the kernel by Dirac function, i.e. the center of the kernel is one and zeros otherwise. Following [20, 60]
, the kernel is also reshaped and then reduced by principal component analysis (PCA)
[43]. We set in practice and both modules are supervised only at the last iteration by L1 loss. The whole network could be well trained without any restrictions on intermediate results because the parameters of both modules are shared between different iterations.In DAN, Estimator takes both LR and SR images as inputs, which makes the estimation of blur kernel much easier. More importantly, Restorer is trained with the kernel estimated by Estimator, instead of the ground-truth kernel as previous methods do. Thus, Restorer could be more tolerant to the estimation error of Estimator during testing. Besides, compared with previous two-step solutions, the results of both modules in DAN could be substantially improved, and it is likely for DAN to get better final results. Especially, in the case where the scale factor , DAN becomes a deblurring network.
The most direct way to build the Estimator and Restorer is using the kernel estimation network and non-blind SR network in previous methods [20, 58]. However, on the one hand, those networks are too large to be directly combined together. On the other hand, the performance of our proposed method is more related to the compatibility between Estimator and Restorer. Architectures designed for cases where they are working alone may be not suitable for the case in DAN. Thus, in this section, we specially design the architectures for Estimator and Restorer.
Analysis. Both modules in our network have two inputs. Estimator takes LR and SR image, and Restorer takes LR image and blur kernel as inputs. We define the LR image as the basic input, and the other one is the conditional input, i.e. the blur kernel, and SR image is the conditional input of Restorer and Estimator respectively. During iterating, the basic inputs of both modules keep the same, but their conditional inputs are repeatedly updated. We claim that it is significantly important to keep the output of each module closely related to its conditional input. Otherwise, the iterating results will collapse to a fixed point at the first iteration. Specifically, if Estimator outputs the same kernel regardless of the value of SR image, or Restorer outputs the same SR image regardless of the value of blur kernel, their outputs will only depend on the basic input, and the results will keep the same during the iterating.
Conditional Residual Block. In the conference version [37], a conditional residual block (CRB) is used to ensure the outputs of Estimator and Restorer are closely related to their conditional inputs. However, this block has three drawbacks: 1) In Restorer, the conditional input, i.e. he estimated kernels have to be expanded spatially to get concatenated with the LR features, which largely increases the computational cost. 2) Experiments show that the channel attention layer (CALayer) in CRB is time-consuming and will easily lead to gradient explosion, which slows down the inference and makes the training unstable. 3) All blocks in the network are conditioned by the same features, which may restrict the representing ability of the whole network.
Dual-Path Conditional Block. To overcome the drawbacks of the conditional residual block, we propose a dual-path conditional block (DPCB) in this paper. As shown in Figure 2 (a), there are two paths in DPCB, i.e. conditional path (top one) and basic path (bottom one). we do not concatenate the conditional and basic paths directly. Instead, they are independently processed firstly and then are multiplied to get correlated. If the conditional input has different spatial sizes as the basic input, it is expanded just before the multiplication. In this way, convolutions on the conditional input are performed before the spatial expansion, which saves much computation. Besides, we add skip connection on the conditional path, which enables the basic inputs at different depths are conditioned by different features. It may improve the representing ability of the whole module and enhance the final results. We also remove the channel attentional layer to accelerate the inference and stabilize the training.
Dua-Path Conditional Group. We further adopt the residual in residual (RIR) structure proposed in [61]. As shown in 2 (b), we add long skip connections when several DPCBs are sequentially stacked. These blocks form what we call dual-path conditional group (DPCG). These long skip connections could further help stabilize the training and enhance the results of very deep neural networks [61]. Since the conditional and basic paths are independently processed, the convolutional layers on the two paths can also have different configurations. As shown in Figure 2
(b), we denote the kernel size and stride for the two paths as
, and and respectively.The whole structure of Restorer is shown in Figure 2 (c). Both inputs are firstly mapped to have the same number of channels by a single convolutional layer respectively. The body of Restorer consists of only DPCGs. The spatial size of the conditional input, i.e. the reduced kernel, is . In this case, the conditional input needs to be expanded spatially to get multiplied with the basic input in the DPCB. Fortunately, the conditional input can maintain the spatial size through the conditional path, which saves many computations than the conference verison [37]. We use PixelShuffle [45] layers to upscale the features to the desired size. In practice, Restorer consists of DPCGs and each DPCG contains DPCBs. The number of channels in the body is set as .
The whole structure of Estimator is shown in Figure 2 (d). The SR image super-resolved by Restorer is firstly downscaled by a convolutional layer with stride . Then the feature maps are used as the conditional input of Estimator. The body of Estimator also consists of only DPCGs. The kernel sizes for both basic and conditional paths are set as . In practice, the body of Estimator consists one DPCG, which contains DPCBs. The number of channels in the body is set as .
In the conference version [37], the Estimator only predicts kernels in the reduced space and it is only supervised by the reduced kernel. There are two drawbacks to this design: 1) Estimator can not predict complete kernels, i.e. kernels before being transformed by PCA. Even the final SR result is good enough, we do not know how the blur kernel looks like. 2) Although the reduced kernel is well-supervised, the complete kernel is not well-constrained. While according to [33], it is better to restrict the complete kernel to sum to one. which is important to the convergence of the whole algorithm [34]. Thus, in current version, the Estimator directly predicts all elements of the blur kernel, i.e. the complete kernel. We further add a Softmax [8] layer at the end of Estimator, which explicitly forces the complete kernel to sum to one. Experiments in Sec 4.1.3 indicates that the predicted kernels of modified Estimator have fewer visual distinctions with ground truth and smaller quantitative error.
Method | Scale | Set5 | Set14 | BSD100 | Urban100 | Manga109 | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | ||
Bicubic | 2 | 28.82 | 0.8577 | 26.02 | 0.7634 | 25.92 | 0.7310 | 23.14 | 0.7258 | 25.60 | 0.8498 |
CARN [2] | 30.99 | 0.8779 | 28.10 | 0.7879 | 26.78 | 0.7286 | 25.27 | 0.7630 | 26.86 | 0.8606 | |
Bicubic+ZSSR [46] | 31.08 | 0.8786 | 28.35 | 0.7933 | 27.92 | 0.7632 | 25.25 | 0.7618 | 28.05 | 0.8769 | |
[41]+CARN [2] | 24.20 | 0.7496 | 21.12 | 0.6170 | 22.69 | 0.6471 | 18.89 | 0.5895 | 21.54 | 0.7946 | |
CARN [2]+[41] | 31.27 | 0.8974 | 29.03 | 0.8267 | 28.72 | 0.8033 | 25.62 | 0.7981 | 29.58 | 0.9134 | |
IKC [20] | 37.19 | 0.9526 | 32.94 | 0.9024 | 31.51 | 0.8790 | 29.85 | 0.8928 | 36.93 | 0.9667 | |
DANv1 [37] | 37.34 | 0.9526 | 33.08 | 0.9041 | 31.76 | 0.8858 | 30.60 | 0.9060 | 37.23 | 0.9710 | |
DANv2 | 37.60 | 0.9544 | 33.44 | 0.9094 | 32.00 | 0.8904 | 31.43 | 0.9174 | 38.07 | 0.9734 | |
Bicubic | 3 | 26.21 | 0.7766 | 24.01 | 0.6662 | 24.25 | 0.6356 | 21.39 | 0.6203 | 22.98 | 0.7576 |
CARN [2] | 27.26 | 0.7855 | 25.06 | 0.6676 | 25.85 | 0.6566 | 22.67 | 0.6323 | 23.84 | 0.7620 | |
Bicubic+ZSSR [46] | 28.25 | 0.7989 | 26.11 | 0.6942 | 26.06 | 0.6633 | 23.26 | 0.6534 | 25.19 | 0.7914 | |
[41]+CARN [2] | 19.05 | 0.5226 | 17.61 | 0.4558 | 20.51 | 0.5331 | 16.72 | 0.4578 | 18.38 | 0.6118 | |
CARN [2]+[41] | 30.31 | 0.8562 | 2757 | 0.7531 | 27.14 | 0.7152 | 24.45 | 0.7241 | 27.67 | 0.8592 | |
IKC [20] | 33.06 | 0.9146 | 29.38 | 0.8233 | 28.53 | 0.7899 | 24.43 | 0.8302 | 32.43 | 0.9316 | |
DANv1 [37] | 34.04 | 0.9199 | 30.09 | 0.8287 | 28.94 | 0.7919 | 27.65 | 0.8352 | 33.16 | 0.9382 | |
DANv2 | 34.19 | 0.9209 | 30.20 | 0.8309 | 29.03 | 0.7948 | 27.83 | 0.8395 | 33.28 | 0.9400 | |
Bicubic | 4 | 24.57 | 0.7108 | 22.79 | 0.6032 | 23.29 | 0.5786 | 20.35 | 0.5532 | 21.50 | 0.6933 |
CARN [2] | 26.57 | 0.7420 | 24.62 | 0.6226 | 24.79 | 0.5963 | 22.17 | 0.5865 | 21.85 | 0.6834 | |
Bicubic+ZSSR [46] | 26.45 | 0.7279 | 24.78 | 0.6268 | 24.97 | 0.5989 | 22.11 | 0.5805 | 23.53 | 0.7240 | |
[41]+CARN [2] | 18.10 | 0.4843 | 16.59 | 0.3994 | 18.46 | 0.4481 | 15.47 | 0.3872 | 16.78 | 0.5371 | |
CARN [2]+[41] | 28.69 | 0.8092 | 26.40 | 0.6926 | 26.10 | 0.6528 | 23.46 | 0.6597 | 25.84 | 0.8035 | |
IKC [20] | 31.67 | 0.8829 | 28.31 | 0.7643 | 27.37 | 0.7192 | 25.33 | 0.7504 | 28.91 | 0.8782 | |
DANv1 [37] | 31.89 | 0.8864 | 28.42 | 0.7687 | 27.51 | 0.7248 | 25.86 | 0.7721 | 30.50 | 0.9037 | |
DANv2 | 32.00 | 0.8885 | 28.50 | 0.7715 | 27.56 | 0.7277 | 25.94 | 0.7748 | 30.45 | 0.9037 |
To fully investigate the proposed method, experiments are performed on both synthetic and real images. In experiments on synthetic images, we evaluate its quantitative results under different settings and perform controlled experiments to help analyze the proposed method. In experiments on real images, we provide a qualitative comparison to demonstrate the effectiveness of the proposed method.
To fully investigate the proposed method, extensive experiments are performed under two different degradation settings. Setting 1 only focuses on cases of isotropic Gaussian blur kernels. In this case, different blur kernels can be quantitatively compared, which can help study the influence of blur kernels. Setting 2 focuses on cases of more general and irregular blur kernels. Intuitively, Setting 2 is relatively more difficult and can help study the performance of the proposed method.
Setting 1. Following the setting in [20], the kernel size is set as . During training, the kernel width is uniformly sampled in [0.2, 4.0], [0.2, 3.0] and [0.2, 2.0] for scale factors , and respectively. For quantitative evaluation, we collect HR images from the commonly used benchmark datasets, i.e. Set5 [7], Set14 [57], Urban100 [24], BSD100 [38] and Manga109 [39]. Since determined kernels are needed for reasonable comparison, we uniformly choose 8 kernels, denoted as Gaussian8, from range [1.8, 3.2], [1.35, 2.40] and [0.80, 1.60] for scale factors , and respectively. The HR images are first blurred by the selected blur kernels and then downsampled to form synthetic test images.
Types | Method | Scale | |||
---|---|---|---|---|---|
2 | 4 | ||||
PSNR | SSIM | PSNR | SSIM | ||
Class 1 | Bicubic | 28.73 | 0.8040 | 25.33 | 0.6795 |
Bicubic kernel + ZSSR [46] | 29.10 | 0.8215 | 25.61 | 0.6911 | |
EDSR [36] | 29.17 | 0.8216 | 25.64 | 0.6928 | |
RCAN [61] | 29.20 | 0.8223 | 25.66 | 0.6936 | |
Class 2 | PDN [48] - 1st in NTIRE’19 track4 | / | / | 26.34 | 0.7190 |
WDSR [56] - 1st in NTIIRE’19 track2 | / | / | 21.55 | 0.6841 | |
WDSR [56] - 1st in NTIRE’19 track3 | / | / | 21.54 | 0.7016 | |
WDSR [56] - 2nd in NTIRE’19 track4 | / | / | 25.64 | 0.7144 | |
Ji et al. [27] - 1st in NITRE’20 track 1 | / | / | 25.43 | 0.6907 | |
Class 3 | Cornillere et al. [12] | 29.46 | 0.8474 | / | / |
Michaeli et al. [40] + SRMD [60] | 25.51 | 0.8083 | 23.34 | 0.6530 | |
Michaeli et al. [40] + ZSSR [46] | 29.37 | 0.8370 | 26.09 | 0.7138 | |
KernelGAN [6] + SRMD [60] | 29.57 | 0.8564 | 25.71 | 0.7265 | |
KernelGAN [6] + USRNet [58] | / | / | 20.06 | 0.5359 | |
KernelGAN [6]+ ZSSR [46] | 30.36 | 0.8669 | 26.81 | 0.7316 | |
Ours | DANv1 | 32.56 | 0.8997 | 27.55 | 0.7582 |
DANv2 | 32.58 | 0.9048 | 28.74 | 0.7893 |
Setting 2. Following the setting in [6], we set the kernel sizes as and for scale and
respectively. We firstly generate anisotropic Gaussian kernels. The lengths of both axes are uniformly distributed in
, rotated by a random angle uniformly distributed in [, ]. To deviate from a regular Gaussian, we further apply uniform multiplicative noise (up to 25% of each pixel value of the kernel) and normalize it to sum to one. For testing, we use the benchmark dataset DIV2KRK that is used in [6].Data. For both settings, we collect HR images from DIV2K [1] and Flickr2K [16] as training set. We firstly crop all HR images to patches of and use them to synthesize training pairs on the fly. The synthesized pairs are then further cropped such that the sizes of LR images are for all scale factors.
Training. The batch sizes for all models are . All models are trained for iterations. We use Adam [31] as our optimizer, with , . The initial learning rate is , and will decay by half at every iterations. All models are trained on RTX2080Ti GPUs.
Evaluation metric. All methods are evaluated by PSNR and SSIM [54]. Both metrics are calculated on the Y channel (i.e. luminance) of transformed YCbCr space.
In this section, we provide quantitative results of different methods under different settings.
Setting 1. For the first setting, we evaluate our method on test images synthesized by Gaussian8 kernels. We denote DAN in the conference version [37] as DANv1 and the DAN in current paper as DANv2. We mainly compare our results with ZSSR [46] (using bicubic kernel) and IKC [20]. We also include a comparison with CARN [2]. Since it is not designed for blind SR, we perform the deblurring method [41] before or after CARN. The results in Table I.
Despite that CARN achieves remarkable results in the context of bicubic downsampling, it suffers severe performance drop when applied to images with unknown blur kernels. Its performance is largely improved when it is followed by a deblurring method, but still inferior to that of blind-SR methods. ZSSR trains a specific network for each single tested image by utilizing the internal patch recurrence. However, ZSSR has an in-born drawback: the training samples for each image are limited, and thus it cannot learn a good prior for HR images. IKC is also a two-step solution for blind SR. Although the accuracy of the estimated kernel is largely improved in IKC, the final result is still suboptimal.
Both DANv1 and DANv2 are trained in an end-to-end manner, which is not only much easier to be trained than two-step solutions but also more likely to reach a better optimum point. As shown in Table I, they outperform other methods by a large margin. Specially, DANv1 outperforms IKC by on Urban100 for scale . This comparison indicates the importance of end-to-end training in blind SR. On the other hand, DANv2 is also improved a lot on the basis of DANv1. It suggests that the optimized structures of Restorer and Estimator are better than the conference version.
Setting 2. The second setting involves irregular blur kernels, which are more general, but also more difficult to solve. For Setting 2, we mainly compare methods of three different classes: i) SOTA SR algorithms trained on bicubically downsampled images such as EDSR [36] and RCAN [61] , ii) blind SR methods designed for NTIRE competition such as PDN [48] and WDSR [56], iii) the two-step solutions, i.e. the combination of a kernel estimation method and a non-blind SR method, such as Kernel-GAN [6] and ZSSR [46]. The PSNR and SSIM results on the Y channel are shown in Table II.
Similarly, the performance of methods trained on bicubically downsampled images is limited by the domain gap. Thus, their results are only slightly better than that of interpolation. The methods in Class 2 are trained on synthesized images provided in the NTIRE competition. Although these methods achieve remarkable results in the competition, they still cannot generalize well to irregular blur kernels.
The comparison between methods of Class 3 can enlighten us a lot. Specifically, USRNet [58] achieves remarkable results when GT kernels are provided, and KernelGAN also performs well on kernel estimation. However, when they are combined together, as shown in Table II, the final SR results are worse than most other methods. This indicates that it is important for the Estimator and Restorer to be compatible with each other. Additionally, although a better kernel-estimation method can benefit the SR results, the overall performance is still largely inferior to that of both DANv1 and DANv2. This comparison also indicates the importance of end-to-end training for blind SR. Compared with DANv1, the performance of DANv2 is further improved. Specially, DANv2 outperforms DANv1 by for scale . On the one hand, DPCB largely improves the representing ability of DANv2. On the other hand, DANv2 can be trained more stably than DANv1. Thus it can be better optimized and achieve better results.
In this section, we provide some visual results of different methods under different settings for qualitative comparisons.
Setting 1. The visual results of img 005, img 013, img 047 and img 052 in Urban100 are shown in Figure 3 for comparisons between DAN and other methods. As one can see, ZSSR and CARN even cannot restore clear edges. IKC performs better, but the edges are severely blurred. DANv2 restores sharper edges and simultaneously alleviates the blurriness. This comparison indicates that DAN could produce more visually pleasant SR images. For the qualitative comparisons between DANv1 and DANv2, we need to focus on harder cases. Because for relatively easier cases, both models perform well enough and their results are hard to be visually distinguished. We provide their results of img 092 and img 096 in Urban100 for comparisons. As shown in Figure 4, it is likely for DANv1 to mix the stripes of different directions during the super-resolving processing. While DANv2 may be more stable for such cases.
Setting 2. The visual results of img 864, img 816, img 812 and img 853 in Urban100 are shown in Figure 6 for comparisons between DAN and other methods. We need to note that Bicubic interpolation is actually a strong baseline in blind SR. Although KernelGAN +ZSSR and Ji et al. can have better overall results on DIV2KRK, Bicubic interpolation can still outperform them in many cases. As indicated in the figure, compared with the other three methods, the SR images produced by DAN are much sharper and cleaner. We also provide individual comparisons between DANv1 and DANv2 in Figure 7. As one can see, the SR images of DANv1 are still slightly blurred, while those of DANv2 are much cleaner.
Accuracy.
We calculate the L1 error of predicted kernels to quantitatively evaluate their accuracy. As we want to investigate the performance over different kernels, we choose to measure the predicted kernels in Setting 1, because different kernels in Setting 1 can be classified via their standard deviation
. we calculate their L1 errors in the reduced space, and the results on Urban100 are shown in Figure 8 (a). As one can see that the L1 errors of reduced kernels predicted by DANv1 and DANv2 are much lower than that of IKC. It suggests that the overall improvements of DAN may partially come from more accurate predicted kernels. We need to note that DANv2 predicts more accurate kernels than DANv1, which demonstrates the modifications on Estimator in Sec 3.4.3. We also plot the PSNR results with respect to kernels with different in Figure 8 (b). As increases, the performance gap between IKC and DAN also becomes larger. It indicates that DAN may have better generalization ability.Visualization. Compared with DANv1, DANv2 directly predicts the complete blur kernel, instead of in the reduced space. It enables us to visualize the estimated kernels. In this section, we visualize some estimated kernels to qualitatively measure the performance of Estimator. Since Gaussian kernels in Setting 1 are hard to be visually distinguished, we choose to visualize estimated kernels on DIV2KRK for scale factor . The irregular kernels of DIV2KRK are more difficult to be estimated and the performances of different methods are easier to be visually measured. We use the results of KernelGAN [6] and Pan et al. [41] as comparisons. As shown in Figure 5, kernels estimated by Pan et al. are collapsed to the central area. It indicates that this method fails in estimating relatively large kernels. The kernels estimated by KernelGAN are likely to be isotropic and look very different from the ground-truth kernels. Compared with these two methods, DAN can estimate the kernel much more accurately, even if the ground-truth kernels are highly anisotropic.
In this section, we replace the estimated kernel with ground truth (GT) to further investigate the influence of Estimator. If GT kernels are provided, the iterating processing becomes meaningless. Thus we test the Restorer with just once forward propagation. The tested results for Setting 1 are shown in Table III. The result almost keeps unchanged and sometimes even gets worse when GT kernels are provided. It indicates that Predictor may have already satisfied the requirements of Restorer, and the superiority of DAN also partially comes from the good cooperation between its Predictor and Restorer.
Methods | Set5 | Set14 | B100 | Urban100 | Manga109 |
---|---|---|---|---|---|
DANv2 | 32.00 | 28.50 | 27.56 | 25.94 | 30.45 |
DANv2(GT kernel) | 31.98 | 28.49 | 27.56 | 25.95 | 30.46 |
In this section, we investigate the influences of different architectures, including DPCB, DPCG, and Softmax layer in
Estimator. We use DAN of the conference version, i.e. DANv1, as the baseline, which is denoted as experiment . In experiment , we replace the conditional residual block in DANv1 with DPCB. To control the model size, the number of blocks is increased from to . In experiment , we further add long skip connections. We introduce the Softmax layer to Estimator in experiment , and the network finally becomes DANv2. We report the results of different experiments on Set14 in Setting 1. As shown in Table IV, compared with the original conditional residual block in DANv1, DPCB and DPCG can improve the results by . The Softmax layer in Estimator can further improve the results by . It indicates that it helps to explicitly restrict the estimated kernels to sum to one.Exp. | DPCB | DPCG | Softmax | Results |
---|---|---|---|---|
After the model is trained, we also change the number of iterations to see whether the two modules have learned the property of convergence or just have ‘remembered’ the iteration number. The model is trained with iterations, but during testing, we increase the iteration number from to . As shown in Figure 9 (a) and (c), the average PSNR results on Set5 and Set14 firstly increase rapidly and then gradually converge. It should be noted that when we iterate more times than training, the performance does not become worse, and sometimes even becomes better. For example, the average PSNR on Set14 is when the iteration number is , higher than when we iterate times. Although the incremental is relatively small, it suggests that the two modules may have learned to cooperate with each other, instead of solving this problem like ordinary end-to-end networks, in which cases, the performance will drop significantly when the setting of testing is different from that of training. It also suggests that the estimation error of intermediate results does not destroy the convergence of DAN. In other words, DAN is robust to various estimation errors.
Compared with other blind SR methods, our end-to-end model also has superiority in inference speed. To make a quantitative comparison, we evaluate the average speed of different methods on the same platform. We choose the 40 images synthesized by Gaussian8 kernels from Set5 as testing images, and all methods are evaluated on the same platform with an RTX2080Ti GPU. We choose KernelGAN [6] + ZSSR [46] and IKC [20] as the comparison methods. The model complexities and inference speed are shown in Table V. The FLOPs of KernelGAN+ZSSR is left out because it re-trains a different model for each test image. In that case, FLOPs can not indicate the model complexity. As shown in Table V, the average speed of DANv1 is 0.75 seconds per image, nearly 554 times faster than KernelGAN + ZSSR, and 5 times faster than IKC. It indicates that DAN not only can largely outperform SOTA blind SR methods on PSNR results but also has a much higher speed. DANv2 further improves the speed of DANv1 by . This is mainly because DPCB removes the expansion and concatenation operation in CRB. It saves many computations and memory and thus can be accelerated.
We also conduct experiments to prove that DAN can generalize well to real-world images. We use the model trained with Setting 1 for scale to upscale the commonly used real image chip [17]. We use KernelGAN [6] + ZSSR [46] and IKC [20] as the representative methods for blind SR, and CARN [2] as the representative method for non-blind SR method. It should be noted that it is a real image and we do not have the ground truth. Thus we can only provide a visual comparison in Figure [17]. As one can see, the result of KernelGAN + ZSSR is slightly better than bicubic interpolation but is still heavily blurred. The result of CARN is over smoothed and the edge is not sharp enough. IKC produces a cleaner result, but there are still some artifacts. The letter ‘X’ restored by IKC has an obvious dark line at the top right part. But this dark line is much lighter in the image restored by DAN. It suggests that even if DAN is trained via synthesized image pairs, it still has the ability to generalize to images in real applications in some cases.
In this paper, we have proposed an end-to-end algorithm for blind SR. This algorithm is based on alternating optimization, the two parts of which are both implemented by convolutional modules, namely Restorer and Estimator. We unfold the alternating process to form an end-to-end trainable network. In this way, Estimator can utilize information from both LR and SR images, which makes it easier to estimate blur kernel. More importantly, Restorer is trained with the kernel estimated by Estimator, instead of the ground-truth kernel, thus Restorer could be more tolerant to with the estimation error of Estimator. Experiments show that the compatibility of the two modules may be more important than their accuracy, and that is the main reason why the proposed method is better than the previous two-step solution. Our main contributions are that we provide an end-to-end algorithm for blind SR and demonstrate that an end-to-end pipeline is important for the final performance. In the future, we will try to apply similar ideas in other low-level vision tasks, such as deblur and denoise.
Proceedings of the European Conference on Computer Vision
, pp. 252–268. Cited by: TABLE I, §4.1.1, §4.2.Accelerating the super-resolution convolutional neural network
. In Proceedings of the European Conference on Computer Vision, pp. 391–407. Cited by: §2.1.Meta-transfer learning for zero-shot super-resolution
. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3516–3525. Cited by: §2.2.Esrgan: enhanced super-resolution generative adversarial networks
. In Proceedings of the European Conference on Computer Vision Workshops, pp. 0–0. Cited by: §2.1.