This is an official implementation of Unfolding the Alternating Optimization for Blind Super Resolution
Previous methods decompose blind super resolution (SR) problem into two sequential steps: i) estimating blur kernel from given low-resolution (LR) image and ii) restoring SR image based on estimated kernel. This two-step solution involves two independently trained models, which may not be well compatible with each other. Small estimation error of the first step could cause severe performance drop of the second one. While on the other hand, the first step can only utilize limited information from LR image, which makes it difficult to predict highly accurate blur kernel. Towards these issues, instead of considering these two steps separately, we adopt an alternating optimization algorithm, which can estimate blur kernel and restore SR image in a single model. Specifically, we design two convolutional neural modules, namely Restorer and Estimator. Restorer restores SR image based on predicted kernel, and Estimator estimates blur kernel with the help of restored SR image. We alternate these two modules repeatedly and unfold this process to form an end-to-end trainable network. In this way, Estimator utilizes information from both LR and SR images, which makes the estimation of blur kernel easier. More importantly, Restorer is trained with the kernel estimated by Estimator, instead of ground-truth kernel, thus Restorer could be more tolerant to the estimation error of Estimator. Extensive experiments on synthetic datasets and real-world images show that our model can largely outperform state-of-the-art methods and produce more visually favorable results at much higher speed. The source code is available at https://github.com/greatlog/DAN.git.READ FULL TEXT VIEW PDF
Previous methods decompose the blind super-resolution (SR) problem into ...
Deep learning based methods have dominated super-resolution (SR) field d...
Recently, lots of deep networks are proposed to improve the quality of
This paper proposes a simple, accurate, and robust approach to single im...
Despite the great success of deep model on Hyperspectral imagery (HSI)
Blind image deblurring plays a very important role in many vision and
The defocus deblurring raised from the finite aperture size and exposure...
This is an official implementation of Unfolding the Alternating Optimization for Blind Super Resolution
DAN: Unfolding the Alternating Optimization for Blind Super Resolution
Single image super resolution (SISR) aims to recover the high-resolution (HR) version of a given degraded low-resolution (LR) image. It has wide applications in video enhancement, medical imaging, as well as security and surveillance imaging. Mathematically, the degradation process can be expressed as
where is the original HR image, is the degraded LR image, denotes the two-dimensional convolution of with blur kernel , denotes Additive White Gaussian Noise (AWGN), and denotes the standard -fold downsampler, which means keeping only the upper-left pixel for each distinct patch usr . Then SISR refers to the process of recovering from . It is a highly ill-posed problem due to this inverse property, and thus has always been a challenging task.
Recently, deep neural networks (DNNs) have achieved remarkable results on SISR. But most of these methodsrcan ; carn ; rdn ; edsr ; san ; vdsr
assume that the blur kernel is predefined as the kernel of bicubic interpolation. In this way, large number of training samples can be manually synthesized and further used to train powerful DNNs. However, blur kernels in real applications are much more complicated, and there is a domain gap between bicubically synthesized training samples and real images. This domain gap will lead to severe performance drop when these networks are applied to real applications. Thus, more attention should be paid to SR in the context of unknown blur kernels,i.e. blind SR.
In blind SR, there is one more undetermined variable, i.e. blur kernel , and the optimization also becomes much more difficult. To make this problem easier to be solved, previous methods srmd ; udvd ; dpsr ; usr usually decompose it into two sequential steps: i) estimating blur kernel from LR image and ii) restoring SR image based on estimated kernel. This two-step solution involves two independently trained models, thus they may be not well compatible to each other. Small estimation error of the first step could cause severe performance drop of the following one ikc . But on the other hand, the first step can only utilize limited information from LR image, which makes it difficult to predict highly accurate blur kernel. As a result, although both models can perform well individually, the final result may be suboptimal when they are combined together.
Instead of considering these two steps separately, we adopt an alternating optimization algorithm, which can estimate blur kernel and restore SR image in the same model. Specifically, we design two convolutional neural modules, namely Restorer and Estimator. Restorer restores SR image based on blur kernel predicted by Estimator, and the restored SR image is further used to help Estimator estimate better blur kernel. Once the blur kernel is manually initialized, the two modules can well corporate with each other to form a closed loop, which can be iterated over and over. The iterating process is then unfolded to an end-to-end trainable network, which is called deep alternating network (DAN). In this way, Estimator can utilize information from both LR and SR images, which makes the estimation of blur kernel easier. More importantly, Restorer is trained with the kernel estimated by Estimator, instead of ground-truth kernel. Thus during testing Restorer could be more tolerant to the estimation error of Estimator. Besides, the results of both modules could be substantially improved during the iterations, thus it is likely for our alternating optimization algorithm to get better final results than the direct two-step solutions. We summarize our contributions into three points:
We adopt an alternating optimization algorithm to estimate blur kernel and restore SR image for blind SR in a single network (DAN), which helps the two modules to be well compatible with each other and likely to get better final results than previous two-step solutions.
We design two convolutional neural modules, which can be alternated repeatedly and then unfolded to form an end-to-end trainable network, without any pre/post-processing. It is easier to be trained and has higher speed than previous two-step solutions. To the best of our knowledge, the proposed method is the first end-to-end network for blind SR.
Extensive experiments on synthetic datasets and real-world images show that our model can largely outperform state-of-the art methods and produce more visually favorable results at much higher speed.
Learning based methods for SISR usually require a large number of paired HR and LR images as training samples. However, these paired samples are hard to get in real world. As a result, researchers manually synthesize LR images from HR images with predefined downsampling settings. The most popular setting is bicubic interpolation, i.e. defining in Equation 1 as bicubic kernel. From the arising of SRCNN srcnn , various DNNs vdsr ; rdn ; rcan ; fsrcnn ; dbpn ; meta_sr have been proposed based on this setting. Recently, after the proposal of RCAN rcan and RRDB esrgan , the performance of these non-blind methods even start to saturate on common benchmark datasets. However, the blur kernels for real images are indeed much more complicated. In real applications, kernels are unknown and differ from image to image. As a result, despite that these methods have excellent performance in the context of bicubic downsampling, they still cannot be directly applied to real images due to the domain gap.
Another kind of non-blind SR methods aims to propose a single model for multiple degradations, i.e. the second step of the two-step solution for blind SR. These methods take both LR image and its corresponding blur kernel as inputs. In zssr_pre ; zssr , the blur kernel is used to downsample images and synthesize training samples, which can be used to train a specific model for given kernel and LR image. In srmd , the blur kernel and LR image are directly concatenated at the first layer of a DNN. Thus, the SR result can be closely correlated to both LR image and blur kernel. In dpsr , Zhang et al. proposed a method based on ADMM algorithm. They interpret this problem as MAP optimization and solve the data term and prior term alternately. In ikc , a spatial feature transform (SFT) layer is proposed to better preserve the details in LR image while blur kernel is an additional input. However, as pointed out in ikc , the SR results of these methods are usually sensitive to the provided blur kernels. Small deviation of provided kernel from the ground truth will cause severe performance drop of these non-blind SR methods.
Previous methods for blind SR are usually the sequential combinations of a kernel-estimation method and a non-blind SR method. Thus kernel-estimation methods are also an important part of blind SR. In nonpara , Michaeli et al. estimate the blur kernel by utilizing the internal patch recurrence. In kernel_gan and gan_first , LR image is firstly downsampled by a generative network, and then a discriminator is used to verify whether the downsampled image has the same distribution with original LR image. In this way, the blur kernel can be learned by the generative network. In ikc , Gu et al. not only train a network for kernel estimation, but also propose a correction network to iteratively correct the kernel. Although the accuracy of estimated kernel is largely improved, it requires training of two or even three networks, which is rather complicated. Instead, DAN is an end-to-end trainable network that is much easier to be trained and has much higher speed.
As shown in Equation 1, there are three variables, i.e. , and , to be determined in blind SR problem. In literature, we can apply a denoise algorithm ircan ; bm3d ; wnnm in the first place. Then blind SR algorithm only needs to focus on solving and . It can be mathematically expressed an optimization problem:
where the former part is the reconstruction term, and is prior term for HR image. The prior term is usually unknown and has no analytic expression. Thus it is extremely difficult to solve this problem directly. Previous methods decompose this problem into two sequential steps:
where denotes the function that estimates from , and the second step is usually solved by a non-blind SR method described in Sec 2.2. This two-step solution has its drawbacks in threefold. Firstly, this algorithm usually requires training of two or even more models, which is rather complicated. Secondly, can only utilize information from , which treats as a kind of prior of . But in fact, could not be properly solved without information from . At last, the non-blind SR model for the second step is trained with ground-truth kernels. While during testing, it can only have access to kernels estimated in the first step. The difference between ground-truth and estimated kernels will usually cause serve performance drop of the non-blind SR model ikc .
Towards these drawbacks, we propose an end-to-end network that can largely release these issues. We still split it into two subproblems, but instead of solving them sequentially, we adopt an alternating optimization algorithm, which restores SR image and estimates corresponding blur kernel alternately. The mathematical expression is
We alternately solve these two subproblems both via convolutional neural modules, namely Estimator and Restorer respectively. Actually, there even has an analytic solution for Estimator. But we experimentally find that analytic solution is more time-consuming and not robust enough (when noise is not fully removed). We fix the number of iterations as and unfold the iterating process to form an end-to-end trainable network, which is called deep alternating network (DAN).
, the kernel is also reshaped and then reduced by principal component analysis (PCA). We setin practice and both modules are supervised only at the last iteration by L1 loss. The whole network could be well trained without any restrictions on intermediate results, because the parameters of both modules are shared between different iterations.
In DAN, Estimator takes both LR and SR images as inputs, which makes the estimation of blur kernel much easier. More importantly, Restorer is trained with the kernel estimated by Estimator, instead of ground-truth kernel as previous methods do. Thus, Restorer could be more tolerant to the estimation error of Estimator during testing. Besides, compared with previous two-step solutions, the results of both modules in DAN could be substantially improved, and it is likely for DAN to get better final results. Specially, in the case where the scale factor , DAN becomes an deblurring network. Due to limited pages, we only discuss SR cases in this paper.
Both modules in our network have two inputs. Estimator takes LR and SR image, and Restorer takes LR image and blur kernel as inputs. We define LR image as basic input, and the other one is conditional input. For example, blur kernel is the conditional input of Restorer. During iterations, the basic inputs of both modules keep the same, but their conditional inputs are repeatedly updated. We claim that it is significantly important to keep the output of each module closely related to its conditional input. Otherwise, the iterating results will collapse to a fixed point at the first iteration. Specifically, if Estimator outputs the same kernel regardless the value of SR image, or Restorer outputs the same SR image regardless of the value of blur kernel, their outputs will only depend on the basic input, and the results will keep the same during the iterations.
To ensure that the outputs of Estimator and Restorer are closely related to their conditional inputs, we propose a conditional residual block (CRB). On the basis of the residual block in rcan , we concatenate the conditional and basic inputs at the beginning:
where denotes the residual mapping function of CRB and denotes concatenation. and are the basic input and conditional input respectively. As shown in Figure 2 (a), the residual mapping function consists of two convolutional layers and one channel attention layer senet . Both Estimator and Restorer are build by CRBs.
Estimator. The whole structure of Estimator is shown in Figure 2
(b). We firstly downsample SR image by a convolutional layer with stride. Then the feature maps are sent to all CRBs as conditional inputs. At the end of the network, we squeeze the features by global average pooling to form the elements of predicted kernel. Since the kernel is reduced by PCA, Estimator only needs to estimate the PCA result of blur kernel. In practice, Estimator has CRBs, and both basic input and conditional input of each CRB have channels.
Restorer. The whole structure of Restorer is shown in Figure 2 (c). In Restorer, we stretch the kernel in spatial dimension to the same spatial size as LR image. Then the stretched kernel is sent to all CRBs of Restorer as conditional inputs. We use PixelShuffle pixel_shuffle layers to upscale the features to desired size. In practice, Restorer has CRBs, and the basic input and conditional input of each CRB has and channels respectively.
We collect HR images from DIV2K div2k and Flickr2K flickr2k as training set. To make reasonable comparison with other methods, we train models with two different degradation settings. One is the setting in ikc , which only focuses on cases with isotropic Gaussian blur kernels. The other is the setting in kernel_gan , which focuses on cases with more general and irregular blur kernels.
Setting 1. Following the setting in ikc , the kernel size is set as . During training, the kernel width is uniformly sampled in [0.2, 4.0], [0.2, 3.0] and [0.2, 2.0] for scale factors , and respectively. For quantitative evaluation, we collect HR images from the commonly used benchmark datasets, i.e
, Set14set14 , Urban100 urban100 , BSD100 bsd100 and Manga109 manga109 . Since determined kernels are needed for reasonable comparison, we uniformly choose 8 kernels, denoted as Gaussian8, from range [1.8, 3.2], [1.35, 2.40] and [0.80, 1.60] for scale factors , and respectively. The HR images are first blurred by the selected blur kernels and then downsampled to form synthetic test images.
Setting 2. Following the setting in kernel_gan , we set the kernel size as
. We firstly generate anisotropic Gaussian kernels. The lengths of both axises are uniformly distributed in, rotated by a random angle uniformly distributed in [, ]. To deviate from a regular Gaussian, we further apply uniform multiplicative noise (up to 25% of each pixel value of the kernel) and normalize it to sum to one. For testing, we use the benchmark dataset DIV2KRK that is used in kernel_gan .
The input size during training is for all scale factors. The batch size is . Each model is trained for iterations. We use Adam adam as our optimizer, with , . The initial learning rate is , and will decay by half at every iterations. All models are trained on RTX2080Ti GPUs.
Setting 1. For the first setting, we evaluate our method on test images synthesized by Gaussian8 kernels. We mainly compare our results with ZSSR zssr and IKC ikc , which are methods designed for blind SR. We also include a comparison with CARN carn . Since it is not designed for blind SR, we perform deblurring method pan before or after CARN. The PSNR and SSIM results on Y channel of transformed YCbCr space are shown in Table 1.
Despite that CARN achieves remarkable results in the context of bicubic downsampling, it suffers severe performance drop when applied to images with unknown blur kernels. Its performance is largely improved when it is followed by a deblurring method, but still inferior to that of blind-SR methods. ZSSR trains specific network for each single tested image by utilizing the internal patch recurrence. However, ZSSR has an in-born drawback: the training samples for each image are limited, and thus it cannot learn a good prior for HR images. IKC is also a two-step solution for blind SR. Although the accuracy of estimated kernel is largely improved in IKC, the final result is still suboptimal. DAN is trained in an end-to-end scheme, which is not only much easier to be trained than two-step solutions, but also likely to a reach a better optimum point. As shown in Table 1, the PSNR result of DAN on Manga109 for scale is even higher than that of IKC. For other scales and datasets, DAN also largely outperforms IKC.
The visual results of img 005 in Urban100 are shown in Figure 3 for comparison. As one can see, CARN and ZSSR even cannot restore the edges for the window. IKC performs better, but the edges are severely blurred. While DAN can restore sharp edges and produce more visually pleasant result.
|pan +CARN carn||24.20||0.7496||21.12||0.6170||22.69||0.6471||18.89||0.5895||21.54||0.7946|
|CARN carn +pan||31.27||0.8974||29.03||0.8267||28.72||0.8033||25.62||0.7981||29.58||0.9134|
|pan +CARN carn||19.05||0.5226||17.61||0.4558||20.51||0.5331||16.72||0.4578||18.38||0.6118|
|CARN carn +pan||30.31||0.8562||2757||0.7531||27.14||0.7152||24.45||0.7241||27.67||0.8592|
|pan +CARN carn||18.10||0.4843||16.59||0.3994||18.46||0.4481||15.47||0.3872||16.78||0.5371|
|CARN carn +pan||28.69||0.8092||26.40||0.6926||26.10||0.6528||23.46||0.6597||25.84||0.8035|
Setting 2. The second setting involves irregular blur kernels, which is more general, but also more difficult to solve. For Setting 2, we mainly compare methods of three different classes: i) SOTA SR algorithms trained on bicubically downsampled images such as EDSR edsr and RCAN rcan , ii) blind SR methods designed for NTIRE competition such as PDN pdn and WDSR wdsr , iii) the two-step solutions, i.e. the combination of a kernel estimation method and a non-blind SR method, such as Kernel-GAN kernel_gan and ZSSR zssr . The PSNR and SSIM results on Y channl are shown in Table 2.
Similarly, the performance of methods trained on bicubically downsampled images is limited by the domain gap. Thus, their results are only slightly better than that of interpolation. The methods in Class 2 are trained on synthesized images provided in NTIRE competition. Although these methods achieve remarkable results in the competition, they still cannot generalize well to irregular blur kernels.
The comparison between methods of Class 3 can enlighten us a lot. Specifically, USRNet usr achieves remarkable results when GT kernels are provided, and KernelGAN also performs well on kernel estimation. However, when they are combined together, as shown in Table 2, the final SR results are worse than all other methods. This indicates that it is important for the Estimator and Restorer to be compatible with each other. Additionally, although better kernel-estimation method can benefit the SR results, the overall performance is still largely inferior to that of DAN. DAN outperforms the combination of KernelGAN and ZSSR by and for scales and respectively.
The visual results of img 892 in DIVKRK are shown in Figure 4. Although the combination of KernelGAN and ZSSR can produce slightly shaper edges than interpolation, it suffers from severe artifacts. The SR image of DAN is obviously much cleaner and has more reliable details.
|Class 1 Class 1||Bicubic||28.73||0.8040||25.33||0.6795|
|Bicubic kernel + ZSSR zssr||29.10||0.8215||25.61||0.6911|
|Class 2||PDN pdn - 1st in NTIRE’19 track4||/||/||26.34||0.7190|
|WDSR wdsr - 1st in NTIIRE’19 track2||/||/||21.55||0.6841|
|WDSR wdsr - 1st in NTIRE’19 track3||/||/||21.54||0.7016|
|WDSR wdsr - 2nd in NTIRE’19 track4||25.64||0.7144|
|Ji et al. ji2020real - 1st in NITRE’20 track 1||/||/||25.43||0.6907|
|Class 3||Cornillere et al. siga||29.46||0.8474||/||/|
|Michaeli et al. nonpara + SRMD srmd||25.51||0.8083||23.34||0.6530|
|Michaeli et al. nonpara + ZSSR zssr||29.37||0.8370||26.09||0.7138|
|KernelGAN kernel_gan + SRMD srmd||29.57||0.8564||25.71||0.7265|
|KernelGAN kernel_gan + USRNet usr||/||/||20.06||0.5359|
|KernelGAN kernel_gan + ZSSR zssr||30.36||0.8669||26.81||0.7316|
To evaluated the accuracy of predicted kernels, we calculate their L1 errors in the reduced space, and the results on Urban100 are shown in Figure 5 (a). As one can see that the L1 error of reduced kernels predicted by DAN are much lower than that of IKC. It suggests that the overall improvements of DAN may partially come from more accurate retrieved kernels. We also plot the PSNR results with respect to kernels with different sigma in Figure 5 (b). As sigma increases, the performance gap between IKC and DAN also becomes larger. It indicates that DAN may have better generalization ability.
We also replace the estimated kernel by ground truth (GT) to further investigate the influence of Estimator. If GT kernels are provided, the iterating processing becomes meaningless. Thus we test the Restorer with just once forward propagation. The tested results for Setting 1 is shown in Table 3. The result almost keeps unchanged and sometimes even gets worser when GT kernels are provided. It indicates that Predictor may have already satisfied the requirements of Restorer, and the superiority of DAN also partially comes from this good cooperation between its Predictor and Restorer.
After the model is trained, we also change the number of iterations to see whether the two modules have learned the property of convergence or just have ‘remembered’ the iteration number. The model is trained with iterations, but during testing we increase the iteration number from to . As shown in Figure 6 (a) and (c), the average PSNR results on Set5 and Set14 firstly increase rapidly and then gradually converge. It should be noted that when we iterate more times than training, the performance dose not becomes worse, and sometimes even becomes better. For example, the average PSNR on Set14 is when the iteration number is , higher than when we iterate times. Although the incremental is relatively small, it suggests that the two modules may have learned to cooperate with each other, instead of solving this problem like ordinary end-to-end networks, in which cases, the performance will drop significantly when the setting of testing is different from that of training. It also suggests that the estimation error of intermediate results does not destroy the convergence of DAN. In other words, DAN is robust to various estimation error.
One more superiority of our end-to-end model is that it has higher inference speed. To make a quantitative comparison, we evaluate the average speed of different methods on the same platform. We choose the 40 images synthesized by Gaussian8 kernels from Set5 as testing images, and all methods are evaluated on the same platform with a RTX2080Ti GPU. We choose KernelGAN kernel_gan + ZSSR zssr as the one of the representative methods. Its speed is 415.7 seconds per image. IKC ikc has much faster inference speed, which is only 3.93 seconds per image. As a comparison, the average speed of DAN is 0.75 seconds per image, nearly 554 times faster than KernelGAN + ZSSR, and 5 times faster than IKC. In other words, DAN not only can largely outperform SOTA blind SR methods on PSNR results, but also has much higher speed.
We also conduct experiments to prove that DAN can generalize well to real wold images. In this case, we need to consider the influence of additive noise. As we mentioned in Sec 3.1, we can perform an denoise algorithm in the first place. But for simplicity, we retrain a different model by adding AWGN to LR image during training. In this way, DAN would be forced to generalize to noisy images. The covariance of noise is set as . We use KernelGAN kernel_gan + ZSSR zssr and IKC ikc as the representative methods for blind SR, and CARN carn as the representative method for non-blind SR method. The commonly used image chip chip is chosen as test image. It should be noted that it is a real image and we do not have the ground truth. Thus we can only provide a visual comparison in Figure chip . As one can see, the result of KernelGAN + ZSSR is slightly better than bicubic interpolation, but is still heavily blurred. The result of CARN is over smoothed and the edge is not sharp enough. IKC produces cleaner result, but there are still some artifacts. The letter ‘X’ restored by IKC has an obvious dark line at the top right part. But this dark line is much lighter in the image restored by DAN. It suggests that if trained with noisy images, DAN can also learn to denoise, and produce more visually pleasant results with more reliable details. This is because that both modules are implemented via convolutional layers, which are flexible enough to be adapted to different tasks.
In this paper, we have proposed an end-to-end algorithm for blind SR. This algorithm is based on alternating optimization, the two parts of which are both implemented by convolutional modules, namely Restorer and Estimator. We unfold the alternating process to form an end-to-end trainable network. In this way, Estimator can utilize information from both LR and SR images, which makes it easier to estimate blur kernel. More importantly, Restorer is trained with the kernel estimated by Estimator, instead of ground-truth kernel, thus Restorer could be more tolerant to with the estimation error of Estimator. Besides, the results of both modules could be substantially improved during the iterations, thus it is likely for DAN to get better final results than previous two-step solutions. Experiments also prove that DAN outperforms SOTA blind SR methods by a large margin. In the future, if the two parts of DAN can be implemented by more powerful modules, we believe that its performance could be further improved.
Super Resolution is a traditional task in computer vision. It has been studied for several decades and has wide applications in video enhancement, medical imaging, as well as security and surveillance imaging. These techniques have largely benefited the society in various areas for years and have no negative impact yet. The proposed method (DAN) could further improve the merits of these applications especially in cases where the degradations are unknown. DAN has relatively better performance and much higher speed, and it is possible for DAN to be used in real-time video enhancement or surveillance imaging. This work does not present any negative foreseeable societal consequence.
This work is jointly supported by National Key Research and Development Program of China (2016YFB1001000), Key Research Program of Frontier Sciences, CAS (ZDBS-LY-JSC032), Shandong Provincial Key Research and Development Program (2019JZZY010119), and CAS-AIR.
2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1122–1131, 2017.
Accelerating the super-resolution convolutional neural network.In European conference on computer vision, pages 391–407. Springer, 2016.