Source code for paper "Iterative Network for Image Super-Resolution"
Single image super-resolution (SISR), as a traditional ill-conditioned inverse problem, has been greatly revitalized by the recent development of convolutional neural networks (CNN). These CNN-based methods generally map a low-resolution image to its corresponding high-resolution version with sophisticated network structures and loss functions, showing impressive performances. This paper proposes a substantially different approach relying on the iterative optimization on HR space with an iterative super-resolution network (ISRN). We first analyze the observation model of image SR problem, inspiring a feasible solution by mimicking and fusing each iteration in a more general and efficient manner. Considering the drawbacks of batch normalization, we propose a feature normalization (FNorm) method to regulate the features in network. Furthermore, a novel block with F-Norm is developed to improve the network representation, termed as FNB. Residual-in-residual structure is proposed to form a very deep network, which groups FNBs with a long skip connection for better information delivery and stabling the training phase. Extensive experimental results on testing benchmarks with bicubic (BI) degradation show our ISRN can not only recover more structural information, but also achieve competitive or better PSNR/SSIM results with much fewer parameters compared to other works. Besides BI, we simulate the real-world degradation with blur-downscale (BD) and downscalenoise (DN). ISRN and its extension ISRN+ both achieve better performance than others with BD and DN degradation models.READ FULL TEXT VIEW PDF
Single Image Super-Resolution (SISR) aims to generate a high-resolution ...
Deep convolutional neural networks (CNNs) have recently achieved great
In this report we demonstrate that with same parameters and computationa...
Recent years have witnessed great success of convolutional neural networ...
How to extract more and useful information for single image super resolu...
In real-world single image super-resolution (SISR) task, the low-resolut...
With advancement in deep neural network (DNN), recent state-of-the-art (...
Source code for paper "Iterative Network for Image Super-Resolution"
Single image super resolution (SISR) is a traditional ill-posed problem in image processing. Given a low-resolution (LR) image, the task of SISR is to find the corresponding image with high-resolution (HR). Convolutional neural network (CNN) has shown impressive performance for image restoration [1, 2, 3]. Recently, there are numerous CNN-based works for image super-resolution [4, 5]. SRCNN  proposed by Dong et al. is the first work for SISR problem with a three-layer network, which achieves better performance than traditional methods. The three layers of SRCNN are corresponding to the steps of traditional sparse coding methods. Deeper networks usually result in better performance. Kim et al. increased the layer number and introduced global residual learning in VDSR  for stronger network representation and better performance. Deconvolution layer was widely used in early SISR works for resolution increase. ESPCN  proposed by Shi et al. substituted the deconvolution with sub-pixel convolutional layer for more effective upscaling operation, which has been proved to an effective structure. After ESPCN, most of the SISR works choose sub-pixel layer instead of deconvolution. Residual structure has shown amazing performance on image and video restoration . To obtain better performance, in EDSR  proposed by Lim et al., the residual blocks with more filters have been adopted. Batch normalization layers in EDSR are removed to decrease the memory cost and build the network deeper. Besides deeper designs, there are works concentrating on effective blocks. Since dense connection has shown good performance for different tasks, SRDenseNet  proposed by Tong et al. stacked dense blocks for better performance. Zhang et al. combined residual and dense connections in RDN . He et al. designed ODENet 
based on ordinary differential equations. Multi-scale designs also turn out to be an effective component[14, 15]. MRFN  introduced a multi-receptive-field design for feature exploration. Meanwhile, there are also works focusing on the attention mechanism [16, 17]. These methods enjoy a straightforward structure ot map LR images to HR images.
Recursive designs have also been widely studied for image restoration problems. To our best knowledge, Kim et al. firstly applied recursive structure with share convolution layers in DRCN  for SISR problem. To expand the receptive fields, DRCN increased the network depth by using shared filters with limited parameters. Inspired by the residual design, Tai et al. proposed DRRN  with residual blocks incorporated. DRRN introduced a recursive block design with the combination of convolution layers, achieving better performance than VDSR. MemNet  developed by Tai et al. is motivated by long-term memory model of human’s brain. In MemNet, recursive and gate units are proposed to simulate the memory mechanism, and memory blocks have been adopted for better performance. Recently, Yang et al. designed DRFN  with recurrent structure for large factors. However, these methods lack an explanation of intrinsic optimization mechanism in nature.
There are numerous normalization methods developed for network representation improvement. VDSR used batch normalization (BN)  between different convolution layers. Since BN consumes more memory , recent works have replaced the normalization with more efficient convolutional layers. Weight normalization (WN) was adopted in WDSR  proposed by Yu et al., which was firstly proposed by Salimans et al. for recurrent models .
In this paper, an iterative super-resolution network is proposed to solve the SISR problem, termed as ISRN. We analyze the observation model and the target of image SR from the perspective of traditional energy optimization [26, 27, 28]. Motivated by those works, the half quadratic splitting (HQS) method 
is adopted to analyze the SR problem and obtain a feasible solution. The network is designed based on the solution with iterative structure. Features from each iteration are collected and fused to obtain the final result based on maximum likelihood estimation (MLE). In vanilla HQS method, degradation model should be given explicitly to find the close-form solution. However, when the degradation models are complex, it is challenging to find a formula description. From this perspective, a network structure is introduced to simulate the degradation and optimization.
In particular, a novel block with feature normalization (FN) termed as FNB is designed in ISRN. Different from other normalization methods, the proposed FN uses convolution layers to find the adaptive weights and bias for every pixel. To pass the features from shallow layers to deeper more efficiently, FNBs are grouped with a residual structure and padding layers, termed as FNG. Extensive experimental results show ISRN and the extension model ISRNwith self-ensemble are competitive or superior in terms of PSNR/SSIM with much fewer parameters. Subjective visualizations from Fig 8 clearly show that ISRN can recover structural textures more effectively. Besides bicubic (BI) degradation, we also simulate the real-world degradation by blur-downscale (BD) and downscale-noise (DN) operations. ISRN and ISRN perform better on both objective and subjective comparisons with BD and DN degradation models.
The main contributions of this paper are summarized as follows:
We provide a new perspective on SISR by integrating the conventional optimization architecture with deep convolution networks. In this perspective, a novel and lightweight iterative super-resolution network (ISRN) is proposed.
We propose a novel block with feature normalization (FNB). FNBs are grouped with residual structure and padding layers to bypass the features with skip connections more effectively, termed as FNG.
Experimental results show ISRN is competitive or better in terms of PSMR/SSIM with much fewer parameters. Visualization results indicate that ISRN delivers better performance on complex structural information recovery. Furthermore, ISRN and the extension model ISRN can achieve better performance in terms of both subjective and objective comparisons with BD and DN degradation models.
The observation model of SISR problem could be formulated as,
where is the degradation operator, is the noise term, and are LR and HR images respectively. Generally speaking, could be a bicubic down-sampler, blur kernel or the mixture operations.
Given an LR image , the target of super-resolution is to find an satisfying,
where is the image prior term and is a factor. means the -norm.
To obtain the HR image, there are numerous CNN-based works calculating a direct mapping from LR to HR, aiming to solve Eqn. (2). In this paper, half quadratic splitting (HQS) [29, 32] method is applied for finding the solutions. Let , then Eqn. (2) could be re-written as,
As such, Eqn. (3) could be solved in an iterative way by calculating and alternatively,
where is a weighting factor for the -th iteration and varies in a non-descending order for each iteration. For Eqn. (5), has the closed-form solution by linearly combining and .
Let , then Eqn. (4) can be re-written as:
The iterative solution can be interpreted from another perspective. In particular, Eqn. (6) can be cast into a mapping from the LR space to HR space, such that a reasonably good result on average can be obtained in each iteration. Eqn. (5) aims to achieve a linear combination of and , which can be regarded as guiding with a specific direction and a specific step length governed by the parameter . This is in analogous to the gradient descent method. Since the exact distance between and on HR space is unknown, these iterative steps shrink the distance between and . As such, we hold the notion that iterative optimization steps gradually decrease the distance on the HR space by adjusting the distance on the LR space. An illustration of the steps could be demonstrated in Fig. 9.
However, there are two critical issues. On one hand, the down-sampler operator which accounts for the mapping from the HR space to LR space is difficult to be simulated. In general, could be regarded as a bicubic down-sampling operator while training. However, in some complicated situations, it could be difficult to explicitly express . From Eqn. (5), the accuracy of directly influences the optimization. From this perspective, the degradation model should be learned from paired data. On the other hand, the solution of Eqn. (5), i.e. , is a linear combination of and on -th step, which is close to the one-step gradient descent operation. When the -th iteration begins, the start point is still instead of . This shows the optimization is memory-less. In other words, the history descent directions do not influence the starting point but only the next descent direction. To handle this issue, outputs from different iterations should be collected and considered jointly to find the final result. It can be regarded as a maximum likelihood estimation (MLE), demonstrated as,
where denotes the final HR image, and denotes the output of -th iteration.
Iterative super-resolution network (ISRN) is designed based on the previous formulation study. From the problem formulation, for -th iteration is optimized from Eqn. (6), which could be cast into an independent super-resolution problem mapping the input LR image to HR image . From this observation, a solver for image super-resolution is suitable to find the solution. We design a network module to find the result, termed as Solver SR. While training, the implicit expression of and will be learned from the paired data, and the adaptive optimization will be performed. Solver SR is shared for each iteration to find the suitable mapping relations between LR space and HR space while training.
The closed-form solution of Eqn. (5) is a linear combination of and . Since there is no explicit expression for when degradation models are complex, it is hard to find while given . We investigate a network module to simulate the degradation, and term it as Down-sampler. Furthermore, considering that the weighting factor in Eqn. (5) varies in different iterations, we utilize a network module to learn feasible factors for each iteration and find the solution, which is term as Solver LR.
From the formulation study, the optimization steps for each iteration are memory-less. It is necessary to collect the outputs of different iterations, and find a suitable result considering all descent directions. The MLE step could be designed as a network module to find the
with maximum probability, termed asSolver MLE.
As shown in Fig. 10, there are four modules in the proposed ISRN, corresponding to Solver SR, Solver LR, Down-sampler and Solver MLE separately. Herein, these modules are detailed as follows.
Solver SR is the main component to generate images in HR space from the LR space shared for every iteration, which is formulated as
. Most recent networks for image restoration are deep, which may accumulate the feature variance. Batch normalization is proposed for performance improvement, which may consume much memory. In this paper, a novel feature normalization (F-Norm) method is proposed, formulated as,
where is the channel index, and are corresponding input and output feature channels, is a convolution kernel, and is the bias. To preserve the original feature information, the features before and after normalization are added as the final output.
The proposed F-Norm is designed with the hypothesis that different channels contain different information. Different channels are treated parallelly to prevent the information fusion. The parallel normalization will decrease the parameters and computation complexity, making it flexible for various network designs.
The F-Norm has a similar formulation with BN. If is regarded as a convolution kernel with size , then it holds a same operation with BN when setting batch size as 1. The F-Norm performs normalization on features independently, preventing the influence of minibatch in BN. The factors for normalization are explored from the only feature maps. Different form BN, F-Norm is implemented with only one convolution layer, which is fast and with little memory cost.
A novel block named feature normalization block (FNB) is proposed with F-Norm. In FNB, F-Norm is applied at the bottom of residual block. On one hand, it could normalize the feature maps after non-linear processing. On the other hand, using only one normalization layer in each block could save the parameters and computation cost.
Residual structure can gradually pass the shallow layer features to deeper layers. To speed up the feature delivery and make better use of shallow layer features, residual-in-residual (RIR) structure is applied in the network. A group of FNBs with a skip connection is proposed, termed as FNG. For each FNB in the group, there is a residual structure. The global FNG also acquires a shortcut to pass the shallow features to the deeper and improve the gradient transmission.
There is a padding structure after FNBs, composed of two convolution layers with ReLU activation and a F-Norm layer. This padding structure could introduce a non-linear processing step for main path information. In FNG, F-Norm layer following the last convolution layer aims to normalize the features on the main path.
The entire network structure of Solver SR is shown in Fig. 11. In analogous to other super-resolution networks, Solver SR has a main enhancement path and a skip bypass, which form the global residual framework. The bypath in Solver SR upscales the by convolution and sub-pixel layer. A convolution layer is applied after each sub-pixel layer to introduce the spatial correlation.
Solver SR could be regarded as an complete network structure for single image super-resolution, since it directly maps LR image into HR space. There are four modules in the Solver SR
, The first convolution layer in the main path denotes the feature extraction module. After feature extraction, several FNGs are used to compose the non-linear mapping module. The restoration module is made up of two convolution layers with a sub-pixel layer. Finally, a skip connection is applied as the shortcut.
Different from RCAN 
and other RIR-based works, there is no global residual connection in non-linear mapping module. On one hand, there is a residual structure in proposed FNG. With the stack of FNGs, information could be fully delivered on the shortcuts from top to bottom. On the other hand, the skip module could be regarded as a global residual connection of the entire network, helping the information transmission.
Solver LR is a network proposed to solve the Eqn. (5), formulated as . Although Eqn. (5) has a closed-form solution, the result is a linear combination of and , which implies the SR solution will fall into a space spanned by and . Meanwhile, the weight factor varies for every iteration. From this persepctive, a network is designed which both introduces the non-linearity and adaptive factors.
The structure of Solver LR is a 3-layer network with ReLU activation after the second convolution layer. The first convolution layer aims to linearly combine the feature of and . The second layer with ReLU activation introduces the non-linearity. The last convolution layer maintains the same channel number of inputs and the output.
Solver LR has a similar structure to SRCNN , which has been proved effective for filtering. Different form the closed-form solution of Eqn. (5) which could be regarded as a point-wise operation, the filter-based method Solver LR enlarges the receptive field to consider the information nearby.
Down-sampler is a network dedicated to simulate . In previous works, the degradation model is usually chosen as bicubic-down, which has an explicit formulation. However, when the degradation is more general or even unknown, it is difficult to calculate . To address this issue, a network is designed to simulate the degradation while training. Down-sampler is designed with 4 convolution layers. Considering the mechanism of vanilla bicubic-down, where one pixel corresponds to a
window while interpolation, the first 2 layers extract the features with the kernel size as 3, which equals a
receptive field. To simulate the down-sampling operation, two convolution layers with different strides are applied at the subsequence with ReLU activation. Notice that the stride size should be no larger than the kernel size to prevent the information loss. When the scaling factors areand , the strides are performed on the first layer. When the scaling factor is , the strides are performed as and on both two layers. The structure of Down-sampler is shown in Fig. 12.
Solver MLE is designed to simulate the maximum likelihood estimation, formulated as . Solver MLE is used to analyze from every step and estimate a final result . This model is designed as 2 convolution layers’ network with a ReLU activation.
Processing step can be demonstrated as follows. Given an LR input , the input of the first iteration is . For the -th step, there is
and the input of -th iteration is:
Solver SR is shared for every iteration, while Solver LR and Down-sampler are different. We hold the notion that the difference of Solver LR and Down-sampler could be diverse in terms of the input space and finally enhance the final result. The output of the network is given by,
|Dataset||Scale||Bicubic||SRCNN ||VDSR ||LapSRN ||EDSR ||RDN ||SRFBN ||Ours||Ours+|
|Dataset||Scale||Bicubic||SRCNN ||IRCNN_G ||IRCNN_C ||RDN ||RCAN ||SRFBN ||Ours||Ours+|
Comparisons with RCAN . To our best knowledge, RCAN is the first work introducing residual-in-residual structure to image super-resolution. In RCAN, residual-in-residual is embedded with Squeeze-and-Excitation  block to perform channel attention. Different from RCAN, in ISRN, an iterative structure is designed for better performance with fewer parameters. At the same time, ISRN concentrates on feature normalization rather than channel attention. ISRN applies the feature normalization method, and useful evidences have been provided. RCAN aims to find direct mapping from LR space to HR space, which ISRN provides an optimization perspective for finding solution.
Comparisons with SRFBN 
. SRFBN applies a feedback mechanism to recursively enhance the super-resolution performance, which directly concatenates shallow and deep features. Different from SRFBN, ISRN provides a mathematical proof of the model, and simulate each solver with corresponding network components. ISRN feeds the network with different inputs in every iteration. SRFBN is trained with outputs from every iterations, while ISRN is trained with only one output fromSolver MLE.
Comparisons with IRCNN . IRCNN applies HQS to convert the image restoration as two sub-problems, which can be alternatively calculated with a denoiser network. As to the proposed ISRN, on one hand, a different constraint is adopted from IRCNN and utilize the variable splitting to convert the SR problem into two sub-problems with more flexibility. On the other hand, ISRN is regarded as an end-to-end network, instead of building the pipeline as plug-and-play. The experimental results show ISRN has better performance than IRCNN.
Comparisons with DBPN . DBPN applies a back-projection method for iterative up-and-down sampling, which concentrates on information from different depths of network. In the proposed ISRN, a different perspective is applied to SISR problem, and the network is built based on mathematical analysis. The projection of DPBN is used for effective information transmission in non-linear mapping step, which maps LR images to HR images directly. In ISRN, Solver SR module is designed for mapping LR images to HR, and the results on LR and HR spaces are optimized iteratively. Finally, the experimental results show ISRN achieves better performance than DBPN with fewer parameters.
Plug-and-Play. Since Solver SR is an independent network for SR, it is feasible to consider building the pipeline as plug-and-play. We hold the hypothesis that a straightforward image restoration network can be regarded as a sparse-coding like solver. After trained with loss, the network will find a best mapping on average. Notice that different networks learn different coding dictionaries, which vary widely. It is difficult to fit general parameters for other components. From this perspective, the proposed ISRN is regarded as an end-to-end structure rather than plug-and-play.
In ISRN, all layers are with kernel size as except for the skip bypath in Solver SR and all layers after sub-pixel. These layers are with kernel size as for a larger receptive. Layers of Solver SR, Solver MLE, and Solver LR have filters, and layers of Down-sampler have filters. For each FNP, there are FNBs; and for Solver SR, there are FNGs. There are iterations in the network. During training, the loss is chosen as loss function.
In the training progress, 800 images are used from DIV2K  dataset for training, and 5 images are used for validation. Five benchmark datasets are used for testing: Set5 , Set14 , B100 , Urban100  and Manga109 . Images from B100 are from real-world containing rich high-frequency information. There are numerous buildings in Urban100, such that abundant straight textures are included. Manga109 are cartoons with structural information. Three benchmark degradation models are used to simulate LR images: bicubic (BI) , blur-downscale (BD), and downscale-noise (DN). All the parameter settings of degradation models are identical with RDN . Adam optimizer, which is widely used in several super-resolution tasks [12, 33, 9], is used with learning rate
. The learning rate is halved for every 200 epochs. The patch size of LR inputs is. The training data are augmented by randomly flipping and rotation. In total, the network is trained with 1000 iterations. Self-ensemble  is adopted to improve the performance of ISRN, and the extended model is named as ISRN. The source code and pre-trained models of ISRN and ISRN can be downloaded at: https://github.com/yuqing-liu-dut/ISRN.
The experiments are conducted with BI (scaling factors , , and ). In particular, ISRN and ISRN are compared with several methods, and Table I shows quantitative comparisons. From the results, ISRN achieves the best performance on all benchmark datasets, and ISRN achieves better performance than others on Urban100 and Manga109. Moreover, ISRN and ISRN are superior in terms of SSIM values, which implies that the models can recover the structural information more effectively, as shown on B100, Urban100 and Manga109 datasets. Results on Urban100 and Manga109 show the performance on recovering structure information. A visualization comparison of PSNR and parameters on Set5 is shown in Fig. 13, which reveals that the proposed model achieves competitive results with fewer parameters than state-of-the-arts.
The visual quality comparisons on B100 dataset are shown in Fig. 38, which contains abundant complex structural textures from real world. From these results, ISRN and ISRN can recover structural information more effectively. This also explains why the models can achieve promising SSIM result. When processing structural information, especially the line textures, ISRN and ISRN have shown very competitive performance.
To show the performance on large images with more textures, we compare the models with other works on Urban100 dataset. The images are from urban photos, which contain more line and structural textures. Visualization quality comparisons on Urban100 dataset are shown in Fig. 71. Compared with RDN, ISRN and ISRN could recover more textures on buildings. Specifically, our models can distinguish the mixture of lines more efficiently.
There are also experiments conducted with BD and DN degradation with the scaling factor to simulate the complex situations. Quantitative results are shown in Table II. From the results, ISRN and ISRN both achieve better performance than others. In particular, for the DN degradation, the proposed ISRN/ISRN are superior in terms of both PSNR and SSIM. The promising performance is originated from the iterative structure which is the distinctive component compared with the prominent methods.
The visual quality comparisons are shown in Fig. 100 and Fig. 122. From Fig. 100 with BD degradation, the recovered lines from other works are warped or blurry. In ISRN and ISRN, the lines could be recovered better than other methods. From Fig. 122 with DN degradation, the tiny lines are missing from others work, due to the introduced noise. Random noise may disturb the original texture and make the tiny lines omitted. In ISRN and ISRN, the lines could be recovered better than other methods.
Study on Network Designs. To show the performance of feature normalization, experiments are conducted on Set5. The results are shown in Table III. From the results, the model with feature normalization achieves better performance with BI degradation (scaling factors , , and ). Since feature normalization is an elaborate but effective component, introducing the block will lead to few increases of parameters and computational cost.
In the proposed ISRN, different Down-sampler and Solver LR are applied in different iterations, leading to better performance in general. The comparisons are performed with using same Down-sampler and Solver LR in different iterations. The results are shown in Table IV. From Table IV, the performance is better when using different components. Experimental results provide useful evidence regarding the training of different solvers. Using different components in different iterations will lead to few increases of parameters.
The number of blocks will also influence the performance. There are experiments with different block number and group number to show the performance with different number of blocks and groups. The results are shown in Fig. 123. The comparisons are conducted on five validation images from DIV2K. From Fig. 123, the performance will be better with the increase of and , showing more blocks can achieve better performance.
Study on Iteration Mechanism. To show the performance of iterations, experiment results with and without iterations are compared. The comparison is set with iteration number . The experiments on 5 validation images from DIV2K with scaling factor are conducted. The results are shown in Fig. 124. From the results, it could be found that iterations indeed improve the performance. With the increase of iteration times, the PSNR results will be higher.
Moreover, the feature maps for different iteration are analyzed. The visualization results of each and output image are shown in Fig. 131. From the visual quality comparison, it could be found that with the increase of iteration, richer details can be found on the feature maps. In the first and second iterations, there are structural information with clear lines and contours. In the next three iterations, there are more details on the wing. It is also observed that different iterations concentrate on different features. This is in line with the hypothesis of descent direction in Solver LR for each iteration. The MLE fuses results from every iteration and makes full use of these results.
Study on Plug-and-Play. To better illustrate the hypothesis about plug-and-play, Solver SR is substituted with a pre-trained RCAN . The pre-trained model is downloaded from GitHub repository provided by the author. Notice that there is no change except for the Solver SR. The result is shown in Fig. 135. The images are chosen from Set5 dataset. From the results, plug-and-play cannot deliver satisfactory result on both color and texture, which provide useful evidence on the hypothesis of learning different coding directories.
In this paper, a novel iterative super-resolution network (ISRN) was proposed for SISR problem. We analyzed the problem from an optimization perspective, and found a feasible solution in an iterative manner. Based on the formulation study, each module of ISRN was elaborately designed, and a maximization likelihood estimation was performed to considerate results from all iterations. Specifically, a novel block named FNB with feature normalization was introduced to compose the network, and grouped in a residual-in-residual way. Considering the drawbacks of batch normalization, the feature normalization (F-Norm) was designed for feature regulation with depth-wise convolution. Extensive experimental results on benchmark datasets with different degradation models show that the proposed ISRN and extension model ISRN are able to recover structural information more effectively, and to achieve competitive or better performance with much fewer parameters.
W. Yang, X. Zhang, Y. Tian, W. Wang, J. Xue, and Q. Liao, “Deep learning for single image super-resolution: A brief review,”IEEE Transactions on Multimedia, vol. 21, no. 12, pp. 3106–3121, 2019.
Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, 2015, p. 448–456.