Iterative Network for Image Super-Resolution

05/20/2020 ∙ by Yuqing Liu, et al. ∙ City University of Hong Kong Dalian University of Technology Peking University 8

Single image super-resolution (SISR), as a traditional ill-conditioned inverse problem, has been greatly revitalized by the recent development of convolutional neural networks (CNN). These CNN-based methods generally map a low-resolution image to its corresponding high-resolution version with sophisticated network structures and loss functions, showing impressive performances. This paper proposes a substantially different approach relying on the iterative optimization on HR space with an iterative super-resolution network (ISRN). We first analyze the observation model of image SR problem, inspiring a feasible solution by mimicking and fusing each iteration in a more general and efficient manner. Considering the drawbacks of batch normalization, we propose a feature normalization (FNorm) method to regulate the features in network. Furthermore, a novel block with F-Norm is developed to improve the network representation, termed as FNB. Residual-in-residual structure is proposed to form a very deep network, which groups FNBs with a long skip connection for better information delivery and stabling the training phase. Extensive experimental results on testing benchmarks with bicubic (BI) degradation show our ISRN can not only recover more structural information, but also achieve competitive or better PSNR/SSIM results with much fewer parameters compared to other works. Besides BI, we simulate the real-world degradation with blur-downscale (BD) and downscalenoise (DN). ISRN and its extension ISRN+ both achieve better performance than others with BD and DN degradation models.



There are no comments yet.


page 1

page 2

page 4

page 7

page 8

page 9

page 10

Code Repositories


Source code for paper "Iterative Network for Image Super-Resolution"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Single image super resolution (SISR) is a traditional ill-posed problem in image processing. Given a low-resolution (LR) image, the task of SISR is to find the corresponding image with high-resolution (HR). Convolutional neural network (CNN) has shown impressive performance for image restoration [1, 2, 3]. Recently, there are numerous CNN-based works for image super-resolution [4, 5]. SRCNN [6] proposed by Dong et al. is the first work for SISR problem with a three-layer network, which achieves better performance than traditional methods. The three layers of SRCNN are corresponding to the steps of traditional sparse coding methods. Deeper networks usually result in better performance. Kim et al. increased the layer number and introduced global residual learning in VDSR [7] for stronger network representation and better performance. Deconvolution layer was widely used in early SISR works for resolution increase. ESPCN [8] proposed by Shi et al. substituted the deconvolution with sub-pixel convolutional layer for more effective upscaling operation, which has been proved to an effective structure. After ESPCN, most of the SISR works choose sub-pixel layer instead of deconvolution. Residual structure has shown amazing performance on image and video restoration [9]. To obtain better performance, in EDSR [10] proposed by Lim et al., the residual blocks with more filters have been adopted. Batch normalization layers in EDSR are removed to decrease the memory cost and build the network deeper. Besides deeper designs, there are works concentrating on effective blocks. Since dense connection has shown good performance for different tasks, SRDenseNet [11] proposed by Tong et al. stacked dense blocks for better performance. Zhang et al. combined residual and dense connections in RDN [12]. He et al. designed ODENet [13]

based on ordinary differential equations. Multi-scale designs also turn out to be an effective component 

[14, 15]. MRFN [15] introduced a multi-receptive-field design for feature exploration. Meanwhile, there are also works focusing on the attention mechanism [16, 17]. These methods enjoy a straightforward structure ot map LR images to HR images.

Recursive designs have also been widely studied for image restoration problems. To our best knowledge, Kim et al. firstly applied recursive structure with share convolution layers in DRCN [18] for SISR problem. To expand the receptive fields, DRCN increased the network depth by using shared filters with limited parameters. Inspired by the residual design, Tai et al. proposed DRRN [19] with residual blocks incorporated. DRRN introduced a recursive block design with the combination of convolution layers, achieving better performance than VDSR. MemNet [20] developed by Tai et al. is motivated by long-term memory model of human’s brain. In MemNet, recursive and gate units are proposed to simulate the memory mechanism, and memory blocks have been adopted for better performance. Recently, Yang et al. designed DRFN [21] with recurrent structure for large factors. However, these methods lack an explanation of intrinsic optimization mechanism in nature.

There are numerous normalization methods developed for network representation improvement. VDSR used batch normalization (BN) [22] between different convolution layers. Since BN consumes more memory [23], recent works have replaced the normalization with more efficient convolutional layers. Weight normalization (WN) was adopted in WDSR [24] proposed by Yu et al., which was firstly proposed by Salimans et al. for recurrent models [25].

In this paper, an iterative super-resolution network is proposed to solve the SISR problem, termed as ISRN. We analyze the observation model and the target of image SR from the perspective of traditional energy optimization [26, 27, 28]. Motivated by those works, the half quadratic splitting (HQS) method [29]

is adopted to analyze the SR problem and obtain a feasible solution. The network is designed based on the solution with iterative structure. Features from each iteration are collected and fused to obtain the final result based on maximum likelihood estimation (MLE). In vanilla HQS method, degradation model should be given explicitly to find the close-form solution. However, when the degradation models are complex, it is challenging to find a formula description. From this perspective, a network structure is introduced to simulate the degradation and optimization.

In particular, a novel block with feature normalization (FN) termed as FNB is designed in ISRN. Different from other normalization methods, the proposed FN uses convolution layers to find the adaptive weights and bias for every pixel. To pass the features from shallow layers to deeper more efficiently, FNBs are grouped with a residual structure and padding layers, termed as FNG. Extensive experimental results show ISRN and the extension model ISRN

with self-ensemble are competitive or superior in terms of PSNR/SSIM with much fewer parameters. Subjective visualizations from Fig 8 clearly show that ISRN can recover structural textures more effectively. Besides bicubic (BI) degradation, we also simulate the real-world degradation by blur-downscale (BD) and downscale-noise (DN) operations. ISRN and ISRN perform better on both objective and subjective comparisons with BD and DN degradation models.

(a) Example image from Urban100 [30]
(c) Bicubic (22.43/0.6037)
(d) RDN [12] (24.22/0.7552)
(e) SAN [9] (24.22/0.7572)
(f) SRFBN [31] (24.21/0.7538)
(g) Ours(24.34/0.7606)
Fig. 8: Visual quality comparisons for various image SR methods.

The main contributions of this paper are summarized as follows:

  • We provide a new perspective on SISR by integrating the conventional optimization architecture with deep convolution networks. In this perspective, a novel and lightweight iterative super-resolution network (ISRN) is proposed.

  • We propose a novel block with feature normalization (FNB). FNBs are grouped with residual structure and padding layers to bypass the features with skip connections more effectively, termed as FNG.

  • Experimental results show ISRN is competitive or better in terms of PSMR/SSIM with much fewer parameters. Visualization results indicate that ISRN delivers better performance on complex structural information recovery. Furthermore, ISRN and the extension model ISRN can achieve better performance in terms of both subjective and objective comparisons with BD and DN degradation models.

Ii Method

Ii-a Formulation Study

Fig. 9: A simple illustration of our proposed iterative scheme. The distance shrinks on LR space for each iteration, and the results are optimized on HR space.

The observation model of SISR problem could be formulated as,


where is the degradation operator, is the noise term, and are LR and HR images respectively. Generally speaking, could be a bicubic down-sampler, blur kernel or the mixture operations.

Given an LR image , the target of super-resolution is to find an satisfying,


where is the image prior term and is a factor. means the -norm.

To obtain the HR image, there are numerous CNN-based works calculating a direct mapping from LR to HR, aiming to solve Eqn. (2). In this paper, half quadratic splitting (HQS) [29, 32] method is applied for finding the solutions. Let , then Eqn. (2) could be re-written as,


As such, Eqn. (3) could be solved in an iterative way by calculating and alternatively,


where is a weighting factor for the -th iteration and varies in a non-descending order for each iteration. For Eqn. (5), has the closed-form solution by linearly combining and .

Let , then Eqn. (4) can be re-written as:

Fig. 10: The network structure of the proposed iterative super-resolution network (ISRN). There are four components in ISRN: Solver SR, Solver LR, Down-Sampler, and Solver MLE, corresponding to different steps in formulation study. Solver SR is shared for each iteration to find the suitable mapping from LR space to HR space.

The iterative solution can be interpreted from another perspective. In particular, Eqn. (6) can be cast into a mapping from the LR space to HR space, such that a reasonably good result on average can be obtained in each iteration. Eqn. (5) aims to achieve a linear combination of and , which can be regarded as guiding with a specific direction and a specific step length governed by the parameter . This is in analogous to the gradient descent method. Since the exact distance between and on HR space is unknown, these iterative steps shrink the distance between and . As such, we hold the notion that iterative optimization steps gradually decrease the distance on the HR space by adjusting the distance on the LR space. An illustration of the steps could be demonstrated in Fig. 9.

However, there are two critical issues. On one hand, the down-sampler operator which accounts for the mapping from the HR space to LR space is difficult to be simulated. In general, could be regarded as a bicubic down-sampling operator while training. However, in some complicated situations, it could be difficult to explicitly express . From Eqn. (5), the accuracy of directly influences the optimization. From this perspective, the degradation model should be learned from paired data. On the other hand, the solution of Eqn. (5), i.e. , is a linear combination of and on -th step, which is close to the one-step gradient descent operation. When the -th iteration begins, the start point is still instead of . This shows the optimization is memory-less. In other words, the history descent directions do not influence the starting point but only the next descent direction. To handle this issue, outputs from different iterations should be collected and considered jointly to find the final result. It can be regarded as a maximum likelihood estimation (MLE), demonstrated as,


where denotes the final HR image, and denotes the output of -th iteration.

Iterative super-resolution network (ISRN) is designed based on the previous formulation study. From the problem formulation, for -th iteration is optimized from Eqn. (6), which could be cast into an independent super-resolution problem mapping the input LR image to HR image . From this observation, a solver for image super-resolution is suitable to find the solution. We design a network module to find the result, termed as Solver SR. While training, the implicit expression of and will be learned from the paired data, and the adaptive optimization will be performed. Solver SR is shared for each iteration to find the suitable mapping relations between LR space and HR space while training.

The closed-form solution of Eqn. (5) is a linear combination of and . Since there is no explicit expression for when degradation models are complex, it is hard to find while given . We investigate a network module to simulate the degradation, and term it as Down-sampler. Furthermore, considering that the weighting factor in Eqn. (5) varies in different iterations, we utilize a network module to learn feasible factors for each iteration and find the solution, which is term as Solver LR.

From the formulation study, the optimization steps for each iteration are memory-less. It is necessary to collect the outputs of different iterations, and find a suitable result considering all descent directions. The MLE step could be designed as a network module to find the

with maximum probability, termed as

Solver MLE.

Ii-B Network Design

As shown in Fig. 10, there are four modules in the proposed ISRN, corresponding to Solver SR, Solver LR, Down-sampler and Solver MLE separately. Herein, these modules are detailed as follows.

Fig. 11: Framework of the Solver SR and its components. There are four modules in Solver SR, mapping the features from LR space onto HR space.

Solver SR is the main component to generate images in HR space from the LR space shared for every iteration, which is formulated as

. Most recent networks for image restoration are deep, which may accumulate the feature variance. Batch normalization is proposed for performance improvement, which may consume much memory 

[23]. In this paper, a novel feature normalization (F-Norm) method is proposed, formulated as,


where is the channel index, and are corresponding input and output feature channels, is a convolution kernel, and is the bias. To preserve the original feature information, the features before and after normalization are added as the final output.

The proposed F-Norm is designed with the hypothesis that different channels contain different information. Different channels are treated parallelly to prevent the information fusion. The parallel normalization will decrease the parameters and computation complexity, making it flexible for various network designs.

The F-Norm has a similar formulation with BN. If is regarded as a convolution kernel with size , then it holds a same operation with BN when setting batch size as 1. The F-Norm performs normalization on features independently, preventing the influence of minibatch in BN. The factors for normalization are explored from the only feature maps. Different form BN, F-Norm is implemented with only one convolution layer, which is fast and with little memory cost.

A novel block named feature normalization block (FNB) is proposed with F-Norm. In FNB, F-Norm is applied at the bottom of residual block. On one hand, it could normalize the feature maps after non-linear processing. On the other hand, using only one normalization layer in each block could save the parameters and computation cost.

Residual structure can gradually pass the shallow layer features to deeper layers. To speed up the feature delivery and make better use of shallow layer features, residual-in-residual (RIR) structure is applied in the network. A group of FNBs with a skip connection is proposed, termed as FNG. For each FNB in the group, there is a residual structure. The global FNG also acquires a shortcut to pass the shallow features to the deeper and improve the gradient transmission.

There is a padding structure after FNBs, composed of two convolution layers with ReLU activation and a F-Norm layer. This padding structure could introduce a non-linear processing step for main path information. In FNG, F-Norm layer following the last convolution layer aims to normalize the features on the main path.

The entire network structure of Solver SR is shown in Fig. 11. In analogous to other super-resolution networks, Solver SR has a main enhancement path and a skip bypass, which form the global residual framework. The bypath in Solver SR upscales the by convolution and sub-pixel layer. A convolution layer is applied after each sub-pixel layer to introduce the spatial correlation.

Solver SR could be regarded as an complete network structure for single image super-resolution, since it directly maps LR image into HR space. There are four modules in the Solver SR

, The first convolution layer in the main path denotes the feature extraction module. After feature extraction, several FNGs are used to compose the non-linear mapping module. The restoration module is made up of two convolution layers with a sub-pixel layer. Finally, a skip connection is applied as the shortcut.

Different from RCAN [33]

and other RIR-based works, there is no global residual connection in non-linear mapping module. On one hand, there is a residual structure in proposed FNG. With the stack of FNGs, information could be fully delivered on the shortcuts from top to bottom. On the other hand, the skip module could be regarded as a global residual connection of the entire network, helping the information transmission.

Solver LR is a network proposed to solve the Eqn. (5), formulated as . Although Eqn. (5) has a closed-form solution, the result is a linear combination of and , which implies the SR solution will fall into a space spanned by and . Meanwhile, the weight factor varies for every iteration. From this persepctive, a network is designed which both introduces the non-linearity and adaptive factors.

The structure of Solver LR is a 3-layer network with ReLU activation after the second convolution layer. The first convolution layer aims to linearly combine the feature of and . The second layer with ReLU activation introduces the non-linearity. The last convolution layer maintains the same channel number of inputs and the output.

Solver LR has a similar structure to SRCNN [6], which has been proved effective for filtering. Different form the closed-form solution of Eqn. (5) which could be regarded as a point-wise operation, the filter-based method Solver LR enlarges the receptive field to consider the information nearby.

Down-sampler is a network dedicated to simulate . In previous works, the degradation model is usually chosen as bicubic-down, which has an explicit formulation. However, when the degradation is more general or even unknown, it is difficult to calculate . To address this issue, a network is designed to simulate the degradation while training. Down-sampler is designed with 4 convolution layers. Considering the mechanism of vanilla bicubic-down, where one pixel corresponds to a

window while interpolation, the first 2 layers extract the features with the kernel size as 3, which equals a

receptive field. To simulate the down-sampling operation, two convolution layers with different strides are applied at the subsequence with ReLU activation. Notice that the stride size should be no larger than the kernel size to prevent the information loss. When the scaling factors are

and , the strides are performed on the first layer. When the scaling factor is , the strides are performed as and on both two layers. The structure of Down-sampler is shown in Fig. 12.

Fig. 12: Illustration of the Down-sampler with different scaling factors.

Solver MLE is designed to simulate the maximum likelihood estimation, formulated as . Solver MLE is used to analyze from every step and estimate a final result . This model is designed as 2 convolution layers’ network with a ReLU activation.

Processing step can be demonstrated as follows. Given an LR input , the input of the first iteration is . For the -th step, there is


and the input of -th iteration is:


Solver SR is shared for every iteration, while Solver LR and Down-sampler are different. We hold the notion that the difference of Solver LR and Down-sampler could be diverse in terms of the input space and finally enhance the final result. The output of the network is given by,

Dataset Scale Bicubic SRCNN [6] VDSR [7] LapSRN [34] EDSR [10] RDN [12] SRFBN [31] Ours Ours+
Set5 33.66/0.9299 36.66/0.9542 37.53/0.9590 37.52/0.9591 38.11/0.9601 38.24/0.9614 38.11/0.9609 38.20/0.9613 38.25/0.9615
30.39/0.8682 32.75/0.9090 33.67/0.9210 33.82/0.9227 34.65/0.9282 34.71/0.9296 34.70/0.9292 34.68/0.9294 34.76/0.9300
28.42/0.8104 30.48/0.8628 31.35/0.8830 31.54/0.8850 32.46/0.8968 32.47/0.8990 32.47/0.8983 32.55/0.8992 32.66/0.9004
Set14 30.24/0.8688 32.45/0.9067 33.05/0.9130 33.08/0.9130 33.92/0.9195 34.01/0.9212 33.82/0.9196 33.84/0.9199 34.03/0.9212
27.55/0.7742 29.30/0.8215 29.78/0.8320 29.87/0.8320 30.52/0.8462 30.57/0.8468 30.51/0.8461 30.60/0.8475 30.67/0.8487
26.00/0.7027 27.50/0.7513 28.02/0.7680 28.19/0.7720 28.80/0.7876 28.81/0.7871 28.81/0.7868 28.79/0.7872 28.91/0.7891
B100 29.56/0.8431 31.36/0.8879 31.90/0.8960 31.80/0.8950 32.32/0.9013 32.34/0.9017 32.29/0.9010 32.35/0.9019 32.39/0.9023
27.21/0.7385 28.41/0.7863 28.83/0.7990 28.82/0.7980 29.25/0.8093 29.26/0.8093 29.24/0.8084 29.25/0.8096 29.31/0.8105
25.96/0.6675 26.90/0.7101 27.29/0.7260 27.32/0.7270 27.71/0.7420 27.72/0.7419 27.72/0.7409 27.74/0.7422 27.80/0.7435
Urban100 26.88/0.8403 29.50/0.8946 30.77/0.9140 30.41/0.9101 32.93/0.9351 32.89/0.9353 32.62/0.9328 32.96/0.9357 33.10/0.9371
24.46/0.7349 26.24/0.7989 27.14/0.8290 27.07/0.8280 28.80/0.8653 28.80/0.8653 28.73/0.8641 28.83/0.8666 29.01/0.8691
23.14/0.6577 24.52/0.7221 25.18/0.7540 25.21/0.7560 26.64/0.8033 26.61/0.8028 26.60/0.8015 26.64/0.8033 26.83/0.8070
Manga109 30.80/0.9339 35.60/0.9663 37.22/0.9750 37.27/0.9740 39.10/0.9773 39.18/0.9780 39.08/0.9779 39.20/0.9781 39.38/0.9785
26.95/0.8556 30.48/0.9117 32.01/0.9340 32.21/0.9350 34.17/0.9476 34.13/0.9484 34.18/0.9481 34.19/0.9487 34.45/0.9499
24.89/0.7866 27.58/0.8555 28.83/0.8870 29.09/0.8900 31.02/0.9148 31.00/0.9151 31.15/0.9160 31.16/0.9166 31.48/0.9190
Param (M) - 0.057 0.665 0.813 43 20 3.6 3.4 3.4
TABLE I: Average PSNR/SSIM results with BI degradation. The best performance is shown in bold. The extension model achieves the best PSNR/SSIM results on all benchmarks.
Dataset Scale Bicubic SRCNN [6] IRCNN_G [26] IRCNN_C [26] RDN [12] RCAN [33] SRFBN [31] Ours Ours+
Set5 BD 28.34/0.8161 31.63/0.8888 33.38/0.9182 29.55/0.8246 34.57/0.9280 34.70/0.9288 34.66/0.9283 34.74/0.9291 34.83/0.9297
DN 24.14/0.5445 27.16/0.7672 24.85/0.7205 26.18/0.7430 28.46/0.8151 - 28.53/0.8182 28.59/0.8201 28.66/0.8214
Set14 BD 26.12/0.7106 28.52/0.7924 29.73/0.8292 27.33/0.7135 30.53/0.8447 30.63/0.8462 30.48/0.8439 30.69/0.8473 30.78/0.8484
DN 23.14/0.4828 25.49/0.6580 23.84/0.6091 24.68/0.6300 26.60/0.7101 - 26.60/0.7144 26.71/0.7167 26.75/0.7175
B100 BD 26.02/0.6733 27.76/0.7526 28.65/0.7922 26.46/0.6572 29.23/0.8079 29.32/0.8093 29.21/0.8069 29.31/0.8099 29.36/0.8107
DN 22.94/0.4461 25.11/0.6151 23.89/0.5688 24.52/0.5850 25.93/0.6573 - 25.95/0.6625 26.00/0.6637 26.03/0.6644
Urban100 BD 23.20/0.6661 25.31/0.7612 26.77/0.8154 24.89/0.7172 28.46/0.8581 28.81/0.8647 28.48/0.8581 28.83/0.8652 29.01/0.8680
DN 21.63/0.4701 23.32/0.6500 21.96/0.6018 22.63/0.6205 24.92/0.7362 - 24.99/0.7424 25.25/0.7525 25.35/0.7549
Manga109 BD 25.03/0.7987 28.79/0.8851 31.15/0.9245 28.68/0.8574 33.97/0.9465 34.38/0.9483 34.07/0.9466 34.46/0.9489 34.73/0.9501
DN 23.08/0.5448 25.78/0.7889 23.18/0.7466 24.74/0.7701 28.00/0.8590 - 28.02/0.8618 28.25/0.8669 28.39/0.8688
TABLE II: Average PSNR/SSIM results with BD and DN degradation. The best performance is shown in bold. The basic and extension models achieve better PSNR/SSIM results on all benchmarks than state-of-the-arts.
Fig. 13: A visualization comparison of PSNR and parameters on Set5 with scaling factor .

Ii-C Discussion

Comparisons with RCAN [33]. To our best knowledge, RCAN is the first work introducing residual-in-residual structure to image super-resolution. In RCAN, residual-in-residual is embedded with Squeeze-and-Excitation [35] block to perform channel attention. Different from RCAN, in ISRN, an iterative structure is designed for better performance with fewer parameters. At the same time, ISRN concentrates on feature normalization rather than channel attention. ISRN applies the feature normalization method, and useful evidences have been provided. RCAN aims to find direct mapping from LR space to HR space, which ISRN provides an optimization perspective for finding solution.

Comparisons with SRFBN [31]

. SRFBN applies a feedback mechanism to recursively enhance the super-resolution performance, which directly concatenates shallow and deep features. Different from SRFBN, ISRN provides a mathematical proof of the model, and simulate each solver with corresponding network components. ISRN feeds the network with different inputs in every iteration. SRFBN is trained with outputs from every iterations, while ISRN is trained with only one output from

Solver MLE.

Comparisons with IRCNN [26]. IRCNN applies HQS to convert the image restoration as two sub-problems, which can be alternatively calculated with a denoiser network. As to the proposed ISRN, on one hand, a different constraint is adopted from IRCNN and utilize the variable splitting to convert the SR problem into two sub-problems with more flexibility. On the other hand, ISRN is regarded as an end-to-end network, instead of building the pipeline as plug-and-play. The experimental results show ISRN has better performance than IRCNN.

Comparisons with DBPN [36]. DBPN applies a back-projection method for iterative up-and-down sampling, which concentrates on information from different depths of network. In the proposed ISRN, a different perspective is applied to SISR problem, and the network is built based on mathematical analysis. The projection of DPBN is used for effective information transmission in non-linear mapping step, which maps LR images to HR images directly. In ISRN, Solver SR module is designed for mapping LR images to HR, and the results on LR and HR spaces are optimized iteratively. Finally, the experimental results show ISRN achieves better performance than DBPN with fewer parameters.

Plug-and-Play. Since Solver SR is an independent network for SR, it is feasible to consider building the pipeline as plug-and-play. We hold the hypothesis that a straightforward image restoration network can be regarded as a sparse-coding like solver. After trained with loss, the network will find a best mapping on average. Notice that different networks learn different coding dictionaries, which vary widely. It is difficult to fit general parameters for other components. From this perspective, the proposed ISRN is regarded as an end-to-end structure rather than plug-and-play.

(a) Image 8023
(c) VDSR [7] (29.54/0.8651)
(d) LapSRN [34] (30.09/0.8670)
(e) RDN [12] (30.33/0.8760)
(f) CARN [37] (30.94/0.8755)
(g) Ours (31.08/0.8785)
(h) Ours+ (31.18/0.8793)
(i) Image 223061
(k) VDSR [7] (24.46/0.6537)
(l) LapSRN [34] (24.51/0.6566)
(m) RDN [12] (25.01/0.7032)
(n) CARN [37] (24.75/0.6745)
(o) Ours (25.07/0.7045)
(p) Ours+ (25.12/0.7058)
(q) Image 253027
(s) VDSR [7] (22.35/0.6902)
(t) LapSRN [34] (22.40/0.6922)
(u) RDN [12] (22.85/0.7118)
(v) CARN [37] (22.67/0.7043)
(w) Ours (22.93/0.7143)
(x) Ours+ (23.12/0.7163)
Fig. 38: Visual quality comparisons on B100 dataset with BI degradation.
(a) Image img_062
(c) VDSR [7] (20.75/0.7474)
(d) LapSRN [34] (20.80/0.7500)
(e) RDN [12] (22.31/0.8401)
(f) CARN [37] (21.39/0.7969)
(g) Ours (22.41/0.8412)
(h) Ours+ (22.53/0.8439)
(i) Image img_069
(k) VDSR [7] (24.40/0.7320)
(l) LapSRN [34] (24.39/0.7345)
(m) RDN [12] (25.19/0.7732)
(n) CARN [37] (24.78/0.7537)
(o) Ours (25.23/0.7744)
(p) Ours+ (25.28/0.7752)
(q) Image img_070
(s) VDSR [7] (21.92/0.5767)
(t) LapSRN [34] (21.93/0.5776)
(u) RDN [12] (22.20/0.6070)
(v) CARN [37] (22.12/0.5936)
(w) Ours (22.35/0.6100)
(x) Ours+ (22.37/0.6110)
(y) Image img_096
(aa) VDSR [7] (23.31/0.8014)
(ab) LapSRN [34] (22.53/0.7851)
(ac) RDN [12] (26.14/0.8921)
(ad) CARN [37] (25.11/0.8629)
(ae) Ours (26.63/0.9849)
(af) Ours+ (27.10/0.9006)
Fig. 71: Visual quality comparisons on Urban100 dataset with BI degradation.
(a) img_012 from Urban100
(c) Bicubic (22.52/0.6147)
(d) RDN [12] (25.53/0.8236)
(e) SRFBN [31] (25.44/0.8221)
(f) Ours (26.18/0.8390)
(g) Ours+ (26.29/0.8403)
(h) img_024 from Urban100
(j) Bicubic (18.41/0.4882)
(k) RDN [12] (22.85/0.7799)
(l) SRFBN [31] (22.92/0.7889)
(m) Ours (23.58/0.8058)
(n) Ours+ (23.89/0.8123)
(o) img_078 from Urban100
(q) Bicubic (25.73/0.6848)
(r) RDN [12] (29.92/0.8503)
(s) SRFBN [31] (29.82/0.8488)
(t) Ours (30.62/0.8619)
(u) Ours+ (30.75/0.8621)
(v) 86000 from B100
(x) Bicubic (25.73/0.6848)
(y) RDN [12] (29.92/0.8503)
(z) SRFBN [31] (29.82/0.8488)
(aa) Ours (30.62/0.8619)
(ab) Ours+ (30.75/0.8621)
Fig. 100: Visual quality comparisons of different methods with BD degradation.

Iii Experimental Results

Iii-a Settings

In ISRN, all layers are with kernel size as except for the skip bypath in Solver SR and all layers after sub-pixel. These layers are with kernel size as for a larger receptive. Layers of Solver SR, Solver MLE, and Solver LR have filters, and layers of Down-sampler have filters. For each FNP, there are FNBs; and for Solver SR, there are FNGs. There are iterations in the network. During training, the loss is chosen as loss function.

In the training progress, 800 images are used from DIV2K [38] dataset for training, and 5 images are used for validation. Five benchmark datasets are used for testing: Set5 [39], Set14 [40], B100 [41], Urban100 [30] and Manga109 [42]. Images from B100 are from real-world containing rich high-frequency information. There are numerous buildings in Urban100, such that abundant straight textures are included. Manga109 are cartoons with structural information. Three benchmark degradation models are used to simulate LR images: bicubic (BI) , blur-downscale (BD), and downscale-noise (DN). All the parameter settings of degradation models are identical with RDN [12]. Adam optimizer, which is widely used in several super-resolution tasks [12, 33, 9], is used with learning rate

. The learning rate is halved for every 200 epochs. The patch size of LR inputs is

. The training data are augmented by randomly flipping and rotation. In total, the network is trained with 1000 iterations. Self-ensemble [12] is adopted to improve the performance of ISRN, and the extended model is named as ISRN. The source code and pre-trained models of ISRN and ISRN can be downloaded at:

(a) img_044 from Urban100
(c) Bicubic (24.58/0.5733)
(d) RDN [12] (28.35/0.7962)
(e) SRFBN [31] (29.26/0.8251)
(f) Ours (30.15/0.8604)
(g) Ours+ (30.43/0.8639)
(h) img_004 from Urban100
(j) Bicubic (20.78/0.5852)
(k) RDN [12] (23.27/0.8029)
(l) SRFBN [31] (23.00/0.7995)
(m) Ours (23.76/0.8216)
(n) Ours+ (24.08/0.8287)
(o) img_011 from Urban100
(q) Bicubic (16.66/0.5377)
(r) RDN [12] (19.08/0.8234)
(s) SRFBN [31] (19.90/0.8413)
(t) Ours (20.68/0.8653)
(u) Ours+ (20.91/0.8642)
Fig. 122: Visual quality comparisons of different methods with DN degradation.

Iii-B Results with BI Degradation

The experiments are conducted with BI (scaling factors , , and ). In particular, ISRN and ISRN are compared with several methods, and Table I shows quantitative comparisons. From the results, ISRN achieves the best performance on all benchmark datasets, and ISRN achieves better performance than others on Urban100 and Manga109. Moreover, ISRN and ISRN are superior in terms of SSIM values, which implies that the models can recover the structural information more effectively, as shown on B100, Urban100 and Manga109 datasets. Results on Urban100 and Manga109 show the performance on recovering structure information. A visualization comparison of PSNR and parameters on Set5  is shown in Fig. 13, which reveals that the proposed model achieves competitive results with fewer parameters than state-of-the-arts.

The visual quality comparisons on B100 dataset are shown in Fig. 38, which contains abundant complex structural textures from real world. From these results, ISRN and ISRN can recover structural information more effectively. This also explains why the models can achieve promising SSIM result. When processing structural information, especially the line textures, ISRN and ISRN have shown very competitive performance.

To show the performance on large images with more textures, we compare the models with other works on Urban100 dataset. The images are from urban photos, which contain more line and structural textures. Visualization quality comparisons on Urban100 dataset are shown in Fig. 71. Compared with RDN, ISRN and ISRN could recover more textures on buildings. Specifically, our models can distinguish the mixture of lines more efficiently.

Iii-C Results with BD and DN Degradation

There are also experiments conducted with BD and DN degradation with the scaling factor to simulate the complex situations. Quantitative results are shown in Table II. From the results, ISRN and ISRN both achieve better performance than others. In particular, for the DN degradation, the proposed ISRN/ISRN are superior in terms of both PSNR and SSIM. The promising performance is originated from the iterative structure which is the distinctive component compared with the prominent methods.

The visual quality comparisons are shown in Fig. 100 and Fig. 122. From Fig. 100 with BD degradation, the recovered lines from other works are warped or blurry. In ISRN and ISRN, the lines could be recovered better than other methods. From Fig. 122 with DN degradation, the tiny lines are missing from others work, due to the introduced noise. Random noise may disturb the original texture and make the tiny lines omitted. In ISRN and ISRN, the lines could be recovered better than other methods.

Iii-D Ablation Study

Study on Network Designs. To show the performance of feature normalization, experiments are conducted on Set5. The results are shown in Table III. From the results, the model with feature normalization achieves better performance with BI degradation (scaling factors , , and ). Since feature normalization is an elaborate but effective component, introducing the block will lead to few increases of parameters and computational cost.

w 38.20/0.9613 34.68/0.9294 32.55/0.8992
w/o 38.18/0.9611 34.59/0.9289 32.48/0.8986
TABLE III: PSNR/SSIM results with and without feature normalization on Set5 with BI degradation.

In the proposed ISRN, different Down-sampler and Solver LR are applied in different iterations, leading to better performance in general. The comparisons are performed with using same Down-sampler and Solver LR in different iterations. The results are shown in Table IV. From Table IV, the performance is better when using different components. Experimental results provide useful evidence regarding the training of different solvers. Using different components in different iterations will lead to few increases of parameters.

Solvers Set5 Set14 B100
Same 32.43/0.8980 28.80/0.7870 27.70/0.7409
Different 32.55/0.8992 28.79/0.7872 27.74/0.7422
TABLE IV: PSNR/SSIM results with same and different solvers with BI  degradation.
Fig. 123: PSNR comparisons on different blocks and groups with BI  degradation.

The number of blocks will also influence the performance. There are experiments with different block number and group number to show the performance with different number of blocks and groups. The results are shown in Fig. 123. The comparisons are conducted on five validation images from DIV2K. From Fig. 123, the performance will be better with the increase of and , showing more blocks can achieve better performance.

Study on Iteration Mechanism. To show the performance of iterations, experiment results with and without iterations are compared. The comparison is set with iteration number . The experiments on 5 validation images from DIV2K with scaling factor are conducted. The results are shown in Fig. 124. From the results, it could be found that iterations indeed improve the performance. With the increase of iteration times, the PSNR results will be higher.

Fig. 124: Performance comparisons for different iteration times with BI  degradation.

Moreover, the feature maps for different iteration are analyzed. The visualization results of each and output image are shown in Fig. 131. From the visual quality comparison, it could be found that with the increase of iteration, richer details can be found on the feature maps. In the first and second iterations, there are structural information with clear lines and contours. In the next three iterations, there are more details on the wing. It is also observed that different iterations concentrate on different features. This is in line with the hypothesis of descent direction in Solver LR for each iteration. The MLE fuses results from every iteration and makes full use of these results.

(a) (a)
(b) (b)
(c) (c)
(d) (d)
(e) (e)
(f) (f)
Fig. 131: Visualizations of different iterations with BI degradation. (a)-(e): of iteration ; (f): The final result after MLE.
(a) (a) HR image
(b) (b) Plug-and-Play
(c) (c) End-to-End
Fig. 135: Visual quality comparisons of Plug-and-Play and End-to-End with BI degradation.

Study on Plug-and-Play. To better illustrate the hypothesis about plug-and-play, Solver SR is substituted with a pre-trained RCAN [33]. The pre-trained model is downloaded from GitHub repository provided by the author. Notice that there is no change except for the Solver SR. The result is shown in Fig. 135. The images are chosen from Set5 dataset. From the results, plug-and-play cannot deliver satisfactory result on both color and texture, which provide useful evidence on the hypothesis of learning different coding directories.

Iv Conclusion

In this paper, a novel iterative super-resolution network (ISRN) was proposed for SISR problem. We analyzed the problem from an optimization perspective, and found a feasible solution in an iterative manner. Based on the formulation study, each module of ISRN was elaborately designed, and a maximization likelihood estimation was performed to considerate results from all iterations. Specifically, a novel block named FNB with feature normalization was introduced to compose the network, and grouped in a residual-in-residual way. Considering the drawbacks of batch normalization, the feature normalization (F-Norm) was designed for feature regulation with depth-wise convolution. Extensive experimental results on benchmark datasets with different degradation models show that the proposed ISRN and extension model ISRN are able to recover structural information more effectively, and to achieve competitive or better performance with much fewer parameters.


  • [1] Z. Jin, M. Z. Iqbal, D. Bobkov, W. Zou, X. Li, and E. Steinbach, “A flexible deep cnn framework for image restoration,” IEEE Transactions on Multimedia, vol. 22, no. 4, pp. 1055–1068, 2020.
  • [2] B. Yan, B. Bare, C. Ma, K. Li, and W. Tan, “Deep objective quality assessment driven single image super-resolution,” IEEE Transactions on Multimedia, vol. 21, no. 11, pp. 2957–2971, 2019.
  • [3] Y. Wang, L. Wang, H. Wang, and P. Li, “Resolution-aware network for image super-resolution,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 5, pp. 1259–1269, 2019.
  • [4]

    W. Yang, X. Zhang, Y. Tian, W. Wang, J. Xue, and Q. Liao, “Deep learning for single image super-resolution: A brief review,”

    IEEE Transactions on Multimedia, vol. 21, no. 12, pp. 3106–3121, 2019.
  • [5] Y. Hu, J. Li, Y. Huang, and X. Gao, “Channel-wise and spatial feature modulation network for single image super-resolution,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2019.
  • [6] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295–307, 2016.
  • [7] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” in

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2016, pp. 1646–1654.
  • [8] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1874–1883.
  • [9] T. Dai, J. Cai, Y. Zhang, S. Xia, and L. Zhang, “Second-order attention network for single image super-resolution,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11 057–11 066.
  • [10] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 1132–1140.
  • [11] T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using dense skip connections,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4809–4817.
  • [12] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 2472–2481.
  • [13] X. He, Z. Mo, P. Wang, Y. Liu, M. Yang, and J. Cheng, “Ode-inspired network design for single image super-resolution,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1732–1741.
  • [14] Z. He, S. Tang, J. Yang, Y. Cao, M. Ying Yang, and Y. Cao, “Cascaded deep networks with multiple receptive fields for infrared image super-resolution,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 8, pp. 2310–2322, 2019.
  • [15] Z. He, Y. Cao, L. Du, B. Xu, J. Yang, Y. Cao, S. Tang, and Y. Zhuang, “Mrfn: Multi-receptive-field network for fast and accurate single image super-resolution,” IEEE Transactions on Multimedia, vol. 22, no. 4, pp. 1042–1054, 2020.
  • [16] H. Wu, Z. Zou, J. Gui, W. Zeng, J. Ye, J. Zhang, H. Liu, and Z. Wei, “Multi-grained attention networks for single image super-resolution,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2020.
  • [17] Y. Hu, J. Li, Y. Huang, and X. Gao, “Channel-wise and spatial feature modulation network for single image super-resolution,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2019.
  • [18] J. Kim, J. K. Lee, and K. M. Lee, “Deeply-recursive convolutional network for image super-resolution,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1637–1645.
  • [19] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2790–2798.
  • [20] Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memory network for image restoration,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4549–4557.
  • [21] X. Yang, H. Mei, J. Zhang, K. Xu, B. Yin, Q. Zhang, and X. Wei, “Drfn: Deep recurrent fusion network for single-image super-resolution with large factors,” IEEE Transactions on Multimedia, vol. 21, no. 2, pp. 328–337, 2019.
  • [22] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in

    Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37

    , 2015, p. 448–456.
  • [23] Z. Wang, J. Chen, and S. C. H. Hoi, “Deep learning for image super-resolution: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2020.
  • [24] J. Yu, Y. Fan, J. Yang, N. Xu, Z. Wang, X. Wang, and T. Huang, “Wide activation for efficient and accurate image super-resolution,” 2018.
  • [25] T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16.   Red Hook, NY, USA: Curran Associates Inc., 2016, p. 901–909.
  • [26] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for image restoration,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2808–2817.
  • [27] K. Zhang, W. Zuo, and L. Zhang, “Deep plug-and-play super-resolution for arbitrary blur kernels,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1671–1681.
  • [28] J. Gu, H. Lu, W. Zuo, and C. Dong, “Blind super-resolution with iterative kernel correction,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1604–1613.
  • [29] M. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo, “Fast image recovery using variable splitting and constrained optimization,” IEEE Transactions on Image Processing, vol. 19, no. 9, pp. 2345–2356, 2010.
  • [30] J. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 5197–5206.
  • [31] Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu, “Feedback network for image super-resolution,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3862–3871.
  • [32] D. Geman and G. Reynolds, “Constrained restoration and the recovery of discontinuities,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 3, pp. 367–383, 1992.
  • [33] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds.   Cham: Springer International Publishing, 2018, pp. 294–310.
  • [34] W. Lai, J. Huang, N. Ahuja, and M. Yang, “Deep laplacian pyramid networks for fast and accurate super-resolution,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5835–5843.
  • [35] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.
  • [36] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for super-resolution,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 1664–1673.
  • [37] N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and lightweight super-resolution with cascading residual network,” in Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds.   Cham: Springer International Publishing, 2018, pp. 256–272.
  • [38] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 1122–1131.
  • [39] M. Bevilacqua, A. Roumy, C. Guillemot, and M. line Alberi Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” in Proceedings of the British Machine Vision Conference.   BMVA Press, 2012, pp. 135.1–135.10.
  • [40] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in Curves and Surfaces, J.-D. Boissonnat, P. Chenin, A. Cohen, C. Gout, T. Lyche, M.-L. Mazure, and L. Schumaker, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 711–730.
  • [41] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, vol. 2, 2001, pp. 416–423 vol.2.
  • [42] Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, and K. Aizawa, “Sketch-based manga retrieval using manga109 dataset,” Multimedia Tools and Applications, vol. 76, no. 20, pp. 21 811–21 838, 2017.