Single image super-resolution (SR) aims to estimate a high resolution (HR) image from a low resolution (LR) image. It is a classical image processing problem and has received active research efforts in the vision and graphics community within the last decade. The renewed interest is due to the widely-used high-definition devices in our daily life, such as iPhoneXS (), Pixel3 (), iPad Pro (), SAMSUNG Galaxy note9 (), and 4K UHDTV (). There is a great need to super-resolve existing LR images so that they can be pleasantly viewed on high-definition devices.
Recently significant progress has been made by using convolutional neural networks (CNNs) in a regression way. For example, numerous methods[3, 10, 11, 22, 15, 14, 4, 35] develop feed-forward networks with advanced network architectures (e.g., residual network 33]) or optimization strategies to learn the LR-to-HR mapping. These methods are efficient and outperform conventional hand-crafted prior-based methods by large margins. However, as the SR problem is highly ill-posed, using feed-forward networks may not be sufficient to estimate the LR-to-HR mapping. In particular, the reconstructed HR images often do not strictly satisfy the image formation model of SR.
To address this issue, several methods improve feed-forward networks with feedback schemes, such as re-implementing iterative back-projection method  by deep CNNs , using deep CNNs as image priors to constrain the solution space in a variational setting , using the image formation model in a feedback step to constrain the training process . However, these algorithms all regenerate LR images from the reconstructed intermediate HR results. The downsampling operation leads to information loss and thus makes these algorithms hard to estimate the details and structures (e.g., Figure 1(f)).
We note that the LR image is usually assumed to be obtained by a convolution followed by a downsampling process on the HR image. Under this assumption, at the un-decimated positions, the LR image should have the same pixel values as the blurred HR image which is obtained by applying a convolution operation to the clear HR image. Thus, we should impose this image formation constraint in the network architecture to generate high-quality images.
However, it is challenging to apply the hard image formation constraint to deep neural networks, because it requires a feedback loop. To this end, we propose a cascaded architecture to efficiently learn the network parameters. The algorithm first generates an intermediate HR image by a deep neural network and then uses the LR image to update the intermediate HR image based on the image formation process. The updated intermediate HR image is further refined by the same deep neural network. Extensive experiments show that the proposed algorithm based on this cascaded manner converges quickly and can generate high-quality images with clear structures.
2 Related Work
We briefly discuss methods most relevant to this work and refer interested readers to  for comprehensive reviews.
Dong et al.  are the first to develop a CNN method for SR, named as SRCNN. Kim et al.  show that the SRCNN algorithm is less effective at recovering image details and propose a residual learning algorithm using a 20-layer CNN. In , Kim et al. introduce a deep recursive convolutional network (DRCN) using recursive-supervision and skip connections. The recursive learning algorithm is further improved by Tai et al. 
, where both global and local learning are used to increase the performance. However, these methods usually upscale LR images to the desired spatial resolution using bicubic interpolation as input to a network, which is less effective for the details restoration as the bicubic interpolation method usually removes details.
As a remedy, the sub-pixel convolutional layer  or deconvolution layer  are developed based on SRCNN. In , the Laplacian Pyramid Super-Resolution Network (LapSRN) is proposed to predict sub-band residuals on various scales progressively. Based on the sub-pixel convolutional layer, several algorithms develop the networks with advanced architectures and strategies, e.g., dense skip connection [27, 35], dual-state recurrent models , residual channel attention method 
. These algorithms are effective for super-resolving LR images but usually tend to smooth some structural details. To generate more realistic images, Generative Adversarial Networks (GANs) with both pixel-wise and perceptual loss functions have been used to solve the SR problem[14, 19]. Recent work  first uses GANs to generate more realistic training images then trains GANs with the generated training images for SR. Motivated by the generative network in , Lim et al. 
remove some unnecessary non-linear active functions in the generator and propose an Enhanced Deep Super-Resolution (EDSR) network to super-resolve images. However, all these methods directly predict the nonlinear LR-to-HR mapping based on feed-forward networks. They do not explore the domain knowledge of the SR problem and tend to fail at recovering fine image details.
To generate high-quality images that satisfy the image formation constraint, Wang et al.  propose a sparse coding network (SCN) based on the sparse representation prior. In , Zhang et al. learn a CNN as an image prior to constrain the iterative back-projection algorithm . More recently, the deep neural networks with feedback schemes have been used in SR. Haris et al.  improve the conventional iterative back-projection algorithm using CNNs. Pan et al.  propose a GAN model with an image formation constraint for image restoration. However, these algorithms need to regenerate low-resolution images in the feedback step which accordingly increase the difficulty for the details and structures restoration. Moreover, the image formation in these methods is used as a soft constraint, which does not directly help the SR results . Using the image formation as a hard constraint is first introduced by Shan et al.  in the variational framework. This method  uses the pixel substitution to ensure that the generated SR results satisfy the image formation of SR in a hard way. However, it cannot effectively recover the details and structures as only the sparsity of gradient prior is used.
In this work, we revisit the idea of pixel substitution to impose the hard image formation constraint in a deep neural network. The proposed algorithm explores the information from both HR images and LR inputs by a deep neural network in a regression way and is able to generate the results satisfying the image formation model, thus facilitating the high-quality image restoration.
3 Image Formation Process
We first describe the image formation process of the SR problem and then derive the image formation constraint. Given a HR image , the process of generating the LR image is usually defined as
where denotes the blur kernel, denotes the convolution operator, and denotes the downsampling operation with a scale factor . Mathematically, this image formation process can be rewritten as
where denotes the filtering matrix corresponding to the blur kernel ; denotes the downsampling operation; and
denote the vector forms ofand .
Applying the upsampling matrix, i.e., , we have
where is a selection matrix which is defined as
where and denote pixel locations; , and denotes the number of the pixels in . If , we denote as the un-decimated position. The constraint (3) indicates that the pixel value of in is equal to the pixel value of in the blurred high resolution image at the un-decimated positions. In the following, we will use the image formation constraint (3) to guide our SR algorithm so that it can generate high-resolution images satisfying this constraint.
4 Proposed Algorithm
The analysis above inspires us to use the image formation process to constrain the deep neural networks for SR. Specifically, we first generate an intermediate HR image from a LR image by a deep neural network. Then we apply the convolution kernel to and use pixel substitution (Section 4.2) to enforce the image formation constraint in the feedback step, as shown in Figure 2. In the following, we will explain the details of the proposed algorithm.
4.1 Intermediate HR image estimation
The effectiveness of using deep CNNs to super-resolve images has been extensively validated in SR problems. Our goal here is not to propose a novel network structure but to develop a new framework to constrain the generated SR results using the image formation process. Thus, we can use an existing network architecture, such as EDSR , SRCNN , and VDSR . In this paper, we use similar network architecture by  as our HR image estimation sub-network. Figure 2 shows the proposed network architecture for one stage of the proposed cascaded approach. The parameters of the network are shown in Table 1.
4.2 Pixel substitution
Let be the output of the HR image estimation sub-network. If is the ground truth HR image, the equality in the SR formation model (3) strictly holds. Thus, to enforce the intermediate HR image to be close to the ground truth HR image, we adopt the pixel substitution operation . Specifically, we first obtain the upsampled image by applying the upsampling matrix to the LR image and blurred intermediate HR image and by applying the blur kernel to the intermediate HR image , respectively. Then the output of the pixel substitution operation is
Empirically, we find that the approximation scheme for image formation process converges well, as shown in Figure 8.
|LapSRN ||37.52 0.9591||33.08/0.9130||31.08/0.8950||30.41/0.9101||37.27/0.9740||35.31/0.9442|
|VDSR ||33.67 0.9210||29.78 0.8320||28.83 0.7990||27.14 0.8290||32.01 0.9340||31.76/0.8780|
|VDSR ||31.35/0.8830||28.02/0.7680||27.29/0.0726||25.18/0.7540||28.83 0.8870||29.82/0.8240|
|EDSR ||32.48/0.8985||28.80 0.7876||27.72/0.7419||26.65/0.8032||31.03/0.9156||30.73/0.8445|
4.3 Cascaded training
As the proposed algorithm consists of both intermediate HR image estimation and pixel substitution, we perform these two steps in a cascaded manner. Let denote the model parameters at stage (iteration) , and denote a set of training samples. We learn the stage-dependent model parameters from by minimizing the cost function
|(b1) HR patch||(c1) Bicubic||(d1) A+ ||(e1) SRCNN ||(f1) FSRCNN |
|(a1) GT HR image||(g1) VDSR ||(h1) LapSRN ||(i1) EDSR ||(j1) RDN ||(k1) Ours|
|(b2) HR patch||(c2) Bicubic||(d2) A+ ||(e2) SRCNN ||(f2) FSRCNN |
|(a2) GT HR image||(g2) VDSR ||(h2) LapSRN ||(i2) EDSR ||(j2) RDN ||(k2) Ours|
|(b1) HR patch||(c1) Bicubic||(d1) A+ ||(e1) SRCNN ||(f1) VDSR |
|(a1) GT HR image||(g1) LapSRN ||(h1) EDSR ||(i1) RDN ||(j1) DBPN ||(k1) Ours|
|(b2) HR patch||(c2) Bicubic||(d2) A+ ||(e2) SRCNN ||(f2) VDSR |
|(a2) GT HR image||(g2) LapSRN ||(h2) EDSR ||(i2) RDN ||(j2) DBPN ||(k2) Ours|
|(b1) Bicubic||(c1) SRCNN ||(d1) VDSR ||(e1) DRCN |
|(a1) Input image||(f1) LapSRN ||(g1) DRRN ||(h1) EDSR ||(i1) Ours|
|(b2) Bicubic||(c2) SRCNN ||(d2) VDSR ||(e2) DRCN |
|(a2) Input image||(f2) LapSRN ||(g2) DRRN ||(h2) EDSR ||(i2) Ours|
5 Experimental Results
We examine the proposed algorithm using publicly available benchmark datasets and compare it to state-of-the-art single image SR methods.
5.1 Parameter settings and training data
In the learning process, we use the ADAM optimizer  with parameters , , and . The minibatch size is set to be . The learning rate is initialized to be . We use a Gaussian kernel in (3) with the same settings used in . We empirically set as a trade-off between accuracy and speed. In the first stage, we use the same upsampling layer as  to upsample the features before the Conv layer.
For fair comparisons, we first follow standard protocols adopted by existing methods (e.g., [6, 15, 34, 35]) to generate LR images using bicubic downsampling from the DIV2K dataset  for training and use the Set5  as the validation test set. Then, we evaluate the effectiveness of our algorithm when LR images are obtained with different image formation models of SR in Section 6
. We implement our algorithm based on the PyTorch version of. The code will be made publicly available on the authors’ website.
5.2 Comparisons with the state of the art
To evaluate the performance of the proposed algorithm, we compare it against state-of-the-art algorithms including A+ , SRCNN , FSRCNN , VDSR , LapSRN , MemNet , DRCN , DRRN , EDSR , RDN , and DBPN . We use the benchmark datasets: Set5 , Set14 , B100 , Urban100 , Manga109 , and DIV2K (validation set)  to evaluate the performance. These datasets contain different image diversities, e.g., the Set5, Set14, and B100 datasets consist of natural scenes; Urban100 mainly contains urban scenes with details in different frequency bands; Manga109 is a dataset of Japanese manga; DIV2K (validation set) contains 100 natural images with 2K resolution. We use the PSNR and SSIM to evaluate the quality of each recovered image.
Table 2 summarizes the quantitative results on these benchmark datasets for the upsampling factors of 2, 3, and 4. Overall, the proposed method performs favorably against the state-of-the-art methods.
Figure 3 shows some SR results with a scale factor of 3 by the evaluated methods. The results by the feed-forward models [3, 4, 10, 13, 15, 35] do not recover the structures well. The EDSR algorithm  simplifies and improves the network architectures in . However, the structures of the super-resolved images are not sharp (Figure 3(i1) and (i2)). Although the proposed network is based on the network structure of EDSR , using pixel substitution to enforce the image formation constraint generates high-quality images.
Figure 4 shows SR results with a scale factor of 4 by the evaluated methods. The recent DBPN algorithm  adopts a feedback network to super-resolve images using information from the LR images. However, this method needs to regenerate LR featrues from intermediate HR features. Consequently, the information at un-decimated pixels would get lost, which makes it hard to estimate the details and structures. The results in Figure 4(j1) and (j2) show that the structures of the images super-resolved by the DBPN method are not recovered well. In contrast, the proposed method recovers finer image details and structures than the state-of-the-art algorithms.
6 Analysis and Discussions
We have shown that enforcing the image formation constraint using pixel substitution leads to an algorithm that outperforms state-of-the-art methods. To better understand the proposed algorithm, we perform further analysis, compare it with related methods, and discuss its limitations.
|(a) HR patch||(b) Bicubic||(c) w/ only one stage|
|(d) w/o conv. & w/o (5)||(e) one stage with (7)||(f) Ours|
|w/o conv. & (5)||38.20/0.9612||33.96/0.9195||32.33/0.9017||32.78/0.9347||39.07/0.9780|
|w/ only one stage||38.19/0.9609||33.92/0.9195||32.35/0.9019||32.97/0.9358||39.20/0.9783|
|one stage with (7)||38.22/0.9612||33.84/0.9167||32.33/0.9014||32.83/0.9351||39.00/0.9777|
Effectiveness of the image formation constraint.
As our cascaded architecture uses a basic SR network several times, one may wonder whether the performance gains merely come from the use of a larger network. To answer this question, we remove the pixel substitution step from our cascaded network architecture for fair comparisons. The comparisons in Figure 6(d) and (f) demonstrate the benefit of using the image formation constraint in generating clearer images with finer details and structures. We note that there is little performance improvement by simply cascading a basic SR network several times to increase the network capacity (Figure 6(d)). The results in Table 3 show that using the image formation constraint of SR consistently improves SR results, which further demonstrates the effectiveness of this constraint.
As the proposed network architectures are similar to those used in , the proposed algorithm with only one stage would reduce to the feed-forward model  to some extent. Both the quantitative evaluations in Table 3 and comparisons in Figure 6(c) show that only using one feed-forward model does not generate high-quality HR images.
We further note that an alternative approach is to add the image formation model (1) to the loss function to constrain the network training instead of using feedback scheme, where the new loss function is defined as
where is a weight parameter. We empirically set for fair comparisons in this paper. We quantitatively evaluate the feed-forward network with (7) on the benchmark datasets. Both the quantitative results in Table 3 and visual comparison (Figure 6(e)) demonstrate that adding the image formation loss to the overall loss function does not always improve the performance.
|(a) HR patch||(b) Bicubic||(c) w/ only one stage|
|(d) Shan et al. ||(e) DBPN ||(f) Ours|
Several notable methods [6, 21] improve the back-projection algorithm  for single image SR. The DBPN algorithm  extends the back-projection method  using a deep neural network. It needs a downsampling operation after obtaining intermediate HR images in the feedback stage. As the information at un-decimated pixels of the intermediate HR images may be lost due to the downsampling operation, DBPN is less effective at recovering details and structures (Figure 7(e)). The method  first proposes pixel substitution to enforce the image formation constraint in an iterative optimization scheme. However, this method cannot effectively restore the edges and textures (Figure 7(d)) because only the sparsity of gradient prior is used. In contrast, our algorithm uses pixel substitution to constrain the deep CNN. Both the edges and textures are well recovered (see Figure 7(f)).
We further examine whether the estimated HR images satisfy the image formation constraint. To this end, we apply the image formation to the estimated HR images to generate the LR images and use the PSNR and mean squared error (MSE) as the metrics. The MSE values in Table 4 indicate that the results generated by the proposed method satisfy the image formation model well.
Robustness to general degradation models of SR.
We have shown that using the image formation constraint can make the deep CNNs more compact thus facilitating the SR problem when the degradation model is approximated by the Bicubic downsmapling operation in Section 5. We further evaluate our method on the other degradation models [32, 34]. One degradation model is based on (1), where the blur kernel is Gaussian (denoted as GD). We use this model to generate the LR images using 800 images from DIV2K for training. The size of the Gaussian kernel used for generating LR images ranges from to pixels. Table 5 demonstrates that the proposed algorithm performs favorably against state-of-the-art methods due to the use of the image formation constraint.
We then evaluate the proposed algorithm when the degradation model is approximated by the Bicubic downsmapling with noise. To generate LR images for training, we add Gaussian noise to each LR image used in Section 5.1, where the noise level ranges from 0 to 10%. Table 6 shows that our algorithm is robust to image noise due to the cascaded optimization method.
All above results on both synthetic and real-world images demonstrate that the proposed algorithm can generalize well even though the image formation constraint is based on known blur kernels.
To quantitatively evaluate the convergence properties of our algorithm, we evaluate our method on the benchmark dataset Set5. Figure 8
shows that the network converges after 250 epochs in terms of the average PSNR values. We further note that using 2-stage cascaded model would generate better results and using more stages does not significantly improve the performance.
Figure 9 shows some intermediate HR images from the proposed method. We note that the structural details are better recovered with more stages. This further demonstrates that using the image formation constraint in a deep CNN helps the restoration of the structural details.
Running time performance.
As our algorithm uses a cascaded architecture, it increases the computation. We examine the running time of the proposed algorithm and compare it with state-of-the-art methods on the Set5 dataset, as shown in Table 7. The proposed algorithm takes slightly more running time compared with the feed-forward models, e.g., [10, 15]. The proposed algorithm is about times faster than the feedback DBPN method .
As our algorithm uses the known image formation of SR to approximate the unknown degradation model of SR, it is less effective when this approximation does not hold. Figure 10 shows an example with significant JPEG compression artifacts, where the image formation model of SR does not approximate the degradation caused by the image compression well. Our algorithm exacerbates the compression artifacts, while the results by the feed-forward models have few artifacts. Building the compression process into the network architecture is likely to reduce these artifacts.
|(a) Bicubic||(b) DBPN ||(c) w/ only one stage|
|(d) Stage 1||(e) Stage 2||(f) Stage 3|
|(a) Input image||(b) VDSR ||(c) Ours|
|Avg. running time (/s)||0.88||1.16||2.01||6.81||2.21|
7 Concluding Remarks
We have introduced a simple and effective super-resolution algorithm that exploits the image formation constraint. The proposed algorithm first uses a deep CNN to estimate an intermediate HR image and then uses pixel substitution to enforce the intermediate HR image satisfy the image formation model at the un-decimated pixel positions. Our cascaded architecture can be applied to existing feed-forward super-resolution networks. Both quantitative and qualitative results show that the proposed algorithm performs favorably against state-of-the-art methods.
-  (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, pp. 1–10. Cited by: §5.1, §5.2.
-  (2018) To learn image super-resolution, use a GAN to learn how to do image degradation first. In ECCV, pp. 187–202. Cited by: §2.
-  (2014) Learning a deep convolutional network for image super-resolution. In ECCV, pp. 184–199. Cited by: §1, §2, Figure 3, Figure 4, Figure 5, §4.1, Table 2, §5.2, §5.2, §5.2.
-  (2016) Accelerating the super-resolution convolutional neural network. In ECCV, pp. 391–407. Cited by: §1, §2, Figure 3, Table 2, §5.2, §5.2.
-  (2018) Image super-resolution via dual-state recurrent networks. In CVPR, pp. 1654–1663. Cited by: §2.
-  (2018) Deep back-projection networks for super-resolution. In CVPR, pp. 1664–1673. Cited by: Figure 1, §1, §2, Figure 4, Table 2, §5.1, §5.2, §5.2, Figure 7, Figure 9, Table 4, §6, §6.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, Table 1.
-  (2015) Single image super-resolution from transformed self-exemplars. In CVPR, pp. 5197–5206. Cited by: §5.2.
-  (1991) Improving resolution by image registration. CVGIP: Graphical Model and Image Processing 53 (3), pp. 231–239. Cited by: §1, §2, Figure 7, §6.
-  (2016) Accurate image super-resolution using very deep convolutional networks. In CVPR, pp. 1646–1654. Cited by: Figure 1, §1, §2, Figure 3, Figure 4, Figure 5, §4.1, Table 2, §5.2, §5.2, §5.2, Figure 10, §6.
-  (2016) Deeply-recursive convolutional network for image super-resolution. In CVPR, pp. 1637–1645. Cited by: §1, §2, Figure 5, Table 2, §5.2.
-  (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Cited by: §5.1.
-  (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, pp. 624–632. Cited by: §2, Figure 3, Figure 4, Figure 5, Table 2, §5.2, §5.2, §5.2.
-  (2017) Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, pp. 105–114. Cited by: §1, §2, §5.2.
-  (2017) Enhanced deep residual networks for single image super-resolution. In CVPR, pp. 1132–1140. Cited by: Figure 1, §1, §2, Table 1, Figure 3, Figure 4, Figure 5, §4.1, §4.3, Table 2, §5.1, §5.1, §5.2, §5.2, §5.2, Table 5, Table 6, §6, §6.
-  (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, pp. 416–425. Cited by: §5.2.
-  (2017) Sketch-based manga retrieval using manga109 dataset. Multimedia Tools Appl. 76 (20), pp. 21811–21838. Cited by: §5.2.
-  (2018) Physics-based generative adversarial models for image restoration and beyond. CoRR abs/1808.00605. Cited by: §1, §2.
-  (2017) EnhanceNet: single image super-resolution through automated texture synthesis. In ICCV, pp. 4491–4500. Cited by: §2.
-  (2008) High-quality motion deblurring from a single image. ACM TOG 27 (3), pp. 73:1–73:10. Cited by: §2.
-  (2008) Fast image/video upsampling. ACM TOG 27 (5), pp. 153:1–153:7. Cited by: §2, §4.2, §5.1, Figure 7, §6.
-  (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, pp. 1874–1883. Cited by: §1, §2, §2.
-  (2017) MemNet: a persistent memory network for image restoration. In ICCV, pp. 4539–4547. Cited by: Table 2, §5.2.
-  (2017) Image super-resolution via deep recursive residual network. In CVPR, pp. 3147–3155. Cited by: §2, Figure 5, §5.2.
-  (2017) NTIRE 2017 challenge on single image super-resolution: methods and results. In CVPR Workshops, pp. 1110–1121. Cited by: §5.1, §5.2.
-  (2014) A+: adjusted anchored neighborhood regression for fast super-resolution. In ACCV, pp. 111–126. Cited by: Figure 3, Figure 4, Table 2, §5.2.
-  (2017) Image super-resolution using dense skip connections. In CVPR, pp. 4799–4807. Cited by: §2.
-  (2015) Deep networks for image super-resolution with sparse prior. In CVPR, pp. 370–378. Cited by: §2.
-  (2014) Single-image super-resolution: A benchmark. In ECCV, pp. 372–386. Cited by: §2.
-  (2010) On single image scale-up using sparse-representations. In The 7th International Conference on Curves and Surfaces, pp. 711–730. Cited by: §5.2.
-  (2017) Learning deep CNN denoiser prior for image restoration. In CVPR, pp. 2808–2817. Cited by: §1, §2.
-  (2018) Learning a single convolutional super-resolution network for multiple degradations. In CVPR, pp. 3262–3271. Cited by: §6.
-  (2018) Image super-resolution using very deep residual channel attention networks. In ECCV, pp. 286–301. Cited by: §1, §2.
-  (2018) Image super-resolution using very deep residual channel attention networks. In ECCV, pp. 294–310. Cited by: §5.1, Table 5, §6.
-  (2018) Residual dense network for image super-resolution. In CVPR, pp. 2472–2481. Cited by: §1, §2, Figure 3, Figure 4, Table 2, §5.1, §5.2, §5.2, Table 5, Table 6.