Single Image Super-Resolution (SISR), which aims to restore a visually pleasing High-Resolution (HR) image from its Low-Resolution (LR) version, is still a challenging task within computer vision research community [34, 36]. Since multiple solutions exist for the mapping from LR to HR space, SISR is highly ill-posed. To regularize the solution of SISR, various priors of natural images have been exploited, especially the current leading learning-based methods [39, 6, 22, 15, 16, 33, 32, 21, 1, 11, 20, 50] are proposed to directly learn the non-linear LR-HR mapping.
By modeling the sparse prior in natural images, the Sparse Coding (SC) based methods for SR [46, 47, 44] with strong theoretical support are widely used owing to their excellent performance. Considering the complexity in images, these methods divide the image into overlapping patches and aim to jointly train two over-complete dictionaries for LR/HR patches. There are usually three steps in these methods’ framework. First, overlapping patches are extracted from input image. Then to reconstruct the HR patch, the sparse representation of LR patch can be applied to the HR dictionary with the assumption that LR/HR patch pair shares similar sparse representation. The final HR image is produced by aggregating the recovered HR patches.
Recently, with the development of Deep Learning (DL), many researchers attempt to combine the advantages of DL and SC for image SR. Dong firstly proposed the seminal CNN model for SR termed as SRCNN, which exploits a shallow convolutional neural network to learn a nonlinear LR-HR mapping in an end-to-end manner and dramatically overshadows conventional methods [47, 35]. However, sparse prior is ignored to a large extent in SRCNN for it adopts a generic architecture without considering the domain expertise. To address this issue, Wang  implemented a Sparse Coding based Network (SCN) for image SR, by combining the merits of sparse coding and deep learning, which fully exploits the approximation of sparse coding learned from the LISTA  based sub-network.
It’s worth to note that most of SC based methods utilize the sparse prior locally , , coping with overlapping image patches. Thus the consistency of pixels in overlapped patches has been ignored [10, 28]. To address this issue, CSC is proposed to serve sparse prior as a global prior [48, 28, 27] and it furnishes a way to fill the local-global gap by working directly on the entire image by convolution operation. Consequently, CSC has attained much attention from researchers [48, 4, 13, 10, 30, 8]. However, very few studies focus on the validation of CSC for image SR , resulting in no work been reported that CSC based image SR can achieve state-of-the-art performance. Can CSC based image SR show highly competitive results with recent state-of-the-art methods [6, 22, 15, 16, 33, 32, 37, 21, 20, 50]? To answer this question, the following issues need to be considered:
Optimization Issue. The previous CSC based image SR method  contains several steps and they are optimized independently. Hundreds of iterations are required to solve the CSC problem in each step.
Memory Issue. To solve the CSC problem, ADMM  is commonly employed [4, 42, 13, 43, 8], where the whole training set needs to be loaded in memory. As a consequence, it is not applicable to improve the performance by enlarging the training set.
Multi-Scale Issue. Training a single model for multiple scales is difficult for the previous CSC based image SR method .
Based on these considerations, in this paper, we attempt to answer the aforementioned question. Specifically, we exploit the advantages of CSC and the powerful learning ability of deep learning to address image SR problem. Moreover, massive theoretical foundations for CSC [28, 27, 7] make our proposed architectures interpretable and also enable to theoretically analyze our SR performance. In the rest of this paper, we first introduce CISTA, which can be naturally implemented using CNN architectures for solving the CSC problem. Then we develop a framework for CSC based image SR, which can address the Framework Issue
. Subsequently, CRNet-A (CSC and Residual learning based Network) and CRNet-B inspired by this framework are proposed for image SR. They are classified as pre- and post-upsampling models
respectively, as the former takes Interpolated LR (ILR) images as input while the latter processes LR images directly. By adopting CNN architectures,Optimization Issue and Memory Issue would be mitigated to some extent. For Multi-Scale Issue, with the help of the recently introduced scale augmentation [15, 16] or scale-specific multi-path learning [21, 40] strategies, both of our models are capable of handling multi-scale SR problem effectively, and achieve favorable performance against state-of-the-arts, as shown in Fig. 1.
The main contributions of this paper include:
We introduce CISTA, which can be naturally implemented using CNN architectures for solving the CSC problem.
A novel framework for CSC based image SR is developed. Two models, CRNet-A and CRNet-B, inspired by this framework are proposed for image SR.
2 Related Work
2.1 Sparse Coding for Image Super-Resolution
Sparse coding has been widely used in a variety of applications . As for SISR, Yang  proposed a representative Sparse coding based Super-Resolution (ScSR) method. In the training stage, ScSR attempts to learn the LR/HR overcomplete dictionary pair / jointly by given a group of LR/HR training patch pairs /. In the test stage, the HR patch is reconstructed from its LR version by assuming they share the same sparse code. Specifically, the optimal sparse code is obainted by minimizing the following sparsity-inducing -norm regularized objective function
and then the HR patch is obtained by
. Finally, the HR image can be estimated by aggregating all the reconstructed HR patches. Inspired by ScSR, many SC based SR methods have been proposed by using various constraints on sparse code and dictionary[45, 38].
2.2 Convolutional Sparse Coding for Image Super-Resolution
Traditional SC based SR algorithms usually process images in a patch based manner to reduce the burden of modeling and computation, resulting in the inconsistency problem . As a special case of SC, CSC is inherently suitable for this issue . CSC is proposed to avoid the inconsistency problem by representing the whole image directly. Specifically, an image can be represented as the summation of feature maps convolved with the corresponding filters : , where is the convolution operation.
Gu  proposed the CSC-SR method and revealed the potential of CSC for image SR. In , CSC-SR requires to solve the following CSC based optimization problem in both the training and testing phase:
 solves this problem by alternatively optimizing the and subproblems . The subproblem is a standard CSC problem. Hundreds of iterations are required to solve the CSC problem and the aforementioned Optimization Issue and Memory Issue cannot be completely avoided. Inspired by the success of deep learning based sparse coding , we exploit the natural connection between CSC and CNN to solve the CSC problem efficiently.
3 CISTA for solving CSC problem
CSC can be considered as a special case of conventional SC, due to the fact that convolution operation can be replaced with matrix multiplication, so the objective function of CSC can be formulated as:
are in vectorized form andis a sparse convolution matrix with the following attributes:
where and are following the notations of Zeiler , representing that array is flipped in left/right or up/down direction.
is the identity matrix,and . Note that identity matrix is also a sparse convolution matrix, so according to (4), there existing a filter satisfies:
so (6) becomes:
where and . Even though (3) is for a single image with one channel, the extension to multiple channels (for both image and filters) and multiple images is mathematically straightforward. Thus for representing images of size with channels, (8) is still true with and .
As for ,  reveals two important facts: (1) the expressiveness of the sparsity inspired model is not affected even by restricting the coefficients to be nonnegative; (2) the activation function and the soft nonnegative thresholding operator are equal, that is:
We set for simplicity. So the final form of (8) is:
One can see that (10) is a convolutional form of (5), so we name it as CISTA. It provides the solution of (3) with theoretical guarantees . Furthermore, this convolutional form can be implemented employing CNN architectures. So and in (10) would be trainable.
4 Proposed Method
In this section, our framework for CSC based image SR is first introduced. And then we implement it using CNN techniques. Since most of image SR methods can be attributed to two frameworks with different upsampling strategies, , pre-upsampling and post-upsampling, we propose two models, CRNet-A for pre-upsampling and CRNet-B for post-upsampling.
4.1 The framework for CSC based Image SR
Analogy to sparse coding based SR, we develop a framework for CSC based image SR. As shown in Fig. 3, LR feature maps are extracted from the input LR image using the learned LR filters. Then convolutional sparse codes of LR feature maps are obtained using CISTA with LR dictionary and shared parameter , as indicated in (10). Under the assumption that HR feature maps share the same convolutional sparse codes with LR feature maps, HR feature maps can be recovered by . Finally, the HR image is reconstructed by utilizing the learned HR filters.
In this work, we implement this framework using CNN techniques. However, when combining CSC with CNN, the characteristics of CNN itself must be considered. With more recursions used in CISTA, the network becomes deeper and tends to be bothered by the gradient vanishing/exploding problems. Residual learning [15, 16, 33] is such a useful tool that not only mitigates these difficulties, but helps network converge faster. In Fig. 4, residual/non-residual networks with different recursions are compared experimentally and the residual network converges much faster and achieves better performance. Based on these observations, both of our proposed models adopt residual learning.
4.2 CRNet-A Model for Pre-upsampling
As shown in Fig. 5, CRNet-A takes the ILR image with channels as input, and predicts the output HR image as . Two convolution layers, consisting of filters of spatial size and containing filters of spatial size
are utilized for hierarchical features extraction from ILR image:
The ILR features are then fed into a CISTA block to learn the convolutional sparse codes. As stated in (10), two convolutional layers and are needed:
where is initialized to . The convolutional sparse codes are learned after recursions with shared across every recursion. When the convolutional sparse codes are obtained, it is then passed through a convolution layer to recover the HR feature maps. The last convolution layer is used as HR filters:
Note that we pad zeros before all convolution operations to keep all the feature maps to have the same size, which is a common strategy used in a variety of methods[15, 16, 33]. So the residual image has the same size as the input ILR image , and the final HR image would be reconstructed by:
Given ILR-HR image patch pairs as a training set, our goal is to minimize the following objective function:
4.3 CRNet-B Model for Post-upsampling
We extend CRNet-A to its post-upsampling version to further mine its potential. Notice that most post-upsampling models [19, 37, 20, 50] need to train and store many scale-dependent models for various scales without fully using the inter-scale correlation, so we adopt the scale-specific multi-path learning strategy  presented in MDSR  with minor modifications to address this issue. The complete model is shown in Fig. 6
. The main branch is our CRNet-A module. The pre-processing modules are used for reducing the variance from input images of different scales and only one residual unit withkernels is used in each of the pre-processing module. At the end of CRNet-B, upsampling modules are used for multi-scale reconstruction.
5 Experimental Results
|Dataset||Scale||Bicubic||SRCNN ||RED30 ||VDSR ||DRCN ||DRRN ||MemNet ||CRNet-A (ours)|
|Dataset||Scale||SRDenseNet ||MSRN ||D-DBPN ||EDSR ||MDSR ||RDN ||CRNet-B (ours)||CRNet-B+ (ours)|
5.1 Datasets and metrics
Training Set By following [15, 33], the training set of CRNet-A consists of 291 images, where of these images are from Yang  with the addition of images from Berkeley Segmentation Dataset . For CRNet-B, training images of DIV2K  are used for training.
Testing Set During testing, Set5 , Set14 , B100  and Urban100  are employed. As recent post-upsampling methods [21, 20, 11, 50] also evaluate their performance on Manga109 , so does CRNet-B.
Metrics Both PSNR and SSIM  on Y channel (, luminance) of transformed YCbCr space are calculated for evaluation.
5.2 Implementation details
CRNet-A Data augmentation and scale augmentation [15, 16, 33] are used for training a single model for all different scales (, and ). Every convolution layer in CRNet-A contains filters () of size while and have filters (). The network is optimized using SGD. The learning rate is initially set to and then decreased by a factor of every epochs. L2 loss is used for CRNet-A, and we train a total of epochs.
CRNet-B Every weight layer in CRNet-B has filters () with the size of except and have filters (). CRNet-B is updated using Adam . The initial learning rate is and halved every epochs. We train CRNet-B for epochs. Unlike CRNet-A, CRNet-B is trained using L1 loss for better convergence speed.
5.3 Comparison with CSC-SR
We first compare our proposed models with the existing CSC based image SR method, , CSC-SR . Since CSC-SR utilizes LR images as input image, it can be considered as a post-upsampling method, thus CRNet-B is used for comparison. Tab. 1 presents that our CRNet-B clearly outperforms CSC-SR by a large margin.
5.4 Comparison with State of the Arts
We now compare the proposed models with other state-of-the-arts in recent years. We compare CRNet-A with pre-upsampling models (, SRCNN , RED30 , VDSR , DRCN , DRRN , MemNet ) while CRNet-B with post-upsampling architectures (, SRDenseNet , MSRN , D-DBPN , EDSR/MDSR , RDN ). Similar to [21, 50], self-ensemble strategy  is also adopted to further improve the performance of CRNet-B, and we denote the self-ensembled version as CRNet-B+.
Tab. 2 and Tab. 3 show the quantitative comparisons on the benchmark testing sets. Both of our models achieve superior performance against the state-of-the-arts, which indicates the effectiveness of our models. Qualitative results are provided in Fig. 7. Our methods tend to produce shaper edges and more correct textures, while other images may be blurred or distorted. More visual comparisons are available in the supplementary material.
Fig. 11 shows the performance versus the number of parameters, our CRNet-B and CRNet-B+ achieve better results with fewer parameters than EDSR  and RDN . It’s worth noting that EDSR/MDSR and RDN are far deeper than CRNet-B (, ), but CRNet-B is quite wider ( and have filters). As reported in , when increasing the number of filters to a certain level, , , the training procedure of EDSR (for ) without residual scaling [31, 21] is numerically unstable, as shown in Fig. ((a))(a). However, CRNet-B is relieved from the residual scaling trick. The training loss of CRNet-B is depicted in Fig. ((b))(b), it converges fast at the begining, then keeps decreasing and finally fluctuates at a certain range.
5.5 Parameter Study
The key parameters in both of our models are the number of filters () and recursions .
Number of Filters We set for CRNet-A as stated in Section 5.2. In Fig. ((a))(a), CRNet-A with different number of filters are tested (DRCN  is used for reference). We find that even is decreased from to , the performance is not affected greatly. On the other hand, if we decrease from to , the performance would suffer an obvious drop, but still better than DRCN . Based on these observations, we set the parameters of CRNet-B by making larger and smaller for the trade off between model size and performance. Specifically, we use for CRNet-B. As shown in Fig. ((b))(b), the performance of CRNet-B can be significantly boosted with larger (MDSR  and MSRN  are used for reference). Even with small , , , CRNet-B still outperforms MSRN  with fewer parameters (M M).
Number of Recursions We also have trained and tested CRNet-A with , , , recursions, so the depth of the these models are , , , respectively. The results are presented in Fig. ((a))(a). It’s clear that CRNet-A with layers still outperforms DRCN with the same depth and increasing can promote the final performance. The results of using different recursions in CRNet-B are shown in Fig. ((b))(b), which demonstrate that more recursions facilitate the performance improved.
We discuss the differences between our proposed models and several recent CNN models for SR with recursive learning strategy, , DRRN , SCN  and DRCN . Due to the fact that CRNet-B is an extension of CRNet-A, , the main part of CRNet-B has the same structure as CRNet-A, so we use CRNet-A here for comparison. The simplified structures of these models are shown in Fig. 12, where the digits on the left of the recursion line represent the number of recursions.
Difference to DRRN. The main part of DRRN  is the recursive block structure, where several residual units with BN layers are stacked. On the other hand, guided by (10), CRNet-A contains no BN layers. Coinciding with EDSR/MDSR , by normalizing features, BN layers get rid of range flexibility from networks. Furthermore, BN consumes much amount of GPU memory and increases computational complexity. Experimental results on benchmark datasets under common-used assessments demonstrate the superiority of CRNet-A.
Difference to SCN. There are two main differences between CRNet-A and SCN : CISTA block and residual learning. Specifically, CRNet-A takes consistency constraint into consideration with the help of CISTA block, while SCN uses linear layers and ignores the information from the consistency prior. On the other hand, CRNet-A adopts residual learning, which is a powerful tool for training deeper networks. CRNet-A ( layers) is much deeper than SCN ( layers). As indicated in , a deeper network has larger receptive fileds, so more contextual information in an image would be utilized to infer high-frequency details. In Fig. ((a))(a), we show that more recursions, , , can be used to achieve better performance.
Difference to DRCN. CRNet-A differs with DRCN  in two aspects: recursive block and training techniques. In the recursive block, both local residual learning  and pre-activation [12, 33] are utilized in CRNet-A, which are demonstrated to be effective in . As for training techniques, DRCN is not easy to train, so recursive-supervision is introduced to facilitate the network to converge. Moreover, an ensemble strategy (in Fig. ((c))(c), the final output is the weighted average of all intermediate predictions) is used to further improve the performance. CRNet-A is relieved from these techniques and can be easily trained with more recursions.
In this work, we propose two effective CSC based image SR models, , CRNet-A and CRNet-B, for pre-/post-upsampling SR, respectively. By combining the merits of CSC and CNN, we achieve superior performance against recent state-of-the-arts. Furthermore, our framework and CISTA block are expected to be applicable in various CSC based tasks, though in this paper we focus on CSC based image SR.
-  (2018) Fast, accurate, and lightweight super-resolution with cascading residual network. In ECCV, Cited by: §1.
-  (2012) Low-Complexity Single-Image Super-Resolution based on Nonnegative Neighbor Embedding.. BMVC, pp. 135.1–135.10. Cited by: Figure 1, §5.1.
Distributed optimization and statistical learning via the alternating direction method of multipliers.
Foundations and Trends in Machine learning3 (1), pp. 1–122. Cited by: §1.
-  (2013) Fast Convolutional Sparse Coding. In CVPR, Cited by: §1, §1.
-  (2004) An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics 57 (11), pp. 1413–1457. Cited by: §3, §3.
-  (2016) Image super-resolution using deep convolutional networks. TPAMI 38 (2), pp. 295–307. Cited by: §1, §1, §1, §5.4, Table 2.
-  (2018) Convolutional dictionary learning: a comparative review and new algorithms. IEEE Transactions on Computational Imaging 4 (3), pp. 366–381. Cited by: §1.
-  (2018) Convolutional Dictionary Learning: A Comparative Review and New Algorithms. IEEE Transactions on Computational Imaging 4 (3), pp. 366–381. Cited by: §1, §1.
-  (2010) Learning Fast Approximations of Sparse Coding.. In ICML, Cited by: §1, §2.2.
-  (2015) Convolutional Sparse Coding for Image Super-Resolution. In ICCV, Cited by: 3rd item, §1, §1, §1, §2.2, §2.2, Table 1, §5.3.
-  (2018) Deep back-projection networks for super-resolution. In CVPR, Cited by: §1, §5.1, §5.4, Table 3.
-  (2016) Identity mappings in deep residual networks. In ECCV, Cited by: §6.
-  (2015) Fast and flexible convolutional sparse coding. In CVPR, Cited by: §1, §1.
-  (2015) Single image super-resolution from transformed self-exemplars.. In CVPR, Cited by: Figure 1, Figure 2, §5.1.
-  (2016) Accurate image super-resolution using very deep convolutional networks. In CVPR, Cited by: §1, §1, §1, §4.1, §4.2, §5.1, §5.2, §5.4, Table 2, §6.
-  (2016) Deeply-Recursive Convolutional Network for Image Super-Resolution.. In CVPR, Cited by: 4th item, §1, §1, §1, §4.1, §4.2, §5.2, §5.4, §5.5, Table 2, Figure 12, §6, §6.
-  (2014) Adam: a method for stochastic optimization. In ICLR, Cited by: §5.2.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.2.
-  (2017) Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In CVPR, Cited by: §4.3.
-  (2018) Multi-scale residual network for image super-resolution. In ECCV, Cited by: §1, §1, §4.3, §5.1, §5.4, §5.5, Table 3.
-  (2017) Enhanced deep residual networks for single image super-resolution. In CVPR Workshops, Cited by: 3rd item, §1, §1, §1, §4.3, §5.1, §5.4, §5.4, §5.5, Table 3, §6.
-  (2016) Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In NIPS, Cited by: §1, §1, §5.4, Table 2.
-  (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, Cited by: §5.1, §5.1.
-  (2017) Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications. Cited by: §5.1.
-  (2010) Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: §3.
-  (2017) Convolutional neural networks analyzed via convolutional sparse coding. The Journal of Machine Learning Research 18 (1), pp. 2887–2938. Cited by: §3.
-  (2018) Theoretical foundations of deep learning via sparse representations: a multilayer sparse model and its connection to convolutional neural networks. IEEE Signal Processing Magazine 35 (4), pp. 72–89. Cited by: §1, §1.
-  (2017) Working locally thinking globally: theoretical guarantees for convolutional sparse coding. IEEE Transactions on Signal Processing 65 (21), pp. 5687–5701. Cited by: §1, §1, §2.2.
-  (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §5.2.
-  (2018) Learned convolutional sparse coding. In ICASSP, Cited by: §1.
Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv:11602.07261. Cited by: §5.4.
-  (2017) Memnet: a persistent memory network for image restoration. In ICCV, Cited by: §1, §1, §5.4, Table 2.
-  (2017) Image Super-Resolution via Deep Recursive Residual Network.. In CVPR, Cited by: 4th item, §1, §1, §4.1, §4.2, §5.1, §5.2, §5.4, Table 2, Figure 12, §6, §6, §6.
-  (2017) Ntire 2017 challenge on single image super-resolution: methods and results. In CVPR Workshops, Cited by: §1, §5.1.
-  (2014) A+: adjusted anchored neighborhood regression for fast super-resolution. In ACCV, Cited by: §1.
-  (2018) NTIRE 2018 challenge on single image super-resolution: methods and results. In CVPR Workshops, Cited by: §1.
-  (2017) Image super-resolution using dense skip connections. In ICCV, Cited by: §1, §4.3, §5.4, Table 3.
-  (2012) Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis. In CVPR, pp. 2216–2223. Cited by: §2.1.
-  (2015) Deep networks for image super-resolution with sparse prior. In ICCV, Cited by: 4th item, §1, §1, Figure 12, §6, §6.
-  (2019) Deep learning for image super-resolution: a survey. arXiv:1902.06068. Cited by: §1, §4.3.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE TIP 13 (4), pp. 600–612. Cited by: §5.1.
-  (2014) Efficient convolutional sparse coding. In ICASSP, Cited by: §1, §2.2.
-  (2016) Boundary handling for convolutional sparse representations. In ICIP, Cited by: §1.
-  (2014) Single-image super-resolution: a benchmark. In ECCV, Cited by: §1.
-  (2012) Coupled dictionary training for image super-resolution. IEEE TIP 21 (8), pp. 3467–3478. Cited by: §2.1.
-  (2008) Image super-resolution as sparse representation of raw image patches. In CVPR, Cited by: §1, §1, §2.1.
-  (2010) Image super-resolution via sparse representation. IEEE TIP 19 (11), pp. 2861–2873. Cited by: §1, §1, §1, §5.1.
-  (2010) Deconvolutional networks. In CVPR, Cited by: §1, §2.2, §3.
-  (2010) On single image scale-up using sparse-representations. In International conference on curves and surfaces, pp. 711–730. Cited by: §5.1.
-  (2018) Residual dense network for image super-resolution. In CVPR, Cited by: 3rd item, §1, §1, §4.3, §5.1, §5.4, §5.4, Table 3.
-  (2015) A Survey of Sparse Representation - Algorithms and Applications. IEEE Access 3, pp. 490–530. Cited by: §2.1.