Image Super-Resolution (SR) is the process of achieving high-detailed, high-resolution (HR) images from one or multiple low-resolution (LR) observations of the same scene. Rapid developments in image processing and deployment of scene recognition for visual communications have created a strong need for high-resolution images not only to provide better visualization (fidelity) but also for the extraction of additional information details (recognition). High-resolution images are useful when isolating regions in multi-spectral remote sensing images[62, 8, 19, 3, 60] or when they assist radiologists in making diagnostic decisions based on the images [24, 20, 29, 7, 22, 21, 23]. When it comes to video surveillance systems, higher-resolution video frames are always appreciated for more accurate identification of the objects and people of interest. In order to obtain higher-resolution images, the most direct means is to reduce the pixel size on the sensor (e.g., charge-coupled device) of an image acquisition device (e.g., digital camera); sensor technology, however, has limitations when it comes to reducing sensor pixels. The quality of captured images will inevitably deteriorate if the sensor’s pixel size is too small, signal power decreases proportionally to the reduction in pixel size, while noise power remains roughly the same. In addition, a larger chip incurs a higher cost. SR image processing is, therefore, an attractive alternative because of the factors listed.
Despite having been explored for decades, image super-resolution remains a challenging task in computer vision. The ill-posed nature of this problem stems from the fact that each LR image can contain multiple HR images that have slight differences in the camera angle, color, brightness, and other variables. Moreover, the LR and HR data are subject to fundamental uncertainties, since it is possible to downscale two HR images to yield the same LR image. In a nutshell, it is a many-to-one conversion.
The methods of image super-resolution available today are either single-image super-resolution (SISR) or multiple-image methods. When using single-image SR, each LR-HR pair within the image is learned separately, while in multiple-image SR, the LR-HR pairs within a scene are learned to be able to generate an HR image from the scene (multiple images). In video super-resolution, multiple successive images (frames) are super-resolved using the relationship between them; it is a special form of multiple image SR, defined as an image that is part of a scene comprised of different frames.
Traditional methods of achieving super-resolution in the past include statistical methods, prediction-based methods, patch-based methods, and edge-based methods. Recently, the advance in computational power and big data has prompted researchers to use deep learning (DL) to address the problem of SR. SR studies based on deep learning have featured superior performance than classical methods in the past decade, and DL methods are commonly used to achieve SR. A variety of methods have been used to investigate SR, from the first Convolutional Neural Network (CNN) to the latest Generative Adversarial Nets (GAN)
In this study, a brief overview of the classical methods of SR is outlined initially, whereas the main focus is given to give an overview of the most recent research in SR using deep learning, specifically on SISR.
Ii Early Days of Super-Resolution
Image Super-resolution (SR) techniques try to construct a high resolution (HR) image from one or more observed low resolution (LR) images. Due to SR’s ill-posed nature, many possible solutions exist. Concerning LR input images, SR techniques can be divided into two main groups, namely single-image super-resolution (SISR) and multiple-image super-resolution or multi-frame super-resolution. As the SISR requires only one input LR image to produce a corresponding HR image, it has attracted the attention of researchers as it is closer to everyday life settings.
Early SISR techniques can be divided into two types:
The problem of SISR has been alleviated by using methods to learn prior information from LR and HR pairs. Examples include neighbor embedding regression 47], and deep convolutional neural networks.
Iii Deep Learning Era of Image Super-Resolution
Computer vision applications have become more robust with deep learning , especially convolutional neural networks (CNNs) . Although CNNs aren’t perfect , their performance in different computer vision applications has been reported to be outstanding [59, 53]. This section discusses recent SISR methods based on CNNs and relative methods.
Iii-a Convolutional Neural Networks
. In the method, the input LR image is mapped to the HR image by learning the end-to-end mapping. This technique employs bicubic interpolation as a pre-processing step. After that, it extracts feature vectors from the image patches by convolution, which are then non-linearly mapped to find the most representative patches to reconstruct the HR image. SRCNN only uses convolutional layers, so it’s possible to input images of any size, and its algorithm is not patch-based. The SRCNN model outperforms many ”traditional” models.
Based on this simple model, it appears possible that the accuracy cannot be further improved. As a result, the question arose whether ”the deeper, the better” holds in SR or not. Following the success of very deep networks, Kim went on to propose two new algorithms called Very Deep Convolutional Networks (VDSR) and Deeply Recursive Convolutional Networks (DRCN), which both used 20 convolutional layers, as illustrated in Figure 2
. Gradient clipping was used to control the explosion problem, while the VDSR was trained with a very high learning rate to accelerate the convergence speed.
Currently, the peaks of “vanilla” SISR CNN performance and accuracy are dominated by the neural network architecture type called u-Net, which uses staggered “leaky” layers to compute the HR image. This type of neural network architecture is highly compressible and lightweight compared to the other CNN architecture types that usually use stacked dense layers.
The world’s most competitive image processing workshop NTIRE (New Trends in Image Restoration and Enhancement) annually organizes various machine learning challenges (image restoration, downscaling, recoloring, etc.). The most recent SISR challenge was won by the SuperRior team that was utilizing a u-net architecture. This model is referred to as U-shaped Deep Super-Resolution (UDSR) and an illustration of it is shown in Figure 3
. A convolution layer is used in UDSR to extract deep feature maps from a low-resolution input image. The feature maps were then processed by residual blocks and down-sampled to a lower resolution. In order to obtain high-resolution feature maps, they upsampled the feature maps, as well as applied more residual blocks. The left side of the U-shaped network was connected to the right side by straight paths. To create the final output, they used a residual image that was derived from the highest resolution feature maps.
In addition, they train models by using a cascaded approach, in order to refine the input image better with each stage. A UDSR is used to process the input image from each stage, using the output of the previous stage as input. Three stages have different supervision signals, from coarse to fine. They first downsample the high-resolution ground truth by four scales and then upsample it to the original size. The 4× blurred image is used to supervise the output of the first UDSR model. Second, the output of the second UDSR model is supervised using the blurred 2x image. As part of the third stage, the ground-truth image is used to supervise the UDSR model of the third stage.
Finally, the results were merged using an ensemble of adaptive multi-models. Diverse models have different characteristics. Even with the same model, the performance of different patches varies a great deal. Moreover, these priors motivate them to ensemble multiple models in an adaptive way, namely, the fusion weights of different models must be conditioned on the frames generated by these models in an image patch granularity. They use a CNN model to operate the outputs of several models and learn a normalized weight for each pixel of every single model.
Iii-B Adaptive Models
In addition to using CNNs for classification tasks, many researchers build SISR models that are more adaptive to the content of images (pixels or structures).  presents a Deep Projection CNN (DPN) method. Model adaptation in DPN is used to seek out repetitive structures in LR images. 
proposes the pixel recursive super-resolution network, which consists of a conditioning network and a prior network. Conditional networks transform LR images into logits, resulting in multiple predictions of the likelihood of each HR pixel. Prior networks are called pixelCNNs. Models built in this way can add realistic details to images and enhance resolution at the same time. The authors of  propose a model named deep joint super-resolution (DJSR) in order to adapt the deep model for joint similarities.
In 2018, researchers of  proposed an adaptive residual network (ARN) for high-quality image restoration. The ARN, which consists of six cascaded adaptive shortcuts, convolutional layers, and PReLUs, is a deep residual network. Each adaptive shortcut contains two small convolutional layers, followed by PReLU activation layers and one adaptive skip connection. It is possible to train the ARN model depending on the application.
One of the most recent examples of adaptive models utilizes adaptive models for target generation. In , the authors describe a simple and effective way to cultivate sharp output generation by accepting solutions other than those provided by the training pair. The new method calculates the loss based on an adaptive target yi instead of directly comparing to the original target yi. In theory, their alternative target allows different HR predictions based on LR input to relax the typical pixel reconstruction loss. Adaptive targets are made from the original targets so that the network prediction f(xi) is penalized at the lowest rate, while maintaining the original contents and perceptual impression. In particular, they find an affine transform matrix for every small non-overlapping piece of yi to those of f(xi) within the range of acceptable transforms. After that, each piece is transformed to construct the adaptive target, Figure 4. This can be done on-the-fly during training with minimal computational overhead. During each training iteration, the SR network is trained using the loss computed with the adapted target.
Iii-C Generative Adversarial Network Based Models
In contrast to traditional machine learning methods, generative adversarial networks (GANs) are known for their ability to preserve texture details in images, create solutions that are close to the real image, and appear perceptually convincing. Thus, GANs are also suitable for SISR. The authors in 
propose the Depixelated Super-Resolution Convolutional Neural Network (DSRCNN). It is designed to resolve partially pixelated images for super-resolution. Depixelation is achieved by combining hnbp autoencoder with depixelating layers. The autoencoder is composed of a generator and a discriminator. In, a GAN-based architecture using densely connected convolutional neural networks (DenseNets) is proposed for super resolving overhead imagery by as much as 8×.
In , the most known and first successful GAN-based SISR model, the Super-Resolution Generative Adversarial Network (SRGAN) is introduced, of which a generative network upsamples LR images to super-resolution (SR) images, and the discriminative network is to distinguish the ground truth HR images and SR images. The pixel-wise quality assessment metric has been criticized for showing poor human perception. With the addition of adversarial loss, GAN-based algorithms were able to improve perceptive naturalistic images. By fusing pixel-wise loss, perceptual loss, and newly proposed texture transfer loss, the GAN-based SISR model has been further developed in [11, 55]. The SRFeat proposed by Park et al. employed an additional discriminator in the feature domain. There are two phases to training the generator: pre-training and adversarial training. During the pre-training phase, the generator is trained to minimize MSE losses to achieve high PSNR. By using perceptual similarity loss, GAN loss in the pixel domain, and GAN loss in the feature domain, the training procedure is aimed at improving perceptual quality. A major disadvantage of GAN-based SISR methods is the difficulty of training the models.
Iii-D Sparsity Based Models
Researchers have shown that sparse coding combined with CNNs can produce better performance than CNNs alone [13, 52, 43, 57]. Using sparse priors, the model of the sparse coding-based network (SCN) in  is more compact and accurate than the SRCNN. During the training of a deep CNN, another model, SCRNN-Pr , explores image priors as well. Compared with other current state-of-the-art methods, better training time cost and super-resolution tasks are reported.  proposes a hybrid wavelet convolution network (HWCN). LR images are fed into a scattering convolution network (a wavelet tree in nature) to obtain scattering feature maps. Sparse codes are then extracted from these maps and used to input a CNN. With this model, complex deep networks can be trained with a tiny dataset with better generalization.
In , a sparse representation-based noise-robust super-resolution approach that incorporates smoothing prior to enforcing the same sparse coding coefficients in similar training patches is proposed. It employs LASSO-based smooth constraint combined with locality-based smooth constraint for obtaining stable reconstruction weights, especially when noise levels are high in the input LR image.
In one of the most recent examples of sparsity-based SISR research papers, the SISR model named SSR2 is specifically tuned to extract and amplify information from extreme low resolution images of human faces. The overall similarity in human faces represent a base output for the model to build on top of it. With this logic, sparsity is used to extract only the most defining features from the lower resolution face image and amplified to get a 16x super-resolution. The base human face is also helpful to get rid of the obstructions that may occur in the low resolution image. The overall method of SSR2 can be seen at Figure 5. This model is made to enhance human faces to recognizable sizes from surveillance camera footage.
In this paper, we decided to focus on the different types of Single Image Super-Resolution models rather than focusing on individual works done by other researchers. The reason behind this is that while under the same broad term of SISR they are structured and optimized for entirely different tasks. In our observations, we deduced that a SISR model that has been built to work on LR images that have a lower amount of information than others tend to use sparsity-based SISR structures. The SISR models that have been built for extreme super-resolution tasks can be given as an example for this category. On the other hand, the GAN-based SISR models dominated the research areas that worked on images with similar attributes, like human faces, medical imagery, etc. This can be attributed to the tear-down build-from-scratch nature of the GAN-based models. We also observed that the adaptive models while not the most popular in the SISR world did wonder on the restoration of details and LR images with subpar qualities. CNN-based models were dominating the areas where the desired SR images didn’t have many details like animation images, space photography, and alike.
GAN-based models were by far the most popular way of implementing SISR, this comes from the ease of implementation and decent performance in every type of research area.
When it comes to the best possible performance in any type of objective image, we see a combination of the model types listed. The models with this method of learning are called ensemble models. Ensemble models are at the peak of any given objectives because they make up for the downfalls of the main model by introducing side models with different types.
In this paper, we took a closer look at why the different types of SISR models are still trained for different types of image categories, and what they are best at. In conclusion, we observed that there is no best type of structure when it comes to the ill-posed SISR problem. The objective of your project and the type of images you use greatly impact the models’ performance. That’s why there are predominant structures when it comes to different research topics. Lately, ensemble models are commonly praised and used in the most revolutionary research papers of any given topic.
In the close future that has no scientific breakthroughs in deep learning, we expect SISR models to become more specialized for the type of image used, and to be on the structure of ensemble models. Currently. for a general-purpose SISR model, mainly GAN-based ensemble models are the way to go. We expect this to change into an ensemble model where there can be a pre-deciding machine learning model that chooses the percentages of the different types of SISR model outputs in the ensemble model itself. This approach could mitigate the downfalls of model types greatly while being adaptable to any image type possible.
-  (2019) SSR2: sparse signal recovery for single-image super-resolution on faces with extreme low resolutions. Pattern Recognition 90, pp. 308–324. External Links: Cited by: Fig. 5, §III-D.
-  (2005) Image up-sampling using total-variation regularization with a new observation model. IEEE Transactions on Image Processing 14 (10), pp. 1647–1659. Cited by: item 2.
-  (2020) CNN based spectral super-resolution of remote sensing images. Signal Processing 169, pp. 107394. Cited by: §I.
-  (2018) Super-resolution for overhead imagery using densenets and adversarial learning. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1414–1422. Cited by: §III-C.
-  (2019) Ntire 2019 challenge on real image super-resolution: methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §III-A.
-  (2020) Self-calibrated attention neural network for real-world super resolution. In European Conference on Computer Vision, pp. 453–467. Cited by: §IV.
-  (2020) Super-resolution ultrasound imaging. Ultrasound in medicine & biology 46 (4), pp. 865–891. Cited by: §I.
-  (2020) Small object detection in remote sensing images based on super-resolution with auxiliary generative adversarial networks. Remote Sensing 12 (19), pp. 3152. Cited by: §I.
-  (2017) Pixel recursive super resolution. In Proceedings of the IEEE international conference on computer vision, pp. 5439–5448. Cited by: §III-B.
-  (2007) Soft edge smoothness prior for alpha channel super resolution. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: item 2.
-  (2021) D-srgan: dem super-resolution with generative adversarial networks. SN Computer Science 2 (1), pp. 1–11. Cited by: §III-C.
-  (2019) Deep residual dense u-net for resolution enhancement in accelerated mri acquisition. In Medical Imaging 2019: Image Processing, Vol. 10949, pp. 110–117. Cited by: Fig. 3.
-  (2018) A deeply-recursive convolutional network for crowd counting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1942–1946. Cited by: §III-D.
-  (2014) Learning a deep convolutional network for image super-resolution. In European conference on computer vision, pp. 184–199. Cited by: §I, §II.
-  (2015) Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38 (2), pp. 295–307. Cited by: Fig. 1, §III-A.
-  (2002) Example-based super-resolution. IEEE Computer graphics and Applications 22 (2), pp. 56–65. Cited by: item 1.
-  (2012) Image super-resolution with sparse neighbor embedding. IEEE Transactions on Image Processing 21 (7), pp. 3194–3205. Cited by: item 1.
-  (2016) A hybrid wavelet convolution network with sparse-coding for image super-resolution. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 1439–1443. Cited by: §III-D.
-  (2021) Enlighten-gan for super resolution reconstruction in mid-resolution remote sensing images. Remote Sensing 13 (6), pp. 1104. Cited by: §I.
-  (2009) Super-resolution in medical imaging. The computer journal 52 (1), pp. 43–63. Cited by: §I.
-  (2020) MedSRGAN: medical images super-resolution using generative adversarial networks.. Multimedia Tools & Applications 79. Cited by: §I.
-  (2020) Super-resolution using gans for medical imaging. Procedia Computer Science 173, pp. 28–35. Cited by: §I.
-  (2020) Super-resolution magnetic resonance imaging reconstruction using deep attention networks. In Medical Imaging 2020: Image Processing, Vol. 11313, pp. 113132J. Cited by: §I.
-  (2015) Super resolution techniques for medical image processing. In 2015 International Conference on Technologies for Sustainable Development (ICTSD), pp. 1–6. Cited by: §I.
-  (2021-06) Tackling the ill-posedness of super-resolution through adaptive target generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16236–16245. Cited by: Fig. 4, §III-B.
-  (2016) Video super-resolution with convolutional neural networks. IEEE transactions on computational imaging 2 (2), pp. 109–122. Cited by: §III-A.
-  (2016) Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1646–1654. Cited by: Fig. 2, §III-A.
-  (2016) Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1637–1645. Cited by: §III-A.
-  (2009) Super-resolution in medical imaging: an illustrative approach through ultrasound. In 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pp. 249–252. Cited by: §I.
-  (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §III.
-  (1997) Face recognition: a convolutional neural-network approach. IEEE transactions on neural networks 8 (1), pp. 98–113. Cited by: Single Image Super-Resolution Methods: A Survey , §I.
-  (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §III.
-  (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §I, §III-C.
-  (2020) S-lwsr: super lightweight super-resolution network. IEEE Transactions on Image Processing 29 (), pp. 8368–8380. External Links: Cited by: §III-A.
-  (2017) Single image super resolution-when model adaptation matters. arXiv preprint arXiv:1703.10889. Cited by: §III-B.
-  (2016) Incorporating image priors with deep convolutional neural networks for image super-resolution. Neurocomputing 194, pp. 340–347. Cited by: §III-D.
-  (2016) Robust single image super-resolution via deep networks with sparse prior. IEEE Transactions on Image Processing 25 (7), pp. 3194–3207. Cited by: §III-D.
-  (2020) Ensemble cnn in transform domains for image super-resolution from small data sets. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 384–391. Cited by: §IV.
-  (2020) Progressive multi-scale residual network for single image super-resolution. arXiv preprint arXiv:2007.09552. Cited by: §IV.
-  (2020) MRI super-resolution with ensemble learning and complementary priors. IEEE Transactions on Computational Imaging 6, pp. 615–624. Cited by: §IV.
-  (2016) Super resolution of the partial pixelated images with deep convolutional neural network. In Proceedings of the 24th ACM international conference on Multimedia, pp. 322–326. Cited by: §III-C.
-  (2016) Conditional image generation with pixelcnn decoders. arXiv preprint arXiv:1606.05328. Cited by: §III-B.
-  (2014) Image super-resolution with fast approximate convolutional sparse coding. In International Conference on Neural Information Processing, pp. 250–257. Cited by: §III-D.
-  (2020) Real image super resolution via heterogeneous model ensemble using gp-nas. In European Conference on Computer Vision, pp. 423–436. Cited by: §IV.
-  (2018) Srfeat: single image super-resolution with feature discrimination. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 439–455. Cited by: §III-C.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §III-A.
-  (2015) Fast and accurate image upscaling with super-resolution forests. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3791–3799. Cited by: §II.
-  (2021) Proposing a novel cascade ensemble super resolution generative adversarial network (cesr-gan) method for the reconstruction of super-resolution skin lesion images. Informatics in Medicine Unlocked, pp. 100628. Cited by: §IV.
-  (2017) Failures of gradient-based deep learning. In International Conference on Machine Learning, pp. 3067–3075. Cited by: §III.
-  (2008) Fast image/video upsampling. ACM Transactions on Graphics (TOG) 27 (5), pp. 1–7. Cited by: item 2.
-  (2020) Perceptual extreme super-resolution network with receptive field block. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 440–441. Cited by: §IV.
-  (2017) Structure-preserving image super-resolution via contextualized multitask learning. IEEE transactions on multimedia 19 (12), pp. 2804–2815. Cited by: §III-D.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §III.
-  (2013) Anchored neighborhood regression for fast example-based super-resolution. In Proceedings of the IEEE international conference on computer vision, pp. 1920–1927. Cited by: §II.
-  (2018) Esrgan: enhanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pp. 0–0. Cited by: §III-C.
-  (2015) Self-tuned deep super resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–8. Cited by: §III-B, §III-D.
-  (2015) Deep networks for image super-resolution with sparse prior. In Proceedings of the IEEE international conference on computer vision, pp. 370–378. Cited by: §III-D.
-  (2016) Image super-resolution: the techniques, applications, and future. Signal Processing 128, pp. 389–408. Cited by: §II.
-  (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §III.
-  (2020) Remote sensing image super-resolution via mixed high-order attention network. IEEE Transactions on Geoscience and Remote Sensing 59 (6), pp. 5183–5196. Cited by: §I.
-  (2012) Single image super-resolution with non-local means and steering kernel regression. IEEE Transactions on Image Processing 21 (11), pp. 4544–4556. Cited by: item 1.
-  (2020) Scene-adaptive remote sensing image super-resolution using a multiscale attention network. IEEE Transactions on Geoscience and Remote Sensing 58 (7), pp. 4764–4779. Cited by: §I.
-  (2020) Pqa-cnn: towards perceptual quality assured single-image super-resolution in remote sensing. In 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS), pp. 1–10. Cited by: §IV.
-  (2018) Adaptive residual networks for high-quality image restoration. IEEE Transactions on Image Processing 27 (7), pp. 3150–3163. Cited by: §III-B.