‘Everything has been said before, but since nobody listens we have to keep going back and beginning all over again.’
Image super-resolution (SR) has received increasing attention from the research community in recent years. Super-resolution aims to convert a given low-resolution image with coarse details to a corresponding high-resolution image with better visual quality and refined details. Image super-resolution is also referred to by other names such as image scaling, interpolation, upsampling, zooming and enlargement. The process of generating a raster image with higher resolution can be performed using a single image or multiple images. This exposition mainly focuses on single image super-resolution (SISR) due to its challenging nature and because multi-image SR is directly based on SISR.
High-resolution images provide improved reconstructed details of the scenes and constituent objects, which are critical for many devices such as large computer displays, HD television sets, and hand-held devices (mobile phones, tablets, cameras etc.). Furthermore, super-resolution has important applications in many other domains e.g. object detection in scenes  (particularly small objects 
), face recognition in surveillance videos, medical imaging , improving interpretation of images in remote sensing , astronomical images , and forensics .
Super-resolution is a classical problem that is still considered a challenging and open research problem in computer vision due to several reasons. Firstly, SR is an ill-posed inverse problem,i.e. an under-determined case. Instead of a single unique solution, there exist multiple solutions for the same low-resolution image. To constrain the solution-space, reliable prior information is typically required. Secondly, the complexity of the problem increases as the up-scaling factor increases. At higher factors, the recovery of missing scene details becomes even more complex, and consequently it often leads to reproduction of wrong information. Furthermore, assessment of the quality of output is not straightforward i.e., quantitative metrics (e.g. PSNR, SSIM) only loosely correlate to human perception.
Super-resolution methods can be broadly divided into two main categories: traditional and deep learning methods. Classical algorithms have been around for decades now, but are out-performed by their deep learning based counterparts. Therefore, most recent algorithms rely on data-driven deep learning models to reconstruct the required details for accurate super-resolution. Deep learning is a branch of machine learning, that aims to automatically learn the relationship between input and output directly from the data. Alongside SR, deep learning algorithms have shown promising results on other sub-fields in Artificial Intelligence such as object classification  and detection 11, 12], image processing [13, 14], and audio signal processing . Due to these reasons, in this survey, we mainly focus on deep learning algorithms for SR and only provide a brief background on traditional approaches (Section 2).
Our Contributions: In this exposition, our focus is on deep neural networks for single (natural) image super-resolution. Our contribution is five-fold. 1) We provide a thorough review of the recent techniques for image super-resolution. 2) We introduce a new taxonomy of the SR algorithms based on their structural differences. 3) A comprehensive analysis is performed based on the number of parameters, algorithm settings, training details and important architectural innovations that leads to significant performance improvements. 4) We provide a systematic evaluation of algorithms on six publicly available datasets for SISR. 5) We discuss the challenges and provide insights into the possible future directions.
Let us consider a Low-Resolution (LR) image is denoted by and the corresponding high-resolution (HR) image is denoted by , then the degradation process is given as:
where is the degradation function, and denotes the degradation parameters (such as the scaling factor, noise etc.). In a real-world scenario, only is available while no information about the degradation process or the degradation parameters . Super-resolution seeks to nullify the degradation effect and recovers an approximation of the ground-truth image as,
where, are the parameters for the function . The degradation process is unknown and can be quite complex. It can be affected by several factors such as noise (sensor and speckle), compression, blur (defocus and motion), and other artifacts. Therefore, most research works prefer the following degradation model over that of Eq. 1:
where is the blurring kernel and is the convolution operation between the HR image and the blur kernel, is a downsampling operation with a scaling factor . The variable
denotes the additive white Gaussian noise (AWGN) with a standard deviation of(noise level). In image super-resolution, the aim is to minimize the data fidelity term associated with the model , as,
where is the balancing factor for the the data fidelity term and image prior . According to Yang et al. , based on the image prior, super-resolution methods can be roughly categorized into: prediction methods , edge-based methods , statistical methods , patch-based methods [20, 21, 22], and deep learning methods . In this article, our focus is on the methods which employ deep neural networks to learn the prior.
3 Single Image Super-resolution
The SISR problem has been extensively studied in the literature using a variety of deep learning based techniques. We categorize existing methods into nine groups according to the most distinctive features in their model designs. The overall taxonomy used in this literature is shown in Figure 1. Among these, we begin discussion with the earliest and simplest network designs that are called the linear networks.
3.1 Linear networks
Linear networks have a simple structure consisting of only a single path for signal flow without any skip connections or multiple-branches. In such network designs, several convolution layers are stacked on top of each other and the input flows sequentially from initial to later layers. Linear networks differ in the way the up-sampling operation is performed i.e., early upsampling or late upsampling. Note that some linear networks learn to reproduce the residual image i.e., the difference between the LR and HR images [24, 25, 26]. Since the network architecture is linear in such cases, we categorize them as linear networks. This is as opposed to residual networks that have skip connections in their design (Sec. 3.2). We elaborate notable linear network designs in these two sub-categories below.
3.1.1 Early Upsampling Designs
The early upsampling designs are linear networks that first upsample the LR input to match with desired HR output size and then learn hierarchical feature representations to generate the output. A common upsampling operation used for this purpose is Bicubic interpolation, which is a computationally expensive operation. A seminal work based on this pipeline is the SRCNN which we explain next.
is the first successful attempt towards using only convolutional layers for super-resolution. This effort can rightfully be considered as the pioneering work in deep learning based SR that inspired several later attempts in this direction. SRCNN structure is straightforward, it only consists of convolutional layers where each layer (except the last one) is followed by rectified linear unit (ReLU) non-linearity. There are a total of three convolutional and two ReLU layers, stacked together linearly. Although the layers are the same (i.e., convolution layers), the authors named the layers according to their functionality. The first
convolutional layer is termed as patch extraction or feature extraction which creates the feature maps from the input images. Thesecond
convolutional layer is called non-linear mapping which converts the feature maps onto high-dimensional feature vectors. Thelast convolutional layer aggregates the features maps to output the final high-resolution image. The structure of SRCNN is shown in the Figure 2.
The training data set is synthesized by extracting non-overlapping dense patches of size 32
32 from the HR images. The LR input patches are first downsampled and then upsampled using bicubic interpolation having the same size as the high-resolution output image. The SRCNN is an end-to-end trainable network and minimizes the difference between the output reconstructed high-resolution images and the ground truth high-resolution images using Mean Squared Error (MSE) loss function.
VDSR: Unlike the shallow network architectures used in SR-CNN  and FSRCNN , Very Deep Super-Resolution  (VDSR) is based on a deep CNN architecture originally proposed in . This architecture is popularly known as the VGG-net and uses fixed-size convolutions () in all network layers. To avoid slow convergence in deep networks (specifically with 20 weight layers), they propose two effective strategies. Firstly, instead of directly generating a HR image, they learn a residual mapping that generates the difference between the HR and LR image. As a result, it provides an easier objective and the network focuses on only high-frequency information. Secondly, gradients are clipped with in the range which allows very high learning rates to speed up the training process. Their results support the argument that deeper networks can provide better contextualization and learn generalizable representations that can be used for multi-scale super-resolution.
DnCNN: DnCNN 
learns to predict a high-frequency residual directly instead of the latent super-resolved image. The residual image is basically the difference between LR and HR images. The architecture of DnCNN is very simple and similar to SRCNN as it only stacks convolutional, batch normalization and ReLU layers. The architecture of DnCNN is shown in Figure2.
Although both models were able to report favorable results, their performance depends heavily on the accuracy of noise estimation without knowing the underlying structures and textures present in the image. Besides, they are computationally expensive because of the batch normalization operations after every convolutional layer.
IRCNN: Image Restoration CNN (IRCNN)  proposes a set of CNN based denoisers that can be jointly used for several low-level vision tasks such as image denoising, deblurring and super-resolution. This technique aims to combine high-performing discriminative CNN networks with model-based optimization approaches to achieve better generalizability across image restoration tasks. Specifically, the Half Quadratric Splitting (HQS) technique is used to uncouple regularization and fidelity terms in the observation model  . Afterwards, a denoising prior is discriminatively learned using a CNN due to its superior modeling capacity and test time efficiency. The CNN denoiser is composed of a stack of 7 dilated convolution layers interleaved with batch normalization and ReLU non-linearity layers. The dilation operation helps in modeling larger context by enclosing a bigger receptive field. To speed up the learning process, residual image learning is performed in a similar manner to previous architectures such as VDSR , DRCN  and DRRN 
. The authors also proposed to use small sized training samples along with zero-padding to avoid boundary artifacts due to the convolution operation.
A set of 25 denoisers is trained with the range of noise levels [0,50] that are collectively used for image restoration tasks. The proposed unified approach provides strong performance simultaneously on image denoising, deblurring and super-resolution.
3.1.2 Late Upsampling Designs
As we saw in the previous examples, linear networks generally perform early upsampling on the input images. This operation can be computationally expensive since the later network structure grows in proportion to deal with larger sized inputs. To address this problem, post-upsampling networks perform learning on the low-resolution inputs and then upsample the features near the output of the network. This strategy results in efficient approaches with low memory footprint. We discuss such designs in the following.
FSRCNN: Fast Super-Resolution Convolutional Neural Network (FSRCNN)  improves speed and quality over SRCNN . The aim is to bring the rate of computation to real-time (24 fps) as compared to SRCNN (1.3 fps). FSRCNN  also has a simple architecture and consists of four convolution layers and one deconvolution. The architecture of FSRCNN  is shown in Figure 2.
Although the first four layers implement convolution operations, FSRCNN  names each layer according to its function, namely i.e. feature extraction, shrinking, non-linear mapping, and expansion layers. The feature extraction step is similar to SRCNN , the only difference lies in the input size and the filter size. The input to SRCNN  is an upsampled bicubic patch while the input to FSRCNN  is the original patch without upsampling it. The second convolution layer is named shrinking layer due to its ability to reduce the feature dimensions (number of parameters) by adopting a smaller filter size (i.e. f=1) to increase computational efficiency. Next, the convolutional layer acts as a non-linear mapping step, and according to the authors, this is a critical step both in SRCNN  and FSRCNN , as it helps in learning non-linear functions and consequently has a strong influence on the performance. Through experimentation, the size of filters in the non-linear mapping layer is set to three, while the number of channels is kept the same as the previous layer. The last convolutional layer, termed as expanding, is an inverse operation of the shrinking step to increase the number of dimensions. This layer results in an increase in performance by 0.3dB.
The final part of the network is an upsampling and aggregating deconvolution layer, which is an inverse process of the convolution. In convolution operation, the image is convolved with the convolution filter with a stride, and the output of that convolutional layer is 1/stride of the input. However, the role of the filter is exactly opposite in deconvolutional layer, and here stride acts as an upscaling factor. Similarly, another subtle difference from SRCNN is the usage of Parametric Rectified Linear Unit (PReLU)  instead of the Rectified Linear Unit (ReLU) after each convolutional layer.
FSRCNN employs the same cost function as SRCNN  i.e. mean-square error. For training,  used the 91-image dataset  with another 100 images collected from the internet. Data augmentation such as rotation, flipping, and scaling is also employed to increase the number of images by 19 times.
ESPCN: Efficient sub-pixel convolutional neural network (ESPCN)  is a fast SR approach that can operate in real-time both for images and videos. As discussed above, traditional SR techniques first map the LR image to higher resolution usually with bi-cubic interpolation and subsequently learn the SR model in the higher dimensional space. ESPCN noted that this pipeline results in much higher computational requirements and alternatively propose to perform feature extraction in the LR space. After the features are extracted, ESPCN uses a sub-pixel convolution layer at the very end to aggregate LR feature maps and simultaneously perform projection to high dimensional space to reconstruct the HR image. Feature processing in LR space significantly reduces the memory and computational requirements.
The sub-pixel convolution operation used in this work is essentially similar to convolution transpose or deconvolution operation , where a fractional kernel stride is used to increase the spatial resolution of input feature maps. A separate upscaling kernel is used to map each feature map that provides more flexibility in modeling the LR to HR mapping. An loss is used to train the overall network. ESPCN provides competitive SR performance with efficiency as high as real-time processing of 1080p videos on a single GPU.
3.2 Residual Networks
In contrast to linear networks, residual learning uses skip connections in the network design to avoid gradients vanishing and makes it feasible to design very deep networks. Its significance was first demonstrated for the image classification problem . Recently, several networks [37, 38] provided a boost to SR performance using residual learning. In this approach, algorithms learn residue i.e. the high-frequencies between the input and ground-truth. Based on the number of stages used in such networks, we categorize existing residual learning approaches into single-stage [37, 38] and multi-stage networks [39, 40, 41].
3.2.1 Single-stage Residual Nets
EDSR: The Enhanced Deep Super-Resolution (EDSR)  modifies the ResNet architecture  proposed originally for image classification to work with the SR task. Specifically, they demonstrated substantial improvements by removing Batch Normalization layers (from each residual block) and ReLU activation (outside residual blocks). Similar to VDSR, they also extended their single scale approach to work on multiple scales. Their proposed Multi-scale Deep SR (MDSR) architecture, however, reduces the number of parameters through a majority of shared parameters. Scale-specific layers are only applied close to the input and output blocks in parallel to learn scale-dependent representations. The proposed deep architectures are trained using loss. Data augmentation (rotations and flips) was used to create a ‘self-ensemble’ i.e., transformed inputs are passed through the network, reverse-transformed and averaged together to create a single output. The authors noted that such a self-ensemble scheme does not require learning multiple separate models, but results in a gain comparable to conventional ensemble based models. EDSR and MDSR achieve better performance, in terms of quantitative measures ( e.g., PSNR), compared to older architectures such as SR-CNN, VDSR and other ResNet based closely related architectures (e.g., SR-GAN ).
CARN: Cascading residual network (CARN)  employs ResNet Blocks  to learn the relationship between low-resolution input and high-resolution output. The difference between the models is the presence of local and global cascading modules. The features from intermediate layers are cascaded and converged onto a 11 convolutional layer. The local cascading connections are identical to the global cascading connections, except the blocks are simple residual blocks. This strategy makes information propagation efficient due to multi-level representation and many shortcut connections.The architecture of CARN is shown in Figure 2.
3.2.2 Multi-Stage Residual Nets
A multi-stage design is composed of multiple subnets that are generally trained in succession [39, 40]. The first subnet usually predicts the coarse features while the other subnets improve the initial predictions. Here, we also include encoder-decoder designs (e.g., ) that first downsample the input using an encoder and then perform upsampling via a decoder (hence two distinct stages). The following architectures super-resolved the image in various stages.
FormResNet: FormResNet is proposed by  which builds upon DnCNN as shown in Figure 2. This model is composed of two networks, both of which are similar to DnCNN; however, the difference lies in the loss layers. The first network, termed as “Formatting layer”, incorporates Euclidean and perceptual loss. The classical algorithms such as BM3D can also replace this formatting layer. The second deep network “DiffResNet” is similar to DnCNN and input to this network is fed from the first one. The stated formatting layer removes high-frequency corruption in uniform areas, while DiffResNet learns the structured regions. FormResNet improves upon the results of DnCNN by a small margin.
BTSRN: BTSRN stands for balanced two-stage residual networks  for image super-resolution. The network is composed of a low-resolution stage and a high-resolution stage. In the low-resolution stage, the feature maps have a smaller size, the same as the input patch. The feature maps are upsampled using a deconvolution followed by nearest neighbor upsampling. The upsampled feature maps are then fed into the high-resolution stage. In both the low-resolution and the high-resolution stages, a variant of residual block  called projected convolution is employed. The residual block consists of 11 convolutional layer as a feature map projection to decrease the input size of 33 convolutional features. The LR stage has six residual blocks while the HR stage consists of four residual blocks.
Being a competitor in the NTIRE 2017 challenge , the model is trained on 900 images from DIV2K dataset , 800 training image and 100 validation images combined. During training, the images are cropped to 108108 sized patches and augmented using flipping and rotation operations. The initial learning rate was set to 0.001 which is exponentially decreased after each iteration by a factor of 0.6. The optimization was performed using Adam . The residual block consists of 128 feature maps as input and 64 as output. distance is used for computing difference between the prediction output and the ground-truth.
REDNet: Recently, due to the success of UNet ,  proposes a super-resolution algorithm using an encoder (based on convolutional layers) and a decoder (based on deconvolutional layers). REDNet  stands for Residual Encoder Decoder Network and is mainly composed of convolutional and symmetric deconvolutional layers. A rectification layer (ReLU) is added after each convolutional and deconvolutional layer. The convolutional layers extract feature maps while preserving object structures and removing degradations. On the other hand, the deconvolutional layers reconstruct the missing details of the images. Furthermore, skip connections are added between the convolutional and the symmetric deconvolutional layer. The feature maps of the convolutional layer are summed with the output of the mirrored deconvolutional layer before applying non-linear rectification. The input to the network is the bicubic interpolated images, and the outcome of the final deconvolutional layer is a high-resolution image. The proposed network is end-to-end trainable and convergence is achieved by minimizing the -norm between the output of the system and the ground truth. The architecture of the REDNet  is shown in Figure 2.
The authors proposed three variants of the REDNet architecture where the overall structure remains same, but the number of convolutional and deconvolutional layers are changed. The best performing architecture has 30 weight layers, each with 64 feature maps. Furthermore, the luminance channel from the Berkeley Segmentation Dataset (BSD)  is used to generate the training image set. The patches of size 5050 are extracted with a regular stride as the ground truth, and the input patches are formed from the ground truth by downsampling the patches and then upsampling it to the original size using bicubic interpolation.
The network is trained by extracting patches from 91 images  and employing Mean square error (MSE) as a loss function. The input and output patch sizes are 99 and 5
5, respectively. The patches are normalized by its means and variances which are later added to the corresponding restored final high-resolution outputs. Furthermore, the kernel has a size of 55 with 128 feature channels.
3.3 Recursive networks
As the name indicates, recursive networks [31, 32, 48] either employ recursively connected convolutional layers or recursively linked units. The main motivation behind these designs is to progressively break down the harder SR problem into a set of simpler ones, that are easy to solve. The basic architecture is shown in Figure 2 and we provide further details of recursive models in the following sections.
As the name indicates, Deep Recursive Convolutional Network (DRCN)  applies the same convolution layers multiple times. An advantage of this technique is that the number of parameters remains constant for more recursions. DRCN  is composed of three smaller networks, termed as embedding net, inference net, and reconstruction net.
The first sub-net, called the embedding network, converts the input (either grayscale or color image) to feature maps. The subsequent sub-network, known as inference net, performs super-resolution, which analyzes image regions by recursively applying a single layer consisting of convolution and ReLU. The size of the receptive field is increased after each recursion. The output of the inference net is high-resolution feature maps which are transformed to grayscale or color by the reconstruction net.
Deep Recursive Residual Network (DRRN)  proposes a deep CNN model but with conservative parametric complexity. Compared to previous models such as VDSR , REDNet  and DRCN , this model introduces an even deeper architecture with as many as 52 convolutional layers. At the same time, they reduce the network complexity by factors of 14, 6 and 2 for the cases of REDNet, DRCN and VDSR respectively. This is achieved by combining residual image learning  with local identity connections between small blocks of layers with in the network. The authors stress that such parallel information flow realizes stable training for deeper architectures.
). Since parameters are shared between the replications, the memory cost and computational complexity is significantly reduced. The final architecture is obtained by stacking multiple recursive blocks. DRCN used the standard SGD optimizer with gradient clipping for parameter learning. The loss layer is based on MSE loss, similar to other popular architectures. The proposed architecture reports a consistent improvement over previous methods, which supports the case for deeper recursive architectures and residual learning.
A novel persistent memory network for image super-resolution (abbreviated as MemNet) is present by Tai et al. . MemNet can be broken down into three parts similar to SRCNN . The first part is called the feature extraction block, which extracts features from the input image. This part is consistent with earlier designs such as [27, 28, 35]. The second part consists of a series of memory blocks stacked together. This part plays the most crucial role in this network. The memory block, as shown in Figure 2, consists of a recursive unit and a gate unit. The recursive part is similar to ResNet  and is composed of two convolutional layers with a pre-activation mechanism and dense connections to the gate unit. Each gate unit is a convolutional layer with 11 convolutional kernel size.
The MSE loss function is adopted by MemNet . The experimental settings are the same as VDSR , using 200 images from BSD  and 91 images from Yang et al. . The network consists of six memory blocks with six recursions. The total number of layers in MemNet is 80. MemNet is also employed for other image restoration tasks such as image denoising, and JPEG deblocking where it shows promising results.
3.4 Progressive reconstruction designs
Typically, CNN algorithms predict the output in one step; however, it may not be feasible for large scaling factors. To deal with large factors, some algorithms [50, 51], predict the output in multiple steps i.e. 2 followed by 4 and so on. Here, we introduce such algorithms.
Wang et al.  proposed a scheme which consolidates the merits of sparse coding  with domain knowledge of deep neural networks. With this combination, it aims for a compact model and improved performance. The proposed sparse coding-based network (SCN)  mimics a Learned Iterative Shrinkage and Thresholding Algorithm (LISTA) network to build a multi-layer neural network.
Similar to SRCNN 
, the first convolutional layer extracts features from the low-resolution patches which are then fed into a LISTA network. To obtain the sparse code for each feature, the LISTA network consists of a finite number of recurrent stages. The LISTA stage is composed of two linear layers and a nonlinear layer with an activation function having a threshold which is learned/updated during training. To simplify training, the authors decomposed the nonlinear neuron into two linear scaling layers and a unit-threshold neuron. The two scaling layers are diagonal matrices which are reciprocal to each othere.g. if multiplication scaling layer is present, division after the threshold unit follows it. After the LISTA network, the original high-resolution patches are reconstructed by multiplying the sparse code and high-resolution dictionary in the successive linear layer. As a final step, again using a linear layer, the high-resolution patches are placed in the original location in the image to obtain the high-resolution output.
Deep Laplacian pyramid super-resolution network (LapSRN)  employs a pyramidal framework. LapSRN consists of three sub-networks that progressively predict the residual images up to a factor of 8. The residual images of each sub-network are added to the input LR image to obtain SR images. The output of the first sub-network is a residue of 2, the second sub-network provides a 4 residue, and the last one gives the 8 residual image. These residual images are added to the correspondingly scaled upsampled images to obtain the final super-resolved images. The authors term the residual prediction branch as feature extraction while the addition of bicubic images with the residue is called image reconstruction branch. The Figure 2 shows the LapSRN network which consists of three types of elements i.e. the convolutional layers, leaky ReLU, and deconvolutional layers. Following the CNN convention, the convolutional layers precede the leaky ReLU (allowing a negative slope of 0.2) and deconvolutional layer at the end of the sub-network to increase the size of the residual image to the corresponding scale.
LapSRN uses a differentiable variant of
loss function known as Charbonnier which can handle outliers. The loss is employed at every sub-network, resembling a multi-loss structure. Furthermore, the filter sizes for convolutional and deconvolutional layers are 33 and 44, respectively, having 64 channels each. The training data is similar to SRCNN  i.e. 91 images from Yang et al.  and 200 images from BSD dataset .
The LapSRN model uses three distinct models to perform 2, 4 and 8 SR. They also propose a single model, termed as Multi-scale (MS) LapSRN, that jointly learns to handle multiple SR scales . Interestingly, a single MS-LapSRN model outperforms the results obtained from three distinct models. One explanation for this effect is that the single model leverages common inter-scale traits that help in achieving more accurate results.
3.5 Densely Connected Networks
Inspired by the success of the DenseNet  architecture for image classification, super-resolution algorithms based on densely connected CNN layers have been proposed to improve performance. The main motivation in such a design is to combine hierarchical cues available along the network depth to achieve high flexibility and richer feature representations. We discuss some popular designs in this category below.
a layer directly operates on the output from all previous layers. Such an information flow from low to high-level feature layers avoids the vanishing gradient problem, enables learning compact models and speeds up the training process. Towards the rear part of the network, SR-DenseNet uses a couple of deconvolution layers to upscale the inputs. The authors propose three variants of SR-DenseNet, (1) a sequential arrangement of dense blocks followed by deconvolution layers. In this way only high-level features are used for reconstructing the final SR image. (2) Low-level features from initial layers are combined before final reconstruction. For this purpose, a skip connection is used to combine low- and high-level features. (3) All features are combined by using multiple skip connections between low-level features and the dense blocks to allow a direct flow of information for a better HR reconstruction. Since complementary features are encoded at multiple stages in the network, the combination of all feature maps gives the best performance among other variants of SR-DenseNet. The MSE error (loss) is used as a loss to train the full model. Overall, SR-DenseNet models demonstrate a consistent improvement in performance over the models that do not use dense connections between layers.
As the name implies, Residual Dense Network 
(RDN) combines residual skip connections (inspired by SR-ResNet) with dense connections (inspired by SR-DenseNet). The main motivation is that the hierarchical feature representations should be fully used to learn local patterns. To this end, residual connections are introduced at two levels; local and global. At the local level, a novel residual dense block (RDB) was proposed where the input to each block (an image or output from a previous block) is forwarded to all layers with in the RDB and also added to the block’s output so that each block focuses more on the residual patterns. Since the dense connections quickly lead to high dimensional outputs, a local feature fusion approach to reduce the dimensions withconvolutions was used in each RDB. At the global level, outputs of multiple RDBs are fused together (via concatenation and convolution operations) and a global residual learning is performed to combine features from multiple blocks in the network. The residual connections help stabilize network training and results in an improvement over the SR-DenseNet .
In contrast to the loss used in SR-DenseNet, RDN utilizes the loss function and advocates its improved convergence properties. Network training is performed on patches randomly selected in each batch. Data augmentation by flips and rotations is applied as a regularization measure. The authors also experiment with settings where different forms of degradation (e.g.., noise and artifacts) are present in LR images. The proposed approach shows good resilience against such degradation and recovers much enhanced SR images.
Dense deep back-projection network for super-resolution  takes inspiration from the conventional SR approaches (e.g., ) that iteratively perform back-projections to learn the feedback error signal between LR and HR images. The motivation is that only a feed-forward approach is not optimal for modelling the mapping from LR to HR images, and a feedback mechanism can greatly help in achieving better results. For this purpose, the proposed architecture comprises of a series of up and down sampling layers that are densely connected with each other. In this manner, HR images from multiple depths in the network are combined to achieve the final output.
The architecture of up and down sampling blocks is shown in Fig. 2. For the sake of brevity, the simpler case of single connection from previous layers is shown, and the readers are directed to  for the complete densely connected block. An important feature of this design is the combination of upsampling outputs for input feature map and the residual signal. The explicit addition of residual signal in the upsampled feature map provides error feedback and forces the network to focus on fine details. The network is trained using the standard loss function. D-DBPN has a relatively high computational complexity of million parameters for 4 SR, however a lower complexity version of the final model was also proposed that led to a slight drop in performance.
3.6 Multi-branch designs
In contrast to single-stream (linear) and skip-connection based designs, multi-branch networks aim to obtain a diverse set of features at multiple context scales. Such complementary information is then fused to obtain better HR reconstructions. This design also enables a multi-path signal flow, leading to better information exchange in forward-backward steps during training. Multi-branch designs are becoming common in several other computer vision tasks as well. We explain multi-branch networks in the section below.
Ren et al.  proposed fusing multiple convolutional neural networks for image super-resolution. The authors termed their CNN network Context-wise Network Fusion (CNF), where each SRCNN  is constructed with a different number of layers. The output of each SRCNN  is then passed through a single convolutional layer and eventually all of them are fused using sum-pooling.
33 pixels of luminance channel only. First, each SRCNN is trained individually for 50 epochs with a learning rate of 1e-4; then the fused network is trained for ten epochs with the same learning rate. Such a progressive learning strategy is similar to curriculum learning that starts from a simple task and then moves on to the more complex task of jointly optimizing multiple sub-nets to achieve improved SR. Mean square error is used as a loss for the network training.
Cascaded multi-scale cross-network, abbreviated as CMSC , is composed of a feature extraction layer, cascaded subnets, and a reconstruction network. The feature extraction layer performs the same function as mentioned for the cases of SRCNN , FSRCNN . Each subnet is composed of merge-and-run (MR) blocks. Each MR block is comprised of two parallel branches having two convolutional layers each. The residual connections from each branch are accumulated together and then added to the output of both branches individually as shown in Figure 2. Each subnet of CMSC is formed with four MR blocks having different receptives field of 33, 55, and 77 to capture contextual information at multiple scales. Furthermore, each convolutional layer in the MR block is followed by batch normalization and Leaky-ReLU . The last reconstruction layer generates the final output.
The loss function is which combines the intermediate outputs with the final one using a balancing term. The input to the network is upsampled using bicubic interpolation with a patch size of 41 41. The model is trained with 291 images similar to VDSR  using an initial learning rate of 10, decreasing by a factor of 10 after every ten epochs for a total of 50 epochs. CMSC lags in performance compared to EDSR  and its variant MDSR .
The Information Distillation Network (IDN)  consists of three blocks: a feature extraction block, multiple stacked information distillation blocks and a reconstruction block. The feature extraction block is composed of two convolutional layers to extract features. The distillation block is made up of two other blocks, an enhancement unit, and a compression unit. The enhancement unit has six convolutional layers followed by leaky ReLU. The output of the third convolutional layer is sliced, the half batch is concatenated with the input of the block, and the other half is used as an input to the fourth convolutional layer. The output of the concatenated component is added with the output of the enhancement block. In total, four enhancement blocks are utilized. The compression unit is realized using a 11 convolutional layer after each enhancement block. The reconstruction block is a deconvolution layer with a kernel size of 1717.
3.7 Attention-based Networks
The previously discussed network designs consider all spatial locations and channels to have a uniform importance for the super-resolution. In several cases, it helps to selectively attend to only a few features at a given layer. Attention-based models [64, 65] allow this flexibility and consider that not all the features are essential for super-resolution but have varying importance. Coupled with deep networks, recent attention-based models have shown significant improvements for SR. Following are the examples of CNN algorithms using attention mechanisms.
Choi and Kim  proposed a novel selection unit for the image super-resolution network, termed as SelNet. The selection unit serves as a gate between convolutional layers, allowing only selected values from the feature maps. The selection unit is composed of an identity mapping and a cascade of ReLU, 11 convolution and a sigmoid layer. SelNet consists of a total of 22 convolutional layers, and the selection unit is added after every convolutional layer. Similar to VDSR , residual learning and gradient switching (a version of gradient clipping) are also employed in SelNet  for faster learning.
The low-resolution patches of size 120120 are input to the network which are cropped from DIV2K dataset . The number of epochs is set to 50 with a learning rate of 10. The loss used for training the SelNet is .
Residual Channel Attention Network (RCAN)  is a recently proposed deep CNN architecture for single image super-resolution. The main highlights of the architecture include: (a) a recursive residual design where residual connections exist within each block of a global residual network and (b) each local residual block has a channel attention mechanism such that the filter activations are collapsed from to a vector with dimensions (after passing through a bottleneck) that acts as a selective attention over channel maps. The first novelty allows multiple pathways for information flow from initial to final layers. The second contribution allows the network to focus on selective feature maps that are more important for the end task and also effectively models the relationships between feature maps.
RCAN  uses loss function for network training. It was observed that the recursive residual style architecture leads to better convergence properties of very deep networks. Furthermore, it leads the better performance compared to contemporary approaches such as IRCNN , VDSR  and RDN . This shows the effectiveness of channel attention mechanisms  for low-level vision tasks. Having said that, one shortcoming of the proposed framework is its high computational complexity ( million parameters for 4 SR) compared to e.g. LapSRN , MemNet  and VDSR .
This recent work  focuses on the attention blocks used for single image super-resolution. They evaluate a range of attention mechanisms with common SR architectures to compare their performance and individual merits/demerits. A Residual Attention Module for SR (SRRAM) is proposed. The structure of SRRAM  is similar to RCAN , as both these methods are inspired from EDSR . The SRRAM can be divided into three parts which are feature extraction, feature upscaling and feature reconstruction. The first and the last part are similar to the previously discussed methods [23, 28]. However, the feature upscaling part is composed of residual attention modules (RAM). The RAM is a basic unit of SRRAM which is formed of residual blocks followed by spatial attention and channel attention for learning the inter-channel and intra-channel dependencies.
The model is trained using randomly cropped 4848 patches from DIV2K dataset  with data augmentation. The filters are of 33 size with feature maps of 64. The optimizer used is Adam  employing loss, fixing the initial learning rate as 10. There are a total of 64 RAM blocks used in the final model.
3.8 Multiple-degradation handling networks
The super-resolution networks discussed so far (e.g., [23, 24]) consider bicubic degradations. However, in reality, this may not be a feasible assumption as multiple degradations can simultaneously occur. To deal with such real-world scenarios, the following methods are proposed.
ZSSR stands for Zero-Shot Super-Resolution  and it follows the footsteps of classical methods by super-resolving the images using the internal image statistics employing the power of deep neural networks. The ZSSR  uses a simple network architecture that is trained using a downsampled version of the test image. The aim here is to predict the test image from the LR image created from the test image. Once the network learns the relationship between the LR test image and the test image, the same network is used to predict the SR image using the test image as an input. Hence it does not require training images for a particular degradation and can learn an image-specific network on-the-fly during inference. The ZSSR  has a total of eight convolutional layers followed by ReLU consisting of 64 channels. Similar to [24, 37], ZSSR  learns the residue image using norm.
Super-resolution network for multiple degradations (SRMD)  takes a concatenated low-resolution image and its degradation maps. The architecture of SRMD is similar to [26, 25, 23]. First, a cascade of convolutional layers of 33 filter size is applied to extracted features, followed by a sequence of Conv, ReLU and Batch normalization layers. Furthermore, similar to , a convolution operation is utilized to extract HR sub-images, and as a final step, the multiple HR sub-images are transformed to the final single HR output. SRMD directly learns HR images instead of the residue of the images. The authors also introduced a variant called SRMDNF, which learns from noise-free degradations. In SRMDNF network, the connections from the first noise-level maps in the convolutional layers are removed; however, the rest of the architecture is similar to SRMD. The network architecture of the SRMD is presented in Figure 2.
The authors trained individual models for each upsampling scale in contrast to the multi-scale training. loss is employed, and the size of the training patches is set to 4040. The number of convolution layers is fixed to 12, while each layer has 128 feature maps. Training is performed on 5,944 images from BSD , DIV2K  and Waterloo  datasets. The initial learning is fixed at which is later decreased to . The criteria for learning rate reduction is based on the error change between successive epochs. Both SRMD and its variant are unable to break the PSNR record of earlier SR networks such as EDSR , MDSR  and CMSC . However, its ability to jointly tackle multiple degradations offer a unique capability.
3.9 GAN Models
Generative Adversarial Networks (GAN) [71, 72] employ a game-theoretic approach where two components of the model, namely a generator and discriminator, try to fool the later. The generator creates SR images that a discriminator cannot distinguish as a real HR image or an artificially super-resolved output. In this manner, HR images with better perceptual quality are generated. The corresponding PSNR values are generally degraded, which highlights the problem that prevalent quantitative measures in SR literature do not encapsulate perceptual soundness of generated HR outputs. The super-resolution methods [42, 73] based on the GAN framework are explained next.
Single image super-resolution by large up-scaling factors is very challenging. SRGAN  proposed to use an adversarial objective function that promotes super-resolved outputs that lie close to the manifold of natural images. The main highlight of their work is a multi-task loss formulation that consists of three main parts: (1) a MSE loss that encodes pixel-wise similarity, (2) a perceptual similarity metric in terms of a distance metric defined over high-level image representation (e.g., deep network features), and (3) an adversarial loss that balances a min-max game between a generator and a discriminator (standard GAN objective ). The proposed framework basically favors outputs that are perceptually similar to the high-dimensional images. To quantify this capability, they introduce a new Mean Opinion Score (MOS) which is assigned manually by human raters indicating bad/excellent quality of each super-resolved image. Since other techniques generally learn to optimize direct data dependent measures (such as pixel-errors),  outperformed its competitors by a significant margin on the perceptual quality metric.
This network design focuses on creating faithful texture details in high-resolution super-resolved images . A key problem with regular image quality measures such as PSNR is their noncompliance with the perceptual quality of an image. This results in overly smoothed images that do not have sharp textures. To overcome this problem, EnhanceNet used two other loss terms beside the regular pixel-level MSE loss: (a) the perceptual loss function was defined on the intermediate feature representation of a pretrained network  in the form of distance. (b) the texture matching loss is used to match the texture of low and high resolution images and is quantified as the
loss between gram matrices computed from deep features. The whole network architecture is adversarialy trained where the SR network’s goal is to fool a discriminator network.
The architecture used by EnhanceNet is based on the Fully Convolutional Network  and residual learning principle . Their results showed that although best PSNR is achieved when only a pixel level loss is used, the additional loss terms and an adversarial training mechanism lead to more realistic and perceptually better outputs. On the downside, the proposed adversarial training could create visible artifacts when super-resolving highly textured regions. This limitation was addressed further by the recent work on high perceptual quality SR .
SRFeat  is another GAN-based Super-Resolution algorithm with Feature Discrimination. This work focuses on the realistic perception of the input image using an additional discriminator that assists the generator to generate high-frequency structural features rather than noisy artifacts. This requisite is achieved by distinguishing between the features of synthetic (machine generated) and the real images. This network uses 99 convolutional layer to extract features. Then, residual blocks similar to  with long-range skip connections are used which have 11 convolutions. The feature maps are upsampled by pixel shuffler layers to achieve the desired output size. The authors used 16 residual blocks with two different settings of feature maps i.e. 64 and 128. The proposed model uses a combination of perceptual (adversarial loss) and pixel-level loss () functions that is optimized with an Adam optimizer . The input resolution to the system is 7474 which only outputs 296
296 image. The network uses 120k images from the ImageNet for pre-training the generator, followed by fine-tuning on augmented DIV2K dataset  using learning rates of 10 to 10.
Enhanced Super-Resolution Generative Adversarial Networks (ESRGAN)  builds upon SRGAN  by removing batch normalization and incorporating dense blocks. Each dense block’s input is also connected to the output of the respective block making a residual connection over each dense block. ESRGAN also has a global residual connection to enforce residual learning. Moreover, the authors also employ an enhanced discriminator called Relativistic GAN .
The training is performed on a total of 3,450 images from the DIV2K  and Flicker2K datasets employing augmentation  via the loss function first and then using the trained model using perceptual loss. The patch size for training is set to 128128, having a network depth of 23 blocks. Each block contains five convolutional layers, each with 64 feature maps. The visual results are comparatively better as compared to RCAN , however, it lags in terms of the quantitative measures where RCAN performs better.
4 Experimental Evaluation
We compare the state-of-the-art algorithms on publicly available benchmark datasets which include Set5 , Set14 , BSD100 , Urban100 , DIV2K  and Manga109 . The representative images from all the datasets are shown in Figure 3.
Set5  is a classical dataset and only contains five test images of a baby, bird, butterfly, head, and a woman.
DIV2K  is a dataset used for NITRE challenge. The image quality is of 2K resolution and is composed of 800 images for training while 100 images each for testing and validation. As the test set is not publicly available, the results are only reported on validation images for all the algorithms.
Manga109  is the latest addition for evaluating super-resolution algorithms. The dataset is a collection of 109 test images of a manga volume. These mangas were professionally drawn by Japanese artists and were available only for commercial use between the 1970s and 2010s.
4.2 Quantitative Measures
The algorithms detailed in section 3 are evaluated on the peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM)  measures. Table II presents the results for 2, 3 and 4 for the super-resolution algorithms. Currently, the PSNR and SSIM performance of RCAN  is better for 2 and 3 and ESRGAN  for 4. However, it is difficult to declare one algorithm to be a clear winner compared to the rest as there are many factors involved such as network complexity, depth of the network, training data, patch size for training, number of features maps, etc. A fair comparison is only possible by keeping all the parameters consistent.
In Figure 6, we present the visual comparison between a few of the state-of-the-art algorithms which aim to improve the PSNR of the images. Furthermore, Figure 7 shows the output of the GAN-based algorithms which are perceptually-driven and aim to enhance the visual quality of the generated outputs. As one can notice, outputs in Figure 7 are generally more crisp, but the corresponding PSNR values are relatively lower compared to methods that optimize pixel-level loss measures.
4.3 Number of parameters
Table I shows the comparison of parameters for different SR algorithms. Methods with direct reconstruction perform one-step upsampling from the LR to HR space, while progressive reconstruction predicts HR images in multiple upsampling steps. Depth represents the number of convolutional and transposed convolutional layers in the longest path from input to output for 4 SR. Global residual learning (GRL) indicates that the network learns the difference between the ground truth HR image and the upsampled (i.e. using bicubic interpolation or learned filters) LR images. Local residual learning (LRL) stands for the local skip connections between intermediate convolutional layers. As one can notice, methods that perform late upsampling [28, 35] have considerably lower computational cost compared to methods that perform upsampling earlier in the network pipeline [37, 27, 65].
4.4 Choice of network loss
The most popular choices for network loss is either mean square error or mean absolute error in the convolutional neural network for the image super-resolution. Similarly, Generative adversarial networks (GANs) also employ perceptual loss (adversarial loss) in addition to the pixel-level losses such as the MSE. From table I it is evident that the initial CNN methods were trained using loss; however, there is a shift in the trend towards more recently, and absolute mean difference measure () has shown to be more robust compared to . The reason is that puts more emphasis on more erroneous predictions while considers a more balanced error distribution.
4.5 Network depth
Contrary to the claim made in SRCNN  that network depth does not contribute to the better numbers rather it sometimes degrades the quality. VDSR  initially proved that using deeper networks helps in better PSNR and image quality. EDSR  further establishes this claim, where the number of convolutional layers are increased by nearly four times that of VDSR . Recently, RCAN  employed more than four hundred convolutional layers to enhance image quality. The current batch of CNNs [38, 32] are incorporating more convolutional layers to construct deeper networks to improve the image quality and numbers, and this trend has continuously remained a dominant one in deep SR since the inception of SRCNN.
4.6 Skip Connections
Overall, skip connections have played a vital role in the improvement of SR results. These connections can be braodly categorized into four main types: global connections, local connections, recursive connections, and dense connections. Initially, VDSR  utilized global residual learning (GRL) and has shown enormous performance improvement over SRCNN . Further, DRRN  and DRCN  have demonstrated the effectiveness of recursive connections. Recently, EDSR  and RCAN  employed local residual learning (LRL) i.e. local connections while keeping the global residual learning (GRL) as well. Similarly, RDN  and ESRGAN  engaged dense connections and global ones. Modern CNNs are innovating ways to improve and introduce other types of connections between different layers or modules. In Table I, we show the skip connections along with the corresponding methods.
5 Future Directions/Open Problems
Although deep networks have shown exceptional performance on the super-resolution task, there remain several open research questions. We outline some of these future research directions below.
Incorporation of Priors: Current deep networks for SR are data driven models that are learned in an end-to-end fashion. While this approach has shown excellent results in general, it proves to be sub-optimal when a particular class of degradation occurs for which large amount of training data is non-existent (e.g., in medical imaging). In such cases, if the information about the sensor, imaged object/scene and acquisition conditions is known, useful priors can be designed to obtain high-resolution images. Recent works focusing on this direction have proposed both deep network  and sparse coding  based priors for better super-resolution.
Objective Functions and Metrics: Existing SR approaches predominantly use pixel-level error measures e.g., and distances or a combination of both. Since, these measures only encapsulate local pixel-level information, the resulting images do not always provide perceptually sound results. As an example, it has been shown that images with high PSNR and SSIM values give overly smooth images with low perceptual quality . To counter this issue, several perceptual loss measures have been proposed in the literature. The conventional perceptual metrics were fixed e.g., SSIM , multi-scale SSIM , while more recent ones are learned to model human perception of images e.g., LPIPS  and PieAPP . Each of these measures have their own failure cases. As a result, there is no universal perceptual metric that optimally works in all conditions and perfectly quantifies the image quality. Therefore, the development of new objective functions is an open research problem.
Need for Unified Solutions: Two or more degradations often happen simultaneously in real life situations. An important consideration in such cases is how to jointly recover images with higher resolution, low noise and enhanced details. Current models developed for SR are generally restricted to only one case and suffer in the presence of other degradations. Furthermore, problems specific models differ in their architectures, loss functions and training details. It is a challenge to design unified models that perform well for several low-level vision tasks, simultaneously .
Unsupervised Image SR: Models discussed in this survey generally consider LR-HR image pairs to learn a super-resolution mapping function. One interesting direction is to explore how SR can be performed for cases where corresponding HR images are not available. One solution to this problem is Zero-shot SR  which learns the SR model on a further downsampled version of a given image. However, when an input image is already of poor resolution, this solution cannot work. The unsupervised image SR aims to solve this problem by learning a function from unpaired LR-HR image sets . Such a capability is very useful for real-life settings since it is not trivial to obtain matched HR images in several cases.
Higher SR rates: Current SR models generally do not tackle extreme super-resolution which can be useful for cases such as super-resolving faces in crowd scenes. Very few works target SR rates higher than 8 (e.g., 16 and 32) . In such extreme upsampling conditions, it becomes challenging to preserve accurate local details in the image. Further, an open question is how to preserve high perceptual quality in these super-resolved images.
Arbitrary SR rates: In practical scenarios, it’s often not known which upsampling factor is the optimal one for a given input. When the downsampling factor is not known for all the images in the dataset, it becomes a significant challenge during training since it becomes hard for a single model to encapsulate several levels of details. In such cases, it is important to first characterize the level of degradation before training and performing inference through a specified SR model.
Real vs Artificial Degradation: Existing SR works mostly use a bicubic interpolation to generate LR images. Actual LR images that are encountered in real-world scenarios have a totally different distribution compared to the ones generated synthetically using bicubic interpolation. As a result, SR networks trained on artificially created degradations do not generalize well to actual LR images in practical scenarios. One recent effort towards the solution of this problem first learns a GAN to model the real-world degradation .
Single-image super-resolution is a challenging research problem with important real-life applications. The phenomenal success of deep learning approaches has resulted in rapid growth in deep convolutional network based techniques for image super-resolution. A diverse set of approaches have been proposed with exciting innovations in network architectures and learning methodologies. This survey provides a comprehensive analysis of existing deep-learning based methods for super-resolution. We note that the super-resolution performance has been greatly enhanced in recent years with a corresponding increase in the network complexity. Remarkably, the state-of-the-art approaches still suffer from limitations that restrict their application to key real-world scenarios (e.g., inadequate metrics, high model complexity, inability to handle real-life degradations). We hope this survey will attract new efforts towards the solution of these crucial problems.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” TPAMI, 2016.
-  Y. Bai, Y. Zhang, M. Ding, and B. Ghanem, “Sod-mtgan: Small object detection via multi-task generative adversarial network,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 206–221.
-  S. P. Mudunuri and S. Biswas, “Low resolution face recognition across variations in pose and illumination,” TPAMI, 2016.
-  H. Greenspan, “Super-resolution in medical imaging,” CJ, 2008.
-  T. Lillesand, R. W. Kiefer, and J. Chipman, Remote sensing and image interpretation, 2014.
-  A. P. Lobanov, “Resolution limits in astronomical images,” arXiv preprint astro-ph/0503225, 2005.
-  A. Swaminathan, M. Wu, and K. R. Liu, “Digital image forensics via intrinsic fingerprints,” TIFS, 2008.
-  S. Khan, H. Rahmani, S. A. A. Shah, and M. Bennamoun, “A guide to convolutional neural networks for computer vision,” Synthesis Lectures on Computer Vision, vol. 8, no. 1, pp. 1–207, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in NIPS, 2015.
-  A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher, “Ask me anything: Dynamic memory networks for natural language processing,” in ICML, 2016.
-  R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” JMLR, 2011.
-  S. Anwar, C. Li, and F. Porikli, “Deep underwater image enhancement,” arXiv preprint arXiv:1807.03528, 2018.
-  S. Anwar, C. P. Huynh, and F. Porikli, “Chaining identity mapping modules for image denoising,” arXiv preprint arXiv:1712.02933, 2017.
-  G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on audio, speech, and language processing, 2012.
-  C.-Y. Yang, C. Ma, and M.-H. Yang, “Single-image super-resolution: A benchmark,” in ECCV, 2014.
-  M. Irani and S. Peleg, “Improving resolution by image registration,” CVGIP, 1991.
-  R. Fattal, “Image upsampling via imposed edge statistics,” ACM TOG, 2007.
-  J. Huang and D. Mumford, “Statistics of natural images and models,” in CVPR, 1999.
-  W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Example-based super-resolution,” CGA, 2002.
-  H. Chang, D.-Y. Yeung, and Y. Xiong, “Super-resolution through neighbor embedding,” in CVPR, 2004.
-  J. Yang, Z. Lin, and S. Cohen, “Fast image super-resolution based on in-place example regression,” in CVPR, 2013.
-  C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” TPAMI, 2016.
-  J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in CVPR, 2016.
-  K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” TIP, 2017.
-  K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for image restoration,” in CVPR, 2017.
-  C. Dong, C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in ECCV, 2014.
-  C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in ECCV, 2016.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  D. Geman and C. Yang, “Nonlinear image recovery with half-quadratic regularization,” TIP, 1995.
-  J. Kim, J. Kwon Lee, and K. Mu Lee, “Deeply-recursive convolutional network for image super-resolution,” in CVPR, 2016.
-  Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in CVPR, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in ICCV, 2015.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” TIP, 2010.
-  W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in CVPR, 2016.
-  W. Shi, J. Caballero, L. Theis, F. Huszar, A. Aitken, C. Ledig, and Z. Wang, “Is the deconvolution layer the same as a convolutional layer?” arXiv preprint arXiv:1609.07009, 2016.
-  B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in CVPRW, 2017.
-  N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and, lightweight super-resolution with cascading residual network,” arXiv preprint arXiv:1803.08664, 2018.
-  J. Jiao, W.-C. Tu, S. He, and R. W. Lau, “Formresnet: formatted residual learning for image restoration,” in CVPRW, 2017.
-  Y. Fan, H. Shi, J. Yu, D. Liu, W. Han, H. Yu, Z. Wang, X. Wang, and T. S. Huang, “Balanced two-stage residual networks for image super-resolution,” in CVPRW, 2017.
-  X. Mao, C. Shen, and Y.-B. Yang, “Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections,” in NIPS, 2016.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network.”
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in ICCV, 2001.
-  R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, L. Zhang, B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee et al., “Ntire 2017 challenge on single image super-resolution: Methods and results,” in CVPRW, 2017.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, 2015.
-  Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memory network for image restoration,” in CVPR, 2017.
J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using
very deep convolutional networks,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1646–1654.
-  Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang, “Deep networks for image super-resolution with sparse prior,” in ICCV, 2015.
-  W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep laplacian pyramid networks for fast and accurate superresolution,” in CVPR, 2017.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” TIP, 2010.
-  W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Fast and accurate image super-resolution with deep laplacian pyramid networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
-  G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in CVPR, 2017.
-  T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using dense skip connections,” in ICCV, 2017.
-  Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in CVPR, 2018.
-  M. Haris, G. Shakhnarovich, and N. Ukita, “Deep backprojection networks for super-resolution,” in CVPR, 2018.
-  H. Ren, M. El-Khamy, and J. Lee, “Image super resolution based on fusing multiple convolution neural networks,” in CVPRW, 2017.
-  I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, S. Kamali, M. Malloci, J. Pont-Tuset, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy, “Openimages: A public dataset for large-scale multi-label and multi-class image classification.” Dataset available from https://storage.googleapis.com/openimages/web/index.html, 2017.
-  A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari, “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” arXiv:1811.00982, 2018.
-  Y. Hu, X. Gao, J. Li, Y. Huang, and H. Wang, “Single image super-resolution via cascaded multi-scale cross network,” arXiv preprint arXiv:1802.08808, 2018.
-  A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in ICML, 2013.
-  Z. Hui, X. Wang, and X. Gao, “Fast and accurate single image super-resolution via information distillation network,” in CVPR, 2018.
-  J. Choi and M. Kim, “A deep convolutional neural network with selection units for super-resolution,” in CVPRW, 2017.
-  Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” arXiv preprint arXiv:1807.02758, 2018.
-  J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” arXiv preprint arXiv:1709.01507, 2017.
-  J.-H. Kim, J.-H. Choi, M. Cheon, and J.-S. Lee, “Ram: Residual attention module for single image super-resolution,” arXiv preprint arXiv:1811.12043, 2018.
-  A. Shocher, N. Cohen, and M. Irani, “Zero-shot” super-resolution using deep internal learning,” arXiv preprint arXiv:1712.06087, 2017.
-  K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional super-resolution network for multiple degradations,” in CVPR, 2018.
-  K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang, “Waterloo exploration database: New challenges for image quality assessment models,” TIP, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014.
-  A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
-  M. S. Sajjadi, B. Schölkopf, and M. Hirsch, “Enhancenet: Single image super-resolution through automated texture synthesis,” in ICCV, 2017.
-  J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in ECCV, 2016.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
-  X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C. C. Loy, Y. Qiao, and X. Tang, “Esrgan: Enhanced super-resolution generative adversarial networks,” arXiv preprint arXiv:1809.00219, 2018.
-  S.-J. Park, H. Son, S. Cho, K.-S. Hong, and S. Lee, “Srfeat: Single image super-resolution with feature discrimination,” in ECCV, 2018.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” 2009.
-  A. Jolicoeur-Martineau, “The relativistic discriminator: a key element missing from standard gan,” arXiv preprint arXiv:1807.00734, 2018.
-  M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” 2012.
-  R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in International conference on curves and surfaces, 2010.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in ICCV, 2001.
-  J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in CVPR, 2015.
-  A. Fujimoto, T. Ogawa, K. Yamamoto, Y. Matsui, T. Yamasaki, and K. Aizawa, “Manga109 dataset and creation of metadata,” in International Workshop on coMics ANalysis, Processing and Understanding, 2016.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” TIP, 2004.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in CVPR, 2018.
-  W. Dong, Z. Yan, X. Li, and G. Shi, “Learning hybrid sparsity prior for image restoration: Where deep learning meets sparse coding,” arXiv preprint arXiv:1807.06920, 2018.
-  Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor, “The 2018 pirm challenge on perceptual image super-resolution,” in ECCV, 2018.
-  Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in ACSSC, 2003.
-  R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018.
-  E. Prashnani, H. Cai, Y. Mostofi, and P. Sen, “PieAPP: Perceptual image-error assessment through pairwise preference,” in CVPR, 2018.
-  Y. Yuan, S. Liu, J. Zhang, Y. Zhang, C. Dong, and L. Lin, “Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks,” in CVPRW, 2018.
-  A. Bulat, J. Yang, and G. Tzimiropoulos, “To learn image super-resolution, use a gan to learn how to do image degradation first,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 185–200.