I Introduction
Deep learning (DL) [1]
is a branch of machine learning algorithms that aims at learning the hierarchical representations of data. Deep learning has shown prominent superiority over other machine learning algorithms in many artificial intelligence domains, such as computer vision
[2], speech recognition [3], and natural language processing
[4]. Generally, the strong capacity of DL to address substantial unstructured data is attributable to two main contributors: the development of efficient computing hardware and the advancement of sophisticated algorithms.Single image superresolution (SISR) is a notoriously challenging illposed problem because a specific lowresolution (LR) input can correspond to a crop of possible highresolution (HR) images, and the HR space (in most instances, it refers to the natural image space) that we intend to map the LR input to is usually intractable [5]. Previous methods for SISR mainly have two drawbacks: one is the unclear definition of the mapping that we aim to develop between the LR space and the HR space, and the other is the inefficiency of establishing a complex highdimensional mapping given massive raw data. Benefiting from the strong capacity of extracting effective highlevel abstractions that bridge the LR and HR space, recent DLbased SISR methods have achieved significant improvements, both quantitatively and qualitatively.
In this survey, we attempt to give an overall review of recent DLbased SISR algorithms. We mainly focus on two areas: efficient neural network architectures designed for SISR and effective optimization objectives for DLbased SISR learning. The reason for this taxonomy is that when we apply DL algorithms to tackle a specified task, it is best for us to consider both the universal DL strategies and the specific domain knowledge. From the perspective of DL, although many other techniques such as data preprocessing [6] and model training techniques are also quite important [7, 8], the combination of DL and domain knowledge in SISR is usually the key to success and is often reflected in the innovations of neural network architectures and optimization objectives for SISR. In each of these two focused areas, based on the benchmark, several representative works are discussed mainly from the perspective of their contributions and experimental results as well as our comments and views.
The rest of the paper is arranged as follows. In Section II, we present relevant background concepts of SISR and DL. In Section III, we survey the literature on exploring efficient neural network architectures for various SISR tasks. In Section IV, we survey the studies on proposing effective objective functions for different purposes. In Section V, we summarize some trends and challenges for DLbased SISR. We conclude this survey in Section VI.
Ii Background
Iia Single Image SuperResolution
Superresolution (SR) [9]
refers to the task of restoring highresolution images from one or more lowresolution observations of the same scene. According to the number of input LR images, the SR can be classified into single image superresolution (SISR) and multiimage superresolution (MISR). Compared with MISR, SISR is much more popular because of its high efficiency. Since an HR image with high perceptual quality has more valuable details, it is widely used in many areas, such as medical imaging, satellite imaging and security imaging. In the typical SISR framework, as depicted in Fig.
1, the LR image y is modeled as follows:(1) 
where is the convolution between the blurry kernel and the unknown HR image , is the downsampling operator with scale factor , and is the independent noise term. Solving (1
) is an extremely illposed problem because one LR input may correspond to many possible HR solutions. To date, mainstream algorithms of SISR are mainly divided into three categories: interpolationbased methods, reconstructionbased methods and learningbased methods.
Interpolationbased SISR methods, such as bicubic interpolation [10] and Lanczos resampling [11], are very speedy and straightforward but suffer from accuracy shortcomings. Reconstructionbased SR methods [12, 13, 14, 15] often adopt sophisticated prior knowledge to restrict the possible solution space with an advantage of generating flexible and sharp details. However, the performance of many reconstructionbased methods degrades rapidly when the scale factor increases, and these methods are usually timeconsuming.
Learningbased SISR methods, also known as examplebased methods, are brought into focus because of their fast computation and outstanding performance. These methods usually utilize machine learning algorithms to analyze statistical relationships between the LR and its corresponding HR counterpart from substantial training examples. The Markov random field (MRF) [16] approach was first adopted by Freeman et al. to exploit the abundant realworld images to synthesize visually pleasing image textures. Neighbor embedding methods [17] proposed by Chang et al. took advantage of similar local geometry between LR and HR to restore HR image patches. Inspired by the sparse signal recovery theory [18], researchers applied sparse coding methods [19, 20, 21, 22, 23, 24]
to SISR problems. Lately, random forest
[25] has also been used to achieve improvement in the reconstruction performance. Meanwhile, many works combined the merits of reconstructionbased methods with the learningbased approaches to further reduce artifacts introduced by external training examples [26, 27, 28, 29]. Very recently, DLbased SISR algorithms have demonstrated great superiority to reconstructionbased and other learningbased methods.IiB Deep Learning
Deep learning is a branch of machine learning algorithms based on directly learning diverse representations of data [30]. In contrast to traditional taskspecific learning algorithms that select useful handcrafted features with expert domain knowledge, deep learning algorithms aim to learn informative hierarchical representations automatically and then leverage them to achieve the final purpose, where the whole learning process can be seen as an entirety [31].
Because of the high approximating capacity and hierarchical property of an artificial neural network (ANN), most modern deep learning models are based on ANNs [32]
. Early ANNs can be traced back to perceptron algorithms in the 1960s
[33]. Then, in the 1980s, the multilayer perceptron could be trained with the backpropagation algorithm
[34], and the convolutional neural network (CNN)
[35]and recurrent neural network (RNN)
[36], two representative derivatives of the traditional ANN, were introduced to the computer vision and speech recognition fields, respectively. Despite remarkable progress achieved by ANNs during that period, there were still many deficiencies handicapping ANNs from developing further [37, 38]. Thereafter, the rebirth of the modern ANN was marked by pretraining the deep neural network (DNN) with the restricted Boltzmann machine (RBM) proposed by Hinton in 2006
[39]. Consequently, benefiting from the boom of computing power and the development of advanced algorithms, models based on the DNN have achieved remarkable performance in various supervised tasks [40, 41, 2]. Meanwhile, DNNbased unsupervised algorithms such as the deep Boltzmann machine (DBM) [42], variational autoencoder (VAE)
[43, 44] and generative adversarial nets (GAN) [45] have attracted much attention owing to their potential to address challenging unlabeled data. Readers can refer to [46] for an extensive analysis of DL.Iii Deep Architectures for SISR
In this section, we mainly discuss the efficient architectures proposed for SISR in recent years. First, we set the network architecture of superresolution CNN (SRCNN) [47, 48] as the benchmark. When we discuss each related architecture in detail, we focus on their universal parts that can apply to other tasks and their specific parts that characterize SISR properties. To meaningfully construct fair comparisons among different models, we will illustrate the importance of the training dataset and attempt to compare models with the same training dataset.
Iiia Benchmark of Deep Architecture for SISR
We select the SRCNN architecture as the benchmark in this section. The overall architecture of SRCNN is shown in Fig. 2. As established in many traditional methods, for simplicity, SRCNN only implements the luminance components for training. SRCNN is a threelayer CNN, where the filter sizes of each layer are , and
. The functions of these three nonlinear transformations are patch extraction, nonlinear mapping and reconstruction. The loss function for optimizing SRCNN is the mean square error (MSE), which will be discussed in the next section.
The formulation of SRCNN is relatively simple and can be envisioned as an ordinary CNN that approximates the complex mapping between the LR and HR spaces in an endtoend manner. SRCNN reportedly demonstrated vast superiority over concurrent traditional methods, and we argue that its acclaim is owing to the CNN’s strong capability of learning valid representations from big data in an endtoend manner.
Despite the success of SRCNN, the following problems have inspired more effective architectures:
1) The input of SRCNN is the bicubic LR, an approximation of HR. However, these interpolated inputs have three drawbacks: (a) detailsmoothing effects introduced by these inputs may lead to further wrong estimations of the image structure; (b) employing interpolated versions as input is very timeconsuming; and (c) when the downsampling kernel is unknown, one specific interpolated input as a raw estimation is unreasonable. Therefore, the first question emerges: can we design CNN architectures that directly implement LR as input to address these problems?
^{1}^{1}1Generally, the first problem can be grouped into the third problem below. Because the solutions to this problem form the basis of many other models, it is necessary to introduce this problem separately as the first drawback.2) The SRCNN is just a threelayer architecture. Can more complex CNN architectures (with different depths, widths and topologies) achieve better results? If yes, then how can we design such models of greater complexity?
3) The prior terms in the loss function that reflect properties of HR images are trivial. Can we integrate any property of the SISR process into the design of the CNN frame or other parts in the algorithms for SISR? If yes, then can these deep architectures with SISR properties be more effective in addressing some challenging SISR problems, such as the large scale factor SISR and the unknown downsampling of SISR?
Based on some solutions to these three questions, recent studies on deep architectures for SISR will be discussed in Sections IIIB1, IIIB2 and IIIB3.
IiiB StateoftheArt Deep SISR Networks
IiiB1 Learning Effective Upsampling with CNN
One solution to the first question is to design a module in the CNN architecture that adaptively increases the resolution. Convolution with pooling and stride convolution are the common downsampling operators in the basic CNN architecture. Naturally, people can implement the upsampling operation, which is known as deconvolution
[50] or transposed convolution [51]. Given the upsampling factor, the deconvolution layer is composed of an arbitrary interpolation operator (usually, we choose the nearest neighbor interpolation for simplicity) and a following convolution operator with a stride of 1, as shown in Fig. 3. Readers should be aware that such deconvolution may not completely recover the information missing from convolution with pooling or stride convolution. Such a deconvolution layer has been successfully adopted in the context of network visualization [52], semantic segmentation [53] and generative modeling [54]. For a more detailed illustration of the deconvolution layer, readers can refer to [55]. To the best of our knowledge, FSRCNN [56] is the first work using this normal deconvolution layer to reconstruct HR images from LR feature maps. As mentioned previously, the usage of the deconvolution layer has two main advantages: one is that a reduction in computation is achieved because we just need to increase resolution at the end of the network; the other is that when the downsampling kernel is unknown, many reports, e.g., [57], have shown that when an inaccurate estimation is input, there are side effects on the final performance.Although the normal deconvolution layer, which has already been involved in popular open source packages such as Caffe
[58]and TensorFlow
[59], offers a reasonably good solution to the first question, there is still an underlying problem: when we use the nearest neighbor interpolation, the points in the upsampled features are repeated several times in each direction. This configuration of the upsampled pixels is redundant. To circumvent this problem, Shi et al. proposed an efficient subpixel convolution layer in [49], known as ESPCN; the structure of ESPCN is shown in Fig. 4. Rather than increasing resolution by explicitly enlarging feature maps as the deconvolution layer does, ESPCN expands the channels of the output features for storing the extra points to increase resolution and then rearranges these points to obtain the HR output through a specific mapping criterion. As the expansion is carried out in the channel dimension, a smaller kernel size is sufficient. [55]further shows that when the ordinary but redundant nearest neighbor interpolation is replaced with the interpolation that pads the subpixels with zeroes, the deconvolution layer can be simplified into the subpixel convolution in ESPCN. Obviously, compared with the nearest neighbor interpolation, this interpolation is more efficient, which can also verify the effectiveness of ESPCN.
IiiB2 The Deeper, The Better
In the DL research, there is theoretical work [60] showing that the solution space of a DNN can be expanded by increasing its depth or its width. In some situations, to attain more hierarchical representations more effectively, many works mainly focus on improvements acquired by increasing the depth. Recently, various DLbased applications have also demonstrated the great power of very deep neural networks despite many training difficulties. VDSR [61] is the first very deep model used in SISR. As shown in Fig. 5(a), VDSR is a 20layer VGGnet [62]. The VGG architecture sets all kernel sizes as
(the kernel size is usually odd and takes the increase in the receptive field into account, and
is the smallest kernel size). To train this deep model, the authors used a relatively high initial learning rate to accelerate convergence and used gradient clipping to prevent the annoying gradient explosion problem.
In addition to the innovative architecture, VDSR has made two more contributions. The first one is that a single model is used for multiple scales since the SISR processes with different scale factors have a strong relationship with each other. This fact is the basis of many traditional SISR methods. Similar to SRCNN, VDSR takes the bicubic of LR as input. During training, VDSR puts the bicubics of LR of different scale factors together for training. For larger scale factors (), the mapping for a smaller scale factor () may also be informative. The second contribution is the residual learning. Unlike the direct mapping from the bicubic version to HR, VDSR uses deep CNN to learn the mapping from the bicubic to the residual between the bicubic and HR. The authors argued that residual learning could improve performance and accelerate convergence.
The convolution kernels in the nonlinear mapping part of VDSR are very similar, and in order to reduce parameters, Kim et al. further proposed DRCN [63], which utilizes the same convolution kernel in the nonlinear mapping part 16 times, as shown in Fig. 5(b). To overcome the difficulties of training a deep recursive CNN, a multisupervised strategy is applied, and the final result can be regarded as the fusion of 16 intermediate results. The coefficients for fusion are a list of trainable positive scalars with the summation of 1. As they showed, DRCN and VDSR have a quite similar performance.
Here, we believe that it is necessary to emphasize the importance of the multisupervised training in DRCN. This strategy not only creates short paths through which the gradients can flow more smoothly during backpropagation but also guides all the intermediate representations to reconstruct raw HR outputs. Finally, fusing all these raw HR outputs produces a wonderful result. However, for fusion, this strategy has two flaws: 1) once the weight scalars are determined in the training process, they will not change with different inputs; and 2) using a single scalar to weight HR outputs does not take pixelwise differences into consideration, that is, it would be better to weight different parts distinguishingly in an adaptive way.
It is hard to go deeper with a plain architecture such as VGGnet. Various deep models based on skipconnections can be extremely deep and have achieved stateoftheart performance in many tasks. Among them, ResNet [64, 65], proposed by He et al., is the most representative model. Readers can refer to [66, 67] for further discussions on why ResNet works well. In [68]
, the authors proposed SRResNet, which is composed of 16 residual units (a residual unit consists of two nonlinear convolutions with residual learning). In each unit, batch normalization (BN)
[69] is used to stabilize the training process. The overall architecture of SRResNet is shown in Fig. 5(c). Based on the original residual unit in [65], Tai et al. proposed DRRN [70], in which basic residual units are rearranged in a recursive topology to form a recursive block, as shown in Fig. 5(d). Then, to accommodate parameter reduction, each block shares the same parameters and is reused recursively, such as in the single recursive convolution kernel in DRCN.EDSR [71] was proposed by Lee et al. and has currently achieved stateoftheart performance. EDSR has mainly made three improvements on the overall frame: 1) Compared with the residual unit used in previous work, EDSR removes the usage of BN, as shown in Fig. 5(e). The original ResNet with BN was designed for classification, where inner representations are highly abstract, and these representations can be insensitive to the shift introduced by BN. Regarding imagetoimage tasks such as SISR, since the input and output are strongly related, if the convergence of the network is not a problem, then such a shift may harm the final performance. 2) Except for regular depth increasing, EDSR also increases the number of output features of each layer on a large scale. To relinquish the difficulties of training such a wide ResNet, the residual scaling trick proposed in [72] is employed. 3) Additionally, inspired by the fact that the SISR processes with different scale factors have strong relationships with each other, when training the models for and scales, the authors of [71] initialized the parameters with the pretrained network. This pretraining strategy accelerates the training and improves the final performance.
The effectiveness of the pretraining strategy in EDSR implies that models for different scales may share many intermediate representations. To explore this idea further, similar to building a multiscale architecture as VDSR does on the condition of bicubic input, the authors of EDSR proposed MDSR to achieve the multiscale architecture, as shown in Fig. 5(g). In MDSR, the convolution kernels for nonlinear mapping are shared across different scales, where only the front convolution kernels for extracting features and the final subpixel upsampling convolution are different. At each update during training MDSR, minibatches for , and are randomly chosen, and only the corresponding parts of MDSR are updated.
In addition to ResNet, DenseNet [73] is another effective architecture based on skip connections. In DenseNet, each layer is connected with all the preceding representations, and the bottleneck layers are used in units and blocks to reduce the parameter amounts. In [74], the authors pointed out that ResNet enables feature reusage while DenseNet enables new feature exploration. Based on the basic DenseNet, SRDenseNet [75], as shown in Fig. 5(f), further concatenates all the features from different blocks before the deconvolution layer, which is shown to be effective in improving performance. MemNet [76], proposed by Tai et al., uses the residual unit recursively to replace the normal convolution in the block of the basic DenseNet and adds dense connections among different blocks, as shown in Fig. 5(h). The authors explained that the local connections in the same block resemble the shortterm memory and the connections with previous blocks resemble the longterm memory [77]. Recently, RDN [78] was proposed by Zhang et al. and uses a similar structure. In an RDN block, basic convolution units are densely connected similar to DenseNet, and at the end of an RDN block, a bottleneck layer is used, following with the residual learning across the whole block. Before entering the reconstruction part, features from all previous blocks are fused by the dense connection and residual learning.








IiiB3 Combining Properties of the SISR Process with the Design of the CNN Frame
In this subsection, we discuss some deep frames whose architectures or procedures are inspired by some representative methods for SISR. Compared with the abovementioned NNoriented methods, these methods can be better interpreted, and they sometimes are more sophisticated in addressing certain challenging cases for SISR.
Combining sparse coding with deep NN: The sparse prior in nature images and the relationships between the HR and LR spaces rooted from this prior were widely used for their great performance and theoretical support. SCN [79] was proposed by Wang et al. and uses the learned iterative shrinkage and thresholding algorithm (LISTA) [80], which produces an approximate estimation of sparse coding based on NN, to solve the timeconsuming inference in traditional sparse coding SISR. They further introduced a cascaded version (CSCN) [81] that employs multiple SCNs. Previous works such as SRCNN tried to explain general CNN architectures with the sparse coding theory, which from today’s view may be somewhat unconvincing. SCN combines these two important concepts innovatively and gains both quantitative and qualitative improvements.
Learning to ensemble by NN: Different models specialize in different image patterns of SISR. From the perspective of ensemble learning, a better result can be acquired by adaptively fusing various models with different purposes at the pixel level. Motivated by this idea, MSCN was proposed by Liu et al. [82]
by developing an extra module in the form of a CNN, taking the LR as input and outputting several tensors with the same shape as the HR. These tensors can be viewed as adaptive elementwise weights for each raw HR output. By selecting NNs as the raw SR inference modules, the raw estimating parts and the fusing part can be optimized jointly. However, in MSCN, the summation of coefficients at each pixel is not 1, which may be slightly incongruous.
Deep architectures with progressive methodology: Increasing SISR performance progressively has been extensively studied previously, and many recent DLbased approaches also exploit it from various perspectives. Here, we mainly discuss three novel works within this scope: DEGREE [83], combining the progressive property of ResNet with traditional subband reconstruction; LapSRN [84], generating SR of different scales progressively; and PixelSR [85]
, leveraging conditional autoregressive models to generate SR pixelbypixel.
Compared with other deep architectures, ResNet is intriguing for its progressive properties. Taking SRResNet for example, one can observe that directly sending the representations produced by intermediate residual blocks to the final reconstruction part will also yield a quite good raw HR estimator. The deeper these representations are, the better the results that can be obtained. A similar phenomenon of ResNet applied in recognition is reported in [66]. DEGREE, proposed by Yang et al., combines this progressive property of ResNet with the subband reconstruction of traditional SR methods [86]. The residues learned in each residual block can be used to reconstruct highfrequency details, resembling the signals from a certain highfrequency band. To simulate subband reconstruction, a recursive residual block is used. Compared with the traditional supervised subband recovery methods that need to obtain subband ground truth by diverse filters, this simulation with recursive ResNet avoids explicitly estimating intermediate subband components, benefiting from the endtoend representation learning.
As mentioned above, models for small scale factors can be used for a raw estimator of a large scale SISR. In the SISR community, SISR under large scale factors (e.g.,
8) has been a challenging problem for a long time. In such situations, plausible priors are imposed to restrict the solution space. A straightforward way to address this is to gradually increase resolution by adding extra supervision on the auxiliary SISR process of the small scale. Based on this heuristic prior, LapSRN, proposed by Lai
et al., uses the Laplacian pyramid structure to reconstruct HR outputs. LapSRN has two branches: the feature extraction branch and the image reconstruction branch, as shown in Fig.
6. At each scale, the image reconstruction branch estimates a raw HR output of the present stage, and the feature extraction branch outputs a residue between the raw estimator and the corresponding ground truth as well as extracts useful representations for the next stage.When faced with large scale factors with a severe loss of necessary details, some researchers suggest that synthesizing rational details can achieve better results. In this situation, deep generative models, which will be discussed in the next sections, could be good choices. Compared with the traditional independent point estimation of the lost information, conditional autoregressive generative models using conditional maximum likelihood estimation in directional graphical models gradually generate highresolution images based on the previously generated pixels. PixelRNN [87] and PixelCNN [88] are recent representative autoregressive generative models. The current pixel in PixelRNN and PixelCNN is explicitly dependent on the left and top pixels that have already been generated. To implement such operations, novel network architectures are elaborated. PixelSR was proposed by Dahl et al. and first applies conditional PixelCNN to SISR. The overall architecture is shown in Fig. 7
. The conditioning CNN takes LR as input, which provides LRconditional information to the whole model, and the PixelCNN part is the autoregressive inference part. The current pixel is determined by these two parts together using the current softmax probability:
(2) 
where is the LR input, is the current HR pixel to be generated, are the generated pixels,
denotes the conditioning network predicting a vector of logit values corresponding to the possible values, and
denotes the prior network predicting a vector of logit values of the th output pixel. Pixels with the highest probability are taken as the final output pixel.Similarly, the whole network is optimized by minimizing crossentropy loss (maximizing the corresponding loglikelihood) between the model’s prediction and the discrete groundtruth labels.
Deep architectures with backprojection: Iterative backprojection [89] is an early SR algorithm that iteratively computes the reconstruction error and then feeds it back to tune the HR results. Recently, DBPN [90], proposed by Haris et al., uses deep architectures to simulate iterative backprojection and further improves performance with dense connections [73], which is shown to achieve wonderful performance in the scale. As shown in Fig. 8, the dense connection and convolution for reducing the dimension is first applied across different upprojection (downprojection) units; next, in the th upprojection unit, the current LR feature input is first deconvoluted to obtain a raw HR feature , and is backprojected to the LR feature . The residue between two LR features is then deconvoluted and added to to obtain a finer HR feature . The downprojection unit is defined very similarly in an inverse way.
Usage of additional information from LR: Although modern deep NNs are skillful in extracting various ranges of useful representations in endtoend manners, in some cases, it is still helpful to select some information to process explicitly. For example, the DEGREE [83] takes the edge map of LR as another input. Recent studies tend to use more complex information of LR directly, two examples of which are the following: SFTGAN [91], with extra semantic information of LR for better perceptual quality, and SRMD [92], incorporating degradation into input for multiple degradations.
[93] reported that using a semantic prior helps improve the performance of many SISR algorithms. Leveraging powerful deep architectures recently designed for segmentation, Wang et al. [91] used semantic segmentation maps of interpreted LR as additional input and deliberated the spatial feature transformation (SFT) layer to handle them. With this extra information from highlevel tasks, the proposed work is more skilled in generating textual details.
To take degradations of different LRs into account, SRMD first applied a parametric zeromean anisotropic Gaussian kernel to stand for the blur kernel and the additive white Gaussian noise with hyperparameter
to represent noise. Then, a simple regression is used to obtain its covariance matrix. These sufficient statistics are dimensionally stretched to concatenate with LR in the channel dimension, and with such input, a deep model is trained. Notably, when SRMD is tested with real images, the needed parameters on the degradation level are obtained by grid search.Reconstructionbased frameworks based on priors offered by deep NN: Sophisticated priors are of key points for efficient reconstructionbased SISR algorithms to address different cases flexibly. Recent works showed that deep NNs could provide wellperforming priors mainly from two perspectives: priors in the deep NN learn from data in advance within a plugandplay approach and direct reconstruction of output, leveraging intriguing but still unclear priors of deep architectures themselves.
Given the degraded version , the reconstructionbased algorithms aim to obtain the desired result by solving
(3) 
where is the degradation matrix and is regularization, also called a prior from the Bayesian view. [94] split (3) into a data part and a prior part with variable splitting techniques and then replaced the prior part with efficient denoising algorithms. Regarding different degradation cases, one only needs to change denoising algorithms for the prior part, behaving in socalled plugandplay manners. Recent works [95, 96, 97] use deep discriminatively trained NNs under different noise levels as denoisers in various inverse problems, and IRCNN [96] is the first one among them to address SISR. In IRCNN, they first trained a series of CNNbased denoisers with different noise levels, and took backprojection as the reconstruction part. The LR is first preceded by several backprojection iterations and then denoised by CNN denoisers with decreasing noise levels along with backprojection. The iteration number is empirically set to 30. In IRCNN, the authors use deep networks to learn a set of image priors and then plug the priors into the reconstruction framework; the experimental results in these cases are better than the contemporary methods that only employ examplebased training.
Recently, Ulyanov et al. showed in [98] that the structure of deep neural networks could capture a considerable amount of lowlevel image statistical priors. They reported that when neural networks are used to fit images of different statistical properties, the convergence speed for different kinds of images can also be different. As shown in Fig. 9, naturallooking images, whose different parts are highly relevant, will converge much faster. In contrast, images such as noises and shuffled images, which have little inner relationship, tend to converge more slowly. Many inverse problems such as denoising and superresolution are modeled as the pixelwise summation of the original image and the independent additive noises. Based on the observed prior, when used to fit these degraded images, the neural networks tend to fit the naturallooking images first, which can be used to retain the naturallooking parts as well as to filter the noisy ones. To illustrate the effectiveness of the proposed prior for SISR, only given the LR , the authors took a fixed random vector as input to fit the HR with a randomly initialized DNN by optimizing
(4) 
where
is a common differentiable downsampling operator. The optimization is terminated in advance for only filtering noisy parts. Although these totally unsupervised methods are outperformed by other supervised learning methods, they perform considerably better than some other naive methods.
Deep architectures with internal examples: Internalexample SISR algorithms are based on the recurrence of small pieces of information across different scales of a single image, which are shown to be better at addressing specific details rarely existing in other external images [99]. ZSSR [100], proposed by Shocher et al., is the first literature combining deep architectures with internalexample learning. In ZSSR, other than the image for testing, no extra images are needed, and all the patches for training are taken from different degraded pairs of the test image. As demonstrated in [101], the visual entropy inside a single image is much smaller than the large training dataset collected from wide ranges, so unlike externalexample SISR algorithms, a very small CNN is sufficient. As we mentioned previously for VDSR, the training data for a smallscale model can also be useful for training largescale models. Additionally, based on this trick, ZSSR can be more robust by collecting more internal training pairs with small scale factors for training largescale models. However, this approach will increase runtime immensely. Notably, when combined with the kernel estimation algorithms mentioned in [102], ZSSR performs quite well with the unknown degradation kernels.
Recently, Tirer et al. argued that degradation in LR decreases the performance of internalexample algorithms [103]. Therefore, they proposed to use reconstructionbased deep frame IDBP [97] to obtain an initial SR result and then conduct internalexamplebased network training similar to ZSSR. This method was believed to combine two successful techniques that address the mismatch between training and test, and it has achieved robust performance in these cases.
IiiC Comparisons among Different Models and Discussion
In this section, we will summarize recent progress in deep architectures for SISR from two perspectives: quantitative comparisons for those trained by specific blurring, and comparisons on those models for handling nonspecific blurring.
For the first part, quantitative criteria mainly include the following:
1) PSNR/SSIM [104] for measuring reconstruction quality: Given two images and both with pixels, the MSE and peak signaltonoise ratio (PSNR) are defined as
(5) 
(6) 
where is the Frobenius norm and L is usually 255. The structural similarity index (SSIM) is defined as
(7) 
where and
is the mean and variance of
, is the covariance between and , and and are constant relaxation terms.2) Number of parameters of NN for measuring storage efficiency (Params).
3) Number of composite multiplyaccumulate operations for measuring computational efficiency (Mult&Adds): Since operations in NNs for SISR are mainly multiplications with additions, we use Mult&Adds in CARN [105] to measure computation, assuming that the desired SR is 720p.
Notably, it has been shown in [48] and [49] that the training datasets have a great influence on the final performance, and usually, more abundant training data will lead to better results. Generally, these models are trained via three main datasets: 1) 91 images from [19] and 200 images from [106]
, called the 291 dataset (some models only use 91 images); 2) images derived from ImageNet
[107] randomly; and 3) the newly published DIV2K dataset [108]. In addition to the different number of images each dataset contains, the quality of images in each dataset is also different. Images in the 291 dataset are usually small (on average, ), images in ImageNet are much larger, while images in DIV2K are of very high quality. Because of the restricted resolution of the images in the 291 dataset, models on this set have difficulties in obtaining large patches with large receptive fields. Therefore, models based on the 291 dataset usually take the bicubic of LR as input, which is quite timeconsuming. Table I compares different models on the mentioned criteria.














































































From Table I, we can see that generally as the depth and the number of parameters grow, the performance improves. However, the growth rate of performance levels off. Recently, some works on designing light models [109, 105, 110] and learning sparse structural NN [111] were proposed to achieve relatively good performance with less storage and computation, which are very meaningful in practice.
For the second part, we mainly show that the performance of the models for some specific degradation dropped drastically when the true degradation mismatches the one assumed for training. For example, we use four models, including EDSR trained with bicubic degradation [71], IRCNN [96], SRMD [92] and ZSSR [100], to address LRs generated by Gaussian kernel degradation (kernel size of with bandwidth 1.6), as shown in Fig. 10, and the performance of EDSR dropped drastically with obvious blur, while other models for nonspecific degradation perform quite well. Therefore, to address some longstanding problems in SISR, such as unknown degradation, the direct usage of general deep learning techniques may not be sufficient. More effective solutions can be achieved by combining the power of DL and the specific properties of the SISR scene.





Iv Optimization Objectives for DLbased SISR
Iva Benchmark of Optimization Objectives for DLbased SISR
We select the MSE loss used in SRCNN as the benchmark. It is known that using MSE favors a high PSNR, and PSNR is a widely used metric for quantitatively evaluating image restoration quality. Optimizing MSE can be viewed as a regression problem, leading to a point estimation of as
(8) 
where are the th training examples and is a CNN parameterized by . Here, (8
) can be interpreted in a probabilistic way by assuming Gaussian white noise (
) independent of the image in the regression model, and then, the conditional probability of givenbecomes a Gaussian distribution with mean
and the diagonal covariance matrix , whereis the identity matrix:
(9) 
Then, using maximum likelihood estimation (MLE) on the training examples with (9) will lead to (8).
The KullbackLeibler divergence (KLD) between the conditional empirical distribution
and the conditional model distribution is defined as(10) 
We call (10) the forward KLD, where denotes the HR (SR) conditioned on its LR counterpart, and are the conditional distributions of and , respectively, where is an intrinsic term determined by the training data regardless of the parameter of the model (or the model distribution ). Hence, when we use the training samples to estimate parameter , minimizing this KLD is equivalent to MLE.
Here, we have demonstrated that MSE is a special case of MLE, and MLE is a special case of KLD. However, we may conjecture whether the assumptions underlying these specializations are violated. This consideration has led to some emerging objective functions from four perspectives:
1) Translating MLE into MSE can be achieved by assuming Gaussian white noise. Although the Gaussian model is the most widely used model for its simplicity and technical support, what if this independent Gaussian noise assumption is violated in a complicated scene such as SISR?
2) To use MLE, we need to assume the parametric form of the data distribution. What if the parametric form is misspecified?
3) Apart from KLD in (10), are there any other distances between probability measures that we can use as the optimization objectives for SISR?
4) Under specific circumstances, how can we choose the suitable objective functions according to their properties?
IvB Objective Functions Based on nonGaussian Additive Noises
The poor perceptual quality of the SISR images obtained by optimizing MSE directly demonstrates a fact: using Gaussian additive noise in the HR space is not good enough. To address this problem, solutions are proposed from two aspects: use other distributions for this additive noise, or transfer the HR space to some space where the Gaussian noise is reasonable.
IvB1 Denote Additive Noise with Other Probability Distributions
In [112], Zhao et al. investigated the difference between mean absolute error (MAE) and MSE used to it optimize NN in image processing. Similar to (8), MAE can be written as
(11) 
From the perspective of probability, (11) can be interpreted as introducing Laplacian white noise, and similar to (9), the conditional probability becomes
(12) 
Compared with MSE in regression, MAE is believed to be more robust against outliers. As reported in
[112], when MAE is used to optimize an NN, the NN tends to converge faster and produce better results. The authors argued that the reason might be because MAE could guide NN to reach a better local minimum. Other similar loss functions in robust statistics can be viewed as modeling additive noises with other probability distributions.
Although these specific distributions often cannot represent unknown additive noise very precisely, their corresponding robust statistical loss functions are used in many DLbased SISR works for their conciseness and advantages over MSE.
IvB2 Using MSE in a Transformed Space
Alternatively, we can search for a mapping to transform the HR space to some space where Gaussian white noise can be used reasonably. From this perspective, Bruna et al. [113] proposed socalled perceptual loss to leverage deep architectures. In [113], the conditional probability of the residual between HR and LR given the LR is stimulated by the Gibbs energy model:
(13) 
where and are two mappings between the original spaces and the transformed ones, and is the partition function. The features produced by sophisticated supervised deep architectures have been shown to be perceptually stable and discriminative, denoted by ^{2}^{2}2Either the scattering network or VGG can be denoted by . When is VGG, there is no residual learning and finetuning.. Then, represents the corresponding deep architectures. In contrast, is the mapping between the LR space and the manifold represented by , trained by minimizing the Euclidean distance as
(14) 
After is obtained, the final result can be inferred with SGD by solving
(15) 
For further improvement, [113] also proposed a finetuning algorithm in which and can be finetuned to the data. Similar to the alternating updating in GAN, and are finetuned with SGD based on the current . However, this finetuning will involve calculating the gradient of the partition function , which is a wellknown difficult decomposition into the positive phase and the negative phase of learning. Hence to avoid sampling within inner loops, a biased estimator of this gradient is chosen for simplicity.
The inference algorithm in [113] is extremely timeconsuming. To improve efficiency, Johnson et al. utilized this perceptual loss in an endtoend training manner [114]. In [114], the SISR network is directly optimized with SGD by minimizing the MSE in the feature manifold produced by VGG16 as follows:
(16) 
where is the mapping represented by VGG16, denotes the SISR network, and is the ground truth. Compared with [113], [114] replaces the nonlinear mapping and the expensive inference with an endtoend trained CNN, and their results show that this change does not affect the restoration quality but does accelerate the whole process.
Perceptual loss mitigates blurring and leads to more visuallypleasing results compared with directly optimizing MSE in the HR space. However, there remains no theoretical analysis on why this approach works. In [113], the author generally concluded that successful supervised networks used for highlevel tasks could produce very compact and stable features. In these feature spaces, small pixellevel variation and much other trivial information can be omitted, making these feature maps mainly focus on pixels of human interest. At the same time, with the deep architectures, the most specific and discriminative information of the input is shown to be retained in feature spaces because of the great performance of the models applied in various highlevel tasks. From this perspective, using MSE in these feature spaces will focus more on the parts that are attractive to human observers with little loss of original contents, so perceptually pleasing results can be obtained.
IvC Optimizing Forward KLD with Nonparametric Estimation
Parametric estimation methods such as MLE need to specify in advance the parametric form the distribution of data, which suffers from model misspecification. Different from parametric estimation, nonparametric estimation methods such as kernel distribution estimation (KDE) fit the data without distributional assumptions, which are robust when the real distributional form is unknown. Based on nonparametric estimation, recently, the contextual loss [115, 116] was proposed by Mechrez et al. to maintain natural image statistics. In the contextual loss, a Gaussian kernel function is applied:
(17) 
where can be any symmetric distance between and , is the bandwidth, and the partition function . Then, and are
(18) 
and (10) can be rewritten as
(19) 
The first log term in (19) is a constant with respect to the model parameters. Let us denote the kernel in the second log term by . Then, the optimization objective in (19) can be rewritten as
(20) 
With the Jensen inequality, we can obtain a lower bound of (20):
(21) 
The first equality holds if and only if , . Both equalities hold if and only if . When (20) reaches 0, the given lower bound also reaches 0. Therefore, we can take this lower bound as the optimization objective alternatively.
We can further simplify the lower bound in (21). The lower bound can be rewritten as
(22) 
where , and is the norm. When the bandwidth , the affinity will degrade into the indicator function, which means if , ; otherwise, . In this case, the norm can be approximated well by the norm, which returns the maximum element of the vector. Thus, (22) can degenerate into the contextual loss in [115, 116]:
(23) 
Recently, implicit likelihood estimation (IMLE) [117] was proposed and its conditional version was applied to SISR [118]. Here, we will briefly show that minimizing IMLE equals minimizing an upper bound of the forward KLD with KDE. Let us use a Gaussian kernel as
(24) 
As with (20), the optimization objective can be rewritten as
(25) 
With and , we can obtain a simple upper bound of (25) as
(26) 
Minimizing (26) equals minimizing
(27) 
which is the core of the optimization objective of IMLE.
As above, the recently proposed contextual loss and IMLE are illustrated via nonparametric estimation and KLD. Visually pleasing results were reported using the contextual loss and IMLE. However, as KDE is generally very timeconsuming, several reasonable approximations along with acceleration algorithms were applied.
IvD Other Distances between Probability Measures Used in SISR
As KLD is an asymmetric (pseudo) distance for measuring similarity between two distributions, in this subsection, we begin with the inverse form of forward KLD, namely, backward KLD. The backward KLD is defined as
(28) 
When , both KLDs reach the minimum of 0. However, when the solution is inadequate, these two KLDs will lead to quite different results. Here, we use a toy example to illustrate a simple case of inadequate solutions, as shown in Fig. 11.
The unknown wanted distribution is a Gaussian mixture model (GMM) with two modes, denoted as
, and we model it by a single Gaussian distribution. We can easily see that optimizing the forward KLD results in a solution locating at the middle areas of two modes, while optimizing the backward KLD makes the result close to the most prominent mode.From Fig. 11 we can see that, under inadequate solutions, optimizing the forward KLD will lead to the wellknown regressiontothemean problem, while optimizing the backward KLD only concentrates on the main modality. The former is one of the reasons for blurring, and some researchers [119] argued that the latter improves the visual quality but makes the results collapse to some patterns.


Different distances may lead to different results under an inadequate solution. Readers can refer to [120] for further understanding. In most lowlevel computer vision tasks, is an empirical distribution and is an intractable distribution. For this reason, the backward KLD is unpractical for optimizing deep architectures. To relieve optimizing difficulties, we replace the asymmetric KLD with the symmetric JensenShannon divergence (JSD) as follows:
(29) 
Optimizing (29) explicitly is also very difficult. Generative adversarial nets (GANs) proposed by Goodfellow et al.
use the objective function below to implicitly address this problem in a game theory scenario, successfully avoiding the troubling approximate inference and approximation of the partition function gradient:
(30) 
where is the main part called the generator supervised by an auxiliary part called the discriminator. The two parts update alternatively, and when the discriminator cannot give useful information to the generator anymore, in other words, the outputs of the generator totally confuse the discriminator, the optimization procedure is completed. For the detailed discussion on GANs, readers can refer to [45]. Recent works have shown that sophisticated architectures and suitable hyperparameters can help GANs perform excellently. The representative works on GANbased SISR are [68] and [121]. In [68], the generator of the GAN is the SRResNet mentioned previously, and the discriminator refers to the design criterion of DCGAN [54]. In the context of GANs, a recent work [121] follows a similar path except with a different architecture. Very recently, by leveraging the extension of the basic GAN framework [122], [123] was proposed as an unsupervised SR algorithm. Fig. 12 shows the results of the GAN and MSE with the same architecture; despite the lower PSNR due to artifacts, the visual quality improves by using the GAN for SISR.





Generally, GANs offer an implicit optimization strategy in an adversarial training way by using deep neural networks. Based on this, more rational but complicated measures such as Wasserstein distances [124], divergence [125]^{3}^{3}3Forward KLD, backward KLD and JSD can all be regarded as the special cases of divergence. and maximum mean discrepancy (MMD) [126] are taken as alternatives to JSD for training GANs.
IvE Characters of Different objective functions
Now, we can see that those losses mentioned in Section IVB explicitly model the relation between LR and its HR counterpart. Here, we follow the methodology of [127] and call the losses that were based on measuring the dissimilarity between training pairs the distortionaimed losses. When the training data are not sufficient, distortion losses usually ignore the particularity of data and appear ineffective to measure the similarity between the source and target distributions.


The losses mentioned in Sections IVC and IVD are rooted from measuring the similarity between distributions, which is thought to measure the perceptual quality. Here, we call them perceptionaimed losses. Recently, Blau et al. [127] discussed the inherent tradeoff between the two kinds of losses. Their discussion can be simplified into an optimization problem:
(31) 
is distortionaimed loss, and is the (pseudo) distance between distributions. Furthermore, the author also proved that if is convex in its second argument, then the is monotonically nonincreasing and convex. From this property, we can draw the curve of and easily see this tradeoff, as shown in Fig. 13(a), such that improving one must be at the expense of the other. However, as shown in Section IVB, using MSE in the VGG feature space achieves a better quality, and choosing suitable and may ease this tradeoff.
For the perceptionaimed losses mentioned in Sections IVC and IVD, up to now, there has been no rigorous analysis on their differences. Here, we apply the nonreference quality assessment proposed by Ma et al. [95] with RMSE to conduct quantitative comparisons, and the representative qualitative comparisons are depicted in Fig. 13(b). To summarize, we should be aware that there is no onefitsall objective function, and we should choose one that is suitable to the context of an application.
V Trends and Challenges
Along with the promising performance that DL algorithms have achieved in SISR, there remain several important challenges and inherent trends as follows.
V1 Lighter Deep Architectures for Efficient SISR
Although the high accuracy of advanced deep models has been achieved for SISR, it is still difficult to deploy these models to realworld scenarios, which is mainly due to massive parameters and computation. To address this issue, we need to design light deep models or slim the existing deep models for SISR with fewer parameters and computation at the expense of little or no performance degradation. Hence, in the future, researchers are expected to focus more on reducing the size of NNs for speeding up the SISR process.
V2 More Effective DL Algorithms for Largescale SISR and SISR with Unknown Corruption
Generally, DL algorithms proposed in recent years have improved the performance of traditional SISR tasks by a large margin. However, the large scale of SISR and the SISR with unknown corruption, the two major challenges in the SR community, are still lacking very effective remedies. DL algorithms are thought to be skilled at addressing many inferences or unsupervised problems, which is of key importance to address these two challenges. Therefore, by leveraging the great power of DL, more effective solutions to these two demanding problems are expected.
V3 Theoretical Understanding of Deep Models for SISR
The success of deep learning is said to be attributed to learning powerful representations. However, to date, we still cannot understand these representations very well, and the deep architectures are treated as a black box. For DLbased SISR, the deep architectures are often viewed as a universal approximation, and the learned representations are often omitted for simplicity. This behavior is not beneficial for further exploration. Therefore, we should not only focus on whether a deep model works but also concentrate on why and how it works. That is, more theoretical explorations are needed.
V4 More Rational Assessment Criteria for SISR in Different Applications
In many applications, we need to design the desired objective function for a specific application. However, in most cases, we cannot give an explicit and precise definition to assess the requirement for the application, which leads to the vagueness of the optimization objectives. Many works, although for different purposes, simply employ MSE as the assessment criterion, which has been shown as a poor criterion in many cases. In the future, we think that it is of great necessity to make clear definitions for assessments in various applications. Based on these criteria, we can design better targeted optimization objectives and compare algorithms in the same context more rationally.
Vi Conclusion
This paper presents a brief review of recent deep learning algorithms on SISR. It divides the recent works into two categories: the deep architectures for simulating the SISR process and the optimization objectives for optimizing the whole process. Despite the promising results reported so far, there are still many underlying problems. We summarize the main challenges into three aspects: the acceleration of deep models, the extensive comprehension of deep models and the criteria for designing and evaluating the objective functions. Along with these challenges, several directions may be further explored in the future.
Acknowledgment
We are grateful to the authors of [47, 84, 71, 61, 68, 121, 116, 114, 96, 92, 100] for kindly releasing their experimental results or codes, as well as to the three anonymous reviewers for their constructive criticism, which has significantly improved our manuscript. Moreover, we thank Qiqi Bao for helpful discussions.
References
 [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
 [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
 [3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
 [4] R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in Proceedings of the International Conference on Machine Learning, 2008, pp. 160–167.
 [5] C.Y. Yang, C. Ma, and M.H. Yang, “Singleimage superresolution: A benchmark,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 372–386.

[6]
R. Timofte, R. Rothe, and L. Van Gool, “Seven ways to improve examplebased
single image super resolution,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2016, pp. 1865–1873.  [7] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [8] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on ImageNet classification,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1026–1034.
 [9] S. C. Park, M. K. Park, and M. G. Kang, “Superresolution image reconstruction: a technical overview,” IEEE Signal Processing Magazine, vol. 20, no. 3, pp. 21–36, 2003.
 [10] R. Keys, “Cubic convolution interpolation for digital image processing,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 29, no. 6, pp. 1153–1160, 1981.
 [11] C. E. Duchon, “Lanczos filtering in one and two dimensions,” Journal of Applied Meteorology, vol. 18, no. 8, pp. 1016–1022, 1979.
 [12] S. Dai, M. Han, W. Xu, Y. Wu, Y. Gong, and A. K. Katsaggelos, “Softcuts: a soft edge smoothness prior for color image superresolution,” IEEE Transactions on Image Processing, vol. 18, no. 5, pp. 969–981, 2009.
 [13] J. Sun, Z. Xu, and H.Y. Shum, “Image superresolution using gradient profile prior,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
 [14] Q. Yan, Y. Xu, X. Yang, and T. Q. Nguyen, “Single image superresolution based on gradient profile sharpness,” IEEE Transactions on Image Processing, vol. 24, no. 10, pp. 3187–3202, 2015.
 [15] A. Marquina and S. J. Osher, “Image superresolution by TVregularization and Bregman iteration,” Journal of Scientific Computing, vol. 37, no. 3, pp. 367–382, 2008.
 [16] W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Examplebased superresolution,” IEEE Computer Graphics and Applications, vol. 22, no. 2, pp. 56–65, 2002.
 [17] H. Chang, D.Y. Yeung, and Y. Xiong, “Superresolution through neighbor embedding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2004, pp. 275–282.
 [18] M. Aharon, M. Elad, A. Bruckstein et al., “KSVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, p. 4311, 2006.
 [19] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image superresolution via sparse representation,” IEEE Transactions on Image Processing, vol. 19, no. 11, pp. 2861–2873, 2010.
 [20] R. Zeyde, M. Elad, and M. Protter, “On single image scaleup using sparserepresentations,” in Proceedings of the International Conference on Curves and Surfaces, 2010, pp. 711–730.
 [21] R. Timofte, V. De, and L. Van Gool, “Anchored neighborhood regression for fast examplebased superresolution,” in Proceedings of the IEEE international Conference on Computer Vision, 2013, pp. 1920–1927.
 [22] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchored neighborhood regression for fast superresolution,” in Proceedings of the Asian Conference on Computer Vision, 2014, pp. 111–126.
 [23] F. Cao, M. Cai, Y. Tan, and J. Zhao, “Image superresolution via adaptive regularization and sparse representation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 7, pp. 1550–1561, 2016.
 [24] J. Liu, W. Yang, X. Zhang, and Z. Guo, “Retrieval compensated group structured sparsity for image superresolution,” IEEE Transactions on Multimedia, vol. 19, no. 2, pp. 302–316, 2017.
 [25] S. Schulter, C. Leistner, and H. Bischof, “Fast and accurate image upscaling with superresolution forests,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3791–3799.
 [26] K. Zhang, D. Tao, X. Gao, X. Li, and J. Li, “Coarsetofine learning for singleimage superresolution,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 5, pp. 1109–1122, 2017.
 [27] J. Yu, X. Gao, D. Tao, X. Li, and K. Zhang, “A unified learning framework for single image superresolution,” IEEE Transactions on Neural Networks and Learning systems, vol. 25, no. 4, pp. 780–792, 2014.
 [28] C. Deng, J. Xu, K. Zhang, D. Tao, X. Gao, and X. Li, “Similarity constraintsbased structured output regression machine: An approach to image superresolution,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 12, pp. 2472–2485, 2016.
 [29] W. Yang, Y. Tian, F. Zhou, Q. Liao, H. Chen, and C. Zheng, “Consistent coding scheme for singleimage superresolution via independent dictionaries,” IEEE Transactions on Multimedia, vol. 18, no. 3, pp. 313–325, 2016.
 [30] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
 [31] H. A. Song and S.Y. Lee, “Hierarchical representation using NMF,” in Proceedings of the International Conference on Neural Information Processing, 2013, pp. 466–473.
 [32] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, 2015.
 [33] N. Rochester, J. Holland, L. Haibt, and W. Duda, “Tests on a cell assembly theory of the action of the brain, using a large digital computer,” IRE Transactions on Information Theory, vol. 2, no. 3, pp. 80–93, 1956.
 [34] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by backpropagating errors,” Nature, vol. 323, no. 6088, p. 533, 1986.
 [35] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, 1989.
 [36] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–211, 1990.
 [37] Y. Bengio, P. Simard, and P. Frasconi, “Learning longterm dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157–166, 1994.
 [38] J. F. Kolen and S. C. Kremer, Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies. IEEE, 2001. [Online]. Available: https://ieeexplore.ieee.org/document/5264952
 [39] G. E. Hinton, “Learning multiple layers of representation,” Trends in Cognitive Sciences, vol. 11, no. 10, pp. 428–434, 2007.
 [40] D. C. Ciresan, U. Meier, J. Masci, L. Maria Gambardella, and J. Schmidhuber, “Flexible, high performance convolutional neural networks for image classification,” in Proceedings of the International Joint Conference on Artificial Intelligence, 2011, pp. 1237–1242.
 [41] D. CireşAn, U. Meier, J. Masci, and J. Schmidhuber, “Multicolumn deep neural network for traffic sign classification,” Neural Networks, vol. 32, pp. 333–338, 2012.
 [42] R. Salakhutdinov and H. Larochelle, “Efficient learning of deep Boltzmann machines,” in Proceedings of the International Conference on Artificial Intelligence and Statistics, 2010, pp. 693–700.
 [43] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
 [44] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and approximate inference in deep generative models,” arXiv preprint arXiv:1401.4082, 2014.
 [45] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proceedings of the Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
 [46] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT press Cambridge, 2016, vol. 1.
 [47] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image superresolution,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 184–199.
 [48] ——, “Image superresolution using deep convolutional networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295–307, 2016.
 [49] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Realtime single image and video superresolution using an efficient subpixel convolutional neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1874–1883.
 [50] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional networks for mid and high level feature learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2011, pp. 2018–2025.
 [51] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” arXiv preprint arXiv:1603.07285, 2016.
 [52] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 818–833.
 [53] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, 2015, pp. 3431–3440.
 [54] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
 [55] W. Shi, J. Caballero, L. Theis, F. Huszar, A. Aitken, C. Ledig, and Z. Wang, “Is the deconvolution layer the same as a convolutional layer?” arXiv preprint arXiv:1609.07009, 2016.
 [56] C. Dong, C. C. Loy, and X. Tang, “Accelerating the superresolution convolutional neural network,” in Proceedings of the European Conference on Computer Vision, 2016, pp. 391–407.
 [57] N. Efrat, D. Glasner, A. Apartsin, B. Nadler, and A. Levin, “Accurate blur models vs. image priors in single image superresolution,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2832–2839.
 [58] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM International Conference on Multimedia, 2014, pp. 675–678.
 [59] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “TensorFlow: A system for largescale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
 [60] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the number of linear regions of deep neural networks,” in Proceedings of the Advances in Neural Information Processing Systems, 2014, pp. 2924–2932.
 [61] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image superresolution using very deep convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1646–1654.
 [62] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [63] J. Kim, J. Kwon Lee, and K. Mu Lee, “Deeplyrecursive convolutional network for image superresolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1637–1645.
 [64] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
 [65] ——, “Identity mappings in deep residual networks,” in Proceedings of the European Conference on Computer Vision, 2016, pp. 630–645.
 [66] A. Veit, M. J. Wilber, and S. Belongie, “Residual networks behave like ensembles of relatively shallow networks,” in Proceedings of the Advances in Neural Information Processing Systems, 2016, pp. 550–558.
 [67] D. Balduzzi, M. Frean, L. Leary, J. Lewis, K. W.D. Ma, and B. McWilliams, “The shattered gradients problem: If resnets are the answer, then what is the question?” in Proceedings of the International Conference on Machine Learning, 2017, pp. 342–350.
 [68] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photorealistic single image superresolution using a generative adversarial network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4681–4690.
 [69] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the International Conference on Machine Learning, 2015, pp. 448–456.
 [70] Y. Tai, J. Yang, and X. Liu, “Image superresolution via deep recursive residual network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3147–3155.
 [71] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual networks for single image superresolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 136–144.

[72]
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inceptionv4, inceptionresnet and the impact of residual connections on learning,” in
Proceedings of the Association for the Advancement of Artificial Intelligence, 2017, pp. 4278–4284.  [73] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
 [74] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path networks,” in Proceedings of the Advances in Neural Information Processing Systems, 2017, pp. 4470–4478.
 [75] T. Tong, G. Li, X. Liu, and Q. Gao, “Image superresolution using dense skip connections,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4809–4817.
 [76] Y. Tai, J. Yang, X. Liu, and C. Xu, “MemNet: A persistent memory network for image restoration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4539–4547.

[77]
S. Hochreiter and J. Schmidhuber, “Long shortterm memory,”
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.  [78] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image superresolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2472–2481.
 [79] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang, “Deep networks for image superresolution with sparse prior,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 370–378.
 [80] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proceedings of the International Conference on International Conference on Machine Learning, 2010, pp. 399–406.
 [81] D. Liu, Z. Wang, B. Wen, J. Yang, W. Han, and T. S. Huang, “Robust single image superresolution via deep networks with sparse prior,” IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3194–3207, 2016.
 [82] D. Liu, Z. Wang, N. Nasrabadi, and T. Huang, “Learning a mixture of deep networks for single image superresolution,” in Proceedings of the Asian Conference on Computer Vision, 2016, pp. 145–156.
 [83] W. Yang, J. Feng, J. Yang, F. Zhao, J. Liu, Z. Guo, and S. Yan, “Deep edge guided recurrent residual learning for image superresolution,” IEEE Transactions on Image Processing, vol. 26, no. 12, pp. 5895–5907, 2017.
 [84] W.S. Lai, J.B. Huang, N. Ahuja, and M.H. Yang, “Deep Laplacian pyramid networks for fast and accurate superresolution,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 624–632.
 [85] R. Dahl, M. Norouzi, and J. Shlens, “Pixel recursive super resolution,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5439–5448.
 [86] A. Singh and N. Ahuja, “Superresolution using subband selfsimilarity,” in Proceedings of the Asian Conference on Computer Vision, 2014, pp. 552–568.
 [87] A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in Proceedings of the International Conference on International Conference on Machine Learning, 2016, pp. 1747–1756.
 [88] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al., “Conditional image generation with PixelCNN decoders,” in Proceedings of the Advances in Neural Information Processing Systems, 2016, pp. 4790–4798.
 [89] M. Irani and S. Peleg, “Improving resolution by image registration,” CVGIP: Graphical models and image processing, vol. 53, no. 3, pp. 231–239, 1991.
 [90] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep backprojection networks for superresolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 1664–1673.
 [91] X. Wang, K. Yu, C. Dong, and C. Change Loy, “Recovering realistic texture in image superresolution by deep spatial feature transform,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 606–615.
 [92] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional superresolution network for multiple degradations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3262–3271.
 [93] R. Timofte, V. De Smet, and L. Van Gool, “Semantic superresolution: When and where is it useful?” Computer Vision and Image Understanding, vol. 142, pp. 1–12, 2016.
 [94] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plugandplay priors for model based reconstruction,” in Proceedings of the IEEE Global Conference on Signal and Information Processing, 2013, pp. 945–948.
 [95] T. Meinhardt, M. Moller, C. Hazirbas, and D. Cremers, “Learning proximal operators: Using denoising networks for regularizing inverse imaging problems,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1781–1790.
 [96] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser prior for image restoration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3929–3938.
 [97] T. Tirer and R. Giryes, “Image restoration by iterative denoising and backward projections,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1220–1234, 2019.
 [98] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9446–9454.
 [99] K. Zhang, X. Gao, D. Tao, and X. Li, “Single image superresolution with multiscale similarity learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 10, pp. 1648–1659, 2013.
 [100] A. Shocher, N. Cohen, and M. Irani, ““zeroshot” superresolution using deep internal learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3118–3126.
 [101] M. Zontak and M. Irani, “Internal statistics of a single natural image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2011, pp. 977–984.
 [102] T. Michaeli and M. Irani, “Nonparametric blind superresolution,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 945–952.
 [103] T. Tirer and R. Giryes, “Superresolution based on imageadapted CNN denoisers: Incorporating generalization of training data and internal learning in test time,” arXiv preprint arXiv:1811.12866, 2018.
 [104] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli et al., “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
 [105] N. Ahn, B. Kang, and K.A. Sohn, “Fast, accurate, and lightweight superresolution with cascading residual network,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 252–268.
 [106] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proceedings of the IEEE International Conference on Computer Vision, 2001, pp. 416–423.
 [107] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and F.F. Li, “ImageNet: A largescale hierarchical image database,” in Proceedings of the IEEE International Conference on Computer Vision, 2009, pp. 248–255.
 [108] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image superresolution: Dataset and study,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 126–135.
 [109] Z. Yang, K. Zhang, Y. Liang, and J. Wang, “Single image superresolution with a parameter economic residuallike convolutional neural network,” in Proceedings of the International Conference on Multimedia Modeling, 2017, pp. 353–364.
 [110] Z. Hui, X. Wang, and X. Gao, “Fast and accurate single image superresolution via information distillation network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 723–731.
 [111] X. Fan, Y. Yang, C. Deng, J. Xu, and X. Gao, “Compressed multiscale feature fusion network for single image superresolution,” Signal Processing, vol. 146, pp. 50–60, 2018.
 [112] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for neural networks for image processing,” IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 47–51, 2017.
 [113] J. Bruna, P. Sprechmann, and Y. LeCun, “Superresolution with deep convolutional sufficient statistics,” arXiv preprint arXiv:1511.05666, 2015.
 [114] J. Johnson, A. Alahi, and F.F. Li, “Perceptual losses for realtime style transfer and superresolution,” in Proceedings of the European Conference on Computer Vision, 2016, pp. 694–711.
 [115] R. Mechrez, I. Talmi, F. Shama, and L. ZelnikManor, “Learning to maintain natural image statistics,” arXiv preprint arXiv:1803.04626, 2018.
 [116] R. Mechrez, I. Talmi, and L. ZelnikManor, “The contextual loss for image transformation with nonaligned data,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 768–783.
 [117] K. Li and J. Malik, “Implicit maximum likelihood estimation,” arXiv preprint arXiv:1809.09087, 2018.
 [118] K. Li, S. Peng, and J. Malik, “Superresolution via conditional implicit maximum likelihood estimation,” arXiv preprint arXiv:1810.01406, 2018.
 [119] F. Huszár, “How (not) to train your generative model: Scheduled sampling, likelihood, adversary?” arXiv preprint arXiv:1511.05101, 2015.
 [120] L. Theis, A. v. d. Oord, and M. Bethge, “A note on the evaluation of generative models,” arXiv preprint arXiv:1511.01844, 2015.
 [121] M. S. Sajjadi, B. Schölkopf, and M. Hirsch, “EnhanceNet: Single image superresolution through automated texture synthesis,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4501–4510.

[122]
J.Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired imagetoimage translation using cycleconsistent adversarial networks,” in
Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.  [123] Y. Yuan, S. Liu, J. Zhang, Y. Zhang, C. Dong, and L. Lin, “Unsupervised image superresolution using cycleincycle generative adversarial networks,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 814–823.
 [124] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proceedings of the International Conference on Machine Learning, 2017, pp. 214–223.
 [125] S. Nowozin, B. Cseke, and R. Tomioka, “fGAN: Training generative neural samplers using variational divergence minimization,” in Proceedings of the Advances in Neural Information Processing Systems, 2016, pp. 271–279.
 [126] D. J. Sutherland, H.Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton, “Generative models and model criticism via optimized maximum mean discrepancy,” arXiv preprint arXiv:1611.04488, 2016.
 [127] Y. Blau and T. Michaeli, “The perceptiondistortion tradeoff,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6228–6237.