starter from "How to Train a GAN?" at NIPS2016
Recently, several models based on deep neural networks have achieved great success in terms of both reconstruction accuracy and computational performance for single image super-resolution. In these methods, the low resolution (LR) input image is upscaled to the high resolution (HR) space using a single filter, commonly bicubic interpolation, before reconstruction. This means that the super-resolution (SR) operation is performed in HR space. We demonstrate that this is sub-optimal and adds computational complexity. In this paper, we present the first convolutional neural network (CNN) capable of real-time SR of 1080p videos on a single K2 GPU. To achieve this, we propose a novel CNN architecture where the feature maps are extracted in the LR space. In addition, we introduce an efficient sub-pixel convolution layer which learns an array of upscaling filters to upscale the final LR feature maps into the HR output. By doing so, we effectively replace the handcrafted bicubic filter in the SR pipeline with more complex upscaling filters specifically trained for each feature map, whilst also reducing the computational complexity of the overall SR operation. We evaluate the proposed approach using images and videos from publicly available datasets and show that it performs significantly better (+0.15dB on Images and +0.39dB on Videos) and is an order of magnitude faster than previous CNN-based methods.READ FULL TEXT VIEW PDF
Recently, deep neural networks have achieved impressive performance in t...
Rectified linear units (ReLU) are well-known to be helpful in obtaining
In this paper, we propose an efficient super-resolution (SR) method base...
Deep convolutional neural networks (DCNNs) have recently demonstrated
Haze degrades content and obscures information of images, which can
In this note, we want to focus on aspects related to two questions most
Multi-frame image super-resolution (MISR) aims to fuse information in
starter from "How to Train a GAN?" at NIPS2016
Keras / TF subpixel convolution
The recovery of a HR image or video from its LR counter part is topic of great interest in digital image processing. This task, referred to as SR, finds direct applications in many areas such as HDTV , medical imaging [28, 33], satellite imaging 17] and surveillance . The global SR problem assumes LR data to be a low-pass filtered (blurred), downsampled and noisy version of HR data. It is a highly ill-posed problem, due to the loss of high-frequency information that occurs during the non-invertible low-pass filtering and subsampling operations. Furthermore, the SR operation is effectively a one-to-many mapping from LR to HR space which can have multiple solutions, of which determining the correct solution is non-trivial. A key assumption that underlies many SR techniques is that much of the high-frequency data is redundant and thus can be accurately reconstructed from low frequency components. SR is therefore an inference problem, and thus relies on our model of the statistics of images in question.
Many methods assume multiple images are available as LR instances of the same scene with different perspectives, i.e. with unique prior affine transformations. These can be categorised as multi-image SR methods [1, 11] and exploit explicit redundancy by constraining the ill-posed problem with additional information and attempting to invert the downsampling process. However, these methods usually require computationally complex image registration and fusion stages, the accuracy of which directly impacts the quality of the result. An alternative family of methods are SISR techniques . These techniques seek to learn implicit redundancy that is present in natural data to recover missing HR information from a single LR instance. This usually arises in the form of local spatial correlations for images and additional temporal correlations in videos. In this case, prior information in the form of reconstruction constraints is needed to restrict the solution space of the reconstruction.
The goal of SISR methods is to recover a HR image from a single LR input image 
. Recent popular SISR methods can be classified into edge-based, image statistics-based [9, 18, 46, 12] and patch-based [2, 43, 52, 13, 54, 40, 5] methods. A detailed review of more generic SISR methods can be found in . One family of approaches that has recently thrived in tackling the SISR problem is sparsity-based techniques. Sparse coding is an effective mechanism that assumes any natural image can be sparsely represented in a transform domain. This transform domain is usually a dictionary of image atoms [25, 10], which can be learnt through a training process that tries to discover the correspondence between LR and HR patches. This dictionary is able to embed the prior knowledge necessary to constrain the ill-posed problem of super-resolving unseen data. This approach is proposed in the methods of [47, 8]. A drawback of sparsity-based techniques is that introducing the sparsity constraint through a nonlinear reconstruction is generally computationally expensive.
to train on large image databases such as ImageNet in order to learn nonlinear mappings of LR and HR image patches. Stacked collaborative local auto-encoders are used in  to super-resolve the LR image layer by layer. Osendorfer et al.  suggested a method for SISR based on an extension of the predictive convolutional sparse coding framework . A multiple layer CNN inspired by sparse-coding methods is proposed in . Chen et. al.  proposed to use multi-stage TNRD as an alternative to CNN where the weights and the nonlinearity is trainable. Wang et. al  trained a cascaded sparse coding network from end to end inspired by LISTA (Learning iterative shrinkage and thresholding algorithm) 
to fully exploit the natural sparsity of images. The network structure is not limited to neural networks, for example, a random forest has also been successfully used for SISR.
With the development of CNN, the efficiency of the algorithms, especially their computational and memory cost, gains importance . The flexibility of deep network models to learn nonlinear relationships has been shown to attain superior reconstruction accuracy compared to previously hand-crafted models [27, 7, 44, 31, 3]. To super-resolve a LR image into HR space, it is necessary to increase the resolution of the LR image to match that of the HR image at some point.
In Osendorfer et al. , the image resolution is increased in the middle of the network gradually. Another popular approach is to increase the resolution before or at the first layer of the network [7, 44, 3]. However, this approach has a number of drawbacks. Firstly, increasing the resolution of the LR images before the image enhancement step increases the computational complexity. This is especially problematic for convolutional networks, where the processing speed directly depends on the input image resolution. Secondly, interpolation methods typically used to accomplish the task, such as bicubic interpolation [7, 44, 3], do not bring additional information to solve the ill-posed reconstruction problem.
Learning upscaling filters was briefly suggested in the footnote of Dong et.al. . However, the importance of integrating it into the CNN as part of the SR operation was not fully recognised and the option not explored. Additionally, as noted by Dong et al. , there are no efficient implementations of a convolution layer whose output size is larger than the input size and well-optimized implementations such as convnet  do not trivially allow such behaviour.
In this paper, contrary to previous works, we propose to increase the resolution from LR to HR only at the very end of the network and super-resolve HR data from LR feature maps. This eliminates the need to perform most of the SR operation in the far larger HR resolution. For this purpose, we propose an efficient sub-pixel convolution layer to learn the upscaling operation for image and video super-resolution.
The advantages of these contributions are two fold:
In our network, upscaling is handled by the last layer of the network. This means each LR image is directly fed to the network and feature extraction occurs through nonlinear convolutions in LR space. Due to the reduced input resolution, we can effectively use a smaller filter size to integrate the same information while maintaining a given contextual area. The resolution and filter size reduction lower the computational and memory complexity substantially enough to allow super-resolution of HD videos in real-time as shown in Sec.3.5.
For a network with layers, we learn upscaling filters for the feature maps as opposed to one upscaling filter for the input image. In addition, not using an explicit interpolation filter means that the network implicitly learns the processing necessary for SR. Thus, the network is capable of learning a better and more complex LR to HR mapping compared to a single fixed filter upscaling at the first layer. This results in additional gains in the reconstruction accuracy of the model as shown in Sec. 3.3.2 and Sec. 3.4.
We validate the proposed approach using images and videos from publicly available benchmarks datasets and compared our performance against previous works including [7, 3, 31]. We show that the proposed model achieves state-of-art performance and is nearly an order of magnitude faster than previously published methods on images and videos.
The task of SISR is to estimate a HR imagegiven a LR image downscaled from the corresponding original HR image . The downsampling operation is deterministic and known: to produce from , we first convolve using a Gaussian filter - thus simulating the camera’s point spread function - then downsample the image by a factor of . We will refer to as the upscaling ratio. In general, both and can have
colour channels, thus they are represented as real-valued tensors of sizeand , respectively.
To solve the SISR problem, the SRCNN proposed in  recovers from an upscaled and interpolated version of instead of . To recover , a 3 layer convolutional network is used. In this section we propose a novel network architecture, as illustrated in Fig. 1, to avoid upscaling before feeding it into the network. In our architecture, we first apply a layer convolutional neural network directly to the LR image, and then apply a sub-pixel convolution layer that upscales the LR feature maps to produce .
For a network composed of layers, the first layers can be described as follows:
Where are learnable network weights and biases respectively. is a 2D convolution tensor of size , where is the number of features at layer , , and is the filter size at layer . The biases
are vectors of length
. The nonlinearity function (or activation function)is applied element-wise and is fixed. The last layer has to convert the LR feature maps to a HR image .
The addition of a deconvolution layer is a popular choice for recovering resolution from max-pooling and other image down-sampling layers. This approach has been successfully used in visualizing layer activations and for generating semantic segmentations using high level features from the network . It is trivial to show that the bicubic interpolation used in SRCNN is a special case of the deconvolution layer, as suggested already in [24, 7]. The deconvolution layer proposed in 
can be seen as multiplication of each input pixel by a filter element-wise with stride, and sums over the resulting output windows also known as backwards convolution .
The other way to upscale a LR image is convolution with fractional stride of in the LR space as mentioned by , which can be naively implemented by interpolation, perforate  or un-pooling  from LR space to HR space followed by a convolution with a stride of in HR space. These implementations increase the computational cost by a factor of , since convolution happens in HR space.
Alternatively, a convolution with stride of in the LR space with a filter of size with weight spacing would activate different parts of for the convolution. The weights that fall between the pixels are simply not activated and do not need to be calculated. The number of activation patterns is exactly . Each activation pattern, according to its location, has at most weights activated. These patterns are periodically activated during the convolution of the filter across the image depending on different sub-pixel location: where are the output pixel coordinates in HR space. In this paper, we propose an effective way to implement the above operation when :
where is an periodic shuffling operator that rearranges the elements of a tensor to a tensor of shape . The effects of this operation are illustrated in Fig. 1. Mathematically, this operation can be described in the following way
The convolution operator thus has shape . Note that we do not apply nonlinearity to the outputs of the convolution at the last layer. It is easy to see that when and it is equivalent to sub-pixel convolution in the LR space with the filter . We will refer to our new layer as the sub-pixel convolution layer and our network as ESPCN. This last layer produces a HR image from LR feature maps directly with one upscaling filter for each feature map as shown in Fig. 4.
Given a training set consisting of HR image examples , we generate the corresponding LR images , and calculate the pixel-wise MSE of the reconstruction as an objective function to train the network:
It is noticeable that the implementation of the above periodic shuffling can be avoided in training time. Instead of shuffling the output as part of the layer, we can pre-shuffle the training data to match the output of the layer before . Thus our proposed layer is times faster compared to deconvolution layer in training and times faster compared to implementations using various forms of upscaling before convolution.
The detailed report of quantitative evaluation including the original data including images and videos, down-sampled data, super-resolved data, overall and individual scores and run-times on a K2 GPU are provided in the supplemental material111Supplemental material https://twitter.box.com/s/47bhw60d066imhh88i2icqnbu7lwiza2.
During the evaluation, we used publicly available benchmark datasets including the Timofte dataset  widely used by SISR papers [7, 44, 3] which provides source code for multiple methods, 91 training images and two test datasets Set5 and Set14 which provides 5 and 14 images; The Berkeley segmentation dataset  BSD300 and BSD500 which provides 100 and 200 images for testing and the super texture dataset  which provides 136 texture images. For our final models, we use 50,000 randomly selected images from ImageNet  for the training. Following previous works, we only consider the luminance channel in YCbCr colour space in this section because humans are more sensitive to luminance changes . For each upscaling factor, we train a specific network.
For video experiments we use 1080p HD videos from the publicly available Xiph database222Xiph.org Video Test Media [derf’s collection] https://media.xiph.org/video/derf/, which has been used to report video SR results in previous methods [37, 23]. The database contains a collection of HD videos approximately seconds in length and with width and height . In addition, we also use the Ultra Video Group database333Ultra Video Group Test Sequences http://ultravideo.cs.tut.fi/, containing videos of in size and seconds in length.
For the ESPCN, we set , , and in our evaluations. The choice of the parameter is inspired by SRCNN’s 3 layer 9-5-5 model and the equations in Sec. 2.2. In the training phase, pixel sub-images are extracted from the training ground truth images , where is the upscaling factor. To synthesize the low-resolution samples , we blur using a Gaussian filter and sub-sample it by the upscaling factor. The sub-images are extracted from original images with a stride of from and a stride of from . This ensures that all pixels in the original image appear once and only once as the ground truth of the training data. We choose instead of as the activation function for the final model motivated by our experimental results.
The training stops after no improvement of the cost function is observed after 100 epochs. Initial learning rate is set to 0.01 and final learning rate is set to 0.0001 and updated gradually when the improvement of the cost function is smaller than a threshold. The final layer learns 10 times slower as in . The training takes roughly three hours on a K2 GPU on 91 images, and seven days on images from ImageNet  for upscaling factor of 3. We use the PSNR as the performance metric to evaluate our models. PSNR of SRCNN and Chen’s models on our extended benchmark set are calculated based on the Matlab code and models provided by [7, 3].
|Dataset||Scale||SRCNN (91)||ESPCN (91 )||ESPCN (91)||SRCNN (ImageNet)||ESPCN (ImageNet )|
0.001 with paired t-test).
In this section, we demonstrate the positive effect of the sub-pixel convolution layer as well as activation function. We first evaluate the power of the sub-pixel convolution layer by comparing against SRCNN’s standard 9-1-5 model . Here, we follow the approach in , using as the activation function for our models in this experiment, and training a set of models with 91 images and another set with images from ImageNet. The results are shown in Tab. 1. ESPCN with trained on ImageNet images achieved statistically significantly better performance compared to SRCNN models. It is noticeable that ESPCN (91) performs very similar to SRCNN (91). Training with more images using ESPCN has a far more significant impact on PSNR compared to SRCNN with similar number of parameters (+0.33 vs +0.07).
To make a visual comparison between our model with the sub-pixel convolution layer and SRCNN, we visualized weights of our ESPCN (ImageNet) model against SRCNN 9-5-5 ImageNet model from  in Fig. 3 and Fig. 4. The weights of our first and last layer filters have a strong similarity to designed features including the log-Gabor filters , wavelets  and Haar features . It is noticeable that despite each filter is independent in LR space, our independent filters is actually smooth in the HR space after . Compared to SRCNN’s last layer filters, our final layer filters has complex patterns for different feature maps, it also has much richer and more meaningful representations.
In this section, we show ESPCN trained on ImageNet compared to results from SRCNN  and the TNRD  which is currently the best performing approach published. For simplicity, we do not show results which are known to be worse than . For the interested reader, the results of other previous methods can be found in . We choose to compare against the best SRCNN 9-5-5 ImageNet model in this section . And for , results are calculated based on the stages model.
Our results shown in Tab. 2 are significantly better than the SRCNN 9-5-5 ImageNet model, whilst being close to, and in some cases out-performing, the TNRD . Although TNRD uses a single bicubic interpolation to upscale the input image to HR space, it possibly benefits from a trainable nonlinearity function. This trainable nonlinearity function is not exclusive from our network and will be interesting to explore in the future. Visual comparison of the super-resolved images is given in Fig. 5 and Fig. 6, the CNN methods create a much sharper and higher contrast images, ESPCN provides noticeably improvement over SRCNN.
In this section, we compare the ESPCN trained models against single frame bicubic interpolation and SRCNN  on two popular video benchmarks. One big advantage of our network is its speed. This makes it an ideal candidate for video SR which allows us to super-resolve the videos frame by frame. Our results shown in Tab. 3 and Tab. 4 are better than the SRCNN 9-5-5 ImageNet model. The improvement is more significant than the results on the image data, this maybe due to differences between datasets. Similar disparity can be observed in different categories of the image benchmark as Set5 vs SuperTexture.
In this section, we evaluated our best model’s run time on Set14444It should be noted our results outperform all other algorithms in accuracy on the larger BSD datasets. However, the use of Set14 on a single CPU core is selected here in order to allow a straight-forward comparison with results from previous published results [31, 6]. with an upscale factor of 3. We evaluate the run time of other methods [2, 51, 39] from the Matlab codes provided by  and 
. For methods which use convolutions including our own, a python/theano implementation is used to improve the efficiency based on the Matlab codes provided in[7, 3]. The results are presented in Fig. 2. Our model runs a magnitude faster than the fastest methods published so far. Compared to SRCNN 9-5-5 ImageNet model, the number of convolution required to super-resolve one image is times smaller and the number of total parameters of the model is times smaller. The total complexity of the super-resolution operation is thus times lower. We have achieved a stunning average speed of for super-resolving one single image from Set14 on a K2 GPU. Utilising the amazing speed of the network, it will be interesting to explore ensemble prediction using independently trained models as discussed in  to achieve better SR performance in the future.
We also evaluated run time of 1080 HD video super-resolution using videos from the Xiph and the Ultra Video Group database. With upscale factor of 3, SRCNN 9-5-5 ImageNet model takes 0.435s per frame whilst our ESPCN model takes only 0.038s per frame. With upscale factor of 4, SRCNN 9-5-5 ImageNet model takes 0.434s per frame whilst our ESPCN model takes only 0.029s per frame.
In this paper, we demonstrate that a non-adaptive upscaling at the first layer provides worse results than an adaptive upscaling for SISR and requires more computational complexity. To address the problem, we propose to perform the feature extraction stages in the LR space instead of HR space. To do that we propose a novel sub-pixel convolution layer which is capable of super-resolving LR data into HR space with very little additional computational cost compared to a deconvolution layer  at training time. Evaluation performed on an extended bench mark data set with upscaling factor of 4 shows that we have a significant speed () and performance (+0.15dB on Images and +0.39dB on videos) boost compared to the previous CNN approach with more parameters  (5-3-3 vs 9-5-5). This makes our model the first CNN model that is capable of SR HD videos in real time on a single GPU.
A reasonable assumption when processing video information is that most of a scene’s content is shared by neighbouring video frames. Exceptions to this assumption are scene changes and objects sporadically appearing or disappearing from the scene. This creates additional data-implicit redundancy that can be exploited for video super-resolution as has been shown in [32, 23]. Spatio-temporal networks are popular as they fully utilise the temporal information from videos for human action recognition [19, 41]. In the future, we will investigate extending our ESPCN network into a spatio-temporal network to super-resolve one frame from multiple neighbouring frames using 3D convolutions.
Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 399–406, 2010.
Efficient learning of sparse representations with an energy-based model.In Advances in neural information processing systems, pages 1137–1144, 2006.