The task of single image super-resolution aims at restoring a high-resolution (HR) image from a given low-resolution (LR) one. Super-resolution has wide applications in many fields where image details are on demand, such as medical, remote sensing imaging, video surveillance, and entertainment. In the past decades, super-resolution has attracted much attention from computer vision communities. Early methods include bicubic interpolation, Lanczos resampling , statistical priors , neighbor embedding , and sparse coding . However, super-resolution is highly ill-posed since the process from HR to LR contains non-invertible operation such as low-pass filtering and subsampling.
Deep convolutional neural networks (CNNs) have achieved state-of-the-art performance in computer vision, such as image classification , object detection , and image enhancement . Recently, CNNs are widely used to address the ill-posed inverse problem of super-resolution, and have demonstrated superiority over traditional methods [9, 15, 4, 23] with respect to both reconstruction accuracy and computational efficiency. Dong et al. [6, 7] successfully design a super-resolution convolutional neural network (SRCNN) to demonstrate that a CNN can be applied to learn the mapping from LR to HR in an end-to-end manner. A fast super-resolution convolutional neural network (FSRCNN)  is proposed to accelerate the speed of SRCNN [6, 7], which takes the original LR image as input and adopts a deconvolution layer to replace the bicubic interpolation. In , an efficient sub-pixel convolution layer is introduced to achieve real time performance. Kim et al.  uses a very deep super-resolution (VDSR) network with 20 convolutional layers, which greatly improves the accuracy of the model.
The previous methods based on CNN has achieved great progress on the restoration quality as well as efficiency. However, there are some limitations mainly coming from the following aspects:
CNN based methods make efforts to enlarge the receptive field of the models as well as stack more layers. They reconstruct any type of contents from LR images using only single-scale region, thus ignore the various scales of different details. For instance, restoring the detail in the sky probably relies on a lager image region, while the tiny text may only be relevant to a small patch.
Most previous approaches learn a specific model for one single up-scale factor. Therefore, the model learned for one up-scale factor cannot work well for another. That is, many scale-specific models should be trained for different up-scale factors, which is inefficient both in terms of time and memory. Though  trains a model for multiple up-scales, it ignores the fact that a single receptive field may contain different information amount in various resolution versions.
In this paper, we propose a multi-scale super resolution (MSSR) convolutional neural network to issue these problems – there are two folds of meaning in the term multi-scale. First, the proposed network combines multi-path subnetworks with different depth, which correspond to multi-scale regions in the input image. Second, the multi-scale network is capable to select a proper receptive field for different up-scales to restore the HR image. Only single model is trained for multiple up-scale factors by multi-scale training.
2 Mutli-scale Super-Resolution
Given a low-resolution image, super-resolution aims at restoring its high-resolution version. For this ill-posed recovery problem, it is probably an effective way to estimate a target pixel by taking into account more context information in the neighborhood. In[6, 7, 14], authors found that larger receptive field tends to achieve better performance due to richer structural information. However, we argue that the restoration process is not only depending on single-scale regions with large receptive field.
Different kinds of components in an image may be relevant to different scales of neighbourhood. In 
, multi-scale neighborhood has been proven effective for super-resolution, which simultaneously integrates local and non-local sparse priors. Multi-scale feature extraction[3, 24] is also effective to represent image patterns. For example, the inception architecture in GoogLeNet  uses parallel convolutions with varying filter sizes, and better addresses the issue of aligning objects in input images, resulting in state-of-the-art performance in object recognition. Motivated by this, we propose a multi-scale super-resolution convolutional neural network to improve the performance (see as Fig. 1): low-resolution image is first up-sampled to the desired size by bicubic interpolation, and then MSSR is implemented to predict the detail.
2.1 Multi-Scale Architecture
With fixed filter size larger than 1, the receptive field is going larger when network stacks more layers. The proposed architecture is composed of two parallel paths as illustrated in Fig. 1. The upper path (Module-L) stacks convolutional layers which is able to catch a large region of information in the LR image. The other path (Module-S) contains () convolutional layers to ensure a relatively small receptive filed. The response of the -th convolutional layer in Module-L/S for input is given by
where and are the weights and bias respectively, and represents nonlinear operation (ReLU). Here we denote the interpolated low-resolution image as . The output of Module-L is , and the output of Module-S is .
For saving consideration, parameters between Module-S and the front part of Module-L are shared. Outputs of the two modules are fused into one, which can take various functional forms (e.g. connection, weighting, and summation). We find that simply summation is efficient enough for our purpose, and the fusion result is generated as . To further vary the spatial scales of the ensemble architecture, a similar subnetwork is cascaded to the previous one as . A final reconstruction module with convolutional layers is employed to make the prediction. Following , size of all convolutional kernels is set to
with zero-padding. With respect to the local information involved in LR image, there are streams of three scales (Small/Middle/Large-Scale) corresponding to, and , respectively. Each layer consists of 64 filters except for the last reconstruction layer, which contains only one single filter without nonlinear operation.
2.2 Multi-Scale Residual Learning
High-frequency content is more important for HR restoration, such as gradient features taken into account in [1, 2, 4]. Since the input is highly similar to the output in super-resolution problem, the proposed network (MSSR) focuses on high-frequency details estimation through multi-scale residual learning.
The given training set includes pairs of multi-scale LR images with scale factors and HR image . Multi-scale residual image for each sample is computed as . The goal of MSSR is to learn the nonlinear mapping from multi-scale LR images to predict the residual image . The network parameters
are achieved through minimizing the loss function as
With multi-scale residual learning, we only train a general model for multiple up-scale factors. For LR images with different down sampling scales , even the same region size in LR images may contain different information content. In the work of Dong et al. , a small patch in LR space could cover almost all information of a large patch in HR. For multiple up-scale samples, a model with only one single receptive field cannot make the best of them all simultaneously. However, our multi-scale network is capable of handling this problem. The advantages of multi-scale learning include not only memory and time saving, but also a way to adapt the model for different down sampling scales.
Training dataset. The model is trained on 91 images from Yang et al.  and 200 images from the training set of Berkeley Segmentation Dataset (BSD) , which are widely used for super-resolution problem [7, 14, 8, 18]. As in , to make full use of the training data, we apply data augmentation in two ways: 1) Rotate the images with the degree of , and . 2) Downscale the images with the factor of 0.9, 0.8, 0.7 and 0.6. Following the sample cropping in , training images are cropped into sub-images of size with non-overlapping. In addition, to train a general model for multiple up-scale factors, we combine LR-HR pairs of three up-scale size () into one.
Test dataset. The proposed method is evaluated on four publicly available benchmark datasets: Set5  and Set14  provide 5 and 14 images respectively; B100  contains 100 natural images collected from BSD; Urban100  consists of 100 high-resolution images rich of structures in real-world. Following previous works [12, 8, 14], we transform the images to YCbCr color space and only apply the algorithm on the luminance channel, since human vision is more sensitive to details in intensity than in color.
3.2 Experimental Settings
In the experiments, the Caffe  package is implemented to train the proposed MSSR with Adam . To ensure varying receptive field scales, we set , and respectively. That is, each Module-L in Fig. 1 stacks 9 convolutional layers, while Module-S stacks 2 layers. The reconstruction module is built of 2 layers. Thus, the longest path in the network consists of 20 convolutional layers totally, and there are streams of three different scales corresponding to 13, 27 and 41. Model weights are initialized according to the approach described in . Learning rate is initially set to
and decreases by the factor of 10 after 80 epochs. Training phase stops at 100 epochs. We set the parameters of batch-size, momentum and weight decay to 64, 0.9 andrespectively.
To quantitatively assess the proposed model, MSSR is evaluated for three different up-scale factors from 2 to 4 on four testing datasets aforementioned. We compute the Peak Signal-to-Noise Ratio (PSNR) and structural similarity (SSIM) of the results to compare with some recent competitive methods, including A+ , SelfEx , SRCNN , FSRCNN  and VDSR . As shown in Table 1, we can see that the proposed MSSR outperforms other methods almost on every up-scale factor and each test set. The only suboptimal result is the PSNR on B100 of up-scale factor 4, which is slightly lower than VDSR , but still competitive with a higher SSIM. Visual comparisons can be found in Fig. 2 and Fig. 3.
As for effectiveness, we evaluate the execution time using the public code of state-of-the-art methods. The experiments are conducted with an Intel CPU (Xeon E5-2620, 2.1 GHz) and an NVIDIA GPU (GeForce GTX 1080). Fig. 4 shows the PSNR performance of several state-of-the-art methods for super-resolution versus the execution time. The proposed MSSR network achieves better super-resolution quality than existing methods, and are tens of times faster.
|Dataset||Scale||A+ ||SelfEx ||SRCNN ||FSRCNN ||VDSR ||MSSR|
In this paper, we highlight the importance of scales in super-resolution problem, which is neglected in the previous work. Instead of simply enlarge the size of input patches, we proposed a multi-scale convolutional neural network for single image super-resolution. Combining paths of different scales enables the model to synthesize a wider range of receptive fields. Since different components in images may be relevant to a diversity of neighbor sizes, the proposed network can benefit from multi-scale features. Our model generalizes well across different up-scale factors. Experimental results reveal that our approach can achieve state-of-the-art results on standard benchmarks with a relatively high speed.
-  Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-complexity single-image super-resolution based on nonnegative neighbor embedding (2012)
-  Bevilacqua, M., Roumy, A., Guillemot, C., Morel, M.L.A.: Super-resolution using neighbor embedding of back-projection residuals. In: Digital Signal Processing (DSP), 2013 18th International Conference on. pp. 1–8. IEEE (2013)
-  Cai, B., Xu, X., Jia, K., Qing, C., Tao, D.: Dehazenet: An end-to-end system for single image haze removal. IEEE Transactions on Image Processing 25(11), 5187–5198 (2016)
Chang, H., Yeung, D.Y., Xiong, Y.: Super-resolution through neighbor embedding. In: Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on. vol. 1, pp. I–I. IEEE (2004)
-  De Boor, C.: Bicubic spline interpolation. Studies in Applied Mathematics 41(1-4), 212–218 (1962)
-  Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: European Conference on Computer Vision. pp. 184–199. Springer (2014)
-  Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38(2), 295–307 (2016)
-  Dong, C., Loy, C.C., Tang, X.: Accelerating the super-resolution convolutional neural network. In: European Conference on Computer Vision. pp. 391–407. Springer (2016)
-  Duchon, C.E.: Lanczos filtering in one and two dimensions. Journal of Applied Meteorology 18(8), 1016–1022 (1979)
-  Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 580–587 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision. pp. 1026–1034 (2015)
-  Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5197–5206 (2015)
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia. pp. 675–678. ACM (2014)
-  Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1646–1654 (2016)
-  Kim, K.I., Kwon, Y.: Single-image super-resolution using sparse regression and natural image prior. IEEE transactions on pattern analysis and machine intelligence 32(6), 1127–1133 (2010)
-  Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on. vol. 2, pp. 416–423. IEEE (2001)
-  Schulter, S., Leistner, C., Bischof, H.: Fast and accurate image upscaling with super-resolution forests. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3791–3799 (2015)
-  Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1874–1883 (2016)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
-  Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1–9 (2015)
-  Timofte, R., De Smet, V., Van Gool, L.: A+: Adjusted anchored neighborhood regression for fast super-resolution. In: Asian Conference on Computer Vision. pp. 111–126. Springer (2014)
-  Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse representation. IEEE transactions on image processing 19(11), 2861–2873 (2010)
-  Zeng, L., Xu, X., Cai, B., Qiu, S., Zhang, T.: Multi-scale convolutional neural networks for crowd counting. arXiv preprint arXiv:1702.02359 (2017)
-  Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparse-representations. In: International conference on curves and surfaces. pp. 711–730. Springer (2010)
-  Zhang, K., Gao, X., Tao, D., Li, X.: Multi-scale dictionary for single image super-resolution. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. pp. 1114–1121. IEEE (2012)