Single image super-resolution (SR) aims to infer a high resolution (HR) image from one low resolution (LR) input image, which has a wide applications in video surveillance, remote sensing imaging, medical imaging and digital entertainment. Since the SR process is inherently an ill-posed inverse problem, exploring and enforcing strong prior information about the HR image are necessary to guarantee the stability of the SR process. Many traditional example-based SR methods have been devoted to resolving this problem via probabilistic graphical model [1, 2], neighbor embedding [3, 4], sparse coding [5, 6], linear or nonlinear regression [7, 8, 9]
and random forest.
More recently, deep networks have been utilized for image SR via modeling the mapping from LR to HR space and achieved impressive results. Dong et al.  present a deep convolutional neural network (CNN) with three convolutional layers (SRCNN) to predict the nonlinear relationship between LR and HR images. Due to the slow convergence and the difficulty in deeper network training, their deeper networks with more convolutional layers do not perform better. To break through the limitation of SRCNN, Kim et al. 
propose a very deep convolutional network (VDSR) for highly accurate SR and adopt extremely high learning rate as well as residual-learning to speed-up the training process. Besides, they use adjustable gradient clipping to solve gradient explosion problem. Meanwhile, to control the number of model parameters in deeper network, Kimet al.  also propose a deeply-recursive convolutional network (DRCN) in which recursive-supervision and skip-connection are used to ease the difficulty of training. For the same reason, the deep recursive residual network (DRRN) is proposed by Tai et al. , in which global and local residual learning as well as recursive module are introduced to reduce the number of model parameters. Since the identity mapping in residual network (ResNet)  makes the training of very deep networks easy, ResNet architecture with identity mapping has been applied in image restoration. Ledig et al.  employ a deep residual network with 16 residual blocks (SRResNet) and skip-connection to super-resolve LR image with an upscaling factor of . Lim et al. 
develop an enhanced deep super-resolution network by removing the batch normalization layers in SRResNet and their method win the NTIRE2017 super-resolution challenge. Further, to conveniently pass information across several layers or modules, the pattern of multiple or dense skip connections between layers or modules is adopted in SR. Mao et al.  propose a 30-layer convolutional residual encoder-decoder network (RED30) for image restoration, which uses skip-layer connection to symmetrically link convolutional layers and deconvolutional layers. Inspired by densely connected convolutional networks (DenseNet)  which achieves high performance in image classification, Tong et al.  utilize the DenseNet structure as building blocks to reuse learnt feature maps and introduce dense skip connections to fuse features at different levels. Meanwhile, Tai et al.  propose the deepest persistent memory network (MemNet) for image restoration, in which a memory block is applied to achieve persistent memory and multiple memory blocks are stacked with a densely connected structure to ensure maximum information flow between blocks.
Assembling a set of independent subnetworks, widely adopted in image SR, is also an effective solution to improve SR performance. Wang et al.  explore a new deep CNN architecture by jointly training both deep and shallow CNNs, where the shallow network stabilizes training and the deep network ensures accurate HR reconstruction. Similarly, Tang et al.  propose a joint residual network with three parallel subnetworks to learn the low frequency information and high frequency information for SR. Yamanaka et al.  combine skip connection layers and parallelized CNNs into a deep CNN architecture for image reconstruction. Also,  discusses different subnetworks fusion schemes and proposes a context-wise network fusion approach which integrates the outputs of individual networks by additional convolutional layers. In addition to parallel network fusion, progressive network fusion or cascaded networks structure is adopted in SR. Wang et al.  exploit the natural sparsity of images and build a cascaded sparse coding network in which each subnetwork is trained for a small upscaling factor. Recently, a cascade of convolutional neural networks is also utilized in  and  to progressively upscale low-resolution images.
For image SR, the input and output images as well as information among layers in networks are highly correlated. It is important for SR to combine the knowledge of features at different levels and different scales. Although the previous SR approaches utilize deeper networks to take more contextual information, the fusion of complementary multi-scale information under different receptive fields is still difficult due to adopting single-stream structure in their designed networks, such as SRCNN , FSRCNN , VDSR  and DRCN . Besides, it is found in [31, 32] that for deep networks, as a way to increase the number of layers, increasing the width is more effective than increasing the depth. Therefore, in order to conveniently promote information integration for image SR, the multi-stream structure and network widening may be beneficial. On the other hand, applying the residual information learning and the cascaded or progressive structure into SR can simplify the difficulty of direct super-resolving images, which has been manifested in several SR methods. Taking the above into consideration, we propose a deep Cascaded Multi-Scale Cross network (CMSC) for SR (illustrated in Fig.2
), which consists of a feature extraction network, a set of cascaded subnetworks and a reconstruction network. A set of cascaded subnetworks is utilized to gradually reconstruct HR features. In each stage of the cascade, we develop a multi-scale cross (MSC) module with two branches (depicted in Fig.3(b)) for fusing multi-level information under different receptive fields, and then stack the MSC modules as a subnetwork (shown in Fig.2(b)) for learning the residual information between HR and LR features. During training, the multiple cascaded-supervised strategy is adopted to supervise all of the predictions from subnetworks and the final output from overall CMSC model. Compared with state-of-the-art SR methods, our proposed CMSC network achieves the best performance with relatively less execution time, as illustrated in Fig.1. In summary, the major contributions of our proposed method include:
1) A multi-scale cross module not only to fuse multi-scale complementary information under different receptive fields but also to help information flow across the network. In the multi-scale cross module, two branches having different receptive fields are assembled in parallel via averaging and adding.
2) A subnetwork with residual-features learning to reconstruct the high-resolution features. In the subnetwork, multiple multi-scale cross modules are stacked in sequence and an identity mapping is used to add the input to the end of the subnetwork. Therefore, instead of inferring the direct mapping from LR to HR features, the subnetwork uses the stacked multi-scale cross modules to reconstruct the residual features.
3) A cascaded networks structure to gradually decrease the gap between estimated HR features and ground truth HR features. Several residual-features learning subnetworks are cascaded to reconstruct the HR features with a coarse-to-fine manner. All of the outputs from the cascaded stages are fed into the corresponding reconstruction networks to obtain the intermediate predictions, which are utilized to compute the final prediction via weighted average. Both the intermediate predictions and final prediction are supervised in training. Thus, the SR performance is boosted by the cascaded-supervision and the assembling of intermediate predictions.
The remainder of this paper is organized as follows. Section II discusses the related single image SR methods and network architectures. Section III describes the proposed CMSC network for SR in detail. Model analysis and experimental comparison with other state-of-the-art methods are presented in Section IV, and Section V concludes the paper with observations and discussion.
Ii Related Work
Numerous single image SR methods and different network architectures have been proposed in the literatures. Here, we focus our discussions on the approaches which are relative to our method.
Ii-a Multi-branch Module
To improve information flow and to make training easy, the networks consisting of multi-branch modules have been developed, such as highway networks , residual networks [15, 34], and GoogLeNet [35, 36, 37]
. Residual network is built by stacking a sequence of residual blocks which contain a residual branch and an identity mapping. The inception module in GoogLeNet consists of parallel convolutional branches with different kernel sizes which are then concatenated for width increase and information fusion. More recently, Zhaoet al.  propose a deep merge-and-run neural network, which contains a set of merge-and-run (MR) blocks (depicted in Fig.3(a)). The MR block assembles residual branches in parallel with a merge-and-run mapping, which is shown as a linear idempotent function in . Idempotent mapping implies that the information from the early blocks can quickly flow to the later blocks and the gradient also can be quickly back-propagated to the early blocks from the later blocks, and thus the training difficulty is reduced. Although these different methods vary in network topology and training procedure, they all share a key characteristic: they contain multiple branches in each block. On the other hand, multi-branch networks can be viewed as an ensemble of many networks with different depths by which the performance can be boosted. Considering the advantages of multi-branch module, we extend merge-and-run block via operating the convolutions with different kernel sizes on two parallel branches to fuse information at different scales for SR.
Ii-B Residual Learning
Since the SR prediction is largely similar to the input, residual learning is widely adopted in SR to achieve faster convergence and better accuracy. In 
, residual feature patches between the estimated HR feature patches and the ground truth feature patches are estimated via a set of cascaded linear regressions. In deep learning based method, two networks of VDSR and DRCN  are built to learn the residual image between HR and LR images by using a skip-connection from the input to the reconstruction layer. Later, the residual learning is extensively utilized in different SR networks [21, 22, 23, 24, 25]. In 
, instead of using bicubic interpolation to obtain the coarse HR images, the shallow network of three convolutional layers is applied to coarsely estimate HR images and then the deep network is utilized to predict the residual images between the ground truth HR images and the coarse HR images. Taiet al.  term the way of estimating the residual images by the skip-connection as global residual learning (GRL) (like in VDSR  and DRCN ) and introduce a multi-path mode local residual learning which is combined with GRL to boost SR performance. Taking the effectiveness of GRL in training deep networks into account, we also adopt GRL in our proposed method. Further, we introduce the residual learning to the feature space, termed as residual-features learning (RFL), which is performed in each stage of the cascaded process.
Ii-C Progressive Structure
In image SR, reconstructing the high-frequency details becomes very challenging when the upscaling factor is large. To simplify the difficulty of direct super-resolving the details, some progressive or cascaded structures for SR are proposed. There are two fashions in cascaded structure for SR: stage-by-stage refining and stage-by-stage upscaling. In former manner, the output of the previous stage is taken as the input and the ground truth as target for each stage, where the input and output have the same size. Meanwhile, the cascade minimizes the prediction errors at each stage and thus the prediction is gradually close to the target. Hu et al.  develop a cascaded linear regressions framework to refine the predicted feature patches in a coarse-to-fine fashion and merge all predicted patches to generate an HR image. In stage-by-stage upscaling manner, one stage of the cascade is utilized to upscale LR image with a smaller scale factor, the output of which is further fed into the next stage until the desired image scale. With this manner, deep network cascade is proposed in  by using a cascade of multiple stacked collaborative local auto-encoders to gradually upscale low-resolution images. More recently, Lai et al.  present a Laplacian pyramid super-resolution network (LapSRN) based on a cascade of convolutional neural networks, which can progressively predict the sub-band residuals of HR images at multiple pyramid levels. To supervise intermediate stages in LapSRN, different scales of HR images at the corresponding levels need be generated by downsampling the ground truth HR image. Compared with the refining mode in which the cascade explicitly minimizes the prediction errors at each stage, this incremental approach has a loose control over the errors . Therefore, we utilize the cascaded structure with a stage-by-stage refining fashion to gradually refine the HR features.
Iii The Proposed CMSC Network
Our proposed CMSC model for SR, outlined in Fig.2, consists of a feature extraction network, a set of cascaded subnetworks and a reconstruction network. The feature extraction network is applied to represent the input image as the feature maps via a convolutional layer. A set of cascaded subnetworks is designed to reconstruct the HR features from LR features with a coarse-to-fine manner. The reconstructed HR feature maps are then fed into the reconstruction network to generate the HR images via the convolution operation. In this section, we describe the proposed model in detail, from the multi-scale cross module to the residual-features learning subnetwork and finally the overall cascaded network.
Iii-a Multi-scale Cross Module
Due to the identity mappings to skip residual branches in ResNet , a deep residual network is easy to train. Similarly, to further reduce the training difficulty and to improve information flow, deep merge-and-run neural networks, built by stacking a sequence of merge-and-run (MR) blocks, is proposed in  for the standard image classification task. Depicted in Fig.3(a), the MR block assembles the residual branches in parallel with a merge-and-run mapping: average the inputs of these residual branches and add the average to the output of each residual branch as the input of the subsequent residual branch respectively.
Inspired by MR block and considering that different receptive fields of convolutional networks can provide different contextual information which is very important for SR, we propose a multi-scale cross (MSC) module to incorporate different information under different sizes of receptive fields. As shown in Fig.3(b), an MSC module similarly integrates two residual branches in parallel with a merge-and-run mapping but operates different kernel sizes of convolutions on two branches, where different convolutions are differentiated by different color rectangles in Fig.3(b). Each residual branch in MSC module contains two convolutional layers and each convolutional layer is followed by batch normalization (BN)  and LeakyReLU . Moreover, the “+” operators in gray circles in Fig.3 are between BN and LeakyReLU and denote the addition operation, while the “+” operators in white circles denote the average operation. Thus, with this design, these branches can provide complementary contextual information at multiple scales which is further combined by the merge-and-run mapping. By denoting and as the transition functions of two residual branches respectively, the proposed module can be represented in matrix form as below.
where and ( and ) are the inputs (outputs) of two residual branches of the module and
is identity matrix.is the transformation matrix of the merge-and-run mapping. With this multi-scale information fusing structure, the proposed module can exploit a wide variety of contextual information to infer missing high-frequency components. On the other hand, as analyzed in , the transformation matrix in merge-and-run mapping is idempotent, which is similarly possessed by the proposed MSC module. The property of idempotent not only helps rich information flow across the different modules, but also encourages gradient back-propagation during training. Experimental analysis is described in Section IV C.
Iii-B Residual-Features Learning Subnetwork
Aiming to infer the HR features, the subnetwork with the residual-features learning (RFL) is built. As illustrated in Fig.2(b), the building subnetwork contains a convolutional layer and a sequence of multi-scale cross (MSC) modules. At the end of the last MSC module, two outputs from two branches and the average of their inputs are fused by addition, which can be formulated as follows.
where denotes the output from the last MSC module and the other notations have the same meanings as those in Eq. (1). The first convolutional layer in the subnetwork is utilized to transform the input of the subnetwork as a specified number of feature maps. Then, with the parallel and intersecting structure, the stacked multi-scale cross (MSC) modules can be used to process these features via different processing procedures. Meanwhile, multiple merge-and-run mappings in subnetwork are exploited to integrate different information coming from two branches as well as to create quick paths directly sending the information of the intermediate branches to the later modules.
Since the features of HR image are highly correlated with the features of corresponding LR image, we introduce residual-features learning into our subnetwork by adding an identity branch from the input to the end of the subnetwork (blue curves in Fig.2(b)). Therefore, instead of directly inferring the HR features, the subnetwork is built to estimate the residual features between LR and HR features. We denote and as the input and the output of the subnetwork, as the number of MSC modules for stacking, and as the convolution operation of the first convolutional layer. Thus, the output of the subnetwork is
where is function for MSC module in last subsection (depicted in Eq. (1)) and denotes the function of the last MSC module in Eq. (2). Since operations of and are performed as well as the residual-features learning is applied, the subnetwork is able to capture different characteristics and representations for inferring the HR features.
Iii-C Overall Model with Cascaded Structure
In order to reconstruct HR features and reduce the difficulty of direct super-resolving the details, we build a cascade of subnetworks to estimate HR features from LR features extracted by the feature extraction network. All subnetworks share the same structure and setting as mentioned in Section III B and are jointly trained with simultaneous supervision. It is expected that each stage of subnetwork brings the predictions closer to the ground truth HR features and the cascade progressively minimizes the prediction errors. Then, all the estimated HR features from subnetworks are fed into corresponding reconstruction layers to reconstruct HR image. The full model is illustrated in Fig.2 and termed as cascaded multi-scale cross network (CMSC).
Let and denote the input and output of the CMSC network. And, we utilize a convolutional layer followed by BN and LeakyReLU as a feature extraction network to extract the features from LR input image. The feature extraction network is formulated as below.
where denotes the operation of the feature extraction network and is the extracted features which are then fed into the first stage of subnetwork. Supposing subnetworks with RFL are cascaded to progressively infer HR features, following the notations in Section III B, the process of inference is represented as follows.
where represents the function for the subnetwork, as depicted in Eq. (3). In order to make the output from each cascaded subnetwork closer to the ground truth HR features, we supervise all predictions from cascaded subnetworks via cascaded-supervision. The output of each subnetwork is fed into the reconstruction network, where each convolutional layer takes the output of the corresponding stage as its input to reconstruct the corresponding HR residual image, and then intermediate predictions from the cascaded stages are generated, as illustrated in Fig.2(a). Similar to [12, 13, 14, 22], we also adopt global residual learning (GRL) in our network via adding an identity branch from the input to the reconstruction network. Thus, the intermediate prediction is
where is the output features of the stage and denotes the function for the corresponding reconstruction layer. And then, all of intermediate predictions are assembled to generate the final output via weighted average.
where denotes the weight for the prediction from the subnetwork. All of the weights in the above equation are learned during training and the final output is also supervised.
Given a training dataset , where is the number of training patches and are the
LR and HR patch pairs, the loss function of our model with cascaded-supervision can be formulated as
where denotes the parameter set. The balances the losses on the final output and on the intermediate outputs, and is empirically set as .
When the depth of a network is defined as the longest path from the input to the output, the depth of our overall model can be calculated by
where and denote the number of cascaded subnetworks and the number of MSC modules in each subnetwork. The multiplied by represents two convolutional layers contained in a branch of MSC module, and the in the parentheses indicates the first convolutional layer in each subnetwork. Besides, the at the end of equation represents two convolutional layers in feature extraction network and in reconstruction network respectively.
Iv Experiments and Analysis
We conduct comparison studies on widely used datasets, Set5 , Set14 , BSD100  and Urban100 , which contain 5, 14, 100 and 100 images respectively. We use a training set of 291 images for benchmark with other methods, which consists of 91 images from Yang et al.  and 200 images from the training set of BSD300 dataset . For results in the section of model analysis, 91 images  are used to train network fast. In addition, data augmentation is used, which includes flipping, downscaling with the scales of and , and rotating by , and .
We use the peak signal-to-noise ratio (PSNR), the structural similarity (SSIM)  index and information fidelity criterion (IFC)  as metrics for evaluation. Since the reconstruction is performed on the Y-channel in YCbCr color space, all the criteria are calculated on the Y-channel of images after pixels near image boundary are removed.
Iv-B Implementation Details
Given the HR image, the input LR image is generated by first downsampling and then upsampling with bicubic interpolation to a certain scale and has the same size as the HR image. By following , we only train a single model for all different scales, including , and . The LR and HR pairs of all different scales are combined into one training dataset.
We split training images into
sub-images with no overlap and set the mini-batch size to 32 for stochastic gradient descent (SGD) solver. And, we set momentum parameter toand the weight decay to . The initial learning rate is initialized to
and then decreased by a factor of 10 for every 10 epochs. To suppress the vanishing and the exploding gradients, we clip individual gradients to a certain range, where is set to . We find that our convergent procedure is fast and our model gets convergence after about 50 epochs.
In our CMSC network, each convolutional layer has 64 filters and is followed by BN and LeakyReLU with a negative slope of . The kernel size of the convolutional layers is set to except the convolutional layers in MSC modules, of which kernel size is determined according to the experimental analysis in Section IV C. For weight initialization in all convolutional layers, we applied the same way as in He et al. 
. And, we apply PyTorch on a NVIDIA Titan X Pascal GPU (12G Memory) for model training and testing.
Study on the Effect of Cascaded Structure.
Average PSNRs for a Scale Factor of Are Reported.
Fontbold Indicates the Best Performance.
Iv-C Model Analysis
In this section, the designs and the contributions of different components of our model are analyzed via the experiments, including MSC module, cascaded structure, residual-features learning, cascaded-supervision and different reconstruction layers utilization. For all experiments in this section, we use the models trained on 91 images  for faster training.
Iv-C1 Multi-scale cross module
To design multi-scale cross (MSC) module for fusing multiple levels information as well as to demonstrate the superiority of designed MSC module, we build four modules for comparison. Two merge-and-run blocks (MR) with different filter sizes are shown in Fig.4(a): (I) MR_L3R3, where four convolutional layers in two assembled residual branches have 64 filters of the size , (II) MR_L5R5, in which each convolutional layer has 64 filters with the size of . In addition, Fig.4(a) depicts two MSC modules which are utilized to fuse information from different receptive fields: (III) MSC_L3R5, in which one residual branch contains two stacked convolutional layers with 64 filters of the size , and another residual branch uses the same number of filters but with the size of for the convolutional layers, (IV) MSC_L3R7, which has the similar structure with the MSC_L3R5, but stacks two convolutional layers with the filter kernel size of in one residual branch (orange rectangles in Fig.4(a)). We stack these different modules to construct the corresponding plain networks with global residual learning (GRL) (Fig.4(b)) for comparing SR performances. As illustrated in Fig.4(b), the built basic network is composed of a convolutional layer, four stacked modules and another convolutional layer for reconstructing the residual image. By applying four modules (Fig.4(a)) into the basic structure in Fig.4(b) respectively, we obtain four networks which are named according to their containing modules (i.e., MR_L3R3, MR_L5R5, MSC_L3R5 and MSC_L3R7). In addition, we compare these four networks with VDSR  (illustrated in Fig.4(c)) which has 20 stacked convolutional layers. We apply the trained models of these networks on the Set5 dataset and then illustrate the PSNRs of these structures for SR with a scale factor of in Fig.5. One can see that, by introducing interactions between branches, MR_L3R3 network with fewer parameters and fewer layers achieves higher PSNR than VDSR, which manifests the effectiveness of the merge-and-run mapping for SR. On the other hand, our proposed MSC_L3R5 performs better than MR_L5R5 but contains fewer parameters, which suggests that the gains of our MSC module over MR block (MR_L3R3 and MR_L5R5) are not only from the larger receptive field but also from multi-scale complementary information fusing and richer representation. Among these models, MSC_L3R7 achieves the best performance. Considering both the performance and the number of parameters, we adopt the MSC_L3R5 as our MSC module for the next experiments.
Iv-C2 Cascaded structure
To validate the effectiveness of the cascaded structure, we use fifteen MSC modules to build a plain structure and a cascaded structure for comparing, which are shown in Fig.6. For the plain structure, denoted as PMSC, we stack fifteen MSC modules to reconstruct the HR features and also apply the residual-features learning (RFL) as well as the global residual learning (GRL) in this structure. For cascaded structure, we utilize five MSC modules in each subnetwork and adopt three subnetworks for three stages of cascade. For fair comparison, cascaded-supervision and predictions ensemble are excluded from the cascaded structure, which is denoted as CMSC_NS. Both the plain structure and the cascaded structure use one convolutional layer for the features extraction layer and for the reconstruction layer respectively. TABLE I presents the SR results on four benchmark datasets, including Set5, Set14, BSD100 and Urban100. From the results, it is seen that the cascaded structure (CMSC_NS) achieves moderate performance improvements over the plain structure. Therefore, the image SR can benefit from the cascaded structure.
Study on the Effects of Residual-Features Learning, Cascaded-Supervision and Different Reconstruction Layers Utilization. Average PSNRs/SSIMs on the Set5 Dataset Are Reported. Fontbold Indicates the Best Performance.
Quantitative Evaluations of State-of-the-art SR Methods.
The Average PSNRs/SSIMs/IFCs for Scale Factors of , and Are Reported.
Fontbold Indicates the Best Performance and Underline Indicates the Second-best Performance.
Iv-C3 Residual-features learning, cascaded-supervision and different reconstruction layers utilization
To study contributions of residual-features learning (RFL), cascaded-supervision and different reconstruction layers utilization for boosting SR performance, we rebuild three models for comparison besides our final model (CMSC), which are termed as CMSC_NRS, CMSC_NS and CMSC_SR respectively. For the CMSC_NRS, the identity branch between the beginning and the end of each cascaded subnetwork (blue curves in Fig.2(b)) is removed from the CMSC, and the multiple supervisions are also excluded. Thus, by directly feeding the output from the last cascaded stage into the reconstruction network, the final prediction is obtained. Based on the CMSC_NRS, we recover the RFL in each subnetwork to obtain the CMSC_NS in which the cascaded-supervision is still not applied. The difference between the CMSC_SR and the CMSC is that all subnetworks in CMSC_SR share the same reconstruction layer to obtain intermediate predictions while the subnetworks in CMSC own their respective reconstruction layers. The four networks have the same number of cascaded subnetworks and the same number of MSC modules in each subnetwork. TABLE II shows the SR performances of four models in terms of PSNR and SSIM on the Set5 dataset for three scale factors of , and . We can see that both residual-features learning and cascaded-supervision make contributions to improving the SR performance. Further, the CMSC achieves better performance than CMSC_SR with very few parameters increase, which manifests that applying different reconstruction layers for different cascaded stages can further boost SR performance.
Iv-C4 The number of stages () and the number of modules ()
The capacity of the CMSC is mainly determined by two parameters: the number of subnetworks () for cascaded and the number of MSC modules () in each subnetwork. In this subsection, we test the effects of the two parameters on image SR. Firstly, we fix the parameter of to and change the parameter of from to . Fig.7(a) shows the curve of the PSNR performance versus the parameter on the dataset of BSD100 for a scale factor of , and the corresponding average execution time in seconds is marked on the side of the curve. We can see that the performance is improved gradually with the increase in the number of stages but at the expense of increased computational cost.
Next, the parameter is fix to and the parameter in each stage is increased from to . The curve of the PSNR performance versus the parameter is illustrated in Fig.7(b). When the more MSC modules are contained in subnetwork, the network gets deeper. Therefore, the curve in Fig.7(b) illustrates that the deeper network still achieves the better performance but with more execution time. To strike a balance between performance and speed, we choose and for our CMSC model, the depth of which is 35 according to Eq. (9).
Iv-D Comparisons With the State-of-the-arts
To illustrate the effectiveness of the proposed CMSC model, several state-of-the-art single image SR methods, including A+ , SelfExSR , SRCNN , FSRCNN , VDSR , DRCN , LapSRN , DRRN  and MemNet , are compared in terms of quantitative evaluation, visual quality and execution time. For comparison, we also construct the CMSC_SR model which has the same parameters of () and () as the CMSC model but enables the reconstruction network sharing among all stages of subnetworks, similar to DRCN  and MemNet . All methods are only applied to the luminance channel of an image while bicubic interpolation is utilized to the color components.
The quantitative evaluations on the four benchmark datasets for three scale factors (, , ) are summarized in TABLE III. Since the trained model for a scale factor of is not provided by LapSRN , we generate the corresponding results via downscaling its upscaling results as the way in . While the proposed CMSC_SR achieves comparable results to state-of-the-art approaches, our final model CMSC significantly outperforms all exiting methods on all datasets for all upscaling factors, in terms of PSNR, SSIM and IFC. Compared to MemNet  which obtains the highest performances among the prior methods, our proposed CMSC achieves the improvements of 0.12dB, 0.11dB and 0.12dB respectively for three upscaling factors (, , ) on the average PSNRs of four datasets. Especially, on the very challenging dataset Urban100, the proposed CMSC outperforms the state-of-the-art method (MemNet ) by the PSNR gains of 0.16dB, 0.13dB and 0.14 dB on scale factors of , and respectively. In addition, objective image quality assessment values in terms of SSIM and IFC scores further validate the superiority of the proposed method.
The visual comparisons of different methods are shown in Fig.8, Fig.9, Fig.10 and Fig.11. Our proposed CMSC accurately and clearly reconstructs the texture pattern, the grid pattern and the lines. It is observed that the severe distortions and the artifacts are contained in the results generated by the prior methods, such as the marked zebra stripes and texture regions of the wing in Fig.8 and Fig.9. In contrast, our method avoids the distortions and suppresses the artifacts via the cascaded features reconstruction, the residual-features learning and the multi-scale information fusion. In addition, in Fig.10 and Fig.11, only our method is able to reconstruct finer edges and clearer grids while other methods generate very blurry results.
We also adopt the public source codes of state-of-the-art methods to measure the execution time. Since the testing codes of SRCNN  and FSRCNN  are implemented on the CPU, we then rebuild both models as well as the VDSR  model in PyTorch with the same network parameters for evaluating the runtime on GPU. Fig.1 shows the PSNR performance versus execution time in the testing phase on the Set5 dataset for a scale factor of . We can see that our proposed CMSC outperforms all mentioned methods with relatively less execution time. Our source code will be released to the public later.
In this paper, we propose a deep cascaded multi-scale cross network (CMSC) for modeling the super-resolution reconstruction process, where a sequence of subnetworks is cascaded to gradually refine high resolution features with cascaded-supervision in a coarse-to-fine manner. In each cascaded subnetwork, multiple multi-scale cross (MSC) modules are stacked not only to fuse complementary information under different receptive fields but also to improve information flow across the layers. Besides, to make full use of relative information between high-resolution and low-resolution features, residual-features learning is introduced to the cascaded subnetworks for further boosting reconstruction performance. Comprehensive evaluations on benchmark datasets demonstrate that our CMSC network outperforms state-of-the-art super-resolution methods in terms of quantitative and qualitative evaluations with relatively less execution time.
Since the subnetworks in CMSC at all stages have the same structure and the same aim, it is possible for our model to share the network parameters across the cascaded stages. In future work, we will explore a suitable strategy to share the parameters across as well as within the cascaded stages, and thus to control the number of model parameters without a decrease in performance. On the other hand, we will extend our CMSC model to other image restoration and heterogeneous image transformation fields.
-  W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, “Learning low-level vision,” Int. J. Comput. Vis., vol. 40, no. 1, pp. 25–47, Oct. 2000.
-  G. Polatkan, M. Zhou, L. Carin, and D. Blei, “A Bayesian non-parametric approach to image super-resolution,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 2, pp. 346–358, Feb. 2015.
H. Chang, D.-Y. Yeung, and Y. Xiong, “Super-resolution through neighbor
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun./Jul. 2004, pp. 275–282.
-  X. Gao, K. Zhang, D. Tao, and X. Li, “Image super-resolution with sparse neighbor embedding,” IEEE Trans. Image Process., vol. 21, no. 7, pp. 3194–3205, Jul. 2012.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE Trans. Image Process., vol. 19, no. 11, pp. 2861–2873, Nov. 2010.
-  L. He, H. Qi, and R. Zaretzki, “Beta process joint dictionary learning for coupled feature spaces with application to single image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2013, pp. 345–352.
-  R. Timofte, V. D. Smet, and L. V. Gool, “A+: adjusted anchored neighborhood regression for fast super-resolution,” in Proc. 12th Asian Conf. Comput. Vis. (ACCV), Nov. 2014, pp. 111–126.
-  Y. Hu, N. Wang, D. Tao, X. Gao, and X. Li, “SERF: a simple, effective, robust, and fast image super-resolver from cascaded linear regression,” IEEE Trans. Image Process., vol. 25, no. 9, pp. 4091–4102, Sep. 2016.
-  H. Wang, X. Gao, K. Zhang, and J. Li, “Single image super-resolution using active-sampling Gaussian process regression,” IEEE Trans. Image Process., vol. 25, no. 2, pp. 935–948, Feb. 2016.
-  S. Schulter, C. Leistner, and H. Bischof, “Fast and accurate image upscaling with super-resolution forests,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3791–3799.
-  C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 2, pp. 295–307, Feb. 2016.
-  J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1646–1654.
-  J. Kim, and J. K. Lee, and K. M. Lee, “Deeply-recursive convolutional network for image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1637–1645.
-  Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 3147–3155.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2016, pp. 630–645.
-  C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 105–114.
-  B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017, pp. 136–144.
-  R. Timofte, E. Agustsson, L. V. Gool, M.-H. Yang, L. Zhang, B. Lim, and et al., “NTIRE 2017 challenge on single image super-resolution: methods and results,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017, pp. 1110–1121.
-  X.-J. Mao, C. Shen, and Y.-B. Yang, “Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2016, pp. 2802–2810.
-  G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708.
-  T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using dense skip connections,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 4799–4807.
-  Y. Tai, J. Yang, X. Liu, and C. Xu, “MemNet: a persistent memory network for image restoration,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 4549–4557.
-  Y. Wang, L. Wang, H. Wang, and P. Li, “End-to-end image super-resolution via deep and shallow convolutional networks,” arXiv: 1607.07680, Jul. 2016.
-  Z. Tang, L. Luo, H. Peng, and S. Li, “A joint residual network with paired ReLUs activation for image super-resolution,” Neurocomputing, vol. 273, pp. 37–46, Jan. 2018.
-  J. Yamanaka, S. Kuwashima, and T. Kurita, “Fast and accurate image super resolution by deep CNN with skip connection and network in network,” in Int. Conf. Neural Inf. Process. (ICONIP), Nov. 2017, pp. 217–225.
-  H. Ren, M. El-Khamy, and J. Lee, “Image super resolution based on fusing multiple convolution neural networks,” in IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017, pp. 54–61.
-  Z. Wang, D. Liu, J. Yang, W. Han, and T. S. Huang, “Deep networks for image super-resolution with sparse prior,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 370–378.
-  Z. Cui, H. Chang, S. Shan, B. Zhong, and X. Chen, “Deep network cascade for image super-resolution,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2014, pp. 49–64.
-  W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep Laplacian pyramid networks for fast and accurate super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 624–632.
-  C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Oct. 2016, pp. 391–407.
-  S. Zagoruyko and N. Komodakis, “DiracNets: training very deep neural networks without skip-connections,” arXiv: 1706.00388, Jun. 2017.
-  L. Zhao, J. Wang, X. Li, Z. Tu, and W. Zen, “Deep convolutional neural networks with merge-and-run mappings,” arXiv: 1611.07718, Jul. 2017.
-  R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2015, pp. 2377–2385.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun./Jul. 2016, pp. 770–778.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.
-  C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnet and the impact of residual connections on learning,” arXiv: 1602.07261, Aug. 2016.
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” inIEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun./Jul. 2016, pp. 2818–2826.
-  R. Timofte, R. Rothe, and L. V. Gool, “Seven ways to improve example-based single image super resolution,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun./Jul. 2016, pp. 1865–1873.
-  S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” in Int. Conf. Mach. Learn. (ICML), Jul. 2015, pp. 448–456.
-  A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Int. Conf. Mach. Learn. (ICML), Jun. 2013.
-  M. Bevilacqua, A. Roumy, C. Guillemot, and M.-L. A. Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” in Proc. 23rd British Mach. Vis. Conf. (BMVC), Sep. 2012, pp. 135.1–135.10.
-  R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in Proc. 7th Int. Conf. Curves Surfaces, Jun. 2010, pp. 711–730.
-  P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 5, pp. 898–916, May. 2011.
-  J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 5197–5206.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
-  H. R. Sheikh, A. C. Bovik, and G. de Veciana, “An information fidelity criterion for image quality assessment using natural scene statistics,” IEEE Trans. Image Process., vol. 14, no. 12, pp. 2117–2128, Dec. 2005.
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: surpassing human-level performance on imagenet classification,” inin Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1026–1034.