Single image super-resolution (SISR) is an important low-level computer vision task which aims at recovering a high-resolution (HR) image from a low-resolution (LR) image. It is a seriously ill-posed problem since an LR image can be mapped to an infinite number of HR images. Recently, deep convolutional neural network (CNN) has greatly facilitated improvements in this field. Donget al.  firstly proposed a three-layer CNN to establish a mapping between LR and HR. Kim et al. proposed the well-known VDSR 
, which introduced residual learning and adaptive gradient clipping to alleviate the difficulty of training deep network. In DRCN
, the recursive network was used to reduce the model parameters and a multi-supervised strategy was adopted to fuse intermediate results. Benefiting from skip-connection can alleviate the vanishing-gradient problem, , Lim et al.  built a very deep network MDSR (more than 160 layers) with residual blocks.
Researchers usually deepen and widen the network to achieve better performance. However, even constructed with small convolution kernels, such as , the network will take up large memories. In order to lighten the deep network, some strategies have been adopted. DRRN  employed parameter sharing strategy to reduce parameters, but it still requires large computation objectively. CARN-M  adopt group convolution to attack a trade-off between computation and performance of the model. Unfortunately, applying group convolution directly to SISR will obviously impair performance. To address these problems, we propose a lightweight network LFFN to compute the HR image from the original LR image. In LFFN, we introduce a new organization of the inception-residual block , named spindle block, which contains a dimension extension unit, a feature exploration unit and a feature refinement unit. The dimension extension unit can learn the feature maps suitable for the next unit, and the architecture can be mitigated by fewer filters in backbone. Inspired by ResNeXt  and Xception , we introduce a feature exploration unit to explore the linear and nonlinear as well as multi-scale information for 4 different channel groups. This unit can improve the representational power of the model and can further alleviate the architecture due to fewer filters in each group. We also consider using feature maps of different receptive fields to enhance the performance. Taking computation into account and motivated by feature recalibration demonstrated in SENets , we develop a softmax feature fusion module (SFFM) to aggregate the features of different levels in a self-adaptive channel-wise convex weighted way rather than the multi-supervised method used in DRCN and MemNet . The parameters of SFFM are not large, since there is only one dense layer applied to each global feature of different levels. And SFFM can learn how to combine the features that are most conducive to reconstruction.
Ii Proposed Method
Ii-a Network structure
As shown in Fig. 2, the overall architecture consists of spindle blocks, a softmax feature fusion module (SFFM) and an up-sampling module. We denote and as the input and output of LFFN, respectively. First of all, we use a single convolutional layer of 48 filters to extract the feature maps from the original LR image:
where represents convolution operation and serves as the input of next part. The next part is stacked local feature fusion modules. Inspired by MemNet  and SRDenseNet , we concatenate feature maps from stacked spindle blocks to further make full use of local features. We also introduce residual learning for each module to make deep network training easier. This procedure can be expressed as
where denotes the -th module function and is the function of the convolution in -th module. and indicate the output of the -th module and -th spindle block respectively. More details about spindle block will be given in next section. After extracting complicated features progressively with modules, we further conduct softmax feature fusion (SFFM).
where is the output feature maps of SFFM, denotes a composite function. Finally, like and  , we utilize ESPCN  followed by a convolution layer to upscale the refined feature maps and get the output of LFFN. It is worth mentioning that we replace the convolution with convolution in upscale module and the last layer to further reduce parameters.
where and denote the convolution and is the function of upscale module.
Ii-B Spindle Block
To reap the benefits of inception residual block  and group convolution, we propose a well-designed residual block, named spindle block. The overall block can be formulated as:
where , and represent compound function of three basic units respectively. And more details about them explicated as follows.
Ii-B1 dimension extension unit
The number of filters is a critical factor to improve the efficiency of deep networks, which is fixed to 64 in many deep methods for SISR currently. We can lighten the architecture by decreasing the filters, but the performance fluctuates accordingly. Using “bottleneck layer”  ( convolution) to compress dimensions resemble pooling operation in channel dimension. We believe that reducing feature channels before non-linear layer can lead to information loss. Here, we expand the dimensions from 48 to 64 before non-linear mapping to maintain performance with fewer parameters.
Ii-B2 feature exploration unit
As shown in Fig.3(b), we first slice the feature maps into four different 16-dimensional groups. Then, we explore nonlinear information in three groups and linear information in the other group. Specifically, we adopt a sequence of
convolutional layers followed by parametric rectified linear units (PReLUs) to make full use of the image multi-scale information. Different from, , we assemble linear and nonlinear information to boost representational power of basic blocks and directly dispose the expanded feature maps instead of reducing dimension by additional convolutions.
Ii-B3 feature refinement unit
Then the concatenate feature maps are sent to a convolutional layer which acts as refining features, compressing dimensions and overcoming the impact of the slice operation on weakening the information flow.
Basically, as shown in Fig.3, our spindle block can take advantage of linear and nonlinear and multi-scale information with fewer parameters than baseline residual block. In particular, when we use the configuration expressed in Fig.3, a spindle block has of parameters of a residual block. This ratio can be further decreased to by replacing convolution in spindle block with depthwise convolution. More analysis will be described in experiment.
Ii-C Softmax Feature Fusion Module
Information in different levels of feature maps can complement each other for reconstruction. In order to gain more abundant and efficient information, we focus on hierarchical features and achieve a fusion mechanism. As shown in Fig.4, we take all intermediate feature maps as input and generate a fusion representation . And , , , where denotes the -th channel of the -th feature maps , and is the total number of channels. Inspired by squeeze operation in, we apply global average pooling to each channel to obtain the global channel feature . Then, we follow it with a dense layer to fully exploit inter-channel correlation, as formulated below:
where represent the weight set of -th dense layer and , . We utilize concatenation and slice operation and softmax function to produce the weight of the corresponding channel of different features. This process can be expressed as:
where , and , . The final output of SFFM is obtained as the following formula:
where and denotes the -th channel of the -th rescaled feature maps
. SFFM aims to incorporate hierarchical features with as few parameters as possible and each weight vectorin SFFM depends on global features of all intermediate feature maps, which is different from channel attention in SENets .
Iii-a Implementation details
At first, we pre-train our model on 91 images from Yang et al.  and 200 images from the Berkeley Segmentation Dataset . To further improve the performance, we use a newly-proposed high-quality image dataset DIV2K 
which consists of 800 images to fine-tune our pre-trained model. Data augmentation (rotation and flip) is also performed on the 291-image dataset and DIV2K dataset. To produce LR images, we downscale the HR images on particular scaling factors with bicubic interpolation. The proposed method is compared on four widely used benchmark datasets: Set5, Manga109 , BSD100 , Urban100 . For fair comparison, we evaluate the model with PSNR and SSIM on Y channel (i.e., luminance) of transformed YCbCr space.
In our final architecture LFFN, 15 spindle modules, each contains 4 spindle blocks, are constructed (i.e., B4M15). We initialize all convolutional filters using the method of He et al. 
. We use the L1 loss as our loss function instead of the L2. For optimization, we use the ADAM optimizer by setting , , and . We use 16 RGB input patches of size from the LR images for training, and the initial learning rate is set to
and then decreased to half every 20 epochs. In order to accelerate the convergence, we adopt the adjustable gradient clipping
which has been well implemented in tensorflow. Both training stages are configured the same as demonstrated above except that the initial learning rate is set toduring the fine-tuning. Training a LFFN roughly takes four days with a GTX 1080Ti GPU on the 2 model.
Iii-B Model Analysis
Table I shows the effects of spindle block and softmax feature fusion module (SFFM). LFFN-NF is LFFN without softmax feature fusion module (SFFM) and we replace spindle blocks with residual blocks (Fig.3(a)) in LFFN-NS. The three networks have the same number of basic blocks (B4M15). Compared with LFFN, the performance of LFFN-NS degraded and the parameters increased by three times, indicating that the proposed spindle block is more effective than residual block. LFFN is obviously superior to LFFN-NF, and the parameters are not increased much, revealing that SFFM is valid for incorporating hierarchical information. Beyond that, as shown in Fig.5, the different channel information of the feature maps used for reconstruction come from all levels. And high-level features play a major role in some channels, while low-level features dominate in other channels, which indicates that aggregating hierarchical features is important for SISR and SFFM can implement it well.
Iii-C Comparisons With State-of-the-Art Methods
We compare LFFN (B4M15) and LFFN-S (B4M4 + depthwise convolution) with state-of-the-art methods. We also compare parameters and computation (Mult-Adds) of each method. And Mult-Adds is calculated by assuming that the spatial resolution of HR image is . As shown in Table II, our LFFN performs favorably against state-of-the-art methods on all datasets. LFFN exceeds Memnet by a margin of 0.41 PSNR while being 30.32 times less compute than Memnet for upscaling on Set5. Our smallest network LFFN-S has Mult-Adds about of MemNet, of DRRN and of DWSR on enlargement, respectively, but still achieves comparable performance. Fig.1 shows the execution time of different methods. We use the original codes of state-of-the-art methods to evaluate the runtime on the same machine with 2.1 GHz Intel Xeon CPU and GTX 1080 Ti GPU (12G Memory). LFFN faster, lighter and more accurate than the latest lightweight network CARN . LFFN-S is about 400 times faster and 3.7 times smaller than MemNet
We also provide qualitative comparison in Fig.6. Our smallest network LFFN-S can produce almost the same result as other state-of-the-art methods (e.g., MemNet). Besides, LFFN recovers clearer, more accurate contours and less artifacts than other methods.
In this paper,we propose a novel lightweight feature fusion network (LFFN) for single image super-resolution. In order to build a more effective and accurate architecture, we pay more attention to full usage of the feature map information. Whether softmax feature fusion module (SFFM) or the proposed spindle block which serves as the basic building unit can significantly improve the representational capacity of a network with fewer parameters. Experiments well demonstrate the effectiveness of our method.
This work was partly supported by the Natural Science Foundation of China (No.61471216 and No.61771276), and the Special Foundation for the Development of Strategic Emerging Industries of Shenzhen (No.JCYJ20170817161845824 and No.JCYJ20170307153940960)
-  C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in European conference on computer vision. Springer, 2014, pp. 184–199.
J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very
deep convolutional networks,” in
IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1646–1654.
-  J. Kim, J. Kwon Lee, and K. Mu Lee, “Deeply-recursive convolutional network for image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1637–1645.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  J. Chu, J. Zhang, W. Lu, and X. Huang, “A novel multiconnected convolutional network for super-resolution,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 946–950, 2018.
-  B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in The IEEE conference on computer vision and pattern recognition (CVPR) workshops, vol. 1, no. 2, 2017, p. 4.
-  Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, no. 2, 2017, p. 5.
-  N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and, lightweight super-resolution with cascading residual network,” arXiv preprint arXiv:1803.08664, 2018.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning.” inAAAI, vol. 4, 2017, p. 12.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017, pp. 5987–5995.
F. Chollet, “Xception: Deep learning with depthwise separable convolutions,”arXiv preprint, pp. 1610–02 357, 2017.
-  J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” arXiv preprint arXiv:1709.01507, vol. 7, 2017.
-  Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memory network for image restoration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4539–4547.
-  T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using dense skip connections,” in Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2017, pp. 4809–4817.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network.”
-  W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1874–1883.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE transactions on image processing, vol. 19, no. 11, pp. 2861–2873, 2010.
-  P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 5, pp. 898–916, 2011.
-  E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, vol. 3, 2017, p. 2.
-  M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” 2012.
-  Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, and K. Aizawa, “Sketch-based manga retrieval using manga109 dataset,” Multimedia Tools and Applications, vol. 76, no. 20, pp. 21 811–21 838, 2017.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, vol. 2. IEEE, 2001, pp. 416–423.
-  J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5197–5206.
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  T. Guo, H. S. Mousavi, T. H. Vu, and V. Monga, “Deep wavelet prediction for image super-resolution,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017.