Single Image Super-Resolution (SISR) aims to reconstruct the visually High-Resolution (HR) images from the Low Resolution (LR) ones, which has various applications such as satellite imaging , medical imaging [17, 20, 19] and small object detection [14, 18]
. However, given a specific LR image, the mapping to an HR one could have many solutions, making this task ill-posed. Benefiting from the powerful feature representation and end-to-end training, Convolutial Neural Networks (CNNs) have demonstrated significant achievements in various computer vision tasks, greatly promoted the development of SISR. In the work, Dong et al. firstly proposed SRCNN with three-layer to map a LR image to a Super-Resolution (SR) one. Later, networks are designed with deeper and complicated structure to further improve the performance. Deepening the networks has been considered useful in SISR methods, especially when He et al.  proposed ResNet with residual learning and Huang et al.  raised DenseNet based on dense connections. Later, Lim et al.  designed a very deep network termed as EDSR by stacking residual blocks for super resolution. Furthermore, Zhang et al. combined both residual learning and dense connections to sufficiently utilize the hierarchical featWEures from different convolutional layers to further enhance the SR performance. The excellent performance has verified the importance of the depth representation for SISR. However, we argue that simply deepening the network is not the desired way for SISR as the relevance of features has not been thoroughly explored.
To address the issues mentioned above, several CNN-based methods have been exploited, focusing on the attention of particular features for SISR. For example, Liu et al. made use of the non-local attention block proposed in  for image restoration. In , Li et al. utilized spatial attention module and DenseNet to reconstruct realistic HR images. Different from the methods as mentioned above that only exploit correlations in spatial space, other methods attempted to explore the channel correlation of features. In , Zhang et al. utilized the channel attention block (SE)  to improve the performance of SR. Later, methods like [8, 13, 24, 21] made full utilization of both spatial attention and channel attention to improve the SR performance.
Inspired by the above methods, we propose a novel Residual Neuron Attention Network (RNAN) for better representation and learning of features, as well as exploiting long-range global contextual information to enhance SISR. On the one hand, we propose the RNA blocks for explicitly modeling the interdependencies between the neurons of features, which is able to selectively re-weight the key neurons to learn more characteristic features. On the other hand, a global context block is embedded into GCRG to further model the correlations of global contextual information. The experimental results have shown that our method can effectively improve the quantitative results and visual quality compared with state-of-the-art methods.
Our contributions are summarized as follows:
•We elaborate the cascaded Global Context-enhanced Residual Groups (GCRGs) to construct a novel Residual Neuron Attention Networks (RNAN) for Single Image Super-Resolution (SISR).
•We propose a Residual Neuron Attention (RNA) to concentrate more on neuron-wise relationships, as well as employing a lightweight Global Context (GC) block at the end of each GCRG, to incorporate global contextual information.
•Extensive experiments on several benchmark datasets demonstrate that our RNAN achieves superior results with fewer parameters.
2 Proposed method
2.1 Network architecture
As shown in Figure 1, our RNAN can be divided into four parts, i.e., shallow feature extractor, Global Context-enhanced Residual Groups (GCRGs), up-sampling module, and reconstruction layer. Given and as the input and output of RNAN, respectively. Following the work [11, 30, 22], we apply only one convolutional layer to extract the shallow features from the LR input
where represents the convolutional operation to extract features from the shallow layers, is the input of GCRGs. Suppose we have G GCRGs, the output of the g-th GCRG can be expressed as
where denotes the representation of g-th GCRG. The GCRG is used to enhance the sensitivity of feature maps, as well as capturing global contextual information. Then we extract features from each GCRG block, and conduct uniform-spaced features fusion. To stabilize the training, we introduce a global residual learning as
where is the output features of GCRGs, represents feature fusion which concatenates the outputs of uniformly-spaced GCRGs with an interval N (e.g., N equals 2), and denotes the convolutional layers, including a convolutional (conv) layer for feature dimension reduction and a conv layer for further feature fusion. After that, the up-sampling module upsamples the residual learned feature maps , followed by reconstruction layer
where and denote the reconstruction layer and upsampling module, respectively. is the representation of the proposed RNAN. Inspired by the work , we use sub-pixel convolutional layer as our up-sampling module. The reconstruction layer employs three convolutional kernels to generate the 3-channel super-resolved RGB image. It is worth noting that using residual learning and concatenation in global architecture and every GCRG can bypass more abundant low-frequency information during training [11, 29].
2.2 Global Context-enhanced Residual Group
We now give more details for the proposed GCRG, which is composed of several (10 in our experiments) Residual Neuron Attention (RNA) blocks and one Global Context (GC) block. In order to further facilitate feature extraction, we uniformly-spaced concatenate the hierarchical features that generated from RNAs, the same with feature fusion of different GCRG blocks. Therefore, the final representation of the g-th GCRG can be defined as
where and are the output and input of the g-th GCRG, respectively. denotes feature concatenation, and denotes convolutions with the kernel size as and , respectively. M denotes the interval that we concatenate the features of RNA blocks.
2.2.1 Residual Neuron Attention (RNA) block
Inspired by the Residual Blocks (RB) in [25, 4, 23] and the Neuron Attention (NA) in , we integrate NA into RB and propose Residual Neuron Attention (RNA) block, as shown in Figure 2. Taking the input and output features of the b-th RNA in g-th GCRG as , and , respectively, the process of RNA can be formulated as
where and denote NA module and RB, respectively.
Previous CNN-based methods utilize convolutional filters to incorporate channel-wise and spatial-wise information within local receptive field to generate the final convolutional feature. However, the contextual information outside the local receptive field in the last convolutional layer can not be used. To this end, we exploit the independencies of neurons modeled by Neuron Attention (NA) mechanism to recalibrate neuron-wise responses adaptively and dynamically. NA consists of two main operations, Depthwise Convolution (DC) and Pointwise Convolution (PC). DC aims to make use of spatial information in each individual channel, which keeps the number of filters the same with channels of input features. To overcome the drawback of DC that can not fully utilize the information of different maps in the same spatial location, we adopt the PC, using convolution kernel with the number of filters the same with the depth of input features. Similar with the attention mechanism in 
, we employ a sigmoid activation function after the PC. The operations of NA can be expressed as
where and denote the weight of the DC and the PC, respectively. and
represent the sigmoid and ReLU activation function, respectively.X is the input features, and Y is the corresponding output. With the NA module, the residual component in RNA can be adaptively recalibrated.
2.2.2 Global Context (GC) module.
The Global Context (GC) block  is placed at the end of each GCRG to learn global contextual information. GC mainly consists of context modeling and feature transform, as illustrated in the bottom panel of Figure 2. In this way, GC can benefit model learning by both the simplified non-local block and the Squeeze-Excitation (SE) block . The former can effectively model long-range dependencies throughout the full image with smaller computation cost compared with original non-local block . Meanwhile, the latter can fully capture channel-wise dependencies.
We denote as the fused feature maps of multiple RNA blocks; as the output of GCRG, where is the number of positions in the feature map (e.g., in an image). The detailed architecture of the GC block is illustrated in the bottom panel of Figure 2. GC block can be formulated as
where denotes convolution operation, denotes the features bottleneck transform, and denotes the global context modeling. and stand for ReLU and LayerNorm, respectively. We set the bottle ratio r as 16 in our experiments.
|Set5 (PSNR/SSIM)||Set14 (PSNR/SSIM)||BSD100 (PSNR/SSIM)||Urban100 (PSNR/SSIM)|
|RNAN+||38.31 /||34.80 /||32.66/||34.10/||30.69/||28.92/||33.42/||29.33/||27.79/||33.28/||29.08/||26.90/|
Following [7, 11], we use 800 images from DIV2K datasets as the training set. The LR images were obtained by bicubic downsampling of HR images using MATLAB. For testing, we use four standard benchmark datasets:Set5, Set14, B100, and Urban100.
During training, we randomly cropped patches from LR images and corresponding HR images. Besides, we augment the training images by randomly rotating 90, 180, 270 and horizontally flipping. In every training mini-batch, 16 cropped and colorful LR patches with size of are provided as inputs.We train our model with Adam optimizer with , , and
to calculate the loss between input and output. The initial learning rate is assigned by 0.0001, and decreases to half every 200 epochs. Moreover, we set the numbers of RNAB as 20 and GCRG as 10. Similar to, self-ensemble, that averages the outputs of augmented inputs of one image when testing, was introduced to maximize the potential performance of our model.
3.2 Comparison with state-of-the-art methods
We compare our RNAN with several state-of-the-art SR methods: SRCNN , FSRCNN , VDSR , LapSRN , EDSR , NLRN , RDN . The performance of different models are executed with quantitative and qualitative comparisons.
For fair comparison, we follow a common setting [11, 6, 12], evaluating our model using the luminance channel (Y) of the transformed YCbCr space for quantitative measurement. Table 1 shows the quantitative results of PSNR and SSIM values of the compared SR methods for , , and super resolution, respectively. Referring to Table 1, RNAN which adopts self-ensemble strategy, achieves better performance on all benchmark datasets regarding various scaling factors, compared with other methods. Without self-ensemble, RNAN and RDN achieve vary similar results and still outperform other methods, however RNAN has less parameters than that of RDN (about , see Table 2). Besides, we observe that the gap between RNAN and EDSR decreases as the upsampling factor increases (e.g., : 0.13dB, : 0.08dB, : 0.04dB in Set14), but the slightly better performance of RNAN on scale brings about significantly visual advance (see Figure 3). It is worth to note that the parameters of RNAN are about of EDSR. Table 1 and Table 2 show that our proposed models increase the performance with better trade-off between parameters and performance.
In Figure 3, we visually illustrate the qualitative comparisons on scale on images from Set14 and Urban100. It is clear to see that RNAN recovers more details than the compared SR methods. For the image ’ppt3’ from Set14 dataset, RNAN can generate more clearly distinguishable words than other methods. Referring to the image ’img074’ from Urban100 dataset, the compared methods cannot reconstruct the realistic and clear structure of the building. On the contrary, RNAN reconstructs the image that is more faithful to the ground truth with sharper edges and more high-frequency details. Such obvious comparisons demonstrate that networks with NA and GC can extract more sophisticated features from the LR image.
In this paper, we propose a Residual Neuron Attention Networks (RNAN) for high-realistic image super resolution. Specifically, we propose the Global Context-enhanced Residual Groups (GCRGs), each composed of multiple Residual Neuron Attention (RNA) blocks and one Global Context (GC) block, to recalibrade neuron-wise feature responses adaptively and capture global contextual information. Extensive experiments on several benchmark datasets demonstrate that our RNAN can significantly improve the super resolution performance with fewer parameters involved.
-  (2019) GCNet: non-local networks meet squeeze-excitation networks and beyond. arXiv preprint arXiv:1904.11492. Cited by: §2.2.2.
-  (2014) Learning a deep convolutional network for image super-resolution. In ECCV, pp. 184–199. Cited by: §1, §3.2.
-  (2016) Accelerating the super-resolution convolutional neural network. In ECCV, pp. 391–407. Cited by: §3.2.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §2.2.1.
-  (2018) Squeeze-and-excitation networks. In CVPR, pp. 7132–7141. Cited by: §1, §2.2.2.
-  (2017) Densely connected convolutional networks. In CVPR, pp. 4700–4708. Cited by: §1, §3.2.
-  (2016) Accurate image super-resolution using very deep convolutional networks. In CVPR, pp. 1646–1654. Cited by: §3.1, §3.2.
-  (2018) Ram: residual attention module for single image super-resolution. arXiv preprint arXiv:1811.12043. Cited by: §1.
-  (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, pp. 624–632. Cited by: §3.2.
-  (2019) Image super-resolution using attention based densenet with residual deconvolution. arXiv preprint arXiv:1907.05282. Cited by: §1.
-  (2017) Enhanced deep residual networks for single image super-resolution. In CVPRW, pp. 136–144. Cited by: §1, §2.1, §3.1, §3.1, §3.2, §3.2.
-  (2018) Non-local recurrent network for image restoration. In NIPS, pp. 1673–1682. Cited by: §1, §3.2, §3.2.
-  (2019) Hybrid residual attention network for single image super resolution. arXiv preprint arXiv:1907.05514. Cited by: §1.
-  (2019) Better to follow, follow to be better: towards precise supervision of feature super-resolution for small object detection. In CVPR, pp. 9725–9734. Cited by: §1.
-  (2019) NASNet: a neuron attention stage-by-stage net for single image deraining. arXiv preprint arXiv:1912.03151. Cited by: §2.2.1.
-  (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, pp. 1874–1883. Cited by: §2.1.
-  (2013) Cardiac image super-resolution with global correspondence using multi-atlas patchmatch. In MICCAI, pp. 9–16. Cited by: §1.
-  (2020) Face attribute invertion. arXiv preprint arXiv:2001.04665. Cited by: §1.
MR image segmentation and bias field estimation based on coherent local intensity clustering with total variation regularization. Medical & biological engineering & computing 54 (12), pp. 1807–1818. Cited by: §1.
-  (2017) Automatic categorization and scoring of solid, part-solid and non-solid pulmonary nodules in ct images with convolutional neural network. Scientific reports 7 (1), pp. 1–10. Cited by: §1.
-  (2019) Deep transfer across domains for face antispoofing. Journal of Electronic Imaging 28 (4), pp. 043001. Cited by: §1.
-  (2019) Enhance the motion cues for face anti-spoofing using cnn-lstm architecture. arXiv preprint arXiv:1901.05635. Cited by: §2.1.
Joint 3d face reconstruction and dense face alignment from a single image with 2d-assisted self-supervised learning. arXiv preprint arXiv:1903.09359. Cited by: §2.2.1.
-  (2019) Learning generalizable and identity-discriminative representations for face anti-spoofing. arXiv preprint arXiv:1901.05602. Cited by: §1.
-  (2017) Residual attention network for image classification. In CVPR, pp. 3156–3164. Cited by: §2.2.1, §2.2.1.
-  (2018) Non-local neural networks. In CVPR, pp. 7794–7803. Cited by: §1, §2.2.2.
-  (2012) A novel image fusion method using ikonos satellite images. GGS 1 (1), pp. 75–83. Cited by: §1.
-  (2018) Image super-resolution using very deep residual channel attention networks. In ECCV, pp. 286–301. Cited by: §1.
-  (2018) Residual dense network for image super-resolution. In CVPR, pp. 2472–2481. Cited by: §1, §2.1, §3.2.
Multi-prototype networks for unconstrained set-based face recognition. arXiv preprint arXiv:1902.04755. Cited by: §2.1.