Single image super-resolution (SISR), which aims at reconstructing a high-resolution (HR) image from its single low-resolution (LR) counterpart, is an ill-posed inverse problem. To tackle such an inverse problem, numerous learning-based super-resolution (SR) methods have been proposed to learn the mapping function between LR and HR image pairs via probabilistic graphical model [1, 2], neighbor embedding [3, 4], sparse coding [5, 6], linear or nonlinear regression [7, 8, 9]
, and random forest.
More recently, benefiting from the powerful representational ability of convolutional neural networks (CNNs), deep-learning-based SR methods have achieved better performances in terms of effectiveness and efficiency. As an early first attempt, SRCNN proposed by Dong et al. employed three convolutional layers to predict the nonlinear mapping function from bicubic upscaled middle resolution image to high resolution image, which outperformed most conventional SR methods. Later, various works followed the similar network design and consistently improved SR performance via residual learning [15, 16], recursive learning [16, 17], symmetric skip connections  and cascading memory blocks . Differing from the above pre-upscaling approaches which operated SR on bicubic upsampled images, FSRCNN  and ESPCN , designed by Dong et al. and Shi et al. respectively, extracted features from the original LR images and upsampled spatial resolution only at the end of the processing pipeline via a deconvolution layer or a sub-pixel convolution module . Following this post-upscaling architecture, Ledig et al.  employed the residual blocks proposed in  to construct a deeper network (SRResnet) for image SR, which was further improved by EDSR  and MDSR  via removing unnecessary modules. Further, to conveniently pass information across several layers, dense blocks  were also introduced to construct several deep networks [25, 13, 26] for suiting image super-resolution. Meanwhile, to simplify the difficulty of direct super-resolving the details, [26, 27, 28] adopted the progressive structure to reconstruct HR image in a stage-by-stage upscaling manner. In addition, [29, 30] incorporated the feedback mechanism into network designs for exploiting both LR and HR signals jointly.
Although these existing deep-learning-based approaches have made good efforts to improve SR performance, the reconstruction of high frequency details for SISR is still a challenge. In deep neural networks, the LR inputs and extracted features contain different types of information across channels, spaces and layers, such as low-frequency and high-frequency information or low-level and high-level features, which have different reconstruction difficulties (e.g., the high-frequency features or the pixels on the texture areas are more difficult to reconstruction than the low-frequency features or the pixels on the flat areas) as well as different contributions to recovering the implicit high-frequency details. However, the most CNN-based methods consider different types of information equally and lack flexible modulation ability in dealing with them, which resultantly limits the representational ability and fitting capacity of the deep networks. Therefore, for the deeper neural networks, simply increasing depth or width can hardly achieve better improvement. On the other hand, for image restoration tasks, the hierarchical features produced by deep neural networks are informative and useful. However, many very deep networks, such as VDSR , LapSRN , EDSR  and IDN , adopt single-path direct connections or short skip connections among layers, where hierarchical features could hardly be fully utilized and long-term information that provides some clues for SR would be lost as the network depth grows. Although SRDenseNet  and RDN  employ dense-connection blocks for SR to fuse different levels of features, the extreme connectivity pattern in their networks not only hinders their scalability to large width or high depth but also produces redundant computation. Memory blocks adopted in MemNet  also integrate information from the preceding memory blocks to achieve persistent memory, but the fused features are extracted from bicubic pre-upscaled images which might lose some details and produce new noises. Therefore, how to effectively make full use multi-level, channel-wise and spatial features within neural networks is crucial for HR image reconstruction and remains to be explored.
To address these issues, we propose a Channel-wise and Spatial Feature Modulation network (illustrated in Fig. 10) for SISR, named CSFM, which not only adaptively learns to pay attention to every feature entry in the multi-level, channel-wise and spatial feature responses but also fully and effectively exploits the hierarchical features to maintain persistent memory. In the CSFM network, we construct a feature-modulation memory (FMM) module (shown in Fig. 10(b)) as the building module and stack several FMM modules with a densely connected structure. An FMM module contains a channel-wise and spatial attention residual (CSAR) blockchain and a gated fusion (GF) node. In the CSAR blockchain, we develop a channel-wise and spatial attention residual (CSAR) block via integrating the channel-wise and spatial attentions into the residual block  and stack a collection of CSAR blocks to modulate multi-level features for adaptively capturing more important information. In addition, by adopting a GF node in the FMM module, the states of the current FMM module and of the preceding FMM modules are conveniently concatenated and adaptively fused for short-term and long-term information preservation as well as for information flow enhancement. As shown in Fig. 7, our proposed CSFM network generates more realistic visual result compared with other methods.
In summary, the major contributions of our proposed SISR method are three-fold:
1). We develop a CSAR block via combining channel-wise and spatial attention mechanisms into the residual block, which can adaptively recalibrate the feature responses in a global-and-local manner by explicitly modelling channel-wise and spatial feature interdependencies.
2). We construct an FMM module via stacking a set of CSAR blocks to modulate multi-level features and adding a GF node to adaptively fuse hierarchical features for important information preservation. The block-stacking structure in the FMM module enables it to capture different types of attention and then enhance high contribution information for image super-resolution, while the gating mechanism help it to adaptively distill more effective information from short-term and long-term states.
3) We design a CSFM network for accurate single image SR, in which the stacked FMM modules enhance discriminative learning ability of the network and the densely connected structure helps to fully exploit multi-level information as well as ensures maximum information flow between modules.
The remainder of this paper is organized as follows. Section II discusses the related SISR methods and correlative mechanisms applied in neural networks. Section III describes the proposed CSFM network for SR in detail. Model analysis and experimental comparisons with other state-of-the-art methods are presented in Section IV, and Section V concludes the paper with observations and discussions.
Ii Related Work
Numerous SISR methods, different learning mechanisms and various network architectures have been proposed in the literatures. Here, we focus our discussions on the approaches which are related to our method.
Ii-a Deep-learning based Image Super-Resolution
Since Dong et al.  first proposed a super-resolution convolutional neural network (SRCNN) to predict the nonlinear relationship between bicubic upscaled image and HR image, various CNN architectures have been studied for SR. As deeper CNNs have larger receptive fields to capture more contextual information, Kim et al. proposed two deep networks of VDSR  and DRCN  which utilized global residual learning and recursive layers respectively to improve SR accuracy. To control the number of model parameters and maintain persistent memory, Tai et al. constructed the recursive blocks with global-and-local residual learning in DRRN  and designed the memory blocks with dense connections in MemNet 
. For these methods, the LR images need be bicubic interpolated to the desired size before entering the networks, which inevitably increases the computational complexity and might produce new noise.
For alleviating the computational loads and overcoming the disadvantage of the pre-upscaling structure, Dong et al.  exploited the deconvolution operator to upscale spatial resolution at the network tail. Later, Shi et al.  proposed a more effective sub-pixel convolution layer to replace the deconvolution layer for upscaling the final LR feature-maps into the HR output, which was recently extended by an enhanced upscaling module (EUM)  via applying residual learning and multi-path concatenation into the module. Benefiting from this post-upscaling strategy, more and more deeper networks, such as SRResnet , EDSR  and SRDenseNet , achieved high performances with less computational load. Recently, Hui et al.  developed the information distillation blocks and stacked them to construct a deep and compact convolutional network. And, Zhang et al.  proposed a residual dense network (RDN) which used the densely connected convolutional layers to extract abundant local features and adopted the local-and-global feature fusion procedure to adaptively fuse hierarchical features in the LR space.
Taking the effectiveness of post-upscaling strategy into account, we also apply the sub-pixel convolution layer  at the end of network for upscaling spatial resolution. Furthermore, we exploit the feature modulation mechanism to enhance the discriminative ability of the network for different types of information.
Ii-B Attention Mechanism
The aim of attention mechanism in neural network is to recalibrate the feature responses towards the most informative and important components of the inputs. Recently, some works have focused on the integration of attention modules within deep network architectures on a range of tasks, such as image generation , image captioning [34, 35], image classification [36, 37] and image restoration [38, 39]. Xu et al. et al.  further refined the spatial attention model by stacking two spatial attention models for visual question answering. Moreover, by investigating the interdependencies between the channels of the convolutional features in a network, Hu et al.  introduced a channel-wise attention mechanism and proposed a squeeze-and-excitation (SE) block to adaptively recalibrate channel-wise feature responses for image classification. Recently, inspired by SE networks, Zhang et al.  integrated the channel-wise attention into the residual blocks and proposed a very deep residual channel attention network which pushed the state-of-the-art performance of SISR forward. In addition, Chen et al.  stacked the spatial and channel-wise attention modules at multiple layers for image captioning, where the second attention (spatial attention or channel-wise attention) was operated on the attentive feature-maps recalibrated by the first one (channel-wise attention or spatial attention). Besides the spatial and channel-wise attentions, Wang et al.  utilized semantic segmentation probability maps as prior knowledge and introduced semantic attention to modulate spatial features for realistic texture generation. However, this model requires external resources to train these semantic attributes.
Inspired by attention mechanism and considering that there are different types of information within and across feature-maps which have different contributions for image SR, we combine channel-wise and spatial attentions into the residual blocks to adaptively modulate feature representations in a global-and-local way for capturing more important information.
Ii-C Skip Connections
As the depth of a network grows, the problems of information flow weakened and gradient vanishing hamper the training of the network. Many recent methods have been devoted to resolving these problems. ResNets proposed by He et al.  was built by stacking a sequence of residual blocks, which utilized the skip connections between layers to improve information flow and make training easier. The residual blocks were also widely applied in [22, 12] to construct very wide and deep networks for SR performance improvement. To fully explore the advantages of skip connections, Huang et al.  constructed DenseNets by directly connecting each layer to all previous layers. Meanwhile, in order to make the networks scale to deep and wide ones, block compression was applied in DenseNets to halve the number of channels in the concatenation of previous layers. The dense connections were utilized in [19, 25, 13] for image SR to improve the flows of information and gradient throughout the networks as well. However, the extremely dense connections and frequent concatenations may increase information redundancy and computational cost. Considering these, Chen et al.  combined the insights of ResNets  and DenseNets  and proposed a DualPathNet which utilized both concatenation and summation for previous features.
Recognizing both advantages of residual path in residual block and densely connected paths in dense block, we stack several attention-based residual blocks within each module and utilize the densely connected paths between modules for effective feature re-exploitation and important information preservation.
Iii The Proposed CSFM Network
The proposed CSFM network for SISR, outlined in Fig. 10
, consists of an initial feature extraction sub-network (IFENet), a feature transformation sub-network (FTNet) and an upscaling sub-network (UpNet). The IFENet is applied to represent a LR input as a set of feature-maps via a convolutional layer. The FTNet is designed to capture more informative features for SR by a sequence of stacked feature-modulation memory (FMM) modules and two convolutional layers. The transformed features are then fed into the UpNet to generate the HR image. In this section, we detail the proposed model, from the channel-wise and spatial attention residual (CSAR) block to the FMM module and finally the overall network architecture.
Iii-a The CSAR Block
The features generated by a deep network contain different types of information across channels and spatial regions which have different contributions for the high-frequency details recovery. If we are able to increase the network’s sensitivity to higher contribution features and make it focus on learning more important features, the representational power of the network would be enhanced and the performance improved. Keeping that in mind, we design a channel-wise attention (CA) unit and a spatial attention (SA) unit by utilizing the interdependencies between channels and spatial locations of the features, and then combine two types of attention into the residual blocks to adaptively modulate feature representations.
Iii-A1 The CA Unit
The aim of the CA unit is to perform feature recalibration in a global way where the per-channel summary statistics are calculated and then used to selectively emphasis informative feature-maps as well as suppress useless ones (e.g. redundant feature-maps). The structure of the CA unit is illustrated in Fig. 14(a)–(b). We denote as the input of the CA unit, which consists of feature-maps with size of . To generate channel-wise summary statistics , the global average pooling is operated on individual feature channels across spatial dimensions , as done in . The element of is computed by
where is the value at position of the channel . To assign different attentions to different types of feature-maps, we employ a gating mechanism with a sigmoid activation to summary statistic . The process is represented as follows.
represent the sigmoid and ReLU functions respectively, and denotes the convolution operation. and are the weights and bias in the first convolutional layer which is followed by ReLU activation and used to decrease the number of channels of by the reduction ratio . Next, the number of channels is increased back to the original amount via another convolutional layer with parameters of and . In addition, the channel-wise attention weights are adapted to the values between and , and then used to rescale the input features as follows.
where is a channel-wise multiplication for feature channels and corresponding channel weights, is the channel-wise recalibrated output, and represents the CA unit which is apparently conditioned on the input .
With the above process, the CA unit is able to adaptively modulate the channel-wise features according to the channel-wise statistics of input, and help the network boost the channel-wise feature discriminability.
Iii-A2 The SA Unit
The channel-wise attention exploits global average pooling to squeeze global spatial information into a channel statistical descriptor, by which the spatial information within each feature-map is yet removed. On the other hand, the information contained in the inputs and feature-maps is also diverse over spatial positions. For example, the edge or texture regions usually contain more high-frequency information while the smooth areas have more low-frequency information. Therefore, to recover high-frequency details for image SR, it is helpful to make the network have discriminative ability for different local regions and pay more attentions to the regions which are more important and more difficult to reconstruct.
Considering aforementioned discussion, besides the channel-wise attention, we explore a complementary form of attention termed as spatial attention to improve the representations of the network. As shown in Fig. 14(a)–(c), let be an input for the SA unit, which has feature-maps with size of . To make use of feature channel interdependencies of the input and inspired by the local computations in computational-neuroscience models , we use a two-layers neural network followed by a sigmoid function to generate a spatial attention mask . Below is the definition of the SA unit.
where the meanings of the notations , and are the same as those used in Eq. (2). The first convolutional layer with parameters of and is used to yield per-channel attentive maps which are then combined into a single attentive map by the second convolutional layer (parameterized by and ). Further, the sigmoid function normalizes the attentive map range to to obtain the spatial attention soft mask . The process of input features being spatially modulated by can be formulated as
where is an element-wise multiplication for spatial positions of each feature-map and their corresponding spatial attention weights, and denotes the SA model.
With the SA unit, the features are adaptively modulated in a local way, which could be interplayed with the global channel-wise modulation to help the network enhancing the representational power.
Iii-A3 Integration of CA and SA into the Residual Block
Since the residual blocks introduced in ResNets  can improve information flow and achieve better performance for image SR in , we combine the channel-wise and spatial attention units into the residual block and propose the CSAR block.
As illustrated in Fig. 14, if we denote and as the input and output of a CSAR block, and as the combinational attention model of CA and SA that will be detailed later, the CSAR block can be formulated as
where and represent the functions of the CSAR block and the residual branch respectively. The residual branch contains two stacked convolutional layers with a ReLU activation,
where and are the weight and bias sets of the residual branch and is a set of produced residual features.
To capture more important information, we apply the combinational attention model to modulate the residual features . At first, we operate the CA unit and the SA unit on the residual features respectively to obtain channel-wise weighted feature-maps and spatial weighted feature-maps , as described in Section III. A 1) and 2). Then, two sets of modulated feature-maps are concatenated as the input to a convolutional layer ( parameterized by and ) which is utilized to fuse two types of attention-modulated features with learned adaptive weights. All processes are summarized as follows.
where represents the operation of feature concatenation.
Inserting the combinational attention model into the deep network in the way described above has two benefits. First, since the combinational attention model only modulates the residual features, the good property of the identical mapping in the residual block is not broken and the information flow is still improved. Second, as two attention units are combined into a residual block, we can conveniently apply channel-wise and spatial attentions to multi-level features by stacking multiple CSAR blocks, and thus more multi-level important information is captured.
Iii-B The FMM Module
To make full use of the attention mechanism and conveniently maintain persistent memory, the FMM module is built. As illustrated in Fig. 10(b), the FMM module contains a CSAR blockchain and a gated fusion (GF) node.
The CSAR blockchain is constructed by stacking multiple CSAR blocks in a chain structure, which is exploited to perform channel-wise and spatial feature modulation at multiple levels. Supposing CSAR blocks in a blockchain are stacked in sequence, the input of the first CSAR block and the output of the last CSAR block are obviously the input and output of the CSAR blockchain. Thus, the CSAR blockchain can be formulated as below.
where are the functions for the CSAR blocks as depicted in Eq. (6), and denotes the operation of the CSAR blockchain.
To preserve long-term information when multiple FMM modules are stacked in the deep network, the GF node is attached to integrate the information coming from the previous FMM modules and from the current blockchain through an adaptive learning process. In the GF node, the features generated by the preceding FMM modules and by the current CSAR blockchain are firstly concatenated and then fed into a convolutional layer to be adaptively fused. Let and be the output features of previous FMM modules and of the current CSAR blockchain with CSAR blocks. The process of gated fusion is formulated as
where denotes the function of the convolutional layer with parameters of and . This convolutional layer accomplishes the gating mechanism to learn adaptive weights for different information and then controls the output information. Based on those depicted above, the formulation of the FMM module can be written as
where denotes the function for the FMM module, and and are the input and output of the FMM module. As is also the input of the CSAR blockchain () in the FMM module (i.e., ), there is in Eq. (11).
Thus, in the CSAR blockchain, the stacked CSAR blocks modulate multi-level features to capture more important information, and multiple short-term skip connections help rich information flow across different layers and modules. Meanwhile, in the GF node, the long-term dense connections among the FMM modules not only alleviate long-term information loss of the deep network during forward propagation but also contribute to multi-level information fusion, which would benefit image SR.
|Components||Different Combinations of Components|
|In residual blocks||Channel-wise attention (CAR)||×||✓||×||–||×||✓||×||–|
|Spatial attention (SAR)||×||×||✓||–||×||×||✓||–|
|Combinational attention of CA and SA (CSAR)||×||–||–||✓||×||–||–||✓|
|Gated fusion (GF) node with long-term dense connections||×||×||×||×||✓||✓||✓||✓|
Ablation Study on Effects of the Channel-wise and Spatial Attention Residual (CSAR) Block and
the Gated Fusion (GF) Node with Long-term Dense Connections.
Average PSNRs for a Scale Factor of on Urban100 Dataset Are Reported.
Iii-C Network Architecture
As shown in Fig. 10, we stack multiple FMM modules to build the feature transformation sub-network (FTNet), which is utilized to map the features, generated from the initial feature extraction sub-network (IFENet), to the high informative features for the upscaling sub-network (UpNet). In addition, similar to [12, 13], we also adopt the global residual-feature learning in the FTNet via adding an identity branch from its input to its output (green curve in Fig. 10). Thus, the three sub-networks make up our CSFM network to super-resolve LR image. Let’s denote and as the input and output of the CSFM network. And, we adopt a convolutional layer as the IFENet to extract the initial features from LR input image,
where denotes the function of the IFENet, and is a set of extracted features which is then fed into the FTNet and also used for global residual-feature learning.
In the FTNet, the input is firstly sent to a convolutional layer for receptive field expansion and the generated features are then used as the input to the first FMM module. Supposing FMM modules and one convolutional layer are stacked to act as the features transformation, the output of the FTNet can be obtained by
where represents the FTNet of which the output is , is the convolutional operation, denotes the function for the FMM module as described in Eq. (11).
After acquiring the high informative features , we exploit the UpNet to upsample them for HR image reconstruction. Specifically, we adopt a sub-pixel convolutional layer  followed by a convolutional layer as the UpNet for converting multiple HR sub-images to a single HR image.
where and denote the functions of the UpNet and the whole CSFM network respectively.
The CSFM network is optimized via minimizing the difference between the super-resolved image and the corresponding ground-truth image . As done in previous work [12, 13], we adopt loss function to measure the difference. Given a training dataset , where is the number of training patch pairs and are the LR and HR patch pairs, the objective function for training the CSFM network is formulated as
where denotes the parameter set of the CSFM network.
With the stacked FMM modules and the densely connected structure, the proposed CSFM network not only possesses the discriminative learning ability for different types of information but also enables the information that is easier to reconstruct to adopt the shorter forward/backward paths across the network and then pays more attentions to the more important and more difficult information.
Iv Experiments and Analysis
In this section, we first provide implementation details, including both model hyper-parameters and training data setting. Then, we study the contributions of different components in the proposed CSFM network by the ablation experiments. Finally, we compare our CSFM model with other state-of-the-art methods on several benchmark datasets.
Iv-a Datasets and Metrics
We conduct comparison studies on widely used datasets, Set5 , Set14 , BSD100 , Urban100  and Manga109 , which contain 5, 14, 100, 100 and 109 images respectively. The Set5, Set14 and BSD100 contain natural scene images, while the Urban100 consists of urban scene images with many details in different frequency bands and Manga109 is made up of Japanese comic images with many fine structures. We use 800 high-quality training images from DIV2K  to train our model. Data augmentation is performed on these training images, which includes random horizontal flipping and random rotation by .
We use the peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM)  index as metrics for evaluation. Higher PSNR and SSIM values indicate better quality. As commonly done in SISR, all the criteria are calculated on the luminance channel of image after pixels near image boundary are removed.
Iv-B Implementation Details
We apply our model to super-resolve the RGB low-resolution images which are generated by downsampling the corresponding HR images with bicubic kernel to a certain scale. Following , we pre-process all images by subtracting the mean RGB values of DIV2K dataset. For training, the LR color patches with a size of are randomly cropped from LR images as the inputs of our proposed model and the mini-batch size is set to 16. We train our model with ADAM optimizer  by setting , and . The initial learning rate is initialized to , which is reduced to half at mini-batch updates and then halved at every
iterations. And, we apply PyTorch on an NVIDIA GTX 1080Ti GPU for model training and testing.
In our CSFM network, all convolutional layers have 64 filters and the kernel sizes of them are except the
convolutional layers in the CA and SA units and those in the GF nodes. Meanwhile, we zero-pad the boundaries of each feature-map to ensure the spatial size of it is the same as the input size after the convolution is operated. In addition, in the CSAR block, the reduction ratioin the CA unit and the increase ratio in the SA unit are empirically set to 16 and 2 respectively.
Iv-C Model Analysis
In this subsection, the contributions of different components and designs in our model are analyzed via the experiments, including the CSAR block, the GF node for information persistence and the performance comparisons of different numbers of the CSAR blocks and the FMM modules. For all experiments, all models utilized for comparisons are trained with mini-batch updates for convenience.
Iv-C1 The CSAR Block
To validate the effectiveness of the CSAR block, besides the CSAR block, we construct another three blocks for comparison. (I) The base residual (BR) block contains two convolutional layers with one ReLU activation, as shown in Fig. 18(a). Compared with the CSAR block, the BR block removes both the CA unit and SA unit, corresponding to the and combinations of the first three rows in TABLE I. (II) The channel-wise attention residual (CAR) block is constructed by integrating the CA unit to the BR block for adaptively rescaling channel-wise features, which is depicted in Fig. 18(b) and corresponds to the and combinations of the first three rows in TABLE I. (III) The spatial attention residual (SAR) block (the and combinations of the first three rows in TABLE I), as illustrated in Fig. 18(c), is developed by introducing the SA unit into the BR block to modulate pixel-wise features. Specifically, we apply 64 these blocks to the respective networks for experimental comparison, and present SR performances of these networks on Urban100 dataset in TABLE I. Obviously, when the combinational attention of CA and SA is adopted in our CSAR block (the and combinations of the first three rows in TABLE I), the channel-wise attention or the spatial attention needs not be introduced. Therefore, we mark these cases with the symbol of “–” in TABLE I. In addition, Fig. 24 provides the visual comparisons of the network with BR blocks (the combination in TABLE I), the network with CSAR blocks (the combination in TABLE I), and the network with both CSAR blocks and GF nodes (the last combination in TABLE I).
From TABLE I, we can see that when both the CA unit and the SA unit are removed in the BR block, the PSNR values are relatively low, especially when the GF nodes are not used for long-term information preservation. And, by integrating the CA unit or the SA unit into the BR blocks, the SR performances can be moderately improved. Moreover, when our proposed CSAR blocks with the combinational attentions are utilized, the performance can be further boosted. In both cases of without and with the GF nodes, the network with the CSAR blocks outperforms those with the BR blocks by the PSNR gains of 0.16dB and 0.11dB respectively. Furthermore, in Fig. 24, it is seen that the network only with BR blocks (Fig. 24(b)) generates some blurry and false fence lines while the network with proposed CSAR blocks (Fig. 24(c)) accurately reconstructs the fence rows and presents better result via combining the channel-wise and spatial attentions. The above observations demonstrate the superiority of our CSAR block over other blocks without attention or with only one type of attention (i.e. the BR block, CAR block and SAR block), and also manifest that integrating channel-wise and spatial attentions in residual blocks to modulate multi-level features can benefit image SR.
Quantitative Evaluations of State-of-the-art SR Methods.
The Average PSNRs/SSIMs for Scale Factors of , and Are Reported.
Fontbold Indicates the Best Performance and Underline Indicates the Second-best Performance.
Iv-C2 The GF Node with Long-term Dense Connections
As illustrated in Fig. 10(b), the GF node is added at the end of the FMM module for contributing to persistent memory maintenance and different information fusion. To investigate the contributions of the GF node, we conduct the ablation tests and present the study on the effect of the GF node in TABLE I and Fig. 24. In TABLE I, the first four columns list the results produced by the networks without GF nodes where 64 blocks are cascaded for feature transformation, while the last four columns show the performances of the networks with GF nodes in which 16 blocks and one GF node constitute a module and 4 modules are stacked with densely connected structure (similar to the architecture of the CSFM network). Through the comparisons between the results in the first four columns and those in the last four columns, we find that the networks with GF nodes would perform better than those without GF nodes. Specifically, when the CSAR blocks with combinational attentions are utilized, the network with GF nodes can achieve an improvement of 0.21dB in terms of PSNR compared with the baseline network with only BR blocks. Besides, from Fig. 24, we can observe that by introducing information maintenance mechanism, the network with GF nodes generates finer and clearer fence rows compared with those without GF nodes. These comparisons manifest that applying the GF nodes makes long-term information preservation easy and then more important information can be effectively exploited for image SR.
To further analyze the contributions of different kinds of information fed into the GF nodes and illustrate how the GF nodes control the output information, and inspired by , we make statistics on the norms of the weights from all filters in the GF nodes. For each feature-map input to the GF node, we calculate the weight norm in the corresponding filter as follows
where represents the weight norm of the feature-map fed into the GF node (receiving feature-maps as input), and with size of denotes the weight set of the filter in the GF node. The larger norm indicates that the feature-map provides more information to the GF node for fusion, and vice versa. For the sake of comparison, we average the weight norms of long-term feature-maps from the preceding FMM modules and of short-term feature-maps from the current FMM module respectively. Similar to , we normalize the weight norms to the range of 0 to 1 for better visualization. Fig. 27 presents the average norms of two types of feature-maps (long-term feature-maps and short-term feature-maps) in eight GF nodes of eight FMM modules for two scale factors of and . One can see that the long-term information from the preceding modules makes non-negligible contribution especially in late modules whatever the upscaling factor is, which indicates that the long-term information plays an important role in super-resolving LR image. Therefore, the GF nodes being added for information persistence is beneficial for improving SR performance.
Iv-C3 The Number of FMM Modules and the Number of CSAR Blocks in each FMM Module
The capacity of the CSFM network is mainly determined by the number of the FMM modules and the number of the CSAR blocks in each FMM module. In this subsection, we test the effects of two parameters on image SR. For simplicity, we denote the number of the FMM modules as and the number of the CSAR blocks as . The network with modules and blocks per module is represented as for short.
Fig. 28 shows the results of the PSNR performance (illustrated by different colors according to the color bar on the right) versus two parameters ( and ) on the dataset of BSD100 for a scale factor of . We can see that the better performances can be achieved by increasing or . Since the larger and results in a deeper network, the comparisons in Fig. 28 suggest that the deeper model is still advantageous. On the other hand, compared with (achieving 32.212dB on PSNR), (obtaining 32.208dB on PSNR) with the same total number of CSAR blocks achieves comparable performance although it has fewer GF nodes for long-term skip connections, and the similar observation can be obtained in the comparison between and . These results indicate that properly utilizing the limited number of skip connections does not lose accuracy but reduces the redundancy and computational cost. To effectively exploit long-term skip connections for information persistence as well as control the computational cost, we adopt and as our CSFM model for the next comparison experiments.
Iv-D Comparisons with the State-of-the-arts
To illustrate the effectiveness of the proposed CSFM network, several state-of-the-art SISR methods, including SRCNN , FSRCNN , VDSR , LapSRN , DRRN , MemNet , IDN , EDSR , SRMDNF , D-DBPN  and RDN , are compared in terms of quantitative evaluation, visual quality and number of parameters. Since some of existing networks, such as SRCNN, FSRCNN, VDSR, DRRN, MemNet, EDSR and IDN, did not perform SR on Manga109 dataset, we generate the corresponding results by applying their public trained models to Manga109 dataset for evaluation. In addition, we rebuild the VDSR network in PyTorch with the same network parameters for training and testing as its trained model is not provided.
The quantitative evaluations in the five benchmark datasets for three scale factors (, , ) are summarized in TABLE II. When compared with MemNet and RDN, both of which introduce persistence memory mechanism via extremely dense skip connections, our CSFM network achieves the highest performance but with fewer skip connections. This indicates that our FMM module with long-term skip connections not only advances the memory block in MemNet  and the residual dense block in RDN  but also reduces the redundancy in the structure of extremely dense connections. Meanwhile, our CSFM model significantly outperforms the remaining methods on all datasets for all upscaling factors, in terms of PSNR and SSIM. Especially, on the challenging dataset Urban100, the proposed CSFM network advances the state-of-the-art (achieved by EDSR or RDN) with the improvement margins of 0.19dB, 0.18dB and 0.14dB on scale factors of , and respectively. In addition, more significant improvements earned by the CSFM network are shown on Manga109 dataset, where the proposed CSFM model outperforms EDSR (with highest performance among the prior methods) by the PSNR gains of 0.21dB, 0.32dB and 0.29dB for the , and enlargement respectively. These results validate the superiority of the proposed method especially on super-resolving the images with fine structures such as those in Urban100 and Manga109 datasets.
The visual comparisons of different methods are shown in Fig. 29 – Fig. 34. Thanks to the proposed FMM modules for adaptive multi-level feature-modulation and long-term memory preservation, our proposed CSFM network accurately and clearly reconstructs the stripe patterns, the grid structures, the texture regions and the characters. It is observed that the severe distortions and the noticeable artifacts are contained in the results generated by the prior methods, such as the marked strips on the wall in Fig. 29, the color lines on the balcony in Fig. 32 and the grids on the building in Fig. 33. In contrast, our method avoids the distortions, suppresses the artifacts and generates more faithful results. Besides, in Fig. 30, Fig. 34 and Fig. 31, only our method is able to recover more accurate textures and more recognizable characters, while other methods suffer from much information loss and heavy blurring artifacts. The above visual comparisons demonstrate the powerful representational ability of our CSFM network as well.
We also compare the tradeoff between the performance and the number of network parameters from our CSFM network and existing networks. Fig. 35 shows the PSNR performances of several models versus the number of parameters, where the results are evaluated with Set5 dataset for upscaling factor. We can see that our CSFM network significantly outperforms the relatively small models. Furthermore, compared with EDSR and RDN, our CSFM network achieves higher PSNR but with 72% and 47% fewer parameters respectively. These comparisons indicate that our model has a better tradeoff between performance and model size.
In this paper, we propose a channel-wise and spatial feature modulation (CSFM) network for modeling the process of single image super-resolution, where stacked feature-modulation memory (FMM) modules with the densely connected structure effectively improve its discriminative learning ability and make it concentrate on the worthwhile information. The FMM module consists of a chain of cascaded channel-wise and spatial attention residual (CSAR) blocks and a gated fusion (GF) node. The CSAR block is constructed by incorporating the channel-wise attention and spatial attention into the residual block and utilized to modulate the residual features in a global-and-local way. Further, when a sequence of CSAR blocks are cascaded in the FMM module, two types of attention can be jointly applied to multi-level features and then more informative features can be captured. Meanwhile, The GF node, designed via introducing the gating mechanism and for establishing long-term skip connections among the FMM modules, can help to maintain long-term information and enhance information flow. Comprehensive evaluations on benchmark datasets demonstrate better performance of our CSFM network in terms of quantitative and qualitative measurements.
-  W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, “Learning low-level vision,” Int. J. Comput. Vis., vol. 40, no. 1, pp. 25–47, Oct. 2000.
-  G. Polatkan, M. Zhou, L. Carin, and D. Blei, “A Bayesian non-parametric approach to image super-resolution,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 2, pp. 346–358, Feb. 2015.
H. Chang, D.-Y. Yeung, and Y. Xiong, “Super-resolution through neighbor
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun./Jul. 2004, pp. 275–282.
-  J. Jiang, R. Hu, Z. Wang, Z. Han, and J. Ma, “Facial image hallucination through coupled-layer neighbor embedding,” IEEE Trans. Circuits Syst. Video Technol., vol. 26, no. 9, pp. 1674–1684, Sep. 2016.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE Trans. Image Process., vol. 19, no. 11, pp. 2861–2873, Nov. 2010.
-  L. He, H. Qi, and R. Zaretzki, “Beta process joint dictionary learning for coupled feature spaces with application to single image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2013, pp. 345–352.
-  R. Timofte, V. D. Smet, and L. V. Gool, “A+: adjusted anchored neighborhood regression for fast super-resolution,” in Proc. 12th Asian Conf. Comput. Vis. (ACCV), Nov. 2014, pp. 111–126.
Y. Hu, N. Wang, D. Tao, X. Gao, and X. Li, “SERF: a simple, effective, robust, and fast image super-resolver from cascaded linear regression,”IEEE Trans. Image Process., vol. 25, no. 9, pp. 4091–4102, Sep. 2016.
J. Huang and W. Siu, “Learning hierarchical decision trees for single-image super-resolution,”IEEE Trans. Circuits Syst. Video Technol., vol. 27, no. 5, pp. 937–950, May. 2017.
-  S. Schulter, C. Leistner, and H. Bischof, “Fast and accurate image upscaling with super-resolution forests,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3791–3799.
-  J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 5197–5206.
-  B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017, pp. 136–144.
-  Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 2472–2481.
-  C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 2, pp. 295–307, Feb. 2016.
-  J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1646–1654.
-  Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 3147–3155.
-  J. Kim, and J. K. Lee, and K. M. Lee, “Deeply-recursive convolutional network for image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1637–1645.
-  X.-J. Mao, C. Shen, and Y.-B. Yang, “Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2016, pp. 2802–2810.
-  Y. Tai, J. Yang, X. Liu, and C. Xu, “MemNet: a persistent memory network for image restoration,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 4549–4557.
-  C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Oct. 2016, pp. 391–407.
-  W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1874–1883.
-  C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 105–114.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778.
-  G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708.
-  T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using dense skip connections,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 4799–4807.
-  Y. Wang, F. Perazzi, B. McWilliams, A. Sorkine-Hornung, O. Sorkine-Hornung, and C. Schroers, “A fully progressive approach to single-image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2018, pp. 977–986.
-  W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep Laplacian pyramid networks for fast and accurate super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 624–632.
-  Z. He, S. Tang, J. Yang., Y. Cao, M. Y. Yang, and Y. Cao, “Cascaded deep networks with multiple receptive fields for infrared image super-resolution,” IEEE Trans. Circuits Syst. Video Technol., 2018.
-  M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 1664–1673.
-  S. Zagoruyko and N. Komodakis, “Image super-resolution via dual-state recurrent networks,” arXiv: 1805.02704, May. 2018.
-  Z. Hui, X. Wang, and X. Gao, “Fast and accurate single image super-resolution via information distillation network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 723–731.
-  J.-H. Kim and J.-S. Lee, “Deep residual network with enhanced upscaling module for super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2018, pp. 913–921.
-  E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Generating images from captions with attention,” in Int. Conf. Learn. Rep. (ICLR), May. 2016, pp. 1–4.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Int. Conf. Mach. Learn. (ICML), Jul. 2015, pp. 2048–2057.
-  L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua, “Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 5659–5667.
-  J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 7132–7141.
-  F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 3156–3164.
-  Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sept. 2018, pp. 1–16.
-  X. Wang, K. Yu, C. Dong, and C. C. Loy, “Recovering realistic texture in image super-resolution by deep spatial feature transform,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 606–615.
-  H. Xu and K. Saenko, “Ask, attend and answer: Exploring question guided spatial attention for visual question answering,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Oct. 2016, pp. 451–466.
-  Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path networks,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2017, pp. 1–9.
-  L. Itti and C. Koch, “Computational modelling of visual attention,” Nat. Rev. Neurosci., vol. 2, no. 3, pp. 194–203, Mar. 2001.
-  M. Bevilacqua, A. Roumy, C. Guillemot, and M.-L. A. Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” in Proc. 23rd British Mach. Vis. Conf. (BMVC), Sep. 2012, pp. 135.1–135.10.
-  R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in Proc. 7th Int. Conf. Curves Surfaces, Jun. 2010, pp. 711–730.
-  P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 5, pp. 898–916, May. 2011.
-  Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, and K. Aizawa, “Sketch-based manga retrieval using manga109 dataset,” Multimedia Tools Appl., vol. 76, no. 20, pp. 21 811–21 838, Oct. 2017.
-  R. Timofte, E. Agustsson, L. V. Gool, M.-H. Yang, L. Zhang, B. Lim, and et al., “NTIRE 2017 challenge on single image super-resolution: methods and results,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017, pp. 1110–1121.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Int. Conf. Learn. Rep. (ICLR), May. 2015, pp. 1–13.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in PyTorch,” in Proc. Adv. Neural Inf. Process. Syst. Workshop Autodiff, Dec. 2015, pp. 1–4.
-  K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional super-resolution network for multiple degradations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 3262–3271.