Multi-grained Attention Networks for Single Image Super-Resolution

09/26/2019 ∙ by Huapeng Wu, et al. ∙ 26

Deep Convolutional Neural Networks (CNN) have drawn great attention in image super-resolution (SR). Recently, visual attention mechanism, which exploits both of the feature importance and contextual cues, has been introduced to image SR and proves to be effective to improve CNN-based SR performance. In this paper, we make a thorough investigation on the attention mechanisms in a SR model and shed light on how simple and effective improvements on these ideas improve the state-of-the-arts. We further propose a unified approach called "multi-grained attention networks (MGAN)" which fully exploits the advantages of multi-scale and attention mechanisms in SR tasks. In our method, the importance of each neuron is computed according to its surrounding regions in a multi-grained fashion and then is used to adaptively re-scale the feature responses. More importantly, the "channel attention" and "spatial attention" strategies in previous methods can be essentially considered as two special cases of our method. We also introduce multi-scale dense connections to extract the image features at multiple scales and capture the features of different layers through dense skip connections. Ablation studies on benchmark datasets demonstrate the effectiveness of our method. In comparison with other state-of-the-art SR methods, our method shows the superiority in terms of both accuracy and model size.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image Super-Resolution (SR) is an important image processing technique that recovers high-resolution images from low-resolution ones. Image SR has drawn great attention recently in a variety of research fields, such as remote sensing imaging [35], medical imaging [28], and video surveillance [15]

. Image SR is a representative of the ill-posed inverse problems. This is because information is lost during the degradation process and each Low-Resolution (LR) image may correspond to multiple High-Resolution (HR) ones. Great efforts have been expended for this problem, where the previous methods can be divided into three groups: the interpolation-based methods

[49], the reconstruction-based methods [45], and the learning-based methods [39, 6, 16, 17, 32, 19, 21, 23].

In recent years, the fast development of deep learning technology

[20] has greatly advanced the research of single image SR. Due to its high-level representation and strong learning ability, the deep convolutional neural networks (CNN) have soon become the de facto framework for the SR community. The first attempt to use CNN for single image SR was made by Dong et al[6]. They built a three-layer CNN model called “SRCNN” to learn a nonlinear mapping from LR to HR image pairs in an end-to-end fashion, which shows significant improvement over most conventional methods. Later, some other researchers proposed much deeper networks by integrating the idea of residual learning [16, 17, 32] and recursive learning [17, 32], achieving substantial improvement over the SRCNN. However, these methods firstly upscale a LR input images to the desired output size by using bicubic interpolation before it is fed into the networks, thus leading to extra computational cost and reconstruction artifacts. To overcome this problem, some new methods are proposed to upsample spatial resolution in the latter layers of the network. Dong et al. [7] proposed a deconvolution operation for upscaling the final LR feature maps. Shi et al. [27] introduced a more effective sub-pixel convolution layer to upscale the LR feature maps to the size of HR images at the output end of the network. Then, this efficient post-processing method is widely used in SR [23, 47], which not only further increases the depth of the network, but also reduces computational load. In a recent work proposed by Lee et al [23], the SR model is built with over 160 layers. They employed a simplified ResNet [9] architecture by removing the normalization layers in the SRResNet [21] and won the championship of the NTIRE2017 super-resolution challenge [33].


Fig. 1: An overview of our multi-grained attention network (MGAN). Our network consists of several multi-grand attention blocks. To further exploit the information of hierarchical features [22, 48], the features produced by different blocks are concatenated together to obtain the final representation.

More recently, the idea of multi-scale feature fusion has been introduced to SR, where most of them are based on inception mechanism in GoogLeNet [30]. In image SR, it has been demonstrated that making full use of multi-scale features improves the restoration results [44]. Specially, Li et al[22] employed inception architecture to construct multi-scale residual block as the building block for the SR model. However, like many deep learning networks such as VDSR [16], LapSRN [19], EDSR [23], this method [22] ignores the integration of the information from different convolutional layers for image reconstruction. Although the above methods achieve incremental results by utilizing the dense skip connections [13] to enrich their features, e.g., SRDenseNet [34], RDN [48] and CSAR [12], they ignore the importance of multi-scale features. Besides, some other methods like MemoryNet [31] use a similar approach by combining information from the preceding memory blocks. However, it takes the pre-sampled images to form input which results in additional computational overhead.

Attention mechanism in image SR. More recently, the “attention mechanism” has been introduced to single image SR and proves to be effective to improve the performance of a deep CNN based SR model. The attention mechanism was originally proposed in machine translation to improve the performance of a RNN model by taking into account the input from several time steps to make one-step prediction [2]

. In a CNN-based model, the introduction of an attention module is helpful for investigating the correlations of different feature channels and spatial locations, which has now been widely used in many computer vision tasks, such as object detection

[46], optical character recognition [38]

, and image captioning

[40]. Although CNNs have shown great superiority in SR tasks, its drawbacks are obvious. As the convolution operation in a standard CNN model is translation invariant, a convolution layer treats different feature locations equally and thus is hard to learn effective representations to exploit contextual cues.

To this end, some researchers introduce the attention mechanism into a SR model to improve the features [47, 5, 12]. Zhang et al[47] introduced the Squeeze and Excitation (SE) module [10] (a.k.a the channel attention module) into a SR model to re-weight the importance of each feature channel by learning the interdependencies among channels and then rescaling the features in a channel-wise manner. The SE [10] module selects the most useful feature among channels and improves the effectiveness of the feature representations. In addition to investigating the interdependencies of the channel-wise features, the spatial correlations are also important for SR tasks. Hu et al[12] combines channel attention and spatial attention by setting trainable convolution filters on each spatial location individually [4] together to improve the accuracy. Although incremental results have been obtained, the above methods still have drawbacks. For example, RCAN only focuses on channel-wise attention while ignores the importance of spatial information. Although [12] uses the channel and spatial attention mechanisms, their spatial attention mechanism yields high calculation costs. Besides, learning of features in flat image area may require a large image receptive field, while the textual details may require a small one. However, in the above methods, the diversity of features at different scales is ignored. Therefore, it is particularly important to integrate multi-scale information for an attention-based image SR task [29, 11, 22].

In this paper, we make a thorough investigation on attention-based image SR methods and then shed light on how simple and effective modifications further improve the state of the art performance. We further propose a unified approach called “multi-grained attention networks (MGAN)” to fully exploit the advantages of attention mechanisms in SR tasks, where the “channel attention” [47, 5, 12] and “spatial attention” [12] strategies in previous methods can be essentially considered as two special cases of our method. In our method, the importance of each neuron is computed according to its surrounding regions in a multi-grained fashion and then is used to adaptively re-scale the feature responses.

An overview of our method, MGAN, is shown in Fig. 1. In the network, we not only jointly learn feature interdependencies in the channel and spatial dimensions, but also make full use of the skip connections between different layers and scales to improve the information flow during training. The proposed MGAN consists of a series of Multi-Grained Attention Blocks (MGAB). The detail of each block is shown in Fig. 2, where MGAB produces multi-scale features at each layer and passes to all the subsequent layers through dense connections [13]. In each block, the feature responses are adaptively re-scaled in the channel and spatial dimensions to capture richer contextual information, which is beneficial to enhance the discriminability of the network. Specially, we introduce a multi-grained attention structure [36] to compute the importance values from the surrounding regions of each neuron in different scales and then individually recalibrates the features at each location (as shown in Fig. 2(b)). In addition, the output of each MGAB has direct access to a bottleneck layer (with convolution) to conduct global feature fusion, which not only boosts the flow of information, but also avoids information loss as the network depth grows.


Fig. 2: (a) The architecture of the basic building block of our method: the Multi-Grained Attention Block (MGAB). (b) The details of the feature refinement operation by using multi-grained attention. denotes element-wise product between two feature maps. The recalibrated features at different scales are further fused into a final representation for further operations.

The contributions of this paper can be summarized as follows.

(1) We introduce the multi-grained attention to a image SR model and show how simple improvement on attention based methods improves the state-of-the-arts. The proposed multi-grained attention mechanism not only captures the importance of feature channels, but also fully exploits spatial context cues.

(2) We further introduce the multi-scale dense connections to fully exploit the features of different layers at multiple scales through dense skip connections. Our model learns richer contextual information and enhances the discriminative representation ability of the features. Ablation studies on benchmark datasets demonstrate the effectiveness of our method. In comparison with some other state-of-the-art SR methods, our method shows superiority in terms of both its accuracy and model size.

The rest of this paper is organized as follows. In Section II, we introduce the proposed MGAN in detail. Experimental results and analysis are given in Section III. Conclusions are drawn in Section IV.

Ii Proposed Method

In this section, we will introduce the proposed multi-grained attention networks in detail.

Ii-a Network Architecture

As shown in Fig. 1

, the processing flow of our method mainly consists of two stages: 1) feature extraction stage, and 2) reconstruction stage. Suppose

and represent the input LR image and output SR image, respectively. We firstly apply one convolutional layer to extract the initial features from the input LR image. Then, the initial features are fed into a series of multi-grained attention blocks to produce informative feature maps. Finally, a sub-pixel convolution layer [27] followed by a convolution layer are used for reconstructing the HR image. The relationship between the image and image can be written as follows:

(1)

where represents the proposed multi-grained attention networks, represents the feature extraction operation, represents the up-sampling and reconstruction operation, and is the parameter of the network to be optimized. To obtain more informative features, we also apply hierarchical feature fusion and global residual learning in the feature extraction pipeline, as shown in Fig. 1.

Our MGAN is optimized under a standard regression paradigm by minimizing the difference between the reconstructed image and the ground-truth . Given a training dataset with image pairs , we use the loss function [23, 47, 12, 48] to train our model as it introduces less blurring effect. The objective function can be written as

(2)

In the following subsection, we will give more details of the proposed multi-grained attention block.

Ii-B Multi-grained Attention Block

The architecture of our multi-grained attention block (MGAB) is shown in Fig. 2. The process flow of each MGAB consists of two stages: 1) feature fusion based on multi-scale dense connections, and 2) feature refinement based on multi-grained attention mechanism.

To further enhance the network’s representation ability and the information flow, the local residual connection is introduced in each attention block. Suppose

and represent the input and output feature representation of the -th MGAB, their relationship can be written by

(3)

where represents the forward mapping function of a MGAB. In the above operation, we adjust the dimension of the feature maps to make them equal-sized by passing through a set of convolutional filters.

Ii-B1 Multi-scale dense connections

Exploiting multi-scale semantic information of an image is important for effective image representation [44]. The GoogleNet [30]inception module adopts parallel convolutions with different filter sizes to learn multi-scale image representation, leading to state-of-the-art results of the object recognition. Inspired by the GoogleNet [30], we introduce multi-scale dense connections to our model, which not only learns richer semantic information but also boosts the gradient flow during the training process.

As illustrated in Fig. 2(a), we introduce two parallel paths in each processing unit with different filter sizes (e.g., and ). Their outputs of each unit are concatenated together and are then fed to all the subsequent layers in a densely connected fashion, which was introduced by the DenseNet [13]. Before the feature refinement stage, we perform local feature fusion among different layers to further extract informative information and make the dimensions of the input and output features the same.

Ii-B2 Squeeze and excitation operation

Most of the previous methods apply standard convolution to build image SR networks. As the standard convolution operation simply focuses on local regions and cannot obtain long-dependencies over different spatial locations, some recent methods take advantage of the attention mechanism and improve the feature representation of SR methods [47, 12].

To introduce the attention mechanism, most previous methods compute the channel-wise global statistics by using average pooling [47], then, the squeeze and excitation (SE) network [10] performs feature re-calibration by computing the importance of each feature channel. More specifically, let represents the input features with feature maps and size. The SE network first conducts the average pooling operation over global spatial locations of a feature map:

(4)

where is the global pooling function and is the pixel value at position of the -th feature channel . Then, a gating function is introduced to learn channel-wise interdependencies, which can be written as follows:

(5)

where and

denote the sigmoid and ReLU

[8] function, respectively. and are the weights of two convolutional layers, which aim to adjust the number of channels and learn the channel importance. is the learned attention weights.

Finally, the SE [10] re-scales the channel feature representations by re-weighting each feature map with the channel-wise attention weights,

(6)

Ii-B3 Multi-grained attention mechanism

Although the SE network [47, 10] learn channel-wise interdependencies effectively, they ignore the diversity of spatial positions, which is crucial for the representation of the image. This is because these methods only exploit all pixels equally to recalibrate channel features without considering the location a priori. For this purpose, we introduce the multi-grained attention mechanism to individually divide the feature maps into different spatial regions and allow different locations to have different weights.

Specifically, we evenly split the feature map into a set of spatial regions (e.g. 1, 2 or 4), as illustrated in Fig. 2(b). Each region is then individually processed by passing through the subsequent SE module. Finally, the recalibrated features of each region at different grains are concatenated together to further enhance the feature discriminative presentation. When is set to 1, the above operation will be reduced to a standard SE module [10]. In contrast, when we set , each spatial region can be learned with a corresponding attention weight. Specially, when each pixel location on the feature map is considered as a unique region, the above operation reduces to the spatial attention as mentioned in [12].

The Channel-wise and Spatial Attention Residual networks (CSAR) [12] is a recent proposed method that is similar to ours, in which they learn individual spatial weights for each pixel location. However, their method suffers from a huge amount of training parameters as they need to learn individual weights for every pixel location in a feature map. As a comparison, our method requires fewer parameters and has lower computational costs. In addition, our method can be considered as a unified approach to introducing attention mechanism at both spatial and channel dimensions, where most methods in previous literature can be viewed as a special case of ours. With the above designs, our method captures rich contextual information in the spatial and channel dimensions at the same time so as to improve the discriminative ability of the model.

Iii Experimental Results and Analysis

Iii-a Datasets and Evaluation Metrics

In our experiment, we train our model using the DIV2K dataset [33], which contains 800 high-quality images. Our model is then tested on five standard benchmark datasets, including Set5 [3], Set14 [41], Bsd100 [1], Urban100 [14] and Manga109 [24] with 4 upscaling factors and two configurations: 1) Bicubic (BI) degradation and 2) Blur-downscale (BD) degradation [47, 48, 43]. The Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity Index Measurement (SSIM) [37] are used to evaluate SR performance on Y channel (i.e., luminance) of the transformed YCbCr space.

MSRN [22]
Multi-scale dense connections
Hierarchical feature fusion
Channel attention (i.e. S=1)
Spatial attention [12]
Multi-grained attention (S=2)
Multi-grained attention (S=4)
PSNR (dB) 32.25 32.23 32.31 32.37 32.39 32.42 32.45
TABLE I: An analysis of the importance of each technical component in our method. The columns - correspond to different experiment configurations. We report the average PSNR on Set5 [3] ( ).

Iii-B Implementation Details

We implement our method using PyTorch

[25] with an NVIDIA Titan Xp GPU. We use data augmentation to increase the diversity of the training images, including random rotation (, , ) and horizontally flipping. In each training batch, we randomly select 16 LR color patches with the size of as inputs. Our model is trained with Adam [18] optimizer with the parameters , , and . The learning rate is initialized to

and then reduces to half every 200 epochs. In our proposed network, all convolution layers are set with 64 filters and

filter size, except for the convolutional layer in MGABs and

layer at feature fusion. To keep the size of feature map unchanged during the convolution process, we use zero-padding at the edge of each feature map. In each MGAB, the layer number of

and convolutional filters is set the same, which is 3. The reduction ratio in the multi-grained attention unit is set to 16. For upscaling layers, we follow the literatures [23, 47, 48, 27] and apply ESPCNN [27] to perform the upsampling operation. The final convolution layer with 3 filters is used to reconstruct the HR output.


Fig. 3: A comparison of different SR methods on Urban100 dataset [14] with bicubic degradation. The best result is marked as bold.

Iii-C Ablation Analysis

To further evaluate the effectiveness of the proposed methods, the ablation studies are conducted to analyze the importance of each component of the proposed method, including the “multi-scale dense connections”, “hierarchical feature fusion”, “channel attention”, “spatial attention”, and the “multi-grained attention”. For fair comparison, we set our MGAN and its variants with the same number of MGAB (8) and convolution filters (64). All evaluations of the ablation analyses are performed based on the same set of configurations. The baseline method ( from Table I) is first evaluated, and then other technical components are gradually added.

  • Multi-scale dense connections: This corresponds to the multi-scale feature fusion and dense connection in each of our multi-grained attention block. Please refer to Fig. 2 (a) for more details.

  • Hierarchical feature fusion: This corresponds to the implementations in the method MSRN [22].

  • Channel attention: This corresponds to the “channel attention” in the method CSAR [12]. It can be also considered as a special case of our method when the grid size is set to 1. Please refer to Fig. 2 (b) for more details.

  • Spatial attention: This corresponds to the “spatial attention” in the method CSAR [12]. It can be also considered as a special case of our method when each grid is set to every pixel in the feature map. Please refer to Fig. 2 (b) for more details.

  • Multi-grained attention (): This corresponds to the proposed method with different grid sizes.

Table I shows the ablation study results. From Table I, compared with MSRN [22], that adds the multi-scale dense connections obtains better SR results, which validates the effectiveness of the multi-scale dense connections. Meanwhile, we can see that the baseline (without hierarchical feature fusion and multi-grained attention) obtains a relatively low accuracy, where the PSNR only reaches 32.23dB on Set5 . Then, we integrate the hierarchical feature fusion and multi-grained attention into the base block, respectively, where we observe a consistent increment of the accuracy. This experiment demonstrates the effectiveness of the proposed hierarchical feature fusion and multi-grained attention. It can also be noticed that the network with channel attention () performs better than those without channel attention ( and ), where the PSNR reaches 32.37 dB. In addition, when we compare the multi-grained attention mechanism ( and ) with other channel and spatial attention mechanisms (i.e., to ), we can see that our model always achieves better performance than the network with other attention mechanisms. These comparisons consistently confirm the superiority of our method. We also notice that adding more scales to our attention module may not guarantee better performance, but it will greatly increase the computational overhead. To balance the speed and accuracy, we finally set and in our method.

Iii-D Results with Bicubic Degradation (BI)

We compare our method with several state-of-the-art SR methods, including Bicubic, SRCNN [6], VDSR [16], DRCN [17], LapSRN [19], EDSR [23], and MSRN [22]111MSRN has two different implementations and we use its latest results published on its GitHub homepage.. As suggested by [23, 47, 48], we also adopt self-ensemble strategy to further improve our MGAN and accordingly we denote the self-ensembled MGAN as MGAN+. Table II shows quantitative evaluations with various upscaling factors .

Compared to these methods, our MGAN+ achieves higher accuracy on most datasets with various scale factors. Without self-ensemble, our MGAN still obtains favorable results and outperforms most of the other state-of-the-art methods in terms of both PSNR and SSIM. When the up-scaling factor is set to , the average PSNR gain of our MGAN over the MSRN [22] are 0.20 dB, 0.11 dB, 0.07 dB, 0.25 dB and 0.24 dB on Set5, Set14, Bsd100, Urban100 and Manga109 datasets, respectively. Our method achieves better performance on Urban100 and Manga109 datasets mainly because they contain richer structures and details, where the multi-grained attention mechanism in our method shows advantages under such conditions.


Fig. 4: A comparison of different SR methods on image “YukiNoFuruMachi” from the Manga109 dataset [24] with bicubic degradation. The best result is marked as bold.

Fig. 5: A comparison of different SR methods on “img_043” from Urban100 dataset [14] with bicubic degradation. The best result is marked as bold.

To further demonstrate the effectiveness of our method, we provide a visual assessment compared to the other state-of-the-art methods, as shown in Fig. 3, Fig. 4, and Fig. 5.

In Fig. 3, we show visual performance of different methods with a upscaling factor on Urban100 dataset. Our method achieves the best visual results among all methods in terms of both structures and high-frequency details. Using ”” as an example, we can see that the Bicubic interpolation generates the worst results with seriously blurring effects along the edges and visually displeasing blurred detailed textures. MSRN [22] and the other methods generate good results in synthesizing details, but they fail in recovering straight edges and obtain distorted results. As a comparison, our method alleviates the effects of bending along the edges and reconstruct the high-frequency components of the HR images, which is more faithful to the original HR ones. This is because the multi-grained attention mechanism in our method learns not only high-frequency details but also context cues, which is helpful to recover regular shapes and structures in an image. The above results verify the superiority of our MGAN especially for those images with fine structures and details.

Iii-E Results with Blur-downscale Degradation (BD)

As suggested by [47, 48, 43], we also conduct our experiments with blur-degradation (BD) inputs. We compare our MGAN with 8 state-of-the-art methods: SPMSR [26], SRCNN [6], FSRCNN [7], VDSR [16], IRCNN [42], SRMD [43], RDN [48], MSRN [22]. As shown in Table III, our MGAN outperforms other methods on most datasets with upscaling factor. Our enhanced version, MGAN+, further improves the accuracy by using self-ensemble. Particularly, on Urban100 and Manga109, which are more challenging than other datasets, the PSNR gain of the proposed MGAN over MSRN is up to 0.36 dB and 0.37 dB, respectively.


Fig. 6: A comparison of different SR methods on “img_044” from Urban100 dataset [14] with blur-downscale degradation. The best result is marked as bold.

In Fig. 6, we show visual results on up-scaling factor for BD degradation model. For ””, RDN and MSRN suffer from the blurring artifacts. As a comparison, our MGAN alleviates the blurring artifacts and recovers more faithful results.

Iii-F Model Size Comparison

We further compare the number of parameters and the average PSNR of different methods on Set 5 with upscaling factor , as shown in Fig. 7. The results show that our MGAN and MGAN+ achieve a good tradeoff between model size and accuracy. In comparison with other methods, our method (MGAN+) obtains the highest accuracy. Notice that although MGAN achieves comparable performance to EDSR, as shown in Table II, our MGAN has fewer parameters than EDSR (11.7 M v.s. 43 M).


Fig. 7: Average accuracy (PSNR) of different methods on Set 5 [3] with bicubic degradation vs. the number of their parameters.
Set5 Set14 Bsd100 Urban100 Manga109
Methods Scale PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
Bicubic x2 33.66 0.9299 30.24 0.8688 29.56 0.8431 26.88 0.8403 30.80 0.9339
SRCNN [6] x2 36.66 0.9542 32.45 0.9067 31.36 0.8879 29.50 0.8946 35.60 0.9663
VDSR [16] x2 37.53 0.9590 33.05 0.9130 31.90 0.8960 30.77 0.9140 37.22 0.9750
DRCN [17] x2 37.63 0.9588 33.04 0.9118 31.85 0.8942 30.75 0.9133 37.63 0.9723
LapSRN [19] x2 37.52 0.9591 33.08 0.9130 31.08 0.8950 30.41 0.9101 37.27 0.9740
EDSR [23] x2 38.11 0.9602 33.92 0.9195 32.32 0.9013 32.93 0.9351 39.10 0.9773
MSRN [22] x2 38.07 0.9608 33.68 0.9184 32.22 0.9002 32.32 0.9304 38.64 0.9771
MGAN (ours) x2 38.16 0.9612 33.83 0.9198 32.28 0.9009 32.75 0.9340 39.11 0.9778
MGAN+ (ours) x2 38.21 0.9614 33.91 0.9205 32.33 0.9015 32.95 0.9354 39.31 0.9782
Bicubic x3 30.39 0.8682 27.55 0.7742 27.21 0.7385 24.46 0.7349 26.95 0.8556
SRCNN [6] x3 32.75 0.9090 29.30 0.8215 28.41 0.7863 26.24 0.7989 30.48 0.9117
VDSR [16] x3 33.67 0.9210 29.78 0.8320 28.83 0.7990 27.14 0.8290 32.01 0.9340
DRCN [17] x3 33.82 0.9226 29.76 0.8311 28.80 0.7963 27.15 0.8276 32.31 0.9328
LapSRN [19] x3 33.82 0.9227 29.87 0.8320 28.82 0.7980 27.07 0.8280 32.21 0.9350
EDSR [23] x3 34.65 0.9280 30.52 0.8462 29.25 0.8093 28.80 0.8653 34.17 0.9476
MSRN [22] x3 34.48 0.9276 30.40 0.8436 29.13 0.8061 28.31 0.8560 33.56 0.9451
MGAN (ours) x3 34.65 0.9292 30.51 0.8460 29.22 0.8086 28.61 0.8621 34.00 0.9474
MGAN+ (ours) x3 34.75 0.9299 30.60 0.8474 29.29 0.8098 28.82 0.8651 34.31 0.9490
Bicubic x4 28.42 0.8104 26.00 0.7027 25.96 0.6675 23.14 0.6577 24.89 0.7866
SRCNN [6] x4 30.48 0.8628 27.50 0.7513 26.90 0.7101 24.52 0.7221 27.58 0.8555
VDSR [16] x4 31.35 0.8830 28.02 0.7680 27.29 0.7251 25.18 0.7540 28.83 0.8870
DRCN [17] x4 31.53 0.8854 28.02 0.7670 27.23 0.7233 25.14 0.7510 28.98 0.8816
LapSRN [19] x4 31.54 0.8850 28.19 0.7720 27.32 0.7270 25.21 0.7560 29.09 0.8900
EDSR [23] x4 32.46 0.8968 28.80 0.7876 27.71 0.7420 26.64 0.8033 31.02 0.9148
MSRN [22] x4 32.25 0.8958 28.63 0.7833 27.61 0.7377 26.22 0.7905 30.57 0.9103
MGAN (ours) x4 32.45 0.8980 28.74 0.7852 27.68 0.7400 26.47 0.7981 30.81 0.9131
MGAN+ (ours) x4 32.57 0.8993 28.85 0.7874 27.75 0.7415 26.68 0.8027 31.15 0.9161
Bicubic x8 24.40 0.6580 23.10 0.5660 23.67 0.5480 20.74 0.5160 21.47 0.6500
SRCNN [6] x8 25.33 0.6900 23.76 0.5910 24.13 0.5660 21.29 0.5440 22.46 0.6950
VDSR [16] x8 25.93 0.7240 24.26 0.6140 24.49 0.5830 21.70 0.5710 23.16 0.7250
DRCN [17] x8 25.93 0.6743 24.25 0.5510 24.49 0.5168 21.71 0.5289 23.20 0.6686
LapSRN [19] x8 26.15 0.7380 24.35 0.6200 24.54 0.5860 21.81 0.5810 23.39 0.7350
EDSR [23] x8 26.96 0.7762 24.91 0.6420 24.81 0.5985 22.51 0.6221 24.69 0.7841
MSRN [22] x8 26.95 0.7728 24.87 0.6380 24.77 0.5954 22.35 0.6124 24.40 0.7729
MGAN (ours) x8 26.90 0.7722 25.00 0.6415 24.81 0.5979 22.47 0.6190 24.55 0.7803
MGAN+ (ours) x8 27.09 0.7801 25.16 0.6458 24.91 0.6005 22.69 0.6275 24.87 0.7882
TABLE II: Quantitative results (PSNR/SSIM) of different SR methods with bicubic degradation. The proposed MGAN and its enhanced version MGAN+ achieve the best SR results for the most datasets and experimental configurations.
Set5 Set14 Bsd100 Urban100 Manga109
Methods Scale PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
Bicubic x3 28.78 0.8308 26.38 0.7271 26.33 0.6918 23.52 0.6862 25.46 0.8149
SPMSR [26] x3 32.21 0.9001 28.89 0.8105 28.13 0.7740 25.84 0.7856 29.64 0.9003
SRCNN [6] x3 32.05 0.8944 28.80 0.8074 28.13 0.7736 25.70 0.7770 29.47 0.8924
FSRCNN [7] x3 26.23 0.8124 24.44 0.7106 24.86 0.6832 22.04 0.6745 23.04 0.7927
VDSR [16] x3 33.25 0.9150 29.46 0.8244 28.57 0.7893 26.61 0.8136 31.06 0.9234
IRCNN [42] x3 33.38 0.9182 29.63 0.8281 28.65 0.7922 26.77 0.8154 31.15 0.9245
SRMD [43] x3 34.01 0.9242 30.11 0.8364 28.98 0.8009 27.50 0.8370 32.97 0.9391
RDN [48] x3 34.58 0.9280 30.53 0.8447 29.23 0.8079 28.46 0.8582 33.97 0.9465
MSRN [22] x3 34.50 0.9271 30.43 0.8427 29.15 0.8060 28.15 0.8513 33.74 0.9447
MGAN (ours) x3 34.63 0.9284 30.54 0.8450 29.24 0.8081 28.51 0.8580 34.11 0.9467
MGAN+ (ours) x3 34.73 0.9290 30.64 0.8462 29.30 0.8091 28.70 0.8610 34.42 0.9483
TABLE III: Quantitative results (PSNR/SSIM) of different SR methods with blur-downscale degradation. The proposed MGAN and its enhanced version MGAN+ achieve the best SR results for all the datasets among other state-of-the-art methods.

Iv Conclusion

We made a thorough investigation on attention mechanisms in a super-resolution model and show how simple improvements on previous attention-based model improve the accuracy. We hereby propose a method called “multi-grained attention networks (MGAN)” which fully exploits the advantages of attention mechanisms in SR tasks. We found that both “channel attention” and “spatial attention” are essential for a SR model, and the multi-grained attention integration can further boost the accuracy of current state-of-the-art models. Besides, the proposed multi-scale dense connections allows MGAN not only to capture the image features in different scales, but also make full use of different layers of information. Extensive experiments and ablation studies on a variety of super-resolution models demonstrate the superiority of the proposed methods in terms of quantitative results and visual appearance.

References

  • [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik (2010) Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (5), pp. 898–916. Cited by: §III-A.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §I.
  • [3] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the British Machine Vision Conference, Cited by: Fig. 7, §III-A, TABLE I.
  • [4] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 5659–5667. Cited by: §I.
  • [5] X. Cheng, X. Li, J. Yang, and Y. Tai (2018) SESR: single image super resolution with recursive squeeze and excitation networks. In International Conference on Pattern Recognition, pp. 147–152. Cited by: §I, §I.
  • [6] C. Dong, C. C. Loy, K. He, and X. Tang (2015) Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2), pp. 295–307. Cited by: §I, §I, §III-D, §III-E, TABLE II, TABLE III.
  • [7] C. Dong, C. C. Loy, and X. Tang (2016) Accelerating the super-resolution convolutional neural network. In European Conference on Computer Vision, pp. 391–407. Cited by: §I, §III-E, TABLE III.
  • [8] X. Glorot, A. Bordes, and Y. Bengio (2011) Deep sparse rectifier neural networks. In

    Proceedings of the International Conference on Artificial Intelligence and Statistics

    ,
    pp. 315–323. Cited by: §II-B2.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §I.
  • [10] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. Cited by: §I, §II-B2, §II-B2, §II-B3, §II-B3.
  • [11] Y. Hu, X. Gao, J. Li, Y. Huang, and H. Wang (2018) Single image super-resolution via cascaded multi-scale cross network. arXiv preprint arXiv:1802.08808. Cited by: §I.
  • [12] Y. Hu, J. Li, Y. Huang, and X. Gao (2019) Channel-wise and spatial feature modulation network for single image super-resolution. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §I, §I, §I, §II-A, §II-B2, §II-B3, §II-B3, 3rd item, 4th item, TABLE I.
  • [13] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. Cited by: §I, §I, §II-B1.
  • [14] J. Huang, A. Singh, and N. Ahuja (2015) Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197–5206. Cited by: Fig. 3, Fig. 5, Fig. 6, §III-A.
  • [15] J. Jiang, C. Chen, J. Ma, Z. Wang, Z. Wang, and R. Hu (2016) SRLSP: a face image super-resolution algorithm using smooth regression with local structure prior. IEEE Transactions on Multimedia 19 (1), pp. 27–40. Cited by: §I.
  • [16] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1646–1654. Cited by: §I, §I, §I, §III-D, §III-E, TABLE II, TABLE III.
  • [17] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1637–1645. Cited by: §I, §I, §III-D, TABLE II.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-B.
  • [19] W. Lai, J. Huang, N. Ahuja, and M. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 624–632. Cited by: §I, §I, §III-D, TABLE II.
  • [20] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436. Cited by: §I.
  • [21] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4681–4690. Cited by: §I, §I.
  • [22] J. Li, F. Fang, K. Mei, and G. Zhang (2018) Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision, pp. 517–532. Cited by: Fig. 1, §I, §I, 2nd item, §III-C, §III-D, §III-D, §III-D, §III-E, TABLE I, TABLE II, TABLE III.
  • [23] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee (2017) Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 136–144. Cited by: §I, §I, §I, §II-A, §III-B, §III-D, TABLE II.
  • [24] Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, and K. Aizawa (2017) Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications 76 (20), pp. 21811–21838. Cited by: Fig. 4, §III-A.
  • [25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §III-B.
  • [26] T. Peleg and M. Elad (2014) A statistical prediction model based on sparse representations for single image super-resolution. IEEE Transactions on Image Processing 23 (6), pp. 2569–2582. Cited by: §III-E, TABLE III.
  • [27] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883. Cited by: §I, §II-A, §III-B.
  • [28] W. Shi, J. Caballero, C. Ledig, X. Zhuang, W. Bai, K. Bhatia, A. M. S. M. de Marvao, T. Dawes, D. O’Regan, and D. Rueckert (2013) Cardiac image super-resolution with global correspondence using multi-atlas patchmatch. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 9–16. Cited by: §I.
  • [29] W. Shi, F. Jiang, and D. Zhao (2017) Single image super-resolution with dilated convolution based multi-scale information learning inception module. In IEEE International Conference on Image Processing, pp. 977–981. Cited by: §I.
  • [30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §I, §II-B1.
  • [31] Y. Tai, J. Yang, X. Liu, and C. Xu (2017) Memnet: a persistent memory network for image restoration. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4539–4547. Cited by: §I.
  • [32] Y. Tai, J. Yang, and X. Liu (2017) Image super-resolution via deep recursive residual network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3147–3155. Cited by: §I, §I.
  • [33] R. Timofte, E. Agustsson, L. Van Gool, M. Yang, and L. Zhang (2017) Ntire 2017 challenge on single image super-resolution: methods and results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 114–125. Cited by: §I, §III-A.
  • [34] T. Tong, G. Li, X. Liu, and Q. Gao (2017) Image super-resolution using dense skip connections. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4799–4807. Cited by: §I.
  • [35] L. Wang, K. Lu, and P. Liu (2014) Compressed sensing of a remote sensing image based on the priors of the reference image. IEEE Geoscience and Remote Sensing Letters 12 (4), pp. 736–740. Cited by: §I.
  • [36] Y. Wang, L. Xie, S. Qiao, Y. Zhang, W. Zhang, and A. L. Yuille (2018) Multi-scale spatially-asymmetric recalibration for image classification. In Proceedings of the European Conference on Computer Vision, pp. 509–525. Cited by: §I.
  • [37] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §III-A.
  • [38] Z. Wojna, A. Gorban, D. Lee, K. Murphy, Q. Yu, Y. Li, and J. Ibarz (2017) Attention-based extraction of structured information from street view imagery. arXiv preprint arXiv:1704.03549. Cited by: §I.
  • [39] H. Wu, J. Zhang, and Z. Wei (2018) High resolution similarity directed adjusted anchored neighborhood regression for single image super-resolution. IEEE Access 6, pp. 25240–25247. Cited by: §I.
  • [40] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In

    International Conference on Machine Learning

    ,
    pp. 2048–2057. Cited by: §I.
  • [41] R. Zeyde, M. Elad, and M. Protter (2010) On single image scale-up using sparse-representations. In International Conference on Curves and Surfaces, pp. 711–730. Cited by: §III-A.
  • [42] K. Zhang, W. Zuo, S. Gu, and L. Zhang (2017) Learning deep cnn denoiser prior for image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3929–3938. Cited by: §III-E, TABLE III.
  • [43] K. Zhang, W. Zuo, and L. Zhang (2018) Learning a single convolutional super-resolution network for multiple degradations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3262–3271. Cited by: §III-A, §III-E, TABLE III.
  • [44] K. Zhang, X. Gao, D. Tao, and X. Li (2012) Multi-scale dictionary for single image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1114–1121. Cited by: §I, §II-B1.
  • [45] K. Zhang, X. Gao, D. Tao, and X. Li (2012) Single image super-resolution with non-local means and steering kernel regression. IEEE Transactions on Image Processing 21 (11), pp. 4544–4556. Cited by: §I.
  • [46] S. Zhang, J. Yang, and B. Schiele (2018) Occluded pedestrian detection through guided attention in cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6995–7003. Cited by: §I.
  • [47] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision, pp. 286–301. Cited by: §I, §I, §I, §II-A, §II-B2, §II-B2, §II-B3, §III-A, §III-B, §III-D, §III-E.
  • [48] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018) Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2472–2481. Cited by: Fig. 1, §I, §II-A, §III-A, §III-B, §III-D, §III-E, TABLE III.
  • [49] F. Zhou, W. Yang, and Q. Liao (2012) Interpolation-based image super-resolution using multisurface fitting. IEEE Transactions on Image Processing 21 (7), pp. 3312–3318. Cited by: §I.