Image Super-Resolution Using Attention Based DenseNet with Residual Deconvolution

07/03/2019 ∙ by Zhuangzi Li, et al. ∙ NetEase, Inc 0

Image super-resolution is a challenging task and has attracted increasing attention in research and industrial communities. In this paper, we propose a novel end-to-end Attention-based DenseNet with Residual Deconvolution named as ADRD. In our ADRD, a weighted dense block, in which the current layer receives weighted features from all previous levels, is proposed to capture valuable features rely in dense layers adaptively. And a novel spatial attention module is presented to generate a group of attentive maps for emphasizing informative regions. In addition, we design an innovative strategy to upsample residual information via the deconvolution layer, so that the high-frequency details can be accurately upsampled. Extensive experiments conducted on publicly available datasets demonstrate the promising performance of the proposed ADRD against the state-of-the-arts, both quantitatively and qualitatively.



There are no comments yet.


page 1

page 3

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image super-resolution aims at recovering high-resolution (HR) images from it’s low-resolution (LR) versions. By far, it has been widely applied to various intelligent image processing applications, e.g. license plate recognition [Liu et al.2017], video surveillance [Zou and Yuen2012]. However, image super-resolution is an inherently ill-posed problem since the mapping from the LR to HR space can have multiple solutions. To deal with this issue, various promising super-resolution approaches have been proposed in the past years [Kim and Kwon2010, Yang et al.2013, Freedman and Fattal2011, Tai et al.2017, Hui et al.2018].

In image super-resolution, recovering high-frequency is a key problem that the super-resolved images should be full of edges, textures, and other details. Recently, convolutional neural networks (CNNs) are gradually applied to image super-resolution relying on CNN’s great approximating to capture high-frequency. Dong et al. firstly introduced CNN’s architecture for the image super-resolution in

[Dong et al.2016a]. Later, a series of CNNs [Kim et al.2016a, Kim et al.2016b, Tai et al.2017, Lai et al.2017, Zhang et al.2018] try to solve the problem by increasing network depth. Shortcut connections [Kim et al.2016b, Tai et al.2017, Lai et al.2017, Zhang et al.2018] demonstrates the power of recovering high-quality images. As a kind of shortcut connections, dense connections are introduced in [Tong et al.2017, Zhu et al.2018] to recover images by extracting additional information from hierarchical features. However, the above methods treat all hierarchical features equally and lack of the flexibility to select valuable features. Moreover, spatial features are not well explored, resulting in the loss of high-frequency information during feedforward.

Bicubic RDN Ours HR
Figure 1:

Side-by-side image super-resolution comparisons of bicubic interpolation, the state-of-the-art RDN, our method and ground-truth HR image.

Furthermore, high-frequency information can not be well upscaled by the conventional deconvolution as stated in [Dong et al.2016b, Mao et al.2016, Tong et al.2017].

Figure 2: Framework of our attention based DenseNet with Residual Deconvolution (ADRD) for image super-resolution.

To practically tackle the above-mentioned problems, we propose a novel image super-resolution framework based on attention based densely connected network (DenseNet) with a residual deconvolution (ADRD). As shown in Figure 1, our method can generate high-quality super-resolved images compared with the state-of-the-art RDN [Zhang et al.2018]. Specifically, weighted dense block (WDB) is proposed, where features from preceding layers are weighted into current layers. In such a way, different hierarchical features can be effectively combined by its significance. Then, we present a novel spatial attention module by learning feature residual from the WDB, enhancing the informative details for feature modeling, thus high-frequency regions can be highlighted. Further, an innovative upsampling strategy is devised that allows abundant low-frequency information to be bypassed through interpolation and focus on accurately upsampling high-frequency information. To summarize, the main contributions of this paper are three-fold:

  • We propose ADRD for image super-resolution and achieve state-of-the-art performance.

  • Proposing a weighted dense block to adaptively combine valuable features.

  • Presenting a spatial attention method to emphasize high-frequency information.

  • An innovative residual deconvolution algorithm is proposed for upsampling.

Our anonymous training and testing codes, final model, and supplementary experimental results are available at website:

2 Our method

The framework of ADRD is shown in Figure 2, which contains 4 parts. The LR image is firstly fed into a convolution layer and PReLU [He et al.2015] to get primary feature maps. Then, the primary feature maps are put into -groups based feature transformation.

In each group, weighted dense block (WDB) can obtain deeply diversified representations by weighted dense connections. A bottleneck layer would compress increasing feature maps extracted from the WDB. Next, the spatial attention (SA) module receives the compressed features and generate a residual output by attentive maps. The residual output integrates with the compressed features, thus enhanced features are obtained. For easy training and increasing the width of network, skip connections [Tong et al.2017, Zhu et al.2018] are introduced to make the input feature maps of the WDB concatenate the enhanced features. In the end of feature transformation, a bottleneck layer works for compressing global features.

The transformed features are upsampled by a residual deconvolution approach which amplifies feature maps to HR’s sizes. Finally, the reconstruction component, a 3-channel output convolution layer, reconstructs feature maps to the RGB channel space, and the prospective HR image can be obtained. Our contributions, weighted dense block, spatial attention module, as well as the residual deconvolution strategy, will be illustrated in next sections in detail.

2.1 Weighted dense block

Dense connections can alleviate the vanishing-gradient problem, strengthen feature propagation and substantially reduce the number of parameters

[Huang et al.2017]. Inspired by [Zhu et al.2018], we take advantage of dense connection for capturing diverse information from different hierarchies. In dense blocks of dense connection network, dense layers are sequentially stacked, and have short paths from previous dense layers. Consequently, the -th dense layer receives the feature-maps of all preceding layers. Let’s denote the input feature maps of the -th dense layer. Then the output of can be formulated as:


where denotes channel-wise concatenation of feature maps.

denotes as composite function which consists of Rectified Linear Units (ReLUs), a

convolution layer and a convolution layer. A group of dense layers are combined as a dense block. However, in existing dense block based methods [Zhu et al.2018, Zhang et al.2018], they treat previous level features equally. Consequently, some beneficial features cannot be well represented, and some vulgar features will restrain the final super-resolution performance.

Figure 3: Calculation of WDB in the -th dense layer. denotes element-wise product.

To solve the problem, we propose WDB. It aims to increase the flexibility during feature combinations by adaptively learning a group of weights. As shown in Figure 3, each dense layer assigns a set of weights to the preceding layers. Thus, valuable features will be adequately explored in the current level, while restrain unimportant features will be suppressed.. The WDB output of the -th layer can be formulated as:


where is the weight of preceding level features. From Eq. 1 and Eq. 2, we can see that dense connection is a special case of the weighted dense connections in the condition of . Notably, the channel number of is called as growth rate , which is equal in all block.

2.2 Spatial attention

The spatial attention module aims to enhance the high-frequency information by learning a group of attentive maps. The attentive maps can give large weights for informative regions. Flowchart of spatial attention is shown in Figure 4.

Figure 4: Flowchart of spatial attention: (a) Residual features generation. (b) Attentive maps generation. (c) Enhanced feature maps generation.

Detailedly, the spatial attention module includes three stages: (a) Residual features generation. (b) Attentive maps generation and (c) Enhanced feature maps generation. In step (a), the information residual between the head layer in the WDB (denoted as ) and the features compressed from bottleneck layer (denoted as ) is computed. The bottleneck in here is composed by a

convolutional layer and a ReLU function, which guarantees

should have the same channel number as . The residual feature maps can be obtained as:


In step (b), the residual feature maps are then fed into an attention function , which contains two convolutional layers, and a convolutional layer. Thus, attentive maps are generated and formulated as:


where represents the tangent function, which has larger gradients than Sigmoid near to 0. In step (c), and are combined to generate residual attentive features :


where is Hadamard product. Based on the residual attentive feature maps and the , the enhanced feature maps are then generated by:


where is a hyper-parameter to keep an attention level. Our attention method can extract the content information of features, and learn to generate attentive maps. The super-resolved images tend to be clearer and sharper, because contains more high-frequency information.

2.3 Residual deconvolution

Deconvolution is a popular conventional upsample method in image super-resolution [Dong et al.2016b, Mao et al.2016, Tong et al.2017]. However, they equally treat high-frequency and low-frequency information. Therefore high-frequency details are hard to be fully explored to upscale. Moreover, according to our experiments, we find the deconvolution easily destabilizes training process. To solve these issues, we separately upscale high-frequency and low-frequency information by a pyramid structure for upsampling.

Figure 5: Structure of the residual deconvolution strategy, the red parts are trainable. denotes element-wise addition.

As shown in Figure 5, the structure consists two blocks. In each block, it contains a deconvolution layer, a PReLU, and a convolution layer. We use a “nearest” interpolation function and the convolution to upsample low-frequency information, it can be formulated as:


where “” denotes convolutional operation, and is the input feature map. In addition, the deconvolution layer and the PReLU can upsample high-frequency of the feature map by in each block:


where denotes the deconvolution layer’s operation. We perform element-wise addition for and , thus we get final upsampled output of each building block. Notably, the input and output channel should be equal, and the two deconvolution layers have different weights.

3 Experiment

3.1 Data and evaluation metrics

We follow work [Haris et al.2018] to train our network using high-quality (2K resolution) DIV2K dataset [Timofte et al.2017]

and ImageNet dataset

[Deng et al.2009]. Data augmentation is adopted with random flip, rotation (, , and ). To evaluate our method, four benchmark datasets are adopted: Set5 [Bevilacqua and et al.2012], Set14 [Zeyde and et al.2010], BSD100 [R. and et al.2001] and Urban100 [Huang et al.2015] datasets. Set5 [Bevilacqua and et al.2012] and Set14 [Zeyde and et al.2010] contain and different types of images, respectively. BSD100 includes natural images. And the Urban100 contains images of urban scenario. All experiments are performed using a up-scaling factor from low resolution to high resolution. The peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) index are two criterion metrics for evaluation. The PSNR and SSIM are calculated on the Y-channel of images.

3.2 Ablation investigation

We build a lightweight ADRD architecture to evaluate each proposed module. It contains 4 dense block with , , , and dense layers. Experiments adopt patches for training, and other settings are same as Section 3.5.

WDB evaluation.

We investigate WDB and with different growth rates (, , ). To verify the effectiveness of WDB, the experiment compares it with dense block (DB) which weights are fixed and equal to . As shown in Table 1, by adopting a group of trainable weights, WDB can consistently achieve higher scores than DB when setting different growth rates. It achieves more apparent PSNR promotion with growth rates increased.

Index DB- WDB- DB- WDB- DB- WDB-
Table 1: Investigations of WDB with different growth rates on Set5 with scaling factor .
Figure 6: Weight matrix of different blocks, the foregoing five dense layers are selected for exhibition.

An example of weighted matrixes of WDB is shown in Figure 6 which shows weights of the foremost five layers. The red part in each dense block is the maximum weight and the yellow one is the minimum weight. The minimum value for the -th, -th, -th dense block exists in the head layer while the biggest value comes from the nearest layer. As for the -th block, the maximum and minimum values both appear in the nearest layer. It reveals that the weights of the nearest features are more sensitive and important than the preceding levels. Conclusively, WDB can learn meaningful weights adaptively from training data.

Spatial attention evaluation.

We adopt growth rate = , = and to verify the effectiveness of SA. “noSA” denotes there is no SA in the network. Except for PSNR and SSIM evaluation, we introduce relative content increasing rate (RCIR) to verify the ability of SA module for enhancing high-frequency features. According to [Ledig et al.2017], they utilized a pre-trained VGG network to optimize the content loss to make super-resolved images have more high-frequency information. We use this property to calculate RCIR. Firstly, we calculate the mean absolute error (MAE) between HR and interpolated images based on the content:


where is the -th layer’s output from the VGG16, is high-resolution image. Then, the MAE between HR and super-resolved images is calculated:


where is a super-resolved image. We assume that is larger than , and the value of is not equal to zero. At last, the can be calculated as:


A model can achieve high RCIR when it has relatively low error between HR and SR.

Index noSA-16 SA-16 noSA-20 SA-20 noSA-32 SA-32
Table 2: Evaluation of SA with different growth rates on Set14 with up-scaling factor.

As shown in Table 2, SA improves PSNR more than in each growth rate, and it increases output RCIR with a large extent, thus high-frequency details of an image tend to be recovered more clearly. Besides, compared SA ( = ) with noSA ( = ), they have the almost same parameters (M and M), but SA still achieves higher RCIR and SSIM than no SA. Notably, increasing can achieve better performance. However, it will construct a wide network and brings severalfold computational load, so utilizing SA modules is an effective way to boost image SR performance without too much computational load.

Residual deconvolution evaluation.

Residual deconvolution (RD) strategy bypasses low-frequency and focus on high-frequency deconvolution. Here, we take WDB with16-growth rate and utilize SA module to exhibit training curves of deconvolution (denotes D) and RD, as shown in Figure 7.

Figure 7: Curve convergence of PSNR and SSIM on Set5.
    Set5 PSNR 32.47
SSIM 0.8999
   Set14 PSNR 28.84
SSIM 0.7923
 BSD100 PSNR 27.72
SSIM 0.7477
Urban100 PSNR 24.52   
SSIM 0.8041
Table 3: Comparisons with the state-of-the-art methods by PSNR and SSIM (). Scores in bold denote the highest values ( indicates that the input is divided into four parts and calculated due to computation limitation of large size images).

The C denotes -channel features. Compared with deconvolution, RD can not only make the network achieve better results, but also stabilize the training process. Because it can reduce the influence of low frequency. Though RD-C has only 64 channels, but it acquires comparable performance with D-C, showing superiority of the proposed strategy.

3.3 Comparisons with the state-of-the-arts

We compare ADRD with state-of-the-art methods, as shown in Table 3. Here, the bicubic interpolation is viewed as a baseline for comparisons. A+ [Timofte et al.2013]

is introduced as a conventional machine learning approach. Some CNN based methods, i.e. SRCNN

[Dong et al.2016a], VDSR [Kim et al.2016a], LapSRN [Lai et al.2017], and D-DBPN [Haris et al.2018] are introduced. SRDenseNet [Tong et al.2017] (denotes as SRDense), SR-DDNet [Zhu et al.2018], and RDN [Zhang et al.2018] are three different sizes of dense block based networks are also cited in the comparison list. ADRD achieves the highest SSIM among all methods, it tends to have better quality in human perception [Wang et al.2004]. Because ADRD is adept at recovering high-frequency information. Additionally, ADRD also outperforms D-DBPN nearly PSNR on Urban100 dataset that contains many large-size real-world images.

Figure 8: Parameters and PSNR comparison on Set14.

Comprehensively, we visualize parameters and PSNR comparisons on the Set14 dataset (). As shown in Figure 8, ADRD has less (about 9700 K) a half parameters than RDN, but it still shows a bit promotion. ADRD also outperforms dense block based network, i.e. SRDenseNet, SR-DDNet more than and , demonstrating the superiority of our method. Visual comparisons are shown in Figure 10, in the first group comparison, ADRD clearly recovers the word “W” but others exist breakage. The second group shows strong recovery capability of ADRD in textures, which is close to the HR image.

3.4 Robustness comparison

The robustness is also essential for image super-resolution. We evaluate our method on different Gaussian noise levels. Here, four kinds of noise variances are used:

, , , and . The Bicubic is viewed as the baseline. Three state-of-the-art networks D-DBPN [Haris et al.2018], RDN [Zhang et al.2018], LapSRN [Lai et al.2017] are introduced for comparisons. The detailed results are shown in Table 4.

  Level Bicubic LapSRN  RDN D-DBPN ADRD
Table 4: PSNR results of different noise levels on Set5.

ADRD outperforms all other methods in each noise level. Though RDN is also a dense block based network, it is easy to be attacked by noises. Despite D-DBPN surpasses ADRD on Set5 in PSNR as shown in Table 3, it is lower than ours in the noise conditions. Visual comparisons under the noise level are shown in Figure 9. ADRD has less damage in local details. It is mainly due to the attention mechanism can reduce the weights for some noisy features by attentive maps. Therefore, ADRD is not only an effective model, but also a robust one, showing superior anti-noise capability.

Figure 9: Visual comparison of Set5 on noise.
Figure 10: Visual comparisons with up-scaling factor . From top to bottom: “ppt3” from Set14 and “img_093” from BSD100.

3.5 Implementation details

Network setting.

The final ADRD is trained specially for a scale factor super-resolution. The primary convolution is composed of a convolutional layer and a ReLU. The proposed ADRD model contains WDBs with , , and dense layers, respectively. It utilizes 32-channel primary features, and the growth rate of WDB is set to . The of SA is set to and the channel number of global bottleneck layer is . In our network, the sizes of the convolutional filters are set to and . For

convolutional filters, the padding is set to

. Notably, there is no batch normalization in ADRD, because it removes the range flexibility of the features

[Haris et al.2018].

Training detail.

We randomly crop a set of patches for training, thus the size of LR patch is . The training batch size is set to in each back-propagation. All the weights of weighted dense connections are initialized by . This network is trained via pixel-wise mean square error (MSE) loss between super-resolved HR images and ground-truth HR images. The Adam [Kingma and Ba2014] is adopted for optimizing ADRD, and the initial learning rate is set to . For each epochs, the learning rate will decrease by the scale of . After epochs, we randomly select 50000 images from ImageNet to fine-tune our networks using patch size. Experiments are performed on two NVIDIA Titan Xp GPUs for training and testing. The training process costs about hours for epochs, and the average testing speed of an image on Set5 dataset is s.

3.6 Application for recognition

ADRD is also beneficial for low-resolution image recognition. Here, we conduct the experiment on a real-world Pairs & Oxford dataset [Philbin et al.2007, Philbin et al.2008] that totally contains categories. A VGG16 is trained on the dataset. Then, we adopt different models to super resolve LR testing images. The super-resolved testing images will be fed into the VGG network to test recognition accuracy.

As shown in Table 5, ADRD promotes Top-1 accuracy, while the RDN only promotes . The results demonstrate that ADRD is good at real-world image super-resolution. As shown in Figure 11, the super-resolved images have clear textures, and conform to human perception.

Acc (%) Bicubic LapSRN RDN D-DBPN ADRD
Top-1 55.7
Top-5 84.2
Table 5: Recognition accuracy on Pairs & Oxford.
Figure 11: Visual results of different super-resolution approaches.

4 Conclusion

We propose a novel attention based DenseNet with a residual deconvolution for image super-resolution. In our framework, a weighted dense block is proposed to weight all the features from all preceding layers into current layer, so as to adaptively combine informative features. A spatial attention module is presented to emphasize high-frequency information after each WDB. Besides, we exhibit a residual deconvolution strategy to focus on high-frequency upsampling. Experimental results conducted on benchmark datasets demonstrate that ADRD achieves state-of-the-art performance. Our future works will concentrate on more lightweight model design and apply to low-resolution retrieval and recognition.


  • [Bevilacqua and et al.2012] Marco Bevilacqua and Aline Roumy et al. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, 2012.
  • [Deng et al.2009] Jia Deng, Wei Dong, and Richard Socher et al. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • [Dong et al.2016a] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell., 38(2):295–307, 2016.
  • [Dong et al.2016b] Chao Dong, Chen Change Loy, and Xiaoou Tang. Accelerating the super-resolution convolutional neural network. In ECCV, 2016.
  • [Freedman and Fattal2011] Gilad Freedman and Raanan Fattal. Image and video upscaling from local self-examples. ACM Trans. Graph., 30(2):12:1–12:11, 2011.
  • [Haris et al.2018] Muhammad Haris, Greg Shakhnarovich, and Norimichi Ukita. Deep back-projection networks for super-resolution. In CVPR, 2018.
  • [He et al.2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
  • [Huang et al.2015] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In CVPR, 2015.
  • [Huang et al.2017] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
  • [Hui et al.2018] Zheng Hui, Xiumei Wang, and Xinbo Gao. Fast and accurate single image super-resolution via information distillation network. In CVPR, June 2018.
  • [Kim and Kwon2010] Kwang In Kim and Younghee Kwon. Single-image super-resolution using sparse regression and natural image prior. IEEE Trans. Pattern Anal. Mach. Intell., 32(6):1127–1133, 2010.
  • [Kim et al.2016a] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016.
  • [Kim et al.2016b] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-recursive convolutional network for image super-resolution. In CVPR, 2016.
  • [Kingma and Ba2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  • [Lai et al.2017] Wei-Sheng Lai, Jia-Bin Huang, and Narendra Ahuja et al. Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, 2017.
  • [Ledig et al.2017] Christian Ledig, Lucas Theis, and Ferenc Huszar et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
  • [Liu et al.2017] Wu Liu, Xinchen Liu, Huadong Ma, and Peng Cheng. Beyond human-level license plate super-resolution with progressive vehicle search and domain priori GAN. In ACM MM, 2017.
  • [Mao et al.2016] Xiao-Jiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration using convolutional auto-encoders with symmetric skip connections. CoRR, abs/1606.08921, 2016.
  • [Philbin et al.2007] James Philbin, Ondrej Chum, and Michael Isard et al. Object retrieval with large vocabularies and fast spatial matching. In CVPR, 2007.
  • [Philbin et al.2008] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR, 2008.
  • [R. and et al.2001] David R. and Martin et al. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001.
  • [Tai et al.2017] Ying Tai, Jian Yang, and Xiaoming Liu. Image super-resolution via deep recursive residual network. In CVPR, 2017.
  • [Timofte et al.2013] Radu Timofte, Vincent De Smet, and Luc J. Van Gool. Anchored neighborhood regression for fast example-based super-resolution. In ICCV, 2013.
  • [Timofte et al.2017] Radu Timofte, Eirikur Agustsson, and Luc Van Gool et al. NTIRE 2017 challenge on single image super-resolution: Methods and results. In CVPR Workshops, 2017.
  • [Tong et al.2017] Tong Tong, Gen Li, Xiejie Liu, and Qinquan Gao. Image super-resolution using dense skip connections. In ICCV, 2017.
  • [Wang et al.2004] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Processing, 13(4):600–612, 2004.
  • [Yang et al.2013] Jianchao Yang, Zhe Lin, and Scott Cohen. Fast image super-resolution based on in-place example regression. In CVPR, 2013.
  • [Zeyde and et al.2010] Roman Zeyde and Michael Elad et al. On single image scale-up using sparse-representations. In International Conference on Curves and Surfaces, 2010.
  • [Zhang et al.2018] Yulun Zhang, Yapeng Tian, and Yu Kong et al. Residual dense network for image super-resolution. In CVPR, 2018.
  • [Zhu et al.2018] Xiaobin Zhu, Zhuangzi Li, and Xiaoyu Zhang et al. Generative adversarial image super-resolution through deep dense skip connections. Comput. Graph. Forum, 37(7):289–300, 2018.
  • [Zou and Yuen2012] Wilman W. W. Zou and Pong C. Yuen.

    Very low resolution face recognition problem.

    IEEE Trans. Image Processing, 21(1):327–340, 2012.