Image Super-Resolution via Attention based Back Projection Networks

10/10/2019 ∙ by Zhi-Song Liu, et al. ∙ Hong Kong Polytechnic University 23

Deep learning based image Super-Resolution (SR) has shown rapid development due to its ability of big data digestion. Generally, deeper and wider networks can extract richer feature maps and generate SR images with remarkable quality. However, the more complex network we have, the more time consumption is required for practical applications. It is important to have a simplified network for efficient image SR. In this paper, we propose an Attention based Back Projection Network (ABPN) for image super-resolution. Similar to some recent works, we believe that the back projection mechanism can be further developed for SR. Enhanced back projection blocks are suggested to iteratively update low- and high-resolution feature residues. Inspired by recent studies on attention models, we propose a Spatial Attention Block (SAB) to learn the cross-correlation across features at different layers. Based on the assumption that a good SR image should be close to the original LR image after down-sampling. We propose a Refined Back Projection Block (RBPB) for final reconstruction. Extensive experiments on some public and AIM2019 Image Super-Resolution Challenge datasets show that the proposed ABPN can provide state-of-the-art or even better performance in both quantitative and qualitative measurements.



There are no comments yet.


page 3

page 6

page 8

Code Repositories


Attention based Back Projection Network (ABPN) for image ultra-resolution

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: SR results on image HinagikuKenzan with SR factor 16. We applied 2 times of 4 SR

As a fundamental low-level vision problem, image super-resolution (SR) attracts much attention in the past few years. The objective of image SR is to super-resolve low-resolution (LR) images to the desired dimension as the same high-resolution (HR) images with pleasing visual quality. For image SR, we need to approximate times pixels for up-sampling. Thanks to the architectural innovations and computation advances, it is possible to utilize larger datasets and more complex models for image SR. Various deep learning based approaches with different network architectures have achieved image SR with good quality. Most SR works are based on the residual mapping modified from ResNet  [12]

. In order to deliver good super-resolution quality, we need to build a very deep network to cover receptive fields of the image as large as possible to learn different levels of feature abstrction. The advent of 4K/8K UHD (Ultra High Definition) displays demand for more accurate image SR with less computation at different up-sampling factors. It is essential to have a deep neural network with the ability to capture long-term dependencies to efficiently learn the reconstruction mapping for SR. Attention or non-local modeling is one of the choices to globally capture the feature response across the whole image. A lot of related works  

[31, 7, 26, 27, 15, 5]

have been proposed for computing vision successfully. There are several advantages of using attention operations: 1) It can directly compute the correlation between patterns across the image regardless of their distances; 2) It can efficiently reduce the number of kernels and depth of the network to achieve comparable or even better performance and 3) Finally, it is also easy to be embedded into any structure for operations. As shown in Figure 

1, we tested the state-of-the-art SR approaches on 16 enlargement by applying two times of 4 SR using pre-trained models. ESRGAN [28] and RCAN [31] tend to generate fake edges which do not exist in the HR images while the proposed ABPN can still predict correct patterns.

Inspired by Non-local neural networks  [27] and Back Projection based image SR  [20]

, we propose an Attention based Back Projection Network (ABPN) for efficient image SR. Our method focuses on studying the global feature correlation to make full use of non-local mean operation. Specifically, instead of using plain concatenation or addition operations, we propose the Spatial Attention Block (SAB) to compute the auto- and cross-correlation of the feature maps extracted at different levels. That is, we use proposed SAB to measure the similarity between two feature maps to obtain the global correlation maps. By further investigating the SR methods, we find that back projection based network is a better choice for the backbone of feature extraction because it can iteratively up- and down-sample the feature maps to update the residues of LR and HR features. To make a step forward, we propose a Refined Back Projection Block (RBPB) as the final stage to directly minimize the residues between the original LR images and down-sampled predicted SR images.

We summarize our contributions as follows: 1) By making use of the proposed Spatial Attention Block, we modified the back projection network to Attention based Back Projection Network (ABPN) for efficient single image super-resolution. (2) We propose a Refined Back Projection Block (RBPB) to replace the common post back projection process in image SR. (3) We tested our proposed SR method on various datasets and real images. Extensive experiments show that the ABPN can achieve the state-of-the-art SR or even better performance both quantitatively and qualitatively.

Figure 2: Proposed ABPN structure. It can iteratively up- and down-sample the feature maps to update feature residues.

2 Related Work

Non-local Image Processing. Non-local mean is a conventional algorithm for image processing. The idea is that it searches not only the local areas but also the non-local areas for repeated patterns. It allows distant pixels or patches to contribute to the filtered region. The idea is generalized as a non-local convolution operation which maps the neighborhood region to the whole region of images or videos. It is commonly used in image denoising  [6], inpainting  [2] and super-resolution  [10].

Nowadays, non-local processing is also explicitly or implicitly embedded into deep neural networks to capture the long-term dependencies. In most deep learning algorithms, stacking more and more convolution operations with small kernels (e.g. 3

3) can cover a larger receptive field for global modeling. This repeated local operation has the limitations of 1) inefficient computation for practical applications, 2) difficulty in optimizing networks and 3) a feedforward operation without feedback. Recurrent Neural Networks (RNN) 

[29] are the dominant approaches for sequential data by forming a close loop to progressively process the data. However, it still works on a local neighborhood and its performance is not optimal. Recently, there is a trend of using self-attention [26] or non-local neural network [27]

for modeling the sequential data in language and images. Note that in this paper, we use the term “attention” to describe the non-local modeling process in deep feature extraction. There are several great works on making use of attention mechanism in computing vision.  

[26] first proposed self-attention for machine translation. The idea is to decompose each word as a weighted combination of all positions in the sequence. That is, the model looks into onward and backward words to ensure the consistency of the translation. Similar self-attention based works were proposed in various computing fields. For example,  [27] proposed non-local neural network to investigate the possible solution to spatial attention for video classification.  [15] proposed an efficient attention computation mechanism called Criss-Cross Network for semantic segmentation.  [5] used the idea of bilateral filter to learn robust weighting model for object recognition. Besides, “attention” has also been proposed for image super-resolution and shown its great potential. For example, inspired by the squeeze and excitation network  [13],  [31] proposed to model the channel correlation by residual channel attention network.  [7] further modified the idea of channel attention to second-order attention enhancement. However, these approaches still do not fully explore the non-local property in the spatial domain. Hence, there is a great potential for further study.

Super-Resolution Deep Neural Networks. In the past few years, deep neural networks have shown remarkable ability on image SR. From the beginning of the pioneer work [8], CNN has outperformed conventional learning approaches significantly. The capabilities of resolving complex nonlinear mapping models and digestion on huge datasets encourage researchers to design deeper networks for better performance. Most of the state-of-the-art SR approaches adopt the residual architecture, like SRGAN [18], EDSR [19], DenseSR [32] and ESRGAN [28]. There are also some SR approaches that have different architectures for reconstruction. For example,  [25] proposed the PixelCNN for image reconstruction.  [22] proposed to use recursive neural network to iteratively predict the SR image.  [11, 20] proposed to embed the back projection into the super-resolution to update the LR and HR feature residual. This can be considered as a generalized residual model.

Recently, using generative adversarial networks (GAN) for perceptual image SR attracts a lot of attention. The idea is to add one discriminator as an indicator for SR estimation. The backbones for generator and discriminator are more or less the same as aforementioned SR algorithms. A better architecture can further improve the perceptual quality. Once the training is finished, we only need the generator for testing. It is important to make sure the model complexity of the generator to be as small as possible for real-time applications. In this paper, we have not investigated our proposed SR method on perceptual quality but it can be modified as the generator for efficient recall.

3 Method

3.1 Problem Formulation

Let us formally define the image SR. Mathematically, given a LR image down-sampled from the corresponding HR image , where (, ) is the dimension the image and is the up-sampling factor. They are related by th following degradation model,


where is the additive white Gaussian noise and D is the down-sampling operator. The goal of image SR is to resolve Equation 1 as Maximum A Posterior (MAP) problem as follows,


where is the predicted SR image. log() represents the log-likelihood of LR images given HR images and log() is the prior of HR images that is used for model optimization. Formally, we resolve the image SR problem as follows,


where represents the -th order estimation of pixel based distortion. The regularization term controls the complexity of the model. Using external or internal images, we can form LR-HR image pairs to train the proposed Attention based Back Projection Network (ABPN) model to approximate the ideal mapping model. As shown in Figure 2, the complete structure of ABPN contains three basic modules: Feature extraction, Enhanced Back Projection Blocks and Refined Back Projection Block. Feature extraction includes two convolution layers and followed by a self-attention block as a global weighting process. Enhanced Back Projection Blocks are modified from  [20] and the difference are twofold: 1) the concatenation layer is replaced by the proposed Spatial Attention Block and 2), the LR feature maps are combined with HR feature map together to form the final feature maps. Finally, the Refined Back Projection Block updates the feature residues between the estimated and original LR images to refine the final SR image. The detailed structure is discussed in the following parts.

3.2 Back Projection Blocks for image SR

The Back Projection block was first proposed in DBPN  [11] and the further modified version is formed in HBPN  [20]. Let us see Figure 3, the idea of back projection is based on the assumption that a good SR image should have an estimated LR image that is as close as possible to the original LR image. We follow the same idea to build our basic module entitled as Enhanced Down-sampling Back Projection blocks (EDBP) for down-sampling and Enhanced Up-sampling Back Projection block (EUBP) for up-sampling. As shown in Figure 2, We stack multiple back projection blocks in up-down order to extract deep feature representation. For the final reconstruction, the intermediate feature maps are concatenated together to learn the SR images. The only structural difference between  [20] and ours is that we also concatenate the LR feature maps together (yellow lines shown in Figure 2) with HR feature maps for final reconstruction. Note that since the LR feature maps are smaller than HR, we use one deconvolution layer to up-sample them to the same size as the HR feature maps.

Figure 3: Back Projection procedure.

3.3 Spatial Attention Blocks (SAB)

Spatial Attention Blocks are the major contribution of this work. The idea is to learn cross-correlation between features at different levels. In the proposed ABPN network, we have two types of attention blocks: self-attention blocks and spatial attention blocks. The self-attention block is exactly the same as the one in  [26] that is situated at the end of the feature extraction (the pink block in Figure 2(a)). And the spatial attention block is located at each EDBP block (pink blocks in Figure 2 with words “SAB”) to extract the attention maps for following up-sampling. Their detailed differences are described in Figure 4.

Figure 4: Comparison between self-attention and spatial attention blocks.

Inside self-attention and spatial attention blocks, there are three convolution layers that decompose the input data into three components: , and . Then two dot product operations are done using two of the three components. There is a short connection between input to the output so the attention models need to learn the residual mapping relationship. The difference is that the self-attention takes only the input X for calculation while the spatial attention block takes both X and Y for calculation.

The attention model can be understood as a non-local convolution process. For input X, we can define the non-local operation as follows,


where f represents the relationship of each pixel to another on the input image X. Following the description of self-attention, we can further rewrite Equation 4 as,


Similarly, for spatial attention block, we can write it as,


The non-local operation in both self-attention and spatial attention consider all positions on the feature maps. The dot product of can be regarded as the covariance of the input data. It measures the degree of tendency between two feature maps at different channels. A convolution operation or channel attention model  [31]

can only sum up the weighted input in a local region while the attention model can compute the whole data, It can be also related to the Principal Component Analysis (PCA). As shown in Figure 

4, input X is decomposed into and

. Then we vectorize the feature maps along the channel dimension so that

i-th vector represents the feature map at i-th channel. Their dot products calculate the autocorrelation of the input data. Using Softmax operation can normalize each of the vectors to become a unit vector. Once this is done, each of the unit vector can be interpreted as an axis of the input data. Multiplying g(X

) to the normalized vectors can be considered as projecting data to a new coordinate system. The output of Softmax can be called the global weighting matrix that measures the importance of each feature map. Note that the goal of PCA is to reduce the dimension of data so it calculates the statistical correlation of a group of data and find the eigenvectors to project all the data with maximum variance. However, the self-attention and spatial attention focus on finding the principal features across the whole spatial domain so that they calculate the feature correlation across the channel domain and find the basis for projection.

Generally, most deep learning based SR approaches concatenate feature maps from different layers to form a large feature map for next operation. In order to reduce the computation, a convolution is used to globally weight all feature maps to output one compressed result. The disadvantage is that when the model goes deep, the more feature maps we concatenate and the heavier computation we need to cost on the convolution. It is difficult to train global weighting to obtain optimal results. On the contrary, using spatial attention blocks can enhance the correlation of feature maps from different layers because the feature maps are not equally important, we only need an attention map to assign the confidence scores to the feature maps for estimation. Importantly, symbols , and g represent 1

1 convolution operation without using any activation functions because 1) the correlation or covariance is a measure of linear dependence among data. Nonlinear data is more computationally demanding and 2), the input data

X are the activated feature maps so there is no need to add another activation operation to increase the training difficulty.

3.4 Refined Back Projection Block (RBPB)

Finally, we have modified the Enhanced Back Projection Block to the proposed Refined Back Projection Block (RBPB) for final reconstruction. The detailed structure is shown in Figure 2d. The reason is that the EDBP and EUBP blocks are stacked in order to update LR and HR feature residues but they never feedback to the original LR images to simulate the iterative back projection process. To form the close loop the same as Figure 3, we use RBPB to connect the input LR image to the final SR image. In most of the SR approaches, researchers assume that the LR image is downsampled by the Bicubic operator so we also use Bicubic to down-sample the estimated SR image to obtain the estimated LR. Then we estimate the LR residues between estimated LR and input LR images by using another feature extraction block (the purple box at the top of Figure 2). Finally, we up-sample the LR residues by Bicubic and add to the estimate SR to obtain the final SR image.

4 Experiments

4.1 Data Preparation and Network Implementation

We synthesized the training image pairs based on the settings of AIM2019 SR challenge  [4]. The training images include 800 2K images from DIV2K  [24] and 2650 2K images from Flickr [19]. Each image was rotated and flipped for augmentation to increase the number of images eight times. The LR images were obtained by using Bicubic function in MATLAB according to down-sampling factors . We extracted LR-HR patch pairs from images of size 3232 and 3232, respectively. The testing images include Set5 [3], Set14 [30], BSD100 [1], Urban100 [14], Manga109 [21], DIV2K [24] and DIV8K [4] with 4, 8 and 16 SR enlargement.

To efficiently super-resolve images, we designed the proposed ABPN network using 32 kernels for all convolution and deconvolution layers. For short connections and attention models, we used 1

1 kernels with stride 1 and pad 1. For the convolution and deconvolution in EDBP and EUBP, we used 6

6 kernels with stride 4 and pad 1 for 4 SR and 1010 kernels with stride 8 and pad 1 for 8 SR. Note that most SR approaches use 64 kernels for convolution or deconvolution, we only use half of convolution kernels to build the network. With the help of the proposed attention blocks, in the following experiments, we will demonstrate that the proposed ABPN can achieve comparable or even better SR performance with much less convolutional parameters.

We conducted our experiments using Pytorch 1.1, MATLAB R2016b on two NVIDIA GTX1080Ti GPUs. During the training, we set the learning rate to 0.0001 for all layer. The batch size is 8 for 1

iterations. For optimization, we used Adam with the momentum to 0.9 and the weight decay of 0.0001. The executive codes and experimental results can be found in the following link:

4.2 Model analysis

Attention Back Projection Block. For our proposed ABPN, the attention back projection block replaces the concatenation layer to combine feature maps from different layers. The self-attention is used in the feature extraction and the spatial attention is used after the enhanced down-sampling back projection blocks. To demonstrate the capability of the attention models, we design the same ABPN network using concatenation layers as Model-C and the ABPN network using attention layers as Model-A. Depending on the up-sampling factors, we conducted multiple experiments for 2, 4 and 8 enlargement on Set5 and Set14 to make comparison.

Algorithm Scale Set5 Set14
Model-C 2 37.78 0.955 33.77 0.913
Model-A 2 38.29 0.961 34.18 0.922
Model-C 4 32.48 0.894 28.78 0.774
Model-A 4 32.69 0.900 28.94 0.789
Model-C 8 26.84 0.774 24.65 0.618
Model-A 8 27.25 0.786 25.08 0.638
Table 1: Comparison of the network using plain concatenation block or attention block, including PSNR and SSIM for scale 2, 4 and 8 SR on Set5 and Set14. Red indicates the best results.

The results are shown in Table 1. We compare Model-C and Model-A on SR with different up-sampling factors. Model-A outperforms Model-C about 0.4 dB in PSNR and 0.01 in SSIM. It indicates the effectiveness of using attention over concatenation. Furthermore, to understand the physical meaning of attention models, we visualize the feature maps obtained from EDBP and SAB blocks. The feature maps on the first row of Figure 5 were used to compute the basis for projection (same as input X in Figure 4) and the feature maps on the second row of Figure 5 are projected to the basis to obtain the SAB outputs (the third row of Figure 5). EDBP_n represents the n-th down-sampling back projection blocks. NOte the red boxes on the visualization and we can find that the output of SAB blocks are the weighted results of two EDBP blocks. For example, the red boxes in EDBP_1 are located at the feature maps that estimate the complete image so that the basis can be across the whole frequency band which shows no focus on specific features. However, the feature maps on EDBP_3 only have responses to the edges in the neighborhood area. After the projection, the feature map on the SAB block enhanced the edge information across the whole image which is the purpose of using attention model to find the non-local property for reconstruction.

Figure 5: Visualization of the proposed spatial attention blocks. The SAB is obtained by computing the correlation between EDBP features on the first and second rows.

4.3 Refined Back Projection Block

For the final reconstruction, we used the proposed Refined Back Projection Block (RBPB) to further improve the SR performance. There are some related deep learning based SR works [16, 33, 28] that first super-resolve the LR image via the deep networks and then use back projection as the post processing to the obtained SR image for refinement. It can improve the PSNR by about 0.010.1 dB but the problem is the back projection is not connected to the network to form an end-to-end architecture. We directly attached the post back projection at the end of network to jointly train the model for better SR. To make a comparison, we tested ABPN without final back projection (A), ABPN with post back projection (B) and ABPN with RBPB (C) on Set5 and Set14 for 2, 4 and 8 enlargement.

Algorithm Scale Back Projection Set5 Set14
A 2 none 38.05 0.960 33.89 0.919
B 2 post BP 38.20 0.961 34.07 0.921
C 2 RBPB 38.29 0.961 34.18 0.922
A 4 none 32.48 0.899 28.74 0.788
B 4 post BP 32.58 0.899 28.83 0.788
C 4 RBPB 32.69 0.900 28.94 0.789
A 8 none 27.16 0.786 24.97 0.638
B 8 post BP 27.20 0.786 25.01 0.638
C 8 RBPB 27.25 0.786 25.08 0.638
Table 2: Comparison of the network using with or without back projection or RBPB, including PSNR and SSIM for scale 2, 4 and 8 SR on Set5 and Set14. Red indicates the best results.

The results are shown in Table 2. We can find that compared to model (A), using back projection as a post processing for (B) can help to boost up the PSNR performance. And when we add the Refined Back Projection Block in the network, model (C) can further improve the PSNR about 0.1 dB. Note that the effect of back projection is limited when we super-resolve LR with larger up-sampling factors. For example, in 4 image SR, using RBPB can outperform the model without back projection by about 0.2 dB but the improvement decreases to about 0.1 dB in 8 super-resolution. The reason is that the residual information is getting smaller when the down-sampling factor is larger. Using Bicubic as the assumed down-sampling operator may not be sufficient to estimate the ground truth distribution of the LR images.

Algorithm Scale Set5 Set14 BSD100 Urban100 Manga109
Bicubic 4 28.42 0.810 26.10 0.704 25.96 0.669 23.64 0.659 25.15 0.789
A+ [23] 30.30 0.859 27.43 0.752 26.82 0.710 24.34 0.720 27.02 0.850
CRFSR [33] 31.10 0.871 27.87 0.765 27.05 0.719 24.89 0.744 28.12 0.872
SRCNN [8] 30.49 0.862 27.61 0.754 26.91 0.712 24.53 0.724 27.66 0.858
LapSRN [17] 31.54 0.885 28.19 0.772 27.32 0.728 25.21 0.756 29.09 0.890
EDSR [19] 32.46 0.897 28.80 0.788 27.71 0.742 26.64 0.803 31.02 0.915
RCAN [31] 32.63 0.900 28.87 0.789 27.77 0.744 26.82 0.809 31.22 0.917
ESRGAN [28] 32.73 0.901 28.99 0.792 27.85 0.745 27.03 0.815 31.66 0.920
ABPN(Ours) 32.69 0.900 28.94 0.789 27.82 0.743 27.06 0.811 31.79 0.921
Bicubic 8 24.39 0.657 23.19 0.568 23.67 0.547 21.24 0.516 21.68 0.647
A+ [23] 25.52 0.692 23.98 0.597 24.20 0.568 21.37 0.545 22.39 0.680
CRFSR [33] 26.07 0.732 23.97 0.600 24.20 0.569 21.36 0.550 22.59 0.688
SRCNN [8] 25.33 0.689 23.85 0.593 24.13 0.565 21.29 0.543 22.37 0.682
LapSRN [17] 26.15 0.738 24.42 0.622 24.59 0.587 21.88 0.583 23.60 0.742
EDSR [19] 26.97 0.775 24.94 0.640 24.80 0.596 22.47 0.620 24.58 0.778
RCAN [31] 27.47 0.791 25.40 0.655 25.05 0.608 23.22 0.652 25.58 0.809
HBPN [20] 27.17 0.785 24.96 0.642 24.93 0.602 23.04 0.647 25.24 0.802
ABPN(Ours) 27.25 0.786 25.08 0.638 24.99 0.604 23.04 0.641 25.29 0.802
DIV8K val DIV2K val BSD100 Urban100 Manga109
Bicubic 16 - - 22.867 0.598 21.73 0.477 18.92 0.434 19.10 0.568
EDSR [19] - - 24.13 0.631 22.62 0.506 19.96 0.481 20.62 0.635
RCAN [31] - - 24.30 0.639 22.69 0.511 20.20 0.496 20.88 0.656
ESRGAN [28] - - 19.09 0.421 18.01 0.281 15.42 0.262 17.41 0.428
ABPN(Ours) 26.71 0.65 24.38 0.641 22.72 0.512 20.39 0.515 21.25 0.673
Table 3: Quantitative evaluation of state-of-the-art SR approaches, including PSNR and SSIM for scale 4, 8 and 16. Red indicates the best and blue indicates the second best results.

4.4 Comparison with the state-of-the-art SR approaches

To prove the effectiveness of the proposed ABPN network, we conducted experiments by comparing most of (if not all) the state-of-the-art SR algorithms: Bicubic, A+ [23], CRFSR [33], SRCNN [8], LapSRN [17], EDSR [19], HBPN [20], RCAN [31] and ESRGAN [28]. PSNR and SSIM were used to evaluate the proposed method and others. Generally, PSNR and SSIM were calculated by converting RGB image to YUV and only the Y-channel image was taken for consideration. During the testing, we flipped and rotated LR images for augmentation to generate several augmented inputs and then applied inverse augmentation and average all the outputs to form the final SR images. For different scaling factors s, we excluded s pixels at boundaries to avoid boundary effect. For these SR results, A+ and CRFSR were provided by the corresponding authors, SRCNN was reimplemented and provided by the authors of  [17], EDSR, HBPN, RCAN and ESRGAN were reimplemented using the codes that are provided by the corresponding authors. Note that, our proposed approach also participated in the AIM2019 Image Super-resolution Challenge [4]. Table 3 shows the comparison of all SR approaches at 4, 8 and 16. We did not conduct image SR with up-sampling factor smaller than 4 because all state-of-the-art SR approaches have achieved great performance in that scenario and the differences are too small to be compared. Instead, we show the extreme case with 16 enlargement. We chose the SR approaches that achieve the best performance in 4 and 8 for extreme comparison. The 16 results for EDSR, RCAN and ESRGAN were obtained by applying 2 times of the 4 SR using the provided pre-trained models. For a fair comparison, we also tried to use our proposed 4 ABPN SR model twice for enlargement. We can find that the proposed ABPN can achieve 0.10.2 dB improvement in PSNR and 0.010.2 in SSIM. It indicates that the proposed ABPN is more robust than others that can handle image SR even without further training. Note that we did not test Set5 and Set14 for two reasons: 1) the images in these two dataset are too small for evaluation and 2), the released codes for EDSR, RCAN and ESRGAN cannot be reimplemented in these two datasets so we tested on using DIV2K validation dataset, BSD100, Urban100 and Manga109 datasets. Furthermore, AIM2019 Image Super-resolution Challenge provided another 8K dataset for 16 SR and we show the results of using our proposed ABPN on the validation dataset. In conclusion, from the comparison on PSNR and SSIM across different up-sampling factors, we can find that using proposed ABPN can achieve comparable or even better performance compared with other state-of-the-art SR approaches. It demonstrates that the proposed ABPN is robust and accurate to handle image SR with different up-sampling factors, even in extreme conditions.

Figure 6: Comparison between model complexity and image quality. Left vertical axis is the number of parameters and right vertical axes is the size of the model file.

Figure 7: Visual comparison of different SR approaches on Urban100 for 4 enlargement.

More importantly, we are also interested in the computation complexity of different models. Hence, we selected some of the state-of-the-art SR approaches for comparison, including SRCNN, VDSR, LapSRN, DBPN, HBPN, ESRGAN, RCAN. Note that we used the models and network setting that the authors claimed the best in their papers. We calculated the number of parameters by using the source code provided by [9], and used it as one indicator to show the model complexity. We also list the size of the pre-trained model file as another indicator. Since different models can be implemented with different computers and with different platforms. We did not test the running time to complicate the comparison. In Figure 6, we show the number of parameters and PSNR for 4 SR for Urban100 dataset.

In Figure 6, orange dots indicate the model size and green dots indicate the number of parameters. The right bottom corner means good with higher PSNR and less model complexity. We can see that using proposed ABPN can achieve better PSNR than ESRGAN and RCAN with much less number of parameters. Note that the size of the model is consistent with the number of parameters (for some SR approaches, the orange and green dots overlap together) because the SR approaches used for comparison were all conducted using Pytorch and saved in the files with the same format. With the help of attention models, ABPN can reduce at least 23 times of parameters to outperform about 0.1 dB in PSNR.

Finally, we show some typical images from the testing datasets for visual comparison. Figure 7 gives the visualization of 4 image SR. We can see that the proposed ABPN can generate SR images with comparable quality similar to other state-of-the-art SR approaches. For example, the pattern in Figure 7 B is supposed to approximately horizontal. Affected by the vertical lines on the original image, other SR approaches tend to reconstruct diagonal patterns while the proposed ABPN can correctly reconstruct the pattern. In Figure 7 C, EDSR and HBPN can generate sharp edges around the balcony but with some distortion. Our proposed ABPN can generate the pattern with better quality.

5 Discussion

In this paper, we explore the attention mechanism in image super-resolution, and then propose the Attention based Back Projection Network (ABPN) for image SR. There are three contributions in this network: modified enhanced back projection blocks, Spatial Attention Block (SAB) and Refined Back Projection Block (RBPB). The key modification is the Spatial Attention Block that can be used to replace the concatenation layer so that the correlation relationship between the intermediate feature maps can be extracted as a non-local weighting model. Without increasing the complexity of the CNN network, SAB can substantially improve the quality of super-resolution. The final Refined Back Projection Block works as a residual feedback that can form a close loop between the input LR and output SR images to further boost up the performance. Results on quantitative and qualitative evaluation show its advantages over other approaches. The exciting results of attention models for image SR indicate its great potential for further study.

6 Acknowledgment

This work was supported by the Centre for Signal Processing, Department of Electronic and Information Engineering. Earning Account, The Hong Kong Polytechnic university Internal Research Grant (ZZHR), and a RGC project of the Hong Kong Special Administrative Region, China (Grant No. PolyU 152208/17E).


  • [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik (2011-05) Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (5), pp. 898–916. External Links: Document, ISSN Cited by: §4.1.
  • [2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman (2009-08) PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (Proc. SIGGRAPH) 28 (3). Cited by: §2.
  • [3] M. Bevilacqua, A. Roumy, C. Guillemot, and M. A. Morel (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the British Machine Vision Conference, pp. 135.1–135.10. External Links: ISBN 1-901725-46-4, Document Cited by: §4.1.
  • [4] A. 2. I. S. Challenge Note: Cited by: Image Super-Resolution via Attention based Back Projection Networks, §4.1, §4.4.
  • [5] Y. Chen, M. Rohrbach, Z. Yan, S. Yan, J. Feng, and Y. Kalantidis (2018) Graph-based global reasoning networks. CoRR abs/1811.12814. External Links: Link, 1811.12814 Cited by: §1, §2.
  • [6] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian (2007-08) Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image Processing 16 (8), pp. 2080–2095. External Links: Document, ISSN Cited by: §2.
  • [7] T. Dai, J. Cai, Y. Zhang, S. Xia, and L. Zhang (2019) Second-order attention network for single image super-resolution. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 11065–11074. Cited by: §1, §2.
  • [8] C. Dong, C. C. Loy, K. He, and X. Tang (2016-02) Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2), pp. 295–307. External Links: Document, ISSN Cited by: §2, §4.4, Table 3.
  • [9] A. 2. I. S. evaluation Note: Cited by: §4.4.
  • [10] D. Glasner, S. Bagon, and M. Irani (2009-Sep.) Super-resolution from a single image. In 2009 IEEE 12th International Conference on Computer Vision, Vol. , pp. 349–356. External Links: Document, ISSN Cited by: §2.
  • [11] M. Haris, G. Shakhnarovich, and N. Ukita (2018) Deep back-projection networks for super-resolution. CoRR abs/1803.02735. External Links: Link, 1803.02735 Cited by: §2, §3.2.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §1.
  • [13] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. Cited by: §2.
  • [14] J. Huang, A. Singh, and N. Ahuja (2015-06) Single image super-resolution from transformed self-exemplars. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 5197–5206. External Links: Document, ISSN Cited by: §4.1.
  • [15] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu (2018) CCNet: criss-cross attention for semantic segmentation. CoRR abs/1811.11721. External Links: Link, 1811.11721 Cited by: §1, §2.
  • [16] J. Kim, J. K. Lee, and K. M. Lee (2015) Accurate image super-resolution using very deep convolutional networks. CoRR abs/1511.04587. External Links: Link, 1511.04587 Cited by: §4.3.
  • [17] W. Lai, J. Huang, N. Ahuja, and M. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. CoRR abs/1704.03915. External Links: Link, 1704.03915 Cited by: §4.4, Table 3.
  • [18] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi (2016) Photo-realistic single image super-resolution using a generative adversarial network. CoRR abs/1609.04802. External Links: Link, 1609.04802 Cited by: §2.
  • [19] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017) Enhanced deep residual networks for single image super-resolution. CoRR abs/1707.02921. External Links: Link, 1707.02921 Cited by: §2, §4.1, §4.4, Table 3.
  • [20] Z. Liu, L. Wang, C. Li, and W. Siu (2019) Hierarchical back projection network for image super-resolution. CoRR abs/1906.06874. External Links: Link, 1906.06874 Cited by: §1, §2, §3.1, §3.2, §4.4, Table 3.
  • [21] Y. Matsui, K. Ito, Y. Aramaki, T. Yamasaki, and K. Aizawa (2015) Sketch-based manga retrieval using manga109 dataset. CoRR abs/1510.04389. External Links: Link, 1510.04389 Cited by: §4.1.
  • [22] Y. Tai, J. Yang, and X. Liu (2017-07) Image super-resolution via deep recursive residual network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2790–2798. External Links: Document, ISSN Cited by: §2.
  • [23] R. Timofte, V. De Smet, and L. Van Gool (2015-04) A+: adjusted anchored neighborhood regression for fast super-resolution. Vol. 9006, pp. 111–126. External Links: Document Cited by: §4.4, Table 3.
  • [24] R. Timofte and et al. (2017-08-22) NTIRE 2017 challenge on single image super-resolution: methods and results. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2017, IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1110–1121 (English (US)). External Links: Document Cited by: §4.1.
  • [25] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu (2016) Conditional image generation with pixelcnn decoders. CoRR abs/1606.05328. External Links: Link, 1606.05328 Cited by: §2.
  • [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §1, §2, §3.3.
  • [27] X. Wang, R. B. Girshick, A. Gupta, and K. He (2017) Non-local neural networks. CoRR abs/1711.07971. External Links: Link, 1711.07971 Cited by: §1, §1, §2.
  • [28] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C. C. Loy, Y. Qiao, and X. Tang (2018) ESRGAN: enhanced super-resolution generative adversarial networks. CoRR abs/1809.00219. External Links: Link, 1809.00219 Cited by: §1, §2, §4.3, §4.4, Table 3.
  • [29] W. Zaremba, I. Sutskever, and O. Vinyals (2014) Recurrent neural network regularization. CoRR abs/1409.2329. External Links: Link, 1409.2329 Cited by: §2.
  • [30] R. Zeyde, M. Elad, and M. Protter (2010-06) On single image scale-up using sparse-representations. Vol. 6920, pp. 711–730. External Links: Document Cited by: §4.1.
  • [31] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. CoRR abs/1807.02758. External Links: Link, 1807.02758 Cited by: §1, §2, §3.3, §4.4, Table 3.
  • [32] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018) Residual dense network for image super-resolution. CoRR abs/1802.08797. External Links: Link, 1802.08797 Cited by: §2.
  • [33] L. Zhi-Song and W. Siu (2018-10)

    Cascaded random forests for fast image super-resolution

    In 2018 25th IEEE International Conference on Image Processing (ICIP), Vol. , pp. 2531–2535. External Links: Document, ISSN Cited by: §4.3, §4.4, Table 3.