Attention based Back Projection Network (ABPN) for image ultra-resolution
Deep learning based image Super-Resolution (SR) has shown rapid development due to its ability of big data digestion. Generally, deeper and wider networks can extract richer feature maps and generate SR images with remarkable quality. However, the more complex network we have, the more time consumption is required for practical applications. It is important to have a simplified network for efficient image SR. In this paper, we propose an Attention based Back Projection Network (ABPN) for image super-resolution. Similar to some recent works, we believe that the back projection mechanism can be further developed for SR. Enhanced back projection blocks are suggested to iteratively update low- and high-resolution feature residues. Inspired by recent studies on attention models, we propose a Spatial Attention Block (SAB) to learn the cross-correlation across features at different layers. Based on the assumption that a good SR image should be close to the original LR image after down-sampling. We propose a Refined Back Projection Block (RBPB) for final reconstruction. Extensive experiments on some public and AIM2019 Image Super-Resolution Challenge datasets show that the proposed ABPN can provide state-of-the-art or even better performance in both quantitative and qualitative measurements.READ FULL TEXT VIEW PDF
Recent advances in image super-resolution (SR) explored the power of dee...
Advances in image super-resolution (SR) have recently benefited signific...
Image Super-Resolution (SR) is an important class of image processing
Different from traditional image super-resolution task, real image
Convolutional Sparse Coding (CSC) has been attracting more and more atte...
Image super-resolution is a challenging task and has attracted increasin...
Stereo image pairs can be used to improve the performance of super-resol...
Attention based Back Projection Network (ABPN) for image ultra-resolution
As a fundamental low-level vision problem, image super-resolution (SR) attracts much attention in the past few years. The objective of image SR is to super-resolve low-resolution (LR) images to the desired dimension as the same high-resolution (HR) images with pleasing visual quality. For image SR, we need to approximate times pixels for up-sampling. Thanks to the architectural innovations and computation advances, it is possible to utilize larger datasets and more complex models for image SR. Various deep learning based approaches with different network architectures have achieved image SR with good quality. Most SR works are based on the residual mapping modified from ResNet 
. In order to deliver good super-resolution quality, we need to build a very deep network to cover receptive fields of the image as large as possible to learn different levels of feature abstrction. The advent of 4K/8K UHD (Ultra High Definition) displays demand for more accurate image SR with less computation at different up-sampling factors. It is essential to have a deep neural network with the ability to capture long-term dependencies to efficiently learn the reconstruction mapping for SR. Attention or non-local modeling is one of the choices to globally capture the feature response across the whole image. A lot of related works[31, 7, 26, 27, 15, 5]
have been proposed for computing vision successfully. There are several advantages of using attention operations: 1) It can directly compute the correlation between patterns across the image regardless of their distances; 2) It can efficiently reduce the number of kernels and depth of the network to achieve comparable or even better performance and 3) Finally, it is also easy to be embedded into any structure for operations. As shown in Figure1, we tested the state-of-the-art SR approaches on 16 enlargement by applying two times of 4 SR using pre-trained models. ESRGAN  and RCAN  tend to generate fake edges which do not exist in the HR images while the proposed ABPN can still predict correct patterns.
, we propose an Attention based Back Projection Network (ABPN) for efficient image SR. Our method focuses on studying the global feature correlation to make full use of non-local mean operation. Specifically, instead of using plain concatenation or addition operations, we propose the Spatial Attention Block (SAB) to compute the auto- and cross-correlation of the feature maps extracted at different levels. That is, we use proposed SAB to measure the similarity between two feature maps to obtain the global correlation maps. By further investigating the SR methods, we find that back projection based network is a better choice for the backbone of feature extraction because it can iteratively up- and down-sample the feature maps to update the residues of LR and HR features. To make a step forward, we propose a Refined Back Projection Block (RBPB) as the final stage to directly minimize the residues between the original LR images and down-sampled predicted SR images.
We summarize our contributions as follows: 1) By making use of the proposed Spatial Attention Block, we modified the back projection network to Attention based Back Projection Network (ABPN) for efficient single image super-resolution. (2) We propose a Refined Back Projection Block (RBPB) to replace the common post back projection process in image SR. (3) We tested our proposed SR method on various datasets and real images. Extensive experiments show that the ABPN can achieve the state-of-the-art SR or even better performance both quantitatively and qualitatively.
Non-local Image Processing. Non-local mean is a conventional algorithm for image processing. The idea is that it searches not only the local areas but also the non-local areas for repeated patterns. It allows distant pixels or patches to contribute to the filtered region. The idea is generalized as a non-local convolution operation which maps the neighborhood region to the whole region of images or videos. It is commonly used in image denoising , inpainting  and super-resolution .
Nowadays, non-local processing is also explicitly or implicitly embedded into deep neural networks to capture the long-term dependencies. In most deep learning algorithms, stacking more and more convolution operations with small kernels (e.g. 3
3) can cover a larger receptive field for global modeling. This repeated local operation has the limitations of 1) inefficient computation for practical applications, 2) difficulty in optimizing networks and 3) a feedforward operation without feedback. Recurrent Neural Networks (RNN) are the dominant approaches for sequential data by forming a close loop to progressively process the data. However, it still works on a local neighborhood and its performance is not optimal. Recently, there is a trend of using self-attention  or non-local neural network 
for modeling the sequential data in language and images. Note that in this paper, we use the term “attention” to describe the non-local modeling process in deep feature extraction. There are several great works on making use of attention mechanism in computing vision. first proposed self-attention for machine translation. The idea is to decompose each word as a weighted combination of all positions in the sequence. That is, the model looks into onward and backward words to ensure the consistency of the translation. Similar self-attention based works were proposed in various computing fields. For example,  proposed non-local neural network to investigate the possible solution to spatial attention for video classification.  proposed an efficient attention computation mechanism called Criss-Cross Network for semantic segmentation.  used the idea of bilateral filter to learn robust weighting model for object recognition. Besides, “attention” has also been proposed for image super-resolution and shown its great potential. For example, inspired by the squeeze and excitation network ,  proposed to model the channel correlation by residual channel attention network.  further modified the idea of channel attention to second-order attention enhancement. However, these approaches still do not fully explore the non-local property in the spatial domain. Hence, there is a great potential for further study.
Super-Resolution Deep Neural Networks. In the past few years, deep neural networks have shown remarkable ability on image SR. From the beginning of the pioneer work , CNN has outperformed conventional learning approaches significantly. The capabilities of resolving complex nonlinear mapping models and digestion on huge datasets encourage researchers to design deeper networks for better performance. Most of the state-of-the-art SR approaches adopt the residual architecture, like SRGAN , EDSR , DenseSR  and ESRGAN . There are also some SR approaches that have different architectures for reconstruction. For example,  proposed the PixelCNN for image reconstruction.  proposed to use recursive neural network to iteratively predict the SR image. [11, 20] proposed to embed the back projection into the super-resolution to update the LR and HR feature residual. This can be considered as a generalized residual model.
Recently, using generative adversarial networks (GAN) for perceptual image SR attracts a lot of attention. The idea is to add one discriminator as an indicator for SR estimation. The backbones for generator and discriminator are more or less the same as aforementioned SR algorithms. A better architecture can further improve the perceptual quality. Once the training is finished, we only need the generator for testing. It is important to make sure the model complexity of the generator to be as small as possible for real-time applications. In this paper, we have not investigated our proposed SR method on perceptual quality but it can be modified as the generator for efficient recall.
Let us formally define the image SR. Mathematically, given a LR image down-sampled from the corresponding HR image , where (, ) is the dimension the image and is the up-sampling factor. They are related by th following degradation model,
where is the additive white Gaussian noise and D is the down-sampling operator. The goal of image SR is to resolve Equation 1 as Maximum A Posterior (MAP) problem as follows,
where is the predicted SR image. log() represents the log-likelihood of LR images given HR images and log() is the prior of HR images that is used for model optimization. Formally, we resolve the image SR problem as follows,
where represents the -th order estimation of pixel based distortion. The regularization term controls the complexity of the model. Using external or internal images, we can form LR-HR image pairs to train the proposed Attention based Back Projection Network (ABPN) model to approximate the ideal mapping model. As shown in Figure 2, the complete structure of ABPN contains three basic modules: Feature extraction, Enhanced Back Projection Blocks and Refined Back Projection Block. Feature extraction includes two convolution layers and followed by a self-attention block as a global weighting process. Enhanced Back Projection Blocks are modified from  and the difference are twofold: 1) the concatenation layer is replaced by the proposed Spatial Attention Block and 2), the LR feature maps are combined with HR feature map together to form the final feature maps. Finally, the Refined Back Projection Block updates the feature residues between the estimated and original LR images to refine the final SR image. The detailed structure is discussed in the following parts.
The Back Projection block was first proposed in DBPN  and the further modified version is formed in HBPN . Let us see Figure 3, the idea of back projection is based on the assumption that a good SR image should have an estimated LR image that is as close as possible to the original LR image. We follow the same idea to build our basic module entitled as Enhanced Down-sampling Back Projection blocks (EDBP) for down-sampling and Enhanced Up-sampling Back Projection block (EUBP) for up-sampling. As shown in Figure 2, We stack multiple back projection blocks in up-down order to extract deep feature representation. For the final reconstruction, the intermediate feature maps are concatenated together to learn the SR images. The only structural difference between  and ours is that we also concatenate the LR feature maps together (yellow lines shown in Figure 2) with HR feature maps for final reconstruction. Note that since the LR feature maps are smaller than HR, we use one deconvolution layer to up-sample them to the same size as the HR feature maps.
Spatial Attention Blocks are the major contribution of this work. The idea is to learn cross-correlation between features at different levels. In the proposed ABPN network, we have two types of attention blocks: self-attention blocks and spatial attention blocks. The self-attention block is exactly the same as the one in  that is situated at the end of the feature extraction (the pink block in Figure 2(a)). And the spatial attention block is located at each EDBP block (pink blocks in Figure 2 with words “SAB”) to extract the attention maps for following up-sampling. Their detailed differences are described in Figure 4.
Inside self-attention and spatial attention blocks, there are three convolution layers that decompose the input data into three components: , and . Then two dot product operations are done using two of the three components. There is a short connection between input to the output so the attention models need to learn the residual mapping relationship. The difference is that the self-attention takes only the input X for calculation while the spatial attention block takes both X and Y for calculation.
The attention model can be understood as a non-local convolution process. For input X, we can define the non-local operation as follows,
where f represents the relationship of each pixel to another on the input image X. Following the description of self-attention, we can further rewrite Equation 4 as,
Similarly, for spatial attention block, we can write it as,
The non-local operation in both self-attention and spatial attention consider all positions on the feature maps. The dot product of can be regarded as the covariance of the input data. It measures the degree of tendency between two feature maps at different channels. A convolution operation or channel attention model 
can only sum up the weighted input in a local region while the attention model can compute the whole data, It can be also related to the Principal Component Analysis (PCA). As shown in Figure4, input X is decomposed into and
. Then we vectorize the feature maps along the channel dimension so thati-th vector represents the feature map at i-th channel. Their dot products calculate the autocorrelation of the input data. Using Softmax operation can normalize each of the vectors to become a unit vector. Once this is done, each of the unit vector can be interpreted as an axis of the input data. Multiplying g(X
) to the normalized vectors can be considered as projecting data to a new coordinate system. The output of Softmax can be called the global weighting matrix that measures the importance of each feature map. Note that the goal of PCA is to reduce the dimension of data so it calculates the statistical correlation of a group of data and find the eigenvectors to project all the data with maximum variance. However, the self-attention and spatial attention focus on finding the principal features across the whole spatial domain so that they calculate the feature correlation across the channel domain and find the basis for projection.
Generally, most deep learning based SR approaches concatenate feature maps from different layers to form a large feature map for next operation. In order to reduce the computation, a convolution is used to globally weight all feature maps to output one compressed result. The disadvantage is that when the model goes deep, the more feature maps we concatenate and the heavier computation we need to cost on the convolution. It is difficult to train global weighting to obtain optimal results. On the contrary, using spatial attention blocks can enhance the correlation of feature maps from different layers because the feature maps are not equally important, we only need an attention map to assign the confidence scores to the feature maps for estimation. Importantly, symbols , and g represent 1
1 convolution operation without using any activation functions because 1) the correlation or covariance is a measure of linear dependence among data. Nonlinear data is more computationally demanding and 2), the input dataX are the activated feature maps so there is no need to add another activation operation to increase the training difficulty.
Finally, we have modified the Enhanced Back Projection Block to the proposed Refined Back Projection Block (RBPB) for final reconstruction. The detailed structure is shown in Figure 2d. The reason is that the EDBP and EUBP blocks are stacked in order to update LR and HR feature residues but they never feedback to the original LR images to simulate the iterative back projection process. To form the close loop the same as Figure 3, we use RBPB to connect the input LR image to the final SR image. In most of the SR approaches, researchers assume that the LR image is downsampled by the Bicubic operator so we also use Bicubic to down-sample the estimated SR image to obtain the estimated LR. Then we estimate the LR residues between estimated LR and input LR images by using another feature extraction block (the purple box at the top of Figure 2). Finally, we up-sample the LR residues by Bicubic and add to the estimate SR to obtain the final SR image.
We synthesized the training image pairs based on the settings of AIM2019 SR challenge . The training images include 800 2K images from DIV2K  and 2650 2K images from Flickr . Each image was rotated and flipped for augmentation to increase the number of images eight times. The LR images were obtained by using Bicubic function in MATLAB according to down-sampling factors . We extracted LR-HR patch pairs from images of size 3232 and 3232, respectively. The testing images include Set5 , Set14 , BSD100 , Urban100 , Manga109 , DIV2K  and DIV8K  with 4, 8 and 16 SR enlargement.
To efficiently super-resolve images, we designed the proposed ABPN network using 32 kernels for all convolution and deconvolution layers. For short connections and attention models, we used 16 kernels with stride 4 and pad 1 for 4 SR and 1010 kernels with stride 8 and pad 1 for 8 SR. Note that most SR approaches use 64 kernels for convolution or deconvolution, we only use half of convolution kernels to build the network. With the help of the proposed attention blocks, in the following experiments, we will demonstrate that the proposed ABPN can achieve comparable or even better SR performance with much less convolutional parameters.
We conducted our experiments using Pytorch 1.1, MATLAB R2016b on two NVIDIA GTX1080Ti GPUs. During the training, we set the learning rate to 0.0001 for all layer. The batch size is 8 for 1iterations. For optimization, we used Adam with the momentum to 0.9 and the weight decay of 0.0001. The executive codes and experimental results can be found in the following link: https://github.com/Holmes-Alan/ABPN.
Attention Back Projection Block. For our proposed ABPN, the attention back projection block replaces the concatenation layer to combine feature maps from different layers. The self-attention is used in the feature extraction and the spatial attention is used after the enhanced down-sampling back projection blocks. To demonstrate the capability of the attention models, we design the same ABPN network using concatenation layers as Model-C and the ABPN network using attention layers as Model-A. Depending on the up-sampling factors, we conducted multiple experiments for 2, 4 and 8 enlargement on Set5 and Set14 to make comparison.
The results are shown in Table 1. We compare Model-C and Model-A on SR with different up-sampling factors. Model-A outperforms Model-C about 0.4 dB in PSNR and 0.01 in SSIM. It indicates the effectiveness of using attention over concatenation. Furthermore, to understand the physical meaning of attention models, we visualize the feature maps obtained from EDBP and SAB blocks. The feature maps on the first row of Figure 5 were used to compute the basis for projection (same as input X in Figure 4) and the feature maps on the second row of Figure 5 are projected to the basis to obtain the SAB outputs (the third row of Figure 5). EDBP_n represents the n-th down-sampling back projection blocks. NOte the red boxes on the visualization and we can find that the output of SAB blocks are the weighted results of two EDBP blocks. For example, the red boxes in EDBP_1 are located at the feature maps that estimate the complete image so that the basis can be across the whole frequency band which shows no focus on specific features. However, the feature maps on EDBP_3 only have responses to the edges in the neighborhood area. After the projection, the feature map on the SAB block enhanced the edge information across the whole image which is the purpose of using attention model to find the non-local property for reconstruction.
For the final reconstruction, we used the proposed Refined Back Projection Block (RBPB) to further improve the SR performance. There are some related deep learning based SR works [16, 33, 28] that first super-resolve the LR image via the deep networks and then use back projection as the post processing to the obtained SR image for refinement. It can improve the PSNR by about 0.010.1 dB but the problem is the back projection is not connected to the network to form an end-to-end architecture. We directly attached the post back projection at the end of network to jointly train the model for better SR. To make a comparison, we tested ABPN without final back projection (A), ABPN with post back projection (B) and ABPN with RBPB (C) on Set5 and Set14 for 2, 4 and 8 enlargement.
The results are shown in Table 2. We can find that compared to model (A), using back projection as a post processing for (B) can help to boost up the PSNR performance. And when we add the Refined Back Projection Block in the network, model (C) can further improve the PSNR about 0.1 dB. Note that the effect of back projection is limited when we super-resolve LR with larger up-sampling factors. For example, in 4 image SR, using RBPB can outperform the model without back projection by about 0.2 dB but the improvement decreases to about 0.1 dB in 8 super-resolution. The reason is that the residual information is getting smaller when the down-sampling factor is larger. Using Bicubic as the assumed down-sampling operator may not be sufficient to estimate the ground truth distribution of the LR images.
|DIV8K val||DIV2K val||BSD100||Urban100||Manga109|
To prove the effectiveness of the proposed ABPN network, we conducted experiments by comparing most of (if not all) the state-of-the-art SR algorithms: Bicubic, A+ , CRFSR , SRCNN , LapSRN , EDSR , HBPN , RCAN  and ESRGAN . PSNR and SSIM were used to evaluate the proposed method and others. Generally, PSNR and SSIM were calculated by converting RGB image to YUV and only the Y-channel image was taken for consideration. During the testing, we flipped and rotated LR images for augmentation to generate several augmented inputs and then applied inverse augmentation and average all the outputs to form the final SR images. For different scaling factors s, we excluded s pixels at boundaries to avoid boundary effect. For these SR results, A+ and CRFSR were provided by the corresponding authors, SRCNN was reimplemented and provided by the authors of , EDSR, HBPN, RCAN and ESRGAN were reimplemented using the codes that are provided by the corresponding authors. Note that, our proposed approach also participated in the AIM2019 Image Super-resolution Challenge . Table 3 shows the comparison of all SR approaches at 4, 8 and 16. We did not conduct image SR with up-sampling factor smaller than 4 because all state-of-the-art SR approaches have achieved great performance in that scenario and the differences are too small to be compared. Instead, we show the extreme case with 16 enlargement. We chose the SR approaches that achieve the best performance in 4 and 8 for extreme comparison. The 16 results for EDSR, RCAN and ESRGAN were obtained by applying 2 times of the 4 SR using the provided pre-trained models. For a fair comparison, we also tried to use our proposed 4 ABPN SR model twice for enlargement. We can find that the proposed ABPN can achieve 0.10.2 dB improvement in PSNR and 0.010.2 in SSIM. It indicates that the proposed ABPN is more robust than others that can handle image SR even without further training. Note that we did not test Set5 and Set14 for two reasons: 1) the images in these two dataset are too small for evaluation and 2), the released codes for EDSR, RCAN and ESRGAN cannot be reimplemented in these two datasets so we tested on using DIV2K validation dataset, BSD100, Urban100 and Manga109 datasets. Furthermore, AIM2019 Image Super-resolution Challenge provided another 8K dataset for 16 SR and we show the results of using our proposed ABPN on the validation dataset. In conclusion, from the comparison on PSNR and SSIM across different up-sampling factors, we can find that using proposed ABPN can achieve comparable or even better performance compared with other state-of-the-art SR approaches. It demonstrates that the proposed ABPN is robust and accurate to handle image SR with different up-sampling factors, even in extreme conditions.
More importantly, we are also interested in the computation complexity of different models. Hence, we selected some of the state-of-the-art SR approaches for comparison, including SRCNN, VDSR, LapSRN, DBPN, HBPN, ESRGAN, RCAN. Note that we used the models and network setting that the authors claimed the best in their papers. We calculated the number of parameters by using the source code provided by , and used it as one indicator to show the model complexity. We also list the size of the pre-trained model file as another indicator. Since different models can be implemented with different computers and with different platforms. We did not test the running time to complicate the comparison. In Figure 6, we show the number of parameters and PSNR for 4 SR for Urban100 dataset.
In Figure 6, orange dots indicate the model size and green dots indicate the number of parameters. The right bottom corner means good with higher PSNR and less model complexity. We can see that using proposed ABPN can achieve better PSNR than ESRGAN and RCAN with much less number of parameters. Note that the size of the model is consistent with the number of parameters (for some SR approaches, the orange and green dots overlap together) because the SR approaches used for comparison were all conducted using Pytorch and saved in the files with the same format. With the help of attention models, ABPN can reduce at least 23 times of parameters to outperform about 0.1 dB in PSNR.
Finally, we show some typical images from the testing datasets for visual comparison. Figure 7 gives the visualization of 4 image SR. We can see that the proposed ABPN can generate SR images with comparable quality similar to other state-of-the-art SR approaches. For example, the pattern in Figure 7 B is supposed to approximately horizontal. Affected by the vertical lines on the original image, other SR approaches tend to reconstruct diagonal patterns while the proposed ABPN can correctly reconstruct the pattern. In Figure 7 C, EDSR and HBPN can generate sharp edges around the balcony but with some distortion. Our proposed ABPN can generate the pattern with better quality.
In this paper, we explore the attention mechanism in image super-resolution, and then propose the Attention based Back Projection Network (ABPN) for image SR. There are three contributions in this network: modified enhanced back projection blocks, Spatial Attention Block (SAB) and Refined Back Projection Block (RBPB). The key modification is the Spatial Attention Block that can be used to replace the concatenation layer so that the correlation relationship between the intermediate feature maps can be extracted as a non-local weighting model. Without increasing the complexity of the CNN network, SAB can substantially improve the quality of super-resolution. The final Refined Back Projection Block works as a residual feedback that can form a close loop between the input LR and output SR images to further boost up the performance. Results on quantitative and qualitative evaluation show its advantages over other approaches. The exciting results of attention models for image SR indicate its great potential for further study.
This work was supported by the Centre for Signal Processing, Department of Electronic and Information Engineering. Earning Account, The Hong Kong Polytechnic university Internal Research Grant (ZZHR), and a RGC project of the Hong Kong Special Administrative Region, China (Grant No. PolyU 152208/17E).
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11065–11074. Cited by: §1, §2.
Cascaded random forests for fast image super-resolution. In 2018 25th IEEE International Conference on Image Processing (ICIP), Vol. , pp. 2531–2535. External Links: Cited by: §4.3, §4.4, Table 3.