An Attention-Based Approach for Single Image Super Resolution

07/18/2018 ∙ by Yuan Liu, et al. ∙ Federation University Australia 14

The main challenge of single image super resolution (SISR) is the recovery of high frequency details such as tiny textures. However, most of the state-of-the-art methods lack specific modules to identify high frequency areas, causing the output image to be blurred. We propose an attention-based approach to give a discrimination between texture areas and smooth areas. After the positions of high frequency details are located, high frequency compensation is carried out. This approach can incorporate with previously proposed SISR networks. By providing high frequency enhancement, better performance and visual effect are achieved. We also propose our own SISR network composed of DenseRes blocks. The block provides an effective way to combine the low level features and high level features. Extensive benchmark evaluation shows that our proposed method achieves significant improvement over the state-of-the-art works in SISR.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The task of single image super-resolution (SISR) is to infer a high resolution (HR) image from a single low resolution (LR) input image. It is a highly ill-posed problem because the high frequency information such as tiny textures is lost during low-pass filtering and down-sampling. Thus, SISR is a one-to-many mapping. Our task is to find the most plausible HR image which recovers the tiny textures as close as possible.

In order to recover HR images from LR images, large receptive field is needed to take more contextual information from LR images. Using deeper networks is a better way to increase the receptive field. One drawback of deep networks is the vanishing-gradient problem which makes the network difficult to train. He et al.

[1] use the residual learning framework to ease the training of networks. Skip connections are another solution to boost the flow of gradient and information through the network. The low level features contain effective information and can be used to reconstruct the HR image. SISR will benefit from the collective information at different levels.

The difficulty of SISR is the recovery of high frequency details such as tiny textures. Mean squared error (MSE) between the output image and the original image is often applied as a loss function to train the convolutional neural network. However, in the process of pursuing high peak signal-to-noise ratio (PSNR), MSE will return the mean of many possible solutions and thus the output image will look blurry and implausible. In order to recover the high frequency details, perceptual losses

[2] have been proposed which encourage the network to produce images whose feature representations are similar, resulting in sharper images. Ledig at al. [3, 4] combine adversarial network, perceptual loss and texture loss to encourage the output image to recover high frequency details such as tiny textures. But all these networks don’t explicitly know the positions of high frequency details and they just try to restore the textures blindly. Thus, the performance of these network is not satisfactory.

To solve these problems, first, based on denseNet [5] which connects each layer to every subsequent layer, we propose a novel block called DenseRes block composed of residual building blocks (Resblock) [1]. The output of every Resblock is connected to every other Resblock, boosting the flow of information and avoiding the re-learning of redundant features. With the DenseRes block, the gradient vanishing problem is alleviated and the network is easy to train. Second, we provide an attention mechanism to cope with the recovery of high frequency details. Inspired by U-net [6]

which is used for semantic pixel-wise segmentation, we propose a novel hybrid densely connected U-net to help the network to discriminate if the areas are full of tiny textures in need of repairment or similar to the interpolation image. It works as a feature selector which selectively enhance the high frequency features. Thus, the textures can be restored as close as possible.

It is the first time that the attention mechanism is introduced into SISR. The method is simple and effective. By selectively providing high frequency enhancement, it alleviates the problem that output images tend to be blurred. The attention mechanism can incorporate with previously proposed SISR networks. Higher PSNR and SSIM are achieved. Another contribution is that we propose the DenseRes block which provides an efficient way to combine the low level features and high level features. It’s beneficial for the recovery of high frequency details.

We evaluate our model on four publicly available benchmark datasets. It outperforms the current state-of-the-art approaches in terms of PSNR and the structural similarity (SSIM) index. As for PSNR, we get an improvement of 0.54 dB and 0.52dB respectively over VDSR [7] and DRCN [8].

The remainder of this paper is organized as follows: related work which includes the algorithms for super resolution (SR) and attention mechanism is presented in Section ii@, followed by the proposed network structure in Section iii@. The experimental results and visual comparison with state-of-the-art results are provided in Section iv@. We make a conclusion in Section v@.

Ii Related work

Ii-a Sisr

Early approaches such as bicubic and Lanczos [9] are easy to implement and the speed is high. But these methods often produce blurry results, lacking high frequency details. Many powerful methods such as sparse coding [10]were proposed to establish a complex mapping between low resolution and high resolution images. Sparse coding [11, 12] is based on the assumption that the sparse representation of the LR image over the LR dictionary is the same as that of the corresponding high resolution image over the HR dictionary.

Recently, algorithms based on convolutional neural networks (CNN) have got excellent results and outperform other algorithms. Dong et al. [13] upscaled an input image with bicubic interpolation and then trained a shallow convolutional network end-to-end to learn a nonlinear mapping from the LR input to a super-resolution output. Subsequently, various works [7, 8, 3] have successfully used deep networks in SISR and get higher PSNR values compared with shallow convolutional architectures. Recently, Lim et al. [14] get the best results in the NTIRE2017 Super-Resolution Challenge [15]. The depth of their network is up to 32.

In many deep learning algorithms for SISR, the LR image is upsampled via bicubic interpolation as the input of the network

[7, 8]. This means that the SISR operation is performed in high resolution space, which is sub-optimal and adds computational complexity. Instead of an interpolated image, sub-pixel convolution layers [16] are applied to upsample the feature maps to the size of the ground truth in the later layers of the network. This can reduce computations while the model capacity is reserved.

Ii-B Attention

Methods based on attention mechanism have shown good performance on a range of tasks. In the field of speech recognition, an attention-based recurrent network decoder is used to transcribe speech utterances to characters [17]. Chorowski et al. [18] improve the robustness to long inputs of speech with the attention mechanism. Hou et al. [19] propose a simple but effective attention mechanism to achieve online speech recognition. In the field of machine translation, Ashish et al. [20] propose a new simple network based solely on attention mechanisms, dispensing with recurrence and convolutions entirely, which shows superior quality in machine translation tasks. Other works [21, 22]

also achieve decent performance with the assistance of attention. In the field of computer vision, the attention mechanism has been used in image generation

[23, 24] and image caption generation [25]. Yao et al. [26] propose a temporal attention mechanism to automatically select the most relevant temporal segments in the task of video description. As for salient object detection which aims to identify and locate distinctive regions that attract human attention, Zhang et al. [27] design a symmetrical fully convolutional network to extract saliency features. Li et al. [28] use weakly supervised methods and achieve comparable results with strongly-supervised methods.

Fig. 1: The network architecture of our method. It consists of two parts: the feature reconstruction network which aims to recover the HR image and the attention producing network whose purpose is to selectively enhance high frequency features.

Iii Proposed Method

In this section, we describe the proposed model architecture for SISR. The network aims to learn an end-to-end mapping function F between the LR image and the HR image. As shown in Fig. 1, our network is composed of two parts: the feature reconstruction network which aims to recover the HR image and the attention producing network whose purpose is to find high frequency details to be repaired. By the multiplication of the two networks’ output, we will get the residual of the HR image.

Iii-a Feature reconstruction network

The feature reconstruction network aims to recover high-frequency details and reconstruct the high resolution image. It is a fully convolutional network. denotes the input of the convolutional layer, then the output of the layer is expressed as:

(1)

Where refers to the operation of convolution,

refers to the activation of rectified linear units (ReLU)

[29], and refer to the weights and biases of layer seperately.

The feature reconstruction network consists of three parts: a convolutional layer for feature extraction, multiple stacked DenseRes blocks and a subpixel convolution

[16] layer as an upsampling module, which is illustrated in Fig. 1. Transition layers between DenseRes blocks are omitted for simplicity. In order to recover HR images from LR images, large receptive field is needed to take more contextual information to predict pixels in HR images. By the cascade of multiple DenseRes blocks, our network is deep and can make use of more pixels to get better performance in SISR.

DenseRes block is an important component of the feature reconstruction network. We now present the details of the block. The DenseRes block consists of residual building blocks(Resblock) [1], which show powerful learning ability for object recognition. Let be the input of the resblock, the output can be expressed as:

(2)

where is the weight set to be learned in the Resblock and is the Resblock function. Resblock includes two convolutional layers. Specifically, the Resblock function can be expressed as follows:

(3)

where and are the weight of the two convolutional layers respectively and the bias is omitted for simplicity.

represents the batch normalization

[30] followed by the activation of ReLU [29]. represents the batch normalization.

The DenseRes block includes several Resblocks. The input of the i Resblock H is the concatenation of the previous Resblocks’ output. In typical feedforward CNN process, the high frequency information is easy to get lost at latter layers, and dense connections from previous Resblocks can alleviate such loss and further enhance high frequency signals. However, if a large number of feature maps are directly fed into the subsequent Resblock, the model size and the computional size will be intensively increased. Thus, we use a convolutional layer with 11 kernel to control how much of the previous states should be reserved. It adaptively learns the weights for different states. The input of the Resblock is expressed as:

(4)

where is the weight of the convolutional layer with 11 kernel and represents the batch normalization followed by the activation of ReLU.

Iii-B Attention producing network

The difficulty of SISR is the recovery of tiny textures. In the work of [31], they use the sobel operator to extract high-frequency components such as edges, then combine the high-frequency details with the LR image as input of the network, which is helpful for the recovery of shaper edges. But one disadvantage of this method is the hand-crafted features are not robust and thus the performance is not satisfactory.

If the network knows the exact locations of the tiny textures, it can give enhancement to features of these areas. Thus, more high frequency details will be recovered. The attention producing network can implement such functionality by providing attention mechanism which needs a large receptive field. With the architecture inspired by Unet [6] used for semantic segmentation, which is an encoder-decoder structure, the attention producing network can make use of a large region to provide attention. As shown in Fig. 1, the network consists of a contracting path(left side), an expansive path(right side) and skip connections. It takes an interpolated LR image (to the desired size) as input. The increased redundancy from interpolation can reduce the information loss in the forward propagation and thus is beneficial for a precise segmentation between texture areas and smooth areas. On the contrary, if we just take the LR image as input, the attention mechanism cannot perform well.

When compared with U-net, we substitute the convolution layer for a dense block. In the structure of DenseNet, each layer is connected to all the subsequent layers. This strengthens the reuse of information and thus solves the problem of vanishing-gradient. In addition, By the reuse of features, DenseNet structure can substantially reduce the number of parameters. Thus, it is easy to train and requires less computation complexity and memory cost.

In the contracting path, the interpolated image will be extracted low level features by convolutional layers firstly. Then max pooling is followed to reduce the dimension of data and get larger receptive field. We use pooling two times in the contracting path. In this way, the network can make use of a larger region to predict whether a pixel belongs to the high-frequency region or not. In the expansive path, deconvolution layer is included to upsample the previous feature maps. Low-level features contain much useful information and much is lost during the forward propagation. By combining the low-level features and high-level features in the expansive path, the output can give a precise segmentation of whether this area is the field with textures or not and need to be repaired by the feature reconstruction network. The feature channels of the network’s output is 1 and the size is the same as the HR image. In the final layer, we use the activation of sigmoid to control the output ranging from 0 to 1. We call the output mask. If the probability that the pixels belong to the texture areas is higher, the mask values will be closer to 1 which means these pixels need given more attention. If not, the mask values will be closer to 0.

(a) HR
(b)

1 epoch

(c) 5 epochs
(d) 10 epochs
(e) 50 epochs
Fig. 2: Mask evolution during training. The upscaling factor is 3
(a) Original image
(b) Mask
(c) Residual before mask
(d) Residual after mask
(e) Ground truth
Fig. 3: Effect of the attention mechanism on the residual image. The upscaling factor is 3.

Iii-C Residual learning of attention

We get the residual of HR image by dot production of the output of the feature reconstruction network and the mask values. By adding the interpolated LR image, which is also the input of the attention producing network, the final HR result is achieved. It can be expressed as:

(5)

Where is the output of the feature reconstruction network, the number of output channels is 3. is the mask values. is the interpolated LR image. is the final high resolution result of our method. and represent the pixel position in each channel and is the channel index. Thus, the attention producing network will encourage residual values from texture areas to be large, those not from texture areas to be close to 0. The mask values works as a feature selector which enhance high frequency features and suppress noise. In the output images, the high frequency details will be recovered, and in those smooth places, noise will be removed.

Iv Experiments and Analysis

Dataset Scale
Bicubic
(PSNR/SSIM)
SelfEx
(PSNR/SSIM)
SRCNN
(PSNR/SSIM)
A+
(PSNR/SSIM)
VDSR
(PSNR/SSIM)
DRCN
(PSNR/SSIM)
Proposed(no mask)
(PSNR/SSIM)
Proposed
(PSNR/SSIM)
TABLE I: Comparison of different methods including SelfEx [32], SRCNN [13], A+ [33], VDSR [7] and DRCN [8]. Our models without the attention producing network is also included. The upscaling factor for super-resolution ranges from 2 to 4. The best performance is in bold.

Iv-a Datasets

As for training, we use images from the DIV2K dataset [15]. It consists of 1000 diverse 2K resolution RGB images.There are 800 images for training, 100 images for validation and 100 images for test among them. During the evaluation, we perform experiments on four publicly available benchmark datasets including Set5 [34], Set14 [35], BSD100 and Urban100 [32]. For the datasets of Set5 and Set14, they contain 5 and 14 images respectively. BSD100 is the testing set of the Berkeley segmentation dataset BSD300 [36]. Urban100 contains 100 images with a variety of real-world structures. Our experiments are performed with different scale factors varying from 2 to 4 between the low resolution and high resolution images. As for evaluation, we use PNSR and SSIM as metrics.

Iv-B Training details and parameters

We use the RGB image patches whose sizes are 4848 as input and the ground truth are the corresponding HR patches whose sizes are 48r48r. r is the upscaling factor. We augment the training data with random horizontal flips and 90 rotations. All the images are from the DIV2K dataset. Before being fed into the network, the image values are normalized to the range between 0 and 1. As for the network, the filter size is set to 33 in all weight layers. We use the method of Xavier proposed in [37]

to initialize the weights and the bias are initialized to zero. We choose the rectified linear units (ReLU) as the activation function. The attention producing network includes a contracting path and an expansive path. In the contracting path, a 2

2 pooling operation with stride 2 is used for downsampling. In the expansive path, deconvolution layer is used to upsample the feature map. As for the feature reconstruction block, one DenseRes block includes 4 Resblocks and we use 6 DenseRes blocks in all.

The network is optimized end to end using Adam [38]. The batch size is 16 and the training will stop after 80 epochs when no improvement is observed. Initial learning rate is 0.0001 which will decrease by 50 percent for each ten epochs. Because the loss function of MSE will cause the output images to be blurry and lacking high-frequency details, L1 loss is used as the loss function which provides better convergence according to [14]

. Our implementation is based on Tensorflow

[39].

Iv-C The importance of the attention producing network

The feature reconstruction network and the attention producing network are trained jointly in the whole procedure. We call the output of the attention producing network mask. From Fig. 2, which shows changes of the output masks in the training procedure, we can conclude that the mask will give a precise attention when the training is finished. Thus, the texture areas can be recovered and noise will be removed in the assistance of the mask. Fig. 3 illustrates that when mask is added, the residual image will be rich in texture information. As illustrated in Fig. 5, with the help of attention producing network, tiny textures are recovered. In addition, the output image has better visual effect than the ground truth image due to high frequency enhancement of the attention mechanism. On the contrary, when the attention producing network is wiped off, the network cannot provide a better recovery of tiny textures. The quantitative results by PSNR and SSIM are presented in Table 1. Compared with our method without attention producing network, in terms of PSNR, our method with attention producing network gets an improvement of 0.368 dB on average. An increasement of 1.02 dB is even achieved for the dataset of Set14 on 3 super resolution. As for SSIM, our method with the mask gets an improvement of 0.0111 over that without the mask. Thus, we can come to the conclusion that the attention producing network is critical in the recovery of HR images.

We also compare the attention mechanism with the work of [31], which use sobel operator to help the recovery of high frequency details. We substitute the attention producing network for the sobel operator while the feature reconstruction network remains unchanged. When the network finishes training, quantitative results show that a decline of 0.28 dB is achieved compared with the network with attention mechanism. Thus, our attention mechanism can give a better assistance of the feature reconstruction network.

Fig. 4: The architecture of SRResNet with attention producing network.
Fig. 5: The comparison of our methods between with attention mechanism and without attention mechanism. The SR result with attention mechanims has even better visual effect than the ground truth due to high frequency enhancement.
(a)
(b)
(c)
Fig. 6: Performance curve for SRResNet, VDSR and DRCN. Comparisons are made between the networks with attention mechanism and without attention mechanism. Two networks are tested under ’Set14’ dataset with upscaling factor 3.

Iv-D The flexibility of the attention producing network

The attention producing network can be applied in other SISR networks. By providing attention, the attention producing network will improve the overall performance of other networks. To illustrate this point, we integrate the attention mechanism into SRResNet [4] , which is shown in Fig. 4. In order to make a comparison, we train SRResNet without attention and SRResNet with attention from scratch. The two networks are trained with the same training sets. Training details are also the same. Performance curve is illustrated in Fig. 6. It’s obvious that SRResNet with attention producing network has a better convergence and the test performance is better than SRResNet without attention producing network. When the two networks finish training, in terms of PSNR, SRResNet with attention is 0.15dB higher than SRResNet without attention on average, which is tested on the four publicly available benchmark datasets with upscaling factor 3. We also integrate the attention mechanism into VDSR [7] and DRCN [8] . From Fig. 6, we can conclude that VDSR and DRCN have a better convergence when the attention mechanism is added. As for PSNR, we get increasement of 0.093 dB and 0.18dB respectively for VDSR and DRCN. Thus, the effectiveness of the attention mechanism is verified.

(a)
(b)
(c)
(d)
(e)
(f)
(g) Bicubic
(h) SRCNN
(i) DRCN
(j) VDSR
(k) Proposed Method
(l) HR
(m)
(n)
(o)
(p)
(q)
(r)
(s) Bicubic
(t) SRCNN
(u) DRCN
(v) VDSR
(w) Proposed Method
(x) HR
(y)
(z)
(aa)
(ab)
(ac)
(ad)
(ae) Bicubic
(af) SRCNN
(ag) DRCN
(ah) VDSR
(ai) Proposed Method
(aj) HR
Fig. 7: Comparison of our models with other works on 3 SR. The images are from Set14 [35] and BSD100 [36].

Iv-E Comparison with state-of-the-art methods

In this section, we compare the results of our model with other state-of-the-art methods including SelfExSR [32], SRCNN [13], A+ [33], VDSR [7] and DRCN [8]. The performance of these methods are presented in Fig. 7. The results of SRCNN, DRCN and VDSR lack high frequency details and look blurry. On the contrary, our method can output high resolution images with high frequency details which are perceptually similar to the ground truth.

The quantitative results by PSNR and SSIM are presented in table 1. For fair comparison, we deal with y channel only because human vision is more sensitive to details in intensity than in color. We ignore the same amount of pixels as scales from image boundary in order to eliminate the effect of zero-padding in convolution layers. MATLAB functions is used for the evaluation. Our methods with attention producing network get the best results in the metrics of PSNR and SSIM. On average, in terms of PSNR, an improvement of 0.54 dB using the proposed method is achieved over VDSR and an improvement of 0.52 dB is achieved over DRCN. In terms of SSIM, the proposed method is also higher than other method. On the whole, our method can achieve superior results, especially on the image with rich texture information, over many existing state-of-the-art super resolution methods.

V Conclusion

We have proposed an attention-based approach to give a discrimination between texture areas and smooth areas. When the locations of the high frequency details are located, the attention mechanism works as a feature selector which enhance high frequency features and suppress noise in smooth areas. Thus, our method avoids recovering high frequency details blindly. We integrate the mechanism into SISR networks including SRResNet, VDSR and DRCN, and the performance of these SISR networks are all improved. Thus, the effectiveness of the attention mechanism is verified. As for the feature reconstruction network, we propose the DenseRes block which provides an efficient way to combine low level features and high level features. By the cascade of multiple DenseRes blocks, our network has a large receptive field. Therefore, useful contexture information in large regions from LR images are captured to recover the high frequency details in HR images. Our method has the best performance compared with state-of-the-art methods. In the future, we will explore applications of the attention mechanism in video super resolution to generate visually and quantitatively high quality results.

Vi Acknowledgments

This work was supported in part by the Natural Science Foundation of Jiangsu Province under Grant BK20151102, in part by the Ministry of Education Key Laboratory of Machine Perception, Peking University under Grant K-2016-03, in part by the AI lab of Southeast University, in part by the Open Project Program of the Ministry of Education Key Laboratory of Underwater Acoustic Signal Processing, Southeast University under Grant UASP1502, and in part by the Natural Science Foundation of China under Grant 61673108.

References

  • [1] K. He, X. Zhang, S. Ren et al., “Deep residual learning for image recognition,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2016, pp. 770–778.
  • [2] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision (ECCV), 2016, pp. 694–711.
  • [3] M. S. Sajjadi, B. Schölkopf, and M. Hirsch, “Enhancenet: Single image super-resolution through automated texture synthesis,” in arXiv preprint arXiv:1612.07919, 2016.
  • [4] C. Ledig, L. Theis, F. Huszár et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in arXiv preprint arXiv:1609.04802, 2016.
  • [5] G. Huang, Z. Liu, L. v. d. Maaten et al., “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 2261–2269.
  • [6] O. Ronneberger, P. Fischer, and T. Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation.   Springer International Publishing, 2015.
  • [7] J. Kim, J. K. Lee, K. M. Lee et al., “Accurate image super-resolution using very deep convolutional networks,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1646–1654.
  • [8] J. Kim, J. Kwon Lee, and K. Mu Lee, “Deeply-recursive convolutional network for image super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1637–1645.
  • [9] C. E. Duchon, “Lanczos filtering in one and two dimensions,” in Journal of Applied Meteorology, vol. 18, no. 8, 1979, pp. 1016–1022.
  • [10] M. Aharon, M. Elad, and A. Bruckstein, “-svd: An algorithm for designing overcomplete dictionaries for sparse representation,” in IEEE Transactions on Signal Processing, vol. 54, no. 11, Nov 2006, pp. 4311–4322.
  • [11] J. Yang, J. Wright, T. Huang et al., “Image super-resolution as sparse representation of raw image patches,” in Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8.
  • [12] Y. Xie, W. Zhang, C. Li et al., “Discriminative object tracking via sparse representation and online dictionary learning,” in IEEE transactions on cybernetics, vol. 44, no. 4.   IEEE, 2014, pp. 539–553.
  • [13] C. Dong, C. C. Loy, K. He et al., “Learning a deep convolutional network for image super-resolution,” in ECCV, 2014, pp. 184–199.
  • [14] B. Lim, S. Son, H. Kim et al., “Enhanced deep residual networks for single image super-resolution,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), July 2017, pp. 1132–1140.
  • [15] R. Timofte, E. Agustsson, L. V. Gool et al., “Ntire 2017 challenge on single image super-resolution: Methods and results,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), July 2017, pp. 1110–1121.
  • [16] W. Shi, J. Caballero, F. Huszár et al., “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 1874–1883.
  • [17] W. Chan, N. Jaitly, Q. V. Le et al., “Listen, attend and spell,” Computer Science, 2015.
  • [18] J. Chorowski, D. Bahdanau, D. Serdyuk et al., “Attention-based models for speech recognition,” Computer Science, 2015.
  • [19] J. Hou, S. Zhang, and L. Dai, “Gaussian prediction based attention for online end-to-end speech recognition,” Proc. Interspeech 2017, pp. 3692–3696, 2017.
  • [20] A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
  • [21]

    D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,”

    Computer Science, 2014.
  • [22] B. Sankaran, H. Mi, Y. Alonaizan et al., “Temporal attention model for neural machine translation,” arXiv:1608.02927, 2016.
  • [23] K. Gregor, I. Danihelka, A. Graves et al.

    , “Draw: a recurrent neural network for image generation,”

    Computer Science, pp. 1462–1471, 2015.
  • [24] E. Mansimov, E. Parisotto, J. L. Ba et al., “Generating images from captions with attention,” Computer Science, 2015.
  • [25] K. Xu, J. Ba, R. Kiros et al., “Show, attend and tell: Neural image caption generation with visual attention,” Computer Science, pp. 2048–2057, 2015.
  • [26] L. Yao, A. Torabi, K. Cho et al., “Describing videos by exploiting temporal structure,” in Proceedings of the IEEE international conference on computer vision (ICCV), vol. 53, 2015, pp. 199–211.
  • [27] P. Zhang, W. Liu, H. Lu et al., “Salient object detection by lossless feature reflection,” arXiv preprint arXiv:1802.06527, 2018.
  • [28] G. Li, Y. Xie, and L. Lin, “Weakly supervised salient object detection using image labels,” arXiv preprint arXiv:1803.06503, 2018.
  • [29]

    N. V and H. G. E, “Rectified linear units improve restricted boltzmann machines,” in

    Proceedings of the 27th international conference on machine learning (ICML-10)

    , 2010.
  • [30] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Computer Science, 2015.
  • [31] W. Yang, J. Feng, J. Yang et al., “Deep edge guided recurrent residual learning for image super-resolution,” IEEE Transactions on Image Processing, vol. 26, no. 12, pp. 5895–5907, Dec 2017.
  • [32] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 5197–5206.
  • [33] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchored neighborhood regression for fast super-resolution,” in Asian Conference on Computer Vision (ACCV).   Springer, 2014, pp. 111–126.
  • [34] M. Bevilacqua, A. Roumy, C. Guillemot et al., “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” in BMVC, 6 2012.
  • [35] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in International conference on curves and surfaces.   Springer, 2010, pp. 711–730.
  • [36] D. Martin, C. Fowlkes, D. Tal et al., “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in ICCV, vol. 2, 2001, pp. 416–423.
  • [37] G. X and B. Y, “Understanding the difficulty of training deep feedforward neural networks,” in

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    , 2010, pp. 249–256.
  • [38] D. Kingma and J. Ba., “Adam: A method for stochastic optimization,” in ICLR, 2014.
  • [39] M. Abadi, A. Agarwal, P. Barham et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” in arXiv preprint arXiv:1603.04467, 2016.