It has been proven that capturing long-range dependencies in deep neural networks is helpful in improving their application effects, especially for image processing. Previously, long-range dependencies of images are captured by large receptive fields formed by deep stacks of convolutional operations [Fukushima1980, LeCun et al.1989]. However, since convolutional operations only focus on a local neighborhood, long-range dependencies need to be captured via applying them repeatedly. There are two main limitations of repeating such local operations. First, it causes difficulty in optimization [Hochreiter and Schmidhuber1997, He et al.2016]. Second, it is computationally inefficient.
Recently, Wang et al. [Wang et al.2018] applied non-local operations to capture efficiently long-range dependencies by using non-local mean operations [Buades et al.2005] in deep neural networks. The non-local operations compute the response at a position as a weighted sum of the features at all positions in input feature maps. Zhang et al. [Zhang et al.2018a]
proposed Self-Attention Generative Adversarial Networks (SAGAN) which use one of non-local operations to implement self-attention modules. SAGAN obtains state-of-the-art results on ImageNet dataset[Russakovsky et al.2015]
. SAGAN is the first to combine Generative Adversarial Networks with Self-Attention (SA) mechanism, and generates a new solution of computer vision, especially for image reconstruction. However, the SA module of SAGAN has two limitations:
The SA module is hard to be employed on bigger datasets with higher dimensions, since it has a space complexity of . It limits many applications of self-attention in computer vision. For example, based on [Brock et al.2018], the performance of generative tasks has positive correlation with the batch size of its training and a high space complexity will restrict the increase of batch size.
It also has a time complexity of . Although it improves the quality of image generation, using the self-attention module brings huge time costs in both testing and training phases.
In this paper, we propose a new self-attention module for overcoming these limitations. Compared with the original module, in theory, the new self-attention module has the space and time complexity of instead of , and in practice, the time and memory can be saved up (depending on the dimensions of input data and the structure of networks), while obtaining comparable performance with the vanilla SA. Further, we can introduce the proposed self-attention mechanism into GANs or other deep neural networks to reduce their computational costs. Our contributions include:
We implement a new self-attention module which has a time and space complexity of .
We analyze the proposed module from the view point of channel attention [Hu et al.2018] and further compare them.
We provide two experiments to verify the performance of the proposed module for image reconstruction.
2 Related Works
2.1 Self-attention (SA)
The advantages of self-attention in capturing global dependencies make attention mechanisms become an integral part of modules [Bahdanau et al.2014, Xu et al.2015, Yang et al.2016, Gregor et al.2015]. In particular, self-attention [Cheng et al.2016, Parikh et al.2016] computes the response at a position as a weighted sum of the features at all positions in input feature maps. By adding self-attention module into an autoregressive module for image generation, Parmar et al. [Parmar et al.2018] propose an image transformer module. Wang et al. [Wang et al.2018] formalize self-attention as non-local operations inspired by non-local mean filter [Buades et al.2005]. Based on non-local operations, Zhang et al. [Zhang et al.2018a] present Self-Attention Generative Adversarial Network (SAGAN) to generate images based on ImageNet [Russakovsky et al.2015] and obtain 27.62 of Fréchet inception score (FID) compared with previous state-of-the-art 18.65.
2.2 Self-Attention Module in SAGAN
Self-attention module in SAGAN as shown in Figure 1(a), is based on non-local neural networks inspired by non-local mean filters. Following the baseline, the non-local operation can be defined generally as:
where is the matrix of image features and , denote the numbers of channels and elements of one channel, respectively. , are two embeddings of , and and are their embedding matrices. is output signal. The pairwise function computes a relationship between the th and the th elements of . The unary function represents features of and the equation is normalized by a factor . Based on the non-local embedded operation, self-attention mechanism is only a special case. The implementation is shown as:
where is a linear feature transform of . In this case, function can be seen as , and
becomes the cosine similarity.
3 Improved Self-Attention Module
Using SA modules, SAGAN reduces the FID from 27.62 to 18.65 on the challenging ImageNet dataset. The results show the SA mechanism has a great potential in image reconstruction. However, the time and space complexity of the SA module of SAGAN are (in Section 4). Especially in space complexity, if we enlarge the resolution of generative images from 128 * 128 to 256 * 256, the consumed memory will be enlarged from to . Huge consumption of computing resources affects the applications of SA seriously, even with the help of GPU computers.
3.1 The Novel Self-attention Module
From Figure 1(a), the reason why the SA module of SAGAN consumes so much memory and time lies in the computation of attention map . The attention map calculates any pair of elements in its input. Hence, the key to reduce the computational complexity is to modify the computational way of attention maps. Inspired by the associativity of matrix multiplication, we find if we first calculate in Equation (2), we will obtain a matrix rather than a matrix. In convolutional operations, is a hyper-parameter and generally . Obviously, the revised computation can reduce computational complexity to a large extent. However, is not a linear function, we need to use a linear function to replace function.
Following the statement of [Wang et al.2018], although recent self-attention modules mostly take as the normalization factor, [Wang et al.2018] uses two alternative versions of non-local operations to prove the nonlinear attentional behavior is not essential. Further, they also make experiments to verify that the results of those versions are comparable in video classification and image recognition. Thus we can employ one of the versions to rewrite the self-attention module:
We continue to rewrite Equation (3) by the associativity of matrix multiplication:
where is of a space complexity of . Following Equation (4), we design a network, the structure of which is shown in Figure 1(b). The structures in Figure 1(b) and Figure 1(a) are different, and also have different meanings, which will be analyzed in the following section.
3.2 The Principle Behind the Module
In Figure 1, we use a schematic diagram to compare the two modules of self-attention. Generally, . For simplification, we set , and the number of channels keeps invariant in transforms, i.e., the number of channels is not divided by 8. The explanation starts from calculating an element of matrix in different ways.
where and . ,
represent row vectors consisting of the elements in theth row of , , respectively. And , denote column vectors consisting of the elements in the th column of , , respectively.
For the original SA module, calculating an element of is to compute a cosine similarities between the element and all elements of input. Through Equation (5), where represents the cosine similarity between the th element and the th element, is computed by a weighted sum of all elements and the weights depend on cosine similarity which is defined as a product of two normalized vectors. Matrix is the weight matrix whose element is the cosine similarity between and .
However, the attention map is produced by inner product, which means that cannot represent the cosine similarity between and . Like the analysis of attention map , we also calculate an element of in the proposed module. Through Equation (6), the results of the proposed module are formed by a weighted sum of all channels, and the weights depend on a similarity corresponding to cosine similarity. That means, the proposed module computes the similarity of every two channels of all elements (e.g. ) rather than every two elements of all channels (e.g. ). The principle behind the proposed module is similar to that of channel attention modules [Hu et al.2018, Zhang et al.2018b] that make important channels be focused. To explain it more clearly, we first make transformation of Equation (6):
where , represent column vectors consisting of the elements in the th channel, Since is produced by convolutional operation, we can obtain:
where , denote weights of the th convolutional operations, respectively. Further making where , we obtain:
Equation (9) shows that the proposed module aims to assign a weight for every channel to make some important channels be focused. The weight is composed of two parts. The first part includes and calculated by global information and the second part includes and which are learnable parameters.
3.3 Comparation with Channel Attention
Channel attention (CA) is firstly proposed in [Hu et al.2018], which generates different scores for each channel-wise feature. As shown in Figure 2(a), the original channel attention uses global average pooling to obtain global information, which is defined as:
represents global information captured by the CA module. Then the information is processed by two non-linear transformations to obtain the scorefor each channel-wise:
Finally, the initial feature map is multiplied by :
The CA module in [Hu et al.2018]
and the proposed module have similar purposes which assign a learnable weight to each channel. There are two differences between the two CA modules. The first difference lies in the way to obtain global information. The original CA module uses a global pooling to add all elements of a channel, while our module computes the relationship between any pair of channels. Secondly, the original module is a nonlinear module, whereas our module is linear. The nonlinearity of the original module is from its nonlinear active function. However, the linearity of our module does not affect its effect. For instance, a network generally does not use only one SA module like Figure3 and throughout those modules, there are some nonlinear layers to do the nonlinear transformation.
4 Complexity of Computation
As shown in Figure 1(a), the original self-attention module which computes an attention map that explicitly represents the relationships between any two positions in the convolutional feature map , so there are elements of attention map . Since there is a quadratic relationship between the elements of image feature and the size of attention map , the space complexity is . The time complexity of the original module is also , since it computes multiplications between the attention map and convolutional feature map, whose dimensions are and , respectively.
Our self-attention module has the space and time complexity of . According to Figure 1(b), there are two multiplications. The first multiplication between a matrix and a matrix produces a matrix and it times a matrix . Hence, the time and memory space costs are linear relationships with , and correspondingly, both the time and space complexity are .
Via reducing memory space and computational time, we can apply the self-attention mechanism to the fields of image reconstruction including image completion, super resolution, etc., which need relatively more computational resources.
We provide two experiments to measure and compare the proposed module with the original one. The first experiment is an ablation study which applies our module to complete some images with large margin missing, in order to measure the effectiveness for capturing long-range dependencies. The second experiment is about image generation, in order to prove that the refined module can obtain comparable results with the original module in SAGAN but consume less memory space and running time. The two experiments are carried out on a platform of NVIDIA GTX 1080ti GPU, 32 GB RAM and i7-7700k CPU.
5.1 Ablation Study
The experiment is to verify the ability of our self-attention module to capture long-range dependencies for image reconstruction. The purpose of the experiment is to complete an image (e.g. Figure 4) which is cut 1/4 both on its left and right side (e.g. Figure 4(b)). We will observe the influences after replacing two convolutial layers of a standard GAN by our SA module (illustrated in Figure 3).
5.1.1 Details of Implementation
The used network is a basical super resolution generative adversarial network (SRGAN) [Ledig et al.2017]. All settings of the experiment are inherited from [Ledig et al.2017] except that we re-implement the structure of generators like Figure 3 and add spectral normalization [Miyato et al.2018] for every convolutional layer of the generator to stabilize the training phase. Additionally, the dataset used is a simple coast dataset 111It can be downloaded on http://cvcl.mit.edu/scenedatabase/coast.zip, the images of which are resized into 256*256 in pre-processing.
5.1.2 Experimental Results
The image reconstruction from an image with large margins missing cannot achieve satisfactory results if a network is hard to handle long-range dependencies as shown in Figure 4(c). This experiment is to inspect whether our SA model still has an ability to capture long-range information after structural transforms. Figure 4(d) shows the results of our SA model. Without our SA module, the regions to be completed are hard to receive valid signals provided by residual regions (Figure 4(c)). Hence the experiments verify that our SA module inherits the ability from the original self-attention module.
5.2 Generative Experiments
Furthermore, we compare the performance of two modules in a real computational environment. To ensure a fair comparison, we choose three networks only with single difference in implementation of SA model. The three networks are self-attention generative adversarial network (SAGAN), improved self-attention generative adversarial network (ISAGAN) and standard generative adversarial network (SGAN). SAGAN uses its original implementation in [Zhang et al.2018a]222The code can be downloaded on https://github.com/heykeetae/Self-Attention-GAN and ISAGAN uses the structure of SAGAN but replaces the SA models by the proposed one. SGAN replaces SA modules by convolution layers as a criterion. We train the three networks based on two benchmark datasets and and evaluate them by Fréchet inception score [Heusel et al.2017] (FID) (the lower is the better). Generally, a generative task is evaluated by Inception score [Salimans et al.2016] and Fréchet inception score (FID), but based on [Barratt and Sharma2018], Inception score is misleading when a generative network is not trained on ImageNet.
5.2.1 Details of Implementation
All the generative models are designed to generate 64*64 images. By default, the batch size is 64 and other hyper-parameters of discriminators, generators and optimizers are inherited from SAGAN.
5.2.2 Experimental Results
The results of the three networks (SAGAN, ISAGAN, GAN) are tabulated in Table 1. Compared with SGAN, SAGAN and ISAGAN achieve comparable effects. Concretely, trained on , a human face dataset, both SAGAN and ISAGAN degrade the generative quality, since generating human face may depend more on local features than global features and thus the advantages of the self-attention mechanism are not helpful and even have some interferences. Whereas, trained on , a multiple classes dataset, the two self-attention modules improve the quality with the advantage of their long-range dependencies. Through the generative experiments and the formula derivation, we can infer the two self-attention modules are comparable in effectiveness.
Furthermore, we need to evaluate the costs of time and space of our proposed module. Table 2 shows the time spent in forward (every 30k images) and backward (every 10 * batch size images) propagation.
|time (in seconds)||SAGAN||ISAGAN||SGAN|
The reason why the speeds of ISAGAN on two propagations are faster than those of SAGAN is mainly that our self-attention module avoids the large-scale matrix multiplication. In forward propagation, the large-scale matrix multiplication happens to more units, and in backward propagation, it needs to do differentiation on larger computational graph. About memory space usage, Table 3 shows the training of SAGAN and ISAGAN with different batch sizes, respectively, where ‘no’ means that a module cannot be run in our environment and ‘ok’ represents the opposite.
Since increasing the batch size needs more memory space for training, we can also infer that ISAGAN uses less memory space than SAGAN.
We improve the original self-attention module to reduce time and space complexity. Due to less memory space consumption, our self-attention module can be used in image reconstruction which often needs to process higher dimensions data. Theoretically, our SA module is a special kind of channel attention mechanisms. Experimental results verify that using our self-attention module can obtain comparable effects with the vanilla one but use less time and memory space. In future work, we will apply the proposed module of self-attention to other deep learning tasks, beyond image reconstruction.
- [Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- [Barratt and Sharma2018] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.
- [Brock et al.2018] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
[Buades et al.2005]
Antoni Buades, Bartomeu Coll, and J-M Morel.
A non-local algorithm for image denoising.
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 60–65. IEEE, 2005.
- [Cheng et al.2016] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
- [Fukushima1980] Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):193–202, Apr 1980.
[Gregor et al.2015]
Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan
Draw: a recurrent neural network for image generation.In
Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pages 1462–1471. JMLR. org, 2015.
- [He et al.2016] Kaiming. He, Xiangyu Zhang, Shaoqing. Ren, and Jian. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.
- [Heusel et al.2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
- [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- [Hu et al.2018] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
- [Kingma and Ba2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [LeCun et al.1989] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
- [Ledig et al.2017] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
- [Miyato et al.2018] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
- [Parikh et al.2016] Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933, 2016.
- [Parmar et al.2018] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. arXiv preprint arXiv:1802.05751, 2018.
- [Russakovsky et al.2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- [Salimans et al.2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
- [Wang et al.2018] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
- [Xu et al.2015] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
- [Yang et al.2016] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 21–29, 2016.
- [Zhang et al.2018a] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
- [Zhang et al.2018b] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 286–301, 2018.