Less Memory, Faster Speed: Refining Self-Attention Module for Image Reconstruction

by   Zheng Wang, et al.
Beijing Institute of Technology

Self-attention (SA) mechanisms can capture effectively global dependencies in deep neural networks, and have been applied to natural language processing and image processing successfully. However, SA modules for image reconstruction have high time and space complexity, which restrict their applications to higher-resolution images. In this paper, we refine the SA module in self-attention generative adversarial networks (SAGAN) via adapting a non-local operation, revising the connectivity among the units in SA module and re-implementing its computational pattern, such that its time and space complexity is reduced from O(n^2) to O(n), but it is still equivalent to the original SA module. Further, we explore the principles behind the module and discover that our module is a special kind of channel attention mechanisms. Experimental results based on two benchmark datasets of image reconstruction, verify that under the same computational environment, two models can achieve comparable effectiveness for image reconstruction, but the proposed one runs faster and takes up less memory space.



There are no comments yet.


page 6


Global Self-Attention Networks for Image Recognition

Recently, a series of works in computer vision have shown promising resu...

Exploring Self-Attention for Visual Intersection Classification

In robot vision, self-attention has recently emerged as a technique for ...

Deeper or Wider Networks of Point Clouds with Self-attention?

Prevalence of deeper networks driven by self-attention is in stark contr...

Hyperspectral and LiDAR data classification based on linear self-attention

An efficient linear self-attention fusion model is proposed in this pape...

X-volution: On the unification of convolution and self-attention

Convolution and self-attention are acting as two fundamental building bl...

Factorized Attention: Self-Attention with Linear Complexities

Recent works have been applying self-attention to various fields in comp...

Efficient Folded Attention for 3D Medical Image Reconstruction and Segmentation

Recently, 3D medical image reconstruction (MIR) and segmentation (MIS) b...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It has been proven that capturing long-range dependencies in deep neural networks is helpful in improving their application effects, especially for image processing. Previously, long-range dependencies of images are captured by large receptive fields formed by deep stacks of convolutional operations [Fukushima1980, LeCun et al.1989]. However, since convolutional operations only focus on a local neighborhood, long-range dependencies need to be captured via applying them repeatedly. There are two main limitations of repeating such local operations. First, it causes difficulty in optimization [Hochreiter and Schmidhuber1997, He et al.2016]. Second, it is computationally inefficient.

Recently, Wang et al. [Wang et al.2018] applied non-local operations to capture efficiently long-range dependencies by using non-local mean operations [Buades et al.2005] in deep neural networks. The non-local operations compute the response at a position as a weighted sum of the features at all positions in input feature maps. Zhang et al. [Zhang et al.2018a]

proposed Self-Attention Generative Adversarial Networks (SAGAN) which use one of non-local operations to implement self-attention modules. SAGAN obtains state-of-the-art results on ImageNet dataset

[Russakovsky et al.2015]

. SAGAN is the first to combine Generative Adversarial Networks with Self-Attention (SA) mechanism, and generates a new solution of computer vision, especially for image reconstruction. However, the SA module of SAGAN has two limitations:

  1. The SA module is hard to be employed on bigger datasets with higher dimensions, since it has a space complexity of . It limits many applications of self-attention in computer vision. For example, based on [Brock et al.2018], the performance of generative tasks has positive correlation with the batch size of its training and a high space complexity will restrict the increase of batch size.

  2. It also has a time complexity of . Although it improves the quality of image generation, using the self-attention module brings huge time costs in both testing and training phases.

In this paper, we propose a new self-attention module for overcoming these limitations. Compared with the original module, in theory, the new self-attention module has the space and time complexity of instead of , and in practice, the time and memory can be saved up (depending on the dimensions of input data and the structure of networks), while obtaining comparable performance with the vanilla SA. Further, we can introduce the proposed self-attention mechanism into GANs or other deep neural networks to reduce their computational costs. Our contributions include:

  1. We implement a new self-attention module which has a time and space complexity of .

  2. We analyze the proposed module from the view point of channel attention [Hu et al.2018] and further compare them.

  3. We provide two experiments to verify the performance of the proposed module for image reconstruction.

2 Related Works

2.1 Self-attention (SA)

The advantages of self-attention in capturing global dependencies make attention mechanisms become an integral part of modules [Bahdanau et al.2014, Xu et al.2015, Yang et al.2016, Gregor et al.2015]. In particular, self-attention [Cheng et al.2016, Parikh et al.2016] computes the response at a position as a weighted sum of the features at all positions in input feature maps. By adding self-attention module into an autoregressive module for image generation, Parmar et al. [Parmar et al.2018] propose an image transformer module. Wang et al. [Wang et al.2018] formalize self-attention as non-local operations inspired by non-local mean filter [Buades et al.2005]. Based on non-local operations, Zhang et al. [Zhang et al.2018a] present Self-Attention Generative Adversarial Network (SAGAN) to generate images based on ImageNet [Russakovsky et al.2015] and obtain 27.62 of Fréchet inception score (FID) compared with previous state-of-the-art 18.65.

2.2 Self-Attention Module in SAGAN

Self-attention module in SAGAN as shown in Figure 1(a), is based on non-local neural networks inspired by non-local mean filters. Following the baseline, the non-local operation can be defined generally as:


where is the matrix of image features and , denote the numbers of channels and elements of one channel, respectively. , are two embeddings of , and and are their embedding matrices. is output signal. The pairwise function computes a relationship between the th and the th elements of . The unary function represents features of and the equation is normalized by a factor . Based on the non-local embedded operation, self-attention mechanism is only a special case. The implementation is shown as:


where is a linear feature transform of . In this case, function can be seen as , and

becomes the cosine similarity.

3 Improved Self-Attention Module

Using SA modules, SAGAN reduces the FID from 27.62 to 18.65 on the challenging ImageNet dataset. The results show the SA mechanism has a great potential in image reconstruction. However, the time and space complexity of the SA module of SAGAN are (in Section 4). Especially in space complexity, if we enlarge the resolution of generative images from 128 * 128 to 256 * 256, the consumed memory will be enlarged from to . Huge consumption of computing resources affects the applications of SA seriously, even with the help of GPU computers.

3.1 The Novel Self-attention Module

From Figure 1(a), the reason why the SA module of SAGAN consumes so much memory and time lies in the computation of attention map . The attention map calculates any pair of elements in its input. Hence, the key to reduce the computational complexity is to modify the computational way of attention maps. Inspired by the associativity of matrix multiplication, we find if we first calculate in Equation (2), we will obtain a matrix rather than a matrix. In convolutional operations, is a hyper-parameter and generally . Obviously, the revised computation can reduce computational complexity to a large extent. However, is not a linear function, we need to use a linear function to replace function.

(a) The self-attention module of SAGAN
(b) The proposed self-attention module
Figure 1: The structures of two kinds of SA modules where represents matrix multiplication and is scalar division.

Following the statement of [Wang et al.2018], although recent self-attention modules mostly take as the normalization factor, [Wang et al.2018] uses two alternative versions of non-local operations to prove the nonlinear attentional behavior is not essential. Further, they also make experiments to verify that the results of those versions are comparable in video classification and image recognition. Thus we can employ one of the versions to rewrite the self-attention module:


We continue to rewrite Equation (3) by the associativity of matrix multiplication:


where is of a space complexity of . Following Equation (4), we design a network, the structure of which is shown in Figure 1(b). The structures in Figure 1(b) and Figure 1(a) are different, and also have different meanings, which will be analyzed in the following section.

3.2 The Principle Behind the Module

In Figure 1, we use a schematic diagram to compare the two modules of self-attention. Generally, . For simplification, we set , and the number of channels keeps invariant in transforms, i.e., the number of channels is not divided by 8. The explanation starts from calculating an element of matrix in different ways.


where and . ,

represent row vectors consisting of the elements in the

th row of , , respectively. And , denote column vectors consisting of the elements in the th column of , , respectively.

For the original SA module, calculating an element of is to compute a cosine similarities between the element and all elements of input. Through Equation (5), where represents the cosine similarity between the th element and the th element, is computed by a weighted sum of all elements and the weights depend on cosine similarity which is defined as a product of two normalized vectors. Matrix is the weight matrix whose element is the cosine similarity between and .

However, the attention map is produced by inner product, which means that cannot represent the cosine similarity between and . Like the analysis of attention map , we also calculate an element of in the proposed module. Through Equation (6), the results of the proposed module are formed by a weighted sum of all channels, and the weights depend on a similarity corresponding to cosine similarity. That means, the proposed module computes the similarity of every two channels of all elements (e.g. ) rather than every two elements of all channels (e.g. ). The principle behind the proposed module is similar to that of channel attention modules [Hu et al.2018, Zhang et al.2018b] that make important channels be focused. To explain it more clearly, we first make transformation of Equation (6):


where , represent column vectors consisting of the elements in the th channel, Since is produced by convolutional operation, we can obtain:


where , denote weights of the th convolutional operations, respectively. Further making where , we obtain:


Equation (9) shows that the proposed module aims to assign a weight for every channel to make some important channels be focused. The weight is composed of two parts. The first part includes and calculated by global information and the second part includes and which are learnable parameters.

3.3 Comparation with Channel Attention

(a) channel attention
(b) channel attention (ours)
Figure 2: Two channel attention modules.

Channel attention (CA) is firstly proposed in [Hu et al.2018], which generates different scores for each channel-wise feature. As shown in Figure 2(a), the original channel attention uses global average pooling to obtain global information, which is defined as:



represents global information captured by the CA module. Then the information is processed by two non-linear transformations to obtain the score

for each channel-wise:


Finally, the initial feature map is multiplied by :


The CA module in [Hu et al.2018]

and the proposed module have similar purposes which assign a learnable weight to each channel. There are two differences between the two CA modules. The first difference lies in the way to obtain global information. The original CA module uses a global pooling to add all elements of a channel, while our module computes the relationship between any pair of channels. Secondly, the original module is a nonlinear module, whereas our module is linear. The nonlinearity of the original module is from its nonlinear active function. However, the linearity of our module does not affect its effect. For instance, a network generally does not use only one SA module like Figure

3 and throughout those modules, there are some nonlinear layers to do the nonlinear transformation.

4 Complexity of Computation

As shown in Figure 1(a), the original self-attention module which computes an attention map that explicitly represents the relationships between any two positions in the convolutional feature map , so there are elements of attention map . Since there is a quadratic relationship between the elements of image feature and the size of attention map , the space complexity is . The time complexity of the original module is also , since it computes multiplications between the attention map and convolutional feature map, whose dimensions are and , respectively.

Our self-attention module has the space and time complexity of . According to Figure 1(b), there are two multiplications. The first multiplication between a matrix and a matrix produces a matrix and it times a matrix . Hence, the time and memory space costs are linear relationships with , and correspondingly, both the time and space complexity are .

Via reducing memory space and computational time, we can apply the self-attention mechanism to the fields of image reconstruction including image completion, super resolution, etc., which need relatively more computational resources.

5 Experiments

We provide two experiments to measure and compare the proposed module with the original one. The first experiment is an ablation study which applies our module to complete some images with large margin missing, in order to measure the effectiveness for capturing long-range dependencies. The second experiment is about image generation, in order to prove that the refined module can obtain comparable results with the original module in SAGAN but consume less memory space and running time. The two experiments are carried out on a platform of NVIDIA GTX 1080ti GPU, 32 GB RAM and i7-7700k CPU.

5.1 Ablation Study

The experiment is to verify the ability of our self-attention module to capture long-range dependencies for image reconstruction. The purpose of the experiment is to complete an image (e.g. Figure 4) which is cut 1/4 both on its left and right side (e.g. Figure 4(b)). We will observe the influences after replacing two convolutial layers of a standard GAN by our SA module (illustrated in Figure 3).

5.1.1 Details of Implementation

Figure 3: The structure of generator in the experiments.

The used network is a basical super resolution generative adversarial network (SRGAN) [Ledig et al.2017]. All settings of the experiment are inherited from [Ledig et al.2017] except that we re-implement the structure of generators like Figure 3 and add spectral normalization [Miyato et al.2018] for every convolutional layer of the generator to stabilize the training phase. Additionally, the dataset used is a simple coast dataset 111It can be downloaded on http://cvcl.mit.edu/scenedatabase/coast.zip, the images of which are resized into 256*256 in pre-processing.

5.1.2 Experimental Results

(a) Ground truth
(b) Input
(c) The results of standard GAN
(d) The Results of the Proposed Module
Figure 4: The results of image completion. We train GANs to complete images in (b). (c) shows results of a standard GAN. When two convolutional layers are replaced by our SA model, we obtain the results in (d).

The image reconstruction from an image with large margins missing cannot achieve satisfactory results if a network is hard to handle long-range dependencies as shown in Figure 4(c). This experiment is to inspect whether our SA model still has an ability to capture long-range information after structural transforms. Figure 4(d) shows the results of our SA model. Without our SA module, the regions to be completed are hard to receive valid signals provided by residual regions (Figure 4(c)). Hence the experiments verify that our SA module inherits the ability from the original self-attention module.

5.2 Generative Experiments

Furthermore, we compare the performance of two modules in a real computational environment. To ensure a fair comparison, we choose three networks only with single difference in implementation of SA model. The three networks are self-attention generative adversarial network (SAGAN), improved self-attention generative adversarial network (ISAGAN) and standard generative adversarial network (SGAN). SAGAN uses its original implementation in [Zhang et al.2018a]222The code can be downloaded on https://github.com/heykeetae/Self-Attention-GAN and ISAGAN uses the structure of SAGAN but replaces the SA models by the proposed one. SGAN replaces SA modules by convolution layers as a criterion. We train the three networks based on two benchmark datasets and and evaluate them by Fréchet inception score [Heusel et al.2017] (FID) (the lower is the better). Generally, a generative task is evaluated by Inception score [Salimans et al.2016] and Fréchet inception score (FID), but based on [Barratt and Sharma2018], Inception score is misleading when a generative network is not trained on ImageNet.

5.2.1 Details of Implementation

All the generative models are designed to generate 64*64 images. By default, the batch size is 64 and other hyper-parameters of discriminators, generators and optimizers are inherited from SAGAN.

5.2.2 Experimental Results

8.864 8.723 6.394
12.739 12.236 16.066
Table 1: The results based on two datasets

The results of the three networks (SAGAN, ISAGAN, GAN) are tabulated in Table 1. Compared with SGAN, SAGAN and ISAGAN achieve comparable effects. Concretely, trained on , a human face dataset, both SAGAN and ISAGAN degrade the generative quality, since generating human face may depend more on local features than global features and thus the advantages of the self-attention mechanism are not helpful and even have some interferences. Whereas, trained on , a multiple classes dataset, the two self-attention modules improve the quality with the advantage of their long-range dependencies. Through the generative experiments and the formula derivation, we can infer the two self-attention modules are comparable in effectiveness.

Furthermore, we need to evaluate the costs of time and space of our proposed module. Table 2 shows the time spent in forward (every 30k images) and backward (every 10 * batch size images) propagation.

time (in seconds) SAGAN ISAGAN SGAN
forward 1.077 0.762 0.564
backward 5.992 3.049 2.390
Table 2: The time spent in forward and backward propagation on .

The reason why the speeds of ISAGAN on two propagations are faster than those of SAGAN is mainly that our self-attention module avoids the large-scale matrix multiplication. In forward propagation, the large-scale matrix multiplication happens to more units, and in backward propagation, it needs to do differentiation on larger computational graph. About memory space usage, Table 3 shows the training of SAGAN and ISAGAN with different batch sizes, respectively, where ‘no’ means that a module cannot be run in our environment and ‘ok’ represents the opposite.

256 ok ok ok
512 no ok ok
1024 no ok ok
Table 3: Training of the three models with different batch sizes on

Since increasing the batch size needs more memory space for training, we can also infer that ISAGAN uses less memory space than SAGAN.

6 Conclusion

We improve the original self-attention module to reduce time and space complexity. Due to less memory space consumption, our self-attention module can be used in image reconstruction which often needs to process higher dimensions data. Theoretically, our SA module is a special kind of channel attention mechanisms. Experimental results verify that using our self-attention module can obtain comparable effects with the vanilla one but use less time and memory space. In future work, we will apply the proposed module of self-attention to other deep learning tasks, beyond image reconstruction.