Log In Sign Up

Fair Comparison between Efficient Attentions

Transformers have been successfully used in various fields and are becoming the standard tools in computer vision. However, self-attention, a core component of transformers, has a quadratic complexity problem, which limits the use of transformers in various vision tasks that require dense prediction. Many studies aiming at solving this problem have been reported proposed. However, no comparative study of these methods using the same scale has been reported due to different model configurations, training schemes, and new methods. In our paper, we validate these efficient attention models on the ImageNet1K classification task by changing only the attention operation and examining which efficient attention is better.


UFO-ViT: High Performance Linear Vision Transformer without Softmax

Vision transformers have become one of the most important models for com...

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

Transformers are increasingly dominating multi-modal reasoning tasks, su...

Pay Attention to MLPs

Transformers have become one of the most important architectural innovat...

Learning to Estimate Shapley Values with Vision Transformers

Transformers have become a default architecture in computer vision, but ...

Hydra Attention: Efficient Attention with Many Heads

While transformers have begun to dominate many tasks in vision, applying...

X-ViT: High Performance Linear Vision Transformer without Softmax

Vision transformers have become one of the most important models for com...

Patch Similarity Aware Data-Free Quantization for Vision Transformers

Vision transformers have recently gained great success on various comput...

Code Repositories



view repo

1 Introduction

Transformer [10]

models have received widespread attention due to their effectiveness in various fields. The transformer consists of a self-attention operation, which has a quadratic complexity of time and memory proportional to the number of input tokens. In natural language processing (NLP), which was the first field to admit the possibility of transformers, various works 

[11, 8, 15, 16, 9] to improve quadratic complexity problems for handling documents consisting of many words have been reported. Such efforts have continued in computer vision [3, 1]. This is because the vision transformer, which creates tokens by dividing an image into patches, has an inverse relationship between patch size and performance [5], but quadratic complexity makes it difficult to handle computations when reducing patch sizes.

Previous studies on using efficient self-attention to improve from quadratic to linear complexity have never been compared in the same environment because of several differences between both of them. Those studies have different model configurations and training schemes except for their self-attention operations. Thus, comparing only the influence of efficient self-attention is difficult. Furthermore, the performance of some efficient self-attention proposed in NLP has not been evaluated in vision tasks.

In this paper, we conduct a comparison experiment by changing only self-attention operations to efficient self-attentions with linear time complexity. Here, the number of patches increases from 4x to 16x compared with or patches, increasing the computations to an infeasible extent despite using efficient attention. Therefore, we employ the pyramid structure of previous studies [12, 6].

As a result, we observe that the performance improves as computational complexity increases for and patches, regardless of what kind of self-attention is used, and that efficient attentions do not perform better than normal self-attention without additional methods.

2 Efficient Self-Attentions

The original transformer architecture [10] and the vision transformer for image classification [5] both conduct global self-attention, where the relationships between a token and all other tokens are computed. Global self-attention leads to quadratic complexity problems proportional to the number of tokens, impeding the use of many tokens to obtain fine-grained features for dense prediction.

To solve this problem, efficient self-attentions [11, 8, 15, 1, 3], which characterize linear complexity and global token interaction, have been proposed. In this section, we outline the efficient self-attentions used in this paper and indicate the computation complexity of each method. Here, we only consider self-attention operation for representing complexity. Table 1 summarizes abbreviations and the complexity for each efficient attention used in this paper.

Self-attention The original self-attention operation [10] computes the dot products of queries with keys, divides each by , and applies a softmax function to obtain the weights on the values. Then, a linear projection using is performed. However, this part is omitted in this paper for clarity.


where , , and represent queries, keys, and values, respectively. Generally, the dimension of query equals that of key . They are calculated using the input sequence by projecting onto three learnable weight matrixes , , and . denotes the application of the softmax function along the last axis of the matrix. The computation complexity of original self-attention is .

Linformer This technique [11] is considered a naive improvement of self-attention. Since the original self-attention has quadratic complexity with respect to the number of tokens, Linformer projects and into matrixes using learnable parameters and then performs self-attention operations. The linear self-attention is computed as:


The computation complexity is . In our experiments, we set the project dimension as .

Efficient attention To improve the quadratic complexity, The efficient attention [8] change the order of operations in self-attention. This method has been proposed as the non-local module [13]

in a convolutional neural network.

denotes the efficient attention and is written as:


Here, computation complexity is .

Performer The performer [3] is an improved kernel method of efficient attention. It uses a kernel approximating softmax function via a positive orthogonal random feature and is denoted as :


where is the kernel using a positive orthogonal random feature and is an identity matrix. The computation complexity of the kernel and the total computation complexity is . In our experiment, we set the project dimension as to ensure that is smaller than , according to the original work [3].

XCiT The cross-covariance attention [1] is a transposed version of self-attention, which operates across the feature dimension rather than the token dimension. This change causes implicit communication between tokens and degrades the quality of representation. Therefore, XCiT [1] introduced a local patch interaction module (LPI) consisting of two convolution layers. However, we evaluated results both with and without LPI, denoted as LPI and XCA. The cross-covariance attention is given as:


where is a learnable temperature scaling parameter to compensate for reducing the representational power due to normalization. For comparison, we set as the fixed value and removed the regularization from the experiment. The computation complexity is .

Fastformer Instead of modeling the global attention using matrix multiplication, the fastformer [15] employs element-wise multiplication. This method transforms each token representation using its global context. For the global context and , learnable parameters and are required. The additive attention is given as:


where is computed using and denotes element-wise multiplication. is the projection parameter we mentioned but omitted in Equation 1. The computation complexity of additive attention is .

Swin Transformer In addition to the above efficient self-attentions, several approaches to solving quadratic complexity exist, including the Swin transformer [6]. It regards global interactions as the core problem and uses a nested window attention with normal self-attention [10]. Therefore, the window attention is outside the scope of our experiment. However, since the Swin transformer is an important benchmark for vision transformer, we compare its performance in this study. The computation complexity of the nested window attention is , where is the size of the window. In experiment, we set as 7 and 8 when the patch sizes are 4 and 7, respectively.

Figure 1:

Baseline pyramid architecture of our experiment. We split an image into fixed-size patches, linearly embedded each of them, added positional embedding, and fed the sequence vector

to four ‘Stages’. Next, we performed patch pooling, which is an operation that averages the feature along the patch axis. After computing the last feature

, we fed the feature into the linear classifier.

3 Experiment

In our experiment, we benchmarked various efficient self-attentions on the ImageNet1K dataset [4], which contains 1.28M training images and 50K validation images from 1,000 classes. For a fair comparison, we trained all models on the training set using a similar configuration and the same training scheme except for self-attention and patch size. We report the top-1 error in the validation set.

3.1 Baseline architecture

An overview of the baseline architecture is presented in Figure 1. First, we reshaped the 2D image into a sequence of flattened patches , where and are the size and number of patches, respectively. If is 4 and is (224, 224), becomes 3136. Then, we use learned embeddings to convert the patches to the token representation and add positional encoding to

. Next, this representation was averaged after passing through some ‘Stage’ and input into the linear classifier to obtain logits. Every ‘Stage’ consists of several blocks and we only changed the self-attention layer to efficient self-attentions in the block for a fair comparison.

Transformer block Blocks of each ‘Stage’ are a multi-head self-attention (MSA) using the efficient attentions, and a feed-forward network (FFN) in a transformer [10]. FFN consists of a 2-layer MLP with GELU non-linearity in between. A LayerNorm [2]

(LN) layer is applied before each MSA and FFN, and a residual connection is applied after each module.

Pyramid structure Even if the computation decreases using efficient self-attention, the computation rapidly increases when many tokens are used for attention. Therefore, we employed a pyramid structure to lower the computation to a feasible level. Pyramid structure [6, 12] consists of several ‘Stages’ maintaining the number of tokens. Each ‘Stage’ has several transformer blocks with efficient or normal self-attention. At the end of each ‘Stage’, except the last ‘Stage’, there is a ‘Patch Merging’ layer of which neighbor patches are merged to reduce the number of tokens:


where represents the input tokens and is a linear projection parameter. is an operation that reshapes the input sequence to and refers to LN.

Columnar structure Another structure that can be used is a columnar structure, which works similarly to ViT [5]. This is a reference point for showing the difference between the pyramid and column structures when each structure has a similar amount of computation. Therefore, we do not use efficient attention with a columnar structure. Here, we trained the columnar structure with normal self-attention and set patch sizes to 14 and 16 due to the large computation.

Model details To make the number of feature channels smaller than that of tokens in each stage, we set the number of feature channels per head to 32. With a pyramid structure, the number of attention layers and heads were set to [2, 2, 6, 2] and [6, 12, 24, 32] along the ‘Stage’. With a columnar structure, we set the number of heads to 12. Next, we adopted a positional encoding as proposed by El-Nouby et al[1]. This method first produces an encoding in an intermediate 64-dimensional space before projecting it to the feature dimension space of the transformer.

Figure 2: ImageNet1K top 1 accuracy versus GFLOPs under different patch sizes. All models were trained at same scheme.

3.2 Training Scheme

For the training, we employed an AdamW optimizer [7]

for 310 epochs using a cosine decay learning rate scheduler, 20 epochs of linear warm-up, and 10 epochs of cool down. Furthermore, we used a batch size of 128, an initial learning rate of 0.005, and a weight decay of 0.05. Also, we applied most augmentation and regularization strategies of XCiT 

[1]. For a baseline with a patch size of 4, we used a batch size of 32 due to limited memory size. For implementation, we developed our code using the timm library [14].

3.3 Experimental results

Model Architecture Complexity per Self-Attention
Transformer (SA) [10]
Linformer (LA) [11]
Efficient Attention (EA) [8]
Performer (PA) [3]
Fastformer (AA) [15]
XCiT (XCA) [1]
Swin Transformer (Swin) [6]
Table 1: Abbreviations and complexity for each self-attention used. , , and

are their hyperparameters described in

Section 2.
Type of #params FLOPs FLOPs Top 1
Attention (M, millions) (G, giga) ratio Acc. (%)
SA-4 [10] 28.27 8.821 1 81.80
SA-7 [10] 28.28 1.915 1 78.97
Efficient Attentions
LA-4 [11] 30.91 5.496 0.62 79.04(-2.76)
LA-7 [11] 28.56 1.561 0.81 77.47(-1.5)
EA-4 [8] 28.27 4.480 0.51 79.87(-1.93)
EA-7 [8] 28.28 1.473 0.77 77.91(-1.06)
PA-4 [3] 28.27 4.481 0.51 78.73(-3.07)
PA-7 [3] 28.28 1.473 0.77 77.87(-1.1)
AA-4 [15] 28.27 4.394 0.50 77.60(-3.93)
AA-7 [15] 28.28 1.445 0.75 76.02(-2.95)
XCA-4 [1] 28.27 4.480 0.51 78.67(-3.13)
XCA-7 [1] 28.28 1.473 0.77 77.62(-1.35)
Swin-4 [6] 28.27 4.528 0.51 80.08(-1.72)
Swin-7 [6] 28.28 1.500 0.78 78.72(-0.25)
LPI-4 [1] 28.38 4.520 0.51 81.54(-0.26)
LPI-7 [1] 28.39 1.486 0.78 79.7(+0.73)
COL-14 22.00 6.117 0.69 81.30(-0.50)
COL-16 22.00 4.589 0.52 80.97(-0.83)
Table 2: Experiment results with efficient attentions and benchmarks. COL means columnar structure and LPI means local patch interaction in Section 2. The patch size is the number after the attention. The number parameters of the model and GFLOPs were measured with an input resolution of 224. Additionally, the ratio of FLOPs was calculated using a baseline of the same color.

This section presents comparison experiments on the ImageNet1K dataset for the classification task. Table 2 presents the number of parameters, FLOPs, and top-1 accuracy between various efficient attentions. Figure 2 shows the top-1 accuracy for various efficient attentions versus GFLOPs under different patch sizes. From Table 2, results show that no efficient attention performs better than the baseline. Furthermore, the performance order of each model with patches is good in the order of EA, LA, FA, XCA, and AA. When the patch size is , the performance order is EA, PA, XCA, LA, and AA. In both cases, EA attained the least performance loss and AA achieved the most inferior accuracy with the largest reduction in computations. Also, PA, which uses a more complex kernel than the softmax kernel in EA, performed poorly. However, LPI showed similar or higher performance than the baseline. Compared to XCA, the performance of LPI was higher by 2.87% and 2.08% for and patches, respectively. This shows that efficient attention can outperform the normal self-attention when used with a proper module.

4 Conclusion

In this paper, we conduct a comparison experiment between pyramid transformers with efficient attentions. For a fair comparison, we considered the same environments for all experiments. The experimental results show that efficient attention achieved lower accuracy and computation than normal self-attention. However, in some cases, efficient attentions perform on par or better than the baseline. This result shows the potential power of efficient attentions to reduce computation.

Acknowledgments This work was partially supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2020R1C1C1007423 and project BK21 FOUR).