Log In Sign Up

X-ViT: High Performance Linear Vision Transformer without Softmax

by   Jeonggeun Song, et al.

Vision transformers have become one of the most important models for computer vision tasks. Although they outperform prior works, they require heavy computational resources on a scale that is quadratic to the number of tokens, N. This is a major drawback of the traditional self-attention (SA) algorithm. Here, we propose the X-ViT, ViT with a novel SA mechanism that has linear complexity. The main approach of this work is to eliminate nonlinearity from the original SA. We factorize the matrix multiplication of the SA mechanism without complicated linear approximation. By modifying only a few lines of code from the original SA, the proposed models outperform most transformer-based models on image classification and dense prediction tasks on most capacity regimes.


page 1

page 2

page 3

page 4


UFO-ViT: High Performance Linear Vision Transformer without Softmax

Vision transformers have become one of the most important models for com...

Convolutional Xformers for Vision

Vision transformers (ViTs) have found only limited practical use in proc...

Vicinity Vision Transformer

Vision transformers have shown great success on numerous computer vision...

Fair Comparison between Efficient Attentions

Transformers have been successfully used in various fields and are becom...

IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers

The self-attention-based model, transformer, is recently becoming the le...

Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints

A vision transformer (ViT) is the dominant model in the computer vision ...

1 Introduction

As early successes in natural language processing (NLP), several studies based on transformers have shown impressive results in vision tasks. Recent studies have shown that transformer-based architectures renew the state of the art across a wide range of subject areas, including image classification, object detection and semantic segmentation, and generative models.

Despite its great successes, the original self-attention (SA) mechanism has time and memory complexity due to the matrix multiplication of and . This is one of the well-known drawbacks of traditional transformers. For vision tasks, is proportional to the input resolution. This means that SA consumes 16 times the computational resources if the width and height of the input image are doubled.

Here, we propose a new model that implements an alternative novel SA mechanism to avoid this drawback. It is called the X-ViT, the vision transformer with XNorm. Our key method replaces softmax nonlinearity with a simple -norm. Using the associative law of matrix multiplication, our new SA algorithm requires much less computational resources than the original SA.

Figure 1: Top-1 accuracy vs. Model capacity. Comparison of ImageNet1k top-1 accuracy of various models according to model capacity. Our models show the best results at the same parameter sizes compared to the other models.

The main contributions in this paper are summarized as follows:

  • We propose a novel constraint scheme, XNorm, that generates a unit hypersphere to extract relational features. It eliminates non-linearity from SA by replacing the softmax function. Our module has complexity, and it handles high-resolution inputs efficiently.

  • We demonstrate that X-ViT can be adopted for general purposes. Our proposed method outperforms most of the state-of-the-art models based on transformers at lower capacity and FLOPs. In particular, our models perform well in lightweight regimes.

  • We empirically show that X-ViT models have faster inference speed and require less GPU memory.

2 Related Works

Dosovitskiy et al.[dosovitskiy2020image] proposed a vision transformer (ViT), which showed that transformer-based models could be used for vision tasks. After the achievements of ViT, DeiT[touvron2020training] introduced data-efficient training strategies for vision transformers with detailed ablation studies. They solved the ViT data efficiency problem successfully, and most of the current transformer-based models follow their schemes.

Instead of architectural strategies, many approaches have been proposed to solve the problem of the SA mechanism. They are summarized in several categories: those that use their own spatial patterns[ho2019axial, child2019generating, sukhbaatar2019adaptive], those that use various low-rank factorization methods [choromanski2020rethinking, shen2021efficient, wang2020linformer], those that use linear approximation by sampling important tokens[kitaev2020reformer, xiong2021nystr], and those that use cross-covariance matrices instead of Gram matrices[el2021xcit]. Although detailed methods are quite different, our XNorm is mainly related to low-rank factorization methods.

Tokens-to-token ViT, introduced by Yuan et al.[yuan2021tokens], aims to achieve a similar objective through different approaches. They presented a method of overlapping tokens to locally correlate patches. They did not use additional methods to reduce the computation, except when using small channels. El-Nouby et al. introduced local patch interactions in XCiT[el2021xcit]. With two depthwise convolutions[chollet2017xception] added after XCA, XCiT achieved better performance. Our models are generally inspired by the intrinsic optimization strategies that XCiT[el2021xcit] introduced, while we present our own SA method.

3 Methods

Figure 2: Overview of X-ViT module.

Note that affine layers

[touvron2021resmlp] are following each module.
Figure 3: X-ViT module.

3.1 XNorm

The structure of our model is shown in Figure 2. It is a mixture of convolutional layers, a X-ViT module, and a simple feed-forward MLP layer.

For an input , the original SA mechanism is formulated as follows:


where denotes the attention operator.

By removing softmax function from original SA, can be decomposed into and . Compared to complexity of original SA, each matrix multiplication has complexity.

So we designed a simple constraint to replace softmax function. Our proposed method, called cross-normalization or XNorm, is defined as follows:


where is a learnable parameter and is the number of embedding dimensions. It is a common -norm, applied to the patches of and the filters of .

In the above formulation, the patches of are projected to dimension by . After that, the pixel-to-pixel relations are computed by multiplying

. In this process, we observed that the variance of sizes of the pixel vectors can harm the stability of training at initial stage. With XNorm, all pixels are normalized to unit-sized vectors. It makes training stable and improves the performance of the model.

To build our X-ViT model, we adopted architectural strategies from earlier vision transformer models[graham2021levit, xiao2021early, el2021xcit, touvron2021going]. First, we used convolutional layers instead of linear patch embedding layers. Several recent studies[graham2021levit, xiao2021early] claimed that early convolutional layers help vision transformers to be well-trained. Also, we added the local patch interaction (LPI) layers proposed in XCiT[el2021xcit]. We found that the latter showed better performance than the other type of convolutional modules. The overall structure is illustrated in Figure 2.

Model Top-1 Acc. Params FLOPs
RegNetY-1.6G[radosavovic2020designing] 78.0 11M 1.6G
DeiT-Ti[touvron2020training] 72.2 5M 1.3G
XCiT-T12/16[el2021xcit] 77.1 26M 1.2G
X-ViT-T 78.8 10M 1.9G
ResNet-50[he2016deep] 75.3 26M 3.8G
RegNetY-4G[radosavovic2020designing] 80.0 21M 4.0G
DeiT-S[touvron2020training] 79.8 22M 4.6G
Swin-T[liu2021swin] 81.3 29M 4.5G
XCiT-S12/16[el2021xcit] 82.0 26M 4.8G
X-ViT-S 82.0 21M 3.7G
ResNet-101[he2016deep] 75.3 47M 7.6G
RegNetY-8G[radosavovic2020designing] 81.7 39M 8.0G
Swin-S[liu2021swin] 83.0 50M 8.7G
XCiT-S24/16[el2021xcit] 82.6 48M 9.1G
X-ViT-M 82.8 37M 7.0G
RegNetY-16G[radosavovic2020designing] 82.9 84M 16.0G
DeiT-B[touvron2020training] 81.8 86M 17.5G
Swin-B[liu2021swin] 83.5 88M 15.4G
XCiT-M24/16[el2021xcit] 82.9 84M 16.2G
X-ViT-B 83.3 64M 11.9G
EfficientNet-B7[tan2019efficientnet] 84.3 66M 37.0G
XCiT-S24/8[el2021xcit] 83.9 48M 36.0G
Swin-B/384[liu2021swin] 84.5 48M 47.0G
X-ViT-M/384 83.8 37M 20.5G
X-ViT-B/384 84.3 64M 35.1G
Table 1: Comparison with the state of the art models. The image classification results, model capacity, and FLOPs of various models on ImageNet1k dataset.

3.2 X-ViT

To build our X-ViT model, we adopted architectural strategies from earlier vision transformer models[graham2021levit, xiao2021early, el2021xcit, touvron2021going]. In this section, we introduce several intrinsic structures that improve performance. The overall structure is illustrated in Figure 2.

Replace linear patch embedding with convolutions. Several recent studies[graham2021levit, xiao2021early] claimed that early convolutional layers help vision transformers to be well-trained. To adopt their strategy, we used convolutional layers instead of linear patch-embedding layers.

Multi-headed attention. Following the original transformer[vaswani2017attention], our modules are multi-headed for better regularization. The parameter in Eq.3 is applied to all heads to scale the importance of each head.

Convolutional layers. Designing an extra module to extract local features is not a new idea. We chose the most simplistic method by adding various types of convolutional layers. We experimented with both the simple depthwise convolutions and the local patch interaction (LPI) layers proposed in XCiT[el2021xcit]. We found that the latter showed better performance on the regimes overall.

Class attention. In the ImageNet1k experiments, we used the class attention layers presented in CaiT[touvron2021going]. This helps the class token gather spatial information. Class attention is computed on class token only to reduce computation, as in the original paper. We implemented the class attention layers using X-ViT modules, whereas CaiT used the SA module for class attention.

Backbone Params
ResNet50[he2016deep] 44M 41.0 37.1
PVT-Small[wang2021pyramid] 44M 43.0 39.9
Swin-T[liu2021swin] 48M 46.0 41.6
XCiT-S12/16[el2021xcit] 44M 45.3 40.8
X-ViT-S 40M 44.6 40.4
ResNet101[he2016deep] 63M 42.8 39.2
PVT-Medium[wang2021pyramid] 64M 44.2 40.5
Swin-S[liu2021swin] 69M 48.5 43.3
XCiT-S24/16[el2021xcit] 66M 46.5 41.8
X-ViT-M 56M 46.0 41.0
ResNeXt101-64[xie2017aggregated] 102M 44.4 39.7
PVT-Large[wang2021pyramid] 81M 44.5 40.7
XCiT-M24/16[el2021xcit] 101M 46.7 42.0
X-ViT-B 82M 45.8 41.2
Table 2:

Object detection performance on the COCO val2017.

4 Experiments

4.1 Image Classification

Dataset. For the image classification task, we trained our models using the ImageNet1k[deng2009imagenet] dataset from scratch.

Implementation details. Our setup was almost the same as that of DeiT[touvron2020training]

. However, we optimized some hyperparameters according to the model size. The learning rate was scaled per the 512 batch size following the linear scaling rule


and linearly warmed up for the first 5 epochs. We trained our model for 400 epochs using the AdamW optimizer

[loshchilov2017decoupled] and cosine scheduler. For data augmentation, CutMix[yun2019cutmix] and RandAugment[cubuk2020randaugment] was used. We applied a stronger augmentation in larger models.

Fine-tune at higher resolution. Instead of training from scratch again, we fine-tuned X-ViT-M and X-ViT-B at a higher resolution. Our models achieved better performance in 0.1 training time compared to learning from scratch.

Comparison with state-of-the-art models. We experimented with four models that used the same architectural design schemes as DeiT[touvron2020training]. (See Table 1.) As summarized in Figure 1, all our models showed better performance and parameter efficiency than most of the concurrent transformer-based models.

Figure 4: Allocated memory vs. # of tokens. To check the linearity of our models empirically, we measured the maximum value of allocated GPU memory on different resolutions. For a batch size of 64, the memory consumption of our models shows linearity with the number of tokens. Moreover, our models require significantly less memory than the other models.

4.2 Object Detection with Mask R-CNN

Implementation details. Our models were trained and evaluated on the COCO benchmark dataset[lin2014microsoft] for the object detection task. We used our models as the backbone and mask R-CNN[he2017mask] as the detector heads. Our training setups and hyperparameters follow that of DETR[carion2020end]. All experiments were performed on a 3x schedule. The input resolution was fixed at for all the experiments.

Evaluation on COCO dataset. We compared CNNs[he2016deep, xie2017aggregated] and ViT models on object detection and instance segmentation tasks. To make the comparison fair, the experimental environment was the same for all the results. All models were pre-trained on the ImageNet1k dataset.

According to Table 2, our models significantly outperform the CNN-based models and achieve higher or more competitive results than do state-of-the-art vision transformers. Notably, Swin transformer[liu2021swin] models showed better results in the overall regime. Their architectural strategy is better optimized for dense prediction tasks, while that of our models is not.

Figure 5: GPU throughput according to the input resolution. Note that the scale of throughput axis is scale. ’max batch’ means throughput measured on maximum available batch size.

4.3 Measuring Computational Efficiency

We measured the various computational resources required for the inference. All measurements were performed on a single V100 GPU with 32GB of VRAM.

Memory efficiency. According to Figure 4, we determined that our models consumed much less memory for larger resolutions compared to DeiT [touvron2020training] and Swin Models[liu2021swin]. Our model can process up to a 4 batch size compared with other models showing similar performance.

GPU throughput. Figure 5 shows that our model is faster than other models showing similar performance. And the GPU throughput of our model decreases more slowly compared to other models as input resolution increases.

5 Conclusion

In this paper, we proposed a simple method that ensures linear complexity for SA without loss of performance. By replacing the softmax function, we removed the quadratic operation using the associative law of matrix multiplication. This type of factorization has typically caused performance degradation in earlier studies. The X-ViT models outperformed most of the existing state-of-the-art transformer-based and CNN-based models for image classification. We have shown that our models can also be deployed well for general purposes. Our X-ViT models show performance on dense prediction tasks that are competitive with or better than earlier models. With more optimized structures for dense prediction, we expect our models to become more efficient and perform better.