1 Introduction
As early successes in natural language processing (NLP), several studies based on transformers have shown impressive results in vision tasks. Recent studies have shown that transformer-based architectures renew the state of the art across a wide range of subject areas, including image classification, object detection and semantic segmentation, and generative models.
Despite its great successes, the original self-attention (SA) mechanism has time and memory complexity due to the matrix multiplication of and . This is one of the well-known drawbacks of traditional transformers. For vision tasks, is proportional to the input resolution. This means that SA consumes 16 times the computational resources if the width and height of the input image are doubled.
Here, we propose a new model that implements an alternative novel SA mechanism to avoid this drawback. It is called the X-ViT, the vision transformer with XNorm. Our key method replaces softmax nonlinearity with a simple -norm. Using the associative law of matrix multiplication, our new SA algorithm requires much less computational resources than the original SA.

The main contributions in this paper are summarized as follows:
-
We propose a novel constraint scheme, XNorm, that generates a unit hypersphere to extract relational features. It eliminates non-linearity from SA by replacing the softmax function. Our module has complexity, and it handles high-resolution inputs efficiently.
-
We demonstrate that X-ViT can be adopted for general purposes. Our proposed method outperforms most of the state-of-the-art models based on transformers at lower capacity and FLOPs. In particular, our models perform well in lightweight regimes.
-
We empirically show that X-ViT models have faster inference speed and require less GPU memory.
2 Related Works
Dosovitskiy et al.[dosovitskiy2020image] proposed a vision transformer (ViT), which showed that transformer-based models could be used for vision tasks. After the achievements of ViT, DeiT[touvron2020training] introduced data-efficient training strategies for vision transformers with detailed ablation studies. They solved the ViT data efficiency problem successfully, and most of the current transformer-based models follow their schemes.
Instead of architectural strategies, many approaches have been proposed to solve the problem of the SA mechanism. They are summarized in several categories: those that use their own spatial patterns[ho2019axial, child2019generating, sukhbaatar2019adaptive], those that use various low-rank factorization methods [choromanski2020rethinking, shen2021efficient, wang2020linformer], those that use linear approximation by sampling important tokens[kitaev2020reformer, xiong2021nystr], and those that use cross-covariance matrices instead of Gram matrices[el2021xcit]. Although detailed methods are quite different, our XNorm is mainly related to low-rank factorization methods.
Tokens-to-token ViT, introduced by Yuan et al.[yuan2021tokens], aims to achieve a similar objective through different approaches. They presented a method of overlapping tokens to locally correlate patches. They did not use additional methods to reduce the computation, except when using small channels. El-Nouby et al. introduced local patch interactions in XCiT[el2021xcit]. With two depthwise convolutions[chollet2017xception] added after XCA, XCiT achieved better performance. Our models are generally inspired by the intrinsic optimization strategies that XCiT[el2021xcit] introduced, while we present our own SA method.
3 Methods

Note that affine layers
[touvron2021resmlp] are following each module.
3.1 XNorm
The structure of our model is shown in Figure 2. It is a mixture of convolutional layers, a X-ViT module, and a simple feed-forward MLP layer.
For an input , the original SA mechanism is formulated as follows:
(1) | |||
(2) |
where denotes the attention operator.
By removing softmax function from original SA, can be decomposed into and . Compared to complexity of original SA, each matrix multiplication has complexity.
So we designed a simple constraint to replace softmax function. Our proposed method, called cross-normalization or XNorm, is defined as follows:
(3) | |||
(4) |
where is a learnable parameter and is the number of embedding dimensions. It is a common -norm, applied to the patches of and the filters of .
In the above formulation, the patches of are projected to dimension by . After that, the pixel-to-pixel relations are computed by multiplying
. In this process, we observed that the variance of sizes of the pixel vectors can harm the stability of training at initial stage. With XNorm, all pixels are normalized to unit-sized vectors. It makes training stable and improves the performance of the model.
To build our X-ViT model, we adopted architectural strategies from earlier vision transformer models[graham2021levit, xiao2021early, el2021xcit, touvron2021going]. First, we used convolutional layers instead of linear patch embedding layers. Several recent studies[graham2021levit, xiao2021early] claimed that early convolutional layers help vision transformers to be well-trained. Also, we added the local patch interaction (LPI) layers proposed in XCiT[el2021xcit]. We found that the latter showed better performance than the other type of convolutional modules. The overall structure is illustrated in Figure 2.
Model | Top-1 Acc. | Params | FLOPs |
---|---|---|---|
RegNetY-1.6G[radosavovic2020designing] | 78.0 | 11M | 1.6G |
DeiT-Ti[touvron2020training] | 72.2 | 5M | 1.3G |
XCiT-T12/16[el2021xcit] | 77.1 | 26M | 1.2G |
X-ViT-T | 78.8 | 10M | 1.9G |
ResNet-50[he2016deep] | 75.3 | 26M | 3.8G |
RegNetY-4G[radosavovic2020designing] | 80.0 | 21M | 4.0G |
DeiT-S[touvron2020training] | 79.8 | 22M | 4.6G |
Swin-T[liu2021swin] | 81.3 | 29M | 4.5G |
XCiT-S12/16[el2021xcit] | 82.0 | 26M | 4.8G |
X-ViT-S | 82.0 | 21M | 3.7G |
ResNet-101[he2016deep] | 75.3 | 47M | 7.6G |
RegNetY-8G[radosavovic2020designing] | 81.7 | 39M | 8.0G |
Swin-S[liu2021swin] | 83.0 | 50M | 8.7G |
XCiT-S24/16[el2021xcit] | 82.6 | 48M | 9.1G |
X-ViT-M | 82.8 | 37M | 7.0G |
RegNetY-16G[radosavovic2020designing] | 82.9 | 84M | 16.0G |
DeiT-B[touvron2020training] | 81.8 | 86M | 17.5G |
Swin-B[liu2021swin] | 83.5 | 88M | 15.4G |
XCiT-M24/16[el2021xcit] | 82.9 | 84M | 16.2G |
X-ViT-B | 83.3 | 64M | 11.9G |
EfficientNet-B7[tan2019efficientnet] | 84.3 | 66M | 37.0G |
XCiT-S24/8[el2021xcit] | 83.9 | 48M | 36.0G |
Swin-B/384[liu2021swin] | 84.5 | 48M | 47.0G |
X-ViT-M/384 | 83.8 | 37M | 20.5G |
X-ViT-B/384 | 84.3 | 64M | 35.1G |
3.2 X-ViT
To build our X-ViT model, we adopted architectural strategies from earlier vision transformer models[graham2021levit, xiao2021early, el2021xcit, touvron2021going]. In this section, we introduce several intrinsic structures that improve performance. The overall structure is illustrated in Figure 2.
Replace linear patch embedding with convolutions. Several recent studies[graham2021levit, xiao2021early] claimed that early convolutional layers help vision transformers to be well-trained. To adopt their strategy, we used convolutional layers instead of linear patch-embedding layers.
Multi-headed attention. Following the original transformer[vaswani2017attention], our modules are multi-headed for better regularization. The parameter in Eq.3 is applied to all heads to scale the importance of each head.
Convolutional layers. Designing an extra module to extract local features is not a new idea. We chose the most simplistic method by adding various types of convolutional layers. We experimented with both the simple depthwise convolutions and the local patch interaction (LPI) layers proposed in XCiT[el2021xcit]. We found that the latter showed better performance on the regimes overall.
Class attention. In the ImageNet1k experiments, we used the class attention layers presented in CaiT[touvron2021going]. This helps the class token gather spatial information. Class attention is computed on class token only to reduce computation, as in the original paper. We implemented the class attention layers using X-ViT modules, whereas CaiT used the SA module for class attention.
Backbone | Params | ||
---|---|---|---|
ResNet50[he2016deep] | 44M | 41.0 | 37.1 |
PVT-Small[wang2021pyramid] | 44M | 43.0 | 39.9 |
Swin-T[liu2021swin] | 48M | 46.0 | 41.6 |
XCiT-S12/16[el2021xcit] | 44M | 45.3 | 40.8 |
X-ViT-S | 40M | 44.6 | 40.4 |
ResNet101[he2016deep] | 63M | 42.8 | 39.2 |
PVT-Medium[wang2021pyramid] | 64M | 44.2 | 40.5 |
Swin-S[liu2021swin] | 69M | 48.5 | 43.3 |
XCiT-S24/16[el2021xcit] | 66M | 46.5 | 41.8 |
X-ViT-M | 56M | 46.0 | 41.0 |
ResNeXt101-64[xie2017aggregated] | 102M | 44.4 | 39.7 |
PVT-Large[wang2021pyramid] | 81M | 44.5 | 40.7 |
XCiT-M24/16[el2021xcit] | 101M | 46.7 | 42.0 |
X-ViT-B | 82M | 45.8 | 41.2 |
Object detection performance on the COCO val2017.
4 Experiments
4.1 Image Classification
Dataset. For the image classification task, we trained our models using the ImageNet1k[deng2009imagenet] dataset from scratch.
Implementation details. Our setup was almost the same as that of DeiT[touvron2020training]
. However, we optimized some hyperparameters according to the model size. The learning rate was scaled per the 512 batch size following the linear scaling rule
[you2017large]and linearly warmed up for the first 5 epochs. We trained our model for 400 epochs using the AdamW optimizer
[loshchilov2017decoupled] and cosine scheduler. For data augmentation, CutMix[yun2019cutmix] and RandAugment[cubuk2020randaugment] was used. We applied a stronger augmentation in larger models.Fine-tune at higher resolution. Instead of training from scratch again, we fine-tuned X-ViT-M and X-ViT-B at a higher resolution. Our models achieved better performance in 0.1 training time compared to learning from scratch.
Comparison with state-of-the-art models. We experimented with four models that used the same architectural design schemes as DeiT[touvron2020training]. (See Table 1.) As summarized in Figure 1, all our models showed better performance and parameter efficiency than most of the concurrent transformer-based models.

4.2 Object Detection with Mask R-CNN
Implementation details. Our models were trained and evaluated on the COCO benchmark dataset[lin2014microsoft] for the object detection task. We used our models as the backbone and mask R-CNN[he2017mask] as the detector heads. Our training setups and hyperparameters follow that of DETR[carion2020end]. All experiments were performed on a 3x schedule. The input resolution was fixed at for all the experiments.
Evaluation on COCO dataset. We compared CNNs[he2016deep, xie2017aggregated] and ViT models on object detection and instance segmentation tasks. To make the comparison fair, the experimental environment was the same for all the results. All models were pre-trained on the ImageNet1k dataset.
According to Table 2, our models significantly outperform the CNN-based models and achieve higher or more competitive results than do state-of-the-art vision transformers. Notably, Swin transformer[liu2021swin] models showed better results in the overall regime. Their architectural strategy is better optimized for dense prediction tasks, while that of our models is not.

4.3 Measuring Computational Efficiency
We measured the various computational resources required for the inference. All measurements were performed on a single V100 GPU with 32GB of VRAM.
Memory efficiency. According to Figure 4, we determined that our models consumed much less memory for larger resolutions compared to DeiT [touvron2020training] and Swin Models[liu2021swin]. Our model can process up to a 4 batch size compared with other models showing similar performance.
GPU throughput. Figure 5 shows that our model is faster than other models showing similar performance. And the GPU throughput of our model decreases more slowly compared to other models as input resolution increases.
5 Conclusion
In this paper, we proposed a simple method that ensures linear complexity for SA without loss of performance. By replacing the softmax function, we removed the quadratic operation using the associative law of matrix multiplication. This type of factorization has typically caused performance degradation in earlier studies. The X-ViT models outperformed most of the existing state-of-the-art transformer-based and CNN-based models for image classification. We have shown that our models can also be deployed well for general purposes. Our X-ViT models show performance on dense prediction tasks that are competitive with or better than earlier models. With more optimized structures for dense prediction, we expect our models to become more efficient and perform better.