EIT: Efficiently Lead Inductive Biases to ViT

by   Rui Xia, et al.
NetEase, Inc

Vision Transformer (ViT) depends on properties similar to the inductive bias inherent in Convolutional Neural Networks to perform better on non-ultra-large scale datasets. In this paper, we propose an architecture called Efficiently lead Inductive biases to ViT (EIT), which can effectively lead the inductive biases to both phases of ViT. In the Patches Projection phase, a convolutional max-pooling structure is used to produce overlapping patches. In the Transformer Encoder phase, we design a novel inductive bias introduction structure called decreasing convolution, which is introduced parallel to the multi-headed attention module, by which the embedding's different channels are processed respectively. In four popular small-scale datasets, compared with ViT, EIT has an accuracy improvement of 12.6 and FLOPs. Compared with ResNet, EIT exhibits higher accuracy with only 17.7 parameters and fewer FLOPs. Finally, ablation studies show that the EIT is efficient and does not require position embedding. Code is coming soon: https://github.com/MrHaiPi/EIT



page 5

page 8


Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training

Recently, vision Transformers (ViTs) are developing rapidly and starting...

RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?

For the past ten years, CNN has reigned supreme in the world of computer...

On the Bias Against Inductive Biases

Borrowing from the transformer models that revolutionized the field of n...

A Novel Approach for Semiconductor Etching Process with Inductive Biases

The etching process is one of the most important processes in semiconduc...

Vision Transformer for Small-Size Datasets

Recently, the Vision Transformer (ViT), which applied the transformer st...

Inductive biases and Self Supervised Learning in modelling a physical heating system

Model Predictive Controllers (MPC) require a good model for the controll...

Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints

A vision transformer (ViT) is the dominant model in the computer vision ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, Transformer [16]

has swept the field of natural language processing (NLP) due to its superior performance. The fields of the computer vision (CV)

[3, 5, 20]

and even the multi-agent reinforcement learning (MARL)

[12, 17] are gradually infiltrated by it. Vision Transformer (ViT) [5] is the first CV model which is completely based on Transformer architecture and achieves better performance on ultra-large-scale image classification task compared with convolutional neural networks (CNN). ViT is also the first model trying to break the barrier of unifying the CV and NLP with the same backbone network and achieved a breakthrough. Specifically, ViT divides images into non-overlapping patches to correspond the in NLP, and uses the full-transformer architecture to model the patches and complete the image classification task.

Figure 1: Introduction of IB. (a) Data is pre-processed by IB structure [6] [25]. (b) MHA with built-in IB [21]. (c) MHA handles all data together with the module with IB structure [19]. (d) The MHA and the IB structure each process a invariant portion of the data [22]. (e) IB structure in the previous layer process less data than those in the next layer.

Although ViT has been successful in ultra-large-scale datasets, ViT is difficult to obtain the best performance compared with CNN (e.g., ResNet [7]) when trained on small-scale datasets. One possible reason is that transformer architecture lacks some of the properties similar to inductive biases (IB) inherent to CNN, such as translation equivariance and locality [5]. Many studies have been explored on introducing the IB into ViT, such as the locality-based methods like TNT [6] and T2T [25], or CNN based methods like LSRA [22], CvT [21] and ViT [24]. Aforementioned researches have experimentally demonstrated that the introduction of IB can improve the performance of ViT. In summary, the current introduction of IB can be summarized in four ways, as shown in Fig.1. The first three approaches suffer from a significant increase in parameters and FLOPs, thus further introducing CNN’s hierarchical architecture in the final implementation. However, it destroys the structure of ViT and departs from the original intention of ViT that unifies CV and NLP by the same structure. The fourth approach does not introduce more parameters and FLOPs. Furthermore, both of the four approaches suffer from the same inefficient introduction of IB. The reason is that the input data of each layer’s MHA in both of the four approaches, which can be briefly expressed as , all contain the data directly processed by some MHA of the front layer. The presence of each MHA weakens the IB of the whole equation and thus causes IB to weaken layer by layer.

This paper proposes a model called EIT, which can Efficiently lead the IB to the ViT without changing its backbone structure and with fewer parameters and FLOPs. Unlike the above four approaches that use the same structure at each layer, we propose a novel decreasing IB introduction structure, aiming at carrying IB through the entire ViT network as weaken-free as possible with fewer parameters and FLOPs. We make the following two changes to get the decreasing structure: 1) the linear Patches Projection (PaPr) of ViT is replaced by a convolutional layer plus a maximum pooling layer, which is called EIT; 2) a decreasing convolutional structure is introduced to the multi-headed attention (MHA) of Transformer Encoder (TrEn), by which the embedding’s different channel dimensions are processed, respectively.

Since ViT exhibits poor performance on small-scale datasets, we validated the performance of EIT on four popular small-scale datasets showing that the EIT outperforms the similar ViT-like methods (ViTs) and the CNN-like methods (CNNs). Moreover, the proposed decreasing structure is generally applicable for attention-based architectures and has further impacts on a broader range of applications. We summarize the contributions as follows:

  1. To the best of our knowledge, we first find that the introduction of IB can improve the diversity of head-attention distances in the transformer, which causes an improvement in performance.

  2. We analyze the composition of the MHA’s inputs and demonstrated that the decreasing structure can introduce IB more efficiently.

  3. We propose two novel structures, which can efficiently lead IB to ViT, with fewer parameters and FLOPs, while do not modify the backbone of ViT, ensuring the unification of CV and NLP.

  4. We conduct comprehensive experiments that show EIT has higher accuracy with fewer parameters and FLOPs than similar IB introduction methods. Compared with CNNs, EIT shows better performance with much fewer parameters.

Position Encoding
Lead IB to PaPr phase
Lead IB to TrEn phase
Backbone Structure
LSRA [22] Cosine None
Lite Transformer Block (LTB)
No Chg.
ViT [5] Trainable None None No Chg.
TNT [6] Trainable Patch + Pixel Patch + Pixel Changed
T2T [25] Trainable Concatenate Concatenate Changed
CvT [21] None
Convolutional Embedding (CvT)
Convolutional Mapping QKV (CvT)
ViT [24] Trainable Convolutional Stem (CoSt) None No Chg.
EIT(ours) None
Convolutional Embedding and Maxpool
Embedded Parallel Decreasing Convolution
No Chg.
Table 1: Representative works of IB introduction.

2 Related Work

Transformer [16] was a network architecture that relied deeply on a self-attentive mechanism to obtain global sensing capabilities. Since its introduction in machine translation tasks in 2017, Transformer had achieved state-of-the-art in many NLP tasks [4, 1]. Given this, researchers in the CV field had also applied Transformer to their research and achieved competitive results compared to CNN. Examples included image classification [3, 5], target detection[27], segmentation [18, 2], image enhancement [2], image generation [10], and video processing [26].

2.1 Vision Transformer

Although there were many Transformer-based models in the CV field, ViT [5]

was the first model based entirely on Transformer and tried to unify the CV and NLP with the same network structure. In its implementation, ViT first split an image into non-overlapping patches, then mapped the patches into patches embedding by a linear mapping layer. Finally, it classified the images by connecting multiple standard TrEn. However, compared with CNN, the better performance of ViT relied heavily on ultra-large-scale datasets (e.g., ImageNet-21k and JFT-300M) with the reason lacking IB. In this paper, we study how to efficiently lead CNN’s inhernet IB to ViT without changing its backbone structure. The ultimate goal is to improve its performance in small-scale datasets without breaking the uniformity of CV and NLP.

2.2 Lead Inductive Biases to ViT

Many methods have been proposed to lead IB to ViT. For example, Long-Short Range Attention (LSRA) [22] introduced Lite Transformer Block (LTB) to TrEn, which divided half of the data to be processed by MHA along the channel dimension to the convolutional layer. Transformer-in-Transformer (TNT) [6] introduced the transformer module inside patches as a way to model a more detailed pixel-level representation. Tokens-to-Token (T2T) [25] stitched together neighbouring embedding to form a new embedding in the original location to change ViT’s PaPr and TrEn. Such an operation could preemptively improve the similarity of neighbouring, which in turn introduced IB. CvT [21]

lead IB to both PaPr and TrEn by convolution operation with a hierarchical structure. Specifically, CvT changed the original linear mapping of ViT to a convolutional mapping in the both PaPr and TrEn. The stride of convolution was smaller than the kernel size. ViT

[24] changed the PaPr with Convolutional Stem to help transformers see better. Summarizing the above works, we can find two phases to lead the IB to ViT. One is the PaPr phase, and the other is the TrEn phase. Table 1 summarizes the contributions of the above works. However, all of the above works suffer from the IB introduction weakening layer by layer or the inability to reconcile the contradiction between Transformer’s backbone invariance and performance improvement. Compared with the above methods, our method ensures both the efficiency of IB introduction and the backbone invariance without increasing the FLOPs and parameters.

ViT [5]. Lead IB to PaPr (ViT) [24]. Lead IB to TrEn (LSRA) [22].
Figure 2: All the model consisting of five TrEns, each TrEn contains ten heads. The above results are from training on the Cifar10 [9]. Attention distance was computed for 2000 example images from Cifar10 by summing the distance between the selected query pixel and all other pixels, weighted by the attention weight. [5]

3 Our Approach

3.1 Motivation

In order to investigate the effect of IB introduction on ViT without changing the backbone of ViT, we explored the IB introduction in two phases of ViT using ViT and LSRA, and the results are shown in Fig.2. The results show that the IB introduction improves the model’s performance, intuitively manifested in the increased diversity of head-attention distances (head diversity), which ensures that each layer has as many scales of attention distances as possible. The head diversity of LSRA is greater than that of ViT and thus has better performance. We believe this is because ViT only introduces IB at the very beginning of the model, while LSRA introduces IB throughout all TrEns. In other words, the IB introduction of LSRA is more efficient than that of ViT. However, we find that the head diversity of LSRA’s deep layers is still small, so we believe that the IB introduction of LSRA is not efficient enough. We speculated that the network’s performance will be further improved if IB is introduced more efficiently to make the head diversity at the deep layers greater. To further improve the efficiency of IB introduction, we investigated why LSRA (Embedded Parallel) cannot bring IB to the leaning back layers.

3.2 Embedded Parallel

First, we give the short form of the expression for each layer of ViT. We ignore the operations (position embedding, class token and normalization), which do not affect the attention mechanism. For the input , the output of ViT’s each layers can be abbreviated as the following equation.


We can understand the above equation as follows: has IB of intensity , then has IB of intensity . Since the presence of MHA weakens the IB of the whole equation (MHA has an innate global attention mechanism), and MLP (fully-connection layer) does not enhance the IB of the whole equation (MLP does not fusion the different patches of the same channel), . This means that the IB intensity of is weaker than that of .

If Embedded Parallel structure (convolution) is added, we can rewrite the Eq.1 as follow with ignoring the index of patches in and the input of MLP.


where stands for the convolution with some reshape operations, is the channel number processed by . Due to the specificity of the convolution operation, does not handle the class embedding. Since the network’s final output depends on class embedding, and MHA is the only bridge between class embedding and other patches for data interaction, we can directly observe the IB introduction situation of MHA for each layer. If is the same for all layers, then the input of MHA can be expressed as follows.


It can be find that the Eq.1 and Eq.3 are formally the same. The only difference is that the MLP’s IB weakening for the entire equation is reduced due to the enhanced IB of the MLP input, but the IB at each layer is still weak. This means that the introduction of IB is not efficient when equals . This explains why the head diversity in the deeper layers of the LSRA in Fig.2 does not increase. Although the above equation describes the inefficiency of introducing IB by LSRA (embedded parallel), this can be generalized to the other three methods (front, inside and parallel) of introducing IB (Fig.1). The only difference is the degree of weakening of IB by MLP in Eq.3. Our next goal is to find a way to enhance the IB of Eq.3.

3.3 Decreasing Embedded Parallel

We find that the IB of Eq.3 can be greatly enhanced with a small change: let be decreasing layer by layer (If is increasing, the situation is similar to Eq.3). The input of MHA in decreasing structure can be expressed as follows.


The Eq.4 differs significantly in form from Eq.3, specifically in that the above equation has the shown in the second term. Furthermore, the input of in Eq.4, which we can expand to get the following equation, has the IB with few weaken.


We can find that there is no longer an MHA operation displayed in Eq.5. The input () to the of Eq.5 and the first item () of Eq.5 have the same form as Eq.5. This means that the to channels of data from each MHA layer’s input are not directly disturbed by any MHA of the front layer. The Eq.4 and Eq.5 enable the strongest strength IB in each layer to be passed to the MHA of the next layer with few distortions, thus enabling the efficient introduction of IB.

3.4 Network Architecture

3.4.1 Eit

Based on the above analysis, we design a decreasing convolution structure to improve the efficiency of the IB introduction. The architecture of the proposed model is shown in Fig.3.

Figure 3: Model Overview. The backbone of EIT is the same as ViT. EIT and MHA each process all the patches embedding in different channel dimensions. and are the number of channels processed by EIT and MHA, respectively. EIT does not handle the class embedding.

We propose two simple structures called EIT (EIT for PaPr) and EIT (EIT for TrEn) to form a complete decreasing structure. EIT uses a convolution layer with a stride smaller than the kernel size, i.e., there are overlapping patches, which can improve the IB introduction efficiency of PaPr. However, it also means we will get redundant patches, introducing excess FLOPs. To solve this problem without seriously destroying the similarity of the adjacent patches, we filter out the redundant patches by a maximum pooling layer.

3.4.2 Eit

Formally, given a 2D image , we learn a function that maps into new embeddings . is 2D convolution operation of kernel number , kernel size , stride (In ViT, = , but in this work, <) and padding. The height and width of the new embedding take the following values.


where denotes rounding down. The height and width of then reduced by a maximum pooling () layer with kernel size and stride of . By adjusting , we can reduce the redundant patches introduced by the convolution operation. , where . Finally, is transformed into as the final output of EIT. The above descriptions can be summarized in the following expression.


3.4.3 Eit

The structure of IB introduction in LSRA [22] contains one activation layer, one convolutional layer, and one fully connected layer. The difference is that EIT contains only one layer of convolution. Because we believe that if we aim to introduce IB, we only need to have the convolutional layer. It does not matter much whether there are other types of layers (e.g., activation layers, fully connected layers) or how many convolutional layers there are. Alternatively, it does not cause a significant performance increase. We will discuss this issue in ablation studies.

Formally, given the normalizated input in the -layer (”1” represents the added class embedding), different channel dimensions of data will be processed by MHA and EIT, respectively.


where and are the number of channel dimensions processed by EIT and MHA, respectively, satisfying . The final output is the combination of the output of the MHA and the EIT.


Since the convolution in EIT handles two-dimension (2D) data, some dimensional transformations are involved before and after the convolution operation. Additionally, EIT does not model the class embedding because it is challenging to perform 2D convolution operations if it is added. MHA, which is not repeated here, uses the same operation as ViT [5].


To ensure that is divisible by while decreases layer by layer, for a network with a total of -layer encoders, the of layer is set to


where denotes integer division, and is the number of heads in MHA, which generally requires to divide . , is the division ratio of to . It is worth noting that , which will cause to take the value of 0. It is valid. Because MLP uses only the class embedding of the network’s last layer when classifying images and EIT does not do anything to the class embedding. If there is an EIT in the last layer, it also does not contribute to the result of the network.

4 Experiments

In this section, we evaluate the performance of EIT on small-scale datasets and verify the efficiency of the IB introduction of EIT.

Model C EIT EIT Layers
MLP Size
250 @Conv:(3,3,1); Maxpool:(4,4,4) @Conv:(3,3,1) 5 4 10 3.766M
330 8 10 10.59M
400 10 16 19.54M
Table 2: Details of EIT model variants. ”EIT3/4” indicates that the convolutional kernel size used in EIT and EIT is 3 (the stride is 1 by default), the kernel size and stride of the maxpool in EIT is 4, and C is the number of channels (i.e., embedding dimension). The ”Params” is for a 10-category classification task with an input image size of 32 * 32 and includes the trainable position embedding.
Method ViTs
Res Net
SGD Adam SGD SGD Adam Adam
Cosine None Cosine Cosine None None
Ini Lr. Rate
1e-3 5e-4 1e-3 1e-3 1e-3 5e-4
End Lr. Rate
1e-5 None 1e-5 1e-5 None None
Drop Ratio
2e-1 None 1e-1 1e-1 1e-1 None
Table 3:

Details of model training. All the models are trained with a total batch size of 25 for 300 epochs. EIT belongs to ViTs.

4.1 Set up

We use four popular small-scale datasets to evaluate the performance of EIT: Cifar10/100 [9]

, Fashion-Mnist


, and Tiny ImageNet-200

[15]. We conducted four major sets of experiments as follows. 1) The comparison with ViT-like methods (ViTs) (ViT [5], LSRA [22], CvT [21] and ViT [24]). 2) The comparison with CNN-like methods (CNNs) (ResNet [7], EfficientNet [13], EfficientNetV2 [14], MobileNetV2 [11] and MobileNetV3 [8]). 3) The visualization of attention maps and attention distance for ViTs. 4) The ablation studies with Cifar10/100.

4.1.1 Model Variants

We design three sets of parameters as shown in Table 2 for the experiments. Next, we will use a simple notation to denote the model used.

4.1.2 Training

The details of training as shown in Table 3. In all experiments, we use only random horizontal flipping and normalization for data augmentation.

Lead IB to
PaPr Phase
Lead IB to
TrEn Phase
Pram. Num.
Cifar10 Cifar100
Classes:10; Image Size:32,32 ;Patch Size:4
1.1 ViT [5] None None 0.515G 3.798M 0.682 0.413 0.888 0.246 0.557(+0.0%)
1.2 LSRA [22] None LTB 0.487G 3.679M 0.778 0.477 0.910 0.276 0.610(+5.3%)
1.3 CvT* [21]
None 2.296G 3.846M 0.738 0.452 0.899 0.280 0.592(+3.5%)
1.4 None
1.838G 14.11M 0.712 0.414 0.909 0.233 0.567(+1.0%)
7.580G 14.16M 0.777 0.497 0.914 0.288 0.619(+6.2%)
1.6 ViT[24]
0.553G 4.108M 0.700 0.431 0.902 0.269 0.576(+1.9%)
None 0.527G 3.793M 0.746 0.479 0.911 0.283 0.605(+4.8%)
1.8 None
0.501G 3.771M 0.818 0.523 0.922 0.313 0.644(+8.7%)
0.514G 3.766M 0.855 0.605 0.926 0.346 0.683(+12.6%)
Table 4: Comparison with the ViTs. The structural parameters (C, Layers, MLP size and Heads) of ViTs are the same. In VITs, all convolution operations are implemented without further parameter optimization. * We did not use the hierarchy in CvT, this is to ensure the fairness of the comparison of ViTs. The convolutional kernel size of CvT-P is 4, and the stride is 2. The value of in ViT is 2, the convolutional kernel size is 4, and the stride is 2. Furthermore, we use a simple index (Idx) notation to denote the model compared in experiments.

4.2 Comparision

4.2.1 Comparison with the ViTs

We discuss the performance of EIT based on EIT3/4-Mini. In addition to comparing ViT, we also compare how EIT introduces IB in both phases of ViT with that of CvT, LSRA and ViT. The results are shown in Table 4. The results show that both structures of EIT exhibit more efficient IB introduction. On the four datasets, compared with ViT, EIT has the average improvement of 12.6% with fewer parameters and FLOPs. Compared with CvT, LSRA and ViT, the average improvement of EIT are 6.4%, 7.3% and 10.7%, respectively.

4.2.2 Comparison with the CNNs

We compare the EIT with CNNs based on the proposed three model parameters in Table 5. The results show that EIT can achieve higher accuracy with fewer parameters than CNNs. In particular, comparing Model Idx 2.4 with Model Idx 2.13 of Table 5, we can see that EIT’s parameters are only 17.7% the scale of ResNet’s, while the FLOPs and accuracy are almost the same as those of ResNet, or even ”better”.

Method FLOPs
Cifar10 Cifar100
Classes:10 Image Size:32,32
2.1 ResNet18-4/2 [7] 0.071G 11.19M 0.806 0.504 0.926 0.332 0.642
2.2 ResNet18-3/1 [7] 0.221G 11.19M 0.841 0.544 0.934 0.380 0.675
2.3 ResNet34-4/2 [7] 0.417G 21.31M 0.807 0.506 0.928 0.324 0.641
2.4 ResNet34-3/1 [7] 0.585G 21.30M 0.846 0.549 0.936 0.388 0.680
EfficientNet-b0 [13]
0.017G 4.062M 0.736 0.442 0.910 0.217 0.576
EfficientNet-b1 [13]
0.026G 6.588M 0.730 0.414 0.916 0.228 0.572
EfficientNet-b3 [13]
0.043G 10.80M 0.730 0.410 0.916 0.218 0.569
EfficientNetV2-s [14]
0.124G 20.34M 0.769 0.424 0.925 0.226 0.586
EfficientNetV2-m [14]
0.239G 53.16M 0.595 0.258 0.897 0.189 0.485
MobileNetV2-(=1.3) [11]
0.021G 3.783M 0.778 0.300 0.923 0.237 0.560
MobileNetV2-(=2.2) [11]
0.056G 10.61M 0.786 0.262 0.925 0.271 0.561
MobileNetV3-large [8]
0.014G 4.239M 0.746 0.369 0.923 0.191 0.557
2.13 EIT3/4-Mini 0.514G 3.766M 0.855 0.605 0.926 0.346 0.683
2.14 EIT3/3-Mini 0.804G 3.775M 0.865 0.610 0.932 0.343 0.688
2.15 EIT3/4-Tiny 1.415G 10.59M 0.859 0.618 0.928 0.354 0.690
2.16 EIT3/3-Tiny 2.214G 10.60M 0.873 0.616 0.933 0.363 0.696
2.17 EIT3/4-Base 2.593G 19.54M 0.863 0.614 0.929 0.356 0.691
2.18 EIT3/3-Base 4.056G 19.63M 0.875 0.638 0.930 0.370 0.703
Table 5: Comparison with the CNNs. The Model Idx 2.13 is the same as Model Idx 1.9 The ”4/2” in ”ResNet18-4/2” means the kernel size of in ResNet is 44 and the stride is 2.

4.3 Visualization

To verify if the performance improvement of EIT is due to the improvement of header diversity, we computed the attention maps and attention distances of ViTs, as shown in Fig.4.

It is clear that continuing to increase the head diversity of the network does improve the performance of ViT. Moreover, the efficient IB introduction can directly improve the head diversity of deep layers. Compared with other ViTs, EIT can introduce IB more efficiently based on Eq.4 and Eq.5, which results in better performance. Compared with CNNs, in which each layer has a constant attention distance, EIT has a variety of attention distances for each layer, which is why it is possible to achieve better performance with much fewer parameters. See more in Fig.5 and Fig.6.

The increase of head diversity leads to a decrease in the attentional range of heads in each layer, which exhibits more focused attention on the average attention maps of each layer.

4.4 Ablation Study

On the Cifar10/100 dataset, we designed four ablation experiments based on EIT3/4-Mini to verify that: 1) EIT works better than Parallel Convolution; 2) the presence or absence of convolutional layers in ELT is essential; 3) compared with increasing and invariant structure, the decreasing structure is optimal; 4) EIT does not require position embedding.

Figure 4: Attention Distance and Attention Maps of ViTs. From top to bottom, these are the results for ViT [5], LSRA [22], CvT [21] and EIT, using Model Idx 1.1, 1.2, 1.5 and 1.9, respectively. The attention maps of each layer are obtained by averaging that of heads. The attention maps of each head are the average of all embeddings’. The attention distance is obtained by the same operation mentioned in Fig.2.
Figure 5: Attention Distance of EIT, EIT and EIT. From left to right, these are the results for Model Idx 1.7, 1.8 and 1.9, respectively. The attention distance is obtained by the same operation mentioned in Fig.2. It can be found that the improve head diversity in the deeper layers of the network is mainly due to the effect of EIT, which also proves the efficiency of the decreasing structure for IB introduction.
Figure 6: Attention Distance of EIT-Tiny and EIT-Base. From top to bottom and left to right, these are the results for Model Idx 2.15, 2.16, 2.17 and 2.18, respectively. The attention distance is obtained by the same operation mentioned in Fig.2. It can be seen that as the network deepens, the deeper layers of the EIT still possess a great head diversity, which justifies the Eq.4 and Eq.5.

4.4.1 Parallel Convolution

We investigated the performance of EIT (Decreasing Embedded Parallel Convolution) and Parallel Convolution (residual convolutional structure, i.e., each process all data and then sums them as the final output). The results are shown in Table 6. The accuracy of EIT is on average 2.2% higher than that of Parallel Convolution. Additionally, the number of parameters and FLOPs of EIT are both about 50% of its.

Model Idx
TrEn with
Pram. Num.
Cifar10 Cifar100 Average
Classes:10 image size:32,32
3.1(1.9) EIT 0.514G 3.766M 0.855 0.605 0.730
0.887G 6.589M 0.841 0.574 0.708
Table 6: Ablations on Parallel Convolution.

4.4.2 Complicating EIT

We assume that the role of the convolutional layer in EIT is only to introduce IB, so we only use one convolutional layer in EIT. Next, we try to complicate EIT to see if it leads to an improvement in performance. For example, we add multiple convolutional layers, activation layers, normalization layers, and fully connected layers. The results are shown in Table 7 showing that the one convolutional layer is more efficient.

Model Idx
of EIT
Pram. Num.
Cifar10 Cifar100 Average
Classes:10 Image size:32,32
4.1(1.9) Conv 0.514G 3.766M 0.855 0.605 0.730
4.2(1.7) None 0.527G 3.793M 0.746 0.479 0.613
0.687G 5.116M 0.843 0.593 0.718
0.524G 3.841M 0.846 0.573 0.710
0.514G 3.768M 0.851 0.577 0.714
Table 7: Ablations on Complicating EIT. The direction of data flow is all from top to down of the structure.

4.4.3 Increasing and Invariant

We examine the performance of the decreasing structure in EIT, and the results are shown in Table 8. Compared with the Invariant and Increasing structures, the accuracy of Decreasing structure is 9% higher on average. The Fig.7 show that both increasing and invariant structures have slighter head diversity than decreasing structures because of the ineffective introduction of IB, which is consistent with the inference in Section 3.2-3.3. The last layer of the increasing and invariant structure have great head diversity because the last two layers are decreasing structures ( for the three structure equals 0). This setting is because the last layer of convolution does not operate on class embedding.

4.4.4 Removing Position Embedding

Considering the introduction of convolutional operations in EIT, we investigated whether it still requires position embedding. The results are shown in Table 9 illustrate that the impact of removing position embedding on model performance is negligible. This is consistent with CvT [21]. The network without position embedding offers the possibility of simplified adaptation to more visual tasks without the need to redesign embedding. However, in the experiments of this paper, ViTs (including EIT) are added with trainable position embedding by default, which is to ensures the consistency of comparison among various methods.

Model Idx
Cifar10 Cifar100 Avg.
Image size:32,32
5.1(1.9) Decreasing 0.855 0.605 0.730
5.2 Increasing 0.790 0.481 0.636
5.3 Invariant 0.817 0.476 0.647
Table 8: Ablations on Increasing and Invariant structure.
Figure 7: Attention Distance of Decreasing, Increasing and Invariant structure. The attention distance is obtained by the same operation mentioned in Fig.2.
Model Idx Position Embedding Cifar10 Cifar100 Average
Image size:32,32
6.1 None 0.856 0.600 0.728
6.2(1.9) Trainable 0.855 0.605 0.730
Table 9: Ablations on position embedding.

5 Conclusion

In this work, we present a simple yet efficient network architecture that leads IB to ViT with fewer parameters and FLOPs, called EIT. EIT ensures the efficiency of introducing IB without destroying the unification of the network in CV and NLP. Extensive experiments are conducted to validate that the EIT has better performance than the previous ViTs (with IB) and CNNs. In addition, to the best of our knowledge, we find for the first time a strong correlation between the performance of the transformer and the diversity of head attention distance, which gives new ideas for further improving the performance of the transformer.


  • [1] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and et al. Language models are few-shot learners. In Neural Information Processing Systems, volume 33, pages 1877–1901, 2020.
  • [2] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In

    Computer Vision and Pattern Recognition

    , pages 12299–12310, 2021.
  • [3] Mark Chen, Alec Radford, Rewon Child, Jeffrey K. Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In

    International Conference on Machine Learning

    , volume 1, pages 1691–1703, 2020.
  • [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina N. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, pages 4171–4186, 2018.
  • [5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  • [6] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer. In Neural Information Processing Systems, volume 34, 2021.
  • [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [8] Brett Koonce. Mobilenetv3. In

    Convolutional Neural Networks with Swift for Tensorflow

    , pages 125–144. Springer, 2021.
  • [9] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  • [10] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In Dy Jennifer and Krause Andreas, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 4055–4064. PMLR, 2018.
  • [11] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
  • [12] Dianxi Shi, Chenran Zhao, Yajie Wang, Huanhuan Yang, Gongju Wang, Hao Jiang, Chao Xue, Shaowu Yang, and Yongjun Zhang.

    Multi actor hierarchical attention critic with rnn-based feature extraction.

    Neurocomputing, 471:79–93, 2022.
  • [13] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  • [14] Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. In International Conference on Machine Learning, pages 10096–10106. PMLR, 2021.
  • [15] Amirhossein Tavanaei. Embedded encoder-decoder in convolutional networks towards explainable ai. arXiv preprint arXiv:200706712T, 2020.
  • [16] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neural Information Processing Systems, volume 30, pages 5998–6008, 2017.
  • [17] Yajie Wang, Dianxi Shi, Chao Xue, Hao Jiang, Gongju Wang, and Peng Gong. Ahac: Actor hierarchical attention critic for multi-agent reinforcement learning. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 3013–3020, 2020.
  • [18] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-end video instance segmentation with transformers. In Computer Vision and Pattern Recognition, pages 8741–8750, 2021.
  • [19] Yujing Wang, Yaming Yang, Jiangang Bai, Mingliang Zhang, Jing Bai, Jing Yu, Ce Zhang, Gao Huang, and Yunhai Tong. Evolving attention with residual convolutions. In International Conference on Machine Learning, pages 10971–10980. PMLR, 2021.
  • [20] Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Masayoshi Tomizuka, Kurt Keutzer, and Peter Vajda. Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677, 2020.
  • [21] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. In International Conference on Computer Vision, pages 22–31, 2021.
  • [22] Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. Lite transformer with long-short range attention. In International Conference on Learning Representations, 2020.
  • [23] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:170807747X, 2020.
  • [24] Tete Xiao, Piotr Dollar, Mannat Singh, Eric Mintun, Trevor Darrell, and Ross Girshick. Early convolutions help transformers see better. Advances in Neural Information Processing Systems, 34, 2021.
  • [25] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis E. H. Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In International Conference on Computer Vision, pages 558–567, 2021.
  • [26] Yanhong Zeng, Jianlong Fu, and Hongyang Chao. Learning joint spatial-temporal transformations for video inpainting. In European Conference on Computer Vision, pages 528–543, 2020.
  • [27] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations, 2021.