Log In Sign Up

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Vision transformers have recently received explosive popularity, but the huge computational cost is still a severe issue. Recent efficient designs for vision transformers follow two pipelines, namely, structural compression based on local spatial prior and non-structural token pruning. However, token pruning breaks the spatial structure that is indispensable for local spatial prior. To take advantage of both two pipelines, this work seeks to dynamically identify uninformative tokens for each instance and trim down both the training and inference complexity while maintaining complete spatial structure and information flow. To achieve this goal, we propose Evo-ViT, a self-motivated slow-fast token evolution method for vision transformers. Specifically, we conduct unstructured instance-wise token selection by taking advantage of the global class attention that is unique to vision transformers. Then, we propose to update informative tokens and placeholder tokens that contribute little to the final prediction with different computational priorities, namely, slow-fast updating. Thanks to the slow-fast updating mechanism that guarantees information flow and spatial structure, our Evo-ViT can accelerate vanilla transformers of both flat and deep-narrow structures from the very beginning of the training process. Experimental results demonstrate that the proposed method can significantly reduce the computational costs of vision transformers while maintaining comparable performance on image classification. For example, our method accelerates DeiTS by over 60 top-1 accuracy.


page 2

page 7


Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Transformers, which are popular for language modeling, have been explore...

Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers

Vision transformers have achieved significant improvements on various vi...

Self-slimmed Vision Transformer

Vision transformers (ViTs) have become the popular structures and outper...

AdaViT: Adaptive Tokens for Efficient Vision Transformer

We introduce AdaViT, a method that adaptively adjusts the inference cost...

A Unified Pruning Framework for Vision Transformers

Recently, vision transformer (ViT) and its variants have achieved promis...

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Vision transformers (ViTs) have recently received explosive popularity, ...

Efficient Video Transformers with Spatial-Temporal Token Selection

Video transformers have achieved impressive results on major video recog...


Recently, transformers have show strong power on various computer vision tasks such as image classification, object detection, and instance segmentation. The reason of introducing transformers into computer vision lies on its unique properties that convolution neural networks (CNN) lack, especially the property of modeling long-range dependencies in the data. However, dense modeling of long-range dependencies among image tokens across layers of the transformer usually brings computation inefficiency, because images contain large regions of low-level texture and uninformative background.

Figure 1: An illustration of technique pipelines for efficiency token sparsity. The first branch: the pipeline of unstructural token pruning Rao et al. (2021); Tang et al. (2021). The second branch: the pipeline of structural compression Wang et al. (2021a); Graham et al. (2021); Heo et al. (2021). The third branch: our proposed pipeline that perform unstructural updating while suitable for structural compression models.
Figure 2: Visualization of class attention in DeiT-T. Interpretability denotes the method Chefer et al. (2021).

As shown in above two pathways of Fig. 1, existing methods follow two mainstreams to address the inefficiency problem of modeling long-range dependencies in vision transformers. The first is to perform structural compression based on local spatial prior, such as local linear projection Wang et al. (2021a), convolutional projection Wu et al. (2021); Heo et al. (2021), and shift windows Liu et al. (2021). The second paradigm is non-structural token pruning. Tang et al. (2021)

improves the efficiency of a pre-trained transformer network by developing a top-down layer-by-layer token slimming approach that can identify and remove redundant tokens based on the reconstruction error of the pre-trained network. The final pruning mask is fixed for all instances.

Rao et al. (2021) also proposes to accelerate a pre-trained transformer network by removing redundant tokens hierarchically, but explores an unstructured and data-dependent down-sampling strategy.

In this paper, as shown in the third pathway of Fig. 1, we propose to handle the inefficiency problem in a dynamic data-dependent way while suitable for structural compression methods. We denote uninformative tokens that contribute little to the final prediction but bring computational cost when bridging redundant long-range dependencies as Placeholder Tokens. Different from simple structural compression that reduces local spatial redundancy in  Wang et al. (2021a); Graham et al. (2021), we propose to unstructurally and dynamically distinguish informative tokens from placeholder tokens for each instance, and update them with different computation priorities. In contrast to searching for redundancy and pruning in a pre-trained network like Tang et al. (2021); Rao et al. (2021), by preserving placeholder tokens, the redundancy problem can be alleviated in the beginning of the training process of a new model, and our method can be a generic plugin in most vision transformers of both flat and deep-narrow structures.

Concretely, Evo-ViT, a self-motivated slow-fast token evolution method for dynamic vision transformer is proposed in this paper. We claim that since transformers have insight into global dependencies among image tokens and learn for classification, it is naturally able to distinguish informative tokens from placeholder tokens for each instance, which is self-motivated. Taking DeiT Touvron et al. (2020) in Fig. 2

as example, we find that the class token of DeiT-T estimates importance of each token for dependency modeling and final classification objection. Especially in deeper layers (

e.g., layer 10), it usually augments informative tokens with higher attention scores and has a sparse attention response, which is quite consistent to the visualization result provided by Chefer et al. (2021) for transformer interpretability. In shallow layers (e.g., layer 5), the effect of the class token is relatively scattered but mainly focus on informative regions. Thus, taking advantage of class tokens, informative tokens and placeholer tokens are determined, and the preserved placeholer tokens also ensure complete information flow in shallow layers of a transformer for modeling accuracy. After determining two kinds of tokens, the placeholder tokens are summarized to a representative token that is evolved via the full transformer encoder simultaneously with the informative tokens in a slow and elaborate way. Then, the evolved representative token is exploited to fast update the placeholder tokens.

We evaluated the effectiveness of the proposed Evo-ViT method on two kinds of baseline models, namely transformers of a flat structure such as DeiT Touvron et al. (2020) and transformers of a deep-narrow structure such as LeViT Graham et al. (2021)

on ImageNet 

Deng et al. (2009) dataset. Our self-motivated slow-fast token evolution method allows the DeiT model to improve computational throughput by 40%-60% while maintaining comparable performance.

Related Work

Vision Transformer

Recently, a series of transformer models Han et al. (2020); Khan et al. (2021); Tay et al. (2020b) are proposed to solve various computer vision tasks. Due to it’s significant modeling capabilities of long-range dependencies, transformer has achieved promising success in image classification Dosovitskiy et al. (2020); Touvron et al. (2020); d’Ascoli et al. (2021), object detection Carion et al. (2020); Huang et al. (2021); Liu et al. (2021); Zhu et al. (2020) and segmentation Duke et al. (2021); Zheng et al. (2021).

The pioneering works Dosovitskiy et al. (2020); Touvron et al. (2020); Jiang et al. (2021) directly split an image into patches with fixed length and transform these image patches into tokens as inputs to a flat transformer. Vision Transformer (ViT) Dosovitskiy et al. (2020) is one of such attempts that achieved state-of-the-art performance with large-scale pre-training. DeiT Touvron et al. (2020) manages to tackle the data-inefficiency problem in ViT by simply adjusting training strategies and adding an additional token along with the class token for knowledge distillation. To achieve better accuracy/speed trade-offs for general dense prediction, recent works Yuan et al. (2021); Heo et al. (2021); Graham et al. (2021); Wang et al. (2021a) design transformers of deep-narrow structures by adopting sub-sample operation (e.g.

, strided down sampling, local average pooling, convolutional sampling) to reduce the number of tokens in intermediate layers. These structural sub-sample operations usually help reduce spatial redundancies among neighboring tokens and introduce some locality prior. In this paper, we propose to handle instance-wise unstructured redundancies for both flat and deep-narrow transformers.

Redundancy Reduction

Transformer takes high computational cost because Multi-head Self-Attention (MSA) requires quadratic space and time complexity and Feed Forward Network (FFN) increases the dimension of latent features. The existing acceleration methods for transformers can be mainly categorized into sparse attention mechanism (e.g., low rank factorization Xu et al. (2021); Wang et al. (2020), fixed local patterns and learnable patterns Tay et al. (2020a); Beltagy et al. (2020); Liu et al. (2021)), pruning Tang et al. (2021); Rao et al. (2021); Frankle and Carbin (2018); Michel et al. (2019), knowledge distillation Sanh et al. (2019) and so on. For example, Liu et al. (2021) propose a shifted windowing scheme that brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. The Linformer Wang et al. (2020) is a classic example of low rank methods as it projects the length dimension of keys and values to a lower-dimensional representation. Tang et al. (2021) present a top-down layer-by-layer patch slimming algorithm to reduce the computational cost in pre-trained vision transformers. The patch slimming scheme is conducted under a careful control of the feature reconstruction error, so that the pruned transformer network can maintain the original performance with lower computational cost. Rao et al. (2021) devise a lightweight prediction module to estimate the importance score of each token given the current features of a pre-trained transformer. The module is added to different layers to prune redundant tokens unstructurally and is supervised by a distillation loss based on the predictions of the original pre-trained transformer. In this paper, we propose to handle the redundancy problem from the very beginning of the training process of a versatile transformer.


Figure 3: The overall diagram of the proposed slow-fast token evolution (Evo-ViT) method.

Vision Transformer (ViT) Dosovitskiy et al. (2020) proposes a simple tokenization strategy that handles 2D images by reshaping them into flattened sequential patches and linearly projecting each patch into latent embedding. An extra class token (CLS) is added to the sequence and serves as the image representation. Moreover, since self-attention in the transformer encoder is position-agnostic and vision applications highly need position information, ViT adds position embedding into each token, including the CLS token. Afterwards, all tokens are passed through stacked transformer encoders and finally the CLS token is used for classification.

The transformer is composed of a series of stacked encoders where each encoder consists of two modules, namely a Multi-head Self-Attention (MSA) module and a Feed Forward Network (FFN) module. The FFN module contains two linear transformations with a Gelu activation. A residual connection is employed around both MSA and FFN modules, followed by layer normalization (LN). The input of ViT,

, and the processing of the k-th encoder can be expressed as


where and are CLS and patch tokens respectively and is the position embedding. and are the number of patch tokens and the dimension of the embedding.

Specifically, a self-attention (SA) module projects the input sequences into query, key, value vectors (

i.e., ) using three learnable linear mapping , and . Then, a weighted sum over all values in the sequence is computed through:


MSA is an extension of SA. It splits queries, keys, and values for times and performs the attention function in parallel, then linearly projects their concatenated outputs.

It is worth noting that one very different design of ViT from CNNs is the CLS token. The CLS token interacts with patch tokens at each encoder and summarizes all the patch tokens for final embedding. We denote the similarity scores between the CLS token and patch tokens as class attention , formulated as:


where is the query vector of the CLS token.

Computational complexity.

In ViT, the computation costs of the MSA and FFN modules are and , respectively. For pruning methods Rao et al. (2021); Tang et al. (2021), by pruning tokens, FLOPS both in the FFN and MSA modules can be reduced. Our method can achieve same efficiency while preserving the placeholder tokens for scratch training and versatile downstream applications benefiting from the slow-fast token update strategy.



Figure 4: Two folds that illustrate the difficulty of pruning the shallow layers. (a) Inter-layer fold: the CKA similarity with the final CLS token of token features in each layer. (b) Intra-layer fold: the correlation coefficient of the token features in each layer.

In this paper, we aim to handle the inefficiency modeling issue in each input instance from the very beginning of the training process of a versatile transformer. As shown in Fig 3, the pipeline of Evo-ViT mainly contains two parts: the structure preserving token selection module and the slow-fast token update module. In the structure preserving token selection module, the informative tokens and the placeholder tokens are determined by the evolved global class attention, so that they can be updated in different manners in the following slow-fast token update module. Namely, the placeholder tokens are summarized and updated by a representative token. The long-term dependencies and feature richness of the representative token and the informative tokens are evolved via the MSA and FFN modules.

We first elaborate on the proposed structure preserving token selection module. Then, we will introduce how to update the informative tokens and the placeholder tokens in a slow-fast way. Finally, the training details such as loss and other training strategy are introduced.

Structure preserving token selection

In this paper, we propose to preserve all the tokens and dynamically distinguish informative tokens and placeholder tokens for complete information flow. The reason is that it is not trivial to prune tokens in shallow and middle layers of a vision transformer, especially in the beginning of the training process of the vision transformer. We explain this problem in both inter-layer and intra-layer ways. First, shallow and middle layers usually present fast growing capability of feature representation. Pruning tokens brings severe information loss. Following Refiner Zhou et al. (2021), taking DeiT-T as an example, we use CKA similarity Kornblith et al. (2019) to measure similarity of the intermediate token features output by each encoder and the final CLS token, assuming that the CLS token is strongly correlated with classification. As shown in Fig. 4, the token features of DeiT-T keep evolving fast when the model goes deeper and the final CLS token feature is quite different from token features in shallow layers. It means that the representations in shallow or middle layers are insufficiently encoded and have some diversities, which make token pruning quite difficult. Second, tokens have low correlation with each other in the shallow layers. Following Tang et al. (2021), we also use the average similarity of different patch tokens varies w.r.t. network depth in the DeiT-T model to show redundancies. As shown in Fig. 4, the lower similarities with larger variation in the shallow layers also prove the difficulty to distinguish redundancies in shallow features.

The attention weight is the easiest and most popular approach Abnar and Zuidema (2020); Wang et al. (2021b) to interpret a model’s decisions and to gain insights about the propagation of information among tokens. The class attention weight described in Eqn. 5 reflects the information collection and broadcast based on the router that is the CLS token. We find that our proposed evolving global class attention is able to be a simple measure to help dynamically distinguish informative tokens and placeholder tokens in a vision transformer. In Fig. 4, the distinguished informative tokens have high CKA correlations with the CLS token, while the placeholder tokens have low CKA correlations. As shown in Fig. 2, the global class attention is able to focus on the object tokens, which is consistent to the visualization results of Chefer et al. (2021). In the following part of this section, detailed introduction of our structure preserving token selection method is provided.

As discussed in Preliminaries Section, the class attention is calculated by Eqn. 5. We select tokens whose scores in the class attention are among the top as the informative tokens. The remaining tokens are recognized as placeholder tokens that contain less information. Note that the placeholder tokens are kept and fast-updated rather than directly dropped.

For better capability of capturing the underlying information among tokens in different layers, we propose a global class attention that augments the class attention by evolving it across layers as shown in Fig. 3. Specifically, a residual connections between class attentions are designed to facilitate the attention information flow with some regularization effects. Mathematically,


where is the global class attention in the k-th layer, and is the class attention in the k-th layer. We use for the token selection in the (k+1)-th layer for stability and efficiency.

Slow-fast token update

Once the informative tokens and the placeholder tokens are determined by the global class attention, instead of harshly dropping placeholder tokens as Tang et al. (2021); Rao et al. (2021), we propose to update tokens in a slow-fast way. As shown in Fig. 3, informative tokens are carefully evolved via MSA and FFN modules, while placeholder tokens are coarsely summarized and updated via a representative token. We introduce our slow-fast token updating strategy mathematically as follows.

For patch tokens , we first split them into informative tokens and placeholder tokens by token selection strategy introduced above. Secondly, the placeholder tokens are aggregated into a representative token , as:


where denotes an aggregating function such as weighted pooling or transposed linear projection. Here we use weighted pooling based on the corresponding global attention score in Eqn. 6.

Then, both the informative tokens and the representative token are fed into MSA and FFN modules, and their residuals are recorded as and for skip connections, which can be denoted by:


Thus, the informative tokens and the representative token are updated in a slow and elaborate way.

Finally, the placeholder tokens are updated in a fast way by the residuals of :


where denotes an expanding function such as a simple copy in our method.

Training Strategies

Layer-to-stage training schedule.

Our proposed token selection mechanism becomes increasingly stable and consistent as the training process of a transformer progresses. Fig. 6

shows that the indexes of selected informative tokens in different layers of the same stage of a transformer gradually turn to be similar during the training process. The transformer tends to augment representations of meaningful informative tokens. Thus, we propose a layer-to-stage training strategy for further consistency and efficiency. Specifically, we conduct the token selection and slow-fast token update layer by layer at the first 200 epochs. During the remaining 100 epochs, we only conduct token selection at the beginning of each stage, and then according to the determined informative tokens and placeholder tokens, slow-fast update is normally performed towards the end of the stage. For transformers with flat structure such as DeiT 

Touvron et al. (2020), we manually arrange 4 layers as one stage.

Assisted CLS token loss.

Although many state-of-the-art vision transformers Wang et al. (2021a); Graham et al. (2021) remove the CLS token and use the final average pooled features for classification, it is not difficult to add a CLS token in their models for our token selection strategy. We empirically find that the ability of distinguishing two kinds of tokens of the CLS token as illustrated in Fig. 2 is kept in these models even without supervision on the CLS token. For better stability, we calculate classification losses based on the CLS token together with the final average pooled features during training. Mathematically,


where and denote the CLS token and patch tokens, respectively and is their corresponding ground-truth. denotes the transformer model. is the classification metric function usually realized by the cross-entropy loss. During inference, the final average pooled features are used for classification and the CLS token is only used for token selection.



In this section, we demonstrate the superiority of the proposed Evo-ViT method through extensive experiments on the ImageNet-1k 

Deng et al. (2009) classification dataset. To demonstrate the generalization of our method, we conduct experiments on vision transformers of both flat and deep-narrow structures, i.e., DeiT Touvron et al. (2020) and LeViT Graham et al. (2021). For overall comparisons with the state-of-the-arts (SOTA) methods Rao et al. (2021); Tang et al. (2021); Chen et al. (2021); Pan et al. (2021), we conduct the token selection and slow-fast token update from the fifth layer of DeiT and the third layer (excluding the convolution layers) of LeViT, respectively. The selection ratio of informative tokens in all selected layers of both DeiT and LeViT are set to 0.5. The global CLS attention trade-off in Eqn. 6 are set to 0.5 for all layers. For fair comparisons, all the models are trained for 300 epochs.

Main Results

Acceleration comparisons with state-of-the-art pruning methods.

In Table 1, we compare our method with existing token pruning methods Rao et al. (2021); Pan et al. (2021); Tang et al. (2021); Chen et al. (2021). Since token pruning methods can not recover the 2D structure and are usually designed for flat structured transformers, we comprehensively conduct the comparisons based on DeiT Touvron et al. (2020) on ImageNet dataset. We report the top-1 accuracy and throughput for performance evaluation. The throughput is measured on a single NVIDIA V100 GPU with batch size fixed to 256, which is same with the setting of DeiT. Results indicate that our method outperforms previous token pruning method on both accuracy and efficiency. Our method accelerate the inference at runtime by over 60 with negligible accuracy drop (-0.4) on DeiT-S.

Comparisons with state-of-the-art transformer models.

Thanks to placeholder tokens, our method can preserve the spatial structure that is indispensable for most existing modern vision transformer architectures. Thus, we further apply our method to state-of-the-art efficient transformer LeViT Graham et al. (2021), which presents a deep-narrow architecture. As shown in Table 2, our method can further accelerate the deep-narrow transformer like LeViT besides good performance on DeiT.

Method Top-1 Acc. Throughput
(%) (img/s) (%)
Baseline Touvron et al. (2020) 72.2 2536 -
PS-ViT Tang et al. (2021) 72.0 3563 40.5
DynamicViT Rao et al. (2021) 71.2 3890 53.4
SViTE Chen et al. (2021) 70.1 2836 11.8
Evo-ViT (ours) 72.0 3978 56.9
Baseline Touvron et al. (2020) 79.8 940 -
PS-ViT Tang et al. (2021) 79.4 1308 43.6
SViTE Chen et al. (2021) 79.2 1215 29.3
DynamicViT Rao et al. (2021) 79.3 1479 57.3
IA-RED Pan et al. (2021) 79.1 1360 44.7
Evo-ViT (ours) 79.4 1510 60.6
Table 1: Comparison with existing token pruning methods on DeiT.
Figure 5: Ablation results on Hyper parameters: (a) keeping ratio, (b) starting layer index, (c) global attention tradeoff, (d) starting epoch.

Ablation Analysis

Effectiveness of each module.

To evaluate the effectiveness of each sub-method, we add improvements step by step in Tab. 3 on both flat structure DeiT and deep-narrow structure LeViT. The improvements include:

  • Naive selection. To directly prune the uninformative tokens.

  • Placeholder token. To preserve the uninformative tokens but not fast update them.

  • Global attention. To utilize the proposed evolved global class attention instead of vanilla class attention for token selection.

  • Fast updating. To augment the placeholder tokens with fast updating.

  • Layer to stage. To apply the proposed layer-to-stage training strategy to further accelerate inference.

Model Param Throughput Top-1 Acc.
(M) (img/s) (%)
DeiT-T 5.9 2536 72.2
LeViT-128S 7.8 8755 74.5
LeViT-128 9.2 6109 76.2
LeViT-192 10.9 4705 78.4
PVTv2-B1 14.0 1225 78.7
CoaT-Lite Tiny 5.7 1083 76.6
T2T-ViT-7 4.2 2012 71.2
Evo-DeiT-Ti 5.9 3978 72.0
Evo-LeViT-128S 7.9 10008 73.2
Evo-LeViT-128 9.3 8190 74.8
Evo-LeViT-192 11.0 6114 76.9
DeiT-S 22.5 940 79.8
LeViT-256 18.9 3357 80.1
PVTv2-B2 25.4 687 82.0
T2T-ViT-14 21.4 793 80.6
Evo-DeiT-S 22.5 1510 79.4
Evo-LeViT-256 19.0 4270 79.1
DeiT-B 86.2 317 81.8
LeViT-384 39.1 1838 81.6
PVTv2-B2 45.2 457 83.2
CoaT-Lite Small 20.0 550 81.9
T2T-ViT-19 39.0 486 81.2
Evo-LeViT-384 39.3 2371 81.0
Table 2: Comparison with state-of-the-art vision transformers. The image resolution is .

Results on DeiT shows that our placeholder token strategy can further improve the selection performance due to its capacity of preserving complete information flow. The global attention strategy enhances the consistency of token selection in each layer and achieves better performance. Fast updating strategy makes less effect on DeiT than on LeViT. We claim that the performance of DeiT turns to be saturated based on placeholder tokens and global attention while LeViT still has some space for improvement. LeViT exploits spatial pooling for token reduction, which makes unstructured token reduction in each stage more difficult. By using the fast updating strategy, it is possible to collect some extra cues from placeholder tokens for accuracy and augment feature representations. We also evaluate the layer to stage training strategy. Results indicate that it maintains the accuracy while further accelerating inference.

Hyper parameter analysis.

We further investigate the hyper parameters of our method on DeiT-T, namely, keeping ratio, starting layer index, global attention trade-off, and starting epoch. We initialize these hyper parameters as described in Setup. During ablation analysis, only the object parameter is changed and the others remain fixed.

Figure 6: Token selection results of our method on DeiT-T. The left three columns demonstrate results on different layers of a well-trained model. The right three columns demonstrate results on the fifth layer at different training epochs.

Keeping ratio denotes how many tokens are kept for slow update in each layer. For precision, we set all layers with the same keeping ratio and investigate the trade-off between accuracy and inference throughput in Fig. 5. Results show that the accuracy turns to be saturated when the keeping ratio is larger than 0.5. Another interesting finding is that the accelerated model with 0.9 keeping ratio outperforms the full baseline by 0.2 (72.2 to 72.4), which is consistent with the conclusion in Chen et al. (2021) that properly dropping several uninformative tokens can serve as regularization for vision transformers.

Strategy DeiT-T LeViT 128S
Top-1 Acc. Throughput Top-1 Acc. Throughput
(%) (img/s) (%) (img/s)
baseline 72.2 2536 74.5 8755
+ naive selection 70.8 3824 - -
+ placeholder token 71.6 3802 72.1 9892
+ global attention 72.0 3730 72.5 9452
+ fast updating 72.0 3610 73.2 9360
+ layer to stage 72.0 3978 73.2 10008
Table 3: Method ablation on DeiT and LeViT.

Starting layer index denotes which layer to start token selection and slow-fast updating. Fig. 5 indicates that the accuracy turns to be stable when we start from the fifth layer. We find that the accuracy drop greatly as the starting layer become shallow, especially for the first three layers. We claim the reason lies that the features in these layers are still with large variation and not stable as shown in Fig. 4 and Fig. 4.

Global attention tradeoff in Eqn. 6 controls the dependence on previous layer information when conduct token selection in each layer. Larger trade-off means stronger dependence on previous information. It is illustrated in Fig. 5 that it is best to equally consider the previous and current information.

Starting epoch denotes which epoch to start our token selection and evolution strategy. As shown in Fig. 5, the accuracy drop sharply when we start from the last 100 epoch. We claim that the dynamic token selection requires enough training epochs to learn the refined features. For precision and training efficiency, we start our token selection and evolution strategy from the very beginning of training.

Different Token Selection Strategy.

Method Top-1 Acc. (%) Throughput (img/s)
random selection 66.4 3760
average pooling 69.5 3703
max pooling 69.8 3698
convolution 70.2 3688
global class attention 70.8 3750
Table 4: Different sub-sampling methods on DeiT-T. To align throughput, we conduct subsampling at the seventh layer for pooling and convolution, and the fifth layer for random selection and global class attention.

We further compare the global attention token selection strategy with some common subsampling methods in Tab. 4 to evaluate the effectiveness of our token selection metric. For fair comparisons, we directly drop the unselected tokens instead of keeping as placeholder tokens. We align the throughput by setting different subsample location in the network for each method and compare their accuracy. Tab. 4 shows that our global class attention metric outperform the common subsampling method on both accuracy and throughput.


We further visualize the token selection in Fig. 6 to demonstrate performance of our method during both training and inference stages. The left three columns demonstrate results on different layers of a well-trained DeiT-T model. Results show that our token selection method mainly focuses on objects instead of backgrounds, which means that our method can effectively discriminate the informative tokens from placeholder tokens. The selection results tend to be consistent across layers, which proves the feasibility of our layer-to-stage training strategy. Another interesting finding is that some missed tokens in the shallow layers are retrieved in the deep layers thanks to our structure preserving strategy. Take the baseball images in the forth row as example, tokens of the bat are gradually picked up as the layer goes deeper. We also investigate how the token selection evolves during the training stage in the right three columns. Results demonstrate that some informative tokens such as the fish tail are determined as placeholder tokens at the early epochs. With more training epochs, the ability of redundancy recognition of our method gradually increases and finally turns to be stable for discriminative token selection.


In this work, we investigate the efficiency of vision transformers by developing a self-motivated slow-fast token evolution (Evo-ViT) method, which can be conducted from the very beginning of the training process of a vision transformer. For each instance, the informative tokens and placeholder tokens are determined by the evolved global class attention of the transformer. By preserving placeholder tokens and updating them coarsely based on a representative token, both the complete information flow and 2D spatial structure are preserved for training stability and generalization to transformers of flat and deep-narrow structures. Meanwhile, the informative tokens and the representative token are carefully evolved via MSA and FFN modules of a vanilla transformer encoder. Experiments on DeiT indicate that the proposed Evo-ViT method accelerates the model by 40-60 while maintaining comparable classification performance. Interesting future directions include extending our method to downstream tasks such as detection and segmentation.