Self-slimmed Vision Transformer

Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks. However, such powerful transformers bring a huge computation burden. And the essential barrier behind this is the exhausting token-to-token comparison. To alleviate this, we delve deeply into the model properties of ViT and observe that ViTs exhibit sparse attention with high token similarity. This intuitively introduces us a feasible structure-agnostic dimension, token number, to reduce the computational cost. Based on this exploration, we propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT. Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs by dynamic token aggregation. Different from the token hard dropping, our TSM softly integrates redundant tokens into fewer informative ones, which can dynamically zoom visual attention without cutting off discriminative token relations in the images. Furthermore, we introduce a concise Dense Knowledge Distillation (DKD) framework, which densely transfers unorganized token information in a flexible auto-encoder manner. Due to the similar structure between teacher and student, our framework can effectively leverage structure knowledge for better convergence. Finally, we conduct extensive experiments to evaluate our SiT. It demonstrates that our method can speed up ViTs by 1.7x with negligible accuracy drop, and even speed up ViTs by 3.6x while maintaining 97 LV-ViT with our SiT, we achieve new state-of-the-art performance on ImageNet, surpassing all the CNNs and ViTs in the recent literature.

READ FULL TEXT VIEW PDF

page 2

page 8

page 14

06/03/2021

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Attention is sparse in vision transformers. We observe the final predict...
08/03/2021

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Vision transformers have recently received explosive popularity, but the...
07/04/2022

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

In this paper, we present a new approach for model acceleration by explo...
03/23/2022

What to Hide from Your Students: Attention-Guided Masked Image Modeling

Transformers and masked language modeling are quickly being adopted and ...
10/08/2021

Token Pooling in Vision Transformers

Despite the recent success in many applications, the high computational ...
11/23/2021

Efficient Video Transformers with Spatial-Temporal Token Selection

Video transformers have achieved impressive results on major video recog...
06/23/2021

IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers

The self-attention-based model, transformer, is recently becoming the le...

1 Introduction

Since vision transformer (ViT)  [10]

started the era of transformer structure in the fundamental computer vision tasks 

[2, 38, 4], variant transformers have been designed to challenge the dominance of convolutional neural networks (CNNs). Different from CNNs that stack convolutions to encode local features progressively, ViTs directly capture the long-term token dependencies. However, because of the exhausting token-to-token comparison, current powerful transformers require huge computation, limiting their wide application in reality [12]. Hence, in this paper, we aim to design a generic learning framework for boosting the efficiency of vanilla vision transformers.

 

Method Token Knowledge Acc Throughput
Pruning Source (%) (%)
PS-ViT[28] Hard None 0.5 43.6
IA-RED[22] Hard None 0.7 44.7
Dynamic-ViT[24] Hard Final layer 0.5 57.3
Evo-ViT[39] Semi-Hard None 0.4 60.6
Our SiT Soft All layers 0.0 43.2
0.4 102.1

 

Table 1: Comparison to recent token pruning methods for ViT. All models are based on DeiT-S [29]. Our SiT achieves the best trade-off between accuracy and throughput.
(a) Token similarity becomes higher and higher in deeper layers.
(b) All tokens tend to focus on the same informative tokens in deeper layers.
(c) Our soft slimming can automatically zoom the attention scope according to the object size.
Figure 1: Our motivation. In Fig(a), we calculate the correlation coefficients among tokens and count the proportion that is at least similar (0.7) to 4/8/16 tokens in different layers. As for Fig(b), we randomly select two tokens in the tenth layer to show their attention. Moreover, we compare different token pruning methods in Fig(c). Darker tokens get less attention.

To make ViTs more efficient, we tried to explore the inherent properties of the token-to-token comparison. We conduct a series of experiments based on LV-ViT, which reveals that sparse attention with high token similarity exists in ViTs. Figure 1 shows that even in the first layer, more than 50% of tokens are similar to the other 3 tokens and the token similarity becomes higher in the deeper layer. Besides, the attention tends to focus on the specific tokens in the deeper layers (Figure 1), which indicates the number of decision-relevant tokens becomes fewer. These observations demonstrate that only a few token candidates indicate meaningful information. It inspires us a feasible structure-agnostic dimension, token number, to reduce the computational cost. Intuitively, we can progressively drop the redundant tokens as the network deepens.

Recent studies have tried to compress tokens via data-independent dropping with minimizing reconstruction error [28], or data-dependent dropping with differentiable scoring [24]. However, data-independent dropping requires layer-by-layer optimization, which is hard to generalize. Moreover, the token hard dropping will inevitably discard the vital tokens as the dropping ratio increases, e.g., the shape of the otterhound is destroyed in the deep layer (Figure 1), thus limiting its performance as shown in Table 1.

In contrast, we propose the flexible token soft slimming to dynamically aggregate decision-relevant information into a slimmed token set. Specifically, we design a concise Token Slimming Module (TSM), which generates decision-relevant tokens via a data-dependent weight matrix. As shown in Figure 1, by simply inserting multiple TSMs in LV-ViT, our network can learn to localize the key object tokens. More importantly, the attention scope can be zoomed automatically without cutting off the discriminative token relations, e.g, our network can concentrate on the most informative parts of the otterhound and the oxygen mask, which is totally different from the token hard dropping.

To be even further, we introduce a novel Dense Knowledge Distillation (DKD) to achieve stable and efficient model slimming optimization. In DKD, the original network can teach its slimmed version by a dense (layer-to-layer) supervision manner. Note that previous hint knowledge distillation methods [25, 45, 42, 18] are usually performed between different structures, leading to sparse knowledge source and inevitable knowledge loss [33]. Besides, they leverage spatially structured operators (e.g., convolution or pooling) to align the hint dimension between student and teacher, which are unable to handle unorganized tokens (e.g., the hint in CNN has strict spatial structured information, but the token set does not). To solve the above problems, we first design a reverse version of the token slimming module (RTSM) to align the token number for each layer in a flexible auto-encoder manner. Thus we can densely transfer all the token information. Benefiting from the innate knowledge inheritance (structure knowledge), our DKD is more suitable for teaching itself, i.e., self-slimmed learning.

Our self-slimmed learning method is flexible and easy to generalize to all vanilla vision transformers (SiT), e.g., DeiT [29], LV-ViT [17] etc. We conduct extensive experiments on ImageNet [8] to verify the effectiveness and efficiency. Interestingly, our method can perform better than DynamicViT [24] even only with TSM. Besides, the SiT-XS achieves 81.8% top-1 accuracy with inference speed and SiT-L achieves competitive 85.6% top-1 accuracy while running faster. More importantly, our SiT based on LV-ViT achieves the new state-of-the-art performance on ImageNet, surpassing recent CNNs and ViTs.

2 Related Works

Vision transformers. Transformer architecture [31]

was first proposed for machine translation in the field of natural language processing (NLP). The success in NLP inspires the application of transformers in various vision tasks, for example, DETR

[2] for object detection and ViT [10] for image recognition. ViT is the first pure transformer that achieves the state-of-the-art performance on ImageNet [8]. Recent ViT variants mainly focus on better optimization and more powerful performance [29, 46, 30, 11, 44, 43, 21, 36, 17, 14, 3, 9, 37, 7, 6, 41, 16, 20, 13]. However, few of them explore to improve the efficiency of vision transformers [12]. In this paper, we aim to design a general optimization framework named self-slimmed to promote the efficiency of ViTs.

Figure 2: The framework of our self-slimmed learning. We insert our token slimming modules (TSM) into vanilla vision transformer. To reduce decision-relevant information loss, we apply dense knowledge distillation (DKD) to provide layer-to-layer supervision, wherein the reverse version of TSM (RTSM) is utilized for token reconstruction. The dash lines indicate the prediction supervision from an extra CNN teacher is optional and complementary to our method.

Transformer slimming. The large computation of self-attention hinders the wide application of ViTs, such as detection and segmentation with the high-resolution input image. To solve this problem, several prior works concentrate on designing sparse attention [34, 21] or structure pruning [5]. SViTE [5] dynamically extracts and trains sparse subnetworks of ViTs, while sticking to a fixed small parameter budget. However, model structure pruning struggles to trim down the inference latency. Other works try to reduce the token redundancy [24, 28, 22, 40] by entirely dropping the unimportant tokens, which brings more improvements on throughput compared to structure pruning. Different from the above works, our SiT aggregates all tokens into fewer informative tokens in a soft manner by a concise slimming module. It can automatically zoom the attention scope to localize the key object for better recognition.

3 Method

In this section, we describe our self-slimmed learning for vision transformer (SiT) in detail. First, we introduce the overall architecture of SiT. Then, we explain the vital design of our SiT, i.e., token slimming module (TSM) and dense knowledge distillation (DKD). Finally, we thoroughly compare our TSM and DKD with other counterparts.

3.1 Overview of Self-slimmed Learning

In this section, we formally describe the details of our self-slimmed learning for vision transformers (SiT). The overall framework is illustrated in Figure 2. We first design a lightweight Token Slimming Module (TSM) for conventional ViTs to perform token slimming and its reverse version (RTSM) for token reconstruction. Following the hierarchical feature representations of prior works [12, 21], we progressively perform token slimming three times, reducing half of the token number every time. To decrease the information loss, we propose a layer-to-layer dense knowledge distillation (DKD), wherein the original vision transformer can serve as a teacher to minimize the difference between itself and the slimmed student. Finally, we integrate TSM and DKD to form a general self-slimmed learning method for all vanilla ViTs.

3.2 Token Slimming Module

Given a sequence of input tokens with channels (class token is omitted as it will never be pruned), token slimming aims to dynamically aggregate the redundant tokens to generate informative tokens :

(1)

where is a normalized weight matrix:

(2)

Such operation is differentiable and friendly to end-to-end training. We follow the design paradigm of self-attention [32] and propose a lightweight token slimming module (TSM) shown in Figure 3:

(3)

where and are both learnable parameters. and represents the nonlinear function (GELU) and scaling factor respectively. Similar to self-attention, TSM generates a global attention matrix, but it requires much fewer overhead in terms of throughput and memory usage during both training and inference. Thanks to the learnable scaling factor , the attention tends to be sparse in our experiments, which means it learns to focus on the most informative tokens.

Hard dropping vs. soft slimming. The prior works have tried to compress tokens via hard dropping [28, 24], in which the slimming weight is a binary decision matrix, i.e., dropping or keeping the corresponding token. However, this approach with binary decision leads to severe information loss if numerous tokens are discarded. Such weakness limits their high efficiency on ImageNet [8], wherein the objects often occupy a large part in the pictures. On the contrary, we design soft slimming with normalized weight . It is able to discriminate the meaningful tokens in a global view, thus effectively generating decision-relevant tokens. Moreover, as shown in Figure 1, our soft slimming can dynamically zoom the attention scope to cover the significant regions for classification.

(a) TSM
(b) RTSM
Figure 3: The pipelines of the token slimming module (TSM) and its reverse version (RTSM).

3.3 Dense Knowledge Distillation

Token reconstruction. Though token slimming significantly reduces the inference latency, it will inevitably discard some decision-relevant token candidates, leading to an accuracy drop. To ensure the stable extraction of the decision-relevant information, we propose Dense Knowledge Distillation (DKD) that regards the original vision transformer as a teacher to provide structure knowledge. We first design a reverse version of the token slimming module (RTSM) to reconstruct the original tokens in a flexible auto-encoder manner (Figure 3

). Therefore, all the token information can be seamlessly transferred from the teacher. Note that we only perform RTSM when training, thus no extra computation is introduced during inference. We first linearly transform the informative tokens into plenty of token candidates, thus utilizing a non-linear function (GELU) to filter the vital representations. Finally, another linear transformation is performed to compress the token candidates:

(4)

where and in our experiments. To further enhance the token representations, we introduce an extra multi-layer perception (MLP) block [32]

with residual connection

[15]:

(5)

The recovered tokens will be forced to be consistent with the original tokens in DKD, which guarantees the sufficient information of the slimmed tokens .

(a) SiT surpasses the state-of-the-art CNNs by a large margin, even compared with EfficientNetV2 [27].
(b) SiT achieves superior speed-accuracy trade-off to all the previous ViTs at different complexity levels.
(c) The randomly sampled SiT-Ti models outperform other distilled or pruned ViTs.
Figure 4: Speed vs. accuracy and robustness study.

We compare our SiT with the previous state-of-the-art CNNs and ViTs in Fig(a) and Fig(b), respectively. To verify the robustness of our method, we randomly change the numbers of blocks at each stage and adjust the keeping ratio of TSM from 0.3 to 0.7 to sample a series of SiT-Ti models in Fig(c). All of them are trained for 125 epochs.

Dense knowledge distillation. Due to the invariant model structure, we design a dense (layer-to-layer) knowledge distillation for the recovered tokens:

(6)

where and refer to the -th token embedding at the -th layer of the student and teacher, respectively. means the layer number. Note that refers to the recovered tokens in Eq. 5. With such dense distillation, the student model will be forced to maintain as much as knowledge in the informative tokens

. Besides, to alleviate the classification performance deterioration caused by token slimming, we introduce the logits distillation to minimize the predictions difference between the student and teacher:

(7)

where KL denotes Kullback–Leibler divergence loss and

is the softmax function. and are respectively the predictions of the student and teacher model. Moreover, the above DKD is complementary to the hard distillation recommended in DeiT [29]:

(8)

where indicates the prediction of distillation head and is a hard decision of the extra CNN teacher. It can further improve the performance with longer training epochs. Our final objective of distillation for self-slimmed learning is:

(9)

where is the coefficient balancing the three distillation losses. We set by default. is set to 1 when the CNN teacher is involved. As for the training objective of self-slimmed learning, we treat the classification task and the distillation task equally:

(10)
(11)

where means the ground truth, i.e., one-hot label.

 

Method Unstructured Same Feature Knowledge
Tokens Architecture Adaption Source
FitNets[25] Linear Early layers
AT[45] Attention map End of layer group
FSP[42] FSP relation End of layer group
MINILM[35] Self-attention End of layer group
Our DKD RTSM All layers

 

Table 2: Comparison between our DKD and other hint knowledge distillation methods. Our DKD provides dense supervision for unstructured tokens, which are first reconstructed by RTSM.

DKD vs. other knowledge distillation. We compare our well-designed dense knowledge distillation with other distillation methods. Firstly, current methods [29, 17] simply select a strong teacher network with totally different architectures, e.g., RegNet for DeiT and NFNet for LV-ViT. Hence, only the spare knowledge can be used to supervise the student, such as the single image-level or dense token-level predictions generated by the last classification layer. Due to the structural isolation between student and teacher in conventional KD, the semantic information in the intermediate layer is ignored. In DKD, thanks to the consistency between the teacher and student, we naturally conduct densely layer-wise and token-level supervision for each layer, which greatly improves the stability of the model mimicking. Secondly, as shown in Table 2, we compare our DKD with some popular hint knowledge distillation methods [25, 45, 42, 35]. Since they are conducted between different structures, i.e., teacher and student are different, they only transfer partial pre-defined layer knowledge for student. Furthermore, they are often designed for contiguous token information transfer, when handling misaligned resolution, they simply adopt local operations like convolution to align token numbers, which is impossible for unstructured tokens. In contrast, we elaborately design RTSM to reconstruct original tokens, thus token information can be seamlessly transferred to student layer-to-layer.

4 Experiments

4.1 Implementation Details

In this section, we conduct comprehensive experiments to empirically analyze the effectiveness of our proposed self-slimmed learning for vision transformer (SiT). All the models are evaluated on the ImageNet dataset [8]. For our teacher models, we train LV-ViTs [17] following the original settings, but we replace the patch embedding module with lightweight stacked convolutions inspired by LeViT [12]. All the teacher models share the same head dimension (64) for self-attention and expand ratio (3) for FFN [10]. As for student models, all the training hyper-parameters are the same as DeiT [29] by defaults. For initialization, we load all the weights from the corresponding teacher models to accelerate the convergence and train them for 125 epochs. If utilizing an extra CNN teacher, we train the student for 300 epochs for better improvements. Moreover, we set different initial learning rates for the backbone and the token reconstruction branch, which are and respectively. For token slimming, we insert TSM three times, thus there are four stages in SiT. The default keeping ratio is set to 0.5, which means the token number is halved after slimming.

  Model Depth Stage Embed Dim Heads Resolution #Params FLOPs (M) (G) SiT-Ti 14 {1,1,1,11} 320 5 224 15.9 1.0 SiT-XS 16 {1,1,1,13} 384 6 224 25.6 1.5 SiT-S 16 {9,3,2,2} 384 6 224 25.6 4.0 SiT-M 20 {10,4,3,3} 512 8 224 55.6 8.1 SiT-L 24 {10,4,3,7} 768 12 288 148.2 34.4  
(a) Model architecture settings
  Model Student Teacher Throughput Top-1 Top-1 Throughput Top-1 (image/s) (%) (%) (image/s) (%) SiT-Ti 5896 () 80.1 (2.0) 80.6 (1.5) 1827 82.1 SiT-XS 4839 () 81.1 (2.2) 81.8 (1.5) 1360 83.3 SiT-S 1892 () 83.2 (0.1) 83.4 (0.1) 1360 83.3 SiT-M 1197 () 84.1 (0.1) 84.3 (0.1) 804 84.2 SiT-L 346 () 85.6 (0.1) - 204 85.7  
(b) Efficiency comparisons
Table 3: Main results on ImageNet. We apply our self-slimming learning on the state-of-the-art vanilla vision transformer LV-ViT [17]. means we adopt an extra CNN teacher. Our SiT can speed up LV-ViT with a slight accuracy drop. For fast inference, our SiT can maintain 97% of the performance while speeding up the original transformers by .

 

Type Method ImageNet Throughput
Top-1(%) (image/s) (%)
Baseline DeiT[29] 79.8 1637 0
Structure Pruning SViTE[5] 79.2(0.6) 2117 29.3
Token Hard Dropping PS-ViT[28] 79.4(0.5) 2351 43.6
IA-RED[22] 79.1(0.7) 2369 44.7
Dynamic-ViT[24] 79.3(0.5) 2575 57.3
Evo-ViT[39] 79.4(0.4) 2629 60.6
Token Soft Slimming Our SiT 79.8(0.0) 2344 43.2
79.4(0.4) 3308 102.1

 

Table 4: Comparison to recent model pruning methods for ViT. Our SiT surpasses all the other methods based on structure pruning or token hard dropping.

 

Model Resolution #Params FLOPs Throughput ImageNet
(M) (G) (image/s) Top-1(%)
EfficientNet-B0 [26] 5.3 0.4 4204 77.1
EfficientNet-B1 [26] 7.8 0.7 2559 79.1
EfficientNet-B2 [26] 9.1 1.1 1808 80.1
PVT-S [36] 24.5 3.8 1003 79.8
DeiT-T [29] 5.9 1.3 3346 74.5
LeViT-256 [12] 18.9 1.1 5802 80.1
CaiT-XXS36 [30] 17.3 3.8 513 79.7
SiT-Ti 15.9 1.0 5896 80.1
SiT-Ti 16.2 1.0 5833 80.6

 

EfficientNet-B3 [26] 12.2 1.9 1062 81.6
PVT-M [36] 44.2 6.7 711 81.2
PVT-L [36] 61.4 9.9 496 81.7
Swin-T [21] 28.3 4.5 1023 81.3
DeiT-S [29] 22.4 4.6 1598 81.2
LeViT-384 [12] 39.1 2.4 3876 81.6
SiT-XS 25.6 1.5 4839 81.1
SiT-XS 26.0 1.5 4798 81.8

 

EfficientNet-B4 [26] 19.3 4.6 545 82.9
Swin-S [21] 49.6 8.8 649 83.0
Swin-B [21] 87.8 15.5 474 83.3
DeiT-B [29] 87.3 17.7 718 83.4
CaiT-XS36 [30] 38.6 8.1 368 82.9
LV-ViT-S [17] 26.2 6.6 1270 83.3
SiT-S 25.6 4.0 1892 83.2
SiT-S 26.0 4.0 1873 83.4

 

EfficientNet-B5 [26] 30.4 10.8 246 83.6
EfficientNet-B6 [26] 43.0 19.9 153 84.0
EfficientNet-B7 [26] 66.3 39.2 86 84.3
EfficientNetV2-S [27] 21.5 8.5 742 83.9
NFNet-F0 [1] 71.5 12.6 361 83.6
CaiT-S36 [30] 68.2 13.9 233 83.9
LV-ViT-M [17] 55.8 12.7 768 84.1
SiT-M 55.6 8.1 1197 84.1
SiT-M 56.2 8.1 1185 84.3

 

EfficientNetV2-M [27] 54.1 25.0 271 85.1
NFNet-F1 [1] 132.6 36.0 128 84.7
NFNet-F2 [1] 193.8 63.2 72 85.1
CaiT-M36 [30] 270.1 53.4 130 85.1
LV-ViT-L [17] 150.1 58.8 208 85.3
SiT-L 148.2 34.4 346 85.6

 

Table 5: Comparison to the state-of-the-art on ImageNet. The models marked in gray color are trained with distillation supervision from a powerful CNN for 300 epochs. Our SiT achieves the best balance between throughput and accuracy.

4.2 Main Results

We conduct our self-slimmed learning for LV-ViT [17], which is the state-of-the-art vanilla vision transformer. Table 3 shows our detailed settings for different SiT variants. For SiT-Ti and SiT-XS, we explore their capacity for fast inference, thus we insert TSMs in the early layers. It demonstrates that our self-slimmed method is able to speed up the original vision transformers by , while maintaining at least 97% of their accuracy. Besides, we adopt another CNN teacher to provide the hard label as in DeiT [29]. The results show that complementary prediction supervision can further improve performance. As for other variants, we insert TSMs in the deeper layers. Surprisingly, with negligible accuracy drop, our SiTs are up to faster than their teacher models. It is worth mentioning that, extra CNN prediction supervision brings little improvement, mainly because that the CNN teacher is worse than the original transformer teacher ( vs. ).

Robustness of TSM locations and keeping ratio. To verify the robustness of our method, we conduct experiments as shown in Figure 4. We choose our SiT-Ti as the baseline and randomly change the numbers of blocks in each stage (i.e., TSM location) and the keeping ratio of TSM from 0.3 to 0.7. All the models are trained for 125 epochs without a CNN teacher. It clearly shows that all of the randomly sampled models outperform popular ViTs with knowledge distillation, e.g., DeiT [29] and XCiT [11]. Besides, compared with other counterparts based on token hard dropping [24, 28]

and structure pruning, our models surpass them by a large margin. These results demonstrate that our SiT is insensitive to the setting of TSM locations and keeping ratio. To make a fair comparison with the state-of-the-art ViTs, we set these hyperparameters according to the GFLOPs.

  Method Top-1 Throughput Structure-width 76.3 2947 Structure-depth 69.4 5652 DynamicViT[24] 75.7 5762 SiT w/o DKD 77.7 5896  
(a) Efficiency comparison. Performing token slimming only with TSM still yields the best efficiency among other scaling down methods.
  Knowledge Self CaiT RegNet 83.3 83.5 82.9 Scratch 80.1 79.9 79.2 Fine-tuning 80.5 80.2 80.0 Fine-tuning+Structure 81.1 80.6 80.2  
(b) Inherited knowledge. Pre-trained weights accelerate convergence and structure knowledge increases accuracy with different teachers.
  Ratio + 1 82.1 82.1 0.75 82.0 82.0 0.5 81.6 81.3 0.25 80.1 78.4  
(c) Robustness analysis. Our self-slimming learning with DKD is robust to FLOPs ratio.
  Method GFLOPs Top-1 None 3.5 82.1 33 AvgPool 1.0 77.4 33 Conv 1.0 79.3 Token-Mixer 1.1 79.3 Our TSM 1.0 80.1  
(d) Token slimming methods. Our dynamic TSM reaches better accuracy than those methods with fixed parameters.
  Method Top-1 None 79.0 Token-Linear 78.8 Token-Mixer 79.0 Token-Linear+MLP 79.6 Our RTSM 80.1  
(e) Token reconstruction methods. The extra MLP is critical to the token reconstruction.
  Method Top-1 Baseline 77.7 + 79.0 ++ 80.1 +++ 80.2 ++++Longer training 80.6  
(f) Knowledge distillation. Each distillation supervision can help improve the performance. Training longer epoch with CNN distillation will further improve the performance.
Table 6: Ablation studies. If not otherwise specified, all experiments for ablations are conducted on SiT-Ti and run with only 125 training epochs under the supervision of original teacher model. “Token-Linear” and “Token-Mixer” refer to single and double linear layers without residual connection respectively. “MLP” means MLP block proposed in ViT [10].

4.3 Comparison to state-of-the-art

In Table 5, we compare SiT with other competitive CNNs and ViTs. For a fair comparison, we group these methods according to their top-1 accuracy. The throughput is measured on a single 16GB V100 GPU under the same setting as LeViT [12]. Our SiT-Ti is competitive with LeViT, while the throughput is than that of EfficientNet [26]. Note that EfficientNet is designed via extensive neural architecture search and LeViT is elaborately designed for fast inference. For our larger model variants, they perform better than EfficientNetV2 [27] with simple training strategies. Compared with the original LV-ViT [17], our SiT is faster than those with similar accuracy. We further visualize the comparisons to the upper bounds of CNNs and ViTs in Figure 4 and 4. It clearly shows that our SiT achieves the best balance between throughput and accuracy, surpassing the recent state-of-the-art CNNs and ViTs. Finally, we compare our SiT with recent model pruning methods for ViT in Table 4. On one hand, we are able to improve the throughput by 43.2% without a performance drop. On the other hand, our SiT can obtain comparable accuracy with state-of-the-art approaches, e.g., Evo-ViT [39], while accelerating the inference time by over 100%.

4.4 Ablation Studies

Does token slimming outperform model scaling down? In Table 6, we compare token slimming with model scaling down rules under the same computation limit, in order to verify the effectiveness of our TSM. For model scaling down, we adapt the channel and depth individually. Note that the above two models are trained from scratch for 300 epochs with the distillation technique of token labeling [17]. For token slimming, we simply insert TSMs without DKD. We also drop tokens and train it with extra distillation as in DynamicViT [24]. It shows that scaling along the channel achieves higher accuracy than scaling along with depth but with lower throughput. Besides, token slimming can largely improve the throughput with higher performance. However, DynamicViT performs worse than our SiT without distillation, which is mainly because token hard dropping loses much discriminative information with a large slimming ratio. Such results demonstrate simply inserting our TSM into vanilla ViT is able to achieve great performance.

Does structure knowledge matter to self-slimmed learning? We further investigate whether the structure knowledge benefits the performance as shown in Table 6. For the teacher models, we adopt different architectures (LV-ViT-S[17], CaiT-S24[30], and RegNetY-16GF[23]) but similar accuracies for a fair comparison. It shows that training with the pre-trained weights for 125 epochs converges to higher results than those trained from scratch for 300 epochs. Moreover, we utilize structure knowledge via layer-to-layer mimicking, which can further boost the performance. It also reveals that higher similarity between students and teachers can bring greater improvements.

Is self-slimmed learning robust to different FLOPs ratios? In Table 6, we empirically train models with different FLOPs ratios. When the ratio is large than 0.5, our DKD and CNN distillation are both helpful for maintaining performance. However, when the ratio is small, CNN distillation leads to a higher performance drop, while our DKD only drops the accuracy by 2.0%. These results demonstrate that our method is robust to different FLOPs ratios.

Figure 5: Visualizations of our progressive token slimming. The blue tokens contribute less to the final informative tokens, while the red tokens contribute more. Our method can zoom the attention scope to cover the key object, even with only 12.5% of tokens.
Figure 6: Cross CKA heatmap between different student models and the teacher models. We adopt LV-ViT-S [17] as student. Transfering knowledge densely from same structure yields the largest similarity, while achieving the best results as shown in Table 6.

Dynamic vs. Static: Which aggregation manner works better for token slimming? To explore whether dynamic aggregation is better for token slimming, we perform ablation experiments as shown in Table 6. For static aggregation, we choose different data-independent operations and maintain similar computation: 3

3 average pooling/convolution with stride 2

2, and double linear layers with GELU function (“Token-Mixer”). It shows that learnable parameters are vital for token slimming since average pooling leads to a severe accuracy drop. Besides, the static aggregation methods with data-independent weights yield similar but inferior performance to our TSM (79.3% vs. 80.1%). Such comparisons prove that our TSM can generate more informative tokens.

How much does MLP bring for token reconstruction? We first reconstruct the original tokens by only single and double linear layers. As shown in Table 6. “Token-Linear” and “Token-Mixer” do not bring any accuracy gains and even hurts the capacity compared with the baseline (without layer-to-layer mimicking). Surprisingly, simply introducing an MLP [10] obviously improves the performance by 0.8% and 1.1% respectively. It shows that via enhancing the token representations individually, MLP can guarantee sufficient information of the slimmed tokens.

Does each distillation supervision help? Table 6 presents that the soft logits surpervision brings 1.4% accuracy gain. When further introducing layer-to-layer knowledge supervision, our model improves the accuracy by 1.1%. Finally, combining complementary hard label supervision, the accuracy reaches 80.6% with longer training epochs.

4.5 Visualization

Qualitative token slimming visualization. Figure 5 shows the original images and the token slimming procedure of our SiT-Ti. We observe that the tokens of higher scores, i.e., brighter tokens, are concentrated and tend to cover the key objects in the image. It demonstrates that our proposed TSM is able to localize the significant regions and predict accurate scores for the most informative tokens.

Qualitative DKD visualization. In Figure 6, we compute the CKA [19] heatmap by comparing all layers of the student models (LV-ViT-S) with all layers of their teacher models. It shows that the CKA similarities between the similar structures are generally higher than those between different structures (0.75/0.85 vs. 0.33/0.38). Interestingly, we find the pre-trained weights inherited by the student force itself to be similar to its teacher. Besides, for similar structures, the CKA similarities in the shallow layers are higher than those in deep layers. It is mainly because we slim a large number of tokens after the third layer, leading to an inevitable information loss. As for different structures, the CKA similarities in the deep layers are higher than those in shallow layers, which is mainly because the logits distillation provides direct supervision for features in the deeper layers. Note that the above observations are consistent with the results in Table 6, which reveals that teachers with similar structures can transfer structure knowledge better for higher performance.

5 Conclusions

In this paper, we propose a generic self-slimmed learning method for vanilla vision transformers (SiT), which can speed up the ViTs with negligible accuracy drop. Our concise TSM softly integrates redundant tokens into fewer informative ones. For stable and efficient training, we introduce a novel DKD framework to leverage structure knowledge, which can densely transfer token information in a flexible auto-encoder manner. Extensive experiments demonstrate the effectiveness of our SiT. By simply arming LV-ViT with our SiT, we achieve new state-of-the-art performance on ImageNet, surpassing recent CNNs and ViTs.

Limitations. Though token soft slimming can automatically generate informative tokens, it undoubtedly damages location information, which is significant for dense prediction tasks, e.g., detection and segmentation. In future work, we will explore feasible designs of RTSM to recover location information explicitly.

References

Appendix A More details of about teacher models

 

Teacher Student Resolution Depth Heads Patch Stem
(ks, st, oc)
SiT-Ti 224 14 5 (33, 22, 40)
Our (33, 22, 80)
LV-ViT-Ti (33, 22, 160)
(33, 22, 320)
224 16 6 (33, 22, 48)
Our SiT-XS (33, 22, 96)
LV-ViT-S SiT-S (33, 22, 192)
(33, 22, 384)
SiT-M 224 20 8 (33, 22, 64)
Our (33, 22, 128)
LV-ViT-M (33, 22, 256)
(33, 22, 512)
SiT-L 288 24 12 (33, 22, 96)
(33, 11, 96)
Our (33, 11, 96)
LV-ViT-L (33, 22, 192)
(33, 22, 384)
(33, 22, 768)

 

Table 7: Details of teacher models. The head dimensions of all the models are set to 64. ‘ks’ means kernel size. ‘st’ means stride. ‘oc’ means output channel number.

Table 7 shows more details about our teacher models. We elaborately design different patch stems for our LV-ViT [17] models.

Appendix B More robustness analysis.

 

Ratio +
1 83.3 83.3
0.75 83.2 83.0
0.5 82.6 82.2
0.25 80.9 80.0

 

Table 8: Robustness analysis based on our LV-ViT-S.

We conduct more analysis based on our LV-ViT-S in Table 8. It shows that our self-slimmed learning is also robust to different FLOPs ratios on LV-ViT-S. Moreover, our method still performs better than CNN distillation on larger model.

Appendix C More experiments on DeiT

We also verify the effectiveness of our self-slimmed learning on DeiT as illustrated in Table 9. For the FLOPs ratio of 0.5 and 0.25, the stage numbers are {3,4,3,2} and {1,1,1,9} respectively. Specifically, we conduct the experiments on the original DeiT [29] and its variant with lightweight convolutional patch embedding. Both models achieve similar accuracy with the same computational costs. However, we observe the performance of their students is quite different especially at a small FLOPs ratio. DeiT suffers severe performance deterioration when computation is reduced, while DeiT only drops the accuracy by 2.5%. More importantly, DeiT generally obtain higher accuracies than DeiT at a relatively higher FLOPs ratio. It demonstrates that the models with convolutional patch embedding are more redundant and friendly to slimming. In addition, we also compare our DKD with the CNN distillation under different settings. The layer-to-layer dense knowledge distillation consistently brings more performance gains than CNN distillation. It is worth mentioning that, self-slimmed learning is also complementary to the extra CNN distillation. Surprisingly, the best student model of DeiT even outperforms the teacher by top-1 accuracy while running faster under the joint supervision. These results prove the effectiveness and generalization ability of our self-slimmed learning.

As described in Table 10, we further compare our self-slimmed learning with the recent method, i.e., DynamicViT. We observe that our SiT runs slightly faster than DynamicViT with the same FLOPs, which reveals our TSM presents better inference efficiency than the prediction module of DynamicViT. More importantly, thanks to the soft-slimming designs, SiT outperforms DynamicViT by a large margin (5.3%-10.0%) at the FLOPs ratio of 0.25. For the large FLOPs ratio, our SiT still obtains at least 0.7% higher accuracy than DynamicViT, proving the soft slimming triumphs the hard dropping manner.

Appendix D More visualizations

We present more visualizations of our progressive token slimming in Figure 7.

 

Model FLOPs FLOPs Throughput ImageNet
ratio (G) + (image/s) Top-1(%)
DeiT-S 0.25 1.1 6413() 71.6(-8.2)
1.1 6413() 75.9(-3.9)
1.1 6286() 72.9(-6.9)
1.1 6286() 75.3(-4.5)
0.5 2.3 3308() 78.6
2.3 3308() 79.4
2.3 3262() 78.8
2.3 3262() 79.8
1 4.6 1637 79.8
DeiT-S 0.25 1.1 5898() 76.1(-3.9)
1.1 5898() 78.4(-1.6)
1.1 5830() 77.5(-2.5)
1.1 5830() 78.8(-1.2)
0.5 2.3 3150() 79.1
2.3 3150() 79.9
2.3 3106() 80.3
2.3 3106() 80.6
1 4.6 1597 80.0

 

Table 9: More results on DeiT. “DeiT” indicates the original DeiT and “DeiT” refers to the variant with lightweight convolutional patch embedding stacked by four 33 convolutions (22 stride) and one point-wise convolution.

 

Model FLOPs ratio #FLOPs (G) DynamicViT SiT
Throughput ImageNet Throughput ImageNet
(image/s) Top-1(%) (image/s) Top-1(%)
DeiT-S 0.25 1.1 6254() 65.6(-14.2) 6413() 75.9
0.5 2.3 3248() 78.4(-1.4) 3308() 79.4
1 4.6 1637 79.8 1637 79.8
DeiT-S 0.25 1.1 5689() 73.4(-6.6) 5898() 78.4
0.5 2.3 3092() 79.2(-0.8) 3150() 79.9
1 4.6 1597 80.0 1597 80.0

 

Table 10: Comparisons between DynamicViT and our SiT on DeiT.
Figure 7: More visualizations of our SiT.