The vision transformer has achieved stunning success in computer vision since ViT[dosovitskiy2020image]
. It has shown impressive capability upon convolutional neural networks (CNNs) on prevalent visual domains, including image classification[deit, twins], object detection [detr, zhu2020deformable], semantic segmentation [liu2021swin], action recognition [mvit, bertasius2021space] with both supervised and self-supervised [he2021masked, bao2021beit] training configurations. Along with the development of ViT models, the deployment of vision transformers is becoming an issue due to its high computational cost.
How to accelerate ViT is an important yet rarely explored topic. Due to the substantial model difference between CNNs and ViT [naseer2021intriguing, mahmood2021robustness, raghu2021vision], many techniques (i.e., pruning, distillation, and neural architecture search) to accelerate the CNNs cannot directly apply to ViT. As the attention module in the transformer computes the fully-connected relations among all of the input patches [vaswani2017attention], the computational cost is then quadratic with regard to the length of the input sequence [choromanski2020rethinking, beltagy2020longformer]. As a result, the transformer suffers heavy computational costs, especially when the input sequence is long. In the ViT model, the images are split into a fixed number of tokens. Following the conventional paradigm [dosovitskiy2020image], an image is represented by tokens. We target to reduce the computational complexity of ViT from the perspective of reducing the number of tokens that are used to split the images. Our motivation is illustrated in Figure 1, which shows three examples predicted by the ViT model with three different token lengths. The results are obtained from three individually trained DeiT-S [deit] with three token lengths, while the check-mark denotes correct prediction and cross denotes the wrong prediction. We can observe that some ”easy-to-classify” images only need 22 tokens to correctly determine their category (i.e., the dog image on the left). In contrast, some images might need much more tokens to make the right prediction. These observations motivate us that the computational complexity of the existing ViT model can be drastically reduced if the input can be accurately classified with the minimum number of tokens.
Ideally, suppose we know the minimum number of tokens that correctly predict an individual image. In that case, a descent model can be trained with this information to assign the ”optimal” token lengths for the ViT model. Therefore, we decouple the task into the ViT training and token length assignment. A naive approach is to train multiple ViT models, each dealing with one token length. However, it would be computational prohibitive in both training and testing. To tackle this issue, we begin with modifying a transformer from ”static,” representing that the model can only evaluate on a fixed, single token length, to ”dynamic,” where the ViT model can adaptively process the images with multiple token lengths. This dynamic vision transformer, called Resizable-ViT (ReViT), can mark the minimum sufficient token length used in the ViT model to obtain the correct prediction for each image. Then, we train a lightweight Token-Length Assigner (TLA) to predict an appropriate token length for the given image, where the label is retrieved from the ReVit. As such, the ReViT can use significantly less computational cost based on the new token length to split the images.
The biggest challenge of this approach is how to train the ReViT so that the ViT model can manage to process the image with any given size provided by the TLA. We introduce a token length-aware layer normalization to switch the normalization statics for each type of token length and a self-distillation module to improve the model performance of using short token length in ReViT. Another issue is that the ViT mode has to ”see” the images with corresponding token lengths beforehand to make it capable of dealing with the various token lengths for images. As the number of predefined token-length choices increases, the training cost increase linearly. To reduce the training cost, we introduce a parallel computing strategy for efficient training, which enables training of ReViT is almost as cheap as training a vanilla ViT model.
We illustrate the effectiveness of our method on representative ViT models, including DeiT [deit] and LV-ViT [lvvit] on image classification, as well as TimesFormer [bertasius2021space] on video recognition. Our experiments demonstrate that our approach can effectively reduce the computation cost while maintaining performance. For example, we manage to speed up the DeiT-S [deit] model by 50% with accuracy drop by 0.3%. On action recognition, the computational cost of TimesFormer [bertasius2021space] is reduced up to 33% on Kinetic 400 while only sacrificing 0.5% recognition accuracy.
2 Related Works
Vision Transformer. Transformers [vaswani2017attention] have drawn much attention to computer vision recently due to its strong capability of modeling long-range relation. Many attempts have been made to integrate the long-range modeling into CNNs, such as non-local networks [wang2018non, yin2020disentangled], relation networks [hu2018relation], etc. Vision Transformer (ViT) [dosovitskiy2020image] first introduced a set of pure Transformer backbones for image classification, and its follow-ups modify the vision transformer soon to dominate many downstream tasks for computer vision, such as object detection [carion2020end, zhu2020deformable], semantic segmentation [liu2021swin], action recognition [bertasius2021space, mvit]
, 2D/3D human pose estimation[yang2020transpose, poseformer], and 3D object detection [pointformer]
. It has shown great potential to be an alternative backbone for convolutional neural networks.
Model Compression. It is known that the over-parameterized model has many attracting merits and can achieve better performance than small models. However, computational efficiency is critical in real-world scenarios, where the executed computation is translated into power consumption or carbon emission. Many works have tried on reducing the computational cost of CNNs via neural architecture search [liu2018darts, zoph2016neural, fbnnas, chu2021fairnas, guo2020single], knowledge distillation [hinton2015distilling, zhu2021student], dynamic routing [elbayad2019depth, cai2019once, wang2020glance, zhu2019resizable, yu2018slimmable] and pruning [han2015deep, frankle2018lottery], but how to accelerate the ViT model have been rarely explored.
Recently, some attempts have been made to compress the vision transformer model, including quantization [liu2021post], distillation [li2021mst], and pruning [tang2021patch]. Some works have started to reduce the computational cost of the ViT models in terms of the token length. These works can be classified into two forms, unstructured token sparsification and structure token division. Most works, including PatchSlim [tang2021patch], TokenSparse [rao2021dynamicvit], GlobalEncoder [song2021dynamic], IA-RED [pan2021ia], and Tokenlearner [ryoo2021tokenlearner], focus on the former. They aim to remove the uninformative tokens, such as tokens that learn features from the background of the image. As a result, the inference speed is boosted by reserving the informative tokens only. These approaches typically need to progressively reduce the number of tokens based on the inputs and can be performed either jointly with ViT training or afterward. The latter, unstructured token sparsification, is the most related work to us. Wang et al. [wang2021not] proposed DVT to determine the number of patches to divide an image dynamically. Specifically, they leverage a cascade ViT models, where each ViT is responsible for one type of token length. The cascade ViT makes a sequential decision. It will stop inference for an input image if it has sufficient confidence in the prediction on the current token length. Different from DVT [wang2021not], our method is more accessible and practical since only a single ViT model is required. Moreover, we pay more attention on how to accurately decide the smallest number of token lengths that can give correct predication in the transformer for each image.
The vision transformers consider an image as a sentence. It splits each 2D image into the 1D tokens and models the long-range dependency between tokens with the multi-head self-attention mechanism. The self-attention has been recognized as the computation bottleneck in the transformer model. Its computational cost increases quadratically to the number of incoming tokens. As aforementioned, our approach is motivated by the fact that many ”easy-to-recognize” images do not require tokens [dosovitskiy2020image] to correctly predication their category. Thus, the computational cost can be saved by processing fewer tokens on ”easy” images while using more tokens on ”hard” images. It is worth noting that the key to a successful input-dependent token-adaptive ViT model is to know precisely the minimum number of tokens that are sufficient to correctly classify the image.
Therefore, we decouple the model training into two stages. At the first stage, we train a ViT model that can process an image with any predefined token lengths. Normally, a single ViT model can only deal with one token length. We discuss the detailed model design and training strategy of such model in Section 3.2. We need to train a model to assign the token length to the images at the second stage. First, we obtain the token-length label, the minimum number of tokens that this ViT model needs to perform correct classification, from the previously trained ViT model. Then, a Token-Length Assigner (TLA) is trained with the training data, where the input is an image and the label is the token length. Intuitively, such decoupled procedure helps the TLA to make a better decision on selecting the token lengths. In inference, the TLA directly guides the ViT model that how many tokens are sufficient to make the decision based on the input. The complete training and testing process are shown in Figure 2.
In the following, we first introduce the Token-Label Assigner, then present the training method on the Resizable-ViT model and improved techniques.
3.1 Token-Length Assigner
The purpose of the Token-Length Assigner (TLA) is to make precise predictions based on the feedback from ReViT. The training of TLA is performed after the ReViT. We first define a list of token-length , where the token length is in descending order. For notation simplicity, we use a single number to denote the token length. For example, . The model with tokens of has the lowest computational cost among the three token lengths.
To train a TLA, we need to retrieve a token-length label from the trained ReViT. We define the token-length label for an image as the smallest token length that the ViT model needs to correctly classify this particular image. For instance, the inference speed of the ReViT would be . The denotes the total number of choices for token length. We can obtain the predication of for each input , . The label of the input is the smallest token size that for any smaller token length cannot make the correct prediction , the is the ground truth label. As such, we would have a set of input-output , which is used to train the TLA. The TLA is a lightweight module since the token-label assignment is an easy task with only a few labels. The additional computational overhead introduced by this TLA module is quite small, especially considering the computational overhead saved by reducing the unnecessary tokens in the ViT model.
This section introduces the Resizable-ViT (ReViT), a dynamic ViT model that can predict the category of a given image with various token lengths. We introduce two techniques that improve the performance of ReViT, then present the training strategy along with an efficient training implementation to accelerate the training of ReViT.
Token-Aware Layer Normalization. Layer Normalization (LN/LayerNorm) layer is a standard normalization method to accelerate training and improve the generalization of the Transformer architecture. It is common for both natural language processing and computer vision to adopt an LN after addition in transformer block. Because the feature maps of both self-attention matrices and feed-forward networks are constantly changing, as the number of token size changes during training, it leads to inaccurate normalization statistics across the different token lengths in shared layers, which impairs the test accuracy. We also empirically find that the LN cannot share in ReViT.
To tackle this issue, we propose a Token-Length-Aware LayerNorm (TAL-LN), which uses an independent LayerNorm corresponding to each choice of token length in the predefined token length list. In other words, we use a ” , …, ” as a building block, where k denotes as the number of the predefined token length. As such, each LayerNorm layer specifically calculates layer-wise statistics and learns the parameters of the corresponding feature map. Moreover, the number of extra parameters in TAL-LN is negligible since the number of parameters in normalization layers usually takes less than one percent of the total model size [yu2018slimmable]. A brief summary is shown in Figure 3.
Self-Distillation It is aware that the performance of ViT is strongly correlated to the number of patches, and experiments show that reducing the token size significantly hampers the accuracy of small token ViT. Directly optimizing via the supervision from the ground truth poses difficulty for the small token length sub-model. Motivated by self-attention, a variant of knowledge distillation techniques, where the teacher can be insufficiently trained or even the student model itself [yu2018slimmable, zhu2021student, yu2019universally, yu2020bignas], we present a token length-aware self-distillation (TLSD). We will show in the next section, the model with the largest token length is always trained first. For , the training objective is to minimize the cross-entropy loss . When comes to the model with other token length , we use a distillation objective to train the target model:
is the logits of the student modeland teacher model , respectively. is the temperature for the distillation, is the coefficient balancing the KL loss (Kullack-Leibler divergence) and the CE loss (cross-entropy) on ground truth label , and is the softmax function. Similar to DeiT, we add a distillation token for student models. Figure 3 gives an overview. Notably, this distillation scheme is computational-free: we can directly use the predicted label of the model with the largest token length as the training label for other sub-model, while for the largest token length model, we use ground truth.
3.3 Training Strategy
In order to let the ViT model adaptively process the token length in the predefined choice list, an image with various token lengths has to be ”seen.” Inspired by the batch gradient accumulation, a technique to bypass the issue of small batch size by accumulating the gradient and batch statistics in a single iteration (updating the model parameters counts for one iteration), we propose a mixing token length training. As illustrated in Algorithm 1, a batch of images that are processed with different token lengths are calculated the loss through in the feed-forward, and then obtain the individual gradients. After all choices of token length are looped over, the gradients of all parameters calculated by feeding different token lengths are accumulated to perform parameter updating.
Efficient Training Implementation. An issue with the above training strategy is that the training time is linearly increasing as the number of the predefined number of token lengths choices increases. Here, we propose an efficient implementation strategy to trade memory cost to training cost. As illustrated by Figure 4, we replicate the model, each model corresponds to a token length. At the end of each iteration,
the gradients of different replicas are0 synchronized and accumulated. We always send the gradient of replicas in which the token length is small to the one with a larger token length because they are the training bottleneck. Thus the communication cost in the gradient synchronization step is free. Then the parameters in the model are updated through back-propagation. After updating the parameters is completed, the main process distributes the learned parameters to the rest of the replicas. These steps are repeated until the end of the training, and all replicas except the model in the main process can be dislodged. As such, the training time of the Resizable Transformer reduces from to , where is the number of the predefined token length. Though the number of is small, i.e., , in practice, the computation of itself is high. Through our designed parallel computing, training the Resizable Transformer is almost the same as the naive ViT, where the cost from communication between replicas is negligible compared to the model training cost. In exchange for fast training, extra computational power is required for parallel computing.
Implementation details.deng2009imagenet] training set with approximately 1.2 million images and report the accuracy on the 50k images in the test set. By default, the predefined token length is set to and . We didn’t use token of of 4 4 as shown in the motivation figure, since the accuracy drop significantly with this number of tokens. We conduct experiments on DeiT-S [deit] and LV-ViT-S [lvvit] with image resolution in training and testing is 224 × 224 unless otherwise specified. For the training settings and optimization methods, we simply follow those in the original papers of DeiT [deit] and LV-ViT [lvvit]
. For LV-ViT, we obtain the token label for smaller token length according to their method. Since our method is generally applicable to diverse ViT models, We didn’t carefully finetune the hyper-parameters. We also train the ReViT on resized images with higher resolutions, 228 on DeiT-S and 448 on LV-ViT-S. Because the the convolutional layers with large kernel and stride are required to perform patch embedding, which may cause difficulty on optimization. Thus, we replace the large kernel convolution with consecutive convolutions followed Xiao et al.[xiao2021early]. After the ReViT finishes training, we obtain the token-length labels for all training data and train the Token-Length Assigner. The TLA is a shrink version of EfficientNet-B0. It is extremely small compared to the ViT model. We give detailed description of the TLA architecture and training details in the Appendix.
We use Kinetics400 [kay2017kinetics] to conduct experiments on action recognition. The Kinetics400 is a widely-used benchmark for action recognition, which includes around 240k training videos and 20k validation videos in 400 classes. We follow the training setting of TimeSformer [bertasius2021space]. Specifically, two versions of TimeSformer are tested. The TimeSformer, which is the default version of TimeSformer model operating on video clips, and TimeSformer-HR, a high spatial resolution variant that operates on video clips. All models are pretrained on the ImageNet1K. The TimeSformer use default of patch size of , while we use the same setting as in ImageNet classification.
Main Results on ImageNet Classification. We report the main result of our ReViT based on DeiT-S and LV-ViT-S in Figure 5. We compare our approach with a number of models, including DeiT [deit], CaiT [cait], LV-ViT [lvvit], CoaT [coat], Swin [liu2021swin], Twins [twins], Visformer [visformer], ConViT [wu2021cvt], TNT [tnt], and EfficientNet [tan2019efficientnet]. It shows that our method can achieve a plausible accuracy-throughput trade-off. Specifically, the ReViT reduces the computational cost of the baseline counterpart by reducing the token number used for inference. For instance, the inference speed of DeiT-S is increased by 50% with a 0.3% accuracy drop. By increasing the input resolution, we manage to outperform the baseline counterpart given a similar computational cost. The ReViT based on LV-ViT-S using image size achieves 0.2% higher top-1 accuracy with slightly faster inference speed than LV-ViT-S with .
Main Results on Kinetic 400 Action Recognition. We further apply ReviT to the video action recognition. We train the ReviT-TimeSformer and ReviT-TimeSformer-HR and compare our method with the baseline TimeSformer and TimeSformer-HR, respectively. We list the results in Table 1. We can see that our method speed up the TimeSformer baseline. For TimeSformer, we reduce the computational cost of the baseline by approximately 33% with 0.5% accuracy drop. By training on larger image resolution, we correspondingly reduce the model by 28% with 0.4% accuracy drop. It is slightly worse than smaller resolution. Nevertheless, our experiments show that ReViT works generally well on action recognition.
Visualization of samples with different token-length. We select eight classes from the ImageNet validation set and pick three samples from three categories, easy, medium, and hard, corresponding to tokens with , , and , respectively. It is selected according to the token length assigned by the Token-Length Assigner. The images are shown in Figure 6. Note that some classes do not have all images filled in because less than three samples in the validation set belong to this category. For instance, only one image in the dog class need to use the largest token length to do the classification. We observe that the number of token lengths needed to predict their classes is correlated to the object’s size. If the object is large, only a few tokens are sufficient to predicate their category.
4.1 Ablation Study
Shared patch embedding and position encoding. We experiment to see what is the impact of using share patch embedding and position encoding. Because the token number changes during training, we apply some techniques in order to share both operations. For position encoding, we follow the ViT [dosovitskiy2020image]
to zero-pad the position encoding module whenever the token size changes. This technique is used initially to adjust the positional encoding in the pretrain-finetune paradigm. For shared patch embedding, we use weight-sharing kernel[cai2019once]. A large kernel is constructed to process a large patch size. When the patch size changes, a smaller kernel with shared weight on the center is adopted to flat the image patch.
As shown in Table 3, both shared patch embedding and shared positional encoding decrease the model’s accuracy. Especially for the sharing patch strategy, the accuracy dropped nearly 14% on the large token-lengths’ model. The share positional encoding module performs better than shared patch embedding, but still severally hurt the performance of ReViT.
|Method||SD*||Top-1 Acc (%)|
|14 14||10 10||7 7|
|Method||Shared||Top-1 Acc (%)|
|Patch||Pos||14 14||10 10||7 7|
The effect of self-distillation and choice of . We conduct experiments to verify the effectiveness of self-distillation in ReViT and discuss the impact of hyper-parameters . We test two different , 0.9 and 0.5, for all sub-networks. We demonstrate the results in Table 2. Without self-distillation, the accuracy on small token length is comparable with tokens of , but much worse on tokens of . When applying self-distillation with , the accuracy of both models increases. The further evaluate the model use . The higher affects the accuracy of the largest token length, drops the accuracy around 0.3%, but significantly increases the performance of models. It shows the necessity of using self-distillation in our scenario.
Training cost and Memory Consumption. We compare ReViT with DeiT-S and DVT [wang2021not] in terms of training cost and memory consumption, shown in Figure 7. The ReViT-B denotes the baseline approach of ReViT and ReViT-E is the efficient implementation method. We can observe that both ReViT-B and DeiT-S are increasing linearly as the choice in
increases. The ReViT-B is cheaper because the backpropagation of multiple token lengths is merged. On the other hand, the training time of ReViT-E slightly increases due to the communication cost between parallel models increasing. For the memory consumption (number of parameters) during testing, since our method only has a single ViT where most computational heavy components are shared, the memory cost is slightly higher than the baseline. Compared to the DVT, the number of parameters increases with respect to the increasing number of token length choices is negligible. It indicates that our approach is more practical than DVT from the perspective of training cost and memory cost. Also, it is easier than DVT to apply our method on the existing ViT model.
This paper aims to reduce the token length to split the image in ViT model to eliminate unnecessary computational costs. We propose the Resizable Transformer (ReViT), which can adaptively process any predefined token size for a given image. Then, we define a Token-Length Assigner to decide the minimum number of tokens that the transformer can use to classify the individual image correctly. Extensive experiments indicate that ReViT can significantly accelerate the state-of-the-art ViT model.