SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models

05/24/2023
by   Zekun Wang, et al.
0

Despite achieving remarkable performance on various vision-language tasks, Transformer-based pretrained vision-language models (VLMs) still suffer from efficiency issues arising from long inputs and numerous parameters, limiting their real-world applications. However, the huge computation is redundant for most samples and the degree of redundancy and the respective components vary significantly depending on tasks and input instances. In this work, we propose an adaptive acceleration method SmartTrim for VLMs, which adjusts the inference overhead based on the complexity of instances. Specifically, SmartTrim incorporates lightweight trimming modules into the backbone to perform task-specific pruning on redundant inputs and parameters, without the need for additional pre-training or data augmentation. Since visual and textual representations complement each other in VLMs, we propose to leverage cross-modal interaction information to provide more critical semantic guidance for identifying redundant parts. Meanwhile, we introduce a self-distillation strategy that encourages the trimmed model to be consistent with the full-capacity model, which yields further performance gains. Experimental results demonstrate that SmartTrim significantly reduces the computation overhead (2-3 times) of various VLMs with comparable performance (only a 1-2 degradation) on various vision-language tasks. Compared to previous acceleration methods, SmartTrim attains a better efficiency-performance trade-off, demonstrating great potential for application in resource-constrained scenarios.

READ FULL TEXT

page 9

page 16

research
10/14/2022

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Pre-trained vision-language models (VLMs) have achieved impressive resul...
research
05/27/2023

PuMer: Pruning and Merging Tokens for Efficient Vision Language Models

Large-scale vision language (VL) models use Transformers to perform cros...
research
05/21/2023

Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model

The prevalence of Transformer-based pre-trained language models (PLMs) h...
research
05/27/2023

CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

Vision-language models have achieved tremendous progress far beyond what...
research
05/09/2022

Attribution-based Task-specific Pruning for Multi-task Language Models

Multi-task language models show outstanding performance for various natu...
research
01/10/2023

Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language Understanding

Current natural language understanding (NLU) models have been continuous...
research
07/26/2023

Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models

Vision-language pre-training (VLP) models have shown vulnerability to ad...

Please sign up or login with your details

Forgot password? Click here to reset