PuMer: Pruning and Merging Tokens for Efficient Vision Language Models

05/27/2023
by   Qingqing Cao, et al.
0

Large-scale vision language (VL) models use Transformers to perform cross-modal interactions between the input text and image. These cross-modal interactions are computationally expensive and memory-intensive due to the quadratic complexity of processing the input image and text. We present PuMer: a token reduction framework that uses text-informed Pruning and modality-aware Merging strategies to progressively reduce the tokens of input image and text, improving model inference speed and reducing memory footprint. PuMer learns to keep salient image tokens related to the input text and merges similar textual and visual tokens by adding lightweight token reducer modules at several cross-modal layers in the VL model. Training PuMer is mostly the same as finetuning the original VL model but faster. Our evaluation for two vision language models on four downstream VL tasks shows PuMer increases inference throughput by up to 2x and reduces memory footprint by over 50 less than a 1

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2021

FILIP: Fine-grained Interactive Language-Image Pre-Training

Unsupervised large-scale vision-language pre-training has shown promisin...
research
05/27/2023

CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

Vision-language models have achieved tremendous progress far beyond what...
research
05/24/2023

SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models

Despite achieving remarkable performance on various vision-language task...
research
09/05/2021

Data Efficient Masked Language Modeling for Vision and Language

Masked language modeling (MLM) is one of the key sub-tasks in vision-lan...
research
10/14/2022

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Pre-trained vision-language models (VLMs) have achieved impressive resul...
research
05/02/2022

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

Recently, large-scale pre-training methods like CLIP have made great pro...
research
10/17/2022

Token Merging: Your ViT But Faster

We introduce Token Merging (ToMe), a simple method to increase the throu...

Please sign up or login with your details

Forgot password? Click here to reset