Instance-Wise Adaptive Tuning and Caching for Vision-Language Models

07/29/2023
by   Chunjin Yang, et al.
0

Large-scale vision-language models (LVLMs) pretrained on massive image-text pairs have achieved remarkable success in visual representations. However, existing paradigms to transfer LVLMs to downstream tasks encounter two primary challenges. Firstly, the text features remain fixed after being calculated and cannot be adjusted according to image features, which decreases the model's adaptability. Secondly, the model's output solely depends on the similarity between the text and image features, leading to excessive reliance on LVLMs. To address these two challenges, we introduce a novel two-branch model named the Instance-Wise Adaptive Tuning and Caching (ATC). Specifically, one branch implements our proposed ConditionNet, which guides image features to form an adaptive textual cache that adjusts based on image features, achieving instance-wise inference and improving the model's adaptability. The other branch introduces the similarities between images and incorporates a learnable visual cache, designed to decouple new and previous knowledge, allowing the model to acquire new knowledge while preserving prior knowledge. The model's output is jointly determined by the two branches, thus overcoming the limitations of existing methods that rely solely on LVLMs. Additionally, our method requires limited computing resources to tune parameters, yet outperforms existing methods on 11 benchmark datasets.

READ FULL TEXT
research
11/18/2022

Task Residual for Tuning Vision-Language Models

Large-scale vision-language models (VLMs) pre-trained on billion-level d...
research
06/04/2022

Instance-wise Prompt Tuning for Pretrained Language Models

Prompt Learning has recently gained great popularity in bridging the gap...
research
09/12/2023

Dynamic Visual Prompt Tuning for Parameter Efficient Transfer Learning

Parameter efficient transfer learning (PETL) is an emerging research spo...
research
07/19/2022

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

Contrastive Vision-Language Pre-training, known as CLIP, has provided a ...
research
04/03/2023

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement

The popularity of Contrastive Language-Image Pre-training (CLIP) has pro...
research
09/18/2023

CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval

Text-based Person Retrieval aims to retrieve the target person images gi...

Please sign up or login with your details

Forgot password? Click here to reset