Variational prompt tuning improves generalization of vision-language models

Prompt tuning provides an efficient mechanism to adapt large vision-language models to downstream tasks by treating part of the input language prompts as learnable parameters while freezing the rest of the model. Existing works for prompt tuning are however prone to damaging the generalization capabilities of the foundation models, because the learned prompts lack the capacity of covering certain concepts within the language model. To avoid such limitation, we propose a probabilistic modeling of the underlying distribution of prompts, allowing prompts within the support of an associated concept to be derived through stochastic sampling. This results in a more complete and richer transfer of the information captured by the language model, providing better generalization capabilities for downstream tasks. The resulting algorithm relies on a simple yet powerful variational framework that can be directly integrated with other developments. We show our approach is seamlessly integrated into both standard and conditional prompt learning frameworks, improving the performance on both cases considerably, especially with regards to preserving the generalization capability of the original model. Our method provides the current state-of-the-art for prompt learning, surpassing CoCoOp by 1.6 surpasses the original CLIP model in terms of generalization to new classes. Implementation code will be released.


Consistency-guided Prompt Learning for Vision-Language Models

We propose Consistency-guided Prompt learning (CoPrompt), a new fine-tun...

Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models

Prompt learning has become one of the most efficient paradigms for adapt...

Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment

Large-scale vision-language models (VLMs), e.g., CLIP, learn broad visua...

Prompting Is Programming: A Query Language For Large Language Models

Large language models have demonstrated outstanding performance on a wid...

Cross-Modal Concept Learning and Inference for Vision-Language Models

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, est...

Patch-Token Aligned Bayesian Prompt Learning for Vision-Language Models

For downstream applications of vision-language pre-trained models, there...

On Conditional and Compositional Language Model Differentiable Prompting

Prompts have been shown to be an effective method to adapt a frozen Pret...

Please sign up or login with your details

Forgot password? Click here to reset