Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models

09/15/2022
by   Manli Shu, et al.
29

Pre-trained vision-language models (e.g., CLIP) have shown promising zero-shot generalization in many downstream tasks with properly designed text prompts. Instead of relying on hand-engineered prompts, recent works learn prompts using the training data from downstream tasks. While effective, training on domain-specific data reduces a model's generalization capability to unseen new domains. In this work, we propose test-time prompt tuning (TPT), a method that can learn adaptive prompts on the fly with a single test sample. For image classification, TPT optimizes the prompt by minimizing the entropy with confidence selection so that the model has consistent predictions across different augmented views of each test sample. In evaluating generalization to natural distribution shifts, TPT improves the zero-shot top-1 accuracy of CLIP by 3.6 additional task-specific training data. In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data. Project page: https://azshue.github.io/TPT.

READ FULL TEXT

page 8

page 20

research
08/11/2023

Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning

Benefiting from prompt tuning, recent years have witnessed the promising...
research
05/29/2023

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

Misalignment between the outputs of a vision-language (VL) model and tas...
research
01/05/2023

Test of Time: Instilling Video-Language Models with a Sense of Time

Modeling and understanding time remains a challenge in contemporary vide...
research
06/09/2022

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

To build an artificial neural network like the biological intelligence s...
research
03/18/2023

DeAR: Debiasing Vision-Language Models with Additive Residuals

Large pre-trained vision-language models (VLMs) reduce the time for deve...
research
09/15/2023

Audio-free Prompt Tuning for Language-Audio Models

Contrastive Language-Audio Pretraining (CLAP) is pre-trained to associat...
research
06/30/2020

Maximum Entropy Models for Fast Adaptation

Deep Neural Networks have shown great promise on a variety of downstream...

Please sign up or login with your details

Forgot password? Click here to reset