Unsupervised Prompt Learning for Vision-Language Models

04/07/2022
by   Tony Huang, et al.
4

Contrastive vision-language models like CLIP have shown great progress in zero-shot transfer learning. This new paradigm uses large-scale image-text pairs for training and aligns images and texts in a common embedding space. In the inference stage, the proper text description, known as prompt, needs to be carefully designed for zero-shot transfer. To avoid laborious prompt engineering and simultaneously improve transfer performance, recent works such as CoOp, CLIP-Adapter and Tip-Adapter propose to adapt vision-language models for downstream image recognition tasks by either optimizing the continuous prompt representations or training an additional adapter network on top of the pre-trained vision-language models on a small set of labeled data. Though promising improvements are achieved, using labeled images from target datasets may violate the intention of zero-shot transfer of pre-trained vision-language models. In this paper, we propose an unsupervised prompt learning (UPL) framework, which does not require any annotations of the target dataset, to improve the zero-shot transfer of CLIP-like vision-language models. Experimentally, for zero-shot transfer, our UPL outperforms original CLIP with prompt engineering and on ImageNet as well as other 10 datasets. An enhanced version of UPL is even on par with the 8-shot CoOp and the 8-shot TIP-Adapter on most datasets while our method does not need any labeled images for training. Code and models are available at https://github.com/tonyhuang2022/UPL.

READ FULL TEXT
research
12/14/2022

Pre-trained Language Models can be Fully Zero-Shot Learners

How can we extend a pre-trained model to many language understanding tas...
research
06/13/2023

Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images

Contrastive visual language pretraining has emerged as a powerful method...
research
06/07/2022

Masked Unsupervised Self-training for Zero-shot Image Classification

State-of-the-art computer vision models are mostly trained with supervis...
research
06/15/2023

LOVM: Language-Only Vision Model Selection

Pre-trained multi-modal vision-language models (VLMs) are becoming incre...
research
05/31/2023

Improving CLIP Training with Language Rewrites

Contrastive Language-Image Pre-training (CLIP) stands as one of the most...
research
04/04/2023

Exploring Vision-Language Models for Imbalanced Learning

Vision-Language models (VLMs) that use contrastive language-image pre-tr...
research
10/18/2022

Perceptual Grouping in Vision-Language Models

Recent advances in zero-shot image recognition suggest that vision-langu...

Please sign up or login with your details

Forgot password? Click here to reset