Unsupervised Prototype Adapter for Vision-Language Models

08/22/2023
by   Yi Zhang, et al.
0

Recently, large-scale pre-trained vision-language models (e.g. CLIP and ALIGN) have demonstrated remarkable effectiveness in acquiring transferable visual representations. To leverage the valuable knowledge encoded within these models for downstream tasks, several fine-tuning approaches, including prompt tuning methods and adapter-based methods, have been developed to adapt vision-language models effectively with supervision. However, these methods rely on the availability of annotated samples, which can be labor-intensive and time-consuming to acquire, thus limiting scalability. To address this issue, in this work, we design an unsupervised fine-tuning approach for vision-language models called Unsupervised Prototype Adapter (UP-Adapter). Specifically, for the unannotated target datasets, we leverage the text-image aligning capability of CLIP to automatically select the most confident samples for each class. Utilizing these selected samples, we generate class prototypes, which serve as the initialization for the learnable prototype model. After fine-tuning, the prototype model prediction is combined with the original CLIP's prediction by a residual connection to perform downstream recognition tasks. Our extensive experimental results on image recognition and domain generalization show that the proposed unsupervised method outperforms 8-shot CoOp, 8-shot Tip-Adapter, and also the state-of-the-art UPL method by large margins.

READ FULL TEXT
research
07/28/2023

Cross-Modal Concept Learning and Inference for Vision-Language Models

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, est...
research
09/03/2023

BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and ...
research
10/19/2022

Prompting through Prototype: A Prototype-based Prompt Learning on Pretrained Vision-Language Models

Prompt learning is a new learning paradigm which reformulates downstream...
research
05/20/2022

Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning

Learning with little data is challenging but often inevitable in various...
research
08/24/2023

Towards Realistic Unsupervised Fine-tuning with CLIP

The emergence of vision-language models (VLMs), such as CLIP, has spurre...
research
06/01/2023

Consistency-guided Prompt Learning for Vision-Language Models

We propose Consistency-guided Prompt learning (CoPrompt), a new fine-tun...
research
08/29/2022

Prompt Tuning with Soft Context Sharing for Vision-Language Models

Vision-language models have recently shown great potential on many compu...

Please sign up or login with your details

Forgot password? Click here to reset