Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

11/06/2021
by   Renrui Zhang, et al.
8

Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations by using large-scale contrastive image-text pairs. It shows impressive performance on zero-shot knowledge transfer to downstream tasks. To further enhance CLIP's few-shot capability, CLIP-Adapter proposed to fine-tune a lightweight residual feature adapter and significantly improves the performance for few-shot classification. However, such a process still needs extra training and computational resources. In this paper, we propose Training-Free CLIP-Adapter (Tip-Adapter), which not only inherits CLIP's training-free advantage but also performs comparably or even better than CLIP-Adapter. Tip-Adapter does not require any back propagation for training the adapter, but creates the weights by a key-value cache model constructed from the few-shot training set. In this non-parametric manner, Tip-Adapter acquires well-performed adapter weights without any training, which is both efficient and effective. Moreover, the performance of Tip-Adapter can be further boosted by fine-tuning such properly initialized adapter for only a few epochs with super-fast convergence speed. We conduct extensive experiments of few-shot classification on ImageNet and other 10 datasets to demonstrate the superiority of proposed Tip-Adapter. The code will be released at <https://github.com/gaopengcuhk/Tip-Adapter>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/19/2022

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

Contrastive Vision-Language Pre-training, known as CLIP, has provided a ...
research
11/28/2022

SuS-X: Training-Free Name-Only Transfer of Vision-Language Models

Contrastive Language-Image Pre-training (CLIP) has emerged as a simple y...
research
04/10/2023

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Visual and linguistic pre-training aims to learn vision and language rep...
research
05/31/2023

Improving CLIP Training with Language Rewrites

Contrastive Language-Image Pre-training (CLIP) stands as one of the most...
research
08/18/2023

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Relational Language-Image Pre-training (RLIP) aims to align vision repre...
research
04/03/2023

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement

The popularity of Contrastive Language-Image Pre-training (CLIP) has pro...
research
05/02/2023

Transfer Visual Prompt Generator across LLMs

While developing a new vision-language LLM (VL-LLM) by pre-training on t...

Please sign up or login with your details

Forgot password? Click here to reset