CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention

09/28/2022
by   Ziyu Guo, et al.
8

Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with great transferability, which achieves promising accuracy for zero-shot classification. To further improve its downstream performance, existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets. However, the resulting extra training cost and data requirement severely hinder the efficiency for model deployment and knowledge transfer. In this paper, we introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module. Specifically, we guide visual and textual representations to interact with each other and explore cross-modal informative features via attention. As the pre-training has largely reduced the embedding distances between two modalities, we discard all learnable parameters in the attention and bidirectionally update the multi-modal features, enabling the whole process to be parameter-free and training-free. In this way, the images are blended with textual-aware signals and the text representations become visual-guided for better adaptive zero-shot alignment. We evaluate CALIP on various benchmarks of 14 datasets for both 2D image and 3D point cloud few-shot classification, showing consistent zero-shot performance improvement over CLIP. Based on that, we further insert a small number of linear layers in CALIP's attention module and verify our robustness under the few-shot settings, which also achieves leading performance compared to existing methods. Those extensive experiments demonstrate the superiority of our approach for efficient enhancement of CLIP.

READ FULL TEXT

page 1

page 6

page 7

page 9

research
07/19/2022

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

Contrastive Vision-Language Pre-training, known as CLIP, has provided a ...
research
08/29/2023

Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification with Cross-Modal Retrieval

Contrastive language-image pre-training (CLIP) has demonstrated remarkab...
research
08/22/2023

GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

Cross-modal pre-training has shown impressive performance on a wide rang...
research
05/09/2023

Boosting Visual-Language Models by Exploiting Hard Samples

Large vision and language models, such as Contrastive Language-Image Pre...
research
04/03/2023

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement

The popularity of Contrastive Language-Image Pre-training (CLIP) has pro...
research
03/27/2023

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Contrastive language-image pre-training, CLIP for short, has gained incr...
research
09/05/2022

Design of the topology for contrastive visual-textual alignment

Pre-training weakly related image-text pairs in the contrastive style sh...

Please sign up or login with your details

Forgot password? Click here to reset