CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks

04/12/2023
by   Yi Li, et al.
0

Contrastive Language-Image Pre-training (CLIP) is a powerful multimodal large vision model that has demonstrated significant benefits for downstream tasks, including many zero-shot learning and text-guided vision tasks. However, we notice some severe problems regarding the model's explainability, which undermines its credibility and impedes related tasks. Specifically, we find CLIP prefers the background regions than the foregrounds according to the predicted similarity map, which contradicts human understanding. Besides, there are obvious noisy activations on the visualization results at irrelevant positions. To address these two issues, we conduct in-depth analyses and reveal the reasons with new findings and evidences. Based on these insights, we propose the CLIP Surgery, a method that enables surgery-like modifications for the inference architecture and features, for better explainability and enhancement in multiple open-vocabulary tasks. The proposed method has significantly improved the explainability of CLIP for both convolutional networks and vision transformers, surpassing existing methods by large margins. Besides, our approach also demonstrates remarkable improvements in open-vocabulary segmentation and multi-label recognition tasks. For examples, the mAP improvement on NUS-Wide multi-label recognition is 4.41 additional training, and our CLIP Surgery surpasses the state-of-the-art method by 8.74 Furthermore, our method benefits other tasks including multimodal visualization and interactive segmentation like Segment Anything Model (SAM). The code is available at https://github.com/xmed-lab/CLIP_Surgery

READ FULL TEXT

page 2

page 5

page 6

page 12

page 13

page 15

page 16

page 17

research
09/28/2021

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

We present VideoCLIP, a contrastive approach to pre-train a unified mode...
research
07/05/2022

Open-Vocabulary Multi-Label Classification via Multi-modal Knowledge Transfer

Real-world recognition system often encounters a plenty of unseen labels...
research
07/03/2022

Can Language Understand Depth?

Besides image classification, Contrastive Language-Image Pre-training (C...
research
09/15/2022

Exploring Visual Interpretability for Contrastive Language-Image Pre-training

Contrastive Language-Image pre-training (CLIP) learns rich representatio...
research
07/14/2023

Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-Training

The foundation models based on pre-training technology have significantl...
research
06/08/2023

R-MAE: Regions Meet Masked Autoencoders

Vision-specific concepts such as "region" have played a key role in exte...
research
01/29/2023

Composer's Assistant: Interactive Transformers for Multi-Track MIDI Infilling

We consider the task of multi-track MIDI infilling when arbitrary (track...

Please sign up or login with your details

Forgot password? Click here to reset