Exploring Visual Interpretability for Contrastive Language-Image Pre-training

09/15/2022
by   Yi Li, et al.
18

Contrastive Language-Image pre-training (CLIP) learns rich representations via readily available supervisions of natural language. It could improve general performance on downstream vision tasks, including but not limited to zero-shot, long tail, segmentation, retrieval, caption and video. However, to the best of our knowledge, the visual interpretability of CLIP has not been studied yet. To provide visual explanations of its predictions, we propose the Image-Text Similarity Map (ITSM). Based on it, we surprisingly find that CLIP prefers the background regions than the foregrounds, and presenting erroneous visualization against human understanding. Experimentally, we find the devil is in the pooling part, where inappropriate pooling methods lead to a phenomenon called semantic shift. To correct and boost the visualization results, we propose the Masked Max Pooling, with attention map from the self-supervised image encoder. Meanwhile, interpretability task and recognition task require different representations. To address the problem, we propose the dual projections to cater this requirement. We integrate above methods as Interpretable Contrastive Language-Image pre-training (ICLIP). And experiments suggest ICLIP greatly improves the interpretability. For example, the nontrivial improvements are 32.85% and 49.10%, respectively, on VOC 2012 dataset.

READ FULL TEXT

page 2

page 4

page 5

page 8

page 9

page 13

page 14

research
10/17/2022

Non-Contrastive Learning Meets Language-Image Pre-Training

Contrastive language-image pre-training (CLIP) serves as a de-facto stan...
research
04/10/2023

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Visual and linguistic pre-training aims to learn vision and language rep...
research
05/01/2023

CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations

Geo-tagged images are publicly available in large quantities, whereas la...
research
07/14/2023

Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-Training

The foundation models based on pre-training technology have significantl...
research
04/12/2023

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks

Contrastive Language-Image Pre-training (CLIP) is a powerful multimodal ...
research
04/14/2023

WYTIWYR: A User Intent-Aware Framework with Multi-modal Inputs for Visualization Retrieval

Retrieving charts from a large corpus is a fundamental task that can ben...
research
04/19/2023

EC^2: Emergent Communication for Embodied Control

Embodied control requires agents to leverage multi-modal pre-training to...

Please sign up or login with your details

Forgot password? Click here to reset