ComCLIP: Training-Free Compositional Image and Text Matching

11/25/2022
by   Kenan Jiang, et al.
0

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for image-text matching because of its holistic use of natural language supervision that covers large-scale, open-world visual concepts. However, it is still challenging to adapt CLIP to compositional image and text matching – a more challenging image and matching mask requiring the model understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel training-free compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically assess the contribution of each entity when performing image and text matching. Experiments on compositional image-text matching on SVO and ComVG and general image-text retrieval on Flickr8K demonstrate the effectiveness of our plug-and-play method, which boosts the zero-shot inference ability of CLIP even without further training or fine-tuning of CLIP.

READ FULL TEXT

page 2

page 3

page 7

page 11

page 12

page 13

research
11/29/2021

Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Recent text-to-image matching models apply contrastive learning to large...
research
07/03/2023

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging bec...
research
10/11/2022

CLIP also Understands Text: Prompting CLIP for Phrase Understanding

Contrastive Language-Image Pretraining (CLIP) efficiently learns visual ...
research
06/04/2022

Rethinking the Openness of CLIP

Contrastive Language-Image Pre-training (CLIP) has demonstrated great po...
research
06/25/2020

A causal view of compositional zero-shot recognition

People easily recognize new visual categories that are new combinations ...
research
03/31/2022

Do Vision-Language Pretrained Models Learn Primitive Concepts?

Vision-language pretrained models have achieved impressive performance o...
research
07/18/2023

Augmenting CLIP with Improved Visio-Linguistic Reasoning

Image-text contrastive models such as CLIP are useful for a variety of d...

Please sign up or login with your details

Forgot password? Click here to reset