Augmenting CLIP with Improved Visio-Linguistic Reasoning

07/18/2023
by   Samyadeep Basu, et al.
0

Image-text contrastive models such as CLIP are useful for a variety of downstream applications including zero-shot classification, image-text retrieval and transfer learning. However, these contrastively trained vision-language models often fail on compositional visio-linguistic tasks such as Winoground with performance equivalent to random chance. In our paper, we address this issue and propose a sample-efficient light-weight method called SDS-CLIP to improve the compositional visio-linguistic reasoning capabilities of CLIP. The core idea of our method is to use differentiable image parameterizations to fine-tune CLIP with a distillation objective from large text-to-image generative models such as Stable-Diffusion which are relatively good at visio-linguistic reasoning tasks. On the challenging Winoground compositional reasoning benchmark, our method improves the absolute visio-linguistic performance of different CLIP models by up to 7 ARO dataset, our method improves the visio-linguistic performance by upto 3 As a byproduct of inducing visio-linguistic reasoning into CLIP, we also find that the zero-shot performance improves marginally on a variety of downstream datasets. Our method reinforces that carefully designed distillation objectives from generative models can be leveraged to extend existing contrastive image-text models with improved visio-linguistic reasoning capabilities.

READ FULL TEXT

page 1

page 2

page 4

page 7

research
03/27/2023

Text-to-Image Diffusion Models are Zero-Shot Classifiers

The excellent generative capabilities of text-to-image diffusion models ...
research
03/28/2023

Your Diffusion Model is Secretly a Zero-Shot Classifier

The recent wave of large-scale text-to-image diffusion models has dramat...
research
06/02/2023

VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores

Vision-language models (VLMs) discriminatively pre-trained with contrast...
research
11/25/2022

ComCLIP: Training-Free Compositional Image and Text Matching

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zer...
research
08/25/2022

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

This paper presents a simple yet effective framework MaskCLIP, which inc...
research
09/06/2022

Language-aware Domain Generalization Network for Cross-Scene Hyperspectral Image Classification

Text information including extensive prior knowledge about land cover cl...
research
04/07/2022

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

We present a novel task and dataset for evaluating the ability of vision...

Please sign up or login with your details

Forgot password? Click here to reset