Text encoders are performance bottlenecks in contrastive vision-language models

05/24/2023
by   Amita Kamath, et al.
0

Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that aim to reconstruct captions from single-vector text representations produced by several VL models. This approach doesn't require images, allowing us to test on a broader range of scenes compared to prior work. We find that: 1) CLIP's text encoder falls short on object relationships, attribute-object association, counting, and negations; 2) some text encoders work significantly better than others; and 3) text-only recovery performance predicts multi-modal matching performance on ControlledImCaps: a new evaluation benchmark we collect+release consisting of fine-grained compositional images+captions. Specifically – our results suggest text-only recoverability is a necessary (but not sufficient) condition for modeling compositional factors in contrastive vision+language models. We release data+code.

READ FULL TEXT

page 1

page 6

research
04/07/2022

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

We present a novel task and dataset for evaluating the ability of vision...
research
10/06/2022

Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models

We explore the idea of compressing the prompts used to condition languag...
research
05/05/2023

COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?

Compositional reasoning is a hallmark of human visual intelligence; yet ...
research
05/31/2023

Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

Vision and Language (VL) models offer an effective method for aligning r...
research
03/17/2023

GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

Text-to-image (T2I) models based on diffusion processes have achieved re...
research
10/04/2022

When and why vision-language models behave like bags-of-words, and what to do about it?

Despite the success of large vision and language models (VLMs) in many d...
research
09/30/2022

Linearly Mapping from Image to Text Space

The extent to which text-only language models (LMs) learn to represent t...

Please sign up or login with your details

Forgot password? Click here to reset