Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

04/07/2022
by   Tristan Thrush, et al.
3

We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly - but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models' shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.

READ FULL TEXT

page 3

page 8

page 15

page 16

page 17

research
05/24/2023

Text encoders are performance bottlenecks in contrastive vision-language models

Performant vision-language (VL) models like CLIP represent captions usin...
research
06/26/2023

SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality

In the last year alone, a surge of new benchmarks to measure composition...
research
05/11/2023

Simple Token-Level Confidence Improves Caption Correctness

The ability to judge whether a caption correctly describes an image is a...
research
12/14/2021

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

We propose VALSE (Vision And Language Structured Evaluation), a novel be...
research
11/16/2018

Analyzing Compositionality-Sensitivity of NLI Models

Success in natural language inference (NLI) should require a model to un...
research
07/18/2023

Augmenting CLIP with Improved Visio-Linguistic Reasoning

Image-text contrastive models such as CLIP are useful for a variety of d...
research
10/04/2022

When and why vision-language models behave like bags-of-words, and what to do about it?

Despite the success of large vision and language models (VLMs) in many d...

Please sign up or login with your details

Forgot password? Click here to reset