Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

11/01/2022
by   Anuj Diwan, et al.
0

Recent visuolinguistic pre-trained models show promising progress on various end tasks such as image retrieval and video captioning. Yet, they fail miserably on the recently proposed Winoground dataset, which challenges models to match paired images and English captions, with items constructed to overlap lexically but differ in meaning (e.g., "there is a mug in some grass" vs. "there is some grass in a mug"). By annotating the dataset using new fine-grained tags, we show that solving the Winoground task requires not just compositional language understanding, but a host of other abilities like commonsense reasoning or locating small, out-of-focus objects in low-resolution images. In this paper, we identify the dataset's main challenges through a suite of experiments on related tasks (probing task, image retrieval task), data augmentation, and manual inspection of the dataset. Our analysis suggests that a main challenge in visuolinguistic models may lie in fusing visual and textual representations, rather than in compositional language understanding. We release our annotation and code at https://github.com/ajd12342/why-winoground-hard .

READ FULL TEXT
research
01/28/2022

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Vision-Language Pre-training (VLP) has advanced the performance for many...
research
08/22/2023

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

Given a query composed of a reference image and a relative caption, the ...
research
04/18/2016

Selective Convolutional Descriptor Aggregation for Fine-Grained Image Retrieval

Deep convolutional neural network models pre-trained for the ImageNet cl...
research
09/27/2020

A Brief Survey and Comparative Study of Recent Development of Pronoun Coreference Resolution

Pronoun Coreference Resolution (PCR) is the task of resolving pronominal...
research
05/31/2023

Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

Vision and Language (VL) models offer an effective method for aligning r...
research
05/27/2023

FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing

Textual scene graph parsing has become increasingly important in various...
research
03/29/2022

Image Retrieval from Contextual Descriptions

The ability to integrate context, including perceptual and temporal cues...

Please sign up or login with your details

Forgot password? Click here to reset