Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

03/13/2023
by   Nitzan Bitton Guetta, et al.
0

Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities. Data, models and code are available at the project website: whoops-benchmark.github.io

READ FULL TEXT

page 1

page 2

page 5

page 7

page 8

page 11

page 13

page 14

research
06/03/2022

A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge

The Visual Question Answering (VQA) task aspires to provide a meaningful...
research
12/08/2022

VASR: Visual Analogies of Situation Recognition

A core process in human cognition is analogical mapping: the ability to ...
research
03/19/2016

Generating Natural Questions About an Image

There has been an explosion of work in the vision & language community d...
research
10/16/2022

COFAR: Commonsense and Factual Reasoning in Image Search

One characteristic that makes humans superior to modern artificially int...
research
07/17/2023

The FathomNet2023 Competition Dataset

Ocean scientists have been collecting visual data to study marine organi...
research
11/17/2016

Answering Image Riddles using Vision and Reasoning through Probabilistic Soft Logic

In this work, we explore a genre of puzzles ("image riddles") which invo...
research
07/25/2022

WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

While vision-and-language models perform well on tasks such as visual qu...

Please sign up or login with your details

Forgot password? Click here to reset