WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

07/25/2022
by   Yonatan Bitton, et al.
1

While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills. In this work, we introduce WinoGAViL: an online game to collect vision-and-language associations, (e.g., werewolves to a full moon), used as a dynamic benchmark to evaluate state-of-the-art models. Inspired by the popular card game Codenames, a spymaster gives a textual cue related to several visual candidates, and another player has to identify them. Human players are rewarded for creating associations that are challenging for a rival AI model but still solvable by other human players. We use the game to collect 3.5K instances, finding that they are intuitive for humans (>90 but challenging for state-of-the-art AI models, where the best model (ViLT) achieves a score of 52 Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills, including general knowledge, common sense, abstraction, and more. We release the dataset, the code and the interactive game, aiming to allow future data collection that can be used to develop models with better association abilities.

READ FULL TEXT

page 2

page 22

research
10/19/2020

Deriving Commonsense Inference Tasks from Interactive Fictions

Commonsense reasoning simulates the human ability to make presumptions a...
research
12/01/2021

Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text

Communicating with humans is challenging for AIs because it requires a s...
research
07/05/2023

Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models

Are current language models capable of deception and lie detection? We s...
research
03/16/2023

ESCAPE: Countering Systematic Errors from Machine's Blind Spots via Interactive Visual Analysis

Classification models learn to generalize the associations between data ...
research
03/13/2023

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Weird, unusual, and uncanny images pique the curiosity of observers beca...
research
02/05/2020

Stimulating Creativity with FunLines: A Case Study of Humor Generation in Headlines

Building datasets of creative text, such as humor, is quite challenging....
research
08/15/2023

CALYPSO: LLMs as Dungeon Masters' Assistants

The role of a Dungeon Master, or DM, in the game Dungeons Dragons is...

Please sign up or login with your details

Forgot password? Click here to reset