VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

12/14/2021
by   Letitia Parcalabescu, et al.
2

We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V L models from a linguistic perspective, complementing the canonical task-centred V L evaluations.

READ FULL TEXT

page 20

page 21

page 23

page 24

page 25

page 26

page 27

page 28

research
09/12/2023

BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models

The rapid development of Large Language Models (LLMs) and the emergence ...
research
07/27/2016

Synthetic Language Generation and Model Validation in BEAST2

Generating synthetic languages aids in the testing and validation of fut...
research
10/06/2022

Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation

Negation is poorly captured by current language models, although the ext...
research
10/16/2020

Linguistically-Informed Transformations (LIT): A Method forAutomatically Generating Contrast Sets

Although large-scale pretrained language models, such as BERT and RoBERT...
research
09/13/2021

The Grammar-Learning Trajectories of Neural Language Models

The learning trajectories of linguistic phenomena provide insight into t...
research
05/04/2023

"Oops, Did I Just Say That?" Testing and Repairing Unethical Suggestions of Large Language Models with Suggest-Critique-Reflect Process

As the popularity of large language models (LLMs) soars across various a...
research
04/07/2022

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

We present a novel task and dataset for evaluating the ability of vision...

Please sign up or login with your details

Forgot password? Click here to reset