VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

05/30/2022
by   Wangchunshu Zhou, et al.
0

Recent advances in vision-language pre-training (VLP) have demonstrated impressive performance in a range of vision-language (VL) tasks. However, there exist several challenges for measuring the community's progress in building general multi-modal intelligence. First, most of the downstream VL datasets are annotated using raw images that are already seen during pre-training, which may result in an overestimation of current VLP models' generalization ability. Second, recent VLP work mainly focuses on absolute performance but overlooks the efficiency-performance trade-off, which is also an important indicator for measuring progress. To this end, we introduce the Vision-Language Understanding Evaluation (VLUE) benchmark, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off (“Pareto SOTA”) of VLP models. We demonstrate that there is a sizable generalization gap for all VLP models when testing on out-of-distribution test sets annotated on images from a more diverse distribution that spreads across cultures. Moreover, we find that measuring the efficiency-performance trade-off of VLP models leads to complementary insights for several design choices of VLP. We release the VLUE benchmark to promote research on building vision-language models that generalize well to more diverse images and concepts unseen during pre-training, and are practical in terms of efficiency-performance trade-off.

READ FULL TEXT
research
05/24/2021

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Recently a number of studies demonstrated impressive performance on dive...
research
08/10/2022

Alternating Cross-attention Vision-Language Model for Efficient Learning with Medical Image and Report without Curation

Recent advances in vision-language pre-training have demonstrated astoun...
research
02/18/2022

VLP: A Survey on Vision-Language Pre-training

In the past few years, the emergence of pre-training models has brought ...
research
02/16/2023

Learning Multi-Object Positional Relationships via Emergent Communication

The study of emergent communication has been dedicated to interactive ar...
research
07/06/2023

BiPhone: Modeling Inter Language Phonetic Influences in Text

A large number of people are forced to use the Web in a language they ha...
research
11/16/2021

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Most existing methods in vision language pre-training rely on object-cen...
research
09/14/2023

Gradient constrained sharpness-aware prompt learning for vision-language models

This paper targets a novel trade-off problem in generalizable prompt lea...

Please sign up or login with your details

Forgot password? Click here to reset