VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations

07/01/2022
by   Tiancheng Zhao, et al.
29

Vision-Language Pretraining (VLP) models have recently successfully facilitated many cross-modal downstream tasks. Most existing works evaluated their systems by comparing the fine-tuned downstream task performance. However, only average downstream task accuracy provides little information about the pros and cons of each VLP method, let alone provides insights on how the community can improve the systems in the future. Inspired by the CheckList for testing natural language processing, we introduce VL-CheckList, a novel framework to understand the capabilities of VLP models. The proposed method divides the image-texting ability of a VLP model into three categories: objects, attributes, and relations, and uses a novel taxonomy to further break down these three aspects. We conduct comprehensive studies to analyze seven recently popular VLP models via the proposed framework. Results confirm the effectiveness of the proposed method by revealing fine-grained differences among the compared models that were not visible from downstream task-only evaluation. Further results show promising research direction in building better VLP models. Data and Code: https://github.com/om-ai-lab/VL-CheckList

READ FULL TEXT

page 5

page 12

research
09/24/2021

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabi...
research
08/06/2023

Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

Vision-language models (VLMs) have shown impressive performance in subst...
research
08/19/2023

An Empirical Study of CLIP for Text-based Person Search

Text-based Person Search (TBPS) aims to retrieve the person images using...
research
05/31/2023

ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

Two-Tower Vision-Language (VL) models have shown promising improvements ...
research
10/06/2022

MaPLe: Multi-modal Prompt Learning

Pre-trained vision-language (V-L) models such as CLIP have shown excelle...
research
08/07/2023

Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench

Recently, the community has witnessed the advancement of Large Language ...
research
09/27/2020

A Brief Survey and Comparative Study of Recent Development of Pronoun Coreference Resolution

Pronoun Coreference Resolution (PCR) is the task of resolving pronominal...

Please sign up or login with your details

Forgot password? Click here to reset