Scalable Performance Analysis for Vision-Language Models

05/30/2023
by   Santiago Castro, et al.
22

Joint vision-language models have shown great performance over a diverse set of tasks. However, little is known about their limitations, as the high dimensional space learned by these models makes it difficult to identify semantic errors. Recent work has addressed this problem by designing highly controlled probing task benchmarks. Our paper introduces a more scalable solution that relies on already annotated benchmarks. Our method consists of extracting a large set of diverse features from a vision-language benchmark and measuring their correlation with the output of the target model. We confirm previous findings that CLIP behaves like a bag of words model and performs better with nouns and verbs; we also uncover novel insights such as CLIP getting confused by concrete words. Our framework is available at https://github.com/MichiganNLP/Scalable-VLM-Probing and can be used with other multimodal models and benchmarks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/26/2023

SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality

In the last year alone, a surge of new benchmarks to measure composition...
research
07/26/2022

V^2L: Leveraging Vision and Vision-language Models into Large-scale Product Retrieval

Product retrieval is of great importance in the ecommerce domain. This p...
research
02/27/2023

Aligning Bag of Regions for Open-Vocabulary Object Detection

Pre-trained vision-language models (VLMs) learn to align vision and lang...
research
09/06/2023

Distribution-Aware Prompt Tuning for Vision-Language Models

Pre-trained vision-language models (VLMs) have shown impressive performa...
research
11/03/2022

LMentry: A Language Model Benchmark of Elementary Language Tasks

As the performance of large language models rapidly improves, benchmarks...
research
12/15/2022

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models Tasks

Vision and language models (VL) are known to exploit unrobust indicators...
research
04/30/2020

WiC-TSV: An Evaluation Benchmark for Target Sense Verification of Words in Context

In this paper, we present WiC-TSV (Target Sense Verification for Words i...

Please sign up or login with your details

Forgot password? Click here to reset