VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models

09/12/2022
by   Felix Vogel, et al.
0

Vision-language models trained on large, randomly collected data had significant impact in many areas since they appeared. But as they show great performance in various fields, such as image-text-retrieval, their inner workings are still not fully understood. The current work analyses the true zero-shot capabilities of those models. We start from the analysis of the training corpus assessing to what extent (and which of) the test classes are really zero-shot and how this correlates with individual classes performance. We follow up with the analysis of the attribute-based zero-shot learning capabilities of these models, evaluating how well this classical zero-shot notion emerges from large-scale webly supervision. We leverage the recently released LAION400M data corpus as well as the publicly available pretrained models of CLIP, OpenCLIP, and FLAVA, evaluating the attribute-based zero-shot capabilities on CUB and AWA2 benchmarks. Our analysis shows that: (i) most of the classes in popular zero-shot benchmarks are observed (a lot) during pre-training; (ii) zero-shot performance mainly comes out of models' capability of recognizing class labels, whenever they are present in the text, and a significantly lower performing capability of attribute-based zeroshot learning is only observed when class labels are not used; (iii) the number of the attributes used can have a significant effect on performance, and can easily cause a significant performance decrease.

READ FULL TEXT

page 5

page 6

page 8

page 16

research
01/26/2023

Vision-Language Models Performing Zero-Shot Tasks Exhibit Gender-based Disparities

We explore the extent to which zero-shot vision-language models exhibit ...
research
04/04/2023

Learning to Name Classes for Vision and Language Models

Large scale vision and language models can achieve impressive zero-shot ...
research
07/26/2023

ECO: Ensembling Context Optimization for Vision-Language Models

Image recognition has recently witnessed a paradigm shift, where vision-...
research
07/23/2023

Geometry-Aware Adaptation for Pretrained Models

Machine learning models – including prominent zero-shot models – are oft...
research
12/01/2022

Improving Zero-Shot Models with Label Distribution Priors

Labeling large image datasets with attributes such as facial age or obje...
research
02/19/2023

Text Classification in the Wild: a Large-scale Long-tailed Name Normalization Dataset

Real-world data usually exhibits a long-tailed distribution,with a few f...
research
07/14/2021

HTLM: Hyper-Text Pre-Training and Prompting of Language Models

We introduce HTLM, a hyper-text language model trained on a large-scale ...

Please sign up or login with your details

Forgot password? Click here to reset