Training Vision-Language Models with Less Bimodal Supervision

11/01/2022
by   Elad Segal, et al.
0

Standard practice in pretraining multimodal models, such as vision-language models, is to rely on pairs of aligned inputs from both modalities, for example, aligned image-text pairs. However, such pairs can be difficult to obtain in low-resource settings and for some modality pairs (e.g., structured tables and images). In this work, we investigate the extent to which we can reduce the reliance on such parallel data, which we term bimodal supervision, and use models that are pretrained on each modality independently. We experiment with a high-performing vision-language model, and analyze the effect of bimodal supervision on three vision-language tasks. We find that on simpler tasks, such as VQAv2 and GQA, one can eliminate bimodal supervision completely, suffering only a minor loss in performance. Conversely, for NLVR2, which requires more complex reasoning, training without bimodal supervision leads to random performance. Nevertheless, using only 5% of the bimodal data (142K images along with their captions), or leveraging weak supervision in the form of a list of machine-generated labels for each image, leads to only a moderate degradation compared to using 3M image-text pairs: 74%→∼70%. Our code is available at https://github.com/eladsegal/less-bimodal-sup.

READ FULL TEXT
research
10/07/2022

SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models

Vision-language models such as CLIP are pretrained on large volumes of i...
research
01/31/2023

Grounding Language Models to Images for Multimodal Generation

We propose an efficient method to ground pretrained text-only language m...
research
10/13/2022

Caption supervision enables robust learners

Vision language models like CLIP are robust to natural distribution shif...
research
03/20/2023

eP-ALM: Efficient Perceptual Augmentation of Language Models

Large Language Models (LLMs) have so far impressed the world, with unpre...
research
03/25/2023

Equivariant Similarity for Vision-Language Foundation Models

This study explores the concept of equivariance in vision-language found...
research
08/18/2022

VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations

We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT ...
research
12/15/2022

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models Tasks

Vision and language models (VL) are known to exploit unrobust indicators...

Please sign up or login with your details

Forgot password? Click here to reset