Seeing past words: Testing the cross-modal capabilities of pretrained V L models

12/22/2020
by   Letitia Parcalabescu, et al.
11

We investigate the ability of general-purpose pretrained vision and language V L models to perform reasoning in two tasks that require multimodal integration: (1) discriminating a correct image-sentence pair from an incorrect one, and (2) counting entities in an image. We evaluate three pretrained V L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our results show that models solve task (1) very well, as expected, since all models use task (1) for pretraining. However, none of the pretrained V L models are able to adequately solve task (2), our counting probe, and they cannot generalise to out-of-distribution quantities. Our investigations suggest that pretrained V L representations are less successful than expected at integrating the two modalities. We propose a number of explanations for these findings: LXMERT's results on the image-sentence alignment task (and to a lesser extent those obtained by ViLBERT 12-in-1) indicate that the model may exhibit catastrophic forgetting. As for our results on the counting probe, we find evidence that all models are impacted by dataset bias, and also fail to individuate entities in the visual input.

READ FULL TEXT

page 4

page 5

page 7

research
09/07/2023

Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation

With strong representation capabilities, pretrained vision-language mode...
research
05/03/2023

Entity Tracking in Language Models

Keeping track of how states and relations of entities change as a text o...
research
02/15/2022

Impact of Pretraining Term Frequencies on Few-Shot Reasoning

Pretrained Language Models (LMs) have demonstrated ability to perform nu...
research
06/16/2021

Probing Image-Language Transformers for Verb Understanding

Multimodal image-language transformers have achieved impressive results ...
research
02/11/2023

Cross-Modal Fine-Tuning: Align then Refine

Fine-tuning large-scale pretrained models has led to tremendous progress...
research
08/20/2023

ViT-Lens: Towards Omni-modal Representations

Though the success of CLIP-based training recipes in vision-language mod...
research
02/02/2022

Understanding Knowledge Integration in Language Models with Graph Convolutions

Pretrained language models (LMs) do not capture factual knowledge very w...

Please sign up or login with your details

Forgot password? Click here to reset