Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

09/09/2021
by   Stella Frank, et al.
19

Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually integrate cross-modal information. This method involves ablating inputs from one modality, either entirely or selectively based on cross-modal grounding alignments, and evaluating the model prediction performance on the other modality. Model performance is measured by modality-specific tasks that mirror the model pretraining objectives (e.g. masked language modelling for text). Models that have learned to construct cross-modal representations using both modalities are expected to perform worse when inputs are missing from a modality. We find that recently proposed models have much greater relative difficulty predicting text when visual information is ablated, compared to predicting visual object categories when text is ablated, indicating that these models are not symmetrically cross-modal.

READ FULL TEXT
research
06/19/2023

Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning

The robustness of multimodal deep learning models to realistic changes i...
research
04/15/2023

CoVLR: Coordinating Cross-Modal Consistency and Intra-Modal Structure for Vision-Language Retrieval

Current vision-language retrieval aims to perform cross-modal instance s...
research
09/07/2023

Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation

With strong representation capabilities, pretrained vision-language mode...
research
06/22/2023

Learning Unseen Modality Interaction

Multimodal learning assumes all modality combinations of interest are av...
research
12/08/2020

Parameter Efficient Multimodal Transformers for Video Representation Learning

The recent success of Transformers in the language domain has motivated ...
research
08/28/2019

Adversarial Representation Learning for Text-to-Image Matching

For many computer vision applications such as image captioning, visual q...
research
05/19/2018

Do Neural Network Cross-Modal Mappings Really Bridge Modalities?

Feed-forward networks are widely used in cross-modal applications to bri...

Please sign up or login with your details

Forgot password? Click here to reset