Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

by   Stella Frank, et al.
Università di Trento

Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually integrate cross-modal information. This method involves ablating inputs from one modality, either entirely or selectively based on cross-modal grounding alignments, and evaluating the model prediction performance on the other modality. Model performance is measured by modality-specific tasks that mirror the model pretraining objectives (e.g. masked language modelling for text). Models that have learned to construct cross-modal representations using both modalities are expected to perform worse when inputs are missing from a modality. We find that recently proposed models have much greater relative difficulty predicting text when visual information is ablated, compared to predicting visual object categories when text is ablated, indicating that these models are not symmetrically cross-modal.


Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning

The robustness of multimodal deep learning models to realistic changes i...

CoVLR: Coordinating Cross-Modal Consistency and Intra-Modal Structure for Vision-Language Retrieval

Current vision-language retrieval aims to perform cross-modal instance s...

Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation

With strong representation capabilities, pretrained vision-language mode...

Learning Unseen Modality Interaction

Multimodal learning assumes all modality combinations of interest are av...

Parameter Efficient Multimodal Transformers for Video Representation Learning

The recent success of Transformers in the language domain has motivated ...

Adversarial Representation Learning for Text-to-Image Matching

For many computer vision applications such as image captioning, visual q...

Do Neural Network Cross-Modal Mappings Really Bridge Modalities?

Feed-forward networks are widely used in cross-modal applications to bri...

Please sign up or login with your details

Forgot password? Click here to reset