Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning

06/19/2023
by   Shivaen Ramshetty, et al.
0

The robustness of multimodal deep learning models to realistic changes in the input text is critical for their applicability to important tasks such as text-to-image retrieval and cross-modal entailment. To measure robustness, several existing approaches edit the text data, but do so without leveraging the cross-modal information present in multimodal data. Information from the visual modality, such as color, size, and shape, provide additional attributes that users can include in their inputs. Thus, we propose cross-modal attribute insertions as a realistic perturbation strategy for vision-and-language data that inserts visual attributes of the objects in the image into the corresponding text (e.g., "girl on a chair" to "little girl on a wooden chair"). Our proposed approach for cross-modal attribute insertions is modular, controllable, and task-agnostic. We find that augmenting input text using cross-modal insertions causes state-of-the-art approaches for text-to-image retrieval and cross-modal entailment to perform poorly, resulting in relative drops of 15 annotations demonstrate that cross-modal insertions lead to higher quality augmentations for multimodal data than augmentations using text-only data, and are equivalent in quality to original examples. We release the code to encourage robustness evaluations of deep vision-and-language models: https://github.com/claws-lab/multimodal-robustness-xmai.

READ FULL TEXT

page 1

page 3

page 7

research
11/04/2022

Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content Dilutions

As multimodal learning finds applications in a wide variety of high-stak...
research
09/09/2021

Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

Pretrained vision-and-language BERTs aim to learn representations that c...
research
09/13/2022

Visual Recipe Flow: A Dataset for Learning Visual State Changes of Objects with Recipe Flows

We present a new multimodal dataset called Visual Recipe Flow, which ena...
research
11/21/2022

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention

We present Perceiver-VL, a vision-and-language framework that efficientl...
research
10/06/2020

Learning to Represent Image and Text with Denotation Graph

Learning to fuse vision and language information and representing them i...
research
03/26/2019

Cross-Modal Data Programming Enables Rapid Medical Machine Learning

Labeling training datasets has become a key barrier to building medical ...
research
09/22/2021

Generating Compositional Color Representations from Text

We consider the cross-modal task of producing color representations for ...

Please sign up or login with your details

Forgot password? Click here to reset