Multi-modal Robustness Analysis Against Language and Visual Perturbations

07/05/2022
by   Madeline C. Schiappa, et al.
0

Joint visual and language modeling on large-scale datasets has recently shown a good progress in multi-modal tasks when compared to single modal learning. However, robustness of these approaches against real-world perturbations has not been studied. In this work, we perform the first extensive robustness study of such models against various real-world perturbations focusing on video and language. We focus on text-to-video retrieval and propose two large-scale benchmark datasets, MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different textual perturbations. The study reveals some interesting findings: 1) The studied models are more robust when text is perturbed versus when video is perturbed 2) The transformer text encoder is more robust on non-semantic changing text perturbations and visual perturbations compared to word embedding approaches. 3) Using two-branch encoders in isolation is typically more robust than when architectures use cross-attention. We hope this study will serve as a benchmark and guide future research in robust multimodal learning.

READ FULL TEXT

page 8

page 9

page 15

page 16

page 17

page 21

page 23

page 24

research
07/04/2022

Large-scale Robustness Analysis of Video Action Recognition Models

We have seen a great progress in video action recognition in recent year...
research
06/15/2023

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

Although instruction-tuned large language models (LLMs) have exhibited r...
research
12/15/2022

Are Multimodal Models Robust to Image and Text Perturbations?

Multimodal image-text models have shown remarkable performance in the pa...
research
01/30/2023

Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

Vision-language alignment learning for video-text retrieval arouses a lo...
research
05/12/2022

A Generalist Agent

Inspired by progress in large-scale language modeling, we apply a simila...
research
04/19/2023

Learning Robust Visual-Semantic Embedding for Generalizable Person Re-identification

Generalizable person re-identification (Re-ID) is a very hot research to...
research
06/15/2023

Robustness Analysis on Foundational Segmentation Models

Due to the increase in computational resources and accessibility of data...

Please sign up or login with your details

Forgot password? Click here to reset