A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

12/15/2020
by   Linjie Li, et al.
1

Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level. Although achieving impressive performance on standard tasks, to date, it still remains unclear how robust these pre-trained models are. To investigate, we conduct a host of thorough evaluations on existing pre-trained models over 4 different types of V+L specific model robustness: (i) Linguistic Variation; (ii) Logical Reasoning; (iii) Visual Content Manipulation; and (iv) Answer Distribution Shift. Interestingly, by standard model finetuning, pre-trained V+L models already exhibit better robustness than many task-specific state-of-the-art methods. To further enhance model robustness, we propose Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool pre-trained V+L models. Differing from previous studies focused on one specific type of robustness, Mango is task-agnostic, and enables universal performance lift for pre-trained models over diverse tasks designed to evaluate broad aspects of robustness. Comprehensive experiments demonstrate that Mango achieves new state of the art on 7 out of 9 robustness benchmarks, surpassing existing methods by a significant margin. As the first comprehensive study on V+L robustness, this work puts robustness of pre-trained models into sharper focus, pointing new directions for future study.

READ FULL TEXT

page 2

page 8

page 15

research
10/01/2021

A Survey of Knowledge Enhanced Pre-trained Models

Pre-trained models learn contextualized word representations on large-sc...
research
06/01/2023

Can Large Pre-trained Models Help Vision Models on Perception Tasks?

The recent upsurge in pre-trained large models (e.g. GPT-4) has swept ac...
research
02/25/2023

MetaAID 2.0: An Extensible Framework for Developing Metaverse Applications via Human-controllable Pre-trained Models

Pre-trained models (PM) have achieved promising results in content gener...
research
11/29/2022

AutoCAD: Automatically Generating Counterfactuals for Mitigating Shortcut Learning

Recent studies have shown the impressive efficacy of counterfactually au...
research
10/06/2022

Distilling Task-specific Logical Rules from Large Pre-trained Models

Logical rules, both transferable and explainable, are widely used as wea...
research
05/19/2019

What Do Adversarially Robust Models Look At?

In this paper, we address the open question: "What do adversarially robu...
research
01/08/2022

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Cross-Modal Retrieval (CMR) is an important research topic across multim...

Please sign up or login with your details

Forgot password? Click here to reset