A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

by   Linjie Li, et al.

Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level. Although achieving impressive performance on standard tasks, to date, it still remains unclear how robust these pre-trained models are. To investigate, we conduct a host of thorough evaluations on existing pre-trained models over 4 different types of V+L specific model robustness: (i) Linguistic Variation; (ii) Logical Reasoning; (iii) Visual Content Manipulation; and (iv) Answer Distribution Shift. Interestingly, by standard model finetuning, pre-trained V+L models already exhibit better robustness than many task-specific state-of-the-art methods. To further enhance model robustness, we propose Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool pre-trained V+L models. Differing from previous studies focused on one specific type of robustness, Mango is task-agnostic, and enables universal performance lift for pre-trained models over diverse tasks designed to evaluate broad aspects of robustness. Comprehensive experiments demonstrate that Mango achieves new state of the art on 7 out of 9 robustness benchmarks, surpassing existing methods by a significant margin. As the first comprehensive study on V+L robustness, this work puts robustness of pre-trained models into sharper focus, pointing new directions for future study.


page 2

page 8

page 15


A Survey of Knowledge Enhanced Pre-trained Models

Pre-trained models learn contextualized word representations on large-sc...

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Cross-Modal Retrieval (CMR) is an important research topic across multim...

Vision Checklist: Towards Testable Error Analysis of Image Models to Help System Designers Interrogate Model Capabilities

Using large pre-trained models for image recognition tasks is becoming i...

What Do Adversarially Robust Models Look At?

In this paper, we address the open question: "What do adversarially robu...

EBJR: Energy-Based Joint Reasoning for Adaptive Inference

State-of-the-art deep learning models have achieved significant performa...

Feature-informed Embedding Space Regularization For Audio Classification

Feature representations derived from models pre-trained on large-scale d...

Introducing Orthogonal Constraint in Structural Probes

With the recent success of pre-trained models in NLP, a significant focu...