A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

12/15/2020
by   Linjie Li, et al.
1

Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level. Although achieving impressive performance on standard tasks, to date, it still remains unclear how robust these pre-trained models are. To investigate, we conduct a host of thorough evaluations on existing pre-trained models over 4 different types of V+L specific model robustness: (i) Linguistic Variation; (ii) Logical Reasoning; (iii) Visual Content Manipulation; and (iv) Answer Distribution Shift. Interestingly, by standard model finetuning, pre-trained V+L models already exhibit better robustness than many task-specific state-of-the-art methods. To further enhance model robustness, we propose Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool pre-trained V+L models. Differing from previous studies focused on one specific type of robustness, Mango is task-agnostic, and enables universal performance lift for pre-trained models over diverse tasks designed to evaluate broad aspects of robustness. Comprehensive experiments demonstrate that Mango achieves new state of the art on 7 out of 9 robustness benchmarks, surpassing existing methods by a significant margin. As the first comprehensive study on V+L robustness, this work puts robustness of pre-trained models into sharper focus, pointing new directions for future study.

READ FULL TEXT

page 2

page 8

page 15

10/01/2021

A Survey of Knowledge Enhanced Pre-trained Models

Pre-trained models learn contextualized word representations on large-sc...
01/08/2022

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Cross-Modal Retrieval (CMR) is an important research topic across multim...
01/27/2022

Vision Checklist: Towards Testable Error Analysis of Image Models to Help System Designers Interrogate Model Capabilities

Using large pre-trained models for image recognition tasks is becoming i...
05/19/2019

What Do Adversarially Robust Models Look At?

In this paper, we address the open question: "What do adversarially robu...
10/20/2021

EBJR: Energy-Based Joint Reasoning for Adaptive Inference

State-of-the-art deep learning models have achieved significant performa...
06/10/2022

Feature-informed Embedding Space Regularization For Audio Classification

Feature representations derived from models pre-trained on large-scale d...
12/30/2020

Introducing Orthogonal Constraint in Structural Probes

With the recent success of pre-trained models in NLP, a significant focu...