-
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Recent Transformer-based large-scale pre-trained models have revolutioni...
read it
-
Revisiting Pre-Trained Models for Chinese Natural Language Processing
Bidirectional Encoder Representations from Transformers (BERT) has shown...
read it
-
What Do Adversarially Robust Models Look At?
In this paper, we address the open question: "What do adversarially robu...
read it
-
SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions
State-of-the-art NLP models can often be fooled by human-unaware transfo...
read it
-
Introducing Orthogonal Constraint in Structural Probes
With the recent success of pre-trained models in NLP, a significant focu...
read it
-
Measuring and Reducing Gendered Correlations in Pre-trained Models
Pre-trained models have revolutionized natural language understanding. H...
read it
-
A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports
Joint image-text embedding extracted from medical images and associated ...
read it
A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level. Although achieving impressive performance on standard tasks, to date, it still remains unclear how robust these pre-trained models are. To investigate, we conduct a host of thorough evaluations on existing pre-trained models over 4 different types of V+L specific model robustness: (i) Linguistic Variation; (ii) Logical Reasoning; (iii) Visual Content Manipulation; and (iv) Answer Distribution Shift. Interestingly, by standard model finetuning, pre-trained V+L models already exhibit better robustness than many task-specific state-of-the-art methods. To further enhance model robustness, we propose Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool pre-trained V+L models. Differing from previous studies focused on one specific type of robustness, Mango is task-agnostic, and enables universal performance lift for pre-trained models over diverse tasks designed to evaluate broad aspects of robustness. Comprehensive experiments demonstrate that Mango achieves new state of the art on 7 out of 9 robustness benchmarks, surpassing existing methods by a significant margin. As the first comprehensive study on V+L robustness, this work puts robustness of pre-trained models into sharper focus, pointing new directions for future study.
READ FULL TEXT
Comments
There are no comments yet.