Rethinking Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization

05/24/2022
by   Aishwarya Agrawal, et al.
3

Vision-and-language (V L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, we observe that these models exhibit poor out-of-distribution (OOD) generalization on the task of VQA. To better understand the underlying causes of poor generalization, we comprehensively investigate performance of two pretrained V L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also argue that in most cases generative models are less susceptible to shifts in data distribution, while frequently performing better on our tested benchmarks. Moreover, we find that multimodal pretraining improves OOD performance in most settings. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.

READ FULL TEXT

page 6

page 11

page 15

page 25

page 28

page 29

research
09/15/2021

Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering

Integrating outside knowledge for reasoning in visio-linguistic tasks su...
research
05/24/2022

On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

Integrating vision and language has gained notable attention following t...
research
09/14/2022

Finetuning Pretrained Vision-Language Models with Correlation Information Bottleneck for Robust Visual Question Answering

Benefiting from large-scale Pretrained Vision-Language Models (VL-PMs), ...
research
07/18/2023

Generative Visual Question Answering

Multi-modal tasks involving vision and language in deep learning continu...
research
06/30/2023

Multimodal Prompt Retrieval for Generative Visual Question Answering

Recent years have witnessed impressive results of pre-trained vision-lan...
research
11/23/2020

Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention

Accurate and efficient product classification is significant for E-comme...
research
07/27/2023

Med-Flamingo: a Multimodal Medical Few-shot Learner

Medicine, by its nature, is a multifaceted domain that requires the synt...

Please sign up or login with your details

Forgot password? Click here to reset