Generative Visual Question Answering

by   Ethan Shen, et al.

Multi-modal tasks involving vision and language in deep learning continue to rise in popularity and are leading to the development of newer models that can generalize beyond the extent of their training data. The current models lack temporal generalization which enables models to adapt to changes in future data. This paper discusses a viable approach to creating an advanced Visual Question Answering (VQA) model which can produce successful results on temporal generalization. We propose a new data set, GenVQA, utilizing images and captions from the VQAv2 and MS-COCO dataset to generate new images through stable diffusion. This augmented dataset is then used to test a combination of seven baseline and cutting edge VQA models. Performance evaluation focuses on questions mirroring the original VQAv2 dataset, with the answers having been adjusted to the new images. This paper's purpose is to investigate the robustness of several successful VQA models to assess their performance on future data distributions. Model architectures are analyzed to identify common stylistic choices that improve generalization under temporal distribution shifts. This research highlights the importance of creating a large-scale future shifted dataset. This data can enhance the robustness of VQA models, allowing their future peers to have improved ability to adapt to temporal distribution shifts.


page 1

page 2


IQ-VQA: Intelligent Visual Question Answering

Even though there has been tremendous progress in the field of Visual Qu...

The Color of the Cat is Gray: 1 Million Full-Sentences Visual Question Answering (FSVQA)

Visual Question Answering (VQA) task has showcased a new stage of intera...

Question Generation for Evaluating Cross-Dataset Shifts in Multi-modal Grounding

Visual question answering (VQA) is the multi-modal task of answering nat...

Rethinking Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization

Vision-and-language (V L) models pretrained on large-scale multimodal ...

Polar-VQA: Visual Question Answering on Remote Sensed Ice sheet Imagery from Polar Region

For glaciologists, studying ice sheets from the polar regions is critica...

Robust Visual Question Answering: Datasets, Methods, and Future Challenges

Visual question answering requires a system to provide an accurate natur...

The FathomNet2023 Competition Dataset

Ocean scientists have been collecting visual data to study marine organi...

Please sign up or login with your details

Forgot password? Click here to reset