Revisiting Visual Question Answering Baselines

06/27/2016
by   Allan Jabri, et al.
0

Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems include attention or memory mechanisms designed to support "reasoning". For multiple-choice VQA, nearly all of these systems train a multi-class classifier on image and question features to predict an answer. This paper questions the value of these common practices and develops a simple alternative model based on binary classification. Instead of treating answers as competing choices, our model receives the answer as input and predicts whether or not an image-question-answer triplet is correct. We evaluate our model on the Visual7W Telling and the VQA Real Multiple Choice tasks, and find that even simple versions of our model perform competitively. Our best model achieves state-of-the-art performance on the Visual7W Telling task and compares surprisingly well with the most complex systems proposed for the VQA Real Multiple Choice task. We explore variants of the model and study its transferability between both datasets. We also present an error analysis of our model that suggests a key problem of current VQA systems lies in the lack of visual grounding of concepts that occur in the questions and answers. Overall, our results suggest that the performance of current VQA systems is not significantly better than that of systems designed to exploit dataset biases.

READ FULL TEXT

page 2

page 9

page 10

page 14

research
01/24/2018

Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering

Visual question answering (VQA) is of significant interest due to its po...
research
06/09/2020

Roses Are Red, Violets Are Blue... but Should Vqa Expect Them To?

To be reliable on rare events is an important requirement for systems ba...
research
04/02/2022

Co-VQA : Answering by Interactive Sub Question Sequence

Most existing approaches to Visual Question Answering (VQA) answer quest...
research
12/02/2019

Deep Bayesian Active Learning for Multiple Correct Outputs

Typical active learning strategies are designed for tasks, such as class...
research
11/30/2019

A Free Lunch in Generating Datasets: Building a VQG and VQA System with Attention and Humans in the Loop

Despite their importance in training artificial intelligence systems, la...
research
11/15/2022

Visually Grounded VQA by Lattice-based Retrieval

Visual Grounding (VG) in Visual Question Answering (VQA) systems describ...
research
05/11/2021

Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules

The problem of grounding VQA tasks has seen an increased attention in th...

Please sign up or login with your details

Forgot password? Click here to reset