Analysis on Image Set Visual Question Answering

03/31/2021
by   Abhinav Khattar, et al.
0

We tackle the challenge of Visual Question Answering in multi-image setting for the ISVQA dataset. Traditional VQA tasks have focused on a single-image setting where the target answer is generated from a single image. Image set VQA, however, comprises of a set of images and requires finding connection between images, relate the objects across images based on these connections and generate a unified answer. In this report, we work with 4 approaches in a bid to improve the performance on the task. We analyse and compare our results with three baseline models - LXMERT, HME-VideoQA and VisualBERT - and show that our approaches can provide a slight improvement over the baselines. In specific, we try to improve on the spatial awareness of the model and help the model identify color using enhanced pre-training, reduce language dependence using adversarial regularization, and improve counting using regression loss and graph based deduplication. We further delve into an in-depth analysis on the language bias in the ISVQA dataset and show how models trained on ISVQA implicitly learn to associate language more strongly with the final answer.

READ FULL TEXT

page 1

page 5

page 6

page 7

page 8

research
01/16/2021

Latent Variable Models for Visual Question Answering

Conventional models for Visual Question Answering (VQA) explore determin...
research
08/17/2019

What is needed for simple spatial language capabilities in VQA?

Visual question answering (VQA) comprises a variety of language capabili...
research
06/29/2023

Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering

We study visual question answering in a setting where the answer has to ...
research
12/02/2016

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Problems at the intersection of vision and language are of significant i...
research
10/28/2020

Leveraging Visual Question Answering to Improve Text-to-Image Synthesis

Generating images from textual descriptions has recently attracted a lot...
research
04/04/2023

Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA

In this paper, we propose a novel multi-modal framework for Scene Text V...
research
02/15/2018

Learning to Count Objects in Natural Images for Visual Question Answering

Visual Question Answering (VQA) models have struggled with counting obje...

Please sign up or login with your details

Forgot password? Click here to reset