Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering

We study visual question answering in a setting where the answer has to be mined from a pool of relevant and irrelevant images given as a context. For such a setting, a model must first retrieve relevant images from the pool and answer the question from these retrieved images. We refer to this problem as retrieval-based visual question answering (or RETVQA in short). The RETVQA is distinctively different and more challenging than the traditionally-studied Visual Question Answering (VQA), where a given question has to be answered with a single relevant image in context. Towards solving the RETVQA task, we propose a unified Multi Image BART (MI-BART) that takes a question and retrieved images using our relevance encoder for free-form fluent answer generation. Further, we introduce the largest dataset in this space, namely RETVQA, which has the following salient features: multi-image and retrieval requirement for VQA, metadata-independent questions over a pool of heterogeneous images, expecting a mix of classification-oriented and open-ended generative answers. Our proposed framework achieves an accuracy of 76.5 dataset, namely RETVQA and also outperforms state-of-the-art methods by 4.9 and 11.8 accuracy and fluency metrics, respectively.


page 1

page 4

page 7

page 8

page 12

page 13


Knowledge Detection by Relevant Question and Image Attributes in Visual Question Answering

Visual question answering (VQA) is a Multidisciplinary research problem ...

VQABQ: Visual Question Answering by Basic Questions

Taking an image and question as the input of our method, it can output t...

SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs

In this work, we present SciGraphQA, a synthetic multi-turn question-ans...

Adversarial Training for Community Question Answer Selection Based on Multi-scale Matching

Community-based question answering (CQA) websites represent an important...

Analysis on Image Set Visual Question Answering

We tackle the challenge of Visual Question Answering in multi-image sett...

DualNet: Domain-Invariant Network for Visual Question Answering

Visual question answering (VQA) task not only bridges the gap between im...

Differential Attention for Visual Question Answering

In this paper we aim to answer questions based on images when provided w...

Please sign up or login with your details

Forgot password? Click here to reset