Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions

01/27/2018
by   Qing Li, et al.
0

Visual Question Answering (VQA) has attracted attention from both computer vision and natural language processing communities. Most existing approaches adopt the pipeline of representing an image via pre-trained CNNs, and then using the uninterpretable CNN features in conjunction with the question to predict the answer. Although such end-to-end models might report promising performance, they rarely provide any insight, apart from the answer, into the VQA process. In this work, we propose to break up the end-to-end VQA into two steps: explaining and reasoning, in an attempt towards a more explainable VQA by shedding light on the intermediate results between these two steps. To that end, we first extract attributes and generate descriptions as explanations for an image using pre-trained attribute detectors and image captioning models, respectively. Next, a reasoning module utilizes these explanations in place of the image to infer an answer to the question. The advantages of such a breakdown include: (1) the attributes and captions can reflect what the system extracts from the image, thus can provide some explanations for the predicted answer; (2) these intermediate results can help us identify the inabilities of both the image understanding part and the answer inference part when the predicted answer is wrong. We conduct extensive experiments on a popular VQA dataset and dissect all results according to several measurements of the explanation quality. Our system achieves comparable performance with the state-of-the-art, yet with added benefits of explainability and the inherent ability to further improve with higher quality explanations.

READ FULL TEXT

page 4

page 6

page 7

research
03/20/2018

VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions

Most existing works in visual question answering (VQA) are dedicated to ...
research
06/03/2019

Generating Question Relevant Captions to Aid Visual Question Answering

Visual question answering (VQA) and image captioning require a shared bo...
research
02/15/2019

Generating Natural Language Explanations for Visual Question Answering using Scene Graphs and Visual Attention

In this paper, we present a novel approach for the task of eXplainable Q...
research
02/14/2022

Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content?

Large datasets underlying much of current machine learning raise serious...
research
01/10/2020

In Defense of Grid Features for Visual Question Answering

Popularized as 'bottom-up' attention, bounding box (or region) based vis...
research
03/08/2023

Interpretable Visual Question Answering Referring to Outside Knowledge

We present a novel multimodal interpretable VQA model that can answer th...
research
04/19/2023

SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

Advances in GPT-based large language models (LLMs) are revolutionizing n...

Please sign up or login with your details

Forgot password? Click here to reset