LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering

11/21/2020
by   Weixin Liang, et al.
0

The predominant approach to visual question answering (VQA) relies on encoding the image and question with a "black-box" neural encoder and decoding a single token as the answer like "yes" or "no". Despite this approach's strong quantitative results, it struggles to come up with intuitive, human-readable forms of justification for the prediction process. To address this insufficiency, we reformulate VQA as a full answer generation task, which requires the model to justify its predictions in natural language. We propose LRTA [Look, Read, Think, Answer], a transparent neural-symbolic reasoning framework for visual question answering that solves the problem step-by-step like humans and provides human-readable form of justification at each step. Specifically, LRTA learns to first convert an image into a scene graph and parse a question into multiple reasoning instructions. It then executes the reasoning instructions one at a time by traversing the scene graph using a recurrent neural-symbolic execution module. Finally, it generates a full answer to the given question with natural language justifications. Our experiments on GQA dataset show that LRTA outperforms the state-of-the-art model by a large margin (43.1 perturbed GQA test set by removing linguistic cues (attributes and relations) in the questions for analyzing whether a model is having a smart guess with superficial data correlations. We show that LRTA makes a step towards truly understanding the question while the state-of-the-art model tends to learn superficial correlations from the training data.

READ FULL TEXT

page 2

page 9

research
10/04/2018

Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding

We marry two powerful ideas: deep representation learning for visual rec...
research
07/13/2021

Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering

Visual Question Answering (VQA) is concerned with answering free-form qu...
research
07/26/2022

Equivariant and Invariant Grounding for Video Question Answering

Video Question Answering (VideoQA) is the task of answering the natural ...
research
05/16/2022

A Neuro-Symbolic ASP Pipeline for Visual Question Answering

We present a neuro-symbolic visual question answering (VQA) pipeline for...
research
05/23/2023

Image Manipulation via Multi-Hop Instructions – A New Dataset and Weakly-Supervised Neuro-Symbolic Approach

We are interested in image manipulation via natural language text – a ta...
research
12/21/2020

Object-Centric Diagnosis of Visual Reasoning

When answering questions about an image, it not only needs knowing what ...
research
07/09/2019

Learning by Abstraction: The Neural State Machine

We introduce the Neural State Machine, seeking to bridge the gap between...

Please sign up or login with your details

Forgot password? Click here to reset