Object-Centric Diagnosis of Visual Reasoning

12/21/2020
by   Jianwei Yang, et al.
1

When answering questions about an image, it not only needs knowing what – understanding the fine-grained contents (e.g., objects, relationships) in the image, but also telling why – reasoning over grounding visual cues to derive the answer for a question. Over the last few years, we have seen significant progress on visual question answering. Though impressive as the accuracy grows, it still lags behind to get knowing whether these models are undertaking grounding visual reasoning or just leveraging spurious correlations in the training data. Recently, a number of works have attempted to answer this question from perspectives such as grounding and robustness. However, most of them are either focusing on the language side or coarsely studying the pixel-level attention maps. In this paper, by leveraging the step-wise object grounding annotations provided in the GQA dataset, we first present a systematical object-centric diagnosis of visual reasoning on grounding and robustness, particularly on the vision side. According to the extensive comparisons across different models, we find that even models with high accuracy are not good at grounding objects precisely, nor robust to visual content perturbations. In contrast, symbolic and modular models have a relatively better grounding and robustness, though at the cost of accuracy. To reconcile these different aspects, we further develop a diagnostic model, namely Graph Reasoning Machine. Our model replaces purely symbolic visual representation with probabilistic scene graph and then applies teacher-forcing training for the visual reasoning module. The designed model improves the performance on all three metrics over the vanilla neural-symbolic model while inheriting the transparency. Further ablation studies suggest that this improvement is mainly due to more accurate image understanding and proper intermediate reasoning supervisions.

READ FULL TEXT

page 1

page 5

page 6

page 8

page 11

page 13

page 14

research
09/07/2023

Interpretable Visual Question Answering via Reasoning Supervision

Transformer-based architectures have recently demonstrated remarkable pe...
research
10/31/2019

TAB-VCR: Tags and Attributes based VCR Baselines

Reasoning is an important ability that we learn from a very early age. Y...
research
11/21/2020

LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering

The predominant approach to visual question answering (VQA) relies on en...
research
03/05/2023

VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning

The ideal form of Visual Question Answering requires understanding, grou...
research
11/08/2021

Visual Question Answering based on Formal Logic

Visual question answering (VQA) has been gaining a lot of traction in th...
research
06/16/2021

Techniques for Symbol Grounding with SATNet

Many experts argue that the future of artificial intelligence is limited...
research
04/03/2019

Revisiting Visual Grounding

We revisit a particular visual grounding method: the "Image Retrieval Us...

Please sign up or login with your details

Forgot password? Click here to reset