Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning

03/18/2023
by   Shi Chen, et al.
0

Humans have the innate capability to answer diverse questions, which is rooted in the natural ability to correlate different concepts based on their semantic relationships and decompose difficult problems into sub-tasks. On the contrary, existing visual reasoning methods assume training samples that capture every possible object and reasoning problem, and rely on black-boxed models that commonly exploit statistical priors. They have yet to develop the capability to address novel objects or spurious biases in real-world scenarios, and also fall short of interpreting the rationales behind their decisions. Inspired by humans' reasoning of the visual world, we tackle the aforementioned challenges from a compositional perspective, and propose an integral framework consisting of a principled object factorization method and a novel neural module network. Our factorization method decomposes objects based on their key characteristics, and automatically derives prototypes that represent a wide range of objects. With these prototypes encoding important semantics, the proposed network then correlates objects by measuring their similarity on a common semantic space and makes decisions with a compositional reasoning process. It is capable of answering questions with diverse objects regardless of their availability during training, and overcoming the issues of biased question-answer distributions. In addition to the enhanced generalizability, our framework also provides an interpretable interface for understanding the decision-making process of models. Our code is available at https://github.com/szzexpoi/POEM.

READ FULL TEXT

page 2

page 3

page 4

page 7

page 8

research
04/14/2022

Measuring Compositional Consistency for Video Question Answering

Recent video question answering benchmarks indicate that state-of-the-ar...
research
05/02/2022

ComPhy: Compositional Physical Reasoning of Objects and Events from Videos

Objects' motions in nature are governed by complex interactions and thei...
research
12/20/2016

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

When building artificial intelligence systems that can reason and answer...
research
07/23/2019

Graph Reasoning Networks for Visual Question Answering

The interaction between language and visual information has been emphasi...
research
02/25/2019

GQA: a new dataset for compositional question answering over real-world images

We introduce GQA, a new dataset for real-world visual reasoning and comp...
research
07/26/2018

Unified Perceptual Parsing for Scene Understanding

Humans recognize the visual world at multiple levels: we effortlessly ca...
research
04/12/2022

AGQA 2.0: An Updated Benchmark for Compositional Spatio-Temporal Reasoning

Prior benchmarks have analyzed models' answers to questions about videos...

Please sign up or login with your details

Forgot password? Click here to reset