RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning

04/24/2022
by   Xiaojian Ma, et al.
6

Reasoning about visual relationships is central to how humans interpret the visual world. This task remains challenging for current deep learning algorithms since it requires addressing three key technical problems jointly: 1) identifying object entities and their properties, 2) inferring semantic relations between pairs of entities, and 3) generalizing to novel object-relation combinations, i.e., systematic generalization. In this work, we use vision transformers (ViTs) as our base model for visual reasoning and make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs. Specifically, we introduce a novel concept-feature dictionary to allow flexible image feature retrieval at training time with concept keys. This dictionary enables two new concept-guided auxiliary tasks: 1) a global task for promoting relational reasoning, and 2) a local task for facilitating semantic object-centric correspondence learning. To examine the systematic generalization of visual reasoning models, we introduce systematic splits for the standard HICO and GQA benchmarks. We show the resulting model, Concept-guided Vision Transformer (or RelViT for short) significantly outperforms prior approaches on HICO and GQA by 16 the original split, and by 43 analyses also reveal our model's compatibility with multiple ViT variants and robustness to hyper-parameters.

READ FULL TEXT

page 2

page 8

page 9

research
06/04/2023

Systematic Visual Reasoning through Object-Centric Relational Abstraction

Human visual reasoning is characterized by an ability to identify abstra...
research
04/14/2023

The role of object-centric representations, guided attention, and external memory on generalizing visual relations

Visual reasoning is a long-term goal of vision research. In the last dec...
research
12/09/2021

PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning

A critical aspect of human visual perception is the ability to parse vis...
research
06/11/2022

A Benchmark for Compositional Visual Reasoning

A fundamental component of human vision is our ability to parse complex ...
research
05/25/2023

Concept-Centric Transformers: Concept Transformers with Object-Centric Concept Learning for Interpretability

Attention mechanisms have greatly improved the performance of deep-learn...
research
10/06/2020

CURI: A Benchmark for Productive Concept Learning Under Uncertainty

Humans can learn and reason under substantial uncertainty in a space of ...
research
11/06/2018

Concept Learning with Energy-Based Models

Many hallmarks of human intelligence, such as generalizing from limited ...

Please sign up or login with your details

Forgot password? Click here to reset