Visual Composite Set Detection Using Part-and-Sum Transformers

05/05/2021
by   Qi Dong, et al.
8

Computer vision applications such as visual relationship detection and human-object interaction can be formulated as a composite (structured) set detection problem in which both the parts (subject, object, and predicate) and the sum (triplet as a whole) are to be detected in a hierarchical fashion. In this paper, we present a new approach, denoted Part-and-Sum detection Transformer (PST), to perform end-to-end composite set detection. Different from existing Transformers in which queries are at a single level, we simultaneously model the joint part and sum hypotheses/interactions with composite queries and attention modules. We explicitly incorporate sum queries to enable better modeling of the part-and-sum relations that are absent in the standard Transformers. Our approach also uses novel tensor-based part queries and vector-based sum queries, and models their joint interaction. We report experiments on two vision tasks, visual relationship detection, and human-object interaction, and demonstrate that PST achieves state-of-the-art results among single-stage models, while nearly matching the results of custom-designed two-stage models.

READ FULL TEXT

page 8

page 13

page 14

page 15

research
06/07/2023

2D Object Detection with Transformers: A Review

Astounding performance of Transformers in natural language processing (N...
research
04/28/2021

HOTR: End-to-End Human-Object Interaction Detection with Transformers

Human-Object Interaction (HOI) detection is a task of identifying "a set...
research
06/09/2023

Single-Stage Visual Relationship Learning using Conditional Queries

Research in scene graph generation (SGG) usually considers two-stage mod...
research
12/03/2021

Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer

Recent developments in transformer models for visual data have led to si...
research
01/06/2021

Line Segment Detection Using Transformers without Edges

In this paper, we present a holistically end-to-end algorithm for line s...
research
04/11/2022

Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection

Human-Object Interaction detection is a holistic visual recognition task...
research
12/15/2020

Enhance Multimodal Transformer With External Label And In-Domain Pretrain: Hateful Meme Challenge Winning Solution

Hateful meme detection is a new research area recently brought out that ...

Please sign up or login with your details

Forgot password? Click here to reset