Agglomerative Transformer for Human-Object Interaction Detection

08/16/2023
by   Danyang Tu, et al.
0

We propose an agglomerative Transformer (AGER) that enables Transformer-based human-object interaction (HOI) detectors to flexibly exploit extra instance-level cues in a single-stage and end-to-end manner for the first time. AGER acquires instance tokens by dynamically clustering patch tokens and aligning cluster centers to instances with textual guidance, thus enjoying two benefits: 1) Integrality: each instance token is encouraged to contain all discriminative feature regions of an instance, which demonstrates a significant improvement in the extraction of different instance-level cues and subsequently leads to a new state-of-the-art performance of HOI detection with 36.75 mAP on HICO-Det. 2) Efficiency: the dynamical clustering mechanism allows AGER to generate instance tokens jointly with the feature learning of the Transformer encoder, eliminating the need of an additional object detector or instance decoder in prior methods, thus allowing the extraction of desirable extra cues for HOI detection in a single-stage and end-to-end pipeline. Concretely, AGER reduces GFLOPs by 8.5 DETR-like pipeline without extra cue extraction.

READ FULL TEXT
research
06/04/2022

Video-based Human-Object Interaction Detection from Tubelet Tokens

We present a novel vision Transformer, named TUTOR, which is able to lea...
research
03/09/2023

Efficient Transformer-based 3D Object Detection with Dynamic Token Halting

Balancing efficiency and accuracy is a long-standing problem for deployi...
research
03/27/2023

Object Discovery from Motion-Guided Tokens

Object discovery – separating objects from the background without manual...
research
08/06/2022

IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation

Video 3D human pose estimation aims to localize the 3D coordinates of hu...
research
07/11/2022

Instance Shadow Detection with A Single-Stage Detector

This paper formulates a new problem, instance shadow detection, which ai...
research
11/29/2021

Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

DETR is the first end-to-end object detector using a transformer encoder...
research
11/03/2022

PolyBuilding: Polygon Transformer for End-to-End Building Extraction

We present PolyBuilding, a fully end-to-end polygon Transformer for buil...

Please sign up or login with your details

Forgot password? Click here to reset