Spatial-Temporal Enhanced Transformer Towards Multi-Frame 3D Object Detection

07/01/2023
by   Yifan Zhang, et al.
0

The Detection Transformer (DETR) has revolutionized the design of CNN-based object detection systems, showcasing impressive performance. However, its potential in the domain of multi-frame 3D object detection remains largely unexplored. In this paper, we present STEMD, a novel end-to-end framework for multi-frame 3D object detection based on the DETR-like paradigm. Our approach treats multi-frame 3D object detection as a sequence-to-sequence task and effectively captures spatial-temporal dependencies at both the feature and query levels. To model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network. This network represents queries as nodes in a graph and enables effective modeling of object interactions within a social context. In addition, to solve the problem of missing hard cases in the proposed output of the encoder in the current frame, we incorporate the output of the previous frame to initialize the query input of the decoder. Moreover, we tackle the issue of redundant detection results, where the model generates numerous overlapping boxes from similar queries. To mitigate this, we introduce an IoU regularization term in the loss function. This term aids in distinguishing between queries matched with the ground-truth box and queries that are similar but unmatched during the refinement process, leading to reduced redundancy and more accurate detections. Through extensive experiments, we demonstrate the effectiveness of our approach in handling challenging scenarios, while incurring only a minor additional computational overhead. The code will be available at <https://github.com/Eaphan/STEMD>.

READ FULL TEXT

page 8

page 12

research
05/23/2021

End-to-End Video Object Detection with Spatial-Temporal Transformers

Recently, DETR and Deformable DETR have been proposed to eliminate the n...
research
01/13/2022

TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers

Detection Transformer (DETR) and Deformable DETR have been proposed to e...
research
08/29/2019

Great Ape Detection in Challenging Jungle Camera Trap Footage via Attention-Based Spatial and Temporal Feature Blending

We propose the first multi-frame video object detection framework traine...
research
03/31/2022

BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection

Single frame data contains finite information which limits the performan...
research
03/21/2023

Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection

In this paper, we propose a long-sequence modeling framework, named Stre...
research
03/30/2022

Forecasting from LiDAR via Future Object Detection

Object detection and forecasting are fundamental components of embodied ...
research
07/15/2022

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

Active speaker detection (ASD) in videos with multiple speakers is a cha...

Please sign up or login with your details

Forgot password? Click here to reset