Transformer-based stereo-aware 3D object detection from binocular images

by   Hanqing Sun, et al.

Vision Transformers have shown promising progress in various object detection tasks, including monocular 2D/3D detection and surround-view 3D detection. However, when used in essential and classic stereo 3D object detection, directly adopting those surround-view Transformers leads to slow convergence and significant precision drops. We argue that one of the causes of this defect is that the surround-view Transformers do not consider the stereo-specific image correspondence information. In a surround-view system, the overlapping areas are small, and thus correspondence is not a primary issue. In this paper, we explore the model design of vision Transformers in stereo 3D object detection, focusing particularly on extracting and encoding the task-specific image correspondence information. To achieve this goal, we present TS3D, a Transformer-based Stereo-aware 3D object detector. In the TS3D, a Disparity-Aware Positional Encoding (DAPE) model is proposed to embed the image correspondence information into stereo features. The correspondence is encoded as normalized disparity and is used in conjunction with sinusoidal 2D positional encoding to provide the location information of the 3D scene. To extract enriched multi-scale stereo features, we propose a Stereo Reserving Feature Pyramid Network (SRFPN). The SRFPN is designed to reserve the correspondence information while fusing intra-scale and aggregating cross-scale stereo features. Our proposed TS3D achieves a 41.29 average precision on the KITTI test set and takes 88 ms to detect objects from each binocular image pair. It is competitive with advanced counterparts in terms of both precision and inference speed.


page 1

page 9


EGFN: Efficient Geometry Feature Network for Fast Stereo 3D Object Detection

Fast stereo based 3D object detectors have made great progress in the se...

Deep Laparoscopic Stereo Matching with Transformers

The self-attention mechanism, successfully employed with the transformer...

ORA3D: Overlap Region Aware Multi-view 3D Object Detection

In multi-view 3D object detection tasks, disparity supervision over over...

Triangulation Learning Network: from Monocular to Stereo 3D Object Detection

In this paper, we study the problem of 3D object detection from stereo i...

YOLOStereo3D: A Step Back to 2D for Efficient Stereo 3D Detection

Object detection in 3D with stereo cameras is an important problem in co...

Epipolar Transformers

A common approach to localize 3D human joints in a synchronized and cali...

MVSFormer: Multi-View Stereo with Pre-trained Vision Transformers and Temperature-based Depth

Feature representation learning is the key recipe for learning-based Mul...

Please sign up or login with your details

Forgot password? Click here to reset