AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D Object Detection

07/21/2022
by   Zehui Chen, et al.
0

Point clouds and RGB images are two general perceptional sources in autonomous driving. The former can provide accurate localization of objects, and the latter is denser and richer in semantic information. Recently, AutoAlign presents a learnable paradigm in combining these two modalities for 3D object detection. However, it suffers from high computational cost introduced by the global-wise attention. To solve the problem, we propose Cross-Domain DeformCAFA module in this work. It attends to sparse learnable sampling points for cross-modal relational modeling, which enhances the tolerance to calibration error and greatly speeds up the feature aggregation across different modalities. To overcome the complex GT-AUG under multi-modal settings, we design a simple yet effective cross-modal augmentation strategy on convex combination of image patches given their depth information. Moreover, by carrying out a novel image-level dropout training scheme, our model is able to infer in a dynamic manner. To this end, we propose AutoAlignV2, a faster and stronger multi-modal 3D detection framework, built on top of AutoAlign. Extensive experiments on nuScenes benchmark demonstrate the effectiveness and efficiency of AutoAlignV2. Notably, our best model reaches 72.4 NDS on nuScenes test leaderboard, achieving new state-of-the-art results among all published multi-modal 3D object detectors. Code will be available at https://github.com/zehuichen123/AutoAlignV2.

READ FULL TEXT

page 3

page 4

page 5

research
04/01/2022

CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object Detection

In autonomous driving, LiDAR point-clouds and RGB images are two major d...
research
02/01/2022

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics

Human-Object Interaction (HOI) detection is an essential task to underst...
research
01/17/2022

AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection

Object detection through either RGB images or the LiDAR point clouds has...
research
04/26/2022

Focal Sparse Convolutional Networks for 3D Object Detection

Non-uniformed 3D sparse data, e.g., point clouds or voxels in different ...
research
01/03/2023

Cross Modal Transformer via Coordinates Encoding for 3D Object Dectection

In this paper, we propose a robust 3D detector, named Cross Modal Transf...
research
08/15/2023

UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation

Jointly processing information from multiple sensors is crucial to achie...
research
08/23/2023

Understanding Dark Scenes by Contrasting Multi-Modal Observations

Understanding dark scenes based on multi-modal image data is challenging...

Please sign up or login with your details

Forgot password? Click here to reset