UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation

08/15/2023
by   Haiyang Wang, et al.
0

Jointly processing information from multiple sensors is crucial to achieving accurate and robust perception for reliable autonomous driving systems. However, current 3D perception research follows a modality-specific paradigm, leading to additional computation overheads and inefficient collaboration between different sensor data. In this paper, we present an efficient multi-modal backbone for outdoor 3D perception named UniTR, which processes a variety of modalities with unified modeling and shared parameters. Unlike previous works, UniTR introduces a modality-agnostic transformer encoder to handle these view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps. More importantly, to make full use of these complementary sensor types, we present a novel multi-modal integration strategy by both considering semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood relations. UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks. It sets a new state-of-the-art performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object detection and +12.0 higher mIoU for BEV map segmentation with lower inference latency. Code will be available at https://github.com/Haiyang-W/UniTR .

READ FULL TEXT
research
05/26/2022

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

Multi-sensor fusion is essential for an accurate and reliable autonomous...
research
08/21/2023

UniM^2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

Masked Autoencoders (MAE) play a pivotal role in learning potent represe...
research
06/17/2020

FISHING Net: Future Inference of Semantic Heatmaps In Grids

For autonomous robots to navigate a complex environment, it is crucial t...
research
12/02/2021

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

Biological intelligence systems of animals perceive the world by integra...
research
07/21/2022

AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D Object Detection

Point clouds and RGB images are two general perceptional sources in auto...
research
08/18/2023

Single Frame Semantic Segmentation Using Multi-Modal Spherical Images

In recent years, the research community has shown a lot of interest to p...
research
02/18/2022

Multi-view and Multi-modal Event Detection Utilizing Transformer-based Multi-sensor fusion

We tackle a challenging task: multi-view and multi-modal event detection...

Please sign up or login with your details

Forgot password? Click here to reset