Unifying Voxel-based Representation with Transformer for 3D Object Detection

06/01/2022
by   Yanwei Li, et al.
0

In this work, we present a unified framework for multi-modality 3D object detection, named UVTR. The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection. To this end, the modality-specific space is first designed to represent different inputs in the voxel feature space. Different from previous work, our approach preserves the voxel space without height compression to alleviate semantic ambiguity and enable spatial interactions. Benefit from the unified manner, cross-modality interaction is then proposed to make full use of inherent properties from different sensors, including knowledge transfer and modality fusion. In this way, geometry-aware expressions in point clouds and context-rich features in images are well utilized for better performance and robustness. The transformer decoder is applied to efficiently sample features from the unified space with learnable positions, which facilitates object-level interactions. In general, UVTR presents an early attempt to represent different modalities in a unified framework. It surpasses previous work in single- and multi-modality entries and achieves leading performance in the nuScenes test set with 69.7 LiDAR, camera, and multi-modality inputs, respectively. Code is made available at https://github.com/dvlab-research/UVTR.

READ FULL TEXT

page 14

page 15

research
05/31/2022

Voxel Field Fusion for 3D Object Detection

In this work, we present a conceptually simple yet effective framework f...
research
10/30/2021

Cross-Modality Fusion Transformer for Multispectral Object Detection

Multispectral image pairs can provide the combined information, making o...
research
09/22/2022

FusionRCNN: LiDAR-Camera Fusion for Two-stage 3D Object Detection

3D object detection with multi-sensors is essential for an accurate and ...
research
07/12/2022

Paint and Distill: Boosting 3D Object Detection with Semantic Passing Network

3D object detection task from lidar or camera sensors is essential for a...
research
07/03/2022

You Only Need One Detector: Unified Object Detector for Different Modalities based on Vision Transformers

Most systems use different models for different modalities, such as one ...
research
08/18/2020

AssembleNet++: Assembling Modality Representations via Attention Connections

We create a family of powerful video models which are able to: (i) learn...
research
04/28/2023

Exploiting the Distortion-Semantic Interaction in Fisheye Data

In this work, we present a methodology to shape a fisheye-specific repre...

Please sign up or login with your details

Forgot password? Click here to reset