X-Align: Cross-Modal Cross-View Alignment for Bird's-Eye-View Segmentation

10/13/2022
by   Shubhankar Borse, et al.
0

Bird's-eye-view (BEV) grid is a common representation for the perception of road components, e.g., drivable area, in autonomous driving. Most existing approaches rely on cameras only to perform segmentation in BEV space, which is fundamentally constrained by the absence of reliable depth information. Latest works leverage both camera and LiDAR modalities, but sub-optimally fuse their features using simple, concatenation-based mechanisms. In this paper, we address these problems by enhancing the alignment of the unimodal features in order to aid feature fusion, as well as enhancing the alignment between the cameras' perspective view (PV) and BEV representations. We propose X-Align, a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation consisting of the following components: (i) a novel Cross-Modal Feature Alignment (X-FA) loss, (ii) an attention-based Cross-Modal Feature Fusion (X-FF) module to align multi-modal BEV features implicitly, and (iii) an auxiliary PV segmentation branch with Cross-View Segmentation Alignment (X-SA) losses to improve the PV-to-BEV transformation. We evaluate our proposed method across two commonly used benchmark datasets, i.e., nuScenes and KITTI-360. Notably, X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes. We also provide extensive ablation studies to demonstrate the effectiveness of the individual components.

READ FULL TEXT

page 1

page 4

page 5

page 8

research
08/12/2023

BEV-DG: Cross-Modal Learning under Bird's-Eye View for Domain Generalization of 3D Semantic Segmentation

Cross-modal Unsupervised Domain Adaptation (UDA) aims to exploit the com...
research
11/23/2022

How do Cross-View and Cross-Modal Alignment Affect Representations in Contrastive Learning?

Various state-of-the-art self-supervised visual representation learning ...
research
09/11/2023

FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal Consistent Transformer for 3D Objection

Multi-sensor modal fusion has demonstrated strong advantages in 3D objec...
research
04/01/2020

Shared Cross-Modal Trajectory Prediction for Autonomous Driving

We propose a framework for predicting future trajectories of traffic age...
research
06/05/2019

OctopusNet: A Deep Learning Segmentation Network for Multi-modal Medical Images

Deep learning models, such as the fully convolutional network (FCN), hav...
research
09/01/2020

Practical Cross-modal Manifold Alignment for Grounded Language

We propose a cross-modality manifold alignment procedure that leverages ...
research
08/02/2021

Efficient Deep Feature Calibration for Cross-Modal Joint Embedding Learning

This paper introduces a two-phase deep feature calibration framework for...

Please sign up or login with your details

Forgot password? Click here to reset