Text to Point Cloud Localization with Relation-Enhanced Transformer

by   Guangzhi Wang, et al.

Automatically localizing a position based on a few natural language instructions is essential for future robots to communicate and collaborate with humans. To approach this goal, we focus on the text-to-point-cloud cross-modal localization problem. Given a textual query, it aims to identify the described location from city-scale point clouds. The task involves two challenges. 1) In city-scale point clouds, similar ambient instances may exist in several locations. Searching each location in a huge point cloud with only instances as guidance may lead to less discriminative signals and incorrect results. 2) In textual descriptions, the hints are provided separately. In this case, the relations among those hints are not explicitly described, leading to difficulties of learning relations. To overcome these two challenges, we propose a unified Relation-Enhanced Transformer (RET) to improve representation discriminability for both point cloud and natural language queries. The core of the proposed RET is a novel Relation-enhanced Self-Attention (RSA) mechanism, which explicitly encodes instance (hint)-wise relations for the two modalities. Moreover, we propose a fine-grained cross-modal matching method to further refine the location predictions in a subsequent instance-hint matching stage. Experimental results on the KITTI360Pose dataset demonstrate that our approach surpasses the previous state-of-the-art method by large margin.


page 1

page 3

page 7


Text2Pos: Text-to-Point-Cloud Cross-Modal Localization

Natural language-based communication with mobile devices and home applia...

PU-Transformer: Point Cloud Upsampling Transformer

Given the rapid development of 3D scanners, point clouds are becoming po...

I2P-Rec: Recognizing Images on Large-scale Point Cloud Maps through Bird's Eye View Projections

Place recognition is an important technique for autonomous cars to achie...

X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

3D dense captioning aims to describe individual objects by natural langu...

Cross-modal and Cross-domain Knowledge Transfer for Label-free 3D Segmentation

Current state-of-the-art point cloud-based perception methods usually re...

Determining Accessible Sidewalk Width by Extracting Obstacle Information from Point Clouds

Obstacles on the sidewalk often block the path, limiting passage and res...

InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring

Compared with the visual grounding in 2D images, the natural-language-gu...

Please sign up or login with your details

Forgot password? Click here to reset