3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection

by   Junyu Luo, et al.

3D visual grounding aims to locate the referred target object in 3D point cloud scenes according to a free-form language description. Previous methods mostly follow a two-stage paradigm, i.e., language-irrelevant detection and cross-modal matching, which is limited by the isolated architecture. In such a paradigm, the detector needs to sample keypoints from raw point clouds due to the inherent properties of 3D point clouds (irregular and large-scale), to generate the corresponding object proposal for each keypoint. However, sparse proposals may leave out the target in detection, while dense proposals may confuse the matching model. Moreover, the language-irrelevant detection stage can only sample a small proportion of keypoints on the target, deteriorating the target prediction. In this paper, we propose a 3D Single-Stage Referred Point Progressive Selection (3D-SPS) method, which progressively selects keypoints with the guidance of language and directly locates the target. Specifically, we propose a Description-aware Keypoint Sampling (DKS) module to coarsely focus on the points of language-relevant objects, which are significant clues for grounding. Besides, we devise a Target-oriented Progressive Mining (TPM) module to finely concentrate on the points of the target, which is enabled by progressive intra-modal relation modeling and inter-modal target mining. 3D-SPS bridges the gap between detection and matching in the 3D visual grounding task, localizing the target at a single stage. Experiments demonstrate that 3D-SPS achieves state-of-the-art performance on both ScanRefer and Nr3D/Sr3D datasets.


page 1

page 3

page 8


Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud

3D object grounding aims to locate the most relevant target object in a ...

You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding

Visual Grounding (VG) aims to locate the most relevant region in an imag...

Boundary-Aware Dense Feature Indicator for Single-Stage 3D Object Detection from Point Clouds

3D object detection based on point clouds has become more and more popul...

Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images

Grounding referring expressions in RGBD image has been an emerging field...

Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding

Recently, one-stage visual grounders attract high attention due to the c...

3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding

3D visual grounding aims to localize the target object in a 3D point clo...

VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language Matching

The prevailing framework for matching multimodal inputs is based on a tw...

Please sign up or login with your details

Forgot password? Click here to reset