3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection

by   Junyu Luo, et al.

3D visual grounding aims to locate the referred target object in 3D point cloud scenes according to a free-form language description. Previous methods mostly follow a two-stage paradigm, i.e., language-irrelevant detection and cross-modal matching, which is limited by the isolated architecture. In such a paradigm, the detector needs to sample keypoints from raw point clouds due to the inherent properties of 3D point clouds (irregular and large-scale), to generate the corresponding object proposal for each keypoint. However, sparse proposals may leave out the target in detection, while dense proposals may confuse the matching model. Moreover, the language-irrelevant detection stage can only sample a small proportion of keypoints on the target, deteriorating the target prediction. In this paper, we propose a 3D Single-Stage Referred Point Progressive Selection (3D-SPS) method, which progressively selects keypoints with the guidance of language and directly locates the target. Specifically, we propose a Description-aware Keypoint Sampling (DKS) module to coarsely focus on the points of language-relevant objects, which are significant clues for grounding. Besides, we devise a Target-oriented Progressive Mining (TPM) module to finely concentrate on the points of the target, which is enabled by progressive intra-modal relation modeling and inter-modal target mining. 3D-SPS bridges the gap between detection and matching in the 3D visual grounding task, localizing the target at a single stage. Experiments demonstrate that 3D-SPS achieves state-of-the-art performance on both ScanRefer and Nr3D/Sr3D datasets.


page 1

page 3

page 8


Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud

3D object grounding aims to locate the most relevant target object in a ...

You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding

Visual Grounding (VG) aims to locate the most relevant region in an imag...

Boundary-Aware Dense Feature Indicator for Single-Stage 3D Object Detection from Point Clouds

3D object detection based on point clouds has become more and more popul...

Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding

Recently, one-stage visual grounders attract high attention due to the c...

Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images

Grounding referring expressions in RGBD image has been an emerging field...

InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring

Compared with the visual grounding in 2D images, the natural-language-gu...

VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language Matching

The prevailing framework for matching multimodal inputs is based on a tw...

1 Introduction

Figure 1: Traditional two-stage 3D VG methods are limited by the isolation of the detection stage and the matching stage. (a) Sparse proposals may leave out the target in detection. (b) Dense proposals could confuse the matching model. (c) 3D-SPS progressively selects keypoints (blue pointsred pointsgreen points) and performs referring at a single stage. Noted that dense surfaces are utilized only to help readers understand the example 3D scene, while the input of our method only contains sparse point clouds.

Visual Grounding (VG) aims to localize the target object in the scene based on an object-related linguistic description. In recent years, the 3D VG task has received increasing attention owing to its wide applications, such as autonomous robots and human-machine interaction in AR/VR/Metaverse. Even though much progress [yang2019dynamic, yang2019cross, wang2019neighbourhood, wang2018learning, zhang2018grounding, yu2018mattnet, yu2016modeling, sadhu2019zero, yang2019fast, yang2020improving] has been achieved in the 2D VG task, it is still challenging to locate the referred target object in 3D scenes since point clouds are irregular and large-scale.

Existing 3D VG methods [scanrefer, Yuan_2021_ICCV, sat, Zhao_2021_ICCV, he2021transrefer3d, Feng_2021_ICCV] are mainly based on the detection-then-matching two-stage pipeline. The first stage is language-irrelevant detection, where general 3D object detectors [DBLP:conf/iccv/QiLHG19, groupfree, DBLP:conf/cvpr/ChengSSY021] are adopted to produce numerous object proposals. The second stage is cross-modal matching, where specific vision-language attention mechanisms are usually designed to match the proposal and the description. Previous methods primarily focus on the second stage, i.e., exploring relations among proposals to distinguish the target object.

We argue that the separation of the two stages limits the existing methods. Previous 2D detection methods adopt data-independent anchor boxes as proposals on regular and well-organized images. However, the anchor-based fashion is generally impractical for the large-scale and irregular 3D point clouds. Consequently, the 3D detector utilized in the first stage needs to sample a limited number of keypoints to represent the whole scene and generate the corresponding proposal for each keypoint. However, sparse proposals may leave out the target in the detection stage (e.g., the sofa chair in Figure 1 (a)), which leads to the inability to locate the target in the matching stage. Meanwhile, dense proposals may contain redundant objects, causing the inter-proposal relationship so complex that the matching module struggles to distinguish the target. As shown in Figure 1 (b), it is difficult to select the right sofa chair from these numerous proposals with similar appearances. Therefore, the two-stage grounding methods face a dilemma of deciding the proposal number. Besides, the keypoint sampling strategy (e.g., Farthest Point Sampling (FPS) [pointnet++]) usually adopted in the detector at the first stage is also language-irrelevant. The strategy aims to sample keypoints to cover the entire scene as much as possible to detect all potential objects. Thus, the proportion of target keypoints is relatively small, which is unfavorable for the target prediction.

To address the aforementioned issues, we propose a 3D Single-Stage Referred Point Progressive Selection (3D-SPS) method in this paper. Our main idea is to progressively select keypoints under the guidance of the language description throughout the whole process, as shown in Figure 1 (c). Based on this idea, we propose a Description-aware Keypoint Sampling (DKS) module to coarsely focus on the points of language-relevant objects, e.g., sofa chair, couch, and table in Figure 1 (c). These keypoints provide significant clues for localizing the grounding target in the following cross-modal interaction. Besides, we devise a Target-oriented Progressive Mining (TPM) module, which conducts progressive mining to finely figure out the target. We leverage the self/cross-attention mechanism to model intra/inter-modal relationships respectively. In addition, we fuse the keypoint features with point features of the whole scene to achieve global localization perception. To progressively select keypoints of the target, we utilize the language-points cross-attention map to select the keypoints that the language pays more attention to and discard irrelevant points. The model gradually concentrates on the target and obtains a condensed set of keypoints through multiple layers. Thus, the proportion of target points will gradually increase with richer target-related features, which benefits the target box regression. Finally, 3D-SPS distinguishes the target from the condensed keypoint set and regresses its bounding box. Note that 3D-SPS is also consistent with the commonsense of how human finds the target object. Commonly, a human first selects a coarse candidate set according to the language description and then finely recognize and judge it to select the target object. [jacob2021qualitative, ullman2016atoms]

In summary, we make the following contributions:

  • We propose the 3D-SPS method, which directly performs 3D VG at a single stage to bridge the gap between detection and matching. To the best of our knowledge, 3D-SPS is the first work investigating single-stage 3D VG.

  • We treat the 3D VG task as a keypoint selection problem. Two selection modules, i.e., DKS and TPM, are designed to progressively select target-related keypoints. DKS samples the coarse language-relevant keypoints, and TPM finely mines the cross-modal relationship to distinguish the target.

  • Extensive experiments confirm the effectiveness of our method. 3D-SPS achieves the state-of-the-art performance on both ScanRefer [scanrefer] and Nr3D/Sr3D [achlioptas2020referit3d] datasets. The code is provided in https://github.com/fjhzhixi/3D-SPS.

Figure 2: 3D-SPS framework. We take the 3D VG task as a keypoint selection problem and avoid the separation of detection and matching. Specifically, we use PointNet++ as the backbone to extract point seeds from . After that, we coarsely sample the language-relevant keypoints by DKS with word features , which are mostly on the kitchen cabinets, refrigerator and oven in the figure. Then, TPM finely selects target keypoints and predict referring confidence scores . Here the keypoints are concentrated on the target kitchen cabinet. Finally, the target box is regressed from the keypoint with the highest in . The blue box is the ground truth. The yellow boxes are objects of the same category as the target. The green box is our target prediction. Best viewed in color.

2 Related Work

Visual Grounding on 2D Images. The goal of visual grounding on 2D images is to select a referred target according to the referring expression [hu2016natural, yu2018mattnet, nagaraja2016modeling, gao2021room]. Two mainstream frameworks have been proposed in succession: two-stage and one-stage methods. Specifically, two-stage methods [yang2019dynamic, yang2019cross, wang2019neighbourhood, wang2018learning, zhang2018grounding, yu2018mattnet, yu2016modeling, hong2019learning, liu2019learning, zhuang2018parallel] first generate region proposals with object detectors and then select the target region by matching the language features with the proposals. Each proposal is treated the same in the matching stage, despite their importance in the referring context varies. Besides, one-stage methods [sadhu2019zero, yang2019fast, yang2020improving, chen2018real, deng2021transvg, liao2020real]

eliminate the proposal generation and feature extraction stage in two-stage frameworks. In these methods, linguistic features are densely fused with each pixel or patch to generate multi-modal feature maps for regressing the bounding box.

However, one-stage methods in 2D VG could not be directly lifted to 3D VG. Firstly, 3D point clouds are numerous and noisy. Therefore, it is computationally unacceptable [zhou2018voxelnet, sparseconv, graham2017submanifold] to treat each point as a candidate. Then, due to the large-scale and complexity of 3D scenes, it is not easy to model the relationship of all objects and figure out the target[Zhao_2021_ICCV, he2021transrefer3d, sat]. Moreover, 2D one-stage methods adopt the sliding-window manner like [he2016deep, simonyan2014very], which cannot deal with 3D points since 2D input is highly regular while 3D points are inherently sparse, unordered, and irregular [qi2017pointnet, pointnet++]. In this paper, we propose 3D-SPS to address the problems introduced by 3D point clouds, which becomes the leading 3D VG solution.

Visual Grounding on 3D Point Clouds.

With the prevalence of deep learning technologies on 3D point clouds, the 3D VG task has attracted much attention. Chen

et al. [scanrefer] released a 3D VG dataset ScanRefer, in which the bounding boxes of objects are referred by their corresponding descriptions in an indoor scene. ReferIt3D [achlioptas2020referit3d] also proposes two datasets, i.e., Sr3D and Nr3D, for the 3D VG task.

Existing 3D VG works [scanrefer, DBLP:conf/aaai/HuangLCL21, Yuan_2021_ICCV, Zhao_2021_ICCV, sat, he2021transrefer3d, Feng_2021_ICCV, roh2021languagerefer] mainly focus on better modeling the relationship among objects to locate the target object, e.g

., adopting graph neural network 

[DBLP:conf/aaai/HuangLCL21], and attention mechanisms [Zhao_2021_ICCV]. To the best of our knowledge, previous 3D grounding approaches can generally be concluded into a detection-then-matching two-stage framework. In these methods, the detection stage fails to leverage the language context to concentrate on the points that are more essential to the referring task. To overcome those shortcomings, we propose the first single-stage method in 3D VG to progressively select keypoints under the guidance of the description.

3 Method

In this section, we detail the 3D-SPS method. In Sec 3.1, we present an overview of 3D VG task and our method. In Sec 3.2 and Sec 3.3, we dive into the technical details and how we obtain the target by progressive keypoint selection. In Sec 3.4, we introduce the training objectives of 3D-SPS.

3.1 Overview

In the 3D VG task, the inputs are the point clouds and a free-form plain text description of the target object with words, where contains 3D coordinates and

-dimensional auxiliary feature (RGB, normal vectors,

etc.) of points. The goal of this task is to locate the target object (i.e., the most relevant object to the description) and predict its bounding box.

The main idea of 3D-SPS is the progressive keypoint selection process, as shown in Figure 2. Firstly, we adopt a widely used PointNet++ [pointnet++] as the backbone network to extract point features from . The backbone outputs seed points with coordinates and -dimensional enriched local features . Meanwhile, we use the language encoder to extract -dimensional word features from -length description . Secondly, DKS module selects language-relevant keypoints with features from seed points based on word features . These keypoints belong to the objects whose categories are mentioned in the description, providing significant clues to distinguishing the grounding target. Thirdly, TPM module takes point features and word features as inputs. The -th layer of the TPM module takes and as inputs and outputs and . TPM progressively distinguishes the grounding target by multi-layer cross-modal transformers. We select keypoints with features and update the word features as . Lastly, we predict the referring confidence score based on keypoint features and cross-modally aligned word features by a simple MLP head. The keypoint feature with the highest is used to regress the bounding box of the grounding target as the center and the size .

By treating the 3D VG task as a keypoint selection problem, our 3D-SPS concentrates on distinguishing the keypoints of the target object from point clouds for predicting the bounding box directly, which is more effective than traditional detection-then-matching two-stage methods.

Figure 3: The DKS module. We use object confidence score to select points near object centers and description relevance score to select language-relevant points.

3.2 Description-aware Keypoint Sampling

Since the search space of 3D anchor boxes is huge, the data-independent anchor assignment strategy widely adopted in 2D object detection [DBLP:conf/nips/RenHGS15] is impractical when lifted to 3D [groupfree].

Figure 4: The TPM module. It is a two-stream cross-modal transformer model. We select the keypoints of the target based on the language-points cross-attention map at the -th layer.

To this end, most 3D object detection methods [DBLP:conf/iccv/QiLHG19, groupfree, DBLP:conf/cvpr/ChengSSY021] usually adopt sampling methods (e.g., FPS [pointnet++]) to sample keypoints from seed points and generate a proposal for each selected point. Existing detection-then-matching methods for the 3D VG task usually use the same strategy at the detection stage. However, directly adopting the sampling strategy in detection to the 3D VG task is not sensible because of the divergence of interest of the two tasks. The sampling objective of 3D object detection is to cover the entire scene as much as possible for detecting potential objects, while the goal of 3D VG is to locate the referred target.

Therefore, we propose DKS to help the model focus on the keypoints of language-relevant objects instead of the whole scene. Specifically, we bring word features into the sampling process to select keypoints of the objects whose categories are mentioned in the description. These keypoints contain the information of not only the target object but also related objects to help determine the target.

Figure 3 details the DKS. We first obtain an object confidence score based on point features to clarify whether the point is near an object center. The keypoint features with top highest are selected as:


Then a description relevance score is utilized to select top keypoints as that are related to the description context . We jointly use point features and global word features to predict the of each point, which can be formulated as:


3.3 Target-oriented Progressive Mining

With the coarsely selected language-relevant keypoints by DKS, we perform fine target mining with the TPM module. TPM is constructed by a -layer stacked multi-modal two-stream transformer model, where both word features and keypoint features are processed in separate streams and interact through cross-modal attention layers to model the relationship and mine the target. At the -th layer, TPM selects from . TPM progressively selects the keypoints and concentrates the attention by discarding target-irrelevant keypoints in each layer.

Intra/inter-modal Modeling. As Figure 4 shows, we employ the attention mechanism [vaswani2017attention] to learn intra-modal relationships. For point features, the point self-attention block helps to refine point visual features and exploits their spatial relationship. For word features, the language self-attention block is used to extract context relationships.

Specially, we leverage a point cross-attention block to model the global location of keypoints in the scene because the interaction of selected keypoints could not well model descriptions which include the global location like “in the center/corner of room”. Therefore, the scene point clouds  (point features before DKS) are fused to acquire global scene features.

Next, point features and word features interact in cross-modal attention blocks. In these blocks, the points branch is assisted by word features to distinguish the target, while the language branch fuses the scene information by attending to point features.

Attention-guided Keypoint Selection. TPM reduces the keypoint set at each layer and gradually focuses on the target, as shown in Figure 4. We make use of the language-points cross-attention map , which represents the importance of keypoints to the referring task. Specifically, we perform average pooling on and obtain point-wise attention scores . Then the keypoints with top highest are selected for the next layer as follow:


3.4 Training Objectives

Visual Grounding Loss. 3D VG loss is the primary loss of our framework. In the training phase, we supervise referring confidence scores predicted from with the target label. During inference, we only choose the keypoint with the highest from to predict the target box. We adapt the loss in ScanRefer [scanrefer] to our framework. In ScanRefer, the target label of is a one-hot label. The keypoint whose proposal box has the highest IoU with the ground truth target box is set to , and others are set to . However, in 3D-SPS, we usually obtain several feasible keypoints of the target after TPM since the model aims to select points on it. Therefore, we modify this target label from one-hot to multi-hot. Specifically, we assign to keypoints whose predicted boxes’ IoUs with the ground truth target box are the top highest and greater than the threshold .

DKS Loss. In the DKS module, we apply to supervise the object confidence score and the description relevance score with Focal Loss [Lin_2017_ICCV]. The is supervised by whether the point is inside an object box and belongs to the -closest points to the object center. The is supervised by whether the point belongs to any object whose category is mentioned in the description.

Detection Loss. Following the loss used in [DBLP:conf/iccv/QiLHG19, groupfree], we use the object detection loss as an auxiliary loss for VG task. Specifically, comprises object semantic classification loss , objectness binary classification loss , center offset regression loss , and bounding box regression loss . In the training phase, we supervise the box of objects predicted by all keypoints of each TPM layer. During inference, we only use the box prediction of the keypoint with the highest from the last TPM layer as our predicted grounding target.

Language Classification Loss. Following  [scanrefer], we also introduce the language classification loss as an auxiliary loss, which includes a multi-class object classification loss for the target category based on the updated language features of each TPM layer.

In summary, the total loss is: , where the weights are used for balancing different loss terms.

4 Experiments

Method Pub. Input Unique Multiple Overall
Acc@0.25 Acc@0.5 Acc@0.25 Acc@0.5 Acc@0.25 Acc@0.5
SCRC [hu2016natural] CVPR16 2D only 24.03 9.22 17.77 5.97 18.70 6.45
One-stage [yang2019fast] ICCV19 2D only 29.32 22.82 18.72 6.49 20.38 9.04
ScanRefer [scanrefer] ECCV20 3D only 67.64 46.19 32.06 21.26 38.97 26.10
TGNN [DBLP:conf/aaai/HuangLCL21] AAAI21 3D only 68.61 56.80 29.84 23.18 37.37 29.70
IntanceRefer [Yuan_2021_ICCV] ICCV21 3D only 77.45 66.83 31.27 24.77 40.23 32.93
SAT [sat] ICCV21 3D only 73.21 50.83 37.64 25.16 44.54 30.14
3DVG-Transformer [Zhao_2021_ICCV] ICCV21 3D only 77.16 58.47 38.38 28.70 45.90 34.47
3D-SPS (Ours) - 3D only 81.63 64.77 39.48 29.61 47.65 36.43
ScanRefer [scanrefer] ECCV20 2D + 3D 76.33 53.51 32.73 21.11 41.19 27.40
InstanceRefer [Yuan_2021_ICCV] ICCV21 2D + 3D 75.72 64.66 29.41 22.99 38.40 31.08
3DVG-Transformer [Zhao_2021_ICCV] ICCV21 2D + 3D 81.93 60.64 39.30 28.42 47.57 34.67
3D-SPS (Ours) - 2D + 3D 84.12 66.72 40.32 29.82 48.82 36.98
Table 1: Comparison on ScanRefer. The unique stands for samples with no distracting objects and multiple for remaining samples. We measure the percentage of predictions whose IoU with the ground truth is greater than .
Method Pub. Easy Hard View-dep. View-indep. Overall
ReferIt3DNet [achlioptas2020referit3d] ECCV20 43.6% 0.8% 27.9% 0.7% 32.5% 0.7% 37.1% 0.8% 35.6% 0.7%
TGNN [DBLP:conf/aaai/HuangLCL21] AAAI21 44.2% 0.4% 30.6% 0.2% 35.8% 0.2% 38.0% 0.3% 37.3% 0.3%
IntanceRefer [Yuan_2021_ICCV] ICCV21 46.0% 0.5% 31.8% 0.4% 34.5% 0.6% 41.9% 0.4% 38.8% 0.4%
3DVG-Transformer [Zhao_2021_ICCV] ICCV21 48.5% 0.2% 34.8% 0.4% 34.8% 0.7% 43.7% 0.5% 40.8% 0.2%
LanguageRefer [roh2021languagerefer] CoRL21 51.0% 36.6% 41.7% 45.0% 43.9%
SAT [sat] ICCV21 56.3% 0.5% 42.4% 0.4% 46.9% 0.3% 50.4% 0.3% 49.2% 0.3%
3D-SPS (Ours) - 58.1% 0.3% 45.1% 0.4% 48.0% 0.2% 53.2% 0.3% 51.5% 0.2%
ReferIt3DNet [achlioptas2020referit3d] ECCV20 44.7% 0.1% 31.5% 0.4% 39.2% 1.0% 40.8% 0.1% 40.8% 0.2%
TGNN [DBLP:conf/aaai/HuangLCL21] AAAI21 48.5% 0.2% 36.9% 0.5% 45.8% 1.1% 45.0% 0.2% 45.0% 0.2%
IntanceRefer [Yuan_2021_ICCV] ICCV21 51.1% 0.2% 40.5% 0.3% 45.4% 0.9% 48.1% 0.3% 48.0% 0.3%
3DVG-Transformer [Zhao_2021_ICCV] ICCV21 54.2% 0.1% 44.9% 0.5% 44.6% 0.3% 51.7% 0.1% 51.4% 0.1%
LanguageRefer [roh2021languagerefer] CoRL21 58.9% 49.3% 49.2% 56.3% 56.0%
SAT [sat] ICCV21 - - - - 57.9% 0.1%
3D-SPS (Ours) - 56.2% 0.6% 65.4% 0.1% 49.2% 0.5% 63.2% 0.2% 62.6% 0.2%
Table 2: Comparison on Nr3D and Sr3D. Easy samples contain no distractor, and the remaining belong to Hard. View-dep./View-indep. refer to whether the description is dependent or independent on the camera view.

4.1 Datasets

ScanRefer. The ScanRefer dataset [scanrefer] is a 3D visual grounding dataset with descriptions based on the ScanNet [dai2017scannet] scenes. Each scene has objects and

descriptions on average. The evaluation metric of the dataset is the Acc@

IoU, which means the fraction of descriptions whose predicted box overlaps the ground truth with IoU , where . The accuracy is reported in unique and multiple

categories. Specifically, a target object is classified as

unique if it is the only object of its class in the scene; otherwise, it is classified as multiple.

Nr3D and Sr3D. The ReferIt3D dataset [achlioptas2020referit3d] is also based on the ScanNet  [dai2017scannet] scenes. It contains two subsets: Sr3D and Nr3D. Sr3D (Spatial Reference in 3D) contains synthetic expressions generated by templates and Nr3D (Natural Reference in 3D) consists of human expressions. It directly provides segmented point clouds for each object as inputs rather than the whole scene. The evaluation metric of ReferIt3D is the accuracy, i.e., whether the model correctly selects the target among objects.

4.2 Implementation Details

Our model is trained end-to-end with the AdamW optimizer [DBLP:journals/corr/abs-1711-05101] and a batch size of for epochs. The initial learning rates of TPM layers and the rest of the model are empirically set to and , respectively. We apply learning rate decay at epoch {, , } with a rate of . We adopt the pre-trained PointNet++ [pointnet++] following the settings in  [groupfree] and the language encoder in  [radford2021learning], while the rest of the network is trained from scratch. For the ScanRefer dataset, we use coordinates, RGB values, normal vectors, and extracted multiview features as inputs following [scanrefer]. The number of is empirically set to . The number of is empirically set to . The number of TPM layers is set to , and we select keypoints in each layer, i.e., . The loss weights are empirically set to , , , for balancing terms. We set to , to in , and to in

. All experiments are implemented with PyTorch on a single NVIDIA V100 GPU.

4.3 Quantitative Comparison

Figure 5: Effectiveness Validation. (a) As the point number sampled from increases, our 3D-SPS performs better. The performance of the two-stage baseline first increases and then decreases. (b) As the progressive language-relevant keypoint selection goes, the ratio of target keypoints in our 3D-SPS increases after each selection. Also, this ratio keeps outperforming language-irrelevant sampling (e.g., FPS) used in the two-stage baseline.

In Table 1 and 2, we compare 3D-SPS with existing 3D VG works on ScanRefer and Nr3D/Sr3D datasets. The methods involved are 2D-based methods SCRC [hu2016natural] and One-stage [yang2019fast], the segmentation-based two-stage methods TGNN [DBLP:conf/aaai/HuangLCL21] and InstanceRefer [Yuan_2021_ICCV], the detection-based two-stage methods SAT [sat], 3DVG-Transformer [Zhao_2021_ICCV], ScanRefer [scanrefer], and ReferIt3DNet [achlioptas2020referit3d].

ScanRefer. 3D-SPS outperforms the existing methods by a large margin, as shown in Table 1. In the Input column, 3D only stands for xyz + RGB + normals, and 2D + 3D means an extra -dimensional multiview feature for each point is added to 3D only. We concatenate these multiview features with our point features from the backbone and feed them into TPM together. In the 3D only setting, 3D-SPS has improved by at Acc and at Acc compared to the existing state-of-the-art methods. In the 2D+3D setting, 3D-SPS outperforms the existing methods by at Acc and at Acc.

Note that TGNN and InstanceRefer both rely on a pre-fixed 3D instance segmentation model. Thus InstanceRefer performs better on the Acc@0.5 score in the Unique subset.

Nr3D & Sr3D. The task of the ReferIt3D dataset (Nr3D & Sr3D) is to identify the target object among the given ground truth object bounding boxes. We modify 3D-SPS accordingly, removing DKS and only verifying the effectiveness of TPM. For fair comparisons, we adopt 2D semantic assisted training proposed by SAT [sat] in the training process and only use 3D inputs in the inference process. Results in Table 2 show progressive selection is effective for referring tasks. 3D-SPS significantly improves the grounding accuracy by in Nr3D and in Sr3D. Although LanguageRefer performs better on the Easy subset of the synthetic dataset Sr3D, 3D-SPS outperforms it by a large margin on the more challenging Hard subset.

Effectiveness Validation.  Figure 5 confirms that our main idea, i.e., progressive keypoint selection, can address the issues from the motivation in Sec. 1. We analyze 3D-SPS and the two-stage method baseline [scanrefer] on the entire validation set of ScanRefer. As shown in Figure 5 (a), the two-stage baseline faces the dilemma of the point number sampled from . In contrast, 3D-SPS benefits from more sampled points. According to Figure 5 (b), the two-stage baseline is limited by the small ratio of target keypoints due to the language-irrelevant keypoint sampling, while the ratio in 3D-SPS increases significantly after each selection.

Acc@0.25 Acc@0.5
DKS (w/o )
DKS (w/o )
Table 3: Ablations on the sampling strategy of DKS.
Table 4: Ablations on the layer number in TPM.
Keypoints w/o selection w/ selection
Table 5: Ablations of TPM on whether to select keypoints and different keypoint numbers. Our default setting is w/ selection, where we progressively select keypoints from to .
Figure 6: The two-stage baseline (ScanRefer) fails while our 3D-SPS predicts correctly since 3D-SPS can select more valuable keypoints. (a) Language-relevant keypoints sampled by DKS. (b) Target keypoints selected by TPM. (c) Bounding boxes predicted by 3D-SPS. (d) Language-irrelevant keypoints sampled by FPS. (e) Bounding boxes predicted by ScanRefer.
Figure 7: Visualization of the same referring target with different descriptions in 3D-SPS. (a) sampled by DKS. Comparing the left and right subfigures in each row, when the language-relevant objects change (e.g., window, desk, bed), 3D-SPS focuses on different keypoints (red keypoints). (b) selected by TPM. (c) The predicted target bounding box.

4.4 Ablation Study

In this subsection, we investigate the contribution of the proposed DKS and TPM module. We take ScanRefer as an example and report the Overall accuracy in 3D only setting.

Sampling Strategy of DKS.  Table 3 shows the ablations of sampling strategy in the DKS module. FPS [pointnet++] is a widely adopted point sampling method, which makes an effort to cover the whole scene without special attention to the language-relevant points. DKS (w/o ) means that only the object confidence score is utilized, and DKS (w/o ) represents that only the description relevance score is used. DKS means that both and are adopted and is the full version of the proposed DKS module. According to the results in Table 3, and are both beneficial to the referring task, helping DKS select description-related keypoints near object centers. The joint use of and can produce promising results.

Layer Number of TPM.  We investigate the performance on different TPM layer numbers . As shown in Table 4, more TPM layers bring higher accuracy, which demonstrates that TPM and the progressive mining are essential to grounding. We take as the default setting since more layers might force the model to leave out some keypoints of the target object and miss the best bounding box.

Progressive Selection of TPM.  To further confirm the effectiveness of progressive keypoint selection, we compare the results on whether to adopt keypoint selection, as shown in Table 5. In detail, for the w/o selection setting, we only conduct multi-modal self/cross-attention. In this way, the number of keypoints does not change in TPM, and the predicted box is chosen from all keypoints after TPM. From Table 5, with the increase of keypoint numbers, the performance of the w/o selection setting rises at first and then declines. 3D-SPS (w/ selection) achieves significant improvement compared to the w/o selection settings. This observation proves the benefits of progressive keypoint selection.

4.5 Qualitative comparison

In this subsection, we perform a qualitative comparison on ScanRefer validation set to show how 3D-SPS works.

Language-relevant Keypoints.  We visualize the progressive keypoint selection process of 3D-SPS in Figure 6 and compare it with the two-stage baseline ScanRefer [scanrefer]. Enabled by DKS and TPM, 3D-SPS gradually focuses on the target. In contrast, the attention of ScanRefer is scattered everywhere in the scene and ultimately fails to locate the target due to the separation of detection and matching.

Language-adapted Keypoints.  3D-SPS selects different keypoints for the same target with different descriptions. As shown in Figure 7 (upper), to locate the table, 3D-SPS selects some keypoints on the window for subsequent mining when window is mentioned in the left sample. On the right, when only armchairs is mentioned, 3D-SPS only selects keypoints on armchairs and tables. In Figure 7 (lower), for the target shelf, 3D-SPS finds more keypoints related to the desk when the shelf is described as on the desk in the left sample. When the description contains under the bed, the model pays more attention to the bed.

5 Conclusion and Discussion

In this work, we propose a brand new 3D visual grounding framework on point clouds named 3D Single-Stage Referred Point Progressive Selection method (3D-SPS). Under the guidance of language, it progressively selects keypoints following a coarse-to-fine pattern and directly localizes the target at a single stage. Comprehensive experiments reveal that our method outperforms the existing 3D VG methods on both ScanRefer and Nr3D / Sr3D datasets by a large margin, leading to the new state-of-the-art performance.

Limitation. The limitation of 3D-SPS exists due to the complexity of 3D point clouds and free-form description, although we have made significant improvements on existing methods. The view-dependent descriptions and the ambiguous queries can both confuse the model. These limitations could guide our future work.

Acknowledgement. This research is partly supported by National Natural Science Foundation of China (Grant 62122010, 61876177), Fundamental Research Funds for the Central Universities, and Key R & D Program of Zhejiang Province(2022C01082).