The deep learning revolution has enabled great progress in localizing objects in a given visual context given a natural language description referring to the object. Seminal methods for exploring visual grounding and referring expression resolution tasks have drawn attention in this research area[20, 19, 35]. Datasets such as ReferIt , RefCOCO , and Flickr30K Entities  are a key reason for this groundbreaking progress. However, at the same time they are also restricted to the 2D image domain. That is, bounding boxes and pixel segments representing objects are inherently limited to the context of a single RGB image and are unable to capture the full 3D extent of the objects. Knowing the 3D location and spatial extent of an object is critical for agents that need to interact with objects. For instance, if you ask a robot to bring “the black stool next to the fridge”, it is important for the robot to understand the underlying spatial layout of the environments in order to successfully navigate within and interact with the environment.
In this work, we ground natural language expressions in 3D scenes. Specifically, we localize objects in 3D point clouds given natural language descriptions referring to the objects. To this end, we introduce an architecture that jointly learns to propose 3D bounding boxes for objects in an input point cloud and match the boxes against input descriptions.
To train and evaluate our architecture, we collect a new dataset: ScanRefer which augments RGB-D scans in ScanNet  with natural language descriptions. Concretely, we hire human annotators to describe objects in the reconstructed 3D environments, as well as their spatial context. This way, we annotate all unique scenes in ScanNet with free-form natural language descriptions. In total, we acquire descriptions of objects. To the best of our knowledge, our ScanRefer dataset is the first large-scale effort that combines 3D scene semantics and free-form descriptions in the 3D vision community.
In summary, our main contributions are as follows:
We introduce the ScanRefer dataset containing human-written free-form descriptions of objects in 3D scans.
We propose a data-driven method for localizing 3D target objects in a point cloud using natural language expressions.
2 Related Work
2.1 Grounding Referring Expressions in Images
There has been a plethora of successful work connecting images to natural language descriptions across tasks such as image captioning[25, 24, 55, 60]
, text-to-image retrieval[57, 22], and visual grounding [20, 35, 65]. The task of visual grounding (also known as referring expression comprehension or phrase localization) is to localize a region described by a given referring expression, the query. To address this task, many methods [20, 51, 57, 56, 43, 42, 35, 66, 67, 65, 10] have been proposed. A common strategy of these works is to follow a two-stage pipeline, where in the first stage an object detector, either unsupervised  or pretrained , is used to propose regions of interest, and in the second stage the regions are ranked by similarity to the query, with the highest scoring region provided as the final output. In contrast, other methods [19, 39, 63] address the referring expression task with a single stage proposing end-to-end approaches, and some [19, 32, 30, 36, 64, 4] produce a segmentation mask for the object description. Another line of research investigates how to improve visual grounding using more advanced methods such as incorporating syntax [33, 15], using graph attention networks [58, 62], speaker-listener models [35, 67], as well as weakly supervised methods [59, 68, 9] and zero-shot settings for unseen nouns .
Despite the impressive work in visual grounding, these methods all operate on 2D image datasets [43, 26, 66]. A recent dataset  integrates the RGB-D images but it still lacks the complete 3D context beyond a single image. Qi et al.  proposes to study referring expressions in an embodied setting, where semantic annotations are project from 3D to 2D bounding boxes on images observed by an agent. In our work, we go beyond individual images. Our core contribution is to lift NLP tasks to 3D by introducing the first large-scale effort that couples free-form descriptions to objects in 3D scans.
2.2 Object Detection in 3D
With the recent growth of popularity of 3D scene understanding in computer vision, a variety of object detection methods have been proposed to the 3D domain. Many of these works operate on volumetric grids[17, 23, 29, 38, 11] and have achieved great success on several 3D RGB-D datasets [54, 7, 2]. As an alternative to regular grids, point-based methods, such as PointNet  or PointNet++ , have been used as backbones for 3D detection and/or object instance segmentation [61, 12]. Very recently, Qi et al.  introduced an object detection scheme based on Hough Voting  which differs most from other metric or anchor-based methods and is specifically tailored to point-based representations. Our approach extracts geometric features in a similar fashion, but our architecture further backprojects 2D feature information since color signal is critical for describing 3D objects with natural language.
2.3 3D Vision Language
Despite the fact that vision and language research has been gaining popularity in image domains (e.g., image captioning [24, 55, 60, 34], image-text matching [13, 28, 31, 21, 14], and text-to-image generation [49, 14, 53]), only very few researchers focus on the joint domain of vision and language in 3D. For instance, the work by Chen et al.  learns a joint embedding of 3D shapes from ShapeNet  and corresponding free-form natural language descriptions. Achlioptas et al.  disambiguates between different objects using language. Recent work by Prabhudesai et al.  has started to investigate grounding of language to 3D by identifying 3D bounding boxes of target objects for simple arrangements of primitive shapes of different colors. Instead of focusing on isolated objects, our work considers large 3D real-world environments obtained by RGB-D reconstructions that are typical in semantic 3D scene understanding.
We introduce the task of object localization in 3D scenes using natural language as illustrated in Fig. 2. The input to our task consists of a 3D scene together with a free-form text describing an object in the scene. The scene is represented as a point cloud where each point has 3D spatial coordinates as well as additional features such as colors and normals. The goal of the task is to predict the 3D bounding box of the object that matches the input description.
We build our ScanRefer dataset based on ScanNet  which is composed of 1513 RGB-D scans taken in 707 unique indoor environments. We provide around 5 unique descriptions for each object in each scene, focusing on complete coverage of all objects that are present in the reconstruction.
4.1 Data Collection
We develop a 3D web-based annotation interface using WebGL which we deployed on Amazon Mechanical Turk (AMT) to collect object descriptions in the ScanNet scenes. The annotation pipeline consists of two stages, first description collection (see Fig. 3), followed by verification (see Fig. 4). This is similar in spirit to the ReferItGame , except we separate the writing of the description and the selection of referred objects into two phases. We select the set of objects to annotate from ScanNet by restricting the object categories to indoor furnitures and excluding structural objects such as “Floor” and “Wall” (see supplementary material for more details). We manually verify the reconstructed objects and filter out objects whose reconstructions are too incomplete and/or hard to identify.
To collect object descriptions, we present a 3D web-based UI that shows the object in context (see Fig. 3). We show the workers a visualization for the scene with all objects other than the target object faded out and an image gallery on the side to compensate for incomplete details in the reconstructions. Annotators are provided with a random initial viewpoint that includes the target object and camera controls that they can use to adjust the camera view to better examine the target object. We ask the annotator to describe the appearance of the target object as well as its spatial location within the scene and relation to other objects. To ensure the descriptions are informative, we require the annotator to provide at least two full sentences. The description collection tasks are batched and randomized so that each object is described by five different workers.
To ensure the quality of the descriptions, we recruit trained workers (students) to verify that the descriptions are discriminative and correct. Verifiers are shown the 3D scene and a description, and are asked to select the objects (potentially multiple) in the scene that match the description (see Fig. 4). In addition, verifiers are asked to correct any spelling and wording issues, and provide a corrected description when necessary. In the end, we filter out in total 2,823 invalid descriptions that do not match the target objects and fix writing issues for 2,129 problematic descriptions.
4.2 Dataset Statistics
In total, we collected 46,173 descriptions for 703 ScanNet scenes1114 scenes are excluded due to lack of objects to describe. On average, there are 14.14 objects, 65.68 descriptions per scene, and 4.91 descriptions per object after filtering. The data collection took place over one month and involved 1,929 AMT workers. Together, the description collection and verification took around 4,984 man hours in total. Tab. 1 shows the details of the dataset statistics.
|Number of descriptions||46,173|
|Number of scenes||703|
|Number of objects||9,943|
|Number of objects per scene||14.14|
|Number of descriptions per scene||65.68|
|Number of descriptions per object||4.64|
|Size of vocabulary||7,378|
|Min/max/average length of descriptions||4/117/17.91|
We propose an end-to-end approach on point clouds to address the localization task (see Fig. 5). Our architecture consists of two main modules: 1) detection & encoding; 2) fusion & localization. The detection & encoding module encodes the input point cloud and description, and outputs the object proposals and the language embedding, which are then fed into the fusion module to mask out invalid object proposals and produce the fused features. Finally, the localization module chooses the most likely object proposal as the final output.
5.1 Data Representations
We randomly sample 40,000 vertices of one scan from ScanNet as the input point cloud , where represents the point coordinates in 3D space and stands for the point features such as colors and normals. We use the point coordinates as the only geometrical information. Since descriptions of objects refer to attributes such as color and texture, we incorporate visual appearance by adapting the feature projection scheme in Dai and Nießner  to project multi-view image features to the point cloud. The image features are extracted using a pre-trained ENet . Following Qi et al. , we also append the height of the point from the ground and normals to the point features. The final point cloud data is prepared offline as .
5.2 Network Architecture
Our method takes as input the preprocessed point cloud and the word embedding sequence representing the input description and outputs the 3D bounding box for the proposal which is most likely referred to by the input description. Conceptually, our localization pipeline consists of the following four stages: detection, encoding, fusion and localization.
As the first step in our network, we detect all probable objects in the given point cloud. To construct our detection module, we adapt the PointNet++ backbone and the voting module in Qi et al.  to process the point cloud input and aggregate all potential object candidates to individual clusters. The output from the detection module is a set of point clusters representing all object proposals with enriched point features, where represents the upper bound of the number of proposals. Next, the proposal module takes in the point clusters and further processes those clusters to predict the objectness mask and regress the bounding boxes for all proposals, where each consists of the coordinates of the box center and the size residuals in all 3 dimensions.
The sequences of word embedding vectors representing the input description are fed into a GRU cell  to aggregate the textual information. We take the final hidden state of the GRU cell as the final language embedding.
The outputs from the previous detection and encoding modules are fed into the fusion module (see Fig. 6) to integrate the point features together with the language embeddings. Specifically, each feature vector in the point cluster is concatenated with the language embedding as the extended feature vector, which is then masked by the predicted objectness mask and fused by a multi-layer perceptron as the the final fused cluster features .
Point clusters with fused cluster features are processed by a single layer perceptron to produce the raw scores, which are then squashed into the interval of using a softmax function. The object proposal with the highest score is the final output.
5.3 Loss Function
For the produced score for object proposal among all proposals, our target is constructed as , where represents if the proposal matches the ground truth object. Note that the detection module may produce overlapping proposals for a single object, so the ground truth vector is not necessarily an one-hot vector. We then formulate our reference loss using a binary cross-entropy function as:
where and are the weights for negative and positive samples. In practice, we set those two weights to 0.01 and 1, respectively.
Object detection loss
We follow the same detection loss as introduced in Qi et al.  to handle object proposals. The detection loss is marked as .
Language to object classification loss
To further supervise the training, we include an object classification loss based on the input description. We consider the 18 ScanNet classes and exclude the label “Floor” and “Wall” to match our dataset settings. The language to object classification loss here is a multi-class cross-entropy loss.
The final loss is a linear combination of the reference loss, object detection loss and the language to object classification loss:
where , and are the weights for the individual loss terms and are set to 0.1, 10, and 1 in our experiments.
5.4 Training and Inference
During the training, the detection and encoding modules propose object candidates as point clusters, which are then fed into the fusion and localization modules to fuse the features from the previous module and predict the final bounding boxes. We use a softmax function to compress the raw scores from the localization module to . The higher the predicted score is, the more likely the proposal will be chosen as output. In addition, to filter out invalid object proposals which stand for false detections, we utilize the predicted objectness mask to ensure that only the positive proposals are taken into account. We set the maximum number of proposals to 256 in practice.
Since there can be overlapping detections, we apply a non-maximum suppression module to suppress those overlapping proposals in the inference step. The remaining object proposals will be fed into the localization module to predict the final score for each proposal. The number of object proposals is less than the upper bound in the training step.
5.5 Implementation Details
We implement our architecture using PyTorch and train the model end-to-end using the ADAM optimizer with a learning rate of e. We train the model for roughly iterations until it fully converges. To avoid overfitting, we set the weight decay factor to e and apply data augmentations to our training data. For point clouds, we apply rotation about all three axes by a random angle in and randomly translate the point cloud within meters in all directions.
We adopt the official ScanNet  split from which we construct a training set with 36,665 samples and a validation set with 9,508 samples. The split ensures disjoint train/val scenes (results are reported on val).
To evaluate the performance of our method, we measure precision for thresholded intersection over union (IoU) between the predicted and ground truth bounding box. Similar to work with 2D images, we use precision@IoU k as our metric, where the threshold value is set to 0.25 and 0.5 in our experiments.
SCRC A 2D image baseline for visual grounding by extending SCRC  to 3D using back-projection. Since the method operates on a single frame, we retrieve the closest frames from the ScanNet frames using the camera pose to construct the training set. At inference time, we feed in every 20th frame and predict the target bounding boxes in each frame. Finally, we select the bounding box with the highest similarity score from the bounding box candidates and project them to 3D using the corresponding depth map.
PointRefNet We modify the PointNet++  semantic segmentation pipeline to include language descriptions by feeding in the sentence embeddings from a GLoVE  + GRU  encoder, and fusing them with the point feature maps.
VoteNetRand From predicted object proposals, we select one of the bounding box proposals predicted by our architecture with the correct semantic class label.
OracleCatRand To illustrate the difficulty of our task, we use an oracle with ground truth bounding boxes of objects, and select a random box that matches the object category.
6.1 Task Difficulty
To understand how informative the input description is beyond capturing the object category, we analyze the performance of the methods on “unique” and “multiple” subsets with 1,572 and 7,936 samples, respectively. The “unique” subset contains samples where only one unique object from a certain category matches the description, while the “multiple” subset contains ambiguous cases where there are multiple objects of the same category. For instance, if there is only one refrigerator in a scene, it is sufficient to identify that the sentence refers to a refrigerator. In contrast, if there are multiple objects of the same category in a scene (e.g., chair), the full description must be taken into account. From the OracleCatRand baseline, we see that information from the description, other than the object category, is necessary to disambiguate between multiple objects (see Tab. 2).
6.2 Quantitative Analysis
As Tab. 2 shows, our method equipped with a language-to-object classifier and fed with spatial coordinates and multi-view features outperforms the baselines. Notably, the PointRefNet baseline performs better on the “unique” subset where the prediction relies more on identifying the object category, while it has lower performance on the “multiple” split where it cannot distinguish between multiple objects (see Fig. 7). With category information, VoteNetRand is able to perform relatively well on the “unique” subset, but has trouble identifying the correct object in the “multiple” case. However, the gap between the VoteNetRand and OracleCatRand for the “unique” case shows that the object detection component can still be improved.
Our method is able to improve over the bounding box predictions from VoteNetRand, and leverages additional information in the description to differentiate between ambiguous objects. It adapts better to the 3D context compared to SCRC which is limited by the view of a single frame.
6.3 Qualitative Analysis
Fig. 7 shows results produced by SCRC, PointRefNet, and our method. The successful localization cases in the green boxes show our architecture can handle the semantic correlation between the scene contexts and the textual descriptions. In contrast, PointRefNet fails to predict correct bounding boxes since it does not identify object instances, while SCRC is limited by the single view and hence cannot produce accurate bounding boxes in 3D space. Some failure cases of our method are displayed in Fig. 8, indicating that our architecture cannot handle all spatial relations to distinguish between ambiguous objects.
6.4 Ablation Studies
Does a language-based object classifier help?
To show the effectiveness of the extra supervision on input descriptions, we conduct an ablation experiment with the language to object classifier (+lobjcls) and without. As Tab. 2 shows, architectures with a language to object classifier outperform ones without it. This indicates that it is helpful to predict the category of the target object based on the input description.
Do colors help?
We compare our method trained with the geometry and multi-view image features (xyz+multiview+lobjcls) with its sibling model trained with only geometry (xyz+lobjcls) and the one trained with additional RGB values from the reconstructed meshes (xyz+rgb+lobjcls). As shown in Tab. 2, ScanRefer trained with geometry and pre-processed multi-view image features outperforms the other two models. Notably, the performance of models trained with extra color information experiences an immediate increase when compared to the one trained only with geometry.
Do other features help?
We include normals from the ScanNet meshes to the input point cloud features and compare performance against the networks trained without them. As Tab. 2 indicates, the additional 3D information can improve the scores and our architecture trained with geometry, multi-view features, and normals (xyz+multiview+normals+lobjcls) achieves the best performance among all ablations.
In this work, we introduce the task of localizing a target object in a 3D point cloud using natural language descriptions. We first introduce the ScanRefer dataset which contains 46,173 unique descriptions for 9,943 objects from 703 ScanNet  scenes. We further propose an end-to-end method for localizing an object with a free-formed description as reference, which first proposes point clusters of interest and then matches them to the embeddings of the input sentence. Our architecture is capable of learning the semantic similarities of the given contexts and regressing the bounding boxes for the target objects. Our method outperforms state-of-the-art baseline methods by a large margin. Overall, we hope that our new dataset and method will enable future research in the 3D visual language field.
We would like to thank the expert annotators Josefina Manieu Seguel and Rinu Shaji Mariam, all anonymous workers on Amazon Mechanical Turk and the student volunteers (Akshit Sharma, Yue Ruan, Ali Gholami, Yasaman Etesam, Leon Kochiev, Sonia Raychaudhuri) at Simon Fraser University for their efforts in building the ScanRefer dataset. This work is funded by Google (AugmentedPerception), the ERC Starting Grant Scan2CAD (804724), and a Google Faculty Award. We would also like to thank the support of the TUM-IAS Rudolf Mößbauer and Hans Fischer Fellowships (Focus Group Visual Computing), as well as the the German Research Foundation (DFG) under the Grant Making Machine Learning on Static and Dynamic 3D Data Practical
Making Machine Learning on Static and Dynamic 3D Data Practical. Finally, we thank Angela Dai for the video voice-over.
-  (2019) ShapeGlot: learning language for shape differentiation. In Proc. International Conference on Computer Vision (ICCV), Cited by: §2.3.
-  (2017) Matterport3D: learning from RGB-D data in indoor environments. In Proceedings of the International Conference on 3D Vision (3DV), Cited by: §2.2.
-  (2015) ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012. Cited by: §2.3.
-  (2019) See-through-text grouping for referring image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7454–7463. Cited by: §2.1.
-  (2018) Text2Shape: generating shapes from natural language by learning joint embeddings. In Proc. Asian Conference on Computer Vision (ACCV), Cited by: §2.3.
Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: Figure 5, §5.2, §6.
ScanNet: richly-annotated 3D reconstructions of indoor scenes.
Proc. Computer Vision and Pattern Recognition (CVPR), Cited by: §A.1, Appendix B, ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language, §1, §2.2, §4, §6, §7.
-  (2018) 3DMV: joint 3D-multi-view prediction for 3D semantic scene segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 452–468. Cited by: §5.1.
-  (2019) Align2Ground: weakly supervised phrase grounding guided by image-caption alignment. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.1.
-  (2019) Neural sequential phrase grounding (SeqGROUND). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4175–4184. Cited by: §2.1.
-  (2019) 3D-BEVIS: birds-eye-view instance segmentation. arXiv preprint arXiv:1904.02199. Cited by: §2.2.
-  (2019) Dilated point convolutions: on the receptive field of point convolutions. arXiv preprint arXiv:1907.12046. Cited by: §2.2.
Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 7–16. Cited by: §2.3.
-  (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7181–7189. Cited by: §2.3.
-  (2019) Learning to compose and reason with language tree structures for visual grounding. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.1.
spaCy 2: natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Note: To appear Cited by: §5.1.
-  (2019) 3D-SIS: 3D semantic instance segmentation of RGB-D scans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4421–4430. Cited by: §2.2.
-  (1959) Machine analysis of bubble chamber pictures. In Conf. Proc., Vol. 590914, pp. 554–558. Cited by: §2.2.
-  (2016) Segmentation from natural language expressions. In European Conference on Computer Vision, pp. 108–124. Cited by: §1, §2.1.
-  (2016) Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4555–4564. Cited by: §1, §2.1, Table 2, §6.
-  (2017) Instance-aware image and sentence matching with selective multimodal LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2310–2318. Cited by: §2.3.
-  (2018) Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6163–6171. Cited by: §2.1.
-  (2019) 3D-SIC: 3D semantic instance completion for RGB-D scans. In arXiv preprint arXiv:1904.12012, Cited by: §2.2.
-  (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §2.1, §2.3.
-  (2014) Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems, pp. 1889–1897. Cited by: §2.1.
ReferItGame: referring to objects in photographs of natural scenes.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798. Cited by: §1, §2.1, §4.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.5.
-  (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539. Cited by: §2.3.
-  (2019) 3D instance segmentation via multi-task metric learning. arXiv preprint arXiv:1906.08650. Cited by: §2.2.
-  (2018) Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5745–5753. Cited by: §2.1.
-  (2017) Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1890–1899. Cited by: §2.3.
-  (2017) Recurrent multimodal interaction for referring image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1271–1280. Cited by: §2.1.
-  (2019) Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4673–4682. Cited by: §2.1.
-  (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 375–383. Cited by: §2.3.
-  (2016) Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11–20. Cited by: §1, §2.1.
-  (2018) Dynamic multimodal instance segmentation guided by natural language queries. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 630–645. Cited by: §2.1.
-  (2019) SUN-Spot: an RGB-D dataset with spatial referring expressions. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §2.1.
-  (2019) PanopticFusion: online volumetric semantic mapping at the level of stuff and things. arXiv preprint arXiv:1903.01177. Cited by: §2.2.
-  (2018) Object captioning and retrieval with natural language. arXiv preprint arXiv:1803.06152. Cited by: §2.1.
-  (2016) ENet: a deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147. Cited by: §5.1.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: Figure 5, §5.1, §6.
-  (2018) Conditional image-text embedding networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 249–264. Cited by: §2.1.
-  (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641–2649. Cited by: §1, §2.1, §2.1.
-  (2019) Embodied language grounding with implicit 3D visual feature representations. arXiv preprint arXiv:1910.01210. Cited by: §2.3.
-  (2019) Deep hough voting for 3D object detection in point clouds. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: Table 3, Appendix B, §2.2, Figure 5, §5.1, §5.2, §5.3, Table 2.
-  (2017) PointNet: deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §2.2.
-  (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §2.2, Figure 5, §5.2, Table 2, §6.
-  (2019) RERERE: remote embodied referring expressions in real indoor environments. arXiv preprint arXiv:1904.10151. Cited by: §2.1.
-  (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396. Cited by: §2.3.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.1.
-  (2016) Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision, pp. 817–834. Cited by: §2.1.
-  (2019) Zero-shot grounding of objects from natural language queries. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4694–4703. Cited by: §2.1.
-  (2018) ChatPainter: improving text to image generation using dialogue. arXiv preprint arXiv:1802.08216. Cited by: §2.3.
-  (2015) SUN RGB-D: a RGB-D scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 567–576. Cited by: §2.2.
-  (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §2.1, §2.3.
-  (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2), pp. 394–407. Cited by: §2.1.
-  (2016) Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5005–5013. Cited by: §2.1.
-  (2019) Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1960–1968. Cited by: §2.1.
-  (2017) Weakly-supervised visual grounding of phrases with linguistic structures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5945–5954. Cited by: §2.1.
-  (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §2.1, §2.3.
-  (2019) Learning object bounding boxes for 3D instance segmentation on point clouds. arXiv preprint arXiv:1906.01140. Cited by: §2.2.
-  (2019) Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4644–4653. Cited by: §2.1.
-  (2019) A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4683–4693. Cited by: §2.1.
-  (2019) Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10502–10511. Cited by: §2.1.
-  (2018) MAttNet: modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315. Cited by: §2.1.
-  (2016) Modeling context in referring expressions. In European Conference on Computer Vision, pp. 69–85. Cited by: §1, §2.1, §2.1.
-  (2017) A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7282–7290. Cited by: §2.1.
Weakly supervised phrase localization with multi-scale anchored transformer network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5696–5705. Cited by: §2.1.
-  (2014) Edge boxes: locating object proposals from edges. In European conference on computer vision, pp. 391–405. Cited by: §2.1.
Appendix A Dataset
We present the distribution of categories of the ScanRefer dataset in Fig. 9. ScanRefer provides a large coverage of furniture (e.g., chair, table, cabinet, bed, etc.) in indoor environments with various sizes, colors, materials, and locations. We use the same category names as in the original ScanNet dataset . In total, we annnotate 9,976 objects from 265 categories from ScanNet . Following the ScanNet voxel labeling task, we aggregate these finer-grained categories into 17 coarse categories and group the remaining object types into “Others” for a total of 18 object categories that we use to train the language-based object classifier.
Fig. 10 shows the distribution of finer-grained objects in the category “Others”. For each of the 18 coarse categories, Fig. 11 shows the average and maximum number of objects for that category in a scene in which an object of that category appears. For instance, for scenes that contains a bed, the average number of beds is 1.22 and the maximum is 3.
a.2 Collection Details
In this section, we provide more details of the data annotation and verification processes of ScanRefer.
We deploy our web-based annotation application on Amazon Mechanical Turk (AMT) to collect object descriptions in the reconstructed RGB-D scans. To ensure that the initial descriptions are written in proper English, we restrict the workers to be from the United States, the United Kingdom, Canada or Australia. The workers are asked to finish a batch of 5 description tasks within a time limit of 2 hours once the assignment is accepted on AMT. To ensure the descriptions are diverse and linguistically rich, we require that each description consists of at least two sentences. Before the annotation task begins, the AMT workers are also presented with the instructions shown in Fig. 12. We request that the workers provide the following information in the descriptions:
The appearance of the object such as shape, color, material and so on.
The location of that object in the scene, e.g., “the chair is in the center of this room”.
The relative position to other objects in the scene, for instance, “this chair is the second one from the left”.
After collecting the descriptions from AMT, we do a quick inspection of the descriptions and manually filter and reject obvious bad descriptions before we start the verification process. We then verify the collected object descriptions by recruiting trained students to perform the verification task on our WebGL-based application. To ensure that the descriptions provided are discriminative (e.g., can pick out which one of the chairs is being described), the verifiers are asked to select the objects in the scene that match the descriptions the best. The verifiers are also asked to fix any spelling and wording issues, e.g., “hair” instead of “chair”, and submit the corrected descriptions to our database. To guide the trained verifiers, we provide the verification instructions as shown in Fig. 13.
Appendix B Object Detection Results
and all ablations of our architectures. We apply the mean average precision (mAP) thresholded by IoU value 0.5 as our evaluation metric. All values in Tab.3 are in percentage. We exclude structural objects such as “Floor” and “Wall”. We group all categories which are not in the ScanNet benchmark categories  including “Otherfurnitures”, “Otherstructure” and “Otherprop” into the “Others” category in our evaluation. Note that the “Others” category in our evaluation includes more types of objects, such as “Pillow” and “Keyboard”, than the “Otherfurniture” category of the ScanNet benchmark.
As shown in Tab. 3, including point normals as extra point features (rows [3,5]) in training increases the detection results when compared to the models trained without the normals (rows [2,4]). Also, training with extracted high-level color features from the multi-view images (rows [4,5]) also produces better detection results compared with the results from models trained with just the raw RGB values (rows [2,3]). Note that the networks equipped with the language-based object classifier (rows [6-10]) fail to produce better detections than the ones without the extra language classifier module (rows [1-5]). This is to be expected as the language provides information to help differentiate between objects of the same category, but does not contain any information for a better bounding box prediction for object detection.
Appendix C Additional Qualitative Analysis
We present additional examples of localization results by our method for further qualitative analysis. As shown in Fig. 14, our method can directly incorporate the immediate detections and further produce the bounding boxes for the target objects by the learned fused features. Our method can not only extract the appearance and location information from the input description, but also produce accurate localization bounding boxes for the targets. There are also challenging cases where our method fails to localize the target object. As Fig. 15 shows, our method is sometimes limited by inaccurate detections, especially for small objects such as the pictures (5th row). This indicates that the object detection submodule can still be improved. Also, our method could not handle all spatial relations in the input descriptions. For instance, for comparative phrases (e.g., “leftmost” or “rightmost”) or counting (e.g., “the second one from the left”), the models fails to pick out the correct object, as shown in Fig. 16.