Robots in a variety of applications require the ability to recognize objects of interest in their environment and distinguish them from the background, using 3D sensors. Examples range from autonomous cars detecting nearby pedestrians, to an industrial robot identifying an object to be assembled. Modern lidar sensors and stereo cameras allow robots to perceive their 3D surroundings in detail as a point cloud. In particular, lidar sensors provide 3D spatial measurements with extreme precision. For a robot to effectively navigate and/or interact with its surroundings, a method is needed to reliably perform detection and point-by-point segmentation of objects in 3D point clouds. Figure 1 demonstrates this task.
Recently, object detection and segmentation in 2D images have undergone impressive improvements in accuracy and reliability through the rise of approaches based on deep convolutional neural networks[1, 2, 3]. 2D images are a natural fit for deep learning: their inherent pixel grid structure allows convolutional neural networks to be applied effectively and efficiently , and large data sets can be labeled easily by non-experts, enabling deep neural networks to thrive.
In contrast, 3D point clouds are unstructured by nature and laborious to label. Although some successful attempts have been made to convert point clouds into inputs for deep networks [5, 6, 7, 8, 9, 10, 11], the lack of large labeled data sets serves as an inherent limitation. Annotating 3D point clouds is a time-consuming and difficult process; a single lidar point cloud can contain tens of thousands of points, and labeling of many such point clouds would be required. Furthermore, these challenges scale with the number of object classes, and with the variety in environments. As evidence for the asymmetry between the difficulty of 2D and 3D labeling, one can look at the relative sizes of the MS COCO  and KITTI  data sets - the image-only MS COCO data set contains over 200,000 labeled images and 1.5 million labeled object instances, whereas the image-and-lidar KITTI object detection data set contains 7,481 image-point cloud pairs for training and 7,518 for testing, with 80,256 object instances.
In this paper we propose a novel approach to 3D object segmentation, which bypasses the challenges inherent in the use of 3D point clouds with deep neural networks. Our method leverages the success of convolutional neural networks for 2D image segmentation and frames 3D point cloud segmentation as a semi-supervised learning problem on graphs[14, 15]. In order to leverage the advantages of both data modalities, we observe a scene with a 3D lidar sensor and an aligned 2D camera. We apply an off-the-shelf object segmentation algorithm (Mask-RCNN ) to the 2D image in order to detect object classes and instances at the pixel-by-pixel level. Subsequently, we construct a graph by connecting 2D pixels to 3D lidar points according to their 2D projected locations, as well as connecting lidar points that neighbor one another in 3D space. We use label diffusion  to propagate 2D segmentation labels through this graph, thereby labeling the 3D lidar points. We refer to our algorithm as Label Diffusion Lidar Segmentation (LDLS). The end result is a fully labeled 3D point cloud, obtained by leveraging the strengths of both data modalities. The camera provides rich semantic information about object classes and facilitates the application of deep neural networks, and the lidar sensor provides precise 3D spatial measurements. Additionally, by identifying objects in 3D at the point level, rather than outputting rectangular 3D bounding boxes around detected objects [9, 16, 10, 11, 17], LDLS allows much more precise localization of object instances which do not neatly fit into rectangular boxes, such as people, animals, etc.
Although labeled images are required to train the 2D object segmentation model, this is a much lower barrier than requiring 3D annotated point clouds. Furthermore, open-source pretrained models are available for a variety of object classes; thus collection of training data may not even be necessary in some cases. By removing the need for labeled 3D training data, LDLS facilitates generalization to new environments, object classes, and robot sensor configurations.
We conduct experiments on the KITTI data set [13, 18] to quantitatively evaluate and analyze the accuracy of LDLS. Our experimental results include evaluations on a subset of the KITTI object detection data set which we have manually annotated with point-level labels, in order to evaluate point-wise segmentation. We additionally present qualitative evaluations on KITTI data, as well as on data collected with a mobile robot equipped with a camera and lower-resolution 3D lidar sensor. Our results demonstrate that LDLS can be applied successfully in different domains on different sensors, and is capable of detecting a variety of object classes due to the flexibility of the pretrained image object detector. Our manually labeled KITTI ground truth data set, and an open-source implementation of LDLS, are shared publicly at https://github.com/brian-h-wang/LDLS.
Ii Related Work
Ii-a Deep Learning on Point Clouds
Deep neural networks have been proposed for various perception tasks on point clouds in the past. PointNet  defines a network architecture that operates directly on unstructured point clouds and extracts features that are invariant to point re-ordering, capturing both local and global point cloud information. PointNet and its successor PointNet++  have been shown to be successful at point cloud classification and semantic segmentation, and have also been extended to object detection  and instance segmentation .
Other methods extend convolutional neural networks to point clouds. Since 3D points lack the grid structure of images, one approach is to arrange the points into a 3D voxel grid and perform 3D convolution ; however this can be computationally inefficient, especially for sparse lidar point clouds . Alternatively, points can be projected into 2D, using panoramic projection [6, 8] or a bird’s-eye view . These projections allow 2D convolution, but information is lost in the reduction to 2D, rendering these approaches unsuitable for some environments, especially complex or cluttered scenes. Recently,  and  propose direct convolution over point clouds by adjusting the kernel weights locally according to irregular point positions.
Several methods for object detection in autonomous driving scenes use lidar sensor data alongside images from an aligned camera [10, 11, 17], using the KITTI data set for training and evaluation . While the KITTI data set is an excellent benchmark, its annotations are 3D bounding boxes, and object classes are limited to driving-relevant objects such as pedestrians, cars, and cyclists. There is significant potential benefit to a variety of robotics applications in building methods that recognize a broader variety of objects, and output point-level segmentations. Along these lines, a key limitation—and opportunity—in deep learning on point clouds is the expansion of labeled data. Recent works [24, 25] attempt to ease the process of annotating ground truth point clouds.
Ii-B Graphical Models and 2D-3D Fusion
As an alternative to deep neural networks, graphical models have been successfully applied to point clouds for various tasks. Previous works by Maddern & Newman  and Schoenberg et al.  use graphical models to fuse lidar scans and stereo camera depth maps to produce accurate dense depth maps suitable for use on autonomous vehicles. Wang et al.  propose a semantic segmentation method for image-aligned 3D point clouds by retrieving referenced labeled images of similar appearances and then propagating their labels to the 3D points using a graphical model.
Various other approaches for fusion of 2D and 3D information have also been considered. Wang & Neumann  add depth-aware operations to standard CNNs to improve segmentation performance for RGBD images. Xie et al.  developed a method to rapidly annotate 2D street scenes by first drawing labeled bounding primitives in 3D, and then transferring labels on to 2D images. Zhang et al.  train a neural network for 2D semantic segmentation, then project onto dense 3D data from a long-range laser scanner. This work uses the additional assumption of coplanar dense points sharing labels to clean projected labels and achieve semantic background segmentation of classes such as buildings and roads. These various concepts of 2D-3D fusion demonstrate that point clouds and images can be used to complement one another in the perception process.
Ii-C Learning with Graphs
Our proposed approach is based on semi-supervised learning with graphs. Graph-based methods are well-established in machine learning, especially for nearest-neighbor graphs built upon the manifold assumption[32, 15]. In this paper, we construct the graph according to 3D lidar point locations, as well as their projected 2D image pixel coordinates, and adapt the label diffusion algorithm by Zhu .
This section presents our label diffusion method for object instance segmentation in lidar point clouds. We formulate the task as a semi-supervised learning problem on a graph, and leverage 2D segmentation from an RGB image along with 3D geometry from a point cloud to obtain a complete 3D segmentation. Figure 2 shows the segmentation pipeline.
Iii-a Problem Formulation
Object instance segmentation in 3D point clouds can be formulated as follows: Given a point cloud , where is the number of 3D points, we want to assign to every point an object instance label . Each instance is also associated with a class label , where is the label set. Let , indicate the background instance and class.
Our approach takes in as inputs a lidar point cloud and an aligned RGB image. This is a common sensor configuration for autonomous vehicles and mobile robots. Note that we consider only lidar points which lie within the field of view of the camera. The pixels have 2D coordinates defined as , where is the total number of pixels in the image. Algorithms for 2D object instance segmentation on RGB images have been well-developed (e.g., Mask R-CNN ) and pretrained models on large-scale data sets (e.g., MS COCO ) are readily accessible for a wide variety of object classes. We therefore leverage these resources for the task of labeling lidar points.
We approach the task of 3D segmentation from the perspective of semi-supervised learning, to avoid training a segmentation model on annotated point clouds. Semi-supervised learning assumes that a set of data points is available, of which a subset of points is labeled. For graph-based semi-supervised learning, we label the remaining points by defining connections between data points and then diffusing labels along these connections . To apply this framework to lidar point cloud segmentation, we construct a graph by drawing connections from 2D pixels to 3D lidar points, as well as among the 3D points. The 2D pixels are labeled according to results from 2D object segmentation of the RGB image, and the graph is then used to diffuse labels onto the 3D points, which are all initially unlabeled. The success of this approach depends on appropriate choices for the graph structure, and label diffusion process.
Iii-B Graph Construction
The graph used in our method consists of two types of nodes (2D image pixels, and 3D lidar points), as well as two types of connections between nodes (from a 2D pixel to a 3D point, and between two 3D points).
Initial Graph Node Labeling
Each 2D pixel and each 3D point is a node within . Initially, all 3D points are unlabeled. The 2D pixels are labeled according to an image segmentation algorithm, which assigns every image pixel an instance label (where corresponds to the background), and associates each instance with a class label . The output will therefore be several distinct instance masks, each containing many pixels, as seen in Figure 2 (iii). This instance-class association is deterministic for each image, simplifying the task of assigning instance labels to the lidar points within the camera’s field of view.
Since we assume the camera and lidar sensors are aligned, each lidar point can be projected from 3D into 2D image pixel coordinates, as shown in Figure 2 (iv). At this stage, a naive way to label the lidar point cloud would be to label each point according to the 2D instance mask into which it is projected. This method will however result in significant labeling errors, especially around the 2D instance boundaries, due to calibration errors between the sensors as well as the fact that the 2D segmentation masks are unaware of 3D depth. Background lidar points will therefore often project into a foreground object mask, or vice versa. Our segmentation pipeline should therefore combine 2D and 3D information, and leverage both information sources for producing a final 3D segmentation.
In order to combine 2D and 3D information in our graph for semi-supervised label diffusion, we construct a subgraph connecting 2D pixels to 3D lidar points, represented by a matrix.
is the set of image pixels which are near the projected 2D location of lidar point . In our implementation, is the set of all pixels in a 5 pixel by 5 pixel box centered around the projected 2D coordinates of . The parameter controls the amount of information that can flow from a pixel to a connected lidar point. In our experiments, we use a small constant value of , to mitigate sensor calibration errors by minimizing the influence of any one pixel. While more sophisticated schemes like setting different box sizes and values for each may be applied, empirically we find our design choice to perform well across multiple domains.
In order to encode connections between 3D points, we construct a nearest neighbor graph from the points to reflect the underlying 3D geometry. This subgraph is denoted as , represented by a matrix.
Given a lidar point cloud , we construct an exponential-weighted nearest neighbors graph over the points. For each point , we compute , the set of nearest neighbors to within the point cloud, according to Euclidean distance. The graph of 3D point connections is then defined as
Each nonzero element captures the similarity between points and . With a small , this subgraph is sparse, enabling fast computation during the diffusion step later on—we set and in our experiments. We apply a KD-tree to speed up the construction of .
Full Label Diffusion Graph
The full graph for label diffusion, combining the 2D-to-3D connections as well as the 3D-to-3D connections, is then defined as
Iii-C Label Diffusion
The graph matrix guides the label diffusion process for lidar point labeling. The nonzero elements of indicate connections along which information on object instance labels should be diffused. The intuition behind the diffusion process is for the 2D pixels to act as source nodes that continuously push label information out through the 3D points, which is then diffused throughout the point cloud according to the connections between points. This process cleans up segmentation boundaries using 3D geometry information, resulting in a final 3D segmentation incorporating both 2D and 3D information.
To perform label diffusion, let us assume in total object instances , including the background instance, are detected by the 2D segmentation method (the background instance can be defined by the absence of any object mask). Let us define as an -dimensional label vector for instance . contains one entry for each 3D point and 2D pixel. The entries corresponding to 3D points are initialized to zero, and the entries corresponding to the 2D pixels are defined according to the 2D segmentation masks:
where the mask function returns 1 if pixel is in the segmentation mask of object instance , and 0 otherwise.
We then iteratively perform the following computation,
to diffuse labels throughout the graph nodes, for all instances. Note that if point is unlabeled, but connected to at least one pixel labeled with instance and , then after such a computation we obtain , indicating an increased likelihood that will be labeled with instance as a result of label diffusion from .
Note that the construction of with as the bottom submatrix ensures that the pixel labels in remain unchanged by this matrix multiplication. Since the labels of all initially labeled nodes within the graph remain fixed, and since is row-normalized, the label diffusion is proven to converge according to Zhu .
We iteratively apply label diffusion according to Eq. (6) until convergence of all , or until a maximum number of iterations (200, in our experiments). Finally, we then convert the likelihood values to lidar point labels according to
That is, we assign each point the most likely label.
Label diffusion can sometimes result in disjointed sections of object segmentations; most often this occurs if projection or mask boundary errors result in a large number of contiguous background lidar points being projected to inside a 2D segmentation mask. To clean up these errors, we introduce an outlier removal step based on finding connected components within . Let be the subgraph of defined by considering only lidar points labeled as object , i.e. is a node of if and only if . Then, let be the largest connected component in , treating it as an undirected graph. We update the lidar point labels as
The final output of this pipeline is a lidar point cloud where each point is labeled as either a background point, or as part of an object instance (with a class label). Algorithm 1 summarizes our overall algorithm.
Iv Results and Evaluation
Iv-a Quantitative Evaluation on the KITTI Data Set
We present experiments on the KITTI data set  in order to quantitatively study the accuracy of LDLS. Our evaluation considers the Pedestrian and Car object classes, as these are a) the most common object classes in the KITTI data set and b) appear in both the KITTI and MS COCO data sets. This allows us to generate results using Mask-RCNN  pretrained on MS COCO, and then evaluate on KITTI. In most of our experiments, we merge the Pedestrian and Person_sitting KITTI classes. The KITTI object detection leaderboard also includes cyclists, however since the Mask-RCNN model we use is trained to detect people and bicycles separately, we do not consider this class. We perform no training or fine-tuning of Mask-RCNN on KITTI data, allowing us to study the applicability of our method to a new domain unseen in the image training data.
Iv-A1 Performance Metrics
We evaluate using performance metrics for both semantic (per-class) segmentation and instance segmentation . Given a set of lidar points, semantic segmentation is defined as assigning a class label to each point, where is the set of all classes. In our experiments, . Given point class labels, the segmentation is evaluated by computing precision, recall, and IoU over all points for each class , as
where is the set of lidar points predicted to have class in the segmentation results, is the set of points with class in the ground truth, and denotes cardinality of a set.
An instance segmentation additionally assigns each point an instance label, distinguishing individual cars and pedestrians from one another. To evaluate an instance segmentation, predicted instance labels must be matched with corresponding ground truth instances . We do this by calculating the IoU between each prediction-ground truth instance pair with the same class label, and then calculating a bipartite graph matching which maximizes the sum of IoUs between matched instance pairs.
The number of true positives is then calculated by counting the number of prediction-truth matchings with IoU over some predefined threshold. The false positive and false negative counts and
are determined by the number of unmatched prediction instances and truth instances, respectively, and instance precision and recall are then computed as
It is also possible to calculate instance precision and recall by counting over individual points, rather than instances. However for lidar data, this metric is unevenly weighted towards objects that are closer to the sensor because they possess a higher density of points. Our results report instance-level precision and recall in order to avoid this bias.
Iv-A2 Comparison to Other Lidar Segmentation Methods
|Noisy Ground Truth (2791 Frames)||Manually Labeled Ground Truth (200 Frames)|
We benchmark LDLS through a comparison against SqueezeSeg , SqueezeSegV2 , and PointSeg , state-of-the-art convolutional neural network methods for object segmentation in lidar point clouds. These methods take as input a lidar point cloud transformed through panoramic projection into a tensor, where the 5 channels are -, -, and -coordinates, depth, and lidar intensity. Since KITTI provides 3D bounding box object annotations, rather than point labels, Wu et al.  generated segmentation annotations by labeling points that fall within the annotated 3D boxes, producing an 8057-frame training set and a 2791-frame validation set.
However, this bounding box-based annotation process results in labeling errors, for example from extra points that fall within the boxes. In order to maximize the quality of our evaluation, we therefore manually annotated 200 KITTI point clouds (randomly selected from within the validation set used by SqueezeSeg) with class and instance ground truth segmentations. In our annotation process, point labels are initialized according to the KITTI bounding boxes, and then manually cleaned up by annotators. The end result is a set of pristinely labeled ground truth point clouds. Note that KITTI objects whose bounding boxes contain no lidar points are dropped. In order to quantify the error of the bounding box-generated point labels, we calculated the semantic segmentation IoU of the KITTI box-derived labels with our manually annotated labels, and found an overall IoU of 81.1 for cars and 93.4 for pedestrians, indicating a significant amount of error in bounding box-generated labels.
We apply LDLS to the 2791-frame SqueezeSeg validation data set, as well as our 200-frame manually labeled data set. These two evaluations each present a different takeaway. The 2791-frame evaluation is useful due to its large scale, and due to its usage by previous works. However, due to the bias caused by annotation errors, this evaluation cannot definitively establish real-world segmentation performance. Rather, its purpose is to broadly establish competitiveness of our method alongside existing works. The smaller-scale evaluation on manual annotations complements the first evaluation by removing annotation error, and therefore offering the clearest possible indicator of expected real-world performance.
Table I presents results from these two evaluations. SqueezeSeg, SqueezeSegV2, and PointSeg results are produced after training on the 8057-point cloud training set. For these methods, results on the noisy ground truth are taken from their respective papers [6, 7, 8]. As the SqueezeSeg validation data set does not include instance labels, Table I presents results using semantic segmentation metrics only. For consistency with the other methods, we treat Pedestrian and Person_sitting as two separate classes in this experiment only, even though the Mask-RCNN model used in LDLS does not distinguish between the two. This only slightly impacts our results, due to the small number of Person_sitting instances in KITTI. Results in all other following experiments are presented with these two classes merged.
When measured on the SqueezeSeg validation data, LDLS achieves competitive semantic segmentation performance compared to SqueezeSeg, SqueezeSegV2, and PointSeg, although without any training on labeled 3D data. Additionally, LDLS outperforms the other methods in overall pedestrian segmentation IoU. On the manually labeled data, the difference is far more pronounced. Our method achieves a % increase in IoU for car segmentation and a % increase for pedestrian segmentation as compared to the next-best method. We hypothesize that one reason for the difference in performance between LDLS and SqueezeSeg/PointSeg is that the latter methods were trained on data that includes annotation errors. When the test set includes similar annotation errors and is therefore closer to the training domain (i.e. the SqueezeSeg validation set), the performance difference between LDLS and the other methods is smaller. However, this gap grows wider for evaluation on error-free annotations.
It is noted that this comparison is not symmetrical—SqueezeSeg and PointSeg make use of labeled 3D training data, while LDLS uses RGB images as well as a pretrained 2D segmentation model (implicitly also using 2D image training data). However, it is important to view this difference in the context of robotics applications. Lidar-equipped robots commonly also have a camera sensor, or can be inexpensively outfitted with one. On the other hand, creating a labeled training set of thousands of lidar point clouds is difficult and time-consuming. Therefore, if installing a camera on a robot and then applying an off-the-shelf image segmentation model can enable similar or greater 3D segmentation performance as a model trained on labeled point clouds, adding the camera information is a reasonable choice in a robotics context.
Iv-A3 Instance Segmentation Evaluation
We also present an instance segmentation evaluation by applying LDLS to our manually annotated KITTI ground truth data. Results calculated across all object instances in the ground truth are shown in Table II.
To better understand the performance of LDLS, we additionally study the effect of object range on accuracy. Label diffusion assumes that neighboring points are more likely to share a class label; this assumption weakens at further distances from the sensor, where lidar data becomes sparser. Therefore, we hypothesize that LDLS should be more reliable for objects that are closer to the sensor and visible with a higher density of points.
We test this hypothesis by performing evaluations at different ranges. For each evaluation, we exclude all lidar points above a given maximum range cutoff. Results are plotted in Figure 3. As range from the sensor increases, instance segmentation performance decreases significantly. Semantic segmentation degradation is not as significant; likely because these metrics count numbers of points and therefore are biased towards nearer, more densely populated lidar points.
Figure 4 plots each object instance within our test set on a scatter plot, as a function of distance to the object’s centroid against instance segmentation IoU. As objects become more distant, a wider range of IoU results appear.
These experiments indicate that LDLS generally segments object instances more reliably at closer distances, with performance falling off as range increases. This suggests that an all-purpose robotic perception system may be best served by using a vision-based bounding box object detector at far ranges, with LDLS applied at close ranges to allow a robot to precisely sense and interact with its immediate surroundings.
Iv-A4 Ablation Study
To demonstrate the benefits of the different components of the LDLS pipeline, we perform an ablation study by removing different components and comparing results. The ablation settings we experimented with are:
Direct projection labeling Lidar points are naively labeled, without graph diffusion, based on whether they project to within a 2D segmentation mask in the image.
Diffusion without outlier removal The full pipeline is executed, except for the final outlier removal step.
Table III contains semantic and instance (using both the 0.50 and 0.70 IoU thresholds) segmentation results from running these settings on the manually annotated data. We see that both the diffusion and outlier removal steps improve overall performance, with the former contributing a major performance gain. This finding confirms the value of label diffusion in fusing 2D and 3D information. Note that the full pipeline results differ slightly from the results presented in Table I, since the ablation results are calculated after merging the Person_sitting and Pedestrian classes.
|Semantic Segmentation||Instance Seg. IoU=0.50||Instance Seg. IoU=0.70|
|Direct projection labeling without diffusion||Car||69.2||83.2||60.7||44.1||59.8||16.4||22.2|
|Pedestrian + Person Sitting||60.1||72.2||48.8||30.0||51.9||5.3||9.1|
|Diffusion without outlier removal||Car||78.2||90.8||72.5||66.3||78.7||51.6||61.2|
|Original, with all components||Car||84.1||88.4||75.7||66.8||79.3||57.7||68.5|
Iv-B Qualitative Evaluation
Our quantitative study establishes the accuracy of LDLS on large-scale annotated ground truth, but is limited to just two object classes, and data from a high-resolution lidar sensor. In order to complement the quantitative evaluation, and demonstrate the applicability of LDLS to different environments, classes, sensors, and data collection platforms, we present two additional sets of results: a) residential and urban sequences from KITTI , and b) a sequence captured on the Cornell University campus using a Clearpath™ Jackal mobile ground robot with a Velodyne VLP-16 lidar sensor and RGB camera.
For this evaluation, we present segmentation results on a wider variety of MS COCO classes, including people, cars, trucks, bicycles, chairs, benches, and potted plants. Segmentation of all of these object classes in lidar data is made possible since LDLS adopts all classes that are detected by the image segmentation model used; in this case the Mask-RCNN model pretrained on the MS COCO data set. An overview of results is shown in Figure 5. In the KITTI data, new object classes are generally segmented with qualitatively comparable accuracy to cars and pedestrians, although narrower objects such as bicycles present a challenge. In comparison, the campus data collection on the Jackal robot exhibits more segmentation errors, and performance breaks down more significantly for objects at farther distances. We hypothesize the following reasons for these differences: Firstly, the VLP-16 sensor outputs only 16 laser scan lines, as opposed to the 64-scan lidar used in KITTI, making the lidar point clouds sparser and more difficult to segment, especially at further ranges. We also believe that errors from sensor calibration  and time synchronization were higher on the Jackal, compared to the KITTI data set. Still, we find LDLS to demonstrate adequate segmentation performance for a small, relatively inexpensive robot.
In terms of computation, the graph construction and sparse matrix multiplication iterations are highly parallelizable and can be GPU-accelerated . On the KITTI evaluation data, our current Python implementation averages approximately .38 seconds per frame on an Nvidia GTX 1080 Ti, excluding the computation of Mask-RCNN results.
In this paper, we present LDLS, a method for instance segmentation of 3D point clouds which leverages a pretrained 2D image segmentation model, followed by label diffusion on a graph connecting 2D pixels and 3D points, to remove any need for labeled 3D training data. By removing this requirement, we make LDLS suitable for application in various environments and on different robotic platforms. Quantitative evaluations on the KITTI data set demonstrate superior accuracy at car and pedestrian segmentation compared to previous methods, and qualitative evaluations demonstrate segmentation of a wider variety of object classes. Our results are additionally presented without fine-tuning of the image segmentation model, demonstrating generalizability to new domains.
The authors would like to thank Sarah Allen, Christopher Graef, Emily Sun, and Shuo Han for their help with collecting and preparing the data used in the presented experiments.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in ECCV, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in NeurIPS, 2015.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inNeurIPS, 2012.
-  C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in CVPR, 2017.
-  B. Wu, A. Wan, X. Yue, and K. Keutzer, “Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud,” in ICRA, 2018.
-  B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer, “Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud,” in ICRA, 2019.
-  Y. Wang, T. Shi, P. Yun, L. Tai, and M. Liu, “Pointseg: Real-time semantic segmentation based on 3d lidar point cloud,” arXiv preprint arXiv:1807.06288, 2018.
-  Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in CVPR, 2018.
-  J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint 3d proposal generation and object detection from view aggregation,” in IROS, 2018.
-  C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3d object detection from rgb-d data,” in CVPR, 2018.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
-  A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012.
-  X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data with label propagation,” Tech. Rep., 2002.
-  X. Zhu, J. Lafferty, and R. Rosenfeld, “Semi-supervised learning with graphs,” Ph.D. dissertation, Carnegie Mellon University, 2005.
-  B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection from point clouds,” in CVPR, 2018.
-  X. Du, M. H. Ang, S. Karaman, and D. Rus, “A general pipeline for 3d detection of vehicles,” in ICRA, 2018.
-  A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” International Journal of Robotics Research (IJRR), 2013.
-  C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in NeurIPS, 2017.
-  W. Wang, R. Yu, Q. Huang, and U. Neumann, “Sgpn: Similarity group proposal network for 3d point cloud instance segmentation,” in CVPR, 2018.
-  M. Ren, A. Pokrovsky, B. Yang, and R. Urtasun, “Sbnet: Sparse blocks network for fast inference,” in CVPR, 2018.
-  S. Wang, S. Suo, W.-C. Ma, A. Pokrovsky, and R. Urtasun, “Deep parametric continuous convolutional neural networks,” in CVPR, 2018.
-  Y. Li, R. Bu, and X. Di, “PointCNN: Convolution On X-Transformed Points,” in NeurIPS, 2018.
-  J. Lee, S. Walsh, A. Harakeh, and S. L. Waslander, “Leveraging pre-trained 3d object detection models for fast ground truth generation,” in ITSC, 2018.
-  K. Lertniphonphan, S. Komorita, K. Tasaka, and H. Yanagihara, “2d to 3d label propagation for object detection in point cloud,” in ICME Workshops, 2018.
-  W. Maddern and P. Newman, “Real-time probabilistic fusion of sparse 3d lidar and dense stereo,” in IROS, 2016.
-  J. R. Schoenberg, A. Nathan, and M. Campbell, “Segmentation of dense range information in complex urban scenes,” in IROS, 2010.
-  Y. Wang, R. Ji, and S.-F. Chang, “Label propagation from imagenet to 3d point clouds,” in CVPR, 2013.
-  W. Wang and U. Neumann, “Depth-Aware CNN for RGB-D Segmentation,” in ECCV, 2018.
-  J. Xie, M. Kiefel, M.-T. Sun, and A. Geiger, “Semantic instance annotation of street scenes by 3d to 2d label transfer,” in CVPR, 2016.
-  R. Zhang, G. Li, M. Li, and L. Wang, “Fusion of images and point clouds for the semantic segmentation of large-scale 3D scenes based on deep learning,” ISPRS Journal of Photogrammetry and Remote Sensing, 2018.
-  M. Belkin, I. Matveeva, and P. Niyogi, “Regularization and semi-supervised learning on large graphs,” in COLT, 2004.
-  A. Geiger, F. Moosmann, Ö. Car, and B. Schuster, “Automatic camera and range sensor calibration using a single shot,” in ICRA, 2012.
-  N. Bell and M. Garland, “Efficient Sparse Matrix-Vector Multiplication on CUDA,” Tech. Rep., 2008.