1 Introduction
There has been great progress in recent years on 3D object detection for robotics and autonomous driving applications. Previous work on 3D object detection takes one of these following approaches: (a) projecting LIDAR points to 2D bird’seye view and performing 2D detection on the projected image, (b) performing 2D detection on images and using a frustum to overlap that with the point cloud, or (c) using a twostage approach where points are first grouped together and then an object is predicted for each group.
Each of these approaches come with their own drawbacks. Projecting LIDAR to a bird’seye view image sacrifices geometric details which may be critical in cluttered indoor environments. The frustum based approaches are strictly dependent on the 2D detector and will miss an object entirely if it is not detected in 2D. Finally, the twostage methods introduce additional hyperparameters and design choices which require tuning and adapting for each domain separately. Furthermore, we believe grouping the points is a harder task than predicting 3D objects. Solving the former to predict the latter will result in an unnecessary upperbound that limits the accuracy of 3D object detection.
In this paper, we propose a singlestage 3D object detection method that outperforms previous approaches. We predict 3D object properties for every point while allowing the information to flow in the 3D adjacency graph of predictions. This way, we do not make hard grouping decisions while at the same time let the information to propagate from each point to its neighborhood.
In addition to predicting 3D bounding boxes, our pipeline can also output the reconstructed 3D object shapes as depicted in Figure 1. Even though there have been various approaches proposed for predicting 3D bounding boxes, predicting the 3D shapes and extents of objects remains largely underexplored. The main challenges in predicting the 3D shapes of objects are sparsity in LIDAR scans, predominant partial occlusion, and lack of groundtruth 3D shape annotations. In this work, we address these challenges by proposing a novel weaklysupervised approach.
Our proposed solution for shape prediction consists of two steps. First, we learn 3D object shape priors using an external 3D CADmodel dataset by training an encoder that maps an object shape into an embedding representation and a decoder that recovers the 3D shape of an object given its embedding vector. Then, we augment our 3D object detection network to predict a shape embedding for each object such that its corresponding decoded shape best fits the points observed on the surface of that object. Using this as an additional constraint, we train a network that learns to detect objects, predict their semantic labels, and their 3D shapes.
To summarize, our main contributions are as follows. First, we propose a singlestage 3D object detection method that achieves stateoftheart results on both indoor and outdoor point cloud datasets. While previous methods often make certain design choices (e.g. projection to a birdeye view image) based on the problem domain, we show the possibility of having a generic pipeline that aggregates perpoint predictions with graph convolutions. By forming better consensus predictions in an endtoend hybrid network, our approach outperforms previous works in both indoor and outdoor settings while running at a speed of 12ms per frame. Second, in addition to 3D bounding boxes, our model is also able to jointly predict the 3D shapes of the objects efficiently. Third, we introduce a training approach that does not require groundtruth 3D shape annotations in the target dataset (which is not available in largescale selfdriving car datasets). Instead, our method learns a shape prior from a dataset of CAD models and transfers that knowledge to the realworld selfdriving car setup.
2 Related Works
2.1 3D Object Detection
3D object detection has been studied extensively. In this paper, we focus on applications such as autonomous driving, where the input is a collection of 3D points captured by a LIDAR range sensor. Processing this type of data using neural networks introduces new challenges. Most notably, unlike images, the input is highly sparse, making it inefficient to uniformly process all locations in the 3D space.
To deal with this problem, PointNet [30, 31] directly consumes the 3D coordinates of the sparse points and processes the point cloud as a set of unordered points. FoldingNet [40], AtlasNet [12], 3D Point Capsule Net [44], and PointWeb [43] improve the representation by incorporating the spatial relationships among the points into the encoding process. For the task of 3D object detection, various methods rely on PointNets for processing the point cloud data. To name a few, Frustum PointNets [29] uses these networks for the final refinement of the object proposals and PointRCNN [33] employs PointNets for the task of proposal generation. VoteNet [28] deploys PointNet++ to directly predict bounding boxes from points in a twostage voting scheme.
Projecting the point cloud data to a 2D space and using 2D convolutions is an alternative approach for reducing the computation. Bird’seye view (BEV), front view, native range view, and learned projections are among such 2D projections. PIXOR [39], Complex YOLO [35], and Complexer YOLO [34] generate 3D bounding boxes in a single stage based on the projected BEV representation. Chen et al. [3] and Liang et al. [20] use a BEV representation and fuse its extracted information with RGB images to improve the detection performance. VeloFCN [18] projects the points to the front view and uses 2D convolutions for 3D bounding box generation. Recently, LaserNet [25] shows that it is possible to achieve stateoftheart results while processing the more compact native range view representation. PointPillars [17], on the other hand, learns this 2D projection by training a PointNet to summarize the information of points that lie inside vertical pillars in the 3D space.
Voxelization followed by 3D convolutions is also applied to point cloudbased object detection [46]. However, 3D convolution is computationally expensive, especially when the input has a high spatial resolution. Sparse 3D convolution [7, 9, 10] is shown to be effective in solving this problem. Our backbone in this paper uses voxelization with sparse convolutions to process the point cloud.
Modeling auxiliary tasks is also studied in the literature. Fast and Furious [22] performs detection, tracking, and motion forecasting using a single network. HDNET [38] estimates highdefinition maps from LIDAR sweeps and uses the geometric features to improve 3D detection. Liang et al. [19] performs 2D detection, 3D detection, ground estimation, and depth completion. Likewise, our system predicts the 3D shape of the objects from incomplete point clouds besides detecting the objects.
2.2 3D Shape Prediction for Object Detection
For 3D object detection from images, 3DRCNN [15] recovers the 3D shape of the objects by estimating the pose of known shapes. A render and compare loss with 2D segmentation annotation is used as supervision. Instead of using known shapes, Mesh RCNN [8] first predicts a coarse voxelized shape followed by a refinement step. The 3D groundtruth information is assumed to be given. For semantic segmentation, [16] improved the generalization of unseen categories by estimating the shape of the detected objects. For 3D detection, GSPN [41] learns a generative model to predict 3D points on objects and uses them for proposal generation. ROI10D [23] annotates groundtruth shapes offline and adds a new branch for shape prediction. In contrast, our approach does not need 3D shape groundtruth annotations in the target dataset. We use the recently proposed explicit shape modeling [27, 24, 32] to learn a function for representing a shape prior. This prior is then used as a weakly supervised signal when training the shape prediction branch on the target dataset.
3 Approach
The overall architecture of our model is depicted in Figure 2. The model consists of four parts: The first one consumes a point cloud and predicts per point object attributes and shape embedding. The second component builds a graph on top of these perpoint predictions and uses graph convolutions to transfer information across the predictions. The third component proposes the final 3D boxes and their attributes by iteratively sampling high scoring boxes which are farthest from the already selected ones. Finally, the fourth component decodes the predicted shape embeddings into SDF values which we convert to 3D meshes using the Marching Cubes algorithm [21].
3.1 Per Point 3D Object Prediction
Given a point cloud of size consisting of points with dimensional input features (e.g
. positions, colors, intensities, normals), first, a 3D encoderdecoder network predicts 3D object attributes (center, size, rotation matrix, and semantic logits) and the shape embedding for every point.
We use SparseConvNet [11] as backbone to generate perpoint features . Each of the object attributes and the shape embedding vector are computed by applying two layers of 3D sparse convolutions on the extracted features.
Box Prediction Loss: We represent a 3D object box by three properties: size (length, width, height), center location (, , ), and a 3x3 rotation matrix. Given these predictions, we use a differentiable function to compute the eight 3D corners of each predicted box. We apply a Huber loss on the distance between predicted and the groundtruth corners. The loss will automatically propagate back to the size, center and rotation variables.
To compute the rotation matrix, our network predicts 6 parameters: (, , , , , ). Then we formulate the rotation matrix as .
The benefit of using this loss in comparison to separate losses for rotation, center, and size is that we do not need to tune the relative scale among multiple losses. Our box corner loss propagates back to all and minimizes the predicted corner errors. We define the perpoint box corner regression loss as
(1) 
where is the Huberloss (i.e. smooth Lloss), and is binary function indicating whether a point is on an object surface. and are the sets of predicted and groundtruth corners in which represents the ’th predicted corner for point , and denotes the corresponding groundtruth corner.
Dynamic Classification Loss
: Every point in the point cloud predicts a 3D bounding box. The box prediction loss forces each point to predict the box that it belongs to. Some of the points make more accurate box predictions than others. Thus we design a classification loss that classifies points that make accurate predictions as positive and others as negative. During the training stage, at each iteration, we compute the IoU overlap between predicted boxes and groundtruth matches and classify the points that have an IoU more than 70% as positive and the rest as negative. This loss gives us a few percent improvements in comparison to regular classification loss (where we would label points that fall inside an object as positive and points outside as negative). We use a softmax loss for classification.
3.2 Object Proposal Consolidation
Each point predicts its object center, size, and rotation matrix. We create a graph where the points are the nodes, and each point is connected to its nearest neighbors in the center space. In other words, each point is connected to those with similar center predictions. We perform a few layers of graph convolution to consolidate the perpoint object predictions. A weight value is estimated per point by the network which determines the significance of the vote a point casts in comparison to its neighbors. We update each object attribute predicted by points as follows:
(2) 
where is an object attribute (e.g. object length) predicted for point , is the set of neighbors of in the predicted center space, and is the weight predicted for point .
We apply the bounding box prediction loss both before and after the graph convolution step to let the network learn a set of weights that make final predictions more accurate. In this way, instead of directly applying a loss on the predicted point weights, the network automatically learns to assign larger weights to more confident points.
3.3 Proposing Boxes
Our network predicts a 3D object box and a semantic score for every point. During the training stage, we apply the losses directly to the per point predictions. However, during the evaluation, we need to use a box proposal mechanism that can reduce the hundreds of thousands of box predictions into a few accurate box proposals. We can greedily pick boxes with high semantic scores. However, we also want to encourage spatial diversity in the locations of the proposed boxes. For this reason, we compute the distance between each predicted box center and all previously selected boxes and choose the one that is far from the already picked points (similar to the heuristic used by KMeans++ initialization
[1]). More precisely, at step , given predicted boxes for previous steps , we select a seed point as follows:where
and represents the foreground semantic score of box . Selecting boxes with high foreground semantic score guarantees high precision, and selecting diverse boxes guarantees high recall. Note that our sampling strategy is different from the nonmaximum suppression algorithm. In NMS, boxes that have a high IoU are suppressed and are not redeemable, while in our algorithm, we can tune the balance between confidence and diversity.
3.4 Shape Prediction
To predict shapes, first, we learn a shape prior function from an external synthetic 3D dataset of CAD models as discussed in Section 3.4.1. Then we deploy our learned prior to recover 3D shapes from the embedding space predicted by the object detection pipeline.
3.4.1 Modeling the Shape Prior
There are various ways to represent a shape prior. For our application, given that a shape embedding vector should be predicted for each point in the point cloud, the representation needs to be compact. We use an encoderdecoder architecture with a compact bottleneck to model the shape prior. The general framework is depicted in Figure 3.
The shape encoder consumes the point cloud of an object after data augmentation techniques (e.g. random cropping) and then outputs a compact shape embedding vector. The point cloud representation of the object is first voxelized and then forwarded through an encoder network. The network consists of three convolutional blocks, each having two 3D sparse convolution layers intervened by BatchNorm and ReLU layers (not shown in the figure for simplicity). The spatial resolution of the feature maps is reduced by a factor of two after each convolutional block. Finally, a fullyconnected layer followed by a global average pooling layer output the embedding vector of the input shape.
For shape decoding, we represent the shape as a level set of an implicit function [24, 32, 27]. That is, the shape is modeled as the level set zero of a signed distance field (SDF) function over a unit hypercube. Following [24]
, we rely on Conditional Batch Normalization
[5, 6] layers to condition the decoder on the predicted embedding vector. The input to the decoder is a batch of 3D coordinates of the query points. After five conditional blocks, a fully connected layer followed by a tanh function predicts the signed distance of each query from the surface of the object in a canonical viewpoint.During training, we sample some query points close to the object surface and some uniformly in the unit hypercube surrounding the object to predict their SDF values. However, as suggested in [32], we regress towards discrete label values to capture more detail near the surface boundaries. More precisely, given a batch of training queries , their corresponding groundtruth signed distance values , and their predicted embedding vectors , the loss is defined as:
(3) 
where is the conditional decoder function and is the sign function.
3.4.2 Training the Shape Prediction Branch
Although there is no groundtruth 3D shape annotation available in detection datasets collected for applications such as autonomous driving, once trained, the learned prior model can be deployed to enforce shape constraints. That is, for each object in the incomplete point cloud, we expect that most of the observed points in its bounding box lie on its surface.
To predict the shape embedding, we add a branch to the object detection pipeline to predict a dimensional vector per point. The shape embeddings for all points belonging to an object is then averaged pool to form its shape representation. To enforce the constraints, we freeze the 3D decoder in Figure 3 and discard the encoder. Conditioned on the predicted shape embedding and given some shape queries per object, the frozen shape decoder should be able to predict the signed distances.
To define the queries, for each object present in the point cloud, we subtract the object center and rotate the queries to match the canonical orientation used during training the shape prior network. Then, the queries are projected into a unit hypercube. We also preprocess them by removing points on the ground and augmenting the symmetrical points (if the object is symmetrical). Finally, as the shape prior is trained with discrete sign labels, we sample some number of queries on the ray connecting the object center to each of the observed points and assign 1/+1 labels to inside/outside queries respectively (in this paper we sample two points with distance to each observed point along the rays). During training, we also optimize the loss defined in Eq. 3 for objects with a reasonable number of points observed (i.e. minimum of 500 points in this paper.)
3.5 Achieving RealTime Speed
Our 3D sparse feature extractor with 3D sparse convolution layers, 3D sparse pooling layers, and 3D sparse unpooling layers achieves a speed of
per frame on Waymo Open dataset (with around 200k input points per frame). Here we describe the implementation details of our Tensorflow sparse GPU operations.
We use CUDA to implement the submanifold sparse convolution [11] and sparse pooling GPU operations in TensorFlow. Since the input to the convolution operation is sparse, we need a mechanism to get all the neighbors for each nonempty voxel. We implemented a hashmap to do that, where the keys are the XYZ indices of the voxels, and the values are the indices of the corresponding voxels in the input voxel array. We use an optimized spatial hash function[37]. Our experiments on the Waymo Open dataset shows that with a load factor of , the average collision rate is . We precompute the neighbor indices for all nonempty voxels and reuse them in one or more subsequent convolution operations. We use various CUDA techniques to speed up the computation (e.g. partitioning and caching the filter in shared memory and using bit operations).
Both 3D sparse max pooling and 3D sparse average pooling operations are implemented in CUDA. Since each voxel needs to be looked up only once during pooling, we do not reuse the convolution hashmap that can introduce redundant lookups. Instead, we compute the pooled XYZ indices and use them as the key to building a new “hashmultimap”(multiple voxels can be pooled together thus having the same key), and shuffle the voxels based on the keys. Our experiments show that this approach is more than 10X faster than the radix sort provided by the CUB library. Furthermore, since our pooling operation does not rely on the original XYZ indices, it has the ability to handle duplicate input indices. This allows us to use the same operation for voxelizing the point cloud, which is the most expensive pooling operation in the network. Our implementation is around 20X faster than a welldesigned implementation with preexisting TensorFlow operations.
4 Experiments
4.1 Experimental Setup
For our object detection backbone, we use an encoderdecoder UNET with sparse 3D convolutions. The encoder consists of 6 blocks of 3D sparse convolutions, each of which having two 3D sparse convolutions inside. Going deeper, we increase the number of channels gradually (i.e
. 64, 96, 128, 160, 192, 224, 256 channels). We also apply a 3D sparse pooling operation after each block to reduce the spatial resolution of the feature maps. For the decoder, we use the same structure but in the reverse order and replace the 3D sparse pooling layers with unpooling operations. Two 3D sparse convolutions with 256 channels connect the encoder and decoder and form the bottleneck. Models are trained on 20 GPUs with a batch size of 6 scenes per each. We use stochastic gradient descent with an initial learning rate of 0.3 and drop the learning rate every 10K iterations by the factors of [1.0, 0.3, 0.1, 0.01, 0.001, 0.0001]. We use a weight decay of
and stop training when the loss plateaus. We use random rotations of (10, 10) degrees along the zaxis and random scaling of (0.9, 1.1) for data augmentation.The 3D sparse encoder in our shape prior network consists of three convolutional blocks with two 3D sparse convolutions in each. We use an embedding size of 128 dimensions and set as the number of channels in the 3D convolutional layers. We downsample the feature maps by a factor of 2 after each block. A global average pooling, followed by a fullyconnected layer outputs the predicted embedding. Our shape decoder consists of five conditional blocks with two 128 dimensional fully connected layers intervened by conditional batch normalization layers. A tanh function maps predictions to [1, +1]. We train our model with an initial learning rate of 0.1 with the same stepwise learning rate schedule used for training the detection pipeline.
4.2 Datasets
ScanNetV2 [4] is a dataset of 3D reconstructed meshes of around 1.5K indoor scenes with both 3D instance and semantic segmentation annotations. The meshes are reconstructed from RGBD videos that are captured in various indoor environments. Following the setup in [28], we sample vertices from the reconstructed meshes as our input point clouds and since ScanNetV2 does not provide amodal or oriented bounding box annotations, we predict axisaligned bounding boxes instead, as in [28, 14].
Waymo Open Dataset [26, 45] is a large scale selfdriving car dataset, recently released for benchmarking 3D object detection. The dataset captures multiple major cities in the U.S., under a variety of weather conditions and across different times of the day. The dataset contains a total of 1000 sequences, where each sequence consists of around 200 frames that are 100 ms apart. The training split consists of 798 sequences containing 4.81M vehicle boxes. The validation split consists of 202 sequences with the same duration and sampling frequency, containing 1.25M vehicle boxes. The effective annotation radius in the Waymo Open dataset is 75m for all object classes. For our experiments, we evaluate 3D object detection metrics for vehicles and predict 3D shapes for them.
Input  mAP@0.25  mAP@0.5  

DSS [36, 14]  Geo + RGB  15.2  6.8 
MRCNN 2D3D [13, 14]  Geo + RGB  17.3  10.5 
FPointNet [29, 14]  Geo + RGB  19.8  10.8 
GSPN [42]  Geo + RGB  30.6  17.7 
3DSIS [14]  Geo + 1 view  35.1  18.7 
3DSIS [14]  Geo + 3 views  36.6  19.0 
3DSIS [14]  Geo + 5 views  40.2  22.5 
3DSIS [14]  Geo only  25.4  14.6 
DeepVote[28]  Geo only  58.6  33.5 
DOPS (ours)  Geo only  63.7  38.2 
4.3 Object Detection on ScanNetV2
We present our object detection results on the ScanNetV2 dataset in Table 1. For this dataset, we follow [28, 14] and predict axisaligned bounding boxes. Although we only use the available geometric information, we also compare the proposed method with approaches that use the available RGB images and different viewpoints. Our approach noticeably improves the stateoftheart by 3% and 4.6% with respect to mAP@0.25 and mAP@0.5 metrics. We also report our percategory results on the ScanNetV2 dataset in Table 2. Figure 7 shows our qualitative results.
Bathtub  Bed 

Cabinet  Chair  Counter  Curtain  Desk  Door  Other  Picture  Refrig. 

Sink  Sofa  Table  Toilet  Window 


mAP@0.25  86.6  83.3  41.0  53.2  91.6  51.9  53.9  73.7  54.8  59.2  26.3  49.2  64.7  71.3  82.6  60.5  98.0  45.2  63.7  
mAP@0.5  71.0  70.2  21.4  25.2  75.8  9.5  24.4  39.4  27.8  35.0  12.3  33.7  17.3  35.7  54.8  41.2  80.6  12.1  38.2 
4.4 Object Detection on Waymo Open
We achieve an mAP of 56.4% at IOU of 0.7. This is while StarNet [26] achieves an mAP of 53.0%. Note that [45] also reports 3D object detection results on the Waymo open dataset. However, their results are not directly comparable to ours since they fuse 2D networks applied to multiple views in addition to a 3D network. Since our detection pipeline consists of different parts, we also perform our ablation studies on this dataset. Table 3 shows the contribution of each component of the system on its overall performance. Each column shows the performance when a single component of the system is excluded and the rest remain the same. Removing graph convolution over the predictions on the neighborhood graph reduces the detection performance by 2%, showing its importance. Replacing the dynamic classification loss with a regular classification loss drops the performance by 3.3%. Finally, if instead of the farthest and highest object sampling, one directly deploys NMS to form the objects, the performance drops by 0.7%. We also noticed that shape prediction does not have a noticeable impact on the detection precision. We believe the main reason is that the Waymo Open dataset has manually labeled bounding boxes for object detection, but no groundtruth shape annotations. As a result, the shape predictions are supervised only with noisy, partial, and sparse LIDAR data, which provides a relatively weaker training signal.






mAP@0.7  56.4  54.5  53.1  55.7 
4.5 3D Shape Prediction on Waymo Open
To model shape, we first learn a prior from the synthetic ShapeNet dataset [2]. Figure 4 shows shapes recovered from the compact embedding vectors predicted for CAD models in ShapeNet. Each row represents one shape and columns show the results for different embedding dimensions. We use marching cube [21] with a resolution of 100 points per side on SDF values predicted by our decoder for a uniform hypercube surrounding the object. As can be seen, the decoder can recover the extent of the object from the predicted embedding vector, even when the dimensionality of the embedding space is low.
Once trained on the ShapeNet dataset, we freeze the decoder and use it to recover shapes from the observed points in the realworld scenes captured by the LIDAR sensors. However, compared to the synthetic CAD models, LIDAR points are incomplete, noisy, and the distribution of the observed points can be different from the clean synthetic datasets. Consequently, we found proper preprocessing and data augmentation techniques crucial. Noticeably, ShapeNet contains dense annotations even for surfaces inside the objects. However, when it comes to autonomous driving datasets, only a sparse set of points on the surface of the object is observed. We remove internal points when training on the ShapeNet dataset and empirically noticed that this step improves convergence and shape prediction quality. Moreover, the LIDAR sensor frequently captures points on the ground while this does not happen in ShapeNet. We also remove points on the ground based on the coordinate frame of each object.
Given a set of observed points in the point cloud, a predicted encoding vector, and a frozen decoder, it is possible to enforce
weakly supervised constraints to recover the shapes. The points which are observed should lie on the surface of the object with a high probability. That is, the frozen decoder conditioned on the predicted embedding should predict a zero SDF value for these points. However, this set of constraints is not enough for reliably recovering the shape. Figure
5 b shows the case when a shape is fitted to a set of of points observed from an object in the Waymo Open dataset, shown in 5 a. As can be seen, the decoder is able to fit a complex surface to the points. This is while the shape almost perfectly passes through the observed points.Instead, we augment points with additional ones sampled along the ray connecting the observations to the object center. For each observed point, we add two points on this ray inside and outside the object with distance from the surface and assign labels 1/+1 to them respectively. Figures 5 c, and 5 d show the shape fitting when we set to and respectively. As can be seen, this augmentation technique is crucial and sampling closer points increases the quality of the recovered shape.
Finally, Figure 6 presents our endtoend shape prediction results. Note that the car shapes fit the point cloud and are not simply copies of examples from a database.
5 Conclusions
We propose DOPS, a singlestage object detection system which operates on point cloud data. DOPS directly predicts object properties for each point. Instead of grouping points before prediction, a graph convolution module is deployed to aggregate the information across neighboring points. For a more accurate localization, it also outputs a 3D mesh using a shape prior learned on a synthetic dataset of CAD models. We show stateoftheart results for on 3D object detection datasets for both indoor and outdoor scenes. Topics for future work include detection and tracking over time, semisupervised training of shape priors, and extending shape models to handle nonrigid objects.
References
 [1] (2007) Kmeans++: the advantages of careful seeding. Proc. symposium on discrete algorithms. Cited by: §3.3.
 [2] (2015) Shapenet: an informationrich 3d model repository. In arXiv:1512.03012, Cited by: §4.5.

[3]
(2017)
Multiview 3d object detection network for autonomous driving.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1907–1915. Cited by: §2.1.  [4] (2017) Scannet: richlyannotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2.
 [5] (2017) Modulating early visual processing by language. In Advances in Neural Information Processing Systems, pp. 6594–6604. Cited by: Figure 3, §3.4.1.
 [6] (2016) Adversarially learned inference. arXiv preprint arXiv:1606.00704. Cited by: §3.4.1.

[7]
(2017)
Vote3deep: fast object detection in 3d point clouds using efficient convolutional neural networks
. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1355–1361. Cited by: §2.1.  [8] (2019) Mesh rcnn. arXiv preprint arXiv:1906.02739. Cited by: §2.2.
 [9] (201509) Sparse 3d convolutional neural networks. In Proceedings of the British Machine Vision Conference (BMVC), G. K. L. T. Xianghua Xie (Ed.), pp. 150.1–150.9. External Links: Document, ISBN 1901725537, Link Cited by: §2.1.
 [10] (2018) 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9224–9232. Cited by: §2.1.
 [11] (2017) Submanifold sparse convolutional networks. arXiv preprint arXiv:1706.01307. Cited by: Figure 2, §3.1, §3.5.
 [12] (2018) AtlasNet: a papierm^ ach’e approach to learning 3d surface generation. arXiv preprint arXiv:1802.05384. Cited by: §2.1.
 [13] (2017) Mask rcnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2980–2988. Cited by: Table 1.
 [14] (2019) 3Dsis: 3d semantic instance segmentation of rgbd scans. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, Cited by: §4.2, §4.3, Table 1.
 [15] (2018) 3drcnn: instancelevel 3d object reconstruction via renderandcompare. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3559–3568. Cited by: §2.2.
 [16] (2019) ShapeMask: learning to segment novel objects by refining shape priors. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.
 [17] (2019) PointPillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §2.1.
 [18] (2016) Vehicle detection from 3d lidar using fully convolutional network. arXiv preprint arXiv:1608.07916. Cited by: §2.1.
 [19] (2019) Multitask multisensor fusion for 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7345–7353. Cited by: §2.1.
 [20] (2018) Deep continuous fusion for multisensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 641–656. Cited by: §2.1.
 [21] (1987) Marching cubes: a high resolution 3d surface construction algorithm. In ACM siggraph computer graphics, Vol. 21, pp. 163–169. Cited by: §3, Figure 4, §4.5.
 [22] (2018) Fast and furious: real time endtoend 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 3569–3577. Cited by: §2.1.
 [23] (2019) Roi10d: monocular lifting of 2d detection to 6d pose and metric shape. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2069–2078. Cited by: §2.2.
 [24] (2019) Occupancy networks: learning 3d reconstruction in function space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4460–4470. Cited by: §2.2, §3.4.1.
 [25] (2019) LaserNet: an efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12677–12686. Cited by: §2.1.
 [26] (2019) StarNet: targeted computation for object detection in point clouds. In arXiv:1908.11069, Cited by: §4.2, §4.4.
 [27] (2019) Deepsdf: learning continuous signed distance functions for shape representation. arXiv preprint arXiv:1901.05103. Cited by: §2.2, §3.4.1.
 [28] (2019) Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.1, §4.2, §4.3, Table 1.
 [29] (2018) Frustum pointnets for 3d object detection from rgbd data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927. Cited by: §2.1, Table 1.

[30]
(2017)
Pointnet: deep learning on point sets for 3d classification and segmentation
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §2.1.  [31] (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §2.1.
 [32] (2019) PIFu: pixelaligned implicit function for highresolution clothed human digitization. arXiv preprint arXiv:1905.05172. Cited by: §2.2, §3.4.1, §3.4.1.
 [33] (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–779. Cited by: §2.1.
 [34] (2019) Complexeryolo: realtime 3d object detection and tracking on semantic point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §2.1.
 [35] (2018) Complexyolo: an eulerregionproposal for realtime 3d object detection on point clouds. In European Conference on Computer Vision, pp. 197–209. Cited by: §2.1.
 [36] (2016) Deep sliding shapes for amodal 3d object detection in rgbd images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 808–816. Cited by: Table 1.
 [37] (2003) Optimized spatial hashing for collision detection of deformable objects.. In Vmv, Vol. 3, pp. 47–54. Cited by: §3.5.
 [38] (2018) Hdnet: exploiting hd maps for 3d object detection. In Conference on Robot Learning, pp. 146–155. Cited by: §2.1.
 [39] (2018) Pixor: realtime 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7652–7660. Cited by: §2.1.
 [40] (2018) Foldingnet: point cloud autoencoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 206–215. Cited by: §2.1.
 [41] (2019) Gspn: generative shape proposal network for 3d instance segmentation in point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3947–3956. Cited by: §2.2.
 [42] (2018) GSPN: generative shape proposal network for 3d instance segmentation in point cloud. arXiv preprint arXiv:1812.03320. Cited by: Table 1.
 [43] (2019) PointWeb: enhancing local neighborhood features for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5565–5573. Cited by: §2.1.
 [44] (2019) 3D point capsule networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1009–1018. Cited by: §2.1.
 [45] (2019) Endtoend multiview fusion for 3d object detection in lidar point clouds. In Conference on Robot Learning (CoRL), Cited by: §4.2, §4.4.
 [46] (2018) Voxelnet: endtoend learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §2.1.