What is a good representation for processing 3D sensor data? While this is a fundamental challenge in machine vision dating back to stereoscopic processing, it has recently been explored in the context of deep neural processing of 3D sensors such as LiDARs. Various representations have been proposed, including graphical meshes , point clouds , voxel grids , and range images , to name a few.
Visibility: We revisit this question by pointing out that 3D sensored data, is infact, not fully 3D! Instantaneous depth measurements captured from a stereo pair, structured light sensor, or LiDAR undeniably suffer from occlusions: once a particular scene element is measured at a particular depth, visibility ensures that all other scene elements behind it along its line-of-sight are occluded. Indeed, this loss of information is one of the fundamental reasons why 3D sensor readings can often be represented with 2D data structures - e.g., 2D range image. From this perspective, such 3D sensored data might be better characterized as “2.5D” .
3D Representations: We argue that representations for processing LiDAR data should embrace visibility, particularly for applications that require instantaneous understanding of freespace (such as autonomous navigation). However, most popular representations are based on 3D point clouds (such as PointNet [21, 14]). Because these were often proposed in the context of truly 3D processing (e.g., of 3D mesh models), they do not exploit visibility constraints implicit in the sensored data (Fig. LABEL:fig:splash). Indeed, representing a LiDAR sweep as a collection of points fundamentally destroys such visibility information if normalized (e.g., when centering point clouds).
Occupancy: By no means are we the first to point out the importance of visibility. In the context of LiDAR processing, visibility is well studied for the tasks of map-building and occupancy reasoning [27, 8]. However, it is not well-explored for object detection, with one notable exception: 
builds a probabilistic occupancy grid and performs template matching to directly estimate the probability of an object appearing at each discretized location. However, this approach requires knowing surface shape of object instances beforehand, therefore it is not scalable. In this paper, we demonstrate that deep architectures can be simply augmented to exploit visibility and freespace cues.
Range images: Given our arguments above, one solution might be defining a deep network on 2D range image input, which implicitly encodes such visibility information. Indeed, this representation is popular for structured light “RGBD” processing [10, 6], and has also been proposed for LiDAR . However, such representations do not seem to produce state-of-the-art accuracy for 3D object understanding, compared to 3D voxel-based or top-down, birds-eye-view (BEV) projected grids. We posit that convolutional layers that operate along a depth dimension can reason about uncertainty in depth. To maintain this property, we introduce simple but novel approaches that directly augment state-of-the-art 3D voxel representations with visibility cues.
We propose a deep learning approach that efficiently augments point clouds with visibility. Our specific constributions are three-fold; (1) We first (re)introduce raycasting algorithms that effciently compute on-the-fly visibility for a voxel grid. We demonstrate that these can be incorporated into batch-based gradient learning. (2) Next, we describe a simple approach to augmenting voxel-based networks with visibility: we add a voxelized visibility map as an additional input stream, exploring alternatives for early and late fusion; (3) Finally, we show that visibility can be combined with two crucial modifications common to state-of-the-art networks: synthetic data augmentation of virtual objects, and temporal aggregation of LiDAR sweeps over multiple time frames. We show that visibility cues can be used to better place virtual objects. We also demonstrate that visibility reasoning over multiple time frames is akin to online occupancy mapping.
2 Related Work
2.1 3D Representations
Most classic works on point representation employ hand-crafted descriptors and require robust estimates of local surface normals, such as spin-images  and Viewpoint Feature Histograms (VFH) . Since PointNet , there has been a line of work focuses on learning better point representation, including PointNet++, Kd-networks , PointCNN , EdgeConv , and PointConv  to name a few. Recent works on point-wise representation tend not to distinguish between reconstructed and measured point clouds. We argue that when the input is a measured point cloud, e.g. a LiDAR sweep, we need to look beyond points and reason about visibility that is hidden within points.
Most research on visibility representation has been done in the context of robotic mapping. For example, Buhmann et al.  estimates a 2D probabilistic occupancy map from sonar readings to navigate the mobile robot and more recently Hornung et al.  have developed Octomap for general purpose 3D occupancy mapping. Visibility through raycasting is at the heart of developing such occupancy maps. Despite the popularity, such visibility reasoning has not been widely studied in the context of object detection, except a notable exception of , which develops a probabilistic framework based on occupancy maps to detect objects with known surface models.
2.2 LiDAR-based 3D Object Detection
We have seen LiDAR-based object detectors built upon range images, bird-eye-view feature maps, raw point clouds, and also voxelized point clouds. One example of a range image based detector is LaserNet , which treats each LiDAR sweep as a cylindrical range image. Examples of bird-eye-view detectors include AVOD , HDNet , and Complex-YOLO . One example that builds upon raw point clouds is PointRCNN . Examples of voxelized point clouds include the initial VoxelNet, SECOND , and PointPillars . Other than , we have not seen a detector that uses visibility as the initial representation.
Yan et al.  propose a novel form of data augmentation, which we call object augmentation. It copy-pastes object point clouds from one scene into another, resulting in new training data. This augmentation technique improves both convergence speed and final performance and is adopted in all recent state-of-the-art 3D detectors, such as PointRCNN , PointPillars . For objects captured under the same sensor setup, simple copy-paste preserves the relative pose between the sensor and the object, resulting in approximately correct return patterns. However, such practice often inserts objects regardless of whether it violates the scene visibility. In this paper, we propose to use visibility reasoning to maintain correct visibility while augmenting objects across scenes.
When learning 3D object detectors over a series of LiDAR sweeps, it is proven helpful to aggregate information across time. Luo et al.  develops a recurrent architecture for detecting, tracking, and fore-casting objects on LiDAR sweeps. Choy et al.  proposes to learn spatial-temporal reasoning through 4D ConvNets. Another technique for temporal aggregation, first found in SECOND , is to simply aggregate point clouds from different sweeps while preserving their timestamps relative to the current one. These timestamps are treated as additional per-point input feature along with and fed into point-wise encoders such as PointNet. We explore temporal aggregation over visibility representations and point out that one can borrow ideas from classic robotic mapping to integrate visibility representation with learning.
3 Exploit Visibility for 3D Object Detection
Before we discuss how to integrate visibility reasoning into 3D detection, we first introduce a general framework for 3D detection. Many 3D detectors have adopted this framework, including AVOD , HDNet , Complex-YOLO , VoxelNet , SECOND , and PointPillars . Among the more recent ones, there are two crucial innovations: (1) object augmentation by inserting rarely seen (virtual) objects into training data and (2) temporal aggregation of LiDAR sweeps over multiple time frames.
We integrate visibility into the aforementioned 3D detection framework. First, we (re)introduce a raycasting algorithm that efficiently computes visibility. Then, we introduce a simple approach to integrate visibility into the existing framework. Finally, we discuss visibility reasoning within the context of object augmentation and temporal aggregation. For object augmentation, we modify the raycasting algorithm to make sure visibility remains intact while inserting virtual objects. For temporal aggregation, we point out that visibility reasoning over multiple frames is akin to online occupancy mapping.
3.1 A General Framework for 3D Detection
We visualize a general framework for 3D detection in Fig. 1. Please refer to the caption. We highlight the fact that once the input 3D point cloud is converted to a multi-channel BEV 2D representation, we can make use of standard 2D convolutional architectures. We later show that visibility can be naturally incorporated into this 3D detection framework.
Data augmentation is a crucial ingredient of contemporary training protocols. Most augmentation strategies perturb coordinates through random transformations (e.g. translation, rotation, flipping) [13, 20]. We focus on object augmentation proposed by Yan et al. , which copy-pastes (virtual) objects of rarely-seen classes (such as buses) into LiDAR scenes. Our ablation studies (gi in Tab. 3) suggest that it dramatically improves vanilla PointPillars by an average of +9.1% on the augmented classes.
In LiDAR-based 3D detection, researchers have explored various strategies for temporal reasoning. We adopt a simple method that aggregates (motion-compensated) points from different sweeps into a single scene [31, 4]. Importantly, points are augmented with an additional channel that encodes its relative timestamp . Our ablation studies (gj in Tab. 3) suggest that temporal aggregation dramatically improves the overall mAP of vanilla PointPillars model by +8.6%.
3.2 Compute Visibility through Raycasting
Physical raycasting in LiDAR
Each LiDAR point is generated through a physical raycasting process. To generate a point, the sensor emits a laser pulse in a certain direction. The pulse travels forward through air and back after hitting an obstacle. Upon its return, one can compute a 3D coordinate derived from the the direction and time-of-flight. However, coordinates are by no means the only information offered by such sensing. Crucially, active sensing also provides estimates of freespace along ray traveled by the pulse.
Simulated LiDAR raycasting
By exploiting the causal relationship between freespace and point returns - points lie along the ray where freespace ends, we can re-create the instantaneous visibility encountered at the time of LiDAR capture. We do so by drawing a line segment from the sensor origin to every 3D point. We would like to use this line segment to define freespace across a discretized volume, e.g. a 3D voxel grid. Specifically, we compute all voxels that intersect this line segment. Those that are encountered along the way are marked as free, while the last voxel enclosing the 3D point is occupied. This results in a visibility volume where all voxels are marked as occupied, free, or unknown (default). We will integrate the visiblity volume into the general detection framework (Fig. 1) in the form of a multi-channel 2D feature map where visibility along the vertical dimension (z-axis) is treated as multiple channels.
Efficient voxel traversal
In order to ensure fast inference times and efficient training times, our visibility computation must be extremely efficient. Many detection networks exploit sparsity in LiDAR point clouds: PointPillars process only non-empty pillars (about 3%) and SECOND  employs spatially sparse 3D ConvNets. Inspired by these approaches, we exploit sparsity through an efficient voxel traversal algorithm . For any given ray, we need traverse only those sparse set of voxels that intersect with the ray. Intuitively, during the traversal, the algorithm enumerates over the six axis-aligned faces of the current voxel to determine which is intersected by the exiting ray (which is quite efficient). It then simply advances to the neighboring voxel with a shared face. The algorithm begins at the voxel at the origin and terminates when it encounters the (precomputed) voxel occupied by the 3D point. This algorithm is linear in the resolution of a single grid dimension, making it quite efficient. We perform raycasting of multiple points in parallel and aggregate computed visibility afterwards. We also follow best-practices outlined in Octomap (Sec. 5.1 in ) to reduce discretization effects during aggregation.
Raycasting with augmented objects
Prior work augments virtual objects while ignoring visibility constraints, producing inconsistent LiDAR sweeps (e.g., by inserting an object behind a wall that should occlude it - Fig. 2-(b)). We can use ray-casting as a tool to “rectify” the LiDAR sweep. Specifically, we might wish to remove virtual objects that are occluded (a strategy we term culling - Fig. 2-(c)). Because this might excessively decrease the number of augmented objects, another option is to remove points from the original scene that occlude the inserted objects (a strategy we term drilling - Fig. 2-(d)). Fortunately, both strategies are efficient to implement with simple modifications to the above ray-casting algorithm. We only have to change the terminating condition of raycasting from arriving at the end point of the ray to hitting a voxel that is pre-occupied. When casting rays from the original scene, we set voxels occupied by virtual objects as pre-occupied. And when casting rays from the virtual objects, we set voxels occupied by original scenes as pre-occupied. As a consequence, points that should be occluded will be removed.
Online occupancy mapping
How do we extend instantaneous visibility into a temporal context? Assume knowing the sensor origin at each timestamp, we can compute instantaneous visibility over every sweep, resulting in 4D spatial-temporal visibility. If we directly integrate a 4D volume into the detection framework, it would be too expensive. We seek out online occupancy mapping [28, 8]
and apply Bayesian filtering to turn a 4D spatial-temporal visibility into a 3D posterior probability of occupancy. In Fig.3, we plot a visual comparison between instantaneous visibility and temporal occupancy. We follow Octomap 
’s formulation and use their off-the-shelf hyper-parameters, e.g. the log-odds of observing freespace and occupied space.
3.3 Approach: A Two-stream Network
Now that we have discussed raycasting approaches for computing visibility, we introduce a novel two-stream network for 3D object detection. We add an additional stream for visibility input into the network of a state-of-the-art 3D detector, i.e. PointPillars. As a result, our approach leverages both point cloud representation and visibility representation and fuses them into a multi-channel representation. We explore two fusion strategies: early fusion and late fusion, as illustrated in Fig. 4. The overall network architecture follows the illustration in Fig. 1.
We implement our two-stream network by adding an additional input stream to PointPillars. We adopt PointPillar’s resolution for discretization in order to improve ease of integration. As a consequence, our visibility volume has the same 2D spatial size as the pillar feature maps. A simple strategy is to concatenate these two first and then feed them into a backbone network. We refer to this strategy as early fusion (Fig. 4-(a)). Another strategy is to have a separate backbone network for both pillar feature maps and visibility volume, which we refer to as late fusion (Fig. 4-(b)). Please refer to the supplementary materials for more implementation details.
We present both qualitative (Fig. 5) and quantitative results on the NuScenes 3D detection benchmark. We first introduce the setup and baselines, before presenting the main results on the test benchmark. Afterwards, we perform diagnostic evaluation and ablation studies to isolate where improvements come from. Finally, we discuss the efficiency of computing visibility through raycasting on-the-fly.
We benchmark our approach on the NuScenes 3D detection dataset. The dataset contains 1,000 scenes captured in two cities. We follow the official protocol for NuScenes detection benchmark. The training set contains 700 scenes (28,130 annotated frames). The validation set contains 150 scenes (6,019 annotated frames). Each annotated frame comes with one LiDAR point cloud captured by a 32-beam LiDAR, as well as up to 10 frames of (motion-compensated) point cloud. We follow the official evaluation protocol for 3D detection  and evaluate average mAP over different classes and distance threshold.
PointPillars  achieves the best accuracy on the NuScenes detection leaderboard among all published methods that have released source code. The official PointPillars codebase111https://github.com/nutonomy/second.pytorch only implements 3D detection on KITTI . To reproduce PointPillars results on NuScenes, the authors of PointPillars recommend a third-party implementation222https://github.com/traveller59/second.pytorch
.Using an off-the-shelf configuration provided by the third-part implementation, we train a PointPillars model for 20 epochs from scratch on the full training set and use it as our baseline. This model achieves an overall mAP of 31.5% on the validation set, which is 2% higher than the official PointPillars mAP (29.5%) (Tab. 2). As suggested by 
, the official implementation of PointPillars employ pretraining (ImageNet/KITTI). There is no pretraining in our re-implementation.
We submitted the results of our two-stream approach to the NuScenes test server. In Tab. 1, we compare our test-set performance to PointPillars on the official leaderboard . By augmenting visibility, our proposed approach achieves a significant improvement over PointPillars in overall mAP by a margin of 4.5%. Specifically, our approach outperforms PointPillars by 10.7% on cars, 5.3% on pedestrians, 7.4% on trucks, 18.4% on buses, 16.7% on trailers. Our model underperforms official PointPillars on motorcycles by a large margin. We hypothesize this might be due to (1) a different configuration for PointPillars (e.g. xy-resolution) or (2) pretraining on ImageNet/KITTI.
: reproduced based on an author-recommended third-party implementation.
Improvement at different levels of visibility
We compare our two-stream approach to PointPillars on the validation set, where we see a 4% improvement from having visibility. We also evaluate each object class at different levels of visibility. Here, we plot results over the two most common classes: car and pedestrian. Interestingly, the biggest improvement are observed for cars that are heavily-occluded (0-40% visible) and the smallest improvements correspond to fully-visible cars (80-100% visible). For pedestrian, we also see the smallest improvement on fully-visible pedestrians (3.2%), which is 1-3% less than what we observe for pedestrians with more occlusion.
To understand how much improvement each component provides, we perform additional ablation studies. We start from our final model and remove one component at a time. Key observations from Tab. 3 are:
(a, b) Replacing early fusion (a) with late fusion (b) results in a 1.4% drop in overall mAP.
(b, c, d) Replacing drilling (b) with culling (c) results in a 11.4% drop on bus and a 4.9% drop on trailer. In practice, most augmented trucks and trailers tend to be severely occluded and are removed if the culling strategy is applied. Replacing drilling (b) with naive augmentation (d) results in a 1.9% drop on bus and 3.1% drop on trailer, likely due to inconsistent visibility when naively augmenting objects.
(b, e) Removing object augmentation (be) leads to significant drops in mAP on classes affected by object augmentation, including in a 2.5% drop on truck, 13.7% on bus, and 7.9% on trailer.
(e, f) Removing temporal aggregation (ef) results in worse performance on every class and a 9.4% drop in overall mAP.
(f, g, h) Removing visibility stream off a vanilla two-stream approach (fg) drops overall mAP by 1.4%. Interestingly, the most dramatic drops are over pedestrian (+7.5%), barrier(+3.3%), and traffic cone (+3.7%). Shape-wise, these objects are all “skinny” and tend to have less LiDAR points on them. This suggests visibility helps especially when having less points. A single-stream network with only visibility (h) underperforms a vanilla PointPillars (g) by 4%.
(g, i, j, k) Object augmentation (gi) improves vanilla PointPillars (g) by an average of 9.1% in AP on augmented classes. Temporal aggregation (gj) improves vanilla PointPillars by 8.6% in overall mAP. Adding both (gk) increases the overall mAP by 11.0%.
We implement raycasting in C++ and call it from Python with the help of PyBind11. We integrate visibility computation into our PyTorch pipeline as part of dataloader pre-processing. With our implementation, it takes 24.43.5 milliseconds on average to compute visibility over a 32-beam LiDAR point on an Intel i9-9980XE CPU.
We revisit the problem of finding a good representation for 3D data. We point out that contemporary representations are designed for true 3D data (e.g. sampled from mesh models). In fact, 3D sensored data such as a LiDAR sweep is 2.5D. By processing such data as a collection of normalized points , important visibility information is fundementally destroyed. In this paper, we augment visibility into 3D object detection. We first demonstrate that visibility can be efficiently re-created through 3D raycasting. We introduce a simple two-stream approach that adds visibility as a separate stream to an existing state-of-the-art 3D detector. We also discuss the role of visibility in placing virtual objects for data augmentation and explore visibility in a temporal context - building a local occupancy map in an online fashion. Finally, on the NuScenes detection benchmark, we demonstrate that the proposed network outperforms state-of-the-art detectors by a significant margin.
Appendix A Additional Qualitative Results
Appendix B Additional Method Details
Here, we provide additional details about our method, including pre-processing, network structure, initialization, loss function, training etc. These details apply to both the baseline method (PointPillars) and our two-stream approach.
We focus on points whose satisfies and ignore points outside the range when computing pillar features. We group points into vertical columns of size . We call each vertical column a pillar. We resample to make sure each non-empty pillar contains 60 points. For raycasting, we do not ignore points outside the range and use a voxel size of .
We introduce (1) pillar feature network; (2) backbone network; (3) detection heads.
Pillar feature network operates over each non-empty pillar. It takes points
within the pillar and produces a 64-d feature vector. To do so, it first compressesto , where . Then it augments each point with its offset to the pillar’s arithmetic mean
and geometric mean. Please refer to Sec. 2.1 of PointPillars  for more details. Then, it processes augmented points
with a 64-d linear layer, followed by BatchNorm, ReLU, and MaxPool, which results in a 64-d embedding for each non-empty pillar. Conceptually, this is equivalent to a mini one-layer PointNet. Finally, we fill empty pillars with all zeros. Based on our discretization choices, pillar feature network produces afeature map.
Backbone network is a convolutional network with an encoder-decoder structure. This is also sometimes referred to as Region Proposal Network. Please read VoxelNet , SECOND , and PointPillars  for more details. The network consists three blocks of fully convolutional layers. Each block consists of a convolutional stage and a deconvolutional stage. The first (de)convolution filter of the (de)convolutional stage changes the spatial resolution and the feature dimension. All (de)convolution is 3x3 and followed with BatchNorm and ReLU. For our two-stream early-fusion model, the backbone network takes an input of size , where channels are from pillar feature and channels are from visibility. The first block contains 4 convolutional layers and 4 deconvolutional layers. The second and the third block each consists of 6 both of these layers. Within the first block, the feature dimension changes from to during the convolutional stage, and to during the deconvolutional stage. Within the second block, the feature dimension from to . Within the third block, the feature map changes from to and back to . At last, features from all three blocks are concatenated as the final output, which has a size of .
Detection heads include one for large object classes (i.e. car, truck, trailer, bus, and construction vehicles) and one for small object classes (i.e. pedestrian, barrier, traffic cone, motorcycle, and bicycle). The large head takes the concatenated feature map from backbone network as input () while the small head takes the feature from the backbone’s first convolutional stage as input (). Each head contains a linear predictor for anchor box classification and a linear prediction for bounding box regression. The classification predictor outputs a confidence score for each anchor box and the regression predictor outputs adjustment coefficients (i.e. ).
For classification, we adopt focal loss  and set and . For regression output, we use smooth L1 loss (a.k.a. Huber loss) and set , where controls where the transition between L1 and L2 happens. The final loss function is the classification loss multiplied by 2 plus the regression loss.
We train all of our models for 20 epochs and optimize using Adam  as the optimizer. We follow a learning rate schedule known as “one-cycle” . The schedule consists of 2 phases. The first phase includes the first 40% training steps, during which we increase the learning rate from to 0.003 while decreasing the momentum from 0.95 to 0.85 following cosine annealing. The second phase includes the rest 60% training steps, during which we decrease the learning rate from 0.003 to while increasing the momentum from 0.85 to 0.95. We use a fixed weight decay of 0.01.
Acknowledgements: This work was supported by the CMU Argo AI Center for Autonomous Vehicle Research.
-  John Amanatides and Andrew Woo. A Fast Voxel Traversal Algorithm for Ray Tracing. In EG 1987-Technical Papers. Eurographics Association, 1987.
-  Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
-  Joachim Buhmann, Wolfram Burgard, Armin B Cremers, Dieter Fox, Thomas Hofmann, Frank E Schneider, Jiannis Strikos, and Sebastian Thrun. The mobile robot rhino. AI Magazine, 16(2):31–31, 1995.
-  Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019.
-  Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. arXiv preprint arXiv:1904.08755, 2019.
-  Andreas Eitel, Jost Tobias Springenberg, Luciano Spinello, Martin Riedmiller, and Wolfram Burgard. Multimodal deep learning for robust rgb-d object recognition. In IROS, pages 681–687. IEEE, 2015.
-  Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. IJRR, 32(11):1231–1237, 2013.
-  Armin Hornung, Kai M Wurm, Maren Bennewitz, Cyrill Stachniss, and Wolfram Burgard. Octomap: An efficient probabilistic 3d mapping framework based on octrees. Autonomous robots, 34(3):189–206, 2013.
-  Andrew E. Johnson and Martial Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. TPAMI, 21(5):433–449, 1999.
-  Eunyoung Kim and Gerard Medioni. 3d object recognition in range images using visibility context. In IROS, pages 3800–3807. IEEE, 2011.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Roman Klokov and Victor Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In ICCV, pages 863–872, 2017.
-  Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L Waslander. Joint 3d proposal generation and object detection from view aggregation. In IROS, pages 1–8. IEEE, 2018.
-  Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019.
-  Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points. In NeurIPS, pages 820–830, 2018.
-  Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In CVPR, pages 2980–2988, 2017.
-  Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In CVPR, pages 3569–3577, 2018.
-  David Marr and Herbert Keith Nishihara. Representation and recognition of the spatial organization of three-dimensional shapes. Proceedings of the Royal Society of London. Series B. Biological Sciences, 200(1140):269–294, 1978.
-  Gregory P Meyer, Ankit Laddha, Eric Kee, Carlos Vallespi-Gonzalez, and Carl K Wellington. Lasernet: An efficient probabilistic 3d object detector for autonomous driving. In CVPR, pages 12677–12686, 2019.
-  Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. In CVPR, pages 918–927, 2018.
-  Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, pages 652–660, 2017.
-  Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, pages 5099–5108, 2017.
-  Radu Bogdan Rusu, Gary Bradski, Romain Thibaux, and John Hsu. Fast 3d recognition and pose using the viewpoint feature histogram. In IROS, pages 2155–2162. IEEE, 2010.
-  Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointrcnn: 3d object proposal generation and detection from point cloud. In CVPR, pages 770–779, 2019.
-  Martin Simon, Stefan Milz, Karl Amende, and Horst-Michael Gross. Complex-yolo: An euler-region-proposal for real-time 3d object detection on point clouds. In ECCV, pages 197–209. Springer, 2018.
Leslie N Smith.
Cyclical learning rates for training neural networks.In WACV, pages 464–472. IEEE, 2017.
Sebastian Thrun and Arno Bücken.
Integrating grid-based and topological maps for mobile robot
Proceedings of the National Conference on Artificial Intelligence, pages 944–951, 1996.
-  Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic robotics. MIT press, 2005.
-  Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. ACM TOG, 38(5):146, 2019.
-  Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep convolutional networks on 3d point clouds. In CVPR, pages 9621–9630, 2019.
-  Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
-  Bin Yang, Ming Liang, and Raquel Urtasun. Hdnet: Exploiting hd maps for 3d object detection. In CoRL, pages 146–155, 2018.
-  Theodore C Yapo, Charles V Stewart, and Richard J Radke. A probabilistic representation of lidar range data for efficient 3d object detection. In CVPR Workshops, pages 1–8. IEEE, 2008.
-  Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In CVPR, pages 4490–4499, 2018.