3D object detection is an essential component of autonomous driving. 3D detectors identify the road obstacles to help the driving system make correct decisions to ensure safety and driving effectiveness effectively. With the rapid development and the decrease of the 3D sensor’s production costs, LiDAR becomes a necessary module on the self-driving car for perceiving the scenes as it can scan the surrounding environments and capture an accurate 3D description. Unlike 2D image, point clouds generated by LiDAR have precise depth values and 3D space information, making it more reliable to reason about accurate 3D objects.
For the LiDAR-based methods, whatever grid-based[31, 9, 27] or point-based[21, 22, 23, 26], they utilize the point clouds in the original LiDAR xyz-coordinate whose origin is located in the LiDAR. We denote this 3D xyz-coordinate as birds-eye view(BEV). Points in BEV capture precise 3D structures of obstacles and environments, and the shapes of objects are distance-invariant. However, for the objects in the distance or small-sized objects, points become very sparse, making detectors very hard to tackle those challenging targets. Although LiDAR points are sparse in the 3D xyz-coordinate, they are inherently dense from the view of the sensor, i.e.the density of points is consistent in angle in the perspective view. Another issue that the birds-eye view encounters is the local feature mismatching. The point densities vary with the distance, and it is unfriendly to be resolved by local parameter-sharing feature extractors, such as 2D/3D CNN or the architecture of PointNet++. This issue is remedied in the perspective view where the point densities keep consistent in angle.
Therefore, some works[4, 14, 30, 25] attempt to employ the perspective view to remedy the drawbacks of the birds-eye view. However, the methods using perspective-image[4, 14] suffer the same troubles with monocular methods that the object shapes are not distant-invariant. And due to the loss of depth information, objects will overlap in 2D images, making it hard to be accurately recognized and located. Recent multi-view based approaches[30, 25] projects points into a 3D spherical or cylindrical coordinate to fix the problem of losing depth information, and the distant-variant issue is also resolved by combining multi-view features. However, they only consider egocentric views. In the perspective view(PV), the space is voxelized into series of sectored grids along x and y axis and become larger and sparser as the distance increase. In the egocentric PV, the features of distant objects are mixed with the surrounding context and lose discriminability. Hence, the egocentric PV is still insufficient to tackle this issue. The solution to remedy this problem is to break through the traditional egocentric constrain.
In this paper, we propose a novel decentralized multi-view algorithm, denoted as X-view. Inspired by previous methods, X-view fuses features of different coordinate spaces. In contrast to previous works, X-view breaks through the traditional ego-centric constraints that the origin of perspective view is consistent with the birds-eye view, limiting the power of multi-view fusion. X-view extends the number of perspective views by translating the origin to imitate different observers’ perspectives. And X
, the number of PVs can be set flexibly. The intuition is that different origins in perspective view allow the network to exploit features in different distances with a dynamic context. Moreover, we propose BEV-Dominant Linear Interpolation Fusion(BDLI-Fusion) module to fuse features from multiple views. BDLI-Fusion fixes the issue existing in previous multi-view methods[30, 25]. BDLI-Fusion avoids inaccuracies when conducting voxel-point-voxel projecting. The features from multiple views boost the final performance. And thanks to the efficient backbone architectures like 3D-conv, PointNet++
, the additional feature extraction brings little extra time and memory costs, and our X-view can achieve robust performance while guaranteeing the real-time demand.
Our contributions can be summarized in three aspects:
We propose a novel multi-view 3D detection architectures, X-view. For the first time, X-view jumps out from the traditional egocentric view and extends the number of perspective views by translating the origin to imitate different observers’ perspectives.
Aiming at the fusion issue existing in previous multi-view methods, we propose BDLI-Fusion to simplify the fusion operation and avoid the accuracies introduced during voxel-point-voxel projection.
X-view, designed as a general, flexible architecture, can be applied on almost all mainstream 3D detectors. We conduct experiments on the four popular state-of-the-art 3D detectors: SECOND, PointRCNN, Part-A, PV-RCNN. The experiments on two challenging datasets, KITTI and NuScenes , demonstrate the effectiveness of X-view and show remarkable improvements.
X-view achieves real-time detection to meet the practical engineering demand via parallelization acceleration.
Ii Related work
In this section, we review the recent development of 3D detection for autonomous driving. According to the number of views used by detectors, the 3D detection approaches can be grouped into two categories: single-view based and multi-view based methods.
Ii-a Single-View 3D Detectors.
employ monocular image to estimate 3D objects. However, due to the projective entanglement of depth and scale, precisely locating 3D targets is hard for the monocular 2D image. These approaches have a huge performance gap compared to LiDAR-based methods. LiDAR-based methods are mainly divided into two categories: grid-based and point-based. Grid-based detectors[13, 31, 27, 32, 28, 9, 29] quantify the point clouds into a series of discrete grids and exploit efficient sparse 3D CNN[31, 27, 32] or compress the input BEV feature maps[28, 9, 29] to meet the real-time demand. Point-based methods directly process raw 3D points to aggregate point-level features via the popular PointNet/PointNet++ architecture[22, 16, 20] or GCN. However, although LiDAR-based 3D detectors recently achieve great successes, they only take advantage of points in the birds-eye view(i.e.3D Cartesian coordinate). Using points in the birds-eye view does not only suffer from the sparsity in the distance but also is unfriendly to the popular local-parameter-sharing feature extractors because the point densities vary with distance in the BEV. To remedy this issue, many researchers begin to handle point clouds in multiple views.
Ii-B Multi-View 3D Detectors.
For the multi-view paradigms, many recent works[7, 4, 24, 15] employ features from multiple sensors(e.g.3D LiDAR and 2D camera) to reason about 3D objects. Although these methods also use perspective view features, they only use 2D features. However, the 2D image does not contain sufficient 3D clues, making it hard to locate 3D objects precisely. And there are certain modality differences between features from the 2D image and 3D voxels or raw points, bringing difficulties for multi-sensor fusion. Hence, the performance of recent multi-sensor based methods still falls a little behind than LiDAR-only methods. Therefore, in the following, we mainly discuss the LiDAR-based multi-view methods, which are compatible with multi-sensor-based multi-view methods.
MV3D is an early multi-view method. Except for the image stream, it exploits the point clouds in two views: birds-eye view and perspective view and directly fuses the feature maps of two views in the RoI-Pooling operation. There are two drawbacks. First, it projects the 3D points onto the 2D plane to represent the perspective view and loses the depth information, which harms the positioning precision. Second, it does not consider the inconsistency between the features maps from different perspectives and diametrically concatenates them together. MVF maps the 3D xyz-points into a 3D spherical coordinate to represent the perspective view. Meanwhile, it employs a dynamic point-to-voxel and voxel-to-point projection to conduct point-wise fusion to synergize the BEV and PV. However, because the sampling or pooling operation when mapping points to voxels brings some ambiguities, it reduces the fusion robustness. PillarOD exploits cylindrical projection to replace the spherical coordinate but still suffers the projection issue like MVF. Some methods[14, 3, 11, 19] leverage the range view or its variants, which can also be regarded as a kind of perspective view, to reason about 3D objects. CVCNet extracts the BEV and range-view features in a unified coordinate and fuses them via a cross-view transformer. RangeRCNN uses the range-view features as priors to initialize the BEV features to overcome the weakness of the range-image-based methods.  uses features from cylindrical coordinate to guide the 3D convolution layer in cartesian coordinate. However, all of those multi-view methods only consider fusing features from the egocentric view, which makes the sparse feature in the distance mixed with the surrounding context and thus hampering the discriminability. To overcome those issues, we propose a novel decentralized multi-view fusion paradigm, X-view. X-view breaks through the concept of multiple views and extends the number of perspective views via imitating different observes’ views to boost performance. Meanwhile, we propose BEV-Dominant Linear Interpolation Fusion(BDLI-Fusion) to remedy the projection fusion issue mentioned above.
The main idea of X-view is breaking through the egocentric constraint for the perspective view via translating the origin of the perspective view coordinate and extending the number of perspective views to boost the performance.
The main architecture of X-view is illustrated in Figure 2. Besides the traditional Cartesian coordinate and the egocentric PV coordinate, the raw point clouds are also projected into series of non-egocentric perspective view coordinates. X-view applies a series of view-independent feature extractors. Then X-view employs the BEV-Dominant Linear Interpolation Fusion module (abbreviated as BDLI-Fusion) to fuse the features from multiple view streams. After the fusion, the fused multi-view features are feed into the following feature extractors to be further embedded and then are forwarded into the final detector head to generate the final predictions. In the following, we first introduce the concept of multiple views for the 3D detection. Then we introduce the detail of X-view and BDLI-Fusion, respectively.
Iii-a Multiple Views
In this section, we first compare the strengths and drawbacks of the traditional two views: birds-eye view and the egocentric perspective view, and then introduce our contribution: the non-egocentric perspective view.
, because its coordinate keep consistent with the measurements of evaluation metric. However, BEV also encounters some problems. Due to the working mechanism of LiDAR, although the laser beams emitted by LiDAR are dense and have consistent densities in angle, the point densities descent rapidly as the distance increases and the point densities of different local areas vary dramatically, as shown in the right part of Figure3. The unbalanced point density brings another problem: local feature mismatching. No matter the grid-based or voxel-based algorithms, they all leverage the local-parameter-sharing extractors to embed features, such as 2D/3D CNN or PointNet++. These local-parameter-sharing kernels will encounter problems when tackling the density-variant features in the BEV. Besides, the too-small point densities in the distances make detectors hard to recognize and locate the targets in the distance precisely. Another challenging case in the BEV is the objects of small sizes. For the BEV feature maps of deep level, the small targets might only occupy a few voxels or keypoints, making them hard to be precisely recognized and located.
Perspective View. Unlike the birds-eye view that captures the real 3D world, the perspective view(PV) imitates the perspective of the camera or the human eyes and describes a projection space. Although points are sparsely distributed in the 3D cartesian space, they will be dense when projected to the perspective-view-image like the process that LiDAR sweeps the scenes. Recent multi-view methods choose to use spherical or cylindrical coordinate to represent perspective view. Compared to BEV, the voxel sizes of the PV coordinate vary with the distances from the origin, as illustrated in Figure 3. Therefore, the point densities in the voxels are roughly consistent and can avoid the local feature mismatching issue mentioned above. Moreover, for the objects of small shapes, PV provides a ”dense” description like in the 2D camera image. Although PV can remedy some drawbacks of the traditional BEV, it still suffers from the problems that distant voxels’ sizes are too large. Specifically, the sparsity in the distance combined with surrounding clutter brought by the large voxel sizes hampers the discriminability of features.
Non-egocentric Perspective View. In the PV coordinates, such as spherical or cylindrical coordinate, the voxel sizes become large as the distance to the origin increases, which causes PV is not robust for the distant targets. Therefore, our proposed X-view leverages the non-egocentric perspective view to remedying this issue. As the name suggests, the origin of non-egocentric PV is not located in the ego-car. To improve the robustness for those hard objects in the distance, we can transform the origin of egocentric coordinate to the distant areas. As Figure 1 illustrates, the non-egocentric PV can give a more meticulous division in the target area than the egocentric one. And the non-egocentric coordinate also imitates another observer in the scene.
Iii-B Non-egocentric Multi-View Detector
X-view extends the concept of Multi-view based 3D detection. In previous multi-view methods, such as MVF and PillarOD, the detector head relies on features extracted from two views: BEV and the egocentric PV. The origin of PV coordinate is consistent with the LiDAR coordinate. Although PV can remedy the local feature mismatching issue via its variant voxel sizes, it suffers from the large voxel shapes in the distance, making the features indiscriminative to precisely recognize and locate those distant targets. To remedy this issue, we propose X-view to break through the egocentric limitation and extends the number of perspective views to arbitrary number. The non-egocentric perspective views imitate multiple observers’ perspectives, allowing us to leverage the characteristics of the perspective view to aggregate a balanced context across different distances.
Considering the point clouds in Cartesian coordinates as where is the number of the points, the traditional egocentric perspective view coordinate can be described as , where:
where sphe coord. and cylin coord. represents the spherical and cylindrical coordinate respectively.
Non-egocentric perspective view coordinate translates the origin of egocentric one. Denote the non-egocentric perspective view centered at as , where:
X-view considers both birds-eye view and a set of perspective view centered at different location. The choice of the perspective centers depends on the LiDAR type and the road scenes, which thus differs for different datasets to fulfill different demands. In the Section VIII, we analysis the effects of different PV origin positions.
Iii-C BEV-Dominant Linear Interpolation Fusion
Previous multi-view methods [30, 25] take points as a bridge to fuse features from different views, as shown in part (a) of the Figure 4. However, we notice that this fusion operation will bring some inaccuracy. For the features in the birds-eye view, when retrieving point features from voxels, we use the linear interpolation. However, we loss the point relative positions with respect to the voxel center when recovering birds-eye view voxels. And the same issue exists for the perspective view. Since the voxels in birds-eye view(BEV) and in perspective view(PV) are voxelized in different coordinate spaces and their sizes are also inconsistent, the voxel(PV)-point-voxel(BEV) projection will also cause information losses, which are not beneficial for the following detection head. As Figure 4 explains, the problem happens at the point-to-voxel(BEV) mapping. When constructing voxel feature maps from point-wise features, for the points in the same voxel, the difference of position with respect to the voxel center loses, which causes that the points features blurs in the same voxel due to the sampling or pooling operation when building voxel feature maps. These inaccuracies harm the location precision especially the voxel sizes is large.
To remedy this issue, we propose BEV-Dominant Linear Interpolation Fusion, abbreviated as BDLI-Fusion. As illustrated in the part (b) of the Figure 4, the detail process of BDLI-Fusion can be divided into three steps: (1) for each voxel in the birds-eye view, we project the xyz-coordinate of the voxel centers to the perspective coordinate; (2) use linear interpolation to retrieve corresponding perspective view voxels; (3) fuse(concatenate or add operation) the retrieved features and birds-eye view voxel features together. Comparing the fusion operation of [30, 25], the BEV voxel features keep unchanged and consistent in the whole process of BDLI-Fusion. Because the following network layers and the detector head rely on the BEV voxels to reason about the final predictions, we directly use the BEV voxel centers as the fusion target to avoid the inaccuracy mentioned above. Compared to the two-pass fusion operation in MVF and PillarOD, our BDLI-Fusion is more simple and efficient with a one-pass structure.
Iii-D Parallelization via Group Convolution
Real-time running time is another challenge that must be considered for 3D detection algorithms. Considering that the context differs in the different views, X-view applies view-dependent feature extractors for different views, as shown in Figure 2
. Although the feature extractors of different views are logically parallel, the implementation via main-stream deep learning frameworks, such asPyTorch, have to be designed as a serial computation graph architecture by using a for loop. The serial computing procedure causes a linear increment in the running speed and makes it difficult to satisfy the real-time requirement. Considering that the number of input feature channels and the architectures of different view feature extractors are consistent, we can employ the group convolution operation to parallelize the computation to reduce the linear increment of running time. Figure 5 illustrates the parallelization via group convolution operation.
|Metrics||3D mmAP||BEV mmAP|
|baseline + Ego PV||82.01||51.82||66.68||38.12||77.54||43.16||17.07||58.23||14.98||59.97||50.96||62.87|
|X-view (3 PVs)||83.45||53.24||67.33||38.24||78.56||43.25||17.74||58.89||15.11||61.03||51.69||63.76|
In this section, we first introduce the settings of the experiment in the Section IV-A; then we compare our X-view and recent state-of-the-art methods in the Section IV-B; finally, we conduct some ablation studies to analyze the performance of X-view in the Section IV-C.
KITTI Dataset. KITTI is a popular 3D object detection dataset for autonomous driving. It contains 7481 samples with labels for training and validating and 7518 labels without labels for testing. For 3D object detection, each sample provides a frame of LiDAR point clouds, an RGB image, and the calibration information. The training samples are labeled with a list of the ground-truth 3D bounding boxes and the corresponding 2D bounding boxes on the 2D image. The ground-truth only contains the objects in the 2D image. Hence, when training and evaluating, we only input the 3D points in the camera frustum as the popular practices. When conducting experiments, we follow the popular practice like to divide the samples with labels into two sets: 3712 samples as train set and 3769 samples as val set.
NuScenes Dataset. Compared to KITTI dataset, NuScenes dataset is a more challenging dataset of 3D object detection for autonomous driving. NuScenes comprises of 1000 scenes with 10 classes, which are then divided into 700 scenes for training and 150 scenes for validation, and 150 scenes for testing. Specifically, each scene is a series of consecutive frames of 20 duration, and annotated keyframes are sample at the frequency of 2Hz. And the consecutive frames with calibration makes it possible to use multiple LiDAR sweeps to enhance the single-frame point clouds. Totally, it has 28000, 6000, 6000 annotated keyframes for training, validation and testing, respectively.
Iv-A2 Experiments Details
X-view is implemented based on the official repository of PV-RCNN
on GitHub. We employ consistent configurations for each baseline model when training and testing. On the KITTI dataset, we train the model with the batch size of 4, the initial learning rate of 0.003, weight decay of 0.01, the momentum of 0.9 on a single TITAN RTX GPU for 80 epochs. On the NuScenes dataset, we train the model with the batch size of 8, the initial learning rate of 0.003, weight decay of 0.01, the momentum of 0.9 on a single TITAN RTX GPU. We train the model for 20 epochs on thetrain split. On both two datasets, all models are end-to-end trained from scratch and apply the ADAM optimizer to update model parameters. All our experiments, we employ all-in-one training including the results evaluating on the test set, i.e.we train all categories in one model, which is different from the common practice of previous works. Therefore, to compare fairly, we reimplement the baselines in all experiments and only compare the improvements to our reimplemented baselines.
Data Augmentation. We follow the commonly adopted data augmentations including randomly scaling with a random factor sampled between [0.95, 1.05], randomly flipping along the x axis, randomly rotating around the z axis with a rotation angle randomly sampled between . We also employ the popular gt-aug to randomly place new the ground-truths from other samples to increase the number of positive samples and the complexity of the scenes.
Evaluation Metrics. For KITTI dataset, we evaluate all models on the metrics of BEV and 3D AP(Average Precision). The AP is calculated with 40 recall positions. For NuScenes dataset, we test on the metrics of mAP and NDS(nuScenes detection score). The mAP is based on the distance threshold: 0.5, 1.0, 2.0 and 4.0, and the NDS is a weighted sum of mAP and precision on box location, scale, orientation, velocity and attributes.
Iv-B1 Results on KITTI datasets
Table I shows the comparisons among X-view and four state-of-the-art baselines: SECOND, PointRCNN, Part-A and PV-RCNN. The mmAP in the table is the mean of the mAPs (i.e.the mean of the APs for Easy, Moderate and Hard diffculties) for three categories: Car, Pedestrian and Cyclist. In the table, +egocentric PV means adding a egocentric PV(perspective view) based on the baseline. X-view(2 PVs) means leveraging a non-egocentric PV stream besides the egocentric one whose origin is at (60, 0, 0). The results shows that X-view can obtain remarkable improvements compared to both the baseline and the baseline with egocentric PV on both 3D and BEV AP metrics. In the Table II, we display the detailed improvements on the Easy, Moderate, and Hard objects of three categories. The results demonstrates that X-view can achieve significant improvements on almost all categories and diffculty levels.
Table IV shows the performance comparison on the KITTI test set. We compare our X-view with the state-of-the-art methods, Part-A. The baseline results are the public results on the official KITTI 3D Object Benchmark. Through the results, we can see that our X-view can achieve improvements on the metric of mAP, and the improvement on Moderate and Hard objects is especially remarkable. The improvements on the distant objects demonstrate the non-egocentric PV of X-view can fix the large-voxel-size issue of egocentric view in the distant area.
|(20, 0, 0)||91.74||75.42||72.03||79.73|
|(40, 0, 0)||93.12||75.98||72.38||80.49|
|(60, 0, 0)||92.45||77.95||74.06||81.49|
|(60, -20, 0)||91.68||76.68||72.81||80.39|
|(60, 20, 0)||90.85||76.74||73.01||80.20|
Iv-B2 Results on NuScenes dataset
Table III shows the comparison between X-view and baseline on the val set of NuScenes dataset. The baseline model is SECOND. The point clouds in the NuScenes dataset are 360-degree scanned by LiDAR, and annotations contain the objects behind the ego car. Therefore, we place one more non-egocentric PV compared to the setting on the KITTI dataset. The origins of non-egocentric PVs of the 3-rd row are set as (40, 0, 0) and (-40, 0, 0) respectively. The results on NuScenes suggests that our X-view can achieve improvements compared to the baseline with only egocentric PV like on the KITTI dataset.
Iv-C Ablation Studies
We do some ablation analysis on the KITTI datasets to study each design of X-view. All ablation studies are trained on the KITTI train set and evaluated on the val set, and the baseline model is SECOND.
Number of Non-egocentric Perspective Views. One of X-view’s advantages is that the number of non-egocentric PV(perspective views) is extendable. Therefore, we investigate the effect of the number of non-egocentric PV in Table VI. In the table, the first row ”1 non-ego PV” uses the origin point in the position of (40, 0, 0); the 2-nd row is (40, -20, 0) and (40, 20, 0); the 3-rd row is (60, -20, 0), (40, -20, 0) and (40, 20, 0). Through the results, we can see that as the number of PVs increases, the performance will also be better, but the running speed and the GPU memory also go up. We can find that the marginal increment of the performance reduces as the number of PVs increases. Therefore, we have to find a trade-off point between the number of PVs and speed. And the results of Table VI only provide advice for the KITTI dataset (3D points in the KITTI dataset are valid in the front camera frustum). The origin positions and the number of non-egocentric PVs should be adjusted according to the dataset and road scenes.
Ablation study of the Fusion Methods. In Table V, we compare the effects of different fusion methods: the fusion of MVF and our BDLI-Fusion. We can see that the BDLI-Fusion can bring improvements through the results, especially for the Cyclist and Pedestrian categories. It is rational because the object sizes of these two categories are generally small, and the estimation for these objects are easy to be influenced by the inaccurate features.
Spherical or Cylindrical Coordinate?  shows that the cylindrical coordinate can avoid the distort of the objects’ physical scale compared to the spherical coordinate. However, this is only for the case of pillar-based baseline, because the pillar-based methods do not voxelize the z axis direction. Many grid-based backbones voxelize all x, y, z directions to 3D voxels. For these methods, applying a cylindrical coordinate will face the density-inconsistent issue in the z axis. So we reinvestigate the performance difference between the spherical and the cylindrical coordinates In Table VII. It suggests that no matter for baseline+PV model or our X-view model, although spherical coordinate is little better than cylindrical, different coordinates do not introduce a significant difference.
Where to Place X-view? We investigate the effects of the different origin locations of non-egocentric PV on the detection performance. As shown in the first to third rows, the performance on Moderate and Hard difficulty levels increases as the origin slides away along the x-axis from near to far, while the performance of Easy objects will peak at a point and then drop, which is reasonable as the best-mixed voxel size in the Easy difficulty levels will peak faster than Moderate and Hard difficulty levels. The non-egocentric PV placed in the far area will provide a more detailed grid partition and more discriminative features for the distant targets. What’s more, the 4-th to 6-th rows show that placing the origin at the symmetry axis helps us achieve the best detection performance. It is reasonable because the obstacles and road scenes are often symmetrically distributed with respect to the x axis, i.e.the driving direction, and the input are only the point clouds in the front camera frustum in the KITTI dataset.
In this paper, we generalize the research on multi-view 3D Object Detection. We propose a novel non-egocentric multi-view 3D detector, named X-view. X-view breaks through the traditional egocentric constraint that the origin of perspective view is consistent with ego coordinate. By adding the non-egocentric perspective views, X-view remedies the issue that the distant voxel sizes became too large in the egocentric view. Besides, X-view leverages our improved fusion module: BDLI Fusion, which fixes the point-wise fusion issue existing in previous methods. X-view architecture can achieve remarkable improvements on four state-of-the-art 3D detection methods: SECOND, PointRCNN, Part-A and PV-RCNN. And based on the parallelization operation, different view streams can be accelerated to avoid the linear increment of the running time, making X-view can meet the real-time demand.
-  (2020) Nuscenes: a multimodal dataset for autonomous driving. In , pp. 11621–11631. Cited by: X-view: Non-egocentric Multi-View 3D Object Detector, 3rd item, §IV-A1, §IV-A1, §IV-B2.
-  (2019) Deep optics for monocular depth estimation and 3d object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10193–10202. Cited by: §II-A.
-  (2020) Every view counts: cross-view consistency in 3d object detection with hybrid-cylindrical-spherical voxelization. Advances in Neural Information Processing Systems 33. Cited by: §II-B.
-  (2017) Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915. Cited by: §I, §II-B, §II-B, §IV-A1.
-  (2015) The kitti vision benchmark suite. URL http://www. cvlibs. net/datasets/kitti. Cited by: X-view: Non-egocentric Multi-View 3D Object Detector, 3rd item, §IV-A1, §IV-A1.
-  (2017) Submanifold sparse convolutional networks. arXiv preprint arXiv:1706.01307. Cited by: §I.
-  (2018) Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8. Cited by: §II-B, §III-A.
-  (2019) Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11867–11876. Cited by: §II-A.
-  (2019) Pointpillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §I, §II-A, §III-C, §IV-C.
-  (2019) Gs3d: an efficient 3d object detection framework for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1019–1028. Cited by: §II-A.
-  (2020) RangeRCNN: towards fast and accurate 3d object detection with range image representation. arXiv preprint arXiv:2009.00206. Cited by: §II-B.
-  (2019) Roi-10d: monocular lifting of 2d detection to 6d pose and metric shape. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2069–2078. Cited by: §II-A.
Voxnet: a 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §II-A.
-  (2019) Lasernet: an efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12677–12686. Cited by: §I, §I, §II-B, §III-A.
-  (2020) Imvotenet: boosting 3d object detection in point clouds with image votes. arXiv preprint arXiv:2001.10692. Cited by: §II-B.
-  (2019) Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9277–9286. Cited by: §II-A.
-  (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §I, §I.
Monogrnet: a geometric reasoning network for monocular 3d object localization.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8851–8858. Cited by: §II-A.
-  (2020) It’s all around you: range-guided cylindrical network for 3d object detection. arXiv preprint arXiv:2012.03121. Cited by: §II-B.
-  (2020) Pv-rcnn: point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10529–10538. Cited by: X-view: Non-egocentric Multi-View 3D Object Detector, 3rd item, §II-A, §III-C, §IV-A2, §IV-B1, TABLE I, TABLE II, §V.
-  (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–779. Cited by: X-view: Non-egocentric Multi-View 3D Object Detector, 3rd item, §I, §III-A, §III-C, §IV-B1, TABLE I, TABLE II, §V.
-  (2020) From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: X-view: Non-egocentric Multi-View 3D Object Detector, 3rd item, §I, §II-A, §III-A, §III-C, §IV-B1, §IV-B1, TABLE I, TABLE IV, §V.
Point-gnn: graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1711–1719. Cited by: §I, §III-A.
-  (2019) PointPainting: sequential fusion for 3d object detection. arXiv preprint arXiv:1911.10150. Cited by: §II-B.
-  (2020) Pillar-based object detection for autonomous driving. arXiv preprint arXiv:2007.10323. Cited by: §I, §I, §II-B, Fig. 4, §III-A, §III-B, §III-C, §III-C, §IV-C.
-  (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §I, §II-A, §III-A.
-  (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: X-view: Non-egocentric Multi-View 3D Object Detector, 3rd item, §I, §II-A, §III-A, §III-C, §IV-A2, §IV-B1, §IV-B2, §IV-C, TABLE I, TABLE II, TABLE III, §V.
-  (2018) Pixor: real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7652–7660. Cited by: §II-A, §III-C.
-  (2020) Center-based 3d object detection and tracking. arXiv preprint arXiv:2006.11275. Cited by: §II-A.
-  (2020) End-to-end multi-view fusion for 3d object detection in lidar point clouds. In Conference on Robot Learning, pp. 923–932. Cited by: §I, §I, §II-B, Fig. 4, §III-A, §III-B, §III-C, §III-C, §IV-C.
-  (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §I, §II-A, §III-A, §III-C.
-  (2019) Class-balanced grouping and sampling for point cloud 3d object detection. arXiv preprint arXiv:1908.09492. Cited by: §II-A.