Reliable and accurate 3D object detection is a necessity for safe autonomous driving. Although LiDAR sensors can provide accurate 3D point cloud estimates of the environment, they are also prohibitively expensive for many settings. Recently, the introduction of pseudo-LiDAR (PL) has led to a drastic reduction in the accuracy gap between methods based on LiDAR sensors and those based on cheap stereo cameras. PL combines state-of-the-art deep neural networks for 3D depth estimation with those for 3D object detection by converting 2D depth map outputs to 3D point cloud inputs. However, so far these two networks have to be trained separately. In this paper, we introduce a new framework based on differentiable Change of Representation (CoR) modules that allow the entire PL pipeline to be trained end-to-end. The resulting framework is compatible with most state-of-the-art networks for both tasks and in combination with PointRCNN improves over PL consistently across all benchmarks – yielding the highest entry on the KITTI image-based 3D object detection leaderboard at the time of submission. Our code will be made available at https://github.com/mileyan/pseudo-LiDAR_e2e.READ FULL TEXT VIEW PDF
One of the most critical components in autonomous driving is 3D object detection: a self-driving car must accurately detect and localize objects such as cars and pedestrians in order to plan the path safely and avoid collisions. To this end, existing algorithms primarily rely on LiDAR (Light Detection and Ranging) as the input signal, which provides precise 3D point clouds of the surrounding environment. LiDAR, however, is very expensive. A 64-beam model can easily cost more than the car alone, making self-driving cars prohibitively expensive for the general public.
One solution is to explore alternative sensors like commodity (stereo) cameras. Although there is still a noticeable gap to LiDAR, it is an area with exceptional progress in the past year [15, 21, 30, 37, 48, 47]. For example, pseudo-LiDAR (PL) [37, 47] converts a depth map estimated from stereo images into a 3D point cloud, followed by applying (any) existing LiDAR-based detectors. Taking advantage of the state-of-the-art algorithms from both ends [2, 16, 31, 35, 47], pseudo-LiDAR achieves the highest image-based 3D detection accuracy ( and at the moderate case) on the KITTI leaderboard [11, 12].
While the modularity of pseudo-LiDAR is conceptual appealing, the combination of two independently trained components can yield an undesired performance hit. In particular, pseudo-LiDAR requires two systems: a depth estimator, typically trained on a generic depth estimation (stereo) image corpus, and an object detector trained on the point cloud data converted from the resulting depth estimates. It is unlikely that the two training objectives are optimally aligned for the ultimate goal, to maximize final detection accuracy. For example, depth estimators are typically trained with a loss that penalizes errors across all pixels equally, instead of focusing on objects of interest. Consequently, it may over-emphasize nearby or non-object pixels as they are over-represented in the data. Further, if the depth network is trained to estimate disparity, its intrinsic error will be exacerbated for far-away objects .
To address these issues, we propose to design a 3D object detection framework that is trained end-to-end, while preserving the modularity and compatibility of pseudo-LiDAR with newly developed depth estimation and object detection algorithms. To enable back-propagation based end-to-end training on the final loss, the change of representation (CoR) between the depth estimator and the object detector must be differentiable with respect to the estimated depth. We focus on two types of CoR modules — subsampling and quantization — which are compatible with different LiDAR-based object detector types. We study in detail on how to enable effective back-propagation with each module. Specifically, for quantization, we introduce a novel differentiable soft quantization CoR module to overcome its inherent non-differentiability. The resulting framework is readily compatible with most existing (and hopefully future) LiDAR-based detectors and 3D depth estimators.
We validate our proposed end-to-end pseudo-LiDAR (E2E-PL) approach with two representative object detectors — PIXOR  (quantized input) and PointRCNN  (subsampled point input) — on the widely-used KITTI object detection dataset [11, 12]. Our results are promising: we improve over the baseline pseudo-LiDAR pipeline and the improved PL++ pipeline  in all the evaluation settings and significantly outperform other image-based 3D object detectors. At the time of submission our E2E-PL with PointRCNN holds the best results on the KITTI image-based 3D object detection leaderboard. Our qualitative results further confirm that end-to-end training can effectively guide the depth estimator to refine its estimates around object boundaries, which are crucial for accurately localizing objects (see Figure 1 for an illustration).
3D Object Detection. Most works on 3D object detection are based on 3D LiDAR point clouds [8, 9, 10, 17, 18, 20, 27, 35, 36, 43, 44, 46]. Among these, there are two streams in terms of point cloud processing: 1) directly operating on the unordered point clouds in 3D [17, 31, 35, 49], mostly by applying PointNet [32, 33]
or/and applying 3D convolution over neighbors; 2) operating on quantized 3D/4D tensor data, which are generated from discretizing the locations of point clouds into some fixed grids[7, 16, 23, 45]. Images can be included in both types of approaches, but primarily to supplement LiDAR signal [7, 9, 16, 22, 23, 26, 31, 42].
Besides LiDAR-based models, there are solely image-based models, which are mostly developed from the 2D frontal-view detection pipeline [13, 24, 34], but most of them are no longer competitive with the state of the art in localizing objects in 3D [1, 5, 6, 4, 28, 19, 29, 39, 40, 41].
Pseudo-LiDAR. This gap has been greatly reduced by the recently proposed pseudo-LiDAR framework [37, 47]. Different from previous image-based 3D object detection models, pseudo-LiDAR first utilizes an image-based depth estimation model to obtain predicted depth of each image pixel . The resulting depth is then projected to a “pseudo-LiDAR” point in 3D by
where is the camera center and and are the horizontal and vertical focal lengths. The “pseudo-LiDAR” points are then treated as if they were LiDAR signals, over which any LiDAR-based 3D object detector can be applied. By making use of the separately trained state-of-the-art algorithms from both ends [2, 16, 31, 35, 47], pseudo-LiDAR achieved the highest image-based performance on KITTI benchmark [11, 12]. Our work builds upon this framework.
One key advantage of the pseudo-LiDAR pipeline [37, 47] is its plug-and-play modularity, which allows it to incorporate any advances in 3D depth estimation or LiDAR-based 3D object detection. However, it also lacks the notion of end-to-end training of both components to ultimately maximize the detection accuracy. In particular, the pseudo-LiDAR pipeline is trained in two steps, with different objectives. First, a depth estimator is learned to estimate generic depths for all pixels in a stereo image; then a LiDAR-based detector is trained to predict object bounding boxes from depth estimates, generated by the frozen depth network.
As mentioned in section 1, learning pseudo-LiDAR in this fashion does not align the two components well. On one end, a LiDAR-based object detector heavily relies on accurate 3D points on or in the proximity of the object surfaces to detect and localize objects. Especially, for far-away objects that are rendered by relatively few points. On the other end, a depth estimator learned to predict all the pixel depths may place over-emphasis on the background and nearby objects since they occupy most of the pixels in an image. For example, in the KITTI dataset  only about of all pixels correspond to cars and pedestrians/cyclists (Figure 2). Such a misalignment is aggravated with fixing the depth estimator in training the object detector: the object detector is unaware of the intrinsic depth error in the input and thus can hardly detect the far-away objects correctly.
Figure 3 illustrates our proposed end-to-end pipeline to resolve these shortcomings. Here, the error signal from misdetecting or mislocalizing an object can “softly attend” to pixels which affect the prediction most (likely those on or around objects in 2D), instructing the depth estimator where to improve for the subsequent detector. To enable back-propagating the error signal from the final detection loss, the change of representation (CoR) between the depth estimator and the object detector must be differentiable with respect to the estimated depth. In the following, we identify two major types of CoR — subsampling and quantization — in incorporating existing LiDAR-based detectors into the pseudo-LiDAR pipeline.
Several LiDAR-based object detectors take voxelized 3D or 4D tensors as inputs [7, 16, 23, 45]. The 3D point locations are discretized into a fixed grid, and only the occupation (i.e., ) or densities (i.e., ) are recorded in the resulting tensor111For LiDAR data the reflection intensity is often also recorded.. The advantage of this kind of approaches is that 2D and 3D convolutions can be directly applied to extract features from the tensor. Such a discretization process, however, makes the back-propagation difficult.
Let us consider an example where we are given a point cloud with the goal to generate a 3D occupation tensor of bins, where each bin is associated with a fixed center location . The resulting tensor is defined as follows,
In other words, if a point falls into bin , then ; otherwise, . The forward pass of generating is as straightforward. The backward pass to obtain the gradient signal of the detection loss with respect to or the depth map (Equation 1), however, is non-trivial.
Concretely, we can obtain by taking the gradients of with respect to . Intuitively, if , it means that should increase; i.e., there should be points falling into bin . In contrast, if , it means that should decrease by pushing points out from bin . But how can we pass these messages back to the input point cloud ? More specifically, how can we translate the single digit of each bin to be useful information in 3D in order to adjust the point cloud ?
As a remedy, we propose to modify the forward pass by introducing a differentiable soft quantization module (see Figure 4
). We introduce a radial basis function (RBF) around the centerof a given bin . Instead of binary occupancy counters222We note that the issue of back-propagation cannot be resolved simply by computing real-value densities in Equation 2., we keep a “soft” count of the points inside the bin, weighted by the RBF. Further, we allow any given bin to be influenced by a local neighborhood of close bins. We then modify the definition of accordingly. Let denote the set of points that fall into bin ,
We define to denote the average RBF weight of points in bin w.r.t. bin (more specifically, ),
The final value of the tensor at bin is the combination of soft occupation from its own and neighboring bins,
We note that, when and , Equation 4 recovers Equation 2. Throughout this paper, we set the neighborhood to the 26 neighboring bins (considering a 3x3x3 cube centered on the bin) and . Following , we set the total number of bins to .
Our soft quantization module is fully differentiable. The partial derivative directly affects the points in bin (i.e., ) and its neighboring bins and enables end-to-end training. For example, to pass the partial derivative to a point in bin , we compute . More importantly, even when bin mistakenly contains no point, allows it to drag points from other bins, say bin , to be closer to , enabling corrections of the depth error more effectively.
As an alternative to voxelization, some LiDAR-based object detectors take the raw 3D points as input (either as a whole  or by grouping them according to metric locations [17, 43, 49] or potential object locations ). For these, we can directly use the 3D point clouds obtained by Equation 1; however, some subsampling is required. Different from voxelization, subsampling is far more amenable to end-to-end training: the points that are filtered out can simply be ignored during the backwards pass; the points that are kept are left untouched. First, we remove all 3D points higher than the normal heights that LiDAR signals can cover, such as pixels of the sky. Further, we may sparsify the remaining points by subsampling. This second step is optional but suggested in  due to the significantly larger amount of points generated from depth maps than LiDAR: on average 300,000 points are in the pseudo-LiDAR signal but 18,000 points are in the LiDAR signal (in the frontal view of the car). Although denser representations can be advantageous in terms of accuracy, they do slow down the object detection network. We apply an angular-based sparsifying method. We define multiple bins in 3D by discretizing the spherical coordinates . Specifically, we discretize (polar angle) and (azimuthal angle) to mimic the LiDAR beams. We then keep a single 3D point from those points whose spherical coordinates fall into the same bin. The resulting point cloud therefore mimics true LiDAR points.
In terms of back-propagation, since these 3D object detectors directly process the 3D coordinates of a point, we can obtain the gradients of the final detection loss with respect to the coordinates; i.e., . As long as we properly record which points are subsampled in the forward pass or how they are grouped, back-propagating the gradients from the object detector to the depth estimates (at sparse pixel locations) can be straightforward. Here, we leverage the fact that Equation 1 is differentiable with respect to . However, due to the high sparsity of gradient information in , we found that the initial depth loss used to train a conventional depth estimator is required to jointly optimize the depth estimator.
This subsection, together with subsection 3.1, presents a general end-to-end framework applicable to various object detectors. We do not claim this subsection as a technical contribution, but it provides details that makes end-to-end training for point-cloud-based detectors successful.
To learn the pseudo-LiDAR framework end-to-end, we replace Equation 2 by Equation 4 for object detectors that take 3D or 4D tensors as input. For object detectors that take raw points as input, no specific modification is needed.
We learn the object detector and the depth estimator jointly with the following loss,
where is the loss from 3D object detection and is the loss of depth estimation. and are the corresponding coefficients. The detection loss is the combination of classification loss and regression loss,
in which the classification loss aims to assign correct class (e.g., car) to the detected bounding box; the regression loss aims to refine the size, center, and rotation of the box.
Let be the predicted depth and be the ground truth, we apply the following depth estimation loss
where is the set of pixels that have ground truth depth. is the smooth L1 loss defined as
We find that the depth loss is important as the loss from object detection may only influence parts of the pixels (due to quantization or subsampling). After all, our hope is to make the depth estimates around (far-away) objects more accurately, but not to sacrifice the accuracy of depths on the background and nearby objects333The depth loss can be seen as a regularizer to keep the output of the depth estimator physically meaningful. We note that, 3D object detectors are designed with an inductive bias: the input is an accurate 3D point cloud. However, with the large capacity of neural networks, training the depth estimator and object detector end-to-end with the detection loss alone can lead to arbitrary representations between them that break the inductive bias but achieve a lower training loss. The resulting model thus will have a much worse testing loss than the one trained together with the depth loss..
|Depth Loss||P-RCNN Loss||PIXOR Loss|
|IoU = 0.5||IoU = 0.7|
|3DOP ||S||55.0 / 46.0||41.3 / 34.6||34.6 / 30.1||12.6 / 6.6||9.5 / 5.1||7.6 / 4.1|
|MLF-stereo ||S||-||53.7 / 47.4||-||-||19.5 / 9.8||-|
|S-RCNN ||S||87.1 / 85.8||74.1 / 66.3||58.9 / 57.2||68.5 / 54.1||48.3 / 36.7||41.5 / 31.1|
|OC-Stereo ||S||90.0 / 89.7||80.6 / 80.0||71.1 / 70.3||77.7 / 64.1||66.0 / 48.3||51.2 / 40.4|
|PL: P-RCNN ||S||88.4 / 88.0||76.6 / 73.7||69.0 / 67.8||73.4 / 62.3||56.0 / 44.9||52.7 / 41.6|
|PL++: P-RCNN ||S||89.8 / 89.7||83.8 / 78.6||77.5 / 75.1||82.0 / 67.9||64.0 / 50.1||57.3 / 45.3|
|E2E-PL: P-RCNN||S||90.5 / 90.4||84.4 / 79.2||78.4 / 75.9||82.7 / 71.1||65.7 / 51.7||58.4 / 46.7|
|PL: PIXOR ||S||89.0 / -||75.2 / -||67.3 / -||73.9 / -||54.0 / -||46.9 / -|
|PL++: PIXOR ||S||89.9 / -||78.4 / -||74.7 / -||79.7 / -||61.1 / -||54.5 / -|
|E2E-PL: PIXOR||S||94.6 / -||84.8 / -||77.1/ -||80.4 / -||64.3 / -||56.7 / -|
|P-RCNN ||L||97.3 / 97.3||89.9 / 89.8||89.4 / 89.3||90.2 / 89.2||87.9 / 78.9||85.5 / 77.9|
|PIXOR ||L + M||94.2 / -||86.7 / -||86.1 / -||85.2 / -||81.2 / -||76.1 / -|
Dataset. We evaluate our end-to-end (E2E-PL) approach on the KITTI object detection benchmark [11, 12], which contains 3,712, 3,769 and 7,518 images for training, validation, and testing. KITTI provides for each image the corresponding 64-beam Velodyne LiDAR point cloud, right image for stereo, and camera calibration matrices.
Metric. We focus on 3D and bird’s-eye-view (BEV) object detection and report the results on the validation set. We focus on the “car” category, following [7, 37, 42]. We report the average precision (AP) with the IoU thresholds at 0.5 and 0.7. We denote AP for the 3D and BEV tasks by AP and AP. KITTI defines the easy, moderate, and hard settings, in which objects with 2D box heights smaller than or occlusion/truncation levels larger than certain thresholds are disregarded. The hard (moderate) setting contains all the objects in the moderate and easy (easy) settings.
Our end-to-end pipeline has two parts: stereo depth estimation and 3D object detection. In training, we first learn only the stereo depth estimation network to get a depth estimation prior, and then we fix the depth network and use its output to train the 3D object detector from scratch. In the end, we joint train the two parts with balanced loss weights.
Depth estimation. We apply SDN  as the backbone to estimate a dense depth map . We follow  to pre-train SDN on the synthetic Scene Flow dataset  and fine-tune it on the 3,712 training images of KITTI. We obtain the depth ground truth by projecting the corresponding LiDAR points onto images.
Object detection. We apply two LiDAR-based algorithms: PIXOR  (voxel-based, with quantization) and PointRCNN (P-RCNN)  (point-cloud-based). We use the released code of P-RCNN. We obtain the code of PIXOR from the authors of , which has slight modification to include visual information (denoted as PIXOR).
Joint training. We set the depth estimation and object detection networks trainable, and allow the gradients of the detection loss to back-propagate to the depth network. We study the gradients of the detection and depth losses w.r.t. the predicted depth map to determine the hyper-parameters and . For each loss, we calculate the percentage of pixels on the entire depth map that have gradients. We further collect the mean and sum of the gradients on the depth map during training, as shown in Table 1. The depth loss only influences 3% of the depth map because the ground truth obtained from LiDAR is sparse. The P-RCNN loss, due to subsampling on the dense PL point cloud, can only influence 4% of the depth map. For the PIXOR loss, our soft quantization module could back-propagate the gradients to 70% of the pixels of the depth map. In our experiments, we find that balancing the sums of gradients between the detection and depth losses is crucial in making joint training stable. We carefully set and to make sure that the sums are on the same scale in the beginning of training. For P-RCNN, we set and ; for PIXOR, we and .
On KITTI validation set. The main results on KITTI validation set are summarized in Table 2. It can be seen that 1) the proposed E2E-PL framework consistently improves the object detection performance on both the model using subsampled point inputs (P-RCNN) and that using quantized inputs (PIXOR). 2) While the quantization-based model (PIXOR) performs worse than the point-cloud-based model (P-RCNN) when they are trained in a non end-to-end manner, end-to-end training greatly reduces the performance gap between these two types of models, especially for IoU at : the gap between these two models on AP in moderate cases is reduced from to . As shown in Table 1, on depth maps, the gradients flowing from the loss of the PIXOR detector are much denser than those from the loss of the P-RCNN detector, suggesting that more gradient information is beneficial. 3) For IoU at 0.5 under easy and moderate cases, E2E-PL: PIXOR performs on a par with PIXOR using LiDAR.
|S-RCNN ||61.9 / 47.6||41.3 / 30.2||33.4 / 23.7|
|RT3DStereo ||58.8 / 29.9||46.8 / 23.3||38.4 / 19.0|
|OC-Stereo ||68.9 / 55.2||51.5 / 37.6||43.0 / 30.3|
|PL ||67.3 / 54.5||45.0 / 34.1||38.4 / 28.3|
|PL++: P-RCNN ||78.3 / 61.1||58.0 / 42.4||51.3 / 37.0|
|E2E-PL: P-RCNN||79.6 / 64.8||58.8 / 43.9||52.1 / 38.1|
|PL++:PIXOR ||70.7 / -||48.3 / -||41.0 / -|
|E2E-PL: PIXOR||71.9 / -||51.7 / -||43.3 / -|
On KITTI test set. Table 3 shows the results on KITTI test set. We observe the same consistent performance boost by applying our E2E-PL framework on each detector type. At the time of submission, E2E-PL: P-RCNN achieves the state-of-the-art results over image-based models.
|IoU = 0.5||IoU = 0.7|
We conduct ablation studies on the the point-cloud-based pipeline with P-RCNN in Table 4. We divide the pipeline into three sub networks: depth estimation network (Depth), region proposal network (RPN) and Regional-CNN (RCNN). We try various combinations of the sub networks (and their corresponding losses) by setting them to trainable in our final joint training stage. The first row serves as the baseline. In rows two to four, the results indicate that simply training each sub network independently with more iterations does not improve the accuracy. In row five, joint training RPN and RCNN (i.e., P-RCNN) does not have significant improvement, because the point cloud from Depth is not updated and remains noisy. In row six we jointly train Depth with RPN, but the result does not improve much either. We suspect that the loss on RPN is insufficient to guide the refinement of depth estimation. By combining the three sub networks together and using the RCNN, RPN, and Depth losses to refine the three sub networks, we get the best results (except for two cases).
For the quantization-based pipeline with soft quantization, we also conduct similar ablation studies, as shown in Table 5. Since PIXOR is a one-stage detector, we divide the pipeline into two components: Depth and Detector. Similar to the point-cloud-based pipeline (Table 4), simply training each component independently with more iterations does not improve. However, when we jointly train both components, we see a significant improvement (last row). This demonstrates the effectiveness of our soft quantization module which can back-propagate the Detector loss to influence 70% pixels on the predicted depth map.
More interestingly, applying the soft quantization module alone without jointly training (rows one and two) does not improve over or is even outperformed by PL++: PIXOR with hard quantization (whose result is , from Table 2). But with joint end-to-end training enabled by soft quantization, our E2E-PL:PIXOR consistently outperforms the separately trained PL++: PIXOR.
|89.8 / 77.0||78.3 / 57.7||69.5 / 53.8|
|89.9 / 76.9||78.7 / 58.0||69.7 / 53.9|
|90.2 / 78.1||79.2 / 58.9||69.6 / 54.2|
|94.6 / 80.4||84.8 / 64.3||77.1 / 56.7|
We show the qualitative results of the point-cloud-based and quantization-based pipelines.
Depth visualization. We visualize the predicted depth maps and the corresponding point clouds converted from the depth maps by the quantization-based pipeline in Figure 5. For the original depth network shown in the first row, since the ground truth is very sparse, the depth prediction is not accurate and we observe clear misestimation (artifact) on the top of the car: there are massive misestimated 3D points on the top of the car. By applying end-to-end joint training, the detection loss on cars enforces the depth network to reduce the misestimation and guides it to generate more accurate point clouds. As shown in the second row, both the depth prediction quality and the point cloud quality are greatly improved. The last row is the input image to the depth network, the zoomed-in patch of a specific car, and its corresponding LiDAR ground truth point cloud.
Detection visualization. We also show the qualitative comparisons of the detection results, as illustrated in Figure 6. From the BEV, the ground truth bounding boxes are marked in red and the predictions are marked in green. In the first example (column), pseudo-LiDAR++ (PL++) misses one car in the middle, and gives poor localization of the far-away cars. Our E2E-PL could detect all the cars, and gives accurate predictions on the farthest car. For the second example, the result is consistent for the far-away cars where E2E-PL detects more cars and localize them more accurately. Even for the nearby cars, E2E-PL gives better results. The third example indicates a case where there is only one ground truth car. Our E2E-PL does not have any false positive predictions.
Speed. The inference time of our method is similar to pseudo-LiDAR and is determined by the stereo and detection networks. The soft quantization module (with 26 neighboring bins) only computes the RBF weights between a point and 27 bins (roughly points per scene), followed by grouping the weights of points into bins. The complexity is
, and both steps can be parallelized. Using a single GPU with PyTorch implementation,E2E-PL: P-RCNN takes 0.49s/frame and E2E-PL: PIXOR takes 0.55s/frame, within which SDN (stereo network) takes 0.39s/frame and deserves further study to speed it up (e.g., by code optimization, network pruning, etc). Others. See the Supplementary Material for more details and results.
. Our proposed framework can work for 3D object detectors taking either the direct point cloud inputs or quantized structured inputs. The resulting models set a new state of the art in image based 3D object detection and further narrow the remaining accuracy gap between stereo and LiDAR based sensors. Although it will probably always be beneficial to include active sensors like LiDARs in addition to passive cameras, it seems possible that the benefit may soon be too small to justify large expenses. Considering the KITTI benchmark, it is worth noting that the stereo pictures are of relatively low resolution and only few images contain (labeled) far away objects. It is quite plausible that higher resolution images with a higher ratio of far away cars would result in further detection improvements, especially in the hard (far away and heavily occluded) category.
This research is supported by grants from the National Science Foundation NSF (III-1618134, III-1526012, IIS-1149882, IIS-1724282, and TRIPODS-1740822), the Office of Naval Research DOD (N00014-17-1-2175), the Bill and Melinda Gates Foundation, and the Cornell Center for Materials Research with funding from the NSF MRSEC program (DMR-1719875). We are thankful for generous support by Zillow and SAP America Inc.
Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks.In ICRA, 2017.
3d bounding box estimation using deep learning and geometry.In CVPR, 2017.
In addition to 3D object detection on Car category, in Table S6 we show the results on Pedestrian and Cyclist categories in KITTI object detection validation set [11, 12]. To be consistent with the main paper, we apply P-RCNN  as the object detector. Our approach (E2E-PL) outperforms the baseline one without end-to-end training (PL++)  by a notable margin for image-based 3D detection.
|Pedestrian||PL++||31.9 / 26.5||25.2 / 21.3||21.0 / 18.1|
|E2E-PL||35.7 / 32.3||27.8 / 24.9||23.4 / 21.5|
|Cyclist||PL++||36.7 / 33.5||23.9 / 22.5||22.7 / 20.8|
|E2E-PL||42.8 / 38.4||26.2 / 24.1||24.5 / 22.7|
We analyze 3D object detection of Car category for ground truths at different depth ranges (i.e., 0-30 or 30-70 meters). We report results with the point-cloud-based pipelines in Table S7 and the quantization-based pipeline in Table S8. E2E-PL achieves better performance at both depth ranges (except for 30-70 meters, moderate, AP). Specifically, on AP, the relative gain between E2E-PL and the baseline becomes larger for the far-away range and the hard setting.
|0-30||PL++||82.9 / 68.7||76.8 / 64.1||67.9 / 55.7||7379|
|E2E-PL||86.2 / 72.7||78.6 / 66.5||69.4 / 57.7|
|30-70||PL++||19.7 / 11.0||29.5 / 18.1||27.5 / 16.4||3583|
|E2E-PL||23.8 / 15.1||31.8 / 18.0||31.0 / 16.9|
|0-30||PL++||81.4 / -||75.5 / -||65.8 / -||7379|
|E2E-PL||82.1 / -||76.4 / -||67.5 / -|
|30-70||PL++||26.1 / -||23.9 / -||20.5 / -||3583|
|E2E-PL||26.8 / -||36.1 / -||31.7 / -|
In Figure S7, we compare the precision-recall curves of our E2E-PL and pseudo-LiDAR ++ (named Pseudo-LiDAR V2 on the leaderboard). On the 3D object detection track (first row of Figure S7), pseudo-LiDAR ++ has a notable drop of precision on easy cars even at low recalls, meaning that pseudo-LiDAR ++ has many high-confident false positive predictions. The same situation happens to moderate and hard cars. Our E2E-PL suppresses the false positive predictions, resulting in more smoother precision-recall curves. On the bird’s-eye view detection track (second row of Figure S7), the precision of E2E-PL is over 97% within recall interval 0.0 to 0.2, which is higher than the precision of pseudo-LiDAR ++, indicating that E2E-PL has fewer false positives.
We show more qualitative depth comparisons in Figure S8. We use red bounding boxes to highlight the depth improvement in car related areas. We also show detection comparisons in Figure S9, where our E2E-PL has fewer false positive and negative predictions.
We also visualize the gradients of the detection loss with respect to the depth map to indicate the effectiveness of our E2E-PL pipeline, as illustrated in Figure S10. We use JET Colormap to indicate the relative absolute value of gradients, where red color indicates higher values while blue color indicates lower values. The gradients from the detector focus heavily around cars.
We summarize the quantitative results of depth estimation (w/o or w/ end-to-end training) in Table S9. As the detection loss only provides semantic information to the foreground objects, which occupy merely of pixels (Figure 2), its improvement to the overall depth estimation is limited. But for pixels around the objects, we do see improvement at certain depth ranges. We hypothesize that the detection loss may not directly improve the metric depth, but will sharpen the object boundaries in 3D to facilitate object detection and localization.
|PL++: P-RCNN||S||68.7 / 55.6||46.3 / 36.6||43.5 / 35.1||17.2 / 6.9||17.0 / 11.1||17.0 / 11.6|
|E2E-PL: P-RCNN||S||73.6 / 61.3||47.9 / 39.1||44.6 / 35.7||30.2 / 16.1||18.8 / 11.3||17.9 / 11.5|
|P-RCNN||L||93.2 / 89.7||85.1 / 79.4||84.5 / 76.8||73.8 / 42.3||66.5 / 34.6||63.7 / 37.4|
We also experiment with Argoverse . We convert the Argoverse dataset into KITTI format, following the original split, which results in and scenes (i.e
., stereo images with the corresponding synchronized LiDAR point clouds) for training and validation. We use the same training scheme and hyperparameters as those in KITTI experiments, and report the validation results inTable S10. We define the easy, moderate, and hard settings following . Note that since the synchronization rate of stereo images in Argoverse is 5Hz instead of 10Hz, the dataset used here is smaller than that used in . We note that the sensor calibration in Argoverse may not register the stereo images into the perfect epipolar correspondence (as indicated in argoverse-api). Our experimental results in Table S10 also confirmed the issue: image-based results are much worse than the LiDAR-based ones. Nevertheless, our E2E-PL pipeline still outperforms PL++. We note that, most existing image-based detectors only report results on KITTI.