End-to-End Pseudo-LiDAR for Image-Based 3D Object Detection

Reliable and accurate 3D object detection is a necessity for safe autonomous driving. Although LiDAR sensors can provide accurate 3D point cloud estimates of the environment, they are also prohibitively expensive for many settings. Recently, the introduction of pseudo-LiDAR (PL) has led to a drastic reduction in the accuracy gap between methods based on LiDAR sensors and those based on cheap stereo cameras. PL combines state-of-the-art deep neural networks for 3D depth estimation with those for 3D object detection by converting 2D depth map outputs to 3D point cloud inputs. However, so far these two networks have to be trained separately. In this paper, we introduce a new framework based on differentiable Change of Representation (CoR) modules that allow the entire PL pipeline to be trained end-to-end. The resulting framework is compatible with most state-of-the-art networks for both tasks and in combination with PointRCNN improves over PL consistently across all benchmarks – yielding the highest entry on the KITTI image-based 3D object detection leaderboard at the time of submission. Our code will be made available at https://github.com/mileyan/pseudo-LiDAR_e2e.



There are no comments yet.


page 1

page 3

page 4

page 7

page 8

page 13

page 14


Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving

Detecting objects such as cars and pedestrians in 3D plays an indispensa...

Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving

3D object detection is an essential task in autonomous driving. Recent t...

Learning Depth-Guided Convolutions for Monocular 3D Object Detection

3D object detection from a single image without LiDAR is a challenging t...

A Simple and Efficient Multi-task Network for 3D Object Detection and Road Understanding

Detecting dynamic objects and predicting static road information such as...

Rethinking Pseudo-LiDAR Representation

The recently proposed pseudo-LiDAR based 3D detectors greatly improve th...

VR3Dense: Voxel Representation Learning for 3D Object Detection and Monocular Dense Depth Reconstruction

3D object detection and dense depth estimation are one of the most vital...

YOLOStereo3D: A Step Back to 2D for Efficient Stereo 3D Detection

Object detection in 3D with stereo cameras is an important problem in co...

Code Repositories



view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: An illustration of the effectiveness of our end-to-end pipeline. The green bounding box is the ground truth detection of a car. The yellow points are points from LiDAR. The pink point cloud is generated from an independently trained depth estimator, which is inaccurate and lies out of the green box. By making depth estimation and 3D object detection end-to-end, we obtain a better blue point cloud. Upon this, the object detector could yield the state-of-the-art performance.

One of the most critical components in autonomous driving is 3D object detection: a self-driving car must accurately detect and localize objects such as cars and pedestrians in order to plan the path safely and avoid collisions. To this end, existing algorithms primarily rely on LiDAR (Light Detection and Ranging) as the input signal, which provides precise 3D point clouds of the surrounding environment. LiDAR, however, is very expensive. A 64-beam model can easily cost more than the car alone, making self-driving cars prohibitively expensive for the general public.

One solution is to explore alternative sensors like commodity (stereo) cameras. Although there is still a noticeable gap to LiDAR, it is an area with exceptional progress in the past year [15, 21, 30, 37, 48, 47]. For example, pseudo-LiDAR (PL) [37, 47] converts a depth map estimated from stereo images into a 3D point cloud, followed by applying (any) existing LiDAR-based detectors. Taking advantage of the state-of-the-art algorithms from both ends [2, 16, 31, 35, 47], pseudo-LiDAR achieves the highest image-based 3D detection accuracy ( and at the moderate case) on the KITTI leaderboard [11, 12].

While the modularity of pseudo-LiDAR is conceptual appealing, the combination of two independently trained components can yield an undesired performance hit. In particular, pseudo-LiDAR requires two systems: a depth estimator, typically trained on a generic depth estimation (stereo) image corpus, and an object detector trained on the point cloud data converted from the resulting depth estimates. It is unlikely that the two training objectives are optimally aligned for the ultimate goal, to maximize final detection accuracy. For example, depth estimators are typically trained with a loss that penalizes errors across all pixels equally, instead of focusing on objects of interest. Consequently, it may over-emphasize nearby or non-object pixels as they are over-represented in the data. Further, if the depth network is trained to estimate disparity, its intrinsic error will be exacerbated for far-away objects [47].

To address these issues, we propose to design a 3D object detection framework that is trained end-to-end, while preserving the modularity and compatibility of pseudo-LiDAR with newly developed depth estimation and object detection algorithms. To enable back-propagation based end-to-end training on the final loss, the change of representation (CoR) between the depth estimator and the object detector must be differentiable with respect to the estimated depth. We focus on two types of CoR modules — subsampling and quantization — which are compatible with different LiDAR-based object detector types. We study in detail on how to enable effective back-propagation with each module. Specifically, for quantization, we introduce a novel differentiable soft quantization CoR module to overcome its inherent non-differentiability. The resulting framework is readily compatible with most existing (and hopefully future) LiDAR-based detectors and 3D depth estimators.

We validate our proposed end-to-end pseudo-LiDAR (E2E-PL) approach with two representative object detectors — PIXOR [45] (quantized input) and PointRCNN [35] (subsampled point input) — on the widely-used KITTI object detection dataset [11, 12]. Our results are promising: we improve over the baseline pseudo-LiDAR pipeline and the improved PL++ pipeline [47] in all the evaluation settings and significantly outperform other image-based 3D object detectors. At the time of submission our E2E-PL with PointRCNN holds the best results on the KITTI image-based 3D object detection leaderboard. Our qualitative results further confirm that end-to-end training can effectively guide the depth estimator to refine its estimates around object boundaries, which are crucial for accurately localizing objects (see Figure 1 for an illustration).

2 Related Work

3D Object Detection. Most works on 3D object detection are based on 3D LiDAR point clouds [8, 9, 10, 17, 18, 20, 27, 35, 36, 43, 44, 46]. Among these, there are two streams in terms of point cloud processing: 1) directly operating on the unordered point clouds in 3D [17, 31, 35, 49], mostly by applying PointNet [32, 33]

or/and applying 3D convolution over neighbors; 2) operating on quantized 3D/4D tensor data, which are generated from discretizing the locations of point clouds into some fixed grids 

[7, 16, 23, 45]. Images can be included in both types of approaches, but primarily to supplement LiDAR signal [7, 9, 16, 22, 23, 26, 31, 42].

Besides LiDAR-based models, there are solely image-based models, which are mostly developed from the 2D frontal-view detection pipeline [13, 24, 34], but most of them are no longer competitive with the state of the art in localizing objects in 3D [1, 5, 6, 4, 28, 19, 29, 39, 40, 41].

Pseudo-LiDAR. This gap has been greatly reduced by the recently proposed pseudo-LiDAR framework [37, 47]. Different from previous image-based 3D object detection models, pseudo-LiDAR first utilizes an image-based depth estimation model to obtain predicted depth of each image pixel . The resulting depth is then projected to a “pseudo-LiDAR” point in 3D by


where is the camera center and and are the horizontal and vertical focal lengths. The “pseudo-LiDAR” points are then treated as if they were LiDAR signals, over which any LiDAR-based 3D object detector can be applied. By making use of the separately trained state-of-the-art algorithms from both ends [2, 16, 31, 35, 47], pseudo-LiDAR achieved the highest image-based performance on KITTI benchmark [11, 12]. Our work builds upon this framework.

3 End-to-End Pseudo-LiDAR

One key advantage of the pseudo-LiDAR pipeline [37, 47] is its plug-and-play modularity, which allows it to incorporate any advances in 3D depth estimation or LiDAR-based 3D object detection. However, it also lacks the notion of end-to-end training of both components to ultimately maximize the detection accuracy. In particular, the pseudo-LiDAR pipeline is trained in two steps, with different objectives. First, a depth estimator is learned to estimate generic depths for all pixels in a stereo image; then a LiDAR-based detector is trained to predict object bounding boxes from depth estimates, generated by the frozen depth network.

Figure 2: Pixel distribution: of all pixels correspond to background. The 10% pixels associated with cars and people (% people) are primarily within a depth of 20m.

As mentioned in section 1, learning pseudo-LiDAR in this fashion does not align the two components well. On one end, a LiDAR-based object detector heavily relies on accurate 3D points on or in the proximity of the object surfaces to detect and localize objects. Especially, for far-away objects that are rendered by relatively few points. On the other end, a depth estimator learned to predict all the pixel depths may place over-emphasis on the background and nearby objects since they occupy most of the pixels in an image. For example, in the KITTI dataset [14] only about of all pixels correspond to cars and pedestrians/cyclists (Figure 2). Such a misalignment is aggravated with fixing the depth estimator in training the object detector: the object detector is unaware of the intrinsic depth error in the input and thus can hardly detect the far-away objects correctly.

Figure 3: End-to-end image-based 3D object detection: We introduce a change of representation (CoR) layer to connect the output of the depth estimation network as the input to the 3D object detection network. The result is an end-to-end pipeline that yields object bounding boxes directly from stereo images and allows back-propagation throughout all layers. Black solid arrows represent the forward pass; Blue and red dashed arrows represent the backward pass for the object detection loss and depth loss, respectively. The * denotes that our CoR layer is able to back propogate the gradients between different representations.

Figure 4: Quantization: We voxelize an input pseudo-LiDAR (PL) point cloud using soft or hard quantization. Green voxels are those influenced by the PL points. A blue voxel having a positive gradient of the detection loss exerts a force to push points away from its center to other voxels, whereas a red voxel having a negative gradient exerts a force to pull points of other voxels to its center. These forces at the red and blue voxles can only affect PL points if PL points influence those voxels. Soft quantization increases the area of influence of the PL points and therefore the forces, allowing points from other voxels to be pushed away or pulled towards. The updated PL points thus can become closer to the ground truth LiDAR point cloud.

Figure 3 illustrates our proposed end-to-end pipeline to resolve these shortcomings. Here, the error signal from misdetecting or mislocalizing an object can “softly attend” to pixels which affect the prediction most (likely those on or around objects in 2D), instructing the depth estimator where to improve for the subsequent detector. To enable back-propagating the error signal from the final detection loss, the change of representation (CoR) between the depth estimator and the object detector must be differentiable with respect to the estimated depth. In the following, we identify two major types of CoR — subsampling and quantization — in incorporating existing LiDAR-based detectors into the pseudo-LiDAR pipeline.

3.1 Quantization

Several LiDAR-based object detectors take voxelized 3D or 4D tensors as inputs [7, 16, 23, 45]. The 3D point locations are discretized into a fixed grid, and only the occupation (i.e., ) or densities (i.e., ) are recorded in the resulting tensor111For LiDAR data the reflection intensity is often also recorded.. The advantage of this kind of approaches is that 2D and 3D convolutions can be directly applied to extract features from the tensor. Such a discretization process, however, makes the back-propagation difficult.

Let us consider an example where we are given a point cloud with the goal to generate a 3D occupation tensor of bins, where each bin is associated with a fixed center location . The resulting tensor is defined as follows,


In other words, if a point falls into bin , then ; otherwise, . The forward pass of generating is as straightforward. The backward pass to obtain the gradient signal of the detection loss with respect to or the depth map (Equation 1), however, is non-trivial.

Concretely, we can obtain by taking the gradients of with respect to . Intuitively, if , it means that should increase; i.e., there should be points falling into bin . In contrast, if , it means that should decrease by pushing points out from bin . But how can we pass these messages back to the input point cloud ? More specifically, how can we translate the single digit of each bin to be useful information in 3D in order to adjust the point cloud ?

As a remedy, we propose to modify the forward pass by introducing a differentiable soft quantization module (see Figure 4

). We introduce a radial basis function (RBF) around the center

of a given bin . Instead of binary occupancy counters222We note that the issue of back-propagation cannot be resolved simply by computing real-value densities in Equation 2., we keep a “soft” count of the points inside the bin, weighted by the RBF. Further, we allow any given bin to be influenced by a local neighborhood of close bins. We then modify the definition of accordingly. Let denote the set of points that fall into bin ,

We define to denote the average RBF weight of points in bin w.r.t. bin (more specifically, ),


The final value of the tensor at bin is the combination of soft occupation from its own and neighboring bins,


We note that, when and , Equation 4 recovers Equation 2. Throughout this paper, we set the neighborhood to the 26 neighboring bins (considering a 3x3x3 cube centered on the bin) and . Following [45], we set the total number of bins to .

Our soft quantization module is fully differentiable. The partial derivative directly affects the points in bin (i.e., ) and its neighboring bins and enables end-to-end training. For example, to pass the partial derivative to a point in bin , we compute . More importantly, even when bin mistakenly contains no point, allows it to drag points from other bins, say bin , to be closer to , enabling corrections of the depth error more effectively.

3.2 Subsampling

As an alternative to voxelization, some LiDAR-based object detectors take the raw 3D points as input (either as a whole [35] or by grouping them according to metric locations [17, 43, 49] or potential object locations [31]). For these, we can directly use the 3D point clouds obtained by Equation 1; however, some subsampling is required. Different from voxelization, subsampling is far more amenable to end-to-end training: the points that are filtered out can simply be ignored during the backwards pass; the points that are kept are left untouched. First, we remove all 3D points higher than the normal heights that LiDAR signals can cover, such as pixels of the sky. Further, we may sparsify the remaining points by subsampling. This second step is optional but suggested in [47] due to the significantly larger amount of points generated from depth maps than LiDAR: on average 300,000 points are in the pseudo-LiDAR signal but 18,000 points are in the LiDAR signal (in the frontal view of the car). Although denser representations can be advantageous in terms of accuracy, they do slow down the object detection network. We apply an angular-based sparsifying method. We define multiple bins in 3D by discretizing the spherical coordinates . Specifically, we discretize (polar angle) and (azimuthal angle) to mimic the LiDAR beams. We then keep a single 3D point from those points whose spherical coordinates fall into the same bin. The resulting point cloud therefore mimics true LiDAR points.

In terms of back-propagation, since these 3D object detectors directly process the 3D coordinates of a point, we can obtain the gradients of the final detection loss with respect to the coordinates; i.e., . As long as we properly record which points are subsampled in the forward pass or how they are grouped, back-propagating the gradients from the object detector to the depth estimates (at sparse pixel locations) can be straightforward. Here, we leverage the fact that Equation 1 is differentiable with respect to . However, due to the high sparsity of gradient information in , we found that the initial depth loss used to train a conventional depth estimator is required to jointly optimize the depth estimator.

This subsection, together with subsection 3.1, presents a general end-to-end framework applicable to various object detectors. We do not claim this subsection as a technical contribution, but it provides details that makes end-to-end training for point-cloud-based detectors successful.

3.3 Loss

To learn the pseudo-LiDAR framework end-to-end, we replace Equation 2 by Equation 4 for object detectors that take 3D or 4D tensors as input. For object detectors that take raw points as input, no specific modification is needed.

We learn the object detector and the depth estimator jointly with the following loss,

where is the loss from 3D object detection and is the loss of depth estimation. and are the corresponding coefficients. The detection loss is the combination of classification loss and regression loss,

in which the classification loss aims to assign correct class (e.g., car) to the detected bounding box; the regression loss aims to refine the size, center, and rotation of the box.

Let be the predicted depth and be the ground truth, we apply the following depth estimation loss

where is the set of pixels that have ground truth depth. is the smooth L1 loss defined as


We find that the depth loss is important as the loss from object detection may only influence parts of the pixels (due to quantization or subsampling). After all, our hope is to make the depth estimates around (far-away) objects more accurately, but not to sacrifice the accuracy of depths on the background and nearby objects333The depth loss can be seen as a regularizer to keep the output of the depth estimator physically meaningful. We note that, 3D object detectors are designed with an inductive bias: the input is an accurate 3D point cloud. However, with the large capacity of neural networks, training the depth estimator and object detector end-to-end with the detection loss alone can lead to arbitrary representations between them that break the inductive bias but achieve a lower training loss. The resulting model thus will have a much worse testing loss than the one trained together with the depth loss..

4 Experiments

4.1 Setup

Depth Loss P-RCNN Loss PIXOR Loss
Ratio 3% 4% 70%
Sum 0.1 10 1
Table 1: Statistics of the gradients of different losses on the predicted depth map. Ratio: the percentage of pixels with gradients.
IoU = 0.5 IoU = 0.7
Detection algo Input Easy Moderate Hard Easy Moderate Hard
3DOP [5] S 55.0 / 46.0 41.3 / 34.6 34.6 / 30.1 12.6 / 6.6 9.5 / 5.1 7.6 / 4.1
MLF-stereo [41] S - 53.7 / 47.4 - - 19.5 / 9.8 -
S-RCNN [21] S 87.1 / 85.8 74.1 / 66.3 58.9 / 57.2 68.5 / 54.1 48.3 / 36.7 41.5 / 31.1
OC-Stereo [30] S 90.0 / 89.7 80.6 / 80.0 71.1 / 70.3 77.7 / 64.1 66.0 / 48.3 51.2 / 40.4
PL: P-RCNN [37] S 88.4 / 88.0 76.6 / 73.7 69.0 / 67.8 73.4 / 62.3 56.0 / 44.9 52.7 / 41.6
PL++: P-RCNN [47] S 89.8 / 89.7 83.8 / 78.6 77.5 / 75.1 82.0 / 67.9 64.0 / 50.1 57.3 / 45.3
E2E-PL: P-RCNN S 90.5 / 90.4 84.4 / 79.2 78.4 / 75.9 82.7 / 71.1 65.7 / 51.7 58.4 / 46.7
PL: PIXOR [37] S 89.0 / - 75.2 / - 67.3 / - 73.9 / - 54.0 / - 46.9 / -
PL++: PIXOR [47] S 89.9 / - 78.4 / - 74.7 / - 79.7 / - 61.1 / - 54.5 / -
E2E-PL: PIXOR S 94.6 / - 84.8 / - 77.1/ - 80.4 / - 64.3 / - 56.7 / -
P-RCNN [35] L 97.3 / 97.3 89.9 / 89.8 89.4 / 89.3 90.2 / 89.2 87.9 / 78.9 85.5 / 77.9
PIXOR [45] L + M 94.2 / - 86.7 / - 86.1 / - 85.2 / - 81.2 / - 76.1 / -
Table 2: 3D object detection results on the KITTI validation set. We report AP / AP (in %) of the car category, corresponding to average precision of the bird’s-eye view and 3D object detection. We arrange methods according to the input signals: S for stereo images, L for 64-beam LiDAR, M for monocular images. PL stands for pseudo-LiDAR. Results of our end-to-end pseudo-LiDAR are in blue. Methods with 64-beam LiDAR are in gray. Best viewed in color.

Dataset. We evaluate our end-to-end (E2E-PL) approach on the KITTI object detection benchmark [11, 12], which contains 3,712, 3,769 and 7,518 images for training, validation, and testing. KITTI provides for each image the corresponding 64-beam Velodyne LiDAR point cloud, right image for stereo, and camera calibration matrices.

Metric. We focus on 3D and bird’s-eye-view (BEV) object detection and report the results on the validation set. We focus on the “car” category, following [7, 37, 42]. We report the average precision (AP) with the IoU thresholds at 0.5 and 0.7. We denote AP for the 3D and BEV tasks by AP and AP. KITTI defines the easy, moderate, and hard settings, in which objects with 2D box heights smaller than or occlusion/truncation levels larger than certain thresholds are disregarded. The hard (moderate) setting contains all the objects in the moderate and easy (easy) settings.

Baselines. We compare to seven stereo-based 3D object detectors: pseudo-LiDAR (PL) [37], pseudo-LiDAR ++ (PL++) [47], 3DOP [5], S-RCNN [21], RT3DStereo [15], OC-Stereo [30], and MLF-stereo [41]. For pseudo-LiDAR ++, we only compare to its image-only method.

4.2 Details of our approach

Our end-to-end pipeline has two parts: stereo depth estimation and 3D object detection. In training, we first learn only the stereo depth estimation network to get a depth estimation prior, and then we fix the depth network and use its output to train the 3D object detector from scratch. In the end, we joint train the two parts with balanced loss weights.

Depth estimation. We apply SDN [47] as the backbone to estimate a dense depth map . We follow [47] to pre-train SDN on the synthetic Scene Flow dataset [25] and fine-tune it on the 3,712 training images of KITTI. We obtain the depth ground truth by projecting the corresponding LiDAR points onto images.

Object detection. We apply two LiDAR-based algorithms: PIXOR [45] (voxel-based, with quantization) and PointRCNN (P-RCNN[35] (point-cloud-based). We use the released code of P-RCNN. We obtain the code of PIXOR from the authors of [47], which has slight modification to include visual information (denoted as PIXOR).

Joint training. We set the depth estimation and object detection networks trainable, and allow the gradients of the detection loss to back-propagate to the depth network. We study the gradients of the detection and depth losses w.r.t. the predicted depth map to determine the hyper-parameters and . For each loss, we calculate the percentage of pixels on the entire depth map that have gradients. We further collect the mean and sum of the gradients on the depth map during training, as shown in Table 1. The depth loss only influences 3% of the depth map because the ground truth obtained from LiDAR is sparse. The P-RCNN loss, due to subsampling on the dense PL point cloud, can only influence 4% of the depth map. For the PIXOR loss, our soft quantization module could back-propagate the gradients to 70% of the pixels of the depth map. In our experiments, we find that balancing the sums of gradients between the detection and depth losses is crucial in making joint training stable. We carefully set and to make sure that the sums are on the same scale in the beginning of training. For P-RCNN, we set and ; for PIXOR, we and .

4.3 Results

On KITTI validation set. The main results on KITTI validation set are summarized in Table 2. It can be seen that 1) the proposed E2E-PL framework consistently improves the object detection performance on both the model using subsampled point inputs (P-RCNN) and that using quantized inputs (PIXOR). 2) While the quantization-based model (PIXOR) performs worse than the point-cloud-based model (P-RCNN) when they are trained in a non end-to-end manner, end-to-end training greatly reduces the performance gap between these two types of models, especially for IoU at : the gap between these two models on AP in moderate cases is reduced from to . As shown in Table 1, on depth maps, the gradients flowing from the loss of the PIXOR detector are much denser than those from the loss of the P-RCNN detector, suggesting that more gradient information is beneficial. 3) For IoU at 0.5 under easy and moderate cases, E2E-PL: PIXOR performs on a par with PIXOR using LiDAR.

Method Easy Moderate Hard
S-RCNN [21] 61.9 / 47.6 41.3 / 30.2 33.4 / 23.7
RT3DStereo [15] 58.8 / 29.9 46.8 / 23.3 38.4 / 19.0
OC-Stereo [30] 68.9 / 55.2 51.5 / 37.6 43.0 / 30.3
PL [37] 67.3 / 54.5 45.0 / 34.1 38.4 / 28.3
PL++: P-RCNN [47] 78.3 / 61.1 58.0 / 42.4 51.3 / 37.0
E2E-PL: P-RCNN 79.6 / 64.8 58.8 / 43.9 52.1 / 38.1
PL++:PIXOR [47] 70.7 / - 48.3 / - 41.0 / -
E2E-PL: PIXOR 71.9 / - 51.7 / - 43.3 / -
Table 3: 3D object (car) detection results on the KITTI test set. We compare E2E-PL (blue) with existing results retrieved from the KITTI leaderboard, and report AP / AP at IoU=0.7.

On KITTI test set. Table 3 shows the results on KITTI test set. We observe the same consistent performance boost by applying our E2E-PL framework on each detector type. At the time of submission, E2E-PL: P-RCNN achieves the state-of-the-art results over image-based models.

IoU = 0.5 IoU = 0.7
Depth RPN RCNN Easy Moderate Hard Easy Moderate Hard
89.8/89.7 83.8/78.6 77.5/75.1 82.0/67.9 64.0/50.1 57.3/45.3
89.7/89.5 83.6/78.5 77.4/74.9 82.2/67.8 64.5/50.5 57.4/45.4
89.3/89.0 83.7/78.3 77.5/75.0 81.1/66.5 63.9/50.0 57.1/45.2
89.6/89.4 83.9/78.2 77.6/75.2 81.7/68.2 63.4/50.4 57.2/45.9
90.2/90.1 84.2/78.8 78.0/75.7 81.9/69.1 64.0/51.2 57.7/46.1
89.3/89.1 83.9/78.5 77.7/75.2 81.3/69.4 64.7/50.7 57.7/45.7
89.8/89.7 84.2/79.1 78.2/76.5 84.2/69.9 65.5/51.0 58.1/46.2
90.5/90.4 84.4/79.2 78.4/75.9 82.7/71.1 65.7/51.7 58.4/46.7
Table 4: Ablation studies on the point-cloud-based pipeline with P-RCNN. We report AP / AP (in %) of the car category, corresponding to average precision of the bird’s-eye view and 3D detection. We divide our pipeline with P-RCNN into three sub networks: Depth, RPN and RCNN. means that we set the sub network trainable and use its corresponding loss in joint training. We note that the gradients of the later sub network would also back-propagate to the previous sub network. For example, if we choose Depth and RPN, the gradients of RPN would also be back-propogated to the Depth network. The best result per column is in blue. Best viewed in color.

4.4 Ablation studies

We conduct ablation studies on the the point-cloud-based pipeline with P-RCNN in Table 4. We divide the pipeline into three sub networks: depth estimation network (Depth), region proposal network (RPN) and Regional-CNN (RCNN). We try various combinations of the sub networks (and their corresponding losses) by setting them to trainable in our final joint training stage. The first row serves as the baseline. In rows two to four, the results indicate that simply training each sub network independently with more iterations does not improve the accuracy. In row five, joint training RPN and RCNN (i.e., P-RCNN) does not have significant improvement, because the point cloud from Depth is not updated and remains noisy. In row six we jointly train Depth with RPN, but the result does not improve much either. We suspect that the loss on RPN is insufficient to guide the refinement of depth estimation. By combining the three sub networks together and using the RCNN, RPN, and Depth losses to refine the three sub networks, we get the best results (except for two cases).

For the quantization-based pipeline with soft quantization, we also conduct similar ablation studies, as shown in Table 5. Since PIXOR is a one-stage detector, we divide the pipeline into two components: Depth and Detector. Similar to the point-cloud-based pipeline (Table 4), simply training each component independently with more iterations does not improve. However, when we jointly train both components, we see a significant improvement (last row). This demonstrates the effectiveness of our soft quantization module which can back-propagate the Detector loss to influence 70% pixels on the predicted depth map.

More interestingly, applying the soft quantization module alone without jointly training (rows one and two) does not improve over or is even outperformed by PL++: PIXOR with hard quantization (whose result is , from Table 2). But with joint end-to-end training enabled by soft quantization, our E2E-PL:PIXOR consistently outperforms the separately trained PL++: PIXOR.

Depth Detector Easy Moderate Hard
89.8 / 77.0 78.3 / 57.7 69.5 / 53.8
89.9 / 76.9 78.7 / 58.0 69.7 / 53.9
90.2 / 78.1 79.2 / 58.9 69.6 / 54.2
94.6 / 80.4 84.8 / 64.3 77.1 / 56.7
Table 5: Ablation studies on the quantization-based pipeline with PIXOR. We report AP at IoU / (in %) of the car category. We divide our pipeline into two sub networks: Depth and Detector. means we set the sub network trainable and use its corresponding loss in join training. The best result per column is in blue. Best viewed in color.

Figure 5: Qualitative results on depth estimation. PL++ (image-only) has many misestimated pixels on the top of the car. By applying end-to-end training, the depth estimation around the cars is improved and the corresponding pseudo-LiDAR point cloud has much better quality. (Please zoom-in for the better view.)

Figure 6: Qualitative results from the bird’s-eye view. The red bounding boxes are the ground truth and the green bounding boxes are the detection results. PL++ (image-only) misses many far-away cars and has poor bounding box localization. By applying end-to-end training, we get much accurate predictions (first and second columns) and reduce the false positive predictions (the third column).

4.5 Qualitative results

We show the qualitative results of the point-cloud-based and quantization-based pipelines.

Depth visualization. We visualize the predicted depth maps and the corresponding point clouds converted from the depth maps by the quantization-based pipeline in Figure 5. For the original depth network shown in the first row, since the ground truth is very sparse, the depth prediction is not accurate and we observe clear misestimation (artifact) on the top of the car: there are massive misestimated 3D points on the top of the car. By applying end-to-end joint training, the detection loss on cars enforces the depth network to reduce the misestimation and guides it to generate more accurate point clouds. As shown in the second row, both the depth prediction quality and the point cloud quality are greatly improved. The last row is the input image to the depth network, the zoomed-in patch of a specific car, and its corresponding LiDAR ground truth point cloud.

Detection visualization. We also show the qualitative comparisons of the detection results, as illustrated in Figure 6. From the BEV, the ground truth bounding boxes are marked in red and the predictions are marked in green. In the first example (column), pseudo-LiDAR++ (PL++) misses one car in the middle, and gives poor localization of the far-away cars. Our E2E-PL could detect all the cars, and gives accurate predictions on the farthest car. For the second example, the result is consistent for the far-away cars where E2E-PL detects more cars and localize them more accurately. Even for the nearby cars, E2E-PL gives better results. The third example indicates a case where there is only one ground truth car. Our E2E-PL does not have any false positive predictions.

4.6 Other results

Speed. The inference time of our method is similar to pseudo-LiDAR and is determined by the stereo and detection networks. The soft quantization module (with 26 neighboring bins) only computes the RBF weights between a point and 27 bins (roughly points per scene), followed by grouping the weights of points into bins. The complexity is

, and both steps can be parallelized. Using a single GPU with PyTorch implementation,

E2E-PL: P-RCNN takes 0.49s/frame and E2E-PL: PIXOR takes 0.55s/frame, within which SDN (stereo network) takes 0.39s/frame and deserves further study to speed it up (e.g., by code optimization, network pruning, etc). Others. See the Supplementary Material for more details and results.

5 Conclusion and Discussion

In this paper, we introduced an end-to-end training framework for pseudo-LiDAR [37, 47]

. Our proposed framework can work for 3D object detectors taking either the direct point cloud inputs or quantized structured inputs. The resulting models set a new state of the art in image based 3D object detection and further narrow the remaining accuracy gap between stereo and LiDAR based sensors. Although it will probably always be beneficial to include active sensors like LiDARs in addition to passive cameras 

[47], it seems possible that the benefit may soon be too small to justify large expenses. Considering the KITTI benchmark, it is worth noting that the stereo pictures are of relatively low resolution and only few images contain (labeled) far away objects. It is quite plausible that higher resolution images with a higher ratio of far away cars would result in further detection improvements, especially in the hard (far away and heavily occluded) category.


This research is supported by grants from the National Science Foundation NSF (III-1618134, III-1526012, IIS-1149882, IIS-1724282, and TRIPODS-1740822), the Office of Naval Research DOD (N00014-17-1-2175), the Bill and Melinda Gates Foundation, and the Cornell Center for Materials Research with funding from the NSF MRSEC program (DMR-1719875). We are thankful for generous support by Zillow and SAP America Inc.


  • [1] Florian Chabot, Mohamed Chaouch, Jaonary Rabarisoa, Céline Teulière, and Thierry Chateau. Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In CVPR, 2017.
  • [2] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. In CVPR, 2018.
  • [3] Ming-Fang Chang, John W Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, and James Hays. Argoverse: 3d tracking and forecasting with rich maps. In CVPR, 2019.
  • [4] Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel Urtasun. Monocular 3d object detection for autonomous driving. In CVPR, 2016.
  • [5] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G Berneshawi, Huimin Ma, Sanja Fidler, and Raquel Urtasun. 3d object proposals for accurate object class detection. In NIPS, 2015.
  • [6] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Huimin Ma, Sanja Fidler, and Raquel Urtasun. 3d object proposals using stereo imagery for accurate object class detection. TPAMI.
  • [7] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In CVPR, 2017.
  • [8] Yilun Chen, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Fast point r-cnn. In ICCV, 2019.
  • [9] Xinxin Du, Marcelo H Ang Jr, Sertac Karaman, and Daniela Rus. A general pipeline for 3d detection of vehicles. In ICRA, 2018.
  • [10] Martin Engelcke, Dushyant Rao, Dominic Zeng Wang, Chi Hay Tong, and Ingmar Posner.

    Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks.

    In ICRA, 2017.
  • [11] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
  • [12] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
  • [13] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017.
  • [14] Jinyong Jeong, Younggun Cho, Young-Sik Shin, Hyunchul Roh, and Ayoung Kim. Complex urban dataset with multi-level sensors from highly diverse urban environments. The International Journal of Robotics Research, 38(6):642–657, 2019.
  • [15] Hendrik Königshof, Niels Ole Salscheider, and Christoph Stiller. Realtime 3d object detection for automated driving using stereo vision and semantic information. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), 2019.
  • [16] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven Waslander. Joint 3d proposal generation and object detection from view aggregation. In IROS, 2018.
  • [17] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019.
  • [18] Bo Li. 3d fully convolutional network for vehicle detection in point cloud. In IROS, 2017.
  • [19] Buyu Li, Wanli Ouyang, Lu Sheng, Xingyu Zeng, and Xiaogang Wang. Gs3d: An efficient 3d object detection framework for autonomous driving. In CVPR, 2019.
  • [20] Bo Li, Tianlei Zhang, and Tian Xia. Vehicle detection from 3d lidar using fully convolutional network. In Robotics: Science and Systems, 2016.
  • [21] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. Stereo r-cnn based 3d object detection for autonomous driving. In CVPR, 2019.
  • [22] Ming Liang, Bin Yang, Yun Chen, Rui Hu, and Raquel Urtasun. Multi-task multi-sensor fusion for 3d object detection. In CVPR, 2019.
  • [23] Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urtasun. Deep continuous fusion for multi-sensor 3d object detection. In ECCV, 2018.
  • [24] Tsung-Yi Lin, Piotr Dollár, Ross B Girshick, Kaiming He, Bharath Hariharan, and Serge J Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 4, 2017.
  • [25] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, 2016.
  • [26] Gregory P Meyer, Jake Charland, Darshan Hegde, Ankit Laddha, and Carlos Vallespi-Gonzalez. Sensor fusion for joint 3d object detection and semantic segmentation. In CVPRW, 2019.
  • [27] Gregory P Meyer, Ankit Laddha, Eric Kee, Carlos Vallespi-Gonzalez, and Carl K Wellington. Lasernet: An efficient probabilistic 3d object detector for autonomous driving. In CVPR, 2019.
  • [28] Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Košecká.

    3d bounding box estimation using deep learning and geometry.

    In CVPR, 2017.
  • [29] Cuong Cao Pham and Jae Wook Jeon. Robust object proposals re-ranking for object detection in autonomous driving using convolutional neural networks. Signal Processing: Image Communication, 53:110–122, 2017.
  • [30] Alex D Pon, Jason Ku, Chengyao Li, and Steven L Waslander. Object-centric stereo matching for 3d object detection. In ICRA, 2020.
  • [31] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. In CVPR, 2018.
  • [32] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
  • [33] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017.
  • [34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [35] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointrcnn: 3d object proposal generation and detection from point cloud. In CVPR, 2019.
  • [36] Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. TPAMI, 2020.
  • [37] Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Q. Weinberger. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In CVPR, 2019.
  • [38] Yan Wang, Xiangyu Chen, Yurong You, Erran Li Li, Bharath Hariharan, Mark Campbell, Kilian Q. Weinberger, and Wei-Lun Chao. Train in germany, test in the usa: Making 3d object detectors generalize. In CVPR, 2020.
  • [39] Yu Xiang, Wongun Choi, Yuanqing Lin, and Silvio Savarese. Data-driven 3d voxel patterns for object category recognition. In CVPR, 2015.
  • [40] Yu Xiang, Wongun Choi, Yuanqing Lin, and Silvio Savarese. Subcategory-aware convolutional neural networks for object proposals and detection. In WACV, 2017.
  • [41] Bin Xu and Zhenzhong Chen. Multi-level fusion based 3d object detection from monocular images. In CVPR, 2018.
  • [42] Danfei Xu, Dragomir Anguelov, and Ashesh Jain. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In CVPR, 2018.
  • [43] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
  • [44] Bin Yang, Ming Liang, and Raquel Urtasun. Hdnet: Exploiting hd maps for 3d object detection. In CoRL, 2018.
  • [45] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real-time 3d object detection from point clouds. In CVPR, 2018.
  • [46] Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Std: Sparse-to-dense 3d object detector for point cloud. In ICCV, 2019.
  • [47] Yurong You, Yan Wang, Wei-Lun Chao, Divyansh Garg, Geoff Pleiss, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. In ICLR, 2020.
  • [48] X. Ye X. Tan W. Yang S. Wen E. Ding A. Meng L. Huang Z. Xu, W. Zhang. Zoomnet: Part-aware adaptive zooming neural network for 3d object detection. In AAAI, 2020.
  • [49] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In CVPR, 2018.

Appendix S1 Results on Pedestrians and Cyclists

In addition to 3D object detection on Car category, in Table S6 we show the results on Pedestrian and Cyclist categories in KITTI object detection validation set [11, 12]. To be consistent with the main paper, we apply P-RCNN [35] as the object detector. Our approach (E2E-PL) outperforms the baseline one without end-to-end training (PL++) [47] by a notable margin for image-based 3D detection.

Category Model Easy Moderate Hard
Pedestrian PL++ 31.9 / 26.5 25.2 / 21.3 21.0 / 18.1
E2E-PL 35.7 / 32.3 27.8 / 24.9 23.4 / 21.5
Cyclist PL++ 36.7 / 33.5 23.9 / 22.5 22.7 / 20.8
E2E-PL 42.8 / 38.4 26.2 / 24.1 24.5 / 22.7
Table S6: Results on pedestrians and cyclists (KITTI validation set). We report AP / AP (in %) of the two categories at IoU=0.5, following existing works [35, 37]. PL++ denotes the pseudo-LiDAR ++ pipeline with images only (i.e., SDN alone) [37]. Both approaches use P-RCNN [35] as the object detector.

Appendix S2 Evaluation at Different Depth Ranges

We analyze 3D object detection of Car category for ground truths at different depth ranges (i.e., 0-30 or 30-70 meters). We report results with the point-cloud-based pipelines in Table S7 and the quantization-based pipeline in Table S8. E2E-PL achieves better performance at both depth ranges (except for 30-70 meters, moderate, AP). Specifically, on AP, the relative gain between E2E-PL and the baseline becomes larger for the far-away range and the hard setting.

Range Model Easy Moderate Hard #Objs
0-30 PL++ 82.9 / 68.7 76.8 / 64.1 67.9 / 55.7 7379
E2E-PL 86.2 / 72.7 78.6 / 66.5 69.4 / 57.7
30-70 PL++ 19.7 / 11.0 29.5 / 18.1 27.5 / 16.4 3583
E2E-PL 23.8 / 15.1 31.8 / 18.0 31.0 / 16.9
Table S7: 3D object detection via the point-cloud-based pipeline with P-RCNN at different depth ranges. We report AP / AP (in %) of the car category at IoU=0.7, using P-RCNN for detection. In the last column we show the number of car objects in KITTI object validation set within different ranges.
Range Model Easy Moderate Hard #Objs
0-30 PL++ 81.4 / - 75.5 / - 65.8 / - 7379
E2E-PL 82.1 / - 76.4 / - 67.5 / -
30-70 PL++ 26.1 / - 23.9 / - 20.5 / - 3583
E2E-PL 26.8 / - 36.1 / - 31.7 / -
Table S8: 3D object detection via the quantization-based pipeline with PIXOR at different depth ranges. The setup is the same as in S7, except that PIXOR does not have height prediction and therefore no AP is reported.
(a) PL++ on 3D Track (AP)
(b) E2E-PL on 3D Track (AP)
(c) PL++ on BEV Track (AP)
(d) E2E-PL on BEV Track (AP)
Figure S7: Precision-recall curves on KITTI test dataset. We here compare E2E-PL with PL++ on the 3D object detection track and bird’s eye view detection track.

Appendix S3 On KITTI Test Set

In Figure S7, we compare the precision-recall curves of our E2E-PL and pseudo-LiDAR ++ (named Pseudo-LiDAR V2 on the leaderboard). On the 3D object detection track (first row of Figure S7), pseudo-LiDAR ++ has a notable drop of precision on easy cars even at low recalls, meaning that pseudo-LiDAR ++ has many high-confident false positive predictions. The same situation happens to moderate and hard cars. Our E2E-PL suppresses the false positive predictions, resulting in more smoother precision-recall curves. On the bird’s-eye view detection track (second row of Figure S7), the precision of E2E-PL is over 97% within recall interval 0.0 to 0.2, which is higher than the precision of pseudo-LiDAR ++, indicating that E2E-PL has fewer false positives.

Appendix S4 Additional Qualitative Results

We show more qualitative depth comparisons in Figure S8. We use red bounding boxes to highlight the depth improvement in car related areas. We also show detection comparisons in Figure S9, where our E2E-PL has fewer false positive and negative predictions.

() Input
() Depth estimation of PL++
() Depth estimation of E2E-PL
Figure S8: Qualitative comparison of depth estimation. We here compare PL++ with E2E-PL.
(a) Image
(b) Image
(c) Image
(d) PL++ Detection
(e) PL++ Detection
(f) PL++ Detection
(g) E2E-PL Detection
(h) E2E-PL Detection
(i) E2E-PL Detection
Figure S9: Qualitative comparison of detection results. We here compare PL++ with E2E-PL. The red bounding boxes are ground truth and the green bounding boxes are predictions.

Appendix S5 Gradient Visualization on Depth Maps

We also visualize the gradients of the detection loss with respect to the depth map to indicate the effectiveness of our E2E-PL pipeline, as illustrated in Figure S10. We use JET Colormap to indicate the relative absolute value of gradients, where red color indicates higher values while blue color indicates lower values. The gradients from the detector focus heavily around cars.

() Image
() Gradient from detector
Figure S10: Visualization of absolute gradient values by the detection loss. We use JET colormap to indicate the relative absolute values of resulting gradients, where red color indicates larger values; blue, otherwise.

Appendix S6 Other results

s6.1 Depth estimation

We summarize the quantitative results of depth estimation (w/o or w/ end-to-end training) in Table S9. As the detection loss only provides semantic information to the foreground objects, which occupy merely of pixels (Figure 2), its improvement to the overall depth estimation is limited. But for pixels around the objects, we do see improvement at certain depth ranges. We hypothesize that the detection loss may not directly improve the metric depth, but will sharpen the object boundaries in 3D to facilitate object detection and localization.

Range(meters) Model Mean error Std
0-10 PL++ 0.728 2.485
E2E-PL 0.728 2.435
10-20 PL++ 1.000 3.113
E2E-PL 0.984 2.926
20-30 PL++ 2.318 4.885
E2E-PL 2.259 4.679
Table S9: Quantitative results on depth estimation.
Method Input IoU=0.5 IoU=0.7
Easy Moderate Hard Easy Moderate Hard
PL++: P-RCNN S 68.7 / 55.6 46.3 / 36.6 43.5 / 35.1 17.2 / 06.9 17.0 / 11.1 17.0 / 11.6
E2E-PL: P-RCNN S 73.6 / 61.3 47.9 / 39.1 44.6 / 35.7 30.2 / 16.1 18.8 / 11.3 17.9 / 11.5
P-RCNN L 93.2 / 89.7 85.1 / 79.4 84.5 / 76.8 73.8 / 42.3 66.5 / 34.6 63.7 / 37.4
Table S10: 3D object detection via the point-cloud-based pipeline with P-RCNN on Argoverse dataset. We report AP / AP (in %) of the car category, using P-RCNN for detection. We arrange methods according to the input signals: S for stereo images, L for 64-beam LiDAR. PL stands for pseudo-LiDAR. Results of our end-to-end pseudo-LiDAR are in blue. Methods with 64-beam LiDAR are in gray. Best viewed in color.

s6.2 Argoverse dataset [3]

We also experiment with Argoverse [3]. We convert the Argoverse dataset into KITTI format, following the original split, which results in and scenes (i.e

., stereo images with the corresponding synchronized LiDAR point clouds) for training and validation. We use the same training scheme and hyperparameters as those in KITTI experiments, and report the validation results in

Table S10. We define the easy, moderate, and hard settings following [38]. Note that since the synchronization rate of stereo images in Argoverse is 5Hz instead of 10Hz, the dataset used here is smaller than that used in [38]. We note that the sensor calibration in Argoverse may not register the stereo images into the perfect epipolar correspondence (as indicated in argoverse-api). Our experimental results in Table S10 also confirmed the issue: image-based results are much worse than the LiDAR-based ones. Nevertheless, our E2E-PL pipeline still outperforms PL++. We note that, most existing image-based detectors only report results on KITTI.