1 Introduction
Safe driving in autonomous cars requires the detection and accurate 3D localization of cars, pedestrians and other objects. This in turn requires accurate depth information, which can be obtained from LiDAR (Light Detection And Ranging) sensors. Although highly precise and reliable, LiDAR sensors are notoriously expensive: a 64beam model can cost around $75,000 (USD).^{1}^{1}1The information is obtained from the automotive LiDAR market report: http://www.woodsidecap.com/wpcontent/uploads/2018/04/Yole_WCPLiDARReport_April2018FINAL.pdf The alternative is to obtain depth information through inexpensive commodity cameras. However, in spite of recent dramatic progress in stereobased 3D object detection brought by pseudoLiDAR [36], a significant performance gap remains especially for far away objects (which we want to detect early to allow time for reaction). The tradeoff between affordability and safety creates an ethical dilemma.
In this paper we propose a possible solution to this remaining challenge that combines insights from both perspectives. We observe that the higher 3D object localization error of stereobased systems stems entirely from the higher error in depth estimation (after the 3D point cloud is obtained the two approaches are identical [36]). Importantly, this error is not random but systematic: we observe that stereo methods do indeed detect objects with high reliability, yet they estimate the depth of the entire object as either too far or too close. See Figure 1 for an illustration: the red stereo points capture the car but are shifted by about 2m completely outside the groundtruth location (green box). If we can debias these depth estimates it should be possible to obtain accurate 3D localization even for distant objects without exorbitant costs.
We start by revisiting the depth estimation routine embedded at the heart of stateoftheart stereobased 3D detection approaches [36]. A major contributor to the systematic depth bias comes from the fact that depth is typically not computed directly. Instead, one first estimates the disparity — the horizontal shift of a pixel between the left and right images — and then inverts
it to obtain pixelwise depth. While the use of deep neural networks has largely improved disparity estimation
[2, 7, 24, 37], designing and learning the networks to optimize the accuracy of disparity estimation simply overemphasizes nearby objects due to the reciprocal transformation. For instance, a unit disparity error (in pixels) for a 5meteraway object means a 10cm error in depth: the length of a side mirror. The same disparity error for a 50meteraway object, however, becomes a 5.8m error in depth: the length of an entire car. Penalizing both errors equally means that the network spends more time correcting subtle errors on nearby objects than gross errors on far away objects, resulting in degraded depth estimates and ultimately poor detection and localization for far away objects. We thus propose to adapt the stereo network architecture and loss function for direct depth estimation. Concretely, the cost volume that fuses the leftright images and the subsequent 3D convolutions are the key components in stereo networks. Taking the central assumption of convolutions — all neighborhood can be operated in an identical manner — we propose to construct the cost volume on the grid of depth rather than disparity, enabling 3D convolutions and the loss function to perform exactly on the right scale for depth estimation. We refer to our network as stereo depth network (SDN). See Figure 1 for a comparison of 3D points obtained with SDN (purple) and disparity estimation (red).Although our SDN improves the depth estimates significantly, stereo images are still inherently 2D and it is unclear if they can ever match the accuracy and reliability of a true 3D LiDAR sensor. Although LiDAR sensors with 32 or 64 beams are expensive, LiDAR sensors with only 4 beams are two orders of magnitude cheaper^{2}^{2}2The Ibeo Wide Angle Scanning (ScaLa) sensor with 4 beams cost $600 (USD). In this paper we simulate the 4beam LiDAR response on KITTI benchmark [12, 11] by sparsifying the original 64beam signal. and thus easily affordable. The 4 laser beams are very sparse and illsuited to capture 3D object shapes by themselves, but if paired with stereo images they become the ideal tool to debias our dense stereo depth estimates: a single highprecision laser beam may inform us how to correct the depth of an entire car or pedestrian in its path. To this end, we present a novel depthpropagation algorithm, inspired by graphbased manifold learning [38, 33, 41]. In a nutshell, we connect our estimated 3D stereo point cloud locally by a nearest neighbor graph, such that points corresponding to the same object will share many local paths with each other. We match the few but exact LiDAR measurements first with pixels (independent of depth) and then with their corresponding 3D points to obtain accurate depth estimates for several nodes in the graph. Finally, we propagate this exact depth information along the graph using a label diffusion mechanism — resulting in a dense and accurate depth map at negligible cost. In Figure 1 we see that the few (yellow) LiDAR measurements are sufficient to position almost all final (blue) points of the entire car within the green ground truth box.
We conduct extensive empirical studies of our approaches on the KITTI object detection benchmark [12, 11] and achieve remarkable results. With solely stereo images, we outperform the previous sateoftheart [36] by . Further adding a cheap 4beam LiDAR brings another relative improvement — on some metrics, our approach is nearly on par with those based on a 64beam LiDAR but can potentially save in cost.
2 Background
3D object detection. Most work on 3D object detection operates on 3D point clouds from LiDAR as input [18, 20, 26, 46, 8, 35, 9, 45, 17]. Frustum PointNet [29] applies PointNet [30, 31] to the points directly, while Voxelnet [48] quantizes them into 3D grids. For street scenes, several work finds that processing points from the bird’seye view can already capture object contours and locations [6, 47, 16]. Images have also been used, but mainly to supplement LiDAR [25, 43, 22, 6, 16]. Early work based solely on images — mostly built on the 2D frontalview detection pipeline [32, 14, 23] — fell far behind in localizing objects in 3D [19, 39, 40, 1, 27, 4, 42, 3, 28, 5].
PseudoLiDAR. This gap has been reduced significantly recently with the introduction of the pseudoLiDAR framework proposed in [36]. This framework applies a drastically different approach from previous imagebased 3D object detectors. Instead of directly detecting the 3D bounding boxes from the frontal view of a scene, pseudoLiDAR begins with imagebased depth estimation, predicting the depth of each image pixel . The resulting depth map is then backprojected into a 3D point cloud: a pixel will be transformed to in 3D by
(1)  
where is the camera center and and are the horizontal and vertical focal length. The 3D point cloud is then treated exactly as LiDAR signal — any LiDARbased 3D detector can be applied seamlessly. By taking the stateoftheart algorithms from both ends [2, 16, 29], pseudoLiDAR obtains the highest imagebased performance on the KITTI object detection benchmark [12, 11]. Our work builds upon this framework.
Stereo disparity estimation. PseudoLiDAR relies heavily on the quality of depth estimation. Essentially, if the estimated pixel depths match those provided by LiDAR, pseudoLiDAR with any LiDARbased detector should be able to achieve the same performance as that obtained by applying the same detector to the LiDAR signal. According to [36], depth estimation from stereo pairs of images [24, 44, 2] are more accurate than that from monocular (, single) images [10, 13] for 3D object detection. We therefore focus on stereo depth estimation, which is routinely obtained from estimating disparity between images.
A disparity estimation algorithm takes a pair of leftright images and as input, captured from a pair of cameras with a horizontal offset (i.e., baseline) . Without loss of generality, we assume that the algorithm treats the left image, , as reference and outputs a disparity map recording the horizontal disparity to for each pixel . Ideally, and will picture the same 3D location. We can therefore derive the depth map via the following transform (: horizontal focal length),
(2) 
A common pipeline of disparity estimation is to first construct a 4D disparity cost volume , in which
is a feature vector that captures the pixel difference between
and . It then estimates the disparity for each pixel according to the cost volume . One basic algorithm is to build a 3D cost volume with and determine by . Advanced algorithms exploit more robust features in constructing and perform structured prediction for . In what follows, we give a concise introduction on PSMNet [2], a stateoftheart algorithm used in [36].PSMNet begins with extracting deep feature maps
and from and , respectively. It then constructs by concatenating features of and, followed by layers of 3D convolutions. The resulting 3D tensor
, with the feature channel size ending up being one, is then used to derive the pixel disparity via the following weighted combination,(3) 
where is performed along the 3^{rd} dimension of . PSMNet can be learned endtoend, including the image feature extractor and 3D convolution kernels, to minimize the disparity error
(4) 
where is the smooth L1 loss, is the ground truth map, and contains pixels with ground truth.
3 Stereo Depth Network (SDN)
A stereo network designed and learned to minimize the disparity error (cf. Equation 4) may overemphasize nearby objects with smaller depths and therefore perform poorly in estimating depths for far away objects. To see this, note that Equation 2 implies that for a given error in disparity , the error in depth, , increases quadratically with depth:
(5) 
and the middle term is obtained by differentiating w.r.t. . In particular, using the settings on the KITTI dataset, a single pixel error in disparity implies only a 0.1m error in depth at a depth of 5 meters, but a 5.8m error at a depth of 50 meters. See Figure 2 for a mapping from disparity to depth.
Depth Loss. We propose two essential changes to adapt a stereo network for direct depth estimation. First, we learn the stereo network to directly optimize the depth loss
(6) 
and can be obtained from and respectively using Equation 2. The change from the disparity to the depth loss corrects the disproportionally strong emphasis on tiny depth errors of nearby objects — a necessary but still insufficient change to overcome the problems of disparity estimation.
Depth Cost Volume. To facilitate accurate depth learning (rather than disparity) we need to address the internals of the depth estimation pipeline. A crucial source of error are the 3D convolutions within the 4D disparity cost volume, where the same convolutional kernels are applied for the entire cost volume. This is highly problematic as it implicitly assumes that the effect of a convolution is homogeneous throughout — which is clearly violated by the reciprocal depth to disparity relation (Figure 2). For example, it may be completely appropriate to locally smooth two neighboring pixels with disparity 85.7 and 86.3 (changing the depth by a few cm to smooth out a surface), whereas applying the same kernel for two pixels with disparity 3.7 and 4.3 could easily move the 3D points by 10m or more.
Taking this insight and the central assumption of convolutions — all neighborhoods can be operated upon in an identical manner — into account, we propose to instead construct depth cost volume , in which will encode features describing how likely the depth of pixel is . The subsequent 3D convolutions will then operate on the grid of depth, rather than disparity, affecting neighboring depths identically, independent of their location. The resulting 3D tensor is then used to predict the pixel depth similar to Equation 3
We construct the new depth volume, , based on the intuition that and
should lead to equivalent “cost”. To this end, we apply a bilinear interpolation to construct
from using the depthtodisparity transform in Equation 2. Figure 5 (top) depicts our stereo depth network (SDN) pipeline. Crucially, all convolution operations are operated on exclusively. Figure 4 compares the median values of absolute depth estimation errors using disparity cost volume (disparity net: , PSMNet) and the depth cost volume (SDN). As expected, for far away depth, SDN leads to drastically smaller errors with only marginal increases in the very near range (which disparity based methods overoptimize).4 Depth Correction
Our SDN significantly improves depth estimation and more precisely renders the object contours. However, there is a fundamental limitation in stereo because of the discrete nature of pixels: the disparity, being the difference in the horizontal coordinate between corresponding pixels, has to be quantized at the level of individual pixels while the depth is continuous. Although the quantization error can be alleviated with higher resolution images, the computational depth prediction cost scales cubic with pixel width and height — pushing the limits of GPUs in autonomous vehicles.
We therefore explore a hybrid approach by leveraging a cheap LiDAR with extremely sparse (, 4 beams) but accurate depth measurements to correct this bias. We note that such sensors are too sparse to capture object shapes and cannot be used alone for detection. However, by projecting the LiDAR points into the image plane we obtain exact depths on a small portion of “landmark” pixels.
We present a graphbased depth correction (GDC) algorithm that effectively combines the dense stereo depth that has rendered object shapes and the sparse accurate LiDAR measurements. Conceptually, we expect the corrected depth map to have the following properties: globally, landmark pixels associated with LiDAR points should possess the exact depths; locally, object shapes captured by neighboring 3D points, backprojected from the input depth map (cf. Equation 1), should be preserved. Figure 5 (bottom) illustrates the algorithm.
Input Matching. We take as input the two point clouds from LiDAR and PseudoLiDAR (PL) by stereo depth estimation. Latter is obtained by converting pixels with depth to 3D points
. First, we characterize the local shapes by the directed Knearestneighbor (KNN) graph in the PL point cloud that connects each point to its KNN neighbors with appropriate weights (using accelerated KDTrees
[34]). Similarly, we can project the 3D LiDAR points onto pixel locations and match them to corresponding 3D stereo points. W.l.o.g. assume that we are given “ground truth” LiDAR depth for the first points and no ground truth for the remaining points. We refer to the 3D stereo depth estimates as and the LiDAR depths as .Edge weights. To construct the KNN graph in 3D we ignore the LiDAR information on the first points and only use their predicted stereo depth in . Let denote the set of neighbors of the point. Further, let denote the weight matrix, where denotes the edgeweight between points and . Inspired by prior work in manifold learning [33, 38] we choose the weights to be the coefficients that reconstructs the depth of any point from the depths of its neighbors in . We can solve for these weights with the following constrained quadratic optimization problem:
(7) 
Here denotes the allones vector. As long as we pick and the points are in general positions there are infinitely many solutions that satisfy , and we pick the minimum norm solution (obtained with slight regularization) for robustness reasons.
Depth Correction. Let us denote the corrected depth values as , with and and . For the points with LiDAR measurements we update the depth to the (ground truth) values . We then solve for given and the weighted KNN graph encoded in . Concretely, we update the remaining depths such that the depth of any point can still be be reconstructed with high fidelity as a weighted sum of its KNN neighbors’ depths using the learned weights ; if point is moved to its new depth , then its neighbors in must also be corrected such that Further, the neighbors’ neighbors must be corrected and the depth of the few points propagates across the entire graph. We can solve for the final directly with another quadratic optimization,
(8) 
To illustrate the correction process, imagine the simplest case where the depth of only a single point () is updated to . A new optimal depth for Equation 8 is to move all the remaining points similarly, i.e. : as and we must have . In the setting with , the leastsquares loss ensures a soft diffusion between the different LiDAR depth estimates. Both optimization problems in Equation 7 and Equation 8 can be solved exactly and efficiently with sparse matrix solvers. We summarize the procedure as an algorithm in the supplemental material. From the view of graphbased manifold learning, our GDC algorithm is reminiscent of the locally linear embedding [33] with landmarks to guide the final solution [38]. Figure 1 illustrates beautifully how the initial 3D point cloud from SDN (purple) of a car in the KITTI data set is corrected with a few sparse LiDAR measurements (yellow). The resulting points (blue) are right inside the groundtruth box and clearly show the contour of the car. Figure 4 shows the additional improvement from the GDC (blue) over the pure SDN depth estimates. The error is corrected over the entire image where many regions have no LiDAR measurements. For objects such as cars the improvements through GDC are far more pronounced, as these typically are touched by the four LiDAR beams and can be corrected effectively.
5 Experiments
5.1 Setup
We refer to our combined method (SDN and GDC) for 3D object detection as pseudoLiDAR++ (PL++ in short). To analyze the contribution of each component, we evaluate SDN and GDC independently and jointly across several settings. For GDC we set and consider adding signal from a (simulated) 4beam LiDAR, unless stated otherwise.
Dataset, Metrics, and Baselines. We evaluate on the KITTI dataset [11, 12], which contains 7,481 and 7,518 images for training and testing. We follow [4] to separate the 7,481 images into 3,712 for training and 3,769 validation. For each (left) image, KITTI provides the corresponding right image, the 64beam Velodyne LiDAR point cloud, and the camera calibration matrices. We focus on 3D object detection and bird’seyeview (BEV) localization and report results on the validation set. Specifically, we focus on the “car” category, following [6, 43]. We report average precision (AP) with IoU (Intersection over Union) thresholds at 0.5 and 0.7. We denote AP for the 3D and BEV tasks by AP and AP. KITTI divides each category into easy, moderate, and hard cases, according to the 2D box height and occlusion/truncation level. We compare to four stereobased detectors: pseudoLiDAR (PL in short) [36], 3DOP [4], SRCNN [21], and MLFstereo [42].
Stereo depth network (SDN). We use PSMNet [2] as the backbone for our stereo depth estimation network (SDN). We follow [36] to pretrain SDN on the synthetic Scene Flow dataset [24] and finetune it on the 3,712 training images of KITTI. We obtain depth ground truth of these images by projecting the corresponding LiDAR points onto images. We also train a PSMNet in the same way for comparison, which minimizes disparity error.
3D object detection. We apply three algorithms for 3D object detection: AVOD [16], PIXOR [47], and PRCNN [35]. All can utilize information from LiDAR and/or monocular images. We use the released implementations of AVOD (more specifically, AVODFPN) and PRCNN. We implement PIXOR ourselves with a slight modification to include visual information (denoted as PIXOR). We train all models on the 3,712 training data from scratch by replacing the LiDAR points with pseudoLiDAR data generated from stereo depth estimation (see the supplemental material for details.)
Sparser LiDAR. We simulate sparser LiDAR signal with fewer beams by first projecting the 64beam LiDAR points onto a 2D plane of horizontal and vertical angles. We quantize the vertical angles into 64 levels with an interval of , which is close to the SPEC of the 64beam LiDAR. We keep points fallen into a subset of beams to mimic the sparser signal (see the supplemental material for details.)
5.2 Experimental results
IoU = 0.5  IoU = 0.7  
Detection algorithm  Input  Easy  Moderate  Hard  Easy  Moderate  Hard 
3DOP [4]  S  55.0 / 46.0  41.3 / 34.6  34.6 / 30.1  12.6 / 6.6  9.5 / 5.1  7.6 / 4.1 
MLFstereo [42]  S    53.7 / 47.4      19.5 / 9.8   
SRCNN [21]  S  87.1 / 85.8  74.1 / 66.3  58.9 / 57.2  68.5 / 54.1  48.3 / 36.7  41.5 / 31.1 
PL: AVOD [36]  S  89.0 / 88.5  77.5 / 76.4  68.7 / 61.2  74.9 / 61.9  56.8 / 45.3  49.0 / 39.0 
PL: PIXOR  S  89.0 /   75.2 /   67.3 /   73.9 /   54.0 /   46.9 /  
PL: PRCNN  S  88.4 / 88.0  76.6 / 73.7  69.0 / 67.8  73.4 / 62.3  56.0 / 44.9  52.7 / 41.6 
PL++: AVOD  S  89.4 / 89.0  79.0 / 77.8  70.1 / 69.1  77.0 / 63.2  63.7 / 46.8  56.0 / 39.8 
PL++: PIXOR  S  89.9 /   78.4 /   74.7 /   79.7 /   61.1 /   54.5 /  
PL++: PRCNN  S  89.8 / 89.7  83.8 / 78.6  77.5 / 75.1  82.0 / 67.9  64.0 / 50.1  57.3 / 45.3 
PL++: AVOD  L# + S  90.2 / 90.1  87.7 / 86.9  79.8 / 79.2  86.8 / 70.7  76.6 / 56.2  68.7 / 53.4 
PL++: PIXOR  L# + S  95.1 /   85.1 /   78.3 /   84.0 /   71.0 /   65.2 /  
PL++: PRCNN  L# + S  90.3 / 90.3  87.7 / 86.9  84.6 / 84.2  88.2 / 75.1  76.9 / 63.8  73.4 / 57.4 
AVOD [16]  L + M  90.5 / 90.5  89.4 / 89.2  88.5 / 88.2  89.4 / 82.8  86.5 / 73.5  79.3 / 67.1 
PIXOR [47, 22]  L + M  94.2 /   86.7 /   86.1 /   85.2 /   81.2 /   76.1 /  
PRCNN [35]  L  96.3 / 96.1  88.6 / 88.5  88.6 / 88.5  87.8 / 81.7  86.0 / 74.4  85.8 / 74.5 
We summarize the main results on KITTI object detection in Table 1. Several important trends can be observed: 1) Our PL++ with enhanced depth estimations by SDN and GDC yields consistent improvement over PL across all settings; 2) PL++ with GDC refinement of 4 laser beams (INPUT: L# + S) performs significantly better than PL++ with only stereo inputs (INPUT: S); 3) PL experiences a substantial drop in accuracy from IoU 0.5 to 0.7 for hard objects. This indicates that PL does indeed manage to detect objects that are far away, but systematically places them at the wrong depth. Once an overlap of 0.7 is required (IoU = 0.7), the object is too far out of place and is no longer registered as detect. Interestingly, here is where we experience the largest gain — from PL: PRCNN (AP ) to PL++: PRCNN (AP ) with input as L# + S. Note that the majority of the gain originates from GDC, as PL++ with solely stereo input only improves the score to AP. 4) Compared to LiDAR, PL++ is only outperformed by at most AP, even at the hard case under IoU at 0.7. 5) For IoU at 0.5, with the aid of only 4 LiDAR beams, PL++ is boosted to a level comparable to models with 64beam LiDAR signals.
Input signal  Easy  Moderate  Hard 

PL++ (SDN)  75.5 / 60.4  57.2 / 44.6  53.4 / 38.5 
PL++ (SDN +GDC)  83.8 / 68.5  73.5 / 54.7  66.5 / 51.2 
LiDAR  89.5 / 85.9  85.7 / 75.8  79.1 / 68.3 
Results on the secret KITTI test set. Table 2 summarizes our test results for the car category on the KITTI test set server. We see a similar gap between our methods and LiDAR as on the validation set, suggesting that our approach does not simply overfit to the validation data. There is no category for 4beam LiDAR, but at the time of submission, our approach without LiDAR refinement (pure SDN) is placed at the top position among all the imagebased algorithms on the KITTI leaderboard.
In the following sections, we conduct a series of experiments to analyze the performance gain by our approaches and discuss several key observations. We will mainly experiment with PRCNN: we find that the results with AVOD and PIXOR follow similar trends and include them in the supplemental material.
Stereo depth  PRCNN  

Easy  Moderate  Hard  
PSMNet  73.3 / 62.3  55.9 / 44.8  52.6 / 41.4 
PSMNet + DL  80.1 / 65.5  61.9 / 46.8  56.0 / 43.0 
SDN  82.0 / 67.9  64.0 / 50.1  57.3 / 45.3 
Depth loss and depth cost volume. To turn a disparity network (, PSMNet) into SDN, there are two subsequent changes: 1) change the disparity loss into the depth loss; 2) change the disparity cost volume into the depth cost volume. In Table 3, we uncover the effect of these two changes separately. Regarding AP/AP (moderate), the metric used in the KITTI leaderboard, the depth loss gains us improvement, while the depth cost volume brings another . Essentially, we demonstrate that the two components are complementary to improve depth estimation.
Stereo depth  PRCNN  

Easy  Moderate  Hard  
SDN  82.0 / 67.9  64.0 / 50.1  57.3 / 45.3 
L#  73.2 / 56.1  71.3 / 53.1  70.5 / 51.5 
SDN + L#  86.3 / 72.0  73.0 / 56.1  67.4 / 54.1 
SDN + GDC  88.2 / 75.1  76.9 / 63.8  73.4 / 57.4 
Impact of sparse LiDAR beams. In pseudoLiDAR ++, we leverage 4beam LiDAR by GDC to correct stereo depth. We then ask the following question: is it possible that the gain solely comes from adding 4beam LiDAR points? In Table 4, we study this question by comparing the detection result against that of models using 1) sole 4beam LiDAR point clouds and 2) pseudoLiDAR point clouds with corresponding parts replaced by 4beam LiDAR. It can be seen that 4beam LiDAR itself performs fairly well on locating far away objects but cannot capture close objects precisely, while simply replacing pseudoLiDAR with LiDAR prevents the model from detecting far away object accurately. In contrast, our proposed GDC method effectively combines the merits of the two signals, achieving superior performance than using them alone.
6 Conclusion
In this paper we made two contributions to improve the 3D object detection in autonomous vehicles without expensive LiDAR. First, we identify the disparity estimation as a main source of error for stereo based systems and propose a novel approach to learn depth directly endtoend instead of through disparity estimates. Second, we advocate that one should not use expensive LiDAR sensors to learn the local structure and depth of objects. Instead one can use commodity stereo cameras for the former and a cheap sparse LiDAR to correct the systematic bias in the resulting depthestimates. We provide a novel graph propagation algorithm that integrates the two data modalities and propagates the initial depth estimates with two sparse matrix solvers. The resulting system, PseudoLiDAR++, performs almost on par with 64beams LiDAR systems for $75,000 but only requires 4 beams and two commodity cameras, which could be obtained with a total cost of less than $800.
Acknowledgments
This research is supported in part by grants from the National Science Foundation (III1618134, III1526012, IIS1149882, IIS1724282, and TRIPODS1740822), the Office of Naval Research DOD (N000141712175), and the Bill and Melinda Gates Foundation. We are thankful for generous support by Zillow and SAP America Inc. We thank Gao Huang for helpful discussion.
References
 [1] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teulière, and T. Chateau. Deep manta: A coarsetofine manytask network for joint 2d and 3d vehicle analysis from monocular image. In CVPR, 2017.
 [2] J.R. Chang and Y.S. Chen. Pyramid stereo matching network. In CVPR, 2018.
 [3] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3d object detection for autonomous driving. In CVPR, 2016.
 [4] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals for accurate object class detection. In NIPS, 2015.
 [5] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals using stereo imagery for accurate object class detection. IEEE transactions on pattern analysis and machine intelligence, 40(5):1259–1272, 2018.
 [6] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multiview 3d object detection network for autonomous driving. In CVPR, 2017.
 [7] X. Cheng, P. Wang, and R. Yang. Depth estimation via affinity learned with convolutional spatial propagation network. In ECCV, 2018.
 [8] X. Du, M. H. Ang Jr, S. Karaman, and D. Rus. A general pipeline for 3d detection of vehicles. In ICRA, 2018.

[9]
M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner.
Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks.
In ICRA, 2017.  [10] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In CVPR, pages 2002–2011, 2018.
 [11] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
 [12] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
 [13] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with leftright consistency. In CVPR, 2017.
 [14] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. In ICCV, 2017.
 [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [16] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. Waslander. Joint 3d proposal generation and object detection from view aggregation. In IROS, 2018.
 [17] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019.
 [18] B. Li. 3d fully convolutional network for vehicle detection in point cloud. In IROS, 2017.
 [19] B. Li, W. Ouyang, L. Sheng, X. Zeng, and X. Wang. Gs3d: An efficient 3d object detection framework for autonomous driving. In CVPR, 2019.
 [20] B. Li, T. Zhang, and T. Xia. Vehicle detection from 3d lidar using fully convolutional network. In Robotics: Science and Systems, 2016.
 [21] P. Li, X. Chen, and S. Shen. Stereo rcnn based 3d object detection for autonomous driving. In CVPR, 2019.
 [22] M. Liang, B. Yang, S. Wang, and R. Urtasun. Deep continuous fusion for multisensor 3d object detection. In ECCV, 2018.
 [23] T.Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 4, 2017.
 [24] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, 2016.
 [25] G. P. Meyer, J. Charland, D. Hegde, A. Laddha, and C. VallespiGonzalez. Sensor fusion for joint 3d object detection and semantic segmentation. arXiv preprint arXiv:1904.11466, 2019.
 [26] G. P. Meyer, A. Laddha, E. Kee, C. VallespiGonzalez, and C. K. Wellington. Lasernet: An efficient probabilistic 3d object detector for autonomous driving. In CVPR, 2019.

[27]
A. Mousavian, D. Anguelov, J. Flynn, and J. Košecká.
3d bounding box estimation using deep learning and geometry.
In CVPR, 2017.  [28] C. C. Pham and J. W. Jeon. Robust object proposals reranking for object detection in autonomous driving using convolutional neural networks. Signal Processing: Image Communication, 53:110–122, 2017.
 [29] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum pointnets for 3d object detection from rgbd data. In CVPR, 2018.
 [30] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
 [31] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017.
 [32] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NIPS, 2015.
 [33] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 2000.
 [34] M. Shevtsov, A. Soupikov, and A. Kapustin. Highly parallel fast kdtree construction for interactive ray tracing of dynamic scenes. In Computer Graphics Forum, volume 26, pages 395–404. Wiley Online Library, 2007.
 [35] S. Shi, X. Wang, and H. Li. Pointrcnn: 3d object proposal generation and detection from point cloud. In CVPR, 2019.
 [36] Y. Wang, W.L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger. Pseudolidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In CVPR, 2019.
 [37] Y. Wang, Z. Lai, G. Huang, B. H. Wang, L. van der Maaten, M. Campbell, and K. Q. Weinberger. Anytime stereo image depth estimation on mobile devices. arXiv preprint arXiv:1810.11408, 2018.
 [38] K. Q. Weinberger, B. Packer, and L. K. Saul. Nonlinear dimensionality reduction by semidefinite programming and kernel matrix factorization. In AISTATS, 2005.
 [39] Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Datadriven 3d voxel patterns for object category recognition. In CVPR, 2015.
 [40] Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Subcategoryaware convolutional neural networks for object proposals and detection. In WACV, 2017.
 [41] Z. Xiaojin and G. Zoubin. Learning from labeled and unlabeled data with label propagation. Tech. Rep., Technical Report CMUCALD02–107, Carnegie Mellon University, 2002.
 [42] B. Xu and Z. Chen. Multilevel fusion based 3d object detection from monocular images. In CVPR, 2018.
 [43] D. Xu, D. Anguelov, and A. Jain. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In CVPR, 2018.
 [44] K. Yamaguchi, D. McAllester, and R. Urtasun. Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In ECCV, 2014.
 [45] Y. Yan, Y. Mao, and B. Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
 [46] B. Yang, M. Liang, and R. Urtasun. Hdnet: Exploiting hd maps for 3d object detection. In Conference on Robot Learning, pages 146–155, 2018.
 [47] B. Yang, W. Luo, and R. Urtasun. Pixor: Realtime 3d object detection from point clouds. In CVPR, 2018.
 [48] Y. Zhou and O. Tuzel. Voxelnet: Endtoend learning for point cloud based 3d object detection. In CVPR, 2018.
Appendix A Graphbased Depth Correction (Gdc) Algorithm
Here we present the GDC algorithm in detail (see LABEL:alg::GDC). The two steps described in the main paper can be easily turned into two (sparse) linear systems and then solved by using Lagrange multipliers. For the first step, we solve a problem that is slightly modified from that described in the main paper (for more accurate reconstruction). For the second step, we use the Generalized Minimal Residual Method (GMRES) to iteratively solve the sparse linear system.
algocf[htbp]
Appendix B Experimental Setup
b.1 Sparse LiDAR generation
In this section, we explain how we generate sparser LiDAR with fewer beams from a 64beam LiDAR point cloud from KITTI dataset in detail. For every point of the point cloud in one scene (in LiDAR coordinate system (: front, : left, : up, and is the location of the LiDAR sensor)), we compute the elevation angle to the LiDAR sensor as
We order the points by their elevation angles and slice them into separate lines by step , starting from (close to the Velodyne 64beam LiDAR SPEC). We select LiDAR points whose elevation angles fall within to be the 2beam LiDAR signal, and similarly to be the 4beam LiDAR signal. We choose them in such a way that consecutive lines has a interval, following the SPEC of the “cheap” 4beam LiDAR ScaLa. We visualize these sparsed LiDAR point clouds from the bird’seye view on one example scene in Figure 6.
b.2 3D object detection algorithms
In this section, we provide more details about the way we train 3D object detection models on pseudoLiDAR point clouds. For AVOD, we use the same model as in [36]. For PRCNN, we use the implementation provided by the authors. Since the PRCNN model exploits the sparse nature of LiDAR point clouds, when training it with pseudoLiDAR input, we will first sparsify the point clouds into 64 beams using the method described in subsection B.1. For PIXOR, we implement the same base model structure and data augmentation specified in [47], but without the “decode finetune” step and focal loss. Inspired by the trick in [22], we add another image feature (ResNet18 [15]) branch along the LiDAR branch, and concatenate the corresponding image features onto the LiDAR branch at each stage. We train PIXOR
using RMSProp with momentum
, learning rate(decay by 10 after 50 and 80 epochs) for 90 epochs. The BEV evaluation results are similar to the reported results, see
Table 1.Appendix C Additional Results
c.1 Ablation study
In Table 5 and Table 6 we provide more experimental results aligned with experiments in subsection 5.2 of the main paper. We conduct the same experiments on two other models, AVOD and PIXOR, and observe similar trends of improvements brought by learning with the depth loss (from PSMNet to PSMNet +DL), constructing the depth cost volume (from PSMNet +DL to SDN), and applying GDC to correct the bias in stereo depth estimation (comparing SDN +GDC with SDN).
Depth Estimation  PIXOR  AVOD  

Easy  Moderate  Hard  Easy  Moderate  Hard  
PSMNet  73.9 /   54.0 /   46.9 /   74.9 / 61.9  56.8 / 45.3  49.0 / 39.0 
PSMNet + DL  75.8 /   56.2 /   51.9 /   75.7 / 60.5  57.1 / 44.8  49.2 / 38.4 
SDN  79.7 /   61.1 /   54.5 /   77.0 / 63.2  63.7 / 46.8  56.0 / 39.8 
Depth Estimation  PIXOR  AVOD  

Easy  Moderate  Hard  Easy  Moderate  Hard  
SDN  79.7 /   61.1 /   54.5 /   77.0 / 63.2  63.7 / 46.8  56.0 / 39.8 
L#  72.0 /   64.7 /   63.6 /   77.0 / 62.1  68.8 / 54.7  67.1 / 53.0 
SDN + L#  75.6 /   59.4 /   53.2 /   84.1 / 66.0  67.0 / 53.1  58.8 / 46.4 
SDN + GDC  84.0 /   71.0 /   65.2 /   86.8 / 70.7  76.6 / 56.2  68.7 / 53.4 
c.2 Using fewer LiDAR beams
In PL++ (, SDN + GDC), we use 4beam LiDAR to correct the predicted point cloud. In Table 7, we investigate using fewer (and also potentially cheaper) LiDAR beams for depth correction. We observe that even with 2 beams, GDC can already manage to combine the two signals and yield a better performance than using 2beam LiDAR or pseudoLiDAR alone.
Depth Estimation  PRCNN  PIXOR  

Easy  Moderate  Hard  Easy  Moderate  Hard  
L# (2)  69.2 / 46.3  62.8 / 41.9  61.3 / 40.0  66.8 /   55.5 /   53.3 /  
L# (4)  73.2 / 56.1  71.3 / 53.1  70.5 / 51.5  72.0 /   64.7 /   63.6 /  
SDN + GDC (2)  87.2 / 73.3  72.0 / 56.6  67.1 / 54.1  82.0 /   65.3 /   61.7 /  
SDN + GDC (4)  88.2 / 75.1  76.9 / 63.8  73.4 / 57.4  84.0 /   71.0 /   65.2 /  
c.3 Qualitative results
In Figure 7, we show detection results using PRCNN with different input signals on a randomly chosen scene in the KITTI object validation set. Specifically, we show the results from the frontalview images and the bird’seye view (BEV) point clouds. In the BEV map, the observer is on the lefthand side looking to the right. For nearby objects (, bounding boxes close to the left in the BEV map), we see that PRCNN with any point cloud performs fairly well in localization. However, for far away objects (, bounding boxes close to the right), pseudoLiDAR with depth estimated from PSMNet predicts objects (green boxes) deviated from the ground truths (red boxes). Moreover, the noisy PSMNet points also leads to several false positives. In contrast, the detected boxes by our pseudoLiDAR ++, either with SDN alone or with SDN +GDC, aligns pretty well with the ground truth boxes, justifying our targeted improvement in estimating far away depths.
Comments
There are no comments yet.