Object-Centric Stereo Matching for 3D Object Detection

09/17/2019 ∙ by Alex D. Pon, et al. ∙ 10

Safe autonomous driving requires reliable 3D object detection-determining the 6 DoF pose and dimensions of objects of interest. Using stereo cameras to solve this task is a cost-effective alternative to the widely used LiDAR sensor. The current state-of-the-art for stereo 3D object detection takes the existing PSMNet stereo matching network, with no modifications, and converts the estimated disparities into a 3D point cloud, and feeds this point cloud into a LiDAR-based 3D object detector. The issue with existing stereo matching networks is that they are designed for disparity estimation, not 3D object detection; the shape and accuracy of object point clouds are not the focus. Stereo matching networks commonly suffer from inaccurate depth estimates at object boundaries, which we define as streaking, because background and foreground points are jointly estimated. Existing networks also penalize disparity instead of the estimated position of object point clouds in their loss functions. We propose a novel 2D box association and object-centric stereo matching method that only estimates the disparities of the objects of interest to address these two issues. Our method achieves state-of-the-art results on the KITTI 3D and BEV benchmarks.



There are no comments yet.


page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Safe autonomous driving requires determining the six DoF pose and dimensions of objects of interest in a scene, i.e., 3D object detection. Existing methods can be categorized by the sensors they use: LiDAR [33, 30, 26, 21, 15], LiDAR and camera [22, 13, 17], monocular camera [14, 18, 23], and stereo camera setups [16, 27, 24, 6]. Methods that incorporate LiDAR measurements set the standard for 3D detection performance as LiDAR has the ability to acquire accurate depth information. However, most multi-beam LiDAR sensors remain expensive, bulky, and their returns are sparse particularly at long distances. On the other hand, acquiring depth from monocular cameras is ill-posed by nature and thus inaccurate and less reliable. Stereo camera setups are generally less expensive than LiDAR, and they resolve the under-constrained monocular problem through stereopsis. Moreover, given high resolution cameras and a large stereo baseline, stereo methods have the potential for accurate long range perception. Stereo object detection is therefore an important alternative to both monocular and LiDAR/camera methods.

Prior to our work, the state-of-the-art stereo 3D object detection method on the KITTI benchmark [8] was Pseudo-LiDAR [27]. Pseudo-LiDAR uses the existing 3D object detector AVOD [13] and replaces the LiDAR input with a point cloud derived from the disparity output of the stereo matching network PSMNet [3]. The performance loss on cars from replacing the LiDAR input is approximately 30% AP. To understand this discrepancy, this work shows that the point clouds derived from PSMNet contain streaking

artifacts that warp the piecewise-smooth surfaces in the scene leading to significant classification and localization errors. The cause of streaking originates from the ambiguity of depth values at object edges; it can be hard to discern whether a pixel belongs to the object or the background. For such pixels, deep learning methods are often encouraged to produce depths between these two extremes 

[10]. Furthermore, in deep stereo networks, closer objects are often favored during training for two main reasons. First, the inversely proportional relation between depth and disparity causes the same disparity error to have drastically different depth errors depending on the distance of objects. For example, for an object 60 m from the camera in the KITTI dataset, a disparity error of only 0.5 pixels corresponds to a large depth error of 5.1 m, but for a car 10 m away the same disparity error corresponds to a depth error of only 0.1 m. The second reason closer depths are favored during training is that there is a natural imbalance in training data. In a typical driving scene, the image is dominated by foreground pixels.

This work presents an object-centric stereo matching network, OC Stereo, to address the problems that arise from typical deep stereo matching methods. First, to resolve the streaking issue described above, we propose an object-centric representation of depth. In 3D object detection, one is primarily concerned with the objects of interest; therefore, we perform stereo matching on only object image crops and mask the ground truth background disparities during training to only penalize errors for object pixels. As a result, we avoid creating streaking artifacts in the object point clouds, and thus capture the true shapes of objects more accurately. Furthermore, as a result of only estimating disparities for the objects of interest, the runtime efficiency is significantly improved—an important aspect for safe self-driving vehicles. Second, to resolve the issue of stereo matching networks favouring closer objects, we introduce a point cloud loss that jointly penalizes the estimated position and shape of the object instances directly, and canonically resize the image crops of objects to balance the number of pixels for close and far objects.

Our main contributions are as follows: 1) A fast 2D box association algorithm that accurately matches detections between left and right images; 2) A novel object-centric stereo matching architecture that addresses the pixel imbalance problem between near and far objects and suppresses streaking artifacts in the resulting point clouds to improve 3D localization; 3) A point cloud loss within the stereo matching network to help recover object shape and to directly penalize depth errors; 4) State-of-the-art results on the KITTI 3D object detection benchmark [8] while running 31% faster than the previous state-of-the-art.

Ii Related Work

Stereo Correspondence.

Determining stereo correspondences is an active area of research in computer vision. End-to-end networks typically construct correlation layers 

[19] or cost volumes [3, 11, 31, 32], which can be processed efficiently on GPUs to produce high quality disparity maps. These methods already achieve less than 2% 3-pixel error on the KITTI 2015 stereo benchmark [20]. However, due to the inversely proportional relation between disparity and depth, the 3-pixel error metric allows for large inaccuracies in depth especially at far distances, and thus this metric is not as meaningful for 3D object detection performance. We instead focus the stereo network on recovering meaningful object shapes and accurate depth to improve 3D detection performance.

Streaking Depth. Streaking depths are a common artifact in typical stereo matching networks. Point clouds of foreground objects generated from re-projecting depth maps into 3D space are generally blurred into the background, as it can be ambiguous whether pixels belong to the object or the background. The cause of streaking has been investigated by  [10], who find that at ambiguous depths common loss functions prefer the mean of the foreground and background depths or do not adequately penalize estimates within this discontinuity. MonoPLiDAR [29] proposes to eliminate streaking artifacts with instance segmentation masks. Using a depth map estimated from a monocular image, instance segmentation masks are applied to remove background points. While their method removes some streaking, streaking still persists, as shown in Fig. 1, since the instance segmentation masks are not perfect, especially at object edges where the depth ambiguities exist. Also, the full depth map is still predicted, which requires additional computation.

Stereo 3D Object detection. One of the early stereo 3D detectors, 3DOP [5], generates candidate 3D anchors which are scored and regressed to final detections using several handcrafted features. Current state-of-the-art methods are deep learning based. Pseudo-LiDAR [27] adapt the 3D detectors AVOD [13] and F-PointNet [22] to use point clouds from disparity maps predicted by PSMNet [3]. However, this method results in point clouds with streaking artifacts, and requires additional computation by estimating depths of background areas that are not necessarily relevant for 3D object detection. On the other hand, we save computation and avoid streaking artifacts by using an object-centric approach by only estimating the depths of the objects of interest. Stereo R-CNN [16] creates 2D anchors that automatically associate left and right bounding boxes. These anchors are used with keypoints to estimate rough 3D bounding boxes that are later refined using photometric alignment on object image crops. TLNet [24] employ 3D anchors for object correspondence and also use triangulation. However, Stereo R-CNN and TLNet perform 8% AP and 36% AP lower, respectively, than Pseudo-LiDAR on the KITTI moderate car category. This discrepancy suggests that explicit photometric errors and sparse anchor triangulations may be inferior to using disparity cost volumes to learn depth, and that depth estimation is the main area for improvement, which is one of the focuses of this work.

Iii Method

Fig. 1: 3D localization is improved with our object-centric point cloud that avoids streaking artifacts, which occurs with PSMNet even when masked using Mask R-CNN. Ground truth and predictions are shown in red and green, respectively.

Given a pair of left and right images, and , our objective is to estimate the 3D pose and dimensions of each object of interest in the scene. The main motivation behind our method is the belief that focusing on the objects of interest will result in better object detection performance. Therefore, instead of performing stereo matching on the full image, we perform stereo matching on Regions of Interest (RoIs), and only for pixels belonging to objects. This approach has three key advantages: 1) we resize the RoIs so there are a similar number of pixels for each object, which reduces class imbalance in depth values, 2) by only comparing RoIs we reduce the possible range of disparity values, and thus have faster runtime because the RoI disparity cost volumes are smaller, 3) we avoid streaking artifacts by ignoring background pixels.

Overall, the pipeline, shown in Fig. 2, works as follows. First, a 2D detector generates 2D boxes in and . Next, a box association algorithm matches object detections across both images. Each matched detection pair is passed into the object-centric stereo network, which jointly produces a disparity map and instance segmentation mask for each object. Together, these form a disparity map containing only the objects of interest. Lastly, the disparity map is transformed into a point cloud that can be used by any LiDAR-based 3D object detection network to predict the 3D bounding boxes.

Iii-a 2D Object Detector and Box Association Algorithm

Given the stereo image pair input, we identify left and right RoIs, , using a 2D object detector. After applying a 2D detection score threshold , we acquire and RoIs in the left and right images, respectively. We perform association by computing the Structural SIMilarity index (SSIM) [28] for each RoI pair combination then matching the highest scores. SSIM is calculated as follows,


where ,

, are the left and right RoI pixel intensity mean and variance,

is the correlation of the pixel intensities, and and

are constants to prevent division by zero. This metric is calculated per image channel and averaged. Our assumption is that objects in the left and right images have similar appearance as SSIM measures the visual similarity between two images emphasizing relations of spatially close pixels.

Each RoI is then interpolated to a standard size. The SSIM index is calculated between each left and right RoI. The algorithm determines association by going in order of highest to lowest scoring SSIM indices using the image with fewer boxes. Once a box is associated, it is removed for faster comparison. At the end of the algorithm, unmatched boxes are considered false positives and removed.

To improve the robustness of the association, we ensure that the difference between associated 2D bounding box centres are within an adaptive box center threshold. MonoPSR [14]

shows the depth of objects is well correlated with bounding box height. This means that closer objects should have larger disparities while further objects should have smaller disparities. Using the KITTI dataset, we model the relationship between box height and centre disparity using linear regression. Based on an RoI’s box height, the data provides the expected centre disparity for its associated box. We therefore constrain the maximum distance between box centers of associated RoIs to be within three standard deviations of the expected disparity. Boxes that do not satisfy these conditions are ignored for the SSIM calculation, further improving the speed and accuracy of the associations. An example of the corresponding RoIs is shown in Fig. 


Fig. 2: A 2D detector and box association algorithm determine associated RoIs. Our stereo matching network estimates disparities with a 3D CNN and soft argmin operation [3] for object pixels using the RoIs and instance segmentation. These are converted to a 3D point cloud and can be inputted to any LiDAR-based 3D object detector. X indicates multiplication.

While our method is reliant on 2D detection quality, we believe using a 2D detector is actually advantageous because 2D detection is a mature field with robust performance. Radosavovic et al. [25] even claim that the current performance of 2D detectors is accurate enough that detectors can be trained using data it inferences—self-training. In Sec. V we show our 2D detections have higher AP compared to other state-of-the-art.

Iii-B Object-Centric Stereo Matching

Given the associated RoIs, we perform stereo matching to estimate a canonically resized disparity map of dimensions per object. Within the RoIs, disparities are learned only for pixels belonging to the object to remove depth ambiguity and thus depth streaking artifacts.

Local Disparity Formulation. We estimate the horizontal pixel shift, or disparity, within the aligned left RoI and right RoI. We refer to this disparity estimation as local compared to the global disparity shift between the pixels of the full-sized left right stereo pair. This local formulation leads to positive and negative ground truth disparities. To obtain the ground truth local disparities we first start by forming an array of the local RoI images coordinates of the left RoI,


The global horizontal image coordinates of the left RoI is


where is the horizontal coordinate of the left RoI’s left edge. We calculate the disparity map corresponding to the resized RoIs by performing a nearest neighbor resizing of the ground truth global disparity map to the canonical size, . Therefore, the corresponding right global image coordinates are calculated as,


These coordinates are normalized to the local coordinate system,


where is the horizontal coordinate of the right RoI’s left edge and is the width of the non-resized RoI bounding box. Lastly, the local disparity of the crops can be calculated as


During training, we use ground truth instance segmentation masks to only train on disparity values corresponding to the object. This formulation removes depth ambiguity at edges and removes streaking artifacts as shown in Fig. 1.

During inference, we mask background pixels using a predicted instance segmentation mask. From the predicted local disparity map we calculate the global disparity map by reversing the above steps. The corresponding depth map is calculated from the known horizontal focal length and baseline as,


Lastly, each pixel of this depth map is converted into a 3D point as


where is the camera center and is the vertical focal length.

Object-Centric Stereo Architecture. The described object-centric stereo depth formulation is flexible with most stereo depth estimation networks. In our implementation, we build on PSMNet [3] with key modifications.

We use the same feature extractor as [3], but use one for the RoIs and another for the full-sized images. Despite only comparing RoIs, we leverage global context by performing RoI Align [9] on the full-sized image feature extractor outputs. The resulting features from the left image are multiplied with the left crop feature map, and the features from the right image are multiplied with the right crop feature map. To estimate disparity, the left and right feature maps are concatenated to form a 4D disparity cost volume (height width disparity range feature size) as in [3]. Importantly, however, our input size and disparity range are smaller than what would be used for global disparity estimation because the local disparity range between two RoIs is smaller than the global disparity range between two full-sized images. As a result, we create a set of smaller cost volumes, which results in a faster runtime.

To predict the instance segmentation map, only the left feature maps are used. The instance segmentation network consists of a simple decoder; the feature map is processed by three repeating bilinear up-sampling and convolutional layers resulting in a instance segmentation mask. For each instance, the predicted segmentation mask is applied to the estimated local disparity map. To deal with overlapping instance masks, each local disparity is converted to a global disparity, resized to the original box size, and placed in farthest to closest depth order in the scene.

Point Cloud Loss. Similar to [3], we use the smooth L1 loss to compare the predicted local disparity and the ground truth local disparity . Penalizing the disparities directly, however, is non-ideal because it places less emphasis on far objects due to the inverse relation between disparity and depth. For example, for a car 60 m from the camera in the KITTI dataset, a disparity error of only 0.5 pixels corresponds to a large depth error of 5 meters, but for a car 10 m away the same disparity error corresponds to a depth error of only 0.13 m. An unwanted consequence of computing loss from disparity estimates is that drastically different depth errors can have the same loss value.

Therefore, we transform the predicted disparities to a point cloud. We then use the smooth L1 loss to compare each object’s point cloud with its ground truth point cloud . Since we are concerned about 3D localization, this loss is more suitable as it directly penalizes predicted 3D point positions and resolves the lack of emphasis on far depths described above.

Iii-C 3D Box Regression Network

One of the benefits of our pipeline is that we can use the estimated point cloud as input to any 3D object detector that processes point clouds. In our implementation we build on the AVOD [13]

architecture and make two modifications. We first note that in AVOD, the RoI cropping operation of the second stage returns identical BEV features regardless of the vertical location of a regressed anchor, or proposal. As well, since our stereo point cloud does not contain ground points, we append the proposal’s 3D position information to the feature vector used to regress each 3D proposal. We also check if the final 3D bounding boxes align with the 2D detections in the first stage. If a 3D box projected into the image plane does not overlap with a 2D detection by at least 0.5 IoU, it is removed.

Fig. 3: Qualitative results on KITTI. Ground truth and predictions are in red and green, respectively. Colored points are predicted by our stereo matching network while LiDAR points are shown in black for visualization purposes only.

Iv Implementation

2D Detector and Box Association. We use MS-CNN [2] as our 2D detector because it has fast runtime speed and high accuracy. The RoIs are cropped and bilinearly resized to for association and for local stereo matching.

Object-Centric Stereo Network. During stereo matching, the minimum and maximum local disparities are set as -64 and 90 pixels. This range was found by calculating the range of local disparities for randomly jittered ground truth 2D boxes that maintain a minimum 0.7 IoU with the original box. For faster convergence, the feature extractors are pre-trained on full depth maps from the SceneFlow dataset [19] and depth completed LiDAR scans from the training split of the KITTI object detection dataset. No training is done on the KITTI raw or stereo datasets because these datasets contain overlapping samples with the object detection dataset. The object-centric stereo network, which leverages these feature extractors, is fine-tuned on crops of depth completed LiDAR scans. Depth completion is used for additional training points and faster convergence, and to remove the erroneous depths due to the differing locations of the camera and LiDAR sensor [1]. The depth completion method used is [12] because it is structure preserving and does not contain streaking artifacts. The ground truth instance segmentation masks used to mask the background disparity are created by projecting the points within the ground truth 3D boxes into the image. These instance masks exactly correspond to the pixels belonging to the object, but they are not smooth due to the depth completion, so the instance segmentation network is instead trained using masks from [4]

. For optimization, Adam was used with a batch size of 16 and a learning rate of 0.001 for 8 epochs then 0.0001 for 4 more epochs.

3D Object Detection. For the 3D object detector, we use AVOD [13] to compare with Pseudo-LiDAR. With a batch size of 1, the Adam optimizer is used with a learning rate of 0.0001 for 50000 steps then decayed to 0.00001 and stopped using early stopping. The data augmentation used was horizontal flipping and PCA jittering.

Method 0.5 IoU 0.7 IoU
Easy Moderate Hard Easy Moderate Hard
TLNet [24] 62.46  / 59.51 45.99  / 43.71 41.92  / 37.99 29.22  / 18.15 21.88  / 14.26 18.83  / 13.72
S-RCNN [16] 87.13  / 85.84 74.11  / 66.28 58.93  / 57.24 68.50  / 54.11 48.30  / 36.69 41.47  / 31.07
PL-FP [27] 89.8  / 89.5 77.6  / 75.5 68.2  / 66.3 72.8  / 59.4 51.8  / 39.8 44.0  / 33.5
PL-AVOD [27] 89.0  / 88.5 77.5  / 76.4 68.7  / 61.2 74.9  / 61.9 56.8  / 45.3 49.0  / 39.0
Ours 90.01  / 89.65 80.63  / 80.03 71.06  / 70.34 77.66  / 64.07 65.95  / 48.34 51.20  / 40.39
TABLE I: Car Localization and Detection. / on val.
Method Pedestrians Cyclists
Easy Moderate Hard Easy Moderate Hard
PSMNet + AVOD 36.68  / 27.39 30.08  / 26.00 23.76  / 20.72 36.12  / 35.88 22.99  / 22.78 22.11  / 21.94
PL-FP [27] 41.3  / 33.8 34.9  / 27.4 30.1  / 24.0 47.6  / 41.3 29.9  / 25.2 27.0  / 24.9
Ours 44.00  / 34.80 37.20  / 29.05 30.39  / 28.06 48.20  / 45.59 27.90  / 25.93 26.96  / 24.62
TABLE II: Pedestrian and Cyclist Localization and Detection. / on val. We note that [27] only provides values up to one decimal place.
Method BEV AP 3D AP
Easy Moderate Hard Easy Moderate Hard
S-RCNN [16] 61.67 43.87 36.44 49.23 34.05 28.39
PL-FP [27] 55.0 38.7 32.9 39.7 26.7 22.3
PL-AVOD [27] 66.83 47.20 40.30 55.40 37.17 31.37
Ours 66.97 54.16 46.70 55.11 38.80 31.86
TABLE III: Car Localization and Detection. and on KITTI test.
Method Pedestrians Cyclists
Easy Moderate Hard Easy Moderate Hard
PL-FP [27] 31.3  / 29.8 24.0  / 22.1 21.9  / 18.8 4.1  / 3.7 3.1  / 2.8 2.8  / 2.1
PL-AVOD [27] 27.5  / 25.2 20.6  / 19.0 19.4  / 15.3 13.5  / 13.3 9.1  / 9.1 9.1  / 9.1
Ours 35.12  / 28.14 23.23  / 21.85 22.56  / 20.92 34.77  / 32.66 22.26  / 21.25 21.36  / 19.77
TABLE IV: Pedestrians and Cyclists Localization and Detection. and on KITTI test.
Metric Left Right Stereo
Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
S-RCNN [16] 98.73 88.48 71.26 98.71 88.50 71.28 98.53 88.27 71.14
Ours 97.77 89.93 80.53 98.23 90.09 80.50 97.13 89.63 80.02
Ours Adaptive Thresh 98.87 90.53 81.05 98.92 90.50 80.88 98.44 90.38 80.71
TABLE V: Stereo 2D AP. 2D detections and stereo box correspondence AP on val.
Baseline [27] 56.8
Baseline + Pre-trained weights 57.10
Baseline + Mask-RCNN [9] 49.20
Local 64.90
Local + AVOD mods. 65.40
Local + AVOD mods. + PC Loss 65.95
TABLE VI: Ablation Studies. Comparisons of at 0.7 IoU using [27] as the baseline. Local is our object-centric stereo network.
Stage Runtime (s)
MS-CNN [2] 0.080
Box Association 0.009
Stereo Matching 0.161
AVOD [13] 0.100
Total 0.350
TABLE VII: Runtime Analysis. Runtime for each stage of our method. Our total runtime is faster than the previous state-of-the-art: PSMNet + AVOD (0.410s + 0.100s).

V Experimental Results

We compare against the state-of-the-art, perform ablation studies, and provide qualitative results (Fig. 3) using the KITTI dataset [8]. The KITTI dataset contains 7481 training images and 7518 test images, and categorizes objects in three categories: Easy, Moderate, and Hard based on 2D box height, occlusion, and truncation. To compare with the state-of-the-art, we follow the 1:1 training to validation split of [7, 13, 22] and the standard practice of comparing BEV and 3D AP performance using IoUs of 0.5 and 0.7. We also benchmark our results on the online KITTI test server.

V-a 3D AP Comparison with the State-of-the-Art

As mentioned in Sec. II 3-pixel error is not indicative of 3D object detection performance as it allows large inaccuracies in depth. An alternative is comparing depth map errors. The depth map RMSE for our method and Pseudo-LiDAR is 1.60 m and 1.81 m, respectively, when comparing the same pixels that are predicted in both Pseudo-LiDAR and our depth maps. However, we believe object detection AP is more meaningful than depth map metrics because depth map errors are not as indicative of the shape of each object. We therefore use object detection AP for the remaining comparisons.

We compare to the state-of-the-art using and on the validation set in Tab. I and Tab. II. For the car class, we outperform the state-of-the-art in all categories. Most noticeably, we have a 9.2% AP increase in the BEV moderate category at 0.7 IoU, which is used to rank methods on the KITTI online server. For pedestrians and cyclists we surpass Pseudo-LiDAR with F-PointNet (PL-FP) in all but three categories and tie up to rounding error on hard cyclist BEV. We also surpass the performance of Pseudo-LiDAR implemented with AVOD, as shown in the top row, which indicates that much of the performance improvement for PL-FP can be attributed to F-PointNet. We leave using our stereo outputs on different 3D object detectors as future work. Results on the test set show similar performance improvements for our method in Tab. III and Tab. IV.

In Tab. VII we provide runtime analysis using a Titan Xp GPU. Our method runs faster than the current state-of-the-art, Pseudo-LiDAR, by 160 ms. They run PSMNet (0.410s) and AVOD (0.100s), while our entire pipeline takes 0.350s. Our speed boosts can be attributed to the fact we only estimate disparities for RoIs, and our object-centric formulation builds a set of smaller disparity cost volumes.

V-B 2D AP Comparison with Box Association

We compare our box association method with Stereo-RCNN [16] using 2D and stereo AP. Stereo AP [16] is calculated by requiring a minimum 0.7 IoU with the ground truth box for the left and right bounding boxes and for the left and right bounding boxes to belong to the same object. As shown in Tab. V the 2D detector MS-CNN and our box association algorithm outperforms or has comparable results to Stereo R-CNN. In particular, there is a 9.57% AP improvement in the hard category. Moreover, in Tab. V, there is a minimal decrease from our left and right AP to our stereo AP, which demonstrates that minimal performance is lost by performing association.

V-C Effect of Local Stereo Depth Estimation

In Tab. VI we provide ablation studies. The baseline used is Pseudo-LiDAR [27]. The third row of the table shows that we outperform an additional baseline that only keeps foreground depth pixels from PSMNet using a version of Mask R-CNN [9]. As shown in Fig. 1, this is in part because this Mask-RCNN baseline still contains ambiguous depths and is susceptible to streaking artifacts. We note that our object-centric disparity formulation makes our method robust to some erroneous segmentation predictions because our network is trained with only object pixels, so it learns to set some background pixels to the object depth to help maintain object shape. Tab. VI shows the benefits of our method (Local), pre-training AVOD on depth completed LiDAR, appending anchor information to AVOD’s proposal regression, and employing our point cloud loss.


  • [1] A. J. Amiri, S. Y. Loo, and H. Zhang (2019)

    Semi-supervised monocular depth estimation with left-right consistency using deep neural network

    arXiv preprint arXiv:1905.07542. Cited by: §IV.
  • [2] Z. Cai, Q. Fan, R. Feris, and N. Vasconcelos (2016)

    A unified multi-scale deep convolutional neural network for fast object detection

    In ECCV, Cited by: TABLE VII, §IV.
  • [3] J. Chang and Y. Chen (2018) Pyramid stereo matching network. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 5410–5418. Cited by: §I, §II, §II, Fig. 2, §III-B, §III-B, §III-B.
  • [4] L. Chen, S. Fidler, A. Yuille, and R. Urtasun (2014) Beat the mturkers: automatic image labeling from weak 3d supervision. In CVPR, Cited by: §IV.
  • [5] X. Chen, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler, and R. Urtasun (2015) 3D object proposals for accurate object class detection. In NIPS, Cited by: §II.
  • [6] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun (2017) 3d object proposals using stereo imagery for accurate object class detection. IEEE transactions on pattern analysis and machine intelligence 40 (5), pp. 1259–1272. Cited by: §I.
  • [7] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3d object detection network for autonomous driving. In CVPR, Cited by: §V.
  • [8] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, Cited by: §I, §I, §V.
  • [9] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §III-B, TABLE VI, §V-C.
  • [10] S. Imran, Y. Long, X. Liu, and D. Morris (2019-01) Depth coefficients for depth completion. In In Proceeding of IEEE Computer Vision and Pattern Recognition, Long Beach, CA. Cited by: §I, §II.
  • [11] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry (2017) End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 66–75. Cited by: §II.
  • [12] J. Ku, A. Harakeh, and S. L. Waslander (2018) In defense of classical image processing: fast depth completion on the cpu. CRV. Cited by: §IV.
  • [13] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. Waslander (2018) Joint 3d proposal generation and object detection from view aggregation. IROS. Cited by: §I, §I, §II, §III-C, TABLE VII, §IV, §V.
  • [14] J. Ku, A. D. Pon, and S. L. Waslander (2019) Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11867–11876. Cited by: §I, §III-A.
  • [15] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) PointPillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §I.
  • [16] P. Li, X. Chen, and S. Shen (2019) Stereo r-cnn based 3d object detection for autonomous driving. arXiv preprint arXiv:1902.09738. Cited by: §I, §II, TABLE I, TABLE III, TABLE V, §V-B.
  • [17] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun (2019) Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7345–7353. Cited by: §I.
  • [18] F. Manhardt, W. Kehl, and A. Gaidon (2019) Roi-10d: monocular lifting of 2d detection to 6d pose and metric shape. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2069–2078. Cited by: §I.
  • [19] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048. Cited by: §II, §IV.
  • [20] M. Menze and A. Geiger (2015) Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3061–3070. Cited by: §II.
  • [21] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. Wellington (2019) LaserNet: an efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12677–12686. Cited by: §I.
  • [22] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018-06) Frustum pointnets for 3d object detection from rgb-d data. In CVPR, Cited by: §I, §II, §V.
  • [23] Z. Qin, J. Wang, and Y. Lu (2018) Monogrnet: a geometric reasoning network for monocular 3d object localization. arXiv preprint arXiv:1811.10247. Cited by: §I.
  • [24] Z. Qin, J. Wang, and Y. Lu (2019) Triangulation learning network: from monocular to stereo 3d object detection. arXiv preprint arXiv:1906.01193. Cited by: §I, §II, TABLE I.
  • [25] I. Radosavovic et al. (2018-06)

    Data distillation: towards omni-supervised learning

    In CVPR, Cited by: §III-A.
  • [26] S. Shi, X. Wang, and H. Li (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–779. Cited by: §I.
  • [27] Y. Wang, W. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Weinberger (2018) Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. arXiv preprint arXiv:1812.07179. Cited by: §I, §I, §II, TABLE I, TABLE II, TABLE III, TABLE IV, TABLE VI, §V-C.
  • [28] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §III-A.
  • [29] X. Weng and K. Kitani (2019) Monocular 3d object detection with pseudo-lidar point cloud. arXiv preprint arXiv:1903.09847. Cited by: §II.
  • [30] Y. Yan, Y. Mao, and B. Li (2018) SECOND: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §I.
  • [31] Z. Yin, T. Darrell, and F. Yu (2019) Hierarchical discrete distribution decomposition for match density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6044–6053. Cited by: §II.
  • [32] F. Zhang, V. Prisacariu, R. Yang, and P. H. Torr (2019) GA-net: guided aggregation net for end-to-end stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 185–194. Cited by: §II.
  • [33] Y. Zhou and O. Tuzel (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §I.