I Introduction
Object detection in popular domains like automotive has a deep bench of solutions. Singleshot detectors provide realtime, embedded opportunities [1], while twostage detectors support highly accurate models [2]
. Fundamentally differentiating these approaches  and separating them from preceding traditional approaches  is the proposal scheme, which dictates the regionsofinterest in the input space to be classified and regressed over. The flexibility of these proposal methodologies combined with the fundamental power of convolutional frameworks in image domains provides for robust potential when optimizing traditional, 2D localization.
Modern deep learning goes further. Although developed to work with images, CNNs have attempted to move from imagebased input to 3D inputs like LiDAR (e.g.
[3], [4]). A corollary advancement predicts object locations in 3D, a vital objective for safety systems. These methodologies are not mutually exclusive in the literature, with some attempts to fuse LiDAR with camera data, e.g., by leveraging the 3D inputs for anchor generation and the the image features for box regression and classification [5].In some cases, it may be desirable to utilize stereo derived from two calibrated cameras instead of LiDAR. Often, this motivation is as simple as cost or the logistical hurdle of incorporating and synchronizing LiDAR with cameras; stereo is many times cheaper, and naturally is produced in the same space as the image pairs. While there are clear safety benefits to adding a LiDAR to an autonomy system [6], these can be undermined if a model learns jointly over a LiDAR and image. In that case, sensor redundancy can be lost and the model may fail if either inputs undergo degradation. If stereo can be substituted for other range sensors, it frees them up to be used for independent object detection towards a redundancydriven safety case.
We characterize the tradeoff of utilizing stereo data by empirically bounding the performance of stereo methods, underneath by onetoone imageonly methods, and above by stateoftheart LiDAR methods:

Towards constrained systems, we examine singleshot frameworks that are efficient  relying only on blockmatching for stereo inference  and improve on equivalent imageonly networks when trained with the exact same procedure and given the same number of parameters; AP increases from 0.605 to 0.907.

Suggesting that error in stereo can be explicitly learned over and that stereo can be a plausible substitute for LiDAR in some settings, we show that the localization difference between stereo and LiDAR models is generally between m and ms on KITTI

We show that deep multimodal learning with independently computed stereo imbues some calibration invariance versus monocular models, and examine this stability under sensor degradation
Towards evaluating these models for safety, we also make small note of some common metric deficiencies. The true positive rate is unambiguous in classification schemes, but in object detection requires pairing detections and labels accurately. When multiple labels and detections for a single input are not disjoint, this pairing is nontrivial. Existing greedy algorithms, like the ones used by default in VOC toolkit and KITTI, can make infrequent pairing errors in the case of overlapping detections. Furthermore, their metric computation can be unstable or lack coherency when filtering for subset criteria  like they do with the difficulty classes  which may be necessary when building a safety case. We conclude with a brief discussion of these deficiencies and recommended remedy.
Ii Related Work
Although less common than monocular or LiDAR networks, some authors already use stereo for object detection. One popular method involves inputting the left and right images jointly, to implicitly learn stereo information [7] [8] [9]. While performant, this methodology discards known information about the extrinsics of the cameras, which may introduce problems if these settings differ at evaluation (e.g., if the cameras’ imagers or their orientation and position can change, as is common in cheaper plugandplay or selfmounted systems). Because any depth map generated is done implicitly, it is more difficult to build a secondary geometrybased detector to fuse with the image detector for securitydriven redundancy, as is common in robotics applications (e.g., [10]). Furthermore, it doubles the number of parameters necessary for input to the network, which may pose speed challenges in realtime systems or memory challenges when deploying on memorysensitive hardware like FPGAs.
Another method involves using disparity images, but only for anchor generation [11]. This model is not fully learned, and may be undermined at the clustering stage; while it may be effective for distinct cases, heavily occluded objects may be difficult to segregate. The authors of [12] compare methods of involving stereo, though they use a complicated stereo algorithm that requires segmentation and sceneflow analysis; moreover, however, their approach is centered on incorporating stereo into a cohesive 3D region proposal scheme. It requires HHA features [13], which also doubles the number of network parameters. Requiring manually engineered features runs counter to the current trend of allowing the model to learn endtoend without handtuning.
The authors of [14] directly apply stereo information into LiDAR systems without attempting to fuse image information, and characterize the difference exclusively in terms of AP, which may hide in aggregate some of the particular failure surfaces relevant to a safety case involving stereo data.
Method  A.P.  Brier 

SSDSA (LiDAR)  0.986  0.132 
SSSD (Stereo)  0.852  0.222 
SSD550 (RGB)  0.605  0.344 
SSD550 (RGBH)  0.907  0.110 
RPN (RGB)  0.871  0.117 
RPN (RGBD)  0.940  0.061 
SSD550 (RGB)  0.019  0.949 
SSD550 (RGBH)  0.551  0.970 
RPN (RGB)  0.440  0.568 
RPN (RGBD)  0.482  0.593 
Iii Methods
Our analysis considers techniques that (a) maintain speed parity with existing singleshot models while improving performance and calibration and (b) understand the maximal benefit available in stereo information.
Iiia SingleShot, 11 Image Improvement
Standard image representation encodes three channels in 8bits each. To preserve the number of parameters exactly, we quantize the image from 8 bitsperchannel to 5, leaving 9 bits of the originally allocated space for geometric information. We then concatenate an additional channel with a height image taken from the reprojection of disparity and normalized into the pixel domain. Unlike a bird’s eye input used in some regionproposal schemes, the disparity and height image are in the same frame of reference as the left image. Since the only operational cost is the block matching expedition and a reprojection, the overhead is trivial compared to the network speed. This network is shown in Figure 1.
The image channels can be compressed with an outofthebox algorithm that assigns weights to the components as in a color conversion or debayering kernel. However, in doing so ^{1}^{1}1performed here using algorithms from the OpenCV API, an SSD model’s average precision on KITTI improves only modestly this way, from AP of to . On the other hand, we can jointly learn the compression weighting with the model to find the optimal quantized representation by introducing an additional convolution upfront. Although this could be reasonably characterized as increasing the number of parameters of the network (and it certainly is, during training), we think of this as learning a preprocessing step equivalent to running a compression kernel; in any case, the kernel size is very small compared to the backbone. In this scheme, as we see in Table I, AP improves dramatically, from to , and results from this method are presented in subsequent sections.
IiiB Upperbound Improvement on Image Detectors
On the other hand, Figure 2 shows a more expensive version of multimodal learning that optimizes for performance over speed using a latefusion Resnetbased RPN. This network passes both height and depth instead of using raw disparity images, effectively mimicking a 3D pointcloud in the plane of the image. The separate channel of convolutions removes any continuity between the spaces that a singlechannel approach might imply. Moreover, the twochannel approach allows mirroring pretrained weights while allowing for separate feature kernels, potentially avoiding loss valleys. As opposed to previous work like [12]
, this method focuses on feature extraction from the stereo input, not region proposal and is presented here not as a novel methodology, but to understand to comparative performance of the different stereo options. This method is referred to as RPN with RGBD information, to distinguish it from the imageonly RPN with RGB features.
IiiC Model Evaluation
Some authors caution that models in safety cases need to be evaluated beyond their nominal cases [15]. Towards understanding the different architecture benefits, we measure performance on some of the models under image degradation; in particular, we apply Gaussian filters to the input image at evaluation (but not training) to model defocus. To quantify calibration, in addition to performance, we provide a Brier score, though because this value is highly sensitive to the underlying support, there are additional notes in how it should be computed in the final section.
We provide results on the KITTI object detection benchmark, focusing on ROC curves, AP, and Brier score as a measure of calibration. We default to the ”Medium” difficulty KITTI subset except where specified otherwise.
In comparing the calibration of the networks, we also demonstrate an auxiliary benefit to RGBH input over images; under certain degredations, performance in imageonly network appears to collapse much faster than in stereoaugmented networks.
IiiD Stereo vs LiDAR
To expand the case that stereo can be a situational replacement for LiDAR and quantify the error implicit in learning over block matching, we directly compare stateoftheart LiDAR methods in SSDSA[3] by adapting them to stereo point clouds generated from block matching.
The SSDSA model is a singlestage 3D object detector leveraging several key insights, many of which do not require modification when using stereo input. Partsensitive warping for improved calibration, multiresolution features, and pointwise supervisors can be directly extrapolated to the new sensor. As opposed to schemes that voxelize the point cloud, SSDSA maps points directly into indices in the input. For stereo data, a simple preprocessing that removes invalid points and truncates by distance is substituted. Where SSDSA discards LiDAR coordinates not represented in the image view during training, this is unnecessary when using stereo since those points cannot exist. We accept the cutandpaste augmentations that are used in the original LiDAR settings for this work.
Iv Results
Iva Stereo v. LiDAR for 3D Localization
The ROC curve in Figure 3 shows that stereoonly approaches perform competitively with LiDAR both overall and on the reduced ”medium” difficulty vehicles; the tail of accuracy for stereo is shorter, especially when considering difficult objects which tend to be heavily occluded or far away, so may be susceptible to stereo noise. In particular, from Table I, we see that AP degrades from to when switching from LiDAR to stereo point clouds.
However, we see in figure 6 that although the error for the stereo model is highest for close objects less than away (as implied by the wide baseline of the KITTI cameras) and increases gradually in error after , the overall difference in the models’ depth prediction differs by between and meters. For many applications, including slower moving vehicles in constrained environments, this may be a tolerable error considering the difference in price between the sensors. This error is also significantly lower than may be expected by block matchingbased disparity algorithms; at 10m, we would expect about error^{2}^{2}2from the standard derivation that , and this error should increase quadratically with distance. Thus, we appear to have been able to learn over the intrinsic error in the disparity computation, to a limited extent.
We note that machinelearning disparity algorithms, while widely better for many tasks, did not improve performance when substituted for BM, so are not reported here. Outputs from
[16], e.g., yielded no significant change in AP; although it produces smoother disparity, it could be that ”doublelearning” over the images, as happens when learned once to produce disparity and then again for object detection, provides no statistical benefit. Learning depth maps and object detection jointly may still provide improvements as an auxiliarytask scheme.IvB Stereo to Augment Image Detectors
Incorporating stereo in imagedriven singleshot detection provides moderate improvements in AP. On the one hand, even keeping the number of parameters the same, we see in Table I that adding stereo features improves AP from to when requiring 0.7 IoU for overlaping boxes. This even outperforms a Resnetbackboned imageonly RPN, which has and significantly more parameters. A full RPN with depth features, on the other hand, approaches SSDSA’s benchmark, achieving AP. Furthermore, stereobased approaches improved calibration, bringing the Brier score for SSD from to . The effect was even more dramatic with ResnetRPN, with Brier improving from to .
IvC Degradation: Performance & Calibration
Notably, the stereobased channel of input allows the model to hold up under certain degradations in the image. In Table 1, we see that the AP for SSD with images drops by after a Gaussian pass of , but by in the model that uses height features. Similarly, though the Brier score of both collapses, the structure of the calibration curve in 8 maintains the qualitative expected shape.
While, typically, image degradation may be expected to result in stereo degradation with traditional block matching, improvements can be made to anticipate this effect [17], although that is not done here. Even in the case of correlated errors, the calibration derived from imposing additional information in the stereo channel is seen to be more stable and relevant for decision making.
V Metric Refinement
In this section, we make a note of small mistakes used in computing metrics on standard object detection datasets and provide remedy with proof of correctness for relevant components.
We take the true positive rate (tpr) of an object detector to be the maximal true positive classification rate over all possible associations of detections to labels. This definition captures the fundamental tpr goal: measuring how many instances of the primary class the model detects. Existing major object detection metric libraries do not compute tpr along this definition in the cases of overlapping candidates, and do not behave coherently in the presence of filtering criteria.
In Figure 9, the labels (in red) overlap. Some standard object detection libraries fail to accurately associate these detections (in blue) and labels. Such algorithms use a greedy approach to association, iterating sequentially over labels or detections and selecting the candidates of highest overlap. In the case of 9, a greedy algorithm would report one TP, one false negative (FN), and one false positive (FP). While producing a valid classification truepositive rate, this is not accurately capturing the object detection tpr, i.e., there exists an association of labels and detection which, in this case, results in 2 TPs and no FPs or FNs.
Unfortunately, to exhaustively check all combinations of candidates is intractable, requiring complexity in the worst case. The following algorithm provides the mapping of largest tpr while being implemented easily in time.
Va Overview of Algorithm
For any set of detections and labels, take the adjacency matrix to be constructed such that is a function of the overlap between detection and label , thresholded for minimum IoU . For Figure 9, the adjacency matrix is shown in Figure 11. In particular,
An exhaustive search of all candidates would evaluate the permutations of . In the twobytwo case, as in Figure 11, there are only two candidates: , and . This follows directly from the fact that a single label or detection cannot be doublyassociated.
Consider the selector function for a given adjacency matrix given as the max of the sum of the permutations of the matrix, i.e.
In Figure 11, we evaluate the permutation corresponding to and corresponding to , noting that the first partition has a larger sum.
Lemma 1.
The maximum sum of permutations in the adjacency matrix coincides with the maximum true positive (TP) selection
First, a note on why, without the scaling factor, does not coincide in general with the maximum tpr, even though on the example in Figure 11 it would yield the correct solution. Indeed, for all matrices and , is sufficient. On the other hand, fails to capture the tpr exactly when any terms can sum to a lower value than terms, allowing for cases when fewer total terms are selected for a higher sum. Such scenarios are trivial to construct with and three detections and labels.
VB Proof of Correctness
Lemma 2.
coincides with the maximum tpr
For detections and labels, we have
Assume for simplicity that (if not, take the completion of the matrix with zeros and the proof follows identically. See Figure 13 for an example). Assume that selects too few TPs; in this case, has returned a permutation , when there existed a permutation , such that
i.e., that the sum of values in one permutation is greater than that in a second, despite the number of elements () in the second permutation being larger.
Let and such that by construction.
Is a contradiction. Not only is a coinciding solution, it is the minimal coinciding solution. Note that in line 2 we substituted in the bounds for ; in line 4, we noted that and .
We recognize the solution to this problem as the assignment problem, in which we want to find a matching of a particular size in a bipartite weighted graph while maximizing the weights of the edges. We recommend linear programming solutions like the Hungarian algorithm, which runs in
, because the commonness of its implementation^{3}^{3}3In python, this is builtin as ”linear_sum_assignment”, though solutions exist in the literature. In particular, the adjacency matrix is expected to be largely sparse, and several algorithms opportunize on this structure.VC Applying Filtering Criteria
Commonly, a second step in computing metrics is to filter on either the detections or the labels. One may be interested in the AP only on objects greater than some pixel size or closer than some physical distance. Whether performance degrades on objects that are small or far away is especially important to safety cases. KITTI partitions data in to ”Easy”, ”Medium”, and ”Hard” categories, but only asks that associations outside that group do not penalize the true positive rate; we describe a more robust process.
When evaluating multiple filters, it is helpful to enforce that for a filter to be coherent, it must be ”stable”, i.e., there should not possibly exist a set of detections and labels such that are more errors on a subset (e.g., the subset with width greater than 40 pixels) than on the total set. Note that aggregate measures or rates  like tpr or recall  can still degrade under stable filters.
One possible approach is to strike from the original set the candidates that do not meet the criteria, and then run the matching algorithm. This results in lack of stability; a simple example occurs when a label is just underneath a threshold but the best matching detection is not. An alternative formulation of the problem would compute tpr conditionally as a constrained optimization. However, the problem is difficult to formulate algorithmically without exhaustively searching candidates, may not be consistent with multiple criteria, and may not guarantee stability.
Instead, a measure on data under a filter can be computed stably as the measure on the subset of the (unfiltered) labeldetection pairs (under fixed ) for which both the detection and label meet the filtering criteria. This formulation has the benefit of being simple and efficient implement. Since only considering the subset of the original associations, one never recomputes the adjacency matrix or the maximal partition, so filtered results can be computed in the minimal possible complexity, . A corollary of the algorithm is that in the case of a multiple solutions (multiple labeldetection pairs satisfy the overlap criteria) the algorithm returns the association of highest total overlap.
Figure 12 shows the difference on standard SSD of applying an area filter to labels and detections using the traditional, naive approach and the approach described in this section. Though the difference in recall and precision is small, and only appears in corner cases (these all but disappear, with SSD, if the IoU threshold is the recommended 0.7, which may be why the issue is traditionally ignored), we believe it is important to use accurate definitions of metrics especially in safetyrelated applications. For this model, the ”corrected” metrics tend to be greater than the naive metrics for about of possible area filters up to 200 pixels.
Although this definition of filters is coherent, it does not guarantee that the tpr is optimal compared to an objective function that included the filtering criteria. Consider the set of detections and labels in 13, assuming we are filtering labels with width . Note that in this example we take the completion of the original detections by entering null IoU for the missing entries in the .
In this case, following the proposed algorithm would result in consideration of the subset (Label 2, ), yielding one FN. However, removing Label 1 under the filter and then computing the mapping would yield a TP. The proposed definition  while stable and coherent  underrepresents the number of true positives with respect to that optimality. Accordingly, in 12, we see that no graph moves monotonically with increased filtering, as expected.
Additional investigation may exploit the sparsity implicit in the problem as a further optimization or consider the implicit loss of numerical precision when computing the adjacency matrix as a source of minor numerical instability for high dimensional problems.
VD Computing Brier Score
Calibration metrics are less common in the literature, but we offer short consideration of them here. As Figure 14 makes clear, the choice of data over which to evaluate Brier score is particularly sensitive. The complexity is introduced because Brier score rewards lowconfidence true negatives, and there is a massive ratio of negatives to positives in object detection. Although the default method of computing Brier score is to consider the set of all labels and detections for a frame, it produces highly counterintuitive results that we do not recommend for use. For a particular image, a model that produces no detections on a negative region makes no contribution to the cumulative Brier score, but a different model that produces a very low confidence positive object detection in a negative region will be rewarded in terms of its contribution to the Brier score. This result should be avoided, since the first model is clearly better calibrated. On the other hand, only examining over the set of detections leaves out FNs, one of the core measures of object detectors. Thus, when reporting Brier score, we report over the set of labels, precluding miscalibration on FPs but providing the most honest assessment of performance. Note that this decision has a much greater effect on results than the decision in the rest of Section 4, in that it actually flips the conclusion one might derive when evaluating in the usual way. Thus, we present all of the possible scores in this section, but advocate for the score that matches metric intuition in unbalanced cases.
References

[1]
A. Womg, M. J. Shafiee, F. Li, and B. Chwyl, “Tiny ssd: A tiny singleshot detection deep convolutional neural network for realtime embedded object detection,” in
2018 15th Conference on Computer and Robot Vision (CRV). IEEE, 2018, pp. 95–101.  [2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster rcnn: Towards realtime object detection with region proposal networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137–1149, 2016.

[3]
C. He, H. Zeng, J. Huang, X.S. Hua, and L. Zhang, “Structure aware
singlestage 3d object detection from point cloud,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, 2020, pp. 11 873–11 882.  [4] W. Ali, S. Abdelkarim, M. Zidan, M. Zahran, and A. El Sallab, “Yolo3d: Endtoend realtime 3d oriented object bounding box detection from lidar point cloud,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.
 [5] M. Fürst, O. Wasenmüller, and D. Stricker, “Lrpd: Long range 3d pedestrian detection leveraging specific strengths of lidar and rgb,” arXiv preprint arXiv:2006.09738, 2020.
 [6] R. H. Rasshofer and K. Gresser, “Automotive radar and lidar systems for next generation driver assistance functions.” Advances in Radio Science, vol. 3, 2005.
 [7] P. Li, X. Chen, and S. Shen, “Stereo rcnn based 3d object detection for autonomous driving,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7644–7652.
 [8] Y. Chen, S. Liu, X. Shen, and J. Jia, “Dsgn: Deep stereo geometry network for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 536–12 545.

[9]
J. Sun, L. Chen, Y. Xie, S. Zhang, Q. Jiang, X. Zhou, and H. Bao, “Disp rcnn: Stereo 3d object detection via shape prior guided instance disparity estimation,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 548–10 557.  [10] A. Stentz, C. Dima, C. Wellington, H. Herman, and D. Stager, “A system for semiautonomous tractor operations,” Autonomous Robots, vol. 13, no. 1, pp. 87–104, 2002.
 [11] H. Königshof, N. O. Salscheider, and C. Stiller, “Realtime 3d object detection for automated driving using stereo vision and semantic information,” in 2019 IEEE Intelligent Transportation Systems Conference (ITSC). IEEE, 2019, pp. 1405–1410.
 [12] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun, “3d object proposals using stereo imagery for accurate object class detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 5, pp. 1259–1272, 2017.
 [13] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from rgbd images for object detection and segmentation,” in European conference on computer vision. Springer, 2014, pp. 345–360.
 [14] Y. Wang, W.L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudolidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8445–8453.
 [15] Z. Pezzementi, T. Tabor, S. Yim, J. K. Chang, B. Drozd, D. Guttendorf, M. Wagner, and P. Koopman, “Putting image manipulations in context: robustness testing for safe perception,” in 2018 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR). IEEE, 2018, pp. 1–8.
 [16] Z. Yin, T. Darrell, and F. Yu, “Hierarchical discrete distribution decomposition for match density estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6044–6053.
 [17] M. Pedone and J. Heikkilä, “Blur and contrast invariant fast stereo matching,” in International Conference on Advanced Concepts for Intelligent Vision Systems. Springer, 2008, pp. 883–890.