Most contemporary architectures for geometric scene understanding cast the problem as one of regression - given an image, infer a depth for each pixel. However, in safety-critical systems such as autonomous vehicles, such perceptual inferences will be used to make critical decisions and motion plans with considerable implications for safety. For example, what if the estimated depth of an obstacle on the road is incorrect? Here, it is crucial to build recognition systems that (1) allow for safety-critical graceful-degradation in functionality, rather than catastrophic failures; (2) are self-aware enough to diagnose when such failures occur; and (3) extract enough information to take an appropriate action, e.g. a slow-down, pull-over, or alerting of a manual operator. Such requirements are explicitly laid out in Automotive Safety Integrity Level (ASIL) standards which self-driving vehicles will be required to satisfy .
Such safety standards represent significant challenges for data-driven machine vision algorithms, which are unlikely to provide formal guarantees of performance 
. One attractive solution is that of probabilistic modeling, where uncertainty estimates are propagated throughout a model. In the contemporary world of deep learning, deep Bayesian methods[6, 17] provide uncertainty estimates over model parameters (e.g., observing a scene that looks different than experience) and uncertainty estimates arising from ambiguous data (e.g., a sensor failure). We apply such approaches to the problem of depth estimation from a single camera. Our particular approach differs from prior work in two notable aspects. First, prior methods often require Monte Carlo sampling to compute uncertainty estimates , which can be slow for real-time safety-critical applications. Second, while certainty estimates provide some degree of self-awareness, they are limited to uni
modal estimates of scene structure, implicitly producing a Gaussian estimate of depth represented by a regressed mean and regressed variance (or confidence). Instead, we develop representations that report back multimodal distributions that allow us to ask more nuanced questions (e.g., “what is the second possible depth of a pixel?”, “how many modes exist in the distribution?”), as shown in Fig. 1 and Fig. 3.
From a practical perspective, one may ask why bother estimating depth from a single camera when special-purpose sensors for depth estimation exist (such as LIDAR or multi-view camera rigs)? Common arguments include cost, payload and power consumption of robots , but we motivate this problem from a safety perspective. One crucial method for ensuring ASIL certification is redundancy, and so estimates of scene geometry that are independently produced from various sensors (e.g., independently from LIDAR and independently from cameras) and that agree provide additional fault tolerance. In Fig. 4, we illustrate a situation in which monocular depth estimation complements range sensing.
Our overall approach to probabilistic reasoning is to recast the continuous problem of depth regression (given an image patch , regress a depth value ) as a discrete problem of selecting one out of many possible discretized depths . Previous work  has already demonstrated that discretization can improve the accuracy of the underlying depth regression task, but we show that discretization is even more useful for producing simple and efficient (and possibility multimodal) uncertaintypossible discrete depths. Importantly, we find that such distributions can be further improved by recasting the multiclass formulation as a binary multilabel task - essentially, train independent binary classifiers that classify patches at particular discrete depths. It is straightforward to show that the binary multilabel formulation can be seen as a relaxation of the multiclass problem that removes a linear constraint. Removing this constraint creates a more challenging learning problem that appears to be better regularized in terms of uncertainty reports. At test-time, we use the logits as an unnormalized distribution over possible depths, though they can easily be normalized post-hoc (to compute summary statistics such as the expected depth).
Our main contributions are as follows:
We formulate the problem of monocular depth estimation in a probabilistic framework, which gives us confidence intervals of depth instead of point estimations.
We recast the problem of depth regression as multi-label depth classification, which yields reliable, multi-modal distributions over depth.
Our method produces accurate depth and significantly better uncertainty estimation over prior art on KITTI and NYU-depth while running near real-time.
Our predicted distribution over depths improves monocular 3D map reconstruction, reducing streak-like artifacts and improving accuracy as well as memory efficiency.
Ii Related Work
take a data-driven approach to learn features in a coarse-to-fine network that refines global structure with local predictions. Some recent work substantially improves the performance of single image depth estimation using better deep neural network architectures[20, 23, 28].
Depth Estimation as Classification: Closely related to our work, Cao et al.  formulates depth estimation as a multi-class classification problem and use soft targets to train the model. However, they make inference by choosing the most likely depth class, which does not take full advantage of the depth distribution, while we explore richer inference methods based on the predicted depth distributions. More importantly, the standard multi-class classification approach tends to make confident errors and does not yield reliable uncertainty estimations. Instead, we learn the classification model as independent binary classifiers, which regularizes the model and gives us much better uncertainty estimation as well as noticeable performance improvement on standard benchmarks. Fu et al.  formulate depth estimation as ordinal regression, aiming to predict a CDF over depth. However, they do not ensure the predicted CDF to be monotonically non-decreasing. This makes it ungrounded to apply probabilistic reasoning for uncertainty estimation. In contrast, we formulate depth estimation as a discrete classification problem, aiming to predict a valid depth PDF.
Uncertainty in Depth Estimation: Kendall et al. 
introduce two kinds of uncertainties: epistemic uncertainty (over model parameters) and aleatoric uncertainty (over output distributions). They show that epistemic uncertainty is data-dependent while aleatoric uncertainty is not. They model aleatoric uncertainty by fitting the variance of Gaussian distributions (also proposed in recent work on lightweight probabilistic extensions for deep networks). However, this might lead to unstable training and suboptimal performance. More importantly, this ignored the fact that depth distributions are multi-modal in many cases (for example at depth discontinuities and reflective surfaces). They capture epistemic uncertainty by Bayesian neural networks . However, it requires expensive Monte Carlo sampling to obtain depth predictions and uncertainty estimations. Instead, we focus on modeling the multi-modal distributions over depth, which gives us more reliable uncertainty metrics without the additional computational overhead.
Multiple Hypotheses Learning (MHL): Prior works [11, 21] formulate the problem of learning to predict a set of plausible hypotheses as multiple-choice learning. They train an ensemble of models to produce multiple possibilities and define an oracle to pick up the best hypothesis. Rupprecht et al.  uses a shared architecture to produce multiple hypotheses and train the network by assigning each sample to the closest hypothesis. Different from these approaches, we train a single network to produce a multi-modal distribution, from which we can obtain multiple predictions without directly optimizing an oracle loss in training.
We solve the problem of inferring continuous depth through discrete classification. To illustrate the method, we first introduce how we discretize continuous depth into discrete categories. Then we show the formulation of depth estimation as a multi-class classification task (mutual exclusive) and a multi-label (binary) classification task (not mutually exclusive). Then we discuss the output of our model, i.e. a probabilistic categorical distribution over discrete depths, and how we will evaluate the output, including evaluating as a standard depth estimation task and as a depth estimation with uncertainty.
Discretization: We discretize continuous depth values in the log space. Given a continuous range of depth , we discretize it into intervals, i.e. , with
This captures the perceptual difference in human visual systems, i.e., we care more about differences in depths of close objects than distant ones. Furthermore, due to sensor sampling effects, we tend to encounter more close points rather than far away ones. Working in log space partially alleviates this class imbalance problem.
Multi-class Classification: As a baseline method, we first show how we recast the continuous regression as a multi-class classification problem. A discrete distribution over depth can be parameterized by a categorical distribution . We learn to predict the probability
of each depth label by minimizing the negative log likelihood. Since we use the output of a softmax layer as the predicted probability, we will also refer to this variant as “Softmax” in the following text. Given a ground truth label, image feature , and the model parameters , the loss function can be written as,
Here the distribution is predicted from a -way multi-class classifier.
) gives us the cross-entropy between an one-hot label vectorand the predicted distribution . To incorporate the ordinal nature of the depth labels, i.e., penalize predictions closer to the ground truth less than predictions further away, we replace the one-hot target vector with a discretized Gaussian centered around the ground truth, i.e.,
where is the partition function.
Binary Classification: To alleviate competition between depth classes, we further model continuous depth as a collection of
independent Bernoulli random variables, where encodes the probability of falling into the depth interval. We also refer to this variant as multilabel in the paper. The loss function is written as,
where is an unnormalized version of soft target distribution.
One can see this as a relaxation of the training objective from Eq.(2) that drops the constraint that . The variance is designed such that for all depth classes within difference to ground truth, their label is greater than . In test time, we push the pre-logit scores of each binary classifier through a softmax and obtain a distribution over discrete depth, as shown in Fig. 5.
Predicting Depth from a Distribution: After obtaining the distribution over depth, Cao et al.  report the most confident depth class, ignoring the multi-modal nature of the predicted distribution. Different from their approach, we report the expected depth based on the predicted distribution as , which takes into account the whole distribution and yields better depth estimations.
Uncertainty and Multiple Hypotheses: We now describe various statistics that can be computed from our multimodal distribution, motivated by autonomous robotic perception. Because the perception module of robots needs to be self-aware enough to report potential failures to the downstream planner or online-mapping module when faced with ambiguous scenes, the first statistic is uncertainty, as computed with Shannon entropy:
Secondly, even if the most-likely (or expected) depth of a particular pixel is far away, a robotic motion planner may wish to decrease speed if there is a non-negligible probability that its depth is in the near-field (due to say, a translucent obstacle). As such, our network can directly output multiple depth modes to downstream planners.
Evaluation: Evaluating the above functionality on a robotic platform is difficult. Instead, to evaluate the quality of uncertainty estimation, we make use of the area under ROC curve (AUC), which is widely used in stereo vision and optical flow [2, 15]. To assess the accuracy of the multi-hypotheses output, we follow past work on MHL [11, 21] and use an “oracle” evaluation protocol where an algorithm is allowed to report back multiple depth predictions, and the best one is chosen to compute the accuracy . We also report standard metrics  on depth estimation benchmarks.
Implementation We follow the architecture of Kuznietsov et al.  as shown in Fig. 5. We further add a spatial pyramid pooling module  to extract global and semi-global features from the scene. We experimented with different numbers of bins on KITTI. With 32, 64, 96, 128 bins, our method achieves an absolute relative error (ARE) of 9.34%, 8.61%, 8.60%, 8.59%. As improvement becomes marginal, we pick 64 as the number of bins and used it for all experiments in this paper. Fig. 6 shows the unnormalized soft-target distribution we use when training binary classifiers.
We first introduce our experimental setup, including dataset and training details. We then compare to prior estimation methods that reason about uncertainties. Finally, we compare our method with the state-of-the-art on the standard depth estimation task, as well as using multi-hypotheses evaluation .
Setup: We test our method on the standard depth estimation benchmarks, including KITTI  for outdoor scenes (1-80m) and NYU-v2  for indoor scenes (0.5-10m). On KITTI, we follow Eigen’s split  for training and testing. On NYU-v2, we sample k images following  for training and test on the official test split.
We first initialize the weights of our ResNet-50 backbone with the ImageNet pre-trained ones. To augment training data, we apply random gamma, brightness, and color shift, as in. We fine-tune the weights with an Adam optimizer with an initial learning rate of and decrease the learning rate with a factor of after epochs. We train our KITTI model for a total of 60 epochs and our NYU-v2 model for a total of 160 epochs. Our experiments are run on a machine with GeForce GTX Titan X GPU using Tensorflow.
Iv-a Depth Estimation with Uncertainty
Baselines: Considering most prior art do not reason about uncertainty, we compare to predictive Gaussian and predictive Gaussian with Monte Carlo dropout (Gaussian-dropout) [7, 17] in terms of depth estimation with uncertainty, as shown in Tab. I. For a fair comparison, we re-implement and train predictive Gaussian and Gaussian-dropout on KITTI and NYU depth v2. We make sure the re-implemented version has an architecture that is as close as possible to ours. For predictive Gaussian, we use the same backbone architecture but with a different prediction head, which predicts the mean and variance of a Gaussian distribution over depth in log space. To train predictive Gaussian, we minimize the per-batch negative log-likelihood based on the predicted mean and variance. For Gaussian-dropout, we use the same backbone architecture and prediction head except we perform dropout with a probability of 0.5 after several convolutional layers, as in Kendall et al. . During inference, we draw 32 samples to make predictions and estimate uncertainty. Following the same idea, we apply Monte Carlo dropout to our binary model, referred to as Binary-dropout.
Following Hu et al. , we plot ROC curves to evaluate our depth estimation with uncertainty, as shown in Fig. 7 and Fig. 8. Such curves demonstrate how well the predicted uncertainty correlates with the actual depth estimation performance. A point on the curve indicates a performance of on the least uncertain (%) predictions over all pixels in the test set. Perfect uncertainty estimation, from the perspective of the ROC curve, should rank predictions as if they are ranked by the actual error. As a reference, we include curves with such oracle w.r.t. a specific error metric (absolute relative error or ARE). Below, we first compare two variants of our model (binary classification and multiclass classification). Then we will compare our model to prior art that predicts uncertainty (predictive Gaussian and Gaussian-dropout). For each sub-metric under AUC, we follow the definition in Eigen et al. .
Binary classification vs Multiclass classification: In Fig. 7, we compare the model trained with binary classification loss (“Binary”) to the model trained with multi-class classification loss (“Softmax”). As we can see on the left side of both plots, the uncertainty predicted by the multi-class classifier does not correlate well with the actual error rate, especially for those least uncertain (or most confident) pixels. In contrast, the model trained with binary classification loss produces a curve that monotonically increases as the uncertainty threshold goes up, because it is able to correctly rank more correct pixels as more confident. We posit that our multilabel loss (that removes a linear constraint present in the multi-class formulation) acts as an additional regularizer that improves uncertainty estimation.
Gaussian vs Binary: In Fig. 8, we find predictive Gaussian also yields reliable uncertainty estimation, as it produces a monotonically increasing curve. Overall it achieves a slightly worse performance, comparing to our model trained with binary classification. It might be due to its uni-modal assumption and optimization difficulties in training time (discussed further in our ablation study). Interestingly, adding Monte Carlo dropout significantly improves NYU performance for both predictive Gaussian (“Gaussian-dropout”) and our approach (“Binary-dropout”). However, on KITTI, we see a strictly worse performance for the predictive Gaussian.
Quantitative evaluation: In Tab. I, we further compare uncertainty estimation quantitatively using metrics introduced in Section III. Our binary classification method produces better performance in terms of AUC compared to predictive Gaussian and its Monte Carlo dropout variant in terms of ARE and , without expensive Monte Carlo sampling. By adding Monte Carlo dropout to our model, we can further improve AUC of ARE, RMSE and on NYU depth v2. Although predictive Gaussian with Monte Carlo dropout outperforms our binary loss on all metrics based on RMSE, it is too slow for real-time perception. Please refer to Tab. I for more detailed discussion.
|Method||ARE ()||RMSE||()||time (ms)|
|Fu et al. ||9.1||3.90||90.5||74|
|Cao et al. ||9.3||4.02||90.8||74|
|Eigen et al. ||19.0||7.16||69.2||13|
|Godard et al. ||11.4||4.94||86.1||35|
|Cao et al. ||11.5||4.71||88.7||-|
|Fu et al. ||7.2||2.73||93.2||1250|
|Kendall et al. ||14.4||0.51||81.5||353|
|Eigen et al. ||15.8||0.64||76.9||10|
|Laina et al. ||12.7||0.57||81.1||55|
|Fu et al. ||11.5||0.51||82.8||-|
|Kendall et al. ||11.0||0.51||81.7||7500|
Iv-B Multi-hypothesis Depth Prediction
We first evaluate standard depth prediction performance on KITTI and NYU-v2 using metrics proposed in , as shown in Tab. II. We then extend the evaluation by allowing multiple depth hypotheses. For a fair comparison, we re-implement Fu et al.  and Cao et al.  under the same setup as ours (a light-weight backbone and no test-time ensemble). We also include numbers in the original paper as a reference. Please refer to Tab. II for detailed comparison.
To evaluate our multi-modal distributions, we follow the standard protocol in multi-hypothesis learning . After computing the pre-logits scores, we report back depth hypotheses with the highest scores, and the one with the lowest error is selected by the oracle for evaluation.
Since most methods can’t output multiple hypotheses, we compare to the ones that can be trained to output multiple hypotheses , referred to as MHL. Similar to traditional regression, we directly regress to the depth in log space. However in training time, we make predictions and construct an oracle loss by selecting the prediction that best describes the ground-truth in terms of distance. We train the MHL baseline for , and use an oracle to select the best prediction for evaluation. Please see Fig. 9 for analysis of the results.
V Building Maps with Uncertainty
In this section, we demonstrate one application of geometric uncertainty estimation: robust map reconstruction. Though maps are often constructed in an offline stage, online mapping can be an integral part of autonomous navigation in unknown/changing environments .
In practice, is it notoriously difficult to build 3D maps from raw depth predictions because they tend to contain “streak-like artifacts” , which not only affect the quality of the map but also increase the memory usage (because they often result in larger occupied volumes). Empirically, we find that such artifacts often happen where ground truth depth is inherently ambiguous and follows a multi-modal distribution, e.g. depth discontinuities and reflective surfaces. Since our depth estimator is designed to predict multi-modal distributions over depth, we use it to improve the accuracy of map reconstruction. By simply thresholding the uncertainty of each pixel’s predicted distributions, we can significantly reduce streak artifacts and memory usage, as shown in Fig. 2.
We evaluate the performance of map reconstruction with and without uncertainty on KITTI odometry sequence-00 , which is not included in the training set. Specifically, we run our monocular depth estimator on left RGB images, and feed the output depth maps together with ground-truth odometry as the input of Octomap . The accuracy is measured as the percentage of correctly mapped map cells, where a cell counts as correctly mapped if it has the same state (free or occupied) as the LiDAR map (ground-truth). As shown in Tab. III, applying a simple uncertainty-based ranking and selection improves the accuracy of monocular maps by 1.8% and reduces the memory usage by 25%.
|Method||Accuracy (%)||Memory (MB)|
Robotic applications of perception present new challenges for safety-critical, fault-tolerant operation. Inspired by past approaches that advocate a probabilistic Bayesian perspective, we demonstrate a simple but effective strategy of discretization (with the appropriate quantization, smoothing, and training scheme) as a mechanism for generating detailed predictions that support such safety-critical operations.
Appendix A Supplementary material
A-a Ablation study
To reveal the contribution of each design choice to the accuracy of the standard depth estimation task, we perform an extensive ablation study as shown in Tab. V.
Classification vs Regression: We first compare regression loss to classification losses (Binary and Multiclass). We find that classification loss always outperforms regression method in terms of absolute relative error and . However, regression achieves competitive RMSE, likely because it directly minimizes squared error. We also implement Berhu  regression loss, and it is still easily out-performed by classification-based methods.
Multiclass vs Binary classification Training with binary classification loss gets similar performance compared to multiclass classification loss on KITTI. However, it yields significantly better results on NYU. Since test images in NYU differ more from the training images than KITTI, we posit that binary classification loss gives better generalization ability compared to multiclass classification loss.
Effect of Monte Carlo dropout On KITTI, Monte Carlo dropout makes prediction performance worse for both binary classification method and predictive Gaussian. However on NYU, it improves results for both methods. This is possible because NYU contains more diverse scenes, where dropout helps prevent overfitting. While on KITTI, training and testing data are highly correlated. Therefore, regularizing the model by dropout does not help.
Expectation vs Most-likely class inference On KITTI, we find that expectation yields better results for all metrics except for . While for NYU, expectation always out-performs (or on par with) most-likely class. This indicates that expectation is a better way of making a prediction from a depth distribution, since it makes use of the whole distribution.
Soft targets vs One-hot targets Comparing the results of training with soft-target distribution vs one-hot label, we find that soft-target always performs better. We posit that by training with soft targets, our model benefits from sample sharing, and thus performs better than using one-hot labels.
|Test dataset||Abs Rel ()||RMSE||()|
A-B Training on mixed KITTI and NYU-v2
To obtain a robust model that works for both indoor and outdoor scenes, we train a single model using KITTI and NYU-v2. To precisely capture the full depth range in both datasets, we adjust the depth range to m to m and the number of depth intervals to . At training time, we randomly crop the data to and average loss over the image before averaging over the whole batch. As shown in Tab. IV, when trained jointly, the performance of our model is not severely affected on both datasets.
Acknowledgements This work was supported by the CMU Argo AI Center for Autonomous Vehicle Research.
-  (2018) Robust dense mapping for large-scale dynamic environments. In ICRA, Cited by: §V.
-  (2006) A confidence measure for variational optic flow methods. In Geometric Properties for Incomplete Data, pp. 283–298. Cited by: §III.
-  (2017) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: TABLE V, §I, §II, §III, Fig. 7, Fig. 9, §IV-B, TABLE II.
-  (2014) Depth map prediction from a single image using a multi-scale deep network. In NIPS, Cited by: §II, §III, §IV-A, §IV-B, TABLE II, §IV.
-  (2018) Deep ordinal regression network for monocular depth estimation. In CVPR, Cited by: TABLE V, §II, Fig. 9, §IV-B, TABLE II.
-  (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In ICML, Cited by: §I, §II.
-  (2018) Lightweight probabilistic deep networks. In CVPR, Cited by: §II, §IV-A.
-  (2013) Vision meets robotics: the kitti dataset. IJRR. Cited by: §IV.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, Cited by: §V.
-  Unsupervised monocular depth estimation with left-right consistency. In CVPR, pp. 7. Cited by: TABLE II, §IV.
-  (2012) Multiple choice learning: learning to produce multiple structured outputs. In NIPS, Cited by: §II, §III.
-  (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, Cited by: §III.
-  (2005) Geometric context from a single image. In ICCV, Cited by: §II.
-  (2013) OctoMap: an efficient probabilistic 3d mapping framework based on octrees. Autonomous robots 34 (3), pp. 189–206. Cited by: Fig. 2, §V.
-  (2012) A quantitative evaluation of confidence measures for stereo vision. TPAMI. Cited by: §III, §IV-A.
-  (2015) Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680. Cited by: §IV-A, TABLE II.
What uncertainties do we need in bayesian deep learning for computer vision?. In NIPS, Cited by: TABLE V, §I, §II, Fig. 8, §IV-A, TABLE I, TABLE II.
-  (2016) Challenges in autonomous vehicle testing and validation. SAE International Journal of Transportation Safety 4 (1), pp. 15–24. Cited by: §I.
-  (2017) Semi-supervised deep learning for monocular depth map prediction. In CVPR, Cited by: §III.
-  (2016) Deeper depth prediction with fully convolutional residual networks. In 3DV, Cited by: §A-A, TABLE V, §II, TABLE II, §IV.
-  (2016) Stochastic multiple choice learning for training diverse deep ensembles. In NIPS, Cited by: §II, §III, §IV-B.
-  (2018) Brute-force facial landmark analysis with a 140,000-way classifier. AAAI. Cited by: §III.
-  (2016) Learning depth from single monocular images using deep convolutional neural fields.. TPAMI. Cited by: §II.
-  (2018) Autonomous vehicle navigation in rural environments without detailed prior maps. In ICRA, Cited by: §V.
-  (2017) Learning in an uncertain world: representing ambiguity through multiple hypotheses. In ICCV, Cited by: §II, §IV-B, §IV.
-  (2006) Learning depth from single monocular images. In NIPS, Cited by: §II.
-  (2017) On a formal model of safe and scalable self-driving cars. CoRR abs/1708.06374. External Links: Cited by: §I.
-  (2017) Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In CVPR, Cited by: §II.
-  (2017) Real-time monocular dense mapping on aerial robots using visual-inertial fusion. In ICRA, Cited by: §I.