Inferring Distributions Over Depth from a Single Image

12/12/2019 ∙ by Gengshan Yang, et al. ∙ 14

When building a geometric scene understanding system for autonomous vehicles, it is crucial to know when the system might fail. Most contemporary approaches cast the problem as depth regression, whose output is a depth value for each pixel. Such approaches cannot diagnose when failures might occur. One attractive alternative is a deep Bayesian network, which captures uncertainty in both model parameters and ambiguous sensor measurements. However, estimating uncertainties is often slow and the distributions are often limited to be uni-modal. In this paper, we recast the continuous problem of depth regression as discrete binary classification, whose output is an un-normalized distribution over possible depths for each pixel. Such output allows one to reliably and efficiently capture multi-modal depth distributions in ambiguous cases, such as depth discontinuities and reflective surfaces. Results on standard benchmarks show that our method produces accurate depth predictions and significantly better uncertainty estimations than prior art while running near real-time. Finally, by making use of uncertainties of the predicted distribution, we significantly reduce streak-like artifacts and improves accuracy as well as memory efficiency in 3D map reconstruction.



There are no comments yet.


page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Most contemporary architectures for geometric scene understanding cast the problem as one of regression - given an image, infer a depth for each pixel. However, in safety-critical systems such as autonomous vehicles, such perceptual inferences will be used to make critical decisions and motion plans with considerable implications for safety. For example, what if the estimated depth of an obstacle on the road is incorrect? Here, it is crucial to build recognition systems that (1) allow for safety-critical graceful-degradation in functionality, rather than catastrophic failures; (2) are self-aware enough to diagnose when such failures occur; and (3) extract enough information to take an appropriate action, e.g. a slow-down, pull-over, or alerting of a manual operator. Such requirements are explicitly laid out in Automotive Safety Integrity Level (ASIL) standards which self-driving vehicles will be required to satisfy [18].

Fig. 1: Given an input image, traditional methods predict a single depth value for each pixel. In this paper, we describe an approach that predicts a per-pixel multi-modal distribution over depth. In the example above, we zoom in onto depth predictions along the dashed green line. Inside the input image, we highlight a segment filled with depth continuities marked with a yellow double-head arrow, where pixels could come from the car in the front, the car behind, or even the building in the back. In the output at the bottom, we mark ground truth depth with blue

and depth with higher probabilities with

red. While traditional methods incorrectly yield the mean of different modes, our approach successfully captures the multi-modal nature.
Fig. 2: From left to right, we show 3D maps built with LiDAR measurements (left), vanilla monocular depth predictions (middle), and most certain monocular depth predictions (right). Color encodes normalized heights. By thresholding depth predictions with uncertainty, we can remove streak-like artifacts (red dotted circles) and reduce memory usage by a quarter. To generate these maps, we feed depth measurements/predictions into OctoMap [14] and use odometry measurements as provided. LiDAR and monocular images come from the KITTI odometry sequence-00, which is not included in training.
Fig. 3: Visualization of multi-modal depth predictions on the glass table, where the surface is transparent, making its depth fundamentally ambiguous. Instead of regressing a single depth value or predicting a unimodal distribution, our method yields a multi-modal distribution over depth and successfully captures different modes (the table surface and the wall behind the table).

Such safety standards represent significant challenges for data-driven machine vision algorithms, which are unlikely to provide formal guarantees of performance [27]

. One attractive solution is that of probabilistic modeling, where uncertainty estimates are propagated throughout a model. In the contemporary world of deep learning, deep Bayesian methods 

[6, 17] provide uncertainty estimates over model parameters (e.g., observing a scene that looks different than experience) and uncertainty estimates arising from ambiguous data (e.g., a sensor failure). We apply such approaches to the problem of depth estimation from a single camera. Our particular approach differs from prior work in two notable aspects. First, prior methods often require Monte Carlo sampling to compute uncertainty estimates [6], which can be slow for real-time safety-critical applications. Second, while certainty estimates provide some degree of self-awareness, they are limited to uni

modal estimates of scene structure, implicitly producing a Gaussian estimate of depth represented by a regressed mean and regressed variance (or confidence) 

[17]. Instead, we develop representations that report back multimodal distributions that allow us to ask more nuanced questions (e.g., “what is the second possible depth of a pixel?”, “how many modes exist in the distribution?”), as shown in Fig. 1 and Fig. 3.

From a practical perspective, one may ask why bother estimating depth from a single camera when special-purpose sensors for depth estimation exist (such as LIDAR or multi-view camera rigs)? Common arguments include cost, payload and power consumption of robots [29], but we motivate this problem from a safety perspective. One crucial method for ensuring ASIL certification is redundancy, and so estimates of scene geometry that are independently produced from various sensors (e.g., independently from LIDAR and independently from cameras) and that agree provide additional fault tolerance. In Fig. 4, we illustrate a situation in which monocular depth estimation complements range sensing.

Our overall approach to probabilistic reasoning is to recast the continuous problem of depth regression (given an image patch , regress a depth value ) as a discrete problem of selecting one out of many possible discretized depths . Previous work [3] has already demonstrated that discretization can improve the accuracy of the underlying depth regression task, but we show that discretization is even more useful for producing simple and efficient (and possibility multimodal) uncertainty

estimates of depth. Intuitively, K-way classifiers are often trained with softmax loss functions, and so naturally report a distribution over

possible discrete depths. Importantly, we find that such distributions can be further improved by recasting the multiclass formulation as a binary multilabel task - essentially, train independent binary classifiers that classify patches at particular discrete depths. It is straightforward to show that the binary multilabel formulation can be seen as a relaxation of the multiclass problem that removes a linear constraint. Removing this constraint creates a more challenging learning problem that appears to be better regularized in terms of uncertainty reports. At test-time, we use the logits as an unnormalized distribution over possible depths, though they can easily be normalized post-hoc (to compute summary statistics such as the expected depth).

Our main contributions are as follows:

  • We formulate the problem of monocular depth estimation in a probabilistic framework, which gives us confidence intervals of depth instead of point estimations.

  • We recast the problem of depth regression as multi-label depth classification, which yields reliable, multi-modal distributions over depth.

  • Our method produces accurate depth and significantly better uncertainty estimation over prior art on KITTI and NYU-depth while running near real-time.

  • Our predicted distribution over depths improves monocular 3D map reconstruction, reducing streak-like artifacts and improving accuracy as well as memory efficiency.

Fig. 4: A situation in which monocular depth estimation complements range sensing. In the top row, from left to right, we show a monocular image, a binary mask, and an entropy map. The binary mask shows where LiDAR readings are available and the entropy map summarizes the uncertainty of each pixel’s predicted distribution. Note that a large chunk of the truck body with black paint has no LiDAR returns since LiDAR sensors are less reliable with less-reflective materials. Our monocular depth estimator successfully predicts high entropy in the area with black paint. In the bottom row, we show depth predictions with uncertain pixels removed. From left to right, we gradually increase the confidence threshold. The rightmost one plots 30% pixels with the most confident depth predictions, in which we see most predictions on the truck body are removed. If a perception system solely relies on LiDAR measurements, it will perceive plenty of free space on the left side, which might lead to catastrophic decisions. If a perception system is designed with redundancy, it would trust LiDAR measurements less at pixels where the monocular estimator predicts uncertainties.

Ii Related Work

Single Image Depth Estimation: Early works [13, 26] popularize the problem of inferring scene depth maps from a single image, making use of handcrafted features. Eigen et al. [4]

take a data-driven approach to learn features in a coarse-to-fine network that refines global structure with local predictions. Some recent work substantially improves the performance of single image depth estimation using better deep neural network architectures

[20, 23, 28].

Depth Estimation as Classification: Closely related to our work, Cao et al. [3] formulates depth estimation as a multi-class classification problem and use soft targets to train the model. However, they make inference by choosing the most likely depth class, which does not take full advantage of the depth distribution, while we explore richer inference methods based on the predicted depth distributions. More importantly, the standard multi-class classification approach tends to make confident errors and does not yield reliable uncertainty estimations. Instead, we learn the classification model as independent binary classifiers, which regularizes the model and gives us much better uncertainty estimation as well as noticeable performance improvement on standard benchmarks. Fu et al. [5] formulate depth estimation as ordinal regression, aiming to predict a CDF over depth. However, they do not ensure the predicted CDF to be monotonically non-decreasing. This makes it ungrounded to apply probabilistic reasoning for uncertainty estimation. In contrast, we formulate depth estimation as a discrete classification problem, aiming to predict a valid depth PDF.

Uncertainty in Depth Estimation: Kendall et al. [17]

introduce two kinds of uncertainties: epistemic uncertainty (over model parameters) and aleatoric uncertainty (over output distributions). They show that epistemic uncertainty is data-dependent while aleatoric uncertainty is not. They model aleatoric uncertainty by fitting the variance of Gaussian distributions (also proposed in recent work on lightweight probabilistic extensions for deep networks 

[7]). However, this might lead to unstable training and suboptimal performance. More importantly, this ignored the fact that depth distributions are multi-modal in many cases (for example at depth discontinuities and reflective surfaces). They capture epistemic uncertainty by Bayesian neural networks [6]. However, it requires expensive Monte Carlo sampling to obtain depth predictions and uncertainty estimations. Instead, we focus on modeling the multi-modal distributions over depth, which gives us more reliable uncertainty metrics without the additional computational overhead.

Multiple Hypotheses Learning (MHL): Prior works [11, 21] formulate the problem of learning to predict a set of plausible hypotheses as multiple-choice learning. They train an ensemble of models to produce multiple possibilities and define an oracle to pick up the best hypothesis. Rupprecht et al. [25] uses a shared architecture to produce multiple hypotheses and train the network by assigning each sample to the closest hypothesis. Different from these approaches, we train a single network to produce a multi-modal distribution, from which we can obtain multiple predictions without directly optimizing an oracle loss in training.

Iii Method

We solve the problem of inferring continuous depth through discrete classification. To illustrate the method, we first introduce how we discretize continuous depth into discrete categories. Then we show the formulation of depth estimation as a multi-class classification task (mutual exclusive) and a multi-label (binary) classification task (not mutually exclusive). Then we discuss the output of our model, i.e. a probabilistic categorical distribution over discrete depths, and how we will evaluate the output, including evaluating as a standard depth estimation task and as a depth estimation with uncertainty.

Discretization: We discretize continuous depth values in the log space. Given a continuous range of depth , we discretize it into intervals, i.e. , with


This captures the perceptual difference in human visual systems, i.e., we care more about differences in depths of close objects than distant ones. Furthermore, due to sensor sampling effects, we tend to encounter more close points rather than far away ones. Working in log space partially alleviates this class imbalance problem.

Fig. 5: Our network architecture consists of an encoder, a spatial pyramid pooling module, and a decoder. Our encoder is a ResNet-50 truncated before global pooling. Spatial pyramid pooling takes ResNet feature then extracts global and semi-global feature through multi-scale pooling. The decoder processes pooled feature to predict a un-normalized score map for each discrete depth class . During training, each un-normalized score map is pushed into a per-pixel soft-labeled binary cross entropy loss; at test time, we perform per-pixel normalization using softmax across all depth classes to ensure a valid per-pixel distribution over depth, from which we can make the final prediction of depth and uncertainty. Original-resolution images are used as input, and the predictions are bilinearly up-sampled to the same resolution as ground-truth.

Multi-class Classification: As a baseline method, we first show how we recast the continuous regression as a multi-class classification problem. A discrete distribution over depth can be parameterized by a categorical distribution . We learn to predict the probability

of each depth label by minimizing the negative log likelihood. Since we use the output of a softmax layer as the predicted probability, we will also refer to this variant as “Softmax” in the following text. Given a ground truth label

, image feature , and the model parameters , the loss function can be written as,


Here the distribution is predicted from a -way multi-class classifier.

Equation (2

) gives us the cross-entropy between an one-hot label vector

and the predicted distribution . To incorporate the ordinal nature of the depth labels, i.e., penalize predictions closer to the ground truth less than predictions further away, we replace the one-hot target vector with a discretized Gaussian centered around the ground truth, i.e.,


where is the partition function.

Binary Classification: To alleviate competition between depth classes, we further model continuous depth as a collection of

independent Bernoulli random variables

, where encodes the probability of falling into the depth interval. We also refer to this variant as multilabel in the paper. The loss function is written as,


where is an unnormalized version of soft target distribution.

One can see this as a relaxation of the training objective from Eq.(2) that drops the constraint that  [22]. The variance is designed such that for all depth classes within difference to ground truth, their label is greater than . In test time, we push the pre-logit scores of each binary classifier through a softmax and obtain a distribution over discrete depth, as shown in Fig. 5.

Predicting Depth from a Distribution: After obtaining the distribution over depth, Cao et al. [3] report the most confident depth class, ignoring the multi-modal nature of the predicted distribution. Different from their approach, we report the expected depth based on the predicted distribution as , which takes into account the whole distribution and yields better depth estimations.

Uncertainty and Multiple Hypotheses: We now describe various statistics that can be computed from our multimodal distribution, motivated by autonomous robotic perception. Because the perception module of robots needs to be self-aware enough to report potential failures to the downstream planner or online-mapping module when faced with ambiguous scenes, the first statistic is uncertainty, as computed with Shannon entropy:


Secondly, even if the most-likely (or expected) depth of a particular pixel is far away, a robotic motion planner may wish to decrease speed if there is a non-negligible probability that its depth is in the near-field (due to say, a translucent obstacle). As such, our network can directly output multiple depth modes to downstream planners.

Evaluation: Evaluating the above functionality on a robotic platform is difficult. Instead, to evaluate the quality of uncertainty estimation, we make use of the area under ROC curve (AUC), which is widely used in stereo vision and optical flow [2, 15]. To assess the accuracy of the multi-hypotheses output, we follow past work on MHL [11, 21] and use an “oracle” evaluation protocol where an algorithm is allowed to report back multiple depth predictions, and the best one is chosen to compute the accuracy [11]. We also report standard metrics [4] on depth estimation benchmarks.

Implementation We follow the architecture of Kuznietsov et al. [19] as shown in Fig. 5. We further add a spatial pyramid pooling module [12] to extract global and semi-global features from the scene. We experimented with different numbers of bins on KITTI. With 32, 64, 96, 128 bins, our method achieves an absolute relative error (ARE) of 9.34%, 8.61%, 8.60%, 8.59%. As improvement becomes marginal, we pick 64 as the number of bins and used it for all experiments in this paper. Fig. 6 shows the unnormalized soft-target distribution we use when training binary classifiers.

Fig. 6: Soft target distributions for binary classification in log scale (left) and linear scale (right). We plot the soft target centering on the th depth interval.

Iv Experiments

We first introduce our experimental setup, including dataset and training details. We then compare to prior estimation methods that reason about uncertainties. Finally, we compare our method with the state-of-the-art on the standard depth estimation task, as well as using multi-hypotheses evaluation [25].

Setup: We test our method on the standard depth estimation benchmarks, including KITTI [8] for outdoor scenes (1-80m) and NYU-v2 [4] for indoor scenes (0.5-10m). On KITTI, we follow Eigen’s split [4] for training and testing. On NYU-v2, we sample k images following [20] for training and test on the official test split.


We first initialize the weights of our ResNet-50 backbone with the ImageNet pre-trained ones. To augment training data, we apply random gamma, brightness, and color shift, as in

[10]. We fine-tune the weights with an Adam optimizer with an initial learning rate of and decrease the learning rate with a factor of after epochs. We train our KITTI model for a total of 60 epochs and our NYU-v2 model for a total of 160 epochs. Our experiments are run on a machine with GeForce GTX Titan X GPU using Tensorflow.

Fig. 7: How well does the predicted uncertainty correlate with the actual depth estimation performance? We first sort all predictions in ascending order of uncertainty. Then we gradually include more predictions for evaluation by increasing the uncertainty threshold (including more uncertain predictions in the evaluation). The X-axis represents the percentage of pixels we include and Y-axis represents the ARE on the selected pixels. Notice uncertainties estimated by the model trained with multi-class classification loss (“Softmax” [3]) are not well correlated with error, especially for the most confident pixels. On the contrary, the error increases monotonically as confidence drops for our proposed approach (“Binary”). At 80%, our method also achieves a lower error rate (5.4% vs. 5.6%).

Iv-a Depth Estimation with Uncertainty

Baselines: Considering most prior art do not reason about uncertainty, we compare to predictive Gaussian and predictive Gaussian with Monte Carlo dropout (Gaussian-dropout) [7, 17] in terms of depth estimation with uncertainty, as shown in Tab. I. For a fair comparison, we re-implement and train predictive Gaussian and Gaussian-dropout on KITTI and NYU depth v2. We make sure the re-implemented version has an architecture that is as close as possible to ours. For predictive Gaussian, we use the same backbone architecture but with a different prediction head, which predicts the mean and variance of a Gaussian distribution over depth in log space. To train predictive Gaussian, we minimize the per-batch negative log-likelihood based on the predicted mean and variance. For Gaussian-dropout, we use the same backbone architecture and prediction head except we perform dropout with a probability of 0.5 after several convolutional layers, as in Kendall et al. [16]. During inference, we draw 32 samples to make predictions and estimate uncertainty. Following the same idea, we apply Monte Carlo dropout to our binary model, referred to as Binary-dropout.

Following Hu et al. [15], we plot ROC curves to evaluate our depth estimation with uncertainty, as shown in Fig. 7 and Fig. 8. Such curves demonstrate how well the predicted uncertainty correlates with the actual depth estimation performance. A point on the curve indicates a performance of on the least uncertain  (%) predictions over all pixels in the test set. Perfect uncertainty estimation, from the perspective of the ROC curve, should rank predictions as if they are ranked by the actual error. As a reference, we include curves with such oracle w.r.t. a specific error metric (absolute relative error or ARE). Below, we first compare two variants of our model (binary classification and multiclass classification). Then we will compare our model to prior art that predicts uncertainty (predictive Gaussian and Gaussian-dropout). For each sub-metric under AUC, we follow the definition in Eigen et al. [4].

Fig. 8: Compared to predictive Gaussian [17] (“Gaussian”), our method (“Binary”) yields lower error rate when more than pixels are kept for KITTI, and more than pixels for NYU. By applying Monte Carlo dropout, both predictive Gaussian (“Gaussian-dropout”) and our approach (“Binary-dropout”) see a significant improvement on NYU. While on KITTI, the performance get strictly worse for predictive Gaussian.

Binary classification vs Multiclass classification: In Fig. 7, we compare the model trained with binary classification loss (“Binary”) to the model trained with multi-class classification loss (“Softmax”). As we can see on the left side of both plots, the uncertainty predicted by the multi-class classifier does not correlate well with the actual error rate, especially for those least uncertain (or most confident) pixels. In contrast, the model trained with binary classification loss produces a curve that monotonically increases as the uncertainty threshold goes up, because it is able to correctly rank more correct pixels as more confident. We posit that our multilabel loss (that removes a linear constraint present in the multi-class formulation) acts as an additional regularizer that improves uncertainty estimation.

Gaussian vs Binary: In Fig. 8, we find predictive Gaussian also yields reliable uncertainty estimation, as it produces a monotonically increasing curve. Overall it achieves a slightly worse performance, comparing to our model trained with binary classification. It might be due to its uni-modal assumption and optimization difficulties in training time (discussed further in our ablation study). Interestingly, adding Monte Carlo dropout significantly improves NYU performance for both predictive Gaussian (“Gaussian-dropout”) and our approach (“Binary-dropout”). However, on KITTI, we see a strictly worse performance for the predictive Gaussian.

Quantitative evaluation: In Tab. I, we further compare uncertainty estimation quantitatively using metrics introduced in Section III. Our binary classification method produces better performance in terms of AUC compared to predictive Gaussian and its Monte Carlo dropout variant in terms of ARE and , without expensive Monte Carlo sampling. By adding Monte Carlo dropout to our model, we can further improve AUC of ARE, RMSE and on NYU depth v2. Although predictive Gaussian with Monte Carlo dropout outperforms our binary loss on all metrics based on RMSE, it is too slow for real-time perception. Please refer to Tab. I for more detailed discussion.

AUC time
Method ARE RMSE (ms)
K Gaussian [17] 4.38 1.42 2.63 64
Softmax 5.19 2.88 2.93 74
Binary 4.17 1.33 1.79 74
Gaussian-dropout [17] 5.18 1.21 3.61 467
Binary-dropout 4.20 1.33 2.06 540
N Gaussian [17] 10.94 0.41 10.95 44
Softmax 11.17 0.53 11.09 52
Binary 10.28 0.42 9.26 52
Gaussian-dropout [17] 10.33 0.32 10.30 353
Binary-dropout 9.39 0.40 7.79 410
TABLE I: Quantitative evaluation for uncertainty estimation on KITTI (K) and NYU-v2 (N). The best results among methods without Monte Carlo dropout are made bold, while the best considering Monte Carlo dropout are underlined. On both datasets, we compare our method trained with the binary loss (“Binary”) and the multiclass loss (“Softmax”) to predictive Gaussian [17] (“Gaussian”). The quantitative results are consistent with Fig. 7 and Fig. 8. In terms of AUC on ARE and  (the lower the better), our binary loss consistently outperforms predictive Gaussian on both KITTI and NYU-v2. Importantly, when combined with Monte Carlo dropout, our binary model (“Binary-dropout”) further reduces the AUC on NYUv2.
Method ARE () RMSE () time (ms)
K Binary 8.9 3.85 90.7 74
Fu et al. [5] 9.1 3.90 90.5 74
Cao et al. [3] 9.3 4.02 90.8 74
Eigen et al. [4] 19.0 7.16 69.2 13
Godard et al. [10] 11.4 4.94 86.1 35
Cao et al. [3] 11.5 4.71 88.7 -
Fu et al. [5] 7.2 2.73 93.2 1250
N Binary 14.2 0.51 82.7 52
Binary-dropout 13.9 0.50 82.8 410
Kendall et al. [17] 14.4 0.51 81.5 353
Eigen et al. [4] 15.8 0.64 76.9 10
Laina et al. [20] 12.7 0.57 81.1 55
Fu et al. [5] 11.5 0.51 82.8 -
Kendall et al. [17] 11.0 0.51 81.7 7500
TABLE II: Performance on KITTI (K) Eigen’s split and NYU-V2 depth (N) dataset. The best results over the light-weight setup are bolded, while the best results overall are underlined. On KITTI, our method outperforms the state-of-the-art Fu et al. [5] under the same setup. With its original setup (a heavy-weight backbone and test-time ensemble), [5] runs nearly 17x times slower (1250ms vs 75ms). On NYU-v2, our method outperforms Kendall et al. [17] with the same backbone network. With its original setup, Kendall et al. [17] runs 144x slower. Our method further improves when training with dropout and testing with MC sampling [16], referred to as Binary-dropout.

Iv-B Multi-hypothesis Depth Prediction

We first evaluate standard depth prediction performance on KITTI and NYU-v2 using metrics proposed in [4], as shown in Tab. II. We then extend the evaluation by allowing multiple depth hypotheses. For a fair comparison, we re-implement Fu et al. [5] and Cao et al. [3] under the same setup as ours (a light-weight backbone and no test-time ensemble). We also include numbers in the original paper as a reference. Please refer to Tab. II for detailed comparison.

Fig. 9: Error as a function of hypotheses number on KITTI. Compared to MHL, our method always produces better results in terms of ARE. As for RMSE, our method performs worse than MHL when , possibly because the MHL baseline is trained to directly minimize squared error. However, MHL’s error stops going down after , while we do not observe this effect for our model. Compared to softmax (Cao et al. [3]), our method also achieves slightly better performance. Also, our method consistently out-performs Fu et al. [5] in ARE using more than two hypotheses, and in terms of RMSE using more than five hypotheses.

To evaluate our multi-modal distributions, we follow the standard protocol in multi-hypothesis learning [21]. After computing the pre-logits scores, we report back depth hypotheses with the highest scores, and the one with the lowest error is selected by the oracle for evaluation.

Since most methods can’t output multiple hypotheses, we compare to the ones that can be trained to output multiple hypotheses [25], referred to as MHL. Similar to traditional regression, we directly regress to the depth in log space. However in training time, we make predictions and construct an oracle loss by selecting the prediction that best describes the ground-truth in terms of distance. We train the MHL baseline for , and use an oracle to select the best prediction for evaluation. Please see Fig. 9 for analysis of the results.

V Building Maps with Uncertainty

In this section, we demonstrate one application of geometric uncertainty estimation: robust map reconstruction. Though maps are often constructed in an offline stage, online mapping can be an integral part of autonomous navigation in unknown/changing environments [24].

In practice, is it notoriously difficult to build 3D maps from raw depth predictions because they tend to contain “streak-like artifacts” [1], which not only affect the quality of the map but also increase the memory usage (because they often result in larger occupied volumes). Empirically, we find that such artifacts often happen where ground truth depth is inherently ambiguous and follows a multi-modal distribution, e.g. depth discontinuities and reflective surfaces. Since our depth estimator is designed to predict multi-modal distributions over depth, we use it to improve the accuracy of map reconstruction. By simply thresholding the uncertainty of each pixel’s predicted distributions, we can significantly reduce streak artifacts and memory usage, as shown in Fig. 2.

We evaluate the performance of map reconstruction with and without uncertainty on KITTI odometry sequence-00 [9], which is not included in the training set. Specifically, we run our monocular depth estimator on left RGB images, and feed the output depth maps together with ground-truth odometry as the input of Octomap [14]. The accuracy is measured as the percentage of correctly mapped map cells, where a cell counts as correctly mapped if it has the same state (free or occupied) as the LiDAR map (ground-truth). As shown in Tab. III, applying a simple uncertainty-based ranking and selection improves the accuracy of monocular maps by 1.8% and reduces the memory usage by 25%.

Method Accuracy (%) Memory (MB)
LiDAR-FOV 95.9 1220.9
Ours-binary 88.3 1682.6
Ours-binary-80% 89.9 1263.2
TABLE III: Accuracy and memory usage of online mapping. LiDAR-FOV indicates the map built using LiDAR points in the left camera field of view, which is the upper-bound of our methods. The map built with top 80% most confident estimations of our model (Ours-binary-80%) significantly reduces the memory usage and also improves the mapping accuracy.


Robotic applications of perception present new challenges for safety-critical, fault-tolerant operation. Inspired by past approaches that advocate a probabilistic Bayesian perspective, we demonstrate a simple but effective strategy of discretization (with the appropriate quantization, smoothing, and training scheme) as a mechanism for generating detailed predictions that support such safety-critical operations.

Appendix A Supplementary material

A-a Ablation study

To reveal the contribution of each design choice to the accuracy of the standard depth estimation task, we perform an extensive ablation study as shown in Tab. V.

Classification vs Regression: We first compare regression loss to classification losses (Binary and Multiclass). We find that classification loss always outperforms regression method in terms of absolute relative error and . However, regression achieves competitive RMSE, likely because it directly minimizes squared error. We also implement Berhu [20] regression loss, and it is still easily out-performed by classification-based methods.

Multiclass vs Binary classification Training with binary classification loss gets similar performance compared to multiclass classification loss on KITTI. However, it yields significantly better results on NYU. Since test images in NYU differ more from the training images than KITTI, we posit that binary classification loss gives better generalization ability compared to multiclass classification loss.

Effect of Monte Carlo dropout On KITTI, Monte Carlo dropout makes prediction performance worse for both binary classification method and predictive Gaussian. However on NYU, it improves results for both methods. This is possible because NYU contains more diverse scenes, where dropout helps prevent overfitting. While on KITTI, training and testing data are highly correlated. Therefore, regularizing the model by dropout does not help.

Expectation vs Most-likely class inference On KITTI, we find that expectation yields better results for all metrics except for . While for NYU, expectation always out-performs (or on par with) most-likely class. This indicates that expectation is a better way of making a prediction from a depth distribution, since it makes use of the whole distribution.

Soft targets vs One-hot targets Comparing the results of training with soft-target distribution vs one-hot label, we find that soft-target always performs better. We posit that by training with soft targets, our model benefits from sample sharing, and thus performs better than using one-hot labels.

Test dataset Abs Rel () RMSE ()
KITTI 9.9 3.969 89.1
NYU-v2 15.3 0.541 80.3
TABLE IV: Results of using a mixed KITTI and NYU-v2 dataset for training. The model is trained with binary classification loss and predicts the most-likely class at test time.

A-B Training on mixed KITTI and NYU-v2

To obtain a robust model that works for both indoor and outdoor scenes, we train a single model using KITTI and NYU-v2. To precisely capture the full depth range in both datasets, we adjust the depth range to m to m and the number of depth intervals to . At training time, we randomly crop the data to and average loss over the image before averaging over the whole batch. As shown in Tab. IV, when trained jointly, the performance of our model is not severely affected on both datasets.

Acknowledgements This work was supported by the CMU Argo AI Center for Autonomous Vehicle Research.


  • [1] I. A. Bârsan, P. Liu, M. Pollefeys, and A. Geiger (2018) Robust dense mapping for large-scale dynamic environments. In ICRA, Cited by: §V.
  • [2] A. Bruhn and J. Weickert (2006) A confidence measure for variational optic flow methods. In Geometric Properties for Incomplete Data, pp. 283–298. Cited by: §III.
  • [3] Y. Cao, Z. Wu, and C. Shen (2017) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: TABLE V, §I, §II, §III, Fig. 7, Fig. 9, §IV-B, TABLE II.
  • [4] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In NIPS, Cited by: §II, §III, §IV-A, §IV-B, TABLE II, §IV.
  • [5] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In CVPR, Cited by: TABLE V, §II, Fig. 9, §IV-B, TABLE II.
  • [6] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In ICML, Cited by: §I, §II.
  • [7] J. Gast and S. Roth (2018) Lightweight probabilistic deep networks. In CVPR, Cited by: §II, §IV-A.
  • [8] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. IJRR. Cited by: §IV.
  • [9] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, Cited by: §V.
  • [10] C. Godard, O. Mac Aodha, and G. J. Brostow Unsupervised monocular depth estimation with left-right consistency. In CVPR, pp. 7. Cited by: TABLE II, §IV.
  • [11] A. Guzman-Rivera, D. Batra, and P. Kohli (2012) Multiple choice learning: learning to produce multiple structured outputs. In NIPS, Cited by: §II, §III.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, Cited by: §III.
  • [13] D. Hoiem, A. A. Efros, and M. Hebert (2005) Geometric context from a single image. In ICCV, Cited by: §II.
  • [14] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard (2013) OctoMap: an efficient probabilistic 3d mapping framework based on octrees. Autonomous robots 34 (3), pp. 189–206. Cited by: Fig. 2, §V.
  • [15] X. Hu and P. Mordohai (2012) A quantitative evaluation of confidence measures for stereo vision. TPAMI. Cited by: §III, §IV-A.
  • [16] A. Kendall, V. Badrinarayanan, and R. Cipolla (2015) Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680. Cited by: §IV-A, TABLE II.
  • [17] A. Kendall and Y. Gal (2017)

    What uncertainties do we need in bayesian deep learning for computer vision?

    In NIPS, Cited by: TABLE V, §I, §II, Fig. 8, §IV-A, TABLE I, TABLE II.
  • [18] P. Koopman and M. Wagner (2016) Challenges in autonomous vehicle testing and validation. SAE International Journal of Transportation Safety 4 (1), pp. 15–24. Cited by: §I.
  • [19] Y. Kuznietsov, J. Stückler, and B. Leibe (2017) Semi-supervised deep learning for monocular depth map prediction. In CVPR, Cited by: §III.
  • [20] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab (2016) Deeper depth prediction with fully convolutional residual networks. In 3DV, Cited by: §A-A, TABLE V, §II, TABLE II, §IV.
  • [21] S. Lee, S. P. S. Prakash, M. Cogswell, V. Ranjan, D. Crandall, and D. Batra (2016) Stochastic multiple choice learning for training diverse deep ensembles. In NIPS, Cited by: §II, §III, §IV-B.
  • [22] M. Li, L. Jeni, and D. Ramanan (2018) Brute-force facial landmark analysis with a 140,000-way classifier. AAAI. Cited by: §III.
  • [23] F. Liu, C. Shen, G. Lin, and I. D. Reid (2016) Learning depth from single monocular images using deep convolutional neural fields.. TPAMI. Cited by: §II.
  • [24] T. Ort, L. Paull, and D. Rus (2018) Autonomous vehicle navigation in rural environments without detailed prior maps. In ICRA, Cited by: §V.
  • [25] C. Rupprecht, I. Laina, R. DiPietro, M. Baust, F. Tombari, N. Navab, and G. D. Hager (2017) Learning in an uncertain world: representing ambiguity through multiple hypotheses. In ICCV, Cited by: §II, §IV-B, §IV.
  • [26] A. Saxena, S. H. Chung, and A. Y. Ng (2006) Learning depth from single monocular images. In NIPS, Cited by: §II.
  • [27] S. Shalev-Shwartz, S. Shammah, and A. Shashua (2017) On a formal model of safe and scalable self-driving cars. CoRR abs/1708.06374. External Links: Link, 1708.06374 Cited by: §I.
  • [28] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe (2017) Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In CVPR, Cited by: §II.
  • [29] Z. Yang, F. Gao, and S. Shen (2017) Real-time monocular dense mapping on aerial robots using visual-inertial fusion. In ICRA, Cited by: §I.