Regression problems arise in many real-world machine learning tasks. To name just a few: Depth from a single imageEigen04 , Object localization and Acoustic localization Vera-Diaz18
. Many of these tasks are solved by deep neural networks used within decision making pipelines which require the machine learning block not only to predict the target but to also output its confidence in the prediction. For example, the commonly used Kalman-Filter tracking algorithmBlackman14
requiring variance estimation for the observed object’s location estimation. In addition, we may want the system to output a final uncertainty, reflecting real-world empirical probabilities, to allow a safety-critical system such as a self-driving car agent to take appropriate actions when confidence drops. In practice, using the confidence in the localization of objects has been shown to improve the non-maximal suppression stage and consequently the overall detection performanceHe18 . Similarly, Feng18 describe a probabilistic 3D vehicle detector for Lidar point clouds that can model both classification and spatial uncertainty.
To provide uncertainty estimation, each prediction produced by the machine learning module during inference should be a distribution over the target domain. There are several approaches for achieving this: Bayesian neural networks GalThesis16 ; Gal_Ghahramani16 , ensembles Lakshminarayanan17 and outputting a parametric distribution directly Nix94
. Bayesian neural networks place a probability distribution over the network parameters, which is translated to an uncertainty in the prediction, providing a technically sound approach but with overhead at inference time. Thedirect
approach either uses the existing confidence values of output neurons in classificationNiculescu_Caruana05 , or adds additional outputs that represent distributions to existing networks Nix94 . Note that the direct approach naturally captures the aleatoric uncertainty (inherent observation noise), but captures less the epistemic unceIsraelrtainty (uncertainty in the model) Kendall_Gal17 . We chose as a test case for our calibration method, the direct
approach for producing uncertainty: we transform the network output from a single scalar to a Gaussian distribution by taking the scalar as the mean and adding a branch that predicts the standard deviation (STD) as inLakshminarayanan17 . while this is the simplest form, it is commonly used in practice, and our analysis is applicable to more complex distributions as well as other approaches such as Bayesian neural networks and ensembles.
Adjusting the output distributions to match the observed empirical ones via a post process is called uncertainty calibration. It was shown that modern deep networks tend to be over confident in their predictions Guo17 . The same study revealed that for classification tasks, Platt Scaling Platt99 , a simple constant scaling of the pre-activation of the last layer, achieves well calibrated confidence estimates Guo17 . In this paper we show that a similar simple scaling strategy, applied to the standard deviations of the output distributions, can calibrate regression algorithms as well.
One major question is how to define calibration for regression, where the model outputs a continuous distribution over possible predictions. In recent work KuleshovFE18
suggested a definition based on credible intervals where if we take thepercentiles of each predicted distribution the output should fall below them for exactly
percent of the data. Based on this definition the authors further suggested a calibration evaluation metric and re-calibration method. While this seems very sensible and has the advantage of considering the entire distribution, we found serious flaws in this definition. The main problem arises from averaging over the whole dataset. We show, both empirically and analytically, that one can calibrate using this evaluation metric practically any output distribution, even one which is entirely uncorrelated with the empirical uncertainty as can be seen in Fig.1. We elaborate on this property of the evaluation method described in KuleshovFE18 in Section 2 and show empirical evidence in Section 4.
We further propose a new simple definition for calibration for regression, which is closer to the standard one for classification. Calibration for classification can be viewed as expecting the output for every single data point to correctly predict its error, in terms of misclassification probability. In a similar fashion, we define calibration for regression by simply replacing the misclassification probability with the mean square error. Based on this definition, we propose a new calibration evaluation metric similar to the Expected Calibration Error (ECE) Naeini15 , which groups examples into interval bins with similar uncertainty, and then measures the discrepancy between each bin’s parameters and the parameters of the empirical distribution within the bin. An additional dispersion measure completes our set of diagnostic tools by revealing cases where the individual uncertainty outputs are uninformative as they all return similar values.
Finally, we propose a calibration method where we re-adjust the predicted uncertainty, in our case the outputted Gaussian variance, by minimizing the negative-log-likelihood (NLL) on a separate re-calibration set. We show good calibration results on a real-world dataset using a simple parametric model which scales the uncertainty by a constant factor. As opposed toKuleshovFE18 , we show that our approach cannot calibrate predicted uncertainty that is uncorrelated with the real uncertainty, as one would expect.
1.1 Related Work
While shallow neural networks are typically well-calibrated Niculescu_Caruana05 , modern, deep networks, albeit superior in accuracy, are no-longer calibrated Guo17 . Uncertainty calibration for classification is a relatively studied field. Calibration plots or Reliability diagrams provide a visual representation of uncertainty prediction calibration DeGroot83 ; Niculescu_Caruana05 by plotting expected sample accuracy as a function of confidence. Confidence values are grouped into interval bins to allow computing the sample accuracy. A perfect model corresponds to the plot of the identity function. The Expected Calibration Error (ECE) Naeini15 summarizes the reliability diagram by averaging the error (gap between confidence and accuracy) in each bin, producing a single value measure of the calibration. Similarly, the Maximum Calibration Error (MCE) Naeini15 measures the maximal gap. Negative Log Likelihood (NLL) is a standard measure of a model’s fit to the data Friedman01 but combines both accuracy of the model and its uncertainty estimation in one measure. Based on these measures, several calibration methods were proposed, which transform the network’s confidence output to one that will produce a calibrated prediction. Non-parametric transformations include Histogram Binning Zadrozny_Elkan01
, Bayesian Binning into QuantilesNaeini15 and Isotonic Regression Zadrozny_Elkan01 while parametric transformations include versions of Platt Scaling Platt99 such as Matrix Scaling and Temperature Scaling Guo17 . In Guo17
it is demonstrated that the simple Temperature Scaling, consisting of a one scaling-parameter model which multiplies the last layer logits, suffices to produce excellent calibration on many classification data-sets.
In comparison with classification, calibration of uncertainty prediction in regression, has received little attention so far. As already described, KuleshovFE18
propose a practical method for evaluation and calibration based on confidence intervals and isotonic regression. The proposed method is applied in the context of Bayesian neural networks. In recent workPhan18 , the authors follow KuleshovFE18 definition and method of calibration for regression, but use a standard deviation vs MSE scatter plot, somewhat similar to our approach, as a sanity check.
2 Confidence-intervals based calibration
We next review the method for regression uncertainty calibration proposed in KuleshovFE18 which is based on confidence intervals, and highlight its shortcomings. We refer to this method in short as the “interval-based” calibration method. We start by introducing basic notations for uncertainty calibration used throughout the paper.
Notations. Letand their corresponding domains. A dataset consists of i.i.d samples of . A forecaster outputs per example a distribution over the label space, where is the set of all distributions over . In classification tasks, is discrete and is a multinomial distribution, and in regression tasks in which is a continuous domain,
is usually a parametric probability density function, e.g. a Gaussian. For regression, we denote bythe CDF corresponding to .
According to KuleshovFE18 a forecaster in a regression setting is calibrated if:
as . Intuitively this means that the is smaller than with probability approximately , or that the predicted CDF matches the empirical one as the dataset size goes to infinity. In our setting a sufficient condition is:
Where represents the CDF corresponding to H(X). This notion is translated by KuleshovFE18 to a practical evaluation and calibration methodology. A re-calibration dataset is used to compute the empirical CDF value for each predicted CDF value :
The calibration consists of fitting a regression function (i.e. isotonic regression) , to the set of points . For diagnosis the authors suggest a calibration plot of at equally spaced values of .
We start by intuitively explaining the basic limitation of this methodology. From Eq. 3 is non-decreasing and therefore isotonic regression finds a perfect fit. Therefore, the modified CDF will satisfy on the re-calibration set, and the new forecaster is calibrated up to sampling error. This means that perfect calibration is possible no matter what the CDF output is, even for output CDFs which are statistically independent of the actual empirical uncertainty. We note that this might be acceptable when the uncertainty prediction is degenerate, e.g. all output distributions are Gaussian with the same variance, but this is not the case here. We also note that the issue is with the calibration definition not the re-calibration, as we show with the following analytic example.
We next present a concise analytic example in which the output distribution and the ground truth distribution are independent, yet fully calibrated according to Eq. 2
. Consider the case where the target has a normal distributionand the network output
has a Cauchy distribution with zero location parameter and random scale parameterindependent of and , defined as follows:
Following a known equality for Cauchy distributions, the CDF output of the network , where is the CDF of a Cauchy distribution with zero location and scale parameters. First we note that and , i.e. with and without the absolute value, have the same distribution due to symmetry. Next we recall the well known fact that the ratio of two independent normal random variables is distributed as Cauchy with zero location and scale parameters (i.e. ). This means that probability that is exactly (recall that is a CDF). In other words, the prediction is perfectly calibrated according to the definition in Eq. 2, even though the scale parameter was random and independent of the distribution of .
While the Cauchy distribution is a bit unusual due to the lack of mean and variance, the example does not depend on it and it was chosen for simplicity of exposition. It is possible to prove the existence of a distribution whose product of two independent samples is Gaussian lognormal and replace the Cauchy with a Gaussian, but it is an implicit construction and not a familiar distribution.
3 Our method
We present a new definition for calibration for regression, as well as several evaluation measures and a reliability diagram for calibration diagnosis, analogous to the ones used for classification Guo17 . The basic idea is that for each value of uncertainty, measured through standard deviation , the expected mistake, measured in mean square error (MSE), matches the predicted error . This is similar to classification with MSE replacing the role of mis-classification error. More formally, if and are the predicted mean and variance respectively then we consider a regressor well calibrated if
In contrast to to KuleshovFE18 this does not average over points with different values of (at least in the definition, for practical measures some binning is needed), but only considers the mean and variance and not the entire distribution. We claim that this captures the desired meaning of calibration, i.e. for each individual example you can correctly predict the expected mistake.
Since we can expect each exact value of in our dataset to appear exactly once, we evaluate eq. 3 empirically using binning, same as for classification. Formally, let be the standard deviation of predicted output PDF and assume without loss of generality that the examples are ordered by increasing values of . We also assume for notation simplicity that the number of bins, , divides the number of examples, . We divide the indices of the examples to bins, , such that: . Each resulting bin therefore represents an interval in the standard deviation axis: . The intervals are non-overlapping and their boundary values are increasing.
To evaluate how calibrated the forecaster is, we compare per bin two quantities as follows. The root of the mean variance:
And the empirical root mean square error:
where is the mean of the predicted PDF ()
For diagnosis, we propose a reliability diagram which plots the as function of the as shown in Figure 4. The idea is that for a calibrated forecaster per bin the and the observed should be approximately equal, and hence the plot should be close to the identity function. Apart from this diagnosis tool which as we will show is valuable for assessing calibration, we propose additional scores for evaluation.
Expected Normalized Calibration Error (ENCE). For summarizing the error in the calibration we propose the following measure:
This score averages the calibration error in each bin, normalized by the bin’s mean predicted variance, since for larger variance we expect naturally larger errors. This measure is analogous to the expected calibration error (ECE) used in classification.
STDs Coefficient of variation (). In addition to the calibration error we would like to measure the dispersion of the predicted uncertainties. If for example the forecaster predicts a single homogeneous uncertainty measure for each example, which matches the empirical uncertainty of the predictor for the entire population, then the would be zero, but the uncertainty estimation per example would be uninformative. Therefore, we complement the measure with the Coefficient of Variation () for the predicted STDs which measures their dispersion:
where . Ideally the should be high indicating a disperse uncertainty estimation over the dataset.
To understand the need for calibration, let us start by considering a trained neural network for regression, which has very low mean squared error (MSE) on the train data. We now add a separate branch that predicts uncertainty as standard deviation, which together with the original network output interpreted as the mean, defines a Gaussian distribution per example. In this case, the NLL loss on the train data can be minimized by lowering the standard deviation of the predictions, without changing the MSE on train or test data. On test data however, MSE will be naturally higher. Since the predicted STDs remain low on test examples, this will result in higher NLL and ENCE values for the test data. This type of miss-calibration is defined as over-confidence, but opposite or mixed cases can occur depending on how the model is trained.
Negative log-likelihood. is a standard measure for a probabilistic model’s quality Friedman01 . When training the network to output classification confidence or a regression distribution, it is commonly used as the objective function to minimize. It is defined as:
We propose using the NLL on the re-calibration set as our objective for calibration, and the reliability diagram, together with its summary measures ( , ) for diagnosis of the calibration. In the most general setting a calibration function maps predicted PDFs to calibrated PDFs: where is the set of parameters defining the mapping.
Optimizing calibration over the re-calibration set is obtained by finding yielding minimal NLL:
To ensure the calibration generalization, the diagnosis should be made on a separate validation set. Multiple choices exist for the family of functions belongs to. We propose using STD Scaling, (in analogy to Temperature Scaling Guo17 ), which essentially multiplies the STD of each predicted distribution by a constant scaling factor . If the predicted PDF is that of a Gaussian distribution, , then the re-calibrated PDF is . Hence, in this case the calibration objective (Eq. 11) is:
If the original predictions are overconfident, as common in neural networks, then the calibration should set . This is analogous to Temperature Scaling in classification: a single multiplicative parameter is tuned to fix over or under-confidence of the model, and it does not modify the model’s final prediction since remains unchanged.
More complex calibration methods. Histogram binning and Isotonic Regression applied to the STDs can be also used as calibration methods. We chose STD scaling since: (a) it is less prone to overfit the validation set, (b) it does not enforce minimal and maximal STD values, (c) it is easy to implement and (d) empirically, it produced good calibration results.
4 Experimental results
We next show empirical results of our approach on two tasks: a controlled synthetic regression problem and object detection bounding box regression. In both tasks we examine the effect of outputting trained and random uncertainty on the calibration process. In all training and optimization stages we use an SGD optimizer with learning rate and momentum.
4.1 Synthetic regression problem
Experimenting with a synthetic regression problem enables us to control the target distribution and to validate our method. We randomly generate input samples . We sample from and from . This way, the target standard deviation of sample isloss function. We then add a separate branch with its own four layers to predict uncertainty.
Per example , The original network output is considered the mean of a Gaussian distribution () and the additional output as its standard deviation (). For numerical stability, as suggested by Kendall_Gal17 , the network outputs . In the random uncertainty experiment, per example, the standard deviation representing the uncertainty is randomly drawn from . For the predicted uncertainty experiment, the uncertainty branch is optimized using the loss (Eq. 10) while the rest of the network weights are fixed. By fixing the remaining weights, the predicted mean () remains unchanged. We then calibrate the network as described in Sec. 3.1 on a separate re-calibration set consisting of samples.
As one can see in Fig. 2 the confidence interval method can almost perfectly calibrate the random independent uncertainty estimation, as the expected and obsereved cinfidence level match and we get the desired identity curve. This phenomenon is extremely undesirable for safety critical applications where falsely relying on uninformative uncertainty can lead to severe consequences. It is important to note that the perfect calibration did not arise from giving the same fixed for each prediction, which would be acceptable, as the isotonic regression modifies the probabilities directly and not the outputted standard deviations. In contrast you can see how our method can only marginally improve the calibration and one can clearly see, both from the ENCE value and visually from the graph, that the predictions are not calibrated.
In the trained experiment, in which uncertainty is predicted by the network, we can see in Fig. 3 that the network almost perfectly learns the correct uncertainty, as expected from the problem simplicity and the high data availability. In this case both methods do not change the calibration results much. The important thing to note is that our calibration and evaluation method can easily differentiate between both cases, the random and predicted uncertainty, while they are almost exactly the same after calibrating with KuleshovFE18 .
4.2 Bounding box regression for object detection
In computer vision, an object detector outputs per input image a set of bounding boxes, each commonly represented by 5 outputs: classification confidence and four positional outputs: height, width and x,y position of a predefined point (i.e. box center). We show results on each positional output as an independent regression task. We use the R-FCN detectorDai16 with a ResNet-101 backbone He16ResNet as described originally. The R-FCN regression branch outputs per region candidate a 4-d vector that parametrizes the bounding box as following the accepted parametrization in Girshick15 . We use these outputs in our experiments as four independent regression outputs. To this architecture we add an uncertainty branch, identical in structure to the regression branch, that outputs a 4-d vector , each representing the log variance of the Gaussian distributions of the corresponding output. As before, the original regression output represents the Gaussian mean (i.e. ).
For training the network weights we use the entire Common objects in context (COCO) dataset COCO while for uncertainty calibration we use the KITTI Geiger2012CVPR object detection benchmark dataset, which consists of road scenes. Training the uncertainty output on one dataset and performing calibration on a different one reduces the risk of over-fitting and increases the calibration validity. We divide the KITTI dataset into a re-calibration set used for training the calibration parameters ( images), and a validation set ( images). The classes in the KITTI dataset represent a small subset of the classes in the COCO dataset, and therefore we reduce our model training on COCO to the 9 relevant classes (e.g. car, person) and map them accordingly to the KITTI classes.
We initially train the network without the additional uncertainty branch as in Dai16 , while the uncertainty branch weights are randomly initialized. Therefore, in this state which we refer to as untrained uncertainty, random uncertainties are assigned to each example. We then train the uncertainty branch by minimizing the loss (Eq. 10) on the training set, freezing all network weights but the uncertainty head for training iterations with 6 images per iteration. Freezing the rest of the network ensures that the additional uncertainty estimation does not sacrify accuracy. The result of this stage is the network with predicted uncertainty. Finally, we train the loss for additional training iterations on the re-calibration set, to optimize the single scaling parameter , and obtain the calibrated uncertainty.
Figure 4 shows the resulting reliability diagrams before calibration (predicted uncertainty) and after (calibrated uncertainty) for all four positional outputs, on the validation set consisting of 37K object instances. As can be observed from the monotonously increasing curve before calibration, the output uncertainties are indeed correlated with the empirical ones. Additionally, since the curves are entirely above the ideal one, the predictions are over confident. Using the learned scaling factor which varies between and , the is reduced significantly in all cases by a factor ranging from to . For untrained uncertainty, Fig. 1 shows that after calibration, just as with the synthetic dataset, using the interval-based method, uncertainty is almost perfectly calibrated. In contrast, our method reveals the lack of correlation between the predictions and empirical uncertainties before and after applying calibration (See results in Appendix A).
Calibration, and more generally uncertainty prediction, are critical parts of machine learning especially in safety-critial applications. In this work we exposed serious flaws in the current approach to define and evaluate calibration for regression problem. We also proposed an alternative approach and showed that even a very simple re-calibration method can lead to significant improvement in real-world applications.
We would like to sincerely thank Roee Litman for his substantial advisory support to this work.
- (1) S. S. Blackman. Multiple hypothesis tracking for multiple target tracking. IEEE Aerospace and Electronic Systems Magazine, 19(1):5–18, Jan 2004.
- (2) J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 379–387, USA, 2016. Curran Associates Inc.
- (3) M. H. DeGroot and S. E. Fienberg. The comparison and evaluation of forecasters. The Statistician: Journal of the Institute of Statisticians, 32:12–22, 1983.
- (4) D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2366–2374. Curran Associates, Inc., 2014.
- (5) D. Feng, L. Rosenbaum, and K. Dietmayer. Towards safe autonomous driving: Capture uncertainty in the deep neural network for lidar 3d vehicle detection. CoRR, abs/1804.05132, 2018.
Uncertainty in deep learning.2016.
- (7) Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML-16), 2016.
A. Geiger, P. Lenz, and R. Urtasun.
Are we ready for autonomous driving? the kitti vision benchmark
Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
- (9) R. Girshick. Fast r-cnn. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pages 1440–1448, Washington, DC, USA, 2015. IEEE Computer Society.
- (10) C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330. PMLR, 2017.
- (11) T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
- (12) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, 2016.
- (13) Y. He, X. Zhang, M. Savvides, and K. Kitani. Softer-nms: Rethinking bounding box regression for accurate object detection. CoRR, abs/1809.08545, 2018.
- (14) A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5574–5584. Curran Associates, Inc., 2017.
- (15) V. Kuleshov, N. Fenner, and S. Ermon. Accurate uncertainties for deep learning using calibrated regression. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 2801–2809, 2018.
- (16) B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6402–6413. Curran Associates, Inc., 2017.
- (17) T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.
- (18) M. P. Naeini, G. F. Cooper, and M. Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In AAAI, pages 2901–2907. AAAI Press, 2015.
A. Niculescu-Mizil and R. Caruana.
Predicting good probabilities with supervised learning.In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 625–632, New York, NY, USA, 2005. ACM.
- (20) D. A. Nix and A. S. Weigend. Estimating the mean and variance of the target probability distribution. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), volume 1, pages 55–60 vol.1, June 1994.
- (21) B. Phan, R. Salay, K. Czarnecki, V. Abdelzad, T. Denouden, and S. Vernekar. Calibrating uncertainties in object localization task. CoRR, abs/1811.11210, 2018.
J. C. Platt.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.In
ADVANCES IN LARGE MARGIN CLASSIFIERS, pages 61–74. MIT Press, 1999.
- (23) O. Thorin. On the infinite divisibility of the lognormal distribution. Scandinavian Actuarial Journal, 1977(3):121–148, 1977.
- (24) J. M. Vera-Diaz, D. Pizarro, and J. M. Guarasa. Towards end-to-end acoustic localization using deep learning: from audio signal to source position coordinates. CoRR, abs/1807.11094, 2018.
B. Zadrozny and C. Elkan.
Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers.In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 609–616, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
Appendix A Bounding box regression with untrained uncertainty
Figure 5 shows the reliability diagrams for the four bounding box regression outputs with untrained uncertainty before and after we apply our calibration method. As with the synthetic dataset, the graphs immediately reveal the disconnect between the random values and the empirical uncertainties. In all the cases the calibration results in a highly non-calibrated uncertainty according to our metrics.