The proven utility of accurate data analysis has caused machine learning (ML) and deep neural networks (NNs) to emerge as crucially important tools in academia, industry, and society(LeCun et al., 2015)
. NNs have many documented successes in a wide variety of critical domains such as natural language processing(Collobert and Weston, 2008; Mikolov et al., 2013; Sutskever et al., 2014) et al., 2012), and speech recognition (Hinton et al., 2012; Hannun et al., 2014). The main aspect that differentiates ML methods from traditional statistical modeling techniques is their ability to provide tractable analysis on large and informationally dense datasets. As the amount of data being produced each year continues to accelerate, ML-based techniques are expected to dominate the future of data analysis.
Unless NN models are trained in a way to make predictions that indicate uncertainty when they are not confident, these models can make overly confident, yet incorrect, predictions. While these models can guarantee a level of accuracy for data that is statistically similar to the data they trained on, they have no guarantee to make accurate predictions on statistically different (known as out-of-distribution (Hendrycks and Gimpel, 2016)
) data. For instance, after training a vanilla NN to classify the hand written digits in the MNIST dataset, one observes (far more often than not) that feeding the NN a uniformly randomly generated image results in a prediction probability one for the predicted digit. This overly certain and wrong prediction is in stark contrast with what the modeler would desire, e.g. a uniform distribution that indicates uncertainty(Sensoy et al., 2018).
The field of probabilistic machine learning seeks to avoid overly confident, yet incorrect, predictions by quantifying and estimating the predictive uncertainty of NN models (Krzywinski and Altman, 2013; Ghahramani, 2015). Given that these models are being integrated into real decision systems (e.g. self-driving vehicles, infrastructure control, medical diagnosis, etc.), a decision system should incorporate the uncertainty of a prediction to avoid ill-informed choices or reactions that could potentially lead to heavy or undesired losses (Amodei et al., 2016).
A comparative review of the progress made regarding predictive uncertainty estimation for NN models may be found in (Snoek et al., 2019). Most of the early proposed approaches are Bayesian in nature (Bernardo and Smith, 2009). These methods assign prior distributions to the NN’s parameters (weights) and the training process updates these distributions to the “learned” posterior distributions. The residual uncertainty in the posterior distribution of the parameters allow the network to estimate predictive uncertainty. Several methods were suggested for learning Bayesian NNs including Laplace approximation (MacKay, 1992), Hamiltonian methods (Springenberg et al., 2016)
, Markov Chain Monte Carlo (MCMC) methods(Neal, 1996), expectation propagation (Jylänki et al., 2014; Li et al., 2015; Hasenclever et al., 2017), and variational inference (Graves, 2011; Louizos and Welling, 2016)
. Implementing Bayesian NNs is generally difficult and training them is computationally expensive. Recent state-of-the-art methods for predictive uncertainty estimation include probabilistic backpropagation (PBP)(Hernández-Lobato and Adams, 2015), Monte Carlo dropout (MC-dropout) (Gal and Ghahramani, 2016), and Deep Ensembles (Lakshminarayanan et al., 2017). These state-of-the-art methods achieve top performance when estimating predictive uncertainty.
Average generalization error captures the expected ability of a model to generalize to new in-distribution data due to the i.i.d. nature of train-test data split. Among several of the error functions that can be used, log of the predictive probabilities takes the predictive uncertainty into account while assessing the error. Thus, the log of the predictive probabilities is typically used for assessing the quality of predictive methods that quantify uncertainty in regression problems (Nix and Weigend, 1994; Lakshminarayanan et al., 2017).
Contributions: We present a new approach to quantify predictive uncertainty in NNs for regression tasks based on the Bayesian Validation Metric (BVM) framework proposed in (Vanslette et al., 2020)
. Using this framework, we propose a new loss function (log-likelihood cumulative distribution function difference) and use it to train an ensemble of NNs (inspired by the work of(Lakshminarayanan et al., 2017)
). The proposed loss function reproduces maximum likelihood estimation in the limiting case. Our method is very simple to implement and only requires minor changes to the standard NN training procedure. We assess our method both qualitatively and quantitatively through a series of experiments on toy and real-world datasets, and show that our approach provides well-calibrated uncertainty estimates and is competitive with the existing state-of-the-art methods (when tested on in-distribution data). We introduce and utilize the concept of “outlier train-test splitting” to evaluate a method’s predictive ability on out-of-distribution examples whenever their presence in a dataset is not guaranteed. We show that our method has superior predictive power compared to Deep Ensembles(Lakshminarayanan et al., 2017) when tested on out-of-distribution (outlier) samples. As the statistics of training datasets often differ from the statistics of the environment of deployed systems, our method can be used to improve safety and decision-making in the deployed environment by better estimating out-of-distribution uncertainty.
2 The Bayesian Validation Metric for predictive uncertainty estimation
2.1 Notation and problem setup
Consider the following supervised regression task. We are given a dataset , consisting of i.i.d. paired examples, where represents the
-dimensional feature vector anddenotes the corresponding continuous target variable (or label). We aim to learn the probabilistic distribution over the targets for given inputs using NNs.
2.2 Maximum likelihood estimation
In regression tasks, it is common practice to train a NN with a single output node (corresponding to the predicted mean), say , such that the network parameters (or weights) are optimized by minimizing the mean squared error (MSE) cost (or loss) function, expressed as
Note that the network output can be thought of as an estimate of the true mean of the noisy target distribution for a given input feature (Nix and Weigend, 1994). However, this does not take into account the uncertainty or noise in the data.
To capture predictive uncertainty, an alternative approach based on maximum likelihood was proposed in (Nix and Weigend, 1994), and it consists of adding another node to the output layer of the neural network,
, that estimates the true variance of the target distribution. In other words, we train a network with two nodes in its output layer:. By assuming the target values
to be drawn from a Gaussian distribution with the predicted meanand variance , we can express the likelihood of observing the target value given the input vector as follows:
The aim is to train a network that infers by maximizing the likelihood function in (2). This is equivalent to minimizing its negative log-likelihood, expressed as
Hence the overall negative log-likelihood (NLL) cost function is given by
Note that ; we impose this positivity constraint on the variance by using the sigmoid function (instead of softplus as in (Lakshminarayanan et al., 2017)) as our data will be standardized.
2.3 The Bayesian Validation Metric
The Bayesian Validation Metric (BVM) is a general model validation and testing tool that was shown to generalize Bayesian model testing and regression (Vanslette et al., 2020; Tohme et al., 2020). The BVM measures the probability of agreement between the model and the data given the Boolean agreement function , denoted as . The probability of agreement is
where and correspond to the model output and observed data respectively,
is the probability density function (pdf) representing the model predictive uncertainty,is the data uncertainty pdf, and is the indicator function of the Boolean that defines the meaning of model-data agreement. The indicator function behaves as a probabilistic kernel between the data and model prediction pdfs.
2.4 The BVM reproduces the NLL loss as a special case
We show that the BVM is capable of replicating the maximum likelihood NN framework by representing the NLL cost function described in (4) as a special case. In terms of the BVM framework, the maximum likelihood formulation is achieved by modeling the predictions using a Gaussian likelihood given by
and by assuming the target variables to be deterministic, i.e. , where is the Dirac delta function. In addition, the Boolean agreement function is defined such that the model and the data are required to “agree exactly” (as is the case with Bayesian model testing (Vanslette et al., 2020; Tohme, 2020)), and is given by , when is a probability density. For a particular input feature vector , the probability density of agreement between the model and data is equal to
Maximizing the BVM probability density of agreement is equivalent to minimizing its negative log-likelihood,
which is Equation (3). Therefore, the negative log-likelihood BVM cost function over the set of all input feature vectors is given by
which is Equation (4). Thus, with the assumptions put on the data, model, and agreement definition, the BVM method can reproduce the maximum likelihood method as a special case. That is, minimizing the BVM negative log-probability density of agreement is mathematically equivalent to minimizing the NLL loss, which was essentially used in Deep Ensembles (Lakshminarayanan et al., 2017).
2.5 The -BVM loss: a relaxed version of the NLL loss
We now consider the -Boolean agreement function being true iff . In the limit , this Boolean function requires the model output and data to “agree exactly”, which leads to the maximum likelihood NN limit of the BVM discussed above. Again, assuming the model predictive uncertainty to be Gaussian, the target variables to be deterministic and the agreement function to be , the -BVM probability of agreement for a given input feature vector can be expressed as
is the cumulative distribution function (cdf) of the standard normal distribution. Thus, this-BVM probability of agreement becomes the difference in likelihood cdfs around the mean. Taking its (overall) negative log gives
Having this looser definition of model-data agreement effectively coarse-grains the in-distribution data and prevents overfitting. While Section 3.3 shows that this coarse-graining increases the bias of the in-distribution test results, Section 3.4 shows that our method better generalizes to out-of-distribution sample predictions.
2.6 Implementation and ensemble learning
Implementing our proposed method is straightforward and requires little modifications to typical NNs. We simply train a NN using the BVM loss function. Since our aim is to estimate and quantify the predictive uncertainty, our NN will have two nodes in its output layer, corresponding to the predicted mean and variance , as we mentioned earlier. More details about our NN architecture will be discussed in the next section.
Training an ensemble of NNs independently and statistically integrating their results was shown to improve predictive performance (Lakshminarayanan et al., 2017)
. This class of ensemble methods is known as a randomization-based approach (such as random forests(Breiman, 2001)) in contrast to a boosting-based approach where NNs are trained sequentially. Due to the randomized and independent training procedure, the local minima the NNs settle into vary across the ensemble. This causes the ensemble to “agree” where there is training data and “disagree” elsewhere, which increases the variance of the statistically integrated predictive distribution.
We follow (Lakshminarayanan et al., 2017) and adopt their ensemble learning procedure by training an ensemble of NNs, but instead we utilize the BVM loss rather than the NLL loss (recall the NLL is a special case of the BVM). We let each network parametrize a distribution over the outputs, i.e. where represents the vector of weights of network . In addition, we assume the ensemble to be a uniformly-weighted mixture model. In other words, we combine the predictions as . Letting the predictive distributions of the mixture be Gaussian, , the resulting statistically integrated mean and variance are given by and , respectively. These quantities are evaluated against the test set. It is worth noting that, in all our experiments, we train an ensemble of five NNs (i.e. ).
3 Experimental results
We evaluate our proposed method both qualitatively and quantitatively through a series of experiments on regression benchmark datasets. In particular, we first conduct a regression experiment on a one-dimensional toy dataset, and then experiment with well-known, real world datasets.111The datasets can be found at the University of California, Irvine (UCI) machine learning data repository. Further, we show that our approach outperforms state-of-the-art methods in out-of-distribution generalization. In our experiments, we train NNs with one hidden layer and use the -BVM loss function described by Equation (17) (in what follows, we will refer to the
-BVM loss as simply the BVM loss). We randomly initialize the NN weights (using the PyTorch default weight initialization) and randomly shuffle the paired training examples.
3.1 Toy dataset
We first qualitatively assess the performance of our proposed method on a toy dataset that was used in (Hernández-Lobato and Adams, 2015; Lakshminarayanan et al., 2017). The dataset is produced by uniformly sampling (at random) inputs in the interval . The label corresponding to each input is obtained by computing where . The NN architecture consists of one layer with 100 hidden units and the value of in the BVM loss is set to as the data is not normalized.
In order to measure and estimate uncertainty, a commonly used approach is to train multiple NNs independently (i.e. an ensemble of NNs) to minimize MSE, and compute the variance of the networks’ generated point predictions. We show that learning the variance by training using the BVM loss function results in better predictive uncertainty estimation. The results are shown in Figure 1.
From Figure 1, it is clear that predictive uncertainty estimation can be improved by learning the variance through training using the BVM loss, and it can be further improved by training an ensemble of NNs (the effect of ensemble learning becomes more apparent as we move further away from the training data). Note that the results we get using the proposed BVM loss are very similar to the results produced using NLL in (Lakshminarayanan et al., 2017) since is small relative to the range of the data. The goal of this experiment is to show that the BVM loss function is indeed suitable for predictive uncertainty estimation by reproducing the results in (Lakshminarayanan et al., 2017).
3.2 Training using MSE vs NLL vs BVM
This section shows that the predicted variance (using our method) is as well-calibrated as the one from Deep Ensembles (using NLL) and is better calibrated than the empirical variance (using MSE). In (Lakshminarayanan et al., 2017), it was shown that training an ensemble of NNs with a single output (representing the mean) using MSE and computing the empirical variance of the networks’ predictions to estimate uncertainty does not lead to well-calibrated predictive probabilities. This was due to the fact that MSE does not capture predictive uncertainty. It was then shown that learning the predictive variance by training an ensemble of NNs with two outputs (corresponding to the mean and variance) using NLL (i.e. Deep Ensembles) results in well-calibrated predictions. We show that this is also the case for the proposed BVM loss.
We reproduce an experiment from (Lakshminarayanan et al., 2017) using the BVM loss function (with ), where we construct reliability diagrams (also known as calibration curves) on the benchmark datasets. The procedure is as follows: (i) we calculate the prediction interval for each test point (using the predicted mean and variance), (ii) we then measure the actual fraction of test observations that fall within this prediction interval, and (iii) we repeat the calculations for in steps of . If the actual fraction is close to the expected fraction (i.e. ), this indicates that the predictive probabilities are well-calibrated. The ideal output would be a diagonal line. In other words, a regressor is considered to be well-calibrated if its calibration curve is close to the diagonal.
We report the reliability diagram for the Energy dataset in Figure 2; diagrams for the other benchmark datasets are reported in Appendix A (the trend is the same for all datasets). We find that our method provides well-calibrated uncertainty estimates with a calibration curve very close to the diagonal (and almost overlapping with the curve of Deep Ensembles (Lakshminarayanan et al., 2017)). We also find that the predicted variance (learned using BVM or NLL) is better calibrated than the empirical variance (computed by training five NNs using MSE) which is overconfident. For instance, for the prediction interval (i.e. the expected fraction is equal to ), the actual fraction of test observations that fall within the interval is only (i.e. the observed fraction is around ). In other words, the empirical variance (using MSE) underestimates the true uncertainty.
|Avg. Test RMSE and Std. Errors||Avg. Test NLL and Std. Errors|
|Dataset||PBP||MC-dropout||Deep Ensembles||BVM||PBP||MC-dropout||Deep Ensembles||BVM|
|Boston housing||506||13||3.01 0.18||2.97 0.19||3.28 1.00||3.06 0.22||2.57 0.09||2.46 0.06||2.41 0.25||2.52 0.08|
|Concrete||1,030||8||5.67 0.09||5.23 0.12||6.03 0.58||6.07 0.18||3.16 0.02||3.04 0.02||3.06 0.18||3.18 0.14|
|Energy||768||8||1.80 0.05||1.66 0.04||2.09 0.29||2.16 0.07||2.04 0.02||1.99 0.02||1.38 0.22||1.67 0.13|
|Kin8nm||8,192||8||0.10 0.00||0.10 0.00||0.09 0.00||0.11 0.00||-0.90 0.01||-0.95 0.01||-1.20 0.02||-0.85 0.10|
|Naval propulsion plant||11,934||16||0.01 0.00||0.01 0.00||0.00 0.00||0.01 0.00||-3.73 0.01||-3.80 0.01||-5.63 0.05||-3.92 0.01|
|Power plant||9,568||4||4.12 0.03||4.02 0.04||4.11 0.17||4.18 0.13||2.84 0.01||2.80 0.01||2.79 0.04||3.07 0.08|
|Protein||45,730||9||4.73 0.01||4.36 0.01||4.71 0.06||4.29 0.08||2.97 0.00||2.89 0.00||2.83 0.02||3.02 0.03|
|Wine||1,599||11||0.64 0.01||0.62 0.01||0.64 0.04||0.64 0.01||0.97 0.01||0.93 0.01||0.94 0.12||1.01 0.09|
|Yacht||308||6||1.02 0.05||1.11 0.09||1.58 0.48||1.67 0.25||1.63 0.02||1.55 0.03||1.18 0.21||1.56 0.18|
3.3 Real world datasets
We further evaluate our proposed method by comparing it to existing state-of-the-art methods. We adopt the same experimental setup as in (Hernández-Lobato and Adams, 2015) for evaluating PBP, (Gal and Ghahramani, 2016) for evaluating MC-dropout, and (Lakshminarayanan et al., 2017)2010), consisting of hidden units for all datasets except for the largest one (i.e. Protein) where we use NNs with hidden units. We train NNs using the BVM loss function with . Each dataset is randomly split into training and test sets with and of the available data, respectively. For each train-test split, we train an ensemble of networks. We repeat the splitting process times and report the average test performance of our proposed method. For the larger Protein dataset, we perform the train-test splitting times (instead of 20).
In our experiments, we run the training for epochs, using mini-batches of size and AdamW optimizer with fixed learning rate of . For all the datasets, we apply feature scaling by standardizing the input features to have zero mean and unit variance, and normalize the targets to have a range of (in the training set). Before evaluating the predictions, we invert the normalization factor on the predictions so they are back to the original scale of the targets for the purpose of error evaluation. Note that a sigmoid activation function is applied to the outputs of the NNs corresponding to the mean and variance. We summarize our results in Table 1, along with the results of PBP, MC-dropout, and Deep Ensembles as were outlined in their respective papers. For each dataset, the best method(s) is (are) highlighted in bold.
The results in Table 1 clearly demonstrate that our proposed method is competitive with existing state-of-the-art methods. As might be expected, our method performs sub-optimally compared to other methods in terms of RMSE (e.g. on the Energy dataset). Since our method optimizes for the BVM loss, which learns both the mean and the variance (to better capture uncertainties) rather than learning only the mean, it gives less optimal RMSE values. Also note that, although our method outperforms PBP and MC-dropout in terms of NLL on many datasets, it did not outperform Deep Ensembles (e.g. on the Energy dataset, our method produces the second lowest NLL average of behind Deep Ensembles whose NLL average is ). Since the Deep Ensembles method optimizes for NLL, it is expected to perform better than the BVM approach for – at least when the splitting of the data into training and test sets is done randomly (i.e. when tested on in-distribution data). The methods are comparable and identical in the limit , because the BVM loss becomes equivalent to NLL. We intentionally used a nonzero to highlight its effect on the predictions (compared to Deep Ensembles) when tested on in-distribution samples. We later introduce and apply the concept of “outlier train-test splitting”, and show that our method outperforms Deep Ensembles when evaluated on out-of-distribution samples (see Section 3.4).
3.4 Robustness and out-of-distribution generalization
We aim to show that our proposed method is robust and able to generalize better to out-of-distribution (OOD) data than Deep Ensembles. That is, if we evaluate our method on data that is statistically different from the training data, we observe more robustness and higher predictive uncertainties.
We consider a training set consisting of Google stock prices for a period of years (from the beginning of till the end of ) and a test set containing the stock prices of January (see Figure 3). In particular, we consider the Google opening stock price, i.e. the stock price at the beginning of the financial/trading day. It is worth noting that the input feature vector is -dimensional corresponding to a -day window, i.e. for a given day, the NN will consider the stock prices for the past days, and based on the trends captured during this time window, it will predict the corresponding stock price (with its uncertainty).
We train an ensemble of NNs consisting of hidden layers with hidden units per layer.222Indeed, training recurrent NNs will improve forecasting accuracies, however, here we are more interested in predictive uncertainties and standard NNs were enough to prove our point. We run the training for epochs, using batch size of and Adam optimizer with fixed learning rate of . We repeat this process for three different loss functions: (i) The NLL loss in (4) used in Deep Ensembles (Lakshminarayanan et al., 2017), (ii) the BVM loss in (17) with , and (iii) the BVM loss with . We plot the predicted mean stock price along with the prediction interval corresponding to January . The results are shown in Figure 3.
|Statistical Difference||Test NLL|
|Naval propulsion plant||11,934||16||0.01||0.02||-4.42||-3.84|
The results clearly demonstrate that the value of in the BVM loss affects the predictive uncertainty (i.e. the prediction interval). A small value of corresponds to a stricter agreement condition between the NN predictive means and the observed targets, which results in a narrower prediction interval (i.e. lower variance values). Note that using results in a predictive envelope that almost overlaps with the prediction interval of Deep Ensembles. When we increase the value of to in the BVM loss, the agreement conditions become less stringent, and this leads to a wider prediction interval, which better captures the uncertainty of the stock price in the test set. This results in a lower NLL for this highly volatile test set.
The NLL results are summarized in Table 2. Due to the out-of-distribution nature of the test data, the BVM loss with a relatively large results in the lowest NLL. Since this loss leads to the largest variances (or uncertainties), its corresponding likelihood (2) will be the largest, which is equivalent to the lowest NLL. Using a very large will overly coarse grain the data and one will lose predictive power.
We now compare our method to Deep Ensembles in terms of robustness and OOD generalization on the regression benchmark datasets used in Section 3.3 (see table 1). Since the presence of OOD examples (for testing) is not guaranteed in these datasets, we apply “outlier train-test splitting”, which forces the generation of statistical differences between the training and test sets (when outliers exist). We repeat the experiment on the benchmarks from Section 3.3, however, instead of randomly splitting the datasets into training and test sets, we now detect outliers (e.g. of the dataset) and treat them as test examples and train on the remaining examples (e.g. of the dataset). The outliers represent out-of-distribution examples that could potentially lead to heavy losses if characterized poorly in a deployment environment. Using this splitting process, we can better evaluate a method’s predictive ability on out-of-distribution samples. To perform the outlier train-test data splitting, we use Isolation Forest (Liu et al., 2008) to detect the outliers in the datasets (which isolates anomalies that are less frequent and different in the feature space).
We train an ensemble of NNs consisting of one hidden layer with hidden units, using both the NLL loss (which is Deep Ensembles) and the BVM loss function (with ). We train for epochs, with batch size of and Adam optimizer with fixed learning rate of . We summarize our results along with the statistical differences between the normalized train-test targets in Table 3. For each dataset, the best method is highlighted in bold.
As shown in Table 3, our method consistently outperforms Deep Ensembles on all datasets that have significant statistical differences between the training and test sets (gray rows). In other words, the BVM approach is robust to statistical change. It is interesting to note that while our method did not directly optimize for NLL, it was still able to outperform Deep Ensembles, which did.
Why does BVM outperform Deep Ensembles on OOD samples?
Note that for a given input feature vector , the minimizer of the BVM loss function satisfies
Taylor expanding around leads to
The proof can be found in Appendix B. Thus, the minimizer of the BVM loss over the set of all input feature vectors can be approximated as
We can clearly see that for , the minimizer of the BVM loss is indeed the minimizer of the NLL loss in Equation (4). For a nonzero , will increase linearly with (see proof in Appendix B) leading to a larger variance, and hence a wider distribution (or prediction interval). Thus, the OOD samples near the tails (i.e. the outliers) will be more probable resulting in lower NLL values compared to Deep Ensembles (keeping in mind that the in-distribution samples near the mean will be less probable resulting in higher NLL values compared to Deep Ensembles, which was the case in Table 1).
In this work, we proposed a new loss function for regression uncertainty estimation (based on the BVM framework) which reproduces maximum likelihood estimation in the limiting case. This loss, boosted by ensemble learning, improves predictive performance when the training and test sets are statistically different. Experiments on in-distribution data show that our method generates well-calibrated uncertainty estimates and is competitive with existing state-of-the-art methods. When tested on out-of-distribution samples (outliers), our method exhibits superior predictive power by consistently displaying improved predictive log-likelihoods. Because the data source statistics in the learning and deployed environments are often known to be different, our method can be used to improve safety and decision-making in the deployed environment. Our future work involves expanding the BVM framework to address predictive uncertainty estimation in classification problems.
- Concrete problems in ai safety. arXiv preprint arXiv:1606.06565. Cited by: §1.
- Bayesian theory. Vol. 405, John Wiley & Sons. Cited by: §1.
- Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §2.6.
- A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pp. 160–167. Cited by: §1.
Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §1, §3.3.
Probabilistic machine learning and artificial intelligence. Nature 521, pp. 452–459. External Links: Cited by: §1.
- Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348–2356. Cited by: §1.
- Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §1.
- Distributed bayesian learning with stochastic natural gradient expectation propagation and the posterior server. The Journal of Machine Learning Research 18 (1), pp. 3744–3780. Cited by: §1.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. External Links: Cited by: §1.
- Probabilistic backpropagation for scalable learning of bayesian neural networks. In International Conference on Machine Learning, pp. 1861–1869. Cited by: §1, §3.1, §3.3.
- Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal processing magazine 29 (6), pp. 82–97. Cited by: §1.
- Expectation propagation for neural networks with sparsity-promoting priors. The Journal of Machine Learning Research 15 (1), pp. 1849–1901. Cited by: §1.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
- Points of significance: importance of being uncertain. Nature methods 10, pp. 809–810. External Links: Cited by: §1.
- Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pp. 6402–6413. Cited by: Appendix A, Appendix A, Appendix A, §1, §1, §1, §2.2, §2.4, §2.6, §2.6, §3.1, §3.1, §3.2, §3.2, §3.2, §3.3, §3.4.
- Deep learning. Nature 521, pp. 436–444. External Links: Cited by: §1.
- Stochastic expectation propagation. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2323–2331. Cited by: §1.
- Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. Cited by: §3.4.
- Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pp. 1708–1716. Cited by: §1.
- Bayesian methods for adaptive models. Ph.D. Thesis, California Institute of Technology. Cited by: §1.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §1.
Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: §3.3.
- Bayesian learning for neural networks. Vol. 118, Springer Science & Business Media. Cited by: §1.
Estimating the mean and variance of the target probability distribution. In Proceedings of 1994 ieee international conference on neural networks (ICNN’94), Vol. 1, pp. 55–60. Cited by: §1, §2.2, §2.2.
- Evidential deep learning to quantify classification uncertainty. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. 3179–3189. Cited by: §1.
- Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pp. 13969–13980. Cited by: §1.
- Bayesian optimization with robust bayesian neural networks. In Advances in neural information processing systems, pp. 4134–4142. Cited by: §1.
- Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
- A generalized bayesian approach to model calibration. Reliability Engineering & System Safety 204, pp. 107141. Cited by: §2.3.
- The bayesian validation metric: a framework for probabilistic model calibration and validation. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §2.4.
- A general model validation and testing tool. Reliability Engineering & System Safety 195, pp. 106684. Cited by: §1, §2.3, §2.4.
Appendix A Training using MSE vs NLL vs BVM
This section shows that the predicted variance (using our method) is as well-calibrated as the one from Deep Ensembles (using NLL) and is better calibrated than the empirical variance (using MSE). In (Lakshminarayanan et al., 2017), it was shown that training an ensemble of NNs with a single output (representing the mean) using MSE and computing the empirical variance of the networks’ predictions to estimate uncertainty does not lead to well-calibrated predictive probabilities. This was due to the fact that MSE does not capture predictive uncertainty. It was then shown that learning the predictive variance by training NNs with two outputs (corresponding to the mean and variance) using NLL (i.e. Deep Ensembles) results in well-calibrated predictions. We show that this is also the case for the proposed BVM loss.
We reproduce an experiment from (Lakshminarayanan et al., 2017) using the BVM loss function, where we construct reliability diagrams (also known as calibration curves) on the benchmark datasets. The procedure is as follows: (i) we calculate the prediction interval for each test point (using the predicted mean and variance), (ii) we then measure the actual fraction of test observations that fall within this prediction interval, and (iii) we repeat the calculations for in steps of . If the actual fraction is close to the expected fraction (i.e. ), this indicates that the predictive probabilities are well-calibrated. The ideal output would be the diagonal line. In other words, a regressor is considered to be well-calibrated if its calibration curve is close to the diagonal.
We report the reliability diagrams for the benchmark datasets in Figure 4. We find that our method provides well-calibrated uncertainty estimates with a calibration curve very close to the diagonal (and almost overlapping with the curve of Deep Ensembles (Lakshminarayanan et al., 2017)). We also find that the predicted variance (learned using BVM or NLL) is better calibrated than the empirical variance (computed by training five NNs using MSE) which is overconfident. For instance, if we consider the reliability diagram for the Boston Housing dataset, for the prediction interval (i.e. the expected fraction is equal to ), the actual fraction of test observations that fall within the interval is only (i.e. the observed fraction is around ). In other words, the empirical variance (using MSE) underestimates the true uncertainty. The trend is the same for all datasets.
Appendix B Why does BVM outperform Deep Ensembles on OOD samples? (detailed proof)
Recall from Section 2.5 that the -BVM probability of agreement for a given input feature vector can be expressed as
where is the cumulative distribution function (cdf) of the standard normal distribution:
Also recall that the (overall) negative log -BVM probability of agreement (i.e. the BVM loss function) over the set of all input feature vectors is
Note that for a given input feature vector , the minimizer of the BVM loss function satisfies
Taylor expanding around leads to
Let for a given input feature vector the function be defined by
Then, we have
where is the probability density function (pdf) of the standard normal distribution:
In what follow we will also use and which are expressed as
The Taylor series approximation of near is
where , , and can be derived as follows:
It follows that is the pdf of the general normal distribution:
It follows that
It follows that
Hence, the Taylor series approximation of around is
Taking its negative log gives
Thus, the minimizer of the BVM loss over the set of all input feature vectors can be approximated as