Maximizing Overall Diversity for Improved Uncertainty Estimates in Deep Ensembles

by   Siddhartha Jain, et al.

The inaccuracy of neural network models on inputs that do not stem from the training data distribution is both problematic and at times unrecognized. Model uncertainty estimation can address this issue, where uncertainty estimates are often based on the variation in predictions produced by a diverse ensemble of models applied to the same input. Here we describe Maximize Overall Diversity (MOD), a straightforward approach to improve ensemble-based uncertainty estimates by encouraging larger overall diversity in ensemble predictions across all possible inputs that might be encountered in the future. When applied to various neural network ensembles, MOD significantly improves predictive performance for out-of-distribution test examples without sacrificing in-distribution performance on 38 Protein-DNA binding regression datasets, 9 UCI datasets, and the IMDB-Wiki image dataset. Across many Bayesian optimization tasks, the performance of UCB acquisition is also greatly improved by leveraging MOD uncertainty estimates.


page 1

page 2

page 3

page 4


Hydra: Preserving Ensemble Diversity for Model Distillation

Ensembles of models have been empirically shown to improve predictive pe...

Diversity regularization in deep ensembles

Calibrating the confidence of supervised learning models is important fo...

DIBS: Diversity inducing Information Bottleneck in Model Ensembles

Although deep learning models have achieved state-of-the-art performance...

Ensembles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping

In machine learning, an agent needs to estimate uncertainty to efficient...

Ensembles of Locally Independent Prediction Models

Many ensemble methods encourage their constituent models to be diverse, ...

Sequential Bayesian Neural Subnetwork Ensembles

Deep neural network ensembles that appeal to model diversity have been u...

Faster Deep Ensemble Averaging for Quantification of DNA Damage from Comet Assay Images With Uncertainty Estimates

Several neurodegenerative diseases involve the accumulation of cellular ...

1 Introduction

Model ensembling provides a simple, yet extremely effective technique for improving the predictive performance of arbitrary supervised learners each trained via empirical risk minimization (ERM) (Breiman, 1996; Brown, 2004). Often, ensembles are utilized not only to improve predictions on test examples stemming from the same underlying distribution as the training data, but also to provide estimates of model uncertainty when learners are presented with out-of-distribution (OOD) examples that may look different than the data encountered during training (Lakshminarayanan et al., 2017; Osband et al., 2016)

. The widespread success of ensembles crucially relies on the variance-reduction produced by aggregating predictions that are statistically prone to different types of individual errors

(Kuncheva and Whitaker, 2003). Thus, prediction improvements are best realized by using a large ensemble with many base models, and a large ensemble is also typically employed to produce stable distributional estimates of model uncertainty (Breiman, 1996; Papadopoulos et al., 2001).

Despite this, practical applications of massive neural networks (NN) are commonly limited to a small ensemble due to the unwieldy nature of these models (Osband et al., 2016; Balan et al., 2015; Beluch et al., 2018)

. Although supervised learning performance may still be enhanced by an ensemble comprised of only a few ERM-trained models, the resulting ensemble-based uncertainty estimates can exhibit excessive sampling variability in low-density regions of the underlying training distribution. Such unreliable uncertainty estimates are highly undesirable in applications where future data may not always stem from the same distribution (e.g. due to sampling bias, covariate shift, or the adaptive experimentation that occurs in bandits, Bayesian optimization (BO), and reinforcement learning (RL) contexts). Here, we propose a technique - Maximize Overall Diversity (MOD) - to stabilize the OOD model uncertainty estimates produced by an ensemble of arbitrary neural networks. The core idea is to consider


possible inputs and encourage as much overall diversity in the corresponding model ensemble outputs as can be tolerated without diminishing the ensemble’s predictive performance. MOD utilizes an auxiliary loss function and data-augmentation strategy that is easily integrated into any existing training procedure.

2 Related Work

NN ensembles have been previously demonstrated to produce useful uncertainty estimates, including for sequential experimentation applications in Bayesian optimization and reinforcement learning (Papadopoulos et al., 2001; Lakshminarayanan et al., 2017; Riquelme et al., 2018; Osband et al., 2016; Chen et al., 2017). Proposed methods to improve ensembles of limited size include adversarial training to enforce smoothness (Lakshminarayanan et al., 2017), and maximizing ensemble output diversity over the training data (Brown, 2004). In contrast, our focus is on controlling ensemble behavior over all possible inputs, not merely those presented during training.

Consideration of all possible inputs has previously been advocated by Hooker and Rosset (2012), although not in the context of uncertainty estimation. Like our approach, Hafner et al. (2018) also aim to control NN output-behavior beyond the training distribution, but our methods do not require the Bayesian formulation they impose and can be applied to arbitrary NN ensembles, which are one of the most straightforward methods used for quantifying neural network uncertainty (Papadopoulos et al., 2001; Osband et al., 2016). While we primarily consider regression settings here, our ideas can be easily adapted to classification by replacing variance terms with entropy terms; a similar variant that relies on an auxiliary generator network to produce augmented samples that do not stem from the training distribution has been recently proposed by Lee et al. (2018).

3 Methods

We consider a standard regression setup, assuming continuous target values are generated via: with , such that

may heteroscedastically depend on feature values

. Given a limited training dataset , where specifies the underlying data distribution from which the in-distribution examples in the training data are sampled, our goal is to learn an ensemble of neural networks that accurately models both the underlying function as well as the uncertainty in ensemble-estimates of . Of particular concern are scenarios where test examples may stem from a different distribution , which we refer to as out-of-distribution (OOD) examples. As in Lakshminarayanan et al. (2017), each network (with parameters ) in our NN ensemble outputs both an estimated mean to predict and an estimated variance to predict , and the per network loss function , is chosen as the negative log-likelihood (NLL) under the Gaussian assumption . While traditional bagging provides different training data to each ensemble member, we simply train each NN using the entire dataset, since the randomness of separate NN-initializations and SGD-training suffice to produce comparable performance to bagging of NN models (Lakshminarayanan et al., 2017; Lee et al., 2015; Osband et al., 2016).

Following Lakshminarayanan et al. (2017), we estimate

(and NLL with respect to the ensemble) by treating the aggregate ensemble output as a single Gaussian distribution

. Here, the ensemble-estimate of (used in RMSE calculations) is given by , and the uncertainty in the target value is given by: based on noise-level estimate: and model uncertainty estimate: . While we focus on Gaussian likelihoods for simplicity, our proposed methodology is applicable to general parametric conditional distributions.

3.1 Maximizing Overall Diversity (MOD)

Assuming have been scaled to bounded regions, MOD encourages higher ensemble diversity by introducing an auxiliary loss that is computed over augmented data sampled from another distribution defined over the input feature space. differs from the underlying training data distribution and instead describes what sorts of OOD examples could possibly be encountered at test-time. The underlying population objective we target is:


with as the original supervised learning loss function (e.g. NLL), and a user-specified penalty . Since NLL entails a proper-scoring rule (Lakshminarayanan et al., 2017), minimizing the above objective with a sufficiently small value of will ensure the ensemble seeks to recover for inputs that lie in the support of the training distribution and otherwise to output large model uncertainty for OOD that lie beyond this support. As it is difficult in most applications to specify how future OOD examples may look, we aim to ensure the ensemble outputs high uncertainty estimates for any possible by taking the entire input space into consideration. This corresponds to where is the bounded region of all possible inputs and denotes the pdf (or pmf) of . In practice, we approximate using the average loss over the training data (as in ERM), and train each with respect to its contribution to this term independently of the others (as in bagging). To approximate ,we similarly utilize an empirical average based on augmented examples sampled uniformly throughout feature space . The formal MOD procedure is detailed in Algorithm 1. We advocate selecting as the largest value for which estimates of (on held-out validation data) do not indicate worse predictive performance. This strategy naturally favors smaller values of as the sample size grows, thus resulting in lower model uncertainty estimates (with as when is supported everywhere and our NN are universal approximators).

We also experiment with an alternative choice of

being the uniform distribution over the finite training data (i.e. 

and = 0 otherwise). We call this alternative method MOD-in, and note its similarity to the diversity-encouraging penalty proposed by Brown (2004), which is also measured over the training data. Note that MOD in contrast considers to be uniformly distributed over the entire feature space (i.e. all possible test inputs) rather than only the training examples. Maximizing diversity solely over the training data may fail to control ensemble behavior at OOD points that do not lie near any training example.

3.2 Maximizing Reweighted Diversity (MOD-R)

Aiming for high-quality OOD uncertainty estimates, we are most concerned with regularizing the ensemble-variance around points located in low density regions of the training data distribution. To obtain a simple estimate that intuitively reflects (the inverse of) the local density of at a particular set of feature values, one can compute the feature-distance to the nearest training data points (Papernot and McDaniel, 2018)

. Under this perspective, we want to encourage greater model uncertainty for the lowest density points that lie furthest from the training data. Commonly used covariance kernels for Gaussian Process regressors (e.g. radial basis functions) explicitly enforce a high amount of uncertainty on points that lie far from the training data. As calculating the distance of each point to the entire training set may be undesirably inefficient for large datasets, we only compute the distance of our augmented data to a current minibatch

during training. Specifically, we use these distances to compute the following:


where are weights for each of the augmented points , and are members of the minibatch that are the nearest neighbors of . Throughout this paper, we use .

The are thus inversely related to a crude density estimate of the training distribution evaluated at each augmented sample . Rather than optimizing the loss which uniformly weights each augmented sample (as done in Algorithm 1), we can instead form a weighted loss computed over the minibatch of augmented samples as: which should increase the model uncertainty for augmented inputs proportionally to their distance from the training data. We call this variant of our methodology with augmented input reweighting MOD-R.

Input: Training data , penalty , batch-size
Output: Parameters of ensemble of neural networks ,…,
Initialize ,…, randomly, initialize for
     Sample minibatch from training data:
     Sample augmented inputs ,…, uniformly at random from
     for m = 1,…, M do
         if MOD-R then     (defined in equation (2))          
         Update via SGD with gradient      
until iteration limit reached
Algorithm 1  MOD/MOD-R training procedure

4 Experiments

4.1 Baseline Methods

Here, we evaluate various alternative strategies for improving model ensembles. All strategies are applied to the same base NN ensemble, which is taken to be the Deep Ensembles (DeepEns) model of Lakshminarayanan et al. (2017) previously described in Section 3.1.

Deep Ensembles with Adversarial Training (DeepEns+AT)

Lakshminarayanan et al. (2017) used this strategy to improve their basic DeepEns model. The idea is to adversarially sample inputs that lie close to the training data but on which the NLL loss is high (assuming they share the same label as their neighboring training example). Then, we include these adversarial points as augmented data when training the ensemble, which smooths the function learned by the ensemble. Starting from training example , we sample augmented datapoint with the labels for assumed to be the same as that for the corresponding .

here denotes the NLL loss function, and the values for hyperparameter

that we search over include 0.05, 0.1, 0.2.

Negative Correlation (NegCorr)

This method from  Liu and Yao (1999) minimizes the empirical correlation between predictions of different ensemble members over the training data. It adds a penalty to the loss of the form where is the prediction of the th ensemble member and is the mean ensemble prediction. This penalty is weighted by a user-specified penalty , as done in our methodology.

4.2 Experiment Details

All experiments were run on Nvidia TitanX 1080 Ti and Nvidia TitanX 2080 Ti GPUs with PyTorch version 1.0. Unless otherwise indicated, all p-values were computed using a single tailed paired t-test per dataset, and the p-values are combined using Fisher’s method to produce an overall p-value across all datasets in a task. All hyperparameters – including learning rate,

-regularization, for MOD/Negative Correlation, and adversarial training – were tuned based on validation set NLL. In every regression task, the search for hyperparameter was over the values 0.01, 0.1, 1, 5, 10, 20, 50.

4.3 Univariate Regression

We first consider a one-dimensional regression toy dataset that is similar to the one used by Blundell et al. (2015). We generated training data from the function:

Here, the training data only contain samples drawn from two limited-size regions. Using the standard NLL loss as well as the auxiliary MOD penalty, we train a deep ensemble with 4 neural networks of identical architecture (1-hidden layer with 50 units, ReLU activation, two sigmoid outputs to estimate the mean and variance of

, and l2 regularization). To depict the improvement gained by simply adding ensemble members, we also train an ensemble of 120 networks with same architecture. Figure 1 shows the predictions and confidence interval of the ensembles. MOD is able to produce more reliable uncertainty estimates on the lefthand regions, where regular deep ensembles fail even with many networks. MOD also properly inflated the predictive uncertainty in the center region where no training data is found. Using a smaller in MOD ensures the ensemble predictive performance remains strong for in-distribution inputs that lie near the training data and the ensemble exhibits adequate levels of certainty around these points. While the larger value leads to overly conservative uncertainty estimates that are large everywhere, we note the mean of the ensemble predictions remains highly accurate for in-distribution inputs.

Figure 1: Regression on toy data with confidence interval ranges. The black dots stand for randomly generated training examples and the grey dotted line demonstrated the ground-truth function that is used to generate labels. Predicted mean and CI from each individual network is plotted in dashed lines and the ensemble mean/CI are plotted in connected purple line/band.

4.4 UCI Regression Datasets

We next experimented with nine real world datasets with continuous inputs in some applicable bounded domain. We follow the experimental setup that Lakshminarayanan et al. (2017) and Hernández-Lobato et al. (2017) used to evaluate deep ensembles and deep Bayesian regressors. We split off all datapoints whose -values fall in the top as an OOD test set (so datapoints with such large -values are never encountered during training). We simulate the situation where training set is limited and thus used of the data for training and for validation. The remaining data is used as an in-distribution test set. The analysis is repeated for 10 random splits of the data to ensure the results are robust. We again use an ensemble of 4 fully-connected neural networks with the same architecture as in the previous experiment and the NLL training loss (searching over hyperparameter values: penalty , learning rate ). We report the negative log-likelihood (NLL) and root mean squared error (RMSE) on both in- and out-of-distribution test sets for ensembles trained via different strategies. We also examine the calibration curve where predicted confidence levels are compared with actual empirical level of correctness observed from the data.

As shown in Table 1 and Appendix Table 1, MOD outperforms DeepEns in 6 out of the 9 datasets on OOD NLL, 7 out of the 9 datasets on OOD RMSE, and all of the datasets on in-distribution NLL/MSE. This shows that the MOD loss lead to higher-quality uncertainties on OOD data while also improving in-distribution performance. MOD also outperforms all baselines with significant overall p-value on NLL/RMSE except on par with NegCorr on RMSE, demonstrating the robustness of the improvement contributed by MOD.

Datasets DeepEns DeepEns+AT NegCorr MOD MOD-R MOD-in
Out-of-distribution NLL
protein 1.1620.23 1.1780.16 1.2310.13 1.1970.14 1.1940.21 1.1540.13
naval -2.5800.10 -1.3800.09 -2.6180.06 -2.7290.07 -2.1300.07 -2.0570.05
kin8nm -1.9800.05 -1.9700.09 -2.0360.05 -1.9990.05 -2.0030.09 -1.9930.08
power-plant -1.7340.05 -1.7310.09 -1.6590.07 -1.6380.15 -1.6440.12 -1.7310.05
bostonHousing 1.5910.68 1.2430.69 1.8210.91 0.5680.96 0.4600.65 0.9230.73
energy -1.5900.25 -1.7840.15 -1.7180.19 -1.7360.12 -1.7410.26 -1.7330.20
concrete -0.8310.24 -0.9150.20 -0.9130.28 -0.9040.12 -0.9100.19 -0.9240.19
yacht -1.5970.84 -1.7620.65 -1.9720.57 -1.7970.44 -1.7610.58 -1.6380.66
wine 0.1330.13 0.1150.09 0.1130.10 0.1530.11 0.0840.07 0.0850.06
MOD outperformance p-value 0.002 4.9e-07 0.034 1.6e-04 4.6e-05
In-distribution NLL
protein -0.5140.01 -0.5190.01 -0.5440.01 -0.5330.01 -0.5320.01 -0.5290.01
naval -2.7350.08 -1.5130.04 -2.8100.04 -2.8570.07 -2.2970.06 -2.2380.05
kin8nm -1.3050.02 -1.3150.02 -1.3340.02 -1.3170.02 -1.3150.02 -1.3220.02
power -1.5210.02 -1.5250.02 -1.5240.02 -1.5230.01 -1.5220.02 -1.5240.01
bostonHousing -0.9010.15 -0.9370.14 -0.6560.67 -0.9530.15 -0.8830.19 -0.9250.18
energy -2.4260.15 -2.5170.10 -2.6200.13 -2.5070.15 -2.5250.10 -2.5220.13
concrete -1.0750.09 -1.1290.08 -1.0890.10 -1.1550.09 -1.1370.13 -1.0900.09
yacht -3.2860.69 -3.2450.82 -3.5700.17 -3.5000.19 -3.4610.25 -3.3390.82
wine -0.0700.85 -0.3410.07 -0.2660.29 -0.3370.07 -0.3480.05 -0.3510.05
MOD outperformance p-value 2.4e-08 6.9e-11 0.116 8.9e-06 2.2e-06
Table 1: Averaged NLL on out-of-distribution/in-distribution test example over 10 replicate runs for UCI datasets, top 5% samples were heldout as OOD test set (See Appendix Table 1 for RMSE). MOD outperformance p-value is the combined (via Fisher’s method) p-value of MOD NLL being less than the NLL of the method in the corresponding column (with p-value per dataset being computed using a paired single tailed t-test). Bold indicates best in category and bold+italicized

indicates second best. In case of a tie in the means, the method with the lower standard deviation is highlighted.

Figure 2

shows the calibration curve on two of the datasets where the basic deep ensembles exhibit over-confidence on OOD data. Note that retaining accurate calibration on OOD data is extremely difficult for most machine learning methods. MOD and MOD-R improve calibration by a significant margin compared to most of the baselines, validating the effectiveness of our MOD procedure.

Figure 2: Calibration curves for regression models trained on two of the UCI datasets (left) and two DNA TF binding datasets (right). A perfect calibration curve should lie on the diagonal.

4.5 Protein Binding Microarray Data

We next study scientific data with discrete features by predicting Protein-DNA binding. This is a collection of 38 different microarray datasets, each of which contains measurements of the binding affinity of a single transcription factor (TF) protein against all possible 8-base DNA sequences (Barrera et al., 2016). We consider each dataset as a separate task with taken to be the binding affinity (re-scaled to [0,1] interval) and the one-hot embedded DNA sequence (as we ignore reverse-complements, there are possible values of ).


We trained a small ensemble of 4 neural networks with the same architecture as in the previous experiments. We consider 2 different OOD test sets, one comprised of the sequences with top 10% -values and the other comprised of the sequences with more than 80% of the position being G or C (GC-content). For each OOD set, we use the remainder of the sequences as corresponding in-distribution set. We separate them into extremely small training set (300 examples) and validation set (300 examples), and use the rest as in-distribution test set. We compare MOD along with 2 alternative sampling distribution (MOD-R and MOD-in) against the 3 baselines mentioned in section 4.1. We search over 0,0.001,0.01,0.05,0.1 for penalty and 0.01 for learning rate.

(MOD/-R outperform) (MOD/-R outperform)
Methods NLL # of TFs p-value RMSE # of TFs p-value
In-distribution test performance
DeepEns -0.42660.031 33 9.0e-14 0.15910.005 35 5.1e-22
DeepEns+AT -0.43120.033 28 0.001 0.15820.005 29 6.0e-09
NegCorr -0.43140.032 19 0.236 0.15830.005 24 0.032
MOD -0.43120.031 22 0.075 0.15810.005 25 0.024
MOD-R -0.43250.032 0.15790.005
MOD-in -0.43170.032 22 0.146 0.15810.005 26 0.044
Out-of-distribution test performance (OOD as sequences with top 10% binding affinity)
DeepEns 0.74850.124 24 0.006 0.28370.011 28 3.1e-06
DeepEns+AT 0.74380.122 25 0.014 0.28120.011 24 0.036
NegCorr 0.73580.118 20 0.218 0.28140.010 25 0.060
MOD 0.71530.117 16 0.921 0.28020.010 21 0.333
MOD-R 0.72250.116 0.27950.010
MOD-in 0.73260.121 22 0.279 0.28010.010 22 0.298
Out-of-distribution test performance (OOD as sequences with >80% GC content)
DeepEns -0.69380.052 20 0.022 0.11900.004 25 0.008
DeepEns+AT -0.70100.041 23 0.007 0.11800.003 19 0.029
NegCorr -0.68050.065 25 0.011 0.11790.004 21 0.052
MOD -0.70070.047 0.11730.003
MOD-R -0.69590.040 24 0.004 0.11770.003 22 0.112
MOD-in -0.69480.054 21 0.103 0.11770.004 21 0.401
Table 2: NLL and RMSE on OOD/in-distribution test set averaged across 38 TFs over 10 replicate runs. MOD/-R outperformance p-value is the combined p-value of MOD/-R NLL/RMSE being less than the NLL/RMSE of the method in the corresponding row. Bold indicates best in category and bold+italicized indicates second best. In case of a tie in the means, the method with lower standard deviation is highlighted.

Table 2 shows mean OOD and in-distribution performance across 38 TFs (averaged over 10 runs using random data splits and NN initializations). MOD methods have significantly improved performance on all metrics and OOD setups compared to DeepEns/DeepEns+AT, both in terms of # of TF outperforming and overall p-value. The re-weighting scheme (MOD-R) further improved the performance on top 10% -value OOD set up. Figure 2 shows the calibration curve on two of the TFs where the deep ensembles are over-confident on top 10% -value OOD examples. MOD-R and MOD improve the calibration results by significant margin compared to most of the baselines.

Bayesian Optimization

Next, we compared how the MOD, MOD-R, and MOD-in ensembles performed against the DeepEns, DeepEns+AT, and NegCorr ensembles in 38 Bayesian optimization tasks using the same protein binding data (see Hashimoto et al. (2018)). For each TF, we performed 30 rounds of DNA-sequence acquisition, acquiring batches of 10 sequences per round in an attempt to maximize binding affinity. We used the upper confidence bound (UCB) as our acquisition function (Chen et al., 2017), ordering the candidate points via (with UCB coefficient ).

Figure 3: Regret for two Bayesian optimization tasks (averaged over 20 replicate runs). The bands depict 50% confidence intervals, and the -axis indicates the number of DNA sequences whose binding affinity has been profiled so far.
vs DeepEns DeepEns+AT NegCorr MOD-in MOD MOD-R
MOD-in 21 (0.111) 21 (0.041) 19 (0.356) 17 (0.791) 16 (0.51)
MOD 26 (0.003) 24 (0.004) 20 (0.001) 19 (0.002) 22 (0.173)
MOD-R 22 (0.019) 23 (0.007) 22 (0.017) 20 (0.052) 14 (0.674)
Table 3: Regret () comparison. Each cell shows the number of TFs (out of 38) for which the method in corresponding row outperforms the method in the corresponding column (lower ). The number in parentheses is the combined (across 38 TFs) p-value of MOD/-in/-R regret being less than the regret of the method in the corresponding column.

At every acquisition iteration, we randomly held out 10% of the training set as the validation set and chose the

penalty (for MOD, MOD-in, MOD-R, and NegCorr) that produced the best validation NLL (out of choices: 0, 5, 10, 20, 40, 80). The stopping epoch is chosen based on the validation NLL not increasing for 10 epochs with an upper limit of 30 epochs. Optimization was done with a learning rate of 0.01, l2 penalty of 0.01 and used the Adam optimizer. For each of the 38 TFs, we performed 20 Bayesian optimization runs with different seed sequences (same seeds used for all the methods) and using 200 points randomly sampled from the bottom 90% of

values as are initial training set.

We evaluated on the metric of simple regret (second term in the subtraction quantifies the best point acquired so far and the first term is the global best). The results are presented in Table 3. MOD outperforms all other methods in both number of TFs with better regret and the combined p-value. MOD-R is also strong outperforming all other methods except MOD with respect to which is about equivalent in terms of statistical significance. Figure 3 shows for the TFs OVOL2 and HESX1, a task in which MOD and MOD-R outperform the other methods.

4.6 Age Prediction from Images

To demonstrate the effectiveness of MOD-in, MOD, and MOD-R on high dimensional data, we consider supervised learning with image data. Here, we use a dataset of human images collected from IMDB and Wikimedia and annotated with age and gender information 

(Rothe et al., 2015). The IMDB/Wiki parts of the dataset consist of 460K+/62K+ images respectively. 28,601 images in the Wiki dataset are males and the rest are females.

In the context of Wiki images, we tried to predict the ages given the image of a person using 2000 images of males as the training set. We used the female images as the unlabeled OOD test set on which we increased variance via MOD and MOD-R. We used the Wide Residual Network architecture (Zagoruyko and Komodakis, 2016) with a depth of 4 and a width factor of 2. As before, we used an ensemble of size 4. The search for the optimal value was over . The stopping epoch is chosen based on the validation NLL not increasing for 10 epochs with an upper limit of 30 epochs. Optimization was done with a learning rate of 0.001, l2 penalty of 0.001 and used the Adam optimizer. MOD gets NLL of -0.16 on OOD (female) data and -0.331 on in-dist (male) data while MOD-R gets -0.148 on OOD and -0.328 for in-dist. This is in contrast to DeepEns which gets only -0.13 on OOD and -0.32 on in-dist. Thus both MOD and MOD-R show significant improvements on NLL on the OOD (female) data (p-values of 0.014 and 0.102 respectively) and slight improvements on NLL on in-distribution (male) data. In addition, while DeepEns+AT has a better mean for OOD NLL compared to MOD, the result is not significant (p-value 0.388) and thus the two methods can be considered equivalent on this dataset. Notable every MOD variant improves performance across all of the 4 metrics. Thus augmenting the loss function with the MOD penalty should not make your model worse.

DeepEns -0.131 0.050 -0.320 0.024 0.208 0.007 0.180 0.003
DeepEns+AT -0.1640.004 -0.3340.024 0.203 0.007 0.1770.004
NegCorr -0.1530.038 -0.3280.021 0.204 0.006 0.178 0.004
MOD-in -0.1400.031 -0.3310.018 0.2070.005 0.1770.003
MOD -0.1600.029 -0.3310.025 0.2030.005 0.1760.004
MOD-R -0.1470.029 -0.3300.027 0.2050.005 0.1770.004
Table 4: Image regression results showing mean performance across 20 randomly seeded runs (along with one standard deviation). In-Dist refers to the in-distribution test set. OOD refers to the out of distribution test set. Bold indicates best in category and bold+italicized indicates second best. In case of a tie in means, the lower standard deviation method is highlighted.

5 Conclusion

We have developed a new loss function and data augmentation strategy that helps stabilize distribution uncertainty estimates obtained from model ensembling. Our method increases model uncertainty over the entire input space while simultaneously maintaining predictive performance. We further propose a variant of our method which assesses the distance of an augmented sample from the training distribution and aims to ensure higher model uncertainty in regions with low-density under the training data distribution. When used for training deep NN ensembles, our method produces strong improvements to the in and out of distribution NLL, out of distribution RMSE, and calibration on a variety of datasets drawn from biology, vision, and common datasets from UCI used to benchmark uncertainty quantification methods. The resulting ensemble is shown to be very useful across numerous Bayesian optimization tasks. Future work could develop techniques to generate OOD augmented samples for structured data domains like images/text, as well as applying our ensembles with improved uncertainty-awareness to currently challenging tasks such as exploration in reinforcement learning.



Datasets DeepEns DeepEns+AT NegCorr MOD MOD-R MOD-in
Out-of-distribution RMSE
protein 0.3010.01 0.3010.00 0.2940.00 0.2990.01 0.3000.01 0.3000.01
naval 0.1110.01 0.1260.02 0.1120.01 0.1090.01 0.1250.01 0.1290.01
kin8nm 0.0420.00 0.0400.01 0.0370.00 0.0390.00 0.0410.01 0.0390.00
power 0.0410.00 0.0420.00 0.0440.00 0.0480.01 0.0470.01 0.0410.00
bostonHousing 0.2210.02 0.2130.02 0.2100.02 0.2090.02 0.2000.01 0.2120.02
energy 0.0600.01 0.0540.01 0.0530.01 0.0530.01 0.0540.01 0.0540.01
concrete 0.1050.02 0.1020.01 0.0960.01 0.1050.01 0.1130.03 0.0990.01
yacht 0.0380.04 0.0390.04 0.0180.01 0.0260.00 0.0260.01 0.0390.04
wine 0.2390.01 0.2360.01 0.2340.01 0.2390.01 0.2340.01 0.2350.01
MOD outperformance p-value 0.028 0.102 0.989 0.090 0.017
In-distribution RMSE
protein 0.1640.00 0.1640.00 0.1610.00 0.1630.00 0.1630.00 0.1630.00
naval 0.0800.00 0.0890.00 0.0790.00 0.0770.00 0.0880.00 0.0910.00
kin8nm 0.0740.00 0.0720.00 0.0710.00 0.0720.00 0.0730.00 0.0720.00
power-plant 0.0530.00 0.0530.00 0.0520.00 0.0530.00 0.0530.00 0.0530.00
bostonHousing 0.0850.01 0.0840.01 0.0830.01 0.0840.01 0.0840.01 0.0840.01
energy 0.0420.00 0.0390.00 0.0360.00 0.0390.00 0.0390.00 0.0390.00
concrete 0.0860.00 0.0850.00 0.0840.00 0.0830.00 0.0830.00 0.0840.00
yacht 0.0170.02 0.0190.02 0.0100.00 0.0120.00 0.0120.00 0.0180.02
wine 0.1700.00 0.1700.00 0.1690.00 0.1700.00 0.1690.00 0.1690.01
MOD outperformance p-value 4.0e-06 5.9e-07 0.943 8.9e-04 0.002
Table A1: RMSE on out-of-distribution/in-distribution test examples for the UCI datasets (over 10 replicate runs). Examples with -values in the top 5% were held out as the OOD test set.