1 Introduction
Model ensembling provides a simple, yet extremely effective technique for improving the predictive performance of arbitrary supervised learners each trained via empirical risk minimization (ERM) (Breiman, 1996; Brown, 2004). Often, ensembles are utilized not only to improve predictions on test examples stemming from the same underlying distribution as the training data, but also to provide estimates of model uncertainty when learners are presented with outofdistribution (OOD) examples that may look different than the data encountered during training (Lakshminarayanan et al., 2017; Osband et al., 2016)
. The widespread success of ensembles crucially relies on the variancereduction produced by aggregating predictions that are statistically prone to different types of individual errors
(Kuncheva and Whitaker, 2003). Thus, prediction improvements are best realized by using a large ensemble with many base models, and a large ensemble is also typically employed to produce stable distributional estimates of model uncertainty (Breiman, 1996; Papadopoulos et al., 2001).Despite this, practical applications of massive neural networks (NN) are commonly limited to a small ensemble due to the unwieldy nature of these models (Osband et al., 2016; Balan et al., 2015; Beluch et al., 2018)
. Although supervised learning performance may still be enhanced by an ensemble comprised of only a few ERMtrained models, the resulting ensemblebased uncertainty estimates can exhibit excessive sampling variability in lowdensity regions of the underlying training distribution. Such unreliable uncertainty estimates are highly undesirable in applications where future data may not always stem from the same distribution (e.g. due to sampling bias, covariate shift, or the adaptive experimentation that occurs in bandits, Bayesian optimization (BO), and reinforcement learning (RL) contexts). Here, we propose a technique  Maximize Overall Diversity (MOD)  to stabilize the OOD model uncertainty estimates produced by an ensemble of arbitrary neural networks. The core idea is to consider
allpossible inputs and encourage as much overall diversity in the corresponding model ensemble outputs as can be tolerated without diminishing the ensemble’s predictive performance. MOD utilizes an auxiliary loss function and dataaugmentation strategy that is easily integrated into any existing training procedure.
2 Related Work
NN ensembles have been previously demonstrated to produce useful uncertainty estimates, including for sequential experimentation applications in Bayesian optimization and reinforcement learning (Papadopoulos et al., 2001; Lakshminarayanan et al., 2017; Riquelme et al., 2018; Osband et al., 2016; Chen et al., 2017). Proposed methods to improve ensembles of limited size include adversarial training to enforce smoothness (Lakshminarayanan et al., 2017), and maximizing ensemble output diversity over the training data (Brown, 2004). In contrast, our focus is on controlling ensemble behavior over all possible inputs, not merely those presented during training.
Consideration of all possible inputs has previously been advocated by Hooker and Rosset (2012), although not in the context of uncertainty estimation. Like our approach, Hafner et al. (2018) also aim to control NN outputbehavior beyond the training distribution, but our methods do not require the Bayesian formulation they impose and can be applied to arbitrary NN ensembles, which are one of the most straightforward methods used for quantifying neural network uncertainty (Papadopoulos et al., 2001; Osband et al., 2016). While we primarily consider regression settings here, our ideas can be easily adapted to classification by replacing variance terms with entropy terms; a similar variant that relies on an auxiliary generator network to produce augmented samples that do not stem from the training distribution has been recently proposed by Lee et al. (2018).
3 Methods
We consider a standard regression setup, assuming continuous target values are generated via: with , such that
may heteroscedastically depend on feature values
. Given a limited training dataset , where specifies the underlying data distribution from which the indistribution examples in the training data are sampled, our goal is to learn an ensemble of neural networks that accurately models both the underlying function as well as the uncertainty in ensembleestimates of . Of particular concern are scenarios where test examples may stem from a different distribution , which we refer to as outofdistribution (OOD) examples. As in Lakshminarayanan et al. (2017), each network (with parameters ) in our NN ensemble outputs both an estimated mean to predict and an estimated variance to predict , and the per network loss function , is chosen as the negative loglikelihood (NLL) under the Gaussian assumption . While traditional bagging provides different training data to each ensemble member, we simply train each NN using the entire dataset, since the randomness of separate NNinitializations and SGDtraining suffice to produce comparable performance to bagging of NN models (Lakshminarayanan et al., 2017; Lee et al., 2015; Osband et al., 2016).Following Lakshminarayanan et al. (2017), we estimate
(and NLL with respect to the ensemble) by treating the aggregate ensemble output as a single Gaussian distribution
. Here, the ensembleestimate of (used in RMSE calculations) is given by , and the uncertainty in the target value is given by: based on noiselevel estimate: and model uncertainty estimate: . While we focus on Gaussian likelihoods for simplicity, our proposed methodology is applicable to general parametric conditional distributions.3.1 Maximizing Overall Diversity (MOD)
Assuming have been scaled to bounded regions, MOD encourages higher ensemble diversity by introducing an auxiliary loss that is computed over augmented data sampled from another distribution defined over the input feature space. differs from the underlying training data distribution and instead describes what sorts of OOD examples could possibly be encountered at testtime. The underlying population objective we target is:
(1) 
with as the original supervised learning loss function (e.g. NLL), and a userspecified penalty . Since NLL entails a properscoring rule (Lakshminarayanan et al., 2017), minimizing the above objective with a sufficiently small value of will ensure the ensemble seeks to recover for inputs that lie in the support of the training distribution and otherwise to output large model uncertainty for OOD that lie beyond this support. As it is difficult in most applications to specify how future OOD examples may look, we aim to ensure the ensemble outputs high uncertainty estimates for any possible by taking the entire input space into consideration. This corresponds to where is the bounded region of all possible inputs and denotes the pdf (or pmf) of . In practice, we approximate using the average loss over the training data (as in ERM), and train each with respect to its contribution to this term independently of the others (as in bagging). To approximate ,we similarly utilize an empirical average based on augmented examples sampled uniformly throughout feature space . The formal MOD procedure is detailed in Algorithm 1. We advocate selecting as the largest value for which estimates of (on heldout validation data) do not indicate worse predictive performance. This strategy naturally favors smaller values of as the sample size grows, thus resulting in lower model uncertainty estimates (with as when is supported everywhere and our NN are universal approximators).
We also experiment with an alternative choice of
being the uniform distribution over the finite training data (i.e.
and = 0 otherwise). We call this alternative method MODin, and note its similarity to the diversityencouraging penalty proposed by Brown (2004), which is also measured over the training data. Note that MOD in contrast considers to be uniformly distributed over the entire feature space (i.e. all possible test inputs) rather than only the training examples. Maximizing diversity solely over the training data may fail to control ensemble behavior at OOD points that do not lie near any training example.3.2 Maximizing Reweighted Diversity (MODR)
Aiming for highquality OOD uncertainty estimates, we are most concerned with regularizing the ensemblevariance around points located in low density regions of the training data distribution. To obtain a simple estimate that intuitively reflects (the inverse of) the local density of at a particular set of feature values, one can compute the featuredistance to the nearest training data points (Papernot and McDaniel, 2018)
. Under this perspective, we want to encourage greater model uncertainty for the lowest density points that lie furthest from the training data. Commonly used covariance kernels for Gaussian Process regressors (e.g. radial basis functions) explicitly enforce a high amount of uncertainty on points that lie far from the training data. As calculating the distance of each point to the entire training set may be undesirably inefficient for large datasets, we only compute the distance of our augmented data to a current minibatch
during training. Specifically, we use these distances to compute the following:(2) 
where are weights for each of the augmented points , and are members of the minibatch that are the nearest neighbors of . Throughout this paper, we use .
The are thus inversely related to a crude density estimate of the training distribution evaluated at each augmented sample . Rather than optimizing the loss which uniformly weights each augmented sample (as done in Algorithm 1), we can instead form a weighted loss computed over the minibatch of augmented samples as: which should increase the model uncertainty for augmented inputs proportionally to their distance from the training data. We call this variant of our methodology with augmented input reweighting MODR.
4 Experiments
4.1 Baseline Methods
Here, we evaluate various alternative strategies for improving model ensembles. All strategies are applied to the same base NN ensemble, which is taken to be the Deep Ensembles (DeepEns) model of Lakshminarayanan et al. (2017) previously described in Section 3.1.
Deep Ensembles with Adversarial Training (DeepEns+AT)
Lakshminarayanan et al. (2017) used this strategy to improve their basic DeepEns model. The idea is to adversarially sample inputs that lie close to the training data but on which the NLL loss is high (assuming they share the same label as their neighboring training example). Then, we include these adversarial points as augmented data when training the ensemble, which smooths the function learned by the ensemble. Starting from training example , we sample augmented datapoint with the labels for assumed to be the same as that for the corresponding .
here denotes the NLL loss function, and the values for hyperparameter
that we search over include 0.05, 0.1, 0.2.Negative Correlation (NegCorr)
This method from Liu and Yao (1999) minimizes the empirical correlation between predictions of different ensemble members over the training data. It adds a penalty to the loss of the form where is the prediction of the th ensemble member and is the mean ensemble prediction. This penalty is weighted by a userspecified penalty , as done in our methodology.
4.2 Experiment Details
All experiments were run on Nvidia TitanX 1080 Ti and Nvidia TitanX 2080 Ti GPUs with PyTorch version 1.0. Unless otherwise indicated, all pvalues were computed using a single tailed paired ttest per dataset, and the pvalues are combined using Fisher’s method to produce an overall pvalue across all datasets in a task. All hyperparameters – including learning rate,
regularization, for MOD/Negative Correlation, and adversarial training – were tuned based on validation set NLL. In every regression task, the search for hyperparameter was over the values 0.01, 0.1, 1, 5, 10, 20, 50.4.3 Univariate Regression
We first consider a onedimensional regression toy dataset that is similar to the one used by Blundell et al. (2015). We generated training data from the function:
Here, the training data only contain samples drawn from two limitedsize regions. Using the standard NLL loss as well as the auxiliary MOD penalty, we train a deep ensemble with 4 neural networks of identical architecture (1hidden layer with 50 units, ReLU activation, two sigmoid outputs to estimate the mean and variance of
, and l2 regularization). To depict the improvement gained by simply adding ensemble members, we also train an ensemble of 120 networks with same architecture. Figure 1 shows the predictions and confidence interval of the ensembles. MOD is able to produce more reliable uncertainty estimates on the lefthand regions, where regular deep ensembles fail even with many networks. MOD also properly inflated the predictive uncertainty in the center region where no training data is found. Using a smaller in MOD ensures the ensemble predictive performance remains strong for indistribution inputs that lie near the training data and the ensemble exhibits adequate levels of certainty around these points. While the larger value leads to overly conservative uncertainty estimates that are large everywhere, we note the mean of the ensemble predictions remains highly accurate for indistribution inputs.4.4 UCI Regression Datasets
We next experimented with nine real world datasets with continuous inputs in some applicable bounded domain. We follow the experimental setup that Lakshminarayanan et al. (2017) and HernándezLobato et al. (2017) used to evaluate deep ensembles and deep Bayesian regressors. We split off all datapoints whose values fall in the top as an OOD test set (so datapoints with such large values are never encountered during training). We simulate the situation where training set is limited and thus used of the data for training and for validation. The remaining data is used as an indistribution test set. The analysis is repeated for 10 random splits of the data to ensure the results are robust. We again use an ensemble of 4 fullyconnected neural networks with the same architecture as in the previous experiment and the NLL training loss (searching over hyperparameter values: penalty , learning rate ). We report the negative loglikelihood (NLL) and root mean squared error (RMSE) on both in and outofdistribution test sets for ensembles trained via different strategies. We also examine the calibration curve where predicted confidence levels are compared with actual empirical level of correctness observed from the data.
As shown in Table 1 and Appendix Table 1, MOD outperforms DeepEns in 6 out of the 9 datasets on OOD NLL, 7 out of the 9 datasets on OOD RMSE, and all of the datasets on indistribution NLL/MSE. This shows that the MOD loss lead to higherquality uncertainties on OOD data while also improving indistribution performance. MOD also outperforms all baselines with significant overall pvalue on NLL/RMSE except on par with NegCorr on RMSE, demonstrating the robustness of the improvement contributed by MOD.
Datasets  DeepEns  DeepEns+AT  NegCorr  MOD  MODR  MODin 

Outofdistribution NLL  
protein  1.1620.23  1.1780.16  1.2310.13  1.1970.14  1.1940.21  1.1540.13 
naval  2.5800.10  1.3800.09  2.6180.06  2.7290.07  2.1300.07  2.0570.05 
kin8nm  1.9800.05  1.9700.09  2.0360.05  1.9990.05  2.0030.09  1.9930.08 
powerplant  1.7340.05  1.7310.09  1.6590.07  1.6380.15  1.6440.12  1.7310.05 
bostonHousing  1.5910.68  1.2430.69  1.8210.91  0.5680.96  0.4600.65  0.9230.73 
energy  1.5900.25  1.7840.15  1.7180.19  1.7360.12  1.7410.26  1.7330.20 
concrete  0.8310.24  0.9150.20  0.9130.28  0.9040.12  0.9100.19  0.9240.19 
yacht  1.5970.84  1.7620.65  1.9720.57  1.7970.44  1.7610.58  1.6380.66 
wine  0.1330.13  0.1150.09  0.1130.10  0.1530.11  0.0840.07  0.0850.06 
MOD outperformance pvalue  0.002  4.9e07  0.034  1.6e04  4.6e05  
Indistribution NLL  
protein  0.5140.01  0.5190.01  0.5440.01  0.5330.01  0.5320.01  0.5290.01 
naval  2.7350.08  1.5130.04  2.8100.04  2.8570.07  2.2970.06  2.2380.05 
kin8nm  1.3050.02  1.3150.02  1.3340.02  1.3170.02  1.3150.02  1.3220.02 
power  1.5210.02  1.5250.02  1.5240.02  1.5230.01  1.5220.02  1.5240.01 
bostonHousing  0.9010.15  0.9370.14  0.6560.67  0.9530.15  0.8830.19  0.9250.18 
energy  2.4260.15  2.5170.10  2.6200.13  2.5070.15  2.5250.10  2.5220.13 
concrete  1.0750.09  1.1290.08  1.0890.10  1.1550.09  1.1370.13  1.0900.09 
yacht  3.2860.69  3.2450.82  3.5700.17  3.5000.19  3.4610.25  3.3390.82 
wine  0.0700.85  0.3410.07  0.2660.29  0.3370.07  0.3480.05  0.3510.05 
MOD outperformance pvalue  2.4e08  6.9e11  0.116  8.9e06  2.2e06 
indicates second best. In case of a tie in the means, the method with the lower standard deviation is highlighted.
Figure 2
shows the calibration curve on two of the datasets where the basic deep ensembles exhibit overconfidence on OOD data. Note that retaining accurate calibration on OOD data is extremely difficult for most machine learning methods. MOD and MODR improve calibration by a significant margin compared to most of the baselines, validating the effectiveness of our MOD procedure.
4.5 Protein Binding Microarray Data
We next study scientific data with discrete features by predicting ProteinDNA binding. This is a collection of 38 different microarray datasets, each of which contains measurements of the binding affinity of a single transcription factor (TF) protein against all possible 8base DNA sequences (Barrera et al., 2016). We consider each dataset as a separate task with taken to be the binding affinity (rescaled to [0,1] interval) and the onehot embedded DNA sequence (as we ignore reversecomplements, there are possible values of ).
Regression
We trained a small ensemble of 4 neural networks with the same architecture as in the previous experiments. We consider 2 different OOD test sets, one comprised of the sequences with top 10% values and the other comprised of the sequences with more than 80% of the position being G or C (GCcontent). For each OOD set, we use the remainder of the sequences as corresponding indistribution set. We separate them into extremely small training set (300 examples) and validation set (300 examples), and use the rest as indistribution test set. We compare MOD along with 2 alternative sampling distribution (MODR and MODin) against the 3 baselines mentioned in section 4.1. We search over 0,0.001,0.01,0.05,0.1 for penalty and 0.01 for learning rate.
(MOD/R outperform)  (MOD/R outperform)  
Methods  NLL  # of TFs  pvalue  RMSE  # of TFs  pvalue 
Indistribution test performance  
DeepEns  0.42660.031  33  9.0e14  0.15910.005  35  5.1e22 
DeepEns+AT  0.43120.033  28  0.001  0.15820.005  29  6.0e09 
NegCorr  0.43140.032  19  0.236  0.15830.005  24  0.032 
MOD  0.43120.031  22  0.075  0.15810.005  25  0.024 
MODR  0.43250.032  0.15790.005  
MODin  0.43170.032  22  0.146  0.15810.005  26  0.044 
Outofdistribution test performance (OOD as sequences with top 10% binding affinity)  
DeepEns  0.74850.124  24  0.006  0.28370.011  28  3.1e06 
DeepEns+AT  0.74380.122  25  0.014  0.28120.011  24  0.036 
NegCorr  0.73580.118  20  0.218  0.28140.010  25  0.060 
MOD  0.71530.117  16  0.921  0.28020.010  21  0.333 
MODR  0.72250.116  0.27950.010  
MODin  0.73260.121  22  0.279  0.28010.010  22  0.298 
Outofdistribution test performance (OOD as sequences with >80% GC content)  
DeepEns  0.69380.052  20  0.022  0.11900.004  25  0.008 
DeepEns+AT  0.70100.041  23  0.007  0.11800.003  19  0.029 
NegCorr  0.68050.065  25  0.011  0.11790.004  21  0.052 
MOD  0.70070.047  0.11730.003  
MODR  0.69590.040  24  0.004  0.11770.003  22  0.112 
MODin  0.69480.054  21  0.103  0.11770.004  21  0.401 
Table 2 shows mean OOD and indistribution performance across 38 TFs (averaged over 10 runs using random data splits and NN initializations). MOD methods have significantly improved performance on all metrics and OOD setups compared to DeepEns/DeepEns+AT, both in terms of # of TF outperforming and overall pvalue. The reweighting scheme (MODR) further improved the performance on top 10% value OOD set up. Figure 2 shows the calibration curve on two of the TFs where the deep ensembles are overconfident on top 10% value OOD examples. MODR and MOD improve the calibration results by significant margin compared to most of the baselines.
Bayesian Optimization
Next, we compared how the MOD, MODR, and MODin ensembles performed against the DeepEns, DeepEns+AT, and NegCorr ensembles in 38 Bayesian optimization tasks using the same protein binding data (see Hashimoto et al. (2018)). For each TF, we performed 30 rounds of DNAsequence acquisition, acquiring batches of 10 sequences per round in an attempt to maximize binding affinity. We used the upper confidence bound (UCB) as our acquisition function (Chen et al., 2017), ordering the candidate points via (with UCB coefficient ).
vs  DeepEns  DeepEns+AT  NegCorr  MODin  MOD  MODR 

MODin  21 (0.111)  21 (0.041)  19 (0.356)  17 (0.791)  16 (0.51)  
MOD  26 (0.003)  24 (0.004)  20 (0.001)  19 (0.002)  22 (0.173)  
MODR  22 (0.019)  23 (0.007)  22 (0.017)  20 (0.052)  14 (0.674) 
At every acquisition iteration, we randomly held out 10% of the training set as the validation set and chose the
penalty (for MOD, MODin, MODR, and NegCorr) that produced the best validation NLL (out of choices: 0, 5, 10, 20, 40, 80). The stopping epoch is chosen based on the validation NLL not increasing for 10 epochs with an upper limit of 30 epochs. Optimization was done with a learning rate of 0.01, l2 penalty of 0.01 and used the Adam optimizer. For each of the 38 TFs, we performed 20 Bayesian optimization runs with different seed sequences (same seeds used for all the methods) and using 200 points randomly sampled from the bottom 90% of
values as are initial training set.We evaluated on the metric of simple regret (second term in the subtraction quantifies the best point acquired so far and the first term is the global best). The results are presented in Table 3. MOD outperforms all other methods in both number of TFs with better regret and the combined pvalue. MODR is also strong outperforming all other methods except MOD with respect to which is about equivalent in terms of statistical significance. Figure 3 shows for the TFs OVOL2 and HESX1, a task in which MOD and MODR outperform the other methods.
4.6 Age Prediction from Images
To demonstrate the effectiveness of MODin, MOD, and MODR on high dimensional data, we consider supervised learning with image data. Here, we use a dataset of human images collected from IMDB and Wikimedia and annotated with age and gender information
(Rothe et al., 2015). The IMDB/Wiki parts of the dataset consist of 460K+/62K+ images respectively. 28,601 images in the Wiki dataset are males and the rest are females.In the context of Wiki images, we tried to predict the ages given the image of a person using 2000 images of males as the training set. We used the female images as the unlabeled OOD test set on which we increased variance via MOD and MODR. We used the Wide Residual Network architecture (Zagoruyko and Komodakis, 2016) with a depth of 4 and a width factor of 2. As before, we used an ensemble of size 4. The search for the optimal value was over . The stopping epoch is chosen based on the validation NLL not increasing for 10 epochs with an upper limit of 30 epochs. Optimization was done with a learning rate of 0.001, l2 penalty of 0.001 and used the Adam optimizer. MOD gets NLL of 0.16 on OOD (female) data and 0.331 on indist (male) data while MODR gets 0.148 on OOD and 0.328 for indist. This is in contrast to DeepEns which gets only 0.13 on OOD and 0.32 on indist. Thus both MOD and MODR show significant improvements on NLL on the OOD (female) data (pvalues of 0.014 and 0.102 respectively) and slight improvements on NLL on indistribution (male) data. In addition, while DeepEns+AT has a better mean for OOD NLL compared to MOD, the result is not significant (pvalue 0.388) and thus the two methods can be considered equivalent on this dataset. Notable every MOD variant improves performance across all of the 4 metrics. Thus augmenting the loss function with the MOD penalty should not make your model worse.
Methods  OOD NLL  InDist NLL  OOD RMSE  InDist RMSE 

DeepEns  0.131 0.050  0.320 0.024  0.208 0.007  0.180 0.003 
DeepEns+AT  0.1640.004  0.3340.024  0.203 0.007  0.1770.004 
NegCorr  0.1530.038  0.3280.021  0.204 0.006  0.178 0.004 
MODin  0.1400.031  0.3310.018  0.2070.005  0.1770.003 
MOD  0.1600.029  0.3310.025  0.2030.005  0.1760.004 
MODR  0.1470.029  0.3300.027  0.2050.005  0.1770.004 
5 Conclusion
We have developed a new loss function and data augmentation strategy that helps stabilize distribution uncertainty estimates obtained from model ensembling. Our method increases model uncertainty over the entire input space while simultaneously maintaining predictive performance. We further propose a variant of our method which assesses the distance of an augmented sample from the training distribution and aims to ensure higher model uncertainty in regions with lowdensity under the training data distribution. When used for training deep NN ensembles, our method produces strong improvements to the in and out of distribution NLL, out of distribution RMSE, and calibration on a variety of datasets drawn from biology, vision, and common datasets from UCI used to benchmark uncertainty quantification methods. The resulting ensemble is shown to be very useful across numerous Bayesian optimization tasks. Future work could develop techniques to generate OOD augmented samples for structured data domains like images/text, as well as applying our ensembles with improved uncertaintyawareness to currently challenging tasks such as exploration in reinforcement learning.
References
 Balan et al. (2015) A. K. Balan, V. Rathod, K. P. Murphy, and M. Welling. Bayesian dark knowledge. In Advances in Neural Information Processing Systems, 2015.
 Barrera et al. (2016) Luis A Barrera, Anastasia Vedenko, Jesse V Kurland, Julia M Rogers, Stephen S Gisselbrecht, Elizabeth J Rossin, Jaie Woodard, Luca Mariani, Kian Hong Kock, Sachi Inukai, et al. Survey of variation in human transcription factors reveals prevalent DNA binding changes. Science, 351(6280):1450–1454, 2016.

Beluch et al. (2018)
W. H. Beluch, T. Genewein, A. Nürnberger, and J. M. Köhler.
The power of ensembles for active learning in image classification.
InIEEE Conference on Computer Vision and Pattern Recognition
, 2018.  Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
 Breiman (1996) Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996.
 Brown (2004) Gavin Brown. Diversity in neural network ensembles. PhD thesis, University of Birmingham, 2004.
 Chen et al. (2017) R. Y. Chen, S. Sidor, P. Abbeel, and J. Schulman. UCB exploration via Qensembles. arXiv:1706.01502, 2017.
 Hafner et al. (2018) Danijar Hafner, Dustin Tran, Timothy Lillicrap, Alex Irpan, and James Davidson. Reliable uncertainty estimates in deep neural networks using noise contrastive priors. arXiv:1807.09289, 2018.

Hashimoto et al. (2018)
Tatsunori B Hashimoto, Steve Yadlowsky, and John C Duchi.
Derivative free optimization via repeated classification.
In
International Conference on Artificial Intelligence and Statistics
, 2018.  HernándezLobato et al. (2017) José Miguel HernándezLobato, James Requeima, Edward O PyzerKnapp, and Alán AspuruGuzik. Parallel and distributed thompson sampling for largescale accelerated exploration of chemical space. arXiv:1706.01825, 2017.
 Hooker and Rosset (2012) Giles Hooker and Saharon Rosset. Predictionfocused regularization using dataaugmented regression. Statistics and Computing, 1:237–349, 2012.

Kuncheva and Whitaker (2003)
Ludmila I. Kuncheva and Christopher J. Whitaker.
Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy.
Machine Learning, 51:181–207, 2003.  Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017.
 Lee et al. (2018) Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training confidencecalibrated classifiers for detecting outofdistribution samples. In International Conference on Learning Representations, 2018.
 Lee et al. (2015) S. Lee, S. Purushwalkam, M. Cogswell, D. Crandall, and D. Batra. Why M heads are better than one: Training a diverse ensemble of deep networks. arXiv:1511.06314, 2015.
 Liu and Yao (1999) Yong Liu and Xin Yao. Ensemble learning via negative correlation. Neural networks, 12(10):1399–1404, 1999.
 Osband et al. (2016) I. Osband, C. Blundell, A. Pritzel, and B Van Roy. Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems, 2016.
 Papadopoulos et al. (2001) G. Papadopoulos, P. J. Edwards, and A. F. Murray. Confidence estimation methods for neural networks: A practical comparison. IEEE Transactions on Neural Networks, 12:1278–1287, 2001.
 Papernot and McDaniel (2018) Nicolas Papernot and Patrick McDaniel. Deep knearest neighbors: Towards confident, interpretable and robust deep learning. arXiv preprint arXiv:1803.04765, 2018.

Riquelme et al. (2018)
Carlos Riquelme, George Tucker, and Jasper Snoek.
Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling.
In International Conference on Learning Representations, 2018.  Rothe et al. (2015) Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deep expectation of apparent age from a single image. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 10–15, 2015.
 Zagoruyko and Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
Appendix
Datasets  DeepEns  DeepEns+AT  NegCorr  MOD  MODR  MODin 

Outofdistribution RMSE  
protein  0.3010.01  0.3010.00  0.2940.00  0.2990.01  0.3000.01  0.3000.01 
naval  0.1110.01  0.1260.02  0.1120.01  0.1090.01  0.1250.01  0.1290.01 
kin8nm  0.0420.00  0.0400.01  0.0370.00  0.0390.00  0.0410.01  0.0390.00 
power  0.0410.00  0.0420.00  0.0440.00  0.0480.01  0.0470.01  0.0410.00 
bostonHousing  0.2210.02  0.2130.02  0.2100.02  0.2090.02  0.2000.01  0.2120.02 
energy  0.0600.01  0.0540.01  0.0530.01  0.0530.01  0.0540.01  0.0540.01 
concrete  0.1050.02  0.1020.01  0.0960.01  0.1050.01  0.1130.03  0.0990.01 
yacht  0.0380.04  0.0390.04  0.0180.01  0.0260.00  0.0260.01  0.0390.04 
wine  0.2390.01  0.2360.01  0.2340.01  0.2390.01  0.2340.01  0.2350.01 
MOD outperformance pvalue  0.028  0.102  0.989  0.090  0.017  
Indistribution RMSE  
protein  0.1640.00  0.1640.00  0.1610.00  0.1630.00  0.1630.00  0.1630.00 
naval  0.0800.00  0.0890.00  0.0790.00  0.0770.00  0.0880.00  0.0910.00 
kin8nm  0.0740.00  0.0720.00  0.0710.00  0.0720.00  0.0730.00  0.0720.00 
powerplant  0.0530.00  0.0530.00  0.0520.00  0.0530.00  0.0530.00  0.0530.00 
bostonHousing  0.0850.01  0.0840.01  0.0830.01  0.0840.01  0.0840.01  0.0840.01 
energy  0.0420.00  0.0390.00  0.0360.00  0.0390.00  0.0390.00  0.0390.00 
concrete  0.0860.00  0.0850.00  0.0840.00  0.0830.00  0.0830.00  0.0840.00 
yacht  0.0170.02  0.0190.02  0.0100.00  0.0120.00  0.0120.00  0.0180.02 
wine  0.1700.00  0.1700.00  0.1690.00  0.1700.00  0.1690.00  0.1690.01 
MOD outperformance pvalue  4.0e06  5.9e07  0.943  8.9e04  0.002 