Model ensembling provides a simple, yet extremely effective technique for improving the predictive performance of arbitrary supervised learners each trained via empirical risk minimization (ERM) (Breiman, 1996; Brown, 2004). Often, ensembles are utilized not only to improve predictions on test examples stemming from the same underlying distribution as the training data, but also to provide estimates of model uncertainty when learners are presented with out-of-distribution (OOD) examples that may look different than the data encountered during training (Lakshminarayanan et al., 2017; Osband et al., 2016)
. The widespread success of ensembles crucially relies on the variance-reduction produced by aggregating predictions that are statistically prone to different types of individual errors(Kuncheva and Whitaker, 2003). Thus, prediction improvements are best realized by using a large ensemble with many base models, and a large ensemble is also typically employed to produce stable distributional estimates of model uncertainty (Breiman, 1996; Papadopoulos et al., 2001).
Despite this, practical applications of massive neural networks (NN) are commonly limited to a small ensemble due to the unwieldy nature of these models (Osband et al., 2016; Balan et al., 2015; Beluch et al., 2018)
. Although supervised learning performance may still be enhanced by an ensemble comprised of only a few ERM-trained models, the resulting ensemble-based uncertainty estimates can exhibit excessive sampling variability in low-density regions of the underlying training distribution. Such unreliable uncertainty estimates are highly undesirable in applications where future data may not always stem from the same distribution (e.g. due to sampling bias, covariate shift, or the adaptive experimentation that occurs in bandits, Bayesian optimization (BO), and reinforcement learning (RL) contexts). Here, we propose a technique - Maximize Overall Diversity (MOD) - to stabilize the OOD model uncertainty estimates produced by an ensemble of arbitrary neural networks. The core idea is to considerall
possible inputs and encourage as much overall diversity in the corresponding model ensemble outputs as can be tolerated without diminishing the ensemble’s predictive performance. MOD utilizes an auxiliary loss function and data-augmentation strategy that is easily integrated into any existing training procedure.
2 Related Work
NN ensembles have been previously demonstrated to produce useful uncertainty estimates, including for sequential experimentation applications in Bayesian optimization and reinforcement learning (Papadopoulos et al., 2001; Lakshminarayanan et al., 2017; Riquelme et al., 2018; Osband et al., 2016; Chen et al., 2017). Proposed methods to improve ensembles of limited size include adversarial training to enforce smoothness (Lakshminarayanan et al., 2017), and maximizing ensemble output diversity over the training data (Brown, 2004). In contrast, our focus is on controlling ensemble behavior over all possible inputs, not merely those presented during training.
Consideration of all possible inputs has previously been advocated by Hooker and Rosset (2012), although not in the context of uncertainty estimation. Like our approach, Hafner et al. (2018) also aim to control NN output-behavior beyond the training distribution, but our methods do not require the Bayesian formulation they impose and can be applied to arbitrary NN ensembles, which are one of the most straightforward methods used for quantifying neural network uncertainty (Papadopoulos et al., 2001; Osband et al., 2016). While we primarily consider regression settings here, our ideas can be easily adapted to classification by replacing variance terms with entropy terms; a similar variant that relies on an auxiliary generator network to produce augmented samples that do not stem from the training distribution has been recently proposed by Lee et al. (2018).
We consider a standard regression setup, assuming continuous target values are generated via: with , such that
may heteroscedastically depend on feature values. Given a limited training dataset , where specifies the underlying data distribution from which the in-distribution examples in the training data are sampled, our goal is to learn an ensemble of neural networks that accurately models both the underlying function as well as the uncertainty in ensemble-estimates of . Of particular concern are scenarios where test examples may stem from a different distribution , which we refer to as out-of-distribution (OOD) examples. As in Lakshminarayanan et al. (2017), each network (with parameters ) in our NN ensemble outputs both an estimated mean to predict and an estimated variance to predict , and the per network loss function , is chosen as the negative log-likelihood (NLL) under the Gaussian assumption . While traditional bagging provides different training data to each ensemble member, we simply train each NN using the entire dataset, since the randomness of separate NN-initializations and SGD-training suffice to produce comparable performance to bagging of NN models (Lakshminarayanan et al., 2017; Lee et al., 2015; Osband et al., 2016).
Following Lakshminarayanan et al. (2017), we estimate
(and NLL with respect to the ensemble) by treating the aggregate ensemble output as a single Gaussian distribution. Here, the ensemble-estimate of (used in RMSE calculations) is given by , and the uncertainty in the target value is given by: based on noise-level estimate: and model uncertainty estimate: . While we focus on Gaussian likelihoods for simplicity, our proposed methodology is applicable to general parametric conditional distributions.
3.1 Maximizing Overall Diversity (MOD)
Assuming have been scaled to bounded regions, MOD encourages higher ensemble diversity by introducing an auxiliary loss that is computed over augmented data sampled from another distribution defined over the input feature space. differs from the underlying training data distribution and instead describes what sorts of OOD examples could possibly be encountered at test-time. The underlying population objective we target is:
with as the original supervised learning loss function (e.g. NLL), and a user-specified penalty . Since NLL entails a proper-scoring rule (Lakshminarayanan et al., 2017), minimizing the above objective with a sufficiently small value of will ensure the ensemble seeks to recover for inputs that lie in the support of the training distribution and otherwise to output large model uncertainty for OOD that lie beyond this support. As it is difficult in most applications to specify how future OOD examples may look, we aim to ensure the ensemble outputs high uncertainty estimates for any possible by taking the entire input space into consideration. This corresponds to where is the bounded region of all possible inputs and denotes the pdf (or pmf) of . In practice, we approximate using the average loss over the training data (as in ERM), and train each with respect to its contribution to this term independently of the others (as in bagging). To approximate ,we similarly utilize an empirical average based on augmented examples sampled uniformly throughout feature space . The formal MOD procedure is detailed in Algorithm 1. We advocate selecting as the largest value for which estimates of (on held-out validation data) do not indicate worse predictive performance. This strategy naturally favors smaller values of as the sample size grows, thus resulting in lower model uncertainty estimates (with as when is supported everywhere and our NN are universal approximators).
We also experiment with an alternative choice of
being the uniform distribution over the finite training data (i.e.and = 0 otherwise). We call this alternative method MOD-in, and note its similarity to the diversity-encouraging penalty proposed by Brown (2004), which is also measured over the training data. Note that MOD in contrast considers to be uniformly distributed over the entire feature space (i.e. all possible test inputs) rather than only the training examples. Maximizing diversity solely over the training data may fail to control ensemble behavior at OOD points that do not lie near any training example.
3.2 Maximizing Reweighted Diversity (MOD-R)
Aiming for high-quality OOD uncertainty estimates, we are most concerned with regularizing the ensemble-variance around points located in low density regions of the training data distribution. To obtain a simple estimate that intuitively reflects (the inverse of) the local density of at a particular set of feature values, one can compute the feature-distance to the nearest training data points (Papernot and McDaniel, 2018)
. Under this perspective, we want to encourage greater model uncertainty for the lowest density points that lie furthest from the training data. Commonly used covariance kernels for Gaussian Process regressors (e.g. radial basis functions) explicitly enforce a high amount of uncertainty on points that lie far from the training data. As calculating the distance of each point to the entire training set may be undesirably inefficient for large datasets, we only compute the distance of our augmented data to a current minibatchduring training. Specifically, we use these distances to compute the following:
where are weights for each of the augmented points , and are members of the minibatch that are the nearest neighbors of . Throughout this paper, we use .
The are thus inversely related to a crude density estimate of the training distribution evaluated at each augmented sample . Rather than optimizing the loss which uniformly weights each augmented sample (as done in Algorithm 1), we can instead form a weighted loss computed over the minibatch of augmented samples as: which should increase the model uncertainty for augmented inputs proportionally to their distance from the training data. We call this variant of our methodology with augmented input reweighting MOD-R.
4.1 Baseline Methods
Here, we evaluate various alternative strategies for improving model ensembles. All strategies are applied to the same base NN ensemble, which is taken to be the Deep Ensembles (DeepEns) model of Lakshminarayanan et al. (2017) previously described in Section 3.1.
Deep Ensembles with Adversarial Training (DeepEns+AT)
Lakshminarayanan et al. (2017) used this strategy to improve their basic DeepEns model. The idea is to adversarially sample inputs that lie close to the training data but on which the NLL loss is high (assuming they share the same label as their neighboring training example). Then, we include these adversarial points as augmented data when training the ensemble, which smooths the function learned by the ensemble. Starting from training example , we sample augmented datapoint with the labels for assumed to be the same as that for the corresponding .
here denotes the NLL loss function, and the values for hyperparameterthat we search over include 0.05, 0.1, 0.2.
Negative Correlation (NegCorr)
This method from Liu and Yao (1999) minimizes the empirical correlation between predictions of different ensemble members over the training data. It adds a penalty to the loss of the form where is the prediction of the th ensemble member and is the mean ensemble prediction. This penalty is weighted by a user-specified penalty , as done in our methodology.
4.2 Experiment Details
All experiments were run on Nvidia TitanX 1080 Ti and Nvidia TitanX 2080 Ti GPUs with PyTorch version 1.0. Unless otherwise indicated, all p-values were computed using a single tailed paired t-test per dataset, and the p-values are combined using Fisher’s method to produce an overall p-value across all datasets in a task. All hyperparameters – including learning rate,-regularization, for MOD/Negative Correlation, and adversarial training – were tuned based on validation set NLL. In every regression task, the search for hyperparameter was over the values 0.01, 0.1, 1, 5, 10, 20, 50.
4.3 Univariate Regression
We first consider a one-dimensional regression toy dataset that is similar to the one used by Blundell et al. (2015). We generated training data from the function:
Here, the training data only contain samples drawn from two limited-size regions. Using the standard NLL loss as well as the auxiliary MOD penalty, we train a deep ensemble with 4 neural networks of identical architecture (1-hidden layer with 50 units, ReLU activation, two sigmoid outputs to estimate the mean and variance of, and l2 regularization). To depict the improvement gained by simply adding ensemble members, we also train an ensemble of 120 networks with same architecture. Figure 1 shows the predictions and confidence interval of the ensembles. MOD is able to produce more reliable uncertainty estimates on the lefthand regions, where regular deep ensembles fail even with many networks. MOD also properly inflated the predictive uncertainty in the center region where no training data is found. Using a smaller in MOD ensures the ensemble predictive performance remains strong for in-distribution inputs that lie near the training data and the ensemble exhibits adequate levels of certainty around these points. While the larger value leads to overly conservative uncertainty estimates that are large everywhere, we note the mean of the ensemble predictions remains highly accurate for in-distribution inputs.
4.4 UCI Regression Datasets
We next experimented with nine real world datasets with continuous inputs in some applicable bounded domain. We follow the experimental setup that Lakshminarayanan et al. (2017) and Hernández-Lobato et al. (2017) used to evaluate deep ensembles and deep Bayesian regressors. We split off all datapoints whose -values fall in the top as an OOD test set (so datapoints with such large -values are never encountered during training). We simulate the situation where training set is limited and thus used of the data for training and for validation. The remaining data is used as an in-distribution test set. The analysis is repeated for 10 random splits of the data to ensure the results are robust. We again use an ensemble of 4 fully-connected neural networks with the same architecture as in the previous experiment and the NLL training loss (searching over hyperparameter values: penalty , learning rate ). We report the negative log-likelihood (NLL) and root mean squared error (RMSE) on both in- and out-of-distribution test sets for ensembles trained via different strategies. We also examine the calibration curve where predicted confidence levels are compared with actual empirical level of correctness observed from the data.
As shown in Table 1 and Appendix Table 1, MOD outperforms DeepEns in 6 out of the 9 datasets on OOD NLL, 7 out of the 9 datasets on OOD RMSE, and all of the datasets on in-distribution NLL/MSE. This shows that the MOD loss lead to higher-quality uncertainties on OOD data while also improving in-distribution performance. MOD also outperforms all baselines with significant overall p-value on NLL/RMSE except on par with NegCorr on RMSE, demonstrating the robustness of the improvement contributed by MOD.
|MOD outperformance p-value||0.002||4.9e-07||0.034||1.6e-04||4.6e-05|
|MOD outperformance p-value||2.4e-08||6.9e-11||0.116||8.9e-06||2.2e-06|
indicates second best. In case of a tie in the means, the method with the lower standard deviation is highlighted.
shows the calibration curve on two of the datasets where the basic deep ensembles exhibit over-confidence on OOD data. Note that retaining accurate calibration on OOD data is extremely difficult for most machine learning methods. MOD and MOD-R improve calibration by a significant margin compared to most of the baselines, validating the effectiveness of our MOD procedure.
4.5 Protein Binding Microarray Data
We next study scientific data with discrete features by predicting Protein-DNA binding. This is a collection of 38 different microarray datasets, each of which contains measurements of the binding affinity of a single transcription factor (TF) protein against all possible 8-base DNA sequences (Barrera et al., 2016). We consider each dataset as a separate task with taken to be the binding affinity (re-scaled to [0,1] interval) and the one-hot embedded DNA sequence (as we ignore reverse-complements, there are possible values of ).
We trained a small ensemble of 4 neural networks with the same architecture as in the previous experiments. We consider 2 different OOD test sets, one comprised of the sequences with top 10% -values and the other comprised of the sequences with more than 80% of the position being G or C (GC-content). For each OOD set, we use the remainder of the sequences as corresponding in-distribution set. We separate them into extremely small training set (300 examples) and validation set (300 examples), and use the rest as in-distribution test set. We compare MOD along with 2 alternative sampling distribution (MOD-R and MOD-in) against the 3 baselines mentioned in section 4.1. We search over 0,0.001,0.01,0.05,0.1 for penalty and 0.01 for learning rate.
|(MOD/-R outperform)||(MOD/-R outperform)|
|Methods||NLL||# of TFs||p-value||RMSE||# of TFs||p-value|
|In-distribution test performance|
|Out-of-distribution test performance (OOD as sequences with top 10% binding affinity)|
|Out-of-distribution test performance (OOD as sequences with >80% GC content)|
Table 2 shows mean OOD and in-distribution performance across 38 TFs (averaged over 10 runs using random data splits and NN initializations). MOD methods have significantly improved performance on all metrics and OOD setups compared to DeepEns/DeepEns+AT, both in terms of # of TF outperforming and overall p-value. The re-weighting scheme (MOD-R) further improved the performance on top 10% -value OOD set up. Figure 2 shows the calibration curve on two of the TFs where the deep ensembles are over-confident on top 10% -value OOD examples. MOD-R and MOD improve the calibration results by significant margin compared to most of the baselines.
Next, we compared how the MOD, MOD-R, and MOD-in ensembles performed against the DeepEns, DeepEns+AT, and NegCorr ensembles in 38 Bayesian optimization tasks using the same protein binding data (see Hashimoto et al. (2018)). For each TF, we performed 30 rounds of DNA-sequence acquisition, acquiring batches of 10 sequences per round in an attempt to maximize binding affinity. We used the upper confidence bound (UCB) as our acquisition function (Chen et al., 2017), ordering the candidate points via (with UCB coefficient ).
|MOD-in||21 (0.111)||21 (0.041)||19 (0.356)||17 (0.791)||16 (0.51)|
|MOD||26 (0.003)||24 (0.004)||20 (0.001)||19 (0.002)||22 (0.173)|
|MOD-R||22 (0.019)||23 (0.007)||22 (0.017)||20 (0.052)||14 (0.674)|
At every acquisition iteration, we randomly held out 10% of the training set as the validation set and chose the
penalty (for MOD, MOD-in, MOD-R, and NegCorr) that produced the best validation NLL (out of choices: 0, 5, 10, 20, 40, 80). The stopping epoch is chosen based on the validation NLL not increasing for 10 epochs with an upper limit of 30 epochs. Optimization was done with a learning rate of 0.01, l2 penalty of 0.01 and used the Adam optimizer. For each of the 38 TFs, we performed 20 Bayesian optimization runs with different seed sequences (same seeds used for all the methods) and using 200 points randomly sampled from the bottom 90% ofvalues as are initial training set.
We evaluated on the metric of simple regret (second term in the subtraction quantifies the best point acquired so far and the first term is the global best). The results are presented in Table 3. MOD outperforms all other methods in both number of TFs with better regret and the combined p-value. MOD-R is also strong outperforming all other methods except MOD with respect to which is about equivalent in terms of statistical significance. Figure 3 shows for the TFs OVOL2 and HESX1, a task in which MOD and MOD-R outperform the other methods.
4.6 Age Prediction from Images
To demonstrate the effectiveness of MOD-in, MOD, and MOD-R on high dimensional data, we consider supervised learning with image data. Here, we use a dataset of human images collected from IMDB and Wikimedia and annotated with age and gender information(Rothe et al., 2015). The IMDB/Wiki parts of the dataset consist of 460K+/62K+ images respectively. 28,601 images in the Wiki dataset are males and the rest are females.
In the context of Wiki images, we tried to predict the ages given the image of a person using 2000 images of males as the training set. We used the female images as the unlabeled OOD test set on which we increased variance via MOD and MOD-R. We used the Wide Residual Network architecture (Zagoruyko and Komodakis, 2016) with a depth of 4 and a width factor of 2. As before, we used an ensemble of size 4. The search for the optimal value was over . The stopping epoch is chosen based on the validation NLL not increasing for 10 epochs with an upper limit of 30 epochs. Optimization was done with a learning rate of 0.001, l2 penalty of 0.001 and used the Adam optimizer. MOD gets NLL of -0.16 on OOD (female) data and -0.331 on in-dist (male) data while MOD-R gets -0.148 on OOD and -0.328 for in-dist. This is in contrast to DeepEns which gets only -0.13 on OOD and -0.32 on in-dist. Thus both MOD and MOD-R show significant improvements on NLL on the OOD (female) data (p-values of 0.014 and 0.102 respectively) and slight improvements on NLL on in-distribution (male) data. In addition, while DeepEns+AT has a better mean for OOD NLL compared to MOD, the result is not significant (p-value 0.388) and thus the two methods can be considered equivalent on this dataset. Notable every MOD variant improves performance across all of the 4 metrics. Thus augmenting the loss function with the MOD penalty should not make your model worse.
|Methods||OOD NLL||In-Dist NLL||OOD RMSE||In-Dist RMSE|
|DeepEns||-0.131 0.050||-0.320 0.024||0.208 0.007||0.180 0.003|
|NegCorr||-0.1530.038||-0.3280.021||0.204 0.006||0.178 0.004|
We have developed a new loss function and data augmentation strategy that helps stabilize distribution uncertainty estimates obtained from model ensembling. Our method increases model uncertainty over the entire input space while simultaneously maintaining predictive performance. We further propose a variant of our method which assesses the distance of an augmented sample from the training distribution and aims to ensure higher model uncertainty in regions with low-density under the training data distribution. When used for training deep NN ensembles, our method produces strong improvements to the in and out of distribution NLL, out of distribution RMSE, and calibration on a variety of datasets drawn from biology, vision, and common datasets from UCI used to benchmark uncertainty quantification methods. The resulting ensemble is shown to be very useful across numerous Bayesian optimization tasks. Future work could develop techniques to generate OOD augmented samples for structured data domains like images/text, as well as applying our ensembles with improved uncertainty-awareness to currently challenging tasks such as exploration in reinforcement learning.
- Balan et al. (2015) A. K. Balan, V. Rathod, K. P. Murphy, and M. Welling. Bayesian dark knowledge. In Advances in Neural Information Processing Systems, 2015.
- Barrera et al. (2016) Luis A Barrera, Anastasia Vedenko, Jesse V Kurland, Julia M Rogers, Stephen S Gisselbrecht, Elizabeth J Rossin, Jaie Woodard, Luca Mariani, Kian Hong Kock, Sachi Inukai, et al. Survey of variation in human transcription factors reveals prevalent DNA binding changes. Science, 351(6280):1450–1454, 2016.
Beluch et al. (2018)
W. H. Beluch, T. Genewein, A. Nürnberger, and J. M. Köhler.
The power of ensembles for active learning in image classification.In
- Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
- Breiman (1996) Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996.
- Brown (2004) Gavin Brown. Diversity in neural network ensembles. PhD thesis, University of Birmingham, 2004.
- Chen et al. (2017) R. Y. Chen, S. Sidor, P. Abbeel, and J. Schulman. UCB exploration via Q-ensembles. arXiv:1706.01502, 2017.
- Hafner et al. (2018) Danijar Hafner, Dustin Tran, Timothy Lillicrap, Alex Irpan, and James Davidson. Reliable uncertainty estimates in deep neural networks using noise contrastive priors. arXiv:1807.09289, 2018.
Hashimoto et al. (2018)
Tatsunori B Hashimoto, Steve Yadlowsky, and John C Duchi.
Derivative free optimization via repeated classification.
International Conference on Artificial Intelligence and Statistics, 2018.
- Hernández-Lobato et al. (2017) José Miguel Hernández-Lobato, James Requeima, Edward O Pyzer-Knapp, and Alán Aspuru-Guzik. Parallel and distributed thompson sampling for large-scale accelerated exploration of chemical space. arXiv:1706.01825, 2017.
- Hooker and Rosset (2012) Giles Hooker and Saharon Rosset. Prediction-focused regularization using data-augmented regression. Statistics and Computing, 1:237–349, 2012.
Kuncheva and Whitaker (2003)
Ludmila I. Kuncheva and Christopher J. Whitaker.
Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy.Machine Learning, 51:181–207, 2003.
- Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017.
- Lee et al. (2018) Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training confidence-calibrated classifiers for detecting out-of-distribution samples. In International Conference on Learning Representations, 2018.
- Lee et al. (2015) S. Lee, S. Purushwalkam, M. Cogswell, D. Crandall, and D. Batra. Why M heads are better than one: Training a diverse ensemble of deep networks. arXiv:1511.06314, 2015.
- Liu and Yao (1999) Yong Liu and Xin Yao. Ensemble learning via negative correlation. Neural networks, 12(10):1399–1404, 1999.
- Osband et al. (2016) I. Osband, C. Blundell, A. Pritzel, and B Van Roy. Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems, 2016.
- Papadopoulos et al. (2001) G. Papadopoulos, P. J. Edwards, and A. F. Murray. Confidence estimation methods for neural networks: A practical comparison. IEEE Transactions on Neural Networks, 12:1278–1287, 2001.
- Papernot and McDaniel (2018) Nicolas Papernot and Patrick McDaniel. Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning. arXiv preprint arXiv:1803.04765, 2018.
Riquelme et al. (2018)
Carlos Riquelme, George Tucker, and Jasper Snoek.
Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling.In International Conference on Learning Representations, 2018.
- Rothe et al. (2015) Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deep expectation of apparent age from a single image. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 10–15, 2015.
- Zagoruyko and Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
|MOD outperformance p-value||0.028||0.102||0.989||0.090||0.017|
|MOD outperformance p-value||4.0e-06||5.9e-07||0.943||8.9e-04||0.002|