Ensembles of Locally Independent Prediction Models

by   Andrew Slavin Ross, et al.

Many ensemble methods encourage their constituent models to be diverse, because ensembling provides no benefits when models are identical. Most methods define diversity in terms of differences in training set predictions. In this paper, however, we demonstrate that diversity in training set predictions does not necessarily imply diversity when extrapolating even slightly outside it (which can affect generalization). To address this issue, we introduce a new diversity metric and associated method of training ensembles of models that extrapolate differently on local patches of the data manifold. Across a variety of synthetic and real-world tasks, we find that our method improves generalization and diversity in qualitatively novel ways, especially under data limits and covariate shift.



page 3

page 6

page 11


Neural Ensemble Search for Performant and Calibrated Predictions

Ensembles of neural networks achieve superior performance compared to st...

When does Diversity Help Generalization in Classification Ensembles?

Ensembles, as a widely used and effective technique in the machine learn...

Repulsive Deep Ensembles are Bayesian

Deep ensembles have recently gained popularity in the deep learning comm...

Maximizing Overall Diversity for Improved Uncertainty Estimates in Deep Ensembles

The inaccuracy of neural network models on inputs that do not stem from ...

Twin Neural Network Regression

We introduce twin neural network (TNN) regression. This method predicts ...

On Stein Variational Neural Network Ensembles

Ensembles of deep neural networks have achieved great success recently, ...

Anti-Distillation: Improving reproducibility of deep networks

Deep networks have been revolutionary in improving performance of machin...

Code Repositories


Code for AAAI 2020 paper "Ensembles of Locally Independent Prediction Models"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An ensemble is generally more accurate than its constituent models. However, for this to hold true, those models must make different errors on unseen data [9, 7]. This is often described as the ensemble’s “diversity.”

Despite diversity’s well-recognized importance, there is no firm consensus on how best to foster it. Some procedures encourage it implicitly, e.g. by training models with different inputs [3], while others explicitly optimize for proxies [23] that tend to be functions of differences in training set predictions [19].

However, there has been increasing criticism of supervised machine learning for focusing too exclusively on cases where training and testing data are drawn from the same distribution

[20]. In many real-world settings, this assumption does not hold, e.g. due to natural covariate shift over time [32] or selection bias in data collection [36]. Intuitively, we might hope that a “diverse” ensemble would more easily adapt to such problems, since ideally different members would be robust to different shifts. In this paper, however, we show that incentivizing members of ensembles to make different predictions on the training data often still results in models that make identical errors when extrapolating. This invariably impacts generalization in practical settings.

In this paper, we address the problem of extrapolation diversity. Specifically, our contributions are (1) a novel and differentiable diversity measure, defined as a formal proxy for the ability of classifiers to extrapolate differently away from data, and (2) a method for training an ensemble of classifiers to be diverse by this measure. We apply our method to a range of synthetic and real-world datasets under both normal conditions and covariate shift, and illustrate improvements and differences between our method and baselines.

2 Related Work

Ensembling is a well-established subfield of supervised learning

[3, 4, 13, 30], and one of its important lessons is that model diversity is a necessary condition for creating predictive and robust ensembles [17]. There are a number of methods for fostering diversity, which can be roughly divided into two categories: those that implicitly promote diversity by random modifications to training conditions, and those that explicitly promote it by deliberate modifications to the objective function.

Implicit diversity methods sometimes operate by introducing stochasticity into which models see which parts of the data, e.g. by randomly resampling training examples [3] or subsets of input features [4]. Other implicit methods exploit model parameter stochasticity, e.g. by retraining from different initializations [16, 6, 33] or sampling from parameter snapshots saved during individual training cycles [14].

Methods that explicitly encourage diversity include boosting [30, 8], which sequentially modifies the objective function of each model to specialize on previous models’ mistakes, or methods like negative correlation learning [22] amended cross-entropy [31], and DPPs over non-maximal predictions [26], which simulateously train models with penalities on both individual errors and pairwise similarities. Finally, methods such as Diverse Ensemble Evolution [37] and Competition of Experts [27] use explicit techniques to encourage models to specialize in different regions of input space.

Although at first glance these diverse training techniques seem quite diverse themselves, they are all similar in a crucial respect: they encourage diversity in terms of training set predictions. In the machine learning fairness, adversarial robustness, and explainability communities, however, there has been increasing movement away from the assumption that train is similar to test. For example, many methods for locally explaining ML predictions literally present simplified approximations of how models extrapolate away from given points [1, 29, 28], while adversarial attacks (and defenses) exploit (and mitigate) pathological extrapolation behavior [34, 25], sometimes in an ensemble setting [35]. Although our focus is not explicitly on explanability or adversarial robustness, our method can be seen as a reapplication of techniques in those subfields to the problem of ensemble diversity.

3 Method

In this section, building on ross2018learning ross2018learning, we formally define our diversity measure and training procedure, beginning with notation. We use to denote -dimensional inputs, which are supported over an input space . We use to denote prediction targets in an output space . In this paper, will be

, and we focus on the case where it represents a log-odds used for binary classification, but our method can be generalized to classification or regression in

given any notion of distance between outputs. We seek to learn prediction models (parameterized by

) that estimate

from . We assume these models are differentiable with respect to and

(which is true for linear models and neural networks).

In addition, we suppose a joint distribution over inputs and targets

and a distribution quantifying the likelihood of the observed target given the model prediction. Typically, during training, we seek model parameters that maximize the likelihood of the observed data, .

3.1 Diversity Measure: Local Independence

We now introduce a model diversity measure that quantifies how differently two models generalize over small patches of the data manifold . Formally, we define an -neighborhood of , denoted , on the data manifold to be the intersection of an -ball centered at in the input space, , and the data manifold: . We capture the notion of generalization difference on a small neighborhood of through an intuitive geometric condition: we say that two functions and generalize maximally differently at if is invariant in the direction of of the greatest change in (or vice versa) within an -neighborhood around . That is:


where we define . In other words, perturbing by small amounts to increase inside does not change the value of . In the case that a choice of exists to satisfy Equation 1, we say that is locally independent at . We call and locally independent without qualification if for every the functions and are locally independent at for some choice of . We note that in order for the right-hand side expression of 1 to be well-defined, we assume that the gradient of is not zero at and that is chosen to be small enough that is convex or concave over .

In the case that and are classifiers, local independence intuitively implies a kind of dissimilarity between their decision boundaries. For example, if and are linear and the data manifold is Euclidean, then and are locally independent if and only if their decision boundaries are orthogonal.

This definition motivates the formulation of a diversity measure, , quantifying how far and are from being locally independent:


We can use this penalty within an ensemble-wide loss function

for a set of models as follows:


The first term encourages each model to be predictive and the second encourages diversity in terms of the diversity measure

(with a strength hyperparameter

). Computing exactly, however, is challenging, because it requires an inner optimization of . Although it can be closely approximated for fixed small with projected gradient descent as in adversarial training [25], that procedure is computationally intensive. If we let , however, we can approximate by a fairly simple equation that only needs to compute once per . In particular, we observe that under certain smoothness assumptions on , with unconstrained ,111The simplifying assumption that in a local neighborhood around is significant, though not always inappropriate. We discuss both limitations and generalizations in Section A.1. and as , we can make the approximation


Assuming similar smoothness assumptions on (so we can replace it by its first-order Taylor expansion), we see that


In other words, the independence error between and is approximately equal to the dot product of their gradients

. Empirically, we find it helpful to normalize the dot product and work in terms of cosine similarity

. We also add a small constant value to the denominator to prevent numerical underflow.

Alternate statistical formulation: As another way of obtaining this cosine similarity approximation, suppose we sample small perturbations and evaluate and . As , these differences approach and

, which are 1D Gaussian random variables whose correlation is given by

and whose mutual information is per gretton2003kernel gretton2003kernel. Therefore, making the input gradients of and orthogonal is equivalent to enforcing statistical independence between their outputs when we perturb with samples from as . This could be used as an alternate definition of “local independence.”

Motivated by this approximation, we substitute


into our ensemble loss from Equation (3), which gives us a final “local independence training” objective of


Note that we will sometimes abbreviate as . In Section as well as Figure 10, we show that is meaningfully correlated with other diversity measures and therefore may be useful in its own right, independently of its use within a loss function.

4 Experiments

Training details:

For the experiments that follow, we use 256 or 256x256 unit fully connected neural networks with rectifier (ReLU or softplus) activations, trained in Tensorflow with Adam.


We test local independence training (“LIT”) against random restarts (“RRs”), bagging [3] (“Bag”), AdaBoost [11] (“Ada”), 0-1 squared loss negative correlation learning liu1999simultaneous liu1999simultaneous (“NCL”), and amended cross-entropy [31] (“ACE”). We omit NIPS2018_7831 NIPS2018_7831 and parascandolo2017learning parascandolo2017learning which require more complex inner submodular or adversarial optimization steps, but note that because they also operationalize diversity as making different errors on training points, we expect the results to be qualitatively similar to ACE and NCL.


For our non-toy results, we test all methods with ensemble sizes in , and methods with regularization parameters (LIT, ACE, and NCL) with 16 logarithmically spaced values between and , using validation AUC to select the best performing model (except when examining how results vary with

or size). For each hyperparameter setting and method, we run 10 full random restarts (though within each restart, different methods are tested against the same split), and present mean results with standard deviation errorbars.

4.1 Synthetic Examples

To provide intuition and an initial demonstration of our method, we first present several sets of 2D toy examples in Figure 1. These 2D examples are constructed to have gaps in the data distribution (i.e. ), but also to have the property that locally (except at the boundary), still behaves like .222That is, as for all . More informally, we construct these examples to have ambiguity in how a single accurate classifier should behave, but a single intuitive “right answer” for how two accurate but maximally different classifiers should behave.

Figure 1: 2D toy datasets with gaps. We argue that “diverse” ensemble methods applied to these datasets should produce accurate models with different decision boundaries.
Figure 2: Comparison of local independence training, random restarts and NCL on toy 2D datasets. For each ensemble, the first model’s decision boundary is plotted in orange and the other in dashed blue. Both NCL and LIT are capable of producing variation, but in qualitatively different ways.
Random Split
Method Mushroom Ionosphere Sonar SPECTF
RRs 1.0 1.0
Bag 1.0 1.0
Ada 1.0
ACE 1.0 1.0
NCL 1.0 1.0
LIT 1.0 1.0
Method Mushroom Ionosphere Sonar SPECTF
Table 1:

UCI dataset results in both the normal prediction task (top) and the extrapolation task (bottom) over 10 reruns, with errorbars based on standard deviations and bolding based on standard errors. On random splits, LIT offers modest AUC improvements over random restarts, on par with other ensemble methods. On extrapolation splits, however, LIT tends to result in more significant improvements to AUC. In both cases, LIT almost always exhibits the lowest pairwise Pearson correlation between heldout model errors (

), and for other methods, roughly matches pairwise gradient cosine similarity ().

In Figure 2, we present neural network decision boundaries learned by random restarts, local independence training, and negative correlation learning (NCL) on these examples. Random restarts and NCL at low learn identical boundaries. As we increase for NCL, its boundaries stay essentially identical until a very narrow transition region (leading up to a critical value of at which it becomes favorable for one model to always predict 0 and the other to always predict 1). In this transition region, we can obtain models with differences between decision boundaries, but not between their angles—and at the cost of a symmetrical reduction in accuracy. The fact and symmetry of the accuracy reduction makes sense given that NCL can only increase its “diversity” by making different (and therefore not 100% accurate) training predictions.

LIT, on the other hand, outputs meaningfully different boundaries even at values of that are very low compared to its prediction loss term. This is in large part because on most of these tasks (except Dataset 3), there is very little tradeoff to learning a near-orthogonal boundary. At larger , LIT outputs decision boundaries that are completely orthogonal (at the cost of a slight accuracy reduction on Dataset 3). The main takeaway from this set of toy examples is that optimizing for diverse training predictions—even across a very wide sweep of penalty values—may never succeed in encouraging diverse extrapolation.

Relationship between LIT and feature selection: Note that a trivial way of minimizing CosIndepErr is to train an ensemble of models sensitive to disjoint sets of features. This roughly occurs in the ensemble on Dataset 1. We ran additional synthetic experiments (Figures 12 and 13

) to verify that LIT can be used to perform group feature selection in higher dimensions, which is usually approached with search-based enumeration methods

[10] rather than a single pass of gradient descent. However, LIT also produces more general kinds of diversity, learning diverse linear boundaries on Datasets 2 and 3 as well as diverse nonlinear boundaries on Dataset 4.

4.2 UCI Datasets

Next, we test our method on several standard binary classification datasets from the UCI repository [21]. These are ionosphere, sonar, spectf, and mushroom

(with categorical features one-hot encoded, and all features z-scored). For all datasets without canonical splits, we first randomly select 80% of the dataset for training and 20% for test, then take an additional 20% split of the training set to use for validation. In addition to these random splits, we also evaluate models on an

extrapolation task, where instead of splitting datasets randomly, we train on the 50% of points closest to the origin (i.e. where is less than its median value over the dataset) and validate/test on the remaining points (which are furthest from the origin). This test is meant to evaluate robustness to covariate shift. Quantitative performance and diversity results are presented in Table 1, and additional metrics are shown in Figures 9, 10, and 11.

4.3 ICU Mortality Case Study

As a final set of experiments, we run a more in-depth case study on a real world clinical application. In particular, we predict in-hospital mortality for a cohort of patient visits extracted from the MIMIC-III database [15] based on on labs, vital signs, and basic demographics. We follow the same cohort selection and feature selection process as ghassemi2017predicting ghassemi2017predicting. In addition to this full cohort, we also test on a limited data task where we restrict the size of the training set to to measure robustness.

We visualize the results of these experiments in many different ways to help tease out the effects of , ensemble size, and dataset size on individual and ensemble predictive performance, diversity, and model explanations. Table 2 shows overall performance and diversity metrics for these two tasks after cross-validation, along with the most common values of and ensemble size selected for each method. Drilling into the results, Figure 3 visualizes how multiple metrics for performance (AUC and accuracy) and diversity ( and ) change with , while Figure 4 visualizes the relationship between optimal and ensemble size.

Figure 5 (and Figures 7 and 8) visualize changes in the marginal distributions of input gradients for each model in their explanatory sense [1]. As a qualitative evaluation, we discussed these explanation differences with two intensive care unit clinicians and found that LIT revealed meaningful redundancies in which combinations of features encoded different underlying conditions.

ICU Mortality Task, Full Dataset ()
Method AUC #
RRs 13
Bag 8
Ada 8
ACE 13
NCL 13
ICU Mortality Task, Limited Slice ()
Method AUC #
RRs 8
Bag 8
Ada 2
NCL 13
LIT 13
Table 2: Quantitative results on the ICU mortality prediction task, where and signify the most commonly selected values of ensemble size and regularization parameter chosen for each method. On the full data task, although all methods perform similarly, NCL and AdaBoost edge out slightly, and LIT consistently selects its weakest regularization parameter. On the limited data task, LIT significantly outperforms baselines, with NCL and Bagging in second, ACE indistinguishable from restarts, and significantly worse performance for AdaBoost (which overfits).
Figure 3: Changes in individual AUC/accuracy and ensemble diversity with for two-model ensembles on the ICU mortality dataset (averaged across 10 reruns, error-bars omitted for clarity). For NCL and ACE, there is a wide low- regime where they are indistinguishable from random restarts. This is followed by a very brief window of meaningful diversity (around for NCL, slightly lower for ACE), after which both methods output pairs of models which always predict 0 and 1 (respectively), as shown by the error correlation dropping to -1. LIT, on the other hand, exhibits smooth drops in individual model predictive performance, with error correlation falling towards 0. Results for other ensemble sizes were qualitatively similar.
Figure 4: Another exploration of the effect of ensemble size and on ICU mortality predictions. In particular, we find that for LIT on this dataset, the optimal value of depends on the ensemble size in a roughly log-linear relationship. Because -dimensional datasets can support a maximum of locally independent models (and only one model if the data completely determines the decision boundary), it is intuitive that there should be an optimal value. For NCL, we also observe an optimal value near , but with a less clear relationship to ensemble size and very steep dropoff to random guessing at slightly higher .
Figure 5: Differences in cross-patient gradient distributions of ICU mortality prediction models for random restart and locally independent ensembles. Features with mass consistently above the x-axis have positive associations with predicted mortality (increasing them increases predicted mortality) while those with mass consistently below the x-axis have negative associations (decreasing them increases predicted mortality). Distance from the x-axis corresponds to the association strength. Models trained normally (top) consistently learn positive associations with age and bun (blood urea nitrogen; larger values indicate kidney failure) and negative associations with weight and urine (low weight is correlated with mortality; low urine output also indicates kidney failure or internal bleeding). However, they also learn somewhat negative associations with creatinine, which confused clinicians because high values are another indicator of kidney failure. When we trained LIT models, however, we found that creatinine regained its positive association with mortality (in model 2), while the other main features were more or less divided up. This collinearity between creatinine and bun/urine in indicating organ problems (and revealed by LIT) was one of the main insights derived in our qualitative evaluation with ICU clinicians.

5 Discussion

LIT matches or outperforms other methods, especially under data limits or covariate shift.

On the UCI datasets under conditions (random splits), LIT always offers at least modest improvements over random restarts, and often outperforms other baselines. Under extrapolation splits, LIT tends to do significantly better. This pattern repeats itself on the normal vs. data-limited versions of ICU mortality prediction task. We hypothesize that on small or selectively restricted datasets, there is typically more predictive ambiguity, which hurts the generalization of normally trained ensembles (who consistently make similar guesses on unseen data). LIT is more robust to these issues.

Gradient cosine similarity can be a meaningful diversity metric.

In our UCI results in Table 1, we saw that for non-LIT methods, error correlation and gradient similarity (which do not require labels to compute) were often close in value. We plot this relationship over all datasets, methods, restarts, and hyperparameters in Figure 10 and find that the correspondence continues to hold. One potential explanation for this correspondence is that, by our analysis at the end of Section 3.1, can literally be interpreted as an average squared correlation (between changes in model predictions over infinitesimal Gaussian perturbations away from each input). We hypothesize that may be a useful quantity independently of LIT.

LIT is less sensitive to hyperparameters than baselines, but ensemble size matters more.

In both our toy examples (Figure 2) and our ICU mortality results (Figures 3 and 4), we found that LIT produced qualitatively similar (diverse) results over several orders of magnitude of . NCL, on the other hand, required extremely careful tuning of to achieve meaningful diversity (before its performance plummeted). We hypothesize that this difference results from the fact that NCL’s diversity term is formulated as a direct tradeoff with individual model accuracy, so the balance must be precise, whereas LIT’s diversity term is often completely independent of individual model accuracy (which is true by construction in the toy examples). However, datasets only have the capacity to support a limited number of (mostly or completely) locally independent models. On the toy datasets, this capacity was exactly 2, but on real data, it is generally unknown, and it may be possible to achieve similar results either with a small fully independent ensemble or a large partially independent ensemble. For example, in Figure 4, we show that we can achieve similar improvements to ICU mortality prediction with 2 highly independent () models or 13 more weakly independent () models. We hypothesize that the trend-line of optimal LIT ensemble size and may be a useful tool for characterizing the amount of ambiguity present in a dataset.

Interpretation of individual LIT models can yield useful dataset insights.

In Figure 5, we found that in discussions with ICU clinicians, mortality feature assocations for normally trained neural networks were somewhat confusing due to hidden collinearities. LIT models made more clinical sense individually, and the differences between them helped reveal those collinearities (in particular between elevated levels of blood urea nitrogen and creatinine). Because LIT ensembles are often optimal when small, and because individual LIT models tend to be more accurate than the individual models of other methods such as AdaBoost, they also enable significantly more data interpretation than other ensemble methods.


LIT does come with restrictions and limitations. In particular, we found that it works well for rectifier activations (e.g. ReLU and softplus) and but leads to inconsistent behavior with others (e.g. sigmoid and tanh). This may be related to the linear rather than saturating extrapolation behavior of rectifiers. Because it relies on cosine similarity, LIT is also sensitive to relative changes in feature scaling; however, in practice this issue can be resolved by standardizing variables first.

Additionally, our cosine similarity approximation in LIT makes the assumption that the data manifold is locally similar to near most inputs. However, we introduce generalizations in Section A.1 to handle situations where this is not approximately true (such as with image data).

Finally, LIT requires computing a second derivative (the derivative of the penalty) during the optimization process, which increases memory usage and training time; in practice, LIT took approximately 1.5x as long as random restarts, while NCL took approximately half the time. However, significant progress is being made on making higher-order autodifferentiation more efficient [2], so we can expect improvements. Also, in cases where LIT achieves high accuracy with a comparatively small ensemble size (e.g. ICU mortality prediction), overall training time can remain short if cross-validation is terminated early.

6 Conclusion and Future Work

In this paper, we presented a novel diversity metric that formalizes the notion of difference in local extrapolations. Based on this metric we defined an ensemble method, local independence training, for building ensembles of highly predictive base models that generalize differently outside the training set. On datasets we knew supported multiple diverse decision boundaries, we demonstrated our method’s superior ability to recover them compared to baselines. On real-world datasets with unknown levels of redundancy, we showed that LIT ensembles perform competitively on traditional prediction tasks and were more robust to data scarcity and covariate shift (as measured by training on inliers and testing on outliers). Finally, through applying LIT to a clinical prediction task in the intensive care unit, we provided evidence that the extrapolation diversity exhibited by LIT ensembles improved data robustness and helped us reach meaningful clinical insights in conversations with clinicians.

There are ample directions for future improvements to the method. For example, it would be useful to consider methods for aggregating predictions of LIT ensembles using a more complex mechanism, such as a mixture-of-experts model. Along similar lines, combining pairwise s in more informed way, such as a determinantal point process penalty [18] over the matrix of model similarities, may help us better quantify the diversity of the ensemble. Another interesting extension of our work would be to prediction tasks in semi-supervised settings, since labels are generally not required for computing local independence error. Finally, as we observe in the Section 5, some datasets seem to support a particular number of locally independent models. It is worth exploring how best to formally quantify and characterize the inherent ambiguity present in a prediction task—a related but distinct problem from that of characterizing data complexity [12, 5, 24].


WP acknowledges the Harvard Institute for Applied Computational Science for its support. ASR is supported by NIH 1R56MH115187.


  • [1] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K. MÞller (2010) How to explain individual classification decisions. Journal of Machine Learning Research 11 (Jun), pp. 1803–1831. Cited by: §2, §4.3.
  • [2] M. Betancourt (2018) A geometric theory of higher-order automatic differentiation. arXiv preprint arXiv:1812.11592. Cited by: §5.
  • [3] L. Breiman (1996) Bagging predictors. Machine learning 24 (2), pp. 123–140. Cited by: §1, §2, §2, §4.
  • [4] L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §2, §2.
  • [5] J. Cano (2013) Analysis of data complexity measures for classification. Expert Systems with Applications 40 (12), pp. 4820–4831. Cited by: §6.
  • [6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Cited by: §2.
  • [7] T. G. Dietterich (2000) Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Cited by: §1.
  • [8] Y. Freund and R. E. Schapire (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55 (1), pp. 119–139. Cited by: §2.
  • [9] L. K. Hansen and P. Salamon (1990) Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence 12 (10), pp. 993–1001. Cited by: §1.
  • [10] S. Hara and T. Maehara (2017) Enumerate lasso solutions for feature selection.. In

    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

    pp. 1985–1991. Cited by: §4.1.
  • [11] T. Hastie, S. Rosset, J. Zhu, and H. Zou (2009) Multi-class adaboost. Statistics and its Interface 2 (3), pp. 349–360. Cited by: §4.
  • [12] T. K. Ho and M. Basu (2002) Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis & Machine Intelligence (3), pp. 289–300. Cited by: §6.
  • [13] T. K. Ho (1995) Random decision forests. In Document analysis and recognition, 1995., proceedings of the third international conference on, Vol. 1, pp. 278–282. Cited by: §2.
  • [14] G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger (2017) Snapshot ensembles: train 1, get m for free. arXiv preprint arXiv:1704.00109. Cited by: §2.
  • [15] A. E. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-iii, a freely accessible critical care database. Scientific data 3. Cited by: §4.3.
  • [16] J. F. Kolen and J. B. Pollack (1991) Back propagation is sensitive to initial conditions. In Advances in neural information processing systems, pp. 860–867. Cited by: §2.
  • [17] A. Krogh and J. Vedelsby (1995)

    Neural network ensembles, cross validation, and active learning

    In Advances in neural information processing systems, pp. 231–238. Cited by: §2.
  • [18] A. Kulesza, B. Taskar, et al. (2012) Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning 5 (2–3), pp. 123–286. Cited by: §6.
  • [19] L. I. Kuncheva and C. J. Whitaker (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning 51 (2), pp. 181–207. Cited by: §1.
  • [20] P. Liang (2018) How should we evaluate machine learning for ai?. Note: Thirty-Second AAAI Conference on Artificial Intelligencehttps://www.youtube.com/watch?v=7CcSm0PAr-Y Cited by: §1.
  • [21] M. Lichman (2013) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.2.
  • [22] Y. Liu and X. Yao (1999) Ensemble learning via negative correlation. Neural networks 12 (10), pp. 1399–1404. Cited by: §2.
  • [23] Y. Liu and X. Yao (1999) Simultaneous training of negatively correlated neural networks in an ensemble. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 29 (6), pp. 716–725. Cited by: §1.
  • [24] A. C. Lorena, L. P. Garcia, J. Lehmann, M. C. Souto, and T. K. Ho (2018) How complex is your classification problem? a survey on measuring classification complexity. arXiv preprint arXiv:1808.03591. Cited by: §6.
  • [25] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017)

    Towards deep learning models resistant to adversarial attacks

    arXiv preprint arXiv:1706.06083. Cited by: §2, §3.1.
  • [26] T. Pang, K. Xu, C. Du, N. Chen, and J. Zhu (2019) Improving adversarial robustness via promoting ensemble diversity. arXiv preprint arXiv:1901.08846. Cited by: §2.
  • [27] G. Parascandolo, N. Kilbertus, M. Rojas-Carulla, and B. Schölkopf (2017) Learning independent causal mechanisms. arXiv preprint arXiv:1712.00961. Cited by: §2.
  • [28] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) Why should i trust you?: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §2.
  • [29] A. S. Ross, M. C. Hughes, and F. Doshi-Velez (2017) Right for the right reasons: training differentiable models by constraining their explanations. arXiv preprint arXiv:1703.03717. Cited by: §2.
  • [30] R. E. Schapire (1990) The strength of weak learnability. Machine learning 5 (2), pp. 197–227. Cited by: §2, §2.
  • [31] R. Shoham and H. Permuter (2019) Amended cross-entropy cost: an approach for encouraging diversity in classification ensemble (brief announcement). In International Symposium on Cyber Security Cryptography and Machine Learning, pp. 202–207. Cited by: §2, §4.
  • [32] M. Sugiyama, N. D. Lawrence, A. Schwaighofer, et al. (2017) Dataset shift in machine learning. Cited by: §1.
  • [33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.
  • [34] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §2.
  • [35] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel (2017) Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: §2.
  • [36] B. Zadrozny (2004) Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning, pp. 114. Cited by: §1.
  • [37] T. Zhou, S. Wang, and J. A. Bilmes (2018) Diverse ensemble evolution: curriculum data-model marriage. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 5909–5920. Cited by: §2.

Appendix A Appendix

a.1 Imposing Penalties over Manifolds

In the beginning of our derivation of (Equation 4), we assumed that locally, . However, in many cases, our data manifold is much lower dimensional than

. In these cases, we have additional degrees of freedom to learn decision boundaries that, while locally orthogonal, are functionally equivalent over the dimensions which matter. To restrict spurious similarity, we can project our gradients down to the data manifold. Given a local basis for its tangent space, we can accomplish this by taking dot products between


and each tangent vector, and then use these two vectors of dot products to compute the cosine similarity in Equation 

6. More formally, if is the Jacobian matrix of manifold tangents at , we can replace our regular cosine penalty with


An example of this method applied to a toy example is given in Figure 6. Alternatively, if we are using projected gradient descent adversarial training to minimize the original formulation in Equation 2, we can modify its inner optimization procedure to also project updates back to the manifold.

Figure 6: Toy 2D manifold dataset (randomly sampled from a neural network) embedded in , with decision boundaries shown in 2D chart space (top) and the 3D embedded manifold space (bottom). Naively imposing LIT penalties in (middle) leads to only slight differences in the chart space decision boundary, but given knowledge of the manifold’s tangent vectors (right), we can recover maximally different chart space boundaries.

For many problems of interest, we do not have a closed form expression for the data manifold or its tangent vectors. In this case, however, we can approximate one, e.g. by performing PCA or training an autoencoder.

a.2 Deeper Dive into ICU Mortality Gradients

To better understand qualitative differences between ICU mortality prediction models, we examine differences between each model’s marginal distributions of input gradients over the dataset in Figures 7 and 8.

Figure 7: Violin plots showing marginal distributions of ICU mortality input gradients across heldout data for 5-model ensembles trained on the slice (top 5 plots) and restarts on the full dataset (bottom). Distributions for each model in each ensemble are overlaid with transparency in the top 4 plots. From the top, we see that restarts and NCL learn models with similar gradient distributions. Bagging is slightly more varied, but only LIT (which performs significantly better on the prediction task) exhibits significant differences between models. When LIT gradients on this limited data task are averaged (second from bottom), their distribution comes to resemble (in both shape and scale) that of a model trained on the full dataset (bottom), which may explain LIT’s stronger performance.
Figure 8: Additional 2-model ensemble gradient comparisons (showing a sudden transition for NCL).
Figure 9: Full ensemble AUC results by method and ensemble size. LIT usually beats baselines when train test.
Figure 10: Empirical relationship between and across all experiments. The two metrics are meaningfully correlated.
Figure 11: Differences between predicted and true accuracy (averaged across models) using the 3-model method of estimating accuracy from unlabeled data from Equation 7 of Platanios, Blum, and Mitchell (2014), as a function of . For most methods, this method overestimates ensemble accuracy, but much less severely for LIT.
Figure 12: 8D synthetic group feature selection dataset where is restricted so that class labels can be redundantly (but exclusively) determined by four disjoint pairs of dimensions (highlighted—each 2D plot shows a projection of class labels from to a particular plane). We argue that “diverse” ensemble methods applied to this dataset should produce models that utilize different pairs of features.
Figure 13: Results for LIT vs. other methods on the 8D synthetic dataset. Restarts and NCL consistently use dense combinations of all eight features, while networks obtained by LIT monopolize different pairs.