lit
Code for AAAI 2020 paper "Ensembles of Locally Independent Prediction Models"
view repo
Many ensemble methods encourage their constituent models to be diverse, because ensembling provides no benefits when models are identical. Most methods define diversity in terms of differences in training set predictions. In this paper, however, we demonstrate that diversity in training set predictions does not necessarily imply diversity when extrapolating even slightly outside it (which can affect generalization). To address this issue, we introduce a new diversity metric and associated method of training ensembles of models that extrapolate differently on local patches of the data manifold. Across a variety of synthetic and real-world tasks, we find that our method improves generalization and diversity in qualitatively novel ways, especially under data limits and covariate shift.
READ FULL TEXT VIEW PDFCode for AAAI 2020 paper "Ensembles of Locally Independent Prediction Models"
An ensemble is generally more accurate than its constituent models. However, for this to hold true, those models must make different errors on unseen data [9, 7]. This is often described as the ensemble’s “diversity.”
Despite diversity’s well-recognized importance, there is no firm consensus on how best to foster it. Some procedures encourage it implicitly, e.g. by training models with different inputs [3], while others explicitly optimize for proxies [23] that tend to be functions of differences in training set predictions [19].
However, there has been increasing criticism of supervised machine learning for focusing too exclusively on cases where training and testing data are drawn from the same distribution
[20]. In many real-world settings, this assumption does not hold, e.g. due to natural covariate shift over time [32] or selection bias in data collection [36]. Intuitively, we might hope that a “diverse” ensemble would more easily adapt to such problems, since ideally different members would be robust to different shifts. In this paper, however, we show that incentivizing members of ensembles to make different predictions on the training data often still results in models that make identical errors when extrapolating. This invariably impacts generalization in practical settings.In this paper, we address the problem of extrapolation diversity. Specifically, our contributions are (1) a novel and differentiable diversity measure, defined as a formal proxy for the ability of classifiers to extrapolate differently away from data, and (2) a method for training an ensemble of classifiers to be diverse by this measure. We apply our method to a range of synthetic and real-world datasets under both normal conditions and covariate shift, and illustrate improvements and differences between our method and baselines.
Ensembling is a well-established subfield of supervised learning
[3, 4, 13, 30], and one of its important lessons is that model diversity is a necessary condition for creating predictive and robust ensembles [17]. There are a number of methods for fostering diversity, which can be roughly divided into two categories: those that implicitly promote diversity by random modifications to training conditions, and those that explicitly promote it by deliberate modifications to the objective function.Implicit diversity methods sometimes operate by introducing stochasticity into which models see which parts of the data, e.g. by randomly resampling training examples [3] or subsets of input features [4]. Other implicit methods exploit model parameter stochasticity, e.g. by retraining from different initializations [16, 6, 33] or sampling from parameter snapshots saved during individual training cycles [14].
Methods that explicitly encourage diversity include boosting [30, 8], which sequentially modifies the objective function of each model to specialize on previous models’ mistakes, or methods like negative correlation learning [22] amended cross-entropy [31], and DPPs over non-maximal predictions [26], which simulateously train models with penalities on both individual errors and pairwise similarities. Finally, methods such as Diverse Ensemble Evolution [37] and Competition of Experts [27] use explicit techniques to encourage models to specialize in different regions of input space.
Although at first glance these diverse training techniques seem quite diverse themselves, they are all similar in a crucial respect: they encourage diversity in terms of training set predictions. In the machine learning fairness, adversarial robustness, and explainability communities, however, there has been increasing movement away from the assumption that train is similar to test. For example, many methods for locally explaining ML predictions literally present simplified approximations of how models extrapolate away from given points [1, 29, 28], while adversarial attacks (and defenses) exploit (and mitigate) pathological extrapolation behavior [34, 25], sometimes in an ensemble setting [35]. Although our focus is not explicitly on explanability or adversarial robustness, our method can be seen as a reapplication of techniques in those subfields to the problem of ensemble diversity.
In this section, building on ross2018learning ross2018learning, we formally define our diversity measure and training procedure, beginning with notation. We use to denote -dimensional inputs, which are supported over an input space . We use to denote prediction targets in an output space . In this paper, will be
, and we focus on the case where it represents a log-odds used for binary classification, but our method can be generalized to classification or regression in
given any notion of distance between outputs. We seek to learn prediction models (parameterized by) that estimate
from . We assume these models are differentiable with respect to and(which is true for linear models and neural networks).
In addition, we suppose a joint distribution over inputs and targets
and a distribution quantifying the likelihood of the observed target given the model prediction. Typically, during training, we seek model parameters that maximize the likelihood of the observed data, .We now introduce a model diversity measure that quantifies how differently two models generalize over small patches of the data manifold . Formally, we define an -neighborhood of , denoted , on the data manifold to be the intersection of an -ball centered at in the input space, , and the data manifold: . We capture the notion of generalization difference on a small neighborhood of through an intuitive geometric condition: we say that two functions and generalize maximally differently at if is invariant in the direction of of the greatest change in (or vice versa) within an -neighborhood around . That is:
(1) |
where we define . In other words, perturbing by small amounts to increase inside does not change the value of . In the case that a choice of exists to satisfy Equation 1, we say that is locally independent at . We call and locally independent without qualification if for every the functions and are locally independent at for some choice of . We note that in order for the right-hand side expression of 1 to be well-defined, we assume that the gradient of is not zero at and that is chosen to be small enough that is convex or concave over .
In the case that and are classifiers, local independence intuitively implies a kind of dissimilarity between their decision boundaries. For example, if and are linear and the data manifold is Euclidean, then and are locally independent if and only if their decision boundaries are orthogonal.
This definition motivates the formulation of a diversity measure, , quantifying how far and are from being locally independent:
(2) |
We can use this penalty within an ensemble-wide loss function
for a set of models as follows:(3) |
The first term encourages each model to be predictive and the second encourages diversity in terms of the diversity measure
(with a strength hyperparameter
). Computing exactly, however, is challenging, because it requires an inner optimization of . Although it can be closely approximated for fixed small with projected gradient descent as in adversarial training [25], that procedure is computationally intensive. If we let , however, we can approximate by a fairly simple equation that only needs to compute once per . In particular, we observe that under certain smoothness assumptions on , with unconstrained ,111The simplifying assumption that in a local neighborhood around is significant, though not always inappropriate. We discuss both limitations and generalizations in Section A.1. and as , we can make the approximation(4) |
Assuming similar smoothness assumptions on (so we can replace it by its first-order Taylor expansion), we see that
(5) |
In other words, the independence error between and is approximately equal to the dot product of their gradients
. Empirically, we find it helpful to normalize the dot product and work in terms of cosine similarity
. We also add a small constant value to the denominator to prevent numerical underflow.Alternate statistical formulation: As another way of obtaining this cosine similarity approximation, suppose we sample small perturbations and evaluate and . As , these differences approach and
, which are 1D Gaussian random variables whose correlation is given by
and whose mutual information is per gretton2003kernel gretton2003kernel. Therefore, making the input gradients of and orthogonal is equivalent to enforcing statistical independence between their outputs when we perturb with samples from as . This could be used as an alternate definition of “local independence.”Motivated by this approximation, we substitute
(6) |
into our ensemble loss from Equation (3), which gives us a final “local independence training” objective of
(7) |
Note that we will sometimes abbreviate as . In Section as well as Figure 10, we show that is meaningfully correlated with other diversity measures and therefore may be useful in its own right, independently of its use within a loss function.
For the experiments that follow, we use 256 or 256x256 unit fully connected neural networks with rectifier (ReLU or softplus) activations, trained in Tensorflow with Adam.
We test local independence training (“LIT”) against random restarts (“RRs”), bagging [3] (“Bag”), AdaBoost [11] (“Ada”), 0-1 squared loss negative correlation learning liu1999simultaneous liu1999simultaneous (“NCL”), and amended cross-entropy [31] (“ACE”). We omit NIPS2018_7831 NIPS2018_7831 and parascandolo2017learning parascandolo2017learning which require more complex inner submodular or adversarial optimization steps, but note that because they also operationalize diversity as making different errors on training points, we expect the results to be qualitatively similar to ACE and NCL.
For our non-toy results, we test all methods with ensemble sizes in , and methods with regularization parameters (LIT, ACE, and NCL) with 16 logarithmically spaced values between and , using validation AUC to select the best performing model (except when examining how results vary with
or size). For each hyperparameter setting and method, we run 10 full random restarts (though within each restart, different methods are tested against the same split), and present mean results with standard deviation errorbars.
To provide intuition and an initial demonstration of our method, we first present several sets of 2D toy examples in Figure 1. These 2D examples are constructed to have gaps in the data distribution (i.e. ), but also to have the property that locally (except at the boundary), still behaves like .222That is, as for all . More informally, we construct these examples to have ambiguity in how a single accurate classifier should behave, but a single intuitive “right answer” for how two accurate but maximally different classifiers should behave.
Random Split | ||||||||||||
Method | Mushroom | Ionosphere | Sonar | SPECTF | ||||||||
AUC | AUC | AUC | AUC | |||||||||
RRs | 1.0 | 1.0 | ||||||||||
Bag | 1.0 | 1.0 | ||||||||||
Ada | 1.0 | — | — | — | — | — | — | — | — | |||
ACE | 1.0 | 1.0 | ||||||||||
NCL | 1.0 | 1.0 | ||||||||||
LIT | 1.0 | 1.0 | ||||||||||
Extrapolation | ||||||||||||
Method | Mushroom | Ionosphere | Sonar | SPECTF | ||||||||
AUC | AUC | AUC | AUC | |||||||||
RRs | ||||||||||||
Bag | ||||||||||||
Ada | — | — | — | — | — | — | — | — | ||||
ACE | ||||||||||||
NCL | ||||||||||||
LIT |
UCI dataset results in both the normal prediction task (top) and the extrapolation task (bottom) over 10 reruns, with errorbars based on standard deviations and bolding based on standard errors. On random splits, LIT offers modest AUC improvements over random restarts, on par with other ensemble methods. On extrapolation splits, however, LIT tends to result in more significant improvements to AUC. In both cases, LIT almost always exhibits the lowest pairwise Pearson correlation between heldout model errors (
), and for other methods, roughly matches pairwise gradient cosine similarity ().In Figure 2, we present neural network decision boundaries learned by random restarts, local independence training, and negative correlation learning (NCL) on these examples. Random restarts and NCL at low learn identical boundaries. As we increase for NCL, its boundaries stay essentially identical until a very narrow transition region (leading up to a critical value of at which it becomes favorable for one model to always predict 0 and the other to always predict 1). In this transition region, we can obtain models with differences between decision boundaries, but not between their angles—and at the cost of a symmetrical reduction in accuracy. The fact and symmetry of the accuracy reduction makes sense given that NCL can only increase its “diversity” by making different (and therefore not 100% accurate) training predictions.
LIT, on the other hand, outputs meaningfully different boundaries even at values of that are very low compared to its prediction loss term. This is in large part because on most of these tasks (except Dataset 3), there is very little tradeoff to learning a near-orthogonal boundary. At larger , LIT outputs decision boundaries that are completely orthogonal (at the cost of a slight accuracy reduction on Dataset 3). The main takeaway from this set of toy examples is that optimizing for diverse training predictions—even across a very wide sweep of penalty values—may never succeed in encouraging diverse extrapolation.
Relationship between LIT and feature selection: Note that a trivial way of minimizing CosIndepErr is to train an ensemble of models sensitive to disjoint sets of features. This roughly occurs in the ensemble on Dataset 1. We ran additional synthetic experiments (Figures 12 and 13
) to verify that LIT can be used to perform group feature selection in higher dimensions, which is usually approached with search-based enumeration methods
[10] rather than a single pass of gradient descent. However, LIT also produces more general kinds of diversity, learning diverse linear boundaries on Datasets 2 and 3 as well as diverse nonlinear boundaries on Dataset 4.Next, we test our method on several standard binary classification datasets from the UCI repository [21]. These are ionosphere, sonar, spectf, and mushroom
(with categorical features one-hot encoded, and all features z-scored). For all datasets without canonical splits, we first randomly select 80% of the dataset for training and 20% for test, then take an additional 20% split of the training set to use for validation. In addition to these random splits, we also evaluate models on an
extrapolation task, where instead of splitting datasets randomly, we train on the 50% of points closest to the origin (i.e. where is less than its median value over the dataset) and validate/test on the remaining points (which are furthest from the origin). This test is meant to evaluate robustness to covariate shift. Quantitative performance and diversity results are presented in Table 1, and additional metrics are shown in Figures 9, 10, and 11.As a final set of experiments, we run a more in-depth case study on a real world clinical application. In particular, we predict in-hospital mortality for a cohort of patient visits extracted from the MIMIC-III database [15] based on on labs, vital signs, and basic demographics. We follow the same cohort selection and feature selection process as ghassemi2017predicting ghassemi2017predicting. In addition to this full cohort, we also test on a limited data task where we restrict the size of the training set to to measure robustness.
We visualize the results of these experiments in many different ways to help tease out the effects of , ensemble size, and dataset size on individual and ensemble predictive performance, diversity, and model explanations. Table 2 shows overall performance and diversity metrics for these two tasks after cross-validation, along with the most common values of and ensemble size selected for each method. Drilling into the results, Figure 3 visualizes how multiple metrics for performance (AUC and accuracy) and diversity ( and ) change with , while Figure 4 visualizes the relationship between optimal and ensemble size.
Figure 5 (and Figures 7 and 8) visualize changes in the marginal distributions of input gradients for each model in their explanatory sense [1]. As a qualitative evaluation, we discussed these explanation differences with two intensive care unit clinicians and found that LIT revealed meaningful redundancies in which combinations of features encoded different underlying conditions.
ICU Mortality Task, Full Dataset () | |||||
Method | AUC | # | |||
RRs | 13 | — | |||
Bag | 8 | — | |||
Ada | 8 | — | |||
ACE | 13 | ||||
NCL | 13 | ||||
LIT | 3 | ||||
ICU Mortality Task, Limited Slice () | |||||
Method | AUC | # | |||
RRs | 8 | — | |||
Bag | 8 | — | |||
Ada | 2 | — | |||
ACE | 2 | ||||
NCL | 13 | ||||
LIT | 13 |
On the UCI datasets under conditions (random splits), LIT always offers at least modest improvements over random restarts, and often outperforms other baselines. Under extrapolation splits, LIT tends to do significantly better. This pattern repeats itself on the normal vs. data-limited versions of ICU mortality prediction task. We hypothesize that on small or selectively restricted datasets, there is typically more predictive ambiguity, which hurts the generalization of normally trained ensembles (who consistently make similar guesses on unseen data). LIT is more robust to these issues.
In our UCI results in Table 1, we saw that for non-LIT methods, error correlation and gradient similarity (which do not require labels to compute) were often close in value. We plot this relationship over all datasets, methods, restarts, and hyperparameters in Figure 10 and find that the correspondence continues to hold. One potential explanation for this correspondence is that, by our analysis at the end of Section 3.1, can literally be interpreted as an average squared correlation (between changes in model predictions over infinitesimal Gaussian perturbations away from each input). We hypothesize that may be a useful quantity independently of LIT.
In both our toy examples (Figure 2) and our ICU mortality results (Figures 3 and 4), we found that LIT produced qualitatively similar (diverse) results over several orders of magnitude of . NCL, on the other hand, required extremely careful tuning of to achieve meaningful diversity (before its performance plummeted). We hypothesize that this difference results from the fact that NCL’s diversity term is formulated as a direct tradeoff with individual model accuracy, so the balance must be precise, whereas LIT’s diversity term is often completely independent of individual model accuracy (which is true by construction in the toy examples). However, datasets only have the capacity to support a limited number of (mostly or completely) locally independent models. On the toy datasets, this capacity was exactly 2, but on real data, it is generally unknown, and it may be possible to achieve similar results either with a small fully independent ensemble or a large partially independent ensemble. For example, in Figure 4, we show that we can achieve similar improvements to ICU mortality prediction with 2 highly independent () models or 13 more weakly independent () models. We hypothesize that the trend-line of optimal LIT ensemble size and may be a useful tool for characterizing the amount of ambiguity present in a dataset.
In Figure 5, we found that in discussions with ICU clinicians, mortality feature assocations for normally trained neural networks were somewhat confusing due to hidden collinearities. LIT models made more clinical sense individually, and the differences between them helped reveal those collinearities (in particular between elevated levels of blood urea nitrogen and creatinine). Because LIT ensembles are often optimal when small, and because individual LIT models tend to be more accurate than the individual models of other methods such as AdaBoost, they also enable significantly more data interpretation than other ensemble methods.
LIT does come with restrictions and limitations. In particular, we found that it works well for rectifier activations (e.g. ReLU and softplus) and but leads to inconsistent behavior with others (e.g. sigmoid and tanh). This may be related to the linear rather than saturating extrapolation behavior of rectifiers. Because it relies on cosine similarity, LIT is also sensitive to relative changes in feature scaling; however, in practice this issue can be resolved by standardizing variables first.
Additionally, our cosine similarity approximation in LIT makes the assumption that the data manifold is locally similar to near most inputs. However, we introduce generalizations in Section A.1 to handle situations where this is not approximately true (such as with image data).
Finally, LIT requires computing a second derivative (the derivative of the penalty) during the optimization process, which increases memory usage and training time; in practice, LIT took approximately 1.5x as long as random restarts, while NCL took approximately half the time. However, significant progress is being made on making higher-order autodifferentiation more efficient [2], so we can expect improvements. Also, in cases where LIT achieves high accuracy with a comparatively small ensemble size (e.g. ICU mortality prediction), overall training time can remain short if cross-validation is terminated early.
In this paper, we presented a novel diversity metric that formalizes the notion of difference in local extrapolations. Based on this metric we defined an ensemble method, local independence training, for building ensembles of highly predictive base models that generalize differently outside the training set. On datasets we knew supported multiple diverse decision boundaries, we demonstrated our method’s superior ability to recover them compared to baselines. On real-world datasets with unknown levels of redundancy, we showed that LIT ensembles perform competitively on traditional prediction tasks and were more robust to data scarcity and covariate shift (as measured by training on inliers and testing on outliers). Finally, through applying LIT to a clinical prediction task in the intensive care unit, we provided evidence that the extrapolation diversity exhibited by LIT ensembles improved data robustness and helped us reach meaningful clinical insights in conversations with clinicians.
There are ample directions for future improvements to the method. For example, it would be useful to consider methods for aggregating predictions of LIT ensembles using a more complex mechanism, such as a mixture-of-experts model. Along similar lines, combining pairwise s in more informed way, such as a determinantal point process penalty [18] over the matrix of model similarities, may help us better quantify the diversity of the ensemble. Another interesting extension of our work would be to prediction tasks in semi-supervised settings, since labels are generally not required for computing local independence error. Finally, as we observe in the Section 5, some datasets seem to support a particular number of locally independent models. It is worth exploring how best to formally quantify and characterize the inherent ambiguity present in a prediction task—a related but distinct problem from that of characterizing data complexity [12, 5, 24].
WP acknowledges the Harvard Institute for Applied Computational Science for its support. ASR is supported by NIH 1R56MH115187.
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence
, pp. 1985–1991. Cited by: §4.1.Neural network ensembles, cross validation, and active learning
. In Advances in neural information processing systems, pp. 231–238. Cited by: §2.Towards deep learning models resistant to adversarial attacks
. arXiv preprint arXiv:1706.06083. Cited by: §2, §3.1.In the beginning of our derivation of (Equation 4), we assumed that locally, . However, in many cases, our data manifold is much lower dimensional than
. In these cases, we have additional degrees of freedom to learn decision boundaries that, while locally orthogonal, are functionally equivalent over the dimensions which matter. To restrict spurious similarity, we can project our gradients down to the data manifold. Given a local basis for its tangent space, we can accomplish this by taking dot products between
andand each tangent vector, and then use these two vectors of dot products to compute the cosine similarity in Equation
6. More formally, if is the Jacobian matrix of manifold tangents at , we can replace our regular cosine penalty with(8) |
An example of this method applied to a toy example is given in Figure 6. Alternatively, if we are using projected gradient descent adversarial training to minimize the original formulation in Equation 2, we can modify its inner optimization procedure to also project updates back to the manifold.
For many problems of interest, we do not have a closed form expression for the data manifold or its tangent vectors. In this case, however, we can approximate one, e.g. by performing PCA or training an autoencoder.
To better understand qualitative differences between ICU mortality prediction models, we examine differences between each model’s marginal distributions of input gradients over the dataset in Figures 7 and 8.