Variable importance (VI) tools describe how much a prediction model’s accuracy depends on the information in each covariate. For example, in Random Forests, VI is measured by the decrease in prediction accuracy when a covariate is permuted (Breiman, 2001; Breiman et al., 2001; see also Strobl et al., 2008; Altmann et al., 2010; Zhu et al., 2015; Gregorutti et al., 2015; Datta et al., 2016; Gregorutti et al., 2017
). A similar “Perturb” VI measure has been used for neural networks, where noise is added to covariates(Recknagel et al., 1997; Yao et al., 1998; Scardi and Harding, 1999; Gevrey et al., 2003). Such tools can be useful for improving the transparency of a “black box” a prediction model, for determining what scenarios may cause the model to fail, or for identifying covariates that must be measured with high precision.
However, existing VI measures do not generally account for the fact that many prediction models may fit the data almost equally well. In such cases, the model used by one analyst may rely on entirely different covariate information than the model used by another analyst. This common scenario has been called the “Rashomon” effect of statistics (Breiman et al., 2001; see also Statnikov et al., 2013; Tulabandhula and Rudin, 2014; Nevo and Ritov, 2015; Letham et al., 2016). The term is inspired by the 1950 Kurosawa film of the same name, in which four witnesses offer different descriptions and explanations for the same encounter. Under the Rashomon effect, how should analysts give comprehensive descriptions of the importance of each covariate? How well can one analyst recover the conclusions of another?
To address this, we define model class reliance (MCR) as the highest and lowest degree to which any well-performing model within a given class may rely on a variable of interest for prediction accuracy. Roughly speaking, MCR captures the range of explanations, or mechanisms, associated with well-performing models.
We make several technical contributions in deriving MCR. First, we define a core importance measure, model reliance (MR), as the degree to which a specific model relies on covariates of interest to predict well. This measure is based on permutation importance measures for Random Forests (Breiman, 2001; Breiman et al., 2001). We draw a connection between permutation-based importance estimates and U-statistics, which facilitates later theoretical results. Additionally, we derive connections between permutation importance, conditional causal effects, and coefficients for additive models. Next, we expand on MR to propose MCR, which captures the reliance values for a class of models. We derive finite-sample bounds for MCR, which motivate an intuitive estimator of MCR.
MCR and the Rashomon effect become especially relevant in the context of criminal recidivism prediction. Proprietary recidivism risk models trained from criminal records data are increasingly being used in U.S. courtrooms. One concern is that these models may be relying on information that would otherwise be considered unacceptable (for example, race, sex, or proxies for these variables), in order to estimate recidivism risk. The relevant models are often proprietary, and cannot be studied directly. Still, in cases where the predictions made by these models are publicly available, it may be possible to identify alternative prediction models that are sufficiently similar to the proprietary model of interest.
In this paper, we specifically consider the proprietary model COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), developed by the company Equivant (formerly known as Northpointe). Our goal is to estimate how much COMPAS relies on either race, sex, or proxies for these variables not measured in our dataset. To this end, we apply a broad class of flexible, kernel-based prediction models to predict COMPAS score. In this setting, the MCR interval reflects the highest and lowest degree to which any prediction model in our class can rely on race and sex while still predicting COMPAS score relatively accurately. Equipped with MCR, we can relax the common assumption of being able to correctly specify the unknown model of interest (here, COMPAS) up to a parametric form. Instead, rather than assuming that the COMPAS model itself is contained in our class, we assume that our class contains at least one well-performing alternative model that relies on sensitive covariates to the same degree that COMPAS does. Under this assumption, the MCR interval will contain the VI value for COMPAS. Applying our approach, we find that race, sex, and their potential proxy variables, are likely not the dominant predictive factors in the COMPAS score (see analysis and discussion in Section 9).
The tools we use to derive bounds for MCR and quite powerful and general, and can be used to make finite-sample (non-asymptotic) inferences for many other summary descriptions of well-performing models. For example, rather than describing how much well-performing models may rely on covariates of interest, the same techniques allow inference for the range of risk predictions that well-performing models may assign to a given covariate profile. In some cases, these novel techniques may provide finite-sample confidence intervals where none have previously existed. In others, they shed light on why certain types of model classes (for example, Neural Networks) have resisted rigorous inferential theory (see Section5).
The remainder of this paper is organized as follows. In Section 2 we introduce notation and terminology. In Sections 3 and 4 we present MR and MCR respectively, and derive theoretical properties of each. In Section 5, we discuss general applicability of our approach for determining finite-sample confidence intervals for other problems. In Section 6 we review additional related literature in more detail. In particular, we discuss the common practice of retraining a model after removing one of the covariates. In Section 7, we present a general procedure for computing MCR, with specific implementations for (regularized) linear models, and linear models in a reproducing kernel Hilbert space. In Section 8, we illustrate MR and MCR with a simulated toy example, to aid intuition. We also present simulation studies for the task of estimating MR for the unknown, underlying conditional expectation function, under misspecification. We analyze a well-known public dataset on recidivism in Section 9. All proofs are presented in the appendices.
2 Terminology & notation: “importance,” “models,” and “model classes.”
The label of “variable importance” measure has been broadly used to describe approaches for either inference (van der Laan, 2006; Díaz et al., 2015; Williamson et al., 2017) or prediction. While these two goals are highly related, we primarily focus on how much prediction models rely on covariates to achieve accuracy. We use terms such as “model reliance” rather than “importance” to clarify this context.
be a random variable with outcomeand covariates , where the covariate subsets and may each be multivariate. Our goal is to study how much different prediction models rely on to predict . We refer to our dataset as , a matrix composed of a -length outcome vector in the first column, and a covariate matrix in the remaining columns. We assume that observations of are , that , and that solutions to and operations exist whenever optimizing over sets mentioned in this paper (for example, in Theorem 6, below).
We use the term model class to refer to a prespecified subset of the measurable functions from to . We refer to member functions as prediction models, or simply as models. Given a model , we evaluate its performance using a nonnegative loss function . For example, may be the squared error loss for regression, or the hinge loss for classification. We use the term algorithm to refer to any procedure that takes a dataset as input and returns a model as output. We illustrate these terms with two brief toy examples in Appendix A.1.
For a given vector , let denote its element(s). For a given matrix , let , , , and respectively denote the transpose of , the row(s) of , the column(s) of , and the element(s) in the row(s) and column(s) of .
3 Model reliance
To describe how much the expected accuracy of a fixed prediction model relies on the random variable , we use the notion of a “switched” loss. Let and be independent random variables, each following the same distribution as . Denote realizations of and by and respectively. Given the realizations and , let be the loss of model on , if was first replaced with :
For a given prediction function , we wish to know the expectation of this quantity across pairs in the population, .
As a reference point, we compare against the standard expected loss when none of the variables are switched, . We define model reliance (MR) as the ratio of these two expected losses:
Higher values of signify greater reliance of on , and signifies no reliance on . Models with reliance values strictly less than 1 are more difficult to interpret, but are often accompanied by better performing models satisfying (see Appendix A.2).
Model reliance could alternatively be defined as a difference rather than a ratio, that is, with . In Appendix A.5, we discuss how many of our results remain similar under either definition.
3.1 Estimating model reliance with U-statistics, and connections to permutation-based variable importance.
Given a model and dataset , we estimate with the standard empirical loss
We estimate with
To illustrate the connection between Eq 3.1 and the permutation-based variable importance approach of Breiman (2001), let be a set of -length vectors, each containing a different permutation of the set . The approach of Breiman (2001) is analogous to computing the loss for a randomly chosen permutation vector . Similarly, our calculation in Eq 3.1 is proportional to the sum of losses over all possible permutations, excluding the unique combinations of the rows of and the rows of that appear in the original sample (see Appendix A.3).
As an alternative to Eq 3.1, if the summation over all possible pairs is computationally prohibitive due to sample size, another estimator of is
The above estimator divides the sample into a single, specific grouping of pairs, and averages across these pairs.
Finally, we can estimate with the plug-in estimator
which we refer to as the “empirical” model reliance of on .
Both and belong to the well-studied class of U-statistics. Thus, under fairly minor conditions, and are unbiased, asymptotically normal, and have finite-sample probabilistic bounds (Hoeffding, 1948; Serfling, 1980). To our knowledge, connections between permutation-based importance and U-statistics have not been previously established.
While the above results from U-statistics depend on being fixed a priori, we can also leverage these results to create uniform bounds on the MR estimation error for all models in a sufficiently regularized class . We formally present this bound in Section 4 (Theorem 8), after introducing required conditions on model class complexity. The existence of this uniform bound implies that it is feasible to train a model and to evaluate its importance using the same data. This differs from the classical VI approach of Random Forests (Breiman, 2001), which avoids in-sample importance estimation. There, each tree in the ensemble is fit on a random subset of data, and VI for the tree is estimated using the held-out data. The tree-specific VI estimates are then aggregated to obtain a VI estimate for the overall ensemble. Although sample-splitting approaches such as this are helpful in many cases, the uniform bound for MR suggests that they are not strictly necessary.
3.2 Model reliance and causal effects
In this section, we show a connection between population-level model reliance and the conditional average causal effect. For consistency with the causal inference literature, we temporarily rename the random variables as , with realizations . Here, represents a binary treatment indicator, represents a set of baseline covariates, and represents an outcome of interest. Under this notation, represents the expected loss of a prediction function , and denotes the expected loss in a pair of observations in which the treatment has been switched. Let be the (unknown) conditional expectation function for , where we place no restrictions on the functional form of .
Let and be potential outcomes under treatment and control respectively, such that . The treatment effect for an individual is defined as , and the average treatment effect is defined as . Let be the (unknown) conditional average treatment effect of for all patients with . Causal inference methods typically assume (conditional ignorability), and for all values of (positivity), in order for and CATE to be well defined and identifiable.
The next theorem quantifies the relation between the conditional average treatment effect function (CATE) and the model reliance of on , as measured by . We choose this form of the result for simplicity, although dividing both sides of Eq 3.3 (below) by immediately translates the result in terms of a model reliance ratio.
For any prediction model , let and be defined based on the squared error loss . Under the assumptions that (conditional ignorability) and for all values of (positivity), we have
where is the marginal variance of the treatment assignment.
is the marginal variance of the treatment assignment.
Intuitively, we see that the difference between and depends on the treatment prevalence and the conditional average treatment effect of on for each covariate profile . For example, if all patients are treated, then scrambling the treatment in a random pair of observations has no effect on the loss. In this case we see that and . Likewise, if the conditional average treatment effect is zero for every covariate profile , then , , and does not rely on . If the conditional average treatment effect is positive and constant for all values of , then a larger treatment effect will result in a larger value for , and a higher reliance of on .
Importantly, will yield different conclusions than the average treatment effect when there is treatment effect heterogeneity (that is, when ). For example, if a treatment is harmful, on average, in one subpopulation, but helpful in another subpopulation, then the average treatment effect may be zero while the average of the squared conditional average treatment effect, , will be positive.
3.3 Model reliance for linear models and additive models.
For linear models and the squared error loss, we can show both an interpretable definition of model reliance, as measured by , as well as a computationally efficient formula for . The result is similar to Theorem 1, and augments results from Gregorutti et al. (2017).
For any prediction model , let , , , and be defined based on the squared error loss for , , and , where and are positive integers. Let and satisfy , , and . Then
and, for finite samples,
where , is the -length vector of ones, and is the identity matrix.
Eq 3.4 shows that model reliance for linear models, as measured by , can be interpreted in terms of the population covariances and the model coefficients. Gregorutti et al. (2017) show an equivalent formulation of Eq 3.4 under the stronger assumptions that is equal to the conditional expectation function of (that is, ), and the covariates and are centered.
Although the number of terms in the definition of grows quadratically in (see Eq 3.1), in the case of linear models we see from Eq 3.5 that the computational complexity of grows only linearly in . Specifically, the terms and in Eq 3.5 can be computed as and respectively, where the computational complexity each term in parentheses grows linearly in .
4 Model class reliance
Like many statistical procedures, our MR measure (Section 3) produces a description of a single predictive model. Given a model with high predictive accuracy, MR describes how much the model’s performance hinges on covariates of interest (). However, there will often be many other models that perform similarly well, and that rely on to different degrees. With this notion in mind, we now study how much any well-performing model from a prespecified class may rely on covariates of interest.
To formally define the set of well-performing models, we require a “benchmark” or “reference” model, denoted by . For , and a reference model , let be the subset of models with expected loss no more than above that of . We refer to as a population-level “Rashomon set” around . This set can be thought of as representing models that might be arrived at due to differences in data measurement, processing, filtering, model parameterization, covariate selection, or other analysis choices.
While could be selected by minimizing the in-sample loss, the theoretical study of is simplified under the assumption that is prespecified. For example, may come from a flowchart used to predict injury severity in a hospital’s emergency room, or from another quantitative decision rule that is currently implemented in practice. The model can also be selected using sample splitting. In some cases it may be desirable to fix equal to the best-in-class model , but this is generally infeasible because is unknown. Still, for any , the Rashomon set around will always contain the Rashomon set around . Hereafter, we assume that and are fixed, and typically omit their specification when writing .
We define population-level model class reliance (MCR) by maximizing and minimizing MR over :
Studying the above quantities is the main focus of this paper, as these measures provide a more comprehensive view of importance than traditional measures (for example, Section 3). If is low, then no well-performing model exists that places high importance on , and can be discarded at low cost regardless of future modeling decisions. If is large, then every well-performing model must rely substantially on , and should be given careful attention during the modeling process. Here,) does not depend on the fitting algorithm used to select a model . The range is valid for any algorithm producing models in , and applies for any .
To study MCR empirically, we introduce the sample analogues
where . We refer to and as “empirical” MCR measures, and to as an “empirical Rashomon set.” We discuss the utility of these empirical measures in the next section.
In Appendix B.10 we consider an alternate formulation of Rashomon sets and MCR where we replace the relative loss threshold in the definition of with an absolute loss threshold. This alternate formulation can be similar in practice, but still requires the specification of a reference function to ensure that and are nonempty.
4.1 Finite-sample bounds for model class reliance.
In this section we derive finite-sample, probabilistic bounds for and based on empirical MCR. While the bounds we present are conservative, they imply that, under minimal assumptions, and are sensible point estimates for and . Thus, in Sections 8.1 & 9, we use and as point estimates, and apply a bootstrap procedure to obtain confidence intervals.
To derive these results we introduce three bounded loss assumptions, each of which can be assessed empirically. Let be known constants.
(Bounded individual loss) For a given model , assume that for any .
(Bounded relative loss) For a given model , assume that for any .
(Bounded aggregate loss) For a given model , assume that .
Each assumption is a property of a specific model . The notation and refer to bounds for any individual observation, and the notation and refer to bounds on the aggregated loss in a sample.
We give example methods of determining in Sections 7.4.2 & 7.5.2. For Assumption 5, we can approximate by training a highly flexible model to the data, and setting equal to half (or any positive fraction) of the resulting cross-validated loss. To determine we can simply set , although this may be conservative. For example, in the case of binary classification models for non-separated groups (see Section 8.1
), no linear classifier can misclassify all observations, particularly after a covariate is permuted. Thus, it must hold that. Similarly, if satisfies Assumption 3, then may be conservatively set equal to . If model reliance is redefined as a difference rather than a ratio, then a similar form of the results in this section will apply without Assumption 5 (see Appendix A.5).
Based on these assumptions, we can create a finite-sample upper bound for and lower bound for .
where , and .
states that, with high probability,is no higher than added to an error term . As increases, approaches and approaches zero. A similar interpretation holds for Eq 4.4.
We provide a visualization of the result in Appendix A.4, and also give a brief sketch of the proof here. First, we enlarge the empirical Rashomon set by increasing to , such that, by Hoeffding’s inequality, with high probability. When , we know that by the definition of . Next, the term accounts for estimator error of relative to . Thus, we can relate to both and in order to obtain Eq 4.3. Similar steps can be applied to obtain Eq 4.4.
The bounds in Theorem 6 naturally account for potential overfitting without an explicit limit on model class complexity, such as a covering number (Bousquet et al., 2004; Rudin and Schapire, 2009). Instead, these bounds depend on being able to fully optimize MR across sets in the form of . If we allow our model class to become more flexible, then the size of will also increase. Because the bounds in Theorem 6 result from optimizing over , increasing the size of results in wider, more conservative bounds. In this way, Eqs 4.3 and 4.4 implicitly capture model class complexity.
A corollary of Theorem 6 is that we can create a probabilistic bound for the reliance of the (unknown) best-in-class model on .
where , and .
The above result does not require that be unique. If several models achieve the minimum possible expected loss, the above boundaries apply simultaneously for each of them. In the special case when the true conditional expectation function is equal to , then we have a boundary for the reliance of the function on . This reliance bound can also be translated into a causal statement using Theorem 1.
So far, Theorem 6 allows us to cap the range of MR values corresponding to models that predict well, but it does not necessarily provide a complete understanding when this range is large. For example, even if the lower bound on from Eq 4.4 is below 1, we are not able conclude that there exists a well-performing model with no reliance on (that is, with ). To conclude that such a model exists, we need an upper bound on . Likewise, in order to make conclusions regarding the existence of well-performing models with high reliance on , we need a lower bound on .
To create such bounds, we take a contrary approach to that of Theorem 6, where we had expanded the empirical Rashomon set by increasing to . We now contract the empirical Rashomon set by subtracting a buffer term from . This requires that we generalize the definition of an empirical Rashomon set to for . The definition is unchanged for , but for the explicit inclusion of now ensures that is nonempty. As before, we typically omit the notation and , writing instead.
Creating these addition boundaries will also require a limit on the complexity of . We propose a complexity measure in the form of a covering number. Specifically, we define the set of functions as an -margin-expectation-cover if for any and any distribution , there exists such that
We define the covering number to be the size of the smallest -margin-expectation-cover for . In general, we use and to denote probabilities and expectations with respect to a random variable following the distribution . We abbreviate these quantities accordingly when or are clear from context, for example, as , , or simply . Unless otherwise stated, all expectations and probabilities are taken with respect to the (unknown) population distribution.
Equipped with this complexity measure, we can now uniformly bound the estimation error of for all . This, in turn, will let us determine the desired bounds for MCR.
Theorem 8 states that, with high probability, the largest possible estimation error for across all models in is bounded by ), which can be made arbitrarily small by increasing and decreasing . This, in turn, means that it is possible to train a model and estimate it’s reliance on variables without using sample-splitting. Theorem 8 also lets us derive a lower bound for , and upper bound for .
where , and , as defined in Eq 4.6.
Eq 4.8 states that, with high probability, is no higher than added to an error term , where both and can be made arbitrarily small by shrinking and increasing . A similar interpretation holds for the upper bound on in Eq 4.7. We provide a visualization of this result in Appendix A.4.
The proof for Theorem 9 follows a similar structure to that of Theorem 6, but incorporates Theorem 8’s uniform bound on MR estimation error ( term), as well as an additional uniform bound on the probability that any model has in-sample loss too far its expected loss ( term).
In some cases, it may be possible to improve the bounds in Theorem 9 by splitting the sample into two parts, using the first split to intelligently select a subset , and using the second split to calculate boundaries. For any subset , the value of will naturally serve as a lower bound on . Thus, rather than using Eq 4.7 to lower bound directly, we can search for a simpler model class such that , and . Balancing these two criteria may lead to a tighter bound from Theorem 9. For example, consider the special case where is indexed by a real-valued parameter . Given a training dataset, let and let , where and are computed using training data. We can then set as the subset of models resulting from convex combinations of and :
The bound from Eq 4.7 can be evaluated for in a held-out dataset, potentially resulting in a tighter bound for . In the case of linear classifiers, we can analytically derive a covering number for .
Consider the classification setting where ; ; is the hinge loss for ; is the set of linear classifiers ; and is defined as in Eq 4.9 for the prespecified vectors . In this setting, if for all , then holds for any .
4.2 Model class reliance on imputation residuals of correlated predictors
One common scenario where multiple models achieve low loss is when the sets of predictors and are highly correlated, or contain redundant information. Models may predict well either through reliance on , or through reliance on , and so MCR will correctly identify a wide range of potential reliances on . However, we may specifically be interested how much models that fully exhaust the information in can additionally rely on the information in .
For example, age and accumulated wealth may be correlated, and both may be predictive of future promotion. We may wish to know the potential importance of wealth when predicting promotions, but only for models that fully incorporate age.
To achieve this, we propose a two-step procedure:
Impute the values of based on using a prespecified model . Replace with its imputation residual .
Let be a prespecified class of models that use and to predict . After computing , estimate the model class reliance of on .
Such an approach will generally result in a smaller value of relative to the result that would occur from proceeding without an imputation step.
5 General purpose, finite-sample CIs from Rashomon sets
Theorem 6 implies that Rashomon sets can be used to derive non-asymptotic confidence intervals (CIs) for the MR of best-in-class model(s). In this section we generalize the Rashomon set approach to create finite-sample CIs for other summary descriptions of best-in-class model(s). The generalization also helps to illustrate a core aspect of the argument underlying Theorem 6: best-in-class models tend to have relatively good performance in random samples.
Let be a descriptor of interest for models in . For example, if is the linear model , then may be defined as the norm of the associated coefficient vector (that is, ) or the prediction would assign given a specific covariate profile (that is, ). For simplicity, we assume that can be determined exactly for any model . Note that this condition is not satisfied if we choose model reliance as our descriptor , and so, because of this simplification, the results of this section do not fully replace those of Section 4.1. We also temporarily assume that the best-in-class model uniquely minimizes the expected loss. In this setting, the following proposition allows us to create finite-sample CIs for based on empirical Rashomon sets.
Let and , where . Let be the prediction model that uniquely attains the lowest possible expected loss. If satisfies Assumption 4, then
Proposition 11 generates the finite-sample CI , which can be interpreted as the range of values corresponding to models with empirical loss not substantially above that of . Thus, the interval has both a rigorous coverage rate and a coherent in-sample interpretation. The proof of Proposition 11 uses Hoeffding’s inequality to show that is contained in with high probability.
Our assumption that uniquely minimizes over can be removed if we consider the following version of Proposition 11:
Let and , where . If Assumption 4 holds for all , then
In contrast to Proposition 11, Proposition 12 creates a finite-sample CI for the range of values corresponding to the models with expected loss no greater than . By definition, this range includes for any minimizing the expected loss.
As demonstrated in this section, Rashomon sets allow us to reframe a wide set of statistical inference problems as in-sample optimization problems. The implied confidence intervals are not necessarily in closed form, but the approach still opens an exciting pathway for deriving non-asymptotic results. Propositions 11 & 12 also shed light on why certain types of modeling approaches, such as Neural Networks and Random Forests, have been associated with slow progress in inferential theory. For such model classes, it is not even guaranteed that the empirical risk can be globally minimized, much less objective functions in the form of , as in and .
6 Limitations of existing variable importance methods
Several common approaches for variable selection, or for describing relationships between variables, do not necessarily capture a variable’s importance. Null hypothesis testing methods may identify a relationship, but do not describe the relationship’s strength. Similarly, checking whether a variable is included by a sparse model-fitting algorithm, such as the Lasso(Hastie et al., 2009), does not describe the extent to which the variable is relied on. Partial dependence plots (Breiman et al., 2001; Hastie et al., 2009) can be difficult to interpret if multiple variables are of interest, or if the prediction model contains interaction effects.
Another common VI procedure is to run a model-fitting algorithm twice, first on all of the data, and then again after removing from the dataset. The losses for the two resulting models are then compared to determine the importance, or “necessity,” of (Gevrey et al., 2003). Because this measure is a function of two prediction models rather than one, it does not measure how much either individual model relies on . We refer to this approach as measuring empirical Algorithm Reliance (AR) on , as the model-fitting algorithm is the common attribute between the two models. Related procedures were proposed by (Breiman et al., 2001; Breiman, 2001), which measure the sufficiency of .
The permutation-based VI measure from RFs forms the inspiration for our definition of MR (see Section 3). This RF VI measure has been the topic of empirical studies (Archer and Kimes, 2008; Calle and Urrea, 2010; Wang et al., 2016), and several variations of the measure have been proposed (Strobl et al., 2007, 2008; Altmann et al., 2010; Hapfelmeier et al., 2014). Mentch and Hooker (2016) use U-statistics to study predictions of ensemble models fit to subsamples, similar to the bootstrap aggregation used in RFs. Procedures related to “Mean Difference Impurity,” another VI measure derived for RFs, have been studied theoretically by (Louppe et al., 2013; Kazemitabar et al., 2017). All of this literature focuses on VI measures for RFs, for ensembles, or for individual trees. Our estimator for model reliance differs from the traditional RF VI measure (Breiman, 2001) in that we permute inputs to the overall model, rather than permuting the inputs to each individual ensemble member. Thus, our approach can be used generally, and is not limited to trees or ensemble models.