Please Stop Permuting Features: An Explanation and Alternatives

This paper advocates against permute-and-predict (PaP) methods for interpreting black box functions. Methods such as the variable importance measures proposed for random forests, partial dependence plots, and individual conditional expectation plots remain popular because of their ability to provide model-agnostic measures that depend only on the pre-trained model output. However, numerous studies have found that these tools can produce diagnostics that are highly misleading, particularly when there is strong dependence among features. Rather than simply add to this growing literature by further demonstrating such issues, here we seek to provide an explanation for the observed behavior. In particular, we argue that breaking dependencies between features in hold-out data places undue emphasis on sparse regions of the feature space by forcing the original model to extrapolate to regions where there is little to no data. We explore these effects through various settings where a ground-truth is understood and find support for previous claims in the literature that PaP metrics tend to over-emphasize correlated features both in variable importance and partial dependence plots, even though applying permutation methods to the ground-truth models do not. As an alternative, we recommend more direct approaches that have proven successful in other settings: explicitly removing features, conditional permutations, or model distillation methods.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/08/2020

Model-agnostic Feature Importance and Effects with Dependent Features – A Conditional Subgroup Approach

Partial dependence plots and permutation feature importance are popular ...
06/04/2021

Evaluation of Local Model-Agnostic Explanations Using Ground Truth

Explanation techniques are commonly evaluated using human-grounded metho...
09/06/2021

Bringing a Ruler Into the Black Box: Uncovering Feature Impact from Individual Conditional Expectation Plots

As machine learning systems become more ubiquitous, methods for understa...
09/03/2021

Relating the Partial Dependence Plot and Permutation Feature Importance to the Data Generating Process

Scientists and practitioners increasingly rely on machine learning to mo...
05/26/2021

Fooling Partial Dependence via Data Poisoning

Many methods have been developed to understand complex predictive models...
08/09/2021

Visualizing Variable Importance and Variable Interaction Effects in Machine Learning Models

Variable importance, interaction measures, and partial dependence plots ...
04/07/2021

Hollow-tree Super: a directional and scalable approach for feature importance in boosted tree models

Current limitations in boosted tree modelling prevent the effective scal...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning methods have proved to be enormously successful tools for making predictions from data. However, most of these methods produce algebraically complex models that provide little or no insight into how they arrive at their predictions. As a consequence, a number of methods have been developed to “X-ray the black box”, providing insight into which input features are important and how they effect predictions.

This paper focuses on methods based on permuting feature values, or otherwise investigating the effect that changing the values of a feature has on predictions. Our message can be simply summarized by

When features in the training set exhibit statistical dependence, permutation methods can be highly misleading when applied to the original model.

Permutation methods are some of the oldest, most popular, and computationally convenient means of understanding complex learning algorithms. In this paper, we will focus primarily on three commonly-used techniques:

Variable Importance

Introduced in Breiman (2001), these methods are designed to provide a score for each feature based on how much difference replacing the feature with noise makes in predictive accuracy. In particular, Breiman (2001) proposed measuring the importance of the th feature by permuting its values in the training data and examining the corresponding drop in predictive accuracy on a model built with the original training data.

Given a training set consisting of a matrix of feature values with rows

giving each observation and corresponding response vector

, let be a matrix achieved by randomly permuting the th column of . Using as the loss for predicting from Breiman (2001) defined the importance of the th feature as

the increase in loss due to replacing with a value randomly chosen from the (marginal) distribution of feature .

There are numerous variations on this. Breiman (2001) designed the method specifically for use with random forests and considered out-of-bag loss: based only on trees which were not trained using . For more general learners, either training or test loss can be used. Fisher et al. (2018) considered averaging over all possible permutations and derived concentration bounds based on -statistics. Nonetheless, all of these methods consist of the same permute-and-predict (PaP) structure.

Partial Dependence Plots (PDPs)

Friedman (2001) suggested examining the effect of feature by plotting the average prediction as the feature is changed. Specifically, letting be the matrix of feature values where the th entry of every row has been replaced with value , we define the partial dependence function

as the average prediction made with the th feature replaced with the value . Since these are univariate functions (multivariate versions can be defined naturally) they can be readily displayed and interpreted.

Individual Conditional Expectation (ICE) Plots

Goldstein et al. (2015) A refined version of partial dependence plots involves simply plotting out the functions

that correspond to tracing out the prediction given to any one example as the th feature is changed. PDPs are then exactly the average of the corresponding ICE plots, but the latter allows an investigation in how the effect of feature may change for different combinations of the remaining inputs. When is very large, a random selection of ICE plots can be presented as examples.

These techniques are attractive for a number of reasons: they are each computationally cheap, requiring operations, apply to the derived from any learning method, have no tuning parameters, and in relying only on averages, they are statistically very stable. Moreover, the approach is readily understood and easily explained to users in applied areas. For this reason, they have been frequently adopted and advocated for the general public.

However, procedures like these based on the PaP structure have been shown to exhibit serious flaws, especially when significant correlations exist between features. Strobl et al. (2007) investigate classification and note that the importance measures based on permuted out-of-bag (oob) error in CART built trees are biased toward features that are correlated with other features and/or have many categories and further suggest that bootstrapping exaggerates these effects. Archer and Kimes (2008) explore a similar set-up and also note improved performance when true features – those actually related to the response – are uncorrelated with noise features. Nicodemus et al. (2010) investigates these claims of feature preference in a large scale simulation study and again find that the oob measures overestimate the importance of correlated predictors.

Here we provide an intuitive rationale for these observations: extrapolation. If, for example, features and are strongly positively dependent, say, there will be no training examples that examples which pair a large value of with a small value of or vice versa. Thus, the predictions made in the upper-left corner of space will mostly depend on the extrapolation behavior of the particular learning method employed. And as we demonstrate in the following sections, permutation-based methods place significant weight on exactly these predictions. As a concrete example: an evaluation of the importance of pregnancy status in a model that also includes gender would result in the evaluation of the response of pregnant men as often as pregnant women.

For these reasons, we argue that these methods are flawed and can be strongly misleading. Hooker (2007) provided an interpretation of all these measures in the context of the functional ANOVA decomposition (Hoeffding, 1948) but only when every feature is statistically independent from all the rest. A generalization can be obtained when the features are dependent, but this does not readily translate into a permutation-based diagnostic. Similar critiques can be found in Strobl et al. (2007, 2008) and Toloşi and Lengauer (2011).

In this paper, we provide a more detailed examination of the effect of these diagnostic tools along with a new explanation for the behavior when applied to random forests. Because the success or otherwise of any diagnostic cannot be evaluated without knowledge of the ground-truth, we illustrate the behavior of these methods in simulation, using a known linear model. In addition to random forests, we also examine the behavior of neural networks where different, but similarly problematic behavior is observed.

Note that these methods are distinct from the permute and re-learn approach to testing feature importance in Mentch and Hooker (2016, 2017) as well as the leave-out-covariates (LOCO) approach in Lei et al. (2018). There, rather than extrapolate, a model is re-learned using the data and the change in predictions is tested for statistical significance. Here, the test point locations are only evaluated in the original feature distribution and the use of serves to replace the th feature with noise that should not contribute to predictive accuracy, thereby maintaining the same dimension as the original feature space.

In this paper we focus on permutation and re-learning procedures as being applicable to any machine learning method. Specific methods have their own diagnostics such as the split-improvement methods suggested in Friedman (2001) that apply specifically to trees. We note that these methods can also be biassed towards features with more potential split points (Strobl et al., 2007), along with potential corrections in Zhou and Hooker (2019).

In the following, we introduce a simple, and understandable model in Section 2 used to illustrate the misleadingness of variable importance measures and diagnostic plots in Section 3. We provide an explanation of these results in terms of extrapolation behavior in Section 4 and suggest some remedies in Sections 5 and 6.

2 A Simple Simulated Example

Here we set up a simple model that will allow us to be clear about the values that we ought to obtain from variable importance measures and which we can then contrast to those we actually find. Specifically, we propose a linear regression model based on 10 features

:

(1)

where

produces process noise. The permutation methods we investigate here are not restricted to regression problems. However, in this setting we are able to relate all these diagnostic tools to the coefficients of the regression, providing a point of comparison. Our choice of noise variance is also deliberately small in order to make the behavior we observe clearer.

In addition to specifying the relationship between features and output, we need to provide a distribution for the . It is this that makes permutation methods misleading. In our model, each of the features is marginally distributed as uniform on [0, 1]. They are generated independently with the exception of the first two which we make correlated via a Gaussian copula (Nelsen, 2007) with correlation parameter . The strength of this correlation will play a large role in how strongly variable importance and other measures can mislead.

For a linear model such as (1) with covariates that have the same scale, variable importance can be obtained from the magnitude of covariates; see Gregorutti et al. (2015) (and Theorem 1 below) for a formal connection to the variable importance measures in Breiman (2001). Here we will be interested in the relative importance of and to the other features and we have chosen coefficients so that provide a reference for features with the same coefficient, has no influence and and have a clear importance ordering within .

We contend that permutation-based diagnostic tools are misleading when one both employs flexible learning methods and has correlated features. Neither, by themselves, are sufficient, although we note that the combination is in fact very common. We illustrate this necessity in the following theorem where we investigate the diagnostics returned by an oracle that has direct access to the response function (1).

Theorem 1.

For fit by least-squares

  1. where indicates the expectation over permutations and is the average value of the th feature.

  2. where

  3. where

The first of these results is also stated in Gregorutti et al. (2015); Fisher et al. (2018), but repeated here for completeness. Proofs are provided in the appendix.

As we will show, learning linear models returns accurate measures regardless of the correlation structure while more flexible techniques like random forests and neural networks can be very misleading.

For each choice of correlation parameter and sample size , we learn both a random forest and a single layer neural network with 20 hidden nodes (randomForest and nnet respectively in R). We then evaluated the variable importance as given above using the training data. For random forests we also obtained the original out-of-bag importance measures implemented in importance; these differ from in being evaluated from out-of-bag residuals for each tree and are thus specific to bagging methods.

Additionally, we recorded the partial dependence of on and for each model as well as the ICE for 11 observations with the th observation taken at and the remaining generated uniformly but kept constant across all simulations. These values were chosen so that the reference points were always in the bulk of the feature distribution.

Because we are primarily interested in bias, we simulated each data set with associated learned function

, feature importances and plots, 50 times and report the average over these simulations in order to remove variability from the results.

3 Simulation Results

In order to ensure that scaling does not affect our results, we report the importance rank of each feature: for each simulation we order the importance of the feature from least to greatest and record the position of the feature. This is commonly employed in screening methods where only the most important features are retained.

Figure 1: Average variable importance rank (lowest to highest) computed using on the training data for each feature over 10 models trained on simulated data of size 2000. Rank is given for random forests (r), neural networks (n) and linear models(l) as well as the out-of-bag variable importance for random forests (o). Left: when all features are generated independently. Right: for and generated from a Gaussian copula model with correlation parameter . Dashed lines indicate the theoretical rank of the covariates.

Figure 1 shows the average importance rank of each of the 10 features obtained on the training set using the measures implemented in the randomForest package. We use 2,000 observations in each of 50 simulations and give results where and are related with a Gaussian copula with either correlation or 0.9. Note that in the independent case (), random forests, neural networks and linear models all agree on the ordering of the covariates, including ties between -, which also corresponds to the results in Theorem 1.

However, when , permutation-based methods applied to random forests and neural networks rank and as more important than and frequently more important than even , which has a larger coefficient. However, in line with Theorem 1, linear models retained the same importance rank between the two correlation structures, an ordering which agrees with the known coefficients of the response function.

In Section 4, we explain this observation by noting that in permuting only 1 correlated feature, we break its relationship with the remaining correlated features resulting in evaluations of the function in regions which, by construction, have no data nearby; see Figure 4. This may be viewed as being due to the very high correlation parameter employed, and the specific sample size. In Figure 2, we examine the joint effect of sample size and correlation on the average rank given to and . This time, we modify (1) so that each of and have coefficient 0.8, making them less relevant than , and plot their average importance rank over and for each data set size . These are averaged over 20 simulations to improve stability.

Figure 2: Left: change in the average rank of and as correlation increases with . Remaining plots (left to right): average importance for each as a function of for fandom forests, neural networks, and random forests using OOB importance measures, respectively. (True) theoretical rank should be 4.

Here, we observe that for small , nonzero correlation makes and appear to be more important than , particularly for the out-of-bag importance measures, but that for small this effect is removed at larger . For large , the importance rank decreases with , though never below 5, the value indicated by Theorem 1.

Figure 3 gives the average partial dependence and ICE plots for for each of the models comparing those trained on independent feature distributions to those with . For random forests, we observe what appears to be less attenuation of the relationship due to edge effects when and are correlated, but note that these will still be compared to partial dependence functions for features where the edge effects are attenuated. Neural network partial dependence exhibits greater variability when the features are correlated. This is particularly evident in the ICE plots for neural networks where plots in the high correlation case diverge considerably from the underlying relationship. Recall that each of these lines is an average of 50 replications, making the ICE for any individual curve highly variable. The problematic behavior of these plots is more apparent in Figure 6 where a lower dimensional setting reduces both bias and variance.

Figure 3: Left plots: average partial dependence for 50 simulations both correlation parameters and

and standard deviation of the estimated partial dependence showing the increase in variability in neural networks as correlation increases, respectively. Right plots: example ICE plots for

for random forests (middle-right) and neural networks (far right). Dotted lines give predictions for simulations with , dashed for , letters indicate pairing. Solid portions of lines give the range of conditional on the remaining features. Black lines indicate partial dependence functions; blue the underlying relationship.

4 Extrapolation and Explanations

The patterns observed in the results above have also been observed in, for example Toloşi and Lengauer (2011), although the periodic reappearance of these observations does not appear to have reduced the use of these types of diagnostics. Furthermore, to this point, there has been relatively little explanation offered as to why this sort behavior occurs.

As noted in Theorem 1, for linear models with standardized features, permutation-based importances report the square of the feature coefficients. Moreover, where the features are not correlated, random forest permutation importances essentially provide the same ranking as permutation importances applied to linear models. The over-emphasis of correlated features is caused by random forests’ (and neural networks’) need to extrapolate in order to generate predictions at those permuted locations.

To illustrate this point, consider the model is so that does not influence the response. We generate yielding a signal to noise ratio of 100/3 and learn a random forest based on 200 points with uniformly distributed but associated through a Gaussian copula with correlation 0.9. This was repeated 100 times to stabilize the resulting estimates, and to allow us to investigate the between-simulation variability of our answers.

The left panel Figure 4 plots the generated points, the points used to assess permutation importance for and the contours of the learned random forest. Here we observe two things:

  1. Although the true contours are given by vertical lines, the random forest only approximates these contours within the convex hull of the data. Outside this, the contours become more diagonal.

  2. While the data distribution is concentrated along the diagonal, the points used to evaluate permutation importance fill out the unit square.

Thus, much of the evaluation of permutation importance is based on values of that are far from the training data, where random forests are (unsurprisingly) poor at mimicking the underlying data generating function.

The middle plot of Figure 4 overlays a shaded contour with the splits obtained from 10 example trees selected at random from our forests. Here we observe that at the top left and bottom right corners, individual tree predictions are obtained either by splits that proceed horizontally from or vertically from the data distribution at the adjacent corners. Thus, as discussed, each tree will predict a local average from one of the two corners.

The right hand plot of Figure 4 makes this reasoning more explicit, in which the permuted value moves the query from the original data in the bottom left, to in the bottom right of the plot where there are very few observations. In tree-based methods, predictions from each tree at take the form

(2)

where denotes the set of observations falling in the same Leaf as . The collection of response values for which the corresponding weight can be non-zero is a subset of all training observations and such observations are referred to as the potential nearest neighbors (pNN) of .

For correlated features, the potential nearest neighbours of include those whose values are far from the original in both coordinates; indeed this happens with substantial frequency among the trees in our forest. In these trees, the permuted prediction is likely to be far from the original prediction , causing a large perceived importance of the feature even when it is irrelevant. By contrast, when the observations are more uniformly distributed over the feature space, the pNN’s of will be localised around it, and will have very similar values of to the original point making a reasonable comparison to .

In fact, almost any of the data points in the right-most plot of Figure 4 are pNNs of . This can be seen by examining the set of rectangles that reach both the data distribution and the bottom-right corner of the plot. Moving along one edge or other, however, the geometry of forming a rectangle that encompasses both a small number of data points and the prediction restricts potential nearest neighbours to be those with larger values of both and (or smaller values for points above the diagonal).

In Figure 4, this argument explains the shift in contours away from the observed data and an increase in the importance measure for . However it is not sufficient to explain the joint increase in importance for both and when they have symmetric effects in the simulations in Section 2. To explain this effect, we also observe that the concentration of observations along the diagonal increases the over-all signal jointly attributed to them, increasing the number of splits given to at least one of these two covariates and, in particular, reducing edge bias.

Figure 6 plots the average PD and ICE functions for each of random forests and neural networks as in Figure 3. Here, the effect of extrapolation is made more explicitly evident. ICE plots reflect the underlying relationship reasonably well in regions that are supported by the data, but can be very different outside of that. In the case of neural networks, these exhibit greater variability; in random forests, extrapolation bias is a larger source of concern, suggesting more significance for than is warranted.

Figure 4: Random Forest extrapolation behavior and variable importance. Left: contours from the average of 100 random forests trained on data given by ’*’ with response given by , dots give the locations at which the forests are queried when calculating permutation-based variable importance. Middle: The boundaries between leaves for 10 example trees in the forest: when the forest is extrapolating the trees are local averages based either on or . Right: an illustration of potential nearest neighbours of a query point used to determine variable importance.

In contrast to this, our characterization of neural network extrapolation is simply in terms of increased variability. When fitting high-dimensional polynomials, the phenomenon of observing large oscillations outside the range of the data is termed Gibbs effects and we suspect that a similar phenomenon occurs here. Figure 5 presents a contour plot of the average of 100 20-hidden-node neural networks trained on the same data as above along with the standard deviation between these models. Here, the lack of reliability in permutation importance and in partial dependence plots can be explained by the large increase in variance as we move away from the data distribution, allowing fluctuations in predictions to increase apparent variable importance.

Figure 5: Left: contour plot of the average of 100 neural networks trained on correlated data with response . Example data is given by ’*’ with dots indicating the points used to evaluate variable importance. Right: the standard deviation between the 100 neural networks.
Figure 6: Example ICE plots for models trained on data generated from . Left pair: neural network. Right pair: random forest. Thick black lines give partial dependence, thin blue lines indicate the theoretical model.

5 Variable Importance Alternatives

As has been noted, these effects of correlation on variable importance have been observed in various forms by several authors in recent years, though we believe we are the first to explicitly provide an explanation for this behavior in terms of extrapolation. Similarly, there have been several proposed alternatives, generally in line with one of two ideas:

  1. Permuting (or otherwise generating) new values of feature based on its distribution conditional on the remaining features. This was suggested in Strobl et al. (2008) and is related to the use of “knockoffs” in Barber et al. (2015) and Candes et al. (2018), although the latter somewhat more stringent conditions. Similar ideas also appear in Tuv et al. (2009) and also (more specific to the context of linear regression) in Wu et al. (2007). These are also mentioned briefly in Fisher et al. (2018) though not explored in detail.

  2. Removing feature from the data and examining the drop in accuracy when a new model is learned without it. This leave-one-covariate-out (LOCO) framework is well studied within classical statistical methods, (e.g. Lehmann and Romano (2006)) and suggested more generally in Lei et al. (2018).

These methods can also be combined. Mentch and Hooker (2016) examined the change in prediction between two random forests, one trained on the original data, the other trained on data in which feature was permuted. Here the permutation served simply to make the feature irrelevant to the prediction, and the resulting models were both evaluated within the bulk of the training distribution. As the authors note, the choice to permute, rather than drop, a feature was made in order to maintain the same complexity in the fitting procedure and thus make the two predictions statistically comparable. Candes et al. (2018) adds a set of knockoff features and examines multiple retraining methods.

In fact, these methods all produce similar variable importance scores. In our simulation in Section 2, we also computed the following variable importance measures that combine these ideas:

Conditional Variable Importance

measured by creating in which is simulated conditional on the remaining features and measuring

We note that for the Gaussian copula model used here, the distribution of can be explicitly computed.

Dropped Variable Importance

obtained by the increase in training error when learning a model from the data that does not include the th feature

This is equivalent to the LOCO methods explored in Lei et al. (2018).

Permute-and-Relearn Importance

obtained by permuting the features and learning from giving

This was the approach taken in Mentch and Hooker (2016) in which distributional results for random forests were used to assess the statistical significance of .

Condition-and-Relearn Importance

in which a new model is learned from the data and we measure

These measures all necessarily entail additional computational costs: requiring training a new model, and/or estimating, and simulating from, the distribution of . In our simulation this distribution can be computed analytically, but that will very rarely be the case.

As we can see in Figure 7, these measures all agree (up to variability in the procedure) on the ordering of covariate importance across all models. Unlike the permute-and-repredict measures, they reduce the importance of and when these become more correlated. For LOCO methods, this can be explained intuitively by being able to account for the some of the signal coming from . For conditional importance measures, it is associated with the distribution having a much smaller variance than the marginal distribution of (see, for example, the points in Figure 5).

This is, in fact, what we should expect. For linear models, Theorem 2 states that these all have approximately the same expectation:

Theorem 2.

For fit by least-squares with linear dependence between features: . Let be the least-squares residuals to predict , then

Here the addition of comes from estimating coefficients with a random covariate matrix; when the importances are normalized by this becomes and is generally small.

All these methods exhibit the compensation effect; reducing the importance of correlated features relative to their coefficients. Theorem 2 also suggests a way to correct for this effect by dividing by the appropriate quantity, if desired. However, these results are only exact for linear models and the most reliable diagnostic that we know of for this is to examine the effect of jointly removing or permuting pairs of features, and then re-learning. An extension of Theorem 1 yields a joint importance of , which is not recoverable from any of univariate importance measures discussed in this paper.

Figure 7: Average rank of feature importance computed by various measures for random forests (left), neural networks (middle) and linear models (right) based on 10 simulated data sets of size 200. Features were generated independently (top) and with and generated from a Gaussian copular with correlation parameter (bottom).

A further set of approaches to variable importance use the behavior of close to values of in the data. Saliency (Simonyan et al., 2013) obtains derivatives with respect to , while LIME (Ribeiro et al., 2016)

approximates this and carries out some feature selection. These do not suffer from extrapolation, but may not provide the same global-level picture of behavior that permutation methods purport to. If

essentially encodes a threshold with a transition point that is not close to observed data, we will see low saliency at all observed points, even though there may be a large range of predictions.

5.1 Real World Data: Bike Sharing

To illustrate the real-world impact of the difference among these importance measures, Figure 8 presents the importance rankings of features in the hourly bike share data reported in Fanaee-T and Gama (2013) as stored on the UCI repository. Here we predicted the log number of rentals each hour from 14 features using the randomForest package in R. We report the importance rank (ordered from least to most) of each feature as calculated by the default out-of-bag permutation measure, and as calculated by – permuting the th feature, but then learning a new model before assessing the change in accuracy. was chosen because it required only re-using the current learning method and maintained the dimension of the feature space.

Here we observe that while many of the features exhibit comparable ranks, there are some notable disagreements, temp, in particular, appearing important in OOB measures, but quite unimportant otherwise. A screening tool based on OOB importance might have selected this rather than yr to include in candidate features, possibly harming subsequent analysis.

Figure 8: A real-world comparison of OOB variable importance ranks (x-axis) with those obtained by permute-and-re-learn importance (y-axis) using random forests on the bikeshare data.

6 Partial Dependence Alternatives

While restricting to the data distribution creates a number of ways to redesign feature importance measures as illustrated in Section 5, this is less straightforward for plots of effects. We now discuss briefly some alternatives that avoid extrapolation.

ICE plots can be restricted to a range of values consistent with the data distribution. In Figure 3, we have indicated regions with high data density by making the ICE plot lines solid over this portion of . This device provides local information, but also serves to demonstrate the extent of dependence of on other covariates. However, we note here that we need some means of determining an appropriate range of values for each . In Figure 3, we have obtained this from 2-standard deviations within the Gaussian copula used to generate the data. However, lacking this information, obtaining these bounds requires us to estimate a conditional distribution.

Figure 6 suggests modifying partial dependence plots to average only the solid parts of each ICE line. This could be strongly misleading – attributing changes due to rather to . Hooker (2007) provides a re-interpretation of permutation diagnostics in terms of a functional ANOVA decomposition of functions of increasing dimension

(3)

and proposed specifying these by an optimal low-order approximation to on the feature distribution. Equivalents of partial dependence functions can then be obtained by finding functions and to minimize

in which approximates the feature distribution and is an unknown function of all features except . Hooker (2007) minimized a quasi-Monte Carlo approximation to this integral, but required an estimate of , which will likely not be accurate in high dimensions. Chastaing et al. (2012) modified the estimation technique to include smoothing methods. Similar structures were used in Mentch and Hooker (2017) to test for feature interactions.

A key problem for these methods is the need to jointly estimate and the high dimensional , requiring specialized learning methods. Lou et al. (2013) and Tan et al. (2017) instead fit a generalized additive model (Wood, 2006) that includes only one-dimensional functions in the right hand side of (3). These can be fit to the data directly or may be used to examine the values of as a model distillation method. In Figure 3, our linear model fits explicitly fall within this class and provide a coherent, extrapolation-free, representation of the underlying function.

7 Conclusions

While naïve permutation-based methods can be appealing, we have shown with some quite simple examples that they can give misleading results. We also identify the extrapolation behavior by flexible models as a significant source of error in these diagnostics. The precise biases that permutation methods produce will depend on the learning method employed as well as the specifics of the dependence of the features and the response function.

Alternative methods for summaries of feature importance all require further computational effort. Our still employs permutations, but learns a new model after permuting the features; Mentch and Hooker (2016) found that by maintaining the feature dimension, this made for a better statistical comparison than dropping a feature and re-learning as we have done in . Methods that avoid re-learning the response can be based on conditional permutation or simulation, but in general that requires a model for ; see Liu and Zheng (2018) for recent developments in this direction. Alternative measures of importance include generalizations of Sobol indices (Hooker, 2007) and Shapley values (Owen, 2014).

Beyond assigning importance scores, the use of permutation methods in PD and ICE plots are concerning for the same reasons. Alternatives to these are more challenging. Local explanation methods avoid extrapolation, but do not provide a global representation of the learned model over the whole range of feature values. We also note that the counterfactual explanation methods explored in Wachter et al. (2017) may pose similar extrapolation problems, but have not explored this here.

As an alternative visual diagnostic, the additive models explored in Tan et al. (2017) produce effective graphical displays. Tan et al. (2018) found that these better represented the over-all behavior of the function than methods based on combining local explanations. However, specialized methods are still required to employ the diagnostics suggested in Hooker (2007).

References

  • Archer and Kimes (2008) Archer, K. J. and R. V. Kimes (2008). Empirical characterization of random forest variable importance measures. Computational Statistics & Data Analysis 52(4), 2249–2260.
  • Barber et al. (2015) Barber, R. F., E. J. Candès, et al. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics 43(5), 2055–2085.
  • Breiman (2001) Breiman, L. (2001). Random Forests. Machine Learning 45, 5–32.
  • Candes et al. (2018) Candes, E., Y. Fan, L. Janson, and J. Lv (2018). Panning for gold:”model-x” knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(3), 551–577.
  • Chastaing et al. (2012) Chastaing, G., F. Gamboa, C. Prieur, et al. (2012). Generalized hoeffding-sobol decomposition for dependent variables-application to sensitivity analysis. Electronic Journal of Statistics 6, 2420–2448.
  • Fanaee-T and Gama (2013) Fanaee-T, H. and J. Gama (2013). Event labeling combining ensemble detectors and background knowledge.

    Progress in Artificial Intelligence

    , 1–15.
  • Fisher et al. (2018) Fisher, A., C. Rudin, and F. Dominici (2018). Model class reliance: Variable importance measures for any machine learning model class, from the” rashomon” perspective. arXiv preprint arXiv:1801.01489.
  • Friedman (2001) Friedman, J. H. (2001).

    Greedy function approximation: a gradient boosting machine.

    Annals of statistics, 1189–1232.
  • Goldstein et al. (2015) Goldstein, A., A. Kapelner, J. Bleich, and E. Pitkin (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics 24(1), 44–65.
  • Gregorutti et al. (2015) Gregorutti, B., B. Michel, and P. Saint-Pierre (2015). Grouped variable importance with random forests and application to multiple functional data analysis. Computational Statistics & Data Analysis 90, 15–35.
  • Hoeffding (1948) Hoeffding, W. (1948).

    A Class of Statistics with Asymptotically Normal Distribution.

    The Annals of Mathematical Statistics 19(3), 293–325.
  • Hooker (2007) Hooker, G. (2007). Generalized functional anova diagnostics for high-dimensional functions of dependent variables. Journal of Computational and Graphical Statistics 16(3).
  • Lehmann and Romano (2006) Lehmann, E. L. and J. P. Romano (2006). Testing statistical hypotheses. Springer Science & Business Media.
  • Lei et al. (2018) Lei, J., M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman (2018). Distribution-free predictive inference for regression. Journal of the American Statistical Association 113(523), 1094–1111.
  • Liu and Zheng (2018) Liu, Y. and C. Zheng (2018). Auto-encoding knockoff generator for fdr controlled variable selection. arXiv preprint arXiv:1809.10765.
  • Lou et al. (2013) Lou, Y., R. Caruana, J. Gehrke, and G. Hooker (2013). Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 623–631. ACM.
  • Mentch and Hooker (2016) Mentch, L. and G. Hooker (2016).

    Quantifying uncertainty in random forests via confidence intervals and hypothesis tests.

    The Journal of Machine Learning Research 17(1), 841–881.
  • Mentch and Hooker (2017) Mentch, L. and G. Hooker (2017). Formal hypothesis tests for additive structure in random forests. Journal of Computational and Graphical Statistics 26(3), 589–597.
  • Nelsen (2007) Nelsen, R. B. (2007). An introduction to copulas. Springer Science & Business Media.
  • Nicodemus et al. (2010) Nicodemus, K. K., J. D. Malley, C. Strobl, and A. Ziegler (2010). The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC bioinformatics 11(1), 110.
  • Owen (2014) Owen, A. B. (2014). Sobol’indices and shapley value. SIAM/ASA Journal on Uncertainty Quantification 2(1), 245–251.
  • Ribeiro et al. (2016) Ribeiro, M. T., S. Singh, and C. Guestrin (2016).

    Why should i trust you?: Explaining the predictions of any classifier.

    In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. ACM.
  • Simonyan et al. (2013) Simonyan, K., A. Vedaldi, and A. Zisserman (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
  • Strobl et al. (2008) Strobl, C., A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008). Conditional variable importance for random forests. BMC bioinformatics 9(1), 307.
  • Strobl et al. (2007) Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics 8(1), 25.
  • Tan et al. (2018) Tan, S., R. Caruana, G. Hooker, P. Koch, and A. Gordo (2018). Learning global additive explanations for neural nets using model distillation. arXiv preprint arXiv:1801.08640.
  • Tan et al. (2017) Tan, S., R. Caruana, G. Hooker, and Y. Lou (2017). Distill-and-compare: Auditing black-box models using transparent model distillation.
  • Toloşi and Lengauer (2011) Toloşi, L. and T. Lengauer (2011). Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27(14), 1986–1994.
  • Tuv et al. (2009) Tuv, E., A. Borisov, G. Runger, and K. Torkkola (2009). Feature selection with ensembles, artificial variables, and redundancy elimination. Journal of Machine Learning Research 10(Jul), 1341–1366.
  • Wachter et al. (2017) Wachter, S., B. Mittelstadt, and C. Russell (2017). Counterfactual explanations without opening the black box: Automated decisions and the gdpr.(2017). Harvard Journal of Law and Technology 31, 841.
  • Wood (2006) Wood, S. N. (2006). Generalized additive models: an introduction with R. Chapman and Hall/CRC.
  • Wu et al. (2007) Wu, Y., D. D. Boos, and L. A. Stefanski (2007). Controlling variable selection by the addition of pseudovariables. Journal of the American Statistical Association 102(477), 235–243.
  • Zhou and Hooker (2019) Zhou, Z. and G. Hooker (2019). Unbiased measurement of feature importance in tree-based methods. arXiv preprint arXiv:1903.05179.

Appendix A Proofs of Results

Proof of Theorem 1: We observe and write

Examining the second term, when standard results give that further . Adding and subtracting , the first term can be simplified to

yielding the first result.

The remaining identities are the result of direct calculation.

Proof of Theorem 2:

Observe that for any invertible matrix

, regressing on is equivalent to regressing on .

For simplicity, we will assume that the feature matrix includes a column of 1’s to allow for an intercept. Writing we have

where are orthogonal to , as is . Then writing

we see the second term is zero by standard results and we obtain the result by substitution.

The result for can be shown by first centering the after which . Hence, and the result follows from that for .

The result for is obtained by the same calculation, after subtracting from .

The result for follows from the calculations in the proof of Theorem 1 after observing that .