Many machine learning (ML) interpretation methods (see [18, 10] for an overview) are based on making predictions on perturbed input features, e.g., by permuting feature values. The partial dependence plot (PDP)  and permutation feature importance (PFI) 
perturb individual features without conditioning on the remaining features, i.e., feature values are changed while ignoring the joint distribution. If features are dependent, such perturbations will cause predictions that extrapolate to areas of the feature space with low density. Extrapolation can result in misleading interpretations (see Fig. 1).
An obvious approach to avoid extrapolation would be to perturb a feature conditional on all other features and thereby preserve the joint distribution. The interpretation of conditional feature effect and importance differ from the unconditional variants (see Fig. 2). The conditional effect of a feature is a mixture of its unconditional effect and the unconditional effects of all dependent features. The conditional PFI must be interpreted as the importance of a feature given the other features. If two features are highly dependent, their conditional importance is lower than their unconditional importance, because their shared information can be substituted by the other feature. For global interpretation methods, there is a trade-off between avoiding extrapolation and unconditional interpretation of feature effects and importance.
We propose novel, model-agnostic variants of the conditional PDP and conditional PFI based on interpretable subgroups. Our approach is based on constructing subgroups in which the feature of interest is independent from other features and values within the groups are permuted. Subgroup permutation greatly reduces extrapolation while maintaining unconditional interpretation within the subgroups. Furthermore, we introduce a data fidelity measure that quantifies the ability of an interpretation method to preserve the data distribution. Using simulated and real data, we show that conditional subgroup permutation achieves state-of-the-art data fidelity. We compare our conditional subgroup PFI with the true cPFI in a simulation and demonstrate state-of-the-art performance. In an application, we illustrate how our conditional PDP and PFI can reveal new insights into the ML model and the data.
2 Notation and Background
We consider ML prediction functions , where is a model prediction and is a
-dimensional feature vector. We defineas an observed single feature (vector) and to refer to the
-th feature as a random variable. Withwe refer to complementary feature space . We refer to the value of the -th feature value from the -th instance as and to the tuples as data.
Permutation Feature Importance (PFI) for a feature
is estimated as the average increase in prediction loss when the feature is permuted in training or test data:
where is a permutation of and the number of repeated permutations. Numerous variations of this formulation exist. Breiman 
proposed the PFI for random forests, which is computed from the out-of-bag samples of individual trees. Subsequently, Fisher et. al[Fisher2018] introduced a model-agnostic PFI version.
The Partial Dependence Plot (PDP)  describes the average effect of the j-th feature on the prediction. The PDP evaluated at feature value is:
3 Related Work
The marginal plot (M-Plot)  averages the predictions locally on the feature grid and mixes effects of dependent features (see Fig. 2).
Hooker (2007)  proposed a functional ANOVA decomposition with hierarchically orthogonal components. The decomposition requires access to the joint distribution of the data. The approach has the undesirable property that in e.g. a linear model the coefficients are not recovered when features are correlated. Accumulated Local Effect (ALE) plots by Apley and Zhu  reduce extrapolation by accumulating the finite differences computed within intervals of the feature of interest. Interpretations of ALE plots are, by definition, only locally valid. Furthermore, there is no satisfactory approach to derive ALE plots for categorical features, since ALE requires ordered feature values. Our proposed approach can handle categorical features.
Another PDP variant based on stratification was proposed by . However, this stratified PDP describes only the data and is independent of the model.
Conditional PFI. Strobl et. al  proposed a conditional PFI for the random forest. While  relies on the splits of the underlying random forest trees and permutes the features within these subgroups, we construct the subgroups explicitly from the conditional distribution of the features in a model-agnostic way.
Hooker and Mentch  suggested four methods for conditional feature importance: Conditional Variable Importance, Dropped Variable Importance, Permute-And-Relearn Importance and the Condition-and-Relearn Importance. Our proposed method is a variant of the Conditional Variable Importance measure based on interpretable subgroups.
Knockoffs are random variables which are ”copies” of the original features that preserve the joint distribution but are otherwise independent of the prediction target. Knockoffs can be used to replace feature values for conditional feature importance computation. Candes et. al  proposed knockoffs based on the correlation structure of the features. Others have proposed to use generative adversarial networks for generating knockoffs . Knockoffs are not transparent with respect to how they condition on the features, while we report interpretable subgroups.
4 Conditional Subgroups
PFI and PDPs are based on sampling from marginal feature distributions which causes extrapolation when features are dependent .
Conditional variants of PFI and PDPs (see Section 3) avoid extrapolation by sampling from distributions conditional on the remaining features.
However, with conditional PFI and conditional PDP the data dependencies between and can influence the sample, leading to an intepretation that mixes properties of the model with properties of the dataset .
We suggest approaching the dependent feature problem by constructing an interpretable grouping such that the feature of interest is independent of the remaining features within each subgroup, i.e. . Sampling from the group-wise marginal distribution reduces extrapolation (Fig. 3).
Within a group, samples from the marginal and the conditional distribution coincide.
The grouping consequently enables (1) the application of standard PFI and PDPs within each group without extrapolation and (2) sampling from the global conditional distribution and using group-wise permutation.
With our approach we exploit these properties to derive both (1) group-wise unconditional and (2) global conditional interpretations.
The group-wise unconditional PFIs and PDPs can be seen as a decomposition of the global conditional interpretation.
To get a good approximation of the marginal distribution in a group, the group should contain sufficient observations. Moreover, the groupings should be human-intelligible. Existing approaches that model the conditional distribution for interpretation [4, 24, 1] do not provide such a coarse, explicit interpretable grouping.
Transformation trees: We use transformation trees  to model the conditional distribution of the feature of interest given features . This approach partitions the feature space so that the distribution of within the resulting subgroups is homogeneous, i.e. the group-wise parameterization of the modeled distribution is independent of . By specifying a maximum tree depth or the minimum number of observations within a node, the granularity of the partitioning can be traded off with the homogeneity of distributions within a partition. Partitions can be described with the conditions that determine its boundaries, e.g. in form of the decision path. We leverage this partitioning to construct an interpretable grouping . The new variable can be calculated by assigning every observation the indicator of the partition that it lies in (meaning for with the group variable’s value is defined as ).
In theory, the approach faces two challenges. First, not every distribution can be perfectly partitioned into homogeneous and interpretable parts (e.g. in the case of linear Gaussian dependencies). However, the granularity of the grouping can be adjusted using the model’s hyper parameters. As empirical results show, the method’s performance is equal to or better than existing approaches in ground-truth evaluations (Section 6.3). Second, the distribution we specify for the model needs to be able to capture the dependencies between and , for to hold. However, the approach is in principle agnostic to the specified distribution and the default transformation family of distributions is very general, as empirical results suggest . In most settings, it is therefore reasonable to assume For more detailed explanations of transformation trees please refer to .
For the remainder of this paper, we have set the minimum number of observations in a node to 30, used Bernstein polynomials of degree five for the transformation function and the Normal distribution as target distribution. We denote the subgroups by, where is the k-th subgroup for feature j, with groups in total for the j-th feature. The subgroups are disjoint: and . Let be a subset of that refers to the data subset belonging to the subgroup .
4.1 Conditional Permutation Feature Importance
We estimate the PFI of feature within a subgroup as , where refers to the permutation of within the subgroup . Algorithm 1 describes the cPFI estimation for one feature in detail on unseen data.
The algorithm has two outcomes: We get importance values for feature for each subgroup () and a global conditional feature importance (). The latter is equivalent to the weighted average of subgroup importances regarding the number of observations within each subgroup (Appendix 0.A).
The cPFI needs the same amount of model evaluations as the PFI ().
4.2 Conditional Partial Dependence Plot (cPDP)
The conditional PDP has a different interpretation than the unconditional PDP, as the motivating example shows (Fig. 2). The proposed PDP variant solves the problem of extrapolation while allowing an unconditional interpretation within each subgroup. Since within groups the marginal and conditional distribution coincide, we compute the for each group using the (unconditional) standard PDP formula in Equation 2. This results in multiple PDPs per feature, which can be displayed together in the same plot as in Fig. 8.
Again, we do not only get the groupwise result. We can aggregate subgroup PDPs to yield the conditional PDP (cPDP). A proof is given in Appendix 0.B.
We restrict each subgroup to the interval
. For our visualization, we suggest to plot the PDPs similar to boxplots, where the dense center quartiles are indicated with a bold line (see Fig.4).
The subgroups PDPs do not break if features are independent.
When feature is independent of features , each subgroup PDP has the same expectation as the unconditional PDP, and an -times larger variance, where
-times larger variance, whereand are the number of observations in the data and the subgroup .
The proof is shown in Appendix 0.C.
Equivalence in expectation and higher variance under independence of and holds true even if the partitions would be randomly chosen.
Assuming we perform permutations in both settings, both the PDP and the set of subgroup need evaluations, since (and worst case if evaluated at each value).
5 Data and Model Fidelity
5.1 Data Fidelity
PDP and PFI work by data perturbation, prediction and subsequent aggregation . We define a measure of data fidelity to quantify the ability to preserve the joint distribution under perturbation.
Definition 1 (Data Fidelity)
Data fidelity is the degree to which a perturbation of feature preserves the joint distribution of , i.e. the degree to which
This definition is similar to a property required for knockoffs, see e.g. . Based on data , perturbations create a new dataset which is to be compared to the original data distribution. In this two-sample test-scenario, the maximum mean discrepancy (MMD) can be used to compare whether two samples come from the same distribution. We propose to measure the data fidelity with the empirical MMD:
where is the original dataset and a dataset with perturbed . As kernel
we used the radial basis function kernel for all experiments. We require the features to be scaled to a mean of zero and a standard deviation of one. Categorical features are one-hot encoded. For parameterof the radial basis function kernel, we chose the median L2-distance between data points.
5.2 Model Fidelity
Model fidelity has been defined as how close the predictions of an explanation method are to the ML model .
Similarly, we define model fidelity for feature effects as the mean squared error between model prediction and the prediction by the partial function (which depends only on feature ) defined by the feature effect method.
For a given data instance, the predicted outcome from an, e.g., PDP is the y-axis value at the observed value.
where is a feature effect function such as ALE or PDP. In order to evaluate ALE plots, they have to be adjusted such that they are on a comparable scale to a PDP , i.e., .
6.1 Data Fidelity Evaluation
We evaluated how different types of perturbations affect the data fidelity measure (based on MMD) for numerous datasets (see Table 1).
|No. of rows||6497||6435||6574||3107||3848||2178|
|No. of features||12||37||15||7||6||4|
We used 40% of the data to fit the transformation trees to find the subgroups. The remaining 60% were split in half. One half remained unchanged, while we perturbed one of the features in the other half. Then, we computed the MMD comparing the two datasets. We sampled a subset of observations from the PDP / PFI perturbation dataset by permuting once for each observation , so that we get a perturbed dataset of the same size as the original data . This differs from the PDP definitions where for each observation is replaced by a set of grid values. According to , PFI and PDP can be formulated with the same underlying feature replacement strategy, either by replacing the feature values using pre-defined grid points (as usually done in PDP) or by permuting the feature values (as usually done in PFI). For the perturbation in subgroups, we permuted once within each subgroup. For the interval-based perturbation of ALE plots, we used a grid based on 30 quantiles that determine the intervals. We averaged two MMD computations: Once when moving each observation to the left border of the containing ALE interval, and once to the right border. For Model-X knockoffs  we replaced the feature with its knockoff. We repeated the experiment 30 times with different random seeds.
Fig. 5 shows that the PDP/PFI type of perturbation has a low data fidelity (high MMD) compared to all other approaches. Model-X knockoffs and conditional subgroup permutation (with many groups) have the best data fidelity. Even splitting with a maximum depth of only 1 (two subgroups) strongly improves data fidelity. The deeper the trees are, the more subgroups are found and the better the data fidelity. Ranked across all features and datasets, the average rankings show that deep conditional subgroups even outperform ALE, see Table 2.
6.2 Model Fidelity Evaluation
In this section we evaluate the model fidelity of PDP, ALE and subgroup PDPs. We trained random forests (500 trees), linear models and k-nearest neighbours models (k = 7) on various datasets (Table 1). 70% of the data were used to train the models and the transformation trees. 30% of the data were used to evaluate model fidelity. For each model and each dataset, we measured model fidelity between effect prediction and model prediction (Equation 4), averaged across observations and features. Table 3 shows that the model fidelity of ALE and PDP is similar, while the subgroup PDPs have the best model fidelity. This is interesting since the grouping is neither based on the model nor the real target, but solely on the conditional dependence structure of the features.
Median model fidelity averaged over features in a random forest for various datasets. The cPDPs always had a lower loss (i.e. higher model fidelity) than PDP and ALE. The loss monotonically decreases with increasing maximum tree depth for subgroup construction. Using different models (knn or linear model) produced similar results, see Appendix0.E.
6.3 Conditional Feature Importance Evaluation
We computed the true conditional feature importance for following simulated linear model: , where with .
In the simulations, we varied the correlation between and , while remained independent.
We repeated the experiments 30 times and sampled 1000 data points in each repetition.
We examined two experimental settings. In setting (I) we assumed that our machine learning model recovered the true linear regression model. We measured the absolute distance for the true cPFI of feature (derivation in Appendix 0.D) and the cPFI based on subgroups with different tree depths for subgroup generation. In setting (II) we trained a random forest (with 100 conditional inference trees , mtry = 2 and maxdepth = 10). For this random forest we computed the cPFI with various methods (computed on the random forest) and compared it to the true cPFI (based on the data generating process). We compared our subgroup cPFI approach, the random forest based cPFI by Strobl et. al , and Model-X knockoffs .
Fig. 6 shows that (I) the deeper the transformation trees (and the more subgroups), the better the true cPFI is approximated and (II) our subgroup-based cPFI approach is equal or superior to the state-of-the art.
On the following practical application we demonstrate that subgroup based conditional variants are a valuable tool to understand model and data beyond insights given by PFI, PDPs or ALE plots.
We trained a random forest to predict daily bike rentals  with given weather and seasonal information. The data (, ) was divided into 70% training and 30% test data. The features in the bike data are dependent. For example, the correlation between temperature and humidity is 0.13 . The data contains both categorical and numerical features and we are interested in the multivariate, non-linear dependencies, thus correlation is an inadequate measure of dependence. We therefore indicate the degree of dependence by showing the extent to which we can predict each feature from all other features in Table 4. Per feature, we trained a random forest to predict the feature from all other features. Random forests can capture non-linear dependencies and interactions and work reasonably well without tuning. We measured the proportion of loss explained to quantify the dependence of the respective feature on all other features. For numerical features we used the R-squared measure. For categorical features we computed , where is the mean misclassification error, the true class, the classification function of the random forest and the most common class in the training data. We divided the training data into two folds and trained the random forest on one half. Then we computed the proportion of explained loss on the other half and vice versa. Finally we averaged the results.
To construct the subgroups, we set the maximum tree depth to 2, i.e. we limited the number of possible subgroups to 4. We compared the unconditional and conditional PFI for the bike rental predictions, see Fig. 7.
The most important feature, according to PFI, was the temperature. Temperature is less important when we condition on season and humidity. To get a deeper understanding of the temperature effect, we examined the effect plots (see Fig. 8). Both ALE and PDP show a monotonous increase of predicted bike rentals up until 25 C and a decrease beyond that. The PDP shows a weaker negative effect of very high temperatures which might be caused by extrapolation: High temperatures days are combined with e.g. winter. A limitation of the ALE plot is that we can only interpret it locally. In contrast, our subgroups are explicit about the subgroup conditions in which the interpretation of the PDP is valid and shows the distributions in which the feature effect may be interpreted. The PDPs in subgroups reveal a more nuanced picture: For dry summer days, increasing temperature mostly has a negative effect on the predicted number of bike rentals.
The change in intercepts of the subgroup PDP can be interpreted as the effect of the grouping features (season and humidity).
The slope can be interpreted as the temperature effect within a subgroup.
We also demonstrate the subgroup PDPs for the season, a categorical feature. Fig. 9 shows both the PDP and our subgroup PDPs. The normal PDP shows that on average there is no difference between spring, summer and fall and only slightly less bike rentals in winter. The PDP with four subgroups conditional on temperature shows that the unconditional PDP is misleading.
We proposed the conditional PDP and the conditional PFI, both based on subgroups. This research addresses the inherent conflict between extrapolation and an unconditional interpretation. Our subgroup-based approach unites the best of both worlds: It reduces extrapolation and makes the conditioning explicit through interpretable subgroups, while allowing unconditional interpretation within the subgroups. We have shown that permuting data within subgroups greatly improves data fidelity compared to unconditional permutation. For this purpose we introduced a data fidelity measure based on the maximum mean discrepancy. As a surprising finding, the model fidelity of the subgroup PDPs is better than that of ALE or PDPs. In a simulation we showed that our conditional PFI exceeds or is equal to the state-of-the-art for conditional PFI. The measure of data fidelity can be used for other interpretation methods as well. For local explanation methods such as LIME  or Shapley Values  it could be adapted to measure local data fidelity.
Computational Details. All experiments were conducted using mlr  and R . The code for all experiments is available at https://github.com/compstat-lmu/paper_2019_dependent_features/.
Acknowledgements. This work is funded by the Bavarian State Ministry of Science and the Arts in the framework of the Centre Digitisation.Bavaria (ZD.B) and supported by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibilities for its content.
-  Aas, K., Jullum, M., Løland, A.: Explaining individual predictions when features are dependent: More accurate approximations to shapley values. arXiv preprint arXiv:1903.10464 (2019)
-  Apley, D.W., Zhu, J.: Visualizing the effects of predictor variables in black box supervised learning models. arXiv preprint arXiv:1612.08468 (2016)
-  Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)
-  Candes, E., Fan, Y., Janson, L., Lv, J.: Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(3), 551–577 (2018)
-  Casalicchio, G., Bossek, J., Lang, M., Kirchhoff, D., Kerschke, P., Hofner, B., Seibold, H., Vanschoren, J., Bischl, B.: OpenML: An R package to connect to the machine learning platform OpenML. Comput. Stat. (2017)
-  Casalicchio, G., Molnar, C., Bischl, B.: Visualizing the feature importance for black box models. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. pp. 655–670. Springer (2018)
-  Dua, D., Graff, C.: UCI machine learning repository (2017), http://archive.ics.uci.edu/ml
-  Fisher, A., Rudin, C., Dominici, F.: All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. Journal of Machine Learning Research 20(177), 1–81 (2019)
Friedman, J.H., et al.: Multivariate adaptive regression splines. The annals of statistics19(1), 1–67 (1991)
-  Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D.: A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51(5), 1–42 (2018)
-  Hooker, G.: Generalized functional anova diagnostics for high-dimensional functions of dependent variables. J. Comput. Graph. Stat. 16(3) (2007)
-  Hooker, G., Mentch, L.: Please stop permuting features: An explanation and alternatives. arXiv preprint arXiv:1905.03151 (2019)
-  Hothorn, T., Hornik, K., Zeileis, A.: Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics 15(3), 651–674 (2006)
-  Hothorn, T., Zeileis, A.: Transformation forests. arXiv preprint arXiv:1701.02110 (2017)
-  Janzing, D., Minorics, L., Blöbaum, P.: Feature relevance quantification in explainable ai: A causality problem. arXiv preprint arXiv:1910.13413 (2019)
Lang, M., Binder, M., Richter, J., Schratz, P., Pfisterer, F., Coors, S., Au, Q., Casalicchio, G., Kotthoff, L., Bischl, B.: mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software (2019)
-  Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: NIPS, vol. 30, pp. 4765–4774. Curran Associates, Inc. (2017)
-  Molnar, C.: Interpretable Machine Learning (2019), https://christophm.github.io/interpretable-ml-book/
-  Parr, T., Wilson, J.D.: A stratification approach to partial dependence for codependent variables. arXiv preprint arXiv:1907.06698 (2019)
-  R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2017)
Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. pp. 1135–1144. ACM (2016)
-  Romano, Y., Sesia, M., Candès, E.: Deep knockoffs. Journal of the American Statistical Association pp. 1–12 (2019)
-  Scholbeck, C.A., Molnar, C., Heumann, C., Bischl, B., Casalicchio, G.: Sampling, intervention, prediction, aggregation: A generalized framework for model agnostic interpretations. arXiv preprint arXiv:1904.03959 (2019)
-  Strobl, C., Boulesteix, A.L., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC bioinformatics 9(1), 307 (2008)
-  Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter 15(2), 49–60 (2014)
Appendix 0.A Decompose cPFI into subgroup PFIs
Assuming a perfect construction of , it holds that and also that (as is a compression of ). Therefore
When we sample the replacement for an from the marginal within a group (, e.g. via permutation) we also sample from the conditional . Every datapoint from the global sample can therefore equivalently be seen as a sample from the marginal within the group, or as a sample from the global conditional.
As follows, the weighted sum of marginal subgroup PFIs coincides with the cPFI.
Appendix 0.B Decompose cPDPs into subgroup PDPs
As we know that we see that the conditional PDP as defined below can be seen as a point wise local sum of marginal PDPs. We denote that we consider a conditional distribution with by denoting at the end of a line.
in order to construct the cPDP from the group-wise PDPs. These probabilities can be approximated, but cannot be trivially derived analytically, as they depend on the unknown distributions ofwithin the groups.
If we would construct the groups by partitioning of (and modelling of the conditional of ), the term would evaluate to or and the aggregation would be straight-forward to perform.
Appendix 0.C Expectation and Variance of the PDP in a Subgroup
We show that under feature independence the PDP and a PDP in an arbitrary subgroup have the same expected value and the subgroup PDP has a higher variance.
Appendix 0.D Groundtruth cPFI
The data has the following distribution:
And with . We assume an ML model For any data point, the squared loss is:
The expectation for this is:
For the permutation feature importance, we permute one of the features. The following formula shows permutation of :
So for a single data point, the feature importance is:
The expected value of this is:
Conditional on observed ,
follows a Gaussian distribution:
Finally the expected conditional permutation feature importance becomes: